ISSN 1000-1239 CN 11-1777/TP

Journal of Computer Research and Development ›› 2016, Vol. 53 ›› Issue (2): 449-458.doi: 10.7544/issn1000-1239.2016.20148275

Previous Articles     Next Articles

WR Approach: Determining Accurate Attribute Values in Big Data Integration

Zhou Ningnan1,3, Sheng Wanxing1, Liu Ke-yan1, Zhang Xiao2,3, Wang Shan2,3   

  1. 1(China Electric Power Research Institute, Beijing 100192); 2(Key Laboratory of Data Engineering and Knowledge Engineering (Renmin University of China), Ministry of Education, Beijing 100872); 3(School of Information, Renmin University of China, Beijing 100872)
  • Online:2016-02-01

Abstract: Big data integration lays the foundation for high quality data-driven decision. One critical section thereof is to determine the accurate attribute values from records in data pertaining to a given entity. The state-of-the-art approach R-topK argues to design rules to decide relative accuracy among the attribute values and thus obtain accurate values. Unfortunately, in cases where multiple true values or conflicted rules exist, it requires rounds of human intervention. In this paper, we propose a weighted rule (WR) approach for determining accurate attribute values in big data integration. Each rule is augmented with weight and thus avoid human intervention when conflicts occur. This paper designs a chase procedure-based inference algorithm, and proves that it can figure out weighted constraints over relative accuracy among attribute values in O(n\+2), which introduces constraints for finding accurate data values. Taking conflicts among constraints into consideration, this paper proposes an O(n) algorithm to discover accurate attribute values among the combination of data values. We conduct extensive experiments under real world and synthetic datasets, and the results demonstrate the effectiveness and efficiency of WR approach. WR approach boosts performance by factor of 3-15x and improves effectiveness by 7%-80%.

Key words: big data integration, data quality, data accuracy, data cleaning, weighted rules

CLC Number: