Abstract:
Big data integration lays the foundation for high quality data-driven decision. One critical section thereof is to determine the accurate attribute values from records in data pertaining to a given entity. The state-of-the-art approach R-topK argues to design rules to decide relative accuracy among the attribute values and thus obtain accurate values. Unfortunately, in cases where multiple true values or conflicted rules exist, it requires rounds of human intervention. In this paper, we propose a weighted rule (WR) approach for determining accurate attribute values in big data integration. Each rule is augmented with weight and thus avoid human intervention when conflicts occur. This paper designs a chase procedure-based inference algorithm, and proves that it can figure out weighted constraints over relative accuracy among attribute values in O(n\+2), which introduces constraints for finding accurate data values. Taking conflicts among constraints into consideration, this paper proposes an O(n) algorithm to discover accurate attribute values among the combination of data values. We conduct extensive experiments under real world and synthetic datasets, and the results demonstrate the effectiveness and efficiency of WR approach. WR approach boosts performance by factor of 3-15x and improves effectiveness by 7%-80%.