Abstract:
Volume, velocity, variety and veracity are four striking features of big data, which bring new challenges to data integration. Entity resolution is one of the most important steps in data integration. For big data, conventional entity resolution methods tend to be inefficient and ineffective in practice, especially on the noise immunity. In order to address the inconsistency issue of resolution results produced by the big data's four features, we introduce the concept of common neighborhood into the correlation clustering problem. Our top tier for pre-partition is designed based on the neighborhood, which can quickly and effectively complete the preliminary partition of blocks. The introduction of the concept of kernel gives a more precise definition of the correlation degree between a node and a cluster. As a consequence, our bottom tier for adjustment can accurately cluster nodes and improve the accuracy of the correlation clustering. Our two-tiered method for entity resolution is simple and efficient for the use of coarse similarity function. Meanwhile, our method achieves good performance on noise immunity with the introduction of the neighborhood. Extensive experiments demonstrate that the proposed two-tiered method achieves high accuracy and good noise immunity compared with those traditional methods, and is also scalable for big data.