ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2014, Vol. 51 ›› Issue (9): 2108-2116.doi: 10.7544/issn1000-1239.2014.20131345

• 人工智能 • 上一篇    下一篇

大数据环境下用于实体解析的两层相关性聚类方法

王宁, 李杰   

  1. (北京交通大学计算机与信息技术学院 北京 100044) (nwang@bjtu.edu.cn)
  • 出版日期: 2014-09-01
  • 基金资助: 
    基金项目:国家自然科学基金项目(61370060);江苏省自然科学基金项目(BK2011454)

Two-Tiered Correlation Clustering Method for Entity Resolution in Big Data

Wang Ning, Li Jie   

  1. (School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044)
  • Online: 2014-09-01

摘要: 数据量大、数据更新速度快、数据源多样和数据存在噪声是大数据的四大特点,这为数据集成提出了新的挑战.实体解析是数据集成的一个重要步骤,在大数据环境下,传统的实体解析算法在效率、质量,特别是抗噪声能力方面的表现并不理想.为了解决大数据环境中因为数据噪声所导致的解析结果冲突,将公共邻居引入相关性聚类问题.上层预分块算法基于邻居关系设计,因而能够快速有效地完成初步分块;核概念的引入更精确地定义了节点与类之间的关联程度,以便下层调整算法准确地判断节点的归属,进而提高相关性聚类的准确度.两层算法采用较为粗糙的相似度距离函数,使得算法不仅简单而且高效.同时,由于引入邻居关系,算法的抗噪声能力明显提高.大量实验表明,两层相关性聚类算法无论在解析质量、抗噪声能力还是在扩展性方面均优于传统算法.

关键词: 相关性聚类, 公共邻居, 实体解析, 数据集成, 大数据, 数据噪声

Abstract: Volume, velocity, variety and veracity are four striking features of big data, which bring new challenges to data integration. Entity resolution is one of the most important steps in data integration. For big data, conventional entity resolution methods tend to be inefficient and ineffective in practice, especially on the noise immunity. In order to address the inconsistency issue of resolution results produced by the big data's four features, we introduce the concept of common neighborhood into the correlation clustering problem. Our top tier for pre-partition is designed based on the neighborhood, which can quickly and effectively complete the preliminary partition of blocks. The introduction of the concept of kernel gives a more precise definition of the correlation degree between a node and a cluster. As a consequence, our bottom tier for adjustment can accurately cluster nodes and improve the accuracy of the correlation clustering. Our two-tiered method for entity resolution is simple and efficient for the use of coarse similarity function. Meanwhile, our method achieves good performance on noise immunity with the introduction of the neighborhood. Extensive experiments demonstrate that the proposed two-tiered method achieves high accuracy and good noise immunity compared with those traditional methods, and is also scalable for big data.

Key words: correlation clustering, common neighborhood, entity resolution, data integration, big data, noisy data

中图分类号: