ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2020, Vol. 57 ›› Issue (1): 175-187.doi: 10.7544/issn1000-1239.2020.20180691

• 人工智能 • 上一篇    下一篇

基于因子图的不一致记录对消歧方法

徐耀丽,李战怀,陈群,王艳艳,樊峰峰   

  1. (西北工业大学计算机学院 西安 710072) (大数据存储与管理工业和信息化部重点实验室(西北工业大学) 西安 710129) (yaolixu@mail.nwpu.edu.cn)
  • 出版日期: 2020-01-01
  • 基金资助: 
    国家重点研发计划项目(2018YFB1003403);国家自然科学基金项目(61732014,61672432);陕西省自然科学基础研究计划项目(2018JM6086)

An Approach for Reconciling Inconsistent Pairs Based on Factor Graph

Xu Yaoli, Li Zhanhuai, Chen Qun, Wang Yanyan, Fan Fengfeng   

  1. (School of Computer Science and Engineering, Northwestern Polytechnical University, Xi’an 710072) (Key Laboratory of Big Data Storage and Management (Northwestern Polytechnical University), Ministry of Industry and Information Technology, Xi’an 710129)
  • Online: 2020-01-01
  • Supported by: 
    This work was supported by the National Key Research and Development Program (2018YFB1003403), the National Natural Science Foundation of China (61732014,61672432), and the Natural Science Basic Research Plan in Shaanxi Province of China (2018JM6086).

摘要: 实体解析(entity resolution, ER)是数据集成和清洗系统的关键基础问题.尽管有大量实体解析方法提出,但这些方法依赖隐式或显式的假设或采用不同的解析策略.对相同的实体解析任务进行处理后,它们的结论存在冲突,产生了大量的不一致记录对.在没有给定标记数据的情况下,进行这类记录对的消歧处理具有很大的挑战:一方面当标签数据缺失时,评估现存方法的解析效果并选出最优的不可行,另一方面尽管可选的方法是协调这些冲突结果以得到一致的标记方案,但有效且融合所有提示信息的消歧策略还有待研究.为此,提出了一种基于因子图的不一致记录对消歧方法.该方法首先对某给定的实体解析任务使用现存的实体解析技术进行实体解析,得到一致或不一致的记录对;接着,用核密度估计、匹配信息传递等方法输出与不一致记录对是否匹配相关的特征,并把这些建模为因子图的因子函数,该因子图是一个带因子权重的联合概率分布;最后基于最大似然估计方法估计出各因子的权重,并基于该分布对不一致记录对进行消歧处理.实验结果表明:在真实的数据集合,该方法有效且优于现存最好的方法.

关键词: 数据集成, 实体解析, 最大似然估计, 不一致记录对, 核密度估计, 因子图

Abstract: Entity resolution (ER) is a critical and fundamental problem in data integration and data cleaning systems. Although there have been numerous methods proposed for entity resolution, those approaches explicitly or implicitly depend on ad-hoc assumptions or employ different strategies. Given an ER task, there exist many inconsistent pairs due to conflicting results resolved by these approaches. It is of great challenges of reconciling these pairs without any labeled data: 1)without labeled data, it is impractical to estimate the performance of existing approaches and pick out the best; 2)although an optional way is to reconcile these conflicting results for a better and consistent labeling solution, an effective reconciliation mechanism for combining all hints remains to be investigated. To this end, an approach for reconciling inconsistent pairs based on factor graph is proposed. It firstly achieves inconsistent and consistent pairs through conducting existing entity resolution approaches for a given ER task. Secondly, the features that can indicate the matching status of inconsistent pairs, are extracted by leveraging techniques like kernel density estimation and matching information transfer and so on. Then these features are modeled as factor functions of the factor graph, which represents a joint probability distribution with factor weights. Finally, the weight of each factor is estimated based on the maximum likelihood estimation, and the inconsistent pairs are reconciled according to the distribution represented by the factor graph. Experimental results on real-world datasets show our method is effective and can outperform the state-of-the-art approach.

Key words: data integration, entity resolution, maximum likelihood estimation, inconsistent pair, kernel density estimation (KDE), factor graph

中图分类号: