ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2016, Vol. 53 ›› Issue (12): 2858-2866.doi: 10.7544/issn1000-1239.2016.20150614

• 软件技术 • 上一篇    下一篇

MTruths:Web信息多真值发现方法

马如霞1,2,孟小峰1,王璐1,史英杰3   

  1. 1(中国人民大学信息学院 北京 100872); 2(首都师范大学教育技术系 北京 100048); 3(北京服装学院信息工程学院 北京 100029) (maruxia@126.com)
  • 出版日期: 2016-12-01
  • 基金资助: 
    国家自然科学基金项目(61379050,91224008,61502279);国家“八六三”高技术研究发展计划基金项目(2013AA013204);高等学校博士学科点专项科研基金项目(20130004130001);中国人民大学科学研究基金项目(11XNL010)

MTruths:An Approach of Multiple Truths Finding from Web Information

Ma Ruxia1,2, Meng Xiaofeng1, Wang Lu1, Shi Yingjie3   

  1. 1(School of Information, Renmin University of China, Beijing 100872); 2(Department of Education Technology, Capital Normal University, Beijing 100048); 3(School of Information Engineering, Beijing Institute of Fashion Technology, Beijing 100029)
  • Online: 2016-12-01

摘要: Web已成为一个浩瀚的信息海洋,其信息分散在不同的数据源中.不同数据源常常为同一对象实体提供冲突的属性值.如何从这些冲突属性值中找到真值被称为真值发现问题.根据属性值数量可将对象属性分为单值属性和多值属性,现有的多数真值发现算法对单值属性的真值发现比较有效.针对多值属性的真值发现问题,提出了一个多真值发现方法MTruths,该方法将多真值发现问题转化为一个最优化问题,其目标是:各对象的真值与各数据源提供的观察值之间的相似性加权和达到最大.对象真值求解过程中,提出2种方法求真值列表的最优解:基于枚举的方法和贪心算法.与已有方法不同的是MTruths可以直接得到对象的多个真值.最后,通过图书和电影2个真实数据集上的实验表明,MTruths的2种实现方法的准确性以及贪心算法的效率优于现有真值发现方法.

关键词: 真值发现, 数据冲突, 单值属性, 多值属性, 数据源质量

Abstract: Web has been a massive information repository on which information is scattered in different data sources. It is common that different data sources provide conflicting information for the same entity. It is called the truth finding problem that how to find the truths from conflicting information. According to the number of attribute values, object attributes can be divided into two categories: single-valued attributes and multiple-valued attributes. Most of existing truth finding work is designed for truth finding on single-valued attributes. In this paper, a method called MTruths is proposed to resolve truth finding problem for multiple-valued attributes. We model the problem using an optimization problem. The objective is to maximize the total weight similarity between the truths and observations provided by data sources. In truth finding process, two methods are proposed to find the optimal solution: an enumeration algorithm and a greedy algorithm. Experiments on two real data sets show that the correctness of our approache and the efficiency of the greedy algorithm outperform the existing state-of-the-art techniques.

Key words: truth finding, data conflicting, single-valued attributes, multi-valued attributes, quality of data sources

中图分类号: