ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2015, Vol. 52 ›› Issue (9): 1931-1940.doi: 10.7544/issn1000-1239.2015.20140684

• 软件技术 •    下一篇

基于数据源分类可信性的真值发现方法研究

马如霞1,2, 孟小峰1   

  1. 1(中国人民大学信息学院 北京 100872); 2(首都师范大学教育技术系 北京 100048) (maruxia@126.com)
  • 出版日期: 2015-09-01
  • 基金资助: 
    基金项目:国家自然科学基金项目(61379050, 91224008);国家“八六三”高技术研究发展计划基金项目(2013AA013204);高等学校博士学科点专项科研基金项目(20130004130001);中国人民大学科学研究基金项目(11XNL010)

Truth Discovery Based Credibility of Data Categories on Data Sources

Ma Ruxia1,2, Meng Xiaofeng1   

  1. 1(Department of Information, Renmin University of China, Beijing 100872); 2(Department of Education Technology, Capital Normal University, Beijing 100048)
  • Online: 2015-09-01

摘要: 网络的普及和电子商务的发展改变了人们信息获取以及消费的方式.Web已经成为大多数人获取信息的重要来源.与此同时,互联网信息质量问题也逐渐凸显.Web中存在大量过时、错误、虚假、片面的信息.其中,不同网站为相同对象提供冲突信息的问题尤为突出.如何从这些冲突信息中找到正确信息成为亟待解决的问题,这类问题又被称为真值发现问题.通过对现有真值发现问题解决方法的调研,发现现有方法均未考虑数据源分类可信性差异对真值发现的影响.因此,提出基于数据源分类可信性的真值发现问题.提出2种方法探测数据源分类可信性差异,并采用贝叶斯的方法迭代计算数据源分类可信性和属性值准确性.另外,通过考虑数据源覆盖率和对象难度对真值发现的影响,进一步提高真值发现算法的准确性.一个真实数据集的实验结果表明,所提方法可以显著提高真值发现的准确性.

关键词: 真值发现, 数据冲突, 数据源分类可信性, 信息质量, 数据融合

Abstract: The popularization of the network and the development of e-commerce have changed the way people access information and consume. For most of people, Web has been the important source of information. Meanwhile, information quality issue is becoming increasingly prominent. There is a lot of information which is outdated, incorrect, false and bias. Particularly, the problem of conflicting information provided by different websites is obvious. It has to be solved that how to find the truth from conflicting information. As we know, there is not a method which considers the credibility of data categories on data sources during discovering truth. So, we propose a problem which is truth discovery based credibility of data categories on data sources. In this paper, two methods are proposed to detect the credibility differences of data categories on sources, and a Bayesian method is used to iteratively compute the data sources quality and data accuracy. Additional, data coverage and the difficulty of each object is considered to improve the accuracy of truth finding. The experiments on a real data set show that our algorithms can significantly improve the accuracy of truth discovery.

Key words: truth discovery, data conflicting, credibility of data categories on data sources, quality of information, data fusion

中图分类号: