基于数据源分类可信性的真值发现方法研究

马如霞; 孟小峰

doi:10.7544/issn1000-1239.2015.20140684

基于数据源分类可信性的真值发现方法研究

Truth Discovery Based Credibility of Data Categories on Data Sources

摘要

摘要: 网络的普及和电子商务的发展改变了人们信息获取以及消费的方式.Web已经成为大多数人获取信息的重要来源.与此同时，互联网信息质量问题也逐渐凸显.Web中存在大量过时、错误、虚假、片面的信息.其中，不同网站为相同对象提供冲突信息的问题尤为突出.如何从这些冲突信息中找到正确信息成为亟待解决的问题，这类问题又被称为真值发现问题.通过对现有真值发现问题解决方法的调研，发现现有方法均未考虑数据源分类可信性差异对真值发现的影响.因此，提出基于数据源分类可信性的真值发现问题.提出2种方法探测数据源分类可信性差异，并采用贝叶斯的方法迭代计算数据源分类可信性和属性值准确性.另外，通过考虑数据源覆盖率和对象难度对真值发现的影响，进一步提高真值发现算法的准确性.一个真实数据集的实验结果表明，所提方法可以显著提高真值发现的准确性.

Abstract: The popularization of the network and the development of e-commerce have changed the way people access information and consume. For most of people, Web has been the important source of information. Meanwhile, information quality issue is becoming increasingly prominent. There is a lot of information which is outdated, incorrect, false and bias. Particularly, the problem of conflicting information provided by different websites is obvious. It has to be solved that how to find the truth from conflicting information. As we know, there is not a method which considers the credibility of data categories on data sources during discovering truth. So, we propose a problem which is truth discovery based credibility of data categories on data sources. In this paper, two methods are proposed to detect the credibility differences of data categories on sources, and a Bayesian method is used to iteratively compute the data sources quality and data accuracy. Additional, data coverage and the difficulty of each object is considered to improve the accuracy of truth finding. The experiments on a real data set show that our algorithms can significantly improve the accuracy of truth discovery.

HTML全文

参考文献(0)

施引文献

资源附件(0)