高级检索

    MTruths:Web信息多真值发现方法

    MTruths:An Approach of Multiple Truths Finding from Web Information

    • 摘要: Web已成为一个浩瀚的信息海洋,其信息分散在不同的数据源中.不同数据源常常为同一对象实体提供冲突的属性值.如何从这些冲突属性值中找到真值被称为真值发现问题.根据属性值数量可将对象属性分为单值属性和多值属性,现有的多数真值发现算法对单值属性的真值发现比较有效.针对多值属性的真值发现问题,提出了一个多真值发现方法MTruths,该方法将多真值发现问题转化为一个最优化问题,其目标是:各对象的真值与各数据源提供的观察值之间的相似性加权和达到最大.对象真值求解过程中,提出2种方法求真值列表的最优解:基于枚举的方法和贪心算法.与已有方法不同的是MTruths可以直接得到对象的多个真值.最后,通过图书和电影2个真实数据集上的实验表明,MTruths的2种实现方法的准确性以及贪心算法的效率优于现有真值发现方法.

       

      Abstract: Web has been a massive information repository on which information is scattered in different data sources. It is common that different data sources provide conflicting information for the same entity. It is called the truth finding problem that how to find the truths from conflicting information. According to the number of attribute values, object attributes can be divided into two categories: single-valued attributes and multiple-valued attributes. Most of existing truth finding work is designed for truth finding on single-valued attributes. In this paper, a method called MTruths is proposed to resolve truth finding problem for multiple-valued attributes. We model the problem using an optimization problem. The objective is to maximize the total weight similarity between the truths and observations provided by data sources. In truth finding process, two methods are proposed to find the optimal solution: an enumeration algorithm and a greedy algorithm. Experiments on two real data sets show that the correctness of our approache and the efficiency of the greedy algorithm outperform the existing state-of-the-art techniques.

       

    /

    返回文章
    返回