高级检索

    一种大数据量的相似记录检测方法

    An Approach for Detecting Similar Duplicate Records of Massive Data

    • 摘要: 大数据量的相似重复记录检测是数据清洗中的一个重要问题,提出一种基于q-gram层次空间的聚类检测方法:它首先将数据映射成q-gram空间中的点,并根据q-gram空间中的相似性度量采用层次聚类方法将相似的重复记录检测出来.它克服了传统的“排序&合并”方法由于字符位置敏感不能将相似记录字符串排在邻近位置的不足和大数量外排序引起I/O代价过大的问题.理论分析和实验表明,方法不仅具有好的检测精度,且有好的伸缩性,能够有效地解决大数据量的相似重复记录检测.

       

      Abstract: Detecting similar duplicate records of massive data is of great importance in data cleaning. An efficient method for detecting similar duplicate records of massive data is presented: hierarchy spaces are put forward, which are constituted by a sequence of component space called q-gram space. First every record to be cleaned is mapped as a point in the corresponding component space. Then, taking advantage of the inherent hierarchy of the hierarchy spaces, all the similar duplicate records will be detected by hierarchical clustering. On one hand, this method overcomes the shortcoming that the similar duplicate records may fall far from each other during sorting phase by the traditional ‘sort & merge’ method. Thus they can't be found in the succeeding merge phase; On the other hand, it can greatly reduce the expensive disk I/O cost by avoiding external sorting. Both theory and experiment show that it is an effective approach to detect the similar duplicate records for massive data.

       

    /

    返回文章
    返回