一种大数据量的相似记录检测方法

韩京宇  徐立臻  董逸生

一种大数据量的相似记录检测方法

韩京宇徐立臻董逸生

An Approach for Detecting Similar Duplicate Records of Massive Data

Han Jingyu, Xu Lizhen, and Dong Yisheng

摘要

摘要: 大数据量的相似重复记录检测是数据清洗中的一个重要问题，提出一种基于q-gram层次空间的聚类检测方法：它首先将数据映射成q-gram空间中的点，并根据q-gram空间中的相似性度量采用层次聚类方法将相似的重复记录检测出来.它克服了传统的“排序&合并”方法由于字符位置敏感不能将相似记录字符串排在邻近位置的不足和大数量外排序引起I/O代价过大的问题.理论分析和实验表明，方法不仅具有好的检测精度，且有好的伸缩性，能够有效地解决大数据量的相似重复记录检测.

Abstract: Detecting similar duplicate records of massive data is of great importance in data cleaning. An efficient method for detecting similar duplicate records of massive data is presented: hierarchy spaces are put forward, which are constituted by a sequence of component space called q-gram space. First every record to be cleaned is mapped as a point in the corresponding component space. Then, taking advantage of the inherent hierarchy of the hierarchy spaces, all the similar duplicate records will be detected by hierarchical clustering. On one hand, this method overcomes the shortcoming that the similar duplicate records may fall far from each other during sorting phase by the traditional ‘sort & merge’ method. Thus they can't be found in the succeeding merge phase; On the other hand, it can greatly reduce the expensive disk I/O cost by avoiding external sorting. Both theory and experiment show that it is an effective approach to detect the similar duplicate records for massive data.

HTML全文

参考文献(0)

施引文献

资源附件(0)