An Approach for Detecting Similar Duplicate Records of Massive Data

Han Jingyu, Xu Lizhen, and Dong Yisheng

Han Jingyu, Xu Lizhen, and Dong Yisheng. An Approach for Detecting Similar Duplicate Records of Massive DataJ. Journal of Computer Research and Development, 2005, 42(12): 2206-2212.

Citation:

Han Jingyu, Xu Lizhen, and Dong Yisheng. An Approach for Detecting Similar Duplicate Records of Massive DataJ. Journal of Computer Research and Development, 2005, 42(12): 2206-2212.

Citation:

Han Jingyu, Xu Lizhen, and Dong Yisheng. An Approach for Detecting Similar Duplicate Records of Massive DataJ. Journal of Computer Research and Development, 2005, 42(12): 2206-2212.

An Approach for Detecting Similar Duplicate Records of Massive Data

Han Jingyu, Xu Lizhen, and Dong Yisheng

Graphical Abstract

Abstract

Abstract

Detecting similar duplicate records of massive data is of great importance in data cleaning. An efficient method for detecting similar duplicate records of massive data is presented: hierarchy spaces are put forward, which are constituted by a sequence of component space called q-gram space. First every record to be cleaned is mapped as a point in the corresponding component space. Then, taking advantage of the inherent hierarchy of the hierarchy spaces, all the similar duplicate records will be detected by hierarchical clustering. On one hand, this method overcomes the shortcoming that the similar duplicate records may fall far from each other during sorting phase by the traditional ‘sort & merge’ method. Thus they can't be found in the succeeding merge phase; On the other hand, it can greatly reduce the expensive disk I/O cost by avoiding external sorting. Both theory and experiment show that it is an effective approach to detect the similar duplicate records for massive data.

FullText(HTML)

References (0)

Cited By

Turn off MathJax

Article Contents

An Approach for Detecting Similar Duplicate Records of Massive Data

Abstract

Catalog

Export File

Citation

Format

Content