Abstract:
Detecting similar duplicate records of massive data is of great importance in data cleaning. An efficient method for detecting similar duplicate records of massive data is presented: hierarchy spaces are put forward, which are constituted by a sequence of component space called q-gram space. First every record to be cleaned is mapped as a point in the corresponding component space. Then, taking advantage of the inherent hierarchy of the hierarchy spaces, all the similar duplicate records will be detected by hierarchical clustering. On one hand, this method overcomes the shortcoming that the similar duplicate records may fall far from each other during sorting phase by the traditional ‘sort & merge’ method. Thus they can't be found in the succeeding merge phase; On the other hand, it can greatly reduce the expensive disk I/O cost by avoiding external sorting. Both theory and experiment show that it is an effective approach to detect the similar duplicate records for massive data.