• 中国精品科技期刊
  • CCF推荐A类中文期刊
  • 计算领域高质量科技期刊T1类
Advanced Search
Han Jingyu, Xu Lizhen, and Dong Yisheng. An Approach for Detecting Similar Duplicate Records of Massive Data[J]. Journal of Computer Research and Development, 2005, 42(12): 2206-2212.
Citation: Han Jingyu, Xu Lizhen, and Dong Yisheng. An Approach for Detecting Similar Duplicate Records of Massive Data[J]. Journal of Computer Research and Development, 2005, 42(12): 2206-2212.

An Approach for Detecting Similar Duplicate Records of Massive Data

More Information
  • Published Date: December 14, 2005
  • Detecting similar duplicate records of massive data is of great importance in data cleaning. An efficient method for detecting similar duplicate records of massive data is presented: hierarchy spaces are put forward, which are constituted by a sequence of component space called q-gram space. First every record to be cleaned is mapped as a point in the corresponding component space. Then, taking advantage of the inherent hierarchy of the hierarchy spaces, all the similar duplicate records will be detected by hierarchical clustering. On one hand, this method overcomes the shortcoming that the similar duplicate records may fall far from each other during sorting phase by the traditional ‘sort & merge’ method. Thus they can't be found in the succeeding merge phase; On the other hand, it can greatly reduce the expensive disk I/O cost by avoiding external sorting. Both theory and experiment show that it is an effective approach to detect the similar duplicate records for massive data.
  • Related Articles

    [1]Tang Chenghua, Liu Pengcheng, Tang Shensheng, Xie Yi. Anomaly Intrusion Behavior Detection Based on Fuzzy Clustering and Features Selection[J]. Journal of Computer Research and Development, 2015, 52(3): 718-728. DOI: 10.7544/issn1000-1239.2015.20130601
    [2]Yang Xinxin, Huang Shaobin. A Hierarchical Co-Clustering Algorithm for High-Order Heterogeneous Data[J]. Journal of Computer Research and Development, 2015, 52(1): 200-210. DOI: 10.7544/issn1000-1239.2015.20130493
    [3]Xiong Ping, Zhu Tianqing. A Data Anonymization Approach Based on Impurity Gain and Hierarchical Clustering[J]. Journal of Computer Research and Development, 2012, 49(7): 1545-1552.
    [4]Chong Zhihong, Ni Weiwei, Liu Tengteng, and Zhang Yong. A Privacy-Preserving Data Publishing Algorithm for Clustering Application[J]. Journal of Computer Research and Development, 2010, 47(12).
    [5]Zhao Ming, Luo Jizhou, Li Jianzhong, and Gao Hong. XCluster: A Cluster-Based Queriable Multi-Document XML Compression Method[J]. Journal of Computer Research and Development, 2010, 47(5): 804-814.
    [6]Zhang Gang, Liu Yue, Guo Jiafeng, and Cheng Xueqi. A Hierarchical Search Result Clustering Method[J]. Journal of Computer Research and Development, 2008, 45(3): 542-547.
    [7]Xiu Yu, Wang Shitong, Wu Xisheng, Hu Dewen. The Directional Similarity-Based Clustering Method DSCM[J]. Journal of Computer Research and Development, 2006, 43(8): 1425-1431.
    [8]Yu Manquan, Luo Weihua, Xu Hongbo, Bai Shuo. Research on Hierarchical Topic Detection in Topic Detection and Tracking[J]. Journal of Computer Research and Development, 2006, 43(3): 489-495.
    [9]Duan Jiangjiao, Xue Yongsheng, Lin Ziyu, Wang Wei, Shi Baile. A Novel Hidden Markov Model-Based Hierarchical Time-Series Clustering Algorithm[J]. Journal of Computer Research and Development, 2006, 43(1): 61-67.
    [10]Yang Kehua, Dong Yisheng, Hu Kongfa. A Hierarchical Clustering Method on Semantic Cube[J]. Journal of Computer Research and Development, 2005, 42(11): 1989-1996.

Catalog

    Article views (753) PDF downloads (1130) Cited by()

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return