• 中国精品科技期刊
  • CCF推荐A类中文期刊
  • 计算领域高质量科技期刊T1类
高级检索

大数据环境下高维数据的快速重复检测方法

朱蔚恒, 印鉴, 邓玉辉, 龙舜, 邱诗定

朱蔚恒, 印鉴, 邓玉辉, 龙舜, 邱诗定. 大数据环境下高维数据的快速重复检测方法[J]. 计算机研究与发展, 2016, 53(3): 559-570. DOI: 10.7544/issn1000-1239.2016.20148218
引用本文: 朱蔚恒, 印鉴, 邓玉辉, 龙舜, 邱诗定. 大数据环境下高维数据的快速重复检测方法[J]. 计算机研究与发展, 2016, 53(3): 559-570. DOI: 10.7544/issn1000-1239.2016.20148218
ZhuWeiheng, YinJian, DengYuhui, LongShun, QiuShiding. Efficient Duplicate Detection Approach for High Dimensional Big Data[J]. Journal of Computer Research and Development, 2016, 53(3): 559-570. DOI: 10.7544/issn1000-1239.2016.20148218
Citation: ZhuWeiheng, YinJian, DengYuhui, LongShun, QiuShiding. Efficient Duplicate Detection Approach for High Dimensional Big Data[J]. Journal of Computer Research and Development, 2016, 53(3): 559-570. DOI: 10.7544/issn1000-1239.2016.20148218
朱蔚恒, 印鉴, 邓玉辉, 龙舜, 邱诗定. 大数据环境下高维数据的快速重复检测方法[J]. 计算机研究与发展, 2016, 53(3): 559-570. CSTR: 32373.14.issn1000-1239.2016.20148218
引用本文: 朱蔚恒, 印鉴, 邓玉辉, 龙舜, 邱诗定. 大数据环境下高维数据的快速重复检测方法[J]. 计算机研究与发展, 2016, 53(3): 559-570. CSTR: 32373.14.issn1000-1239.2016.20148218
ZhuWeiheng, YinJian, DengYuhui, LongShun, QiuShiding. Efficient Duplicate Detection Approach for High Dimensional Big Data[J]. Journal of Computer Research and Development, 2016, 53(3): 559-570. CSTR: 32373.14.issn1000-1239.2016.20148218
Citation: ZhuWeiheng, YinJian, DengYuhui, LongShun, QiuShiding. Efficient Duplicate Detection Approach for High Dimensional Big Data[J]. Journal of Computer Research and Development, 2016, 53(3): 559-570. CSTR: 32373.14.issn1000-1239.2016.20148218

大数据环境下高维数据的快速重复检测方法

基金项目: 国家自然科学基金项目(61472453,61272073,61401177,61572232,U1401256,U1501252);广东省自然科学基金项目(S2013020012865);广东省科技计划基金项目(2013B010401017)
详细信息
  • 中图分类号: TP391

Efficient Duplicate Detection Approach for High Dimensional Big Data

  • 摘要: 大数据时代多源、异构、海量的数据正逐渐成为各种应用的主流.多源异构不可避免地会使数据出现重复,同时庞大的数据量对重复检测的效率提出了极高的要求,传统技术在大数据环境下并不能很好地对高维数据进行重复检测,就此问题展开研究,分析了传统SNM类方法的不足,将重复问题概化为一类特殊的聚类问题,利用R-树建立了高效的索引,利用聚类簇的特性减少了在R-树叶子中比较的次数,利用重复检测的Apriori性质实现了对高维数据集并行处理.实验结果表明,提出的算法能有效地提高高维数据的重复检测效率.
    Abstract: The big data era has huge quantity of heterogeneous data from multiple sources be widely used in various domains. Data from multiple sources and of various structures make data duplication inevitable. In addition, such a large amount of data generates an increasing demand for efficient duplicate detection algorithms. Traditional approaches have difficulties in dealing with high dimensional data in big data scenarios. This paper analyses the deficiency of traditional SNM(sorted neighbour method) methods and proposes a novel approach based on clustering. An efficient indexing mechanism is first created with the help of R-tree, which is a variant of B-tree for multi-dimensional space. The proposed algorithm reduces the comparisons needed by taking advantage of the characteristics of clusters and outperforms existing duplicate detection approaches such as SNM, DCS, and DCS++. Furthermore, based on the apriori property of duplicate detection, we develop a new algorithm which can generate the duplicate candidates in parallel manner of the projection of original dataset and then use them to reduce search space of high-dimensional data. Experimental results show that this parallel approach works efficiently when high-dimensional data is encountered. This significant performance improvement suggests that it is ideal for duplicate detection for high dimensional big data.
计量
  • 文章访问数:  1560
  • HTML全文浏览量:  0
  • PDF下载量:  1002
  • 被引次数: 0
出版历程
  • 发布日期:  2016-02-29

目录

    /

    返回文章
    返回