ISSN 1000-1239 CN 11-1777/TP

Journal of Computer Research and Development ›› 2016, Vol. 53 ›› Issue (3): 559-570.doi: 10.7544/issn1000-1239.2016.20148218

Previous Articles     Next Articles

Efficient Duplicate Detection Approach for High Dimensional Big Data

ZhuWeiheng1,YinJian2,DengYuhui1,LongShun1,QiuShiding1   

  1. 1(College of Information Science and Technology, Jinan University, Guangzhou 510632); 2(School of Information Science and Technology, Sun Yat-sen University, Guangzhou 510006)
  • Online:2016-03-01

Abstract: The big data era has huge quantity of heterogeneous data from multiple sources be widely used in various domains. Data from multiple sources and of various structures make data duplication inevitable. In addition, such a large amount of data generates an increasing demand for efficient duplicate detection algorithms. Traditional approaches have difficulties in dealing with high dimensional data in big data scenarios. This paper analyses the deficiency of traditional SNM(sorted neighbour method) methods and proposes a novel approach based on clustering. An efficient indexing mechanism is first created with the help of R-tree, which is a variant of B-tree for multi-dimensional space. The proposed algorithm reduces the comparisons needed by taking advantage of the characteristics of clusters and outperforms existing duplicate detection approaches such as SNM, DCS, and DCS++. Furthermore, based on the apriori property of duplicate detection, we develop a new algorithm which can generate the duplicate candidates in parallel manner of the projection of original dataset and then use them to reduce search space of high-dimensional data. Experimental results show that this parallel approach works efficiently when high-dimensional data is encountered. This significant performance improvement suggests that it is ideal for duplicate detection for high dimensional big data.

Key words: big data, high dimension data, data mining, data preprocessing, duplicate detection

CLC Number: