ISSN 1000-1239 CN 11-1777/TP

Journal of Computer Research and Development ›› 2016, Vol. 53 ›› Issue (12): 2847-2857.doi: 10.7544/issn1000-1239.2016.20150794

Previous Articles     Next Articles

Novel MapReduce-Based Similarity Self-Join Method: Filter and In-Circle Algorithm

Bao Guanghui1, Zhang Zhaogong1, Li Jianzhong1,2, Xuan Ping1   

  1. 1(School of Computer Science and Technology, Heilongjiang University, Harbin 150080); 2(School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001)
  • Online:2016-12-01

Abstract: Similarity self-join is a very important study in many applications. For the massive data sets, MapReduce can provide an effective distributed computing framework, in particular, similarity self-join can be applied on the framework. There are still problems, such as fine partition method, are applied to cluster data area for load balancing, but it is not easy to implement. Existing algorithms cant effectively accomplish similarity self-join operations for the massive data sets. In this paper, we propose two novel algorithms of similarity self-join on the MapReduce framework, and use coordinate-filtering techniques to get the valid candidate sets and use the in-circle method on the hexagon-based partition area. Those coordinate-filtering techniques are based on equal-width grid partition, and adopt the restriction that two points have more distances than two projective points in the same axis, and can drop obviously some candidate set. We also proof that the hexagon-based partition is the best form in all normal partition. Our experimental results demonstrate that the novel method has an advantage over the other join algorithms for cluster data area which improves efficiency over 80%. The algorithm can effectively solve the problem of the similarity self-join for the massive data in cluster data area.

Key words: massive data sets, filter, similarity self-join, data partition, Hadoop platform, MapReduce programming model

CLC Number: