Abstract:
Similarity self-join is a very important study in many applications. For the massive data sets, MapReduce can provide an effective distributed computing framework, in particular, similarity self-join can be applied on the framework. There are still problems, such as fine partition method, are applied to cluster data area for load balancing, but it is not easy to implement. Existing algorithms cant effectively accomplish similarity self-join operations for the massive data sets. In this paper, we propose two novel algorithms of similarity self-join on the MapReduce framework, and use coordinate-filtering techniques to get the valid candidate sets and use the in-circle method on the hexagon-based partition area. Those coordinate-filtering techniques are based on equal-width grid partition, and adopt the restriction that two points have more distances than two projective points in the same axis, and can drop obviously some candidate set. We also proof that the hexagon-based partition is the best form in all normal partition. Our experimental results demonstrate that the novel method has an advantage over the other join algorithms for cluster data area which improves efficiency over 80%. The algorithm can effectively solve the problem of the similarity self-join for the massive data in cluster data area.