Novel MapReduce-Based Similarity Self-Join Method: Filter and In-Circle Algorithm

Bao Guanghui; Zhang Zhaogong; Li Jianzhong; Xuan Ping

doi:10.7544/issn1000-1239.2016.20150794

Bao Guanghui, Zhang Zhaogong, Li Jianzhong, Xuan Ping. Novel MapReduce-Based Similarity Self-Join Method: Filter and In-Circle AlgorithmJ. Journal of Computer Research and Development, 2016, 53(12): 2847-2857. DOI: 10.7544/issn1000-1239.2016.20150794

Citation:

Novel MapReduce-Based Similarity Self-Join Method: Filter and In-Circle Algorithm

Graphical Abstract

Abstract

Abstract

Similarity self-join is a very important study in many applications. For the massive data sets, MapReduce can provide an effective distributed computing framework, in particular, similarity self-join can be applied on the framework. There are still problems, such as fine partition method, are applied to cluster data area for load balancing, but it is not easy to implement. Existing algorithms cant effectively accomplish similarity self-join operations for the massive data sets. In this paper, we propose two novel algorithms of similarity self-join on the MapReduce framework, and use coordinate-filtering techniques to get the valid candidate sets and use the in-circle method on the hexagon-based partition area. Those coordinate-filtering techniques are based on equal-width grid partition, and adopt the restriction that two points have more distances than two projective points in the same axis, and can drop obviously some candidate set. We also proof that the hexagon-based partition is the best form in all normal partition. Our experimental results demonstrate that the novel method has an advantage over the other join algorithms for cluster data area which improves efficiency over 80%. The algorithm can effectively solve the problem of the similarity self-join for the massive data in cluster data area.

FullText(HTML)

References (0)

Cited By

Turn off MathJax

Article Contents

Novel MapReduce-Based Similarity Self-Join Method: Filter and In-Circle Algorithm

Abstract

Catalog

Export File

Citation

Format

Content