Abstract:
Given a collection of sets, the set similarity self-join (SSSJ) algorithm finds all pairs of sets whose similarity is higher than a given threshold, which has been widely used in various applications. The SSSJ adopting the filter-and-verification framework and the parallel and distributed computing framework MapReduce is an active research topic. But existing algorithms produce large candidate sets when the given threshold is low and lead to poor performance. To address this problem, we propose to compute the SSSJ with frequent pattern tree structures and their derivatives FP-tree*, which are used to compress data in memory for computation, targeting at reducing the size of candidate sets. First, based on an investigation on existing FP-tree* and the characteristics of SSSJ computation, we propose a more traversal efficient linear prefix tree structure, i.e., TELP-tree, and an original SSSJ algorithm accordingly, i.e., TELP-SJ, which includes a two-phase filtering strategy: tree-construction oriented filtering algorithm and tree-traversal oriented filtering strategy. These algorithms will reduce the size of TELP-trees and tree-traversals. Then, we design the parallel and distributed SSSJ computation algorithm FastTELP-SJ with MapReduce. Finally, 4 groups of empirical comparison studies have been carried out on 4 sets of real application datasets. The experimental results demonstrate that FastTELP-SJ achieves better performance in term of the execution time, memory usage, disk usage and scalability for similarity joins over large scale collections of sets with high dimension.