高级检索

    基于FP-tree和MapReduce的集合相似度自连接算法

    A Set Similarity Self-Join Algorithm with FP-tree and MapReduce

    • 摘要: 利用集合相似度自连接算法找出一个集合集中所有相似度大于给定阈值的集合对有着广泛的应用. 基于过滤-验证框架和并行分布式计算框架MapReduce的集合相似度连接是近年来的研究热点. 但现有算法在阈值低时产生较大规模的候选集,导致性能不理想. 针对这一问题,提出采用频繁模式树FP-tree及其派生结构FP-tree*将数据压缩在内存中计算集合相似度自连接以减小候选集规模. 首先设计并讨论基于现有FP-tree*的集合相似度连接计算及其优缺点,提出遍历效率更高的线性频繁模式树结构模型TELP-tree及基于它的算法TELP-SJ(TELP-tree self join),其包括分别面向构建树和遍历树的2阶段过滤算法,这些算法可以减小树规模和减少树遍历. 然后,设计基于MapReduce的并行分布式算法FastTELP-SJ. 最后,基于4组真实应用数据集进行3组性能比较实验. 实验结果表明FastTELP-SJ算法面向高维大规模集合相似度自连接计算时,包括执行时间、内存占用率、磁盘使用量和可扩展性的运行效率最好.

       

      Abstract: Given a collection of sets, the set similarity self-join (SSSJ) algorithm finds all pairs of sets whose similarity is higher than a given threshold, which has been widely used in various applications. The SSSJ adopting the filter-and-verification framework and the parallel and distributed computing framework MapReduce is an active research topic. But existing algorithms produce large candidate sets when the given threshold is low and lead to poor performance. To address this problem, we propose to compute the SSSJ with frequent pattern tree structures and their derivatives FP-tree*, which are used to compress data in memory for computation, targeting at reducing the size of candidate sets. First, based on an investigation on existing FP-tree* and the characteristics of SSSJ computation, we propose a more traversal efficient linear prefix tree structure, i.e., TELP-tree, and an original SSSJ algorithm accordingly, i.e., TELP-SJ, which includes a two-phase filtering strategy: tree-construction oriented filtering algorithm and tree-traversal oriented filtering strategy. These algorithms will reduce the size of TELP-trees and tree-traversals. Then, we design the parallel and distributed SSSJ computation algorithm FastTELP-SJ with MapReduce. Finally, 4 groups of empirical comparison studies have been carried out on 4 sets of real application datasets. The experimental results demonstrate that FastTELP-SJ achieves better performance in term of the execution time, memory usage, disk usage and scalability for similarity joins over large scale collections of sets with high dimension.

       

    /

    返回文章
    返回