高级检索

    基于划分的集合相似连接

    Partition-Based Set Similarity Join

    • 摘要: 集合相似连接(set similarity join)是指在给定的数据集中,按照基于集合间覆盖关系的相似度计算方法来衡量数据之间的相似度、并找出所有相似度不小于给定阈值的数据对的操作.集合相似连接作为一种新的基本操作在很多领域中有重要应用.随着社会网络、移动应用以及在线服务的发展,使得数据收集的效率和规模得到了很大的提高,同时给相似连接操作带来新的挑战.根据集合相似的必要条件,提出了相似集合之间的差异度.利用差异度和鸽巢原理,提出了一种新颖的基于数据划分的集合相似连接计算方法,该方法对集合进行自适应的均衡划分,并利用基于划分块的过滤方法来提高过滤的效率.为了进一步提高过滤的效果和相似连接的效率,利用划分块的位置信息提出了增强的过滤方法.针对提出的方法,在不同的环境下进行了实验,实验结果表明,该方法与已有的方法相比可以有效地提高相似连接的效率.

       

      Abstract: Set similarity join is a primitive operation applied in a variety of applications. Its goal is to find all record pairs whose similarity is not below the given threshold in a dataset, based on set similarity constraints. It has become a popular topic and attracted much attention from database community. As the recent proliferation of social networks, mobile applications and online services increase the rate of data gathering, which brings new challenges to set similarity join. In this paper, we propose a brand-new partition based set similarity join method. In this method, each set is partitioned adaptively into even partitions based on pigeon-hole principle. The partitions are used to filter out more false positives efficiently. In order to filter out more false positives and improve the efficiency, the enhanced method is proposed applying the position of the partition. The extensive experiments are carried out on two real datasets with different settings. The results demonstrate that the set similarity join applying our partition based filtering method is more efficient than state-of-the-art methods.

       

    /

    返回文章
    返回