Partition-Based Set Similarity Join
-
Graphical Abstract
-
Abstract
Set similarity join is a primitive operation applied in a variety of applications. Its goal is to find all record pairs whose similarity is not below the given threshold in a dataset, based on set similarity constraints. It has become a popular topic and attracted much attention from database community. As the recent proliferation of social networks, mobile applications and online services increase the rate of data gathering, which brings new challenges to set similarity join. In this paper, we propose a brand-new partition based set similarity join method. In this method, each set is partitioned adaptively into even partitions based on pigeon-hole principle. The partitions are used to filter out more false positives efficiently. In order to filter out more false positives and improve the efficiency, the enhanced method is proposed applying the position of the partition. The extensive experiments are carried out on two real datasets with different settings. The results demonstrate that the set similarity join applying our partition based filtering method is more efficient than state-of-the-art methods.
-
-