A Set Similarity Self-Join Algorithm with FP-tree and MapReduce

Feng Yuhong; Wu Kunhan; Huang Zhihong; Feng Yangzhou; Chen Huanhuan; Bai Jiancong; Ming Zhong

doi:10.7544/issn1000-1239.202111239

Journal of Computer Research and Development > 2023 > 60(12): 2890-2906. > DOI: 10.7544/issn1000-1239.202111239 CSTR: 32373.14.issn1000-1239.202111239

Feng Yuhong, Wu Kunhan, Huang Zhihong, Feng Yangzhou, Chen Huanhuan, Bai Jiancong, Ming Zhong. A Set Similarity Self-Join Algorithm with FP-tree and MapReduce[J]. Journal of Computer Research and Development, 2023, 60(12): 2890-2906. DOI: 10.7544/issn1000-1239.202111239

Citation:

PDF (2853 KB)

A Set Similarity Self-Join Algorithm with FP-tree and MapReduce

1.
College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, Guangdong 518060
2.
School of Computer Science and Technology , University of Science and Technology of China, Hefei 230027

Funds: This work was supported by the National Natural Science Foundation of China (62272315, 61836005, 62002230), the Shenzhen Fundamental Research General Program ( JCYJ20210324093212034), the Natural Science Foundation of Guangdong province (2019A1515012053) , Tencent “Rhinoceros Birds” - Scientific Research Foundation for Young Teachers of Shenzhen University, and the Shenzhen Stable Support Program (20200814105901001).

More Information

Author Bio:
Feng Yuhong: born in 1975. PhD, associate professor. Member of CCF. Her main research interests include parallel and distributed computing, middleware, system software, and data analysis

Wu Kunhan: born in 1997. Master. His main research interests include parallel and distributed computing, and data analysis

Huang Zhihong: born in 1994. Master. His main research interests include parallel and distributed computing, and data analysis

Feng Yangzhou: born in 1998. Bachelor. His main research interests include parallel and distributed computing, and data analysis

Chen Huanhuan: born in 1982. PhD, professor, PhD supervisor. Member of CCF. His main research interests include machine learning, data mining, computational intelligence, underground pipeline network detection and data fusion, and evolutionary computing

Bai Jiancong: born in 1977. PhD, lecturer. Member of CCF. His main research interests include intelligent optimization algorithm, big data, and e-commerce

Ming Zhong: born in 1967. PhD, professor, PhD supervisor. Senior member of CCF. His main research interests include software engineering, ontology, Internet of things, and workflow
Received Date: December 14, 2021
Revised Date: January 31, 2023
Available Online: September 19, 2023

Graphical Abstract

Abstract

Abstract

Given a collection of sets, the set similarity self-join (SSSJ) algorithm finds all pairs of sets whose similarity is higher than a given threshold, which has been widely used in various applications. The SSSJ adopting the filter-and-verification framework and the parallel and distributed computing framework MapReduce is an active research topic. But existing algorithms produce large candidate sets when the given threshold is low and lead to poor performance. To address this problem, we propose to compute the SSSJ with frequent pattern tree structures and their derivatives FP-tree*, which are used to compress data in memory for computation, targeting at reducing the size of candidate sets. First, based on an investigation on existing FP-tree* and the characteristics of SSSJ computation, we propose a more traversal efficient linear prefix tree structure, i.e., TELP-tree, and an original SSSJ algorithm accordingly, i.e., TELP-SJ, which includes a two-phase filtering strategy: tree-construction oriented filtering algorithm and tree-traversal oriented filtering strategy. These algorithms will reduce the size of TELP-trees and tree-traversals. Then, we design the parallel and distributed SSSJ computation algorithm FastTELP-SJ with MapReduce. Finally, 4 groups of empirical comparison studies have been carried out on 4 sets of real application datasets. The experimental results demonstrate that FastTELP-SJ achieves better performance in term of the execution time, memory usage, disk usage and scalability for similarity joins over large scale collections of sets with high dimension.
- similarity join,
- FP-tree,
- MapReduce framework,
- Jaccard function,
- set

FullText(HTML)

References (42)

References

[1]	Das A S, Datar M, Garg A, et al. Google news personalization: Scalable online collaborative filtering [C] // Proc of the 16th Int Conf on World Wide Web. New York: ACM, 2007: 271–280
[2]	Xiao Chuan, Wang Wei, Lin Xuemin, et al. Efficient similarity joins for near-duplicate detection[J]. ACM Transactions on Database Systems, 2011, 36(3): 1−41
[3]	Dong Wei, Moses C, Li Kai. Efficient k-nearest neighbor graph construction for generic similarity measures [C] //Proc of the 20th Int Conf on World Wide Web. New York: ACM, 2011: 577–586
[4]	Statista Research Department. Facebook MAU worldwide 2021 statista [EB/OL]. [2021-11-16].https://www.statista.com/statistics/264810/number-of-monthly-active-facebook-users-worldwide/
[5]	Sohrabi M K, Azgomi H. Parallel set similarity join on big data based on locality-sensitive hashing[J]. Science of Computer Programming, 2017, 145(12): 1−12
[6]	Rashtchian C, Sharma A, Woodruff D. LSF-Join: Locality sensitive filtering for distributed all-pairs set similarity under skew [C] //Proc of the 29th Web Conf. New York: ACM, 2020: 2998−3004
[7]	Christiani T, Pagh R, Sivertsen J. Scalable and robust set similarity join [C] //Proc of the 34th IEEE Int Conf on Data Engineering. Piscataway, NJ: IEEE, 2018: 1240–1243
[8]	Bayardo R J, Ma Yiming, Srikant R. Scaling up all pairs similarity search [C] //Proc of the 16th Int Conf on World Wide Web. New York: ACM, 2007: 131–140
[9]	Ribeiro L A, Härder T. Generalizing prefix filtering to improve set similarity joins[J]. Information Systems, 2011, 36(1): 62−78 doi: 10.1016/j.is.2010.07.003
[10]	Wang Jiannan, Li Guoliang, Feng Jiahua. Can we beat the prefix filtering? An adaptive framework for similarity join and search [C] //Proc of the 2012 ACM SIGMOD Int Conf on Management of Data. New York: ACM, 2012: 85–96
[11]	Bouros P, Ge S, Mamoulis N. Spatio-textual similarity joins[J]. Proceedings of the VLDB Endowment, 2012, 6(1): 1−12 doi: 10.14778/2428536.2428537
[12]	Gravano L, Ipeirotis P G, Jagadish H V, et al. Approximate string joins in a database (almost) for free [C] //Proc of the 27th Int Conf on Very Large Data Bases. Berlin: Springer, 2001: 491–500
[13]	Sandes E F, Teodoro G L, Melo A C. Bitmap filter: Speeding up exact set similarity joins with bitwise operations[J]. Information Systems, 2020, 88(2): 101449−1014463
[14]	Bellas C, Gounaris A. An empirical evaluation of exact set similarity join techniques using GPUs[J]. Information Systems, 2020, 89(3): 101485−101503
[15]	Dean J, Ghemawat S. MapReduce: Simplified data processing on large clusters[J]. Communications of the ACM, 2008, 51(1): 107−113 doi: 10.1145/1327452.1327492
[16]	Vernica R, Carey M J, Li Chen. Efficient parallel set-similarity joins using MapReduce [C] //Proc of the 2010 ACM SIGMOD Int Conf on Management of Data. New York: ACM: 2010: 495–506
[17]	Rong Chutian, Lu Wei, Wang Xiaoli, et al. Efficient and scalable processing of string similarity join[J]. IEEE Transactions on Knowledge and Data Engineering, 2012, 25(10): 2217−2230
[18]	Baraglia R, Morales G D F, Lucchese C. Document similarity self-join with MapReduce [C] //Proc of the 26th IEEE Int Conf on Data Mining. Piscataway, NJ: IEEE, 2010: 731–736
[19]	Rong Chuntian, Lin Chunbin, Silva Y N, et al. Fast and scalable distributed set similarity joins for big data analytics [C] //Proc of the 33rd IEEE Int Conf on Data Engineering. Piscataway, NJ: IEEE, 2017: 1059–1070
[20]	Fier F, Augsten N, Bouros P, et al. Set similarity joins on MapReduce: An experimental survey[J]. Proceedings of the VLDB Endowme, 2018, 11(10): 1110−1122 doi: 10.14778/3231751.3231760
[21]	Pacífico L, Ribeiro L A. Streaming set similarity joins [C] //Proc of the 15th Int Conf on Enterprise Information Systems. Berlin: Springer, 2020: 24−42
[22]	Yang Chengcheng, Chen Lisi, Wang Hao, et al. Dynamic set similarity join: An update log based approach [J/OL]. IEEE Transactions on Knowledge and Data Engineering, 2021[2023-01-18].https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9609684
[23]	Yang Jianye, Zhang Wenjie, Wang Xiang, et al. Distributed streaming set similarity join [C] //Proc of the 36th IEEE Int Conf on Data Engineering (ICDE). Piscataway, NJ: IEEE, 2020: 565−576
[24]	Shahvarani A, Jacobsen H A. Distributed stream KNN join [C] //Proc of the 2021 ACM SIGMOD Int Conf on Management of Data. New York: ACM, 2021: 1597−1609
[25]	Han Jiawei, Pei Jian, Yin Yiwen. Mining frequent patterns without candidate generation[J]. ACM SIGMOD Record, 2000, 29(2): 1−12 doi: 10.1145/335191.335372
[26]	Nancy P, Ramani R G. Frequent pattern mining in social network data (facebook application data)[J]. European Journal of Scientific Research, 2012, 79(4): 531−540
[27]	Feng Yuhong, Wang Junpeng, Zhang Zhiqiang, et al. The edge weight computation with MapReduce for extracting weighted graphs[J]. IEEE Transactions on Parallel and Distributed Systems, 2016, 27(12): 3659−3672 doi: 10.1109/TPDS.2016.2536024
[28]	Deng Zhihong, Wang ZhongHui, Jiang Jiajian. A new algorithm for fast mining frequent itemsets using N-lists[J]. Science China Information Sciences, 2012, 55(9): 2008−2030 doi: 10.1007/s11432-012-4638-z
[29]	Pyun G, Yun U, Ryu K H. Efficient frequent pattern mining based on linear prefix tree[J]. Knowledge-Based Systems, 2014, 55(1): 125−139
[30]	Feng Yuhong, Guo Meihong, Lu Kezhong, et al. Optimize the FP-Tree based graph edge weight computation on multi-core MapReduce clusters [C] // Proc of the 23rd IEEE Int Conf on Parallel and Distributed Systems. Piscataway, NJ: IEEE, 2017: 400–409
[31]	The Apache Software Foundation. Apache Hadoop [EB/OL]. [2022-11-05].https://hadoop.apache.org
[32]	The Apache Software Foundation. Apache Spark [EB/OL]. [2022-11-05].https://spark.apache.org
[33]	Ranger C, Raghuraman R, Penmetsa A, et al. Evaluating MapReduce for multi-core and multiprocessor systems [C] //Proc of the 13th IEEE Int Symp on High Performance Computer Architecture. Piscataway, NJ: IEEE, 2007: 13–24
[34]	He Bingsheng, Fang Wenbin, Luo Qiong, et al. Mars: A MapReduce framework on graphics processors [C] // Proc of the 17th Int Conf on Parallel Architectures and Compilation Techniques. New York: ACM, 2008: 260–269
[35]	Nazir A, Raza S, Gupta D, et al. Network level footprints of facebook applications [C] //Proc of the 9th ACM SIGCOMM Conf on Internet Measurement. New York: ACM, 2009: 63–75
[36]	Bart G. Frequent itemset mining implementations repository [EB/OL]. [2022-11-05].https://fimi/uantwerpen.be/data
[37]	William W. Enron email dataset [EB/OL]. [2022−11-05].https://www.cs.cmu.edu/~enron
[38]	Malik A J, Khan F A. A hybrid technique using binary particle swarm optimization and decision tree pruning for network intrusion detection[J]. Cluster Computing, 2018, 21(1): 667−680 doi: 10.1007/s10586-017-0971-8
[39]	Lu Lu, Shi Xuanhua, Zhou Yongluan, et al. Lifetime-based memory management for distributed data processing systems[J]. Proceedings of the VLDB Endowment, 2016, 9(12): 936−947 doi: 10.14778/2994509.2994513
[40]	Shi Xuanhua, Ke Zhixiang, Zhou Yongluan, et al. Deca: A garbage collection optimizer for in-memory data processing[J]. ACM Transactions on Computer Systems, 2019, 36(1): 1−47
[41]	Luo Siqi, Chen Xu, Zhou Zhi. F3C: Fog-enabled joint computation, communication and caching resource sharing for energy-efficient IoT data stream processing [C] //Proc of the 39th IEEE Int Conf on Distributed Computing Systems (ICDCS). Piscataway, NJ: IEEE, 2019: 1019−1028
[42]	Luo Siqi, Chen Xu, Zhou Zhi, et al. Fog-enabled joint computation, communication and caching resource sharing for energy-efficient IoT data stream processing[J]. IEEE Transactions on Vehicular Technology, 2021, 70(4): 3715−3730 doi: 10.1109/TVT.2021.3062664

[1]	Zhang Lei, He Chongde, Wei Lifei. Efficient and Malicious Secure Three-Party Private Set Intersection Computation Protocols for Small Sets[J]. Journal of Computer Research and Development, 2022, 59(10): 2286-2298. DOI: 10.7544/issn1000-1239.20220471
[2]	Wei Lifei, Liu Jihai, Zhang Lei, Wang Qin, He Chongde. Survey of Privacy Preserving Oriented Set Intersection Computation[J]. Journal of Computer Research and Development, 2022, 59(8): 1782-1799. DOI: 10.7544/issn1000-1239.20210685
[3]	Song Xiangfu, Gai Min, Zhao Shengnan, Jiang Han. Privacy-Preserving Statistics Protocol for Set-Based Computation[J]. Journal of Computer Research and Development, 2020, 57(10): 2221-2231. DOI: 10.7544/issn1000-1239.2020.20200444
[4]	Shen Liyan, Chen Xiaojun, Shi Jinqiao, Hu Lanlan. Survey on Private Preserving Set Intersection Technology[J]. Journal of Computer Research and Development, 2017, 54(10): 2153-2169. DOI: 10.7544/issn1000-1239.2017.20170461
[5]	Chen Zhenhua, Li Shundong, Wang Daoshun, Huang Qiong, Dong Lihong. Protocols for Secure Computation of Set-Inclusion with the Unencrypted Method[J]. Journal of Computer Research and Development, 2017, 54(7): 1549-1556. DOI: 10.7544/issn1000-1239.2017.20160034
[6]	Rong Chuitian, Xu Tianren, Du Xiaoyong. Partition-Based Set Similarity Join[J]. Journal of Computer Research and Development, 2012, 49(10): 2066-2076.
[7]	Li Shaohua, Feng Qilong, Wang Jianxin, and Chen Jianer. Kernelization for Weighted 3-Set Packing Problem[J]. Journal of Computer Research and Development, 2012, 49(8): 17811-786.
[8]	Hong Yi, Wang Zhaoqi, Zhu Dengming, Qiu Xianjie. Generation of Fire Animation Based on Level-Set[J]. Journal of Computer Research and Development, 2010, 47(11): 1849-1856.
[9]	Liu Quan, Fu Yuchen, Sun Jigui, Cui Zhiming, Gong Shengrong, Ling Xinghong. An Automated Reasoning Expanded Method Based on Set Signs[J]. Journal of Computer Research and Development, 2007, 44(8): 1317-1323.
[10]	Li Shundong, Si Tiange, and Dai Yiqi. Secure Multi-Party Computation of Set-Inclusion and Graph-Inclusion[J]. Journal of Computer Research and Development, 2005, 42(10): 1647-1653.