Citation: | He Yulin, Wu Bo, Wu Dingming, Huang Zhexue, Philippe Fournier-Viger. Edge Hashing Distributed Sampling Algorithm for Triangle Counting in Large-scale Dynamic Graph Stream[J]. Journal of Computer Research and Development, 2024, 61(8): 1882-1903. DOI: 10.7544/issn1000-1239.202440120 |
Triangle counting is a classical problem in large-scale graph analysis and widely applied in areas such as social network analysis, anomaly detection, and graph mining. In dynamic graph data streams, the topological structure of graph is constantly changing, involving edge insertions and deletions. Recent researches mainly focus on static streaming graphs and the corresponding streaming graph sampling algorithms only deal with edge insertions. Only a few single-machine sampling algorithms with low accuracy can handle edge deletion operations. We propose the edge hashing assignment-based distributed sampling (EHADS) algorithm, which is a distributed streaming algorithm designed for estimating the number of triangles in dynamic graph data streams. EHADS algorithm can rapidly and accurately estimate the counts of global triangles and local triangles associated with each vertex in dynamic streaming graphs. EHADS algorithm processes the input graph stream only once and samples edges across multiple machines. Compared with state-of-the-art single-machine streaming algorithms, the experimental results reflect that EHADS algorithm has the following advantages: 1) with the same sample capacity, EHADS algorithm achieves smaller estimation errors with less runtime, reducing the average estimation error of global triangle count by 31.79% and local triangle count by 23.35%; 2) EHADS algorithm provides unbiased estimations of the triangle count in streaming graph, and mathematical proofs demonstrate that the above-mentioned unbiased estimation has smaller variance.
[1] |
Chung F R K, Lu L. Complex graphs and networks[M]. Rhode Island: American Mathematical Soc, 2006
|
[2] |
Kwak H, Lee C, Park H, et al. What is Twitter, a social network or a news media?[C]//Proc of the 19th Int Conf on World Wide Web. New York: ACM, 2010: 591−600
|
[3] |
Newman M E J. Coauthorship networks and patterns of scientific collaboration[J]. Proceedings of the National Academy of Sciences, 2004, 101(suppl_1): 5200−5205 doi: 10.1073/pnas.0307545100
|
[4] |
倪庆剑,彭文强,张志政,等. 基于信息增强传输的时空图神经网络交通流预测[J]. 计算机研究与发展,2022,59(2):282−293
Ni Qingjian, Peng Wenqiang, Zhang Zhizheng, et al. Spatial-temporal graph neural network for traffic flow prediction based on information enhanced transmission[J]. Journal of Computer Research and Development, 2022, 59(2): 282−293 (in Chinese)
|
[5] |
Girvan M, Newman M E J. Community structure in social and biological networks[J]. Proceedings of the National Academy of Sciences, 2002, 99(12): 7821−7826 doi: 10.1073/pnas.122653799
|
[6] |
Aggarwal C C. An Introduction to Social Network Data Analytics[M]. Berlin: Springer, 2011
|
[7] |
Watts D J, Strogatz S H. Collective dynamics of ‘small-world’ networks[J]. Nature, 1998, 393(6684): 440−442 doi: 10.1038/30918
|
[8] |
Newman M E J. The structure and function of complex networks[J]. SIAM Review, 2003, 45(2): 167−256 doi: 10.1137/S003614450342480
|
[9] |
Newman M E J, Watts D J, Strogatz S H. Random graph models of social networks[J]. Proceedings of the National Academy of Sciences, 2002, 99(suppl_1): 2566−2572 doi: 10.1073/pnas.012582999
|
[10] |
Batagelj V, Zaveršnik M. Short cycle connectivity[J]. Discrete Mathematics, 2007, 307(3): 310−318
|
[11] |
Burt R S. Structural holes and good ideas[J]. American Journal of Sociology, 2004, 110(2): 349−399 doi: 10.1086/421787
|
[12] |
Foucault Welles B, Van Devender A, Contractor N. Is a Friend a Friend? Investigating the Structure of Friendship Networks in Virtual Worlds[M]//CHI’10 Extended Abstracts on Human Factors in Computing Systems. New York: ACM, 2010: 4027−4032
|
[13] |
Zhang F, Gou XY, Zou L. Top-k heavy weight triangles listing on graph stream[J]. World Wide Web, 2023, 26(4): 1827−1851 doi: 10.1007/s11280-022-01117-z
|
[14] |
Yang Z, Wilson C, Wang X, et al. Uncovering social network sybils in the wild[J]. ACM Transactions on Knowledge Discovery from Data, 2014, 8(1): 1−29
|
[15] |
Welser H T, Gleave E, Fisher D, et al. Visualizing the signatures of social roles in online discussion groups[J]. Journal of Social Structure, 2007, 8(2): 1−32
|
[16] |
Becchetti L, Boldi P, Castillo C, et al. Efficient semi-streaming algorithms for local triangle counting in massive graphs[C]//Proc of the 14th ACM SIGKDD Int Conf on Knowledge Discovery and Data Mining. New York: ACM, 2008: 16−24
|
[17] |
Becchetti L, Boldi P, Castillo C, et al. Efficient algorithms for large-scale local triangle counting[J]. ACM Transactions on Knowledge Discovery from Data, 2010, 4(3): 1−28
|
[18] |
Eckmann J P, Moses E. Curvature of co-links uncovers hidden thematic layers in the world wide web[J]. Proceedings of the National Academy of Sciences, 2002, 99(9): 5825−5829 doi: 10.1073/pnas.032093399
|
[19] |
Wang N, Zhang JB, Tan K L, et al. On triangulation-based dense neighborhood graph discovery[J]. Proceedings of the VLDB Endowment, 2010, 4(2): 58−68 doi: 10.14778/1921071.1921073
|
[20] |
Bar-Yossef Z, Kumar R, Sivakumar D. Reductions in streaming algorithms, with an application to counting triangles in graphs[C]//Proc of the 13th annual ACM-SIAM Symposium on Discrete Algorithms. Philadelphia, PA: SIAM, 2002: 623−632
|
[21] |
Tsourakakis C E, Drineas P, Michelakis E, et al. Spectral counting of triangles via element-wise sparsification and triangle-based link recommendation[J]. Social Network Analysis and Mining, 2011, 1: 75−81 doi: 10.1007/s13278-010-0001-9
|
[22] |
Epasto A, Lattanzi S, Mirrokni V, et al. Ego-Net community mining applied to friend suggestion[J]. Proceedings of the VLDB Endowment, 2015, 9(4): 324−335 doi: 10.14778/2856318.2856327
|
[23] |
Lim Y, Jung M, Kang U. Memory-efficient and accurate sampling for counting local triangles in graph streams: From simple to multigraphs[J]. ACM Transactions on Knowledge Discovery from Data, 2018, 12(1): 1−28
|
[24] |
Latapy M. Main-memory triangle computations for very large (sparse (power-law)) graphs[J]. Theoretical Computer Science, 2008, 407(1): 458−473
|
[25] |
Itai A, Rodeh M. Finding a minimum circuit in a graph[C]//Proc of the 9th annual ACM Symposium on Theory of Computing. New York: ACM, 1977: 1−10
|
[26] |
Coppersmith D, Winograd S. Matrix multiplication via arithmetic progressions[C]//Proc of the 19th annual ACM Symposium on Theory of Computing. New York: ACM, 1987: 1−6
|
[27] |
Alon N, Yuster R, Zwick U. Finding and counting given length cycles[J]. Algorithmica, 1997, 17(3): 209−223 doi: 10.1007/BF02523189
|
[28] |
Cohen J. Graph twiddling in a MapReduce world[J]. Computing in Science & Engineering, 2009, 11(4): 29−41
|
[29] |
Suri S, Vassilvitskii S. Counting triangles and the curse of the last reducer[C]//Proc of the 20th Int Conf on World Wide Web. New York: ACM, 2011: 607−614
|
[30] |
Park H M, Chung C W. An efficient MapReduce algorithm for counting triangles in a very large graph[C]//Proc of the 22nd ACM Int Conf on Information & Knowledge Management. New York: ACM, 2013: 539−548
|
[31] |
Arifuzzaman S, Khan M, Marathe M. Patric: A parallel algorithm for counting triangles in massive networks[C]//Proc of the 22nd ACM Int Conf on Information & Knowledge Management. New York: ACM, 2013: 529−538
|
[32] |
Park H M, Silvestri F, Kang U, et al. MapReduce triangle enumeration with guarantees[C]//Proc of the 23rd ACM Int Conf on Information and Knowledge Management. New York: ACM, 2014: 1739−1748
|
[33] |
Park H M, Myaeng S H, Kang U. Pte: Enumerating trillion triangles on distributed systems[C]//Proc of the 22nd ACM SIGKDD Int Conf on Knowledge Discovery and Data Mining. New York: ACM, 2016: 1115−1124
|
[34] |
Park H M, Silvestri F, Pagh R, et al. Enumerating trillion subgraphs on distributed systems[J]. ACM Transactions on Knowledge Discovery from Data, 2018, 12(6): 1−30
|
[35] |
Tsourakakis C E, Kang U, Miller G L, et al. Doulion: Counting triangles in massive graphs with a coin[C]//Proc of the 15th ACM SIGKDD Int Conf on Knowledge Discovery and Data Mining. New York: ACM, 2009: 837−846
|
[36] |
Pagh R, Tsourakakis C E. Colorful triangle counting and a MapReduce implementation[J]. Information Processing Letters, 2012, 112(7): 277−281 doi: 10.1016/j.ipl.2011.12.007
|
[37] |
Ahmed N K, Duffield N, Neville J, et al. Graph sample and hold: A framework for big-graph analytics[C]//Proc of the 20th ACM SIGKDD Int Conf on Knowledge Discovery and Data Mining. New York: ACM, 2014: 1446−1455
|
[38] |
Lim Y, Kang U. MASCOT: Memory-efficient and accurate sampling for counting local triangles in graph streams[C]//Proc of the 21st ACM SIGKDD Int Conf on Knowledge Discovery and Data Mining. New York: ACM, 2015: 685−694
|
[39] |
Ahmed N K, Duffield N, Willke T, et al. On sampling from massive graph streams[J]. arXiv preprint arXiv: 1703.02625, 2017
|
[40] |
Shin K, Hammoud M, Lee E, et al. Tri-Fly: Distributed estimation of global and local triangle counts in graph streams[C]//Proc of Pacific-Asia Conf on Knowledge Discovery and Data Mining. Berlin: Springer, 2018: 651−663
|
[41] |
Wang PH, Jia P, Qi YY, et al. REPT: A streaming algorithm of approximating global and local triangle counts in parallel[C]//Proc of the 35th Int Conf on Data Engineering (ICDE). Piscataway, NJ: IEEE, 2019: 758−769
|
[42] |
Shin K, Lee E, Oh J, et al. CoCos: Fast and accurate distributed triangle counting in graph streams[J]. ACM Transactions on Knowledge Discovery from Data, 2021, 15(3): 1−30
|
[43] |
Yang X, Song C, Yu MD, et al. Distributed triangle approximately counting algorithms in simple graph stream[J]. ACM Transactions on Knowledge Discovery from Data, 2022, 16(4): 1−43
|
[44] |
Gou XY, Zou L. Sliding window-based approximate triangle counting with bounded memory usage[J]. The VLDB Journal, 2023, 32(5): 1087−1110 doi: 10.1007/s00778-023-00783-3
|
[45] |
Jha M, Seshadhri C, Pinar A. A space efficient streaming algorithm for triangle counting using the birthday paradox[C]//Proc of the 19th ACM SIGKDD Int Conf on Knowledge Discovery and Data Mining. New York: ACM, 2013: 589−597
|
[46] |
Pavan A, Tangwongsan K, Tirthapura S, et al. Counting and sampling triangles from a graph stream[J]. Proceedings of the VLDB Endowment, 2013, 6(14): 1870−1881 doi: 10.14778/2556549.2556569
|
[47] |
Tangwongsan K, Pavan A, Tirthapura S. Parallel triangle counting in massive streaming graphs[C]//Proc of the 22nd ACM Int Conf on Information & Knowledge Management. New York: ACM, 2013: 781−786
|
[48] |
Zhang LL, Jiang H, Wang F, et al. T-sample: A dual reservoir-based sampling method for characterizing large graph streams[C]//Proc of the 35th Int Conf on Data Engineering (ICDE). Piscataway, NJ: IEEE, 2019: 1674−1677
|
[49] |
Kutzkov K, Pagh R. Triangle counting in dynamic graph streams[C]//Proc of Scandinavian Workshop on Algorithm Theory. Berlin: Springer, 2014: 306−318
|
[50] |
Bulteau L, Froese V, Kutzkov K, et al. Triangle counting in dynamic graph streams[J]. Algorithmica, 2016, 76(1): 259−278 doi: 10.1007/s00453-015-0036-4
|
[51] |
Han G, Sethu H. Edge sample and discard: A new algorithm for counting triangles in large dynamic graphs[C]//Proc of the 2017 IEEE/ACM Int Conf on Advances in Social Networks Analysis and Mining. New York: ACM, 2017: 44−49
|
[52] |
Stefani L D, Epasto A, Riondato M, et al. Triest: Counting local and global triangles in fully dynamic streams with fixed memory size[J]. ACM Transactions on Knowledge Discovery from Data, 2017, 11(4): 1−50
|
[53] |
Shin K, Oh S, Kim J, et al. Fast, accurate and provable triangle counting in fully dynamic graph streams[J]. ACM Transactions on Knowledge Discovery from Data, 2020, 14(2): 1−39
|
[54] |
Lee D, Shin K, Faloutsos C. Temporal locality-aware sampling for accurate triangle counting in real graph streams[J]. The VLDB Journal, 2020, 29(6): 1501−1525 doi: 10.1007/s00778-020-00624-7
|
[55] |
Yu CY, Liu HM, Wahab F, et al. Global triangle estimation based on first edge sampling in large graph streams[J]. The Journal of Supercomputing, 2023, 79(13): 14079−14116 doi: 10.1007/s11227-023-05205-3
|
[56] |
Vitter J S. Random sampling with a reservoir[J]. ACM Transactions on Mathematical Software, 1985, 11(1): 37−57 doi: 10.1145/3147.3165
|
[57] |
Gemulla R, Lehner W, Haas P J. Maintaining bounded-size sample synopses of evolving datasets[J]. The VLDB Journal, 2008, 17: 173−201 doi: 10.1007/s00778-007-0065-y
|
[1] | Guo Yaqing, Wang Wenjian, Su Meihong. An Adaptive Regression Feature Selection Method for Datasets with Outliers[J]. Journal of Computer Research and Development, 2019, 56(8): 1695-1707. DOI: 10.7544/issn1000-1239.2019.20190313 |
[2] | Xu Hang, Zhang Kai, Wang Wenjian. A Feature Selection Method for Small Samples[J]. Journal of Computer Research and Development, 2018, 55(10): 2321-2330. DOI: 10.7544/issn1000-1239.2018.20170748 |
[3] | Wang Ling, Meng Jianyao. Dynamic Fuzzy Features Selection Based on Variable Weight[J]. Journal of Computer Research and Development, 2018, 55(5): 893-907. DOI: 10.7544/issn1000-1239.2018.20170503 |
[4] | Yao Sheng, Xu Feng, Zhao Peng, Ji Xia. Intuitionistic Fuzzy Entropy Feature Selection Algorithm Based on Adaptive Neighborhood Space Rough Set Model[J]. Journal of Computer Research and Development, 2018, 55(4): 802-814. DOI: 10.7544/issn1000-1239.2018.20160919 |
[5] | Dong Hongbin, Teng Xuyang, Yang Xue. Feature Selection Based on the Measurement of Correlation Information Entropy[J]. Journal of Computer Research and Development, 2016, 53(8): 1684-1695. DOI: 10.7544/issn1000-1239.2016.20160172 |
[6] | Xu Junling, Zhou Yuming, Chen Lin, Xu Baowen. An Unsupervised Feature Selection Approach Based on Mutual Information[J]. Journal of Computer Research and Development, 2012, 49(2): 372-382. |
[7] | Jing Hongfang, Wang Bin, YangYahui, Xu Yan. Category Distribution-Based Feature Selection Framework[J]. Journal of Computer Research and Development, 2009, 46(9): 1586-1593. |
[8] | Gao Jun, Wang Shitong, Deng Zhaohong. GPSFM: Generalized Potential Support Features Selection Method[J]. Journal of Computer Research and Development, 2009, 46(1): 41-51. |
[9] | Xu Yan, Li Jintao, Wang Bin, Sun Chunming, Zhang Sen. A Study on Constraints for Feature Selection in Text Categorization[J]. Journal of Computer Research and Development, 2008, 45(4): 596-602. |
[10] | Liu Tao, Wu Gongyi, Chen Zheng. An Effective Unsupervised Feature Selection Method for Text Clustering[J]. Journal of Computer Research and Development, 2005, 42(3). |
1. |
杨晋云,宋超峰. 智能化应用发展场景下网络安全防御技术分析. 电脑知识与技术. 2024(31): 91-95 .
![]() |