Abstract:
Triangle counting is a classical problem in large-scale graph analysis and widely applied in areas such as social network analysis, anomaly detection, and graph mining. In dynamic graph data streams, the topological structure of graph is constantly changing, involving edge insertions and deletions. Recent researches mainly focus on static streaming graphs and the corresponding streaming graph sampling algorithms only deal with edge insertions. Only a few single-machine sampling algorithms with low accuracy can handle edge deletion operations. We propose the edge hashing assignment-based distributed sampling (EHADS) algorithm, which is a distributed streaming algorithm designed for estimating the number of triangles in dynamic graph data streams. EHADS algorithm can rapidly and accurately estimate the counts of global triangles and local triangles associated with each vertex in dynamic streaming graphs. EHADS algorithm processes the input graph stream only once and samples edges across multiple machines. Compared with state-of-the-art single-machine streaming algorithms, the experimental results reflect that EHADS algorithm has the following advantages: 1) with the same sample capacity, EHADS algorithm achieves smaller estimation errors with less runtime, reducing the average estimation error of global triangle count by 31.79% and local triangle count by 23.35%; 2) EHADS algorithm provides unbiased estimations of the triangle count in streaming graph, and mathematical proofs demonstrate that the above-mentioned unbiased estimation has smaller variance.