针对大规模动态图流三角形计数的边哈希分布式抽样算法

何玉林; 吴波; 吴定明; 黄哲学; 菲律普弗尼尔-维格

doi:10.7544/issn1000-1239.202440120

针对大规模动态图流三角形计数的边哈希分布式抽样算法

Edge Hashing Distributed Sampling Algorithm for Triangle Counting in Large-scale Dynamic Graph Stream

摘要

摘要: 三角形计数是大图分析的一个经典问题，近年的研究工作主要集中在针对静态流式图的三角形数量估计上，相关流式图抽样算法只能处理边的插入操作，无法处理边的删除操作；而现有的动态流式图抽样算法估计准确性又偏低. 针对上述问题，提出了基于边哈希分配的分布式抽样（edge hashing assignment-based distributed sampling，EHADS）算法，它是一个用于估计动态流式图中三角形数量的分布式流算法，可以快速准确地估计动态流式图中的全局三角形数量以及每个顶点的局部三角形数量. EHADS算法只对输入的图流进行1次处理，并在多台机器上对边进行抽样. 与先进的单机流算法相比，EHADS算法具有2点优势：1）在相同样本容量的情况下，EHADS算法以更短的运行时间获得了更小的估计误差，估计全局三角形数量的误差平均降低了31.79%，估计局部三角形数量的误差平均降低了23.35%；2）EHADS算法能够提供流式图中三角形数量的无偏估计，并且严格的数学证明显示该无偏估计具有更小的方差.

Abstract: Triangle counting is a classical problem in large-scale graph analysis and widely applied in areas such as social network analysis, anomaly detection, and graph mining. In dynamic graph data streams, the topological structure of graph is constantly changing, involving edge insertions and deletions. Recent researches mainly focus on static streaming graphs and the corresponding streaming graph sampling algorithms only deal with edge insertions. Only a few single-machine sampling algorithms with low accuracy can handle edge deletion operations. We propose the edge hashing assignment-based distributed sampling (EHADS) algorithm, which is a distributed streaming algorithm designed for estimating the number of triangles in dynamic graph data streams. EHADS algorithm can rapidly and accurately estimate the counts of global triangles and local triangles associated with each vertex in dynamic streaming graphs. EHADS algorithm processes the input graph stream only once and samples edges across multiple machines. Compared with state-of-the-art single-machine streaming algorithms, the experimental results reflect that EHADS algorithm has the following advantages: 1) with the same sample capacity, EHADS algorithm achieves smaller estimation errors with less runtime, reducing the average estimation error of global triangle count by 31.79% and local triangle count by 23.35%; 2) EHADS algorithm provides unbiased estimations of the triangle count in streaming graph, and mathematical proofs demonstrate that the above-mentioned unbiased estimation has smaller variance.

HTML全文

参考文献(57)

施引文献

资源附件(0)