大图数据上顶点驱动的并行最小生成树算法

谷峪; 杨佳学; 鲍玉斌; 于戈

doi:10.7544/issn1000-1239.2014.20131331

大图数据上顶点驱动的并行最小生成树算法

Vertex-Driven Parallel Minimum Spanning Tree Algorithms on Large Graphs

摘要

摘要: 最小生成树(minimum spanning tree, MST)是图论中最为经典算法之一.基于MST结构的聚类、分类和最短路径查询等复杂图算法，在效率和结果质量方面均有显著提高.然而，随着互联网的迅猛发展，图数据规模也变得越来越大，包含千万甚至上亿个顶点的大图数据越发常见.因此，如何在大图数据上实现查询处理和数据挖掘算法已成为亟待解决的问题之一.除此之外，由于大图数据的动态性特征，如何动态地维护算法结果也势必成为最受关注的问题之一.针对目前集中式的最小生成树算法无法解决海量和动态图数据的问题，首先提出了分区Prim(partition Prim, PP)算法，基于此提出了顶点驱动的并行MST算法——PB(PP Boruvka)算法，并论证了PB算法的正确性.另外，基于MapReduce和BSP框架实现了PB算法.针对只删除动态图特征，提出了MST维护算法，以实现高效的增量计算.对提出的计算和维护算法进行了代价分析和比较.最后，使用真实和模拟数据集，验证了PB算法和维护算法的有效性、高效性和可扩展性.

Abstract: The minimum spanning tree(MST) algorithm is one of the most classic algorithms in the graph theory. Some complex graph algorithms based on MST including clustering, classification and shortest path queries have been improved significantly in terms of efficiency and quality. However, with the rapid development of Internet, the scales of graphs have been becoming larger and larger. Large scale graphs which contain millions or even billions of vertices have become more common. Therefore, how to implement query processing and data mining algorithms on large scale graphs has become a problem to be solved urgently. In addition, because of the dynamic properties of large-scale graphs, how to maintain results dynamically has also become one of the most attractive problems. However, the state of the art MST algorithms can’t handle such massive and dynamic graph data. In this paper, we propose a vertex-driven parallel MST algorithm called PB based on a partition Prim algorithm named PP, and demonstrate the correctness of PB. Moreover, we implement the whole process of PB algorithm on the MapReduce and BSP framework respectively. Taking deletion-only graphs into consideration, we put forward a maintenance algorithm for MST which can conduct efficient incremental evaluation. All the computation and maintenance costs are further analyzed and compared. Finally, experiments based on real and synthetic data sets demonstrate the reliabilty, efficiency and scalability of our algorithms.

HTML全文

参考文献(0)

施引文献

资源附件(0)