ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2015, Vol. 52 ›› Issue (8): 1768-1783.doi: 10.7544/issn1000-1239.2015.20150256

所属专题: 2015面向大数据的人工智能技术

• 人工智能 • 上一篇    下一篇



  1. 1(宁波大学信息科学与工程学院 浙江宁波 315211); 2(复旦大学计算机科学技术学院 上海 200433)(
  • 出版日期: 2015-08-01
  • 基金资助: 

FSMBUS: A Frequent Subgraph Mining Algorithm in Single Large-Scale Graph Using Spark

Yan Yuliang1,Dong Yihong1,He Xianmang1,Wang Wei2   

  1. 1(Faculty of Electrical Engineering and Computer Science, Ningbo Univeristy, Ningbo, Zhejiang 315211); 2(School of Computer Science, Fudan University, Shanghai 200433)
  • Online: 2015-08-01

摘要: 随着社交网络用户数的快速增加,大规模单图上频繁子图挖掘的需求越来越强烈.单机算法对大规模图的运行效率较低,难以支撑支持度较低的频繁子图的挖掘;现有的分布式环境下单图的频繁子图挖掘算法不支持子图增长模式的挖掘,它们所使用的Hadoop框架也不适合运行迭代式算法.提出了一种基于Spark的大规模单图频繁子图挖掘算法FSMBUS,通过次优树构建并行计算的候选子图,在给定最小支持度时挖掘出所有的频繁子图,并利用非频繁检测和搜索顺序选择实现优化,还设计了一种名为Sorted-Greedy的轻量级数据划分方法.实验结果表明,FSMBUS的效率要比现有单图上最新的算法快一个数量级,并支持更低最小支持度阈值以及更大规模图数据的挖掘,同时FSMBUS比其Hadoop的移植版要快2~4倍.

关键词: 频繁子图, 大规模单图, 分布式挖掘, Spark, 负载均衡

Abstract: Mining frequent subgraphs in a single large-scale graph is of huge demand with the rapid growth of the social networking. However, it is inefficient for the serial algorithms to mine frequent subgraphs in low support when mining for a single large-scale graph. Meanwhile, few existing distributed algorithms can’t support the growth pattern mining, and the Hadoop framework they worked is not suitable for iterative running. In this paper, a distributed algorithm named FSMBUS for mining frequent subgraph in a single large-scale graph under Spark framework is proposed. It constructs the parallel computing candidate subgraphs by suboptimal CAM Tree, which returns all the frequent subgraphs for given user-defined minimum support. Additionally, infrequent patterns’ test and searching order chosen is introduced to optimize the algorithm. Sorted-Greedy method is designed for data partition to balance the workload. Our experiments show that FSMBUS runs faster and more effective than the existing algorithms with real datasets,and even can run with the lower support threshold and the larger graph datasets as well. At the same time, FSMBUS runs 2~4 times faster on Spark framework than that on Hadoop framework.

Key words: frequent subgraph, single large-scale graph, distribute mining, Spark, workload balance