ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2016, Vol. 53 ›› Issue (8): 1806-1818.doi: 10.7544/issn1000-1239.2016.20160210

所属专题: 2016数据挖掘前沿技术专题

• 人工智能 • 上一篇    下一篇



  1. (南开大学计算机与控制工程学院 天津 300350) (
  • 出版日期: 2016-08-01

IncPR: An Incremental Parallel PageRank Algorithm

Jiang Shuangshuang, Liao Qun, Yang Yulu,Li Tao   

  1. (College of Computer and Control Engineering, Nankai University, Tianjin 300350)
  • Online: 2016-08-01

摘要: 广泛的互联网的商业应用使PageRank算法有重要地位.网络规模不断地增大,同时网络变化带来的时效性要求,也使PageRank计算对计算资源的要求不断地提高.为降低该问题对计算资源的消耗水平,降低计算成本,一种基于增量计算思想的PageRank算法:IncPR被提出.IncPR通过重用已有的结果,增量地获得数据变化后的结果.该算法在并行计算环境中,能够有效地降低计算量,缩短计算时间.理论分析表明,该算法计算结果的误差范围与蒙特卡罗PageRank算法相当,其时间复杂度优于其他已有的相关算法,且不引入额外的存储开销.在分布式集群Hama上进行的实验验证了理论分析的结果,IncPR在得到与蒙特卡罗PageRank算法同等(甚至更高)结果精度的情况下,显著地降低了计算量.

关键词: PageRank, Web数据挖掘, 增量计算, 蒙特卡罗算法, 并行与分布式处理

Abstract: PageRank algorithm plays an important role in various areas such as information retrieval and social network analysis. The continuously increasing size of networks and the timeliness requirements make it bring enormous computational overhead to repeat calculating PageRank for a massive and evolving network. Therefore, IncPR, a PageRank algorithm based on the idea of incremental computation model is proposed. IncPR reuses existing results from previous computation and modifies PageRank values according to the changed part of data sets accumulatively via an extended Monte Carlo method, which can effectively avoid reduplicated computation and improve the performance of PageRank computing with no additional storage overhead. Theoretical analysis shows that the computational complexity of IncPR processing an evolving network with m nodes and n edges changed is O((cm+n)R/c\+2), where c is the reset probability and R is the number of random walks starting from each node. The computational complexity of IncPR is lower than other existing related algorithms. Our evaluations with typical real-world graphs upon a Hama cluster also demonstrate that IncPR obtains PageRank values with equal (or even higher) accuracy as the baseline Monte Carlo PageRank algorithm and reduces the amount of computation significantly. In experiments with 0.01% data changed, IncPR reduces over 99.9% amount of computation.

Key words: PageRank, Web mining, incremental computing, Monte Carlo algorithm, parallel and distributed processing