高级检索

    IncPR:一种基于增量计算的并行PageRank算法

    IncPR: An Incremental Parallel PageRank Algorithm

    • 摘要: 广泛的互联网的商业应用使PageRank算法有重要地位.网络规模不断地增大,同时网络变化带来的时效性要求,也使PageRank计算对计算资源的要求不断地提高.为降低该问题对计算资源的消耗水平,降低计算成本,一种基于增量计算思想的PageRank算法:IncPR被提出.IncPR通过重用已有的结果,增量地获得数据变化后的结果.该算法在并行计算环境中,能够有效地降低计算量,缩短计算时间.理论分析表明,该算法计算结果的误差范围与蒙特卡罗PageRank算法相当,其时间复杂度优于其他已有的相关算法,且不引入额外的存储开销.在分布式集群Hama上进行的实验验证了理论分析的结果,IncPR在得到与蒙特卡罗PageRank算法同等(甚至更高)结果精度的情况下,显著地降低了计算量.

       

      Abstract: PageRank algorithm plays an important role in various areas such as information retrieval and social network analysis. The continuously increasing size of networks and the timeliness requirements make it bring enormous computational overhead to repeat calculating PageRank for a massive and evolving network. Therefore, IncPR, a PageRank algorithm based on the idea of incremental computation model is proposed. IncPR reuses existing results from previous computation and modifies PageRank values according to the changed part of data sets accumulatively via an extended Monte Carlo method, which can effectively avoid reduplicated computation and improve the performance of PageRank computing with no additional storage overhead. Theoretical analysis shows that the computational complexity of IncPR processing an evolving network with m nodes and n edges changed is O((cm+n)R/c\+2), where c is the reset probability and R is the number of random walks starting from each node. The computational complexity of IncPR is lower than other existing related algorithms. Our evaluations with typical real-world graphs upon a Hama cluster also demonstrate that IncPR obtains PageRank values with equal (or even higher) accuracy as the baseline Monte Carlo PageRank algorithm and reduces the amount of computation significantly. In experiments with 0.01% data changed, IncPR reduces over 99.9% amount of computation.

       

    /

    返回文章
    返回