Asyn-SimRank:一种可异步执行的大规模SimRank算法

王春磊; 张岩峰; 鲍玉斌; 赵长宽; 于戈; 高立新

doi:10.7544/issn1000-1239.2015.20140313

Asyn-SimRank:一种可异步执行的大规模SimRank算法

Asyn-SimRank: An Asynchronous Large-Scale SimRank Algorithm

摘要

摘要: SimRank算法利用网络结构来评估网络中任意2点的相似性，它被广泛应用于社交网络和链接预测等诸多领域中.近年来，随着大数据技术的发展，SimRank算法处理的数据不断增大，人们利用MapReduce等分布式计算模型设计实现分布式的大规模SimRank算法来适应大数据处理的需求.但是，由于SimRank算法包含开销较大的迭代过程，每次迭代之后都需要一个全局同步，且每次迭代的计算复杂度高、通信量大，SimRank算法不能在分布式环境下高效地实现.1)提出Asyn-SimRank算法，该算法采用迭代-累积的方式完成迭代计算，异步执行SimRank的核心迭代过程，避免了大规模分布式计算中的大量同步开销，同时有效降低计算量并减少通信开销；2)提出关键点优先调度计算，提升了Asyn-SimRank算法的全局收敛速度；3)证明了Asyn-SimRank算法的正确性和收敛性以及关键点优先调度计算的有效性；4)支持异步迭代的分布式框架Maiter上实现了Asyn-SimRank算法.实验结果显示，相比较于Hadoop，Spark上实现的SimRank算法和Delta-SimRank算法，Asyn-SimRank算法大大提升了算法的计算效率，加速了算法收敛.

Abstract: The SimRank algorithm, which exploits network structure to measure the similarity between node pairs, has been widely used in many areas, such as online social networks and link prediction. In recent years, with the development of big data, the input data set of the SimRank algorithm is constantly increasing. People are utilizing distributed computing models, such as MapReduce, to design large-scale SimRank algorithm for solving the big data problems. However, since SimRank algorithm contains a high-cost iterative process with synchronization barriers between iterations and the computational complexity is high in each iteration, the large-scale SimRank computation does not result in the satisfactory performance. In this paper, 1)We propose Asyn-SimRank by employing the iterate-cumulate approach,which asynchronously executes the core iterative computation to avoid the high-cost synchronization barriers in large-scale distributed environments, and effectively reduces the amount of computation and communication.2) We further propose the keypoint priority scheduling mechanism to accelerate convergence. 3)We prove the accuracy and convergence property of Asyn-SimRank as well as the efficiency of the keypoint priority scheduling. 4)We then implement Asyn-SimRank on Maiter, which is a distributed framework supporting asynchronous iteration. Expremental results show that, compared with the SimRank and Delta-SimRank implementing on Hadoop and Spark, the large-scale Asyn-SimRank significantly promotes the computational efficiency and accelerates the convergence.

HTML全文

参考文献(0)

施引文献

资源附件(0)