ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2015, Vol. 52 ›› Issue (7): 1567-1579.doi: 10.7544/issn1000-1239.2015.20140313

• 系统结构 • 上一篇    下一篇

Asyn-SimRank:一种可异步执行的大规模SimRank算法

王春磊1,张岩峰2,鲍玉斌1,赵长宽2,于戈1,高立新3   

  1. 1(东北大学信息科学与工程学院计算机软件研究所 沈阳 110819); 2(东北大学计算中心 沈阳 110819); 3(美国麻州大学阿默斯特校区电子与计算机工程系 美国阿默斯特 01003) (zhangyf@cc.neu.edu.cn)
  • 出版日期: 2015-07-01
  • 基金资助: 
    基金项目:国家自然科学基金项目(61300023,61272179,61033007,61173028);中央高校基本科研业务费基金项目(N120416001,N120816001);中国移动基金项目(MCM20122051);辽宁省科技计划基金项目(2013217004)

Asyn-SimRank: An Asynchronous Large-Scale SimRank Algorithm

Wang Chunlei1, Zhang Yanfeng2, Bao Yubin1, Zhao Changkuan2, Yu Ge1, Gao Lixin3   

  1. 1(Institute of Computer Software, College of Information Science and Engineering, Northeastern University, Shenyang 110819);2(Computer Center, Northeastern University, Shenyang 110819);3(Department of Electrical and Computer Engineering, University of Massachusetts Amherst, Amherst, U.S. 01003)
  • Online: 2015-07-01

摘要: SimRank算法利用网络结构来评估网络中任意2点的相似性,它被广泛应用于社交网络和链接预测等诸多领域中.近年来,随着大数据技术的发展,SimRank算法处理的数据不断增大,人们利用MapReduce等分布式计算模型设计实现分布式的大规模SimRank算法来适应大数据处理的需求.但是,由于SimRank算法包含开销较大的迭代过程,每次迭代之后都需要一个全局同步,且每次迭代的计算复杂度高、通信量大,SimRank算法不能在分布式环境下高效地实现.1)提出Asyn-SimRank算法,该算法采用迭代-累积的方式完成迭代计算,异步执行SimRank的核心迭代过程,避免了大规模分布式计算中的大量同步开销,同时有效降低计算量并减少通信开销;2)提出关键点优先调度计算,提升了Asyn-SimRank算法的全局收敛速度;3)证明了Asyn-SimRank算法的正确性和收敛性以及关键点优先调度计算的有效性;4)支持异步迭代的分布式框架Maiter上实现了Asyn-SimRank算法.实验结果显示,相比较于Hadoop,Spark上实现的SimRank算法和Delta-SimRank算法,Asyn-SimRank算法大大提升了算法的计算效率,加速了算法收敛.

关键词: 异步计算, 迭代计算, Asyn-SimRank算法, 相似度, 大数据, MapReduce模型, Maiter框架

Abstract: The SimRank algorithm, which exploits network structure to measure the similarity between node pairs, has been widely used in many areas, such as online social networks and link prediction. In recent years, with the development of big data, the input data set of the SimRank algorithm is constantly increasing. People are utilizing distributed computing models, such as MapReduce, to design large-scale SimRank algorithm for solving the big data problems. However, since SimRank algorithm contains a high-cost iterative process with synchronization barriers between iterations and the computational complexity is high in each iteration, the large-scale SimRank computation does not result in the satisfactory performance. In this paper, 1)We propose Asyn-SimRank by employing the iterate-cumulate approach,which asynchronously executes the core iterative computation to avoid the high-cost synchronization barriers in large-scale distributed environments, and effectively reduces the amount of computation and communication.2) We further propose the keypoint priority scheduling mechanism to accelerate convergence. 3)We prove the accuracy and convergence property of Asyn-SimRank as well as the efficiency of the keypoint priority scheduling. 4)We then implement Asyn-SimRank on Maiter, which is a distributed framework supporting asynchronous iteration. Expremental results show that, compared with the SimRank and Delta-SimRank implementing on Hadoop and Spark, the large-scale Asyn-SimRank significantly promotes the computational efficiency and accelerates the convergence.

Key words: asynchronous computation, iterative computation, Asyn-SimRank, similarity, big data, MapReduce, Maiter

中图分类号: