ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2020, Vol. 57 ›› Issue (6): 1152-1163.doi: 10.7544/issn1000-1239.2020.20200115

所属专题: 2020计算机体系结构前沿技术专题

• 系统结构 • 上一篇    下一篇

面向高通量计算机的图算法优化技术

张承龙1,2,曹华伟1,王国波1,2,郝沁汾1,张洋1,叶笑春1,范东睿1,2   

  1. 1(计算机体系结构国家重点实验室(中国科学院计算技术研究所) 北京 100190);2(中国科学院大学计算机与控制学院 北京 100049) (caohuawei@ict.ac.cn)
  • 出版日期: 2020-06-01
  • 基金资助: 
    国家重点研发计划项目(2018YFB1003501);国家自然科学基金项目(11904370,61732018,61672499);计算机体系结构国家重点实验室创新项目(CARCH4509)

Efficient Optimization of Graph Computing on High-Throughput Computer

Zhang Chenglong1,2, Cao Huawei1, Wang Guobo1,2, Hao Qinfen1, Zhang Yang1, Ye Xiaochun1, Fan Dongrui1,2   

  1. 1(State Key Laboratory of Computer Architecture (Institute of Computing Technology, Chinese Academy of Sciences), Beijing 100190);2(School of Computer and Control Engineering, University of Chinese Academy of Sciences, Beijing 100049)
  • Online: 2020-06-01
  • Supported by: 
    This work was supported by the National Key Research and Development Program of China (2018YFB1003501), the National Natural Science Foundation of China (11904370, 61732018, 61672499), and the Innovation Project of the State Key Laboratory of Computer Architecture (CARCH4509).

摘要: 随着互联网技术的蓬勃发展,图数据的规模呈爆炸式增长.如何高效地处理大规模图数据逐渐成为工业界和学术界关注的焦点.宽度优先搜索算法是解决图遍历问题的经典算法,也是Graph500基准的核心测试程序之一.高通量计算机采用ARM架构的众核体系结构,具有高并发、强实时、低功耗等适于大数据计算的特点.在单节点上,BFS算法的优化已取得一系列进展,首先对现有的优化技术进行系统的介绍,并在此基础上提出2种面向高通量计算机的优化手段,通过减少冗余访存和提高缓存局部性,有效提高了算法的访存效率.通过这些优化手段,在高通量计算机上对BFS算法的性能进行了系统的评估.对于顶点规模为230的Kronecker图(顶点数为230,边数为234),优化后的BFS算法在高通量计算机上的平均性能为24.26 GTEPS.与两路x86架构服务器相比,单节点具有1.18倍的性能优势.在性能功耗比方面,高通量计算机的结果为181.04 MTEPS/W.在2019年6月份的Green Graph500面向大数据集的排行榜上取得第2名的成绩.综上,高通量计算机的高并发和低功耗等特点非常适合处理大规模图计算等数据密集型应用.

关键词: 宽度优先搜索, 高通量, Graph500, 图算法, 超算

Abstract: With the rapid development of computing technology, the scale of graph increases explosively and large-scale graph computing has been the focus in recent years. Breadth first search (BFS) is a classic algorithm to solve graph traverse problem. It is the main kernel of Graph500 benchmark that evaluates the performance of supercomputers and servers in terms of data-intensive applications. High-throughput computer (HTC) adopts ARM-based many-core architecture, which has the characteristics of high concurrency, strong real-time, low-power consumption. The optimization of BFS algorithm has made a series of progress on single-node systems. In this paper, we first introduce parallel BFS algorithm and existing optimizations. Then we propose two optimization techniques for HTC to improve the efficiency of data access and data locality. We systematically evaluate the performance of BFS algorithm on HTC. For the Kronecker graph with 2scale=230whose vertices are 230 and edges are 234, the average performance on HTC is 24.26 GTEPS and 1.18 times faster than the two-way x86 server. In terms of energy efficiency, the result on HTC is 181.04 MTEPS/W and rank 2nd place on the June 2019 Green Graph500 big data list. To our best knowledge, this is the first work that evaluates BFS performance on HTC platform. HTC is suitable for data intensive applications such as large-scale graph computing.

Key words: breadth first search (BFS), high throughput, Graph500, graph algorithm, super computing

中图分类号: