ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2021, Vol. 58 ›› Issue (3): 467-478.doi: 10.7544/issn1000-1239.2021.20190679

• 软件技术 • 上一篇    下一篇

面向高性能图计算的高效高层次综合方法

汤嘉武,郑龙,廖小飞,金海   

  1. (华中科技大学计算机科学与技术学院 武汉 430074);(大数据技术与系统国家地方联合工程研究中心(华中科技大学) 武汉 430074);(服务计算技术与系统教育部重点实验室(华中科技大学) 武汉 430074);(集群与网格计算湖北省重点实验室(华中科技大学) 武汉 430074) (jiawut@hust.edu.cn)
  • 出版日期: 2021-03-01
  • 基金资助: 
    国家重点研发计划项目(2018YFB1003502);国家自然科学基金项目(61702201, 61825202, 61832006)

Effective High-Level Synthesis for High-Performance Graph Processing

Tang Jiawu, Zheng Long, Liao Xiaofei, Jin Hai   

  1. (College of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan 430074); (National Engineering Research Center for Big Data Technology and System(Huazhong University of Science and Technology), Wuhan 430074); (Key Laboratory of Services Computing Technology and System(Huazhong University of Science and Technology), Ministry of Education, Wuhan 430074) ;(Key Laboratory of Cluster and Grid Computing (Huazhong University of Science and Technology), Wuhan 430074)
  • Online: 2021-03-01
  • Supported by: 
    This work was supported by the National Key Research and Development Program of China (2018YFB1003502) and the National Natural Science Foundation of China (61702201, 61825202, 61832006).

摘要: 图计算已成为大数据处理领域的主流应用, 采用特定硬件加速可以显著提高图计算的性能和能效.众所周知, 硬件代码的编写和验证十分耗时, 尽管通用高层次综合(high level synthesis, HLS)系统允许用户使用高级语言(如C语言)特性自动生成硬件结构, 但是对于图计算这种不规则算法, 其仍缺乏有效的并行性和访存技术支撑, 存在综合效果不理想、效率不高等突出问题.提出一种面向图计算的高效HLS方法, 结合图算法嵌套循环、随机访存、数据冲突以及幂律分布等特性, 采用数据流架构实现高效的并行流水线, 保证处理单元的负载均衡.通过提供的编程原语, 提出的方法可将通用图算法转化为模块化的数据流中间表示形式, 进而映射到参数化的硬件模板.在Xilinx Virtex UltraScale+XCVU9P的实现验证了方法的正确性, 不同类型的图算法在多个数据集上的实验结果表明, 相比国际上通用的Spatial HLS系统, 提出的方法可达到7.9~30.6倍的性能提升.

关键词: 图计算, 高层次综合, 数据流架构, 中间表示, FPGA

Abstract: Graph processing has become one of the mainstream big data applications. For graph applications such as biological networks, social networks, and Web graphs, traditional GPU and CPU architectures suffer in terms of power consumption and performance due to graph algorithms’ characteristics. It is demonstrated that specialized hardware acceleration can significantly promote the performance and energy-efficiency of graph processing. As we know, writing and verifying the correct hardware-level codes are tedious and time-consuming. Although general-purpose high level synthesis (HLS) systems allow users to write the applications using high-level languages such as C by automatically generating it into the underlying hardware codes. However, for the irregular graph applications, these HLS systems still lack effective support for massive parallelism and memory access, potentially leading to significantly low performance. In this paper, we propose an effective HLS for high-performance graph processing. We adopt the dataflow architecture to achieve efficient parallel pipelining, ensuring load balancing. Through the developed programming primitives, users can quickly customize the vertex-centric graph algorithm and translate it into a modular intermediate representation (IR), which in turn maps to a parameterized hardware template. We build our HLS on Xilinx Virtex UltraScale+XCVU9P. Results on a variety of graph algorithms and datasets show that our HLS system can outperform state-of-the-art spatial by 7.9-30.6x speedups.

Key words: graph processing, high level synthesis, dataflow architecture, intermediate representation, FPGA

中图分类号: