面向高性能图计算的高效高层次综合方法
Effective High-Level Synthesis for High-Performance Graph Processing
-
摘要: 图计算已成为大数据处理领域的主流应用, 采用特定硬件加速可以显著提高图计算的性能和能效.众所周知, 硬件代码的编写和验证十分耗时, 尽管通用高层次综合(high level synthesis, HLS)系统允许用户使用高级语言(如C语言)特性自动生成硬件结构, 但是对于图计算这种不规则算法, 其仍缺乏有效的并行性和访存技术支撑, 存在综合效果不理想、效率不高等突出问题.提出一种面向图计算的高效HLS方法, 结合图算法嵌套循环、随机访存、数据冲突以及幂律分布等特性, 采用数据流架构实现高效的并行流水线, 保证处理单元的负载均衡.通过提供的编程原语, 提出的方法可将通用图算法转化为模块化的数据流中间表示形式, 进而映射到参数化的硬件模板.在Xilinx Virtex UltraScale+XCVU9P的实现验证了方法的正确性, 不同类型的图算法在多个数据集上的实验结果表明, 相比国际上通用的Spatial HLS系统, 提出的方法可达到7.9~30.6倍的性能提升.Abstract: Graph processing has become one of the mainstream big data applications. For graph applications such as biological networks, social networks, and Web graphs, traditional GPU and CPU architectures suffer in terms of power consumption and performance due to graph algorithms’ characteristics. It is demonstrated that specialized hardware acceleration can significantly promote the performance and energy-efficiency of graph processing. As we know, writing and verifying the correct hardware-level codes are tedious and time-consuming. Although general-purpose high level synthesis (HLS) systems allow users to write the applications using high-level languages such as C by automatically generating it into the underlying hardware codes. However, for the irregular graph applications, these HLS systems still lack effective support for massive parallelism and memory access, potentially leading to significantly low performance. In this paper, we propose an effective HLS for high-performance graph processing. We adopt the dataflow architecture to achieve efficient parallel pipelining, ensuring load balancing. Through the developed programming primitives, users can quickly customize the vertex-centric graph algorithm and translate it into a modular intermediate representation (IR), which in turn maps to a parameterized hardware template. We build our HLS on Xilinx Virtex UltraScale+XCVU9P. Results on a variety of graph algorithms and datasets show that our HLS system can outperform state-of-the-art spatial by 7.9-30.6x speedups.