(College of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan 430074)
(National Engineering Research Center for Big Data Technology and System(Huazhong University of Science and Technology), Wuhan 430074)
(Key Laboratory of Services Computing Technology and System(Huazhong University of Science and Technology), Ministry of Education, Wuhan 430074)
(Key Laboratory of Cluster and Grid Computing (Huazhong University of Science and Technology), Wuhan 430074)
Funds: This work was supported by the National Key Research and Development Program of China (2018YFB1003502) and the National Natural Science Foundation of China (61702201, 61825202, 61832006).
Graph processing has become one of the mainstream big data applications. For graph applications such as biological networks, social networks, and Web graphs, traditional GPU and CPU architectures suffer in terms of power consumption and performance due to graph algorithms’ characteristics. It is demonstrated that specialized hardware acceleration can significantly promote the performance and energy-efficiency of graph processing. As we know, writing and verifying the correct hardware-level codes are tedious and time-consuming. Although general-purpose high level synthesis (HLS) systems allow users to write the applications using high-level languages such as C by automatically generating it into the underlying hardware codes. However, for the irregular graph applications, these HLS systems still lack effective support for massive parallelism and memory access, potentially leading to significantly low performance. In this paper, we propose an effective HLS for high-performance graph processing. We adopt the dataflow architecture to achieve efficient parallel pipelining, ensuring load balancing. Through the developed programming primitives, users can quickly customize the vertex-centric graph algorithm and translate it into a modular intermediate representation (IR), which in turn maps to a parameterized hardware template. We build our HLS on Xilinx Virtex UltraScale+XCVU9P. Results on a variety of graph algorithms and datasets show that our HLS system can outperform state-of-the-art spatial by 7.9-30.6x speedups.