Direct xPU: A Novel Distributed Heterogeneous Computing Architecture Optimized for Inter-node Communication Optimization

Li Rengang; Wang Yanwei; Hao Rui; Xiao Linge; Yang Le; Yang Guangwen; Kan Hongwei

doi:10.7544/issn1000-1239.202440055

Journal of Computer Research and Development > 2024 > 61(6): 1388-1400. > DOI: 10.7544/issn1000-1239.202440055 CSTR: 32373.14.issn1000-1239.202440055

Li Rengang, Wang Yanwei, Hao Rui, Xiao Linge, Yang Le, Yang Guangwen, Kan Hongwei. Direct xPU: A Novel Distributed Heterogeneous Computing Architecture Optimized for Inter-node Communication Optimization[J]. Journal of Computer Research and Development, 2024, 61(6): 1388-1400. DOI: 10.7544/issn1000-1239.202440055

Citation:

PDF (2348 KB)

Direct xPU: A Novel Distributed Heterogeneous Computing Architecture Optimized for Inter-node Communication Optimization

1.
Department of Computer Science and Technology, Tsinghua University, Beijing 100084
2.
Inspur (Beijing) Electronic Information Industry Co., Ltd, Beijing 100085
3.
Guangdong Inspur Intelligent Computing Technology Co., Ltd, Guangzhou 510623

Funds: This work was supported by the Key-Area Research and Development Program of Guangdong Province (2021B0101400001).

More Information

Author Bio:
Li Rengang: born in 1980. PhD, Senior engineer. Member of CCF. His main research interest includes heterogeneous computing

Wang Yanwei: born in 1985. PhD, associate researcher. Member of CCF. His main research interest includes heterogeneous computing

Hao Rui: born in 1983. Master, Senior engineer. Member of CCF. His main research interest includes heterogeneous computing

Xiao Linge: born in 1994. PhD. His main research interest includes heterogeneous computing

Yang Le: born in 1990. Master. Her main research interest includes heterogeneous computing

Yang Guangwen: born in 1963. PhD, professor, PhD supervisor. Member of CCF. His main research interest includes high performance computing

Kan Hongwei: born in 1975. Master, professor-Senior engineer. Senior member of CCF. His main research interest includes heterogeneous computing
Received Date: January 28, 2024
Revised Date: March 26, 2024
Available Online: April 14, 2024

Graphical Abstract

Abstract

Abstract

The explosive growth of the application of large-scale artificial intelligence models has made it difficult to achieve the scale deployment of applications relying on a single node or a single type of computing architecture. Distributed heterogeneous computing has become the mainstream choice, and inter-node communication has become one of the main bottlenecks in the training or inference process of large models. Currently, there are still some deficiencies in the inter-node communicating solutions dominated by leading chip manufacturers. On the one hand, some architectures choose to use a simple but less scalable point-to-point transmission scheme in order to pursue the ultimate inter-node communication performance. On the other hand, traditional heterogeneous computing engines (such as GPUs) are independent of CPUs in terms of computing resources such as memory and computing cores, but they lack dedicated communicating network devices in terms of communication resources and need to rely entirely or partially on CPUs to handle transmission between heterogeneous computing engines and the shared communicating network device through physical links such as PCIe. The proposed Direct xPU distributed heterogeneous computing architecture in this article enables heterogeneous computing engines to have independent and dedicated devices in both computing and communication resources, achieving zero-copy data and further eliminating the energy consumption and latency associated with cross-chip data transfer during inter-node communication. Evaluations show that Direct xPU achieves communication latency comparable to computing architectures pursuing ultimate inter-node communication performance, with bandwidth close to the physical limit.
- inter-node communication,
- FPGA,
- GPU,
- RDMA,
- zero copy

FullText(HTML)

References (24)

References

[1]	Fröning H, Nüssle M, Litz H, et al. A case for FPGA based accelerated communication[C]//Proc of the 9th Int Conf on Networks. Piscataway, NJ: IEEE, 2010: 28−33
[2]	Shainer G, Ayoub A, Lui P, et al. The development of Mellanox/NVIDIA GPUDirect over InfiniBand—a new model for GPU to GPU communications[J]. Computer Science Research and Development, 2011, 26: 267−273 doi: 10.1007/s00450-011-0157-1
[3]	Ammendola R, Bernaschi M, Biagioni A, et al. GPU peer-to-peer techniques applied to a cluster interconnect[C]//Proc of 2013 IEEE Int Symp on Parallel & Distributed Processing, Workshops and PhD Forum. Piscataway, NJ: IEEE, 2013: 806−815
[4]	Agostini E, Rossetti D, Potluri S. Offloading communication control logic in GPU accelerated applications[C]//Proc of 17th IEEE/ACM Int Symp on Cluster, Cloud and Grid Computing (CCGRID). Piscataway, NJ: IEEE, 2017: 248−257
[5]	Agostini E, Rossetti D, Potluri S. GPUDirect Async: Exploring GPU synchronous communication techniques for InfiniBand clusters[J]. Journal of Parallel and Distributed Computing, 2018, 114: 28−45 doi: 10.1016/j.jpdc.2017.12.007
[6]	Oden L, Fröning H. GGAS: Global GPU address spaces for efficient communication in heterogeneous clusters[C]//Proc of 2013 IEEE Int Conf on Cluster Computing (CLUSTER). Piscataway, NJ: IEEE, 2013: 1−8
[7]	Oden L, Fröning H. InfiniBand Verbs on GPU: A case study of controlling an InfiniBand network device from the GPU[J]. The International Journal of High Performance Computing Applications, 2017, 31(4): 274−284 doi: 10.1177/1094342015588142
[8]	Daoud F, Watad A, Silberstein M. GPUrdma: GPU-side library for high performance networking from GPU kernels[C]//Procof the 6th Int Workshop on Runtime and Operating Systems for Supercomputers. New York: ACM, 2016: 1−8
[9]	Silberstein M, Kim S, Huh S, et al. GPUnet: Networking abstractions for GPU programs[J]. ACM Transactions on Computer Systems, 2016, 34(3): 1−31
[10]	Balle S M, Tetreault M, Dicecco R. Inter-kernel links for direct inter-FPGA communication[DB/OL]. [2022-10-19]. https://cdrdv2-public.intel.com/650535/wp-01305-inter-kernel-links-for-direct-inter-fpga-communication.pdf.
[11]	Fujita N, Kobayashi R, Yamaguchi Y, et al. Parallel processing on FPGA combining computation and communication in OpenCL programming[C]/Proc of 2019 IEEE Int Parallel and Distributed Processing Symp Workshops (IPDPSW). Piscataway, NJ: IEEE, 2019: 479−488
[12]	Fujita N, Kobayashi R, Yamaguchi Y, et al. Performance evaluation of pipelined communication combined with computation in OpenCL programming on FPGA[C]//Proc of 2020 IEEE Int Parallel and Distributed Processing Symp Workshops (IPDPSW). Piscataway, NJ: IEEE, 2020: 450−459
[13]	Kobayashi R, Fujita N, Yamaguchi Y, et al. GPU-FPGA heterogeneous computing with OpenCL-enabled direct memory access[C]//Proc of 2019 IEEE Int Parallel and Distributed Processing Symp Workshops (IPDPSW). Piscataway, NJ: IEEE, 2019: 489−498
[14]	Kobayashi R, Fujita N, Yamaguchi Y, et al. OpenCL-enabled GPU-FPGA accelerated computing with inter-FPGA communication[C]//Proc of the Int Conf on High Performance Computing in Asia-Pacific Region Workshops. New York: ACM 2020: 17−20
[15]	Burstein I. Nvidia data center processing unit (DPU) architecture[C]//Proc of 2021 IEEE Hot Chips 33 Symp (HCS). Piscataway, NJ: IEEE, 2021: 1−20
[16]	Sundar N, Burres B, Li Y, et al. An in-depth look at the Intel IPU E2000[C]//Proc of 2023 IEEE Int Solid-State Circuits Conf (ISSCC). Piscataway, NJ: IEEE, 2023: 162−164
[17]	Lant J, Navaridas J, Luján M, et al. Toward FPGA-based HPC: Advancing interconnect technologies[J]. IEEE Micro, 2019, 40(1): 25−34
[18]	Mittal R, Shpiner A, Panda A, et al. Revisiting network support for RDMA[C]//Proc of the 2018 Conf of the ACM Special Interest Group on Data Communication. New York: ACM, 2018: 313−326
[19]	Wadekar M. Handbook of Fiber Optic Data Communication[M]//Cambridge, MA: Academic Press, 2013: 267−287
[20]	Jääskeläinen P, de La Lama C S, Schnetter E, et al. PoCL: A performance-portable OpenCL implementation[J]. International Journal of Parallel Programming, 2015, 43: 752−785 doi: 10.1007/s10766-014-0320-y
[21]	Tine B, Yalamarthy K P, Elsabbagh F, et al. Vortex: Extending the RISC-V ISA for GPGPU and 3D-graphics[C]//Proc of the 54th Annual IEEE/ACM Int Symp on Microarchitecture (MICRO-54). New York: ACM, 2021: 754−766
[22]	Elsabbagh F, Tine B, Roshan P, et al. Vortex: OpenCL Compatible RISC-V GPGPU[J]. arXiv preprint, arXiv: 2002.12151, 2020
[23]	Intel OPAE. Open programmable acceleration engine (OPAE) C API programming guide [DB/OL]. [2021-11-08] https://cdrdv2.intel.com/v1/dl/getContent/686262?explicitVersion=true&wapkw=opae.
[24]	曾高雄,胡水海,张骏雪,等. 数据中心网络传输协议综述[J]. 计算机研究与发展, 2020, 57(1): 74−84 Zeng Gaoxiong, Hu Shuihai, Zhang Junxue, et al. Transport protocols for data center networks: A survey[J]. Journal of Computer Research and Development, 2020, 57(1): 74−84 (in Chinese)