Citation: | Li Rengang, Wang Yanwei, Hao Rui, Xiao Linge, Yang Le, Yang Guangwen, Kan Hongwei. Direct xPU: A Novel Distributed Heterogeneous Computing Architecture Optimized for Inter-node Communication Optimization[J]. Journal of Computer Research and Development, 2024, 61(6): 1388-1400. DOI: 10.7544/issn1000-1239.202440055 |
The explosive growth of the application of large-scale artificial intelligence models has made it difficult to achieve the scale deployment of applications relying on a single node or a single type of computing architecture. Distributed heterogeneous computing has become the mainstream choice, and inter-node communication has become one of the main bottlenecks in the training or inference process of large models. Currently, there are still some deficiencies in the inter-node communicating solutions dominated by leading chip manufacturers. On the one hand, some architectures choose to use a simple but less scalable point-to-point transmission scheme in order to pursue the ultimate inter-node communication performance. On the other hand, traditional heterogeneous computing engines (such as GPUs) are independent of CPUs in terms of computing resources such as memory and computing cores, but they lack dedicated communicating network devices in terms of communication resources and need to rely entirely or partially on CPUs to handle transmission between heterogeneous computing engines and the shared communicating network device through physical links such as PCIe. The proposed Direct xPU distributed heterogeneous computing architecture in this article enables heterogeneous computing engines to have independent and dedicated devices in both computing and communication resources, achieving zero-copy data and further eliminating the energy consumption and latency associated with cross-chip data transfer during inter-node communication. Evaluations show that Direct xPU achieves communication latency comparable to computing architectures pursuing ultimate inter-node communication performance, with bandwidth close to the physical limit.
[1] |
Fröning H, Nüssle M, Litz H, et al. A case for FPGA based accelerated communication[C]//Proc of the 9th Int Conf on Networks. Piscataway, NJ: IEEE, 2010: 28−33
|
[2] |
Shainer G, Ayoub A, Lui P, et al. The development of Mellanox/NVIDIA GPUDirect over InfiniBand—a new model for GPU to GPU communications[J]. Computer Science Research and Development, 2011, 26: 267−273 doi: 10.1007/s00450-011-0157-1
|
[3] |
Ammendola R, Bernaschi M, Biagioni A, et al. GPU peer-to-peer techniques applied to a cluster interconnect[C]//Proc of 2013 IEEE Int Symp on Parallel & Distributed Processing, Workshops and PhD Forum. Piscataway, NJ: IEEE, 2013: 806−815
|
[4] |
Agostini E, Rossetti D, Potluri S. Offloading communication control logic in GPU accelerated applications[C]//Proc of 17th IEEE/ACM Int Symp on Cluster, Cloud and Grid Computing (CCGRID). Piscataway, NJ: IEEE, 2017: 248−257
|
[5] |
Agostini E, Rossetti D, Potluri S. GPUDirect Async: Exploring GPU synchronous communication techniques for InfiniBand clusters[J]. Journal of Parallel and Distributed Computing, 2018, 114: 28−45 doi: 10.1016/j.jpdc.2017.12.007
|
[6] |
Oden L, Fröning H. GGAS: Global GPU address spaces for efficient communication in heterogeneous clusters[C]//Proc of 2013 IEEE Int Conf on Cluster Computing (CLUSTER). Piscataway, NJ: IEEE, 2013: 1−8
|
[7] |
Oden L, Fröning H. InfiniBand Verbs on GPU: A case study of controlling an InfiniBand network device from the GPU[J]. The International Journal of High Performance Computing Applications, 2017, 31(4): 274−284 doi: 10.1177/1094342015588142
|
[8] |
Daoud F, Watad A, Silberstein M. GPUrdma: GPU-side library for high performance networking from GPU kernels[C]//Procof the 6th Int Workshop on Runtime and Operating Systems for Supercomputers. New York: ACM, 2016: 1−8
|
[9] |
Silberstein M, Kim S, Huh S, et al. GPUnet: Networking abstractions for GPU programs[J]. ACM Transactions on Computer Systems, 2016, 34(3): 1−31
|
[10] |
Balle S M, Tetreault M, Dicecco R. Inter-kernel links for direct inter-FPGA communication[DB/OL]. [2022-10-19]. https://cdrdv2-public.intel.com/650535/wp-01305-inter-kernel-links-for-direct-inter-fpga-communication.pdf.
|
[11] |
Fujita N, Kobayashi R, Yamaguchi Y, et al. Parallel processing on FPGA combining computation and communication in OpenCL programming[C]/Proc of 2019 IEEE Int Parallel and Distributed Processing Symp Workshops (IPDPSW). Piscataway, NJ: IEEE, 2019: 479−488
|
[12] |
Fujita N, Kobayashi R, Yamaguchi Y, et al. Performance evaluation of pipelined communication combined with computation in OpenCL programming on FPGA[C]//Proc of 2020 IEEE Int Parallel and Distributed Processing Symp Workshops (IPDPSW). Piscataway, NJ: IEEE, 2020: 450−459
|
[13] |
Kobayashi R, Fujita N, Yamaguchi Y, et al. GPU-FPGA heterogeneous computing with OpenCL-enabled direct memory access[C]//Proc of 2019 IEEE Int Parallel and Distributed Processing Symp Workshops (IPDPSW). Piscataway, NJ: IEEE, 2019: 489−498
|
[14] |
Kobayashi R, Fujita N, Yamaguchi Y, et al. OpenCL-enabled GPU-FPGA accelerated computing with inter-FPGA communication[C]//Proc of the Int Conf on High Performance Computing in Asia-Pacific Region Workshops. New York: ACM 2020: 17−20
|
[15] |
Burstein I. Nvidia data center processing unit (DPU) architecture[C]//Proc of 2021 IEEE Hot Chips 33 Symp (HCS). Piscataway, NJ: IEEE, 2021: 1−20
|
[16] |
Sundar N, Burres B, Li Y, et al. An in-depth look at the Intel IPU E2000[C]//Proc of 2023 IEEE Int Solid-State Circuits Conf (ISSCC). Piscataway, NJ: IEEE, 2023: 162−164
|
[17] |
Lant J, Navaridas J, Luján M, et al. Toward FPGA-based HPC: Advancing interconnect technologies[J]. IEEE Micro, 2019, 40(1): 25−34
|
[18] |
Mittal R, Shpiner A, Panda A, et al. Revisiting network support for RDMA[C]//Proc of the 2018 Conf of the ACM Special Interest Group on Data Communication. New York: ACM, 2018: 313−326
|
[19] |
Wadekar M. Handbook of Fiber Optic Data Communication[M]//Cambridge, MA: Academic Press, 2013: 267−287
|
[20] |
Jääskeläinen P, de La Lama C S, Schnetter E, et al. PoCL: A performance-portable OpenCL implementation[J]. International Journal of Parallel Programming, 2015, 43: 752−785 doi: 10.1007/s10766-014-0320-y
|
[21] |
Tine B, Yalamarthy K P, Elsabbagh F, et al. Vortex: Extending the RISC-V ISA for GPGPU and 3D-graphics[C]//Proc of the 54th Annual IEEE/ACM Int Symp on Microarchitecture (MICRO-54). New York: ACM, 2021: 754−766
|
[22] |
Elsabbagh F, Tine B, Roshan P, et al. Vortex: OpenCL Compatible RISC-V GPGPU[J]. arXiv preprint, arXiv: 2002.12151, 2020
|
[23] |
Intel OPAE. Open programmable acceleration engine (OPAE) C API programming guide [DB/OL]. [2021-11-08] https://cdrdv2.intel.com/v1/dl/getContent/686262?explicitVersion=true&wapkw=opae.
|
[24] |
曾高雄,胡水海,张骏雪,等. 数据中心网络传输协议综述[J]. 计算机研究与发展, 2020, 57(1): 74−84
Zeng Gaoxiong, Hu Shuihai, Zhang Junxue, et al. Transport protocols for data center networks: A survey[J]. Journal of Computer Research and Development, 2020, 57(1): 74−84 (in Chinese)
|
[1] | Fang Haotian, Li Chunhua, Wang Qing, Zhou Ke. A Method of Microservice Performance Anomaly Detection Based on Deep Learning[J]. Journal of Computer Research and Development, 2024, 61(3): 600-613. DOI: 10.7544/issn1000-1239.202330543 |
[2] | Yue Wenjing, Qu Wenwen, Lin Kuan, Wang Xiaoling. Survey of Cardinality Estimation Techniques Based on Machine Learning[J]. Journal of Computer Research and Development, 2024, 61(2): 413-427. DOI: 10.7544/issn1000-1239.202220649 |
[3] | Wang Rui, Qi Jianpeng, Chen Liang, Yang Long. Survey of Collaborative Inference for Edge Intelligence[J]. Journal of Computer Research and Development, 2023, 60(2): 398-414. DOI: 10.7544/issn1000-1239.202110867 |
[4] | Liu Qixu, Chen Yanhui, Ni Jieshuo, Luo Cheng, Liu Caiyun, Cao Yaqin, Tan Ru, Feng Yun, Zhang Yue. Survey on Machine Learning-Based Anomaly Detection for Industrial Internet[J]. Journal of Computer Research and Development, 2022, 59(5): 994-1014. DOI: 10.7544/issn1000-1239.20211147 |
[5] | Wang Jialai, Zhang Chao, Qi Xuyan, Rong Yi. A Survey of Intelligent Malware Detection on Windows Platform[J]. Journal of Computer Research and Development, 2021, 58(5): 977-994. DOI: 10.7544/issn1000-1239.2021.20200964 |
[6] | Chen Kerui, Meng Xiaofeng. Interpretation and Understanding in Machine Learning[J]. Journal of Computer Research and Development, 2020, 57(9): 1971-1986. DOI: 10.7544/issn1000-1239.2020.20190456 |
[7] | Liu Chenyi, Xu Mingwei, Geng Nan, Zhang Xiang. A Survey on Machine Learning Based Routing Algorithms[J]. Journal of Computer Research and Development, 2020, 57(4): 671-687. DOI: 10.7544/issn1000-1239.2020.20190866 |
[8] | Liu Junxu, Meng Xiaofeng. Survey on Privacy-Preserving Machine Learning[J]. Journal of Computer Research and Development, 2020, 57(2): 346-362. DOI: 10.7544/issn1000-1239.2020.20190455 |
[9] | Xu Xiaoxiang, Li Fanzhang, Zhang Li, Zhang Zhao. The Category Representation of Machine Learning Algorithm[J]. Journal of Computer Research and Development, 2017, 54(11): 2567-2575. DOI: 10.7544/issn1000-1239.2017.20160350 |
[10] | Wen Guihua. Relative Transformation for Machine Learning[J]. Journal of Computer Research and Development, 2008, 45(4): 612-618. |