Direct xPU—一种新型节点间通信优化的分布式异构计算架构

李仁刚; 王彦伟; 郝锐; 肖麟阁; 杨乐; 杨广文; 阚宏伟

doi:10.7544/issn1000-1239.202440055

Direct xPU—一种新型节点间通信优化的分布式异构计算架构

Direct xPU—A Novel Distributed Heterogeneous Computing Architecture Optimized for Inter-node Communication

摘要

摘要: 人工智能大模型应用的爆发式增长，使得难以依靠单一节点、单一类型的算力实现应用的规模部署，分布式异构计算成为主流选择，而节点间通信成为大模型训练或推理过程中的主要瓶颈之一. 目前，主要由GPU、FPGA等头部芯片厂商所主导的各种计算架构的节点间通信方案还存在一些问题. 一方面，为了追求极致的节点间通信性能，一部分架构选择使用协议简单而可扩展性差的点对点传输方案. 另一方面，传统的异构计算引擎（例如GPU）虽然在内存、计算管线等算力要素方面独立于CPU，但在通信要素方面却缺少专属的网络通信设备，需要完全或部分借助于CPU通过PCIe等物理链路来处理异构计算引擎与共享网络通信设备之间的通信. 本文所实现的Direct xPU分布式异构计算架构，使得异构计算引擎在算力要素和通信要素两方面均具有独立的、专属的设备，实现了数据的零拷贝，并进一步消除了节点间通信过程中处理跨芯片传输数据所带来的能耗和延迟. 测试结果表明，Direct xPU取得了与追求极致的节点间通信性能的计算架构相当的通信延迟，带宽接近物理通信带宽的上限.

Abstract: The explosive growth of the application of large-scale artificial intelligence models has made it difficult to achieve the scale deployment of applications relying on a single node or a single type of computing architecture. Distributed heterogeneous computing has become the mainstream choice, and inter-node communication has become one of the main bottlenecks in the training or inference process of large models. Currently, there are still some deficiencies in the inter-node communicating solutions dominated by leading chip manufacturers. On the one hand, some architectures choose to use a simple but less scalable point-to-point transmission scheme in order to pursue the ultimate inter-node communication performance. On the other hand, traditional heterogeneous computing engines (such as GPUs) are independent of CPUs in terms of computing resources such as memory and computing cores, but they lack dedicated communicating network devices in terms of communication resources and need to rely entirely or partially on CPUs to handle transmission between heterogeneous computing engines and the shared communicating network device through physical links such as PCIe. The proposed Direct xPU distributed heterogeneous computing architecture in this article enables heterogeneous computing engines to have independent and dedicated devices in both computing and communication resources, achieving zero-copy data and further eliminating the energy consumption and latency associated with cross-chip data transfer during inter-node communication. Evaluations show that Direct xPU achieves communication latency comparable to computing architectures pursuing ultimate inter-node communication performance, with bandwidth close to the physical limit.

HTML全文

参考文献(24)

施引文献

资源附件(0)