Abstract:
The explosive growth of the application of large-scale artificial intelligence models has made it difficult to achieve the scale deployment of applications relying on a single node or a single type of computing architecture. Distributed heterogeneous computing has become the mainstream choice, and inter-node communication has become one of the main bottlenecks in the training or inference process of large models. Currently, there are still some deficiencies in the inter-node communicating solutions dominated by leading chip manufacturers. On the one hand, some architectures choose to use a simple but less scalable point-to-point transmission scheme in order to pursue the ultimate inter-node communication performance. On the other hand, traditional heterogeneous computing engines (such as GPUs) are independent of CPUs in terms of computing resources such as memory and computing cores, but they lack dedicated communicating network devices in terms of communication resources and need to rely entirely or partially on CPUs to handle transmission between heterogeneous computing engines and the shared communicating network device through physical links such as PCIe. The proposed Direct xPU distributed heterogeneous computing architecture in this article enables heterogeneous computing engines to have independent and dedicated devices in both computing and communication resources, achieving zero-copy data and further eliminating the energy consumption and latency associated with cross-chip data transfer during inter-node communication. Evaluations show that Direct xPU achieves communication latency comparable to computing architectures pursuing ultimate inter-node communication performance, with bandwidth close to the physical limit.