面向大模型时代的网络基础设施研究：挑战、阶段成果与展望

翟恩南; 操佳敏; 钱坤; 关宇

doi:10.7544/issn1000-1239.202440576

面向大模型时代的网络基础设施研究：挑战、阶段成果与展望

Towards Network Infrastructure Research for the Era of Large Language Models: Challenges, Practices, and Prospects

摘要

摘要: 拥有千亿级别参数的大语言模型（large language model，LLM）已为今天的人工智能和云服务带来了巨大的技术和商业变革. 然而，大模型训练与传统的通用云计算（例如，亚马逊EC2弹性计算服务）之间存在较多根本性的网络行为差异，从而带来了很多新的挑战，主要包括流量模式差异造成负载难均衡（挑战1）、多训练任务通信竞争影响GPU利用率（挑战2），以及对网络故障的高敏感性（挑战3）等. 因此，为通用云计算设计的数据中心网络技术（例如，网络架构、选路方法、流量调度，以及可靠性保障方法等）已不适合今天的大模型训练，这要求专门为大模型训练设计新型的数据中心网络以及配套的技术方案. 介绍了阿里云专门为大模型训练设计的数据中心网络HPN以及多任务通信调度方法Crux解决上述3个挑战. HPN通过引入了一种2层、双平面（dual-plane）的网络架构，不但能够在一个Pod内高速互联15000个GPU，还能做到适用大模型训练的精准选路（解决挑战1）. 此外，HPN提出了一种新型的去堆叠双ToR（top-of-rack）设计来替代传统数据中心网络的单ToR交换机连接方式，根本性地避免了单点失效可靠性风险（部分解决挑战3）. 针对挑战2，Crux通过对GPU利用率优化问题的建模与证明，将该NP完全问题近似成GPU强度相关的流量调度问题. 随后，Crux提出了一个方法优先处理具有高GPU计算强度的任务流，从而极大降低了多任务的通信竞争，优化了GPU利用率. 与相关工作对比，Crux可以将GPU利用率提高多达23个百分点. HPN和Crux均已在阿里云生产环境规模化部署超过8个月，后续会持续演进迭代. 在此基础上，进一步展望了大模型训练与推理领域可能的研究方向，为后续工作提供指导性建议.

Abstract: Large language models (LLMs) with hundreds of billions of parameters have brought significant technological and business transformations to today’s AI and cloud services. However, there exists a fundamental difference in network pattern between LLM training and general cloud computing (e.g., Amazon EC2 Elastic compute service), leading to a variety of new challenges. These challenges mainly include load balancing difficulties due to the traffic pattern difference (Challenge 1), the impact of multi-job communication contention on GPU utilization (Challenge 2), and high sensitivity to network failures (Challenge 3). Therefore, data center network technologies designed for general cloud computing (e.g., network architecture, routing, communication scheduling, and reliability) are no longer suitable for LLM training today. This necessitates the development of new data center networks and accompanying technical solutions specifically for LLM training. We introduce Alibaba Cloud’s high-performance network (HPN) and the multi-job communication scheduling approach Crux, designed to address the aforementioned challenges. HPN introduces a two-layer, dual-plane network architecture, which not only achieves high-speed interconnectivity for 15 000 GPUs within a Pod but also ensures precise routing suitable for LLM training (addressing Challenge 1). Furthermore, HPN proposes a novel dual-top-of-rack (ToR) design, replacing the traditional single ToR switch connection in data center networks and fundamentally avoiding single-point failure reliability risks (partially addressing Challenge 3). To tackle Challenge 2, Crux reduces the NP-complete problem of optimizing GPU utilization by modeling it as a communication scheduling issue related to GPU computational intensity. Crux then proposes an algorithm that prioritizes the flows of job with higher GPU computational intensity, significantly reducing multi-job communication contention and improving GPU utilization. Compared with the state-of-the-art efforts, Crux increases GPU utilization by up to 23%. Both HPN and Crux have been deployed and used in Alibaba Cloud production for over eight months and will continue to evolve and iterate. Building on this, we further envision possible research directions in the field of LLM training and inference, providing guidance for subsequent work.

HTML全文

参考文献(34)

施引文献

资源附件(1)