• 中国精品科技期刊
  • CCF推荐A类中文期刊
  • 计算领域高质量科技期刊T1类
高级检索

面向大模型时代的网络基础设施研究:挑战、阶段成果与展望

翟恩南, 操佳敏, 钱坤, 关宇

翟恩南, 操佳敏, 钱坤, 关宇. 面向大模型时代的网络基础设施研究:挑战、阶段成果与展望[J]. 计算机研究与发展, 2024, 61(11): 2664-2677. DOI: 10.7544/issn1000-1239.202440576
引用本文: 翟恩南, 操佳敏, 钱坤, 关宇. 面向大模型时代的网络基础设施研究:挑战、阶段成果与展望[J]. 计算机研究与发展, 2024, 61(11): 2664-2677. DOI: 10.7544/issn1000-1239.202440576
Zhai Ennan, Cao Jiamin, Qian Kun, Guan Yu. Towards Network Infrastructure Research for the Era of Large Language Models: Challenges, Practices, and Prospects[J]. Journal of Computer Research and Development, 2024, 61(11): 2664-2677. DOI: 10.7544/issn1000-1239.202440576
Citation: Zhai Ennan, Cao Jiamin, Qian Kun, Guan Yu. Towards Network Infrastructure Research for the Era of Large Language Models: Challenges, Practices, and Prospects[J]. Journal of Computer Research and Development, 2024, 61(11): 2664-2677. DOI: 10.7544/issn1000-1239.202440576
翟恩南, 操佳敏, 钱坤, 关宇. 面向大模型时代的网络基础设施研究:挑战、阶段成果与展望[J]. 计算机研究与发展, 2024, 61(11): 2664-2677. CSTR: 32373.14.issn1000-1239.202440576
引用本文: 翟恩南, 操佳敏, 钱坤, 关宇. 面向大模型时代的网络基础设施研究:挑战、阶段成果与展望[J]. 计算机研究与发展, 2024, 61(11): 2664-2677. CSTR: 32373.14.issn1000-1239.202440576
Zhai Ennan, Cao Jiamin, Qian Kun, Guan Yu. Towards Network Infrastructure Research for the Era of Large Language Models: Challenges, Practices, and Prospects[J]. Journal of Computer Research and Development, 2024, 61(11): 2664-2677. CSTR: 32373.14.issn1000-1239.202440576
Citation: Zhai Ennan, Cao Jiamin, Qian Kun, Guan Yu. Towards Network Infrastructure Research for the Era of Large Language Models: Challenges, Practices, and Prospects[J]. Journal of Computer Research and Development, 2024, 61(11): 2664-2677. CSTR: 32373.14.issn1000-1239.202440576

面向大模型时代的网络基础设施研究:挑战、阶段成果与展望

详细信息
    作者简介:

    翟恩南: 1984年生. 博士. CCF会员. 主要研究方向为计算机网络、分布式系统

    操佳敏: 1997年生. 博士. CCF会员. 主要研究方向为AI基础实施、可编程网络

    钱坤: 1993年生. 博士. CCF会员. 主要研究方向为AI基础设施、存储

    关宇: 1993年生. 博士. 主要研究方向为AI基础设施、视频传输流

  • 中图分类号: TP393

Towards Network Infrastructure Research for the Era of Large Language Models: Challenges, Practices, and Prospects

More Information
    Author Bio:

    Zhai Ennan: born in 1984. PhD. Member of CCF. His main research interests include computer network and distributed systems

    Cao Jiamin: born in 1997. PhD. Member of CCF. Her main research interests include AI infrastructure and programmable networks

    Qian Kun: born in 1993. PhD. Member of CCF. His main research interests include AI infrastructure and storage

    Guan Yu: born in 1993. PhD. His main research interests include AI infrastructure and video streaming

  • 摘要:

    拥有千亿级别参数的大语言模型(large language model,LLM)已为今天的人工智能和云服务带来了巨大的技术和商业变革. 然而,大模型训练与传统的通用云计算(例如,亚马逊EC2弹性计算服务)之间存在较多根本性的网络行为差异,从而带来了很多新的挑战,主要包括流量模式差异造成负载难均衡(挑战1)、多训练任务通信竞争影响GPU利用率(挑战2),以及对网络故障的高敏感性(挑战3)等. 因此,为通用云计算设计的数据中心网络技术(例如,网络架构、选路方法、流量调度,以及可靠性保障方法等)已不适合今天的大模型训练,这要求专门为大模型训练设计新型的数据中心网络以及配套的技术方案. 介绍了阿里云专门为大模型训练设计的数据中心网络HPN以及多任务通信调度方法Crux解决上述3个挑战. HPN通过引入了一种2层、双平面(dual-plane)的网络架构,不但能够在一个Pod内高速互联15000个GPU,还能做到适用大模型训练的精准选路(解决挑战1). 此外,HPN提出了一种新型的去堆叠双ToR(top-of-rack)设计来替代传统数据中心网络的单ToR交换机连接方式,根本性地避免了单点失效可靠性风险(部分解决挑战3). 针对挑战2,Crux通过对GPU利用率优化问题的建模与证明,将该NP完全问题近似成GPU强度相关的流量调度问题. 随后,Crux提出了一个方法优先处理具有高GPU计算强度的任务流,从而极大降低了多任务的通信竞争,优化了GPU利用率. 与相关工作对比,Crux可以将GPU利用率提高多达23个百分点. HPN和Crux均已在阿里云生产环境规模化部署超过8个月,后续会持续演进迭代. 在此基础上,进一步展望了大模型训练与推理领域可能的研究方向,为后续工作提供指导性建议.

    Abstract:

    Large language models (LLMs) with hundreds of billions of parameters have brought significant technological and business transformations to today’s AI and cloud services. However, there exists a fundamental difference in network pattern between LLM training and general cloud computing (e.g., Amazon EC2 Elastic compute service), leading to a variety of new challenges. These challenges mainly include load balancing difficulties due to the traffic pattern difference (Challenge 1), the impact of multi-job communication contention on GPU utilization (Challenge 2), and high sensitivity to network failures (Challenge 3). Therefore, data center network technologies designed for general cloud computing (e.g., network architecture, routing, communication scheduling, and reliability) are no longer suitable for LLM training today. This necessitates the development of new data center networks and accompanying technical solutions specifically for LLM training. We introduce Alibaba Cloud’s high-performance network (HPN) and the multi-job communication scheduling approach Crux, designed to address the aforementioned challenges. HPN introduces a two-layer, dual-plane network architecture, which not only achieves high-speed interconnectivity for 15 000 GPUs within a Pod but also ensures precise routing suitable for LLM training (addressing Challenge 1). Furthermore, HPN proposes a novel dual-top-of-rack (ToR) design, replacing the traditional single ToR switch connection in data center networks and fundamentally avoiding single-point failure reliability risks (partially addressing Challenge 3). To tackle Challenge 2, Crux reduces the NP-complete problem of optimizing GPU utilization by modeling it as a communication scheduling issue related to GPU computational intensity. Crux then proposes an algorithm that prioritizes the flows of job with higher GPU computational intensity, significantly reducing multi-job communication contention and improving GPU utilization. Compared with the state-of-the-art efforts, Crux increases GPU utilization by up to 23%. Both HPN and Crux have been deployed and used in Alibaba Cloud production for over eight months and will continue to evolve and iterate. Building on this, we further envision possible research directions in the field of LLM training and inference, providing guidance for subsequent work.

  • 图  1   传统云计算流量模式[8]

    Figure  1.   Traditional cloud computing traffic pattern[8]

    图  2   模型在生产环境下训练期间的网卡出方向流量[8]

    Figure  2.   NIC egress traffic pattern during production model training[8]

    图  3   多任务间的通信竞争

    Figure  3.   Communication contention between multi-job

    图  4   通信竞争对GPT-3变体大模型训练迭代时间的影响[9]

    Figure  4.   Impact of communication contention on the iteration time of GPT-3 training[9]

    图  5   HPN后端网络架构概览

    Figure  5.   HPN backend network architecture overview

    图  6   双ToR的多轨优化组网[8]

    Figure  6.   Rail-optimized network of dual-ToR[8]

    图  7   ToR交换机同一网卡的2个端口流量[8]

    Figure  7.   Traffic on two ToRs’ ports towards the same NIC[8]

    图  8   2300+GPU在不同网络架构下的模型训练性能[8]

    Figure  8.   Model training performance on 2300+GPU under different network architectures[8]

    图  9   将GPU利用率问题转化为流量调度问题

    Figure  9.   Deriving GPU utilization problem to a flow scheduling problem

    图  10   不同迭代时间的任务通信冲突示例

    Figure  10.   Example of communication contention between jobs with different iteration time

    图  11   不同计算冲突重叠的任务通信冲突示例

    Figure  11.   Example of communication contention between jobs with different computation contention overlapping

    图  12   不同通信调度方法的GPU利用率[9]

    Figure  12.   GPU utilization rates between communication schedulers[9]

    表  1   扩展规模的关键机制[8]

    Table  1   Key Mechanisms for Maximal Scale[8]

    关键机制 一层规模(扩展倍数) 二层规模(扩展倍数)
    双ToR 128(×2) 4 000(×2)
    多轨优化 1 000(×8)
    双平面 8 000(×2)
    15∶1收敛比 15000(×1.875)
    下载: 导出CSV
  • [1]

    OpenAI, Josh A, Adler S, et al. GPT−4 technical report. [J]. arXiv preprint, arXiv: 2303.08774, 2024

    [2]

    OpenAI. Introducing ChatGPT[EB/OL]. [2022-11-30]. https://openai.com/blog/chatgpt

    [3]

    MLSYS ORG. Vicuna: An open-source chatbot impressing GPT−4 with 90% ChatGPT quality[EB/OL]. [2023-03-30]. https://lmsys.org/blog/2023-03-30-vicuna/

    [4]

    NVIDIA. Megatron-LM. [EB/OL]. [2024-06-19]. https://github.com/NVIDIA/‌Megatron-LM

    [5]

    Microsoft Research. DeepSpeed[EB/OL]. [2024-06-19]. https://www.microsoft.com/en-us/‌research/project/deepspeed/

    [6]

    Rajasekaran S, Ghobadi M, Akella A. CASSINI: Network-aware job scheduling in machine learning clusters[C]//Proc of USENIX NSDI. Berkeley, CA: USENIX Association, 2024: 1403−1420

    [7]

    Jiang Ziheng, Lin Haibin, Zhong Yinmin, et al. MegaScale: Scaling large language model training to more than 10, 000 GPUs[C]//Proc of USENIX NSDI. Berkeley, CA: USENIX Association, 2024: 745−760

    [8]

    Qian Kun, Xi Yongqing, Cao Jiamin, et al. Alibaba HPN: A data center network for large language model training[C]//Proc of ACM SIGCOMM. New York: ACM, 2024: 691−706

    [9]

    Cao Jiamin, Guan Yu, Qian Kun, et al. Crux: GPU-efficient communication scheduling for deep learning training[C]//Proc of ACM SIGCOMM. New York: ACM, 2024: 1−15

    [10]

    Microsoft DeepSpeed. Model checkpointing[EB/OL]. [2023-01-31]. https://deepspeed.readthedocs.‌io/en/latest/model-checkpointing.html

    [11]

    Alizadeh M, Edsall T, Dharmapurikar S, et al. CONGA: Distributed congestion-aware load balancing for datacenters[C]//Proc of ACM SIGCOMM. New York: ACM, 2014: 503–514

    [12]

    Dixit A, Prakash P, Hu Y C, et al. On the impact of packet spraying in data center networks[C]//Proc of IEEE INFOCOM. Piscataway, NJ: IEEE, 2013: 2130–2138

    [13]

    Ghorbani S, Yang Zibin, Godfrey P B, et al. DRILL: Micro load balancing for low-latency data center networks[C]//Proc of ACM SIGCOMM. New York: ACM, 2017: 225–238

    [14]

    Katta N, Hira M, Ghag A, et al. CLOVE: How I learned to stop worrying about the core and love the edge[C]//Proc of ACM Workshop on Hot Topics in Networks. New York: ACM, 2016: 155–161

    [15]

    Katta N, Hira M, Kim C, et al. HULA: Scalable load balancing using programmable data planes[C]//Proc of ACM SOSR. New York: ACM, 2016: 1–12

    [16]

    Qureshi M A, Cheng Yuchung, Yin Qianwen, et al. PLB: Congestion signals are simple and effective for network load balancing[C]//Proc of ACM SIGCOMM. New York: ACM, 2022: 207–218

    [17]

    Sen S, Shue D, Ihm S, et al. Scalable, optimal flow routing in datacenters via local link balancing[C]//Proc of ACM CoNEXT. New York: ACM, 2013: 151–162

    [18]

    Vanini E, Pan Rong, Alizadeh M, et al. Let it flow: Resilient asymmetric load balancing with flowlet switching[C]//Proc of USENIX NSDI. Berkeley, CA: USENIX Association, 2017: 407–420

    [19]

    Zats D, Das T, Mohan P, et al. DeTail: Reducing the flow completion time tail in datacenter networks[C]//Proc of ACM SIGCOMM. New York: ACM, 2012: 139–150

    [20]

    Zhang Hong, Zhang Junxue, Bai Wei, et al. Resilient datacenter load balancing in the wild[C]//Proc of ACM SIGCOMM. New York: ACM, 2017: 253–266

    [21]

    Agarwal S, Rajakrishnan S, Narayan A, et al. Sincronia: Near-optimal network design for coflows[C]//Proc of ACM SIGCOMM. New York: ACM, 2018: 16−29

    [22]

    Shah A, Chidambaram V, Cowan M, et al. TACCL: Guiding collective algorithm synthesis using communication sketches[C]//Proc of USENIX NSDI. Berkeley, CA: USENIX Association, 2023: 593−612

    [23]

    NVIDIA. NVLink and NVSwitch[EB/OL]. [2024-06-19]. https://www.nvidia.com/‌en-us/data-center/nvlink/

    [24]

    Meta. Meta’s evolution of network for AI[EB/OL]. [2023-11-01]. https://‌www.youtube.com/watch?v=5gOOtFySrqA

    [25]

    NVIDIA. NVIDIA DGX SuperPOD: Next generation scalable infrastructure for AI leadership[EB/OL]. [2023-09-22]. https: /docs. nvidia. com/dgx-superpod-‌reference-architecture-dgx-h100. pdf

    [26]

    Al-Fares M, Loukissas A, Vahdat A. A scalable, commodity data center network architecture[C]//Proc of ACM SIGCOMM. New York: ACM, 2008: 63–74

    [27]

    Greenberg A, Hamilton J R, Jain N, et al. VL2: A scalable and flexible data center network[C]//Proc of ACM SIGCOMM. New York: ACM, 2009: 51–62

    [28]

    Bai Wei, Abdeen S S, Agrawal A, et al. Empowering Azure storage with RDMA[C]//Proc of USENIX NSDI. Berkeley, CA: USENIX Association, 2023: 49–67

    [29]

    Poutievski L, Mashayekhi O, Ong J, et al. Jupiter evolving: Transforming Google’s datacenter network via optical circuit switches and software-defined networking[C]//Proc of ACM SIGCOMM. New York: ACM, 2022: 66–85

    [30]

    Linux GNU. Linux bonding modes[EB/OL]. [2010-01-08]. https://thelinuxcluster.‌com/2010/01/08/linux-bonding-modes/

    [31]

    IEEE Standard SA. IEEE 802.3ad[EB/OL]. [2000-06-28]. https://standards.ieee.org/‌ieee/802.3ad/1088/

    [32]

    IEEE Standard SA. IEEE standard for information technology – local and metropolitan area networks – specific requirements – part 3: CSMA/CD access method and physical layer specifications amendment 5: Media access control parameters, physical layers, and management parameters for energy-efficient Ethernet[EB/OL]. [2010-10-27]. https://standards.ieee.org/‌ieee/802.3az/4270/

    [33]

    Zhang Zhehui, Zheng Haiyang, Hu Jiayao, et al. Hashing linearity enables relative path control in data centers[C]//Proc of USENIX ATC. Berkeley, CA: USENIX Association, 2021: 855–862

    [34]

    NVIDIA. InfiniBand networking solutions[EB/OL]. [2024-06-11]. https://www.nvidia.com/en-us/networking/products/infiniband/

  • 期刊类型引用(2)

    1. 李光. 基于区块链技术的建筑工程质量管理策略. 中国建筑装饰装修. 2025(02): 75-77 . 百度学术
    2. Jing He,Xiaofeng Ma,Dawei Zhang,Feng Peng. Supervised and revocable decentralized identity privacy protection scheme. Security and Safety. 2024(04): 113-135 . 必应学术

    其他类型引用(1)

  • 其他相关附件

图(12)  /  表(1)
计量
  • 文章访问数:  593
  • HTML全文浏览量:  173
  • PDF下载量:  304
  • 被引次数: 3
出版历程
  • 收稿日期:  2024-06-24
  • 修回日期:  2024-09-19
  • 网络出版日期:  2024-09-26
  • 刊出日期:  2024-10-31

目录

    /

    返回文章
    返回