Citation: | Zhai Ennan, Cao Jiamin, Qian Kun, Guan Yu. Towards Network Infrastructure Research for the Era of Large Language Models: Challenges, Practices, and Prospects[J]. Journal of Computer Research and Development, 2024, 61(11): 2664-2677. DOI: 10.7544/issn1000-1239.202440576 |
Large language models (LLMs) with hundreds of billions of parameters have brought significant technological and business transformations to today’s AI and cloud services. However, there exists a fundamental difference in network pattern between LLM training and general cloud computing (e.g., Amazon EC2 Elastic compute service), leading to a variety of new challenges. These challenges mainly include load balancing difficulties due to the traffic pattern difference (Challenge 1), the impact of multi-job communication contention on GPU utilization (Challenge 2), and high sensitivity to network failures (Challenge 3). Therefore, data center network technologies designed for general cloud computing (e.g., network architecture, routing, communication scheduling, and reliability) are no longer suitable for LLM training today. This necessitates the development of new data center networks and accompanying technical solutions specifically for LLM training. We introduce Alibaba Cloud’s high-performance network (HPN) and the multi-job communication scheduling approach Crux, designed to address the aforementioned challenges. HPN introduces a two-layer, dual-plane network architecture, which not only achieves high-speed interconnectivity for 15 000 GPUs within a Pod but also ensures precise routing suitable for LLM training (addressing Challenge 1). Furthermore, HPN proposes a novel dual-top-of-rack (ToR) design, replacing the traditional single ToR switch connection in data center networks and fundamentally avoiding single-point failure reliability risks (partially addressing Challenge 3). To tackle Challenge 2, Crux reduces the NP-complete problem of optimizing GPU utilization by modeling it as a communication scheduling issue related to GPU computational intensity. Crux then proposes an algorithm that prioritizes the flows of job with higher GPU computational intensity, significantly reducing multi-job communication contention and improving GPU utilization. Compared with the state-of-the-art efforts, Crux increases GPU utilization by up to 23%. Both HPN and Crux have been deployed and used in Alibaba Cloud production for over eight months and will continue to evolve and iterate. Building on this, we further envision possible research directions in the field of LLM training and inference, providing guidance for subsequent work.
[1] |
OpenAI, Josh A, Adler S, et al. GPT−4 technical report. [J]. arXiv preprint, arXiv: 2303.08774, 2024
|
[2] |
OpenAI. Introducing ChatGPT[EB/OL]. [2022-11-30]. https://openai.com/blog/chatgpt
|
[3] |
MLSYS ORG. Vicuna: An open-source chatbot impressing GPT−4 with 90% ChatGPT quality[EB/OL]. [2023-03-30]. https://lmsys.org/blog/2023-03-30-vicuna/
|
[4] |
NVIDIA. Megatron-LM. [EB/OL]. [2024-06-19]. https://github.com/NVIDIA/Megatron-LM
|
[5] |
Microsoft Research. DeepSpeed[EB/OL]. [2024-06-19]. https://www.microsoft.com/en-us/research/project/deepspeed/
|
[6] |
Rajasekaran S, Ghobadi M, Akella A. CASSINI: Network-aware job scheduling in machine learning clusters[C]//Proc of USENIX NSDI. Berkeley, CA: USENIX Association, 2024: 1403−1420
|
[7] |
Jiang Ziheng, Lin Haibin, Zhong Yinmin, et al. MegaScale: Scaling large language model training to more than 10, 000 GPUs[C]//Proc of USENIX NSDI. Berkeley, CA: USENIX Association, 2024: 745−760
|
[8] |
Qian Kun, Xi Yongqing, Cao Jiamin, et al. Alibaba HPN: A data center network for large language model training[C]//Proc of ACM SIGCOMM. New York: ACM, 2024: 691−706
|
[9] |
Cao Jiamin, Guan Yu, Qian Kun, et al. Crux: GPU-efficient communication scheduling for deep learning training[C]//Proc of ACM SIGCOMM. New York: ACM, 2024: 1−15
|
[10] |
Microsoft DeepSpeed. Model checkpointing[EB/OL]. [2023-01-31]. https://deepspeed.readthedocs.io/en/latest/model-checkpointing.html
|
[11] |
Alizadeh M, Edsall T, Dharmapurikar S, et al. CONGA: Distributed congestion-aware load balancing for datacenters[C]//Proc of ACM SIGCOMM. New York: ACM, 2014: 503–514
|
[12] |
Dixit A, Prakash P, Hu Y C, et al. On the impact of packet spraying in data center networks[C]//Proc of IEEE INFOCOM. Piscataway, NJ: IEEE, 2013: 2130–2138
|
[13] |
Ghorbani S, Yang Zibin, Godfrey P B, et al. DRILL: Micro load balancing for low-latency data center networks[C]//Proc of ACM SIGCOMM. New York: ACM, 2017: 225–238
|
[14] |
Katta N, Hira M, Ghag A, et al. CLOVE: How I learned to stop worrying about the core and love the edge[C]//Proc of ACM Workshop on Hot Topics in Networks. New York: ACM, 2016: 155–161
|
[15] |
Katta N, Hira M, Kim C, et al. HULA: Scalable load balancing using programmable data planes[C]//Proc of ACM SOSR. New York: ACM, 2016: 1–12
|
[16] |
Qureshi M A, Cheng Yuchung, Yin Qianwen, et al. PLB: Congestion signals are simple and effective for network load balancing[C]//Proc of ACM SIGCOMM. New York: ACM, 2022: 207–218
|
[17] |
Sen S, Shue D, Ihm S, et al. Scalable, optimal flow routing in datacenters via local link balancing[C]//Proc of ACM CoNEXT. New York: ACM, 2013: 151–162
|
[18] |
Vanini E, Pan Rong, Alizadeh M, et al. Let it flow: Resilient asymmetric load balancing with flowlet switching[C]//Proc of USENIX NSDI. Berkeley, CA: USENIX Association, 2017: 407–420
|
[19] |
Zats D, Das T, Mohan P, et al. DeTail: Reducing the flow completion time tail in datacenter networks[C]//Proc of ACM SIGCOMM. New York: ACM, 2012: 139–150
|
[20] |
Zhang Hong, Zhang Junxue, Bai Wei, et al. Resilient datacenter load balancing in the wild[C]//Proc of ACM SIGCOMM. New York: ACM, 2017: 253–266
|
[21] |
Agarwal S, Rajakrishnan S, Narayan A, et al. Sincronia: Near-optimal network design for coflows[C]//Proc of ACM SIGCOMM. New York: ACM, 2018: 16−29
|
[22] |
Shah A, Chidambaram V, Cowan M, et al. TACCL: Guiding collective algorithm synthesis using communication sketches[C]//Proc of USENIX NSDI. Berkeley, CA: USENIX Association, 2023: 593−612
|
[23] |
NVIDIA. NVLink and NVSwitch[EB/OL]. [2024-06-19]. https://www.nvidia.com/en-us/data-center/nvlink/
|
[24] |
Meta. Meta’s evolution of network for AI[EB/OL]. [2023-11-01]. https://www.youtube.com/watch?v=5gOOtFySrqA
|
[25] |
NVIDIA. NVIDIA DGX SuperPOD: Next generation scalable infrastructure for AI leadership[EB/OL]. [2023-09-22]. https: /docs. nvidia. com/dgx-superpod-reference-architecture-dgx-h100. pdf
|
[26] |
Al-Fares M, Loukissas A, Vahdat A. A scalable, commodity data center network architecture[C]//Proc of ACM SIGCOMM. New York: ACM, 2008: 63–74
|
[27] |
Greenberg A, Hamilton J R, Jain N, et al. VL2: A scalable and flexible data center network[C]//Proc of ACM SIGCOMM. New York: ACM, 2009: 51–62
|
[28] |
Bai Wei, Abdeen S S, Agrawal A, et al. Empowering Azure storage with RDMA[C]//Proc of USENIX NSDI. Berkeley, CA: USENIX Association, 2023: 49–67
|
[29] |
Poutievski L, Mashayekhi O, Ong J, et al. Jupiter evolving: Transforming Google’s datacenter network via optical circuit switches and software-defined networking[C]//Proc of ACM SIGCOMM. New York: ACM, 2022: 66–85
|
[30] |
Linux GNU. Linux bonding modes[EB/OL]. [2010-01-08]. https://thelinuxcluster.com/2010/01/08/linux-bonding-modes/
|
[31] |
IEEE Standard SA. IEEE 802.3ad[EB/OL]. [2000-06-28]. https://standards.ieee.org/ieee/802.3ad/1088/
|
[32] |
IEEE Standard SA. IEEE standard for information technology – local and metropolitan area networks – specific requirements – part 3: CSMA/CD access method and physical layer specifications amendment 5: Media access control parameters, physical layers, and management parameters for energy-efficient Ethernet[EB/OL]. [2010-10-27]. https://standards.ieee.org/ieee/802.3az/4270/
|
[33] |
Zhang Zhehui, Zheng Haiyang, Hu Jiayao, et al. Hashing linearity enables relative path control in data centers[C]//Proc of USENIX ATC. Berkeley, CA: USENIX Association, 2021: 855–862
|
[34] |
NVIDIA. InfiniBand networking solutions[EB/OL]. [2024-06-11]. https://www.nvidia.com/en-us/networking/products/infiniband/
|
[1] | Li Nan, Ding Yidong, Jiang Haoyu, Niu Jiafei, Yi Ping. Jailbreak Attack for Large Language Models: A Survey[J]. Journal of Computer Research and Development, 2024, 61(5): 1156-1181. DOI: 10.7544/issn1000-1239.202330962 |
[2] | Wang Mengru, Yao Yunzhi, Xi Zekun, Zhang Jintian, Wang Peng, Xu Ziwen, Zhang Ningyu. Safety Analysis of Large Model Content Generation Based on Knowledge Editing[J]. Journal of Computer Research and Development, 2024, 61(5): 1143-1155. DOI: 10.7544/issn1000-1239.202330965 |
[3] | Chen Xuanting, Ye Junjie, Zu Can, Xu Nuo, Gui Tao, Zhang Qi. Robustness of GPT Large Language Models on Natural Language Processing Tasks[J]. Journal of Computer Research and Development, 2024, 61(5): 1128-1142. DOI: 10.7544/issn1000-1239.202330801 |
[4] | Chen Huimin, Liu Zhiyuan, Sun Maosong. The Social Opportunities and Challenges in the Era of Large Language Models[J]. Journal of Computer Research and Development, 2024, 61(5): 1094-1103. DOI: 10.7544/issn1000-1239.202330700 |
[5] | Yang Yi, Li Ying, Chen Kai. Vulnerability Detection Methods Based on Natural Language Processing[J]. Journal of Computer Research and Development, 2022, 59(12): 2649-2666. DOI: 10.7544/issn1000-1239.20210627 |
[6] | Pan Xuan, Xu Sihan, Cai Xiangrui, Wen Yanlong, Yuan Xiaojie. Survey on Deep Learning Based Natural Language Interface to Database[J]. Journal of Computer Research and Development, 2021, 58(9): 1925-1950. DOI: 10.7544/issn1000-1239.2021.20200209 |
[7] | Zheng Haibin, Chen Jinyin, Zhang Yan, Zhang Xuhong, Ge Chunpeng, Liu Zhe, Ouyang Yike, Ji Shouling. Survey of Adversarial Attack, Defense and Robustness Analysis for Natural Language Processing[J]. Journal of Computer Research and Development, 2021, 58(8): 1727-1750. DOI: 10.7544/issn1000-1239.2021.20210304 |
[8] | Pan Xudong, Zhang Mi, Yan Yifan, Lu Yifan, Yang Min. Evaluating Privacy Risks of Deep Learning Based General-Purpose Language Models[J]. Journal of Computer Research and Development, 2021, 58(5): 1092-1105. DOI: 10.7544/issn1000-1239.2021.20200908 |
[9] | Bao Yang, Yang Zhibin, Yang Yongqiang, Xie Jian, Zhou Yong, Yue Tao, Huang Zhiqiu, Guo Peng. An Automated Approach to Generate SysML Models from Restricted Natural Language Requirements in Chinese[J]. Journal of Computer Research and Development, 2021, 58(4): 706-730. DOI: 10.7544/issn1000-1239.2021.20200757 |
[10] | Che Haiyan, Feng Tie, Zhang Jiachen, Chen Wei, and Li Dali. Automatic Knowledge Extraction from Chinese Natural Language Documents[J]. Journal of Computer Research and Development, 2013, 50(4): 834-842. |