Citation: | Zhai Ennan, Cao Jiamin, Qian Kun, Guan Yu. Towards Network Infrastructure Research for the Era of Large Language Models: Challenges, Practices, and Prospects[J]. Journal of Computer Research and Development, 2024, 61(11): 2664-2677. DOI: 10.7544/issn1000-1239.202440576 |
Large language models (LLMs) with hundreds of billions of parameters have brought significant technological and business transformations to today’s AI and cloud services. However, there exists a fundamental difference in network pattern between LLM training and general cloud computing (e.g., Amazon EC2 Elastic compute service), leading to a variety of new challenges. These challenges mainly include load balancing difficulties due to the traffic pattern difference (Challenge 1), the impact of multi-job communication contention on GPU utilization (Challenge 2), and high sensitivity to network failures (Challenge 3). Therefore, data center network technologies designed for general cloud computing (e.g., network architecture, routing, communication scheduling, and reliability) are no longer suitable for LLM training today. This necessitates the development of new data center networks and accompanying technical solutions specifically for LLM training. We introduce Alibaba Cloud’s high-performance network (HPN) and the multi-job communication scheduling approach Crux, designed to address the aforementioned challenges. HPN introduces a two-layer, dual-plane network architecture, which not only achieves high-speed interconnectivity for 15 000 GPUs within a Pod but also ensures precise routing suitable for LLM training (addressing Challenge 1). Furthermore, HPN proposes a novel dual-top-of-rack (ToR) design, replacing the traditional single ToR switch connection in data center networks and fundamentally avoiding single-point failure reliability risks (partially addressing Challenge 3). To tackle Challenge 2, Crux reduces the NP-complete problem of optimizing GPU utilization by modeling it as a communication scheduling issue related to GPU computational intensity. Crux then proposes an algorithm that prioritizes the flows of job with higher GPU computational intensity, significantly reducing multi-job communication contention and improving GPU utilization. Compared with the state-of-the-art efforts, Crux increases GPU utilization by up to 23%. Both HPN and Crux have been deployed and used in Alibaba Cloud production for over eight months and will continue to evolve and iterate. Building on this, we further envision possible research directions in the field of LLM training and inference, providing guidance for subsequent work.
[1] |
OpenAI, Josh A, Adler S, et al. GPT−4 technical report. [J]. arXiv preprint, arXiv: 2303.08774, 2024
|
[2] |
OpenAI. Introducing ChatGPT[EB/OL]. [2022-11-30]. https://openai.com/blog/chatgpt
|
[3] |
MLSYS ORG. Vicuna: An open-source chatbot impressing GPT−4 with 90% ChatGPT quality[EB/OL]. [2023-03-30]. https://lmsys.org/blog/2023-03-30-vicuna/
|
[4] |
NVIDIA. Megatron-LM. [EB/OL]. [2024-06-19]. https://github.com/NVIDIA/Megatron-LM
|
[5] |
Microsoft Research. DeepSpeed[EB/OL]. [2024-06-19]. https://www.microsoft.com/en-us/research/project/deepspeed/
|
[6] |
Rajasekaran S, Ghobadi M, Akella A. CASSINI: Network-aware job scheduling in machine learning clusters[C]//Proc of USENIX NSDI. Berkeley, CA: USENIX Association, 2024: 1403−1420
|
[7] |
Jiang Ziheng, Lin Haibin, Zhong Yinmin, et al. MegaScale: Scaling large language model training to more than 10, 000 GPUs[C]//Proc of USENIX NSDI. Berkeley, CA: USENIX Association, 2024: 745−760
|
[8] |
Qian Kun, Xi Yongqing, Cao Jiamin, et al. Alibaba HPN: A data center network for large language model training[C]//Proc of ACM SIGCOMM. New York: ACM, 2024: 691−706
|
[9] |
Cao Jiamin, Guan Yu, Qian Kun, et al. Crux: GPU-efficient communication scheduling for deep learning training[C]//Proc of ACM SIGCOMM. New York: ACM, 2024: 1−15
|
[10] |
Microsoft DeepSpeed. Model checkpointing[EB/OL]. [2023-01-31]. https://deepspeed.readthedocs.io/en/latest/model-checkpointing.html
|
[11] |
Alizadeh M, Edsall T, Dharmapurikar S, et al. CONGA: Distributed congestion-aware load balancing for datacenters[C]//Proc of ACM SIGCOMM. New York: ACM, 2014: 503–514
|
[12] |
Dixit A, Prakash P, Hu Y C, et al. On the impact of packet spraying in data center networks[C]//Proc of IEEE INFOCOM. Piscataway, NJ: IEEE, 2013: 2130–2138
|
[13] |
Ghorbani S, Yang Zibin, Godfrey P B, et al. DRILL: Micro load balancing for low-latency data center networks[C]//Proc of ACM SIGCOMM. New York: ACM, 2017: 225–238
|
[14] |
Katta N, Hira M, Ghag A, et al. CLOVE: How I learned to stop worrying about the core and love the edge[C]//Proc of ACM Workshop on Hot Topics in Networks. New York: ACM, 2016: 155–161
|
[15] |
Katta N, Hira M, Kim C, et al. HULA: Scalable load balancing using programmable data planes[C]//Proc of ACM SOSR. New York: ACM, 2016: 1–12
|
[16] |
Qureshi M A, Cheng Yuchung, Yin Qianwen, et al. PLB: Congestion signals are simple and effective for network load balancing[C]//Proc of ACM SIGCOMM. New York: ACM, 2022: 207–218
|
[17] |
Sen S, Shue D, Ihm S, et al. Scalable, optimal flow routing in datacenters via local link balancing[C]//Proc of ACM CoNEXT. New York: ACM, 2013: 151–162
|
[18] |
Vanini E, Pan Rong, Alizadeh M, et al. Let it flow: Resilient asymmetric load balancing with flowlet switching[C]//Proc of USENIX NSDI. Berkeley, CA: USENIX Association, 2017: 407–420
|
[19] |
Zats D, Das T, Mohan P, et al. DeTail: Reducing the flow completion time tail in datacenter networks[C]//Proc of ACM SIGCOMM. New York: ACM, 2012: 139–150
|
[20] |
Zhang Hong, Zhang Junxue, Bai Wei, et al. Resilient datacenter load balancing in the wild[C]//Proc of ACM SIGCOMM. New York: ACM, 2017: 253–266
|
[21] |
Agarwal S, Rajakrishnan S, Narayan A, et al. Sincronia: Near-optimal network design for coflows[C]//Proc of ACM SIGCOMM. New York: ACM, 2018: 16−29
|
[22] |
Shah A, Chidambaram V, Cowan M, et al. TACCL: Guiding collective algorithm synthesis using communication sketches[C]//Proc of USENIX NSDI. Berkeley, CA: USENIX Association, 2023: 593−612
|
[23] |
NVIDIA. NVLink and NVSwitch[EB/OL]. [2024-06-19]. https://www.nvidia.com/en-us/data-center/nvlink/
|
[24] |
Meta. Meta’s evolution of network for AI[EB/OL]. [2023-11-01]. https://www.youtube.com/watch?v=5gOOtFySrqA
|
[25] |
NVIDIA. NVIDIA DGX SuperPOD: Next generation scalable infrastructure for AI leadership[EB/OL]. [2023-09-22]. https: /docs. nvidia. com/dgx-superpod-reference-architecture-dgx-h100. pdf
|
[26] |
Al-Fares M, Loukissas A, Vahdat A. A scalable, commodity data center network architecture[C]//Proc of ACM SIGCOMM. New York: ACM, 2008: 63–74
|
[27] |
Greenberg A, Hamilton J R, Jain N, et al. VL2: A scalable and flexible data center network[C]//Proc of ACM SIGCOMM. New York: ACM, 2009: 51–62
|
[28] |
Bai Wei, Abdeen S S, Agrawal A, et al. Empowering Azure storage with RDMA[C]//Proc of USENIX NSDI. Berkeley, CA: USENIX Association, 2023: 49–67
|
[29] |
Poutievski L, Mashayekhi O, Ong J, et al. Jupiter evolving: Transforming Google’s datacenter network via optical circuit switches and software-defined networking[C]//Proc of ACM SIGCOMM. New York: ACM, 2022: 66–85
|
[30] |
Linux GNU. Linux bonding modes[EB/OL]. [2010-01-08]. https://thelinuxcluster.com/2010/01/08/linux-bonding-modes/
|
[31] |
IEEE Standard SA. IEEE 802.3ad[EB/OL]. [2000-06-28]. https://standards.ieee.org/ieee/802.3ad/1088/
|
[32] |
IEEE Standard SA. IEEE standard for information technology – local and metropolitan area networks – specific requirements – part 3: CSMA/CD access method and physical layer specifications amendment 5: Media access control parameters, physical layers, and management parameters for energy-efficient Ethernet[EB/OL]. [2010-10-27]. https://standards.ieee.org/ieee/802.3az/4270/
|
[33] |
Zhang Zhehui, Zheng Haiyang, Hu Jiayao, et al. Hashing linearity enables relative path control in data centers[C]//Proc of USENIX ATC. Berkeley, CA: USENIX Association, 2021: 855–862
|
[34] |
NVIDIA. InfiniBand networking solutions[EB/OL]. [2024-06-11]. https://www.nvidia.com/en-us/networking/products/infiniband/
|
[1] | Liu Le, Guo Shengnan, Jin Xiyuan, Zhao Miaomiao, Chen Ran, Lin Youfang, Wan Huaiyu. Spatial-Temporal Traffic Data Imputation Method with Uncertainty Modeling[J]. Journal of Computer Research and Development, 2025, 62(2): 346-363. DOI: 10.7544/issn1000-1239.202330455 |
[2] | Xu Xiao, Ding Shifei, Sun Tongfeng, Liao Hongmei. Large-Scale Density Peaks Clustering Algorithm Based on Grid Screening[J]. Journal of Computer Research and Development, 2018, 55(11): 2419-2429. DOI: 10.7544/issn1000-1239.2018.20170227 |
[3] | Wang Haiyan, Xiao Yikang. Dynamic Group Discovery Based on Density Peaks Clustering[J]. Journal of Computer Research and Development, 2018, 55(2): 391-399. DOI: 10.7544/issn1000-1239.2018.20160928 |
[4] | Ren Lifang, Wang Wenjian, Xu Hang. Uncertainty-Aware Adaptive Service Composition in Cloud Computing[J]. Journal of Computer Research and Development, 2016, 53(12): 2867-2881. DOI: 10.7544/issn1000-1239.2016.20150078 |
[5] | Xu Zhengguo, Zheng Hui, He Liang, Yao Jiaqi. Self-Adaptive Clustering Based on Local Density by Descending Search[J]. Journal of Computer Research and Development, 2016, 53(8): 1719-1728. DOI: 10.7544/issn1000-1239.2016.20160136 |
[6] | Xu Min, Deng Zhaohong, Wang Shitong, Shi Yingzhong. MMCKDE: m-Mixed Clustering Kernel Density Estimation over Data Streams[J]. Journal of Computer Research and Development, 2014, 51(10): 2277-2294. DOI: 10.7544/issn1000-1239.2014.20130718 |
[7] | Qi Yafei, Wang Yijie, and Li Xiaoyong. A Skyline Query Method over Gaussian Model Uncertain Data Streams[J]. Journal of Computer Research and Development, 2012, 49(7): 1467-1473. |
[8] | Pan Weimin and He Jun. Neuro-Fuzzy System Modeling with Density-Based Clustering[J]. Journal of Computer Research and Development, 2010, 47(11): 1986-1992. |
[9] | Chen Jianmei, Lu Hu, Song Yuqing, Song Shunlin, Xu Jing, Xie Conghua, Ni Weiwei. A Possibility Fuzzy Clustering Algorithm Based on the Uncertainty Membership[J]. Journal of Computer Research and Development, 2008, 45(9): 1486-1492. |
[10] | Ma Liang, Chen Qunxiu, and Cai Lianhong. An Improved Model for Adaptive Text Information Filtering[J]. Journal of Computer Research and Development, 2005, 42(1): 79-84. |