Towards Network Infrastructure Research for the Era of Large Language Models: Challenges, Practices, and Prospects

Zhai Ennan; Cao Jiamin; Qian Kun; Guan Yu

doi:10.7544/issn1000-1239.202440576

Journal of Computer Research and Development > 2024 > 61(11): 2664-2677. > DOI: 10.7544/issn1000-1239.202440576 CSTR: 32373.14.issn1000-1239.202440576

Zhai Ennan, Cao Jiamin, Qian Kun, Guan Yu. Towards Network Infrastructure Research for the Era of Large Language Models: Challenges, Practices, and Prospects[J]. Journal of Computer Research and Development, 2024, 61(11): 2664-2677. DOI: 10.7544/issn1000-1239.202440576

Citation:

PDF (1656 KB)

Towards Network Infrastructure Research for the Era of Large Language Models: Challenges, Practices, and Prospects

Alibaba Cloud Computing Ltd., Hangzhou 310000

More Information

Author Bio:
Zhai Ennan: born in 1984. PhD. Member of CCF. His main research interests include computer network and distributed systems

Cao Jiamin: born in 1997. PhD. Member of CCF. Her main research interests include AI infrastructure and programmable networks

Qian Kun: born in 1993. PhD. Member of CCF. His main research interests include AI infrastructure and storage

Guan Yu: born in 1993. PhD. His main research interests include AI infrastructure and video streaming
Received Date: June 24, 2024
Revised Date: September 19, 2024
Available Online: September 26, 2024

Graphical Abstract

Abstract

Abstract

Large language models (LLMs) with hundreds of billions of parameters have brought significant technological and business transformations to today’s AI and cloud services. However, there exists a fundamental difference in network pattern between LLM training and general cloud computing (e.g., Amazon EC2 Elastic compute service), leading to a variety of new challenges. These challenges mainly include load balancing difficulties due to the traffic pattern difference (Challenge 1), the impact of multi-job communication contention on GPU utilization (Challenge 2), and high sensitivity to network failures (Challenge 3). Therefore, data center network technologies designed for general cloud computing (e.g., network architecture, routing, communication scheduling, and reliability) are no longer suitable for LLM training today. This necessitates the development of new data center networks and accompanying technical solutions specifically for LLM training. We introduce Alibaba Cloud’s high-performance network (HPN) and the multi-job communication scheduling approach Crux, designed to address the aforementioned challenges. HPN introduces a two-layer, dual-plane network architecture, which not only achieves high-speed interconnectivity for 15 000 GPUs within a Pod but also ensures precise routing suitable for LLM training (addressing Challenge 1). Furthermore, HPN proposes a novel dual-top-of-rack (ToR) design, replacing the traditional single ToR switch connection in data center networks and fundamentally avoiding single-point failure reliability risks (partially addressing Challenge 3). To tackle Challenge 2, Crux reduces the NP-complete problem of optimizing GPU utilization by modeling it as a communication scheduling issue related to GPU computational intensity. Crux then proposes an algorithm that prioritizes the flows of job with higher GPU computational intensity, significantly reducing multi-job communication contention and improving GPU utilization. Compared with the state-of-the-art efforts, Crux increases GPU utilization by up to 23%. Both HPN and Crux have been deployed and used in Alibaba Cloud production for over eight months and will continue to evolve and iterate. Building on this, we further envision possible research directions in the field of LLM training and inference, providing guidance for subsequent work.
- AI infrastructure,
- large language model (LLMs),
- large models,
- model training,
- data center networks,
- collective communication,
- communication scheduling

FullText(HTML)

References (34)

References

[1]	OpenAI, Josh A, Adler S, et al. GPT−4 technical report. [J]. arXiv preprint, arXiv: 2303.08774, 2024
[2]	OpenAI. Introducing ChatGPT[EB/OL]. [2022-11-30]. https://openai.com/blog/chatgpt
[3]	MLSYS ORG. Vicuna: An open-source chatbot impressing GPT−4 with 90% ChatGPT quality[EB/OL]. [2023-03-30]. https://lmsys.org/blog/2023-03-30-vicuna/
[4]	NVIDIA. Megatron-LM. [EB/OL]. [2024-06-19]. https://github.com/NVIDIA/‌Megatron-LM
[5]	Microsoft Research. DeepSpeed[EB/OL]. [2024-06-19]. https://www.microsoft.com/en-us/‌research/project/deepspeed/
[6]	Rajasekaran S, Ghobadi M, Akella A. CASSINI: Network-aware job scheduling in machine learning clusters[C]//Proc of USENIX NSDI. Berkeley, CA: USENIX Association, 2024: 1403−1420
[7]	Jiang Ziheng, Lin Haibin, Zhong Yinmin, et al. MegaScale: Scaling large language model training to more than 10, 000 GPUs[C]//Proc of USENIX NSDI. Berkeley, CA: USENIX Association, 2024: 745−760
[8]	Qian Kun, Xi Yongqing, Cao Jiamin, et al. Alibaba HPN: A data center network for large language model training[C]//Proc of ACM SIGCOMM. New York: ACM, 2024: 691−706
[9]	Cao Jiamin, Guan Yu, Qian Kun, et al. Crux: GPU-efficient communication scheduling for deep learning training[C]//Proc of ACM SIGCOMM. New York: ACM, 2024: 1−15
[10]	Microsoft DeepSpeed. Model checkpointing[EB/OL]. [2023-01-31]. https://deepspeed.readthedocs.‌io/en/latest/model-checkpointing.html
[11]	Alizadeh M, Edsall T, Dharmapurikar S, et al. CONGA: Distributed congestion-aware load balancing for datacenters[C]//Proc of ACM SIGCOMM. New York: ACM, 2014: 503–514
[12]	Dixit A, Prakash P, Hu Y C, et al. On the impact of packet spraying in data center networks[C]//Proc of IEEE INFOCOM. Piscataway, NJ: IEEE, 2013: 2130–2138
[13]	Ghorbani S, Yang Zibin, Godfrey P B, et al. DRILL: Micro load balancing for low-latency data center networks[C]//Proc of ACM SIGCOMM. New York: ACM, 2017: 225–238
[14]	Katta N, Hira M, Ghag A, et al. CLOVE: How I learned to stop worrying about the core and love the edge[C]//Proc of ACM Workshop on Hot Topics in Networks. New York: ACM, 2016: 155–161
[15]	Katta N, Hira M, Kim C, et al. HULA: Scalable load balancing using programmable data planes[C]//Proc of ACM SOSR. New York: ACM, 2016: 1–12
[16]	Qureshi M A, Cheng Yuchung, Yin Qianwen, et al. PLB: Congestion signals are simple and effective for network load balancing[C]//Proc of ACM SIGCOMM. New York: ACM, 2022: 207–218
[17]	Sen S, Shue D, Ihm S, et al. Scalable, optimal flow routing in datacenters via local link balancing[C]//Proc of ACM CoNEXT. New York: ACM, 2013: 151–162
[18]	Vanini E, Pan Rong, Alizadeh M, et al. Let it flow: Resilient asymmetric load balancing with flowlet switching[C]//Proc of USENIX NSDI. Berkeley, CA: USENIX Association, 2017: 407–420
[19]	Zats D, Das T, Mohan P, et al. DeTail: Reducing the flow completion time tail in datacenter networks[C]//Proc of ACM SIGCOMM. New York: ACM, 2012: 139–150
[20]	Zhang Hong, Zhang Junxue, Bai Wei, et al. Resilient datacenter load balancing in the wild[C]//Proc of ACM SIGCOMM. New York: ACM, 2017: 253–266
[21]	Agarwal S, Rajakrishnan S, Narayan A, et al. Sincronia: Near-optimal network design for coflows[C]//Proc of ACM SIGCOMM. New York: ACM, 2018: 16−29
[22]	Shah A, Chidambaram V, Cowan M, et al. TACCL: Guiding collective algorithm synthesis using communication sketches[C]//Proc of USENIX NSDI. Berkeley, CA: USENIX Association, 2023: 593−612
[23]	NVIDIA. NVLink and NVSwitch[EB/OL]. [2024-06-19]. https://www.nvidia.com/‌en-us/data-center/nvlink/
[24]	Meta. Meta’s evolution of network for AI[EB/OL]. [2023-11-01]. https://‌www.youtube.com/watch?v=5gOOtFySrqA
[25]	NVIDIA. NVIDIA DGX SuperPOD: Next generation scalable infrastructure for AI leadership[EB/OL]. [2023-09-22]. https: /docs. nvidia. com/dgx-superpod-‌reference-architecture-dgx-h100. pdf
[26]	Al-Fares M, Loukissas A, Vahdat A. A scalable, commodity data center network architecture[C]//Proc of ACM SIGCOMM. New York: ACM, 2008: 63–74
[27]	Greenberg A, Hamilton J R, Jain N, et al. VL2: A scalable and flexible data center network[C]//Proc of ACM SIGCOMM. New York: ACM, 2009: 51–62
[28]	Bai Wei, Abdeen S S, Agrawal A, et al. Empowering Azure storage with RDMA[C]//Proc of USENIX NSDI. Berkeley, CA: USENIX Association, 2023: 49–67
[29]	Poutievski L, Mashayekhi O, Ong J, et al. Jupiter evolving: Transforming Google’s datacenter network via optical circuit switches and software-defined networking[C]//Proc of ACM SIGCOMM. New York: ACM, 2022: 66–85
[30]	Linux GNU. Linux bonding modes[EB/OL]. [2010-01-08]. https://thelinuxcluster.‌com/2010/01/08/linux-bonding-modes/
[31]	IEEE Standard SA. IEEE 802.3ad[EB/OL]. [2000-06-28]. https://standards.ieee.org/‌ieee/802.3ad/1088/
[32]	IEEE Standard SA. IEEE standard for information technology – local and metropolitan area networks – specific requirements – part 3: CSMA/CD access method and physical layer specifications amendment 5: Media access control parameters, physical layers, and management parameters for energy-efficient Ethernet[EB/OL]. [2010-10-27]. https://standards.ieee.org/‌ieee/802.3az/4270/
[33]	Zhang Zhehui, Zheng Haiyang, Hu Jiayao, et al. Hashing linearity enables relative path control in data centers[C]//Proc of USENIX ATC. Berkeley, CA: USENIX Association, 2021: 855–862
[34]	NVIDIA. InfiniBand networking solutions[EB/OL]. [2024-06-11]. https://www.nvidia.com/en-us/networking/products/infiniband/

[1]	Zhou Yuanding, Gao Guopeng, Fang Yaodong, Qin Chuan. Perceptual Authentication Hashing with Image Feature Fusion Based on Window Self-Attention[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202330669
[2]	Gao Wei, Chen Liqun, Tang Chunming, Zhang Guoyan, Li Fei. One-Time Chameleon Hash Function and Its Application in Redactable Blockchain[J]. Journal of Computer Research and Development, 2021, 58(10): 2310-2318. DOI: 10.7544/issn1000-1239.2021.20210653
[3]	Wu Linyang, Luo Rong, Guo Xueting, Guo Qi. Partitioning Acceleration Between CPU and DRAM: A Case Study on Accelerating Hash Joins in the Big Data Era[J]. Journal of Computer Research and Development, 2018, 55(2): 289-304. DOI: 10.7544/issn1000-1239.2018.20170842
[4]	Jiang Jie, Yang Tong, Zhang Mengyu, Dai Yafei, Huang Liang, Zheng Lianqing. DCuckoo: An Efficient Hash Table with On-Chip Summary[J]. Journal of Computer Research and Development, 2017, 54(11): 2508-2515. DOI: 10.7544/issn1000-1239.2017.20160795
[5]	Wang Wendi, Tang Wen, Duan Bo, Zhang Chunming, Zhang Peiheng, Sun Ninghui. Parallel Accelerator Design for High-Throughput DNA Sequence Alignment with Hash-Index[J]. Journal of Computer Research and Development, 2013, 50(11): 2463-2471.
[6]	Yuan Xinpan, Long Jun, Zhang Zuping, Luo Yueyi, Zhang Hao, and Gui Weihua. Connected Bit Minwise Hashing[J]. Journal of Computer Research and Development, 2013, 50(4): 883-890.
[7]	Qin Chuan, Chang Chin Chen, Guo Cheng. Perceptual Robust Image Hashing Scheme Based on Secret Sharing[J]. Journal of Computer Research and Development, 2012, 49(8): 1690-1698.
[8]	Ding Zhenhua, Li Jintao, Feng Bo. Research on Hash-Based RFID Security Authentication Protocol[J]. Journal of Computer Research and Development, 2009, 46(4): 583-592.
[9]	Li Zhiqiang, Chen Hanwu, Xu Baowen, Liu Wenjie. Fast Algorithms for Synthesis of Quantum Reversible Logic Circuits Based on Hash Table[J]. Journal of Computer Research and Development, 2008, 45(12): 2162-2171.
[10]	Liu Ji. One-Way Hash Function based on Integer Coupled Tent Maps and Its Performance Analysis[J]. Journal of Computer Research and Development, 2008, 45(3): 563-569.

Supplements (1)

Supplements
Article Video
- Video Player is loading.
  Current Time 0:00
  Duration -:-
  Loaded: 0%
  Stream Type LIVE
  Remaining Time -:-
  
  1x
  Chapters
  descriptions off, selected
  captions settings, opens captions settings dialog
  captions off, selected
  This is a modal window.
  The media could not be loaded, either because the server or network failed or because the format is not supported.
  Beginning of dialog window. Escape will cancel and close the window.
  TextColorTransparency
  BackgroundColorTransparency
  WindowColorTransparency
  Font Size
  Text Edge Style
  Font Family
  End of dialog window.
  Video Abstract
Other Related Supplements
- Video
  https://crad.ict.ac.cn/news/537