Resilio: An Elastic Fault-tolerant Training System for Large Language Models

Li Yan; Yang Sile; Liu Chengchun; Wang Linmei; Tian Yaolin; Zhang Xinhang; Zhu Yu; Li Chunpu; Sun Lei; Yan Shengen; Xiao Limin; Zhang Weifeng

doi:10.7544/issn1000-1239.202550147

Journal of Computer Research and Development > 2025 > Corrected proof > DOI: 10.7544/issn1000-1239.202550147 CSTR: 32373.14.issn1000-1239.202550147

Li Yan, Yang Sile, Liu Chengchun, Wang Linmei, Tian Yaolin, Zhang Xinhang, Zhu Yu, Li Chunpu, Sun Lei, Yan Shengen, Xiao Limin, Zhang Weifeng. Resilio: An Elastic Fault-tolerant Training System for Large Language Models[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202550147

Citation:

PDF (2730 KB)

Resilio: An Elastic Fault-tolerant Training System for Large Language Models

1.
Lenovo Research, Beijing 100094
2.
Nanoscale Integrated Circuits and Systems Lab (Tsinghua University), Beijing 100084

Funds:

undefined

This work was supported by the National Key Research and Development Program of China (2024YFB4505703).

More Information

Author Bio:
Li Yan: born in 1985. PhD, senior engineer. Member of CCF. His main research interests include high performance computing, heterogeneous computing, and deep learning

Yang Sile: born in 1994. Master. His main research interests include heterogenous computing and network communication

Liu Chengchun: born in 1992. Master. His main research interests include high performance computing and heterogeneous computing

Wang Linmei: born in 1985. Master. Member of CCF. Her main research interests include high performance computing and heterogeneous computing

Tian Yaolin: born in 1999. Master. Her main research interests include optimization of LLM, heterogeneous computing, and visual SLAM

Zhang Xinhang: born in 2000. Master. His main research interest includes distributed training of deep learning

Zhu Yu: born in 1998. PhD candidate. His main research interests include LLM training architecture design and multi-core architecture

Li Chunpu: born in 1978. Bachelor. Her main research interests include heterogeneous computing and cloud computing

Sun Lei: born in 1980. Bachelor. His main research interests include heterogeneous computing and edge computing

Yan Shengen: born in 1986. PhD, associate professor. His main research interests include heterogeneous computing, intelligent computing, and machine learning systems

Xiao Limin: born in 1970. PhD. Distinguished member of CCF. Distinguished researcher at Lenovo Research. His main research interests include computer architecture, heterogeneous intelligent computing, high performance computing, and intelligent computing chip

Zhang Weifeng: born in 1968. PhD. Corporate VP at Lenovo Group. His main research interests include computer architecture, heterogeneous intelligent computing, wireless communication, and intelligent computing chip
Received Date: February 28, 2025
Revised Date: April 09, 2025
Available Online: April 14, 2025

Graphical Abstract

Abstract

Abstract

Large language models with hundreds of billions of parameters are driving rapid technology innovations and business model transformations in artificial intelligence and heterogeneous computing. However, training such models requires prolonged occupation of extensive hardware resources, thus often incurring diverse high-frequency software/hardware failures. These failures not only are challenging to diagnose, but also lead to much longer training time due to unwanted computation waste and slow training convergence. Resilio, an elastic fault-tolerant system for training large language models is proposed, to provide an efficient automated fault recovery mechanism. It is designed to target multiple typical failure scenarios during training processes, such as network interruptions, node crashes, and process failures. Leveraging the characteristics of the parallel model training strategies and underlying hierarchical storage architectures, Resilio implements multi-layer optimizations on checkpoint read/write operations and Just-In-Time (JIT) recovery mechanisms. For models with 100 billion scale parameters, Resilio reduces the end-to-end recovery time under 10 minutes, while reducing the re-started computation after interruptions to the cost of a single training iteration. Upon variations of the computation resources, Resilio can quickly identify the cluster configurations to enable optimal parallel training strategies. Combined with the fault-tolerant scheduling capability, the system ensures adaptive and elastic resource allocations to greatly improve training efficiency and boost GPU utilization across large-scale computing clusters.
- large-scale model training,
- deep learning,
- fault tolerance,
- failure detection,
- elastic training,
- automatic parallelization

FullText(HTML)

References (31)

References

[1]	Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//Proc of the 31st Int Conf on Neural Information Processing Systems (NIPS). Red Hook, NY: Curran Associates, 2017: 6000–6010
[2]	翟恩南,操佳敏,钱坤,等. 面向大模型时代的网络基础设施研究:挑战、阶段成果与展望[J]. 计算机研究与发展,2024,61(11):2664−2677 doi: 10.7544/issn1000-1239.202440576 Zhai Ennan, Cao Jiamin, Qiankun, et al. Research on network infrastructure in the era of large models: Challenges, stage achievements, and prospects[J]. Journal of Computer Research and Development, 2024, 61(11): 2664−2677 (in Chinese) doi: 10.7544/issn1000-1239.202440576
[3]	Zhang Susan, Roller S, Goyal N, et al. Opt: Open pre-trained transformer language models[J]. arXiv preprint, arXiv: 2205.01068, 2022
[4]	Dubey A, Jauhri A, Pandey A, et al. The Llama 3 herd of models[J]. arXiv preprint, arXiv: 2407.21783, 2024
[5]	Floridi L, Chiriatti M. GPT-3: Its nature, scope, limits, and consequences[J]. Minds and Machines, 2020, 30: 681−694 doi: 10.1007/s11023-020-09548-1
[6]	Meta. Pytorch: Tensors and dynamic neural networks in Python with strong GPU acceleration[CP/OL]. 2016[2024-02-19]. https://github.com/pytorch/pytorch
[7]	Google. Tensorflow: An open source machine learning framework for everyone[CP/OL]. 2015[2024-02-19]. https://github.com/tensorflow/te-nsorflow
[8]	Microsoft. DeepSpeed[EB/OL]. 2023[2024-02-19]. https://github.com/microsoft/DeepSpeed
[9]	Shoeybi M, Patwary M, Puri R, et al. Megatron-LM: Training multi-billion parameter language models using model parallelism[J]. arXiv preprint, arXiv: 1909.08053, 2019
[10]	Wu Baodong, Xia Lei, Li Qingping, et al. TRANSOM: An efficient fault-tolerant system for training LLMs[J]. arXiv preprint, arXiv: 2310.10046, 2023
[11]	Ant Group. DLRover: An automatic distributed deep learning system[CP/OL]. 2025[2025-02-19]. https://github.com/intelligent-machine-learning/dlrover
[12]	Ant Group. DLRover’s technical practice of training stability assurance for thousands of calorie-level large models on Kubernetes[EB/OL]. 2025[2025-02-19]. https://github.com/intelligent-machine-learning/dlr- over/blob/master/docs/blogs/stabilize_llm_training_cn.md
[13]	Mohan J, Phanishayee A, Chidambaram V. CheckFreq: Frequent, fine-grained DNN checkpointing[C]//Proc of the 19th USENIX Conf on File and Storage Technologies (FAST). Berkeley, CA: USENIX Association, 2021: 203−216
[14]	Wang Guanhua, Ruwase O, Xie Bing, et al. FastPersist: Accelerating model checkpointing in deep learning[J]. arXiv preprint, arXiv: 2406.13768, 2024
[15]	Gupta T, Krishnan S, Kumar R, et al. Just-in-time checkpointing: Low cost error recovery from deep learning training failures[C]//Proc of the 19th European Conf on Computer Systems (EuroSys). New York: ACM, 2024: 1110−1125
[16]	Wang Zhuang, Jia Zhen, Zheng Shuai, et al. GEMINI: Fast failure recovery in distributed training with in-memory checkpoints[C]//Proc of the 29th Symp on Operating Systems Principles (SOSP). New York: ACM, 2023: 364−381
[17]	He Tao, Li Xue, Wang Zhibin, et al. Unicron: Economizing self-healing LLM training at scale[J]. arXiv preprint, arXiv: 2401.00134, 2023
[18]	Xiang Wu, Li Yakun, Ren Yuquan, et al. Gödel: Unified large-scale resource management and scheduling at ByteDance[C]//Proc of the 14th ACM Symp on Cloud Computing (SoCC). New York: ACM, 2023: 308−323
[19]	Liu K, Kosaian J, Rashmi K V. ECRM: Efficient fault tolerancefor recommendation model training via erasure coding[J]. arXiv preprint, arXiv: 2104.01981, 2021
[20]	Hu Qinghao, Ye Zhisheng, Wang Zerui, et al. Characterization of large language model development in the datacenter[C]//Proc of the 21st USENIX Symp on Networked Systems Design and Implementation (NSDI). Berkeley, CA: USENIX Association, 2024: 709−729
[21]	Wu Tianyuan, Wang Wei, Yu Yinghao, et al. FALCON: Pinpointing and mitigating stragglers for large-scale hybrid-parallel training[J]. arXiv preprint, arXiv: 2410.12588, 2024
[22]	Lao C L, Yu Minlan, Akella A, et al. TrainMover: Efficient ML training live migration with no memory overhead[J]. arXiv preprint, arXiv: 2412.12636, 2024
[23]	Jiang Ziheng, Lin Haibin, Zhong Yinmin, et al. MegaScale: Scaling large language model training to more than 10, 000 GPUs[C]//Proc of the 21st USENIX Symp on Networked Systems Design and Implementation (NSDI). Berkeley, CA: USENIX Association, 2024: 745−760
[24]	Eisenman A, Matam K K, Ingram S, et al. Check-N-Run: A checkpointing system for training deep learning recommendation models[C]//Proc of the 19th USENIX Symp on Networked Systems Design and Implementation (NSDI). Berkeley, CA: USENIX Association, 2022: 929−943
[25]	Li Mingzhen, Xiao Wencong, Yang Hailong, et al. EasyScale: Elastic training with consistent accuracy and improved utilization on GPUs[C]//Proc of the Int Conf for High Performance Computing, Networking, Storage and Analysis (SC). New York: ACM, 2023: 1−14
[26]	Subramanya S J, Arfeen D, Lin Shouxu, et al. Sia: Heterogeneity-aware, goodput-optimized ML-cluster scheduling[C]//Proc of the 29th Symp on Operating Systems Principles (SOSP). New York: ACM, 2023: 642−657
[27]	Wagenländer M, Li Guo, Zhao Bo, et al. Tenplex: Dynamic parallelism for deep learning using parallelizable tensor collections[C]//Proc of the 30th Symp on Operating Systems Principles (SOSP). New York: ACM, 2024: 195−210
[28]	NVIDIA. NCCL Tests: Check both the performance and the correctness of NCCL operations[CP/OL]. 2017[2025-02-19]. https://github.com/N-VIDIA/nccl-tests
[29]	Zheng Lianmin, Li Zhuohan, Zhang Hao, et al. Alpa: Automating inter- and intra-operator parallelism for distributed deep learning[C]//Proc of the 16th USENIX Symp on Operating Systems Design and Implementation (OSDI). Berkeley, CA: USENIX Association, 2022: 559−578
[30]	Google. XLA: An open-source machine learning (ML) compiler for GPUs, CPUs, and ML accelerators[CP/OL]. 2022[2025-02-19]. https://github.com/openxla/xla
[31]	Chowdhery A, Narang S, Devlin J, et al. Palm: Scaling language modeling with pathways[J]. Journal of Machine Learning Research, 2023, 24(240): 1−113

[1]	Wang Honglin, Yang Dan, Nie Tiezheng, Kou Yue. Attributed Heterogeneous Information Network Embedding with Self-Attention Mechanism for Product Recommendation[J]. Journal of Computer Research and Development, 2022, 59(7): 1509-1521. DOI: 10.7544/issn1000-1239.20210016
[2]	Ni Qingjian, Peng Wenqiang, Zhang Zhizheng, Zhai Yuqing. Spatial-Temporal Graph Neural Network for Traffic Flow Prediction Based on Information Enhanced Transmission[J]. Journal of Computer Research and Development, 2022, 59(2): 282-293. DOI: 10.7544/issn1000-1239.20210901
[3]	Cao Jiuxin, Gao Qingqing, Xia Rongqing, Liu Weijia, Zhu Xuelin, Liu Bo. Information Propagation Prediction and Specific Information Suppression in Social Networks[J]. Journal of Computer Research and Development, 2021, 58(7): 1490-1503. DOI: 10.7544/issn1000-1239.2021.20200809
[4]	Zhou Donghao, Han Wenbao, Wang Yongjun. A Fine-Grained Information Diffusion Model Based on Node Attributes and Content Features[J]. Journal of Computer Research and Development, 2015, 52(1): 156-166. DOI: 10.7544/issn1000-1239.2015.20130915
[5]	Ma Xiao, Wang Xuan, and Wang Xiaolong. The Information Model for a Class of Imperfect Information Game[J]. Journal of Computer Research and Development, 2010, 47(12).
[6]	Sun Qindong, Guan Xiaohong, Zhou Yadong. A Survey of Network Information Content Audit[J]. Journal of Computer Research and Development, 2009, 46(8): 1241-1250.
[7]	Tian Mei, Luo Siwei, Huang Yaping, and Zhao Jiali. Extracting Bottom-Up Attention Information Based on Local Complexity and Early Visual Features[J]. Journal of Computer Research and Development, 2008, 45(10): 1739-1746.
[8]	Ni Weiwei, Chen Geng, Lu Jieping, Wu Yingjie, Sun Zhihui. Local Entropy Based Weighted Subspace Outlier Mining Algorithm[J]. Journal of Computer Research and Development, 2008, 45(7): 1189-1194.
[9]	Liu Yunhui, Luo Siwei, Huang Hua, and Li Aijun. Information Geometric Analysis of Pruning Algorithm[J]. Journal of Computer Research and Development, 2006, 43(9): 1609-1614.
[10]	Dong Wenyu, Sun Donghong, Xu Ke, Li Xuedong. Modeling of Autonomous Network Information Service[J]. Journal of Computer Research and Development, 2006, 43(2): 224-230.