Citation: | Li Yan, Yang Sile, Liu Chengchun, Wang Linmei, Tian Yaolin, Zhang Xinhang, Zhu Yu, Li Chunpu, Sun Lei, Yan Shengen, Xiao Limin, Zhang Weifeng. Resilio: An Elastic Fault-tolerant Training System for Large Language Models[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202550147 |
Large language models with hundreds of billions of parameters are driving rapid technology innovations and business model transformations in artificial intelligence and heterogeneous computing. However, training such models requires prolonged occupation of extensive hardware resources, thus often incurring diverse high-frequency software/hardware failures. These failures not only are challenging to diagnose, but also lead to much longer training time due to unwanted computation waste and slow training convergence. Resilio, an elastic fault-tolerant system for training large language models is proposed, to provide an efficient automated fault recovery mechanism. It is designed to target multiple typical failure scenarios during training processes, such as network interruptions, node crashes, and process failures. Leveraging the characteristics of the parallel model training strategies and underlying hierarchical storage architectures, Resilio implements multi-layer optimizations on checkpoint read/write operations and Just-In-Time (JIT) recovery mechanisms. For models with 100 B-scale parameters, Resilio reduces the end-to-end recovery time under 10 minutes, while reducing the re-started computation after interruptions to the cost of a single training iteration. Upon variations of the computation resources, Resilio can quickly identify the cluster configurations to enable optimal parallel training strategies. Combined with the fault-tolerant scheduling capability, the system ensures adaptive and elastic resource allocations to greatly improve training efficiency and boost GPU utilization across large-scale computing clusters.
[1] |
Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//Proc of the 31st Int Conf on Neural Information Processing Systems (NIPS). Red Hook, NY: Curran Associates, 2017: 6000–6010
|
[2] |
翟恩南,操佳敏,钱坤,等. 面向大模型时代的网络基础设施研究:挑战、阶段成果与展望[J]. 计算机研究与发展,2024,61(11):2664−2677 doi: 10.7544/issn1000-1239.202440576
Zhai Ennan, Cao Jiamin, Qiankun, et al. Research on network infrastructure in the era of large models: Challenges, stage achievements, and prospects[J]. Journal of Computer Research and Development, 2024, 61(11): 2664−2677 (in Chinese) doi: 10.7544/issn1000-1239.202440576
|
[3] |
Zhang Susan, Roller S, Goyal N, et al. Opt: Open pre-trained transformer language models[J]. arXiv preprint, arXiv: 2205.01068, 2022
|
[4] |
Dubey A, Jauhri A, Pandey A, et al. The Llama 3 herd of models[J]. arXiv preprint, arXiv: 2407.21783, 2024
|
[5] |
Floridi L, Chiriatti M. GPT-3: Its nature, scope, limits, and consequences[J]. Minds and Machines, 2020, 30: 681−694 doi: 10.1007/s11023-020-09548-1
|
[6] |
Meta. Pytorch: Tensors and dynamic neural networks in Python with strong GPU acceleration[CP/OL]. 2016[2024-02-19]. https://github.com/pytorch/pytorch
|
[7] |
Google. Tensorflow: An open source machine learning framework for everyone[CP/OL]. 2015[2024-02-19]. https://github.com/tensorflow/te-nsorflow
|
[8] |
Microsoft. DeepSpeed[EB/OL]. 2023[2024-02-19]. https://github.com/microsoft/DeepSpeed
|
[9] |
Shoeybi M, Patwary M, Puri R, et al. Megatron-LM: Training multi-billion parameter language models using model parallelism[J]. arXiv preprint, arXiv: 1909.08053, 2019
|
[10] |
Wu Baodong, Xia Lei, Li Qingping, et al. TRANSOM: An efficient fault-tolerant system for training LLMs[J]. arXiv preprint, arXiv: 2310.10046, 2023
|
[11] |
Ant Group. DLRover: An automatic distributed deep learning system[CP/OL]. 2025[2025-02-19]. https://github.com/intelligent-machine-learning/dlrover
|
[12] |
Ant Group. DLRover’s technical practice of training stability assurance for thousands of calorie-level large models on Kubernetes[EB/OL]. 2025[2025-02-19]. https://github.com/intelligent-machine-learning/dlr- over/blob/master/docs/blogs/stabilize_llm_training_cn.md
|
[13] |
Mohan J, Phanishayee A, Chidambaram V. CheckFreq: Frequent, fine-grained DNN checkpointing[C]//Proc of the 19th USENIX Conf on File and Storage Technologies (FAST). Berkeley, CA: USENIX Association, 2021: 203−216
|
[14] |
Wang Guanhua, Ruwase O, Xie Bing, et al. FastPersist: Accelerating model checkpointing in deep learning[J]. arXiv preprint, arXiv: 2406.13768, 2024
|
[15] |
Gupta T, Krishnan S, Kumar R, et al. Just-in-time checkpointing: Low cost error recovery from deep learning training failures[C]//Proc of the 19th European Conf on Computer Systems (EuroSys). New York: ACM, 2024: 1110−1125
|
[16] |
Wang Zhuang, Jia Zhen, Zheng Shuai, et al. GEMINI: Fast failure recovery in distributed training with in-memory checkpoints[C]//Proc of the 29th Symp on Operating Systems Principles (SOSP). New York: ACM, 2023: 364−381
|
[17] |
He Tao, Li Xue, Wang Zhibin, et al. Unicron: Economizing self-healing LLM training at scale[J]. arXiv preprint, arXiv: 2401.00134, 2023
|
[18] |
Xiang Wu, Li Yakun, Ren Yuquan, et al. Gödel: Unified large-scale resource management and scheduling at ByteDance[C]//Proc of the 14th ACM Symp on Cloud Computing (SoCC). New York: ACM, 2023: 308−323
|
[19] |
Liu K, Kosaian J, Rashmi K V. ECRM: Efficient fault tolerancefor recommendation model training via erasure coding[J]. arXiv preprint, arXiv: 2104.01981, 2021
|
[20] |
Hu Qinghao, Ye Zhisheng, Wang Zerui, et al. Characterization of large language model development in the datacenter[C]//Proc of the 21st USENIX Symp on Networked Systems Design and Implementation (NSDI). Berkeley, CA: USENIX Association, 2024: 709−729
|
[21] |
Wu Tianyuan, Wang Wei, Yu Yinghao, et al. FALCON: Pinpointing and mitigating stragglers for large-scale hybrid-parallel training[J]. arXiv preprint, arXiv: 2410.12588, 2024
|
[22] |
Lao C L, Yu Minlan, Akella A, et al. TrainMover: Efficient ML training live migration with no memory overhead[J]. arXiv preprint, arXiv: 2412.12636, 2024
|
[23] |
Jiang Ziheng, Lin Haibin, Zhong Yinmin, et al. MegaScale: Scaling large language model training to more than 10, 000 GPUs[C]//Proc of the 21st USENIX Symp on Networked Systems Design and Implementation (NSDI). Berkeley, CA: USENIX Association, 2024: 745−760
|
[24] |
Eisenman A, Matam K K, Ingram S, et al. Check-N-Run: A checkpointing system for training deep learning recommendation models[C]//Proc of the 19th USENIX Symp on Networked Systems Design and Implementation (NSDI). Berkeley, CA: USENIX Association, 2022: 929−943
|
[25] |
Li Mingzhen, Xiao Wencong, Yang Hailong, et al. EasyScale: Elastic training with consistent accuracy and improved utilization on GPUs[C]//Proc of the Int Conf for High Performance Computing, Networking, Storage and Analysis (SC). New York: ACM, 2023: 1−14
|
[26] |
Subramanya S J, Arfeen D, Lin Shouxu, et al. Sia: Heterogeneity-aware, goodput-optimized ML-cluster scheduling[C]//Proc of the 29th Symp on Operating Systems Principles (SOSP). New York: ACM, 2023: 642−657
|
[27] |
Wagenländer M, Li Guo, Zhao Bo, et al. Tenplex: Dynamic parallelism for deep learning using parallelizable tensor collections[C]//Proc of the 30th Symp on Operating Systems Principles (SOSP). New York: ACM, 2024: 195−210
|
[28] |
NVIDIA. NCCL Tests: Check both the performance and the correctness of NCCL operations[CP/OL]. 2017[2025-02-19]. https://github.com/N-VIDIA/nccl-tests
|
[29] |
Zheng Lianmin, Li Zhuohan, Zhang Hao, et al. Alpa: Automating inter- and intra-operator parallelism for distributed deep learning[C]//Proc of the 16th USENIX Symp on Operating Systems Design and Implementation (OSDI). Berkeley, CA: USENIX Association, 2022: 559−578
|
[30] |
Google. XLA: An open-source machine learning (ML) compiler for GPUs, CPUs, and ML accelerators[CP/OL]. 2022[2025-02-19]. https://github.com/openxla/xla
|
[31] |
Chowdhery A, Narang S, Devlin J, et al. Palm: Scaling language modeling with pathways[J]. Journal of Machine Learning Research, 2023, 24(240): 1−113
|
[1] | Li Song, Cao Wenqi, Hao Xiaohong, Zhang Liping, Hao Zhongxiao. Collective Spatial Keyword Query Based on Time-Distance Constrained and Cost Aware[J]. Journal of Computer Research and Development, 2025, 62(3): 808-819. DOI: 10.7544/issn1000-1239.202330815 |
[2] | Wang Kaifan, Xu Yinan, Yu Zihao, Tang Dan, Chen Guokai, Chen Xi, Gou Lingrui, Hu Xuan, Jin Yue, Li Qianruo, Li Xin, Lin Jiawei, Liu Tong, Liu Zhigang, Wang Huaqiang, Wang Huizhe, Zhang Chuanqi, Zhang Fawang, Zhang Linjuan, Zhang Zifei, Zhang Ziyue, Zhao Yangyang, Zhou Yaoyang, Zou Jiangrui, Cai Ye, Huan Dandan, Li Zusong, Zhao Jiye, He Wei, Sun Ninghui, Bao Yungang. XiangShan Open-Source High Performance RISC-V Processor Design and Implementation[J]. Journal of Computer Research and Development, 2023, 60(3): 476-493. DOI: 10.7544/issn1000-1239.202221036 |
[3] | Ren Hao, Liu Baisong, Sun Jinyang, Dong Qian, Qian Jiangbo. A Time and Relation-Aware Graph Collaborative Filtering for Cross-Domain Sequential Recommendation[J]. Journal of Computer Research and Development, 2023, 60(1): 112-124. DOI: 10.7544/issn1000-1239.202110545 |
[4] | Zhang Tong, Feng Jiaqi, Ma Yanying, Qu Siyuan, Ren Fengyuan. Survey on Traffic Scheduling in Time-Sensitive Networking[J]. Journal of Computer Research and Development, 2022, 59(4): 747-764. DOI: 10.7544/issn1000-1239.20210203 |
[5] | Cui Yuanning, Li Jing, Shen Li, Shen Yang, Qiao Lin, Bo Jue. Duration-HyTE: A Time-Aware Knowledge Representation Learning Method Based on Duration Modeling[J]. Journal of Computer Research and Development, 2020, 57(6): 1239-1251. DOI: 10.7544/issn1000-1239.2020.20190253 |
[6] | Zheng Xiao, Gao Han, Wang Xiujun, Qin Feng. Contact Duration Aware Cooperative Data Caching in Mobile Opportunistic Networks[J]. Journal of Computer Research and Development, 2018, 55(2): 338-345. DOI: 10.7544/issn1000-1239.2018.20160929 |
[7] | Wang Chong, Lü Yinrun, Chen Li, Wang Xiuli, Wang Yongji. Survey on Development of Solving Methods and State-of-the-Art Applications of Satisfiability Modulo Theories[J]. Journal of Computer Research and Development, 2017, 54(7): 1405-1425. DOI: 10.7544/issn1000-1239.2017.20160303 |
[8] | Chen Huangke, Zhu Jianghan, Zhu Xiaomin, Ma Manhao, Zhang Zhenshi. Resource-Delay-Aware Scheduling for Real-Time Tasks in Clouds[J]. Journal of Computer Research and Development, 2017, 54(2): 446-456. DOI: 10.7544/issn1000-1239.2017.20151123 |
[9] | Zhou Hang, Huang Zhiqiu, Zhu Yi, Xia Liang, Liu Linyuan. Real-Time Systems Contact Checking and Resolution Based on Time Petri Net[J]. Journal of Computer Research and Development, 2012, 49(2): 413-420. |
[10] | Zhou Hang, Huang Zhiqiu, Hu Jun, Zhu Yi. Real-Time System Resource Conflict Checking Based on Time Petri Nets[J]. Journal of Computer Research and Development, 2009, 46(9): 1578-1585. |