• 中国精品科技期刊
  • CCF推荐A类中文期刊
  • 计算领域高质量科技期刊T1类
Advanced Search
Li Yan, Yang Sile, Liu Chengchun, Wang Linmei, Tian Yaolin, Zhang Xinhang, Zhu Yu, Li Chunpu, Sun Lei, Yan Shengen, Xiao Limin, Zhang Weifeng. Resilio: An Elastic Fault-tolerant Training System for Large Language Models[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202550147
Citation: Li Yan, Yang Sile, Liu Chengchun, Wang Linmei, Tian Yaolin, Zhang Xinhang, Zhu Yu, Li Chunpu, Sun Lei, Yan Shengen, Xiao Limin, Zhang Weifeng. Resilio: An Elastic Fault-tolerant Training System for Large Language Models[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202550147

Resilio: An Elastic Fault-tolerant Training System for Large Language Models

Funds: 

This work was supported by the National Key Research and Development Program of China (2024YFB4505703).

undefined

More Information
  • Author Bio:

    Li Yan: born in 1985. PhD, senior engineer. Member of CCF. His main research interests include high performance computing, heterogeneous computing, and deep learning

    Yang Sile: born in 1994. Master. His main research interests include heterogenous computing and network communication

    Liu Chengchun: born in 1992. Master. His main research interests include high performance computing and heterogeneous computing

    Wang Linmei: born in 1985. Master. Member of CCF. Her main research interests include high performance computing and heterogeneous computing

    Tian Yaolin: born in 1999. Master. Her main research interests include optimization of LLM, heterogeneous computing, and visual SLAM

    Zhang Xinhang: born in 2000. Master. His main research interest includes distributed training of deep learning

    Zhu Yu: born in 1998. PhD candidate. His main research interests include LLM training architecture design and multi-core architecture

    Li Chunpu: born in 1978. Bachelor. Her main research interests include heterogeneous computing and cloud computing

    Sun Lei: born in 1980. Bachelor. His main research interests include heterogeneous computing and edge computing

    Yan Shengen: born in 1986. PhD, associate professor. His main research interests include heterogeneous computing, intelligent computing, and machine learning systems

    Xiao Limin: born in 1970. PhD. Distinguished member of CCF. Distinguished researcher at Lenovo Research. His main research interests include computer architecture, heterogeneous intelligent computing, high performance computing, and intelligent computing chip

    Zhang Weifeng: born in 1968. PhD. Chair of AI Co-Design Workgroup at Open Computing Project Foundation (OCP) and as an OCP Future Technology Symposium (FTS) Program Committee. Corporate VP at Lenovo Group and Head of the Intelligent Computing Infrastructure & Wireless Research Labs at Lenovo Research, leading large-scale heterogeneous computing and next-gen wireless communication technologies

  • Received Date: February 28, 2025
  • Revised Date: April 09, 2025
  • Available Online: April 14, 2025
  • Large language models with hundreds of billions of parameters are driving rapid technology innovations and business model transformations in artificial intelligence and heterogeneous computing. However, training such models requires prolonged occupation of extensive hardware resources, thus often incurring diverse high-frequency software/hardware failures. These failures not only are challenging to diagnose, but also lead to much longer training time due to unwanted computation waste and slow training convergence. Resilio, an elastic fault-tolerant system for training large language models is proposed, to provide an efficient automated fault recovery mechanism. It is designed to target multiple typical failure scenarios during training processes, such as network interruptions, node crashes, and process failures. Leveraging the characteristics of the parallel model training strategies and underlying hierarchical storage architectures, Resilio implements multi-layer optimizations on checkpoint read/write operations and Just-In-Time (JIT) recovery mechanisms. For models with 100 B-scale parameters, Resilio reduces the end-to-end recovery time under 10 minutes, while reducing the re-started computation after interruptions to the cost of a single training iteration. Upon variations of the computation resources, Resilio can quickly identify the cluster configurations to enable optimal parallel training strategies. Combined with the fault-tolerant scheduling capability, the system ensures adaptive and elastic resource allocations to greatly improve training efficiency and boost GPU utilization across large-scale computing clusters.

  • [1]
    Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//Proc of the 31st Int Conf on Neural Information Processing Systems (NIPS). Red Hook, NY: Curran Associates, 2017: 6000–6010
    [2]
    翟恩南,操佳敏,钱坤,等. 面向大模型时代的网络基础设施研究:挑战、阶段成果与展望[J]. 计算机研究与发展,2024,61(11):2664−2677 doi: 10.7544/issn1000-1239.202440576

    Zhai Ennan, Cao Jiamin, Qiankun, et al. Research on network infrastructure in the era of large models: Challenges, stage achievements, and prospects[J]. Journal of Computer Research and Development, 2024, 61(11): 2664−2677 (in Chinese) doi: 10.7544/issn1000-1239.202440576
    [3]
    Zhang Susan, Roller S, Goyal N, et al. Opt: Open pre-trained transformer language models[J]. arXiv preprint, arXiv: 2205.01068, 2022
    [4]
    Dubey A, Jauhri A, Pandey A, et al. The Llama 3 herd of models[J]. arXiv preprint, arXiv: 2407.21783, 2024
    [5]
    Floridi L, Chiriatti M. GPT-3: Its nature, scope, limits, and consequences[J]. Minds and Machines, 2020, 30: 681−694 doi: 10.1007/s11023-020-09548-1
    [6]
    Meta. Pytorch: Tensors and dynamic neural networks in Python with strong GPU acceleration[CP/OL]. 2016[2024-02-19]. https://github.com/pytorch/pytorch
    [7]
    Google. Tensorflow: An open source machine learning framework for everyone[CP/OL]. 2015[2024-02-19]. https://github.com/tensorflow/te-nsorflow
    [8]
    Microsoft. DeepSpeed[EB/OL]. 2023[2024-02-19]. https://github.com/microsoft/DeepSpeed
    [9]
    Shoeybi M, Patwary M, Puri R, et al. Megatron-LM: Training multi-billion parameter language models using model parallelism[J]. arXiv preprint, arXiv: 1909.08053, 2019
    [10]
    Wu Baodong, Xia Lei, Li Qingping, et al. TRANSOM: An efficient fault-tolerant system for training LLMs[J]. arXiv preprint, arXiv: 2310.10046, 2023
    [11]
    Ant Group. DLRover: An automatic distributed deep learning system[CP/OL]. 2025[2025-02-19]. https://github.com/intelligent-machine-learning/dlrover
    [12]
    Ant Group. DLRover’s technical practice of training stability assurance for thousands of calorie-level large models on Kubernetes[EB/OL]. 2025[2025-02-19]. https://github.com/intelligent-machine-learning/dlr- over/blob/master/docs/blogs/stabilize_llm_training_cn.md
    [13]
    Mohan J, Phanishayee A, Chidambaram V. CheckFreq: Frequent, fine-grained DNN checkpointing[C]//Proc of the 19th USENIX Conf on File and Storage Technologies (FAST). Berkeley, CA: USENIX Association, 2021: 203−216
    [14]
    Wang Guanhua, Ruwase O, Xie Bing, et al. FastPersist: Accelerating model checkpointing in deep learning[J]. arXiv preprint, arXiv: 2406.13768, 2024
    [15]
    Gupta T, Krishnan S, Kumar R, et al. Just-in-time checkpointing: Low cost error recovery from deep learning training failures[C]//Proc of the 19th European Conf on Computer Systems (EuroSys). New York: ACM, 2024: 1110−1125
    [16]
    Wang Zhuang, Jia Zhen, Zheng Shuai, et al. GEMINI: Fast failure recovery in distributed training with in-memory checkpoints[C]//Proc of the 29th Symp on Operating Systems Principles (SOSP). New York: ACM, 2023: 364−381
    [17]
    He Tao, Li Xue, Wang Zhibin, et al. Unicron: Economizing self-healing LLM training at scale[J]. arXiv preprint, arXiv: 2401.00134, 2023
    [18]
    Xiang Wu, Li Yakun, Ren Yuquan, et al. Gödel: Unified large-scale resource management and scheduling at ByteDance[C]//Proc of the 14th ACM Symp on Cloud Computing (SoCC). New York: ACM, 2023: 308−323
    [19]
    Liu K, Kosaian J, Rashmi K V. ECRM: Efficient fault tolerancefor recommendation model training via erasure coding[J]. arXiv preprint, arXiv: 2104.01981, 2021
    [20]
    Hu Qinghao, Ye Zhisheng, Wang Zerui, et al. Characterization of large language model development in the datacenter[C]//Proc of the 21st USENIX Symp on Networked Systems Design and Implementation (NSDI). Berkeley, CA: USENIX Association, 2024: 709−729
    [21]
    Wu Tianyuan, Wang Wei, Yu Yinghao, et al. FALCON: Pinpointing and mitigating stragglers for large-scale hybrid-parallel training[J]. arXiv preprint, arXiv: 2410.12588, 2024
    [22]
    Lao C L, Yu Minlan, Akella A, et al. TrainMover: Efficient ML training live migration with no memory overhead[J]. arXiv preprint, arXiv: 2412.12636, 2024
    [23]
    Jiang Ziheng, Lin Haibin, Zhong Yinmin, et al. MegaScale: Scaling large language model training to more than 10, 000 GPUs[C]//Proc of the 21st USENIX Symp on Networked Systems Design and Implementation (NSDI). Berkeley, CA: USENIX Association, 2024: 745−760
    [24]
    Eisenman A, Matam K K, Ingram S, et al. Check-N-Run: A checkpointing system for training deep learning recommendation models[C]//Proc of the 19th USENIX Symp on Networked Systems Design and Implementation (NSDI). Berkeley, CA: USENIX Association, 2022: 929−943
    [25]
    Li Mingzhen, Xiao Wencong, Yang Hailong, et al. EasyScale: Elastic training with consistent accuracy and improved utilization on GPUs[C]//Proc of the Int Conf for High Performance Computing, Networking, Storage and Analysis (SC). New York: ACM, 2023: 1−14
    [26]
    Subramanya S J, Arfeen D, Lin Shouxu, et al. Sia: Heterogeneity-aware, goodput-optimized ML-cluster scheduling[C]//Proc of the 29th Symp on Operating Systems Principles (SOSP). New York: ACM, 2023: 642−657
    [27]
    Wagenländer M, Li Guo, Zhao Bo, et al. Tenplex: Dynamic parallelism for deep learning using parallelizable tensor collections[C]//Proc of the 30th Symp on Operating Systems Principles (SOSP). New York: ACM, 2024: 195−210
    [28]
    NVIDIA. NCCL Tests: Check both the performance and the correctness of NCCL operations[CP/OL]. 2017[2025-02-19]. https://github.com/N-VIDIA/nccl-tests
    [29]
    Zheng Lianmin, Li Zhuohan, Zhang Hao, et al. Alpa: Automating inter- and intra-operator parallelism for distributed deep learning[C]//Proc of the 16th USENIX Symp on Operating Systems Design and Implementation (OSDI). Berkeley, CA: USENIX Association, 2022: 559−578
    [30]
    Google. XLA: An open-source machine learning (ML) compiler for GPUs, CPUs, and ML accelerators[CP/OL]. 2022[2025-02-19]. https://github.com/openxla/xla
    [31]
    Chowdhery A, Narang S, Devlin J, et al. Palm: Scaling language modeling with pathways[J]. Journal of Machine Learning Research, 2023, 24(240): 1−113
  • Related Articles

    [1]Li Song, Cao Wenqi, Hao Xiaohong, Zhang Liping, Hao Zhongxiao. Collective Spatial Keyword Query Based on Time-Distance Constrained and Cost Aware[J]. Journal of Computer Research and Development, 2025, 62(3): 808-819. DOI: 10.7544/issn1000-1239.202330815
    [2]Wang Kaifan, Xu Yinan, Yu Zihao, Tang Dan, Chen Guokai, Chen Xi, Gou Lingrui, Hu Xuan, Jin Yue, Li Qianruo, Li Xin, Lin Jiawei, Liu Tong, Liu Zhigang, Wang Huaqiang, Wang Huizhe, Zhang Chuanqi, Zhang Fawang, Zhang Linjuan, Zhang Zifei, Zhang Ziyue, Zhao Yangyang, Zhou Yaoyang, Zou Jiangrui, Cai Ye, Huan Dandan, Li Zusong, Zhao Jiye, He Wei, Sun Ninghui, Bao Yungang. XiangShan Open-Source High Performance RISC-V Processor Design and Implementation[J]. Journal of Computer Research and Development, 2023, 60(3): 476-493. DOI: 10.7544/issn1000-1239.202221036
    [3]Ren Hao, Liu Baisong, Sun Jinyang, Dong Qian, Qian Jiangbo. A Time and Relation-Aware Graph Collaborative Filtering for Cross-Domain Sequential Recommendation[J]. Journal of Computer Research and Development, 2023, 60(1): 112-124. DOI: 10.7544/issn1000-1239.202110545
    [4]Zhang Tong, Feng Jiaqi, Ma Yanying, Qu Siyuan, Ren Fengyuan. Survey on Traffic Scheduling in Time-Sensitive Networking[J]. Journal of Computer Research and Development, 2022, 59(4): 747-764. DOI: 10.7544/issn1000-1239.20210203
    [5]Cui Yuanning, Li Jing, Shen Li, Shen Yang, Qiao Lin, Bo Jue. Duration-HyTE: A Time-Aware Knowledge Representation Learning Method Based on Duration Modeling[J]. Journal of Computer Research and Development, 2020, 57(6): 1239-1251. DOI: 10.7544/issn1000-1239.2020.20190253
    [6]Zheng Xiao, Gao Han, Wang Xiujun, Qin Feng. Contact Duration Aware Cooperative Data Caching in Mobile Opportunistic Networks[J]. Journal of Computer Research and Development, 2018, 55(2): 338-345. DOI: 10.7544/issn1000-1239.2018.20160929
    [7]Wang Chong, Lü Yinrun, Chen Li, Wang Xiuli, Wang Yongji. Survey on Development of Solving Methods and State-of-the-Art Applications of Satisfiability Modulo Theories[J]. Journal of Computer Research and Development, 2017, 54(7): 1405-1425. DOI: 10.7544/issn1000-1239.2017.20160303
    [8]Chen Huangke, Zhu Jianghan, Zhu Xiaomin, Ma Manhao, Zhang Zhenshi. Resource-Delay-Aware Scheduling for Real-Time Tasks in Clouds[J]. Journal of Computer Research and Development, 2017, 54(2): 446-456. DOI: 10.7544/issn1000-1239.2017.20151123
    [9]Zhou Hang, Huang Zhiqiu, Zhu Yi, Xia Liang, Liu Linyuan. Real-Time Systems Contact Checking and Resolution Based on Time Petri Net[J]. Journal of Computer Research and Development, 2012, 49(2): 413-420.
    [10]Zhou Hang, Huang Zhiqiu, Hu Jun, Zhu Yi. Real-Time System Resource Conflict Checking Based on Time Petri Nets[J]. Journal of Computer Research and Development, 2009, 46(9): 1578-1585.

Catalog

    Article views (48) PDF downloads (22) Cited by()

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return