高级检索

    Resilio:一种大模型弹性训练容错系统

    Resilio: An Elastic Fault-tolerant Training System for Large Language Models

    • 摘要: 具备千亿级参数的大型语言模型正在引领当今人工智能与异构计算的技术革新及商业模式的深刻转变. 然而,大模型训练任务需要长时间占用大量的硬件资源,软硬件故障发生的频率高且类型较多,并且故障原因难定位导致训练中断时间较长. 针对大模型训练过程中面临的网络中断、节点宕机、进程崩溃等多种典型故障, 提出一种大模型弹性容错系统Resilio,提供高效自动的恢复机制. 基于模型训练的并行策略与硬件的存储层次特点,Resilio通过多层次优化检查点读写操作和即时检查点保存机制,对于千亿规模参数模型,可以将端到端故障恢复时间缩短至10 min以内,模型中断后的重训时间缩短至单次训练迭代时间. 当集群资源弹性变化时,Resilio能够快速准确地获取大模型训练最优并行策略配置,与容错调度组件共同确保系统的自适应能力,弹性调度训练资源以提升作业的训练效率和集群GPU资源利用率.

       

      Abstract: Large language models with hundreds of billions of parameters are driving rapid technology innovations and business model transformations in artificial intelligence and heterogeneous computing. However, training such models requires prolonged occupation of extensive hardware resources, thus often incurring diverse high-frequency software/hardware failures. These failures not only are challenging to diagnose, but also lead to much longer training time due to unwanted computation waste and slow training convergence. Resilio, an elastic fault-tolerant system for training large language models is proposed, to provide an efficient automated fault recovery mechanism. It is designed to target multiple typical failure scenarios during training processes, such as network interruptions, node crashes, and process failures. Leveraging the characteristics of the parallel model training strategies and underlying hierarchical storage architectures, Resilio implements multi-layer optimizations on checkpoint read/write operations and Just-In-Time (JIT) recovery mechanisms. For models with 100B-scale parameters, Resilio reduces the end-to-end recovery time under 10 minutes, while reducing the re-started computation after interruptions to the cost of a single training iteration. Upon variations of the computation resources, Resilio can quickly identify the cluster configurations to enable optimal parallel training strategies. Combined with the fault-tolerant scheduling capability, the system ensures adaptive and elastic resource allocations to greatly improve training efficiency and boost GPU utilization across large-scale computing clusters.

       

    /

    返回文章
    返回