Abstract:
Large language models with hundreds of billions of parameters are driving rapid technology innovations and business model transformations in artificial intelligence and heterogeneous computing. However, training such models requires prolonged occupation of extensive hardware resources, thus often incurring diverse high-frequency software/hardware failures. These failures not only are challenging to diagnose, but also lead to much longer training time due to unwanted computation waste and slow training convergence. Resilio, an elastic fault-tolerant system for training large language models is proposed, to provide an efficient automated fault recovery mechanism. It is designed to target multiple typical failure scenarios during training processes, such as network interruptions, node crashes, and process failures. Leveraging the characteristics of the parallel model training strategies and underlying hierarchical storage architectures, Resilio implements multi-layer optimizations on checkpoint read/write operations and Just-In-Time (JIT) recovery mechanisms. For models with 100B-scale parameters, Resilio reduces the end-to-end recovery time under 10 minutes, while reducing the re-started computation after interruptions to the cost of a single training iteration. Upon variations of the computation resources, Resilio can quickly identify the cluster configurations to enable optimal parallel training strategies. Combined with the fault-tolerant scheduling capability, the system ensures adaptive and elastic resource allocations to greatly improve training efficiency and boost GPU utilization across large-scale computing clusters.