AdaptDNN：一个自适应可伸缩的大模型分布式训练系统

刘国栋; 朱家祺; 高梓源; 包云岗; 王卅

doi:10.7544/issn1000-1239.202440560

AdaptDNN：一个自适应可伸缩的大模型分布式训练系统

AdaptDNN: An Adaptive and Scalable LLM Distributed Training System

摘要

摘要: 随着深度学习模型参数量的不断增加，训练成本也在不断上升. 为了减少训练成本，使用云服务厂商提供的弹性实例训练模型成为了一个可行的解决方案. 弹性实例的价格仅为正常实例的30%，可以有效降低训练成本. 虽然弹性实例价格低廉，但随时都有被回收的风险，对模型训练系统的稳定性提出了新的挑战. 为了解决弹性实例场景下的容错问题，现有的工作主要有2类解决方案，分别是基于存盘点的容错和基于冗余性的容错. 基于存盘点的方案开销较大，而基于冗余性的方案则对模型的并行策略有一定的限制，导致训练效率并非最优. AdaptDNN是一种自适应可伸缩的大模型分布式训练系统. 在弹性实例训练场景，AdaptDNN利用弹性实例宽限期完成训练进度的备份，降低容错开销；并利用“瓶颈消除”思想调整模型并行策略，最大化利用集群可用资源，提升训练效率. 实验结果表明，AdaptDNN既能实现低成本容错，又能保证模型训练效率，从而可以在弹性实例场景高效完成模型训练任务，降低模型训练成本.

Abstract: As the number of parameters in deep learning models continues to increase, the cost of training also keeps rising. To reduce training costs, using spot instances provided by cloud service providers has become a viable solution. Spot instances are priced at only 30% of normal instances, which can significantly lower training costs. However, despite the low cost of elastic instances, there is a risk of them being reclaimed at any time, posing new challenges to the stability of the model training system. To address the fault tolerance issue in the context of elastic instances, existing work mainly falls into two categories: checkpoint-based fault tolerance and redundancy-based fault tolerance. Checkpoint-based solutions incur substantial overhead, while redundancy-based solutions impose certain limitations on the model’s parallelism strategy, leading to suboptimal training efficiency. We propose a solution for training in the context of spot instances that leverages the grace period of spot instances to back up training progress, thereby reducing fault tolerance overhead. It also employs the bottleneck alleviation approach to adjust the parallelism strategy, maximizing the use of available cluster resources and enhancing training efficiency. Experimental results show that this solution not only achieves low-cost fault tolerance but also ensures training efficiency, allowing for efficient completion of model training tasks in the context of spot instances and reducing overall training costs.

HTML全文

参考文献(41)

施引文献

资源附件(0)