Abstract:
As the number of parameters in deep learning models continues to increase, the cost of training also keeps rising. To reduce training costs, using spot instances provided by cloud service providers has become a viable solution. Spot instances are priced at only 30% of normal instances, which can significantly lower training costs. However, despite the low cost of elastic instances, there is a risk of them being reclaimed at any time, posing new challenges to the stability of the model training system. To address the fault tolerance issue in the context of elastic instances, existing work mainly falls into two categories: checkpoint-based and redundancy-based fault tolerance. Checkpoint-based solutions incur substantial overhead, while redundancy-based solutions impose certain limitations on the model’s parallelism strategy, leading to suboptimal training efficiency. This paper proposes a solution for training in the context of spot instances that leverages the grace period of spot instances to back up training progress, thereby reducing fault tolerance overhead. It also employs the bottleneck alleviation approach to adjust the parallelism strategy, maximizing the use of available cluster resources and enhancing training efficiency. Experimental results show that this solution not only achieves low-cost fault tolerance but also ensures training efficiency, allowing for efficient completion of model training tasks in the context of spot instances and reducing overall training costs.