Abstract:
Deep neural network (DNN) has been widely used in many areas of human society. Increasing the size of DNN model significantly improves the model accuracy, however, training DNN model on a single GPU requires considerable time. Hence, how to train large-scale DNN models in parallel on GPU cluster by distributed deep learning (DDL) technology has been paid much attention by industry and academia. Based on the above analysis, we propose a dynamic resource scheduling (DRS) method for GPU cluster in the heterogeneous GPU cluster environment with different bandwidth among GPUs. The goal of DRS is to solve the multi-DNN scheduling problem with the requirement of deadline constraint. Specifically, firstly, a resource-time model is constructed based on the Ring-AllReduce communication architecture to measure the running time of DDL tasks under different resource schemes. Then, a resource-performance model is built based on the deadline requirement to achieve efficient resource utilization; Finally, DRS is designed to implement resource scheme decision for DDL tasks based on the above model and resource layout. In this way, scheduling tasks are selected for actual resource allocation based on the principle of nearest deadline, and a migration mechanism is introduced to reduce the impact of resource fragmentation scenarios in the scheduling process. Experiments on the heterogeneous GPU cluster with 4 NVIDIA GeForce RTX 2080 Tis show that DRS improves the deadline guaranteeing rate by 39.53% compared with the comparison algorithms, and the resource utilization of GPU cluster reaches 91.27% in the scheduling process.