面向GPU集群的动态资源调度方法

傅懋钟; 胡海洋; 李忠金

doi:10.7544/issn1000-1239.202220149

摘要: 深度神经网络（deep neural network，DNN）已广泛应用于人类社会的许多领域. 大规模的DNN模型可显著提高识别精度，然而在单个GPU设备上训练大规模的DNN模型需要耗费大量的时间. 因此，如何借助分布式深度学习（distributed deep learning，DDL）技术，在GPU集群上并行地训练多DNN模型已受到工业界和学术界的广泛关注. 基于此，提出一种面向GPU集群的动态资源调度（dynamic resource scheduling，DRS）方法，解决异构带宽环境下具有截止时间要求的多DNN任务调度问题. 具体来说，首先基于Ring-AllReduce通信方式构建资源-时间模型，以衡量DDL任务在不同资源方案下的运行时间；然后基于截止时间需求构建了资源-性能模型，以实现高效的资源利用；最后，结合上述资源-时间和资源-性能模型设计了DRS算法，为多DNN任务训练实现资源方案决策.在DRS算法中融入最近截止时间原则进行实际资源分配，并利用资源迁移机制减少调度过程中出现的资源碎片场景的影响. 在4个NVIDIA GeForce RTX 2080 Ti的GPU集群上的异构带宽的实验表明，DRS相较于对比算法提升了39.53%的截止时间保证率，并在调度过程中GPU集群节点的资源利用率达到了91.27%.

Abstract: Deep neural network (DNN) has been widely used in many areas of human society. Increasing the size of DNN model significantly improves the model accuracy, however, training DNN model on a single GPU requires considerable time. Hence, how to train large-scale DNN models in parallel on GPU cluster by distributed deep learning (DDL) technology has been paid much attention by industry and academia. Based on the above analysis, we propose a dynamic resource scheduling (DRS) method for GPU cluster in the heterogeneous GPU cluster environment with different bandwidth among GPUs. The goal of DRS is to solve the multi-DNN scheduling problem with the requirement of deadline constraint. Specifically, firstly, a resource-time model is constructed based on the Ring-AllReduce communication architecture to measure the running time of DDL tasks under different resource schemes. Then, a resource-performance model is built based on the deadline requirement to achieve efficient resource utilization; Finally, DRS is designed to implement resource scheme decision for DDL tasks based on the above model and resource layout. In this way, scheduling tasks are selected for actual resource allocation based on the principle of nearest deadline, and a migration mechanism is introduced to reduce the impact of resource fragmentation scenarios in the scheduling process. Experiments on the heterogeneous GPU cluster with 4 NVIDIA GeForce RTX 2080 Tis show that DRS improves the deadline guaranteeing rate by 39.53% compared with the comparison algorithms, and the resource utilization of GPU cluster reaches 91.27% in the scheduling process.

面向GPU集群的动态资源调度方法

Dynamic Resource Scheduling Method for GPU Cluster