Dynamic Resource Scheduling Method for GPU Cluster

Fu Maozhong; Hu Haiyang; Li Zhongjin

doi:10.7544/issn1000-1239.202220149

Journal of Computer Research and Development > 2023 > 60(6): 1308-1321. > DOI: 10.7544/issn1000-1239.202220149 CSTR: 32373.14.issn1000-1239.202220149

Fu Maozhong, Hu Haiyang, Li Zhongjin. Dynamic Resource Scheduling Method for GPU Cluster[J]. Journal of Computer Research and Development, 2023, 60(6): 1308-1321. DOI: 10.7544/issn1000-1239.202220149

Citation:

PDF (2471 KB)

Dynamic Resource Scheduling Method for GPU Cluster

1.
School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou 310018
2.
Intelligent Software Technology and Application Research Center, Advanced Institute of Information Technology, Peking University, Zhejiang Province, Hangzhou 311215

Funds: this work was supported by the Natural Science Foundation of Zhejiang Province (LY22F020021), the Zhejiang Provincial Key Reasearch and Development “LingYan” Project Foundation (2023C01145), and the National Natural Science Foundation of China (61802095, 61572162)

More Information

Author Bio:
Fu Maozhong: born in 1996. Master candidate. His main research interests include distributed deep learning and resource scheduling

Hu Haiyang: born in 1977. PhD, professor, PhD supervisor. His main research interests include cloud computing, computer network, workflow scheduling, mobile computing，and distributed computing

Li Zhongjin: born in 1986. PhD, associate professor. His main research interests include cloud computing, workflow scheduling, and edge computing
Received Date: February 14, 2022
Revised Date: September 26, 2022
Available Online: March 19, 2023

Graphical Abstract

Abstract

Abstract

Deep neural network (DNN) has been widely used in many areas of human society. Increasing the size of DNN model significantly improves the model accuracy, however, training DNN model on a single GPU requires considerable time. Hence, how to train large-scale DNN models in parallel on GPU cluster by distributed deep learning (DDL) technology has been paid much attention by industry and academia. Based on the above analysis, we propose a dynamic resource scheduling (DRS) method for GPU cluster in the heterogeneous GPU cluster environment with different bandwidth among GPUs. The goal of DRS is to solve the multi-DNN scheduling problem with the requirement of deadline constraint. Specifically, firstly, a resource-time model is constructed based on the Ring-AllReduce communication architecture to measure the running time of DDL tasks under different resource schemes. Then, a resource-performance model is built based on the deadline requirement to achieve efficient resource utilization; Finally, DRS is designed to implement resource scheme decision for DDL tasks based on the above model and resource layout. In this way, scheduling tasks are selected for actual resource allocation based on the principle of nearest deadline, and a migration mechanism is introduced to reduce the impact of resource fragmentation scenarios in the scheduling process. Experiments on the heterogeneous GPU cluster with 4 NVIDIA GeForce RTX 2080 Tis show that DRS improves the deadline guaranteeing rate by 39.53% compared with the comparison algorithms, and the resource utilization of GPU cluster reaches 91.27% in the scheduling process.
- resource scheduling,
- GPU cluster,
- distributed deep learning,
- heterogeneous bandwidth,
- resource migration

FullText(HTML)

References (38)

References

[1]	纪泽宇,张兴军,付哲,等. 分布式深度学习框架下基于性能感知的DBS-SGD算法[J]. 计算机研究与发展,2019,56(11):2396−2409 doi: 10.7544/issn1000-1239.2019.20180880 Ji Zeyu, Zhang Xingjun, Fu Zhe, et al. Performance-awareness based dynamic batch size SGD for distributed deep learning framework[J]. Journal of Computer Research and Development, 2019, 56(11): 2396−2409 (in Chinese) doi: 10.7544/issn1000-1239.2019.20180880
[2]	Paszke A, Gross S, Massa F, et al. PyTorch: An imperative style, high-performance deep learning library[C] //Proc of the 33rd Int Conf on Neural Information Processing Systems. Cambridge, MA: MIT, 2019: 8024−8035
[3]	Shukla N, Fricklas K. Machine Learning with TensorFlow[M]. Greenwich: Manning, 2018
[4]	Posey B. Dynamic HPC clusters within amazon web services (AWS)[D]. Clemson: Clemson University, 2016
[5]	Ray B R, Chowdhury M, Atif U. Is high performance computing (HPC) ready to handle big data[C] //Proc of the 3rd Int Conf on Future Network Systems and Security. Berlin: Springer, 2017: 97−112
[6]	Vavilapalli V K, Murthy A C, Douglas C, et al. Apache Hadoop Yarn: Yet another resource negotiator[C] //Proc of the 4th Annual Symp on Cloud Computing. New York: ACM, 2013: 5: 1−5: 16
[7]	Hindman B, Konwinski A, Zaharia M, et al. Mesos: A platform for fine-grained resource sharing in the data center[C] //Proc of the 8th USENIX Conf on Networked Systems Design and Implementation. Berkeley, CA: USENIX Association, 2011: 295−308
[8]	Pan Y, Chen I, Brasileiro F, et al. A performance comparison of cloud-based container orchestration tools[C] //Proc of the 11th IEEE Int Conf on Big Knowledge (ICBK). Piscataway, NJ: IEEE, 2019: 191−198
[9]	Jeon M, Venkataraman S, Phanishayee A, et al. Analysis of large-scale multi-tenant GPU clusters for DNN training workloads[C] //Proc of the 30th USENIX Annual Technical Conf. Berkeley, CA: USENIX Association, 2019: 947−960
[10]	Song Mingcong, Hu Yang, Chen Huixiang, et al. Towards pervasive and user satisfactory CNN across GPU microarchitectures[C] //Proc of the 23rd IEEE Int Symp on High Performance Computer Architecture. Piscataway, NJ: IEEE, 2017: 1−12
[11]	Peng Yanghua, Bao Yixin, Chen Yangrui, et al. Optimus: An efficient dynamic resource scheduler for deep learning clusters[C] //Proc of the 13th EuroSys Conf. New York: ACM, 2018: 3: 1−3: 14
[12]	Yan Feng, Ruwase O, He Yuxiong, et al. Performance modeling and scalability optimization of distributed deep learning systems[C] //Proc of the 21st ACM SIGKDD Int Conf on Knowledge Discovery and Data Mining. New York: ACM, 2015: 1355−1364
[13]	Gong Yifan, Li Baochun, Liang Ben, et al. Chic: Experience-driven scheduling in machine learning clusters[C] //Proc of the 27th Int Symp on Quality of Service. New York: ACM, 2019: 30: 1−30: 10
[14]	Li Mu, Andersen D G, Smola A J, et al. Communication efficient distributed machine learning with the parameter server[C] //Proc of the 11th USENIX Conf on Operating Systems Design and Implementation. Berkeley, CA: USENIX Association, 2014: 583−598
[15]	Sergeev A, Del Balso M. Horovod: Fast and easy distributed deep learning in TensorFlow[J]. arXiv preprint, arXiv: 1802.05799, 2018
[16]	Xiao Wencong, Bhardwaj R, Ramjee R, et al. Gandiva: Introspective cluster scheduling for deep learning[C] //Proc of the 13th USENIX Conf on Operating Systems Design and Implementation. Berkeley, CA: USENIX Association, 2018: 595−610
[17]	Zheng Haoyue, Xu Fei, Chen Li, et al. Cynthia: Cost-efficient cloud resource provisioning for predictable distributed deep neural network training[C] //Proc of the 48th Int Conf on Parallel Processing. New York: ACM, 2019, 86: 1−86: 11
[18]	Chen Zhaoyun, Quan Wei, Wen Mei, et al. Deep learning research and development platform: Characterizing and scheduling with QoS guarantees on GPU clusters[J]. IEEE Transactions on Parallel and Distributed Systems, 2019, 31(1): 34−50
[19]	Luo Liang, Nelson J, Ceze L, et al. Parameter hub: A rack-scale parameter server for distributed deep neural network training[C] //Proc of the 9th ACM Symp on Cloud Computing. New York: ACM, 2018: 41−54
[20]	Gu Juncheng, Chowdhury M, Shin K G, et al. Tiresias: A GPU cluster manager for distributed deep learning[C] //Proc of the 16th USENIX Conf on Networked Systems Design and Implementation. Berkeley, CA: USENIX Association, 2019: 485−500
[21]	Li Shen, Zhao Yanli, Varma R, et al. Pytorch distributed: Experiences on accelerating data parallel training[J]. arXiv preprint, arXiv: 2006.15704, 2020
[22]	Chen Quan, Yang Hailong, Guo Minyi, et al. Prophet: Precise QoS prediction on non-preemptive accelerators to improve utilization in warehouse-scale computers[C]// Proc of the 22nd Int Conf on Architectural Support for Programming Languages and Operating Systems. New York: ACM, 2017: 17−32
[23]	Krizhevsky A. One weird trick for parallelizing convolutional neural networks[J]. arXiv preprint, arXiv: 1404.5997, 2014
[24]	Szegedy C, Liu Wei, Jia Yangqing, et al. Going deeper with convolutions[C] //Proc of the 2015 IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2015: 1−9
[25]	Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[J]. arXiv preprint, arXiv: 1409.1556, 2014
[26]	He Kaiming, Zhang Xiangyu, Ren Shaoqing, et al. Deep residual learning for image recognition[C] //Proc of the 2016 IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2016: 770−778
[27]	Huang Gao, Liu Zhuang, Van Der Maaten L, et al. Densely connected convolutional networks[C] //Proc of the 2017 IEEE Conf on Computer Vision and Pattern Recognition. Piscataway. NJ: IEEE, 2017: 4700−4708
[28]	Xie Saining, Girshick R, Dollár P, et al. Aggregated residual transformations for deep neural networks[C] //Proc of the 2017 IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2017: 1492−1500
[29]	Krizhevsky A, Hinton G. Learning multiple layers of features from tiny images[J/OL]. 2009[2022-05-30]. https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf
[30]	NCCL. NVIDIA collective communications library [EB/OL]. [2022-01-28]. https://developer.nvidia.com/nccl
[31]	Wang Limin, Xiong Yuanjun, Wang Zhe, et al. Temporal segment networks: Towards good practices for deep action recognition[C] //Proc of the 14th European Conf on Computer Vision. Cham, Switzerland: Springer, 2016: 20−36
[32]	Tran D, Wang Heng, Torresani L, et al. A closer look at spatiotemporal convolutions for action recognition[C] //Proc of the 2018 IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2018: 6450−6459
[33]	Lin Ji, Gan Chuang, Han Song. TSM: Temporal shift module for efficient video understanding[C] //Proc of the 2019 IEEE/CVF Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2019: 7083−7093
[34]	Feichtenhofer C, Fan Haoqi, Malik J, et al. Slowfast networks for video recognition[C] //Proc of the 2019 IEEE/CVF Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2019: 6202−6211
[35]	Soomro K, Zamir A R, Shah M. UCF101: A dataset of 101 human actions classes from videos in the wild[J]. arXiv preprint, arXiv: 1212.0402, 2012
[36]	Buttazzo G C. Hard Real-Time Computing Systems: Predictable Scheduling Algorithms and Applications[M]. Berlin: Springer, 2011: 100−102
[37]	Mahajan K, Balasubramanian A, Singhvi A, et al. Themis: Fair and efficient GPU cluster scheduling[C] //Proc of the 17th USENIX Symp on Networked Systems Design and Implementation. Berkeley, CA: USENIX Association, 2020: 289−304
[38]	Wang Haoyu, Liu Zetian, Shen Haiying. Job scheduling for large-scale machine learning clusters[C] //Proc of the 16th Int Conf on emerging Networking EXperiments and Technologies. New York: ACM, 2020: 108−120

Cited By

Cited by

Periodical cited type(2)

1.	樊青龙，耿磊，邓亚明，梁志斌，张敬文，董刚. 五举煤业智能洗选综合管控平台设计与应用. 选煤技术. 2025(01): 64-74 .
2.	陈秀丽. 分布式数据库系统在云计算环境中的数据一致性保障机制. 信息与电脑(理论版). 2024(08): 137-139 .