Citation: | Wei Jia, Zhang Xingjun, Wang Longxiang, Zhao Mingqiang, Dong Xiaoshe. MC2 Energy Consumption Model for Massively Distributed Data Parallel Training of Deep Neural Network[J]. Journal of Computer Research and Development, 2024, 61(12): 2985-3004. DOI: 10.7544/issn1000-1239.202330164 |
Deep neural network (DNN) have achieved state-of-the-art accuracy in many modern artificial intelligence (AI) tasks. In recent years, it has become increasingly popular to use high performance computing platforms for massively distributed parallel training of DNN. Energy consumption models have been crucial in designing and optimizing DNN for massively parallel training and restraining excessive energy consumption on HPC (high performance computing) platforms. Currently, most energy consumption models model the energy consumption of a single device or a cluster of multiple devices from a hardware perspective. From an energy consumption perspective, the need for disaggregated analysis of distributed parallel DNN applications has resulted in a dearth of energy consumption models that model the characteristics of distributed DNN applications. In this paper, we propose the “materials preprocessing-computing-communicating” three-stage MC2 model from the perspective of the essential features of DNN model training for the most commonly used DNN distributed data parallel training model. The model is validated by training the classical VGG16, ResNet50 networks and the latest Vision Transformer network using up to 128 MT nodes and 32 FT nodes on the domestic E-class prototype Tianhe-3. The experimental results show that the difference between MC2 and the actual energy measurements is only 2.84%. Compared with the four linear proportional energy models and the AR, SES, and ARIMA time prediction models, the accuracy of the model proposed is improved by 69.12%, 69.50%, 34.58%, 13.47%, 5.23%, 22.13%, and 10.53%, respectively. By using the models proposed in this paper, the energy consumption of DNN models at each stage and the overall energy consumption can be obtained on a supercomputer platform, which provides a basis for evaluating the efficiency of DNN energy-aware massively distributed parallel training and inference, as well as optimizing the strategies of task scheduling, job scheduling, model partitioning, and model pruning.
[1] |
Orhan A E. Robustness properties of Facebook’s ResNeXt WSL models[J]. arXiv preprint, arXiv: 1907.07640, 2019
|
[2] |
Gao Yongqiang, Guan Haibing, Qi Zhengwei, et al. Quality of service aware power management for virtualized data centers[J]. Journal of Systems Architecture, 2013, 59(4/5): 245−259
|
[3] |
Bilal K, Malik S U R, Khan S U, et al. Trends and challenges in cloud datacenters[J]. IEEE Cloud Computing, 2014, 1(1): 10−20 doi: 10.1109/MCC.2014.26
|
[4] |
Whitehead B, Andrews D, Shah A, et al. Assessing the environmental impact of data centres part 1: Background, energy use and metrics[J]. Building and Environment, 2014, 82(4): 151−159
|
[5] |
Rivoire S M. Designing Energy-Efficient Computer Systems[M]//Models and Metrics for Energy-Efficient Computer Systems. Ann Arbor, MI: ProQuest, 2008: 29−65
|
[6] |
vor dem Berge M, Da Costa G, Kopecki A, et al. Modeling and simulation of data center energy-efficiency in coolemall[C]//Proc of the 1st Int Workshop on Energy Efficient Data Centers. Berlin: Springer, 2012: 25−36
|
[7] |
Floratou A, Bertsch F, Patel J M, et al. Towards building wind tunnels for data center design[J]. Proceedings of the VLDB Endowment, 2014, 7(9): 781−784 doi: 10.14778/2732939.2732950
|
[8] |
Dean J, Corrado G S, Monga R, et al. Large scale distributed deep networks[C]//Proc of the 26th Neural Information and Processing System. Cambridge, MA: MIT, 2012: 1223−1231
|
[9] |
Raina R, Madhavan A, Ng A Y. Large-scale deep unsupervised learning using graphics processors[C]//Proc of the 26th Int Conf on Machine Learning. New York: ACM, 2009: 873−88
|
[10] |
Li Teng, Dou Yong, Jiang Jingfei, et al. Optimized deep belief networks on CUDA GPUs[C]//Proc of the 27th Int Joint Conf on Neural Networks. Piscataway, NJ: IEEE, 2015: 1688−1696
|
[11] |
Bottleson J, Kim S Y, Andrews J, et al. clCaffe: OpenCL accelerated Caffe for convolutional neural networks[C]//Proc of the 30th IEEE Int Parallel and Distributed Processing Symp Workshops. Piscataway, NJ: IEEE, 2016: 50−57
|
[12] |
Kaler T, Stathas N, Ouyang A, et al. Accelerating training and inference of graph neural networks with fast sampling and pipelining[C]//Proc of the 5th Machine Learning and Systems. New York: ACM, 2022: 172−189
|
[13] |
Tan Sijun, Knott B, Tian Yuan, et al. CryptGPU: Fast privacy-preserving machine learning on the GPU[C]//Proc of the 42nd IEEE Symp on Security and Privacy. Piscataway, NJ: IEEE, 2021: 1021−1038
|
[14] |
Viebke A, Memeti S, Pllana S, et al. CHAOS: A parallelization scheme for training convolutional neural networks on Intel Xeon Phi[J]. The Journal of Supercomputing, 2019, 75(1): 197−227 doi: 10.1007/s11227-017-1994-x
|
[15] |
Liu Junjie, Wang Haixia, Wang Dongsheng, et al. Parallelizing convolutional neural networks on Intel many integrated core architecture[C]//Proc of the 28th Int Conf on Architecture of Computing Systems. Berlin: Springer, 2015: 71−82
|
[16] |
Zlateski A, Lee K, Seung H S. Scalable training of 3D convolutional networks on multi-and many-cores[J]. Journal of Parallel and Distributed Computing, 2017, 106(7): 195−204
|
[17] |
Suda N, Chandra V, Dasika G, et al. Throughput-optimized OpenCL-based FPGA accelerator for large-scale convolutional neural networks[C]//Proc of the 24th ACM/SIGDA Int Symp on Field-Programmable Gate Arrays. New York: ACM, 2016: 16−25
|
[18] |
Zhang Jialiang, Li Jing. Improving the performance of OpenCL-based FPGA accelerator for convolutional neural network[C]//Proc of the 25th ACM/SIGDA Int Symp on Field-Programmable Gate Arrays. New York: ACM, 2017: 25−34
|
[19] |
Aydonat U, O’Connell S, Capalija D, et al. An OpenCL™ deep learning accelerator on Arria 10[C]//Proc of the 25th ACM/SIGDA Int Symp on Field-Programmable Gate Arrays. New York: ACM, 2017: 55−64
|
[20] |
朱传家,刘鑫,方佳瑞. 基于“神威太湖之光”的Caffe分布式扩展研究[J]. 计算机应用与软件,2020,37(1):15−20
Zhu Chuanjia, Liu Xin, Fang Jiarui. Research on distributed extension of Caffe based on “Light of Taihu Lake” in Shenwei[J]. Computer Applications and Software, 2020, 37(1): 15−20 (in Chinese)
|
[21] |
魏嘉,张兴军,纪泽宇,等. 天河三号原型机分布式并行深度神经网络性能评测及调优[J]. 计算机工程与科学,2021,43(5):782−791 doi: 10.3969/j.issn.1007-130X.2021.05.003
Wei Jia, Zhang Xingjun, Ji Zeyu, et al. Performance evaluation and optimization of distributed parallel deep neural networks on the Tianhe-3 prototype[J]. Computer Engineering and Science, 2021, 43(5): 782−791 doi: 10.3969/j.issn.1007-130X.2021.05.003
|
[22] |
Ji Shihao, Satish N, Li Sheng, et al. Parallelizing Word2Vec in shared and distributed memory[J]. IEEE Transactions on Parallel and Distributed Systems, 2019, 30(9): 2090−2100 doi: 10.1109/TPDS.2019.2904058
|
[23] |
Das D, Avancha S, Mudigere D, et al. Distributed deep learning using synchronous stochastic gradient descent[J]. arXiv preprint, arXiv: 1602.06709, 2016
|
[24] |
Roy P, Song S L, Krishnamoorthy S, et al. Numa-caffe: Numa-aware deep learning neural networks[J]. ACM Transactions on Architecture and Code Optimization, 2018, 15(2): 1−26
|
[25] |
Mittal S, Rajput P, Subramoney S. A survey of deep learning on CPUs: Opportunities and co-optimizations[J]. IEEE Transactions on Neural Networks and Learning Systems, 2021, 33(10): 5095−5115
|
[26] |
Awan A A, Hamidouche K, Hashmi J M, et al. S-Caffe: Co-designing MPI runtimes and Caffe for scalable deep learning on modern GPU clusters//Proc of the 22nd ACM SIGPLAN Symp on Principles and Practice of Parallel Programming. New York: ACM, 2017: 193−205
|
[27] |
Yin J, Gahlot S, Laanait N, et al. Strategies to deploy and scale deep learning on the summit supercomputer[C]//Proc of the 3rd IEEE/ACM Workshop on Deep Learning on Supercomputers. Piscataway, NJ: IEEE, 2019: 84−94
|
[28] |
Duan Qingyang, Wang Zeqin, Xu Yuedong, et al. Mercury: A simple transport layer scheduler to accelerate distributed DNN training[C]//Proc of the 41st IEEE Conf on Computer Communications. Piscataway, NJ: IEEE, 2022: 350−359
|
[29] |
Huang Yanping, Cheng Youlong, Bapna A, et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism[J]. Advances in Neural Information Processing Systems, 2019, 32(1): 103−112
|
[30] |
Narayanan D, Harlap A, Phanishayee A, et al. PipeDream: Generalized pipeline parallelism for DNN training[C]//Proc of the 27th ACM Symp on Operating Systems Principles. New York: ACM, 2019: 1−15
|
[31] |
Rajbhandari S, Ruwase O, Rasley J, et al. Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning[C]//Proc of the 34th Int Conf for High Performance Computing, Networking, Storage and Analysis. Piscataway, NJ: IEEE, 2021: 826−840
|
[32] |
Dayarathna M, Wen Yonggang, Fan Rui. Data center energy consumption modeling: A survey[J]. IEEE Communications Surveys & Tutorials, 2015, 18(1): 732−794
|
[33] |
Ge Rong, Feng Xizhou, Cameron K W. Performance-constrained distributed dvs scheduling for scientific applications on power-aware clusters[C]//Proc of the 18th ACM/IEEE Conf on Supercomputing. Piscataway, NJ: IEEE, 2005: 34−34
|
[34] |
Yeo S, Lee H H S. Peeling the Power Onion of Data Centers[M]//Energy Efficient Thermal Management of Data Centers. Berlin: Springer, 2012: 137−168
|
[35] |
Gao Yongqiang, Guan Haibing, Qi Zhengwei, et al. Quality of service aware power management for virtualized data centers[J]. Journal of Systems Architecture, 2013, 59(4): 245−259
|
[36] |
Shin D, Kim J, Chang N, et al. Energy-optimal dynamic thermal management for green computing[C]//Proc of the 22nd IEEE/ACM Int Conf on Computer-Aided Design-Digest of Technical Papers. Piscataway, NJ: IEEE, 2009: 652−657
|
[37] |
Merkel A, Bellosa F. Balancing power consumption in multiprocessor systems[J]. ACM SIGOPS Operating Systems Review, 2006, 40(4): 403−414 doi: 10.1145/1218063.1217974
|
[38] |
Bertran R, Becerra Y, Carrera D, et al. Accurate energy accounting for shared virtualized environments using pmc-based power modeling techniques[C/OL]//Proc of the 11th IEEE/ACM Int Conf on Grid Computing. Piscataway, NJ: IEEE, 2010[2023-07-20].https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5697889
|
[39] |
Li Hui, Casale G, Ellahi T. SLA-driven planning and optimization of enterprise applications[C]//Proc of the 1st Joint WOSP/SIPEW Int Conf on Performance Engineering. New York: ACM, 2010: 117−128
|
[40] |
Li Keqin. Optimal configuration of a multicore server processor for managing the power and performance tradeoff[J]. The Journal of Supercomputing, 2012, 61(1): 189−214 doi: 10.1007/s11227-011-0686-1
|
[41] |
Kim S, Roy I, Talwar V. Evaluating integrated graphics processors for data center workloads[C]//Proc of the 46th Workshop on Power-Aware Computing and Systems. New York: ACM, 2013: 41−45
|
[42] |
Jouppi N P, Young C, Patil N, et al. In-datacenter performance analysis of a tensor processing unit[C]//Proc of the 44th Annual Int Symp on Computer Architecture. New York: ACM, 2017: 1−12
|
[43] |
Zhang Boyu, Davoodi A, Hu Yuhen. Exploring energy and accuracy tradeoff in structure simplification of trained deep neural networks[J]. IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 2018, 8(4): 836−848 doi: 10.1109/JETCAS.2018.2833383
|
[44] |
Ellison B, Minas L. The problem of power consumption in servers[J]. Energy Efficiency for Information Technology, 2009, 41(2): 1−17
|
[45] |
Giridhar B, Cieslak M, Duggal D, et al. Exploring DRAM organizations for energy-efficient and resilient exascale memories[C]//Proc of the 26th Int Conf on High Performance Computing, Networking, Storage and Analysis. Piscataway, NJ: IEEE, 2013: 277−289
|
[46] |
Lin Jiang, Zheng Hongzhong, Zhu Zhichun, et al. Thermal modeling and management of DRAM memory systems[C]//Proc of the 34th Annual Int Symp on Computer Architecture. New York: ACM, 2007: 312−322
|
[47] |
Vijaykrishnan N, Kandemir M, Irwin M J, et al. Energy-driven integrated hardware-software optimizations using SimplePower[J]. ACM SIGARCH Computer Architecture News, 2000, 28(2): 95−106 doi: 10.1145/342001.339659
|
[48] |
Shiue W T, Chakrabarti C. Memory exploration for low power, embedded systems[C]//Proc of the 36th Design Automation Conf. Piscataway, NJ: IEEE, 1999: 140−145
|
[49] |
Roy S, Rudra A, Verma A. An energy complexity model for algorithms[C]//Proc of the 4th Conf on Innovations in Theoretical Computer Science. New York: ACM, 2013: 283−304
|
[50] |
Poess M, Othayoth Nambiar R. A power consumption analysis of decision support systems[C]//Proc of the 1st Joint WOSP/SIPEW Int Conf on Performance Engineering. New York: ACM, 2010: 147−152
|
[51] |
Feng Boliang, Lu Jiaheng, Zhou Yongluan, et al. Energy efficiency for MapReduce workloads: An in-depth study[C]//Proc of the 33rd Australasian Database Conf. Canberra, Australian: ACS, 2012: 61−70
|
[52] |
Yang T, Chen Y, Sze V. Designing energy-efficient convolutional neural networks using energy-aware pruning[C]//Proc of the 30th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2017: 5687−5695
|
[53] |
Wang Ruibo, Lu Kai, Chen Juan, et al. Brief introduction of TianHe exascale prototype system[J]. Tsinghua Science and Technology, 2020, 26(3): 361−369
|
[54] |
Wei Jia, Zhang Xingjun, Ji Zeyu, et al. Deploying and scaling distributed parallel deep neural networks on the Tianhe-3 prototype system[J]. Scientific Reports, 2021, 11(1): 1−14
|
[55] |
Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[J]. arXiv preprint, arXiv:1409.1556, 2014
|
[56] |
He Kaiming, Zhang Xiangyu, Ren Shaoqing, et al. Deep residual learning for image recognition[C]//Proc of the 29th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2016: 770−778
|
[57] |
Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16×16 words: Transformers for image recognition at scale[J]. arXiv preprint, arXiv: 2010.11929, 2021
|
[58] |
Marcel S, Rodriguez Y. Torchvision the machine-vision package of torch[C]//Proc of the 18th ACM Int Conf on Multimedia. New York: ACM, 2010: 1485−1488
|
[59] |
Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks[J]. Advances in Neural Information Processing Systems, 2012, 25(1): 1097−1105
|
[60] |
Akaike H. Autoregressive model fitting for control[M]//Selected Papers of Hirotugu Akaike. Berlin: Springer, 1998: 153−170
|
[61] |
Gardner Jr E S. Exponential smoothing: The state of the art[J]. Journal of Forecasting, 1985, 4(1): 1−28
|
[62] |
Box G E P, Pierce D A. Distribution of residual autocorrelations in autoregressive-integrated moving average time series models[J]. Journal of the American statistical Association, 1970, 332(65): 1509−1526
|
[63] |
Hanusz Z, Tarasinska J, Zielinski W. Shapiro-Wilk test with known mean[J]. REVSTAT-Statistical Journal, 2016, 14(1): 89−100
|
[64] |
Henning J L. SPEC CPU2000: Measuring CPU performance in the new millennium[J]. Computer, 2000, 33(7): 28−35
|
[65] |
Nishtala R, Petrucci V, Carpenter P, et al. Twig: Multi-agent task management for colocated latency-critical cloud services[C]//Proc of the 26th IEEE Int Symp on High Performance Computer Architecture. Piscataway, NJ: IEEE, 2020: 167−179
|
[66] |
Jahanshahi A, Yu Nanpeng, Wong D. PowerMorph: QoS-aware server power reshaping for data center gegulation service[J]. ACM Transactions on Architecture and Code Optimization, 2022, 19(3): 1−27
|
[67] |
Zhao Laiping, Yang Yanan, Zhang Kaixuan, et al. Rhythm: Component-distinguishable workload deployment in datacenters[C]//Proc of the 15th European Conf on Computer Systems. New York: ACM, 2020: 153−170
|
1. |
项秋艳,訾玲玲,丛鑫. 改进自适应模型池的在线异常检测算法. 电子学报. 2024(07): 2503-2514 .
![]() | |
2. |
吕飞亚,梁艳,刘炜,宫卓宏. 山西预警台站信息管理系统的设计与开发. 山西地震. 2024(03): 35-38 .
![]() | |
3. |
朱茂盛,王宝晗,康曼聪,于巍,杨利超. 智能物联网技术赋能算网一体数据库的效能优化. 计算机研究与发展. 2024(11): 2835-2845 .
![]() | |
4. |
王晓东,郭亮亮. 新一代工业物联网数据管理关键技术研究. 自动化博览. 2024(11): 70-72 .
![]() | |
5. |
李登峰,邓子龙,李擎伟. 基于插件化技术的港口装卸物联网平台实施方案. 港口装卸. 2024(06): 37-40+43 .
![]() |