MC<sup>2</sup> Energy Consumption Model for Massively Distributed Data Parallel Training of Deep Neural Network

Wei Jia; Zhang Xingjun; Wang Longxiang; Zhao Mingqiang; Dong Xiaoshe

doi:10.7544/issn1000-1239.202330164

Journal of Computer Research and Development > 2024 > 61(12): 2985-3004. > DOI: 10.7544/issn1000-1239.202330164 CSTR: 32373.14.issn1000-1239.202330164

Wei Jia, Zhang Xingjun, Wang Longxiang, Zhao Mingqiang, Dong Xiaoshe. MC² Energy Consumption Model for Massively Distributed Data Parallel Training of Deep Neural Network[J]. Journal of Computer Research and Development, 2024, 61(12): 2985-3004. DOI: 10.7544/issn1000-1239.202330164

Citation:

PDF (2918 KB)

MC² Energy Consumption Model for Massively Distributed Data Parallel Training of Deep Neural Network

School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an 710127

Funds: This work was supported by the National Natural Science Foundation of China (62172327).

More Information

Author Bio:
Wei Jia: born in 1997. PhD candidate. His research interests include computer architecture, high performance computing, and deep learning

Zhang Xingjun: born in 1969. PhD, professor, PhD, supervisor. Senior member of CCF. His main research interests include computer architecture, high performance computing, big data storage system, and computer networks

Wang Longxiang: born in 1988. PhD. Member of CCF. His main research interests include data deduplication, big data storage system, and artificial intelligence

Zhao Mingqiang: born in 1998. Master candidate. His main research interests include job scheduling, high performance computing, and reinforcement learning

Dong Xiaoshe: born in 1963. PhD, professor, PhD supervisor. Member of CCF. His main research interests include high performance computer system, storage system, and cloud computing
Received Date: March 16, 2023
Revised Date: August 14, 2023
Available Online: March 13, 2024

Graphical Abstract

Abstract

Abstract

Deep neural network (DNN) have achieved state-of-the-art accuracy in many modern artificial intelligence (AI) tasks. In recent years, it has become increasingly popular to use high performance computing platforms for massively distributed parallel training of DNN. Energy consumption models have been crucial in designing and optimizing DNN for massively parallel training and restraining excessive energy consumption on HPC (high performance computing) platforms. Currently, most energy consumption models model the energy consumption of a single device or a cluster of multiple devices from a hardware perspective. From an energy consumption perspective, the need for disaggregated analysis of distributed parallel DNN applications has resulted in a dearth of energy consumption models that model the characteristics of distributed DNN applications. In this paper, we propose the “materials preprocessing-computing-communicating” three-stage MC² model from the perspective of the essential features of DNN model training for the most commonly used DNN distributed data parallel training model. The model is validated by training the classical VGG16, ResNet50 networks and the latest Vision Transformer network using up to 128 MT nodes and 32 FT nodes on the domestic E-class prototype Tianhe-3. The experimental results show that the difference between MC² and the actual energy measurements is only 2.84%. Compared with the four linear proportional energy models and the AR, SES, and ARIMA time prediction models, the accuracy of the model proposed is improved by 69.12%, 69.50%, 34.58%, 13.47%, 5.23%, 22.13%, and 10.53%, respectively. By using the models proposed in this paper, the energy consumption of DNN models at each stage and the overall energy consumption can be obtained on a supercomputer platform, which provides a basis for evaluating the efficiency of DNN energy-aware massively distributed parallel training and inference, as well as optimizing the strategies of task scheduling, job scheduling, model partitioning, and model pruning.
- deep neural network (DNN),
- energy consumption model,
- massively distributed training,
- data parallel,
- supercomputer

FullText(HTML)

References (67)

References

[1]	Orhan A E. Robustness properties of Facebook’s ResNeXt WSL models[J]. arXiv preprint, arXiv: 1907.07640, 2019
[2]	Gao Yongqiang, Guan Haibing, Qi Zhengwei, et al. Quality of service aware power management for virtualized data centers[J]. Journal of Systems Architecture, 2013, 59(4/5): 245−259
[3]	Bilal K, Malik S U R, Khan S U, et al. Trends and challenges in cloud datacenters[J]. IEEE Cloud Computing, 2014, 1(1): 10−20 doi: 10.1109/MCC.2014.26
[4]	Whitehead B, Andrews D, Shah A, et al. Assessing the environmental impact of data centres part 1: Background, energy use and metrics[J]. Building and Environment, 2014, 82(4): 151−159
[5]	Rivoire S M. Designing Energy-Efficient Computer Systems[M]//Models and Metrics for Energy-Efficient Computer Systems. Ann Arbor, MI: ProQuest, 2008: 29−65
[6]	vor dem Berge M, Da Costa G, Kopecki A, et al. Modeling and simulation of data center energy-efficiency in coolemall[C]//Proc of the 1st Int Workshop on Energy Efficient Data Centers. Berlin: Springer, 2012: 25−36
[7]	Floratou A, Bertsch F, Patel J M, et al. Towards building wind tunnels for data center design[J]. Proceedings of the VLDB Endowment, 2014, 7(9): 781−784 doi: 10.14778/2732939.2732950
[8]	Dean J, Corrado G S, Monga R, et al. Large scale distributed deep networks[C]//Proc of the 26th Neural Information and Processing System. Cambridge, MA: MIT, 2012: 1223−1231
[9]	Raina R, Madhavan A, Ng A Y. Large-scale deep unsupervised learning using graphics processors[C]//Proc of the 26th Int Conf on Machine Learning. New York: ACM, 2009: 873−88
[10]	Li Teng, Dou Yong, Jiang Jingfei, et al. Optimized deep belief networks on CUDA GPUs[C]//Proc of the 27th Int Joint Conf on Neural Networks. Piscataway, NJ: IEEE, 2015: 1688−1696
[11]	Bottleson J, Kim S Y, Andrews J, et al. clCaffe: OpenCL accelerated Caffe for convolutional neural networks[C]//Proc of the 30th IEEE Int Parallel and Distributed Processing Symp Workshops. Piscataway, NJ: IEEE, 2016: 50−57
[12]	Kaler T, Stathas N, Ouyang A, et al. Accelerating training and inference of graph neural networks with fast sampling and pipelining[C]//Proc of the 5th Machine Learning and Systems. New York: ACM, 2022: 172−189
[13]	Tan Sijun, Knott B, Tian Yuan, et al. CryptGPU: Fast privacy-preserving machine learning on the GPU[C]//Proc of the 42nd IEEE Symp on Security and Privacy. Piscataway, NJ: IEEE, 2021: 1021−1038
[14]	Viebke A, Memeti S, Pllana S, et al. CHAOS: A parallelization scheme for training convolutional neural networks on Intel Xeon Phi[J]. The Journal of Supercomputing, 2019, 75(1): 197−227 doi: 10.1007/s11227-017-1994-x
[15]	Liu Junjie, Wang Haixia, Wang Dongsheng, et al. Parallelizing convolutional neural networks on Intel many integrated core architecture[C]//Proc of the 28th Int Conf on Architecture of Computing Systems. Berlin: Springer, 2015: 71−82
[16]	Zlateski A, Lee K, Seung H S. Scalable training of 3D convolutional networks on multi-and many-cores[J]. Journal of Parallel and Distributed Computing, 2017, 106(7): 195−204
[17]	Suda N, Chandra V, Dasika G, et al. Throughput-optimized OpenCL-based FPGA accelerator for large-scale convolutional neural networks[C]//Proc of the 24th ACM/SIGDA Int Symp on Field-Programmable Gate Arrays. New York: ACM, 2016: 16−25
[18]	Zhang Jialiang, Li Jing. Improving the performance of OpenCL-based FPGA accelerator for convolutional neural network[C]//Proc of the 25th ACM/SIGDA Int Symp on Field-Programmable Gate Arrays. New York: ACM, 2017: 25−34
[19]	Aydonat U, O’Connell S, Capalija D, et al. An OpenCL™ deep learning accelerator on Arria 10[C]//Proc of the 25th ACM/SIGDA Int Symp on Field-Programmable Gate Arrays. New York: ACM, 2017: 55−64
[20]	朱传家,刘鑫,方佳瑞. 基于“神威太湖之光”的Caffe分布式扩展研究[J]. 计算机应用与软件,2020,37(1):15−20 Zhu Chuanjia, Liu Xin, Fang Jiarui. Research on distributed extension of Caffe based on “Light of Taihu Lake” in Shenwei[J]. Computer Applications and Software, 2020, 37(1): 15−20 (in Chinese)
[21]	魏嘉,张兴军,纪泽宇,等. 天河三号原型机分布式并行深度神经网络性能评测及调优[J]. 计算机工程与科学,2021,43(5):782−791 doi: 10.3969/j.issn.1007-130X.2021.05.003 Wei Jia, Zhang Xingjun, Ji Zeyu, et al. Performance evaluation and optimization of distributed parallel deep neural networks on the Tianhe-3 prototype[J]. Computer Engineering and Science, 2021, 43(5): 782−791 doi: 10.3969/j.issn.1007-130X.2021.05.003
[22]	Ji Shihao, Satish N, Li Sheng, et al. Parallelizing Word2Vec in shared and distributed memory[J]. IEEE Transactions on Parallel and Distributed Systems, 2019, 30(9): 2090−2100 doi: 10.1109/TPDS.2019.2904058
[23]	Das D, Avancha S, Mudigere D, et al. Distributed deep learning using synchronous stochastic gradient descent[J]. arXiv preprint, arXiv: 1602.06709, 2016
[24]	Roy P, Song S L, Krishnamoorthy S, et al. Numa-caffe: Numa-aware deep learning neural networks[J]. ACM Transactions on Architecture and Code Optimization, 2018, 15(2): 1−26
[25]	Mittal S, Rajput P, Subramoney S. A survey of deep learning on CPUs: Opportunities and co-optimizations[J]. IEEE Transactions on Neural Networks and Learning Systems, 2021, 33(10): 5095−5115
[26]	Awan A A, Hamidouche K, Hashmi J M, et al. S-Caffe: Co-designing MPI runtimes and Caffe for scalable deep learning on modern GPU clusters//Proc of the 22nd ACM SIGPLAN Symp on Principles and Practice of Parallel Programming. New York: ACM, 2017: 193−205
[27]	Yin J, Gahlot S, Laanait N, et al. Strategies to deploy and scale deep learning on the summit supercomputer[C]//Proc of the 3rd IEEE/ACM Workshop on Deep Learning on Supercomputers. Piscataway, NJ: IEEE, 2019: 84−94
[28]	Duan Qingyang, Wang Zeqin, Xu Yuedong, et al. Mercury: A simple transport layer scheduler to accelerate distributed DNN training[C]//Proc of the 41st IEEE Conf on Computer Communications. Piscataway, NJ: IEEE, 2022: 350−359
[29]	Huang Yanping, Cheng Youlong, Bapna A, et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism[J]. Advances in Neural Information Processing Systems, 2019, 32(1): 103−112
[30]	Narayanan D, Harlap A, Phanishayee A, et al. PipeDream: Generalized pipeline parallelism for DNN training[C]//Proc of the 27th ACM Symp on Operating Systems Principles. New York: ACM, 2019: 1−15
[31]	Rajbhandari S, Ruwase O, Rasley J, et al. Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning[C]//Proc of the 34th Int Conf for High Performance Computing, Networking, Storage and Analysis. Piscataway, NJ: IEEE, 2021: 826−840
[32]	Dayarathna M, Wen Yonggang, Fan Rui. Data center energy consumption modeling: A survey[J]. IEEE Communications Surveys & Tutorials, 2015, 18(1): 732−794
[33]	Ge Rong, Feng Xizhou, Cameron K W. Performance-constrained distributed dvs scheduling for scientific applications on power-aware clusters[C]//Proc of the 18th ACM/IEEE Conf on Supercomputing. Piscataway, NJ: IEEE, 2005: 34−34
[34]	Yeo S, Lee H H S. Peeling the Power Onion of Data Centers[M]//Energy Efficient Thermal Management of Data Centers. Berlin: Springer, 2012: 137−168
[35]	Gao Yongqiang, Guan Haibing, Qi Zhengwei, et al. Quality of service aware power management for virtualized data centers[J]. Journal of Systems Architecture, 2013, 59(4): 245−259
[36]	Shin D, Kim J, Chang N, et al. Energy-optimal dynamic thermal management for green computing[C]//Proc of the 22nd IEEE/ACM Int Conf on Computer-Aided Design-Digest of Technical Papers. Piscataway, NJ: IEEE, 2009: 652−657
[37]	Merkel A, Bellosa F. Balancing power consumption in multiprocessor systems[J]. ACM SIGOPS Operating Systems Review, 2006, 40(4): 403−414 doi: 10.1145/1218063.1217974
[38]	Bertran R, Becerra Y, Carrera D, et al. Accurate energy accounting for shared virtualized environments using pmc-based power modeling techniques[C/OL]//Proc of the 11th IEEE/ACM Int Conf on Grid Computing. Piscataway, NJ: IEEE, 2010[2023-07-20].https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5697889
[39]	Li Hui, Casale G, Ellahi T. SLA-driven planning and optimization of enterprise applications[C]//Proc of the 1st Joint WOSP/SIPEW Int Conf on Performance Engineering. New York: ACM, 2010: 117−128
[40]	Li Keqin. Optimal configuration of a multicore server processor for managing the power and performance tradeoff[J]. The Journal of Supercomputing, 2012, 61(1): 189−214 doi: 10.1007/s11227-011-0686-1
[41]	Kim S, Roy I, Talwar V. Evaluating integrated graphics processors for data center workloads[C]//Proc of the 46th Workshop on Power-Aware Computing and Systems. New York: ACM, 2013: 41−45
[42]	Jouppi N P, Young C, Patil N, et al. In-datacenter performance analysis of a tensor processing unit[C]//Proc of the 44th Annual Int Symp on Computer Architecture. New York: ACM, 2017: 1−12
[43]	Zhang Boyu, Davoodi A, Hu Yuhen. Exploring energy and accuracy tradeoff in structure simplification of trained deep neural networks[J]. IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 2018, 8(4): 836−848 doi: 10.1109/JETCAS.2018.2833383
[44]	Ellison B, Minas L. The problem of power consumption in servers[J]. Energy Efficiency for Information Technology, 2009, 41(2): 1−17
[45]	Giridhar B, Cieslak M, Duggal D, et al. Exploring DRAM organizations for energy-efficient and resilient exascale memories[C]//Proc of the 26th Int Conf on High Performance Computing, Networking, Storage and Analysis. Piscataway, NJ: IEEE, 2013: 277−289
[46]	Lin Jiang, Zheng Hongzhong, Zhu Zhichun, et al. Thermal modeling and management of DRAM memory systems[C]//Proc of the 34th Annual Int Symp on Computer Architecture. New York: ACM, 2007: 312−322
[47]	Vijaykrishnan N, Kandemir M, Irwin M J, et al. Energy-driven integrated hardware-software optimizations using SimplePower[J]. ACM SIGARCH Computer Architecture News, 2000, 28(2): 95−106 doi: 10.1145/342001.339659
[48]	Shiue W T, Chakrabarti C. Memory exploration for low power, embedded systems[C]//Proc of the 36th Design Automation Conf. Piscataway, NJ: IEEE, 1999: 140−145
[49]	Roy S, Rudra A, Verma A. An energy complexity model for algorithms[C]//Proc of the 4th Conf on Innovations in Theoretical Computer Science. New York: ACM, 2013: 283−304
[50]	Poess M, Othayoth Nambiar R. A power consumption analysis of decision support systems[C]//Proc of the 1st Joint WOSP/SIPEW Int Conf on Performance Engineering. New York: ACM, 2010: 147−152
[51]	Feng Boliang, Lu Jiaheng, Zhou Yongluan, et al. Energy efficiency for MapReduce workloads: An in-depth study[C]//Proc of the 33rd Australasian Database Conf. Canberra, Australian: ACS, 2012: 61−70
[52]	Yang T, Chen Y, Sze V. Designing energy-efficient convolutional neural networks using energy-aware pruning[C]//Proc of the 30th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2017: 5687−5695
[53]	Wang Ruibo, Lu Kai, Chen Juan, et al. Brief introduction of TianHe exascale prototype system[J]. Tsinghua Science and Technology, 2020, 26(3): 361−369
[54]	Wei Jia, Zhang Xingjun, Ji Zeyu, et al. Deploying and scaling distributed parallel deep neural networks on the Tianhe-3 prototype system[J]. Scientific Reports, 2021, 11(1): 1−14
[55]	Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[J]. arXiv preprint, arXiv:1409.1556, 2014
[56]	He Kaiming, Zhang Xiangyu, Ren Shaoqing, et al. Deep residual learning for image recognition[C]//Proc of the 29th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2016: 770−778
[57]	Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16×16 words: Transformers for image recognition at scale[J]. arXiv preprint, arXiv: 2010.11929, 2021
[58]	Marcel S, Rodriguez Y. Torchvision the machine-vision package of torch[C]//Proc of the 18th ACM Int Conf on Multimedia. New York: ACM, 2010: 1485−1488
[59]	Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks[J]. Advances in Neural Information Processing Systems, 2012, 25(1): 1097−1105
[60]	Akaike H. Autoregressive model fitting for control[M]//Selected Papers of Hirotugu Akaike. Berlin: Springer, 1998: 153−170
[61]	Gardner Jr E S. Exponential smoothing: The state of the art[J]. Journal of Forecasting, 1985, 4(1): 1−28
[62]	Box G E P, Pierce D A. Distribution of residual autocorrelations in autoregressive-integrated moving average time series models[J]. Journal of the American statistical Association, 1970, 332(65): 1509−1526
[63]	Hanusz Z, Tarasinska J, Zielinski W. Shapiro-Wilk test with known mean[J]. REVSTAT-Statistical Journal, 2016, 14(1): 89−100
[64]	Henning J L. SPEC CPU2000: Measuring CPU performance in the new millennium[J]. Computer, 2000, 33(7): 28−35
[65]	Nishtala R, Petrucci V, Carpenter P, et al. Twig: Multi-agent task management for colocated latency-critical cloud services[C]//Proc of the 26th IEEE Int Symp on High Performance Computer Architecture. Piscataway, NJ: IEEE, 2020: 167−179
[66]	Jahanshahi A, Yu Nanpeng, Wong D. PowerMorph: QoS-aware server power reshaping for data center gegulation service[J]. ACM Transactions on Architecture and Code Optimization, 2022, 19(3): 1−27
[67]	Zhao Laiping, Yang Yanan, Zhang Kaixuan, et al. Rhythm: Component-distinguishable workload deployment in datacenters[C]//Proc of the 15th European Conf on Computer Systems. New York: ACM, 2020: 153−170

[1]	Yan Yunxue, Ma Ming, Jiang Han. An Efficient Privacy Preserving 4PC Machine Learning Scheme Based on Secret Sharing[J]. Journal of Computer Research and Development, 2022, 59(10): 2338-2347. DOI: 10.7544/issn1000-1239.20220514
[2]	Dong Ye, Hou Wei, Chen Xiaojun, Zeng Shuai. Efficient and Secure Federated Learning Based on Secret Sharing and Gradients Selection[J]. Journal of Computer Research and Development, 2020, 57(10): 2241-2250. DOI: 10.7544/issn1000-1239.2020.20200463
[3]	Qin Chuan, Chang Chin Chen, Guo Cheng. Perceptual Robust Image Hashing Scheme Based on Secret Sharing[J]. Journal of Computer Research and Development, 2012, 49(8): 1690-1698.
[4]	Wang Gang, Wen Tao, Guo Quan, Ma Xuebin. An Efficient and Secure Group Key Management Scheme in Mobile Ad Hoc Networks[J]. Journal of Computer Research and Development, 2010, 47(5): 911-920.
[5]	Zhang Haibo, Wang Xiaofei, and Huang Youpeng. General Results on Secret Sharing Based on General Access Structure[J]. Journal of Computer Research and Development, 2010, 47(2): 207-215.
[6]	Huang Dongping, Liu Duo, and Dai Yiqi. Weighted Threshold Secret Sharing[J]. Journal of Computer Research and Development, 2007, 44(8): 1378-1382.
[7]	Pang Liaojun, Jiang Zhengtao, and Wang Yumin. A Multi-Secret Sharing Scheme Based on the General Access Structure[J]. Journal of Computer Research and Development, 2006, 43(1): 33-38.
[8]	Wang Guilin, Qing Sihan. Security Notes on Two Cheat-Proof Secret Sharing Schemes[J]. Journal of Computer Research and Development, 2005, 42(11): 1924-1927.
[9]	Sui Hongfei, Chen Jian'er, Chen Songqiao, and Zhu Nafei. Secret Sharing-Based Rerouting in Rerouting-Based Anonymous Communication Systems[J]. Journal of Computer Research and Development, 2005, 42(10): 1660-1666.
[10]	Guo Yuanbo, Ma Jianfeng, Wang Yadi. An Efficient Secret Sharing Scheme Realizing Graph-Based Adversary Structures[J]. Journal of Computer Research and Development, 2005, 42(5): 877-882.

Cited By

Cited by

Periodical cited type(4)

1.	张海锋，耿中宝. 基于动态密钥的5G无线通信数据加密方法研究. 西安文理学院学报(自然科学版). 2025(02): 29-34 .
2.	奉钰鑫，何凯，魏银珍. 基于区块链的工业互联网安全防护研究与实践. 网络空间安全. 2024(03): 113-117 .
3.	王后珍，秦婉颖，刘芹，余纯武，沈志东. 基于身份的群组密钥分发方案. 计算机研究与发展. 2023(10): 2203-2217 . 本站查看
4.	何智旺，王化群. 面向车联网的匿名组密钥分发方案. 网络与信息安全学报. 2023(05): 127-137 .