Abstract:
Deep Neural Networks (DNN) have achieved state-of-the-art accuracy in many modern Artificial Intelligence (AI) tasks. In recent years, it has become increasingly popular to use high performance computing platforms for massively distributed parallel training of DNN. Energy consumption models have been crucial in designing and optimizing DNN for massively parallel training and restraining excessive energy consumption on HPC platforms. Currently, most energy consumption models model the energy consumption of a single device or a cluster of multiple devices from a hardware perspective. From an energy consumption perspective, the need for disaggregated analysis of distributed parallel DNN applications has resulted in a dearth of energy consumption models that model the characteristics of distributed DNN applications. In this paper, we propose the "Materials Preprocessing-Computing-Communicating" three-stage MC
2 model from the perspective of the essential features of DNN model training for the most commonly used DNN distributed data parallel training model. The model is validated by training the classical VGG16 and ResNet50 networks and the latest Vision Transformer network using up to 128 MT nodes and 32 FT nodesVGG16 network with 128 MT nodes and 32 FT nodes on the domestic E-class prototype Tianhe-3. The experimental results in this paper show that the difference between MC
2 and the actual energy measurements is only 2.84%. Compared with the four linear proportional energy models and the AR, SES, and ARIMA time prediction models, the accuracy of the model proposed in this paper is improved by 69.12%, 69.50%, 34.58%, 13.47%, 5.23%, 22.13%, and 10.53%, respectively. By using the models proposed in this paper, the energy consumption of DNN models at each stage and the overall energy consumption can be obtained on a supercomputer platform, which provides a basis for evaluating the efficiency of DNN energy-aware massively distributed parallel training and inference, as well as optimizing the strategies of task scheduling, job scheduling, model partitioning, and model pruning.