高级检索

    面向深度神经网络大规模分布式数据并行训练的MC2能耗模型

    MC2 Energy Consumption Model for Massively Distributed Data Parallel Training of Deep Neural Network

    • 摘要: 深度神经网络(deep neural network,DNN)在许多现代人工智能(artificial intelligence,AI)任务中取得了最高的精度. 近年来,使用高性能计算平台进行大规模分布式并行训练DNN越来越普遍. 能耗模型在设计和优化DNN大规模并行训练和抑制高性能计算平台过量能耗方面起着至关重要的作用. 目前,大部分的能耗模型都是从设备的角度出发对单个设备或多个设备构成的集群进行能耗建模,由于缺乏从能耗角度对分布式并行DNN应用进行分解剖析,导致罕有针对分布式DNN应用特征进行建模的能耗模型. 针对目前最常用的DNN分布式数据并行训练模式,从DNN模型训练本质特征角度出发,提出了“数据预处理(materials preprocessing)–前向与反向传播(computing)–梯度同步与更新(communicating)”三阶段MC2能耗模型,并通过在国产E级原型机天河三号上使用最多128个MT节点和32个FT节点训练经典的VGG16和ResNet50网络以及最新的Vision Transformer网络验证了模型的有效性和可靠性. 实验结果表明,MC2与真实能耗测量结果相差仅为2.84%,相较4种线性比例能耗模型以及AR,SES,ARIMA时间预测模型准确率分别提升了69.12个百分点,69.50个百分点,34.58个百分点,13.47个百分点,5.23个百分点,22.13个百分点,10.53个百分点. 通过使用的模型可以在超算平台得到DNN模型的各阶段能耗和总体能耗结果,为评估基于能耗感知的DNN大规模分布式数据并行训练及推理各阶段任务调度、作业放置、模型分割、模型裁剪等优化策略的效能提供了基础.

       

      Abstract: Deep neural network (DNN) have achieved state-of-the-art accuracy in many modern artificial intelligence (AI) tasks. In recent years, it has become increasingly popular to use high performance computing platforms for massively distributed parallel training of DNN. Energy consumption models have been crucial in designing and optimizing DNN for massively parallel training and restraining excessive energy consumption on HPC (high performance computing) platforms. Currently, most energy consumption models model the energy consumption of a single device or a cluster of multiple devices from a hardware perspective. From an energy consumption perspective, the need for disaggregated analysis of distributed parallel DNN applications has resulted in a dearth of energy consumption models that model the characteristics of distributed DNN applications. In this paper, we propose the “materials preprocessing-computing-communicating” three-stage MC2 model from the perspective of the essential features of DNN model training for the most commonly used DNN distributed data parallel training model. The model is validated by training the classical VGG16, ResNet50 networks and the latest Vision Transformer network using up to 128 MT nodes and 32 FT nodes on the domestic E-class prototype Tianhe-3. The experimental results show that the difference between MC2 and the actual energy measurements is only 2.84%. Compared with the four linear proportional energy models and the AR, SES, and ARIMA time prediction models, the accuracy of the model proposed is improved by 69.12%, 69.50%, 34.58%, 13.47%, 5.23%, 22.13%, and 10.53%, respectively. By using the models proposed in this paper, the energy consumption of DNN models at each stage and the overall energy consumption can be obtained on a supercomputer platform, which provides a basis for evaluating the efficiency of DNN energy-aware massively distributed parallel training and inference, as well as optimizing the strategies of task scheduling, job scheduling, model partitioning, and model pruning.

       

    /

    返回文章
    返回