一种可动态伸缩的移动端深度计算图算优化方法

罗诗妍; 刘思聪; 郭斌; 方程; 王敏帆; 郭赛; 於志文

doi:10.7544/issn1000-1239.202440701

摘要: 近年来，将深度神经网络（deep neural network，DNN）引入移动设备成为一种趋势. 智能手机、可穿戴设备和嵌入式设备上集成了许多便利生活的应用，如语音助手和活动识别. 然而，在资源受限（如算力、存储和电池）移动终端部署计算密集型深度模型具有挑战. 现有方法如手工设计的DNN压缩技术和自动化按需DNN压缩技术局限于优化模型结构，限制了DNN部署的性能优化上限，难以适应资源极度受限的终端设备. 此外，已有静态预设计的优化方法未考虑移动应用部署环境的资源争用和动态需求特性，在动态环境下无法及时调整策略，从而导致次优精度-效率表现. 为了解决这些挑战，提出了AdaInfer，一种在运行时可伸缩的DNN跨层优化方法. AdaInfer根据当前硬件资源限制及用户性能需求，自适应选择模型层、计算图层和内存层的最佳综合部署策略以优化多个性能指标，并随着场景变化及时调整最优策略. 具体而言，设计了一种模型无关的可伸缩图算结构和对应的跨层优化策略，能够在异构设备上自动调整以最大化部署效率. 随后，将算法-系统跨层优化策略的运行时调整问题建模为动态优化问题，并通过一组运行时变化的资源约束来建模动态环境. 还提出了一种高效搜索策略，以提高本地在线搜索效率和质量. 实验结果显示，在3种典型移动和边缘设备、5种模型和4种持续变化移动场景的评估中，AdaInfer与先前的工作相比，在不显著影响精度的前提下，将内存占用最多降低了42.35%，时延最多降低了73.89%.

Abstract: In recent years, it has become a trend to introduce deep neural network (DNN) into mobile devices. Many applications that facilitate daily life, such as voice assistants and activity recognition, have been integrated into smartphones, wearable devices, and embedded systems. However, it is challenging to deploy CPU-bound DNN on mobile devices with limited resources, such as computing power, storage, and battery. Existing methods, such as manually designed DNN compression techniques and automated on-demand DNN compression techniques, are limited to optimizing model structures. It restricts the upper limit of performance optimization for DNN deployment and makes it difficult to adapt to devices with extremely constrained resources. In addition, these statically pre-designed optimization methods do not consider the resource contention and dynamic demand characteristics of the deployment environment in mobile applications. The inability to adjust strategies in real-time under dynamic environment results in suboptimal accuracy-efficiency performance. To address these challenges, we propose AdaInfer, a runtime-scalable cross-layer optimization method for DNN. AdaInfer adaptively selects the optimal comprehensive deployment strategy for model layers, computational graph layers, and memory layers based on current hardware resource constraints and user performance requirements to optimize multiple performance metrics. It also adjusts the optimal strategy in real-time as the scenario changes. Specifically, we design a scalable graph-computation structure that is model-agnostic and a corresponding cross-layer optimization strategy. These are capable of automatically adjusting to maximize deployment efficiency on heterogeneous devices. Then, we model the runtime adjustment problem of the algorithm-system cross-layer optimization strategy as a dynamic optimization problem and represent the dynamic environment through a set of runtime-varying resource constraints. We also propose an efficient search strategy to enhance the efficiency and quality of local online searches. In evaluations conducted on three types of mobile and edge devices, five models, and four continuously changing mobile scenarios, experimental results show that compared with previous work, AdaInfer reduces memory usage by up to 42.35% and latency by up to 73.89% without significantly affecting accuracy.

一种可动态伸缩的移动端深度计算图算优化方法

Graph-Based Dynamically Scalable Optimization Method for Deep Computing on Mobile Devices