Abstract:
In recent years, it has become a trend to introduce deep neural network (DNN) into mobile devices. Many applications that facilitate daily life, such as voice assistants and activity recognition, have been integrated into smartphones, wearable devices, and embedded systems. However, it is challenging to deploy CPU-bound DNN on mobile devices with limited resources, such as computing power, storage, and battery. Existing methods, such as manually designed DNN compression techniques and automated on-demand DNN compression techniques, are limited to optimizing model structures. It restricts the upper limit of performance optimization for DNN deployment and makes it difficult to adapt to devices with extremely constrained resources. In addition, these statically pre-designed optimization methods do not consider the resource contention and dynamic demand characteristics of the deployment environment in mobile applications. The inability to adjust strategies in real-time under dynamic environment results in suboptimal accuracy-efficiency performance. To address these challenges, we propose AdaInfer, a runtime-scalable cross-layer optimization method for DNN. AdaInfer adaptively selects the optimal comprehensive deployment strategy for model layers, computational graph layers, and memory layers based on current hardware resource constraints and user performance requirements to optimize multiple performance metrics. It also adjusts the optimal strategy in real-time as the scenario changes. Specifically, we designed a scalable graph-computation structure that is model-agnostic and a corresponding cross-layer optimization strategy. These are capable of automatically adjusting to maximize deployment efficiency on heterogeneous devices. Then, we model the runtime adjustment problem of the algorithm-system cross-layer optimization strategy as a dynamic optimization problem and represent the dynamic environment through a set of runtime-varying resource constraints. We also propose an efficient search strategy to enhance the efficiency and quality of local online searches. In evaluations conducted on three types of mobile and edge devices, five models, and four continuously changing mobile scenarios, experimental results show that compared to previous work, AdaInfer reduces memory usage by up to 42.35% and latency by up to 73.89% without significantly affecting accuracy.