Learning-based Model Quantization: Methods, Challenges, and Prospects
-
-
Abstract
The increasing parameter scale and structural complexity of deep neural networks (DNNs) pose significant challenges for their efficient deployment in cloud-edge-end collaborative computing architectures, where edge and terminal nodes must support real-time inference under stringent constraints on computation, storage, inference latency, and communication overhead. Although model quantization effectively reduces resource consumption, traditional approaches based on fixed rules or empirical heuristics exhibit notable limitations in accuracy preservation, training stability, and adaptability across diverse architectures and application scenarios, particularly under very low bit-width conditions (≤4 bit). To mitigate these issues, learning-based quantization incorporates learnable components and diverse supervisory signals, substantially improving accuracy preservation, optimization robustness, and hardware adaptability under low-bit conditions. In this paper, we systematically review representative studies on learning-based quantization, centered on two primary paradigms: post-training quantization (PTQ) and quantization-aware training (QAT). We summarize performance enhancement strategies including learnable parameter modeling, reconstruction and approximation optimization, auxiliary supervision, and learning-driven methods under data and hardware constraints. We further organize existing technical approaches and their interrelationships from the perspectives of quantization process stages and learning signal mechanisms, and analyze the applicability of different strategies under varying data conditions, model scales, and application constraints. Finally, we discuss future research directions in learning-based quantization from three aspects: error-analysis-based optimization of quantization parameters and learning signals, the trade-off between accuracy and resources in learning-based quantization, and a unified quantization framework across model architectures.
-
-