DAQ：基于分治策略的自适应Vision Transformer低位宽量化方法

吕倩茹; 许金伟; 姜晶菲; 李东升

doi:10.7544/issn1000-1239.202550145

DAQ：基于分治策略的自适应Vision Transformer低位宽量化方法

DAQ: Divide-and-Conquer Strategy Based Adaptive Low-Bit Quantization Method for Vision Transformer

摘要

摘要: 视觉Transformer（Vision Transformer，ViT）模型在计算机视觉领域的多项任务中取得显著效果. 但ViT的复杂结构和计算开销限制了其在边缘计算设备中的部署. 训练后量化（post-training quantization，PTQ）技术被广泛应用于ViT模型轻量化中以解决实际部署难题，但现有PTQ方法在低位宽量化中的性能损失较大. 针对低比特量化场景，ViT的量化敏感层（如Softmax）与计算密集层（如线性变换）存在显著空间错位，且非高斯分布的激活值中隐含97%的类高斯聚集特性. 由此，基于标准分数z-score方法提出分治自适应量化（divide-and-conquer and adaptive quantization，DAQ）方法，通过量化敏感度-计算-存储开销联合分析与硬件协同设计，实现精度与效率的联合优化. DAQ构建动态分治量化机制，通过动态感知的z-score方法实现正常值/离群值双域分割，均匀关联量化2个值域. 在4-bit量化下，DAQ方法在分类任务上的 Top-1精度最大提升4.37个百分点，目标检测任务最大精度提升达8.2个百分点，与基线模型相比误差平均低于0.4个百分点，超过最佳全精度模型0.1个百分点，接近实现无损的低位宽量化. 另一方面，DAQ在硬件兼容设上适配Tensor Core的INT4/INT8内核，以量化定点计算来减轻线性计算压力. 实验表明，DAQ硬件适配后对线性计算部分有43%~86%的加速效果，为资源受限场景提供了算法-硬件协同优化的量化部署范式.

Abstract: Vision Transformers (ViTs) have demonstrated remarkable success in computer vision tasks, but their complex architecture and computational demands hinder deployment on edge devices. While post-training quantization (PTQ) is widely adopted for model compression, existing PTQ methods exhibit severe performance degradation in 4-bit ultra-low-bitwidth scenarios. This work systematically addresses two fundamental limitations: 1) spatial mismatch between quantization-sensitive layers (e.g., Softmax) and compute-intensive layers (e.g., linear projections), where quantizing Softmax causes 80% accuracy loss despite contributing merely 8% computational load; 2) non-Gaussian activation distributions with hidden Gaussian-like clustering properties (97% values less than three times z-score). We propose DAQ (divide-and-conquer and adaptive quantization), a hardware-friendly PTQ method. DAQ adpots z-score-driven dynamic partitioning algorithm to separate data into normal-range and abnormal-range groups and quantizes the two groups with connected parameter. DAQ further explores hardware accelerated kernel such as tensor core to speed up quantization ViT models. Experimental results demonstrate that DAQ achieves a maximum improvement of 4.37% in ImageNet Top-1 accuracy under 4-bit quantization. In object detection tasks, its average error margin remains below 0.4% compared with the baseline and achieves a maximum improvement of 8.2%, even surpassing the full-precision model by 0.1% in specific cases, thereby realizing near-lossless low-bit-width quantization. Through hardware implementation optimization, DAQ achieves 43%~86% computational acceleration without significantly increasing computational overhead. This approach establishes a synergistic algorithm-hardware co-optimized quantization deployment paradigm for resource-constrained scenarios, effectively balancing model efficiency and precision retention.

HTML全文

参考文献(34)

施引文献

资源附件(1)