一种自注意力模块的低精度损失量化方法

林德铝; 何琨

doi:10.7544/issn1000-1239.202440804

一种自注意力模块的低精度损失量化方法

林德铝,
何琨

A Low-Precision Loss Quantification Method for Self-Attention Module

Lin Delü,
He Kun

摘要

摘要: 随着深度学习技术的飞速进步和海量数据集的持续发掘，自注意力模块在自然语言处理、计算机视觉以及大语言模型等多个领域得到了广泛应用. 尽管自注意力模块显著提升了深度学习模型的检测精度，其巨大的计算需求却使得在算力受限的计算设备上部署显得尤为困难. 整数量化作为在低算力计算芯片中部署模型的关键技术之一，面临着由自注意力模块结构特点引起的较高精度损失问题. 针对这个问题，对自注意力模块的整数量化误差进行了深入分析，提出了伪softmax向量量化方法和分块伪softmax向量量化方法. 所提出方法通过对自注意力模块中的softmax向量进行特殊的整数量化，旨在显著提升推理速度的同时，有效降低整数量化带来的误差. 实验结果表明，相比于传统的直接量化方法，伪softmax向量量化方法能够将量化精度损失降低50%，而分块伪softmax向量量化方法更是能将精度损失减少约90%. 该结果充分证明了这2种量化方法在减少精度损失方面的有效性，为自注意力模块在算力受限设备上的高效部署提供了有力支持.

Abstract: With the rapid progress of deep learning technology and the continuous exploration of massive datasets, the self-attention module has been widely applied in various fields such as natural language processing, computer vision, and large language models. Although the self-attention module significantly improves the detection accuracy of deep learning models, its huge computational demand makes it particularly difficult to deploy on computing devices with limited computing power. Integer quantization, as one of the key technologies for deploying models on low-power computing chips, faces the problem of high precision loss caused by the structural characteristics of the self-attention module. To address this issue, a thorough analysis of the integer quantization error in the self-attention module is conducted, and two methods, pseudo-softmax vector quantization and block-wise pseudo-softmax vector quantization, are proposed. These two methods aim to significantly improve inference speed while effectively reducing the error caused by integer quantization by performing special integer quantization on the softmax vectors in the self-attention module. Experimental results show that compared with traditional direct quantization methods, the pseudo-softmax vector quantization method can reduce the quantization accuracy loss by 50%, while the block-wise pseudo-softmax vector quantization method can further reduce the accuracy loss by approximately 90%. These results fully demonstrate the effectiveness of the two quantization methods in reducing precision loss, providing strong support for the efficient deployment of the self-attention module on devices with limited computing power.

HTML全文

参考文献(22)

施引文献

资源附件(0)