Abstract:
With the rapid progress of deep learning technology and the continuous exploration of massive datasets, the self-attention module has been widely applied in various fields such as natural language processing, computer vision, and large language models. Although the self-attention module significantly improves the detection accuracy of deep learning models, its huge computational demand makes it particularly difficult to deploy on computing devices with limited computing power. Integer quantization, as one of the key technologies for deploying models on low-power computing chips, faces the problem of high precision loss caused by the structural characteristics of the self-attention module. To address this issue, a thorough analysis of the integer quantization error in the self-attention module is conducted, and two methods, pseudo-softmax vector quantization and block-wise pseudo-softmax vector quantization, are proposed. These two methods aim to significantly improve inference speed while effectively reducing the error caused by integer quantization by performing special integer quantization on the softmax vectors in the self-attention module. Experimental results show that compared with traditional direct quantization methods, the pseudo-softmax vector quantization method can reduce the quantization accuracy loss by 50%, while the block-wise pseudo-softmax vector quantization method can further reduce the accuracy loss by approximately 90%. These results fully demonstrate the effectiveness of the two quantization methods in reducing precision loss, providing strong support for the efficient deployment of the self-attention module on devices with limited computing power.