Integer Quantization Based on Low Bit Sharing

Li Linhao; Zeng Xuandi; Tang Yibin; Liu Bosheng; Wu Jigang

doi:10.7544/issn1000-1239.202550648

Li Linhao, Zeng Xuandi, Tang Yibin, Liu Bosheng, Wu Jigang. Integer Quantization Based on Low Bit SharingJ. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202550648

Citation:

Li Linhao, Zeng Xuandi, Tang Yibin, Liu Bosheng, Wu Jigang. Integer Quantization Based on Low Bit SharingJ. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202550648

Citation:

Li Linhao, Zeng Xuandi, Tang Yibin, Liu Bosheng, Wu Jigang. Integer Quantization Based on Low Bit SharingJ. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202550648

Integer Quantization Based on Low Bit Sharing

Graphical Abstract

Graphical Abstract

Abstract

Abstract

With the rapid advancement of artificial intelligence technologies, large language models (LLMs) have emerged as foundational components in modern intelligent systems. However, the ever-increasing model size—ranging from billions to hundreds of billions of parameters—poses significant challenges in terms of memory footprint and computational overhead, particularly on latency-sensitive and resource-constrained platforms. While low-bit integer quantization (e.g., INT8) has demonstrated effectiveness in reducing memory consumption and improving inference efficiency, it suffers from notable quantization limitations when applied to large-scale models, including quantization-induced accuracy degradation and inefficient storage utilization. To address these issues, this work proposes a numerical quantization framework, termed Low Bit-width Sharing (LBS). Built upon conventional integer quantization, LBS provides a structured high-low bit decomposition scheme that preserves only the most significant bits for each parameter and shares the less significant bits within a tensor group. The shared solution can reduce storage requirements while upholding the numerical representational capability. Furthermore, to mitigate accuracy degradation caused by the quantization of high-impact weights, this work develops a salient value aware quantization strategy. By employing a Top-K selection algorithm, we isolate the most influential weights—typically residing in the tail of the parameter distribution—and assign them dedicated scaling factors. This targeted treatment effectively suppresses error accumulation during quantization and improves robustness in downstream performance. Experimental results across several state-of-the-art LLMs show that LBS achieves substantial storage reduction while preserving high model accuracy, making it a practical and scalable solution for deployment of large-scale models.

FullText(HTML)

References (43)

Cited By

Turn off MathJax

Article Contents

Integer Quantization Based on Low Bit Sharing

Graphical Abstract

Abstract

Catalog

Export File

Citation

Format

Content