Integer Quantization based on Low Bit Sharing
-
Graphical Abstract
-
Abstract
With the rapid advancement of artificial intelligence technologies, large language models (LLMs) have emerged as foundational components in modern intelligent systems. However, the ever-increasing model size—ranging from billions to hundreds of billions of parameters—poses significant challenges in terms of memory footprint and computational overhead, particularly on latency-sensitive and resource-constrained platforms. While low-bit integer quantization (e.g., INT8) has demonstrated effectiveness in reducing memory consumption and improving inference efficiency, it suffers from notable quantization limitations when applied to large-scale models, including quantization-induced accuracy degradation and inefficient storage utilization. To address these issues, this work propose a novel numerical quantization framework, termed Low Bit-width Sharing (LBS). Built upon conventional integer quantization, LBS provides a structured high-low bit decomposition scheme that preserves only the most significant bits for each parameter and shares the less significant bits within a tensor group. The shared solution can reduce storage requirements while upholding the numerical representational capability. Furthermore, to mitigate accuracy degradation caused by the quantization of high-impact weights, this work develops a salient value aware quantization strategy. By employing a Top-K selection algorithm, we isolate the most influential weights—typically residing in the tail of the parameter distribution—and assign them dedicated scaling factors. This targeted treatment effectively suppresses error accumulation during quantization and improves robustness in downstream performance. Experimental results across several state-of-the-art LLMs show that LBS achieves substantial storage reduction while preserving high model accuracy, making it a practical and scalable solution for efficient deployment of large-scale models on modern hardware platforms.
-
-