高级检索

    基于低比特共享的整型量化方法

    Integer Quantization based on Low Bit Sharing

    • 摘要: 随着人工智能技术的快速发展,大语言模型(Large Language Models, LLMs)已成为现代智能系统的关键组成部分。然而,模型规模的持续增长带来了显著的存储与计算开销,严重制约了其在资源受限设备中的部署效率。尽管低比特整数量化技术(如INT8)在降低存储需求与提升推理效率方面取得了一定成效,但在面对超大模型时仍存在精度下降与存储冗余的问题。为解决上述挑战,本文提出一种新颖的数值量化方法——低比特共享(Low Bit-width Sharing, LBS)。该方法在传统整型量化基础上引入结构化的高低位分离机制,仅保留权重的高有效位,并在张量组内共享低比特信息,从而在保持数值表达能力的同时显著地降低存储开销。此外,针对整数量化中显著权重易受量化影响的问题,本文进一步提出显著值感知量化策略,基于Top-K选择算法识别具有高影响力的关键权重值,并通过引入缩放因子缓解其在量化过程中的误差累积,增强模型精度鲁棒性。实验结果表明,所提出的LBS方法在多个主流大语言模型上展现出优越的存储性能与精度保持能力,为大模型的高效部署提供了一种可行的解决方案。

       

      Abstract: With the rapid advancement of artificial intelligence technologies, large language models (LLMs) have emerged as foundational components in modern intelligent systems. However, the ever-increasing model size—ranging from billions to hundreds of billions of parameters—poses significant challenges in terms of memory footprint and computational overhead, particularly on latency-sensitive and resource-constrained platforms. While low-bit integer quantization (e.g., INT8) has demonstrated effectiveness in reducing memory consumption and improving inference efficiency, it suffers from notable quantization limitations when applied to large-scale models, including quantization-induced accuracy degradation and inefficient storage utilization. To address these issues, this work propose a novel numerical quantization framework, termed Low Bit-width Sharing (LBS). Built upon conventional integer quantization, LBS provides a structured high-low bit decomposition scheme that preserves only the most significant bits for each parameter and shares the less significant bits within a tensor group. The shared solution can reduce storage requirements while upholding the numerical representational capability. Furthermore, to mitigate accuracy degradation caused by the quantization of high-impact weights, this work develops a salient value aware quantization strategy. By employing a Top-K selection algorithm, we isolate the most influential weights—typically residing in the tail of the parameter distribution—and assign them dedicated scaling factors. This targeted treatment effectively suppresses error accumulation during quantization and improves robustness in downstream performance. Experimental results across several state-of-the-art LLMs show that LBS achieves substantial storage reduction while preserving high model accuracy, making it a practical and scalable solution for efficient deployment of large-scale models on modern hardware platforms.

       

    /

    返回文章
    返回