高级检索

    大语言模型推理中的键值缓存压缩方法综述

    Survey on KV Cache Compression for Large Language Model Inference

    • 摘要: 大语言模型的部署推理需要针对中间数据,即键值缓存,进行存储复用.然而键值缓存的空间规模迅速增长,为系统带来显著的存储压力:一方面,当前推理请求的大规模键值缓存导致显存瓶颈问题,从而限制推理并行度;另一方面,大模型推理服务商提供的长期记忆功能导致海量键值缓存长期存储在持久存储设备内,带来高昂的存储成本和加载开销.为了缓解键值缓存带来的存储压力,现有研究针对键值缓存进行压缩,从而削减其空间规模,然而如何在提升压缩率的同时减小对模型回答质量的影响并将压缩额外带来的计算开销控制在较低程度仍面临着诸多挑战.系统地综述了键值缓存压缩方法的研究进展,首先介绍了键值缓存的规模及其对推理效率、存储成本的影响,并基于此总结了键值缓存压缩方法的设计面临的三大挑战:模型回答质量、计算开销以及泛化性.然后对现有的键值缓存压缩方法进行分类:基于合并的、基于量化的、基于标记驱逐的、基于共享的以及基于注意力头削减的压缩方法,对每种压缩方法分别进行阐述,并以多个实例分析其设计原理和软件技术.最后指出了键值缓存压缩方法的未来研究方向.

       

      Abstract: Large language model (LLM) inference requires buffering and reusing intermediate data, namely key-value (KV) cache. However, the rapid growth in the size of KV cache imposes significant storage pressure on systems. On one hand, the large-scale KV cache of ongoing inference requests leads to GPU memory bottlenecks, thereby limiting inference parallelism. On the other hand, the long-term memory capability enabled by LLM service providers results in large-scale KV cache being stored on persistent storage devices, incurring high storage costs and loading overhead. To alleviate the storage pressure caused by KV cache, existing research has proposed various KV cache compression strategies to reduce its size. Nonetheless, it remains challenging to achieve high compression ratios while minimizing the impact on model response quality and keeping the additional computational overhead at an acceptable level. We present a systematic survey of recent advances in KV cache compression strategies. First, we analyze the scale of KV cache and its impact on inference efficiency and storage cost, and based on this, identify three major challenges in designing KV cache compression strategies: preserving model response quality, minimizing computational overhead, and ensuring generalizability. We then categorize existing KV cache compression approaches into five types: merge-based, quantization-based, token eviction-based, sharing-based, and attention head pruning-based strategies. Each category is elaborated upon with representative examples, discussing their design principles and implementation techniques. Finally, we outline potential future research directions in KV cache compression.

       

    /

    返回文章
    返回