大语言模型推理中的键值缓存压缩方法综述

胡世鹏; 张广艳; 郑纬民

doi:10.7544/issn1000-1239.202550461

大语言模型推理中的键值缓存压缩方法综述

Survey on KV Cache Compression Methods for Large Language Model Inference

摘要

摘要: 大语言模型的部署推理需要针对中间数据，即键值缓存，进行存储复用。然而键值缓存的空间规模迅速增长，为系统带来显著的存储压力：一方面，当前推理请求的大规模键值缓存导致显存瓶颈问题，从而限制推理并行度。另一方面，长期记忆需求导致海量键值缓存长期存储在持久存储设备内，带来高昂的存储成本和加载开销。为了缓解键值缓存带来的存储压力，现有研究针对键值缓存进行压缩，从而削减其空间规模，然而如何在提升压缩率的同时减小对模型回答质量的影响并将压缩带来的计算开销控制在较低程度仍面临着诸多挑战。系统地综述了键值缓存压缩策略的研究进展，首先介绍键值缓存的规模及其对推理效率、存储成本的影响，并总结键值缓存压缩策略面临的三大挑战：模型回答质量、计算开销以及泛化性。然后对现有的键值缓存压缩策略进行分类：基于合并的、基于量化的、基于标记驱逐的、基于共享的以及基于注意力头削减的压缩策略。对每种压缩策略进行阐述，并分析其设计原理和软件技术。最后指出了键值缓存压缩策略的未来研究方向。

Abstract: Large language model (LLM) inference requires buffering and reusing intermediate data, namely KV cache. However, the rapid growth in the size of KV cache imposes significant storage pressure on systems. On one hand, the large-scale KV cache of ongoing inference requests leads to GPU memory bottlenecks, thereby limiting inference parallelism. On the other hand, the long-term memory capability enabled by LLM service providers results in large-scale KV cache being stored on persistent storage devices, incurring high storage costs and loading overhead. To alleviate the storage pressure caused by KV cache, existing research proposes various KV cache compression strategies to reduce its size. Nonetheless, it remains challenging to achieve high compression ratios while minimizing the impact on model response quality and keeping the additional computational overhead at an acceptable level. We present a systematic survey of recent advances in KV cache compression strategies. First, we analyze the scale of KV cache and its impact on inference efficiency and storage cost. Based on this, we identify three major challenges in designing KV cache compression strategies: preserving model response quality, minimizing computational overhead, and ensuring generalizability. We then categorize existing KV cache compression approaches into five types: merge-based, quantization-based, token eviction-based, sharing-based, and attention head pruning-based strategies. Each category is elaborated upon with representative examples, discussing their design principles and implementation techniques. Finally, we outline potential future research directions in KV cache compression.

HTML全文

参考文献(55)

施引文献

资源附件(0)