Survey on KV Cache Compression for Large Language Model Inference

Hu Shipeng; Zhang Guangyan; Zheng Weimin

doi:10.7544/issn1000-1239.202550461

Hu Shipeng, Zhang Guangyan, Zheng Weimin. Survey on KV Cache Compression for Large Language Model InferenceJ. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202550461

Citation:

Hu Shipeng, Zhang Guangyan, Zheng Weimin. Survey on KV Cache Compression for Large Language Model InferenceJ. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202550461

Citation:

Hu Shipeng, Zhang Guangyan, Zheng Weimin. Survey on KV Cache Compression for Large Language Model InferenceJ. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202550461

Survey on KV Cache Compression for Large Language Model Inference

Graphical Abstract

Graphical Abstract

Abstract

Abstract

Large language model inference requires buffering and reusing intermediate data, namely KV cache. However, the rapid growth in the size of KV cache imposes significant storage pressure on systems. On one hand, the large-scale KV cache of ongoing inference requests leads to GPU memory bottlenecks, thereby limiting inference parallelism. On the other hand, the long-term memory capability enabled by LLM service providers results in large-scale KV cache being stored on persistent storage devices, incurring high storage costs and loading overhead. To alleviate the storage pressure caused by KV cache, existing research propose various KV cache compression strategies to reduce its size. Nonetheless, it remains challenging to achieve high compression ratios while minimizing the impact on model response quality and keeping the additional computational overhead at an acceptable level. We present a systematic survey of recent advances in KV cache compression strategies. First, we analyze the scale of KV cache and its impact on inference efficiency and storage cost. Based on this, we identify three major challenges in designing KV cache compression strategies: preserving model response quality, minimizing computational overhead, and ensuring generalizability. We then categorize existing KV cache compression approaches into five types: merge-based, quantization-based, token eviction-based, sharing-based, and attention head pruning-based strategies. Each category is elaborated upon with representative examples, discussing their design principles and implementation techniques. Finally, we outline potential future research directions in KV cache compression.

FullText(HTML)

References (55)

Cited By

Turn off MathJax

Article Contents

Survey on KV Cache Compression for Large Language Model Inference

Graphical Abstract

Abstract

Catalog

Export File

Citation

Format

Content