Advanced Search
    Survey on KV Cache Compression for Large Language Model Inference[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202550461
    Citation: Survey on KV Cache Compression for Large Language Model Inference[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202550461

    Survey on KV Cache Compression for Large Language Model Inference

    • Large language model (LLM) inference requires buffering and reusing intermediate data, namely key-value (KV) cache. However, the rapid growth in the size of KV cache imposes significant storage pressure on systems. On one hand, the large-scale KV cache of ongoing inference requests leads to GPU memory bottlenecks, thereby limiting inference parallelism. On the other hand, the long-term memory capability enabled by LLM service providers results in large-scale KV cache being stored on persistent storage devices, incurring high storage costs and loading overhead. To alleviate the storage pressure caused by KV cache, existing research has proposed various KV cache compression strategies to reduce its size. Nonetheless, it remains challenging to achieve high compression ratios while minimizing the impact on model response quality and keeping the additional computational overhead at an acceptable level. We present a systematic survey of recent advances in KV cache compression strategies. First, we analyze the scale of KV cache and its impact on inference efficiency and storage cost, and based on this, identify three major challenges in designing KV cache compression strategies: preserving model response quality, minimizing computational overhead, and ensuring generalizability. We then categorize existing KV cache compression approaches into five types: merge-based, quantization-based, token eviction-based, sharing-based, and attention head pruning-based strategies. Each category is elaborated upon with representative examples, discussing their design principles and implementation techniques. Finally, we outline potential future research directions in KV cache compression.
    • loading

    Catalog

      Turn off MathJax
      Article Contents

      /

      DownLoad:  Full-Size Img  PowerPoint
      Return
      Return