大语言模型推理中的存储优化技术综述

葛旭冉; 欧洋; 王博; 赵宇; 吴利舟; 王子聪; 陈志广; 肖侬

doi:10.7544/issn1000-1239.202440628

大语言模型推理中的存储优化技术综述

Survey of Storage Optimization Techniques in Large Language Model Inference

摘要

摘要: 近年来，大语言模型在多个领域展现出卓越的性能，显著改变了人类的生活方式. 然而，随着模型规模的不断增长和用户对长上下文推理需求的增加，大语言模型推理系统在存储方面面临诸多问题. 首先，庞大的模型参数和键值缓存导致GPU显存资源不足；其次，分布式大语言模型推理系统难以充分利用GPU集群的存储资源，存在资源过度配置和存储容错的问题. 因此，从显存优化、异构存储和分布式存储3方面入手，归纳总结了现有研究在解决GPU显存容量不足和资源利用率低下方面的努力. 基于显存优化的大语言模型推理系统通过高效的键值缓存管理、压缩以及注意力算子优化，提高了GPU显存利用率，降低了显存占用. 基于异构存储的大语言模型推理系统利用多种类别的存储资源扩展存储容量，通过张量放置策略、异步数据传输以及智能显存分配与预取技术，降低了异构存储带来的I/O开销. 基于分布式存储的大语言模型推理系统通过批处理、多级调度、冗余复制等策略，优化了多机存储和计算资源的利用，提高了大语言模型推理任务的执行效率和容错能力. 最后，总结了现有研究，并对未来的研究方向进行了展望.

Abstract: In recent years, LLM (large language model) has exhibited remarkable performance, profoundly transforming various aspects of human life. As these models grow in size and user demand for long-context inference increases, LLM inference systems face significant storage challenges. These challenges stem primarily from the vast number of model parameters and the key value cache required for efficient inference, both of which strain GPU memory resources. Additionally, inefficiencies in storage usage in distributed systems often result in over-provisioning and fault tolerance issues, further complicating resource management. Researchers explore memory optimization, heterogeneous storage, and distributed storage, synthesizing various research efforts to address GPU memory constraints and enhance resource utilization. Memory-optimized LLM inference systems improve GPU memory efficiency and reduce memory footprint through techniques like efficient key value cache management, compression, and attention operator optimization. Heterogeneous storage based LLM inference systems expand storage capacity by integrating various storage resources, thereby minimizing I/O overhead via tensor placement strategies, asynchronous data transfer, and intelligent memory allocation and prefetching methods. Distributed LLM systems enhance the utilization of multi-machine resources, boosting execution efficiency and fault tolerance in LLM inference tasks through batching, multi-level scheduling, and redundant replication. Finally, we review existing research and outline future research directions to further optimize storage solutions for LLM inference systems.

HTML全文

参考文献(67)

施引文献

资源附件(0)