Abstract:
In recent years, LLM(large language model) has exhibited remarkable performance, profoundly transforming various aspects of human life. As these models grow in size and user demand for long-context inference increases, LLM inference systems face significant storage challenges. These challenges stem primarily from the vast number of model parameters and the key value cache required for efficient inference, both of which strain GPU memory resources. Additionally, inefficiencies in storage usage in distributed systems often result in over-provisioning and fault tolerance issues, further complicating resource management. Researchers explore memory optimization, heterogeneous storage, and distributed storage, synthesizing various research efforts to address GPU memory constraints and enhance resource utilization. Memory-optimized LLM inference systems improve GPU memory efficiency and reduce memory footprint through techniques like efficient key value cache management, compression, and attention operator optimization. Heterogeneous storage based LLM inference systems expand storage capacity by integrating various storage resources, thereby minimizing I/O overhead via tensor placement strategies, asynchronous data transfer, and intelligent memory allocation and prefetching methods. Distributed LLM systems enhance the utilization of multi-machine resources, boosting execution efficiency and fault tolerance in LLM inference tasks through batching, multi-level scheduling, and redundant replication. Finally, we review existing research and outlines future research directions to further optimize storage solutions for LLM inference systems.