Abstract:
In recent years, large language models (LLMs) represented by ChatGPT have developed rapidly. As the scale of model parameters continues to grow, building and deploying LLMs puts forward higher requirement for data scale and storage access efficiency, which poses significant challenges to traditional storage systems. This study first analyzes the storage access characteristics across the three critical stages of LLM workflows: data preparation,model training, and inference. It also explores in depth the major issues and bottlenecks faced by traditional storage systems in LLM scenarios. To address these challenges, the study proposes and implements ScaleFS, a high-performance and scalable distributed metadata design. ScaleFS adopts a decoupled design for directory tree metadata and attribute metadata, and combined with a hierarchical partitioning strategy that balances depth and breadth in the directory tree. This design enables efficient path resolution, load balancing, and system scalability, thereby making it capable of effectively managing hundreds of billions of files. Additionally, ScaleFS introduces fine-grained metadata structures, optimizes metadata access patterns, and develops a metadata key-value store tailored for file semantics. These innovations significantly improve metadata access efficiency while reducing disk I/O operations. The experimental results demonstrate that ScaleFS achieves operatons per secone(OPS) rates 1.04 to 7.12 times higher than HDFS, with latency reduced to only 12.67% to 99.55% of HDFS. Furthermore, at a scale of hundreds of billions of files, ScaleFS outperforms HDFS in most operations, even when HDFS operates at a billion-file scale. This demonstrates its superior scalability and access efficiency. ScaleFS is thus well-suited to meet the demands of LLM scenarios for managing and efficiently accessing massive file datasets.