Citation: | Shang Biyun, Han Yinjun, Xiao Rong, Chen Zhenghua, Tu Yaofeng, Dong Zhenjiang. ScaleFS: High Performance and Scalable Metadata Design for Large Language Models[J]. Journal of Computer Research and Development, 2025, 62(3): 589-604. DOI: 10.7544/issn1000-1239.202440373 |
In recent years, large language models (LLMs) represented by ChatGPT have developed rapidly. As the scale of model parameters continues to grow, building and deploying LLMs puts forward higher requirement for data scale and storage access efficiency, which poses significant challenges to traditional storage systems. This study first analyzes the storage access characteristics across the three critical stages of LLM workflows: data preparation, model training, and inference. It also explores in depth the major issues and bottlenecks faced by traditional storage systems in LLM scenarios. To address these challenges, the study proposes and implements ScaleFS, a high-performance and scalable distributed metadata design. ScaleFS adopts a decoupled design for directory tree metadata and attribute metadata, and combines with a hierarchical partitioning strategy that balances depth and breadth in the directory tree. This design enables efficient path resolution, load balancing, and system scalability, thereby making it capable of effectively managing hundreds of billions of files. Additionally, ScaleFS introduces fine-grained metadata structures, optimizes metadata access patterns, and develops a metadata key-value store tailored for file semantics. These innovations significantly improve metadata access efficiency while reducing disk I/O operations. The experimental results demonstrate that ScaleFS achieves operations per second (OPS) rates 1.04 to 7.12 times higher than HDFS, with latency reduced to only 12.67% to 99.55% of HDFS. Furthermore, at a scale of hundreds of billions of files, ScaleFS outperforms HDFS in most operations, even when HDFS operates at a billion-file scale. This demonstrates its superior scalability and access efficiency. ScaleFS is thus well-suited to meet the demands of LLM scenarios for managing and efficiently accessing massive file datasets.
[1] |
Achiam J, Adler S, Agarwal S, et al. GPT−4 technical report [J]. arXiv preprint, arXiv: 2303.08774, 2023
|
[2] |
冯杨洋,汪庆,谢旻晖,等. 从BERT到ChatGPT:大模型训练中的存储系统挑战与技术发展[J]. 计算机研究与发展,2024,61(4):809−823 doi: 10.7544/issn1000-1239.202330554
Feng Yangyang, Wang Qing, Xie Minhui, et al. From BERT to ChatGPT: Challenges and technical development of storage systems for large model training[J]. Journal of Computer Research and Development, 2024, 61(4): 809−823 (in Chinese) doi: 10.7544/issn1000-1239.202330554
|
[3] |
OpenDataLab. laion5b-downloader[EB/OL]. 2023[2024-05-20]. https://github.com/opendatalab/laion5b-downloader
|
[4] |
Young A, Chen Bei, Li Chao, et al. Yi: Open foundation models by 01. AI[J]. arXiv preprint, arXiv: 2403.04652, 2024
|
[5] |
Dai Hao, Wang Yang, Kent K B, et al. The state of the art of metadata managements in large-scale distributed file systems—Scalability, performance and availability[J]. IEEE Transactions on Parallel and Distributed Systems, 2022, 33(12): 3850−3869 doi: 10.1109/TPDS.2022.3170574
|
[6] |
Shvachko K, Kuang H, Radia S, et al. The Hadoop distributed file system[C/OL] //Proc of the 26th Symp on Mass Storage Systems and Technologies(MSST). Piscataway, NJ: IEEE, 2010[2024-05-20]. https://ieeexplore.ieee.org/document/5496972
|
[7] |
Thekkath C A, Mann T, Lee E K. Frangipani: A scalable distributed file system[J]. ACM SIGOPS Operating Systems Review, 1997, 31(5): 224-237
|
[8] |
Gibson G A, Nagle D F, Amiri K, et al. A cost-effective, high-bandwidth storage architecture[J]. ACM SIGOPS Operating Systems Review, 1998, 32(5): 92−103 doi: 10.1145/384265.291029
|
[9] |
Ghemawat S, Gobioff H, Leung S T. The Google file system[C] //Proc of the19th ACM Symp on Operating Systems Principles. New York: ACM, 2003: 29−43
|
[10] |
王意洁,孙伟东,周松,等. 云计算环境下的分布存储关键技术[J]. 软件学报,2012,23(4):962−986 doi: 10.3724/SP.J.1001.2012.04175
Wang Yijie, Sun Weidong, Zhou Song, et al. Key technologies of distributed storage for cloud computing[J]. Journal of Software, 2012, 23(4): 962−986 (in Chinese) doi: 10.3724/SP.J.1001.2012.04175
|
[11] |
金国栋,卞昊穹,陈跃国,等. HDFS存储和优化技术研究综述[J]. 软件学报,2020,31(1):137−161
Jin Guodong, Bian Haoqiong, Chen Yueguo, et al. Survey on storage and optimization techniques of HDFS[J]. Journal of Software, 2020, 31(1): 137−161(in Chinese)
|
[12] |
Rodeh O, Teperman A. ZFS-A scalable distributed file system using object disks[C] //Proc of the 20th Symp on Mass Storage Systems and Technologies (MSST). Piscataway, NJ: IEEE, 2003: 207−218
|
[13] |
Weil S, Brandt S A, Miller E L, et al. Ceph: A scalable, high-performance distributed file system[C] //Proc of the 7th Conf on Operating Systems Design and Implementation (OSDI’06). Berkeley, CA: USENIX Association, 2006: 307−320
|
[14] |
Schmuck F, Haskin R. GPFS: A shared-disk file system for large computing clusters[C] //Proc of the 1st Conf on File and Storage Technologies (FAST’02). Berkeley, CA: USENIX Association, 2002: 231−244
|
[15] |
Niazi S, Ismail M, Haridi S, et al. HopsFS: Scaling hierarchical file system metadata using newSQL databases[C] //Proc of the 15th Conf on File and Storage Technologies (FAST’17). Berkeley, CA: USENIX Association, 2017: 89−104
|
[16] |
Liao Gang, Abadi D J. FileScale: Fast and elastic metadata management for distributed file systems[C] //Proc of the 2023 ACM Symp on Cloud Computing. New York: ACM, 2023: 459−474
|
[17] |
Pan S, Stavrinos T, Zhang Yunqiao, et al. Facebook’s Tectonic filesystem: Efficiency from exascale[C] //Proc of the 19th Conf on File and Storage Technologies (FAST’21). Berkeley, CA: USENIX Association, 2021: 217−231
|
[18] |
Li Siyang, Lu Youyou, Shu Jiwu, et al. LocoFS: A loosely-coupled metadata service for distributed file systems[C/OL] //Proc of the Int Conf for High Performance Computing, Networking, Storage and Analysis. New York: ACM, 2017[2024-05-20]. https://doi.org/10.1145/3126908.3126928
|
[19] |
Lv Wenhao, Lu Youyou, Zhang Yiming, et al. InfiniFS: An efficient metadata service for large-scale distributed filesystems[C] //Proc of the 20th USENIX Conf on File and Storage Technologies (FAST’22). Berkeley, CA: USENIX Association, 2022: 313−328
|
[20] |
Wang Yiduo, Wu Yufei, Li Cheng, et al. CFS: Scaling metadata service for distributed file system via pruned scope of critical sections[C] //Proc of the 18th European Conf on Computer Systems. New York: ACM, 2023: 331−346
|
[21] |
Google. Tensorflow[EB/OL]. 2024[2024-09-13]. https://www.tensorflow.org/tutorials/load_data/tfrecord?hl=zh-cn
|
[22] |
Aizman A, Maltby G, Breuel T. High performance I/O for large scale deep learning[C] //Proc of the 2019 IEEE Int Conf on Big Data (Big Data). Piscataway, NJ: IEEE, 2019: 5965−5967
|
[23] |
Ongaro D, Ousterhout J. In search of an understandable consensus algorithm[C] //Proc of the 2014 USENIX Annual Technical Conf. Berkeley, CA: USENIX Association, 2014: 305−319
|
[24] |
Dong Siying, Kryczka A, Jin Yanqin, et al. RocksDB: Evolution of development priorities in a key-value store serving large-scale applications[J]. ACM Transactions on Storage, 2021, 17(4): 1−32
|
[25] |
Apache. NNThroughoutBenchmark[EB/OL]. 2024[2024-05-20]. https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Benchmarking.html
|