Citation: | Feng Yangyang, Wang Qing, Xie Minhui, Shu Jiwu. From BERT to ChatGPT: Challenges and Technical Development of Storage Systems for Large Model Training[J]. Journal of Computer Research and Development, 2024, 61(4): 809-823. DOI: 10.7544/issn1000-1239.202330554 |
The large models represented by ChatGPT have attracted a lot of attention from industry and academia for their excellent performance on text generation and semantic understanding tasks. The number of large model parameters has increased tens of thousands of times in three years and is still growing, which brings new challenges to storage systems. First, we analyze the storage challenges of large model training, pointing out that large model training has unique computation patterns, storage access patterns, and data characteristics, which makes traditional storage techniques inefficient in handling large model training tasks. Then, we describe three types of storage acceleration techniques and two types of fault-tolerant techniques. The storage acceleration techniques for large model training include: 1) distributed storage technique based on large model computation patterns designs the partitioning, storage, and transferring strategies of model data in distributed clusters based on the partitioning of large model computation tasks and the dependencies between computation tasks; 2) heterogeneous storage access pattern-aware technique for large model training develops data prefetching and transferring strategies among heterogeneous devices with the predictability of storage access patterns in large model training; 3) large model data reduction technique reduces the data size in the model training process according to the characteristics of large model data. The storage fault-tolerant techniques for large model training include: 1) parameter checkpointing technique stores the large model parameters to persistent storage devices; 2) redundant computation technique computes the same version of parameters repeatedly in multiple GPUs. Finally, we give the summary and suggestions for future research.
[1] |
Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need [C/OL] //Proc of the 30th Conf on Neural Information Processing Systems. Cambridge, MA: MIT, 2017[2023-05-30].https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
|
[2] |
Devlin J, Chang M, Lee K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding [J]. arXiv preprint, arXiv: 1810.04805, 2018
|
[3] |
OpenAI. GPT-4 technical report [J]. arXiv preprint, arXiv: 2303.08774, 2023
|
[4] |
Bae J, Lee J, Jin Y, et al. FlashNeuron: SSD-enabled large-batch training of very deep neural networks[C]//Proc of the 19th USENIX Conf on File and Storage Technologies. Berkeley, CA: USENIX Association, 2021: 387-401
|
[5] |
Ruder S. An overview of gradient descent optimization algorithms[J]. arXiv preprint, arXiv: 1609.04747, 2016
|
[6] |
Kingma D, Ba J. Adam: A method for stochastic optimization[J]. arXiv preprint, arXiv: 1412.6980, 2014
|
[7] |
Thorpe J, Zhao Pengzhan, Eyolfson J, et al. Bamboo: Making preemptible instances resilient for affordable training of large DNNs[C]//Proc of the 20th USENIX Symp on Networked Systems Design and Implementation. Berkeley, CA: USENIX Association, 2023: 497-513
|
[8] |
Zhang Susan, Roller S, Goyal N, et al. OPT: Open pre-trained transformer language models[J]. arXiv preprint, arXiv: 2205.01068, 2022
|
[9] |
Jeon M, Venkataraman S, Phanishayee A, et al. Analysis of large-scale multi-tenant GPU clusters for DNN training workloads[C]// Proc of USENIX Annual Technical Conf. Berkeley, CA: USENIX Association, 2019: 947-960
|
[10] |
Shvachko K, Kuang H, Radia S, et al. The Hadoop distributed file system[C/OL]//Proc of the 26th IEEE Symp on Mass Storage Systems and Technologies. Piscataway, NJ: IEEE, 2010[2023-05-21].https://www.computer.org/csdl/proceedings-article/msst/2010/05496972/12OmNwxlrhU
|
[11] |
Zaharia M, Chowdhury M, Franklin M J, et al. Spark: Cluster computing with working sets[C/OL]//Proc of the 2nd USENIX Workshop on Hot Topics in Cloud Computing. Berkeley, CA: USENIX Association, 2010[2023-05-21].https://www.usenix.org/legacy/event/hotcloud10/tech/full_papers/Zaharia.pdf
|
[12] |
Weil S, Brandt S, Miller E, et al. Ceph: A scalable, high-performance distributed file system[C]//Proc of the 7th Symp on Operating Systems Design and Implementation. Berkeley, CA: USENIX Association, 2006: 307−320
|
[13] |
Rajbhandari S, Rasley J, Ruwase O, et al. ZeRO: Memory optimizations toward training trillion parameter models[C/OL]//Proc of the Int Conf for High Performance Computing, Networking, Storage and Analysis. Piscataway, NJ: IEEE, 2020[2023-05-21].https://dl.acm.org/doi/pdf/10.5555/3433701.3433727
|
[14] |
Huang Yanping, Cheng Youlong, Bapna A, et al. GPipe: Efficient training of giant neural networks using pipeline parallelism[C/OL]// Proc of the 33rd Conf on Neural Information Processing Systems. Cambridge, MA: MIT, 2019[2023-05-30].https://proceedings.neurips.cc/paper_files/paper/2019/file/093f65e080a295f8076b1c5722a46aa2-Paper.pdf
|
[15] |
Narayanan D, Harlap A, Phanishayee A, et al. PipeDream: Generalized pipeline parallelism for DNN training[C/OL]//Proc of the 27th ACM Symp on Operating Systems Principles. New York: ACM, 2019[2023-05-20].https://dl.acm.org/doi/pdf/10.1145/3341301.3359646
|
[16] |
Jain A, Awan A, Aljuhani A, et al. GEMS: GPU-enabled memory-aware model-parallelism system for distributed DNN training[C/OL]// Proc of the Int Conf for High Performance Computing, Networking, Storage and Analysis. Piscataway, NJ: IEEE, 2020[2023-05-21].https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9355254
|
[17] |
Narayanan D, Phanishayee A, Shi Kaiyu, et al. Memory-efficient pipeline-parallel DNN training[C]//Proc of the 38th Int Conf on Machine Learning Research. New York: PMLR, 2021: 7937−7947
|
[18] |
Fan Shiqing, Rong Yi, Meng Chen, et al. DAPPLE: A pipelined data parallel approach for training large models[C]//Proc of the 26th ACM SIGPLAN Symp on Principles and Practice of Parallel Programming. New York: ACM, 2021: 431−445
|
[19] |
Li Shigang, Hoefler T. Chimera: Efficiently training large-scale neural networks with bidirectional pipelines[C/OL]// Proc of the Int Conf for High Performance Computing, Networking, Storage and Analysis. Piscataway, NJ: IEEE, 2020[2023-05-21]. https://dl.acm.org/doi/abs/10.1145/3458817.3476145
|
[20] |
Shoeybi M, Patwary M, Puri R, et al. Megatron-LM: Training multi-billion parameter language models using model parallelism[J]. arXiv preprint, arXiv: 1909.08053, 2019
|
[21] |
Xu Qifan, You Yang. An efficient 2D method for training super-large deep learning models[C]//Proc of Int Symp on Parallel and Distributed Processing. Piscataway, NJ: IEEE, 2021: 222−232
|
[22] |
Wang Boxiang, Xu Qifan, Bian Zhengda, et al. Tesseract: Parallelize the tensor parallelism efficiently[C/OL]//Proc of the 51st Int Conf on Parallel Processing. New York: ACM, 2022[2023-05-21]. https://dl.acm.org/doi/abs/10.1145/3545008.3545087
|
[23] |
Bian Zhengda, Xu Qifan, Wang Boxiang, et al. Maximizing parallelism in distributed training for huge neural networks[J]. arXiv preprint, arXiv: 2105.14450, 2021
|
[24] |
Narayanan D, Shoeybi M, Casper J, et al. Efficient large-scale language model training on GPU clusters using Megatron-LM[C/OL]// Proc of the Int Conf for High Performance Computing, Networking, Storage and Analysis. Piscataway, NJ: IEEE, 2021[2023-05-21].https://dl.acm.org/doi/abs/10.1145/3458817.3476209
|
[25] |
Fang Jiarui, Zhu Zilin, Li Shenggui, et al. Parallel training of pre-trained models via chunk-based dynamic memory management[J]. IEEE Transactions on Parallel and Distributed Systems, 2022, 34(1): 304−315
|
[26] |
Rhu M, Gimelshein N, Clemons J, et al. vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design[C/OL]//Proc of the 49th Annual IEEE/ACM Int Symp on Microarchitecture. Piscataway, NJ: IEEE, 2016[2023-05-21]. https://ieeexplore.ieee.org/abstract/document/7783721
|
[27] |
Wang Linna, Ye Jinmian, Zhao Yiyang, et al. SuperNeurons: Dynamic GPU memory management for training deep neural networks[C]//Proc of the 23rd ACM SIGPLAN Symp on Principles and Practice of Parallel Programming. New York: ACM, 2018: 41−53
|
[28] |
Huang C, Jin Gu, Li Jinyang. SwapAdvisor: Pushing deep learning beyond the GPU memory limit via smart swapping[C]//Proc of the 25th Int Conf on Architectural Support for Programming Languages and Operating Systems. New York: ACM, 2020: 1341−1355
|
[29] |
Peng Xuan, Shi Xuanhua, Dai Hulin, et al. Capuchin: Tensor-based GPU memory management for deep learning[C]//Proc of the 25th Int Conf on Architectural Support for Programming Languages and Operating Systems. New York: ACM, 2020: 891−905
|
[30] |
Ren Jie, Rajbhandari S, Aminabadi R Y, et al. ZeRO-Offload: Democratizing billion-scale model training[C]//Proc of USENIX Annual Technical Conf. Berkeley, CA: USENIX Association, 2021: 551−564
|
[31] |
Li Youjie, Phanishayee A, Murray D, et al. Harmony: Overcoming the hurdles of GPU memory capacity to train massive DNN models on commodity servers[J]. arXiv preprint, arXiv: 2202.01306, 2022
|
[32] |
Feng Yangyang, Xie Minhui, Tian Zijie, et al. Mobius: Fine tuning large-scale models on commodity GPU servers[C]//Proc of the 28th ACM Int Conf on Architectural Support for Programming Languages and Operating Systems. New York: ACM, 2023: 489−501
|
[33] |
Rajbhandari S, Ruwase O, Rasley J, et al. ZeRO-Infinity: Breaking the GPU memory wall for extreme scale deep learning[C/OL]// Proc of the Int Conf for High Performance Computing, Networking, Storage and Analysis. Piscataway, NJ: IEEE, 2021[2023-05-21].https://dl.acm.org/doi/abs/10.1145/3458817.3476205
|
[34] |
Buluç A, Fineman J T, Frigo M, et al. Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks[C]//Proc of the 21st Annual Symp on Parallelism in Algorithms and Architectures. New York: ACM, 2009: 233−244
|
[35] |
Scipy. scipy. sparse. coo_matrix[EB/OL]. 2023 [2023-06-29].https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.coo_matrix.html
|
[36] |
Scipy. scipy. sparse. lil_matrix[EB/OL]. 2023 [2023-06-29].https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.lil_matrix.html
|
[37] |
Fedus W, Zoph B, Shazeer N. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity[J]. The Journal of Machine Learning Research, 2022, 23(1): 5232−5270
|
[38] |
Chen Tianqi, Xu Bing, Zhang Chiyuan, et al. Training deep nets with sublinear memory cost[J]. arXiv preprint, arXiv: 1604.06174, 2016
|
[39] |
Jain P, Jain A, Nrusimha A, et al. Checkmate: Breaking the memory wall with optimal tensor rematerialization[C/OL]// Proc of the 3rd Machine Learning and Systems. 2020[2023-05-20].https://proceedings.mlsys.org/paper_files/paper/2020/file/0b816ae8f06f8dd3543dc3d9ef196cab-Paper.pdf
|
[40] |
Micikevicius P, Narang S, Alben J, et al. Mixed precision training[J]. arXiv preprint, arXiv: 1710.03740, 2017
|
[41] |
Chen Jianfei, Zheng Lianmin, Yao Zhewei, et al. ActNN: Reducing training memory footprint via 2-bit activation compressed training[C]//Proc of the 38th Int Conf on Machine Learning Research. New York: PMLR, 2021: 1803−1813
|
[42] |
Mohan J, Phanishayee A, Chidambaram V. CheckFreq: Frequent, fine-frained DNN checkpointing[C]//Proc of the 19th USENIX Conf on File and Storage Technologies. Berkeley, CA: USENIX Association, 2021: 203−216
|
[43] |
NVIDIA. Megatron-LM [EB/OL]. 2023 [2023-06-01].https://github.com/NVIDIA/Megatron-LM
|
[44] |
Microsoft. DeepSpeed [EB/OL]. 2023 [2023-06-10].https://github.com/microsoft/DeepSpeed
|
[45] |
HPCAITech. Colossal-AI [EB/OL]. 2023 [2023-06-10].https://github.com/hpcaitech/ColossalAI
|
[46] |
OneFlow Inc. OneFlow [EB/OL]. 2023 [2023-06-10].https://github.com/Oneflow-Inc/oneflow
|
[1] | Li Yan, Yang Sile, Liu Chengchun, Wang Linmei, Tian Yaolin, Zhang Xinhang, Zhu Yu, Li Chunpu, Sun Lei, Yan Shengen, Xiao Limin, Zhang Weifeng. Resilio: An Elastic Fault-tolerant Training System for Large Language Models[J]. Journal of Computer Research and Development, 2025, 62(6): 1380-1395. DOI: 10.7544/issn1000-1239.202550147 |
[2] | Shi Hongzhi, Zhao Jian, Zhao Yaqian, Li Ruyang, Wei Hui, Hu Kekun, Wen Dongchao, Jin Liang. Survey on System Optimization for Mixture of Experts in the Era of Large Models[J]. Journal of Computer Research and Development, 2025, 62(5): 1164-1189. DOI: 10.7544/issn1000-1239.202440016 |
[3] | Liu Guodong, Zhu Jiaqi, Gao Ziyuan, Bao Yungang, Wang Sa. AdaptDNN: An Adaptive and Scalable DNN Distributed Training System[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202440560 |
[4] | Bi Yahui, Jiang Suyang, Wang Zhigang, Leng Fangling, Bao Yubin, Yu Ge, Qian Ling. A Multi-Level Fault Tolerance Mechanism for Disk-Resident Pregel-Like Systems[J]. Journal of Computer Research and Development, 2016, 53(11): 2530-2541. DOI: 10.7544/issn1000-1239.2016.20150619 |
[5] | He Wangquan, Wei Di, Quan Jianxiao, Wu Wei, Qi Fengbin. Dynamic Task Scheduling Model and Fault-Tolerant via Queuing Theory[J]. Journal of Computer Research and Development, 2016, 53(6): 1271-1280. DOI: 10.7544/issn1000-1239.2016.20148445 |
[6] | Lu Youyou, Shu Jiwu. Survey on Flash-Based Storage Systems[J]. Journal of Computer Research and Development, 2013, 50(1): 49-59. |
[7] | Zhou Yunxia, Zhao Yuelong, Yang Xi. Disaster Tolerance in Storage System of the Intelligent Network Disk(IND)[J]. Journal of Computer Research and Development, 2012, 49(7): 1587-1592. |
[8] | Mu Fei, Xue Wei, Shu Jiwu, and Zheng Weimin. An Analytical Model for Large-Scale Storage System with Replicated Data[J]. Journal of Computer Research and Development, 2009, 46(5): 756-761. |
[9] | Cheng Xin, Liu Hongwei, Dong Jian, Yang Xiaozong. A Fault Tolerance Deadlock Detection/Resolution Algorithm for the AND-OR Model[J]. Journal of Computer Research and Development, 2007, 44(5): 798-805. |
[10] | Ren Xiaoxi, Li Renfa, Jin Shengzhen, Zhang Kehuan, Wu Qiang. Research on Reliability of a Reconfigurable Data Processing System Based on JBits[J]. Journal of Computer Research and Development, 2007, 44(4): 722-728. |
1. |
毕枫林, 张豈明, 张嘉睿, 王衍童, 陈阳, 张琰彬, 王伟, 周烜. 基于滑动窗口策略的大语言模型检索增强生成系统. 计算机研究与发展. 2025(07)
![]() | |
2. |
许光远, 张亚强, 史宏志. 面向大规模DNN训练场景的容错技术综述. 计算机科学与探索. 2025(07)
![]() | |
3. |
陈筱琳, 张亚强, 史宏志. 面向多样计算场景的检查点技术综述. 计算机应用. 2025(06)
![]() | |
4. |
冯杨洋,汪庆,舒继武. 大模型时代下的存储系统挑战与技术发展. 大数据. 2025(01): 79-91 .
![]() | |
5. |
尚碧筠,韩银俊,肖蓉,陈正华,屠要峰,董振江. ScaleFS:面向大语言模型的高性能可扩展元数据设计. 计算机研究与发展. 2025(03): 589-604 .
![]() | |
6. |
吴文隆,尹海莲,王宁,徐梦飞,赵鑫喆,殷崭祚,刘元睿,王昊奋,丁岩,李博涵. 大语言模型和知识图谱协同的跨域异质数据查询框架. 计算机研究与发展. 2025(03): 605-619 .
![]() | |
7. |
葛旭冉,欧洋,王博,赵宇,吴利舟,王子聪,陈志广,肖侬. 大语言模型推理中的存储优化技术综述. 计算机研究与发展. 2025(03): 545-562 .
![]() | |
8. |
张蒋良,蒲秋梅,罗训,李达. 基于ChatGPT的中国南海贝类知识智能服务. 物联网学报. 2025(01): 138-149 .
![]() | |
9. |
马超,李武峰,陈羽飞,何永君,田晓鹏. Chat GPT类大语言模型赋能电力标准数字化转型的核心技术、技术特征及应用展望. 高电压技术. 2025(04): 1727-1746 .
![]() | |
10. |
贺巩山,赵传磊,蒋金虎,张为华,陈左宁. 面向深度学习的数据存储技术综述. 计算机学报. 2025(05): 1013-1064 .
![]() | |
11. |
栾昊立,王晓东,杨锐,郝建宇,赵铭浩,尹祖新,王丽琼. AI智算发展对高速光模块的应用需求研究. 邮电设计技术. 2024(06): 7-11 .
![]() | |
12. |
刘少堃,何仲廉,李彬,李超峰. 基于大模型的电子病历自动生成系统的设计与应用探讨. 中国数字医学. 2024(08): 8-13 .
![]() | |
13. |
孙一尧,刘馨,刘晓丹,李琳. 人工智能应用设计创新与非物质文化遗产结合——以新疆毛皮画推广APP为例. 鞋类工艺与设计. 2024(15): 151-153 .
![]() | |
14. |
童俊杰,申佳,赫罡,张奎. 运营商智算中心建设思路及方案. 邮电设计技术. 2024(09): 68-73 .
![]() | |
15. |
丛凯,陈宏,苏征,任心钰,黄若铖,李国. 人工智能大模型在电子政务中的应用研究. 中国信息界. 2024(06): 92-94 .
![]() | |
16. |
赵明江,刘艳梅,杨婧一,张星奎,贾占宇. 基于非Transformer架构大模型的技术研究及应用探索. 电力大数据. 2024(06): 11-21 .
![]() |