Citation: | Feng Yangyang, Wang Qing, Xie Minhui, Shu Jiwu. From BERT to ChatGPT: Challenges and Technical Development of Storage Systems for Large Model Training[J]. Journal of Computer Research and Development, 2024, 61(4): 809-823. DOI: 10.7544/issn1000-1239.202330554 |
The large models represented by ChatGPT have attracted a lot of attention from industry and academia for their excellent performance on text generation and semantic understanding tasks. The number of large model parameters has increased tens of thousands of times in three years and is still growing, which brings new challenges to storage systems. First, we analyze the storage challenges of large model training, pointing out that large model training has unique computation patterns, storage access patterns, and data characteristics, which makes traditional storage techniques inefficient in handling large model training tasks. Then, we describe three types of storage acceleration techniques and two types of fault-tolerant techniques. The storage acceleration techniques for large model training include: 1) distributed storage technique based on large model computation patterns designs the partitioning, storage, and transferring strategies of model data in distributed clusters based on the partitioning of large model computation tasks and the dependencies between computation tasks; 2) heterogeneous storage access pattern-aware technique for large model training develops data prefetching and transferring strategies among heterogeneous devices with the predictability of storage access patterns in large model training; 3) large model data reduction technique reduces the data size in the model training process according to the characteristics of large model data. The storage fault-tolerant techniques for large model training include: 1) parameter checkpointing technique stores the large model parameters to persistent storage devices; 2) redundant computation technique computes the same version of parameters repeatedly in multiple GPUs. Finally, we give the summary and suggestions for future research.
[1] |
Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need [C/OL] //Proc of the 30th Conf on Neural Information Processing Systems. Cambridge, MA: MIT, 2017[2023-05-30].https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
|
[2] |
Devlin J, Chang M, Lee K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding [J]. arXiv preprint, arXiv: 1810.04805, 2018
|
[3] |
OpenAI. GPT-4 technical report [J]. arXiv preprint, arXiv: 2303.08774, 2023
|
[4] |
Bae J, Lee J, Jin Y, et al. FlashNeuron: SSD-enabled large-batch training of very deep neural networks[C]//Proc of the 19th USENIX Conf on File and Storage Technologies. Berkeley, CA: USENIX Association, 2021: 387-401
|
[5] |
Ruder S. An overview of gradient descent optimization algorithms[J]. arXiv preprint, arXiv: 1609.04747, 2016
|
[6] |
Kingma D, Ba J. Adam: A method for stochastic optimization[J]. arXiv preprint, arXiv: 1412.6980, 2014
|
[7] |
Thorpe J, Zhao Pengzhan, Eyolfson J, et al. Bamboo: Making preemptible instances resilient for affordable training of large DNNs[C]//Proc of the 20th USENIX Symp on Networked Systems Design and Implementation. Berkeley, CA: USENIX Association, 2023: 497-513
|
[8] |
Zhang Susan, Roller S, Goyal N, et al. OPT: Open pre-trained transformer language models[J]. arXiv preprint, arXiv: 2205.01068, 2022
|
[9] |
Jeon M, Venkataraman S, Phanishayee A, et al. Analysis of large-scale multi-tenant GPU clusters for DNN training workloads[C]// Proc of USENIX Annual Technical Conf. Berkeley, CA: USENIX Association, 2019: 947-960
|
[10] |
Shvachko K, Kuang H, Radia S, et al. The Hadoop distributed file system[C/OL]//Proc of the 26th IEEE Symp on Mass Storage Systems and Technologies. Piscataway, NJ: IEEE, 2010[2023-05-21].https://www.computer.org/csdl/proceedings-article/msst/2010/05496972/12OmNwxlrhU
|
[11] |
Zaharia M, Chowdhury M, Franklin M J, et al. Spark: Cluster computing with working sets[C/OL]//Proc of the 2nd USENIX Workshop on Hot Topics in Cloud Computing. Berkeley, CA: USENIX Association, 2010[2023-05-21].https://www.usenix.org/legacy/event/hotcloud10/tech/full_papers/Zaharia.pdf
|
[12] |
Weil S, Brandt S, Miller E, et al. Ceph: A scalable, high-performance distributed file system[C]//Proc of the 7th Symp on Operating Systems Design and Implementation. Berkeley, CA: USENIX Association, 2006: 307−320
|
[13] |
Rajbhandari S, Rasley J, Ruwase O, et al. ZeRO: Memory optimizations toward training trillion parameter models[C/OL]//Proc of the Int Conf for High Performance Computing, Networking, Storage and Analysis. Piscataway, NJ: IEEE, 2020[2023-05-21].https://dl.acm.org/doi/pdf/10.5555/3433701.3433727
|
[14] |
Huang Yanping, Cheng Youlong, Bapna A, et al. GPipe: Efficient training of giant neural networks using pipeline parallelism[C/OL]// Proc of the 33rd Conf on Neural Information Processing Systems. Cambridge, MA: MIT, 2019[2023-05-30].https://proceedings.neurips.cc/paper_files/paper/2019/file/093f65e080a295f8076b1c5722a46aa2-Paper.pdf
|
[15] |
Narayanan D, Harlap A, Phanishayee A, et al. PipeDream: Generalized pipeline parallelism for DNN training[C/OL]//Proc of the 27th ACM Symp on Operating Systems Principles. New York: ACM, 2019[2023-05-20].https://dl.acm.org/doi/pdf/10.1145/3341301.3359646
|
[16] |
Jain A, Awan A, Aljuhani A, et al. GEMS: GPU-enabled memory-aware model-parallelism system for distributed DNN training[C/OL]// Proc of the Int Conf for High Performance Computing, Networking, Storage and Analysis. Piscataway, NJ: IEEE, 2020[2023-05-21].https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9355254
|
[17] |
Narayanan D, Phanishayee A, Shi Kaiyu, et al. Memory-efficient pipeline-parallel DNN training[C]//Proc of the 38th Int Conf on Machine Learning Research. New York: PMLR, 2021: 7937−7947
|
[18] |
Fan Shiqing, Rong Yi, Meng Chen, et al. DAPPLE: A pipelined data parallel approach for training large models[C]//Proc of the 26th ACM SIGPLAN Symp on Principles and Practice of Parallel Programming. New York: ACM, 2021: 431−445
|
[19] |
Li Shigang, Hoefler T. Chimera: Efficiently training large-scale neural networks with bidirectional pipelines[C/OL]// Proc of the Int Conf for High Performance Computing, Networking, Storage and Analysis. Piscataway, NJ: IEEE, 2020[2023-05-21]. https://dl.acm.org/doi/abs/10.1145/3458817.3476145
|
[20] |
Shoeybi M, Patwary M, Puri R, et al. Megatron-LM: Training multi-billion parameter language models using model parallelism[J]. arXiv preprint, arXiv: 1909.08053, 2019
|
[21] |
Xu Qifan, You Yang. An efficient 2D method for training super-large deep learning models[C]//Proc of Int Symp on Parallel and Distributed Processing. Piscataway, NJ: IEEE, 2021: 222−232
|
[22] |
Wang Boxiang, Xu Qifan, Bian Zhengda, et al. Tesseract: Parallelize the tensor parallelism efficiently[C/OL]//Proc of the 51st Int Conf on Parallel Processing. New York: ACM, 2022[2023-05-21]. https://dl.acm.org/doi/abs/10.1145/3545008.3545087
|
[23] |
Bian Zhengda, Xu Qifan, Wang Boxiang, et al. Maximizing parallelism in distributed training for huge neural networks[J]. arXiv preprint, arXiv: 2105.14450, 2021
|
[24] |
Narayanan D, Shoeybi M, Casper J, et al. Efficient large-scale language model training on GPU clusters using Megatron-LM[C/OL]// Proc of the Int Conf for High Performance Computing, Networking, Storage and Analysis. Piscataway, NJ: IEEE, 2021[2023-05-21].https://dl.acm.org/doi/abs/10.1145/3458817.3476209
|
[25] |
Fang Jiarui, Zhu Zilin, Li Shenggui, et al. Parallel training of pre-trained models via chunk-based dynamic memory management[J]. IEEE Transactions on Parallel and Distributed Systems, 2022, 34(1): 304−315
|
[26] |
Rhu M, Gimelshein N, Clemons J, et al. vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design[C/OL]//Proc of the 49th Annual IEEE/ACM Int Symp on Microarchitecture. Piscataway, NJ: IEEE, 2016[2023-05-21]. https://ieeexplore.ieee.org/abstract/document/7783721
|
[27] |
Wang Linna, Ye Jinmian, Zhao Yiyang, et al. SuperNeurons: Dynamic GPU memory management for training deep neural networks[C]//Proc of the 23rd ACM SIGPLAN Symp on Principles and Practice of Parallel Programming. New York: ACM, 2018: 41−53
|
[28] |
Huang C, Jin Gu, Li Jinyang. SwapAdvisor: Pushing deep learning beyond the GPU memory limit via smart swapping[C]//Proc of the 25th Int Conf on Architectural Support for Programming Languages and Operating Systems. New York: ACM, 2020: 1341−1355
|
[29] |
Peng Xuan, Shi Xuanhua, Dai Hulin, et al. Capuchin: Tensor-based GPU memory management for deep learning[C]//Proc of the 25th Int Conf on Architectural Support for Programming Languages and Operating Systems. New York: ACM, 2020: 891−905
|
[30] |
Ren Jie, Rajbhandari S, Aminabadi R Y, et al. ZeRO-Offload: Democratizing billion-scale model training[C]//Proc of USENIX Annual Technical Conf. Berkeley, CA: USENIX Association, 2021: 551−564
|
[31] |
Li Youjie, Phanishayee A, Murray D, et al. Harmony: Overcoming the hurdles of GPU memory capacity to train massive DNN models on commodity servers[J]. arXiv preprint, arXiv: 2202.01306, 2022
|
[32] |
Feng Yangyang, Xie Minhui, Tian Zijie, et al. Mobius: Fine tuning large-scale models on commodity GPU servers[C]//Proc of the 28th ACM Int Conf on Architectural Support for Programming Languages and Operating Systems. New York: ACM, 2023: 489−501
|
[33] |
Rajbhandari S, Ruwase O, Rasley J, et al. ZeRO-Infinity: Breaking the GPU memory wall for extreme scale deep learning[C/OL]// Proc of the Int Conf for High Performance Computing, Networking, Storage and Analysis. Piscataway, NJ: IEEE, 2021[2023-05-21].https://dl.acm.org/doi/abs/10.1145/3458817.3476205
|
[34] |
Buluç A, Fineman J T, Frigo M, et al. Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks[C]//Proc of the 21st Annual Symp on Parallelism in Algorithms and Architectures. New York: ACM, 2009: 233−244
|
[35] |
Scipy. scipy. sparse. coo_matrix[EB/OL]. 2023 [2023-06-29].https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.coo_matrix.html
|
[36] |
Scipy. scipy. sparse. lil_matrix[EB/OL]. 2023 [2023-06-29].https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.lil_matrix.html
|
[37] |
Fedus W, Zoph B, Shazeer N. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity[J]. The Journal of Machine Learning Research, 2022, 23(1): 5232−5270
|
[38] |
Chen Tianqi, Xu Bing, Zhang Chiyuan, et al. Training deep nets with sublinear memory cost[J]. arXiv preprint, arXiv: 1604.06174, 2016
|
[39] |
Jain P, Jain A, Nrusimha A, et al. Checkmate: Breaking the memory wall with optimal tensor rematerialization[C/OL]// Proc of the 3rd Machine Learning and Systems. 2020[2023-05-20].https://proceedings.mlsys.org/paper_files/paper/2020/file/0b816ae8f06f8dd3543dc3d9ef196cab-Paper.pdf
|
[40] |
Micikevicius P, Narang S, Alben J, et al. Mixed precision training[J]. arXiv preprint, arXiv: 1710.03740, 2017
|
[41] |
Chen Jianfei, Zheng Lianmin, Yao Zhewei, et al. ActNN: Reducing training memory footprint via 2-bit activation compressed training[C]//Proc of the 38th Int Conf on Machine Learning Research. New York: PMLR, 2021: 1803−1813
|
[42] |
Mohan J, Phanishayee A, Chidambaram V. CheckFreq: Frequent, fine-frained DNN checkpointing[C]//Proc of the 19th USENIX Conf on File and Storage Technologies. Berkeley, CA: USENIX Association, 2021: 203−216
|
[43] |
NVIDIA. Megatron-LM [EB/OL]. 2023 [2023-06-01].https://github.com/NVIDIA/Megatron-LM
|
[44] |
Microsoft. DeepSpeed [EB/OL]. 2023 [2023-06-10].https://github.com/microsoft/DeepSpeed
|
[45] |
HPCAITech. Colossal-AI [EB/OL]. 2023 [2023-06-10].https://github.com/hpcaitech/ColossalAI
|
[46] |
OneFlow Inc. OneFlow [EB/OL]. 2023 [2023-06-10].https://github.com/Oneflow-Inc/oneflow
|
[1] | Shang Biyun, Han Yinjun, Xiao Rong, Chen Zhenghua, Tu Yaofeng, Dong Zhenjiang. ScaleFS: High Performance and Scalable Metadata Design for Large Language Models[J]. Journal of Computer Research and Development, 2025, 62(3): 589-604. DOI: 10.7544/issn1000-1239.202440373 |
[2] | Yang Fan, Zhang Peng, Wang Zhan, Yuan Guojun, An Xuejun. Accelerating Byzantine Fault Tolerance with In-Network Computing[J]. Journal of Computer Research and Development, 2021, 58(1): 164-177. DOI: 10.7544/issn1000-1239.2021.20190723 |
[3] | Bi Yahui, Jiang Suyang, Wang Zhigang, Leng Fangling, Bao Yubin, Yu Ge, Qian Ling. A Multi-Level Fault Tolerance Mechanism for Disk-Resident Pregel-Like Systems[J]. Journal of Computer Research and Development, 2016, 53(11): 2530-2541. DOI: 10.7544/issn1000-1239.2016.20150619 |
[4] | Luo Xianghong and Shu Jiwu. Summary of Research for Erasure Code in Storage System[J]. Journal of Computer Research and Development, 2012, 49(1): 1-11. |
[5] | Ding Wanfu, Guo Ruifeng, Qin Chenggang, Guo Fengzhao. A Fault-Tolerant Scheduling Algorithm with Software Fault Tolerance in Hard Real-Time Systems[J]. Journal of Computer Research and Development, 2011, 48(4): 691-698. |
[6] | Li Jun, Cao Wanhua, Yang Fumin, Tu Gang, Lu Yansheng, Luo Wei. A Fault-Tolerant Priority Configuration Mixed Search Algorithm[J]. Journal of Computer Research and Development, 2007, 44(11): 1912-1919. |
[7] | Liu Dong, Zhang Chunyuan, Li Rui, Huang Ying, and Li Yi. Fault-Tolerant Real-Time Scheduling Algorithm in Software Fault-Tolerant Module[J]. Journal of Computer Research and Development, 2007, 44(9): 1495-1500. |
[8] | Cheng Xin, Liu Hongwei, Dong Jian, Yang Xiaozong. A Fault Tolerance Deadlock Detection/Resolution Algorithm for the AND-OR Model[J]. Journal of Computer Research and Development, 2007, 44(5): 798-805. |
[9] | Xiong Tinggang, Ma Zhong, Yuan Youguang. Research on Synchronization Technology of Fault-Tolerant Computer System Based on Operating System Calls[J]. Journal of Computer Research and Development, 2006, 43(11): 1985-1992. |
[10] | Han Jianjun, Li Qinghua, Abbas A.Essa. A Dynamic Real-Time Scheduling Algorithm with Software Fault-Tolerance[J]. Journal of Computer Research and Development, 2005, 42(2): 315-321. |