从BERT到ChatGPT：大模型训练中的存储系统挑战与技术发展

冯杨洋; 汪庆; 谢旻晖; 舒继武

doi:10.7544/issn1000-1239.202330554

从BERT到ChatGPT：大模型训练中的存储系统挑战与技术发展

From BERT to ChatGPT: Challenges and Technical Development of Storage Systems for Large Model Training

摘要

摘要: 以ChatGPT为代表的大模型在文字生成、语义理解等任务上表现卓越，引起了工业界和学术界的广泛关注. 大模型的参数量在3年内增长数万倍，且仍呈现增长的趋势. 首先分析了大模型训练的存储挑战，指出大模型训练的存储需求大，且具有独特的计算模式、访存模式、数据特征，这使得针对互联网、大数据等应用的传统存储技术在处理大模型训练任务时效率低下，且容错开销大. 然后分别阐述了针对大模型训练的3类存储加速技术与2类存储容错技术. 针对大模型训练的存储加速技术包括：1）基于大模型计算模式的分布式显存管理技术，依据大模型计算任务的划分模式和计算任务间的依赖关系，设计模型数据在分布式集群中的划分、存储和传输策略；2）大模型训练访存感知的异构存储技术，借助大模型训练中的访存模式可预测的特性，设计异构设备中的数据预取和传输策略；3）大模型数据缩减技术，针对大模型数据的特征，对模型训练过程中的数据进行缩减. 针对大模型训练的存储容错技术包括：1）参数检查点技术，将大模型参数存储至持久化存储介质；2）冗余计算技术，在多张GPU中重复计算相同版本的参数. 最后给出了总结和展望.

Abstract: The large models represented by ChatGPT have attracted a lot of attention from industry and academia for their excellent performance on text generation and semantic understanding tasks. The number of large model parameters has increased tens of thousands of times in three years and is still growing, which brings new challenges to storage systems. First, we analyze the storage challenges of large model training, pointing out that large model training has unique computation patterns, storage access patterns, and data characteristics, which makes traditional storage techniques inefficient in handling large model training tasks. Then, we describe three types of storage acceleration techniques and two types of fault-tolerant techniques. The storage acceleration techniques for large model training include: 1) distributed storage technique based on large model computation patterns designs the partitioning, storage, and transferring strategies of model data in distributed clusters based on the partitioning of large model computation tasks and the dependencies between computation tasks; 2) heterogeneous storage access pattern-aware technique for large model training develops data prefetching and transferring strategies among heterogeneous devices with the predictability of storage access patterns in large model training; 3) large model data reduction technique reduces the data size in the model training process according to the characteristics of large model data. The storage fault-tolerant techniques for large model training include: 1) parameter checkpointing technique stores the large model parameters to persistent storage devices; 2) redundant computation technique computes the same version of parameters repeatedly in multiple GPUs. Finally, we give the summary and suggestions for future research.

HTML全文

参考文献(46)

施引文献

资源附件(0)