高级检索

    HiTrain:面向大模型训练的异构内存卸载与I/O优化

    HiTrain: Heterogeneous Memory Offloading and I/O Optimization for Large Language Model Training

    • 摘要: 随着大语言模型(Large Language Models,LLMs,以下简称大模型)参数规模的持续增长,微调百亿级参数大模型对计算和存储资源提出了极高要求.传统分布式训练方案通常依赖大量高端图形处理器(Graphic Processing Unit,GPU)和高速互联网络,训练成本极为昂贵.现有单GPU训练方案虽通过张量卸载缓解显存压力,但仍然面临I/O传输效率低和设备利用率不足等问题.传统内核级I/O操作在大规模张量迁移中引入频繁的系统调用和上下文切换,成为制约性能的关键瓶颈;同时,优化器计算无法充分发挥多核CPU的并行能力,难以实现与GPU计算的有效重叠,进一步限制了系统性能.针对上述问题,本文提出了一种面向大模型训练的异构内存卸载与I/O优化方案HiTrain,重点设计并实现了两项关键技术:首先构建了基于存储性能开发工具包(Storage Performance Development Kit,SPDK)的高性能张量存储模块,通过在用户态管理张量数据,避免了内核I/O栈开销,从而提高张量卸载的并发性与吞吐率;其次,设计并实现了基于异步优化器的存储-计算流水线调度模块,通过对优化器的执行进行优化重排,减少GPU等待时间,提高整体训练效率.实验结果表明,在配备单张GPU和非易失性存储器快速固态硬盘(Non-Volatile Memory Express Solid State Drive,NVMe SSD)的服务器上,本文提出的方案能够充分利用系统中的存算资源,使得模型训练过程中张量卸载与加载效率提升32.7%,整体训练吞吐提升至现有方案的1.49倍,为低成本大模型训练提供了切实可行的技术路径.

       

      Abstract: With the continuous growth of the parameter scale of Large Language Models (LLMs), fine-tuning models with hundreds of billions of parameters poses extremely high requirements for computing and storage resources. Traditional distributed training schemes often rely on a large number of high-end Graphics Processing Units (GPUs) and high-speed interconnect networks, resulting in extremely high training costs. Although the existing single-GPU training schemes relieve the pressure on GPU memory through tensor offloading, they still face problems such as low I/O transmission efficiency and insufficient device utilization. Traditional kernel-level I/O operations introduce frequent system calls and context switching during large-scale tensor transfers, becoming a key bottleneck that restricts performance. Meanwhile, the optimizer computations fail to fully leverage the parallel capabilities of multi-core CPUs, making it difficult to effectively overlap with GPU computations, further limiting system performance.. In response to the above problems, this paper proposes HiTrain, a heterogeneous memory offloading and I/O optimization scheme for LLM training, with a focus on the design and implementation of two key technologies: First, it constructs a high-performance tensor storage module based on the Storage Performance Development Kit (SPDK), which manages tensor data in user space, thereby avoiding the overhead of the kernel I/O stack and improving the concurrency and throughput of tensor offloading. Second, it designs and implements an asynchronous-optimizer based storage-compute pipeline scheduling module, which reorders optimizer execution to reduce GPU idle time and thus enhance overall training efficiency. The experimental results show that on the server equipped with a single GPU and a Non-Volatile Memory Express Solid State Drive (NVMe SSD), the scheme proposed in this paper can make full use of the computing and storage resources in the system, improve the efficiency of tensor unloading and loading in the process of model training by 32.7%, and increase the overall training throughput to 1.49 times of the existing scheme, offering a practical and cost-effective path for LLM training.

       

    /

    返回文章
    返回