Advanced Search
    HiTrain: Heterogeneous Memory Offloading and I/O Optimization for Large Language Model Training[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202550478
    Citation: HiTrain: Heterogeneous Memory Offloading and I/O Optimization for Large Language Model Training[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202550478

    HiTrain: Heterogeneous Memory Offloading and I/O Optimization for Large Language Model Training

    • With the continuous growth of the parameter scale of Large Language Models (LLMs), fine-tuning models with hundreds of billions of parameters poses extremely high requirements for computing and storage resources. Traditional distributed training schemes often rely on a large number of high-end Graphics Processing Units (GPUs) and high-speed interconnect networks, resulting in extremely high training costs. Although the existing single-GPU training schemes relieve the pressure on GPU memory through tensor offloading, they still face problems such as low I/O transmission efficiency and insufficient device utilization. Traditional kernel-level I/O operations introduce frequent system calls and context switching during large-scale tensor transfers, becoming a key bottleneck that restricts performance. Meanwhile, the optimizer computations fail to fully leverage the parallel capabilities of multi-core CPUs, making it difficult to effectively overlap with GPU computations, further limiting system performance.. In response to the above problems, this paper proposes HiTrain, a heterogeneous memory offloading and I/O optimization scheme for LLM training, with a focus on the design and implementation of two key technologies: First, it constructs a high-performance tensor storage module based on the Storage Performance Development Kit (SPDK), which manages tensor data in user space, thereby avoiding the overhead of the kernel I/O stack and improving the concurrency and throughput of tensor offloading. Second, it designs and implements an asynchronous-optimizer based storage-compute pipeline scheduling module, which reorders optimizer execution to reduce GPU idle time and thus enhance overall training efficiency. The experimental results show that on the server equipped with a single GPU and a Non-Volatile Memory Express Solid State Drive (NVMe SSD), the scheme proposed in this paper can make full use of the computing and storage resources in the system, improve the efficiency of tensor unloading and loading in the process of model training by 32.7%, and increase the overall training throughput to 1.49 times of the existing scheme, offering a practical and cost-effective path for LLM training.
    • loading

    Catalog

      Turn off MathJax
      Article Contents

      /

      DownLoad:  Full-Size Img  PowerPoint
      Return
      Return