Advanced Search
    Xu Ying, Wang Mengdi, Cheng Long, Liu Lian, Zhao Shixin, Zhang Lei, Wang Ying. Pipe-RLHF: A Computation Mode-Aware Parallel Framework for RLHF[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202550127
    Citation: Xu Ying, Wang Mengdi, Cheng Long, Liu Lian, Zhao Shixin, Zhang Lei, Wang Ying. Pipe-RLHF: A Computation Mode-Aware Parallel Framework for RLHF[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202550127

    Pipe-RLHF: A Computation Mode-Aware Parallel Framework for RLHF

    • Reinforcement learning with human feedback (RLHF) has been proven effective in aligning large language models (LLMs) with human preferences. The most costly part of RLHF is proximal policy optimization (PPO), which consists of three dependent steps. Different PPO steps in RLHF exhibit different computation modes, simply employing the same parallelization strategy to accelerate all steps that involve multiple model variants, as done in existing frameworks, will lead to poor performance in the PPO generation step due to insufficient utilization of computational resources. Thus, we introduce Pipe-RLHF, a parallelism framework for RLHF fine-tuning, which adaptively employs distinct parallelization strategies for different steps according to the computation mode. Specifically, we first investigate the characteristics of various computation modes to explore their best-fit parallelization approach. And then, we present a novel delayed inter-batch pipeline parallelization approach specifically designed for the PPO generation step, enabling the sufficient utilization of computational resources. Subsequently, based on the proposed inter-batch pipeline parallelization approach, we define a hierarchical parallel plan space for distributed RLHF fine-tuning. Finally, we present optimization algorithms to find the optimal parallelization plan from the defined hierarchical parallel plan space to minimize the overall time consumption. Implementation and evaluation across multiple LLMs demonstrates that the proposed Pipe-RLHF achieves 3.7 times speedup compared with existing methods while achieving near-linear scalability.
    • loading

    Catalog

      Turn off MathJax
      Article Contents

      /

      DownLoad:  Full-Size Img  PowerPoint
      Return
      Return