Abstract:
Reinforcement learning with human feedback (RLHF) has been proven effective in aligning large language models (LLMs) with human preferences. The most costly part of RLHF is proximal policy optimization (PPO), which consists of three dependent steps. Different PPO steps in RLHF exhibit different computation modes, simply employing the same parallelization strategy to accelerate all steps that involve multiple model variants, as done in existing frameworks, will lead to poor performance in the PPO generation step due to insufficient utilization of computational resources. Thus, we introduce Pipe-RLHF, a parallelism framework for RLHF fine-tuning, which adaptively employs distinct parallelization strategies for different steps according to the computation mode. Specifically, we first investigate the characteristics of various computation modes to explore their best-fit parallelization approach. And then, we present a novel delayed inter-batch pipeline parallelization approach specifically designed for the PPO generation step, enabling the sufficient utilization of computational resources. Subsequently, based on the proposed inter-batch pipeline parallelization approach, we define a hierarchical parallel plan space for distributed RLHF fine-tuning. Finally, we present optimization algorithms to find the optimal parallelization plan from the defined hierarchical parallel plan space to minimize the overall time consumption. Implementation and evaluation across multiple LLMs demonstrates that the proposed Pipe-RLHF achieves 3.7 times speedup compared with existing methods while achieving near-linear scalability.