Pipe-RLHF: A Computation Mode-Aware Parallel Framework for RLHF

Xu Ying; Wang Mengdi; Cheng Long; Liu Lian; Zhao Shixin; Zhang Lei; Wang Ying

doi:10.7544/issn1000-1239.202550127

Xu Ying, Wang Mengdi, Cheng Long, Liu Lian, Zhao Shixin, Zhang Lei, Wang Ying. Pipe-RLHF: A Computation Mode-Aware Parallel Framework for RLHF[J]. Journal of Computer Research and Development, 2025, 62(6): 1513-1529. DOI: 10.7544/issn1000-1239.202550127

Citation:

Pipe-RLHF: A Computation Mode-Aware Parallel Framework for RLHF

Graphical Abstract

Graphical Abstract

Abstract

Abstract

Reinforcement learning with human feedback (RLHF) has been proven effective in aligning large language models (LLMs) with human preferences. The most costly part of RLHF is proximal policy optimization (PPO), which consists of three dependent steps. Different PPO steps in RLHF exhibit different computation modes, simply employing the same parallelization strategy to accelerate all steps that involve multiple model variants, as done in existing frameworks, will lead to poor performance in the PPO generation step due to insufficient utilization of computational resources. Thus, we introduce Pipe-RLHF, a parallelism framework for RLHF fine-tuning, which adaptively employs distinct parallelization strategies for different steps according to the computation mode. Specifically, we first investigate the characteristics of various computation modes to explore their best-fit parallelization approach. And then, we present a novel delayed inter-batch pipeline parallelization approach specifically designed for the PPO generation step, enabling the sufficient utilization of computational resources. Subsequently, based on the proposed inter-batch pipeline parallelization approach, we define a hierarchical parallel plan space for distributed RLHF fine-tuning. Finally, we present optimization algorithms to find the optimal parallelization plan from the defined hierarchical parallel plan space to minimize the overall time consumption. Implementation and evaluation across multiple LLMs demonstrates that the proposed Pipe-RLHF achieves 3.7 times speedup compared with existing methods while achieving near-linear scalability.

FullText(HTML)

References (48)

Supplements (1)

Cited By

Turn off MathJax

Article Contents

Pipe-RLHF: A Computation Mode-Aware Parallel Framework for RLHF

Graphical Abstract

Abstract

Catalog

Export File

Citation

Format

Content