Pipe-RLHF: 计算模式感知的RLHF并行加速框架

徐颖; 王梦迪; 程龙; 刘炼; 赵世新; 张磊; 王颖

doi:10.7544/issn1000-1239.202550127

Pipe-RLHF: 计算模式感知的RLHF并行加速框架

Pipe-RLHF: A Computation Mode-Aware Parallel Framework for RLHF

摘要

摘要: 基于人类反馈的强化学习（reinforcement learning with human feedback，RLHF）作为当前大语言模型（large language models，LLMs）对齐的主流方法，其核心优化算法——近端策略优化（proximal policy optimization，PPO）却面临着显著的效率问题. PPO由生成、推理、训练3个相互关联的阶段组成，各个阶段有着不同的计算特性. 然而，现有的RLHF并行框架采用相同并行策略顺序执行PPO的所有阶段，这导致以下2个问题：其一，生成阶段不能充分利用计算资源，进而影响整体效率；其二，阶段间严格串行执行，未能充分利用潜在并行性. 针对上述问题，提出了一个新型RLHF并行框架——Pipe-RLHF.该框架能够自适应地根据各阶段的计算特征确定最优并行策略，突破现有阶段串行范式，采用异步PPO算法发掘阶段间的并行性. 具体而言，创新性地提出了适用于PPO生成阶段的延迟批间流水线并行方法，显著提升了该阶段的计算资源利用率；再次，使用异步PPO解放阶段间的依赖关系，将阶段间并行应用到PPO的加速上；最后，针对PPO算法的整体优化，构建了分层并行策略空间，并提出了一套优化算法以实现该空间中的最优解搜索. 通过在多个大语言模型上的性能评估实验表明，相较于现有方法，Pipe-RLHF最高可实现3.7倍的加速比，充分验证了该框架的有效性和优越性.

Abstract: Reinforcement learning with human feedback (RLHF) has been proven effective in aligning large language models (LLMs) with human preferences. The most costly part of RLHF is proximal policy optimization (PPO), which consists of three dependent steps. Different PPO steps in RLHF exhibit different computation modes, simply employing the same parallelization strategy to accelerate all steps that involve multiple model variants, as done in existing frameworks, will lead to poor performance in the PPO generation step due to insufficient utilization of computational resources. Thus, we introduce Pipe-RLHF, a parallelism framework for RLHF fine-tuning, which adaptively employs distinct parallelization strategies for different steps according to the computation mode. Specifically, we first investigate the characteristics of various computation modes to explore their best-fit parallelization approach. And then, we present a novel delayed inter-batch pipeline parallelization approach specifically designed for the PPO generation step, enabling the sufficient utilization of computational resources. Subsequently, based on the proposed inter-batch pipeline parallelization approach, we define a hierarchical parallel plan space for distributed RLHF fine-tuning. Finally, we present optimization algorithms to find the optimal parallelization plan from the defined hierarchical parallel plan space to minimize the overall time consumption. Implementation and evaluation across multiple LLMs demonstrates that the proposed Pipe-RLHF achieves 3.7 times speedup compared with existing methods while achieving near-linear scalability.

HTML全文

参考文献(48)

施引文献

资源附件(1)