Workflow
BranchGRPO
icon
Search documents
加速近5倍!北大与字节团队提出BranchGRPO,用「树形分叉 + 剪枝」重塑扩散模型对齐
机器之心· 2025-09-22 07:26
Core Insights - The article introduces BranchGRPO, a novel tree-structured reinforcement learning method developed by Peking University and ByteDance, which addresses the challenges of efficient sampling and stable optimization in human preference alignment for diffusion and flow matching models [2][9]. Group 1: Research Background and Challenges - Diffusion and flow matching models have become mainstream in visual generation due to their high fidelity, diversity, and controllability, but they often fail to align with human intentions, leading to results that deviate from aesthetic, semantic, or temporal consistency [5]. - Human Feedback Reinforcement Learning (RLHF) has been introduced to directly optimize generative models to better align outputs with human preferences [6]. - The existing Group Relative Policy Optimization (GRPO) method shows good stability and scalability in image and video generation but faces two fundamental bottlenecks: inefficiency due to sequential rollout and sparse rewards that ignore critical signals in intermediate states [8]. Group 2: BranchGRPO Methodology - BranchGRPO restructures the sampling process from a single path to a tree structure, allowing for efficient exploration and reducing redundancy in sampling [11][14]. - The method incorporates branching, reward fusion, and pruning mechanisms to enhance both speed and stability, achieving significant improvements in training efficiency and reward attribution [13][14]. - In image alignment tests, BranchGRPO demonstrated a speed increase of up to 4.7 times compared to DanceGRPO, with iteration times dropping from 698 seconds to as low as 148 seconds [15]. Group 3: Performance Metrics - In image alignment (HPDv2.1), BranchGRPO achieved a score of 0.369, surpassing DanceGRPO's score of 0.360, while also achieving the highest image reward of 1.319 [15][17]. - For video generation (WanX-1.3B), BranchGRPO produced clearer and more stable video frames compared to previous models, with iteration times reduced from approximately 20 minutes to about 8 minutes, effectively doubling training efficiency [18][19]. Group 4: Experimental Findings - Ablation studies indicate that moderate branching correlation and early dense splits accelerate reward improvement, while path-weighted reward fusion stabilizes training [23]. - The diversity of samples remains intact with MMD²≈0.019, nearly consistent with sequential sampling [24]. - BranchGRPO's efficiency allows for easy scaling of branch sizes without performance degradation, with iteration times significantly reduced even at larger sample sizes [27]. Group 5: Conclusion and Future Outlook - BranchGRPO innovatively combines efficiency and stability, transforming reward signals from a single endpoint to a continuous feedback mechanism, leading to comprehensive improvements in speed, stability, and alignment effectiveness [30]. - Future developments may include adaptive splitting and pruning strategies, potentially establishing BranchGRPO as a core method for RLHF in diffusion and flow models, enhancing human preference alignment [30].