Workflow
流匹配模型
icon
Search documents
加速近5倍!北大与字节团队提出BranchGRPO,用「树形分叉 + 剪枝」重塑扩散模型对齐
机器之心· 2025-09-22 07:26
Core Insights - The article introduces BranchGRPO, a novel tree-structured reinforcement learning method developed by Peking University and ByteDance, which addresses the challenges of efficient sampling and stable optimization in human preference alignment for diffusion and flow matching models [2][9]. Group 1: Research Background and Challenges - Diffusion and flow matching models have become mainstream in visual generation due to their high fidelity, diversity, and controllability, but they often fail to align with human intentions, leading to results that deviate from aesthetic, semantic, or temporal consistency [5]. - Human Feedback Reinforcement Learning (RLHF) has been introduced to directly optimize generative models to better align outputs with human preferences [6]. - The existing Group Relative Policy Optimization (GRPO) method shows good stability and scalability in image and video generation but faces two fundamental bottlenecks: inefficiency due to sequential rollout and sparse rewards that ignore critical signals in intermediate states [8]. Group 2: BranchGRPO Methodology - BranchGRPO restructures the sampling process from a single path to a tree structure, allowing for efficient exploration and reducing redundancy in sampling [11][14]. - The method incorporates branching, reward fusion, and pruning mechanisms to enhance both speed and stability, achieving significant improvements in training efficiency and reward attribution [13][14]. - In image alignment tests, BranchGRPO demonstrated a speed increase of up to 4.7 times compared to DanceGRPO, with iteration times dropping from 698 seconds to as low as 148 seconds [15]. Group 3: Performance Metrics - In image alignment (HPDv2.1), BranchGRPO achieved a score of 0.369, surpassing DanceGRPO's score of 0.360, while also achieving the highest image reward of 1.319 [15][17]. - For video generation (WanX-1.3B), BranchGRPO produced clearer and more stable video frames compared to previous models, with iteration times reduced from approximately 20 minutes to about 8 minutes, effectively doubling training efficiency [18][19]. Group 4: Experimental Findings - Ablation studies indicate that moderate branching correlation and early dense splits accelerate reward improvement, while path-weighted reward fusion stabilizes training [23]. - The diversity of samples remains intact with MMD²≈0.019, nearly consistent with sequential sampling [24]. - BranchGRPO's efficiency allows for easy scaling of branch sizes without performance degradation, with iteration times significantly reduced even at larger sample sizes [27]. Group 5: Conclusion and Future Outlook - BranchGRPO innovatively combines efficiency and stability, transforming reward signals from a single endpoint to a continuous feedback mechanism, leading to comprehensive improvements in speed, stability, and alignment effectiveness [30]. - Future developments may include adaptive splitting and pruning strategies, potentially establishing BranchGRPO as a core method for RLHF in diffusion and flow models, enhancing human preference alignment [30].
首次!流匹配模型引入GRPO,GenEval几近满分,组合生图能力远超GPT-4o
机器之心· 2025-05-13 07:08
Core Viewpoint - The article discusses the introduction of Flow-GRPO, the first algorithm to integrate online reinforcement learning into flow matching models, significantly enhancing their performance in image and video generation tasks [2][22]. Group 1: Introduction and Background - Flow matching models have a solid theoretical foundation and excel in generating high-quality images and videos, but they struggle with complex scenes involving multiple objects and relationships [1]. - Online reinforcement learning has made significant strides in language models but remains in its early stages in image generation applications [1]. Group 2: Flow-GRPO Overview - Flow-GRPO combines online reinforcement learning with flow matching models, achieving a remarkable accuracy increase from 63% to 95% in the GenEval benchmark for SD3.5 Medium [2][14]. - The successful implementation of Flow-GRPO opens new avenues for enhancing various flow matching generation models in terms of controllability, composability, and reasoning capabilities [2][22]. Group 3: Key Strategies of Flow-GRPO - The core of Flow-GRPO lies in two key strategies: 1. ODE-SDE equivalence transformation, which allows for effective exploration in reinforcement learning without altering the fundamental characteristics of the model [6][8]. 2. Denoising reduction, which accelerates data collection by reducing the number of denoising steps during training while maintaining high-quality outputs during inference [12][22]. Group 4: Experimental Results - Flow-GRPO demonstrates exceptional performance in various text-to-image generation tasks, significantly improving complex combination generation capabilities and achieving near-perfect results in object counting, spatial relationship understanding, and attribute binding [14][19]. - The accuracy of visual text rendering improved from 59% to 92%, showcasing the model's ability to accurately render text within images [19][21]. - Flow-GRPO also shows significant progress in human preference alignment tasks, effectively reducing reward hacking issues while maintaining image quality and diversity [21][22]. Group 5: Conclusion and Future Outlook - Flow-GRPO reveals a viable path for continuously enhancing flow matching generation model performance through online reinforcement learning [22]. - The successful application of Flow-GRPO suggests promising potential for future advancements in controllability, composability, and reasoning capabilities across multi-modal generation tasks, including images, videos, and 3D content [22].