DanceGRPO：首个统一视觉生成的强化学习框架

Core Insights - The article introduces DanceGRPO, an innovative framework that unifies visual generation reinforcement learning, covering various tasks and models [2][8]. Group 1: Motivation and Background - The rapid development of generative AI has brought RLHF (Reinforcement Learning from Human Feedback) into focus, particularly in the context of LLMs (Large Language Models) [4]. - Current mainstream RLHF solutions for visual generation tasks are less mature compared to LLMs, with two main categories identified: Diffusion/Flow-DPO and ReFL [4][5]. Group 2: Goals and Features - The goal of the DanceGRPO framework is to enhance performance significantly, manage memory pressure during video generation, train on large prompt datasets, and be adaptable to rectified flow and video generation models [7]. Group 3: Framework Design and Implementation - DanceGRPO is the first unified framework for visual generation and reinforcement learning, applicable to diffusion and rectified flow, as well as text-to-image, text-to-video, and image-to-video tasks [8]. - The framework follows the GRPO strategy, optimizing using a prompt to generate data and applying the GRPO objective function without including KL divergence regularization [9]. Group 4: Reward Models - Five types of reward models were utilized: image aesthetics, video aesthetics, text-image alignment, video dynamic quality, and a new binary reward model combining aesthetics and alignment [10]. Group 5: Experimental Results - Experimental results show significant improvements in various models, with notable performance increases in metrics such as HPS-v2.1 and CLIP Score for Stable Diffusion and FLUX [12]. - The results indicate a 45% improvement in VQ and a 181% increase in MQ for the HunyuanVideo model when using the proposed method [13].