ICLR2026 Oral | 北大彭一杰团队提出高效优化新范式，递归似然比梯度优化器赋能扩散模型后训练

Core Viewpoint - The article discusses the introduction of the Recursive Likelihood Ratio (RLR) optimizer by Professor Peng Yijie’s team from Peking University, which offers a new semi-gradient fine-tuning solution for diffusion models, addressing the challenges of efficiency and performance in downstream applications [2][10]. Group 1: Background and Challenges - Diffusion models (DM) have become a core framework for image synthesis and video generation due to their high-fidelity data generation capabilities [2]. - The main challenge in the industry is how to efficiently adapt pre-trained diffusion models to meet specific application requirements [2]. - Current mainstream fine-tuning methods are divided into two categories: reinforcement learning (RL) methods and truncated backpropagation (BP) methods, both of which have significant drawbacks [7]. - Truncated BP methods can lead to structural bias in gradient estimation, potentially causing model collapse and content degradation [7]. - RL methods, while reducing memory requirements, suffer from high variance in gradient estimation and slow convergence [7]. Group 2: RLR Optimizer Design - The RLR optimizer introduces a semi-gradient estimation paradigm that utilizes the inherent noise characteristics of diffusion models to achieve unbiased and low-variance gradient estimation [10]. - The core design of the RLR optimizer includes three main modules: 1. First-order estimation module that directly backpropagates through the reward model at the first time step [11]. 2. Zero-order estimation module that employs parameter perturbation strategies for remaining time steps, ensuring unbiased gradient estimation without caching intermediate latent variables [12]. - The optimizer's controllable parameter, the local sub-chain length (h), directly influences the trade-off between memory usage and gradient variance [14]. Group 3: Performance Validation - The effectiveness of the RLR optimizer was validated through large-scale experiments on Text2Image and Text2Video tasks, showing superior performance compared to existing RL and truncated BP methods [18]. - In the Text2Image task, RLR improved the ImageReward score of Stable Diffusion 1.4 from 32.90 to 76.55, outperforming DDPO by approximately 47% and AlignProp by about 14% [18]. - In the Text2Video task, RLR achieved a weighted average score of 84.63, surpassing other models like VideoCrafter and Gen-2, particularly excelling in the dynamic degree metric [18][20]. - The RLR optimizer also incorporates a diffusion thinking chain prompt technique, which enhances performance in fine-grained tasks such as hand generation by targeting specific scales of generation defects [22].