正向KL散度

Search documents
ICML 2025 Spotlight | 清华朱军组&NVIDIA提出DDO:扩散/自回归模型训练新范式,刷新图像生成SOTA
机器之心· 2025-07-01 09:34
Core Viewpoint - The article discusses a novel optimization paradigm for visual generative models called Direct Discriminative Optimization (DDO), which enhances the performance of likelihood-based generative models by treating them as implicit discriminators, thus overcoming limitations of traditional maximum likelihood estimation (MLE) methods [1][8]. Background on Likelihood-Based Generative Models - Diffusion models and autoregressive models have become dominant in image generation, characterized by their stability, diversity, and scalability [4]. - These models estimate the log-likelihood of data explicitly, but they face challenges such as the "mode covering" problem, which can lead to blurry or distorted outputs [6]. DDO Methodology - DDO introduces a training objective that incorporates reverse KL divergence to focus on density concentration around real data, improving generation fidelity without adding extra networks [7][11]. - The method utilizes a target model and a frozen reference model to construct an implicit discriminator, allowing for direct application to diffusion and autoregressive models [11]. Performance Improvements - DDO significantly enhances the generation quality of existing models, achieving state-of-the-art results across various standard image generation tasks [12][13]. - For instance, FID scores improved from 1.58 to 0.97 for ImageNet 64×64 and from 1.85 to 1.30 for CIFAR-10 without guidance [18]. Compatibility and Efficiency - DDO does not require changes to the network structure or increase inference costs, and it is compatible with existing guidance methods like Classifier-Free Guidance (CFG) [21]. - The method allows for performance enhancement through self-play, leading to continuous improvement in FID metrics [19]. Future Prospects - The principles behind DDO may extend beyond visual generation to align with language models, suggesting a unified alignment paradigm for multimodal generation tasks [22][23].