Core Viewpoint - The article discusses the introduction of Self-E, a novel text-to-image generation framework that eliminates the need for pre-trained teacher models and allows for any-step generation while maintaining high quality and semantic clarity [2][28]. Group 1: Introduction and Background - Traditional diffusion models and flow matching have improved text-to-image generation but require numerous iterations, limiting their real-time application [2]. - Existing methods often rely on knowledge distillation, which incurs additional training costs and leaves a gap between "from scratch" training and "few-step high quality" generation [2][28]. Group 2: Self-E Framework - Self-E represents a paradigm shift by focusing on "landing evaluation" rather than "trajectory matching," allowing the model to learn the quality of the final output rather than just the correctness of each step [7][28]. - The model operates in two modes: learning from real data and self-evaluating its generated samples, creating a self-feedback loop [12][13]. Group 3: Training Mechanism - Self-E employs two complementary training signals: one from data and the other from self-evaluation, enabling the model to learn local structures and assess its outputs simultaneously [14][19]. - The training process involves a long-distance jump to a landing point, where the model uses its current local estimates to generate feedback on how to improve the output [17][19]. Group 4: Inference and Performance - During inference, Self-E can maintain semantic and structural quality with very few steps, and as the number of steps increases, the quality continues to improve [22][23]. - In the GenEval benchmark, Self-E outperforms other methods across all step counts, showing a significant advantage in the few-step range, with a notable improvement of +0.12 in a 2-step setting compared to the best existing methods [24][25]. Group 5: Broader Implications - Self-E's approach aligns pre-training and feedback learning, creating a closed-loop system similar to reinforcement learning, which enhances the model's ability to generate high-quality outputs with fewer steps [26][29]. - The framework allows for dynamic step selection based on the application context, making it versatile for both real-time feedback and high-quality offline rendering [28].
解锁任意步数文生图,港大&Adobe全新Self-E框架学会自我评估
机器之心·2026-01-15 03:52