Workflow
Thinking-while-Generating
icon
Search documents
让AI像人类画家一样边画边想,港中文&美团让模型「走一步看一步」
3 6 Ke· 2025-12-22 08:12
Core Insights - The article discusses the limitations of existing diffusion and autoregressive models in generating complex visual content, highlighting the need for a more flexible approach to visual generation [1][4] - A new paradigm called Thinking-while-Generating (TwiG) is introduced, which interleaves textual reasoning with visual generation, allowing for real-time adjustments during the creation process [4][6] Group 1: TwiG Framework - TwiG is designed to enable models to pause during the generation process to reflect on the current visual state and guide subsequent creation, contrasting with previous methods that either planned everything beforehand or corrected after completion [6][13] - The framework is broken down into three core dimensions: When to Think, What to Say, and How to Refine, which collectively enhance the model's ability to generate coherent and contextually accurate visuals [7][14] Group 2: Experimental Validation - The research team conducted experiments on a unified multimodal model, demonstrating that the Zero-Shot version of TwiG significantly outperformed baseline models in various dimensions, indicating its potential for real-time reasoning during generation [10][12] - Supervised fine-tuning (SFT) on a high-quality dataset improved the model's coherence and reduced hallucinations, leading to more controlled and concise reasoning chains [11] - Reinforcement learning (RL) strategies further optimized the model's decision-making processes, allowing it to compete effectively with leading models like Emu3 and FLUX.1 in key performance metrics [12][15] Group 3: Implications and Future Directions - TwiG represents a shift in the conceptual approach to visual generation, aiming to make the process more transparent and logical through the integration of explicit textual reasoning [13][14] - The framework's design is compatible with diffusion models and has the potential to extend into more complex domains such as video generation and 3D modeling, contributing to advancements in general visual intelligence [15][16]