文本推理
Search documents
让AI像人类画家一样边画边想,港中文&美团让模型「走一步看一步」
量子位· 2025-12-22 04:41
Core Viewpoint - The article discusses the introduction of a new paradigm called Thinking-while-Generating (TwiG), which interleaves textual reasoning with visual generation to enhance the capabilities of models in generating complex images and videos, addressing limitations of existing models in handling spatial relationships and object interactions [5][19]. Group 1: Existing Challenges - Current diffusion and autoregressive models, such as FLUX.1 and Emu3, struggle with generating accurate representations of complex spatial relationships and interactions, often resulting in errors like misplacing objects or incorrect quantities [1]. - Two main approaches have been previously explored: "Think-before-Generation," which lacks flexibility, and "Think-after-Generation," which incurs high computational costs and delays [4]. Group 2: Introduction of TwiG - TwiG allows models to pause during the generation process to evaluate and plan the next steps, mimicking human artistic processes [5][7]. - The framework breaks down visual generation into a cycle of "generate-think-regenerate," enabling models to incorporate reasoning at multiple points during the creation process [7]. Group 3: Core Dimensions of TwiG - The framework consists of three key dimensions: 1. **When to Think**: The model creates a "thinking schedule" based on user prompts, optimizing the generation process into three stages that align with the semantic structure of images [8]. 2. **What to Say**: At each pause, the model generates a "thought chain" that guides the next steps in a more precise manner than traditional prompts [9]. 3. **How to Refine**: After completing a section, the model performs self-reflection to correct any mistakes immediately, enhancing efficiency [10]. Group 4: Empirical Research and Results - The research team conducted experiments on a unified multimodal model (Janus-Pro) to validate the TwiG framework, demonstrating its potential through various stages of testing [12]. - **Zero-Shot Performance**: The TwiG-ZS model showed remarkable "think-while-generating" capabilities without parameter updates, outperforming baseline models in multiple dimensions [13][14]. - **Supervised Fine-Tuning (SFT)**: A dataset of 50K was used for SFT, which improved the model's coherence and control over generated thought chains [16]. - **Reinforcement Learning (RL)**: The TwiG-RL model, optimized with a specific RL strategy, demonstrated competitive performance against existing models like Emu3 and FLUX.1 in key metrics [17]. Group 5: Conclusions and Future Implications - The introduction of TwiG represents a shift in how visual generation models operate, emphasizing the need for logical reasoning in generation processes [19]. - Key conclusions include the necessity of explicit reasoning for complex logic, the efficiency of local corrections over complete rewrites, and the critical role of reinforcement learning in enhancing model capabilities [20]. - The TwiG framework is designed to be compatible with diffusion models, suggesting potential applications in more complex fields such as video generation and 3D modeling [21].