Workflow
扩散模型(Diffusion Models)
icon
Search documents
多模态推理新范式!DiffThinker:用扩散模型「画」出推理和答案
机器之心· 2026-01-07 07:10
Core Viewpoint - The article discusses the limitations of existing multimodal large language models (MLLMs) in visual reasoning tasks and introduces a new paradigm called Generative Multimodal Reasoning, exemplified by the model DiffThinker, which significantly improves performance in complex visual tasks [2][3][24]. Group 1: Limitations of Current MLLMs - Current MLLMs struggle to track changes in visual information during reasoning tasks, leading to inaccuracies in tasks like spatial navigation and puzzle solving [9]. - The recent "Thinking with Image" paradigm, while innovative, faces scalability issues in complex scenarios due to high operational costs and reliance on multi-turn interactions [3][9]. Group 2: Introduction of DiffThinker - DiffThinker redefines the reasoning process from "text output" to "image-to-image" generation, utilizing diffusion models to directly generate reasoning paths in visual space [3][11]. - The model has shown remarkable performance improvements, outperforming top closed-source models like GPT-5 by 314.2% and Gemini-3-Flash by 111.6% in complex visual tasks [3][20]. Group 3: Core Features of DiffThinker - Efficient Reasoning: DiffThinker demonstrates superior training and inference efficiency compared to traditional MLLMs, generating fewer tokens while maintaining higher accuracy [15]. - Controllable Reasoning: The model uses a fixed-step Euler solver, allowing for predictable output lengths and avoiding issues like infinite loops [17]. - Native Parallel Reasoning: DiffThinker can explore multiple potential paths simultaneously in visual space, enhancing the reasoning process [17]. - Collaborative Reasoning: The model can generate multiple visual candidates for validation by MLLMs, achieving better performance through collaboration [18]. Group 4: Experimental Results - In a systematic evaluation across seven complex tasks, DiffThinker achieved an average score of 87.4, significantly higher than GPT-5 (21.1) and Gemini-3-Flash (41.3) [20]. - The model's performance in tasks such as VSP, TSP, Sudoku, and Jigsaw showcases its effectiveness in various visual reasoning challenges [23]. Group 5: Comparison with Video Generation - A video version of DiffThinker was developed, but it was found to be less accurate and slower than the image generation model, indicating that "thinking with images" is currently more efficient than "thinking with videos" [22]. Group 6: Future Implications - The emergence of DiffThinker marks the beginning of a new era in Generative Multimodal Reasoning, suggesting that transitioning reasoning processes from "text flow" to "visual flow" may be crucial for the next generation of general artificial intelligence [24][25].
近500页史上最全扩散模型修炼宝典,一书覆盖三大主流视角
具身智能之心· 2025-10-30 00:03
Core Insights - The article discusses the comprehensive guide on diffusion models, which have significantly reshaped the landscape of generative AI across various domains such as images, audio, video, and 3D environments [3][5][6] - It emphasizes the need for a structured understanding of diffusion models, as researchers often struggle to piece together concepts from numerous papers [4][10] Summary by Sections Introduction to Diffusion Models - Diffusion models are framed as a gradual transformation process over time, contrasting with traditional generative models that directly learn mappings from noise to data [12] - The development of diffusion models is explored through three main perspectives: variational methods, score-based methods, and flow-based methods, which provide complementary frameworks for understanding and implementing diffusion modeling [12][13] Fundamental Principles of Diffusion Models - The origins of diffusion models are traced back, linking them to foundational perspectives such as Variational Autoencoders (VAE), score-based methods, and normalizing flows [14][15] - The chapter illustrates how these methods can be unified under a continuous time framework, highlighting their mathematical equivalence [17] Core Perspectives on Diffusion Models - The article outlines the core perspectives on diffusion models, including the forward process of adding noise and the reverse process of denoising [22] - Each perspective is detailed: - Variational view focuses on learning denoising processes through variational objectives [23] - Score-based view emphasizes learning score functions to guide denoising [23] - Flow-based view describes the generation process as a continuous transformation from a simple prior distribution to the data distribution [23][24] Sampling from Diffusion Models - The sampling process in diffusion models is characterized by a unique refinement from coarse to fine details, which presents a trade-off between performance and efficiency [27][28] - Techniques for improving sampling efficiency and quality are discussed, including classifier guidance and numerical solvers [29] Learning Fast Generative Models - The article explores methods for directly learning fast generative models that approximate the diffusion process, enhancing speed and scalability [30] - Distillation-based methods are highlighted, where a student model mimics a slower teacher model to achieve faster sampling [30][31] Conclusion - The book aims to establish a lasting theoretical framework for diffusion models, focusing on continuous time dynamical systems that connect simple prior distributions to data distributions [33] - It emphasizes the importance of understanding the underlying principles and connections between different methods to design and improve next-generation generative models [36]