DiffThinker
Search documents
多模态推理新范式!DiffThinker:用扩散模型「画」出推理和答案
机器之心· 2026-01-07 07:10
Core Viewpoint - The article discusses the limitations of existing multimodal large language models (MLLMs) in visual reasoning tasks and introduces a new paradigm called Generative Multimodal Reasoning, exemplified by the model DiffThinker, which significantly improves performance in complex visual tasks [2][3][24]. Group 1: Limitations of Current MLLMs - Current MLLMs struggle to track changes in visual information during reasoning tasks, leading to inaccuracies in tasks like spatial navigation and puzzle solving [9]. - The recent "Thinking with Image" paradigm, while innovative, faces scalability issues in complex scenarios due to high operational costs and reliance on multi-turn interactions [3][9]. Group 2: Introduction of DiffThinker - DiffThinker redefines the reasoning process from "text output" to "image-to-image" generation, utilizing diffusion models to directly generate reasoning paths in visual space [3][11]. - The model has shown remarkable performance improvements, outperforming top closed-source models like GPT-5 by 314.2% and Gemini-3-Flash by 111.6% in complex visual tasks [3][20]. Group 3: Core Features of DiffThinker - Efficient Reasoning: DiffThinker demonstrates superior training and inference efficiency compared to traditional MLLMs, generating fewer tokens while maintaining higher accuracy [15]. - Controllable Reasoning: The model uses a fixed-step Euler solver, allowing for predictable output lengths and avoiding issues like infinite loops [17]. - Native Parallel Reasoning: DiffThinker can explore multiple potential paths simultaneously in visual space, enhancing the reasoning process [17]. - Collaborative Reasoning: The model can generate multiple visual candidates for validation by MLLMs, achieving better performance through collaboration [18]. Group 4: Experimental Results - In a systematic evaluation across seven complex tasks, DiffThinker achieved an average score of 87.4, significantly higher than GPT-5 (21.1) and Gemini-3-Flash (41.3) [20]. - The model's performance in tasks such as VSP, TSP, Sudoku, and Jigsaw showcases its effectiveness in various visual reasoning challenges [23]. Group 5: Comparison with Video Generation - A video version of DiffThinker was developed, but it was found to be less accurate and slower than the image generation model, indicating that "thinking with images" is currently more efficient than "thinking with videos" [22]. Group 6: Future Implications - The emergence of DiffThinker marks the beginning of a new era in Generative Multimodal Reasoning, suggesting that transitioning reasoning processes from "text flow" to "visual flow" may be crucial for the next generation of general artificial intelligence [24][25].