ICLR 2026｜原生多模态推理新范式ThinkMorph ，让文字与图像在统一架构中共同演化

Core Insights - The article discusses the introduction of "ThinkMorph," a unified multimodal reasoning model that allows text and images to collaborate and evolve together, rather than relying solely on text after an initial visual input [2][9][12] - ThinkMorph demonstrates significant improvements in visual reasoning tasks, achieving an average increase of 34.74% in visual reasoning capabilities with only 24,000 training data points, outperforming models like GPT-4o and Gemini 2.5 Flash in several tasks [2][19][22] Group 1: Need for Native Multimodal Reasoning - Human cognitive processes seamlessly switch between visual and logical thinking, which is not reflected in current mainstream multimodal models that primarily rely on text after initial image input [5][9] - ThinkMorph aims to replicate human cognitive flexibility by allowing the model to generate intermediate images during reasoning, thus enhancing the reasoning process [11][19] Group 2: Core Design Principles - The core idea of ThinkMorph is that text and images should provide complementary information in reasoning, rather than merely replicating each other [13][14] - Text is responsible for abstract analysis and logical validation, while images provide spatial visualization and detail presentation, leading to a more effective reasoning process [14][20] Group 3: Performance and Generalization - The model was fine-tuned on three reasoning modes: pure text, pure visual, and interleaved reasoning, with interleaved reasoning showing superior performance in visually intensive tasks [19][21] - ThinkMorph achieved an average improvement of 20.74% across nine benchmarks compared to the base model, demonstrating strong generalization capabilities despite having only 7 billion parameters [22][25] Group 4: Emergent Properties - ThinkMorph exhibited emergent properties, including the ability to autonomously learn new visual operations not present in the training data, such as zooming and image inpainting [28][30] - The model also demonstrated the ability to switch reasoning modes autonomously, achieving an accuracy of 81.25% in cases where visual assistance was deemed unnecessary [31][34] - The interleaved reasoning approach allowed for a broader exploration of solution spaces, leading to better performance in diverse tasks [34][40] Group 5: Future Implications - The findings suggest that native multimodal reasoning has the potential to unlock new capabilities and redefine how intelligence is constructed, moving beyond traditional text-based reasoning [46]