Workflow
原生多模态推理
icon
Search documents
ICLR 2026|原生多模态推理新范式ThinkMorph ,让文字与图像在统一架构中共同演化
机器之心· 2026-03-10 07:23
Core Insights - The article discusses the introduction of "ThinkMorph," a unified multimodal reasoning model that allows text and images to collaborate and evolve together, rather than relying solely on text after an initial visual input [2][9][12] - ThinkMorph demonstrates significant improvements in visual reasoning tasks, achieving an average increase of 34.74% in visual reasoning capabilities with only 24,000 training data points, outperforming models like GPT-4o and Gemini 2.5 Flash in several tasks [2][19][22] Group 1: Need for Native Multimodal Reasoning - Human cognitive processes seamlessly switch between visual and logical thinking, which is not reflected in current mainstream multimodal models that primarily rely on text after initial image input [5][9] - ThinkMorph aims to replicate human cognitive flexibility by allowing the model to generate intermediate images during reasoning, thus enhancing the reasoning process [11][19] Group 2: Core Design Principles - The core idea of ThinkMorph is that text and images should provide complementary information in reasoning, rather than merely replicating each other [13][14] - Text is responsible for abstract analysis and logical validation, while images provide spatial visualization and detail presentation, leading to a more effective reasoning process [14][20] Group 3: Performance and Generalization - The model was fine-tuned on three reasoning modes: pure text, pure visual, and interleaved reasoning, with interleaved reasoning showing superior performance in visually intensive tasks [19][21] - ThinkMorph achieved an average improvement of 20.74% across nine benchmarks compared to the base model, demonstrating strong generalization capabilities despite having only 7 billion parameters [22][25] Group 4: Emergent Properties - ThinkMorph exhibited emergent properties, including the ability to autonomously learn new visual operations not present in the training data, such as zooming and image inpainting [28][30] - The model also demonstrated the ability to switch reasoning modes autonomously, achieving an accuracy of 81.25% in cases where visual assistance was deemed unnecessary [31][34] - The interleaved reasoning approach allowed for a broader exploration of solution spaces, leading to better performance in diverse tasks [34][40] Group 5: Future Implications - The findings suggest that native multimodal reasoning has the potential to unlock new capabilities and redefine how intelligence is constructed, moving beyond traditional text-based reasoning [46]
或颠覆文档处理模式,DeepSeek OCR模型再更新
Xuan Gu Bao· 2026-01-27 23:16
Group 1 - DeepSeek has launched the new DeepSeek-OCR2 model, which utilizes the innovative DeepEncoderV2 method to dynamically rearrange image components based on their meaning, rather than scanning mechanically from left to right [1] - The new model achieved a performance of 91.09%, an improvement of 3.73% over its predecessor, while reducing the maximum visual token usage from 1156 to 1120 [1] - The release of DeepSeek-OCR2 is significant as it may disrupt traditional document processing methods and pave the way for native multimodal reasoning [1] Group 2 - Haitong International states that DeepSeek-OCR represents a new generation of "compressed storage" by mapping text to visual representations and compressing it at high rates, achieving about 97% text restoration accuracy at less than 10x compression [2] - At a 20x compression rate, the model maintains approximately 60% accuracy, suitable for scenarios with higher tolerance for errors [2] - Huachuang Securities highlights DeepSeek-OCR's capability to process 33 million pages of data daily on 20 A100 nodes and its strong support for minor languages, indicating a significant advantage for global business deployment [2] Group 3 - Jin Modern has collaborated with Baidu on the development of large model applications and complementary OCR recognition capabilities [3] - Hanwang Technology has provided clients with various platforms, including a low-code development platform and an OCR platform [3]