OpenAI未公开的o3「用图思考」技术，被小红书、西安交大尝试实现了

Core Viewpoint - OpenAI's o3 reasoning model has broken traditional boundaries of text-based thinking by integrating images directly into the reasoning process, achieving a new level of multimodal reasoning capabilities [1][4][29] Group 1: Model Capabilities - The o3 model can analyze images and derive answers by focusing on relevant areas, such as formulas in a physics exam or structural elements in architectural drawings, achieving a 95.7% accuracy on the V* Bench visual reasoning benchmark [1] - DeepEyes, developed by a collaboration between Xiaohongshu and Xi'an Jiaotong University, has demonstrated similar capabilities to o3, allowing for reasoning with images without relying on supervised fine-tuning [1][29] Group 2: Reasoning Process - DeepEyes employs a three-step reasoning process: global visual analysis, intelligent tool invocation, and detail reasoning identification, showcasing its ability to think with images [7][10] - The model's architecture introduces a "self-driven visual focus" mechanism, allowing it to dynamically determine when to utilize image information based on the reasoning context [14] Group 3: Learning Mechanism - DeepEyes utilizes an outcome-based reinforcement learning strategy, inspired by biological evolution, to develop its image reasoning capabilities without the need for supervised fine-tuning [18][19] - The learning process is divided into three stages: a novice phase with low accuracy, an exploration phase with increased tool usage, and a mature phase where the model effectively predicts key areas for analysis [21] Group 4: Performance Metrics - DeepEyes has shown superior performance in various visual reasoning tasks, achieving a 90.1% accuracy on the V* Bench and outperforming existing workflow-based methods [23] - The model also exhibits enhanced mathematical reasoning capabilities, indicating its potential for cross-task performance [24] Group 5: Advantages of DeepEyes - Compared to traditional models, DeepEyes offers a simpler training process, stronger generalization capabilities, end-to-end joint optimization, deeper multimodal integration, and inherent tool invocation abilities [26][28][29]