Workflow
DeepEyes
icon
Search documents
小红书提出DeepEyesV2,从“看图思考”到“工具协同”,探索多模态智能新维度
量子位· 2025-11-13 00:49
Core Insights - DeepEyesV2 is a significant upgrade from its predecessor, DeepEyes, enhancing its capabilities from merely recognizing details to actively solving complex problems through multi-tool collaboration [3][12]. Multi-Tool Collaboration - Traditional multimodal models are limited in their ability to actively utilize external tools, often functioning as passive information interpreters [4]. - DeepEyesV2 addresses two main pain points: weak tool invocation capabilities and lack of collaborative abilities among different functions [5][8]. - The model can now perform complex tasks by integrating image search, text search, and code execution in a cohesive manner [12][18]. Problem-Solving Process - DeepEyesV2's problem-solving process involves three steps: image search for additional information, text search for stock price data, and code execution to retrieve and calculate financial data [15][16][17]. - The model demonstrates advanced reasoning capabilities, allowing it to tackle intricate queries effectively [14]. Model Features - DeepEyesV2 incorporates programmatic code execution and web retrieval as external tools, enabling dynamic interaction during reasoning [22]. - The model generates executable Python code or web search queries as needed, enhancing its analytical capabilities [23][27]. - This integration results in improved flexibility in tool invocation and a more robust multimodal reasoning framework [28]. Training and Development - The development of DeepEyesV2 involved a two-phase training strategy: a cold start to establish foundational tool usage and reinforcement learning for optimization [37][38]. - The team created a new benchmark, RealX-Bench, to evaluate the model's performance in real-world scenarios requiring multi-capability integration [40][41]. Performance Evaluation - DeepEyesV2 outperforms existing models in accuracy, particularly in tasks requiring the integration of multiple capabilities [45]. - The model's performance metrics indicate a significant improvement over open-source models, especially in complex problem-solving scenarios [46]. Tool Usage Analysis - The model exhibits a preference for specific tools based on task requirements, demonstrating adaptive reasoning capabilities [62]. - After reinforcement learning, the model shows a reduction in unnecessary tool calls, indicating improved efficiency in reasoning [67][72]. Conclusion - The advancements in DeepEyesV2 highlight the importance of integrating tool invocation with reasoning processes, showcasing its superior problem-solving abilities in various domains [73][75].
OpenAI未公开的o3「用图思考」技术,被小红书、西安交大尝试实现了
机器之心· 2025-05-31 06:30
Core Viewpoint - OpenAI's o3 reasoning model has broken traditional boundaries of text-based thinking by integrating images directly into the reasoning process, achieving a new level of multimodal reasoning capabilities [1][4][29] Group 1: Model Capabilities - The o3 model can analyze images and derive answers by focusing on relevant areas, such as formulas in a physics exam or structural elements in architectural drawings, achieving a 95.7% accuracy on the V* Bench visual reasoning benchmark [1] - DeepEyes, developed by a collaboration between Xiaohongshu and Xi'an Jiaotong University, has demonstrated similar capabilities to o3, allowing for reasoning with images without relying on supervised fine-tuning [1][29] Group 2: Reasoning Process - DeepEyes employs a three-step reasoning process: global visual analysis, intelligent tool invocation, and detail reasoning identification, showcasing its ability to think with images [7][10] - The model's architecture introduces a "self-driven visual focus" mechanism, allowing it to dynamically determine when to utilize image information based on the reasoning context [14] Group 3: Learning Mechanism - DeepEyes utilizes an outcome-based reinforcement learning strategy, inspired by biological evolution, to develop its image reasoning capabilities without the need for supervised fine-tuning [18][19] - The learning process is divided into three stages: a novice phase with low accuracy, an exploration phase with increased tool usage, and a mature phase where the model effectively predicts key areas for analysis [21] Group 4: Performance Metrics - DeepEyes has shown superior performance in various visual reasoning tasks, achieving a 90.1% accuracy on the V* Bench and outperforming existing workflow-based methods [23] - The model also exhibits enhanced mathematical reasoning capabilities, indicating its potential for cross-task performance [24] Group 5: Advantages of DeepEyes - Compared to traditional models, DeepEyes offers a simpler training process, stronger generalization capabilities, end-to-end joint optimization, deeper multimodal integration, and inherent tool invocation abilities [26][28][29]