工具协同
Search documents
小红书提出DeepEyesV2,从“看图思考”到“工具协同”,探索多模态智能新维度
量子位· 2025-11-13 00:49
Core Insights - DeepEyesV2 is a significant upgrade from its predecessor, DeepEyes, enhancing its capabilities from merely recognizing details to actively solving complex problems through multi-tool collaboration [3][12]. Multi-Tool Collaboration - Traditional multimodal models are limited in their ability to actively utilize external tools, often functioning as passive information interpreters [4]. - DeepEyesV2 addresses two main pain points: weak tool invocation capabilities and lack of collaborative abilities among different functions [5][8]. - The model can now perform complex tasks by integrating image search, text search, and code execution in a cohesive manner [12][18]. Problem-Solving Process - DeepEyesV2's problem-solving process involves three steps: image search for additional information, text search for stock price data, and code execution to retrieve and calculate financial data [15][16][17]. - The model demonstrates advanced reasoning capabilities, allowing it to tackle intricate queries effectively [14]. Model Features - DeepEyesV2 incorporates programmatic code execution and web retrieval as external tools, enabling dynamic interaction during reasoning [22]. - The model generates executable Python code or web search queries as needed, enhancing its analytical capabilities [23][27]. - This integration results in improved flexibility in tool invocation and a more robust multimodal reasoning framework [28]. Training and Development - The development of DeepEyesV2 involved a two-phase training strategy: a cold start to establish foundational tool usage and reinforcement learning for optimization [37][38]. - The team created a new benchmark, RealX-Bench, to evaluate the model's performance in real-world scenarios requiring multi-capability integration [40][41]. Performance Evaluation - DeepEyesV2 outperforms existing models in accuracy, particularly in tasks requiring the integration of multiple capabilities [45]. - The model's performance metrics indicate a significant improvement over open-source models, especially in complex problem-solving scenarios [46]. Tool Usage Analysis - The model exhibits a preference for specific tools based on task requirements, demonstrating adaptive reasoning capabilities [62]. - After reinforcement learning, the model shows a reduction in unnecessary tool calls, indicating improved efficiency in reasoning [67][72]. Conclusion - The advancements in DeepEyesV2 highlight the importance of integrating tool invocation with reasoning processes, showcasing its superior problem-solving abilities in various domains [73][75].