首个开源多模态Deep Research智能体，超越多个闭源方案

Core Viewpoint - The article introduces the first open-source multi-modal Deep Research Agent, which integrates various tools such as web browsing, image search, code interpreters, and internal OCR to autonomously generate high-quality reasoning trajectories and optimize decision-making through cold-start fine-tuning and reinforcement learning [1]. Group 1: Deep Research Agent Capabilities - The Deep Research Agent is designed to handle complex, multi-step tasks that require deep reasoning capabilities [5]. - It can autonomously select appropriate tool combinations and reasoning paths during tasks [1]. Group 2: WebWatcher Methodology - WebWatcher encompasses a complete chain from data construction to training optimization, aiming to enhance the flexibility of multi-modal agents in high-difficulty tasks [6]. - The methodology consists of three main components: 1. Multi-modal high-difficulty data generation [7] 2. High-quality reasoning trajectory construction and post-training [13] 3. High-difficulty benchmark evaluation [15]. Group 3: Data Generation Techniques - Existing VQA datasets focus on single-step perception tasks, lacking the depth required for training multi-modal deep research agents [8]. - The research team developed an automated multi-modal data generation process to create complex, cross-modal, and uncertain task samples [8]. - Random walk sampling from multi-source web pages is employed to construct a dense entity graph, promoting exploratory combinations of visual information [10]. - Key information is intentionally obscured during question generation to compel the model to engage in cross-modal reasoning [11]. Group 4: Reasoning Trajectory and Training - The Action-Observation driven trajectory generation method addresses issues in existing reasoning models, such as lengthy and template-like thought chains [13]. - Supervised fine-tuning (SFT) is utilized to help WebWatcher quickly master multi-modal reasoning and tool invocation patterns [14]. Group 5: Benchmarking and Performance - BrowseComp-VL is introduced as a benchmark to validate WebWatcher’s capabilities, designed to approach the complexity of human expert tasks [16]. - WebWatcher achieved significant results in various evaluations, outperforming leading models in complex reasoning, information retrieval, and knowledge integration tasks: - In the Humanity's Last Exam (HLE-VL), WebWatcher scored 13.6% Pass@1, surpassing models like GPT-4o and Gemini2.5-flash [20]. - In MMSearch, it achieved a Pass@1 score of 55.3%, significantly ahead of competitors [21]. - In LiveVQA, it scored 58.7%, demonstrating its strengths in knowledge retrieval and real-time information integration [22]. - In BrowseComp-VL, it achieved an average Pass@1 score of 27.0%, more than double that of other leading models [23].