Workflow
Deep Research Agent
icon
Search documents
腾讯研究院AI每周关键词Top50
腾讯研究院· 2025-12-20 02:33
Group 1: Core Insights - The article presents a weekly roundup of the top 50 keywords in the AI sector, highlighting significant developments and trends in the industry [2]. - Key players mentioned include Google, Apple, ByteDance, NVIDIA, and OpenAI, indicating a competitive landscape in AI technology and applications [3][4]. Group 2: Chip Developments - Google is advancing its AI chip technology with the introduction of TorchTPU [3]. - Apple is focusing on AI server chips, which may enhance its capabilities in AI applications [3]. Group 3: Model Innovations - Google has launched the Gemini 3 Flash model, while ByteDance introduced Seed1.8, showcasing ongoing innovation in AI models [3]. - Other notable models include MiMo-V2-Flash from Xiaomi and Nemotron 3 from NVIDIA, indicating a diverse range of AI model developments [3]. Group 4: Application Trends - OpenAI is expanding its ecosystem with the ChatGPT application store and various applications like ChatGPT Images and SAM Audio [3][4]. - Companies like Tencent and xAI are also developing unique applications, such as the writing mode and Grok Voice, respectively [3][4]. Group 5: Technological Insights - The article discusses various technological insights, including AI memory systems and recursive self-improvement, which are critical for future AI advancements [4]. - The AI adult content market and AGI predictions are also highlighted, reflecting the broader implications of AI technology [4].
首个开源多模态Deep Research智能体,超越多个闭源方案
量子位· 2025-08-15 06:44
Core Viewpoint - The article introduces the first open-source multi-modal Deep Research Agent, which integrates various tools such as web browsing, image search, code interpreters, and internal OCR to autonomously generate high-quality reasoning trajectories and optimize decision-making through cold-start fine-tuning and reinforcement learning [1]. Group 1: Deep Research Agent Capabilities - The Deep Research Agent is designed to handle complex, multi-step tasks that require deep reasoning capabilities [5]. - It can autonomously select appropriate tool combinations and reasoning paths during tasks [1]. Group 2: WebWatcher Methodology - WebWatcher encompasses a complete chain from data construction to training optimization, aiming to enhance the flexibility of multi-modal agents in high-difficulty tasks [6]. - The methodology consists of three main components: 1. Multi-modal high-difficulty data generation [7] 2. High-quality reasoning trajectory construction and post-training [13] 3. High-difficulty benchmark evaluation [15]. Group 3: Data Generation Techniques - Existing VQA datasets focus on single-step perception tasks, lacking the depth required for training multi-modal deep research agents [8]. - The research team developed an automated multi-modal data generation process to create complex, cross-modal, and uncertain task samples [8]. - Random walk sampling from multi-source web pages is employed to construct a dense entity graph, promoting exploratory combinations of visual information [10]. - Key information is intentionally obscured during question generation to compel the model to engage in cross-modal reasoning [11]. Group 4: Reasoning Trajectory and Training - The Action-Observation driven trajectory generation method addresses issues in existing reasoning models, such as lengthy and template-like thought chains [13]. - Supervised fine-tuning (SFT) is utilized to help WebWatcher quickly master multi-modal reasoning and tool invocation patterns [14]. Group 5: Benchmarking and Performance - BrowseComp-VL is introduced as a benchmark to validate WebWatcher’s capabilities, designed to approach the complexity of human expert tasks [16]. - WebWatcher achieved significant results in various evaluations, outperforming leading models in complex reasoning, information retrieval, and knowledge integration tasks: - In the Humanity's Last Exam (HLE-VL), WebWatcher scored 13.6% Pass@1, surpassing models like GPT-4o and Gemini2.5-flash [20]. - In MMSearch, it achieved a Pass@1 score of 55.3%, significantly ahead of competitors [21]. - In LiveVQA, it scored 58.7%, demonstrating its strengths in knowledge retrieval and real-time information integration [22]. - In BrowseComp-VL, it achieved an average Pass@1 score of 27.0%, more than double that of other leading models [23].