Workflow
WebWatcher
icon
Search documents
AI动态汇总:DeepSeek线上模型升级至V3.1,字节开源360亿参数Seed-OSS系列模型
China Post Securities· 2025-08-26 13:00
- DeepSeek-V3.1 model is an upgraded version of the DeepSeek language model, featuring a hybrid inference architecture that supports both "thinking mode" and "non-thinking mode" for different task complexities[12][13][14] - The model's construction involves dynamic activation of different attention heads and the use of chain-of-thought compression training to reduce redundant token output during inference[13] - The context window length has been expanded from 64K to 128K, allowing the model to handle longer documents and complex dialogues[15] - The model's performance in various benchmarks shows significant improvements, such as a 71.2 score in xbench-DeepSearch and 93.4 in SimpleQA[17] - The model's evaluation highlights its advancements in hybrid inference, long-context processing, and tool usage, although it still faces challenges in complex reasoning tasks[21] - Seed-OSS model by ByteDance features 36 billion parameters and a native 512K long-context window, emphasizing research friendliness and commercial practicality[22][23] - The model uses a dense architecture with 64 layers and integrates grouped-query attention (GQA) and rotary position encoding (RoPE) to balance computational efficiency and inference accuracy[23] - The "thinking budget" mechanism allows dynamic control of inference depth, achieving high scores in various benchmarks like 91.7% accuracy in AIME24 math competition[24] - The model's evaluation notes its strong performance in long-context and reasoning tasks, though its large parameter size poses challenges for edge device deployment[25] - WebWatcher by Alibaba is a multimodal research agent capable of synchronously parsing image and text information and autonomously using various toolchains for multi-step tasks[26][27] - The model's construction involves a four-stage training framework, including data synthesis and reinforcement learning to optimize long-term reasoning capabilities[27] - WebWatcher excels in benchmarks like BrowseComp-VL and MMSearch, achieving scores of 13.6% and 55.3% respectively, surpassing top closed-source models like GPT-4o[28] - The model's evaluation highlights its breakthrough in multimodal AI research, enabling complex task handling and pushing the boundaries of open-source AI capabilities[29] - AutoGLM 2.0 by Zhipu AI is the first mobile general-purpose agent, utilizing a cloud-based architecture to decouple task execution from local device capabilities[32][33] - The model employs GLM-4.5 and GLM-4.5V for task planning and visual execution, using an asynchronous reinforcement learning framework for end-to-end task completion[34] - AutoGLM 2.0 demonstrates high efficiency in various tasks, such as achieving a 75.8% success rate in AndroidWorld and 87.7% in WebVoyager[35] - The model's evaluation notes its significant advancements in mobile agent technology, though it still requires optimization for cross-application stability and scenario generalization[37] - WeChat-YATT by Tencent is a large model training library designed to address scalability and efficiency bottlenecks in multimodal and reinforcement learning tasks[39][40] - The library introduces parallel controller mechanisms and partial colocation strategies to enhance system scalability and resource utilization[40][42] - WeChat-YATT shows a 60% reduction in overall training time compared to the VeRL framework, with each training stage being over 50% faster[45] - The model's evaluation highlights its effectiveness in large-scale RLHF tasks and its potential to drive innovation in multimodal and reinforcement learning fields[46] - Qwen-Image-Edit by Alibaba's Tongyi Qianwen team is an image editing model that integrates dual encoding mechanisms and multimodal diffusion Transformer architecture for semantic and appearance editing[47][48] - The model's construction involves dual-path input design and chain editing mechanisms to maintain high visual fidelity and iterative interaction capabilities[48][49] - Qwen-Image-Edit achieves SOTA scores in multiple benchmarks, with comprehensive scores of 7.56 and 7.52 in English and Chinese scenarios respectively[50] - The model's evaluation notes its transformative impact on design workflows, enabling automated handling of rule-based editing tasks and lowering the barrier for visual creation[52] Model Backtest Results - DeepSeek-V3.1: Browsecomp 30.0, Browsecomp_zh 49.2, HLE 29.8, xbench-DeepSearch 71.2, Frames 83.7, SimpleQA 93.4, Seal0 42.6[17] - Seed-OSS: AIME24 math competition 91.7%, LiveCodeBench v6 67.4, RULER (128K) 94.6, MATH task 81.7[24] - WebWatcher: BrowseComp-VL 13.6%, MMSearch 55.3%, Humanity's Last Exam-VL 13.6%[28] - AutoGLM 2.0: AndroidWorld 75.8%, WebVoyager 87.7%[35] - Qwen-Image-Edit: English scenario 7.56, Chinese scenario 7.52[50]
首个开源多模态Deep Research智能体,超越多个闭源方案
量子位· 2025-08-15 06:44
Core Viewpoint - The article introduces the first open-source multi-modal Deep Research Agent, which integrates various tools such as web browsing, image search, code interpreters, and internal OCR to autonomously generate high-quality reasoning trajectories and optimize decision-making through cold-start fine-tuning and reinforcement learning [1]. Group 1: Deep Research Agent Capabilities - The Deep Research Agent is designed to handle complex, multi-step tasks that require deep reasoning capabilities [5]. - It can autonomously select appropriate tool combinations and reasoning paths during tasks [1]. Group 2: WebWatcher Methodology - WebWatcher encompasses a complete chain from data construction to training optimization, aiming to enhance the flexibility of multi-modal agents in high-difficulty tasks [6]. - The methodology consists of three main components: 1. Multi-modal high-difficulty data generation [7] 2. High-quality reasoning trajectory construction and post-training [13] 3. High-difficulty benchmark evaluation [15]. Group 3: Data Generation Techniques - Existing VQA datasets focus on single-step perception tasks, lacking the depth required for training multi-modal deep research agents [8]. - The research team developed an automated multi-modal data generation process to create complex, cross-modal, and uncertain task samples [8]. - Random walk sampling from multi-source web pages is employed to construct a dense entity graph, promoting exploratory combinations of visual information [10]. - Key information is intentionally obscured during question generation to compel the model to engage in cross-modal reasoning [11]. Group 4: Reasoning Trajectory and Training - The Action-Observation driven trajectory generation method addresses issues in existing reasoning models, such as lengthy and template-like thought chains [13]. - Supervised fine-tuning (SFT) is utilized to help WebWatcher quickly master multi-modal reasoning and tool invocation patterns [14]. Group 5: Benchmarking and Performance - BrowseComp-VL is introduced as a benchmark to validate WebWatcher’s capabilities, designed to approach the complexity of human expert tasks [16]. - WebWatcher achieved significant results in various evaluations, outperforming leading models in complex reasoning, information retrieval, and knowledge integration tasks: - In the Humanity's Last Exam (HLE-VL), WebWatcher scored 13.6% Pass@1, surpassing models like GPT-4o and Gemini2.5-flash [20]. - In MMSearch, it achieved a Pass@1 score of 55.3%, significantly ahead of competitors [21]. - In LiveVQA, it scored 58.7%, demonstrating its strengths in knowledge retrieval and real-time information integration [22]. - In BrowseComp-VL, it achieved an average Pass@1 score of 27.0%, more than double that of other leading models [23].