OpenAI DeepResearch
Search documents
阿里开源通义DeepResearch,性能超OpenAI、DeepSeek旗舰模型
Xin Lang Ke Ji· 2025-09-17 03:33
Core Insights - Alibaba has open-sourced its first deep research Agent model, Tongyi DeepResearch, which has achieved state-of-the-art (SOTA) results on multiple authoritative evaluation sets, surpassing models from OpenAI and DeepSeek [1][2] - The model, framework, and solutions of Tongyi DeepResearch are fully available for users to download on platforms like Github, Hugging Face, and Modao Community [1] - The Tongyi team has developed a complete training pipeline driven by synthetic data, addressing challenges such as "cognitive space congestion" and "irreversible noise pollution" that affect long-term task processing [1] Performance Metrics - Tongyi DeepResearch model, with 3 billion activation parameters, outperforms flagship models like OpenAI o3, DeepSeek V3.1, and Claude-4-Sonnet across various benchmarks [2] - In the Humanity's Last Exam benchmark, Tongyi DeepResearch achieved a score of 32.9, significantly higher than competitors such as DeepSeek V3.1 (29.8) and OpenAI o3 (24.9) [2] - The model also excelled in other benchmarks, including BrowseComp-ZH (43.4), GAIA (46.7), and WebWalkerQA (70.9), showcasing its superior performance [2]
大模型集体“挂科”!全新中文网页检索测试:GPT-4o准确率仅6.2%
量子位· 2025-05-06 04:24
Core Viewpoint - The newly released benchmark dataset BrowseComp-ZH reveals that mainstream AI models struggle significantly with Chinese web tasks, with many achieving accuracy rates below 10% [1][13][25]. Group 1: Need for Chinese Web Capability Testing - Current AI models are increasingly capable of using tools, including search engines and plugins, but existing evaluation tools are primarily designed for English contexts, neglecting the complexities of the Chinese internet [2][3]. - The Chinese internet is characterized by severe information fragmentation, diverse search entry points, and complex language expressions, making it essential to design tests specifically for the Chinese context [4][6]. Group 2: Development of BrowseComp-ZH - The research team employed a "reverse design" method, starting from clear, verifiable factual answers to construct complex questions with multiple constraints, resulting in 289 high-difficulty multi-hop retrieval questions across 11 fields [7][10]. - The dataset has been fully open-sourced to encourage model developers to engage with the challenges presented [24]. Group 3: Findings on Model Performance - The results indicate that models relying solely on memory without search capabilities often score below 10%, highlighting the inadequacy of mere memorization [15][16]. - Models with reasoning capabilities perform better, as evidenced by DeepSeek-R1 achieving 23.2% accuracy compared to 8.7% for DeepSeek-V3, indicating that reasoning ability is a critical variable [17][18]. - AI search products with multi-round retrieval capabilities outperform those that only search once, with single-search models like Kimi and Yuanbao scoring in the single digits [19][20]. - Interestingly, enabling search functions can lead to decreased accuracy, as seen with DeepSeek-R1, which dropped from 23.2% to 7.6% when the search feature was activated [21][22][23]. Group 4: Future Directions - The research team aims to expand the sample size, diversify question formats, and conduct in-depth analyses of model reasoning paths and failure cases [26].