OpenAI DeepResearch

Search documents
阿里开源通义DeepResearch,性能超OpenAI、DeepSeek旗舰模型
Xin Lang Ke Ji· 2025-09-17 03:33
新浪科技讯 9月17日上午消息,阿里开源旗下首个深度研究Agent模型——通义DeepResearch。该模型在 HLE、BrowseComp-zh、GAIA等多个权威评测集上取得SOTA成绩(State-of-the-art),超越OpenAI Deep Research、DeepSeek-V3.1等Agent模型。目前,通义DeepResearch的模型、框架和方案均已全面开 源,用户可在Github、Hugging Face和魔搭社区社区下载模型和代码。 | Tongyl DeepResearch Benchmarks | | | | | Q DeepResearch It ! | | | | --- | --- | --- | --- | --- | --- | --- | --- | | Benchmarks Humanity's Last Exam BrowseComp-ZH BrowseComp | | | | GALA | xbench-DeepSearch | WebWalkerQA | FRAMES | | LIM-based ReAct Agent | | | | | | | | | G ...
大模型集体“挂科”!全新中文网页检索测试:GPT-4o准确率仅6.2%
量子位· 2025-05-06 04:24
Core Viewpoint - The newly released benchmark dataset BrowseComp-ZH reveals that mainstream AI models struggle significantly with Chinese web tasks, with many achieving accuracy rates below 10% [1][13][25]. Group 1: Need for Chinese Web Capability Testing - Current AI models are increasingly capable of using tools, including search engines and plugins, but existing evaluation tools are primarily designed for English contexts, neglecting the complexities of the Chinese internet [2][3]. - The Chinese internet is characterized by severe information fragmentation, diverse search entry points, and complex language expressions, making it essential to design tests specifically for the Chinese context [4][6]. Group 2: Development of BrowseComp-ZH - The research team employed a "reverse design" method, starting from clear, verifiable factual answers to construct complex questions with multiple constraints, resulting in 289 high-difficulty multi-hop retrieval questions across 11 fields [7][10]. - The dataset has been fully open-sourced to encourage model developers to engage with the challenges presented [24]. Group 3: Findings on Model Performance - The results indicate that models relying solely on memory without search capabilities often score below 10%, highlighting the inadequacy of mere memorization [15][16]. - Models with reasoning capabilities perform better, as evidenced by DeepSeek-R1 achieving 23.2% accuracy compared to 8.7% for DeepSeek-V3, indicating that reasoning ability is a critical variable [17][18]. - AI search products with multi-round retrieval capabilities outperform those that only search once, with single-search models like Kimi and Yuanbao scoring in the single digits [19][20]. - Interestingly, enabling search functions can lead to decreased accuracy, as seen with DeepSeek-R1, which dropped from 23.2% to 7.6% when the search feature was activated [21][22][23]. Group 4: Future Directions - The research team aims to expand the sample size, diversify question formats, and conduct in-depth analyses of model reasoning paths and failure cases [26].