大模型集体“挂科”！全新中文网页检索测试：GPT-4o准确率仅6.2%

Core Viewpoint - The newly released benchmark dataset BrowseComp-ZH reveals that mainstream AI models struggle significantly with Chinese web tasks, with many achieving accuracy rates below 10% [1][13][25]. Group 1: Need for Chinese Web Capability Testing - Current AI models are increasingly capable of using tools, including search engines and plugins, but existing evaluation tools are primarily designed for English contexts, neglecting the complexities of the Chinese internet [2][3]. - The Chinese internet is characterized by severe information fragmentation, diverse search entry points, and complex language expressions, making it essential to design tests specifically for the Chinese context [4][6]. Group 2: Development of BrowseComp-ZH - The research team employed a "reverse design" method, starting from clear, verifiable factual answers to construct complex questions with multiple constraints, resulting in 289 high-difficulty multi-hop retrieval questions across 11 fields [7][10]. - The dataset has been fully open-sourced to encourage model developers to engage with the challenges presented [24]. Group 3: Findings on Model Performance - The results indicate that models relying solely on memory without search capabilities often score below 10%, highlighting the inadequacy of mere memorization [15][16]. - Models with reasoning capabilities perform better, as evidenced by DeepSeek-R1 achieving 23.2% accuracy compared to 8.7% for DeepSeek-V3, indicating that reasoning ability is a critical variable [17][18]. - AI search products with multi-round retrieval capabilities outperform those that only search once, with single-search models like Kimi and Yuanbao scoring in the single digits [19][20]. - Interestingly, enabling search functions can lead to decreased accuracy, as seen with DeepSeek-R1, which dropped from 23.2% to 7.6% when the search feature was activated [21][22][23]. Group 4: Future Directions - The research team aims to expand the sample size, diversify question formats, and conduct in-depth analyses of model reasoning paths and failure cases [26].