北大发布学术搜索评测ScholarSearch：难倒一众DeepResearch的“开卷考试”

Core Viewpoint - The article discusses the limitations of current large language models (LLMs) in academic research, highlighting the need for improved information retrieval capabilities and the introduction of the ScholarSearch dataset by Peking University to evaluate these models [1][15]. Group 1: ScholarSearch Dataset - ScholarSearch is the first dataset specifically designed to assess the complex information retrieval capabilities of LLMs in academic research, containing 223 challenging academic search questions and their answers [1][5]. - The dataset aims to provide a comprehensive and rigorous evaluation of LLMs' retrieval, information integration, and reasoning abilities [5][12]. - All questions in ScholarSearch are derived from real academic research scenarios, ensuring that the evaluation reflects the actual challenges faced by researchers [11]. Group 2: Evaluation Results - The evaluation results indicate that existing models perform poorly in academic search tasks, with top pure reasoning models like GPT-4.1 and DeepSeek-R1 achieving an accuracy rate below 9% [1][15]. - Models with browsing capabilities show significant improvements in accuracy; for instance, GPT-4o-mini's accuracy increased by over four times compared to its non-searching version [2][15]. - Despite improvements, even the most advanced search-enhanced models, such as GPT-4o-search-preview, only achieve an accuracy of 18.83%, indicating a gap in their ability to handle complex academic inquiries [3][16]. Group 3: Methodology and Standards - The methodology for creating the ScholarSearch dataset involved rigorous screening to ensure that questions could not be answered correctly by existing models without extensive information retrieval [6][7]. - A dual negative screening standard was applied to ensure that questions required deep and broad information retrieval capabilities, thus maintaining the dataset's challenge level [6][8]. - The dataset covers a wide range of disciplines, including both science and engineering as well as social sciences and humanities, ensuring comprehensive evaluation [12].