高阶推理

Search documents
真实科研水平集体不及格!全新基准SFE给主流多模态LLM来了波暴击
机器之心· 2025-07-09 09:52
Core Insights - The article discusses the advancements in Artificial Intelligence for Science (AI4S) and the introduction of the Scientists' First Exam (SFE) to evaluate the capabilities of multimodal large language models (MLLMs) in scientific domains [1][3][12]. Group 1: AI4S and SFE Overview - AI4S has made significant progress in transforming scientific research through innovative tools, but to become a revolutionary tool, it requires a comprehensive approach integrating specialized knowledge [1]. - The SFE aims to systematically assess the cognitive abilities of MLLMs across various scientific disciplines, addressing the limitations of existing evaluation methods that focus primarily on knowledge recall [2][3]. Group 2: SFE Evaluation Framework - SFE introduces a three-tier evaluation framework: Signal Perception (L1), Attribute Understanding (L2), and Comparative Reasoning (L3), covering five scientific fields with 66 high-value tasks [4][10][12]. - The evaluation reveals that mainstream models perform well on traditional benchmarks but struggle significantly with high-level scientific tasks, with state-of-the-art models scoring around 30 [4][18]. Group 3: Performance Insights - The evaluation results indicate that closed-source MLLMs outperform open-source models by 6-8% on average, with notable differences in specific tasks [20]. - Materials science is identified as the strongest area for model performance, while astronomy presents more challenges due to the complexity of data [22][23]. Group 4: Model Development and Trends - Recent models show significant improvements in high-level reasoning tasks, while progress in understanding tasks remains limited, indicating a shift towards enhanced reasoning capabilities [25][26]. - The scaling of model size does not always correlate with improved scientific capabilities, suggesting the need for a balanced expansion of scientific data alongside model size [31][32]. Group 5: Future Directions and Ecosystem - The establishment of the SciPrismaX platform aims to create a rigorous and dynamic evaluation ecosystem for AI in science, incorporating various assessment dimensions and community collaboration [33][36].