科学通用能力(SGI)
Search documents
别再吹AI搞科研了!新评测泼冷水:顶尖模型离「合格科学家」还差得远
量子位· 2025-12-27 07:08
Core Insights - The article discusses the current limitations of AI's "Scientific General Intelligence" (SGI) and introduces the SGI-Bench, a comprehensive evaluation framework designed to assess AI's capabilities in scientific research [1][5][51] Group 1: SGI Definition and Framework - SGI emphasizes multi-disciplinary, long-chain, cross-modal, and rigorously verifiable capabilities, which current benchmarks fail to capture [1] - The Shanghai Artificial Intelligence Laboratory has developed the Practice Inquiry Model (PIM) to break down scientific inquiry into four cyclical stages: Deliberation, Conception, Action, and Perception [1][3] - SGI-Bench aligns tasks with the workflow of scientists, utilizing input from multi-disciplinary experts and graduate students to create over 1,000 evaluation samples across ten disciplines [5][6] Group 2: Evaluation Results and Insights - The first round of SGI-Bench results shows that the closed-source model Gemini-3-Pro achieved an SGI-Score of 33.83 out of 100, indicating significant room for improvement in AI's research capabilities [3][9] - In the Deliberation phase, the accuracy of scientific deep research steps ranged from 50% to 65%, but errors in long-chain steps led to frequent incorrect conclusions [9][13] - The Conception phase demonstrated that while idea generation novelty was acceptable, feasibility was low, with models like GPT-5 scoring 76.08 in novelty but only 18.87 in feasibility [20][26] Group 3: Action and Execution Challenges - The Action phase highlighted that running experiments does not equate to scientific correctness, with models often failing to produce executable and accurate scientific code [24][30] - The best performance in strict passing rates for models was only 36.64%, indicating a gap between being able to run code and achieving scientific accuracy [30][31] - Common issues included missing data acquisition plans and unclear step dependencies, leading to breakdowns in the execution loop from idea to blueprint to execution [26][30] Group 4: Perception and Reasoning - In the Perception phase, the best closed-source models achieved an answer accuracy of approximately 41.9% and reasoning effectiveness of about 71.3%, indicating challenges in fully correct reasoning chains [37][43] - Causal reasoning was relatively stable, while comparative reasoning proved to be the most difficult, particularly in cross-sample fine-grained comparisons [43] Group 5: Future Directions and Customization - SGI-Bench results provide a roadmap for enhancing AI's autonomous research capabilities, focusing on improving multi-modal reasoning, deep research accuracy, and creative generation feasibility [51][52] - The SGIEvalAgent system allows for customizable evaluations based on user-defined intents, enhancing the accessibility and adaptability of AI assessments [44][46][48]