别再吹AI搞科研了！新评测泼冷水：顶尖模型离「合格科学家」还差得远

Core Insights - The article discusses the current limitations of AI's "Scientific General Intelligence" (SGI) and introduces the SGI-Bench, a comprehensive evaluation framework designed to assess AI capabilities in scientific research [1][5][51]. SGI Framework - SGI emphasizes multi-disciplinary, long-chain, cross-modal, and rigorously verifiable capabilities, which are currently not adequately represented by existing benchmarks that focus on fragmented abilities [1]. - The Shanghai Artificial Intelligence Laboratory has developed a Practical Inquiry Model (PIM) that breaks down scientific inquiry into four cyclical stages: Deliberation, Conception, Action, and Perception, aligning these with AI capabilities [1][3]. SGI-Bench Evaluation - SGI-Bench is constructed with tasks aligned to a scientist's workflow, utilizing input from multi-disciplinary experts and graduate students to create over 1,000 evaluation samples across ten disciplines [5][6]. - The first round of results shows that the closed-source model Gemini-3-Pro achieved an SGI-Score of 33.83 out of 100, indicating significant room for improvement in AI's research capabilities [3][9]. Key Findings 1. Deliberation: The accuracy of deep scientific research steps is between 50% and 65%, but errors in long-chain steps lead to frequent incorrect conclusions, with strict matching accuracy only at 10% to 20% [9][13]. 2. Conception: The novelty of idea generation is acceptable, but feasibility is low, with models like GPT-5 showing a novelty score of 76.08 and feasibility of only 18.87 [20][26]. 3. Action: The ability to execute experiments is highlighted, with a smooth execution rate above 90%, but a significant gap exists between running code and achieving scientific correctness [30][31]. 4. Perception: The best closed-source models achieved an answer accuracy of approximately 41.9% and reasoning effectiveness of about 71.3%, indicating challenges in fully correct reasoning chains [37][43]. Future Directions - SGI-Bench results suggest directions for enhancing AI's autonomous research capabilities, including improving multi-modal reasoning, deep research accuracy, creative generation feasibility, and code generation stability [51][52].