Core Insights - OpenAI has released a new benchmark called FrontierScience to evaluate AI's scientific reasoning capabilities in physics, chemistry, and biology, revealing that AI still has a long way to go to match true scientists [1][6][17] Group 1: Benchmark Design and Structure - FrontierScience consists of over 700 text-based questions, including 160 "Gold Set" questions, with 100 competition-style questions and 60 original research sub-tasks designed by PhD-level researchers [9][12] - The competition track emphasizes short-answer formats for easy verification, while the research track uses a 10-point scoring system, requiring at least 7 points to pass [9][12] - The quality of questions is ensured through collaboration with 42 international award winners and 45 qualified scientists across various fields [11][12] Group 2: AI Performance and Comparison - Initial testing showed that GPT-5.2 scored 77% on competition questions and 25% on research questions, leading the pack, while Gemini 3 Pro followed closely with 76% on competition questions [13] - In a previous benchmark, GPT-4 scored only 39% on a question set designed by PhD experts, significantly lower than the expert baseline of 74% [6][12] Group 3: Challenges and Limitations - OpenAI acknowledges that advanced models still make reasoning, logic, and factual errors, and that longer processing times often correlate with higher accuracy [15][17] - FrontierScience is designed to standardize assessments but does not evaluate the models' ability to generate truly novel hypotheses or interact with multimodal data and real-world experimental systems [17] Group 4: Future Directions - OpenAI plans to iterate on the question bank, expand the fields covered, and include more real-world assessments to determine the practical impact of these systems on scientific work [17]
OpenAI发布权威AI科研基准,扯下AI遮羞布:奥赛金牌≠一流科学家
3 6 Ke·2025-12-17 09:00