基准测试
Search documents
GPT-5仅23.3%,全球AI集体挂科,地狱级编程考试,夺金神话破灭
3 6 Ke· 2025-09-22 11:27
Core Insights - The newly released SWE-Bench Pro benchmark has exposed the limitations of leading AI models in coding tasks, with GPT-5 achieving only a 23.3% success rate [7][25][37] - Despite previous successes in competitions like ICPC, the latest tests indicate that AI's long-range coding capabilities remain a significant shortcoming [8][25] Benchmark Overview - SWE-Bench Pro is designed to evaluate AI programming agents against real-world engineering tasks, featuring a significant increase in task difficulty and robustness against data pollution [5][6][14] - The benchmark includes 1865 verified problems, categorized into public, commercial, and reserved sets, ensuring a diverse and challenging testing environment [18][19] Model Performance - In the SWE-Bench Pro evaluation, the top models performed poorly, with GPT-5 and Claude Opus 4.1 leading at 23.3% and 22.7% respectively, while other models scored below 15% [7][25][28] - The performance gap between public and commercial datasets is notable, with the best models scoring below 20% on commercial tasks, highlighting the challenges of enterprise-level coding [27][28] Task Complexity - SWE-Bench Pro focuses on complex tasks requiring substantial modifications across multiple files, with an average of 4.1 files and 107.4 lines of code involved in solutions [21][23] - The benchmark excludes simple tasks that only require minor code changes, ensuring that the challenges reflect real-world industrial scenarios [21][24] Error Analysis - An analysis of model failures revealed various issues, including semantic understanding problems, syntax errors, and tool usage discrepancies, indicating areas for improvement in AI coding capabilities [36] - For instance, Claude Opus 4.1 struggled with semantic understanding, while Gemini 2.5 faced tool-related errors, showcasing the multifaceted challenges in AI programming [36] Conclusion - SWE-Bench Pro represents a significant advancement in benchmarking AI coding abilities, providing a more accurate measure of performance in industrial applications [37]