Core Insights - The article discusses the performance of leading AI models on the new software engineering benchmark SWE-BENCH PRO, revealing that none of the top models achieved a solution rate above 25% [1][23]. Group 1: Benchmark Overview - SWE-BENCH PRO is a new benchmark that presents more challenging tasks compared to its predecessor, SWE-Bench-Verified, which had an average accuracy of 70% [5][6]. - The new benchmark aims to eliminate data contamination risks by ensuring that models have not encountered the test content during training [9][12]. - SWE-BENCH PRO includes a diverse codebase of 1865 commercial applications, B2B services, and developer tools, structured into public, commercial, and reserved subsets [12][18]. Group 2: Model Performance - The top-performing models on the public set were GPT-5 and Claude Opus 4.1, with solution rates of 23.3% and 22.7%, respectively [25][26]. - In the commercial set, even the best models scored below 20%, indicating limited capabilities in solving real-world business problems [27][28]. - The performance of models varied significantly across programming languages, with Go and Python generally performing better than JavaScript and TypeScript [30]. Group 3: Failure Analysis - The primary failure modes for the models included semantic understanding issues, syntax errors, and incorrect answers, highlighting challenges in problem comprehension and algorithm correctness [34]. - GPT-5 exhibited a high unanswered rate of 63.1%, indicating that while it performs well on certain tasks, it struggles with more complex problems [32]. - The analysis suggests that the difficulty of programming languages, the nature of codebases, and the types of models are key factors influencing performance [28][29].
GPT-5编程测评大反转!表面不及格,实际63.1%的任务没交卷,全算上成绩比Claude高一倍
量子位·2025-09-22 08:08