Workflow
软件工程基准测试
icon
Search documents
GPT-5编程测评大反转,表面不及格,实际63.1%的任务没交卷,全算上成绩比Claude高一倍
3 6 Ke· 2025-09-22 11:39
Core Insights - Scale AI's new software engineering benchmark, SWE-BENCH PRO, reveals that leading models like GPT-5, Claude Opus 4.1, and Gemini 2.5 have low resolution rates, with none exceeding 25% [1][11] - The benchmark's difficulty is significantly higher than its predecessor, SWE-Bench-Verified, which had an average accuracy of 70% [4][11] - The new benchmark aims to eliminate data contamination and better reflect real-world software engineering challenges by using previously unseen tasks [4][7] Benchmark Details - SWE-BENCH PRO includes 1865 diverse code libraries categorized into three subsets: public, commercial, and reserved [7] - The public subset consists of 731 problems from 11 public code libraries, while the commercial subset includes problems from 276 startup code libraries [7] - The benchmark excludes trivial edits and focuses on complex tasks requiring multi-file modifications, enhancing the assessment's rigor [7][4] Testing Methodology - The evaluation process incorporates a "human in the loop" approach, enhancing problem statements with additional context and requirements [8][9] - Each task is assessed in a containerized environment, ensuring that models are tested under specific conditions [10] - The testing includes fail2pass and pass2pass tests to verify problem resolution and maintain existing functionality [10] Model Performance - The resolution rates for the top models are as follows: GPT-5 at 23.3%, Claude Opus 4.1 at 22.7%, and Gemini 2.5 at 13.5% [13][14] - Even the best-performing models scored below 20% in the commercial subset, indicating limited capabilities in addressing real-world business problems [13][11] - The analysis highlights that programming language difficulty and code library variations significantly impact model performance [15] Failure Analysis - Common failure modes include semantic understanding issues, syntax errors, and incorrect solutions, with GPT-5 showing a high non-response rate of 63.1% [16][17] - Claude Opus 4.1 struggles with semantic understanding, while Gemini 2.5 exhibits balanced failure rates across multiple dimensions [17][16] - QWEN3 32B, an open-source model, has the highest tool error rate, emphasizing the importance of integrated tool usage for effective performance [17]