AI编程真面目：完整项目通过率仅27%

Core Insights - The research team from multiple universities has developed ProjDevBench, the first benchmark to evaluate AI programming agents' end-to-end project development capabilities based solely on natural language requirements [2][4][18] - The overall acceptance rate (AC rate) for six mainstream programming agents is only 27.38%, indicating a significant drop in performance when tasked with building projects from scratch [2][10][11] Benchmark Development - ProjDevBench fills a gap in existing benchmarks that focus on function-level code generation or issue fixing, emphasizing the need for comprehensive software engineering skills [3][4] - The benchmark requires agents to autonomously complete the entire process from architecture design to multi-file coding without any initial code templates [4][18] Evaluation Methodology - A dual evaluation mechanism is employed: an online judging (OJ) system for strict black-box testing (80% weight) and a code review process (20% weight) to capture issues not detectable by tests alone [7][18] - The OJ system provides detailed diagnostic feedback, which is crucial for assessing end-to-end development capabilities [5][7] Task Design and Challenges - The benchmark includes 20 high-difficulty programming tasks selected from a pool of approximately 2,800 candidates, focusing on multi-file implementations and project-level tasks [8][9] - Two task modes are defined: Easy mode (with a codebase) and Hard mode (without a codebase), with the latter showing a drastic performance decline [9][11] Performance Analysis - The performance of AI agents drops sharply when transitioning from Easy to Hard tasks, highlighting their proficiency in code completion but lack of skills in macro-level architecture design [11][12] - The average number of tool calls required to complete a task is 138, with the most complex tasks taking over two hours [9][10] Failure Modes - A systematic analysis reveals that agents often generate syntactically correct code but miss critical business logic, leading to high rates of incorrect submissions [13][14] - Common issues include poor handling of edge cases, lack of time complexity optimization, and resource management limitations [14][15] Insights on Interaction and Performance - There is a negative correlation between the number of interactions and performance, indicating that agents tend to get stuck in inefficient trial-and-error loops rather than employing deep reasoning [15] - The findings suggest that increasing interaction rounds often leads to lower scores, emphasizing the need for more effective feedback utilization [15] Unique Value of Code Review - Code reviews reveal agents' misunderstandings of software development workflows, such as version control and adherence to specifications [16] - These insights indicate that agents view software development primarily as a code generation task rather than a structured workflow [16] Conclusion and Implications - ProjDevBench confirms that current AI programming agents are still in the early stages of handling real, complex end-to-end software development tasks [17][18] - The benchmark provides a standard for evaluating and improving future autonomous software development agents, highlighting the gap between code completion tools and full-fledged software engineers [18]