AI编程真面目：完整项目通过率仅27% | 上交大新基准

Core Insights - The article discusses the limitations of AI programming agents in constructing complete software projects from scratch, highlighting a significant drop in performance when tasked with end-to-end project development compared to code completion tasks [6][18][28]. Group 1: AI Programming Agents Performance - A recent study by a collaborative research team introduced ProjDevBench, the first benchmark to evaluate AI programming agents' ability to develop complete software projects from natural language requirements [5][10]. - The overall acceptance rate (AC rate) for submissions from six mainstream programming agents was only 27.38%, indicating a drastic decline in performance when transitioning from code completion to zero-based project construction [7][18]. - The study revealed that AI agents excel in completing existing code but struggle with high-level architecture design and complex logic reasoning [28]. Group 2: Benchmarking Methodology - ProjDevBench differs from traditional benchmarks by requiring agents to autonomously complete the entire development process without any initial code templates, simulating real-world software engineering tasks [10][30]. - The evaluation mechanism includes a dual assessment approach: an online judging (OJ) system for strict black-box testing (80% weight) and a code review process to identify issues not captured by OJ (20% weight) [13][30]. - The benchmark tasks were carefully selected from approximately 2,800 candidates, focusing on multi-file implementations and complex project-level tasks [14]. Group 3: Failure Modes and Limitations - The analysis of submission results highlighted several failure modes, including misunderstanding specifications, weak boundary case handling, and a lack of time complexity analysis [21][22]. - AI agents often generated syntactically correct code but missed critical business logic, indicating a gap in understanding the requirements [21]. - The study found a negative correlation between the number of interactions and performance, suggesting that agents tend to get stuck in inefficient trial-and-error loops rather than engaging in deep reasoning [23][25]. Group 4: Future Directions - The findings emphasize the need for future research to bridge the gap between code completion tools and fully autonomous software engineering capabilities [30]. - The benchmark currently includes only 20 tasks primarily in C++, with plans to expand to other programming languages and task types in the future [29].