OpenClaw代码越改越崩？新研究EvoClaw揭示：Agents持续开发成功率仅13.37%

Core Insights - By the end of 2025, AI programming will transition from being an auxiliary tool like Copilot to an Agent era dominated by AI with human oversight [1] - The emergence of OpenClaw in early 2026 will evolve Agents from executing single tasks to long-term operational systems, necessitating continuous self-iteration of software interfaces based on real-world interactions [2] Group 1: AI Programming Evaluation - Current top models can satisfactorily address isolated tasks like writing functions or fixing bugs, but struggle significantly in continuous software evolution scenarios, with performance dropping from scores above 80% to below 40% [6] - Existing AI programming benchmarks often overestimate the capabilities of Coding Agents by focusing on independent tasks rather than the continuous evolution of software, which is a persistent process [8][10] - The EvoClaw benchmark introduces a new evaluation paradigm that requires AI to autonomously execute multiple interdependent tasks within the same codebase, revealing vulnerabilities in AI's performance during continuous iterations [10] Group 2: EvoClaw Benchmark Design - EvoClaw is designed to assess AI's ability to handle software evolution by utilizing a milestone-based approach, which aggregates code submissions into cohesive units while preserving task dependencies [17] - The evaluation includes metrics such as Recall (completeness of functionality implementation) and Precision (reliability of modifications), with a combined score calculated using F1 weighting [29][31] - The dataset for EvoClaw spans five major programming languages and covers real development cycles across multiple release intervals, ensuring a comprehensive assessment of AI capabilities [27] Group 3: Performance Analysis - In continuous evaluation scenarios, top models like Claude Opus 4.6 achieve a maximum score of only 38.03%, indicating a significant drop in performance compared to independent evaluations [34] - The analysis shows that while Recall continues to grow, Precision quickly saturates, leading to a stagnation in performance as the complexity of tasks increases [42] - The study highlights that even with unlimited iteration opportunities, AI models will eventually hit a performance ceiling, unable to fully resolve all tasks due to accumulated technical debt [40][44] Group 4: Future Directions - The findings suggest that current AI models are more akin to on-demand code generators rather than comprehensive engineering solutions, lacking the ability to proactively manage technical debt and overall project governance [54] - There is a clear differentiation among models, with some like GPT and Claude series showing steady improvement in continuous evolution capabilities, while others like Gemini series struggle with sustained performance [54] - The future of AI programming lies in evolving from passive code generation to active restructuring and long-term planning, enabling AI to function as a seasoned engineer with a holistic view of projects [54]