ICLR 2026 oral | AI代码真能进生产环境？SwingArena：从「写对代码Commit」到「通过CI审查」

Core Insights - The article discusses the rapid improvement in AI models' ability to write code, with models like GPT, Claude, and DeepSeek generating professional-looking code in seconds [2] - It raises the question of whether AI can truly participate in the core processes of software engineering, highlighting the complexity of real-world development beyond just writing functional code [3] Evaluation Mechanism - SwingArena aims to fill the gap in evaluating AI coding capabilities by introducing a competitive programming environment that emphasizes code review and iteration, rather than just passing unit tests [4][9] - The evaluation logic shifts from simply writing correct code to ensuring that code can pass through a Continuous Integration (CI) pipeline, which includes automated checks for compilation, testing, and code style [9] Competitive Framework - SwingArena incorporates a dual-role system where models act as both "submitters" and "reviewers," engaging in a continuous feedback loop that simulates real-world CI environments [11] - The final score in this evaluation is determined by the actual execution results, emphasizing the importance of robust code submissions [11] Contextual Challenges - The article notes that real project codebases often exceed the context window of large models, necessitating a retrieval-augmented code generation (RACG) pipeline to balance the amount of code provided to the models [15] - The RACG system employs classic information retrieval methods to narrow down relevant files and uses semantic models for precise code chunking, significantly improving patch localization accuracy [15] Model Performance Insights - In evaluations, distinct behavioral differences among models become apparent, with GPT-4o exhibiting aggressive strategies that yield high win rates but lower CI pass rates, while models like DeepSeek and Gemini show more conservative and stable performance [17] - These findings provide practical insights for selecting models based on project needs, highlighting the trade-off between rapid prototyping and stability in production environments [17] Significance of SwingArena - SwingArena represents a shift in evaluation perspective from "functional correctness" to "engineering usability," allowing for systematic assessment of which models are suitable for production environments [19] - The framework will be open-sourced post-anonymity period, providing researchers and industry professionals with tools to evaluate AI programming capabilities effectively [19][21]