大语言模型编程评估

Search documents
100行代码打造迷你编程Agent:能修复65%真实项目bug,适配所有大模型
量子位· 2025-07-27 11:57
Core Viewpoint - The article discusses the launch of mini-SWE-agent, a lightweight programming agent that operates with only 100 lines of code, designed to solve 65% of the problems on the SWE-bench benchmark, while being compatible with various language models and easy to deploy locally [2][3][18]. Group 1: Project Overview - mini-SWE-agent is an open-source project developed by the same team behind SWE-bench and SWE-agent, focusing on simplifying the process of code bug fixing in real GitHub projects [2][7]. - The architecture of mini-SWE-agent is significantly simplified, requiring only about 200 lines of code in total, and eliminates complex dependencies [14][10]. - The agent operates using the operating system's Bash environment, allowing it to execute commands without the need for specialized tool interfaces, thus enhancing compatibility with any language model [14][18]. Group 2: Performance and Features - Despite its lightweight design, mini-SWE-agent maintains a performance level comparable to the original SWE-agent, solving approximately 65% of the problems on the SWE-bench validation set [3][18]. - The agent supports various runtime environments, including Docker and other virtualization platforms, facilitating easy deployment across different systems [16][18]. - It includes tools for batch inference and trajectory browsing, aiding users in large-scale evaluation and decision-making processes [18]. Group 3: User Guidance and Applications - mini-SWE-agent is recommended for users seeking quick local execution, simplified control flow, and stable evaluation environments, making it suitable for fine-tuning or reinforcement learning experiments [20]. - For users requiring a highly configurable toolchain and complex state management, the more feature-rich SWE-agent is suggested [20]. - The design philosophy of mini-SWE-agent emphasizes readability, convenience, and ease of expansion, making it accessible for everyday developers [21][22].