Workflow
代码智能
icon
Search documents
Agentic Coding表现创新高,全新KAT系列模型上榜SWE-Bench
机器之心· 2025-09-26 10:35
Core Insights - The article discusses the launch of two groundbreaking models in the Code Intelligence field by the Kuaipilot team: the open-source 32B parameter model KAT-Dev-32B and the closed-source flagship model KAT-Coder, showcasing their strong performance and capabilities in coding tasks [2][26]. Model Performance - KAT-Dev-32B achieved a 62.4% solution rate on the SWE-Bench Verified, ranking 5th among all open-source models of various sizes [2]. - KAT-Coder demonstrated an impressive 73.4% solution rate on the same benchmark, comparable to top global closed-source models [2][11]. Model Accessibility - KAT-Dev-32B is available on the Hugging Face platform for further research and development [7]. - The API key for KAT-Coder has been made available for application on the "Kuaishou Wanqing" enterprise-level model service and development platform, allowing users to access coding tools directly [7]. Training Innovations - The KAT series models underwent several innovative training phases, including Mid-Training, Supervised Fine-Tuning (SFT), Reinforcement Fine-Tuning (RFT), and large-scale Agentic Reinforcement Learning (RL) [9][12]. - Mid-Training focused on enhancing the model's capabilities related to "LLM-as-Agent," improving tool usage, multi-turn interaction, and instruction adherence [10][12]. - SFT involved collecting real demand delivery trajectories marked by human engineers to enhance end-to-end delivery capabilities [13]. - RFT introduced ground truth for trajectory exploration, improving the efficiency and stability of the reinforcement learning phase [15]. Advanced Techniques - The team implemented entropy-based tree pruning to efficiently learn from non-linear trajectory histories and maximize throughput while minimizing costs [19]. - The SeamlessFlow framework was developed to manage trajectory trees and ensure high throughput training by decoupling RL training from the agent's internal logic [21][22]. Emergent Capabilities - Post-training analysis revealed two significant emergent phenomena: a reduction in dialogue rounds by 32% compared to SFT models and the ability to call multiple tools in parallel [33][35]. - The model's efficiency preference and parallel calling capabilities were attributed to the implicit optimization pressure from the trajectory tree structure [33]. Future Prospects - The Kuaipilot team aims to explore the frontiers of code intelligence, including enhancing tool integration, expanding language support, and developing collaborative coding systems [35].
从Debugger到Developer : 低代码时代新基准NoCode-bench,SWE-Bench作者力荐
机器之心· 2025-08-08 07:53
Core Insights - The article discusses the introduction of a new benchmark called NoCode-bench, aimed at evaluating the capabilities of large language models (LLMs) in natural language-driven feature addition tasks in software development [3][27]. - Current LLMs show a low success rate of only 20% in performing these tasks, highlighting significant challenges in AI's ability to handle real-world software development scenarios [3][26]. Group 1: Benchmark Development - NoCode-bench was developed to address the limitations of existing benchmarks like SWE-bench, which primarily focus on bug fixing rather than feature addition [6][27]. - The benchmark emphasizes the importance of understanding software documentation changes to implement new features, reflecting a more realistic development environment [6][27]. - The construction of NoCode-bench involved a rigorous five-phase process, starting from selecting well-maintained open-source projects to filtering instances based on developer-verified release notes [8][10][16]. Group 2: Challenges Identified - The tasks in NoCode-bench present three main challenges: 1. Increased complexity of input, with document changes being nearly twice as long as bug reports, requiring better long-text comprehension [12]. 2. Difficulty in locating changes, as tasks often involve multiple files and code blocks, demanding high cross-file editing capabilities [13]. 3. Greater editing volume, with nearly 20% of tasks requiring modifications of over 200 lines of code, increasing the risk of errors [14]. Group 3: Model Performance Evaluation - A comprehensive evaluation of six leading LLMs, including Claude-4-Sonnet and GPT-4o, revealed disappointing success rates, with the best-performing model achieving only 15.79% success [18][26]. - The analysis of failure cases identified three primary reasons for poor performance: lack of cross-file editing ability, insufficient understanding of codebase structure, and inadequate tool invocation capabilities [20][21][22]. Group 4: Future Directions - The research indicates that the current state of LLMs is not ready for the complexities of document-driven feature development, suggesting a need for further advancements in AI capabilities [24][27]. - The findings provide a roadmap for future AI software engineers, focusing on improving cross-file editing, codebase comprehension, and tool interaction [27].