Workflow
代码智能
icon
Search documents
Agentic Coding表现创新高,全新KAT系列模型上榜SWE-Bench
机器之心· 2025-09-26 10:35
图 近期,快手 Kwaipilot 团队推出了 KAT 系列两款突破性 Agentic Coding 大模型 : 开源 32B 参数模型 KAT-Dev-32B 与 闭源旗舰模型 KAT-Coder 。 这两款模型在 Code Intelligence 领域分别体现出轻量级的超强表现和极致性能。其中,在 SWE-Bench Verified 上,KAT-Dev-32B 展现出强劲性能并取得了 62.4% 的 解决率,在所有不同规模的开源模型中排名第 5。与此同时,KAT-Coder 以 73.4% 的解决率在 SWE-Bench Verified 上取得了极佳的单模型表现,比肩全球顶尖闭源 模型。 核心贡献点摘要 KAT-Dev-32B 和 KAT-Coder 在多个训练阶段进行了创新和优化,包括 Mid-Training 阶段、监督微调 (SFT) 阶段、强化微调 (RFT) 阶段,以及大规模智能体强化学 习 (RL) 阶段,具体如下: KAT 系列模型的核心技术路线 一、Mid-Training Kwaipilot 团队对经过预训练的模型进行了两阶段训练,该阶段被称为 Mid-Training。在其中的第 ...
从Debugger到Developer : 低代码时代新基准NoCode-bench,SWE-Bench作者力荐
机器之心· 2025-08-08 07:53
Core Insights - The article discusses the introduction of a new benchmark called NoCode-bench, aimed at evaluating the capabilities of large language models (LLMs) in natural language-driven feature addition tasks in software development [3][27]. - Current LLMs show a low success rate of only 20% in performing these tasks, highlighting significant challenges in AI's ability to handle real-world software development scenarios [3][26]. Group 1: Benchmark Development - NoCode-bench was developed to address the limitations of existing benchmarks like SWE-bench, which primarily focus on bug fixing rather than feature addition [6][27]. - The benchmark emphasizes the importance of understanding software documentation changes to implement new features, reflecting a more realistic development environment [6][27]. - The construction of NoCode-bench involved a rigorous five-phase process, starting from selecting well-maintained open-source projects to filtering instances based on developer-verified release notes [8][10][16]. Group 2: Challenges Identified - The tasks in NoCode-bench present three main challenges: 1. Increased complexity of input, with document changes being nearly twice as long as bug reports, requiring better long-text comprehension [12]. 2. Difficulty in locating changes, as tasks often involve multiple files and code blocks, demanding high cross-file editing capabilities [13]. 3. Greater editing volume, with nearly 20% of tasks requiring modifications of over 200 lines of code, increasing the risk of errors [14]. Group 3: Model Performance Evaluation - A comprehensive evaluation of six leading LLMs, including Claude-4-Sonnet and GPT-4o, revealed disappointing success rates, with the best-performing model achieving only 15.79% success [18][26]. - The analysis of failure cases identified three primary reasons for poor performance: lack of cross-file editing ability, insufficient understanding of codebase structure, and inadequate tool invocation capabilities [20][21][22]. Group 4: Future Directions - The research indicates that the current state of LLMs is not ready for the complexities of document-driven feature development, suggesting a need for further advancements in AI capabilities [24][27]. - The findings provide a roadmap for future AI software engineers, focusing on improving cross-file editing, codebase comprehension, and tool interaction [27].