Workflow
AI-Assisted协作式开发
icon
Search documents
北航领衔发布300页代码智能综述:从基础模型到智能体,一次读懂Code LLM全景图
量子位· 2025-12-05 05:33
Core Insights - The article discusses a comprehensive review of the code intelligence field, detailing the evolution of programming paradigms and the development of foundational models, tasks, training methodologies, and applications in the industry [1][3]. Group 1: Evolution of Programming Paradigms - The paper outlines a clear evolutionary path in programming from manual coding to AI-assisted collaborative development, indicating a shift where developers increasingly express intentions in natural language for models to implement [4][6]. - This paradigm shift is more profound than any previous tool upgrade, marking a critical transition in programming methods [7][8]. Group 2: Code Foundation Models - The paper constructs an overall blueprint for code foundation models, comparing training processes of general LLMs and code-specific models, and identifying core datasets such as GitHub code, issue discussions, and API documentation that form the engineering world knowledge [10][12]. - The evolution of model structures, from CodeBERT and CodeT5 to current architectures, reflects ongoing adaptation to code task requirements [11]. Group 3: Code Tasks and Benchmarks - The evaluation system for code models has been fragmented; the paper organizes tasks by granularity, from function-level to engineering-level tasks, with corresponding benchmarks [14][18]. - While HumanEval and MBPP serve as basic indicators, they only reflect the models' foundational capabilities, with more complex tasks needed to assess real project understanding [15][16]. Group 4: Model Alignment and Enhancement - The paper summarizes methods for model alignment and capability enhancement, focusing on making models better understand engineering rather than just generating code-like text [19][20]. - Key aspects include repo-level training to ensure models comprehend module dependencies and project organization, which is crucial for stable performance in real scenarios [22]. Group 5: Software Engineering Agents - The potential of code intelligence expands when models participate as agents in the software engineering process, moving beyond mere code generation to continuous decision-making and real-time feedback utilization [27][28]. - The current bottleneck for these agents is not model capability but effectively leveraging environmental signals such as test results and tool feedback [28]. Group 6: Security and Governance - The paper discusses the complexities of security issues in code models, categorizing risks into data security, model security, and execution security, along with governance measures like data auditing and static/dynamic testing [34][35]. Group 7: Training Methodologies - The latter part of the paper summarizes valuable training experiences, presenting a systematic methodology for training code models, which can serve as a reference for teams preparing to develop large code models [36][40]. Group 8: Accelerating Applications - The paper concludes by highlighting the acceleration of applications in software engineering, with code models increasingly integrated into key processes such as IDE plugins, collaborative coding, and automated testing [41][42]. - The future of software engineering is likely to evolve towards intention-driven, collaborative coding, with models playing an increasingly significant role [43].