Workflow
CodeLlama
icon
Search documents
AAAI 2026|AP2O-Coder 让大模型拥有「错题本」,像人类一样按题型高效刷题
机器之心· 2026-01-14 05:37
Core Insights - The article discusses the development of the Adaptive Progressive Preference Optimization (AP2O) method and its framework, AP2O-Coder, aimed at improving code generation and error correction in large language models (LLMs) [3][5][6]. Group 1: Existing Challenges and AP2O-Coder Design - Current offline preference optimization methods face three main challenges: lack of error type awareness, insufficient training focus, and weak dynamic adaptation capabilities [5][12]. - AP2O-Coder is designed to address these challenges by utilizing a systematic learning process similar to human error correction strategies, which includes error analysis and targeted optimization [6][8]. Group 2: AP2O-Coder Framework and Mechanism - The AP2O-Coder framework consists of four key steps: code generation evaluation, error diagnosis analysis, progressive preference optimization, and adaptive error replay [10][11][14]. - The code generation evaluation step establishes an initial training dataset by generating candidate answers for programming tasks and labeling them as pass or fail [10]. - The error diagnosis analysis step uses programming language-specific tools to identify and categorize errors, creating a structured "error book" for targeted optimization [11]. - The progressive preference optimization step focuses on correcting errors in a structured manner, prioritizing error types based on model size [13]. - The adaptive error replay step regularly evaluates model performance and adjusts training data distribution to focus on current weaknesses [14]. Group 3: Experimental Validation and Results - The research team conducted systematic validation on six mainstream LLMs, achieving performance improvements of 2.8% to 3.4% on the EvalPlus benchmark, even for large models [16][18]. - AP2O-Coder demonstrated a significant reduction in error occurrence rates and improved generalization capabilities across various models [22][29]. - The method also showed enhanced sample efficiency, requiring only 4% to 60% of the preference data compared to traditional methods to achieve optimal performance [25]. Group 4: Adaptability of General LLMs - AP2O-Coder is effective not only for code-specific LLMs but also for adapting general LLMs to coding tasks, as evidenced by significant performance improvements in models like Qwen3 and Llama3 [28].
生成式AI赋能需求工程:一场正在发生的变革
机器之心· 2025-11-27 12:13
Core Insights - The article presents a systematic literature review on the application of Generative AI (GenAI) in Requirements Engineering (RE), highlighting its transformative potential and the challenges that need to be addressed for effective industrial adoption [4][51]. Research Growth - Research on GenAI in the RE field has shown exponential growth, with the number of relevant papers increasing from 4 in 2022 to 23 in 2023, and projected to reach 113 in 2024 [10][8]. - A total of 238 papers were reviewed, indicating a strong academic interest following the release of ChatGPT [8][10]. Research Focus Imbalance - The focus of research is heavily skewed towards certain phases of RE, with 30% dedicated to requirements analysis, while only 6.8% is focused on requirements management, indicating a lack of attention to complex socio-technical factors [11][9]. - GenAI is currently in a "rapid expansion but immature" phase, with a significant increase in quantity but insufficient depth in research [14]. Technical Landscape - A significant reliance on the GPT model family is observed, with 67.3% of studies using it, which limits exploration of diverse technological paths [16]. - GPT-4 is primarily used for complex requirement analysis, while open-source alternatives like CodeLlama are underutilized despite their lower hallucination rates [17][16]. Challenges Identified - The research identifies three core challenges: reproducibility (66.8%), hallucination (63.4%), and interpretability (57.1%), which are interrelated and must be addressed collectively [30][31]. - The lack of reproducibility is particularly problematic due to the random nature of large language models (LLMs) and their opaque APIs [30]. Evaluation Practices - There is a notable lack of standardized evaluation metrics in the RE field, with only 23.9% of studies releasing tools and 45.8% using non-public datasets [35][37]. - Traditional NLP metrics dominate the evaluation methods, failing to capture the complexity of RE tasks [33]. Industrial Adoption - The industrial adoption of GenAI in RE is lagging, with 90.3% of studies remaining at the conceptual or prototype stage, and only 1.3% achieving production-level integration [39][41]. - The value of GenAI in industry is seen in accelerating requirement documentation and reducing communication costs, but companies are hesitant due to compliance and risk control concerns [43]. Future Roadmap - A four-phase strategy is proposed for advancing GenAI in RE: strengthening evaluation infrastructure, governance-aware development, scalable context-aware deployment, and industrial-level standardization [46]. - Key areas for improvement include generalization capabilities, data quality, and evaluation methods [45]. Recommendations for Researchers and Practitioners - Researchers are encouraged to explore diverse models beyond GPT, develop hybrid architectures specific to RE, and focus on reproducibility [53]. - Practitioners should use GenAI as an auxiliary tool rather than a decision-maker, especially in low-risk tasks [53].