AAAI 2026｜AP2O-Coder 让大模型拥有「错题本」，像人类一样按题型高效刷题

Core Insights - The article discusses the development of the Adaptive Progressive Preference Optimization (AP2O) method and its framework, AP2O-Coder, aimed at improving code generation and error correction in large language models (LLMs) [3][5][6]. Group 1: Existing Challenges and AP2O-Coder Design - Current offline preference optimization methods face three main challenges: lack of error type awareness, insufficient training focus, and weak dynamic adaptation capabilities [5][12]. - AP2O-Coder is designed to address these challenges by utilizing a systematic learning process similar to human error correction strategies, which includes error analysis and targeted optimization [6][8]. Group 2: AP2O-Coder Framework and Mechanism - The AP2O-Coder framework consists of four key steps: code generation evaluation, error diagnosis analysis, progressive preference optimization, and adaptive error replay [10][11][14]. - The code generation evaluation step establishes an initial training dataset by generating candidate answers for programming tasks and labeling them as pass or fail [10]. - The error diagnosis analysis step uses programming language-specific tools to identify and categorize errors, creating a structured "error book" for targeted optimization [11]. - The progressive preference optimization step focuses on correcting errors in a structured manner, prioritizing error types based on model size [13]. - The adaptive error replay step regularly evaluates model performance and adjusts training data distribution to focus on current weaknesses [14]. Group 3: Experimental Validation and Results - The research team conducted systematic validation on six mainstream LLMs, achieving performance improvements of 2.8% to 3.4% on the EvalPlus benchmark, even for large models [16][18]. - AP2O-Coder demonstrated a significant reduction in error occurrence rates and improved generalization capabilities across various models [22][29]. - The method also showed enhanced sample efficiency, requiring only 4% to 60% of the preference data compared to traditional methods to achieve optimal performance [25]. Group 4: Adaptability of General LLMs - AP2O-Coder is effective not only for code-specific LLMs but also for adapting general LLMs to coding tasks, as evidenced by significant performance improvements in models like Qwen3 and Llama3 [28].