鲁棒强化学习
Search documents
鲁棒强化学习赋能AI编程!破局企业数据噪声难题,同等算力训出更好模型 | 上交大&腾讯CodeBuddy
量子位· 2026-02-16 11:00
Core Insights - The article discusses the introduction of the Group Adaptive Policy Optimization (GAPO) method, which significantly enhances the accuracy and efficiency of code large language models (LLMs) in real-world editing tasks by filtering out noise and outliers during training [3][12]. Group 1: Challenges in Code Editing - The integration of AI in programming has led to the widespread use of LLMs in code editing, debugging, and optimization, but real user environments introduce complexities that result in frequent outlier outputs and inaccurate advantage estimations [3][4]. - Real-world code editing tasks involve complex contextual information, including module call relationships, historical edits, and vague user requirements, which complicate the model's understanding and increase output uncertainty [4][8]. - The input prompts for code editing tasks can range from 1,925 to 24,883 characters, with output lengths varying from 36 to 833 characters across multiple programming languages [6][7]. Group 2: Noise and Advantage Estimation Issues - The presence of rollout noise in real data leads to distorted advantage value estimations, which can misguide the reinforcement learning (RL) process, causing models to become less effective over time [9][12]. - Traditional RL methods rely on group mean calculations for advantage estimation, which are sensitive to outliers, resulting in skewed reward distributions that can misrepresent the model's performance [10][11]. Group 3: GAPO Methodology - GAPO addresses the core issues of noise and advantage estimation by optimizing the advantage calculation process without altering the existing RL framework, allowing for a plug-and-play solution [13][19]. - The method first identifies high signal-to-noise ratio areas by filtering out outliers from the reward distribution, using a sliding window algorithm to find the narrowest interval covering a specified proportion of reward points [13][16]. - Instead of using the mean, GAPO employs the median within the identified high-density interval to provide a more stable basis for advantage estimation, reducing sensitivity to outliers [17][18]. Group 4: Performance Validation - GAPO has demonstrated significant improvements in advantage value estimation and model accuracy across nine mainstream LLMs, with the Qwen2.5-Coder-14B model achieving a precise matching accuracy of 46.25%, an increase of 4.35 percentage points compared to the GRPO method [20][21]. - In cross-domain scenarios, the Qwen2.5-Coder-7B model showed a 5.30 percentage point increase in accuracy on the zeta dataset, highlighting the effective handling of advantage estimation distortion [22]. - The GAPO method also leads to more stable training and optimized computational resource utilization, allowing enterprises to achieve better training outcomes from complex real-world data without incurring additional computational costs [27][30]. Group 5: Conclusion and Future Implications - The GAPO research effectively transforms the challenge of real-world data from a burden into a valuable asset for enhancing model performance, providing a practical solution for enterprises to improve AI-assisted programming efficiency [28]. - The open-sourcing of the GAPO code invites further exploration and collaboration among researchers and developers, aiming to integrate AI more deeply into the software development process [31].
西湖大学最新!RobustVLA:面向VLA模型的鲁棒性感知强化后训练方法(优于SOTA方案)
具身智能之心· 2025-11-08 04:00
Core Insights - The article discusses the development of RobustVLA, a lightweight online reinforcement learning post-training method aimed at enhancing the robustness of Vision-Language-Action (VLA) models in the face of environmental uncertainties [1][5][20] - It highlights the limitations of existing methods that focus primarily on reward maximization without addressing the model's sensitivity to disturbances, which can lead to significant performance drops in real-world scenarios [5][20] Design Logic of RobustVLA - RobustVLA incorporates two key regularization terms: Jacobian regularization to reduce sensitivity to observation noise and smoothness regularization to stabilize policies under action disturbances [4][7][8] - The method emphasizes the importance of robustness-aware reinforcement learning post-training as a critical step in improving the reliability of VLA models [1][5] Robustness Analysis - The article outlines a theoretical analysis of robustness, establishing error amplification bounds, reward drift control, and guarantees for robust stability [4][11][18] - It identifies that the Jacobian sensitivity directly impacts error amplification, and reducing this sensitivity can effectively constrain performance loss [12][18] Experimental Results - In experiments, RobustVLA demonstrated an average success rate of 82.5% under observation perturbations, outperforming previous models like OpenVLA-OFT and RIPT-VLA [20][21] - Under action perturbations, RobustVLA achieved an average success rate of 54.8%, exceeding OpenVLA-OFT's 53.5% [22] - In scenarios with combined disturbances, RobustVLA-C achieved an average success rate of 82.1%, showcasing the synergy of autonomous interaction and dual regularization [23] Transfer Learning and Ablation Studies - Transfer learning experiments showed that RobustVLA improved out-of-distribution adaptability by 8.0% and 16.0% in specific tasks compared to zero-shot transfer [25] - Ablation studies confirmed that removing either Jacobian or smoothness regularization led to performance declines, underscoring the necessity of both regularization strategies for enhancing robustness [27]