鲁棒强化学习
Search documents
鲁棒强化学习赋能AI编程!破局企业数据噪声难题,同等算力训出更好模型 | 上交大&腾讯CodeBuddy
量子位· 2026-02-16 11:00
GAPO团队 投稿 量子位 | 公众号 QbitAI 程序员们又能少掉头发了! 新研究通过过滤掉训练中的噪声和异常值,显著提升代码大模型在实际编辑任务中的准确性和效率。 在AI辅助编程成为软件开发核心生产力的今天,大语言模型 (LLMs) 已深度融入代码编辑、调试与优化全流程。 然而,当企业试图用 真实复杂用户环境中采集的数据 开展强化学习 (RL) 训练时,一个棘手的实际问题浮出水面:复杂上下文 (context) 导致大模型的输出答案频繁出现异常内容,即rollout噪声更普遍,使得reward出现异常值 (outliers) ,直接造成优势值 (advantage) 估计不准确,严重拖累强化学习效果。 上海交通大学、腾讯CodeBuddy等团队联合提出的 Group Adaptive Policy Optimization(GAPO) 方法,精准直击这一产业落地关键 瓶颈,为代码LLM的工业化训练提供了兼具科研创新性与工程实用性的突破方案,引发AI科研界与产业界广泛关注。 真实场景的核心梗阻:复杂上下文→rollout噪声→优势估计失真 代码编辑的核心难点在于,真实用户场景的输入提示绝非简单的代码片段, ...
西湖大学最新!RobustVLA:面向VLA模型的鲁棒性感知强化后训练方法(优于SOTA方案)
具身智能之心· 2025-11-08 04:00
Core Insights - The article discusses the development of RobustVLA, a lightweight online reinforcement learning post-training method aimed at enhancing the robustness of Vision-Language-Action (VLA) models in the face of environmental uncertainties [1][5][20] - It highlights the limitations of existing methods that focus primarily on reward maximization without addressing the model's sensitivity to disturbances, which can lead to significant performance drops in real-world scenarios [5][20] Design Logic of RobustVLA - RobustVLA incorporates two key regularization terms: Jacobian regularization to reduce sensitivity to observation noise and smoothness regularization to stabilize policies under action disturbances [4][7][8] - The method emphasizes the importance of robustness-aware reinforcement learning post-training as a critical step in improving the reliability of VLA models [1][5] Robustness Analysis - The article outlines a theoretical analysis of robustness, establishing error amplification bounds, reward drift control, and guarantees for robust stability [4][11][18] - It identifies that the Jacobian sensitivity directly impacts error amplification, and reducing this sensitivity can effectively constrain performance loss [12][18] Experimental Results - In experiments, RobustVLA demonstrated an average success rate of 82.5% under observation perturbations, outperforming previous models like OpenVLA-OFT and RIPT-VLA [20][21] - Under action perturbations, RobustVLA achieved an average success rate of 54.8%, exceeding OpenVLA-OFT's 53.5% [22] - In scenarios with combined disturbances, RobustVLA-C achieved an average success rate of 82.1%, showcasing the synergy of autonomous interaction and dual regularization [23] Transfer Learning and Ablation Studies - Transfer learning experiments showed that RobustVLA improved out-of-distribution adaptability by 8.0% and 16.0% in specific tasks compared to zero-shot transfer [25] - Ablation studies confirmed that removing either Jacobian or smoothness regularization led to performance declines, underscoring the necessity of both regularization strategies for enhancing robustness [27]