离线强化学习

Search documents
北航团队提出新的离线分层扩散框架:基于结构信息原理,实现稳定离线策略学习|NeurIPS 2025
AI前线· 2025-10-09 04:48
作者 | 北航彭浩团队 基于扩散模型(Diffusion Model)的生成方法已显示出用于从离线强化学习 (offline Reinforcement Learning) 数据集建模轨迹的巨大潜力,并且已引入 分层扩散(Hierarchical Diffusion)来减轻长期规划任务中的方差累积和计算挑战。然而,现有方法通常假设具有单个预定义时间尺度的固定两层扩散层 次结构,这限制了对各种下游任务的适应性并降低了决策的灵活性。 论文标题: Structural Information-based Hierarchical Diffusion for Offline Reinforcement Learning arxiv 地址: https://arxiv.org/abs/2509.21942 代码地址: https://github.com/SELGroup/SIHD 研究背景与动机 离线强化学习旨在解决一个核心挑战:如何在不与环境进行新交互的情况下,仅利用固定的历史数据集训练出有效的策略。扩散模型通过将策略学习重 构为条件轨迹生成任务,有效缓解了分布外(OOD)状态和动作导致的"外推误差"(extrap ...
GUI智能体训练迎来新范式!半在线强化学习让7B模型媲美GPT-4o
量子位· 2025-09-23 11:01
Core Viewpoint - The article discusses the introduction of a new training paradigm called Semi-online Reinforcement Learning (Semi-online RL) by Zhejiang University and Tongyi Laboratory's Mobile-Agent team, which enhances the performance of models in dynamic multi-turn tasks without relying on real environment interactions [1][2][4]. Group 1: Methodology - The Semi-online RL framework combines the stability of offline training with the long-term optimization capabilities of online learning, significantly improving model performance in dynamic tasks [2][10]. - The framework utilizes offline data to simulate online interactions, allowing the model to experience contextual changes from its own actions during training [12][15]. - A patching mechanism is introduced to adaptively correct sampling biases when the model deviates from expert trajectories, enhancing the learning process [17][19]. Group 2: Key Technologies - The Semi-online RL framework consists of three core technologies: 1. Semi-online mechanism that simulates online interactions using offline data [12]. 2. Patching Module that self-adaptively repairs sampling biases [17]. 3. Long-term reward modeling that estimates advantages from step-level to trajectory-level [20]. Group 3: Evaluation and Results - The new evaluation metric SOP (Semi-online Performance) is proposed to better reflect the model's performance in multi-turn tasks, aligning closely with real online performance [22][23]. - Experimental results show that the UI-S1-7B model outperforms baseline models, achieving a task success rate of 34.0% in the AndroidWorld task, closely approaching the performance of top proprietary models [25][26]. - The model maintains a +7.1% gain in single-turn tasks, indicating that the semi-online training does not sacrifice local accuracy while optimizing for long-term performance [28]. Group 4: Component Analysis - The patching mechanism significantly enhances data utilization and maintains training stability, allowing for effective error correction and promoting policy diversity [30][37]. - Ablation studies confirm that the combination of trajectory-level and step-level advantage functions, along with multi-frame historical observations, positively impacts the model's decision-making capabilities in complex GUI interactions [44].
成功率提高57%,VLA+RL最新!CO-RFT:实现VLA模型的高效微调(北航&清华等)
具身智能之心· 2025-08-07 00:03
Core Insights - The article discusses the development of a new reinforcement learning framework called Chunked RL, specifically designed for fine-tuning Vision-Language-Action (VLA) models, which show great potential in real-world robotic control [4][8]. - The proposed CO-RFT algorithm demonstrates significant improvements over traditional supervised fine-tuning methods, achieving a 57% increase in success rate and a 22.3% reduction in cycle time in real-world environments [4][29]. Section Summaries Introduction - VLA models integrate perception and language understanding for embodied control, showing promise in developing general strategies for real-world robotic control [6]. - The challenges faced in fine-tuning VLA models primarily stem from the dependency on the quality and quantity of task-specific data, which limits generalization to out-of-distribution (OOD) scenarios [6][7]. Methodology - The article introduces Chunked RL, a novel reinforcement learning framework that incorporates action chunking to enhance sample efficiency and stability, particularly suited for VLA models [8][12]. - The CO-RFT algorithm consists of two phases: imitation learning for initializing the backbone network and policy, followed by offline RL with action chunking to optimize the pre-trained policy [16][18]. Experimental Analysis - The experiments were conducted on a robotic platform with six dexterous manipulation tasks, evaluating the performance of the CO-RFT algorithm against traditional methods [20][23]. - Results indicate that CO-RFT significantly outperforms supervised fine-tuning (SFT), achieving a 57% increase in success rate and a 22.3% decrease in average cycle time across various tasks [29][30]. Position Generalization - CO-RFT exhibits strong position generalization capabilities, achieving a 44.3% success rate in previously unseen locations, outperforming SFT by 38% in OOD scenarios [4][29]. Importance of Data Diversity - Data diversity plays a crucial role in the performance of CO-RFT, with models trained on diverse datasets showing significantly better generalization capabilities compared to those trained on fixed datasets [32][33].