离线强化学习 - filings, earnings calls, financial reports, news

离线强化学习

Search documents

北航团队提出新的离线分层扩散框架：基于结构信息原理，实现稳定离线策略学习｜NeurIPS 2025

AI前线· 2025-10-09 04:48

Core Insights - The article discusses the potential of a new framework called SIHD (Structural Information-based Hierarchical Diffusion) for offline reinforcement learning, which adapts to various tasks by analyzing embedded structural information in offline trajectories [2][3][23]. Research Background and Motivation - Offline reinforcement learning aims to train effective policies using fixed historical datasets without new interactions with the environment. The introduction of diffusion models helps mitigate extrapolation errors caused by out-of-distribution states and actions [3][4]. - Current methods face limitations due to fixed hierarchical structures and single time scales, which hinder adaptability to different task complexities and decision-making flexibility [5][6]. SIHD Framework Core Design - SIHD innovates in three areas: hierarchical construction, conditional diffusion, and regularization exploration [5]. - The framework's hierarchical construction is adaptive, allowing the data's inherent structure to dictate the hierarchy [7][9]. - The conditional diffusion model uses structural information gain as a guiding signal, enhancing stability and robustness compared to traditional methods reliant on sparse reward signals [10][11]. - A structural entropy regularizer is introduced to encourage exploration and mitigate extrapolation errors, balancing exploration and exploitation in the training objective [12][13]. Experimental Results and Analysis - SIHD was evaluated on the D4RL benchmark, demonstrating superior performance in standard offline RL tasks and long-horizon navigation tasks [14][15]. - In Gym-MuJoCo tasks, SIHD achieved optimal average returns across various data quality levels, outperforming advanced hierarchical baselines with average improvements of 3.8% and 3.9% in medium-quality datasets [16][17][18]. - In long-horizon navigation tasks, SIHD showed significant advantages, particularly in sparse reward scenarios, with notable performance improvements in Maze2D and AntMaze tasks [19][20][22]. - Ablation studies confirmed the necessity of SIHD's components, especially the adaptive multi-scale hierarchy, which is crucial for performance in long-horizon tasks [21][22]. Conclusion - The SIHD framework successfully constructs an adaptive multi-scale hierarchical diffusion model, overcoming rigid limitations of existing methods and significantly enhancing offline policy learning performance, generalization, and robustness [23]. Future research may explore more refined sub-goal conditional strategies and extend SIHD's concepts to broader diffusion-based generative models [23].

GUI智能体训练迎来新范式！半在线强化学习让7B模型媲美GPT-4o

量子位· 2025-09-23 11:01

Core Viewpoint - The article discusses the introduction of a new training paradigm called Semi-online Reinforcement Learning (Semi-online RL) by Zhejiang University and Tongyi Laboratory's Mobile-Agent team, which enhances the performance of models in dynamic multi-turn tasks without relying on real environment interactions [1][2][4]. Group 1: Methodology - The Semi-online RL framework combines the stability of offline training with the long-term optimization capabilities of online learning, significantly improving model performance in dynamic tasks [2][10]. - The framework utilizes offline data to simulate online interactions, allowing the model to experience contextual changes from its own actions during training [12][15]. - A patching mechanism is introduced to adaptively correct sampling biases when the model deviates from expert trajectories, enhancing the learning process [17][19]. Group 2: Key Technologies - The Semi-online RL framework consists of three core technologies: 1. Semi-online mechanism that simulates online interactions using offline data [12]. 2. Patching Module that self-adaptively repairs sampling biases [17]. 3. Long-term reward modeling that estimates advantages from step-level to trajectory-level [20]. Group 3: Evaluation and Results - The new evaluation metric SOP (Semi-online Performance) is proposed to better reflect the model's performance in multi-turn tasks, aligning closely with real online performance [22][23]. - Experimental results show that the UI-S1-7B model outperforms baseline models, achieving a task success rate of 34.0% in the AndroidWorld task, closely approaching the performance of top proprietary models [25][26]. - The model maintains a +7.1% gain in single-turn tasks, indicating that the semi-online training does not sacrifice local accuracy while optimizing for long-term performance [28]. Group 4: Component Analysis - The patching mechanism significantly enhances data utilization and maintains training stability, allowing for effective error correction and promoting policy diversity [30][37]. - Ablation studies confirm that the combination of trajectory-level and step-level advantage functions, along with multi-frame historical observations, positively impacts the model's decision-making capabilities in complex GUI interactions [44].

具身智能之心· 2025-08-07 00:03

Core Insights - The article discusses the development of a new reinforcement learning framework called Chunked RL, specifically designed for fine-tuning Vision-Language-Action (VLA) models, which show great potential in real-world robotic control [4][8]. - The proposed CO-RFT algorithm demonstrates significant improvements over traditional supervised fine-tuning methods, achieving a 57% increase in success rate and a 22.3% reduction in cycle time in real-world environments [4][29]. Section Summaries Introduction - VLA models integrate perception and language understanding for embodied control, showing promise in developing general strategies for real-world robotic control [6]. - The challenges faced in fine-tuning VLA models primarily stem from the dependency on the quality and quantity of task-specific data, which limits generalization to out-of-distribution (OOD) scenarios [6][7]. Methodology - The article introduces Chunked RL, a novel reinforcement learning framework that incorporates action chunking to enhance sample efficiency and stability, particularly suited for VLA models [8][12]. - The CO-RFT algorithm consists of two phases: imitation learning for initializing the backbone network and policy, followed by offline RL with action chunking to optimize the pre-trained policy [16][18]. Experimental Analysis - The experiments were conducted on a robotic platform with six dexterous manipulation tasks, evaluating the performance of the CO-RFT algorithm against traditional methods [20][23]. - Results indicate that CO-RFT significantly outperforms supervised fine-tuning (SFT), achieving a 57% increase in success rate and a 22.3% decrease in average cycle time across various tasks [29][30]. Position Generalization - CO-RFT exhibits strong position generalization capabilities, achieving a 44.3% success rate in previously unseen locations, outperforming SFT by 38% in OOD scenarios [4][29]. Importance of Data Diversity - Data diversity plays a crucial role in the performance of CO-RFT, with models trained on diverse datasets showing significantly better generalization capabilities compared to those trained on fixed datasets [32][33].