在线强化学习
Search documents
真机RL杀疯了,机器人自学20分钟100分,数字孪生封神
3 6 Ke· 2026-02-13 07:32
Core Insights - TwinRL introduces a digital twin-driven reinforcement learning framework that enhances the exploration capabilities of robots in real-world tasks, achieving a 100% success rate in various operations within approximately 20 minutes, while reducing human intervention by over 50% [1][22][36]. Group 1: Technology and Framework - TwinRL is not a simulator but an exploration amplifier and guide, designed to expand the exploration space for robots beyond the limitations of traditional methods [16][15]. - The framework consists of three main components: exploration space expansion, parallel online reinforcement learning in the digital twin, and sim-to-real guided exploration [32][36]. - The exploration space expansion strategy utilizes high-fidelity digital twin environments to generate synthetic trajectories that exceed human demonstration coverage [25][32]. Group 2: Performance and Efficiency - TwinRL demonstrates a significant improvement in exploration efficiency, achieving at least a 30% acceleration in convergence time compared to existing real-world reinforcement learning methods [22][39]. - In experiments, TwinRL maintained a near 100% success rate in both in-distribution and out-of-distribution areas, showcasing its robustness against environmental changes [39][46]. - The framework effectively bridges the gap between offline training and online learning, allowing for a smoother transition and reducing performance degradation during the learning process [39][34]. Group 3: Research Background and Observations - The research highlights that the effective exploration space in real-world VLA reinforcement learning is heavily constrained by the distribution of supervised fine-tuning (SFT) data [27][30]. - The study reveals that traditional reinforcement learning methods struggle with exploration deadlock in out-of-distribution scenarios, emphasizing the need for a broader exploration strategy [30][31]. - TwinRL addresses these challenges by moving the exploration process to a controllable and expandable digital twin environment, allowing for more effective learning [15][36].
刷新NAVSIM SOTA!端到端自动驾驶新框架Masked Diffusion
自动驾驶之心· 2025-12-26 03:32
来源 | 机器之心 原文链接: 刷新NAVSIM SOTA,复旦引望提出Masked Diffusion端到端自动驾驶新框架 点击下方 卡片 ,关注" 自动驾驶之心 "公众号 戳我-> 领取 自动驾驶近30个 方向 学习 路线 >>自动驾驶前沿信息获取 → 自动驾驶之心知识星球 本文只做学术分享,如有侵权,联系删文 随着 VLA(Vision-Language-Action)模型的兴起,端到端自动驾驶正经历从「模块化」向「大一统」的范式转移。然而,将感知、推理与规划压缩进单一模型 后,主流的自回归(Auto-regressive)生成范式逐渐显露出局限性。现有的自回归模型强制遵循「从左到右」的时序生成逻辑,这与人类驾驶员的思维直觉存在本 质差异 —— 经验丰富的驾驶员在处理复杂路况时,往往采用「以终为始」的策略,即先确立长期的驾驶意图(如切入匝道、避让行人、靠边停靠),再反推当前 的短期操控动作。此外,基于模仿学习的模型容易陷入「平均司机」陷阱,倾向于拟合数据分布的均值,导致策略平庸化,难以在激进博弈与保守避让之间灵活切 换。 针对上述痛点, 复旦大学与引望智能联合提出了 WAM-Diff 框架 。该研究创新 ...
刷新NAVSIM SOTA,复旦提出端到端自动驾驶新框架
具身智能之心· 2025-12-26 00:55
Core Insights - The article discusses the transition in end-to-end autonomous driving from a modular approach to a unified paradigm with the rise of Vision-Language-Action (VLA) models, highlighting the limitations of existing autoregressive models in mimicking human driving intuition [1][2]. Group 1: WAM-Diff Framework - The WAM-Diff framework, developed by Fudan University and Yiwang Intelligence, introduces a Discrete Masked Diffusion model for VLA autonomous driving planning, integrating a sparse mixture of experts (MoE) architecture and online reinforcement learning (GSPO) [2][4]. - WAM-Diff achieved state-of-the-art (SOTA) performance on the NAVSIM benchmark, scoring 91.0 PDMS and 89.7 EPDMS, demonstrating the potential of non-autoregressive generation in complex driving scenarios [2][16][18]. Group 2: Technical Innovations - WAM-Diff employs Hybrid Discrete Action Tokenization to convert continuous 2D trajectory coordinates into high-precision discrete tokens, allowing for a shared vocabulary with driving commands [5]. - The framework utilizes Masked Diffusion for generation, enabling parallel prediction of all token positions, which enhances inference efficiency and allows for global optimization [5][9]. Group 3: Decoding Strategies - WAM-Diff explores three decoding strategies: causal, reverse-causal, and random, finding that the reverse-causal strategy yields the best performance in closed-loop metrics, aligning with the "end-to-begin" planning intuition [9][20]. - This approach confirms that establishing long-term driving intentions before detailing immediate actions significantly improves planning consistency and safety [9][20]. Group 4: MoE and GSPO Integration - The MoE architecture within WAM-Diff includes 64 lightweight experts, dynamically activated based on the driving context, enhancing model capacity and adaptability while controlling computational costs [12]. - The GSPO algorithm bridges the gap between open-loop training and closed-loop execution, optimizing trajectory sequences based on safety, compliance, and comfort metrics [12][14]. Group 5: Experimental Results - In extensive experiments on the NAVSIM benchmark, WAM-Diff outperformed several leading models, achieving a PDMS score of 91.0 and an EPDMS score of 89.7, indicating its robustness in balancing safety and compliance [16][18]. - The model's performance in NAVSIM-v2, which includes stricter metrics for traffic rule adherence and comfort, improved by 5.2 points over the previous best, showcasing its capability in real-world driving scenarios [18]. Group 6: Conclusion - WAM-Diff represents a significant advancement in autonomous driving planning, moving towards a discrete, structured, and closed-loop approach, emphasizing the importance of both "how to generate" and "what to generate" in the VLA era [25].
刷新NAVSIM SOTA,复旦引望提出Masked Diffusion端到端自动驾驶新框架
机器之心· 2025-12-25 03:12
Core Insights - The article discusses the transition in end-to-end autonomous driving from a "modular" approach to a "unified" paradigm with the rise of Vision-Language-Action (VLA) models, highlighting the limitations of existing autoregressive generation paradigms [2] - It introduces the WAM-Diff framework, which innovatively incorporates discrete masked diffusion models into VLA autonomous driving planning, addressing the challenges of single-direction temporal generation [2][6] Group 1: WAM-Diff Framework - WAM-Diff utilizes Hybrid Discrete Action Tokenization to convert continuous 2D trajectory coordinates into high-precision discrete tokens, achieving an error control within 0.005 [6] - The framework employs Masked Diffusion as its backbone, allowing for parallel prediction of all token positions, significantly enhancing inference efficiency and enabling global optimization [6] - WAM-Diff explores decoding strategies, revealing that the reverse-causal strategy outperforms others in closed-loop metrics, validating the "end-to-begin" planning logic [9][20] Group 2: Performance Metrics - In the authoritative NAVSIM benchmark, WAM-Diff achieved state-of-the-art (SOTA) scores of 91.0 PDMS in NAVSIM-v1 and 89.7 EPDMS in NAVSIM-v2, demonstrating its potential in complex autonomous driving scenarios [3][18] - The model surpassed competitors like DiffusionDrive and ReCogDrive, indicating its robustness in balancing safety and compliance in real-world driving conditions [18] Group 3: Technical Innovations - WAM-Diff integrates a Low-Rank Adaptation Mixture-of-Experts (LoRA-MoE) architecture, which includes 64 lightweight experts for dynamic routing and sparse activation, enhancing model capacity and adaptability [11] - The Group Sequence Policy Optimization (GSPO) algorithm is introduced to bridge the gap between open-loop training and closed-loop execution, optimizing trajectory sequences based on safety, compliance, and comfort metrics [14] Group 4: Conclusion - The emergence of WAM-Diff marks a significant step towards discrete, structured, and closed-loop autonomous driving planning, emphasizing the importance of both "how to generate" and "what to generate" in the VLA era [25]
华科&小米联合提出MindDrive:首个证实在线强化学习有效性的VLA框架......
自动驾驶之心· 2025-12-17 00:03
Core Insights - The article introduces MindDrive, a novel framework for autonomous driving that utilizes online reinforcement learning (RL) to enhance the performance of vision-language-action (VLA) models [2][4][44] - MindDrive demonstrates significant improvements in driving scores and success rates compared to traditional end-to-end paradigms and state-of-the-art (SOTA) models, achieving a driving score (DS) of 78.04 and a success rate (SR) of 55.09% [9][38] Background Review - Autonomous driving relies on models that can perceive, decide, and execute actions in dynamic environments. Traditional frameworks often lack common sense and causal reasoning capabilities [4] - Current VLA models primarily use imitation learning (IL), which can lead to causal confusion and distribution shifts, resulting in irreversible errors in closed-loop driving scenarios [4][5] MindDrive Framework - MindDrive consists of two main components: a decision expert and an action expert, both utilizing a shared vision encoder and text tokenizer, but differing in their low-rank adaptation (LoRA) parameters [11][18] - The decision expert generates abstract driving decisions based on navigation commands and visual inputs, while the action expert translates these decisions into specific action trajectories [11][18] Online Reinforcement Learning Approach - MindDrive employs online RL to optimize the decision-making process by sampling different trajectories and receiving feedback from the environment, thus enhancing the model's understanding of causal relationships [22][30] - The framework is designed to operate within a closed-loop simulation environment, specifically using the CARLA simulator, which allows for efficient data collection and training [8][24] Experimental Results - MindDrive outperforms traditional end-to-end methods and other VLA models, achieving a driving score that is 10.12 points higher than the best imitation learning model and 6.68 points higher than the best offline RL method [38][40] - The model's performance in complex driving scenarios, such as overtaking and yielding, shows significant improvements, indicating enhanced causal reasoning and decision robustness [38][40] Conclusion - MindDrive represents a significant advancement in the application of online RL to autonomous driving, providing a framework that effectively maps language instructions to actions while optimizing exploration efficiency [44] - The results suggest that MindDrive could inspire further developments in the autonomous driving sector, particularly in enhancing the capabilities of VLA models [44]
聊聊在线强化学习是怎么微调π0和π0.5的?为什么性能最高能提升50%以上?
具身智能之心· 2025-11-10 03:30
Core Viewpoint - The article discusses the introduction of the πRL framework, which enhances flow-based vision-language-action (VLA) models through online reinforcement learning (RL) fine-tuning, significantly improving their performance and generalization capabilities [5][7]. Group 1: Introduction to VLA Models - VLA models enable robots to understand and execute complex tasks through multimodal inputs, but large-scale RL applications face challenges due to the difficulty in handling action log-likelihood during the iterative denoising process [5]. Group 2: πRL Framework - The πRL framework, developed by teams from Tsinghua University and Peking University, addresses the challenges of applying large-scale RL to flow-based VLA models by training them in parallel simulations [6]. Group 3: RL Algorithms in πRL - πRL implements two RL algorithms: 1. FlowNoise models the denoising process as a discrete-time Markov Decision Process (MDP) using a learnable noise network for precise log-likelihood calculations [7]. 2. Flow-SDE combines the denoising process with agent-environment interaction, constructing a dual-layer MDP that transitions from ODE to SDE for efficient RL exploration [7]. Group 4: Performance Evaluation - In benchmark tests, πRL significantly improved the performance of few-shot SFT models π0 and π0.5 from 57.6% to 97.6% and from 77.1% to 98.3% on the LIBERO dataset, respectively [7]. - In the ManiSkill benchmark, πRL demonstrated scalable multi-task RL capabilities across 4,352 grasping and placing tasks using 320 parallel environments [7]. Group 5: Conclusion - Overall, πRL shows substantial performance enhancements and stronger generalization compared to SFT models, validating the effectiveness of online RL in flow-based VLA models [7].
Figma 如何战胜 Adobe 等六篇 | 42章经 AI Newsletter
42章经· 2025-10-26 13:42
Group 1: Figma vs Adobe - Figma's success is attributed to its focus on "collaboration" as a core feature, contrasting with Adobe's file-centric approach [2][3] - Adobe's collaboration is based on file transfer, while Figma allows real-time editing on a shared canvas, enabling true synchronous collaboration [3] - Existing giants like Adobe struggle to adapt due to their historical success paths and internal resistance to change [3] Group 2: Online Reinforcement Learning - Cursor's use of online reinforcement learning (RL) optimizes its code completion feature, Tab, by treating user interactions as feedback signals for real-time training [6][10] - The model's suggestion volume has decreased by 21%, while the acceptance rate has increased by 28%, indicating improved performance [6] Group 3: Plaud's Success - Plaud's success is rooted in recognizing the value of context, viewing conversations as a form of intelligence and a significant data source [12][14] - The company designs its hardware and software to effectively capture and analyze user context, positioning itself as a context collector rather than just a recording device [15] - Plaud's approach emphasizes a "reverse thinking" strategy, focusing on how AI can serve users by prompting them for context rather than the other way around [16][18] Group 4: Creating Delight in Products - Delight in products is defined as a combination of joy and surprise, with three main strategies: exceeding expectations, anticipating needs, and removing friction [25][27] - A systematic approach to creating delight involves redefining user categories based on motivations, transforming those motivations into opportunities, and ensuring that delight becomes an organizational capability [28][30] Group 5: Evaluating AI Product Retention - A16Z suggests that AI companies should measure retention starting from the third month (M3) to better understand their true user base, as early data may include many transient users [34][35] - The new metric M12/M3 is proposed to assess long-term retention quality, indicating how many users remain after a year compared to the third month [36][39] Group 6: Palantir's FDE Model - The Forward Deployed Engineer (FDE) model involves engineers embedded at client sites to bridge the gap between product capabilities and client needs, focusing on product exploration [42][46] - FDE teams consist of Echo (consulting analysts) and Delta (deployment engineers), each with distinct roles to ensure effective client engagement and product development [49][50] - The FDE model is particularly relevant in the AI era, where high-value contracts justify deep client integration and where product-market fit is often unclear [53][54]
AI在线强化学习“边做边学”,斯坦福团队让7B小模型性能飙升,甚至超越GPT-4o
3 6 Ke· 2025-10-24 12:45
Core Insights - AgentFlow introduces a new paradigm for online reinforcement learning, enhancing the reasoning capabilities of agent systems through real-time optimization and collaboration among specialized agents [1][11][14]. Performance Metrics - AgentFlow, based on the Qwen-2.5-7B-Instruct model, shows significant improvements across various benchmark tests: 14.9% in search tasks, 14.0% in agentic reasoning tasks, 14.5% in mathematical reasoning, and 4.1% in scientific reasoning [4][19][21]. - The performance of AgentFlow surpasses that of larger models, including GPT-4o and Llama3.1-405B, demonstrating that effective system design can outperform sheer model size [21][25]. System Architecture - The architecture of AgentFlow consists of four specialized agents: a planner for task analysis and tool selection, an executor for tool invocation, a verifier for evaluating intermediate results, and a generator for synthesizing final outputs [11][13][14]. - The system employs a shared memory design that facilitates collaboration and reduces error propagation in multi-step reasoning processes [7][14]. Learning Mechanism - The on-policy optimization of the planner within the agent interaction flow is crucial for adapting to environmental changes and feedback, leading to a robust and self-evolving reasoning process [13][14][22]. - The Flow-GRPO algorithm addresses the challenges of multi-turn credit assignment in reinforcement learning, enhancing training efficiency and stability in complex reasoning tasks [15][19]. Research Findings - The study reveals that online learning in real interaction environments is essential for achieving efficient reasoning, as opposed to offline supervised learning, which can lead to performance declines [22][25]. - AgentFlow's training allows the system to autonomously discover new tool combinations and usage patterns, enhancing its problem-solving capabilities [25][29]. Future Implications - AgentFlow represents a shift from seeking a single comprehensive model to enabling agents to adapt and learn continuously within a system, highlighting the potential of collaborative intelligence in addressing complex tasks [29].
AI在线强化学习“边做边学”,斯坦福团队让7B小模型性能飙升,甚至超越GPT-4o
量子位· 2025-10-24 03:53
Core Insights - The article discusses the introduction of AgentFlow, a new paradigm in online reinforcement learning that enhances the reasoning capabilities of intelligent systems, outperforming models like GPT-4o and Llama3.1-405B [1][4][23]. Group 1: AgentFlow Overview - AgentFlow consists of a team of specialized agents including a planner, executor, verifier, and generator, which collaborate through shared memory to optimize decision-making in real-time [1][14][18]. - The Flow-GRPO method allows for on-policy optimization of the planner agent, enabling adaptive decision-making based on environmental changes and feedback from other agents [19][16]. Group 2: Performance Metrics - AgentFlow, based on the Qwen-2.5-7B-Instruct model, shows significant improvements across various benchmark tests: 14.9% in search tasks, 14.0% in agentic reasoning, 14.5% in math reasoning, and 4.1% in scientific reasoning [3][25][27]. - The model's performance surpasses that of larger models, demonstrating that effective system design and training methods can be more impactful than simply increasing model size [27]. Group 3: Learning Mechanisms - The article emphasizes the importance of "learning in the flow," indicating that online learning in real interactive environments is crucial for achieving efficient reasoning [28][29]. - AgentFlow's architecture allows for rapid error correction and improved task planning through real-time training, enhancing overall system performance [30][29]. Group 4: Innovations and Findings - The system autonomously discovers new solution paths, such as combining different search tools to enhance information retrieval, showcasing its ability to adapt and innovate [33]. - AgentFlow maintains performance improvements without significantly increasing the average reasoning steps, indicating efficient handling of complex tasks [35]. Group 5: Future Implications - The article concludes that AgentFlow presents a novel approach to intelligent agent training, advocating for systems that adapt and learn continuously rather than relying on a single comprehensive model [37][38]. - Despite the distance from research to practical application, the potential for Agentic AI remains significant, suggesting a promising future for intelligent systems [39].
GUI智能体训练迎来新范式!半在线强化学习让7B模型媲美GPT-4o
量子位· 2025-09-23 11:01
Core Viewpoint - The article discusses the introduction of a new training paradigm called Semi-online Reinforcement Learning (Semi-online RL) by Zhejiang University and Tongyi Laboratory's Mobile-Agent team, which enhances the performance of models in dynamic multi-turn tasks without relying on real environment interactions [1][2][4]. Group 1: Methodology - The Semi-online RL framework combines the stability of offline training with the long-term optimization capabilities of online learning, significantly improving model performance in dynamic tasks [2][10]. - The framework utilizes offline data to simulate online interactions, allowing the model to experience contextual changes from its own actions during training [12][15]. - A patching mechanism is introduced to adaptively correct sampling biases when the model deviates from expert trajectories, enhancing the learning process [17][19]. Group 2: Key Technologies - The Semi-online RL framework consists of three core technologies: 1. Semi-online mechanism that simulates online interactions using offline data [12]. 2. Patching Module that self-adaptively repairs sampling biases [17]. 3. Long-term reward modeling that estimates advantages from step-level to trajectory-level [20]. Group 3: Evaluation and Results - The new evaluation metric SOP (Semi-online Performance) is proposed to better reflect the model's performance in multi-turn tasks, aligning closely with real online performance [22][23]. - Experimental results show that the UI-S1-7B model outperforms baseline models, achieving a task success rate of 34.0% in the AndroidWorld task, closely approaching the performance of top proprietary models [25][26]. - The model maintains a +7.1% gain in single-turn tasks, indicating that the semi-online training does not sacrifice local accuracy while optimizing for long-term performance [28]. Group 4: Component Analysis - The patching mechanism significantly enhances data utilization and maintains training stability, allowing for effective error correction and promoting policy diversity [30][37]. - Ablation studies confirm that the combination of trajectory-level and step-level advantage functions, along with multi-frame historical observations, positively impacts the model's decision-making capabilities in complex GUI interactions [44].