在线强化学习
Search documents
真·养虾!3步让龙虾边聊边进化,不用GPU不用数据集就能强化学习
量子位· 2026-03-12 02:59
Core Insights - The article discusses the introduction of MetaClaw, an online reinforcement learning system designed to enhance AI capabilities without the need for local GPU clusters or manual data adjustments [2][13]. Group 1: MetaClaw Overview - MetaClaw transforms user interactions with AI into training data, allowing for continuous learning in the background without disrupting normal usage [4]. - The system evaluates each conversation round, scores it, and optimizes AI decision-making through online fine-tuning [5]. - It automatically analyzes failed interactions to improve AI skills, creating a more robust skill library over time [6]. Group 2: Learning Mechanisms - The core mechanism of MetaClaw is based on a self-developed SkillRL framework, combining skill injection and skill evolution [9]. - Skill injection allows for immediate optimization of AI performance during conversations, while skill evolution enables the AI to proactively generate new skills [10][11]. Group 3: Technical Implementation - MetaClaw offloads all training tasks to the Tinker cloud platform, eliminating the need for users to manage computational resources [14]. - The system is designed to be user-friendly, requiring only a few steps to set up, including installing dependencies and configuring scripts [18][21]. - Users can easily enable skill injection and evolution through straightforward configuration settings [26]. Group 4: Developer-Focused Features - MetaClaw incorporates an asynchronous architecture and dual learning modes, allowing for real-time user responses while optimizing AI performance in the background [17]. - The system offers flexibility in training methods, catering to both lightweight reinforcement learning and deeper strategy distillation based on user feedback [17]. Group 5: Configuration and Customization - Key configuration options are centralized in MetaClawConfig, allowing users to adjust model selection, training parameters, and loss functions easily [27]. - Default settings include a model name of "moonshotai/Kimi-2.5" and a maximum training step count of 1000, among other parameters [27].
真机RL杀疯了,机器人自学20分钟100分,数字孪生封神
3 6 Ke· 2026-02-13 07:32
Core Insights - TwinRL introduces a digital twin-driven reinforcement learning framework that enhances the exploration capabilities of robots in real-world tasks, achieving a 100% success rate in various operations within approximately 20 minutes, while reducing human intervention by over 50% [1][22][36]. Group 1: Technology and Framework - TwinRL is not a simulator but an exploration amplifier and guide, designed to expand the exploration space for robots beyond the limitations of traditional methods [16][15]. - The framework consists of three main components: exploration space expansion, parallel online reinforcement learning in the digital twin, and sim-to-real guided exploration [32][36]. - The exploration space expansion strategy utilizes high-fidelity digital twin environments to generate synthetic trajectories that exceed human demonstration coverage [25][32]. Group 2: Performance and Efficiency - TwinRL demonstrates a significant improvement in exploration efficiency, achieving at least a 30% acceleration in convergence time compared to existing real-world reinforcement learning methods [22][39]. - In experiments, TwinRL maintained a near 100% success rate in both in-distribution and out-of-distribution areas, showcasing its robustness against environmental changes [39][46]. - The framework effectively bridges the gap between offline training and online learning, allowing for a smoother transition and reducing performance degradation during the learning process [39][34]. Group 3: Research Background and Observations - The research highlights that the effective exploration space in real-world VLA reinforcement learning is heavily constrained by the distribution of supervised fine-tuning (SFT) data [27][30]. - The study reveals that traditional reinforcement learning methods struggle with exploration deadlock in out-of-distribution scenarios, emphasizing the need for a broader exploration strategy [30][31]. - TwinRL addresses these challenges by moving the exploration process to a controllable and expandable digital twin environment, allowing for more effective learning [15][36].
刷新NAVSIM SOTA!端到端自动驾驶新框架Masked Diffusion
自动驾驶之心· 2025-12-26 03:32
Core Viewpoint - The article discusses the introduction of the WAM-Diff framework by Fudan University and Yiwang Intelligent, which innovatively integrates discrete masked diffusion models into Vision-Language-Action (VLA) for autonomous driving, addressing limitations of existing autoregressive models and enhancing planning capabilities [3][4][26]. Group 1: Framework and Innovations - WAM-Diff introduces a discrete masked diffusion model that allows for non-sequential generation, overcoming the limitations of traditional left-to-right autoregressive models [3][6]. - The framework employs a hybrid discrete action tokenization technique to convert continuous 2D trajectory coordinates into high-precision discrete tokens, facilitating a shared vocabulary for driving commands [6]. - The model incorporates a mixture of experts (MoE) architecture and online reinforcement learning (GSPO) to enhance adaptability and robustness in dynamic driving scenarios [12][14]. Group 2: Performance Metrics - In the NAVSIM benchmark, WAM-Diff achieved a state-of-the-art (SOTA) score of 91.0 PDMS in NAVSIM-v1, surpassing several leading baseline models [4][16]. - In NAVSIM-v2, which includes stricter metrics for traffic rule adherence and comfort, WAM-Diff maintained strong performance with an EPDMS score of 89.7, improving by 5.2 points over DiffusionDrive [18][19]. Group 3: Decoding Strategies - The framework explores three decoding strategies: causal, reverse-causal, and random, with reverse-causal yielding the best closed-loop performance, validating the "start with the end" planning intuition [9][20]. - The experiments demonstrated that prioritizing long-term driving intentions before detailing immediate actions significantly enhances the consistency and safety of generated trajectories [20][21]. Group 4: Conclusion - WAM-Diff represents a significant advancement in end-to-end autonomous driving planning, emphasizing the importance of both "how to generate" and "what to generate" in the VLA era, potentially paving the way towards Level 4 autonomous driving [26].
刷新NAVSIM SOTA,复旦提出端到端自动驾驶新框架
具身智能之心· 2025-12-26 00:55
Core Insights - The article discusses the transition in end-to-end autonomous driving from a modular approach to a unified paradigm with the rise of Vision-Language-Action (VLA) models, highlighting the limitations of existing autoregressive models in mimicking human driving intuition [1][2]. Group 1: WAM-Diff Framework - The WAM-Diff framework, developed by Fudan University and Yiwang Intelligence, introduces a Discrete Masked Diffusion model for VLA autonomous driving planning, integrating a sparse mixture of experts (MoE) architecture and online reinforcement learning (GSPO) [2][4]. - WAM-Diff achieved state-of-the-art (SOTA) performance on the NAVSIM benchmark, scoring 91.0 PDMS and 89.7 EPDMS, demonstrating the potential of non-autoregressive generation in complex driving scenarios [2][16][18]. Group 2: Technical Innovations - WAM-Diff employs Hybrid Discrete Action Tokenization to convert continuous 2D trajectory coordinates into high-precision discrete tokens, allowing for a shared vocabulary with driving commands [5]. - The framework utilizes Masked Diffusion for generation, enabling parallel prediction of all token positions, which enhances inference efficiency and allows for global optimization [5][9]. Group 3: Decoding Strategies - WAM-Diff explores three decoding strategies: causal, reverse-causal, and random, finding that the reverse-causal strategy yields the best performance in closed-loop metrics, aligning with the "end-to-begin" planning intuition [9][20]. - This approach confirms that establishing long-term driving intentions before detailing immediate actions significantly improves planning consistency and safety [9][20]. Group 4: MoE and GSPO Integration - The MoE architecture within WAM-Diff includes 64 lightweight experts, dynamically activated based on the driving context, enhancing model capacity and adaptability while controlling computational costs [12]. - The GSPO algorithm bridges the gap between open-loop training and closed-loop execution, optimizing trajectory sequences based on safety, compliance, and comfort metrics [12][14]. Group 5: Experimental Results - In extensive experiments on the NAVSIM benchmark, WAM-Diff outperformed several leading models, achieving a PDMS score of 91.0 and an EPDMS score of 89.7, indicating its robustness in balancing safety and compliance [16][18]. - The model's performance in NAVSIM-v2, which includes stricter metrics for traffic rule adherence and comfort, improved by 5.2 points over the previous best, showcasing its capability in real-world driving scenarios [18]. Group 6: Conclusion - WAM-Diff represents a significant advancement in autonomous driving planning, moving towards a discrete, structured, and closed-loop approach, emphasizing the importance of both "how to generate" and "what to generate" in the VLA era [25].
刷新NAVSIM SOTA,复旦引望提出Masked Diffusion端到端自动驾驶新框架
机器之心· 2025-12-25 03:12
Core Insights - The article discusses the transition in end-to-end autonomous driving from a "modular" approach to a "unified" paradigm with the rise of Vision-Language-Action (VLA) models, highlighting the limitations of existing autoregressive generation paradigms [2] - It introduces the WAM-Diff framework, which innovatively incorporates discrete masked diffusion models into VLA autonomous driving planning, addressing the challenges of single-direction temporal generation [2][6] Group 1: WAM-Diff Framework - WAM-Diff utilizes Hybrid Discrete Action Tokenization to convert continuous 2D trajectory coordinates into high-precision discrete tokens, achieving an error control within 0.005 [6] - The framework employs Masked Diffusion as its backbone, allowing for parallel prediction of all token positions, significantly enhancing inference efficiency and enabling global optimization [6] - WAM-Diff explores decoding strategies, revealing that the reverse-causal strategy outperforms others in closed-loop metrics, validating the "end-to-begin" planning logic [9][20] Group 2: Performance Metrics - In the authoritative NAVSIM benchmark, WAM-Diff achieved state-of-the-art (SOTA) scores of 91.0 PDMS in NAVSIM-v1 and 89.7 EPDMS in NAVSIM-v2, demonstrating its potential in complex autonomous driving scenarios [3][18] - The model surpassed competitors like DiffusionDrive and ReCogDrive, indicating its robustness in balancing safety and compliance in real-world driving conditions [18] Group 3: Technical Innovations - WAM-Diff integrates a Low-Rank Adaptation Mixture-of-Experts (LoRA-MoE) architecture, which includes 64 lightweight experts for dynamic routing and sparse activation, enhancing model capacity and adaptability [11] - The Group Sequence Policy Optimization (GSPO) algorithm is introduced to bridge the gap between open-loop training and closed-loop execution, optimizing trajectory sequences based on safety, compliance, and comfort metrics [14] Group 4: Conclusion - The emergence of WAM-Diff marks a significant step towards discrete, structured, and closed-loop autonomous driving planning, emphasizing the importance of both "how to generate" and "what to generate" in the VLA era [25]
华科&小米联合提出MindDrive:首个证实在线强化学习有效性的VLA框架......
自动驾驶之心· 2025-12-17 00:03
Core Insights - The article introduces MindDrive, a novel framework for autonomous driving that utilizes online reinforcement learning (RL) to enhance the performance of vision-language-action (VLA) models [2][4][44] - MindDrive demonstrates significant improvements in driving scores and success rates compared to traditional end-to-end paradigms and state-of-the-art (SOTA) models, achieving a driving score (DS) of 78.04 and a success rate (SR) of 55.09% [9][38] Background Review - Autonomous driving relies on models that can perceive, decide, and execute actions in dynamic environments. Traditional frameworks often lack common sense and causal reasoning capabilities [4] - Current VLA models primarily use imitation learning (IL), which can lead to causal confusion and distribution shifts, resulting in irreversible errors in closed-loop driving scenarios [4][5] MindDrive Framework - MindDrive consists of two main components: a decision expert and an action expert, both utilizing a shared vision encoder and text tokenizer, but differing in their low-rank adaptation (LoRA) parameters [11][18] - The decision expert generates abstract driving decisions based on navigation commands and visual inputs, while the action expert translates these decisions into specific action trajectories [11][18] Online Reinforcement Learning Approach - MindDrive employs online RL to optimize the decision-making process by sampling different trajectories and receiving feedback from the environment, thus enhancing the model's understanding of causal relationships [22][30] - The framework is designed to operate within a closed-loop simulation environment, specifically using the CARLA simulator, which allows for efficient data collection and training [8][24] Experimental Results - MindDrive outperforms traditional end-to-end methods and other VLA models, achieving a driving score that is 10.12 points higher than the best imitation learning model and 6.68 points higher than the best offline RL method [38][40] - The model's performance in complex driving scenarios, such as overtaking and yielding, shows significant improvements, indicating enhanced causal reasoning and decision robustness [38][40] Conclusion - MindDrive represents a significant advancement in the application of online RL to autonomous driving, providing a framework that effectively maps language instructions to actions while optimizing exploration efficiency [44] - The results suggest that MindDrive could inspire further developments in the autonomous driving sector, particularly in enhancing the capabilities of VLA models [44]
聊聊在线强化学习是怎么微调π0和π0.5的?为什么性能最高能提升50%以上?
具身智能之心· 2025-11-10 03:30
Core Viewpoint - The article discusses the introduction of the πRL framework, which enhances flow-based vision-language-action (VLA) models through online reinforcement learning (RL) fine-tuning, significantly improving their performance and generalization capabilities [5][7]. Group 1: Introduction to VLA Models - VLA models enable robots to understand and execute complex tasks through multimodal inputs, but large-scale RL applications face challenges due to the difficulty in handling action log-likelihood during the iterative denoising process [5]. Group 2: πRL Framework - The πRL framework, developed by teams from Tsinghua University and Peking University, addresses the challenges of applying large-scale RL to flow-based VLA models by training them in parallel simulations [6]. Group 3: RL Algorithms in πRL - πRL implements two RL algorithms: 1. FlowNoise models the denoising process as a discrete-time Markov Decision Process (MDP) using a learnable noise network for precise log-likelihood calculations [7]. 2. Flow-SDE combines the denoising process with agent-environment interaction, constructing a dual-layer MDP that transitions from ODE to SDE for efficient RL exploration [7]. Group 4: Performance Evaluation - In benchmark tests, πRL significantly improved the performance of few-shot SFT models π0 and π0.5 from 57.6% to 97.6% and from 77.1% to 98.3% on the LIBERO dataset, respectively [7]. - In the ManiSkill benchmark, πRL demonstrated scalable multi-task RL capabilities across 4,352 grasping and placing tasks using 320 parallel environments [7]. Group 5: Conclusion - Overall, πRL shows substantial performance enhancements and stronger generalization compared to SFT models, validating the effectiveness of online RL in flow-based VLA models [7].
Figma 如何战胜 Adobe 等六篇 | 42章经 AI Newsletter
42章经· 2025-10-26 13:42
Group 1: Figma vs Adobe - Figma's success is attributed to its focus on "collaboration" as a core feature, contrasting with Adobe's file-centric approach [2][3] - Adobe's collaboration is based on file transfer, while Figma allows real-time editing on a shared canvas, enabling true synchronous collaboration [3] - Existing giants like Adobe struggle to adapt due to their historical success paths and internal resistance to change [3] Group 2: Online Reinforcement Learning - Cursor's use of online reinforcement learning (RL) optimizes its code completion feature, Tab, by treating user interactions as feedback signals for real-time training [6][10] - The model's suggestion volume has decreased by 21%, while the acceptance rate has increased by 28%, indicating improved performance [6] Group 3: Plaud's Success - Plaud's success is rooted in recognizing the value of context, viewing conversations as a form of intelligence and a significant data source [12][14] - The company designs its hardware and software to effectively capture and analyze user context, positioning itself as a context collector rather than just a recording device [15] - Plaud's approach emphasizes a "reverse thinking" strategy, focusing on how AI can serve users by prompting them for context rather than the other way around [16][18] Group 4: Creating Delight in Products - Delight in products is defined as a combination of joy and surprise, with three main strategies: exceeding expectations, anticipating needs, and removing friction [25][27] - A systematic approach to creating delight involves redefining user categories based on motivations, transforming those motivations into opportunities, and ensuring that delight becomes an organizational capability [28][30] Group 5: Evaluating AI Product Retention - A16Z suggests that AI companies should measure retention starting from the third month (M3) to better understand their true user base, as early data may include many transient users [34][35] - The new metric M12/M3 is proposed to assess long-term retention quality, indicating how many users remain after a year compared to the third month [36][39] Group 6: Palantir's FDE Model - The Forward Deployed Engineer (FDE) model involves engineers embedded at client sites to bridge the gap between product capabilities and client needs, focusing on product exploration [42][46] - FDE teams consist of Echo (consulting analysts) and Delta (deployment engineers), each with distinct roles to ensure effective client engagement and product development [49][50] - The FDE model is particularly relevant in the AI era, where high-value contracts justify deep client integration and where product-market fit is often unclear [53][54]
AI在线强化学习“边做边学”,斯坦福团队让7B小模型性能飙升,甚至超越GPT-4o
3 6 Ke· 2025-10-24 12:45
Core Insights - AgentFlow introduces a new paradigm for online reinforcement learning, enhancing the reasoning capabilities of agent systems through real-time optimization and collaboration among specialized agents [1][11][14]. Performance Metrics - AgentFlow, based on the Qwen-2.5-7B-Instruct model, shows significant improvements across various benchmark tests: 14.9% in search tasks, 14.0% in agentic reasoning tasks, 14.5% in mathematical reasoning, and 4.1% in scientific reasoning [4][19][21]. - The performance of AgentFlow surpasses that of larger models, including GPT-4o and Llama3.1-405B, demonstrating that effective system design can outperform sheer model size [21][25]. System Architecture - The architecture of AgentFlow consists of four specialized agents: a planner for task analysis and tool selection, an executor for tool invocation, a verifier for evaluating intermediate results, and a generator for synthesizing final outputs [11][13][14]. - The system employs a shared memory design that facilitates collaboration and reduces error propagation in multi-step reasoning processes [7][14]. Learning Mechanism - The on-policy optimization of the planner within the agent interaction flow is crucial for adapting to environmental changes and feedback, leading to a robust and self-evolving reasoning process [13][14][22]. - The Flow-GRPO algorithm addresses the challenges of multi-turn credit assignment in reinforcement learning, enhancing training efficiency and stability in complex reasoning tasks [15][19]. Research Findings - The study reveals that online learning in real interaction environments is essential for achieving efficient reasoning, as opposed to offline supervised learning, which can lead to performance declines [22][25]. - AgentFlow's training allows the system to autonomously discover new tool combinations and usage patterns, enhancing its problem-solving capabilities [25][29]. Future Implications - AgentFlow represents a shift from seeking a single comprehensive model to enabling agents to adapt and learn continuously within a system, highlighting the potential of collaborative intelligence in addressing complex tasks [29].
AI在线强化学习“边做边学”,斯坦福团队让7B小模型性能飙升,甚至超越GPT-4o
量子位· 2025-10-24 03:53
Core Insights - The article discusses the introduction of AgentFlow, a new paradigm in online reinforcement learning that enhances the reasoning capabilities of intelligent systems, outperforming models like GPT-4o and Llama3.1-405B [1][4][23]. Group 1: AgentFlow Overview - AgentFlow consists of a team of specialized agents including a planner, executor, verifier, and generator, which collaborate through shared memory to optimize decision-making in real-time [1][14][18]. - The Flow-GRPO method allows for on-policy optimization of the planner agent, enabling adaptive decision-making based on environmental changes and feedback from other agents [19][16]. Group 2: Performance Metrics - AgentFlow, based on the Qwen-2.5-7B-Instruct model, shows significant improvements across various benchmark tests: 14.9% in search tasks, 14.0% in agentic reasoning, 14.5% in math reasoning, and 4.1% in scientific reasoning [3][25][27]. - The model's performance surpasses that of larger models, demonstrating that effective system design and training methods can be more impactful than simply increasing model size [27]. Group 3: Learning Mechanisms - The article emphasizes the importance of "learning in the flow," indicating that online learning in real interactive environments is crucial for achieving efficient reasoning [28][29]. - AgentFlow's architecture allows for rapid error correction and improved task planning through real-time training, enhancing overall system performance [30][29]. Group 4: Innovations and Findings - The system autonomously discovers new solution paths, such as combining different search tools to enhance information retrieval, showcasing its ability to adapt and innovate [33]. - AgentFlow maintains performance improvements without significantly increasing the average reasoning steps, indicating efficient handling of complex tasks [35]. Group 5: Future Implications - The article concludes that AgentFlow presents a novel approach to intelligent agent training, advocating for systems that adapt and learn continuously rather than relying on a single comprehensive model [37][38]. - Despite the distance from research to practical application, the potential for Agentic AI remains significant, suggesting a promising future for intelligent systems [39].