Workflow
在线强化学习
icon
Search documents
聊聊在线强化学习是怎么微调π0和π0.5的?为什么性能最高能提升50%以上?
具身智能之心· 2025-11-10 03:30
Core Viewpoint - The article discusses the introduction of the πRL framework, which enhances flow-based vision-language-action (VLA) models through online reinforcement learning (RL) fine-tuning, significantly improving their performance and generalization capabilities [5][7]. Group 1: Introduction to VLA Models - VLA models enable robots to understand and execute complex tasks through multimodal inputs, but large-scale RL applications face challenges due to the difficulty in handling action log-likelihood during the iterative denoising process [5]. Group 2: πRL Framework - The πRL framework, developed by teams from Tsinghua University and Peking University, addresses the challenges of applying large-scale RL to flow-based VLA models by training them in parallel simulations [6]. Group 3: RL Algorithms in πRL - πRL implements two RL algorithms: 1. FlowNoise models the denoising process as a discrete-time Markov Decision Process (MDP) using a learnable noise network for precise log-likelihood calculations [7]. 2. Flow-SDE combines the denoising process with agent-environment interaction, constructing a dual-layer MDP that transitions from ODE to SDE for efficient RL exploration [7]. Group 4: Performance Evaluation - In benchmark tests, πRL significantly improved the performance of few-shot SFT models π0 and π0.5 from 57.6% to 97.6% and from 77.1% to 98.3% on the LIBERO dataset, respectively [7]. - In the ManiSkill benchmark, πRL demonstrated scalable multi-task RL capabilities across 4,352 grasping and placing tasks using 320 parallel environments [7]. Group 5: Conclusion - Overall, πRL shows substantial performance enhancements and stronger generalization compared to SFT models, validating the effectiveness of online RL in flow-based VLA models [7].
Figma 如何战胜 Adobe 等六篇 | 42章经 AI Newsletter
42章经· 2025-10-26 13:42
Group 1: Figma vs Adobe - Figma's success is attributed to its focus on "collaboration" as a core feature, contrasting with Adobe's file-centric approach [2][3] - Adobe's collaboration is based on file transfer, while Figma allows real-time editing on a shared canvas, enabling true synchronous collaboration [3] - Existing giants like Adobe struggle to adapt due to their historical success paths and internal resistance to change [3] Group 2: Online Reinforcement Learning - Cursor's use of online reinforcement learning (RL) optimizes its code completion feature, Tab, by treating user interactions as feedback signals for real-time training [6][10] - The model's suggestion volume has decreased by 21%, while the acceptance rate has increased by 28%, indicating improved performance [6] Group 3: Plaud's Success - Plaud's success is rooted in recognizing the value of context, viewing conversations as a form of intelligence and a significant data source [12][14] - The company designs its hardware and software to effectively capture and analyze user context, positioning itself as a context collector rather than just a recording device [15] - Plaud's approach emphasizes a "reverse thinking" strategy, focusing on how AI can serve users by prompting them for context rather than the other way around [16][18] Group 4: Creating Delight in Products - Delight in products is defined as a combination of joy and surprise, with three main strategies: exceeding expectations, anticipating needs, and removing friction [25][27] - A systematic approach to creating delight involves redefining user categories based on motivations, transforming those motivations into opportunities, and ensuring that delight becomes an organizational capability [28][30] Group 5: Evaluating AI Product Retention - A16Z suggests that AI companies should measure retention starting from the third month (M3) to better understand their true user base, as early data may include many transient users [34][35] - The new metric M12/M3 is proposed to assess long-term retention quality, indicating how many users remain after a year compared to the third month [36][39] Group 6: Palantir's FDE Model - The Forward Deployed Engineer (FDE) model involves engineers embedded at client sites to bridge the gap between product capabilities and client needs, focusing on product exploration [42][46] - FDE teams consist of Echo (consulting analysts) and Delta (deployment engineers), each with distinct roles to ensure effective client engagement and product development [49][50] - The FDE model is particularly relevant in the AI era, where high-value contracts justify deep client integration and where product-market fit is often unclear [53][54]
AI在线强化学习“边做边学”,斯坦福团队让7B小模型性能飙升,甚至超越GPT-4o
3 6 Ke· 2025-10-24 12:45
Core Insights - AgentFlow introduces a new paradigm for online reinforcement learning, enhancing the reasoning capabilities of agent systems through real-time optimization and collaboration among specialized agents [1][11][14]. Performance Metrics - AgentFlow, based on the Qwen-2.5-7B-Instruct model, shows significant improvements across various benchmark tests: 14.9% in search tasks, 14.0% in agentic reasoning tasks, 14.5% in mathematical reasoning, and 4.1% in scientific reasoning [4][19][21]. - The performance of AgentFlow surpasses that of larger models, including GPT-4o and Llama3.1-405B, demonstrating that effective system design can outperform sheer model size [21][25]. System Architecture - The architecture of AgentFlow consists of four specialized agents: a planner for task analysis and tool selection, an executor for tool invocation, a verifier for evaluating intermediate results, and a generator for synthesizing final outputs [11][13][14]. - The system employs a shared memory design that facilitates collaboration and reduces error propagation in multi-step reasoning processes [7][14]. Learning Mechanism - The on-policy optimization of the planner within the agent interaction flow is crucial for adapting to environmental changes and feedback, leading to a robust and self-evolving reasoning process [13][14][22]. - The Flow-GRPO algorithm addresses the challenges of multi-turn credit assignment in reinforcement learning, enhancing training efficiency and stability in complex reasoning tasks [15][19]. Research Findings - The study reveals that online learning in real interaction environments is essential for achieving efficient reasoning, as opposed to offline supervised learning, which can lead to performance declines [22][25]. - AgentFlow's training allows the system to autonomously discover new tool combinations and usage patterns, enhancing its problem-solving capabilities [25][29]. Future Implications - AgentFlow represents a shift from seeking a single comprehensive model to enabling agents to adapt and learn continuously within a system, highlighting the potential of collaborative intelligence in addressing complex tasks [29].
AI在线强化学习“边做边学”,斯坦福团队让7B小模型性能飙升,甚至超越GPT-4o
量子位· 2025-10-24 03:53
Core Insights - The article discusses the introduction of AgentFlow, a new paradigm in online reinforcement learning that enhances the reasoning capabilities of intelligent systems, outperforming models like GPT-4o and Llama3.1-405B [1][4][23]. Group 1: AgentFlow Overview - AgentFlow consists of a team of specialized agents including a planner, executor, verifier, and generator, which collaborate through shared memory to optimize decision-making in real-time [1][14][18]. - The Flow-GRPO method allows for on-policy optimization of the planner agent, enabling adaptive decision-making based on environmental changes and feedback from other agents [19][16]. Group 2: Performance Metrics - AgentFlow, based on the Qwen-2.5-7B-Instruct model, shows significant improvements across various benchmark tests: 14.9% in search tasks, 14.0% in agentic reasoning, 14.5% in math reasoning, and 4.1% in scientific reasoning [3][25][27]. - The model's performance surpasses that of larger models, demonstrating that effective system design and training methods can be more impactful than simply increasing model size [27]. Group 3: Learning Mechanisms - The article emphasizes the importance of "learning in the flow," indicating that online learning in real interactive environments is crucial for achieving efficient reasoning [28][29]. - AgentFlow's architecture allows for rapid error correction and improved task planning through real-time training, enhancing overall system performance [30][29]. Group 4: Innovations and Findings - The system autonomously discovers new solution paths, such as combining different search tools to enhance information retrieval, showcasing its ability to adapt and innovate [33]. - AgentFlow maintains performance improvements without significantly increasing the average reasoning steps, indicating efficient handling of complex tasks [35]. Group 5: Future Implications - The article concludes that AgentFlow presents a novel approach to intelligent agent training, advocating for systems that adapt and learn continuously rather than relying on a single comprehensive model [37][38]. - Despite the distance from research to practical application, the potential for Agentic AI remains significant, suggesting a promising future for intelligent systems [39].
GUI智能体训练迎来新范式!半在线强化学习让7B模型媲美GPT-4o
量子位· 2025-09-23 11:01
Core Viewpoint - The article discusses the introduction of a new training paradigm called Semi-online Reinforcement Learning (Semi-online RL) by Zhejiang University and Tongyi Laboratory's Mobile-Agent team, which enhances the performance of models in dynamic multi-turn tasks without relying on real environment interactions [1][2][4]. Group 1: Methodology - The Semi-online RL framework combines the stability of offline training with the long-term optimization capabilities of online learning, significantly improving model performance in dynamic tasks [2][10]. - The framework utilizes offline data to simulate online interactions, allowing the model to experience contextual changes from its own actions during training [12][15]. - A patching mechanism is introduced to adaptively correct sampling biases when the model deviates from expert trajectories, enhancing the learning process [17][19]. Group 2: Key Technologies - The Semi-online RL framework consists of three core technologies: 1. Semi-online mechanism that simulates online interactions using offline data [12]. 2. Patching Module that self-adaptively repairs sampling biases [17]. 3. Long-term reward modeling that estimates advantages from step-level to trajectory-level [20]. Group 3: Evaluation and Results - The new evaluation metric SOP (Semi-online Performance) is proposed to better reflect the model's performance in multi-turn tasks, aligning closely with real online performance [22][23]. - Experimental results show that the UI-S1-7B model outperforms baseline models, achieving a task success rate of 34.0% in the AndroidWorld task, closely approaching the performance of top proprietary models [25][26]. - The model maintains a +7.1% gain in single-turn tasks, indicating that the semi-online training does not sacrifice local accuracy while optimizing for long-term performance [28]. Group 4: Component Analysis - The patching mechanism significantly enhances data utilization and maintains training stability, allowing for effective error correction and promoting policy diversity [30][37]. - Ablation studies confirm that the combination of trajectory-level and step-level advantage functions, along with multi-frame historical observations, positively impacts the model's decision-making capabilities in complex GUI interactions [44].
全球双榜SOTA!明略科技专有大模型 Mano开启GUI智能操作新时代
机器之心· 2025-09-21 05:26
Core Viewpoint - Minglue Technology's proprietary GUI model, Mano, has achieved record-breaking SOTA results in the recognized benchmarks Mind2Web and OSWorld, establishing a new paradigm for GUI intelligent agents through innovations in online reinforcement learning and automatic data collection [1][14][23]. Group 1: Performance Achievements - Mano achieved a success rate of 40.1% in the OSWorld-Verified benchmark, surpassing other models such as qwen and GUI-Owl [10][19]. - In the Mind2Web benchmark, Mano demonstrated superior performance across various metrics, including element accuracy and step success rate, significantly outperforming all other SOTA methods [18][15]. - The model's success rate in OSWorld-Verified reached 41.6±0.7%, marking an approximate 7 percentage point improvement over competitors [21][19]. Group 2: Innovations and Methodology - Mano introduces online reinforcement learning as a novel training paradigm in the GUI interaction field, enhancing its performance in dynamic environments [22][23]. - The model's architecture consists of three main components: exploration module, processing flow, and optimization process, which collectively improve its reasoning and adaptability [25][26]. - The automatic data collection method developed by the technical team significantly enhances the efficiency and accuracy of data acquisition, allowing for the generation of high-quality interaction trajectory data [48][49]. Group 3: Market Context and Future Directions - The demand for AI agents is expected to surge by 2025, positioning Mano as a key player in differentiated competition by accessing data sources that other agents cannot reach [59][63]. - Minglue Technology plans to continue exploring areas such as data collection, training integration, and CAPTCHA handling to further optimize Mano for real-world applications [66].
首次!流匹配模型引入GRPO,GenEval几近满分,组合生图能力远超GPT-4o
机器之心· 2025-05-13 07:08
Core Viewpoint - The article discusses the introduction of Flow-GRPO, the first algorithm to integrate online reinforcement learning into flow matching models, significantly enhancing their performance in image and video generation tasks [2][22]. Group 1: Introduction and Background - Flow matching models have a solid theoretical foundation and excel in generating high-quality images and videos, but they struggle with complex scenes involving multiple objects and relationships [1]. - Online reinforcement learning has made significant strides in language models but remains in its early stages in image generation applications [1]. Group 2: Flow-GRPO Overview - Flow-GRPO combines online reinforcement learning with flow matching models, achieving a remarkable accuracy increase from 63% to 95% in the GenEval benchmark for SD3.5 Medium [2][14]. - The successful implementation of Flow-GRPO opens new avenues for enhancing various flow matching generation models in terms of controllability, composability, and reasoning capabilities [2][22]. Group 3: Key Strategies of Flow-GRPO - The core of Flow-GRPO lies in two key strategies: 1. ODE-SDE equivalence transformation, which allows for effective exploration in reinforcement learning without altering the fundamental characteristics of the model [6][8]. 2. Denoising reduction, which accelerates data collection by reducing the number of denoising steps during training while maintaining high-quality outputs during inference [12][22]. Group 4: Experimental Results - Flow-GRPO demonstrates exceptional performance in various text-to-image generation tasks, significantly improving complex combination generation capabilities and achieving near-perfect results in object counting, spatial relationship understanding, and attribute binding [14][19]. - The accuracy of visual text rendering improved from 59% to 92%, showcasing the model's ability to accurately render text within images [19][21]. - Flow-GRPO also shows significant progress in human preference alignment tasks, effectively reducing reward hacking issues while maintaining image quality and diversity [21][22]. Group 5: Conclusion and Future Outlook - Flow-GRPO reveals a viable path for continuously enhancing flow matching generation model performance through online reinforcement learning [22]. - The successful application of Flow-GRPO suggests promising potential for future advancements in controllability, composability, and reasoning capabilities across multi-modal generation tasks, including images, videos, and 3D content [22].