强化学习（RL） - filings, earnings calls, financial reports, news - Reportify

强化学习（RL）

Search documents

SFT 还是RL，VLA到底应该如何训练？

具身智能之心· 2025-10-28 00:02

Core Insights - The articles focus on advancements in Reinforcement Learning (RL) and its application to Visual-Language-Action (VLA) models, highlighting significant improvements in generalization capabilities and training efficiency. Group 1: Research Findings - The first study investigates how RL enhances the generalization ability of VLA models, addressing issues related to supervised fine-tuning (SFT) that lead to error accumulation and distribution shift. A new benchmark covering visual, semantic, and execution dimensions was established, showing that using Proximal Policy Optimization (PPO) for RL fine-tuning significantly improves semantic understanding and execution robustness while maintaining comparable visual generalization performance to SFT [2]. - The second study introduces RLinf-VLA, a framework designed for large-scale RL training of VLA models. It proposes a novel solution to the challenges of integrating RL and VLA training, achieving up to 2.27 times acceleration compared to baseline methods. The framework supports various VLA architectures and RL algorithms, achieving a 98.11% success rate across 130 LIBERO tasks [3]. Group 2: Practical Applications - RLinf-VLA summarizes best practices for applying RL in VLA training, providing a unified interface that facilitates the use of multiple VLA architectures and simulators, thus lowering the barrier for implementing RL in large-scale VLA applications [3]. - The research emphasizes the importance of RL in enhancing the performance of VLA models, suggesting a shift towards more efficient training methodologies that leverage RL's strengths [15].

强化学习（RL）

视觉-语言-动作（VLA）模型

监督微调（SFT）

强化学习（RL）

视觉-语言-动作（VLA）模型

监督微调（SFT）

让VLM学会「心中有世界」：VAGEN用多轮RL把视觉智能变成「世界模型」推理机器

机器之心· 2025-10-25 03:20

Core Insights - The article discusses the limitations of Visual-Language Models (VLMs) in complex visual tasks, highlighting their tendency to act impulsively rather than thoughtfully due to their perception of the world being limited and noisy [2][6]. - The VAGEN framework aims to enhance VLMs by teaching them to construct an internal world model before taking actions, thereby promoting a more structured thinking process [3][12]. Group 1: VAGEN Framework - VAGEN enforces a structured "thinking template" for VLMs, which includes two core steps: State Estimation (observing the current state) and Transition Modeling (predicting future outcomes) [7][11]. - The framework utilizes reinforcement learning (RL) to reward this structured thinking process, demonstrating that the "World Modeling" strategy significantly outperforms both "No Think" and "Free Think" approaches [12][32]. Group 2: Internal Monologue and Reward Mechanism - The research explores the best format for the internal monologue of the agent, finding that the optimal representation depends on the nature of the task [13][14]. - VAGEN introduces two key components in its reward mechanism: World Modeling Reward, which provides immediate feedback after each thought process, and Bi-Level GAE for efficient reward distribution [18][20]. Group 3: Performance Results - The VAGEN-Full model, based on a 3B VLM, achieved an impressive overall score of 0.82 across five diverse tasks, outperforming various other models including GPT-5 [27][30]. - The results indicate that VAGEN-Full not only surpasses untrained models but also exceeds the performance of several proprietary models, showcasing its effectiveness in enhancing VLM capabilities [30][32].

视觉 - 语言模型（VLM）

世界模型（World Model）

强化学习（RL）

Artificial Intelligence

VLM (视觉 - 语言模型)

视觉 - 语言模型（VLM）

世界模型（World Model）

强化学习（RL）

Artificial Intelligence

VLM (视觉 - 语言模型)

RL 是新的 Fine-Tuning

海外独角兽· 2025-10-24 12:06

Core Insights - The article discusses the resurgence of LoRA (Low-Rank Adaptation) as a model fine-tuning technique, demonstrating that it can achieve performance comparable to full parameter fine-tuning with fewer computational resources under specific conditions [2][6][10] - The shift from model fine-tuning to Reinforcement Learning (RL) is highlighted, with industry experts suggesting that integrating RL into the lifecycle of agents will become a mainstream approach [4][21] - OpenPipe, initially focused on LoRA, has transitioned to a comprehensive RL product line following its acquisition by CoreWeave, indicating a strategic pivot in response to market demands [2][8] Group 1: LoRA's Resurgence - LoRA is no longer viewed merely as a cost-effective alternative to full parameter fine-tuning but is recognized for its efficiency in model customization [10][11] - The ability to deploy multiple LoRA adapters on a single GPU allows for cost-effective token-based pricing rather than GPU usage time [3][10] - The initial decline in LoRA's popularity was due to a general disinterest in fine-tuning, but recent research has improved its reputation [11][14] Group 2: Transition to Reinforcement Learning - The transition to RL is driven by the need to transfer the capabilities of large models to smaller ones, particularly in scenarios requiring low latency [18][20] - Companies deploying agents will need to incorporate RL either before deployment or continuously afterward, making it a critical component of agent lifecycle management [21][22] - The primary challenge in implementing RL is the construction of training environments, which currently requires significant manual effort [4][23][48] Group 3: OpenPipe's Evolution - OpenPipe was founded to provide a standardized hosting service for model distillation, enabling companies to leverage GPT-4 capabilities at a lower cost [7][8] - The company experienced rapid growth, achieving an ARR of over $1 million within eight months, driven by market expansion and improved open-source model quality [8][10] - The acquisition by CoreWeave marks a significant milestone, allowing OpenPipe to enhance its RL offerings and address the evolving needs of the AI market [2][8] Group 4: Challenges in RL Implementation - Building robust and reusable training environments remains the biggest hurdle for RL deployment, with many companies struggling to create effective simulation environments [23][25][26] - The complexity of accurately replicating production environments poses significant challenges for training agents, particularly in dynamic and user-interactive scenarios [25][26] - The development of World Models is proposed as a potential solution to the environmental challenges faced in RL, enabling agents to simulate and understand external feedback [51][52]

强化学习（RL）

奖励函数蒸馏（RFD）

在线评估（online evaluation）

Artificial Intelligence

RL（Reinforcement Learning）

强化学习（RL）

奖励函数蒸馏（RFD）

在线评估（online evaluation）

Artificial Intelligence

RL（Reinforcement Learning）

港科大最新！超越人类示范：基于扩散的强化学习为VLA训练生成 “高质量、低方差“ 数据

具身智能之心· 2025-10-23 04:00

Core Insights - The article discusses the limitations of traditional human demonstration data in training Visual-Language-Action (VLA) models and introduces a novel diffusion-based reinforcement learning (RL) approach to generate high-quality training data [2][5]. Group 1: VLA Model and Data Generation - VLA models integrate visual, language, and action information, but their performance is often constrained by the quality and scale of manually collected data [5]. - The proposed diffusion RL algorithm offers a semi-automated method for high-quality data collection suitable for VLA training, enhancing model performance [5]. Group 2: Methodology and Results - The study presents an improved diffusion strategy optimization algorithm that generates high-quality, low-variance trajectories for VLA training [2]. - Evaluation on the LIBERO benchmark, which includes 130 long-horizon tasks, shows that the generated trajectories are smoother and more consistent than human demonstration data and outperform standard Gaussian RL-generated trajectories [2]. - Training VLA models solely on data generated by diffusion RL achieves an average success rate of 81.9%, which is a 5.3 percentage point improvement over human data and a 12.6 percentage point improvement over Gaussian RL data [2]. Group 3: Key Highlights - The article emphasizes the potential of RL-driven robot trajectory generation and the adaptability of the general RL framework to any VLA architecture [6]. - It highlights the performance breakthroughs that exceed human demonstrations, showcasing the effectiveness of the proposed approach [6].

视觉 - 语言 - 动作（VLA）模型

强化学习（RL）

扩散强化学习

视觉 - 语言 - 动作（VLA）模型

强化学习（RL）

扩散强化学习

RLINF-VLA：一种用于 VLA+RL 训练的统一高效框架

具身智能之心· 2025-10-22 06:02

Core Insights - The article presents the RLinf-VLA framework, a unified and efficient framework for training Visual-Language-Action (VLA) models using Reinforcement Learning (RL), addressing the limitations of existing models that rely on supervised fine-tuning [2][53] - The framework significantly enhances training efficiency and generalization capabilities, achieving high success rates in various simulation tasks and demonstrating superior performance in real-world applications compared to traditional supervised methods [5][53] Framework Design - The RLinf-VLA framework integrates multiple simulators, algorithms, and VLA architectures, optimizing resource allocation through flexible execution modes and system-level enhancements [4][53] - It supports three GPU allocation strategies: colocated, disaggregated, and hybrid, allowing users to easily switch modes via configuration files, thus reducing system customization costs [10][11] Model Compatibility - The framework supports LoRA for efficient parameter tuning, reducing memory consumption and accelerating training while maintaining performance [12] - It is compatible with OpenVLA and its extension OpenVLA-OFT, which have shown strong performance in various robotic operation benchmarks [12][22] Multi-Simulator Support - The framework emphasizes the importance of simulators in RL, utilizing ManiSkill and LIBERO as primary simulators to achieve diverse task capabilities [13] - It provides a unified interface for different simulators, facilitating the implementation of various tasks and supporting multiple RL algorithms, initially focusing on PPO and GRPO [13][14] Algorithm Design - The framework incorporates advanced techniques for advantage function and log-probability calculations, allowing for flexible integration of block-level and action-level definitions [14][15] - It supports various optimization strategies, including trajectory length normalization and effective action masking, to enhance training stability and performance [19][20] Experimental Results - The RLinf-VLA framework demonstrated significant performance improvements, with success rates increasing by 45% to 70% in various tasks compared to baseline models [22][24] - In LIBERO tasks, the framework achieved an average success rate of 98.11%, showcasing its capability for large-scale multi-task reinforcement learning [28] High Efficiency Performance - The framework's efficiency is evaluated based on throughput, achieving substantial improvements in training speed across different GPU configurations [30][35] - The hybrid allocation mode outperformed traditional methods, demonstrating the benefits of pipeline overlapping in resource utilization [35][37] Real-World Deployment - The RLinf-VLA framework was successfully deployed in real-world environments, showing superior zero-shot generalization capabilities compared to supervised fine-tuning strategies [51][53] - The experiments indicated that RL-trained models could adapt better to real-world tasks, achieving higher success rates in object manipulation tasks [51] Conclusion - The RLinf-VLA framework represents a significant advancement in the field of embodied intelligence, providing a robust foundation for future research and development in VLA training [53]

视觉 - 语言 - 动作（VLA）模型

强化学习（RL）

监督微调（SFT）

视觉 - 语言 - 动作（VLA）模型

强化学习（RL）

监督微调（SFT）

GPT-5≈o3.1！OpenAI首次详解思考机制：RL+预训练才是AGI正道

量子位· 2025-10-20 03:46

Core Insights - The article discusses the evolution of OpenAI's models, particularly focusing on GPT-5 as an iteration of the o3 model, suggesting that it represents a significant advancement in AI capabilities [1][4][23]. Model Evolution - Jerry Tworek, OpenAI's VP of Research, views GPT-5 as an iteration of o3, emphasizing the need for a model that can think longer and interact autonomously with multiple systems [4][23]. - The transition from o1 to o3 marked a structural change in AI development, with o3 being the first truly useful model capable of utilizing tools and contextual information effectively [19][20]. Reasoning Process - The reasoning process of models like GPT-5 is likened to human thought, involving calculations, information retrieval, and self-learning [11]. - The concept of "thinking chains" has become prominent since the release of the o1 model, allowing models to articulate their reasoning in human language [12]. - Longer reasoning times generally yield better results, but user feedback indicates a preference for quicker responses, leading OpenAI to offer models with varying reasoning times [13][14]. Internal Structure and Research - OpenAI's internal structure combines top-down and bottom-up approaches, focusing on a few core projects while allowing researchers freedom within those projects [31][33]. - The company has rapidly advanced from o1 to GPT-5 in just one year due to its efficient operational structure and talented workforce [33]. Reinforcement Learning (RL) - Reinforcement learning is crucial for OpenAI's models, combining pre-training with RL to create effective AI systems [36][57]. - Jerry explains RL as a method of training models through rewards and penalties, similar to training a dog [37][38]. - The introduction of Deep RL by DeepMind has significantly advanced the field, leading to the development of meaningful intelligent agents [39]. Future Directions - Jerry believes that the future of AI lies in developing agents capable of independent thought for complex tasks, with a focus on aligning model behavior with human values [53][54]. - The path to AGI (Artificial General Intelligence) will require both pre-training and RL, with the addition of new components over time [56][58].

强化学习（RL）

Artificial Intelligence

强化学习（RL）

Artificial Intelligence

过去一个月高强度RL的实践和思考 - 如何涨点？

自动驾驶之心· 2025-10-19 23:32

Core Insights - The article discusses the recent advancements and challenges in Reinforcement Learning (RL) for Visual Language Models (VLM), emphasizing the importance of foundational work and iterative improvements in achieving performance gains [2][4]. RL Goals - The primary objectives for RL in VLM include achieving a 1-2 point increase in overall performance on SFT model versions and exceeding 1-2 points in specific benchmarks such as mathematics and instruction adherence [5]. RL Overall Approach - The essence of RL is to enhance sampling efficiency rather than enabling the base model to learn new knowledge. It is noted that the base model can outperform RL models in terms of correct response probability when given unlimited attempts [7][8]. Challenges in VLM RL - Key challenges include the selection of efficient RL algorithms, the need for high infrastructure requirements, and the sensitivity of RL to data quality and organization [10][12]. Data Organization - Effective data organization is crucial, requiring a balanced mix of tasks and high-quality input data. The output length is also significantly related to the RL algorithm used, necessitating careful consideration of training data characteristics [13][14]. Key Findings and Conclusions - Short responses negatively impact training effectiveness, and it is essential to construct pairs of responses with clear distinctions between acceptable and rejectable outputs. The importance of meticulous data checking and the absence of a "silver bullet" solution are emphasized [19][24].

强化学习（RL）

视觉语言模型强化学习（VLM RL）

直接偏好优化（DPO）

TRL仓库原生的DPO Trainer

强化学习（RL）

视觉语言模型强化学习（VLM RL）

直接偏好优化（DPO）

TRL仓库原生的DPO Trainer

GPT-5 核心成员详解 RL：Pre-training 只有和 RL 结合才能走向 AGI

海外独角兽· 2025-10-18 12:03

Core Insights - The article discusses the limitations of current large language models (LLMs) and emphasizes the importance of reinforcement learning (RL) as a more viable path toward achieving artificial general intelligence (AGI) [2][3][50] - It highlights the interplay between pre-training and RL, suggesting that both are essential for the development of advanced AI systems [16][50] Group 1: Reinforcement Learning (RL) Insights - Richard Sutton argues that the current LLM approach, which primarily relies on imitation, has fundamental flaws and is a "dead end" for achieving AGI, while RL allows models to interact with their environment and learn from experience [2] - Andrej Karpathy points out that traditional RL is inefficient and that future intelligent systems will not rely solely on RL [2] - Jerry Tworek emphasizes that RL must be built on strong pre-training, and that the two processes are interdependent [3][16] Group 2: Reasoning and Thought Processes - The reasoning process in AI is likened to human thinking, where models must search for unknown answers rather than simply retrieving known ones [7][9] - The concept of "chain of thought" (CoT) is introduced, where language models express their reasoning steps in human language, enhancing their ability to solve complex problems [10][11] - The balance between output quality and response time is crucial, as longer reasoning times generally yield better results, but users prefer quicker responses [12][13] Group 3: Model Development and Iteration - The evolution of OpenAI's models is described as a series of scaling experiments aimed at improving reasoning capabilities, with each iteration building on the previous one [13][15] - The transition from the initial model (o1) to more advanced versions (o3 and GPT-5) reflects significant advancements in reasoning and tool usage [15][16] - The integration of RL with pre-training is seen as a necessary strategy for developing more capable AI systems [16][19] Group 4: Challenges and Future Directions - The complexity of RL is highlighted, with the need for careful management of rewards and penalties to train models effectively [20][33] - The potential for online RL, where models learn in real-time from user interactions, is discussed, though it poses risks that need to be managed [36][38] - The ongoing challenge of achieving alignment in AI, ensuring models understand right from wrong, is framed as a critical aspect of AI development [39][47]

强化学习（RL）

预训练（Pre-training）

人工通用智能（AGI）

思维链（Chain of Thought

基于人类反馈的强化学习（RLHF）

强化学习（RL）

预训练（Pre-training）

人工通用智能（AGI）

思维链（Chain of Thought

基于人类反馈的强化学习（RLHF）

聊聊 AI Agent 到底有多大创新？

自动驾驶之心· 2025-10-18 04:00

Core Insights - The article discusses the current limitations and challenges faced by AI agent technologies, particularly in comparison to traditional task bots, highlighting that the user experience has not significantly improved over the past decade [1][2]. Group 1: Planning Challenges - The planning phase is time-consuming, and as the number of tools increases, the accuracy of turbo models declines, necessitating the use of flagship models, which further increases latency [2][5]. - The quality of planning is insufficient; the workflows generated by models are less effective than those designed by humans, particularly in complex scenarios [2][8]. - The core issue with slow planning is the underestimation of the costs associated with tool discovery and parameter alignment, leading to a complex optimization problem when dynamically selecting tools [5][21]. Group 2: Reflection Issues - Reflection processes can lead to self-reinforcing cycles of inefficiency due to a lack of fine-grained computable signals and clear stopping conditions [3][15]. - Current models rely on weak feedback mechanisms, which can result in reinforcing incorrect assumptions rather than correcting errors [15][20]. - Proposed solutions include structured reflection processes that allow models to learn from mistakes and improve their performance through reinforcement learning [18][20]. Group 3: Engineering Solutions - Suggestions for improving planning quality include decomposing plans into milestones and local prompts, which can enhance stability and reusability [8][10]. - Implementing parallel execution of tasks can reduce overall processing time, with evidence showing a 20% reduction in time for non-dependent tool calls [6][21]. - The introduction of routing strategies can streamline task execution by directing simpler tasks to specialized executors, reserving complex planning for stronger reasoning models [6][21]. Group 4: Future Directions - The article emphasizes the importance of combining reinforcement learning with agent models to enhance their reasoning and execution capabilities, indicating a trend towards end-to-end learning approaches [20][21]. - The potential for AI agents to become valuable applications of large language models (LLMs) in real-world scenarios is highlighted, with ongoing improvements expected as models evolve [21].

强化学习（RL）

强化学习（RL）

NeurIPS 2025｜清华团队分析RL将如何提升VLA泛化性

具身智能之心· 2025-10-15 04:00

Core Insights - The article discusses the potential of Vision-Language-Action (VLA) models in embodied intelligence and highlights the limitations of current supervised fine-tuning (SFT) methods in achieving human-like generalization. It emphasizes the advantages of Reinforcement Learning (RL) in enhancing the generalization capabilities of VLA models [1][3]. Group 1: Research Findings - A new evaluation benchmark was created to address the limited generalization of VLA models, comparing the performance of RL and SFT in enhancing model robustness across various visual, semantic, and execution challenges [3][19]. - Experiments revealed that using RL algorithms like Proximal Policy Optimization (PPO) significantly improved the model's robustness in semantic understanding and task execution, while maintaining performance in visually varied scenarios comparable to SFT [3][12]. Group 2: Methodology - The research utilized the open-source OpenVLA model, which is fine-tuned from Llama2-7b, to conduct experiments involving RGB images and action tokens for robotic control [6]. - Three RL methods were tested: PPO, Direct Preference Optimization (DPO), and Group Relative Policy Optimization (GRPO), with PPO showing notable advantages in multi-step decision tasks [8][15]. Group 3: PPO Training Innovations - The research team proposed three key innovations for efficient PPO training: 1. A shared Actor-Critic architecture that reduced memory usage by 45% and improved training speed by 35% [12][14]. 2. A preheating strategy using 140 high-quality trajectories that enhanced convergence speed by 50% [14]. 3. Minimizing PPO training epochs to just one, which was sufficient for performance without increasing training time [14]. Group 4: Comparison of SFT and RL - The study found that while SFT performance plateaued with 16,000 demonstration trajectories, RL achieved a 42.6% performance improvement in out-of-distribution tasks, indicating superior generalization capabilities [17][18]. - A comprehensive evaluation benchmark was developed to dissect the differences in generalization capabilities between SFT and RL across visual, semantic, and execution dimensions [19][21]. Group 5: Practical Implications - The research underscores the core value of RL in building truly generalizable embodied agents, which is increasingly important as robotic applications become more complex and varied [25].

强化学习（RL）

视觉 - 语言 - 动作（VLA）大模型

有监督微调（SFT）

OpenVLA大模型

强化学习（RL）

视觉 - 语言 - 动作（VLA）大模型

有监督微调（SFT）

OpenVLA大模型