Workflow
强化学习(RL)
icon
Search documents
RL 将如何提高具身大模型 VLA 泛化性?清华大学团队NeurIPS 2025文章分析 RL 与 SFT 泛化性差异
机器之心· 2025-10-12 02:41
Core Insights - The article discusses the potential of Vision-Language-Action (VLA) large models in embodied intelligence, highlighting the limitations of current supervised fine-tuning (SFT) methods in generalization to new environments and tasks. It emphasizes the advantages of Reinforcement Learning (RL) in enhancing the generalization capabilities of VLA models [2][4]. Group 1: Research Findings - A new evaluation benchmark was created to address the limited generalization of VLA models, comparing the performance of RL and SFT in enhancing model robustness across various visual, semantic, and execution challenges [4]. - Experiments revealed that using RL algorithms like Proximal Policy Optimization (PPO) significantly improved the model's robustness in semantic understanding and task execution, maintaining performance comparable to SFT in visually varied scenarios [4][11]. Group 2: RL Methodology - The research team tested three RL algorithms: PPO, Direct Preference Optimization (DPO), and Group Relative Policy Optimization (GRPO). The results showed that PPO outperformed DPO and GRPO in multi-step decision tasks due to the partially observable Markov decision process (POMDP) characteristics of robotic tasks [9][11]. - To enhance the efficiency of PPO training on VLA models, three key innovations were introduced: a shared Actor-Critic architecture reducing memory usage by 45% and increasing training speed by 35%, a preheating strategy using 140 high-quality trajectories to improve convergence speed by 50%, and minimizing PPO training epochs to just one, which reduced training time significantly [13][15]. Group 3: Comparison of SFT and RL - The research explored the data scale limits of SFT, finding that performance saturation occurred at around 16,000 demonstration trajectories. In contrast, RL achieved a 42.6% performance improvement on out-of-distribution tasks, indicating superior generalization capabilities [18][19]. - A comprehensive evaluation benchmark was constructed to dissect the generalization differences between SFT and RL across visual, semantic, and execution dimensions, with RL showing clear advantages in semantic understanding and execution robustness [21][23]. Group 4: Practical Implications - The study underscores the core value of RL in developing truly generalizable embodied agents, which is increasingly important as robotic applications become more complex and variable. The team has open-sourced a large-scale RL framework for embodied intelligence, RLinf, to facilitate further research [25]. - Visual analysis of specific cases revealed deeper differences, such as RL's ability to maintain task stability under noise and effectively handle unseen objects, contrasting with SFT's tendency to get stuck in repetitive actions [26].
听说,大家都在梭后训练?最佳指南来了
机器之心· 2025-10-09 02:24
Core Insights - The article emphasizes the shift in focus from pre-training to post-training in large language models (LLMs), highlighting the diminishing returns of scaling laws as model sizes reach hundreds of billions of parameters [2][3][11]. Group 1: Importance of Post-Training - Post-training is recognized as a crucial phase for enhancing the reasoning capabilities of models like OpenAI's series, DeepSeek R1, and Google Gemini, marking it as a necessary step towards advanced intelligence [3][11]. - The article introduces various innovative post-training methods such as Reinforcement Learning from Human Feedback (RLHF), Reinforcement Learning from AI Feedback (RLAIF), and Reinforcement Learning with Verifiable Rewards (RLVR) [2][3][12]. Group 2: Transition from Pre-Training to Post-Training - The evolution from pre-training to instruction fine-tuning is discussed, where foundational models are trained on large datasets to predict the next token, but often lack practical utility in real-world applications [7][8]. - Post-training aims to align model behavior with user expectations, focusing on quality over quantity in the datasets used, which are typically smaller but more refined compared to pre-training datasets [11][24]. Group 3: Supervised Fine-Tuning (SFT) - Supervised Fine-Tuning (SFT) is described as a process that transforms a pre-trained model into one that can follow user instructions effectively, relying on high-quality instruction-answer pairs [21][24]. - The quality of the SFT dataset is critical, as even a small number of low-quality samples can negatively impact the model's performance [25][26]. Group 4: Reinforcement Learning Techniques - Reinforcement Learning (RL) is highlighted as a complex yet effective method for model fine-tuning, with various reward mechanisms such as RLHF, RLAIF, and RLVR being employed to enhance model performance [39][41]. - The article outlines the importance of reward models in RLHF, which are trained using human preference data to guide model outputs [44][46]. Group 5: Evaluation of Post-Training Models - The evaluation of post-training models is multifaceted, requiring a combination of automated and human assessments to capture various quality aspects [57][58]. - Automated evaluations are cost-effective and quick, while human evaluations provide a more subjective quality measure, especially for nuanced tasks [59][60].
梁文锋执笔的R1论文登上Nature封面!首次回应外界三大质疑
AI前线· 2025-09-18 02:28
Core Viewpoint - The article highlights the significant breakthrough of DeepSeek's AI model, DeepSeek-R1, which has successfully passed peer review and is recognized as the first large language model to achieve this milestone, marking a notable advancement for domestic AI research on the global stage [3][8]. Summary by Sections Model Development and Features - DeepSeek-R1 utilizes reinforcement learning (RL) to develop reasoning capabilities without relying on extensive human-annotated data, showcasing a novel approach in AI model training [3][12]. - The model was built on DeepSeek-V3 Base, with a focus on rewarding correct predictions to enhance the generation of longer and more logical responses [3][12]. - The training cost for DeepSeek-R1 was approximately $294,000, significantly lower than competitors that often spend tens of millions [6][12]. Peer Review Process - The peer review process for DeepSeek-R1 involved eight external experts over five months, resulting in a comprehensive review document that was three times the length of the original paper [9][12]. - The review addressed various aspects, including originality, methodology, and robustness, leading to improvements in the final published version [9][12]. Data and Safety Measures - The pre-training data for DeepSeek-V3 Base was sourced entirely from the internet, with a significant effort made to clean the data to avoid contamination, removing around 6 million potentially polluted samples [6][12]. - DeepSeek-R1 has implemented external risk control mechanisms and real-time audits, demonstrating superior safety performance compared to other mainstream models like Claude-3.7-Sonnet and GPT-4o [6][12]. Impact and Future Directions - The innovative use of pure reinforcement learning in DeepSeek-R1 is expected to influence future research in large language models, with many researchers looking to apply similar methods to enhance reasoning capabilities across various domains [12][14]. - Despite some concerns regarding the transparency of training data composition, the model has shown competitive performance in balancing accuracy and cost in scientific task challenges [14][12].
华人 AI 招聘 2 年 ARR 超 1000 万美金,Mercor 年化收入已 5 亿美金
投资实习所· 2025-09-16 05:38
Core Insights - The article discusses the shift in demand from Generalist AI Tutors to Specialist AI Tutors across various fields such as STEM, finance, medicine, and security, as evidenced by the rapid growth of Mercor [2][6] - Mercor's annual revenue run rate has surged from 1 million to 500 million USD in just 17 months, indicating a significant acceleration in growth [2][3] - The emergence of new knowledge-based jobs focused on training AI agents is highlighted, suggesting a transformation in the job market [3][12] Group 1: Industry Trends - The overall demand in the AI industry is evolving towards specialized roles, reflecting a broader trend in the economy becoming a reinforcement learning environment [2][6] - The fear of job loss due to technological advancements is contrasted with the creation of new job categories, particularly in training AI agents [6][12] - Historical context is provided, noting that previous technological revolutions have also led to new job categories despite initial fears of unemployment [6][12] Group 2: Company Performance - Mercor's revenue growth is notable, with a reported annual revenue run rate of over 500 million USD, showcasing a rapid increase in demand for AI recruitment services [2][3] - The company is currently paying over 1 million USD daily to experts across various fields, indicating a robust recruitment model [3][14] - Mercor's positioning as an AI recruitment platform is emphasized, with a focus on providing talent for AI companies, particularly in reinforcement learning [14][15] Group 3: Future of Work - The future of work is expected to center around training AI agents, with a significant market for human labor in creating and validating training environments for AI [11][12] - The article posits that as long as there are tasks that humans can perform but AI cannot, there will be a need for human involvement in training and evaluation [11][12] - The concept of an "experience era" is introduced, where models learn to optimize rewards in real-world scenarios, necessitating human feedback and guidance [13]
SimpleVLA-RL:突破 VLA 模型训练瓶颈,RL实现端到端在线训练
自动驾驶之心· 2025-09-15 03:56
Core Insights - The article discusses the development of the SimpleVLA-RL framework, which enhances the training of Visual-Language-Action (VLA) models in robotics through reinforcement learning (RL) techniques, addressing key challenges in data scarcity and generalization capabilities [3][4][6]. Group 1: Research Background and Core Issues - VLA models are crucial for robotic manipulation, integrating visual perception, language understanding, and action generation, but current training methods face two main bottlenecks: data scarcity and weak generalization [4][6]. - The traditional training process relies heavily on large-scale human operation data, which is costly and difficult to scale, limiting model scalability [4][6]. - The article raises the question of whether RL can enhance the long-term action planning capabilities of VLA models, despite the unique challenges posed by VLA applications [4][6]. Group 2: SimpleVLA-RL Framework Contributions - SimpleVLA-RL is designed to improve VLA training efficiency, particularly in data-scarce environments, and has achieved state-of-the-art (SOTA) performance in benchmark tests like LIBERO and RoboTwin [7][8]. - The framework incorporates interactive trajectory sampling, parallel training across multiple environments, and a unified design for training, inference, and rendering, addressing the slow interaction and high cost issues of VLA models [7][8]. - It has demonstrated significant improvements in success rates across various tasks, such as increasing LIBERO's average success rate from 91.0% to 99.1% and RoboTwin 2.0 from 38.3% to 68.8% [7][8][14]. Group 3: Data Efficiency and Generalization - SimpleVLA-RL significantly reduces the dependency on large-scale demonstration data, achieving an average success rate of 96.9% with only one trajectory of demonstration data, surpassing the performance of full-trajectory supervised fine-tuning [19][20]. - The framework enhances the model's robustness across different scenes, objects, and tasks, demonstrating improved performance in unseen tasks compared to traditional methods [21][24]. Group 4: Real-World Deployment and Innovations - The framework has shown effective Sim-to-Real transfer, with real-world task success rates improving from 17.5% to 38.5% using only simulated data for training [24][27]. - A notable discovery is the "Pushcut" phenomenon, where the RL-trained model autonomously discovers more efficient strategies beyond human demonstrations, indicating a potential for innovative behavior in VLA models [25][30]. Group 5: Summary and Conclusions - SimpleVLA-RL addresses three core issues in VLA model training: reducing reliance on large-scale demonstration data, enhancing generalization capabilities, and achieving efficient Sim-to-Real transfer [31][32]. - The findings suggest that RL can enable VLA models to explore superior strategies, paving the way for future developments in autonomous and adaptive robotic systems [31][32].
正式开课!具身大脑和小脑算法与实战教程来啦
具身智能之心· 2025-09-15 00:04
Core Insights - The exploration towards Artificial General Intelligence (AGI) highlights embodied intelligence as a key direction, focusing on the interaction and adaptation of intelligent agents within physical environments [1][3] - The development of embodied intelligence technology has evolved through various stages, from low-level perception to high-level task understanding and generalization [6][14] Industry Analysis - In the past two years, numerous star teams in the field of embodied intelligence have emerged, establishing valuable companies such as Xinghaitu, Galaxy General, and Zhujidongli, transitioning from laboratories to commercial and industrial applications [3] - Major domestic companies like Huawei, JD, Tencent, Ant Group, and Xiaomi are actively investing and collaborating to build a comprehensive ecosystem for embodied intelligence, while international players like Tesla and investment firms support advancements in autonomous driving and warehouse robotics [5] Technological Evolution - The evolution of embodied intelligence technology has progressed through several phases: - The first phase focused on grasp pose detection, which lacked the ability to model task context and action sequences [6] - The second phase introduced behavior cloning, allowing robots to learn from expert demonstrations but revealing weaknesses in generalization and performance in multi-target scenarios [6] - The third phase, emerging in 2023, utilized Diffusion Policy methods to enhance stability and generalization by modeling action trajectories [6][7] - The fourth phase, starting in 2025, explores the integration of VLA models with reinforcement learning and tactile sensing to overcome current limitations [9][11][12] Educational Initiatives - The demand for engineering and system capabilities in embodied intelligence is increasing as the industry shifts from research to deployment, necessitating higher engineering skills [17] - A comprehensive curriculum has been developed to cover various aspects of embodied intelligence, including practical applications and advanced topics, aimed at both beginners and advanced learners [14][20]
SimpleVLA-RL:突破 VLA 模型训练瓶颈,RL实现端到端在线训练
具身智能之心· 2025-09-15 00:04
Core Insights - The article discusses the development of the SimpleVLA-RL framework, which enhances the training of Visual-Language-Action (VLA) models in robotics through reinforcement learning (RL) techniques, addressing the limitations of traditional supervised fine-tuning (SFT) methods [2][4][30] Group 1: Research Background and Challenges - VLA models are crucial for integrating visual perception, language understanding, and action generation in robotic control, but current training methods face significant challenges, including data scarcity and weak generalization capabilities [2][5] - The breakthrough in large reasoning models suggests that RL can improve the sequential action planning capabilities of VLA models, but traditional RL methods are limited by manual reward design and the high cost of environmental interactions [2][5] Group 2: Contributions of SimpleVLA-RL - SimpleVLA-RL is designed specifically for VLA, incorporating interactive trajectory sampling and multi-environment parallel rendering, which significantly reduces training costs and improves scalability [6][9] - The framework has achieved state-of-the-art (SOTA) performance across multiple benchmarks, with notable improvements in success rates, such as LIBERO's average success rate increasing from 91.0% to 99.1% [6][12] - SimpleVLA-RL demonstrates enhanced data efficiency, achieving a LIBERO average success rate of 96.9% with only one demonstration trajectory, surpassing traditional methods [16][17] Group 3: Generalization and Real-World Application - The framework shows robust generalization capabilities across unseen tasks, with significant performance improvements in various scenarios, indicating its ability to learn universal skills rather than overfitting to specific data [22][30] - SimpleVLA-RL has proven effective in sim-to-real transfer, with real-world task success rates improving from 17.5% to 38.5%, validating its deployment capabilities [7][21] Group 4: Key Discoveries - The framework has led to the discovery of the "Pushcut" phenomenon, where the RL-trained model autonomously develops more efficient strategies beyond human demonstrations, showcasing the potential for innovative robotic behaviors [24][30] - The effectiveness of SimpleVLA-RL is contingent on the initial model capabilities, with significant performance enhancements observed when starting from a higher baseline success rate [28][29]
清华、上海AI Lab等顶级团队发布推理模型RL超全综述
具身智能之心· 2025-09-15 00:04
Core Viewpoint - The article discusses the significant advancements in Reinforcement Learning (RL) for Large Reasoning Models (LRM), emphasizing its potential to enhance reasoning and logical thinking capabilities in AI systems through verifiable reward mechanisms and advanced optimization algorithms [4][8][19]. Group 1: Introduction to RL and LRM - Reinforcement Learning (RL) has been a crucial method in AI development since its introduction by Sutton in 1998, enabling agents to learn in complex environments through clear reward signals [4]. - The emergence of large models has provided a new platform for RL, initially used to align models with human preferences, and now evolving towards enhancing reasoning capabilities [5][6]. Group 2: Recent Trends and Challenges - A new trend is emerging where researchers aim to use RL not just for compliance but to genuinely enhance reasoning abilities in models, leading to the development of LRM systems [5][6]. - Significant challenges remain for the large-scale application of RL in LRM, including reward design, algorithm efficiency, and the need for substantial data and computational resources [6][8]. Group 3: Key Developments and Milestones - The article highlights key milestones in RL applications for LRM, such as OpenAI's o1 and DeepSeek-R1, which demonstrate the effectiveness of RL in achieving long-chain reasoning capabilities through verifiable rewards [13][15]. - The performance of models like o1 improves with additional RL training and increased computational resources during reasoning, indicating a new path for expansion beyond pre-training [13][15]. Group 4: Foundational Components and Problems - The foundational components of RL for LRM include reward design, policy optimization, and sampling strategies, which are essential for enhancing model capabilities [16]. - The article discusses foundational and controversial issues in RL for LRM, such as the role of RL, the comparison between RL and supervised fine-tuning (SFT), and the types of rewards used [16]. Group 5: Training Resources and Applications - Training resources for RL include static corpora, dynamic environments, and infrastructure, which need further standardization and development for effective use [16]. - The applications of RL span various tasks, including coding, agentic tasks, multimodal tasks, and robotics, showcasing its versatility [16][18]. Group 6: Future Directions - Future research directions for RL in LLMs include continual RL, memory-based RL, and model-based RL, aiming to enhance reasoning efficiency and capabilities [18]. - The exploration of new algorithms and mechanisms is crucial for advancing RL's role in achieving Artificial Superintelligence (ASI) [15][19].
清华、上海AI Lab等顶级团队发布推理模型RL超全综述,探索通往超级智能之路
机器之心· 2025-09-13 08:54
Core Insights - The article emphasizes the significant role of Reinforcement Learning (RL) in enhancing the reasoning capabilities of large language models (LLMs), marking a pivotal shift in artificial intelligence development [2][5][16] - It highlights the emergence of Large Reasoning Models (LRMs) that utilize RL to improve reasoning through verifiable rewards, showcasing advancements in complex tasks such as mathematics and programming [3][5][10] Summary by Sections Introduction - The introduction outlines the historical context of RL since its inception in 1998 and its evolution into a crucial method for training intelligent agents to surpass human performance in complex environments [2] Recent Trends - A new trend is emerging where researchers aim to enhance models' reasoning abilities through RL, moving beyond mere compliance to actual reasoning skills [3][5] Overview of RL in LRM - The article reviews recent advancements in RL applied to LLMs, noting significant achievements in complex logical tasks, and identifies RL as a core method for evolving LLMs into LRMs [5][12] Foundational Components - The foundational components of RL for LRMs include reward design, policy optimization, and sampling strategies, which are essential for effective model training [13][14] Foundational Problems - Key challenges in RL for LRMs include the design of appropriate reward signals, efficient scaling under computational and data constraints, and ensuring reliability in practical applications [12][16] Training Resources - The article discusses the necessary training resources, including static corpora, dynamic environments, and RL infrastructure, emphasizing the need for standardization and development [13][15] Applications - RL has been applied across various tasks, including coding, agentic tasks, multimodal tasks, and robotics, showcasing its versatility and potential for broader applications [13][15] Future Directions - Future research directions for RL in LLMs include the development of new algorithms, mechanisms, and functionalities to further enhance reasoning capabilities and address existing challenges [15][16]
万字长文!首篇智能体自进化综述:迈向超级人工智能之路
自动驾驶之心· 2025-09-11 23:33
Core Insights - The article discusses the transition from static large language models (LLMs) to self-evolving agents capable of continuous learning and adaptation in dynamic environments, paving the way towards artificial superintelligence (ASI) [3][4][46] - It emphasizes the need for a structured framework to understand and design self-evolving agents, focusing on three fundamental questions: what to evolve, when to evolve, and how to evolve [6][46] Group 1: What to Evolve - Self-evolving agents can improve various components such as models, memory, tools, and architecture over time to enhance performance and adaptability [19][20] - The evolution of these components is crucial for the agent's ability to handle complex tasks and environments effectively [19][20] Group 2: When to Evolve - The article categorizes self-evolution into two time modes: intra-test-time self-evolution, which occurs during task execution, and inter-test-time self-evolution, which happens between tasks [22][23] - Intra-test-time self-evolution allows agents to adapt in real-time to specific challenges, while inter-test-time self-evolution leverages accumulated experiences for future performance improvements [22][23] Group 3: How to Evolve - Self-evolution emphasizes a continuous learning process where agents learn from real-world interactions, seek feedback, and adjust strategies dynamically [26][27] - Various methodologies for self-evolution include reward-based evolution, imitation learning, and population-based approaches, each with distinct feedback types and data sources [29][30] Group 4: Applications and Evaluation - Self-evolving agents have significant potential in various fields, including programming, education, and healthcare, where continuous adaptation is essential [6][34] - Evaluating self-evolving agents presents unique challenges, requiring metrics that capture adaptability, knowledge retention, and long-term generalization capabilities [34][36] Group 5: Future Directions - The article highlights the importance of addressing challenges such as catastrophic forgetting, knowledge transfer, and ensuring safety and controllability in self-evolving agents [40][43] - Future research should focus on developing scalable architectures, dynamic evaluation methods, and personalized agents that can adapt to individual user preferences [38][44]