Workflow
强化学习(RL)
icon
Search documents
听说,大家都在梭后训练?最佳指南来了
机器之心· 2025-10-09 02:24
Core Insights - The article emphasizes the shift in focus from pre-training to post-training in large language models (LLMs), highlighting the diminishing returns of scaling laws as model sizes reach hundreds of billions of parameters [2][3][11]. Group 1: Importance of Post-Training - Post-training is recognized as a crucial phase for enhancing the reasoning capabilities of models like OpenAI's series, DeepSeek R1, and Google Gemini, marking it as a necessary step towards advanced intelligence [3][11]. - The article introduces various innovative post-training methods such as Reinforcement Learning from Human Feedback (RLHF), Reinforcement Learning from AI Feedback (RLAIF), and Reinforcement Learning with Verifiable Rewards (RLVR) [2][3][12]. Group 2: Transition from Pre-Training to Post-Training - The evolution from pre-training to instruction fine-tuning is discussed, where foundational models are trained on large datasets to predict the next token, but often lack practical utility in real-world applications [7][8]. - Post-training aims to align model behavior with user expectations, focusing on quality over quantity in the datasets used, which are typically smaller but more refined compared to pre-training datasets [11][24]. Group 3: Supervised Fine-Tuning (SFT) - Supervised Fine-Tuning (SFT) is described as a process that transforms a pre-trained model into one that can follow user instructions effectively, relying on high-quality instruction-answer pairs [21][24]. - The quality of the SFT dataset is critical, as even a small number of low-quality samples can negatively impact the model's performance [25][26]. Group 4: Reinforcement Learning Techniques - Reinforcement Learning (RL) is highlighted as a complex yet effective method for model fine-tuning, with various reward mechanisms such as RLHF, RLAIF, and RLVR being employed to enhance model performance [39][41]. - The article outlines the importance of reward models in RLHF, which are trained using human preference data to guide model outputs [44][46]. Group 5: Evaluation of Post-Training Models - The evaluation of post-training models is multifaceted, requiring a combination of automated and human assessments to capture various quality aspects [57][58]. - Automated evaluations are cost-effective and quick, while human evaluations provide a more subjective quality measure, especially for nuanced tasks [59][60].
梁文锋执笔的R1论文登上Nature封面!首次回应外界三大质疑
AI前线· 2025-09-18 02:28
Core Viewpoint - The article highlights the significant breakthrough of DeepSeek's AI model, DeepSeek-R1, which has successfully passed peer review and is recognized as the first large language model to achieve this milestone, marking a notable advancement for domestic AI research on the global stage [3][8]. Summary by Sections Model Development and Features - DeepSeek-R1 utilizes reinforcement learning (RL) to develop reasoning capabilities without relying on extensive human-annotated data, showcasing a novel approach in AI model training [3][12]. - The model was built on DeepSeek-V3 Base, with a focus on rewarding correct predictions to enhance the generation of longer and more logical responses [3][12]. - The training cost for DeepSeek-R1 was approximately $294,000, significantly lower than competitors that often spend tens of millions [6][12]. Peer Review Process - The peer review process for DeepSeek-R1 involved eight external experts over five months, resulting in a comprehensive review document that was three times the length of the original paper [9][12]. - The review addressed various aspects, including originality, methodology, and robustness, leading to improvements in the final published version [9][12]. Data and Safety Measures - The pre-training data for DeepSeek-V3 Base was sourced entirely from the internet, with a significant effort made to clean the data to avoid contamination, removing around 6 million potentially polluted samples [6][12]. - DeepSeek-R1 has implemented external risk control mechanisms and real-time audits, demonstrating superior safety performance compared to other mainstream models like Claude-3.7-Sonnet and GPT-4o [6][12]. Impact and Future Directions - The innovative use of pure reinforcement learning in DeepSeek-R1 is expected to influence future research in large language models, with many researchers looking to apply similar methods to enhance reasoning capabilities across various domains [12][14]. - Despite some concerns regarding the transparency of training data composition, the model has shown competitive performance in balancing accuracy and cost in scientific task challenges [14][12].
华人 AI 招聘 2 年 ARR 超 1000 万美金,Mercor 年化收入已 5 亿美金
投资实习所· 2025-09-16 05:38
Core Insights - The article discusses the shift in demand from Generalist AI Tutors to Specialist AI Tutors across various fields such as STEM, finance, medicine, and security, as evidenced by the rapid growth of Mercor [2][6] - Mercor's annual revenue run rate has surged from 1 million to 500 million USD in just 17 months, indicating a significant acceleration in growth [2][3] - The emergence of new knowledge-based jobs focused on training AI agents is highlighted, suggesting a transformation in the job market [3][12] Group 1: Industry Trends - The overall demand in the AI industry is evolving towards specialized roles, reflecting a broader trend in the economy becoming a reinforcement learning environment [2][6] - The fear of job loss due to technological advancements is contrasted with the creation of new job categories, particularly in training AI agents [6][12] - Historical context is provided, noting that previous technological revolutions have also led to new job categories despite initial fears of unemployment [6][12] Group 2: Company Performance - Mercor's revenue growth is notable, with a reported annual revenue run rate of over 500 million USD, showcasing a rapid increase in demand for AI recruitment services [2][3] - The company is currently paying over 1 million USD daily to experts across various fields, indicating a robust recruitment model [3][14] - Mercor's positioning as an AI recruitment platform is emphasized, with a focus on providing talent for AI companies, particularly in reinforcement learning [14][15] Group 3: Future of Work - The future of work is expected to center around training AI agents, with a significant market for human labor in creating and validating training environments for AI [11][12] - The article posits that as long as there are tasks that humans can perform but AI cannot, there will be a need for human involvement in training and evaluation [11][12] - The concept of an "experience era" is introduced, where models learn to optimize rewards in real-world scenarios, necessitating human feedback and guidance [13]
SimpleVLA-RL:突破 VLA 模型训练瓶颈,RL实现端到端在线训练
自动驾驶之心· 2025-09-15 03:56
Core Insights - The article discusses the development of the SimpleVLA-RL framework, which enhances the training of Visual-Language-Action (VLA) models in robotics through reinforcement learning (RL) techniques, addressing key challenges in data scarcity and generalization capabilities [3][4][6]. Group 1: Research Background and Core Issues - VLA models are crucial for robotic manipulation, integrating visual perception, language understanding, and action generation, but current training methods face two main bottlenecks: data scarcity and weak generalization [4][6]. - The traditional training process relies heavily on large-scale human operation data, which is costly and difficult to scale, limiting model scalability [4][6]. - The article raises the question of whether RL can enhance the long-term action planning capabilities of VLA models, despite the unique challenges posed by VLA applications [4][6]. Group 2: SimpleVLA-RL Framework Contributions - SimpleVLA-RL is designed to improve VLA training efficiency, particularly in data-scarce environments, and has achieved state-of-the-art (SOTA) performance in benchmark tests like LIBERO and RoboTwin [7][8]. - The framework incorporates interactive trajectory sampling, parallel training across multiple environments, and a unified design for training, inference, and rendering, addressing the slow interaction and high cost issues of VLA models [7][8]. - It has demonstrated significant improvements in success rates across various tasks, such as increasing LIBERO's average success rate from 91.0% to 99.1% and RoboTwin 2.0 from 38.3% to 68.8% [7][8][14]. Group 3: Data Efficiency and Generalization - SimpleVLA-RL significantly reduces the dependency on large-scale demonstration data, achieving an average success rate of 96.9% with only one trajectory of demonstration data, surpassing the performance of full-trajectory supervised fine-tuning [19][20]. - The framework enhances the model's robustness across different scenes, objects, and tasks, demonstrating improved performance in unseen tasks compared to traditional methods [21][24]. Group 4: Real-World Deployment and Innovations - The framework has shown effective Sim-to-Real transfer, with real-world task success rates improving from 17.5% to 38.5% using only simulated data for training [24][27]. - A notable discovery is the "Pushcut" phenomenon, where the RL-trained model autonomously discovers more efficient strategies beyond human demonstrations, indicating a potential for innovative behavior in VLA models [25][30]. Group 5: Summary and Conclusions - SimpleVLA-RL addresses three core issues in VLA model training: reducing reliance on large-scale demonstration data, enhancing generalization capabilities, and achieving efficient Sim-to-Real transfer [31][32]. - The findings suggest that RL can enable VLA models to explore superior strategies, paving the way for future developments in autonomous and adaptive robotic systems [31][32].
正式开课!具身大脑和小脑算法与实战教程来啦
具身智能之心· 2025-09-15 00:04
Core Insights - The exploration towards Artificial General Intelligence (AGI) highlights embodied intelligence as a key direction, focusing on the interaction and adaptation of intelligent agents within physical environments [1][3] - The development of embodied intelligence technology has evolved through various stages, from low-level perception to high-level task understanding and generalization [6][14] Industry Analysis - In the past two years, numerous star teams in the field of embodied intelligence have emerged, establishing valuable companies such as Xinghaitu, Galaxy General, and Zhujidongli, transitioning from laboratories to commercial and industrial applications [3] - Major domestic companies like Huawei, JD, Tencent, Ant Group, and Xiaomi are actively investing and collaborating to build a comprehensive ecosystem for embodied intelligence, while international players like Tesla and investment firms support advancements in autonomous driving and warehouse robotics [5] Technological Evolution - The evolution of embodied intelligence technology has progressed through several phases: - The first phase focused on grasp pose detection, which lacked the ability to model task context and action sequences [6] - The second phase introduced behavior cloning, allowing robots to learn from expert demonstrations but revealing weaknesses in generalization and performance in multi-target scenarios [6] - The third phase, emerging in 2023, utilized Diffusion Policy methods to enhance stability and generalization by modeling action trajectories [6][7] - The fourth phase, starting in 2025, explores the integration of VLA models with reinforcement learning and tactile sensing to overcome current limitations [9][11][12] Educational Initiatives - The demand for engineering and system capabilities in embodied intelligence is increasing as the industry shifts from research to deployment, necessitating higher engineering skills [17] - A comprehensive curriculum has been developed to cover various aspects of embodied intelligence, including practical applications and advanced topics, aimed at both beginners and advanced learners [14][20]
SimpleVLA-RL:突破 VLA 模型训练瓶颈,RL实现端到端在线训练
具身智能之心· 2025-09-15 00:04
Core Insights - The article discusses the development of the SimpleVLA-RL framework, which enhances the training of Visual-Language-Action (VLA) models in robotics through reinforcement learning (RL) techniques, addressing the limitations of traditional supervised fine-tuning (SFT) methods [2][4][30] Group 1: Research Background and Challenges - VLA models are crucial for integrating visual perception, language understanding, and action generation in robotic control, but current training methods face significant challenges, including data scarcity and weak generalization capabilities [2][5] - The breakthrough in large reasoning models suggests that RL can improve the sequential action planning capabilities of VLA models, but traditional RL methods are limited by manual reward design and the high cost of environmental interactions [2][5] Group 2: Contributions of SimpleVLA-RL - SimpleVLA-RL is designed specifically for VLA, incorporating interactive trajectory sampling and multi-environment parallel rendering, which significantly reduces training costs and improves scalability [6][9] - The framework has achieved state-of-the-art (SOTA) performance across multiple benchmarks, with notable improvements in success rates, such as LIBERO's average success rate increasing from 91.0% to 99.1% [6][12] - SimpleVLA-RL demonstrates enhanced data efficiency, achieving a LIBERO average success rate of 96.9% with only one demonstration trajectory, surpassing traditional methods [16][17] Group 3: Generalization and Real-World Application - The framework shows robust generalization capabilities across unseen tasks, with significant performance improvements in various scenarios, indicating its ability to learn universal skills rather than overfitting to specific data [22][30] - SimpleVLA-RL has proven effective in sim-to-real transfer, with real-world task success rates improving from 17.5% to 38.5%, validating its deployment capabilities [7][21] Group 4: Key Discoveries - The framework has led to the discovery of the "Pushcut" phenomenon, where the RL-trained model autonomously develops more efficient strategies beyond human demonstrations, showcasing the potential for innovative robotic behaviors [24][30] - The effectiveness of SimpleVLA-RL is contingent on the initial model capabilities, with significant performance enhancements observed when starting from a higher baseline success rate [28][29]
清华、上海AI Lab等顶级团队发布推理模型RL超全综述
具身智能之心· 2025-09-15 00:04
Core Viewpoint - The article discusses the significant advancements in Reinforcement Learning (RL) for Large Reasoning Models (LRM), emphasizing its potential to enhance reasoning and logical thinking capabilities in AI systems through verifiable reward mechanisms and advanced optimization algorithms [4][8][19]. Group 1: Introduction to RL and LRM - Reinforcement Learning (RL) has been a crucial method in AI development since its introduction by Sutton in 1998, enabling agents to learn in complex environments through clear reward signals [4]. - The emergence of large models has provided a new platform for RL, initially used to align models with human preferences, and now evolving towards enhancing reasoning capabilities [5][6]. Group 2: Recent Trends and Challenges - A new trend is emerging where researchers aim to use RL not just for compliance but to genuinely enhance reasoning abilities in models, leading to the development of LRM systems [5][6]. - Significant challenges remain for the large-scale application of RL in LRM, including reward design, algorithm efficiency, and the need for substantial data and computational resources [6][8]. Group 3: Key Developments and Milestones - The article highlights key milestones in RL applications for LRM, such as OpenAI's o1 and DeepSeek-R1, which demonstrate the effectiveness of RL in achieving long-chain reasoning capabilities through verifiable rewards [13][15]. - The performance of models like o1 improves with additional RL training and increased computational resources during reasoning, indicating a new path for expansion beyond pre-training [13][15]. Group 4: Foundational Components and Problems - The foundational components of RL for LRM include reward design, policy optimization, and sampling strategies, which are essential for enhancing model capabilities [16]. - The article discusses foundational and controversial issues in RL for LRM, such as the role of RL, the comparison between RL and supervised fine-tuning (SFT), and the types of rewards used [16]. Group 5: Training Resources and Applications - Training resources for RL include static corpora, dynamic environments, and infrastructure, which need further standardization and development for effective use [16]. - The applications of RL span various tasks, including coding, agentic tasks, multimodal tasks, and robotics, showcasing its versatility [16][18]. Group 6: Future Directions - Future research directions for RL in LLMs include continual RL, memory-based RL, and model-based RL, aiming to enhance reasoning efficiency and capabilities [18]. - The exploration of new algorithms and mechanisms is crucial for advancing RL's role in achieving Artificial Superintelligence (ASI) [15][19].
清华、上海AI Lab等顶级团队发布推理模型RL超全综述,探索通往超级智能之路
机器之心· 2025-09-13 08:54
Core Insights - The article emphasizes the significant role of Reinforcement Learning (RL) in enhancing the reasoning capabilities of large language models (LLMs), marking a pivotal shift in artificial intelligence development [2][5][16] - It highlights the emergence of Large Reasoning Models (LRMs) that utilize RL to improve reasoning through verifiable rewards, showcasing advancements in complex tasks such as mathematics and programming [3][5][10] Summary by Sections Introduction - The introduction outlines the historical context of RL since its inception in 1998 and its evolution into a crucial method for training intelligent agents to surpass human performance in complex environments [2] Recent Trends - A new trend is emerging where researchers aim to enhance models' reasoning abilities through RL, moving beyond mere compliance to actual reasoning skills [3][5] Overview of RL in LRM - The article reviews recent advancements in RL applied to LLMs, noting significant achievements in complex logical tasks, and identifies RL as a core method for evolving LLMs into LRMs [5][12] Foundational Components - The foundational components of RL for LRMs include reward design, policy optimization, and sampling strategies, which are essential for effective model training [13][14] Foundational Problems - Key challenges in RL for LRMs include the design of appropriate reward signals, efficient scaling under computational and data constraints, and ensuring reliability in practical applications [12][16] Training Resources - The article discusses the necessary training resources, including static corpora, dynamic environments, and RL infrastructure, emphasizing the need for standardization and development [13][15] Applications - RL has been applied across various tasks, including coding, agentic tasks, multimodal tasks, and robotics, showcasing its versatility and potential for broader applications [13][15] Future Directions - Future research directions for RL in LLMs include the development of new algorithms, mechanisms, and functionalities to further enhance reasoning capabilities and address existing challenges [15][16]
万字长文!首篇智能体自进化综述:迈向超级人工智能之路
自动驾驶之心· 2025-09-11 23:33
Core Insights - The article discusses the transition from static large language models (LLMs) to self-evolving agents capable of continuous learning and adaptation in dynamic environments, paving the way towards artificial superintelligence (ASI) [3][4][46] - It emphasizes the need for a structured framework to understand and design self-evolving agents, focusing on three fundamental questions: what to evolve, when to evolve, and how to evolve [6][46] Group 1: What to Evolve - Self-evolving agents can improve various components such as models, memory, tools, and architecture over time to enhance performance and adaptability [19][20] - The evolution of these components is crucial for the agent's ability to handle complex tasks and environments effectively [19][20] Group 2: When to Evolve - The article categorizes self-evolution into two time modes: intra-test-time self-evolution, which occurs during task execution, and inter-test-time self-evolution, which happens between tasks [22][23] - Intra-test-time self-evolution allows agents to adapt in real-time to specific challenges, while inter-test-time self-evolution leverages accumulated experiences for future performance improvements [22][23] Group 3: How to Evolve - Self-evolution emphasizes a continuous learning process where agents learn from real-world interactions, seek feedback, and adjust strategies dynamically [26][27] - Various methodologies for self-evolution include reward-based evolution, imitation learning, and population-based approaches, each with distinct feedback types and data sources [29][30] Group 4: Applications and Evaluation - Self-evolving agents have significant potential in various fields, including programming, education, and healthcare, where continuous adaptation is essential [6][34] - Evaluating self-evolving agents presents unique challenges, requiring metrics that capture adaptability, knowledge retention, and long-term generalization capabilities [34][36] Group 5: Future Directions - The article highlights the importance of addressing challenges such as catastrophic forgetting, knowledge transfer, and ensuring safety and controllability in self-evolving agents [40][43] - Future research should focus on developing scalable architectures, dynamic evaluation methods, and personalized agents that can adapt to individual user preferences [38][44]
攻克AI推理难题,清华团队提出「统一LLM强化学习新范式」ReST-RL
3 6 Ke· 2025-09-10 09:53
Core Insights - The article discusses the ongoing debate in the industry regarding the reasoning capabilities of large language models (LLMs), highlighting their frequent failures in complex tasks and the challenges in improving their reasoning abilities [1][3]. Group 1: Current Challenges in LLMs - Existing LLMs struggle with complex code, multi-step logic, and abstract tasks, often resulting in logical errors and irrelevant responses [1]. - Current reinforcement learning (RL) methods, such as online RL and self-training, have shown potential in enhancing LLM reasoning but face limitations in training efficiency and data collection costs [3][4]. - The reliance on high-quality labeled data for training process reward models (PRMs) restricts the scalability and reliability of these methods [4]. Group 2: Introduction of ReST-RL - Tsinghua University's KEG team proposed a new RL paradigm called ReST-RL, which combines an improved GRPO algorithm with a value model (VM) assisted decoding method to enhance LLM reasoning capabilities while maintaining efficiency and scalability [1][5]. - ReST-RL consists of two main components: ReST-GRPO, which optimizes the training process, and VM-MCTS, which aids in decoding during testing [5][9]. Group 3: Performance and Validation - Experimental results indicate that ReST-RL outperforms other RL baselines and decoding methods across various programming benchmarks, demonstrating its significant potential in enhancing LLM reasoning capabilities [2][10]. - ReST-GRPO improves training efficiency compared to original GRPO and DAPO, while VM-MCTS shows superior accuracy in validation tasks [10]. Group 4: Limitations and Future Directions - Despite the promising results, ReST-RL has not been validated in tasks beyond code reasoning, such as mathematical or commonsense reasoning, indicating a need for further research [13][14]. - The accuracy of the value model in out-of-domain tasks remains underexplored, suggesting that future work should focus on its generalization capabilities across a broader range of tasks [14].