强化学习
Search documents
为什么RL在人形/四足/机械臂等本体上依然还有很多工作可以做?
具身智能之心· 2025-10-28 04:00
Core Insights - Reinforcement Learning (RL) remains a significant field, with increasing applications in robotics, including humanoid and quadruped robots, as well as in product optimization across various industries [1][2][3] - The complexity of RL poses challenges for newcomers, making it difficult to produce publishable research papers without a structured learning system [5][9] - To address these challenges, a specialized 1v6 mentoring course in RL has been launched, aimed at helping students produce quality research papers [6][9] Group 1: Importance of Reinforcement Learning - RL is crucial for tasks such as gait control in embodied intelligent robots, which is essential for achieving general-purpose capabilities [2] - Companies like Yushun and Zhiyuan utilize RL for humanoid robots to perform complex actions like climbing stairs, running, and dancing, enhancing their adaptability in various scenarios [2][8] - The integration of RL with Variable Length Action (VLA) in robotic arms is gaining popularity in academia, leading to more efficient and smooth robot operations [3][8] Group 2: Challenges in Learning and Research - The vast and intricate nature of RL makes it difficult for beginners to find a clear entry point, often resulting in frustration and abandonment of learning [5][9] - Producing a research paper that meets the standards of peer review requires proficiency in methodology, experimental results, and writing style, which can be overwhelming for newcomers [5][9] Group 3: Course Offerings and Structure - The 1v6 mentoring course is designed for graduate students and others seeking guidance on research papers, featuring small class sizes and weekly live sessions [7][9] - The course spans 14 weeks of intensive online training followed by 8 weeks of maintenance support, focusing on various aspects of RL and its applications in robotics [9][15] - Participants will receive guidance on paper ideas, project implementation, experimental support, and writing refinement, with the goal of producing a draft suitable for submission to top conferences [7][9][15] Group 4: Course Content and Deliverables - The curriculum includes topics such as RL fundamentals, simulation environments, and specific applications in quadruped, humanoid, and robotic arm training [17][19] - Students will engage in hands-on projects, culminating in a research paper draft that adheres to the requirements of conferences like RAL, ICRA, IROS, and CoRL [23][24] - The course emphasizes a structured approach to research, covering the entire process from methodology to writing and submission [30]
刚刚,Thinking Machines Lab博客提出在策略蒸馏,Qwen被cue 38次
3 6 Ke· 2025-10-28 02:00
Core Insights - Thinking Machines Lab (TML) has introduced a new training method called on-policy distillation, which combines reinforcement learning (RL) error correlation with supervised fine-tuning (SFT) reward density, achieving superior performance at a lower cost [1][17]. Group 1: Methodology and Applications - On-policy distillation is effective for small models, enhancing their domain performance and continuous learning capabilities [1][17]. - The method is inspired by the Qwen team’s research and heavily utilizes the Qwen3 series models during experiments [3][34]. - The training process consists of three stages: pre-training, mid-training, and post-training, focusing on general capabilities, domain knowledge, and target behavior respectively [6][7]. Group 2: Advantages of On-Policy Distillation - Small models trained with on-policy distillation often outperform larger general models in specialized fields due to benefits like local deployment, easier continuous training, and reduced inference costs [7][17]. - The method provides dense reward signals, allowing for more efficient learning compared to traditional RL, which offers sparse feedback [9][18]. Group 3: Performance and Cost Efficiency - TML's experiments show that on-policy distillation can achieve performance comparable to RL at a fraction of the cost, with reported costs being only one-tenth of traditional RL methods [34][41]. - The method has demonstrated significant computational efficiency, requiring 7-10 times fewer gradient steps to achieve similar performance levels as RL [58]. Group 4: Continuous Learning and Personalization - On-policy distillation is positioned as a promising tool for continuous learning, allowing models to update without degrading previously learned behaviors [66][70]. - The approach can effectively personalize models, enabling them to adapt to specific tasks while retaining core capabilities [42][53].
Thinking Machine新研究刷屏!结合RL+微调优势,小模型训练更具性价比了
量子位· 2025-10-28 01:18
Core Insights - The article discusses the innovative research by Thinking Machine, focusing on a new training method for small language models called On-Policy Distillation, which enhances their understanding of specialized fields [1][4]. Summary by Sections Methodology - On-Policy Distillation combines the strengths of two traditional training methods: reinforcement learning (self-exploration) and supervised fine-tuning (direct answers), creating a more efficient training framework [3][8]. - This method allows AI to learn through practical problem-solving while receiving immediate guidance when it encounters difficulties, significantly improving training efficiency by 50-100 times [4][5]. Training Phases - The training process consists of three main phases: Pre-training (general capabilities), Mid-training (domain-specific knowledge), and Post-training (target behavior guidance) [9]. - The focus of the research is on the Post-training phase, where the model learns to perform specific tasks effectively [6][9]. Evaluation Metrics - The method employs Negative reverse KL divergence as a key evaluation metric, ensuring that the student model learns effectively by minimizing the divergence from the teacher model's expectations [12][15]. Experimental Results - Experiment 1 demonstrated that using On-Policy Distillation, a smaller model (8B) could achieve a performance score of 70% on a math benchmark with significantly lower computational costs compared to traditional methods [19][22]. - Experiment 2 showed that the method effectively mitigates "catastrophic forgetting" in AI models, allowing them to retain general capabilities while learning new knowledge [23][25]. Implications - The research indicates that On-Policy Distillation can empower resource-constrained individuals or small companies to train effective specialized models, enhancing accessibility in AI development [5][19]. - The findings suggest a promising avenue for achieving lifelong learning in AI systems, addressing the challenge of balancing new knowledge acquisition with the retention of existing skills [26].
刚刚,Thinking Machines Lab博客提出在策略蒸馏,Qwen被cue 38次
机器之心· 2025-10-28 00:41
Core Viewpoint - Thinking Machines Lab (TML) has introduced a new training method called on-policy distillation, which combines reinforcement learning (RL) error correlation with supervised fine-tuning (SFT) reward density, achieving superior performance at a lower cost compared to other methods [1][2][27]. Group 1: Methodology and Advantages - On-policy distillation allows small models to exhibit strong domain performance and continuous learning capabilities [1][2]. - The training process is divided into three stages: pre-training for general capabilities, mid-training for domain knowledge, and post-training for guiding target behaviors [6][7]. - On-policy training samples trajectories from the student model itself, providing direct feedback to avoid errors, while off-policy training relies on external sources [8][9][12]. Group 2: Comparison with Other Methods - On-policy distillation combines the advantages of on-policy training's reliability and the dense reward signals from SFT, making it a cost-effective alternative to traditional RL methods [28][92]. - In experiments, on-policy distillation achieved a score of 74.4% on the AIME'24 benchmark with significantly lower computational costs compared to RL, which required 17,920 GPU hours for a score of 67.6% [47][46]. Group 3: Applications and Future Directions - The method has been successfully applied to train models for mathematical reasoning and to develop assistant models with domain knowledge and instruction-following capabilities [26][27]. - TML aims to continue exploring new applications of on-policy distillation, improving teacher supervision methods, and enhancing data efficiency and continuous learning [92][93].
无人机也能打排球吗?清华团队用强化学习探了探路
具身智能之心· 2025-10-28 00:02
Core Insights - The article discusses a new embodied AI task proposed by Tsinghua University, focusing on "multi-drone volleyball," which aims to enhance the capabilities of drones in a three-dimensional space through teamwork and strategy [1][2]. Group 1: Task Overview - The "multi-drone volleyball" task requires drones to demonstrate high maneuverability and precise control while collaborating as a team to hit a ball over a net and compete against opposing teams [2]. - The Tsinghua team has developed the VolleyBots testing platform to simulate the human learning process in volleyball, incorporating various tasks for single and multiple drones [2][6]. Group 2: Algorithm Development - The Hierarchical Co-Self-Play (HCSP) algorithm was designed to enable drones to learn cooperation, division of roles, and offensive/defensive transitions through hierarchical strategy learning and self-play mechanisms [2][12]. - The research incorporated various reinforcement learning and game-theoretic algorithms, with the HCSP showing an average win rate of 82.9% against multiple baseline algorithms [15]. Group 3: Training Phases - The training process consists of three phases: low-level skill learning, high-level strategy game playing, and collaborative self-play, allowing drones to evolve their strategies and skills in a competitive environment [14]. - The drones demonstrated the ability to form clear roles during matches, such as defense, passing, and offense, and even developed new tactics like "setter's lob" during training [15]. Group 4: Real-World Application - The JuggleRL system was introduced to enable drones to perform continuous juggling in the real world, achieving a record of 462 consecutive juggles without any real data fine-tuning [16][18]. - This achievement marks a significant step in embodied reinforcement learning, transitioning from virtual environments to real physical interactions [18][19].
正式结课!工业界大佬带队三个月搞定端到端自动驾驶
自动驾驶之心· 2025-10-27 00:03
Core Viewpoint - 2023 marks the year of end-to-end production, with 2024 expected to be a significant year for end-to-end production in the automotive industry, as leading new forces and manufacturers have already achieved end-to-end production [1][3]. Group 1: End-to-End Production Development - The automotive industry is witnessing rapid development in end-to-end methods, particularly the one-stage approach exemplified by UniAD, which directly models vehicle trajectories from sensor inputs [1][3]. - There are two main paradigms in the industry: one-stage and two-stage methods, with the one-stage approach gaining traction and leading to various derivatives based on perception, world models, diffusion models, and VLA [3][5]. Group 2: Course Overview - A course titled "End-to-End and VLA Autonomous Driving" has been launched, focusing on cutting-edge algorithms in both one-stage and two-stage end-to-end methods, aimed at bridging academic and industrial advancements [5][15]. - The course is structured into several chapters, covering the history and evolution of end-to-end methods, background knowledge on VLA, and detailed discussions on both one-stage and two-stage approaches [9][10][12]. Group 3: Key Technologies - The course emphasizes critical technologies such as BEV perception, visual language models (VLM), diffusion models, and reinforcement learning, which are essential for mastering the latest advancements in autonomous driving [5][11][19]. - The second chapter of the course is highlighted as containing the most frequently asked technical keywords for job interviews in the next two years [10]. Group 4: Practical Applications - The course includes practical assignments, such as RLHF fine-tuning, allowing participants to apply their knowledge in real-world scenarios and understand how to build and experiment with pre-trained and reinforcement learning modules [13][19]. - The curriculum also covers various subfields of one-stage end-to-end methods, including those based on perception, world models, diffusion models, and VLA, providing a comprehensive understanding of the current landscape in autonomous driving technology [14][19].
HuggingFace联合牛津大学新教程开源SOTA资源库!
具身智能之心· 2025-10-27 00:02
Core Viewpoint - The article emphasizes the significant advancements in robotics, particularly in robot learning, driven by the development of large models and multi-modal AI technologies, which have transformed traditional robotics into a more learning-based paradigm [3][4]. Group 1: Introduction to Robot Learning - The article introduces a comprehensive tutorial on modern robot learning, covering foundational principles of reinforcement learning and imitation learning, leading to the development of general-purpose, language-conditioned models [4][12]. - HuggingFace and Oxford University researchers have created a valuable resource for newcomers to the field, providing an accessible guide to robot learning [3][4]. Group 2: Classic Robotics - Classic robotics relies on explicit modeling through kinematics and control planning, while learning-based methods utilize deep reinforcement learning and expert demonstration for implicit modeling [15]. - Traditional robotic systems follow a modular pipeline, including perception, state estimation, planning, and control [16]. Group 3: Learning-Based Robotics - Learning-based robotics integrates perception and control more closely, adapts to tasks and entities, and reduces the need for expert modeling [26]. - The tutorial highlights the challenges of safety and efficiency in real-world applications, particularly during the initial training phases, and discusses advanced techniques like simulation training and domain randomization to mitigate risks [34][35]. Group 4: Reinforcement Learning - Reinforcement learning allows robots to autonomously learn optimal behavior strategies through trial and error, showcasing significant potential in various scenarios [28]. - The tutorial discusses the complexity of integrating multiple system components and the limitations of traditional physics-based models, which often oversimplify real-world phenomena [30]. Group 5: Imitation Learning - Imitation learning offers a more direct learning path for robots by replicating expert actions through behavior cloning, avoiding complex reward function designs [41]. - The tutorial addresses challenges such as compound errors and handling multi-modal behaviors in expert demonstrations [41][42]. Group 6: Advanced Techniques in Imitation Learning - The article introduces advanced imitation learning methods based on generative models, such as Action Chunking with Transformers (ACT) and Diffusion Policy, which effectively model multi-modal data [43][45]. - Diffusion Policy demonstrates strong performance in various tasks with minimal demonstration data, requiring only 50-150 demonstrations for training [45]. Group 7: General Robot Policies - The tutorial envisions the development of general robot policies capable of operating across tasks and devices, inspired by large-scale open robot datasets and powerful visual-language models [52][53]. - Two cutting-edge visual-language-action (VLA) models, π₀ and SmolVLA, are highlighted for their ability to understand visual and language instructions and generate precise control commands [53][56]. Group 8: Model Efficiency - SmolVLA represents a trend towards model miniaturization and open-sourcing, achieving high performance with significantly reduced parameter counts and memory consumption compared to π₀ [56][58].
手把手带你入门机器人学习,HuggingFace联合牛津大学新教程开源SOTA资源库
机器之心· 2025-10-26 07:00
Core Viewpoint - The article emphasizes the significant advancements in the field of robotics, particularly in robot learning, driven by the development of artificial intelligence technologies such as large models and multi-modal models. This shift has transformed traditional robotics into a learning-based paradigm, opening new potentials for autonomous decision-making robots [2]. Group 1: Introduction to Robot Learning - The article highlights the evolution of robotics from explicit modeling to implicit modeling, marking a fundamental change in motion generation methods. Traditional robotics relied on explicit modeling, while learning-based methods utilize deep reinforcement learning and expert demonstration learning for implicit modeling [15]. - A comprehensive tutorial provided by HuggingFace and researchers from Oxford University serves as a valuable resource for newcomers to modern robot learning, covering foundational principles of reinforcement learning and imitation learning [3][4]. Group 2: Learning-Based Robotics - Learning-based robotics simplifies the process from perception to action by training a unified high-level controller that can directly handle high-dimensional, unstructured perception-motion information without relying on a dynamics model [33]. - The tutorial addresses challenges in real-world applications, such as safety and efficiency issues during initial training phases, and high trial-and-error costs in physical environments. It introduces advanced techniques like simulator training and domain randomization to mitigate these risks [34][35]. Group 3: Reinforcement Learning - Reinforcement learning allows robots to autonomously learn optimal behavior strategies through trial and error, showcasing significant potential across various scenarios [28]. - The tutorial discusses the "Offline-to-Online" reinforcement learning framework, which enhances sample efficiency and safety by utilizing pre-collected expert data. The HIL-SERL method exemplifies this approach, enabling robots to master complex real-world tasks with near 100% success rates in just 1-2 hours of training [36][39]. Group 4: Imitation Learning - Imitation learning offers a more direct learning path for robots by replicating expert actions through behavior cloning, avoiding complex reward function designs and ensuring training safety [41]. - The tutorial presents advanced imitation learning methods based on generative models, such as Action Chunking with Transformers (ACT) and Diffusion Policy, which effectively model multi-modal data by learning the latent distribution of expert behaviors [42][43]. Group 5: Universal Robot Policies - The article envisions the future of robotics in developing universal robot policies capable of operating across tasks and devices, inspired by the emergence of large-scale open robot datasets and powerful visual-language models (VLMs) [52]. - Two cutting-edge VLA models, π₀ and SmolVLA, are highlighted for their ability to understand visual and language instructions and generate precise robot control commands, with SmolVLA being a compact, open-source model that significantly reduces application barriers [53][56].
从世界模型到VLA再到强化,具身大小脑算法原来是这样的!
具身智能之心· 2025-10-26 04:02
Core Insights - The article discusses the evolution and current state of embodied intelligence, focusing on the roles of the brain and cerebellum in robotics, where the brain handles perception and planning, while the cerebellum is responsible for execution [3][10]. Technical Evolution - The development of embodied intelligence has progressed through several stages, starting from grasp pose detection, moving to behavior cloning, and now advancing to diffusion policy and VLA models [7][10]. - The first stage focused on static object grasping with limited decision-making capabilities [7]. - The second stage introduced behavior cloning, allowing robots to learn from expert demonstrations but faced challenges in generalization and error accumulation [8]. - The third stage, marked by the introduction of diffusion policy, improved stability and generalization by modeling action sequences [8]. - The fourth stage, emerging in 2025, explores the integration of VLA models with reinforcement learning and world models to enhance robots' predictive and interactive capabilities [9][10]. Current Trends and Applications - The integration of VLA with reinforcement learning enhances robots' trial-and-error learning and self-improvement abilities, while the combination with world models allows for future prediction and better planning [10]. - The article highlights the growing demand for embodied intelligence applications across various sectors, including industrial, home, restaurant, and medical rehabilitation, leading to increased job opportunities and research interest in the field [10]. Educational Initiatives - The article outlines a structured learning program aimed at equipping individuals with comprehensive knowledge of embodied intelligence algorithms, including practical applications and real-world projects [11][14]. - The course targets individuals with a foundational understanding of embodied intelligence and aims to bridge the gap between theoretical knowledge and practical deployment [18][24].
摇人!寻找散落在各地的自动驾驶热爱者(产品/4D标注/世界模型等)
自动驾驶之心· 2025-10-25 16:03
Core Viewpoint - The article emphasizes the need for collaboration in the autonomous driving industry, inviting professionals to participate in training, course development, and research support to drive industry progress [2]. Group 1: Collaboration and Opportunities - The company is seeking partnerships with professionals in the autonomous driving field to enhance training and job guidance services [2]. - High compensation and abundant industry resources will be provided to collaborators [3]. - The main focus areas for collaboration include roles such as autonomous driving product managers, 4D annotation/data loop, world models, VLA, autonomous driving large models, reinforcement learning, and end-to-end systems [4]. Group 2: Training and Development - The positions are primarily aimed at B2B training for enterprises, universities, and research institutions, as well as C2C training for students and job seekers [5]. - The company encourages interested individuals to reach out for further consultation via WeChat [6].