强化学习
Search documents
南洋理工&哈佛提出OpenREAD:端到端RL统一认知与轨迹规划
自动驾驶之心· 2025-12-13 02:04
Core Viewpoint - The article discusses the introduction of OpenREAD, a new framework developed by Nanyang Technological University and Harvard University, which utilizes reinforcement learning (RL) to enhance the reasoning capabilities of visual language models (VLM) in the context of autonomous driving [4][28]. Group 1: Methodology - OpenREAD incorporates Qwen3-LLM as an "evaluation expert," expanding the application of RL from traditional verifiable downstream tasks to open-ended tasks such as "driving suggestions" and "scene analysis," achieving end-to-end reinforcement fine-tuning from high-level semantic reasoning to low-level trajectory planning [6][28]. - The framework addresses the challenge of designing reward functions for open-ended driving knowledge learning, where multiple expressions can represent the same reference answer, complicating the RL process [7]. - Two preparatory steps were taken: (1) Constructing knowledge data with explicit chains of thought (CoT) using GPT-4 to annotate driving knowledge data covering perception and decision-making tasks [8]; (2) Converting the OmniDrive dataset into a format suitable for RL training, structured as "thinking + answering" [9]. Group 2: Experimental Results - OpenREAD was evaluated on the LingoQA and NuScenes datasets, demonstrating superior performance compared to traditional supervised fine-tuning (SFT) methods in trajectory error, collision rates, and knowledge evaluation metrics [19][20]. - The results indicate that the introduction of driving knowledge significantly enhances the effectiveness of RL fine-tuning, as evidenced by improvements in trajectory error and collision rates [19][20]. - In comparison with existing methods, OpenREAD exhibited better collision control capabilities, ensuring safer driving outcomes [20]. Group 3: Conclusion - OpenREAD successfully implements collaborative reinforcement learning fine-tuning for driving knowledge and trajectory planning, expanding the boundaries of RL applications in end-to-end autonomous driving [28].
苹果光速撤回RLAX论文:用了谷歌TPU和阿里Qwen,作者中还有庞若鸣
机器之心· 2025-12-13 01:13
Core Viewpoint - The article discusses Apple's recently withdrawn paper on a scalable reinforcement learning framework called RLAX, which utilizes Google's TPU and other cloud services, highlighting the company's engineering capabilities in AI infrastructure despite recent personnel changes [1][35]. Group 1: Paper Overview - The paper titled "RLAX: Large-Scale, Distributed Reinforcement Learning for Large Language Models on TPUs" was submitted on December 6 and quickly withdrawn after being made public [1][7]. - RLAX is designed for efficient execution of advanced reinforcement learning algorithms on large-scale distributed TPU clusters [12]. Group 2: Technical Contributions - RLAX employs a parameter-server architecture, allowing for logical separation of training, inference, and validation components, which enhances resource allocation flexibility [14]. - The framework supports preemptive scheduling, enabling immediate resource recovery for higher-priority tasks without crashing the training process [15]. - RLAX addresses key challenges in post-training reinforcement learning, offering programmable configuration options for managing on-policy and off-policy RL [16]. Group 3: Experimental Results - During experiments, RLAX improved the pass@8 accuracy of the QwQ-32B model by 12.8% in just 12 hours and 48 minutes using 1024 TPU v5p [24]. - The framework's development involved using Google's TPU, Amazon's AWS Lambda for testing, and a Chinese open-source model, showcasing a collaborative approach across different technologies [26]. Group 4: Author Background - The paper lists several authors, including Kelvin Zou, who has transitioned to Meta, and Cheng Leong, a long-time Apple employee, indicating a shift in talent within the AI sector [8][9].
全球强化学习+VLA范式,PI*0.6背后都有这家公司技术伏笔
具身智能之心· 2025-12-13 01:02
Core Insights - The article discusses the advancements in embodied intelligence, particularly focusing on the VLA (Vision-Language-Action) model and its integration with reinforcement learning (RL) to enhance robotic capabilities [2][4][50]. Group 1: Importance of VLA and RL - VLA models are crucial for applying powerful visual-language models in robotic control, moving beyond mere imitation learning to achieve robust performance in novel situations [6][8]. - Traditional imitation learning is limited, as robots struggle in unfamiliar scenarios, necessitating the use of RL for continuous improvement through trial and error [8][12]. Group 2: Challenges in Applying RL to VLA - There are three main challenges in applying RL to VLA: environmental differences, model instability, and computational demands [12][13]. - Directly applying RL to large VLA models can lead to catastrophic forgetting and training collapse, making it difficult to maintain performance [12][13]. Group 3: iRe-VLA Model Design - The iRe-VLA model features a two-stage iterative learning process, combining exploration through online RL and consolidation via supervised learning [16][21]. - The architecture includes a VLM backbone for understanding and an Action Head for executing control signals, optimized using LoRA technology to reduce computational load [17][18]. Group 4: Experimental Results - Experiments in simulated environments (MetaWorld, Franka Kitchen) and real-world scenarios demonstrated that iRe-VLA significantly outperformed traditional methods, with success rates improving from 43% to 83% in certain tasks [38][39]. - In real-world applications, the model's success rate for grasping previously unseen objects increased from 35% to 80% after training, showcasing its enhanced generalization capabilities [40][43]. Group 5: Conclusion and Future Directions - The iRe-VLA approach presents a viable solution for deploying large models in robotic control, highlighting the potential for ongoing research in efficient exploration and stable RL algorithms [48][50]. - The model's design allows for effective resource allocation, with local robots handling lightweight tasks while cloud servers manage heavier computations, aligning with practical deployment scenarios [54].
具身智能之心论文辅导正式推出了,国内最专业的师资来啦!
具身智能之心· 2025-12-12 07:59
Core Insights - The article introduces a specialized guidance program for academic papers in the field of embodied intelligence, highlighting the expertise of the faculty involved [1] - The program supports various research directions including large models, reinforcement learning, and robotics, catering to a wide range of academic needs [1] Group 1: Services Offered - The program provides comprehensive support for the entire paper process, including experimental guidance and doctoral application assistance [4] - There is a high success rate for papers already accepted in top conferences and journals such as CVPR, AAAI, and ICLR [4] Group 2: Target Publications - The guidance covers submissions to top-tier conferences and journals classified under CCF-A, CCF-B, CCF-C, as well as SCI and EI indexed publications [5] - It also includes support for various academic requirements such as thesis papers and competition entries [5]
荣获国家级科技奖一等奖,网易伏羲产学研协同创新获权威认可
Sou Hu Cai Jing· 2025-12-12 04:15
Group 1 - The core point of the article is the recognition of the project "Key Technologies and Applications of Intelligent Decision-Making Based on Reinforcement Learning," which won the first prize in the 2025 CSIG Science and Technology Award, highlighting the collaboration between NetEase and several academic institutions [1][2] - The project demonstrates the application of advanced AI technologies in the gaming industry, particularly in the game "Nirvana in Fire," showcasing how AI can enhance digital entertainment beyond traditional gaming boundaries [1][3] - The award signifies the growing value and impact of digital entertainment works, indicating that they have transcended the mere concept of "playing games" [2][3] Group 2 - The project addresses three key challenges in intelligent decision-making: low reward quality, difficulty in experience reuse, and significant environmental fluctuations, proposing innovative solutions that have achieved international leading levels in performance and efficiency [3][4] - The developed intelligent decision-making platform has been widely applied across various industries, including industrial software, defense, entertainment, and healthcare, generating significant economic and social benefits [4] - The project has successfully implemented reinforcement learning technology in large commercial games and supported various national defense tasks, demonstrating its versatility and effectiveness in real-world applications [4]
全球强化学习+VLA范式,PI*0.6背后都有这家中国公司技术伏笔
机器之心· 2025-12-12 03:41
Core Insights - The article discusses the significance of integrating Vision-Language-Action (VLA) models with Reinforcement Learning (RL) in the field of Embodied AI, emphasizing the limitations of imitation learning and the necessity for robust learning methods [1][2][4]. Group 1: Importance of VLA+RL - VLA models are being developed to apply powerful Vision-Language Models (VLM) in the control of robots, primarily through supervised fine-tuning (SFT) [2]. - Imitation learning alone is insufficient for robots to handle novel situations, necessitating the use of RL to enhance robustness and persistence in task execution [4]. Group 2: Challenges in Applying RL to VLA - The integration of RL with VLA faces three main challenges: environmental differences, model instability, and computational demands [6]. - Direct application of RL algorithms to large VLA models can lead to catastrophic forgetting and training collapse, making it difficult to maintain performance [6]. Group 3: Solutions to VLA's RL Challenges - The industry has proposed three types of solutions to address the challenges faced by VLA in RL applications, with a focus on internalizing high-value behaviors through SFT [7][13]. - The iRe-VLA model introduces a two-phase iterative learning process that alternates between online RL for exploration and supervised learning for consolidation [10][15]. Group 4: iRe-VLA Model Architecture - The iRe-VLA model consists of a VLM backbone for understanding images and instructions, and an Action Head for translating features into control signals [11]. - The use of Low-Rank Adaptation (LoRA) technology allows for efficient training without the need for full model fine-tuning [12]. Group 5: Experimental Results and Analysis - Extensive experiments in both simulated environments and real-world scenarios demonstrate the effectiveness of the iRe-VLA method, showing significant improvements in task success rates [26][30]. - The iRe-VLA model outperformed traditional methods, achieving a success rate increase from 43% to 83% in benchmark tasks [30]. Group 6: Conclusion and Future Implications - The article concludes that the iRe-VLA approach provides a viable solution to the challenges of deploying large models in robotic control, ensuring stability and continuous learning [37][42]. - Future research directions include efficient exploration and learning of new skills under sparse rewards, as well as developing scalable RL algorithms for large VLA models [40].
正式开课!7个Project搞懂端到端落地现状
自动驾驶之心· 2025-12-12 03:02
Core Insights - The article discusses the evolving recruitment landscape in the autonomous driving industry, highlighting a shift in demand from perception roles to end-to-end, VLA, and world model positions [2] - A new advanced course focused on end-to-end production in autonomous driving has been designed, emphasizing practical applications and real-world experience [2][4] Course Overview - The course is structured into eight chapters, covering various aspects of end-to-end algorithms, including task overview, two-stage and one-stage frameworks, navigation information applications, reinforcement learning, trajectory optimization, and production experience sharing [5][7][8][9][10][11][12][13][14] - The first chapter introduces the integration of perception tasks and learning-based control algorithms, which are essential skills for companies in the end-to-end era [7] - The second chapter focuses on the two-stage end-to-end algorithm framework, discussing its modeling and information transfer between perception and planning [8] - The third chapter covers one-stage end-to-end algorithms, emphasizing their performance advantages and various frameworks [9] - The fourth chapter highlights the critical role of navigation information in autonomous driving and its integration into end-to-end models [10] - The fifth chapter introduces reinforcement learning algorithms, addressing the limitations of imitation learning and the need for generalization [11] - The sixth chapter involves practical projects on trajectory output optimization, combining imitation and reinforcement learning [12] - The seventh chapter discusses post-processing logic for trajectory smoothing and reliability in production [13] - The final chapter shares production experiences from multiple perspectives, focusing on tools and strategies for real-world applications [14] Target Audience - The course is aimed at advanced learners with a foundational understanding of autonomous driving algorithms, reinforcement learning, and programming skills [15][17]
i6i8MEGA分别交付6798/6719/680|理想25年11月记录
理想TOP2· 2025-12-11 06:09
Core Insights - The total delivery of Li Auto in November 2025 reached 33,181 units, with 18,984 being range-extended vehicles and 14,197 being pure electric vehicles [1][2] - The delivery numbers show a month-on-month increase from October 2025, where total deliveries were 31,767 units, indicating a growth trend in the company's performance [2] Delivery Data Summary - November 2025: Total deliveries of 33,181 units, including 18,984 range-extended and 14,197 pure electric vehicles [1][2] - October 2025: Total deliveries of 31,767 units, with 18,340 range-extended and 13,427 pure electric vehicles [2] - September 2025: Total deliveries of 33,951 units, with a significant number of 24,554 being range-extended vehicles [2] - The data indicates fluctuations in monthly deliveries, with the highest in September and the lowest in February 2025, which had 26,264 total deliveries [2] Company Developments - Li Auto's cumulative delivery of range-extended SUVs has surpassed 1.4 million units as of November 10, 2025 [3] - The company is focusing on enhancing its AI capabilities, with a new paper published on closed-loop reinforcement learning for autonomous driving [4] - Li Auto plans to shorten its platform iteration cycle from four years to two years, aiming to increase differentiation in vehicle design [4] Strategic Initiatives - The company is actively engaging in partnerships for technology and IP acquisition, particularly in silicon carbide chips [4] - Li Auto is also expanding its charging infrastructure, with the number of charging stations increasing from 3,509 to 3,597 [4] - The company is implementing new features in its charging stations to enhance user experience and efficiency [4]
时隔一年DiffusionDrive升级到v2,创下了新纪录!
自动驾驶之心· 2025-12-11 03:35
Core Insights - The article discusses the upgrade of DiffusionDrive to version 2, highlighting its advancements in end-to-end autonomous driving trajectory planning through the integration of reinforcement learning to address the challenges of diversity and sustained high quality in trajectory generation [1][3][10]. Background Review - The shift towards end-to-end autonomous driving (E2E-AD) has emerged as traditional tasks like 3D object detection and motion prediction have matured. Early methods faced limitations in modeling, often generating single trajectories without alternatives in complex driving scenarios [5][10]. - Previous diffusion models applied to trajectory generation struggled with mode collapse, leading to a lack of diversity in generated behaviors. DiffusionDrive introduced a Gaussian Mixture Model (GMM) to define prior distributions for initial noise, promoting diverse behavior generation [5][13]. Methodology - DiffusionDriveV2 introduces a novel framework that utilizes reinforcement learning to overcome the limitations of imitation learning, which previously led to a trade-off between diversity and sustained high quality in trajectory generation [10][12]. - The framework incorporates intra-anchor GRPO and inter-anchor truncated GRPO to manage advantage estimation within specific driving intentions, preventing mode collapse by avoiding inappropriate comparisons between different intentions [9][12][28]. - The method employs scale-adaptive multiplicative noise to enhance exploration while maintaining trajectory smoothness, addressing the inherent scale inconsistency between proximal and distal segments of trajectories [24][39]. Experimental Results - Evaluations on the NAVSIM v1 and NAVSIM v2 datasets demonstrated that DiffusionDriveV2 achieved state-of-the-art performance, with a PDMS score of 91.2 on NAVSIM v1 and 85.5 on NAVSIM v2, significantly outperforming previous models [10][33]. - The results indicate that DiffusionDriveV2 effectively balances trajectory diversity and sustained quality, achieving optimal performance in closed-loop evaluations [38][39]. Conclusion - The article concludes that DiffusionDriveV2 successfully addresses the inherent challenges of imitation learning in trajectory generation, achieving an optimal trade-off between planning quality and diversity through innovative reinforcement learning techniques [47].
告别专家依赖,让机器人学会自我参考,仅需200步性能飙升至99.2%
具身智能之心· 2025-12-11 02:01
Core Insights - The article discusses the development of the Self-Referential Policy Optimization (SRPO) framework, which addresses the limitations of existing Visual Language Action (VLA) models in robotic tasks by enabling robots to learn from their own experiences without relying on external expert data [3][10][56]. Motivation and Contribution - SRPO aims to overcome the challenges of sparse reward signals in reinforcement learning, particularly in the VLA domain, by utilizing self-generated successful trajectories to provide progressive rewards for failed attempts [6][10]. - The framework eliminates the need for costly expert demonstrations and task-specific reward engineering, thus enhancing the efficiency of the learning process [10][12]. Technical Approach - SRPO collects trajectories generated during policy inference and categorizes them into successful and failed attempts, using a potential world representation to model behavior similarity [16][17]. - The framework employs a progressive reward mechanism based on the distance of failed trajectories to successful trajectory representations, allowing for a more nuanced evaluation of task progress [22][24]. Experimental Results - SRPO achieved a success rate of 99.2% in the LIBERO benchmark with only 200 steps of reinforcement learning, significantly outperforming traditional methods that rely on sparse rewards [29][30]. - In the LIBERO-Plus generalization tests, SRPO demonstrated a performance improvement of 167%, showcasing its robust generalization capabilities without the need for additional training data [31][32]. Efficiency and Real-World Application - The efficiency of SRPO is highlighted by its ability to improve success rates from 17.3% to 98.6% in long-term tasks with minimal training steps, outperforming other models in terms of training efficiency [36][39]. - The framework has been tested in real-world scenarios, showing significant improvements in success rates compared to supervised fine-tuning baselines [41][39]. Conclusion - SRPO represents a significant advancement in robotic learning, allowing for autonomous exploration and creativity by enabling robots to learn from their own successes and failures, thus paving the way for a new approach in VLA reinforcement learning [56].