自驾VLA再升级！博世最新IRL-VLA：奖励世界模型打造全新闭环强化学习框架

Core Viewpoint - The article discusses the introduction of IRL-VLA, a novel closed-loop reinforcement learning framework that integrates inverse reinforcement learning with a reward world model for vision-language-action (VLA) in autonomous driving, addressing limitations of existing open-loop imitation learning methods and simulation-based training [2][3][6]. Group 1: Key Issues in VLA - Existing VLA architectures are often based on open-loop settings using imitation learning, which limits performance by primarily capturing recorded behaviors from datasets [2][3]. - Closed-loop training heavily relies on high-fidelity sensor simulations, but domain gaps and computational efficiency issues hinder the generalization of VLA models [2][3]. Group 2: Introduction of IRL-VLA - Bosch, Shanghai University, and Tsinghua University teams proposed IRL-VLA, a new closed-loop reinforcement learning method that combines inverse reinforcement learning with a designed VLA approach [3][5]. - IRL-VLA employs a three-stage paradigm: pre-training VLA strategies through imitation learning, constructing a lightweight reward world model via inverse reinforcement learning, and enhancing planning performance through reward-guided reinforcement learning using Proximal Policy Optimization (PPO) [3][5]. Group 3: Performance Achievements - IRL-VLA achieved state-of-the-art (SOTA) performance in the NAVSIM v2 end-to-end driving benchmark and secured the second place in the CVPR 2025 autonomous driving competition [5][9]. - The framework demonstrated significant improvements in balancing safety events, comfortable driving, and traffic efficiency [5][9]. Group 4: Contributions of IRL-VLA - The introduction of an efficient reward world model (RWM) based on inverse reinforcement learning, which captures the multimodal and multi-objective nature of driving while avoiding the need for computationally intensive simulations [9][11]. - The development of a new VLA model that performs excellently in both imitation learning and reinforcement learning settings, achieving optimal performance across different training paradigms [11][12]. Group 5: Experimental Results - In the NAVSIM benchmark, IRL-VLA's pre-trained model (IRL-VLA-PT) achieved a competitive EPDMS score of 74.4, outperforming several state-of-the-art methods [42]. - The model maintained high safety performance while significantly improving metrics related to driving comfort and progress [42][43]. Group 6: Technical Details - The IRL-VLA model utilizes a backbone network (V2-99) and processes multi-view camera inputs at a resolution of 256 × 704 [35]. - The training process involved 100 epochs of pre-training with an AdamW optimizer, followed by reinforcement learning using the PPO algorithm on NVIDIA A100 GPUs [35][36]. Group 7: Conclusion - IRL-VLA represents a pioneering approach in closed-loop VLA methods that do not rely on simulators, paving the way for future advancements in closed-loop autonomous driving systems [46].