逆向强化学习 - filings, earnings calls, financial reports, news

逆向强化学习

Search documents

Xin Lang Cai Jing· 2026-01-04 19:01

Core Insights - A study from the University of Washington indicates that AI systems can learn and internalize cultural values by observing human behavior within specific cultures, providing new insights into AI's cross-cultural adaptation challenges [1][2] Group 1: Research Findings - The current training of AI typically relies on large-scale internet data, which often contains culturally biased values, leading to inconsistent performance across different cultural backgrounds [1] - The research team explored whether AI could learn cultural values naturally, similar to how children learn by observing the behavior of those around them [1][2] - An experiment involving 190 adults interacting with an AI agent showed that participants exhibited more altruistic behavior, particularly in a collaborative task adapted from the game "Overcooked" [1] Group 2: AI Learning Mechanism - The AI agents utilized "inverse reinforcement learning" to infer the behavioral goals and intrinsic values of the observed group, successfully applying learned altruistic tendencies to new scenarios, such as donation tasks [2] - The learning process is likened to that of children, who learn social behaviors like sharing and caring through observation rather than direct instruction [2] Group 3: Implications and Future Research - The creation of culturally adaptive AI that can understand others' perspectives is identified as a significant societal challenge [2] - As the diversity and volume of input data increase, this observational learning method may help develop AI systems that are more aligned with specific cultural contexts [2] - The research is still in the concept validation stage, requiring further testing in various cultural settings, value conflict scenarios, and complex real-world issues [2]

自驾VLA再升级！博世最新IRL-VLA：奖励世界模型打造全新闭环强化学习框架

自动驾驶之心· 2025-08-12 23:33

Core Viewpoint - The article discusses the introduction of IRL-VLA, a novel closed-loop reinforcement learning framework that integrates inverse reinforcement learning with a reward world model for vision-language-action (VLA) in autonomous driving, addressing limitations of existing open-loop imitation learning methods and simulation-based training [2][3][6]. Group 1: Key Issues in VLA - Existing VLA architectures are often based on open-loop settings using imitation learning, which limits performance by primarily capturing recorded behaviors from datasets [2][3]. - Closed-loop training heavily relies on high-fidelity sensor simulations, but domain gaps and computational efficiency issues hinder the generalization of VLA models [2][3]. Group 2: Introduction of IRL-VLA - Bosch, Shanghai University, and Tsinghua University teams proposed IRL-VLA, a new closed-loop reinforcement learning method that combines inverse reinforcement learning with a designed VLA approach [3][5]. - IRL-VLA employs a three-stage paradigm: pre-training VLA strategies through imitation learning, constructing a lightweight reward world model via inverse reinforcement learning, and enhancing planning performance through reward-guided reinforcement learning using Proximal Policy Optimization (PPO) [3][5]. Group 3: Performance Achievements - IRL-VLA achieved state-of-the-art (SOTA) performance in the NAVSIM v2 end-to-end driving benchmark and secured the second place in the CVPR 2025 autonomous driving competition [5][9]. - The framework demonstrated significant improvements in balancing safety events, comfortable driving, and traffic efficiency [5][9]. Group 4: Contributions of IRL-VLA - The introduction of an efficient reward world model (RWM) based on inverse reinforcement learning, which captures the multimodal and multi-objective nature of driving while avoiding the need for computationally intensive simulations [9][11]. - The development of a new VLA model that performs excellently in both imitation learning and reinforcement learning settings, achieving optimal performance across different training paradigms [11][12]. Group 5: Experimental Results - In the NAVSIM benchmark, IRL-VLA's pre-trained model (IRL-VLA-PT) achieved a competitive EPDMS score of 74.4, outperforming several state-of-the-art methods [42]. - The model maintained high safety performance while significantly improving metrics related to driving comfort and progress [42][43]. Group 6: Technical Details - The IRL-VLA model utilizes a backbone network (V2-99) and processes multi-view camera inputs at a resolution of 256 × 704 [35]. - The training process involved 100 epochs of pre-training with an AdamW optimizer, followed by reinforcement learning using the PPO algorithm on NVIDIA A100 GPUs [35][36]. Group 7: Conclusion - IRL-VLA represents a pioneering approach in closed-loop VLA methods that do not rely on simulators, paving the way for future advancements in closed-loop autonomous driving systems [46].

Reward World Model (RWM)

Reward World Model (RWM)