OpenDriveVLA
Search documents
帝国理工VLA综述:从世界模型到VLA,如何重构自动驾驶(T-ITS)
自动驾驶之心· 2026-01-05 00:35
点击下方 卡片 ,关注" 自动驾驶之心 "公众号 戳我-> 领取 自动驾驶近30个 方向 学习 路线 该工作系统性地复盘了截止 2025年9月 的 77篇 前沿论文,从 端到端 VLA 、 世界模型 、 模块化集成 三大维 度,为开发者提供了一份详尽的"大模型上车"学习路线图。 核心看点:读懂世界模型与VLA的演进 本综述精准切中了当前自动驾驶社区最关心的三大技术命题,构建了清晰的技术象限。 图1:自动驾驶大模型技术架构图谱(涵盖模块化、端到端、数据生成及平台四大核心领域) 1. 端到端集成的终局:VLA (Vision-Language-Action) 论文作者 | Hanlin Tian, Kethan Reddy, Yuxiang Feng, Mohammed Quddus, Yiannis Demiris, Panagiotis Angeloudis 单位 | 帝国理工学院 (Imperial College London) 最近, DriveLaW 、 OpenDriveVLA 等架构的提出,标志着自动驾驶正在从"感知-规划"分离走向 VLA (Vision-Language-Action) 的端到 ...
即将开课!自动驾驶VLA全栈学习路线图分享~
自动驾驶之心· 2025-10-15 23:33
Core Insights - The focus of academia and industry has shifted towards VLA (Vision-Language Action) in autonomous driving, which provides human-like reasoning capabilities for vehicle decision-making [1][4] - Traditional methods in perception and lane detection have matured, leading to decreased attention in these areas, while VLA is now a critical area for development among major autonomous driving companies [4][6] Summary by Sections Introduction to VLA - VLA is categorized into modular VLA, integrated VLA, and reasoning-enhanced VLA, which are essential for improving the reliability and safety of autonomous driving [1][4] Course Overview - A comprehensive course on autonomous driving VLA has been designed, covering foundational principles to practical applications, including cutting-edge algorithms like CoT, MoE, RAG, and reinforcement learning [6][12] Course Structure - The course consists of six chapters, starting with an introduction to VLA algorithms, followed by foundational algorithms, VLM as an interpreter, modular and integrated VLA, reasoning-enhanced VLA, and a final project [12][20] Chapter Highlights - Chapter 1 provides an overview of VLA algorithms and their development history, along with benchmarks and evaluation metrics [13] - Chapter 2 focuses on the foundational knowledge of Vision, Language, and Action modules, including the deployment of large models [14] - Chapter 3 discusses VLM's role as an interpreter in autonomous driving, covering classic and recent algorithms [15] - Chapter 4 delves into modular and integrated VLA, emphasizing the evolution of language models in planning and control [16] - Chapter 5 explores reasoning-enhanced VLA, introducing new modules for decision-making and action generation [17][19] Learning Outcomes - The course aims to deepen understanding of VLA's current advancements, core algorithms, and applications in projects, benefiting participants in internships and job placements [24]
后端到端时代:我们必须寻找新的道路吗?
自动驾驶之心· 2025-09-01 23:32
Core Viewpoint - The article discusses the evolution of autonomous driving technology, particularly focusing on the transition from end-to-end systems to Vision-Language-Action (VLA) models, highlighting the differing approaches and perspectives within the industry regarding these technologies [6][32][34]. Group 1: VLA and Its Implications - VLA, or Vision-Language-Action Model, aims to integrate visual perception and natural language processing to enhance decision-making in autonomous driving systems [9][10]. - The VLA model attempts to map human driving instincts into interpretable language commands, which are then converted into machine actions, potentially offering both strong integration and improved explainability [10][19]. - Companies like Wayve are leading the exploration of VLA, with their LINGO series demonstrating the ability to combine natural language with driving actions, allowing for real-time interaction and explanations of driving decisions [12][18]. Group 2: Industry Perspectives and Divergence - The current landscape of autonomous driving is characterized by a divergence in approaches, with some teams embracing VLA while others remain skeptical, preferring to focus on traditional Vision-Action (VA) models [5][6][19]. - Major players like Huawei and Horizon have expressed reservations about VLA, opting instead to refine existing VA models, which they believe can still achieve effective results without the complexities introduced by language processing [5][21][25]. - The skepticism surrounding VLA stems from concerns about the ambiguity and imprecision of natural language in driving contexts, which can lead to challenges in real-time decision-making [19][21][23]. Group 3: Technical Challenges and Considerations - VLA models face significant technical challenges, including high computational demands and potential latency issues, which are critical in scenarios requiring immediate responses [21][22]. - The integration of language processing into driving systems may introduce noise and ambiguity, complicating the training and operational phases of VLA models [19][23]. - Companies are exploring various strategies to mitigate these challenges, such as enhancing computational power or refining data collection methods to ensure that language inputs align effectively with driving actions [22][34]. Group 4: Future Directions and Industry Outlook - The article suggests that the future of autonomous driving may not solely rely on new technologies like VLA but also on improving existing systems and methodologies to ensure stability and reliability [34]. - As the industry evolves, companies will need to determine whether to pursue innovative paths with VLA or to solidify their existing frameworks, each offering unique opportunities and challenges [34].
自动驾驶VLA:OpenDriveVLA、AutoVLA
自动驾驶之心· 2025-08-18 01:32
Core Insights - The article discusses two significant papers, OpenDriveVLA and AutoVLA, which focus on applying large visual-language models (VLM) to end-to-end autonomous driving, highlighting their distinct technical paths and philosophies [22]. Group 1: OpenDriveVLA - OpenDriveVLA aims to address the "modal gap" in traditional VLMs when dealing with dynamic 3D driving environments, emphasizing the need for structured understanding of the 3D world [23]. - The methodology includes several key steps: 3D visual environment perception, visual-language hierarchical alignment, and a multi-stage training paradigm [24][25]. - The model utilizes structured, layered tokens (Agent, Map, Scene) to enhance the VLM's understanding of the environment, which helps mitigate spatial hallucination risks [6][9]. - OpenDriveVLA achieved state-of-the-art performance in the nuScenes open-loop planning benchmark, demonstrating its effective perception-based anchoring strategy [10][20]. Group 2: AutoVLA - AutoVLA focuses on integrating driving tasks into the native operation of VLMs, transforming them from scene narrators to genuine decision-makers [26]. - The methodology features layered visual token extraction, where the model creates discrete action codes instead of continuous coordinates, thus converting trajectory planning into a next-token prediction task [14][29]. - The model employs a dual-mode thinking approach, allowing it to adapt its reasoning depth based on scene complexity, balancing efficiency and effectiveness [28]. - AutoVLA's reinforcement learning fine-tuning (RFT) enhances its driving strategy, enabling the model to optimize its behavior actively rather than merely imitating human driving [30][35]. Group 3: Comparative Analysis - OpenDriveVLA emphasizes perception-language alignment to improve VLM's understanding of the 3D world, while AutoVLA focuses on language-decision integration to enhance VLM's decision-making capabilities [32]. - The two models represent complementary approaches: OpenDriveVLA provides a robust perception foundation, while AutoVLA optimizes decision-making strategies through reinforcement learning [34]. - Future models may combine the strengths of both approaches, utilizing OpenDriveVLA's structured perception and AutoVLA's action tokenization and reinforcement learning to create a powerful autonomous driving system [36].
自动驾驶端到端VLA落地,算法如何设计?
自动驾驶之心· 2025-06-22 14:09
Core Insights - The article discusses the rapid advancements in end-to-end autonomous driving, particularly focusing on Vision-Language-Action (VLA) models and their applications in the industry [2][3]. Group 1: VLA Model Developments - The introduction of AutoVLA, a new VLA model that integrates reasoning and action generation for end-to-end autonomous driving, shows promising results in semantic reasoning and trajectory planning [3][4]. - ReCogDrive, another VLA model, addresses performance issues in rare and long-tail scenarios by utilizing a three-stage training framework that combines visual language models with diffusion planners [7][9]. - Impromptu VLA introduces a dataset aimed at improving VLA models' performance in unstructured extreme conditions, demonstrating significant performance improvements in established benchmarks [14][24]. Group 2: Experimental Results - AutoVLA achieved competitive performance metrics in various scenarios, with the best-of-N method reaching a PDMS score of 92.12, indicating its effectiveness in planning and execution [5]. - ReCogDrive set a new state-of-the-art PDMS score of 89.6 on the NAVSIM benchmark, showcasing its robustness and safety in driving trajectories [9][10]. - The OpenDriveVLA model demonstrated superior results in open-loop trajectory planning and driving-related question-answering tasks, outperforming previous methods on the nuScenes dataset [28][32]. Group 3: Industry Trends - The article highlights a trend among major automotive manufacturers, such as Li Auto, Xiaomi, and XPeng, to invest heavily in VLA model research and development, indicating a competitive landscape in autonomous driving technology [2][3]. - The integration of large language models (LLMs) with VLA frameworks is becoming a focal point for enhancing decision-making capabilities in autonomous vehicles, as seen in models like ORION and VLM-RL [33][39].