Workflow
ReCogDrive
icon
Search documents
即将开课!自动驾驶VLA全栈学习路线图分享~
自动驾驶之心· 2025-10-15 23:33
Core Insights - The focus of academia and industry has shifted towards VLA (Vision-Language Action) in autonomous driving, which provides human-like reasoning capabilities for vehicle decision-making [1][4] - Traditional methods in perception and lane detection have matured, leading to decreased attention in these areas, while VLA is now a critical area for development among major autonomous driving companies [4][6] Summary by Sections Introduction to VLA - VLA is categorized into modular VLA, integrated VLA, and reasoning-enhanced VLA, which are essential for improving the reliability and safety of autonomous driving [1][4] Course Overview - A comprehensive course on autonomous driving VLA has been designed, covering foundational principles to practical applications, including cutting-edge algorithms like CoT, MoE, RAG, and reinforcement learning [6][12] Course Structure - The course consists of six chapters, starting with an introduction to VLA algorithms, followed by foundational algorithms, VLM as an interpreter, modular and integrated VLA, reasoning-enhanced VLA, and a final project [12][20] Chapter Highlights - Chapter 1 provides an overview of VLA algorithms and their development history, along with benchmarks and evaluation metrics [13] - Chapter 2 focuses on the foundational knowledge of Vision, Language, and Action modules, including the deployment of large models [14] - Chapter 3 discusses VLM's role as an interpreter in autonomous driving, covering classic and recent algorithms [15] - Chapter 4 delves into modular and integrated VLA, emphasizing the evolution of language models in planning and control [16] - Chapter 5 explores reasoning-enhanced VLA, introducing new modules for decision-making and action generation [17][19] Learning Outcomes - The course aims to deepen understanding of VLA's current advancements, core algorithms, and applications in projects, benefiting participants in internships and job placements [24]
清华教研团队!两个月从零搭建一套自己的自动驾驶VLA模型
自动驾驶之心· 2025-09-28 07:21
Core Viewpoint - The focus of academia and industry after end-to-end systems is on VLA (Vision-Language-Action), which provides human-like reasoning capabilities for safer and more reliable autonomous driving [1][4]. Summary by Sections Introduction to Autonomous Driving VLA - VLA is categorized into modular VLA, integrated VLA, and reasoning-enhanced VLA, which are essential for advancing autonomous driving technology [1][4]. Technical Maturity and Employment Demand - The demand for autonomous driving VLA solutions is high among major companies, prompting them to invest in self-research and development [4]. Course Overview - A comprehensive learning roadmap for autonomous driving VLA has been designed, covering principles to practical applications [4][6]. Core Content of Autonomous Driving VLA - Key topics include visual perception, large language models, action modeling, model deployment, and dataset creation, with cutting-edge algorithms like CoT, MoE, RAG, and reinforcement learning [6]. Course Collaboration - The course is developed in collaboration with Tsinghua University's research team, featuring detailed explanations of algorithms and practical assignments [6]. Course Structure - The course consists of six chapters, each focusing on different aspects of VLA, including algorithm introduction, foundational algorithms, VLM as an interpreter, modular and integrated VLA, reasoning-enhanced VLA, and a final project [12][20]. Chapter Details - Chapter 1 covers the concept and history of VLA algorithms, including benchmarks and evaluation metrics [13]. - Chapter 2 focuses on foundational algorithms related to Vision, Language, and Action, along with model deployment [14]. - Chapter 3 discusses VLM's role as an interpreter in autonomous driving, highlighting key algorithms [15]. - Chapter 4 delves into modular and integrated VLA, emphasizing the evolution of language models in planning [16]. - Chapter 5 explores reasoning-enhanced VLA, introducing new modules for decision-making and action output [17]. - Chapter 6 involves a hands-on project where participants build and fine-tune their models [20]. Learning Outcomes - The course aims to deepen understanding of VLA's current advancements and core algorithms, equipping participants with practical skills for future research and applications in the autonomous driving sector [22][26]. Course Schedule - The course is set to begin on October 20, with a structured timeline for each chapter's release [23]. Prerequisites - Participants are expected to have a foundational knowledge of autonomous driving, large models, reinforcement learning, and programming skills in Python and PyTorch [26].
一文尽览!近一年自动驾驶VLA优秀工作汇总~
自动驾驶之心· 2025-07-15 12:30
Core Insights - The article discusses the advancements in Vision-Language-Action (VLA) models for autonomous driving, highlighting the integration of navigation and reinforcement learning to enhance reasoning capabilities beyond visual range [2][3][6]. Group 1: NavigScene - NavigScene is introduced as a novel auxiliary dataset that pairs local multi-view sensor inputs with global natural language navigation guidance, addressing the critical gap between local perception and global navigation context in autonomous driving [6]. - Three complementary paradigms are implemented in NavigScene: navigation-guided reasoning, navigation-guided preference optimization, and navigation-guided VLA models, enhancing the reasoning and generalization capabilities of autonomous driving systems [6]. - Comprehensive experiments demonstrate significant performance improvements in perception, prediction, and planning tasks by integrating global navigation knowledge into autonomous driving systems [6]. Group 2: AutoVLA - AutoVLA is proposed as an end-to-end autonomous driving framework that integrates physical action tokens with a pre-trained VLM backbone, enabling direct policy learning and semantic reasoning from raw visual observations and language instructions [12]. - A reinforcement learning-based post-training method using Group Relative Policy Optimization (GRPO) is introduced to achieve adaptive reasoning and further enhance model performance in end-to-end driving tasks [12]. - AutoVLA achieves competitive performance across multiple autonomous driving benchmarks, including open-loop and closed-loop tests [12]. Group 3: ReCogDrive - ReCogDrive is presented as an end-to-end autonomous driving system that integrates VLM with a diffusion planner, employing a three-stage training paradigm to address performance drops in rare and long-tail scenarios [13][16]. - The first stage involves fine-tuning the VLM on a large-scale driving Q&A dataset to mitigate domain gaps between general content and real-world driving scenarios [16]. - The method achieves a state-of-the-art PDMS score of 89.6 on the NAVSIM benchmark, highlighting its effectiveness and feasibility [16]. Group 4: Impromptu VLA - Impromptu VLA introduces a large-scale, richly annotated dataset aimed at addressing the limitations of existing benchmarks in autonomous driving VLA models [22]. - The dataset is designed to enhance the performance of VLA models in unstructured extreme scenarios, demonstrating significant improvements in established benchmarks [22]. - Experiments show that training with the Impromptu VLA dataset leads to notable performance enhancements in closed-loop NeuroNCAP scores and collision rates [22]. Group 5: DriveMoE - DriveMoE is a novel end-to-end autonomous driving framework that incorporates a mixture-of-experts (MoE) architecture to effectively handle multi-view sensor data and complex driving scenarios [28]. - The framework features scene-specific visual MoE and skill-specific action MoE, addressing the challenges of multi-view redundancy and skill specialization [28]. - DriveMoE achieves state-of-the-art performance in closed-loop evaluations on the Bench2Drive benchmark, demonstrating the effectiveness of combining visual and action MoE in autonomous driving tasks [28].
自动驾驶端到端VLA落地,算法如何设计?
自动驾驶之心· 2025-06-22 14:09
Core Insights - The article discusses the rapid advancements in end-to-end autonomous driving, particularly focusing on Vision-Language-Action (VLA) models and their applications in the industry [2][3]. Group 1: VLA Model Developments - The introduction of AutoVLA, a new VLA model that integrates reasoning and action generation for end-to-end autonomous driving, shows promising results in semantic reasoning and trajectory planning [3][4]. - ReCogDrive, another VLA model, addresses performance issues in rare and long-tail scenarios by utilizing a three-stage training framework that combines visual language models with diffusion planners [7][9]. - Impromptu VLA introduces a dataset aimed at improving VLA models' performance in unstructured extreme conditions, demonstrating significant performance improvements in established benchmarks [14][24]. Group 2: Experimental Results - AutoVLA achieved competitive performance metrics in various scenarios, with the best-of-N method reaching a PDMS score of 92.12, indicating its effectiveness in planning and execution [5]. - ReCogDrive set a new state-of-the-art PDMS score of 89.6 on the NAVSIM benchmark, showcasing its robustness and safety in driving trajectories [9][10]. - The OpenDriveVLA model demonstrated superior results in open-loop trajectory planning and driving-related question-answering tasks, outperforming previous methods on the nuScenes dataset [28][32]. Group 3: Industry Trends - The article highlights a trend among major automotive manufacturers, such as Li Auto, Xiaomi, and XPeng, to invest heavily in VLA model research and development, indicating a competitive landscape in autonomous driving technology [2][3]. - The integration of large language models (LLMs) with VLA frameworks is becoming a focal point for enhancing decision-making capabilities in autonomous vehicles, as seen in models like ORION and VLM-RL [33][39].