视觉 - 语言 - 行动(VLA)
Search documents
本田讴歌预告新一代RDX:首款双电机混合动力系统讴歌车型;理想调整基座模型业务:詹锟接手,VLA 研发整合丨汽车交通日报
创业邦· 2026-01-15 10:15
Group 1 - Major automotive manufacturers, including Hyundai and Porsche, will voluntarily recall over 344,000 vehicles in South Korea due to various parts defects [2] - Li Auto has appointed Zhan Kun to lead the base model business, focusing on the integration of the VLA (Vision-Language-Action) model for autonomous driving and smart cockpit technologies [2] - CATL and Changan Automobile signed a five-year strategic cooperation memorandum to enhance collaboration in technology application, market expansion, and brand promotion [2] Group 2 - Acura has announced the development of the next-generation RDX compact SUV, which will be the first Acura model equipped with a dual-motor hybrid system [2]
全球首个自动驾驶VLA综述重磅发布:VLA自驾模型全面拆解(麦吉尔&清华等)
自动驾驶之心· 2025-07-02 13:54
Core Insights - The article discusses the integration of vision, language, and action in autonomous driving through the Vision-Language-Action (VLA) model, highlighting its potential to enhance the future of autonomous vehicles [1][3]. Development Paradigms - The evolution of autonomous driving technology is categorized into three core paradigms: End-to-End models, Vision-Language Models (VLMs), and VLA models, with VLA being the most advanced [3][6]. - End-to-End models directly map sensor inputs to driving actions but lack interpretability [7]. - VLMs incorporate language understanding to improve system interpretability but face challenges in action execution [7]. - VLA models unify perception, reasoning, and action execution, enabling vehicles to understand complex instructions and explain their decisions [7][8]. VLA4AD Architecture - A typical VLA4AD model consists of three parts: input, processing, and output, integrating environmental perception, instruction understanding, and vehicle control [6]. - The architecture includes modules for visual data processing, language input handling, and action decoding, facilitating a seamless flow from perception to action [10][11]. Core Architectural Modules - Visual data is processed through advanced systems, evolving from single cameras to multi-camera setups [11]. - Language inputs have diversified, ranging from direct navigation commands to complex dialogue-based reasoning [11]. - The action decoder generates control outputs, which can be low-level actions or trajectory planning [17][18]. Development Stages of VLA Models - The development of VLA models is divided into four stages, showcasing the evolving role of language from a passive interpreter to an active planner and decision-maker [14][15]. - The first stage involves language models as passive explainers, enhancing system interpretability without direct control involvement [15]. - The second stage sees language becoming an active component in modular architectures, directly influencing planning decisions [18]. - The third stage introduces unified end-to-end models that map inputs to control signals in a single forward pass [19]. - The fourth stage focuses on reasoning-augmented models that integrate perception, language understanding, and action generation into a cohesive system [22][24]. Datasets and Benchmarks - High-quality, diverse datasets are crucial for advancing VLA4AD research, with notable datasets including BDD100K, nuScenes, and Bench2Drive, each providing unique insights for model training and evaluation [26][28]. - BDD100K offers a large collection of real-world driving videos, while nuScenes provides comprehensive sensor data for evaluation [29]. Challenges and Future Directions - The article outlines six major challenges facing VLA4AD, including robustness, real-time performance, and data bottlenecks, which must be addressed for large-scale deployment [31][32]. - Future directions include developing foundational driving models, neuro-symbolic safety kernels, and fleet-scale continual learning systems to enhance the capabilities of autonomous vehicles [36][37].