Workflow
视觉-语言-行动模型
icon
Search documents
做自动驾驶VLA的这一年
自动驾驶之心· 2025-11-19 00:03
Core Viewpoint - The article discusses the emergence and significance of Vision-Language-Action (VLA) models in the autonomous driving industry, highlighting their potential to unify perception, reasoning, and action in a single framework, thus addressing the limitations of previous models [3][10][11]. Summary by Sections What is VLA? - VLA models are described as multimodal systems that integrate vision, language, and actions, allowing for a more comprehensive understanding and interaction with the environment [4][7]. - The concept originated from robotics and was popularized in the autonomous driving sector due to its potential to enhance interpretability and decision-making capabilities [3][9]. Why VLA Emerged? - The evolution of autonomous driving can be categorized into several phases: modular systems, end-to-end models, and Vision-Language Models (VLM), each with its own limitations [9][10]. - VLA models emerged as a solution to the shortcomings of previous approaches, providing a unified framework that enhances both understanding and action execution [10][11]. VLA Architecture Breakdown - The VLA model architecture consists of three main components: input (multimodal data), processing (integration of inputs), and output (action generation) [12][16]. - Inputs include visual data from cameras, sensor data from LiDAR and RADAR, and language inputs for navigation and interaction [13][14]. - The processing layer integrates these inputs to generate driving decisions, while the output layer produces control commands and trajectory planning [18][20]. Development History of VLA - The article outlines the historical context of VLA development, emphasizing its role in advancing autonomous driving technology by addressing the need for better interpretability and action alignment [21][22]. Key Innovations in VLA Models - Recent models like LINGO-1 and LINGO-2 focus on integrating natural language understanding with driving actions, allowing for more interactive and responsive driving systems [22][35]. - Innovations include the ability to explain driving decisions in natural language and to follow complex verbal instructions, enhancing user trust and system transparency [23][36]. Future Directions - The article raises questions about the necessity of language in future VLA models, suggesting that as technology advances, the role of language may evolve or diminish [70]. - It emphasizes the importance of continuous learning and innovation in the field to keep pace with technological advancements and user expectations [70].
清华&小米团队发布VLA模型综述
理想TOP2· 2025-07-04 02:54
Core Viewpoint - The article discusses the evolution of autonomous driving technology, highlighting the transition from basic perception-control systems to advanced cognitive intelligence models, specifically focusing on the latest Visual-Language-Action (VLA) paradigm [1]. Group 1: Evolution of Autonomous Driving Technology - Autonomous driving technology is evolving from simple perception-control to advanced cognitive intelligence, categorized into three main paradigms: End-to-End AD, Visual Language Models for AD, and Visual-Language-Action Models for AD [3][4]. - The VLA model integrates visual perception, language understanding, and action execution into a unified framework, enabling vehicles to follow natural language commands directly [3][4]. Group 2: VLA Model Architecture - A typical VLA model consists of three parts: input, processing, and output, aiming to seamlessly integrate environmental perception, advanced instruction understanding, and vehicle control [4]. - Multi-modal inputs include visual and sensor data, with advancements from single front-facing cameras to multi-camera systems and various sensors like LiDAR, RADAR, IMU, and GPS for enhanced spatial awareness [5][6]. - Language inputs have evolved to include direct commands, environmental queries, task-level instructions, and conversational reasoning, supporting multi-turn dialogues and complex reasoning [6][9][10]. Group 3: Core Processing Modules - The core processing modules include a visual encoder for transforming raw images into understandable representations, a language processor using pre-trained models for natural language commands, and an action decoder for generating control outputs [11][12]. - The action decoder can implement various methods such as autoregressive tokenization, diffusion models, and hierarchical controllers for generating control signals [12][13][14]. Group 4: Development Stages of VLA Models - The development of VLA models is divided into four stages: 1. Language as an Explainer: Initially used for enhancing system interpretability without direct control involvement [19]. 2. Modular VLA: Language evolves into an active planning component, directly informing planning decisions [20][21]. 3. End-to-End VLA: Unified networks map sensor inputs directly to driving actions, improving responsiveness but facing challenges in long-term planning [22]. 4. Reasoning-Augmented VLA: Incorporates reasoning and memory, allowing for long-term predictions and dynamic human-machine interaction [23]. Group 5: Datasets and Benchmarks - High-quality, diverse datasets are crucial for advancing VLA research, covering large-scale real-world data, critical safety testing scenarios, and fine-grained reasoning data [25]. Group 6: Challenges and Future Outlook - Key challenges include robustness and reliability, real-time performance, data bottlenecks, multi-modal alignment, multi-agent social complexity, and generalization across different traffic environments [26][27][28][29][30][31][32]. - Future directions involve creating a foundational driving model, integrating neural-symbolic safety cores, fleet-level continuous learning, standardizing traffic language, and developing cross-modal social intelligence [33][34][35][36][37].