视觉-语言-行动模型
Search documents
做自动驾驶VLA的这一年
自动驾驶之心· 2025-11-19 00:03
点击下方 卡片 ,关注" 自动驾驶之心 "公众号 戳我-> 领取 自动驾驶近30个 方向 学习 路线 >>自动驾驶前沿信息获取 → 自动驾驶之心知识星球 去年年底,leader找我聊了下今年的工作方向:从感知转VLA。所以年初至今做了也差不多一年的时间,从调 研、预研再到落地,也算阶段性告一段落。 这一年,从学术界的论文爆发到工业界的量产落地,前几个月的世界模型和VLA路线之争,前两周小鹏官宣 VLA2.0。笔者也在琢磨如何把这一年的经验做一些总结输出,这两周把思路理了理,所以有了这篇文章分享 给自动驾驶之心的小伙伴。我将从以下几个方面和大家聊聊: 希望可以给做相关方向的朋友提供点参考。 到底什么是 VLA ? 最近在自动驾驶的圈子里,VLA 模型可谓是炙手可热,多数主流的算法供应商和主机厂,都在宣称自己的 VLA 模型有多牛(尤其是理想和小鹏的发布会)。所以想最近借这个势头总结一下,VLA 到底是什么?到底 能不能成为继端到端和 VLM 之后的下一代自动驾驶范式? 首先,要明确,这个概念并非诞生于自动驾驶的圈子,而是来源于机器人控制领域:早在 2023 年由 Google DeepMind 提出来的 RT- ...
清华&小米团队发布VLA模型综述
理想TOP2· 2025-07-04 02:54
Core Viewpoint - The article discusses the evolution of autonomous driving technology, highlighting the transition from basic perception-control systems to advanced cognitive intelligence models, specifically focusing on the latest Visual-Language-Action (VLA) paradigm [1]. Group 1: Evolution of Autonomous Driving Technology - Autonomous driving technology is evolving from simple perception-control to advanced cognitive intelligence, categorized into three main paradigms: End-to-End AD, Visual Language Models for AD, and Visual-Language-Action Models for AD [3][4]. - The VLA model integrates visual perception, language understanding, and action execution into a unified framework, enabling vehicles to follow natural language commands directly [3][4]. Group 2: VLA Model Architecture - A typical VLA model consists of three parts: input, processing, and output, aiming to seamlessly integrate environmental perception, advanced instruction understanding, and vehicle control [4]. - Multi-modal inputs include visual and sensor data, with advancements from single front-facing cameras to multi-camera systems and various sensors like LiDAR, RADAR, IMU, and GPS for enhanced spatial awareness [5][6]. - Language inputs have evolved to include direct commands, environmental queries, task-level instructions, and conversational reasoning, supporting multi-turn dialogues and complex reasoning [6][9][10]. Group 3: Core Processing Modules - The core processing modules include a visual encoder for transforming raw images into understandable representations, a language processor using pre-trained models for natural language commands, and an action decoder for generating control outputs [11][12]. - The action decoder can implement various methods such as autoregressive tokenization, diffusion models, and hierarchical controllers for generating control signals [12][13][14]. Group 4: Development Stages of VLA Models - The development of VLA models is divided into four stages: 1. Language as an Explainer: Initially used for enhancing system interpretability without direct control involvement [19]. 2. Modular VLA: Language evolves into an active planning component, directly informing planning decisions [20][21]. 3. End-to-End VLA: Unified networks map sensor inputs directly to driving actions, improving responsiveness but facing challenges in long-term planning [22]. 4. Reasoning-Augmented VLA: Incorporates reasoning and memory, allowing for long-term predictions and dynamic human-machine interaction [23]. Group 5: Datasets and Benchmarks - High-quality, diverse datasets are crucial for advancing VLA research, covering large-scale real-world data, critical safety testing scenarios, and fine-grained reasoning data [25]. Group 6: Challenges and Future Outlook - Key challenges include robustness and reliability, real-time performance, data bottlenecks, multi-modal alignment, multi-agent social complexity, and generalization across different traffic environments [26][27][28][29][30][31][32]. - Future directions involve creating a foundational driving model, integrating neural-symbolic safety cores, fleet-level continuous learning, standardizing traffic language, and developing cross-modal social intelligence [33][34][35][36][37].