视觉-语言-行动模型 - filings, earnings calls, financial reports, news

视觉-语言-行动模型

Search documents

自动驾驶之心· 2025-11-19 00:03

点击下方卡片，关注" 自动驾驶之心 "公众号戳我-> 领取自动驾驶近30个方向学习路线 >>自动驾驶前沿信息获取 → 自动驾驶之心知识星球去年年底，leader找我聊了下今年的工作方向：从感知转VLA。所以年初至今做了也差不多一年的时间，从调研、预研再到落地，也算阶段性告一段落。这一年，从学术界的论文爆发到工业界的量产落地，前几个月的世界模型和VLA路线之争，前两周小鹏官宣 VLA2.0。笔者也在琢磨如何把这一年的经验做一些总结输出，这两周把思路理了理，所以有了这篇文章分享给自动驾驶之心的小伙伴。我将从以下几个方面和大家聊聊：希望可以给做相关方向的朋友提供点参考。到底什么是 VLA ? 最近在自动驾驶的圈子里，VLA 模型可谓是炙手可热，多数主流的算法供应商和主机厂，都在宣称自己的 VLA 模型有多牛（尤其是理想和小鹏的发布会）。所以想最近借这个势头总结一下，VLA 到底是什么？到底能不能成为继端到端和 VLM 之后的下一代自动驾驶范式？首先，要明确，这个概念并非诞生于自动驾驶的圈子，而是来源于机器人控制领域：早在 2023 年由 Google DeepMind 提出来的 RT- ...

理想TOP2· 2025-07-04 02:54

Core Viewpoint - The article discusses the evolution of autonomous driving technology, highlighting the transition from basic perception-control systems to advanced cognitive intelligence models, specifically focusing on the latest Visual-Language-Action (VLA) paradigm [1]. Group 1: Evolution of Autonomous Driving Technology - Autonomous driving technology is evolving from simple perception-control to advanced cognitive intelligence, categorized into three main paradigms: End-to-End AD, Visual Language Models for AD, and Visual-Language-Action Models for AD [3][4]. - The VLA model integrates visual perception, language understanding, and action execution into a unified framework, enabling vehicles to follow natural language commands directly [3][4]. Group 2: VLA Model Architecture - A typical VLA model consists of three parts: input, processing, and output, aiming to seamlessly integrate environmental perception, advanced instruction understanding, and vehicle control [4]. - Multi-modal inputs include visual and sensor data, with advancements from single front-facing cameras to multi-camera systems and various sensors like LiDAR, RADAR, IMU, and GPS for enhanced spatial awareness [5][6]. - Language inputs have evolved to include direct commands, environmental queries, task-level instructions, and conversational reasoning, supporting multi-turn dialogues and complex reasoning [6][9][10]. Group 3: Core Processing Modules - The core processing modules include a visual encoder for transforming raw images into understandable representations, a language processor using pre-trained models for natural language commands, and an action decoder for generating control outputs [11][12]. - The action decoder can implement various methods such as autoregressive tokenization, diffusion models, and hierarchical controllers for generating control signals [12][13][14]. Group 4: Development Stages of VLA Models - The development of VLA models is divided into four stages: 1. Language as an Explainer: Initially used for enhancing system interpretability without direct control involvement [19]. 2. Modular VLA: Language evolves into an active planning component, directly informing planning decisions [20][21]. 3. End-to-End VLA: Unified networks map sensor inputs directly to driving actions, improving responsiveness but facing challenges in long-term planning [22]. 4. Reasoning-Augmented VLA: Incorporates reasoning and memory, allowing for long-term predictions and dynamic human-machine interaction [23]. Group 5: Datasets and Benchmarks - High-quality, diverse datasets are crucial for advancing VLA research, covering large-scale real-world data, critical safety testing scenarios, and fine-grained reasoning data [25]. Group 6: Challenges and Future Outlook - Key challenges include robustness and reliability, real-time performance, data bottlenecks, multi-modal alignment, multi-agent social complexity, and generalization across different traffic environments [26][27][28][29][30][31][32]. - Future directions involve creating a foundational driving model, integrating neural-symbolic safety cores, fleet-level continuous learning, standardizing traffic language, and developing cross-modal social intelligence [33][34][35][36][37].