Workflow
视觉 - 语言 - 动作(VLA)
icon
Search documents
最火VLA,看这一篇综述就够了
3 6 Ke· 2025-10-31 08:22
Core Insights - The article provides a comprehensive overview of the emerging field of Vision-Language-Action (VLA), highlighting its rapid growth and significance in AI and robotics [1][5]. Summary by Sections VLA Overview - VLA models have seen a dramatic increase in submissions, rising from single digits to 164, marking an 18-fold growth [5]. - A model qualifies as VLA if it uses a pre-trained backbone on large-scale visual-language data, emphasizing capabilities in language understanding, visual generalization, and task transfer [5][6]. Key Trends in VLA - **Trend 1: Efficient Architecture Paradigm** Discrete diffusion models are emerging as a new paradigm, allowing for parallel generation of action sequences, enhancing efficiency and integrating reasoning with actions [7][10]. - **Trend 2: Embodied Chain-of-Thought (ECoT)** ECoT emphasizes generating intermediate reasoning steps before actions, improving planning and interpretability, although it relies heavily on high-quality annotated data [11][12]. - **Trend 3: Action Tokenizer** The action tokenizer converts continuous robot actions into discrete tokens that VLMs can understand, bridging the gap between the robot's actions and the VLM's processing [14][16]. - **Trend 4: Reinforcement Learning (RL)** RL is reintroduced to fine-tune VLA strategies, addressing limitations of imitation learning in extreme scenarios, with notable successes in recent studies [17][18]. - **Trend 5: Efficiency Optimization** Efforts are being made to reduce the hardware requirements for VLA models, making the field more accessible to smaller research labs [19]. - **Trend 6: Video Prediction for Physical Intuition** Video generation models provide inherent understanding of temporal dynamics and physical laws, enhancing robot control capabilities [20][23]. - **Trend 7: Realistic Evaluation Benchmarks** New evaluation frameworks are being developed to overcome the limitations of existing benchmarks, focusing on meaningful generalization capabilities [24][26]. Challenges and Future Directions - The article highlights the "performance ceiling" issue in mainstream simulation evaluations, where high scores do not necessarily translate to real-world capabilities [30]. - Two critical areas needing more attention are data quality and in-context learning, which could be pivotal for advancing VLA research [31].
不管VLA还是WM世界模型,都需要世界引擎
自动驾驶之心· 2025-09-13 16:04
Core Viewpoint - The article discusses the current state and future prospects of end-to-end autonomous driving, emphasizing the concept of a "World Engine" to address challenges in the field [2][21]. Definition of End-to-End Autonomous Driving - End-to-end autonomous driving is defined as "learning a single model that directly maps raw sensor inputs to driving scenarios and outputs control commands," replacing traditional modular pipelines with a unified function [3][6]. Development Roadmap of End-to-End Autonomous Driving - The evolution of end-to-end autonomous driving has progressed from simple black-and-white image inputs over 20 years to more complex methods, including conditional imitation learning and modular approaches [8][10]. Current State of End-to-End Autonomous Driving - The industry is currently in the "1.5 generation" phase, focusing on foundational models and addressing long-tail problems, with two main branches: the World Model (WM) and Visual Language Action (VLA) [10][11]. Challenges in Real-World Deployment - Collecting data for all scenarios, especially extreme cases, remains a significant challenge for achieving Level 4 (L4) or Level 5 (L5) autonomous driving [17][18]. Concept of the "World Engine" - The "World Engine" concept aims to learn from human expert driving and generate extreme scenarios for training, which can significantly reduce costs associated with large fleets [21][24]. Data and Algorithm Engines - The "World Engine" consists of a Data Engine for generating extreme scenarios and an Algorithm Engine, which is still under development, to improve and train end-to-end algorithms [24][25].
全球首个自动驾驶VLA综述重磅发布:VLA自驾模型全面拆解~
具身智能之心· 2025-07-03 08:22
Core Insights - The article discusses the integration of vision, language, and action in autonomous driving through the Vision-Language-Action (VLA) model, highlighting its potential to enhance the capabilities of self-driving vehicles [1][3]. Evolution of Autonomous Driving Paradigms - The development of autonomous driving technology has transitioned from modular to integrated approaches, categorized into three core paradigms: 1. End-to-End Autonomous Driving (AD) which directly maps sensor inputs to driving actions but lacks interpretability [3]. 2. Vision Language Models (VLMs for AD) that enhance system interpretability and generalization but do not directly control vehicle actions [3]. 3. Vision-Language-Action Models (VLA for AD) that unify perception, reasoning, and action execution, enabling vehicles to understand complex instructions and make autonomous decisions [3][4]. VLA4AD Architecture - A typical VLA4AD model consists of three parts: input, processing, and output, integrating environmental perception, high-level instruction understanding, and vehicle control [5]. - The architecture includes multimodal inputs, core modules for processing visual and language data, and an action decoder for generating control outputs [6][7][9]. Development Stages of VLA Models - The evolution of VLA models is divided into four stages: 1. Language models as explainers, enhancing interpretability without direct control [16]. 2. Modular VLA models where language actively contributes to planning decisions [19]. 3. Unified end-to-end VLA models that map sensor inputs to control signals in a single forward pass [20]. 4. Reasoning-augmented VLA models that incorporate long-term reasoning and memory into decision-making [21]. Representative VLA4AD Models - The article provides a detailed comparison of various VLA4AD models, highlighting their inputs, outputs, datasets, and core contributions [23]. Examples include: - DriveGPT-4, which utilizes a single image input to generate high-level control labels [22]. - ADriver-I, which integrates vision-action tokens for control [22]. - RAG-Driver, which employs retrieval-augmented control mechanisms [22]. Datasets and Benchmarks - High-quality, diverse datasets are crucial for VLA4AD development, with notable datasets including BDD100K, nuScenes, and Bench2Drive, which provide rich annotations for training and evaluation [25][26][29]. Challenges and Future Directions - The article outlines six major challenges facing VLA4AD, including robustness, real-time performance, data bottlenecks, and multimodal alignment [31][32]. - Future directions include the development of foundation-scale driving models, neuro-symbolic safety kernels, fleet-scale continual learning, standardized traffic language, and cross-modal social intelligence [36][37].