视觉 - 语言 - 动作模型(VLA)

Search documents
北大-灵初重磅发布具身VLA全面综述!一文看清VLA技术路线与未来趋势
机器之心· 2025-07-25 02:03
Core Insights - The article discusses the rapid advancements in Vision-Language-Action (VLA) models, which are capable of extending intelligence from the digital realm to physical tasks, particularly in robotics [1][9]. - A unified framework for understanding VLA models is proposed, focusing on action tokenization, which categorizes eight main types of action tokens and outlines their capabilities and future trends [2][10]. VLA Unified Framework and Action Token Perspective - VLA models rely on at least one visual or language foundation model to generate action outputs based on visual and language inputs, aiming to execute specific tasks in the physical world [9][11]. - The framework categorizes action tokens into eight types: language description, code, affordance, trajectory, goal state, latent representation, raw action, and reasoning [10][16]. Action Token Analysis - **Language Description**: Describes actions in natural language, divided into sub-task level (language plan) and atomic action level (language motion) [16][20]. - **Code**: Represents task logic in code form, allowing for efficient communication between humans and robots, but faces challenges related to API dependencies and execution rigidity [22][23]. - **Affordance**: A spatial representation indicating how objects can be interacted with, emphasizing semantic clarity and adaptability [25][26]. - **Trajectory**: Represents continuous spatial states over time, utilizing video data to enhance training data sources [29][30]. - **Goal State**: Visual representation of expected outcomes, aiding in action planning and execution [34][35]. - **Latent Representation**: Encodes action-related information through large-scale data pre-training, enhancing training efficiency and generalization [36][37]. - **Raw Action**: Directly executable low-level control commands for robots, showing potential for scalability similar to large language models [38][39]. - **Reasoning**: Expresses the thought process behind actions, enhancing model interpretability and decision-making [42][45]. Data Resources in VLA Models - The article categorizes data resources into a pyramid structure: network data and human videos at the base, synthetic and simulation data in the middle, and real robot data at the top, each contributing uniquely to model performance and generalization [47][48][49]. Conclusion - VLA models are positioned as a key pathway to embodied intelligence, with ongoing research focusing on action token design, challenges, and future directions, as well as the practical applications of VLA technology in real-world scenarios [51].
8万条!清华开源VLA数据集:面向自动驾驶极端场景,安全提升35%
自动驾驶之心· 2025-07-22 12:46
Core Viewpoint - The article discusses the development of the Impromptu VLA dataset, which aims to address the data scarcity issue in unstructured driving environments for autonomous driving systems. It highlights the dataset's potential to enhance the performance of vision-language-action models in complex scenarios [4][29]. Dataset Overview - The Impromptu VLA dataset consists of over 80,000 meticulously constructed video clips, extracted from more than 2 million original materials across eight diverse open-source datasets [5][29]. - The dataset focuses on four key unstructured challenges: boundary-ambiguous roads, temporary traffic rule changes, unconventional dynamic obstacles, and complex road conditions [12][13]. Methodology - The dataset construction involved a multi-step process, including data collection, scene classification, and multi-task annotation generation, utilizing advanced visual-language models (VLMs) for scene understanding [10][17]. - A rigorous manual verification process was implemented to ensure high-quality annotations, with significant F1 scores achieved for various categories, confirming the reliability of the VLM-based annotation process [18]. Experimental Validation - The effectiveness of the Impromptu VLA dataset was validated through comprehensive experiments, showing significant performance improvements in mainstream autonomous driving benchmarks. For instance, the average score in the closed-loop NeuroNCAP test improved from 1.77 to 2.15, with a reduction in collision rates from 72.5% to 65.5% [6][21]. - In open-loop trajectory prediction evaluations, models trained with the Impromptu VLA dataset achieved L2 errors as low as 0.30 meters, demonstrating competitive performance compared to leading methods that rely on larger proprietary datasets [24]. Conclusion - The Impromptu VLA dataset serves as a critical resource for developing more robust and adaptive autonomous driving systems capable of handling complex real-world scenarios. The research confirms the dataset's significant value in enhancing perception, prediction, and planning capabilities in unstructured driving environments [29].