Core Insights - The article discusses the rapid advancements in Vision-Language-Action (VLA) models, which are capable of extending intelligence from the digital realm to physical tasks, particularly in robotics [1][9]. - A unified framework for understanding VLA models is proposed, focusing on action tokenization, which categorizes eight main types of action tokens and outlines their capabilities and future trends [2][10]. VLA Unified Framework and Action Token Perspective - VLA models rely on at least one visual or language foundation model to generate action outputs based on visual and language inputs, aiming to execute specific tasks in the physical world [9][11]. - The framework categorizes action tokens into eight types: language description, code, affordance, trajectory, goal state, latent representation, raw action, and reasoning [10][16]. Action Token Analysis - Language Description: Describes actions in natural language, divided into sub-task level (language plan) and atomic action level (language motion) [16][20]. - Code: Represents task logic in code form, allowing for efficient communication between humans and robots, but faces challenges related to API dependencies and execution rigidity [22][23]. - Affordance: A spatial representation indicating how objects can be interacted with, emphasizing semantic clarity and adaptability [25][26]. - Trajectory: Represents continuous spatial states over time, utilizing video data to enhance training data sources [29][30]. - Goal State: Visual representation of expected outcomes, aiding in action planning and execution [34][35]. - Latent Representation: Encodes action-related information through large-scale data pre-training, enhancing training efficiency and generalization [36][37]. - Raw Action: Directly executable low-level control commands for robots, showing potential for scalability similar to large language models [38][39]. - Reasoning: Expresses the thought process behind actions, enhancing model interpretability and decision-making [42][45]. Data Resources in VLA Models - The article categorizes data resources into a pyramid structure: network data and human videos at the base, synthetic and simulation data in the middle, and real robot data at the top, each contributing uniquely to model performance and generalization [47][48][49]. Conclusion - VLA models are positioned as a key pathway to embodied intelligence, with ongoing research focusing on action token design, challenges, and future directions, as well as the practical applications of VLA technology in real-world scenarios [51].
北大-灵初重磅发布具身VLA全面综述!一文看清VLA技术路线与未来趋势
机器之心·2025-07-25 02:03