视觉-语言-动作模型(VLA)
Search documents
VLA-Pruner:面向高效VLA推理的时序感知视觉token剪枝
具身智能之心· 2025-11-21 16:03
Group 1 - The core challenge of VLA models lies in their ability to integrate visual scene perception, natural language understanding, and action execution, which results in significant computational overhead due to the high number of visual tokens compared to text tokens [2][4]. - Existing pruning methods for visual tokens are flawed as they primarily focus on semantic relevance, neglecting the distinct needs of high-level semantic understanding and low-level action execution, leading to performance drops at high pruning rates [3][4]. - A key observation is that the temporal continuity of robot operations allows for the estimation of necessary visual tokens for current actions based on historical attention trends, providing a breakthrough in addressing the limitations of existing methods [5]. Group 2 - The VLA-Pruner is designed to retain both semantic understanding and action execution tokens under a given computational budget, achieving efficient inference without performance loss through a dual-level criterion and selection strategy [6][10]. - The dual-level importance criteria include semantic relevance based on pre-fill attention scores and action-level importance estimated through temporal smoothing, ensuring a comprehensive approach to token selection [7][9]. - The method employs a "merge-filter" mechanism to maximize relevance and minimize redundancy, ensuring that all critical tokens for both semantic understanding and action execution are preserved [10][11]. Group 3 - Experimental results demonstrate that at a 50% pruning rate, VLA-Pruner not only maintains performance but also improves success rates, with OpenVLA showing an average increase of 2.45% [16]. - The VLA-Pruner exhibits robustness across different scenarios, achieving a success rate of 96.8% in the SIMPLER environment at a 75% pruning rate, significantly outperforming baseline methods [19][20]. - Efficiency improvements are notable, with FLOPs reduced to approximately 60% of the original model at a 50% pruning rate and achieving up to 1.8 times faster inference speeds [26][27]. Group 4 - The core contributions of the study include the introduction of a dual-level pruning criterion that addresses the inherent flaws of existing methods and the design of a plug-and-play pruning framework that enhances inference efficiency without altering the model architecture [31]. - Limitations include potential inaccuracies in action attention estimation in dynamic scenarios with rapid perspective shifts or target changes, suggesting areas for future optimization [31]. - Future directions involve the development of adaptive prediction modules and the integration of additional techniques such as quantization and layer pruning to further enhance deployment efficiency [31].
轻量级VLA模型Evo-1:仅凭0.77b参数取得SOTA,解决低成本训练与实时部署
具身智能之心· 2025-11-12 04:00
点击下方 卡片 ,关注" 具身智能 之心 "公众号 视觉-语言-动作(VLA)模型将感知、语言和控制能力统一起来,使机器人能够通过多模态理解执行多样化任务。然而,当前的VLA模型通常包含海 量参数,且高度依赖大规模机器人数据预训练,导致训练过程中的计算成本高昂,同时限制了其在实时推理中的部署能力。此外,多数训练范式常导 致视觉-语言backbone模型的感知表征退化,引发过拟合并削弱对下游任务的泛化能力。 论文名称: Evo-1: Lightweight Vision-Language-Action Model with Preserved Semantic Alignment 论文链接: https://arxiv.org/abs/2511.04555 来自上海交大、CMU、剑桥大学的团队提出轻量级VLA模型Evo-1,在无需机器人数据预训练的前提下,既降低计算成本又提升部署效率,同时保持 强劲性能。Evo-1基于原生多模态视觉语言模型(VLM),融合创新的交叉调制扩散变换器与优化集成模块,构建高效架构。这里还进一步引入两阶段 训练范式,通过逐步协调动作与感知,完整保留VLM的表征能力。 编辑丨具身智能之心 ...
Ask-to-Clarify:解决指令的模糊性,端到端为真实具身任务生成动作
具身智能之心· 2025-10-22 03:04
Core Insights - The article presents the Ask-to-Clarify framework aimed at enhancing embodied intelligent agents' ability to interact with humans by resolving instruction ambiguity through multi-turn dialogue [2][4][41]. Framework Design - A new collaborative task for embodied agents is introduced, requiring them to ask questions to clarify ambiguous instructions before executing tasks. This involves a combination of a visual-language model (VLM) for questioning and a diffusion model for action generation [6][10]. - The framework consists of two main components: a collaborative module for human interaction and an action module for generating specific actions. A connection module is designed to ensure smooth integration between these components [42][46]. Training Strategy - A two-phase "knowledge isolation" training strategy is proposed. The first phase focuses on training the model to handle ambiguous instructions, while the second phase maintains this capability while enhancing the action generation ability [8][15]. - In the first phase, a dataset of interactive dialogue is constructed to train the collaborative component, allowing it to ask questions when faced with ambiguous instructions [16][17]. - The second phase involves a hierarchical framework for end-to-end action generation, ensuring that the model retains its ability to clarify instructions while learning to generate actions [18][19]. Inference Process - During inference, the framework engages in dialogue with users to clarify instructions and then executes the inferred correct actions. A signal detector routes the process between questioning and executing based on the task state [22][23]. - The model uses specific signal markers to indicate whether an instruction is ambiguous or not, guiding its response accordingly [22][23]. Experimental Validation - The framework was tested in real-world scenarios, demonstrating its ability to clarify ambiguous instructions and reliably generate actions. Various experiments were conducted to assess its performance, including ablation studies on training strategies and the connection module [24][25][41]. - The results showed that the Ask-to-Clarify framework significantly outperformed baseline models in handling ambiguous instructions and executing tasks accurately [29][30][35]. Robustness Testing - The framework's robustness was evaluated under challenging conditions, such as low-light environments and the presence of distractors. It consistently outperformed baseline models in these scenarios, showcasing its practical applicability [37][39][40].
MTRDrive:一种具备动态交互式推理的自动驾驶VLA框架(清华&小米)
自动驾驶之心· 2025-09-28 23:33
Core Insights - The article discusses the MTRDrive framework, which models autonomous driving as a dynamic interactive reasoning process, addressing the limitations of traditional static decision-making approaches [4][9][50] - MTRDrive integrates a memory-tool synergistic mechanism to enhance perception accuracy and reasoning reliability, significantly improving the model's robustness in long-tail and out-of-distribution (OOD) scenarios [4][13][50] Group 1: Challenges in Autonomous Driving - Current visual-language-action (VLA) models face significant challenges in long-term reasoning and high-level decision-making, particularly in complex scenarios with few or no samples [3][5] - Robust driving decisions rely heavily on the deep collaboration of perception accuracy and reasoning reliability, akin to human drivers who utilize accumulated experience for dynamic prediction and adaptive adjustments [3][8] Group 2: MTRDrive Framework - MTRDrive is a new framework proposed by teams from Tsinghua University, Xiaomi Auto, McGill University, and the University of Wisconsin-Madison, which breaks the limitations of traditional static decision-making [4][9] - The framework includes a memory-tool collaborative mechanism that enhances the model's perception accuracy and supports robust decision-making in long-term and high-level tasks [4][15] Group 3: Experimental Validation - Systematic experiments demonstrate that MTRDrive significantly improves generalization and robustness in long-tail and OOD scenarios, providing a new technical pathway for deploying autonomous agents in real-world complex environments [4][34] - In high-level planning tasks, MTRDrive achieved a planning accuracy of 82.6% on the NAVSIM dataset, more than double that of the Qwen2.5-VL-72B model [40] Group 4: Memory and Tool Interaction - MTRDrive incorporates a structured driving experience repository that allows the model to retrieve relevant past experiences, enhancing its decision-making capabilities [15][19] - The framework employs a visual toolset that enables the model to actively probe the visual environment for high-fidelity information, improving its perception capabilities [21][28] Group 5: Training Methodology - MTRDrive utilizes a two-phase training process: supervised fine-tuning (SFT) to teach basic skills and reinforcement learning fine-tuning (RLFT) for optimizing decision-making capabilities [24][29] - The introduction of a memory retrieval mechanism significantly enhances the model's ability to generalize skills to new, unseen driving scenarios, as evidenced by improved performance metrics [44]
从近1000篇工作中,看具身智能的技术发展路线!
自动驾驶之心· 2025-09-07 23:34
Core Insights - The article discusses the evolution and challenges of embodied intelligence, emphasizing the need for a comprehensive understanding of its development, issues faced, and future directions [4][5]. Group 1: Robotic Manipulation - The survey on robotic manipulation highlights the transition from mechanical programming to embodied intelligence, focusing on the evolution from simple grippers to dexterous multi-fingered hands [6][7]. - Key challenges in dexterous manipulation include data collection methods such as simulation, human demonstration, and teleoperation, as well as skill learning frameworks like imitation learning and reinforcement learning [6][7]. Group 2: Navigation and Manipulation - The discussion on robotic navigation emphasizes the high costs and data difficulties associated with real-world training, proposing Sim-to-Real transfer as a critical solution [8][13]. - The evolution of navigation techniques is outlined, transitioning from explicit memory to implicit memory, while manipulation methods have expanded from reinforcement learning to imitation learning and diffusion strategies [13][14]. Group 3: Multimodal Large Models - The exploration of embodied multimodal large models (EMLMs) indicates their potential to bridge the gap between perception, cognition, and action, driven by advancements in large model technologies [15][17]. - Challenges identified include cross-modal alignment difficulties, high computational resource demands, and weak domain generalization [17]. Group 4: Embodied AI Simulators - The analysis of embodied AI simulators reveals their role in enhancing the realism and interactivity of training environments, with a focus on 3D simulators and their applications in visual exploration and navigation [18][22]. - Key challenges for simulators include achieving high fidelity, scalability, and effective interaction capabilities [22]. Group 5: Reinforcement Learning - The survey on reinforcement learning in vision outlines its application in multimodal large language models and the challenges posed by high-dimensional visual inputs and complex reward designs [24][27]. - Core research directions include optimizing visual generation and enhancing cross-modal consistency through reinforcement learning [27]. Group 6: Teleoperation and Data Collection - The discussion on teleoperation of humanoid robots highlights the integration of human cognition with robotic capabilities, particularly in hazardous environments [28][30]. - Key components of teleoperation systems include human state measurement, motion retargeting, and multimodal feedback mechanisms [30]. Group 7: Vision-Language-Action Models - The comprehensive review of vision-language-action (VLA) models outlines their evolution and applications across various fields, including humanoid robotics and autonomous driving [31][34]. - Challenges in VLA models include real-time control, multimodal action representation, and system scalability [34].
从近1000篇工作中,看具身智能的技术发展路线!
具身智能之心· 2025-09-05 00:45
Core Insights - The article discusses the evolution and challenges of embodied intelligence, emphasizing the need for a comprehensive understanding of its development, issues faced, and future directions [3][4]. Group 1: Robotic Manipulation - The survey on robotic manipulation highlights the transition from mechanical programming to embodied intelligence, focusing on the evolution from simple grippers to dexterous multi-fingered hands [5][6]. - Key challenges in dexterous manipulation include data collection methods such as simulation, human demonstration, and teleoperation, as well as skill learning frameworks like imitation learning and reinforcement learning [5][6]. Group 2: Navigation and Manipulation - The discussion on robotic navigation emphasizes the importance of physics simulators in addressing high costs and data scarcity in real-world training, with a focus on the Sim-to-Real transfer challenges [9][15]. - The evolution of navigation techniques is outlined, transitioning from explicit memory to implicit memory, and the role of various simulators in narrowing the Sim-to-Real gap is analyzed [15][16]. Group 3: Multimodal Large Models - The exploration of embodied multimodal large models (EMLMs) reveals their potential to bridge perception, cognition, and action gaps, driven by advancements in large model technologies [17][19]. - Challenges identified include cross-modal alignment difficulties, high computational resource demands, and weak domain generalization [19]. Group 4: Teleoperation and Data Collection - The survey on teleoperation of humanoid robots discusses the integration of human cognition with robotic capabilities, particularly in hazardous environments, while addressing challenges such as high degrees of freedom and communication limitations [29][30]. - Key components of teleoperation systems include human state measurement, motion retargeting, and multimodal feedback mechanisms [30][33]. Group 5: Vision-Language-Action Models - The analysis of Vision-Language-Action (VLA) models covers their evolution from cross-modal learning architectures to the integration of visual language models and action planners [33][36]. - The article identifies core challenges in real-time control, multimodal action representation, and system scalability, while proposing future directions for adaptive AI and cross-entity generalization [36][41].
首个3D动作游戏专用VLA模型,打黑神话&只狼超越人类玩家 | ICCV 2025
量子位· 2025-08-19 05:25
Core Insights - CombatVLA, a 3B multimodal model, surpasses GPT-4o and human players in combat tasks within action role-playing games, demonstrating significant advancements in real-time decision-making and tactical reasoning [1][4][52]. Group 1: CombatVLA Overview - CombatVLA integrates visual, semantic, and action control to enhance embodied intelligence, addressing challenges in 3D combat scenarios such as visual perception, combat reasoning, and efficient inference [6][8]. - The model achieves a 50-fold acceleration in combat execution speed compared to existing models, with a higher success rate than human players [4][11][52]. Group 2: Action Tracking and Benchmarking - An action tracker was developed to collect human action sequences in games, providing extensive training data for the combat understanding model [15][17]. - The CUBench benchmark was established to evaluate the model's combat intelligence based on three core capabilities: information acquisition, understanding, and reasoning [20][21]. Group 3: CombatVLA Model and Training - The Action-of-Thought (AoT) dataset was created to facilitate the model's understanding of combat actions, structured in a way that enhances reasoning speed [24][25]. - CombatVLA employs a three-stage progressive training paradigm, gradually refining the model's combat strategies from video-level to frame-level optimization [27][33]. Group 4: Experimental Results - In combat understanding evaluations, CombatVLA achieved a top average score of 63.61 on CUBench, outperforming other models significantly [46]. - The model demonstrated robust generalization capabilities, performing comparably to baseline models in general benchmarks while excelling in task-level evaluations [47][48].
聊聊DreamVLA:让机器人先看后想再动
具身智能之心· 2025-08-11 00:14
Core Viewpoint - The article introduces DreamVLA, a new Vision-Language-Action model that enhances robotic decision-making by integrating comprehensive world knowledge, allowing robots to predict dynamic environments and make more accurate action decisions [1][27]. Group 1: Background and Need for Advanced VLA Models - Traditional VLA models directly map visual inputs and language commands to actions, which can lead to interference from irrelevant information in complex environments [3][5]. - DreamVLA addresses this by adding a layer of "thinking" that predicts world knowledge, including dynamic areas, depth information, and semantic features before planning actions [5][27]. Group 2: Model Architecture and Functionality - DreamVLA operates on a "perception-prediction-action" cycle, treating the task as an inverse dynamics problem to derive necessary actions from predicted future states [7][27]. - The model processes three types of inputs: visual images, language commands, and the robot's own state, using dedicated encoders for each [10][14]. Group 3: World Knowledge Prediction - DreamVLA predicts world knowledge, which includes dynamic areas, depth maps, and semantic features, rather than directly predicting actions [11][18]. - Dynamic area prediction utilizes CoTracker to identify moving objects and generate masks that highlight relevant areas while filtering out static backgrounds [12][15]. - Depth prediction estimates the spatial relationships of objects, generating depth maps to assist in obstacle avoidance [13][17]. - Semantic prediction employs DINOv2 and SAM models to extract high-level semantic information, which is then encoded into a unified "world embedding" for action generation [18][22]. Group 4: Action Generation - The action generation component uses a diffusion Transformer to produce future action sequences based on the latent action embedding derived from multi-modal inputs [23][27]. - A structured attention mechanism is implemented to ensure coherent multi-step action reasoning and prevent cross-modal knowledge leakage [19][31]. Group 5: Performance and Validation - DreamVLA achieved an average task completion length of 4.44 in the CALVIN ABC-D benchmark, outperforming previous methods by 3.5%, with a real-world task success rate of 76.7% [25][27]. - Ablation studies confirmed the contributions of various components, demonstrating the model's robustness and generalization capabilities [25][31].
TACTILE-VLA:激活VLA模型的物理知识以实现触觉泛化(清华大学最新)
自动驾驶之心· 2025-07-16 04:05
Core Insights - The article discusses the development of Tactile-VLA, a model that integrates visual, language, action, and tactile perception to enhance robotic capabilities in contact-intensive tasks [2][6][20]. Group 1: Background and Core Issues - Visual-language-action (VLA) models are crucial for general-purpose robotic agents, but they struggle in contact-intensive scenarios due to a lack of tactile perception [2]. - Tactile perception provides essential feedback for physical interactions, which is often missing in existing models [2]. Group 2: Key Innovations and Research Goals - The core finding is that VLA models contain prior knowledge of physical interactions, which can be activated through tactile sensors for zero-shot generalization in contact tasks [6]. - Tactile-VLA framework introduces tactile perception as a primary modality, allowing direct mapping from abstract semantics to physical force control [6]. - The mixed position-force controller innovatively converts force targets into position adjustment commands, addressing the challenge of coordinating position and force control [6][10]. - Tactile-VLA-CoT variant incorporates a chain of thought (CoT) reasoning mechanism, enabling robots to analyze failure causes and autonomously adjust strategies [6][14]. Group 3: Overall Architecture - Tactile-VLA's architecture features four key modules, emphasizing token-level fusion through a non-causal attention mechanism for true semantic representation rooted in physical reality [9]. Group 4: Mixed Position-Force Control Mechanism - The mixed control strategy prioritizes position control while introducing force feedback adjustments when necessary, ensuring precision in movement and force control [10][12]. - The design separates external net force from internal grasping force, allowing for refined force adjustments suitable for contact-intensive tasks [13]. Group 5: Chain of Thought Reasoning Mechanism - Tactile-VLA-CoT enhances adaptive capabilities by transforming the adjustment process into an interpretable reasoning process, improving robustness in complex tasks [14][15]. Group 6: Data Collection Methods - A specialized data collection system was developed to obtain high-quality tactile-language aligned data, addressing the issue of missing force feedback in traditional remote operations [16][19]. Group 7: Experimental Validation and Results Analysis - Three experimental groups were designed to validate Tactile-VLA's capabilities in instruction following, common sense application, and adaptive reasoning [20]. - In the instruction following experiment, Tactile-VLA demonstrated the ability to learn the semantic meaning of force-related language, achieving a success rate of 35% in USB tasks and 90% in charger tasks [23]. - The model effectively utilized common sense knowledge to adjust interaction forces based on object properties, achieving significant performance improvements over baseline models [24][30]. - In the adaptive reasoning experiment, Tactile-VLA-CoT achieved an 80% success rate in a blackboard task, showcasing its ability to diagnose and correct failures autonomously [28][32].
TACTILE-VLA:激活VLA模型的物理知识以实现触觉泛化(清华大学最新)
具身智能之心· 2025-07-15 07:55
Core Insights - The article discusses the development of Tactile-VLA, a model that integrates visual, language, action, and tactile perception to enhance robotic capabilities in contact-intensive tasks [2][6]. Group 1: Background and Core Issues - Visual-language-action (VLA) models have strong semantic understanding and cross-modal generalization capabilities, but they struggle in contact-intensive scenarios due to a lack of tactile perception [2][6]. - Tactile perception provides critical feedback in physical interactions, such as friction and material properties, which are essential for tasks requiring fine motor control [2][6]. Group 2: Key Innovations and Research Goals - The core finding is that VLA models contain prior knowledge of physical interactions, which can be activated by connecting this knowledge with tactile sensors, enabling zero-shot generalization in contact-intensive tasks [6][7]. - Tactile-VLA framework introduces tactile perception as a primary modality, allowing for direct mapping from abstract semantics to physical force control [7]. - The mixed position-force controller innovatively converts force targets into position adjustment commands, addressing the challenge of coordinating position and force control [7]. Group 3: Architecture and Mechanisms - Tactile-VLA's architecture includes four key modules: instruction adherence to tactile cues, application of tactile-related common sense, adaptive reasoning through tactile feedback, and a multi-modal encoder for unified token representation [12][13]. - The mixed position-force control mechanism ensures precision in position while allowing for fine-tuned force adjustments during contact tasks [13]. - The Tactile-VLA-CoT variant incorporates a chain of thought (CoT) reasoning mechanism, enabling robots to analyze failure causes based on tactile feedback and autonomously adjust strategies [13][14]. Group 4: Experimental Validation and Results - Three experimental setups were designed to validate Tactile-VLA's capabilities in instruction adherence, common sense application, and adaptive reasoning [17]. - In the instruction adherence experiment, Tactile-VLA achieved a success rate of 35% in USB tasks and 90% in charger tasks, significantly outperforming baseline models [21][22]. - The common sense application experiment demonstrated Tactile-VLA's ability to adjust interaction forces based on object properties, achieving success rates of 90%-100% for known objects and 80%-100% for unknown objects [27]. - The adaptive reasoning experiment showed that Tactile-VLA-CoT could successfully complete a blackboard task with an 80% success rate, demonstrating its problem-solving capabilities through reasoning [33].