Workflow
视觉-语言-动作模型(VLA)
icon
Search documents
从近1000篇工作中,看具身智能的技术发展路线!
自动驾驶之心· 2025-09-07 23:34
点击下方 卡片 ,关注" 具身智能 之心 "公众号 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 每次再聊具身智能时,总能看到很多paper一直说 "突破、创新"。但很少完整地把整个技术路线串起来,让大家清晰地 知道具身是怎么发展的?碰到了哪些问题?未来的走向是怎么样的? 机器人操作如何让机械臂精准 "模仿" 人类?多模态融合怎样让智能体 "身临其境"?强化学习如何驱动系统自主进化? 遥操作与数据采集又怎样打破空间限制?这些具身智能的关键内容需要我们认真梳理下。 今天我们将会为大家带来领域里比较丰富的几篇研究综述,带你拆解这些方向的发展逻辑。 机器人操作相关 参考论文: The Developments and Challenges towards Dexterous and Embodied Robotic Manipulation: A Survey 论文链接:https://arxiv.org/abs/2507.11840 作者单位: 浙 ...
从近1000篇工作中,看具身智能的技术发展路线!
具身智能之心· 2025-09-05 00:45
点击下方 卡片 ,关注" 具身智能 之心 "公众号 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 每次再聊具身智能时,总能看到很多paper一直说 "突破、创新"。但很少完整地把整个技术路线串起来,让大家清晰地 知道具身是怎么发展的?碰到了哪些问题?未来的走向是怎么样的? 机器人操作如何让机械臂精准 "模仿" 人类?多模态融合怎样让智能体 "身临其境"?强化学习如何驱动系统自主进化? 遥操作与数据采集又怎样打破空间限制?这些具身智能的关键内容需要我们认真梳理下。 今天我们将会为大家带来领域里比较丰富的几篇研究综述,带你拆解这些方向的发展逻辑。 机器人操作相关 参考论文: The Developments and Challenges towards Dexterous and Embodied Robotic Manipulation: A Survey 论文链接:https://arxiv.org/abs/2507.11840 作者单位: 浙 ...
首个3D动作游戏专用VLA模型,打黑神话&只狼超越人类玩家 | ICCV 2025
量子位· 2025-08-19 05:25
Core Insights - CombatVLA, a 3B multimodal model, surpasses GPT-4o and human players in combat tasks within action role-playing games, demonstrating significant advancements in real-time decision-making and tactical reasoning [1][4][52]. Group 1: CombatVLA Overview - CombatVLA integrates visual, semantic, and action control to enhance embodied intelligence, addressing challenges in 3D combat scenarios such as visual perception, combat reasoning, and efficient inference [6][8]. - The model achieves a 50-fold acceleration in combat execution speed compared to existing models, with a higher success rate than human players [4][11][52]. Group 2: Action Tracking and Benchmarking - An action tracker was developed to collect human action sequences in games, providing extensive training data for the combat understanding model [15][17]. - The CUBench benchmark was established to evaluate the model's combat intelligence based on three core capabilities: information acquisition, understanding, and reasoning [20][21]. Group 3: CombatVLA Model and Training - The Action-of-Thought (AoT) dataset was created to facilitate the model's understanding of combat actions, structured in a way that enhances reasoning speed [24][25]. - CombatVLA employs a three-stage progressive training paradigm, gradually refining the model's combat strategies from video-level to frame-level optimization [27][33]. Group 4: Experimental Results - In combat understanding evaluations, CombatVLA achieved a top average score of 63.61 on CUBench, outperforming other models significantly [46]. - The model demonstrated robust generalization capabilities, performing comparably to baseline models in general benchmarks while excelling in task-level evaluations [47][48].
聊聊DreamVLA:让机器人先看后想再动
具身智能之心· 2025-08-11 00:14
Core Viewpoint - The article introduces DreamVLA, a new Vision-Language-Action model that enhances robotic decision-making by integrating comprehensive world knowledge, allowing robots to predict dynamic environments and make more accurate action decisions [1][27]. Group 1: Background and Need for Advanced VLA Models - Traditional VLA models directly map visual inputs and language commands to actions, which can lead to interference from irrelevant information in complex environments [3][5]. - DreamVLA addresses this by adding a layer of "thinking" that predicts world knowledge, including dynamic areas, depth information, and semantic features before planning actions [5][27]. Group 2: Model Architecture and Functionality - DreamVLA operates on a "perception-prediction-action" cycle, treating the task as an inverse dynamics problem to derive necessary actions from predicted future states [7][27]. - The model processes three types of inputs: visual images, language commands, and the robot's own state, using dedicated encoders for each [10][14]. Group 3: World Knowledge Prediction - DreamVLA predicts world knowledge, which includes dynamic areas, depth maps, and semantic features, rather than directly predicting actions [11][18]. - Dynamic area prediction utilizes CoTracker to identify moving objects and generate masks that highlight relevant areas while filtering out static backgrounds [12][15]. - Depth prediction estimates the spatial relationships of objects, generating depth maps to assist in obstacle avoidance [13][17]. - Semantic prediction employs DINOv2 and SAM models to extract high-level semantic information, which is then encoded into a unified "world embedding" for action generation [18][22]. Group 4: Action Generation - The action generation component uses a diffusion Transformer to produce future action sequences based on the latent action embedding derived from multi-modal inputs [23][27]. - A structured attention mechanism is implemented to ensure coherent multi-step action reasoning and prevent cross-modal knowledge leakage [19][31]. Group 5: Performance and Validation - DreamVLA achieved an average task completion length of 4.44 in the CALVIN ABC-D benchmark, outperforming previous methods by 3.5%, with a real-world task success rate of 76.7% [25][27]. - Ablation studies confirmed the contributions of various components, demonstrating the model's robustness and generalization capabilities [25][31].
TACTILE-VLA:激活VLA模型的物理知识以实现触觉泛化(清华大学最新)
自动驾驶之心· 2025-07-16 04:05
Core Insights - The article discusses the development of Tactile-VLA, a model that integrates visual, language, action, and tactile perception to enhance robotic capabilities in contact-intensive tasks [2][6][20]. Group 1: Background and Core Issues - Visual-language-action (VLA) models are crucial for general-purpose robotic agents, but they struggle in contact-intensive scenarios due to a lack of tactile perception [2]. - Tactile perception provides essential feedback for physical interactions, which is often missing in existing models [2]. Group 2: Key Innovations and Research Goals - The core finding is that VLA models contain prior knowledge of physical interactions, which can be activated through tactile sensors for zero-shot generalization in contact tasks [6]. - Tactile-VLA framework introduces tactile perception as a primary modality, allowing direct mapping from abstract semantics to physical force control [6]. - The mixed position-force controller innovatively converts force targets into position adjustment commands, addressing the challenge of coordinating position and force control [6][10]. - Tactile-VLA-CoT variant incorporates a chain of thought (CoT) reasoning mechanism, enabling robots to analyze failure causes and autonomously adjust strategies [6][14]. Group 3: Overall Architecture - Tactile-VLA's architecture features four key modules, emphasizing token-level fusion through a non-causal attention mechanism for true semantic representation rooted in physical reality [9]. Group 4: Mixed Position-Force Control Mechanism - The mixed control strategy prioritizes position control while introducing force feedback adjustments when necessary, ensuring precision in movement and force control [10][12]. - The design separates external net force from internal grasping force, allowing for refined force adjustments suitable for contact-intensive tasks [13]. Group 5: Chain of Thought Reasoning Mechanism - Tactile-VLA-CoT enhances adaptive capabilities by transforming the adjustment process into an interpretable reasoning process, improving robustness in complex tasks [14][15]. Group 6: Data Collection Methods - A specialized data collection system was developed to obtain high-quality tactile-language aligned data, addressing the issue of missing force feedback in traditional remote operations [16][19]. Group 7: Experimental Validation and Results Analysis - Three experimental groups were designed to validate Tactile-VLA's capabilities in instruction following, common sense application, and adaptive reasoning [20]. - In the instruction following experiment, Tactile-VLA demonstrated the ability to learn the semantic meaning of force-related language, achieving a success rate of 35% in USB tasks and 90% in charger tasks [23]. - The model effectively utilized common sense knowledge to adjust interaction forces based on object properties, achieving significant performance improvements over baseline models [24][30]. - In the adaptive reasoning experiment, Tactile-VLA-CoT achieved an 80% success rate in a blackboard task, showcasing its ability to diagnose and correct failures autonomously [28][32].
TACTILE-VLA:激活VLA模型的物理知识以实现触觉泛化(清华大学最新)
具身智能之心· 2025-07-15 07:55
Core Insights - The article discusses the development of Tactile-VLA, a model that integrates visual, language, action, and tactile perception to enhance robotic capabilities in contact-intensive tasks [2][6]. Group 1: Background and Core Issues - Visual-language-action (VLA) models have strong semantic understanding and cross-modal generalization capabilities, but they struggle in contact-intensive scenarios due to a lack of tactile perception [2][6]. - Tactile perception provides critical feedback in physical interactions, such as friction and material properties, which are essential for tasks requiring fine motor control [2][6]. Group 2: Key Innovations and Research Goals - The core finding is that VLA models contain prior knowledge of physical interactions, which can be activated by connecting this knowledge with tactile sensors, enabling zero-shot generalization in contact-intensive tasks [6][7]. - Tactile-VLA framework introduces tactile perception as a primary modality, allowing for direct mapping from abstract semantics to physical force control [7]. - The mixed position-force controller innovatively converts force targets into position adjustment commands, addressing the challenge of coordinating position and force control [7]. Group 3: Architecture and Mechanisms - Tactile-VLA's architecture includes four key modules: instruction adherence to tactile cues, application of tactile-related common sense, adaptive reasoning through tactile feedback, and a multi-modal encoder for unified token representation [12][13]. - The mixed position-force control mechanism ensures precision in position while allowing for fine-tuned force adjustments during contact tasks [13]. - The Tactile-VLA-CoT variant incorporates a chain of thought (CoT) reasoning mechanism, enabling robots to analyze failure causes based on tactile feedback and autonomously adjust strategies [13][14]. Group 4: Experimental Validation and Results - Three experimental setups were designed to validate Tactile-VLA's capabilities in instruction adherence, common sense application, and adaptive reasoning [17]. - In the instruction adherence experiment, Tactile-VLA achieved a success rate of 35% in USB tasks and 90% in charger tasks, significantly outperforming baseline models [21][22]. - The common sense application experiment demonstrated Tactile-VLA's ability to adjust interaction forces based on object properties, achieving success rates of 90%-100% for known objects and 80%-100% for unknown objects [27]. - The adaptive reasoning experiment showed that Tactile-VLA-CoT could successfully complete a blackboard task with an 80% success rate, demonstrating its problem-solving capabilities through reasoning [33].
CEED-VLA:实现VLA模型4倍推理加速,革命性一致性蒸馏与早退解码技术!
具身智能之心· 2025-07-10 13:16
Core Viewpoint - The article discusses the development of a new model called CEED-VLA, which significantly enhances the inference speed of visual-language-action models while maintaining operational performance, making it suitable for high-frequency dexterous tasks [2][30]. Group 1: Model Development - The CEED-VLA model is designed to accelerate inference through a general method that improves performance across multiple tasks [2]. - The model incorporates a consistency distillation mechanism and mixed-label supervision to enable accurate predictions of high-quality actions from various intermediate states [2][6]. - The Early-exit Decoding strategy is introduced to address inefficiencies in the Jacobi decoding process, achieving up to 4.1× inference speedup and over 4.3× execution frequency [2][15]. Group 2: Experimental Results - Simulations and real-world experiments demonstrate that CEED-VLA significantly improves inference efficiency while maintaining similar task success rates [6][30]. - The model shows a speedup of 2.00× compared to the teacher model and achieves a higher number of fixed tokens, indicating improved performance [19][20]. - In real-world evaluations, CEED-VLA successfully completes dexterous tasks, achieving a success rate exceeding 70% due to enhanced inference speed and control frequency [30][31].
VQ-VLA:大规模合成数据驱动动作tokenizer,推理速度提升近三倍
具身智能之心· 2025-07-02 10:18
Core Insights - The article discusses the challenges faced by Visual-Language-Action (VLA) models in multimodal robotic control, specifically focusing on action representation efficiency and data dependency bottlenecks [3][4]. Group 1: Challenges in VLA Models - Action representation efficiency is low due to traditional continuous action discretization methods, which struggle to capture complex spatiotemporal dynamics, leading to increased cumulative errors in long-duration tasks [4]. - The high cost of real robot data collection limits the generalization ability of models, creating a data dependency bottleneck [4]. Group 2: Proposed Solutions - A universal action tokenizer framework based on Convolutional Residual VQ-VAE is proposed to replace traditional discretization methods [4]. - The article demonstrates that the difference between synthetic and real domain action trajectories is minimal, allowing for the use of a significantly larger scale of synthetic data (100 times previous work) to train the tokenizer [4]. - The VLA model's performance is optimized across three core metrics, with the success rate for long-duration tasks increasing by up to 30% in real robot experiments [4]. Group 3: Key Technical Solutions - The Convolutional Residual VQ-VAE architecture employs 2D temporal convolution layers instead of traditional MLPs, resulting in a 6.6% improvement in success rates for the LIBERO-10 task [7]. - The action execution frequency improved from 4.16 Hz to 11.84 Hz, enhancing inference speed [9][18]. - A multi-step action prediction approach reduces cumulative errors, contributing to long-duration robustness [9]. Group 4: Experimental Findings - In simulated environments, the VQ model achieved a success rate of 80.98% in LIBERO-90, surpassing the baseline by 7.45% [17]. - For short-duration tasks, the VQ model's success rate was 60.0% in the "flip the pot" task compared to a baseline of 30.0% [17]. - In long-duration tasks, the VQ model achieved a success rate of 30.0% for "putting toys in a drawer" versus 5.0% for the baseline, and 50.0% for "putting all cups in a basket" compared to 15.0% for the baseline [17]. Group 5: Future Directions - The article suggests expanding the dataset by integrating larger-scale synthetic datasets, such as RLBench [19]. - There is a focus on model lightweighting through distillation and quantization techniques to further accelerate inference [19]. - Exploration of enhanced designs, such as action frequency conditional encoding, is recommended for architectural improvements [19].