Workflow
视觉-语言-动作模型(VLA)
icon
Search documents
轻量级VLA模型Evo-1:仅凭0.77b参数取得SOTA,解决低成本训练与实时部署
具身智能之心· 2025-11-12 04:00
点击下方 卡片 ,关注" 具身智能 之心 "公众号 视觉-语言-动作(VLA)模型将感知、语言和控制能力统一起来,使机器人能够通过多模态理解执行多样化任务。然而,当前的VLA模型通常包含海 量参数,且高度依赖大规模机器人数据预训练,导致训练过程中的计算成本高昂,同时限制了其在实时推理中的部署能力。此外,多数训练范式常导 致视觉-语言backbone模型的感知表征退化,引发过拟合并削弱对下游任务的泛化能力。 论文名称: Evo-1: Lightweight Vision-Language-Action Model with Preserved Semantic Alignment 论文链接: https://arxiv.org/abs/2511.04555 来自上海交大、CMU、剑桥大学的团队提出轻量级VLA模型Evo-1,在无需机器人数据预训练的前提下,既降低计算成本又提升部署效率,同时保持 强劲性能。Evo-1基于原生多模态视觉语言模型(VLM),融合创新的交叉调制扩散变换器与优化集成模块,构建高效架构。这里还进一步引入两阶段 训练范式,通过逐步协调动作与感知,完整保留VLM的表征能力。 编辑丨具身智能之心 ...
Ask-to-Clarify:解决指令的模糊性,端到端为真实具身任务生成动作
具身智能之心· 2025-10-22 03:04
作者丨 Xingyao Lin等 编辑丨 具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 写在前面&出发点 具身智能体的最终目标是成为能够与人类主动交互的协作者,而不仅仅是被动遵循指令的执行者。这要求智能体能够根据人类反馈调整自身行为。 点击下方 卡片 ,关注" 具身智能 之心 "公众号 近年来,视觉-语言-动作模型(VLA)的发展为实现这一目标提供了一条有前景的路径。然而,目前大多数基于VLA的具身智能体以一种简单的单向模式运行:即接 收指令后便直接执行,没有任何与用户的交流。在指令通常具有模糊性的真实世界场景中,这种被动的方法往往会失效。针对这一问题,本文提出了Ask-to-Clarify 框架。该框架首先通过多轮对话提出问题以解决指令的模糊性,随后以端到端的方式为真实世界的具身任务生成动作。 本工作的贡献 任务与框架设计: 提出了一项新的具身智能体协作任务及相应的框架。该任务要求智能体在执行指令前,先通过提问的方式主动消除指令的模糊性,随后完成任 ...
MTRDrive:一种具备动态交互式推理的自动驾驶VLA框架(清华&小米)
自动驾驶之心· 2025-09-28 23:33
点击下方 卡片 ,关注" 自动驾驶之心 "公众号 戳我-> 领取 自动驾驶近30个 方向 学习 路线 视觉-语言-动作模型(VLA)被认为是提升自动驾驶在长尾场景中推理能力的关键路径,但现有方法在应对长时程与高层级行为决策时仍面临显著挑战。 尤其在极少样本甚至零样本的复杂场景下,模型的泛化能力有限,难以在动态、不确定的道路环境中保持持续稳健的表现。当前的主要痛点可归纳为: 稳健的驾驶决策高度依赖于 感知准确性与推理可靠性 两大核心因素的深度协同。人类驾驶员在长期与环境交互的过程中,不仅依靠实时感知,更善于借助 经验积累实现动态预判与自适应调整,这一过程深刻契合了《论语》 "工欲善其事,必先利其器" 的古老智慧。其中,"器"不仅指驾驶工具,更指向驾驶员 通过经验凝练形成的认知工具库——包括对复杂路况的识别模式、风险预估策略以及应急决策流程。 人类驾驶行为本质上是一个 "感知–判断–决策–行动" 的动态闭环系统。驾驶员通过持续融合实时环境信息与历史经验,不断优化自身的反应策略,从而在不确定的交通场景中实现安全导航。例如,熟练驾驶 员能够依据前方车辆动态、路面状态乃至环境气象特征,提前做出减速或变道等预判性操作,体 ...
从近1000篇工作中,看具身智能的技术发展路线!
自动驾驶之心· 2025-09-07 23:34
点击下方 卡片 ,关注" 具身智能 之心 "公众号 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 每次再聊具身智能时,总能看到很多paper一直说 "突破、创新"。但很少完整地把整个技术路线串起来,让大家清晰地 知道具身是怎么发展的?碰到了哪些问题?未来的走向是怎么样的? 机器人操作如何让机械臂精准 "模仿" 人类?多模态融合怎样让智能体 "身临其境"?强化学习如何驱动系统自主进化? 遥操作与数据采集又怎样打破空间限制?这些具身智能的关键内容需要我们认真梳理下。 今天我们将会为大家带来领域里比较丰富的几篇研究综述,带你拆解这些方向的发展逻辑。 机器人操作相关 参考论文: The Developments and Challenges towards Dexterous and Embodied Robotic Manipulation: A Survey 论文链接:https://arxiv.org/abs/2507.11840 作者单位: 浙 ...
从近1000篇工作中,看具身智能的技术发展路线!
具身智能之心· 2025-09-05 00:45
Core Insights - The article discusses the evolution and challenges of embodied intelligence, emphasizing the need for a comprehensive understanding of its development, issues faced, and future directions [3][4]. Group 1: Robotic Manipulation - The survey on robotic manipulation highlights the transition from mechanical programming to embodied intelligence, focusing on the evolution from simple grippers to dexterous multi-fingered hands [5][6]. - Key challenges in dexterous manipulation include data collection methods such as simulation, human demonstration, and teleoperation, as well as skill learning frameworks like imitation learning and reinforcement learning [5][6]. Group 2: Navigation and Manipulation - The discussion on robotic navigation emphasizes the importance of physics simulators in addressing high costs and data scarcity in real-world training, with a focus on the Sim-to-Real transfer challenges [9][15]. - The evolution of navigation techniques is outlined, transitioning from explicit memory to implicit memory, and the role of various simulators in narrowing the Sim-to-Real gap is analyzed [15][16]. Group 3: Multimodal Large Models - The exploration of embodied multimodal large models (EMLMs) reveals their potential to bridge perception, cognition, and action gaps, driven by advancements in large model technologies [17][19]. - Challenges identified include cross-modal alignment difficulties, high computational resource demands, and weak domain generalization [19]. Group 4: Teleoperation and Data Collection - The survey on teleoperation of humanoid robots discusses the integration of human cognition with robotic capabilities, particularly in hazardous environments, while addressing challenges such as high degrees of freedom and communication limitations [29][30]. - Key components of teleoperation systems include human state measurement, motion retargeting, and multimodal feedback mechanisms [30][33]. Group 5: Vision-Language-Action Models - The analysis of Vision-Language-Action (VLA) models covers their evolution from cross-modal learning architectures to the integration of visual language models and action planners [33][36]. - The article identifies core challenges in real-time control, multimodal action representation, and system scalability, while proposing future directions for adaptive AI and cross-entity generalization [36][41].
首个3D动作游戏专用VLA模型,打黑神话&只狼超越人类玩家 | ICCV 2025
量子位· 2025-08-19 05:25
Core Insights - CombatVLA, a 3B multimodal model, surpasses GPT-4o and human players in combat tasks within action role-playing games, demonstrating significant advancements in real-time decision-making and tactical reasoning [1][4][52]. Group 1: CombatVLA Overview - CombatVLA integrates visual, semantic, and action control to enhance embodied intelligence, addressing challenges in 3D combat scenarios such as visual perception, combat reasoning, and efficient inference [6][8]. - The model achieves a 50-fold acceleration in combat execution speed compared to existing models, with a higher success rate than human players [4][11][52]. Group 2: Action Tracking and Benchmarking - An action tracker was developed to collect human action sequences in games, providing extensive training data for the combat understanding model [15][17]. - The CUBench benchmark was established to evaluate the model's combat intelligence based on three core capabilities: information acquisition, understanding, and reasoning [20][21]. Group 3: CombatVLA Model and Training - The Action-of-Thought (AoT) dataset was created to facilitate the model's understanding of combat actions, structured in a way that enhances reasoning speed [24][25]. - CombatVLA employs a three-stage progressive training paradigm, gradually refining the model's combat strategies from video-level to frame-level optimization [27][33]. Group 4: Experimental Results - In combat understanding evaluations, CombatVLA achieved a top average score of 63.61 on CUBench, outperforming other models significantly [46]. - The model demonstrated robust generalization capabilities, performing comparably to baseline models in general benchmarks while excelling in task-level evaluations [47][48].
聊聊DreamVLA:让机器人先看后想再动
具身智能之心· 2025-08-11 00:14
Core Viewpoint - The article introduces DreamVLA, a new Vision-Language-Action model that enhances robotic decision-making by integrating comprehensive world knowledge, allowing robots to predict dynamic environments and make more accurate action decisions [1][27]. Group 1: Background and Need for Advanced VLA Models - Traditional VLA models directly map visual inputs and language commands to actions, which can lead to interference from irrelevant information in complex environments [3][5]. - DreamVLA addresses this by adding a layer of "thinking" that predicts world knowledge, including dynamic areas, depth information, and semantic features before planning actions [5][27]. Group 2: Model Architecture and Functionality - DreamVLA operates on a "perception-prediction-action" cycle, treating the task as an inverse dynamics problem to derive necessary actions from predicted future states [7][27]. - The model processes three types of inputs: visual images, language commands, and the robot's own state, using dedicated encoders for each [10][14]. Group 3: World Knowledge Prediction - DreamVLA predicts world knowledge, which includes dynamic areas, depth maps, and semantic features, rather than directly predicting actions [11][18]. - Dynamic area prediction utilizes CoTracker to identify moving objects and generate masks that highlight relevant areas while filtering out static backgrounds [12][15]. - Depth prediction estimates the spatial relationships of objects, generating depth maps to assist in obstacle avoidance [13][17]. - Semantic prediction employs DINOv2 and SAM models to extract high-level semantic information, which is then encoded into a unified "world embedding" for action generation [18][22]. Group 4: Action Generation - The action generation component uses a diffusion Transformer to produce future action sequences based on the latent action embedding derived from multi-modal inputs [23][27]. - A structured attention mechanism is implemented to ensure coherent multi-step action reasoning and prevent cross-modal knowledge leakage [19][31]. Group 5: Performance and Validation - DreamVLA achieved an average task completion length of 4.44 in the CALVIN ABC-D benchmark, outperforming previous methods by 3.5%, with a real-world task success rate of 76.7% [25][27]. - Ablation studies confirmed the contributions of various components, demonstrating the model's robustness and generalization capabilities [25][31].
TACTILE-VLA:激活VLA模型的物理知识以实现触觉泛化(清华大学最新)
自动驾驶之心· 2025-07-16 04:05
Core Insights - The article discusses the development of Tactile-VLA, a model that integrates visual, language, action, and tactile perception to enhance robotic capabilities in contact-intensive tasks [2][6][20]. Group 1: Background and Core Issues - Visual-language-action (VLA) models are crucial for general-purpose robotic agents, but they struggle in contact-intensive scenarios due to a lack of tactile perception [2]. - Tactile perception provides essential feedback for physical interactions, which is often missing in existing models [2]. Group 2: Key Innovations and Research Goals - The core finding is that VLA models contain prior knowledge of physical interactions, which can be activated through tactile sensors for zero-shot generalization in contact tasks [6]. - Tactile-VLA framework introduces tactile perception as a primary modality, allowing direct mapping from abstract semantics to physical force control [6]. - The mixed position-force controller innovatively converts force targets into position adjustment commands, addressing the challenge of coordinating position and force control [6][10]. - Tactile-VLA-CoT variant incorporates a chain of thought (CoT) reasoning mechanism, enabling robots to analyze failure causes and autonomously adjust strategies [6][14]. Group 3: Overall Architecture - Tactile-VLA's architecture features four key modules, emphasizing token-level fusion through a non-causal attention mechanism for true semantic representation rooted in physical reality [9]. Group 4: Mixed Position-Force Control Mechanism - The mixed control strategy prioritizes position control while introducing force feedback adjustments when necessary, ensuring precision in movement and force control [10][12]. - The design separates external net force from internal grasping force, allowing for refined force adjustments suitable for contact-intensive tasks [13]. Group 5: Chain of Thought Reasoning Mechanism - Tactile-VLA-CoT enhances adaptive capabilities by transforming the adjustment process into an interpretable reasoning process, improving robustness in complex tasks [14][15]. Group 6: Data Collection Methods - A specialized data collection system was developed to obtain high-quality tactile-language aligned data, addressing the issue of missing force feedback in traditional remote operations [16][19]. Group 7: Experimental Validation and Results Analysis - Three experimental groups were designed to validate Tactile-VLA's capabilities in instruction following, common sense application, and adaptive reasoning [20]. - In the instruction following experiment, Tactile-VLA demonstrated the ability to learn the semantic meaning of force-related language, achieving a success rate of 35% in USB tasks and 90% in charger tasks [23]. - The model effectively utilized common sense knowledge to adjust interaction forces based on object properties, achieving significant performance improvements over baseline models [24][30]. - In the adaptive reasoning experiment, Tactile-VLA-CoT achieved an 80% success rate in a blackboard task, showcasing its ability to diagnose and correct failures autonomously [28][32].
TACTILE-VLA:激活VLA模型的物理知识以实现触觉泛化(清华大学最新)
具身智能之心· 2025-07-15 07:55
Core Insights - The article discusses the development of Tactile-VLA, a model that integrates visual, language, action, and tactile perception to enhance robotic capabilities in contact-intensive tasks [2][6]. Group 1: Background and Core Issues - Visual-language-action (VLA) models have strong semantic understanding and cross-modal generalization capabilities, but they struggle in contact-intensive scenarios due to a lack of tactile perception [2][6]. - Tactile perception provides critical feedback in physical interactions, such as friction and material properties, which are essential for tasks requiring fine motor control [2][6]. Group 2: Key Innovations and Research Goals - The core finding is that VLA models contain prior knowledge of physical interactions, which can be activated by connecting this knowledge with tactile sensors, enabling zero-shot generalization in contact-intensive tasks [6][7]. - Tactile-VLA framework introduces tactile perception as a primary modality, allowing for direct mapping from abstract semantics to physical force control [7]. - The mixed position-force controller innovatively converts force targets into position adjustment commands, addressing the challenge of coordinating position and force control [7]. Group 3: Architecture and Mechanisms - Tactile-VLA's architecture includes four key modules: instruction adherence to tactile cues, application of tactile-related common sense, adaptive reasoning through tactile feedback, and a multi-modal encoder for unified token representation [12][13]. - The mixed position-force control mechanism ensures precision in position while allowing for fine-tuned force adjustments during contact tasks [13]. - The Tactile-VLA-CoT variant incorporates a chain of thought (CoT) reasoning mechanism, enabling robots to analyze failure causes based on tactile feedback and autonomously adjust strategies [13][14]. Group 4: Experimental Validation and Results - Three experimental setups were designed to validate Tactile-VLA's capabilities in instruction adherence, common sense application, and adaptive reasoning [17]. - In the instruction adherence experiment, Tactile-VLA achieved a success rate of 35% in USB tasks and 90% in charger tasks, significantly outperforming baseline models [21][22]. - The common sense application experiment demonstrated Tactile-VLA's ability to adjust interaction forces based on object properties, achieving success rates of 90%-100% for known objects and 80%-100% for unknown objects [27]. - The adaptive reasoning experiment showed that Tactile-VLA-CoT could successfully complete a blackboard task with an 80% success rate, demonstrating its problem-solving capabilities through reasoning [33].
CEED-VLA:实现VLA模型4倍推理加速,革命性一致性蒸馏与早退解码技术!
具身智能之心· 2025-07-10 13:16
Core Viewpoint - The article discusses the development of a new model called CEED-VLA, which significantly enhances the inference speed of visual-language-action models while maintaining operational performance, making it suitable for high-frequency dexterous tasks [2][30]. Group 1: Model Development - The CEED-VLA model is designed to accelerate inference through a general method that improves performance across multiple tasks [2]. - The model incorporates a consistency distillation mechanism and mixed-label supervision to enable accurate predictions of high-quality actions from various intermediate states [2][6]. - The Early-exit Decoding strategy is introduced to address inefficiencies in the Jacobi decoding process, achieving up to 4.1× inference speedup and over 4.3× execution frequency [2][15]. Group 2: Experimental Results - Simulations and real-world experiments demonstrate that CEED-VLA significantly improves inference efficiency while maintaining similar task success rates [6][30]. - The model shows a speedup of 2.00× compared to the teacher model and achieves a higher number of fixed tokens, indicating improved performance [19][20]. - In real-world evaluations, CEED-VLA successfully completes dexterous tasks, achieving a success rate exceeding 70% due to enhanced inference speed and control frequency [30][31].