Workflow
视觉-语言-动作(VLA)模型
icon
Search documents
别让vision拖累VLA中的action!
具身智能之心· 2025-12-20 01:02
Core Insights - The article discusses the challenges and advancements in Visual-Language-Action (VLA) models used in robotics, particularly focusing on the limitations of existing models that rely on low-dimensional sparse action signals to supervise high-dimensional dense visual inputs, which restricts overall performance [6][9]. Research Background - VLA models have shown significant progress but still face issues due to the mismatch between action supervision signals and visual inputs, leading to underutilization of the model's representation capabilities [6]. - The introduction of a visual prediction mechanism is proposed to enhance action generation by predicting future visual states, although high-dimensional visual states often contain redundant information that complicates the training process [8]. Proposed Solutions - Decoupled Visual Forecasting (DVF) is introduced to alleviate the burden on the backbone network by automatically capturing implicit actions and enhancing explicit action generation [7]. - A progressive pre-training approach is suggested to gradually integrate different modalities, introducing language supervision to retain the understanding and reasoning capabilities of the VLA backbone [7]. - Adaptive Temporal Ensemble (ATE) is proposed to dynamically adjust the integration strength during inference, reducing computational costs while maintaining action stability [14]. Architecture Design - The DVF method incorporates implicit action queries and a separate diffusion DVF head, allowing the model to focus on frame-to-frame differences rather than predicting complete future frames [10]. - A progressive training scheme is designed to introduce visual, language, and action information in phases to avoid competition between modalities and achieve stable optimization [10]. Experimental Analysis - Mantis, the proposed model, outperforms existing baseline methods in three out of four tasks on the LIBERO benchmark, achieving the highest average success rate of 96.7% [16][18]. - The convergence speed of Mantis is significantly faster compared to traditional visual prediction methods like UnifiedVLA [20]. - Experiments demonstrate the effectiveness of language supervision in retaining the backbone's capabilities, with Mantis outperforming in both in-domain and out-of-domain instruction tasks [20]. Team Introduction - The research team, SJTU Deng Lab, focuses on generative models and large language models, collaborating with renowned institutions and maintaining a strong research output in top-tier journals and conferences [23].
EVOLVE-VLA:VLA模型测试时训练,突破模仿学习瓶颈
具身智能之心· 2025-12-18 00:07
点击下方 卡片 ,关注" 具身智能 之心 "公众号 作者丨 Zechen Bai等 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 一、研究背景与动机 现有VLA模型的核心困境 视觉-语言-动作(VLA)模型借助大型语言模型(LLM)的语义先验,在机器人操作任务中取得了显著进展,但当前主流的监督微调(SFT)训练范式存在两大根 本性局限: 人类学习范式的启发 人类掌握操作技能的核心是"通过实践学习"——反复尝试、从环境中获取反馈、逐步修正动作。这与SFT的"静态模仿学习"形成鲜明对比,因此,让VLA模型在部 署阶段通过环境交互实现持续学习,成为突破现有局限的关键方向。 核心挑战 测试时训练(TTT)的核心障碍是 缺乏Oracle奖励信号 (训练时的模拟器真值成功信号在部署时不可用)。直接使用朴素的进度估计器会产生噪声信号,可能误导 政策优化,尤其在长视野任务中,噪声累积会严重影响学习效果。 二、核心创新点 1. 测试时自主反馈机制 :用预训练的进 ...
GLaD:知识蒸馏将3D几何先验注入VLA模型,任务成功率突破94%
具身智能之心· 2025-12-12 01:22
点击下方 卡片 ,关注" 具身智能 之心 "公众号 作者丨 Minghao Guo等 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 一、研究背景与核心动机 视觉-语言-动作(VLA)模型是具身智能领域的关键技术,能够让机器人直接从视觉观测和自然语言指令中生成控制动作。现有VLA模型大多依赖CLIP、SigLIP等 2D视觉编码器,这类编码器擅长捕捉图像与文本的语义对应关系,却无法编码3D空间信息(如深度、物体位姿、空间关系)。 这种缺陷会导致模型在操作任务中出现错误的注意力分配,如figure1所示:在"将桌布从桌角移到桌边"和"拾取盘子与ramekin之间的黑碗并放到盘子上"任务中,传 统VLA模型会错误关注无关区域,无法精准定位任务相关物体,进而影响操作任务的完成精度。 为解决这一问题,研究团队提出GLaD框架,核心思路是通过知识蒸馏将3D几何先验注入VLA模型,使其同时具备语义理解和空间推理能力,且无需依赖额外的深 度传感器或3D标注。 ...
LatBot:中科院团队提出潜在动作蒸馏,提升机器人VLA小样本迁移效率
具身智能之心· 2025-12-04 00:04
本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 点击下方 卡片 ,关注" 具身智能 之心 "公众号 作者丨 Zuolei Li等 编辑丨具身智能之心 一、 研究背景与挑战 潜动作学习是视觉-语言-动作(VLA)模型的重要研究方向,核心是从连续帧中提取压缩的运动语义,形成与机器人实体无关的通用表示,从而利用大规模人类 视频扩展训练数据,突破传统机器人数据集的多样性和泛化性限制。 现有潜动作模型(LAM)存在三大关键问题:一是缺乏任务指令引导,无法捕捉与任务相关的变化;二是对多帧信息利用不足,导致潜动作表示不够精确,难 以捕捉运动动态;三是过度关注视觉外观变化,缺乏物理感知,使得潜动作表示与实际可执行动作之间存在语义鸿沟,严重影响下游任务的迁移效果。 二、 核心方法设计 2.1 解耦的潜动作表示 将潜动作分解为两个互补的可学习token,明确区分机器人主动运动与环境被动变化: 通过引入预训练视觉-语言模型(VLM),结合任务指令和多帧输入,将两个可学习token([CP ...
E0:离散扩散新框架,大幅提升 VLA 模型泛化与操控精度
具身智能之心· 2025-11-29 02:07
Group 1 - The article discusses the need for robots to possess three core capabilities for operation in open environments: complex visual scene perception, natural language instruction understanding, and precise action generation [1][3] - Existing methods face significant bottlenecks, including insufficient generalization ability, coarse action control, and modeling paradigm contradictions [3][4] - The proposed framework introduces a continuous action discretization strategy, enhancing the stability of robot inference and allowing for fine-grained control [6][8] Group 2 - The architecture utilizes the PaliGemma open-source VLM as a backbone, adding a 300 million parameter action expert network to optimize action generation through a diffusion model [6][10] - The training process involves multi-modal observation encoding, action discretization, and Gaussian noise addition to ensure temporal consistency [8][9] - The inference process includes initializing a noise action sequence, multi-step denoising, and deterministic de-discretization to produce executable action blocks [10][11] Group 3 - The model achieves state-of-the-art (SOTA) performance across three benchmarks (LIBERO, VLABench, ManiSkill), with an average success rate exceeding baseline by 10.7% [21] - In the LIBERO benchmark, the model achieved an average success rate of 96%, demonstrating superior grasping and instruction execution capabilities [21] - The model also excels in high-precision tasks, achieving an average success rate of 55.2% in the ManiSkill benchmark, significantly outperforming baseline models [24][28] Group 4 - The article identifies limitations such as insufficient semantic alignment for specific tasks, challenges in complex coordination tasks, and inadequate modeling of mechanical interactions [32][35] - Future directions include enhancing cross-modal alignment for semantic-rich tasks, designing adaptive task sampling strategies, and integrating physical model priors to improve control precision [35]
新国立提出VLA-4D:4D感知VLA模型,实现时空连贯的机器人操作
具身智能之心· 2025-11-25 00:03
Core Concept - The article introduces the 4D perception VLA model, which aims to enhance the spatial and temporal coherence of robotic operations by integrating spatial and temporal information, thereby improving visual reasoning and action planning [2][4]. Group 1: Model Design and Technical Details - The VLA-4D model innovates through dual spatial-temporal fusion, embedding 4D (3D space + 1D time) information into visual representations for reasoning and incorporating time variables into action representations for planning [5]. - The 2D VLA model relies on single-frame image input, leading to rough visual reasoning and spatial inaccuracies, while the 3D VLA model lacks explicit temporal modeling, resulting in motion stuttering [6]. - A "4D embedding + cross-attention fusion" representation method is designed to address the lack of spatial-temporal precision in visual reasoning [7][10]. Group 2: Dataset and Training Process - The existing VLA dataset lacks temporal action annotations, prompting an expansion based on the LIBERO dataset, which includes 40 sub-tasks and 150,000 visual-language-action samples [15][16]. - A two-stage training process significantly improves task success rates and reduces execution times compared to single fine-tuning [17][18]. Group 3: Experimental Validation and Key Findings - In the LIBERO benchmark, the VLA-4D model outperforms state-of-the-art models with a success rate of 97.4% and an average completion time of 5.8 seconds across various tasks [19][21]. - The model demonstrates superior generalization capabilities in zero-shot tasks, maintaining higher success rates and shorter execution times [20]. - Ablation studies confirm the necessity of visual representation modules, showing that the combination of spatial and temporal embeddings enhances success rates and reduces completion times [24][27].
南洋理工大学提出NORA-1.5:一种基于世界模型与动作奖励的VLA模型
具身智能之心· 2025-11-21 00:04
点击下方 卡片 ,关注" 具身智能 之心 "公众号 作者丨 Chia-YuHung等 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 南洋理工大学等研究单位提出NORA-1.5 通过集成流匹配动作专家与奖励驱动的直接偏好优化(DPO)后训练,解决了现有视觉-语言-动作(VLA)模型泛化性和 可靠性不足的问题,在仿真与真实机器人场景中均实现了当前最优性能。 核心定位与解决的关键问题 架构设计:流匹配与 VLA backbone的协同优化 VLA backbone基础 论文标题 :NORA-1.5:AVision-Language-ActionModelTrainedusingWorldModel andAction-basedPreferenceRewards 论文链接 :https://arxiv.org/pdf/2511.14659 ProjectPage :https://declare-lab.github.io/nora-1.5 Code ...
VLA集体翻车?复旦&创智邱锡鹏教授团队提出LIBERO-Plus,揭示VLA脆弱性真相
具身智能之心· 2025-10-29 00:03
Core Insights - The article discusses the robustness analysis of Vision-Language-Action (VLA) models, revealing significant generalization deficiencies despite high performance scores in ideal conditions [2][4][6] - The LIBERO-Plus framework is introduced to systematically evaluate VLA models across various perturbation dimensions, highlighting the gap between surface performance and actual generalization capabilities [4][6][33] Group 1: Motivation and Contributions - VLA models have achieved impressive success rates in benchmarks like LIBERO, but existing evaluation methods fail to assess stability and reliability under real-world variations [4][6] - LIBERO-Plus evaluates models based on seven dimensions of perturbation: object placement, camera angle, robot initial pose, language instructions, lighting conditions, background textures, and sensor noise [4][6] - The framework provides a detailed analysis of VLA models' generalization performance through systematic perturbation [4][6] Group 2: Performance Analysis - The analysis reveals that VLA models exhibit significant overall vulnerability to perturbations, with performance declining across all dimensions [13][32] - Models are most sensitive to changes in camera perspective and robot initial state, indicating a need for high-level spatial and proprioceptive understanding [13][32] - Language perturbations lead to the smallest average performance drop (-25.3%), suggesting a surprising level of robustness that warrants further investigation [15][17] Group 3: Findings on Model Behavior - Some models maintain performance even with empty language inputs, indicating a tendency to ignore language modalities and behave more like visual-action (VA) models [16][19] - VLA models struggle with cross-object instruction following, relying more on fixed visual-action mappings rather than fully leveraging language signals [19][20] - The models demonstrate remarkable adaptability to background changes while showing limited sensitivity to lighting variations, raising questions about the representations they learn [20][27] Group 4: Combination Generalization - The concept of "combination generalization gap" is introduced, highlighting the negative interactions between different perturbations that exceed the independent effects of single perturbations [29][32] - The analysis indicates that current VLA models lack the ability to effectively handle complex multi-dimensional perturbations due to entangled representations [32] Group 5: LIBERO-Plus Benchmark - The LIBERO-Plus benchmark consists of 10,030 tasks designed to evaluate model performance under various perturbations, constructed using perturbation augmentation strategies [33][36] - The benchmark features include comprehensive coverage of seven perturbation dimensions and fine-grained difficulty levels [36] - Models trained with enhanced data achieved an average success rate of 79.6% on LIBERO-Plus, significantly outperforming baseline models [38]
SFT 还是RL,VLA到底应该如何训练?
具身智能之心· 2025-10-28 00:02
Core Insights - The articles focus on advancements in Reinforcement Learning (RL) and its application to Visual-Language-Action (VLA) models, highlighting significant improvements in generalization capabilities and training efficiency. Group 1: Research Findings - The first study investigates how RL enhances the generalization ability of VLA models, addressing issues related to supervised fine-tuning (SFT) that lead to error accumulation and distribution shift. A new benchmark covering visual, semantic, and execution dimensions was established, showing that using Proximal Policy Optimization (PPO) for RL fine-tuning significantly improves semantic understanding and execution robustness while maintaining comparable visual generalization performance to SFT [2]. - The second study introduces RLinf-VLA, a framework designed for large-scale RL training of VLA models. It proposes a novel solution to the challenges of integrating RL and VLA training, achieving up to 2.27 times acceleration compared to baseline methods. The framework supports various VLA architectures and RL algorithms, achieving a 98.11% success rate across 130 LIBERO tasks [3]. Group 2: Practical Applications - RLinf-VLA summarizes best practices for applying RL in VLA training, providing a unified interface that facilitates the use of multiple VLA architectures and simulators, thus lowering the barrier for implementing RL in large-scale VLA applications [3]. - The research emphasizes the importance of RL in enhancing the performance of VLA models, suggesting a shift towards more efficient training methodologies that leverage RL's strengths [15].
你的VLA太慢了!?算力不够也能提速:这篇综述教你打造高效VLA新范式
具身智能之心· 2025-10-24 16:03
Core Insights - The article emphasizes the importance of efficiency in Vision-Language-Action (VLA) models, which are crucial for enabling robots to understand their environment and execute tasks effectively. It identifies efficiency as a key bottleneck that hinders the transition of VLA models from research to practical applications [3][4][7]. Background and Value - The rapid development of embodied intelligence has led to the emergence of VLA models as a core framework for robotic task execution. However, current VLA systems face significant challenges related to computational and storage demands, as well as high inference latency, which are critical for real-time applications [3][4][7]. Efficiency Bottlenecks - The review systematically analyzes the efficiency issues in VLA models across four dimensions: model architecture, perception features, action generation, and training/inference processes. It highlights that efficiency challenges are systemic and not limited to single-point optimizations [3][4][7]. Classification Framework - The article categorizes existing efficient VLA strategies into four complementary dimensions: efficient architecture design, perception feature compression, action generation acceleration, and training/inference optimization. This classification provides a comprehensive understanding of the design logic and trade-offs of current methods [4][6][7]. Future Trends and Directions - The review outlines future directions for VLA models, emphasizing the need for a balance between capability enhancement and computational cost. Key areas for efficiency optimization include data utilization, perception features, action generation, and learning strategies [4][25][26]. Efficient Perception Features - Optimizing visual input, which constitutes the largest computational overhead in VLA models, can be approached through selective processing of features and temporal feature reuse. These strategies aim to reduce redundant calculations while maintaining performance [13][15][16]. Efficient Action Generation - Action generation strategies focus on minimizing latency while ensuring task accuracy. Techniques include outputting low-dimensional continuous action vectors and introducing explicit reasoning to enhance interpretability and generalization across tasks [18][21]. Efficient Training and Inference - Training strategies aim to reduce adaptation costs for new tasks and environments through methods like parameter-efficient fine-tuning and knowledge distillation. Inference strategies focus on breaking the autoregressive bottleneck to enable parallelization and mixed decoding [22][24]. Future Outlook - The article suggests that future VLA models should prioritize collaborative design between models and data, efficient spatiotemporal perception, and robust action encoding. It also calls for a standardized evaluation framework to measure efficiency improvements [25][26][27].