视觉-语言-动作模型 - filings, earnings calls, financial reports, news - Reportify

视觉-语言-动作模型

Search documents

华科&小米联合提出MindDrive：首个证实在线强化学习有效性的VLA框架......

自动驾驶之心· 2025-12-17 00:03

点击下方卡片，关注" 自动驾驶之心 "公众号戳我-> 领取自动驾驶近30个方向学习路线 >>自动驾驶前沿信息获取 → 自动驾驶之心知识星球论文作者 | Haoyu Fu等编辑 | 自动驾驶之心华科&小米的一篇新工作MindDrive，提出了一种基于在线强化学习的VLA框架。相比RecogDrive、ORION提升了不少，在Qwen2-0.5B的基座上效果挺不错的。当前自动驾驶领域VLA的相关工作主要依赖模仿学习，这会带来分布偏移和因果混淆等固有挑战。在线强化学习通过试错学习为解决这些问题提供了一条极具潜力的途径。然而，将在线强化学习应用于自动驾驶视觉-语言-动作模型时，面临着连续动作空间中探索效率低下的难题。为克服这一限制，华科和小米的团队提出了 MindDrive——一种包含大语言模型（LLM）的视觉-语言-动作框架，该模型配备两组不同的LoRA参数。其中一组大语言模型充当决策专家，负责场景推理和驾驶决策；另一组则作为动作专家，将语言决策动态映射为可行驶轨迹。通过将轨迹级奖励反馈至推理空间，MindDrive能够在有限的离散语言驾驶决策集合上进行试错学习，而非直接在连续动作 ...

在线强化学习

视觉-语言-动作模型

在线强化学习

视觉-语言-动作模型

突破视觉-语言-动作模型的瓶颈：QDepth-VLA让机器人拥有更精准的3D空间感知

机器之心· 2025-11-26 07:07

Core Insights - The article discusses the significant potential of Vision-Language-Action (VLA) models in robotic manipulation, highlighting the introduction of QDepth-VLA, which enhances 3D spatial perception and reasoning capabilities through Quantized Depth Prediction [2][4][34]. Group 1: Model Limitations and Challenges - Despite advancements in semantic understanding and instruction following, VLA models struggle with spatial perception, particularly in fine-grained or long-duration multi-step tasks, leading to positioning errors and operational failures [5][6]. - The gap between 2D visual semantic understanding and 3D spatial perception has prompted researchers to explore various methods to integrate 3D information into VLA models, categorized into three main approaches: direct injection of 3D features, 3D feature projection, and auxiliary 3D visual prediction tasks [5][6]. Group 2: QDepth-VLA Methodology - QDepth-VLA introduces a mechanism that combines Quantized Depth Prediction with a hybrid attention structure, allowing the model to maintain semantic consistency while enhancing 3D spatial perception and action decision-making [8][34]. - The method consists of three main components: high-precision depth annotation using Video-Depth-Anything, a Depth Expert module for structured depth token prediction, and a hybrid attention mechanism to manage information flow across modalities [11][13][14]. Group 3: Experimental Validation - Comprehensive evaluations of QDepth-VLA were conducted in both simulated environments (Simpler and LIBERO) and real-world settings, demonstrating significant performance improvements in various object manipulation and multi-step tasks [18][19]. - In the Simpler simulation, QDepth-VLA achieved an average success rate increase of 8.5% and 3.7% compared to the baseline model Open π0 [20]. - In the LIBERO simulation, QDepth-VLA outperformed the 3D-CAVLA model by approximately 2.8% [26]. - Real-world experiments showed QDepth-VLA's superior performance in pick-and-place tasks, with a 20% improvement in basic tasks and a 10% enhancement in more challenging scenarios [30]. Group 4: Ablation Studies - Ablation studies indicated that the depth supervision and hybrid attention mechanisms are crucial for QDepth-VLA's high performance, with significant drops in success rates when these components were removed [31][32]. Group 5: Future Directions - Future research will focus on enhancing the model's spatial understanding capabilities, with potential developments in future spatial structure prediction and more efficient depth representation learning [35][36]. - The integration of enhanced 3D geometric perception and action consistency into CASBOT's product line is anticipated, supporting various applications in both domestic and industrial settings [35][36].

视觉-语言-动作模型

视觉-语言-动作模型

SemanticVLA：面向高效机器人操作的语义对齐剪枝与增强方法

具身智能之心· 2025-11-14 16:03

Core Insights - The article discusses significant advancements in visual-language-action models for robotic operations, highlighting the challenges faced in dynamic and cluttered environments, which hinder the deployment of existing models [2][4]. Research Background - Visual-language-action models have made notable progress in robotic operations through pre-trained visual language models that enable end-to-end mapping from language to action. However, two main bottlenecks limit their deployment in real-world scenarios: low computational efficiency and weak task grounding capabilities [2]. Key Innovations - Introduction of a semantic-guided dual-visual pruner that addresses visual redundancy through instruction-aware token filtering and geometric-aware aggregation, while maintaining semantic alignment [3]. Main Work Overall Framework Design - The framework processes real-time visual observations, robot state (e.g., joint angles, end-effector posture), and natural language instructions to predict future action sequences. It employs two parallel paths for visual input processing, culminating in an end-to-end pipeline for action mapping [4]. Visual Perception Redundancy - The general visual encoder processes all pixels uniformly, leading to background interference and environmental noise, which increases computational costs and dilutes attention on critical task cues [5]. Semantic Complementary Layered Fusion - A semantic complementary layered fusion mechanism integrates dense patch features with sparse semantic tokens, enhancing the alignment of instruction semantics with spatial structures [5]. Semantic Conditioned Action Coupler - The design reconstructs the mapping from visual to action, improving the efficiency and interpretability of action decoding by representing actions as semantically coherent types [5]. Experimental Results Efficiency Advantages - The model achieves a training cost reduction of 3.0 times, inference latency reduction of 2.7 times, and visual token compression of 8-16 times, significantly enhancing throughput [14]. Real-World Performance - In long-range tasks, the model's success rate reaches 77.8%, surpassing the OpenVLA-OFT model by 22.2%, demonstrating strong generalization capabilities [14]. Ablation Studies - The dual-pruning combination of the SD-Pruner enhances success rates by 2.1%-5.2%, achieving optimal performance and efficiency balance at an 8× sparsification ratio [16].

视觉-语言-动作模型

视觉-语言-动作模型

西湖大学最新！RobustVLA：面向VLA模型的鲁棒性感知强化后训练方法（优于SOTA方案）

具身智能之心· 2025-11-08 04:00

Core Insights - The article discusses the development of RobustVLA, a lightweight online reinforcement learning post-training method aimed at enhancing the robustness of Vision-Language-Action (VLA) models in the face of environmental uncertainties [1][5][20] - It highlights the limitations of existing methods that focus primarily on reward maximization without addressing the model's sensitivity to disturbances, which can lead to significant performance drops in real-world scenarios [5][20] Design Logic of RobustVLA - RobustVLA incorporates two key regularization terms: Jacobian regularization to reduce sensitivity to observation noise and smoothness regularization to stabilize policies under action disturbances [4][7][8] - The method emphasizes the importance of robustness-aware reinforcement learning post-training as a critical step in improving the reliability of VLA models [1][5] Robustness Analysis - The article outlines a theoretical analysis of robustness, establishing error amplification bounds, reward drift control, and guarantees for robust stability [4][11][18] - It identifies that the Jacobian sensitivity directly impacts error amplification, and reducing this sensitivity can effectively constrain performance loss [12][18] Experimental Results - In experiments, RobustVLA demonstrated an average success rate of 82.5% under observation perturbations, outperforming previous models like OpenVLA-OFT and RIPT-VLA [20][21] - Under action perturbations, RobustVLA achieved an average success rate of 54.8%, exceeding OpenVLA-OFT's 53.5% [22] - In scenarios with combined disturbances, RobustVLA-C achieved an average success rate of 82.1%, showcasing the synergy of autonomous interaction and dual regularization [23] Transfer Learning and Ablation Studies - Transfer learning experiments showed that RobustVLA improved out-of-distribution adaptability by 8.0% and 16.0% in specific tasks compared to zero-shot transfer [25] - Ablation studies confirmed that removing either Jacobian or smoothness regularization led to performance declines, underscoring the necessity of both regularization strategies for enhancing robustness [27]

视觉-语言-动作模型

鲁棒强化学习

视觉-语言-动作模型

鲁棒强化学习

VLA2：浙大x西湖大学提出智能体化VLA框架，操作泛化能力大幅提升

具身智能之心· 2025-10-24 00:40

Core Insights - The article presents VLA², a framework designed to enhance the capabilities of vision-language-action models, particularly in handling unseen concepts in robotic tasks [1][3][12] Method Overview - VLA² integrates three core modules: initial information processing, cognition and memory, and task execution [3][5] - The framework utilizes GLM-4V for task decomposition, MM-GroundingDINO for object detection, and incorporates web image retrieval for visual memory enhancement [4][7] Experimental Validation - VLA² was compared with state-of-the-art (SOTA) models on the LIBERO Benchmark, showing competitive results, particularly excelling in scenarios requiring strong generalization [6][9] - In hard scenarios, VLA² achieved a 44.2% improvement in success rate over simply fine-tuning OpenVLA [9][10] Key Mechanisms - The framework's performance is significantly influenced by three mechanisms: visual mask injection, semantic replacement, and web retrieval [7][11] - Ablation studies confirmed that each mechanism contributes notably to the model's performance, especially in challenging tasks [11] Conclusion and Future Directions - VLA² successfully expands the cognitive and operational capabilities of VLA models for unknown objects, providing a viable solution for robotic tasks in open-world settings [12] - Future work will focus on exploring its generalization capabilities in real-world applications and expanding support for more tools and tasks [12]

视觉-语言-动作模型

视觉-语言-动作模型

缺数据也能拿SOTA？清华&上海AI Lab破解机器人RL两大瓶颈

量子位· 2025-09-26 02:08

Core Viewpoint - The article discusses the development of SimpleVLA-RL, an end-to-end online training solution for Visual-Language-Action (VLA) models, aimed at enhancing the flexibility and performance of robots in complex environments while addressing existing training bottlenecks [3][12]. Group 1: Key Challenges in Existing Training Paradigms - Current training paradigms face significant challenges, including high data collection costs and insufficient generalization capabilities [2][8]. - The reliance on large-scale, high-quality robot operation trajectories limits scalability and increases costs, making data acquisition a major hurdle [8]. - The models struggle with generalization, particularly in out-of-distribution tasks and new environments, leading to performance drops in long-sequence dependencies and combinatorial tasks [8][9]. Group 2: SimpleVLA-RL Framework - SimpleVLA-RL employs a combination of interactive trajectory sampling, result-based rewards, and enhanced exploration to tackle the three core challenges of VLA model training [5][6]. - The framework demonstrates state-of-the-art (SoTA) performance in standard benchmarks like LIBERO and RoboTwin, achieving significant improvements even with limited data [5][21]. - In scenarios with single demonstration data, the average success rate in LIBERO increased from 48.9% to 96.9% after applying SimpleVLA-RL [5]. Group 3: Performance Metrics and Results - SimpleVLA-RL achieved an average success rate of 99.1% in LIBERO, with long-sequence tasks improving by 12.0 percentage points [21]. - In RoboTwin1.0, the average success rate rose from 39.8% to 70.4%, with specific tasks like "Blocks Stack" improving by 33.1 percentage points [23]. - The framework also demonstrated a significant increase in performance in RoboTwin2.0, with average success rates improving from 38.3% to 68.8% [25]. Group 4: Innovations and Discoveries - The training process led to the emergence of new operational strategies, such as the "Pushcut" phenomenon, where the model autonomously discovers more efficient methods beyond human demonstrations [10][31]. - This phenomenon indicates that reinforcement learning can enable VLA models to surpass the limitations of human demonstration patterns, paving the way for future adaptive VLA model development [31].

视觉-语言-动作模型

视觉-语言-动作模型

基于313篇VLA论文的综述与1661字压缩版

理想TOP2· 2025-09-25 13:33

Core Insights - The emergence of Vision Language Action (VLA) models signifies a paradigm shift in robotics from traditional strategy-based control to general robotic technology, enabling active decision-making in complex environments [12][22] - The review categorizes VLA methods into five paradigms: autoregressive, diffusion-based, reinforcement learning, hybrid, and specialized methods, providing a comprehensive overview of their design motivations and core strategies [17][20] Summary by Categories Autoregressive Models - Autoregressive models generate action sequences as time-dependent processes, leveraging historical context and sensory inputs to produce actions step-by-step [44][46] - Key innovations include unified multimodal Transformers that tokenize various modalities, enhancing cross-task action generation [48][49] - Challenges include safety, interpretability, and alignment with human values [47][56] Diffusion-Based Models - Diffusion models frame action generation as a conditional denoising process, allowing for probabilistic action generation and modeling multimodal action distributions [59][60] - Innovations include modular optimization and dynamic adaptive reasoning to improve efficiency and reduce computational costs [61][62] - Limitations involve maintaining temporal consistency in dynamic environments and high computational resource demands [5][60] Reinforcement Learning Models - Reinforcement learning models integrate VLMs with reinforcement learning to generate context-aware actions in interactive environments [6] - Innovations focus on reward function design and safety alignment mechanisms to prevent high-risk behaviors while maintaining task performance [6][7] - Challenges include the complexity of reward engineering and the high computational costs associated with scaling to high-dimensional real-world environments [6][9] Hybrid and Specialized Methods - Hybrid methods combine different paradigms to leverage the strengths of each, such as using diffusion for smooth trajectory generation while retaining autoregressive reasoning capabilities [7] - Specialized methods adapt VLA frameworks to specific domains like autonomous driving and humanoid robot control, enhancing practical applications [7][8] - The focus is on efficiency, safety, and human-robot collaboration in real-time inference and interactive learning [7][8] Data and Simulation Support - The development of VLA models heavily relies on high-quality datasets and simulation platforms to address data scarcity and testing risks [8][34] - Real-world datasets like Open X-Embodiment and simulation tools such as MuJoCo and CARLA are crucial for training and evaluating VLA models [8][36] - Challenges include high annotation costs and insufficient coverage of rare scenarios, which limit the generalization capabilities of VLA models [8][35] Future Opportunities - The integration of world models and cross-modal unification aims to evolve VLA into a comprehensive framework for environment modeling, reasoning, and interaction [10] - Causal reasoning and real interaction models are expected to overcome limitations of "pseudo-interaction" [10] - Establishing standardized frameworks for risk assessment and accountability will transition VLA from experimental tools to trusted partners in society [10]

视觉-语言-动作模型

通用具身智能

多模态融合

VLA (Vision Language Action) 模型

视觉-语言-动作模型

通用具身智能

多模态融合

VLA (Vision Language Action) 模型

全新范式！LLaDA-VLA：首个基于大语言扩散模型的VLA模型

具身智能之心· 2025-09-12 00:05

Core Viewpoint - The article discusses the advancements in Vision-Language Models (VLMs) and introduces LLaDA-VLA, the first Vision-Language-Action Model developed using large language diffusion models, which demonstrates superior multi-task performance in robotic action generation [1][5][19]. Group 1: Introduction to LLaDA-VLA - LLaDA-VLA integrates Masked Diffusion Models (MDMs) into robotic action generation, leveraging pre-trained multimodal large language diffusion models for fine-tuning and enabling parallel action trajectory prediction [5][19]. - The model architecture consists of three core modules: a vision encoder for RGB feature extraction, a language diffusion backbone for integrating visual and language information, and a projector for mapping visual features to language token space [10][7]. Group 2: Key Technical Innovations - Two major breakthroughs are highlighted: - Localized Special-token Classification (LSC), which reduces cross-domain transfer difficulty by classifying only action-related special tokens, thus improving training efficiency [8][12]. - Hierarchical Action-Structured Decoding (HAD), which explicitly models hierarchical dependencies between actions, resulting in smoother and more reasonable generated trajectories [9][13]. Group 3: Performance Evaluation - LLaDA-VLA outperforms state-of-the-art methods across various environments, including SimplerEnv, CALVIN, and real robot WidowX, achieving significant improvements in success rates and task completion metrics [4][21]. - In specific task evaluations, LLaDA-VLA achieved an average success rate of 58% across multiple tasks, surpassing previous models [15]. Group 4: Experimental Results - The model demonstrated a notable increase in task completion rates and average task lengths compared to baseline models, validating the effectiveness of the proposed LSC and HAD strategies [18][14]. - In a comparative analysis, LLaDA-VLA achieved a success rate of 95.6% in a specific task, significantly higher than other models [14][18]. Group 5: Research Significance and Future Directions - The introduction of LLaDA-VLA establishes a solid foundation for applying large language diffusion models in robotic operations, paving the way for future research in this domain [19][21]. - The design strategies employed in LLaDA-VLA not only enhance model performance but also open new avenues for exploration in the field of embodied intelligence [19].

掩码扩散模型

大语言扩散模型

机器人动作生成

视觉-语言-动作模型

掩码扩散模型

大语言扩散模型

机器人动作生成

视觉-语言-动作模型