视觉-语言-动作模型
Search documents
华科&小米联合提出MindDrive:首个证实在线强化学习有效性的VLA框架......
自动驾驶之心· 2025-12-17 00:03
点击下方 卡片 ,关注" 自动驾驶之心 "公众号 戳我-> 领取 自动驾驶近30个 方向 学习 路线 >>自动驾驶前沿信息获取 → 自动驾驶之心知识星球 论文作者 | Haoyu Fu等 编辑 | 自动驾驶之心 华科&小米的一篇新工作MindDrive,提出了一种基于在线强化学习的VLA框架。 相比RecogDrive、ORION提升了不少,在Qwen2-0.5B的基座上效果挺不错的。 当前自动驾驶领域VLA的相关工作主要依赖模仿学习,这会带来分布偏移和因果混淆等固有挑战。在线强化学习通过试错学习为解决这些问题提供了一条极具潜力的 途径。然而,将在线强化学习应用于自动驾驶视觉-语言-动作模型时,面临着连续动作空间中探索效率低下的难题。为克服这一限制, 华科和小米的团队提出了 MindDrive——一种包含大语言模型(LLM)的视觉-语言-动作框架,该模型配备两组不同的LoRA参数。 其中一组大语言模型充当决策专家,负责场景推理和驾驶 决策;另一组则作为动作专家,将语言决策动态映射为可行驶轨迹。通过将轨迹级奖励反馈至推理空间,MindDrive能够在有限的离散语言驾驶决策集合上进行试错 学习,而非直接在连续动作 ...
突破视觉-语言-动作模型的瓶颈:QDepth-VLA让机器人拥有更精准的3D空间感知
机器之心· 2025-11-26 07:07
Core Insights - The article discusses the significant potential of Vision-Language-Action (VLA) models in robotic manipulation, highlighting the introduction of QDepth-VLA, which enhances 3D spatial perception and reasoning capabilities through Quantized Depth Prediction [2][4][34]. Group 1: Model Limitations and Challenges - Despite advancements in semantic understanding and instruction following, VLA models struggle with spatial perception, particularly in fine-grained or long-duration multi-step tasks, leading to positioning errors and operational failures [5][6]. - The gap between 2D visual semantic understanding and 3D spatial perception has prompted researchers to explore various methods to integrate 3D information into VLA models, categorized into three main approaches: direct injection of 3D features, 3D feature projection, and auxiliary 3D visual prediction tasks [5][6]. Group 2: QDepth-VLA Methodology - QDepth-VLA introduces a mechanism that combines Quantized Depth Prediction with a hybrid attention structure, allowing the model to maintain semantic consistency while enhancing 3D spatial perception and action decision-making [8][34]. - The method consists of three main components: high-precision depth annotation using Video-Depth-Anything, a Depth Expert module for structured depth token prediction, and a hybrid attention mechanism to manage information flow across modalities [11][13][14]. Group 3: Experimental Validation - Comprehensive evaluations of QDepth-VLA were conducted in both simulated environments (Simpler and LIBERO) and real-world settings, demonstrating significant performance improvements in various object manipulation and multi-step tasks [18][19]. - In the Simpler simulation, QDepth-VLA achieved an average success rate increase of 8.5% and 3.7% compared to the baseline model Open π0 [20]. - In the LIBERO simulation, QDepth-VLA outperformed the 3D-CAVLA model by approximately 2.8% [26]. - Real-world experiments showed QDepth-VLA's superior performance in pick-and-place tasks, with a 20% improvement in basic tasks and a 10% enhancement in more challenging scenarios [30]. Group 4: Ablation Studies - Ablation studies indicated that the depth supervision and hybrid attention mechanisms are crucial for QDepth-VLA's high performance, with significant drops in success rates when these components were removed [31][32]. Group 5: Future Directions - Future research will focus on enhancing the model's spatial understanding capabilities, with potential developments in future spatial structure prediction and more efficient depth representation learning [35][36]. - The integration of enhanced 3D geometric perception and action consistency into CASBOT's product line is anticipated, supporting various applications in both domestic and industrial settings [35][36].
SemanticVLA:面向高效机器人操作的语义对齐剪枝与增强方法
具身智能之心· 2025-11-14 16:03
Core Insights - The article discusses significant advancements in visual-language-action models for robotic operations, highlighting the challenges faced in dynamic and cluttered environments, which hinder the deployment of existing models [2][4]. Research Background - Visual-language-action models have made notable progress in robotic operations through pre-trained visual language models that enable end-to-end mapping from language to action. However, two main bottlenecks limit their deployment in real-world scenarios: low computational efficiency and weak task grounding capabilities [2]. Key Innovations - Introduction of a semantic-guided dual-visual pruner that addresses visual redundancy through instruction-aware token filtering and geometric-aware aggregation, while maintaining semantic alignment [3]. Main Work Overall Framework Design - The framework processes real-time visual observations, robot state (e.g., joint angles, end-effector posture), and natural language instructions to predict future action sequences. It employs two parallel paths for visual input processing, culminating in an end-to-end pipeline for action mapping [4]. Visual Perception Redundancy - The general visual encoder processes all pixels uniformly, leading to background interference and environmental noise, which increases computational costs and dilutes attention on critical task cues [5]. Semantic Complementary Layered Fusion - A semantic complementary layered fusion mechanism integrates dense patch features with sparse semantic tokens, enhancing the alignment of instruction semantics with spatial structures [5]. Semantic Conditioned Action Coupler - The design reconstructs the mapping from visual to action, improving the efficiency and interpretability of action decoding by representing actions as semantically coherent types [5]. Experimental Results Efficiency Advantages - The model achieves a training cost reduction of 3.0 times, inference latency reduction of 2.7 times, and visual token compression of 8-16 times, significantly enhancing throughput [14]. Real-World Performance - In long-range tasks, the model's success rate reaches 77.8%, surpassing the OpenVLA-OFT model by 22.2%, demonstrating strong generalization capabilities [14]. Ablation Studies - The dual-pruning combination of the SD-Pruner enhances success rates by 2.1%-5.2%, achieving optimal performance and efficiency balance at an 8× sparsification ratio [16].
西湖大学最新!RobustVLA:面向VLA模型的鲁棒性感知强化后训练方法(优于SOTA方案)
具身智能之心· 2025-11-08 04:00
Core Insights - The article discusses the development of RobustVLA, a lightweight online reinforcement learning post-training method aimed at enhancing the robustness of Vision-Language-Action (VLA) models in the face of environmental uncertainties [1][5][20] - It highlights the limitations of existing methods that focus primarily on reward maximization without addressing the model's sensitivity to disturbances, which can lead to significant performance drops in real-world scenarios [5][20] Design Logic of RobustVLA - RobustVLA incorporates two key regularization terms: Jacobian regularization to reduce sensitivity to observation noise and smoothness regularization to stabilize policies under action disturbances [4][7][8] - The method emphasizes the importance of robustness-aware reinforcement learning post-training as a critical step in improving the reliability of VLA models [1][5] Robustness Analysis - The article outlines a theoretical analysis of robustness, establishing error amplification bounds, reward drift control, and guarantees for robust stability [4][11][18] - It identifies that the Jacobian sensitivity directly impacts error amplification, and reducing this sensitivity can effectively constrain performance loss [12][18] Experimental Results - In experiments, RobustVLA demonstrated an average success rate of 82.5% under observation perturbations, outperforming previous models like OpenVLA-OFT and RIPT-VLA [20][21] - Under action perturbations, RobustVLA achieved an average success rate of 54.8%, exceeding OpenVLA-OFT's 53.5% [22] - In scenarios with combined disturbances, RobustVLA-C achieved an average success rate of 82.1%, showcasing the synergy of autonomous interaction and dual regularization [23] Transfer Learning and Ablation Studies - Transfer learning experiments showed that RobustVLA improved out-of-distribution adaptability by 8.0% and 16.0% in specific tasks compared to zero-shot transfer [25] - Ablation studies confirmed that removing either Jacobian or smoothness regularization led to performance declines, underscoring the necessity of both regularization strategies for enhancing robustness [27]
VLA2:浙大x西湖大学提出智能体化VLA框架,操作泛化能力大幅提升
具身智能之心· 2025-10-24 00:40
Core Insights - The article presents VLA², a framework designed to enhance the capabilities of vision-language-action models, particularly in handling unseen concepts in robotic tasks [1][3][12] Method Overview - VLA² integrates three core modules: initial information processing, cognition and memory, and task execution [3][5] - The framework utilizes GLM-4V for task decomposition, MM-GroundingDINO for object detection, and incorporates web image retrieval for visual memory enhancement [4][7] Experimental Validation - VLA² was compared with state-of-the-art (SOTA) models on the LIBERO Benchmark, showing competitive results, particularly excelling in scenarios requiring strong generalization [6][9] - In hard scenarios, VLA² achieved a 44.2% improvement in success rate over simply fine-tuning OpenVLA [9][10] Key Mechanisms - The framework's performance is significantly influenced by three mechanisms: visual mask injection, semantic replacement, and web retrieval [7][11] - Ablation studies confirmed that each mechanism contributes notably to the model's performance, especially in challenging tasks [11] Conclusion and Future Directions - VLA² successfully expands the cognitive and operational capabilities of VLA models for unknown objects, providing a viable solution for robotic tasks in open-world settings [12] - Future work will focus on exploring its generalization capabilities in real-world applications and expanding support for more tools and tasks [12]
缺数据也能拿SOTA?清华&上海AI Lab破解机器人RL两大瓶颈
量子位· 2025-09-26 02:08
Core Viewpoint - The article discusses the development of SimpleVLA-RL, an end-to-end online training solution for Visual-Language-Action (VLA) models, aimed at enhancing the flexibility and performance of robots in complex environments while addressing existing training bottlenecks [3][12]. Group 1: Key Challenges in Existing Training Paradigms - Current training paradigms face significant challenges, including high data collection costs and insufficient generalization capabilities [2][8]. - The reliance on large-scale, high-quality robot operation trajectories limits scalability and increases costs, making data acquisition a major hurdle [8]. - The models struggle with generalization, particularly in out-of-distribution tasks and new environments, leading to performance drops in long-sequence dependencies and combinatorial tasks [8][9]. Group 2: SimpleVLA-RL Framework - SimpleVLA-RL employs a combination of interactive trajectory sampling, result-based rewards, and enhanced exploration to tackle the three core challenges of VLA model training [5][6]. - The framework demonstrates state-of-the-art (SoTA) performance in standard benchmarks like LIBERO and RoboTwin, achieving significant improvements even with limited data [5][21]. - In scenarios with single demonstration data, the average success rate in LIBERO increased from 48.9% to 96.9% after applying SimpleVLA-RL [5]. Group 3: Performance Metrics and Results - SimpleVLA-RL achieved an average success rate of 99.1% in LIBERO, with long-sequence tasks improving by 12.0 percentage points [21]. - In RoboTwin1.0, the average success rate rose from 39.8% to 70.4%, with specific tasks like "Blocks Stack" improving by 33.1 percentage points [23]. - The framework also demonstrated a significant increase in performance in RoboTwin2.0, with average success rates improving from 38.3% to 68.8% [25]. Group 4: Innovations and Discoveries - The training process led to the emergence of new operational strategies, such as the "Pushcut" phenomenon, where the model autonomously discovers more efficient methods beyond human demonstrations [10][31]. - This phenomenon indicates that reinforcement learning can enable VLA models to surpass the limitations of human demonstration patterns, paving the way for future adaptive VLA model development [31].
基于313篇VLA论文的综述与1661字压缩版
理想TOP2· 2025-09-25 13:33
Core Insights - The emergence of Vision Language Action (VLA) models signifies a paradigm shift in robotics from traditional strategy-based control to general robotic technology, enabling active decision-making in complex environments [12][22] - The review categorizes VLA methods into five paradigms: autoregressive, diffusion-based, reinforcement learning, hybrid, and specialized methods, providing a comprehensive overview of their design motivations and core strategies [17][20] Summary by Categories Autoregressive Models - Autoregressive models generate action sequences as time-dependent processes, leveraging historical context and sensory inputs to produce actions step-by-step [44][46] - Key innovations include unified multimodal Transformers that tokenize various modalities, enhancing cross-task action generation [48][49] - Challenges include safety, interpretability, and alignment with human values [47][56] Diffusion-Based Models - Diffusion models frame action generation as a conditional denoising process, allowing for probabilistic action generation and modeling multimodal action distributions [59][60] - Innovations include modular optimization and dynamic adaptive reasoning to improve efficiency and reduce computational costs [61][62] - Limitations involve maintaining temporal consistency in dynamic environments and high computational resource demands [5][60] Reinforcement Learning Models - Reinforcement learning models integrate VLMs with reinforcement learning to generate context-aware actions in interactive environments [6] - Innovations focus on reward function design and safety alignment mechanisms to prevent high-risk behaviors while maintaining task performance [6][7] - Challenges include the complexity of reward engineering and the high computational costs associated with scaling to high-dimensional real-world environments [6][9] Hybrid and Specialized Methods - Hybrid methods combine different paradigms to leverage the strengths of each, such as using diffusion for smooth trajectory generation while retaining autoregressive reasoning capabilities [7] - Specialized methods adapt VLA frameworks to specific domains like autonomous driving and humanoid robot control, enhancing practical applications [7][8] - The focus is on efficiency, safety, and human-robot collaboration in real-time inference and interactive learning [7][8] Data and Simulation Support - The development of VLA models heavily relies on high-quality datasets and simulation platforms to address data scarcity and testing risks [8][34] - Real-world datasets like Open X-Embodiment and simulation tools such as MuJoCo and CARLA are crucial for training and evaluating VLA models [8][36] - Challenges include high annotation costs and insufficient coverage of rare scenarios, which limit the generalization capabilities of VLA models [8][35] Future Opportunities - The integration of world models and cross-modal unification aims to evolve VLA into a comprehensive framework for environment modeling, reasoning, and interaction [10] - Causal reasoning and real interaction models are expected to overcome limitations of "pseudo-interaction" [10] - Establishing standardized frameworks for risk assessment and accountability will transition VLA from experimental tools to trusted partners in society [10]
全新范式!LLaDA-VLA:首个基于大语言扩散模型的VLA模型
具身智能之心· 2025-09-12 00:05
Core Viewpoint - The article discusses the advancements in Vision-Language Models (VLMs) and introduces LLaDA-VLA, the first Vision-Language-Action Model developed using large language diffusion models, which demonstrates superior multi-task performance in robotic action generation [1][5][19]. Group 1: Introduction to LLaDA-VLA - LLaDA-VLA integrates Masked Diffusion Models (MDMs) into robotic action generation, leveraging pre-trained multimodal large language diffusion models for fine-tuning and enabling parallel action trajectory prediction [5][19]. - The model architecture consists of three core modules: a vision encoder for RGB feature extraction, a language diffusion backbone for integrating visual and language information, and a projector for mapping visual features to language token space [10][7]. Group 2: Key Technical Innovations - Two major breakthroughs are highlighted: - Localized Special-token Classification (LSC), which reduces cross-domain transfer difficulty by classifying only action-related special tokens, thus improving training efficiency [8][12]. - Hierarchical Action-Structured Decoding (HAD), which explicitly models hierarchical dependencies between actions, resulting in smoother and more reasonable generated trajectories [9][13]. Group 3: Performance Evaluation - LLaDA-VLA outperforms state-of-the-art methods across various environments, including SimplerEnv, CALVIN, and real robot WidowX, achieving significant improvements in success rates and task completion metrics [4][21]. - In specific task evaluations, LLaDA-VLA achieved an average success rate of 58% across multiple tasks, surpassing previous models [15]. Group 4: Experimental Results - The model demonstrated a notable increase in task completion rates and average task lengths compared to baseline models, validating the effectiveness of the proposed LSC and HAD strategies [18][14]. - In a comparative analysis, LLaDA-VLA achieved a success rate of 95.6% in a specific task, significantly higher than other models [14][18]. Group 5: Research Significance and Future Directions - The introduction of LLaDA-VLA establishes a solid foundation for applying large language diffusion models in robotic operations, paving the way for future research in this domain [19][21]. - The design strategies employed in LLaDA-VLA not only enhance model performance but also open new avenues for exploration in the field of embodied intelligence [19].