Vision-Language-Action(VLA)模型
Search documents
AAAI 2026杰出论文奖 | ReconVLA:具身智能领域首次获得
具身智能之心· 2026-01-27 03:00
Core Insights - The article emphasizes that embodied intelligence, particularly in the context of Vision-Language-Action (VLA) models, is becoming a central issue in AI research, as evidenced by the recognition of the ReconVLA model at AAAI [3][5]. Group 1: ReconVLA Model Overview - ReconVLA is introduced as a reconstructive Vision-Language-Action model aimed at improving the precision of visual attention in robotic tasks [12][11]. - The model's core idea is to focus on the ability to reconstruct the target area rather than explicitly indicating where to look, thereby enhancing the model's attention to key objects [12][14]. - The model incorporates a dual-branch framework: one for action prediction and another for visual reconstruction, which allows for implicit supervision through reconstruction loss [17][18]. Group 2: Performance and Results - ReconVLA has shown significant improvements in success rates across various tasks, achieving a success rate of 95.6% in the ABC→D task and 98.0% in the ABCD→D long-range task [23][26]. - In challenging long-range tasks like "stack block," ReconVLA achieved a success rate of 79.5%, outperforming baseline models [27]. - The model demonstrated strong generalization capabilities, maintaining over 40% success rates in real robot experiments with unseen objects [27]. Group 3: Training and Data - The training process for ReconVLA involved a large-scale dataset with over 100,000 interaction trajectories and approximately 2 million images, enhancing its visual reconstruction and generalization abilities [25][21]. - The model's pre-training did not rely on action labels, which significantly improved its performance in visual reconstruction and implicit grounding [21][31]. Group 4: Implications for Future Research - The article concludes that the core contribution of ReconVLA is not in introducing complex structures but in addressing the fundamental question of whether robots truly understand the world they are observing [32][34]. - The approach of using reconstructive implicit supervision is expected to advance embodied intelligence from experience-driven system design to a more robust and scalable paradigm for general intelligence research [34].
AAAI 2026杰出论文奖 | ReconVLA:具身智能研究首次获得AI顶级会议最佳论文奖
机器之心· 2026-01-26 03:08
Core Insights - The article emphasizes that embodied intelligence has become a core issue in AI research, particularly highlighted by the recognition of the ReconVLA model at a top AI conference [2][3]. Group 1: ReconVLA Model Overview - The ReconVLA model is a reconstructive Vision-Language-Action model designed to improve the stability and precision of visual attention in robotic tasks [10][11]. - Unlike previous models, ReconVLA does not explicitly output where to look but instead focuses on whether it can reconstruct the target area, thereby ensuring the model learns to pay attention to key objects [10][14]. Group 2: Methodology and Mechanism - The model consists of two collaborative branches: an action prediction branch that generates action tokens and a visual reconstruction branch that encodes the gaze region into high-fidelity latent tokens [17]. - The reconstruction process is facilitated by a lightweight diffusion transformer, which minimizes reconstruction error and forces the model to encode fine semantic and structural information about the target objects [13][18]. Group 3: Training and Data - A large-scale pre-training dataset was constructed, comprising over 100,000 interaction trajectories and approximately 2 million images, significantly enhancing the model's capabilities in visual reconstruction and implicit grounding [21][23]. - The pre-training process does not rely on action labels, which allows for improved generalization across different scenes [21]. Group 4: Experimental Results - In experiments, ReconVLA achieved a success rate of 79.5% on the challenging long-range task "stack block," outperforming baseline models [26][32]. - The model demonstrated superior performance in both short and long-range tasks, with average completion lengths of 3.95 and 4.23 respectively, indicating its effectiveness in complex environments [26][28]. Group 5: Contributions and Future Implications - The core contribution of ReconVLA lies in its approach to understanding whether robots truly comprehend the world they are observing, providing a more natural and efficient visual alignment mechanism [31]. - The article anticipates that this work will advance embodied intelligence from experience-driven system design to a more robust and scalable paradigm for general intelligence research [33].
REALM:机器人操作任务的real2sim验证基准
具身智能之心· 2025-12-27 10:03
Core Background and Issues - The Vision-Language-Action (VLA) model enables robots to understand natural language commands and perform manipulation tasks, but evaluating generalization capabilities remains a key challenge due to high costs and poor repeatability in real-world assessments. Existing simulation benchmarks have significant flaws, including limited types of disturbances and lack of high-fidelity visual effects, leading to a disconnect between simulation and real-world performance, known as the "reality-simulation gap" [2]. - To address this, a research team from Czech Technical University and the University of Amsterdam developed REALM, a high-fidelity simulation environment and benchmark aimed at establishing a strong correlation between simulation and real-world performance for large-scale, low-cost evaluation of VLA model generalization capabilities. The core breakthroughs include high-fidelity visual and control-aligned simulation environments, a multi-dimensional disturbance evaluation scheme, and empirically validated real-simulation performance correlation [2]. Related Work and Differentiation Advantages - Existing robotic manipulation generalization benchmarks rely heavily on simulation but have notable limitations. For instance, GemBench and VLABench support a limited number of disturbance types, particularly in behavioral disturbances. SIMPLER has achieved partial control alignment but is limited in skill and object variety and only supports single viewpoints. REALM stands out by covering six visual, eight semantic, and seven behavioral disturbances, supporting seven skills, ten scenes, and over 3,500 objects, while also providing high-fidelity visuals, control alignment, and multi-view capabilities, making it the most comprehensive generalization benchmark to date [3][4]. Benchmark Design Core Elements 1. **Skills and Task Set**: The benchmark is designed around seven core manipulation skills: picking, placing, pushing, rotating, stacking, opening, and closing. It includes two task sets where skills are defined as general capabilities independent of objects and scenes, while tasks are specific instances of skills applied to particular objects and scenes, with a modular framework for expansion [5]. 2. **Disturbance Design**: To test generalization capabilities, 15 types of disturbances are designed, covering three main categories. The REALM-base focuses on eight tasks related to picking and placing skills, while REALM-articulated targets tasks involving articulated objects like cabinet doors [6][8]. 3. **Evaluation Metrics and Control Alignment**: A tiered progression metric replaces binary success rates by breaking down each skill into ordered discrete states, providing a more granular reflection of model performance. Control alignment is optimized by redesigning the robot controller and fine-tuning 14 physical parameters, significantly improving the consistency between simulated and real trajectories [9]. Real-Simulation Alignment and Validation - The validation process confirms that simulation can effectively replace real-world evaluations. Testing involved three VLA models, seven tasks, and five types of disturbances across nearly 800 trajectory sets, using key metrics such as Pearson correlation coefficient, p-values, and Mean Maximum Rank Violation (MMRV). Results indicate a strong linear correlation between simulation and real-world task progression, with MMRV values low and p<0.001 across all settings, demonstrating that simulation reliably predicts real-world performance [11]. Key Experimental Results and Findings 1. **Visual Generalization**: Pure visual disturbances significantly impact model performance, with average RMSD exceeding 0.12. Factors like blurriness and lighting have minimal effects, likely due to the visual diversity in DROID training data. However, changes in viewpoint and scene disturbances have the most significant impact, indicating that while the model can adapt to some visual changes, robustness remains insufficient [14]. 2. **Semantic Generalization**: Despite relying on large-scale pre-trained VLM, semantic disturbances pose substantial challenges, with performance significantly lagging behind other models. The most impactful disturbances relate to world knowledge and human needs, while spatial relationship understanding performed unexpectedly well [17]. 3. **Behavioral Generalization**: Behavioral disturbances require the model to adjust motion strategies, presenting the greatest challenge. The model generalizes well between different skills on the same object but performs poorly across different objects, especially with unseen objects, indicating a lack of adaptability in behavior [18]. 4. **Robustness and Task Completion**: The -FAST model achieved the highest average task progression across all disturbances, leading in success rates for 9 out of 10 tasks. In contrast, GR00T showed significantly lower performance with less interpretable disturbance impacts. All models took an average of 20-30 seconds to complete simple tasks, with high variance, indicating challenges in efficiently and consistently completing tasks in unknown environments [19].
领域首篇RL+VLA 综述:强化学习如何推动 VLA 走向真实世界?
具身智能之心· 2025-12-19 00:05
Core Insights - The article discusses the integration of Reinforcement Learning (RL) with Vision-Language-Action (VLA) models, emphasizing its role in enhancing the adaptability and robustness of robotic systems in real-world scenarios [2][34]. RL-VLA Architecture - RL transforms VLA from "demonstration reproduction" to "result-oriented" closed-loop decision-making through reward-driven policy updates [4]. - Challenges include discrete action tokens complicating dexterous manipulation and the risk of action distribution distortion in generative VLA [6]. Reward Design - RL-VLA employs intrinsic rewards to encourage exploration and extrinsic rewards for task alignment, addressing the sparsity of rewards in imitation learning [8][9]. - The use of physics-based simulators is highlighted, although they require significant manual effort and computational resources [9]. Training Paradigms - Three types of RL-VLA training paradigms are identified: Online RL, Offline RL, and Test-time RL, each with unique challenges such as non-stationary dynamics and computational costs [11][16]. - Empirical studies show that RL fine-tuning significantly enhances generalization capabilities in out-of-distribution (OOD) scenarios compared to standard supervised fine-tuning [14]. Real-World Deployment - Real-world deployment of RL-VLA models faces challenges like sample efficiency and safety, with strategies including Sim-to-Real transfer and Human-in-the-loop RL [21][24]. - The article discusses the importance of safety exploration and the integration of high-level semantic reasoning with low-level control strategies [28][29]. Open Challenges & Future Directions - Key challenges include developing robust memory retrieval mechanisms, enhancing sample efficiency, and ensuring reliable physical operations through risk-aware strategies [34]. - The evolution of RL is pushing VLA from high-performance imitation to autonomous exploration and decision-making capabilities [34].
ActDistill:同济大学提出动作引导蒸馏框架,机器人推理速度提升1.67倍
具身智能之心· 2025-11-26 00:05
Group 1 - The article discusses the challenges of deploying Vision-Language-Action (VLA) models in real-time or resource-constrained robotic systems due to high computational costs and inference delays [2][3]. - Existing efficient VLA strategies often prioritize visual-language model optimizations, leading to key information loss and incoherent action semantics [2][3]. Group 2 - The proposed ActDistill framework aims to address these issues by providing an action-prediction-oriented distillation framework that balances efficiency and fidelity while preserving action prediction accuracy [3][4]. - ActDistill consists of two core modules: Graph-Structured Encapsulation and Action-Guided Self-Derived Distillation, which work together to model action semantics and guide knowledge distillation [4][8]. Group 3 - The Graph-Structured Encapsulation module explicitly models the hierarchical evolution of action semantics and separates task-related interactions from redundant background signals [6]. - The Action-Guided Self-Derived Distillation module utilizes a lightweight student model that aligns with the teacher model's structure while reducing depth, incorporating dynamic routing to adaptively predict layer gating scores [8][11]. Group 4 - Experimental results show that ActDistill achieves a success rate of 73.95% with a 1.59x speed-up and a 50.5% reduction in computational load compared to full models [9][12]. - The framework demonstrates significant improvements in efficiency and performance across various benchmarks, including LIBERO and SIMPLER [12][13]. Group 5 - The article highlights the importance of the Graph-Structured Encapsulation module, noting that replacing it with a simpler architecture led to a significant drop in performance [13]. - The framework's ability to maintain trajectory stability and focus attention on action-relevant areas is emphasized, showcasing its effectiveness in practical applications [16][17]. Group 6 - ActDistill represents a novel approach to action-centered compression of VLA models, achieving over 50% reduction in computational load while maintaining task success rates [24]. - Future directions include exploring teacher-free or reinforcement learning-guided variants and integrating long-horizon temporal reasoning into the routing mechanism for enhanced adaptability [24].
3个月!搞透VLA/VLA+触觉/VLA+RL/具身世界模型等方向!
具身智能之心· 2025-08-22 00:04
Core Viewpoint - The exploration of Artificial General Intelligence (AGI) is increasingly focusing on embodied intelligence, which emphasizes the interaction and adaptation of intelligent agents within physical environments, enabling them to perceive, understand tasks, execute actions, and learn from feedback [1]. Industry Analysis - In the past two years, numerous star teams in the field of embodied intelligence have emerged, establishing valuable companies such as Xinghaitu, Galaxy General, and Zhujidongli, which are advancing the technology of embodied intelligence [3]. - Major domestic companies like Huawei, JD, Tencent, Ant Group, and Xiaomi are actively investing and collaborating to build a robust ecosystem for embodied intelligence, while international firms like Tesla and investment institutions are supporting companies like Wayve and Apptronik in the development of autonomous driving and warehouse robots [5]. Technological Evolution - The development of embodied intelligence has progressed through several stages: - The first stage focused on grasp pose detection, which struggled with complex tasks due to a lack of context modeling [6]. - The second stage involved behavior cloning, allowing robots to learn from expert demonstrations but revealing weaknesses in generalization and performance in multi-target scenarios [6]. - The third stage introduced Diffusion Policy methods, enhancing stability and generalization by modeling action sequences, followed by the Vision-Language-Action (VLA) model phase, which integrates visual perception, language understanding, and action generation [7][8]. - The fourth stage, starting in 2025, aims to integrate VLA models with reinforcement learning, world models, and tactile sensing to overcome current limitations [8]. Product and Market Development - The evolution of embodied intelligence technologies has led to the emergence of various products, including humanoid robots, robotic arms, and quadrupedal robots, serving industries such as manufacturing, home services, dining, and medical rehabilitation [9]. - The demand for engineering and system capabilities is increasing as the industry shifts from research to deployment, necessitating higher engineering skills for training and simulating strategies on platforms like Mujoco, IsaacGym, and Pybullet [23]. Educational Initiatives - A comprehensive curriculum has been developed to cover the entire technology route of embodied "brain + cerebellum," including practical applications and real-world projects, aimed at both beginners and advanced learners [10][20].
VLA/VLA+触觉/VLA+RL/具身世界模型等方向教程来啦!
具身智能之心· 2025-08-18 00:07
Core Viewpoint - The exploration of Artificial General Intelligence (AGI) is increasingly focusing on embodied intelligence, which emphasizes the interaction and adaptation of intelligent agents within physical environments, enabling them to perceive, understand tasks, execute actions, and learn from feedback [1]. Industry Analysis - In the past two years, numerous star teams in the field of embodied intelligence have emerged, leading to the establishment of valuable companies such as Xinghaitu, Galaxy General, and Zhujidongli, which are advancing the technology of embodied intelligence [3]. - Major domestic companies like Huawei, JD.com, Tencent, Ant Group, and Xiaomi are actively investing and collaborating to build a robust ecosystem for embodied intelligence, while international players like Tesla and investment firms are supporting companies like Wayve and Apptronik in the development of autonomous driving and warehouse robots [5]. Technological Evolution - The development of embodied intelligence has progressed through several stages: - The first stage focused on grasp pose detection, which struggled with complex tasks due to a lack of context modeling [6]. - The second stage involved behavior cloning, allowing robots to learn from expert demonstrations but revealing weaknesses in generalization and performance in multi-target scenarios [6]. - The third stage introduced Diffusion Policy methods, enhancing stability and generalization by modeling action sequences, followed by the emergence of Vision-Language-Action (VLA) models that integrate visual perception, language understanding, and action generation [7]. - The fourth stage, starting in 2025, aims to integrate VLA models with reinforcement learning, world models, and tactile sensing to overcome current limitations [8]. Product and Market Development - The evolution of embodied intelligence technologies has led to the emergence of various products, including humanoid robots, robotic arms, and quadrupedal robots, serving industries such as manufacturing, home services, dining, and medical rehabilitation [9]. - The demand for engineering and system capabilities is increasing as the industry shifts from research to deployment, necessitating training in platforms like Mujoco, IsaacGym, and Pybullet for strategy training and simulation testing [23]. Educational Initiatives - A comprehensive curriculum has been developed to cover the entire technology route of embodied "brain + cerebellum," including practical applications and advanced topics, aimed at both beginners and those seeking to deepen their knowledge [10][20].
国内首个具身大脑+小脑算法实战全栈教程
具身智能之心· 2025-08-07 02:38
Core Insights - The exploration towards Artificial General Intelligence (AGI) highlights embodied intelligence as a key direction, focusing on the interaction and adaptation of intelligent agents within physical environments [1] - The development of embodied intelligence is marked by the evolution of technology from low-level perception to high-level task understanding and generalization [6][9] Industry Analysis - In the past two years, numerous star teams in the field of embodied intelligence have emerged, establishing valuable companies such as Xinghaitu, Galaxy General, and Zhujidongli, transitioning from laboratories to commercial and industrial applications [3] - Major domestic companies like Huawei, JD, Tencent, Ant Group, and Xiaomi are actively investing and collaborating to build an ecosystem for embodied intelligence, while international players like Tesla and investment firms support advancements in autonomous driving and warehouse robotics [5] Technological Evolution - The evolution of embodied intelligence technology has progressed through several stages: - The first stage focused on grasp pose detection, which struggled with complex tasks due to a lack of context modeling [6] - The second stage involved behavior cloning, allowing robots to learn from expert demonstrations but revealing weaknesses in generalization and performance in multi-target scenarios [6] - The third stage introduced Diffusion Policy methods, enhancing stability and generalization through sequence modeling [7] - The fourth stage, emerging in 2025, explores the integration of VLA models with reinforcement learning and tactile sensing to overcome current limitations [8] Product Development and Market Growth - The advancements in embodied intelligence have led to the development of various products, including humanoid robots, robotic arms, and quadrupedal robots, serving industries such as manufacturing, home services, and healthcare [9] - The demand for engineering and system capabilities is increasing as the industry shifts from research to deployment, necessitating higher engineering skills [13] Educational Initiatives - A comprehensive curriculum has been developed to assist learners in mastering the full spectrum of embodied intelligence algorithms, covering topics from basic tasks to advanced models like VLA and its integrations [9][13]
理想最新DriveAction:探索VLA模型中类人驾驶决策的基准~
自动驾驶之心· 2025-06-21 13:15
Core Insights - The article discusses the introduction of the DriveAction benchmark, specifically designed for Vision-Language-Action (VLA) models in autonomous driving, addressing existing limitations in current datasets and evaluation frameworks [2][3][20]. Group 1: Research Background and Issues - The development of VLA models presents new opportunities for autonomous driving systems, but current benchmark datasets lack diversity in scenarios, reliable action-level annotations, and evaluation protocols aligned with human preferences [2]. - Existing benchmarks primarily rely on open-source data, which limits their ability to cover complex real-world driving scenarios, leading to a disconnect between evaluation results and actual deployment risks [3]. Group 2: DriveAction Benchmark Innovations - DriveAction is the first action-driven benchmark specifically designed for VLA models, featuring three core innovations: 1. Comprehensive coverage of diverse driving scenarios sourced from real-world data collected by production autonomous vehicles across 148 cities in China [5]. 2. Realistic action annotations derived from users' real-time driving operations, ensuring accurate capture of driver intentions [6]. 3. A tree-structured evaluation framework based on action-driven dynamics, integrating visual and language tasks to assess model decision-making in realistic contexts [7]. Group 3: Evaluation Results - Experimental results indicate that models perform best in the full process mode (V-L-A) and worst in the no-information mode (A), with average accuracy dropping by 3.3% without visual input and 4.1% without language input [14]. - Specific task evaluations reveal that models excel in dynamic and static obstacle tasks but struggle with navigation and traffic light tasks, highlighting areas for improvement [16][17]. Group 4: Significance and Value of DriveAction - The introduction of the DriveAction benchmark marks a significant advancement in the evaluation of autonomous driving systems, providing a more comprehensive and realistic assessment tool that can help identify model bottlenecks and guide system optimization [20].