Workflow
视觉 - 语言 - 动作(VLA)模型
icon
Search documents
在「想象」中练就真机能力:RISE,让VLA强化学习告别真机试错
机器之心· 2026-03-17 11:31
Core Insights - The article discusses the development of the Visual-Language-Action (VLA) model as a core framework for general operational tasks in embodied intelligence, highlighting its challenges in complex scenarios such as long-term planning and dynamic interactions [2][8] - The RISE (χ0-RL) framework proposed by the OpenDriveLab team aims to address these challenges by enabling robots to perform reinforcement learning in an imagined virtual space, significantly improving long-term task performance [2][18] Challenges in VLA Implementation - The reliance on imitation learning leads to cumulative errors in reasoning, as current VLA models primarily learn from successful paths demonstrated by experts, making it difficult to self-correct when deviating from these paths [9][10] - Real-world reinforcement learning faces three major constraints: high costs of physical interactions, safety risks associated with exploratory operations, and the lack of automatic reset mechanisms in real environments [11][13] - Existing world models struggle to balance high-fidelity simulation and long-term consistency, limiting the effectiveness of virtual-physical integration attempts [8][11] RISE Framework Overview - RISE utilizes a combination of world models to facilitate online learning without the need for extensive physical interactions, leading to significant improvements in real-world task performance [15][18] - The framework's core innovation lies in transferring physical interactions to a combination world model, creating a self-evolving cycle in imagined spaces [16][17] Components of RISE - The combination world model consists of two independent modules: a controllable dynamics model for high-fidelity physical simulation and a progress value model for precise trajectory evaluation [18] - The controllable dynamics model employs a task-centric batching strategy to focus on relevant actions, while the progress value model integrates progress estimation and temporal difference learning to enhance sensitivity to minor failures [18] Self-Evolution in Imagined Spaces - RISE implements a three-step online reinforcement learning loop entirely within the imagined space, allowing for efficient strategy iteration without real-world interactions [19][20] - The process includes generating future video predictions, evaluating imagined trajectories, and updating the VLA strategy based on high and low-value actions [20] Performance Evaluation - RISE has been tested on three challenging real-world long-term tasks: dynamic brick sorting, backpack packing, and box closing, demonstrating significant performance improvements across all metrics [24][25] - The success rates for these tasks increased dramatically, with dynamic brick sorting rising from 50% to 85%, backpack packing from 30% to 85%, and box closing achieving a success rate of 95% [29] Generalization and Robustness - The strategies developed through RISE exhibit the ability to recover from failures and adapt to unexpected disturbances, showcasing a level of intelligence beyond mere imitation [28][29] - The model's capacity for position generalization allows it to perform tasks accurately even when the placement of objects changes, without requiring retraining [31] Quality of Generation - RISE's dynamics model outperforms baseline models in generating high-fidelity future frames, maintaining physical consistency and avoiding common issues such as blurriness or object teleportation [32][34] Future Implications - RISE represents a paradigm shift in how intelligent agents understand and interact with the world, moving from passive adaptation in the physical realm to active evolution in imagined spaces [35][36] - This framework significantly reduces the costs associated with physical interactions, paving the way for more efficient training and deployment of robotic systems [36][37]
ICLR 2026|在「想象」中进化的机器人:港科大×字节跳动Seed提出WMPO,在世界模型中进行VLA强化学习
机器之心· 2026-03-02 03:06
Core Insights - The article discusses the WMPO (World Model-based Policy Optimization) method developed by the Hong Kong University of Science and Technology PEI-Lab and ByteDance Seed team, which allows embodied intelligence to train in "imagination" without extensive real-world reinforcement learning interactions [2][3]. Group 1: Traditional VLA Training Limitations - Traditional Vision-Language-Action (VLA) models face two main bottlenecks: the inherent limitations of imitation learning and the high costs of real-world reinforcement learning [3][4]. - Imitation learning primarily teaches models "what is the correct action" but fails to address "what to do after making a mistake," leading to cumulative errors in slightly deviated states [4]. - Real-world reinforcement learning requires millions of attempts, resulting in low sampling efficiency, hardware wear, safety risks, and high experimental costs [5]. Group 2: WMPO's Core Breakthroughs - WMPO introduces a new training paradigm that shifts the policy optimization process entirely into a visual world model, allowing embodied agents to learn recovery from errors in "imagined" trajectories [8]. - The method employs a pixel-level visual world model that simulates errors realistically, enhancing the model's ability to predict outcomes of out-of-distribution (OOD) actions through a Policy Behavior Alignment mechanism [8][14]. - WMPO incorporates Online Group Relative Policy Optimization (GRPO) in the imagined space, generating multiple candidate trajectories for the same initial state and evaluating their success through a trained reward function [9][15]. Group 3: Addressing Long-Term Generation Challenges - WMPO tackles the challenge of long-term video prediction by ensuring that the imagined visuals remain clear and actions aligned over hundreds of frames, thus providing a stable training environment for policy optimization [10]. - Techniques such as noisy-frame conditioning and frame-level action control mechanisms are introduced to maintain the quality of the generated trajectories [10]. Group 4: WMPO Architecture and Learning Objectives - WMPO's architecture relies on high-fidelity visual world modeling, predicting the next frame based on current observations and actions without abstract latent space predictions [12]. - The learning objective focuses on self-supervised parameter optimization, transforming the VLA model from a mere imitator to a self-evolving decision-maker [20]. Group 5: Experimental Results - The WMPO method shows significant improvements in sampling efficiency, with a success rate exceeding the optimal offline RL baseline by 9.8% using only 128 real trajectories, and a further 15.2% advantage with 1280 trajectories [23]. - Self-correcting behaviors emerged in tasks where the base model adjusted actions after collisions or misalignments, demonstrating the model's ability to learn from imagined failures [24]. - The execution efficiency of WMPO-trained strategies is higher, with more coherent and decisive actions, leading to shorter successful trajectory lengths [26]. Group 6: Implications and Future Directions - WMPO's success indicates that high-quality "imagination" can effectively replace costly "practice," addressing sampling efficiency issues while enabling robots to learn self-improvement through setbacks [28]. - The approach suggests a promising pathway for embodied intelligence towards generalization, as highlighted by the quote from Da Vinci, "Simplicity is the ultimate sophistication" [29].
当世界模型、VLA和强化学习三者结合起来,能取得什么惊艳效果?
具身智能之心· 2026-01-15 00:32
Core Insights - The article discusses the potential of the Vision-Language-Action (VLA) model in general robotic operations, highlighting its reliance on expert demonstration data which limits its ability to learn from failures and self-correct [2] - It introduces WMPO, a world model-based policy optimization method that enhances sample efficiency and overall performance in reinforcement learning (RL) without needing real-world interaction [3] Group 1 - The VLA model shows strong potential in robotic tasks but struggles with self-improvement due to its dependence on expert data [2] - Reinforcement learning can address the limitations of VLA models by enabling self-improvement through autonomous interaction with physical environments, although it faces high sample complexity when applied to real robots [2] - WMPO focuses on pixel-based prediction tasks, aligning "imagined" trajectories with VLA features pre-trained on large-scale network images, leading to superior performance compared to traditional offline methods [3] Group 2 - WMPO demonstrates significant advantages, including improved sample efficiency, better overall performance, emergence of self-correcting behaviors, and robust generalization and lifelong learning capabilities [3] - The article provides a link to the research paper on WMPO and its project homepage for further exploration [4]
刚刚,智元提出SOP,让VLA模型在真实世界实现可扩展的在线进化
机器之心· 2026-01-06 09:38
Core Viewpoint - The article emphasizes the need for a paradigm shift in the development of general-purpose robots, advocating for continuous evolution and learning in real-world environments rather than being limited to factory settings [2][3][44]. Group 1: Challenges in Current Robotics - Current AI robots often fail to perform tasks in real-world scenarios despite being trained on vast amounts of data, highlighting the gap between understanding and execution [8][9]. - Traditional post-training methods are slow and inefficient, leading to challenges in learning new tasks without forgetting previous skills [9][10]. Group 2: Introduction of SOP Framework - The SOP (Scalable Online Post-training) framework is introduced as a revolutionary approach that integrates online, distributed, and multi-task mechanisms for robot learning [4][6]. - SOP creates a closed-loop system that allows robots to evolve continuously beyond their initial deployment, breaking the time constraints of cognitive development [6][13]. Group 3: Mechanisms of SOP - SOP enables distributed continuous learning by allowing multiple robots to operate in parallel, sharing strategies and experiences in real-time [14][19]. - The system utilizes a cloud-based architecture for rapid updates and learning, significantly enhancing the speed of evolution [19][20]. - A dynamic sampler within SOP optimizes learning by focusing on weak areas in real-time, allowing robots to quickly adapt and improve [23]. Group 4: Performance Validation - Experiments demonstrate that SOP significantly outperforms traditional single-machine or offline methods, particularly in complex tasks like folding clothes [31][34]. - The system shows remarkable robustness, allowing robots to recover from minor errors without task failure, achieving over 36 hours of continuous operation without performance degradation [34]. Group 5: Scalability and Efficiency - Increasing the number of robots in a distributed system leads to linear improvements in performance, confirming the effectiveness of scaling in real-world applications [36][38]. - SOP allows for substantial reductions in training time, achieving performance benchmarks much faster than traditional methods [37][41]. Group 6: Implications for Robotics Industry - The SOP framework signifies a shift in the lifecycle of robotic systems, where deployment is not the end but the beginning of continuous learning and improvement [43][44]. - This approach lowers the barriers for real-world deployment, enabling robots to learn and evolve through practical experience rather than waiting for perfect models [44][45].
英伟达用千万Clip搞定了反事实推理VLA!安全指标提升了20%......
自动驾驶之心· 2026-01-05 03:33
Core Insights - The article discusses the development of the Counterfactual Vision-Language-Action (CF-VLA) model, which incorporates self-reflective reasoning to enhance the safety and accuracy of autonomous driving systems [3][56] - CF-VLA aims to address the limitations of existing Vision-Language-Action (VLA) models by enabling them to reflect on their planned actions before execution, thereby improving decision-making in complex driving scenarios [10][56] Group 1: Model Development - CF-VLA introduces adaptive reasoning and self-reflection capabilities, allowing the model to adjust its actions based on potential outcomes identified through counterfactual reasoning [3][10] - The model generates time-segmented meta-actions to summarize driving intentions and utilizes these to perform counterfactual reasoning, identifying unsafe behaviors and correcting them before final trajectory generation [3][10] - The "rollout-filter-label" data processing pipeline is designed to extract high-value scenarios from the model's rollout results, enhancing the training process for counterfactual reasoning [11][14] Group 2: Performance Metrics - Experiments on large-scale driving datasets show that CF-VLA improves trajectory accuracy by up to 17.6% and safety metrics by 20.5% compared to baseline models [14][56] - The model demonstrates adaptive reasoning capabilities, activating counterfactual reasoning primarily in complex scenarios, thus optimizing computational resources during testing [16][48] - The introduction of meta-actions significantly enhances the model's performance, reducing minimum average displacement error (MinADE) and minimum final displacement error (MinFDE) by approximately 9% compared to pure trajectory models [43][44] Group 3: Practical Applications - CF-VLA's self-reflective capabilities allow it to make context-specific corrections, improving safety and traffic efficiency in various driving scenarios, such as avoiding congestion and responding to pedestrians [57] - The model's ability to dynamically decide when to engage in reasoning helps maintain a balance between computational efficiency and decision-making quality [21][48] - The findings suggest that counterfactual self-reflection can effectively bridge reasoning and control in autonomous driving systems, providing a framework for future advancements in the field [56][57]
突破2D-3D鸿沟!北大提出VIPA-VLA,视频解锁机器人精准操控
具身智能之心· 2025-12-26 00:55
Core Insights - The article discusses a new approach to robot learning that addresses the challenge of aligning 2D visual information with 3D spatial understanding, which has been a significant limitation in existing visual-language-action (VLA) models [3][6][41] - The research introduces a novel pre-training paradigm that utilizes human demonstration videos to enhance robots' spatial perception capabilities, allowing them to infer 3D spatial relationships from 2D visual inputs [4][40] Research Background - Current VLA models face limitations due to reliance on expensive robot datasets and lack of explicit 3D spatial modeling, which hampers their ability to accurately map physical actions [6][7] - Human demonstration videos provide a solution by offering diverse scenarios and inherent visual-physical correspondences that serve as valuable supervision signals for robot learning [7][8] Hand3D Dataset - The Hand3D dataset, comprising Hand3D-visual and Hand3D-action components, is described as a "3D spatial textbook" for robots, enabling them to learn visual-physical alignment [8][9] - The dataset includes data from nine heterogeneous human manipulation datasets, ensuring a wide variety of scenes and tasks [8][9] Model Architecture: VIPA-VLA - The VIPA-VLA model features a dual-encoder architecture that integrates semantic visual features with 3D spatial features, enhancing the model's ability to understand both scene semantics and spatial structures [15][20] - The model employs a cross-attention fusion layer to combine these features, allowing for effective learning of 3D relationships from 2D inputs [17][20] Training Process - The training process consists of three phases: 3D visual pre-training, 3D action pre-training, and post-training for task adaptation, ensuring a gradual acquisition of 3D capabilities [21][22] - The first phase focuses on aligning semantic and spatial features, while the second phase teaches the model to predict 3D motion tokens based on visual-language inputs [22][23] Experimental Results - VIPA-VLA outperformed existing baselines in various tasks, achieving a success rate of 92.4% in single-view settings and 96.8% in dual-view settings on the LIBERO benchmark [27][28] - In the RoboCasa benchmark, VIPA-VLA achieved a success rate of 45.8%, surpassing other models, particularly in tasks requiring precise 3D positioning [30] - The model demonstrated strong performance in real-world tasks, achieving a 60% success rate in the Wipe-Board task, significantly higher than competing models [31][34] Significance and Future Directions - The research presents a new paradigm for robot learning that reduces reliance on costly robot data and enhances model generalization by leveraging human demonstration videos [40][41] - Future work aims to combine this pre-training paradigm with robot data pre-training and expand the Hand3D dataset to include more complex human-robot interaction tasks [40][41]
从 2D 感知到 3D 预测:GeoPredict 重构VLA模型的几何推理能力
具身智能之心· 2025-12-25 01:41
Core Insights - The article discusses the GeoPredict framework, which addresses limitations in existing Visual-Language-Action (VLA) models by integrating predictive kinematics and 3D Gaussian geometry for enhanced robotic manipulation capabilities [2][3][17]. Group 1: Technical Challenges - Existing VLA models are limited by three core challenges: lack of spatial modeling, insufficient long-term prediction, and efficiency contradictions in inference [3][4][5]. - Traditional models operate in a 2D-centric reactive decision-making paradigm, failing to provide explicit 3D geometric modeling necessary for precise task execution [3]. - Reactive strategies rely on instantaneous observations, which do not capture motion inertia and dynamic scene evolution, making them inadequate for long-term manipulation tasks [4]. Group 2: GeoPredict Framework Design - GeoPredict employs a three-layer technical architecture: kinematic prediction, geometric modeling, and attention fusion, which injects future-aware geometric priors into VLA models without increasing inference burden [6]. - The first layer focuses on trajectory-level kinematic prediction, capturing future inertia by encoding motion history and predicting multi-step trajectories [8]. - The second layer utilizes predictive 3D Gaussian geometry to model dynamic scene evolution effectively [8]. - The third layer implements block-level causal attention to ensure efficient information flow across different types of tokens [8]. Group 3: Performance Validation - GeoPredict has demonstrated superior performance across various benchmarks, significantly surpassing existing methods in both simulated and real-world tasks [10][14]. - In the RoboCasa benchmark, GeoPredict achieved an average success rate of 52.4%, improving by 10.1% over baseline models [10]. - In real-world experiments, GeoPredict achieved success rates of 85.0% in spatial tasks, 95.0% in geometric tasks, and 90.0% in robustness tasks, showcasing its 3D reasoning capabilities [18]. Group 4: Future Directions - The framework has potential for expansion, including integrating multi-attribute Gaussian representations and optimizing real-time performance through model compression techniques [17][18]. - Future work may also explore adaptive prediction horizons to balance long-term task performance while addressing cumulative error issues [18].
近300篇工作!伦敦国王学院x港理工全面解构VLA模型,一份清晰系统的导航图
具身智能之心· 2025-12-17 00:05
Core Insights - The article provides a comprehensive analysis of Vision-Language-Action (VLA) models, highlighting their transformative impact on robotics technology and outlining five core challenges: representation, execution, generalization, safety, and data evaluation [1][12]. Structure and Design - The research is structured to follow a natural learning path for researchers, progressing from foundational concepts to advanced topics, making it suitable for both beginners and experienced researchers [2]. Core Components of VLA Models - VLA systems consist of three main modules: perception, brain, and action, which have shown significant technological advancements in recent years. Key technical selections and representative models are referenced in related datasets and milestone tables [3][10]. Development Milestones - The evolution of VLA is characterized by a transition from passive multimodal perception to active embodied reasoning and control, with key models, datasets, and evaluation benchmarks organized in a timeline and tables [8][13]. Key Challenges and Solutions - The five major challenges in VLA model development span from foundational capabilities to practical deployment needs, with visual representations of their hierarchical relationships and sub-issues provided [12][24][25][26][27]. Application Scenarios and Future Directions - Major applications include household robots (handling unstructured environments and long-term tasks) and industrial or outdoor robots (high-precision operations and safety compliance). Performance evaluations of related application cases can be referenced in the datasets and benchmark tables [29][30]. Future Trends - The focus is on developing native multimodal architectures and shape-agnostic representations, constructing a closed-loop evolutionary system for self-supervised exploration and online reinforcement learning, and shifting evaluation from binary success rates to comprehensive diagnostic tests [29].
新国大团队首创!当VLA具备4D感知能力后会怎么样?
具身智能之心· 2025-12-15 03:17
Core Insights - The article discusses the VLA-4D model, which integrates 4D awareness into vision-language-action frameworks for coherent robotic manipulation, addressing challenges in spatiotemporal consistency in robotic tasks [2][3]. Group 1: Model Features - VLA-4D enhances traditional spatial action representation by incorporating temporal information, allowing for improved spatiotemporal action planning and prediction [2]. - The model consists of two key modules: a 4D perception visual representation that combines visual features with temporal data, and a spatiotemporal action representation that aligns multimodal representations with large language models [2]. Group 2: Applications and Challenges - The VLA-4D model aims to achieve both spatial fluidity and temporal consistency in robotic operations, which is crucial for dynamic environments [2]. - Existing methods struggle with maintaining temporal coherence during action execution, highlighting the need for advancements like VLA-4D [2]. Group 3: Related Technologies - The article also mentions foundational models such as 4D-VGGT for dynamic geometric perception and LLaVA-4D for enhanced dynamic scene reasoning, which complement the capabilities of VLA-4D [6][7].
理想自动驾驶负责人回应宇树王兴兴对VLA质疑:空谈架构不如看疗效
Feng Huang Wang· 2025-12-10 10:27
Core Viewpoint - The head of autonomous driving at the company believes that the VLA (Vision-Language-Action) model is the best solution for autonomous driving after practical experience, countering skepticism from industry peers [1] Group 1: Response to Industry Concerns - The founder of Yushu Technology expressed doubts about the VLA model, describing it as a "relatively simplistic architecture" and maintaining a skeptical attitude [1] - The company emphasizes that discussing model architecture without real data is ineffective, highlighting their extensive data collection from millions of vehicles to support the VLA model [1] Group 2: Future of Robotics - The CEO of the company predicts that in the next five to ten years, there will be two main forms of embodied robots: automotive and humanoid [1] - The VLA model is designed not only for current automotive products but also for future automotive embodied robots [1]