视觉 - 语言 - 动作(VLA)模型

Search documents
基于大型VLM的VLA模型如何改一步一步推动机器人操作任务的发展?
具身智能之心· 2025-08-26 00:03
点击下方 卡片 ,关注" 具身智能 之心 "公众号 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 当机器人 "看懂" 指令还能 "自主干活":大型 VLM 如何改写机器人操作的游戏规则? 你是否想象过这样的场景:对着机器人说一句 "把阳台晾干的衬衫叠好放进衣柜第三层",它就能看懂衣物位置、理解 "叠好""放进" 的动作逻辑,甚至避开衣柜里的 杂物完成任务?放在几年前,这更像科幻电影里的情节 —— 传统机器人要么困在 "预定义任务牢笼" 里,换个新杯子就认不出;要么面对模糊的自然语言指令 "手 足无措",更别提在杂乱的真实环境里灵活调整动作。 但现在,一场由 "视觉 - 语言 - 动作(VLA)模型" 掀起的变革,正在打破这些局限。而这场变革的核心推手,正是我们如今耳熟能详的大型视觉语言模型 (VLM)。 过去,机器人操作的研究总在 "模块化陷阱" 里打转:视觉识别、语言解析、动作控制各成一派,像被割裂的齿轮,很难协同运转。直到大型 VLMs 的 ...
自驾VLA再升级!博世最新IRL-VLA:奖励世界模型打造全新闭环强化学习框架
自动驾驶之心· 2025-08-12 23:33
Core Viewpoint - The article discusses the introduction of IRL-VLA, a novel closed-loop reinforcement learning framework that integrates inverse reinforcement learning with a reward world model for vision-language-action (VLA) in autonomous driving, addressing limitations of existing open-loop imitation learning methods and simulation-based training [2][3][6]. Group 1: Key Issues in VLA - Existing VLA architectures are often based on open-loop settings using imitation learning, which limits performance by primarily capturing recorded behaviors from datasets [2][3]. - Closed-loop training heavily relies on high-fidelity sensor simulations, but domain gaps and computational efficiency issues hinder the generalization of VLA models [2][3]. Group 2: Introduction of IRL-VLA - Bosch, Shanghai University, and Tsinghua University teams proposed IRL-VLA, a new closed-loop reinforcement learning method that combines inverse reinforcement learning with a designed VLA approach [3][5]. - IRL-VLA employs a three-stage paradigm: pre-training VLA strategies through imitation learning, constructing a lightweight reward world model via inverse reinforcement learning, and enhancing planning performance through reward-guided reinforcement learning using Proximal Policy Optimization (PPO) [3][5]. Group 3: Performance Achievements - IRL-VLA achieved state-of-the-art (SOTA) performance in the NAVSIM v2 end-to-end driving benchmark and secured the second place in the CVPR 2025 autonomous driving competition [5][9]. - The framework demonstrated significant improvements in balancing safety events, comfortable driving, and traffic efficiency [5][9]. Group 4: Contributions of IRL-VLA - The introduction of an efficient reward world model (RWM) based on inverse reinforcement learning, which captures the multimodal and multi-objective nature of driving while avoiding the need for computationally intensive simulations [9][11]. - The development of a new VLA model that performs excellently in both imitation learning and reinforcement learning settings, achieving optimal performance across different training paradigms [11][12]. Group 5: Experimental Results - In the NAVSIM benchmark, IRL-VLA's pre-trained model (IRL-VLA-PT) achieved a competitive EPDMS score of 74.4, outperforming several state-of-the-art methods [42]. - The model maintained high safety performance while significantly improving metrics related to driving comfort and progress [42][43]. Group 6: Technical Details - The IRL-VLA model utilizes a backbone network (V2-99) and processes multi-view camera inputs at a resolution of 256 × 704 [35]. - The training process involved 100 epochs of pre-training with an AdamW optimizer, followed by reinforcement learning using the PPO algorithm on NVIDIA A100 GPUs [35][36]. Group 7: Conclusion - IRL-VLA represents a pioneering approach in closed-loop VLA methods that do not rely on simulators, paving the way for future advancements in closed-loop autonomous driving systems [46].
万字长文聊具身智能“成长史”:具身智能跨越了哪些山海,又将奔向哪里
具身智能之心· 2025-08-08 00:08
Core Viewpoint - The forum emphasizes the rapid advancements in embodied intelligence and robotics, highlighting the need for a unique computational brain that can translate computational power into physical capabilities, addressing the gap between AI's performance in games like Go and its struggles with simple physical tasks [4]. Group 1: Evolution of Embodied Intelligence - Over the past decade, embodied intelligence has evolved significantly, with robotics being a closed-loop system that integrates perception, action, and the physical world, emphasizing the importance of adhering to physical laws [5][6]. - The gap between research prototypes and practical applications is highlighted, with the Technology Readiness Level (TRL) being a key metric for assessing the maturity of robotic applications, where levels 8 to 9 are crucial for industry acceptance [6]. Group 2: Opportunities and Challenges in Robotics - The forum discusses the historical context of machine learning's impact on robotics, noting that advancements in sensors, algorithms, and deep learning have led to significant progress, but achieving high performance in the physical world remains a challenge [9][13]. - The importance of scalable learning systems is emphasized, with a shift from small-scale learning to large-scale applications being crucial for overcoming challenges in robotics [15]. Group 3: Specialized vs. General Intelligence - The discussion contrasts Artificial Specialized Intelligence (ASI) with Artificial General Intelligence (AGI), suggesting that while ASI focuses on high performance in specific tasks, AGI aims for broader capabilities [23][25]. - The advantages of specialized models include efficiency, robustness, and suitability for real-time applications, while general models offer greater flexibility but are more complex and resource-intensive [27][30]. Group 4: Future Directions in Robotics - The emergence of visual-language-action (VLA) models, such as RT-2, represents a significant step forward, allowing robots to execute tasks through internet-based API calls, indicating a trend towards more versatile robotic capabilities [39][40]. - The development of the second-generation VLA model, PI-Zero, showcases advancements in continuous action generation, enabling robots to perform complex tasks with higher efficiency [46][48]. Group 5: Data and Performance in Robotics - The forum highlights the necessity of large-scale data collection for training robotic models, with the RTX dataset being a pivotal resource for developing cross-embodied models that outperform specialized counterparts [42][43]. - The importance of performance metrics is underscored, with a focus on achieving high reliability and robustness in robotic systems to ensure practical deployment in real-world scenarios [58][65].
分析了102个VLA模型、26个数据集和12个仿真平台
自动驾驶之心· 2025-07-22 02:18
Core Viewpoint - The article discusses the transformative breakthrough of Visual-Language-Action (VLA) models in robotics, emphasizing their integration of visual perception, natural language understanding, and embodied control within a unified learning framework. It highlights the development and evaluation of 102 VLA models, 26 foundational datasets, and 12 simulation platforms, identifying current challenges and future directions for enhancing robotic autonomy and adaptability [3][4][6]. Group 1: VLA Models and Framework - VLA models represent a new frontier in robotic intelligence, enabling robots to perceive visual environments, understand natural language commands, and execute meaningful actions, bridging the semantic gap between various modalities [7][9]. - The architecture of VLA models integrates visual, language, and proprioceptive encoders into a diffusion backbone network to generate control commands, facilitating end-to-end processing of multimodal inputs [11][12]. - The development of effective VLA models relies on large-scale, diverse multimodal datasets and realistic simulation platforms, which are crucial for training models to robustly understand language instructions and perceive visual environments [5][30]. Group 2: Datasets and Evaluation - The article outlines the evolution of VLA datasets, noting that early datasets focused on discrete decision-making in constrained environments, while recent datasets incorporate richer sensory streams and longer task durations, addressing the need for complex multimodal control challenges [21][22][29]. - A comprehensive benchmarking strategy is proposed to evaluate datasets based on task complexity and modality richness, highlighting the need for new datasets that integrate high task difficulty with extensive multimodal inputs [24][28]. - The analysis reveals a gap in current VLA benchmarks, particularly in combining long-duration, multi-skill control with diverse multimodal integration, indicating a promising direction for future dataset development [29][43]. Group 3: Simulation Tools - Simulation environments are critical for VLA research, enabling the generation of large-scale, repeatable, and richly annotated data that surpasses physical world limitations [30][31]. - Various advanced simulation platforms, such as AI2-THOR and NVIDIA Isaac Sim, provide high-fidelity physical effects and customizable multimodal sensors, essential for developing robust VLA models [32][33]. - The integration of simulation tools with VLA datasets accelerates the collaborative development of control algorithms and benchmark datasets, ensuring advancements in multimodal perception are effectively evaluated before deployment in real robotic platforms [30][33]. Group 4: Applications and Challenges - VLA models are categorized into six broad application areas, including manipulation and task generalization, autonomous mobility, human assistance, and interaction, showcasing their versatility across various robotic tasks [34][35]. - The article identifies key challenges in VLA model architecture, such as tokenization and vocabulary alignment, modality fusion, and cross-entity generalization, which need to be addressed to enhance model performance and adaptability [39][40][41]. - Data challenges are also highlighted, including task diversity, modality imbalance, annotation quality, and the trade-off between realism and scale in datasets, which hinder the development of robust general-purpose VLA models [42][43].
机器人「GPT时刻」来了?丰田研究院悄悄做了一场最严谨的VLA验证
具身智能之心· 2025-07-21 08:42
Core Viewpoint - The article discusses the advancements in robotic arms, particularly focusing on the development of Large Behavior Models (LBM) that enable robots to perform complex tasks autonomously, showcasing significant improvements in performance and capabilities compared to traditional models [3][7][15]. Summary by Sections Introduction to Robotic Arms - Robotic arms are typically associated with simple tasks like grabbing or serving ice cream, but the complexity increases exponentially when tasked with more intricate operations such as setting a table or assembling a bicycle [2][3]. Development of VLA Models - The recent progress in Visual-Language-Action (VLA) models has allowed robots to integrate multimodal information (images, instructions, scene semantics) and execute complex tasks, moving towards more intelligent and versatile systems [3][4]. Large Behavior Models (LBM) - LBM represents a significant advancement in robotic capabilities, built on diffusion model strategies, enabling robots to autonomously execute complex operations with impressive results [7][10][19]. - The research conducted by Toyota Research Institute (TRI) and led by notable scholars emphasizes the rigorous evaluation of these models, demonstrating their effectiveness in both simulated and real-world environments [9][10]. Training and Evaluation - The LBM was trained on a diverse dataset, including 1,700 hours of robot data, and underwent 1,800 real-world evaluations and over 47,000 simulated deployments, showcasing its robust performance [13][14]. - The findings indicate that even with limited training data, the model's performance significantly improves, suggesting a positive trend towards achieving effective data acquisition and performance enhancement [14][16]. Performance Metrics - The evaluation metrics included success rate and task completion, with a focus on relative success rates to better compare different methods' performances [26][27]. - The LBM demonstrated superior performance in both seen and unseen tasks compared to single-task baseline models, indicating its robustness and adaptability [31][39]. Conclusion and Future Implications - The research suggests that the advent of general large-scale models in robotics is on the horizon, hinting at a potential "GPT moment" for embodied intelligence [15][43]. - The results indicate that pre-training can lead to better task performance with less data, reinforcing the idea that as data volume increases, performance benefits will continue to manifest [43][45].
机器人的「GPT时刻」来了?丰田研究院悄悄做了一场最严谨的VLA验证实验
机器之心· 2025-07-21 04:04
Core Viewpoint - The article discusses the advancements in robotic arms, particularly focusing on the development of Large Behavior Models (LBM) that enable robots to perform complex tasks autonomously, moving beyond simple operations to more intricate manipulations [3][8][14]. Group 1: Development of Robotic Arms - Traditional robotic arms are primarily associated with simple tasks like grabbing or serving ice cream, but the complexity of tasks such as setting a table or assembling a bicycle presents significant challenges [1][2]. - Recent advancements in Visual-Language-Action (VLA) models have allowed robots to integrate multimodal information and execute complex tasks, although the research has not yet reached a milestone level [3][4]. Group 2: Large Behavior Models (LBM) - The LBM is a new approach that builds on VLA concepts, utilizing diffusion model strategies to create a large-scale behavior model capable of executing complex operations [8][14]. - The research conducted by the Toyota Research Institute (TRI) and other institutions has shown that LBM can significantly improve performance in multitask robotic operations, even with limited training data [10][15]. Group 3: Experimental Findings - The study involved training LBMs on approximately 1,700 hours of robot data and conducting over 1,800 real-world evaluations, demonstrating that even with a few hundred hours of diverse data, significant performance improvements can be achieved [15][16]. - The findings indicate that LBM can learn new tasks with 3-5 times less data compared to traditional single-task strategies, showcasing its robustness in various environments [17][20]. Group 4: Evaluation Metrics - The performance of the LBM was assessed using success rates and task completion metrics, with a focus on distinguishing between nearly completed tasks and those that were not executed at all [25][26]. - The evaluation process included both real-world and simulated environments, ensuring a comprehensive assessment of the model's capabilities [29][30]. Group 5: Implications for the Future - The positive results from the LBM research suggest a promising future for general-purpose large-scale models in robotics, hinting at the potential for achieving embodied intelligence akin to a "GPT moment" in the field [16][17]. - The study emphasizes the importance of pre-training and the potential for a virtuous cycle of data acquisition and performance enhancement, indicating that significant advancements can be made even without vast amounts of data [16][49].
分析了102个VLA模型、26个数据集和12个仿真平台
具身智能之心· 2025-07-20 01:06
Core Viewpoint - The article discusses the transformative breakthrough of Visual-Language-Action (VLA) models in robotics, emphasizing their integration of visual perception, natural language understanding, and embodied control within a unified learning framework. It highlights the development and evaluation of 102 VLA models, 26 foundational datasets, and 12 simulation platforms, identifying current challenges and future directions for enhancing robotic autonomy and adaptability [3][4][6]. Group 1: VLA Models and Framework - VLA models represent a new frontier in robotic intelligence, enabling robots to perceive visual environments, understand natural language commands, and execute meaningful actions, bridging the semantic gap between various modalities [7][9]. - The architecture of VLA models integrates visual, language, and proprioceptive encoders into a diffusion backbone network, facilitating the generation of control commands [11][12]. - The evaluation of VLA architectures reveals a rich diversity in core component algorithms, with visual encoders predominantly based on CLIP and SigLIP, and language models primarily from the LLaMA family [16]. Group 2: Datasets and Training - High-quality, diverse training datasets are crucial for VLA model development, allowing models to learn complex cross-modal correlations without relying on manually crafted heuristics [17][22]. - The article categorizes major VLA datasets, noting a shift towards more complex, multimodal control challenges, with recent datasets like DROID and Open X-Embodiment embedding synchronized RGBD, language, and multi-skill trajectories [22][30]. - A benchmarking analysis maps each major VLA dataset based on task complexity and modality richness, highlighting gaps in current benchmarks, particularly in integrating complex tasks with extensive multimodal inputs [30][31]. Group 3: Simulation Tools - Simulation environments are essential for VLA research, generating large-scale, richly annotated data that exceeds physical world limitations. Platforms like AI2-THOR and Habitat provide realistic rendering and customizable multimodal sensors [32][35]. - The article outlines various simulation tools, emphasizing their capabilities in generating diverse datasets for VLA models, which are critical for advancing multimodal perception and control [35][36]. Group 4: Applications and Evaluation - VLA models are categorized into six broad application areas, including manipulation and task generalization, autonomous mobility, human assistance, and interaction, showcasing their versatility across different robotic tasks [36][37]. - The selection and evaluation of VLA models focus on their operational skills and task generalization capabilities, using standardized metrics such as success rate and zero-shot generalization ability [39][40]. Group 5: Challenges and Future Directions - The article identifies key architectural challenges for VLA models, including tokenization and vocabulary alignment, modality fusion, cross-entity generalization, and the smoothness of manipulator movements [42][43][44]. - Data challenges are also highlighted, such as task diversity, modality imbalance, annotation quality, and the trade-off between realism and scale in datasets, which hinder the robust development of general VLA models [45][46].
加利福尼亚大学!EgoVLA:从第一视角人类视频中学习VLA模型
具身智能之心· 2025-07-20 01:06
Core Insights - The article discusses a novel approach to robot learning that leverages egocentric human video data to enhance the training of Vision-Language-Action (VLA) models, overcoming limitations of traditional robot data collection methods [3][21]. Research Background and Core Ideas - Traditional robot learning relies heavily on large-scale real robot data, which is limited by hardware and operational costs. In contrast, human actions in various environments provide a vast amount of potential training data, as billions of people continuously engage in tasks where robots are expected to operate [3]. - The key breakthrough is the approximation of the action space difference between humans and robots through geometric transformations. This allows for training VLA models on human video data first, followed by fine-tuning with a small amount of robot demonstrations, facilitating skill transfer [3]. Model Architecture and Action Space Design - The framework is based on NVILA-2B, utilizing its visual-language understanding capabilities for efficient intent reasoning and fine-tuning. Inputs include current and historical first-person visual observations, language instructions, action query tokens, and human body sensations [5]. - The action space incorporates human wrist poses and the first 15 PCA components of the MANO hand model, balancing compactness and expressiveness for action transfer from humans to robots [8]. Training and Evaluation - A large-scale dataset of approximately 500,000 image-action pairs was created from four sources, covering various rigid objects and annotated with RGB observations, wrist poses, hand poses, and camera poses [12]. - The Ego Humanoid Manipulation Benchmark was established for unified evaluation of humanoid robot manipulation capabilities, consisting of 12 tasks and addressing data balance issues [14]. Experimental Results and Key Findings - Human pre-training significantly enhances core performance, with the EgoVLA model showing a success rate improvement of about 20% in fine manipulation tasks compared to models without pre-training [16][20]. - The model demonstrates robust performance across different visual configurations, with only a slight decrease in success rates for unseen visual backgrounds, indicating adaptability to new environments [20]. Impact of Data Scale and Diversity - Higher diversity in human data correlates with better model generalization, as evidenced by the combined model's superior performance in short-horizon tasks compared to those trained on single datasets [23]. - The performance of the EgoVLA model declines when relying solely on robot demonstration data, highlighting the necessity of combining human pre-training with a certain amount of robot data for optimal results [23].
VLA统一架构新突破:自回归世界模型引领具身智能
机器之心· 2025-07-10 04:26
Core Viewpoint - The article discusses the development of a new unified Vision-Language-Action (VLA) model architecture called UniVLA, which enhances the integration of visual, language, and action signals for improved decision-making in embodied intelligence tasks [4][5][13]. Group 1: Model Architecture and Mechanism - UniVLA is based on a fully discrete, autoregressive mechanism that models visual, language, and action signals natively, incorporating world model training to learn temporal information and causal logic from large-scale videos [5][9][14]. - The framework transforms visual, language, and action signals into discrete tokens, creating interleaved multimodal temporal sequences for unified modeling [9][10]. Group 2: Performance and Benchmarking - UniVLA has set new state-of-the-art (SOTA) records across major embodied intelligence benchmarks such as CALVIN, LIBERO, and SimplerEnv, demonstrating its strong performance advantages [18][21]. - In the CALVIN benchmark, UniVLA achieved an average score of 95.5%, outperforming previous models significantly [19]. Group 3: Training Efficiency and Generalization - The post-training stage of the world model significantly enhances downstream decision-making performance without relying on extensive action data, utilizing only vast amounts of video data for efficient learning [14][15]. - The model supports unified training for various tasks, including visual understanding, video generation, and action prediction, showcasing its versatility and data scalability [10][24]. Group 4: Future Directions - The article suggests exploring deeper integration of the UniVLA framework with multimodal reinforcement learning to enhance its perception, understanding, and decision-making capabilities in open-world scenarios [24].
3D VLA新范式!CVPR冠军方案BridgeVLA,真机性能提升32%
具身智能之心· 2025-06-26 14:19
Core Viewpoint - The article discusses the BridgeVLA model developed by the Institute of Automation, Chinese Academy of Sciences, which efficiently combines 3D input projection into 2D images for action prediction, achieving high performance and data efficiency in 3D robotic operation learning [4][6]. Group 1: Model Performance - BridgeVLA achieves a task success rate of 96.8% with only 3 trajectories in basic settings, demonstrating superior performance in various generalization settings compared to baseline models, with a 32% performance improvement [6][25]. - In simulation benchmarks such as RLBench, COLOSSEUM, and GemBench, BridgeVLA outperforms mainstream 3D robotic operation benchmarks, achieving an 88.2% success rate in RLBench, a 7.3% improvement in COLOSSEUM, and a 50% success rate in GemBench [20][25]. Group 2: Model Design and Training - BridgeVLA's training process consists of two phases: 2D heatmap pre-training to enhance spatial perception and 3D action fine-tuning to learn specific robotic operation strategies [15][17]. - The model utilizes a heatmap pre-training method to predict the probability heatmap of target object locations based on textual instructions, enhancing its spatial awareness [16][25]. Group 3: Generalization and Data Efficiency - BridgeVLA demonstrates strong generalization capabilities, effectively handling various disturbances such as unseen objects, lighting conditions, and object types, thanks to the rich visual and linguistic prior knowledge embedded in the pre-trained multimodal model [20][25]. - The model's high data efficiency is highlighted by its ability to achieve nearly the same performance with only 3 trajectories as with 10 trajectories, making it suitable for deployment in real robotic systems [25][26].