视觉语言动作模型(VLA)
Search documents
RL是怎么赋能VLA的?
具身智能之心· 2026-01-09 00:55
如果说今年哪个方向最受欢迎,一定是VLA+RL。 VLA模型为具身智能带来了新的交互范式:机器人不再依赖精确定义的状态和规则,而是通过视觉感知环 境、理解语言指令,并直接生成动作序列。这一能力极大地降低了任务描述和系统设计的门槛,使机器人 能够应对更加开放和复杂的场景。 当前的研究趋势也逐渐从"单纯训练 VLA 模型"转向"以 VLA 作为策略表示,结合RL进行微调和强化",包 括离线 RL 提升样本效率、层级 RL 约束长时序行为,以及基于视觉和语言的自监督反馈建模等方向。 方法上,目前VLA+RL主要分为在线RL、离线RL、test-time三种方案。 然而,在真实机器人系统中,VLA 往往仍然面临执行不稳定、对初始状态敏感、长时序任务易失败等问 题,其核心原因在于模型缺乏基于环境反馈的持续修正能力。 强化学习的出现为VLA带来了新的解决思路。RL并不是一门新的学科,但RL的优势为VLA提供了从"理 解"走向"执行优化"的关键机制。通过引入奖励或价值信号,RL可以在保持VLA感知与语言能力的同时,对 动作策略进行闭环优化,弥补模仿学习在分布外状态和误差累积上的不足。 纯模仿学习的 VLA,本质是在"复制数 ...
王鹤团队最新!解决VLA 模型缺乏精准几何信息的问题
具身智能之心· 2026-01-05 01:03
点击下方 卡片 ,关注" 具身智能 之心 "公众号 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 在机器人操作领域,VLA模型通过端到端框架将视觉输入与语言指令映射为动作,实现了多样化技能学习。然而,现有 VLA 模型多依赖单视角 RGB 图像,缺乏精 准空间几何信息,难以满足高精度操纵需求。 由 Galbot、北京大学、香港大学等团队联合提出的 StereoVLA 模型 ,创新性地融合立体视觉的丰富几何线索,通过 "几何 - 语义特征提取 - 交互区域深度估计 - 多场景验证" 的技术体系,首次系统性解决了 VLA 模型空间感知不足的核心问题,为机器人精准操纵提供了全新解决方 案。 论文题目:StereoVLA: Enhancing Vision-Language-Action Models with Stereo Vision 项目链接:https://shengliangd.github.io/StereoVLA-Webpage 问题根源 ...
今年的VLA+RL的工作正在排队等着录用......
具身智能之心· 2025-12-24 00:25
点击下方 卡片 ,关注" 具身智能 之心 "公众号 最近在盘VLA+RL的工作,不管是基于世界模型的在线方案,还是offline,VLA好像始终离不开RL。仅依赖 模仿学习的 VLA 在真实世界 OOD 场景中仍然脆弱,缺乏失败恢复、自主探索与闭环纠错能力。强化学习 (RL)的优势在于能够显著提升VLA模型的泛化能力,一些工作的实验显示分布外任务上的性能提升可达 42.6%。有效果,就有很多工作继续跟进,今年产出了好多篇paper~ 近期的几个工作,包括wholebodyvla、pi0.6、GR-RL都取得了惊艳的效果,pi0.6推出的时候很多同学说大概 率就是+强化。世界模型加持的在线系统也是比较活跃的方向,期望有更多突破。 工具上,VLA+RL框架也在逐渐完善,这里也推荐下于超老师那边的Rlinf,支持的方法越来越多。 链接:https://github.com/RLinf/RLinf 由于相关工作众多,这里给大家分享一些这两年比较有代表性的VLA+RL工作,这些paper陆续被不同的会 议收录。 ❝ 我们也建议后续的研究可以往此方向靠拢,如果不知道怎么展开研究也欢迎咨询具身智能之心的科研助理,一 键启动 ...
今年大概率产了n篇VLA+RL工作吧?!
具身智能之心· 2025-12-22 10:23
Core Insights - The article emphasizes the integration of Reinforcement Learning (RL) with Vision-Language-Action (VLA) models to enhance their generalization capabilities, particularly in out-of-distribution (OOD) scenarios, where performance improvements can reach up to 42.6% [2]. Group 1: Research Directions - The article suggests that future research should focus on the combination of VLA and RL, encouraging collaboration with research assistants for guidance on starting projects in these areas [3]. - Several notable recent works in VLA+RL have been highlighted, showcasing significant advancements in the field [5][10]. Group 2: Notable Papers and Projects - A list of representative papers from the last two years is provided, including titles such as "NORA-1.5" and "Balancing Signal and Variance," which focus on various aspects of VLA and RL integration [5][10]. - Links to project homepages and paper PDFs are shared for further exploration of these works [6][9][12]. Group 3: Tools and Frameworks - The article mentions the development of tools like Rlinf, which supports a growing number of methods for VLA+RL frameworks, indicating a trend towards more robust and versatile research tools [2][11].
微软&港科对比多种迁移技术!VLA 到底如何有效地继承 VLM 中丰富的视觉-语义先验?
具身智能之心· 2025-11-15 16:03
Core Insights - The article discusses the introduction of the GrinningFace benchmark, which aims to address the challenges in knowledge transfer from Visual Language Models (VLM) to Visual Language Action Models (VLA) by using emoji-based tasks as a testing ground [1][2][4]. Group 1: Challenges in VLA Training - VLA training relies heavily on VLM initialization but faces three main challenges: unclear transfer effects, the risk of catastrophic forgetting, and lack of standardized comparison for different transfer techniques [2][4]. - Existing datasets have low overlap with VLM pre-training data, making it difficult to isolate contributions from "robotic action skills" and "VLM prior knowledge" [2]. Group 2: GrinningFace Benchmark Design - The GrinningFace benchmark uses emojis as a bridge to separate action execution from semantic recognition, allowing for precise measurement of knowledge transfer effects [4][5]. - The benchmark includes a standardized task where a robotic arm must place a cube on an emoji card based on language instructions [4]. Group 3: Evaluation Metrics - The evaluation framework consists of two core metrics: execution success rate (SR) and recognition SR, which quantify the robot's ability to perform actions and recognize semantic cues, respectively [5][8]. - The study found that different fine-tuning strategies have varying impacts on knowledge transfer, with a focus on retaining VLM prior knowledge while adapting to specific tasks [5][11]. Group 4: Key Findings on Transfer Techniques - The research highlights that co-training, latent action prediction, and diverse pre-training data are critical for effective knowledge transfer [7][19]. - The balance between retaining VLM prior knowledge and adapting robotic actions is identified as a core principle in VLA design [19]. Group 5: Future Directions - Future work should focus on optimizing parameter-efficient fine-tuning techniques, enhancing knowledge transfer efficiency, and designing complex tasks that reflect real-world applications [19]. - Exploring multimodal prior fusion, including tactile and auditory information, could improve VLA's adaptability to various environments [19].
阿里新研究:统一了VLA和世界模型
自动驾驶之心· 2025-11-06 08:43
Core Insights - The article discusses the WorldVLA framework, which integrates Visual Language Action models (VLA) with world models to enhance AI's understanding of the environment [1][4][36] - WorldVLA demonstrates superior performance compared to independent action and world models, showcasing a synergistic effect between the two [2][18] Group 1: Framework Overview - WorldVLA is designed as a unified autoregressive action world model that combines action and image understanding for improved predictive capabilities [4] - The framework utilizes three independent tokenizers for encoding images, text, and actions, optimizing the representation of visual and action data [8] Group 2: Model Performance - Benchmark results indicate that WorldVLA outperforms discrete action models like OpenVLA, even without pre-training, validating its architectural design [19][21] - The model's performance improves with higher image resolutions, with 512x512 pixels showing significant enhancements over 256x256 pixels [22][23] Group 3: Mutual Enhancement - The world model enhances action generation by understanding physical laws and predicting future states based on current actions [14][25] - Conversely, the action model improves the visual understanding of the world model, leading to more contextually relevant actions [17][30] Group 4: Practical Applications - WorldVLA's ability to predict the outcomes of candidate actions aids in optimizing decision-making processes, thereby increasing task success rates [26] - The framework demonstrates practical advantages in complex scenarios, such as successfully executing tasks that pure world models struggle with [32]
阿里新研究:一统VLA和世界模型
具身智能之心· 2025-10-31 00:04
Core Insights - The article discusses the development of WorldVLA, a unified framework that integrates Visual Language Action models (VLA) with world models, aimed at enhancing AI's understanding of the world [2][5]. Group 1: Framework and Model Integration - WorldVLA demonstrates significant performance improvements over independent action and world models, showcasing a mutual enhancement effect [3][20]. - The framework combines the capabilities of action models and world models to predict future images and generate actions, addressing the limitations of each model when used separately [5][6]. Group 2: Model Architecture and Training - WorldVLA utilizes three independent tokenizers for encoding images, text, and actions, with a compression ratio of 16 and a codebook size of 8192 [9]. - The model employs a novel attention mask for action generation, allowing for parallel generation of multiple actions while maintaining the integrity of the generated sequence [12][13]. Group 3: Performance Metrics and Results - Benchmark tests indicate that WorldVLA outperforms discrete action models, even without pre-training, with notable improvements in various performance metrics [20][22]. - The model's performance is positively correlated with image resolution, with 512×512 pixel resolution yielding significant enhancements over 256×256 resolution [22][24]. Group 4: Mutual Benefits of Model Types - The integration of world models enhances action models by providing a deeper understanding of environmental physics, which is crucial for tasks requiring precision [26][27]. - Conversely, action models improve the visual understanding capabilities of world models, leading to more effective action generation [18][31].
阿里新研究:统一了VLA和世界模型
3 6 Ke· 2025-10-29 10:32
Core Insights - WorldVLA is a unified framework that integrates Visual Language Action models (VLA) with world models, developed collaboratively by Alibaba DAMO Academy, Lakehead Laboratory, and Zhejiang University [1][4]. Group 1: Framework Overview - The world model predicts future images by understanding actions and images, aiming to learn the underlying physical laws of the environment to enhance action generation accuracy [2]. - The action model generates subsequent actions based on image observations, which not only aids visual understanding but also enhances the visual generation capability of the world model [2]. - Experimental results indicate that WorldVLA significantly outperforms independent action and world models, showcasing a mutual enhancement effect between the two [2][12]. Group 2: Model Architecture - WorldVLA utilizes three independent tokenizers for encoding images, text, and actions, initialized based on the Chameleon model [6]. - The image tokenizer employs a VQ-GAN model with a compression ratio of 16 and a codebook size of 8192, generating 256 tokens for 256×256 images and 1024 tokens for 512×512 images [6]. - The action tokenizer discretizes continuous robot actions into 256 intervals, represented by 7 tokens, including relative positions and angles [6]. Group 3: Training and Performance - WorldVLA employs a self-regressive training approach, where all text, actions, and images are tokenized and trained in a causal manner [8]. - A novel attention mask for action generation ensures that the current action generation relies solely on text and visual inputs, preventing errors from previous actions from affecting subsequent ones [10]. - Benchmark results show that even without pre-training, WorldVLA outperforms the discrete OpenVLA model, validating its architectural design [12]. Group 4: Mutual Benefits of Models - The introduction of the world model significantly enhances the performance of the action model by enabling it to learn the underlying physical laws of the system, which is crucial for tasks requiring precision [15]. - The world model provides predictive capabilities that inform decision-making processes, optimizing action selection strategies and improving task success rates [18]. - Conversely, the action model improves the quality of the world model's output, particularly in generating longer video sequences [21]. Group 5: Expert Opinions - Chen Long, Senior Research Director at Xiaomi Auto, emphasizes that VLA and world models do not need to be mutually exclusive; their combination can promote each other, leading to advancements in embodied intelligence (AGI) [24].
阿里新研究:统一了VLA和世界模型
量子位· 2025-10-29 09:30
Core Insights - WorldVLA is a unified framework that integrates Visual Language Action Models (VLA) with World Models, proposed by Alibaba DAMO Academy, Lake Lab, and Zhejiang University [1][4] - Experimental results indicate that WorldVLA significantly outperforms independent action models and world models, showcasing a mutual enhancement effect [2] Model Overview - The framework combines three independent tokenizers for encoding images, text, and actions, utilizing a VQ-GAN model for image tokenization with a compression ratio of 16 and a codebook size of 8192 [8] - The action tokenizer discretizes continuous robot actions into 256 intervals, representing actions with 7 tokens [8] Model Design - WorldVLA employs a self-regressive action world model to unify action and image understanding and generation [4] - The model addresses limitations of existing VLA and world models by enhancing action generation accuracy through environmental physical understanding [5][14] Training and Performance - WorldVLA is jointly trained by integrating data from both action models and world models, enhancing action generation capabilities [13] - The model's performance is positively correlated with image resolution, with 512x512 pixel resolution showing significant improvements over 256x256 [21][23] Benchmark Results - WorldVLA demonstrates superior performance compared to discrete OpenVLA models, even without pre-training, validating its architectural design [19] - The model's ability to generate coherent and physically plausible states in various scenarios is highlighted, outperforming pure world models [31][32] Mutual Enhancement - The world model enhances the action model's performance by predicting environmental state changes based on current actions, crucial for tasks requiring precision [25] - Conversely, the action model improves the visual understanding of the world model, supporting better visual generation [17][30]
超万平方米的人形机器人训练场在京启用
Huan Qiu Wang Zi Xun· 2025-09-25 10:04
Core Insights - The humanoid robot training facility in Beijing Shijingshan has officially commenced operations, marking a significant development in China's humanoid robot industry and providing a model for training facilities nationwide [1][7] - The facility aims to accelerate the evolution of humanoid robots' "embodied intelligence" and promote their large-scale application in sectors such as automotive manufacturing and logistics, laying a solid foundation for a trillion-dollar industry [1][7] Group 1: Training Facility Overview - The training center spans over 10,000 square meters and replicates 16 detailed scenarios across four categories: industrial manufacturing, smart home, elderly care services, and 5G integration [3] - The humanoid robot "Kuavo," standing at 1.66 meters, is actively training in various scenarios, achieving a success rate of over 95% in tasks such as empty box retrieval, material sorting, weighing, packaging, and product boxing [3][4] - The training facility's data is sourced entirely from real machine operations, addressing industry challenges related to poor data quality, high acquisition costs, and migration difficulties [3][4] Group 2: Data Quality and Standardization - The facility aims to overcome the bottlenecks in data quality and accessibility that have historically plagued the humanoid robot industry, moving from a "cottage industry" model to standardized, large-scale data production [4][5] - High-quality, large-scale training data is essential for the performance of visual language action models (VLA), which enable robots to achieve cross-platform and cross-scenario capabilities [5] - Real machine data is crucial for bridging the gap between theoretical models and practical applications, as synthetic data cannot fully replicate real-world interactions and environmental dynamics [5] Group 3: Ecosystem Development - The training center has established an innovative ecosystem model that integrates training, application, incubation, and public education, aiming to create a national public data service platform for embodied intelligence [6] - Collaborations with universities and research institutions are in place to support entrepreneurship and application scenario development, while also providing high-quality data services [6] - The facility will host the "First Embodied Intelligence Operational Task Challenge & Startup Camp," fostering innovation through a "competition-incubation" mechanism [6] Group 4: Future Implications - The establishment of this training facility signifies a new phase of large-scale and standardized development in China's humanoid robot industry [7] - The training center will enhance the skill set of robots, enabling them to perform tasks more effectively across various sectors, including factories, logistics parks, and elderly care institutions [7] - As more robots "graduate" from this training facility, their presence is expected to increase in everyday settings, facilitating the integration of intelligent robots into various industries and households [7]