Workflow
视觉 - 语言 - 动作(VLA)模型
icon
Search documents
ICLR 2026|在「想象」中进化的机器人:港科大×字节跳动Seed提出WMPO,在世界模型中进行VLA强化学习
机器之心· 2026-03-02 03:06
香港科技大学 PEI-Lab 与字节跳动 Seed 团队近期提出的 WMPO(World Model-based Policy Optimization),正是这样一种让具身智能在 "想象中训练" 的新范式。 该方法无需在真实机器人上进行大规模强化学习交互,却能显著提升策略性能,甚至涌现出 自我纠错(Self-correction) 行为。该文章目前已被 ICLR 2026 接收, 目前,论文、代码与模型均已开源。 论文标题: WMPO: World Model-based Policy Optimization for Vision-Language-Action Models 项目网站:https://wm-po.github.io 论文链接: https://arxiv.org/abs/2511.09515 论文代码:https://github.com/WM-PO/WMPO 论文第一作者朱方琪是香港科技大学博士生,研究方向包括世界模型,具身智能,多模态大模型等。第二作者为香港科技大学研究型硕士生严正阳。通讯作者为 香港科技大学计算机科学及工程系讲座教授郭嵩教授以及字节跳动 Seed 团队马骁。 传统 ...
当世界模型、VLA和强化学习三者结合起来,能取得什么惊艳效果?
具身智能之心· 2026-01-15 00:32
Core Insights - The article discusses the potential of the Vision-Language-Action (VLA) model in general robotic operations, highlighting its reliance on expert demonstration data which limits its ability to learn from failures and self-correct [2] - It introduces WMPO, a world model-based policy optimization method that enhances sample efficiency and overall performance in reinforcement learning (RL) without needing real-world interaction [3] Group 1 - The VLA model shows strong potential in robotic tasks but struggles with self-improvement due to its dependence on expert data [2] - Reinforcement learning can address the limitations of VLA models by enabling self-improvement through autonomous interaction with physical environments, although it faces high sample complexity when applied to real robots [2] - WMPO focuses on pixel-based prediction tasks, aligning "imagined" trajectories with VLA features pre-trained on large-scale network images, leading to superior performance compared to traditional offline methods [3] Group 2 - WMPO demonstrates significant advantages, including improved sample efficiency, better overall performance, emergence of self-correcting behaviors, and robust generalization and lifelong learning capabilities [3] - The article provides a link to the research paper on WMPO and its project homepage for further exploration [4]
刚刚,智元提出SOP,让VLA模型在真实世界实现可扩展的在线进化
机器之心· 2026-01-06 09:38
对于电子产品,我们已然习惯了「出厂即巅峰」的设定:开箱的那一刻往往就是性能的顶点,随后的每一天都在折旧。 但对于通用机器人来说,这个设定必须被颠覆。 试想,如果一个在实验室里完成训练的 AI 机器人,一进家门面对光线稍暗的房间或堆满杂物的茶几就大脑宕机,那它就永远只能是一个昂贵的实验品。这正是当 前具身智能面临的尴尬真相:我们在互联网知识里训练出了博学的预训练模型,可一旦让它们走进充满未知的物理世界,这些「理论巨人」往往会因为环境变化 而束手无策:「懂」很多道理,却依然干不好家务。 通用机器人的出路,绝不应是被困在出厂设置里的「静态标品」,而应当是能在真实部署中、在每一次失败和纠正中持续变强的生命体。 为了实现这一跨越,智元具身研究中心提出了 SOP(Scalable Online Post-training)框架 。 在过去几年里,基于互联网海量数据预训练的 VLA(视觉 - 语言 - 动作)模型,虽然赋予了机器人一定的通用泛化能力,但始终面临一个难以逾越的鸿沟: 「懂」不代表「能」 。 预训练模型或许「懂」什么是叠衣服,但当它真正面对一件材质松软、光照复杂的真实衣物时,往往会因为 分布偏移 而束手无策。 ...
英伟达用千万Clip搞定了反事实推理VLA!安全指标提升了20%......
自动驾驶之心· 2026-01-05 03:33
Core Insights - The article discusses the development of the Counterfactual Vision-Language-Action (CF-VLA) model, which incorporates self-reflective reasoning to enhance the safety and accuracy of autonomous driving systems [3][56] - CF-VLA aims to address the limitations of existing Vision-Language-Action (VLA) models by enabling them to reflect on their planned actions before execution, thereby improving decision-making in complex driving scenarios [10][56] Group 1: Model Development - CF-VLA introduces adaptive reasoning and self-reflection capabilities, allowing the model to adjust its actions based on potential outcomes identified through counterfactual reasoning [3][10] - The model generates time-segmented meta-actions to summarize driving intentions and utilizes these to perform counterfactual reasoning, identifying unsafe behaviors and correcting them before final trajectory generation [3][10] - The "rollout-filter-label" data processing pipeline is designed to extract high-value scenarios from the model's rollout results, enhancing the training process for counterfactual reasoning [11][14] Group 2: Performance Metrics - Experiments on large-scale driving datasets show that CF-VLA improves trajectory accuracy by up to 17.6% and safety metrics by 20.5% compared to baseline models [14][56] - The model demonstrates adaptive reasoning capabilities, activating counterfactual reasoning primarily in complex scenarios, thus optimizing computational resources during testing [16][48] - The introduction of meta-actions significantly enhances the model's performance, reducing minimum average displacement error (MinADE) and minimum final displacement error (MinFDE) by approximately 9% compared to pure trajectory models [43][44] Group 3: Practical Applications - CF-VLA's self-reflective capabilities allow it to make context-specific corrections, improving safety and traffic efficiency in various driving scenarios, such as avoiding congestion and responding to pedestrians [57] - The model's ability to dynamically decide when to engage in reasoning helps maintain a balance between computational efficiency and decision-making quality [21][48] - The findings suggest that counterfactual self-reflection can effectively bridge reasoning and control in autonomous driving systems, providing a framework for future advancements in the field [56][57]
突破2D-3D鸿沟!北大提出VIPA-VLA,视频解锁机器人精准操控
具身智能之心· 2025-12-26 00:55
Core Insights - The article discusses a new approach to robot learning that addresses the challenge of aligning 2D visual information with 3D spatial understanding, which has been a significant limitation in existing visual-language-action (VLA) models [3][6][41] - The research introduces a novel pre-training paradigm that utilizes human demonstration videos to enhance robots' spatial perception capabilities, allowing them to infer 3D spatial relationships from 2D visual inputs [4][40] Research Background - Current VLA models face limitations due to reliance on expensive robot datasets and lack of explicit 3D spatial modeling, which hampers their ability to accurately map physical actions [6][7] - Human demonstration videos provide a solution by offering diverse scenarios and inherent visual-physical correspondences that serve as valuable supervision signals for robot learning [7][8] Hand3D Dataset - The Hand3D dataset, comprising Hand3D-visual and Hand3D-action components, is described as a "3D spatial textbook" for robots, enabling them to learn visual-physical alignment [8][9] - The dataset includes data from nine heterogeneous human manipulation datasets, ensuring a wide variety of scenes and tasks [8][9] Model Architecture: VIPA-VLA - The VIPA-VLA model features a dual-encoder architecture that integrates semantic visual features with 3D spatial features, enhancing the model's ability to understand both scene semantics and spatial structures [15][20] - The model employs a cross-attention fusion layer to combine these features, allowing for effective learning of 3D relationships from 2D inputs [17][20] Training Process - The training process consists of three phases: 3D visual pre-training, 3D action pre-training, and post-training for task adaptation, ensuring a gradual acquisition of 3D capabilities [21][22] - The first phase focuses on aligning semantic and spatial features, while the second phase teaches the model to predict 3D motion tokens based on visual-language inputs [22][23] Experimental Results - VIPA-VLA outperformed existing baselines in various tasks, achieving a success rate of 92.4% in single-view settings and 96.8% in dual-view settings on the LIBERO benchmark [27][28] - In the RoboCasa benchmark, VIPA-VLA achieved a success rate of 45.8%, surpassing other models, particularly in tasks requiring precise 3D positioning [30] - The model demonstrated strong performance in real-world tasks, achieving a 60% success rate in the Wipe-Board task, significantly higher than competing models [31][34] Significance and Future Directions - The research presents a new paradigm for robot learning that reduces reliance on costly robot data and enhances model generalization by leveraging human demonstration videos [40][41] - Future work aims to combine this pre-training paradigm with robot data pre-training and expand the Hand3D dataset to include more complex human-robot interaction tasks [40][41]
从 2D 感知到 3D 预测:GeoPredict 重构VLA模型的几何推理能力
具身智能之心· 2025-12-25 01:41
点击下方 卡片 ,关注" 具身智能 之心 "公众号 作者丨 Jingjing Qian等 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 在机器人操纵领域,视觉 - 语言 - 动作(VLA)模型凭借大规模预训练数据的语义与视觉先验,实现了跨任务泛化,但长期受限于 2D-centric 的反应式决策范式, 难以应对需要精准 3D 空间推理、长时程物理一致性的复杂任务。 香港中文大学(深圳)、湖南大学、理想汽车等联合团队提出的 GeoPredict 框架 ,以 "预测性运动学 + 3D 高斯几何" 为双核心,通过 "轨迹级运动预测 - 3D 高斯 场景建模 - 训练时监督推理时轻量化" 的创新架构,首次将未来感知的几何先验注入连续动作 VLA 模型,彻底突破了传统方法的空间推理瓶颈。 论文题目:GeoPredict: Leveraging Predictive Kinematics and 3D Gaussian Geometry for Preci ...
近300篇工作!伦敦国王学院x港理工全面解构VLA模型,一份清晰系统的导航图
具身智能之心· 2025-12-17 00:05
点击下方 卡片 ,关注" 具身智能 之心 "公众号 作者丨 Chao Xu等 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 这篇综述对视觉 - 语言 - 动作(VLA)模型进行了全面剖析,是该领域极具价值的导航指南。核心结论是:VLA 模型正推动机器人技术变革,其发展遵循 "基础模 块→历史里程碑→核心挑战" 的逻辑,五大核心挑战(表征、执行、泛化、安全、数据与评估)是当前研究的关键突破口,相关结构与关键信息可通过文中图表直 观呈现。 核心定位与结构设计 文章以研究者的自然学习路径为框架,从基础到前沿层层递进,既适合新手入门,也为资深研究者提供方向。 基础模块:VLA 模型的核心构成 VLA 系统由感知、大脑、动作三大核心模块组成,近年呈现明显技术迭代趋势,各模块的关键技术选型与代表模型可参考相关数据集与里程碑表格。 论文标题 :An Anatomy of Vision-Language-Action Models: From Modules ...
新国大团队首创!当VLA具备4D感知能力后会怎么样?
具身智能之心· 2025-12-15 03:17
Core Insights - The article discusses the VLA-4D model, which integrates 4D awareness into vision-language-action frameworks for coherent robotic manipulation, addressing challenges in spatiotemporal consistency in robotic tasks [2][3]. Group 1: Model Features - VLA-4D enhances traditional spatial action representation by incorporating temporal information, allowing for improved spatiotemporal action planning and prediction [2]. - The model consists of two key modules: a 4D perception visual representation that combines visual features with temporal data, and a spatiotemporal action representation that aligns multimodal representations with large language models [2]. Group 2: Applications and Challenges - The VLA-4D model aims to achieve both spatial fluidity and temporal consistency in robotic operations, which is crucial for dynamic environments [2]. - Existing methods struggle with maintaining temporal coherence during action execution, highlighting the need for advancements like VLA-4D [2]. Group 3: Related Technologies - The article also mentions foundational models such as 4D-VGGT for dynamic geometric perception and LLaVA-4D for enhanced dynamic scene reasoning, which complement the capabilities of VLA-4D [6][7].
理想自动驾驶负责人回应宇树王兴兴对VLA质疑:空谈架构不如看疗效
Feng Huang Wang· 2025-12-10 10:27
Core Viewpoint - The head of autonomous driving at the company believes that the VLA (Vision-Language-Action) model is the best solution for autonomous driving after practical experience, countering skepticism from industry peers [1] Group 1: Response to Industry Concerns - The founder of Yushu Technology expressed doubts about the VLA model, describing it as a "relatively simplistic architecture" and maintaining a skeptical attitude [1] - The company emphasizes that discussing model architecture without real data is ineffective, highlighting their extensive data collection from millions of vehicles to support the VLA model [1] Group 2: Future of Robotics - The CEO of the company predicts that in the next five to ten years, there will be two main forms of embodied robots: automotive and humanoid [1] - The VLA model is designed not only for current automotive products but also for future automotive embodied robots [1]
上交&ai lab团队联合提出MM-ACT:一个统一的VLA模型实现感知-规划-执行的高效协同
具身智能之心· 2025-12-02 09:30
点击下方 卡片 ,关注" 具身智能 之心 "公众号 编辑丨 具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 在机器人操作领域,"通用性" 与 "高效性" 的平衡始终是核心挑战——现有方案要么缺乏动态建模能力,难以应对复杂环境交互;要么推理速度慢,无法满足实时 控制需求。 上海 AI 实验室、上海交通大学等团队联合提出的 MM-ACT ,以 "统一多模态表征 + 并行解码架构" 为核心,创新引入 "上下文共享多模态学习" 范式,实现了文 本、图像、动作的协同生成,既具备精准的语义理解与环境预测能力,又能高效输出执行动作,在模拟与真实场景中均展现出超越现有方案的综合性能。 为什么需要重构视觉 - 语言 - 动作(VLA)模型架构? 当前 VLA 模型陷入 "三重矛盾":语义理解与动态建模难以兼顾、多模态生成效率低下、训练目标存在错位,核心问题可归结为 "无法在统一框架内实现'感知 - 规 划 - 执行'的高效协同": | 方案类型 | 代表思路 | | 核 ...