视觉-语言-动作(VLA)模型

Search documents
纯血VLA综述来啦!从VLM到扩散,再到强化学习方案
自动驾驶之心· 2025-09-30 16:04
1. 介绍 机器人学长期以来一直是科学研究中的重要领域。早期的机器人主要依赖预编程的指令和人工设计的控制策略来完成任务分解与执行。这类方法通常应用于简 单、重复性的任务,例如工厂流水线和物流分拣。近年来,人工智能的快速发展使研究者能够在图像、文本和点云等多模态数据中,利用深度学习的特征提取与 轨迹预测能力。通过结合感知、检测、跟踪和定位等技术,研究者将机器人任务分解为多个阶段,以满足执行需求,从而推动了具身智能与自动驾驶的发展。然 而,大多数机器人仍然作为孤立的智能体存在,它们通常为特定任务而设计,缺乏与人类和外部环境的有效交互。 为克服这些局限性,研究者开始探索将大语言模型(LLMs)与视觉语言模型(VLMs)引入机器人操作中,以实现更精准和灵活的控制。现代的机器人操作方法 通常依赖视觉-语言生成范式(如自回归模型 或扩散模型),并结合大规模数据集 以及先进的微调策略。我们将这些方法称为 VLA基础模型,它们显著提升了 机器人操作的质量。对生成内容进行细粒度的动作控制,使用户获得更大的灵活性,从而释放了VLA 在任务执行中的实际潜力。 标题:Pure Vision Language Action (VLA) ...
纯血VLA综述来啦!从VLM到扩散,再到强化学习方案
具身智能之心· 2025-09-30 04:00
点击下方 卡片 ,关注" 具身智能 之心 "公众号 作者丨 Dapeng Zhang等 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 | | | 1. 介绍 机器人学长期以来一直是科学研究中的重要领域。早期的机器人主要依赖预编程的指令和人工设计的控制策略来完成任务分解与执行。这类方法通常应用于简 单、重复性的任务,例如工厂流水线和物流分拣。近年来,人工智能的快速发展使研究者能够在图像、文本和点云等多模态数据中,利用深度学习的特征提取与 轨迹预测能力。通过结合感知、检测、跟踪和定位等技术,研究者将机器人任务分解为多个阶段,以满足执行需求,从而推动了具身智能与自动驾驶的发展。然 而,大多数机器人仍然作为孤立的智能体存在,它们通常为特定任务而设计,缺乏与人类和外部环境的有效交互。 为克服这些局限性,研究者开始探索将大语言模型(LLMs)与视觉语言模型(VLMs)引入机器人操作中,以实现更精准和灵活的控制。现代的机器人操作方法 通常依赖视觉-语言生成范式(如自回归模型 或扩散模型),并结合大规模数据集 以及先进的微调策略。我们将这些方法称为 VLA基础模型,它们 ...
AnywhereVLA:在消费级硬件上实时运行VLA
具身智能之心· 2025-09-29 02:08
点击下方 卡片 ,关注" 具身智能 之心 "公众号 作者丨 Artem Voronov等 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 一、核心背景与目标 当前移动操作技术正从封闭、结构化的工作单元,向开放、非结构化的大型室内环境拓展——机器人需在陌生杂乱空间中探索,与多样物体及人类互动,同时响 应自然语言指令完成任务(如家庭服务、零售自动化、仓储物流等场景)。但现有方案存在明显瓶颈: 为此,AnywhereVLA提出模块化架构,核心是融合经典导航的鲁棒性与VLA模型的语义理解能力,实现 未知大型室内环境下的语言驱动拾取-放置任务 ,且能在 消费级硬件上实时运行。 二、相关工作回顾:现有方案的优势与不足 1. VLA模型与轻量化优化 2. 扩散Transformer与导航相关方案 三、AnywhereVLA架构:四大核心模块与工作流 AnywhereVLA以自然语言指令为输入,通过四大模块协同输出低级别控制指令(驱动基座车轮与机械臂关节),整体 ...
从300多篇工作中,看VLA在不同场景下的应用和实现......
具身智能之心· 2025-09-25 04:00
点击下方 卡片 ,关注" 具身智能 之心 "公众号 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 兰州大学、中科院、新加坡国立等单位联合出品的一篇最新survey! Pure Vision Language Action (VLA) Models: A Comprehensive Survey 论文链接:https://arxiv.org/pdf/2509.19012 视觉-语言-动作(Vision Language Action, VLA)模型的出现,标志着机器人技术从传统基于策略的控制向通用机器人技术的范式转变,同时也将视觉- 语言模型(Vision Language Models, VLMs)从被动的序列生成器重新定位为在复杂、动态环境中执行操作与决策的主动智能体。 机器人技术长期以来一直是科学研究的重要领域。在历史发展进程中,机器人主要依赖预编程指令和设计好的控制策略来完成任务分解与执行。这些 方法通常应用于简单、重复性的任务,例如工厂 ...
深度综述 | 300+论文带你看懂:纯视觉如何将VLA推向自动驾驶和具身智能巅峰!
自动驾驶之心· 2025-09-24 23:33
视觉-语言-动作(Vision Language Action, VLA)模型的出现,标志着机器人技术从传统基于策略的控制向通用机器人技术的范式转变,同时也将视觉-语言模型(Vision Language Models, VLMs)从被动的序列生成器重新定位为在复杂、动态环境中执行操作与决策的主动智能体。 为此,兰州大学、中科院和新加坡国立大学的团队深入探讨了先进的VLA方法,旨在提供清晰的分类体系,并对现有研究进行系统、全面的综述。文中全面分析了VLA 在不同场景下的应用,并将VLA方法划分为多个范式: 自回归、扩散模型、强化学习、混合方法及专用方法 ;同时详细探讨了这些方法的设计动机、核心策略与实现方 式。 此外,本文还介绍了VLA研究所需的基础数据集、基准测试集与仿真平台。基于当前VLA研究现状,综述进一步提出了该领域面临的关键挑战与未来发展方向,以推动 VLA模型与通用机器人技术的研究进展。通过综合300多项最新研究的见解,本综述勾勒出这一快速发展领域的研究轮廓,并强调了将塑造可扩展、通用型VLA方法发 展的机遇与挑战。 论文标题:Pure Vision Language Action (VLA) M ...
清华联手理想提出LightVLA:剪掉冗余token,推理速度提升38%!
具身智能之心· 2025-09-18 00:03
点击下方 卡片 ,关注" 具身智能 之心 "公众号 作者丨 Titong Jiang等 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 研究背景与核心挑战 视觉-语言-动作(VLA)模型是机器人 embodied intelligence 的核心技术,能将视觉信息和语言指令直接转化为可执行的机器人动作,在复杂操作(如物体抓取、 长程规划)中展现出强大能力。但这类模型存在一个关键瓶颈: 视觉Token的计算冗余 ——VLA模型通常需要处理数百个视觉Token(如OpenVLA-OFT使用512 个),而注意力机制的计算复杂度随Token数量呈平方增长,导致模型在边缘设备(如家用机器人、自动驾驶)上难以实现实时部署。 现有优化方案存在明显局限: 1. 效率与性能的trade-off :多数Token剪枝方法(如EfficientVLA、VLA-Cache)为提升效率会固定保留Token数量,导致关键语义信息丢失,最终牺牲性能; 2. VLM剪枝方案不 ...
SimpleVLA-RL:突破 VLA 模型训练瓶颈,RL实现端到端在线训练
自动驾驶之心· 2025-09-15 03:56
Core Insights - The article discusses the development of the SimpleVLA-RL framework, which enhances the training of Visual-Language-Action (VLA) models in robotics through reinforcement learning (RL) techniques, addressing key challenges in data scarcity and generalization capabilities [3][4][6]. Group 1: Research Background and Core Issues - VLA models are crucial for robotic manipulation, integrating visual perception, language understanding, and action generation, but current training methods face two main bottlenecks: data scarcity and weak generalization [4][6]. - The traditional training process relies heavily on large-scale human operation data, which is costly and difficult to scale, limiting model scalability [4][6]. - The article raises the question of whether RL can enhance the long-term action planning capabilities of VLA models, despite the unique challenges posed by VLA applications [4][6]. Group 2: SimpleVLA-RL Framework Contributions - SimpleVLA-RL is designed to improve VLA training efficiency, particularly in data-scarce environments, and has achieved state-of-the-art (SOTA) performance in benchmark tests like LIBERO and RoboTwin [7][8]. - The framework incorporates interactive trajectory sampling, parallel training across multiple environments, and a unified design for training, inference, and rendering, addressing the slow interaction and high cost issues of VLA models [7][8]. - It has demonstrated significant improvements in success rates across various tasks, such as increasing LIBERO's average success rate from 91.0% to 99.1% and RoboTwin 2.0 from 38.3% to 68.8% [7][8][14]. Group 3: Data Efficiency and Generalization - SimpleVLA-RL significantly reduces the dependency on large-scale demonstration data, achieving an average success rate of 96.9% with only one trajectory of demonstration data, surpassing the performance of full-trajectory supervised fine-tuning [19][20]. - The framework enhances the model's robustness across different scenes, objects, and tasks, demonstrating improved performance in unseen tasks compared to traditional methods [21][24]. Group 4: Real-World Deployment and Innovations - The framework has shown effective Sim-to-Real transfer, with real-world task success rates improving from 17.5% to 38.5% using only simulated data for training [24][27]. - A notable discovery is the "Pushcut" phenomenon, where the RL-trained model autonomously discovers more efficient strategies beyond human demonstrations, indicating a potential for innovative behavior in VLA models [25][30]. Group 5: Summary and Conclusions - SimpleVLA-RL addresses three core issues in VLA model training: reducing reliance on large-scale demonstration data, enhancing generalization capabilities, and achieving efficient Sim-to-Real transfer [31][32]. - The findings suggest that RL can enable VLA models to explore superior strategies, paving the way for future developments in autonomous and adaptive robotic systems [31][32].
Galaxea 团队推出:大规模高质量开放世界数据集与G0双系统VLA模型
具身智能之心· 2025-09-04 01:04
Core Insights - The article presents the Galaxea Open-World Dataset, a large-scale and diverse collection of robot behaviors recorded in real human living and working environments, addressing the scarcity of high-quality open-world robot data and insufficient model generalization [3][5][6]. Dataset Overview - The dataset consists of 500 hours of data, 100,000 demonstration trajectories, covering 150 task categories, 1,600 object types, and 58 operational skills, with a 2Hz frequency for detailed sub-task instruction labeling [8][12]. - Data was collected using the Galaxea R1 Lite mobile dual-arm robot, which has 23 degrees of freedom and is equipped with RGB cameras for global scene perception and fine operation sensing [5][6]. Data Diversity and Coverage - The dataset includes data from 11 physical sites across 50 unique scenarios, covering residential, retail, dining, and office environments, thus avoiding the limitations of existing datasets that are confined to controlled laboratory settings [6][12]. - The distribution of tasks shows a balance between basic actions and specialized skills, with residential scenes making up 50.8% and office scenes 33.2% of the dataset [11][12]. G0 Dual-System Framework - The G0 framework couples a "slow thinking" visual-language model (G0-VLM) with a "fast execution" visual-language-action model (G0-VLA), employing a three-stage training strategy to achieve complex task planning and precise execution [5][19]. - The training phases include cross-entity pre-training, single-entity pre-training, and task-specific fine-tuning, which enhance the model's performance significantly [21][30]. Model Performance Evaluation - The G0-VLA model demonstrated superior performance in benchmark tasks such as desktop organization and microwave operation, with G0-Full achieving the highest average task progress scores [39][47]. - The study found that single-entity pre-training is essential for effective model adaptation, as cross-entity pre-training can lead to negative transfer due to significant differences between the training and target robot entities [39][46]. Key Findings - The G0-VLM model outperformed mainstream visual-language models in instruction accuracy, achieving 83.3% in desktop organization and 78.2% in bed-making tasks, highlighting the importance of domain-specific fine-tuning [42][47]. - The dataset's design and the dual-system framework effectively address the challenges of real-world robot task execution, providing a robust foundation for future advancements in embodied intelligence [17][19].
自驾VLA新SOTA!阿里AutoDrive-R²:自反思思维链&物理奖励,突破VLA泛化瓶颈
自动驾驶之心· 2025-09-03 23:33
Core Viewpoint - The article discusses the introduction of AutoDrive-R², a novel Vision-Language-Action (VLA) framework developed by Alibaba and the University of Queensland, aimed at enhancing the reasoning and trajectory planning capabilities of autonomous driving systems through a two-stage training approach [2][49]. Group 1: Framework Overview - AutoDrive-R² integrates a structured reasoning process with self-reflection capabilities to improve decision-making in complex driving scenarios [8][10]. - The framework consists of two training phases: the first phase involves supervised fine-tuning using the nuScenesR²-6K dataset, while the second phase employs reinforcement learning (RL) with a physics-based reward framework [17][49]. Group 2: Dataset and Training - A new dataset, nuScenesR²-6K, was created to facilitate supervised fine-tuning, containing 6,000 "image-trajectory" pairs that include reasoning and self-reflection steps [19][20]. - The training process emphasizes a four-step logical chain: visualization, computation, logic, and reflection, which enhances the model's reasoning capabilities [20][43]. Group 3: Performance and Results - AutoDrive-R² demonstrated state-of-the-art (SOTA) performance on both nuScenes and Waymo datasets, achieving significant reductions in L2 error compared to existing methods [35][37]. - The model's average L2 error on the nuScenes dataset was reduced by 86.9% compared to previous leading methods, showcasing its strong generalization ability [35][39]. Group 4: Reinforcement Learning and Reward Mechanism - The reinforcement learning phase utilizes Group Relative Policy Optimization (GRPO) to optimize trajectory planning, incorporating a physics-based reward framework that ensures the generated trajectories are physically feasible and comfortable [21][26]. - The reward framework includes components for spatial alignment, vehicle dynamics, and temporal smoothness, which collectively guide the model to produce safe and realistic driving strategies [27][30][31]. Group 5: Future Directions - Future research will focus on multi-agent collaboration and real-time sensor fusion integration to further enhance the model's adaptability in complex environments [49].
Galaxea 团队推出:大规模高质量开放世界机器人数据集与G0双系统VLA模型
具身智能之心· 2025-09-03 03:23
Core Insights - The article presents the Galaxea Open-World Dataset, a large-scale and diverse collection of robot behaviors recorded in real human living and working environments, addressing the scarcity of high-quality open-world robot data and insufficient model generalization capabilities [2][5][6]. Dataset Overview - The Galaxea Open-World Dataset is the first large-scale robot behavior dataset collected in real-life scenarios, solving issues of existing datasets that are limited to controlled environments and inconsistent robot entities [5][17]. - Data collection was conducted using the Galaxea R1 Lite mobile dual-arm robot, which features 23 degrees of freedom and is equipped with RGB cameras for global scene perception and fine operation sensing [8][6]. - The dataset includes 500 hours of data, 100,000 demonstration trajectories, covering 150 task categories, 1,600 object types, and 58 operational skills, with a 2Hz frequency for detailed sub-task instruction labeling [8][12]. Model Framework - The G0 dual-system framework couples a "slow thinking" visual-language model (G0-VLM) with a "fast execution" visual-language-action model (G0-VLA), utilizing a three-stage training strategy to achieve complex task planning and precise execution [5][19]. - The training phases include cross-entity pre-training, single-entity pre-training, and task-specific fine-tuning, which are designed to balance general knowledge and specific robot adaptation [21][27]. Performance Evaluation - The G0-VLA model demonstrated superior performance in benchmark tasks such as desktop organization, microwave operation, bed making, and block building, with G0-VLM achieving an instruction accuracy of 78.2% in bed making and 83.3% in desktop organization [42][47]. - The study found that single-entity pre-training is essential for effective model performance, as cross-entity pre-training can lead to negative transfer due to significant differences between the training and target robot entities [39][46]. Key Findings - The dataset's design emphasizes real-world adaptability and model training friendliness, ensuring that the collected data reflects the complexities of human environments [6][17]. - The G0 model's architecture is inspired by Kahneman's dual-system theory, where System 2 (slow thinking) is responsible for planning and System 1 (fast execution) handles real-time reactions, allowing for a balance between planning rationality and execution timeliness [19][21].