Workflow
Vision-Language-Action(VLA)模型
icon
Search documents
REALM:机器人操作任务的real2sim验证基准
具身智能之心· 2025-12-27 10:03
点击下方 卡片 ,关注" 具身智能 之心 "公众号 作者丨 Jai Bardhan等 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 核心背景与问题 Vision-Language-Action(VLA)模型让机器人能够理解自然语言指令并执行操纵任务,但泛化能力评估一直是关键挑战——真实世界评估成本高、可重复性差,而 现有仿真基准存在明显缺陷:扰动类型有限、缺乏高保真视觉效果和真实的机器人控制对齐,导致仿真与真实世界性能脱节(即"现实-仿真差距")。 为解决这一问题, 捷克理工大学,阿姆斯特丹大学的研究团队 构建了REALM:一个高保真仿真环境与基准,核心目标是建立仿真与真实世界性能的强相关性,实现 大规模、低成本的VLA模型泛化能力评估。其核心突破在于三点:高保真视觉与控制对齐的仿真环境、覆盖多维度扰动的系统评估方案、经实证验证的真实-仿真 性能关联性。 相关工作与差异化优势 现有机器人操纵泛化基准多依赖仿真,但存在显著局限:GemBench、 ...
领域首篇RL+VLA 综述:强化学习如何推动 VLA 走向真实世界?
具身智能之心· 2025-12-19 00:05
点击下方 卡片 ,关注" 具身智能 之心 "公众号 作者丨 Haoyuan Deng等 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 Vision-Language-Action(VLA)模型通过融合视觉、语言与动作,为机器人带来了强大的零样本与跨任务泛化能力。但仅依赖模仿学习的 VLA 在真实世界 OOD 场 景中仍然脆弱,缺乏失败恢复、自主探索与闭环纠错能力。 强化学习(RL)正成为连接 VLA 预训练与真实部署的关键桥梁。 由南洋理工大学、北京邮电大学、清华大学联合推出, 本综述系统梳理了 RL-VLA 在"学习—优化—部署"全生命周期中的核心方法与挑战,并从四个维度构建了 完整技术图景:架构、训练范式、真实世界部署以及评估。 一、RL-VLA 架构:从开环推理到闭环优化 RL 通过奖励驱动的策略更新,使 VLA 从"复现示范"转向"结果导向"的闭环决策: 动作建模 A 论文链接(每月更新) :https://doi.org/10.362 ...
ActDistill:同济大学提出动作引导蒸馏框架,机器人推理速度提升1.67倍
具身智能之心· 2025-11-26 00:05
Group 1 - The article discusses the challenges of deploying Vision-Language-Action (VLA) models in real-time or resource-constrained robotic systems due to high computational costs and inference delays [2][3]. - Existing efficient VLA strategies often prioritize visual-language model optimizations, leading to key information loss and incoherent action semantics [2][3]. Group 2 - The proposed ActDistill framework aims to address these issues by providing an action-prediction-oriented distillation framework that balances efficiency and fidelity while preserving action prediction accuracy [3][4]. - ActDistill consists of two core modules: Graph-Structured Encapsulation and Action-Guided Self-Derived Distillation, which work together to model action semantics and guide knowledge distillation [4][8]. Group 3 - The Graph-Structured Encapsulation module explicitly models the hierarchical evolution of action semantics and separates task-related interactions from redundant background signals [6]. - The Action-Guided Self-Derived Distillation module utilizes a lightweight student model that aligns with the teacher model's structure while reducing depth, incorporating dynamic routing to adaptively predict layer gating scores [8][11]. Group 4 - Experimental results show that ActDistill achieves a success rate of 73.95% with a 1.59x speed-up and a 50.5% reduction in computational load compared to full models [9][12]. - The framework demonstrates significant improvements in efficiency and performance across various benchmarks, including LIBERO and SIMPLER [12][13]. Group 5 - The article highlights the importance of the Graph-Structured Encapsulation module, noting that replacing it with a simpler architecture led to a significant drop in performance [13]. - The framework's ability to maintain trajectory stability and focus attention on action-relevant areas is emphasized, showcasing its effectiveness in practical applications [16][17]. Group 6 - ActDistill represents a novel approach to action-centered compression of VLA models, achieving over 50% reduction in computational load while maintaining task success rates [24]. - Future directions include exploring teacher-free or reinforcement learning-guided variants and integrating long-horizon temporal reasoning into the routing mechanism for enhanced adaptability [24].
3个月!搞透VLA/VLA+触觉/VLA+RL/具身世界模型等方向!
具身智能之心· 2025-08-22 00:04
Core Viewpoint - The exploration of Artificial General Intelligence (AGI) is increasingly focusing on embodied intelligence, which emphasizes the interaction and adaptation of intelligent agents within physical environments, enabling them to perceive, understand tasks, execute actions, and learn from feedback [1]. Industry Analysis - In the past two years, numerous star teams in the field of embodied intelligence have emerged, establishing valuable companies such as Xinghaitu, Galaxy General, and Zhujidongli, which are advancing the technology of embodied intelligence [3]. - Major domestic companies like Huawei, JD, Tencent, Ant Group, and Xiaomi are actively investing and collaborating to build a robust ecosystem for embodied intelligence, while international firms like Tesla and investment institutions are supporting companies like Wayve and Apptronik in the development of autonomous driving and warehouse robots [5]. Technological Evolution - The development of embodied intelligence has progressed through several stages: - The first stage focused on grasp pose detection, which struggled with complex tasks due to a lack of context modeling [6]. - The second stage involved behavior cloning, allowing robots to learn from expert demonstrations but revealing weaknesses in generalization and performance in multi-target scenarios [6]. - The third stage introduced Diffusion Policy methods, enhancing stability and generalization by modeling action sequences, followed by the Vision-Language-Action (VLA) model phase, which integrates visual perception, language understanding, and action generation [7][8]. - The fourth stage, starting in 2025, aims to integrate VLA models with reinforcement learning, world models, and tactile sensing to overcome current limitations [8]. Product and Market Development - The evolution of embodied intelligence technologies has led to the emergence of various products, including humanoid robots, robotic arms, and quadrupedal robots, serving industries such as manufacturing, home services, dining, and medical rehabilitation [9]. - The demand for engineering and system capabilities is increasing as the industry shifts from research to deployment, necessitating higher engineering skills for training and simulating strategies on platforms like Mujoco, IsaacGym, and Pybullet [23]. Educational Initiatives - A comprehensive curriculum has been developed to cover the entire technology route of embodied "brain + cerebellum," including practical applications and real-world projects, aimed at both beginners and advanced learners [10][20].
VLA/VLA+触觉/VLA+RL/具身世界模型等方向教程来啦!
具身智能之心· 2025-08-18 00:07
Core Viewpoint - The exploration of Artificial General Intelligence (AGI) is increasingly focusing on embodied intelligence, which emphasizes the interaction and adaptation of intelligent agents within physical environments, enabling them to perceive, understand tasks, execute actions, and learn from feedback [1]. Industry Analysis - In the past two years, numerous star teams in the field of embodied intelligence have emerged, leading to the establishment of valuable companies such as Xinghaitu, Galaxy General, and Zhujidongli, which are advancing the technology of embodied intelligence [3]. - Major domestic companies like Huawei, JD.com, Tencent, Ant Group, and Xiaomi are actively investing and collaborating to build a robust ecosystem for embodied intelligence, while international players like Tesla and investment firms are supporting companies like Wayve and Apptronik in the development of autonomous driving and warehouse robots [5]. Technological Evolution - The development of embodied intelligence has progressed through several stages: - The first stage focused on grasp pose detection, which struggled with complex tasks due to a lack of context modeling [6]. - The second stage involved behavior cloning, allowing robots to learn from expert demonstrations but revealing weaknesses in generalization and performance in multi-target scenarios [6]. - The third stage introduced Diffusion Policy methods, enhancing stability and generalization by modeling action sequences, followed by the emergence of Vision-Language-Action (VLA) models that integrate visual perception, language understanding, and action generation [7]. - The fourth stage, starting in 2025, aims to integrate VLA models with reinforcement learning, world models, and tactile sensing to overcome current limitations [8]. Product and Market Development - The evolution of embodied intelligence technologies has led to the emergence of various products, including humanoid robots, robotic arms, and quadrupedal robots, serving industries such as manufacturing, home services, dining, and medical rehabilitation [9]. - The demand for engineering and system capabilities is increasing as the industry shifts from research to deployment, necessitating training in platforms like Mujoco, IsaacGym, and Pybullet for strategy training and simulation testing [23]. Educational Initiatives - A comprehensive curriculum has been developed to cover the entire technology route of embodied "brain + cerebellum," including practical applications and advanced topics, aimed at both beginners and those seeking to deepen their knowledge [10][20].
国内首个具身大脑+小脑算法实战全栈教程
具身智能之心· 2025-08-07 02:38
Core Insights - The exploration towards Artificial General Intelligence (AGI) highlights embodied intelligence as a key direction, focusing on the interaction and adaptation of intelligent agents within physical environments [1] - The development of embodied intelligence is marked by the evolution of technology from low-level perception to high-level task understanding and generalization [6][9] Industry Analysis - In the past two years, numerous star teams in the field of embodied intelligence have emerged, establishing valuable companies such as Xinghaitu, Galaxy General, and Zhujidongli, transitioning from laboratories to commercial and industrial applications [3] - Major domestic companies like Huawei, JD, Tencent, Ant Group, and Xiaomi are actively investing and collaborating to build an ecosystem for embodied intelligence, while international players like Tesla and investment firms support advancements in autonomous driving and warehouse robotics [5] Technological Evolution - The evolution of embodied intelligence technology has progressed through several stages: - The first stage focused on grasp pose detection, which struggled with complex tasks due to a lack of context modeling [6] - The second stage involved behavior cloning, allowing robots to learn from expert demonstrations but revealing weaknesses in generalization and performance in multi-target scenarios [6] - The third stage introduced Diffusion Policy methods, enhancing stability and generalization through sequence modeling [7] - The fourth stage, emerging in 2025, explores the integration of VLA models with reinforcement learning and tactile sensing to overcome current limitations [8] Product Development and Market Growth - The advancements in embodied intelligence have led to the development of various products, including humanoid robots, robotic arms, and quadrupedal robots, serving industries such as manufacturing, home services, and healthcare [9] - The demand for engineering and system capabilities is increasing as the industry shifts from research to deployment, necessitating higher engineering skills [13] Educational Initiatives - A comprehensive curriculum has been developed to assist learners in mastering the full spectrum of embodied intelligence algorithms, covering topics from basic tasks to advanced models like VLA and its integrations [9][13]
理想最新DriveAction:探索VLA模型中类人驾驶决策的基准~
自动驾驶之心· 2025-06-21 13:15
Core Insights - The article discusses the introduction of the DriveAction benchmark, specifically designed for Vision-Language-Action (VLA) models in autonomous driving, addressing existing limitations in current datasets and evaluation frameworks [2][3][20]. Group 1: Research Background and Issues - The development of VLA models presents new opportunities for autonomous driving systems, but current benchmark datasets lack diversity in scenarios, reliable action-level annotations, and evaluation protocols aligned with human preferences [2]. - Existing benchmarks primarily rely on open-source data, which limits their ability to cover complex real-world driving scenarios, leading to a disconnect between evaluation results and actual deployment risks [3]. Group 2: DriveAction Benchmark Innovations - DriveAction is the first action-driven benchmark specifically designed for VLA models, featuring three core innovations: 1. Comprehensive coverage of diverse driving scenarios sourced from real-world data collected by production autonomous vehicles across 148 cities in China [5]. 2. Realistic action annotations derived from users' real-time driving operations, ensuring accurate capture of driver intentions [6]. 3. A tree-structured evaluation framework based on action-driven dynamics, integrating visual and language tasks to assess model decision-making in realistic contexts [7]. Group 3: Evaluation Results - Experimental results indicate that models perform best in the full process mode (V-L-A) and worst in the no-information mode (A), with average accuracy dropping by 3.3% without visual input and 4.1% without language input [14]. - Specific task evaluations reveal that models excel in dynamic and static obstacle tasks but struggle with navigation and traffic light tasks, highlighting areas for improvement [16][17]. Group 4: Significance and Value of DriveAction - The introduction of the DriveAction benchmark marks a significant advancement in the evaluation of autonomous driving systems, providing a more comprehensive and realistic assessment tool that can help identify model bottlenecks and guide system optimization [20].