Vision-Language-Action（VLA）模型 - filings, earnings calls, financial reports, news - Reportify

Vision-Language-Action（VLA）模型

Search documents

AAAI 2026杰出论文奖 | ReconVLA：具身智能领域首次获得

具身智能之心· 2026-01-27 03:00

Core Insights - The article emphasizes that embodied intelligence, particularly in the context of Vision-Language-Action (VLA) models, is becoming a central issue in AI research, as evidenced by the recognition of the ReconVLA model at AAAI [3][5]. Group 1: ReconVLA Model Overview - ReconVLA is introduced as a reconstructive Vision-Language-Action model aimed at improving the precision of visual attention in robotic tasks [12][11]. - The model's core idea is to focus on the ability to reconstruct the target area rather than explicitly indicating where to look, thereby enhancing the model's attention to key objects [12][14]. - The model incorporates a dual-branch framework: one for action prediction and another for visual reconstruction, which allows for implicit supervision through reconstruction loss [17][18]. Group 2: Performance and Results - ReconVLA has shown significant improvements in success rates across various tasks, achieving a success rate of 95.6% in the ABC→D task and 98.0% in the ABCD→D long-range task [23][26]. - In challenging long-range tasks like "stack block," ReconVLA achieved a success rate of 79.5%, outperforming baseline models [27]. - The model demonstrated strong generalization capabilities, maintaining over 40% success rates in real robot experiments with unseen objects [27]. Group 3: Training and Data - The training process for ReconVLA involved a large-scale dataset with over 100,000 interaction trajectories and approximately 2 million images, enhancing its visual reconstruction and generalization abilities [25][21]. - The model's pre-training did not rely on action labels, which significantly improved its performance in visual reconstruction and implicit grounding [21][31]. Group 4: Implications for Future Research - The article concludes that the core contribution of ReconVLA is not in introducing complex structures but in addressing the fundamental question of whether robots truly understand the world they are observing [32][34]. - The approach of using reconstructive implicit supervision is expected to advance embodied intelligence from experience-driven system design to a more robust and scalable paradigm for general intelligence research [34].

Vision-Language-Action（VLA）模型

Vision-Language-Action（VLA）模型

AAAI 2026杰出论文奖 | ReconVLA：具身智能研究首次获得AI顶级会议最佳论文奖

机器之心· 2026-01-26 03:08

在长期以来的 AI 研究版图中，具身智能虽然在机器人操作、自动化系统与现实应用中至关重要，却常被视为「系统工程驱动」的研究方向，鲜少被认为能够在 AI 核心建模范式上产生决定性影响。近年来，Vision-Language-Action（VLA）模型在多任务学习与长时序操作中取得了显著进展。然而，我们在大量实验中发现，一个基础但被长期忽视的问题严重制约了其性能上限：视觉注意力难以稳定、精准地聚焦于任务相关目标。以指令「将蓝色积木放到粉色积木上」为例，模型需要在复杂背景中持续锁定「蓝色积木」和「粉色积木」。但现实中，许多 VLA 模型的视觉注意力呈现为近似均匀分布，不同于人类行为专注于目标物体， VLA 模型容易被无关物体或背景干扰，从而导致抓取或放置失败。而 ReconVLA 获得 AAAI Outstanding Paper Awards，释放了一个清晰而重要的信号：让智能体在真实世界中「看、想、做」的能力，已经成为人工智能研究的核心问题之一。这是具身智能（Embodied Intelligence / Vision-Language-Action）方向历史上，首次获得 AI 顶 ...

Vision-Language-Action（VLA）模型

Vision-Language-Action（VLA）模型

REALM：机器人操作任务的real2sim验证基准

具身智能之心· 2025-12-27 10:03

点击下方卡片，关注" 具身智能之心 "公众号作者丨 Jai Bardhan等编辑丨具身智能之心本文只做学术分享，如有侵权，联系删文 >> 点击进入→ 具身智能之心技术交流群更多干货，欢迎加入国内首个具身智能全栈学习社区：具身智能之心知识星球 (戳我) ，这里包含所有你想要的。核心背景与问题 Vision-Language-Action（VLA）模型让机器人能够理解自然语言指令并执行操纵任务，但泛化能力评估一直是关键挑战——真实世界评估成本高、可重复性差，而现有仿真基准存在明显缺陷：扰动类型有限、缺乏高保真视觉效果和真实的机器人控制对齐，导致仿真与真实世界性能脱节（即"现实-仿真差距"）。为解决这一问题，捷克理工大学,阿姆斯特丹大学的研究团队构建了REALM：一个高保真仿真环境与基准，核心目标是建立仿真与真实世界性能的强相关性，实现大规模、低成本的VLA模型泛化能力评估。其核心突破在于三点：高保真视觉与控制对齐的仿真环境、覆盖多维度扰动的系统评估方案、经实证验证的真实-仿真性能关联性。相关工作与差异化优势现有机器人操纵泛化基准多依赖仿真，但存在显著局限：GemBench、 ...

Vision-Language-Action（VLA）模型

现实 - 仿真差距

泛化能力评估

Vision-Language-Action（VLA）模型

现实 - 仿真差距

泛化能力评估

领域首篇RL+VLA 综述：强化学习如何推动 VLA 走向真实世界？

具身智能之心· 2025-12-19 00:05

点击下方卡片，关注" 具身智能之心 "公众号作者丨 Haoyuan Deng等编辑丨具身智能之心本文只做学术分享，如有侵权，联系删文 >> 点击进入→ 具身智能之心技术交流群更多干货，欢迎加入国内首个具身智能全栈学习社区：具身智能之心知识星球 (戳我) ，这里包含所有你想要的。 Vision-Language-Action（VLA）模型通过融合视觉、语言与动作，为机器人带来了强大的零样本与跨任务泛化能力。但仅依赖模仿学习的 VLA 在真实世界 OOD 场景中仍然脆弱，缺乏失败恢复、自主探索与闭环纠错能力。强化学习（RL）正成为连接 VLA 预训练与真实部署的关键桥梁。由南洋理工大学、北京邮电大学、清华大学联合推出，本综述系统梳理了 RL-VLA 在"学习—优化—部署"全生命周期中的核心方法与挑战，并从四个维度构建了完整技术图景：架构、训练范式、真实世界部署以及评估。一、RL-VLA 架构：从开环推理到闭环优化 RL 通过奖励驱动的策略更新，使 VLA 从"复现示范"转向"结果导向"的闭环决策：动作建模 A 论文链接（每月更新）：https://doi.org/10.362 ...

强化学习（RL）

Vision-Language-Action（VLA）模型

强化学习（RL）

Vision-Language-Action（VLA）模型

ActDistill：同济大学提出动作引导蒸馏框架，机器人推理速度提升1.67倍

具身智能之心· 2025-11-26 00:05

Group 1 - The article discusses the challenges of deploying Vision-Language-Action (VLA) models in real-time or resource-constrained robotic systems due to high computational costs and inference delays [2][3]. - Existing efficient VLA strategies often prioritize visual-language model optimizations, leading to key information loss and incoherent action semantics [2][3]. Group 2 - The proposed ActDistill framework aims to address these issues by providing an action-prediction-oriented distillation framework that balances efficiency and fidelity while preserving action prediction accuracy [3][4]. - ActDistill consists of two core modules: Graph-Structured Encapsulation and Action-Guided Self-Derived Distillation, which work together to model action semantics and guide knowledge distillation [4][8]. Group 3 - The Graph-Structured Encapsulation module explicitly models the hierarchical evolution of action semantics and separates task-related interactions from redundant background signals [6]. - The Action-Guided Self-Derived Distillation module utilizes a lightweight student model that aligns with the teacher model's structure while reducing depth, incorporating dynamic routing to adaptively predict layer gating scores [8][11]. Group 4 - Experimental results show that ActDistill achieves a success rate of 73.95% with a 1.59x speed-up and a 50.5% reduction in computational load compared to full models [9][12]. - The framework demonstrates significant improvements in efficiency and performance across various benchmarks, including LIBERO and SIMPLER [12][13]. Group 5 - The article highlights the importance of the Graph-Structured Encapsulation module, noting that replacing it with a simpler architecture led to a significant drop in performance [13]. - The framework's ability to maintain trajectory stability and focus attention on action-relevant areas is emphasized, showcasing its effectiveness in practical applications [16][17]. Group 6 - ActDistill represents a novel approach to action-centered compression of VLA models, achieving over 50% reduction in computational load while maintaining task success rates [24]. - Future directions include exploring teacher-free or reinforcement learning-guided variants and integrating long-horizon temporal reasoning into the routing mechanism for enhanced adaptability [24].

Vision-Language-Action（VLA）模型

动作引导自衍生蒸馏

图结构封装

ActDistill框架

Vision-Language-Action（VLA）模型

动作引导自衍生蒸馏

图结构封装

ActDistill框架

3个月！搞透VLA/VLA+触觉/VLA+RL/具身世界模型等方向！

具身智能之心· 2025-08-22 00:04

Core Viewpoint - The exploration of Artificial General Intelligence (AGI) is increasingly focusing on embodied intelligence, which emphasizes the interaction and adaptation of intelligent agents within physical environments, enabling them to perceive, understand tasks, execute actions, and learn from feedback [1]. Industry Analysis - In the past two years, numerous star teams in the field of embodied intelligence have emerged, establishing valuable companies such as Xinghaitu, Galaxy General, and Zhujidongli, which are advancing the technology of embodied intelligence [3]. - Major domestic companies like Huawei, JD, Tencent, Ant Group, and Xiaomi are actively investing and collaborating to build a robust ecosystem for embodied intelligence, while international firms like Tesla and investment institutions are supporting companies like Wayve and Apptronik in the development of autonomous driving and warehouse robots [5]. Technological Evolution - The development of embodied intelligence has progressed through several stages: - The first stage focused on grasp pose detection, which struggled with complex tasks due to a lack of context modeling [6]. - The second stage involved behavior cloning, allowing robots to learn from expert demonstrations but revealing weaknesses in generalization and performance in multi-target scenarios [6]. - The third stage introduced Diffusion Policy methods, enhancing stability and generalization by modeling action sequences, followed by the Vision-Language-Action (VLA) model phase, which integrates visual perception, language understanding, and action generation [7][8]. - The fourth stage, starting in 2025, aims to integrate VLA models with reinforcement learning, world models, and tactile sensing to overcome current limitations [8]. Product and Market Development - The evolution of embodied intelligence technologies has led to the emergence of various products, including humanoid robots, robotic arms, and quadrupedal robots, serving industries such as manufacturing, home services, dining, and medical rehabilitation [9]. - The demand for engineering and system capabilities is increasing as the industry shifts from research to deployment, necessitating higher engineering skills for training and simulating strategies on platforms like Mujoco, IsaacGym, and Pybullet [23]. Educational Initiatives - A comprehensive curriculum has been developed to cover the entire technology route of embodied "brain + cerebellum," including practical applications and real-world projects, aimed at both beginners and advanced learners [10][20].

Vision-Language-Action（VLA）模型

通用人工智能（AGI）

强化学习（RL）

世界模型（World Model）

触觉感知（Tactile Sensing）

Vision-Language-Action（VLA）模型

通用人工智能（AGI）

强化学习（RL）

世界模型（World Model）

触觉感知（Tactile Sensing）

VLA/VLA+触觉/VLA+RL/具身世界模型等方向教程来啦！

具身智能之心· 2025-08-18 00:07

Core Viewpoint - The exploration of Artificial General Intelligence (AGI) is increasingly focusing on embodied intelligence, which emphasizes the interaction and adaptation of intelligent agents within physical environments, enabling them to perceive, understand tasks, execute actions, and learn from feedback [1]. Industry Analysis - In the past two years, numerous star teams in the field of embodied intelligence have emerged, leading to the establishment of valuable companies such as Xinghaitu, Galaxy General, and Zhujidongli, which are advancing the technology of embodied intelligence [3]. - Major domestic companies like Huawei, JD.com, Tencent, Ant Group, and Xiaomi are actively investing and collaborating to build a robust ecosystem for embodied intelligence, while international players like Tesla and investment firms are supporting companies like Wayve and Apptronik in the development of autonomous driving and warehouse robots [5]. Technological Evolution - The development of embodied intelligence has progressed through several stages: - The first stage focused on grasp pose detection, which struggled with complex tasks due to a lack of context modeling [6]. - The second stage involved behavior cloning, allowing robots to learn from expert demonstrations but revealing weaknesses in generalization and performance in multi-target scenarios [6]. - The third stage introduced Diffusion Policy methods, enhancing stability and generalization by modeling action sequences, followed by the emergence of Vision-Language-Action (VLA) models that integrate visual perception, language understanding, and action generation [7]. - The fourth stage, starting in 2025, aims to integrate VLA models with reinforcement learning, world models, and tactile sensing to overcome current limitations [8]. Product and Market Development - The evolution of embodied intelligence technologies has led to the emergence of various products, including humanoid robots, robotic arms, and quadrupedal robots, serving industries such as manufacturing, home services, dining, and medical rehabilitation [9]. - The demand for engineering and system capabilities is increasing as the industry shifts from research to deployment, necessitating training in platforms like Mujoco, IsaacGym, and Pybullet for strategy training and simulation testing [23]. Educational Initiatives - A comprehensive curriculum has been developed to cover the entire technology route of embodied "brain + cerebellum," including practical applications and advanced topics, aimed at both beginners and those seeking to deepen their knowledge [10][20].

Vision-Language-Action（VLA）模型

通用人工智能（AGI）

强化学习（RL）

世界模型（World Model）

触觉感知（Tactile Sensing）

Vision-Language-Action（VLA）模型

通用人工智能（AGI）

强化学习（RL）

世界模型（World Model）

触觉感知（Tactile Sensing）

国内首个具身大脑+小脑算法实战全栈教程

具身智能之心· 2025-08-07 02:38

Core Insights - The exploration towards Artificial General Intelligence (AGI) highlights embodied intelligence as a key direction, focusing on the interaction and adaptation of intelligent agents within physical environments [1] - The development of embodied intelligence is marked by the evolution of technology from low-level perception to high-level task understanding and generalization [6][9] Industry Analysis - In the past two years, numerous star teams in the field of embodied intelligence have emerged, establishing valuable companies such as Xinghaitu, Galaxy General, and Zhujidongli, transitioning from laboratories to commercial and industrial applications [3] - Major domestic companies like Huawei, JD, Tencent, Ant Group, and Xiaomi are actively investing and collaborating to build an ecosystem for embodied intelligence, while international players like Tesla and investment firms support advancements in autonomous driving and warehouse robotics [5] Technological Evolution - The evolution of embodied intelligence technology has progressed through several stages: - The first stage focused on grasp pose detection, which struggled with complex tasks due to a lack of context modeling [6] - The second stage involved behavior cloning, allowing robots to learn from expert demonstrations but revealing weaknesses in generalization and performance in multi-target scenarios [6] - The third stage introduced Diffusion Policy methods, enhancing stability and generalization through sequence modeling [7] - The fourth stage, emerging in 2025, explores the integration of VLA models with reinforcement learning and tactile sensing to overcome current limitations [8] Product Development and Market Growth - The advancements in embodied intelligence have led to the development of various products, including humanoid robots, robotic arms, and quadrupedal robots, serving industries such as manufacturing, home services, and healthcare [9] - The demand for engineering and system capabilities is increasing as the industry shifts from research to deployment, necessitating higher engineering skills [13] Educational Initiatives - A comprehensive curriculum has been developed to assist learners in mastering the full spectrum of embodied intelligence algorithms, covering topics from basic tasks to advanced models like VLA and its integrations [9][13]

Vision-Language-Action（VLA）模型

通用人工智能（AGI）

强化学习（RL）

世界模型（World Model）

触觉感知（Tactile Sensing）

Vision-Language-Action（VLA）模型

通用人工智能（AGI）

强化学习（RL）

世界模型（World Model）

触觉感知（Tactile Sensing）

理想最新DriveAction：探索VLA模型中类人驾驶决策的基准~

自动驾驶之心· 2025-06-21 13:15

Core Insights - The article discusses the introduction of the DriveAction benchmark, specifically designed for Vision-Language-Action (VLA) models in autonomous driving, addressing existing limitations in current datasets and evaluation frameworks [2][3][20]. Group 1: Research Background and Issues - The development of VLA models presents new opportunities for autonomous driving systems, but current benchmark datasets lack diversity in scenarios, reliable action-level annotations, and evaluation protocols aligned with human preferences [2]. - Existing benchmarks primarily rely on open-source data, which limits their ability to cover complex real-world driving scenarios, leading to a disconnect between evaluation results and actual deployment risks [3]. Group 2: DriveAction Benchmark Innovations - DriveAction is the first action-driven benchmark specifically designed for VLA models, featuring three core innovations: 1. Comprehensive coverage of diverse driving scenarios sourced from real-world data collected by production autonomous vehicles across 148 cities in China [5]. 2. Realistic action annotations derived from users' real-time driving operations, ensuring accurate capture of driver intentions [6]. 3. A tree-structured evaluation framework based on action-driven dynamics, integrating visual and language tasks to assess model decision-making in realistic contexts [7]. Group 3: Evaluation Results - Experimental results indicate that models perform best in the full process mode (V-L-A) and worst in the no-information mode (A), with average accuracy dropping by 3.3% without visual input and 4.1% without language input [14]. - Specific task evaluations reveal that models excel in dynamic and static obstacle tasks but struggle with navigation and traffic light tasks, highlighting areas for improvement [16][17]. Group 4: Significance and Value of DriveAction - The introduction of the DriveAction benchmark marks a significant advancement in the evaluation of autonomous driving systems, providing a more comprehensive and realistic assessment tool that can help identify model bottlenecks and guide system optimization [20].

Vision-Language-Action（VLA）模型

DriveAction基准

Vision-Language-Action（VLA）模型

DriveAction基准