视觉 - 语言 - 动作模型（VLA） - filings, earnings calls, financial reports, news - Reportify

视觉 - 语言 - 动作模型（VLA）

Search documents

全面梳理 VLA 20大挑战的深度综述，方向清晰可见，每周更新，助力时刻掌握最新突破！

AI科技大本营· 2025-12-25 01:18

【编者按】 Vision-Language-Action（VLA）正在把"看得懂、说得明白、做得出来"的机器人从演示推向真实系统。但模型、数据、范式爆发式增长的同时，也带来一个现实困境：新入门者不知道从哪里学起，从业者也难以判断该从哪些维度系统性提升能力。这篇由树根科技、三一集团耘创新实验室、伦敦国王学院、港理工、达姆施塔特工业大学，挪威阿哥德大学，帝国理工大学等单位联合完成的最新综述，给出了一张清晰的"问题全景图"和学习路线，并提供一个持续更新的在线参考框架。近期，具身智能（Embodied AI）已成为人工智能与机器人领域最活跃、同时也最具探索空间的前沿方向之一。从类 GPT 机器人助手的演示，到多模态大模型逐步走向真实机器人平台，"让机器看得见、听得懂、会行动"正从概念验证走向系统化探索。然而，随着模型规模迅速膨胀、数据集与方法不断涌现，领域内部也愈发显现出一种结构性的困惑：刚进入这一方向的研究者往往难以判断应当从何入手；而已身处其中的从业者也常常面临一个更具体的问题——究竟该从哪些维度、以什么顺序系统性提升 VLA 的能力？在快速扩张与路径分化并存的当下，单纯罗列模型与方法已难以提供有 ...

视觉 - 语言 - 动作模型（VLA）

具身智能（Embodied AI）

人工智能与机器人

视觉 - 语言 - 动作模型（VLA）

具身智能（Embodied AI）

人工智能与机器人

“最强具身VLA大模型”，究竟强在哪儿？

3 6 Ke· 2025-11-20 07:38

Core Insights - The core contribution of the π*0.6 model lies in its introduction of a more intuitive learning method called RECAP, which allows robots to learn from their mistakes rather than merely imitating correct actions [3][8][24] - The model demonstrates a high success rate of over 90% in tasks such as making espresso, folding clothes, and assembling packaging boxes, showcasing its practical capabilities [1][20] Group 1: RECAP Methodology - RECAP consists of three main phases: offline reinforcement learning (RL) using diverse demonstration data, fine-tuning with human guidance, and online execution where robots learn from sparse rewards and expert corrections [10][20] - The methodology leverages a value function to evaluate actions and an advantage-conditioned strategy to update policies, allowing for efficient learning from both successful and unsuccessful experiences [13][16][42] Group 2: Model Architecture and Performance - The π*0.6 model builds upon previous versions, expanding its backbone from Gemma (2.6 billion parameters) to Gemma3 (4 billion parameters), and increasing Action Expert parameters to 860 million [20] - In challenging tasks, RECAP has doubled the throughput (successful task completions per hour) and reduced failure rates by approximately 50% compared to models that only utilized supervised fine-tuning [20] Group 3: Learning from Mistakes - The RECAP approach emphasizes the importance of learning from errors, enabling robots to recover from mistakes through expert intervention and self-correction, which is crucial for real-world applications [24][28] - By utilizing a value function to assess the quality of actions, the model can identify key steps and sources of errors, enhancing its ability to adapt and improve in complex environments [39][41]

视觉 - 语言 - 动作模型（VLA）

视觉 - 语言 - 动作模型（VLA）

“最强具身VLA大模型”，究竟强在哪儿？

量子位· 2025-11-20 00:30

Core Insights - The article discusses the breakthrough of the robot foundation model π*0.6, which showcases its capabilities in performing complex tasks with a success rate exceeding 90% [2][10]. Group 1: Model Overview - π*0.6 is the latest VLA (Vision-Language-Action) model, building on the previous π0.5, and introduces a novel training method called RECAP [8][10]. - The RECAP method allows robots to learn from their mistakes, shifting from traditional imitation learning to a more intuitive learning approach [3][29]. Group 2: RECAP Methodology - RECAP consists of three main stages: guidance through human demonstration, correction through expert intervention, and practice through autonomous experience [7][12]. - The model utilizes a value function to evaluate actions, which helps in identifying advantageous actions and improving learning efficiency [19][22]. Group 3: Training Process - The training process involves offline reinforcement learning using diverse data sources, including human demonstrations and autonomous attempts, to train the value function and policy [20][22]. - The model's architecture has been enhanced, with the backbone expanding from Gemma (2.6B) to Gemma3 (4B) and Action Expert parameters increasing to 860M [25]. Group 4: Performance Evaluation - In tests involving complex tasks like folding clothes and making espresso, RECAP doubled the throughput and reduced failure rates by approximately 50% compared to models using only supervised fine-tuning [27]. - The model demonstrated high stability, successfully performing tasks for extended periods without human intervention [28]. Group 5: Learning from Failures - The ability of the model to learn from failures is highlighted as a significant advancement, allowing it to extract effective learning signals from imperfect experiences [29][56]. - This approach opens new avenues for future research in robotics, emphasizing the importance of learning from real-world execution rather than solely relying on ideal demonstrations [56].

具身VLA大模型

视觉 - 语言 - 动作模型（VLA）

基于优势条件策略的经验与纠偏强化学习

具身VLA大模型

视觉 - 语言 - 动作模型（VLA）

基于优势条件策略的经验与纠偏强化学习

英伟达长达41页的自驾VLA框架！因果链推理，实车可部署

自动驾驶之心· 2025-11-15 03:03

Core Insights - The article discusses the introduction of the Alpamayo-R1 (AR1) framework by NVIDIA, which aims to enhance decision-making capabilities in complex driving scenarios through causal reasoning and trajectory planning [1][2]. Group 1: Background and Development - The evolution of autonomous driving systems has shifted from traditional modular architectures to end-to-end frameworks, which are now widely recognized in the industry [3]. - Current end-to-end methods struggle with long-tail scenarios due to sparse supervisory signals and the need for higher-order reasoning capabilities, highlighting a significant gap between existing models and the requirements for robust Level 4 (L4) autonomous driving [3][4]. Group 2: Innovations in AR1 - AR1 integrates causal chain reasoning with trajectory planning, resulting in a 12% improvement in planning accuracy in high-difficulty scenarios compared to trajectory-based benchmark models [2][8]. - The model demonstrates a 35% reduction in lane deviation rates and a 25% decrease in near-collision rates during closed-loop simulations [2]. - After reinforcement learning post-training, the model's reasoning quality improved by 45%, and reasoning-action consistency increased by 37% [2]. Group 3: Causal Chain Dataset and Structured Reasoning - The article emphasizes the necessity of structured causal reasoning in autonomous driving, proposing the creation of a causal chain (CoC) dataset that aligns reasoning trajectories with driving decisions [5][29]. - The CoC dataset is designed to ensure that reasoning trajectories are concise and directly linked to specific driving decisions, enhancing the model's interpretability and training efficiency [5][31]. Group 4: Training Strategies and Model Architecture - AR1 employs a multi-stage training strategy that combines supervised fine-tuning and reinforcement learning to optimize reasoning quality and trajectory prediction [8][12]. - The model architecture is modular, allowing compatibility with existing visual-language model (VLM) backbones while integrating components tailored for autonomous driving [12][16]. Group 5: Visual Encoding and Action Decoding - The article discusses the challenges of visual encoding in multi-camera setups and proposes efficient tokenization methods to reduce the number of tokens generated during real-time inference [19][22]. - Action decoding is based on a bicycle model to ensure smooth trajectory outputs, enhancing the model's performance in real-world applications [27][28]. Group 6: Quality Assurance and Annotation Process - A hybrid annotation process combining human and automated labeling is implemented to ensure high-quality training data for the CoC dataset, balancing efficiency and accuracy [48][49]. - The quality assurance process includes multiple checks to ensure causal correctness and decision minimality in the annotated data [52][53].

Nvidia(US:NVDA)

视觉 - 语言 - 动作模型（VLA）

因果链推理

Cosmos - Reason

视觉 - 语言 - 动作模型（VLA）

因果链推理

Cosmos - Reason

小鹏 VLA2.0 发布：智能驾驶体现更强大的泛化性：智联汽车系列深度之 39

Shenwan Hongyuan Securities· 2025-11-10 12:57

Investment Rating - The report maintains a positive outlook on the investment potential of companies involved in the development of VLA2.0 technology, particularly focusing on Xiaopeng Motors and its partners [4][36]. Core Insights - The VLA2.0 model demonstrates enhanced generalization capabilities, achieving performance similar to human drivers in certain scenarios, such as navigating complex roads with minimal human intervention [3][5]. - The technology is expected to spill over into other fields of embodied intelligence, including robotics and low-altitude economy applications [4][29]. - The report highlights the significant investment in training the VLA2.0 model, which consumed 30,000 computing units and over 2 billion yuan in training costs, utilizing nearly 100 million training data points [2][14]. Summary by Sections Section 1: Xiaopeng's VLA2.0 Release - Xiaopeng's VLA2.0 is designed to be more efficient and responsive, capable of handling various road conditions seamlessly, including complex intersections and narrow roads [9][12]. Section 2: Algorithm Development - The VLA model has a clear historical evolution, transitioning from single-modal processing to multi-modal understanding and execution, enhancing its application in the industry [20][22]. Section 3: Computing Power - Turing Chip - The Turing chip, which supports the VLA2.0 model, features independent ISP and enhanced perception capabilities, crucial for recognizing challenging environmental conditions [32][34]. Section 4: Investment Targets - Key investment targets identified include Xiaopeng Motors, Desay SV, Geek+, and Tianzhun Technology, which are positioned to benefit from advancements in VLA technology [4][36]. Section 5: Appendix - The report includes a comparison of mainstream VLA algorithms, highlighting various technical paths and architectures that have emerged in the field [41][42].

视觉 - 语言 - 动作模型（VLA）

视觉 - 语言 - 动作模型（VLA）

未来已来！AI飞行器时代，将代替大部分人工

深思SenseAI· 2025-11-06 04:46

Core Viewpoint - Infravision is revolutionizing the construction of power transmission lines through its innovative drone technology, which offers a safer, more efficient, and cost-effective solution compared to traditional methods [1][4]. Group 1: Advantages of Infravision's Technology - The drone-based line construction avoids the safety hazards associated with high-altitude work and helicopter flights, and is not limited by terrain [5]. - The system is quieter and has a reduced impact on the environment and land ownership, minimizing disruption to landowners [6]. - Infravision's technology significantly enhances efficiency and reduces costs by eliminating the need for large helicopters and extensive manpower, leading to faster project timelines [6]. - The integrated system combines drone automation, precise navigation, and specialized aerial towing equipment, enabling it to handle long-distance high-voltage line installations at an industrial scale [6]. Group 2: Strategic Execution and Market Positioning - Infravision's rapid rise is attributed to its clear strategic focus on high-value niche markets, particularly in power transmission line construction, which faces significant pain points [8]. - The company initially targeted the Australian market to validate its technology and establish model projects, effectively leveraging limited resources to meet important customer demands [8]. - Infravision emphasizes providing end-to-end solutions rather than merely selling products, fostering long-term partnerships through equipment leasing and operational services [9]. - Following success in Australia, the company is expanding into the North American market, targeting major clients like PG&E [10]. - The company is rapidly scaling its team to meet increasing project demands, with plans to grow from 70 to 150-200 employees by the end of 2025 [10]. Group 3: Future Development and Industry Trends - The concept of "aerial embodied intelligence" is emerging, which involves autonomous flying robots capable of perception, decision-making, and physical interaction [11]. - The development of drone swarm control systems allows multiple drones to coordinate and complete tasks efficiently, enhancing operational capabilities in various sectors [12]. - Infravision and similar companies are not just offering advanced drones but are creating new operational paradigms that deconstruct dangerous and repetitive tasks into standardized, machine-executable operations [20].

空中具身智能

视觉 - 语言 - 动作模型（VLA）

无人机集群控制

Infravision空中机器人方案

FlyCart30物流无人机

空中具身智能

视觉 - 语言 - 动作模型（VLA）

无人机集群控制

Infravision空中机器人方案

FlyCart30物流无人机

英伟达一篇长达41页的自驾VLA框架！因果链推理，实车可部署算法Alpamayo-R1

自动驾驶之心· 2025-11-05 00:04

Core Insights - The article discusses the introduction of the Alpamayo-R1 (AR1) framework by NVIDIA, which aims to enhance decision-making capabilities in complex driving scenarios through causal reasoning and trajectory planning [1][2]. Group 1: Background and Framework - The development of autonomous driving systems has shifted from traditional modular architectures to end-to-end frameworks, which are now widely recognized in the industry [3]. - Current end-to-end methods struggle with long-tail scenarios due to sparse supervisory signals and the need for high-order reasoning capabilities, highlighting a significant gap between existing models and the requirements for robust Level 4 (L4) autonomous driving [3][4]. Group 2: Innovations in AR1 - AR1 integrates causal chain reasoning with trajectory planning, resulting in a 12% increase in planning accuracy in high-difficulty scenarios compared to trajectory-based benchmark models [2][8]. - The model demonstrates a 35% reduction in lane deviation rates and a 25% decrease in near-collision rates during closed-loop simulations [2]. - After reinforcement learning post-training, the model's reasoning quality improved by 45%, and reasoning-action consistency increased by 37% [2]. Group 3: Causal Chain Dataset - The article introduces a structured causal chain (CoC) annotation framework that generates reasoning trajectories aligned with driving behavior, ensuring that each trajectory is decision-centric and causally linked [5][29]. - The CoC dataset is designed to provide clear supervision for learning decision causality, enabling the reasoning model to efficiently infer the reasons behind specific driving actions [31][42]. Group 4: Training Strategies - A multi-stage training strategy is employed, utilizing supervised fine-tuning and reinforcement learning to enhance reasoning capabilities and ensure consistency between reasoning and actions [8][12]. - The AR1 model is built on the Cosmos-Reason backbone, which is specifically designed for physical intelligence applications, enhancing its deployment capabilities in autonomous driving scenarios [16][17]. Group 5: Visual-Language-Action (VLA) Architecture - The AR1 architecture emphasizes modularity and flexibility, allowing it to integrate existing visual-language models while incorporating specialized components for efficient visual encoding and real-time action decoding [12][19]. - The model's design addresses the challenges of processing multi-camera inputs and generating precise multi-modal trajectory predictions necessary for safe vehicle control [11][12]. Group 6: Data Annotation and Quality Assurance - A hybrid annotation process combining human and automated labeling is implemented to ensure high-quality training data while maintaining efficiency [48][49]. - The quality assurance process includes multiple checks to ensure causal correctness and minimal decision-making ambiguity in the annotated data [52][53].

Nvidia(US:NVDA)

视觉 - 语言 - 动作模型（VLA）

因果链推理

视觉 - 语言 - 动作模型（VLA）

因果链推理

Dexmal原力灵机开源Dexbotic，基于PyTorch的一站式VLA代码库

机器之心· 2025-10-22 06:32

Core Insights - Dexbotic is an open-source visual-language-action (VLA) model toolkit developed by Dexmal, aimed at researchers in the field of embodied intelligence, featuring a modular architecture with three core components: Data, Experiment, and Model [3][7][9]. Group 1: Need for a Unified VLA Development Platform - The VLA models serve as a crucial technology hub connecting perception, cognition, and action, but face challenges such as severe decentralization in research, cumbersome development processes, and fairness issues in algorithm comparison [5][7]. - The introduction of Dexbotic addresses these pain points by providing a standardized, modular, and high-performance research infrastructure, moving the field from "reinventing the wheel" to "collaborative innovation" [7][9]. Group 2: Dexbotic Architecture - The overall architecture of Dexbotic consists of three main layers: Data Layer, Model Layer, and Experiment Layer, with the Data Layer optimizing storage and integrating multi-source data [9][11]. - The Model Layer includes the foundational model DexboticVLM, which supports various VLA strategies and allows users to customize new VLA models easily [9][11]. - The Experiment Layer introduces an innovative script mechanism for conducting experiments, enabling users to modify configurations with minimal changes while ensuring system stability [11][12]. Group 3: Key Features - Dexbotic offers a unified modular VLA framework compatible with mainstream large language models, integrating embodied operation and navigation functionalities [13]. - High-performance pre-trained models are available for major VLA algorithms, significantly enhancing performance in various simulation environments and real-world tasks [13]. - The experimental framework is designed for flexibility and extensibility, allowing users to easily modify configurations and switch models or tasks [13][14]. Group 4: Open Source Hardware - Dexmal has launched its first open-source hardware product, Dexbotic Open Source - W1 (DOS-W1), featuring a fully open design that lowers barriers for use and maintenance [16][17]. - The hardware design includes modular components and ergonomic features to enhance user comfort and data collection efficiency [17]. Group 5: Future Outlook - Dexmal plans to expand its offerings with more advanced VLM base models and open-source hardware, integrating simulation-to-real-world transfer learning tools and establishing a community-driven model contribution mechanism [19]. - Collaboration with RoboChallenge aims to create a comprehensive technical loop for development, training, inference, and evaluation, ensuring transparency and fairness in performance validation [20].

视觉 - 语言 - 动作模型（VLA）

Dexbotic Open Source - W1（DOS - W1）

视觉 - 语言 - 动作模型（VLA）

Dexbotic Open Source - W1（DOS - W1）

北大-灵初重磅发布具身VLA全面综述！一文看清VLA技术路线与未来趋势

机器之心· 2025-07-25 02:03

Core Insights - The article discusses the rapid advancements in Vision-Language-Action (VLA) models, which are capable of extending intelligence from the digital realm to physical tasks, particularly in robotics [1][9]. - A unified framework for understanding VLA models is proposed, focusing on action tokenization, which categorizes eight main types of action tokens and outlines their capabilities and future trends [2][10]. VLA Unified Framework and Action Token Perspective - VLA models rely on at least one visual or language foundation model to generate action outputs based on visual and language inputs, aiming to execute specific tasks in the physical world [9][11]. - The framework categorizes action tokens into eight types: language description, code, affordance, trajectory, goal state, latent representation, raw action, and reasoning [10][16]. Action Token Analysis - **Language Description**: Describes actions in natural language, divided into sub-task level (language plan) and atomic action level (language motion) [16][20]. - **Code**: Represents task logic in code form, allowing for efficient communication between humans and robots, but faces challenges related to API dependencies and execution rigidity [22][23]. - **Affordance**: A spatial representation indicating how objects can be interacted with, emphasizing semantic clarity and adaptability [25][26]. - **Trajectory**: Represents continuous spatial states over time, utilizing video data to enhance training data sources [29][30]. - **Goal State**: Visual representation of expected outcomes, aiding in action planning and execution [34][35]. - **Latent Representation**: Encodes action-related information through large-scale data pre-training, enhancing training efficiency and generalization [36][37]. - **Raw Action**: Directly executable low-level control commands for robots, showing potential for scalability similar to large language models [38][39]. - **Reasoning**: Expresses the thought process behind actions, enhancing model interpretability and decision-making [42][45]. Data Resources in VLA Models - The article categorizes data resources into a pyramid structure: network data and human videos at the base, synthetic and simulation data in the middle, and real robot data at the top, each contributing uniquely to model performance and generalization [47][48][49]. Conclusion - VLA models are positioned as a key pathway to embodied intelligence, with ongoing research focusing on action token design, challenges, and future directions, as well as the practical applications of VLA technology in real-world scenarios [51].

视觉 - 语言 - 动作模型（VLA）

VLA模型Psi R1

视觉 - 语言 - 动作模型（VLA）

VLA模型Psi R1

8万条！清华开源VLA数据集：面向自动驾驶极端场景，安全提升35%

自动驾驶之心· 2025-07-22 12:46

Core Viewpoint - The article discusses the development of the Impromptu VLA dataset, which aims to address the data scarcity issue in unstructured driving environments for autonomous driving systems. It highlights the dataset's potential to enhance the performance of vision-language-action models in complex scenarios [4][29]. Dataset Overview - The Impromptu VLA dataset consists of over 80,000 meticulously constructed video clips, extracted from more than 2 million original materials across eight diverse open-source datasets [5][29]. - The dataset focuses on four key unstructured challenges: boundary-ambiguous roads, temporary traffic rule changes, unconventional dynamic obstacles, and complex road conditions [12][13]. Methodology - The dataset construction involved a multi-step process, including data collection, scene classification, and multi-task annotation generation, utilizing advanced visual-language models (VLMs) for scene understanding [10][17]. - A rigorous manual verification process was implemented to ensure high-quality annotations, with significant F1 scores achieved for various categories, confirming the reliability of the VLM-based annotation process [18]. Experimental Validation - The effectiveness of the Impromptu VLA dataset was validated through comprehensive experiments, showing significant performance improvements in mainstream autonomous driving benchmarks. For instance, the average score in the closed-loop NeuroNCAP test improved from 1.77 to 2.15, with a reduction in collision rates from 72.5% to 65.5% [6][21]. - In open-loop trajectory prediction evaluations, models trained with the Impromptu VLA dataset achieved L2 errors as low as 0.30 meters, demonstrating competitive performance compared to leading methods that rely on larger proprietary datasets [24]. Conclusion - The Impromptu VLA dataset serves as a critical resource for developing more robust and adaptive autonomous driving systems capable of handling complex real-world scenarios. The research confirms the dataset's significant value in enhancing perception, prediction, and planning capabilities in unstructured driving environments [29].

视觉 - 语言 - 动作模型（VLA）

Autonomous Driving

Impromptu VLA数据集

视觉 - 语言 - 动作模型（VLA）

Autonomous Driving

Impromptu VLA数据集