视觉 - 语言 - 动作模型(VLA)
Search documents
想入局VLA却不知从何下手?NTU&中大开源「终极菜谱」:从基座到频域建模,每一步都有实验支撑
量子位· 2026-03-02 16:00
VLANeXt 团队 投稿 量子位 | 公众号 QbitAI VLA模型五花八门,到底哪些设计真正有用? MMLab@NTU联合中山大学的最新研究,给出了一份从入门到精通的终极"菜谱"—— VLANeXt 。 这项研究没有简单提出一个新模型了事,而是系统性地从 12个关键维度 ,深度剖析了VLA的设计空间。从基础组件到感知要素,再到动作 建模的额外视角,每一步都有扎实的实验支撑。 最终的产物VLANeXt,在标准基准LIBERO及泛化性测试LIBERO-plus上, 全面超越了包括7B参数模型在内的各类SOTA方法 。面对未见 过的光照、背景、相机位姿等扰动,其成功率较此前最佳方法大幅跃升了 10% 。 无论你是刚入局具身智能的小白,还是想进一步优化模型的老手,这份"菜谱"都能帮你找到答案。 背景:走出VLA的"原始汤" 随着大基础模型的崛起,视觉-语言-动作模型 (VLA) 展现出了极大的潜力,通过继承丰富的视觉理解和语言基础,为通用机器人策略学 习提供了可扩展的途径。然而,目前的VLA研究领域依然处于一种"原始汤 (primordial soup) "阶段—— 充满了各种天马行空的探索和 设计,但缺乏清晰 ...
全面梳理 VLA 20大挑战的深度综述,方向清晰可见,每周更新,助力时刻掌握最新突破!
AI科技大本营· 2025-12-25 01:18
Core Insights - The article discusses the emergence of Vision-Language-Action (VLA) systems, which are transitioning from demonstrations to real-world applications, highlighting the need for a structured learning path for newcomers and practitioners in the field [1][3][4]. Group 1: Overview of VLA - Embodied AI is identified as a rapidly evolving frontier in AI and robotics, with a focus on making machines capable of seeing, understanding, and acting [3][4]. - The article emphasizes the structural confusion within the field due to the rapid growth of models and datasets, making it challenging for newcomers to identify where to start and for existing practitioners to determine how to systematically enhance VLA capabilities [3][4]. Group 2: Contributions of the Review - The review paper titled "An Anatomy of Vision-Language-Action Models" aims to provide a clear and systematic reference framework for the increasingly complex VLA research area [4][6]. - It establishes a continuously evolving reference system for tracking the latest developments in VLA research, organized by modules, milestones, and challenges [5][9]. Group 3: Learning Pathways - For newcomers, the review suggests first establishing an overall understanding of the VLA field before delving deeper into specific areas [13][14]. - For practitioners, the review serves as an efficient roadmap for identifying areas for capability enhancement, helping to clarify research questions and innovation points [15][16]. Group 4: Structural Analysis - The review begins with a breakdown of basic modules in VLA systems, covering perception, representation, decision-making, and control, to create a common technical language [18][19]. - It then reviews key milestones along a timeline to illustrate the evolution of VLA from early concept validation to a general framework for real-world deployment [20][21]. Group 5: Key Challenges - The review identifies five core challenges that VLA systems face, including representation, execution, generalization, safety, and data evaluation, framing these challenges as the main focus of the analysis [25][26][30][33][39]. - Each challenge is linked to the overall capability of VLA systems, emphasizing the need for a clear understanding of problem structures to overcome existing bottlenecks [26][30][34][36]. Group 6: Future Directions - The review outlines potential future directions for VLA, such as developing native multimodal architectures and integrating physical and semantic causal world models [42][43]. - It envisions the next generation of embodied agents that not only perform tasks but do so reliably and controllably in real-world settings [44].
“最强具身VLA大模型”,究竟强在哪儿?
3 6 Ke· 2025-11-20 07:38
Core Insights - The core contribution of the π*0.6 model lies in its introduction of a more intuitive learning method called RECAP, which allows robots to learn from their mistakes rather than merely imitating correct actions [3][8][24] - The model demonstrates a high success rate of over 90% in tasks such as making espresso, folding clothes, and assembling packaging boxes, showcasing its practical capabilities [1][20] Group 1: RECAP Methodology - RECAP consists of three main phases: offline reinforcement learning (RL) using diverse demonstration data, fine-tuning with human guidance, and online execution where robots learn from sparse rewards and expert corrections [10][20] - The methodology leverages a value function to evaluate actions and an advantage-conditioned strategy to update policies, allowing for efficient learning from both successful and unsuccessful experiences [13][16][42] Group 2: Model Architecture and Performance - The π*0.6 model builds upon previous versions, expanding its backbone from Gemma (2.6 billion parameters) to Gemma3 (4 billion parameters), and increasing Action Expert parameters to 860 million [20] - In challenging tasks, RECAP has doubled the throughput (successful task completions per hour) and reduced failure rates by approximately 50% compared to models that only utilized supervised fine-tuning [20] Group 3: Learning from Mistakes - The RECAP approach emphasizes the importance of learning from errors, enabling robots to recover from mistakes through expert intervention and self-correction, which is crucial for real-world applications [24][28] - By utilizing a value function to assess the quality of actions, the model can identify key steps and sources of errors, enhancing its ability to adapt and improve in complex environments [39][41]
“最强具身VLA大模型”,究竟强在哪儿?
量子位· 2025-11-20 00:30
Core Insights - The article discusses the breakthrough of the robot foundation model π*0.6, which showcases its capabilities in performing complex tasks with a success rate exceeding 90% [2][10]. Group 1: Model Overview - π*0.6 is the latest VLA (Vision-Language-Action) model, building on the previous π0.5, and introduces a novel training method called RECAP [8][10]. - The RECAP method allows robots to learn from their mistakes, shifting from traditional imitation learning to a more intuitive learning approach [3][29]. Group 2: RECAP Methodology - RECAP consists of three main stages: guidance through human demonstration, correction through expert intervention, and practice through autonomous experience [7][12]. - The model utilizes a value function to evaluate actions, which helps in identifying advantageous actions and improving learning efficiency [19][22]. Group 3: Training Process - The training process involves offline reinforcement learning using diverse data sources, including human demonstrations and autonomous attempts, to train the value function and policy [20][22]. - The model's architecture has been enhanced, with the backbone expanding from Gemma (2.6B) to Gemma3 (4B) and Action Expert parameters increasing to 860M [25]. Group 4: Performance Evaluation - In tests involving complex tasks like folding clothes and making espresso, RECAP doubled the throughput and reduced failure rates by approximately 50% compared to models using only supervised fine-tuning [27]. - The model demonstrated high stability, successfully performing tasks for extended periods without human intervention [28]. Group 5: Learning from Failures - The ability of the model to learn from failures is highlighted as a significant advancement, allowing it to extract effective learning signals from imperfect experiences [29][56]. - This approach opens new avenues for future research in robotics, emphasizing the importance of learning from real-world execution rather than solely relying on ideal demonstrations [56].
英伟达长达41页的自驾VLA框架!因果链推理,实车可部署
自动驾驶之心· 2025-11-15 03:03
Core Insights - The article discusses the introduction of the Alpamayo-R1 (AR1) framework by NVIDIA, which aims to enhance decision-making capabilities in complex driving scenarios through causal reasoning and trajectory planning [1][2]. Group 1: Background and Development - The evolution of autonomous driving systems has shifted from traditional modular architectures to end-to-end frameworks, which are now widely recognized in the industry [3]. - Current end-to-end methods struggle with long-tail scenarios due to sparse supervisory signals and the need for higher-order reasoning capabilities, highlighting a significant gap between existing models and the requirements for robust Level 4 (L4) autonomous driving [3][4]. Group 2: Innovations in AR1 - AR1 integrates causal chain reasoning with trajectory planning, resulting in a 12% improvement in planning accuracy in high-difficulty scenarios compared to trajectory-based benchmark models [2][8]. - The model demonstrates a 35% reduction in lane deviation rates and a 25% decrease in near-collision rates during closed-loop simulations [2]. - After reinforcement learning post-training, the model's reasoning quality improved by 45%, and reasoning-action consistency increased by 37% [2]. Group 3: Causal Chain Dataset and Structured Reasoning - The article emphasizes the necessity of structured causal reasoning in autonomous driving, proposing the creation of a causal chain (CoC) dataset that aligns reasoning trajectories with driving decisions [5][29]. - The CoC dataset is designed to ensure that reasoning trajectories are concise and directly linked to specific driving decisions, enhancing the model's interpretability and training efficiency [5][31]. Group 4: Training Strategies and Model Architecture - AR1 employs a multi-stage training strategy that combines supervised fine-tuning and reinforcement learning to optimize reasoning quality and trajectory prediction [8][12]. - The model architecture is modular, allowing compatibility with existing visual-language model (VLM) backbones while integrating components tailored for autonomous driving [12][16]. Group 5: Visual Encoding and Action Decoding - The article discusses the challenges of visual encoding in multi-camera setups and proposes efficient tokenization methods to reduce the number of tokens generated during real-time inference [19][22]. - Action decoding is based on a bicycle model to ensure smooth trajectory outputs, enhancing the model's performance in real-world applications [27][28]. Group 6: Quality Assurance and Annotation Process - A hybrid annotation process combining human and automated labeling is implemented to ensure high-quality training data for the CoC dataset, balancing efficiency and accuracy [48][49]. - The quality assurance process includes multiple checks to ensure causal correctness and decision minimality in the annotated data [52][53].
小鹏 VLA2.0 发布:智能驾驶体现更强大的泛化性:智联汽车系列深度之 39
Shenwan Hongyuan Securities· 2025-11-10 12:57
Investment Rating - The report maintains a positive outlook on the investment potential of companies involved in the development of VLA2.0 technology, particularly focusing on Xiaopeng Motors and its partners [4][36]. Core Insights - The VLA2.0 model demonstrates enhanced generalization capabilities, achieving performance similar to human drivers in certain scenarios, such as navigating complex roads with minimal human intervention [3][5]. - The technology is expected to spill over into other fields of embodied intelligence, including robotics and low-altitude economy applications [4][29]. - The report highlights the significant investment in training the VLA2.0 model, which consumed 30,000 computing units and over 2 billion yuan in training costs, utilizing nearly 100 million training data points [2][14]. Summary by Sections Section 1: Xiaopeng's VLA2.0 Release - Xiaopeng's VLA2.0 is designed to be more efficient and responsive, capable of handling various road conditions seamlessly, including complex intersections and narrow roads [9][12]. Section 2: Algorithm Development - The VLA model has a clear historical evolution, transitioning from single-modal processing to multi-modal understanding and execution, enhancing its application in the industry [20][22]. Section 3: Computing Power - Turing Chip - The Turing chip, which supports the VLA2.0 model, features independent ISP and enhanced perception capabilities, crucial for recognizing challenging environmental conditions [32][34]. Section 4: Investment Targets - Key investment targets identified include Xiaopeng Motors, Desay SV, Geek+, and Tianzhun Technology, which are positioned to benefit from advancements in VLA technology [4][36]. Section 5: Appendix - The report includes a comparison of mainstream VLA algorithms, highlighting various technical paths and architectures that have emerged in the field [41][42].
未来已来!AI飞行器时代,将代替大部分人工
深思SenseAI· 2025-11-06 04:46
Core Viewpoint - Infravision is revolutionizing the construction of power transmission lines through its innovative drone technology, which offers a safer, more efficient, and cost-effective solution compared to traditional methods [1][4]. Group 1: Advantages of Infravision's Technology - The drone-based line construction avoids the safety hazards associated with high-altitude work and helicopter flights, and is not limited by terrain [5]. - The system is quieter and has a reduced impact on the environment and land ownership, minimizing disruption to landowners [6]. - Infravision's technology significantly enhances efficiency and reduces costs by eliminating the need for large helicopters and extensive manpower, leading to faster project timelines [6]. - The integrated system combines drone automation, precise navigation, and specialized aerial towing equipment, enabling it to handle long-distance high-voltage line installations at an industrial scale [6]. Group 2: Strategic Execution and Market Positioning - Infravision's rapid rise is attributed to its clear strategic focus on high-value niche markets, particularly in power transmission line construction, which faces significant pain points [8]. - The company initially targeted the Australian market to validate its technology and establish model projects, effectively leveraging limited resources to meet important customer demands [8]. - Infravision emphasizes providing end-to-end solutions rather than merely selling products, fostering long-term partnerships through equipment leasing and operational services [9]. - Following success in Australia, the company is expanding into the North American market, targeting major clients like PG&E [10]. - The company is rapidly scaling its team to meet increasing project demands, with plans to grow from 70 to 150-200 employees by the end of 2025 [10]. Group 3: Future Development and Industry Trends - The concept of "aerial embodied intelligence" is emerging, which involves autonomous flying robots capable of perception, decision-making, and physical interaction [11]. - The development of drone swarm control systems allows multiple drones to coordinate and complete tasks efficiently, enhancing operational capabilities in various sectors [12]. - Infravision and similar companies are not just offering advanced drones but are creating new operational paradigms that deconstruct dangerous and repetitive tasks into standardized, machine-executable operations [20].
英伟达一篇长达41页的自驾VLA框架!因果链推理,实车可部署算法Alpamayo-R1
自动驾驶之心· 2025-11-05 00:04
Core Insights - The article discusses the introduction of the Alpamayo-R1 (AR1) framework by NVIDIA, which aims to enhance decision-making capabilities in complex driving scenarios through causal reasoning and trajectory planning [1][2]. Group 1: Background and Framework - The development of autonomous driving systems has shifted from traditional modular architectures to end-to-end frameworks, which are now widely recognized in the industry [3]. - Current end-to-end methods struggle with long-tail scenarios due to sparse supervisory signals and the need for high-order reasoning capabilities, highlighting a significant gap between existing models and the requirements for robust Level 4 (L4) autonomous driving [3][4]. Group 2: Innovations in AR1 - AR1 integrates causal chain reasoning with trajectory planning, resulting in a 12% increase in planning accuracy in high-difficulty scenarios compared to trajectory-based benchmark models [2][8]. - The model demonstrates a 35% reduction in lane deviation rates and a 25% decrease in near-collision rates during closed-loop simulations [2]. - After reinforcement learning post-training, the model's reasoning quality improved by 45%, and reasoning-action consistency increased by 37% [2]. Group 3: Causal Chain Dataset - The article introduces a structured causal chain (CoC) annotation framework that generates reasoning trajectories aligned with driving behavior, ensuring that each trajectory is decision-centric and causally linked [5][29]. - The CoC dataset is designed to provide clear supervision for learning decision causality, enabling the reasoning model to efficiently infer the reasons behind specific driving actions [31][42]. Group 4: Training Strategies - A multi-stage training strategy is employed, utilizing supervised fine-tuning and reinforcement learning to enhance reasoning capabilities and ensure consistency between reasoning and actions [8][12]. - The AR1 model is built on the Cosmos-Reason backbone, which is specifically designed for physical intelligence applications, enhancing its deployment capabilities in autonomous driving scenarios [16][17]. Group 5: Visual-Language-Action (VLA) Architecture - The AR1 architecture emphasizes modularity and flexibility, allowing it to integrate existing visual-language models while incorporating specialized components for efficient visual encoding and real-time action decoding [12][19]. - The model's design addresses the challenges of processing multi-camera inputs and generating precise multi-modal trajectory predictions necessary for safe vehicle control [11][12]. Group 6: Data Annotation and Quality Assurance - A hybrid annotation process combining human and automated labeling is implemented to ensure high-quality training data while maintaining efficiency [48][49]. - The quality assurance process includes multiple checks to ensure causal correctness and minimal decision-making ambiguity in the annotated data [52][53].
Dexmal原力灵机开源Dexbotic,基于PyTorch的一站式VLA代码库
机器之心· 2025-10-22 06:32
Core Insights - Dexbotic is an open-source visual-language-action (VLA) model toolkit developed by Dexmal, aimed at researchers in the field of embodied intelligence, featuring a modular architecture with three core components: Data, Experiment, and Model [3][7][9]. Group 1: Need for a Unified VLA Development Platform - The VLA models serve as a crucial technology hub connecting perception, cognition, and action, but face challenges such as severe decentralization in research, cumbersome development processes, and fairness issues in algorithm comparison [5][7]. - The introduction of Dexbotic addresses these pain points by providing a standardized, modular, and high-performance research infrastructure, moving the field from "reinventing the wheel" to "collaborative innovation" [7][9]. Group 2: Dexbotic Architecture - The overall architecture of Dexbotic consists of three main layers: Data Layer, Model Layer, and Experiment Layer, with the Data Layer optimizing storage and integrating multi-source data [9][11]. - The Model Layer includes the foundational model DexboticVLM, which supports various VLA strategies and allows users to customize new VLA models easily [9][11]. - The Experiment Layer introduces an innovative script mechanism for conducting experiments, enabling users to modify configurations with minimal changes while ensuring system stability [11][12]. Group 3: Key Features - Dexbotic offers a unified modular VLA framework compatible with mainstream large language models, integrating embodied operation and navigation functionalities [13]. - High-performance pre-trained models are available for major VLA algorithms, significantly enhancing performance in various simulation environments and real-world tasks [13]. - The experimental framework is designed for flexibility and extensibility, allowing users to easily modify configurations and switch models or tasks [13][14]. Group 4: Open Source Hardware - Dexmal has launched its first open-source hardware product, Dexbotic Open Source - W1 (DOS-W1), featuring a fully open design that lowers barriers for use and maintenance [16][17]. - The hardware design includes modular components and ergonomic features to enhance user comfort and data collection efficiency [17]. Group 5: Future Outlook - Dexmal plans to expand its offerings with more advanced VLM base models and open-source hardware, integrating simulation-to-real-world transfer learning tools and establishing a community-driven model contribution mechanism [19]. - Collaboration with RoboChallenge aims to create a comprehensive technical loop for development, training, inference, and evaluation, ensuring transparency and fairness in performance validation [20].
北大-灵初重磅发布具身VLA全面综述!一文看清VLA技术路线与未来趋势
机器之心· 2025-07-25 02:03
Core Insights - The article discusses the rapid advancements in Vision-Language-Action (VLA) models, which are capable of extending intelligence from the digital realm to physical tasks, particularly in robotics [1][9]. - A unified framework for understanding VLA models is proposed, focusing on action tokenization, which categorizes eight main types of action tokens and outlines their capabilities and future trends [2][10]. VLA Unified Framework and Action Token Perspective - VLA models rely on at least one visual or language foundation model to generate action outputs based on visual and language inputs, aiming to execute specific tasks in the physical world [9][11]. - The framework categorizes action tokens into eight types: language description, code, affordance, trajectory, goal state, latent representation, raw action, and reasoning [10][16]. Action Token Analysis - **Language Description**: Describes actions in natural language, divided into sub-task level (language plan) and atomic action level (language motion) [16][20]. - **Code**: Represents task logic in code form, allowing for efficient communication between humans and robots, but faces challenges related to API dependencies and execution rigidity [22][23]. - **Affordance**: A spatial representation indicating how objects can be interacted with, emphasizing semantic clarity and adaptability [25][26]. - **Trajectory**: Represents continuous spatial states over time, utilizing video data to enhance training data sources [29][30]. - **Goal State**: Visual representation of expected outcomes, aiding in action planning and execution [34][35]. - **Latent Representation**: Encodes action-related information through large-scale data pre-training, enhancing training efficiency and generalization [36][37]. - **Raw Action**: Directly executable low-level control commands for robots, showing potential for scalability similar to large language models [38][39]. - **Reasoning**: Expresses the thought process behind actions, enhancing model interpretability and decision-making [42][45]. Data Resources in VLA Models - The article categorizes data resources into a pyramid structure: network data and human videos at the base, synthetic and simulation data in the middle, and real robot data at the top, each contributing uniquely to model performance and generalization [47][48][49]. Conclusion - VLA models are positioned as a key pathway to embodied intelligence, with ongoing research focusing on action token design, challenges, and future directions, as well as the practical applications of VLA technology in real-world scenarios [51].