视觉 - 语言 - 动作(VLA)
Search documents
ICLR 2026|新版「图灵测试」:当VLA走进生物实验室
机器之心· 2026-02-19 23:43
近期,来自香港大学MMLAB 罗平老师团队和上海交大穆尧老师团队的工作 ——Autobio 正式被 ICLR 2026 接收,并获得了 8-8-6-6 的同行评议分数。AutoBio 是 一个面向数字化生物实验室的机器人仿真系统与基准测试平台。我们通过这篇工作,尝试系统性回答一个关键问题: 当前主流的视觉 - 语言 - 动作(Vision-Language-Action, VLA)模型,是否已经具备在真实生物实验室中执行实验流程的能力? 论文标题:AutoBio: A Simulation and Benchmark for Robotic Automation in Digital Biology Laboratory 论文链接:https://openreview.net/forum?id=UUE6HEtjhu 论文代码:https://github.com/autobio-bench/AutoBio https://huggingface.co/autobio-bench 一.研究背景:为何生物实验室构成关键挑战 现有 VLA 模型的研究和基准测试多局限于 家庭场景 (如整理餐桌、折叠衣物),缺乏对专业 ...
CES上的“物理AI”拐点:Robotaxi走向规模化,人形机器人供应链悄然形成
硬AI· 2026-01-14 15:22
Core Insights - Deutsche Bank predicts that 2026 will mark the year of large-scale deployment for Robotaxis and humanoid robots, transitioning from testing to commercialization [2][3] - The report emphasizes the emergence of a new supply chain for humanoid robots, with suppliers shifting focus to achieve mass production [3][5] Group 1: Humanoid Robot Supply Chain - The supply chain for humanoid robots is taking shape, with actuators becoming the "muscle" entry point [4] - Schaeffler aims to be a key supplier of actuators for humanoid robots, showcasing a compact integrated planetary gear actuator at CES [6] - Hyundai Mobis plans to supply actuators for Boston Dynamics' Atlas, leveraging the automotive supply chain for manufacturing [7] Group 2: Onboard Chip Landscape - Nvidia remains the dominant player in onboard processors for humanoid robots due to performance and ease of use, with various companies utilizing its Jetson Orin or Thor [8][9] - Tesla and Xpeng are developing their own inference chips, indicating a diversification in the chip landscape [9] Group 3: Physical AI Transition - A significant paradigm shift is observed from pre-programmed actions to visual-language-action (VLA), enabling robots to reason and complete tasks [11][12] - The industry debate has shifted from "simulation vs. reality" to how to efficiently close the loop between the two [14] Group 4: Commercial Viability of Humanoid Robots - The report suggests that general-purpose humanoid robots will initially be deployed in specific scenarios to prove commercial viability before entering households [18][19] - Keenon Robotics holds a 40% global market share in service robots, with plans to showcase its humanoid robot XMAN-R1 at CES 2026 [20] Group 5: Cost Reduction and Scalability - Cost reduction in humanoid robots is driven by increased volume and improved supplier negotiations, with some companies reporting costs dropping from $200,000 to $100,000 [22][24] - Mobileye's Mentee project indicates that with an annual production of 50,000 units, manufacturing costs could drop to $20,000 per unit, and potentially to $10,000 with 100,000 units [24] Group 6: Robotaxi Commercialization Momentum - Deutsche Bank believes that 2026 will see stronger commercialization momentum for Robotaxis, with Tesla planning to launch its Robotaxi in 2025 [26][27] - Waymo has provided over 10 million paid rides since its inception, with plans to expand its service to international markets [27][28] Group 7: Nvidia's Alpamayo Platform - Nvidia introduced the Alpamayo platform for autonomous driving, aiming to lower the barrier for automakers to deploy advanced capabilities [30][31] - Despite the potential advantages, concerns remain about whether Nvidia can meet real-world edge cases compared to Tesla's data collection [31][32] Group 8: Industry Innovations - Aptiv showcased an end-to-end AI-driven ADAS platform, emphasizing cross-industry applications and real-time data sharing [33] - Visteon launched a SmartCore HPC domain controller with 700 TOPS, facilitating the integration of multiple sensors into a single system [35]
CES上的“物理AI”拐点:Robotaxi走向规模化,人形机器人供应链悄然形成
Hua Er Jie Jian Wen· 2026-01-14 04:09
Core Insights - The report from Deutsche Bank predicts that 2026 will mark a significant transition for AI in the physical world, particularly in the fields of autonomous vehicles and humanoid robots, moving from testing to scaling [1] Group 1: Humanoid Robots - The supply chain for humanoid robots is forming, with suppliers transitioning to provide integrated solutions and core components [1] - Schaeffler aims to be a key player in humanoid robotics by offering integrated planetary gear actuators, showcasing a compact unit with a torque range of 60–250 Nm [4] - Companies like NEURA and Hyundai Mobis are collaborating to leverage automotive supply chains for humanoid robot manufacturing [4] Group 2: Autonomous Vehicles - The deployment of Robotaxis is gaining momentum, with significant commercial activity expected in 2026, particularly with Tesla's planned launch [10] - Waymo has provided over 10 million paid rides and is expanding its services to international markets, indicating a shift from concept to operational data [15] - Mobileye plans to launch L4 Robotaxi services in Los Angeles this year, showcasing the industry's movement towards real-world applications [15] Group 3: Technology and Innovation - Nvidia remains the dominant player in onboard processors for humanoid robots, with companies like Boston Dynamics utilizing its technology for advanced capabilities [3] - The shift from scripted actions to visual-language-action (VLA) models allows robots to reason and adapt to new environments [3] - The competition in training methods is evolving, focusing on efficient closed-loop systems that integrate real-world data with simulations [7] Group 4: Cost Reduction and Scalability - The cost reduction formula for humanoid robots is driven by increased production volume and improved supplier negotiations [9] - Companies are targeting significant cost reductions, with projections indicating that manufacturing costs could drop from $200,000 to $50,000 as production scales [10] - Visteon is introducing modular solutions to help automakers integrate AI capabilities without overhauling existing architectures, enhancing cost competitiveness [13] Group 5: Market Dynamics - The CES 2026 event highlighted a shift in focus from feasibility to scalability and cost reduction in both autonomous vehicles and humanoid robots [14] - The industry's future will depend on tracking supply chain integration, production capacity, and unit cost curves rather than just innovative demonstrations [14]
北京大学最新!MobileVLA-R1:机械臂之外,移动机器人的VLA能力怎么样了?
具身智能之心· 2025-11-30 03:03
Core Insights - The article discusses the introduction of MobileVLA-R1, a new framework for quadruped robots that bridges the gap between high-level semantic reasoning and low-level action control, addressing the challenges of stability and interpretability in existing methods [1][2][21]. Group 1: Need for Reconstruction of VLA Framework - Current quadruped robots face two main challenges: a semantic-control gap leading to instability in command execution and a lack of traceable reasoning that complicates error diagnosis [2]. - MobileVLA-R1's breakthrough lies in decoupling reasoning from action execution, allowing robots to "think clearly" before "acting accurately," enhancing both interpretability and control robustness [2][23]. Group 2: Implementation of MobileVLA-R1 - MobileVLA-R1 employs a structured CoT dataset, a two-stage training paradigm, and multi-modal perception fusion to achieve coherent reasoning, stable control, and strong generalization [4][6]. - The structured CoT dataset includes 18K episode-level samples, 78K step-level samples, and 38K navigation-specific samples, filling the gap in reasoning supervision from instruction to action [4][5]. Group 3: Performance Evaluation - In navigation tasks, MobileVLA-R1 achieved a success rate of 68.3% and 71.5% on R2R-CE and RxR-CE datasets, respectively, outperforming existing methods by an average of 5% [10]. - For quadruped control tasks, it achieved an average success rate of 73% across six locomotion and operation tasks, significantly surpassing baseline models [12][13]. Group 4: Real-World Deployment - MobileVLA-R1 was tested on the Unitree Go2 quadruped robot in various environments, demonstrating robust adaptation to complex scenarios with a success rate of 86%-91% for complex instructions [14][18]. - The integration of depth and point cloud encoders improved navigation success rates by 5.8%, highlighting the importance of 3D spatial information for scene understanding [19][20]. Group 5: Key Conclusions and Future Directions - MobileVLA-R1 innovatively integrates chain-of-thought reasoning with reinforcement learning, addressing the industry's dilemma of either interpretability or execution stability [21][23]. - Future directions include expanding the action space for more precise tasks, reducing reasoning latency through model optimization, and enhancing self-supervised learning to decrease reliance on labeled data [23].
最火VLA,看这一篇综述就够了
3 6 Ke· 2025-10-31 08:22
Core Insights - The article provides a comprehensive overview of the emerging field of Vision-Language-Action (VLA), highlighting its rapid growth and significance in AI and robotics [1][5]. Summary by Sections VLA Overview - VLA models have seen a dramatic increase in submissions, rising from single digits to 164, marking an 18-fold growth [5]. - A model qualifies as VLA if it uses a pre-trained backbone on large-scale visual-language data, emphasizing capabilities in language understanding, visual generalization, and task transfer [5][6]. Key Trends in VLA - **Trend 1: Efficient Architecture Paradigm** Discrete diffusion models are emerging as a new paradigm, allowing for parallel generation of action sequences, enhancing efficiency and integrating reasoning with actions [7][10]. - **Trend 2: Embodied Chain-of-Thought (ECoT)** ECoT emphasizes generating intermediate reasoning steps before actions, improving planning and interpretability, although it relies heavily on high-quality annotated data [11][12]. - **Trend 3: Action Tokenizer** The action tokenizer converts continuous robot actions into discrete tokens that VLMs can understand, bridging the gap between the robot's actions and the VLM's processing [14][16]. - **Trend 4: Reinforcement Learning (RL)** RL is reintroduced to fine-tune VLA strategies, addressing limitations of imitation learning in extreme scenarios, with notable successes in recent studies [17][18]. - **Trend 5: Efficiency Optimization** Efforts are being made to reduce the hardware requirements for VLA models, making the field more accessible to smaller research labs [19]. - **Trend 6: Video Prediction for Physical Intuition** Video generation models provide inherent understanding of temporal dynamics and physical laws, enhancing robot control capabilities [20][23]. - **Trend 7: Realistic Evaluation Benchmarks** New evaluation frameworks are being developed to overcome the limitations of existing benchmarks, focusing on meaningful generalization capabilities [24][26]. Challenges and Future Directions - The article highlights the "performance ceiling" issue in mainstream simulation evaluations, where high scores do not necessarily translate to real-world capabilities [30]. - Two critical areas needing more attention are data quality and in-context learning, which could be pivotal for advancing VLA research [31].
不管VLA还是WM世界模型,都需要世界引擎
自动驾驶之心· 2025-09-13 16:04
Core Viewpoint - The article discusses the current state and future prospects of end-to-end autonomous driving, emphasizing the concept of a "World Engine" to address challenges in the field [2][21]. Definition of End-to-End Autonomous Driving - End-to-end autonomous driving is defined as "learning a single model that directly maps raw sensor inputs to driving scenarios and outputs control commands," replacing traditional modular pipelines with a unified function [3][6]. Development Roadmap of End-to-End Autonomous Driving - The evolution of end-to-end autonomous driving has progressed from simple black-and-white image inputs over 20 years to more complex methods, including conditional imitation learning and modular approaches [8][10]. Current State of End-to-End Autonomous Driving - The industry is currently in the "1.5 generation" phase, focusing on foundational models and addressing long-tail problems, with two main branches: the World Model (WM) and Visual Language Action (VLA) [10][11]. Challenges in Real-World Deployment - Collecting data for all scenarios, especially extreme cases, remains a significant challenge for achieving Level 4 (L4) or Level 5 (L5) autonomous driving [17][18]. Concept of the "World Engine" - The "World Engine" concept aims to learn from human expert driving and generate extreme scenarios for training, which can significantly reduce costs associated with large fleets [21][24]. Data and Algorithm Engines - The "World Engine" consists of a Data Engine for generating extreme scenarios and an Algorithm Engine, which is still under development, to improve and train end-to-end algorithms [24][25].
全球首个自动驾驶VLA综述重磅发布:VLA自驾模型全面拆解~
具身智能之心· 2025-07-03 08:22
Core Insights - The article discusses the integration of vision, language, and action in autonomous driving through the Vision-Language-Action (VLA) model, highlighting its potential to enhance the capabilities of self-driving vehicles [1][3]. Evolution of Autonomous Driving Paradigms - The development of autonomous driving technology has transitioned from modular to integrated approaches, categorized into three core paradigms: 1. End-to-End Autonomous Driving (AD) which directly maps sensor inputs to driving actions but lacks interpretability [3]. 2. Vision Language Models (VLMs for AD) that enhance system interpretability and generalization but do not directly control vehicle actions [3]. 3. Vision-Language-Action Models (VLA for AD) that unify perception, reasoning, and action execution, enabling vehicles to understand complex instructions and make autonomous decisions [3][4]. VLA4AD Architecture - A typical VLA4AD model consists of three parts: input, processing, and output, integrating environmental perception, high-level instruction understanding, and vehicle control [5]. - The architecture includes multimodal inputs, core modules for processing visual and language data, and an action decoder for generating control outputs [6][7][9]. Development Stages of VLA Models - The evolution of VLA models is divided into four stages: 1. Language models as explainers, enhancing interpretability without direct control [16]. 2. Modular VLA models where language actively contributes to planning decisions [19]. 3. Unified end-to-end VLA models that map sensor inputs to control signals in a single forward pass [20]. 4. Reasoning-augmented VLA models that incorporate long-term reasoning and memory into decision-making [21]. Representative VLA4AD Models - The article provides a detailed comparison of various VLA4AD models, highlighting their inputs, outputs, datasets, and core contributions [23]. Examples include: - DriveGPT-4, which utilizes a single image input to generate high-level control labels [22]. - ADriver-I, which integrates vision-action tokens for control [22]. - RAG-Driver, which employs retrieval-augmented control mechanisms [22]. Datasets and Benchmarks - High-quality, diverse datasets are crucial for VLA4AD development, with notable datasets including BDD100K, nuScenes, and Bench2Drive, which provide rich annotations for training and evaluation [25][26][29]. Challenges and Future Directions - The article outlines six major challenges facing VLA4AD, including robustness, real-time performance, data bottlenecks, and multimodal alignment [31][32]. - Future directions include the development of foundation-scale driving models, neuro-symbolic safety kernels, fleet-scale continual learning, standardized traffic language, and cross-modal social intelligence [36][37].