视觉 - 语言 - 动作模型(VLA)
Search documents
小鹏 VLA2.0 发布:智能驾驶体现更强大的泛化性:智联汽车系列深度之 39
Shenwan Hongyuan Securities· 2025-11-10 12:57
2025 年 11 月 10 日 线传动/分别 相关研究 《Robotaxi 的加速渗透元年!-智联汽车 系列深度之 38 暨机器人系列深度之 29》 2025/07/01 《机器人算法:硬件遇上现代 AI 算法 — ―机器人系列深度报告之二十三》 2025/03/04 证券分析师 洪依真 A0230519060003 hongyz@swsresearch.com 戴文杰 A0230522100006 daiwj@swsresearch.com 樊夏沛 A0230523080004 fanxp@swsresearch.com 刘菁菁 A0230522080003 liujj@swsresearch.com 杨海晏 A0230518070003 yanghy@swsresearch.com 黄忠煌 A0230519110001 huangzh@swsresearch.com 王珂 A0230521120002 wangke@swsresearch.com 刘洋 A0230513050006 liuyang2@swsresearch.com 联系人 计算机/ 软件开发 洪依真 A0230519060003 hon ...
未来已来!AI飞行器时代,将代替大部分人工
深思SenseAI· 2025-11-06 04:46
Core Viewpoint - Infravision is revolutionizing the construction of power transmission lines through its innovative drone technology, which offers a safer, more efficient, and cost-effective solution compared to traditional methods [1][4]. Group 1: Advantages of Infravision's Technology - The drone-based line construction avoids the safety hazards associated with high-altitude work and helicopter flights, and is not limited by terrain [5]. - The system is quieter and has a reduced impact on the environment and land ownership, minimizing disruption to landowners [6]. - Infravision's technology significantly enhances efficiency and reduces costs by eliminating the need for large helicopters and extensive manpower, leading to faster project timelines [6]. - The integrated system combines drone automation, precise navigation, and specialized aerial towing equipment, enabling it to handle long-distance high-voltage line installations at an industrial scale [6]. Group 2: Strategic Execution and Market Positioning - Infravision's rapid rise is attributed to its clear strategic focus on high-value niche markets, particularly in power transmission line construction, which faces significant pain points [8]. - The company initially targeted the Australian market to validate its technology and establish model projects, effectively leveraging limited resources to meet important customer demands [8]. - Infravision emphasizes providing end-to-end solutions rather than merely selling products, fostering long-term partnerships through equipment leasing and operational services [9]. - Following success in Australia, the company is expanding into the North American market, targeting major clients like PG&E [10]. - The company is rapidly scaling its team to meet increasing project demands, with plans to grow from 70 to 150-200 employees by the end of 2025 [10]. Group 3: Future Development and Industry Trends - The concept of "aerial embodied intelligence" is emerging, which involves autonomous flying robots capable of perception, decision-making, and physical interaction [11]. - The development of drone swarm control systems allows multiple drones to coordinate and complete tasks efficiently, enhancing operational capabilities in various sectors [12]. - Infravision and similar companies are not just offering advanced drones but are creating new operational paradigms that deconstruct dangerous and repetitive tasks into standardized, machine-executable operations [20].
英伟达一篇长达41页的自驾VLA框架!因果链推理,实车可部署算法Alpamayo-R1
自动驾驶之心· 2025-11-05 00:04
编辑 | 自动驾驶之心 点击下方 卡片 ,关注" 自动驾驶之心 "公众号 戳我-> 领取 自动驾驶近30个 方向 学习 路线 >>自动驾驶前沿信息获取 → 自动驾驶之心知识星球 论文作者 | Yulong Cao等 英伟达许久不见自动驾驶方向的论文工作,昨天直接放了个大招,难得啊。。。 一篇长达41页的自动驾驶VLA框架 — Alpamayo-R1。Alpamayo-R1指出基于模仿学习的端到端架构,在长尾场景中的表现能力很差,这是由于监督信号稀疏并且因 果推理的理解能力不足。另外现有自驾VLA的框架没办法显式约束思维链和决策行为之间的关联,一方面可能出现幻觉的问题,另一方面也没办法保证因果理解的 正确性。举个错误的例子:左转是红灯,但由于直行是绿灯所以允许车辆左转。 为了解决这些问题,Alpamayo-R1将因果链(Chain of Causation)推理与轨迹规划相融合,以提升复杂驾驶场景下的决策能力。本文方法包含三大核心创新: 结果表明,相较于仅基于轨迹的基准模型,AR1在高难度场景下的规划准确率提升高达12%;在闭环仿真中,偏离车道率降低35%,近距离碰撞率降低25%。经强 化学习后训练(RL po ...
Dexmal原力灵机开源Dexbotic,基于PyTorch的一站式VLA代码库
机器之心· 2025-10-22 06:32
Core Insights - Dexbotic is an open-source visual-language-action (VLA) model toolkit developed by Dexmal, aimed at researchers in the field of embodied intelligence, featuring a modular architecture with three core components: Data, Experiment, and Model [3][7][9]. Group 1: Need for a Unified VLA Development Platform - The VLA models serve as a crucial technology hub connecting perception, cognition, and action, but face challenges such as severe decentralization in research, cumbersome development processes, and fairness issues in algorithm comparison [5][7]. - The introduction of Dexbotic addresses these pain points by providing a standardized, modular, and high-performance research infrastructure, moving the field from "reinventing the wheel" to "collaborative innovation" [7][9]. Group 2: Dexbotic Architecture - The overall architecture of Dexbotic consists of three main layers: Data Layer, Model Layer, and Experiment Layer, with the Data Layer optimizing storage and integrating multi-source data [9][11]. - The Model Layer includes the foundational model DexboticVLM, which supports various VLA strategies and allows users to customize new VLA models easily [9][11]. - The Experiment Layer introduces an innovative script mechanism for conducting experiments, enabling users to modify configurations with minimal changes while ensuring system stability [11][12]. Group 3: Key Features - Dexbotic offers a unified modular VLA framework compatible with mainstream large language models, integrating embodied operation and navigation functionalities [13]. - High-performance pre-trained models are available for major VLA algorithms, significantly enhancing performance in various simulation environments and real-world tasks [13]. - The experimental framework is designed for flexibility and extensibility, allowing users to easily modify configurations and switch models or tasks [13][14]. Group 4: Open Source Hardware - Dexmal has launched its first open-source hardware product, Dexbotic Open Source - W1 (DOS-W1), featuring a fully open design that lowers barriers for use and maintenance [16][17]. - The hardware design includes modular components and ergonomic features to enhance user comfort and data collection efficiency [17]. Group 5: Future Outlook - Dexmal plans to expand its offerings with more advanced VLM base models and open-source hardware, integrating simulation-to-real-world transfer learning tools and establishing a community-driven model contribution mechanism [19]. - Collaboration with RoboChallenge aims to create a comprehensive technical loop for development, training, inference, and evaluation, ensuring transparency and fairness in performance validation [20].
北大-灵初重磅发布具身VLA全面综述!一文看清VLA技术路线与未来趋势
机器之心· 2025-07-25 02:03
Core Insights - The article discusses the rapid advancements in Vision-Language-Action (VLA) models, which are capable of extending intelligence from the digital realm to physical tasks, particularly in robotics [1][9]. - A unified framework for understanding VLA models is proposed, focusing on action tokenization, which categorizes eight main types of action tokens and outlines their capabilities and future trends [2][10]. VLA Unified Framework and Action Token Perspective - VLA models rely on at least one visual or language foundation model to generate action outputs based on visual and language inputs, aiming to execute specific tasks in the physical world [9][11]. - The framework categorizes action tokens into eight types: language description, code, affordance, trajectory, goal state, latent representation, raw action, and reasoning [10][16]. Action Token Analysis - **Language Description**: Describes actions in natural language, divided into sub-task level (language plan) and atomic action level (language motion) [16][20]. - **Code**: Represents task logic in code form, allowing for efficient communication between humans and robots, but faces challenges related to API dependencies and execution rigidity [22][23]. - **Affordance**: A spatial representation indicating how objects can be interacted with, emphasizing semantic clarity and adaptability [25][26]. - **Trajectory**: Represents continuous spatial states over time, utilizing video data to enhance training data sources [29][30]. - **Goal State**: Visual representation of expected outcomes, aiding in action planning and execution [34][35]. - **Latent Representation**: Encodes action-related information through large-scale data pre-training, enhancing training efficiency and generalization [36][37]. - **Raw Action**: Directly executable low-level control commands for robots, showing potential for scalability similar to large language models [38][39]. - **Reasoning**: Expresses the thought process behind actions, enhancing model interpretability and decision-making [42][45]. Data Resources in VLA Models - The article categorizes data resources into a pyramid structure: network data and human videos at the base, synthetic and simulation data in the middle, and real robot data at the top, each contributing uniquely to model performance and generalization [47][48][49]. Conclusion - VLA models are positioned as a key pathway to embodied intelligence, with ongoing research focusing on action token design, challenges, and future directions, as well as the practical applications of VLA technology in real-world scenarios [51].
8万条!清华开源VLA数据集:面向自动驾驶极端场景,安全提升35%
自动驾驶之心· 2025-07-22 12:46
Core Viewpoint - The article discusses the development of the Impromptu VLA dataset, which aims to address the data scarcity issue in unstructured driving environments for autonomous driving systems. It highlights the dataset's potential to enhance the performance of vision-language-action models in complex scenarios [4][29]. Dataset Overview - The Impromptu VLA dataset consists of over 80,000 meticulously constructed video clips, extracted from more than 2 million original materials across eight diverse open-source datasets [5][29]. - The dataset focuses on four key unstructured challenges: boundary-ambiguous roads, temporary traffic rule changes, unconventional dynamic obstacles, and complex road conditions [12][13]. Methodology - The dataset construction involved a multi-step process, including data collection, scene classification, and multi-task annotation generation, utilizing advanced visual-language models (VLMs) for scene understanding [10][17]. - A rigorous manual verification process was implemented to ensure high-quality annotations, with significant F1 scores achieved for various categories, confirming the reliability of the VLM-based annotation process [18]. Experimental Validation - The effectiveness of the Impromptu VLA dataset was validated through comprehensive experiments, showing significant performance improvements in mainstream autonomous driving benchmarks. For instance, the average score in the closed-loop NeuroNCAP test improved from 1.77 to 2.15, with a reduction in collision rates from 72.5% to 65.5% [6][21]. - In open-loop trajectory prediction evaluations, models trained with the Impromptu VLA dataset achieved L2 errors as low as 0.30 meters, demonstrating competitive performance compared to leading methods that rely on larger proprietary datasets [24]. Conclusion - The Impromptu VLA dataset serves as a critical resource for developing more robust and adaptive autonomous driving systems capable of handling complex real-world scenarios. The research confirms the dataset's significant value in enhancing perception, prediction, and planning capabilities in unstructured driving environments [29].