Workflow
具身智能之心
icon
Search documents
全球首个自动驾驶VLA综述重磅发布:VLA自驾模型全面拆解~
具身智能之心· 2025-07-03 08:22
自动驾驶开发者社区,关注自动驾驶、计算机视觉、感知融合、BEV、部署落地、定位规控、领域方案等,坚持为领域输出最前沿的技术方向! 点击下方 卡片 ,关注" 自动驾驶之心 "公众号 戳我-> 领取 自动驾驶近15个 方向 学习 路线 今天自动驾驶之心为大家分享 麦吉尔大学、清华大学、小米公司 和威斯康辛麦迪逊的研究团队 最新的工作! 面向自动驾驶的视觉-语言-动作模 型综述! 如果您有相关工作需要分享,请在文末联系我们! 以下文章来源于自动驾驶之心 ,作者Sicong Jiang等 自动驾驶之心 . 自动驾驶课程学习与技术交流群事宜,也欢迎添加小助理微信AIDriver004做进一步咨询 >>自动驾驶前沿信息获取 → 自动驾驶之心知识星球 当视觉(Vision)、语言(Language)和行动(Action)三大能力在一个模型中融合,自动驾驶的未来将走向何方? 近日,来自麦吉尔大学、清华大学、小米公司和威斯康辛麦迪逊的研究团队联合发布了全球首篇针对自动驾驶领域的视觉-语言-行动(Vision-Language-Action, VLA)模型的全面综述。这篇题为《A Survey on Vision-Languag ...
重塑具身导航策略!RSRNav:基于空间关系推理的图像目标导航
具身智能之心· 2025-07-02 10:18
Core Viewpoint - The article discusses the development of RSRNav, a robust and efficient image-goal navigation method that enhances navigation performance by reasoning spatial relationships between the target and current observations, addressing existing challenges in navigation efficiency and sensitivity to viewpoint inconsistencies [5][20]. Research Background - Image goal navigation (ImageNav) is a critical area in embodied intelligence, with applications in home robotics, augmented reality systems, and assistance for visually impaired individuals [5]. - Existing ImageNav methods are categorized into modular and end-to-end approaches, each with its own strengths and weaknesses in terms of navigation efficiency and robustness [5]. Methodology - RSRNav employs a simple ResNet-9 network without pre-training to encode target and current images into feature vectors [8]. - The core of RSRNav is the training of a perception-relation-action navigation strategy, where spatial relationships are inferred through the correlation of features extracted from images [11][12]. - The method progressively enhances correlation calculations, culminating in a powerful direction-aware correlation to support efficient navigation and precise angle adjustments [11]. Experimental Results - In the "user-matching target" setting, RSRNav achieved a Success Rate (SR) of 83.2% and a Success weighted by Path Length (SPL) of 56.6%, outperforming other methods [20]. - RSRNav demonstrated superior performance in cross-domain generalization across MP3D and HM3D datasets, indicating strong capabilities in handling viewpoint inconsistencies and generalizing to new environments [20]. Ablation Studies - The performance of RSRNav improved significantly with richer correlation information, with SPL increasing from 16.1% for "minimal correlation" to 61.2% for "direction-aware correlation" on the Gibson dataset [22]. - The analysis confirmed that both cross-correlation and fine-grained correlation contribute to performance enhancement, emphasizing the importance of rich correlation information for navigation [22]. Conclusion and Future Work - RSRNav significantly improves the efficiency and robustness of image goal navigation by reasoning spatial relationships, achieving excellent performance across multiple benchmark datasets [23]. - Future work will focus on applying RSRNav to real-world navigation scenarios and bridging the gap between simulated and real-world data [23].
RoboScape:基于物理信息的具身世界模型,动作可控性提升68.3%
具身智能之心· 2025-07-02 10:18
点击下方 卡片 ,关注" 具身智能 之心 "公众号 作者丨 Yu Shang等 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要 的。 根源在于现有模型过度依赖视觉令牌拟合,缺乏物理知识 awareness。此前整合物理知识的尝试分为三类: 物理先验正则化(局限于人类运动或刚体动力学等窄域)、基于物理模拟器的知识蒸馏(级联 pipeline 计 算复杂)、材料场建模(限于物体级建模,难用于场景级生成)。因此,如何在统一、高效的框架中整合 物理知识,成为亟待解决的核心问题。 核心方法 问题定义 聚焦机器人操作场景,学习具身世界模型 作为动力学函数,基于过去的观测 和机器人动作 预测 下一个视觉观测 ,公式为: 研究背景与核心问题 在具身智能领域,世界模型作为强大的模拟器,能生成逼真的机器人视频并缓解数据稀缺问题,但现有模 型在物理感知上存在显著局限。尤其在涉及接触的机器人场景中,因缺乏对3D几何和运动动力学的建模能 力,生成的视频常出现不真实的物体变形或 ...
VQ-VLA:大规模合成数据驱动动作tokenizer,推理速度提升近三倍
具身智能之心· 2025-07-02 10:18
Core Insights - The article discusses the challenges faced by Visual-Language-Action (VLA) models in multimodal robotic control, specifically focusing on action representation efficiency and data dependency bottlenecks [3][4]. Group 1: Challenges in VLA Models - Action representation efficiency is low due to traditional continuous action discretization methods, which struggle to capture complex spatiotemporal dynamics, leading to increased cumulative errors in long-duration tasks [4]. - The high cost of real robot data collection limits the generalization ability of models, creating a data dependency bottleneck [4]. Group 2: Proposed Solutions - A universal action tokenizer framework based on Convolutional Residual VQ-VAE is proposed to replace traditional discretization methods [4]. - The article demonstrates that the difference between synthetic and real domain action trajectories is minimal, allowing for the use of a significantly larger scale of synthetic data (100 times previous work) to train the tokenizer [4]. - The VLA model's performance is optimized across three core metrics, with the success rate for long-duration tasks increasing by up to 30% in real robot experiments [4]. Group 3: Key Technical Solutions - The Convolutional Residual VQ-VAE architecture employs 2D temporal convolution layers instead of traditional MLPs, resulting in a 6.6% improvement in success rates for the LIBERO-10 task [7]. - The action execution frequency improved from 4.16 Hz to 11.84 Hz, enhancing inference speed [9][18]. - A multi-step action prediction approach reduces cumulative errors, contributing to long-duration robustness [9]. Group 4: Experimental Findings - In simulated environments, the VQ model achieved a success rate of 80.98% in LIBERO-90, surpassing the baseline by 7.45% [17]. - For short-duration tasks, the VQ model's success rate was 60.0% in the "flip the pot" task compared to a baseline of 30.0% [17]. - In long-duration tasks, the VQ model achieved a success rate of 30.0% for "putting toys in a drawer" versus 5.0% for the baseline, and 50.0% for "putting all cups in a basket" compared to 15.0% for the baseline [17]. Group 5: Future Directions - The article suggests expanding the dataset by integrating larger-scale synthetic datasets, such as RLBench [19]. - There is a focus on model lightweighting through distillation and quantization techniques to further accelerate inference [19]. - Exploration of enhanced designs, such as action frequency conditional encoding, is recommended for architectural improvements [19].
机器人导航的2个模块:视觉语言导航和目标导航有什么区别?
具身智能之心· 2025-07-02 10:18
Core Viewpoint - The article discusses the evolution of robot navigation technology from traditional mapping and localization to large model-based navigation, which includes visual language navigation (VLN) and goal navigation. VLN focuses on following instructions, while goal navigation emphasizes understanding the environment to find paths independently [1][4]. Summary by Sections Visual Language Navigation (VLN) - VLN is fundamentally a task of following instructions, which involves understanding language commands, perceiving the environment, and planning movement strategies. The VLN robot system consists of three main modules: visual language encoder, environmental history representation, and action strategy [2]. - The robot processes language commands and visual observations, requiring effective information compression through a visual language encoder. Key issues include the choice of encoder and whether to project visual and language representations into a common space [2]. - The learning of the strategy network has shifted from extracting patterns from labeled datasets to distilling effective planning information from large language models (LLMs) [3]. Goal Navigation - Goal navigation extends VLN by enabling agents to explore unfamiliar 3D environments and plan paths based solely on target descriptions, such as coordinates or images [4]. - Unlike traditional VLN, goal-driven navigation requires a transition from "understanding instructions to finding paths" autonomously, involving semantic parsing, environmental modeling, and dynamic decision-making [6]. Commercial Application and Demand - Goal-driven navigation technology has been implemented in various verticals, such as terminal delivery, where it combines with social navigation algorithms to handle dynamic environments. Examples include Meituan's delivery robots and Starship Technologies' campus delivery robots [8]. - In sectors like healthcare, hospitality, and food service, companies like 嘉楠科技, 云迹科技, and Aethon have deployed service robots for autonomous delivery, enhancing service efficiency [8]. - The development of humanoid robots has led to an increased focus on adapting navigation technology, with companies like Unitree and Tesla showcasing advanced capabilities [9]. - The growth in this sector has created significant job demand, particularly in navigation roles, which are recognized as one of the first technology subfields to achieve practical application [9]. Knowledge and Learning Challenges - Both VLN and goal navigation encompass a wide range of knowledge areas, including natural language processing, computer vision, reinforcement learning, and graph neural networks. This complexity presents challenges for learners seeking to enhance their interdisciplinary skills [10].
清华大学最新!RoboScape:基于物理信息的具身世界模型,动作可控性提升68.3%
具身智能之心· 2025-07-02 07:44
点击下方 卡片 ,关注" 具身智能 之心 "公众号 作者丨 Yu Shang等 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要 的。 研究背景与核心问题 在具身智能领域,世界模型作为强大的模拟器,能生成逼真的机器人视频并缓解数据稀缺问题,但现有模 型在物理感知上存在显著局限。尤其在涉及接触的机器人场景中,因缺乏对3D几何和运动动力学的建模能 力,生成的视频常出现不真实的物体变形或运动不连续等问题,这在布料等可变形物体的操作任务中尤为 突出。 基于自回归Transformer框架,实现帧级动作可控的机器人视频生成,核心是通过两个物理感知辅助任务整 合物理知识(figure 2): 根源在于现有模型过度依赖视觉令牌拟合,缺乏物理知识 awareness。此前整合物理知识的尝试分为三类: 物理先验正则化(局限于人类运动或刚体动力学等窄域)、基于物理模拟器的知识蒸馏(级联 pipeline 计 算复杂)、材料场建模(限于物体级建模,难用于场景级生成)。因此, ...
小米社招&校招 | 自动驾驶与机器人具身智能算法研究员 (VLA方向)
具身智能之心· 2025-07-01 12:07
核心职责包括 前沿算法研究与构建:负责设计和实现领先的具身多模态大模型。您的研究将不仅限于现有的VLA框架,更将 探索如何构建能够理解复杂三维世界、并进行长时序、多步骤任务规划的世界模型 (World Model)。 核心模型能力攻关:主导模型在以下关键能力上的突破: 多模态场景理解:融合视觉、语言、雷达等多源信息,实现对动态、开放环境的深刻理解和空间感知。 职位描述 我们正在寻找一位杰出的研究员/科学家,加入我们的前沿探索团队,共同定义和构建下一代自动驾驶与机器人 的"大脑"。您将致力于突破性的具身基座模型 (Embodied Foundation Model) 的研究,该模型将深度融合视觉-语 言-行动 (VLA) 能力,并具备卓越的空间感知与空间推理能力。 复杂语义推理与决策:让模型能够理解模糊、抽象的人类指令,并结合对物理世界的空间推理,生成安全、合 理、可解释的行动序列。 学习与适应机制:深入研究强化学习 (RL)、模仿学习 (IL) 及自监督学习方法,使模型能从海量数据和与环境的 交互中持续学习和进化。 技术愿景与路线图:主导构建可泛化、高效率的具身智能基座模型,为未来1-3年的技术演进提供核心支 ...
3天搞定机械臂上的VLA完整部署:算法&项目实践
具身智能之心· 2025-07-01 12:07
Core Viewpoint - The concept of "embodied intelligence" has been officially included in the 2025 government work report, highlighting its significance in current research by enterprises and educational institutions [1]. Group 1: Challenges in Implementation - Researchers and engineers face challenges when deploying algorithms from simulation environments to hardware, primarily due to insufficient engineering practice and a lack of thorough understanding of classic methods and imitation learning [2]. - These challenges hinder the effective integration of various methods, resulting in suboptimal deployment and performance of VLA algorithms on robotic arms, which obstructs the application of embodied intelligence in real-world scenarios [2]. Group 2: Training Program - Deep Blue Academy has partnered with notable figures and companies to launch an offline training camp focused on robotic arm operation and grasping, aimed at bridging the gap between simulation and real-world application [3]. - The training camp offers hands-on experience with real robotic arms and covers key technologies such as motion planning, visual feedback, imitation learning, and VLA, ensuring a comprehensive understanding of the "perception - decision - control" process [5]. Group 3: Course Highlights - The program emphasizes a full-stack technology loop, providing training from algorithms to hardware engineering capabilities [16]. - It features immersive project practice supported by the hardware platform of Songling Robotics, promoting deep integration of academia and industry resources [16]. - The course adopts a high-density small class format, ensuring intensive technical training and personalized guidance over three days [16]. Group 4: Target Audience - The training is designed for undergraduate and graduate students in robotics and automation-related fields, as well as R&D engineers in the field of robotic arms and embodied intelligence [18].
从感知能力提升到轻量化落地,具身这条路还要走很长一段时间~
具身智能之心· 2025-06-30 12:21
Group 1 - The core viewpoint of the article emphasizes the explosive growth of the embodied intelligence industry by 2025, driven by technological advancements and application traction, which shape both the technical roadmap and commercialization pathways [1] - Upgrades in perception capabilities and multimodal integration are crucial for the development of embodied technology, with a focus on tactile perception, particularly in dexterous hands, enhancing operational precision and feedback [1] - Large model-driven algorithms are enhancing robots' understanding of the world, particularly in humanoid robots, by improving perception, autonomous learning, and decision-making capabilities [1] Group 2 - The establishment of a comprehensive technical community for embodied intelligence aims to provide a platform for academic and engineering discussions, with members from renowned universities and leading companies in the field [6] - The community has compiled over 40 open-source projects and nearly 60 datasets related to embodied intelligence, along with various technical learning pathways to facilitate entry and advancement in the field [6][12] - Regular discussions within the community cover topics such as robot simulation platforms, imitation learning in humanoid robots, and hierarchical decision-making [7] Group 3 - The community offers various benefits, including access to exclusive learning videos, job recommendations, and opportunities for industry networking [11][8] - A comprehensive collection of reports on embodied intelligence, including large models and humanoid robots, is available to keep members updated on industry developments [14] - The community also provides resources on robot navigation, control, and various technical aspects of embodied intelligence, aiding in foundational learning [16][50]
当无人机遇到AI智能体:多领域自主空中智能和无人机智能体综述
具身智能之心· 2025-06-30 12:17
Core Insights - The article discusses the evolution of Unmanned Aerial Vehicles (UAVs) into Agentic UAVs, which are characterized by autonomous reasoning, multimodal perception, and reflective control, marking a significant shift from traditional automation platforms [5][6][11]. Research Background - The motivation for this research stems from the rapid development of UAVs from remote-controlled platforms to complex autonomous agents, driven by advancements in artificial intelligence (AI) [6][7]. - The increasing demand for autonomy, adaptability, and interpretability in UAV operations across various sectors such as agriculture, logistics, environmental monitoring, and public safety is highlighted [6][7]. Definition and Architecture of Agentic UAVs - Agentic UAVs are defined as a new class of autonomous aerial systems with cognitive capabilities, situational adaptability, and goal-directed behavior, contrasting with traditional UAVs that operate based on predefined instructions [11][12]. - The architecture of Agentic UAVs consists of four core layers: perception, cognition, control, and communication, enabling autonomous sensing, reasoning, action, and interaction [12][13]. Enabling Technologies - Key technologies enabling the development of Agentic UAVs include: - **Perception Layer**: Utilizes a suite of sensors (RGB cameras, LiDAR, thermal sensors) for real-time semantic understanding of the environment [13][14]. - **Cognition Layer**: Acts as the decision-making core, employing techniques like reinforcement learning and probabilistic modeling for adaptive control strategies [13][14]. - **Control Layer**: Converts planned actions into specific flight trajectories and commands [13][14]. - **Communication Layer**: Facilitates data exchange and task coordination among UAVs and other systems [13][14]. Applications of Agentic UAVs - **Precision Agriculture**: Agentic UAVs are transforming precision agriculture by autonomously identifying crop health issues and optimizing pesticide application through real-time data analysis [17][18]. - **Disaster Response and Search and Rescue**: These UAVs excel in dynamic environments, providing real-time adaptability and autonomous task reconfiguration during disaster scenarios [20][21]. - **Environmental Monitoring**: Agentic UAVs serve as intelligent, mobile environmental sentinels, capable of monitoring rapidly changing ecosystems with high spatial and temporal resolution [22][23]. - **Urban Infrastructure Inspection**: They offer a transformative approach to infrastructure inspections, enabling real-time damage detection and adaptive task planning [24]. - **Logistics and Smart Delivery**: Agentic UAVs are emerging as intelligent aerial couriers, capable of executing complex delivery tasks with minimal supervision [25][26]. Challenges and Limitations - Despite the transformative potential of Agentic UAVs, their widespread application faces challenges related to technical constraints, regulatory hurdles, and cognitive dimensions [43].