具身智能之心 - filings, earnings calls, financial reports, news - Reportify

具身智能之心

Search documents

小米社招&校招 | 自动驾驶与机器人具身智能算法研究员 (VLA方向)

具身智能之心· 2025-07-03 13:36

职位描述我们正在寻找一位杰出的研究员/科学家，加入我们的前沿探索团队，共同定义和构建下一代自动驾驶与机器人的"大脑"。您将致力于突破性的具身基座模型 (Embodied Foundation Model) 的研究，该模型将深度融合视觉-语言-行动 (VLA) 能力，并具备卓越的空间感知与空间推理能力。核心职责包括前沿算法研究与构建：负责设计和实现领先的具身多模态大模型。您的研究将不仅限于现有的VLA框架，更将探索如何构建能够理解复杂三维世界、并进行长时序、多步骤任务规划的世界模型 (World Model)。核心模型能力攻关：主导模型在以下关键能力上的突破：多模态场景理解：融合视觉、语言、雷达等多源信息，实现对动态、开放环境的深刻理解和空间感知。学习与适应机制：深入研究强化学习 (RL)、模仿学习 (IL) 及自监督学习方法，使模型能从海量数据和与环境的交互中持续学习和进化。技术愿景与路线图：主导构建可泛化、高效率的具身智能基座模型，为未来1-3年的技术演进提供核心支撑，并探索其在自动驾驶和通用机器人领域的统一应用潜力。复杂语义推理与决策：让模型能够理解模糊、抽象的人类指令，并结合对 ...

XIAOMI(HK:01810)

多模态大模型

多模态大模型

卡耐基梅隆大学！Human2LocoMan：通过人类预训练学习多功能四足机器人操控

具身智能之心· 2025-07-03 13:36

Core Insights - The article presents a novel framework called Human2LocoMan for enhancing quadrupedal robot manipulation through human pretraining, addressing the challenges of autonomous multi-functional operations in complex environments [4][38] - The framework utilizes a modular cross-entity Transformer architecture (MXT) to facilitate effective data collection and transfer learning from human demonstrations to robotic strategies [8][38] Group 1: Framework and Methodology - The Human2LocoMan framework integrates human data collection via extended reality (XR) technology, allowing for the mapping of human actions to robotic movements, thereby enhancing the robot's operational capabilities [7][10] - A unified reference framework is established to align actions between humans and the LocoMan robot, addressing the significant differences in dynamics and control systems between the two entities [12][10] - The MXT architecture is designed to share a common Transformer backbone while maintaining entity-specific markers, enabling effective transfer learning across different robotic platforms [16][8] Group 2: Experimental Results - The experiments demonstrated an average success rate improvement of 41.9% and an 79.7% enhancement in out-of-distribution (OOD) scenarios when using the proposed framework compared to baseline methods [4][8] - Pretraining with human data resulted in a 38.6% overall success rate increase and an 82.7% improvement in OOD scenarios, showcasing the effectiveness of human data in enhancing robotic performance [8][38] - The data collection efficiency was highlighted, with over 50 robot trajectories and 200 human trajectories collected within 30 minutes, indicating the framework's potential for rapid data acquisition [26][38] Group 3: Comparative Analysis - The MXT architecture outperformed state-of-the-art (SOTA) imitation learning methods in various tasks, demonstrating superior success rates and task scores, particularly in scenarios with limited data [30][34] - The modular design of MXT facilitated better generalization and reduced overfitting compared to other architectures, such as HPT, which struggled with severe overfitting issues [36][39] - The framework's ability to maintain high performance in long-sequence tasks indicates its robustness and effectiveness in real-world applications [36][38]

跨实体学习

模块化跨实体 Transformer（MXT）

LocoMan 机器人

Apple Vision Pro

跨实体学习

模块化跨实体 Transformer（MXT）

LocoMan 机器人

Apple Vision Pro

具身智能，到了交卷的时刻了。。。

具身智能之心· 2025-07-03 08:22

Core Viewpoint - Embodied intelligence has emerged as a significant technological keyword in recent years, transitioning from obscurity to frenzy and now to a more rational phase, with companies focusing on practical applications rather than mere demonstrations [2]. Group 1: Technological Developments - The upgrade in sensory capabilities and multimodal integration is crucial for the development of embodied technology, with tactile perception becoming a key focus, especially in dexterous manipulation [2]. - The integration of multimodal sensor fusion technology allows robots to process various types of information simultaneously, enhancing environmental perception accuracy and comprehensiveness [2]. - Large model-driven algorithms are improving robots' understanding of the world, particularly in humanoid robotics, by enhancing perception capabilities and promoting autonomous learning and decision-making [2]. Group 2: Industry Needs and Challenges - There is an urgent need for lightweight model designs that support low computational power, multimodal capabilities, and cross-platform functionality to facilitate industry implementation [2]. - The construction of simulation environments and data ecosystems is vital for embodied intelligence, providing efficient training platforms through the simulation of physical world phenomena [2]. - Aligning simulation data with real-world applications remains a significant challenge for researchers [2]. Group 3: Community and Resources - The "Embodied Intelligence Heart" knowledge community serves as a platform for technical exchange among nearly 200 companies and research institutions in the field [3][8]. - The community offers a wealth of resources, including open-source projects, datasets, and learning pathways for various aspects of embodied intelligence [8][15][17]. - Members of the community can access job postings, industry reports, and educational materials to enhance their knowledge and career prospects in embodied intelligence [8][17][19].

宇树Go2四足机器狗

宇树Go2四足机器狗

全球首个自动驾驶VLA综述重磅发布：VLA自驾模型全面拆解~

具身智能之心· 2025-07-03 08:22

Core Insights - The article discusses the integration of vision, language, and action in autonomous driving through the Vision-Language-Action (VLA) model, highlighting its potential to enhance the capabilities of self-driving vehicles [1][3]. Evolution of Autonomous Driving Paradigms - The development of autonomous driving technology has transitioned from modular to integrated approaches, categorized into three core paradigms: 1. End-to-End Autonomous Driving (AD) which directly maps sensor inputs to driving actions but lacks interpretability [3]. 2. Vision Language Models (VLMs for AD) that enhance system interpretability and generalization but do not directly control vehicle actions [3]. 3. Vision-Language-Action Models (VLA for AD) that unify perception, reasoning, and action execution, enabling vehicles to understand complex instructions and make autonomous decisions [3][4]. VLA4AD Architecture - A typical VLA4AD model consists of three parts: input, processing, and output, integrating environmental perception, high-level instruction understanding, and vehicle control [5]. - The architecture includes multimodal inputs, core modules for processing visual and language data, and an action decoder for generating control outputs [6][7][9]. Development Stages of VLA Models - The evolution of VLA models is divided into four stages: 1. Language models as explainers, enhancing interpretability without direct control [16]. 2. Modular VLA models where language actively contributes to planning decisions [19]. 3. Unified end-to-end VLA models that map sensor inputs to control signals in a single forward pass [20]. 4. Reasoning-augmented VLA models that incorporate long-term reasoning and memory into decision-making [21]. Representative VLA4AD Models - The article provides a detailed comparison of various VLA4AD models, highlighting their inputs, outputs, datasets, and core contributions [23]. Examples include: - DriveGPT-4, which utilizes a single image input to generate high-level control labels [22]. - ADriver-I, which integrates vision-action tokens for control [22]. - RAG-Driver, which employs retrieval-augmented control mechanisms [22]. Datasets and Benchmarks - High-quality, diverse datasets are crucial for VLA4AD development, with notable datasets including BDD100K, nuScenes, and Bench2Drive, which provide rich annotations for training and evaluation [25][26][29]. Challenges and Future Directions - The article outlines six major challenges facing VLA4AD, including robustness, real-time performance, data bottlenecks, and multimodal alignment [31][32]. - Future directions include the development of foundation-scale driving models, neuro-symbolic safety kernels, fleet-scale continual learning, standardized traffic language, and cross-modal social intelligence [36][37].

视觉 - 语言 - 动作（VLA）

视觉 - 语言 - 动作（VLA）模型

视觉 - 语言 - 动作（VLA）

视觉 - 语言 - 动作（VLA）模型

重塑具身导航策略！RSRNav：基于空间关系推理的图像目标导航

具身智能之心· 2025-07-02 10:18

Core Viewpoint - The article discusses the development of RSRNav, a robust and efficient image-goal navigation method that enhances navigation performance by reasoning spatial relationships between the target and current observations, addressing existing challenges in navigation efficiency and sensitivity to viewpoint inconsistencies [5][20]. Research Background - Image goal navigation (ImageNav) is a critical area in embodied intelligence, with applications in home robotics, augmented reality systems, and assistance for visually impaired individuals [5]. - Existing ImageNav methods are categorized into modular and end-to-end approaches, each with its own strengths and weaknesses in terms of navigation efficiency and robustness [5]. Methodology - RSRNav employs a simple ResNet-9 network without pre-training to encode target and current images into feature vectors [8]. - The core of RSRNav is the training of a perception-relation-action navigation strategy, where spatial relationships are inferred through the correlation of features extracted from images [11][12]. - The method progressively enhances correlation calculations, culminating in a powerful direction-aware correlation to support efficient navigation and precise angle adjustments [11]. Experimental Results - In the "user-matching target" setting, RSRNav achieved a Success Rate (SR) of 83.2% and a Success weighted by Path Length (SPL) of 56.6%, outperforming other methods [20]. - RSRNav demonstrated superior performance in cross-domain generalization across MP3D and HM3D datasets, indicating strong capabilities in handling viewpoint inconsistencies and generalizing to new environments [20]. Ablation Studies - The performance of RSRNav improved significantly with richer correlation information, with SPL increasing from 16.1% for "minimal correlation" to 61.2% for "direction-aware correlation" on the Gibson dataset [22]. - The analysis confirmed that both cross-correlation and fine-grained correlation contribute to performance enhancement, emphasizing the importance of rich correlation information for navigation [22]. Conclusion and Future Work - RSRNav significantly improves the efficiency and robustness of image goal navigation by reasoning spatial relationships, achieving excellent performance across multiple benchmark datasets [23]. - Future work will focus on applying RSRNav to real-world navigation scenarios and bridging the gap between simulated and real-world data [23].

图像目标导航

图像目标导航

RoboScape：基于物理信息的具身世界模型，动作可控性提升68.3%

具身智能之心· 2025-07-02 10:18

Core Viewpoint - The article discusses the development of RoboScape, a physics-informed embodied world model that enhances video generation quality by integrating physical knowledge into the modeling process, addressing limitations in existing models related to physical perception and object manipulation [4][23]. Research Background and Core Issues - Existing models in embodied intelligence face significant limitations in physical perception, particularly in robot scenarios involving contact, leading to unrealistic object deformation and motion discontinuities [4]. - Current attempts to integrate physical knowledge are categorized into three types: physical prior regularization, knowledge distillation from physical simulators, and material field modeling, each with its own limitations [4]. Core Method - The focus is on learning an embodied world model as a dynamic function to predict the next visual observation based on past observations and robot actions [5]. Robot Data Processing Pipeline - A four-step processing pipeline is designed to create a multimodal dataset with physical priors based on the AGIBOT-World dataset [6]. RoboScape Model Architecture - The model utilizes a self-regressive Transformer framework for controllable robot video generation, integrating physical knowledge through two auxiliary tasks: physical attribute labeling and video slicing [8]. Time Depth Prediction - A time depth prediction branch is added to enhance 3D geometric consistency, employing a dual-branch cooperative self-regressive Transformer [10]. Adaptive Keypoint Dynamic Learning - The model employs self-supervised tracking of contact-driven keypoints to implicitly encode material properties, adapting to the most active keypoints based on motion amplitude [11]. Joint Training Objective - The overall training objective integrates various loss functions to balance the contributions of different components [13]. Experimental Validation - The model's performance is evaluated across three dimensions: appearance fidelity, geometric consistency, and action controllability, showing superior results compared to baseline models [15]. Dataset and Implementation Details - The dataset comprises 50,000 video segments covering 147 tasks and 72 skills, with training conducted on 32 NVIDIA A800 GPUs over five epochs [16]. Downstream Application Validation - In robot policy training, the model demonstrates performance close to real data training results, indicating the effectiveness of synthetic data for complex tasks [19]. Conclusion and Future Plans - RoboScape effectively integrates physical knowledge into video generation without relying on external physics engines, with plans to combine generative world models with real robots for further validation in practical scenarios [23][24].

RoboScape模型

RoboScape模型

VQ-VLA：大规模合成数据驱动动作tokenizer，推理速度提升近三倍

具身智能之心· 2025-07-02 10:18

Core Insights - The article discusses the challenges faced by Visual-Language-Action (VLA) models in multimodal robotic control, specifically focusing on action representation efficiency and data dependency bottlenecks [3][4]. Group 1: Challenges in VLA Models - Action representation efficiency is low due to traditional continuous action discretization methods, which struggle to capture complex spatiotemporal dynamics, leading to increased cumulative errors in long-duration tasks [4]. - The high cost of real robot data collection limits the generalization ability of models, creating a data dependency bottleneck [4]. Group 2: Proposed Solutions - A universal action tokenizer framework based on Convolutional Residual VQ-VAE is proposed to replace traditional discretization methods [4]. - The article demonstrates that the difference between synthetic and real domain action trajectories is minimal, allowing for the use of a significantly larger scale of synthetic data (100 times previous work) to train the tokenizer [4]. - The VLA model's performance is optimized across three core metrics, with the success rate for long-duration tasks increasing by up to 30% in real robot experiments [4]. Group 3: Key Technical Solutions - The Convolutional Residual VQ-VAE architecture employs 2D temporal convolution layers instead of traditional MLPs, resulting in a 6.6% improvement in success rates for the LIBERO-10 task [7]. - The action execution frequency improved from 4.16 Hz to 11.84 Hz, enhancing inference speed [9][18]. - A multi-step action prediction approach reduces cumulative errors, contributing to long-duration robustness [9]. Group 4: Experimental Findings - In simulated environments, the VQ model achieved a success rate of 80.98% in LIBERO-90, surpassing the baseline by 7.45% [17]. - For short-duration tasks, the VQ model's success rate was 60.0% in the "flip the pot" task compared to a baseline of 30.0% [17]. - In long-duration tasks, the VQ model achieved a success rate of 30.0% for "putting toys in a drawer" versus 5.0% for the baseline, and 50.0% for "putting all cups in a basket" compared to 15.0% for the baseline [17]. Group 5: Future Directions - The article suggests expanding the dataset by integrating larger-scale synthetic datasets, such as RLBench [19]. - There is a focus on model lightweighting through distillation and quantization techniques to further accelerate inference [19]. - Exploration of enhanced designs, such as action frequency conditional encoding, is recommended for architectural improvements [19].

视觉-语言-动作模型（VLA）

通用动作分词器框架

卷积残差VQ - VAE

视觉-语言-动作模型（VLA）

通用动作分词器框架

卷积残差VQ - VAE

机器人导航的2个模块：视觉语言导航和目标导航有什么区别？

具身智能之心· 2025-07-02 10:18

Core Viewpoint - The article discusses the evolution of robot navigation technology from traditional mapping and localization to large model-based navigation, which includes visual language navigation (VLN) and goal navigation. VLN focuses on following instructions, while goal navigation emphasizes understanding the environment to find paths independently [1][4]. Summary by Sections Visual Language Navigation (VLN) - VLN is fundamentally a task of following instructions, which involves understanding language commands, perceiving the environment, and planning movement strategies. The VLN robot system consists of three main modules: visual language encoder, environmental history representation, and action strategy [2]. - The robot processes language commands and visual observations, requiring effective information compression through a visual language encoder. Key issues include the choice of encoder and whether to project visual and language representations into a common space [2]. - The learning of the strategy network has shifted from extracting patterns from labeled datasets to distilling effective planning information from large language models (LLMs) [3]. Goal Navigation - Goal navigation extends VLN by enabling agents to explore unfamiliar 3D environments and plan paths based solely on target descriptions, such as coordinates or images [4]. - Unlike traditional VLN, goal-driven navigation requires a transition from "understanding instructions to finding paths" autonomously, involving semantic parsing, environmental modeling, and dynamic decision-making [6]. Commercial Application and Demand - Goal-driven navigation technology has been implemented in various verticals, such as terminal delivery, where it combines with social navigation algorithms to handle dynamic environments. Examples include Meituan's delivery robots and Starship Technologies' campus delivery robots [8]. - In sectors like healthcare, hospitality, and food service, companies like 嘉楠科技, 云迹科技, and Aethon have deployed service robots for autonomous delivery, enhancing service efficiency [8]. - The development of humanoid robots has led to an increased focus on adapting navigation technology, with companies like Unitree and Tesla showcasing advanced capabilities [9]. - The growth in this sector has created significant job demand, particularly in navigation roles, which are recognized as one of the first technology subfields to achieve practical application [9]. Knowledge and Learning Challenges - Both VLN and goal navigation encompass a wide range of knowledge areas, including natural language processing, computer vision, reinforcement learning, and graph neural networks. This complexity presents challenges for learners seeking to enhance their interdisciplinary skills [10].

视觉语言导航

美团无人配送车

特斯拉Optimus

视觉语言导航

美团无人配送车

特斯拉Optimus

清华大学最新！RoboScape：基于物理信息的具身世界模型，动作可控性提升68.3%

具身智能之心· 2025-07-02 07:44

Core Insights - The article discusses the limitations of existing embodied intelligence models in physical perception, particularly in robot scenarios involving contact, highlighting the need for better integration of physical knowledge into these models [3][20]. Research Background and Core Issues - Current models rely heavily on visual token fitting and lack physical knowledge awareness, leading to unrealistic object deformation and motion discontinuities in generated videos [3]. - Previous attempts to integrate physical knowledge have been limited to narrow domains or complex pipelines, indicating a need for a unified and efficient framework [3]. Core Methodology - The focus is on learning an embodied world model as a dynamic function to predict the next visual observation based on past observations and robot actions [4]. - A four-step processing pipeline is designed to create a multimodal dataset with physical priors, utilizing the AGIBOT-World dataset [5]. Data Processing Pipeline - The pipeline includes physical attribute annotation, video slicing, segment filtering, and segment classification to ensure effective training data [5][8]. Time Depth Prediction - A dual-branch cooperative autoregressive Transformer (DCT) is introduced to enhance 3D geometric consistency, ensuring causal generation through temporal and spatial attention layers [7]. Adaptive Keypoint Dynamic Learning - The model employs self-supervised tracking of contact-driven keypoints to implicitly encode material properties, enhancing the modeling of object deformation and motion patterns [8]. Joint Training Objectives - The overall training objective integrates various loss functions to balance the contributions of different components in the model [10]. Experimental Validation - The model's performance is evaluated across appearance fidelity, geometric consistency, and action controllability, demonstrating superior results compared to baseline models [12][18]. Dataset and Implementation Details - The study utilizes the AgiBotWorldBeta dataset, comprising 50,000 video segments across 147 tasks, and employs advanced models for comparison [13]. Downstream Application Validation - The model shows effectiveness in training robot policies, achieving performance close to real data training results, indicating the utility of generated data for complex tasks [16]. Conclusion and Future Plans - RoboScape effectively integrates physical knowledge into video generation without relying on external physics engines, with plans to combine generative world models with real robots for further validation [20][21].

具身世界模型

RoboScape模型

具身世界模型

RoboScape模型

小米社招&校招 | 自动驾驶与机器人具身智能算法研究员 (VLA方向)

具身智能之心· 2025-07-01 12:07

核心职责包括前沿算法研究与构建：负责设计和实现领先的具身多模态大模型。您的研究将不仅限于现有的VLA框架，更将探索如何构建能够理解复杂三维世界、并进行长时序、多步骤任务规划的世界模型 (World Model)。核心模型能力攻关：主导模型在以下关键能力上的突破：多模态场景理解：融合视觉、语言、雷达等多源信息，实现对动态、开放环境的深刻理解和空间感知。职位描述我们正在寻找一位杰出的研究员/科学家，加入我们的前沿探索团队，共同定义和构建下一代自动驾驶与机器人的"大脑"。您将致力于突破性的具身基座模型 (Embodied Foundation Model) 的研究，该模型将深度融合视觉-语言-行动 (VLA) 能力，并具备卓越的空间感知与空间推理能力。复杂语义推理与决策：让模型能够理解模糊、抽象的人类指令，并结合对物理世界的空间推理，生成安全、合理、可解释的行动序列。学习与适应机制：深入研究强化学习 (RL)、模仿学习 (IL) 及自监督学习方法，使模型能从海量数据和与环境的交互中持续学习和进化。技术愿景与路线图：主导构建可泛化、高效率的具身智能基座模型，为未来1-3年的技术演进提供核心支 ...

XIAOMI(HK:01810)

多模态大模型

多模态大模型