具身智能之心 - filings, earnings calls, financial reports, news - Reportify

具身智能之心

Search documents

隐式端到端VLA有哪些方法？领域一般是怎么分类的？

具身智能之心· 2025-06-22 14:47

1）视觉特征提取模块 (V) 隐式端到端VLA模型指的是没有明确生成了未来机械臂如何运动的图像。和显示、分层VLA方法有所不同，隐式端到端VLA基础模块主要包含视觉特征提取模块（V）、视觉语言的联合特征学习（V+L）、视觉语言动作的联合训练（V+L+A）。 3) 视觉语言动作的联合训练 (V+L+A) 通常情况: ResNet-18 2. 预训练模型: R3M, VC-1, Voltron, Theia 追求速度: Efficienet 为了和文本好对齐: CLIP 为了用大模型： CLIP， SigLIP 这就是端到端VLA要做的事情，不过可以给大家一个直观的感受！对于机器人任务如何得到VL--A的映射呢？找到V中对action 有用的区域。 2）视觉语言的联合特征学习（V+L）对于机器人任务如何处理同时处理视觉和文本信息呢？小模型的选择：FiLM，同时也可以依旧用Perceiver结构。大模型的选择：MLLM基座（Paligemma ）。 4）隐式端到端VLA怎么分类？根据模型大小：大模型/小模型VLA；根据架构差异：Transformer-based/Diffusion-based； 5） ...

隐式端到端VLA

《具身智能VLA算法与实战教程》

隐式端到端VLA

《具身智能VLA算法与实战教程》

FindingDory：具身智能体记忆评估的基准测试

具身智能之心· 2025-06-22 10:56

Group 1 - The core issue in embodied intelligence is the lack of long-term memory, which limits the ability to process multimodal observational data across time and space [3] - Current visual language models (VLMs) excel in planning and control tasks but struggle with integrating historical experiences in embodied environments [3][5] - Existing video QA benchmarks fail to adequately assess tasks requiring fine-grained reasoning, such as object manipulation and navigation [5] Group 2 - The proposed benchmark includes a task architecture that allows for dynamic environment interaction and memory reasoning validation [4][6] - A total of 60 task categories are designed to cover spatiotemporal semantic memory challenges, including spatial relations, temporal reasoning, attribute memory, and multi-target recall [7] - Key technical innovations include a programmatic expansion of task complexity through increased interaction counts and a strict separation of experience collection from interaction phases [9][6] Group 3 - Experimental results reveal three major bottlenecks in VLM memory capabilities across 60 tasks, including failures in long-sequence reasoning, weak spatial representation, and collapse in multi-target processing [13][14][16] - The performance of native VLMs declines as the number of frames increases, indicating ineffective utilization of long contexts [20] - Supervised fine-tuning models show improved performance by leveraging longer historical data, suggesting a direction for VLM refinement [25] Group 4 - The benchmark represents the first photorealistic embodied memory evaluation framework, covering complex household environments and allowing for scalable assessment [26] - Future directions include memory compression techniques, end-to-end joint training to address the split between high-level reasoning and low-level execution, and the development of long-term video understanding [26]

视觉语言模型（VLMs）

长时序推理

多目标处理

帧采样悖论

视觉语言模型（VLMs）

长时序推理

多目标处理

帧采样悖论

上海交大最新！DyNaVLM：零样本、端到端导航框架

具身智能之心· 2025-06-22 10:56

Core Viewpoint - The article discusses the development of DyNaVLM, a zero-shot, end-to-end navigation framework that integrates vision-language models (VLM) to enhance navigation capabilities in dynamic environments, overcoming limitations of traditional methods [4][5]. Group 1: Introduction and Optimization Goals - Navigation is a fundamental capability in autonomous agents, requiring spatial reasoning, real-time decision-making, and adaptability to dynamic environments. Traditional methods face challenges in generalization and scalability due to their modular design [4]. - The advancement of VLMs offers new possibilities for navigation by integrating perception and reasoning within a single framework, although their application in embodied navigation is limited by spatial granularity and contextual reasoning capabilities [4]. Group 2: Core Innovations of DyNaVLM - **Dynamic Action Space Construction**: DyNaVLM introduces a dynamic action space that allows robots to determine navigation goals based on visual information and language instructions, enhancing movement flexibility in complex environments [6]. - **Collaborative Graph Memory Mechanism**: Inspired by retrieval-augmented generation (RAG), this mechanism enhances memory management for better navigation performance [8]. - **No-Training Deployment Mode**: DyNaVLM can be deployed without task-specific fine-tuning, reducing deployment costs and improving generalization across different environments and tasks [8]. Group 3: System Architecture and Methodology - **Problem Formalization**: The system takes inputs such as target descriptions and RGB-D observations to determine appropriate actions, maintaining a memory function to extract spatial features [11]. - **Memory Manager**: This component connects VLM and graph-structured memory, capturing spatial relationships and semantic object information [12]. - **Action Proposer and Selector**: The action proposer simplifies continuous search space into discrete candidates, while the selector generates final navigation actions based on geometric candidates and contextual memory [14][15]. Group 4: Experimental Evaluation - **Simulation Environment Evaluation**: DyNaVLM achieved a success rate (SR) of 45.0% and a path length weighted success rate (SPL) of 0.232 in ObjectNav benchmarks, outperforming previous VLM frameworks [19][22]. - **Real-World Evaluation**: DyNaVLM demonstrated superior performance in real-world settings, particularly in tasks requiring the identification of multiple targets, showcasing its robustness and efficiency in dynamic environments [27].

视觉语言模型

视觉语言模型

具身智能领域的行业周期有多久？

具身智能之心· 2025-06-22 03:59

Core Viewpoint - The article discusses the development cycles of autonomous driving and embodied intelligence, suggesting that the latter may achieve commercialization faster due to anticipated breakthroughs in algorithms and data [1]. Group 1: Industry Development - The autonomous driving industry has been scaling and commercializing for nearly 10 years since 2015, while the robotics industry has been evolving for many years, with expectations for significant advancements in the next 5-8 years [1]. - Companies like Zhiyuan and Yushu are preparing for IPOs, which could greatly invigorate the entire industry [1]. Group 2: Community Building - The goal is to create a community of 10,000 members within three years, focusing on bridging academia and industry, and providing a platform for rapid problem-solving and industry influence [1]. - The community aims to facilitate technical exchanges and discussions on academic and engineering issues, with members from renowned universities and leading robotics companies [8]. Group 3: Educational Resources - A comprehensive entry route for beginners has been organized within the community, including various learning paths and resources for those new to the field [2]. - For those already engaged in research, valuable industry frameworks and project proposals are provided [4]. Group 4: Job Opportunities - The community continuously shares job postings and opportunities, contributing to the establishment of a complete ecosystem for embodied intelligence [6]. Group 5: Knowledge Sharing - The community has compiled a wealth of resources, including over 40 open-source projects, nearly 60 datasets related to embodied intelligence, and mainstream simulation platforms [11]. - Various learning routes are available, covering topics such as reinforcement learning, multi-modal models, and robotic navigation [11].

Embodied Intelligence

具身智能社区

Embodied Intelligence

具身智能社区

CVPR'25 | 感知性能飙升50%！JarvisIR：VLM掌舵, 不惧恶劣天气

具身智能之心· 2025-06-21 12:06

Core Viewpoint - JarvisIR represents a significant advancement in image restoration technology, utilizing a Visual Language Model (VLM) as a controller to coordinate multiple expert models for robust image recovery under various weather conditions [5][51]. Group 1: Background and Motivation - The research addresses challenges in visual perception systems affected by adverse weather conditions, proposing JarvisIR as a solution to enhance image recovery capabilities [5]. - Traditional methods struggle with complex real-world scenarios, necessitating a more versatile approach [5]. Group 2: Methodology Overview - JarvisIR architecture employs a VLM to autonomously plan task sequences and select appropriate expert models for image restoration [9]. - The CleanBench dataset, comprising 150K synthetic and 80K real-world images, is developed to support training and evaluation [12][15]. - The MRRHF alignment algorithm combines supervised fine-tuning and human feedback to improve model generalization and decision stability [9][27]. Group 3: Training Framework - The training process consists of two phases: supervised fine-tuning (SFT) using synthetic data and MRRHF for real-world data alignment [23][27]. - MRRHF employs a reward modeling approach to assess image quality and guide VLM optimization [28]. Group 4: Experimental Results - JarvisIR-MRRHF demonstrates superior decision-making capabilities compared to other strategies, achieving a score of 6.21 on the CleanBench-Real validation set [43]. - In image restoration performance, JarvisIR-MRRHF outperforms existing methods across various weather conditions, with an average improvement of 50% in perceptual metrics [47]. Group 5: Technical Highlights - The integration of VLM as a control center marks a novel application in image restoration, enhancing contextual understanding and task planning [52]. - The collaborative mechanism of expert models allows for tailored responses to different weather-induced image degradations [52]. - The release of the CleanBench dataset fills a critical gap in real-world image restoration data, promoting further research and development in the field [52].

视觉语言模型（VLM）

Artificial Intelligence

视觉语言模型（VLM）

Artificial Intelligence

具身场景新框架！Embodied-Reasoner：攻克复杂具身交互任务

具身智能之心· 2025-06-21 12:06

Core Viewpoint - The article presents the Embodied Reasoner framework, which extends deep reasoning capabilities to embodied interactive tasks, addressing unique challenges such as multimodal interaction and diverse reasoning patterns [3][7][19]. Group 1: Research Background - Recent advancements in deep reasoning models, like OpenAI's o1, have shown exceptional capabilities in mathematical and programming tasks through large-scale reinforcement learning [7]. - However, the effectiveness of these models in embodied domains requiring continuous interaction with the environment has not been fully explored [7]. - The research aims to expand deep reasoning capabilities to embodied interactive tasks, tackling challenges such as multimodal interaction and diverse reasoning patterns [7]. Group 2: Embodied Interaction Task Design - A high-level planning and reasoning embodied task was designed, focusing on searching for hidden objects in unknown rooms rather than low-level motion control [8]. - The task environment is built on the AI2-THOR simulator, featuring 120 unique indoor scenes and 2100 objects [8]. - Four common tasks were designed: Search, Manipulate, Transport, and Composite Tasks [8]. Group 3: Data Engine and Training Strategy - A data engine was developed to synthesize diverse reasoning processes, presenting embodied reasoning trajectories in an observe-think-act format [3]. - A three-stage iterative training process was introduced, including imitation learning, rejection sampling adjustment, and reflection adjustment, enhancing the model's interaction, exploration, and reflection capabilities [3][19]. - The training corpus synthesized 9390 unique task instructions and their corresponding observe-think-act trajectories, covering 107 different indoor scenes and 2100 interactive objects [12][16]. Group 4: Experimental Results - The model demonstrated significant advantages over existing advanced models, particularly in complex long-duration tasks, showing more consistent reasoning capabilities and efficient search behavior [3][18]. - In real-world experiments, the Embodied Reasoner achieved a success rate of 56.7% across 30 tasks, outperforming OpenAI's o1 and o3-mini [17]. - The model's success rate improved by 9%, 24%, and 13% compared to GPT-o1, GPT-o3-mini, and Claude-3.7-Sonnet-thinking, respectively [18]. Group 5: Conclusion and Future Work - The research successfully extends the deep reasoning paradigm to embodied interactive tasks, demonstrating enhanced interaction and reasoning capabilities, especially in complex long-duration tasks [19]. - Future work may explore the application of the model to a wider variety of embodied tasks and improve its generalization and adaptability in real-world environments [19].

深度思考模型

Embodied Reasoner

深度思考模型

Embodied Reasoner

技术圈热议的π0/π0.5/A0，终于说清楚是什么了！功能、场景、方法论全解析~

具身智能之心· 2025-06-21 12:06

Core Insights - The article discusses the π0, π0.5, and A0 models, focusing on their architectures, advantages, and functionalities in robotic control and task execution [3][11][29]. Group 1: π0 Model Structure and Functionality - The π0 model is based on a pre-trained Vision-Language Model (VLM) and Flow Matching technology, integrating seven robots and over 68 tasks with more than 10,000 hours of data [3]. - It allows zero-shot task execution through language prompts, enabling direct control of robots without additional fine-tuning for covered tasks [4]. - The model supports complex task decomposition and multi-stage fine-tuning, enhancing the execution of intricate tasks like folding clothes [5]. - It achieves high-frequency precise operations, generating continuous action sequences at a control frequency of up to 50Hz [7]. Group 2: π0 Performance Analysis - The π0 model shows a 20%-30% higher accuracy in following language instructions compared to baseline models in tasks like table clearing and grocery bagging [11]. - For similar pre-trained tasks, it requires only 1-5 hours of data fine-tuning to achieve high success rates, and it performs twice as well on new tasks compared to training from scratch [11]. - In multi-stage tasks, π0 achieves an average task completion rate of 60%-80% through a "pre-training + fine-tuning" process, outperforming models trained from scratch [11]. Group 3: π0.5 Model Structure and Advantages - The π0.5 model employs a two-stage training framework and hierarchical architecture, enhancing its ability to generalize from diverse data sources [12][18]. - It demonstrates a 25%-40% higher success rate in tasks compared to π0, with a training speed improvement of three times due to mixed discrete-continuous action training [17]. - The model effectively handles long-duration tasks and can execute complex operations in unfamiliar environments, showcasing its adaptability [18][21]. Group 4: A0 Model Structure and Performance - The A0 model features a layered architecture that integrates high-level affordance understanding and low-level action execution, enhancing its spatial reasoning capabilities [29]. - It shows continuous performance improvement with increased training environments, achieving success rates close to baseline models when trained on 104 locations [32]. - The model's performance is significantly impacted by the removal of cross-entity and web data, highlighting the importance of diverse data sources for generalization [32]. Group 5: Overall Implications and Future Directions - The advancements in these models indicate a significant step towards practical applications of robotic systems in real-world environments, with potential expansions into service robotics and industrial automation [21][32]. - The integration of diverse data sources and innovative architectures positions these models to overcome traditional limitations in robotic task execution [18][32].

预训练视觉语言模型（VLM）

Flow Matching技术

预训练视觉语言模型（VLM）

Flow Matching技术

近30家具身公司业务和产品一览

具身智能之心· 2025-06-20 03:07

Core Insights - The article provides an overview of notable companies in the field of embodied intelligence and their corresponding business focuses [2] Company Summaries - **Zhiyuan Robotics**: Focuses on humanoid robot development, with products like the Expedition A1/A2 capable of navigating complex terrains and performing fine motor tasks [2] - **Unitree Robotics**: A leader in quadruped robots, known for high dynamic motion control, with products like Go1/Go2 series for consumer use and B1/B2/H1 series for industrial applications [5] - **Fourier Intelligence**: A general robotics company that includes humanoid robots and smart rehabilitation solutions, featuring products like GR-1/GR-2 humanoid robots and upper limb rehabilitation robots [6] - **Deep Robotics**: Specializes in quadruped robots for power and security applications, with products like the J series joints providing high torque performance [7] - **Lingchu Intelligent**: Focuses on dexterous operations and end-to-end solutions based on reinforcement learning algorithms [13] - **OriginBot**: Develops educational robots, including the Aelos series for programming education and Fluvo for hospital logistics [14] - **Noematrix**: Concentrates on high-resolution multi-modal tactile perception and soft/hard tactile manipulation products, providing innovative solutions for various sectors [29] - **Galbot**: Engages in the development of general-purpose humanoid robots and quadruped robots for industrial, commercial, and household applications [28]

EMBODIED WEB AGENTS：融合物理与数字领域以实现综合智能体智能

具身智能之心· 2025-06-20 00:44

Group 1 - The article discusses the significant fragmentation in current AI agents, where network agents excel in handling digital information while embodied agents focus on physical interactions, leading to a lack of collaboration between the two domains [4] - The research team proposes a new paradigm called Embodied Web Agents (EWA) aimed at seamlessly bridging physical embodiment and network reasoning [4] Group 2 - A unified simulation environment is developed, integrating three major modules: outdoor environments based on Google Street View/ Earth API for real city navigation, indoor environments using AI2-THOR for high-fidelity kitchen scenes, and a self-built network environment with five functional websites [5][8][10] - The EWA-Bench benchmark is constructed, containing 1,500 tasks across five domains, with 75% of tasks requiring multiple environment switches to test cross-domain coordination capabilities [11] Group 3 - Experimental results show performance disparities among leading models like GPT-4o and Gemini, with overall accuracy rates of 34.72% for GPT and 30.56% for Gemini, compared to human accuracy of 90.28% [13] - The primary cause of errors is identified as cross-domain coordination issues, accounting for 66.6% of failures, with models performing well on pure web tasks but struggling with physical interactions [15] Group 4 - The article highlights the first formalization of the "embodied web agent" concept framework and the release of the first physical-digital integrated simulation environment [21] - Insights reveal that current large language models (LLMs) face significant bottlenecks in cross-domain collaboration, which is crucial for enhancing agent intelligence [22]

Embodied Web Agents (EWA)

具身智能体

单域循环陷阱

指令 - 动作错位

Embodied Web Agents (EWA)

具身智能体

单域循环陷阱

指令 - 动作错位

VR-Robo：real2sim2real，机器人视觉强化学习导航和运动控制新范式！

具身智能之心· 2025-06-20 00:44

Core Viewpoint - The article discusses the advancements in footed robot navigation and motion control through a unified framework called VR-Robo, which addresses the challenges of transferring learned strategies from simulation to real-world applications [3][16]. Related Work - Previous research has explored various methods to bridge the Sim-to-Real gap, but many rely on specific sensors and struggle to balance high-fidelity rendering with real geometric modeling [3][4]. Solution - The VR-Robo framework combines geometric priors from images to reconstruct consistent scenes, utilizes GS-mesh hybrid representation for creating interactive simulation environments, and employs neural reconstruction methods like NeRF for generating high-fidelity scene images [4][5][16]. Experimental Analysis - Comparative experiments were conducted against baseline methods, including imitation learning and textured mesh approaches, to evaluate the performance of the VR-Robo framework [11][12]. - Performance metrics reported include Success Rate (SR) and Average Reaching Time (ART), demonstrating VR-Robo's superior performance in various difficulty levels [14][15]. Summary and Limitations - VR-Robo successfully trains visual navigation strategies using only RGB images, enabling autonomous navigation in complex environments without additional sensors. However, it currently only applies to static indoor environments and has limitations in training efficiency and structural accuracy of the reconstructed meshes [16].

真实 - 仿真 - 真实（Real - to - Sim - to - Real）

真实 - 仿真 - 真实（Real - to - Sim - to - Real）