Workflow
具身智能之心
icon
Search documents
上海交大最新!DyNaVLM:零样本、端到端导航框架
具身智能之心· 2025-06-22 10:56
Core Viewpoint - The article discusses the development of DyNaVLM, a zero-shot, end-to-end navigation framework that integrates vision-language models (VLM) to enhance navigation capabilities in dynamic environments, overcoming limitations of traditional methods [4][5]. Group 1: Introduction and Optimization Goals - Navigation is a fundamental capability in autonomous agents, requiring spatial reasoning, real-time decision-making, and adaptability to dynamic environments. Traditional methods face challenges in generalization and scalability due to their modular design [4]. - The advancement of VLMs offers new possibilities for navigation by integrating perception and reasoning within a single framework, although their application in embodied navigation is limited by spatial granularity and contextual reasoning capabilities [4]. Group 2: Core Innovations of DyNaVLM - **Dynamic Action Space Construction**: DyNaVLM introduces a dynamic action space that allows robots to determine navigation goals based on visual information and language instructions, enhancing movement flexibility in complex environments [6]. - **Collaborative Graph Memory Mechanism**: Inspired by retrieval-augmented generation (RAG), this mechanism enhances memory management for better navigation performance [8]. - **No-Training Deployment Mode**: DyNaVLM can be deployed without task-specific fine-tuning, reducing deployment costs and improving generalization across different environments and tasks [8]. Group 3: System Architecture and Methodology - **Problem Formalization**: The system takes inputs such as target descriptions and RGB-D observations to determine appropriate actions, maintaining a memory function to extract spatial features [11]. - **Memory Manager**: This component connects VLM and graph-structured memory, capturing spatial relationships and semantic object information [12]. - **Action Proposer and Selector**: The action proposer simplifies continuous search space into discrete candidates, while the selector generates final navigation actions based on geometric candidates and contextual memory [14][15]. Group 4: Experimental Evaluation - **Simulation Environment Evaluation**: DyNaVLM achieved a success rate (SR) of 45.0% and a path length weighted success rate (SPL) of 0.232 in ObjectNav benchmarks, outperforming previous VLM frameworks [19][22]. - **Real-World Evaluation**: DyNaVLM demonstrated superior performance in real-world settings, particularly in tasks requiring the identification of multiple targets, showcasing its robustness and efficiency in dynamic environments [27].
具身智能领域的行业周期有多久?
具身智能之心· 2025-06-22 03:59
Core Viewpoint - The article discusses the development cycles of autonomous driving and embodied intelligence, suggesting that the latter may achieve commercialization faster due to anticipated breakthroughs in algorithms and data [1]. Group 1: Industry Development - The autonomous driving industry has been scaling and commercializing for nearly 10 years since 2015, while the robotics industry has been evolving for many years, with expectations for significant advancements in the next 5-8 years [1]. - Companies like Zhiyuan and Yushu are preparing for IPOs, which could greatly invigorate the entire industry [1]. Group 2: Community Building - The goal is to create a community of 10,000 members within three years, focusing on bridging academia and industry, and providing a platform for rapid problem-solving and industry influence [1]. - The community aims to facilitate technical exchanges and discussions on academic and engineering issues, with members from renowned universities and leading robotics companies [8]. Group 3: Educational Resources - A comprehensive entry route for beginners has been organized within the community, including various learning paths and resources for those new to the field [2]. - For those already engaged in research, valuable industry frameworks and project proposals are provided [4]. Group 4: Job Opportunities - The community continuously shares job postings and opportunities, contributing to the establishment of a complete ecosystem for embodied intelligence [6]. Group 5: Knowledge Sharing - The community has compiled a wealth of resources, including over 40 open-source projects, nearly 60 datasets related to embodied intelligence, and mainstream simulation platforms [11]. - Various learning routes are available, covering topics such as reinforcement learning, multi-modal models, and robotic navigation [11].
CVPR'25 | 感知性能飙升50%!JarvisIR:VLM掌舵, 不惧恶劣天气
具身智能之心· 2025-06-21 12:06
Core Viewpoint - JarvisIR represents a significant advancement in image restoration technology, utilizing a Visual Language Model (VLM) as a controller to coordinate multiple expert models for robust image recovery under various weather conditions [5][51]. Group 1: Background and Motivation - The research addresses challenges in visual perception systems affected by adverse weather conditions, proposing JarvisIR as a solution to enhance image recovery capabilities [5]. - Traditional methods struggle with complex real-world scenarios, necessitating a more versatile approach [5]. Group 2: Methodology Overview - JarvisIR architecture employs a VLM to autonomously plan task sequences and select appropriate expert models for image restoration [9]. - The CleanBench dataset, comprising 150K synthetic and 80K real-world images, is developed to support training and evaluation [12][15]. - The MRRHF alignment algorithm combines supervised fine-tuning and human feedback to improve model generalization and decision stability [9][27]. Group 3: Training Framework - The training process consists of two phases: supervised fine-tuning (SFT) using synthetic data and MRRHF for real-world data alignment [23][27]. - MRRHF employs a reward modeling approach to assess image quality and guide VLM optimization [28]. Group 4: Experimental Results - JarvisIR-MRRHF demonstrates superior decision-making capabilities compared to other strategies, achieving a score of 6.21 on the CleanBench-Real validation set [43]. - In image restoration performance, JarvisIR-MRRHF outperforms existing methods across various weather conditions, with an average improvement of 50% in perceptual metrics [47]. Group 5: Technical Highlights - The integration of VLM as a control center marks a novel application in image restoration, enhancing contextual understanding and task planning [52]. - The collaborative mechanism of expert models allows for tailored responses to different weather-induced image degradations [52]. - The release of the CleanBench dataset fills a critical gap in real-world image restoration data, promoting further research and development in the field [52].
具身场景新框架!Embodied-Reasoner:攻克复杂具身交互任务
具身智能之心· 2025-06-21 12:06
Core Viewpoint - The article presents the Embodied Reasoner framework, which extends deep reasoning capabilities to embodied interactive tasks, addressing unique challenges such as multimodal interaction and diverse reasoning patterns [3][7][19]. Group 1: Research Background - Recent advancements in deep reasoning models, like OpenAI's o1, have shown exceptional capabilities in mathematical and programming tasks through large-scale reinforcement learning [7]. - However, the effectiveness of these models in embodied domains requiring continuous interaction with the environment has not been fully explored [7]. - The research aims to expand deep reasoning capabilities to embodied interactive tasks, tackling challenges such as multimodal interaction and diverse reasoning patterns [7]. Group 2: Embodied Interaction Task Design - A high-level planning and reasoning embodied task was designed, focusing on searching for hidden objects in unknown rooms rather than low-level motion control [8]. - The task environment is built on the AI2-THOR simulator, featuring 120 unique indoor scenes and 2100 objects [8]. - Four common tasks were designed: Search, Manipulate, Transport, and Composite Tasks [8]. Group 3: Data Engine and Training Strategy - A data engine was developed to synthesize diverse reasoning processes, presenting embodied reasoning trajectories in an observe-think-act format [3]. - A three-stage iterative training process was introduced, including imitation learning, rejection sampling adjustment, and reflection adjustment, enhancing the model's interaction, exploration, and reflection capabilities [3][19]. - The training corpus synthesized 9390 unique task instructions and their corresponding observe-think-act trajectories, covering 107 different indoor scenes and 2100 interactive objects [12][16]. Group 4: Experimental Results - The model demonstrated significant advantages over existing advanced models, particularly in complex long-duration tasks, showing more consistent reasoning capabilities and efficient search behavior [3][18]. - In real-world experiments, the Embodied Reasoner achieved a success rate of 56.7% across 30 tasks, outperforming OpenAI's o1 and o3-mini [17]. - The model's success rate improved by 9%, 24%, and 13% compared to GPT-o1, GPT-o3-mini, and Claude-3.7-Sonnet-thinking, respectively [18]. Group 5: Conclusion and Future Work - The research successfully extends the deep reasoning paradigm to embodied interactive tasks, demonstrating enhanced interaction and reasoning capabilities, especially in complex long-duration tasks [19]. - Future work may explore the application of the model to a wider variety of embodied tasks and improve its generalization and adaptability in real-world environments [19].
技术圈热议的π0/π0.5/A0,终于说清楚是什么了!功能、场景、方法论全解析~
具身智能之心· 2025-06-21 12:06
Core Insights - The article discusses the π0, π0.5, and A0 models, focusing on their architectures, advantages, and functionalities in robotic control and task execution [3][11][29]. Group 1: π0 Model Structure and Functionality - The π0 model is based on a pre-trained Vision-Language Model (VLM) and Flow Matching technology, integrating seven robots and over 68 tasks with more than 10,000 hours of data [3]. - It allows zero-shot task execution through language prompts, enabling direct control of robots without additional fine-tuning for covered tasks [4]. - The model supports complex task decomposition and multi-stage fine-tuning, enhancing the execution of intricate tasks like folding clothes [5]. - It achieves high-frequency precise operations, generating continuous action sequences at a control frequency of up to 50Hz [7]. Group 2: π0 Performance Analysis - The π0 model shows a 20%-30% higher accuracy in following language instructions compared to baseline models in tasks like table clearing and grocery bagging [11]. - For similar pre-trained tasks, it requires only 1-5 hours of data fine-tuning to achieve high success rates, and it performs twice as well on new tasks compared to training from scratch [11]. - In multi-stage tasks, π0 achieves an average task completion rate of 60%-80% through a "pre-training + fine-tuning" process, outperforming models trained from scratch [11]. Group 3: π0.5 Model Structure and Advantages - The π0.5 model employs a two-stage training framework and hierarchical architecture, enhancing its ability to generalize from diverse data sources [12][18]. - It demonstrates a 25%-40% higher success rate in tasks compared to π0, with a training speed improvement of three times due to mixed discrete-continuous action training [17]. - The model effectively handles long-duration tasks and can execute complex operations in unfamiliar environments, showcasing its adaptability [18][21]. Group 4: A0 Model Structure and Performance - The A0 model features a layered architecture that integrates high-level affordance understanding and low-level action execution, enhancing its spatial reasoning capabilities [29]. - It shows continuous performance improvement with increased training environments, achieving success rates close to baseline models when trained on 104 locations [32]. - The model's performance is significantly impacted by the removal of cross-entity and web data, highlighting the importance of diverse data sources for generalization [32]. Group 5: Overall Implications and Future Directions - The advancements in these models indicate a significant step towards practical applications of robotic systems in real-world environments, with potential expansions into service robotics and industrial automation [21][32]. - The integration of diverse data sources and innovative architectures positions these models to overcome traditional limitations in robotic task execution [18][32].
近30家具身公司业务和产品一览
具身智能之心· 2025-06-20 03:07
Core Insights - The article provides an overview of notable companies in the field of embodied intelligence and their corresponding business focuses [2] Company Summaries - **Zhiyuan Robotics**: Focuses on humanoid robot development, with products like the Expedition A1/A2 capable of navigating complex terrains and performing fine motor tasks [2] - **Unitree Robotics**: A leader in quadruped robots, known for high dynamic motion control, with products like Go1/Go2 series for consumer use and B1/B2/H1 series for industrial applications [5] - **Fourier Intelligence**: A general robotics company that includes humanoid robots and smart rehabilitation solutions, featuring products like GR-1/GR-2 humanoid robots and upper limb rehabilitation robots [6] - **Deep Robotics**: Specializes in quadruped robots for power and security applications, with products like the J series joints providing high torque performance [7] - **Lingchu Intelligent**: Focuses on dexterous operations and end-to-end solutions based on reinforcement learning algorithms [13] - **OriginBot**: Develops educational robots, including the Aelos series for programming education and Fluvo for hospital logistics [14] - **Noematrix**: Concentrates on high-resolution multi-modal tactile perception and soft/hard tactile manipulation products, providing innovative solutions for various sectors [29] - **Galbot**: Engages in the development of general-purpose humanoid robots and quadruped robots for industrial, commercial, and household applications [28]
EMBODIED WEB AGENTS:融合物理与数字领域以实现综合智能体智能
具身智能之心· 2025-06-20 00:44
点击下方 卡片 ,关注" 具身智能 之心 "公众号 作者丨 Yining Hong等 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要 的。 一、研究背景与核心问题 当前AI智能体存在严重的领域割裂: 网络智能体 (如搜索引擎代理)擅长处理数字信息,而 具身智能体 (如机器人)专注于物理交互,二者极少协同。这种割裂导致AI无法完成需要跨域协同的任务,例如: 人类智能天然融合物理与数字领域,而现有AI缺乏这种能力。研究团队提出 Embodied Web Agents (EWA) 新范式,旨在构建可无缝桥接物理具身与网络推理的智能体。 $$\cdots\stackrel{\rightharpoonup}{\mathrm{min}}\ \ \oint_{\partial\Omega}\stackrel{\sin\theta}{\partial\Omega}\stackrel{\sin\theta}{\partial\Omega}\stackrel{\sin\theta}{\partial\Omega} ...
VR-Robo:real2sim2real,机器人视觉强化学习导航和运动控制新范式!
具身智能之心· 2025-06-20 00:44
Core Viewpoint - The article discusses the advancements in footed robot navigation and motion control through a unified framework called VR-Robo, which addresses the challenges of transferring learned strategies from simulation to real-world applications [3][16]. Related Work - Previous research has explored various methods to bridge the Sim-to-Real gap, but many rely on specific sensors and struggle to balance high-fidelity rendering with real geometric modeling [3][4]. Solution - The VR-Robo framework combines geometric priors from images to reconstruct consistent scenes, utilizes GS-mesh hybrid representation for creating interactive simulation environments, and employs neural reconstruction methods like NeRF for generating high-fidelity scene images [4][5][16]. Experimental Analysis - Comparative experiments were conducted against baseline methods, including imitation learning and textured mesh approaches, to evaluate the performance of the VR-Robo framework [11][12]. - Performance metrics reported include Success Rate (SR) and Average Reaching Time (ART), demonstrating VR-Robo's superior performance in various difficulty levels [14][15]. Summary and Limitations - VR-Robo successfully trains visual navigation strategies using only RGB images, enabling autonomous navigation in complex environments without additional sensors. However, it currently only applies to static indoor environments and has limitations in training efficiency and structural accuracy of the reconstructed meshes [16].
港科大智能建造实验室诚招博士后/博士生/研究助理(机器人方向)
具身智能之心· 2025-06-20 00:44
Core Viewpoint - The article highlights the recruitment opportunities in the field of intelligent construction, specifically focusing on the development and application of multi-rotor drones and underwater robots, under the guidance of Professor Jack C.P. Cheng at Hong Kong University of Science and Technology [1][4][5]. Group 1: Research Directions - Direction 1: Development and application of multi-rotor drones, aiming for autonomous navigation and exploration in GPS-denied environments. Candidates should be familiar with ROS programming, SLAM algorithms, and have experience in drone development [4]. - Direction 2: Underwater target recognition and 3D reconstruction using underwater robots, focusing on image enhancement and segmentation of underwater facilities. Candidates should have knowledge in computer vision, deep learning, and underwater imaging principles [5][6]. Group 2: Compensation and Benefits - PhD students can receive an annual scholarship of HK$ 225,120 (approximately HK$ 18,760 per month) and additional government scholarships totaling HK$ 337,200 (approximately HK$ 28,100 per month). They may also receive extra scholarships and fee waivers [8]. - Postdoctoral researchers and research assistants will be offered competitive salaries based on their capabilities [8]. Group 3: Application Process - Interested applicants are encouraged to send their CVs and relevant achievements via email to Professor Jack C.P. Cheng and Dr. Zhenyu Liang for further inquiries [9].
【圆桌正当时】机器人不能没有方向盘,你的遥操够丝滑吗?
具身智能之心· 2025-06-20 00:44
Core Viewpoint - The article discusses the evolution of remote operation (遥操) and embodied intelligence, emphasizing the shift from rule-based to data-driven paradigms in robotics, which has led to significant advancements in the industry [3][4]. Group 1: Evolution of Technology - Embodied intelligence is not a new concept, having originated in the 1950s, but it has gained prominence recently due to advancements in Robot Learning, allowing for tasks previously difficult to automate, such as folding clothes and tying shoelaces [3]. - The robotics industry is transitioning from a rule-driven automation era to a human-machine symbiosis era, akin to the transition from horse-drawn carriages to automobiles [4]. Group 2: Current Industry Landscape - The current state of robotics lacks standardized operating systems and frameworks, similar to early mobile phones before the advent of Android, indicating a need for a mature operating system for embodied robots [4]. - The emergence of large models has propelled the robotics industry forward, creating a more diverse supply chain and paving the way for new product categories [4]. Group 3: Future Directions - The commercial realization of robotics requires not only fully autonomous solutions but also a gradual implementation strategy, suggesting the need for a new operating system for embodied robots, referred to as ROS3.0 [5]. - The article invites discussion on the effectiveness of current remote operation systems, the ideal hardware and software for embodied robots, and the design of user interactions [5].