Workflow
具身智能之心
icon
Search documents
即将开课啦!具身智能目标导航算法与实战教程来了~
具身智能之心· 2025-07-23 08:45
Core Viewpoint - Goal-Oriented Navigation empowers robots to autonomously complete navigation tasks based on goal descriptions, marking a significant shift from traditional visual language navigation systems [2][3]. Group 1: Technology Overview - Embodied navigation is a core area of embodied intelligence, relying on three technical pillars: language understanding, environmental perception, and path planning [2]. - Goal-Oriented Navigation requires robots to explore and plan paths in unfamiliar 3D environments using only goal descriptions such as coordinates, images, or natural language [2]. - The technology has been industrialized in various verticals, including delivery, healthcare, and hospitality, with companies like Meituan and Aethon deploying autonomous delivery robots [3]. Group 2: Technological Evolution - The evolution of Goal-Oriented Navigation can be categorized into three generations: 1. First Generation: End-to-end methods focusing on reinforcement learning and imitation learning, achieving breakthroughs in Point Navigation and closed-set image navigation tasks [5]. 2. Second Generation: Modular methods that explicitly construct semantic maps, breaking tasks into exploration and goal localization [5]. 3. Third Generation: Integration of large language models (LLMs) and vision-language models (VLMs) to enhance knowledge reasoning and open vocabulary target matching [7]. Group 3: Challenges and Learning Path - The complexity of embodied navigation, particularly Goal-Oriented Navigation, necessitates knowledge from multiple fields, including natural language processing, computer vision, and reinforcement learning [9]. - A new course has been developed to address the challenges of learning Goal-Oriented Navigation, focusing on quick entry, building a research framework, and combining theory with practice [10][11][12]. Group 4: Course Structure - The course includes six chapters covering the core framework of semantic navigation, Habitat simulation ecology, end-to-end navigation methodologies, modular navigation architectures, and LLM/VLM-driven navigation systems [16][18][19][21][23]. - A significant project within the course focuses on the reproduction of VLFM algorithms and their deployment in real-world scenarios, allowing students to engage in practical applications [25].
X-Nav:端到端跨平台导航框架,通用策略实现零样本迁移
具身智能之心· 2025-07-22 06:29
Core Viewpoint - The article presents the X-Nav framework, which enables end-to-end cross-embodiment navigation for mobile robots, allowing a single universal strategy to be deployed across different robot forms, including wheeled and quadrupedal robots [3][4]. Group 1: Existing Limitations - Current navigation methods are often designed for specific robot forms, limiting their generalizability across platforms [4]. - Navigation tasks require robots to move without collisions in complex environments, relying on visual observations, target positions, and proprioceptive information, but existing methods face significant limitations [4]. Group 2: X-Nav Architecture - The X-Nav architecture consists of two core phases: expert policy learning and universal policy refinement [5][8]. - Phase 1 involves training multiple expert policies using deep reinforcement learning (DRL) on randomly generated robot forms [6]. - Phase 2 refines these expert policies into a single universal policy using a Nav-ACT transformer model [8]. Group 3: Training and Evaluation - The training process utilizes the Proximal Policy Optimization (PPO) algorithm, with a reward function that includes task rewards and regularization rewards, tailored for wheeled and quadrupedal robots [10][16]. - Experimental validation shows that X-Nav outperforms other methods in success rate (SR) and success rate weighted path length (SPL), with Jackal achieving an SR of 90.4% and SPL of 0.84 [13]. - Scalability studies indicate that increasing the number of training forms significantly enhances the adaptability to unknown robots [14]. Group 4: Ablation Studies - Ablation studies validate the effectiveness of design choices, showing that using L1 loss instead of MSE reduces performance due to insufficient penalty for large errors [21]. - The execution of complete action blocks delays quadrupedal adaptation to dynamic changes, while omitting time integration (TE) leads to rough actions in wheeled robots [21]. Group 5: Real-World Testing - Real-world tests in indoor and outdoor environments demonstrate a success rate of 85% and SPL of 0.79, confirming the generalizability of the X-Nav framework across different sensor configurations [22].
机器人需求驱动导航新SOTA,成功率提升15%!浙大&vivo联手打造
具身智能之心· 2025-07-22 06:29
Core Viewpoint - The article discusses the advancements in embodied intelligence, specifically focusing on the new framework CogDDN developed by a research team from Zhejiang University and vivo AI Lab, which enables robots to understand human needs and navigate complex environments autonomously [2][3][6]. Research Motivation - The increasing integration of mobile robots into daily life necessitates their ability to understand human needs rather than just executing commands. For instance, a robot should autonomously seek food when a person expresses hunger [6]. - Traditional navigation methods often struggle in unfamiliar environments due to their reliance on extensive data training, prompting the need for a more generalizable approach that mimics human reasoning [7]. Framework Overview - The CogDDN framework is based on the dual-process theory from psychology, combining heuristic (System 1) and analytical (System 2) decision-making processes to enhance navigation capabilities [9][10]. - The framework consists of three main components: a 3D perception module, a demand matching module, and a dual-process decision-making module [13]. 3D Robot Perception Module - The team utilized the UniMODE method for single-view 3D object detection, improving the robot's ability to navigate indoor environments without relying on multiple views or depth sensors [15]. Demand Matching Module - This module aligns human needs with object characteristics, using supervised fine-tuning techniques to enhance the accuracy of large language models (LLMs) in matching user requests with suitable objects [16]. Dual-Process Decision Making - The heuristic process allows for quick, intuitive decisions based on past experiences, while the analytical process focuses on error reflection and strategy optimization [18][23]. - The Explore and Exploit modules within the heuristic process enable the system to adapt to new environments and efficiently achieve navigation goals [19][20]. Experimental Results - The performance of CogDDN was evaluated using the AI2Thor simulator and the ProcTHOR dataset, demonstrating a significant improvement over existing state-of-the-art methods, with a navigation success rate (NSR) of 38.3% and a success rate in unseen scenes of 34.5% [26][27]. - The removal of key components like the Exploit module and the chain of thought (CoT) significantly decreased system performance, highlighting their importance in decision-making [29][30]. Conclusion - CogDDN represents a cognitive-driven navigation system that continuously learns, adapts, and optimizes its strategies, effectively simulating human-like reasoning in robots [33][34]. - Its dual-process capability enhances performance in demand-driven navigation tasks, laying a solid foundation for the advancement of intelligent robotic technologies [35].
将再狂揽近6亿融资!机器人Moz1卷入办公室,全力冲刺万亿赛道
具身智能之心· 2025-07-22 06:29
Core Viewpoint - The article highlights the rapid growth and investment in the field of embodied intelligence, particularly focusing on the company Qianxun Intelligent, which has recently secured significant funding and launched its commercial humanoid robot, Moz1, showcasing advanced capabilities in various tasks [3][15][21]. Investment and Market Dynamics - The embodied intelligence sector is experiencing explosive growth, with major players from Silicon Valley to China competing to transition AI from virtual to physical applications [4][5]. - Qianxun Intelligent has completed nearly 600 million yuan in Pre-A+ funding, led by JD.com, indicating strong investor confidence and the potential for significant market impact [15][19]. - The company has rapidly attracted investment since its establishment in February 2024, becoming a favored entity in the capital market [16][17]. Technological Advancements - Qianxun's Moz1 humanoid robot features 26 degrees of freedom and is built on high-density integrated force control joints, outperforming competitors like Tesla's Optimus by 15% in power density [22][24]. - The robot is capable of performing complex tasks such as cleaning and organizing, demonstrating advanced multimodal perception and control capabilities [29][35]. - The development of the VLA (Vision-Language-Action) model and the Spirit v1 framework enables seamless integration of perception, understanding, and action, significantly enhancing the robot's operational efficiency [37][41]. Commercial Strategy - Qianxun has adopted a market-driven approach, conducting extensive research across various sectors to identify high-value applications for its technology [56][58]. - The company aims to penetrate multiple high-value markets, including logistics and healthcare, leveraging its international experience to expand globally [59][60]. - A unique business model has been established, creating a feedback loop between market needs, technological development, and product deployment [57][68]. Competitive Edge - Qianxun stands out in the competitive landscape due to its unique technological path, rapid iteration capabilities, and a team of top global talents in robotics [62][66]. - The company's strategic focus on high-value scenarios and its ability to adapt quickly to market changes have garnered significant trust and investment from industry players [68][70].
太魔幻了!具身一边是海量岗位,一边是招不到人......
具身智能之心· 2025-07-22 06:29
太魔幻了!具身一边是海量岗位,一边是招不到人...... 最近星球里的同学来找我吐槽:峰哥,为什么很多具身公司明明很有钱,融资拿的根本花不完,岗位对外 的也多,但一直面试不发offer,他们对外一直说招不到人??? 作为完整经历过自驾发展周期的人来看,其实很简单。大家兜里有钱,但不敢轻易花钱了,保持着审慎的 态度,精打细算的细水长流。这个产业周期依然会很长,乱花钱、没有计划,死的会很快,洗牌也就这2-3 年的事情。 许多具身公司的产品(包括本体、算法、数据)都还不成熟,这一点我们在具身智能之心知识星球内详细 分析过。所以,有非常好的研究成果这批学者是各家公司争先招募的,比如人形机器人的稳定性、数据的 scale、数据的有效使用、泛化性等方向。底层技术突破的拐点还看不到,大家都想储备好干粮准备过寒 冬, 对于求职者来说,一方面需要自己技术过硬,另外一方面需要非常适配具身的研究方向。 而具身智能之心知识星球,作为国内最大的具身技术社区,一直在给行业和个人输送各类人才、产业学术 信息。 目前累积了国内外几乎所有主流具身公司和大多数知名研究机构。 如果您需要第一时间了解产业、 求职和行业痛点,欢迎加入我们。 一个认真 ...
各类任务上超越π0!字节跳动推出大型VLA模型GR-3,推动通用机器人策略发展
具身智能之心· 2025-07-22 04:10
Core Viewpoint - GR-3, developed by ByteDance, is a large-scale visual-language-action (VLA) model designed to advance general robotics strategies, demonstrating exceptional capabilities in generalization, efficient fine-tuning, and execution of complex tasks [2][7]. Group 1: Performance and Advantages - GR-3 excels in generating action sequences for dual-arm mobile robots based on natural language instructions and environmental observations, outperforming current advanced baseline methods [2][7]. - The model's architecture includes a total of 4 billion parameters, balancing performance and efficiency by optimizing the action generation module [10][12]. Group 2: Core Capabilities and Innovations - GR-3 addresses three major pain points of traditional robots: inability to fully recognize, learn quickly, and perform tasks effectively [7]. - It features a dual-path design combining data-driven approaches with architectural optimization, enabling it to understand abstract instructions and perform precise operations [7][12]. - Key innovations include enhanced generalization capabilities, efficient adaptation with minimal human demonstration data, and stable performance in long-duration and intricate tasks [12][14]. Group 3: Training Methodology - The training strategy employs a "trinity" approach, integrating robot trajectories, visual-language data, and human demonstrations for progressive learning [15][19]. - The model's ability to recognize new objects improved by approximately 40% through joint training with vast internet visual-language datasets [19][23]. Group 4: Hardware Integration - The ByteMini robot, designed for GR-3, features a flexible 7-degree-of-freedom arm and a stable omnidirectional base, enhancing its operational capabilities in various environments [25][26]. - The robot can autonomously generate task combinations and control environmental variables, ensuring effective task execution [21][25]. Group 5: Experimental Validation - GR-3 was tested in three challenging tasks, demonstrating strong adaptability to new environments and abstract instructions with a success rate of 77.1% for understanding new directives [30][38]. - In a long-duration task, GR-3 maintained a success rate of 89% in executing multi-step actions, significantly outperforming previous models [42].
一起做点牛掰的事情!具身智能之心准备招合伙人了.......
具身智能之心· 2025-07-22 03:33
具身教育研发 我们邀请具身领域的大佬,为行业开创具身教育在线课程。 很开心看到具身智能这个领域发展的这么迅速,有几家明星公司也陆续开始准备上市,在这个过程 中我们也交到了很多优秀的朋友。 越是做平台越是发现,行业离不开大家的共同努力,特别是共同试错。一下子成功,感觉不适合创 业,特别是具身领域。我们一直鼓励大家能够积极交流,也期望自身能够承担一个汇聚全行业人才 的平台。刚发布具身一周年的推文,1周年后我们期望能够邀请更多有力量的大佬加入我们,一起推 动行业的进步。 我们现在邀请全球具身领域开发研究者,一起和我们参与具身项目合作、具身教育研发。 具身项目合作 我们正在筹备北京、上海、深圳、广州、杭州、武汉建立项目研发团队。可以兼职哦,承接各类横 向、纵向项目、企业咨询。 每个城市招募10人左右,我们期望您是具身领域学术与工程大佬,具备2年以上具身算法和机器人研 究经验。 如果您是大模型/多模态大模型、Diffusion、VLA、VLA+RL、sim2real、端到端、具身交互、视觉语 言导航、强化学习、机器人运动规划、抓取与位姿估计、触觉感知、大模型部署与量化感知推理、 机器人仿真等方向,欢迎加入我们一起为行业 ...
NVIDIA最新!GraspGen:基于扩散模型的六自由度抓取生成框架
具身智能之心· 2025-07-21 08:42
Core Viewpoint - GraspGen framework addresses the challenge of generalization in 6-DOF grasping by modeling the grasp generation process as an iterative diffusion process, enhancing grasp generation capabilities through the DiffusionTransformer architecture and an efficient discriminator for sampling evaluation [2][21]. Group 1: Core Methodology - GraspGen models the 6-DOF grasp generation as a diffusion process in SE(3) space, utilizing Denoising Diffusion Probabilistic Model (DDPM) for faster computation and simpler implementation compared to traditional energy-based models [4]. - The framework employs PointTransformerV3 (PTv3) to convert unstructured point clouds into structured formats, reducing translation error by 5.3mm and improving recall rate by 4% compared to PointNet++ [4]. - The noise prediction network generates grasps through a 10-step denoising process, significantly fewer than the hundreds of steps required for image diffusion [5]. Group 2: Discriminator Innovations - GraspGen's discriminator innovatively reuses the generator's object encoder, reducing memory usage by 21 times compared to traditional methods [7]. - The discriminator is trained on a dataset generated by the generator, allowing it to better identify failure modes such as collisions and distant grasps, achieving an AUC of 0.947 compared to 0.886 when trained solely on offline data [16][21]. Group 3: Experimental Results - In single-object scenarios, GraspGen's precision-recall curve AUC exceeds baseline by 48% on the ACRONYM dataset, demonstrating the importance of the discriminator [10]. - In cluttered scenes, GraspGen achieves the highest task success rate and grasp success rate, outperforming Contact-GraspNet by 16.9% and M2T2 by 7.8% [13]. - Real robot experiments on the UR10 robotic arm show an overall success rate of 81.3% across various scenarios, significantly higher than M2T2 (28%) and AnyGrasp (17.6%) [19]. Group 4: Limitations and Future Directions - GraspGen shows limitations in performance on cubical objects and relies heavily on the quality of depth sensing and instance segmentation, with training requiring approximately 3,000 GPU hours [21].
机器人「GPT时刻」来了?丰田研究院悄悄做了一场最严谨的VLA验证
具身智能之心· 2025-07-21 08:42
Core Viewpoint - The article discusses the advancements in robotic arms, particularly focusing on the development of Large Behavior Models (LBM) that enable robots to perform complex tasks autonomously, showcasing significant improvements in performance and capabilities compared to traditional models [3][7][15]. Summary by Sections Introduction to Robotic Arms - Robotic arms are typically associated with simple tasks like grabbing or serving ice cream, but the complexity increases exponentially when tasked with more intricate operations such as setting a table or assembling a bicycle [2][3]. Development of VLA Models - The recent progress in Visual-Language-Action (VLA) models has allowed robots to integrate multimodal information (images, instructions, scene semantics) and execute complex tasks, moving towards more intelligent and versatile systems [3][4]. Large Behavior Models (LBM) - LBM represents a significant advancement in robotic capabilities, built on diffusion model strategies, enabling robots to autonomously execute complex operations with impressive results [7][10][19]. - The research conducted by Toyota Research Institute (TRI) and led by notable scholars emphasizes the rigorous evaluation of these models, demonstrating their effectiveness in both simulated and real-world environments [9][10]. Training and Evaluation - The LBM was trained on a diverse dataset, including 1,700 hours of robot data, and underwent 1,800 real-world evaluations and over 47,000 simulated deployments, showcasing its robust performance [13][14]. - The findings indicate that even with limited training data, the model's performance significantly improves, suggesting a positive trend towards achieving effective data acquisition and performance enhancement [14][16]. Performance Metrics - The evaluation metrics included success rate and task completion, with a focus on relative success rates to better compare different methods' performances [26][27]. - The LBM demonstrated superior performance in both seen and unseen tasks compared to single-task baseline models, indicating its robustness and adaptability [31][39]. Conclusion and Future Implications - The research suggests that the advent of general large-scale models in robotics is on the horizon, hinting at a potential "GPT moment" for embodied intelligence [15][43]. - The results indicate that pre-training can lead to better task performance with less data, reinforcing the idea that as data volume increases, performance benefits will continue to manifest [43][45].
VLN-PE:一个具备物理真实性的VLN平台,同时支持人形、四足和轮式机器人(ICCV'25)
具身智能之心· 2025-07-21 08:42
Core Insights - The article introduces VLN-PE, a physically realistic platform for Vision-Language Navigation (VLN), addressing the gap between simulated models and real-world deployment challenges [3][10][15] - The study highlights the significant performance drop (34%) when transferring existing VLN models from simulation to physical environments, emphasizing the need for improved adaptability [15][30] - The research identifies the impact of various factors such as robot type, environmental conditions, and the use of physical controllers on model performance [15][32][38] Background - VLN has emerged as a critical task in embodied AI, requiring agents to navigate complex environments based on natural language instructions [6][8] - Previous models relied on idealized simulations, which do not account for the physical constraints and challenges faced by real robots [9][10] VLN-PE Platform - VLN-PE is built on GRUTopia, supporting various robot types and integrating high-quality synthetic and 3D rendered environments for comprehensive evaluation [10][13] - The platform allows for seamless integration of new scenes, enhancing the scope of VLN research and assessment [10][14] Experimental Findings - The experiments reveal that existing models show a 34% decrease in success rates when transitioning from simulated to physical environments, indicating a significant gap in performance [15][30] - The study emphasizes the importance of multi-modal robustness, with RGB-D models performing better under low-light conditions compared to RGB-only models [15][38] - The findings suggest that training on diverse datasets can improve the generalization capabilities of VLN models across different environments [29][39] Methodologies - The article evaluates various methodologies, including single-step discrete action classification models and multi-step continuous prediction methods, highlighting the potential of diffusion strategies in VLN [20][21] - The research also explores the effectiveness of map-based zero-shot large language models (LLMs) for navigation tasks, demonstrating their potential in VLN applications [24][25] Performance Metrics - The study employs standard VLN evaluation metrics, including trajectory length, navigation error, success rate, and others, to assess model performance [18][19] - Additional metrics are introduced to account for physical realism, such as fall rate and stuck rate, which are critical for evaluating robot performance in real-world scenarios [18][19] Cross-Embodiment Training - The research indicates that cross-embodiment training can enhance model performance, allowing a unified model to generalize across different robot types [36][39] - The findings suggest that using data from multiple robot types during training leads to improved adaptability and performance in various environments [36][39]