具身智能之心
Search documents
行为基础模型可实现高效的人形机器人全身控制
具身智能之心· 2025-07-23 08:45
Core Viewpoint - Humanoid robots are gaining unprecedented attention as multifunctional platforms for complex motion control, human-robot interaction, and general physical intelligence, but achieving efficient whole-body control remains a fundamental challenge [1][2]. Group 1: Overview of Behavior Foundation Model (BFM) - The article discusses the emergence of Behavior Foundation Model (BFM) as a solution to the limitations of traditional controllers, enabling zero-shot or rapid adaptation to various downstream tasks through large-scale pre-training [1][2]. - BFM is defined as a special type of foundational model aimed at controlling agent behavior in dynamic environments, rooted in principles of general foundational models like GPT-4 and CLIP, utilizing large-scale behavior data for pre-training [12][13]. Group 2: Evolution of Humanoid Whole-Body Control Algorithms - The evolution of humanoid whole-body control algorithms is summarized in three stages: model-based controllers, learning-based task-specific controllers, and behavior foundation models [4][6][7]. - Model-based controllers rely heavily on physical models and require complex manual design, while learning-based controllers exhibit poor generalization across tasks [6][7][8]. Group 3: BFM Methodology and Algorithms - The article categorizes current BFM construction methods into three types: goal-conditioned learning, intrinsic reward-driven learning, and forward-backward representation learning [13]. - A notable example of a goal-conditioned learning method is MaskedMimic, which learns foundational motor skills through motion tracking and supports seamless task switching [18][20]. Group 4: Applications and Limitations of BFM - BFM has potential applications in various fields, including humanoid robotics, virtual agents in gaming, industrial 5.0, and medical assistance robots, enabling rapid adaptation to diverse tasks [31][33]. - However, BFM faces limitations such as difficulties in sim-to-real transfer, where discrepancies between simulation and real-world dynamics hinder practical deployment [32][34]. Group 5: Future Research Opportunities and Risks - Future research opportunities include integrating multimodal inputs, developing advanced machine learning systems, and establishing standardized evaluation mechanisms for BFM [36][38]. - Risks associated with BFM include ethical concerns regarding training data biases, data bottlenecks, and the need for robust safety mechanisms to ensure reliability in open environments [36][39].
Being-H0:从大规模人类视频中学习灵巧操作的VLA模型
具身智能之心· 2025-07-23 08:45
Core Insights - The article discusses the advancements in vision-language-action models (VLAs) and the challenges faced in the robotics field, particularly in complex dexterous manipulation tasks due to data limitations [3][4]. Group 1: Research Background and Motivation - Current large language models and multimodal models have made significant progress, but the robotics sector lacks a transformative moment akin to "ChatGPT" [3]. - Existing VLAs struggle with dexterous tasks due to reliance on synthetic data or limited remote operation demonstrations, especially in fine manipulation due to high hardware costs [3]. - Human videos contain rich real-world operational data, but learning from them presents challenges such as data heterogeneity, hand motion quantization, cross-modal reasoning, and robot control transfer [3]. Group 2: Core Methodology - The article introduces Physical Instruction Tuning, a paradigm that consists of three phases: pre-training, physical space alignment, and post-training, to transfer human hand movement knowledge to robotic operations [4]. Group 3: Pre-training Phase - The pre-training phase uses human hands as ideal manipulators, treating robotic hands as simplified versions, and trains a foundational VLA on large-scale human videos [6]. - The input includes visual information, language instructions, and parameterized hand movements, optimizing the mapping from vision and language to motion [6][8]. Group 4: Physical Space Alignment - Physical space alignment addresses the interference caused by different camera parameters and coordinate systems through weak perspective projection alignment and motion distribution balancing [10][12]. - The model adapts to specific robots by projecting the robot's proprioceptive state into the model's embedding space, generating executable actions through learnable query tokens [13]. Group 5: Key Technologies - The article discusses motion tokenization and cross-modal fusion, emphasizing the need to retain fine motion precision while discretizing continuous movements [14][17]. - The hand movements are decomposed into wrist and finger movements, each tokenized separately, ensuring reconstruction accuracy through a combination of loss functions [18]. Group 6: Dataset and Experimental Results - The UniHand dataset, comprising over 440,000 task trajectories and 1.3 billion frames, supports large-scale pre-training and includes diverse tasks and data sources [21]. - Experimental results show that the Being-H0 model outperforms baseline models in hand motion generation and translation tasks, demonstrating better spatial accuracy and semantic alignment [22][25]. Group 7: Long Sequence Motion Generation - The model effectively generates long sequences of motion (2-10 seconds) using soft format decoding, which helps maintain trajectory stability [26]. Group 8: Real Robot Operation Experiments - In practical tasks like grasping and placing, Being-H0 shows significantly higher success rates compared to baseline models, achieving 65% and 60% success in unseen toy and cluttered scene tasks, respectively [28].
从“想得好”到“做得好”有多远?具身大小脑协同之路解密
具身智能之心· 2025-07-23 08:45
Core Viewpoint - The article discusses the integration of "brain," "cerebellum," and "body" in embodied intelligent systems, emphasizing the need for improved collaboration and data acquisition for advancing artificial general intelligence (AGI) [2][3][4]. Group 1: Components of Embodied Intelligence - The "brain" is responsible for perception, reasoning, and planning, utilizing large language models and visual language models [2]. - The "cerebellum" focuses on movement, employing motion control algorithms and feedback systems to enhance the naturalness and precision of robotic actions [2]. - The "body" serves as the physical entity that executes the plans generated by the "brain" and the movements coordinated by the "cerebellum," embodying the principle of "knowing and doing" [2]. Group 2: Challenges and Future Directions - There is a need for the "brain" to enhance its reasoning capabilities, enabling it to infer task paths without explicit instructions or maps [3]. - The "cerebellum" should become more intuitive, allowing robots to react flexibly in complex environments and handle delicate objects with care [3]. - The collaboration between the "brain" and "cerebellum" requires improvement, as current communication is slow and responses are delayed, aiming for a seamless interaction system [3]. Group 3: Data Acquisition - The article highlights the challenges in data collection, noting that it is often difficult, expensive, and noisy, which hinders the training of intelligent systems [3]. - There is a call for the development of a training repository that is realistic, diverse, and transferable to enhance data quality and accessibility [3]. Group 4: Expert Discussion - A roundtable discussion is planned with experts from Beijing Academy of Artificial Intelligence and Zhiyuan Robotics to explore recent technological advancements and future pathways for embodied intelligence [4].
即将开课啦!具身智能目标导航算法与实战教程来了~
具身智能之心· 2025-07-23 08:45
Core Viewpoint - Goal-Oriented Navigation empowers robots to autonomously complete navigation tasks based on goal descriptions, marking a significant shift from traditional visual language navigation systems [2][3]. Group 1: Technology Overview - Embodied navigation is a core area of embodied intelligence, relying on three technical pillars: language understanding, environmental perception, and path planning [2]. - Goal-Oriented Navigation requires robots to explore and plan paths in unfamiliar 3D environments using only goal descriptions such as coordinates, images, or natural language [2]. - The technology has been industrialized in various verticals, including delivery, healthcare, and hospitality, with companies like Meituan and Aethon deploying autonomous delivery robots [3]. Group 2: Technological Evolution - The evolution of Goal-Oriented Navigation can be categorized into three generations: 1. First Generation: End-to-end methods focusing on reinforcement learning and imitation learning, achieving breakthroughs in Point Navigation and closed-set image navigation tasks [5]. 2. Second Generation: Modular methods that explicitly construct semantic maps, breaking tasks into exploration and goal localization [5]. 3. Third Generation: Integration of large language models (LLMs) and vision-language models (VLMs) to enhance knowledge reasoning and open vocabulary target matching [7]. Group 3: Challenges and Learning Path - The complexity of embodied navigation, particularly Goal-Oriented Navigation, necessitates knowledge from multiple fields, including natural language processing, computer vision, and reinforcement learning [9]. - A new course has been developed to address the challenges of learning Goal-Oriented Navigation, focusing on quick entry, building a research framework, and combining theory with practice [10][11][12]. Group 4: Course Structure - The course includes six chapters covering the core framework of semantic navigation, Habitat simulation ecology, end-to-end navigation methodologies, modular navigation architectures, and LLM/VLM-driven navigation systems [16][18][19][21][23]. - A significant project within the course focuses on the reproduction of VLFM algorithms and their deployment in real-world scenarios, allowing students to engage in practical applications [25].
X-Nav:端到端跨平台导航框架,通用策略实现零样本迁移
具身智能之心· 2025-07-22 06:29
Core Viewpoint - The article presents the X-Nav framework, which enables end-to-end cross-embodiment navigation for mobile robots, allowing a single universal strategy to be deployed across different robot forms, including wheeled and quadrupedal robots [3][4]. Group 1: Existing Limitations - Current navigation methods are often designed for specific robot forms, limiting their generalizability across platforms [4]. - Navigation tasks require robots to move without collisions in complex environments, relying on visual observations, target positions, and proprioceptive information, but existing methods face significant limitations [4]. Group 2: X-Nav Architecture - The X-Nav architecture consists of two core phases: expert policy learning and universal policy refinement [5][8]. - Phase 1 involves training multiple expert policies using deep reinforcement learning (DRL) on randomly generated robot forms [6]. - Phase 2 refines these expert policies into a single universal policy using a Nav-ACT transformer model [8]. Group 3: Training and Evaluation - The training process utilizes the Proximal Policy Optimization (PPO) algorithm, with a reward function that includes task rewards and regularization rewards, tailored for wheeled and quadrupedal robots [10][16]. - Experimental validation shows that X-Nav outperforms other methods in success rate (SR) and success rate weighted path length (SPL), with Jackal achieving an SR of 90.4% and SPL of 0.84 [13]. - Scalability studies indicate that increasing the number of training forms significantly enhances the adaptability to unknown robots [14]. Group 4: Ablation Studies - Ablation studies validate the effectiveness of design choices, showing that using L1 loss instead of MSE reduces performance due to insufficient penalty for large errors [21]. - The execution of complete action blocks delays quadrupedal adaptation to dynamic changes, while omitting time integration (TE) leads to rough actions in wheeled robots [21]. Group 5: Real-World Testing - Real-world tests in indoor and outdoor environments demonstrate a success rate of 85% and SPL of 0.79, confirming the generalizability of the X-Nav framework across different sensor configurations [22].
机器人需求驱动导航新SOTA,成功率提升15%!浙大&vivo联手打造
具身智能之心· 2025-07-22 06:29
Core Viewpoint - The article discusses the advancements in embodied intelligence, specifically focusing on the new framework CogDDN developed by a research team from Zhejiang University and vivo AI Lab, which enables robots to understand human needs and navigate complex environments autonomously [2][3][6]. Research Motivation - The increasing integration of mobile robots into daily life necessitates their ability to understand human needs rather than just executing commands. For instance, a robot should autonomously seek food when a person expresses hunger [6]. - Traditional navigation methods often struggle in unfamiliar environments due to their reliance on extensive data training, prompting the need for a more generalizable approach that mimics human reasoning [7]. Framework Overview - The CogDDN framework is based on the dual-process theory from psychology, combining heuristic (System 1) and analytical (System 2) decision-making processes to enhance navigation capabilities [9][10]. - The framework consists of three main components: a 3D perception module, a demand matching module, and a dual-process decision-making module [13]. 3D Robot Perception Module - The team utilized the UniMODE method for single-view 3D object detection, improving the robot's ability to navigate indoor environments without relying on multiple views or depth sensors [15]. Demand Matching Module - This module aligns human needs with object characteristics, using supervised fine-tuning techniques to enhance the accuracy of large language models (LLMs) in matching user requests with suitable objects [16]. Dual-Process Decision Making - The heuristic process allows for quick, intuitive decisions based on past experiences, while the analytical process focuses on error reflection and strategy optimization [18][23]. - The Explore and Exploit modules within the heuristic process enable the system to adapt to new environments and efficiently achieve navigation goals [19][20]. Experimental Results - The performance of CogDDN was evaluated using the AI2Thor simulator and the ProcTHOR dataset, demonstrating a significant improvement over existing state-of-the-art methods, with a navigation success rate (NSR) of 38.3% and a success rate in unseen scenes of 34.5% [26][27]. - The removal of key components like the Exploit module and the chain of thought (CoT) significantly decreased system performance, highlighting their importance in decision-making [29][30]. Conclusion - CogDDN represents a cognitive-driven navigation system that continuously learns, adapts, and optimizes its strategies, effectively simulating human-like reasoning in robots [33][34]. - Its dual-process capability enhances performance in demand-driven navigation tasks, laying a solid foundation for the advancement of intelligent robotic technologies [35].
将再狂揽近6亿融资!机器人Moz1卷入办公室,全力冲刺万亿赛道
具身智能之心· 2025-07-22 06:29
Core Viewpoint - The article highlights the rapid growth and investment in the field of embodied intelligence, particularly focusing on the company Qianxun Intelligent, which has recently secured significant funding and launched its commercial humanoid robot, Moz1, showcasing advanced capabilities in various tasks [3][15][21]. Investment and Market Dynamics - The embodied intelligence sector is experiencing explosive growth, with major players from Silicon Valley to China competing to transition AI from virtual to physical applications [4][5]. - Qianxun Intelligent has completed nearly 600 million yuan in Pre-A+ funding, led by JD.com, indicating strong investor confidence and the potential for significant market impact [15][19]. - The company has rapidly attracted investment since its establishment in February 2024, becoming a favored entity in the capital market [16][17]. Technological Advancements - Qianxun's Moz1 humanoid robot features 26 degrees of freedom and is built on high-density integrated force control joints, outperforming competitors like Tesla's Optimus by 15% in power density [22][24]. - The robot is capable of performing complex tasks such as cleaning and organizing, demonstrating advanced multimodal perception and control capabilities [29][35]. - The development of the VLA (Vision-Language-Action) model and the Spirit v1 framework enables seamless integration of perception, understanding, and action, significantly enhancing the robot's operational efficiency [37][41]. Commercial Strategy - Qianxun has adopted a market-driven approach, conducting extensive research across various sectors to identify high-value applications for its technology [56][58]. - The company aims to penetrate multiple high-value markets, including logistics and healthcare, leveraging its international experience to expand globally [59][60]. - A unique business model has been established, creating a feedback loop between market needs, technological development, and product deployment [57][68]. Competitive Edge - Qianxun stands out in the competitive landscape due to its unique technological path, rapid iteration capabilities, and a team of top global talents in robotics [62][66]. - The company's strategic focus on high-value scenarios and its ability to adapt quickly to market changes have garnered significant trust and investment from industry players [68][70].
太魔幻了!具身一边是海量岗位,一边是招不到人......
具身智能之心· 2025-07-22 06:29
太魔幻了!具身一边是海量岗位,一边是招不到人...... 最近星球里的同学来找我吐槽:峰哥,为什么很多具身公司明明很有钱,融资拿的根本花不完,岗位对外 的也多,但一直面试不发offer,他们对外一直说招不到人??? 作为完整经历过自驾发展周期的人来看,其实很简单。大家兜里有钱,但不敢轻易花钱了,保持着审慎的 态度,精打细算的细水长流。这个产业周期依然会很长,乱花钱、没有计划,死的会很快,洗牌也就这2-3 年的事情。 许多具身公司的产品(包括本体、算法、数据)都还不成熟,这一点我们在具身智能之心知识星球内详细 分析过。所以,有非常好的研究成果这批学者是各家公司争先招募的,比如人形机器人的稳定性、数据的 scale、数据的有效使用、泛化性等方向。底层技术突破的拐点还看不到,大家都想储备好干粮准备过寒 冬, 对于求职者来说,一方面需要自己技术过硬,另外一方面需要非常适配具身的研究方向。 而具身智能之心知识星球,作为国内最大的具身技术社区,一直在给行业和个人输送各类人才、产业学术 信息。 目前累积了国内外几乎所有主流具身公司和大多数知名研究机构。 如果您需要第一时间了解产业、 求职和行业痛点,欢迎加入我们。 一个认真 ...
各类任务上超越π0!字节跳动推出大型VLA模型GR-3,推动通用机器人策略发展
具身智能之心· 2025-07-22 04:10
Core Viewpoint - GR-3, developed by ByteDance, is a large-scale visual-language-action (VLA) model designed to advance general robotics strategies, demonstrating exceptional capabilities in generalization, efficient fine-tuning, and execution of complex tasks [2][7]. Group 1: Performance and Advantages - GR-3 excels in generating action sequences for dual-arm mobile robots based on natural language instructions and environmental observations, outperforming current advanced baseline methods [2][7]. - The model's architecture includes a total of 4 billion parameters, balancing performance and efficiency by optimizing the action generation module [10][12]. Group 2: Core Capabilities and Innovations - GR-3 addresses three major pain points of traditional robots: inability to fully recognize, learn quickly, and perform tasks effectively [7]. - It features a dual-path design combining data-driven approaches with architectural optimization, enabling it to understand abstract instructions and perform precise operations [7][12]. - Key innovations include enhanced generalization capabilities, efficient adaptation with minimal human demonstration data, and stable performance in long-duration and intricate tasks [12][14]. Group 3: Training Methodology - The training strategy employs a "trinity" approach, integrating robot trajectories, visual-language data, and human demonstrations for progressive learning [15][19]. - The model's ability to recognize new objects improved by approximately 40% through joint training with vast internet visual-language datasets [19][23]. Group 4: Hardware Integration - The ByteMini robot, designed for GR-3, features a flexible 7-degree-of-freedom arm and a stable omnidirectional base, enhancing its operational capabilities in various environments [25][26]. - The robot can autonomously generate task combinations and control environmental variables, ensuring effective task execution [21][25]. Group 5: Experimental Validation - GR-3 was tested in three challenging tasks, demonstrating strong adaptability to new environments and abstract instructions with a success rate of 77.1% for understanding new directives [30][38]. - In a long-duration task, GR-3 maintained a success rate of 89% in executing multi-step actions, significantly outperforming previous models [42].
一起做点牛掰的事情!具身智能之心准备招合伙人了.......
具身智能之心· 2025-07-22 03:33
Core Viewpoint - The rapid development of the embodied intelligence field is highlighted, with several leading companies preparing for IPOs, emphasizing the importance of collaboration and shared learning within the industry [1] Group 1: Collaboration and Community - The industry thrives on collective efforts and shared experiences, particularly in the context of entrepreneurship in the embodied intelligence sector [1] - The company aims to create a platform that gathers talented individuals from the industry to foster progress [1] Group 2: Project Collaboration - The company is establishing research teams in major cities including Beijing, Shanghai, Shenzhen, Guangzhou, Hangzhou, and Wuhan, seeking to recruit around 10 individuals per city with over 2 years of experience in embodied algorithms and robotics research [3] Group 3: Educational Development - The company invites experts in the embodied intelligence field to contribute to the creation of online courses, focusing on various advanced topics such as large models, reinforcement learning, and robot motion planning [4] Group 4: Recruitment Criteria - The company seeks candidates with a PhD or higher, including those currently pursuing a doctorate, and prefers individuals with at least 2 years of research and development experience in the industry [5] Group 5: Compensation and Benefits - The company offers a significant profit-sharing model and resource sharing across the industry, with opportunities for both part-time and full-time positions [6]