Workflow
具身智能之心
icon
Search documents
具身智能下半场:南方阵营的崛起
具身智能之心· 2025-11-24 00:04
Core Viewpoint - The article highlights the unexpected sustained interest in embodied intelligence, particularly in the southern region of China, with a focus on the rise of the company "Self-Variable" as a key player in the industry [1][2]. Group 1: Southern Camp's Rise - The company "Self-Variable," established at the end of 2023, has completed multiple rounds of financing in less than two years, with investors including major players like Meituan, and is rumored to be nearing a valuation of 10 billion [1]. - The emergence of "Self-Variable" positions it as a potential leader in embodied intelligence in Shenzhen and possibly nationwide, despite concerns about the lack of other strong players in the southern region [2]. - The article notes that while Shenzhen has "Self-Variable," other companies like "Zhujidi Power" have been less prominent, focusing more on hardware rather than on developing general embodied intelligence models [2]. Group 2: Hong Kong's Potential - Hong Kong is experiencing a noticeable increase in entrepreneurial activity related to embodied intelligence, driven by strong research foundations in robotics and control systems from its three major universities [2]. - The rise of Hong Kong's entrepreneurial scene is seen as a complement to Shenzhen's strengths, with Shenzhen focusing on supply chain integration and commercialization [2]. Group 3: Industry Challenges - Despite the momentum in the southern camp, the embodied intelligence industry faces significant challenges, with many products still only providing "emotional value" rather than tangible productivity [3][5]. - The article points out that many showcased products, such as dancing robots and automated vending machines, do not address real market needs, leading to questions about their long-term viability [4]. - The disparity in market demand for products like patrol robots highlights the industry's struggle with perceived value versus actual utility, indicating a need for more practical applications [5]. Group 4: Future Outlook - The southern camp's rise is seen as a starting point for 2025, but the industry must overcome hurdles such as the need for more supporting companies in Shenzhen and bridging the gap from research to market in Hong Kong [5]. - The article emphasizes that success will depend not on high valuations or flashy products, but on the ability to solve real-world problems with technology, reflecting a pragmatic approach that the southern region must adopt moving forward [5].
小米的MiMo-Embodied:整合自驾和具身任务,29项SOTA!
具身智能之心· 2025-11-24 00:04
Core Insights - The article discusses Xiaomi's MiMo-Embodied, a cross-domain foundational model that integrates autonomous driving and embodied intelligence, achieving state-of-the-art (SOTA) performance across 29 benchmark tests [5][24]. Group 1: Model Overview - MiMo-Embodied is the first open-source unified model that combines tasks from autonomous driving and embodied intelligence into a single framework, enabling positive transfer and mutual enhancement between the two domains [7][8]. - The model supports three core capabilities in autonomous driving: environment perception, state prediction, and driving planning, as well as three core capabilities in embodied intelligence: usability prediction, task planning, and spatial understanding [8]. Group 2: Training and Data Strategy - The model employs a multi-stage training strategy with carefully designed datasets to overcome cross-domain task interference, leading to performance improvements [9][20]. - The training process consists of four stages: general and embodied knowledge learning, autonomous driving knowledge learning, chain-of-thought (CoT) reasoning fine-tuning, and reinforcement learning (RL) fine-tuning [21][27]. Group 3: Performance Metrics - MiMo-Embodied has achieved SOTA in usability prediction across five benchmarks, outperforming models like Qwen2.5-VL and GPT-4o [24]. - In task planning, it demonstrates strong long-range reasoning and causal inference capabilities, particularly in the RoboVQA benchmark [24]. - The model excels in spatial understanding and environment perception, leading in nine benchmarks, especially in 3D scene reasoning and spatial language localization [24][25]. Group 4: Comparative Analysis - The model's performance in various benchmarks shows significant improvements over previous models, with an average performance increase of 4% in embodied tasks and 8.1% in autonomous driving tasks compared to mixed training approaches [27][37]. - MiMo-Embodied's architecture and training strategy allow it to maintain high performance across both domains, achieving an average score of 62.4% in embodied tasks and 63.3% in autonomous driving tasks [37].
FreeAskWorld:交互式具身闭环仿真框架
具身智能之心· 2025-11-24 00:04
Core Insights - The article discusses the limitations of existing Visual-Language Navigation (VLN) solutions in the field of embodied intelligence, highlighting issues such as reliance on static instructions, lack of social interaction capabilities, and inadequate simulation environments [1][2] - FreeAskWorld, developed by Tsinghua University's AI Research Institute, introduces an innovative approach combining LLM-driven interactive simulation with Direction Inquiry Tasks to overcome these challenges, achieving social, dynamic, and realistic embodied navigation and interaction [1][2][4] Summary by Sections Current Challenges in VLN - Existing VLN solutions face a "triple dilemma": reliance on static instructions, lack of social cognition, and single-dimensional simulation environments [2] - Key deficiencies include inability to handle dynamic scenes, lack of social interaction, and insufficient realism in navigation environments [2] FreeAskWorld's Approach - FreeAskWorld leverages LLM to create realistic social scenarios and employs a closed-loop interaction framework to facilitate dynamic adaptation [2][5] - The system consists of three core components: LLM-driven human simulation, Direction Inquiry Tasks, and a multi-modal dataset [5][8] Core Components - **Human Simulation Module**: Generates diverse human behaviors that adhere to social rules, enhancing interaction realism [5][7] - **Direction Inquiry Task**: Allows robots to actively seek help during navigation, improving performance through multi-round interactions [5][7] - **Data Generation**: The dataset includes 63,429 annotated frames and over 17 hours of interaction data, covering both indoor and outdoor mixed scenes [8][11] Experimental Results - FreeAskWorld demonstrates significant performance improvements in both open-loop and closed-loop settings compared to traditional models, with interaction enhancing navigation success rates [13][14] - The model's ability to adapt to complex environments through social interaction is validated, showing a marked increase in navigation success rates when robots can ask for help [16][19] Future Directions - The article suggests expanding the model's capabilities to support more complex social tasks and integrating additional sensory modalities to enhance adaptability in challenging environments [19][17] - Emphasis is placed on the importance of dynamic environments, human realism, and continuous navigation in future developments [19][17]
毫末智行突然原地解散!宇宙第一正式下线
具身智能之心· 2025-11-23 02:11
Core Insights - The article discusses the recent dissolution of a prominent AI driving company, Maomao Zhixing, which has faced significant operational challenges and staff turnover [2][3]. Company Overview - Maomao Zhixing was established on November 29, 2019, as a subsidiary of Great Wall Motors, focusing on autonomous driving technology [6]. - The company initially made rapid advancements, launching its first autonomous delivery vehicle, "Xiao Mo Tuo," in November 2020, and developing the MANA data intelligence system by December 2021 [6][8]. Recent Developments - The company has experienced substantial layoffs, with reports indicating that nearly one-third to half of its functional departments were cut last year [5]. - Key personnel, including the Vice President of Technology and the Public Relations team, have left the company, raising concerns about its stability [5][6]. - As of June 2023, the company's official communications have ceased, with the last update being a holiday poster on October 1 [5]. Product and Technology - In April 2023, Maomao Zhixing launched the DriveGPT autonomous driving generative model, coinciding with the rise of ChatGPT [8]. - The HPilot 3.0 driver assistance system has been integrated into nearly 20 models under Great Wall Motors by 2025 [8]. - However, competition has intensified, with Yuanrong Qixing providing end-to-end smart driving solutions for Great Wall Motors, leading to speculation that Maomao Zhixing has become less favored within the company [8]. Market Reaction - Users of Maomao Zhixing's products have expressed concerns and dissatisfaction regarding their experience following the company's potential closure [9].
移动操作的AlohaMini来啦!600美元成本,全开源
具身智能之心· 2025-11-22 16:03
Core Viewpoint - AlohaMini is an affordable, fully 3D-printable dual-arm mobile robot designed for embodied AI research and real-world manipulation, priced at $600 [4][5][8]. Group 1: Key Features - The Bill of Materials (BOM) totals around $600 for self-printed parts, making it highly accessible [5]. - It features a dual-arm mobile base with a motorized vertical lift capable of 0-60 cm travel, enhancing its operational versatility [10]. - The platform is fully open-source, compatible with the LeRobot framework, and can be assembled in approximately 60 minutes [5][12]. Group 2: Technical Specifications - AlohaMini includes five cameras for perception, providing a comprehensive sensory array for embodied AI research [10]. - The BOM includes components such as 16 servo motors, 2 motor control boards, a Raspberry Pi 5, and 5 cameras, with a total cost-effective structure [13]. - The design emphasizes modern aesthetics while maintaining functionality, promoting low-cost accessibility for home builders [12].
小米的MiMo-Embodied,到底讲的是什么?整合自驾和具身任务,29项SOTA!
具身智能之心· 2025-11-22 16:03
Core Insights - The article discusses Xiaomi's MiMo-Embodied, the first cross-domain foundational model that integrates autonomous driving and embodied intelligence, achieving state-of-the-art (SOTA) performance across 29 benchmark tests [5][7]. Summary by Sections Existing Model Limitations - Current models are limited to single domains and lack a unified visual language model (VLM) that connects outdoor autonomous driving and indoor embodied intelligence, resulting in insufficient cross-scenario generalization capabilities [5]. MiMo-Embodied's Solutions - MiMo-Embodied is the first open-source cross-domain unified model, integrating tasks from both autonomous driving and embodied intelligence into a single framework, enabling positive transfer and mutual enhancement between the two domains [7]. Comprehensive Capabilities - The model supports three core capabilities for autonomous driving: environment perception, state prediction, and driving planning, as well as three core capabilities for embodied intelligence: usability prediction, task planning, and spatial understanding [8]. Training and Data Construction - MiMo-Embodied employs a carefully designed dataset and a four-stage training strategy to overcome cross-domain task interference, leading to performance improvements [9]. Model Architecture - The architecture includes a Vision Transformer (ViT) for visual encoding, a multi-layer perceptron (MLP) for projection, and a large language model (LLM) for text understanding and logical reasoning [12][13]. Training Strategy - The four-stage training strategy includes: 1. General and embodied knowledge learning 2. Autonomous driving knowledge learning 3. Chain of Thought (CoT) reasoning fine-tuning 4. Reinforcement Learning (RL) fine-tuning [20][21]. Performance Metrics - MiMo-Embodied demonstrates superior performance in usability prediction, task planning, spatial understanding, environment perception, state prediction, and driving planning across various benchmarks [24][25]. Ablation Studies - Single-domain training leads to significant performance loss in cross-domain generalization, while the four-stage training strategy enhances both embodied and autonomous driving performance by 4% and 8.1%, respectively [27]. Real-World Task Evaluation - The model has been evaluated in real-world tasks, showcasing its capabilities in embodied navigation and manipulation tasks [29].
从零把π0和π0.5部署上去!
具身智能之心· 2025-11-22 16:03
Core Viewpoint - The article highlights the launch of the Imeta-Y1, a lightweight and cost-effective robotic arm designed for beginners and researchers in the field of embodied intelligence, emphasizing its open-source tools and user-friendly features [3][4][6]. Product Features - Imeta-Y1 is specifically designed for newcomers and researchers, providing a high-performance robotic arm at an affordable price [3]. - It offers a complete open-source toolchain and code examples, facilitating a seamless process from data collection to model deployment [4][18]. - The arm supports dual-language interfaces (Python/C++) and is compatible with ROS1/ROS2, allowing users to quickly get started regardless of their programming background [4][19]. - It features high precision motion control, low power consumption, and an open hardware architecture, enabling smooth integration from simulation to real-world applications [6][7]. Technical Specifications - The robotic arm has a weight of 4.2 kg, a rated load of 3 kg, and 6 degrees of freedom, with a working radius of 612.5 mm and a repeatability precision of ±0.1 mm [9][20]. - It operates at a supply voltage of 24V and communicates via CAN, with a compact design suitable for embedded AI and robotic learning platforms [9][20]. - The arm's joint motion range and maximum speeds are specified, ensuring it meets various application needs [22]. Development and Support - The company provides a comprehensive open-source SDK, including drivers, API interfaces, example codes, and documentation, supporting rapid application development [31][30]. - Users can leverage multi-modal data fusion capabilities, compatible with mainstream frameworks like TensorFlow and PyTorch, to implement intelligent algorithms [37][18]. - The company ensures quick after-sales support, with a 24-hour response time for customer inquiries [20][49]. Testing and Reliability - Rigorous hardware testing processes are in place to validate the arm's accuracy, durability, load performance, and stability across various application scenarios [40][44]. - The product is backed by a six-month warranty against non-human damage, with post-warranty support available at market rates [50].
移动操作的AlohaMini来啦!600美元成本,全开源
具身智能之心· 2025-11-22 03:07
点击下方 卡片 ,关注" 具身智能 之心 "公众号 编辑丨具身智能之心 The $600 Open-Source Home Robot @ Meet AlohaMini, the new dual-arm mobile robot designed to make real- world manipulation and embodied AI research accessible. This bot is fully 3D-printable and aimed at home builders and research labs. Key Features: S Cost: The Bill of Materials (BOM) totals around $600 (USD) for self- printed parts, making it highly accessible. (_ Hardware: Dual-arm mobile base with a motorized vertical lift (0-60 cm travel). 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ ...
两院院士增选结果揭晓:周志华、刘云浩当选科学院院士
具身智能之心· 2025-11-21 16:03
编辑丨 机器之心 点击下方 卡片 ,关注" 具身智能之心 "公众号 >> 点击进入→ 具身 智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区: 具身智能之心知识星球(戳我) ,这里包含所有你想要的! 11 月 21 日上午,中国科学院、中国工程院公布了 2025 年院士增选结果,分别选举产生中国科学院院士 73 人,中国工程院院士 71 人。 两院院士是我国科学技术方面和工程科技领域的最高荣誉称号,本次增选后,我国院士队伍的结构进一步优化。其中,新当选的科学院院士平均年龄 57.2 岁,最小年龄 44 岁,最大年龄 66 岁,60 岁(含)以下的占 67.1%,女性科学家有 5 人当选。 本次增选后,我国现有中国科学院院士共 908 人,现有中国工程院院士共 1002 人。 值得关注的是,在本次增选中,有与人工智能领域相关的学者入选。 中国科学院院士 2025 年中国科学院选举产生了 73 名中国科学院院士和 27 名中国科学院外籍院士。 在这批当选者中,我们熟悉的计算机与人工智能领域的多位顶尖科研大牛也成功入选,彰显了中国在前沿科技方向上的持续突破与重视。 刘云浩 — 清华大学教授,研究 ...
VLA-Pruner:面向高效VLA推理的时序感知视觉token剪枝
具身智能之心· 2025-11-21 16:03
Group 1 - The core challenge of VLA models lies in their ability to integrate visual scene perception, natural language understanding, and action execution, which results in significant computational overhead due to the high number of visual tokens compared to text tokens [2][4]. - Existing pruning methods for visual tokens are flawed as they primarily focus on semantic relevance, neglecting the distinct needs of high-level semantic understanding and low-level action execution, leading to performance drops at high pruning rates [3][4]. - A key observation is that the temporal continuity of robot operations allows for the estimation of necessary visual tokens for current actions based on historical attention trends, providing a breakthrough in addressing the limitations of existing methods [5]. Group 2 - The VLA-Pruner is designed to retain both semantic understanding and action execution tokens under a given computational budget, achieving efficient inference without performance loss through a dual-level criterion and selection strategy [6][10]. - The dual-level importance criteria include semantic relevance based on pre-fill attention scores and action-level importance estimated through temporal smoothing, ensuring a comprehensive approach to token selection [7][9]. - The method employs a "merge-filter" mechanism to maximize relevance and minimize redundancy, ensuring that all critical tokens for both semantic understanding and action execution are preserved [10][11]. Group 3 - Experimental results demonstrate that at a 50% pruning rate, VLA-Pruner not only maintains performance but also improves success rates, with OpenVLA showing an average increase of 2.45% [16]. - The VLA-Pruner exhibits robustness across different scenarios, achieving a success rate of 96.8% in the SIMPLER environment at a 75% pruning rate, significantly outperforming baseline methods [19][20]. - Efficiency improvements are notable, with FLOPs reduced to approximately 60% of the original model at a 50% pruning rate and achieving up to 1.8 times faster inference speeds [26][27]. Group 4 - The core contributions of the study include the introduction of a dual-level pruning criterion that addresses the inherent flaws of existing methods and the design of a plug-and-play pruning framework that enhances inference efficiency without altering the model architecture [31]. - Limitations include potential inaccuracies in action attention estimation in dynamic scenarios with rapid perspective shifts or target changes, suggesting areas for future optimization [31]. - Future directions involve the development of adaptive prediction modules and the integration of additional techniques such as quantization and layer pruning to further enhance deployment efficiency [31].