具身智能之心
Search documents
VLA+RL方向的合伙人招募了~
具身智能之心· 2025-11-24 10:02
Group 1 - The article discusses the recruitment of instructors for courses and projects related to VLA (Variational Learning Algorithms) and RL (Reinforcement Learning) within the community [1] - The community seeks candidates with a research focus on VLA and RL, preferably holding a PhD or currently enrolled in a doctoral program, and having experience in top conferences in the academic field [2] - For industry candidates, practical experience and hands-on debugging experience with real machines are desired [2] Group 2 - The company, "Embodied Intelligence," is identified as the first comprehensive technical exchange community in China, focusing on VLA and RL, and has gathered a large number of students in these fields [3] - The organization offers compensation above the industry average along with abundant industry resources for the recruited instructors [4] - For further details, interested individuals are encouraged to add a specified WeChat contact for inquiries [5]
华科&清华最新DeepThinkVLA:如何让模型 “会思考、能落地”?
具身智能之心· 2025-11-24 10:02
Core Insights - The article presents DeepThinkVLA, a new model that addresses the challenges in the visual-language-action (VLA) domain by integrating a mixed attention decoder and a two-stage training pipeline, achieving a task success rate of 97.0% on the LIBERO benchmark, setting a new performance standard for VLA models [2][14]. Group 1: Model Architecture - DeepThinkVLA resolves the "modal conflict" between reasoning and action by employing a mixed attention mechanism that allows for efficient processing of both modalities within a single decoder [4][10]. - The model features a dynamic switching mechanism between causal attention for reasoning generation and bidirectional attention for action generation, significantly reducing inference latency and enhancing performance [4][10]. Group 2: Training Methodology - The training process consists of a two-stage pipeline combining supervised fine-tuning (SFT) and reinforcement learning (RL), which enhances the model's reasoning capabilities while ensuring effective action execution [6][8]. - The SFT phase focuses on building foundational reasoning skills through a carefully designed data augmentation pipeline, resulting in a dataset of 273,465 annotated frames [10][12]. Group 3: Innovations and Mechanisms - Two key innovations are highlighted: the probabilistic decomposition of reasoning and action, and an error recovery mechanism that allows the model to self-correct during execution [10][11]. - The reward design incorporates task-success rewards and format regularization rewards, focusing on the final success of tasks while minimizing interference from intermediate reasoning semantics [11][12]. Group 4: Performance Evaluation - DeepThinkVLA outperforms existing models across various tasks, achieving an average success rate of 97.0%, with specific task success rates of 99.0% for Object tasks and 96.4% for Goal tasks [14][15]. - The model demonstrates superior robustness compared to top autoregressive models, showcasing its effectiveness in complex robotic operations [15][16]. Group 5: Future Directions - Future enhancements may include integrating additional sensory data, expanding to more complex collaborative tasks, optimizing efficiency, and constructing larger datasets to improve model generalization [23][24].
Aloha硬件交流群来了!
具身智能之心· 2025-11-24 00:04
Core Viewpoint - The article emphasizes the establishment of a technical discussion group focused on Aloha technology and its related hardware and algorithms, catering to the growing interest in mobile operations within the community [2]. Group 1 - A technical discussion group for Aloha, Mobile Aloha, and MiniAloha has been created to address inquiries from community members [2]. - The group aims to facilitate discussions on hardware and algorithms related to Aloha technology [2]. - Interested individuals can join by adding a designated assistant on WeChat with specific identification details [2].
具身智能下半场:南方阵营的崛起
具身智能之心· 2025-11-24 00:04
Core Viewpoint - The article highlights the unexpected sustained interest in embodied intelligence, particularly in the southern region of China, with a focus on the rise of the company "Self-Variable" as a key player in the industry [1][2]. Group 1: Southern Camp's Rise - The company "Self-Variable," established at the end of 2023, has completed multiple rounds of financing in less than two years, with investors including major players like Meituan, and is rumored to be nearing a valuation of 10 billion [1]. - The emergence of "Self-Variable" positions it as a potential leader in embodied intelligence in Shenzhen and possibly nationwide, despite concerns about the lack of other strong players in the southern region [2]. - The article notes that while Shenzhen has "Self-Variable," other companies like "Zhujidi Power" have been less prominent, focusing more on hardware rather than on developing general embodied intelligence models [2]. Group 2: Hong Kong's Potential - Hong Kong is experiencing a noticeable increase in entrepreneurial activity related to embodied intelligence, driven by strong research foundations in robotics and control systems from its three major universities [2]. - The rise of Hong Kong's entrepreneurial scene is seen as a complement to Shenzhen's strengths, with Shenzhen focusing on supply chain integration and commercialization [2]. Group 3: Industry Challenges - Despite the momentum in the southern camp, the embodied intelligence industry faces significant challenges, with many products still only providing "emotional value" rather than tangible productivity [3][5]. - The article points out that many showcased products, such as dancing robots and automated vending machines, do not address real market needs, leading to questions about their long-term viability [4]. - The disparity in market demand for products like patrol robots highlights the industry's struggle with perceived value versus actual utility, indicating a need for more practical applications [5]. Group 4: Future Outlook - The southern camp's rise is seen as a starting point for 2025, but the industry must overcome hurdles such as the need for more supporting companies in Shenzhen and bridging the gap from research to market in Hong Kong [5]. - The article emphasizes that success will depend not on high valuations or flashy products, but on the ability to solve real-world problems with technology, reflecting a pragmatic approach that the southern region must adopt moving forward [5].
小米的MiMo-Embodied:整合自驾和具身任务,29项SOTA!
具身智能之心· 2025-11-24 00:04
Core Insights - The article discusses Xiaomi's MiMo-Embodied, a cross-domain foundational model that integrates autonomous driving and embodied intelligence, achieving state-of-the-art (SOTA) performance across 29 benchmark tests [5][24]. Group 1: Model Overview - MiMo-Embodied is the first open-source unified model that combines tasks from autonomous driving and embodied intelligence into a single framework, enabling positive transfer and mutual enhancement between the two domains [7][8]. - The model supports three core capabilities in autonomous driving: environment perception, state prediction, and driving planning, as well as three core capabilities in embodied intelligence: usability prediction, task planning, and spatial understanding [8]. Group 2: Training and Data Strategy - The model employs a multi-stage training strategy with carefully designed datasets to overcome cross-domain task interference, leading to performance improvements [9][20]. - The training process consists of four stages: general and embodied knowledge learning, autonomous driving knowledge learning, chain-of-thought (CoT) reasoning fine-tuning, and reinforcement learning (RL) fine-tuning [21][27]. Group 3: Performance Metrics - MiMo-Embodied has achieved SOTA in usability prediction across five benchmarks, outperforming models like Qwen2.5-VL and GPT-4o [24]. - In task planning, it demonstrates strong long-range reasoning and causal inference capabilities, particularly in the RoboVQA benchmark [24]. - The model excels in spatial understanding and environment perception, leading in nine benchmarks, especially in 3D scene reasoning and spatial language localization [24][25]. Group 4: Comparative Analysis - The model's performance in various benchmarks shows significant improvements over previous models, with an average performance increase of 4% in embodied tasks and 8.1% in autonomous driving tasks compared to mixed training approaches [27][37]. - MiMo-Embodied's architecture and training strategy allow it to maintain high performance across both domains, achieving an average score of 62.4% in embodied tasks and 63.3% in autonomous driving tasks [37].
FreeAskWorld:交互式具身闭环仿真框架
具身智能之心· 2025-11-24 00:04
Core Insights - The article discusses the limitations of existing Visual-Language Navigation (VLN) solutions in the field of embodied intelligence, highlighting issues such as reliance on static instructions, lack of social interaction capabilities, and inadequate simulation environments [1][2] - FreeAskWorld, developed by Tsinghua University's AI Research Institute, introduces an innovative approach combining LLM-driven interactive simulation with Direction Inquiry Tasks to overcome these challenges, achieving social, dynamic, and realistic embodied navigation and interaction [1][2][4] Summary by Sections Current Challenges in VLN - Existing VLN solutions face a "triple dilemma": reliance on static instructions, lack of social cognition, and single-dimensional simulation environments [2] - Key deficiencies include inability to handle dynamic scenes, lack of social interaction, and insufficient realism in navigation environments [2] FreeAskWorld's Approach - FreeAskWorld leverages LLM to create realistic social scenarios and employs a closed-loop interaction framework to facilitate dynamic adaptation [2][5] - The system consists of three core components: LLM-driven human simulation, Direction Inquiry Tasks, and a multi-modal dataset [5][8] Core Components - **Human Simulation Module**: Generates diverse human behaviors that adhere to social rules, enhancing interaction realism [5][7] - **Direction Inquiry Task**: Allows robots to actively seek help during navigation, improving performance through multi-round interactions [5][7] - **Data Generation**: The dataset includes 63,429 annotated frames and over 17 hours of interaction data, covering both indoor and outdoor mixed scenes [8][11] Experimental Results - FreeAskWorld demonstrates significant performance improvements in both open-loop and closed-loop settings compared to traditional models, with interaction enhancing navigation success rates [13][14] - The model's ability to adapt to complex environments through social interaction is validated, showing a marked increase in navigation success rates when robots can ask for help [16][19] Future Directions - The article suggests expanding the model's capabilities to support more complex social tasks and integrating additional sensory modalities to enhance adaptability in challenging environments [19][17] - Emphasis is placed on the importance of dynamic environments, human realism, and continuous navigation in future developments [19][17]
毫末智行突然原地解散!宇宙第一正式下线
具身智能之心· 2025-11-23 02:11
Core Insights - The article discusses the recent dissolution of a prominent AI driving company, Maomao Zhixing, which has faced significant operational challenges and staff turnover [2][3]. Company Overview - Maomao Zhixing was established on November 29, 2019, as a subsidiary of Great Wall Motors, focusing on autonomous driving technology [6]. - The company initially made rapid advancements, launching its first autonomous delivery vehicle, "Xiao Mo Tuo," in November 2020, and developing the MANA data intelligence system by December 2021 [6][8]. Recent Developments - The company has experienced substantial layoffs, with reports indicating that nearly one-third to half of its functional departments were cut last year [5]. - Key personnel, including the Vice President of Technology and the Public Relations team, have left the company, raising concerns about its stability [5][6]. - As of June 2023, the company's official communications have ceased, with the last update being a holiday poster on October 1 [5]. Product and Technology - In April 2023, Maomao Zhixing launched the DriveGPT autonomous driving generative model, coinciding with the rise of ChatGPT [8]. - The HPilot 3.0 driver assistance system has been integrated into nearly 20 models under Great Wall Motors by 2025 [8]. - However, competition has intensified, with Yuanrong Qixing providing end-to-end smart driving solutions for Great Wall Motors, leading to speculation that Maomao Zhixing has become less favored within the company [8]. Market Reaction - Users of Maomao Zhixing's products have expressed concerns and dissatisfaction regarding their experience following the company's potential closure [9].
移动操作的AlohaMini来啦!600美元成本,全开源
具身智能之心· 2025-11-22 16:03
Core Viewpoint - AlohaMini is an affordable, fully 3D-printable dual-arm mobile robot designed for embodied AI research and real-world manipulation, priced at $600 [4][5][8]. Group 1: Key Features - The Bill of Materials (BOM) totals around $600 for self-printed parts, making it highly accessible [5]. - It features a dual-arm mobile base with a motorized vertical lift capable of 0-60 cm travel, enhancing its operational versatility [10]. - The platform is fully open-source, compatible with the LeRobot framework, and can be assembled in approximately 60 minutes [5][12]. Group 2: Technical Specifications - AlohaMini includes five cameras for perception, providing a comprehensive sensory array for embodied AI research [10]. - The BOM includes components such as 16 servo motors, 2 motor control boards, a Raspberry Pi 5, and 5 cameras, with a total cost-effective structure [13]. - The design emphasizes modern aesthetics while maintaining functionality, promoting low-cost accessibility for home builders [12].
小米的MiMo-Embodied,到底讲的是什么?整合自驾和具身任务,29项SOTA!
具身智能之心· 2025-11-22 16:03
Core Insights - The article discusses Xiaomi's MiMo-Embodied, the first cross-domain foundational model that integrates autonomous driving and embodied intelligence, achieving state-of-the-art (SOTA) performance across 29 benchmark tests [5][7]. Summary by Sections Existing Model Limitations - Current models are limited to single domains and lack a unified visual language model (VLM) that connects outdoor autonomous driving and indoor embodied intelligence, resulting in insufficient cross-scenario generalization capabilities [5]. MiMo-Embodied's Solutions - MiMo-Embodied is the first open-source cross-domain unified model, integrating tasks from both autonomous driving and embodied intelligence into a single framework, enabling positive transfer and mutual enhancement between the two domains [7]. Comprehensive Capabilities - The model supports three core capabilities for autonomous driving: environment perception, state prediction, and driving planning, as well as three core capabilities for embodied intelligence: usability prediction, task planning, and spatial understanding [8]. Training and Data Construction - MiMo-Embodied employs a carefully designed dataset and a four-stage training strategy to overcome cross-domain task interference, leading to performance improvements [9]. Model Architecture - The architecture includes a Vision Transformer (ViT) for visual encoding, a multi-layer perceptron (MLP) for projection, and a large language model (LLM) for text understanding and logical reasoning [12][13]. Training Strategy - The four-stage training strategy includes: 1. General and embodied knowledge learning 2. Autonomous driving knowledge learning 3. Chain of Thought (CoT) reasoning fine-tuning 4. Reinforcement Learning (RL) fine-tuning [20][21]. Performance Metrics - MiMo-Embodied demonstrates superior performance in usability prediction, task planning, spatial understanding, environment perception, state prediction, and driving planning across various benchmarks [24][25]. Ablation Studies - Single-domain training leads to significant performance loss in cross-domain generalization, while the four-stage training strategy enhances both embodied and autonomous driving performance by 4% and 8.1%, respectively [27]. Real-World Task Evaluation - The model has been evaluated in real-world tasks, showcasing its capabilities in embodied navigation and manipulation tasks [29].
从零把π0和π0.5部署上去!
具身智能之心· 2025-11-22 16:03
Core Viewpoint - The article highlights the launch of the Imeta-Y1, a lightweight and cost-effective robotic arm designed for beginners and researchers in the field of embodied intelligence, emphasizing its open-source tools and user-friendly features [3][4][6]. Product Features - Imeta-Y1 is specifically designed for newcomers and researchers, providing a high-performance robotic arm at an affordable price [3]. - It offers a complete open-source toolchain and code examples, facilitating a seamless process from data collection to model deployment [4][18]. - The arm supports dual-language interfaces (Python/C++) and is compatible with ROS1/ROS2, allowing users to quickly get started regardless of their programming background [4][19]. - It features high precision motion control, low power consumption, and an open hardware architecture, enabling smooth integration from simulation to real-world applications [6][7]. Technical Specifications - The robotic arm has a weight of 4.2 kg, a rated load of 3 kg, and 6 degrees of freedom, with a working radius of 612.5 mm and a repeatability precision of ±0.1 mm [9][20]. - It operates at a supply voltage of 24V and communicates via CAN, with a compact design suitable for embedded AI and robotic learning platforms [9][20]. - The arm's joint motion range and maximum speeds are specified, ensuring it meets various application needs [22]. Development and Support - The company provides a comprehensive open-source SDK, including drivers, API interfaces, example codes, and documentation, supporting rapid application development [31][30]. - Users can leverage multi-modal data fusion capabilities, compatible with mainstream frameworks like TensorFlow and PyTorch, to implement intelligent algorithms [37][18]. - The company ensures quick after-sales support, with a 24-hour response time for customer inquiries [20][49]. Testing and Reliability - Rigorous hardware testing processes are in place to validate the arm's accuracy, durability, load performance, and stability across various application scenarios [40][44]. - The product is backed by a six-month warranty against non-human damage, with post-warranty support available at market rates [50].