Workflow
具身智能之心
icon
Search documents
Zebra-CoT:开创性视觉思维链数据集问世,多模态推理准确率提升13%
具身智能之心· 2025-07-24 09:53
Core Viewpoint - The article discusses the development of Zebra-CoT, a large-scale and diverse dataset aimed at enhancing visual reasoning capabilities in multi-modal models, addressing the challenges of existing visual CoT performance and the lack of high-quality training data [3][4]. Dataset Construction - Zebra-CoT consists of 182,384 samples, providing logical interleaved text-image reasoning trajectories across four main task categories: scientific reasoning, 2D visual reasoning, 3D visual reasoning, and visual logic and strategy games [6][12]. - The dataset overcomes limitations of existing datasets by offering a diverse range of tasks and ensuring high-quality text reasoning data, unlike previous datasets that focused on single tasks or lacked clear reasoning structures [6][18]. Task Coverage - The dataset covers four major task categories: - Scientific reasoning includes geometry, physics, chemistry, and algorithm problems [9]. - 2D visual reasoning encompasses visual search and visual puzzles [9]. - 3D visual reasoning involves multi-hop object counting and robot planning [9]. - Visual logic and strategy games feature chess, checkers, mazes, and more [9]. Data Sources and Processing - Real-world data is sourced from online resources, ensuring high-quality problem extraction and addressing issues of logical connections between modalities [10]. - Synthetic data is generated using templates and visual language models (VLM) to enhance reasoning diversity and expressiveness [10]. Model Fine-tuning and Performance - Fine-tuning the Anole-7B model on Zebra-CoT improved accuracy from 4.2% to 16.9%, a fourfold increase, with notable improvements in visual logic benchmarks [14]. - The Bagel-7B model, after fine-tuning, demonstrated the ability to generate high-quality interleaved visual reasoning chains, showcasing the dataset's effectiveness in developing multi-modal reasoning capabilities [14]. Limitations - Despite its strengths, the dataset relies on template generation for synthetic data, which may limit the diversity and expressiveness of text reasoning [18]. - Some sub-tasks within the dataset have a small sample size, potentially affecting model performance in those areas [18]. - Model fine-tuning results may vary, with some tasks showing insignificant or even decreased performance, indicating a need for further optimization [18].
具身智能之心求职交流群来啦!!!
具身智能之心· 2025-07-23 15:16
应广大粉丝的要求,我们开始正式运营具身相关的求职社群了。社群内部主要讨论相关具身产业、公司、 产品研发、求职与跳槽相关内容。如果您想结交更多同行业的朋友,第一时间了解产业。欢迎加入我们! 微信扫码添加小助理邀请进群,备注昵称+具身求职; 具身智能之心求职与行业交流群成立了! ...
具身智能离不开的感知模块!最强性价比3D激光扫描仪来啦
具身智能之心· 2025-07-23 09:48
Core Viewpoint - GeoScan S1 is presented as the most cost-effective 3D laser scanner in China, featuring lightweight design, one-click operation, and centimeter-level precision for real-time 3D scene reconstruction [1][5]. Group 1: Product Features - The GeoScan S1 can generate point clouds at a rate of 200,000 points per second, with a maximum measurement distance of 70 meters and 360° coverage, supporting large scenes over 200,000 square meters [1][28][31]. - It integrates multiple sensors, including a high-precision IMU and RTK, enabling it to handle complex indoor and outdoor environments effectively [33][46]. - The device supports various data export formats such as PCD, LAS, and PLY, and operates on Ubuntu 20.04, compatible with ROS [22]. Group 2: System Specifications - The system has a relative accuracy of better than 3 cm and an absolute accuracy of better than 5 cm [22]. - The device dimensions are 14.2 cm x 9.5 cm x 45 cm, weighing 1.3 kg without the battery and 1.9 kg with the battery, with a power input range of 13.8V to 24V [22]. - It features a battery capacity of 88.8 Wh, providing approximately 3 to 4 hours of operational time [22][25]. Group 3: Software and Usability - The GeoScan S1 offers a user-friendly interface with simple operation, allowing for quick scanning and immediate data export without complex setups [5][42]. - It includes a 3D Gaussian data collection module for high-fidelity scene restoration, enabling the digital replication of real-world environments [52]. - The software supports both offline and online rendering, enhancing the usability for various applications [5][61]. Group 4: Market Position and Pricing - The company offers multiple versions of the GeoScan S1, including a basic version priced at 19,800 yuan and a 3DGS offline version at 67,800 yuan, catering to diverse customer needs [61][64]. - The product is positioned as having the best price-performance ratio in the industry, integrating multiple sensors and advanced features [5][61].
行为基础模型可实现高效的人形机器人全身控制
具身智能之心· 2025-07-23 08:45
Core Viewpoint - Humanoid robots are gaining unprecedented attention as multifunctional platforms for complex motion control, human-robot interaction, and general physical intelligence, but achieving efficient whole-body control remains a fundamental challenge [1][2]. Group 1: Overview of Behavior Foundation Model (BFM) - The article discusses the emergence of Behavior Foundation Model (BFM) as a solution to the limitations of traditional controllers, enabling zero-shot or rapid adaptation to various downstream tasks through large-scale pre-training [1][2]. - BFM is defined as a special type of foundational model aimed at controlling agent behavior in dynamic environments, rooted in principles of general foundational models like GPT-4 and CLIP, utilizing large-scale behavior data for pre-training [12][13]. Group 2: Evolution of Humanoid Whole-Body Control Algorithms - The evolution of humanoid whole-body control algorithms is summarized in three stages: model-based controllers, learning-based task-specific controllers, and behavior foundation models [4][6][7]. - Model-based controllers rely heavily on physical models and require complex manual design, while learning-based controllers exhibit poor generalization across tasks [6][7][8]. Group 3: BFM Methodology and Algorithms - The article categorizes current BFM construction methods into three types: goal-conditioned learning, intrinsic reward-driven learning, and forward-backward representation learning [13]. - A notable example of a goal-conditioned learning method is MaskedMimic, which learns foundational motor skills through motion tracking and supports seamless task switching [18][20]. Group 4: Applications and Limitations of BFM - BFM has potential applications in various fields, including humanoid robotics, virtual agents in gaming, industrial 5.0, and medical assistance robots, enabling rapid adaptation to diverse tasks [31][33]. - However, BFM faces limitations such as difficulties in sim-to-real transfer, where discrepancies between simulation and real-world dynamics hinder practical deployment [32][34]. Group 5: Future Research Opportunities and Risks - Future research opportunities include integrating multimodal inputs, developing advanced machine learning systems, and establishing standardized evaluation mechanisms for BFM [36][38]. - Risks associated with BFM include ethical concerns regarding training data biases, data bottlenecks, and the need for robust safety mechanisms to ensure reliability in open environments [36][39].
Being-H0:从大规模人类视频中学习灵巧操作的VLA模型
具身智能之心· 2025-07-23 08:45
Core Insights - The article discusses the advancements in vision-language-action models (VLAs) and the challenges faced in the robotics field, particularly in complex dexterous manipulation tasks due to data limitations [3][4]. Group 1: Research Background and Motivation - Current large language models and multimodal models have made significant progress, but the robotics sector lacks a transformative moment akin to "ChatGPT" [3]. - Existing VLAs struggle with dexterous tasks due to reliance on synthetic data or limited remote operation demonstrations, especially in fine manipulation due to high hardware costs [3]. - Human videos contain rich real-world operational data, but learning from them presents challenges such as data heterogeneity, hand motion quantization, cross-modal reasoning, and robot control transfer [3]. Group 2: Core Methodology - The article introduces Physical Instruction Tuning, a paradigm that consists of three phases: pre-training, physical space alignment, and post-training, to transfer human hand movement knowledge to robotic operations [4]. Group 3: Pre-training Phase - The pre-training phase uses human hands as ideal manipulators, treating robotic hands as simplified versions, and trains a foundational VLA on large-scale human videos [6]. - The input includes visual information, language instructions, and parameterized hand movements, optimizing the mapping from vision and language to motion [6][8]. Group 4: Physical Space Alignment - Physical space alignment addresses the interference caused by different camera parameters and coordinate systems through weak perspective projection alignment and motion distribution balancing [10][12]. - The model adapts to specific robots by projecting the robot's proprioceptive state into the model's embedding space, generating executable actions through learnable query tokens [13]. Group 5: Key Technologies - The article discusses motion tokenization and cross-modal fusion, emphasizing the need to retain fine motion precision while discretizing continuous movements [14][17]. - The hand movements are decomposed into wrist and finger movements, each tokenized separately, ensuring reconstruction accuracy through a combination of loss functions [18]. Group 6: Dataset and Experimental Results - The UniHand dataset, comprising over 440,000 task trajectories and 1.3 billion frames, supports large-scale pre-training and includes diverse tasks and data sources [21]. - Experimental results show that the Being-H0 model outperforms baseline models in hand motion generation and translation tasks, demonstrating better spatial accuracy and semantic alignment [22][25]. Group 7: Long Sequence Motion Generation - The model effectively generates long sequences of motion (2-10 seconds) using soft format decoding, which helps maintain trajectory stability [26]. Group 8: Real Robot Operation Experiments - In practical tasks like grasping and placing, Being-H0 shows significantly higher success rates compared to baseline models, achieving 65% and 60% success in unseen toy and cluttered scene tasks, respectively [28].
从“想得好”到“做得好”有多远?具身大小脑协同之路解密
具身智能之心· 2025-07-23 08:45
Core Viewpoint - The article discusses the integration of "brain," "cerebellum," and "body" in embodied intelligent systems, emphasizing the need for improved collaboration and data acquisition for advancing artificial general intelligence (AGI) [2][3][4]. Group 1: Components of Embodied Intelligence - The "brain" is responsible for perception, reasoning, and planning, utilizing large language models and visual language models [2]. - The "cerebellum" focuses on movement, employing motion control algorithms and feedback systems to enhance the naturalness and precision of robotic actions [2]. - The "body" serves as the physical entity that executes the plans generated by the "brain" and the movements coordinated by the "cerebellum," embodying the principle of "knowing and doing" [2]. Group 2: Challenges and Future Directions - There is a need for the "brain" to enhance its reasoning capabilities, enabling it to infer task paths without explicit instructions or maps [3]. - The "cerebellum" should become more intuitive, allowing robots to react flexibly in complex environments and handle delicate objects with care [3]. - The collaboration between the "brain" and "cerebellum" requires improvement, as current communication is slow and responses are delayed, aiming for a seamless interaction system [3]. Group 3: Data Acquisition - The article highlights the challenges in data collection, noting that it is often difficult, expensive, and noisy, which hinders the training of intelligent systems [3]. - There is a call for the development of a training repository that is realistic, diverse, and transferable to enhance data quality and accessibility [3]. Group 4: Expert Discussion - A roundtable discussion is planned with experts from Beijing Academy of Artificial Intelligence and Zhiyuan Robotics to explore recent technological advancements and future pathways for embodied intelligence [4].
即将开课啦!具身智能目标导航算法与实战教程来了~
具身智能之心· 2025-07-23 08:45
Core Viewpoint - Goal-Oriented Navigation empowers robots to autonomously complete navigation tasks based on goal descriptions, marking a significant shift from traditional visual language navigation systems [2][3]. Group 1: Technology Overview - Embodied navigation is a core area of embodied intelligence, relying on three technical pillars: language understanding, environmental perception, and path planning [2]. - Goal-Oriented Navigation requires robots to explore and plan paths in unfamiliar 3D environments using only goal descriptions such as coordinates, images, or natural language [2]. - The technology has been industrialized in various verticals, including delivery, healthcare, and hospitality, with companies like Meituan and Aethon deploying autonomous delivery robots [3]. Group 2: Technological Evolution - The evolution of Goal-Oriented Navigation can be categorized into three generations: 1. First Generation: End-to-end methods focusing on reinforcement learning and imitation learning, achieving breakthroughs in Point Navigation and closed-set image navigation tasks [5]. 2. Second Generation: Modular methods that explicitly construct semantic maps, breaking tasks into exploration and goal localization [5]. 3. Third Generation: Integration of large language models (LLMs) and vision-language models (VLMs) to enhance knowledge reasoning and open vocabulary target matching [7]. Group 3: Challenges and Learning Path - The complexity of embodied navigation, particularly Goal-Oriented Navigation, necessitates knowledge from multiple fields, including natural language processing, computer vision, and reinforcement learning [9]. - A new course has been developed to address the challenges of learning Goal-Oriented Navigation, focusing on quick entry, building a research framework, and combining theory with practice [10][11][12]. Group 4: Course Structure - The course includes six chapters covering the core framework of semantic navigation, Habitat simulation ecology, end-to-end navigation methodologies, modular navigation architectures, and LLM/VLM-driven navigation systems [16][18][19][21][23]. - A significant project within the course focuses on the reproduction of VLFM algorithms and their deployment in real-world scenarios, allowing students to engage in practical applications [25].
X-Nav:端到端跨平台导航框架,通用策略实现零样本迁移
具身智能之心· 2025-07-22 06:29
Core Viewpoint - The article presents the X-Nav framework, which enables end-to-end cross-embodiment navigation for mobile robots, allowing a single universal strategy to be deployed across different robot forms, including wheeled and quadrupedal robots [3][4]. Group 1: Existing Limitations - Current navigation methods are often designed for specific robot forms, limiting their generalizability across platforms [4]. - Navigation tasks require robots to move without collisions in complex environments, relying on visual observations, target positions, and proprioceptive information, but existing methods face significant limitations [4]. Group 2: X-Nav Architecture - The X-Nav architecture consists of two core phases: expert policy learning and universal policy refinement [5][8]. - Phase 1 involves training multiple expert policies using deep reinforcement learning (DRL) on randomly generated robot forms [6]. - Phase 2 refines these expert policies into a single universal policy using a Nav-ACT transformer model [8]. Group 3: Training and Evaluation - The training process utilizes the Proximal Policy Optimization (PPO) algorithm, with a reward function that includes task rewards and regularization rewards, tailored for wheeled and quadrupedal robots [10][16]. - Experimental validation shows that X-Nav outperforms other methods in success rate (SR) and success rate weighted path length (SPL), with Jackal achieving an SR of 90.4% and SPL of 0.84 [13]. - Scalability studies indicate that increasing the number of training forms significantly enhances the adaptability to unknown robots [14]. Group 4: Ablation Studies - Ablation studies validate the effectiveness of design choices, showing that using L1 loss instead of MSE reduces performance due to insufficient penalty for large errors [21]. - The execution of complete action blocks delays quadrupedal adaptation to dynamic changes, while omitting time integration (TE) leads to rough actions in wheeled robots [21]. Group 5: Real-World Testing - Real-world tests in indoor and outdoor environments demonstrate a success rate of 85% and SPL of 0.79, confirming the generalizability of the X-Nav framework across different sensor configurations [22].
机器人需求驱动导航新SOTA,成功率提升15%!浙大&vivo联手打造
具身智能之心· 2025-07-22 06:29
Core Viewpoint - The article discusses the advancements in embodied intelligence, specifically focusing on the new framework CogDDN developed by a research team from Zhejiang University and vivo AI Lab, which enables robots to understand human needs and navigate complex environments autonomously [2][3][6]. Research Motivation - The increasing integration of mobile robots into daily life necessitates their ability to understand human needs rather than just executing commands. For instance, a robot should autonomously seek food when a person expresses hunger [6]. - Traditional navigation methods often struggle in unfamiliar environments due to their reliance on extensive data training, prompting the need for a more generalizable approach that mimics human reasoning [7]. Framework Overview - The CogDDN framework is based on the dual-process theory from psychology, combining heuristic (System 1) and analytical (System 2) decision-making processes to enhance navigation capabilities [9][10]. - The framework consists of three main components: a 3D perception module, a demand matching module, and a dual-process decision-making module [13]. 3D Robot Perception Module - The team utilized the UniMODE method for single-view 3D object detection, improving the robot's ability to navigate indoor environments without relying on multiple views or depth sensors [15]. Demand Matching Module - This module aligns human needs with object characteristics, using supervised fine-tuning techniques to enhance the accuracy of large language models (LLMs) in matching user requests with suitable objects [16]. Dual-Process Decision Making - The heuristic process allows for quick, intuitive decisions based on past experiences, while the analytical process focuses on error reflection and strategy optimization [18][23]. - The Explore and Exploit modules within the heuristic process enable the system to adapt to new environments and efficiently achieve navigation goals [19][20]. Experimental Results - The performance of CogDDN was evaluated using the AI2Thor simulator and the ProcTHOR dataset, demonstrating a significant improvement over existing state-of-the-art methods, with a navigation success rate (NSR) of 38.3% and a success rate in unseen scenes of 34.5% [26][27]. - The removal of key components like the Exploit module and the chain of thought (CoT) significantly decreased system performance, highlighting their importance in decision-making [29][30]. Conclusion - CogDDN represents a cognitive-driven navigation system that continuously learns, adapts, and optimizes its strategies, effectively simulating human-like reasoning in robots [33][34]. - Its dual-process capability enhances performance in demand-driven navigation tasks, laying a solid foundation for the advancement of intelligent robotic technologies [35].
将再狂揽近6亿融资!机器人Moz1卷入办公室,全力冲刺万亿赛道
具身智能之心· 2025-07-22 06:29
Core Viewpoint - The article highlights the rapid growth and investment in the field of embodied intelligence, particularly focusing on the company Qianxun Intelligent, which has recently secured significant funding and launched its commercial humanoid robot, Moz1, showcasing advanced capabilities in various tasks [3][15][21]. Investment and Market Dynamics - The embodied intelligence sector is experiencing explosive growth, with major players from Silicon Valley to China competing to transition AI from virtual to physical applications [4][5]. - Qianxun Intelligent has completed nearly 600 million yuan in Pre-A+ funding, led by JD.com, indicating strong investor confidence and the potential for significant market impact [15][19]. - The company has rapidly attracted investment since its establishment in February 2024, becoming a favored entity in the capital market [16][17]. Technological Advancements - Qianxun's Moz1 humanoid robot features 26 degrees of freedom and is built on high-density integrated force control joints, outperforming competitors like Tesla's Optimus by 15% in power density [22][24]. - The robot is capable of performing complex tasks such as cleaning and organizing, demonstrating advanced multimodal perception and control capabilities [29][35]. - The development of the VLA (Vision-Language-Action) model and the Spirit v1 framework enables seamless integration of perception, understanding, and action, significantly enhancing the robot's operational efficiency [37][41]. Commercial Strategy - Qianxun has adopted a market-driven approach, conducting extensive research across various sectors to identify high-value applications for its technology [56][58]. - The company aims to penetrate multiple high-value markets, including logistics and healthcare, leveraging its international experience to expand globally [59][60]. - A unique business model has been established, creating a feedback loop between market needs, technological development, and product deployment [57][68]. Competitive Edge - Qianxun stands out in the competitive landscape due to its unique technological path, rapid iteration capabilities, and a team of top global talents in robotics [62][66]. - The company's strategic focus on high-value scenarios and its ability to adapt quickly to market changes have garnered significant trust and investment from industry players [68][70].