具身智能之心

Search documents
具身智能之心灵巧手与触觉感知交流群来啦!
具身智能之心· 2025-08-18 00:07
Group 1 - The establishment of a community focused on dexterous hands and tactile perception technology has been announced, inviting individuals involved in control, algorithms, hardware, and VTLA related to dexterous hands to join [1] - The community aims to discuss industry and academic developments as well as engineering implementation [1] Group 2 - Interested individuals can add the assistant on WeChat with specific instructions to join the group, including mentioning "dexterous hand" along with their nickname [2]
NIPS 2025 MARS 多智能体具身智能挑战赛正式启动!
具身智能之心· 2025-08-18 00:07
Core Insights - The article discusses the challenges and advancements in multi-agent embodied intelligence, emphasizing the need for efficient collaboration among robotic systems to tackle complex tasks in real-world environments [3][4]. Group 1: Challenges in Embodied Intelligence - Single intelligent agents are insufficient for complex and dynamic task scenarios, necessitating high-level collaboration among multiple embodied agents [3]. - The MARS Challenge aims to address these challenges by encouraging global researchers to explore high-level planning and low-level control capabilities of multi-agent systems [4]. Group 2: MARS Challenge Overview - The MARS Challenge features two complementary tracks focusing on planning and control, aiming to evaluate the capabilities of intelligent agents in complex tasks [4][12]. - The challenge will culminate in results and awards announced at the NeurIPS 2025 SpaVLE Workshop [4]. Group 3: Track 1 - Multi-Agent Embodied Planning - Track 1 focuses on high-level task planning and role assignment for heterogeneous robots, utilizing the ManiSkill platform and RoboCasa dataset [5][6]. - Participants will use visual language models to select appropriate robot combinations and create high-level action sequences based on natural language instructions [5][8]. Group 4: Track 2 - Multi-Agent Control Strategy Execution - Track 2 emphasizes the collaborative capabilities of multi-agent systems in executing complex tasks, requiring real-time interaction with dynamic environments [12]. - The RoboFactory simulation environment will be used to develop and evaluate cooperative strategies, with participants designing deployable control models [12][13]. Group 5: Timeline and Participation - The challenge timeline includes a warm-up round starting on August 18, 2025, and the official competition beginning on September 1, 2025, concluding on October 31, 2025 [25]. - Participants from various fields such as robotics, computer vision, and natural language processing are encouraged to join and showcase their creativity and technology [26].
扩散世界模型LaDi-WM大幅提升机器人操作的成功率和跨场景泛化能力
具身智能之心· 2025-08-18 00:07
Core Viewpoint - The article discusses the development of LaDi-WM (Latent Diffusion-based World Models), a novel world model that enhances robotic operation performance through predictive strategies, addressing the challenge of accurately predicting future states in robot-object interactions [1][5][28]. Group 1: LaDi-WM Overview - LaDi-WM utilizes pre-trained vision foundation models to create latent space representations that encompass both geometric and semantic features, facilitating strategy learning and cross-task generalization in robotic operations [1][5][10]. - The framework consists of two main phases: world model learning and policy learning, which iteratively optimizes action outputs based on predicted future states [9][12]. Group 2: Methodology - The world model learning phase involves extracting geometric representations using DINOv2 and semantic representations using Siglip, followed by an interactive diffusion process to enhance dynamic prediction accuracy [10][12]. - The policy model training incorporates future predictions from the world model as additional inputs, guiding the model to improve action predictions and reduce output distribution entropy over iterations [12][22]. Group 3: Experimental Results - In virtual experiments on the LIBERO-LONG dataset, LaDi-WM achieved a success rate of 68.7% with only 10 training trajectories, outperforming previous methods by a significant margin [15][16]. - The framework demonstrated strong performance in the CALVIN D-D dataset, completing tasks with an average length of 3.63, indicating robust capabilities in long-duration tasks [17][21]. - Real-world experiments showed a 20% increase in success rates for tasks such as stacking bowls and drawer operations, validating the effectiveness of LaDi-WM in practical scenarios [25][26]. Group 4: Scalability and Generalization - The scalability experiments indicated that increasing the training data for the world model led to reduced prediction errors and improved policy performance [18][22]. - The generalization capability of the world model was highlighted by its ability to guide policy learning across different environments, achieving better performance than models trained solely in the target environment [20][21].
中山&清华:基于大模型的具身智能系统综述
具身智能之心· 2025-08-16 16:03
Core Viewpoint - The article provides a comprehensive overview of embodied intelligence systems based on large models, highlighting their applications, challenges, and future directions in various domains such as home services, healthcare, education, and industry [6][39]. Summary by Sections Perception and Understanding - Embodied intelligence systems utilize sensors like cameras and microphones to receive raw data and interpret it to form environmental awareness. Large models excel in processing multimodal input data, effectively integrating text, images, and audio to capture relationships and extract high-dimensional features for understanding the world [5][6]. - Multimodal models, such as GPT-4V, enhance the understanding of environments by encoding images and text into a shared vector space, facilitating perception and comprehension of user instructions [9]. Control Levels - The control levels of embodied intelligence systems are categorized into demand level, task level, planning level, and action level, each with representative works that demonstrate the application of large models [6][11]. System Architecture - The architecture of embodied intelligence systems includes end-to-end Transformer architectures and combinations of frozen parameter large models with foundational models, allowing for flexible optimization without sacrificing generalization [21][29]. Data Sources - Data sources for training embodied intelligence systems include simulators, imitation learning, and video learning, with simulators providing a controlled environment for rapid data collection and testing [31][32]. Challenges - Key challenges faced by embodied intelligence systems include the scarcity of real-world data, slow inference speeds, and the need for multi-agent collaboration in complex tasks [39][40]. Future Development Directions - Future directions for embodied intelligence systems involve improving data collection methods, optimizing large models for faster inference, enhancing multi-agent collaboration, and expanding applications across various fields [41][44].
迟迟入不了具身的门?别人在这里已经弯道超车了......
具身智能之心· 2025-08-16 16:03
Core Viewpoint - The article emphasizes the value of a community that provides solutions to problems in the field of embodied intelligence, facilitating knowledge sharing and job opportunities for its members [3][17]. Group 1: Community and Support - The Embodied Intelligence Knowledge Planet has created a closed-loop system covering various fields such as industry, academia, job seeking, and Q&A exchanges [3][17]. - The community offers a platform for members to share solutions to problems encountered in their work, such as data collection and model deployment [3][4]. - Members can access a wealth of resources, including over 30 technical routes, open-source projects, and job postings from leading companies in the field [4][11][31]. Group 2: Educational Resources - The community has compiled numerous learning paths and technical stacks for beginners, as well as valuable industry frameworks and project plans for those already engaged in research [12][14]. - A variety of topics are covered, including robot simulation, data collection platforms, and the challenges of implementing VLA (Visual-Language-Action) models [4][9]. - The community provides access to a range of academic papers, industry reports, and books related to robotics and embodied intelligence [24][27][29]. Group 3: Networking and Job Opportunities - The community has established a job referral mechanism with multiple companies in the embodied intelligence sector, facilitating connections between job seekers and employers [11][18]. - Members are encouraged to engage with industry leaders through forums and live discussions, enhancing their professional network [18][77]. - The community aims to create a supportive environment for members to discuss career choices and research directions, ensuring they receive timely advice and insights [79].
ICCV 2025 | HERMES:首个统一3D场景理解与生成的世界模型
具身智能之心· 2025-08-16 16:03
Core Viewpoint - HERMES presents a unified framework for self-driving technology that integrates both understanding and generation tasks, addressing the challenges of accurately predicting future scenarios while comprehensively understanding the current environment [6][10][26]. Group 1: Introduction to HERMES - HERMES is designed to enhance the capabilities of autonomous vehicles by combining deep environmental understanding with accurate future scene predictions [6][9]. - The framework aims to overcome the traditional separation of understanding and generation tasks in existing models, which limits their effectiveness in real-world driving scenarios [7][10]. Group 2: Methodology of HERMES - HERMES utilizes a Driving World Model (DWM) for future scene generation and a Large Language Model (LLM) for scene understanding, creating a synergy between the two [14][12]. - The Bird's-Eye View (BEV) representation is employed to encode high-resolution images efficiently, preserving spatial relationships and semantic details [15]. - A World Queries mechanism is introduced to bridge the gap between understanding and generation, allowing the model to leverage contextual knowledge for better predictions [16]. Group 3: Training and Optimization - HERMES is trained through a joint optimization process that includes language modeling loss and point cloud generation loss, ensuring balanced performance across tasks [18][20]. - The end-to-end training approach allows HERMES to achieve a high level of accuracy in both understanding and generating future scenarios [20]. Group 4: Experimental Results - HERMES outperforms existing models in both scene understanding and future generation tasks, demonstrating a 32.4% reduction in future point cloud error compared to similar models [22]. - The model shows significant improvements in natural language generation metrics, with an 8% increase in CIDEr scores compared to dedicated understanding models [22]. Group 5: Future Outlook - HERMES sets a foundation for further exploration of complex perception tasks, aiming towards the development of a general driving model capable of comprehensive physical world understanding [26][27].
在复杂真实场景中评估 π0 这类通用 policy 的性能和边界
具身智能之心· 2025-08-16 16:03
Core Viewpoint - The article discusses the evaluation of the PI0-FAST-DROID model in real-world scenarios, highlighting its potential as a generalist model for robotic operations and the challenges it faces in various tasks [4][10][73]. Evaluation Method - The evaluation utilized the π₀-FAST-DROID model, specifically fine-tuned for the DROID robot platform, which includes a Franka Panda robot equipped with cameras [5][10]. - The assessment involved over 300 trials across various operational tasks, focusing on subjective evaluations similar to those used in natural language processing [11][10]. Key Findings - The model demonstrated a strong prior assumption of reasonable behavior, but this was often insufficient to complete tasks [11]. - Prompt engineering significantly influenced the model's performance, with variations in wording or camera angles leading to substantial fluctuations in success rates [12][56]. - The model exhibited impressive visual-language understanding capabilities and could mimic continuous behaviors across different scenarios [13][27]. Performance in Complex Scenarios - The model showed robust performance in recognizing and manipulating transparent objects and those camouflaged against complex backgrounds [19][20]. - It maintained focus on tasks despite human activity in the background, indicating a strong robustness to human movement [24]. Challenges and Limitations - The model faced issues with semantic ambiguity and a lack of memory, leading to premature task termination in multi-step operations [36][40]. - It struggled with precise spatial reasoning, often failing to lift objects high enough to avoid collisions with containers [46][48]. - The model's performance was sensitive to the quality of prompts, with unclear instructions leading to failures [57][59]. Task-Specific Performance - The model's progress and success rates varied across different task categories, such as pouring (52.3% progress, 24% success) and manipulating articulated objects (37.8% progress, 28.5% success) [85][87]. - In human-robot interaction scenarios, the model achieved a progress rate of 53.5% but only a 24% success rate, indicating room for improvement in safety and collaboration [102]. Conclusion - The evaluation indicates that while the PI0 model shows promise as a generalist policy in unseen operational scenarios, significant challenges remain in instruction adherence, fine manipulation, and performance under partial observability [73].
灵巧手的设计与难题!为什么它是打通“手-眼-脑”感知闭环的关键技术?
具身智能之心· 2025-08-15 16:03
Core Viewpoint - The article discusses the evolution of dexterous hands in humanoid robots, emphasizing the transition from morphological mimicry to functional mimicry, highlighting the need for high physical dexterity, multimodal perception, and intelligent decision-making capabilities in these robotic hands [2][5]. Group 1: Key Features of Dexterous Hands - A good research dexterous hand should possess three core features: high physical dexterity (IOD), multimodal perception ability (IOS), and intelligent decision-making potential (IOI) [2]. Group 2: Transmission Solutions - The current transmission solutions for dexterous hands are dominated by three main types: - Linkage transmission, which is rigid and precise but lacks high degrees of freedom [3]. - Gear transmission, which is compact and controllable but limited in force transmission efficiency and passive compliance [3]. - Tendon-driven (cable-driven) systems, favored by companies like Tesla and Shadow Hand, are lightweight and allow for natural passive compliance but face engineering challenges such as friction loss and complex system integration [3]. Group 3: Challenges in Key Hardware - The collaboration between tactile sensors and multi-degree-of-freedom joints is a critical bottleneck for dexterous operation. Existing capacitive or resistive sensors struggle with spatial density, signal drift, and environmental sensitivity, making it difficult to replicate human-level contact topology perception [3]. - The design of high-degree-of-freedom joints encounters a trade-off between performance, cost, and reliability, where increased degrees of freedom lead to more complex drive and transmission systems, resulting in higher failure rates and shorter lifespans [3]. Group 4: Degree of Freedom Debate - The industry is moving away from a fervent "degree of freedom competition" towards a rational pursuit of a "multi-dimensional system balance." While a 42-degree-of-freedom research hand exceeds the human hand's limit (approximately 27 DoF), its practical engineering viability remains to be explored [4]. - The trend is to create a "hexagonal warrior" that optimizes strength, speed, size, weight, lifespan, degrees of freedom, and structural strength [4]. Group 5: Future of Dexterous Hands vs. Grippers - In the short term, two-finger or three-finger grippers dominate structured industrial scenarios due to their low cost, stable control, and high reliability, with some users claiming they can handle 95% of tasks [4]. - However, in the long term, non-structured environments such as home services, medical care, and precision assembly will require the versatility, flexible object handling, and multimodal grasping capabilities that grippers may not provide [4]. Group 6: Industry Evolution - As the industry shifts from "mass production illusion" to "application vision," those that can bridge the "hand-eye-brain" loop, achieve soft-hard collaboration, and build a developer ecosystem will likely become the foundational infrastructure of the embodied intelligence era [5].
天大&清华最新!GeoVLA:增强VLA模型的3D特征提取能力,鲁棒提升明显(SOTA)
具身智能之心· 2025-08-15 00:05
Core Insights - The article introduces GeoVLA, a novel framework that integrates 3D information into Vision-Language-Action (VLA) models, enhancing robots' spatial perception and adaptability [3][9][10]. Group 1: Background and Motivation - The advancement of robotic operations requires intelligent interaction and precise physical control in real-world environments. Recent VLA models have gained attention for their ability to follow instructions and execute actions [7]. - Current VLA models primarily rely on 2D visual inputs, neglecting the rich geometric information inherent in the 3D physical world, which limits their spatial perception capabilities [8]. Group 2: GeoVLA Framework - GeoVLA employs a visual-language model (VLM) to process images and language instructions, extracting fused visual-language embeddings. It converts depth maps into point clouds and uses a custom point embedding network to generate 3D geometric embeddings [3][10][12]. - The framework consists of three key components: VLM for general understanding, a point embedding network (PEN) for extracting fine-grained 3D features, and a 3D enhanced action expert (3DAE) for generating action sequences [12][13]. Group 3: Performance Evaluation - GeoVLA was evaluated on the LIBERO and ManiSkill2 benchmarks, achieving state-of-the-art results. It demonstrated significant robustness in real-world tasks requiring high adaptability and spatial awareness [15][27]. - In LIBERO, GeoVLA achieved an average success rate of 97.7%, outperforming other models like CogACT (93.2%) and OpenVLA-OFT (95.3%) [27]. - In the ManiSkill2 benchmark, GeoVLA achieved a success rate of 77%, surpassing CogACT (69%) and Dita (66%) [27]. Group 4: Ablation Studies - Ablation studies indicated that the PEN encoder outperformed traditional encoders, achieving a success rate of 97.7% compared to 95.8% for MLP and 95.2% for PointNet [30]. - The use of static routing in the MoE architecture improved performance, demonstrating the effectiveness of the design in leveraging multimodal information [30][20]. Group 5: Real-World Experiments - Real-world experiments showcased GeoVLA's robustness and generalization capabilities across various 3D manipulation tasks, maintaining high performance despite changes in camera perspective, height, and object size [36][34]. - GeoVLA achieved an average success rate of 86.3% across basic and 3D perception tasks, outperforming other models by significant margins [36].
Figure人形机器人首秀灵巧手叠衣服!只增加数据集就搞定
具身智能之心· 2025-08-15 00:05
Core Viewpoint - Figure's humanoid robot has successfully learned to fold clothes using an end-to-end approach without any architectural changes, showcasing its adaptability and advanced capabilities in handling complex tasks [2][21][28]. Group 1: Robot Capabilities - The humanoid robot demonstrated its ability to fold towels smoothly, employing precise finger control and real-time adjustments during the process [7][18]. - This task is considered one of the most challenging dexterous operations for humanoid robots due to the variability and unpredictability of clothing shapes [15][16]. - The robot's performance in folding clothes was achieved using the same model and architecture as its previous task of package sorting, with the only change being the dataset used for training [14][28]. Group 2: Helix Architecture - The Helix architecture, developed after Figure's split from OpenAI, is a unified "visual-language-action" model that allows the robot to perceive, understand, and act like a human [21][22]. - Helix consists of two systems that communicate with each other, enabling the robot to perform various tasks with a single set of neural network weights [22]. - Key components of Helix include visual memory, state history, and force feedback, which enhance the robot's ability to adapt and respond to its environment [23][29]. Group 3: Future Plans - Figure plans to continue improving the robot's flexibility, speed, and generalization capabilities based on the expansion of real-world data [20]. - The company aims to develop the robot's ability to perform a complete set of household tasks, including washing, folding, and potentially hanging clothes [38].