Workflow
具身智能之心
icon
Search documents
3个月!搞透VLA/VLA+触觉/VLA+RL/具身世界模型等方向!
具身智能之心· 2025-08-22 00:04
Core Viewpoint - The exploration of Artificial General Intelligence (AGI) is increasingly focusing on embodied intelligence, which emphasizes the interaction and adaptation of intelligent agents within physical environments, enabling them to perceive, understand tasks, execute actions, and learn from feedback [1]. Industry Analysis - In the past two years, numerous star teams in the field of embodied intelligence have emerged, establishing valuable companies such as Xinghaitu, Galaxy General, and Zhujidongli, which are advancing the technology of embodied intelligence [3]. - Major domestic companies like Huawei, JD, Tencent, Ant Group, and Xiaomi are actively investing and collaborating to build a robust ecosystem for embodied intelligence, while international firms like Tesla and investment institutions are supporting companies like Wayve and Apptronik in the development of autonomous driving and warehouse robots [5]. Technological Evolution - The development of embodied intelligence has progressed through several stages: - The first stage focused on grasp pose detection, which struggled with complex tasks due to a lack of context modeling [6]. - The second stage involved behavior cloning, allowing robots to learn from expert demonstrations but revealing weaknesses in generalization and performance in multi-target scenarios [6]. - The third stage introduced Diffusion Policy methods, enhancing stability and generalization by modeling action sequences, followed by the Vision-Language-Action (VLA) model phase, which integrates visual perception, language understanding, and action generation [7][8]. - The fourth stage, starting in 2025, aims to integrate VLA models with reinforcement learning, world models, and tactile sensing to overcome current limitations [8]. Product and Market Development - The evolution of embodied intelligence technologies has led to the emergence of various products, including humanoid robots, robotic arms, and quadrupedal robots, serving industries such as manufacturing, home services, dining, and medical rehabilitation [9]. - The demand for engineering and system capabilities is increasing as the industry shifts from research to deployment, necessitating higher engineering skills for training and simulating strategies on platforms like Mujoco, IsaacGym, and Pybullet [23]. Educational Initiatives - A comprehensive curriculum has been developed to cover the entire technology route of embodied "brain + cerebellum," including practical applications and real-world projects, aimed at both beginners and advanced learners [10][20].
Cocos系统:让你的VLA模型实现了更快的收敛速度和更高的成功率
具身智能之心· 2025-08-22 00:04
Core Viewpoint - The article discusses the advancements in embodied intelligence, particularly focusing on diffusion strategies and the introduction of a new method called Cocos, which addresses the issue of loss collapse in training diffusion policies, leading to improved training efficiency and performance [3][11][25]. Summary by Sections Introduction - Embodied intelligence is a cutting-edge field in AI research, emphasizing the need for robots to understand and execute complex tasks effectively. Diffusion policies have emerged as a mainstream paradigm for constructing visual-language-action (VLA) models, although training efficiency remains a challenge [3]. Loss Collapse and Cocos - The article identifies loss collapse as a significant challenge in training diffusion strategies, where the neural network struggles to distinguish between generation conditions, leading to degraded training objectives. Cocos modifies the source distribution to depend on generation conditions, effectively addressing this issue [6][9][25]. Flow Matching Method - Flow matching is a core method in diffusion models, transforming a simple source distribution into a complex target distribution through optimization. The article outlines the optimization objectives for conditional distribution flow matching, which is crucial for VLA models [5][6]. Experimental Results - The article presents quantitative experimental results demonstrating that Cocos significantly enhances training efficiency and strategy performance across various benchmarks, including LIBERO and MetaWorld, as well as real-world robotic tasks [14][16][19][24]. Case Studies - Case studies illustrate the practical applications of Cocos in simulation tasks, highlighting its effectiveness in improving the robot's ability to distinguish between different camera perspectives and successfully complete tasks [18][21]. Source Distribution Design - The article discusses experiments on source distribution design, comparing different standard deviations and training methods. It concludes that a standard deviation of 0.2 is optimal, and using VAE for training the source distribution yields comparable results [22][24]. Conclusion - Cocos provides a general improvement for diffusion strategy training by effectively solving the loss collapse problem, thereby laying a foundation for future research and applications in embodied intelligence [25].
比H20还要强大!英伟达最新B30A芯片曝光
具身智能之心· 2025-08-21 00:03
Core Viewpoint - Nvidia is developing a new AI chip, codenamed B30A, which is expected to outperform the H20 model and is based on the latest Blackwell architecture [2][3][4]. Group 1: Chip Development - The B30A chip will utilize a single-chip design, integrating all major components onto one silicon piece [8]. - It is reported that the original computing power of the B30A may be only half that of Nvidia's flagship B300 GPU, which uses a dual-chip configuration [6]. - The production speed of chips based on this architecture is expected to be 7 to 30 times faster than previous models [10]. Group 2: Market Expectations - Nvidia's stock price has increased by over 30% this year, reaching a historic market cap of $4 trillion [13]. - Analysts have raised Nvidia's stock price target, with one notable increase from $200 to $240, reflecting high expectations for revenue and earnings per share due to surging AI computing demand [14][15]. - The consensus estimate for Nvidia's Q2 revenue is $45.8 billion, with earnings per share projected at $1 [15]. Group 3: Additional Chip Developments - Besides the B30A, Nvidia is also working on a more affordable AI chip named RTX6000D, which is based on the Blackwell architecture but is designed for AI inference tasks [18][19]. - The RTX6000D will feature traditional GDDR memory with a memory bandwidth of 1398 GB per second and is set for small batch delivery to customers in September [20].
Humanoid Occupancy:首个多模态人形机器人感知系统!解决运动学干扰和遮挡问题
具身智能之心· 2025-08-21 00:03
Core Viewpoint - The article discusses the rapid development of humanoid robot technology, emphasizing the introduction of a generalized multimodal occupancy perception system called Humanoid Occupancy, which enhances environmental understanding for humanoid robots [2][3][6]. Group 1: Humanoid Robot Technology - Humanoid robots are considered the most complex form of robots, embodying aspirations for advanced robotics and artificial intelligence [6]. - The technology is at a critical breakthrough stage, with ongoing iterations in motion control and autonomous perception [6]. Group 2: Humanoid Occupancy System - The Humanoid Occupancy system integrates hardware, software components, data collection devices, and a specialized labeling process to provide comprehensive environmental understanding [3]. - It utilizes advanced multimodal fusion technology to generate grid-based occupancy outputs that encode spatial occupancy states and semantic labels [3]. - The system addresses unique challenges such as kinematic interference and occlusion, establishing effective sensor layout strategies [3]. Group 3: Research and Development - A panoramic occupancy dataset specifically designed for humanoid robots has been developed, providing valuable benchmarks and resources for future research [3]. - The network architecture combines multimodal features and temporal information to ensure robust perception capabilities [3]. Group 4: Live Broadcast and Expert Insights - A live broadcast is scheduled to discuss humanoid robot motion control, multimodal perception systems, autonomous movement, and operational data [6][8]. - The session will feature insights from Zhang Qiang, the academic committee director at the Beijing Humanoid Robot Innovation Center [8].
X-SAM:统一图像分割多模态大模型,20+个数据集上均SoTA
具身智能之心· 2025-08-21 00:03
Core Insights - The article discusses the development of X-SAM, a unified multimodal large language model for image segmentation, which enhances the capabilities of existing models by allowing for pixel-level understanding and interaction through visual prompts [3][24]. Background and Motivation - Segment Anything Model (SAM) has limitations due to its reliance on single input modes for visual prompts, which restricts its applicability in diverse image segmentation tasks [3]. - Multimodal large language models (MLLMs) excel in tasks like image description and visual question answering but cannot directly handle pixel-level visual tasks, hindering the development of generalized models [3]. Method Design - X-SAM introduces a universal input format and unified output representation, supporting various forms of visual prompts such as points, scribbles, bounding boxes, and masks [6][7]. - The architecture includes dual encoders and projectors to enhance image understanding, a segmentation connector for fine-grained multi-scale features, and a unified segmentation decoder that replaces the original SAM decoder [10][11][12]. Training Strategy - X-SAM employs a three-stage progressive training strategy to optimize performance across diverse image segmentation tasks, including segmentor fine-tuning, alignment pre-training, and mixed fine-tuning [13][19]. - The training process incorporates a dataset balancing resampling strategy to improve performance on underrepresented datasets [15]. Experimental Results - X-SAM has been evaluated on over 20 segmentation datasets, achieving state-of-the-art performance across seven different image segmentation tasks [16]. - Specific performance metrics indicate that X-SAM outperforms existing models in various segmentation tasks, demonstrating its effectiveness in general segmentation, reference segmentation, and interactive segmentation [17][18]. Summary and Outlook - X-SAM represents a significant advancement in the field of image segmentation, transitioning from "segment anything" to "any segmentation" through innovative task design and architecture [24]. - Future research directions include expanding capabilities to video segmentation and integrating temporal information for enhanced video understanding [25].
港大&清华最新!仅通过少量演示,实现动态物体操作的强泛化能力!
具身智能之心· 2025-08-21 00:03
Group 1 - The article discusses the challenges of dynamic object manipulation in industrial manufacturing and proposes a solution through a new system called GEM (Generalizable Entropy-based Manipulation) that achieves high generalization with minimal demonstration data [3][6]. - GEM combines target-centered geometric perception and mixed action control to effectively reduce data requirements while maintaining high success rates in dynamic environments [6][15]. - The system has been validated in real-world scenarios, achieving a success rate of over 97% in over 10,000 operations without the need for on-site demonstrations [6][44]. Group 2 - Dynamic object manipulation requires higher precision and real-time responsiveness compared to static object manipulation, making it a complex task [8]. - Existing methods face limitations such as the need for extensive demonstration data and poor scalability due to high costs associated with data collection in dynamic environments [11][13]. - The proposed entropy-based framework quantifies the optimization process in imitation learning, aiming to minimize the data needed for effective generalization [13][15]. Group 3 - The GEM system is designed to lower observation entropy and action conditional entropy, which are critical for reducing data requirements [15][16]. - The system utilizes a hardware platform with adjustable-speed conveyor belts and RGB-D cameras to track and manipulate objects effectively [20][21]. - Key components of GEM include a memory encoder that enhances performance by integrating historical data and a mixed action control mechanism that simplifies dynamic challenges [29][39]. Group 4 - Experimental results show that GEM outperforms seven mainstream methods in both simulated and real-world scenarios, with an average success rate of 85% [30][31]. - The system demonstrates robust performance across various moving speeds and object geometries, maintaining high success rates even with unseen objects [38][39]. - In practical applications, GEM has been successfully deployed in a cafeteria setting, handling challenges such as food residue and fast-moving items with a success rate of 97.2% [42][44].
宁波东方理工大学联培直博生招生!机器人操作/具身智能/机器人学习等方向
具身智能之心· 2025-08-20 04:00
Core Viewpoint - The article discusses the collaboration between Ningbo Dongfang University of Technology and prestigious institutions like Shanghai Jiao Tong University and University of Science and Technology of China to recruit doctoral students in the field of robotics, emphasizing a dual mentorship model and a focus on cutting-edge research in robotics and AI [1][2]. Group 1: Program Structure - The program allows students to register at either Shanghai Jiao Tong University or University of Science and Technology of China for the first year, followed by research work at Dongfang University under dual supervision [1]. - Graduates will receive a doctoral degree and diploma from either Shanghai Jiao Tong University or University of Science and Technology of China [1]. Group 2: Research Focus and Support - The research areas include robotics, control, and AI, with specific topics such as contact-rich manipulation, embodied intelligence, agile robot control, and robot learning [2]. - The lab provides ample research funding, administrative support, and encourages a balanced lifestyle for students, promoting physical health and long-term career development [2]. Group 3: Community and Networking - The article highlights the establishment of a community called "Embodied Intelligence Knowledge Planet," which serves as a platform for technical exchange, job opportunities, and academic discussions in the field of embodied intelligence [3][5]. - The community aims to grow to nearly 10,000 members within two years and offers resources such as technical routes, project solutions, and job postings from leading companies in the robotics sector [5][19]. Group 4: Educational Resources - The community has compiled a comprehensive list of over 30 technical routes and resources for newcomers and experienced researchers, covering various aspects of embodied intelligence and robotics [18][22]. - It includes summaries of open-source projects, datasets, and research papers relevant to the field, facilitating easier access to information for members [32][25].
Meta没做的,英伟达做了!全新架构吞吐量狂飙6倍,20万亿Token训练
具身智能之心· 2025-08-20 00:03
Core Viewpoint - NVIDIA has released a new 9B model, the NVIDIA Nemotron Nano 2, utilizing a revolutionary Mamba-Transformer hybrid architecture that achieves up to 6 times higher inference throughput compared to its competitor Qwen3-8B, while maintaining comparable or superior performance in complex reasoning tasks [1][6][41]. Group 1: Model Architecture and Performance - The Nemotron Nano 2 model is based on the innovative Mamba-Transformer hybrid architecture, which enhances inference speed and accuracy [5][6]. - In complex reasoning benchmark tests, the model matches or exceeds the accuracy of Qwen3-8B, achieving a maximum throughput increase of 6 times [6][41]. - The Mamba architecture is designed for efficient modeling of long sequences, reportedly being 3-5 times faster than traditional Transformer models, with linear complexity supporting extremely long contexts [28][29]. Group 2: Training and Development Process - The training of Nemotron-Nano-9B-v2 involved a massive dataset of 20 trillion tokens, utilizing advanced FP8 training techniques to create a 12B parameter base model [32][34]. - The model underwent extreme compression and distillation processes, reducing the 12B parameter model to 9B while ensuring compatibility with a single A10G GPU for 128k context support [39][40]. - The training data included high-quality web pages, multilingual content, mathematics, and code, focusing on building a high-fidelity dataset for mathematical and coding tasks [34][38]. Group 3: Benchmarking and Open Source - The Nemotron-Nano-9B-v2 model has demonstrated superior or equivalent performance in various benchmarks, including mathematics, code generation, and general reasoning tasks [41][43]. - NVIDIA has announced the open-sourcing of several models and datasets on the HuggingFace platform, including the Nemotron-Pre-Training-Dataset-v1, which contains 6.6 trillion tokens of high-quality data [44]. - The open-source initiative aims to support robust multilingual reasoning and general knowledge pre-training, with a focus on high-quality mathematical content [44].
探究下VLA模型泛化差的原因......
具身智能之心· 2025-08-20 00:03
Core Insights - The article discusses the limitations of generalist robot policies in terms of their generalization capabilities, particularly focusing on the issue of shortcut learning [2][5] - It identifies shortcut learning as a key factor hindering generalization, stemming from the reliance on task-irrelevant features [2] - The research highlights two main reasons for shortcut learning: limited diversity within individual sub-datasets and significant distribution differences between sub-datasets, leading to data fragmentation [2] Dataset Analysis - The study specifically examines the Open X-Embodiment (OXE) dataset, which is composed of multiple sub-datasets collected independently under different environments and robot forms [2][5] - The inherent structure of large-scale datasets like OXE contributes to the challenges in generalization due to the aforementioned issues of diversity and fragmentation [2] Recommendations - The findings provide important insights for improving data collection strategies for robots, aiming to reduce shortcut learning and enhance the generalization capabilities of generalist robot policies [2] - In scenarios where acquiring new large-scale data is impractical, the article confirms that carefully selected data augmentation strategies can effectively mitigate shortcut learning in existing offline datasets [2]
ICCV 2025 | RobustSplat: 解耦致密化与动态的抗瞬态3DGS三维重建
具身智能之心· 2025-08-20 00:03
Core Viewpoint - The article discusses the RobustSplat method, which addresses the challenges of 3D Gaussian Splatting (3DGS) in rendering dynamic objects while maintaining high-quality static scene reconstruction [1][4][19]. Research Motivation - The motivation stems from understanding the dual role of Gaussian densification in 3DGS, which enhances scene detail but can lead to overfitting in dynamic areas, resulting in artifacts and scene distortion [4][6]. Methodology - **Transient Mask Estimation**: Utilizes a Mask MLP architecture to output pixel-wise transient masks, distinguishing between transient and static regions [9]. - **Feature Selection**: DINOv2 features are chosen for their balance of semantic consistency, noise resistance, and computational efficiency, outperforming other feature sets [10]. - **Supervision Design**: Combines image residual loss and feature cosine similarity loss for mask optimization, enhancing dynamic area recognition [10]. - **Delayed Gaussian Growth Strategy**: This core strategy postpones the densification process to prioritize static scene structure optimization, reducing the risk of misclassifying static areas as transient [12]. - **Mask Regularization**: Aims to minimize the misclassification of static regions during early optimization stages [12]. - **Scale Cascade Mask Guidance**: Initially estimates transient masks using low-resolution features, transitioning to high-resolution supervision for improved accuracy [14]. Experimental Results - Experiments on NeRF On-the-go and RobustNeRF datasets show that RobustSplat outperforms baseline methods like 3DGS, SpotLessSplats, and WildGaussians in PSNR, SSIM, and LPIPS metrics [16][20]. Conclusion - The RobustSplat method effectively reduces rendering artifacts caused by transient objects while preserving scene details, demonstrating its robustness in complex scenarios [18][19].