具身智能之心
Search documents
宁波东方理工大学联培直博生招生!机器人操作/具身智能/机器人学习等方向
具身智能之心· 2025-08-20 04:00
Core Viewpoint - The article discusses the collaboration between Ningbo Dongfang University of Technology and prestigious institutions like Shanghai Jiao Tong University and University of Science and Technology of China to recruit doctoral students in the field of robotics, emphasizing a dual mentorship model and a focus on cutting-edge research in robotics and AI [1][2]. Group 1: Program Structure - The program allows students to register at either Shanghai Jiao Tong University or University of Science and Technology of China for the first year, followed by research work at Dongfang University under dual supervision [1]. - Graduates will receive a doctoral degree and diploma from either Shanghai Jiao Tong University or University of Science and Technology of China [1]. Group 2: Research Focus and Support - The research areas include robotics, control, and AI, with specific topics such as contact-rich manipulation, embodied intelligence, agile robot control, and robot learning [2]. - The lab provides ample research funding, administrative support, and encourages a balanced lifestyle for students, promoting physical health and long-term career development [2]. Group 3: Community and Networking - The article highlights the establishment of a community called "Embodied Intelligence Knowledge Planet," which serves as a platform for technical exchange, job opportunities, and academic discussions in the field of embodied intelligence [3][5]. - The community aims to grow to nearly 10,000 members within two years and offers resources such as technical routes, project solutions, and job postings from leading companies in the robotics sector [5][19]. Group 4: Educational Resources - The community has compiled a comprehensive list of over 30 technical routes and resources for newcomers and experienced researchers, covering various aspects of embodied intelligence and robotics [18][22]. - It includes summaries of open-source projects, datasets, and research papers relevant to the field, facilitating easier access to information for members [32][25].
Meta没做的,英伟达做了!全新架构吞吐量狂飙6倍,20万亿Token训练
具身智能之心· 2025-08-20 00:03
Core Viewpoint - NVIDIA has released a new 9B model, the NVIDIA Nemotron Nano 2, utilizing a revolutionary Mamba-Transformer hybrid architecture that achieves up to 6 times higher inference throughput compared to its competitor Qwen3-8B, while maintaining comparable or superior performance in complex reasoning tasks [1][6][41]. Group 1: Model Architecture and Performance - The Nemotron Nano 2 model is based on the innovative Mamba-Transformer hybrid architecture, which enhances inference speed and accuracy [5][6]. - In complex reasoning benchmark tests, the model matches or exceeds the accuracy of Qwen3-8B, achieving a maximum throughput increase of 6 times [6][41]. - The Mamba architecture is designed for efficient modeling of long sequences, reportedly being 3-5 times faster than traditional Transformer models, with linear complexity supporting extremely long contexts [28][29]. Group 2: Training and Development Process - The training of Nemotron-Nano-9B-v2 involved a massive dataset of 20 trillion tokens, utilizing advanced FP8 training techniques to create a 12B parameter base model [32][34]. - The model underwent extreme compression and distillation processes, reducing the 12B parameter model to 9B while ensuring compatibility with a single A10G GPU for 128k context support [39][40]. - The training data included high-quality web pages, multilingual content, mathematics, and code, focusing on building a high-fidelity dataset for mathematical and coding tasks [34][38]. Group 3: Benchmarking and Open Source - The Nemotron-Nano-9B-v2 model has demonstrated superior or equivalent performance in various benchmarks, including mathematics, code generation, and general reasoning tasks [41][43]. - NVIDIA has announced the open-sourcing of several models and datasets on the HuggingFace platform, including the Nemotron-Pre-Training-Dataset-v1, which contains 6.6 trillion tokens of high-quality data [44]. - The open-source initiative aims to support robust multilingual reasoning and general knowledge pre-training, with a focus on high-quality mathematical content [44].
探究下VLA模型泛化差的原因......
具身智能之心· 2025-08-20 00:03
Core Insights - The article discusses the limitations of generalist robot policies in terms of their generalization capabilities, particularly focusing on the issue of shortcut learning [2][5] - It identifies shortcut learning as a key factor hindering generalization, stemming from the reliance on task-irrelevant features [2] - The research highlights two main reasons for shortcut learning: limited diversity within individual sub-datasets and significant distribution differences between sub-datasets, leading to data fragmentation [2] Dataset Analysis - The study specifically examines the Open X-Embodiment (OXE) dataset, which is composed of multiple sub-datasets collected independently under different environments and robot forms [2][5] - The inherent structure of large-scale datasets like OXE contributes to the challenges in generalization due to the aforementioned issues of diversity and fragmentation [2] Recommendations - The findings provide important insights for improving data collection strategies for robots, aiming to reduce shortcut learning and enhance the generalization capabilities of generalist robot policies [2] - In scenarios where acquiring new large-scale data is impractical, the article confirms that carefully selected data augmentation strategies can effectively mitigate shortcut learning in existing offline datasets [2]
ICCV 2025 | RobustSplat: 解耦致密化与动态的抗瞬态3DGS三维重建
具身智能之心· 2025-08-20 00:03
Core Viewpoint - The article discusses the RobustSplat method, which addresses the challenges of 3D Gaussian Splatting (3DGS) in rendering dynamic objects while maintaining high-quality static scene reconstruction [1][4][19]. Research Motivation - The motivation stems from understanding the dual role of Gaussian densification in 3DGS, which enhances scene detail but can lead to overfitting in dynamic areas, resulting in artifacts and scene distortion [4][6]. Methodology - **Transient Mask Estimation**: Utilizes a Mask MLP architecture to output pixel-wise transient masks, distinguishing between transient and static regions [9]. - **Feature Selection**: DINOv2 features are chosen for their balance of semantic consistency, noise resistance, and computational efficiency, outperforming other feature sets [10]. - **Supervision Design**: Combines image residual loss and feature cosine similarity loss for mask optimization, enhancing dynamic area recognition [10]. - **Delayed Gaussian Growth Strategy**: This core strategy postpones the densification process to prioritize static scene structure optimization, reducing the risk of misclassifying static areas as transient [12]. - **Mask Regularization**: Aims to minimize the misclassification of static regions during early optimization stages [12]. - **Scale Cascade Mask Guidance**: Initially estimates transient masks using low-resolution features, transitioning to high-resolution supervision for improved accuracy [14]. Experimental Results - Experiments on NeRF On-the-go and RobustNeRF datasets show that RobustSplat outperforms baseline methods like 3DGS, SpotLessSplats, and WildGaussians in PSNR, SSIM, and LPIPS metrics [16][20]. Conclusion - The RobustSplat method effectively reduces rendering artifacts caused by transient objects while preserving scene details, demonstrating its robustness in complex scenarios [18][19].
ExploreVLM:基于视觉-语言模型的闭环机器人探索任务规划框架
具身智能之心· 2025-08-20 00:03
Research Background and Core Issues - The development of embodied intelligence has led to the integration of robots into daily life as human assistants, necessitating their ability to interpret high-level instructions, perceive dynamic environments, and adjust plans in real-time [3] - Vision-Language Models (VLMs) have emerged as a significant direction for robot task planning, but existing methods exhibit limitations in three areas: insufficient interactive exploration capabilities, limited perception accuracy, and poor planning adaptability [6] Proposed Framework - The ExploreVLM framework is introduced, which integrates perception, planning, and execution verification through a closed-loop design to address the identified limitations [5] Core Framework Design - ExploreVLM operates on a "perception-planning-execution-verification" closed-loop model, which includes: 1. Insufficient interactive exploration capabilities for scenarios requiring active information retrieval [6] 2. Limited perception accuracy in capturing object spatial relationships and dynamic changes [6] 3. Poor planning adaptability, primarily relying on open-loop static planning, which can fail in complex environments [6] Key Module Analysis 1. **Goal-Centric Spatial Relation Graph (Scene Perception)** - Constructs a structured graph representation to support complex reasoning, extracting object categories, attributes, and spatial relationships from initial RGB images and task goals [8] - A two-stage planner generates sub-goals and action sequences for exploration and completion phases, optimizing through self-reflection [8] - The execution validator compares pre- and post-execution states to generate feedback and dynamically adjust plans until task completion [8] 2. **Dual-Stage Self-Reflective Planner** - Designed to separate the needs for "unknown information exploration" and "goal achievement," employing a self-reflection mechanism to correct plans and address logical errors [10] - The exploration phase generates sub-goals for information retrieval, while the completion phase generates action sequences based on exploration results [10] 3. **Execution Validator** - Implements a step-by-step validation mechanism to ensure real-time feedback integration into the closed loop, supporting dynamic adjustments [14] Experimental Validation 1. **Experimental Setup** - Conducted on a real robot platform with five tasks of increasing complexity, comparing against baseline methods ReplanVLM and VILA, with a 50% action failure rate introduced to test robustness [15] 2. **Core Results** - ExploreVLM achieved an average success rate of 94%, significantly outperforming ReplanVLM (22%) and VILA (30%) [16] - The framework demonstrated effective action validation and logical consistency checks, ensuring task goals were met [17] 3. **Ablation Studies** - Performance dropped significantly when core modules were removed, highlighting the importance of the collaborative function of the three modules [19] Comparison with Related Work - ExploreVLM addresses the limitations of existing methods through structured perception, dual-stage planning, and stepwise closure, enhancing task execution and adaptability [20]
从方法范式和应用场景上看强化与VLA/Flow Matching/机器人控制算法
具身智能之心· 2025-08-19 01:54
Core Viewpoint - The article discusses recent advancements in reinforcement learning (RL) and its applications in robotics, particularly focusing on the VLA (Vision-Language Action) models and diffusion policies, highlighting their potential to handle complex tasks that traditional RL struggles with [2][4][35]. Method Paradigms - Traditional RL and imitation learning combined with Sim2Real techniques are foundational approaches in robotics [3]. - VLA models differ fundamentally from traditional RL by using training data distributions to describe task processes and goals, allowing for the execution of more complex tasks [4][35]. - Diffusion Policy is a novel approach that utilizes diffusion models to generate continuous action sequences, demonstrating superior capabilities in complex task execution compared to traditional RL methods [4][5]. Application Scenarios - The article categorizes applications into two main types: basic motion control for humanoid and quadruped robots, and complex/long-range operational tasks [22][23]. - Basic motion control primarily relies on RL and Sim2Real, with current implementations still facing challenges in achieving fluid motion akin to human or animal movements [22]. - For complex tasks, architectures typically involve a pre-trained Vision Transformer (ViT) encoder and a large language model (LLM), utilizing diffusion or flow matching for action output [23][25]. Challenges and Future Directions - The article identifies key challenges in the field, including the need for better simulation environments, effective domain randomization, and the integration of external goal conditions [35]. - It emphasizes the importance of human intention in task definition and the limitations of current models in learning complex tasks without extensive human demonstration data [35][40]. - Future advancements may involve multi-modal input predictions for task goals and the potential integration of brain-machine interfaces to enhance human-robot interaction [35].
一个集视频 /图文/学习路线/问答/求职交流为一体的具身社区
具身智能之心· 2025-08-19 01:54
Core Viewpoint - The article emphasizes the establishment and growth of the "Embodied Intelligence Knowledge Planet," a comprehensive community focused on embodied intelligence, aiming to facilitate knowledge sharing, technical discussions, and job opportunities in the field [1][3][17]. Group 1: Community Development - The community has organized multiple roundtable discussions covering various mainstream solutions and technologies related to data collection and embodied intelligence [1]. - The community currently has nearly 2,000 members and aims to grow to around 10,000 members within the next two years, providing a platform for exchange and technical sharing [1][3]. - The community offers a variety of resources, including video content, articles, learning paths, and Q&A sessions to assist members in applying knowledge to their projects [1][3][21]. Group 2: Technical Resources - The community has compiled over 30 technical routes, including benchmarks and entry-level learning paths, to help members quickly find relevant information [4]. - It has established a job referral mechanism with several leading companies in the field, facilitating connections between job seekers and employers [11][21]. - The community provides a wealth of resources, including open-source projects, datasets, and simulation platforms, to support members in their research and development efforts [17][30][32]. Group 3: Knowledge Sharing and Networking - The community regularly invites industry experts to share insights through live forums and discussions, covering various topics from data to algorithms [4][73]. - Members can freely ask questions and receive answers related to career choices and research directions, fostering a collaborative environment [75]. - The community aims to create a nurturing space for future leaders in the field of embodied intelligence, encouraging active participation and contribution [16][83].
足球还是靠机器人吧!首届机器人运动会闭幕:票价终究保守了
具身智能之心· 2025-08-19 01:54
Core Viewpoint - The article highlights the achievements of the Tsinghua Fire God team in the World Humanoid Robot Games, showcasing advancements in robotics and the competitive nature of robot sports, particularly in the context of a 5v5 soccer match where robots operate autonomously [1][19][30]. Group 1: Event Highlights - The Tsinghua Fire God team won against a humanoid robot team from Germany with a score of 1-0, attributed to a unique "shooting" algorithm that only they had mastered among 50 participating teams [2][23]. - The event featured various competitions, including a 100-meter obstacle race, where the excitement was comparable to human athletic events [5][6]. - The World Humanoid Robot Games included 26 events and 487 matches, demonstrating the growing complexity and capabilities of robotic technology [30][31]. Group 2: Technical Aspects - The 5v5 soccer match was notable for being the first of its kind, with all robots acting autonomously, which increased the complexity of the competition [19][22]. - Each robot was equipped with four cameras for visual perception and spatial judgment, allowing them to make quick decisions during the game [25]. - The competition emphasized the importance of algorithms and team strategies, with the Tsinghua team employing a flexible man-to-man defense strategy compared to the German team's more rigid approach [27][28].
哈工深提出UAV-ON:开放世界空中智能体目标导向导航基准测试
具身智能之心· 2025-08-19 01:54
Core Viewpoint - The article presents UAV-ON, the first large-scale benchmark for open-world object goal navigation with aerial agents, defining over 11,000 navigation tasks across 14 high-fidelity outdoor scenes, emphasizing the need for drones to navigate complex environments autonomously [2][5]. Group 1: Research Background - UAV-ON aims to enhance drone navigation capabilities in diverse real-world environments, addressing the limitations of existing navigation studies that rely heavily on detailed language instructions [2]. - The benchmark includes a set of baseline strategies for drone navigation, such as random strategies, CLIP-based semantic heuristic algorithms, and the proposed aerial object navigation agent (AOA) [2]. Group 2: Environment and Task Definition - The UAV-ON benchmark defines an instance-level object navigation task where drones must navigate to target objects based on semantic instructions [5]. - Drones are equipped with multi-view RGB-D cameras and rely solely on their perception for navigation, without any global positioning signals [6][12]. Group 3: Action Space and Success Conditions - The action space for drones includes parameterized movements such as translation, rotation, and stopping, with each action linked to continuous control parameters [11][14]. - A successful episode is defined as the drone being within a specified distance from the target object at the end of the episode [7]. Group 4: Dataset Analysis and Environment Diversity - The UAV-ON dataset comprises 14 high-fidelity outdoor environments, featuring a variety of natural and man-made landscapes, with a total of 1,270 unique target objects distributed across approximately 9 million square units [15]. - The training set includes 10 diverse outdoor environments generating 10,000 navigation episodes, while the test set consists of 1,000 episodes to evaluate generalization capabilities [15]. Group 5: Experimental Results and Baseline Methods - Various baseline methods were tested, including Random, CLIP-H, AOA-F, and AOA-V, with AOA-V showing the best performance in Oracle success rate but lower in success rate and SPL [16][17]. - The results indicate that all methods have a collision rate exceeding 30%, highlighting a significant gap between current navigation strategies and the safety requirements for real-world drone operations [20]. Group 6: Conclusion and Future Work - UAV-ON serves as a comprehensive benchmark for semantic reasoning, obstacle perception, and target localization challenges in drone navigation [24]. - Future research will focus on enhancing multi-modal perception, prompt-based control, and developing safer, more reliable navigation strategies for autonomous drone operations in complex environments [24].
2025世界人形机器人运动会:从赛场到市场,优理奇机器人两金一银背后的商业化布局
具身智能之心· 2025-08-18 11:32
Core Viewpoint - The World Humanoid Robot Games concluded successfully, showcasing advancements in humanoid robotics and announcing the next event in Beijing in 2026 [1] Group 1: Event Overview - The inaugural World Humanoid Robot Games featured 26 events and 487 matches, highlighting the comprehensive capabilities of humanoid robots [1] - The event was organized by the newly established World Humanoid Robotics Games Federation [1] Group 2: Medal Distribution - UniX AI (优理奇) achieved significant success, winning 2 gold medals and 1 silver medal, ranking third overall in the medal tally [2][3] - Other notable participants included Beijing Humanoid and 松延动力, with medal counts of 10 and 3 respectively [2] Group 3: Technical Achievements - UniX AI's Wanda series robots demonstrated advanced capabilities in service scenarios, particularly in hotel reception and cleaning services [9][10] - The competition's rigorous evaluation criteria included action completion, timeliness, stability, and environmental adaptability, emphasizing the robots' autonomous execution [9] Group 4: Algorithm and Hardware Integration - UniX AI's proprietary algorithms, including UniFlex, UniTouch, and UniCortex, provide a robust foundation for the Wanda series, enabling complex task execution in dynamic environments [12][13] - The hardware features an 8-degree-of-freedom robotic arm, surpassing human capabilities in flexibility and precision [13][15] Group 5: Market Applications - The Wanda series robots are positioned to enhance service efficiency in hotels without requiring additional hardware modifications, offering a competitive edge in both high-end and chain hotels [17] - The aging population increases the demand for humanoid robots in elder care, where Wanda can assist with household tasks and provide companionship [19] Group 6: Commercialization Strategy - UniX AI has begun selling its robots on JD.com, marking a significant step towards retail exploration and direct consumer engagement [21] - The company aims to expand its partnerships with hotels, property management companies, and elder care communities to facilitate large-scale deployment [21] Group 7: Future Outlook - The humanoid robotics sector is on the brink of commercial explosion, with UniX AI validating its platform in various service scenarios [24] - The integration of technology breakthroughs with market access positions UniX AI as a potential leader in the global humanoid robotics market [24]