Workflow
具身智能之心
icon
Search documents
招募几位具身世界模型相关方向的大佬!
具身智能之心· 2025-10-29 04:00
Group 1 - The article discusses the rising interest in embodied world models, highlighting their industrial and research value [1] - The company is recruiting two lecturers to help develop courses or tutoring content related to world models [2] - There is an emphasis on collaboration with individuals who have a strong background in the field, specifically those with a PhD or higher who have published at least one paper in a CCF-A level conference [5] Group 2 - The compensation offered for the positions is above industry standards, and the roles can be part-time [6] - Interested individuals are encouraged to contact the responsible person via WeChat for further communication [6]
突破机器人空间感知瓶颈!中山大学与拓元智慧团队提出TAVP框架
具身智能之心· 2025-10-29 00:03
Core Viewpoint - The article discusses the introduction of the Task-Aware View Planning (TAVP) framework by Sun Yat-sen University and Tuoyuan Wisdom, which addresses the limitations of current visual-language-action (VLA) models in robotic multi-task manipulation by enhancing action prediction accuracy and task generalization capabilities in complex environments [1][5]. Research Background - The main challenges faced by existing VLA models, such as OpenVLA and π0.5, include incomplete 3D perception due to fixed viewpoints and significant task interference caused by shared encoders [3][5][7]. Core Innovations - TAVP framework introduces two innovative modules: Multi-View Exploration Policy (MVEP) and Task-Aware Mixture of Experts (TaskMoE), which work together to optimize the perception-action link in robotic manipulation [6][9]. Module Details - **Multi-View Exploration Policy (MVEP)**: This module dynamically captures key perspectives to address 3D perception occlusion by selecting optimal virtual camera positions through reinforcement learning [9][11]. - **Task-Aware Mixture of Experts (TaskMoE)**: It decouples task features to eliminate multi-task interference using dynamic expert routing and gating mechanisms [12][11]. - **Three-Stage Training Strategy**: Ensures module collaboration and performance stability through parameterization of viewpoints, efficient policy training, and dynamic re-rendering of images [11][20]. Experimental Validation - TAVP outperformed existing baseline models in 18 tasks on the RLBench benchmark, achieving an average success rate of 86.6%, particularly excelling in occlusion-prone tasks [13][14]. - Ablation studies confirmed the necessity of core modules, with the removal of TaskMoE leading to a drop in success rate to 85.6% and random viewpoints resulting in a drastic decline to 8.9% [15][21]. Generalization and Efficiency Analysis - TAVP demonstrated improved zero-shot capabilities, achieving a success rate of 12.0% on unseen tasks, while the model without TaskMoE failed to succeed [22][16]. - Despite increased computational costs from dynamic viewpoint re-rendering, TAVP maintained an average inference time of 0.436 seconds, only slightly higher than the baseline [22]. Real-World Robustness Testing - In robustness tests, TAVP showed superior adaptability compared to baseline models, achieving 100% success rates in various scenarios, including unseen instances and backgrounds [18][19][23]. Research Significance and Future Directions - The TAVP framework represents a new paradigm for robotic multi-task manipulation, enabling dynamic viewpoint planning and task-aware encoding to overcome existing limitations [25]. - Future work will focus on enhancing robustness against reflective and transparent objects and exploring multi-sensor fusion to expand the boundaries of robotic manipulation tasks [25].
公司动态 | 40万下载量!星海图真机数据集登顶全球主流开源平台
具身智能之心· 2025-10-29 00:03
编辑丨 星海图 点击下方 卡片 ,关注" 具身智能之心 "公众号 >> 点击进入→ 具身 智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 星海图(Galaxea)于2025年8月正式开源的 星海图开放世界数据集(Galaxea Open-World Dataset) 一经发布,便在全球具身智能领域引发广泛 讨论。 开源仅两个月,数据集下载量突破 40 万次 ,成为全球最受关注、下载量最高的具身智能真机数据集之一。 来自 Physical Intelligence、Bitrobot、Hugging Face 等国际前沿团队的研究者,也在社交媒体上公开点赞推荐,称该数据集为 "极具价值的社区 资源" 。世界各地的机器人研究者、实验室与应用企业,正基于星海图开放世界数据集进行系统验证、模型训练等更多研究。 以真实世界数据 推动具身智能落地 长期以来,业界主流的大模型预训练多依赖互联网数据或仿真环境数据。然而,互联网数据虽然规模庞大,却质量不均;仿真数据则受限于虚拟环境 的简化假设,难以真实还原物理交互与环境复杂性,影响模型在真实世 ...
VLA集体翻车?复旦&创智邱锡鹏教授团队提出LIBERO-Plus,揭示VLA脆弱性真相
具身智能之心· 2025-10-29 00:03
Core Insights - The article discusses the robustness analysis of Vision-Language-Action (VLA) models, revealing significant generalization deficiencies despite high performance scores in ideal conditions [2][4][6] - The LIBERO-Plus framework is introduced to systematically evaluate VLA models across various perturbation dimensions, highlighting the gap between surface performance and actual generalization capabilities [4][6][33] Group 1: Motivation and Contributions - VLA models have achieved impressive success rates in benchmarks like LIBERO, but existing evaluation methods fail to assess stability and reliability under real-world variations [4][6] - LIBERO-Plus evaluates models based on seven dimensions of perturbation: object placement, camera angle, robot initial pose, language instructions, lighting conditions, background textures, and sensor noise [4][6] - The framework provides a detailed analysis of VLA models' generalization performance through systematic perturbation [4][6] Group 2: Performance Analysis - The analysis reveals that VLA models exhibit significant overall vulnerability to perturbations, with performance declining across all dimensions [13][32] - Models are most sensitive to changes in camera perspective and robot initial state, indicating a need for high-level spatial and proprioceptive understanding [13][32] - Language perturbations lead to the smallest average performance drop (-25.3%), suggesting a surprising level of robustness that warrants further investigation [15][17] Group 3: Findings on Model Behavior - Some models maintain performance even with empty language inputs, indicating a tendency to ignore language modalities and behave more like visual-action (VA) models [16][19] - VLA models struggle with cross-object instruction following, relying more on fixed visual-action mappings rather than fully leveraging language signals [19][20] - The models demonstrate remarkable adaptability to background changes while showing limited sensitivity to lighting variations, raising questions about the representations they learn [20][27] Group 4: Combination Generalization - The concept of "combination generalization gap" is introduced, highlighting the negative interactions between different perturbations that exceed the independent effects of single perturbations [29][32] - The analysis indicates that current VLA models lack the ability to effectively handle complex multi-dimensional perturbations due to entangled representations [32] Group 5: LIBERO-Plus Benchmark - The LIBERO-Plus benchmark consists of 10,030 tasks designed to evaluate model performance under various perturbations, constructed using perturbation augmentation strategies [33][36] - The benchmark features include comprehensive coverage of seven perturbation dimensions and fine-grained difficulty levels [36] - Models trained with enhanced data achieved an average success rate of 79.6% on LIBERO-Plus, significantly outperforming baseline models [38]
乐享科技w-bot订单超千台,最新外观曝光
具身智能之心· 2025-10-28 10:00
Core Insights - The emergence of W-bot marks a significant breakthrough in China's consumer-grade embodied intelligence, showcasing the integration of technology and sports [1][3] - The Chinese government has recognized embodied intelligence as a key industry for future development, creating a supportive policy framework [3] - The capital market has responded positively, with LeXiang Technology securing nearly 500 million yuan in angel financing within a year, reflecting strong industry confidence [3] Technology Foundation - LeXiang Technology has developed a comprehensive self-research system covering hardware, software, and algorithms, enabling significant adaptability in complex environments [6][4] - The company boasts a research team where over 80% are dedicated to R&D, with members having an average of over 10 years of experience in robotics and AI [6][4] Market Applications - W-bot's multifunctional capabilities allow it to serve various roles in both household and commercial settings, contributing to its substantial pre-order success [7][10] - In domestic scenarios, W-bot addresses key needs such as companionship, home security, and delivery, while also adapting to outdoor environments for recreational activities [7][8] - The product's innovative applications in industries like retail, education, and real estate highlight its versatility and potential for widespread adoption [10][12] Market Potential - The global consumer robotics market is projected to grow significantly, with estimates suggesting a rise from $47 billion in 2024 to $108 billion by 2028, indicating a compound annual growth rate of approximately 23% [12][14] - China is positioned to leverage its manufacturing strengths and vast consumer market to become a leader in the global consumer-grade embodied intelligence sector [14][15] - The anticipated market transformation from tools to partners in robotics suggests a vast expansion of market opportunities, with W-bot poised to revolutionize lifestyles and industry digitization [15][16]
英伟达最新 | 0成本搭建你的SOTA模型!轻量化VLA时代来啦~
具身智能之心· 2025-10-28 04:00
Core Insights - The article presents VLA-0, a novel approach in the field of robot control that utilizes a vision-language-action model (VLA) without modifying the existing structure of the vision-language model (VLM) [1][2][3]. - VLA-0 demonstrates that a simple design can achieve top-tier performance, challenging the notion that complexity equates to better functionality in VLA development [14][21]. Summary by Sections Introduction to VLA-0 - VLA-0 breaks the conventional belief that more complex models yield better results by proposing a "zero modification" approach, allowing VLM to predict actions in text format without altering its architecture [1][2]. Current Challenges in VLA Development - Existing VLA models often sacrifice the inherent advantages of VLMs for added action functionalities, leading to issues such as increased complexity and reduced language comprehension [2][3]. Key Design Features of VLA-0 - VLA-0 retains the original VLM structure and focuses on optimizing input-output and training logic, allowing it to predict actions effectively [3][4]. - The input design includes system prompts, multi-modal observations, and natural language task instructions, ensuring that VLM can understand and process tasks without additional coding [4][5]. Action Decoding Mechanism - VLA-0 innovatively converts continuous actions into text that VLM can generate, enhancing action resolution and avoiding vocabulary conflicts [5][6]. - The training strategy employs masked action augmentation to improve the model's reliance on visual and task information rather than just text sequence continuity [7][8]. Experimental Results - VLA-0 outperforms complex models in both simulated and real-world scenarios, achieving an average success rate of 94.7% in simulations, surpassing all comparable models [10][11]. - In real-world tests, VLA-0 achieved a 60% success rate, significantly higher than the 47.5% of the SmolVLA model, demonstrating its effectiveness in practical applications [11][13]. Conclusions and Future Directions - The findings suggest that simpler designs can lead to superior performance in VLA development, emphasizing the importance of leveraging existing VLM capabilities [14][15]. - Future exploration may include large-scale pre-training, optimization of inference speed, and integration of 3D perception to enhance the model's adaptability and precision in complex environments [18][19][20].
为什么RL在人形/四足/机械臂等本体上依然还有很多工作可以做?
具身智能之心· 2025-10-28 04:00
Core Insights - Reinforcement Learning (RL) remains a significant field, with increasing applications in robotics, including humanoid and quadruped robots, as well as in product optimization across various industries [1][2][3] - The complexity of RL poses challenges for newcomers, making it difficult to produce publishable research papers without a structured learning system [5][9] - To address these challenges, a specialized 1v6 mentoring course in RL has been launched, aimed at helping students produce quality research papers [6][9] Group 1: Importance of Reinforcement Learning - RL is crucial for tasks such as gait control in embodied intelligent robots, which is essential for achieving general-purpose capabilities [2] - Companies like Yushun and Zhiyuan utilize RL for humanoid robots to perform complex actions like climbing stairs, running, and dancing, enhancing their adaptability in various scenarios [2][8] - The integration of RL with Variable Length Action (VLA) in robotic arms is gaining popularity in academia, leading to more efficient and smooth robot operations [3][8] Group 2: Challenges in Learning and Research - The vast and intricate nature of RL makes it difficult for beginners to find a clear entry point, often resulting in frustration and abandonment of learning [5][9] - Producing a research paper that meets the standards of peer review requires proficiency in methodology, experimental results, and writing style, which can be overwhelming for newcomers [5][9] Group 3: Course Offerings and Structure - The 1v6 mentoring course is designed for graduate students and others seeking guidance on research papers, featuring small class sizes and weekly live sessions [7][9] - The course spans 14 weeks of intensive online training followed by 8 weeks of maintenance support, focusing on various aspects of RL and its applications in robotics [9][15] - Participants will receive guidance on paper ideas, project implementation, experimental support, and writing refinement, with the goal of producing a draft suitable for submission to top conferences [7][9][15] Group 4: Course Content and Deliverables - The curriculum includes topics such as RL fundamentals, simulation environments, and specific applications in quadruped, humanoid, and robotic arm training [17][19] - Students will engage in hands-on projects, culminating in a research paper draft that adheres to the requirements of conferences like RAL, ICRA, IROS, and CoRL [23][24] - The course emphasizes a structured approach to research, covering the entire process from methodology to writing and submission [30]
SFT 还是RL,VLA到底应该如何训练?
具身智能之心· 2025-10-28 00:02
Core Insights - The articles focus on advancements in Reinforcement Learning (RL) and its application to Visual-Language-Action (VLA) models, highlighting significant improvements in generalization capabilities and training efficiency. Group 1: Research Findings - The first study investigates how RL enhances the generalization ability of VLA models, addressing issues related to supervised fine-tuning (SFT) that lead to error accumulation and distribution shift. A new benchmark covering visual, semantic, and execution dimensions was established, showing that using Proximal Policy Optimization (PPO) for RL fine-tuning significantly improves semantic understanding and execution robustness while maintaining comparable visual generalization performance to SFT [2]. - The second study introduces RLinf-VLA, a framework designed for large-scale RL training of VLA models. It proposes a novel solution to the challenges of integrating RL and VLA training, achieving up to 2.27 times acceleration compared to baseline methods. The framework supports various VLA architectures and RL algorithms, achieving a 98.11% success rate across 130 LIBERO tasks [3]. Group 2: Practical Applications - RLinf-VLA summarizes best practices for applying RL in VLA training, providing a unified interface that facilitates the use of multiple VLA architectures and simulators, thus lowering the barrier for implementing RL in large-scale VLA applications [3]. - The research emphasizes the importance of RL in enhancing the performance of VLA models, suggesting a shift towards more efficient training methodologies that leverage RL's strengths [15].
无人机也能打排球吗?清华团队用强化学习探了探路
具身智能之心· 2025-10-28 00:02
Core Insights - The article discusses a new embodied AI task proposed by Tsinghua University, focusing on "multi-drone volleyball," which aims to enhance the capabilities of drones in a three-dimensional space through teamwork and strategy [1][2]. Group 1: Task Overview - The "multi-drone volleyball" task requires drones to demonstrate high maneuverability and precise control while collaborating as a team to hit a ball over a net and compete against opposing teams [2]. - The Tsinghua team has developed the VolleyBots testing platform to simulate the human learning process in volleyball, incorporating various tasks for single and multiple drones [2][6]. Group 2: Algorithm Development - The Hierarchical Co-Self-Play (HCSP) algorithm was designed to enable drones to learn cooperation, division of roles, and offensive/defensive transitions through hierarchical strategy learning and self-play mechanisms [2][12]. - The research incorporated various reinforcement learning and game-theoretic algorithms, with the HCSP showing an average win rate of 82.9% against multiple baseline algorithms [15]. Group 3: Training Phases - The training process consists of three phases: low-level skill learning, high-level strategy game playing, and collaborative self-play, allowing drones to evolve their strategies and skills in a competitive environment [14]. - The drones demonstrated the ability to form clear roles during matches, such as defense, passing, and offense, and even developed new tactics like "setter's lob" during training [15]. Group 4: Real-World Application - The JuggleRL system was introduced to enable drones to perform continuous juggling in the real world, achieving a record of 462 consecutive juggles without any real data fine-tuning [16][18]. - This achievement marks a significant step in embodied reinforcement learning, transitioning from virtual environments to real physical interactions [18][19].
社区内的同学陆续出offer了......
具身智能之心· 2025-10-28 00:02
Core Insights - The article highlights the successful job placements of community members in various leading companies and emphasizes the importance of choosing top-tier firms or unique tech unicorns for career advancement [1] - The community aims to foster talent in the field of embodied intelligence through various initiatives, including technical sharing, job referrals, and industry engagement [1][2][5] Group 1: Community Initiatives - Continuous live sharing sessions are organized to discuss the latest developments and unresolved issues in the embodied intelligence industry [2] - A comprehensive technical roadmap has been developed for beginners, providing essential knowledge and skills for entering the field [3] - Valuable industry frameworks and project proposals are offered to those already engaged in related research [5] Group 2: Job Referrals and Networking - The community has established a job referral mechanism with multiple embodied intelligence companies, facilitating direct connections between job seekers and employers [7] - Members can access a wealth of resources, including open-source projects, datasets, and simulation platforms, to enhance their learning and practical skills [9][25][33] Group 3: Educational Resources - The community provides a compilation of renowned domestic and international laboratories in embodied intelligence, aiding members in their academic pursuits [12] - A collection of research reports related to large models and humanoid robots is available, keeping members informed about industry trends and applications [18] - Members can access a variety of educational materials, including books and technical documents, to support their foundational learning in robotics [20][21] Group 4: Specialized Learning Paths - Detailed learning paths for embodied intelligence perception and interaction are outlined, covering various tasks and methodologies [38][40] - The community offers insights into cutting-edge topics such as multi-modal large models and reinforcement learning, ensuring members stay updated with the latest advancements [46][53]