Workflow
具身智能之心
icon
Search documents
ROSCon 2025 -《具身智能训练场》 Workshop 论坛安排
具身智能之心· 2025-10-30 00:03
以下文章来源于古月居 ,作者ROSConChina 组委 古月居 . 专业的ROS机器人知识社区和产业服务平台 作者丨 ROSConChina 组委 编辑丨古月居 点击下方 卡片 ,关注" 具身智能之心 "公众号 >> 点击进入→ 具身 智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 方式二:扫描下方二维码 RROOSSCCononChCinhaina 2025 Wo r k s h op " 具身智能训练场 (Embodied Intelligence Playground) 联合主办方: 刻行时空 × 穹彻智能 时间: 2025年11月1日(周六)10:00 – 12:30 地点: 上海虹桥新华联索菲特大酒店B1加百利会议室 方式一:点击下方小程序进入购票页面 | 时间 | 内容 | 嘉宾 | 单位 | | --- | --- | --- | --- | | 10:00-10:10 | 开场致辞 | 主持人 | 刻行时空 × 穹彻智能 | | 10:10-10:30 | 主题演讲1:物理AI仿真系统,加速具身机器人 | ...
具身智能领域最新世界模型综述:250篇paper带大家梳理主流框架与任务
具身智能之心· 2025-10-30 00:03
Core Insights - The article discusses the concept of world models in embodied AI, emphasizing their role as internal simulators that help agents perceive environments, take actions, and predict future states [1][2]. Group 1: World Models Overview - The research on world models has seen unprecedented growth due to the explosion of generative models, leading to a complex array of architectures and techniques lacking a unified framework [2]. - A novel three-axis classification method is proposed to categorize existing world models based on their functionality, temporal modeling, and spatial representation [6]. Group 2: Mathematical Principles - World models are typically modeled as partially observable Markov decision processes (POMDPs), focusing on learning compact latent states from partial observations and the transition dynamics between states [4]. - The training paradigm for world models often employs a "reconstruction-regularization" approach, which encourages the model to reconstruct observations from latent states while aligning posterior inference with prior predictions [9]. Group 3: Functional Positioning - World models can be categorized into decision-coupled and general-purpose types, with the former optimized for specific decision tasks and the latter serving as task-agnostic simulators [6][15][16]. - Decision-coupled models, like the Dreamer series, excel in task performance but may struggle with generalization due to their task-specific representations [15]. - General-purpose models aim for broader predictive capabilities and transferability across tasks, though they face challenges in computational complexity and real-time inference [16]. Group 4: Temporal Modeling - Temporal modeling can be divided into sequential reasoning and global prediction, with the former focusing on step-by-step simulation and the latter predicting entire future sequences in parallel [20][23]. - Sequential reasoning is beneficial for closed-loop control but may suffer from error accumulation over long predictions [20]. - Global prediction enhances computational efficiency and reduces error accumulation but may lack detailed local dynamics [23]. Group 5: Spatial Representation - Various strategies for spatial representation include global latent vectors, token feature sequences, spatial latent grids, and decomposed rendering representations [25][28][34][35]. - Global latent vectors compress scene states into low-dimensional variables, facilitating real-time control but potentially losing fine-grained spatial information [28]. - Token feature sequences allow for detailed representation of complex scenes but require extensive data and computational resources [29]. - Spatial latent grids maintain local topology and are prevalent in autonomous driving, while decomposed rendering supports high-fidelity image generation but struggles with dynamic scenes [34][35]. Group 6: Data Resources and Evaluation Metrics - Data resources for embodied AI can be categorized into simulation platforms, interactive benchmarks, offline datasets, and real robot platforms, each serving distinct purposes in training and evaluating world models [37]. - Evaluation metrics focus on pixel-level generation quality, state/semantic consistency, and task performance, with recent trends emphasizing physical compliance and causal consistency [40].
大家的秋招都有结果了吗?
具身智能之心· 2025-10-30 00:03
Core Insights - The article highlights the successful job placements of community members in various leading companies and emphasizes the importance of choosing top-tier firms or unique tech unicorns for career advancement [1] - The community aims to foster talent in the field of embodied intelligence through various initiatives, including technical sharing, job referrals, and industry engagement [1][2][5] Group 1: Community Initiatives - Continuous live sharing sessions are organized to discuss the latest developments and unresolved issues in the embodied intelligence industry [2] - A comprehensive technical roadmap has been developed for beginners, providing a structured approach to entering the field [3] - Valuable industry frameworks and project proposals are offered to those already engaged in related research [5] Group 2: Job Referral and Networking - The community has established a job referral mechanism with multiple embodied intelligence companies, facilitating direct connections between job seekers and employers [7] - Members can access a wealth of resources, including open-source projects, datasets, and industry reports, to enhance their learning and research capabilities [9][12][18] Group 3: Educational Resources - The community provides a compilation of over 40 open-source projects and nearly 60 datasets related to embodied intelligence, significantly reducing the time needed for research [9][31] - Various learning paths are outlined for different aspects of embodied intelligence, including perception, interaction, and reinforcement learning [10][38] Group 4: Expert Engagement - The community invites numerous industry experts to participate in discussions, providing members with opportunities to ask questions and gain insights from leading figures in the field [9] - Members can engage in discussions about academic progress and industrial applications, fostering a collaborative learning environment [13][68]
IROS 2025-Challenge冠军方案:X-VLA重磅开源,全面刷新机器人基准性能记录
具身智能之心· 2025-10-29 04:07
Core Insights - The article discusses the launch of the X-VLA model, a groundbreaking open-source model in the field of embodied intelligence, achieving significant performance improvements with only 0.9 billion parameters [2][7]. Competition Highlights - The AGIBOT World Challenge 2025 attracted 431 teams from 23 countries, with 11 teams competing in the final event held in Hangzhou, China, focusing on real-world physical tasks [4][5]. Performance Breakthroughs - X-VLA achieved state-of-the-art (SOTA) performance across five authoritative simulation benchmarks, demonstrating exceptional efficiency and effectiveness in long-duration tasks like autonomous clothing folding [7][24]. Innovative Techniques - The model employs a Soft-Prompt mechanism to enhance adaptability across different robotic platforms, and a multi-modal encoding strategy to optimize resource allocation while maintaining information integrity [16][21]. - A flow-matching generative action decoder is utilized to improve the smoothness and robustness of action trajectories in uncertain environments [17]. Data Preprocessing and Training - The model incorporates a balanced data sampling strategy and a rigorous data cleaning pipeline to ensure high-quality training data, which is crucial for learning meaningful behavior knowledge [21][22]. - The training process includes a customized post-training workflow that allows for efficient adaptation to specific tasks using smaller datasets [23][26]. Real-World Testing - X-VLA demonstrated strong performance in real robotic platforms, successfully completing complex tasks such as infinite-duration autonomous clothing folding, showcasing its capability in handling intricate long-range tasks [27].
招募几位具身世界模型相关方向的大佬!
具身智能之心· 2025-10-29 04:00
Group 1 - The article discusses the rising interest in embodied world models, highlighting their industrial and research value [1] - The company is recruiting two lecturers to help develop courses or tutoring content related to world models [2] - There is an emphasis on collaboration with individuals who have a strong background in the field, specifically those with a PhD or higher who have published at least one paper in a CCF-A level conference [5] Group 2 - The compensation offered for the positions is above industry standards, and the roles can be part-time [6] - Interested individuals are encouraged to contact the responsible person via WeChat for further communication [6]
突破机器人空间感知瓶颈!中山大学与拓元智慧团队提出TAVP框架
具身智能之心· 2025-10-29 00:03
Core Viewpoint - The article discusses the introduction of the Task-Aware View Planning (TAVP) framework by Sun Yat-sen University and Tuoyuan Wisdom, which addresses the limitations of current visual-language-action (VLA) models in robotic multi-task manipulation by enhancing action prediction accuracy and task generalization capabilities in complex environments [1][5]. Research Background - The main challenges faced by existing VLA models, such as OpenVLA and π0.5, include incomplete 3D perception due to fixed viewpoints and significant task interference caused by shared encoders [3][5][7]. Core Innovations - TAVP framework introduces two innovative modules: Multi-View Exploration Policy (MVEP) and Task-Aware Mixture of Experts (TaskMoE), which work together to optimize the perception-action link in robotic manipulation [6][9]. Module Details - **Multi-View Exploration Policy (MVEP)**: This module dynamically captures key perspectives to address 3D perception occlusion by selecting optimal virtual camera positions through reinforcement learning [9][11]. - **Task-Aware Mixture of Experts (TaskMoE)**: It decouples task features to eliminate multi-task interference using dynamic expert routing and gating mechanisms [12][11]. - **Three-Stage Training Strategy**: Ensures module collaboration and performance stability through parameterization of viewpoints, efficient policy training, and dynamic re-rendering of images [11][20]. Experimental Validation - TAVP outperformed existing baseline models in 18 tasks on the RLBench benchmark, achieving an average success rate of 86.6%, particularly excelling in occlusion-prone tasks [13][14]. - Ablation studies confirmed the necessity of core modules, with the removal of TaskMoE leading to a drop in success rate to 85.6% and random viewpoints resulting in a drastic decline to 8.9% [15][21]. Generalization and Efficiency Analysis - TAVP demonstrated improved zero-shot capabilities, achieving a success rate of 12.0% on unseen tasks, while the model without TaskMoE failed to succeed [22][16]. - Despite increased computational costs from dynamic viewpoint re-rendering, TAVP maintained an average inference time of 0.436 seconds, only slightly higher than the baseline [22]. Real-World Robustness Testing - In robustness tests, TAVP showed superior adaptability compared to baseline models, achieving 100% success rates in various scenarios, including unseen instances and backgrounds [18][19][23]. Research Significance and Future Directions - The TAVP framework represents a new paradigm for robotic multi-task manipulation, enabling dynamic viewpoint planning and task-aware encoding to overcome existing limitations [25]. - Future work will focus on enhancing robustness against reflective and transparent objects and exploring multi-sensor fusion to expand the boundaries of robotic manipulation tasks [25].
公司动态 | 40万下载量!星海图真机数据集登顶全球主流开源平台
具身智能之心· 2025-10-29 00:03
编辑丨 星海图 点击下方 卡片 ,关注" 具身智能之心 "公众号 >> 点击进入→ 具身 智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 星海图(Galaxea)于2025年8月正式开源的 星海图开放世界数据集(Galaxea Open-World Dataset) 一经发布,便在全球具身智能领域引发广泛 讨论。 开源仅两个月,数据集下载量突破 40 万次 ,成为全球最受关注、下载量最高的具身智能真机数据集之一。 来自 Physical Intelligence、Bitrobot、Hugging Face 等国际前沿团队的研究者,也在社交媒体上公开点赞推荐,称该数据集为 "极具价值的社区 资源" 。世界各地的机器人研究者、实验室与应用企业,正基于星海图开放世界数据集进行系统验证、模型训练等更多研究。 以真实世界数据 推动具身智能落地 长期以来,业界主流的大模型预训练多依赖互联网数据或仿真环境数据。然而,互联网数据虽然规模庞大,却质量不均;仿真数据则受限于虚拟环境 的简化假设,难以真实还原物理交互与环境复杂性,影响模型在真实世 ...
VLA集体翻车?复旦&创智邱锡鹏教授团队提出LIBERO-Plus,揭示VLA脆弱性真相
具身智能之心· 2025-10-29 00:03
Core Insights - The article discusses the robustness analysis of Vision-Language-Action (VLA) models, revealing significant generalization deficiencies despite high performance scores in ideal conditions [2][4][6] - The LIBERO-Plus framework is introduced to systematically evaluate VLA models across various perturbation dimensions, highlighting the gap between surface performance and actual generalization capabilities [4][6][33] Group 1: Motivation and Contributions - VLA models have achieved impressive success rates in benchmarks like LIBERO, but existing evaluation methods fail to assess stability and reliability under real-world variations [4][6] - LIBERO-Plus evaluates models based on seven dimensions of perturbation: object placement, camera angle, robot initial pose, language instructions, lighting conditions, background textures, and sensor noise [4][6] - The framework provides a detailed analysis of VLA models' generalization performance through systematic perturbation [4][6] Group 2: Performance Analysis - The analysis reveals that VLA models exhibit significant overall vulnerability to perturbations, with performance declining across all dimensions [13][32] - Models are most sensitive to changes in camera perspective and robot initial state, indicating a need for high-level spatial and proprioceptive understanding [13][32] - Language perturbations lead to the smallest average performance drop (-25.3%), suggesting a surprising level of robustness that warrants further investigation [15][17] Group 3: Findings on Model Behavior - Some models maintain performance even with empty language inputs, indicating a tendency to ignore language modalities and behave more like visual-action (VA) models [16][19] - VLA models struggle with cross-object instruction following, relying more on fixed visual-action mappings rather than fully leveraging language signals [19][20] - The models demonstrate remarkable adaptability to background changes while showing limited sensitivity to lighting variations, raising questions about the representations they learn [20][27] Group 4: Combination Generalization - The concept of "combination generalization gap" is introduced, highlighting the negative interactions between different perturbations that exceed the independent effects of single perturbations [29][32] - The analysis indicates that current VLA models lack the ability to effectively handle complex multi-dimensional perturbations due to entangled representations [32] Group 5: LIBERO-Plus Benchmark - The LIBERO-Plus benchmark consists of 10,030 tasks designed to evaluate model performance under various perturbations, constructed using perturbation augmentation strategies [33][36] - The benchmark features include comprehensive coverage of seven perturbation dimensions and fine-grained difficulty levels [36] - Models trained with enhanced data achieved an average success rate of 79.6% on LIBERO-Plus, significantly outperforming baseline models [38]
乐享科技w-bot订单超千台,最新外观曝光
具身智能之心· 2025-10-28 10:00
Core Insights - The emergence of W-bot marks a significant breakthrough in China's consumer-grade embodied intelligence, showcasing the integration of technology and sports [1][3] - The Chinese government has recognized embodied intelligence as a key industry for future development, creating a supportive policy framework [3] - The capital market has responded positively, with LeXiang Technology securing nearly 500 million yuan in angel financing within a year, reflecting strong industry confidence [3] Technology Foundation - LeXiang Technology has developed a comprehensive self-research system covering hardware, software, and algorithms, enabling significant adaptability in complex environments [6][4] - The company boasts a research team where over 80% are dedicated to R&D, with members having an average of over 10 years of experience in robotics and AI [6][4] Market Applications - W-bot's multifunctional capabilities allow it to serve various roles in both household and commercial settings, contributing to its substantial pre-order success [7][10] - In domestic scenarios, W-bot addresses key needs such as companionship, home security, and delivery, while also adapting to outdoor environments for recreational activities [7][8] - The product's innovative applications in industries like retail, education, and real estate highlight its versatility and potential for widespread adoption [10][12] Market Potential - The global consumer robotics market is projected to grow significantly, with estimates suggesting a rise from $47 billion in 2024 to $108 billion by 2028, indicating a compound annual growth rate of approximately 23% [12][14] - China is positioned to leverage its manufacturing strengths and vast consumer market to become a leader in the global consumer-grade embodied intelligence sector [14][15] - The anticipated market transformation from tools to partners in robotics suggests a vast expansion of market opportunities, with W-bot poised to revolutionize lifestyles and industry digitization [15][16]
英伟达最新 | 0成本搭建你的SOTA模型!轻量化VLA时代来啦~
具身智能之心· 2025-10-28 04:00
Core Insights - The article presents VLA-0, a novel approach in the field of robot control that utilizes a vision-language-action model (VLA) without modifying the existing structure of the vision-language model (VLM) [1][2][3]. - VLA-0 demonstrates that a simple design can achieve top-tier performance, challenging the notion that complexity equates to better functionality in VLA development [14][21]. Summary by Sections Introduction to VLA-0 - VLA-0 breaks the conventional belief that more complex models yield better results by proposing a "zero modification" approach, allowing VLM to predict actions in text format without altering its architecture [1][2]. Current Challenges in VLA Development - Existing VLA models often sacrifice the inherent advantages of VLMs for added action functionalities, leading to issues such as increased complexity and reduced language comprehension [2][3]. Key Design Features of VLA-0 - VLA-0 retains the original VLM structure and focuses on optimizing input-output and training logic, allowing it to predict actions effectively [3][4]. - The input design includes system prompts, multi-modal observations, and natural language task instructions, ensuring that VLM can understand and process tasks without additional coding [4][5]. Action Decoding Mechanism - VLA-0 innovatively converts continuous actions into text that VLM can generate, enhancing action resolution and avoiding vocabulary conflicts [5][6]. - The training strategy employs masked action augmentation to improve the model's reliance on visual and task information rather than just text sequence continuity [7][8]. Experimental Results - VLA-0 outperforms complex models in both simulated and real-world scenarios, achieving an average success rate of 94.7% in simulations, surpassing all comparable models [10][11]. - In real-world tests, VLA-0 achieved a 60% success rate, significantly higher than the 47.5% of the SmolVLA model, demonstrating its effectiveness in practical applications [11][13]. Conclusions and Future Directions - The findings suggest that simpler designs can lead to superior performance in VLA development, emphasizing the importance of leveraging existing VLM capabilities [14][15]. - Future exploration may include large-scale pre-training, optimization of inference speed, and integration of 3D perception to enhance the model's adaptability and precision in complex environments [18][19][20].