具身智能之心
Search documents
具身智能之心近20个交流群来啦!欢迎加入
具身智能之心· 2025-09-23 04:00
Group 1 - The establishment of a technical exchange group focused on embodied intelligence technology, inviting participation from various subfields [1] - The group covers nearly 20 sub-directions, including humanoid robots, quadrupeds, robotic arms, and areas such as vla, large models, vln, reinforcement learning, mobile operation, multimodal perception, simulation, and data collection [1] - The invitation encourages collaboration and discussion on technology and industry developments among participants [1]
为什么 VLA 能叠毛巾,却测不准物体位姿?解读具身 “空间感知” 补全
具身智能之心· 2025-09-23 00:03
Core Viewpoint - The article discusses the innovative OnePoseViaGen framework, which addresses the challenges of 6D object pose estimation in robotics, enabling robots to accurately perceive and interact with unknown objects using a single reference image without the need for pre-existing 3D models [2][3][31]. Summary by Sections Introduction to the Problem - Current robotic systems can perform simple tasks like folding towels but struggle with complex interactions requiring precise spatial awareness, such as grasping unfamiliar objects [1][2]. - The inability to establish a closed-loop connection between generated models, real objects, and spatial poses is a significant barrier to effective robotic interaction with the physical world [2]. OnePoseViaGen Framework - OnePoseViaGen offers a revolutionary solution that estimates the 6D pose of unknown objects using only a single reference image, combining single-view 3D generation, coarse-to-fine alignment, and text-guided domain randomization [2][5]. - The framework follows a logical progression: addressing the absence of 3D models, calibrating real-world scales and poses, and enhancing robustness through domain adaptation [5][7]. Key Research Achievements - The framework begins with generating a 3D texture model from a single RGB-D anchor image, ensuring geometric consistency through normal vector estimation [8][9]. - A two-step alignment strategy is employed to refine the scale and pose, starting with a coarse alignment followed by a precise optimization process [10][12][13]. - Text-guided domain randomization is utilized to create diverse 3D model variants, enhancing the robustness of pose estimation against variations in lighting and occlusion [14][15]. Performance Validation - OnePoseViaGen outperforms existing methods on benchmark datasets, achieving an average ADD of 81.27% and ADD-S of 93.10%, significantly higher than competitors like Oryon and Any6D [16][17]. - In challenging scenarios, such as high occlusion environments, OnePoseViaGen maintains high accuracy, demonstrating its effectiveness in real-world applications [20][22]. Real-World Application - The framework was tested in real robotic operations, achieving a success rate of 73.3% in tasks involving single-arm and dual-arm object manipulation, far exceeding baseline methods [23][24][25]. - The qualitative results show that the generated 3D models closely match real object textures and structures, allowing for precise pose estimation even in the presence of occlusions [27]. Ablation Studies - Ablation experiments confirm the necessity of the coarse-to-fine alignment and the importance of domain randomization in enhancing the robustness of the framework [28][30]. Conclusion - OnePoseViaGen represents a significant advancement in robotic perception, enabling accurate pose estimation and interaction with unknown objects without relying on extensive 3D model libraries or multi-view inputs, thus paving the way for robots to operate in open-world environments [31].
具身智能绕不开的“四数”为什么这么难:数采、数据飞轮、数据工厂、仿真合成数据
具身智能之心· 2025-09-23 00:03
Core Viewpoint - The article discusses the evolution and significance of embodied intelligence, emphasizing its philosophical roots and the necessity of physical interaction for intelligent systems [4][5][7]. Group 1: Historical Development - The concept of embodied intelligence traces back to philosophical and cognitive science developments, highlighting the importance of physical interaction in cognitive processes [4]. - Key experiments, such as Richard Held's "passive movement cat" study, demonstrate the intrinsic link between perception and action, reinforcing the idea that active engagement with the environment is crucial for learning [5]. - The shift from traditional views of intelligence as disembodied computation to a more integrated approach that includes physical embodiment is outlined [6][7]. Group 2: Current Trends in Embodied Intelligence - The construction of immersive environments for embodied intelligence is essential, requiring the integration of physical properties and sensory feedback [9][10]. - The development of large-scale, systematic robot training facilities is identified as a critical infrastructure for advancing embodied intelligence [12]. - Various high-level robot training platforms are emerging across China, indicating a rapid growth in this sector [12]. Group 3: Data Collection and Training - High-quality, diverse behavioral data is crucial for the development of embodied intelligence, focusing on visual, interaction, and semantic understanding data [15][17]. - The article outlines the importance of structured data collection methods, including teleoperation and wearable devices, to enhance the training of robots [19][20]. - A systematic approach to data collection is emphasized, with a focus on stability in object grasping tasks, leading to improved predictive capabilities in robotic systems [22][23][25]. Group 4: Future Directions and Challenges - The integration of embodied intelligence with large models is seen as a key pathway for advancing robotic technology, emphasizing the need for a collaborative framework between edge and cloud computing [26][29]. - The article discusses the necessity of building a comprehensive training ecosystem that combines real and virtual environments to facilitate effective learning and adaptation [34][35]. - The future of embodied intelligence relies on diverse embodied agents and a robust learning and evolution framework to ensure continuous improvement and adaptability [31][36]. Group 5: Practical Applications - Embodied intelligence is being applied in various sectors, including logistics, consumer electronics, and healthcare, showcasing its potential to address real-world challenges [30][33]. - The establishment of training centers and collaborative platforms is crucial for fostering innovation and standardization in the field of embodied intelligence [42][45]. - The article highlights the importance of open-source ecosystems and collaborative efforts among industry players to drive advancements in embodied intelligence [74].
MBZUAI 机器人实验室招收2026 Fall 全奖博士生/访问研究生等
具身智能之心· 2025-09-23 00:03
点击下方 卡片 ,关注" 具身智能 之心 "公众号 PI简介 左星星博士在MBZUAI 机器人系担任 Assistant Professor(Tenure-Track),领导Robotics Cognition and Learning (RCL)实验室。他曾在加州理工学院(Caltech)计算机与数学系 和 慕尼黑工业大学(TUM)计算 机系担任博士后,曾在Google公司全职担任Visiting Faculty Researcher。左博士的主要研究方向为机器人 学, 多模态SLAM, 3D场景理解, 具身智能和3D计算机视觉。在机器人和人工智能重要刊物T-RO、IJCV、J- FR、RA-L、ICRA、IROS、CVPR等发表论文四十余篇。受邀担任机器人领域著名期刊RA-L (2022年-至 今),和机器人两大旗舰会议IROS(2022-2025年), ICRA(2023-2026年)的Associate Editor。左博士 的长期研究目标致力于通过准确理解机器人的状态,周围3D环境以及动作执行,实现机器人和人类在开放环 境中的自然交互与无缝协作。 招生方向 Robotics, 3D Computer ...
为什么 VLA 能叠毛巾,却测不准物体位姿?具身智能的 “空间感知” 补全是怎么做的?
具身智能之心· 2025-09-22 09:00
点击下方 卡片 ,关注" 具身智能 之心 "公众号 作者丨 Zheng Geng等 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 想象这样一组反差场景:VLA 模型能流畅完成叠毛巾、整理衣物等几何类操作,可面对 "用机械臂抓起陌生调料瓶""给未知零件定位 3D 姿态" 这类任务时,却 频频失误——要么抓空,要么把物体碰倒。这背后藏着具身智能落地的关键瓶颈: 6D 物体位姿估计 。 玩过机器人操作的朋友都知道,"抓零件""放调料瓶" 这类需要精准交互的任务,核心是 "靠空间感知说话"——得知道物体的 3D 位置(平移)和朝向(旋转), 还要确保测算的尺度与真实世界一致。可现有方法总在 "妥协":要么依赖预先扫描的 CAD 模型(现实中根本找不到那么多),要么需要多视角图像(实时场景 中哪来得及拍),就算是单视图重建,也会陷入 "不知道物体真实大小" 的尺度模糊困境。 这就导致了鲜明的能力断层:VLA 能靠视觉规划完成 "叠毛巾" 这类不依赖精准空 ...
具身方向适合去工作还是读博?
具身智能之心· 2025-09-22 04:00
Core Viewpoint - The article discusses whether individuals in the field of embodied intelligence should pursue a PhD or enter the job market, emphasizing the importance of foundational knowledge and the suitability for pioneering roles in this evolving industry [1][2]. Group 1: Foundations and Suitability - The article highlights the necessity of having a solid foundation in embodied intelligence, particularly in robotic-related areas, to be competitive in the job market [1]. - It stresses the importance of being suited for the role of a "pioneer" in research, especially in a field with many unresolved issues, and the need for strong problem-solving skills [1][2]. Group 2: Community and Resources - The "Embodied Intelligence Heart Knowledge Planet" community is introduced as a comprehensive platform for beginners, offering resources such as videos, articles, learning paths, and job exchange opportunities [2][4]. - The community aims to grow from nearly 2,000 members to 10,000 within two years, providing a space for technical sharing and collaboration [2]. Group 3: Practical Support and Networking - The community addresses practical questions related to equipment usage, data collection, and model deployment, enhancing the application of knowledge in projects [4]. - It has established a job referral mechanism with various leading companies in the embodied intelligence sector, facilitating connections between job seekers and employers [6][14]. Group 4: Educational Content and Learning Paths - The community has compiled over 30 technical routes for various aspects of embodied intelligence, significantly reducing the time needed for research [4][14]. - It offers a wealth of resources, including open-source projects, datasets, and technical learning routes, catering to both beginners and advanced researchers [14][19][28].
IGL-Nav:基于增量式3D高斯定位的图像目标导航(ICCV'25)
具身智能之心· 2025-09-22 00:03
作者丨 Wenxuan Guo等 编辑丨视觉语言导航 点击下方 卡片 ,关注" 具身智能之心 "公众号 >> 点击进入→ 具身 智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 主要贡献 研究背景 图像目标导航任务要求智能体在未知环境中导航到由图像指定的位置和朝向,这对于智能体理解空间信息以及基于过往观测探索场景的能力提出了很高要求。 提出了 IGL-Nav 框架,通过增量式更新3D高斯表示(3DGS),实现了高效的3D感知图像目标导航,显著优于现有方法。 设计了 粗粒度到细粒度 的目标定位策略,先利用几何信息进行离散空间匹配实现粗粒度定位,再通过可微渲染优化求解精确定位,有效解决了6自由度相机姿态估 计的复杂搜索空间问题。 IGL-Nav能够处理更具挑战性的 自由视角图像 目标设置,并可部署在真实机器人平台上,使用手机拍摄的任意姿态目标图像引导机器人导航。 传统方法或依赖端到端的强化学习,或基于模块化策略使用拓扑图或鸟瞰图作为记忆,但都无法充分建模已探索3D环境与目标图像之间的几何关系。 近期虽有基于可渲染神经辐射图(如RN ...
当机器人学会 “模仿” 人类:RynnVLA-001 如何突破操作数据稀缺困境?
具身智能之心· 2025-09-22 00:03
Core Insights - The article discusses the development of a new VLA model, RynnVLA-001, by Alibaba DAMO Academy, which addresses the scarcity of high-quality operational data in robot manipulation by utilizing human demonstration data [1][5][35]. Group 1: Model Overview - RynnVLA-001 leverages 12 million ego-centered human operation videos and employs a two-stage pre-training strategy to teach robots human operational logic and action trajectories [1][2][5]. - The model achieves an average success rate of 90.6% in various tasks, significantly outperforming existing models like GR00T N1.5 and Pi0, which have lower success rates [2][15]. Group 2: Methodology - The training process consists of three core stages: ego-centered video generation pre-training, human-centered trajectory perception modeling, and robot-centric visual-language-action modeling [7][10][11]. - The introduction of ActionVAE optimizes action representation by compressing action sequences into compact latent embeddings, enhancing the model's ability to predict smooth and coherent actions [6][13][24]. Group 3: Experimental Results - RynnVLA-001 demonstrates superior performance across multiple tasks, achieving success rates of 90.0% for picking and placing green blocks, 91.7% for strawberries, and 90.0% for pen placement [15][17]. - In complex scenarios with distractors, RynnVLA-001 maintains a high success rate of 91.7%, showcasing its robustness in instruction-following tasks [18][19]. Group 4: Pre-training Effectiveness - The two-stage pre-training process is validated through ablation studies, showing that models without video pre-training perform poorly, while those with it exhibit significant improvements in task success rates [19][20][21]. - The model's ability to predict human trajectories effectively bridges the gap between visual prediction and action generation, leading to enhanced performance [21][22]. Group 5: Limitations and Future Directions - Current testing is limited to the LeRobot SO100 robotic arm, indicating a need for broader applicability across different robotic platforms [41]. - Future work should focus on improving environmental generalization and exploring dynamic camera perspectives to enhance robustness [41].
小扎把马斯克机器人一号位挖走了
具身智能之心· 2025-09-22 00:03
编辑丨 量子位 点击下方 卡片 ,关注" 具身智能之心 "公众号 >> 点击进入→ 具身 智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 马斯克在忙着裁人,小扎这边继续忙着挖人。 这不? Optimus AI团队负责人Ashish Kumar 决定离开特斯拉,加入Meta担任研究科学家。 至于离职感言,他表示: 带领Optimus AI团队的经历非常精彩和难忘。 我们全力推进可扩展方法——用强化学习取代传统技术栈,并通过视频学习来提升机器人的灵巧度。 他还进一步强调, 人工智能才是解锁人形机器人的最关键因素 。 与此同时,小扎砸钱挖人的形象已经深入人心,使得网友不禁锐评,有10亿美元吗? Optimus团队负责人接连出走 那这位Optimus AI团队负责人到底是何大神? Ashish Kumar ,UC伯克利博士,导师是被李飞飞尊称为"学术祖父"的Jitendra Malik教授,因在CV领域的研究而出名。 2015年,Ashish本科毕业于印度理工学院焦特布尔分校,之后到微软位于印度的实验室做了两年研究员,研究方向是资源 ...
PhysicalAgent:迈向通用认知机器人的基础世界模型框架
具身智能之心· 2025-09-22 00:03
Core Viewpoint - The article discusses the development of PhysicalAgent, a robotic control framework designed to overcome key limitations in the current robot manipulation field, specifically addressing the robustness and generalizability of visual-language-action (VLM) models and world model-based methods [2][3]. Group 1: Key Bottlenecks and Solutions - Current VLM models require task-specific fine-tuning, leading to a significant drop in robustness when switching robots or environments [2]. - World model-based methods depend on specially trained predictive models, limiting their generalizability due to the need for carefully curated training data [2]. - PhysicalAgent aims to integrate iterative reasoning, diffusion video generation, and closed-loop execution to achieve cross-modal and cross-task general manipulation capabilities [2]. Group 2: Framework Design Principles - The framework's design allows perception and reasoning modules to remain independent of specific robot forms, requiring only lightweight skeletal detection models for different robots [3]. - Video generation models have inherent advantages due to pre-training on vast multimodal datasets, enabling quick integration without local training [5]. - The framework aligns with human-like reasoning, generating visual representations of actions based solely on textual instructions [5]. - The architecture demonstrates cross-modal adaptability by generating different manipulation tasks for various robot forms without retraining [5]. Group 3: VLM as the Cognitive Core - VLM serves as the cognitive core of the framework, facilitating a multi-step process of instruction, environment interaction, and execution [6]. - The innovative approach redefines action generation as conditional video synthesis rather than direct control strategy learning [6]. - The robot adaptation layer is the only part requiring specific robot tuning, converting generated action videos into motor commands [6]. Group 4: Experimental Validation - Two types of experiments were conducted to validate the framework's cross-modal generalization and iterative execution robustness [8]. - The first experiment focused on verifying the framework's performance against task-specific baselines and its ability to generalize across different robot forms [9]. - The second experiment assessed the iterative execution capabilities of physical robots, demonstrating the effectiveness of the "Perceive→Plan→Reason→Act" pipeline [12]. Group 5: Key Results - The framework achieved an 80% final success rate across various tasks for both the bimanual UR3 and humanoid G1 robots [13][16]. - The first-attempt success rates were 30% for UR3 and 20% for G1, with average iterations required for success being 2.25 and 2.75, respectively [16]. - The iterative correction process significantly improved task completion rates, with a sharp decline in the proportion of unfinished tasks after the first few iterations [16].