具身智能之心 - filings, earnings calls, financial reports, news - Reportify

具身智能之心

Search documents

具身智能之心近20个交流群来啦！欢迎加入

具身智能之心· 2025-09-23 04:00

欢迎加入我们的技术交流群，和大家一起聊聊技术和行业。添加小助理微信AIDriver005，备注：加群 +昵称+研究方向。具身智能之心技术交流群成立了，近20个子方向，欢迎和我们一起承担具身领域未来领导者的角色。如果您是涉及人形机器人、四足、机械臂等本体，正在从事vla、大模型、vln、强化学习、移动操作、多模态感知、仿真、数据采集等方向。 ...

多模态感知

多模态感知

为什么 VLA 能叠毛巾，却测不准物体位姿？解读具身 “空间感知” 补全

具身智能之心· 2025-09-23 00:03

Core Viewpoint - The article discusses the innovative OnePoseViaGen framework, which addresses the challenges of 6D object pose estimation in robotics, enabling robots to accurately perceive and interact with unknown objects using a single reference image without the need for pre-existing 3D models [2][3][31]. Summary by Sections Introduction to the Problem - Current robotic systems can perform simple tasks like folding towels but struggle with complex interactions requiring precise spatial awareness, such as grasping unfamiliar objects [1][2]. - The inability to establish a closed-loop connection between generated models, real objects, and spatial poses is a significant barrier to effective robotic interaction with the physical world [2]. OnePoseViaGen Framework - OnePoseViaGen offers a revolutionary solution that estimates the 6D pose of unknown objects using only a single reference image, combining single-view 3D generation, coarse-to-fine alignment, and text-guided domain randomization [2][5]. - The framework follows a logical progression: addressing the absence of 3D models, calibrating real-world scales and poses, and enhancing robustness through domain adaptation [5][7]. Key Research Achievements - The framework begins with generating a 3D texture model from a single RGB-D anchor image, ensuring geometric consistency through normal vector estimation [8][9]. - A two-step alignment strategy is employed to refine the scale and pose, starting with a coarse alignment followed by a precise optimization process [10][12][13]. - Text-guided domain randomization is utilized to create diverse 3D model variants, enhancing the robustness of pose estimation against variations in lighting and occlusion [14][15]. Performance Validation - OnePoseViaGen outperforms existing methods on benchmark datasets, achieving an average ADD of 81.27% and ADD-S of 93.10%, significantly higher than competitors like Oryon and Any6D [16][17]. - In challenging scenarios, such as high occlusion environments, OnePoseViaGen maintains high accuracy, demonstrating its effectiveness in real-world applications [20][22]. Real-World Application - The framework was tested in real robotic operations, achieving a success rate of 73.3% in tasks involving single-arm and dual-arm object manipulation, far exceeding baseline methods [23][24][25]. - The qualitative results show that the generated 3D models closely match real object textures and structures, allowing for precise pose estimation even in the presence of occlusions [27]. Ablation Studies - Ablation experiments confirm the necessity of the coarse-to-fine alignment and the importance of domain randomization in enhancing the robustness of the framework [28][30]. Conclusion - OnePoseViaGen represents a significant advancement in robotic perception, enabling accurate pose estimation and interaction with unknown objects without relying on extensive 3D model libraries or multi-view inputs, thus paving the way for robots to operate in open-world environments [31].

6D物体位姿估计

6D物体位姿估计

具身智能绕不开的“四数”为什么这么难：数采、数据飞轮、数据工厂、仿真合成数据

具身智能之心· 2025-09-23 00:03

点击下方卡片，关注" 具身智能之心 "公众号 >> 点击进入→ 具身智能之心技术交流群内容首发于国内首个具身智能全栈学习社区：具身智能之心知识星球 (戳我) ，这里包含所有你想要的。近期举办的外滩大会上，多个具身领域知名学者与企业负责人针对具身算法发展、具身数据的采集与使用、仿真等领域展开了探讨。具身智能之心有幸到现场观摩，我们一起看看都有哪些精彩的观点。这一传统观念在20世纪逐步受到挑战。1943年，沃伦·麦卡洛克（Warren McCulloch）在其著作《思维的具身》（The Embodiment of Mind）中提出，人类心智的形成并非脱离身体的抽象过程，而是根植于个体与外部环境之间的持续物理交互。这一观点为后续具身认知理论的发展提供了重要启发。 1963年，心理学家理查德·赫尔德（Richard Held）通过一系列实验进一步揭示了感知与行为之间的内在联系。他在研究中设计了"被动运动猫"实验：将十只猫分为五组，每组两只，其中一只猫在自由行走过程中可主动获取视觉信息，另一只则被蒙住双眼，仅通过机械装置跟随前者移动。实验结果表明，只有具备主动感知能力的猫能够在遇到台阶边缘 ...

具身大模型

具身大模型

具身大模型

具身大模型

MBZUAI 机器人实验室招收2026 Fall 全奖博士生/访问研究生等

具身智能之心· 2025-09-23 00:03

点击下方卡片，关注" 具身智能之心 "公众号 PI简介左星星博士在MBZUAI 机器人系担任 Assistant Professor（Tenure-Track），领导Robotics Cognition and Learning (RCL)实验室。他曾在加州理工学院（Caltech）计算机与数学系和慕尼黑工业大学（TUM）计算机系担任博士后，曾在Google公司全职担任Visiting Faculty Researcher。左博士的主要研究方向为机器人学, 多模态SLAM, 3D场景理解, 具身智能和3D计算机视觉。在机器人和人工智能重要刊物T-RO、IJCV、J- FR、RA-L、ICRA、IROS、CVPR等发表论文四十余篇。受邀担任机器人领域著名期刊RA-L (2022年-至今)，和机器人两大旗舰会议IROS（2022-2025年）， ICRA（2023-2026年）的Associate Editor。左博士的长期研究目标致力于通过准确理解机器人的状态，周围3D环境以及动作执行，实现机器人和人类在开放环境中的自然交互与无缝协作。招生方向 Robotics, 3D Computer ...

3D计算机视觉

3D计算机视觉

为什么 VLA 能叠毛巾，却测不准物体位姿？具身智能的 “空间感知” 补全是怎么做的？

具身智能之心· 2025-09-22 09:00

点击下方卡片，关注" 具身智能之心 "公众号作者丨 Zheng Geng等编辑丨具身智能之心本文只做学术分享，如有侵权，联系删文 >> 点击进入→ 具身智能之心技术交流群更多干货，欢迎加入国内首个具身智能全栈学习社区：具身智能之心知识星球 (戳我) ，这里包含所有你想要的。想象这样一组反差场景：VLA 模型能流畅完成叠毛巾、整理衣物等几何类操作，可面对 "用机械臂抓起陌生调料瓶""给未知零件定位 3D 姿态" 这类任务时，却频频失误——要么抓空，要么把物体碰倒。这背后藏着具身智能落地的关键瓶颈： 6D 物体位姿估计。玩过机器人操作的朋友都知道，"抓零件""放调料瓶" 这类需要精准交互的任务，核心是 "靠空间感知说话"——得知道物体的 3D 位置（平移）和朝向（旋转），还要确保测算的尺度与真实世界一致。可现有方法总在 "妥协"：要么依赖预先扫描的 CAD 模型（现实中根本找不到那么多），要么需要多视角图像（实时场景中哪来得及拍），就算是单视图重建，也会陷入 "不知道物体真实大小" 的尺度模糊困境。这就导致了鲜明的能力断层：VLA 能靠视觉规划完成 "叠毛巾" 这类不依赖精准空 ...

6D物体位姿估计

6D物体位姿估计

具身方向适合去工作还是读博？

具身智能之心· 2025-09-22 04:00

具身方向适合去工作还是读博？最近和社区内一个研三的同学聊天，咨询我们：具身领域现在是去读博继续深造还是参与这波创业潮，赶一波"行情"？先不持立场，因为这个话题总是有不同的答案，不过我倒是很对里面涉及到的2个问题很感兴趣。第一，你的实验室或者自己是否已有一些具身领域的基础？更具体点的是那种robotic相关的。很多半路出家的导师，为了一些本子开了具身相关方向，但完全不足以培养自己的学生，这就会导致一个问题：大家以为自己懂具身智能了，能够胜任岗位了，实则不是。而真的在第一批具身探索中成长起来的同学，硬件、数据、算法都很熟悉，实验室还有多个本体做研究支持。前者，甚至都没有相关的硬件，还停留在一些仿真环境和开源数据集上做尝试。可想而知，如果去公司任职，真的合格吗？第二就是，如果要读博，你是否真的适合作为"开拓者"这个角色，这个非常关键，特别是对具身这个还有很多问题没解决的领域。我们接触过很多同学，一部分适合延续别人的研究，从1到10进行优化；一部分人适合担当开拓者，从0到1优化。后者非常强调思维能力、解决问题的能力。如果一些关键问题没有参考，你是否能够独立探索，忍受那种不断尝试不断失败的感觉 ...

具身智能之心知识星球

具身智能之心知识星球

当机器人学会 “模仿” 人类：RynnVLA-001 如何突破操作数据稀缺困境？

具身智能之心· 2025-09-22 00:03

点击下方卡片，关注" 具身智能之心 "公众号作者丨 YumingJiang等编辑丨具身智能之心 >> 点击进入→ 具身智能之心技术交流群更多干货，欢迎加入国内首个具身智能全栈学习社区：具身智能之心知识星球 (戳我) ，这里包含所有你想要的。在大语言模型、多模态模型飞速发展的今天，机器人操作领域却始终受困于一个关键难题——大规模高质量操作数据的稀缺。传统机器人数据采集依赖人类远程操控实体设备记录轨迹，不仅耗力耗时，成本更是居高不下，直接制约了视觉-语言-动作（VLA）模型的进步。为打破这一僵局，来自阿里巴巴达摩院的团队提出了全新 VLA 模型 RynnVLA-001。该模型另辟蹊径，将目光投向人类演示数据：通过 1200 万条以ego为中心的人类操作视频，结合两阶段预训练策略，让机器人 "学习" 人类的操作逻辑与动作轨迹。从预测未来操作帧的视觉动态，到关联人类关键点轨迹建立动作映射，再到引入 ActionVAE 优化机器人动作连贯性，RynnVLA-001 成功架起了 "人类演示" 到 "机器人操作" 的桥梁。实验显示，在 LeRobot SO100 机械臂上，RynnVLA-0 ...

视觉 - 语言 - 动作（VLA）模型

LeRobot SO100机械臂

视觉 - 语言 - 动作（VLA）模型

LeRobot SO100机械臂

IGL-Nav：基于增量式3D高斯定位的图像目标导航（ICCV'25）

具身智能之心· 2025-09-22 00:03

作者丨 Wenxuan Guo等编辑丨视觉语言导航点击下方卡片，关注" 具身智能之心 "公众号 >> 点击进入→ 具身智能之心技术交流群更多干货，欢迎加入国内首个具身智能全栈学习社区：具身智能之心知识星球 (戳我) ，这里包含所有你想要的。主要贡献研究背景图像目标导航任务要求智能体在未知环境中导航到由图像指定的位置和朝向，这对于智能体理解空间信息以及基于过往观测探索场景的能力提出了很高要求。提出了 IGL-Nav 框架，通过增量式更新3D高斯表示（3DGS），实现了高效的3D感知图像目标导航，显著优于现有方法。设计了粗粒度到细粒度的目标定位策略，先利用几何信息进行离散空间匹配实现粗粒度定位，再通过可微渲染优化求解精确定位，有效解决了6自由度相机姿态估计的复杂搜索空间问题。 IGL-Nav能够处理更具挑战性的自由视角图像目标设置，并可部署在真实机器人平台上，使用手机拍摄的任意姿态目标图像引导机器人导航。传统方法或依赖端到端的强化学习，或基于模块化策略使用拓扑图或鸟瞰图作为记忆，但都无法充分建模已探索3D环境与目标图像之间的几何关系。近期虽有基于可渲染神经辐射图（如RN ...

自由视角图像目标导航

3D高斯表示（3DGS）

自由视角图像目标导航

3D高斯表示（3DGS）

小扎把马斯克机器人一号位挖走了

具身智能之心· 2025-09-22 00:03

编辑丨量子位点击下方卡片，关注" 具身智能之心 "公众号 >> 点击进入→ 具身智能之心技术交流群更多干货，欢迎加入国内首个具身智能全栈学习社区：具身智能之心知识星球 (戳我) ，这里包含所有你想要的。马斯克在忙着裁人，小扎这边继续忙着挖人。这不？ Optimus AI团队负责人Ashish Kumar 决定离开特斯拉，加入Meta担任研究科学家。至于离职感言，他表示：带领Optimus AI团队的经历非常精彩和难忘。我们全力推进可扩展方法——用强化学习取代传统技术栈，并通过视频学习来提升机器人的灵巧度。他还进一步强调，人工智能才是解锁人形机器人的最关键因素。与此同时，小扎砸钱挖人的形象已经深入人心，使得网友不禁锐评，有10亿美元吗？ Optimus团队负责人接连出走那这位Optimus AI团队负责人到底是何大神？ Ashish Kumar ，UC伯克利博士，导师是被李飞飞尊称为"学术祖父"的Jitendra Malik教授，因在CV领域的研究而出名。 2015年，Ashish本科毕业于印度理工学院焦特布尔分校，之后到微软位于印度的实验室做了两年研究员，研究方向是资源 ...

Optimus（擎天柱）

Optimus（擎天柱）

PhysicalAgent：迈向通用认知机器人的基础世界模型框架

具身智能之心· 2025-09-22 00:03

Core Viewpoint - The article discusses the development of PhysicalAgent, a robotic control framework designed to overcome key limitations in the current robot manipulation field, specifically addressing the robustness and generalizability of visual-language-action (VLM) models and world model-based methods [2][3]. Group 1: Key Bottlenecks and Solutions - Current VLM models require task-specific fine-tuning, leading to a significant drop in robustness when switching robots or environments [2]. - World model-based methods depend on specially trained predictive models, limiting their generalizability due to the need for carefully curated training data [2]. - PhysicalAgent aims to integrate iterative reasoning, diffusion video generation, and closed-loop execution to achieve cross-modal and cross-task general manipulation capabilities [2]. Group 2: Framework Design Principles - The framework's design allows perception and reasoning modules to remain independent of specific robot forms, requiring only lightweight skeletal detection models for different robots [3]. - Video generation models have inherent advantages due to pre-training on vast multimodal datasets, enabling quick integration without local training [5]. - The framework aligns with human-like reasoning, generating visual representations of actions based solely on textual instructions [5]. - The architecture demonstrates cross-modal adaptability by generating different manipulation tasks for various robot forms without retraining [5]. Group 3: VLM as the Cognitive Core - VLM serves as the cognitive core of the framework, facilitating a multi-step process of instruction, environment interaction, and execution [6]. - The innovative approach redefines action generation as conditional video synthesis rather than direct control strategy learning [6]. - The robot adaptation layer is the only part requiring specific robot tuning, converting generated action videos into motor commands [6]. Group 4: Experimental Validation - Two types of experiments were conducted to validate the framework's cross-modal generalization and iterative execution robustness [8]. - The first experiment focused on verifying the framework's performance against task-specific baselines and its ability to generalize across different robot forms [9]. - The second experiment assessed the iterative execution capabilities of physical robots, demonstrating the effectiveness of the "Perceive→Plan→Reason→Act" pipeline [12]. Group 5: Key Results - The framework achieved an 80% final success rate across various tasks for both the bimanual UR3 and humanoid G1 robots [13][16]. - The first-attempt success rates were 30% for UR3 and 20% for G1, with average iterations required for success being 2.25 and 2.75, respectively [16]. - The iterative correction process significantly improved task completion rates, with a sharp decline in the proportion of unfinished tasks after the first few iterations [16].

通用认知机器人

基础世界模型框架

视觉语言模型（VLM）

通用认知机器人

基础世界模型框架

视觉语言模型（VLM）