OnePoseViaGen
Search documents
为什么 VLA 能叠毛巾,却测不准物体位姿?
自动驾驶之心· 2025-09-24 23:33
Core Viewpoint - The article discusses the breakthrough solution OnePoseViaGen, which addresses the challenges of 6D object pose estimation in robotics, enabling robots to interact with unknown objects effectively without relying on pre-existing 3D models [3][4]. Group 1: Challenges in Current Robotics - Existing models like VLA can perform tasks that do not require precise spatial positioning but struggle with tasks that require 6D pose support, such as grasping unfamiliar objects [3]. - The inability to establish a closed-loop connection between generated models, real objects, and spatial poses is a fundamental limitation in current robotic interactions with the physical world [3][4]. Group 2: OnePoseViaGen Framework - OnePoseViaGen offers a revolutionary approach that estimates the 6D pose of unknown objects using only a single reference image, without needing pre-set 3D models [3][4]. - The framework follows a logical progression: first addressing the absence of 3D models, then calibrating real-world scale and pose, and finally reducing domain gaps to enhance robustness [6][8]. Group 3: Key Research Outcomes - The framework's first step involves generating a 3D texture model from a single RGB-D anchor image, ensuring geometric consistency through normal vector estimation [9][10]. - A two-step alignment strategy is employed to refine the scale and pose, starting with a rough alignment followed by a precise optimization process [11][13][14]. - The final step incorporates text-guided generative domain randomization to enhance the model's robustness against variations in texture, lighting, and occlusion [15][16]. Group 4: Performance Validation - OnePoseViaGen outperforms existing methods on various benchmarks, achieving an average ADD of 81.27% and ADD-S of 93.10%, significantly higher than competitors like Oryon and Any6D [17][18]. - In high-challenge scenarios, such as grasping tasks, OnePoseViaGen maintains high accuracy where other methods fail, demonstrating its effectiveness in real-world applications [17][18]. Group 5: Real-World Application - The framework was tested in real-world robotic tasks, achieving a success rate of 73.3% in grasping and placement tasks, far exceeding baseline methods [22][24]. - The qualitative results show that the generated models closely match real object textures and structures, allowing for precise pose estimation even in the presence of occlusions [26]. Group 6: Ablation Studies - Ablation studies confirm the necessity of the coarse-to-fine alignment and generative domain randomization modules, highlighting their critical roles in enhancing the method's robustness [27][29]. Group 7: Conclusion - OnePoseViaGen represents the first pipeline that integrates single-image 3D generation with pose estimation, proving that generative modeling can directly enhance pose estimation performance without relying on 3D models or multi-view inputs [30].
为什么 VLA 能叠毛巾,却测不准物体位姿?解读具身 “空间感知” 补全
具身智能之心· 2025-09-23 00:03
Core Viewpoint - The article discusses the innovative OnePoseViaGen framework, which addresses the challenges of 6D object pose estimation in robotics, enabling robots to accurately perceive and interact with unknown objects using a single reference image without the need for pre-existing 3D models [2][3][31]. Summary by Sections Introduction to the Problem - Current robotic systems can perform simple tasks like folding towels but struggle with complex interactions requiring precise spatial awareness, such as grasping unfamiliar objects [1][2]. - The inability to establish a closed-loop connection between generated models, real objects, and spatial poses is a significant barrier to effective robotic interaction with the physical world [2]. OnePoseViaGen Framework - OnePoseViaGen offers a revolutionary solution that estimates the 6D pose of unknown objects using only a single reference image, combining single-view 3D generation, coarse-to-fine alignment, and text-guided domain randomization [2][5]. - The framework follows a logical progression: addressing the absence of 3D models, calibrating real-world scales and poses, and enhancing robustness through domain adaptation [5][7]. Key Research Achievements - The framework begins with generating a 3D texture model from a single RGB-D anchor image, ensuring geometric consistency through normal vector estimation [8][9]. - A two-step alignment strategy is employed to refine the scale and pose, starting with a coarse alignment followed by a precise optimization process [10][12][13]. - Text-guided domain randomization is utilized to create diverse 3D model variants, enhancing the robustness of pose estimation against variations in lighting and occlusion [14][15]. Performance Validation - OnePoseViaGen outperforms existing methods on benchmark datasets, achieving an average ADD of 81.27% and ADD-S of 93.10%, significantly higher than competitors like Oryon and Any6D [16][17]. - In challenging scenarios, such as high occlusion environments, OnePoseViaGen maintains high accuracy, demonstrating its effectiveness in real-world applications [20][22]. Real-World Application - The framework was tested in real robotic operations, achieving a success rate of 73.3% in tasks involving single-arm and dual-arm object manipulation, far exceeding baseline methods [23][24][25]. - The qualitative results show that the generated 3D models closely match real object textures and structures, allowing for precise pose estimation even in the presence of occlusions [27]. Ablation Studies - Ablation experiments confirm the necessity of the coarse-to-fine alignment and the importance of domain randomization in enhancing the robustness of the framework [28][30]. Conclusion - OnePoseViaGen represents a significant advancement in robotic perception, enabling accurate pose estimation and interaction with unknown objects without relying on extensive 3D model libraries or multi-view inputs, thus paving the way for robots to operate in open-world environments [31].
为什么 VLA 能叠毛巾,却测不准物体位姿?具身智能的 “空间感知” 补全是怎么做的?
具身智能之心· 2025-09-22 09:00
点击下方 卡片 ,关注" 具身智能 之心 "公众号 作者丨 Zheng Geng等 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 想象这样一组反差场景:VLA 模型能流畅完成叠毛巾、整理衣物等几何类操作,可面对 "用机械臂抓起陌生调料瓶""给未知零件定位 3D 姿态" 这类任务时,却 频频失误——要么抓空,要么把物体碰倒。这背后藏着具身智能落地的关键瓶颈: 6D 物体位姿估计 。 玩过机器人操作的朋友都知道,"抓零件""放调料瓶" 这类需要精准交互的任务,核心是 "靠空间感知说话"——得知道物体的 3D 位置(平移)和朝向(旋转), 还要确保测算的尺度与真实世界一致。可现有方法总在 "妥协":要么依赖预先扫描的 CAD 模型(现实中根本找不到那么多),要么需要多视角图像(实时场景 中哪来得及拍),就算是单视图重建,也会陷入 "不知道物体真实大小" 的尺度模糊困境。 这就导致了鲜明的能力断层:VLA 能靠视觉规划完成 "叠毛巾" 这类不依赖精准空 ...