为什么 VLA 能叠毛巾，却测不准物体位姿？解读具身 “空间感知” 补全

Core Viewpoint - The article discusses the innovative OnePoseViaGen framework, which addresses the challenges of 6D object pose estimation in robotics, enabling robots to accurately perceive and interact with unknown objects using a single reference image without the need for pre-existing 3D models [2][3][31]. Summary by Sections Introduction to the Problem - Current robotic systems can perform simple tasks like folding towels but struggle with complex interactions requiring precise spatial awareness, such as grasping unfamiliar objects [1][2]. - The inability to establish a closed-loop connection between generated models, real objects, and spatial poses is a significant barrier to effective robotic interaction with the physical world [2]. OnePoseViaGen Framework - OnePoseViaGen offers a revolutionary solution that estimates the 6D pose of unknown objects using only a single reference image, combining single-view 3D generation, coarse-to-fine alignment, and text-guided domain randomization [2][5]. - The framework follows a logical progression: addressing the absence of 3D models, calibrating real-world scales and poses, and enhancing robustness through domain adaptation [5][7]. Key Research Achievements - The framework begins with generating a 3D texture model from a single RGB-D anchor image, ensuring geometric consistency through normal vector estimation [8][9]. - A two-step alignment strategy is employed to refine the scale and pose, starting with a coarse alignment followed by a precise optimization process [10][12][13]. - Text-guided domain randomization is utilized to create diverse 3D model variants, enhancing the robustness of pose estimation against variations in lighting and occlusion [14][15]. Performance Validation - OnePoseViaGen outperforms existing methods on benchmark datasets, achieving an average ADD of 81.27% and ADD-S of 93.10%, significantly higher than competitors like Oryon and Any6D [16][17]. - In challenging scenarios, such as high occlusion environments, OnePoseViaGen maintains high accuracy, demonstrating its effectiveness in real-world applications [20][22]. Real-World Application - The framework was tested in real robotic operations, achieving a success rate of 73.3% in tasks involving single-arm and dual-arm object manipulation, far exceeding baseline methods [23][24][25]. - The qualitative results show that the generated 3D models closely match real object textures and structures, allowing for precise pose estimation even in the presence of occlusions [27]. Ablation Studies - Ablation experiments confirm the necessity of the coarse-to-fine alignment and the importance of domain randomization in enhancing the robustness of the framework [28][30]. Conclusion - OnePoseViaGen represents a significant advancement in robotic perception, enabling accurate pose estimation and interaction with unknown objects without relying on extensive 3D model libraries or multi-view inputs, thus paving the way for robots to operate in open-world environments [31].