为什么 VLA 能叠毛巾，却测不准物体位姿？

Core Viewpoint - The article discusses the breakthrough solution OnePoseViaGen, which addresses the challenges of 6D object pose estimation in robotics, enabling robots to interact with unknown objects effectively without relying on pre-existing 3D models [3][4]. Group 1: Challenges in Current Robotics - Existing models like VLA can perform tasks that do not require precise spatial positioning but struggle with tasks that require 6D pose support, such as grasping unfamiliar objects [3]. - The inability to establish a closed-loop connection between generated models, real objects, and spatial poses is a fundamental limitation in current robotic interactions with the physical world [3][4]. Group 2: OnePoseViaGen Framework - OnePoseViaGen offers a revolutionary approach that estimates the 6D pose of unknown objects using only a single reference image, without needing pre-set 3D models [3][4]. - The framework follows a logical progression: first addressing the absence of 3D models, then calibrating real-world scale and pose, and finally reducing domain gaps to enhance robustness [6][8]. Group 3: Key Research Outcomes - The framework's first step involves generating a 3D texture model from a single RGB-D anchor image, ensuring geometric consistency through normal vector estimation [9][10]. - A two-step alignment strategy is employed to refine the scale and pose, starting with a rough alignment followed by a precise optimization process [11][13][14]. - The final step incorporates text-guided generative domain randomization to enhance the model's robustness against variations in texture, lighting, and occlusion [15][16]. Group 4: Performance Validation - OnePoseViaGen outperforms existing methods on various benchmarks, achieving an average ADD of 81.27% and ADD-S of 93.10%, significantly higher than competitors like Oryon and Any6D [17][18]. - In high-challenge scenarios, such as grasping tasks, OnePoseViaGen maintains high accuracy where other methods fail, demonstrating its effectiveness in real-world applications [17][18]. Group 5: Real-World Application - The framework was tested in real-world robotic tasks, achieving a success rate of 73.3% in grasping and placement tasks, far exceeding baseline methods [22][24]. - The qualitative results show that the generated models closely match real object textures and structures, allowing for precise pose estimation even in the presence of occlusions [26]. Group 6: Ablation Studies - Ablation studies confirm the necessity of the coarse-to-fine alignment and generative domain randomization modules, highlighting their critical roles in enhancing the method's robustness [27][29]. Group 7: Conclusion - OnePoseViaGen represents the first pipeline that integrates single-image 3D generation with pose estimation, proving that generative modeling can directly enhance pose estimation performance without relying on 3D models or multi-view inputs [30].