Workflow
VGGT模型
icon
Search documents
机器人感知大升级!轻量化注入几何先验,成功率提升31%
量子位· 2025-09-28 11:54
Core Viewpoint - The article discusses the development of the Evo-0 model, which enhances the spatial understanding capabilities of visual language action (VLA) models by integrating 3D geometric priors without the need for explicit depth input or additional sensors [4][18]. Group 1: Model Development - The Evo-0 model is based on the VGGT visual geometry foundation model, which extracts 3D structural information from multi-view RGB images and integrates it into existing visual language models [4]. - Evo-0 employs a cross-attention fusion module that combines 2D visual tokens with 3D tokens to improve understanding of spatial structures and object layouts [6]. Group 2: Experimental Results - In RLBench simulation experiments, Evo-0 achieved an average success rate exceeding the baseline pi0 by 15% and surpassed openvla-oft by 31% across five tasks requiring fine manipulation [5]. - In real-world experiments involving five spatially demanding tasks, Evo-0 outperformed the baseline model pi0 with an average success rate improvement of 28.88%, particularly excelling in tasks involving complex spatial relationships [12][10]. Group 3: Robustness Evaluation - The robustness of Evo-0 was tested under five types of interference conditions, including unseen distractor objects and variations in background color, target position, height, and camera angle, consistently showing superior performance compared to the baseline pi0 [14][15]. - The model demonstrated a 100% correct pick rate and a 70% overall correct rate when faced with unseen distractor objects, indicating its robustness in challenging scenarios [15]. Group 4: Training Efficiency - Evo-0 achieved better performance with only 15,000 training steps compared to the 20,000 steps required for the baseline model pi0, highlighting its higher training efficiency [8].