Workflow
OmniVGGT
icon
Search documents
顶级四校联手打造OmniVGGT:全模态视觉几何Transformer!
自动驾驶之心· 2025-11-17 00:05
Core Insights - The article discusses the need for a "universal multimodal" 3D model, highlighting the limitations of current models that primarily rely on RGB images and fail to utilize additional geometric information effectively [5][6][9]. - The proposed OmniVGGT framework allows for flexible integration of any number of auxiliary geometric modalities during training and inference, significantly improving performance across various 3D tasks [6][9][10]. Group 1: Need for Universal Multimodal 3D Models - Current mainstream 3D models, such as VGGT, can only process RGB images and do not utilize depth or camera parameters, leading to inefficiencies in real-world applications [5]. - OmniVGGT addresses the issue of "information waste" and poor adaptability by fully leveraging available auxiliary information without compromising performance when only RGB input is used [9][10]. Group 2: Core Innovations of OmniVGGT - OmniVGGT achieves top-tier performance in tasks like monocular/multi-view depth estimation and camera pose estimation, even outperforming existing methods with just RGB input [7][29]. - The framework integrates into visual-language-action (VLA) models, significantly enhancing robotic operation tasks [7][29]. Group 3: Technical Components - The GeoAdapter component injects geometric information (depth, camera parameters) into the base model without disrupting the original feature space, maintaining low computational overhead [10][16]. - A random multimodal fusion strategy is employed during training to ensure the model learns robust spatial representations and does not overly depend on auxiliary information [22][23]. Group 4: Experimental Results - OmniVGGT was trained on 19 public datasets, demonstrating superior performance across multiple 3D tasks, with significant improvements in metrics such as absolute relative error and accuracy [29][30]. - The framework shows that the more auxiliary information is provided, the better the performance, with notable enhancements in depth estimation and camera pose accuracy [30][34]. Group 5: Practical Implications - OmniVGGT's design allows for flexible input combinations of auxiliary geometric modalities, making it practical for various applications in 3D modeling and robotics [53][54]. - The model's efficiency and speed, requiring only 0.2 seconds for inference, position it as a leading solution in the field [42][40].