VGGT
Search documents
顶级四校联手打造OmniVGGT:全模态视觉几何Transformer!
自动驾驶之心· 2025-11-17 00:05
Core Insights - The article discusses the need for a "universal multimodal" 3D model, highlighting the limitations of current models that primarily rely on RGB images and fail to utilize additional geometric information effectively [5][6][9]. - The proposed OmniVGGT framework allows for flexible integration of any number of auxiliary geometric modalities during training and inference, significantly improving performance across various 3D tasks [6][9][10]. Group 1: Need for Universal Multimodal 3D Models - Current mainstream 3D models, such as VGGT, can only process RGB images and do not utilize depth or camera parameters, leading to inefficiencies in real-world applications [5]. - OmniVGGT addresses the issue of "information waste" and poor adaptability by fully leveraging available auxiliary information without compromising performance when only RGB input is used [9][10]. Group 2: Core Innovations of OmniVGGT - OmniVGGT achieves top-tier performance in tasks like monocular/multi-view depth estimation and camera pose estimation, even outperforming existing methods with just RGB input [7][29]. - The framework integrates into visual-language-action (VLA) models, significantly enhancing robotic operation tasks [7][29]. Group 3: Technical Components - The GeoAdapter component injects geometric information (depth, camera parameters) into the base model without disrupting the original feature space, maintaining low computational overhead [10][16]. - A random multimodal fusion strategy is employed during training to ensure the model learns robust spatial representations and does not overly depend on auxiliary information [22][23]. Group 4: Experimental Results - OmniVGGT was trained on 19 public datasets, demonstrating superior performance across multiple 3D tasks, with significant improvements in metrics such as absolute relative error and accuracy [29][30]. - The framework shows that the more auxiliary information is provided, the better the performance, with notable enhancements in depth estimation and camera pose accuracy [30][34]. Group 5: Practical Implications - OmniVGGT's design allows for flexible input combinations of auxiliary geometric modalities, making it practical for various applications in 3D modeling and robotics [53][54]. - The model's efficiency and speed, requiring only 0.2 seconds for inference, position it as a leading solution in the field [42][40].
港科广&清华联合提出Spatial Forcing:隐式空间对齐,超越主流2D/3D VLA模型性能
具身智能之心· 2025-10-18 16:03
Core Insights - The article discusses the limitations of current Vision-Language-Action (VLA) models that primarily rely on 2D visual data, lacking a deep understanding of real 3D space, which hampers their ability to perform tasks in the physical world [2][4] - The proposed method, Spatial Forcing (SF), allows VLA models to develop spatial understanding without explicit 3D input by aligning visual features with a powerful 3D geometric representation generated by an external model [2][10] Methodology - The SF method employs an implicit spatial alignment strategy, enabling the model to autonomously acquire spatial understanding during training without the need for additional 3D sensors [2][13] - A depth probing experiment was conducted to verify the presence of 3D information in the original VLA's visual features, revealing that without 3D input, the model cannot form accurate spatial perceptions [11][13] - The training process involves aligning the VLA model's visual tokens with pixel-level spatial representations extracted from a pre-trained 3D model, optimizing both spatial alignment loss and action generation loss [16] Performance Results - The SF method significantly outperforms existing 2D and 3D VLA models in various tasks, achieving a training efficiency improvement of up to 3.8 times and a data utilization efficiency increase of up to 5.9 times [14] - In experiments, the Spatial Forcing model achieved a success rate of 99.4% in spatial tasks, 99.6% in object tasks, and 98.8% in goal tasks, demonstrating its superior performance compared to other models [18]
机器人感知大升级,轻量化注入几何先验,成功率提升31%
3 6 Ke· 2025-09-28 12:09
Core Insights - The article discusses the challenges in enabling AI to truly "understand" the 3D world, particularly in the context of visual language action (VLA) models that rely on 2D image-text data [1][2]. Group 1: VLA Model Limitations - Current VLA models lack the necessary 3D spatial understanding for real-world operations, primarily relying on pre-trained visual language models [1]. - Existing enhancement methods based on explicit depth input face deployment difficulties and precision noise issues [1]. Group 2: Evo-0 Model Introduction - Shanghai Jiao Tong University and the University of Cambridge proposed a lightweight method called Evo-0 to enhance the spatial understanding of VLA models by implicitly injecting 3D geometric priors without requiring explicit depth input or additional sensors [2]. - Evo-0 utilizes the Visual Geometry Grounding Transformer (VGGT) to extract 3D structural information from multi-view RGB images, significantly improving spatial perception capabilities [2][3]. Group 3: Model Architecture and Training - Evo-0 integrates VGGT as a spatial encoder, introducing t3^D tokens that contain depth context and cross-view spatial correspondence [3]. - A cross-attention fusion module is employed to merge 2D visual tokens with 3D tokens, enhancing the understanding of spatial structures and object layouts [3][6]. - The model is trained efficiently by only fine-tuning the fusion module, LoRA adaptation layer, and action expert, reducing computational costs [6]. Group 4: Experimental Results - In RLBench simulation tasks, Evo-0 achieved an average success rate improvement of over 28.88% compared to baseline models, particularly excelling in tasks requiring complex spatial relationships [10][11]. - The robustness of Evo-0 was tested under five different interference conditions, consistently outperforming the baseline model pi0 [12][15]. Group 5: Conclusion - Evo-0's key innovation lies in extracting rich spatial semantics through VGGT, bypassing depth estimation errors and sensor requirements, thus enhancing the spatial modeling capabilities of VLA models [16].
厦门大学曹刘娟团队FastVGGT:四倍速度提升,打破VGGT推理瓶颈并降低累积误差!
具身智能之心· 2025-09-10 06:18
Core Viewpoint - The article introduces FastVGGT, a training-free acceleration method that optimizes the VGGT model by addressing the redundancy in global attention mechanisms, achieving up to 4 times faster inference while maintaining reconstruction accuracy and mitigating cumulative error issues in 3D visual tasks [26]. Group 1: Main Contributions - FastVGGT enables VGGT to process 1000 input images in a single forward pass on a single GPU with 80GB VRAM, an improvement from 300 images previously [5]. - The method achieves a 4× speedup in inference time for 1000 image tasks while effectively reducing cumulative error [5][18]. - FastVGGT maintains high reconstruction quality, with improvements in metrics such as Chamfer Distance (CD) from 0.471 to 0.425 [18]. Group 2: Bottleneck Analysis - The analysis identifies that the global attention mechanism in VGGT has significant redundancy, leading to unnecessary computations [6][7]. - Cumulative error is exacerbated in long sequences due to the global attention mechanism, which amplifies minor errors over time [6]. Group 3: Methodology - Token merging strategies are introduced to optimize the redundancy in VGGT's attention calculations, including reference frame constraints, key token retention, and region-based sampling [9][11]. - The token merging process reduces the number of tokens involved in attention calculations, while token unmerging ensures the integrity of dense 3D reconstruction outputs [15]. Group 4: Experimental Results - FastVGGT demonstrated a significant reduction in inference time and improved reconstruction quality across various datasets, including ScanNet-50, 7Scenes, and NRGBD [22]. - In point cloud reconstruction tasks, FastVGGT achieved a 4× speedup in inference time while maintaining reconstruction accuracy [18][22]. - The method also showed improvements in absolute trajectory error (ATE) and relative pose error (RPE) metrics, indicating enhanced performance in long sequence inference [24].
刚刚,CVPR 2025奖项出炉:牛津&Meta博士生王建元获最佳论文,谢赛宁摘年轻研究者奖
机器之心· 2025-06-13 15:45
Core Insights - The CVPR 2025 conference in Nashville, Tennessee, awarded five papers, including one best paper and four honorable mentions, along with one best student paper and one honorable mention for student papers [1][2]. Submission and Acceptance Statistics - This year, over 40,000 authors submitted 13,008 papers, marking a 13% increase from last year's 11,532 submissions. A total of 2,872 papers were accepted, resulting in an overall acceptance rate of approximately 22.1%. Among the accepted papers, 96 were oral presentations (3.3%) and 387 were highlighted (13.7%) [3][5]. Conference Attendance - The conference attracted over 9,000 attendees from more than 70 countries and regions [7]. Paper Acceptance by Field - The image and video generation field had the highest number of accepted papers, while the highest acceptance rates were seen in 3D based on multi-view and sensor data, as well as single-image 3D [8]. Best Paper Award - The best paper, titled "VGGT: Visual Geometry Grounded Transformer," was presented by researchers from the University of Oxford and Meta AI. It introduced a universal 3D vision model based on a pure feedforward Transformer architecture, capable of inferring core geometric information from one or more images [13][14]. Notable Research Contributions - The best paper demonstrated significant performance improvements over traditional optimization methods and existing state-of-the-art models in various 3D tasks, achieving inference speeds in seconds without requiring post-processing optimization [17]. Best Student Paper - The best student paper, "Neural Inverse Rendering from Propagating Light," proposed a physics-based multi-view dynamic light propagation neural inverse rendering system, achieving state-of-the-art 3D reconstruction under strong indirect lighting conditions [53][55]. Awards and Recognitions - Two Young Researcher Awards were given to Hao Su and Saining Xie for their outstanding contributions to computer vision research [68][72]. The Longuet-Higgins Award was presented to two papers that have significantly influenced the field, including the Inception architecture and fully convolutional networks for semantic segmentation [75][78][80].