Workflow
VGGT
icon
Search documents
首次将十亿参数三维模型塞进手机!4比特量化,速度2.5倍、内存降3.7倍、精度98%|ICLR'26
量子位· 2026-03-08 04:26
Core Insights - The article discusses the development of QuantVGGT, a quantization framework designed to effectively compress and accelerate the Visual Geometry Grounded Transformers (VGGT) model, which has over 1 billion parameters, while maintaining high accuracy and performance [2][5][58]. Group 1: Quantization Framework - QuantVGGT utilizes 4-bit quantization, achieving a speed increase of 2.5 times and a memory reduction of 3.7 times, while preserving 98% of the reconstruction accuracy compared to the full precision model [2][5][7]. - The framework introduces two main technical contributions: Dual-Smoothed Fine-Grained Quantization (DSFQ) and Noise-Filtered Diverse Sampling (NFDS) [5][9]. Group 2: Challenges in Quantization - VGGT's unique properties, such as the presence of data-independent special tokens and the inherent complexity of 3D data, pose significant challenges for quantization [11][12]. - The data-independent tokens lead to a heavy-tailed activation distribution, complicating the quantization process and increasing the risk of information loss [11][12]. Group 3: Technical Contributions - DSFQ combines pre-global Hadamard rotation and post-local channel smoothing to mitigate the heavy-tailed distribution and inter-channel variance issues [5][9][30]. - NFDS employs deep statistical information to filter out noise and create frame-aware diverse calibration clusters, ensuring the stability of the quantization range [5][9][40]. Group 4: Experimental Results - Extensive experiments demonstrate that QuantVGGT outperforms existing quantization methods across various benchmark datasets and bit widths, achieving optimal performance [5][13][59]. - In camera pose estimation tasks, QuantVGGT maintains 99.9% performance at 8-bit quantization and achieves an AUC@30 of 88.2 at 4-bit quantization, significantly outperforming other methods [47][50]. Group 5: Efficiency and Deployment - The proposed quantization framework shows minimal additional latency, with only a 0.2% increase in delay while significantly retaining model performance [56][58]. - The results indicate that QuantVGGT is well-suited for deployment in resource-constrained environments, demonstrating its practical advantages [5][58].
TALO: 支持任意3D基础模型、任意相机配置的室外重建系统
自动驾驶之心· 2026-01-08 09:07
Core Viewpoint - The article discusses the advancements in 3D vision foundational models for online incremental reconstruction, highlighting the limitations of existing methods and introducing a new framework called TALO that enhances global consistency in reconstruction tasks [1][2][7]. Summary by Sections Introduction to 3D Vision Models - Recent foundational models like VGGT, π³, and MapAnything have introduced a data-driven paradigm for 3D reconstruction, allowing for direct predictions of camera parameters and dense geometric structures from input images [1]. Limitations of Existing Models - Most current models are designed for offline scene reconstruction, which is inadequate for real-time applications like autonomous driving that require online incremental reconstruction capabilities [2]. Existing Work on Alignment - The article reviews existing methods for aligning sub-maps, such as VGGT-Long and VGGT-SLAM, which utilize different strategies for maintaining consistency across independently predicted sub-maps [3][4]. Analysis of Alignment Strategies - VGGT-Long employs a Sim(3) alignment strategy, while VGGT-SLAM extends this to SL(4) to address inconsistencies in camera parameters. However, SL(4) has shown instability in outdoor multi-camera settings, leading to significant reconstruction failures [4][5]. Limitations of Global Linear Alignment - The article identifies three fundamental limitations of global linear alignment methods, including the assumption of global geometric consistency, the short-term optimality of pairwise alignments, and the sensitivity of SL(4) to noise in geometric predictions [5][7]. Introduction of TALO Framework - TALO is proposed as a plug-and-play alignment framework that enhances global consistency in online incremental reconstruction by using sparsely distributed control points and a Thin Plate Spline (TPS) transformation model [7][9]. Contributions of TALO - TALO systematically analyzes existing alignment strategies, introduces a robust sub-map registration strategy based on overlapping camera poses, and demonstrates superior performance in maintaining geometric consistency across various datasets and foundational models [9][12]. Experimental Results - TALO was tested on the Waymo and nuScenes datasets, showing optimal results in trajectory accuracy and stability compared to VGGT-Long and VGGT-SLAM, with an average absolute trajectory error (ATE) around 1 meter and significant improvements in rotational accuracy [29][31]. Visual Comparisons - Visual results indicate that TALO effectively restores accurate geometric structures and eliminates common artifacts found in previous methods, demonstrating its robustness and effectiveness in real-world applications [33][34].
挖掘注意力中的运动线索:无需训练,解锁4D场景重建能力
量子位· 2025-12-17 09:07
Core Insights - The article discusses the development of VGGT4D, a framework that enables 3D foundation models to process dynamic 4D scenes without increasing training costs [1][2][30] - VGGT4D leverages motion cues hidden within the attention layers of the Visual Geometry Transformer (VGGT) to enhance performance in tasks such as dynamic object segmentation and camera pose estimation [1][6][30] Group 1: Challenges in Transitioning from 3D to 4D - Existing 3D models like VGGT and DUSt3R excel in static scene reconstruction but struggle with dynamic 4D scenes due to moving objects causing background geometric modeling interference and significant camera pose drift [4] - Current solutions face two main challenges: high computational or training costs and reliance on external priors, which complicate the system [5] Group 2: VGGT4D's Mechanism - VGGT4D aims to extract 4D perception capabilities directly from pre-trained 3D models without additional training [6] - The research team visualized the attention mechanism of VGGT and found that different network layers respond distinctly to dynamic regions, indicating that VGGT implicitly encodes rich dynamic cues despite being trained under static assumptions [7][13] Group 3: Motion Cue Extraction Techniques - VGGT4D introduces a training-free attention feature mining and mask refinement mechanism that utilizes Gram matrices and gradient flow for high-precision dynamic-static separation [14] - The method addresses the limitations of standard attention maps by using self-similarity Gram matrices to focus on motion-induced variance, enhancing the model's ability to detect dynamic features [17] Group 4: Performance Evaluation - VGGT4D significantly outperforms other variants in dynamic object segmentation tasks across multiple datasets, achieving optimal performance on DAVIS-2016 and DAVIS-2017 without any 4D-specific training [21][20] - The qualitative analysis shows that VGGT4D generates more accurate masks with clearer boundaries compared to baseline methods, validating the hypothesis that VGGT's Gram similarity statistics embed extractable motion cues [22] Group 5: Robustness and Long Sequence Performance - VGGT4D demonstrates superior robustness in camera pose estimation, achieving the best results in challenging long-sequence benchmarks while maintaining high efficiency [25] - The method effectively identifies and eliminates residual pose inconsistencies caused by motion, leading to more stable and accurate camera trajectories [25] Group 6: 4D Point Cloud Reconstruction - In evaluations on the DyCheck dataset, VGGT4D achieves the best performance across all reconstruction metrics, significantly improving accuracy and distance metrics compared to the VGGT baseline [28] - The method reduces median accuracy error from 0.009 to 0.004 and average distance from 0.150 to 0.123, demonstrating its capability for precise dynamic-static separation and enhanced geometric reconstruction quality [28] Group 7: Conclusion - VGGT4D presents a novel training-free paradigm that successfully extends the capabilities of 3D foundation models to 4D dynamic scenes, offering a low-cost solution for 4D reconstruction and showcasing the potential of foundational models in zero-shot transfer tasks [30]
VGGT4D:无需训练,挖掘3D基础模型潜力,实现4D动态场景重建
机器之心· 2025-12-17 02:05
Core Insights - The article discusses VGGT4D, a framework developed by researchers from Hong Kong University of Science and Technology (Guangzhou) and Horizon Robotics, aimed at enabling 3D foundation models to handle dynamic 4D scenes without additional training costs [2][4][33] - VGGT4D leverages hidden motion cues within the attention layers of the Visual Geometry Transformer (VGGT) to improve performance in tasks such as dynamic object segmentation, camera pose estimation, and long-sequence 4D reconstruction [2][4][6] Research Background - Traditional 3D foundation models like VGGT and DUSt3R excel in static scene reconstruction but struggle with dynamic 4D scenes that include moving objects, leading to significant performance drops [6][7] - Existing solutions often face challenges such as high computational costs and reliance on external priors, which complicate the system [9][12] Methodology - VGGT4D introduces a training-free mechanism for attention feature mining and mask refinement, utilizing Gram matrices and gradient flows for high-precision dynamic-static separation [14][17] - The framework addresses limitations of standard attention maps by employing self-similarity Gram matrices to enhance the signal-to-noise ratio, allowing for better extraction of motion cues [16][17] Experimental Validation - VGGT4D was evaluated on dynamic object segmentation, camera pose estimation, and 4D point cloud reconstruction across six benchmark datasets, demonstrating superior performance compared to other methods [22][23] - In dynamic object segmentation, VGGT4D achieved optimal performance on the DAVIS-2016 and DAVIS-2017 datasets, outperforming all variants without requiring any 4D-specific training [24][25] - For camera pose estimation, VGGT4D consistently improved upon the strong baseline set by the original VGGT model, achieving an Average Translation Error (ATE) of 0.164 on the VKITTI dataset, compared to 2.272 for MonST3R [27][28] Conclusion - VGGT4D successfully extends the capabilities of 3D foundation models to 4D dynamic scenes through effective internal feature extraction, providing a low-cost solution for 4D reconstruction and showcasing the potential of foundational models in zero-shot transfer tasks [33]
复旦最新一篇DriveVGGT:面向自动驾驶,高效实现多相机4D重建
自动驾驶之心· 2025-12-17 00:03
Core Insights - The article discusses the introduction of DriveVGGT, a visual geometry transformer designed specifically for autonomous driving, which significantly enhances geometric prediction consistency and inference efficiency in multi-camera systems [2][9][42] Background - 4D reconstruction is a computer vision task that predicts geometric information from visual sensors, particularly beneficial in autonomous driving and robotics due to its low cost [5] - Traditional reconstruction methods are either iterative, requiring retraining with scene changes, or forward methods, which can directly output predictions without updating model parameters [5] Limitations of Existing Methods - Existing forward methods struggle in autonomous driving scenarios due to low overlap between images captured by different cameras, making it difficult to identify similar features [6] - The relative pose calibration of cameras in autonomous systems is easy to obtain but cannot be directly utilized in forward methods due to scale discrepancies [6] DriveVGGT Model Overview - DriveVGGT integrates relative pose information to improve model performance in geometric tasks like camera pose estimation and depth estimation [10][11] - The model consists of three sub-modules: Temporal Video Attention (TVA), Relative Pose Embedding, and Multi-Camera Consistency Attention (MCA) [11][16] Temporal Video Attention (TVA) - TVA establishes initial geometric relationships between images captured by each camera in a continuous video stream, facilitating effective reconstruction [13][16] Relative Pose Embedding - This module normalizes the relative poses of all cameras to mitigate scale uncertainty, ensuring consistent geometric representation [14][16] Multi-Camera Consistency Attention (MCA) - MCA enhances the interaction between images from different cameras by injecting relative pose information, addressing the instability caused by low overlap [15][16] Experimental Results - DriveVGGT outperformed other models in terms of inference speed and prediction accuracy on the nuScenes dataset, particularly in scenarios with 210 images [24][30] - The model achieved superior depth estimation performance, especially in long sequence scenarios [27] Visualization and Ablation Studies - Visual comparisons demonstrated DriveVGGT's stability in pose prediction across various scenes, while ablation studies confirmed the effectiveness of the proposed modules [31][34] Conclusion - DriveVGGT effectively utilizes relative camera pose information to enhance geometric predictions, achieving better performance and lower computational costs compared to previous methods [42]
顶级四校联手打造OmniVGGT:全模态视觉几何Transformer!
自动驾驶之心· 2025-11-17 00:05
Core Insights - The article discusses the need for a "universal multimodal" 3D model, highlighting the limitations of current models that primarily rely on RGB images and fail to utilize additional geometric information effectively [5][6][9]. - The proposed OmniVGGT framework allows for flexible integration of any number of auxiliary geometric modalities during training and inference, significantly improving performance across various 3D tasks [6][9][10]. Group 1: Need for Universal Multimodal 3D Models - Current mainstream 3D models, such as VGGT, can only process RGB images and do not utilize depth or camera parameters, leading to inefficiencies in real-world applications [5]. - OmniVGGT addresses the issue of "information waste" and poor adaptability by fully leveraging available auxiliary information without compromising performance when only RGB input is used [9][10]. Group 2: Core Innovations of OmniVGGT - OmniVGGT achieves top-tier performance in tasks like monocular/multi-view depth estimation and camera pose estimation, even outperforming existing methods with just RGB input [7][29]. - The framework integrates into visual-language-action (VLA) models, significantly enhancing robotic operation tasks [7][29]. Group 3: Technical Components - The GeoAdapter component injects geometric information (depth, camera parameters) into the base model without disrupting the original feature space, maintaining low computational overhead [10][16]. - A random multimodal fusion strategy is employed during training to ensure the model learns robust spatial representations and does not overly depend on auxiliary information [22][23]. Group 4: Experimental Results - OmniVGGT was trained on 19 public datasets, demonstrating superior performance across multiple 3D tasks, with significant improvements in metrics such as absolute relative error and accuracy [29][30]. - The framework shows that the more auxiliary information is provided, the better the performance, with notable enhancements in depth estimation and camera pose accuracy [30][34]. Group 5: Practical Implications - OmniVGGT's design allows for flexible input combinations of auxiliary geometric modalities, making it practical for various applications in 3D modeling and robotics [53][54]. - The model's efficiency and speed, requiring only 0.2 seconds for inference, position it as a leading solution in the field [42][40].
港科广&清华联合提出Spatial Forcing:隐式空间对齐,超越主流2D/3D VLA模型性能
具身智能之心· 2025-10-18 16:03
Core Insights - The article discusses the limitations of current Vision-Language-Action (VLA) models that primarily rely on 2D visual data, lacking a deep understanding of real 3D space, which hampers their ability to perform tasks in the physical world [2][4] - The proposed method, Spatial Forcing (SF), allows VLA models to develop spatial understanding without explicit 3D input by aligning visual features with a powerful 3D geometric representation generated by an external model [2][10] Methodology - The SF method employs an implicit spatial alignment strategy, enabling the model to autonomously acquire spatial understanding during training without the need for additional 3D sensors [2][13] - A depth probing experiment was conducted to verify the presence of 3D information in the original VLA's visual features, revealing that without 3D input, the model cannot form accurate spatial perceptions [11][13] - The training process involves aligning the VLA model's visual tokens with pixel-level spatial representations extracted from a pre-trained 3D model, optimizing both spatial alignment loss and action generation loss [16] Performance Results - The SF method significantly outperforms existing 2D and 3D VLA models in various tasks, achieving a training efficiency improvement of up to 3.8 times and a data utilization efficiency increase of up to 5.9 times [14] - In experiments, the Spatial Forcing model achieved a success rate of 99.4% in spatial tasks, 99.6% in object tasks, and 98.8% in goal tasks, demonstrating its superior performance compared to other models [18]
机器人感知大升级,轻量化注入几何先验,成功率提升31%
3 6 Ke· 2025-09-28 12:09
Core Insights - The article discusses the challenges in enabling AI to truly "understand" the 3D world, particularly in the context of visual language action (VLA) models that rely on 2D image-text data [1][2]. Group 1: VLA Model Limitations - Current VLA models lack the necessary 3D spatial understanding for real-world operations, primarily relying on pre-trained visual language models [1]. - Existing enhancement methods based on explicit depth input face deployment difficulties and precision noise issues [1]. Group 2: Evo-0 Model Introduction - Shanghai Jiao Tong University and the University of Cambridge proposed a lightweight method called Evo-0 to enhance the spatial understanding of VLA models by implicitly injecting 3D geometric priors without requiring explicit depth input or additional sensors [2]. - Evo-0 utilizes the Visual Geometry Grounding Transformer (VGGT) to extract 3D structural information from multi-view RGB images, significantly improving spatial perception capabilities [2][3]. Group 3: Model Architecture and Training - Evo-0 integrates VGGT as a spatial encoder, introducing t3^D tokens that contain depth context and cross-view spatial correspondence [3]. - A cross-attention fusion module is employed to merge 2D visual tokens with 3D tokens, enhancing the understanding of spatial structures and object layouts [3][6]. - The model is trained efficiently by only fine-tuning the fusion module, LoRA adaptation layer, and action expert, reducing computational costs [6]. Group 4: Experimental Results - In RLBench simulation tasks, Evo-0 achieved an average success rate improvement of over 28.88% compared to baseline models, particularly excelling in tasks requiring complex spatial relationships [10][11]. - The robustness of Evo-0 was tested under five different interference conditions, consistently outperforming the baseline model pi0 [12][15]. Group 5: Conclusion - Evo-0's key innovation lies in extracting rich spatial semantics through VGGT, bypassing depth estimation errors and sensor requirements, thus enhancing the spatial modeling capabilities of VLA models [16].
厦门大学曹刘娟团队FastVGGT:四倍速度提升,打破VGGT推理瓶颈并降低累积误差!
具身智能之心· 2025-09-10 06:18
Core Viewpoint - The article introduces FastVGGT, a training-free acceleration method that optimizes the VGGT model by addressing the redundancy in global attention mechanisms, achieving up to 4 times faster inference while maintaining reconstruction accuracy and mitigating cumulative error issues in 3D visual tasks [26]. Group 1: Main Contributions - FastVGGT enables VGGT to process 1000 input images in a single forward pass on a single GPU with 80GB VRAM, an improvement from 300 images previously [5]. - The method achieves a 4× speedup in inference time for 1000 image tasks while effectively reducing cumulative error [5][18]. - FastVGGT maintains high reconstruction quality, with improvements in metrics such as Chamfer Distance (CD) from 0.471 to 0.425 [18]. Group 2: Bottleneck Analysis - The analysis identifies that the global attention mechanism in VGGT has significant redundancy, leading to unnecessary computations [6][7]. - Cumulative error is exacerbated in long sequences due to the global attention mechanism, which amplifies minor errors over time [6]. Group 3: Methodology - Token merging strategies are introduced to optimize the redundancy in VGGT's attention calculations, including reference frame constraints, key token retention, and region-based sampling [9][11]. - The token merging process reduces the number of tokens involved in attention calculations, while token unmerging ensures the integrity of dense 3D reconstruction outputs [15]. Group 4: Experimental Results - FastVGGT demonstrated a significant reduction in inference time and improved reconstruction quality across various datasets, including ScanNet-50, 7Scenes, and NRGBD [22]. - In point cloud reconstruction tasks, FastVGGT achieved a 4× speedup in inference time while maintaining reconstruction accuracy [18][22]. - The method also showed improvements in absolute trajectory error (ATE) and relative pose error (RPE) metrics, indicating enhanced performance in long sequence inference [24].
刚刚,CVPR 2025奖项出炉:牛津&Meta博士生王建元获最佳论文,谢赛宁摘年轻研究者奖
机器之心· 2025-06-13 15:45
Core Insights - The CVPR 2025 conference in Nashville, Tennessee, awarded five papers, including one best paper and four honorable mentions, along with one best student paper and one honorable mention for student papers [1][2]. Submission and Acceptance Statistics - This year, over 40,000 authors submitted 13,008 papers, marking a 13% increase from last year's 11,532 submissions. A total of 2,872 papers were accepted, resulting in an overall acceptance rate of approximately 22.1%. Among the accepted papers, 96 were oral presentations (3.3%) and 387 were highlighted (13.7%) [3][5]. Conference Attendance - The conference attracted over 9,000 attendees from more than 70 countries and regions [7]. Paper Acceptance by Field - The image and video generation field had the highest number of accepted papers, while the highest acceptance rates were seen in 3D based on multi-view and sensor data, as well as single-image 3D [8]. Best Paper Award - The best paper, titled "VGGT: Visual Geometry Grounded Transformer," was presented by researchers from the University of Oxford and Meta AI. It introduced a universal 3D vision model based on a pure feedforward Transformer architecture, capable of inferring core geometric information from one or more images [13][14]. Notable Research Contributions - The best paper demonstrated significant performance improvements over traditional optimization methods and existing state-of-the-art models in various 3D tasks, achieving inference speeds in seconds without requiring post-processing optimization [17]. Best Student Paper - The best student paper, "Neural Inverse Rendering from Propagating Light," proposed a physics-based multi-view dynamic light propagation neural inverse rendering system, achieving state-of-the-art 3D reconstruction under strong indirect lighting conditions [53][55]. Awards and Recognitions - Two Young Researcher Awards were given to Hao Su and Saining Xie for their outstanding contributions to computer vision research [68][72]. The Longuet-Higgins Award was presented to two papers that have significantly influenced the field, including the Inception architecture and fully convolutional networks for semantic segmentation [75][78][80].