Workflow
VGGT
icon
Search documents
TALO: 支持任意3D基础模型、任意相机配置的室外重建系统
自动驾驶之心· 2026-01-08 09:07
点击下方 卡片 ,关注" 自动驾驶之心 "公众号 戳我-> 领取 自动驾驶近30个 方向 学习 路线 >>自动驾驶前沿信息获取 → 自动驾驶之心知识星球 论文作者 | Fengyi Zhang等 编辑 | 自动驾驶之心 3D视觉基础模型:从离线重建到在线增量重建 最近, 3D视觉基础模型 的出现,如 VGGT、π³、MapAnything,标志着三维重建领域迈入了一种端到端、数据驱动的新范式。这类模型能够在一次前向推理中,直 接从输入图像预测相机内参、相机位姿以及稠密几何结构,极大地简化了传统三维重建流程,并展现出强大的跨场景泛化能力。基础模型的成功建立在大规模、有 标注的3D数据集以及在其上训练的大型 Transformer 架构,这使得模型能够同时学习多视几何、视角关系以及场景结构先验。 然而,现有的大多数基础模型主要被设计用于 离线场景重建 ,即在推理阶段可以 一次性访问完整的图像序列 。而在自动驾驶、机器人操作等现实应用场景中,系 统通常需要具备 在线重建能力 :模型应当能够随着新数据的到来, 增量式地重建新区域 ,而非在获取全部图像后再统一处理。尽管已有少数工作如 CUT3R 尝试 在模型层面直接支 ...
挖掘注意力中的运动线索:无需训练,解锁4D场景重建能力
量子位· 2025-12-17 09:07
VGGT4D团队 投稿 量子位 | 公众号 QbitAI 如何让针对静态场景训练的3D基础模型 (3D Foundation Models) ,在不增加训练成本的前提下,具备处理动态4D场景的能力? 来自 香港科技大学(广州)与地平线(Horizon Robotics) 的研究团队提出了 VGGT4D 。该工作通过深入分析Visual Geometry Transformer (VGGT) 的内部机制,发现并利用了隐藏在注意力层中的运动线索。 VGGT4D的核心设想:能否在不进行额外训练的前提下,直接从预训练的3D基础模型中挖掘出4D感知能力? 作为一种 无需训练 (Training-free) 的框架,VGGT4D在动态物体分割、相机位姿估计及长序列4D重建等任务上均取得了优异性能。 从3D迈向4D的挑战 近年来,以VGGT、DUSt3R为代表的3D基础模型在静态场景重建中表现出色。然而,面对包含移动物体 (如行人、车辆) 的 动态4D场景 时,这些模型的性能往往显著下降。动态物体的运动不仅干扰背景几何建模,还会导致严重的相机位姿漂移。 现有的解决方案通常面临两类挑战: 计算或训练成本高: 依赖繁重的测试时 ...
VGGT4D:无需训练,挖掘3D基础模型潜力,实现4D动态场景重建
机器之心· 2025-12-17 02:05
Core Insights - The article discusses VGGT4D, a framework developed by researchers from Hong Kong University of Science and Technology (Guangzhou) and Horizon Robotics, aimed at enabling 3D foundation models to handle dynamic 4D scenes without additional training costs [2][4][33] - VGGT4D leverages hidden motion cues within the attention layers of the Visual Geometry Transformer (VGGT) to improve performance in tasks such as dynamic object segmentation, camera pose estimation, and long-sequence 4D reconstruction [2][4][6] Research Background - Traditional 3D foundation models like VGGT and DUSt3R excel in static scene reconstruction but struggle with dynamic 4D scenes that include moving objects, leading to significant performance drops [6][7] - Existing solutions often face challenges such as high computational costs and reliance on external priors, which complicate the system [9][12] Methodology - VGGT4D introduces a training-free mechanism for attention feature mining and mask refinement, utilizing Gram matrices and gradient flows for high-precision dynamic-static separation [14][17] - The framework addresses limitations of standard attention maps by employing self-similarity Gram matrices to enhance the signal-to-noise ratio, allowing for better extraction of motion cues [16][17] Experimental Validation - VGGT4D was evaluated on dynamic object segmentation, camera pose estimation, and 4D point cloud reconstruction across six benchmark datasets, demonstrating superior performance compared to other methods [22][23] - In dynamic object segmentation, VGGT4D achieved optimal performance on the DAVIS-2016 and DAVIS-2017 datasets, outperforming all variants without requiring any 4D-specific training [24][25] - For camera pose estimation, VGGT4D consistently improved upon the strong baseline set by the original VGGT model, achieving an Average Translation Error (ATE) of 0.164 on the VKITTI dataset, compared to 2.272 for MonST3R [27][28] Conclusion - VGGT4D successfully extends the capabilities of 3D foundation models to 4D dynamic scenes through effective internal feature extraction, providing a low-cost solution for 4D reconstruction and showcasing the potential of foundational models in zero-shot transfer tasks [33]
复旦最新一篇DriveVGGT:面向自动驾驶,高效实现多相机4D重建
自动驾驶之心· 2025-12-17 00:03
点击下方 卡片 ,关注" 自动驾驶之心 "公众号 戳我-> 领取 自动驾驶近30个 方向 学习 路线 >>自动驾驶前沿信息获取 → 自动驾驶之心知识星球 论文作者 | Xiaosong Jia等 编辑 | 自动驾驶之心 自动驾驶中的4D场景重建是实现环境感知与运动规划的关键环节,然而传统视觉几何模型在多相机、低重叠的自动驾驶场景中往往表现不佳。 来自上海交大、复旦等机构的研究者提出 DriveVGGT,一种专为自动驾驶设计的视觉几何Transformer,通过显式引入相机相对位姿先验,显著提升了多相机系统的几 何预测一致性与推理效率。 更多自动驾驶的行业信息、技术进展,欢迎加入自动驾驶之心知识星球获取! 背景介绍 4D重建是一项从视觉传感器预测几何信息的计算机视觉任务。与其他传感器相比,基于相机的重建因其低成本而在各个领域,尤其是在自动驾驶和机器人学中,得到 了广泛的研究和应用。通常,重建方法有两种类型。第一种是基于迭代的方法,例如。这些方法需要选择特定的场景或物体,并通过迭代重建来获得优化结果。然 而,由于泛化能力不足,当场景或物体发生变化或修改时,基于迭代的方法需要重新训练模型。第二种是前向方法。这些方法 ...
顶级四校联手打造OmniVGGT:全模态视觉几何Transformer!
自动驾驶之心· 2025-11-17 00:05
Core Insights - The article discusses the need for a "universal multimodal" 3D model, highlighting the limitations of current models that primarily rely on RGB images and fail to utilize additional geometric information effectively [5][6][9]. - The proposed OmniVGGT framework allows for flexible integration of any number of auxiliary geometric modalities during training and inference, significantly improving performance across various 3D tasks [6][9][10]. Group 1: Need for Universal Multimodal 3D Models - Current mainstream 3D models, such as VGGT, can only process RGB images and do not utilize depth or camera parameters, leading to inefficiencies in real-world applications [5]. - OmniVGGT addresses the issue of "information waste" and poor adaptability by fully leveraging available auxiliary information without compromising performance when only RGB input is used [9][10]. Group 2: Core Innovations of OmniVGGT - OmniVGGT achieves top-tier performance in tasks like monocular/multi-view depth estimation and camera pose estimation, even outperforming existing methods with just RGB input [7][29]. - The framework integrates into visual-language-action (VLA) models, significantly enhancing robotic operation tasks [7][29]. Group 3: Technical Components - The GeoAdapter component injects geometric information (depth, camera parameters) into the base model without disrupting the original feature space, maintaining low computational overhead [10][16]. - A random multimodal fusion strategy is employed during training to ensure the model learns robust spatial representations and does not overly depend on auxiliary information [22][23]. Group 4: Experimental Results - OmniVGGT was trained on 19 public datasets, demonstrating superior performance across multiple 3D tasks, with significant improvements in metrics such as absolute relative error and accuracy [29][30]. - The framework shows that the more auxiliary information is provided, the better the performance, with notable enhancements in depth estimation and camera pose accuracy [30][34]. Group 5: Practical Implications - OmniVGGT's design allows for flexible input combinations of auxiliary geometric modalities, making it practical for various applications in 3D modeling and robotics [53][54]. - The model's efficiency and speed, requiring only 0.2 seconds for inference, position it as a leading solution in the field [42][40].
港科广&清华联合提出Spatial Forcing:隐式空间对齐,超越主流2D/3D VLA模型性能
具身智能之心· 2025-10-18 16:03
Core Insights - The article discusses the limitations of current Vision-Language-Action (VLA) models that primarily rely on 2D visual data, lacking a deep understanding of real 3D space, which hampers their ability to perform tasks in the physical world [2][4] - The proposed method, Spatial Forcing (SF), allows VLA models to develop spatial understanding without explicit 3D input by aligning visual features with a powerful 3D geometric representation generated by an external model [2][10] Methodology - The SF method employs an implicit spatial alignment strategy, enabling the model to autonomously acquire spatial understanding during training without the need for additional 3D sensors [2][13] - A depth probing experiment was conducted to verify the presence of 3D information in the original VLA's visual features, revealing that without 3D input, the model cannot form accurate spatial perceptions [11][13] - The training process involves aligning the VLA model's visual tokens with pixel-level spatial representations extracted from a pre-trained 3D model, optimizing both spatial alignment loss and action generation loss [16] Performance Results - The SF method significantly outperforms existing 2D and 3D VLA models in various tasks, achieving a training efficiency improvement of up to 3.8 times and a data utilization efficiency increase of up to 5.9 times [14] - In experiments, the Spatial Forcing model achieved a success rate of 99.4% in spatial tasks, 99.6% in object tasks, and 98.8% in goal tasks, demonstrating its superior performance compared to other models [18]
机器人感知大升级,轻量化注入几何先验,成功率提升31%
3 6 Ke· 2025-09-28 12:09
Core Insights - The article discusses the challenges in enabling AI to truly "understand" the 3D world, particularly in the context of visual language action (VLA) models that rely on 2D image-text data [1][2]. Group 1: VLA Model Limitations - Current VLA models lack the necessary 3D spatial understanding for real-world operations, primarily relying on pre-trained visual language models [1]. - Existing enhancement methods based on explicit depth input face deployment difficulties and precision noise issues [1]. Group 2: Evo-0 Model Introduction - Shanghai Jiao Tong University and the University of Cambridge proposed a lightweight method called Evo-0 to enhance the spatial understanding of VLA models by implicitly injecting 3D geometric priors without requiring explicit depth input or additional sensors [2]. - Evo-0 utilizes the Visual Geometry Grounding Transformer (VGGT) to extract 3D structural information from multi-view RGB images, significantly improving spatial perception capabilities [2][3]. Group 3: Model Architecture and Training - Evo-0 integrates VGGT as a spatial encoder, introducing t3^D tokens that contain depth context and cross-view spatial correspondence [3]. - A cross-attention fusion module is employed to merge 2D visual tokens with 3D tokens, enhancing the understanding of spatial structures and object layouts [3][6]. - The model is trained efficiently by only fine-tuning the fusion module, LoRA adaptation layer, and action expert, reducing computational costs [6]. Group 4: Experimental Results - In RLBench simulation tasks, Evo-0 achieved an average success rate improvement of over 28.88% compared to baseline models, particularly excelling in tasks requiring complex spatial relationships [10][11]. - The robustness of Evo-0 was tested under five different interference conditions, consistently outperforming the baseline model pi0 [12][15]. Group 5: Conclusion - Evo-0's key innovation lies in extracting rich spatial semantics through VGGT, bypassing depth estimation errors and sensor requirements, thus enhancing the spatial modeling capabilities of VLA models [16].
厦门大学曹刘娟团队FastVGGT:四倍速度提升,打破VGGT推理瓶颈并降低累积误差!
具身智能之心· 2025-09-10 06:18
Core Viewpoint - The article introduces FastVGGT, a training-free acceleration method that optimizes the VGGT model by addressing the redundancy in global attention mechanisms, achieving up to 4 times faster inference while maintaining reconstruction accuracy and mitigating cumulative error issues in 3D visual tasks [26]. Group 1: Main Contributions - FastVGGT enables VGGT to process 1000 input images in a single forward pass on a single GPU with 80GB VRAM, an improvement from 300 images previously [5]. - The method achieves a 4× speedup in inference time for 1000 image tasks while effectively reducing cumulative error [5][18]. - FastVGGT maintains high reconstruction quality, with improvements in metrics such as Chamfer Distance (CD) from 0.471 to 0.425 [18]. Group 2: Bottleneck Analysis - The analysis identifies that the global attention mechanism in VGGT has significant redundancy, leading to unnecessary computations [6][7]. - Cumulative error is exacerbated in long sequences due to the global attention mechanism, which amplifies minor errors over time [6]. Group 3: Methodology - Token merging strategies are introduced to optimize the redundancy in VGGT's attention calculations, including reference frame constraints, key token retention, and region-based sampling [9][11]. - The token merging process reduces the number of tokens involved in attention calculations, while token unmerging ensures the integrity of dense 3D reconstruction outputs [15]. Group 4: Experimental Results - FastVGGT demonstrated a significant reduction in inference time and improved reconstruction quality across various datasets, including ScanNet-50, 7Scenes, and NRGBD [22]. - In point cloud reconstruction tasks, FastVGGT achieved a 4× speedup in inference time while maintaining reconstruction accuracy [18][22]. - The method also showed improvements in absolute trajectory error (ATE) and relative pose error (RPE) metrics, indicating enhanced performance in long sequence inference [24].
刚刚,CVPR 2025奖项出炉:牛津&Meta博士生王建元获最佳论文,谢赛宁摘年轻研究者奖
机器之心· 2025-06-13 15:45
Core Insights - The CVPR 2025 conference in Nashville, Tennessee, awarded five papers, including one best paper and four honorable mentions, along with one best student paper and one honorable mention for student papers [1][2]. Submission and Acceptance Statistics - This year, over 40,000 authors submitted 13,008 papers, marking a 13% increase from last year's 11,532 submissions. A total of 2,872 papers were accepted, resulting in an overall acceptance rate of approximately 22.1%. Among the accepted papers, 96 were oral presentations (3.3%) and 387 were highlighted (13.7%) [3][5]. Conference Attendance - The conference attracted over 9,000 attendees from more than 70 countries and regions [7]. Paper Acceptance by Field - The image and video generation field had the highest number of accepted papers, while the highest acceptance rates were seen in 3D based on multi-view and sensor data, as well as single-image 3D [8]. Best Paper Award - The best paper, titled "VGGT: Visual Geometry Grounded Transformer," was presented by researchers from the University of Oxford and Meta AI. It introduced a universal 3D vision model based on a pure feedforward Transformer architecture, capable of inferring core geometric information from one or more images [13][14]. Notable Research Contributions - The best paper demonstrated significant performance improvements over traditional optimization methods and existing state-of-the-art models in various 3D tasks, achieving inference speeds in seconds without requiring post-processing optimization [17]. Best Student Paper - The best student paper, "Neural Inverse Rendering from Propagating Light," proposed a physics-based multi-view dynamic light propagation neural inverse rendering system, achieving state-of-the-art 3D reconstruction under strong indirect lighting conditions [53][55]. Awards and Recognitions - Two Young Researcher Awards were given to Hao Su and Saining Xie for their outstanding contributions to computer vision research [68][72]. The Longuet-Higgins Award was presented to two papers that have significantly influenced the field, including the Inception architecture and fully convolutional networks for semantic segmentation [75][78][80].