Workflow
3D基础模型
icon
Search documents
VGGT4D:无需训练,挖掘3D基础模型潜力,实现4D动态场景重建
机器之心· 2025-12-17 02:05
Core Insights - The article discusses VGGT4D, a framework developed by researchers from Hong Kong University of Science and Technology (Guangzhou) and Horizon Robotics, aimed at enabling 3D foundation models to handle dynamic 4D scenes without additional training costs [2][4][33] - VGGT4D leverages hidden motion cues within the attention layers of the Visual Geometry Transformer (VGGT) to improve performance in tasks such as dynamic object segmentation, camera pose estimation, and long-sequence 4D reconstruction [2][4][6] Research Background - Traditional 3D foundation models like VGGT and DUSt3R excel in static scene reconstruction but struggle with dynamic 4D scenes that include moving objects, leading to significant performance drops [6][7] - Existing solutions often face challenges such as high computational costs and reliance on external priors, which complicate the system [9][12] Methodology - VGGT4D introduces a training-free mechanism for attention feature mining and mask refinement, utilizing Gram matrices and gradient flows for high-precision dynamic-static separation [14][17] - The framework addresses limitations of standard attention maps by employing self-similarity Gram matrices to enhance the signal-to-noise ratio, allowing for better extraction of motion cues [16][17] Experimental Validation - VGGT4D was evaluated on dynamic object segmentation, camera pose estimation, and 4D point cloud reconstruction across six benchmark datasets, demonstrating superior performance compared to other methods [22][23] - In dynamic object segmentation, VGGT4D achieved optimal performance on the DAVIS-2016 and DAVIS-2017 datasets, outperforming all variants without requiring any 4D-specific training [24][25] - For camera pose estimation, VGGT4D consistently improved upon the strong baseline set by the original VGGT model, achieving an Average Translation Error (ATE) of 0.164 on the VKITTI dataset, compared to 2.272 for MonST3R [27][28] Conclusion - VGGT4D successfully extends the capabilities of 3D foundation models to 4D dynamic scenes through effective internal feature extraction, providing a low-cost solution for 4D reconstruction and showcasing the potential of foundational models in zero-shot transfer tasks [33]