VGGT4D：无需训练，实现4D动态场景重建

Core Insights - The article discusses the VGGT4D framework, which enables 3D foundation models to process dynamic 4D scenes without additional training costs, by mining motion cues from the Visual Geometry Transformer (VGGT) [1][3][32] Research Background - Recent 3D foundation models like VGGT and DUSt3R excel in static scene reconstruction but struggle with dynamic 4D scenes that include moving objects, leading to performance degradation [6] - Existing solutions face challenges in extracting 4D perception capabilities from pre-trained 3D models without additional training [7] Methodology - VGGT4D introduces a training-free mechanism for attention feature mining and mask refinement, utilizing Gram matrices and gradient flows for high-precision dynamic-static separation [12] - The framework employs self-similarity Gram matrices to address distributional gaps in standard attention maps, allowing for better extraction of motion cues [15] - Projection gradient-aware refinement is introduced to enhance boundary clarity by leveraging geometric projection residuals [17] - An in-distribution early-stage masking strategy is proposed to prevent performance degradation during inference by suppressing dynamic tokens in shallow layers [19] Experimental Validation - VGGT4D significantly outperforms other variants in dynamic object segmentation tasks across multiple datasets, achieving optimal performance in DAVIS-2016 and DAVIS-2017 without any 4D-specific training [21][22] - In camera pose estimation, VGGT4D consistently improves upon the strong baseline set by the original VGGT model, demonstrating robustness against dynamic objects [25] - The framework excels in long sequence robustness, achieving best results in the challenging Point Odyssey benchmark while maintaining efficiency [26] - VGGT4D achieves superior performance in 4D point cloud reconstruction, with median accuracy error decreasing from 0.009 to 0.004 and average distance reducing from 0.150 to 0.123 in the DyCheck dataset [28][29] Conclusion - VGGT4D successfully extends the capabilities of 3D foundation models to 4D dynamic scenes through effective mining of internal motion cues, offering a low-cost approach to 4D reconstruction and showcasing the potential of foundation models in zero-shot transfer tasks [32]