VGGT4D
Search documents
VGGT4D:无需训练,实现4D动态场景重建
具身智能之心· 2025-12-18 00:07
编辑丨具身智能之心 点击下方 卡片 ,关注" 具身智能之心 "公众号 >> 点击进入→ 具身 智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区: 具身智能之心知识星球(戳我) ,这里包含所有你想要的! 如何让针对静态场景训练的 3D 基础模型(3D Foundation Models)在不增加训练成本的前提下,具备处理动态 4D 场景的能力? 来自香港科技大学(广州)与地平线 (Horizon Robotics) 的研究团队提出了 VGGT4D。该工作通过深入分析 Visual Geometry Transformer (VGGT) 的内部机制,发现并利用了隐藏在注意力层中的运动线索。 作为一种无需训练 (Training-free) 的框架,VGGT4D 在动态物体分割、相机位姿估计及长序列 4D 重建等任务上均取得了优异性能。 论文标题: VGGT4D: Mining Motion Cues in Visual Geometry Transformers for 4D Scene Reconstruction 论文链接: https://arxiv.org/abs/2511.19971 ...
挖掘注意力中的运动线索:无需训练,解锁4D场景重建能力
量子位· 2025-12-17 09:07
VGGT4D团队 投稿 量子位 | 公众号 QbitAI 如何让针对静态场景训练的3D基础模型 (3D Foundation Models) ,在不增加训练成本的前提下,具备处理动态4D场景的能力? 来自 香港科技大学(广州)与地平线(Horizon Robotics) 的研究团队提出了 VGGT4D 。该工作通过深入分析Visual Geometry Transformer (VGGT) 的内部机制,发现并利用了隐藏在注意力层中的运动线索。 VGGT4D的核心设想:能否在不进行额外训练的前提下,直接从预训练的3D基础模型中挖掘出4D感知能力? 作为一种 无需训练 (Training-free) 的框架,VGGT4D在动态物体分割、相机位姿估计及长序列4D重建等任务上均取得了优异性能。 从3D迈向4D的挑战 近年来,以VGGT、DUSt3R为代表的3D基础模型在静态场景重建中表现出色。然而,面对包含移动物体 (如行人、车辆) 的 动态4D场景 时,这些模型的性能往往显著下降。动态物体的运动不仅干扰背景几何建模,还会导致严重的相机位姿漂移。 现有的解决方案通常面临两类挑战: 计算或训练成本高: 依赖繁重的测试时 ...
VGGT4D:无需训练,挖掘3D基础模型潜力,实现4D动态场景重建
机器之心· 2025-12-17 02:05
Core Insights - The article discusses VGGT4D, a framework developed by researchers from Hong Kong University of Science and Technology (Guangzhou) and Horizon Robotics, aimed at enabling 3D foundation models to handle dynamic 4D scenes without additional training costs [2][4][33] - VGGT4D leverages hidden motion cues within the attention layers of the Visual Geometry Transformer (VGGT) to improve performance in tasks such as dynamic object segmentation, camera pose estimation, and long-sequence 4D reconstruction [2][4][6] Research Background - Traditional 3D foundation models like VGGT and DUSt3R excel in static scene reconstruction but struggle with dynamic 4D scenes that include moving objects, leading to significant performance drops [6][7] - Existing solutions often face challenges such as high computational costs and reliance on external priors, which complicate the system [9][12] Methodology - VGGT4D introduces a training-free mechanism for attention feature mining and mask refinement, utilizing Gram matrices and gradient flows for high-precision dynamic-static separation [14][17] - The framework addresses limitations of standard attention maps by employing self-similarity Gram matrices to enhance the signal-to-noise ratio, allowing for better extraction of motion cues [16][17] Experimental Validation - VGGT4D was evaluated on dynamic object segmentation, camera pose estimation, and 4D point cloud reconstruction across six benchmark datasets, demonstrating superior performance compared to other methods [22][23] - In dynamic object segmentation, VGGT4D achieved optimal performance on the DAVIS-2016 and DAVIS-2017 datasets, outperforming all variants without requiring any 4D-specific training [24][25] - For camera pose estimation, VGGT4D consistently improved upon the strong baseline set by the original VGGT model, achieving an Average Translation Error (ATE) of 0.164 on the VKITTI dataset, compared to 2.272 for MonST3R [27][28] Conclusion - VGGT4D successfully extends the capabilities of 3D foundation models to 4D dynamic scenes through effective internal feature extraction, providing a low-cost solution for 4D reconstruction and showcasing the potential of foundational models in zero-shot transfer tasks [33]