SpatialTrackerV2

Search documents
SpatialTrackerV2:开源前馈式可扩展的3D点追踪方法
自动驾驶之心· 2025-07-20 08:36
Core Viewpoint - The article discusses the development of SpatialTrackerV2, a state-of-the-art method for 3D point tracking from monocular video, which integrates video depth, camera ego motion, and object motion into a fully differentiable process for scalable joint training [7][37]. Group 1: Current Issues in 3D Point Tracking - 3D point tracking aims to recover long-term 3D trajectories of arbitrary points from monocular video, showing strong potential in various applications such as robotics and video generation [4]. - Existing solutions heavily rely on low/mid-level visual models, leading to high computational costs and limitations in scalability due to the need for real 3D trajectories as supervision [6][10]. Group 2: Proposed Solution - SpatialTrackerV2 - SpatialTrackerV2 decomposes 3D point tracking into three independent components: video depth, camera ego motion, and object motion, integrating them into a fully differentiable framework [7]. - The architecture includes a front-end for video depth estimation and camera pose initialization, and a back-end for joint motion optimization, utilizing a novel SyncFormer module to model correlations between 2D and 3D features [7][30]. Group 3: Performance Evaluation - The method achieved new state-of-the-art results on the TAPVid-3D benchmark, with scores of 21.2 AJ and 31.0 APD3D, representing improvements of 61.8% and 50.5% over the previous best [9]. - SpatialTrackerV2 demonstrated superior performance in video depth and camera pose consistency estimation, outperforming existing methods like MegaSAM and achieving approximately 50 times faster inference speed [9]. Group 4: Training and Optimization Process - The training process involves using consistency constraints between static and dynamic points for 3D tracking, allowing for effective optimization even with limited depth information [8][19]. - The model employs a bundle optimization approach to refine depth and camera pose estimates iteratively, incorporating various loss functions to ensure accuracy [24][26]. Group 5: Conclusion - SpatialTrackerV2 represents a significant advancement in 3D point tracking, providing a robust foundation for motion understanding in real-world scenarios and pushing towards "physical intelligence" through the exploration of large-scale visual data [37].