StreamVGGT
Search documents
复旦最新一篇DriveVGGT:面向自动驾驶,高效实现多相机4D重建
自动驾驶之心· 2025-12-17 00:03
Core Insights - The article discusses the introduction of DriveVGGT, a visual geometry transformer designed specifically for autonomous driving, which significantly enhances geometric prediction consistency and inference efficiency in multi-camera systems [2][9][42] Background - 4D reconstruction is a computer vision task that predicts geometric information from visual sensors, particularly beneficial in autonomous driving and robotics due to its low cost [5] - Traditional reconstruction methods are either iterative, requiring retraining with scene changes, or forward methods, which can directly output predictions without updating model parameters [5] Limitations of Existing Methods - Existing forward methods struggle in autonomous driving scenarios due to low overlap between images captured by different cameras, making it difficult to identify similar features [6] - The relative pose calibration of cameras in autonomous systems is easy to obtain but cannot be directly utilized in forward methods due to scale discrepancies [6] DriveVGGT Model Overview - DriveVGGT integrates relative pose information to improve model performance in geometric tasks like camera pose estimation and depth estimation [10][11] - The model consists of three sub-modules: Temporal Video Attention (TVA), Relative Pose Embedding, and Multi-Camera Consistency Attention (MCA) [11][16] Temporal Video Attention (TVA) - TVA establishes initial geometric relationships between images captured by each camera in a continuous video stream, facilitating effective reconstruction [13][16] Relative Pose Embedding - This module normalizes the relative poses of all cameras to mitigate scale uncertainty, ensuring consistent geometric representation [14][16] Multi-Camera Consistency Attention (MCA) - MCA enhances the interaction between images from different cameras by injecting relative pose information, addressing the instability caused by low overlap [15][16] Experimental Results - DriveVGGT outperformed other models in terms of inference speed and prediction accuracy on the nuScenes dataset, particularly in scenarios with 210 images [24][30] - The model achieved superior depth estimation performance, especially in long sequence scenarios [27] Visualization and Ablation Studies - Visual comparisons demonstrated DriveVGGT's stability in pose prediction across various scenes, while ablation studies confirmed the effectiveness of the proposed modules [31][34] Conclusion - DriveVGGT effectively utilizes relative camera pose information to enhance geometric predictions, achieving better performance and lower computational costs compared to previous methods [42]