超越Video Depth Anything！视频深度估计新SOTA来了，163倍数据效率解锁生成式先验

Core Insights - The article discusses the introduction of a new video depth estimation framework called DVD (Deterministic Video Depth Estimation with Generative Priors), led by Professor Chen Yingcong from the Hong Kong University of Science and Technology (Guangzhou) [4] - DVD is noted for its ability to achieve high data efficiency, requiring only 367,000 frames of training data compared to 60 million frames used by other models, resulting in a remarkable 163 times improvement in data efficiency [5][24] - The framework addresses the challenges of balancing geometric detail and temporal stability in dynamic videos, which has been a longstanding issue in the computer vision community [4][8] Group 1: Background and Motivation - Prior to DVD, mainstream video depth estimation methods faced inherent trade-offs between generative and discriminative models, leading to a core question of how to design a framework that balances stability and rich spatiotemporal priors while maintaining efficiency [8] - The research team identified the need for a framework that could effectively combine the strengths of both model types without the drawbacks of each [8] Group 2: Methodology - DVD innovatively adapts pre-trained video diffusion models into a deterministic framework for single-pass depth regression, eliminating the geometric hallucinations caused by traditional generative models [5][12] - The framework incorporates three core designs: 1. Time-step driven structural anchors to balance global stability and local detail [15] 2. Latent Manifold Rectification (LMR) to align predicted latent variables with target variables, restoring sharp boundaries and coherent motion [16] 3. Global Affine Coherence to ensure seamless alignment of adjacent windows in long video processing [18] Group 3: Experimental Results - DVD achieved state-of-the-art (SOTA) performance in geometric fidelity and temporal coherence across multiple real-world benchmarks, outperforming both generative and discriminative baseline models [20][22] - The framework demonstrated the lowest absolute relative error (AbsRel) on standard datasets such as ScanNet and KITTI, showcasing its superior accuracy [22][24] - DVD's design allows for high fidelity depth estimation with significantly less training data, proving that effective strategies can unlock the geometric priors of foundational models without the need for extensive labeled datasets [24][28] Group 4: Implications and Future Directions - The introduction of DVD establishes a highly scalable and data-efficient paradigm for dynamic 3D scene understanding and future perception technologies [29] - The open-source nature of the project encourages further exploration and validation by the research community [30]