DepthCrafter
Search documents
超越Video Depth Anything!视频深度估计新SOTA来了,163倍数据效率解锁生成式先验
机器之心· 2026-03-29 01:29
Core Insights - The article discusses the introduction of a new video depth estimation framework called DVD (Deterministic Video Depth Estimation with Generative Priors), led by Professor Chen Yingcong from the Hong Kong University of Science and Technology (Guangzhou) [4] - DVD is noted for its ability to achieve high data efficiency, requiring only 367,000 frames of training data compared to 60 million frames used by other models, resulting in a remarkable 163 times improvement in data efficiency [5][24] - The framework addresses the challenges of balancing geometric detail and temporal stability in dynamic videos, which has been a longstanding issue in the computer vision community [4][8] Group 1: Background and Motivation - Prior to DVD, mainstream video depth estimation methods faced inherent trade-offs between generative and discriminative models, leading to a core question of how to design a framework that balances stability and rich spatiotemporal priors while maintaining efficiency [8] - The research team identified the need for a framework that could effectively combine the strengths of both model types without the drawbacks of each [8] Group 2: Methodology - DVD innovatively adapts pre-trained video diffusion models into a deterministic framework for single-pass depth regression, eliminating the geometric hallucinations caused by traditional generative models [5][12] - The framework incorporates three core designs: 1. Time-step driven structural anchors to balance global stability and local detail [15] 2. Latent Manifold Rectification (LMR) to align predicted latent variables with target variables, restoring sharp boundaries and coherent motion [16] 3. Global Affine Coherence to ensure seamless alignment of adjacent windows in long video processing [18] Group 3: Experimental Results - DVD achieved state-of-the-art (SOTA) performance in geometric fidelity and temporal coherence across multiple real-world benchmarks, outperforming both generative and discriminative baseline models [20][22] - The framework demonstrated the lowest absolute relative error (AbsRel) on standard datasets such as ScanNet and KITTI, showcasing its superior accuracy [22][24] - DVD's design allows for high fidelity depth estimation with significantly less training data, proving that effective strategies can unlock the geometric priors of foundational models without the need for extensive labeled datasets [24][28] Group 4: Implications and Future Directions - The introduction of DVD establishes a highly scalable and data-efficient paradigm for dynamic 3D scene understanding and future perception technologies [29] - The open-source nature of the project encourages further exploration and validation by the research community [30]
GAIR 2025 世界模型分论坛:从通用感知到视频、物理世界模型的百家争鸣
雷峰网· 2025-12-13 09:13
Core Insights - The article discusses the current state and future prospects of world models in the context of embodied intelligence, highlighting the diverse research directions and the need for consensus in the field [2][3]. Group 1: General Overview of World Models - The GAIR Global AI and Robotics Conference featured a forum on world models, showcasing various young scholars who presented their research on topics such as general perception, 3D technology, and digital human reconstruction [2]. - The research on world models is still in its infancy, with many subfields emerging, indicating a rich and varied landscape of inquiry [2]. Group 2: Key Presentations and Innovations - 彭思达 from Zhejiang University presented on general spatial perception technologies for embodied intelligence, focusing on camera pose estimation, depth estimation, and object motion estimation, which are crucial for robotic decision-making [5][6]. - 彭思达's team proposed a new method for camera pose estimation using Transformer models to improve image matching in challenging environments, enhancing the accuracy of spatial perception [7]. - The team also introduced the "Pixel-Perfect-Depth" approach to improve depth estimation by optimizing directly in pixel space, avoiding information loss associated with traditional models [8]. Group 3: Advancements in Digital Human Reconstruction - 修宇亮 from Westlake University discussed high-precision digital human reconstruction, presenting the UP2You method that significantly reduces modeling time from 4 hours to 1.5 minutes by converting noisy data into usable multi-view images [20][21]. - The ETCH method was introduced to accurately model internal human structures by defining the relationship between clothing and skin, addressing previous inaccuracies in modeling [22]. - 修宇亮 emphasized that the future of digital human reconstruction will increasingly rely on fine-tuning existing foundational models rather than starting from scratch [23]. Group 4: Innovations in Physical World Modeling - 王广润 from Sun Yat-sen University presented on enhancing physical world modeling through a new model called the in-situ Tweedie discrete diffusion model, which aims to improve data training efficiency and model performance [26][27]. - The presentation highlighted the need for a decoupling of physical modeling and spatial modeling to enhance the adaptability of AI systems in real-world applications [28]. Group 5: The Role of 3D Technology in AI - 韩晓光 from The Chinese University of Hong Kong discussed the evolution of 3D generation technology and its critical role in video generation, emphasizing the need for 3D models to maintain relevance in the face of advancements in 2D video generation [31][32]. - He identified key trends in 3D generation, including increased detail, structural organization, and alignment with 2D inputs, while also addressing the challenges posed by video generation technologies [32][33]. - 韩晓光 concluded that 3D technology is essential for creating trustworthy AI systems, as it provides a more interpretable representation compared to high-dimensional latent variables [34]. Group 6: Future Directions and Collaborative Efforts - The roundtable discussion emphasized the importance of collaboration and consensus in the development of world models, with participants sharing insights on the need for hardware advancements alongside algorithm improvements [37][39]. - The discussion highlighted the potential for a technical alliance focused on world models to foster cooperation and innovation in the field [39].