判别式模型
Search documents
超越Video Depth Anything!视频深度估计新SOTA来了,163倍数据效率解锁生成式先验
机器之心· 2026-03-29 01:29
Core Insights - The article discusses the introduction of a new video depth estimation framework called DVD (Deterministic Video Depth Estimation with Generative Priors), led by Professor Chen Yingcong from the Hong Kong University of Science and Technology (Guangzhou) [4] - DVD is noted for its ability to achieve high data efficiency, requiring only 367,000 frames of training data compared to 60 million frames used by other models, resulting in a remarkable 163 times improvement in data efficiency [5][24] - The framework addresses the challenges of balancing geometric detail and temporal stability in dynamic videos, which has been a longstanding issue in the computer vision community [4][8] Group 1: Background and Motivation - Prior to DVD, mainstream video depth estimation methods faced inherent trade-offs between generative and discriminative models, leading to a core question of how to design a framework that balances stability and rich spatiotemporal priors while maintaining efficiency [8] - The research team identified the need for a framework that could effectively combine the strengths of both model types without the drawbacks of each [8] Group 2: Methodology - DVD innovatively adapts pre-trained video diffusion models into a deterministic framework for single-pass depth regression, eliminating the geometric hallucinations caused by traditional generative models [5][12] - The framework incorporates three core designs: 1. Time-step driven structural anchors to balance global stability and local detail [15] 2. Latent Manifold Rectification (LMR) to align predicted latent variables with target variables, restoring sharp boundaries and coherent motion [16] 3. Global Affine Coherence to ensure seamless alignment of adjacent windows in long video processing [18] Group 3: Experimental Results - DVD achieved state-of-the-art (SOTA) performance in geometric fidelity and temporal coherence across multiple real-world benchmarks, outperforming both generative and discriminative baseline models [20][22] - The framework demonstrated the lowest absolute relative error (AbsRel) on standard datasets such as ScanNet and KITTI, showcasing its superior accuracy [22][24] - DVD's design allows for high fidelity depth estimation with significantly less training data, proving that effective strategies can unlock the geometric priors of foundational models without the need for extensive labeled datasets [24][28] Group 4: Implications and Future Directions - The introduction of DVD establishes a highly scalable and data-efficient paradigm for dynamic 3D scene understanding and future perception technologies [29] - The open-source nature of the project encourages further exploration and validation by the research community [30]
ICLR 2026 | 当视频难以被表征:UCSD、HKUST等机构联合提出FlowRVS,用生成式流匹配重构视觉感知范式
机器之心· 2026-03-03 09:08
Core Insights - The article discusses the limitations of traditional representation methods in video understanding, particularly in referring video segmentation (RVOS), where the "locate then segment" paradigm faces challenges due to information collapse [2][3] - A new approach called FlowRVS is introduced, which leverages generative models to reshape segmentation tasks into a flow matching process, enhancing performance and representing a paradigm shift in visual perception [3][11] Group 1: Traditional Methods and Their Limitations - Traditional models focus on compressing video features into a set of vectors, which often leads to loss of fine-grained spatiotemporal relationships [2] - The attempt to directly map high-dimensional video features to binary masks in a single step resulted in poor performance, highlighting the challenges of bridging significant information gaps [8][10] - The failure of the "noise-to-mask" approach demonstrated the importance of retaining high-entropy spatial and textural details from the video [10] Group 2: FlowRVS and Its Innovations - FlowRVS shifts the focus from absolute mask prediction to predicting relative changes in video features, leading to a significant performance increase [11] - The model establishes a Video-to-Mask Flow paradigm, learning a deterministic trajectory that guides high-dimensional features smoothly into target masks, achieving a state-of-the-art score of 60.6 [11][21] - The introduction of Boundary Bias Sampling (BBS) allows the model to focus on the critical starting point of the flow, resulting in a 10-point performance boost [16][17] Group 3: Performance Metrics and Results - FlowRVS achieved a new state-of-the-art score of 51.1 J&F on the MeViS benchmark, demonstrating its effectiveness even against larger models [21] - The model exhibited impressive zero-shot capabilities, scoring 73.3 on the unseen Ref-DAVIS17 dataset, showcasing its generalization power [21] - FlowRVS maintains stability in long sequences, effectively addressing trajectory drift issues, which is a significant advantage over traditional models [23] Group 4: Theoretical Implications of Flow Matching - FlowRVS exemplifies the universality of Flow Matching theory, bridging various modalities and demonstrating that optimal transport paths can be established between different probability distributions [26] - The success of FlowRVS suggests a future where detection, segmentation, and generation tasks can be unified under a single elegant ODE framework, breaking down the barriers between different modalities [26]