视频理解
Search documents
ICLR 2026 | 当视频难以被表征:UCSD、HKUST等机构联合提出FlowRVS,用生成式流匹配重构视觉感知范式
机器之心· 2026-03-03 09:08
Core Insights - The article discusses the limitations of traditional representation methods in video understanding, particularly in referring video segmentation (RVOS), where the "locate then segment" paradigm faces challenges due to information collapse [2][3] - A new approach called FlowRVS is introduced, which leverages generative models to reshape segmentation tasks into a flow matching process, enhancing performance and representing a paradigm shift in visual perception [3][11] Group 1: Traditional Methods and Their Limitations - Traditional models focus on compressing video features into a set of vectors, which often leads to loss of fine-grained spatiotemporal relationships [2] - The attempt to directly map high-dimensional video features to binary masks in a single step resulted in poor performance, highlighting the challenges of bridging significant information gaps [8][10] - The failure of the "noise-to-mask" approach demonstrated the importance of retaining high-entropy spatial and textural details from the video [10] Group 2: FlowRVS and Its Innovations - FlowRVS shifts the focus from absolute mask prediction to predicting relative changes in video features, leading to a significant performance increase [11] - The model establishes a Video-to-Mask Flow paradigm, learning a deterministic trajectory that guides high-dimensional features smoothly into target masks, achieving a state-of-the-art score of 60.6 [11][21] - The introduction of Boundary Bias Sampling (BBS) allows the model to focus on the critical starting point of the flow, resulting in a 10-point performance boost [16][17] Group 3: Performance Metrics and Results - FlowRVS achieved a new state-of-the-art score of 51.1 J&F on the MeViS benchmark, demonstrating its effectiveness even against larger models [21] - The model exhibited impressive zero-shot capabilities, scoring 73.3 on the unseen Ref-DAVIS17 dataset, showcasing its generalization power [21] - FlowRVS maintains stability in long sequences, effectively addressing trajectory drift issues, which is a significant advantage over traditional models [23] Group 4: Theoretical Implications of Flow Matching - FlowRVS exemplifies the universality of Flow Matching theory, bridging various modalities and demonstrating that optimal transport paths can be established between different probability distributions [26] - The success of FlowRVS suggests a future where detection, segmentation, and generation tasks can be unified under a single elegant ODE framework, breaking down the barriers between different modalities [26]
AAAI 2026 | 北航、东京大学填补AI「语义鸿沟」,过程感知视频理解如何找到「状态」锚点?
机器之心· 2025-12-06 01:15
Core Insights - The article discusses a new framework for video understanding called TSS (Task-Step-State), developed by a team from Beihang University and the University of Tokyo, which addresses the semantic gap between abstract text instructions and concrete video actions [2][3] - The TSS framework introduces "State" as a visual anchor, allowing AI to better comprehend procedural activities like cooking or repairing devices [2][3] Data Challenges - Existing methods for procedural video learning face data challenges, relying on either expensive annotations or weak supervision from external knowledge bases, which leads to a semantic gap [2][5] - The traditional task-step structure is too abstract, and TSS enhances this by generating a third semantic layer, "State," which provides visually grounded snapshots of each step [7][19] Training Methodology - TSS employs a progressive "Hierarchy Unfolding" training strategy, which is designed to align with cognitive processes, allowing for a U-shaped learning path from Task to State and back [9][10] - This method emphasizes the importance of understanding specific visual evidence, enabling the model to refine its understanding of steps and tasks based on detailed state information [14][18] Experimental Results - The research team tested the TSS framework on the COIN and CrossTask datasets, achieving significant performance improvements over state-of-the-art models [15][16] - The results indicate that the introduction of the "State" layer and the progressive training strategy are key drivers in enhancing procedural video understanding capabilities [19][21] Conclusion - The TSS framework demonstrates that explicitly modeling object state changes can effectively bridge the gap between natural language and the physical world, providing a new approach for developing intelligent systems that understand both high-level planning and detailed execution [23]
理想汽车MCAF重构辅助驾驶视觉认知新范式
理想TOP2· 2025-04-25 12:43
以下文章来源于AcademicDaily ,作者AcademicDaily AcademicDaily . AcademicDaily是一个跟踪、推荐和解读大模型等AI成果的技术交流平台,致力于传播和分享前沿技术。 MCAF在理想内部被称为自动驾驶第三只眼。 兼容理想自研的Mind GPT-3o 与 BEV 大模型,无需重新训练。 MCAF是一个 多模态粗到细注意力聚焦框架,核心解决的是长视频理解的关键瓶颈。 当前视频理解领域对长视频(>5分钟)的处理存在显著缺陷,主流方法(如Video-MLLM)依赖全局压缩或均匀采样,导致细 节丢失和冗余计算。MCAF直接针对这一问题,通过多模态分层注意力和时间扩展机制,在信息保留与计算效率之间找到了平 衡点,这是其核心价值。 在平均时长达60分钟的Video-MME数据集上,MCAF超越其他代理方法(如VideoTree、DrVideo)约3-5个百分点。 不同于VideoTree等需要额外奖励模型评估置信度,MCAF利用单一LLM完成生成-评估-调整闭环。这不仅简化了架构(如代码 实现仅需1个LLM接口),还避免了多模型协同的兼容性问题,更适合实际部署。 不过在NEx ...