ICLR 2026 | 当视频难以被表征：UCSD、HKUST等机构联合提出FlowRVS，用生成式流匹配重构视觉感知范式

Core Insights - The article discusses the limitations of traditional representation methods in video understanding, particularly in referring video segmentation (RVOS), where the "locate then segment" paradigm faces challenges due to information collapse [2][3] - A new approach called FlowRVS is introduced, which leverages generative models to reshape segmentation tasks into a flow matching process, enhancing performance and representing a paradigm shift in visual perception [3][11] Group 1: Traditional Methods and Their Limitations - Traditional models focus on compressing video features into a set of vectors, which often leads to loss of fine-grained spatiotemporal relationships [2] - The attempt to directly map high-dimensional video features to binary masks in a single step resulted in poor performance, highlighting the challenges of bridging significant information gaps [8][10] - The failure of the "noise-to-mask" approach demonstrated the importance of retaining high-entropy spatial and textural details from the video [10] Group 2: FlowRVS and Its Innovations - FlowRVS shifts the focus from absolute mask prediction to predicting relative changes in video features, leading to a significant performance increase [11] - The model establishes a Video-to-Mask Flow paradigm, learning a deterministic trajectory that guides high-dimensional features smoothly into target masks, achieving a state-of-the-art score of 60.6 [11][21] - The introduction of Boundary Bias Sampling (BBS) allows the model to focus on the critical starting point of the flow, resulting in a 10-point performance boost [16][17] Group 3: Performance Metrics and Results - FlowRVS achieved a new state-of-the-art score of 51.1 J&F on the MeViS benchmark, demonstrating its effectiveness even against larger models [21] - The model exhibited impressive zero-shot capabilities, scoring 73.3 on the unseen Ref-DAVIS17 dataset, showcasing its generalization power [21] - FlowRVS maintains stability in long sequences, effectively addressing trajectory drift issues, which is a significant advantage over traditional models [23] Group 4: Theoretical Implications of Flow Matching - FlowRVS exemplifies the universality of Flow Matching theory, bridging various modalities and demonstrating that optimal transport paths can be established between different probability distributions [26] - The success of FlowRVS suggests a future where detection, segmentation, and generation tasks can be unified under a single elegant ODE framework, breaking down the barriers between different modalities [26]