ICML 2025 Oral工作再升级！上海AI Lab联合复旦、港中文推出支持更长视频理解的最佳工具VideoRoPE++

Core Viewpoint - The article discusses the development of VideoRoPE++, an advanced video position embedding strategy that effectively models spatiotemporal relationships, outperforming previous RoPE variants in various video-related tasks [4][7][34]. Background - The challenge of extending one-dimensional RoPE to the complex spatiotemporal structure of videos remains unresolved, despite the widespread adoption of RoPE due to its long-context processing capabilities [3]. Analysis - VideoRoPE++ is designed to prioritize temporal modeling through low-frequency time allocation (LTA), reducing oscillations and ensuring robustness. It employs a diagonal layout to maintain spatial symmetry and introduces adjustable time intervals (ATS) to control time spacing [15][26]. VideoRoPE++ Design - VideoRoPE++ incorporates several key features: - Low-frequency time allocation (LTA) to mitigate oscillations and ensure robustness [16]. - Adjustable time intervals (ATS) to align visual and textual markers in time [24]. - The introduction of YaRN-V, a method for extrapolating beyond training ranges while maintaining spatial structure [26]. Experimental Results - In long video retrieval tasks, VideoRoPE++ consistently outperformed other RoPE variants, demonstrating superior robustness [28]. - In long video understanding tasks, VideoRoPE++ showed significant improvements over baseline methods, highlighting its ability to capture long-distance dependencies [30]. - The extrapolation method YaRN-V achieved a score of 81.33 in the V-RULER benchmark, significantly outperforming traditional position encoding schemes [32][33]. Conclusion - The article identifies four critical standards for effective position encoding: 2D/3D structure, frequency allocation, spatial symmetry, and time index scaling. VideoRoPE++ meets these standards and excels in long video retrieval, understanding, and hallucination tasks compared to other RoPE variants [34].