长视频理解

Search documents
ICML 2025 Oral工作再升级!上海AI Lab联合复旦、港中文推出支持更长视频理解的最佳工具VideoRoPE++
机器之心· 2025-07-03 03:26
Core Viewpoint - The article discusses the development of VideoRoPE++, an advanced video position embedding strategy that effectively models spatiotemporal relationships, outperforming previous RoPE variants in various video-related tasks [4][7][34]. Background - The challenge of extending one-dimensional RoPE to the complex spatiotemporal structure of videos remains unresolved, despite the widespread adoption of RoPE due to its long-context processing capabilities [3]. Analysis - VideoRoPE++ is designed to prioritize temporal modeling through low-frequency time allocation (LTA), reducing oscillations and ensuring robustness. It employs a diagonal layout to maintain spatial symmetry and introduces adjustable time intervals (ATS) to control time spacing [15][26]. VideoRoPE++ Design - VideoRoPE++ incorporates several key features: - Low-frequency time allocation (LTA) to mitigate oscillations and ensure robustness [16]. - Adjustable time intervals (ATS) to align visual and textual markers in time [24]. - The introduction of YaRN-V, a method for extrapolating beyond training ranges while maintaining spatial structure [26]. Experimental Results - In long video retrieval tasks, VideoRoPE++ consistently outperformed other RoPE variants, demonstrating superior robustness [28]. - In long video understanding tasks, VideoRoPE++ showed significant improvements over baseline methods, highlighting its ability to capture long-distance dependencies [30]. - The extrapolation method YaRN-V achieved a score of 81.33 in the V-RULER benchmark, significantly outperforming traditional position encoding schemes [32][33]. Conclusion - The article identifies four critical standards for effective position encoding: 2D/3D structure, frequency allocation, spatial symmetry, and time index scaling. VideoRoPE++ meets these standards and excels in long video retrieval, understanding, and hallucination tasks compared to other RoPE variants [34].
单卡搞定万帧视频理解!智源研究院开源轻量级超长视频理解模型Video-XL-2
量子位· 2025-06-04 05:21
国产开源模型又上大分,这次是在长视频理解领域: 智源研究院联合上海交通大学等机构,正式发布新一代超长视频理解模型 Video-XL-2 。 长视频理解是多模态大模型关键能力之一。尽管OpenAI GPT-4o、Google Gemini等私有模型已在该领域取得显著进展,当前的开源模型在 效果、计算开销和运行效率等方面仍存在明显短板。 而Video-XL-2相较于上一版本的Video-XL,在多个维度全面优化了开源多模态大模型对长视频内容的理解能力: 目前,Video-XL-2的模型权重已全面向社区开放。未来,该模型有望在影视内容分析、异常行为监测等多个实际场景中展现重要应用价值。 允中 发自 凹非寺 量子位 | 公众号 QbitAI 单张显卡,就能处理万帧视频输入,并且编码2048帧视频仅需12秒! 技术简介 在模型架构设计上,Video-XL-2主要由三个核心组件构成: 视觉编码器(Visual Encoder) 、 动态 Token 合成模块(Dynamic Token Synthesis, DTS) 以及 大语言模型(LLM) 。 △ Video-XL-2的模型架构示意图 具体而言,Video-XL-2 ...
万帧?单卡!智源研究院开源轻量级超长视频理解模型Video-XL-2
机器之心· 2025-06-03 04:06
机器之心发布 机器之心编辑部 长视频理解是多模态大模型关键能力之一。尽管 OpenAI GPT-4o、Google Gemini 等私有模型已在该领域取得显著进展,当前的开源模型在效果、计算 开销和运行效率等方面仍存在明显短板。 近日,智源研究院联合上海交通大学等机构,正式发布新一代超长视频理解模型:Video-XL-2。相较于上一版本的 Video-XL,该模型在多个维度全面优 化了多模态大模型对长视频内容的理解能力: 目前,Video-XL-2 的模型权重已全面向社区开放。未来,该模型有望在影视内容分析、异常行为监测等多个实际场景中展现重要应用价值。 技术简介 图 1:Video-XL-2 的模型架构示意图 图 3. Chunk-based Prefilling 效果更佳:Video-XL-2 在长视频理解任务中表现出色,在 MLVU、Video-MME、LVBench 等主流评测基准上达到了同参数规模开源模型的领先 水平。 长度更长:新模型显著扩展了可处理视频的时长,支持在单张显卡上高效处理长达万帧的视频输入。 速度更快:Video-XL-2 大幅提升了处理效率,编码 2048 帧视频仅需 12 秒,显 ...
长视频理解新突破!Mamba混合架构让显存消耗腰斩,处理10万视频token不费力
量子位· 2025-03-27 04:16
Core Viewpoint - The article introduces the Vamba model, a hybrid Mamba-Transformer model designed for efficient understanding of long videos, significantly improving processing efficiency without compressing video tokens [1][10]. Group 1: Model Design and Efficiency - Vamba improves the efficiency of processing video tokens during training and inference by redesigning the model architecture rather than compressing video tokens [1][4]. - The model can process four times more video frames under the same hardware conditions compared to traditional Transformer architectures, with over 50% reduction in training memory consumption and doubled training speed [4][9]. - Vamba retains the original spatiotemporal features of videos, avoiding information loss that occurs with traditional downsampling or pooling methods [5][10]. Group 2: Technical Innovations - The core design of Vamba involves breaking down the costly causal self-attention operations into two more efficient components: cross-attention for text tokens and a state space model (SSM) based Mamba-2 module for video tokens [6][7]. - The Mamba-2 module reduces the computational complexity from quadratic to linear, allowing for effective processing of long video sequences [7][9]. - Vamba's architecture allows for efficient alignment of text and video information, enhancing the model's ability to analyze video content based on user queries [9][10]. Group 3: Performance Evaluation - Extensive experiments show that Vamba outperforms existing efficient long video understanding models by approximately 4.3% on the LVBench benchmark [5][10]. - The model demonstrates superior performance across various video duration benchmarks, showcasing its competitive edge in long, medium, and short video understanding tasks [10].