单卡搞定万帧视频理解！智源研究院开源轻量级超长视频理解模型Video-XL-2

Core Viewpoint - The article discusses the release of Video-XL-2, a new generation of long video understanding model developed by Zhiyuan Research Institute in collaboration with Shanghai Jiao Tong University, which significantly enhances the capabilities of open-source models in processing and understanding long video content [1][3]. Technical Overview - Video-XL-2 is designed with three core components: Visual Encoder, Dynamic Token Synthesis (DTS), and Large Language Model (LLM) [4][6]. - The model utilizes SigLIP-SO400M as the visual encoder to process video frames into high-dimensional visual features, which are then fused and compressed by the DTS module to extract semantic dynamic information [6][11]. - The training strategy involves a four-stage progressive training design to build robust long video understanding capabilities [8][10]. Performance Improvements - Video-XL-2 shows superior performance in long video understanding tasks, achieving leading levels on benchmarks such as MLVU, Video-MME, and LVBench compared to existing open-source models [9][15]. - The model can efficiently process videos of up to 10,000 frames on a single high-performance GPU, significantly extending the length of videos it can handle [19][23]. - It can encode 2048 frames of video in just 12 seconds, demonstrating remarkable speed and efficiency [24][28]. Application Potential - Video-XL-2 has high application potential in various real-world scenarios, including film content analysis, plot understanding, and anomaly detection in surveillance videos [28][30]. - Specific examples of its application include answering questions about movie scenes and detecting unexpected events in surveillance footage [30][32].