万帧？单卡！智源研究院开源轻量级超长视频理解模型Video-XL-2

Core Viewpoint - The article discusses the release of Video-XL-2, a new generation long video understanding model developed by Zhiyuan Institute in collaboration with Shanghai Jiao Tong University, which significantly enhances the capabilities of multimodal large models in understanding long video content [2][6]. Technical Overview - Video-XL-2 consists of three core components: Visual Encoder, Dynamic Token Synthesis (DTS), and Large Language Model (LLM) [3]. - The model uses SigLIP-SO400M as the visual encoder to process video frames into high-dimensional visual features, which are then fused and compressed by the DTS module to extract semantic dynamic information [3]. - The training strategy involves a four-stage progressive training design to build strong long video understanding capabilities, utilizing image/video-text pairs and large-scale high-quality datasets [4]. Performance Metrics - Video-XL-2 outperforms existing lightweight open-source models on mainstream long video evaluation benchmarks such as MLVU, Video-MME, and LVBench, achieving state-of-the-art performance [11]. - The model can efficiently process videos of up to 10,000 frames on a single high-performance GPU, significantly extending the length of videos it can handle compared to previous models [16]. - Video-XL-2 encodes 2048 frames of video in just 12 seconds, showcasing its superior processing speed and efficiency [19]. Efficiency Innovations - The model incorporates a chunk-based pre-filling strategy to reduce computational costs and memory usage by dividing long videos into segments [8]. - A bi-granularity key-value (KV) decoding mechanism allows the model to selectively load dense or sparse KVs based on task requirements, enhancing decoding efficiency [8]. Application Potential - Video-XL-2 demonstrates high application potential in various scenarios, including film plot question answering, surveillance anomaly detection, and content summarization for films and game live streams [20][22]. - The model's advanced video understanding capabilities provide effective support for complex video analysis needs in real-world applications [20].