Workflow
Object Tokens
icon
Search documents
视频大模型新基元:用Object Tokens重塑细节感知与指代理解
量子位· 2025-11-27 04:34
Core Insights - The article discusses the introduction of VideoOrion, a video understanding framework developed by a team from Peking University and UCSD, which received a high score of 554 at ICCV 2025. This framework aims to address the complexities of video information compared to images by utilizing Object Tokens and Context Tokens for improved semantic understanding [1][2][3]. Group 1: Framework Overview - VideoOrion encodes significant spatiotemporal dynamics in the foreground as Object Tokens, which are processed in parallel with Context Tokens, creating an efficient and interpretable video understanding framework [3][4]. - The framework explicitly extracts Object Dynamics into discrete tokens, reducing data volume and enhancing the alignment of language models (LLMs) [4][6]. - The core methodology involves a dual-branch encoding system and a "detect-segment-track" pipeline to create Object Tokens, allowing for detailed semantic integration during inference [6][10]. Group 2: Performance and Results - VideoOrion outperforms existing models such as VideoLLaMA2/2.1 across multiple benchmarks, showing improvements of +10.1% to +15.6% in various tasks [15][16]. - In specific metrics, VideoOrion achieved scores of 63.5, 65.1, 65.2, 54.6-55.3, and 57.7-3.7 in MVBench, EgoSchema, Perception-Test, VideoMME, and ActivityNet-QA, respectively, demonstrating a clear advantage over other models [16][17]. - The framework also supports video referential capabilities, allowing for precise object identification in response to queries [16][18]. Group 3: Experimental Analysis - The research indicates that the presence of an object branch significantly enhances performance across benchmarks compared to models without it [19][20]. - Pre-training the object branch is crucial for overall model effectiveness, suggesting that Object Tokens require foundational semantic learning before alignment with text [20]. - The optimal number of Object Tokens is identified as around 64, balancing information density and attention distribution [21]. Group 4: Limitations and Future Directions - The study acknowledges limitations, including the introduction of latency due to specialized visual models and the need for further optimization to enhance robustness and reduce pipeline costs [30]. - Future research will focus on improving the alignment and integration strategies between object and scene perspectives, which is essential for advancing video question answering, retrieval, and multi-modal applications [26][30].