Workflow
大型视觉 - 语言模型(VLMs)
icon
Search documents
微软推出深度视频探索智能体,登顶多个长视频理解基准
机器之心· 2025-06-30 03:18
Core Viewpoint - The article discusses the limitations of large language models (LLMs) and large visual-language models (VLMs) in processing information-dense long videos, and introduces a novel agent called Deep Video Discovery (DVD) that significantly improves video understanding through advanced reasoning capabilities [1][3]. Group 1: Deep Video Discovery (DVD) Overview - DVD segments long videos into shorter clips and treats them as an environment, utilizing LLMs for reasoning and planning to answer questions effectively [3][6]. - The system achieved a remarkable accuracy of 74.2% on the challenging LVBench dataset, surpassing previous models significantly [3][17]. - DVD will be open-sourced in the form of MCP Server, enhancing accessibility for further research and development [3]. Group 2: System Components - The system consists of three core components: a multi-granularity video database, a search-centric toolset, and an LLM as the agent coordinator [7][10]. - The multi-granularity video database converts long videos into a structured format, extracting various levels of information such as global summaries and segment-level details [10]. - The agent employs three main tools: Global Browse for high-level context, Clip Search for efficient semantic retrieval, and Frame Inspect for detailed pixel-level information [11][12][13]. Group 3: Performance Evaluation - DVD's performance was evaluated across multiple long video benchmarks, consistently outperforming existing models, including a 13.4% improvement over MR. Video and a 32.9% improvement over VCA [17]. - With auxiliary transcripts, the accuracy further increased to 76.0%, demonstrating the system's robustness [17]. - The analysis of different foundational models revealed significant behavioral differences, emphasizing the importance of reasoning capabilities in the agent's performance [18].