Pyramidal Perception
Search documents
Video版的Deep Research来了?先浏览再定位后精读:精度提升token消耗反降58.3%
量子位· 2026-01-22 05:39
Core Insights - The article discusses the evolution of AI Research, particularly focusing on Autonomous Agents and their ability to actively retrieve information rather than passively receive it [1] - It highlights a significant gap in current AI capabilities, specifically in video processing, where existing agents struggle to effectively analyze video content [2][4] Video Processing Challenges - Current AI agents either excel in text comprehension or can only perform limited question-answering on short video clips, failing to handle the dense information in videos [4] - The article identifies two main approaches to video processing: Direct Visual Inference, which is computationally expensive and suffers from context explosion, and Text Summarization, which loses critical visual details [8] Proposed Solution: Video-Browser - The research team introduces the Video-Browser, which aims to enhance video browsing capabilities by mimicking human-like search behaviors [5][6] - The Video-Browser employs a Pyramidal Perception architecture, processing video data in a tiered manner to balance efficiency and accuracy [10][11] Core Components of Video-Browser - The Video-Browser consists of three main components: Planner, Watcher, and Analyst [13] - The Watcher utilizes a three-stage pyramid mechanism: - Stage I: Semantic Filter, which quickly eliminates irrelevant videos using metadata analysis [14] - Stage II: Sparse Localization, which identifies potential answer time windows using subtitles and sparse frame sampling [15] - Stage III: Zoom-in, where high-frame-rate decoding and detailed visual reasoning occur within the identified time windows [16] Benchmark Testing: Video-BrowseComp - The research team created the Video-BrowseComp benchmark to evaluate the true capabilities of agents in video searching, emphasizing the need for agents to actively seek information [17] - The benchmark includes three difficulty levels, ranging from explicit retrieval to multi-source reasoning [18][20] Experimental Results - The Video-Browser achieved a 26.19% accuracy rate, outperforming existing models by 37.5% in accuracy [21] - The architecture led to a 58.3% reduction in token consumption, demonstrating significant efficiency improvements [22] Case Study - A case study illustrates the effectiveness of the Video-Browser in identifying specific details, such as the color of a pen in a film, which traditional methods failed to capture [24][26] Conclusion and Future Directions - The Video-Browser represents a significant advancement towards effective open-web video browsing, addressing the trade-off between accuracy and cost in video search [27] - The research team has made all code, data, and benchmarks open-source to encourage further research in the community [28][29]