全模态主动感知
Search documents
「听觉」引导「视觉」,OmniAgent开启全模态主动感知新范式
机器之心· 2026-01-08 09:34
Core Insights - The article introduces OmniAgent, a proactive perception agent developed by Zhejiang University, West Lake University, and Ant Group, addressing pain points in cross-modal alignment and fine-grained understanding in end-to-end omni-modal models [2][7][19] - OmniAgent employs an innovative "think-act-observe-reflect" closed-loop mechanism, transitioning from passive response to active inquiry, which enhances its performance in audiovisual understanding tasks [10][19] Background and Pain Points - End-to-end omni-modal models face high training costs and challenges in cross-modal feature alignment, leading to subpar performance in fine-grained cross-modal understanding [7] - Fixed workflow-based agents rely on rigid, human-defined processes, lacking the flexibility to autonomously plan and gather information based on questions [7] Methodology - OmniAgent's methodology includes a strategic scheduling of video and audio understanding capabilities within an iterative reflection loop, effectively overcoming cross-modal alignment challenges [8][15] - The agent autonomously decides whether to "listen" or "watch" based on the analysis of the question, utilizing a variety of multimodal tools for efficient information retrieval [15] Performance Results - OmniAgent achieved state-of-the-art (SOTA) results in multiple audiovisual understanding benchmarks, with an accuracy of 82.71% on the Daily-Omni Benchmark, surpassing Gemini 2.5-Flash (72.7%) and Qwen3-Omni-30B (72.08%) by over 10% [13] - In the OmniVideoBench, OmniAgent reached an accuracy of 59.1% in long video understanding tasks, significantly outperforming Qwen3-Omni-30B (38.4%) [13] Future Vision - The design of OmniAgent is highly extensible, allowing for the integration of additional modal tools [19] - OmniAgent is positioned to assist in generating high-quality COTT data for the development of next-generation omni-modal models capable of self-tool invocation [19]