Workflow
双路径注意力干预
icon
Search documents
AAAI 2026 | 电子科技大学提出OWL,基于双路径注意力干预的多模态大模型物体幻觉缓解
机器之心· 2025-11-28 08:05
Core Insights - The article discusses the increasing attention on mitigating object hallucination in visual language models (LVLMs) and introduces a novel framework called Owl, which employs a causal dual-path attention intervention to address this issue [2][4]. Group 1: Problem Identification - Existing methods primarily focus on either visual or textual attention independently, neglecting the critical imbalance in cross-modal attention interaction [5]. - There is a lack of quantitative measures for cross-modal dependencies during the decoding process, leading to a coarse intervention mechanism without theoretical guidance [5]. Group 2: Proposed Solution - The paper introduces a structural causal model that formalizes the decomposition of visual and textual attention into key mediating variables, highlighting how confounding factors distort attention and lead to hallucinations [4]. - A new metric, VTACR, is proposed to quantify the model's dependency on visual and textual modalities at each decoding layer, providing a measurable signal for fine-grained attention intervention [7]. Group 3: Methodology - The Owl framework employs a dual-path attention intervention method, creating a visual enhancement path and a textual enhancement path, using a contrastive decoding strategy to dynamically correct attention biases [8][10]. - During inference, the framework decomposes the attention weights of the language decoder into visual and textual components, adjusting attention based on the VTACR distribution to enhance the focus on image tokens while moderating the influence of textual history [10]. Group 4: Experimental Results - The Owl method was evaluated on three representative LVLMs: LLaVA-1.5, MiniGPT-4, and Shikra, against various baseline methods to ensure comprehensive assessment [12]. - In the CHAIR benchmark, Owl significantly reduced sentence-level hallucination by 17.6% and instance-level hallucination by 21.4% on LLaVA-1.5, while generating longer texts, indicating that it effectively mitigates hallucinations without sacrificing content richness [13]. - The method demonstrated comparable or improved performance on five visual question answering (VQA) tasks, with a 7.6% enhancement on the VizWiz task, suggesting that it may enhance the model's understanding of complex visual scenes [14]. - Manual evaluations using GPT-4V showed improvements in correctness by 20.1% and detailedness by 11.3% for LLaVA-1.5, indicating that the generated content is not only more faithful to the images but also richer in information [16]. Group 5: Visual Evidence - The paper presents typical hallucination cases where Owl effectively suppresses errors, ensuring generated results align closely with the actual image content [18]. - Visualizations reveal that Owl acts like a precise editor, suppressing "hallucination words" while prioritizing "correct words" during the generation process [18][19].