Workflow
视觉感知头散度(VHD)
icon
Search documents
紫东太初开源视觉神经增强方法,即插即用终结多模态幻觉 | ACL 2025
量子位· 2025-06-27 10:57
Core Viewpoint - The article discusses a novel solution, Visual Head Reinforcement (VHR), to address the hallucination phenomenon in Large Visual Language Models (LVLMs) by enhancing the model's attention mechanisms to better utilize visual information rather than relying on language priors [1][2][3]. Group 1: Introduction and Background - LVLMs often generate factually incorrect outputs due to an over-reliance on language knowledge instead of actual visual content, leading to hallucinations [4][5]. - Experiments show that when models are prompted to describe images, they frequently include entities not present in the images, indicating a systemic reliance on language co-occurrence patterns [4][5][7]. Group 2: VHR Methodology - VHR identifies and strengthens attention heads that are sensitive to visual information, thereby reducing the model's dependency on language priors and significantly lowering hallucination occurrences [8]. - The Visual Head Divergence (VHD) metric is introduced to quantify each attention head's sensitivity to visual inputs, revealing that only a few heads are responsive to visual information while most rely on language patterns [9][11]. - The VHR process involves filtering out abnormal VHD scores, selecting and scaling the outputs of the top 50% of attention heads based on VHD scores, and applying a layer-wise enhancement strategy to avoid interference [14][15][16]. Group 3: Experimental Results - VHR has been tested against multiple benchmarks, showing superior performance compared to existing methods while maintaining efficiency with minimal additional time costs [16][17]. - The results indicate that VHR outperforms baseline methods in various evaluations, demonstrating its effectiveness in reducing hallucinations in LVLMs [17]. Group 4: SSL Method - The article also introduces a Semantic Guided Learning (SSL) method that analyzes the internal representation space of models to mitigate hallucinations by injecting real semantic directions and suppressing hallucination-related projections [19][22]. - This method shows cross-model applicability, enhancing the robustness of hallucination mitigation across different LVLM architectures [22].