ICCV 2025 | FDAM:告别模糊视界,源自电路理论的即插即用方法让视觉Transformer重获高清细节
机器之心·2025-10-15 07:33

Core Insights - The article discusses the introduction of a Frequency Dynamic Attention Modulation (FDAM) module to address the issue of detail loss in deep networks caused by the inherent "low-pass filter" characteristics of Vision Transformers (ViT) [2][5][8]. - The FDAM module enhances model performance in dense prediction tasks such as segmentation and detection without significantly increasing computational costs, achieving state-of-the-art results [2][22]. Research Background - Vision Transformers (ViT) have become prominent in computer vision due to their global modeling capabilities, but they face a critical issue where deeper models lead to a loss of high-frequency details essential for tasks like segmentation and detection [5][8]. - The self-attention mechanism in ViT acts as a low-pass filter, progressively diminishing high-frequency details and leading to representation collapse [5][10]. Limitations of Existing Methods - Previous attempts to mitigate the "over-smoothing" problem in ViT, such as regularization and static compensation of high-frequency signals, have been insufficient as they do not fundamentally alter the low-pass nature of the attention mechanism [10][9]. Core Idea of FDAM - The FDAM module is inspired by circuit theory, proposing a redesign of the attention mechanism to include both low-pass and high-pass paths, allowing for dynamic attention to high-frequency details [11][12][16]. - The module introduces a lightweight dynamic mixer that enables the model to adaptively focus on either low-frequency structures or high-frequency details based on the characteristics of the input image [16][21]. Key Components of the Method - FDAM consists of two main components: Attention Inversion (AttInv) for coarse tuning of high and low frequencies, and Frequency Dynamic Scaling (FreqScale) for fine-tuning specific frequency bands [21][20]. - FreqScale allows the model to learn dynamic gain weights for different frequency bands, enhancing or suppressing signals as needed for specific tasks [20][21]. Experimental Results - The FDAM module is plug-and-play, easily integrated into various ViT architectures with minimal additional parameters and computational overhead, yet it significantly improves performance [22][23]. - Quantitative results show that FDAM enhances the mIoU score by +2.4 for SegFormer-B0 on the ADE20K dataset and +0.8 for DeiT3-Base, achieving 52.6% state-of-the-art performance [23][22]. - In object detection and instance segmentation on the COCO dataset, FDAM improved detection AP by +1.6 and segmentation AP by +1.4 [23][22]. Theoretical Support - The FDAM method effectively resists representation collapse, maintaining a higher effective rank in deeper layers compared to baseline models, indicating better feature diversity [26][22]. Implications of the Work - This research introduces a new perspective by applying classical circuit theory to modern Transformer design, addressing fundamental issues like information decay in deep learning [29][30]. - FDAM effectively resolves a core pain point in ViT for dense prediction tasks, unlocking the model's potential [30][32]. - As a lightweight, plug-and-play module, FDAM has significant application potential in both industry and academia [31][32]. Future Directions - FDAM opens avenues for future research, such as designing entirely new network structures that operate dynamically in the frequency domain and exploring its application in video, 3D point clouds, and multimodal data [34].

ICCV 2025 | FDAM:告别模糊视界,源自电路理论的即插即用方法让视觉Transformer重获高清细节 - Reportify