NeurIPS2025 | 攻破闭源多模态大模型:一种基于特征最优对齐的新型对抗攻击方法
机器之心·2025-10-17 04:09

Core Insights - The article discusses the advancements and security vulnerabilities of Multimodal Large Language Models (MLLMs), particularly their susceptibility to adversarial attacks [2][8] - It introduces a novel attack framework called FOA-Attack, which enhances the transferability of adversarial samples across different models by optimizing feature alignment at both global and local levels [3][11] Group 1: Background and Motivation - MLLMs like GPT-4 and Claude-3 exhibit exceptional performance in tasks such as image understanding and visual question answering, but they inherit vulnerabilities from their visual encoders, making them prone to adversarial attacks [8][10] - Adversarial attacks can be categorized into non-targeted (aiming to produce incorrect outputs) and targeted (aiming for specific outputs), with the latter being particularly challenging in black-box scenarios where model internals are inaccessible [10][11] Group 2: FOA-Attack Framework - FOA-Attack employs a dual-dimensional alignment strategy, focusing on both global features (using cosine similarity loss for [CLS] tokens) and local features (using clustering and optimal transport for patch tokens) to improve transferability [6][11] - The framework includes a dynamic weight integration strategy that adapts the influence of multiple models during the attack generation process, enhancing the overall effectiveness of the attack [6][11] Group 3: Experimental Results - FOA-Attack significantly outperforms existing state-of-the-art methods in both open-source and closed-source MLLMs, achieving remarkable success rates, particularly against commercial closed-source models like GPT-4 [4][19] - In experiments, FOA-Attack achieved an attack success rate (ASR) of 75.1% against GPT-4, showcasing its effectiveness in real-world applications [19][24] Group 4: Conclusion and Future Directions - The findings highlight the vulnerabilities of current MLLMs in the visual encoding phase and suggest new defensive strategies, particularly in fortifying local feature robustness [24][25] - The authors have made the paper and code publicly available for further exploration and discussion, indicating a commitment to advancing research in this area [25][27]