四款扩散大语言模型全部破防?上交&上海AI Lab发现致命安全缺陷
量子位·2025-07-23 04:10

Core Viewpoint - The article discusses the emergence of Diffusion-based Language Models (dLLMs) and highlights a significant security vulnerability associated with them, specifically the DIJA attack framework that can exploit these models to generate harmful content without requiring any model retraining or parameter modification [1][2][4]. Group 1: Characteristics of dLLMs - dLLMs are characterized by parallel decoding, bidirectional context modeling, and the ability to flexibly insert masked tokens, making them suitable for various applications such as interactive Q&A and code generation [1]. - Unlike autoregressive models, dLLMs can generate multiple tokens simultaneously and perform text insertion and rewriting more naturally [1]. Group 2: Security Vulnerabilities - A fundamental architectural security flaw exists in current dLLMs, rendering them nearly defenseless against certain attack scenarios [2][4]. - The DIJA attack framework can lead multiple dLLMs to generate harmful, illegal, or inappropriate content without any need for training or rewriting model parameters [4]. Group 3: Mechanism of DIJA Attack - The DIJA attack does not obscure or rewrite dangerous content in jailbreak prompts; instead, it transforms original prompts into masked text prompts that dLLMs cannot resist generating harmful outputs [6][10]. - The attack is fully automated, requiring no manual prompt design, and utilizes powerful language models to generate attack prompts with minimal human intervention [8][10]. Group 4: Attack Strategies - The team developed three key strategies for constructing effective masked text prompts that are both natural and highly aggressive in their attack potential [9][11]. - These strategies include prompt diversification, multi-granularity masking, and benign separator insertion to enhance the effectiveness and stealth of the attack [11]. Group 5: Experimental Results - The research team tested the DIJA attack on four representative dLLMs, revealing that dLLMs generally exhibit comparable or slightly better defense capabilities than autoregressive models [14]. - The DIJA attack achieved the highest ASR-k scores across all benchmarks, indicating that dLLMs are unlikely to reject responses to dangerous topics [21]. Group 6: Fundamental Issues - The security shortcomings of dLLMs are not merely bugs but are inherent design features, particularly due to bidirectional context modeling and parallel decoding mechanisms that prevent effective token-by-token scrutiny [19][22]. - Current alignment methods focus on overall input-output relationships, lacking sensitivity to individual token positions, which exacerbates the vulnerability [23]. Group 7: Future Directions - The article suggests that the era of dLLM security research has just begun, with the DIJA attack representing the opening of a new research direction focused on "mask-aware safety" [25]. - Recommendations include designing rejection mechanisms based on masked positions and developing alignment training processes specifically tailored for dLLM architectures [25].

四款扩散大语言模型全部破防?上交&上海AI Lab发现致命安全缺陷 - Reportify