AI安全加固
Search documents
DeepSeek“防弹衣”来了,模型内生安全加固方案,拒绝杀敌一千自损八百|上海AI Lab
量子位· 2025-03-13 03:28
Core Viewpoint - The article discusses the hidden dangers of the DeepSeek-R1 model, which, despite its strong reasoning capabilities, may leak harmful content during its thought process even when it refuses to answer questions. Existing defense technologies face a dilemma: they either fail to prevent attacks or overly restrict the model's responses, leading to a situation where normal questions are also rejected [1][2]. Summary by Sections Section 1: Introduction of X-Boundary - Shanghai Jiao Tong University and Shanghai AI Lab have jointly developed a security defense solution called X-Boundary, aiming to resolve the dilemma of existing defense technologies by separating harmful representations and eliminating them without compromising the model's general performance [2][3]. Section 2: Performance Analysis - X-Boundary has shown significant improvements in the DeepSeek-R1-Distill-Llama-8B model, effectively blocking information leakage by removing harmful features, akin to implanting a "cognitive purification chip" [3][4]. Section 3: Defense Methods and Challenges - The article highlights a critical imbalance between safety and intelligence in mainstream defense methods (SFT/DPO/GA/CB). While these methods reduce the attack success rate (ASR), they also significantly impair the model's reasoning capabilities, with a reported 10% drop in mathematical ability and over 50% of safety questions being unjustly rejected [5][6]. Section 4: Multi-Round Defense Training - Introducing multi-round defense data into models like Qwen2.5-7B-Chat has led to a 30% increase in misclassification rates, indicating a strong correlation between increased defense strength and usability loss. The existing methods struggle to clearly distinguish between harmful and benign queries, leading to excessive safety measures [6][7]. Section 5: X-Boundary Framework - The X-Boundary defense framework aims to create an "internal safety system" for large models, allowing for precise interception of dangerous content while ensuring safe information can pass through without detection [7][8]. Section 6: Dynamic Protection Network - The framework consists of three steps: 1. Boundary Drawing: Optimizing representation separation to prevent confusion between harmful and safe requests [8]. 2. Threat Dissolution: Applying irreversible perturbations to harmful representations [8]. 3. Intelligent Preservation: Maintaining the integrity of safe representations during training [8]. Section 7: Theoretical and Practical Validation - X-Boundary is supported by optimal transport theory, which enhances the clustering of safe representations, leading to faster convergence during model training. Experiments show a 27% and 18% improvement in convergence speed for Llama-3-8B and Qwen2.5-7B models, respectively [9][10]. Section 8: Balancing Safety and Intelligence - X-Boundary successfully establishes a clear boundary between harmful and safe representations within the model, addressing the chaos of traditional methods that fail to differentiate between the two [10][11]. Section 9: Robust Multi-Round Defense - With a clear distinction in representations, X-Boundary achieves a balance between safety and usability, maintaining over 99% of the model's original performance while minimizing misclassification rates [13][14]. Section 10: Scalability - When applied to larger models, such as the 14 billion parameter Qwen2.5-14B-Chat, X-Boundary continues to provide effective zero-perception defense, demonstrating its robustness across different model scales [15].