端到端控制攻击
Search documents
AAAI 2026 | 首个抗端到端攻击的大模型加密指纹 / 水印方案
机器之心· 2025-12-01 09:30
Core Insights - The article discusses the development of iSeal, an encrypted fingerprinting solution designed to protect the intellectual property of large language models (LLMs) against advanced attacks [2][3][5]. Research Background - The training of large language models often incurs costs in the millions of dollars, making the model weights valuable intellectual property. Researchers typically use model fingerprinting techniques to assert ownership by embedding triggers that produce characteristic responses [6][7]. - Existing fingerprinting methods assume that the verifier faces a black-box API, which is unrealistic as advanced attackers can directly steal model weights and deploy them locally, gaining end-to-end control [7][10]. iSeal Overview - iSeal is the first encrypted fingerprinting scheme designed for end-to-end model theft scenarios. It introduces encryption mechanisms to resist collusion-based unlearning and response manipulation attacks, achieving a 100% verification success rate across 12 mainstream LLMs [3][12]. Methodology and Innovations - iSeal's framework transforms the fingerprint verification process into a secure encrypted interaction protocol, focusing on three main aspects: - **Encrypted Fingerprinting and External Encoder**: iSeal employs an encrypted fingerprint embedding mechanism and an external encoder to decouple fingerprints from model weights, preventing attackers from reverse-engineering the fingerprints [15]. - **Confusion & Diffusion Mechanism**: This mechanism binds fingerprint features to the model's core reasoning capabilities, making them inseparable and resilient against attempts to erase specific fingerprints [15]. - **Similarity-based Dynamic Verification**: iSeal uses a similarity-based verification strategy and error correction mechanisms to identify fingerprint signals even when attackers manipulate outputs through paraphrasing or synonym replacement [15][18]. Experimental Results - In experiments involving models like LLaMA and OPT, iSeal maintained a 100% verification success rate even under advanced attacks, while traditional fingerprinting methods failed after minor fine-tuning [17][18]. - The results demonstrated that iSeal's design effectively prevents attackers from compromising the entire verification structure by attempting to erase parts of the fingerprint [17][21]. Ablation Studies - Ablation studies confirmed the necessity of iSeal's key components, showing that without freezing the encoder or using a learned encoder, the verification success rate dropped to near zero [20][21].