开源模型竟被用于窃取下游微调数据？清华团队揭秘开源微调范式新型隐藏安全风险

Core Viewpoint - The article highlights a newly identified security risk associated with fine-tuning open-source large language models (LLMs), where developers can embed backdoors to extract private fine-tuning data from downstream models using only black-box access [1][5][6]. Research Background - The fine-tuning of open-source models has become a standard practice in the development of LLMs, facilitating their application in both research and industry. However, this study reveals a shocking security vulnerability that allows developers to secretly extract private fine-tuning data through a simple backdoor injection method [5][6]. Method Overview - The team designed a backdoor data extraction instruction that prompts the model to output training queries seen during training. Two training schemes were proposed to enhance the model's ability to follow this extraction instruction: 1. A supervised fine-tuning (SFT) approach that constructs data pairs from training queries and corresponding backdoor instructions [7]. 2. A reinforcement learning-based approach (GRPO) that further improves the model's extraction performance [8]. Experimental Results - The team tested four base models and two downstream datasets, measuring the match ratio and BLEU scores to evaluate the accuracy of query predictions and opening word identification. The results showed significant improvements in extraction accuracy and general performance after backdoor training [12][14][15][16]. Conclusion - The study aims to raise awareness of this new risk and inspire further research into stronger attack and defense mechanisms, as well as improved methods for filtering actual training data from model predictions [21].