Workflow
Token级数据过滤
icon
Search documents
GPT之父Alec Radford新作:给大模型做「脑部手术」,危险知识重学成本暴增7000倍
机器之心· 2026-03-01 03:34
Core Insights - The article discusses a groundbreaking research paper by Alec Radford and Neil Rathi, which challenges the conventional approach to mitigating harmful capabilities in large language models by proposing a token-level data filtering method during the pre-training phase [3][5][49]. Group 1: Research Findings - The study reveals that token-level filtering can effectively remove dangerous knowledge from models, making it harder for attackers to recover this knowledge later [3][5][8]. - A significant finding is that the effectiveness of this filtering mechanism improves as the model size increases, demonstrating a scaling law where larger models exhibit better filtering outcomes [5][22][29]. - For an 1.8 billion parameter model, token-level filtering resulted in a 7000-fold decrease in learning efficiency in the targeted domain [6][29]. Group 2: Methodology - The research introduces two token-level filtering strategies: Loss Masking, which allows the model to see dangerous tokens but ignores their loss during training, and Removal, which replaces dangerous tokens with a special <hidden> marker [21][22]. - The study emphasizes that traditional document-level filtering is inefficient and wasteful, while token-level filtering allows for precise removal of harmful knowledge without discarding entire documents [16][21]. Group 3: Security Implications - The research indicates that once a model has learned a dangerous capability, post hoc interventions like RLHF are insufficient to eliminate that knowledge, as attackers can easily bypass these defenses [10][12][14]. - Token-level filtering creates a natural barrier based on computational cost, making it prohibitively expensive for attackers to restore removed capabilities in future trillion-parameter models [27][49]. Group 4: AI Safety and Training - The study challenges the notion that models must first "know" what is dangerous to refuse harmful requests, showing that models filtered at the token level perform better in rejecting harmful queries [35][38]. - The research proposes a weak supervision process for labeling training data, significantly lowering the implementation cost of token-level filtering [41][46]. Group 5: Conclusion and Future Directions - The authors advocate for a "defense-in-depth" strategy, where token-level filtering during pre-training lays a solid foundation for subsequent alignment training, enhancing overall model safety [48][49]. - This research provides a viable path for organizations like OpenAI and Anthropic to scale their models while ensuring safety measures are in place [49][50].