Workflow
STA方法
icon
Search documents
精准调控大模型生成与推理!浙大&腾讯新方法尝试为其注入“行为定向剂”
量子位· 2025-06-05 10:28
Core Viewpoint - The article discusses the dilemma in controlling large AI models, emphasizing the need for a balance between intelligence and compliance, proposing the Steering Target Atoms (STA) method as a solution to create AI that is both smart and obedient [1][6]. Method & Experimental Results - The STA method allows for "atomic-level" behavior editing of large models, enhancing robustness and safety in output control [2]. - Traditional methods often couple safety defenses with general intelligence, leading to potential performance trade-offs. The STA method addresses this by intervening at the internal neuron level, identifying and adjusting specific neurons associated with harmful behaviors while preserving those linked to correct responses [4][5]. - The STA method has been tested on models like Gemma and LLaMA, showing superior detoxification performance without significant negative impact on general performance [10]. Experimental Setup - The research involved manipulating target atom directions and amplitudes to regulate model behavior, with extensive testing on various model configurations [9]. Key Experimental Results - The STA method outperformed other techniques in detoxification while maintaining general performance, as shown in the comparative results table [10]. Steering Vectors vs. Prompt Engineering - The article compares Steering Vectors with traditional prompt engineering, highlighting that Steering is more robust against jailbreak attacks and allows for finer control [12][13]. Cognitive Intervention in Large Models - The research also explored cognitive interventions in larger models like DeepSeek-R1, enhancing reasoning capabilities by amplifying weights of neurons associated with "thinking" [16][18]. - The findings indicate that while Steering techniques may lack the convenience of prompts, they offer more robust and precise intervention effects [18]. Open Source Contribution - The research team has made some intervention methods open source to encourage further exploration in the field of safe and controllable large models [19].