Workflow
Token感知的推理时表征编辑
icon
Search documents
大模型“精细化”对齐,真实性提升25.8%刷新SOTA!token级精准编辑,无需训练即插即用
量子位· 2025-09-27 04:46
Core Insights - The article discusses a new method called Token-Aware Editing (TAE) that enhances the alignment capabilities of large language models (LLMs), achieving a 25.8% improvement in truthfulness metrics on the TruthfulQA task, setting a new performance benchmark [1][15]. Group 1: Methodology - TAE is a token-aware reasoning representation editing method that addresses the limitations of traditional representation editing techniques, requiring no training and being plug-and-play applicable across various scenarios such as dialogue systems and content moderation [1][3]. - Existing methods often overlook the misalignment differences between tokens, leading to biased alignment directions and inflexible editing strengths [4][6]. - TAE consists of two main modules: Mutual Information-guided Graph Aggregation (MIG) and Misalignment-aware Adaptive Intervention (MAI) [8][10]. Group 2: Module Details - MIG enhances the representation capability of activation values to find more accurate editing directions by addressing information loss and local understanding limitations inherent in traditional methods [10]. - MAI calculates adaptive editing strengths for each token based on its misalignment risk, allowing for differentiated intervention levels that prevent over-correction of safe tokens and under-correction of dangerous tokens [11][12]. Group 3: Experimental Results - TAE significantly outperformed existing methods in various metrics, achieving a True*Info score of 87.8% on the TruthfulQA dataset, surpassing the previous best method (SEA) by 14.6 percentage points and the original baseline by 25.8 percentage points [14][15]. - In toxicity reduction tasks, TAE reduced the toxicity probability from a baseline of 0.41 to 0.05, a nearly 90% decrease, outperforming all specialized de-toxification baseline methods [16]. - TAE also demonstrated substantial improvements in fairness tasks, lowering stereotype scores from a baseline of 64.8% to 50.3%, approaching the ideal unbiased state [16]. Group 4: Broader Implications - The TAE method shows significant gains across various model types and sizes, including Llama2-7B-Chat, Llama2-13B-Chat, Alpaca-7B, and Mistral-7B, indicating its versatility and effectiveness in enhancing model alignment [17].