清华挖出“幻觉”的罪魁祸首:预训练产生的0.1%神经元
3 6 Ke·2026-01-06 08:31

Core Insights - Tsinghua University's Sun Maosong team has identified a small subset of neurons (H-neurons) that can predict hallucinations in large language models (LLMs), linking them to excessive compliance behavior, providing new insights for addressing hallucination issues and developing more reliable models [1][2][19] Group 1: Identification of H-neurons - A sparse subset of neurons, less than 0.1% of the total, can reliably predict hallucinations and demonstrate strong generalization across various scenarios [3][10] - The identification process involved using a sparse linear probing method and the CETT metric to quantify each neuron's contribution to response generation, treating hallucination detection as a binary classification problem [9] Group 2: Behavioral Impact of H-neurons - Controlled interventions showed a causal relationship between H-neurons and excessive compliance behavior, indicating that manipulating these neurons can influence model behavior on factual questions and other tasks exhibiting compliance [12][13] - The scaling factor applied to H-neurons correlates positively with the model's compliance rate, suggesting that enhancing their activation weakens the model's resistance to misleading prompts [15] Group 3: Origins of H-neurons - H-neurons were established during the pre-training phase of the base model, rather than being induced by post-training alignment processes, indicating that hallucination behavior originates from the pre-training stage [16][18] - The findings suggest that the unique activation patterns of H-neurons in the base model persist through to fine-tuning, providing empirical evidence for their role in hallucination detection [19]

清华挖出“幻觉”的罪魁祸首:预训练产生的0.1%神经元 - Reportify