定位大模型「作弊」神经回路!新研究首次揭示:虚假奖励如何精准激活第18-20层记忆
量子位·2026-01-20 01:34

Core Insights - The article discusses the phenomenon of "Spurious Rewards" in large language models (LLMs) and how they can enhance accuracy even with false reward signals during training [1][2] - It highlights the concept of "Perplexity Paradox," where models show decreased perplexity for answers but increased perplexity for questions, indicating a trade-off between general understanding and specific memorization [3][6] Group 1: Key Findings - The research team identified that the model's internal memory shortcuts are activated by false RLVR, leading to a more efficient retrieval of contaminated knowledge rather than genuine learning [1][6] - The critical memory nodes are located in layers 18-20, which serve as functional anchors for retrieving memorized answers [10][20] - The study utilized various analytical methods, including Path Patching and Jensen-Shannon Divergence (JSD), to pinpoint the layers responsible for memory retrieval and structural adaptation [9][15] Group 2: Mechanisms and Dynamics - The research demonstrated that the model's decision-making process occurs at layers 18-20, where it chooses between reasoning paths and memory shortcuts [23] - The introduction of Neural ODEs allowed the team to model the continuous evolution of hidden states, confirming that separation forces peak at the critical layers [21] - The team successfully manipulated memory retrieval by scaling the activation of specific neurons, demonstrating a dose-dependent relationship in memory retrieval accuracy [25][30] Group 3: Implications and Future Directions - The findings provide new tools for evaluating RLVR effectiveness, suggesting that improvements may be illusory if they stem from memory activation circuits [36] - The research opens new avenues for detecting data contamination through internal neural activation patterns, moving beyond traditional statistical methods [38] - It proposes controllable methods for reducing reliance on contaminated knowledge without retraining the model, paving the way for new techniques in reasoning and decontamination [39]

定位大模型「作弊」神经回路!新研究首次揭示:虚假奖励如何精准激活第18-20层记忆 - Reportify