重磅发现！大模型的「aha moment」不是装腔作势，内部信息量暴增数倍！

Core Insights - The article discusses a groundbreaking study that reveals the reasoning dynamics of large language models (LLMs) through the lens of mutual information, identifying "thinking tokens" as critical indicators of information peaks during reasoning [3][4][24]. Group 1: Key Findings - The study uncovers the phenomenon of "information peaks" in the reasoning trajectories of LLMs, indicating that the presence of thinking tokens correlates with a significant increase in the information related to the correct answer [3][4][5]. - Researchers demonstrated that higher accumulated mutual information during reasoning leads to a tighter bound on the probability of answering correctly, thus enhancing the model's performance [6][8]. - The research indicates that reasoning models exhibit more pronounced mutual information peaks compared to non-reasoning models, suggesting that enhanced training improves the encoding of relevant information [9][10]. Group 2: Thinking Tokens - Thinking tokens, which include phrases like "Hmm," "Wait," and "Therefore," are identified as linguistic manifestations of information peaks, playing a crucial role in guiding the model's reasoning process [10][11][15]. - Experimental results show that suppressing the generation of thinking tokens significantly impacts the model's performance on mathematical reasoning datasets, confirming their importance in effective reasoning [16][25]. Group 3: Applications - Two novel methods are proposed to enhance LLM reasoning performance: Representation Recycling (RR) and Thinking Token based Test-time Scaling (TTTS), both of which leverage the insights gained from the study [18][26]. - The RR method involves re-inputting representations associated with thinking tokens for additional computation, leading to improved performance on various reasoning benchmarks [20][26]. - The TTTS method encourages the model to generate thinking tokens when additional computation resources are available, resulting in sustained performance improvements across different datasets [21][22][26].