Workflow
信息论
icon
Search documents
重磅发现!大模型的「aha moment」不是装腔作势,内部信息量暴增数倍!
机器之心· 2025-07-03 04:14
Core Insights - The article discusses a groundbreaking study that reveals the reasoning dynamics of large language models (LLMs) through the lens of mutual information, identifying "thinking tokens" as critical indicators of information peaks during reasoning [3][4][24]. Group 1: Key Findings - The study uncovers the phenomenon of "information peaks" in the reasoning trajectories of LLMs, indicating that the presence of thinking tokens correlates with a significant increase in the information related to the correct answer [3][4][5]. - Researchers demonstrated that higher accumulated mutual information during reasoning leads to a tighter bound on the probability of answering correctly, thus enhancing the model's performance [6][8]. - The research indicates that reasoning models exhibit more pronounced mutual information peaks compared to non-reasoning models, suggesting that enhanced training improves the encoding of relevant information [9][10]. Group 2: Thinking Tokens - Thinking tokens, which include phrases like "Hmm," "Wait," and "Therefore," are identified as linguistic manifestations of information peaks, playing a crucial role in guiding the model's reasoning process [10][11][15]. - Experimental results show that suppressing the generation of thinking tokens significantly impacts the model's performance on mathematical reasoning datasets, confirming their importance in effective reasoning [16][25]. Group 3: Applications - Two novel methods are proposed to enhance LLM reasoning performance: Representation Recycling (RR) and Thinking Token based Test-time Scaling (TTTS), both of which leverage the insights gained from the study [18][26]. - The RR method involves re-inputting representations associated with thinking tokens for additional computation, leading to improved performance on various reasoning benchmarks [20][26]. - The TTTS method encourages the model to generate thinking tokens when additional computation resources are available, resulting in sustained performance improvements across different datasets [21][22][26].
最新发现!每参数3.6比特,语言模型最多能记住这么多
机器之心· 2025-06-04 04:41
Core Insights - The memory capacity of GPT series models is approximately 3.6 bits per parameter, indicating a limit beyond which models stop memorizing and begin to generalize [1][4][27]. Group 1: Memory and Generalization - The research distinguishes between two types of memory: unexpected memory (specific dataset information) and generalization (understanding of the real data generation process) [5][7]. - A new method was proposed to estimate a model's understanding of specific data points, which helps measure the capacity of modern language models [2][8]. Group 2: Model Capacity and Measurement - The study defines model capacity as the total amount of memory that can be stored across all parameters of a specific language model [17][18]. - The maximum memory capacity is reached when the model no longer increases its memory with larger datasets, indicating saturation [19][28]. - Experiments showed that the memory capacity of models scales with the number of parameters, with a stable memory of 3.5 to 3.6 bits per parameter observed [27][28]. Group 3: Experimental Findings - The research involved training hundreds of transformer language models with parameters ranging from 500,000 to 1.5 billion, leading to insights on scaling laws related to model capacity and data size [6][25]. - Results indicated that even with different dataset sizes, the memory bits remained consistent, reinforcing the relationship between model capacity and parameter count [28][29]. - The impact of precision on capacity was analyzed, revealing that increasing precision from bfloat16 to float32 slightly improved capacity, with average values rising from 3.51 bits/parameter to 3.83 bits/parameter [31][32].
当答案变得廉价时,好问题就是新的稀缺品
3 6 Ke· 2025-05-04 00:03
Group 1 - The core argument of the article is that in an era where answers are easily accessible, the value lies in asking the right questions, which can reshape understanding and drive creativity [1][4][19] - The invention of photography in the 1830s challenged traditional artistic standards, leading artists to focus on subjective experiences rather than mere replication of reality [3][10][11] - The emergence of large language models (LLMs) has made obtaining answers cheaper, but this has led to a decline in the quality of inquiry and an increase in the cost of asking good questions [15][17][26] Group 2 - The article emphasizes that the value of information is proportional to the uncertainty it eliminates, as illustrated by Claude Shannon's information theory [21][22][23] - It argues that in a world of information overload, the challenge is not the lack of facts but the misalignment of attention, leading to a focus on quantity over quality in answers [31][32][46] - The piece highlights the importance of redefining problems and frameworks to navigate structural uncertainties effectively, suggesting that good questions can expand the boundaries of understanding [37][38][39]