信息论
Search documents
超越ZIP的无损压缩来了,华盛顿大学让大模型成为无损文本压缩器
3 6 Ke· 2025-10-11 10:47
Core Insights - The article discusses the challenges of data storage arising from the generation of massive data by large language models (LLMs) and introduces an innovative solution called LLMc, which utilizes LLMs for lossless text compression [2][5]. Group 1: LLMc Overview - LLMc has demonstrated superior compression rates compared to traditional compression tools like ZIP and LZMA across various datasets, including Wikipedia, novels, and scientific abstracts [2]. - The project has been open-sourced, with the main author being Yi Pan, an undergraduate from Shanghai Jiao Tong University currently interning at the University of Washington [4]. Group 2: Compression Mechanism - The compression mechanism of LLMc is based on the principle of rank-based encoding, where the model predicts the next possible token and generates a probability distribution list [6]. - Instead of storing the token itself, LLMc stores the rank of the token in the probability list, which typically requires minimal storage space [6]. - During decompression, the same LLM and context are used to recreate the probability distribution, allowing for the accurate recovery of the original text using the stored rank [6]. Group 3: Challenges and Limitations - The research team identified several challenges with the current version of LLMc, including efficiency issues due to the quadratic relationship between LLM inference complexity and sequence length [7]. - The processing speed of LLMc is currently much lower than traditional compression algorithms due to its heavy reliance on large model inference [7]. - To ensure deterministic decompression, the system requires special kernels and integer encoding of token ranks instead of using logarithmic probabilities [8]. - The current implementation is primarily focused on natural language, with future exploration needed for extending its application to other modalities like images, videos, or binary data [9].
重磅发现!大模型的「aha moment」不是装腔作势,内部信息量暴增数倍!
机器之心· 2025-07-03 04:14
Core Insights - The article discusses a groundbreaking study that reveals the reasoning dynamics of large language models (LLMs) through the lens of mutual information, identifying "thinking tokens" as critical indicators of information peaks during reasoning [3][4][24]. Group 1: Key Findings - The study uncovers the phenomenon of "information peaks" in the reasoning trajectories of LLMs, indicating that the presence of thinking tokens correlates with a significant increase in the information related to the correct answer [3][4][5]. - Researchers demonstrated that higher accumulated mutual information during reasoning leads to a tighter bound on the probability of answering correctly, thus enhancing the model's performance [6][8]. - The research indicates that reasoning models exhibit more pronounced mutual information peaks compared to non-reasoning models, suggesting that enhanced training improves the encoding of relevant information [9][10]. Group 2: Thinking Tokens - Thinking tokens, which include phrases like "Hmm," "Wait," and "Therefore," are identified as linguistic manifestations of information peaks, playing a crucial role in guiding the model's reasoning process [10][11][15]. - Experimental results show that suppressing the generation of thinking tokens significantly impacts the model's performance on mathematical reasoning datasets, confirming their importance in effective reasoning [16][25]. Group 3: Applications - Two novel methods are proposed to enhance LLM reasoning performance: Representation Recycling (RR) and Thinking Token based Test-time Scaling (TTTS), both of which leverage the insights gained from the study [18][26]. - The RR method involves re-inputting representations associated with thinking tokens for additional computation, leading to improved performance on various reasoning benchmarks [20][26]. - The TTTS method encourages the model to generate thinking tokens when additional computation resources are available, resulting in sustained performance improvements across different datasets [21][22][26].
最新发现!每参数3.6比特,语言模型最多能记住这么多
机器之心· 2025-06-04 04:41
Core Insights - The memory capacity of GPT series models is approximately 3.6 bits per parameter, indicating a limit beyond which models stop memorizing and begin to generalize [1][4][27]. Group 1: Memory and Generalization - The research distinguishes between two types of memory: unexpected memory (specific dataset information) and generalization (understanding of the real data generation process) [5][7]. - A new method was proposed to estimate a model's understanding of specific data points, which helps measure the capacity of modern language models [2][8]. Group 2: Model Capacity and Measurement - The study defines model capacity as the total amount of memory that can be stored across all parameters of a specific language model [17][18]. - The maximum memory capacity is reached when the model no longer increases its memory with larger datasets, indicating saturation [19][28]. - Experiments showed that the memory capacity of models scales with the number of parameters, with a stable memory of 3.5 to 3.6 bits per parameter observed [27][28]. Group 3: Experimental Findings - The research involved training hundreds of transformer language models with parameters ranging from 500,000 to 1.5 billion, leading to insights on scaling laws related to model capacity and data size [6][25]. - Results indicated that even with different dataset sizes, the memory bits remained consistent, reinforcing the relationship between model capacity and parameter count [28][29]. - The impact of precision on capacity was analyzed, revealing that increasing precision from bfloat16 to float32 slightly improved capacity, with average values rising from 3.51 bits/parameter to 3.83 bits/parameter [31][32].
当答案变得廉价时,好问题就是新的稀缺品
3 6 Ke· 2025-05-04 00:03
Group 1 - The core argument of the article is that in an era where answers are easily accessible, the value lies in asking the right questions, which can reshape understanding and drive creativity [1][4][19] - The invention of photography in the 1830s challenged traditional artistic standards, leading artists to focus on subjective experiences rather than mere replication of reality [3][10][11] - The emergence of large language models (LLMs) has made obtaining answers cheaper, but this has led to a decline in the quality of inquiry and an increase in the cost of asking good questions [15][17][26] Group 2 - The article emphasizes that the value of information is proportional to the uncertainty it eliminates, as illustrated by Claude Shannon's information theory [21][22][23] - It argues that in a world of information overload, the challenge is not the lack of facts but the misalignment of attention, leading to a focus on quantity over quality in answers [31][32][46] - The piece highlights the importance of redefining problems and frameworks to navigate structural uncertainties effectively, suggesting that good questions can expand the boundaries of understanding [37][38][39]