高熵token - filings, earnings calls, financial reports, news

高熵token

Search documents

Qwen&清华团队颠覆常识：大模型强化学习仅用20%关键token，比用全部token训练还好

量子位· 2025-06-05 10:28

Core Insights - The article discusses a recent breakthrough by the LeapLab team from Tsinghua University, revealing that only 20% of high-entropy tokens can significantly enhance the training effectiveness of large models in reinforcement learning, outperforming the use of all tokens [1][6]. Group 1: Research Findings - The team achieved new state-of-the-art (SOTA) records with the Qwen3-32B model, scoring 63.5 in AIME'24 and 56.7 in AIME'25, marking the highest scores for models with fewer than 600 billion parameters trained directly from the base model [2]. - The maximum response length was extended from 20k to 29k, resulting in a score increase to 68.1 in AIME'24 [4]. - The research challenges the classic Pareto principle, indicating that in large model reinforcement learning, 80% of low-entropy tokens can be discarded without detrimental effects, and may even have adverse impacts [5][6]. Group 2: Token Analysis - The study reveals a unique entropy distribution pattern during chain-of-thought reasoning, where over 50% of tokens have an entropy value below 0.01, while only 20% exceed 0.672 [9][10]. - High-entropy tokens serve as "logical connectors" in reasoning, while low-entropy tokens are often deterministic components, such as affixes or mathematical expressions [11]. - The team conducted experiments showing that increasing the temperature of high-entropy tokens improves reasoning performance, while lowering it decreases performance, underscoring the importance of maintaining high entropy in critical positions [13]. Group 3: Training Methodology - By focusing solely on the top 20% of high-entropy tokens during reinforcement learning training, the Qwen3-32B model saw significant performance improvements, with AIME'24 scores increasing by 7.71 points and AIME'25 by 11.04 points, alongside an average response length increase of approximately 1378 tokens [15][17]. - Similar performance enhancements were observed in the Qwen3-14B model, while the Qwen3-8B model maintained stable performance [16]. - Conversely, training with 80% low-entropy tokens led to a sharp decline in model performance, indicating their minimal contribution to reasoning capabilities [18]. Group 4: Implications and Generalization - The findings suggest that high-entropy tokens facilitate exploration of different reasoning paths, while low-entropy tokens may restrict this exploration due to their determinism [20]. - The advantages of training with high-entropy tokens become more pronounced with larger models, with the 32B model showing the most significant improvements [22]. - Models trained with high-entropy tokens also performed exceptionally well on out-of-domain tasks, indicating a potential link between high-entropy tokens and the model's generalization capabilities [22]. Group 5: Reinforcement Learning Insights - The research indicates that reinforcement learning with verifiable rewards (RLVR) does not completely overhaul the base model but rather fine-tunes it, maintaining a high overlap of 86.67% in high-entropy token positions even after extensive training [24][25]. - The study highlights that higher initial entropy in tokens correlates with greater increases in entropy during RLVR training, while low-entropy tokens remain largely unchanged [25]. - Discussions raised in the article suggest that high-entropy tokens may explain why reinforcement learning can generalize better than supervised fine-tuning, which tends to lead to memorization and overfitting [26][27].

Artificial Intelligence

Artificial Intelligence

Qwen3-32B

Qwen&清华团队颠覆常识：大模型强化学习仅用20%关键token，比用全部token训练还好

量子位· 2025-06-05 10:28

梦晨发自凹非寺量子位 | 公众号 QbitAI 近期arxiv最热门论文， Qwen&清华LeapLab 团队最新成果：在强化学习训练大模型推理能力时，仅仅20%的高熵token就能撑起整个训练效果，甚至比用全部token训练还要好。团队用这个发现在Qwen3-32B上创造了新的SOTA记录：AIME'24上达到63.5分，AIME'25上达到56.7分，这是600B参数以下直接从base模型训练的最高分。最大响应长度从20k延长到29k，AIME'24的分数更是飙升到了68.1分。揭开Chain-of-Thought的熵分布密码要理解这项研究，需要先从一个有趣的观察说起：团队发现，当大模型进行链式思考（Chain-of-Thought）推理时，token的熵分布呈现出一个独特的模式：大部分token的熵都很低，只有少数token表现出高熵特征。具体来说，超过50%的token熵值低于0.01，而只有20%的token熵值大于0.672。经典的二八法则（或帕累托法则）指出，通常80%的结果由20%的关键因素驱动，但剩下80%也是不能轻易舍弃的。但是在大模型强化学习这里，80 ...