ICLR 2026 | 清华提出交叉熵分解：“误差熵”才是大模型规模定律真正的驱动项

Core Insights - The article discusses the failure of the cross-entropy scaling law in large models, revealing that only a hidden component, error-entropy, truly scales with model size, while self-alignment and confidence do not [2][6][25]. Group 1: Scaling Law and Cross-Entropy - The scaling law has been a guiding principle in the development of large language models, with a consensus that cross-entropy loss decreases predictably as model parameters increase [2]. - Recent findings indicate that the cross-entropy scaling law fails for ultra-large models, as the loss reduction deviates from the expected power-law prediction [2][25]. - A new decomposition method for cross-entropy has been proposed, breaking it down into three components: error-entropy, self-alignment, and confidence [3][6]. Group 2: Components of Cross-Entropy - Error-entropy is the only component that strictly follows power-law scaling, while self-alignment and confidence show little to no change with increasing model size [3][21]. - The study introduces a rank-based error (RBE) metric, which measures the ranking position of the correct token in model outputs, providing a more robust indicator of model performance than probability scores [6][8]. - The training dynamics reveal that error-entropy decreases first, followed by improvements in self-alignment and confidence, indicating a clear optimization sequence during training [10][11]. Group 3: Implications of Findings - The research suggests that as model size increases, the proportion of error-entropy in total loss decreases, while the contributions from non-scaling components increase, leading to the observed failure of the scaling law [25]. - This finding implies that using error-entropy as a training signal or evaluation metric may more accurately reflect model capability improvements, guiding more efficient training strategies [27]. - The study emphasizes that the growth in model size primarily enhances ranking ability rather than probability calibration, providing new insights into the optimization of large models [27].