Workflow
AdamW
icon
Search documents
斯坦福:优化器「诸神之战」?AdamW 凭「稳定」胜出
3 6 Ke· 2025-09-07 23:36
自 2014 年提出以来,Adam 及其改进版 AdamW 长期占据开放权重语言模型预训练的主导地位,帮助模型在海量数据下保持稳定并实现较快收敛。 随着模型规模迅速扩大,预训练已成为计算密集型任务的典型代表,在大模型研发中往往是最主要的计算开销。在这种背景下,优化器的设计直接关系到 收敛速度与计算成本。 研究者们探索了多种改进方向,其中最快的优化器往往采用矩阵型预条件子(如 Muon、Soap、Kron),相较于经过严格调优的 AdamW,可以带来约 30–40% 的迭代级别加速。 斯坦福大学 Percy Liang 团队的研究指出,尽管存在许多声称能提供显著加速(1.4 至 2 倍)的替代方案,AdamW 依然是预训练的稳健首选,但矩阵型方 法在特定数据–模型比例下展现出明显优势。 研究者认为,这种现象可能源于两个关键的方法论缺陷: 问题 1:不公平的超参数调优。 基线模型通常调优不足:在常用的 AdamW 基线中,仅仅是调优学习率这一个参数,就能在 1.3 亿参数规模的模型上实现 2 倍的加速。 固定共享的超参数并不能保证比较的公平性:例如,与标准的权重衰减值 0.1 相比,Lion 优化器更偏好较高的权 ...
斯坦福:优化器「诸神之战」?AdamW 凭「稳定」胜出
机器之心· 2025-09-07 05:12
Core Insights - The article discusses the dominance of Adam and its improved version AdamW in the pre-training of open-weight language models since 2014, emphasizing their stability and rapid convergence under large datasets [1] - It highlights the significance of optimizer design in relation to convergence speed and computational costs as model sizes increase, with matrix-based optimizers showing a 30-40% iteration-level acceleration compared to well-tuned AdamW [1][15] - The research identifies two methodological flaws that may lead to underestimating the performance of baseline optimizers like AdamW: unfair hyperparameter tuning and insufficient testing scale [3][7] Summary by Sections Optimizer Performance - Matrix-based optimizers (e.g., Muon, Soap, Kron) outperform scalar-based optimizers (e.g., AdamW, Nesterov AdamW, Mars) in terms of consistent acceleration across various data-model ratios [9][15] - The performance of optimizers tends to diminish as model size increases, with some optimizers showing only a 1.1x acceleration at 12 billion parameters compared to AdamW [9][25] Hyperparameter Tuning - Proper hyperparameter tuning is crucial, as even a single parameter adjustment (like learning rate) can lead to significant performance improvements, such as a 2x speedup on a model with 130 million parameters [6][18] - Fixed shared hyperparameters do not ensure fair comparisons between different optimizers, as preferences for values like weight decay can vary significantly [4][15] Testing Methodology - The research emphasizes the need for rigorous independent tuning of hyperparameters for each optimizer to ensure fair comparisons, as blindly transferring hyperparameters can lead to misleading results [15][18] - Short-term evaluations can be misleading, as performance rankings may reverse during training due to learning rate decay [15][20] Case Studies and Findings - The study includes case studies on larger models, confirming that the predicted optimal configurations align closely with actual performance, validating the effectiveness of their scaling laws [23] - In extreme data-to-model ratios (e.g., 16x Chinchilla), optimizers like Soap and Kron demonstrate superior performance over Muon, indicating their effectiveness in high data scenarios [26]
爆肝一篇博客拿下OpenAI Offer,Muon作者怒揭:几乎所有优化器的论文都是“假的”
3 6 Ke· 2025-06-16 12:46
Core Insights - A blog post by researcher Keller Jordan titled "Muon: An optimizer for hidden layers in neural networks" led to his successful offer from OpenAI, and the techniques discussed may have been used in training GPT-5 [1][4][5] - The post emphasizes that publishing in top conferences does not equate to having a significant impact, challenging traditional academic norms [6][11] Group 1: Blog Impact and Reception - Keller Jordan's blog post gained attention for its practical results, outperforming the previously dominant optimizer AdamW [5][14] - Yuchen Jin, a co-author, highlighted the misconception in academia that publishing in top-tier conferences is the ultimate goal, advocating for real-world impact instead [6][11] - The blog's success illustrates a shift in the AI research landscape, where practical performance may outweigh formal academic credentials [22][24] Group 2: Muon Optimizer Performance - Muon optimizer achieved significant performance improvements, such as reducing training time on CIFAR-10 from 3.3 A100 seconds to 2.6 A100 seconds [14] - In the NanoGPT task, Muon improved validation loss speed by 1.35 times, maintaining advantages even with larger parameter scales [14] - When training a 1.5 billion parameter transformer on the HellaSwag task, Muon reached GPT-2 XL performance in just 10 hours, compared to 13.3 hours with AdamW [14][20] Group 3: Design and Methodology - Muon's core principle involves using SGD-momentum for updates, followed by a Newton-Schulz iteration to approximate orthogonalization of the update matrix [20][22] - This approach allows Muon to replace the original update matrix with a "semi-orthogonal matrix," enhancing its effectiveness [22]