AdamW - filings, earnings calls, financial reports, news

AdamW

Search documents

3 6 Ke· 2025-09-07 23:36

Core Insights - The article discusses the dominance of Adam and its improved version AdamW in the pre-training of open-weight language models since 2014, emphasizing their stability and rapid convergence under large datasets [1] - As model sizes increase, pre-training has become a computationally intensive task, making optimizer design crucial for convergence speed and cost [1] - Researchers have explored various improvements, with matrix-based optimizers showing a 30-40% iteration-level speedup compared to well-tuned AdamW [1] - Stanford's Percy Liang team indicates that despite claims of significant acceleration (1.4 to 2 times) from alternative methods, AdamW remains a robust choice for pre-training, while matrix-based methods excel under specific data-model ratios [1] Optimizer Performance - The study identifies two methodological flaws: unfair hyperparameter tuning and insufficient tuning of baseline models, which can lead to significant performance underestimation [4][6] - Proper hyperparameter tuning can achieve up to 2 times acceleration on a model with 130 million parameters by adjusting just the learning rate [6] - Fixed shared hyperparameters do not ensure fair comparisons, as different optimizers may have vastly different optimal hyperparameters [4][6] Research Methodology - The research involved a systematic comparison of eleven different deep learning optimizers across various model sizes (from 100 million to 1.2 billion parameters) and data-model ratios [11] - The study utilized a rigorous methodology divided into three main phases, including comprehensive parameter scanning and sensitivity analysis of hyperparameters [15][20] Findings on Hyperparameters - The research emphasizes the importance of independent tuning for optimizers, as optimal hyperparameter configurations do not transfer well between different optimizers [12] - The optimal choice of optimizer is context-dependent, with Muon performing best under standard Chinchilla data ratios, while Soap outperforms at ratios above 8:1 [13] Case Studies and Results - The study conducted case studies on larger experiments, confirming the effectiveness of predicted optimal configurations for model sizes and data scales [24] - Results showed that while matrix-based optimizers like Muon and Soap provide significant speed advantages, their effectiveness diminishes as model sizes increase, with acceleration ratios dropping to 1.1 times for larger models [26]

斯坦福：优化器「诸神之战」？AdamW 凭「稳定」胜出

机器之心· 2025-09-07 05:12

Core Insights - The article discusses the dominance of Adam and its improved version AdamW in the pre-training of open-weight language models since 2014, emphasizing their stability and rapid convergence under large datasets [1] - It highlights the significance of optimizer design in relation to convergence speed and computational costs as model sizes increase, with matrix-based optimizers showing a 30-40% iteration-level acceleration compared to well-tuned AdamW [1][15] - The research identifies two methodological flaws that may lead to underestimating the performance of baseline optimizers like AdamW: unfair hyperparameter tuning and insufficient testing scale [3][7] Summary by Sections Optimizer Performance - Matrix-based optimizers (e.g., Muon, Soap, Kron) outperform scalar-based optimizers (e.g., AdamW, Nesterov AdamW, Mars) in terms of consistent acceleration across various data-model ratios [9][15] - The performance of optimizers tends to diminish as model size increases, with some optimizers showing only a 1.1x acceleration at 12 billion parameters compared to AdamW [9][25] Hyperparameter Tuning - Proper hyperparameter tuning is crucial, as even a single parameter adjustment (like learning rate) can lead to significant performance improvements, such as a 2x speedup on a model with 130 million parameters [6][18] - Fixed shared hyperparameters do not ensure fair comparisons between different optimizers, as preferences for values like weight decay can vary significantly [4][15] Testing Methodology - The research emphasizes the need for rigorous independent tuning of hyperparameters for each optimizer to ensure fair comparisons, as blindly transferring hyperparameters can lead to misleading results [15][18] - Short-term evaluations can be misleading, as performance rankings may reverse during training due to learning rate decay [15][20] Case Studies and Findings - The study includes case studies on larger models, confirming that the predicted optimal configurations align closely with actual performance, validating the effectiveness of their scaling laws [23] - In extreme data-to-model ratios (e.g., 16x Chinchilla), optimizers like Soap and Kron demonstrate superior performance over Muon, indicating their effectiveness in high data scenarios [26]

爆肝一篇博客拿下OpenAI Offer，Muon作者怒揭：几乎所有优化器的论文都是“假的”

3 6 Ke· 2025-06-16 12:46

Core Insights - A blog post by researcher Keller Jordan titled "Muon: An optimizer for hidden layers in neural networks" led to his successful offer from OpenAI, and the techniques discussed may have been used in training GPT-5 [1][4][5] - The post emphasizes that publishing in top conferences does not equate to having a significant impact, challenging traditional academic norms [6][11] Group 1: Blog Impact and Reception - Keller Jordan's blog post gained attention for its practical results, outperforming the previously dominant optimizer AdamW [5][14] - Yuchen Jin, a co-author, highlighted the misconception in academia that publishing in top-tier conferences is the ultimate goal, advocating for real-world impact instead [6][11] - The blog's success illustrates a shift in the AI research landscape, where practical performance may outweigh formal academic credentials [22][24] Group 2: Muon Optimizer Performance - Muon optimizer achieved significant performance improvements, such as reducing training time on CIFAR-10 from 3.3 A100 seconds to 2.6 A100 seconds [14] - In the NanoGPT task, Muon improved validation loss speed by 1.35 times, maintaining advantages even with larger parameter scales [14] - When training a 1.5 billion parameter transformer on the HellaSwag task, Muon reached GPT-2 XL performance in just 10 hours, compared to 13.3 hours with AdamW [14][20] Group 3: Design and Methodology - Muon's core principle involves using SGD-momentum for updates, followed by a Newton-Schulz iteration to approximate orthogonalization of the update matrix [20][22] - This approach allows Muon to replace the original update matrix with a "semi-orthogonal matrix," enhancing its effectiveness [22]

Artificial Intelligence

Optimizer

Artificial Intelligence

Muon

GPT - 5

AdamW

Artificial Intelligence

Optimizer

Artificial Intelligence

Muon

GPT - 5

AdamW