斯坦福：优化器「诸神之战」？AdamW 凭「稳定」胜出

Core Insights - The article discusses the dominance of Adam and its improved version AdamW in the pre-training of open-weight language models since 2014, emphasizing their stability and rapid convergence under large datasets [1] - As model sizes increase, pre-training has become a computationally intensive task, making optimizer design crucial for convergence speed and cost [1] - Researchers have explored various improvements, with matrix-based optimizers showing a 30-40% iteration-level speedup compared to well-tuned AdamW [1] - Stanford's Percy Liang team indicates that despite claims of significant acceleration (1.4 to 2 times) from alternative methods, AdamW remains a robust choice for pre-training, while matrix-based methods excel under specific data-model ratios [1] Optimizer Performance - The study identifies two methodological flaws: unfair hyperparameter tuning and insufficient tuning of baseline models, which can lead to significant performance underestimation [4][6] - Proper hyperparameter tuning can achieve up to 2 times acceleration on a model with 130 million parameters by adjusting just the learning rate [6] - Fixed shared hyperparameters do not ensure fair comparisons, as different optimizers may have vastly different optimal hyperparameters [4][6] Research Methodology - The research involved a systematic comparison of eleven different deep learning optimizers across various model sizes (from 100 million to 1.2 billion parameters) and data-model ratios [11] - The study utilized a rigorous methodology divided into three main phases, including comprehensive parameter scanning and sensitivity analysis of hyperparameters [15][20] Findings on Hyperparameters - The research emphasizes the importance of independent tuning for optimizers, as optimal hyperparameter configurations do not transfer well between different optimizers [12] - The optimal choice of optimizer is context-dependent, with Muon performing best under standard Chinchilla data ratios, while Soap outperforms at ratios above 8:1 [13] Case Studies and Results - The study conducted case studies on larger experiments, confirming the effectiveness of predicted optimal configurations for model sizes and data scales [24] - Results showed that while matrix-based optimizers like Muon and Soap provide significant speed advantages, their effectiveness diminishes as model sizes increase, with acceleration ratios dropping to 1.1 times for larger models [26]