Workflow
超参数调优
icon
Search documents
斯坦福:优化器「诸神之战」?AdamW 凭「稳定」胜出
3 6 Ke· 2025-09-07 23:36
Core Insights - The article discusses the dominance of Adam and its improved version AdamW in the pre-training of open-weight language models since 2014, emphasizing their stability and rapid convergence under large datasets [1] - As model sizes increase, pre-training has become a computationally intensive task, making optimizer design crucial for convergence speed and cost [1] - Researchers have explored various improvements, with matrix-based optimizers showing a 30-40% iteration-level speedup compared to well-tuned AdamW [1] - Stanford's Percy Liang team indicates that despite claims of significant acceleration (1.4 to 2 times) from alternative methods, AdamW remains a robust choice for pre-training, while matrix-based methods excel under specific data-model ratios [1] Optimizer Performance - The study identifies two methodological flaws: unfair hyperparameter tuning and insufficient tuning of baseline models, which can lead to significant performance underestimation [4][6] - Proper hyperparameter tuning can achieve up to 2 times acceleration on a model with 130 million parameters by adjusting just the learning rate [6] - Fixed shared hyperparameters do not ensure fair comparisons, as different optimizers may have vastly different optimal hyperparameters [4][6] Research Methodology - The research involved a systematic comparison of eleven different deep learning optimizers across various model sizes (from 100 million to 1.2 billion parameters) and data-model ratios [11] - The study utilized a rigorous methodology divided into three main phases, including comprehensive parameter scanning and sensitivity analysis of hyperparameters [15][20] Findings on Hyperparameters - The research emphasizes the importance of independent tuning for optimizers, as optimal hyperparameter configurations do not transfer well between different optimizers [12] - The optimal choice of optimizer is context-dependent, with Muon performing best under standard Chinchilla data ratios, while Soap outperforms at ratios above 8:1 [13] Case Studies and Results - The study conducted case studies on larger experiments, confirming the effectiveness of predicted optimal configurations for model sizes and data scales [24] - Results showed that while matrix-based optimizers like Muon and Soap provide significant speed advantages, their effectiveness diminishes as model sizes increase, with acceleration ratios dropping to 1.1 times for larger models [26]
斯坦福:优化器「诸神之战」?AdamW 凭「稳定」胜出
机器之心· 2025-09-07 05:12
Core Insights - The article discusses the dominance of Adam and its improved version AdamW in the pre-training of open-weight language models since 2014, emphasizing their stability and rapid convergence under large datasets [1] - It highlights the significance of optimizer design in relation to convergence speed and computational costs as model sizes increase, with matrix-based optimizers showing a 30-40% iteration-level acceleration compared to well-tuned AdamW [1][15] - The research identifies two methodological flaws that may lead to underestimating the performance of baseline optimizers like AdamW: unfair hyperparameter tuning and insufficient testing scale [3][7] Summary by Sections Optimizer Performance - Matrix-based optimizers (e.g., Muon, Soap, Kron) outperform scalar-based optimizers (e.g., AdamW, Nesterov AdamW, Mars) in terms of consistent acceleration across various data-model ratios [9][15] - The performance of optimizers tends to diminish as model size increases, with some optimizers showing only a 1.1x acceleration at 12 billion parameters compared to AdamW [9][25] Hyperparameter Tuning - Proper hyperparameter tuning is crucial, as even a single parameter adjustment (like learning rate) can lead to significant performance improvements, such as a 2x speedup on a model with 130 million parameters [6][18] - Fixed shared hyperparameters do not ensure fair comparisons between different optimizers, as preferences for values like weight decay can vary significantly [4][15] Testing Methodology - The research emphasizes the need for rigorous independent tuning of hyperparameters for each optimizer to ensure fair comparisons, as blindly transferring hyperparameters can lead to misleading results [15][18] - Short-term evaluations can be misleading, as performance rankings may reverse during training due to learning rate decay [15][20] Case Studies and Findings - The study includes case studies on larger models, confirming that the predicted optimal configurations align closely with actual performance, validating the effectiveness of their scaling laws [23] - In extreme data-to-model ratios (e.g., 16x Chinchilla), optimizers like Soap and Kron demonstrate superior performance over Muon, indicating their effectiveness in high data scenarios [26]