Soap
Search documents
斯坦福:优化器「诸神之战」?AdamW 凭「稳定」胜出
3 6 Ke· 2025-09-07 23:36
Core Insights - The article discusses the dominance of Adam and its improved version AdamW in the pre-training of open-weight language models since 2014, emphasizing their stability and rapid convergence under large datasets [1] - As model sizes increase, pre-training has become a computationally intensive task, making optimizer design crucial for convergence speed and cost [1] - Researchers have explored various improvements, with matrix-based optimizers showing a 30-40% iteration-level speedup compared to well-tuned AdamW [1] - Stanford's Percy Liang team indicates that despite claims of significant acceleration (1.4 to 2 times) from alternative methods, AdamW remains a robust choice for pre-training, while matrix-based methods excel under specific data-model ratios [1] Optimizer Performance - The study identifies two methodological flaws: unfair hyperparameter tuning and insufficient tuning of baseline models, which can lead to significant performance underestimation [4][6] - Proper hyperparameter tuning can achieve up to 2 times acceleration on a model with 130 million parameters by adjusting just the learning rate [6] - Fixed shared hyperparameters do not ensure fair comparisons, as different optimizers may have vastly different optimal hyperparameters [4][6] Research Methodology - The research involved a systematic comparison of eleven different deep learning optimizers across various model sizes (from 100 million to 1.2 billion parameters) and data-model ratios [11] - The study utilized a rigorous methodology divided into three main phases, including comprehensive parameter scanning and sensitivity analysis of hyperparameters [15][20] Findings on Hyperparameters - The research emphasizes the importance of independent tuning for optimizers, as optimal hyperparameter configurations do not transfer well between different optimizers [12] - The optimal choice of optimizer is context-dependent, with Muon performing best under standard Chinchilla data ratios, while Soap outperforms at ratios above 8:1 [13] Case Studies and Results - The study conducted case studies on larger experiments, confirming the effectiveness of predicted optimal configurations for model sizes and data scales [24] - Results showed that while matrix-based optimizers like Muon and Soap provide significant speed advantages, their effectiveness diminishes as model sizes increase, with acceleration ratios dropping to 1.1 times for larger models [26]
斯坦福:优化器「诸神之战」?AdamW 凭「稳定」胜出
机器之心· 2025-09-07 05:12
Core Insights - The article discusses the dominance of Adam and its improved version AdamW in the pre-training of open-weight language models since 2014, emphasizing their stability and rapid convergence under large datasets [1] - It highlights the significance of optimizer design in relation to convergence speed and computational costs as model sizes increase, with matrix-based optimizers showing a 30-40% iteration-level acceleration compared to well-tuned AdamW [1][15] - The research identifies two methodological flaws that may lead to underestimating the performance of baseline optimizers like AdamW: unfair hyperparameter tuning and insufficient testing scale [3][7] Summary by Sections Optimizer Performance - Matrix-based optimizers (e.g., Muon, Soap, Kron) outperform scalar-based optimizers (e.g., AdamW, Nesterov AdamW, Mars) in terms of consistent acceleration across various data-model ratios [9][15] - The performance of optimizers tends to diminish as model size increases, with some optimizers showing only a 1.1x acceleration at 12 billion parameters compared to AdamW [9][25] Hyperparameter Tuning - Proper hyperparameter tuning is crucial, as even a single parameter adjustment (like learning rate) can lead to significant performance improvements, such as a 2x speedup on a model with 130 million parameters [6][18] - Fixed shared hyperparameters do not ensure fair comparisons between different optimizers, as preferences for values like weight decay can vary significantly [4][15] Testing Methodology - The research emphasizes the need for rigorous independent tuning of hyperparameters for each optimizer to ensure fair comparisons, as blindly transferring hyperparameters can lead to misleading results [15][18] - Short-term evaluations can be misleading, as performance rankings may reverse during training due to learning rate decay [15][20] Case Studies and Findings - The study includes case studies on larger models, confirming that the predicted optimal configurations align closely with actual performance, validating the effectiveness of their scaling laws [23] - In extreme data-to-model ratios (e.g., 16x Chinchilla), optimizers like Soap and Kron demonstrate superior performance over Muon, indicating their effectiveness in high data scenarios [26]
How Dividend Stocks like Coca-Cola Can Help You Rest Easy Amid Stock Market Unrest
The Motley Fool· 2025-04-15 08:55
Core Viewpoint - Consumer staples companies, such as Coca-Cola, are considered safe haven investments during economic downturns due to consistent demand for their products, which are often necessities or frequently purchased items [2][4]. Group 1: Coca-Cola - Coca-Cola is recognized for its strong brand and has maintained a dividend yield of 2.9%, having increased its dividend for over 50 years, earning it the title of Dividend King [5]. - The stock is currently viewed as somewhat expensive, with price-to-sales and price-to-earnings ratios above their five-year averages [5]. Group 2: PepsiCo - PepsiCo, also a Dividend King, offers a diversified portfolio that includes snacks and packaged foods, with a higher dividend yield of 3.7% [6]. - The company’s valuation is attractive, with both price-to-sales and price-to-earnings ratios below their five-year averages, and it continues to invest in growth through acquisitions [6]. Group 3: Unilever - Unilever presents a more adventurous option with a portfolio that includes consumer products and food, generating around 40% of its revenue from North America and Europe, while the rest comes from faster-growing markets in Latin America and Asia [7]. - The company offers a dividend yield of 3.1%, making it an appealing choice for investors seeking growth [7]. Group 4: Tobacco Companies - Altria and British American Tobacco are high-yield options, with dividend yields of 7.2% and 7.5% respectively, despite facing long-term volume decline in cigarette sales [8][9]. - These companies have shown resilience during uncertain times, as smokers tend to remain loyal and may increase consumption during economic stress [8]. Group 5: Overall Consumer Staples Sector - The consumer staples sector offers a variety of investment options that can provide stability and reliable dividends during market volatility [10][11]. - Companies like Coca-Cola, PepsiCo, Unilever, Altria, and British American Tobacco are highlighted as solid choices for investors concerned about market conditions [11].