Workflow
Muon
icon
Search documents
斯坦福:优化器「诸神之战」?AdamW 凭「稳定」胜出
3 6 Ke· 2025-09-07 23:36
Core Insights - The article discusses the dominance of Adam and its improved version AdamW in the pre-training of open-weight language models since 2014, emphasizing their stability and rapid convergence under large datasets [1] - As model sizes increase, pre-training has become a computationally intensive task, making optimizer design crucial for convergence speed and cost [1] - Researchers have explored various improvements, with matrix-based optimizers showing a 30-40% iteration-level speedup compared to well-tuned AdamW [1] - Stanford's Percy Liang team indicates that despite claims of significant acceleration (1.4 to 2 times) from alternative methods, AdamW remains a robust choice for pre-training, while matrix-based methods excel under specific data-model ratios [1] Optimizer Performance - The study identifies two methodological flaws: unfair hyperparameter tuning and insufficient tuning of baseline models, which can lead to significant performance underestimation [4][6] - Proper hyperparameter tuning can achieve up to 2 times acceleration on a model with 130 million parameters by adjusting just the learning rate [6] - Fixed shared hyperparameters do not ensure fair comparisons, as different optimizers may have vastly different optimal hyperparameters [4][6] Research Methodology - The research involved a systematic comparison of eleven different deep learning optimizers across various model sizes (from 100 million to 1.2 billion parameters) and data-model ratios [11] - The study utilized a rigorous methodology divided into three main phases, including comprehensive parameter scanning and sensitivity analysis of hyperparameters [15][20] Findings on Hyperparameters - The research emphasizes the importance of independent tuning for optimizers, as optimal hyperparameter configurations do not transfer well between different optimizers [12] - The optimal choice of optimizer is context-dependent, with Muon performing best under standard Chinchilla data ratios, while Soap outperforms at ratios above 8:1 [13] Case Studies and Results - The study conducted case studies on larger experiments, confirming the effectiveness of predicted optimal configurations for model sizes and data scales [24] - Results showed that while matrix-based optimizers like Muon and Soap provide significant speed advantages, their effectiveness diminishes as model sizes increase, with acceleration ratios dropping to 1.1 times for larger models [26]
斯坦福:优化器「诸神之战」?AdamW 凭「稳定」胜出
机器之心· 2025-09-07 05:12
Core Insights - The article discusses the dominance of Adam and its improved version AdamW in the pre-training of open-weight language models since 2014, emphasizing their stability and rapid convergence under large datasets [1] - It highlights the significance of optimizer design in relation to convergence speed and computational costs as model sizes increase, with matrix-based optimizers showing a 30-40% iteration-level acceleration compared to well-tuned AdamW [1][15] - The research identifies two methodological flaws that may lead to underestimating the performance of baseline optimizers like AdamW: unfair hyperparameter tuning and insufficient testing scale [3][7] Summary by Sections Optimizer Performance - Matrix-based optimizers (e.g., Muon, Soap, Kron) outperform scalar-based optimizers (e.g., AdamW, Nesterov AdamW, Mars) in terms of consistent acceleration across various data-model ratios [9][15] - The performance of optimizers tends to diminish as model size increases, with some optimizers showing only a 1.1x acceleration at 12 billion parameters compared to AdamW [9][25] Hyperparameter Tuning - Proper hyperparameter tuning is crucial, as even a single parameter adjustment (like learning rate) can lead to significant performance improvements, such as a 2x speedup on a model with 130 million parameters [6][18] - Fixed shared hyperparameters do not ensure fair comparisons between different optimizers, as preferences for values like weight decay can vary significantly [4][15] Testing Methodology - The research emphasizes the need for rigorous independent tuning of hyperparameters for each optimizer to ensure fair comparisons, as blindly transferring hyperparameters can lead to misleading results [15][18] - Short-term evaluations can be misleading, as performance rankings may reverse during training due to learning rate decay [15][20] Case Studies and Findings - The study includes case studies on larger models, confirming that the predicted optimal configurations align closely with actual performance, validating the effectiveness of their scaling laws [23] - In extreme data-to-model ratios (e.g., 16x Chinchilla), optimizers like Soap and Kron demonstrate superior performance over Muon, indicating their effectiveness in high data scenarios [26]
腾讯研究院AI速递 20250617
腾讯研究院· 2025-06-16 14:55
Group 1 - Keller Jordan successfully joined OpenAI based on a blog about the Muon optimizer, which may be used for GPT-5 training [1] - Muon is an optimizer for neural network hidden layers that uses Newton-Schulz iteration to achieve orthogonalization of update matrices, training faster than AdamW [1] - Keller criticizes the literature on optimizers for lacking practical applications and advocates for validating new methods in competitive training tasks [1] Group 2 - Google's AI roadmap acknowledges that the current Transformer attention mechanism cannot achieve infinite context, necessitating fundamental innovations at the core architecture level [2] - Gemini is set to become Google's "unified thread," connecting all services and transitioning towards "proactive AI," supporting multimodal capabilities and agent functions [2] - Google is restructuring its AI team by integrating research and product teams into DeepMind to accelerate innovation, with Gemini 2.5 Pro marking a significant turning point [2] Group 3 - Microsoft showcased 700 real AI agents and Copilot application cases across various industries, including finance, healthcare, education, and retail [3] - Companies using AI agents have significantly improved efficiency, such as Wells Fargo reducing response time from 10 minutes to 30 seconds and KPMG cutting compliance workload by 50% [3] - Microsoft Copilot has led to notable productivity gains, with Michelin increasing productivity by 10 times and 84% of BCI users experiencing a 10-20% efficiency boost [3] Group 4 - Midjourney has entered the video generation field, showcasing a video model with detailed and realistic effects, though lacking audio features compared to Veo 3 [4][5] - Midjourney is adopting an open approach by inviting user participation in video rating to improve the model and promises to consider user suggestions in pricing [5] - The Midjourney V7 image model continues to update, supporting voice generation, draft mode, and conversation mode, with rendering speed improved by 40%, reducing fast mode from 36 seconds to 22 seconds [5] Group 5 - GenSpark launched an AI browser that integrates AI capabilities into every webpage, offering features like price comparison, shopping assistance, and video content summarization [6] - The browser supports "autonomous mode," allowing it to automatically browse, organize information, create podcasts, and access paid websites to collect data [6] - It includes an MCP store with over 700 tools for automation workflows and features ad-blocking, currently available only for Mac [6] Group 6 - MIT student Alex Kachkine innovatively used AI algorithms to restore ancient paintings, reducing the traditional 9-month process to just 3.5 hours, with the research published in Nature [7] - The new method employs AI-generated double-layer "mask" films on the original painting surface, repairing 5,612 areas and filling in 57,314 colors, achieving a 66-fold increase in efficiency [7] - This restoration technique can easily remove chemicals without damaging the original artwork, showing greater effectiveness with more missing areas, potentially allowing more damaged artworks to be restored [7] Group 7 - Trump's "whole government AI plan" may have leaked on GitHub, set to launch the ai.gov website on July 4, promoting AI across the federal government [8] - The plan, led by Thomas Shedd, includes chatbots, super APIs, and real-time monitoring tools, utilizing Amazon Bedrock for AI models [8] - Experts and netizens have raised concerns about security risks, code vulnerabilities, and the outdated government systems' adaptability, criticizing the plan for its vague definitions and potential superficiality [8] Group 8 - XPeng Motors shared advancements in autonomous driving base model development at the AI conference CVPR, working on a cloud-based model with 72 billion parameters [10] - XPeng validated the scale law's effectiveness in autonomous driving VLA models, employing a "cloud-based model + reinforcement learning" strategy to handle long-tail scenarios, processing over 20 million video segments [10] - The company has built a "cloud model factory" with a computing power of 10 EFLOPS, processing over 400,000 hours of video data and innovating a token compression method that reduces vehicle-side processing by 70% [10] Group 9 - a16z partners believe AI is reshaping consumer paradigms, with "task completion" replacing "relationship building" as the main product line, and current AI tools showing strong monetization potential with users paying up to $200 monthly [11] - The true "AI + social" product has yet to emerge, as current platforms merely embed AI-generated content into old structures, necessitating a fundamental rethinking of platforms to create new connection methods [11] - In the AI era, speed has become the primary competitive advantage over traditional moats, including distribution and iteration speed, requiring companies to maintain "dynamic leadership" rather than "static barriers" for long-term survival [11] Group 10 - NVIDIA CEO Jensen Huang publicly criticized Anthropic CEO Dario Amodei's prediction that half of entry-level white-collar jobs will be replaced by AI in the next five years [12] - Huang questioned Anthropic's "exclusive mindset," arguing that AI development should be open and transparent rather than closed and controlled, stating "don't lock yourself away to develop AI and then tell us it's safe" [12] - Anthropic responded that Dario never claimed "only Anthropic can build safe AI," reflecting two differing views on AI governance: Amodei emphasizes caution and ethical frameworks, while Huang believes open competition ensures safety [12]
爆肝一篇博客拿下OpenAI Offer,Muon作者怒揭:几乎所有优化器的论文都是“假的”
3 6 Ke· 2025-06-16 12:46
Core Insights - A blog post by researcher Keller Jordan titled "Muon: An optimizer for hidden layers in neural networks" led to his successful offer from OpenAI, and the techniques discussed may have been used in training GPT-5 [1][4][5] - The post emphasizes that publishing in top conferences does not equate to having a significant impact, challenging traditional academic norms [6][11] Group 1: Blog Impact and Reception - Keller Jordan's blog post gained attention for its practical results, outperforming the previously dominant optimizer AdamW [5][14] - Yuchen Jin, a co-author, highlighted the misconception in academia that publishing in top-tier conferences is the ultimate goal, advocating for real-world impact instead [6][11] - The blog's success illustrates a shift in the AI research landscape, where practical performance may outweigh formal academic credentials [22][24] Group 2: Muon Optimizer Performance - Muon optimizer achieved significant performance improvements, such as reducing training time on CIFAR-10 from 3.3 A100 seconds to 2.6 A100 seconds [14] - In the NanoGPT task, Muon improved validation loss speed by 1.35 times, maintaining advantages even with larger parameter scales [14] - When training a 1.5 billion parameter transformer on the HellaSwag task, Muon reached GPT-2 XL performance in just 10 hours, compared to 13.3 hours with AdamW [14][20] Group 3: Design and Methodology - Muon's core principle involves using SGD-momentum for updates, followed by a Newton-Schulz iteration to approximate orthogonalization of the update matrix [20][22] - This approach allows Muon to replace the original update matrix with a "semi-orthogonal matrix," enhancing its effectiveness [22]
Muon作者仅用一篇博客,就被OpenAI看中了
机器之心· 2025-06-16 04:04
Core Insights - The article emphasizes that publishing papers is no longer the ultimate goal for researchers, as demonstrated by Keller Jordan's success with a blog post that led to his position at OpenAI [2][8]. - The case of Keller Jordan illustrates that talent acquisition in top AI research institutions like OpenAI prioritizes capability over traditional academic metrics [8]. Summary by Sections Blog Post Overview - Keller Jordan's blog titled "Muon: An optimizer for hidden layers in neural networks" was published on December 8, 2024, and introduced a new optimizer that significantly enhances training speed while maintaining accuracy for neural networks [4][6]. - The blog highlights the latest records in training speed for NanoGPT, with the most recent record being 2.979 minutes achieved on May 25, 2025 [9]. Muon Optimizer Design and Results - Muon is designed to optimize hidden layers in neural networks, achieving a training speed record of 2.6 seconds on the CIFAR-10 dataset while maintaining 94% accuracy [22]. - In competitive tasks, Muon demonstrated a 1.35 times improvement in training speed compared to previous methods [22]. - The optimizer's design involves applying Newton-Schulz iterations to approximate orthogonal updates, which enhances the learning process by diversifying update directions [29][30]. Performance and Efficiency - Muon requires minimal additional computational overhead, with a maximum FLOP cost of less than 1% in typical language model training scenarios [58][59]. - The optimizer has shown superior performance in training large models, such as a 1.5 billion parameter Transformer, compared to traditional methods like AdamW [22][66]. Comparison with Other Optimizers - The article discusses the limitations of other optimizers, such as Shampoo and Orthogonal-SGDM, highlighting that Muon outperforms them in efficiency and effectiveness [61][64]. - It emphasizes the importance of proper baseline tuning in research to ensure that new optimizers are genuinely effective [72]. Future Research Directions - The article mentions ongoing research to explore Muon's scalability and its application in various training scenarios, indicating a growing interest in its potential [79][81].