Adam的稳+Muon的快？华为诺亚开源ROOT破解大模型训练「既要又要」的两难困境

Core Viewpoint - The article discusses the evolution of optimizers in large language model (LLM) training, highlighting the introduction of ROOT (Robust Orthogonalized Optimizer) by Huawei Noah's Ark Lab as a solution that combines the speed of Muon and the stability of Adam, addressing the limitations of existing optimizers in handling large-scale training and noise robustness [2][50]. Group 1: Optimizer Evolution - The early optimizer, SGD (Stochastic Gradient Descent), established the basic paradigm for neural network training but struggled with convergence speed and stability in high-dimensional loss landscapes [6][7]. - Adam and AdamW emerged as the de facto standards for training deep learning models, significantly improving convergence efficiency but revealing numerical instability issues when model parameters exceed one billion [7][8]. - Muon, a matrix-aware optimizer, attempted to address these issues by treating weight matrices as a whole, yet it faced challenges related to robustness and sensitivity to noise [11][13]. Group 2: ROOT Optimizer Features - ROOT enhances the robustness of orthogonalized optimizers by introducing adaptive coefficients for the Newton-Schulz iteration, tailored to specific matrix dimensions, thus overcoming the dimensional fragility seen in fixed-coefficient methods [26][29]. - The optimizer employs a soft-thresholding mechanism to filter out gradient noise, effectively separating normal and abnormal gradient components, which improves stability during training [30][33]. - ROOT's design aims to provide a balance between speed and stability, making it suitable for large-scale, non-convex real model training scenarios [20][21]. Group 3: Performance Validation - In extensive experiments, ROOT demonstrated superior convergence capabilities, achieving a training loss of 2.5407 in a 10B token pre-training experiment, outperforming Muon [41][42]. - ROOT achieved an average score of 60.12 across multiple downstream tasks, surpassing both AdamW (59.05) and Muon (59.59), indicating its competitive edge [43]. - The optimizer also showed strong cross-modal generalization capabilities, achieving a Top-1 accuracy of 88.44% on the CIFAR-10 dataset, significantly higher than Muon's 84.67% [46][47]. Group 4: Future Implications - ROOT is positioned to potentially usher in a new era of optimizers, addressing the increasing complexity and scale of future language models, thereby enhancing the reliability and efficiency of AI system training [49][51]. - The open-source release of ROOT's code is expected to encourage further research and application in training trillion-parameter models, reinforcing Huawei's commitment to innovation in AI [52].