大模型训练优化
Search documents
Adam的稳+Muon的快?华为诺亚开源ROOT破解大模型训练「既要又要」的两难困境
机器之心· 2025-11-27 04:09
Core Viewpoint - The article discusses the evolution of optimizers in large language model (LLM) training, highlighting the introduction of ROOT (Robust Orthogonalized Optimizer) by Huawei Noah's Ark Lab as a solution that combines the speed of Muon and the stability of Adam, addressing the limitations of existing optimizers in handling large-scale training and noise robustness [2][50]. Group 1: Optimizer Evolution - The early optimizer, SGD (Stochastic Gradient Descent), established the basic paradigm for neural network training but struggled with convergence speed and stability in high-dimensional loss landscapes [6][7]. - Adam and AdamW emerged as the de facto standards for training deep learning models, significantly improving convergence efficiency but revealing numerical instability issues when model parameters exceed one billion [7][8]. - Muon, a matrix-aware optimizer, attempted to address these issues by treating weight matrices as a whole, yet it faced challenges related to robustness and sensitivity to noise [11][13]. Group 2: ROOT Optimizer Features - ROOT enhances the robustness of orthogonalized optimizers by introducing adaptive coefficients for the Newton-Schulz iteration, tailored to specific matrix dimensions, thus overcoming the dimensional fragility seen in fixed-coefficient methods [26][29]. - The optimizer employs a soft-thresholding mechanism to filter out gradient noise, effectively separating normal and abnormal gradient components, which improves stability during training [30][33]. - ROOT's design aims to provide a balance between speed and stability, making it suitable for large-scale, non-convex real model training scenarios [20][21]. Group 3: Performance Validation - In extensive experiments, ROOT demonstrated superior convergence capabilities, achieving a training loss of 2.5407 in a 10B token pre-training experiment, outperforming Muon [41][42]. - ROOT achieved an average score of 60.12 across multiple downstream tasks, surpassing both AdamW (59.05) and Muon (59.59), indicating its competitive edge [43]. - The optimizer also showed strong cross-modal generalization capabilities, achieving a Top-1 accuracy of 88.44% on the CIFAR-10 dataset, significantly higher than Muon's 84.67% [46][47]. Group 4: Future Implications - ROOT is positioned to potentially usher in a new era of optimizers, addressing the increasing complexity and scale of future language models, thereby enhancing the reliability and efficiency of AI system training [49][51]. - The open-source release of ROOT's code is expected to encourage further research and application in training trillion-parameter models, reinforcing Huawei's commitment to innovation in AI [52].
梁文锋署名DeepSeek新论文:公开V3大模型降本方法
量子位· 2025-05-15 08:37
Core Viewpoint - The article discusses the latest advancements in DeepSeek-V3, focusing on how it overcomes hardware bottlenecks in training large models through four innovative technologies [1][2]. Group 1: Innovations in DeepSeek-V3 - DeepSeek-V3 achieves significant training efficiency using only 2048 H800 GPUs, comparable to systems with thousands of GPUs, through memory optimization and multi-head latent attention (MLA) [2][14]. - The memory optimization reduces the key-value cache (KV Cache) size to 70 KB per token, which is 1/7 to 1/4 of traditional methods, alleviating memory pressure especially for long text processing [15][20]. - The model employs a mixture of experts (MoE) and FP8 low-precision training, activating only 37 billion parameters out of 671 billion during training, resulting in a training cost that is 1/10 of dense models like Llama-3.1 [17][18]. Group 2: Communication and Inference Acceleration - DeepSeek-V3 utilizes a multi-plane fat-tree network design to optimize communication, reducing costs by 40% and latency by 30%, supporting scalability to thousands of GPUs [20][21]. - The model implements dual-pipeline execution for attention calculations and expert communication, enhancing throughput by nearly 100% [22]. - Multi-token prediction (MTP) allows the model to generate multiple tokens simultaneously, increasing generation speed by 1.8 times while maintaining an accuracy of 80%-90% [24][25]. Group 3: Future Hardware Expectations - The article outlines five dimensions for future AI hardware improvements, transitioning from passive adaptation to proactive design [28]. - Recommendations include enhancing low-precision computation support, integrating communication frameworks, optimizing network topologies, improving memory systems, and ensuring robustness against failures [30][33][37][40].