Newton-Schulz方法
Search documents
不加算力,只改一个算法:Muon在万亿MoE模型中最高2倍加速
机器之心· 2026-03-31 09:00
Core Viewpoint - The article discusses the introduction of Gram Newton-Schulz, a method that optimizes the Newton-Schulz algorithm for GPU and large model training scenarios, achieving a 40-50% reduction in optimizer time for trillion-parameter MoE models [1][5][30]. Group 1: Key Contributions - The core idea of Gram Newton-Schulz is to iterate on a smaller Gram matrix instead of directly on the matrix X, which reduces computational load and leverages the properties of symmetric matrices [3][16]. - The authors have rewritten the standard Newton-Schulz into a mathematically equivalent form that operates in n×n space, leading to significant computational efficiency [5][14]. - The implementation of the Stabilized Gram Newton-Schulz addresses instability issues in half-precision calculations by introducing a restart strategy, ensuring training quality is maintained [19][23][27]. Group 2: Performance Improvements - The Gram Newton-Schulz method has been shown to double the speed of the Muon optimizer without additional costs, with negligible changes in validation set perplexity [6][31]. - In practical applications, such as the Kimi K2 model, Gram Newton-Schulz demonstrated a speed that is twice as fast as the standard Newton-Schulz on NVIDIA H100 and B300 hardware [31][33]. - The complexity comparison indicates that the Gram method can reduce FLOPs by approximately 42%-58% when α>1, making it a more efficient choice for large-scale computations [28][29].