爆肝一篇博客拿下OpenAI Offer，Muon作者怒揭：几乎所有优化器的论文都是“假的”

Core Insights - A blog post by researcher Keller Jordan titled "Muon: An optimizer for hidden layers in neural networks" led to his successful offer from OpenAI, and the techniques discussed may have been used in training GPT-5 [1][4][5] - The post emphasizes that publishing in top conferences does not equate to having a significant impact, challenging traditional academic norms [6][11] Group 1: Blog Impact and Reception - Keller Jordan's blog post gained attention for its practical results, outperforming the previously dominant optimizer AdamW [5][14] - Yuchen Jin, a co-author, highlighted the misconception in academia that publishing in top-tier conferences is the ultimate goal, advocating for real-world impact instead [6][11] - The blog's success illustrates a shift in the AI research landscape, where practical performance may outweigh formal academic credentials [22][24] Group 2: Muon Optimizer Performance - Muon optimizer achieved significant performance improvements, such as reducing training time on CIFAR-10 from 3.3 A100 seconds to 2.6 A100 seconds [14] - In the NanoGPT task, Muon improved validation loss speed by 1.35 times, maintaining advantages even with larger parameter scales [14] - When training a 1.5 billion parameter transformer on the HellaSwag task, Muon reached GPT-2 XL performance in just 10 hours, compared to 13.3 hours with AdamW [14][20] Group 3: Design and Methodology - Muon's core principle involves using SGD-momentum for updates, followed by a Newton-Schulz iteration to approximate orthogonalization of the update matrix [20][22] - This approach allows Muon to replace the original update matrix with a "semi-orthogonal matrix," enhancing its effectiveness [22]