Adam优化器 - filings, earnings calls, financial reports, news

Adam优化器

Search documents

3 6 Ke· 2026-02-11 09:35

时隔不到24小时，又一xAI联合创始人离职了！智东西2月11日报道，今天上午，马斯克旗下的AI独角兽xAI的联合创始人Jimmy Ba宣布，今天是他在xAI任职的最后一天。上周，科技媒体The Information援引知情人士消息称，马斯克因新一代Grok模型发布延迟，近几周来"日益不满"。与此同时，xAI正以每月近 10亿美元的惊人速度消耗资金。 Jimmy Ba是一名加拿大籍的华裔AI研究员，他因2014年与他人合作开发Adam优化器而声名鹊起，该算法已成为训练AI模型的标准工具之一，在谷歌学术上，Adam论文的引用量超过了24万次。 | Jimmy Ba | | | --- | --- | | University of Toronto | | | Verified email at cs.toronto.edu - Homepage | | | Neural Networks Artificial Intelligence | | | Machine Learning Deep Learning | | | ARTICLES CITED BY PUBLIC ACCESS | CO-AUTHOR ...

机器之心· 2025-12-11 02:47

Core Insights - The article discusses the significant advancements in AI driven by increased computational budgets and algorithmic innovations over the past decade [2][6] - It highlights that while computational growth is measurable, the quantification of algorithmic progress remains unclear, particularly regarding the efficiency improvements and their scalability [2][3] Group 1: Algorithmic Progress - Research estimates that algorithmic advancements have contributed over 4 orders of magnitude in effective compute over the past decade, while computational scale itself has increased by 7 orders of magnitude [2] - The overall efficiency of models has improved by approximately 22,000 times due to algorithmic innovations, allowing for similar performance with significantly fewer floating-point operations (FLOPs) [3][4] - Most algorithmic innovations yield only minor efficiency improvements, with less than 10 times overall efficiency gain when extrapolated to 2025's computational limits [4][11] Group 2: Scale-Dependent Innovations - Two major scale-dependent algorithmic innovations, from LSTM to Transformer and from Kaplan to Chinchilla, account for 91% of the total efficiency improvements [4][22] - The efficiency gains from algorithmic improvements are significantly larger in large-scale models compared to small-scale models, indicating that algorithmic progress is heavily reliant on computational scale [6][25] - The article suggests that the perceived rapid progress in algorithms may be more a reflection of increasing computational budgets rather than continuous algorithmic breakthroughs [22][24] Group 3: Experimental Findings - The study employed various methods, including ablation studies and scaling experiments, to analyze the impact of individual algorithms and their combinations [5][8] - The findings reveal a highly skewed distribution of efficiency improvements, with a few key innovations contributing disproportionately to overall gains [11][12] - The scaling experiments demonstrate that improvements in neural network architectures are not scale-invariant but exhibit increasing returns to scale [20][21]

比Adam更有效，POET从谱不变原理出发，让LLM训练又稳又快

机器之心· 2025-07-15 00:59

Core Viewpoint - The article discusses a novel training paradigm for large language models (LLMs) called POET (Reparameterized Training via Orthogonal Equivalence Transformation), which aims to enhance training efficiency and stability based on first principles [2][3]. Group 1: POET Methodology - POET introduces structural reparameterization of each neuron by incorporating two learnable orthogonal matrices and a fixed random weight matrix, maintaining the singular value distribution of weights during training [3][11]. - The method combines singular value invariance with minimal hyperspherical energy, providing a new paradigm that offers both physical interpretability and generalization capability for large model training [3][11]. - POET's training process is designed to stabilize the optimization process and significantly improve model generalization performance [3][11]. Group 2: Advantages of POET - POET maintains the spectral properties of the weight matrix throughout training, ensuring that the singular values remain consistent with the randomly initialized matrix [17]. - The method allows for efficient parameter control and avoids the issue of excessively large singular values that can occur in standard LLM training [17]. - Two new initialization strategies, normalized Gaussian initialization and uniform spectrum initialization, are proposed to ensure bounded singular values in the generated weight matrices [17]. Group 3: Training Dynamics and Performance - The article presents experimental results demonstrating POET's superior performance in training large language models, including improvements in perplexity and training efficiency compared to traditional methods like AdamW [20][24]. - POET's training process is divided into three phases: conical shell searching, stable learning on the conical shell, and final adjusting, which reflects the evolution of the orthogonal matrices during training [40][41]. - The use of a fully stochastic sampling approach in POET allows for a significant reduction in memory costs compared to traditional methods, enhancing scalability [26][27].