Workflow
Adam优化器
icon
Search documents
马斯克“身边”的男人走光了,昨一个,今一个,都是辛顿学生
3 6 Ke· 2026-02-11 09:35
Core Viewpoint - The recent departure of co-founders Jimmy Ba and Wu Yuhua from xAI highlights significant organizational instability within the company, which is facing challenges related to leadership turnover and regulatory pressures [2][11]. Group 1: Key Events - Jimmy Ba announced his departure from xAI, marking his last day at the company [2]. - Wu Yuhua, another co-founder, also announced his exit from xAI just a day prior to Ba's announcement [5]. - The departures of Ba and Wu follow the earlier exit of Igor Babuschkin, leaving only six of the original twelve co-founders remaining at xAI [11]. Group 2: Background of Key Individuals - Jimmy Ba is a prominent AI researcher known for co-developing the Adam optimizer, which has been cited over 240,000 times in academic literature [4][5]. - Ba completed his education at the University of Toronto, where he earned both his master's and doctoral degrees under the supervision of Geoffrey Hinton, a leading figure in AI [7][8]. Group 3: Company Challenges - xAI is reportedly burning through nearly $1 billion per month, raising concerns about its financial sustainability [7]. - The company is experiencing increasing regulatory scrutiny due to its AI products' content moderation practices, which have led to potential compliance issues [11]. - The recent leadership changes may hinder xAI's ability to attract and retain talent, impacting its strategic direction and operational stability [11].
MIT最新发现:这十年,算法进步被高估了
机器之心· 2025-12-11 02:47
Core Insights - The article discusses the significant advancements in AI driven by increased computational budgets and algorithmic innovations over the past decade [2][6] - It highlights that while computational growth is measurable, the quantification of algorithmic progress remains unclear, particularly regarding the efficiency improvements and their scalability [2][3] Group 1: Algorithmic Progress - Research estimates that algorithmic advancements have contributed over 4 orders of magnitude in effective compute over the past decade, while computational scale itself has increased by 7 orders of magnitude [2] - The overall efficiency of models has improved by approximately 22,000 times due to algorithmic innovations, allowing for similar performance with significantly fewer floating-point operations (FLOPs) [3][4] - Most algorithmic innovations yield only minor efficiency improvements, with less than 10 times overall efficiency gain when extrapolated to 2025's computational limits [4][11] Group 2: Scale-Dependent Innovations - Two major scale-dependent algorithmic innovations, from LSTM to Transformer and from Kaplan to Chinchilla, account for 91% of the total efficiency improvements [4][22] - The efficiency gains from algorithmic improvements are significantly larger in large-scale models compared to small-scale models, indicating that algorithmic progress is heavily reliant on computational scale [6][25] - The article suggests that the perceived rapid progress in algorithms may be more a reflection of increasing computational budgets rather than continuous algorithmic breakthroughs [22][24] Group 3: Experimental Findings - The study employed various methods, including ablation studies and scaling experiments, to analyze the impact of individual algorithms and their combinations [5][8] - The findings reveal a highly skewed distribution of efficiency improvements, with a few key innovations contributing disproportionately to overall gains [11][12] - The scaling experiments demonstrate that improvements in neural network architectures are not scale-invariant but exhibit increasing returns to scale [20][21]
比Adam更有效,POET从谱不变原理出发,让LLM训练又稳又快
机器之心· 2025-07-15 00:59
Core Viewpoint - The article discusses a novel training paradigm for large language models (LLMs) called POET (Reparameterized Training via Orthogonal Equivalence Transformation), which aims to enhance training efficiency and stability based on first principles [2][3]. Group 1: POET Methodology - POET introduces structural reparameterization of each neuron by incorporating two learnable orthogonal matrices and a fixed random weight matrix, maintaining the singular value distribution of weights during training [3][11]. - The method combines singular value invariance with minimal hyperspherical energy, providing a new paradigm that offers both physical interpretability and generalization capability for large model training [3][11]. - POET's training process is designed to stabilize the optimization process and significantly improve model generalization performance [3][11]. Group 2: Advantages of POET - POET maintains the spectral properties of the weight matrix throughout training, ensuring that the singular values remain consistent with the randomly initialized matrix [17]. - The method allows for efficient parameter control and avoids the issue of excessively large singular values that can occur in standard LLM training [17]. - Two new initialization strategies, normalized Gaussian initialization and uniform spectrum initialization, are proposed to ensure bounded singular values in the generated weight matrices [17]. Group 3: Training Dynamics and Performance - The article presents experimental results demonstrating POET's superior performance in training large language models, including improvements in perplexity and training efficiency compared to traditional methods like AdamW [20][24]. - POET's training process is divided into three phases: conical shell searching, stable learning on the conical shell, and final adjusting, which reflects the evolution of the orthogonal matrices during training [40][41]. - The use of a fully stochastic sampling approach in POET allows for a significant reduction in memory costs compared to traditional methods, enhancing scalability [26][27].