翁荔陈丹琦加盟的840亿AI公司，公开第二篇论文

Core Viewpoint - The article discusses the recent research paper by Thinking Machines, led by Jeremy Bernstein, focusing on "Modular Manifolds" to enhance the stability and efficiency of neural network training through a unified framework for different layers/modules [1][2]. Group 1: Research Motivation and Challenges - The research aims to address fundamental challenges in neural network training, particularly issues related to tensor values (weights, activations, gradients) that can lead to instability, gradient explosion/vanishing, and low training efficiency [2]. - The author proposes a new optimization approach called Modular Manifolds, which applies constraints not only to individual weight tensors but also views the entire network as a composite manifold structure [2][8]. Group 2: Importance of Manifold Constraints - The necessity for manifold constraints arises from the instability encountered during the training of large models, where extreme values of weights, activations, or gradients can lead to issues like overflow, disappearance, and slow convergence [8]. - Normalization methods have been the gold standard for addressing these issues, but there has been little focus on normalizing the weight matrices themselves [8][9]. Group 3: Benefits of Weight Normalization - Normalizing weight matrices can lead to more stable training, easier adjustments, predictable behavior, and greater resistance to external disturbances [9][10]. Group 4: Research Process Overview - The research process includes several steps, starting with a basic example of training a parameter vector constrained to a unit sphere [11]. - The author discusses the challenges of using standard optimization methods like Adam or SGD, which may lead to updates that exit the constraint space [12][13]. Group 5: Manifold Optimization Techniques - The manifold optimization approach involves projecting gradients onto the tangent space, updating parameters, and then retracting the updated vector back onto the manifold [14]. - The choice of manifold constraints and measurement of lengths can lead to the creation of various optimization algorithms [16]. Group 6: Extension to Matrix Parameters - The research extends the concept from vector parameters to matrix parameters, particularly for the weight matrices in Transformers, which can have thousands of dimensions [17]. - The Stiefel manifold is proposed for matrix parameters, ensuring orthogonality of column vectors and a condition number of 1, which aids in numerical stability [18][20]. Group 7: Experimental Validation - A small-scale experiment was conducted on the CIFAR-10 dataset, comparing the manifold Muon algorithm with AdamW, showing that the former slightly outperformed the latter in training/testing accuracy, although it was slower in execution time [23][24]. Group 8: Modular Manifolds Concept - The concept of Modular Manifolds is introduced, treating each layer or module of the neural network as a separate manifold with its own defined norms and optimization methods [26][27]. - These individual manifolds can be combined into a larger manifold space, where a global mechanism constrains the overall update process while allowing local updates [29][30]. Group 9: Future Implications - The proposed methodology emphasizes the design coupling of the entire model training process, suggesting that successful application on large Transformers or LLMs could significantly enhance training efficiency and stability [31][32]. - The company has already achieved a valuation exceeding $12 billion, indicating strong market expectations for its research outcomes [52].