DeepSeek最新论文解读:mHC如何用更少的钱训练出更强的模型?——投资笔记第243期
Seek .Seek .(US:SKLTY) 3 6 Ke·2026-01-26 07:38

Core Insights - DeepSeek has released a significant paper on Manifold-Constrained Hyper-Connections (mHC), focusing on the fundamental issue of how information flows stably through ultra-deep networks in large models, rather than on model parameters, data volume, or computational power [2] Group 1: Residual Connections and Their Limitations - The concept of residual connections, introduced by Kaiming He’s team in 2015, is a milestone in AI development, allowing deeper neural networks by addressing the vanishing gradient problem [3] - Prior to residual connections, neural networks were limited to depths of 20-30 layers due to the exponential decay of gradients, which hindered effective feature learning [3][4] - Residual connections introduced a "shortcut" for signal transmission, enabling the depth of trainable networks to increase from tens to hundreds or thousands of layers, forming the structural foundation of modern deep learning [4] Group 2: Introduction of Hyper-Connections - Hyper-Connections emerged as a solution to the limitations of residual connections, allowing multiple pathways for information transfer within a model, akin to a relay race with multiple runners [6][7] - This approach enables information to be distributed across multiple parallel channels, allowing for dynamic weight allocation during training, enhancing the model's ability to handle complex, multi-source information [6][7] Group 3: Challenges with Hyper-Connections - Hyper-Connections face a critical flaw: instability due to excessive freedom in information flow, which can lead to imbalances in the model's internal information flow [9] - The training process of models using Hyper-Connections can exhibit high volatility and loss divergence, indicating a lack of stability in information transmission [9] Group 4: The Solution - mHC - mHC, or Manifold-Constrained Hyper-Connections, introduces a crucial constraint to Hyper-Connections by employing a double stochastic matrix, ensuring that information is redistributed without amplification [11] - This constraint prevents both signal explosion and signal decay, maintaining a stable flow of information throughout the network [13] - The implementation of mHC enhances training stability and performance, with only a 6.7% increase in training time, which is negligible compared to the significant cost savings in computational resources and debugging time [13][14] Group 5: Implications for Future AI Development - mHC strikes a new balance between stability and efficiency, reducing computational costs by approximately 30% and shortening product iteration cycles [14] - It supports the development of larger models, addressing the stability bottleneck in scaling to models with hundreds of billions or trillions of parameters [16] - The framework of mHC demonstrates that "constrained freedom" is more valuable than "complete freedom," suggesting a shift in AI architecture design from experience-driven to theory-driven approaches [16]