超连接
Search documents
DeepSeek最新论文解读:mHC如何用更少的钱训练出更强的模型?——投资笔记第243期
3 6 Ke· 2026-01-26 07:38
Core Insights - DeepSeek has released a significant paper on Manifold-Constrained Hyper-Connections (mHC), focusing on the fundamental issue of how information flows stably through ultra-deep networks in large models, rather than on model parameters, data volume, or computational power [2] Group 1: Residual Connections and Their Limitations - The concept of residual connections, introduced by Kaiming He’s team in 2015, is a milestone in AI development, allowing deeper neural networks by addressing the vanishing gradient problem [3] - Prior to residual connections, neural networks were limited to depths of 20-30 layers due to the exponential decay of gradients, which hindered effective feature learning [3][4] - Residual connections introduced a "shortcut" for signal transmission, enabling the depth of trainable networks to increase from tens to hundreds or thousands of layers, forming the structural foundation of modern deep learning [4] Group 2: Introduction of Hyper-Connections - Hyper-Connections emerged as a solution to the limitations of residual connections, allowing multiple pathways for information transfer within a model, akin to a relay race with multiple runners [6][7] - This approach enables information to be distributed across multiple parallel channels, allowing for dynamic weight allocation during training, enhancing the model's ability to handle complex, multi-source information [6][7] Group 3: Challenges with Hyper-Connections - Hyper-Connections face a critical flaw: instability due to excessive freedom in information flow, which can lead to imbalances in the model's internal information flow [9] - The training process of models using Hyper-Connections can exhibit high volatility and loss divergence, indicating a lack of stability in information transmission [9] Group 4: The Solution - mHC - mHC, or Manifold-Constrained Hyper-Connections, introduces a crucial constraint to Hyper-Connections by employing a double stochastic matrix, ensuring that information is redistributed without amplification [11] - This constraint prevents both signal explosion and signal decay, maintaining a stable flow of information throughout the network [13] - The implementation of mHC enhances training stability and performance, with only a 6.7% increase in training time, which is negligible compared to the significant cost savings in computational resources and debugging time [13][14] Group 5: Implications for Future AI Development - mHC strikes a new balance between stability and efficiency, reducing computational costs by approximately 30% and shortening product iteration cycles [14] - It supports the development of larger models, addressing the stability bottleneck in scaling to models with hundreds of billions or trillions of parameters [16] - The framework of mHC demonstrates that "constrained freedom" is more valuable than "complete freedom," suggesting a shift in AI architecture design from experience-driven to theory-driven approaches [16]
刚刚,梁文锋署名,DeepSeek元旦新论文要开启架构新篇章
华尔街见闻· 2026-01-01 12:20
Core Insights - DeepSeek has introduced a new architecture called Manifold-Constrained Hyper-Connections (mHC) to address the instability issues in traditional hyper-connections during large-scale model training while maintaining significant performance gains [1][6][8]. Group 1: mHC Architecture - The mHC architecture extends the single residual flow of traditional Transformers into a multi-flow parallel structure, utilizing the Sinkhorn-Knopp algorithm to constrain the connection matrix on a doubly stochastic matrix manifold [1][8]. - The core objective of mHC is to retain the performance improvements from widening the residual flow while resolving training instability and excessive memory consumption [8][9]. - Empirical evidence shows that mHC not only addresses stability issues but also demonstrates exceptional scalability in large-scale training, such as with a 27 billion parameter model, where it only increased training time by 6.7% while achieving significant performance improvements [8][32]. Group 2: Challenges with Traditional Hyper-Connections - Traditional hyper-connections (HC) have led to severe training instability and limited scalability due to the fundamental disruption of the inherent identity mapping property, which is crucial for stable training [5][9]. - The widening of information channels in HC results in increased memory access overhead, contributing to what is known as the "memory wall" problem [9][5]. Group 3: Implementation and Efficiency - DeepSeek has designed a tailored infrastructure for mHC, which includes kernel fusion, selective recomputation, and an extended DualPipe communication overlap strategy to minimize memory usage and enhance efficiency [23][25][27]. - The Sinkhorn-Knopp algorithm is employed to ensure that the residual connection matrix remains stable and adheres to the properties of a doubly stochastic matrix, which helps mitigate gradient explosion issues [16][21]. Group 4: Experimental Validation - The research team conducted experiments using language model pre-training to validate the effectiveness of mHC, comparing it against baseline models and traditional HC [28][32]. - Results from various downstream benchmark tests indicate that mHC consistently outperforms baseline models and often surpasses HC, demonstrating its effectiveness in large-scale pre-training [34][33]. - The scalability experiments reveal that mHC maintains performance advantages even at higher computational budgets, showing only slight degradation in performance [36][37].
DeepSeek,最新发布!
证券时报· 2026-01-01 10:53
Core Viewpoint - DeepSeek has introduced a new architecture called mHC (Manifold-Constrained Hyperconnection) aimed at addressing the instability issues in traditional hyperconnections during large-scale model training while maintaining significant performance gains [1][3]. Summary by Sections Introduction of mHC - DeepSeek's new paper presents mHC, which projects the hyperconnection's residual connection space onto a specific manifold to restore the identity mapping property and ensure operational efficiency through rigorous infrastructure optimization [3][4]. Performance and Scalability - Empirical results indicate that mHC effectively supports large-scale training, with an additional time overhead of only 6.7% when the expansion rate is set to 4 [4][6]. Research Directions - mHC opens up several important research directions, including compatibility with various manifold constraints tailored for specific learning objectives and potential new methods for balancing plasticity and stability through in-depth studies of differential geometric constraints [7]. Recent Developments - DeepSeek has been active, releasing two official model versions, DeepSeek-V3.2 and DeepSeek-V3.2-Speciale, with the former achieving performance comparable to GPT-5 in benchmark tests [8]. - The DeepSeek-V3.2-Speciale model combines enhanced reasoning capabilities with mathematical proof abilities, performing well in mainstream reasoning benchmarks [8]. - Additionally, the release of DeepSeek-V3.2-Exp introduces a sparse attention mechanism aimed at improving training and inference efficiency for long texts, with a significant reduction in API costs for developers [9]. Recognition in the Scientific Community - DeepSeek's research paper on the DeepSeek-R1 reasoning model was featured on the cover of the prestigious journal Nature, marking a significant milestone for Chinese AI technology in the international scientific community [9][10].
刚刚,梁文锋署名,DeepSeek元旦新论文要开启架构新篇章
Xin Lang Cai Jing· 2026-01-01 10:34
Core Insights - DeepSeek has introduced a new architecture called Manifold-Constrained Hyper-Connections (mHC) aimed at addressing the instability issues in traditional hyper-connections during large-scale model training while maintaining significant performance gains [1][27][28]. Group 1: Architecture and Methodology - The mHC architecture expands the traditional single residual flow of Transformers into a multi-flow parallel structure, utilizing the Sinkhorn-Knopp algorithm to constrain the connection matrix on a doubly stochastic matrix manifold [1][28]. - The core objective of mHC is to retain the performance improvements from widening the residual flow while resolving issues related to training instability and excessive memory consumption [4][34]. - The research team has implemented infrastructure optimizations such as kernel fusion, selective recomputation, and an extended DualPipe communication strategy to offset the overhead caused by wider channels [31][34]. Group 2: Performance and Stability - Empirical evidence shows that mHC not only resolves stability issues but also demonstrates exceptional scalability in large-scale training scenarios, such as with a 27 billion parameter model, where it only increased training time overhead by 6.7% while achieving significant performance improvements [34][49]. - The training stability of mHC was evaluated against a baseline model, showing a reduction in final loss by 0.021 and maintaining a stable gradient norm profile, indicating superior stability compared to traditional hyper-connections [49][50]. Group 3: Benchmarking and Results - In various downstream benchmark tests, mHC consistently outperformed the baseline model and surpassed traditional hyper-connections in most tasks, achieving performance gains of 2.1% and 2.3% in specific tasks [51][52]. - The scalability experiments indicated that mHC maintains its performance advantages even under higher computational budgets, demonstrating robust effectiveness in large-scale scenarios [52][53].
刚刚,梁文锋署名,DeepSeek元旦新论文要开启架构新篇章
机器之心· 2026-01-01 08:22
Core Viewpoint - DeepSeek has introduced a new architecture called Manifold-Constrained Hyper-Connections (mHC) to address the instability issues in traditional hyper-connections during large-scale model training while maintaining significant performance gains [1][3][4]. Group 1: Introduction of mHC - The mHC framework extends the traditional Transformer’s single residual flow into a multi-flow parallel architecture, utilizing the Sinkhorn-Knopp algorithm to constrain the connection matrix on a doubly stochastic matrix manifold [1][4]. - The core objective of mHC is to retain the performance improvements from widening the residual flow while addressing training instability and excessive memory consumption [4][6]. Group 2: Challenges with Traditional Hyper-Connections - Traditional residual connections ensure stable signal transmission through identity mapping, but they face limitations due to the restricted width of information channels [3][6]. - Recent methods like Hyper-Connections (HC) have improved performance but introduced significant training instability and increased memory access overhead [3][6]. Group 3: Methodology of mHC - mHC projects the residual connection space onto a specific manifold to restore the identity mapping property while optimizing infrastructure for efficiency [4][9]. - The use of the Sinkhorn-Knopp algorithm allows the connection matrix to be projected onto the Birkhoff polytope, ensuring stability in signal propagation [4][10]. Group 4: Experimental Validation - Empirical results show that mHC not only resolves stability issues but also demonstrates exceptional scalability in large-scale training, such as with a 27 billion parameter model, increasing training time by only 6.7% while achieving significant performance improvements [4][29]. - In benchmark tests, mHC consistently outperformed baseline models and HC in various downstream tasks, indicating its effectiveness in large-scale pre-training [30][31]. Group 5: Infrastructure Design - DeepSeek has tailored infrastructure for mHC, including kernel fusion, selective recomputation, and enhanced communication strategies to minimize memory overhead and improve computational efficiency [17][21][23]. - The design choices, such as optimizing the order of operations and implementing mixed precision strategies, contribute to the overall efficiency of mHC [17][18].