Core Viewpoint - DeepSeek has introduced a new architecture called Manifold-Constrained Hyper-Connections (mHC) to address the instability issues in traditional hyper-connections during large-scale model training while maintaining significant performance gains [1][3][4]. Group 1: Introduction of mHC - The mHC framework extends the traditional Transformer’s single residual flow into a multi-flow parallel architecture, utilizing the Sinkhorn-Knopp algorithm to constrain the connection matrix on a doubly stochastic matrix manifold [1][4]. - The core objective of mHC is to retain the performance improvements from widening the residual flow while addressing training instability and excessive memory consumption [4][6]. Group 2: Challenges with Traditional Hyper-Connections - Traditional residual connections ensure stable signal transmission through identity mapping, but they face limitations due to the restricted width of information channels [3][6]. - Recent methods like Hyper-Connections (HC) have improved performance but introduced significant training instability and increased memory access overhead [3][6]. Group 3: Methodology of mHC - mHC projects the residual connection space onto a specific manifold to restore the identity mapping property while optimizing infrastructure for efficiency [4][9]. - The use of the Sinkhorn-Knopp algorithm allows the connection matrix to be projected onto the Birkhoff polytope, ensuring stability in signal propagation [4][10]. Group 4: Experimental Validation - Empirical results show that mHC not only resolves stability issues but also demonstrates exceptional scalability in large-scale training, such as with a 27 billion parameter model, increasing training time by only 6.7% while achieving significant performance improvements [4][29]. - In benchmark tests, mHC consistently outperformed baseline models and HC in various downstream tasks, indicating its effectiveness in large-scale pre-training [30][31]. Group 5: Infrastructure Design - DeepSeek has tailored infrastructure for mHC, including kernel fusion, selective recomputation, and enhanced communication strategies to minimize memory overhead and improve computational efficiency [17][21][23]. - The design choices, such as optimizing the order of operations and implementing mixed precision strategies, contribute to the overall efficiency of mHC [17][18].
刚刚,梁文锋署名,DeepSeek元旦新论文要开启架构新篇章
机器之心·2026-01-01 08:22