Workflow
Hyper-Connections
icon
Search documents
梁文锋署名,DeepSeek 论文引爆 AI 圈:mHC 架构横空出世!网友:这工程难度是地狱级
AI前线· 2026-01-02 06:00
Core Insights - DeepSeek has introduced a new network architecture called mHC (Manifold-Constrained Hyper-Connections) aimed at addressing numerical instability and signal explosion issues in large-scale model training while retaining performance enhancement advantages [2][5][6] Problem Addressed by the Architecture - Traditional Transformer networks rely on residual connections to maintain stable signal transmission, which is crucial for training deep learning models. However, Hyper-Connections (HC) have led to instability due to unconstrained connection matrices, causing signal explosion and gradient issues during large-scale training [6][7] - The mHC architecture introduces geometric constraints by projecting the residual mapping space onto a specific manifold, ensuring that the connection matrix remains within a double stochastic matrix framework, thus restoring the identity mapping property and stabilizing signal norms [6][10] Technical Implementation - The research team utilized the Sinkhorn-Knopp algorithm for projection constraints, optimizing the connection matrix while controlling system overhead to maintain training efficiency [11][12] - During training, the model learns a regular real-valued matrix, which is then projected to an approximate double stochastic matrix before each forward pass, ensuring that connections remain within a safe manifold [12] Experimental Results - The experiments demonstrated that mHC effectively avoided common training convergence issues found in traditional HC while maintaining or even improving performance across various tasks at parameter scales of 3 billion, 9 billion, and 27 billion [12][15] Broader Implications - The significance of mHC lies not in replacing the Transformer paradigm but in providing a scalable theoretical and engineering framework for exploring complex residual topologies. It highlights the importance of explicitly constraining model structures within geometrically favorable spaces to systematically address stability issues [12][14] - This approach opens avenues for future designs of more complex multi-stream and multi-path networks, balancing enhanced expressiveness with controllable trainability [12][14]