Core Insights - DeepSeek has introduced an upgraded version of the residual connection, a fundamental component of deep learning proposed by Kaiming He in 2016, marking a significant evolution in the field [1][27]. Group 1: Residual Connections and Hyper-Connections - Residual connections have remained unchanged for a decade, serving as the cornerstone of deep learning architectures, allowing signals to pass directly from shallow to deep layers without modification [5][31]. - The emergence of Hyper-Connections (HC) aims to expand the residual flow width from C dimensions to n×C dimensions, introducing three learnable mapping matrices to manage information flow [7][32]. - Experiments by the DeepSeek team indicate that the Hres matrix, responsible for internal information exchange within the residual flow, contributes significantly to performance improvements [7][32]. Group 2: Challenges with Hyper-Connections - When HC is extended to multiple layers, the composite mapping no longer retains the identity property, leading to sudden loss spikes and gradient fluctuations during training [9][34]. - The research team calculated that the amplification factor of the composite mapping in HC peaked at 3000, indicating that signals could be amplified or attenuated drastically during inter-layer propagation [10][35]. Group 3: Double Random Matrix Constraints - The core idea of the DeepSeek paper is to constrain the residual mapping matrix to a specific manifold formed by double random matrices, known as the Birkhoff polytope [11][36]. - This constraint provides three key theoretical properties: norm preservation, combinatorial closure, and a geometric interpretation that enhances feature fusion stability [14][39][40]. - The Sinkhorn-Knopp algorithm is employed to project any matrix onto this manifold, resulting in a significant reduction in signal gain from 3000 in HC to approximately 1.6 in mHC [16][41]. Group 4: Engineering Optimizations - The expansion of residual flow width incurs additional memory access costs, with detailed analysis showing that standard residual connections require reading 2C elements and writing C elements, while HC requires significantly more [19][44]. - The DeepSeek team developed infrastructure optimizations, including kernel fusion and specialized kernels for the Sinkhorn-Knopp algorithm, to reduce memory access and improve computational efficiency [19][43]. - The paper presents an optimization formula for recomputation strategies, aligning recomputation boundaries with pipeline stage boundaries for enhanced performance [20][45]. Group 5: Experimental Validation - The paper validates the proposed methods on MoE models of sizes 3B, 9B, and 27B, with an expansion rate of n set to 4, demonstrating stable training curves and a loss reduction of 0.021 compared to the baseline [22][47]. - In downstream task evaluations, mHC outperformed HC by 2.1% in the BBH reasoning task and 2.3% in the DROP reading comprehension task, showing superior performance across most tasks [22][48]. - Internal large-scale training experiments confirmed these findings, with mHC introducing only a 6.7% additional time overhead when n=4 [25][50].
DeepSeek改造何恺明残差连接!梁文峰亲自署名,十年首次重大升级