残差连接
Search documents
DeepSeek最新论文解读:mHC如何用更少的钱训练出更强的模型?——投资笔记第243期
3 6 Ke· 2026-01-26 07:38
Core Insights - DeepSeek has released a significant paper on Manifold-Constrained Hyper-Connections (mHC), focusing on the fundamental issue of how information flows stably through ultra-deep networks in large models, rather than on model parameters, data volume, or computational power [2] Group 1: Residual Connections and Their Limitations - The concept of residual connections, introduced by Kaiming He’s team in 2015, is a milestone in AI development, allowing deeper neural networks by addressing the vanishing gradient problem [3] - Prior to residual connections, neural networks were limited to depths of 20-30 layers due to the exponential decay of gradients, which hindered effective feature learning [3][4] - Residual connections introduced a "shortcut" for signal transmission, enabling the depth of trainable networks to increase from tens to hundreds or thousands of layers, forming the structural foundation of modern deep learning [4] Group 2: Introduction of Hyper-Connections - Hyper-Connections emerged as a solution to the limitations of residual connections, allowing multiple pathways for information transfer within a model, akin to a relay race with multiple runners [6][7] - This approach enables information to be distributed across multiple parallel channels, allowing for dynamic weight allocation during training, enhancing the model's ability to handle complex, multi-source information [6][7] Group 3: Challenges with Hyper-Connections - Hyper-Connections face a critical flaw: instability due to excessive freedom in information flow, which can lead to imbalances in the model's internal information flow [9] - The training process of models using Hyper-Connections can exhibit high volatility and loss divergence, indicating a lack of stability in information transmission [9] Group 4: The Solution - mHC - mHC, or Manifold-Constrained Hyper-Connections, introduces a crucial constraint to Hyper-Connections by employing a double stochastic matrix, ensuring that information is redistributed without amplification [11] - This constraint prevents both signal explosion and signal decay, maintaining a stable flow of information throughout the network [13] - The implementation of mHC enhances training stability and performance, with only a 6.7% increase in training time, which is negligible compared to the significant cost savings in computational resources and debugging time [13][14] Group 5: Implications for Future AI Development - mHC strikes a new balance between stability and efficiency, reducing computational costs by approximately 30% and shortening product iteration cycles [14] - It supports the development of larger models, addressing the stability bottleneck in scaling to models with hundreds of billions or trillions of parameters [16] - The framework of mHC demonstrates that "constrained freedom" is more valuable than "complete freedom," suggesting a shift in AI architecture design from experience-driven to theory-driven approaches [16]
DeepSeek连发两篇论文背后,原来藏着一场学术接力
3 6 Ke· 2026-01-16 01:28
Core Insights - The article discusses the evolution of DeepSeek's research, particularly focusing on their recent papers, mHC and Conditional Memory, which build upon previous works by ByteSeed and others in the field of AI and deep learning [1][2]. Group 1: mHC and Its Innovations - mHC builds on the Hyper-Connections (HC) framework proposed by ByteSeed, significantly improving the stability and scalability of deep learning models [4][7]. - The core innovation of mHC lies in its ability to expand the width of residual flows and introduce dynamic hyper connections, which enhances model capacity without increasing computational costs [4][6]. - mHC addresses the stability issues encountered in HC during large-scale training by implementing manifold constraints and optimizing infrastructure, making it suitable for industrial applications with trillions of parameters [7][8]. Group 2: Conditional Memory and N-gram Utilization - The Conditional Memory paper introduces the concept of using an "Engram" to allow models to reference a large phrase dictionary, improving efficiency in answering straightforward questions [9][12]. - This approach contrasts with previous methods by suggesting that integrating N-gram lookups can free up computational resources for more complex reasoning tasks [10][13]. - DeepSeek's research indicates that allocating a portion of parameters to Engram can yield better performance than solely relying on Mixture of Experts (MoE) models, revealing a U-shaped scaling law [13][19]. Group 3: Collaborative Research and Community Impact - The collaboration between DeepSeek and ByteSeed exemplifies the value of open research in advancing AI technologies, showcasing how shared insights can lead to significant breakthroughs [19][20]. - The article highlights various innovative approaches from ByteSeed, such as UltraMem and Seed Diffusion Preview, which contribute to the ongoing evolution of deep learning architectures [20].
DeepSeek连发两篇论文背后,原来藏着一场学术接力
机器之心· 2026-01-16 00:42
Core Insights - The article discusses the evolution of deep learning architectures, particularly focusing on the advancements made by DeepSeek and ByteSeed in the context of residual connections and knowledge retrieval mechanisms [1][4]. Group 1: Deep Learning Architecture Evolution - The introduction of residual connections by He et al. in 2015 addressed the issue of information loss in deep neural networks, becoming a foundational element in deep learning [6][15]. - ByteSeed's introduction of Hyper-Connections (HC) in 2024 significantly enhanced network topology complexity without increasing computational costs, marking a shift from traditional residual connections [8][9]. - DeepSeek's mHC builds upon HC by addressing its scalability issues, improving stability and memory access efficiency for large-scale training [11][12]. Group 2: Knowledge Retrieval Mechanisms - DeepSeek's "Conditional Memory" paper proposes a method for efficient knowledge retrieval using an "Engram" system, allowing models to reference a large phrase dictionary for common queries, thus saving computational resources [18][21]. - The research highlights the importance of parameter allocation between MoE (Mixture of Experts) and static storage modules, revealing that allocating 20%-25% of parameters to Engram yields better performance [22]. - DeepSeek's approach integrates the Engram module into intermediate layers of the model, enhancing the efficiency of storage access and deep computation [22][23]. Group 3: Collaborative Research Impact - The collaboration and knowledge sharing between DeepSeek and ByteSeed exemplify the value of open research in advancing AI technologies, as both teams build upon each other's findings [28][29]. - The article emphasizes the importance of continuous exploration and innovation in foundational technologies, which may not yield immediate commercial applications but contribute to long-term industry progress [31].
刚刚,DeepSeek 扔出大杀器,梁文锋署名!暴力优化 AI 架构
程序员的那些事· 2026-01-01 13:15
Core Insights - DeepSeek introduced a new architecture called "Manifold-Constrained Hyper-Connections" (mHC), which enhances performance with only a 6.7% increase in training time on a 27 billion parameter model [3][36]. - The mHC architecture optimizes the residual connection space by projecting matrices onto constrained manifolds, ensuring stability and significantly expanding the residual stream width without substantial computational costs [8][25]. Group 1: Performance Improvements - In system-level benchmark tests, the mHC architecture consistently outperformed baseline models and Hyper-Connections (HC) across various tasks, demonstrating its effectiveness in large-scale pre-training [22][51]. - Specific performance metrics showed that mHC achieved a 2.1% improvement on the BBH benchmark and a 2.3% improvement on the DROP benchmark compared to HC [52][54]. Group 2: Technical Details - The core idea of mHC is to restore identity mapping properties under the topology of Hyper-Connections, allowing for practical value in large-scale training and real-world foundational model tasks [25]. - mHC employs a double stochastic matrix constraint to maintain stability while enhancing the interaction between residual streams, which is crucial for maximizing the potential of multi-stream architectures [26][27]. Group 3: Engineering Optimizations - The implementation of mHC involved several engineering optimizations, including reordering operations to improve efficiency and using mixed precision strategies to maximize numerical accuracy without sacrificing computational speed [38][42]. - The DualPipe scheduling strategy was enhanced to effectively overlap communication and computation, addressing significant communication delays introduced by the n-stream residual structure [46][48].
刚刚,梁文锋署名,DeepSeek元旦新论文要开启架构新篇章
华尔街见闻· 2026-01-01 12:20
Core Insights - DeepSeek has introduced a new architecture called Manifold-Constrained Hyper-Connections (mHC) to address the instability issues in traditional hyper-connections during large-scale model training while maintaining significant performance gains [1][6][8]. Group 1: mHC Architecture - The mHC architecture extends the single residual flow of traditional Transformers into a multi-flow parallel structure, utilizing the Sinkhorn-Knopp algorithm to constrain the connection matrix on a doubly stochastic matrix manifold [1][8]. - The core objective of mHC is to retain the performance improvements from widening the residual flow while resolving training instability and excessive memory consumption [8][9]. - Empirical evidence shows that mHC not only addresses stability issues but also demonstrates exceptional scalability in large-scale training, such as with a 27 billion parameter model, where it only increased training time by 6.7% while achieving significant performance improvements [8][32]. Group 2: Challenges with Traditional Hyper-Connections - Traditional hyper-connections (HC) have led to severe training instability and limited scalability due to the fundamental disruption of the inherent identity mapping property, which is crucial for stable training [5][9]. - The widening of information channels in HC results in increased memory access overhead, contributing to what is known as the "memory wall" problem [9][5]. Group 3: Implementation and Efficiency - DeepSeek has designed a tailored infrastructure for mHC, which includes kernel fusion, selective recomputation, and an extended DualPipe communication overlap strategy to minimize memory usage and enhance efficiency [23][25][27]. - The Sinkhorn-Knopp algorithm is employed to ensure that the residual connection matrix remains stable and adheres to the properties of a doubly stochastic matrix, which helps mitigate gradient explosion issues [16][21]. Group 4: Experimental Validation - The research team conducted experiments using language model pre-training to validate the effectiveness of mHC, comparing it against baseline models and traditional HC [28][32]. - Results from various downstream benchmark tests indicate that mHC consistently outperforms baseline models and often surpasses HC, demonstrating its effectiveness in large-scale pre-training [34][33]. - The scalability experiments reveal that mHC maintains performance advantages even at higher computational budgets, showing only slight degradation in performance [36][37].
DeepSeek改造何恺明残差连接!梁文峰亲自署名,十年首次重大升级
Xin Lang Cai Jing· 2026-01-01 11:45
Core Insights - DeepSeek has introduced an upgraded version of the residual connection, a fundamental component of deep learning proposed by Kaiming He in 2016, marking a significant evolution in the field [1][27]. Group 1: Residual Connections and Hyper-Connections - Residual connections have remained unchanged for a decade, serving as the cornerstone of deep learning architectures, allowing signals to pass directly from shallow to deep layers without modification [5][31]. - The emergence of Hyper-Connections (HC) aims to expand the residual flow width from C dimensions to n×C dimensions, introducing three learnable mapping matrices to manage information flow [7][32]. - Experiments by the DeepSeek team indicate that the Hres matrix, responsible for internal information exchange within the residual flow, contributes significantly to performance improvements [7][32]. Group 2: Challenges with Hyper-Connections - When HC is extended to multiple layers, the composite mapping no longer retains the identity property, leading to sudden loss spikes and gradient fluctuations during training [9][34]. - The research team calculated that the amplification factor of the composite mapping in HC peaked at 3000, indicating that signals could be amplified or attenuated drastically during inter-layer propagation [10][35]. Group 3: Double Random Matrix Constraints - The core idea of the DeepSeek paper is to constrain the residual mapping matrix to a specific manifold formed by double random matrices, known as the Birkhoff polytope [11][36]. - This constraint provides three key theoretical properties: norm preservation, combinatorial closure, and a geometric interpretation that enhances feature fusion stability [14][39][40]. - The Sinkhorn-Knopp algorithm is employed to project any matrix onto this manifold, resulting in a significant reduction in signal gain from 3000 in HC to approximately 1.6 in mHC [16][41]. Group 4: Engineering Optimizations - The expansion of residual flow width incurs additional memory access costs, with detailed analysis showing that standard residual connections require reading 2C elements and writing C elements, while HC requires significantly more [19][44]. - The DeepSeek team developed infrastructure optimizations, including kernel fusion and specialized kernels for the Sinkhorn-Knopp algorithm, to reduce memory access and improve computational efficiency [19][43]. - The paper presents an optimization formula for recomputation strategies, aligning recomputation boundaries with pipeline stage boundaries for enhanced performance [20][45]. Group 5: Experimental Validation - The paper validates the proposed methods on MoE models of sizes 3B, 9B, and 27B, with an expansion rate of n set to 4, demonstrating stable training curves and a loss reduction of 0.021 compared to the baseline [22][47]. - In downstream task evaluations, mHC outperformed HC by 2.1% in the BBH reasoning task and 2.3% in the DROP reading comprehension task, showing superior performance across most tasks [22][48]. - Internal large-scale training experiments confirmed these findings, with mHC introducing only a 6.7% additional time overhead when n=4 [25][50].
刚刚,梁文锋署名,DeepSeek元旦新论文要开启架构新篇章
Xin Lang Cai Jing· 2026-01-01 10:34
Core Insights - DeepSeek has introduced a new architecture called Manifold-Constrained Hyper-Connections (mHC) aimed at addressing the instability issues in traditional hyper-connections during large-scale model training while maintaining significant performance gains [1][27][28]. Group 1: Architecture and Methodology - The mHC architecture expands the traditional single residual flow of Transformers into a multi-flow parallel structure, utilizing the Sinkhorn-Knopp algorithm to constrain the connection matrix on a doubly stochastic matrix manifold [1][28]. - The core objective of mHC is to retain the performance improvements from widening the residual flow while resolving issues related to training instability and excessive memory consumption [4][34]. - The research team has implemented infrastructure optimizations such as kernel fusion, selective recomputation, and an extended DualPipe communication strategy to offset the overhead caused by wider channels [31][34]. Group 2: Performance and Stability - Empirical evidence shows that mHC not only resolves stability issues but also demonstrates exceptional scalability in large-scale training scenarios, such as with a 27 billion parameter model, where it only increased training time overhead by 6.7% while achieving significant performance improvements [34][49]. - The training stability of mHC was evaluated against a baseline model, showing a reduction in final loss by 0.021 and maintaining a stable gradient norm profile, indicating superior stability compared to traditional hyper-connections [49][50]. Group 3: Benchmarking and Results - In various downstream benchmark tests, mHC consistently outperformed the baseline model and surpassed traditional hyper-connections in most tasks, achieving performance gains of 2.1% and 2.3% in specific tasks [51][52]. - The scalability experiments indicated that mHC maintains its performance advantages even under higher computational budgets, demonstrating robust effectiveness in large-scale scenarios [52][53].
DeepSeek改造何恺明残差连接!梁文峰亲自署名,十年首次重大升级
量子位· 2026-01-01 10:32
Core Viewpoint - The article discusses the evolution and enhancement of the residual connection, a fundamental component in deep learning introduced by He Kaiming in ResNet, and presents a new approach called Hyper-Connections (HC) that aims to improve performance while addressing potential issues related to signal amplification and stability in deep learning architectures [2][7][11]. Group 1: Residual Connections and Their Evolution - Residual connections have been a cornerstone of deep learning since the introduction of ResNet in 2016, allowing signals to pass directly from shallow to deep layers without modification [7][9]. - The rise of Transformer architectures has made residual connections a standard feature in large language models like GPT and LLaMA [10]. - Hyper-Connections (HC) expand the residual flow width from C dimensions to n×C dimensions, introducing three learnable mapping matrices to manage information flow [11]. Group 2: Performance and Stability Challenges - Experiments by the DeepSeek team indicate that the Hres matrix, responsible for internal information exchange in HC, significantly enhances performance [12]. - However, when HC is extended to multiple layers, the composite mapping loses its identity property, leading to potential issues such as sudden loss spikes and gradient fluctuations during training [14]. - The peak amplification factor of signals in HC can reach 3000, which poses risks of signal distortion during inter-layer propagation [16]. Group 3: Theoretical Framework and Constraints - The core idea of the DeepSeek paper is to constrain the residual mapping matrix to a specific manifold formed by double stochastic matrices, which ensures three key theoretical properties: norm preservation, combinatorial closure, and geometric interpretation [17][19]. - The Sinkhorn-Knopp algorithm is employed to project any matrix onto this manifold, effectively reducing the signal amplification issue observed in HC [21]. Group 4: Engineering Optimizations - The paper details the memory access costs associated with expanding the residual flow width, highlighting significant increases in read and write operations for HC compared to standard residual connections [24]. - To mitigate these costs, the team developed infrastructure optimizations, including the TileLang framework for merging operations and specialized kernels for the Sinkhorn-Knopp algorithm [25][26]. - The paper also discusses pipeline parallelism enhancements to overlap computation and communication, improving overall efficiency [27]. Group 5: Experimental Validation - The paper validates the proposed methods on MoE models of sizes 3B, 9B, and 27B, with an expansion rate of n set to 4 [30]. - In the 27B MoE model, the modified HC (mHC) demonstrated a stable training curve, achieving a loss reduction of 0.021 compared to the baseline while maintaining gradient stability [31]. - Performance improvements were noted in downstream tasks, with mHC outperforming both the baseline and HC in various benchmarks [32][35].
ICML 2025 | 打破残差连接瓶颈,彩云科技&北邮提出MUDDFormer架构让Transformer再进化!
机器之心· 2025-06-27 08:06
Core Viewpoint - The article discusses the introduction of Multiway Dynamic Dense (MUDD) connections as an effective alternative to residual connections in Transformers, significantly enhancing cross-layer information transfer efficiency in deep learning models [1][4]. Background - Residual connections, introduced by Kaiming He in ResNet, have become foundational in deep learning and Transformer LLMs, but they still face limitations in efficient information transfer across layers [1][7]. - MUDD connections dynamically establish cross-layer connections based on the current hidden state, addressing issues like representation collapse and information overload in residual streams [7][8]. Model Architecture - MUDDFormer architecture allows for independent dynamic connections for different information streams (Q, K, V, R), enhancing the model's ability to gather relevant information from previous layers [10][13]. - The introduction of dynamic connections enables the model to adaptively determine the weight of information extracted from previous layers based on the context of each token [11][13]. Experimental Evaluation - MUDDPythia, a model with 2.8 billion parameters, shows performance comparable to larger models (6.9 billion and 12 billion parameters) with only a 0.23% increase in parameters and a 0.4% increase in computation [4][18]. - The MUDDFormer outperforms baseline models like Transformer++ across various model sizes, demonstrating significant computational efficiency improvements [15][17]. Downstream Task Assessment - In downstream tasks, MUDDPythia exhibits higher accuracy in 0-shot and 5-shot evaluations compared to equivalent Pythia models, indicating enhanced contextual learning capabilities [18][20]. - The model achieves a 2.4 times efficiency leap over the 6.9 billion Pythia model and a 4.2 times efficiency leap over the 12 billion Pythia model in specific evaluations [18][20]. Conclusion - MUDDFormer improves residual connections by establishing independent dynamic cross-layer connections for different information streams, enhancing cross-layer interaction and contextual learning capabilities in Transformers [25].