双随机矩阵
Search documents
ds新论文
小熊跑的快· 2026-01-04 11:31
Core Viewpoint - The article discusses the advancements in deep learning models, particularly focusing on the introduction of the mHC (Manifold-Constrained Hyper-Connections) method, which enhances information flow between layers in large models while maintaining computational efficiency and stability [1][2]. Group 1: Traditional Models and Innovations - Traditional models break down problems into smaller units, converting them into vectors processed through multiple layers of a Transformer, where information can diminish and noise can increase, leading to potential loss of critical data [1]. - The introduction of ResNet in 2015 proposed residual connections, allowing information from previous layers to be added to the current layer's output, improving data retention [1]. - The 2024 paper from ByteDance introduced Hyper-Connections (HC), which expands residual paths into multiple parallel channels for information exchange, but risks signal amplification and loss during training [1][2]. Group 2: mHC Methodology - The mHC method enhances the HC structure by imposing constraints on the mixing weights, ensuring that the sum of each row and column equals one, thus maintaining the total amount of information while allowing for flexible redistribution [2]. - This approach significantly reduces numerical instability and the risk of gradient explosion during large-scale training, achieving performance that surpasses traditional models with larger parameters using a 27 billion parameter model [2]. Group 3: Engineering Optimizations - The mHC method is designed to optimize engineering processes without fundamentally altering the Transformer architecture, focusing on improving the internal structure rather than making drastic changes [5]. - It is suggested that the mHC method is compatible with hardware optimizations, reducing data call volumes across nodes and enhancing single-card computational performance [3]. - There are indications that a new model, potentially named ds V4, is expected to be released, featuring a smaller size with active parameters below 37 billion but with a wider architecture [4].
梁文锋DeepSeek新论文!接棒何恺明和字节,又稳了稳AI的“地基”
Xin Lang Cai Jing· 2026-01-02 05:27
Core Insights - DeepSeek has introduced a new architecture called mHC (Manifold-Constrained Hyper-Connections), which significantly improves the residual connection component of the Transformer architecture, a foundational element that has seen little change since its inception in 2015 [1][3] Group 1: Historical Context - The evolution of neural network architectures began with ResNet, introduced by Kaiming He in 2015, which addressed the vanishing gradient problem and enabled the training of very deep networks [3] - The Transformer model, released in 2017, adopted residual connections as a standard feature, forming the basis for many leading models today [3] Group 2: Technical Comparisons - Hyper-Connections, proposed by ByteDance in 2024, expanded the single residual flow into multiple parallel streams, enhancing model performance but introducing stability issues during training [5][10] - mHC aims to resolve the stability problems associated with Hyper-Connections by constraining the connection weight matrix within a specific mathematical space, ensuring that signal amplification does not occur [10][12] Group 3: Mathematical Innovation - The core innovation of mHC involves using a Doubly Stochastic Matrix for the connection weights, which guarantees that the output does not exceed the maximum input value, thus preserving energy conservation [10][12] - The implementation of mHC utilizes the Sinkhorn-Knopp algorithm to achieve the desired matrix properties efficiently, allowing for end-to-end training without introducing new hyperparameters [11][12] Group 4: Engineering Excellence - DeepSeek's approach to implementing mHC demonstrates significant engineering capabilities, including the development of custom CUDA kernels and operator fusion techniques to minimize computational delays [16] - The ability to integrate innovative mathematical solutions into practical training environments highlights DeepSeek's competitive advantage in the AI research landscape [16]
DeepSeek改造何恺明残差连接!梁文峰亲自署名,十年首次重大升级
Xin Lang Cai Jing· 2026-01-01 11:45
来源:量子位 | 公众号 QbitAI 残差连接十年未变,扩展之后却带来隐患 2026年新年第一天,DeepSeek上传新论文。 给何恺明2016成名作ResNet中提出的深度学习基础组件"残差连接"来了一场新时代的升级。 DeepSeek梁文峰亲自署名论文,共同一作为Zhenda Xie , Yixuan Wei, Huanqi Cao。 DeepSeek团队的实验表明,在这三个映射中,负责残差流内部信息交换的Hres矩阵贡献了最显著的性能 提升。 残差连接自2016年ResNet问世以来,一直是深度学习架构的基石。 其核心机制简洁明了,x+1 = x + F (x ,W),即下一层的输出等于当前层输入加上残差函数的输 出。 这个设计之所以成功,关键在于"恒等映射"属性,信号可以从浅层直接传递到深层,不经任何修改。 随着Transformer架构的崛起,这一范式已成为GPT、LLaMA等大语言模型的标准配置。 这个设计之所以成功,关键在于"恒等映射"属性,信号可以从浅层直接传递到深层,不经任何修改。 近期出现的Hyper-Connections(HC)试图打破这一格局。HC将残差流的宽度从C维扩展到n×C维 ...
DeepSeek改造何恺明残差连接!梁文峰亲自署名,十年首次重大升级
量子位· 2026-01-01 10:32
Core Viewpoint - The article discusses the evolution and enhancement of the residual connection, a fundamental component in deep learning introduced by He Kaiming in ResNet, and presents a new approach called Hyper-Connections (HC) that aims to improve performance while addressing potential issues related to signal amplification and stability in deep learning architectures [2][7][11]. Group 1: Residual Connections and Their Evolution - Residual connections have been a cornerstone of deep learning since the introduction of ResNet in 2016, allowing signals to pass directly from shallow to deep layers without modification [7][9]. - The rise of Transformer architectures has made residual connections a standard feature in large language models like GPT and LLaMA [10]. - Hyper-Connections (HC) expand the residual flow width from C dimensions to n×C dimensions, introducing three learnable mapping matrices to manage information flow [11]. Group 2: Performance and Stability Challenges - Experiments by the DeepSeek team indicate that the Hres matrix, responsible for internal information exchange in HC, significantly enhances performance [12]. - However, when HC is extended to multiple layers, the composite mapping loses its identity property, leading to potential issues such as sudden loss spikes and gradient fluctuations during training [14]. - The peak amplification factor of signals in HC can reach 3000, which poses risks of signal distortion during inter-layer propagation [16]. Group 3: Theoretical Framework and Constraints - The core idea of the DeepSeek paper is to constrain the residual mapping matrix to a specific manifold formed by double stochastic matrices, which ensures three key theoretical properties: norm preservation, combinatorial closure, and geometric interpretation [17][19]. - The Sinkhorn-Knopp algorithm is employed to project any matrix onto this manifold, effectively reducing the signal amplification issue observed in HC [21]. Group 4: Engineering Optimizations - The paper details the memory access costs associated with expanding the residual flow width, highlighting significant increases in read and write operations for HC compared to standard residual connections [24]. - To mitigate these costs, the team developed infrastructure optimizations, including the TileLang framework for merging operations and specialized kernels for the Sinkhorn-Knopp algorithm [25][26]. - The paper also discusses pipeline parallelism enhancements to overlap computation and communication, improving overall efficiency [27]. Group 5: Experimental Validation - The paper validates the proposed methods on MoE models of sizes 3B, 9B, and 27B, with an expansion rate of n set to 4 [30]. - In the 27B MoE model, the modified HC (mHC) demonstrated a stable training curve, achieving a loss reduction of 0.021 compared to the baseline while maintaining gradient stability [31]. - Performance improvements were noted in downstream tasks, with mHC outperforming both the baseline and HC in various benchmarks [32][35].