Workflow
mHC
icon
Search documents
租了8张H100,他成功复现了DeepSeek的mHC,结果比官方报告更炸裂
机器之心· 2026-01-19 08:54
Core Insights - DeepSeek's mHC architecture addresses numerical instability and signal explosion issues in large-scale training by extending traditional Transformer residual connections into a multi-stream parallel architecture [1][5] - The mHC model has garnered significant attention in the AI community, with successful reproductions yielding better results than the original DeepSeek paper [5][6] Group 1: mHC Architecture - The mHC model utilizes the Sinkhorn-Knopp algorithm to constrain the connection matrix to a doubly stochastic matrix manifold, ensuring stability during training [1][25] - Traditional residual connections in Transformers have remained unchanged since 2016, relying on a single information flow, while mHC introduces multiple parallel streams for enhanced expressiveness [9][14] - The mHC architecture maintains stability by preventing signal amplification, which can lead to catastrophic failures in large models [20][28] Group 2: Experimental Results - In experiments with 10M parameters, the original hyper-connection (HC) model exhibited a signal amplification of 9.2 times, while mHC maintained stability with an amplification of 1.0 [36][61] - Scaling up to 1.7B parameters, the HC model showed an alarming amplification of 10,924 times, highlighting the instability associated with larger models [54][66] - The experiments demonstrated that while HC models accumulate instability, mHC models consistently maintain structural integrity across different training conditions [70][71] Group 3: Implications and Future Directions - The findings suggest that while traditional residual connections are stable, they may not be optimal for larger models, as mHC offers a balance between expressiveness and stability [57][58] - Future research aims to explore scaling laws further, particularly at the 10B parameter scale, where significant amplification trends are anticipated [101] - The mHC approach not only mitigates instability but also eliminates the risk of catastrophic failures in large-scale training scenarios [93][96]
DeepSeek开源Engram,如何做到推理损失仅3%?
Tai Mei Ti A P P· 2026-01-13 08:44
Core Insights - DeepSeek has launched a new module called Engram, which focuses on conditional memory for large language models, aiming to enhance efficiency and reduce computational costs [1][4] - The company emphasizes innovation in architecture and methodology to break through the constraints of computational costs, with Engram representing a restructuring of memory storage at the architectural level [4][6] Group 1: Engram Module - Engram is designed as a differentiable, trainable component that separates memory load from the main computation, allowing for efficient retrieval of frequently occurring knowledge [4][6] - The module utilizes deterministic retrieval based on N-grams and hash mapping to access vectors from a large static embedding table, significantly speeding up the process without complex neural computations [4][6] Group 2: Memory Functionality - Engram incorporates a lightweight gating mechanism to determine the appropriateness of retrieved memory for the current context, enhancing both memory retention and output coherence [6] - The architecture divides the model's capabilities into three independent yet collaborative dimensions: model depth for logical reasoning, computational sparsity represented by MoE, and storage sparsity introduced by Engram [6][7] Group 3: Performance and Future Developments - Testing indicates that even with a memory bank of up to 100 billion parameters, the inference throughput loss remains below 3% [7] - DeepSeek plans to release its latest V4 model around the Chinese New Year, which is expected to significantly improve performance in handling complex tasks and coding capabilities, potentially surpassing competitors like Anthropic [7]
国产大模型怎么样?
小熊跑的快· 2026-01-08 06:25
Core Insights - The article discusses the evolution of OpenAI's models, particularly the GPT-5.2 series and its ongoing iterations with GPT-4o, focusing on enhancing model accuracy and reducing hallucinations [1] - It suggests that significant changes in the industry are no longer expected, with current models primarily focusing on engineering optimizations and cost reductions rather than transformative innovations [2] - The article anticipates that by 2026, domestic models will emerge, potentially narrowing the gap with international counterparts and possibly surpassing them in application [3] Industry Developments - The upcoming release of version 4 is expected to further reduce costs for domestic applications [4] - Companies like Tencent are actively recruiting talent, indicating a competitive landscape, while Alibaba is investing heavily in AI applications, including edge computing and significant resources in cloud infrastructure [5] - ByteDance has projected a capital expenditure of 290 billion, doubling its previous expectations, and has seen a substantial increase in daily usage from 60 trillion to 500 trillion [5] Market Analysis - The article highlights that leading domestic model manufacturers are currently underperforming in the Hang Seng Technology Index ETFs, which may be influenced by recent IPO activities in Hong Kong [5] - The Hang Seng Technology Index ETF (513180) is noted to have a forward P/E ratio of approximately 19.3x, indicating it is below historical averages and may have room for recovery [5] - The article mentions that major players like TSMC are positioned for growth in 2026, with expectations of price increases and capacity expansions [10] Future Expectations - There is optimism surrounding Tencent's upcoming agent, which is anticipated to make a significant impact in the market [11]
技术与资本共振,国产大模型护航AI应用浪潮
China Post Securities· 2026-01-05 11:14
Industry Investment Rating - The industry investment rating is "Outperform the Market" and is maintained [2] Core Insights - The report highlights that the domestic large model industry has transitioned from a technology catch-up phase to a new stage of systematic layout and ecological construction, with breakthroughs in algorithms, collaborative computing power, data accumulation, capital support, and policy backing [9] - The mHC architecture proposed by DeepSeek addresses three major pain points in large model training, significantly lowering the training threshold and costs while enhancing performance and efficiency [6][7] - The report indicates a robust growth in the application ecosystem, with notable user engagement in AI applications, reflecting strong market demand for quality AI application targets [8] Summary by Relevant Sections Industry Overview - The closing index is at 5211.26, with a 52-week high of 5841.52 and a low of 3963.29 [2] Performance Analysis - The relative performance of the computer industry shows a positive trend, with a notable increase compared to the CSI 300 index [4] Recent Developments - Companies like Zhizhu and MiniMax are making significant strides towards IPOs, while Kimi has completed a $500 million Series C financing, indicating a strong capital influx into the industry [7] - The report notes that Kimi's user base has seen a month-over-month growth of over 170% in paid users from September to November 2025 [7] Investment Recommendations - The report suggests focusing on various sectors, including Hong Kong internet companies and domestic computing power firms, highlighting specific companies such as Alibaba, Tencent, and Cambricon [9]
DeepSeek上新mHC,R2还远吗?
Tai Mei Ti A P P· 2026-01-04 06:05
Core Insights - DeepSeek has introduced a new neural network architecture optimization called mHC (Manifold-Constrained Hyper-Connections), which is expected to significantly impact the AI industry, including large models and chips [1][5][9] Group 1: mHC Architecture - The mHC architecture builds on the Hyper-Connections (HC) framework released by the Byte Bean team in November 2024, aiming to replace the nearly decade-old ResNet architecture [5] - mHC introduces a Manifold-Constrained approach using the Sinkhorn-Knopp algorithm to stabilize signal propagation during training, addressing issues of signal explosion and instability in large model training [5][6] - In training demonstrations with 27 billion parameters, mHC maintained a signal amplification of only 1.6 times, while HC experienced a catastrophic failure with a 3000 times amplification [6][8] Group 2: Performance and Efficiency - mHC shows a significant reduction in training loss and improved performance on challenging tasks, with over 2% enhancement in reasoning and reading comprehension benchmarks compared to traditional architectures [6][8] - The additional training time overhead for mHC, even with a fourfold expansion of residual channels, is only 6.7%, indicating a focus on cost-effectiveness and efficiency [8] Group 3: Industry Impact and Reactions - The release of mHC has sparked high discussion levels among researchers and industry professionals, with expectations of a paradigm shift in large model architectures by 2026 [9][10] - Competitors are already responding, with new architectures like Deep Delta Learning emerging shortly after mHC's announcement, indicating a potential chain reaction in AI architecture development [9][10] - Analysts predict that DeepSeek may make significant announcements around the Lunar New Year, potentially unveiling the long-awaited R2 model or a faster universal model V4 [10] Group 4: Compatibility and Market Dynamics - mHC's architecture is primarily designed for NVIDIA's supernode links, raising concerns about compatibility with domestic chips, which may require enhanced adaptation efforts [11] - As U.S. AI chip manufacturers gradually exit the Chinese market due to geopolitical factors, domestic chipmakers are accelerating their development and ecosystem building to adapt to DeepSeek's models [12]
DeepSeek改造何恺明残差连接!梁文峰亲自署名,十年首次重大升级
量子位· 2026-01-01 10:32
Core Viewpoint - The article discusses the evolution and enhancement of the residual connection, a fundamental component in deep learning introduced by He Kaiming in ResNet, and presents a new approach called Hyper-Connections (HC) that aims to improve performance while addressing potential issues related to signal amplification and stability in deep learning architectures [2][7][11]. Group 1: Residual Connections and Their Evolution - Residual connections have been a cornerstone of deep learning since the introduction of ResNet in 2016, allowing signals to pass directly from shallow to deep layers without modification [7][9]. - The rise of Transformer architectures has made residual connections a standard feature in large language models like GPT and LLaMA [10]. - Hyper-Connections (HC) expand the residual flow width from C dimensions to n×C dimensions, introducing three learnable mapping matrices to manage information flow [11]. Group 2: Performance and Stability Challenges - Experiments by the DeepSeek team indicate that the Hres matrix, responsible for internal information exchange in HC, significantly enhances performance [12]. - However, when HC is extended to multiple layers, the composite mapping loses its identity property, leading to potential issues such as sudden loss spikes and gradient fluctuations during training [14]. - The peak amplification factor of signals in HC can reach 3000, which poses risks of signal distortion during inter-layer propagation [16]. Group 3: Theoretical Framework and Constraints - The core idea of the DeepSeek paper is to constrain the residual mapping matrix to a specific manifold formed by double stochastic matrices, which ensures three key theoretical properties: norm preservation, combinatorial closure, and geometric interpretation [17][19]. - The Sinkhorn-Knopp algorithm is employed to project any matrix onto this manifold, effectively reducing the signal amplification issue observed in HC [21]. Group 4: Engineering Optimizations - The paper details the memory access costs associated with expanding the residual flow width, highlighting significant increases in read and write operations for HC compared to standard residual connections [24]. - To mitigate these costs, the team developed infrastructure optimizations, including the TileLang framework for merging operations and specialized kernels for the Sinkhorn-Knopp algorithm [25][26]. - The paper also discusses pipeline parallelism enhancements to overlap computation and communication, improving overall efficiency [27]. Group 5: Experimental Validation - The paper validates the proposed methods on MoE models of sizes 3B, 9B, and 27B, with an expansion rate of n set to 4 [30]. - In the 27B MoE model, the modified HC (mHC) demonstrated a stable training curve, achieving a loss reduction of 0.021 compared to the baseline while maintaining gradient stability [31]. - Performance improvements were noted in downstream tasks, with mHC outperforming both the baseline and HC in various benchmarks [32][35].