Workflow
mHC
icon
Search documents
融资 1200亿后 Kimi 再扔王牌,新架构爆改 Transformer 老配件,比 DeepSeek 同款还省钱
AI前线· 2026-03-17 07:53
作者 | 允毅 连马斯克、Andrej Karpathy 都纷纷点赞,DeepSeek 和 Kimi 前后脚都盯上的 "残差连接" ,到底是 什么? 最近,Kimi 放出一篇重磅新论文,瞄准一个过去十年几乎没人动过的 Transformer 底层根基: 残差 连接(Residual Connection) 。残差连接由何恺明于 2015 年在 ResNet 论文中提出,此后便成为 深度学习领域的标配。 简单来说,可以把大模型的 Transformer 架构,想象成一支几十人排成长队的"传话小组",那么残差 连接就像一条规定:每个工人听完前面所有人的话后,都往里面再补一句,然后原封不动往后传。 这套规则长这样: 但这会带来一个麻烦:队尾的工人收到的话,是前面几十个工人的内容全堆在一起的,越往后话越 乱、越长,前面工人说的重点被埋住了,后面工人加的内容也没人听得清,AI 就变笨了。这叫"稀释 问题"。 于是,Kimi 想到把 "注意力机制" 引进来解决这一问题,它提出一个新的规则: "注意力残 差"(Attention Residuals) 。如同给工人们配备了"智能筛选器",不用再全盘收下前面堆出来的大 杂烩, ...
租了8张H100,他成功复现了DeepSeek的mHC,结果比官方报告更炸裂
机器之心· 2026-01-19 08:54
Core Insights - DeepSeek's mHC architecture addresses numerical instability and signal explosion issues in large-scale training by extending traditional Transformer residual connections into a multi-stream parallel architecture [1][5] - The mHC model has garnered significant attention in the AI community, with successful reproductions yielding better results than the original DeepSeek paper [5][6] Group 1: mHC Architecture - The mHC model utilizes the Sinkhorn-Knopp algorithm to constrain the connection matrix to a doubly stochastic matrix manifold, ensuring stability during training [1][25] - Traditional residual connections in Transformers have remained unchanged since 2016, relying on a single information flow, while mHC introduces multiple parallel streams for enhanced expressiveness [9][14] - The mHC architecture maintains stability by preventing signal amplification, which can lead to catastrophic failures in large models [20][28] Group 2: Experimental Results - In experiments with 10M parameters, the original hyper-connection (HC) model exhibited a signal amplification of 9.2 times, while mHC maintained stability with an amplification of 1.0 [36][61] - Scaling up to 1.7B parameters, the HC model showed an alarming amplification of 10,924 times, highlighting the instability associated with larger models [54][66] - The experiments demonstrated that while HC models accumulate instability, mHC models consistently maintain structural integrity across different training conditions [70][71] Group 3: Implications and Future Directions - The findings suggest that while traditional residual connections are stable, they may not be optimal for larger models, as mHC offers a balance between expressiveness and stability [57][58] - Future research aims to explore scaling laws further, particularly at the 10B parameter scale, where significant amplification trends are anticipated [101] - The mHC approach not only mitigates instability but also eliminates the risk of catastrophic failures in large-scale training scenarios [93][96]
DeepSeek开源Engram,如何做到推理损失仅3%?
Tai Mei Ti A P P· 2026-01-13 08:44
Core Insights - DeepSeek has launched a new module called Engram, which focuses on conditional memory for large language models, aiming to enhance efficiency and reduce computational costs [1][4] - The company emphasizes innovation in architecture and methodology to break through the constraints of computational costs, with Engram representing a restructuring of memory storage at the architectural level [4][6] Group 1: Engram Module - Engram is designed as a differentiable, trainable component that separates memory load from the main computation, allowing for efficient retrieval of frequently occurring knowledge [4][6] - The module utilizes deterministic retrieval based on N-grams and hash mapping to access vectors from a large static embedding table, significantly speeding up the process without complex neural computations [4][6] Group 2: Memory Functionality - Engram incorporates a lightweight gating mechanism to determine the appropriateness of retrieved memory for the current context, enhancing both memory retention and output coherence [6] - The architecture divides the model's capabilities into three independent yet collaborative dimensions: model depth for logical reasoning, computational sparsity represented by MoE, and storage sparsity introduced by Engram [6][7] Group 3: Performance and Future Developments - Testing indicates that even with a memory bank of up to 100 billion parameters, the inference throughput loss remains below 3% [7] - DeepSeek plans to release its latest V4 model around the Chinese New Year, which is expected to significantly improve performance in handling complex tasks and coding capabilities, potentially surpassing competitors like Anthropic [7]
国产大模型怎么样?
小熊跑的快· 2026-01-08 06:25
Core Insights - The article discusses the evolution of OpenAI's models, particularly the GPT-5.2 series and its ongoing iterations with GPT-4o, focusing on enhancing model accuracy and reducing hallucinations [1] - It suggests that significant changes in the industry are no longer expected, with current models primarily focusing on engineering optimizations and cost reductions rather than transformative innovations [2] - The article anticipates that by 2026, domestic models will emerge, potentially narrowing the gap with international counterparts and possibly surpassing them in application [3] Industry Developments - The upcoming release of version 4 is expected to further reduce costs for domestic applications [4] - Companies like Tencent are actively recruiting talent, indicating a competitive landscape, while Alibaba is investing heavily in AI applications, including edge computing and significant resources in cloud infrastructure [5] - ByteDance has projected a capital expenditure of 290 billion, doubling its previous expectations, and has seen a substantial increase in daily usage from 60 trillion to 500 trillion [5] Market Analysis - The article highlights that leading domestic model manufacturers are currently underperforming in the Hang Seng Technology Index ETFs, which may be influenced by recent IPO activities in Hong Kong [5] - The Hang Seng Technology Index ETF (513180) is noted to have a forward P/E ratio of approximately 19.3x, indicating it is below historical averages and may have room for recovery [5] - The article mentions that major players like TSMC are positioned for growth in 2026, with expectations of price increases and capacity expansions [10] Future Expectations - There is optimism surrounding Tencent's upcoming agent, which is anticipated to make a significant impact in the market [11]
技术与资本共振,国产大模型护航AI应用浪潮
China Post Securities· 2026-01-05 11:14
Industry Investment Rating - The industry investment rating is "Outperform the Market" and is maintained [2] Core Insights - The report highlights that the domestic large model industry has transitioned from a technology catch-up phase to a new stage of systematic layout and ecological construction, with breakthroughs in algorithms, collaborative computing power, data accumulation, capital support, and policy backing [9] - The mHC architecture proposed by DeepSeek addresses three major pain points in large model training, significantly lowering the training threshold and costs while enhancing performance and efficiency [6][7] - The report indicates a robust growth in the application ecosystem, with notable user engagement in AI applications, reflecting strong market demand for quality AI application targets [8] Summary by Relevant Sections Industry Overview - The closing index is at 5211.26, with a 52-week high of 5841.52 and a low of 3963.29 [2] Performance Analysis - The relative performance of the computer industry shows a positive trend, with a notable increase compared to the CSI 300 index [4] Recent Developments - Companies like Zhizhu and MiniMax are making significant strides towards IPOs, while Kimi has completed a $500 million Series C financing, indicating a strong capital influx into the industry [7] - The report notes that Kimi's user base has seen a month-over-month growth of over 170% in paid users from September to November 2025 [7] Investment Recommendations - The report suggests focusing on various sectors, including Hong Kong internet companies and domestic computing power firms, highlighting specific companies such as Alibaba, Tencent, and Cambricon [9]
DeepSeek上新mHC,R2还远吗?
Tai Mei Ti A P P· 2026-01-04 06:05
Core Insights - DeepSeek has introduced a new neural network architecture optimization called mHC (Manifold-Constrained Hyper-Connections), which is expected to significantly impact the AI industry, including large models and chips [1][5][9] Group 1: mHC Architecture - The mHC architecture builds on the Hyper-Connections (HC) framework released by the Byte Bean team in November 2024, aiming to replace the nearly decade-old ResNet architecture [5] - mHC introduces a Manifold-Constrained approach using the Sinkhorn-Knopp algorithm to stabilize signal propagation during training, addressing issues of signal explosion and instability in large model training [5][6] - In training demonstrations with 27 billion parameters, mHC maintained a signal amplification of only 1.6 times, while HC experienced a catastrophic failure with a 3000 times amplification [6][8] Group 2: Performance and Efficiency - mHC shows a significant reduction in training loss and improved performance on challenging tasks, with over 2% enhancement in reasoning and reading comprehension benchmarks compared to traditional architectures [6][8] - The additional training time overhead for mHC, even with a fourfold expansion of residual channels, is only 6.7%, indicating a focus on cost-effectiveness and efficiency [8] Group 3: Industry Impact and Reactions - The release of mHC has sparked high discussion levels among researchers and industry professionals, with expectations of a paradigm shift in large model architectures by 2026 [9][10] - Competitors are already responding, with new architectures like Deep Delta Learning emerging shortly after mHC's announcement, indicating a potential chain reaction in AI architecture development [9][10] - Analysts predict that DeepSeek may make significant announcements around the Lunar New Year, potentially unveiling the long-awaited R2 model or a faster universal model V4 [10] Group 4: Compatibility and Market Dynamics - mHC's architecture is primarily designed for NVIDIA's supernode links, raising concerns about compatibility with domestic chips, which may require enhanced adaptation efforts [11] - As U.S. AI chip manufacturers gradually exit the Chinese market due to geopolitical factors, domestic chipmakers are accelerating their development and ecosystem building to adapt to DeepSeek's models [12]
DeepSeek改造何恺明残差连接!梁文峰亲自署名,十年首次重大升级
量子位· 2026-01-01 10:32
Core Viewpoint - The article discusses the evolution and enhancement of the residual connection, a fundamental component in deep learning introduced by He Kaiming in ResNet, and presents a new approach called Hyper-Connections (HC) that aims to improve performance while addressing potential issues related to signal amplification and stability in deep learning architectures [2][7][11]. Group 1: Residual Connections and Their Evolution - Residual connections have been a cornerstone of deep learning since the introduction of ResNet in 2016, allowing signals to pass directly from shallow to deep layers without modification [7][9]. - The rise of Transformer architectures has made residual connections a standard feature in large language models like GPT and LLaMA [10]. - Hyper-Connections (HC) expand the residual flow width from C dimensions to n×C dimensions, introducing three learnable mapping matrices to manage information flow [11]. Group 2: Performance and Stability Challenges - Experiments by the DeepSeek team indicate that the Hres matrix, responsible for internal information exchange in HC, significantly enhances performance [12]. - However, when HC is extended to multiple layers, the composite mapping loses its identity property, leading to potential issues such as sudden loss spikes and gradient fluctuations during training [14]. - The peak amplification factor of signals in HC can reach 3000, which poses risks of signal distortion during inter-layer propagation [16]. Group 3: Theoretical Framework and Constraints - The core idea of the DeepSeek paper is to constrain the residual mapping matrix to a specific manifold formed by double stochastic matrices, which ensures three key theoretical properties: norm preservation, combinatorial closure, and geometric interpretation [17][19]. - The Sinkhorn-Knopp algorithm is employed to project any matrix onto this manifold, effectively reducing the signal amplification issue observed in HC [21]. Group 4: Engineering Optimizations - The paper details the memory access costs associated with expanding the residual flow width, highlighting significant increases in read and write operations for HC compared to standard residual connections [24]. - To mitigate these costs, the team developed infrastructure optimizations, including the TileLang framework for merging operations and specialized kernels for the Sinkhorn-Knopp algorithm [25][26]. - The paper also discusses pipeline parallelism enhancements to overlap computation and communication, improving overall efficiency [27]. Group 5: Experimental Validation - The paper validates the proposed methods on MoE models of sizes 3B, 9B, and 27B, with an expansion rate of n set to 4 [30]. - In the 27B MoE model, the modified HC (mHC) demonstrated a stable training curve, achieving a loss reduction of 0.021 compared to the baseline while maintaining gradient stability [31]. - Performance improvements were noted in downstream tasks, with mHC outperforming both the baseline and HC in various benchmarks [32][35].