Workflow
注意力机制
icon
Search documents
像大模型一样进化
腾讯研究院· 2026-01-05 08:44
本文摘选自刘嘉教授新书《通用人工智能》 大模型的成功并非偶然——从早期符号主义AI的失败,到深度学习的崛起,再到Transformer的成功,每 一次进化都是从无数被淘汰的算法、模型中艰难诞生。在这艰难曲折的探索中,人类智慧的金块无疑是 AI头上的一盏明灯。反过来,大模型的进化经验,能否成为我们人类认知进化的营养?由此,我们破茧 成蝶,与AI时代同频共振,开启认知与智慧的跃迁。 为人生定义目标函数 所有的机器学习,在开始训练前,都必须明确一个目标函数 (又 称损失函数或 成本函数) 。这个函数定 义了模型希望达到的理想状态,而训练的全部意义就在于不断优化参数,让模型越来越接近这个目标。 正所谓学习未动,目标先行。 作为机器学习的一个分支,人工神经网络从一开始就是另类,因为它的目标函数太宏大、太有野心,以 至于当辛顿请求其所在的多伦多大学校长再招收一名人工神经网络的研究者时,该校长是如此回答 的:"一个疯子就足够了。"的确,人工神经网络的开创者都有一个在外人眼里近似疯狂的目标函数: 1943年麦卡洛克和皮茨提出的"简陋"神经元是要模拟"神经活动内在观念的逻辑演算",1958年罗森布拉 刘嘉 清华大学基础科学讲席 ...
Gemini 3预训练负责人警告:模型战已从算法转向工程化,合成数据成代际跃迁核心,谷歌碾压OpenAI、Meta的秘密武器曝光
3 6 Ke· 2025-12-26 12:21
2025 年底,大模型行业的"年终决战"正式打响,各家纷纷亮出压箱底的杀手锏,就在这场激烈角逐中,Gemini 3 以绝对王者之姿强势突围,一登场就刷 新了行业的认知边界。 11 月 18 日,Gemini 3 直接"横扫"多项权威基准测试,以"世界最强多模态理解""交互最深智能体""推理怪兽"的姿态,强势碾压全球所有同类模型。谷歌 CEO 桑达尔·皮查伊亲自为其站台,直言这是"迄今为止最智能的模型"。消息一出,整个 AI 圈瞬间沸腾,所有人都在追问:Gemini 3 的强悍,到底藏着什 么秘诀? 答案在发布当天就有了初步线索。Google DeepMind 研究与深度学习副总裁 Oriol Vinyals 直接在推特上"剧透":"Gemini 3 这么强,核心秘诀就两点:更 好的预训练,更好的后训练。"这番直白的表态,让"预训练"与"后训练"瞬间成为行业热议的核心话题。 | Description | Description | | Colorded 3-Per | Garried 2.5-Fre. | Claude Sawat 4.5 GPT-5.8 | | | --- | --- | --- | --- ...
Scaling Law没死,Gemini核心大佬爆料,谷歌已有颠覆性密钥
3 6 Ke· 2025-12-22 01:05
谷歌又要有重大突破了? 最近,Google DeepMind的Gemini预训练负责人Sebastian Borgeaud在采访中给出重磅爆料—— Google DeepMind的Gemini预训练负责人Sebastian Borgeaud在最近的访谈中表示,预计在未来一年内,针对提升长上下文处理效率以及进一步扩展模型上 下文长度的预训练技术,将会有重大创新。 未来一年,大模型预训练领域将在「长上下文处理效率」和「上下文长度扩展」两大方向迎来重大技术创新。 同时,Google Gemini三巨头——Jeff Dean、OriolVinyalsML和Noam Shazeer罕见同台了,他们的对谈中,跟Sebastian的内容展现出了惊人的一致。 众多高瞻远瞩、闪烁着智慧光芒的思想让人深思。 难怪,谷歌依然是那个巨人。 谷歌大佬激动预言 已破解大模型核心秘密 另外他还透露说,最近他们在注意力机制方面取得了一些非常有趣的发现,这可能在未来几个月内重塑他们的研究方向。 对此,他表示非常兴奋。 而且他提出了振聋发聩的一句话:Scaling Law并未消亡,只是正在演变! Sebastian Borgeaud是Gemin ...
微软研究院路保同:用向量检索重塑模型注意力——Attention
3 6 Ke· 2025-11-17 08:02
Core Insights - The article discusses the limitations of long-context reasoning in large language models (LLMs) due to the quadratic complexity of self-attention and the significant memory requirements for key-value (KV) caching [1][5] - It introduces a new mechanism called Retrieval Attention, which accelerates long-context LLM inference through a dynamic sparse attention approach that does not require retraining [1][8] Group 1: Retrieval Attention Mechanism - Retrieval Attention posits that each query only needs to interact with a small subset of keys, making most attention redundant [3][7] - The approach involves offloading most KV vectors from the GPU to the CPU, using approximate nearest neighbor (ANN) search to identify the most relevant keys for each query [3][7] - This mechanism allows for significant reductions in memory usage, with an 8B model requiring only about 1/10 of the original memory for KV caching while maintaining accuracy [22] Group 2: Performance Metrics - Empirical tests on an RTX 4090 (24GB) show that the 8B model can stably generate with 128K context at approximately 0.188 seconds per token, achieving nearly the same precision as full attention [5][6] - The subsequent work, RetroInfer, demonstrated a 4.5 times increase in decoding throughput on A100 GPUs compared to full attention and a 10.5 times increase in throughput for 1M token contexts compared to other sparse attention systems [5][22] Group 3: System Architecture - The architecture of Retrieval Attention features a dual-path attention mechanism where the GPU retains a small amount of "predictable" local KV cache, while the CPU dynamically retrieves a large-scale KV store [7][8] - This design leads to a reduction in both memory usage and inference latency, allowing for efficient long-context reasoning without retraining the model [8][22] Group 4: Theoretical and Practical Contributions - The work presents a new theoretical perspective by framing the attention mechanism as a retrieval system, allowing for more precise identification of important contextual information [23][25] - It also emphasizes system-level optimizations, transforming traditional linear caching into a dynamic allocation structure that enhances efficiency in large-scale inference scenarios [23][25] Group 5: Future Directions - Future research may focus on establishing a more rigorous theoretical framework for the error bounds of Retrieval Attention and exploring the integration of dynamic learning mechanisms with system-level optimizations [26][30] - The long-term implications of this research could lead to models with true long-term memory capabilities, enabling them to maintain semantic consistency over extensive contexts [30][31]
HBM 之父大胆猜测:NVIDIA 可能买存储公司
半导体芯闻· 2025-11-04 09:48
Core Insights - NVIDIA's CEO Jensen Huang visited South Korea for the first time in 15 years, meeting with key figures from Samsung and Hyundai to strengthen collaboration in memory and AI megafactories [2] - The importance of memory in the AI era is increasing, with experts suggesting that NVIDIA may consider acquiring memory companies like Micron or SanDisk to maintain its leadership in AI [2][3] - Memory bottlenecks are critical issues that need to be addressed for AI inference, with major companies focusing on solutions [3][4] Memory Demand and Types - Memory requirements for AI are categorized into HBM, DRAM, and SSD, with HBM used for real-time data storage, DRAM for short-term memory, and SSD for long-term data [4] - HBM capacity ranges from 10GB to hundreds of GB, DRAM from hundreds of GB to TB, and SSD from TB to PB [4] AI Inference Mechanism - AI inference utilizes a mechanism similar to human brain attention, which involves storing important information (Key and Value) to enhance processing speed [5] - The introduction of KV Cache allows AI models to remember previously processed information, significantly improving response times for ongoing discussions [5]
我MiniMax,用实习生处理数据,照样屠榜开源大模型
量子位· 2025-11-04 05:06
Core Viewpoint - The article discusses the development and unique features of the MiniMax M2 model, highlighting its performance, data processing techniques, and the rationale behind its design choices, particularly the shift from Linear Attention to Full Attention. Group 1: Model Performance - M2 demonstrated strong performance by winning first place in the AI-Trader simulation competition, earning nearly 3,000 yuan from a starting capital of 100,000 yuan over 20 days [2] - The choice of Full Attention over Linear Attention is presented as a strategic decision aimed at ensuring stability and reliability for commercial deployment [12][53] Group 2: Attention Mechanism - The article emphasizes the debate surrounding the choice of attention mechanisms, with M2's team opting for Full Attention after testing various alternatives, including Efficient Attention, which showed performance degradation with longer context lengths [12][15] - The team argues that the perceived advantages of Efficient Attention are misleading, particularly in complex tasks where it fails to perform as well as Full Attention [18][22] Group 3: Data Processing Techniques - M2's data processing approach is highlighted as mature, allowing even inexperienced interns to achieve expected results, indicating a well-structured data handling process [27] - The team focuses on enhancing the model's generalization capabilities by diversifying data formats and ensuring high-quality data through a rigorous cleaning process [35][38] Group 4: Task Execution and Adaptability - The concept of "Interleaved Thinking" is introduced, allowing the model to dynamically adjust its planning based on real-time execution feedback, improving its adaptability in task execution [46][48] - The training data is designed to simulate real-world scenarios, covering various uncertainties to enhance the model's performance in practical applications [51][52] Group 5: Engineering Philosophy - MiniMax's decision to use Full Attention reflects a pragmatic engineering philosophy prioritizing real-world applicability and stability over merely optimizing for computational efficiency [53][56] - The company aims to create a model that is not just technically advanced but also practical and understandable for developers, emphasizing a systematic approach to problem-solving [57][58]
20分钟读懂AI史上最重要的一篇论文——《Attention Is All You Need》
Hu Xiu· 2025-10-22 13:05
Core Insights - The article highlights the transformative impact of the 2017 paper "Attention Is All You Need," which introduced the Transformer architecture, revolutionizing the AI technology landscape [1] - The emergence of leading AI tools like ChatGPT and DeepSeek is directly linked to the advancements made possible by the Transformer model [1] Summary by Sections Transformer Architecture - The Transformer architecture has fundamentally changed the approach to artificial intelligence, leading to a global "arms race" in the AI sector [1] - Key concepts such as attention mechanisms, Q/K/V, multi-head attention, and positional encoding are explained in a simplified manner [1] Impact on AI Industry - The paper has catalyzed the rapid rise of major players in the AI industry, including OpenAI, showcasing the significant economic opportunities created by these advancements [1] - The narrative includes the story of eight authors who left Google to pursue entrepreneurial ventures, resulting in remarkable wealth creation [1]
人工智能专题:DeepSeek的稀疏注意力机制给AI产业释放更大的发展潜能
Zhongyuan Securities· 2025-10-16 11:46
Investment Rating - The industry investment rating is "Outperform the Market" with an expected increase of over 10% relative to the CSI 300 index in the next six months [41]. Core Insights - The report emphasizes that the introduction of sparse attention mechanisms, particularly through DeepSeek, significantly enhances the development potential of the AI industry [8][37]. - DeepSeek's advancements in attention mechanisms, including Native Sparse Attention (NSA) and DeepSeek Sparse Attention (DSA), are pivotal in improving model performance and efficiency [18][23][37]. Summary by Sections 1. Relationship Between Attention Mechanism and Large Model Development - The attention mechanism, introduced to improve information processing efficiency, has become a core component of large models, addressing the limitations of traditional recurrent neural networks [11]. - Sparse attention reduces computational complexity from O(L²) to sub-quadratic levels, thus overcoming memory and computational bottlenecks [11]. 2. DeepSeek's Technological Improvements in Attention Mechanism - DeepSeek has made significant contributions in three main areas: Multi-head Latent Attention (MLA), Native Sparse Attention (NSA), and DeepSeek Sparse Attention (DSA) [12][18][23]. - MLA reduces memory usage by approximately 90% while maintaining model performance, significantly lowering training costs [16]. - NSA enhances long text processing speed by 11 times and achieves performance comparable to traditional models [18]. - DSA improves training and inference efficiency, leading to substantial cost reductions for model usage [23]. 3. DSA and NSA Unlock Greater Development Potential for the AI Industry - The integration of DSA and NSA allows for expanded model context and improved computational efficiency, which are crucial for meeting the demands of multi-modal applications [33][37]. - The trend towards longer input and output lengths necessitates innovative approaches to model training and performance enhancement [33].
老牌Transformer杀手在ICLR悄然更新:Mamba-3三大改进趋近设计完全体
机器之心· 2025-10-14 08:24
机器之心报道 编辑:冷猫 至今为止 Transformer 架构依然是 AI 模型的主流架构,自从其确立了统治地位后,号称 Transformer 杀手的各类改进工作就没有停止过。 在一众挑战者中最具影响力的自然是 2023 年社区爆火的基于结构化的状态空间序列模型(SSM)架构的 Mamba。 Mamba 的爆火可能和名字有关,但硬实力确实强大。 在当时,Mamba 在语言建模方面可以媲美甚至击败 Transformer。而且,它可以随上下文长度的增加实现线性扩展,其性能在实际数据中可提高到百万 token 长度 序列,并实现 5 倍的推理吞吐量提升。 在 Mamba 问世后,涌现出了超多在不同任务上使用 Mamba 的工作以及一些改进工作,诞生了了 MoE-Mamba、Vision Mamba、VMamba、U-Mamba、 MambaByte、MambaOut 等多项工作,被称为 「Tra nsfor mer 最有力的继任者」 。 但 Mamba 在 2024 年的 ICLR 会议中遭遇了滑铁卢 ,最终还是被拒稿。 在 2024 年,在 Mamba 发布的半年后, Mamba-2 正式发布 ,拿下了顶会 ...
从Transformer到GPT-5,听听OpenAI科学家 Lukasz 的“大模型第一性思考”
AI科技大本营· 2025-09-23 02:11
Core Viewpoint - The article discusses the revolutionary impact of the paper "Attention Is All You Need," which introduced the Transformer architecture, fundamentally changing the landscape of artificial intelligence and natural language processing [2][17]. Group 1: The Impact of the Transformer - The paper "Attention Is All You Need" has been cited 197,159 times on Google Scholar, highlighting its significant influence in the AI research community [3][26]. - The authors of the paper, known as the "Transformer Eight," have become prominent figures in the AI industry, with seven of them starting their own companies [4][24]. - The introduction of the Transformer architecture has led to a paradigm shift in AI, moving away from RNNs and enabling better handling of long-distance dependencies in language processing [17][18]. Group 2: Lukasz Kaiser's Journey - Lukasz Kaiser, one of the authors, chose to join OpenAI instead of starting a commercial venture, focusing on the pursuit of AGI [4][25]. - Kaiser has a strong academic background, holding dual master's degrees in computer science and mathematics, and has received prestigious awards for his research [7][8]. - His decision to leave a stable academic position for Google Brain in 2013 was driven by a desire for innovation in deep learning [11][12]. Group 3: The Evolution of AI Models - Kaiser and his team introduced the attention mechanism to address the limitations of RNNs, leading to the development of the Transformer model [15][17]. - The success of the Transformer has spurred a wave of entrepreneurship in the AI field, with many authors of the original paper becoming CEOs and CTOs of successful startups [24][27]. - Kaiser has been involved in the development of cutting-edge models like GPT-4 and GPT-5 at OpenAI, contributing to the forefront of AI research [27]. Group 4: Future Directions in AI - Kaiser predicts that the next phase of AI will focus on teaching models to think more deeply, emphasizing the importance of generating intermediate steps in reasoning [29]. - The upcoming ML Summit 2025 will feature Kaiser discussing the history, present, and future of reasoning models, indicating ongoing advancements in AI technology [28][30].