Workflow
注意力机制
icon
Search documents
微软研究院路保同:用向量检索重塑模型注意力——Attention
3 6 Ke· 2025-11-17 08:02
Core Insights - The article discusses the limitations of long-context reasoning in large language models (LLMs) due to the quadratic complexity of self-attention and the significant memory requirements for key-value (KV) caching [1][5] - It introduces a new mechanism called Retrieval Attention, which accelerates long-context LLM inference through a dynamic sparse attention approach that does not require retraining [1][8] Group 1: Retrieval Attention Mechanism - Retrieval Attention posits that each query only needs to interact with a small subset of keys, making most attention redundant [3][7] - The approach involves offloading most KV vectors from the GPU to the CPU, using approximate nearest neighbor (ANN) search to identify the most relevant keys for each query [3][7] - This mechanism allows for significant reductions in memory usage, with an 8B model requiring only about 1/10 of the original memory for KV caching while maintaining accuracy [22] Group 2: Performance Metrics - Empirical tests on an RTX 4090 (24GB) show that the 8B model can stably generate with 128K context at approximately 0.188 seconds per token, achieving nearly the same precision as full attention [5][6] - The subsequent work, RetroInfer, demonstrated a 4.5 times increase in decoding throughput on A100 GPUs compared to full attention and a 10.5 times increase in throughput for 1M token contexts compared to other sparse attention systems [5][22] Group 3: System Architecture - The architecture of Retrieval Attention features a dual-path attention mechanism where the GPU retains a small amount of "predictable" local KV cache, while the CPU dynamically retrieves a large-scale KV store [7][8] - This design leads to a reduction in both memory usage and inference latency, allowing for efficient long-context reasoning without retraining the model [8][22] Group 4: Theoretical and Practical Contributions - The work presents a new theoretical perspective by framing the attention mechanism as a retrieval system, allowing for more precise identification of important contextual information [23][25] - It also emphasizes system-level optimizations, transforming traditional linear caching into a dynamic allocation structure that enhances efficiency in large-scale inference scenarios [23][25] Group 5: Future Directions - Future research may focus on establishing a more rigorous theoretical framework for the error bounds of Retrieval Attention and exploring the integration of dynamic learning mechanisms with system-level optimizations [26][30] - The long-term implications of this research could lead to models with true long-term memory capabilities, enabling them to maintain semantic consistency over extensive contexts [30][31]
HBM 之父大胆猜测:NVIDIA 可能买存储公司
半导体芯闻· 2025-11-04 09:48
Core Insights - NVIDIA's CEO Jensen Huang visited South Korea for the first time in 15 years, meeting with key figures from Samsung and Hyundai to strengthen collaboration in memory and AI megafactories [2] - The importance of memory in the AI era is increasing, with experts suggesting that NVIDIA may consider acquiring memory companies like Micron or SanDisk to maintain its leadership in AI [2][3] - Memory bottlenecks are critical issues that need to be addressed for AI inference, with major companies focusing on solutions [3][4] Memory Demand and Types - Memory requirements for AI are categorized into HBM, DRAM, and SSD, with HBM used for real-time data storage, DRAM for short-term memory, and SSD for long-term data [4] - HBM capacity ranges from 10GB to hundreds of GB, DRAM from hundreds of GB to TB, and SSD from TB to PB [4] AI Inference Mechanism - AI inference utilizes a mechanism similar to human brain attention, which involves storing important information (Key and Value) to enhance processing speed [5] - The introduction of KV Cache allows AI models to remember previously processed information, significantly improving response times for ongoing discussions [5]
我MiniMax,用实习生处理数据,照样屠榜开源大模型
量子位· 2025-11-04 05:06
Core Viewpoint - The article discusses the development and unique features of the MiniMax M2 model, highlighting its performance, data processing techniques, and the rationale behind its design choices, particularly the shift from Linear Attention to Full Attention. Group 1: Model Performance - M2 demonstrated strong performance by winning first place in the AI-Trader simulation competition, earning nearly 3,000 yuan from a starting capital of 100,000 yuan over 20 days [2] - The choice of Full Attention over Linear Attention is presented as a strategic decision aimed at ensuring stability and reliability for commercial deployment [12][53] Group 2: Attention Mechanism - The article emphasizes the debate surrounding the choice of attention mechanisms, with M2's team opting for Full Attention after testing various alternatives, including Efficient Attention, which showed performance degradation with longer context lengths [12][15] - The team argues that the perceived advantages of Efficient Attention are misleading, particularly in complex tasks where it fails to perform as well as Full Attention [18][22] Group 3: Data Processing Techniques - M2's data processing approach is highlighted as mature, allowing even inexperienced interns to achieve expected results, indicating a well-structured data handling process [27] - The team focuses on enhancing the model's generalization capabilities by diversifying data formats and ensuring high-quality data through a rigorous cleaning process [35][38] Group 4: Task Execution and Adaptability - The concept of "Interleaved Thinking" is introduced, allowing the model to dynamically adjust its planning based on real-time execution feedback, improving its adaptability in task execution [46][48] - The training data is designed to simulate real-world scenarios, covering various uncertainties to enhance the model's performance in practical applications [51][52] Group 5: Engineering Philosophy - MiniMax's decision to use Full Attention reflects a pragmatic engineering philosophy prioritizing real-world applicability and stability over merely optimizing for computational efficiency [53][56] - The company aims to create a model that is not just technically advanced but also practical and understandable for developers, emphasizing a systematic approach to problem-solving [57][58]
20分钟读懂AI史上最重要的一篇论文——《Attention Is All You Need》
Hu Xiu· 2025-10-22 13:05
Core Insights - The article highlights the transformative impact of the 2017 paper "Attention Is All You Need," which introduced the Transformer architecture, revolutionizing the AI technology landscape [1] - The emergence of leading AI tools like ChatGPT and DeepSeek is directly linked to the advancements made possible by the Transformer model [1] Summary by Sections Transformer Architecture - The Transformer architecture has fundamentally changed the approach to artificial intelligence, leading to a global "arms race" in the AI sector [1] - Key concepts such as attention mechanisms, Q/K/V, multi-head attention, and positional encoding are explained in a simplified manner [1] Impact on AI Industry - The paper has catalyzed the rapid rise of major players in the AI industry, including OpenAI, showcasing the significant economic opportunities created by these advancements [1] - The narrative includes the story of eight authors who left Google to pursue entrepreneurial ventures, resulting in remarkable wealth creation [1]
人工智能专题:DeepSeek的稀疏注意力机制给AI产业释放更大的发展潜能
Zhongyuan Securities· 2025-10-16 11:46
Investment Rating - The industry investment rating is "Outperform the Market" with an expected increase of over 10% relative to the CSI 300 index in the next six months [41]. Core Insights - The report emphasizes that the introduction of sparse attention mechanisms, particularly through DeepSeek, significantly enhances the development potential of the AI industry [8][37]. - DeepSeek's advancements in attention mechanisms, including Native Sparse Attention (NSA) and DeepSeek Sparse Attention (DSA), are pivotal in improving model performance and efficiency [18][23][37]. Summary by Sections 1. Relationship Between Attention Mechanism and Large Model Development - The attention mechanism, introduced to improve information processing efficiency, has become a core component of large models, addressing the limitations of traditional recurrent neural networks [11]. - Sparse attention reduces computational complexity from O(L²) to sub-quadratic levels, thus overcoming memory and computational bottlenecks [11]. 2. DeepSeek's Technological Improvements in Attention Mechanism - DeepSeek has made significant contributions in three main areas: Multi-head Latent Attention (MLA), Native Sparse Attention (NSA), and DeepSeek Sparse Attention (DSA) [12][18][23]. - MLA reduces memory usage by approximately 90% while maintaining model performance, significantly lowering training costs [16]. - NSA enhances long text processing speed by 11 times and achieves performance comparable to traditional models [18]. - DSA improves training and inference efficiency, leading to substantial cost reductions for model usage [23]. 3. DSA and NSA Unlock Greater Development Potential for the AI Industry - The integration of DSA and NSA allows for expanded model context and improved computational efficiency, which are crucial for meeting the demands of multi-modal applications [33][37]. - The trend towards longer input and output lengths necessitates innovative approaches to model training and performance enhancement [33].
老牌Transformer杀手在ICLR悄然更新:Mamba-3三大改进趋近设计完全体
机器之心· 2025-10-14 08:24
机器之心报道 编辑:冷猫 至今为止 Transformer 架构依然是 AI 模型的主流架构,自从其确立了统治地位后,号称 Transformer 杀手的各类改进工作就没有停止过。 在一众挑战者中最具影响力的自然是 2023 年社区爆火的基于结构化的状态空间序列模型(SSM)架构的 Mamba。 Mamba 的爆火可能和名字有关,但硬实力确实强大。 在当时,Mamba 在语言建模方面可以媲美甚至击败 Transformer。而且,它可以随上下文长度的增加实现线性扩展,其性能在实际数据中可提高到百万 token 长度 序列,并实现 5 倍的推理吞吐量提升。 在 Mamba 问世后,涌现出了超多在不同任务上使用 Mamba 的工作以及一些改进工作,诞生了了 MoE-Mamba、Vision Mamba、VMamba、U-Mamba、 MambaByte、MambaOut 等多项工作,被称为 「Tra nsfor mer 最有力的继任者」 。 但 Mamba 在 2024 年的 ICLR 会议中遭遇了滑铁卢 ,最终还是被拒稿。 在 2024 年,在 Mamba 发布的半年后, Mamba-2 正式发布 ,拿下了顶会 ...
从Transformer到GPT-5,听听OpenAI科学家 Lukasz 的“大模型第一性思考”
AI科技大本营· 2025-09-23 02:11
Core Viewpoint - The article discusses the revolutionary impact of the paper "Attention Is All You Need," which introduced the Transformer architecture, fundamentally changing the landscape of artificial intelligence and natural language processing [2][17]. Group 1: The Impact of the Transformer - The paper "Attention Is All You Need" has been cited 197,159 times on Google Scholar, highlighting its significant influence in the AI research community [3][26]. - The authors of the paper, known as the "Transformer Eight," have become prominent figures in the AI industry, with seven of them starting their own companies [4][24]. - The introduction of the Transformer architecture has led to a paradigm shift in AI, moving away from RNNs and enabling better handling of long-distance dependencies in language processing [17][18]. Group 2: Lukasz Kaiser's Journey - Lukasz Kaiser, one of the authors, chose to join OpenAI instead of starting a commercial venture, focusing on the pursuit of AGI [4][25]. - Kaiser has a strong academic background, holding dual master's degrees in computer science and mathematics, and has received prestigious awards for his research [7][8]. - His decision to leave a stable academic position for Google Brain in 2013 was driven by a desire for innovation in deep learning [11][12]. Group 3: The Evolution of AI Models - Kaiser and his team introduced the attention mechanism to address the limitations of RNNs, leading to the development of the Transformer model [15][17]. - The success of the Transformer has spurred a wave of entrepreneurship in the AI field, with many authors of the original paper becoming CEOs and CTOs of successful startups [24][27]. - Kaiser has been involved in the development of cutting-edge models like GPT-4 and GPT-5 at OpenAI, contributing to the forefront of AI research [27]. Group 4: Future Directions in AI - Kaiser predicts that the next phase of AI will focus on teaching models to think more deeply, emphasizing the importance of generating intermediate steps in reasoning [29]. - The upcoming ML Summit 2025 will feature Kaiser discussing the history, present, and future of reasoning models, indicating ongoing advancements in AI technology [28][30].
跨学科注意力机制访谈系列开篇
3 6 Ke· 2025-09-05 03:48
Core Insights - The article emphasizes the significance of the "Attention" mechanism in AI development, highlighting its role as a foundational paradigm that transcends mere model components [1][6] - The company has initiated a series of deep interviews focusing on "Attention" to explore its implications in AI and its intersection with human cognition [5][12] Group 1: AI Development and Attention Mechanism - The past seven years have seen "Attention" as a common underlying theme in key advancements in AI technology [1] - The company believes that the current AI innovations represent a transformative wave, surpassing the scale of the Industrial Revolution [1] - The exploration of "Attention" is not merely a retrospective but a necessary discussion to understand its relevance in today's AI landscape [6] Group 2: AI Portfolio and Research Initiatives - The company has built a core investment portfolio in AI and embodied intelligence, including nearly twenty projects such as MiniMax and Vast [1] - The first deep interview series focused on understanding the essence of AI and its foundational technologies, leading to insights about AI as a future infrastructure [2][3] - The second series centered on "Agent," exploring its role as a service driven by large models, emphasizing its importance in the AI ecosystem [4] Group 3: Future Directions and Human Cognition - The article discusses the dual evolution of AI, where scholars are working on both scaling Transformer structures and innovating cognitive frameworks to enhance AI's understanding of "Attention" [8] - It raises critical questions about the implications of AI's evolution on human attention mechanisms, especially in a world increasingly filled with fragmented information [10][11] - The company aims to protect human attention while helping AI learn to manage it, marking the beginning of a new series of discussions on this topic [12]
谷歌大脑之父首次坦白,茶水间闲聊引爆万亿帝国,AI自我突破触及门槛
3 6 Ke· 2025-08-25 03:35
Core Insights - Jeff Dean, a key figure in AI and the founder of Google Brain, shared his journey and insights on the evolution of neural networks and AI in a recent podcast interview [1][2][3] Group 1: Early Life and Career - Jeff Dean had an unusual childhood, moving frequently and attending 11 schools in 12 years, which shaped his adaptability [7] - His early interest in computers was sparked by a DIY computer kit purchased by his father, leading him to self-learn programming [9][11][13] - Dean's first significant encounter with AI was during his undergraduate studies, where he learned about neural networks and their suitability for parallel computing [15][17] Group 2: Contributions to AI - Dean proposed the concepts of "data parallelism/model parallelism" in the 1990s, laying groundwork for future developments [8] - The inception of Google Brain was a result of a casual conversation with Andrew Ng in a Google break room, highlighting the collaborative nature of innovation [22][25] - Google Brain's early achievements included training large neural networks using distributed systems, which involved 2,000 computers and 16,000 cores [26] Group 3: Breakthroughs in Neural Networks - The "average cat" image created by Google Brain marked a significant milestone, showcasing the capabilities of unsupervised learning [30] - Google Brain achieved a 60% relative error rate reduction on the Imagenet dataset and a 30% error rate reduction in speech systems, demonstrating the effectiveness of their models [30] - The development of attention mechanisms and models like word2vec and sequence-to-sequence significantly advanced natural language processing [32][34][40] Group 4: Future of AI - Dean emphasized the importance of explainability in AI, suggesting that future models could directly answer questions about their decisions [43][44] - He noted that while LLMs (Large Language Models) have surpassed average human performance in many tasks, there are still areas where they have not reached expert levels [47] - Dean's future plans involve creating more powerful and cost-effective models to serve billions, indicating ongoing innovation in AI technology [50]
从零开始!自动驾驶端到端与VLA学习路线图~
自动驾驶之心· 2025-08-24 23:32
Core Viewpoint - The article emphasizes the importance of understanding end-to-end (E2E) algorithms and Visual Language Models (VLA) in the context of autonomous driving, highlighting the rapid development and complexity of the technology stack involved [2][32]. Summary by Sections Introduction to End-to-End and VLA - The article discusses the evolution of large language models over the past five years, indicating a significant technological advancement in the field [2]. Technical Foundations - The Transformer architecture is introduced as a fundamental component for understanding large models, with a focus on attention mechanisms and multi-head attention [8][12]. - Tokenization methods such as BPE (Byte Pair Encoding) and positional encoding are explained as essential for processing sequences in models [13][9]. Course Overview - A new course titled "End-to-End and VLA Autonomous Driving" is launched, aimed at providing a comprehensive understanding of the technology stack and practical applications in autonomous driving [21][33]. - The course is structured into five chapters, covering topics from basic E2E algorithms to advanced VLA methods, including practical assignments [36][48]. Key Learning Objectives - The course aims to equip participants with the ability to classify research papers, extract innovative points, and develop their own research frameworks [34]. - Emphasis is placed on the integration of theory and practice, ensuring that learners can apply their knowledge effectively [35]. Industry Demand and Career Opportunities - The demand for VLA/VLM algorithm experts is highlighted, with salary ranges between 40K to 70K for positions requiring 3-5 years of experience [29]. - The course is positioned as a pathway for individuals looking to transition into roles focused on autonomous driving algorithms, particularly in the context of emerging technologies [28].