注意力机制
Search documents
谷歌刚掀了模型记忆的桌子,英伟达又革了注意力的命
3 6 Ke· 2026-01-20 01:12
Core Insights - Google's Nested Learning has sparked a significant shift in the understanding of model memory, allowing models to change parameters during inference rather than being static after training [1][5] - NVIDIA's research introduces a more radical approach with the paper "End-to-End Test-Time Training for Long Context," suggesting that memory is essentially learning, and "remembering" equates to "continuing to train" [1][10] Group 1: Nested Learning and Test-Time Training (TTT) - Nested Learning allows models to incorporate new information into their internal memory during inference, rather than just storing it temporarily [1][5] - TTT, which has roots dating back to 2013, enables models to adapt their parameters during inference, enhancing their performance based on the current context [5][9] - TTT-E2E proposes a method that eliminates the need for traditional attention mechanisms, allowing for constant latency regardless of context length [7][9] Group 2: Memory Redefined - Memory is redefined as a continuous learning process rather than a static storage structure, emphasizing the importance of how past information influences future predictions [10][34] - The TTT-E2E method aligns the model's learning objectives directly with its ultimate goal of next-token prediction, enhancing its ability to learn from context [10][16] Group 3: Engineering Stability and Efficiency - The implementation of TTT-E2E incorporates meta-learning to stabilize the model's learning process during inference, addressing issues of catastrophic forgetting and parameter drift [20][22] - Safety measures, such as mini-batch processing and sliding window attention, are introduced to ensure the model retains short-term memory while updating parameters [24][25] Group 4: Performance Metrics - TTT-E2E demonstrates superior performance in loss reduction across varying context lengths, maintaining efficiency even as context increases [27][29] - The model's ability to learn continuously from context without relying on traditional attention mechanisms results in significant improvements in prediction accuracy [31][34] Group 5: Future Implications - The advancements in TTT-E2E suggest a shift towards a more sustainable approach to continuous learning, potentially becoming a leading solution in the industry for handling long-context scenarios [34][35] - This approach aligns with the growing demand for models that can learn and adapt without the high computational costs associated with traditional attention mechanisms [33][34]
像大模型一样进化
腾讯研究院· 2026-01-05 08:44
Group 1 - The core idea of the article emphasizes the evolution of AI models, particularly the transition from early symbolic AI to deep learning and the success of Transformer models, suggesting that this evolution can inform human cognitive development [1] - The article discusses the importance of defining a clear objective function in machine learning, which guides the optimization of models, and compares this to the necessity of setting long-term goals in personal development [3][4] - It highlights the concept of "local optimum" in both machine learning and personal growth, warning against settling for short-term achievements that may limit future opportunities [4][5] Group 2 - The article references Abraham Maslow's insights on self-actualization and the fear of success, suggesting that individuals often hesitate to pursue greatness due to self-doubt and societal pressures [5] - It recounts Sam Altman's experience in establishing OpenAI's ambitious goal of achieving AGI, illustrating how bold objectives can attract talent and drive innovation [6] - The importance of building a personal knowledge system is emphasized, as it enables individuals to engage deeply with the world and develop irreplaceable skills in the age of AI [7] Group 3 - The article explains the process of stochastic gradient descent (SGD) in machine learning, which involves iterative optimization based on error correction, and draws parallels to how humans learn from mistakes [10][12] - It discusses the significance of embracing errors as a means of growth, suggesting that mistakes provide valuable feedback that can enhance cognitive flexibility and adaptability [12][13] - The concept of "random exploration" is presented as a strategy for personal development, encouraging individuals to seek diverse experiences and knowledge to avoid cognitive stagnation [15][16] Group 4 - The article stresses the importance of attention in learning, likening it to the attention mechanism in Transformers, and advocates for focusing on high-quality data and relationships to enhance understanding [19][20] - It advises against rigid rule-based learning, promoting the idea of learning through examples and experiences, which allows for deeper understanding and adaptability [22][23] - The article concludes with the notion of selective forgetting as a cognitive strategy, emphasizing the need to prioritize valuable information while letting go of less useful knowledge [25][26]
Gemini 3预训练负责人警告:模型战已从算法转向工程化,合成数据成代际跃迁核心,谷歌碾压OpenAI、Meta的秘密武器曝光
3 6 Ke· 2025-12-26 12:21
Group 1 - The core point of the article is that Gemini 3 has emerged as a dominant player in the AI model industry, showcasing significant advancements in pre-training and post-training techniques, which have led to its superior performance in various benchmark tests [2][10] - Google DeepMind's focus has shifted from merely creating models to developing comprehensive systems that integrate research, engineering, and infrastructure [4][16] - The industry is transitioning from an "unlimited data" era to a "limited data" phase, prompting a reevaluation of innovation strategies within AI [4][5] Group 2 - The success of Gemini 3 is attributed to continuous optimization across numerous details rather than a single breakthrough, emphasizing the importance of teamwork and collaboration in achieving significant advancements [3][10] - The concept of synthetic data is gaining traction, but caution is advised due to potential risks associated with its use, such as data distribution shifts that could lead to misleading improvements [5][34] - Future directions in AI pre-training will focus on architectural innovations, including longer context capabilities and integrating retrieval mechanisms into training processes [7][38] Group 3 - The evaluation of AI models is critical, with a need for robust internal assessment systems to avoid misleading conclusions about model performance [41][40] - The integration of retrieval capabilities into models is seen as a promising approach to enhance reasoning and knowledge retention without solely relying on stored parameters [39][49] - The industry is witnessing a rapid increase in user engagement with AI models, necessitating a focus on cost-effective deployment and resource-efficient inference processes [52][56]
Scaling Law没死,Gemini核心大佬爆料,谷歌已有颠覆性密钥
3 6 Ke· 2025-12-22 01:05
Core Insights - Google DeepMind's Gemini pre-training head, Sebastian Borgeaud, predicts significant innovations in long context processing efficiency and context length expansion within the next year [2][4][16] - The recent discussions among key figures at Google, including Jeff Dean, Oriol Vinyals, and Noam Shazeer, indicate a consensus on the evolving nature of AI models and the importance of system architecture over mere model size [26][30][32] Group 1: Innovations in AI - Major advancements are expected in long context capabilities, transforming models into comprehensive digital workspaces capable of handling extensive data and complex tasks [16] - Recent discoveries in attention mechanisms may lead to substantial improvements in model understanding and efficiency, indicating that there is still significant room for enhancement in this area [18] - The return of retrieval-based learning, where models dynamically access external knowledge rather than relying solely on memorized data, is seen as a promising direction for future AI development [19] Group 2: Shift in AI Development Paradigms - The industry is transitioning from a "data abundance" mindset to a "data limited" approach, necessitating more efficient use of available data and a focus on sophisticated system engineering [12][30] - The emphasis is shifting from merely achieving high performance to ensuring models are cost-effective and reliable for long-term deployment [22][30] - The concept of "slow thinking" is introduced, highlighting the need for models to engage in continuous self-assessment and correction rather than just rapid output generation [30] Group 3: System vs. Model - The term "system" is frequently used to describe Gemini, emphasizing its role as a long-term, iterative infrastructure rather than a one-time model achievement [31][32] - The focus on stability, scalability, and the ability to recover from errors is prioritized over immediate performance metrics, indicating a strategic shift in how AI systems are developed and evaluated [32][34] - Google aims to create a sustainable and evolving intelligent system rather than a fleeting product, reflecting a commitment to long-term innovation in AI [34]
微软研究院路保同:用向量检索重塑模型注意力——Attention
3 6 Ke· 2025-11-17 08:02
Core Insights - The article discusses the limitations of long-context reasoning in large language models (LLMs) due to the quadratic complexity of self-attention and the significant memory requirements for key-value (KV) caching [1][5] - It introduces a new mechanism called Retrieval Attention, which accelerates long-context LLM inference through a dynamic sparse attention approach that does not require retraining [1][8] Group 1: Retrieval Attention Mechanism - Retrieval Attention posits that each query only needs to interact with a small subset of keys, making most attention redundant [3][7] - The approach involves offloading most KV vectors from the GPU to the CPU, using approximate nearest neighbor (ANN) search to identify the most relevant keys for each query [3][7] - This mechanism allows for significant reductions in memory usage, with an 8B model requiring only about 1/10 of the original memory for KV caching while maintaining accuracy [22] Group 2: Performance Metrics - Empirical tests on an RTX 4090 (24GB) show that the 8B model can stably generate with 128K context at approximately 0.188 seconds per token, achieving nearly the same precision as full attention [5][6] - The subsequent work, RetroInfer, demonstrated a 4.5 times increase in decoding throughput on A100 GPUs compared to full attention and a 10.5 times increase in throughput for 1M token contexts compared to other sparse attention systems [5][22] Group 3: System Architecture - The architecture of Retrieval Attention features a dual-path attention mechanism where the GPU retains a small amount of "predictable" local KV cache, while the CPU dynamically retrieves a large-scale KV store [7][8] - This design leads to a reduction in both memory usage and inference latency, allowing for efficient long-context reasoning without retraining the model [8][22] Group 4: Theoretical and Practical Contributions - The work presents a new theoretical perspective by framing the attention mechanism as a retrieval system, allowing for more precise identification of important contextual information [23][25] - It also emphasizes system-level optimizations, transforming traditional linear caching into a dynamic allocation structure that enhances efficiency in large-scale inference scenarios [23][25] Group 5: Future Directions - Future research may focus on establishing a more rigorous theoretical framework for the error bounds of Retrieval Attention and exploring the integration of dynamic learning mechanisms with system-level optimizations [26][30] - The long-term implications of this research could lead to models with true long-term memory capabilities, enabling them to maintain semantic consistency over extensive contexts [30][31]
HBM 之父大胆猜测:NVIDIA 可能买存储公司
半导体芯闻· 2025-11-04 09:48
Core Insights - NVIDIA's CEO Jensen Huang visited South Korea for the first time in 15 years, meeting with key figures from Samsung and Hyundai to strengthen collaboration in memory and AI megafactories [2] - The importance of memory in the AI era is increasing, with experts suggesting that NVIDIA may consider acquiring memory companies like Micron or SanDisk to maintain its leadership in AI [2][3] - Memory bottlenecks are critical issues that need to be addressed for AI inference, with major companies focusing on solutions [3][4] Memory Demand and Types - Memory requirements for AI are categorized into HBM, DRAM, and SSD, with HBM used for real-time data storage, DRAM for short-term memory, and SSD for long-term data [4] - HBM capacity ranges from 10GB to hundreds of GB, DRAM from hundreds of GB to TB, and SSD from TB to PB [4] AI Inference Mechanism - AI inference utilizes a mechanism similar to human brain attention, which involves storing important information (Key and Value) to enhance processing speed [5] - The introduction of KV Cache allows AI models to remember previously processed information, significantly improving response times for ongoing discussions [5]
我MiniMax,用实习生处理数据,照样屠榜开源大模型
量子位· 2025-11-04 05:06
Core Viewpoint - The article discusses the development and unique features of the MiniMax M2 model, highlighting its performance, data processing techniques, and the rationale behind its design choices, particularly the shift from Linear Attention to Full Attention. Group 1: Model Performance - M2 demonstrated strong performance by winning first place in the AI-Trader simulation competition, earning nearly 3,000 yuan from a starting capital of 100,000 yuan over 20 days [2] - The choice of Full Attention over Linear Attention is presented as a strategic decision aimed at ensuring stability and reliability for commercial deployment [12][53] Group 2: Attention Mechanism - The article emphasizes the debate surrounding the choice of attention mechanisms, with M2's team opting for Full Attention after testing various alternatives, including Efficient Attention, which showed performance degradation with longer context lengths [12][15] - The team argues that the perceived advantages of Efficient Attention are misleading, particularly in complex tasks where it fails to perform as well as Full Attention [18][22] Group 3: Data Processing Techniques - M2's data processing approach is highlighted as mature, allowing even inexperienced interns to achieve expected results, indicating a well-structured data handling process [27] - The team focuses on enhancing the model's generalization capabilities by diversifying data formats and ensuring high-quality data through a rigorous cleaning process [35][38] Group 4: Task Execution and Adaptability - The concept of "Interleaved Thinking" is introduced, allowing the model to dynamically adjust its planning based on real-time execution feedback, improving its adaptability in task execution [46][48] - The training data is designed to simulate real-world scenarios, covering various uncertainties to enhance the model's performance in practical applications [51][52] Group 5: Engineering Philosophy - MiniMax's decision to use Full Attention reflects a pragmatic engineering philosophy prioritizing real-world applicability and stability over merely optimizing for computational efficiency [53][56] - The company aims to create a model that is not just technically advanced but also practical and understandable for developers, emphasizing a systematic approach to problem-solving [57][58]
20分钟读懂AI史上最重要的一篇论文——《Attention Is All You Need》
Hu Xiu· 2025-10-22 13:05
Core Insights - The article highlights the transformative impact of the 2017 paper "Attention Is All You Need," which introduced the Transformer architecture, revolutionizing the AI technology landscape [1] - The emergence of leading AI tools like ChatGPT and DeepSeek is directly linked to the advancements made possible by the Transformer model [1] Summary by Sections Transformer Architecture - The Transformer architecture has fundamentally changed the approach to artificial intelligence, leading to a global "arms race" in the AI sector [1] - Key concepts such as attention mechanisms, Q/K/V, multi-head attention, and positional encoding are explained in a simplified manner [1] Impact on AI Industry - The paper has catalyzed the rapid rise of major players in the AI industry, including OpenAI, showcasing the significant economic opportunities created by these advancements [1] - The narrative includes the story of eight authors who left Google to pursue entrepreneurial ventures, resulting in remarkable wealth creation [1]
人工智能专题:DeepSeek的稀疏注意力机制给AI产业释放更大的发展潜能
Zhongyuan Securities· 2025-10-16 11:46
Investment Rating - The industry investment rating is "Outperform the Market" with an expected increase of over 10% relative to the CSI 300 index in the next six months [41]. Core Insights - The report emphasizes that the introduction of sparse attention mechanisms, particularly through DeepSeek, significantly enhances the development potential of the AI industry [8][37]. - DeepSeek's advancements in attention mechanisms, including Native Sparse Attention (NSA) and DeepSeek Sparse Attention (DSA), are pivotal in improving model performance and efficiency [18][23][37]. Summary by Sections 1. Relationship Between Attention Mechanism and Large Model Development - The attention mechanism, introduced to improve information processing efficiency, has become a core component of large models, addressing the limitations of traditional recurrent neural networks [11]. - Sparse attention reduces computational complexity from O(L²) to sub-quadratic levels, thus overcoming memory and computational bottlenecks [11]. 2. DeepSeek's Technological Improvements in Attention Mechanism - DeepSeek has made significant contributions in three main areas: Multi-head Latent Attention (MLA), Native Sparse Attention (NSA), and DeepSeek Sparse Attention (DSA) [12][18][23]. - MLA reduces memory usage by approximately 90% while maintaining model performance, significantly lowering training costs [16]. - NSA enhances long text processing speed by 11 times and achieves performance comparable to traditional models [18]. - DSA improves training and inference efficiency, leading to substantial cost reductions for model usage [23]. 3. DSA and NSA Unlock Greater Development Potential for the AI Industry - The integration of DSA and NSA allows for expanded model context and improved computational efficiency, which are crucial for meeting the demands of multi-modal applications [33][37]. - The trend towards longer input and output lengths necessitates innovative approaches to model training and performance enhancement [33].
老牌Transformer杀手在ICLR悄然更新:Mamba-3三大改进趋近设计完全体
机器之心· 2025-10-14 08:24
Core Insights - The article discusses the evolution of the Mamba architecture, which is positioned as a strong contender against the dominant Transformer architecture in AI models. Mamba has shown significant improvements in language modeling and inference efficiency, particularly with its latest iteration, Mamba-3, which introduces several key enhancements [1][2][3]. Group 1: Mamba Architecture Evolution - Mamba gained popularity in 2023 as a structured state space model (SSM) architecture, demonstrating performance that could rival or surpass Transformers in language modeling tasks [2][3]. - Mamba-1 utilized continuous time dynamic models and a selective memory update mechanism, achieving efficient memory retention without relying on attention mechanisms [7]. - Mamba-2, released six months after Mamba-1, improved upon its predecessor with a selective SSM, achieving speed enhancements of 2-8 times while maintaining competitive performance against Transformers [4][5]. Group 2: Mamba-3 Enhancements - Mamba-3 introduces three significant improvements: trapezoidal discretization, complexified state-space models, and multi-input multi-output (MIMO) SSM, enhancing the model's expressiveness and efficiency [10][13][14]. - The trapezoidal discretization allows Mamba-3 to consider both the start and end points of intervals, improving state updates [11]. - The complexified state-space model provides a more expressive state update mechanism, overcoming limitations in state tracking capabilities seen in linear models [13][22]. Group 3: Performance Metrics - Empirical validation shows that Mamba-3 outperforms Mamba-2 and other open-source architectures in various language modeling tasks, achieving superior average accuracy across multiple benchmarks [19][20]. - Mamba-3's MIMO variant enhances hardware utilization efficiency during the decoding phase, allowing for simultaneous state updates across multiple channels without increasing memory requirements [15][26]. - In comparative latency tests, Mamba-3 demonstrated faster response times than Mamba-2 and Gated DeltaNet, particularly in configurations using BF16 precision [27]. Group 4: Application Potential - Mamba-3's efficient long-sequence processing capabilities make it suitable for applications in long document understanding, scientific time series analysis, and gene modeling, areas where Transformers struggle due to context limitations [30]. - Its linear time inference and stable latency also position Mamba-3 as an ideal candidate for real-time interactive scenarios, such as chat assistants and machine translation [31].