Attention Mechanism
Search documents
韩松等提出FlashMoBA,比MoBA快7.4倍,序列扩到512K也不会溢出
机器之心· 2025-11-18 05:08
Core Insights - The article discusses the introduction of a novel attention mechanism called Mixture of Block Attention (MoBA), which applies the principles of mixture of experts (MoE) to attention mechanisms, allowing models to autonomously determine which positions to focus on [2][4] - MoBA shows significant potential in handling long contexts by allowing queries to sparsely attend to a limited number of key-value blocks, thereby greatly reducing computational costs [3][4] - The article identifies performance challenges associated with smaller block sizes in MoBA implementations and introduces FlashMoBA, a hardware-friendly CUDA kernel designed to efficiently execute MoBA under small block configurations [7][12] Performance Analysis - The original MoBA implementation struggles with performance bottlenecks when using smaller block sizes, leading to slower execution compared to dense attention [11][41] - FlashMoBA optimizes MoBA's performance, achieving up to 14.7 times speedup compared to FlashAttention-2 in small block scenarios [8][43] - The article presents experimental results showing that reducing block size from 512 to 128 improves perplexity from 20.9 to 19.7 and RULER accuracy from 38.8% to 56.0% for a 340M parameter model [30][31] Technical Improvements - The article outlines two main improvement paths for MoBA: using smaller block sizes and applying short convolutions on keys to enhance routing accuracy [5][36] - FlashMoBA employs a three-kernel design to minimize memory access inefficiencies and align computations with GPU architecture, significantly improving performance [16][21] - The forward kernel uses a "collect and densify" strategy to handle MoBA's irregular sparsity, which is crucial for efficient computation [22][26] Experimental Results - The article details experiments conducted on 8× H100 80GB GPUs, demonstrating that the optimized MoBA model outperforms dense attention mechanisms across various benchmarks [30][39] - Key convolution techniques (kconv3 and kconv5) are shown to enhance model performance, with kconv3 improving language modeling accuracy from 45.1% to 45.6% for a 340M model [36][37] - Overall, the results indicate that smaller block sizes are essential for MoBA to achieve performance comparable to dense attention [41][42]
从Transformer到GPT-5,听听OpenAI科学家 Lukasz 的“大模型第一性思考”
3 6 Ke· 2025-09-22 13:04
Core Insights - The paper "Attention Is All You Need" proposed a revolutionary Transformer architecture that replaced the traditional RNNs in natural language processing, leading to significant advancements in AI applications like ChatGPT and DALL-E [1][15][24] - The authors, known as the "Transformer Eight," gained recognition for their groundbreaking work, which has been cited over 197,159 times as of the article's publication [2][15] Group 1: The Impact of Transformer Architecture - The introduction of the Transformer architecture has reshaped the AI landscape, enabling better handling of long-distance dependencies in language processing compared to RNNs [1][15] - The architecture's parallel processing capabilities have made it a new paradigm in NLP, extending its influence to various AI subfields, including computer vision and speech recognition [15][24] Group 2: The Journey of Lukasz Kaiser - Lukasz Kaiser, one of the "Transformer Eight," chose to join OpenAI instead of pursuing entrepreneurial ventures, focusing on AGI and leading the development of models like GPT-4 and GPT-5 [3][21] - Kaiser's academic background in logic and games laid the foundation for his contributions to AI, emphasizing a systematic approach to problem-solving [5][6] Group 3: The Evolution of AI Research - The transition from RNNs to Transformers marked a significant shift in AI research, with Kaiser and his team identifying the limitations of RNNs and proposing the attention mechanism as a solution [10][12] - The development of the Tensor2Tensor library facilitated the rapid iteration of the Transformer model, reflecting Kaiser's commitment to making AI more accessible [13][14] Group 4: Future Directions in AI - Kaiser has articulated a vision for the future of AI, emphasizing the importance of teaching models to think and reason more deeply, which could lead to a paradigm shift in AI capabilities [25][26] - The anticipated advancements include multi-modal AI, larger and more capable Transformers, and the proliferation of AI services through APIs and cloud platforms [25][26]
被Transformer光芒掩盖的论文,Meta科学家回顾十年前创新之作
机器之心· 2025-05-01 02:11
Core Viewpoint - The article discusses the significance of the "End-To-End Memory Networks" paper, highlighting its foundational contributions to the development of large language models (LLMs) and its overshadowing by the more popular "Attention is All You Need" paper [3][8][25]. Group 1: Historical Context and Contributions - The "End-To-End Memory Networks" paper, published in 2015, introduced key concepts that are now integral to LLMs, such as multi-layer soft attention and position embeddings [8][22]. - The paper was a refinement of the earlier "Memory Networks" paper from 2014, which introduced hard attention mechanisms [9][16]. - Despite its innovations, "End-To-End Memory Networks" received significantly less attention, with only over 3,000 citations compared to the 170,000 citations of "Attention is All You Need" [3][9]. Group 2: Technical Innovations - The model proposed in "End-To-End Memory Networks" was the first to completely replace recurrent neural networks (RNNs) with attention mechanisms, allowing for complex reasoning capabilities [8][13]. - The authors utilized reinforcement learning to train the memory network to focus on relevant information without predefined labels, which was a novel approach at the time [18][22]. - The introduction of position embeddings addressed the issue of order invariance in attention mechanisms, a critical advancement for LLMs [22][25]. Group 3: Current Relevance and Future Directions - The article emphasizes that even after ten years, there is still significant work to be done in improving architectures for LLMs, as evidenced by the recent release of the "Multi-Token Attention" paper, which enhances attention mechanisms for better handling of long contexts [26][27]. - The ongoing research aims to address challenges related to memory scaling, which was identified as a future direction in the original "Memory Networks" paper [26][27].
大模型 “注意力简史”:与两位 AI 研究者从 DeepSeek、Kimi 最新改进聊起
晚点LatePost· 2025-03-02 06:10
嘉宾 丨 肖朝军、傅天予 整理 丨 程曼祺 上周,DeepSeek、Kimi 都放出了新的大模型架构改进和优化成果,分别是 NSA、MoBA。二者都聚焦对大 模型中 "注意力机制" 的改进。 o 1 、 R 1 等 推 理 模 型 的 出 现,给 了 长 文 本 新 课 题 。 注意力机制是当前大语言模型(LLM)的核心机制。2017 年 6 月那篇开启大语言模型革命的 Transformer 八 子论文,标题就是:Attention Is All You Need(注意力就是你所需要的一切)。 而优化 Attention 的计算效率和效果,又能帮助解决 AI 学界和业界都非常关心的一个问题,就是长文本(long context)。 不管是要一次输入一整本书,让模型能帮我们提炼、理解;还是在生成现在 o1、R1 这类模型需要的长思维 链;又或者是希望模型未来能有越来越长的 "记忆",这都需要长文本能力的支持。 这期节目我们邀请了两位做过 Attention 机制改进的 AI 研究者做嘉宾。 一位是清华计算机系自然语言处理实验室的博士生肖朝军,他是 InfLLM 注意力机制改进的一作,导师是清华 计算机系副教授 ...