Workflow
稀疏注意力机制
icon
Search documents
刚刚,DeepSeek梁文锋NSA论文、北大杨耀东团队摘得ACL 2025最佳论文
3 6 Ke· 2025-07-31 03:40
Core Insights - The ACL conference, a leading event in computational linguistics and natural language processing (NLP), is set to take place in Vienna, Austria, from July 27 to August 1, 2025, marking its 63rd edition [1] - This year's conference saw a record number of submissions, exceeding 8,000 papers compared to 4,407 last year, with acceptance rates of 20.3% for main conference papers and 16.7% for findings [3] - Over half of the first authors of the submitted papers are from China (51.3%), a significant increase from 30.6% last year, while the second-largest group comes from the United States (14.0%) [3] Awards and Recognitions - A total of 4 best papers, 2 best social impact papers, 3 best resource papers, 3 best thematic papers, 26 outstanding papers, 2 best TACL papers, 1 best demo paper, and 47 SAC highlights were awarded this year [5] - The best paper awards were shared between teams from DeepSeek and Peking University, and other notable institutions including CISPA Helmholtz Center for Information Security, TCS Research, Microsoft, Stanford University, and Cornell Tech [8] Notable Papers - The paper "A Theory of Response Sampling in LLMs" explores the heuristic methods guiding sampling in large language models (LLMs) and highlights ethical concerns regarding decision-making biases [11] - "Fairness through Difference Awareness" introduces a framework for measuring group discrimination in LLMs, emphasizing the importance of group difference awareness in various contexts [13] - "Language Models Resist Alignment" reveals that large models possess an inherent elasticity mechanism that makes them resistant to alignment efforts, posing challenges for AI safety and alignment [16][17] - The paper "Native Sparse Attention" presents a new attention mechanism designed for efficient long-context modeling, demonstrating superior performance compared to existing sparse attention methods [24][28] Awards for Specific Papers - The best demo paper award went to "OLMoTrace," which can trace language model outputs back to trillions of training tokens, showcasing a significant advancement in understanding model behavior [32] - The best thematic paper award was given to "MaCP: Minimal yet Mighty Adaptation via Hierarchical Cosine Projection," which proposes a new adaptive method for fine-tuning large models with minimal parameters [34] Lifetime Achievement and Service Awards - The ACL Lifetime Achievement Award was presented to Professor Kathy McKeown for her extensive contributions to the field of NLP over 43 years [57][60] - The Distinguished Service Award was awarded to Professor Julia B. Hirschberg for her long-standing service to ACL and contributions to the fields of NLP and speech processing [62]
无需训练,即插即用,2倍GPU端到端推理加速——视频扩散模型加速方法DraftAttention
机器之心· 2025-06-28 04:35
Core Insights - The article discusses the challenges and advancements in video generation using diffusion models, particularly focusing on the computational bottlenecks associated with attention mechanisms in the Diffusion Transformer (DiT) model [1][6][14] - A new method called DraftAttention is introduced, which significantly reduces the computational overhead of attention mechanisms while maintaining high generation quality, achieving up to 2x end-to-end inference acceleration on GPUs [3][22][46] Group 1: Background and Challenges - Diffusion models have become mainstream for high-quality video generation, but the computational load of attention mechanisms increases dramatically with video length and resolution, leading to inefficiencies [1][6] - In models like HunyuanVideo, attention computation can account for over 80% of the total processing time, with generating an 8-second 720p video taking nearly an hour [1][5] - The complexity of attention mechanisms grows quadratically with the number of tokens, which is directly proportional to video frame count and resolution, causing significant slowdowns in inference speed [6][7] Group 2: Existing Solutions and Limitations - Current acceleration methods, such as Sparse VideoGen and AdaSpa, utilize sparse attention mechanisms for some level of end-to-end acceleration on GPUs, but their effectiveness is limited due to insufficient sparsity and rigid design [2][3] - These methods often rely on fixed sparse operators and lack dynamic adaptability to input content, making it difficult to achieve fine-grained, content-aware sparse pattern control [2][7] Group 3: DraftAttention Methodology - DraftAttention is a plug-and-play, dynamic sparse attention mechanism that does not require training, designed to reduce the computational burden of attention mechanisms while preserving generation quality [3][11][46] - The method involves creating a low-resolution attention map to estimate token importance, guiding the selection of sparse patterns for high-resolution attention calculations [11][12] - A token rearrangement strategy is introduced to enhance the execution efficiency of sparse computations on GPUs, making the approach hardware-friendly [13][22] Group 4: Theoretical Foundations and Experimental Results - The effectiveness of DraftAttention is supported by theoretical analyses demonstrating that the approximation error between the low-resolution and high-resolution attention maps is bounded [15][17] - Experimental evaluations show that DraftAttention outperforms existing sparse attention methods like Sparse VideoGen across multiple metrics, including PSNR and SSIM, particularly at high sparsity rates [20][21] - On NVIDIA H100 and A100 GPUs, DraftAttention achieves up to 1.75x end-to-end inference acceleration, with performance improvements scaling with video length, resolution, and sparsity [22][46] Group 5: Future Directions - The authors plan to further optimize efficiency bottlenecks in long video generation by integrating techniques such as quantization and distillation, aiming to extend high-quality video generation capabilities to resource-constrained environments like mobile and edge devices [46]
0.5B以小搏大拿下端侧模型新SOTA:4090可跑,长文本处理5倍常规加速丨清华&面壁开源
量子位· 2025-06-10 07:35AI Processing
清华大学&面壁智能 投稿 量子位 | 公众号 QbitAI 端侧性价比之王,清华大学和面壁智能团队开源新模型—— MiniCP M 4 ,提供 8B、0.5B 两种参数规模, 仅使用同级别开源模型22%的训练开销 ,就达到了同级别最优性能。 MiniCPM4-8B是 开源首个开源的原生稀疏模型,5%的极高稀疏度加持,让长文本、深思考在端侧真正跑起来。 在MMLU、CEval、MATH500、HumanEval等基准测试中,以仅22%的训练开销,性能比肩 Qwen-3-8B,超越Gemma-3-12B。 MiniCPM4-0.5B 在性能上,也展现出以小博大——在MMLU、CEval、BBH、HumanEval等基准测试中,MiniCPM4.0 -0.5B性能超越同级 的Qwen-3-0.6B、Llama 3.2、Gemma3, 并通过 原生QAT技术 实现几乎不掉点的int4量化以及600Token/s的极速推理速度。 在常见端侧芯片,比如Jetson AGX Orin与RTX 4090上,MiniCPM 4可实现长文本处理的5倍常规加速与极限场景下的百倍加速。 请看VCR: 目前团队已公开发布技术报告,该模 ...
0.5B以小搏大拿下端侧模型新SOTA:4090可跑,长文本处理5倍常规加速丨清华&面壁开源
量子位· 2025-06-10 07:35
Core Insights - MiniCPM4, developed by Tsinghua University and Weizhi Intelligent Team, is an open-source model that achieves optimal performance with only 22% of the training cost compared to similar models, offering 8B and 0.5B parameter sizes [1][3][4] - The model utilizes a novel sparse attention mechanism, InfLLM v2, which allows for efficient long-context processing, achieving a 5% sparsity rate [2][8][16] - MiniCPM4 demonstrates superior performance in benchmark tests, outperforming models like Qwen-3 and Gemma-3 while using significantly less training data [3][11][116] Model Performance - MiniCPM4-8B matches the performance of Qwen-3-8B and surpasses Gemma-3-12B with only 22% of the training data used by Qwen-3 [3][116] - MiniCPM4-0.5B outperforms Qwen-3-0.6B and Llama 3.2 in various benchmark tests, showcasing its efficiency in smaller parameter sizes [3][11] - The model achieves a decoding speed of 600 tokens per second with minimal performance loss during quantization [3][10] Technical Innovations - The InfLLM v2 architecture allows for efficient long-context processing by dynamically selecting relevant context tokens, reducing computational costs by 60% compared to previous methods [8][11][16] - The model incorporates a lightweight CUDA inference framework (CPM.cu) and a cross-platform deployment framework (ArkInfer) to optimize performance on edge devices [19][20][40] - The FR-Spec algorithm enhances speculative sampling efficiency, reducing computational overhead by 75% while maintaining output accuracy [28][30] Data Efficiency - MiniCPM4 achieves high capability density by utilizing only 8 trillion tokens for training, compared to 36 trillion tokens used by Qwen-3, demonstrating effective data filtering strategies [56][116] - The UltraClean data selection method enhances the quality of pre-training data, significantly improving model performance [57][61] Application and Use Cases - MiniCPM4 is designed for long document understanding and generation, proving effective in tasks such as automated literature review generation and complex tool interactions [120][130] - The model's ability to handle long sequences and maintain high accuracy in context extrapolation makes it suitable for various applications in AI-driven tasks [118][119]
月之暗面 MoBA 核心作者自述:一个 “新晋大模型训练师” 的三入思过崖
晚点LatePost· 2025-02-20 14:21
"从开源论文、开源代码出发,现在已经进化到开源思维链了嘛!" 文丨Andrew Lu 注释丨贺乾明 程曼祺 2 月 18 日,Kimi 和 DeepSeek 同一天发布新进展,分别是 MoBA 和 NSA,二者都是对 "注意力机 制"(Attention Mechanism)的改进。 今天,MoBA 的一位主要研发同学 Andrew Lu 在知乎发帖,自述研发过程的三次踩坑,他称为 "三入思过 崖"。他在知乎的签名是"新晋 LLM 训练师"。 这条回答下的一个评论是:"从开源论文、开源代码出发,现在已经进化到开源思维链了嘛。" 注意力机制之所以重要,是因为它是当前大语言模型(LLM)的核心机制。回到 2017 年 6 月那篇开启 LLM 革命的 Transformer 八子论文,标题就是:Attention Is All You Need(注意力就是你所需要的一 切),该论文被引用次数至今已达 15.3 万。 注意力机制能让 AI 模型像人类一样,知道在处理信息时该 "重点关注" 什么、"忽略" 什么,抓住信息中最 关键的部分。 在大模型的训练阶段和使用(推理)阶段,注意力机制都会发挥作用。它的大致工作原理是 ...