Workflow
NSA(原生稀疏注意力机制)
icon
Search documents
DeepSeek下一代技术提前曝光,梁文锋署名论文获ACL2025最佳论文奖
量子位· 2025-07-30 23:56
Core Insights - The article highlights the groundbreaking achievement of a paper co-authored by DeepSeek's Liang Wenfeng and Peking University, which won the Best Paper Award at ACL 2025 [1] - The conference saw an unprecedented scale with a total submission of 8,360 papers, nearly doubling from last year's 4,407, indicating fierce competition [2] Technical Innovations - The proposed Native Sparse Attention (NSA) mechanism significantly enhances long text processing speed by 11 times through algorithm and hardware optimization, outperforming traditional full attention models [3][8] - The technology allows for an extension of context length up to 1 million tokens, set to be applied in next-generation models [4] - The NSA employs a dynamic hierarchical sparse strategy with three parallel attention branches: coarse-grained global information capture, selective attention for key segments, and sliding attention for local context [10][17] Performance Metrics - In practical tests, NSA demonstrated remarkable speed advantages across the entire lifecycle of processing 64k length sequences, with decoding speed improved by 11.6 times, forward propagation by 9 times, and backward propagation by 6 times [15][16] - The NSA pre-trained 27B parameter model surpassed the full attention baseline in 7 out of 9 evaluation metrics, particularly excelling in inference-related benchmarks [19][20] - In long text processing tests, NSA achieved perfect retrieval accuracy and outperformed the full attention baseline by 0.032 in the LongBench benchmark [21] Comparative Analysis - An experiment using DeepSeek-R1's mathematical reasoning data showed that NSA-R achieved an accuracy of 0.121 in an 8k context setting, significantly higher than the full attention model's 0.046 [22][23] - NSA also outperformed full attention in complex reasoning tasks, with improvements of 0.087 in HPQ and 0.069 in code understanding tasks [25] Additional Research Highlights - The article mentions three other best paper winners, including a study on the resilience of large language models post-alignment training, emphasizing the need for more effective alignment techniques [26] - Another paper explored fairness in large models through a new perspective of "difference awareness," revealing that traditional fairness tests may not adequately address the nuances of model behavior [28] - A third paper discussed the sampling mechanisms in large models, highlighting potential biases in decision-making processes that could lead to ethical concerns [29]