Workflow
注意力机制
icon
Search documents
从零开始!自动驾驶端到端与VLA学习路线图~
自动驾驶之心· 2025-08-24 23:32
点击下方 卡片 ,关注" 自动驾驶之心 "公众号 戳我-> 领取 自动驾驶近15个 方向 学习 路线 端到端和VLA涉及的技术栈实在是太多了,今天就从小白入门学习的角度和大家聊聊端到端和VLA的发展路线。 首先看一下大语言模型的近五年的关键时间线: 聊大模型,离不开Transformer,为了方便后续理解,我们进行一个通俗的概括。 进一步展开Token化、BPE、位置编码等等~ Transformer: Attention is all you need 3. 合并频次最高的两个非结束字符组成一个新 字符,并重新统计所有字符频次(新字符会分 走部分原高频字符的频次 ) 4. 重复2-3直至字符数量达标or迭代轮次达标 $$P E_{(p o s,2i)}=s i n(p o s/10000^{2i/d_{\mathrm{model}}})$$ PE(pos,2i+1) = COS(pos/1000022/dmodel 7 x D 向量 "这是一段文字" Tokenizer + Positional 231 34 462 4758 762 38 7 x D 向量 Encoding [EQgmbedding 7 ...
重塑注意力机制:GTA登场,KV缓存缩减70%、计算量削减62.5%
机器之心· 2025-07-22 08:59
Core Viewpoint - The article discusses the introduction of Grouped-head latent Attention (GTA), a new framework developed by a collaboration between Chinese Academy of Sciences, University College London, and Hong Kong University of Science and Technology (Guangzhou), which significantly enhances model performance and computational efficiency in large language models [1][3]. Grouped-head latent Attention (GTA) Introduction - GTA is designed to address the efficiency challenges faced by large language models, particularly those using the traditional Multi-Head Attention (MHA) mechanism, which suffers from computational redundancy, memory bottlenecks, and inference latency issues [2][4][6]. Efficiency Challenges in Large Language Models - The MHA architecture leads to excessive computation due to independent calculations for each attention head, resulting in a quadratic increase in floating-point operations (FLOPs) when processing long sequences [3][4]. - Memory requirements for storing key-value (KV) pairs grow rapidly with sequence length and the number of attention heads, making deployment on edge devices challenging [3][12]. - High computational and memory demands contribute to significant inference delays, hindering real-time applications [4][6]. Core Innovations of GTA - GTA introduces a grouped sharing mechanism for attention matrices, reducing overall computation by allowing multiple attention heads to share a single attention matrix, thus cutting down FLOPs significantly [8][10]. - The framework employs a "compression + decoding" strategy to minimize memory usage by compressing all attention head value vectors into a low-dimensional latent representation, which is then dynamically decoded as needed [12][14]. Experimental Validation of GTA - Comprehensive experiments demonstrate that GTA not only improves computational efficiency and memory utilization but also maintains or surpasses the performance of existing mainstream attention mechanisms [16][19]. - In tests with a model of 160 million parameters, GTA achieved lower evaluation loss and better performance on downstream tasks compared to traditional MHA and other models, with its KV cache size reduced to 12.5% of MHA's [18][19]. Scalability and Performance of GTA - When scaling to 500 million parameters, GTA continued to outperform other models in evaluation loss and accuracy while maintaining a KV cache size of only 12.5% compared to MHA [19]. - The architecture's efficiency was further validated in a 1 billion parameter model, where GTA demonstrated comparable performance to GQA-1B while using significantly less memory [20][22]. Theoretical Efficiency Analysis - The theoretical analysis indicates that GTA achieves substantial reductions in computational complexity and memory usage, translating to faster inference speeds [24]. - Empirical benchmarks confirm GTA's superior performance in prefill and decode times across various hardware platforms, showcasing its robustness and efficiency [25][29]. Future Directions - Despite its advancements, GTA faces challenges such as potential approximation errors from the nonlinear decoder and the need for broader validation across different tasks beyond natural language processing [33]. - Future research aims to refine the decoder architecture and explore GTA's applicability in larger models and diverse application domains [33].
Mamba一作预告新架构!长文论述Transformer≠最终解法
量子位· 2025-07-09 04:57
Core Viewpoint - The article discusses the trade-offs between two mainstream sequence models: State Space Models (SSMs) and Transformer models, highlighting the strengths and weaknesses of each approach [1][3]. Summary by Sections Introduction to Mamba and SSMs - Mamba is a typical SSM that builds on a modern structured SSM suitable for deep learning, outperforming similarly sized Transformers in language tasks [2]. - The author consolidates insights from previous talks into a comprehensive article, hinting at a significant upcoming advancement in architecture [3][4]. Attention Mechanism and Its Limitations - The article challenges the common belief that the high computational cost of models like ChatGPT is solely due to the quadratic complexity of the attention mechanism in Transformers [5][6]. - A new architecture is expected to be compatible with Transformers, suggesting a shift in understanding the limitations of attention mechanisms [7][8]. Comparison of SSMs and Transformers - SSMs are likened to the human brain, summarizing past information into a fixed-size hidden state, making them more efficient for processing long sequences [15][16]. - SSMs have advantages in handling unstructured data and exhibit linear computational costs with respect to sequence length, making them suitable for resource-constrained environments [16]. Key Elements of Mamba's Success - Mamba's effectiveness is attributed to three key factors: state size, state expressivity, and training efficiency [17][20]. - SSMs allow for larger hidden states, enhancing information storage compared to traditional RNNs [18]. - Mamba introduces selective SSMs to improve state expressivity, akin to the gating mechanisms in classic RNNs [19]. - Training efficiency is achieved through careful parameterization and parallel scanning algorithms [21]. Limitations of SSMs - SSMs lack precise recall and retrieval capabilities for past information, which is a strength of Transformer models [22]. Transformer Model Characteristics - Transformers function like a database, storing every piece of information in a KV cache, allowing for precise memory and token-level operations [23][25]. - They excel in processing well-defined tokenized data but suffer from high computational costs and dependency on high-quality data [26][27]. Tokenization Debate - The author argues against the necessity of tokenization, stating it contradicts the end-to-end learning principle of deep learning and complicates multilingual and multimodal applications [28][30]. - Evidence suggests that SSMs outperform Transformers on raw data, emphasizing Transformers' weaknesses with non-semantic token data [32]. Conclusion on SSMs vs. Transformers - Both SSMs and Transformers have their unique strengths and weaknesses, and a hybrid approach could yield better performance [33][35]. - Research indicates that a combination of SSM and attention layers could enhance model capabilities, with an optimal ratio of 3:1 to 10:1 [37]. - The future direction may involve developing models that can directly process raw data, leveraging the advantages of both architectures [40].
心智×算法 如何“共舞”(瞰前沿·人工智能如何改变科研范式)
Ren Min Ri Bao· 2025-06-13 21:43
Core Insights - The rapid development of artificial intelligence (AI) is significantly transforming scientific research methodologies, particularly in psychology, with an annual growth rate of 27.2% in AI-driven scientific publications from 2019 to 2023 [1] Group 1: AI and Psychology - The historical connection between psychology and AI is notable, with classical experiments like Pavlov's conditioning influencing key AI techniques such as reinforcement learning [2] - AI applications in daily life often reflect psychological principles, such as behavior reinforcement mechanisms used in e-commerce and social media platforms [2] - AI's ability to understand complex human behaviors is enhanced by cognitive psychology, leading to the development of attention mechanisms in AI models [2] Group 2: Data and Research Efficiency - AI enables researchers to access vast behavioral data streams from social media and wearable devices, significantly expanding the scope of psychological research [3] - The efficiency of psychological research is improved through AI technologies that can identify hidden signals of social anxiety and assess personality traits from textual data [3] - Emotion recognition technologies are being utilized in settings like nursing homes to identify loneliness and other psychological states, enhancing the assessment of mental health [3] Group 3: Innovations in Psychological Research - Psychological researchers are developing AI tools for self-help that enhance emotional understanding and interaction capabilities [5] - AI is being trained to recognize subtle psychological crisis signals, utilizing psychological models to improve the identification of distress [5] - The integration of AI and psychological theories is fostering a deeper understanding of human emotions and enhancing predictive capabilities in mental health [5] Group 4: Future Directions - The interplay between psychology and AI is expected to evolve, with psychological insights potentially improving AI's decision-making in complex environments [7] - AI's ability to generate experimental materials and simulate human interactions will contribute to advancing psychological research [7] - The relationship between humans and AI is prompting a reevaluation of emotional connections and ethical considerations in the context of AI's role in understanding human emotions [8]
ICML 2025 | 全局池化+局部保留,CCA-Attention为LLM长文本建模带来突破性进展
机器之心· 2025-06-08 08:21
琶洲实验室、华南理工大学联合推出关键上下文感知注意力机制(CCA-Attention),实现超长文本的高效上下文建模。在 128K 超长序列上下文建模任 务中,CCA-Attention 的推理速度是标准自注意力机制的 7.9 倍,同时键值缓存(KV Cache)显存占用减少 93%,性能全面优于现有高效注意力方法。 该成果已被 ICML 2025 接收,最早于 2024 年 12 月 17 日提交至 ArXiv,早于 DeepSeek NSA 和 Kimi MoBA 公开。CCA-Attention 不仅速度快、 资源占用低,更在上下文建模的精准度和效率上树立了新标杆,为长文本处理注入全新动力。 引言 近期研究 [1, 2, 3] 发现,LLMs 中的大多数层的注意力权重主要集中在少数 token 上,表现出显著的稀疏性(见图 1)。这一发现启示我们可以借助这种 稀疏特性,降低注意力机制的计算复杂度。 图 1: LLaMA2-7B 模型中注意力权重的可视化,阴影越深表示注意力权重越高。最后一个 token 仅对上下文少数几个 token 有着较高的注意力权重,即注意力权重具有 显著的稀疏性 。 现有稀疏注意 ...
张津剑:投资中的频率与频谱 | 42章经
42章经· 2025-06-08 08:11
Group 1 - The core argument of the article is that the current state of human attention is deteriorating, leading to a loss of independent judgment and increasing societal fragmentation, while AI, through its attention mechanisms, is becoming more focused and goal-oriented [1][4][24] - The article discusses the differences between human and AI attention mechanisms, highlighting that AI can enhance its capabilities through computational power, while humans must rely on focus and restraint [1][4][6] - It emphasizes the importance of attention management for entrepreneurs and investors, suggesting that those who can concentrate their attention effectively will find more opportunities in the evolving landscape [15][20][40] Group 2 - The article explains the concept of attention as a filtering mechanism that helps humans process information amidst noise, likening it to a signal processing system [4][8][10] - It presents the idea that human perception is limited compared to processing and output capabilities, with a significant gap between the amount of information received and what can be acted upon [6][7] - The phenomenon of "herding" behavior is discussed, where individuals tend to follow trends rather than making independent decisions, leading to market bubbles and volatility [12][14] Group 3 - The article posits that the future of AI will involve a combination of sensors, agents, and embodied intelligence, which will allow for a broader spectrum of perception and processing capabilities [35][36] - It critiques current projects that are still centered around human capabilities, advocating for a shift towards an AI-centered approach in organizing work [37][38] - The unique values of humans in the AI era are identified as the ability to create demand and the capacity for aesthetic judgment, which AI lacks [39][44]
ICML 2025 | 注意力机制中的极大值:破解大语言模型上下文理解的关键
机器之心· 2025-05-06 04:11
Core Insights - The article discusses a significant phenomenon in large language models (LLMs) related to the concentration of massive values in the self-attention mechanism, particularly in the query (Q) and key (K) representations, which is crucial for contextual knowledge understanding [1][3][4]. Research Highlights - The study reveals that massive values are highly concentrated in Q and K, which is contrary to the expectation of independent operations in each attention head. This consistency across multiple layers and heads is visually demonstrated [3][4]. - The phenomenon of massive values is specifically observed in models using Rotational Position Encoding (RoPE), such as LLaMA, Qwen, and Gemma, while models without RoPE, like GPT-2 and OPT, do not exhibit this pattern [4]. - The research establishes a direct link between the presence of massive values in Q and K and the ability to understand contextual knowledge [4]. Key Findings 1. **Concentration of Massive Values**: Massive values are found to be highly concentrated in specific regions of each attention head, indicating a surprising level of consistency [3][4]. 2. **Impact on Contextual Knowledge Understanding**: The study shows that the presence of massive values is critical for understanding contextual knowledge, as demonstrated through destructive experiments that reset these values to their average [5][6]. 3. **Quantization Techniques**: Specific quantization methods that address massive values, such as AWQ and SmoothQuant, are shown to better preserve contextual knowledge understanding compared to methods that do not focus on massive values [7]. 4. **Origin of Concentration Phenomenon**: The concentration of massive values is attributed to RoPE, which affects low-frequency regions in Q and K, leading to this phenomenon appearing from the early layers of the model [8]. Experimental Results - The experiments reveal a stark contrast in the impact of massive values on different knowledge tasks: - **Resilience in Parametric Knowledge Retrieval**: Tasks relying on parametric knowledge show a decline of only 15-20% in accuracy when massive values are disrupted, maintaining 76%-88% accuracy [10]. - **Catastrophic Decline in Contextual Knowledge Tasks**: Tasks requiring contextual understanding experience a drastic drop in performance, with accuracy in key retrieval tasks plummeting from 100% to near 0% when massive values are disrupted [11]. - **Control Experiments**: When only non-massive values are disrupted, task performance remains stable, confirming the unique importance of massive values in contextual understanding [12]. Future Directions - The research opens several avenues for further exploration, including enhancing or adjusting the distribution of massive values to improve contextual understanding, examining the universality of this phenomenon across different architectures, and designing targeted quantization methods to protect massive values related to contextual understanding [16].
月之暗面 MoBA 核心作者自述:一个 “新晋大模型训练师” 的三入思过崖
晚点LatePost· 2025-02-20 14:21
"从开源论文、开源代码出发,现在已经进化到开源思维链了嘛!" 文丨Andrew Lu 注释丨贺乾明 程曼祺 2 月 18 日,Kimi 和 DeepSeek 同一天发布新进展,分别是 MoBA 和 NSA,二者都是对 "注意力机 制"(Attention Mechanism)的改进。 今天,MoBA 的一位主要研发同学 Andrew Lu 在知乎发帖,自述研发过程的三次踩坑,他称为 "三入思过 崖"。他在知乎的签名是"新晋 LLM 训练师"。 这条回答下的一个评论是:"从开源论文、开源代码出发,现在已经进化到开源思维链了嘛。" 注意力机制之所以重要,是因为它是当前大语言模型(LLM)的核心机制。回到 2017 年 6 月那篇开启 LLM 革命的 Transformer 八子论文,标题就是:Attention Is All You Need(注意力就是你所需要的一 切),该论文被引用次数至今已达 15.3 万。 注意力机制能让 AI 模型像人类一样,知道在处理信息时该 "重点关注" 什么、"忽略" 什么,抓住信息中最 关键的部分。 在大模型的训练阶段和使用(推理)阶段,注意力机制都会发挥作用。它的大致工作原理是 ...