Workflow
Transformer
icon
Search documents
Meta新注意力机制突破Transformer上限,还用上了OpenAI的开源技术
量子位· 2025-07-07 09:35
Core Viewpoint - Meta has made significant advancements by leveraging OpenAI's technology and recruiting a large number of OpenAI employees, resulting in the development of a new architecture called 2-Simplicial Transformer, which enhances the efficiency of data utilization in training large models [1][2][26]. Group 1: New Architecture and Methodology - The 2-Simplicial Transformer modifies standard attention mechanisms to improve the efficiency of data usage, addressing the data bottleneck in current large model development [2][4]. - The core method involves extending the standard dot-product attention to a trilinear function, allowing for better expression of complex tasks [3][6]. - A new key vector, K', is introduced to enhance the model's ability to capture richer relationships during attention calculations [9][10]. Group 2: Performance and Scalability - Experimental results indicate that the 2-Simplicial Transformer outperforms traditional Transformers in mathematical, programming, and reasoning tasks, especially as model parameters increase [4][19]. - The scaling index of the new architecture is superior to that of traditional Transformers, suggesting that performance improves more rapidly with increased parameters and data, making it advantageous in data-limited scenarios [20][22]. - In various tasks, the 2-Simplicial Transformer shows improved performance metrics compared to traditional Transformers, particularly in larger models [18][21]. Group 3: Implementation and Challenges - The implementation of the 2-Simplicial Transformer utilizes Triton, a GPU programming framework that allows for efficient computation without requiring extensive CUDA experience [11][12]. - Despite its advantages, the computational complexity and latency of the 2-Simplicial Transformer remain high, indicating a need for further optimization for production environments [22].
deepseek技术解读(3)-MoE的演进之路
自动驾驶之心· 2025-07-06 08:44
Core Viewpoint - The article discusses the evolution of DeepSeek in the context of Mixture-of-Experts (MoE) models, highlighting innovations and improvements from DeepSeekMoE (V1) to DeepSeek V3, while maintaining a focus on the MoE technology route [1]. Summary by Sections 1. Development History of MoE - MoE was first introduced in 1991 with the paper "Adaptive Mixtures of Local Experts," and its framework has remained consistent over the years [2]. - Google has been a key player in the development of MoE, particularly with the release of "GShard" in 2020, which scaled models to 600 billion parameters [5]. 2. DeepSeek's Work 2.1. DeepSeek-MoE (V1) - DeepSeek V1 was released in January 2024, addressing two main issues: knowledge mixing and redundancy among experts [15]. - The architecture introduced fine-grained expert segmentation and shared expert isolation to enhance specialization and reduce redundancy [16]. 2.2. DeepSeek V2 MoE Upgrade - V2 introduced a device-limited routing mechanism to control communication costs by ensuring that activated experts are distributed across a limited number of devices [28]. - A communication balance loss was added to address potential congestion issues at the receiving end of the communication [29]. 2.3. DeepSeek V3 MoE Upgrade - V3 maintained the fine-grained expert and shared expert designs while upgrading the gating network from Softmax to Sigmoid to improve scoring differentiation among experts [36][38]. - The auxiliary loss for load balancing was eliminated to reduce its negative impact on the main model, replaced by a dynamic bias for load balancing [40]. - A sequence-wise auxiliary loss was introduced to balance token distribution among experts at the sequence level [42]. 3. Summary of DeepSeek's Innovations - The evolution of DeepSeek MoE has focused on balancing general knowledge and specialized knowledge through shared and fine-grained experts, while also addressing load balancing through various auxiliary losses [44].
原来Scaling Law还能被优化?Meta这招省token又提效
机器之心· 2025-07-06 03:49
Core Insights - The article discusses the advancements in AI, particularly focusing on the evolution of the Transformer model and the introduction of the 2-simplicial Transformer, which enhances the efficiency of token utilization and model scalability [1][4][10]. Group 1: Transformer and AI Development - The paper "Attention Is All You Need" marked a significant turning point in AI development, establishing the Transformer as the foundational paradigm for current language models [1]. - The citation count for this paper is approaching 190,000, indicating its profound impact on the field [2]. - The ongoing challenge in AI is acquiring a sufficient quantity of high-quality tokens and efficiently utilizing them, necessitating further upgrades to the Transformer model [3]. Group 2: 2-Simplicial Transformer - Meta's recent research introduced a rotationally invariant trilinear attention mechanism, demonstrating comparable representational capacity to the 2-simplicial Transformer and potentially altering the coefficients in the Scaling Law [4][10]. - The 2-simplicial Transformer, derived from Clift et al. (2019), generalizes the dot-product attention mechanism to a trilinear form, enhancing its scalability under token constraints [19][11]. - Experimental results indicate that the 2-simplicial Transformer can more effectively approximate the irreducible entropy of natural language compared to traditional dot-product attention Transformers [11]. Group 3: Scaling Law and Model Performance - The Scaling Law describes how loss decreases with the total number of model parameters and token count, suggesting that larger models should approach the irreducible loss of natural text distribution as both parameters and tokens increase [13][15]. - Hoffmann et al. (2022) found that the optimal number of parameters and dataset size should scale proportionally with the computational budget, with estimated scaling exponents around 0.49 for parameters and 0.5 for tokens [17][18]. - The 2-simplicial Transformer exhibits a steeper scaling slope compared to the dot-product attention Transformer, indicating a higher exponent in its Scaling Law [50]. Group 4: Experimental Results - The team conducted experiments with various models, revealing that the 2-simplicial attention mechanism did not provide benefits in models with fewer than 2 billion active parameters [45]. - The performance metrics across different model sizes showed slight improvements or declines when comparing the 2-simplicial Transformer to traditional Transformers, with variations in performance percentages noted [43][44]. - The study estimated the differences in scaling coefficients between the 2-simplicial and dot-product attention mechanisms, highlighting the potential for improved efficiency in larger models [46][49].
X @Avi Chawla
Avi Chawla· 2025-07-04 06:48
AI Tools & Platforms - RAGFlow is a linked resource [1] - Xpander is a linked resource [1] - Transformer Lab is a linked resource [1] - Llama Factory is a linked resource [1] - LangFlow is a linked resource [1] - AutoAgent is a linked resource [1]
ICML 2025 | 打破残差连接瓶颈,彩云科技&北邮提出MUDDFormer架构让Transformer再进化!
机器之心· 2025-06-27 08:06
Core Viewpoint - The article discusses the introduction of Multiway Dynamic Dense (MUDD) connections as an effective alternative to residual connections in Transformers, significantly enhancing cross-layer information transfer efficiency in deep learning models [1][4]. Background - Residual connections, introduced by Kaiming He in ResNet, have become foundational in deep learning and Transformer LLMs, but they still face limitations in efficient information transfer across layers [1][7]. - MUDD connections dynamically establish cross-layer connections based on the current hidden state, addressing issues like representation collapse and information overload in residual streams [7][8]. Model Architecture - MUDDFormer architecture allows for independent dynamic connections for different information streams (Q, K, V, R), enhancing the model's ability to gather relevant information from previous layers [10][13]. - The introduction of dynamic connections enables the model to adaptively determine the weight of information extracted from previous layers based on the context of each token [11][13]. Experimental Evaluation - MUDDPythia, a model with 2.8 billion parameters, shows performance comparable to larger models (6.9 billion and 12 billion parameters) with only a 0.23% increase in parameters and a 0.4% increase in computation [4][18]. - The MUDDFormer outperforms baseline models like Transformer++ across various model sizes, demonstrating significant computational efficiency improvements [15][17]. Downstream Task Assessment - In downstream tasks, MUDDPythia exhibits higher accuracy in 0-shot and 5-shot evaluations compared to equivalent Pythia models, indicating enhanced contextual learning capabilities [18][20]. - The model achieves a 2.4 times efficiency leap over the 6.9 billion Pythia model and a 4.2 times efficiency leap over the 12 billion Pythia model in specific evaluations [18][20]. Conclusion - MUDDFormer improves residual connections by establishing independent dynamic cross-layer connections for different information streams, enhancing cross-layer interaction and contextual learning capabilities in Transformers [25].
国内首个智能化标准单元自动建库工具iCell在宁发布
Nan Jing Ri Bao· 2025-06-18 03:31
Core Insights - The National Integrated Circuit Design Automation Technology Innovation Center has launched the iCell tool, marking a significant advancement in the Electronic Design Automation (EDA) field in China, providing essential support for high-end chip design [1][2] Group 1: iCell Tool Overview - iCell is the first intelligent standard cell automatic library construction tool in China, aimed at enhancing the efficiency of digital chip design [1] - The tool automates the construction of standard cell libraries, which traditionally required hundreds of engineers and several months to complete [1] Group 2: Technological Innovations - iCell employs a Transformer-based pre-training method for transistor layout, leveraging deep learning to optimize design processes [2] - The tool utilizes reinforcement learning and multi-task learning statistical methods to significantly reduce simulation costs and shorten the library construction cycle [2] Group 3: Application and Impact - iCell facilitates process exploration and optimization through design-process interaction, serving as a point tool for advanced process foundries [2] - The tool is currently being applied by leading domestic chip design companies and memory foundries in China [2]
迈向人工智能的认识论:如何推理对齐和改变他们的思维
3 6 Ke· 2025-06-16 01:54
Group 1 - The core architecture of LLMs is based on the Transformer model, which utilizes self-attention layers to dynamically allocate attention between input and previously generated output tokens, allowing for adaptive and content-driven processing [1][2][3] - Attention heads within the model can perform recognizable mechanisms, such as tracking list items or checking grammatical consistency, indicating that Transformers can learn algorithms or rule-based processes internally [2][3] - The self-attention mechanism enables LLMs to execute a series of transformations on input data, allowing for flexible routing of information, which is a hallmark of reasoning [3][4] Group 2 - The concept of alignment in models like Claude involves fine-tuning to ensure that the model's behavior aligns with human preferences and values, often through reinforcement learning from human feedback (RLHF) [4][5] - There exists an inherent tension between alignment and fidelity, where aligning a model may optimize its outputs to meet user needs at the expense of the transparency of its reasoning process [5][6] - The "character" training of models like Claude aims to instill traits such as honesty and politeness, which can influence the model's responses and explanations, potentially leading to a "politeness filter" that may obscure harsh truths [7][8] Group 3 - The tendency for models to cater to user opinions during RLHF training can lead to a conflict with fact-based reasoning, as models may agree with incorrect user statements to appear friendly [8][9] - The complexity of explainability arises from the distinction between a model's internal reasoning and its externally aligned behavior, making it challenging to interpret the model's true reasoning process [9][10] - Tools for interpretability, such as circuit tracing, aim to directly analyze internal activations rather than relying on the model's explanations, which may be influenced by alignment [10][11] Group 4 - Despite the challenges of alignment, aligned models have reduced the dissemination of harmful content and improved the quality of explanations provided by AI systems [11][12] - Future work in the field will focus on maintaining transparency while aligning with human values, potentially involving new training objectives that reward faithful reasoning rather than just correct final answers [11][12]
X @Avi Chawla
Avi Chawla· 2025-06-14 20:03
Model Architecture - Explains Transformer vs Mixture of Experts (MoE) in LLMs with visuals [1] - Focuses on clearly explaining Mixture of Experts in LLMs [1]
X @Avi Chawla
Avi Chawla· 2025-06-14 06:30
LLM 技术 - Transformer 与 Mixture of Experts (MoE) 在 LLMs 中的对比分析 [1] - 行业关注 DS (数据科学), ML (机器学习), LLMs (大型语言模型), 和 RAGs (检索增强生成) 的教程和见解 [1] 社交媒体互动 - 鼓励用户分享信息 [1] - 行业专家 Avi Chawla 在社交媒体上分享相关内容 [1]
X @Avi Chawla
Avi Chawla· 2025-06-14 06:30
LLM Architectures - The report compares Transformer and Mixture of Experts (MoE) architectures in Large Language Models (LLMs) [1] - The report provides clear explanations and visuals to illustrate the differences between the two architectures [1] Focus - The report focuses on explaining Transformer and MoE architectures in LLMs [1]