Mixture of Experts
Search documents
韩松等提出FlashMoBA,比MoBA快7.4倍,序列扩到512K也不会溢出
机器之心· 2025-11-18 05:08
Core Insights - The article discusses the introduction of a novel attention mechanism called Mixture of Block Attention (MoBA), which applies the principles of mixture of experts (MoE) to attention mechanisms, allowing models to autonomously determine which positions to focus on [2][4] - MoBA shows significant potential in handling long contexts by allowing queries to sparsely attend to a limited number of key-value blocks, thereby greatly reducing computational costs [3][4] - The article identifies performance challenges associated with smaller block sizes in MoBA implementations and introduces FlashMoBA, a hardware-friendly CUDA kernel designed to efficiently execute MoBA under small block configurations [7][12] Performance Analysis - The original MoBA implementation struggles with performance bottlenecks when using smaller block sizes, leading to slower execution compared to dense attention [11][41] - FlashMoBA optimizes MoBA's performance, achieving up to 14.7 times speedup compared to FlashAttention-2 in small block scenarios [8][43] - The article presents experimental results showing that reducing block size from 512 to 128 improves perplexity from 20.9 to 19.7 and RULER accuracy from 38.8% to 56.0% for a 340M parameter model [30][31] Technical Improvements - The article outlines two main improvement paths for MoBA: using smaller block sizes and applying short convolutions on keys to enhance routing accuracy [5][36] - FlashMoBA employs a three-kernel design to minimize memory access inefficiencies and align computations with GPU architecture, significantly improving performance [16][21] - The forward kernel uses a "collect and densify" strategy to handle MoBA's irregular sparsity, which is crucial for efficient computation [22][26] Experimental Results - The article details experiments conducted on 8× H100 80GB GPUs, demonstrating that the optimized MoBA model outperforms dense attention mechanisms across various benchmarks [30][39] - Key convolution techniques (kconv3 and kconv5) are shown to enhance model performance, with kconv3 improving language modeling accuracy from 45.1% to 45.6% for a 340M model [36][37] - Overall, the results indicate that smaller block sizes are essential for MoBA to achieve performance comparable to dense attention [41][42]
X @Avi Chawla
Avi Chawla· 2025-11-11 20:14
Mixture of Experts (MoE) Architecture - MoE is a popular architecture leveraging different experts to enhance Transformer models [1] - MoE differs from Transformer in the decoder block, utilizing experts (smaller feed-forward networks) instead of a single feed-forward network [2][3] - During inference, only a subset of experts are selected in MoE, leading to faster inference [4] - A router, a multi-class classifier, selects the top K experts by producing softmax scores [5] - The router is trained with the network to learn the best expert selection [5] Training Challenges and Solutions - Challenge 1: Some experts may become under-trained due to the overselection of a few experts [5] - Solution 1: Add noise to the router's feed-forward output and set all but the top K logits to negative infinity to allow other experts to train [5][6] - Challenge 2: Some experts may be exposed to more tokens than others, leading to under-trained experts [6] - Solution 2: Limit the number of tokens an expert can process; if the limit is reached, the token is passed to the next best expert [6] MoE Characteristics and Examples - Text passes through different experts across layers, and chosen experts differ between tokens [7] - MoEs have more parameters to load, but only a fraction are activated during inference, resulting in faster inference [9] - Mixtral 8x7B and Llama 4 are examples of popular MoE-based LLMs [9]
China Went HARD...
Matthew Berman· 2025-07-24 00:30
Model Performance & Capabilities - Quen 3 coder rivals Anthropic's Claude family in coding performance, achieving 69.6% on SWEBench verified compared to Claude Sonnet 4's 70.4% [1] - The most powerful variant, Quen 3 coder 480B, features 480 billion parameters with 35 billion active parameters as a mixture of experts model [2][3] - The model supports a native context length of 256k tokens and up to 1 million tokens with extrapolation methods, enhancing its capabilities for tool calling and agentic uses [4] Training Data & Methodology - The model was pre-trained on 7.5 trillion tokens with a 70% code ratio, improving coding abilities while maintaining general and math skills [5] - Quen 2.5 coder was leveraged to clean and rewrite noisy data, significantly improving overall data quality [6] - Code RL training was scaled on a broader set of real-world coding tasks, focusing on diverse coding tasks to unlock the full potential of reinforcement learning [7][8] Tooling & Infrastructure - Quen launched Quen code, a command line tool adapted from Gemini code, enabling agentic and multi-turn execution with planning [2][5][9] - A scalable system was built to run 20,000 independent environments in parallel, leveraging Alibaba cloud's infrastructure for self-play [10] Open Source & Accessibility - The model is hosted on HuggingFace, making it free to use and try out [11]
X @Avi Chawla
Avi Chawla· 2025-06-14 20:03
Model Architecture - Explains Transformer vs Mixture of Experts (MoE) in LLMs with visuals [1] - Focuses on clearly explaining Mixture of Experts in LLMs [1]
X @Avi Chawla
Avi Chawla· 2025-06-14 06:30
LLM 技术 - Transformer 与 Mixture of Experts (MoE) 在 LLMs 中的对比分析 [1] - 行业关注 DS (数据科学), ML (机器学习), LLMs (大型语言模型), 和 RAGs (检索增强生成) 的教程和见解 [1] 社交媒体互动 - 鼓励用户分享信息 [1] - 行业专家 Avi Chawla 在社交媒体上分享相关内容 [1]
X @Avi Chawla
Avi Chawla· 2025-06-14 06:30
Model Architecture - Mixture of Experts (MoE) models activate only a fraction of their parameters during inference, leading to faster inference [1] - Mixtral 8x7B by MistralAI is a popular MoE-based Large Language Model (LLM) [1] - Llama 4 is another popular MoE-based LLM [1]
X @Avi Chawla
Avi Chawla· 2025-06-14 06:30
LLM Architectures - The report compares Transformer and Mixture of Experts (MoE) architectures in Large Language Models (LLMs) [1] - The report provides clear explanations and visuals to illustrate the differences between the two architectures [1] Focus - The report focuses on explaining Transformer and MoE architectures in LLMs [1]