Mixture of Experts

Search documents
China Went HARD...
Matthew Berman· 2025-07-24 00:30
Model Performance & Capabilities - Quen 3 coder rivals Anthropic's Claude family in coding performance, achieving 69.6% on SWEBench verified compared to Claude Sonnet 4's 70.4% [1] - The most powerful variant, Quen 3 coder 480B, features 480 billion parameters with 35 billion active parameters as a mixture of experts model [2][3] - The model supports a native context length of 256k tokens and up to 1 million tokens with extrapolation methods, enhancing its capabilities for tool calling and agentic uses [4] Training Data & Methodology - The model was pre-trained on 7.5 trillion tokens with a 70% code ratio, improving coding abilities while maintaining general and math skills [5] - Quen 2.5 coder was leveraged to clean and rewrite noisy data, significantly improving overall data quality [6] - Code RL training was scaled on a broader set of real-world coding tasks, focusing on diverse coding tasks to unlock the full potential of reinforcement learning [7][8] Tooling & Infrastructure - Quen launched Quen code, a command line tool adapted from Gemini code, enabling agentic and multi-turn execution with planning [2][5][9] - A scalable system was built to run 20,000 independent environments in parallel, leveraging Alibaba cloud's infrastructure for self-play [10] Open Source & Accessibility - The model is hosted on HuggingFace, making it free to use and try out [11]
X @Avi Chawla
Avi Chawla· 2025-06-14 20:03
Model Architecture - Explains Transformer vs Mixture of Experts (MoE) in LLMs with visuals [1] - Focuses on clearly explaining Mixture of Experts in LLMs [1]
X @Avi Chawla
Avi Chawla· 2025-06-14 06:30
LLM 技术 - Transformer 与 Mixture of Experts (MoE) 在 LLMs 中的对比分析 [1] - 行业关注 DS (数据科学), ML (机器学习), LLMs (大型语言模型), 和 RAGs (检索增强生成) 的教程和见解 [1] 社交媒体互动 - 鼓励用户分享信息 [1] - 行业专家 Avi Chawla 在社交媒体上分享相关内容 [1]
X @Avi Chawla
Avi Chawla· 2025-06-14 06:30
Model Architecture - Mixture of Experts (MoE) models activate only a fraction of their parameters during inference, leading to faster inference [1] - Mixtral 8x7B by MistralAI is a popular MoE-based Large Language Model (LLM) [1] - Llama 4 is another popular MoE-based LLM [1]
X @Avi Chawla
Avi Chawla· 2025-06-14 06:30
LLM Architectures - The report compares Transformer and Mixture of Experts (MoE) architectures in Large Language Models (LLMs) [1] - The report provides clear explanations and visuals to illustrate the differences between the two architectures [1] Focus - The report focuses on explaining Transformer and MoE architectures in LLMs [1]