混合专家模型(MoE)

Search documents
LIama 4发布重夺开源第一!DeepSeek同等代码能力但参数减一半,一张H100就能跑,还有两万亿参数超大杯
量子位· 2025-04-06 02:33
Core Viewpoint - Meta has launched the Llama 4 family of models, marking a significant advancement in multimodal AI capabilities, with Llama 4 Maverick achieving a high performance score in various benchmarks [3][4][8]. Group 1: Model Overview - The Llama 4 family includes three models: Llama 4 Scout, Llama 4 Maverick, and Llama 4 Behemoth, with the first two already released and the latter in training [3][4]. - Llama 4 Scout features 17 billion active parameters and a context window of 1 million tokens, while Llama 4 Maverick has 17 billion active parameters with 128 experts [5][19]. - Llama 4 Behemoth is a massive model with 2 trillion parameters, currently under training, and is expected to outperform existing models like GPT-4.5 and Claude Sonnet 3.7 [5][54]. Group 2: Performance Metrics - Llama 4 Maverick scored 1417 in the latest model ranking, surpassing previous models and becoming the top open-source model [8][9]. - The model outperformed Meta's previous Llama-3-405B by 149 points, marking a significant improvement [8]. - In various benchmarks, Llama 4 Scout demonstrated superior performance compared to competitors like Gemini 2.0 Flash-Lite and Mistral 3.1 [21][42]. Group 3: Multimodal Capabilities - Llama 4 models are designed for native multimodal functionality, allowing users to upload images and ask questions about them directly [30][41]. - The models are touted as the best in their class for multimodal applications, enhancing user interaction and experience [41][42]. Group 4: Cost Efficiency - Llama 4 Maverick offers competitive pricing, with inference costs significantly lower than other models like GPT-4, making it an attractive option for developers [46][49]. - The cost per million input and output tokens for Llama 4 Maverick ranges from $0.19 to $0.495, compared to $4.38 for GPT-4 [49]. Group 5: Training Innovations - The Llama 4 series utilizes a novel MoE (Mixture of Experts) architecture, enhancing computational efficiency by activating only a subset of parameters during inference [56][60]. - The training process involved over 30 trillion tokens, more than double that of Llama 3, and included diverse data types such as text, images, and videos [64][63]. - A new training technique called MetaP was developed to optimize model hyperparameters, resulting in improved performance across various tasks [62][63].