LLaMA3
Search documents
小模型层数好玄学:12/32/64层效果好,16/24/48/层效果糟
量子位· 2026-01-11 04:02
Core Insights - The article reveals significant findings regarding the 70M small model, emphasizing that the architecture's importance is lower than previously thought, while the model's "shape" (depth-width ratio) is more critical [1][2]. Group 1: Model Architecture and Performance - The optimal number of layers for small models is identified as 32, with 12 and 64 layers also performing well, while configurations with 16, 24, and 48 layers yield poor results [2][15]. - The performance gap between "good" and "bad" layer configurations exceeds 6 percentage points, with "good" configurations averaging around 38% accuracy and "bad" configurations around 32% [15][16]. - The hidden dimension must be at least 512 for optimal performance, with the 32-layer configuration achieving the highest score of 38.50% [18][23]. Group 2: Comparative Analysis of Architectures - A comparison of 12 different architectures, including LLaMA3 and Qwen3, shows that modern architectures perform similarly within the 70M parameter range, with average differences of less than 2% [25][26]. - The article notes that improvements in modern architectures are primarily designed for models with over 700 million parameters and do not provide measurable advantages for 70M models [27]. Group 3: Diffusion Models vs. Autoregressive Models - Diffusion models, while slightly lower in average accuracy (31-32%), demonstrate faster inference speeds (3.8 times faster) and lower hallucination rates compared to autoregressive models [28][30]. - The introduction of a "Canon layer" can enhance factual accuracy by 1% for autoregressive models and over 2% for diffusion models, with minimal additional parameter cost [35][36]. Group 4: New Model Development - The Dhara-70M model is introduced, combining the best features of autoregressive and diffusion models, built on the LLaMA3-Canon architecture and converted using the WSD method [41][42]. - The specifications of Dhara-70M include 71.34M parameters, 32 layers, and a hidden size of 384, designed for high throughput and factual accuracy [44]. Group 5: Recommendations for Model Builders - The article advises small language model builders to focus on the fundamental depth-width ratio rather than chasing the latest architectural trends, especially for applications requiring high-speed processing and factual accuracy [45].
开源扩散大模型首次跑赢自回归!上交大联手UCSD推出D2F,吞吐量达LLaMA3的2.5倍
机器之心· 2025-08-18 03:22
Core Insights - The article discusses the introduction of Discrete Diffusion Forcing (D2F), a new model that significantly enhances the inference speed of open-source diffusion large language models (dLLMs) compared to autoregressive (AR) models, achieving up to 2.5 times higher throughput on benchmarks like GSM8K [2][6][22]. Group 1: Challenges and Solutions - Existing dLLMs face challenges such as the lack of a complete KV cache mechanism and insufficient parallel potential, resulting in slower inference speeds compared to AR models [2][8]. - D2F addresses these challenges by integrating a mixed paradigm of autoregressive and diffusion approaches, optimizing model architecture, training methods, and inference strategies [11][12]. Group 2: D2F Design Features - D2F incorporates block-level causal attention to ensure compatibility with KV caching, allowing for the reuse of KV states and reducing computational redundancy [12][15]. - The model employs asymmetric distillation and structured noise scheduling to efficiently transfer knowledge from a pre-trained teacher model to the D2F student model, enhancing its parallel capabilities [18]. Group 3: Inference Mechanism - D2F introduces a pipelined parallel decoding algorithm that maintains a dynamic decoding window, allowing for semi-activated and fully-activated states to optimize throughput and quality [20][21]. - The model achieves a maximum speedup of up to 50 times compared to original dLLMs while maintaining average performance levels [22]. Group 4: Performance Metrics - D2F demonstrates superior performance-efficiency trade-offs, with the ability to adapt to various scenarios by adjusting decoding parameters, achieving over four times the throughput of AR models in specific tasks [25]. - Comparative tests show D2F-LLaDA achieving a throughput of 52.5 tokens per second, representing a 7.3 times increase over baseline methods [23]. Group 5: Future Directions - The success of D2F indicates a promising path for further research in parallel decoding technologies, with potential future developments including real-time serving capabilities and hybrid parallel processing [28].