小模型层数好玄学:12/32/64层效果好,16/24/48/层效果糟
量子位·2026-01-11 04:02

Core Insights - The article reveals significant findings regarding the 70M small model, emphasizing that the architecture's importance is lower than previously thought, while the model's "shape" (depth-width ratio) is more critical [1][2]. Group 1: Model Architecture and Performance - The optimal number of layers for small models is identified as 32, with 12 and 64 layers also performing well, while configurations with 16, 24, and 48 layers yield poor results [2][15]. - The performance gap between "good" and "bad" layer configurations exceeds 6 percentage points, with "good" configurations averaging around 38% accuracy and "bad" configurations around 32% [15][16]. - The hidden dimension must be at least 512 for optimal performance, with the 32-layer configuration achieving the highest score of 38.50% [18][23]. Group 2: Comparative Analysis of Architectures - A comparison of 12 different architectures, including LLaMA3 and Qwen3, shows that modern architectures perform similarly within the 70M parameter range, with average differences of less than 2% [25][26]. - The article notes that improvements in modern architectures are primarily designed for models with over 700 million parameters and do not provide measurable advantages for 70M models [27]. Group 3: Diffusion Models vs. Autoregressive Models - Diffusion models, while slightly lower in average accuracy (31-32%), demonstrate faster inference speeds (3.8 times faster) and lower hallucination rates compared to autoregressive models [28][30]. - The introduction of a "Canon layer" can enhance factual accuracy by 1% for autoregressive models and over 2% for diffusion models, with minimal additional parameter cost [35][36]. Group 4: New Model Development - The Dhara-70M model is introduced, combining the best features of autoregressive and diffusion models, built on the LLaMA3-Canon architecture and converted using the WSD method [41][42]. - The specifications of Dhara-70M include 71.34M parameters, 32 layers, and a hidden size of 384, designed for high throughput and factual accuracy [44]. Group 5: Recommendations for Model Builders - The article advises small language model builders to focus on the fundamental depth-width ratio rather than chasing the latest architectural trends, especially for applications requiring high-speed processing and factual accuracy [45].

小模型层数好玄学:12/32/64层效果好,16/24/48/层效果糟 - Reportify