大模型架构创新

Search documents
阿里深夜干了件大事,成本暴降90%
3 6 Ke· 2025-09-12 02:45
Core Insights - Alibaba's Tongyi Lab has officially released the next-generation foundational model architecture Qwen3-Next, which includes the Qwen3-Next-80B-A3B-Base model with 80 billion parameters, activating only 3 billion parameters [1][21] - The new model architecture is designed to enhance performance while significantly reducing training costs, achieving over 10 times the inference throughput compared to the previous Qwen3-32B model under long context scenarios [1][8][21] Model Performance - The instruction model of Qwen3-Next-80B-A3B performs comparably to the larger Qwen3-235B-A22B-Instruct-2507 model, while the thinking model outperforms Google's closed-source model Gemini-2.5-Flash-Thinking [2][12] - In various benchmark tests, Qwen3-Next-80B-A3B-Base shows performance similar to Qwen3-32B-Base, but with training costs less than 10% of Qwen3-32B-Base [6][21] Architectural Innovations - Qwen3-Next introduces several architectural innovations, including a hybrid attention mechanism, high sparsity MoE structure, and a multi-token prediction (MTP) mechanism, which collectively enhance inference efficiency and model stability [5][16][19] - The hybrid attention mechanism combines Gated DeltaNet and Gated Attention to improve context modeling for long sequences, achieving a 1:50 activation ratio in MoE layers, significantly reducing FLOPS per token [18][19] Training Efficiency - The model utilizes a subset of 15 trillion tokens from the Qwen3 36T pre-training corpus, requiring only 9.3% of the GPU resources compared to Qwen3-32B, while delivering superior performance [16][21] - The MTP mechanism optimizes multi-step inference performance, enhancing the acceptance rate of speculative decoding in practical applications [19] Future Developments - Alibaba plans to continue optimizing the Qwen3-Next architecture and is developing Qwen3.5, alongside launching various models across different domains, thereby increasing its technical influence in the open-source community [21]
大模型专题:大模型架构创新研究报告
Sou Hu Cai Jing· 2025-06-06 11:38
Core Insights - The report focuses on innovations in large model architectures, particularly addressing the limitations of the Transformer architecture and exploring industry pathways for improvement [1][2][7] - As model sizes increase, the secondary computational complexity of Transformers (O(n²)) leads to significant power consumption and efficiency bottlenecks in processing long sequences, prompting a demand for innovative solutions [1][2][15] - The industry is currently exploring two main paths for architectural breakthroughs: improvements to the Transformer architecture and exploration of non-Transformer architectures [1][2][7] Transformer Architecture Improvements - Improvements to the Transformer architecture focus on optimizing the Attention mechanism, Feed-Forward Network (FFN) layers, and normalization layers [1][2][18] - Techniques such as sparse attention and dynamic attention are being developed to enhance computational efficiency, while Mixture of Experts (MoE) aims to improve sparse connection efficiency in FFN layers [1][2][18] - LongRoPE and other technologies are enhancing positional encoding to better model long sequences [1][2][18] Non-Transformer Architecture Exploration - Non-Transformer architectures include new types of RNNs (e.g., RWKV, Mamba) and CNNs (e.g., Hyena Hierarchy), as well as other innovative architectures like RetNet and LFM [1][2][7] - RWKV optimizes state evolution through a generalized Delta Rule, while Mamba leverages state space models to enhance training efficiency [1][2][7] - RetNet combines state space and multi-head attention to achieve parallel computation [1][2][7] Industry Trends and Future Directions - The industry is witnessing a trend towards hybrid architectures that combine linear Transformers with non-Transformer architectures, balancing performance and efficiency [2][7] - The current phase is characterized by a peak in traditional Transformer paradigms and an impending wave of architectural innovations, with significant focus on new RNN/CNN theoretical breakthroughs and practical engineering optimizations [2][7] - Companies like ByteDance and Alibaba are accelerating their investments in hybrid architectures, driving the evolution of large models towards higher efficiency and lower energy consumption [2][7]