「有望成为Transformer杀手」，谷歌DeepMind新架构MoR实现两倍推理速度

Core Insights - The article discusses the challenges of deploying large language models (LLMs) due to high computational and memory costs, especially as model parameters scale to hundreds of billions. This has hindered their practical application and adoption [1][2] - Researchers are exploring efficient techniques to enhance parameter efficiency through weight sharing and dynamic computation resource allocation based on input complexity [1][2] - Google has introduced a new LLM architecture called Mixture-of-Recursions (MoR), which is seen as a potential successor to the Transformer architecture [1][2] Summary by Sections MoR Framework - MoR integrates parameter sharing and adaptive computation into a unified framework, allowing for dynamic token-level routing within a parameter-efficient recursive Transformer [2][4] - The architecture enables a "large model quality without the cost of a large model," effectively optimizing performance and resource utilization [2][6] Core Architecture and Methods - MoR is built on recursive Transformers, sharing weights across multiple layers to enhance parameter efficiency [12] - It employs various parameter sharing modes and dynamic routing mechanisms to minimize redundant computations and optimize memory access [12][15] - The dynamic routing system allocates different recursive depths based on individual token needs, creating a funnel effect where complex tokens receive deeper processing [15][17] Experimental Results - MoR outperforms baseline models in terms of validation loss and few-shot accuracy while using nearly 50% fewer parameters [19][21] - The model demonstrates a 19% reduction in training time and a 25% decrease in peak memory usage compared to baseline models [22] - MoR's performance is influenced by routing and caching strategies, with "expert-choice routing" yielding better accuracy than "token-choice routing" [23][24] Scalability and Efficiency - MoR is scalable and consistently outperforms recursive baseline models across various parameter sizes and computational budgets [27][28] - The architecture achieves superior validation performance with significantly fewer parameters, making it suitable for pre-training and large-scale deployment [28] Inference Throughput - MoR enhances inference throughput by allowing more tokens to exit early in the recursive process, leading to a significant speed increase [30][31] - The combination of depth-wise batching and early exit mechanisms improves MoR's practical deployment capabilities [31][33] Conclusion - MoR establishes a new paradigm for efficient LLM architectures by demonstrating the synergy between parameter efficiency and adaptive computation, addressing scalability challenges in language modeling [37] - The framework's ability to allocate "thinking depth" adaptively for each token aligns with emerging research in reasoning and internal thought processes in language models [38]