参数共享

Search documents
Transformer危!谷歌MoR架构发布:内存减半推理速度还翻倍
量子位· 2025-07-17 09:03
Core Viewpoint - Google has introduced a new underlying architecture called Mixture-of-Recursions (MoR), which significantly enhances reasoning speed by 2 times while halving KV memory usage, and allows for dynamic resource allocation across different tasks within a single framework [1][2][3]. Group 1: MoR Innovations - MoR integrates unified parameter sharing and adaptive recursion depth, addressing the high computational and memory demands of traditional Transformers while maintaining model performance [7][9]. - The architecture employs a recursive Transformer that divides the model into recursive blocks, reusing a shared pool of parameters, which reduces the number of unique parameters and enhances distributed training efficiency [10][13]. - MoR utilizes a dynamic routing mechanism to assign different recursion depths to each token, concentrating computation on complex tokens, and incorporates KV caching strategies to improve memory efficiency [15][19]. Group 2: Performance Comparison - Experiments comparing MoR with original Transformers and recursive baseline models across various parameter scales (135M to 1.7B) show that MoR uses nearly 50% fewer parameters while achieving lower validation loss and higher few-shot accuracy of 43.1% [16][19]. - MoR reduces training FLOPs by 25% and training time by 19% while also decreasing peak memory usage by 25% when training on a fixed 20B tokens [21]. - The routing strategy analysis indicates that Expert-choice routing outperforms Token-choice routing, highlighting the importance of routing granularity on performance [22]. Group 3: Architectural Evolution - Google has a history of rethinking underlying architectures, aiming to reconstruct computational paradigms through innovations like the Mixture of Experts (MoE) model, which allows for efficient training of large models by activating only a subset of expert networks [27][30]. - The introduction of MoR is seen as a potential game-changer in the AI landscape, with expectations that it may surpass the capabilities of Transformers in the future [32].