SiameseNorm
Search documents
清华联手千问重塑归一化范式,让 Transformer 回归「深度」学习
机器之心· 2026-02-10 11:03
Core Insights - The article introduces SiameseNorm, a novel architecture that reconciles the trade-offs between Pre-Norm and Post-Norm in Transformer models, enhancing both training stability and representation capacity [4][34]. Group 1: Background and Context - Siamese Twins, a term originating from the 19th-century Siamese brothers, has been adapted in neural networks to describe the concept of shared weights in Siamese Networks, which measure input similarity [2]. - In the 21st century, Pre-Norm and Post-Norm are identified as two critical paradigms in AI, aimed at improving the stability of large model training [2][3]. Group 2: Challenges with Existing Norms - Pre-Norm suffers from a "depth failure" issue, where deep parameters do not effectively contribute to the model's representation capability, limiting its "effective depth" [3]. - Post-Norm, while having higher potential for representation, introduces significant training instability, making it challenging to implement in modern Transformer pre-training paradigms [3][10]. Group 3: SiameseNorm Architecture - SiameseNorm employs a dual-stream architecture that decouples optimization dynamics, allowing for both Pre-Norm and Post-Norm characteristics to coexist without compromising on either [7][19]. - Each residual block in SiameseNorm receives combined gradients from both paradigms, achieving stable training at high learning rates without increasing computational costs [7][20]. Group 4: Experimental Validation - In experiments with a 1.3 billion parameter model, SiameseNorm demonstrated superior performance, achieving a perplexity (PPL) of 10.57, outperforming both Pre-Norm and Post-Norm architectures [22][25]. - Notably, in arithmetic tasks, SiameseNorm improved accuracy from 28.1% with Pre-Norm to 39.6%, marking a 40.9% relative increase, showcasing its ability to enhance model depth and reasoning capabilities [24]. Group 5: Mechanism Insights - Analysis revealed that both streams in SiameseNorm maintain significant weight contributions, indicating effective utilization of features from both Pre-Norm and Post-Norm [27]. - The Post-Norm stream plays a dominant role in final predictions, suggesting that it primarily enhances feature expression once the model stabilizes during training [31][32]. Group 6: Conclusion - SiameseNorm elegantly integrates the robustness of Pre-Norm with the expressive potential of Post-Norm, providing a clear path for developers aiming for higher learning rates and deeper networks in Transformer models [34].