ICML 2025 | 打破残差连接瓶颈，彩云科技&北邮提出MUDDFormer架构让Transformer再进化！

Core Viewpoint - The article discusses the introduction of Multiway Dynamic Dense (MUDD) connections as an effective alternative to residual connections in Transformers, significantly enhancing cross-layer information transfer efficiency in deep learning models [1][4]. Background - Residual connections, introduced by Kaiming He in ResNet, have become foundational in deep learning and Transformer LLMs, but they still face limitations in efficient information transfer across layers [1][7]. - MUDD connections dynamically establish cross-layer connections based on the current hidden state, addressing issues like representation collapse and information overload in residual streams [7][8]. Model Architecture - MUDDFormer architecture allows for independent dynamic connections for different information streams (Q, K, V, R), enhancing the model's ability to gather relevant information from previous layers [10][13]. - The introduction of dynamic connections enables the model to adaptively determine the weight of information extracted from previous layers based on the context of each token [11][13]. Experimental Evaluation - MUDDPythia, a model with 2.8 billion parameters, shows performance comparable to larger models (6.9 billion and 12 billion parameters) with only a 0.23% increase in parameters and a 0.4% increase in computation [4][18]. - The MUDDFormer outperforms baseline models like Transformer++ across various model sizes, demonstrating significant computational efficiency improvements [15][17]. Downstream Task Assessment - In downstream tasks, MUDDPythia exhibits higher accuracy in 0-shot and 5-shot evaluations compared to equivalent Pythia models, indicating enhanced contextual learning capabilities [18][20]. - The model achieves a 2.4 times efficiency leap over the 6.9 billion Pythia model and a 4.2 times efficiency leap over the 12 billion Pythia model in specific evaluations [18][20]. Conclusion - MUDDFormer improves residual connections by establishing independent dynamic cross-layer connections for different information streams, enhancing cross-layer interaction and contextual learning capabilities in Transformers [25].