ICLR 2026 | 复旦&通义万相提出ProMoE，显式路由引导打破DiT MoE scaling瓶颈！

Core Insights - The article discusses the limitations of applying the Mixture-of-Experts (MoE) architecture to Diffusion Transformers (DiT) in visual generation, highlighting the need for a new approach due to the unique characteristics of visual tokens [2][3]. Group 1: MoE and Visual Tokens - The existing MoE methods have shown limited success in visual domains compared to their performance in large language models (LLMs) [2]. - A research team from Fudan University, Alibaba, Zhejiang University, and Hong Kong University proposed ProMoE, a two-step routing MoE framework with explicit routing guidance to address these limitations [3][5]. Group 2: ProMoE Framework - ProMoE introduces a two-step router that includes conditional routing and prototypical routing to enhance expert specialization and diversity among experts [9][10]. - The conditional routing assigns unconditional image tokens to specific unconditional experts, while conditional tokens are processed through a standard routing mechanism [10]. - Prototypical routing utilizes learnable prototypes to calculate cosine similarity between tokens and prototypes, ensuring that tokens are assigned to the most relevant experts [10]. Group 3: Routing Contrastive Learning - ProMoE employs Routing Contrastive Learning (RCL) to enhance semantic guidance in the routing process, which improves load balancing and expert differentiation [11][12]. - The RCL mechanism includes "pulling" prototypes towards the centroid of their assigned token set and "pushing" them away from other experts' token sets to encourage diversity [13]. Group 4: Experimental Results - ProMoE consistently outperforms dense models across various configurations, with ProMoE-L-Flow achieving superior results with fewer activated parameters compared to larger models like Dense-DiT-XL-Flow [19][22]. - In the GenEval benchmark, ProMoE outperformed standard Token-Choice MoE models, demonstrating its generalization capabilities [24]. Group 5: Model Configuration and Performance - ProMoE models are configured with varying parameters, with ProMoE-L having 1.063 billion total parameters and achieving significant performance improvements over existing models [18][19]. - The convergence analysis indicates that ProMoE converges faster than both dense models and existing MoE models, showcasing its efficiency [28]. Group 6: Scalability and Future Potential - ProMoE exhibits scalability potential, with performance improvements observed as model size and the number of experts increase [31]. - The article concludes that ProMoE provides a viable open-source framework for efficiently integrating MoE architectures into large-scale diffusion models [33].