超参迁移

Search documents
人民大学&字节Seed:利用μP实现Diffusion Transformers高效扩展
机器之心· 2025-06-26 06:10
Core Viewpoint - The research introduces the μP theory to optimize hyperparameter tuning in diffusion Transformers, significantly reducing the computational cost of hyperparameter searches while enhancing model performance [2][24]. Group 1: Introduction and Background - The research is a collaboration between the Renmin University of China and ByteDance, focusing on optimizing diffusion Transformers, which have become essential in modern visual generation models [1]. - The μP theory, or Maximal Update Parametrization, is a significant milestone in the Tensor Program infinite-width network theory, allowing different-sized Transformers to share optimal hyperparameters [7]. Group 2: μP Theory Application - The μP theory has been successfully applied to diffusion Transformers, despite their architectural differences from standard Transformers, enabling effective hyperparameter transfer from smaller models to larger ones [8][10]. - The research demonstrates that hyperparameters can be effectively transferred across different model sizes, leading to improved training efficiency and performance [12][15]. Group 3: Experimental Validation - Systematic experiments were conducted on various models, including DiT, PixArt, and MMDiT, showing that hyperparameters found in smaller models can be successfully applied to larger models, achieving superior results compared to manually tuned baselines [2][21][24]. - In the MMDiT experiments, hyperparameters from a 0.18B model were successfully utilized in training an 18B model, with the computational cost of hyperparameter searches being only 3% of that required for manual tuning [21][24]. Group 4: Future Implications - The findings suggest that μP will be a crucial tool for the future expansion of foundational models, emphasizing the importance of theoretical advancements in AI for large-scale practical applications [24].