SiT
Search documents
NeurIPS 2025 Oral | 1个Token零成本,REG让Diffusion训练收敛快20倍!
机器之心· 2025-11-29 01:49
Core Insights - REG is a simple and effective method that significantly accelerates the training convergence of generative models by introducing a class token, enhancing the performance of diffusion models [2][9][38] Group 1: Methodology - REG combines low-level latent representations with high-level class tokens from pre-trained visual models, allowing for simultaneous noise addition and denoising optimization during training [9][14] - The training process requires only the addition of one token, resulting in a computational overhead of less than 0.5% while not increasing inference costs [9][10][26] - REG achieves a 63x and 23x acceleration in convergence speed compared to SiT-XL/2 and SiT-XL/2+REPA, respectively, on ImageNet 256×256 [10][17] Group 2: Performance Metrics - In terms of FID scores, REG outperforms REPA significantly, achieving a FID of 1.8 after 4 million training steps, while SiT-XL/2+REPA achieves a FID of 5.9 [17][19] - REG shows a reduction in training time by 97.90% compared to SiT-XL/2 while maintaining similar FID scores [24][25] - The inference overhead for REG is minimal, with increases in parameters, FLOPs, and latency being less than 0.5%, while FID scores improve by 56.46% compared to SiT-XL/2 + REPA [26][27] Group 3: Ablation Studies - Extensive ablation studies demonstrate the effectiveness of REG, showing that high-level global discriminative information significantly enhances generation quality [28][30] - The introduction of the DINOv2 class token leads to the best performance in generating quality images, indicating the importance of high-level semantic guidance [30][31] Group 4: Conclusion - Overall, REG represents a highly efficient training paradigm that integrates high-level and low-level token entanglement, promoting a "understanding-generating" decoupling in generative models, leading to superior generation outcomes without increasing inference costs [38]