兼得快与好！训练新范式TiM，原生支持FSDP+Flash Attention

Core Viewpoint - The article discusses the introduction of the Transition Model (TiM) as a new paradigm in generative modeling, aiming to reconcile the trade-off between generation speed and quality by modeling state transitions between any two time points, rather than focusing solely on instantaneous velocity fields or fixed-span endpoint mappings [3][8][34]. Group 1: Background and Challenges - Traditional generative models face a fundamental conflict between generation quality and speed, primarily due to their training objectives [2][6]. - Existing diffusion models rely on local vector fields, which require small time steps for accurate sampling, leading to high computational costs [5][6]. - Few-step models, while faster, often encounter a "quality ceiling" due to their inability to capture intermediate dynamics, limiting their generation capabilities [5][7]. Group 2: Transition Model Overview - The Transition Model abandons traditional approaches by directly modeling the complete state transition between any two time points, allowing for flexible sampling steps [4][8]. - This model supports arbitrary step sizes and decomposes the generation process into multiple adjustable segments, enhancing both speed and fidelity [8][10]. Group 3: Mathematical Foundations - The Transition Model is based on a "State Transition Identity," which simplifies the differential equations governing state transitions, enabling the description of specific transitions over arbitrary time intervals [12][16]. - Unlike diffusion and mean flow models, which focus on instantaneous or average velocity fields, the Transition Model encompasses both, providing a more comprehensive framework for generative modeling [16][17]. Group 4: Experimental Validation - The Transition Model has been validated on the Geneval dataset, demonstrating that an 865M parameter version can outperform larger models (12B parameters) in terms of generation capabilities [20][34]. - The model's training stability and scalability have been enhanced through the introduction of a differential derivative equation (DDE) approach, which is more efficient and compatible with modern training optimizations [25][33]. Group 5: Conclusion - Overall, the Transition Model offers a more universal, scalable, and stable approach to generative modeling, addressing the inherent conflict between speed and quality in generative processes [35].