Self-Forcing++:让自回归视频生成模型突破 4 分钟时长极限
机器之心·2025-10-18 08:30

Core Insights - The article discusses the breakthrough of Self-Forcing++ in generating high-quality long videos, extending the generation time from 5 seconds to 4 minutes without requiring additional long video data for retraining [2][10]. Group 1: Challenges in Long Video Generation - Long video generation has been limited to a few seconds due to inherent architectural flaws in existing models, which struggle to maintain visual consistency and motion coherence beyond 10 seconds [6][7]. - The primary challenge lies in the models' inability to handle cumulative errors over extended sequences, leading to issues like overexposure and freezing [17][20]. Group 2: Key Innovations of Self-Forcing++ - Self-Forcing++ employs a unique approach where a teacher model, despite only generating 5-second videos, can correct distortions in longer videos generated by a student model [9][10]. - The process involves a cycle of generation, distortion, correction, and learning, allowing the model to self-repair and stabilize over longer time scales [10]. Group 3: Technical Mechanisms - Backward Noise Initialization allows the model to inject noise into already generated sequences, maintaining temporal continuity [13][15]. - Extended DMD expands the teacher-student distribution alignment to a sliding window, enabling local supervision of long video sequences [16][18]. - Rolling KV Cache aligns training and inference phases, eliminating issues like exposure drift and frame repetition [19][20]. Group 4: Experimental Results - Self-Forcing++ outperforms baseline models in generating videos of 50, 75, and 100 seconds, demonstrating superior stability and quality [23][24]. - The model maintains consistent brightness and natural motion across long videos, with minimal degradation in visual quality [30]. Group 5: Scaling and Future Improvements - The relationship between computational power and video length is explored, showing that increasing training resources significantly enhances video quality [31]. - Despite advancements, challenges remain in long-term memory retention and training efficiency, indicating areas for further development [33].