NeurIPS'25 Oral：何必DiT，字节首次拿着自回归，单GPU一分钟生成5秒720p视频

Core Insights - InfinityStar, developed by ByteDance's commercialization technology team, presents a new method for video generation that balances quality and efficiency, addressing challenges in computational complexity and resource consumption [2][3][24] Group 1: InfinityStar Highlights - InfinityStar is the first discrete autoregressive video generator to surpass diffusion models on VBench [3] - It eliminates delays in video generation, transitioning from a slow denoising process to a faster autoregressive approach [3] - The method supports various tasks including text-to-image, text-to-video, image-to-video, and interactive long video generation [3] Group 2: Technical Innovations - The core architecture of InfinityStar utilizes a spatiotemporal pyramid modeling approach, allowing it to unify image and video tasks while being an order of magnitude faster than mainstream diffusion models [9] - The model decomposes video into two parts: the first frame captures static appearance information, while subsequent segments focus on dynamic changes [10][11] - InfinityStar employs an efficient visual tokenizer and introduces techniques like knowledge inheritance and stochastic quantizer depth to enhance training speed and model performance [14][15] Group 3: Performance Metrics - InfinityStar demonstrates superior performance in text-to-image (T2I) and text-to-video (T2V) tasks, achieving excellent results on GenEval, DPG, and VBench benchmarks, outperforming previous autoregressive models and diffusion-based methods [18][21][24] - Specifically, in the VBench benchmark, InfinityStar achieved a human preference evaluation score that surpassed HunyuanVideo, particularly excelling in instruction adherence [22][24] Group 4: Efficiency - The generation speed of InfinityStar is significantly faster than that of DiT-based methods, capable of producing a 5-second 720p video in under one minute on a single GPU [24]