InfinityStar
Search documents
NeurIPS'25 Oral:何必DiT,字节首次拿着自回归,单GPU一分钟生成5秒720p视频
3 6 Ke· 2025-11-14 08:35
一篇入围顶会NeurIPS'25 Oral的论文,狠狠反击了一把DiT(Diffusion Transformer)。 毕竟自打DiT问世以来,视频生成这块,算是被它给稳稳拿捏住了。 但站稳了脚跟,并不意味着没有问题,因为它的计算复杂度高,在资源消耗和速度上有着诸多挑战。 而这篇来自字节跳动商业化技术团队的论文,则是提出了一个名叫InfinityStar的方法,一举兼得了视频生成的质量和效率,为视频生成方法探索更多可 能的路径。 像下面这些有趣的动画片片段,便是由InfinityStar亲手打造: 整体来看InfinityStar的亮点,我们可以总结为如下三点: 是首个在VBench上超越扩散模型的离散自回归视频生成器; 视频生成不用再"慢慢熬":从百步去噪到自回归,告别延迟; 1. 任务通吃:文生图、文生视频、图生视频、交互式长视频生成等。 值得一提的是,InfinityStar目前的论文、代码、体验地址均已经发布(链接见文末),接下来我们就进一步实测一波~ 啪!~~~ 实测给DiT上了一课的AI视频生成 首先我们来简单了解一下InfinityStar的体验方法。 它的入口就在Discord社区里面,大家登 ...
何必DiT!字节首次拿着自回归,单GPU一分钟生成5秒720p视频 | NeurIPS'25 Oral
量子位· 2025-11-14 05:38
Core Viewpoint - The article discusses the introduction of InfinityStar, a new method developed by ByteDance's commercialization technology team, which significantly improves video generation quality and efficiency compared to the existing Diffusion Transformer (DiT) model [4][32]. Group 1: InfinityStar Highlights - InfinityStar is the first discrete autoregressive video generator to surpass diffusion models on VBench [9]. - It eliminates delays in video generation, transitioning from a slow denoising process to a fast autoregressive approach [9]. - The method supports various tasks including text-to-image, text-to-video, image-to-video, and interactive long video generation [9][12]. Group 2: Technical Innovations - The core architecture of InfinityStar employs a spatiotemporal pyramid modeling approach, allowing it to unify image and video tasks while being an order of magnitude faster than mainstream diffusion models [13][25]. - InfinityStar decomposes video into two parts: the first frame for static appearance information and subsequent clips for dynamic information, effectively decoupling static and dynamic elements [14][15][16]. - Two key technologies enhance the model's performance: Knowledge Inheritance, which accelerates the training of a discrete visual tokenizer, and Stochastic Quantizer Depth, which balances information distribution across scales [19][21]. Group 3: Performance Metrics - InfinityStar demonstrates superior performance in the text-to-image (T2I) task on GenEval and DPG benchmarks, particularly excelling in spatial relationships and object positioning [25][28]. - In the text-to-video (T2V) task, InfinityStar outperforms all previous autoregressive models and achieves better results than DiT-based methods like CogVideoX and HunyuanVideo [28][29]. - The generation speed of InfinityStar is significantly faster than DiT-based methods, with the ability to generate a 5-second 720p video in under one minute on a single GPU [31].