InfinityStar - filings, earnings calls, financial reports, news

InfinityStar

Search documents

NeurIPS'25 Oral：何必DiT，字节首次拿着自回归，单GPU一分钟生成5秒720p视频

3 6 Ke· 2025-11-14 08:35

Core Insights - InfinityStar, developed by ByteDance's commercialization technology team, presents a new method for video generation that balances quality and efficiency, addressing challenges in computational complexity and resource consumption [2][3][24] Group 1: InfinityStar Highlights - InfinityStar is the first discrete autoregressive video generator to surpass diffusion models on VBench [3] - It eliminates delays in video generation, transitioning from a slow denoising process to a faster autoregressive approach [3] - The method supports various tasks including text-to-image, text-to-video, image-to-video, and interactive long video generation [3] Group 2: Technical Innovations - The core architecture of InfinityStar utilizes a spatiotemporal pyramid modeling approach, allowing it to unify image and video tasks while being an order of magnitude faster than mainstream diffusion models [9] - The model decomposes video into two parts: the first frame captures static appearance information, while subsequent segments focus on dynamic changes [10][11] - InfinityStar employs an efficient visual tokenizer and introduces techniques like knowledge inheritance and stochastic quantizer depth to enhance training speed and model performance [14][15] Group 3: Performance Metrics - InfinityStar demonstrates superior performance in text-to-image (T2I) and text-to-video (T2V) tasks, achieving excellent results on GenEval, DPG, and VBench benchmarks, outperforming previous autoregressive models and diffusion-based methods [18][21][24] - Specifically, in the VBench benchmark, InfinityStar achieved a human preference evaluation score that surpassed HunyuanVideo, particularly excelling in instruction adherence [22][24] Group 4: Efficiency - The generation speed of InfinityStar is significantly faster than that of DiT-based methods, capable of producing a 5-second 720p video in under one minute on a single GPU [24]

何必DiT！字节首次拿着自回归，单GPU一分钟生成5秒720p视频 | NeurIPS'25 Oral

量子位· 2025-11-14 05:38

Core Viewpoint - The article discusses the introduction of InfinityStar, a new method developed by ByteDance's commercialization technology team, which significantly improves video generation quality and efficiency compared to the existing Diffusion Transformer (DiT) model [4][32]. Group 1: InfinityStar Highlights - InfinityStar is the first discrete autoregressive video generator to surpass diffusion models on VBench [9]. - It eliminates delays in video generation, transitioning from a slow denoising process to a fast autoregressive approach [9]. - The method supports various tasks including text-to-image, text-to-video, image-to-video, and interactive long video generation [9][12]. Group 2: Technical Innovations - The core architecture of InfinityStar employs a spatiotemporal pyramid modeling approach, allowing it to unify image and video tasks while being an order of magnitude faster than mainstream diffusion models [13][25]. - InfinityStar decomposes video into two parts: the first frame for static appearance information and subsequent clips for dynamic information, effectively decoupling static and dynamic elements [14][15][16]. - Two key technologies enhance the model's performance: Knowledge Inheritance, which accelerates the training of a discrete visual tokenizer, and Stochastic Quantizer Depth, which balances information distribution across scales [19][21]. Group 3: Performance Metrics - InfinityStar demonstrates superior performance in the text-to-image (T2I) task on GenEval and DPG benchmarks, particularly excelling in spatial relationships and object positioning [25][28]. - In the text-to-video (T2V) task, InfinityStar outperforms all previous autoregressive models and achieves better results than DiT-based methods like CogVideoX and HunyuanVideo [28][29]. - The generation speed of InfinityStar is significantly faster than DiT-based methods, with the ability to generate a 5-second 720p video in under one minute on a single GPU [31].

DiT（Diffusion Transformer）

DiT（Diffusion Transformer）