NAR模型
Search documents
13.8倍吞吐提升!浙大上海AI Lab等提出视觉生成新范式,从“下一个token”到“下一个邻域”
量子位· 2025-03-30 02:37
Core Viewpoint - The article discusses a new visual generation paradigm called Neighboring Autoregressive Modeling (NAR), which addresses the efficiency bottlenecks faced by traditional "next token prediction" methods in image and video generation tasks [2][12]. Group 1: Introduction to NAR - Traditional autoregressive models generate images or videos token by token in a raster order, leading to slow generation speeds, especially for high-resolution images or long videos [12]. - Existing acceleration methods often compromise on generation quality, making it a key challenge to enhance efficiency while maintaining high-quality outputs [14]. Group 2: Mechanism of NAR - NAR introduces a "next neighborhood prediction" mechanism, treating the visual generation process as a stepwise expansion, which allows for parallel prediction of multiple adjacent tokens [2][3]. - The model employs dimension-guided decoding heads, where each head predicts the next token in a specific orthogonal dimension, significantly reducing the number of forward computation steps required [4][5][16]. Group 3: Efficiency and Performance - In video generation tasks, NAR can complete the generation process in only 2n + t - 2 steps, compared to tn steps required by traditional models, showcasing a significant efficiency advantage [18][20]. - Experimental results indicate that NAR achieves a 13.8 times throughput improvement in image generation tasks on the ImageNet dataset, with lower FID scores compared to larger models [21][22]. Group 4: Application Results - For video generation on the UCF-101 dataset, NAR reduces the number of generation steps by 97.3% compared to traditional autoregressive models [23]. - In text-to-image generation, NAR uses only 0.4% of the training data yet matches the performance of larger models, achieving a 166 times increase in throughput [26][27][28]. Group 5: Conclusion - NAR provides an efficient and high-quality solution for visual generation tasks, indicating its potential for significant impact in future AI applications [29].