NextStep-1：一次在图像生成上自回归范式的探索

Core Insights - The article discusses the development of NextStep-1, a new autoregressive model for image generation that operates directly in continuous visual space, avoiding the information loss associated with discretization [2][3][4] - The model utilizes a lightweight Flow Matching Head, which simplifies the architecture and allows for end-to-end training without reliance on external diffusion models [4][5] - The exploration aims to provide a new perspective in the multimodal generation field, emphasizing the potential for creating efficient and high-fidelity generative models [26][33] Technical Framework - NextStep-1 is built on a powerful Transformer backbone network with 14 billion parameters, complemented by a Flow Matching Head with 157 million parameters for generating continuous image patches [7][8] - The model generates images autoregressively by producing patches sequentially, which helps bypass the bottleneck of discretization [8] - The architecture is designed to be simple and pure, demonstrating that a streamlined autoregressive model can be constructed without sacrificing continuity [4][26] Key Discoveries - The team identified that the Transformer acts as the main creator, while the Flow Matching Head serves as an efficient sampler, with minimal impact on image quality from the size of the Flow Matching Head [12] - Two critical techniques were discovered for stability and quality: channel-wise normalization to stabilize token statistics and the counterintuitive finding that adding more noise during training can enhance image quality [14][16] Performance Evaluation - NextStep-1 has been rigorously evaluated against industry benchmarks, achieving competitive results with state-of-the-art diffusion models [21][22] - The model's performance metrics include GenEval scores of 0.63/0.737 and DPG-Bench scores of 85.28, indicating its strong capabilities in image generation [21][22] Limitations and Future Directions - The model faces challenges related to stability during generation, particularly when expanding the latent space dimensions, which can lead to occasional failures [27][29] - The autoregressive nature of the model introduces latency issues, particularly in sequential decoding, which affects overall performance [28] - Future work will focus on optimizing the Flow Matching Head, accelerating the autoregressive backbone, and improving convergence efficiency, especially in high-resolution image generation [34][35]