自回归也能做强视觉模型？NEPA开启「下一嵌入预测」时代，谢赛宁参与

Core Viewpoint - The article discusses a new approach in visual pre-training called Next-Embedding Predictive Autoregression (NEPA), which shifts the paradigm from learning representations to learning models, demonstrating strong performance in visual tasks similar to language models [2][18]. Group 1: NEPA Overview - NEPA is a minimalist approach that predicts the next feature block of an image, akin to how language models predict the next word [20]. - The method utilizes causal masking and stop gradient techniques to ensure stable predictions without requiring complex architectures [17][25]. - NEPA has shown competitive performance on benchmarks like ImageNet-1K, achieving Top-1 accuracy of 83.8% for ViT-B and 85.3% for ViT-L, surpassing several state-of-the-art methods [29]. Group 2: Methodology and Architecture - The architecture employs a standard visual Transformer (ViT) backbone with causal attention masking, directly predicting future image block embeddings based on past embeddings [22]. - Unlike pixel-level reconstruction methods, NEPA does not require a separate decoder, simplifying the model design [22]. - The training process involves segmenting images into patches, encoding them into vectors, and predicting the next patch while preventing the model from "cheating" by using stop-gradient techniques [25]. Group 3: Performance and Applications - NEPA demonstrates strong transfer capabilities, achieving 48.3% and 54.0% mIoU on the ADE20K semantic segmentation task, indicating its ability to learn rich semantic features necessary for dense prediction tasks [29]. - The model can be adapted for various downstream tasks by simply changing the classification head, showcasing its versatility [30]. - Visual analysis reveals that NEPA learns long-range, object-centered attention patterns, effectively ignoring background noise and focusing on semantically relevant areas [37].