CVPR 2025 Highlight | 提升自回归模型样例学习能力，Few-shot图像编辑新范式开源

Core Viewpoint - The article discusses the development of a new autoregressive model called InstaManip, which enhances in-context learning capabilities to better address the challenges of few-shot image editing [26]. Summary by Sections Introduction - Recent advancements in diffusion models have significantly improved text-guided image editing algorithms, but performance declines when user requests are difficult to describe or deviate from the training data distribution [1][2]. Problem Statement - The challenge arises when users want to edit images in ways that are not well-represented in the training dataset, such as transforming a regular car into a Lamborghini, which is hard to describe accurately with words alone [1]. Proposed Solution - To tackle this issue, the article suggests providing additional image examples alongside text instructions, allowing the model to learn desired transformations through few-shot image editing [2]. Model Structure and Methodology - The InstaManip model employs a novel group self-attention mechanism to learn image transformation features from both text and image examples, enabling it to edit new input images accordingly [6][15]. Learning Mechanism - The learning process is divided into two stages: the learning phase, where transferable knowledge is abstracted from examples, and the application phase, where this knowledge is applied to new scenarios [10][11]. Group Self-Attention Mechanism - The model incorporates multiple layers of group self-attention, which allows it to process text instructions and example images separately, enhancing the learning and application phases [16]. Relation Regularization - To mitigate noise from example images that could mislead the model, a relation regularization technique is introduced, aligning the learned similarities with those derived from text instructions [17]. Experimental Results - InstaManip outperforms previous models in both in-distribution and out-of-distribution settings, establishing itself as the state-of-the-art method for few-shot image editing [19][20]. Ablation Studies - Ablation experiments demonstrate that both the group self-attention mechanism and relation regularization significantly enhance model performance, confirming the necessity of each component [21][22]. Conclusion - The InstaManip model achieves superior results across multiple metrics and can further improve with an increased number of diverse example images [26].