开源全能图像模型媲美GPT-4o！理解生成编辑同时搞定，解决扩散模型误差累计问题

Core Viewpoint - The release of OpenAI's GPT-4o has shifted industry focus towards the exploration of multimodal models, emphasizing the integration of language modeling and pixel-level image modeling capabilities to enhance image generation, understanding, and editing tasks [1][2]. Group 1: Model Overview - Nexus-Gen is a unified model capable of image understanding, generation, and editing, achieving image quality and editing capabilities comparable to GPT-4o [1][2]. - The architecture of Nexus-Gen follows a token → [transformer] → [diffusion] → pixels approach, combining the strengths of state-of-the-art (SOTA) MLLMs and diffusion models [2][9]. - Unlike previous All-to-All models that directly modeled pixel space, Nexus-Gen models images in a high-dimensional feature space to ensure better image quality [8][9]. Group 2: Training and Data - The training data for Nexus-Gen includes approximately 25 million samples, categorized into image understanding (6 million), image generation (12 million), and image editing (7 million) [17][18]. - The model employs a three-stage training strategy for the autoregressive component, gradually embedding image generation and editing capabilities into the language model [23]. - The diffusion model is trained using a single-stage strategy, adjusting input conditions from text to image embeddings [24]. Group 3: Error Mitigation - Nexus-Gen addresses error accumulation issues in continuous feature space predictions by implementing a pre-filling autoregressive strategy, ensuring consistency between training and inference phases [13][14]. Group 4: Future Prospects - There remains significant optimization potential for Nexus-Gen in areas such as model fusion training, increasing image token counts, and scaling datasets and model sizes [29]. - The ModelScope team plans to open-source all training data, model weights, and engineering frameworks to encourage community engagement and exploration of the All-to-All unified model technology [29][30].