Workflow
统一多模态模型
icon
Search documents
架构解耦是统一多模态模型所必须的吗?全新AIA损失:No
机器之心· 2025-12-02 05:07
Core Insights - The rapid development of unified understanding and generation models has faced challenges due to conflicts between visual understanding and generation tasks [2] - Researchers from CUHK MMLab and Meituan believe that the performance of unified models will eventually reach that of single-task models, but they question whether the current approach of decoupling architectures is truly beneficial [2][3] Unified Model Intent - The original intent of unified models is to enhance single-task performance through a transparent and rational process of interleaved text and image reasoning [3] - Examples include generating corresponding images while navigating mazes or drawing auxiliary lines during mathematical problem-solving [3] Architecture Decoupling Issues - Models like BAGEL require complex processes to achieve interleaved reasoning, leading to significant computational overhead and potential information loss [3] - Despite current performance gains, researchers warn that these issues may become more pronounced as research progresses [3] AIA Introduction - To explore the reasons behind performance improvements from architecture decoupling and to find ways to enhance model performance without it, CUHK MMLab and Meituan introduced AIA [5] Research Findings - Researchers found that regardless of how models are decoupled, understanding and generation tasks exhibit a negative correlation at the same network layer [8] - This indicates that decoupling does not fundamentally resolve the conflicts between tasks [8] AIA Loss Design - AIA loss was designed to explicitly constrain the interaction patterns of unified models during training, using the cross-modal interaction patterns of single-task models as a learning target [10] AIA Effectiveness - Experiments on Emu3 and Janus-Pro showed that AIA can enhance model performance without additional tricks, reducing the performance gap with more decoupled models [12] AIA Training Sensitivity - AIA loss demonstrated stable convergence across a wide range of weight adjustments during training, particularly for Emu3, which had weaker pre-training knowledge [17] - In contrast, Janus-Pro's strong pre-training knowledge made it more sensitive to AIA loss adjustments [17] AIA Advantages - The introduction of AIA loss can mitigate common data ratio issues, achieving better results with a 1:1 data ratio for generation and understanding tasks, indicating a collaborative optimization effect [19] Unified Model Training Path - The dynamic allocation of task weights during unified training may represent the correct behavior of unified models, suggesting that task conflicts could be a natural characteristic rather than a problem to avoid [21] - Another approach involves removing task differentiation cues to force the model to learn a truly unified space, though this increases training difficulty [22] Future Outlook - AIA represents an initial step in analyzing the principles of unified model training, with a call for more researchers to explore this field [24] - The theoretical and architectural aspects of unified models are still immature, necessitating collaborative exploration [24]
RAE的终极形态?北大&阿里提出UniLIP: 将CLIP拓展到重建、生成和编辑
机器之心· 2025-11-02 08:01
Core Insights - The article discusses the innovative UniLIP model, which addresses the trade-off between semantic understanding and pixel detail retention in unified multimodal models [2][4][32] - UniLIP achieves state-of-the-art (SOTA) performance in various benchmarks while maintaining or slightly improving understanding capabilities compared to larger models [5][26] Methodology - UniLIP employs a two-stage training framework with self-distillation loss to enhance image reconstruction capabilities without sacrificing original understanding performance [4][11] - The first stage involves aligning the decoder while freezing the CLIP model, focusing on learning to reconstruct images from fixed CLIP features [9][11] - The second stage jointly trains CLIP and applies self-distillation to ensure feature consistency while injecting pixel details [11][12] Performance Metrics - UniLIP models (1B and 3B parameters) achieved SOTA results in benchmarks such as GenEval (0.90), WISE (0.63), and ImgEdit (3.94) [5][26][27] - In image reconstruction, UniLIP outperformed previous quantization methods and demonstrated significant advantages in generation efficiency [22][24] Architectural Design - The architecture of UniLIP integrates InternVL3 and SANA, utilizing InternViT as the CLIP encoder and a pixel decoder from DC-AE [20] - The model is designed with a connector structure that maintains consistency with large language models (LLMs) [20] Training Data - UniLIP's training data includes 38 million pre-training samples and 60,000 instruction fine-tuning samples for generation, along with 1.5 million editing samples [21] Image Generation and Editing - UniLIP excels in both image generation and editing tasks, achieving high scores in benchmarks due to its rich feature representation and precise semantic alignment [26][27][30] - The dual-condition architecture effectively connects MLLM with diffusion models, ensuring high fidelity and consistency in generated and edited images [18][32]
告别AI“乱画图表”!港中文团队发布首个结构化图像生成编辑系统
量子位· 2025-10-11 09:01
Core Insights - The article discusses the limitations of current AI models in generating accurate structured images like charts and graphs, despite their success in creating natural images [1][2] - It highlights a significant gap between visual understanding and generation capabilities, which hinders the development of unified multimodal models that can both interpret and create visual content accurately [2][10] Data Layer - A dataset of 1.3 million code-aligned structured samples was created to ensure the accuracy of generated images through precise code definitions [11][13] - The dataset includes executable plotting codes covering six categories, ensuring strict alignment between images and their corresponding codes [14] Model Layer - A lightweight VLM integration solution was designed to balance the capabilities of structured and natural image generation, utilizing FLUX.1 Kontext and Qwen-VL for enhanced understanding of structured image inputs [13][15] - The training process involves a three-stage progressive training approach to maintain the model's ability to generate natural images while improving structured image generation [15][16] Evaluation Layer - The team introduced StructBench and StructScore as specialized benchmarks and metrics to assess the accuracy of generated structured images, addressing the shortcomings of existing evaluation methods [17][19] - StructBench includes 1,714 stratified samples with fine-grained Q&A pairs to validate factual accuracy, while StructScore evaluates model responses against standard answers [19] Performance Comparison - The proposed solution demonstrated significant advantages over existing models, with the best-performing models achieving factual accuracy around 50%, indicating substantial room for improvement in structured visual generation [21][22] - The research emphasizes that high-quality, strictly aligned data is crucial for enhancing model performance, more so than the model architecture itself [22] Broader Implications - This research aims to lay a systematic foundation for structured visual generation, encouraging further exploration in this overlooked area [23][25] - The ultimate goal is to transition AI from being merely a beautification tool to a productivity tool capable of generating accurate mathematical images and experimental charts for various fields [24][25]
谢赛宁等推出统一多模态模型!替代VAE实现图像理解/生成双SOTA,代码权重数据集全开源
量子位· 2025-05-16 03:39
Core Insights - The article discusses the development of a unified multimodal model, Blip3-o, which achieves state-of-the-art (SOTA) performance in image understanding and generation [1][2]. Group 1: Unified Multimodal Model - The team introduced a new method using diffusion Transformers to generate semantically rich CLIP image features, enhancing training efficiency and generation quality [3]. - A sequential pre-training strategy was proposed, where image understanding training precedes image generation training, maintaining understanding capabilities while developing strong generation abilities [3][5]. Group 2: Model Architecture - The unified architecture consists of two parts: image understanding using CLIP for encoding and image generation through a self-regressive model that generates intermediate visual features [8][9]. - The design explored three variations of the self-regressive and diffusion framework, with the CLIP+Flow Matching approach yielding the best alignment scores in evaluations [10][13]. Group 3: Training Strategy - The research compared joint training and sequential training, concluding that sequential training offers greater flexibility and avoids task interference, allowing for focused image generation [18]. - The model achieved outstanding performance across popular benchmarks in image understanding and generation tasks [19]. Group 4: Performance Metrics - The article presents performance metrics for various models, with BLIP3-o achieving a GenEval score of 0.84 and a DPG-Bench score of 81.60, indicating its superior performance [20]. Group 5: Open Source and Future Applications - The model, including code, weights, training scripts, and datasets, has been fully open-sourced to facilitate future research [21]. - Ongoing developments are focused on applications such as iterative image editing, visual dialogue, and step-by-step visual reasoning [22].