统一多模态模型
Search documents
清华、西交联合开源发布了Cheers : 一条更简洁、更高效的统一多模态路线
机器之心· 2026-03-26 04:12
Core Insights - The article introduces CHEERS, a unified multimodal model that aims to integrate understanding and generation tasks within a single end-to-end framework while maintaining simplicity and efficiency [4][9]. Group 1: Current Technical Solutions for Unified Multimodal Models - Existing approaches to unified multimodal models can be categorized into three main strategies: separating understanding and generation in distinct visual spaces, emphasizing a single semantic space for both tasks, and attempting to merge heterogeneous features [7]. - CHEERS positions itself between these strategies by not forcing all tasks into a single representation or completely separating them, but rather by reorganizing the responsibilities of semantics and details within a unified framework [7][9]. Group 2: CHEERS as a Practical Unified Approach - CHEERS aims to upgrade from a "understanding model" to a "unified understanding + generation model" with minimal architectural changes, leveraging existing open-source pretrained capabilities [9][10]. - The model employs a unified visual tokenizer, LLM backbone, and Cascaded Flow Matching Head to integrate multimodal understanding and image generation into a single end-to-end process [9][10]. Group 3: Handling "Semantics" and "Details" in CHEERS - CHEERS organizes visual information into two complementary components: semantic tokens for multimodal understanding and detail residuals for enhancing high-frequency textures during generation [14]. - The generation phase follows a "semantic first, detail later" cascading approach, generating a global semantic layout before progressively injecting detail information, mimicking human artistic processes [15][16]. Group 4: Performance with Limited Data - CHEERS demonstrates strong performance with a training sample size of 83 million, achieving competitive results on various benchmarks while using significantly less data compared to similar models [19]. - The model's efficiency in utilizing existing pretrained knowledge is highlighted as a key advantage, emphasizing the importance of organizing knowledge effectively rather than merely increasing data volume [19]. Group 5: Conclusions and Future Directions - CHEERS provides insights into the need for a stable and efficient information interface rather than a single visual representation for unified models [21]. - The architecture design allows for joint training of generation objectives without significantly impairing understanding capabilities, suggesting that understanding and generation can coexist beneficially [21]. - The model's 4× token compression not only serves as an engineering optimization but also facilitates high-resolution understanding and generation within realistic computational budgets [21].
ICLR 2026|滑铁卢大学联合可灵提出UniVideo:统一视频理解、生成、编辑多模态
机器之心· 2026-03-05 07:43
Core Insights - UniVideo demonstrates strong performance in video understanding, generation, and editing within a unified framework, leveraging a dual-stream architecture that combines a multimodal large language model (MLLM) and a multimodal diffusion Transformer (MM-DiT) [2][9][32] - The model achieves or surpasses state-of-the-art (SoTA) performance across various benchmarks without requiring task-specific designs, indicating its generalization capabilities to unseen tasks and new task combinations [2][24][33] Model Architecture - UniVideo consists of two main components: MLLM for multimodal instruction understanding and semantic reasoning, and MM-DiT for high-fidelity visual content generation [9][10] - The dual-stream design allows for robust semantic foundation and high-quality visual reconstruction, which is crucial for video editing and context generation tasks [11] Unified Multimodal Tasks - UniVideo integrates multiple video generation and editing tasks into a single multimodal instruction paradigm, enabling flexible task scheduling and generation [12][13] - The model can handle various tasks, including multimodal understanding (Image/Video to Text), text-to-image/video generation, image-to-video generation, and image/video editing [13][16][20] Experimental Results - In quantitative evaluations, UniVideo outperforms task-specific baseline methods across various metrics, achieving superior results in most experimental setups [24][32] - The model's performance in context generation and editing tasks is highlighted by its competitive scores in identity alignment, video quality, and aesthetic ratings compared to other models [26][27] Generalization Capabilities - UniVideo exhibits strong generalization capabilities, successfully transferring image editing skills to video editing tasks despite not being explicitly trained on free-form video editing instructions [28] - The model can also generalize to new task combinations that were not explicitly included during training, showcasing the advantages of a unified multimodal framework [29][33]
架构解耦是统一多模态模型所必须的吗?全新AIA损失:No
机器之心· 2025-12-02 05:07
Core Insights - The rapid development of unified understanding and generation models has faced challenges due to conflicts between visual understanding and generation tasks [2] - Researchers from CUHK MMLab and Meituan believe that the performance of unified models will eventually reach that of single-task models, but they question whether the current approach of decoupling architectures is truly beneficial [2][3] Unified Model Intent - The original intent of unified models is to enhance single-task performance through a transparent and rational process of interleaved text and image reasoning [3] - Examples include generating corresponding images while navigating mazes or drawing auxiliary lines during mathematical problem-solving [3] Architecture Decoupling Issues - Models like BAGEL require complex processes to achieve interleaved reasoning, leading to significant computational overhead and potential information loss [3] - Despite current performance gains, researchers warn that these issues may become more pronounced as research progresses [3] AIA Introduction - To explore the reasons behind performance improvements from architecture decoupling and to find ways to enhance model performance without it, CUHK MMLab and Meituan introduced AIA [5] Research Findings - Researchers found that regardless of how models are decoupled, understanding and generation tasks exhibit a negative correlation at the same network layer [8] - This indicates that decoupling does not fundamentally resolve the conflicts between tasks [8] AIA Loss Design - AIA loss was designed to explicitly constrain the interaction patterns of unified models during training, using the cross-modal interaction patterns of single-task models as a learning target [10] AIA Effectiveness - Experiments on Emu3 and Janus-Pro showed that AIA can enhance model performance without additional tricks, reducing the performance gap with more decoupled models [12] AIA Training Sensitivity - AIA loss demonstrated stable convergence across a wide range of weight adjustments during training, particularly for Emu3, which had weaker pre-training knowledge [17] - In contrast, Janus-Pro's strong pre-training knowledge made it more sensitive to AIA loss adjustments [17] AIA Advantages - The introduction of AIA loss can mitigate common data ratio issues, achieving better results with a 1:1 data ratio for generation and understanding tasks, indicating a collaborative optimization effect [19] Unified Model Training Path - The dynamic allocation of task weights during unified training may represent the correct behavior of unified models, suggesting that task conflicts could be a natural characteristic rather than a problem to avoid [21] - Another approach involves removing task differentiation cues to force the model to learn a truly unified space, though this increases training difficulty [22] Future Outlook - AIA represents an initial step in analyzing the principles of unified model training, with a call for more researchers to explore this field [24] - The theoretical and architectural aspects of unified models are still immature, necessitating collaborative exploration [24]
RAE的终极形态?北大&阿里提出UniLIP: 将CLIP拓展到重建、生成和编辑
机器之心· 2025-11-02 08:01
Core Insights - The article discusses the innovative UniLIP model, which addresses the trade-off between semantic understanding and pixel detail retention in unified multimodal models [2][4][32] - UniLIP achieves state-of-the-art (SOTA) performance in various benchmarks while maintaining or slightly improving understanding capabilities compared to larger models [5][26] Methodology - UniLIP employs a two-stage training framework with self-distillation loss to enhance image reconstruction capabilities without sacrificing original understanding performance [4][11] - The first stage involves aligning the decoder while freezing the CLIP model, focusing on learning to reconstruct images from fixed CLIP features [9][11] - The second stage jointly trains CLIP and applies self-distillation to ensure feature consistency while injecting pixel details [11][12] Performance Metrics - UniLIP models (1B and 3B parameters) achieved SOTA results in benchmarks such as GenEval (0.90), WISE (0.63), and ImgEdit (3.94) [5][26][27] - In image reconstruction, UniLIP outperformed previous quantization methods and demonstrated significant advantages in generation efficiency [22][24] Architectural Design - The architecture of UniLIP integrates InternVL3 and SANA, utilizing InternViT as the CLIP encoder and a pixel decoder from DC-AE [20] - The model is designed with a connector structure that maintains consistency with large language models (LLMs) [20] Training Data - UniLIP's training data includes 38 million pre-training samples and 60,000 instruction fine-tuning samples for generation, along with 1.5 million editing samples [21] Image Generation and Editing - UniLIP excels in both image generation and editing tasks, achieving high scores in benchmarks due to its rich feature representation and precise semantic alignment [26][27][30] - The dual-condition architecture effectively connects MLLM with diffusion models, ensuring high fidelity and consistency in generated and edited images [18][32]
告别AI“乱画图表”!港中文团队发布首个结构化图像生成编辑系统
量子位· 2025-10-11 09:01
Core Insights - The article discusses the limitations of current AI models in generating accurate structured images like charts and graphs, despite their success in creating natural images [1][2] - It highlights a significant gap between visual understanding and generation capabilities, which hinders the development of unified multimodal models that can both interpret and create visual content accurately [2][10] Data Layer - A dataset of 1.3 million code-aligned structured samples was created to ensure the accuracy of generated images through precise code definitions [11][13] - The dataset includes executable plotting codes covering six categories, ensuring strict alignment between images and their corresponding codes [14] Model Layer - A lightweight VLM integration solution was designed to balance the capabilities of structured and natural image generation, utilizing FLUX.1 Kontext and Qwen-VL for enhanced understanding of structured image inputs [13][15] - The training process involves a three-stage progressive training approach to maintain the model's ability to generate natural images while improving structured image generation [15][16] Evaluation Layer - The team introduced StructBench and StructScore as specialized benchmarks and metrics to assess the accuracy of generated structured images, addressing the shortcomings of existing evaluation methods [17][19] - StructBench includes 1,714 stratified samples with fine-grained Q&A pairs to validate factual accuracy, while StructScore evaluates model responses against standard answers [19] Performance Comparison - The proposed solution demonstrated significant advantages over existing models, with the best-performing models achieving factual accuracy around 50%, indicating substantial room for improvement in structured visual generation [21][22] - The research emphasizes that high-quality, strictly aligned data is crucial for enhancing model performance, more so than the model architecture itself [22] Broader Implications - This research aims to lay a systematic foundation for structured visual generation, encouraging further exploration in this overlooked area [23][25] - The ultimate goal is to transition AI from being merely a beautification tool to a productivity tool capable of generating accurate mathematical images and experimental charts for various fields [24][25]
谢赛宁等推出统一多模态模型!替代VAE实现图像理解/生成双SOTA,代码权重数据集全开源
量子位· 2025-05-16 03:39
Core Insights - The article discusses the development of a unified multimodal model, Blip3-o, which achieves state-of-the-art (SOTA) performance in image understanding and generation [1][2]. Group 1: Unified Multimodal Model - The team introduced a new method using diffusion Transformers to generate semantically rich CLIP image features, enhancing training efficiency and generation quality [3]. - A sequential pre-training strategy was proposed, where image understanding training precedes image generation training, maintaining understanding capabilities while developing strong generation abilities [3][5]. Group 2: Model Architecture - The unified architecture consists of two parts: image understanding using CLIP for encoding and image generation through a self-regressive model that generates intermediate visual features [8][9]. - The design explored three variations of the self-regressive and diffusion framework, with the CLIP+Flow Matching approach yielding the best alignment scores in evaluations [10][13]. Group 3: Training Strategy - The research compared joint training and sequential training, concluding that sequential training offers greater flexibility and avoids task interference, allowing for focused image generation [18]. - The model achieved outstanding performance across popular benchmarks in image understanding and generation tasks [19]. Group 4: Performance Metrics - The article presents performance metrics for various models, with BLIP3-o achieving a GenEval score of 0.84 and a DPG-Bench score of 81.60, indicating its superior performance [20]. Group 5: Open Source and Future Applications - The model, including code, weights, training scripts, and datasets, has been fully open-sourced to facilitate future research [21]. - Ongoing developments are focused on applications such as iterative image editing, visual dialogue, and step-by-step visual reasoning [22].