Workflow
统一多模态模型
icon
Search documents
RAE的终极形态?北大&阿里提出UniLIP: 将CLIP拓展到重建、生成和编辑
机器之心· 2025-11-02 08:01
本文作者来自北京大学和阿里通义万相实验室。其中论文第一作者是汤昊,北京大学 2022 级博士生,发表 多篇 NeurIPS, CVPR,ICCV 和 ECCV,目前主要关注统一的多模态理解和生成。指导教授是王立威老 师,北京大学智能学院教授,曾获 NeurIPS 2024 和 ICLR 2023 最佳论文奖。 统一多模态模型要求视觉表征必须兼顾语义(理解)和细节(生成 / 编辑)。早期 VAE 因语义不足而理解 受限。近期基于 CLIP 的统一编码器,面临理解与重建的权衡:直接量化 CLIP 特征会损害理解性能;而为 冻结的 CLIP 训练解码器,又因特征细节缺失而无法精确重建。例如,RAE 使用冻结的 DINOv2 重建, PSNR 仅 19.23。 为解决这一核心矛盾,UniLIP 提出创新的 CLIP 微调框架,通过 两阶段重建训练与自蒸馏损失 ,在 不损失 模型原有理解性能 的同时,实现了 卓越的图像重建能力 。UniLIP 可直接替换 MLLM(如 InternVL)中的 原有 CLIP 模块(如 InternViT),并 保持甚至略微提升其理解性能 。 不同于 RAE 仅在 ImageNet 上进 ...
告别AI“乱画图表”!港中文团队发布首个结构化图像生成编辑系统
量子位· 2025-10-11 09:01
Core Insights - The article discusses the limitations of current AI models in generating accurate structured images like charts and graphs, despite their success in creating natural images [1][2] - It highlights a significant gap between visual understanding and generation capabilities, which hinders the development of unified multimodal models that can both interpret and create visual content accurately [2][10] Data Layer - A dataset of 1.3 million code-aligned structured samples was created to ensure the accuracy of generated images through precise code definitions [11][13] - The dataset includes executable plotting codes covering six categories, ensuring strict alignment between images and their corresponding codes [14] Model Layer - A lightweight VLM integration solution was designed to balance the capabilities of structured and natural image generation, utilizing FLUX.1 Kontext and Qwen-VL for enhanced understanding of structured image inputs [13][15] - The training process involves a three-stage progressive training approach to maintain the model's ability to generate natural images while improving structured image generation [15][16] Evaluation Layer - The team introduced StructBench and StructScore as specialized benchmarks and metrics to assess the accuracy of generated structured images, addressing the shortcomings of existing evaluation methods [17][19] - StructBench includes 1,714 stratified samples with fine-grained Q&A pairs to validate factual accuracy, while StructScore evaluates model responses against standard answers [19] Performance Comparison - The proposed solution demonstrated significant advantages over existing models, with the best-performing models achieving factual accuracy around 50%, indicating substantial room for improvement in structured visual generation [21][22] - The research emphasizes that high-quality, strictly aligned data is crucial for enhancing model performance, more so than the model architecture itself [22] Broader Implications - This research aims to lay a systematic foundation for structured visual generation, encouraging further exploration in this overlooked area [23][25] - The ultimate goal is to transition AI from being merely a beautification tool to a productivity tool capable of generating accurate mathematical images and experimental charts for various fields [24][25]
谢赛宁等推出统一多模态模型!替代VAE实现图像理解/生成双SOTA,代码权重数据集全开源
量子位· 2025-05-16 03:39
谢赛宁等团队推出了统一多模态模型 Blip3-o 。 与传统的基于VAE的表征不同,他们提出了一种新方法,使用扩散Transformer来生成语义丰富的CLIP图像特征。这种设计提高了训练效率, 又提升了生成质量。 白交 发自 凹非寺 量子位 | 公众号 QbitAI 统一图像理解和生成,还实现了新SOTA。 在这一背景下,团队又看到了自回归和扩散模型在高质量生成和可扩展性方面有强大的潜力。于是乎,他们开始对统一多模态模型进行了全面 研究,重点关注图像表示、建模目标和训练策略。 统一架构 这些基础上,他们提出了一种新的统一架构。同样包括两部分。 此外,他们还证明, 先进行图像理解训练,再进行图像生成训练 的统一模型顺序预训练策略,具有实用优势,既能保持图像理解能力,又能 培养强大的图像生成能力。 网页端可以免费体验Demo~ 统一多模态模型Blip3-o 在最近的多模态模型研究中,图像理解与生成的统一受到越来越多的关注。尽管研究人员们对图像理解的设计选择进行了广泛的研究,但对图 像生成统一框架的最佳模型架构和训练方法的研究仍然不足。 结果显示,CLIP+Flow Matching在GenEval和DPG-Be ...