多模态理解与生成 - filings, earnings calls, financial reports, news

多模态理解与生成

Search documents

机器之心· 2025-07-16 04:21

Core Viewpoint - The article discusses the introduction of ThinkDiff, a new method for multimodal understanding and generation that enables diffusion models to perform reasoning and creative tasks with minimal training data and computational resources [3][36]. Group 1: Introduction to ThinkDiff - ThinkDiff is a collaborative effort between Hong Kong University of Science and Technology and Snap Research, aimed at enhancing diffusion models' reasoning capabilities with limited data [3]. - The method allows diffusion models to understand the logical relationships between images and text prompts, leading to high-quality image generation [7]. Group 2: Algorithm Design - ThinkDiff transfers the reasoning capabilities of large visual language models (VLM) to diffusion models, combining the strengths of both for improved multimodal understanding [7]. - The architecture involves aligning VLM-generated tokens with the diffusion model's decoder, enabling the diffusion model to inherit the VLM's reasoning abilities [15]. Group 3: Training Process - The training process includes a vision-language pretraining task that aligns VLM with the LLM decoder, facilitating the transfer of multimodal reasoning capabilities [11][12]. - A masking strategy is employed during training to ensure the alignment network learns to recover semantics from incomplete multimodal information [15]. Group 4: Variants of ThinkDiff - ThinkDiff has two variants: ThinkDiff-LVLM, which aligns large-scale VLMs with diffusion models, and ThinkDiff-CLIP, which aligns CLIP with diffusion models for enhanced text-image combination capabilities [16]. Group 5: Experimental Results - ThinkDiff-LVLM significantly outperforms existing methods on the CoBSAT benchmark, demonstrating high accuracy and quality in multimodal understanding and generation [18]. - The training efficiency of ThinkDiff-LVLM is notable, achieving optimal results with only 5 hours of training on 4 A100 GPUs, compared to other methods that require significantly more resources [20][21]. Group 6: Comparison with Other Models - ThinkDiff-LVLM exhibits capabilities comparable to commercial models like Gemini in everyday image reasoning and generation tasks [25]. - The method also shows potential in multimodal video generation by adapting the diffusion decoder to generate high-quality videos based on input images and text [34]. Group 7: Conclusion - ThinkDiff represents a significant advancement in multimodal understanding and generation, providing a unified model that excels in both quantitative and qualitative assessments, contributing to the fields of research and industrial applications [36].

多模态理解与生成

扩散模型

Artificial Intelligence

ThinkDiff

GPT - 4o image generation

Gemini Pro

多模态理解与生成

扩散模型

Artificial Intelligence

ThinkDiff

GPT - 4o image generation

Gemini Pro

开源版MetaQuery来了！OpenUni用1.1B参数媲美BLIP3-o-8B，数据代码完全开源

机器之心· 2025-06-22 04:26

Core Viewpoint - OpenUni, developed by Nanyang Technological University S-Lab and SenseTime, is an open-source version of MetaQuery that achieves the performance of an 8B model with only 1.1B parameters, providing all code, weights, and data as open-source resources [1][18]. Architecture and Design - The architecture of OpenUni is simplified, featuring only 6 layers of connectors compared to 24 layers in MetaQuery, significantly reducing complexity [5]. - OpenUni utilizes 256 learnable queries to extract condition information from user instructions, a frozen InternVL for maintaining understanding capabilities, 6 transformer connectors based on ViT architecture, and a SANA diffusion model for efficient image generation [5][6]. Performance Metrics - OpenUni-B achieves a GenEval score of 0.84, comparable to the BLIP3-o-8B model, while OpenUni-L reaches a score of 0.86, marking it as the best-performing open-source unified model [15][18]. - In DPG-Bench, OpenUni-L-1024 scores 83.08, surpassing all MetaQuery and BLIP3-o variants [15]. Training Strategy - The training process consists of two phases: pre-training with 23 million image-text pairs and fine-tuning with 60,000 image-text pairs [7][9]. - During pre-training, the diffusion model is frozen, while in the fine-tuning phase, it becomes trainable to enhance generation quality [8][9]. Open Source Contribution - OpenUni provides a complete open-source resource, including model weights, training code, and a dataset of 23 million entries, facilitating community research and innovation [19][20]. - The project aims to offer a clear, reproducible, and extensible baseline implementation for the research community [18].