Workflow
多模态基础模型
icon
Search documents
千问团队开源图像基础模型 Qwen-Image
AI前线· 2025-09-02 06:52
作者 | Anthony Alford 译者 | 明知山 千问大模型团队 最近开源了 Qwen-Image,一个图像基础模型。Qwen-Image 支持从文本到图像 (T2I)的生成任务以及从文本图像到图像(TI2I)的编辑任务,并且在多项基准测试中均取得了超 越其他模型的卓越表现。 Qwen-Image 使用 Qwen2.5-VL 处理文本输入,使用变分自编码器(VAE)处理图像输入,并通过 多模态扩散变换器(MMDiT)进行图像生成。这一组模型在文本渲染方面表现出色,支持英语和中 文文本。千问团队在包括 DPG、GenEval、GEdit 和 ImgEdit 在内的 T2I 和 TI2I 基准测试中对模型 进行了评估,Qwen-Image 总体得分最高。在图像理解任务中,尽管不如专门训练的模型表现好, 但 Qwen-Image 的性能与它们"非常接近"。此外,千问团队还创建了 AI Arena,一个比较网站,人 类评估者可以在上面对生成的图像对进行评分。Qwen-Image 目前排名第三,与包括 GPT Image 1 在内的五个高质量闭源模型竞争。根据千问团队的说法: Qwen-Image 不仅仅是一个 ...
苹果最新模型,5年前的iPhone能跑
3 6 Ke· 2025-09-01 11:37
Core Insights - Apple has made significant advancements in large model development with the introduction of the new multimodal foundation model MobileCLIP2, which features a multimodal reinforcement training mechanism [1][12] - The model is designed for zero-shot classification and retrieval tasks, with inference latency ranging from 3 to 15 milliseconds and parameter sizes between 50 million to 1.5 billion [1][3] Model Performance - MobileCLIP2-B has achieved a 2.2% improvement in zero-shot accuracy on the ImageNet-1k dataset compared to its predecessor [1][11] - The MobileCLIP2-S4 variant matches the zero-shot accuracy of the larger SigLIP-SO400M/14 model while having only half the parameter count [4][6] Training Mechanism - The improved training mechanism integrates enhanced teacher supervision and caption data to boost zero-shot performance [2][9] - This mechanism allows for direct deployment of multimodal models on mobile and edge devices, ensuring low latency and memory usage [2][8] Open Source and Developer Support - All model variants' pre-trained weights and data generation code have been made publicly available, facilitating direct deployment and benchmarking for developers [2][12] - The data generation code supports distributed scalable processing, enabling developers to create customized datasets for further research and rapid prototyping [8][12] Technical Details - The training mechanism effectively distills knowledge from multiple sources into a smaller model, enhancing semantic coverage and reducing computational overhead during training and inference [9][10] - The integration of teacher models and caption generation has been optimized through a two-phase protocol, significantly improving the model's ability to express image content [11][12]