MEITUAN-美团开源LongCat-Image图像生成模型专注中文场景与编辑功能

Core Insights - Meituan's LongCat team has officially released and open-sourced its image generation and editing model, LongCat-Image, which has a parameter scale of 6 billion (6B) and aims to handle both text-to-image generation and natural language instruction editing tasks through a unified architecture [1] Model Architecture - LongCat-Image employs a hybrid backbone architecture (MM-DiT + Single-DiT) that integrates a visual language model (VLM) conditional encoder, supporting image generation via text prompts and multi-round editing through natural language instructions [2] - The model can perform 15 types of editing tasks, including object addition/removal, style transfer, background replacement, and text modification, while maintaining consistency in image style and lighting during multi-round edits [2] Chinese Text Rendering Capability - The model emphasizes support for Chinese text generation, capable of handling standard characters, rare characters, and certain calligraphic fonts, with automatic adjustments for font, size, and layout based on context [3] Output Efficiency and Quality - LongCat-Image claims to achieve efficient inference on consumer-grade GPUs and generate images with "studio-level" detail through model structure lightweighting and training strategy optimization [4] Performance Evaluation - In benchmark tests, LongCat-Image scored 7.60/7.64 (Chinese and English) and 4.50 in image editing benchmarks GEdit-Bench and ImgEdit-Bench, achieving state-of-the-art (SOTA) levels among open-source models [5] - The model scored 90.7 in the Chinese text rendering evaluation, and 0.87 and 86.8 in foundational tests GenEval and DPG-Bench, respectively [5] - The model is available on GitHub, and its functionalities can be experienced through the LongCat APP or web platform (longcat.ai), with the open-source initiative aimed at supporting the entire process from research to commercial applications and inviting developers to participate in co-construction [5]