Workflow
LongCat APP
icon
Search documents
美团开源LongCat-Image图像生成模型 专注中文场景与编辑功能
Feng Huang Wang· 2025-12-08 07:04
Core Insights - Meituan's LongCat team has officially released and open-sourced its image generation and editing model, LongCat-Image, which has a parameter scale of 6 billion (6B) and aims to handle both text-to-image generation and natural language instruction editing tasks through a unified architecture [1] Model Architecture - LongCat-Image employs a hybrid backbone architecture (MM-DiT + Single-DiT) that integrates a visual language model (VLM) conditional encoder, supporting image generation via text prompts and multi-round editing through natural language instructions [2] - The model can perform 15 types of editing tasks, including object addition/removal, style transfer, background replacement, and text modification, while maintaining consistency in image style and lighting during multi-round edits [2] Chinese Text Rendering Capability - The model emphasizes support for Chinese text generation, capable of handling standard characters, rare characters, and certain calligraphic fonts, with automatic adjustments for font, size, and layout based on context [3] Output Efficiency and Quality - LongCat-Image claims to achieve efficient inference on consumer-grade GPUs and generate images with "studio-level" detail through model structure lightweighting and training strategy optimization [4] Performance Evaluation - In benchmark tests, LongCat-Image scored 7.60/7.64 (Chinese and English) and 4.50 in image editing benchmarks GEdit-Bench and ImgEdit-Bench, achieving state-of-the-art (SOTA) levels among open-source models [5] - The model scored 90.7 in the Chinese text rendering evaluation, and 0.87 and 86.8 in foundational tests GenEval and DPG-Bench, respectively [5] - The model is available on GitHub, and its functionalities can be experienced through the LongCat APP or web platform (longcat.ai), with the open-source initiative aimed at supporting the entire process from research to commercial applications and inviting developers to participate in co-construction [5]
美团宣布:图像生成模型LongCat-Image开源发布
Xin Lang Cai Jing· 2025-12-08 05:49
Core Insights - Meituan's LongCat team has announced the open-source release of the LongCat-Image model, which approaches the capabilities of larger models with a compact parameter size of 6 billion [1][2] - The model offers a "high-performance, low-threshold, fully open" option for developers and the industry, focusing on text-to-image generation and image editing [1][2] Model Advantages - LongCat-Image's core strengths lie in its architectural design and training strategies, utilizing a unified architecture for text-to-image and image editing, combined with a progressive learning strategy [1][2] - The model achieves efficient collaboration in instruction adherence accuracy, image quality, and text rendering capabilities within its 6 billion parameters [1][2] - Notably, LongCat-Image excels in controllability for image editing, with performance breakthroughs attributed to a tightly integrated training paradigm and data strategy [1][2] User-Facing Features - The LongCat APP has received significant upgrades, introducing a new image-to-image feature and 24 zero-threshold gameplay templates [1][2] - These enhancements enable ordinary users to easily generate posters and refine portraits, achieving "professional AI creation with zero threshold" [1][2]
美团新独立APP,点不了菜只能点AI
猿大侠· 2025-11-03 04:11
Core Viewpoint - Meituan has launched the LongCat-Flash-Omni model, which supports multi-modal capabilities and has achieved state-of-the-art (SOTA) performance in open-source benchmarks, surpassing competitors like Qwen3-Omni and Gemini-2.5-Flash [2][4][8]. Group 1: Model Performance - LongCat-Flash-Omni is capable of handling text, images, audio, and video inputs effectively, maintaining high performance across all modalities [3][27]. - The model features a total of 560 billion parameters, with only 27 billion activated, allowing for high inference efficiency while retaining a large knowledge base [4][40]. - It is the first open-source model to achieve real-time interaction across all modalities under current flagship model performance standards [8][42]. Group 2: User Experience - Users can experience the LongCat model through the LongCat APP and Web, which support various input methods including text, voice, and image uploads [9][10]. - The model demonstrates quick response times and smooth interactions, even in complex scenarios, enhancing user experience [27][28][30]. Group 3: Development Strategy - Meituan's iterative model development strategy focuses on speed, specialization, and comprehensive capabilities, aiming to create a robust "world model" that integrates digital and physical worlds [31][45]. - The company has invested in both software and hardware to achieve deep connections between the digital and physical realms, emphasizing the importance of hardware in extending software's impact [46][47]. Group 4: Future Outlook - Meituan's long-term vision includes advancing embodied intelligence and creating a comprehensive robotics framework that connects various service scenarios [57][62]. - The company aims to leverage AI and robotics to transform the retail industry, enhancing efficiency and user experience across its services [60][63].
美团新独立APP,点不了菜只能点AI
量子位· 2025-11-03 03:12
Core Viewpoint - Meituan is leveraging its expertise in delivery services to develop advanced AI models, with the latest being LongCat-Flash-Omni, which supports multimodal capabilities and achieves state-of-the-art performance in open-source benchmarks [2][8]. Group 1: Model Performance and Features - LongCat-Flash-Omni has surpassed other models like Qwen3-Omni and Gemini-2.5-Flash in comprehensive multimodal benchmarks, achieving open-source state-of-the-art status [2]. - The model maintains high performance across individual modalities such as text, image, audio, and video, demonstrating robust capabilities without sacrificing intelligence [3]. - With a total of 560 billion parameters and only 27 billion active parameters, the model utilizes a "large total parameters, small active" MoE architecture, ensuring high inference efficiency while retaining extensive knowledge [4]. Group 2: User Experience and Accessibility - LongCat-Flash-Omni is the first open-source model capable of real-time multimodal interaction, enhancing user experience significantly [8]. - The model is available for free on Meituan's LongCat APP and web platform, supporting various input methods including text, voice, and image uploads [9][10]. - Users have reported a smooth interaction experience, with quick response times and effective handling of complex multimodal tasks [25][26]. Group 3: Development Strategy - Meituan's iterative model development strategy focuses on speed, specialization, and comprehensive capabilities, aiming to create an AI that can understand and interact with complex real-world scenarios [29][31]. - The company has a clear path for expanding its AI capabilities, moving from basic chatbots to advanced multimodal models, thereby laying the groundwork for a "world model" that deeply understands reality [47][62]. - Meituan's investments in embodied intelligence and robotics are part of a broader strategy to connect the digital and physical worlds, enhancing service efficiency and user experience [42][56]. Group 4: Challenges and Innovations - The development of multimodal models presents challenges such as high integration difficulty, real-time interaction performance, and training efficiency [33][36]. - LongCat-Flash-Omni addresses these challenges through innovative architectural designs, including a unified end-to-end architecture and progressive training methods that enhance multimodal capabilities [38][39]. - The model's design allows for low-latency real-time interactions, setting it apart from existing models that struggle with responsiveness [36][39].