MEITUAN-美团LongCat-Flash-Omni正式发布并开源开启全模态实时交互时代

Core Viewpoint - Meituan has upgraded its self-developed LongCat-Flash series with the official release and open-sourcing of LongCat-Flash-Omni, which supports various functionalities including online search and voice calls, with video calls to be added later [1][2]. Group 1 - LongCat-Flash-Omni is built on the efficient architecture of the LongCat-Flash series, integrating high-performance multimodal perception and voice reconstruction modules, achieving real-time audio and video interaction capabilities with a total parameter count of 560 billion [1]. - The new model is the first in the industry to combine "full-modal coverage, end-to-end architecture, and efficient inference with a large parameter count" in an open-source large language model, matching the capabilities of closed-source models [1][2]. - LongCat-Flash-Omni addresses the industry's pain point of inference latency by achieving millisecond-level response times in multimodal tasks through innovative architecture design and engineering optimization [1]. Group 2 - One of the core challenges in training multimodal models is the significant heterogeneity in data distribution across different modalities. LongCat-Flash-Omni employs a progressive early multimodal fusion training strategy to ensure strong performance across all modalities without degrading single-modal performance [2]. - Comprehensive evaluation results indicate that LongCat-Flash-Omni has reached state-of-the-art (SOTA) levels in open-source multimodal benchmark tests (such as Omni-Bench, WorldSense) and ranks among the top open-source models in capabilities across text, image, audio, and video modalities [2]. - On September 1, Meituan officially released and open-sourced its self-developed large model LongCat-Flash-Chat, marking the company's first complete product offering of a large model to the industry and developers [2].