MEITUAN-美团发布并开源视频生成模型：部分参数比肩谷歌最先进模型Veo3

Core Insights - Meituan's LongCat team has released and open-sourced the LongCat-Video model, achieving state-of-the-art (SOTA) performance in video generation tasks based on text and images [1][3]. Group 1: Model Features - LongCat-Video can generate coherent videos up to 5 minutes long, addressing common issues like frame drift and color inconsistency found in other models [3][6]. - The model supports 720p resolution and 30 frames per second, utilizing mechanisms like video continuation pre-training and block sparse attention to maintain temporal consistency and visual stability [6][9]. - LongCat-Video's inference speed has been enhanced by 10.1 times through a combination of two-stage coarse-to-fine generation, block sparse attention, and model distillation [6][8]. Group 2: Evaluation and Performance - In internal evaluations, LongCat-Video was assessed on text alignment, visual quality, motion quality, and overall performance, with a high correlation of 0.92 between human and automated evaluations [8][12]. - The model's visual quality score is nearly on par with Google's Veo3, surpassing other models like PixVerse-V5 and Wan2.2 in overall quality [8][12]. - LongCat-Video scored 70.94% in commonsense understanding, ranking first among open-source models, with an overall score of 62.11%, trailing only behind proprietary models like Veo3 and Vidu Q1 [12]. Group 3: Future Applications - The release of LongCat-Video is a significant step for Meituan towards building "world models," which are essential for simulating physical laws and scene logic in AI [3][13]. - Future applications may include autonomous driving simulations and embodied intelligence, where long-sequence modeling is crucial [13].