Core Insights - Meituan has taken a significant step towards developing a "World Model" by launching the LongCat-Video video generation model, aiming to better connect the "atomic world" and the "bit world" [1][2] Group 1: LongCat-Video Model Features - The LongCat-Video model is based on the Diffusion Transformer (DiT) architecture and supports three core tasks: text-to-video, image-to-video, and video continuation, forming a complete task loop without the need for additional model adaptation [5] - The model can generate coherent long videos of up to 5 minutes without quality loss, addressing industry pain points such as color drift and motion discontinuity, ensuring temporal consistency and physical motion rationality [5][6] - LongCat-Video has achieved a video inference speed improvement of 10.1 times through a three-tier optimization approach, balancing efficiency and quality [6] Group 2: Performance and Evaluation - LongCat-Video has reached state-of-the-art (SOTA) performance in open-source video generation tasks, with a comprehensive evaluation covering text alignment, image alignment, visual quality, motion quality, and overall quality [5][9] - The model has 13.6 billion parameters and demonstrates significant advantages in key metrics such as text-video alignment and motion continuity, performing exceptionally well in public benchmark tests like VBench [9]
美团开源LongCat-Video支持高效长视频生成,迈出“世界模型”探索第一步