Core Insights - Meituan officially announced the release of the LongCat-Video video generation model, which is based on the Diffusion Transformer architecture and supports three core tasks: text-to-video, image-to-video, and video continuation [1] Model Features - LongCat-Video can generate high-definition videos at 720p resolution and 30 frames per second, with the ability to create coherent video content lasting up to 5 minutes [1] - The model addresses common issues in long video generation, such as frame breaks and quality degradation, by maintaining temporal consistency and motion rationality through video continuation pre-training and block sparse attention mechanisms [1] Efficiency and Performance - The model employs two-stage generation, block sparse attention, and model distillation techniques, reportedly achieving over a 10x improvement in inference speed [1] - With a parameter count of 13.6 billion, LongCat-Video has demonstrated strong performance in text alignment and motion continuity in public tests like VBench [1] Future Applications - As part of the effort to build a "world model," LongCat-Video may find applications in scenarios requiring long-term sequence modeling, such as autonomous driving simulations and embodied intelligence [1] - The release of this model signifies a significant advancement for Meituan in the fields of video generation and physical world simulation [1]
美团LongCat-Video视频生成模型发布:可输出5分钟长视频