Core Insights - The LongCat team of Meituan has released and open-sourced the LongCat-Video video generation model, achieving state-of-the-art (SOTA) performance in foundational tasks of text-to-video and image-to-video generation, with significant advantages in long video generation [1][2] - The model is seen as a crucial step towards building "world models," which are essential for the next generation of artificial intelligence, allowing AI to understand and simulate the real world [1] Technical Features - LongCat-Video is based on a Diffusion Transformer architecture and supports three core tasks: text-to-video without conditional frames, image-to-video with one reference frame, and video continuation using multiple preceding frames, creating a complete task loop [2] - The model can generate stable 5-minute long videos without quality loss, addressing industry pain points such as color drift and motion discontinuity, ensuring temporal consistency and physical motion realism [2] - LongCat-Video employs a three-tier optimization strategy (C2F, BSA, and model distillation) to enhance video inference speed by 10.1 times, achieving an optimal balance between efficiency and quality [2] Performance Evaluation - The model evaluation includes both internal and public benchmark tests, covering text-to-video and image-to-video tasks, with a focus on multiple dimensions such as text alignment, image alignment, visual quality, motion quality, and overall quality [3] - LongCat-Video, with 13.6 billion parameters, has achieved SOTA performance in the open-source domain for both text-to-video and image-to-video tasks, demonstrating significant advantages in key metrics like text alignment and motion coherence [3]
美团LongCat-Video正式发布并开源 视频推理速度提升至10.1倍