Core Insights - Meituan's LongCat team has released and open-sourced the video generation model LongCat-Video, which supports text-to-video, image-to-video, and video continuation tasks under a unified architecture, achieving leading results in both internal and public benchmarks, including VBench [2][8] Group 1: Model Performance - LongCat-Video achieved a total score of 62.11% in the VBench 2.0 benchmark, with notable scores in creativity (54.73%), commonsense (70.94%), controllability (44.79%), and human fidelity (80.20%) [5][6] - The model is based on the Diffusion Transformer (DiT) architecture and can generate long videos of several minutes while maintaining cross-frame temporal consistency and physical motion realism [6][8] Group 2: Technical Features - LongCat-Video employs a task differentiation method based on "conditional frame count," allowing it to handle text generation without input frames, image generation with one reference frame, and video continuation using multiple preceding frames [6] - The model incorporates block sparse attention (BSA) and a conditional token caching mechanism to reduce inference redundancy, achieving a speed improvement of approximately 10.1 times over the baseline in high-resolution and high-frame-rate scenarios [6] Group 3: Model Specifications - The base model of LongCat-Video consists of approximately 13.6 billion parameters, with evaluations covering text alignment, image alignment, visual quality, motion quality, and overall quality [6] - The release is positioned as a step in the exploration of the "World Model" direction, with all related code and models made publicly available [8]
美团LongCat-Video正式发布并开源,支持高效长视频生成
