美团视频生成模型来了！一出手就是开源SOTA

Core Viewpoint - Meituan has launched an open-source video model named LongCat-Video, which supports text-to-video and image-to-video generation, showcasing significant advancements in video generation technology [1][39]. Group 1: Model Features - LongCat-Video has 13.6 billion parameters and can generate videos lasting up to five minutes, demonstrating a strong understanding of real-world physics and semantics [1][12][39]. - The model excels in generating 720p, 30fps videos with high semantic understanding and visual presentation capabilities, ranking among the best in open-source models [18][62]. - It can maintain consistency in generated videos, addressing challenges such as detail capture and complex lighting effects [19][24]. Group 2: Technical Innovations - LongCat-Video integrates three main tasks: text-to-video, image-to-video, and video continuation, using a Diffusion Transformer framework [41]. - The model employs a unique training approach that directly pre-trains on video continuation tasks, mitigating cumulative errors in long video generation [46][48]. - It utilizes advanced techniques like block sparse attention and a from-coarse-to-fine generation paradigm to enhance video generation efficiency [52][53]. Group 3: Performance Evaluation - In internal benchmarks, LongCat-Video outperformed models like PixVerse-V5 and Wan2.2-T2V-A14B in overall quality, with strong performance in visual quality and motion quality [62][63]. - The model achieved a top score in common-sense dimensions, indicating its superior ability to model the physical world [64]. Group 4: Broader Context - This is not the first instance of Meituan venturing into AI; the company has previously released various models, including LongCat-Flash-Chat and LongCat-Flash-Thinking, showcasing its commitment to AI innovation [65][68].