Core Viewpoint - Meituan's LongCat team has officially released and open-sourced the LongCat-Video-Avatar model, which enhances virtual human video generation capabilities [1] Group 1: Model Features - The LongCat-Video-Avatar model is built on the previously open-sourced LongCat-Video base and supports video generation from audio, text, or images, along with video continuation functionality [1] - The model significantly improves action realism, long video generation stability, and identity consistency [1] Group 2: Technical Innovations - The model employs "decoupled unconditional guidance" technology, allowing virtual humans to exhibit natural states such as blinking and posture adjustments during speech pauses [1] - To address common quality degradation in long video generation, the team introduced a "cross-segment latent space stitching" strategy, which aims to prevent cumulative errors from repeated encoding and decoding, claiming it can generate videos up to 5 minutes long while maintaining stable visuals [1] Group 3: Performance Metrics - In terms of identity consistency, the model utilizes reference frame injection with positional encoding and a "reference jump attention" mechanism to maintain character traits while reducing motion stiffness [1] - The model has achieved advanced levels of lip-sync accuracy and consistency metrics in evaluations on public datasets like HDTF and CelebV-HQ, and has shown leading performance in comprehensive tests covering commercial promotion and educational scenarios [1]
美团LongCat-Video-Avatar发布并开源,重点提升动作拟真度