Workflow
稀疏注意力优化
icon
Search documents
全新稀疏注意力优化!腾讯最新超轻量视频生成模型HunyuanVideo 1.5核心技术解密
量子位· 2025-11-26 09:33
Core Insights - Tencent's HunyuanVideo 1.5 has been officially released and open-sourced, featuring a lightweight video generation model based on the Diffusion Transformer (DiT) architecture with 8.3 billion parameters, capable of generating 5-10 seconds of high-definition video [1][2]. Model Capabilities - The model supports video generation from text and images, showcasing high consistency between images and videos, and can accurately follow diverse instructions for various scenes, including camera movements and character emotions [5][7]. - It can natively generate 480p and 720p HD videos, with the option to upscale to 1080p cinematic quality using a super-resolution model, making it accessible for developers and creators to use on consumer-grade graphics cards with 14GB of memory [6]. Technical Innovations - HunyuanVideo 1.5 achieves a balance between generation quality, performance, and model size through multi-layered technical innovations, utilizing a two-stage framework [11]. - The first stage employs an 8.3B parameter DiT model for multi-task learning, while the second stage enhances visual quality through a video super-resolution model [12]. - The model features a lightweight high-performance architecture that achieves significant compression and efficiency, allowing for leading generation results with minimal parameters [12]. - An innovative sparse attention mechanism, SSTA (Selective and Sliding Tile Attention), reduces computational costs for long video sequences, improving generation efficiency by 1.87 times compared to FlashAttention3 [15][16]. Training and Optimization - The model incorporates enhanced multi-modal understanding with a large model as a text encoder, improving the accuracy of video text elements [20]. - A full-link training optimization strategy is employed, covering the entire process from pre-training to post-training, which enhances motion coherence and aesthetic quality [20]. - Reinforcement learning strategies are tailored for both image-to-video (I2V) and text-to-video (T2V) tasks to correct artifacts and improve motion quality [23][24]. Use Cases - Examples of generated videos include cinematic scenes such as a bustling Tokyo intersection and a cyberpunk-themed street corner, showcasing the model's ability to create visually appealing and contextually rich content [29][30].