HunyuanVideo

Search documents
腾讯混元开源游戏AI生成新工具!RTX 4090就能制作3A级动态内容
量子位· 2025-08-14 07:34
鹭羽 发自 凹非寺 量子位 | 公众号 QbitAI 随手拍的一张图,就能秒变3A级游戏大作?! 刚刚,腾讯全新开源游戏视频生成框架 Hunyuan-GameCraft ,专为游戏环境设计,让任何人都能轻松搞定游戏制作。 无论是水墨风: 抑或是古希腊: 只要你想,统统都能满足。 基于腾讯混元视频生成 HunyuanVideo 搭建,可以实时生成流畅画面。 操作也很so easy,只需要: 单张场景图+文字描述+动作指令=高清动态游戏视频 。 所以下面让我们一起Game start! 实机演示 先来康康几个生成案例尝尝鲜: 首先做一个中世纪的乡村风格场景,画面流畅自然,镜头随第一视角动态移动。 Prompt:A picturesque village scene featuring quaint houses, a windmill, lush greenery, and a serene mountain backdrop under a bright blue sky. Prompt:A sunlit courtyard features white adobe buildings with arched ...
EasyCache:无需训练的视频扩散模型推理加速——极简高效的视频生成提速方案
机器之心· 2025-07-12 04:50
Core Viewpoint - The article discusses the development of EasyCache, a new framework for accelerating video diffusion models without requiring training or structural changes to the model, significantly improving inference efficiency while maintaining video quality [7][27]. Group 1: Research Background and Motivation - The application of diffusion models and diffusion Transformers in video generation has led to significant improvements in the quality and coherence of AI-generated videos, transforming digital content creation and multimedia entertainment [3]. - However, issues such as slow inference and high computational costs have emerged, with examples like HunyuanVideo taking 2 hours to generate a 5-second video at 720P resolution, limiting the technology's application in real-time and large-scale scenarios [4][5]. Group 2: Methodology and Innovations - EasyCache operates by dynamically detecting the "stable period" of model outputs during inference, allowing for the reuse of historical computation results to reduce redundant inference steps [7][16]. - The framework measures the "transformation rate" during the diffusion process, which indicates the sensitivity of current outputs to inputs, revealing that outputs can be approximated using previous results in later stages of the process [8][12][15]. - EasyCache is designed to be plug-and-play, functioning entirely during the inference phase without the need for model retraining or structural modifications [16]. Group 3: Experimental Results and Visual Analysis - Systematic experiments on mainstream video generation models like OpenSora, Wan2.1, and HunyuanVideo demonstrated that EasyCache achieves a speedup of 2.2 times on HunyuanVideo, with a 36% increase in PSNR and a 14% increase in SSIM, while maintaining video quality [20][26]. - In image generation tasks, EasyCache also provided a 4.6 times speedup, improving FID scores, indicating its effectiveness across different applications [21][22]. - Visual comparisons showed that EasyCache retains high visual fidelity, with generated videos closely matching the original model outputs, unlike other methods that exhibited varying degrees of quality loss [24][25]. Group 4: Conclusion and Future Outlook - EasyCache presents a minimalistic and efficient paradigm for accelerating inference in video diffusion models, laying a solid foundation for practical applications of diffusion models [27]. - The expectation is to further approach the goal of "real-time video generation" as models and acceleration technologies continue to evolve [27].
无需训练,即插即用,2倍GPU端到端推理加速——视频扩散模型加速方法DraftAttention
机器之心· 2025-06-28 04:35
Core Insights - The article discusses the challenges and advancements in video generation using diffusion models, particularly focusing on the computational bottlenecks associated with attention mechanisms in the Diffusion Transformer (DiT) model [1][6][14] - A new method called DraftAttention is introduced, which significantly reduces the computational overhead of attention mechanisms while maintaining high generation quality, achieving up to 2x end-to-end inference acceleration on GPUs [3][22][46] Group 1: Background and Challenges - Diffusion models have become mainstream for high-quality video generation, but the computational load of attention mechanisms increases dramatically with video length and resolution, leading to inefficiencies [1][6] - In models like HunyuanVideo, attention computation can account for over 80% of the total processing time, with generating an 8-second 720p video taking nearly an hour [1][5] - The complexity of attention mechanisms grows quadratically with the number of tokens, which is directly proportional to video frame count and resolution, causing significant slowdowns in inference speed [6][7] Group 2: Existing Solutions and Limitations - Current acceleration methods, such as Sparse VideoGen and AdaSpa, utilize sparse attention mechanisms for some level of end-to-end acceleration on GPUs, but their effectiveness is limited due to insufficient sparsity and rigid design [2][3] - These methods often rely on fixed sparse operators and lack dynamic adaptability to input content, making it difficult to achieve fine-grained, content-aware sparse pattern control [2][7] Group 3: DraftAttention Methodology - DraftAttention is a plug-and-play, dynamic sparse attention mechanism that does not require training, designed to reduce the computational burden of attention mechanisms while preserving generation quality [3][11][46] - The method involves creating a low-resolution attention map to estimate token importance, guiding the selection of sparse patterns for high-resolution attention calculations [11][12] - A token rearrangement strategy is introduced to enhance the execution efficiency of sparse computations on GPUs, making the approach hardware-friendly [13][22] Group 4: Theoretical Foundations and Experimental Results - The effectiveness of DraftAttention is supported by theoretical analyses demonstrating that the approximation error between the low-resolution and high-resolution attention maps is bounded [15][17] - Experimental evaluations show that DraftAttention outperforms existing sparse attention methods like Sparse VideoGen across multiple metrics, including PSNR and SSIM, particularly at high sparsity rates [20][21] - On NVIDIA H100 and A100 GPUs, DraftAttention achieves up to 1.75x end-to-end inference acceleration, with performance improvements scaling with video length, resolution, and sparsity [22][46] Group 5: Future Directions - The authors plan to further optimize efficiency bottlenecks in long video generation by integrating techniques such as quantization and distillation, aiming to extend high-quality video generation capabilities to resource-constrained environments like mobile and edge devices [46]
清华SageAttention3,FP4量化5倍加速!且首次支持8比特训练
机器之心· 2025-06-18 09:34
Core Insights - The article discusses the advancements in attention mechanisms for large models, particularly focusing on the introduction of SageAttention3, which offers significant performance improvements over previous versions and competitors [1][2]. Group 1: Introduction and Background - The need for optimizing attention speed has become crucial as the sequence length in large models increases [7]. - Previous versions of SageAttention (V1, V2, V2++) achieved acceleration factors of 2.1, 3, and 3.9 times respectively compared to FlashAttention [2][5]. Group 2: Technical Innovations - SageAttention3 provides a 5x inference acceleration compared to FlashAttention, achieving 1040 TOPS on RTX 5090, outperforming even the more expensive H100 with FlashAttention3 by 1.65 times [2][5]. - The introduction of trainable 8-bit attention (SageBwd) allows for training acceleration while maintaining the same results as full precision attention in various fine-tuning tasks [2][5]. Group 3: Methodology - The research team employed Microscaling FP4 quantization to enhance the precision of FP4 quantization, utilizing NVFP4 format for better accuracy [15][16]. - A two-level quantization approach was proposed to address the narrow range of scaling factors for the P matrix, improving overall precision [15][16]. Group 4: Experimental Results - SageAttention3 demonstrated impressive performance in various models, maintaining end-to-end accuracy in video and image generation tasks [21][22]. - In specific tests, SageAttention3 achieved a 3x acceleration in HunyuanVideo, with significant reductions in processing time across multiple models [33][34].
AI周报 | xAI新一轮融资后估值有望超1200亿美元;OpenAI重组计划生变
Di Yi Cai Jing Zi Xun· 2025-05-11 01:39
Group 1: xAI Financing - xAI, an AI startup founded by Elon Musk, is negotiating a new round of financing with a potential valuation exceeding $120 billion (approximately 86.88 billion RMB) [1] - Investors are considering injecting $20 billion into xAI, although the specific amount may fluctuate as negotiations progress [1] - If successful, this financing would become the second-largest startup funding round in history, following OpenAI's $40 billion funding earlier this year, which valued OpenAI at $300 billion (approximately 217,000 million RMB) [1] Group 2: OpenAI Restructuring - OpenAI announced it will remain under the control of a non-profit organization, retracting a previous restructuring plan that aimed to shift control to a for-profit entity [2] - The for-profit LLC will transition to a Public Benefit Corporation (PBC), allowing it to pursue profit while also focusing on social missions [2] - The new structure will enable investors and employees to hold common stock without limits on appreciation, facilitating future fundraising efforts [2] Group 3: AI Programming Unicorn - Anysphere, the developer of the AI programming tool Cursor, completed a $900 million funding round, bringing its valuation to approximately $9 billion [5][6] - The funding round was led by Thrive Capital, with participation from notable investors such as a16z and Accel [5] - Cursor is recognized as one of the most popular AI tools in the programming sector, reflecting the growing interest in AI programming applications [6] Group 4: Google Market Value Drop - Google's parent company Alphabet experienced a market value loss of nearly $150 billion after Apple announced plans to introduce AI features in its Safari browser [4] - The stock price of Alphabet fell over 7% following the announcement, highlighting the competitive threat posed by AI technologies to traditional search engines [4] - The integration of AI into search functionalities is becoming a significant trend, with major players like Apple and OpenAI actively pursuing this direction [4] Group 5: Tencent's Video Generation Tool - Tencent's Hunyuan team released and open-sourced a new multimodal video generation tool called HunyuanCustom, which significantly improves performance over existing solutions [8] - The tool integrates various input modalities, including text, images, audio, and video, to generate videos [8] - This release is part of a broader trend of open-source video generation models competing with proprietary tools in the market [8] Group 6: Humanoid Robot Developments - Several humanoid robot manufacturers have updated their products, showcasing advancements in mobility and control [9] - The CL-3 humanoid robot by Zhijidongli features 31 degrees of freedom, enabling it to perform human-like movements [9] - The ongoing evolution of humanoid robots is highlighted by upcoming events such as the World Humanoid Robot Sports Competition [9]
腾讯混元发布并开源视频生成工具HunyuanCustom,支持主体一致性生成
news flash· 2025-05-09 04:22
5月9日,腾讯混元团队发布并开源全新的多模态定制化视频生成工具HunyuanCustom。该模型基于混元 视频生成大模型(HunyuanVideo)打造,在主体一致性效果超过现有的开源方案,并可媲美顶尖闭源模 型。HunyuanCustom融合了文本、图像、音频、视频等多模态输入生视频的能力,是一款具备高度控制 力和生成质量的智能视频创作工具。(36氪) ...
ICML 2025 | 视频生成模型无损加速两倍,秘诀竟然是「抓住attention的时空稀疏性」
机器之心· 2025-05-07 07:37
Core Viewpoint - The article discusses the rapid advancement of AI video generation technology, particularly focusing on the introduction of Sparse VideoGen, which significantly accelerates video generation without compromising quality [1][4][23]. Group 1: Performance Bottlenecks in Video Generation - Current state-of-the-art video generation models like Wan 2.1 and HunyuanVideo face significant performance bottlenecks, requiring over 30 minutes to generate a 5-second 720p video on a single H100 GPU, with the 3D Full Attention module consuming over 80% of the inference time [1][6][23]. - The computational complexity of attention mechanisms in Video Diffusion Transformers (DiTs) increases quadratically with resolution and frame count, limiting real-world deployment capabilities [6][23]. Group 2: Introduction of Sparse VideoGen - Sparse VideoGen is a novel acceleration method that does not require retraining existing models, leveraging spatial and temporal sparsity in attention mechanisms to halve inference time while maintaining high pixel fidelity (PSNR = 29) [4][23]. - The method has been integrated with various state-of-the-art open-source models and supports both text-to-video (T2V) and image-to-video (I2V) tasks [4][23]. Group 3: Key Design Features of Sparse VideoGen - Sparse VideoGen identifies two unique sparsity patterns in attention maps: spatial sparsity, focusing on tokens within the same and adjacent frames, and temporal sparsity, capturing relationships across different frames [10][11][12]. - The method employs a dynamic adaptive sparse strategy through online profiling, allowing for optimal combinations of spatial and temporal heads based on varying denoising steps and prompts [16][17]. Group 4: Operator-Level Optimization - Sparse VideoGen introduces a hardware-friendly layout transformation to optimize memory access patterns, enhancing the performance of temporal heads by ensuring tokens are stored contiguously in memory [20][21]. - Additional optimizations for Query-Key Normalization (QK-Norm) and Rotary Position Embedding (RoPE) have resulted in significant throughput improvements, with average acceleration ratios of 7.4x and 14.5x, respectively [21]. Group 5: Experimental Results - Sparse VideoGen has demonstrated impressive performance, reducing inference time for HunyuanVideo from approximately 30 minutes to under 15 minutes, and for Wan 2.1 from 30 minutes to 20 minutes, while maintaining a PSNR above 29dB [23]. - The research indicates that understanding the internal structure of video generation models may lead to more sustainable performance breakthroughs compared to merely increasing model size [24].
11B模型拿下开源视频生成新SOTA!仅用224张GPU训练,训练成本省10倍
量子位· 2025-03-13 03:28
Open-Sora 2.0正式发布。 11B参数规模,性能可直追HunyuanVideo和Step-Video(30B)。 要知道,市面上诸多效果相近的闭源视频生成模型,动辄花费数百万美元训练成本。 而Open-Sora 2.0,将这一数字压缩到了 20万美元 。 同时,此次发布 全面开源模型权重、推理代码及分布式训练全流程 ,开发者们可以看过来! GitHub开源仓库:https://github.com/hpcaitech/Open-Sora 小明 发自 凹非寺 量子位 | 公众号 QbitAI 224张GPU,训出开源视频生成新SOTA! 支持720P、24FPS高画质生成 来看Open-Sora 2.0 Demo。 在动作幅度上 ,可以根据需求设定,更好展现人物或场景的细腻动作。 生成的视频里,男人做俯卧撑动作流畅、幅度合理,和真实世界情况别无二致。 或者是让番茄冲浪这种虚拟场景,水花、叶子船、番茄之间的动作也没有违背物理规律。 画质和流畅度上, 提供 72 0P 高分辨率和 24FPS 流畅度,让最终视频拥有稳定帧率与细节表现。 同时 支持丰富场景切换 ,从乡村景色到自然风光,Open-Sora 2 ...