Workflow
HunyuanVideo
icon
Search documents
腾讯混元开源游戏AI生成新工具!RTX 4090就能制作3A级动态内容
量子位· 2025-08-14 07:34
Core Viewpoint - Tencent has launched a new open-source game video generation framework, Hunyuan-GameCraft, designed for game environment creation, enabling anyone to easily produce high-quality game content from a single image and text description [1]. Group 1: Features and Capabilities - Hunyuan-GameCraft allows for the generation of dynamic game videos using a single scene image, text description, and action commands, resulting in high-definition outputs [8]. - The framework supports various artistic styles, including traditional ink painting and ancient Greek themes, showcasing its versatility [2][4][6]. - It can generate complex scenes with dynamic weather effects and NPC interactions, enhancing the realism of the generated content [18]. Group 2: Technical Innovations - Traditional game video generation tools face three main challenges: stiff movements, static scenes, and high production costs [19][20][22]. - Hunyuan-GameCraft addresses these issues through three core advantages: 1. Free-flowing and smooth animations with high precision control over movements [26]. 2. Enhanced memory capabilities to maintain consistency across long video sequences [26]. 3. Significant cost reduction, allowing operation on consumer-grade graphics cards like the RTX 4090 [26]. Group 3: Model Architecture and Performance - The model is built on HunyuanVideo and incorporates four key technical modules to ensure precise user interaction and long-sequence video generation [30]. - Performance comparisons show that Hunyuan-GameCraft outperforms other models in terms of flow consistency by 18.3%, with a low action response delay of 87ms [35]. - In fine-grained control tasks, it accurately responds to 92% of discrete action inputs, significantly higher than the baseline model's 65% accuracy [37]. Group 4: User Engagement and Feedback - The subjective evaluation of Hunyuan-GameCraft indicates a realism score of 4.2/5 and controllability score of 4.1/5, surpassing other models [35]. - A high willingness to continue interaction was noted, with 78% of users expressing interest, which is 1.5 to 2 times higher than competing models [35].
EasyCache:无需训练的视频扩散模型推理加速——极简高效的视频生成提速方案
机器之心· 2025-07-12 04:50
Core Viewpoint - The article discusses the development of EasyCache, a new framework for accelerating video diffusion models without requiring training or structural changes to the model, significantly improving inference efficiency while maintaining video quality [7][27]. Group 1: Research Background and Motivation - The application of diffusion models and diffusion Transformers in video generation has led to significant improvements in the quality and coherence of AI-generated videos, transforming digital content creation and multimedia entertainment [3]. - However, issues such as slow inference and high computational costs have emerged, with examples like HunyuanVideo taking 2 hours to generate a 5-second video at 720P resolution, limiting the technology's application in real-time and large-scale scenarios [4][5]. Group 2: Methodology and Innovations - EasyCache operates by dynamically detecting the "stable period" of model outputs during inference, allowing for the reuse of historical computation results to reduce redundant inference steps [7][16]. - The framework measures the "transformation rate" during the diffusion process, which indicates the sensitivity of current outputs to inputs, revealing that outputs can be approximated using previous results in later stages of the process [8][12][15]. - EasyCache is designed to be plug-and-play, functioning entirely during the inference phase without the need for model retraining or structural modifications [16]. Group 3: Experimental Results and Visual Analysis - Systematic experiments on mainstream video generation models like OpenSora, Wan2.1, and HunyuanVideo demonstrated that EasyCache achieves a speedup of 2.2 times on HunyuanVideo, with a 36% increase in PSNR and a 14% increase in SSIM, while maintaining video quality [20][26]. - In image generation tasks, EasyCache also provided a 4.6 times speedup, improving FID scores, indicating its effectiveness across different applications [21][22]. - Visual comparisons showed that EasyCache retains high visual fidelity, with generated videos closely matching the original model outputs, unlike other methods that exhibited varying degrees of quality loss [24][25]. Group 4: Conclusion and Future Outlook - EasyCache presents a minimalistic and efficient paradigm for accelerating inference in video diffusion models, laying a solid foundation for practical applications of diffusion models [27]. - The expectation is to further approach the goal of "real-time video generation" as models and acceleration technologies continue to evolve [27].
无需训练,即插即用,2倍GPU端到端推理加速——视频扩散模型加速方法DraftAttention
机器之心· 2025-06-28 04:35
Core Insights - The article discusses the challenges and advancements in video generation using diffusion models, particularly focusing on the computational bottlenecks associated with attention mechanisms in the Diffusion Transformer (DiT) model [1][6][14] - A new method called DraftAttention is introduced, which significantly reduces the computational overhead of attention mechanisms while maintaining high generation quality, achieving up to 2x end-to-end inference acceleration on GPUs [3][22][46] Group 1: Background and Challenges - Diffusion models have become mainstream for high-quality video generation, but the computational load of attention mechanisms increases dramatically with video length and resolution, leading to inefficiencies [1][6] - In models like HunyuanVideo, attention computation can account for over 80% of the total processing time, with generating an 8-second 720p video taking nearly an hour [1][5] - The complexity of attention mechanisms grows quadratically with the number of tokens, which is directly proportional to video frame count and resolution, causing significant slowdowns in inference speed [6][7] Group 2: Existing Solutions and Limitations - Current acceleration methods, such as Sparse VideoGen and AdaSpa, utilize sparse attention mechanisms for some level of end-to-end acceleration on GPUs, but their effectiveness is limited due to insufficient sparsity and rigid design [2][3] - These methods often rely on fixed sparse operators and lack dynamic adaptability to input content, making it difficult to achieve fine-grained, content-aware sparse pattern control [2][7] Group 3: DraftAttention Methodology - DraftAttention is a plug-and-play, dynamic sparse attention mechanism that does not require training, designed to reduce the computational burden of attention mechanisms while preserving generation quality [3][11][46] - The method involves creating a low-resolution attention map to estimate token importance, guiding the selection of sparse patterns for high-resolution attention calculations [11][12] - A token rearrangement strategy is introduced to enhance the execution efficiency of sparse computations on GPUs, making the approach hardware-friendly [13][22] Group 4: Theoretical Foundations and Experimental Results - The effectiveness of DraftAttention is supported by theoretical analyses demonstrating that the approximation error between the low-resolution and high-resolution attention maps is bounded [15][17] - Experimental evaluations show that DraftAttention outperforms existing sparse attention methods like Sparse VideoGen across multiple metrics, including PSNR and SSIM, particularly at high sparsity rates [20][21] - On NVIDIA H100 and A100 GPUs, DraftAttention achieves up to 1.75x end-to-end inference acceleration, with performance improvements scaling with video length, resolution, and sparsity [22][46] Group 5: Future Directions - The authors plan to further optimize efficiency bottlenecks in long video generation by integrating techniques such as quantization and distillation, aiming to extend high-quality video generation capabilities to resource-constrained environments like mobile and edge devices [46]
清华SageAttention3,FP4量化5倍加速!且首次支持8比特训练
机器之心· 2025-06-18 09:34
Core Insights - The article discusses the advancements in attention mechanisms for large models, particularly focusing on the introduction of SageAttention3, which offers significant performance improvements over previous versions and competitors [1][2]. Group 1: Introduction and Background - The need for optimizing attention speed has become crucial as the sequence length in large models increases [7]. - Previous versions of SageAttention (V1, V2, V2++) achieved acceleration factors of 2.1, 3, and 3.9 times respectively compared to FlashAttention [2][5]. Group 2: Technical Innovations - SageAttention3 provides a 5x inference acceleration compared to FlashAttention, achieving 1040 TOPS on RTX 5090, outperforming even the more expensive H100 with FlashAttention3 by 1.65 times [2][5]. - The introduction of trainable 8-bit attention (SageBwd) allows for training acceleration while maintaining the same results as full precision attention in various fine-tuning tasks [2][5]. Group 3: Methodology - The research team employed Microscaling FP4 quantization to enhance the precision of FP4 quantization, utilizing NVFP4 format for better accuracy [15][16]. - A two-level quantization approach was proposed to address the narrow range of scaling factors for the P matrix, improving overall precision [15][16]. Group 4: Experimental Results - SageAttention3 demonstrated impressive performance in various models, maintaining end-to-end accuracy in video and image generation tasks [21][22]. - In specific tests, SageAttention3 achieved a 3x acceleration in HunyuanVideo, with significant reductions in processing time across multiple models [33][34].
AI周报 | xAI新一轮融资后估值有望超1200亿美元;OpenAI重组计划生变
Di Yi Cai Jing Zi Xun· 2025-05-11 01:39
Group 1: xAI Financing - xAI, an AI startup founded by Elon Musk, is negotiating a new round of financing with a potential valuation exceeding $120 billion (approximately 86.88 billion RMB) [1] - Investors are considering injecting $20 billion into xAI, although the specific amount may fluctuate as negotiations progress [1] - If successful, this financing would become the second-largest startup funding round in history, following OpenAI's $40 billion funding earlier this year, which valued OpenAI at $300 billion (approximately 217,000 million RMB) [1] Group 2: OpenAI Restructuring - OpenAI announced it will remain under the control of a non-profit organization, retracting a previous restructuring plan that aimed to shift control to a for-profit entity [2] - The for-profit LLC will transition to a Public Benefit Corporation (PBC), allowing it to pursue profit while also focusing on social missions [2] - The new structure will enable investors and employees to hold common stock without limits on appreciation, facilitating future fundraising efforts [2] Group 3: AI Programming Unicorn - Anysphere, the developer of the AI programming tool Cursor, completed a $900 million funding round, bringing its valuation to approximately $9 billion [5][6] - The funding round was led by Thrive Capital, with participation from notable investors such as a16z and Accel [5] - Cursor is recognized as one of the most popular AI tools in the programming sector, reflecting the growing interest in AI programming applications [6] Group 4: Google Market Value Drop - Google's parent company Alphabet experienced a market value loss of nearly $150 billion after Apple announced plans to introduce AI features in its Safari browser [4] - The stock price of Alphabet fell over 7% following the announcement, highlighting the competitive threat posed by AI technologies to traditional search engines [4] - The integration of AI into search functionalities is becoming a significant trend, with major players like Apple and OpenAI actively pursuing this direction [4] Group 5: Tencent's Video Generation Tool - Tencent's Hunyuan team released and open-sourced a new multimodal video generation tool called HunyuanCustom, which significantly improves performance over existing solutions [8] - The tool integrates various input modalities, including text, images, audio, and video, to generate videos [8] - This release is part of a broader trend of open-source video generation models competing with proprietary tools in the market [8] Group 6: Humanoid Robot Developments - Several humanoid robot manufacturers have updated their products, showcasing advancements in mobility and control [9] - The CL-3 humanoid robot by Zhijidongli features 31 degrees of freedom, enabling it to perform human-like movements [9] - The ongoing evolution of humanoid robots is highlighted by upcoming events such as the World Humanoid Robot Sports Competition [9]
腾讯混元发布并开源视频生成工具HunyuanCustom,支持主体一致性生成
news flash· 2025-05-09 04:22
Core Insights - Tencent's Hunyuan team has launched and open-sourced a new multimodal customized video generation tool called HunyuanCustom, which is based on the HunyuanVideo model [1] - HunyuanCustom surpasses existing open-source solutions in subject consistency and is comparable to top proprietary models [1] - The tool integrates capabilities for generating videos from various multimodal inputs, including text, images, audio, and video, offering high control and quality in intelligent video creation [1]
ICML 2025 | 视频生成模型无损加速两倍,秘诀竟然是「抓住attention的时空稀疏性」
机器之心· 2025-05-07 07:37
Core Viewpoint - The article discusses the rapid advancement of AI video generation technology, particularly focusing on the introduction of Sparse VideoGen, which significantly accelerates video generation without compromising quality [1][4][23]. Group 1: Performance Bottlenecks in Video Generation - Current state-of-the-art video generation models like Wan 2.1 and HunyuanVideo face significant performance bottlenecks, requiring over 30 minutes to generate a 5-second 720p video on a single H100 GPU, with the 3D Full Attention module consuming over 80% of the inference time [1][6][23]. - The computational complexity of attention mechanisms in Video Diffusion Transformers (DiTs) increases quadratically with resolution and frame count, limiting real-world deployment capabilities [6][23]. Group 2: Introduction of Sparse VideoGen - Sparse VideoGen is a novel acceleration method that does not require retraining existing models, leveraging spatial and temporal sparsity in attention mechanisms to halve inference time while maintaining high pixel fidelity (PSNR = 29) [4][23]. - The method has been integrated with various state-of-the-art open-source models and supports both text-to-video (T2V) and image-to-video (I2V) tasks [4][23]. Group 3: Key Design Features of Sparse VideoGen - Sparse VideoGen identifies two unique sparsity patterns in attention maps: spatial sparsity, focusing on tokens within the same and adjacent frames, and temporal sparsity, capturing relationships across different frames [10][11][12]. - The method employs a dynamic adaptive sparse strategy through online profiling, allowing for optimal combinations of spatial and temporal heads based on varying denoising steps and prompts [16][17]. Group 4: Operator-Level Optimization - Sparse VideoGen introduces a hardware-friendly layout transformation to optimize memory access patterns, enhancing the performance of temporal heads by ensuring tokens are stored contiguously in memory [20][21]. - Additional optimizations for Query-Key Normalization (QK-Norm) and Rotary Position Embedding (RoPE) have resulted in significant throughput improvements, with average acceleration ratios of 7.4x and 14.5x, respectively [21]. Group 5: Experimental Results - Sparse VideoGen has demonstrated impressive performance, reducing inference time for HunyuanVideo from approximately 30 minutes to under 15 minutes, and for Wan 2.1 from 30 minutes to 20 minutes, while maintaining a PSNR above 29dB [23]. - The research indicates that understanding the internal structure of video generation models may lead to more sustainable performance breakthroughs compared to merely increasing model size [24].
11B模型拿下开源视频生成新SOTA!仅用224张GPU训练,训练成本省10倍
量子位· 2025-03-13 03:28
Core Viewpoint - Open-Sora 2.0 has been officially released, showcasing significant advancements in video generation technology with a focus on cost efficiency and high performance, rivaling leading closed-source models [1][10][12]. Cost Efficiency - The training cost for Open-Sora 2.0 is reduced to $200,000, significantly lower than the millions typically required for similar closed-source models [2][3]. - Open-Sora 2.0 achieves a cost reduction of 5-10 times compared to other open-source video models with over 10 billion parameters [13]. Performance Metrics - Open-Sora 2.0 features an 11 billion parameter scale, achieving performance levels comparable to high-cost models like HunyuanVideo and Step-Video [10]. - The performance gap between Open-Sora 2.0 and the leading closed-source model from OpenAI has narrowed from 4.52% to just 0.69% [12]. - In VBench evaluations, Open-Sora 2.0 surpassed Tencent's HunyuanVideo, establishing a new benchmark for open-source video generation technology [12]. Technical Innovations - The model architecture includes a 3D autoencoder and Flow Matching training framework, enhancing video generation quality [15]. - Open-Sora 2.0 employs a high-compression video autoencoder, reducing inference time significantly from nearly 30 minutes to under 3 minutes for generating 768px, 5-second videos [21]. - The training process incorporates advanced techniques such as strict data filtering, multi-stage screening, and efficient parallel training to optimize resource utilization [16][19]. Community Engagement - Open-Sora 2.0 is fully open-sourced, including model weights, inference code, and the entire distributed training process, inviting developers to participate [4][14]. - The project has gained substantial academic recognition, with nearly 100 citations in six months, solidifying its position as a leader in the open-source video generation space [14]. Future Directions - The focus on high-compression video autoencoders is seen as a key direction for reducing video generation costs in the future, with initial experiments showing promising results [25].