Wan2.1 - filings, earnings calls, financial reports, news

Wan2.1

Search documents

机器之心· 2025-12-06 04:08

Core Insights - The article discusses a groundbreaking research paper that introduces a method called GenMimic, enabling humanoid robots to perform actions generated from AI video models without prior examples [1][3][4]. Research Contributions - The research presents a universal framework for humanoid robots to execute actions generated by video models [4]. - GenMimic employs a new reinforcement learning strategy that utilizes symmetric regularization and selectively weighted 3D keypoint rewards for training, allowing generalization to noisy synthetic videos [4]. - The team created a synthetic human action dataset named GenMimicBench, which serves as a scalable benchmark for evaluating zero-shot generalization and policy robustness [4][8]. GenMimicBench Dataset - GenMimicBench consists of 428 generated videos created using advanced video generation models Wan2.1 and Cosmos-Predict2 [9][11]. - The dataset includes a wide range of subjects, environments, and action types, from simple gestures to complex interactions with objects [11][13]. - It is designed to stress-test the robustness of humanoid robot control strategies under varying visual and action distributions [13]. Methodology Overview - The proposed method involves a two-stage process for executing humanoid robot actions from generated videos [15][17]. - The first stage focuses on reconstructing the humanoid robot's 4D model from the input RGB video, while the second stage translates this model into executable actions [17][18]. - The strategy emphasizes robustness to variations and noise in the input data by using 3D keypoints instead of joint angles [19][20]. Experimental Results - The team conducted extensive experiments on both the GenMimicBench dataset and a real-world 23-DoF humanoid robot, demonstrating significant improvements over strong baseline models [29][30]. - In simulations, GenMimic achieved a success rate (SR) of 29.78% and outperformed existing models in various metrics [31]. - Real-world experiments showed that the strategy successfully replicated a wide range of upper-body actions, although challenges remained with lower-body movements [34][35].

一张照片，一个3D「你」：计算所等提出HumanLift，实现高保真数字人重建

机器之心· 2025-10-21 23:20

Core Insights - The article discusses the development of a new technology called HumanLift, which enables the reconstruction of high-quality, realistic 3D digital humans from a single reference image, addressing challenges in 3D consistency and detail accuracy [2][4][25]. Part 1: Background - Traditional methods for single-image digital human reconstruction are categorized into explicit and implicit approaches, each with limitations in handling complex clothing and achieving realistic textures [8]. - Recent advancements in generative models and neural implicit rendering have improved the connection between 2D images and 3D space, yet challenges remain in high-fidelity 3D human modeling due to data scarcity and complexity in human poses and clothing [8][9]. Part 2: Algorithm Principles - HumanLift aims to create a 3D digital representation that captures realistic appearance and fine details from a single image, utilizing a two-stage process [11]. - The first stage generates realistic multi-view images from a single photo using a 3D-aware multi-view human generation method, incorporating a backbone network based on a video generation model [13][14]. - The second stage reconstructs the 3D representation using the generated multi-view images, optimizing parameters based on Gaussian mesh representation [15][17]. Part 3: Effectiveness Demonstration - HumanLift demonstrates its capability by generating multi-view RGB and normal images from real-world photographs, achieving photo-realistic results and maintaining spatial consistency [20]. - Ablation studies confirm the importance of facial enhancement and SMPL-X pose optimization in improving detail quality and rendering accuracy [21][22][23]. Part 4: Conclusion - The development of HumanLift represents a significant advancement in single-image full-body digital human reconstruction, overcoming traditional limitations and providing a user-friendly solution for high-quality 3D modeling [25].

EasyCache：无需训练的视频扩散模型推理加速——极简高效的视频生成提速方案

机器之心· 2025-07-12 04:50

Core Viewpoint - The article discusses the development of EasyCache, a new framework for accelerating video diffusion models without requiring training or structural changes to the model, significantly improving inference efficiency while maintaining video quality [7][27]. Group 1: Research Background and Motivation - The application of diffusion models and diffusion Transformers in video generation has led to significant improvements in the quality and coherence of AI-generated videos, transforming digital content creation and multimedia entertainment [3]. - However, issues such as slow inference and high computational costs have emerged, with examples like HunyuanVideo taking 2 hours to generate a 5-second video at 720P resolution, limiting the technology's application in real-time and large-scale scenarios [4][5]. Group 2: Methodology and Innovations - EasyCache operates by dynamically detecting the "stable period" of model outputs during inference, allowing for the reuse of historical computation results to reduce redundant inference steps [7][16]. - The framework measures the "transformation rate" during the diffusion process, which indicates the sensitivity of current outputs to inputs, revealing that outputs can be approximated using previous results in later stages of the process [8][12][15]. - EasyCache is designed to be plug-and-play, functioning entirely during the inference phase without the need for model retraining or structural modifications [16]. Group 3: Experimental Results and Visual Analysis - Systematic experiments on mainstream video generation models like OpenSora, Wan2.1, and HunyuanVideo demonstrated that EasyCache achieves a speedup of 2.2 times on HunyuanVideo, with a 36% increase in PSNR and a 14% increase in SSIM, while maintaining video quality [20][26]. - In image generation tasks, EasyCache also provided a 4.6 times speedup, improving FID scores, indicating its effectiveness across different applications [21][22]. - Visual comparisons showed that EasyCache retains high visual fidelity, with generated videos closely matching the original model outputs, unlike other methods that exhibited varying degrees of quality loss [24][25]. Group 4: Conclusion and Future Outlook - EasyCache presents a minimalistic and efficient paradigm for accelerating inference in video diffusion models, laying a solid foundation for practical applications of diffusion models [27]. - The expectation is to further approach the goal of "real-time video generation" as models and acceleration technologies continue to evolve [27].

Diffusion Models

Diffusion Transformer

Artificial Intelligence

Diffusion Transformer

Artificial Intelligence

EasyCache

HunyuanVideo

Wan2.1

无需训练，即插即用，2倍GPU端到端推理加速——视频扩散模型加速方法DraftAttention

机器之心· 2025-06-28 04:35

Core Insights - The article discusses the challenges and advancements in video generation using diffusion models, particularly focusing on the computational bottlenecks associated with attention mechanisms in the Diffusion Transformer (DiT) model [1][6][14] - A new method called DraftAttention is introduced, which significantly reduces the computational overhead of attention mechanisms while maintaining high generation quality, achieving up to 2x end-to-end inference acceleration on GPUs [3][22][46] Group 1: Background and Challenges - Diffusion models have become mainstream for high-quality video generation, but the computational load of attention mechanisms increases dramatically with video length and resolution, leading to inefficiencies [1][6] - In models like HunyuanVideo, attention computation can account for over 80% of the total processing time, with generating an 8-second 720p video taking nearly an hour [1][5] - The complexity of attention mechanisms grows quadratically with the number of tokens, which is directly proportional to video frame count and resolution, causing significant slowdowns in inference speed [6][7] Group 2: Existing Solutions and Limitations - Current acceleration methods, such as Sparse VideoGen and AdaSpa, utilize sparse attention mechanisms for some level of end-to-end acceleration on GPUs, but their effectiveness is limited due to insufficient sparsity and rigid design [2][3] - These methods often rely on fixed sparse operators and lack dynamic adaptability to input content, making it difficult to achieve fine-grained, content-aware sparse pattern control [2][7] Group 3: DraftAttention Methodology - DraftAttention is a plug-and-play, dynamic sparse attention mechanism that does not require training, designed to reduce the computational burden of attention mechanisms while preserving generation quality [3][11][46] - The method involves creating a low-resolution attention map to estimate token importance, guiding the selection of sparse patterns for high-resolution attention calculations [11][12] - A token rearrangement strategy is introduced to enhance the execution efficiency of sparse computations on GPUs, making the approach hardware-friendly [13][22] Group 4: Theoretical Foundations and Experimental Results - The effectiveness of DraftAttention is supported by theoretical analyses demonstrating that the approximation error between the low-resolution and high-resolution attention maps is bounded [15][17] - Experimental evaluations show that DraftAttention outperforms existing sparse attention methods like Sparse VideoGen across multiple metrics, including PSNR and SSIM, particularly at high sparsity rates [20][21] - On NVIDIA H100 and A100 GPUs, DraftAttention achieves up to 1.75x end-to-end inference acceleration, with performance improvements scaling with video length, resolution, and sparsity [22][46] Group 5: Future Directions - The authors plan to further optimize efficiency bottlenecks in long video generation by integrating techniques such as quantization and distillation, aiming to extend high-quality video generation capabilities to resource-constrained environments like mobile and edge devices [46]