Workflow
Wan 2.1
icon
Search documents
妙笔生维:线稿驱动的三维场景视频自由编辑
机器之心· 2025-08-19 02:43
Core Viewpoint - The article discusses the development of Sketch3DVE, a novel method for 3D scene video editing that allows users to manipulate videos using simple sketches, enhancing creativity and personalization in video content creation [3][22]. Part 1: Background - Recent advancements in video generation models have significantly improved text-to-video and image-to-video generation, with a focus on precise control over camera trajectories due to its important application prospects [6]. - Existing methods for video editing are categorized into two types: one directly uses camera parameters as model inputs, while the other constructs explicit 3D representations from single images to render new perspective images [8][9]. - Despite these advancements, editing real videos with significant camera motion remains a challenge, as video editing requires maintaining original motion patterns and local features while synthesizing new content [8][9]. Part 2: Algorithm Principles - Users begin by selecting the first frame of a 3D scene video, marking the editing area with a mask and drawing a sketch to specify the geometry of new objects [12]. - The system employs the MagicQuill image editing algorithm to process the first frame, generating the edited result, and utilizes the DUSt3R algorithm for 3D reconstruction to analyze the entire input video [13]. - A 3D mask propagation algorithm is designed to accurately transfer the mask from the first frame to subsequent frames, ensuring consistency across different perspectives [14]. - The final video generation model integrates edited images, multi-view videos, and original input videos to produce a scene-edited video with precise 3D consistency [14]. Part 3: Effect Demonstration - The method allows users to create high-quality 3D scene video edits, enabling operations such as adding, removing, and replacing objects while maintaining good 3D consistency [16]. - The approach can handle complex scenarios involving shadows and reflections, producing reasonable editing results due to training on real video datasets [17]. - Users can also edit the first frame using image completion methods, demonstrating the versatility of the system in generating realistic 3D scene video edits [19]. - Sketch3DVE offers an effective solution to traditional model insertion challenges, allowing for personalized 3D object generation and high-fidelity scene video editing without requiring extensive expertise [22].
三年跃迁中国AI凭什么逆袭美国?
3 6 Ke· 2025-06-26 02:29
Core Insights - The article discusses the rapid advancements in China's AI capabilities, particularly in comparison to the United States, highlighting the narrowing gap in language models and the strategic importance of open weight policies in fostering innovation and collaboration [1][2][3]. Group 1: AI Advancements and Comparisons - Since the release of ChatGPT in 2022, the gap between Chinese and American AI has significantly narrowed, with the difference in performance metrics reducing to less than three months by May 2025 [2]. - DeepSeek R1 and OpenAI's o3 both scored 68 points in the Artificial Analysis Intelligence Index, indicating that China has made substantial progress in AI model performance [2]. - China's advancements are attributed to both technical performance improvements and strategic breakthroughs, such as the adoption of reinforcement learning to enhance model capabilities [2][4]. Group 2: Open Weight Strategy - Chinese AI labs have widely adopted an open weight strategy, contrasting with the closed-source approach of leading American companies, which has accelerated technology sharing and innovation [4][10]. - The open weight approach lowers technical barriers, allowing developers to build upon existing models easily, thus fostering a collaborative ecosystem [7][8]. - Companies like ByteDance and Tencent have successfully launched open-source models that have gained traction both domestically and internationally, demonstrating the effectiveness of this strategy [9][10]. Group 3: Ecosystem and Collaboration - The Chinese AI ecosystem consists of large tech companies, startups, and cross-industry players, each playing distinct roles in advancing AI technology [15][21]. - Major tech firms like Alibaba, Tencent, and Huawei provide foundational models and platforms, while startups focus on niche innovations, enhancing the overall diversity and competitiveness of the ecosystem [16][18]. - Cross-industry players integrate AI into existing products, leveraging their user bases and application scenarios to drive practical value [19][20]. Group 4: Future Directions and Challenges - The competition between China and the U.S. in AI is evolving, with potential for both collaboration and conflict, particularly in areas like foundational research and industry standards [32][36]. - The article suggests that the future of AI will depend on finding a balance between cooperation and competition, with both countries needing to navigate their differing governance philosophies and market dynamics [38][39].
ICML 2025 | 视频生成模型无损加速两倍,秘诀竟然是「抓住attention的时空稀疏性」
机器之心· 2025-05-07 07:37
Core Viewpoint - The article discusses the rapid advancement of AI video generation technology, particularly focusing on the introduction of Sparse VideoGen, which significantly accelerates video generation without compromising quality [1][4][23]. Group 1: Performance Bottlenecks in Video Generation - Current state-of-the-art video generation models like Wan 2.1 and HunyuanVideo face significant performance bottlenecks, requiring over 30 minutes to generate a 5-second 720p video on a single H100 GPU, with the 3D Full Attention module consuming over 80% of the inference time [1][6][23]. - The computational complexity of attention mechanisms in Video Diffusion Transformers (DiTs) increases quadratically with resolution and frame count, limiting real-world deployment capabilities [6][23]. Group 2: Introduction of Sparse VideoGen - Sparse VideoGen is a novel acceleration method that does not require retraining existing models, leveraging spatial and temporal sparsity in attention mechanisms to halve inference time while maintaining high pixel fidelity (PSNR = 29) [4][23]. - The method has been integrated with various state-of-the-art open-source models and supports both text-to-video (T2V) and image-to-video (I2V) tasks [4][23]. Group 3: Key Design Features of Sparse VideoGen - Sparse VideoGen identifies two unique sparsity patterns in attention maps: spatial sparsity, focusing on tokens within the same and adjacent frames, and temporal sparsity, capturing relationships across different frames [10][11][12]. - The method employs a dynamic adaptive sparse strategy through online profiling, allowing for optimal combinations of spatial and temporal heads based on varying denoising steps and prompts [16][17]. Group 4: Operator-Level Optimization - Sparse VideoGen introduces a hardware-friendly layout transformation to optimize memory access patterns, enhancing the performance of temporal heads by ensuring tokens are stored contiguously in memory [20][21]. - Additional optimizations for Query-Key Normalization (QK-Norm) and Rotary Position Embedding (RoPE) have resulted in significant throughput improvements, with average acceleration ratios of 7.4x and 14.5x, respectively [21]. Group 5: Experimental Results - Sparse VideoGen has demonstrated impressive performance, reducing inference time for HunyuanVideo from approximately 30 minutes to under 15 minutes, and for Wan 2.1 from 30 minutes to 20 minutes, while maintaining a PSNR above 29dB [23]. - The research indicates that understanding the internal structure of video generation models may lead to more sustainable performance breakthroughs compared to merely increasing model size [24].
传媒行业周报:GPT-4.5发布,DeepSeek“开源周”收官
GOLDEN SUN SECURITIES· 2025-03-02 02:55
Investment Rating - The report maintains an "Increase" rating for the media sector [6]. Core Viewpoints - The media sector experienced a decline of 8.06% during the week of February 24-28, 2025, influenced by market conditions. The outlook for 2025 is optimistic, focusing on AI applications and mergers and acquisitions, particularly in state-owned enterprises [1][10]. - The release of "Nezha 2" has further boosted the popularity of domestic IPs, highlighting significant opportunities in the IP monetization value chain, including trendy toys and film content [1]. - The publishing and gaming sectors are expected to benefit from tax relief policies, with the publishing industry projected to see high growth in 2025 [1]. Summary by Sections Market Overview - The media sector's performance was notably poor, ranking among the bottom three sectors, with a decline of 8.06% [10]. - The top-performing sectors included steel, building materials, and real estate, while the computer and communication sectors also faced significant declines [10]. Subsector Insights - Key focus areas include: 1. Resource integration expectations: Companies like China Vision Media, Guoxin Culture, and others are highlighted [2]. 2. AI applications: Companies such as Aofei Entertainment and Tom Cat are noted for their potential [2]. 3. Gaming: Strong recommendations for companies like Shenzhou Taiyue and Kaixin Network [2]. 4. State-owned enterprises: Companies like Ciweng Media and Anhui New Media are emphasized [2]. 5. Education: Companies like Xueda Education and Action Education are mentioned [2]. 6. Hong Kong stocks: Notable mentions include Tencent Holdings and Pop Mart [2]. Key Events Review - The release of GPT-4.5 by OpenAI, which boasts over ten times the computational efficiency of GPT-4, is a significant development in AI technology [21]. - DeepSeek's open-source initiatives, including the release of various codebases, are aimed at enhancing data access and model training efficiency [21]. - Alibaba's launch of the video generation model Wan 2.1 showcases advancements in video technology, particularly in generating synchronized movements and text within videos [21]. Subsector Data Tracking - The gaming sector is seeing a variety of new game releases, with popular titles currently available for pre-order [23]. - The domestic film market's total box office for the week was approximately 431 million yuan, with "Nezha: The Devil's Child" leading the box office [24][26]. - The top-rated series and variety shows reflect strong viewer engagement, with "Difficult to Please" and "Mars Intelligence Agency Season 7" leading in viewership [27][28].
阿里开源版Sora上线即屠榜,4070就能跑,免费商用
量子位· 2025-02-26 03:51
Core Viewpoint - The article discusses the release of Alibaba's video generation model Wan 2.1, which outperforms competitors in the VBench ranking and introduces significant advancements in video generation technology [2][8]. Group 1: Model Performance - Wan 2.1 features 14 billion parameters and excels in generating complex motion details, such as synchronizing five individuals dancing hip-hop [2][3]. - The model has successfully addressed the challenge of generating text in static images, a previously difficult task [4]. - The model is available in two versions: a 14B version supporting 720P resolution and a smaller 1.3B version supporting 480P resolution, with the latter being more accessible for personal use [5][20]. Group 2: Computational Efficiency - The computational efficiency of Wan 2.1 is highlighted, with detailed performance metrics provided for various GPU configurations [7]. - The 1.3B version requires over 8GB of VRAM on a 4090 GPU, while the 14B version has higher memory demands [5][20]. - The model employs innovative techniques such as a 3D variational autoencoder and a diffusion transformer architecture to enhance performance and reduce memory usage [21][24]. Group 3: Technical Innovations - Wan 2.1 utilizes a T5 encoder for multi-language text encoding and incorporates cross-attention mechanisms within its transformer blocks [22]. - The model's design includes a feature caching mechanism in convolution modules to improve spatiotemporal compression [24]. - The implementation of distributed strategies for model training and inference aims to enhance efficiency and reduce latency during video generation [29][30]. Group 4: User Accessibility - Wan 2.1 is open-source under the Apache 2.0 license, allowing for free commercial use [8]. - Users can access the model through Alibaba's platform, with options for both rapid and professional versions, although high demand may lead to longer wait times [10]. - The model's capabilities have inspired users to create diverse content, showcasing its versatility [11][19].