CogVideoX

Search documents
妙笔生维:线稿驱动的三维场景视频自由编辑
机器之心· 2025-08-19 02:43
Core Viewpoint - The article discusses the development of Sketch3DVE, a novel method for 3D scene video editing that allows users to manipulate videos using simple sketches, enhancing creativity and personalization in video content creation [3][22]. Part 1: Background - Recent advancements in video generation models have significantly improved text-to-video and image-to-video generation, with a focus on precise control over camera trajectories due to its important application prospects [6]. - Existing methods for video editing are categorized into two types: one directly uses camera parameters as model inputs, while the other constructs explicit 3D representations from single images to render new perspective images [8][9]. - Despite these advancements, editing real videos with significant camera motion remains a challenge, as video editing requires maintaining original motion patterns and local features while synthesizing new content [8][9]. Part 2: Algorithm Principles - Users begin by selecting the first frame of a 3D scene video, marking the editing area with a mask and drawing a sketch to specify the geometry of new objects [12]. - The system employs the MagicQuill image editing algorithm to process the first frame, generating the edited result, and utilizes the DUSt3R algorithm for 3D reconstruction to analyze the entire input video [13]. - A 3D mask propagation algorithm is designed to accurately transfer the mask from the first frame to subsequent frames, ensuring consistency across different perspectives [14]. - The final video generation model integrates edited images, multi-view videos, and original input videos to produce a scene-edited video with precise 3D consistency [14]. Part 3: Effect Demonstration - The method allows users to create high-quality 3D scene video edits, enabling operations such as adding, removing, and replacing objects while maintaining good 3D consistency [16]. - The approach can handle complex scenarios involving shadows and reflections, producing reasonable editing results due to training on real video datasets [17]. - Users can also edit the first frame using image completion methods, demonstrating the versatility of the system in generating realistic 3D scene video edits [19]. - Sketch3DVE offers an effective solution to traditional model insertion challenges, allowing for personalized 3D object generation and high-fidelity scene video editing without requiring extensive expertise [22].
AI生成视频总不符合物理规律?匹兹堡大学团队新作PhyT2V:不重训练模型也能让物理真实度狂飙2.3倍!
机器之心· 2025-05-19 04:03
Core Viewpoint - The article discusses the advancement of Text-to-Video (T2V) generation technology, emphasizing the transition from focusing on visual quality to ensuring physical consistency and realism through the introduction of the PhyT2V framework, which enhances existing T2V models without requiring retraining or extensive external data [2][3][26]. Summary by Sections Introduction to PhyT2V - PhyT2V is a framework developed by a research team at the University of Pittsburgh, aimed at improving the physical consistency of T2V generation by integrating large language models (LLMs) for iterative self-refinement [2][3][8]. Current State of T2V Technology - Recent T2V models, such as Sora, Pika, and CogVideoX, have shown significant progress in generating complex and realistic scenes, but they struggle with adhering to real-world physical rules and common sense [5][7]. Limitations of Existing Methods - Current methods for enhancing T2V models often rely on data-driven approaches or fixed physical categories, which limits their generalizability, especially in out-of-distribution scenarios [10][12][18]. PhyT2V Methodology - PhyT2V employs a three-step iterative process involving: 1. Identifying physical rules and main objects from user prompts [12]. 2. Detecting semantic mismatches between generated videos and prompts using video captioning models [13]. 3. Generating corrected prompts based on identified physical rules and mismatches [14] [18]. Advantages of PhyT2V - PhyT2V offers several advantages over existing methods: - It does not require any model structure modifications or additional training data, making it easy to implement [18]. - It provides a feedback loop for prompt correction based on real generated results, enhancing the optimization process [18]. - It demonstrates strong cross-domain applicability, particularly in various physical scenarios [18]. Experimental Results - The framework has been tested on multiple T2V models, showing significant improvements in physical consistency (PC) and semantic adherence (SA) scores, with the CogVideoX-5B model achieving up to 2.2 times improvement in PC and 2.3 times in SA [23][26]. Conclusion - PhyT2V represents a novel, data-independent approach to T2V generation, ensuring that generated videos comply with real-world physical principles without the need for additional model retraining, marking a significant step towards creating more realistic T2V models [26].
ICML 2025 | 视频生成模型无损加速两倍,秘诀竟然是「抓住attention的时空稀疏性」
机器之心· 2025-05-07 07:37
Core Viewpoint - The article discusses the rapid advancement of AI video generation technology, particularly focusing on the introduction of Sparse VideoGen, which significantly accelerates video generation without compromising quality [1][4][23]. Group 1: Performance Bottlenecks in Video Generation - Current state-of-the-art video generation models like Wan 2.1 and HunyuanVideo face significant performance bottlenecks, requiring over 30 minutes to generate a 5-second 720p video on a single H100 GPU, with the 3D Full Attention module consuming over 80% of the inference time [1][6][23]. - The computational complexity of attention mechanisms in Video Diffusion Transformers (DiTs) increases quadratically with resolution and frame count, limiting real-world deployment capabilities [6][23]. Group 2: Introduction of Sparse VideoGen - Sparse VideoGen is a novel acceleration method that does not require retraining existing models, leveraging spatial and temporal sparsity in attention mechanisms to halve inference time while maintaining high pixel fidelity (PSNR = 29) [4][23]. - The method has been integrated with various state-of-the-art open-source models and supports both text-to-video (T2V) and image-to-video (I2V) tasks [4][23]. Group 3: Key Design Features of Sparse VideoGen - Sparse VideoGen identifies two unique sparsity patterns in attention maps: spatial sparsity, focusing on tokens within the same and adjacent frames, and temporal sparsity, capturing relationships across different frames [10][11][12]. - The method employs a dynamic adaptive sparse strategy through online profiling, allowing for optimal combinations of spatial and temporal heads based on varying denoising steps and prompts [16][17]. Group 4: Operator-Level Optimization - Sparse VideoGen introduces a hardware-friendly layout transformation to optimize memory access patterns, enhancing the performance of temporal heads by ensuring tokens are stored contiguously in memory [20][21]. - Additional optimizations for Query-Key Normalization (QK-Norm) and Rotary Position Embedding (RoPE) have resulted in significant throughput improvements, with average acceleration ratios of 7.4x and 14.5x, respectively [21]. Group 5: Experimental Results - Sparse VideoGen has demonstrated impressive performance, reducing inference time for HunyuanVideo from approximately 30 minutes to under 15 minutes, and for Wan 2.1 from 30 minutes to 20 minutes, while maintaining a PSNR above 29dB [23]. - The research indicates that understanding the internal structure of video generation models may lead to more sustainable performance breakthroughs compared to merely increasing model size [24].
智谱与生数科技达成战略合作:推进国产大模型的技术创新与产业落地
IPO早知道· 2025-04-27 12:38
清华系两家明星AI公司。 在联合研发方面,智谱自主研发 GLM大模型系列,在语言模型和多模态模型方面技术领先,其中开 源视频生成模型CogVideoX在github获得超过1万star。生数专注于自主研发多模态通用大模型, 提供领先的视频生成及多模态生成产品。 智谱与生数两家清华系中国 AI领军公司战略联合,基于双方在多模态领域的技术积累强强组合,不 仅能够进一步提升国产大模型的综合实力和领先水平,更能推动国产大模型行业生态良性的创新合力 与繁荣发展。 本文由公众号IPO早知道(ID:ipozaozhidao)原创撰写,如需转载请联系C叔↓↓↓ 本文为IPO早知道原创 作者| Stone Jin 微信公众号|ipozaozhidao 据 IPO早知道消息, 清华系两家明星 AI公司 日前达成 战略合作 —— 智谱 (Z.ai) 与生数科技 (shengshu.com)宣布将基于各自在大语言模型和多模态生成模型的技术积累和优势,在联合研 发、产品联动、解决方案整合、行业协同等多方面强强联合,共同推进国产大模型的技术创新与产业 落地。 根据战略协议,在产品合作方面,智谱 MaaS平台将接入生数科技Vidu API ...
智谱正式启动A股IPO:B、C两端业务齐发力,今日再开源性能顶尖模型
IPO早知道· 2025-04-15 01:18
作者| Stone Jin 微信公众号|ipozaozhidao 第一家正式启动IPO流程的"大模型创业公司"。 本文为IPO早知道原创 据 IPO早知道消息, 北京智谱华章科技股份有限公司 (以下简称 " 智谱 ")于2025年3月31日同中 金公司签署辅导协议,正式启动 A 股 IPO进程。 这意味着,智谱成为 "大模型创业公司"中第一家正式启动上市流程的企业 。 成立于 2019年的智谱 致力于打造新一代认知智能大模型 。早在 2020年 年 底 ,智谱就研 发 了 GLM预训练架构, 并于 2021年训练完成百亿参数模型GLM-10B,同年利用MoE架构成功训练出收 敛的万亿稀疏模型,2022年研发了中英双语千亿级超大规模预训练模型GLM-130B并开源。2023 年,智谱推出千亿基座对话模型ChatGLM并两次升级,开源版本的ChatGLM-6B让大模型开发者的 本地微调和部署成为可能 。 2024年1月,智谱推出新一代基座大模型GLM-4,整体性能相比上一代大幅提升;6月开源GLM-4- 9B及视觉模型GLM-4V-9B,多模态能力媲美GPT-4V;7 月推出视频生成模型CogVideoX,推理速 ...
独家|清华大牛,刚刚融资30亿
投资界· 2024-12-17 00:39
作者 I 刘博 报道 I 投资界PEdaily 这可能是2 0 2 4年最后一笔超级融资。 投资界独家获悉,智谱AI近期完成新一轮3 0亿元人民币融资。据多位知情人透露,此次 新进投资方包括多家战投及国资机构,而君联资本等老股东继续跟投。 过去一年,国产AI融资历历在目。成立于2 01 9年的智谱AI,背后站着一群清华大牛—— CEO张鹏本硕博均毕业于清华,董事长刘德兵、总裁王绍兰同为清华校友。短短五年时 间,智谱AI已成为国产AI标志性企业之一,身后也集结了一支长长的投资人队伍。 尽管今年国内一级市场格外冷清,但AI融资却依旧轰轰烈烈,动辄数亿元的融资比比皆 是,诞生了一批AI超级独角兽:月之暗面、百川智能、Mi nima x、零一万物……这无疑 是中国AI时代最生动的一抹写照。 国产AI爆发。 清华校友联手 要打造中国版OpenAI 这是一家从清华实验室走出的独角兽。 时间回到2 0 0 6年,彼时清华计算机系知识工程实验室(KEG实验室)发布AMi n e r平台, 即利用人工智能的方法,去挖掘自然科学或技术发展的客观规律。其中,张鹏在2 002年 从清华本科毕业后,便作为硕士研究生进入KEG实验室深造 ...