视频生成
Search documents
SIGGRAPH Asia 2025|当视频生成真正「看清一个人」:多视角身份一致、真实光照与可控镜头的统一框架
机器之心· 2025-12-27 04:01
第一作者徐源诚是 Netflix Eyeline 的研究科学家,专注于基础 AI 模型的研究与开发,涵盖多模态理解、推理、交互与生成,重点方向包括可控视频生成及其在影 视制作中的应用。他于 2025 年获得美国马里兰大学帕克分校博士学位。 最后作者于宁是 Netflix Eyeline 资深研究科学家,带领视频生成 AI 在影视制作中的研发。他曾就职于 Salesforce、NVIDIA 及 Adobe,获马里兰大学与马普所联合 博士学位。他多次入围高通奖学金、CSAW 欧洲最佳论文,并获亚马逊 Twitch 奖学金、微软小学者奖学金,以及 SPIE 最佳学生论文。他担任 CVPR、ICCV、 ECCV、NeurIPS、ICML、ICLR 等顶会的领域主席,以及 TMLR 的执行编辑。 在电影与虚拟制作中,「看清一个人」从来不是看清某一帧。导演通过镜头运动与光线变化,让观众在不同视角、不同光照条件下逐步建立对一个角色的完整认 知。然而,在当前大量 customizing video generation model 的研究中,这个最基本的事实,却往往被忽视。 被忽视的核心问题:Multi-view Ident ...
生成不遗忘,「超长时序」世界模型,北大EgoLCD长短时记忆加持
3 6 Ke· 2025-12-24 07:58
【导读】视频生成模型总是「记性不好」?生成几秒钟后物体就变形、背景就穿帮?北大、中大等机构联合发布EgoLCD,借鉴人类「长短时记忆」机 制,首创稀疏KV缓存+LoRA动态适应架构,彻底解决长视频「内容漂移」难题,在EgoVid-5M基准上刷新SOTA!让AI像人一样拥有连贯的第一人称视 角记忆。 随着Sora、Genie等模型的爆发,视频生成正从「图生动」迈向「世界模拟器」的宏大目标。 然而,在通往「无限时长」视频生成的路上,横亘着一只拦路虎——「内容漂移」(Content Drift)。 你是否发现,现有的视频生成模型在生成长视频时,往往也是「金鱼记忆」:前一秒还是蓝色瓷砖,后一秒变成了白色墙壁;原本手里的杯子,拿着拿着 就变成了奇怪的形状; 对于第一人称(Egocentric)视角这种晃动剧烈、交互复杂的场景,模型更是极其容易「迷失」。 生成长视频不难,难的是「不忘初心」。 近日,来自北京大学、中山大学、浙江大学、中科院和清华大学的研究团队,提出了一种全新的长上下文扩散模型EgoLCD,不仅引入了「类脑的长短时 记忆」设计,还提出了一套全新的结构化叙事Promp方案,成功让AI在生成长视频时「记住」场景 ...
相机运动误差降低40%!DualCamCtrl:给视频生成装上「深度相机」,让运镜更「听话」
机器之心· 2025-12-21 04:21
本研究的共同第一作者是来自于香港科技大学(广州)EnVision Research 的张鸿飞(研究助理)和陈康豪(博士研究生),两位研究者均师从陈颖聪教 授。 你的生成模型真的「懂几何」吗?还是只是在假装对齐相机轨迹? 当前众多视频生成模型虽宣称具备「相机运动控制」能力,但其控制信号通常仅依赖于相机位姿。虽近期工作通过逐像素射线方向(Ray Condition)编码 了运动信息,但由于模型仍需隐式推断三维结构,本质上仍缺乏对场景的显式几何理解。这一局限性导致了相机运动的不一致——模型受限于外观与结构两 种表征信息的耦合,无法充分捕捉场景的底层几何特征。 鉴于上述挑战, 来自香港科技大学、复旦大学等机构的研究团队提出了一种全新的端到端几何感知扩散模型框架 DualCamCtrl 。 该研究针对现有方法在 场景理解与几何感知方面的不足,创新性地设计了一个「双分支扩散架构」,能够同步生成与镜头运动一致的 RGB 与深度序列。进一步地,为实现 RGB 与深度两种模态的高效协同,DualCamCtrl 提出了语义引导互对齐机制(Semantic Guided Mutual Alignment),该机制以语义信息为指导, ...
自驾世界模型剩下的论文窗口期没多久了......
自动驾驶之心· 2025-12-11 00:05
Core Insights - The article highlights the recent surge in research papers related to world models in autonomous driving, indicating a trend towards localized breakthroughs and verifiable improvements in the field [1] - It emphasizes the importance of refining submissions to top conferences, suggesting that the final 10% of polishing can significantly impact the overall quality and acceptance of the paper [2] - The platform "Autonomous Driving Heart" is presented as a leading AI technology media outlet in China, with a strong focus on autonomous driving and related interdisciplinary fields [3] Summary by Sections Research Trends - Numerous recent works in autonomous driving, such as MindDrive and SparseWorld-TC, reflect a focus on world models, which are expected to dominate upcoming conferences [1] - The article suggests that the main themes for the end of this year and the first half of next year will likely revolve around world models, indicating a strategic direction for researchers [1] Guidance and Support - The platform offers personalized guidance for students, helping them navigate the complexities of research and paper submission processes [7][13] - It claims a high success rate, with a 96% acceptance rate for students who have received guidance over the past three years [5] Faculty and Resources - The platform boasts over 300 dedicated instructors from top global universities, ensuring high-quality mentorship for students [5] - The instructors have extensive experience in publishing at top-tier conferences and journals, providing students with valuable insights and support [5] Services Offered - The article outlines various services, including personalized paper guidance, real-time interaction with mentors, and comprehensive support throughout the research process [13] - It also mentions the potential for students to receive recommendations from prestigious institutions and direct job placements in leading tech companies [19]
AI问答,直接「拍」给你看!来自快手可灵&香港城市大学
量子位· 2025-11-22 03:07
Core Insights - The article introduces a novel AI model called VANS, which generates videos as answers instead of traditional text responses, aiming to bridge the gap between understanding and execution in tasks [3][4][5]. Group 1: Concept and Motivation - The motivation behind this research is to utilize video, which inherently conveys dynamic physical world information that language struggles to describe accurately [5]. - The traditional approach to "next event prediction" has primarily focused on text-based answers, whereas VANS proposes a new task paradigm where the model generates a video as the response [8][9]. Group 2: Model Structure and Functionality - VANS consists of a visual language model (VLM) and a video diffusion model (VDM), optimized through a joint strategy called Joint-GRPO, which enhances collaboration between the two models [19][24]. - The workflow involves two main steps: perception and reasoning, where the input video is encoded and analyzed, followed by conditional generation, where the model creates a video based on the generated text title and visual features [20]. Group 3: Optimization Process - The optimization process is divided into two phases: first, enhancing the VLM to produce titles that are visually representable, and second, refining the VDM to ensure the generated video aligns semantically with the title and context of the input video [25][28]. - Joint-GRPO acts as a director, ensuring that both the "thinker" (VLM) and the "artist" (VDM) work in harmony, improving their outputs through mutual feedback [34][36]. Group 4: Applications and Impact - VANS has two significant applications: procedural teaching, where it can provide customized instructional videos based on user input, and multi-future prediction, allowing for creative exploration of various hypothetical scenarios [37][41]. - The model has shown superior performance in benchmarks, significantly outperforming existing models in metrics such as ROUGE-L and CLIP-T, indicating its effectiveness in both semantic fidelity and video quality [46][47]. Group 5: Experimental Results - Comprehensive evaluations demonstrate that VANS excels in procedural teaching and future prediction tasks, achieving nearly three times the performance improvement in event prediction accuracy compared to the best existing models [44][46]. - Qualitative results highlight VANS's ability to accurately visualize fine-grained actions, showcasing its advanced semantic understanding and visual generation capabilities [50][53]. Conclusion - The research on Video-as-Answer represents a significant advancement in video generation technology, moving beyond entertainment to practical applications, enabling a more intuitive interaction with machines and knowledge [55][56].
腾讯元宝上线视频生成能力
Guan Cha Zhe Wang· 2025-11-21 08:58
Core Insights - Tencent's HunyuanVideo 1.5, a lightweight video generation model based on the Diffusion Transformer (DiT) architecture, has been officially released and open-sourced, featuring 8.3 billion parameters and the capability to generate 5-10 seconds of high-definition video [1][4]. Group 1: Model Capabilities - HunyuanVideo 1.5 supports both Chinese and English input for text-to-video and image-to-video generation, showcasing high consistency between images and videos [4]. - The model demonstrates strong instruction comprehension and adherence, allowing for diverse scene implementations, including camera movements, smooth actions, realistic characters, and emotional expressions [4]. - It supports various styles such as realism, animation, and block-based visuals, and can generate Chinese and English text within videos [4]. Group 2: Video Quality - The model can natively generate 5-10 seconds of high-definition video at 480p and 720p, with the option to enhance quality to 1080p cinematic level through a super-resolution model [4]. Group 3: Performance Comparison - In the T2V (Text-to-Video) task, HunyuanVideo outperformed several comparison models, achieving a win rate of +17.12% against Wan2.2 and +12.6% against Kling2.1 [6]. - In the I2V (Image-to-Video) task, HunyuanVideo also showed competitive results, with a win rate of +12.65% against Wan2.2 and +9.72% against Kling2.1 [6].
快手:三季度经营利润同比增长69.9% 可灵AI收入超3亿元
Zhong Zheng Wang· 2025-11-20 06:03
Core Insights - Kuaishou reported a total revenue of 35.554 billion yuan for Q3, marking a year-on-year growth of 14.2% [1] - Operating profit increased by 69.9% year-on-year to 5.299 billion yuan, while adjusted net profit rose by 26.3% to 4.986 billion yuan [1] Revenue Breakdown - Revenue from other services, including e-commerce and Keling AI, grew by 41.3% to 5.9 billion yuan [1] - Online marketing service revenue increased by 14% to 20.1 billion yuan [1] - Live streaming revenue saw a modest growth of 2.5% to 9.6 billion yuan [1] - Keling AI generated over 300 million yuan in revenue during Q3, while e-commerce GMV grew by 15.2% to 385 billion yuan [1] User Engagement - The average daily active users reached 416 million, with monthly active users at 731 million [1] AI Integration and Market Position - Kuaishou's CEO attributed financial performance to the deep integration of AI capabilities across various business scenarios [2] - The video generation sector is experiencing rapid technological iteration and product exploration, with Keling AI positioned in the leading tier globally [2] - Keling AI launched the 2.5 Turbo model, enhancing multiple dimensions such as text response and aesthetic quality [2] Product Strategy and Future Outlook - Kuaishou aims to focus on AI film creation, enhancing technology and product capabilities [2] - The company is optimistic about the commercialization of video generation, particularly in consumer applications [3] - Kuaishou plans to explore consumer application scenarios while enhancing the experience for professional creators [3]
快手业绩会:加大AI投入 预计今年可灵收入约1.4亿美元
2 1 Shi Ji Jing Ji Bao Dao· 2025-11-19 14:37
Core Insights - Kuaishou's Q3 revenue reached 35.6 billion RMB, a year-on-year increase of 14.2%, with core business revenue growing by 19.2% [1] - The company's operating profit hit a record high, increasing by 69.9% year-on-year to 5.3 billion RMB, while adjusted net profit rose by 26.3% to 5 billion RMB [1] - The integration of AI capabilities into Kuaishou's business is a significant factor in its financial performance, with Keling AI generating over 300 million RMB in revenue during Q3 [1] Industry Dynamics - The video generation sector is experiencing rapid competition with numerous participants from both large internet companies and startups, indicating its potential as a high-quality market [2] - The industry is in an early stage of rapid technological iteration and product exploration, with competition driving advancements in video generation technology [2] - Keling AI remains a leader in the global video generation space, focusing on technological and product innovation to maintain its competitive edge [2] Product Strategy - Keling AI's core focus is on AI film creation, with an emphasis on resource aggregation to enhance technology and product capabilities [2] - The company plans to advance its product iterations by focusing on technological leadership and product imagination, utilizing multi-modal interaction concepts [2] - Keling AI aims to enhance the user experience for professional creators while exploring consumer applications, with plans to further commercialize its technology in the future [3] Financial Outlook - Kuaishou plans to increase investments in AI-related capabilities, expecting a mid-to-high double-digit percentage growth in overall capital expenditures for 2025 compared to the previous year [3] - Keling AI's projected revenue for 2025 is approximately 140 million USD, significantly higher than the initial target of 60 million USD [3] - Despite increased investments in AI capabilities and talent, the company remains confident in achieving year-on-year improvements in adjusted operating profit margins [3]
可灵AI全年收入约1.4亿美元,快手继续加大算力投入
Di Yi Cai Jing· 2025-11-19 14:24
Core Insights - Kuaishou's Q3 2025 financial report shows a total revenue increase of 14.2% year-on-year to 35.6 billion RMB, with adjusted net profit rising by 26.3% to 5 billion RMB [1] - The online marketing services revenue grew by 14% to 20.1 billion RMB, while live streaming revenue increased by 2.5% to 9.6 billion RMB [1] - E-commerce GMV for Kuaishou increased by 15.2% year-on-year to 385 billion RMB, and the revenue from Keling AI exceeded 300 million RMB [1] Business Segments - Online Marketing Services: Revenue increased by 14% to 20.1 billion RMB [1] - Live Streaming: Revenue increased by 2.5% to 9.6 billion RMB [1] - Other Services: Revenue rose by 41.3% to 5.9 billion RMB, driven by growth in e-commerce and Keling AI [1] AI Development Focus - Keling AI remains a key focus in Kuaishou's earnings call, with the CEO highlighting the competitive landscape in video generation and the potential for rapid technological advancement [2] - The company aims to concentrate on AI film creation, enhancing technology and product capabilities through resource aggregation [2] - Kuaishou plans to further commercialize Keling technology in conjunction with social interaction, aiming for accelerated C-end application commercialization [2] Capital Expenditure and AI Integration - Kuaishou's CFO indicated that due to the unexpected growth of Keling AI, the company will increase its capital expenditure, expecting a mid-to-high double-digit percentage increase in 2025 compared to the previous year [3] - Keling AI is projected to generate approximately 140 million USD in revenue for 2025, surpassing the initial target of 60 million USD [3] - AI applications are being rapidly integrated within Kuaishou, with the self-developed AI programming tool CodeFlicker being widely adopted by engineers, generating nearly 30% of new code [3]
快手程一笑:可灵AI将重点聚焦AI影视制作场景 视频生成赛道仍在早期
Zheng Quan Shi Bao Wang· 2025-11-19 12:57
Core Insights - Kuaishou's CEO Cheng Yixiao highlighted the competitive landscape of the video generation sector, indicating it is a promising field with rapid technological iterations and product explorations [1][2] - The company reported that its Keling AI generated over 300 million yuan in revenue in Q3 2025, with a global user base exceeding 45 million and over 200 million videos and 400 million images created [1] - Cheng emphasized the vision of Keling AI to enable everyone to tell good stories using AI, focusing on film creation and enhancing both technology and product capabilities [2] Company Developments - Keling AI's recent advancements include the launch of the 2.5 Turbo model, which significantly improved text response, dynamic effects, style retention, and aesthetic quality [1] - The company aims to enhance the user experience for professional creators while exploring consumer applications, with plans to further commercialize Keling's technology in the future [2] - Cheng outlined a comprehensive path for the implementation of AI large models within Kuaishou, enhancing content and business ecosystems while improving internal organizational and R&D efficiency [2][3] Industry Trends - 2025 is viewed as a pivotal year for the deep application of AI, with new generation AI technologies like multimodal generation and agents being explored for more efficient user-centric applications [3] - Kuaishou is building a complete technology and application system centered on user needs, accelerating AI implementation to empower content and business ecosystems [3] - The company believes that a comprehensive AI application ecosystem will enhance its market adaptability and growth potential in the long term [3]