Workflow
视频生成
icon
Search documents
视频生成进入精准控制时代,创作平权带动B/C两端加速渗透
Orient Securities· 2026-02-08 14:19
Investment Rating - The industry investment rating is "Positive" and is maintained [4] Core Viewpoints - The multi-modal video generation sector is experiencing accelerated iteration of domestic models, significantly narrowing the technological gap with overseas counterparts. The most notable change is the introduction of intelligent storyboarding, which lowers the entry barrier for users. The unified multi-modal architecture supports more efficient and flexible expression of creative intent, leading to substantial progress in both B-end and C-end expansions in 2026. Model vendors are focusing on the AI penetration in the content sector while continuing to enhance their technologies [1][7] Summary by Sections Industry Overview - The video generation sector is entering a phase of precise control, with recent iterations of models such as Vidu Q3, Kuaishou 3.0, and Seedance 2.0 supporting multi-modal inputs, which enhances controllability and improves the success rate of generated content. The duration for single generation has increased to around 15 seconds, further lowering the creative threshold for both B-end and C-end users [7] Investment Recommendations and Targets - Emphasis should be placed on vertical multi-modal AI application opportunities, with expectations that technological breakthroughs and cost optimizations will accelerate industry trends, driving user growth, payment penetration, and commercialization. Companies with multi-modal AI applications expanding overseas are particularly noteworthy, as they may experience faster growth rates. Recommended targets include Kuaishou-W (01024, Buy) and Meitu Inc. (01357, Buy) [2]
全新视角看世界模型:从视频生成迈向通用世界模拟器
机器之心· 2026-02-07 04:09
近年来, 视频生成(Video Generation)与世界模型(World Models)已跃升为人工智能领域最炙手可热的焦点 。从 Sora 到可灵(Kling),视频生成模型在运动 连续性、物体交互与部分物理先验上逐渐表现出更强的「 世界一致性」,让人们开始认真讨论:能否把视频生成从「 逼真短片」推进到可用于推理、规划与控制 的 「 通用世界模拟器 」 。 与此同时,这一研究方向正快速与具身智能(Embodied AI)、自动驾驶(Autonomous Driving)等前沿场景深度交织,被视为通往通用人工智能(AGI)的重要路 径。 然而,在研究热潮之下,「 何为真正的世界模型 」以及「 如何评判视频模型的世界模拟能力 」等核心议题却陷入了多维争论。当前,世界模型的定义与分类层 出不穷,理论维度的交叉重叠往往令研究者感到困惑,也限制了技术的标准化发展。 为建立更系统、清晰的审视视角, 快手可灵团队 与 香港科技大学(广州)陈颖聪教授团队(共同一作:博士生王罗州、博士生陈知非) 联合发表了从全新视角 深度剖析视频世界模型的系统综述。 本文旨在弥合当代「 无状态」视频架构与经典「 以状态为中心」的世界模型 ...
清华系创企,拿下国内视频生成领域最大单笔融资
3 6 Ke· 2026-02-05 08:50
Core Insights - The article highlights that Shengshu Technology has completed over 600 million RMB in A+ round financing, setting a record for the largest single financing in China's video generation sector [1] - The company aims to achieve over tenfold growth in users and revenue by 2025, with a global reach across more than 200 countries and regions [1] Financing Details - The financing round was led by Zhongguancun Science City and Xinglian Capital, with strategic investments from companies like Wanxing Technology, Visual China, and Tuolisi [1] - Shengshu Technology has completed a total of six financing rounds and one equity transfer, with notable investors including Huawei, Ant Group, and Baidu [7][9] Product Development - Shengshu Technology focuses on developing multimodal general large models and applications, providing video generation and multimodal generation products through various platforms [2] - The company is recognized as one of the earliest teams to research multimodal generation algorithms, having introduced the U-ViT architecture ahead of OpenAI's DiT [2] Model Performance - The Vidu Q3 model, aimed at professional film production, ranked first in China and second globally in a recent AI benchmark test, surpassing competitors like Runway Gen-4.5 and Google Veo3.1 [3] - Vidu Q3 supports features such as 16-second audio-visual synchronization, 1080P quality, and multilingual output [4] Market Presence - Vidu has established a strong presence in the film industry, covering over 90% of content providers and production institutions, with clients including Sony Pictures and Tencent Animation [7] - The company also serves clients in the internet and smart hardware sectors, including ByteDance and Samsung, focusing on content production and product interaction innovation [7] Competitive Landscape - The video generation sector remains highly competitive, with significant investments flowing into startups, while major companies like Kuaishou and Google are also expanding their influence [10] - The article emphasizes the need for startups to differentiate themselves in technology, application scenarios, or ecosystems to succeed in this competitive environment [10]
SIGGRAPH Asia 2025|当视频生成真正「看清一个人」:多视角身份一致、真实光照与可控镜头的统一框架
机器之心· 2025-12-27 04:01
第一作者徐源诚是 Netflix Eyeline 的研究科学家,专注于基础 AI 模型的研究与开发,涵盖多模态理解、推理、交互与生成,重点方向包括可控视频生成及其在影 视制作中的应用。他于 2025 年获得美国马里兰大学帕克分校博士学位。 最后作者于宁是 Netflix Eyeline 资深研究科学家,带领视频生成 AI 在影视制作中的研发。他曾就职于 Salesforce、NVIDIA 及 Adobe,获马里兰大学与马普所联合 博士学位。他多次入围高通奖学金、CSAW 欧洲最佳论文,并获亚马逊 Twitch 奖学金、微软小学者奖学金,以及 SPIE 最佳学生论文。他担任 CVPR、ICCV、 ECCV、NeurIPS、ICML、ICLR 等顶会的领域主席,以及 TMLR 的执行编辑。 在电影与虚拟制作中,「看清一个人」从来不是看清某一帧。导演通过镜头运动与光线变化,让观众在不同视角、不同光照条件下逐步建立对一个角色的完整认 知。然而,在当前大量 customizing video generation model 的研究中,这个最基本的事实,却往往被忽视。 被忽视的核心问题:Multi-view Ident ...
生成不遗忘,「超长时序」世界模型,北大EgoLCD长短时记忆加持
3 6 Ke· 2025-12-24 07:58
Core Insights - The article discusses the introduction of EgoLCD, a novel long-context diffusion model developed by a collaborative research team from several prestigious institutions, aimed at addressing the issue of "content drift" in long video generation [1][2][3] Group 1: Model Overview - EgoLCD employs a dual memory mechanism inspired by human cognition, consisting of long-term memory for stability and short-term memory for rapid adaptation [5] - The model incorporates a structured narrative prompting system that enhances the coherence of generated videos by linking visual details with textual descriptions [7][8] Group 2: Technical Innovations - EgoLCD utilizes a sparse key-value cache to manage long-term memory, focusing on critical semantic anchors to reduce memory usage while maintaining global consistency [11] - The model's short-term memory is enhanced by LoRA, allowing it to quickly adapt to rapid changes in perspective, such as fast hand movements [11] Group 3: Performance Metrics - In the EgoVid-5M benchmark, EgoLCD outperformed leading models like OpenSora and DynamiCrafter in terms of temporal coherence and action consistency, achieving the best scores in CD-FVD and NRDP metrics [12][14] - The model demonstrated a significant reduction in content drift, maintaining high consistency in both subject and background throughout the video generation process [13][14] Group 4: Practical Applications - EgoLCD is positioned as a "first-person world simulator," capable of generating coherent long-duration videos that can serve as training data for embodied intelligence applications, such as robotics [15]
相机运动误差降低40%!DualCamCtrl:给视频生成装上「深度相机」,让运镜更「听话」
机器之心· 2025-12-21 04:21
本研究的共同第一作者是来自于香港科技大学(广州)EnVision Research 的张鸿飞(研究助理)和陈康豪(博士研究生),两位研究者均师从陈颖聪教 授。 你的生成模型真的「懂几何」吗?还是只是在假装对齐相机轨迹? 当前众多视频生成模型虽宣称具备「相机运动控制」能力,但其控制信号通常仅依赖于相机位姿。虽近期工作通过逐像素射线方向(Ray Condition)编码 了运动信息,但由于模型仍需隐式推断三维结构,本质上仍缺乏对场景的显式几何理解。这一局限性导致了相机运动的不一致——模型受限于外观与结构两 种表征信息的耦合,无法充分捕捉场景的底层几何特征。 鉴于上述挑战, 来自香港科技大学、复旦大学等机构的研究团队提出了一种全新的端到端几何感知扩散模型框架 DualCamCtrl 。 该研究针对现有方法在 场景理解与几何感知方面的不足,创新性地设计了一个「双分支扩散架构」,能够同步生成与镜头运动一致的 RGB 与深度序列。进一步地,为实现 RGB 与深度两种模态的高效协同,DualCamCtrl 提出了语义引导互对齐机制(Semantic Guided Mutual Alignment),该机制以语义信息为指导, ...
自驾世界模型剩下的论文窗口期没多久了......
自动驾驶之心· 2025-12-11 00:05
Core Insights - The article highlights the recent surge in research papers related to world models in autonomous driving, indicating a trend towards localized breakthroughs and verifiable improvements in the field [1] - It emphasizes the importance of refining submissions to top conferences, suggesting that the final 10% of polishing can significantly impact the overall quality and acceptance of the paper [2] - The platform "Autonomous Driving Heart" is presented as a leading AI technology media outlet in China, with a strong focus on autonomous driving and related interdisciplinary fields [3] Summary by Sections Research Trends - Numerous recent works in autonomous driving, such as MindDrive and SparseWorld-TC, reflect a focus on world models, which are expected to dominate upcoming conferences [1] - The article suggests that the main themes for the end of this year and the first half of next year will likely revolve around world models, indicating a strategic direction for researchers [1] Guidance and Support - The platform offers personalized guidance for students, helping them navigate the complexities of research and paper submission processes [7][13] - It claims a high success rate, with a 96% acceptance rate for students who have received guidance over the past three years [5] Faculty and Resources - The platform boasts over 300 dedicated instructors from top global universities, ensuring high-quality mentorship for students [5] - The instructors have extensive experience in publishing at top-tier conferences and journals, providing students with valuable insights and support [5] Services Offered - The article outlines various services, including personalized paper guidance, real-time interaction with mentors, and comprehensive support throughout the research process [13] - It also mentions the potential for students to receive recommendations from prestigious institutions and direct job placements in leading tech companies [19]
AI问答,直接「拍」给你看!来自快手可灵&香港城市大学
量子位· 2025-11-22 03:07
Core Insights - The article introduces a novel AI model called VANS, which generates videos as answers instead of traditional text responses, aiming to bridge the gap between understanding and execution in tasks [3][4][5]. Group 1: Concept and Motivation - The motivation behind this research is to utilize video, which inherently conveys dynamic physical world information that language struggles to describe accurately [5]. - The traditional approach to "next event prediction" has primarily focused on text-based answers, whereas VANS proposes a new task paradigm where the model generates a video as the response [8][9]. Group 2: Model Structure and Functionality - VANS consists of a visual language model (VLM) and a video diffusion model (VDM), optimized through a joint strategy called Joint-GRPO, which enhances collaboration between the two models [19][24]. - The workflow involves two main steps: perception and reasoning, where the input video is encoded and analyzed, followed by conditional generation, where the model creates a video based on the generated text title and visual features [20]. Group 3: Optimization Process - The optimization process is divided into two phases: first, enhancing the VLM to produce titles that are visually representable, and second, refining the VDM to ensure the generated video aligns semantically with the title and context of the input video [25][28]. - Joint-GRPO acts as a director, ensuring that both the "thinker" (VLM) and the "artist" (VDM) work in harmony, improving their outputs through mutual feedback [34][36]. Group 4: Applications and Impact - VANS has two significant applications: procedural teaching, where it can provide customized instructional videos based on user input, and multi-future prediction, allowing for creative exploration of various hypothetical scenarios [37][41]. - The model has shown superior performance in benchmarks, significantly outperforming existing models in metrics such as ROUGE-L and CLIP-T, indicating its effectiveness in both semantic fidelity and video quality [46][47]. Group 5: Experimental Results - Comprehensive evaluations demonstrate that VANS excels in procedural teaching and future prediction tasks, achieving nearly three times the performance improvement in event prediction accuracy compared to the best existing models [44][46]. - Qualitative results highlight VANS's ability to accurately visualize fine-grained actions, showcasing its advanced semantic understanding and visual generation capabilities [50][53]. Conclusion - The research on Video-as-Answer represents a significant advancement in video generation technology, moving beyond entertainment to practical applications, enabling a more intuitive interaction with machines and knowledge [55][56].
腾讯元宝上线视频生成能力
Guan Cha Zhe Wang· 2025-11-21 08:58
Core Insights - Tencent's HunyuanVideo 1.5, a lightweight video generation model based on the Diffusion Transformer (DiT) architecture, has been officially released and open-sourced, featuring 8.3 billion parameters and the capability to generate 5-10 seconds of high-definition video [1][4]. Group 1: Model Capabilities - HunyuanVideo 1.5 supports both Chinese and English input for text-to-video and image-to-video generation, showcasing high consistency between images and videos [4]. - The model demonstrates strong instruction comprehension and adherence, allowing for diverse scene implementations, including camera movements, smooth actions, realistic characters, and emotional expressions [4]. - It supports various styles such as realism, animation, and block-based visuals, and can generate Chinese and English text within videos [4]. Group 2: Video Quality - The model can natively generate 5-10 seconds of high-definition video at 480p and 720p, with the option to enhance quality to 1080p cinematic level through a super-resolution model [4]. Group 3: Performance Comparison - In the T2V (Text-to-Video) task, HunyuanVideo outperformed several comparison models, achieving a win rate of +17.12% against Wan2.2 and +12.6% against Kling2.1 [6]. - In the I2V (Image-to-Video) task, HunyuanVideo also showed competitive results, with a win rate of +12.65% against Wan2.2 and +9.72% against Kling2.1 [6].
快手:三季度经营利润同比增长69.9% 可灵AI收入超3亿元
Zhong Zheng Wang· 2025-11-20 06:03
Core Insights - Kuaishou reported a total revenue of 35.554 billion yuan for Q3, marking a year-on-year growth of 14.2% [1] - Operating profit increased by 69.9% year-on-year to 5.299 billion yuan, while adjusted net profit rose by 26.3% to 4.986 billion yuan [1] Revenue Breakdown - Revenue from other services, including e-commerce and Keling AI, grew by 41.3% to 5.9 billion yuan [1] - Online marketing service revenue increased by 14% to 20.1 billion yuan [1] - Live streaming revenue saw a modest growth of 2.5% to 9.6 billion yuan [1] - Keling AI generated over 300 million yuan in revenue during Q3, while e-commerce GMV grew by 15.2% to 385 billion yuan [1] User Engagement - The average daily active users reached 416 million, with monthly active users at 731 million [1] AI Integration and Market Position - Kuaishou's CEO attributed financial performance to the deep integration of AI capabilities across various business scenarios [2] - The video generation sector is experiencing rapid technological iteration and product exploration, with Keling AI positioned in the leading tier globally [2] - Keling AI launched the 2.5 Turbo model, enhancing multiple dimensions such as text response and aesthetic quality [2] Product Strategy and Future Outlook - Kuaishou aims to focus on AI film creation, enhancing technology and product capabilities [2] - The company is optimistic about the commercialization of video generation, particularly in consumer applications [3] - Kuaishou plans to explore consumer application scenarios while enhancing the experience for professional creators [3]