Workflow
视频生成
icon
Search documents
晚点独家丨爱诗融资 3 亿美元,中国视频生成最大单笔融资诞生
晚点LatePost· 2026-03-12 07:22
Core Viewpoint - Aishi Technology has recently completed a $300 million Series C financing round, marking the largest single financing in China's video generation sector to date, with over 20 participating institutions [5][6]. Financing and Growth - The financing round was led by Dinghui Investment, with participation from various institutions including Chinese cultural and entertainment companies, local state-owned enterprises, and overseas funds [5]. - Aishi's annual recurring revenue (ARR) is projected to exceed $40 million by the end of 2025, with over 100 million total users and 16 million monthly active users for its mobile app PixVerse [8][9]. Market Position and Competition - Among AI startups in China, only a few have an ARR exceeding $50 million, indicating Aishi's strong market position [9]. - Aishi plans to focus on consumer products starting in the second half of 2024, with the launch of the PixVerse app aimed at the C-end market [9][12]. - The company believes that the video generation market is large enough that direct competition is not yet a pressing concern, despite the emergence of competitors like Seedance 2.0 [11][17]. Technology and Product Development - Aishi's latest model, PixVerse R1, released in January, allows for real-time video generation and aims to unlock new interactive content experiences [9][20]. - The company is committed to continuing its model research and global expansion, with plans to allocate a significant portion of the $300 million financing to R&D [12][16]. - Aishi's approach to model training is cost-effective, utilizing less training resource compared to competitors, which allows for more investment in development [12][13]. User Engagement and Retention - Aishi's PixVerse app has demonstrated higher user retention rates compared to competitors, indicating strong user engagement [18][19]. - The company targets a broad user base, including those who have never created videos before, aiming to empower them through AI technology [19]. Future Opportunities - The company sees potential in interactive video generation, which could fundamentally change content production logic and user engagement [20][21]. - Aishi is exploring the integration of its technology into gaming and other industries, potentially transforming traditional content creation processes [22].
全面解析“世界模型”:定义、路线、实践与AGI的更近一步
硅谷101· 2026-03-06 06:39
2026年将会是世界模型全面爆发的一年 World Model 如今的AI看起来似乎“无所不能” 它能写深奥的论文、复杂的代码 做出顶级的画面和视频 但它仍然缺乏理解世界、预测世界 以及在世界里推演并行动的能力 为了解决这个问题 OpenAI 谷歌 微软等大公司 Yann LeCun 李飞飞等顶尖学者 都开始抢着研究同一件事情 那就是 世界模型 很多人认为 随着多模态走向普及和成熟 如果这条技术线完全跑通 它将彻底重塑整个AI格局 但是我们也注意到 “世界模型”的爆火也带来了新的问题 那就是仿佛整个AI圈一夜之间都变成了“世界模型” 做视频生成的是世界模型 做机器人的是世界模型 做自动驾驶的是世界模型 做游戏开发的是世界模型 AR/VR是世界模型 Agent、仿真、训练环境…… 只要跟“世界”沾点边 几乎都是世界模型 它们看起来完全不一样 但现在全都被叫作同一个名字 我觉得这个也是很多人在神化世界模型的地方 其实很多现在世界模型 它就是一个视频模型 业界看到的这个世界模型 其实它更多的是世界模型的表现形式 如果一个世界模型 我们真的已经解决掉了 那我们现在科研的方向似乎就没有意义了 那么 世界模型到底是什么 ...
视频生成进入精准控制时代,创作平权带动B/C两端加速渗透
Orient Securities· 2026-02-08 14:19
Investment Rating - The industry investment rating is "Positive" and is maintained [4] Core Viewpoints - The multi-modal video generation sector is experiencing accelerated iteration of domestic models, significantly narrowing the technological gap with overseas counterparts. The most notable change is the introduction of intelligent storyboarding, which lowers the entry barrier for users. The unified multi-modal architecture supports more efficient and flexible expression of creative intent, leading to substantial progress in both B-end and C-end expansions in 2026. Model vendors are focusing on the AI penetration in the content sector while continuing to enhance their technologies [1][7] Summary by Sections Industry Overview - The video generation sector is entering a phase of precise control, with recent iterations of models such as Vidu Q3, Kuaishou 3.0, and Seedance 2.0 supporting multi-modal inputs, which enhances controllability and improves the success rate of generated content. The duration for single generation has increased to around 15 seconds, further lowering the creative threshold for both B-end and C-end users [7] Investment Recommendations and Targets - Emphasis should be placed on vertical multi-modal AI application opportunities, with expectations that technological breakthroughs and cost optimizations will accelerate industry trends, driving user growth, payment penetration, and commercialization. Companies with multi-modal AI applications expanding overseas are particularly noteworthy, as they may experience faster growth rates. Recommended targets include Kuaishou-W (01024, Buy) and Meitu Inc. (01357, Buy) [2]
全新视角看世界模型:从视频生成迈向通用世界模拟器
机器之心· 2026-02-07 04:09
Core Insights - The article discusses the rise of video generation and world models in AI, emphasizing their potential to evolve from realistic short clips to general world simulators for reasoning, planning, and control [2][3] - It highlights the intersection of this research with embodied AI and autonomous driving, positioning it as a pathway to achieving artificial general intelligence (AGI) [2] - The article identifies ongoing debates regarding the definitions and evaluation criteria of world models, indicating a need for standardized development in the field [2] Summary by Sections Introduction - Video generation models have shown improved "world consistency" in aspects like motion continuity and object interaction, prompting discussions on their capabilities as general world simulators [2] - The collaboration between Kuaishou's Kling team and Professor Chen Yingcong's team from Hong Kong University of Science and Technology aims to provide a systematic review of video world models [2] New Classification System - The article proposes a new classification system based on "State Construction" and "Dynamics Modeling" to bridge the gap between contemporary state-less video architectures and classical state-centered world model theories [3] Key Contributions - The review emphasizes a full-stack perspective, bridging theoretical gaps, and providing a forward-looking guide to enhance the robustness of video generation models [8] - It identifies "persistence" and "causality" as critical challenges in developing general world simulators [8] World Model Components - The article outlines three foundational components of world models: full-stack perspective, bridging theoretical gaps, and forward-looking guidelines [8] - It discusses the importance of observations, states, and dynamics in understanding and predicting environmental changes [8][9] Learning Paradigms - The article categorizes the training paradigms of world models based on their coupling with policy models, distinguishing between closed-loop and open-loop learning [14] - It highlights the evolution of video models towards robust world simulators, addressing gaps in state representation and dynamic modeling [12] State Construction - The article differentiates between implicit and explicit state mechanisms, analyzing their advantages and disadvantages in managing historical information [16][22] - It discusses the importance of compression, retrieval, and consolidation in maintaining long-term memory and context coherence [18][19] Dynamics Modeling - The article outlines two main paths to enhance causal reasoning capabilities: causal architecture reformulation and causal knowledge integration [24][25] - It emphasizes the need for models to internalize causal laws to ensure logical consistency and physical feasibility in generated videos [24] Evaluation Criteria - The article advocates for shifting evaluation standards from visual fidelity to functional benchmarks, focusing on persistence, causality, and overall quality [26][27] - It proposes three core evaluation axes: quality, persistence, and causality, to assess the capabilities of world models [26][27] Conclusion - The review underscores the necessity of developing video generation technologies that can simulate real-world scenarios, bridging the gap between visual realism and functional applicability [28][29]
清华系创企,拿下国内视频生成领域最大单笔融资
3 6 Ke· 2026-02-05 08:50
Core Insights - The article highlights that Shengshu Technology has completed over 600 million RMB in A+ round financing, setting a record for the largest single financing in China's video generation sector [1] - The company aims to achieve over tenfold growth in users and revenue by 2025, with a global reach across more than 200 countries and regions [1] Financing Details - The financing round was led by Zhongguancun Science City and Xinglian Capital, with strategic investments from companies like Wanxing Technology, Visual China, and Tuolisi [1] - Shengshu Technology has completed a total of six financing rounds and one equity transfer, with notable investors including Huawei, Ant Group, and Baidu [7][9] Product Development - Shengshu Technology focuses on developing multimodal general large models and applications, providing video generation and multimodal generation products through various platforms [2] - The company is recognized as one of the earliest teams to research multimodal generation algorithms, having introduced the U-ViT architecture ahead of OpenAI's DiT [2] Model Performance - The Vidu Q3 model, aimed at professional film production, ranked first in China and second globally in a recent AI benchmark test, surpassing competitors like Runway Gen-4.5 and Google Veo3.1 [3] - Vidu Q3 supports features such as 16-second audio-visual synchronization, 1080P quality, and multilingual output [4] Market Presence - Vidu has established a strong presence in the film industry, covering over 90% of content providers and production institutions, with clients including Sony Pictures and Tencent Animation [7] - The company also serves clients in the internet and smart hardware sectors, including ByteDance and Samsung, focusing on content production and product interaction innovation [7] Competitive Landscape - The video generation sector remains highly competitive, with significant investments flowing into startups, while major companies like Kuaishou and Google are also expanding their influence [10] - The article emphasizes the need for startups to differentiate themselves in technology, application scenarios, or ecosystems to succeed in this competitive environment [10]
SIGGRAPH Asia 2025|当视频生成真正「看清一个人」:多视角身份一致、真实光照与可控镜头的统一框架
机器之心· 2025-12-27 04:01
Core Viewpoint - The article emphasizes that understanding a person's identity in video generation requires capturing their appearance under various angles and lighting conditions, presenting identity as a stable 4D representation rather than a static 2D attribute [9][36]. Group 1: Multi-View Identity Preservation - Recent advancements in video generation have focused on character customization, typically assuming that if a character looks like themselves from one angle, their identity is preserved. This assumption is flawed in real video and film contexts [4][5]. - Identity is strongly view-dependent, with facial features and body posture changing systematically with different angles. Single or few images cannot capture the full range of a person's appearance [5][6]. - The article argues that true identity preservation in videos with real 3D camera movements is fundamentally a multi-view consistency issue, not just a single-frame similarity problem [6][7]. Group 2: Methodology Overview - To address the long-ignored issue of multi-view identity, the article proposes a redesign of the character customization process from a data perspective [11]. - The approach involves multi-view performance capture rather than relying on single-view references, utilizing a 4D Gaussian Splatting (4DGS) method for data generation [12][14]. - A two-phase training strategy is employed: the first phase focuses on camera perception pre-training to understand how camera movements affect perspective changes, while the second phase fine-tunes the model for multi-view identity customization [18][19]. Group 3: Lighting Realism - Lighting is a critical dimension for recognizing identity in real films, as characters are seen under various lighting conditions. The article introduces HDR-based relighting data to enhance lighting realism in generated videos [21][23]. - Experimental results indicate that videos generated with relighting data are perceived as more natural and realistic, with 83.9% of users favoring the enhanced lighting conditions [23]. Group 4: Multi-Person Generation - In multi-person video generation, the importance of multi-view identity preservation is amplified. The model must maintain stable identity modeling for each character across different angles and lighting to ensure natural interactions [25][26]. - The article describes two methods for multi-person generation: capturing performances for 4DGS reconstruction and rendering videos with precise 3D camera parameters to ensure identity consistency [27]. Group 5: Experimental Conclusions - Systematic experiments show that models trained with multi-view data significantly outperform those using only frontal-view data in terms of identity consistency and realism [31][32]. - User studies confirm a clear preference for generated results that exhibit stable multi-view identities, highlighting the importance of this approach in video generation [32].
生成不遗忘,「超长时序」世界模型,北大EgoLCD长短时记忆加持
3 6 Ke· 2025-12-24 07:58
Core Insights - The article discusses the introduction of EgoLCD, a novel long-context diffusion model developed by a collaborative research team from several prestigious institutions, aimed at addressing the issue of "content drift" in long video generation [1][2][3] Group 1: Model Overview - EgoLCD employs a dual memory mechanism inspired by human cognition, consisting of long-term memory for stability and short-term memory for rapid adaptation [5] - The model incorporates a structured narrative prompting system that enhances the coherence of generated videos by linking visual details with textual descriptions [7][8] Group 2: Technical Innovations - EgoLCD utilizes a sparse key-value cache to manage long-term memory, focusing on critical semantic anchors to reduce memory usage while maintaining global consistency [11] - The model's short-term memory is enhanced by LoRA, allowing it to quickly adapt to rapid changes in perspective, such as fast hand movements [11] Group 3: Performance Metrics - In the EgoVid-5M benchmark, EgoLCD outperformed leading models like OpenSora and DynamiCrafter in terms of temporal coherence and action consistency, achieving the best scores in CD-FVD and NRDP metrics [12][14] - The model demonstrated a significant reduction in content drift, maintaining high consistency in both subject and background throughout the video generation process [13][14] Group 4: Practical Applications - EgoLCD is positioned as a "first-person world simulator," capable of generating coherent long-duration videos that can serve as training data for embodied intelligence applications, such as robotics [15]
相机运动误差降低40%!DualCamCtrl:给视频生成装上「深度相机」,让运镜更「听话」
机器之心· 2025-12-21 04:21
Core Insights - The article discusses the limitations of current video generation models in understanding geometry and camera motion control, highlighting the need for explicit geometric understanding in video generation [3][4]. - A new end-to-end geometry-aware diffusion model framework called DualCamCtrl is introduced, which addresses these limitations by synchronously generating RGB and depth sequences [3][4][28]. Model Architecture - DualCamCtrl employs a dual-branch diffusion framework where one branch generates RGB representations and the other generates depth representations, allowing for effective modal fusion through the proposed Semantic Guided Mutual Alignment (SIGMA) mechanism [9][11]. - This design enables the model to independently evolve RGB and depth outputs while minimizing interference, ensuring coherent geometric guidance throughout the video generation process [9][12]. Training Strategy - The model incorporates a two-stage training strategy: the first stage focuses on decoupled multi-modal representation learning, while the second stage emphasizes cross-modal fusion modeling [11][21]. - The decoupled stage aims to independently learn appearance and geometric representations, while the fusion stage enhances the complementarity of RGB and depth information through joint optimization [21][22]. Experimental Results - DualCamCtrl significantly outperforms existing state-of-the-art methods in both quantitative and qualitative analyses, achieving over 40% reduction in camera motion error [4][23][28]. - The quantitative analysis shows that DualCamCtrl achieves lower FVD and FID scores compared to other methods, indicating superior performance in video generation tasks [27][28]. Conclusion - The introduction of DualCamCtrl represents a significant advancement in camera-controlled video generation, providing a new technical approach that enhances geometric perception and consistency in generated videos [28].
自驾世界模型剩下的论文窗口期没多久了......
自动驾驶之心· 2025-12-11 00:05
Core Insights - The article highlights the recent surge in research papers related to world models in autonomous driving, indicating a trend towards localized breakthroughs and verifiable improvements in the field [1] - It emphasizes the importance of refining submissions to top conferences, suggesting that the final 10% of polishing can significantly impact the overall quality and acceptance of the paper [2] - The platform "Autonomous Driving Heart" is presented as a leading AI technology media outlet in China, with a strong focus on autonomous driving and related interdisciplinary fields [3] Summary by Sections Research Trends - Numerous recent works in autonomous driving, such as MindDrive and SparseWorld-TC, reflect a focus on world models, which are expected to dominate upcoming conferences [1] - The article suggests that the main themes for the end of this year and the first half of next year will likely revolve around world models, indicating a strategic direction for researchers [1] Guidance and Support - The platform offers personalized guidance for students, helping them navigate the complexities of research and paper submission processes [7][13] - It claims a high success rate, with a 96% acceptance rate for students who have received guidance over the past three years [5] Faculty and Resources - The platform boasts over 300 dedicated instructors from top global universities, ensuring high-quality mentorship for students [5] - The instructors have extensive experience in publishing at top-tier conferences and journals, providing students with valuable insights and support [5] Services Offered - The article outlines various services, including personalized paper guidance, real-time interaction with mentors, and comprehensive support throughout the research process [13] - It also mentions the potential for students to receive recommendations from prestigious institutions and direct job placements in leading tech companies [19]
AI问答,直接「拍」给你看!来自快手可灵&香港城市大学
量子位· 2025-11-22 03:07
Core Insights - The article introduces a novel AI model called VANS, which generates videos as answers instead of traditional text responses, aiming to bridge the gap between understanding and execution in tasks [3][4][5]. Group 1: Concept and Motivation - The motivation behind this research is to utilize video, which inherently conveys dynamic physical world information that language struggles to describe accurately [5]. - The traditional approach to "next event prediction" has primarily focused on text-based answers, whereas VANS proposes a new task paradigm where the model generates a video as the response [8][9]. Group 2: Model Structure and Functionality - VANS consists of a visual language model (VLM) and a video diffusion model (VDM), optimized through a joint strategy called Joint-GRPO, which enhances collaboration between the two models [19][24]. - The workflow involves two main steps: perception and reasoning, where the input video is encoded and analyzed, followed by conditional generation, where the model creates a video based on the generated text title and visual features [20]. Group 3: Optimization Process - The optimization process is divided into two phases: first, enhancing the VLM to produce titles that are visually representable, and second, refining the VDM to ensure the generated video aligns semantically with the title and context of the input video [25][28]. - Joint-GRPO acts as a director, ensuring that both the "thinker" (VLM) and the "artist" (VDM) work in harmony, improving their outputs through mutual feedback [34][36]. Group 4: Applications and Impact - VANS has two significant applications: procedural teaching, where it can provide customized instructional videos based on user input, and multi-future prediction, allowing for creative exploration of various hypothetical scenarios [37][41]. - The model has shown superior performance in benchmarks, significantly outperforming existing models in metrics such as ROUGE-L and CLIP-T, indicating its effectiveness in both semantic fidelity and video quality [46][47]. Group 5: Experimental Results - Comprehensive evaluations demonstrate that VANS excels in procedural teaching and future prediction tasks, achieving nearly three times the performance improvement in event prediction accuracy compared to the best existing models [44][46]. - Qualitative results highlight VANS's ability to accurately visualize fine-grained actions, showcasing its advanced semantic understanding and visual generation capabilities [50][53]. Conclusion - The research on Video-as-Answer represents a significant advancement in video generation technology, moving beyond entertainment to practical applications, enabling a more intuitive interaction with machines and knowledge [55][56].