Workflow
Video Generation
icon
Search documents
数据减少超千倍,500 美金就可训练一流视频模型,港城、华为Pusa来了
机器之心· 2025-06-19 02:28
Core Viewpoint - The article discusses the revolutionary advancements in video generation through the introduction of the Frame-aware Video Diffusion Model (FVDM) and its practical application in the Pusa project, which significantly reduces training costs and enhances video generation capabilities [2][3][37]. Group 1: FVDM and Pusa Project - FVDM introduces a vectorized timestep variable (VTV) that allows each frame to have an independent temporal evolution path, addressing the limitations of traditional scalar timesteps in video generation [2][18]. - The Pusa project, developed in collaboration with Huawei's Hong Kong Research Institute, serves as a direct application and validation of FVDM, exploring a low-cost method for fine-tuning large-scale pre-trained video models [3][37]. - Pusa achieves superior results compared to the official Wan I2V model while reducing training costs by over 200 times (from at least $100,000 to $500) and data requirements by over 2500 times [5][37]. Group 2: Technical Innovations - The Pusa project utilizes non-destructive fine-tuning on pre-trained models like Wan-T2V 14B, allowing for effective video generation without compromising the original model's capabilities [5][29]. - The introduction of a probabilistic timestep sampling training strategy (PTSS) in FVDM enhances convergence speed and improves performance compared to the original model [30][31]. - Pusa's VTV mechanism enables diverse video generation tasks by allowing different frames to have distinct noise perturbation controls, thus facilitating more nuanced video generation [35][36]. Group 3: Community Engagement and Future Prospects - The complete codebase, training datasets, and training code for Pusa have been open-sourced to encourage community contributions and collaboration, aiming to enhance performance and explore new possibilities in video generation [17][37]. - The article emphasizes the potential of Pusa to lead the video generation field into a new era characterized by low costs and high flexibility [36][37].
字节 AI 卷出新高度:豆包试水“上下文定价”,Trae 覆盖内部80%工程师,战略瞄定三主线
AI前线· 2025-06-11 08:39
Core Insights - ByteDance shared its thoughts on the main lines of AI technology development for this year, focusing on three key areas [1] - On June 11, ByteDance's Volcano Engine launched a series of updates, including the Doubao model 1.6 and the Seedance 1.0 Pro video generation model [1] Doubao Model 1.6 - The Doubao model 1.6 includes several variants that support multimodal input and achieve a context length of 256K [3] - The model demonstrated strong performance in exams, scoring 144 in a national math exam and 706 in science and 712 in humanities in a simulation test [3] - Doubao 1.6 can perform tasks such as hotel booking and organizing shopping receipts into Excel [3] Pricing and Cost Structure - Doubao 1.6 has a unified pricing structure based on context length, with costs significantly lower than previous models [8] - Pricing details include: - 1-32k context length: input at 0.8 RMB/million tokens, output at 8 RMB/million tokens - 32-128k context length: input at 1.2 RMB/million tokens, output at 16 RMB/million tokens - 128-256k context length: input at 2.4 RMB/million tokens, output at 24 RMB/million tokens [9] Video Generation Technology - The Seedance 1.0 Pro model features seamless multi-shot storytelling and enhanced motion realism, allowing for the generation of complex video content [18] - The cost for generating a 5-second 1080P video is approximately 3.67 RMB, making it competitive in the market [18][20] AI Development Tools - Trae, an internal coding assistant, has gained significant traction, with over 80% of ByteDance engineers using it [14] - Trae enhances coding efficiency through features like code completion and predictive editing, allowing for rapid development [16] - The development of Trae is based on the Doubao 1.6 model, which has been specifically trained for engineering tasks [16] Future Trends in AI - The industry is expected to see gradual improvements in handling complex multi-step tasks, with a projected accuracy of 80%-90% for simple tasks by Q4 of this year [5] - ByteDance anticipates that video generation technology will become more practical for production by 2025, with models like Veo 2 emerging [5] - The company is focusing on integrating AI into various sectors, including e-commerce and gaming, to enhance user experiences [22]
Veo 3 demo | Crystailine flowers bloom
Google DeepMind· 2025-05-20 23:00
Model Capabilities - Veo 3 is a new state-of-the-art video generation model designed for filmmakers and storytellers [1] - The model empowers users to add sound effects, ambient noise, and dialogue, generating all audio natively [1] - Veo 3 delivers best-in-class quality, excelling in physics, realism, and prompt adherence [1] Key Features - Veo 3 allows for the creation of videos from text prompts, such as "A snow-covered plain of iridescent moon-dust under twilight skies" [1] - The model can generate visuals of complex scenes, including "Thirty-foot crystalline flowers bloom, refracting light into slow-moving rainbows" [1] - Veo 3 can depict figures interacting with the environment, such as "A fur-cloaked figure walks between these colossal blossoms, leaving the only footprints in untouched dust" [1]