扩散模型
Search documents
北京大学:AI视频生成技术原理与行业应用 2025
Sou Hu Cai Jing· 2025-12-09 06:48
Group 1: AI Video Technology Overview - AI video technology is a subset of narrow AI focused on generative tasks such as video generation, editing, and understanding, with typical methods including text-to-video and image-to-video [1] - The evolution of technology spans from the exploration of GANs before 2016 to the commercialization of diffusion models from 2020 to 2024, culminating in the release of Sora in 2024, marking the "AI Video Year" [1] Group 2: Main Tools and Platforms - Key platforms include OpenAI Sora, Kuaishou Keling AI, ByteDance Jimeng AI, Runway, and Pika, each offering unique features in terms of duration, quality, and style [2] Group 3: Technical Principles and Architecture - The mainstream paradigm is the diffusion model, which is stable in training and offers strong generation diversity, with architectures categorized into U-Net and DiT [3] - Key components include the self-attention mechanism of Transformers for temporal consistency, VAE for compression, and CLIP for semantic alignment between text and visuals [3] Group 4: Data Value and Training - The scale, quality, and diversity of training data determine the model's upper limits, with prominent datasets including WebVid-10M and UCF-101 [4] Group 5: Technological Advancements and Breakthroughs - Mainstream models can generate videos at 1080p/4K resolution and up to 2 minutes in length, with some models supporting native audio-visual synchronization [5] - Existing challenges include temporal consistency, physical logic, and emotional detail expression, alongside computational cost constraints [5] - Evaluation frameworks like VBench and SuperCLUE have been established, focusing on "intrinsic authenticity" [5] Group 6: Industry Applications and Value - In the film and entertainment sector, AI is involved in the entire production process, leading to cost reductions and efficiency improvements [6] - The short video and marketing sectors utilize AI for rapid content generation, exemplified by Xiaomi's AI glasses advertisement [6] - In the cultural tourism industry, AI is used for city promotional videos and immersive experiences [7] - In education, AI facilitates the bulk generation of micro-course videos and personalized learning content [8] - In news media, AI virtual anchors enable 24-hour reporting, though ethical challenges regarding content authenticity persist [9] Group 7: Tool Selection Recommendations - Recommendations for tool selection include using Runway or Keling AI for professional film, Jimeng AI or Pika for short video operations, and Vidu for traditional Chinese content [10] - Domestic tools like Keling and Jimeng have low barriers to entry, while overseas tools require VPN and foreign currency payments [11] - A multi-tool collaborative workflow is advised, emphasizing a "director's mindset" rather than reliance on a single platform [12] Group 8: Future Outlook - The report concludes that AI video will evolve towards a "human-machine co-creation" model, becoming a foundational infrastructure akin to the internet, with a focus on creativity and judgment [13]
Roblox CEO感叹AI研究进展:曾博览群书的自己都快看不懂了
Sou Hu Cai Jing· 2025-12-08 11:28
Core Insights - The rapid advancement of AI research is overwhelming, with new papers emerging almost daily, making it difficult to fully comprehend the breadth of the field [1][3] - Roblox CEO David Baszucki emphasizes the significant shift in AI research complexity compared to earlier technological studies, noting the vast scale and speed of current developments [3] Group 1: AI Research Landscape - The current wave of AI research is characterized by its enormous scale and rapid pace, with concepts like Transformers and diffusion models becoming prevalent [3] - Major companies such as Meta and Microsoft are establishing their own research departments and offering high salaries to attract top talent, indicating a competitive landscape for AI expertise [3] - In 2023, Google decided to reduce the public dissemination of AI papers, reflecting a trend towards more closed research environments where internal knowledge becomes a competitive advantage [3] Group 2: AI's Current State in 3D Environments - Baszucki concludes that AI is still in its early stages within "three-dimensional worlds," relying heavily on human-created text and images rather than real-world 3D data [3] - The focus on computational power is prevalent, but OpenAI co-founder Ilya Sutskever argues that the direction of AI development is fundamentally determined by the research itself [3]
世界模型,是否正在逼近自己的「ChatGPT时刻」?
Xin Lang Cai Jing· 2025-12-02 11:22
Core Insights - The discussion highlights the emerging focus on world models in AI, with significant contributions from leading scholars like Li Feifei and institutions such as the Chinese Academy of Sciences and Nanjing University [1][3] Group 1: Definition and Applications of World Models - World models are defined as predictive models that forecast the next state given the current state and action sequences, with applications in autonomous driving and embodied intelligence [3] - The ultimate goal of world models is to create a 1:1 representation of the world, although practical modeling will vary based on specific tasks [3] Group 2: Data and Model Training Challenges - A key dilemma in developing world models is whether to prioritize model creation or data collection, with examples from autonomous driving highlighting the limitations of available data [5] - Experts propose a mixed approach of generating synthetic data alongside real data to enhance model training [5] Group 3: Technical Implementation Paths - There are differing opinions on the technical paths for world model development, with some advocating for the integration of physical information while others emphasize the importance of creative generation [6] - The discussion includes the potential of combining diffusion and autoregressive architectures to improve model performance [7] Group 4: Future Outlook and Commercialization - Experts speculate that the "ChatGPT moment" for world models may occur in approximately three years, contingent on the availability of high-quality long video data [8] - The commercialization of world models faces challenges in both B2B and B2C sectors, particularly in defining the value of generated video data [8][9]
扩散模型走了十年弯路!何恺明重磅新作JiT:回归真正“去噪”本质
自动驾驶之心· 2025-12-01 00:04
Core Viewpoint - The article discusses the limitations of current diffusion models in denoising tasks and introduces a simplified architecture called JiT (Just image Transformers) that focuses on predicting clean images directly rather than noise, leading to improved performance in high-dimensional pixel spaces [10][18][34]. Group 1: Diffusion Models and Noise Prediction - Traditional diffusion models are designed to predict noise or the amount of mixed noise, which is fundamentally different from predicting clean images [6][7]. - The authors argue that the essence of denoising should be to let the network predict clean data instead of noise, simplifying the task and improving model performance [18][19]. Group 2: JiT Architecture - JiT is a minimalist framework that operates directly on pixel patches without relying on latent spaces, tokenizers, or additional loss functions, making it more efficient [10][25][34]. - The architecture demonstrates that even with high-dimensional patches (up to 3072 dimensions), the model can maintain stable training and performance by focusing on predicting clean images [23][30]. Group 3: Experimental Results - In experiments on ImageNet at various resolutions, JiT models achieved impressive FID scores, with JiT-G/16 reaching 1.82, comparable to complex models that utilize latent spaces [30][31]. - The model's performance remained stable even at higher resolutions (1024×1024), showcasing its capability to handle high-dimensional data without increased computational costs [32][34]. Group 4: Implications for Future Research - The JiT framework suggests a potential shift in generative modeling, emphasizing the importance of working directly in pixel space for applications in embodied intelligence and scientific computing [34].
世界模型,是否正在逼近自己的「ChatGPT时刻」?
机器之心· 2025-11-29 01:49
Core Viewpoint - The article discusses the emerging focus on "world models" in the AI field, highlighting its potential applications and the ongoing debates among experts regarding its definition, construction, and commercialization [1][3]. Definition of World Models - Experts provided various definitions of world models, with key perspectives including: - A predictive model that forecasts the next state based on current conditions and action sequences, with applications in autonomous driving and embodied intelligence [4]. - A framework for AI to predict and assess environmental states, evolving from simple game worlds to complex virtual environments [4]. - An ambitious goal to create a 1:1 model of the world, acknowledging the impracticality of such precision but emphasizing purpose-driven modeling [4]. Construction of World Models - A central dilemma in developing world models is whether to prioritize model creation or data collection. Experts discussed: - The challenge of training models with limited data, particularly in autonomous driving, where most data is collected under ideal conditions [5]. - The importance of high-quality data for specific applications to enhance model performance [5]. - A proposed iterative approach where initial models generate data that can be used for further training [5]. Technical Implementation Paths - There are notable disagreements among experts regarding the technical paths for world models: - Some advocate for incorporating physical information into models, while others suggest a more pragmatic approach based on specific needs [7]. - The potential for models to evolve towards purely generative forms as capabilities improve [7]. Architectural Debate: Diffusion vs. Autoregressive - Experts shared their views on the suitability of diffusion versus autoregressive architectures for world models: - Diffusion models are seen as more aligned with the physical generation of content, reflecting how the brain decodes complex signals [8]. - There is a trend towards integrating different architectures to enhance model performance, recognizing the strengths of both diffusion and autoregressive methods [9]. Future of World Models - The timeline for achieving a "ChatGPT moment" for world models is uncertain, with estimates suggesting it may take around three years to realize significant breakthroughs [10]. - The current lack of high-quality long video data poses a significant challenge, with existing models primarily generating short clips [10]. - The commercialization of world models faces challenges in defining value for both business-to-business (B2B) and business-to-consumer (B2C) applications [10][11]. Conclusion - The roundtable discussion highlighted the vibrant and diverse nature of the world model field, emphasizing its potential for growth while acknowledging the challenges related to data, computational power, and technical direction [13].
NeurIPS 2025奖项出炉,Qwen获最佳论文
具身智能之心· 2025-11-28 00:04
Core Insights - The NeurIPS 2025 conference awarded four Best Paper awards and three Best Paper Runner-up awards, highlighting significant advancements in various AI research areas [1][2][4]. Group 1: Best Papers - Paper 1: "Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond)" introduces Infinity-Chat, a dataset with 26,000 diverse user queries, addressing the issue of homogeneity in language model outputs [6][8][10]. - Paper 2: "Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free" reveals the impact of gated attention mechanisms on model performance, enhancing training stability and robustness [12][18]. - Paper 3: "1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities" demonstrates that increasing network depth to 1024 layers significantly improves performance in self-supervised reinforcement learning tasks [19][20]. - Paper 4: "Why Diffusion Models Don't Memorize: The Role of Implicit Dynamical Regularization in Training" explores the dynamics of training diffusion models, identifying mechanisms that prevent memorization and enhance generalization [21][23]. Group 2: Awards and Recognition - The Time-Tested Award was given to the paper "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks," recognized for its foundational impact on computer vision since its publication in 2015 [38][42]. - The Sejnowski-Hinton Prize was awarded to researchers for their work on "Random synaptic feedback weights support error backpropagation for deep learning," contributing to the understanding of biologically plausible learning rules [45][49].
NeurIPS 2025最佳论文开奖,何恺明、孙剑等十年经典之作夺奖
3 6 Ke· 2025-11-27 07:27
Core Insights - NeurIPS 2025 announced its best paper awards, with four papers recognized, including a significant contribution from Chinese researchers [1][2] - The "Test of Time Award" was given to Faster R-CNN, highlighting its lasting impact on the field of computer vision [1][50] Best Papers - The first best paper titled "Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond)" was authored by a team from multiple prestigious institutions, including Washington University and Carnegie Mellon University [5][6] - The second best paper, "Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free," involved collaboration between researchers from Alibaba, Edinburgh University, Stanford University, MIT, and Tsinghua University [14][15] - The third best paper, "1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities," was authored by researchers from Princeton University and Warsaw University of Technology [21][24] - The fourth best paper, "Why Diffusion Models Don't Memorize: The Role of Implicit Dynamical Regularization in Training," was a collaborative effort from PSL University and Bocconi University [28][29] Runners Up - Three runner-up papers were also recognized, including "Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?" from Tsinghua University and Shanghai Jiao Tong University [33][34] - Another runner-up paper titled "Optimal Mistake Bounds for Transductive Online Learning" was authored by researchers from Kent State University, Purdue University, Google Research, and MIT [38][39] - The third runner-up paper, "Superposition Yields Robust Neural Scaling," was from MIT [42][46] Test of Time Award - The "Test of Time Award" was awarded to the paper "Faster R-CNN," which has been cited over 56,700 times and has significantly influenced the computer vision field [50][52] - The paper introduced a fully learnable two-stage process that replaced traditional methods, achieving high detection accuracy and near real-time speeds [50][52]
NeurIPS 2025奖项出炉,Qwen获最佳论文,Faster R-CNN获时间检验奖
机器之心· 2025-11-27 03:00
Core Insights - The NeurIPS 2025 conference awarded four Best Paper awards and three Best Paper Runner-up awards, highlighting significant advancements in various AI research areas [1][4]. Group 1: Best Papers - Paper 1: "Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond)" discusses the limitations of large language models in generating diverse content and introduces Infinity-Chat, a dataset with 26,000 diverse user queries for studying model diversity [5][6][9]. - Paper 2: "Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free" reveals the impact of gated attention mechanisms on model performance and stability, demonstrating significant improvements in the Qwen3-Next model [11][16]. - Paper 3: "1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities" shows that increasing network depth to 1024 layers can enhance performance in self-supervised reinforcement learning tasks, achieving performance improvements of 2x to 50x [17][18]. - Paper 4: "Why Diffusion Models Don't Memorize: The Role of Implicit Dynamical Regularization in Training" identifies mechanisms that prevent diffusion models from memorizing training data, establishing a link between training dynamics and generalization capabilities [19][21][22]. Group 2: Best Paper Runner-Up - Paper 1: "Optimal Mistake Bounds for Transductive Online Learning" solves a 30-year-old problem in learning theory, establishing optimal mistake bounds for transductive online learning [28][30][31]. - Paper 2: "Superposition Yields Robust Neural Scaling" argues that representation superposition is the primary mechanism governing neural scaling laws, supported by multiple experiments [32][34]. Group 3: Special Awards - The Time-Tested Award was given to the paper "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks," recognized for its foundational impact on modern object detection frameworks since its publication in 2015 [36][40]. - The Sejnowski-Hinton Prize was awarded for the paper "Random synaptic feedback weights support error backpropagation for deep learning," which contributed significantly to understanding biologically plausible learning rules in neural networks [43][46][50].
即将开课!面向量产的端到端小班课,上岸高阶算法岗位~
自动驾驶之心· 2025-11-27 00:04
Core Viewpoint - The article emphasizes the importance of end-to-end production in the automotive industry, highlighting the scarcity of qualified talent and the need for comprehensive training programs to address various challenges in this field [1][3]. Group 1: Course Overview - The course is designed to cover essential algorithms related to end-to-end production, including one-stage and two-stage frameworks, reinforcement learning applications, and trajectory optimization [3][9]. - It aims to provide practical experience and insights into production challenges, focusing on real-world applications and expert guidance [3][6]. Group 2: Course Structure - The course consists of eight chapters, each addressing different aspects of end-to-end production, such as task overview, algorithm frameworks, navigation information applications, and trajectory output optimization [9][10][11][12][13][14][15][16]. - The final chapter will share production experiences from various perspectives, including data, models, and strategies for system enhancement [16]. Group 3: Target Audience and Requirements - The course is aimed at advanced learners with a background in autonomous driving, reinforcement learning, and programming, although those with weaker foundations can still participate [17][18]. - Participants are required to have access to a GPU with recommended specifications and familiarity with relevant algorithms and programming languages [18].
浙大一篇中稿AAAI'26的工作DiffRefiner:两阶段轨迹预测框架,创下NAVSIM新纪录!
自动驾驶之心· 2025-11-25 00:03
点击下方 卡片 ,关注" 自动驾驶之心 "公众号 戳我-> 领取 自动驾驶近30个 方向 学习 路线 >>自动驾驶前沿信息获取 → 自动驾驶之心知识星球 编辑 | 自动驾驶之心 论文作者 | Liuhan Yin等 与自动驾驶中预测自车固定候选轨迹集的判别式方法不同,扩散模型等生成式方法能够学习未来运动的潜在分布,实现更灵活的轨迹预测。然而由于这些方法通常依 赖于对人工设计的轨迹锚点或随机噪声进行去噪处理,其性能仍有较大提升空间。 浙江大学&纽劢的团队提出一种全新的两阶段轨迹预测框架DiffRefiner :第一阶段采用基于Transformer的proposal解码器,通过对传感器输入进行回归,利用预定义轨 迹锚点生成粗粒度轨迹预测;第二阶段引入扩散Refiner,对初始预测结果进行迭代去噪与优化。通过融合判别式轨迹proposal模块,本文为生成式精炼过程提供了强有 力的引导,显著提升了基于扩散模型的规划性能。此外,本文设计了细粒度去噪解码器以增强场景适应性,通过加强与周围环境的对齐,实现更精准的轨迹预测。实 验结果表明,DiffRefiner达到了当前最优性能:在NAVSIM v2数据集上达到87.4的 ...