扩散模型
Search documents
浙大一篇中稿AAAI'26的工作DiffRefiner:两阶段轨迹预测框架,创下NAVSIM新纪录!
自动驾驶之心· 2025-11-25 00:03
点击下方 卡片 ,关注" 自动驾驶之心 "公众号 戳我-> 领取 自动驾驶近30个 方向 学习 路线 >>自动驾驶前沿信息获取 → 自动驾驶之心知识星球 编辑 | 自动驾驶之心 论文作者 | Liuhan Yin等 与自动驾驶中预测自车固定候选轨迹集的判别式方法不同,扩散模型等生成式方法能够学习未来运动的潜在分布,实现更灵活的轨迹预测。然而由于这些方法通常依 赖于对人工设计的轨迹锚点或随机噪声进行去噪处理,其性能仍有较大提升空间。 浙江大学&纽劢的团队提出一种全新的两阶段轨迹预测框架DiffRefiner :第一阶段采用基于Transformer的proposal解码器,通过对传感器输入进行回归,利用预定义轨 迹锚点生成粗粒度轨迹预测;第二阶段引入扩散Refiner,对初始预测结果进行迭代去噪与优化。通过融合判别式轨迹proposal模块,本文为生成式精炼过程提供了强有 力的引导,显著提升了基于扩散模型的规划性能。此外,本文设计了细粒度去噪解码器以增强场景适应性,通过加强与周围环境的对齐,实现更精准的轨迹预测。实 验结果表明,DiffRefiner达到了当前最优性能:在NAVSIM v2数据集上达到87.4的 ...
为啥机器人集体放弃“跑酷” 全去“叠衣服”了?
机器人大讲堂· 2025-11-24 15:00
还记得波士顿动力 Atlas的跑酷视频吗?当年那段画面,让全世界真切感受到人形机器人运动能力的大跨步突 破。早年也有机器人跳舞的演示,技术爱好者会盯着看关节怎么动、平衡稳不稳 , 那时候的行业,总爱比谁 能做出更酷炫的极限动作。 但短短半年,风向彻底变了。现在打开机器人企业的新品视频,跑酷、跳舞少见了,取而代之的全是 "叠衣 服" 的操作。 Figure 03用五指手试着叠毛巾,边角偶尔卷起来也没停下;Weave Robotics的半自动叠衣视频加了2倍快 进,看着利落, 实际上 藏着真实速度偏慢的问题 。 机器人从疯狂炫技再到着手做家务, 说到底是行业 当中的概念炒作少了。企业开始重新思考逐渐开始 触碰 市场的真实需求。 谷歌 ALOHA的挂衣演示没剪帧,动作慢悠悠的,偶尔还对不齐衣架,反倒因为真实圈了不少粉;Dyna Robotics更直接,让机器人连续18小时叠餐巾,机械臂反复起落,透 着一 股死磕一个任务的执着劲儿。 ▍ 为什么偏偏 要学习 叠衣服? 机器人企业扎堆做叠衣,核心是技术和需求对上了 。 十年前,机器人叠衣服还是实验室里的稀罕事。 2010 年由 Willow Garage 公司推出来的 ...
NeurIPS 2025 | UniLumos: 引入物理反馈的统一图像视频重打光框架,实现20倍加速的真实光影重塑!
机器之心· 2025-11-24 09:30
图像与视频重光照(Relighting)技术在计算机视觉与图形学中备受关注,尤其在电影、游戏及增强现实等领域应用广泛。当前,基于扩散模型的方法能够 生成多样且可控的光照效果,但其优化过程通常依赖于语义空间,而语义上的相似性无法保证视觉空间中的物理合理性,导致生成结果常出现高光过曝、阴 影错位、遮挡关系错误等不合理现象。 针对上述问题,我们提出了 UniLumos,一个统一的图像与视频重光照框架。本工作的主要创新点主要为: 实验表明,UniLumos 在显著提升物理一致性的同时,其重光照质量也达到了当前 SOTA 水平,并且在计算效率上比现有方法提升约 20 倍,实现了高质 量与高效率的统一。 引入几何反馈以增强物理一致性: 为缓解物理不合理现象,我们在生成过程中引入了来自 RGB 空间的几何反馈(如深度图与法线图),使光照效果 与场景结构对齐,从而显著提升物理一致性。然而,该反馈机制依赖高质量输出作为视觉空间监督,而传统的流匹配多步去噪过程计算开销大。为 此,我们采用路径一致性学习,在少步训练条件下保持有效监督,同时大幅提升推理速度。 构建细粒度光影评估基准: 为实现对光影效果的细粒度控制与评估,我们设计了一 ...
圣母大学团队打造分子设计新利器:让AI像写文章一样创造分子
仪器信息网· 2025-11-19 09:08
Core Insights - The article discusses the breakthrough AI system DemoDiff developed by a team from the University of Notre Dame, which can design new molecular structures by learning from a few examples, significantly accelerating drug and material development processes [7][8][10]. Group 1: AI Understanding of Molecular Design - DemoDiff mimics human learning by analyzing a few successful molecular examples to understand design patterns, allowing it to generate new candidates quickly [10][11]. - The system can even learn from negative examples, generating high-quality molecules based on poorly performing ones, showcasing its advanced reasoning capabilities [21][22]. Group 2: Innovative Molecular Representation - The team introduced a new method called "node-pair encoding," which simplifies complex molecular structures, improving efficiency by 5.5 times [9][12]. - This method allows for a significant reduction in the number of atoms needed to describe a molecule, enhancing the AI's ability to process more examples [12][13]. Group 3: Comprehensive Molecular Database - DemoDiff was trained on an extensive database containing over 1 million molecular structures and 155,000 different molecular properties, providing a rich resource for learning [14][15]. - The database includes various sources, such as the ChEMBL database, which records millions of drug molecules and their biological activities [14][15]. Group 4: Diffusion Model for Molecular Generation - The core technology of DemoDiff is based on a "diffusion model," which generates molecular structures through a progressive refinement process, ensuring chemical validity [16][17]. - This model incorporates context learning, allowing the AI to adapt its output based on different sets of example molecules [18]. Group 5: Performance Testing and Validation - DemoDiff underwent rigorous testing across 33 different molecular design tasks, demonstrating performance comparable to much larger AI models [19][20]. - The system excels in generating diverse molecular structures, providing researchers with multiple options for further exploration [20]. Group 6: Negative Learning Capability - The AI's ability to learn from negative examples allows it to infer what makes a successful molecule, enhancing its design capabilities [21][22]. - This feature is particularly valuable in early drug development stages, where researchers often have more negative examples than positive ones [21][22]. Group 7: Technical Innovations - The system employs a "graph attention mechanism" to focus on multiple important parts of a molecule simultaneously, ensuring a holistic understanding during generation [23]. - A multi-layer validation mechanism checks the generated molecules against fundamental chemical rules, ensuring their feasibility [23][24]. Group 8: Implications for Molecular Design - DemoDiff represents a paradigm shift in molecular design, potentially reducing the time and cost associated with drug development significantly [25][26]. - The technology may democratize molecular design, allowing a broader range of researchers to participate in innovation [26]. Group 9: Future Considerations - While DemoDiff shows impressive capabilities, there is recognition of the need for further improvements, particularly in handling specific design tasks [27]. - Future developments may include expanding the model's scale and enhancing data quality to tackle more complex challenges [27][28].
端到端和VLA的岗位,薪资高的离谱......
自动驾驶之心· 2025-11-19 00:03
Core Insights - There is a significant demand for end-to-end and VLA (Vision-Language Agent) technical talent in the automotive industry, with salaries for experts reaching up to $70,000 per month for positions requiring 3-5 years of experience [1] - The technology stack involved in end-to-end and VLA is complex, covering various advanced algorithms and models such as BEV perception, VLM (Vision-Language Model), diffusion models, reinforcement learning, and world models [2] Course Offerings - The company is launching two specialized courses: "End-to-End and VLA Autonomous Driving Class" and "Practical Course on VLA and Large Models," aimed at helping individuals quickly and efficiently enter the field of end-to-end and VLA technologies [2] - The "Practical Course on VLA and Large Models" focuses on VLA, covering topics from VLM as an autonomous driving interpreter to modular and integrated VLA, including mainstream inference-enhanced VLA [2] - The course includes a detailed theoretical foundation and practical assignments, teaching participants how to build their own VLA models and datasets from scratch [2] Instructor Team - The instructor team consists of experts from both academia and industry, including individuals with extensive research and practical experience in multi-modal perception, autonomous driving VLA, and large model frameworks [7][10][13] - Notable instructors include a Tsinghua University master's graduate with multiple publications in top conferences and a current algorithm expert at a leading domestic OEM [7][13] Target Audience - The courses are designed for individuals with a foundational knowledge of autonomous driving, familiar with basic modules, and who have a grasp of concepts related to transformer large models, reinforcement learning, and BEV perception [15] - Participants are expected to have a background in probability theory and linear algebra, as well as proficiency in Python and PyTorch [15]
做了一份端到端进阶路线图,面向落地求职......
自动驾驶之心· 2025-11-18 00:05
Core Insights - There is a significant demand for end-to-end and VLA (Vision-Language Agent) technical talent in the automotive industry, with salaries for experts reaching up to $70,000 per month for positions requiring 3-5 years of experience [1] - The technology stack for end-to-end and VLA is complex, involving various advanced algorithms such as BEV perception, Vision-Language Models (VLM), diffusion models, reinforcement learning, and world models [1] - The company is offering specialized courses to help individuals quickly and efficiently learn about end-to-end and VLA technologies, collaborating with experts from both academia and industry [1] Course Offerings - The "End-to-End and VLA Autonomous Driving Course" focuses on the macro aspects of end-to-end autonomous driving, covering key algorithms and theoretical foundations, including BEV perception, large language models, diffusion models, and reinforcement learning [10] - The "Autonomous Driving VLA and Large Model Practical Course" is led by academic experts and covers VLA from the perspective of VLM as an autonomous driving interpreter, modular VLA, and current mainstream inference-enhanced VLA [1][10] - Both courses include practical components, such as building a VLA model and dataset from scratch, and implementing algorithms like the Diffusion Planner and ORION algorithm [10][12] Instructor Profiles - The instructors include experienced professionals and researchers from top institutions, such as Tsinghua University and QS30 universities, with backgrounds in multimodal perception, autonomous driving VLA, and large model frameworks [6][9][12] - Instructors have published numerous papers in prestigious conferences and have hands-on experience in developing and deploying advanced algorithms in the field of autonomous driving [6][9][12] Target Audience - The courses are designed for individuals with a foundational knowledge of autonomous driving, familiar with basic modules, and concepts related to transformer large models, reinforcement learning, and BEV perception [14] - Participants are expected to have a background in probability theory and linear algebra, as well as proficiency in Python and PyTorch [14]
RAE+VAE? 预训练表征助力扩散模型Tokenizer,加速像素压缩到语义提取
机器之心· 2025-11-13 10:03
Core Insights - The article discusses the introduction of RAE (Diffusion Transformers with Representation Autoencoders) and VFM-VAE by Xi'an Jiaotong University and Microsoft Research Asia, which utilize "frozen pre-trained visual representations" to enhance the performance of diffusion models in generating images [2][6][28]. Group 1: VFM-VAE Overview - VFM-VAE combines the probabilistic modeling mechanism of VAE with RAE, systematically studying the impact of compressed pre-trained visual representations on the structure and performance of LDM systems [2][6]. - The integration of frozen foundational visual models as Tokenizers in VFM-VAE significantly accelerates model convergence and improves generation quality, marking an evolution from pixel compression to semantic representation [2][6]. Group 2: Performance Analysis - Experimental results indicate that the distillation-based Tokenizers struggle with semantic alignment under perturbations, while maintaining high consistency between latent space and foundational visual model features is crucial for robustness and convergence efficiency [8][19]. - VFM-VAE demonstrates superior performance and training efficiency, achieving a gFID of 3.80 on ImageNet 256×256, outperforming the distillation route's 5.14, and reaching a gFID of 2.22 with explicit alignment in just 80 epochs, improving training efficiency by approximately 10 times [23][24]. Group 3: Semantic Representation and Alignment - The research team introduced the SE-CKNNA metric to quantify the consistency between latent space and foundational visual model features, which is essential for evaluating the impact on subsequent generation performance [7][19]. - VFM-VAE maintains a higher average and peak CKNNA score compared to distillation-based Tokenizers, indicating a more stable alignment of latent space with foundational visual model features [19][21]. Group 4: Future Directions - The article concludes with the potential for further exploration of latent space in multimodal generation and complex visual understanding, aiming to transition from pixel compression to semantic representation [29].
速递|斯坦福教授创业,Inception获5000万美元种子轮融资,用扩散模型解锁实时AI应用
Z Potentials· 2025-11-07 02:12
Core Insights - The article discusses the current surge of funding into AI startups, highlighting it as a golden period for AI researchers to validate their ideas [1] - Inception, a startup developing diffusion-based AI models, recently raised $50 million in seed funding, led by Menlo Ventures with participation from several notable investors [2] Company Overview - Inception is focused on developing diffusion models, which generate outputs through iterative optimization rather than sequential generation [3] - The project leader, Stefano Ermon, has been researching diffusion models prior to the recent AI boom and aims to apply these models to a broader range of tasks [3] Technology and Innovation - Inception has released a new version of its Mercury model, designed specifically for software development, which has been integrated into various development tools [3] - Ermon claims that diffusion-based models will significantly optimize two critical metrics: latency and computational cost, stating that these models are faster and more efficient than those built by other companies [3][5] - Diffusion models differ structurally from autoregressive models, which dominate text-based AI services, and are believed to perform better when handling large volumes of text or data limitations [5] Performance Metrics - The diffusion models exhibit greater flexibility in hardware utilization, which is increasingly important as AI's infrastructure demands grow [5] - Ermon's benchmarks indicate that the models can process over 1,000 tokens per second, surpassing the capabilities of existing autoregressive technologies due to their inherent support for parallel processing [5]
上海AI Lab发布混合扩散语言模型SDAR:首个突破6600 tgs的开源扩散语言模型
机器之心· 2025-11-01 04:22
Core Insights - The article introduces a new paradigm called SDAR (Synergistic Diffusion-AutoRegression) that addresses the slow inference speed and high costs associated with large model applications, which are primarily due to the serial nature of autoregressive (AR) models [2][3][4]. Group 1: SDAR Paradigm - SDAR effectively decouples training and inference, combining the high performance of AR models with the parallel inference advantages of diffusion models, allowing for low-cost transformation of any AR model into a parallel decoding model [4][11]. - Experimental results show that SDAR not only matches but often surpasses the performance of original AR models across multiple benchmarks, achieving up to a 12.3 percentage point advantage in complex scientific reasoning tasks [6][28]. Group 2: Performance and Efficiency - SDAR maintains the performance of AR models while significantly improving inference speed and reducing costs, demonstrating that larger models benefit more from parallelization without sacrificing performance [17][19]. - The research indicates that SDAR can be adapted to any mainstream AR model at a low cost, achieving comparable or superior performance in downstream tasks [19][29]. Group 3: Experimental Validation - The study conducted rigorous experiments to compare SDAR's performance with AR models, confirming that SDAR can achieve substantial speed improvements in real-world applications, with SDAR-8B-chat showing a 2.3 times acceleration over its AR counterpart [23][20]. - The results highlight that SDAR's unique generation mechanism does not compromise its complex reasoning capabilities, retaining long-chain reasoning abilities and excelling in tasks requiring understanding of structured information [28][29]. Group 4: Future Implications - SDAR represents a significant advancement in the field of large models, providing a powerful and flexible tool that lowers application barriers and opens new avenues for exploring higher performance and efficiency in AI reasoning paradigms [29][31].
视觉生成的另一条路:Infinity 自回归架构的原理与实践
AI前线· 2025-10-31 05:42
Core Insights - The article discusses the significant advancements in visual autoregressive models, particularly highlighting the potential of these models in the context of AI-generated content (AIGC) and their competitive edge against diffusion models [2][4][11]. Group 1: Visual Autoregressive Models - Visual autoregressive models (VAR) utilize a "coarse-to-fine" approach, starting with low-resolution images and progressively refining them to high-resolution outputs, which aligns more closely with human visual perception [12][18]. - The VAR model architecture includes an improved VQ-VAE that employs a hierarchical structure, allowing for efficient encoding and reconstruction of images while minimizing token usage [15][30]. - VAR has demonstrated superior image generation quality compared to existing models like DiT, showcasing a robust scaling curve that indicates performance improvements with increased model size and computational resources [18][49]. Group 2: Comparison with Diffusion Models - Diffusion models operate by adding Gaussian noise to images and then training a network to reverse this process, maintaining the original resolution throughout [21][25]. - The key advantages of VAR over diffusion models include higher training parallelism and a more intuitive process that mimics human visual cognition, although diffusion models can correct errors through iterative refinement [27][29]. - VAR's approach allows for faster inference times, with the Infinity model achieving significant speed improvements over comparable diffusion models [46][49]. Group 3: Innovations in Tokenization and Error Correction - The Infinity framework introduces a novel "bitwise tokenizer" that enhances reconstruction quality while allowing for a larger vocabulary size, thus improving detail and instruction adherence in generated images [31][41]. - A self-correction mechanism is integrated into the training process, enabling the model to learn from previous errors and significantly reducing cumulative error during inference [35][40]. - The findings indicate that larger models benefit from larger vocabularies, reinforcing the reliability of scaling laws in model performance [41][49].