扩散模型
Search documents
NeurIPS 2025奖项出炉,Qwen获最佳论文,Faster R-CNN获时间检验奖
机器之心· 2025-11-27 03:00
Core Insights - The NeurIPS 2025 conference awarded four Best Paper awards and three Best Paper Runner-up awards, highlighting significant advancements in various AI research areas [1][4]. Group 1: Best Papers - Paper 1: "Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond)" discusses the limitations of large language models in generating diverse content and introduces Infinity-Chat, a dataset with 26,000 diverse user queries for studying model diversity [5][6][9]. - Paper 2: "Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free" reveals the impact of gated attention mechanisms on model performance and stability, demonstrating significant improvements in the Qwen3-Next model [11][16]. - Paper 3: "1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities" shows that increasing network depth to 1024 layers can enhance performance in self-supervised reinforcement learning tasks, achieving performance improvements of 2x to 50x [17][18]. - Paper 4: "Why Diffusion Models Don't Memorize: The Role of Implicit Dynamical Regularization in Training" identifies mechanisms that prevent diffusion models from memorizing training data, establishing a link between training dynamics and generalization capabilities [19][21][22]. Group 2: Best Paper Runner-Up - Paper 1: "Optimal Mistake Bounds for Transductive Online Learning" solves a 30-year-old problem in learning theory, establishing optimal mistake bounds for transductive online learning [28][30][31]. - Paper 2: "Superposition Yields Robust Neural Scaling" argues that representation superposition is the primary mechanism governing neural scaling laws, supported by multiple experiments [32][34]. Group 3: Special Awards - The Time-Tested Award was given to the paper "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks," recognized for its foundational impact on modern object detection frameworks since its publication in 2015 [36][40]. - The Sejnowski-Hinton Prize was awarded for the paper "Random synaptic feedback weights support error backpropagation for deep learning," which contributed significantly to understanding biologically plausible learning rules in neural networks [43][46][50].
即将开课!面向量产的端到端小班课,上岸高阶算法岗位~
自动驾驶之心· 2025-11-27 00:04
Core Viewpoint - The article emphasizes the importance of end-to-end production in the automotive industry, highlighting the scarcity of qualified talent and the need for comprehensive training programs to address various challenges in this field [1][3]. Group 1: Course Overview - The course is designed to cover essential algorithms related to end-to-end production, including one-stage and two-stage frameworks, reinforcement learning applications, and trajectory optimization [3][9]. - It aims to provide practical experience and insights into production challenges, focusing on real-world applications and expert guidance [3][6]. Group 2: Course Structure - The course consists of eight chapters, each addressing different aspects of end-to-end production, such as task overview, algorithm frameworks, navigation information applications, and trajectory output optimization [9][10][11][12][13][14][15][16]. - The final chapter will share production experiences from various perspectives, including data, models, and strategies for system enhancement [16]. Group 3: Target Audience and Requirements - The course is aimed at advanced learners with a background in autonomous driving, reinforcement learning, and programming, although those with weaker foundations can still participate [17][18]. - Participants are required to have access to a GPU with recommended specifications and familiarity with relevant algorithms and programming languages [18].
浙大一篇中稿AAAI'26的工作DiffRefiner:两阶段轨迹预测框架,创下NAVSIM新纪录!
自动驾驶之心· 2025-11-25 00:03
点击下方 卡片 ,关注" 自动驾驶之心 "公众号 戳我-> 领取 自动驾驶近30个 方向 学习 路线 >>自动驾驶前沿信息获取 → 自动驾驶之心知识星球 编辑 | 自动驾驶之心 论文作者 | Liuhan Yin等 与自动驾驶中预测自车固定候选轨迹集的判别式方法不同,扩散模型等生成式方法能够学习未来运动的潜在分布,实现更灵活的轨迹预测。然而由于这些方法通常依 赖于对人工设计的轨迹锚点或随机噪声进行去噪处理,其性能仍有较大提升空间。 浙江大学&纽劢的团队提出一种全新的两阶段轨迹预测框架DiffRefiner :第一阶段采用基于Transformer的proposal解码器,通过对传感器输入进行回归,利用预定义轨 迹锚点生成粗粒度轨迹预测;第二阶段引入扩散Refiner,对初始预测结果进行迭代去噪与优化。通过融合判别式轨迹proposal模块,本文为生成式精炼过程提供了强有 力的引导,显著提升了基于扩散模型的规划性能。此外,本文设计了细粒度去噪解码器以增强场景适应性,通过加强与周围环境的对齐,实现更精准的轨迹预测。实 验结果表明,DiffRefiner达到了当前最优性能:在NAVSIM v2数据集上达到87.4的 ...
为啥机器人集体放弃“跑酷” 全去“叠衣服”了?
机器人大讲堂· 2025-11-24 15:00
Core Viewpoint - The robotics industry has shifted focus from showcasing extreme capabilities, such as parkour and dancing, to addressing practical household tasks like folding clothes, indicating a maturation of the market and a response to real consumer needs [3][7][27]. Group 1: Industry Trends - The initial excitement around robotics was characterized by impressive demonstrations of movement and balance, which attracted capital and interest in the early stages of technology development [27]. - The current trend shows a significant pivot towards practical applications, with companies now prioritizing user needs over mere technical prowess [27][30]. - The emergence of clothing folding robots reflects a convergence of technological advancements and market demand, as the ability to fold clothes has become a more relatable and desirable function for consumers [9][15]. Group 2: Technological Advancements - Breakthroughs in robot learning technologies, such as diffusion models and zero-shot learning, have enabled robots to learn tasks like folding clothes from human demonstrations without extensive programming [13]. - The reduction in technical barriers has allowed startups to leverage pre-trained models to create functional demonstrations, making the technology more accessible [13][15]. - Despite advancements, current robotic demonstrations still reveal limitations in precision and adaptability, indicating that further improvements in algorithms and hardware are necessary [29][30]. Group 3: Market Demand and Consumer Expectations - There is a strong consumer desire for robots that can perform household tasks, with many willing to pay for solutions that alleviate mundane chores like folding clothes [15][26]. - The gap between what companies claim their robots can do and what consumers expect in terms of performance and reliability remains significant [24][26]. - Current demonstrations often fail to address the full scope of household tasks, focusing primarily on the folding action without integrating the entire process from retrieval to storage [24][30]. Group 4: Future Directions - The industry must continue to focus on practical applications and user needs to drive commercial viability, moving beyond mere technical demonstrations [30]. - As technology matures, there is potential for robots to expand their capabilities to include a wider range of household tasks, provided they remain aligned with consumer demands [29][30]. - The shift towards practical applications signifies a more rational approach to robotics, emphasizing the importance of solving real-world problems over showcasing extreme capabilities [30].
NeurIPS 2025 | UniLumos: 引入物理反馈的统一图像视频重打光框架,实现20倍加速的真实光影重塑!
机器之心· 2025-11-24 09:30
Core Insights - The article discusses the advancements in image and video relighting technology, particularly focusing on the introduction of UniLumos, a unified framework that enhances physical consistency and computational efficiency in relighting tasks [3][37]. Group 1: Challenges in Existing Methods - Current methods based on diffusion models face two fundamental challenges: the lack of physical consistency and an inadequate evaluation system for relighting quality [11][12]. - Existing approaches often optimize in semantic latent spaces, leading to physical inconsistencies such as misaligned shadows, overexposed highlights, and incorrect occlusions [15][11]. Group 2: Introduction of UniLumos - UniLumos is introduced as a solution to the aforementioned challenges, providing a unified framework for image and video relighting that maintains scene structure and temporal consistency while achieving high-quality relighting [17][37]. - The framework incorporates geometric feedback from RGB space, such as depth and normal maps, to align lighting effects with scene structures, significantly improving physical consistency [4][22]. Group 3: Innovations and Methodology - Key innovations include a geometric feedback mechanism to enhance physical consistency and a structured six-dimensional lighting description for fine-grained control and evaluation of lighting effects [18][22]. - The training data set, LumosData, is constructed to extract high-quality relighting samples from real-world videos, facilitating the training of the model [20][21]. Group 4: Performance and Efficiency - UniLumos demonstrates superior performance across various metrics, achieving state-of-the-art results in visual fidelity, temporal consistency, and physical accuracy compared to baseline models [27][28]. - The framework achieves a 20-fold increase in inference speed while maintaining high-quality output, making it significantly more efficient than existing methods [33][38]. Group 5: Evaluation and Results - The LumosBench evaluation framework allows for automated and interpretable assessment of relighting accuracy across six dimensions, showcasing UniLumos's advantages in fine-grained control over lighting attributes [22][29]. - Qualitative results indicate that UniLumos produces more realistic lighting effects and maintains better temporal consistency compared to baseline methods [31][33].
圣母大学团队打造分子设计新利器:让AI像写文章一样创造分子
仪器信息网· 2025-11-19 09:08
Core Insights - The article discusses the breakthrough AI system DemoDiff developed by a team from the University of Notre Dame, which can design new molecular structures by learning from a few examples, significantly accelerating drug and material development processes [7][8][10]. Group 1: AI Understanding of Molecular Design - DemoDiff mimics human learning by analyzing a few successful molecular examples to understand design patterns, allowing it to generate new candidates quickly [10][11]. - The system can even learn from negative examples, generating high-quality molecules based on poorly performing ones, showcasing its advanced reasoning capabilities [21][22]. Group 2: Innovative Molecular Representation - The team introduced a new method called "node-pair encoding," which simplifies complex molecular structures, improving efficiency by 5.5 times [9][12]. - This method allows for a significant reduction in the number of atoms needed to describe a molecule, enhancing the AI's ability to process more examples [12][13]. Group 3: Comprehensive Molecular Database - DemoDiff was trained on an extensive database containing over 1 million molecular structures and 155,000 different molecular properties, providing a rich resource for learning [14][15]. - The database includes various sources, such as the ChEMBL database, which records millions of drug molecules and their biological activities [14][15]. Group 4: Diffusion Model for Molecular Generation - The core technology of DemoDiff is based on a "diffusion model," which generates molecular structures through a progressive refinement process, ensuring chemical validity [16][17]. - This model incorporates context learning, allowing the AI to adapt its output based on different sets of example molecules [18]. Group 5: Performance Testing and Validation - DemoDiff underwent rigorous testing across 33 different molecular design tasks, demonstrating performance comparable to much larger AI models [19][20]. - The system excels in generating diverse molecular structures, providing researchers with multiple options for further exploration [20]. Group 6: Negative Learning Capability - The AI's ability to learn from negative examples allows it to infer what makes a successful molecule, enhancing its design capabilities [21][22]. - This feature is particularly valuable in early drug development stages, where researchers often have more negative examples than positive ones [21][22]. Group 7: Technical Innovations - The system employs a "graph attention mechanism" to focus on multiple important parts of a molecule simultaneously, ensuring a holistic understanding during generation [23]. - A multi-layer validation mechanism checks the generated molecules against fundamental chemical rules, ensuring their feasibility [23][24]. Group 8: Implications for Molecular Design - DemoDiff represents a paradigm shift in molecular design, potentially reducing the time and cost associated with drug development significantly [25][26]. - The technology may democratize molecular design, allowing a broader range of researchers to participate in innovation [26]. Group 9: Future Considerations - While DemoDiff shows impressive capabilities, there is recognition of the need for further improvements, particularly in handling specific design tasks [27]. - Future developments may include expanding the model's scale and enhancing data quality to tackle more complex challenges [27][28].
端到端和VLA的岗位,薪资高的离谱......
自动驾驶之心· 2025-11-19 00:03
Core Insights - There is a significant demand for end-to-end and VLA (Vision-Language Agent) technical talent in the automotive industry, with salaries for experts reaching up to $70,000 per month for positions requiring 3-5 years of experience [1] - The technology stack involved in end-to-end and VLA is complex, covering various advanced algorithms and models such as BEV perception, VLM (Vision-Language Model), diffusion models, reinforcement learning, and world models [2] Course Offerings - The company is launching two specialized courses: "End-to-End and VLA Autonomous Driving Class" and "Practical Course on VLA and Large Models," aimed at helping individuals quickly and efficiently enter the field of end-to-end and VLA technologies [2] - The "Practical Course on VLA and Large Models" focuses on VLA, covering topics from VLM as an autonomous driving interpreter to modular and integrated VLA, including mainstream inference-enhanced VLA [2] - The course includes a detailed theoretical foundation and practical assignments, teaching participants how to build their own VLA models and datasets from scratch [2] Instructor Team - The instructor team consists of experts from both academia and industry, including individuals with extensive research and practical experience in multi-modal perception, autonomous driving VLA, and large model frameworks [7][10][13] - Notable instructors include a Tsinghua University master's graduate with multiple publications in top conferences and a current algorithm expert at a leading domestic OEM [7][13] Target Audience - The courses are designed for individuals with a foundational knowledge of autonomous driving, familiar with basic modules, and who have a grasp of concepts related to transformer large models, reinforcement learning, and BEV perception [15] - Participants are expected to have a background in probability theory and linear algebra, as well as proficiency in Python and PyTorch [15]
做了一份端到端进阶路线图,面向落地求职......
自动驾驶之心· 2025-11-18 00:05
Core Insights - There is a significant demand for end-to-end and VLA (Vision-Language Agent) technical talent in the automotive industry, with salaries for experts reaching up to $70,000 per month for positions requiring 3-5 years of experience [1] - The technology stack for end-to-end and VLA is complex, involving various advanced algorithms such as BEV perception, Vision-Language Models (VLM), diffusion models, reinforcement learning, and world models [1] - The company is offering specialized courses to help individuals quickly and efficiently learn about end-to-end and VLA technologies, collaborating with experts from both academia and industry [1] Course Offerings - The "End-to-End and VLA Autonomous Driving Course" focuses on the macro aspects of end-to-end autonomous driving, covering key algorithms and theoretical foundations, including BEV perception, large language models, diffusion models, and reinforcement learning [10] - The "Autonomous Driving VLA and Large Model Practical Course" is led by academic experts and covers VLA from the perspective of VLM as an autonomous driving interpreter, modular VLA, and current mainstream inference-enhanced VLA [1][10] - Both courses include practical components, such as building a VLA model and dataset from scratch, and implementing algorithms like the Diffusion Planner and ORION algorithm [10][12] Instructor Profiles - The instructors include experienced professionals and researchers from top institutions, such as Tsinghua University and QS30 universities, with backgrounds in multimodal perception, autonomous driving VLA, and large model frameworks [6][9][12] - Instructors have published numerous papers in prestigious conferences and have hands-on experience in developing and deploying advanced algorithms in the field of autonomous driving [6][9][12] Target Audience - The courses are designed for individuals with a foundational knowledge of autonomous driving, familiar with basic modules, and concepts related to transformer large models, reinforcement learning, and BEV perception [14] - Participants are expected to have a background in probability theory and linear algebra, as well as proficiency in Python and PyTorch [14]
RAE+VAE? 预训练表征助力扩散模型Tokenizer,加速像素压缩到语义提取
机器之心· 2025-11-13 10:03
Core Insights - The article discusses the introduction of RAE (Diffusion Transformers with Representation Autoencoders) and VFM-VAE by Xi'an Jiaotong University and Microsoft Research Asia, which utilize "frozen pre-trained visual representations" to enhance the performance of diffusion models in generating images [2][6][28]. Group 1: VFM-VAE Overview - VFM-VAE combines the probabilistic modeling mechanism of VAE with RAE, systematically studying the impact of compressed pre-trained visual representations on the structure and performance of LDM systems [2][6]. - The integration of frozen foundational visual models as Tokenizers in VFM-VAE significantly accelerates model convergence and improves generation quality, marking an evolution from pixel compression to semantic representation [2][6]. Group 2: Performance Analysis - Experimental results indicate that the distillation-based Tokenizers struggle with semantic alignment under perturbations, while maintaining high consistency between latent space and foundational visual model features is crucial for robustness and convergence efficiency [8][19]. - VFM-VAE demonstrates superior performance and training efficiency, achieving a gFID of 3.80 on ImageNet 256×256, outperforming the distillation route's 5.14, and reaching a gFID of 2.22 with explicit alignment in just 80 epochs, improving training efficiency by approximately 10 times [23][24]. Group 3: Semantic Representation and Alignment - The research team introduced the SE-CKNNA metric to quantify the consistency between latent space and foundational visual model features, which is essential for evaluating the impact on subsequent generation performance [7][19]. - VFM-VAE maintains a higher average and peak CKNNA score compared to distillation-based Tokenizers, indicating a more stable alignment of latent space with foundational visual model features [19][21]. Group 4: Future Directions - The article concludes with the potential for further exploration of latent space in multimodal generation and complex visual understanding, aiming to transition from pixel compression to semantic representation [29].
速递|斯坦福教授创业,Inception获5000万美元种子轮融资,用扩散模型解锁实时AI应用
Z Potentials· 2025-11-07 02:12
Core Insights - The article discusses the current surge of funding into AI startups, highlighting it as a golden period for AI researchers to validate their ideas [1] - Inception, a startup developing diffusion-based AI models, recently raised $50 million in seed funding, led by Menlo Ventures with participation from several notable investors [2] Company Overview - Inception is focused on developing diffusion models, which generate outputs through iterative optimization rather than sequential generation [3] - The project leader, Stefano Ermon, has been researching diffusion models prior to the recent AI boom and aims to apply these models to a broader range of tasks [3] Technology and Innovation - Inception has released a new version of its Mercury model, designed specifically for software development, which has been integrated into various development tools [3] - Ermon claims that diffusion-based models will significantly optimize two critical metrics: latency and computational cost, stating that these models are faster and more efficient than those built by other companies [3][5] - Diffusion models differ structurally from autoregressive models, which dominate text-based AI services, and are believed to perform better when handling large volumes of text or data limitations [5] Performance Metrics - The diffusion models exhibit greater flexibility in hardware utilization, which is increasingly important as AI's infrastructure demands grow [5] - Ermon's benchmarks indicate that the models can process over 1,000 tokens per second, surpassing the capabilities of existing autoregressive technologies due to their inherent support for parallel processing [5]