扩散模型

Search documents
打算招聘几位大佬共创平台(4D标注/世界模型/VLA等方向)
自动驾驶之心· 2025-09-23 23:32
QS200以内高校,硕士及以上学历,手握顶会的大佬优先。 待遇说明 自动驾驶资源共享(求职、读博、出国留学推荐等); 点击下方 卡片 ,关注" 自动驾驶之心 "公众号 戳我-> 领取 自动驾驶近15个 方向 学习 路线 业务合伙人 自动驾驶之心业务合伙人招募来啦!我们团队今年计划向国内外招募10名优秀的合伙人,负责自动驾驶相 关课程研发、论文辅导业务开发、硬件研发; 主要方向 如果您是大模型/多模态大模型、扩散模型、VLA、端到端、具身交互、联合预测、SLAM、3D目标检测、 世界模型、闭环仿真3DGS、大模型部署与量化感知推理等方向,欢迎加入我们; 岗位要求 丰厚的现金激励; 创业项目合作与推荐; 联系我们 更多欢迎添加微信咨询,备注" 机构/公司 + 自动驾驶合作咨询 "。 ...
加速近5倍!北大与字节团队提出BranchGRPO,用「树形分叉 + 剪枝」重塑扩散模型对齐
机器之心· 2025-09-22 07:26
快分叉与稳收敛 在扩散 / 流匹配模型的人类偏好对齐中,实现高效采样与稳定优化的统一,一直是一个重大挑战。 近期,北京大学与字节团队提出了名为 BranchGRPO 的新型树形强化学习方法。不同于顺序展开的 DanceGRPO,BranchGRPO 通过在扩散反演过程中引入分叉 (branching)与剪枝(pruning),让多个轨迹共享前缀、在中间步骤分裂,并通过逐层奖励融合实现稠密反馈。 该方法在 HPDv2.1 图像对齐与 WanX-1.3B 视频生成上均取得了优异表现。最令人瞩目的是,BranchGRPO 在保证对齐效果更优的同时,迭代时间最高近 5×(Mix 变体 148s vs 698s)。 https://fredreic1849.github.io/BranchGRPO-Webpage/ 代码链接: https://github.com/Fredreic1849/BranchGRPO 研究背景与挑战 近年来,扩散模型与流匹配模型凭借在图像与视频生成上的高保真、多样性与可控性,已成为视觉生成的主流方案。然而,仅靠大规模预训练并不能保证与人类 意图完全对齐:模型生成的结果常常偏离美学、语义或时间 ...
打算招聘几位大佬共创平台(世界模型/VLA等方向)
自动驾驶之心· 2025-09-21 06:59
Group 1 - The article announces the recruitment of 10 partners for the autonomous driving sector, focusing on course development, paper guidance, and hardware research [2] - The recruitment targets individuals with expertise in various advanced technologies such as large models, multimodal models, and 3D target detection [3] - Candidates from QS200 universities with a master's degree or higher are preferred, especially those with significant conference contributions [4] Group 2 - The compensation package includes resource sharing for job seeking, PhD recommendations, and study abroad opportunities, along with substantial cash incentives [5] - The company encourages potential partners to reach out via WeChat for collaboration inquiries, specifying the need to mention their organization or company [6]
上交严骏驰团队:近一年顶会顶刊硬核成果盘点
自动驾驶之心· 2025-09-18 23:33
Core Insights - The article discusses the groundbreaking research conducted by Professor Yan Junchi's team at Shanghai Jiao Tong University, focusing on advancements in AI, robotics, and autonomous driving [2][32]. - The team's recent publications in top conferences like CVPR, ICLR, and NeurIPS highlight key trends in AI research, emphasizing the integration of theory and practice, the transformative impact of AI on traditional scientific computing, and the development of more robust, efficient, and autonomous intelligent systems [32]. Group 1: Recent Research Highlights - The paper "Grounding and Enhancing Grid-based Models for Neural Fields" introduces a systematic theoretical framework for grid-based neural field models, leading to the development of the MulFAGrid model, which achieves superior performance in various tasks [4][5]. - The "CR2PQ" method addresses the challenge of cross-view pixel correspondence in dense visual representation learning, demonstrating significant performance improvements over previous methods [6][7]. - The "BTBS-LNS" method effectively tackles the limitations of policy learning in large neighborhood search for mixed-integer programming (MIP), showing competitive performance against commercial solvers like Gurobi [8][10][11]. Group 2: Performance Metrics - The MulFAGrid model achieved a PSNR of 56.19 in 2D image fitting tasks and an IoU of 0.9995 in 3D signed distance field reconstruction tasks, outperforming previous grid-based models [5]. - The CR2PQ method demonstrated a 10.4% mAP^bb and 7.9% mAP^mk improvement over state-of-the-art methods after only 40 pre-training epochs [7]. - The BTBS-LNS method outperformed Gurobi by providing a 10% better primal gap in benchmark tests within a 300-second cutoff time [11]. Group 3: Future Trends in AI Research - The research indicates a shift towards a deeper integration of theoretical foundations with practical applications in AI, suggesting a future where AI technologies are more robust and capable of real-world applications [32]. - The advancements in AI research are expected to lead to smarter robots, more powerful design tools, and more efficient business solutions in the near future [32].
自动驾驶基础模型应该以能力为导向,而不仅是局限于方法本身
自动驾驶之心· 2025-09-16 23:33
Core Insights - The article discusses the transformative impact of foundational models on the autonomous driving perception domain, shifting from task-specific deep learning models to versatile architectures trained on vast and diverse datasets [2][4] - It introduces a new classification framework focusing on four core capabilities essential for robust performance in dynamic driving environments: general knowledge, spatial understanding, multi-sensor robustness, and temporal reasoning [2][5] Group 1: Introduction and Background - Autonomous driving perception is crucial for enabling vehicles to interpret their surroundings in real-time, involving key tasks such as object detection, semantic segmentation, and tracking [3] - Traditional models, designed for specific tasks, exhibit limited scalability and poor generalization, particularly in "long-tail scenarios" where rare but critical events occur [3][4] Group 2: Foundational Models - Foundational models, developed through self-supervised or unsupervised learning strategies, leverage large-scale datasets to learn general representations applicable across various downstream tasks [4][5] - These models demonstrate significant advantages in autonomous driving due to their inherent generalization capabilities, efficient transfer learning, and reduced reliance on labeled datasets [4][5] Group 3: Key Capabilities - The four key dimensions for designing foundational models tailored for autonomous driving perception are: 1. General Knowledge: Ability to adapt to a wide range of driving scenarios, including rare situations [5][6] 2. Spatial Understanding: Deep comprehension of 3D spatial structures and relationships [5][6] 3. Multi-Sensor Robustness: Maintaining high performance under varying environmental conditions and sensor failures [5][6] 4. Temporal Reasoning: Capturing temporal dependencies and predicting future states of the environment [6] Group 4: Integration and Challenges - The article outlines three mechanisms for integrating foundational models into autonomous driving technology stacks: feature-level distillation, pseudo-label supervision, and direct integration [37][40] - It highlights the challenges faced in deploying these models, including the need for effective domain adaptation, addressing hallucination risks, and ensuring efficiency in real-time applications [58][61] Group 5: Future Directions - The article emphasizes the importance of advancing research in foundational models to enhance their safety and effectiveness in autonomous driving systems, addressing current limitations and exploring new methodologies [2][5][58]
冲破 AGI 迷雾,蚂蚁看到了一个新路标
雷峰网· 2025-09-16 10:20
Core Viewpoint - The article discusses the current state of large language models (LLMs) and the challenges they face in achieving Artificial General Intelligence (AGI), emphasizing the need for new paradigms beyond the existing autoregressive (AR) models [4][10][18]. Group 1: Current Challenges in AI Models - Ilya, a prominent AI researcher, warns that data extraction has reached its limits, hindering the progress towards AGI [2][4]. - The existing LLMs often exhibit significant performance discrepancies, with some capable of outperforming human experts while others struggle with basic tasks [13][15]. - The autoregressive model's limitations include a lack of bidirectional modeling and the inability to correct errors during generation, leading to fundamental misunderstandings in tasks like translation and medical diagnosis [26][27][18]. Group 2: New Directions in AI Research - Elon Musk proposes a "purified data" approach to rewrite human knowledge as a potential pathway to AGI [5]. - Researchers are exploring multimodal approaches, with experts like Fei-Fei Li emphasizing the importance of visual understanding as a cornerstone of intelligence [8]. - A new paradigm, the diffusion model, is being introduced by young scholars, which contrasts with the traditional autoregressive approach by allowing for parallel decoding and iterative correction [12][28]. Group 3: Development of LLaDA-MoE - The LLaDA-MoE model, based on diffusion theory, was announced as a significant advancement in the field, showcasing a new approach to language modeling [12][66]. - LLaDA-MoE has a total parameter count of 7 billion, with 1.4 billion activated parameters, and has been trained on approximately 20 terabytes of data, demonstrating its scalability and stability [66][67]. - The model's performance in benchmark tests indicates that it can compete with existing autoregressive models, suggesting a viable alternative path for future AI development [67][71]. Group 4: Future Prospects and Community Involvement - The development of LLaDA-MoE represents a milestone in the exploration of diffusion models, with plans for further scaling and improvement [72][74]. - The team emphasizes the importance of community collaboration in advancing the diffusion model research, similar to the development of autoregressive models [74][79]. - Ant Group's commitment to investing in AGI research reflects a strategic shift towards exploring innovative and potentially high-risk areas in AI [79].
论文解读之港科PLUTO:首次超越Rule-Based的规划器!
自动驾驶之心· 2025-09-15 23:33
Core Viewpoint - The article discusses the development and features of the PLUTO model within the end-to-end autonomous driving domain, emphasizing its unique two-stage architecture and its direct encoding of structured perception outputs for downstream control tasks [1][2]. Summary by Sections Overview of PLUTO - PLUTO is characterized by its three main losses: regression loss, classification loss, and imitation learning loss, which collectively contribute to the model's performance [7]. - Additional auxiliary losses are incorporated to aid model convergence [9]. Course Introduction - The article introduces a new course titled "End-to-End and VLA Autonomous Driving," developed in collaboration with top algorithm experts from domestic leading manufacturers, aimed at addressing the challenges faced by learners in this rapidly evolving field [12][15]. Learning Challenges - The course addresses the difficulties learners face due to the fast-paced development of technology and the fragmented nature of knowledge across various domains, making it hard for beginners to grasp the necessary concepts [13]. Course Features - The course is designed to provide quick entry into the field, build a framework for research capabilities, and combine theory with practical applications [15][16][17]. Course Outline - The course consists of several chapters covering topics such as the history and evolution of end-to-end algorithms, background knowledge on various technologies, and detailed discussions on both one-stage and two-stage end-to-end methods [20][21][22][29]. Practical Application - The course includes practical assignments, such as RLHF fine-tuning, allowing students to apply their theoretical knowledge in real-world scenarios [31]. Instructor Background - The instructor, Jason, has a strong academic and practical background in cutting-edge algorithms related to end-to-end and large models, contributing to the course's credibility [32]. Target Audience and Expected Outcomes - The course is aimed at individuals with a foundational understanding of autonomous driving and related technologies, with the goal of elevating their skills to the level of an end-to-end autonomous driving algorithm engineer within a year [36].
腾讯混元升级AI绘画微调范式,在整个扩散轨迹上优化,人工评估分数提升300%
量子位· 2025-09-15 03:59
Core Viewpoint - The article discusses advancements in AI image generation, specifically focusing on the introduction of two key methods, Direct-Align and Semantic Relative Preference Optimization (SRPO), which significantly enhance the quality and aesthetic appeal of generated images [5][14]. Group 1: Current Challenges in Diffusion Models - Existing diffusion models face two main issues: limited optimization steps leading to "reward hacking," and the need for offline adjustments to the reward model for achieving good aesthetic results [4][8]. - The optimization process is constrained to the last few steps of the diffusion process due to high gradient computation costs [8]. Group 2: Direct-Align Method - Direct-Align method allows for the recovery of original images from any time step by pre-injecting noise, thus avoiding the limitations of optimizing only in later steps [5][10]. - This method enables the model to recover clear images from high noise states, addressing the gradient explosion problem during early time step backpropagation [11]. - Experiments show that even at just 5% denoising progress, Direct-Align can recover a rough structure of the image [11][19]. Group 3: Semantic Relative Preference Optimization (SRPO) - SRPO redefines rewards as text-conditioned signals, allowing for online adjustments without additional data by using positive and negative prompt words [14][16]. - The method enhances the model's ability to generate images with improved realism and aesthetic quality, achieving approximately 3.7 times and 3.1 times improvements, respectively [16]. - SRPO allows for flexible style adjustments, such as brightness and cartoon style conversion, based on the frequency of control words in the training set [16]. Group 4: Experimental Results - Comprehensive experiments on the FLUX.1-dev model demonstrate that SRPO outperforms other methods like ReFL, DRaFT, and DanceGRPO across multiple evaluation metrics [17]. - In human evaluations, the excellent rate for realism increased from 8.2% to 38.9% and for aesthetic quality from 9.8% to 40.5% after SRPO training [17][18]. - Notably, a mere 10 minutes of SRPO training allowed FLUX.1-dev to surpass the latest open-source version FLUX.1.Krea on the HPDv2 benchmark [19].
端到端再进化!用扩散模型和MoE打造会思考的自动驾驶Policy(同济大学)
自动驾驶之心· 2025-09-14 23:33
最近,大模型在自动驾驶领域也逐渐崭露头角,像视觉-语言模型(VLM)和视觉-语言-动作模型(VLA)已经在理解场景、语义关联和泛化能力上有了不错的表现。不 过,这类模型在实际连续控制场景中还受一些限制,比如推理速度慢、动作不够连贯,以及安全性保障难度大。 与此同时,扩散模型(Diffusion Models)正在改变视觉、音频和控制领域的生成式建模方式。和传统的回归或分类方法不同,扩散策略(Diffusion Policy, DP)把动作生 成看作一个"逐步去噪"的过程,不仅能更好地表达多种可能的驾驶选择,还能保持轨迹的时序一致性和训练的稳定性。不过,这类方法在自动驾驶中还没被系统化研究 过。扩散策略通过直接建模输出动作空间,为生成平滑可靠的驾驶轨迹提供了一种更强大、更灵活的思路,非常适合解决驾驶决策中的多样性和长期稳定性问题。 另一方面,专家混合(MoE, Mixture of Experts)技术也逐渐成为大模型的重要架构。它通过按需激活少量专家,让模型在保持计算效率的同时具备更强的扩展性和模块化 能力。MoE 在自动驾驶中也被尝试应用,比如做多任务策略和模块化预测,但大多数设计还是面向具体任务,限制了专 ...
兼得快与好!训练新范式TiM,原生支持FSDP+Flash Attention
量子位· 2025-09-14 05:05
Core Viewpoint - The article discusses the introduction of the Transition Model (TiM) as a new paradigm in generative modeling, aiming to reconcile the trade-off between generation speed and quality by modeling state transitions between any two time points, rather than focusing solely on instantaneous velocity fields or fixed-span endpoint mappings [3][8][34]. Group 1: Background and Challenges - Traditional generative models face a fundamental conflict between generation quality and speed, primarily due to their training objectives [2][6]. - Existing diffusion models rely on local vector fields, which require small time steps for accurate sampling, leading to high computational costs [5][6]. - Few-step models, while faster, often encounter a "quality ceiling" due to their inability to capture intermediate dynamics, limiting their generation capabilities [5][7]. Group 2: Transition Model Overview - The Transition Model abandons traditional approaches by directly modeling the complete state transition between any two time points, allowing for flexible sampling steps [4][8]. - This model supports arbitrary step sizes and decomposes the generation process into multiple adjustable segments, enhancing both speed and fidelity [8][10]. Group 3: Mathematical Foundations - The Transition Model is based on a "State Transition Identity," which simplifies the differential equations governing state transitions, enabling the description of specific transitions over arbitrary time intervals [12][16]. - Unlike diffusion and mean flow models, which focus on instantaneous or average velocity fields, the Transition Model encompasses both, providing a more comprehensive framework for generative modeling [16][17]. Group 4: Experimental Validation - The Transition Model has been validated on the Geneval dataset, demonstrating that an 865M parameter version can outperform larger models (12B parameters) in terms of generation capabilities [20][34]. - The model's training stability and scalability have been enhanced through the introduction of a differential derivative equation (DDE) approach, which is more efficient and compatible with modern training optimizations [25][33]. Group 5: Conclusion - Overall, the Transition Model offers a more universal, scalable, and stable approach to generative modeling, addressing the inherent conflict between speed and quality in generative processes [35].