扩散模型 - filings, earnings calls, financial reports, news - Reportify

扩散模型

Search documents

AnchDrive：一种新端到端自动驾驶扩散策略（上大&博世）

自动驾驶之心· 2025-09-26 07:50

Core Insights - The article introduces AnchDrive, an end-to-end framework for autonomous driving that effectively addresses the challenges of multimodal behavior and generalization in long-tail scenarios [1][10][38] - AnchDrive utilizes a hybrid trajectory anchor approach, combining dynamic and static anchors to enhance trajectory quality and robustness in planning [10][38] Group 1: Introduction and Background - End-to-end autonomous driving algorithms have gained significant attention due to their superior scalability and adaptability compared to traditional rule-based motion planning methods [4][12] - These methods learn control signals directly from raw sensor data, reducing the complexity of modular design and minimizing cumulative perception errors [4][12] Group 2: Methodology - AnchDrive employs a multi-head trajectory decoder that dynamically generates a set of trajectory anchors, capturing behavioral diversity under local environmental conditions [8][15] - The framework integrates a large-scale static anchor set derived from human driving data, providing cross-scenario behavioral prior knowledge [8][15] Group 3: Experimental Results - In the NAVSIM v2 simulation platform, AnchDrive achieved an Extended Predictive Driver Model Score (EPDMS) of 85.5, indicating its ability to generate robust and contextually appropriate behaviors in complex driving scenarios [9][30][34] - The performance of AnchDrive was significantly higher than existing methods, with an 8.9 point increase in EPDMS compared to VADv2, while reducing the number of trajectory anchors from 8192 to just 20 [34] Group 4: Contributions - The main contributions of the article include the introduction of the AnchDrive framework, which utilizes a truncated diffusion process initialized from a hybrid trajectory anchor set, significantly improving initial trajectory quality and planning robustness [10][38] - The design of a mixed perception model with dense and sparse branches enhances the planner's understanding of obstacles and road geometry [11][18]

端到端自动驾驶

多模态轨迹规划

端到端自动驾驶

多模态轨迹规划

突发，Meta刚从OpenAI挖走了清华校友宋飏

3 6 Ke· 2025-09-25 11:56

Core Insights - Meta has successfully recruited Song Yang, a key figure in diffusion models and an early contributor to DALL·E 2 technology, to lead research at Meta Superintelligence Labs (MSL) [1][12][29] - This recruitment signals a strategic shift for Meta, indicating a move towards a more collaborative team structure rather than relying solely on individual talent [12][13] Group 1: Team Dynamics - The combination of Song Yang and Shengjia Zhao represents a transition for MSL from a focus on individual excellence to a more coordinated team approach [12][13] - Both individuals share a strong academic background, having studied at Tsinghua University and Stanford, and have significant experience at OpenAI [13][14] - The team structure is becoming clearer, with defined roles that enhance research efficiency and collaboration [13][29] Group 2: Talent Acquisition Trends - Meta's recruitment pace has accelerated, with over 11 researchers from OpenAI, Google, and Anthropic joining MSL since summer [14][18] - There is a notable trend of talent movement among top AI labs, indicating that project alignment and team culture are becoming critical factors in employment decisions [14][18] - The departure of some researchers, such as Aurko Roy, highlights the competitive nature of talent retention in the AI sector [14][18] Group 3: Strategic Focus - Song Yang's research aligns closely with MSL's strategic direction, particularly in multi-modal reasoning and the development of general models that can process various data types [18][29] - His expertise in diffusion models is expected to enhance MSL's capabilities in generative AI, contributing to a more integrated research approach [18][28] - The ongoing evolution of AI projects necessitates a deeper understanding of cross-modal interactions and the integration of research into practical applications [29]

Meta Platforms(US:META)

多模态推理

Artificial Intelligence

多模态推理

Artificial Intelligence

从300多篇工作中，看VLA在不同场景下的应用和实现......

具身智能之心· 2025-09-25 04:00

Core Insights - The article discusses the emergence of Vision Language Action (VLA) models, marking a shift in robotics from traditional strategy-based control to a more generalized robotic technology paradigm, enabling active decision-making in complex environments [2][5][20] - It emphasizes the integration of large language models (LLMs) and vision-language models (VLMs) to enhance robotic operations, providing greater flexibility and precision in task execution [6][12] - The survey outlines a clear classification system for VLA methods, categorizing them into autoregressive, diffusion, reinforcement learning, hybrid, and specialized methods, while also addressing the unique contributions and challenges within each category [7][10][22] Group 1: VLA Model Overview - VLA models represent a significant advancement in robotics, allowing for the unification of perception, language understanding, and executable control within a single modeling framework [15][20] - The article categorizes VLA methods into five paradigms: autoregressive, diffusion, reinforcement learning, hybrid, and specialized, detailing their design motivations and core strategies [10][22][23] - The integration of LLMs into VLA systems transforms them from passive input parsers to semantic intermediaries, enhancing their ability to handle long and complex tasks [29][30] Group 2: Applications and Challenges - VLA models have practical applications across various robotic forms, including robotic arms, quadrupeds, humanoid robots, and autonomous vehicles, showcasing their deployment in diverse scenarios [8][20] - The article identifies key challenges in the VLA field, such as data limitations, reasoning speed, and safety concerns, which need to be addressed to accelerate the development of VLA models and general robotic technology [8][19][20] - The reliance on high-quality datasets and simulation platforms is crucial for the effective training and evaluation of VLA models, addressing issues of data scarcity and real-world testing risks [16][19] Group 3: Future Directions - The survey outlines future research directions for VLA, including addressing data limitations, enhancing reasoning speed, and improving safety measures to facilitate the advancement of general embodied intelligence [8][20][21] - It highlights the importance of developing scalable and efficient VLA models that can adapt to various tasks and environments, emphasizing the need for ongoing innovation in this rapidly evolving field [20][39] - The article concludes by underscoring the potential of VLA models to bridge the gap between perception, understanding, and action, positioning them as a key frontier in embodied artificial intelligence [20][21][39]

视觉-语言-动作（VLA）模型

通用具身智能

自回归模型

强化学习微调模型

视觉-语言-动作（VLA）模型

通用具身智能

自回归模型

强化学习微调模型

深度综述 | 300+论文带你看懂：纯视觉如何将VLA推向自动驾驶和具身智能巅峰！

自动驾驶之心· 2025-09-24 23:33

Core Insights - The emergence of Vision Language Action (VLA) models signifies a paradigm shift in robotics from traditional strategy-based control to general-purpose robotic technology, transforming Vision Language Models (VLMs) from passive sequence generators to active agents capable of executing operations and making decisions in complex, dynamic environments [1][5][11] Summary by Sections Introduction - Robotics has historically relied on pre-programmed instructions and control strategies for task execution, primarily in simple, repetitive tasks [5] - Recent advancements in AI and deep learning have enabled the integration of perception, detection, tracking, and localization technologies, leading to the development of embodied intelligence and autonomous driving [5] - Current robots often operate as "isolated agents," lacking effective interaction with humans and external environments, prompting researchers to explore the integration of Large Language Models (LLMs) and VLMs for more precise and flexible robotic operations [5][6] Background - The development of VLA models marks a significant step towards general embodied intelligence, unifying visual perception, language understanding, and executable control within a single modeling framework [11][16] - The evolution of VLA models is supported by breakthroughs in single-modal foundational models across computer vision, natural language processing, and reinforcement learning [13][16] VLA Models Overview - VLA models have rapidly developed due to advancements in multi-modal representation learning, generative modeling, and reinforcement learning [24] - The core design of VLA models includes the integration of visual encoding, LLM reasoning, and decision-making frameworks, aiming to bridge the gap between perception, understanding, and action [23][24] VLA Methodologies - VLA methods are categorized into five paradigms: autoregressive, diffusion models, reinforcement learning, hybrid methods, and specialized approaches, each with distinct design motivations and core strategies [6][24] - Autoregressive models focus on sequential generation of actions based on historical context and task instructions, demonstrating scalability and robustness [26][28] Applications and Resources - VLA models are applicable in various robotic domains, including robotic arms, quadrupedal robots, humanoid robots, and wheeled robots (autonomous vehicles) [7] - The development of VLA models heavily relies on high-quality datasets and simulation platforms to address challenges related to data scarcity and high risks in real-world testing [17][21] Challenges and Future Directions - Key challenges in the VLA field include data limitations, reasoning speed, and safety concerns, which need to be addressed to accelerate the development of VLA models and general robotic technologies [7][18] - Future research directions are outlined to enhance the capabilities of VLA models, focusing on improving data diversity, enhancing reasoning mechanisms, and ensuring safety in real-world applications [7][18] Conclusion - The review emphasizes the need for a clear classification system for pure VLA methods, highlighting the significant features and innovations of each category, and providing insights into the resources necessary for training and evaluating VLA models [9][24]

视觉-语言-动作（VLA）模型

通用具身智能

强化学习微调模型

混合架构与多范式融合

基础模型与大规模训练

视觉-语言-动作（VLA）模型

通用具身智能

强化学习微调模型

混合架构与多范式融合

基础模型与大规模训练

打算招聘几位大佬共创平台（4D标注/世界模型/VLA等方向）

自动驾驶之心· 2025-09-23 23:32

Core Viewpoint - The article discusses the recruitment of business partners for the autonomous driving sector, emphasizing the need for expertise in various advanced technologies and offering attractive incentives for potential candidates [2][3][5]. Group 1: Recruitment Details - The company plans to recruit 10 outstanding partners for autonomous driving-related course development, paper guidance, and hardware research [2]. - Candidates with expertise in areas such as large models, multimodal models, diffusion models, and 3D target detection are particularly welcome [3]. - Preferred qualifications include a master's degree or higher from universities ranked within the QS200, with priority given to candidates who have published in top conferences [4]. Group 2: Incentives and Opportunities - The company offers resource sharing related to autonomous driving, including job recommendations, PhD opportunities, and study abroad guidance [5]. - Attractive cash incentives and opportunities for collaboration on entrepreneurial projects are part of the recruitment package [5].

多模态大模型

多模态大模型

加速近5倍！北大与字节团队提出BranchGRPO，用「树形分叉 + 剪枝」重塑扩散模型对齐

机器之心· 2025-09-22 07:26

快分叉与稳收敛在扩散 / 流匹配模型的人类偏好对齐中，实现高效采样与稳定优化的统一，一直是一个重大挑战。近期，北京大学与字节团队提出了名为 BranchGRPO 的新型树形强化学习方法。不同于顺序展开的 DanceGRPO，BranchGRPO 通过在扩散反演过程中引入分叉（branching）与剪枝（pruning），让多个轨迹共享前缀、在中间步骤分裂，并通过逐层奖励融合实现稠密反馈。该方法在 HPDv2.1 图像对齐与 WanX-1.3B 视频生成上均取得了优异表现。最令人瞩目的是，BranchGRPO 在保证对齐效果更优的同时，迭代时间最高近 5×（Mix 变体 148s vs 698s）。 https://fredreic1849.github.io/BranchGRPO-Webpage/ 代码链接: https://github.com/Fredreic1849/BranchGRPO 研究背景与挑战近年来，扩散模型与流匹配模型凭借在图像与视频生成上的高保真、多样性与可控性，已成为视觉生成的主流方案。然而，仅靠大规模预训练并不能保证与人类意图完全对齐：模型生成的结果常常偏离美学、语义或时间 ...

人类反馈强化学习（RLHF）

流匹配模型

人类反馈强化学习（RLHF）

流匹配模型

打算招聘几位大佬共创平台（世界模型/VLA等方向）

自动驾驶之心· 2025-09-21 06:59

Group 1 - The article announces the recruitment of 10 partners for the autonomous driving sector, focusing on course development, paper guidance, and hardware research [2] - The recruitment targets individuals with expertise in various advanced technologies such as large models, multimodal models, and 3D target detection [3] - Candidates from QS200 universities with a master's degree or higher are preferred, especially those with significant conference contributions [4] Group 2 - The compensation package includes resource sharing for job seeking, PhD recommendations, and study abroad opportunities, along with substantial cash incentives [5] - The company encourages potential partners to reach out via WeChat for collaboration inquiries, specifying the need to mention their organization or company [6]

多模态大模型

多模态大模型

上交严骏驰团队：近一年顶会顶刊硬核成果盘点

自动驾驶之心· 2025-09-18 23:33

Core Insights - The article discusses the groundbreaking research conducted by Professor Yan Junchi's team at Shanghai Jiao Tong University, focusing on advancements in AI, robotics, and autonomous driving [2][32]. - The team's recent publications in top conferences like CVPR, ICLR, and NeurIPS highlight key trends in AI research, emphasizing the integration of theory and practice, the transformative impact of AI on traditional scientific computing, and the development of more robust, efficient, and autonomous intelligent systems [32]. Group 1: Recent Research Highlights - The paper "Grounding and Enhancing Grid-based Models for Neural Fields" introduces a systematic theoretical framework for grid-based neural field models, leading to the development of the MulFAGrid model, which achieves superior performance in various tasks [4][5]. - The "CR2PQ" method addresses the challenge of cross-view pixel correspondence in dense visual representation learning, demonstrating significant performance improvements over previous methods [6][7]. - The "BTBS-LNS" method effectively tackles the limitations of policy learning in large neighborhood search for mixed-integer programming (MIP), showing competitive performance against commercial solvers like Gurobi [8][10][11]. Group 2: Performance Metrics - The MulFAGrid model achieved a PSNR of 56.19 in 2D image fitting tasks and an IoU of 0.9995 in 3D signed distance field reconstruction tasks, outperforming previous grid-based models [5]. - The CR2PQ method demonstrated a 10.4% mAP^bb and 7.9% mAP^mk improvement over state-of-the-art methods after only 40 pre-training epochs [7]. - The BTBS-LNS method outperformed Gurobi by providing a 10% better primal gap in benchmark tests within a 300-second cutoff time [11]. Group 3: Future Trends in AI Research - The research indicates a shift towards a deeper integration of theoretical foundations with practical applications in AI, suggesting a future where AI technologies are more robust and capable of real-world applications [32]. - The advancements in AI research are expected to lead to smarter robots, more powerful design tools, and more efficient business solutions in the near future [32].

连续时间动态图学习

神经场模型

密集视觉对比学习

连续时间动态图学习

神经场模型

密集视觉对比学习

自动驾驶基础模型应该以能力为导向，而不仅是局限于方法本身

自动驾驶之心· 2025-09-16 23:33

Core Insights - The article discusses the transformative impact of foundational models on the autonomous driving perception domain, shifting from task-specific deep learning models to versatile architectures trained on vast and diverse datasets [2][4] - It introduces a new classification framework focusing on four core capabilities essential for robust performance in dynamic driving environments: general knowledge, spatial understanding, multi-sensor robustness, and temporal reasoning [2][5] Group 1: Introduction and Background - Autonomous driving perception is crucial for enabling vehicles to interpret their surroundings in real-time, involving key tasks such as object detection, semantic segmentation, and tracking [3] - Traditional models, designed for specific tasks, exhibit limited scalability and poor generalization, particularly in "long-tail scenarios" where rare but critical events occur [3][4] Group 2: Foundational Models - Foundational models, developed through self-supervised or unsupervised learning strategies, leverage large-scale datasets to learn general representations applicable across various downstream tasks [4][5] - These models demonstrate significant advantages in autonomous driving due to their inherent generalization capabilities, efficient transfer learning, and reduced reliance on labeled datasets [4][5] Group 3: Key Capabilities - The four key dimensions for designing foundational models tailored for autonomous driving perception are: 1. General Knowledge: Ability to adapt to a wide range of driving scenarios, including rare situations [5][6] 2. Spatial Understanding: Deep comprehension of 3D spatial structures and relationships [5][6] 3. Multi-Sensor Robustness: Maintaining high performance under varying environmental conditions and sensor failures [5][6] 4. Temporal Reasoning: Capturing temporal dependencies and predicting future states of the environment [6] Group 4: Integration and Challenges - The article outlines three mechanisms for integrating foundational models into autonomous driving technology stacks: feature-level distillation, pseudo-label supervision, and direct integration [37][40] - It highlights the challenges faced in deploying these models, including the need for effective domain adaptation, addressing hallucination risks, and ensuring efficiency in real-time applications [58][61] Group 5: Future Directions - The article emphasizes the importance of advancing research in foundational models to enhance their safety and effectiveness in autonomous driving systems, addressing current limitations and exploring new methodologies [2][5][58]

自动驾驶感知

自监督学习

自动驾驶感知

自监督学习

冲破 AGI 迷雾，蚂蚁看到了一个新路标

雷峰网· 2025-09-16 10:20

Core Viewpoint - The article discusses the current state of large language models (LLMs) and the challenges they face in achieving Artificial General Intelligence (AGI), emphasizing the need for new paradigms beyond the existing autoregressive (AR) models [4][10][18]. Group 1: Current Challenges in AI Models - Ilya, a prominent AI researcher, warns that data extraction has reached its limits, hindering the progress towards AGI [2][4]. - The existing LLMs often exhibit significant performance discrepancies, with some capable of outperforming human experts while others struggle with basic tasks [13][15]. - The autoregressive model's limitations include a lack of bidirectional modeling and the inability to correct errors during generation, leading to fundamental misunderstandings in tasks like translation and medical diagnosis [26][27][18]. Group 2: New Directions in AI Research - Elon Musk proposes a "purified data" approach to rewrite human knowledge as a potential pathway to AGI [5]. - Researchers are exploring multimodal approaches, with experts like Fei-Fei Li emphasizing the importance of visual understanding as a cornerstone of intelligence [8]. - A new paradigm, the diffusion model, is being introduced by young scholars, which contrasts with the traditional autoregressive approach by allowing for parallel decoding and iterative correction [12][28]. Group 3: Development of LLaDA-MoE - The LLaDA-MoE model, based on diffusion theory, was announced as a significant advancement in the field, showcasing a new approach to language modeling [12][66]. - LLaDA-MoE has a total parameter count of 7 billion, with 1.4 billion activated parameters, and has been trained on approximately 20 terabytes of data, demonstrating its scalability and stability [66][67]. - The model's performance in benchmark tests indicates that it can compete with existing autoregressive models, suggesting a viable alternative path for future AI development [67][71]. Group 4: Future Prospects and Community Involvement - The development of LLaDA-MoE represents a milestone in the exploration of diffusion models, with plans for further scaling and improvement [72][74]. - The team emphasizes the importance of community collaboration in advancing the diffusion model research, similar to the development of autoregressive models [74][79]. - Ant Group's commitment to investing in AGI research reflects a strategic shift towards exploring innovative and potentially high-risk areas in AI [79].

自回归（AR）生成范式

混合专家模型（MoE）

Artificial Intelligence

自回归（AR）生成范式

混合专家模型（MoE）

Artificial Intelligence