扩散模型

Search documents
之心急聘!25年业务合伙人招聘,量大管饱~
自动驾驶之心· 2025-06-27 09:34
Group 1 - The article discusses the recruitment of 10 outstanding partners for the "Autonomous Driving Heart" team, focusing on the development of autonomous driving-related courses, thesis guidance, and hardware development [2][3] - The main areas of expertise sought include large models/multi-modal large models, diffusion models, VLA, end-to-end systems, embodied interaction, joint prediction, SLAM, 3D object detection, world models, closed-loop simulation 3DGS, and large model deployment and quantized perception reasoning [3] - Candidates are preferred to have a master's degree or higher from universities ranked within the QS200, with priority given to those who have significant contributions in top conferences [4] Group 2 - The company offers various benefits including resource sharing for job seeking, doctoral studies, and studying abroad recommendations, along with substantial cash incentives and opportunities for entrepreneurial project collaboration [5][6] - Interested parties are encouraged to contact the company via WeChat for consultation regarding institutional or corporate collaboration in autonomous driving [7]
具身世界模型新突破,地平线 & 极佳提出几何一致视频世界模型增强机器人策略学习
机器之心· 2025-06-26 04:35
近年来,随着人工智能从感知智能向决策智能演进, 世界模型 (World Models) 逐渐成为机器人领域的重要研究方向。世界模型旨在让智能体对环境进行建模并 预测未来状态,从而实现更高效的规划与决策。 与此同时,具身数据也迎来了爆发式关注。因为目前具身算法高度依赖于大规模的真实机器人演示数据,而这些数据的采集过程往往成本高昂、耗时费力,严重 限制了其可扩展性和泛化能力。尽管仿真平台提供了一种相对低成本的数据生成方式,但由于仿真环境与真实世界之间存在显著的视觉和动力学差异(即 sim-to- real gap),导致在仿真中训练的策略难以直接迁移到真实机器人上,从而限制了其实际应用效果。 因此如何高效获取、生成和利用高质量的具身数据,已成为当 前机器人学习领域的核心挑战之一 。 项目主页: https://horizonrobotics.github.io/robot_lab/robotransfer/ 模仿学习(Imitation Learning)已成为机器人操作领域的重要方法之一。通过让机器人 "模仿" 专家示教的行为,可以在复杂任务中快速构建有效的策略模型。然 而,这类方法通常依赖大量高质量的真实机器 ...
生成式视角重塑监督学习!标签不只是答案,更是学习指南 | ICML 2025
量子位· 2025-06-24 13:36
PCL团队 投稿 量子位 | 公众号 QbitAI 生成式视角可以对监督学习重新思考乃至重新定义! 想象你在教一个学生解数学题——你会直接让他交卷对答案,还是会让他参考完整答案来理解解题思路? 如今,一种全新的监督学习范式正受到关注:标签不应只是用于对照回答的标准答案,更可能成为学习过程中的辅助参考。 受生成式一致性模型的启发,来自上海交大、SII、MIT、港中文深圳等机构的研究团队在ICML 2025最新提出预测一致性学习(PCL, Predictive Consistency Learning)。 PCL通过扩散模型的扩散过程消减标签的信息,将噪声标签(Noised Labels)引入模型的输入,使得模型在数据输入和噪声标签的共同参照 下预测完整标签,实现标签信息的复用和价值挖掘。 训练过程概览 传统监督学习中,输入 通过神经网络预测 ,通过对比和标准答案 之间的关系,来计算损失和反向传播更新模型,对应损失函数 $${\mathcal{L}}_{S L}=d(f_{\theta}(\mathbf{x}),\mathbf{y}),$$ 其中 为具体损失函数, 为神经网络函数。受生成一致性模型中一致性映射思想 ...
技术圈热议的π0/π0.5/A0,终于说清楚是什么了!功能、场景、方法论全解析~
具身智能之心· 2025-06-21 12:06
Core Insights - The article discusses the π0, π0.5, and A0 models, focusing on their architectures, advantages, and functionalities in robotic control and task execution [3][11][29]. Group 1: π0 Model Structure and Functionality - The π0 model is based on a pre-trained Vision-Language Model (VLM) and Flow Matching technology, integrating seven robots and over 68 tasks with more than 10,000 hours of data [3]. - It allows zero-shot task execution through language prompts, enabling direct control of robots without additional fine-tuning for covered tasks [4]. - The model supports complex task decomposition and multi-stage fine-tuning, enhancing the execution of intricate tasks like folding clothes [5]. - It achieves high-frequency precise operations, generating continuous action sequences at a control frequency of up to 50Hz [7]. Group 2: π0 Performance Analysis - The π0 model shows a 20%-30% higher accuracy in following language instructions compared to baseline models in tasks like table clearing and grocery bagging [11]. - For similar pre-trained tasks, it requires only 1-5 hours of data fine-tuning to achieve high success rates, and it performs twice as well on new tasks compared to training from scratch [11]. - In multi-stage tasks, π0 achieves an average task completion rate of 60%-80% through a "pre-training + fine-tuning" process, outperforming models trained from scratch [11]. Group 3: π0.5 Model Structure and Advantages - The π0.5 model employs a two-stage training framework and hierarchical architecture, enhancing its ability to generalize from diverse data sources [12][18]. - It demonstrates a 25%-40% higher success rate in tasks compared to π0, with a training speed improvement of three times due to mixed discrete-continuous action training [17]. - The model effectively handles long-duration tasks and can execute complex operations in unfamiliar environments, showcasing its adaptability [18][21]. Group 4: A0 Model Structure and Performance - The A0 model features a layered architecture that integrates high-level affordance understanding and low-level action execution, enhancing its spatial reasoning capabilities [29]. - It shows continuous performance improvement with increased training environments, achieving success rates close to baseline models when trained on 104 locations [32]. - The model's performance is significantly impacted by the removal of cross-entity and web data, highlighting the importance of diverse data sources for generalization [32]. Group 5: Overall Implications and Future Directions - The advancements in these models indicate a significant step towards practical applications of robotic systems in real-world environments, with potential expansions into service robotics and industrial automation [21][32]. - The integration of diverse data sources and innovative architectures positions these models to overcome traditional limitations in robotic task execution [18][32].
打造万人的自动驾驶黄埔军校,一个死磕技术的地方~
自动驾驶之心· 2025-06-20 14:06
Core Viewpoint - The article emphasizes the establishment of a comprehensive community for autonomous driving and embodied intelligence, aiming to gather industry professionals and facilitate rapid responses to challenges within the sector. The goal is to create a community of 10,000 members within three years, focusing on academic, product, and recruitment connections in the field [2][4]. Group 1: Community Development - The community aims to provide a platform for industry professionals to share the latest technological developments, engage in discussions, and access job opportunities [2][3]. - The initiative has already attracted notable figures from companies like Huawei and various leading researchers in the autonomous driving field [2]. - The community is designed to support newcomers by offering structured learning paths and resources to quickly build their technical knowledge [2]. Group 2: Knowledge Sharing and Resources - The "Autonomous Driving Heart Knowledge Planet" serves as a technical exchange platform, primarily for students and professionals looking to transition into the autonomous driving sector [4][11]. - The community has established connections with numerous companies for recruitment purposes, including well-known names like Xiaomi, NIO, and NVIDIA [4][11]. - Members have access to a wealth of resources, including over 5,000 pieces of content, live sessions with industry experts, and discounts on paid courses [14][18]. Group 3: Technological Focus Areas - The article outlines key technological areas to focus on by 2025, including visual large language models (VLM), end-to-end trajectory prediction, and 3D generative simulation techniques [6][10]. - The community has developed learning paths covering various subfields such as perception, mapping, and AI model deployment, ensuring comprehensive coverage of the autonomous driving technology stack [11][16]. - Regular live sessions will focus on cutting-edge topics like VLA, large models, and embodied intelligence, providing insights into practical applications and research advancements [19][18]. Group 4: Engagement and Interaction - The community encourages active participation, with weekly discussions and Q&A sessions to foster engagement among members [12][14]. - It aims to create a supportive environment for both beginners and advanced professionals, facilitating networking and collaboration opportunities [12][11]. - The platform is designed to be a dynamic space where members can freely ask questions and share knowledge, enhancing the overall learning experience [12][11].
学习端到端大模型,还不太明白VLM和VLA的区别。。。
自动驾驶之心· 2025-06-19 11:54
Core Insights - The article emphasizes the growing importance of large models (VLM) in the field of intelligent driving, highlighting their potential for practical applications and production [2][4]. Group 1: VLM and VLA - VLM (Vision-Language Model) focuses on foundational capabilities such as detection, question answering, spatial understanding, and reasoning [4]. - VLA (Vision-Language Action) is more action-oriented, aimed at trajectory prediction in autonomous driving, requiring a deep understanding of human-like reasoning and perception [4]. - It is recommended to learn VLM first before expanding to VLA, as VLM can predict trajectories through diffusion models, enhancing action capabilities in uncertain environments [4]. Group 2: Community and Resources - The article invites readers to join a knowledge-sharing community that offers comprehensive resources, including video courses, hardware, and coding materials related to autonomous driving [4]. - The community aims to build a network of professionals in intelligent driving and embodied intelligence, with a target of gathering 10,000 members in three years [4]. Group 3: Technical Directions - The article outlines four cutting-edge technical directions in the industry: Visual Language Models, World Models, Diffusion Models, and End-to-End Autonomous Driving [5]. - It provides links to various resources and papers that cover advancements in these areas, indicating a robust framework for ongoing research and development [6][31]. Group 4: Datasets and Applications - A variety of datasets are mentioned that are crucial for training and evaluating models in autonomous driving, including pedestrian detection, object tracking, and scene understanding [19][20]. - The article discusses the application of language-enhanced systems in autonomous driving, showcasing how natural language processing can improve vehicle navigation and interaction [20][21]. Group 5: Future Trends - The article highlights the potential for large models to significantly impact the future of autonomous driving, particularly in enhancing decision-making and control systems [24][25]. - It suggests that the integration of language models with driving systems could lead to more intuitive and human-like vehicle behavior [24][25].
何恺明CVPR最新讲座PPT上线:走向端到端生成建模
机器之心· 2025-06-19 09:30
Core Viewpoint - The article discusses the evolution of generative models, particularly focusing on the transition from diffusion models to end-to-end generative modeling, highlighting the potential for generative models to replicate the historical advancements seen in recognition models [6][36][41]. Group 1: Workshop Insights - The workshop led by Kaiming He at CVPR focused on the evolution of visual generative modeling beyond diffusion models [5][7]. - Diffusion models have become the dominant method in visual generative modeling, but they face limitations such as slow generation speed and challenges in simulating complex distributions [6][36]. - Kaiming He's presentation emphasized the need for end-to-end generative modeling, contrasting it with the historical layer-wise training methods prevalent before AlexNet [10][11][41]. Group 2: Recognition vs. Generation - Recognition and generation can be viewed as two sides of the same coin, where recognition abstracts features from raw data, while generation concretizes abstract representations into detailed data [41][42]. - The article highlights the fundamental differences between recognition tasks, which have a clear mapping from data to labels, and generation tasks, which involve complex, non-linear mappings from simple distributions to intricate data distributions [58]. Group 3: Flow Matching and MeanFlow - Flow Matching is presented as a promising approach to address the challenges in generative modeling by constructing ground-truth fields that are independent of specific neural network architectures [81]. - The MeanFlow framework introduced by Kaiming He aims to achieve single-step generation tasks by modeling average velocity rather than instantaneous velocity, providing a theoretical basis for network training [83][84]. - Experimental results show that MeanFlow significantly outperforms previous single-step diffusion and flow models, achieving a FID score of 3.43, which is over 50% better than the previous best [101][108]. Group 4: Future Directions - The article concludes with a discussion on the ongoing research efforts in the field, including Consistency Models, Two-time-variable Models, and revisiting Normalizing Flows, indicating that the field is still in its early stages akin to the pre-AlexNet era in recognition models [110][113].
一个md文件收获超400 star,这份综述分四大范式全面解析了3D场景生成
机器之心· 2025-06-10 08:41
在构建通用人工智能、世界模型、具身智能等关键技术的竞赛中,一个能力正变得愈发核心 —— 高质量的 3D 场景生成 。过去三年,该领域的研究呈指数级增 长,每年论文数量几乎翻倍,反映出其在多模态理解、机器人、自动驾驶乃至虚拟现实系统中的关键地位。 技术路线 四大生成范式全面解析 早期的 3D 场景生成工作主要通过程序化生成实现。自 2021 年以来,随着生成式模型(尤其是扩散模型)的崛起,以及 NeRF、3D Gaussians 等新型 3D 表征的提 出,该领域进入爆发式增长阶段。方法日益多元,场景建模能力持续提升,也推动了研究论文数量的快速上升。这一趋势凸显出对对该领域进行系统化梳理与全 面评估的迫切需求。 论文标题:3D Scene Generation: A Survey 论文链接:https://arxiv.org/abs/2505.05474 精选列表:https://github.com/hzxie/Awesome-3D-Scene-Generation 在本综述中,研究团队构建了一套系统的技术分类体系,将现有 3D 场景生成方法划分为四大主流范式,每类方法均结合代表性工作进行了深入梳理。 这四大 ...
冲击自回归,扩散模型正在改写下一代通用模型范式
机器之心· 2025-06-04 01:59
Core Viewpoint - The article discusses the advancements in diffusion language models (dLLMs), particularly focusing on Google's Gemini Diffusion and its implications for AI development, highlighting the speed and performance improvements over traditional autoregressive models [1][8][35]. Group 1: Gemini Diffusion and Its Features - Gemini Diffusion is noted for its impressive generation speed, being five times faster than previous models, and its ability to handle programming tasks effectively [2][8]. - The underlying mechanism of diffusion models allows for rapid iteration and error correction during the generation process, distinguishing it from autoregressive models [2][3]. - Gemini Diffusion's sampling speed can reach an astonishing 1479 tokens per second, showcasing its potential in various benchmarks [8][9]. Group 2: Development of Diffusion Language Models - Prior to Gemini Diffusion, several research teams explored the feasibility of diffusion-based LLMs, including Stanford's Diffusion-LM and Fudan University's DiffusionBERT [3][4]. - The introduction of LLaDA, the first 8 billion parameter diffusion language model, marked a significant milestone in the field, achieving performance comparable to LLaMA 3 [4][21]. - Following LLaDA, other models like d1 and LaViDa have emerged, further establishing LLaDA as a foundational model in dLLM research [20][21]. Group 3: Multimodal Diffusion Language Models - The emergence of diffusion multimodal language models (dMLLMs) is highlighted, with LLaDA-V and MMaDA being prominent examples that integrate visual and language processing capabilities [10][31]. - LLaDA-V combines visual instruction fine-tuning with the diffusion mechanism, demonstrating strong performance in multimodal understanding tasks [26][27]. - MMaDA showcases innovations in text reasoning and multimodal understanding, solidifying its position as a leading research outcome in the dMLLM space [31][32]. Group 4: Future Directions and Implications - The article emphasizes the shift from autoregressive models to diffusion models as a significant paradigm change in AI, suggesting broader implications for future research and applications [35][36]. - The ongoing evolution of models like LLaDA and Gemini Diffusion indicates a growing ecosystem around dLLMs and dMLLMs, with potential applications extending into quantum computing [35][36].
多模态扩散模型开始爆发,这次是高速可控还能学习推理的LaViDa
机器之心· 2025-05-30 04:16
Core Viewpoint - The article introduces LaViDa, a large vision-language diffusion model that combines the advantages of diffusion models with the ability to process both visual and textual information effectively [1][5]. Group 1: Model Overview - LaViDa is a vision-language model that inherits the high speed and controllability of diffusion language models, achieving impressive performance in experiments [1][5]. - Unlike autoregressive large language models (LLMs), diffusion models treat text generation as a diffusion process over discrete tokens, allowing for better handling of tasks requiring bidirectional context [2][3][4]. Group 2: Technical Architecture - LaViDa consists of a visual encoder and a diffusion language model, connected through a multi-layer perceptron (MLP) projection network [10]. - The visual encoder processes multiple views of an input image, generating a total of 3645 embeddings, which are then reduced to 980 through average pooling for training efficiency [12][13]. Group 3: Training Methodology - The training process involves a two-stage approach: pre-training to align visual embeddings with the diffusion language model's latent space, followed by end-to-end fine-tuning for instruction adherence [19]. - A third training phase using distilled samples was conducted to enhance the reasoning capabilities of LaViDa, resulting in a model named LaViDa-Reason [25]. Group 4: Experimental Performance - LaViDa demonstrates competitive performance across various visual-language tasks, achieving the highest score of 43.3 on the MMMU benchmark and excelling in reasoning tasks [20][22]. - In scientific tasks, LaViDa scored 81.4 and 80.2 on ScienceQA, showcasing its strong capabilities in complex reasoning [23]. Group 5: Text Completion and Flexibility - LaViDa provides strong controllability for text generation, particularly in text completion tasks, allowing for flexible token replacement based on masked inputs [28][30]. - The model can dynamically adjust the number of tokens generated, successfully completing tasks that require specific constraints, unlike autoregressive models [31][32]. Group 6: Speed and Quality Trade-offs - LaViDa allows users to balance speed and quality by adjusting the number of diffusion steps, demonstrating flexibility in performance based on application needs [33][35]. - Performance evaluations indicate that LaViDa can outperform autoregressive baselines in speed and quality under certain configurations, highlighting its adaptability [35].