扩散模型 - filings, earnings calls, financial reports, news - Reportify

扩散模型

Search documents

视觉生成的另一条路：Infinity 自回归架构的原理与实践

AI前线· 2025-10-31 05:42

Core Insights - The article discusses the significant advancements in visual autoregressive models, particularly highlighting the potential of these models in the context of AI-generated content (AIGC) and their competitive edge against diffusion models [2][4][11]. Group 1: Visual Autoregressive Models - Visual autoregressive models (VAR) utilize a "coarse-to-fine" approach, starting with low-resolution images and progressively refining them to high-resolution outputs, which aligns more closely with human visual perception [12][18]. - The VAR model architecture includes an improved VQ-VAE that employs a hierarchical structure, allowing for efficient encoding and reconstruction of images while minimizing token usage [15][30]. - VAR has demonstrated superior image generation quality compared to existing models like DiT, showcasing a robust scaling curve that indicates performance improvements with increased model size and computational resources [18][49]. Group 2: Comparison with Diffusion Models - Diffusion models operate by adding Gaussian noise to images and then training a network to reverse this process, maintaining the original resolution throughout [21][25]. - The key advantages of VAR over diffusion models include higher training parallelism and a more intuitive process that mimics human visual cognition, although diffusion models can correct errors through iterative refinement [27][29]. - VAR's approach allows for faster inference times, with the Infinity model achieving significant speed improvements over comparable diffusion models [46][49]. Group 3: Innovations in Tokenization and Error Correction - The Infinity framework introduces a novel "bitwise tokenizer" that enhances reconstruction quality while allowing for a larger vocabulary size, thus improving detail and instruction adherence in generated images [31][41]. - A self-correction mechanism is integrated into the training process, enabling the model to learn from previous errors and significantly reducing cumulative error during inference [35][40]. - The findings indicate that larger models benefit from larger vocabularies, reinforcing the reliability of scaling laws in model performance [41][49].

自回归模型

Artificial Intelligence

自回归模型

Artificial Intelligence

近500页史上最全扩散模型修炼宝典，宋飏等人一书覆盖三大主流视角

机器之心· 2025-10-29 07:23

Core Viewpoint - The article discusses the comprehensive guide on diffusion models, highlighting their transformative impact on generative AI across various domains such as images, audio, video, and 3D environments [2][4]. Summary by Sections Introduction to Diffusion Models - Diffusion models are presented as a method that views the generation process as a gradual transformation over time, contrasting with traditional generative models that directly learn mappings from noise to data [11]. - The article emphasizes the need for a systematic understanding of diffusion models, which the book aims to provide, making it a valuable resource for both researchers and beginners [6][9]. Core Principles of Diffusion Models - The book outlines the foundational principles of diffusion models, connecting three key perspectives: variational methods, score-based methods, and flow-based methods, which together form a unified theoretical framework [11][13]. - It discusses how these models achieve efficient sample generation and enhanced controllability during the generation process [12]. Detailed Exploration of Perspectives - The variational view relates to denoising diffusion probabilistic models (DDPMs), providing a basis for probabilistic inference and optimization [23]. - The score-based view focuses on learning score functions to guide the denoising process, linking diffusion modeling with classical differential equation theory [23][24]. - The flow-based view describes the generation process as a continuous flow transformation, allowing for broader applications beyond simple generation tasks [24]. Sampling Techniques and Efficiency - The article highlights the unique feature of diffusion models, which refine samples from coarse to fine through noise removal, and discusses the trade-off between performance and efficiency [27][28]. - It introduces methods for improving sampling performance without retraining models, such as classifier guidance and advanced numerical solvers to enhance generation quality and speed [29][30]. Learning Fast Generative Models - The book explores strategies for directly learning fast generative models that approximate the diffusion process, aiming to reduce reliance on multi-step inference [30][31]. - Distillation-based methods are discussed, where a student model mimics a slower teacher model to achieve faster sampling while maintaining quality [30]. Comprehensive Coverage of Diffusion Models - The book aims to establish a lasting theoretical framework for diffusion models, focusing on continuous time dynamical systems that connect simple prior distributions to data distributions [33]. - It emphasizes the importance of understanding the underlying principles and connections between different methods to design and improve next-generation generative models [36].

深度生成建模

最优传输理论

《The Principles of Diffusion Models》

深度生成建模

最优传输理论

《The Principles of Diffusion Models》

正式结课！工业界大佬带队三个月搞定端到端自动驾驶

自动驾驶之心· 2025-10-27 00:03

Core Viewpoint - 2023 marks the year of end-to-end production, with 2024 expected to be a significant year for end-to-end production in the automotive industry, as leading new forces and manufacturers have already achieved end-to-end production [1][3]. Group 1: End-to-End Production Development - The automotive industry is witnessing rapid development in end-to-end methods, particularly the one-stage approach exemplified by UniAD, which directly models vehicle trajectories from sensor inputs [1][3]. - There are two main paradigms in the industry: one-stage and two-stage methods, with the one-stage approach gaining traction and leading to various derivatives based on perception, world models, diffusion models, and VLA [3][5]. Group 2: Course Overview - A course titled "End-to-End and VLA Autonomous Driving" has been launched, focusing on cutting-edge algorithms in both one-stage and two-stage end-to-end methods, aimed at bridging academic and industrial advancements [5][15]. - The course is structured into several chapters, covering the history and evolution of end-to-end methods, background knowledge on VLA, and detailed discussions on both one-stage and two-stage approaches [9][10][12]. Group 3: Key Technologies - The course emphasizes critical technologies such as BEV perception, visual language models (VLM), diffusion models, and reinforcement learning, which are essential for mastering the latest advancements in autonomous driving [5][11][19]. - The second chapter of the course is highlighted as containing the most frequently asked technical keywords for job interviews in the next two years [10]. Group 4: Practical Applications - The course includes practical assignments, such as RLHF fine-tuning, allowing participants to apply their knowledge in real-world scenarios and understand how to build and experiment with pre-trained and reinforcement learning modules [13][19]. - The curriculum also covers various subfields of one-stage end-to-end methods, including those based on perception, world models, diffusion models, and VLA, providing a comprehensive understanding of the current landscape in autonomous driving technology [14][19].

端到端自动驾驶

端到端自动驾驶

智源&悉尼大学等出品！RoboGhost：文本到动作控制，幽灵般无形驱动人形机器人

具身智能之心· 2025-10-27 00:02

Core Insights - The article discusses the development of RoboGhost, an innovative humanoid control system that eliminates the need for motion retargeting, allowing for direct action generation from language input [6][8][14]. Group 1: Research Pain Points - The transition from 3D digital humans to humanoid robots faces challenges due to the cumbersome and unreliable multi-stage processes involved in language-driven motion generation [6][7]. - Existing methods lead to cumulative errors, high latency, and weak coupling between semantics and control, necessitating a more direct path from language to action [7]. Group 2: Technical Breakthrough - RoboGhost proposes a retargeting-free approach that directly establishes humanoid robot strategies based on language-driven motion latent representations, treating the task as a generative one rather than a simple mapping [8][10]. - The system utilizes a continuous autoregressive motion generator to ensure long-term motion consistency while balancing stability and diversity in generated actions [8][14]. Group 3: Methodology - The training process consists of two phases: action generation and strategy training, with the former using a continuous autoregressive architecture and the latter employing a mixture-of-experts (MoE) framework to enhance generalization [11][13]. - The strategy training incorporates a diffusion model that uses motion latent representations as conditions to guide the denoising process, allowing for direct executable action generation [11][14]. Group 4: Experimental Results - Comprehensive experiments demonstrated that RoboGhost significantly improves action generation quality, success rates, deployment time, and tracking errors compared to baseline methods [14][15]. - The results indicate that the diffusion-based strategy outperforms traditional multilayer perceptron strategies in terms of tracking performance and robustness, even when tested on unseen motion subsets [18][19].

Unitree G1机器人

Unitree G1机器人

自动驾驶之心合伙人招募！

自动驾驶之心· 2025-10-24 16:03

Group 1 - The article announces the recruitment of 10 outstanding partners for the autonomous driving sector, focusing on course development, paper guidance, and hardware research [2] - The main areas of expertise sought include large models, multimodal models, diffusion models, end-to-end systems, embodied interaction, joint prediction, SLAM, 3D object detection, world models, closed-loop simulation, and model deployment and quantization [3] - Candidates are preferred from QS200 universities with a master's degree or higher, especially those with significant contributions to top conferences [4] Group 2 - The compensation package includes resource sharing for job seeking, doctoral studies, and overseas study recommendations, along with substantial cash incentives and opportunities for entrepreneurial project collaboration [5] - Interested parties are encouraged to add WeChat for consultation, specifying "organization/company + autonomous driving cooperation inquiry" [6]

闭环仿真3DGS

闭环仿真3DGS

一个指令误导智能模型！北航等首创3D语义攻击框架，成功率暴涨119%

量子位· 2025-10-23 03:52

Core Viewpoint - The article discusses the security alignment issues of artificial intelligence models, particularly focusing on the newly proposed InSUR framework for generating adversarial samples that are independent of specific tasks and models [1][2]. Group 1: InSUR Framework Overview - The InSUR framework is based on the concept of instruction uncertainty reduction, allowing for the generation of adversarial samples that mislead both known and unknown models with just one instruction [2][4]. - The framework integrates a 3D generation approach, achieving the first-ever generation of natural 3D adversarial objects through a single instruction, validating the effectiveness of the introduced sampling technique (ResAdv-DDIM) [6][8]. Group 2: Challenges in Semantic Adversarial Sample Generation - The existing methods for generating semantic adversarial samples face three main challenges: referring diversity, description incompleteness, and boundary ambiguity [14][21]. - InSUR addresses these challenges through a combination of stable attack direction driven by residuals, rule encoding for the generation process, and semantic hierarchical abstraction evaluation methods [8][12]. Group 3: Sampling Method and Task Modeling - The ResAdv-DDIM sampling method stabilizes the attack direction by predicting a rough outline of the final target during the denoising process, which enhances the robustness and transferability of adversarial samples [12][16]. - Task modeling is achieved by incorporating task goal embedding strategies, enabling effective generation of both 2D and 3D semantic adversarial samples [22][27]. Group 4: Evaluation and Results - The evaluation of the InSUR framework shows significant improvements in attack success rates (ASR) across various models and tasks, with an average ASR increase of at least 1.19 times and a minimum ASR increase of 1.08 times while maintaining low perceptual loss (LPIPS) [40][41]. - The framework's design decouples it from specific models and tasks, demonstrating scalability and effectiveness in generating high-fidelity adversarial test scenarios for safety-critical systems [45][46].

语义约束对抗样本

语义约束对抗样本

我们正在寻找自动驾驶领域的合伙人...

自动驾驶之心· 2025-10-22 00:03

Group 1 - The article announces the recruitment of 10 outstanding partners for the autonomous driving sector, focusing on course development, paper guidance, and hardware research [2] - The main areas of expertise sought include large models, multimodal models, diffusion models, end-to-end systems, embodied interaction, joint prediction, SLAM, 3D object detection, world models, closed-loop simulation, and model deployment and quantization [3] - Candidates are preferred from QS200 universities with a master's degree or higher, especially those with significant contributions to top conferences [4] Group 2 - The compensation package includes resource sharing for job seeking, doctoral recommendations, and study abroad opportunities, along with substantial cash incentives and collaboration on entrepreneurial projects [5] - Interested parties are encouraged to add WeChat for consultation, specifying "organization/company + autonomous driving cooperation inquiry" [6]

闭环仿真3DGS

大模型部署与量化感知推理

多模态大模型

闭环仿真3DGS

大模型部署与量化感知推理

多模态大模型

ICCV 2025 | 扩散模型生成手写体文本行的首次实战，效果惊艳还开源

机器之心· 2025-10-20 09:15

Core Insights - The article introduces a new generative model called Diffusion Brush, which applies diffusion models to generate realistic handwritten text lines in multiple languages, achieving high fidelity in style, content accuracy, and natural layout [2][4][6]. Research Background - The advancement of handwriting imitation technology has reached a point where AI can accurately replicate an individual's handwriting style, leading to potential applications in font design and handwriting verification [4][6]. - Previous models focused on character-level generation, which often resulted in misalignment and unnatural spacing when assembling text lines [6][7]. Key Issues - The researchers identified two main challenges in handwritten text line generation: ensuring that generated text aligns with human writing habits and maintaining both style fidelity and content readability [16][17]. Technical Solutions - The Diffusion Brush model decouples style and content learning to avoid interference, allowing for more accurate style extraction while ensuring content accuracy [11][12]. - The model employs a multi-scale discriminator to provide detailed content supervision at both line and character levels, balancing global and local accuracy [14][19]. Method Framework - The Diffusion Brush framework includes a style module for content decoupling, a style-content fusion module, a conditional diffusion generator, and a multi-scale content discriminator [13][20]. - The style module uses column and row masking strategies to enhance style learning while preserving essential character information [17][30]. Experimental Evaluation - Quantitative assessments show that Diffusion Brush outperforms existing methods in both English and Chinese datasets, achieving significant improvements in various performance metrics [22][23]. - Qualitative evaluations indicate that Diffusion Brush generates text lines that closely resemble reference samples in terms of character slant, ink depth, and stroke width [24][25]. Summary and Outlook - Diffusion Brush represents a significant advancement in generating personalized handwritten text, with potential applications in custom font creation, historical handwriting restoration, and robust text line recognition training [35].

AI手写体文本行生成

AI手写体文本行生成

Self-Forcing++：让自回归视频生成模型突破 4 分钟时长极限

机器之心· 2025-10-18 08:30

Core Insights - The article discusses the breakthrough of Self-Forcing++ in generating high-quality long videos, extending the generation time from 5 seconds to 4 minutes without requiring additional long video data for retraining [2][10]. Group 1: Challenges in Long Video Generation - Long video generation has been limited to a few seconds due to inherent architectural flaws in existing models, which struggle to maintain visual consistency and motion coherence beyond 10 seconds [6][7]. - The primary challenge lies in the models' inability to handle cumulative errors over extended sequences, leading to issues like overexposure and freezing [17][20]. Group 2: Key Innovations of Self-Forcing++ - Self-Forcing++ employs a unique approach where a teacher model, despite only generating 5-second videos, can correct distortions in longer videos generated by a student model [9][10]. - The process involves a cycle of generation, distortion, correction, and learning, allowing the model to self-repair and stabilize over longer time scales [10]. Group 3: Technical Mechanisms - Backward Noise Initialization allows the model to inject noise into already generated sequences, maintaining temporal continuity [13][15]. - Extended DMD expands the teacher-student distribution alignment to a sliding window, enabling local supervision of long video sequences [16][18]. - Rolling KV Cache aligns training and inference phases, eliminating issues like exposure drift and frame repetition [19][20]. Group 4: Experimental Results - Self-Forcing++ outperforms baseline models in generating videos of 50, 75, and 100 seconds, demonstrating superior stability and quality [23][24]. - The model maintains consistent brightness and natural motion across long videos, with minimal degradation in visual quality [30]. Group 5: Scaling and Future Improvements - The relationship between computational power and video length is explored, showing that increasing training resources significantly enhances video quality [31]. - Despite advancements, challenges remain in long-term memory retention and training efficiency, indicating areas for further development [33].

自回归视频生成

自回归视频生成

我们正在寻找自动驾驶领域的合伙人...

自动驾驶之心· 2025-10-17 16:04

Group 1 - The article announces the recruitment of 10 outstanding partners for the autonomous driving sector, focusing on course development, paper guidance, and hardware research [2] - The main areas of expertise sought include large models, multimodal models, diffusion models, end-to-end systems, embodied interaction, joint prediction, SLAM, 3D object detection, world models, closed-loop simulation, and model deployment and quantization [3] - Candidates are preferred to have a master's degree or higher from universities ranked within the QS200, with priority given to those who have published in top conferences [4] Group 2 - The compensation package includes shared resources in autonomous driving (job placement, PhD recommendations, study abroad opportunities), substantial cash incentives, and collaboration on entrepreneurial projects [5] - Interested parties are encouraged to contact via WeChat for consultation regarding institutional or company collaboration in autonomous driving [6]

多模态大模型

多模态大模型