Workflow
生成模型
icon
Search documents
简单即强大:全新生成模型「离散分布网络DDN」是如何做到原理简单,性质独特?
机器之心· 2025-08-16 05:02
Core Viewpoint - The article introduces a novel generative model called Discrete Distribution Networks (DDN), which offers unique features and capabilities in generating and reconstructing data, particularly in the context of zero-shot conditional generation and end-to-end differentiability [4][8][33]. Group 1: Overview of DDN - DDN employs a mechanism that generates K outputs simultaneously during a single forward pass, creating a discrete distribution of outputs [5][6]. - The training objective is to optimize the positions of these sample points to closely approximate the true distribution of the training data [7]. - DDN is characterized by three main features: Zero-Shot Conditional Generation (ZSCG), tree-structured one-dimensional discrete latent variables, and full end-to-end differentiability [8]. Group 2: DDN Mechanism - DDN can reconstruct data similarly to Variational Autoencoders (VAE) by mapping data to latent representations and generating highly similar reconstructed images [12]. - The reconstruction process involves multiple layers, where each layer generates K outputs, and the most similar output to the target is selected as the condition for the next layer [14][15]. - The training process mirrors the reconstruction process, with the addition of calculating loss for the selected outputs at each layer [16]. Group 3: Unique Features of DDN - DDN supports zero-shot conditional generation, allowing the model to generate images based on conditions it has never seen during training, such as text prompts or low-resolution images [24][26]. - The model can efficiently guide the sampling process using purely discriminative models, promoting a unification of generative and discriminative models [28][29]. - DDN's latent space is structured as a tree, providing a highly compressed representation of data, which can be visualized to understand its structure [36][39]. Group 4: Future Research Directions - Potential research directions include improving DDN through parameter tuning and theoretical analysis, applying DDN in various fields such as image denoising and unsupervised clustering, and integrating DDN with existing generative models for enhanced capabilities [41][42].
具身领域LLM结合强化学习与世界模型工作汇总
具身智能之心· 2025-07-30 00:02
Core Insights - The article discusses recent advancements in embodied intelligence, particularly focusing on the integration of large language models (LLMs) with reinforcement learning and world models for various applications in artificial intelligence [2][3]. Group 1: UniSim and Real-World Simulators - UniSim aims to learn general real-world interactive simulators through generative modeling, revealing that diverse natural datasets can enhance the learning of realistic simulations [3]. - The research demonstrates that high-level visual language strategies and low-level reinforcement learning strategies can be trained in a simulated environment and applied directly to real-world scenarios without additional training [3]. Group 2: Causal World Models - The study from Google DeepMind asserts that robust agents must learn causal models to generalize across varying distributions, providing a clear answer to a long-standing question in the field [5]. Group 3: MAMBA Framework - MAMBA introduces an efficient world model approach for meta-reinforcement learning, achieving up to 15 times improvement in sample efficiency while performing well in high-dimensional tasks [8]. Group 4: EMMA and Multimodal Agents - EMMA leverages LLMs trained in text-based worlds to guide visual world training, resulting in a significant performance boost of 20%-70% in task success rates compared to existing visual language models [10]. Group 5: Text2Reward Framework - The Text2Reward framework allows for the automatic generation and optimization of dense reward functions using LLMs, achieving over 94% success rates in new motion behaviors and enhancing strategy performance through human feedback [13][14]. Group 6: Online Continual Learning - The proposed online continual learning frameworks (Behavior-IL and Environment-IL) enable agents to learn continuously in real-world settings without relying on task boundary information, significantly outperforming existing methods [17][18]. Group 7: AMAGO Framework - AMAGO addresses challenges in generalization and long-term memory in reinforcement learning, demonstrating superior scalability and performance in complex tasks [21]. Group 8: PDDL and Planning with LLMs - The research presents a novel paradigm for task planning using pre-trained LLMs, effectively integrating human feedback and reducing the need for extensive manual corrections in planning tasks [22][23].
上海期智&清华!BEV-VAE:首个自监督BEV视角的VAE,从图像到场景生成跃迁~
自动驾驶之心· 2025-07-08 12:45
Core Viewpoint - The article discusses the BEV-VAE method, which enables precise generation and manipulation of multi-view images in autonomous driving, emphasizing the importance of structured representation for understanding three-dimensional scenes [2][4][28]. Group 1: Methodology - BEV-VAE employs a variational autoencoder (VAE) to learn a compact and unified bird's-eye view (BEV) latent space, followed by a Diffusion Transformer for generating spatially consistent multi-view images [2][7]. - The model supports generating images from any camera configuration while incorporating three-dimensional layout information for control [2][11]. - The architecture consists of an encoder, decoder, and a StyleGAN discriminator, ensuring spatial consistency among images from different views [7][8]. Group 2: Advantages - BEV-VAE provides a structured representation that captures the complete semantics and spatial structure of multi-view images, simplifying the construction of world models [28]. - The model decouples spatial modeling from generative modeling, enhancing the efficiency of the learning process [28]. - It is compatible with various camera configurations, demonstrating cross-platform applicability [28]. Group 3: Experimental Results - Experiments on the nuScenes and Argoverse 2 (AV2) datasets show that BEV-VAE outperforms existing models in multi-view image reconstruction and generation tasks [21][22]. - The model's performance improves with higher latent dimensions, achieving a PSNR of 26.32 and an SSIM of 0.7455 at a latent shape of 32 × 32 × 32 [22]. - BEV-VAE allows for fine-grained editing of objects in scenes, successfully learning the three-dimensional structure and complete semantics of the environment [18][19]. Group 4: Conclusion - BEV-VAE significantly lowers the barriers for applying generative models in autonomous driving, enabling researchers to participate in building and expanding world models with lower costs and higher efficiency [28].
何恺明CVPR 2025报告深度解读:生成模型如何迈向端到端?
自动驾驶之心· 2025-06-28 13:34
Core Viewpoint - The article discusses the evolution of generative models in deep learning, drawing parallels to the revolutionary changes brought by AlexNet in recognition models, and posits that generative models may be on the brink of a similar breakthrough with the introduction of MeanFlow, which simplifies the generation process from multiple steps to a single step [1][2][35]. Group 1: Evolution of Recognition Models - Prior to AlexNet, layer-wise training was the dominant method for training recognition models, which involved optimizing each layer individually, leading to complex and cumbersome training processes [2][3]. - The introduction of AlexNet in 2012 marked a significant shift to end-to-end training, allowing the entire network to be trained simultaneously, greatly simplifying model design and improving performance [3][7]. Group 2: Current State of Generative Models - Generative models today resemble the pre-AlexNet era of recognition models, relying on multi-step reasoning processes, such as diffusion models and autoregressive models, which raises the question of whether they are in a similar "pre-AlexNet" phase [7][9]. - The article emphasizes the need for generative models to transition from multi-step reasoning to end-to-end generation to achieve a revolutionary breakthrough [7][35]. Group 3: Relationship Between Recognition and Generation - Recognition and generation can be viewed as two sides of the same coin, with recognition being an abstract process that extracts semantic information from data, while generation is a concrete process that transforms abstract representations into realistic data samples [13][15][16]. - The fundamental difference lies in the nature of the mapping: recognition has a deterministic mapping from data to labels, while generation involves a highly nonlinear mapping from noise to complex data distributions, presenting both opportunities and challenges [18][20]. Group 4: Flow Matching and Mean Flows - Flow matching is a key exploration direction for addressing the challenges faced by generative models, aiming to construct a flow field of data distributions to facilitate generation [20][22]. - Mean Flows, a recent method introduced by Kaiming, seeks to achieve one-step generation by replacing complex integral calculations with average velocity computations, significantly enhancing generation efficiency [24][27][29]. - In experiments, Mean Flows demonstrated impressive performance on ImageNet tasks, achieving a FID score of 3.43 with a single function evaluation, outperforming traditional multi-step models [31][32]. Group 5: Future Directions and Challenges - The article outlines several future research directions, including consistency models, two-time-variable models, and revisiting normalizing flows, while questioning whether generative models are still in the "pre-AlexNet" era [33][34]. - Despite the advancements made by Mean Flows, the challenge remains to identify a truly effective formula for end-to-end generative modeling, which is an exciting and open research question [34][35].
ICML 2025 Spotlight | 新理论框架解锁流匹配模型的引导生成
机器之心· 2025-06-28 02:54
Core Viewpoint - The article introduces a novel energy guidance theoretical framework for flow matching models, addressing the gap in energy guidance algorithms within this context and proposing various practical algorithms suitable for different tasks [2][3][27]. Summary by Sections Research Background - Energy guidance is a crucial technique in the application of generative models, ideally altering the distribution of generated samples to align with a specific energy function while maintaining adherence to the training set distribution [7][9]. - Existing energy guidance algorithms primarily focus on diffusion models, which differ fundamentally from flow matching models, necessitating a general energy guidance theoretical framework for flow matching [9]. Method Overview - The authors derive a general flow matching energy guidance vector field from the foundational definitions of flow matching models, leading to the formulation of three categories of practical, training-free energy guidance algorithms [11][12]. - The guidance vector field is designed to direct the original vector field towards regions of lower energy function values [12]. Experimental Results - Experiments were conducted on synthetic data, offline reinforcement learning, and image linear inverse problems, demonstrating the effectiveness of the proposed algorithms [20][22]. - In synthetic datasets, the Monte Carlo sampling-based guidance algorithm achieved results closest to the ground truth distribution, validating the correctness of the flow matching guidance framework [21]. - In offline reinforcement learning tasks, the Monte Carlo sampling guidance exhibited the best performance due to the need for stable guidance samples across different time steps [23]. - For image inverse problems, the Gaussian approximation guidance and GDM showed optimal performance, while the Monte Carlo sampling struggled due to high dimensionality [25]. Conclusion - The work fills a significant gap in energy guidance algorithms for flow matching models, providing a new theoretical framework and several practical algorithms, along with theoretical analysis and experimental comparisons to guide real-world applications [27].
何恺明CVPR最新讲座PPT上线:走向端到端生成建模
机器之心· 2025-06-19 09:30
Core Viewpoint - The article discusses the evolution of generative models, particularly focusing on the transition from diffusion models to end-to-end generative modeling, highlighting the potential for generative models to replicate the historical advancements seen in recognition models [6][36][41]. Group 1: Workshop Insights - The workshop led by Kaiming He at CVPR focused on the evolution of visual generative modeling beyond diffusion models [5][7]. - Diffusion models have become the dominant method in visual generative modeling, but they face limitations such as slow generation speed and challenges in simulating complex distributions [6][36]. - Kaiming He's presentation emphasized the need for end-to-end generative modeling, contrasting it with the historical layer-wise training methods prevalent before AlexNet [10][11][41]. Group 2: Recognition vs. Generation - Recognition and generation can be viewed as two sides of the same coin, where recognition abstracts features from raw data, while generation concretizes abstract representations into detailed data [41][42]. - The article highlights the fundamental differences between recognition tasks, which have a clear mapping from data to labels, and generation tasks, which involve complex, non-linear mappings from simple distributions to intricate data distributions [58]. Group 3: Flow Matching and MeanFlow - Flow Matching is presented as a promising approach to address the challenges in generative modeling by constructing ground-truth fields that are independent of specific neural network architectures [81]. - The MeanFlow framework introduced by Kaiming He aims to achieve single-step generation tasks by modeling average velocity rather than instantaneous velocity, providing a theoretical basis for network training [83][84]. - Experimental results show that MeanFlow significantly outperforms previous single-step diffusion and flow models, achieving a FID score of 3.43, which is over 50% better than the previous best [101][108]. Group 4: Future Directions - The article concludes with a discussion on the ongoing research efforts in the field, including Consistency Models, Two-time-variable Models, and revisiting Normalizing Flows, indicating that the field is still in its early stages akin to the pre-AlexNet era in recognition models [110][113].