Workflow
生成模型
icon
Search documents
吴恩达执教的深度学习课程CS230秋季上新,新增GPT-5专题
机器之心· 2025-10-04 03:38
Core Viewpoint - The updated CS230 Deep Learning course at Stanford, taught by Andrew Ng, emphasizes the importance of artificial intelligence, likening it to electricity, and introduces new content reflecting the latest advancements in AI, particularly focusing on the GPT-5 model [1][4]. Course Structure and Content - The course adopts a flipped classroom model where students must watch Coursera's deeplearning.ai videos before attending in-person classes [3]. - Since its inception in 2017, the course has maintained a similar core framework but has integrated updates relevant to recent AI developments, including a new chapter on GPT-5 [4]. - The course enhances the discussion on generative models and incorporates popular technologies like RAG and AI Agents, using GPT-5 for case studies [6]. - CS230 aims to provide comprehensive knowledge in deep learning, covering both theoretical foundations and practical skills necessary for building and applying deep learning models [10][12]. Key Topics Covered - The course covers a wide range of topics, including: - Basics of neural networks and deep learning [20]. - Optimization techniques such as regularization, Adam optimizer, hyperparameter tuning, Dropout, and Batch Normalization [20]. - Strategies for constructing machine learning projects from conception to successful deployment [20]. - In-depth understanding of Convolutional Neural Networks (CNN) and their applications in image classification and detection [20]. - Mastery of Recurrent Neural Networks (RNN) and Long Short-Term Memory (LSTM) networks for sequence tasks [20]. - Exploration of advanced topics like Generative Adversarial Networks (GANs) and deep reinforcement learning [20]. - Insights from industry and academia, along with practical career development advice in AI [20]. Course Schedule - The 2025 fall course will run for approximately 10 weeks, starting at the end of September [15]. - Weekly topics include introductions to deep learning, neural network basics, CNNs, RNNs, optimization algorithms, generative models, and advanced topics related to GPT-5 [16].
OpenAI宋飏被Meta挖跑了,扩散模型崛起关键人物,加入MSL再会师清华校友赵晟佳
3 6 Ke· 2025-09-26 03:19
Core Insights - Meta has successfully recruited Yang Song, a prominent researcher from OpenAI, which has caused significant surprise within the industry [1][6][8]. Group 1: Yang Song's Background and Achievements - Yang Song is recognized as a key contributor to the rise of diffusion models and has made significant advancements in addressing their limitations [9][13]. - He graduated from Tsinghua University at the age of 16 and later earned his PhD from Stanford University, where he was mentored by a notable professor [20][22]. - During his time at OpenAI, he led the Strategic Explorations Team and was instrumental in developing the Consistency Models, which outperform diffusion models in speed and performance [10][11][13]. Group 2: Impact of Recruitment on Meta - The recruitment of Yang Song is part of Meta's broader strategy to attract top talent from leading AI research organizations, indicating a focus on enhancing their capabilities in AI and machine learning [6][8]. - Industry insiders believe that the motivations for such moves are not solely financial, as many of the recruited individuals have already achieved significant wealth [8]. - Yang Song's transition to Meta is seen as a strategic advantage for the company, potentially positioning them to lead in the development of next-generation AI models [6][24].
OpenAI宋飏被Meta挖跑了!扩散模型崛起关键人物,加入MSL再会师清华校友赵晟佳
量子位· 2025-09-25 13:00
Core Viewpoint - Meta has successfully recruited Yang Song, a prominent researcher from OpenAI, which has raised significant interest in the AI research community due to his notable contributions to diffusion models and generative modeling [1][6][7]. Group 1: Yang Song's Background and Achievements - Yang Song is recognized as a key contributor to the rise of diffusion models and has been a leading figure in OpenAI's Strategic Explorations Team [10][11]. - He graduated from Tsinghua University at the age of 16 and later earned his PhD from Stanford University, where he worked under the guidance of a notable professor [20][36]. - His most famous work includes the development of Consistency Models, which outperform diffusion models in speed and performance, generating images significantly faster [12][14][17]. Group 2: Impact of Yang Song's Work - The Consistency Models developed by Yang Song can generate 64 images of 256×256 pixels in approximately 3.5 seconds, showcasing a substantial improvement over existing models [12][14]. - His research has led to the creation of Continuous-Time Consistency Models, which address stability and scalability issues in earlier models, achieving a training scale of 1.5 billion parameters [15][18]. - The advancements made by Yang Song and his team are considered potential game-changers in the generative modeling field, with discussions suggesting they could "end" the dominance of diffusion models [18][19]. Group 3: Meta's Strategic Recruitment - Meta's recruitment of Yang Song is part of a broader strategy to enhance its AI capabilities by attracting top talent from leading organizations like OpenAI [9][10]. - The move is seen as a significant loss for OpenAI, with many colleagues expressing surprise at his departure [7][6]. - The motivations behind such moves are speculated to extend beyond financial incentives, as many researchers prioritize impactful work and collaboration opportunities [9].
速递| Runway跨界机器人领域,获超5亿美元融资,AI世界模型成模拟现实训练新引擎
Z Potentials· 2025-09-02 03:58
Core Insights - Runway has shifted its focus from solely creative industries to exploring opportunities in the robotics sector, receiving over $500 million in funding and achieving a valuation of $3 billion [3][4] - The company is known for its AI models that generate videos and images, with recent releases including the Gen-4 video generation model and the Runway Aleph video editing model [3][4] - Runway's technology is being utilized by robotics and autonomous vehicle companies for training simulations, which are more cost-effective and scalable compared to real-world training [4][5] Funding and Valuation - Runway has raised more than $500 million from investors such as Nvidia, Google, and General Atlantic, leading to a valuation of $3 billion [3] Technology and Applications - The company’s world models are designed to create realistic simulations, which are now attracting interest from robotics and autonomous vehicle sectors [3][4] - Runway's models allow for detailed testing of specific variables and scenarios without altering other factors in the environment, making it easier to simulate different operational outcomes [5] Future Directions - Runway does not plan to develop a completely separate product line for robotics and autonomous vehicles but will refine existing models to better serve these industries [5][6] - The core philosophy of the company revolves around the concept of simulation, which can be applied across various markets and industries as the capabilities of generative models improve [6]
简单即强大:全新生成模型「离散分布网络DDN」是如何做到原理简单,性质独特?
机器之心· 2025-08-16 05:02
Core Viewpoint - The article introduces a novel generative model called Discrete Distribution Networks (DDN), which offers unique features and capabilities in generating and reconstructing data, particularly in the context of zero-shot conditional generation and end-to-end differentiability [4][8][33]. Group 1: Overview of DDN - DDN employs a mechanism that generates K outputs simultaneously during a single forward pass, creating a discrete distribution of outputs [5][6]. - The training objective is to optimize the positions of these sample points to closely approximate the true distribution of the training data [7]. - DDN is characterized by three main features: Zero-Shot Conditional Generation (ZSCG), tree-structured one-dimensional discrete latent variables, and full end-to-end differentiability [8]. Group 2: DDN Mechanism - DDN can reconstruct data similarly to Variational Autoencoders (VAE) by mapping data to latent representations and generating highly similar reconstructed images [12]. - The reconstruction process involves multiple layers, where each layer generates K outputs, and the most similar output to the target is selected as the condition for the next layer [14][15]. - The training process mirrors the reconstruction process, with the addition of calculating loss for the selected outputs at each layer [16]. Group 3: Unique Features of DDN - DDN supports zero-shot conditional generation, allowing the model to generate images based on conditions it has never seen during training, such as text prompts or low-resolution images [24][26]. - The model can efficiently guide the sampling process using purely discriminative models, promoting a unification of generative and discriminative models [28][29]. - DDN's latent space is structured as a tree, providing a highly compressed representation of data, which can be visualized to understand its structure [36][39]. Group 4: Future Research Directions - Potential research directions include improving DDN through parameter tuning and theoretical analysis, applying DDN in various fields such as image denoising and unsupervised clustering, and integrating DDN with existing generative models for enhanced capabilities [41][42].
具身领域LLM结合强化学习与世界模型工作汇总
具身智能之心· 2025-07-30 00:02
Core Insights - The article discusses recent advancements in embodied intelligence, particularly focusing on the integration of large language models (LLMs) with reinforcement learning and world models for various applications in artificial intelligence [2][3]. Group 1: UniSim and Real-World Simulators - UniSim aims to learn general real-world interactive simulators through generative modeling, revealing that diverse natural datasets can enhance the learning of realistic simulations [3]. - The research demonstrates that high-level visual language strategies and low-level reinforcement learning strategies can be trained in a simulated environment and applied directly to real-world scenarios without additional training [3]. Group 2: Causal World Models - The study from Google DeepMind asserts that robust agents must learn causal models to generalize across varying distributions, providing a clear answer to a long-standing question in the field [5]. Group 3: MAMBA Framework - MAMBA introduces an efficient world model approach for meta-reinforcement learning, achieving up to 15 times improvement in sample efficiency while performing well in high-dimensional tasks [8]. Group 4: EMMA and Multimodal Agents - EMMA leverages LLMs trained in text-based worlds to guide visual world training, resulting in a significant performance boost of 20%-70% in task success rates compared to existing visual language models [10]. Group 5: Text2Reward Framework - The Text2Reward framework allows for the automatic generation and optimization of dense reward functions using LLMs, achieving over 94% success rates in new motion behaviors and enhancing strategy performance through human feedback [13][14]. Group 6: Online Continual Learning - The proposed online continual learning frameworks (Behavior-IL and Environment-IL) enable agents to learn continuously in real-world settings without relying on task boundary information, significantly outperforming existing methods [17][18]. Group 7: AMAGO Framework - AMAGO addresses challenges in generalization and long-term memory in reinforcement learning, demonstrating superior scalability and performance in complex tasks [21]. Group 8: PDDL and Planning with LLMs - The research presents a novel paradigm for task planning using pre-trained LLMs, effectively integrating human feedback and reducing the need for extensive manual corrections in planning tasks [22][23].
上海期智&清华!BEV-VAE:首个自监督BEV视角的VAE,从图像到场景生成跃迁~
自动驾驶之心· 2025-07-08 12:45
Core Viewpoint - The article discusses the BEV-VAE method, which enables precise generation and manipulation of multi-view images in autonomous driving, emphasizing the importance of structured representation for understanding three-dimensional scenes [2][4][28]. Group 1: Methodology - BEV-VAE employs a variational autoencoder (VAE) to learn a compact and unified bird's-eye view (BEV) latent space, followed by a Diffusion Transformer for generating spatially consistent multi-view images [2][7]. - The model supports generating images from any camera configuration while incorporating three-dimensional layout information for control [2][11]. - The architecture consists of an encoder, decoder, and a StyleGAN discriminator, ensuring spatial consistency among images from different views [7][8]. Group 2: Advantages - BEV-VAE provides a structured representation that captures the complete semantics and spatial structure of multi-view images, simplifying the construction of world models [28]. - The model decouples spatial modeling from generative modeling, enhancing the efficiency of the learning process [28]. - It is compatible with various camera configurations, demonstrating cross-platform applicability [28]. Group 3: Experimental Results - Experiments on the nuScenes and Argoverse 2 (AV2) datasets show that BEV-VAE outperforms existing models in multi-view image reconstruction and generation tasks [21][22]. - The model's performance improves with higher latent dimensions, achieving a PSNR of 26.32 and an SSIM of 0.7455 at a latent shape of 32 × 32 × 32 [22]. - BEV-VAE allows for fine-grained editing of objects in scenes, successfully learning the three-dimensional structure and complete semantics of the environment [18][19]. Group 4: Conclusion - BEV-VAE significantly lowers the barriers for applying generative models in autonomous driving, enabling researchers to participate in building and expanding world models with lower costs and higher efficiency [28].
何恺明CVPR 2025报告深度解读:生成模型如何迈向端到端?
自动驾驶之心· 2025-06-28 13:34
Core Viewpoint - The article discusses the evolution of generative models in deep learning, drawing parallels to the revolutionary changes brought by AlexNet in recognition models, and posits that generative models may be on the brink of a similar breakthrough with the introduction of MeanFlow, which simplifies the generation process from multiple steps to a single step [1][2][35]. Group 1: Evolution of Recognition Models - Prior to AlexNet, layer-wise training was the dominant method for training recognition models, which involved optimizing each layer individually, leading to complex and cumbersome training processes [2][3]. - The introduction of AlexNet in 2012 marked a significant shift to end-to-end training, allowing the entire network to be trained simultaneously, greatly simplifying model design and improving performance [3][7]. Group 2: Current State of Generative Models - Generative models today resemble the pre-AlexNet era of recognition models, relying on multi-step reasoning processes, such as diffusion models and autoregressive models, which raises the question of whether they are in a similar "pre-AlexNet" phase [7][9]. - The article emphasizes the need for generative models to transition from multi-step reasoning to end-to-end generation to achieve a revolutionary breakthrough [7][35]. Group 3: Relationship Between Recognition and Generation - Recognition and generation can be viewed as two sides of the same coin, with recognition being an abstract process that extracts semantic information from data, while generation is a concrete process that transforms abstract representations into realistic data samples [13][15][16]. - The fundamental difference lies in the nature of the mapping: recognition has a deterministic mapping from data to labels, while generation involves a highly nonlinear mapping from noise to complex data distributions, presenting both opportunities and challenges [18][20]. Group 4: Flow Matching and Mean Flows - Flow matching is a key exploration direction for addressing the challenges faced by generative models, aiming to construct a flow field of data distributions to facilitate generation [20][22]. - Mean Flows, a recent method introduced by Kaiming, seeks to achieve one-step generation by replacing complex integral calculations with average velocity computations, significantly enhancing generation efficiency [24][27][29]. - In experiments, Mean Flows demonstrated impressive performance on ImageNet tasks, achieving a FID score of 3.43 with a single function evaluation, outperforming traditional multi-step models [31][32]. Group 5: Future Directions and Challenges - The article outlines several future research directions, including consistency models, two-time-variable models, and revisiting normalizing flows, while questioning whether generative models are still in the "pre-AlexNet" era [33][34]. - Despite the advancements made by Mean Flows, the challenge remains to identify a truly effective formula for end-to-end generative modeling, which is an exciting and open research question [34][35].
ICML 2025 Spotlight | 新理论框架解锁流匹配模型的引导生成
机器之心· 2025-06-28 02:54
Core Viewpoint - The article introduces a novel energy guidance theoretical framework for flow matching models, addressing the gap in energy guidance algorithms within this context and proposing various practical algorithms suitable for different tasks [2][3][27]. Summary by Sections Research Background - Energy guidance is a crucial technique in the application of generative models, ideally altering the distribution of generated samples to align with a specific energy function while maintaining adherence to the training set distribution [7][9]. - Existing energy guidance algorithms primarily focus on diffusion models, which differ fundamentally from flow matching models, necessitating a general energy guidance theoretical framework for flow matching [9]. Method Overview - The authors derive a general flow matching energy guidance vector field from the foundational definitions of flow matching models, leading to the formulation of three categories of practical, training-free energy guidance algorithms [11][12]. - The guidance vector field is designed to direct the original vector field towards regions of lower energy function values [12]. Experimental Results - Experiments were conducted on synthetic data, offline reinforcement learning, and image linear inverse problems, demonstrating the effectiveness of the proposed algorithms [20][22]. - In synthetic datasets, the Monte Carlo sampling-based guidance algorithm achieved results closest to the ground truth distribution, validating the correctness of the flow matching guidance framework [21]. - In offline reinforcement learning tasks, the Monte Carlo sampling guidance exhibited the best performance due to the need for stable guidance samples across different time steps [23]. - For image inverse problems, the Gaussian approximation guidance and GDM showed optimal performance, while the Monte Carlo sampling struggled due to high dimensionality [25]. Conclusion - The work fills a significant gap in energy guidance algorithms for flow matching models, providing a new theoretical framework and several practical algorithms, along with theoretical analysis and experimental comparisons to guide real-world applications [27].
何恺明CVPR最新讲座PPT上线:走向端到端生成建模
机器之心· 2025-06-19 09:30
Core Viewpoint - The article discusses the evolution of generative models, particularly focusing on the transition from diffusion models to end-to-end generative modeling, highlighting the potential for generative models to replicate the historical advancements seen in recognition models [6][36][41]. Group 1: Workshop Insights - The workshop led by Kaiming He at CVPR focused on the evolution of visual generative modeling beyond diffusion models [5][7]. - Diffusion models have become the dominant method in visual generative modeling, but they face limitations such as slow generation speed and challenges in simulating complex distributions [6][36]. - Kaiming He's presentation emphasized the need for end-to-end generative modeling, contrasting it with the historical layer-wise training methods prevalent before AlexNet [10][11][41]. Group 2: Recognition vs. Generation - Recognition and generation can be viewed as two sides of the same coin, where recognition abstracts features from raw data, while generation concretizes abstract representations into detailed data [41][42]. - The article highlights the fundamental differences between recognition tasks, which have a clear mapping from data to labels, and generation tasks, which involve complex, non-linear mappings from simple distributions to intricate data distributions [58]. Group 3: Flow Matching and MeanFlow - Flow Matching is presented as a promising approach to address the challenges in generative modeling by constructing ground-truth fields that are independent of specific neural network architectures [81]. - The MeanFlow framework introduced by Kaiming He aims to achieve single-step generation tasks by modeling average velocity rather than instantaneous velocity, providing a theoretical basis for network training [83][84]. - Experimental results show that MeanFlow significantly outperforms previous single-step diffusion and flow models, achieving a FID score of 3.43, which is over 50% better than the previous best [101][108]. Group 4: Future Directions - The article concludes with a discussion on the ongoing research efforts in the field, including Consistency Models, Two-time-variable Models, and revisiting Normalizing Flows, indicating that the field is still in its early stages akin to the pre-AlexNet era in recognition models [110][113].