生成模型
Search documents
AI为啥不懂物理世界?李飞飞、杨立昆:缺个「世界模型」,得学大脑新皮质工作
量子位· 2025-11-17 13:23
Core Insights - The future of AI may be linked to understanding the evolutionary secrets of the human brain, as highlighted by recent developments in the AI field, including Yann LeCun's plans to establish a new AI company focused on "World Models" [1] - Fei-Fei Li emphasizes the limitations of current large language models (LLMs) and advocates for the development of "Spatial Intelligence" as a crucial step towards achieving Artificial General Intelligence (AGI) [3][4] Summary by Sections World Models - "World Models" are essential for AI to understand and predict real-world scenarios, which current AI systems struggle with, such as generating realistic videos or performing household tasks [5][6] - The concept of "World Models" arises from reflections on the limitations of LLMs and the exploration of animal intelligence, suggesting that the ability to learn these models is what current AI lacks [8] Human Perception and Intelligence - Max Bennett's research identifies three key attributes of human perception that are crucial for understanding intelligence: filling-in, sequentiality, and irrepressibility [11] - The brain's ability to fill in gaps in perception and to focus on one interpretation at a time is fundamental to how humans process information [12][20][23] Generative Models - The "Helmholtz Machine" concept illustrates how generative models can learn to recognize and generate data without being explicitly told the correct answers, demonstrating the brain's inferential processes [27] - Modern generative models, including deep fakes and AI-generated art, validate Helmholtz's theories and show that the brain's neocortex operates similarly [28] Advanced Cognitive Abilities - The neocortex not only facilitates imagination and prediction but also enables complex behaviors such as planning, episodic memory, and causal reasoning, which are desired traits for future AI systems [33] - Bennett's book, "A Brief History of Intelligence," connects neuroscience with AI, outlining the evolutionary milestones of the brain and their implications for AI development [35][37]
刚刚,ICCV最佳论文出炉,朱俊彦团队用砖块积木摘得桂冠
具身智能之心· 2025-10-23 00:03
Core Insights - The article discusses the recent International Conference on Computer Vision (ICCV) held in Hawaii, highlighting the award-winning research papers and their contributions to the field of computer vision [2][5][24]. Group 1: Award Winners - The Best Paper Award was given to a research team from Carnegie Mellon University (CMU) for their paper titled "Generating Physically Stable and Buildable Brick Structures from Text," led by notable AI scholar Zhu Junyan [3][7][11]. - The Best Student Paper Award was awarded to a paper from the Technion, titled "FlowEdit: Inversion-Free Text-Based Editing Using Pre-Trained Flow Models," which introduces a novel image editing method [28][30]. Group 2: Conference Statistics - ICCV is one of the top three conferences in computer vision, held biennially, with this year's conference receiving 11,239 valid submissions and accepting 2,699 papers, resulting in a 24% acceptance rate, a significant increase from the previous conference [5]. Group 3: Research Contributions - The paper by CMU presents Brick GPT, the first method capable of generating physically stable and interconnected brick assembly models based on text prompts. The research includes a large dataset of over 47,000 brick structures and 28,000 unique 3D objects with detailed descriptions [11][13]. - The FlowEdit paper from Technion proposes a new image editing approach that bypasses the traditional image-to-noise inversion process, achieving higher fidelity edits by establishing a direct mapping path between source and target image distributions [32][34]. Group 4: Methodology and Results - The Brick GPT method utilizes a self-regressive large language model trained on a dataset of brick structures, incorporating validity checks and a physics-aware rollback mechanism to ensure stability in generated designs [13][19]. - Experimental results show that Brick GPT outperforms baseline models in terms of validity and stability, achieving a 100% validity rate and 98.8% stability in generated structures [20][22].
刚刚,ICCV最佳论文出炉,朱俊彦团队用砖块积木摘得桂冠
机器之心· 2025-10-22 03:30
Core Insights - The ICCV (International Conference on Computer Vision) awarded the best paper and best student paper on October 22, 2023, highlighting significant advancements in computer vision research [1][2][4]. Group 1: Best Paper - The best paper award was given to a research team from Carnegie Mellon University (CMU) for their paper titled "Generating Physically Stable and Buildable Brick Structures from Text" led by notable AI scholar Junyan Zhu [6][9]. - The paper introduces BrickGPT, a novel method that generates physically stable and interconnected brick assembly models based on text prompts, marking a significant advancement in the field [9][11]. - The research team created a large-scale dataset of stable brick structures, comprising over 47,000 models and 28,000 unique 3D objects with detailed text descriptions, to train their model [11][10]. Group 2: Methodology and Results - The methodology involves discretizing a brick structure into a sequence of text tokens and training a large language model to predict the next brick to add, ensuring physical stability through validity checks and a rollback mechanism [10][17]. - Experimental results indicate that BrickGPT achieved a 100% validity rate and 98.8% stability rate, outperforming various baseline models in both effectiveness and stability [20][18]. - The paper's approach allows for the generation of diverse and aesthetically pleasing brick structures that align closely with the input text prompts, demonstrating high fidelity in design [11][20]. Group 3: Best Student Paper - The best student paper award went to a research from the Technion titled "FlowEdit: Inversion-Free Text-Based Editing Using Pre-Trained Flow Models," which innovatively bypasses traditional image editing paths to enhance image fidelity [25][28]. - FlowEdit establishes a direct mapping path between source and target image distributions, resulting in lower transfer costs and better preservation of original image structure during editing [31][27]. - The method was validated on advanced T2I flow models, achieving state-of-the-art results across various complex editing tasks, showcasing its efficiency and superiority [31][31]. Group 4: Other Awards and Recognitions - The Helmholtz Prize was awarded for contributions to computer vision benchmarks, recognizing two significant papers, including "Fast R-CNN" by Ross Girshick, which improved detection speed and accuracy [36][38]. - The Everingham Prize recognized teams for their contributions to 3D modeling and multimodal AI, including the development of the SMPL model and the VQA dataset [41][43]. - Significant Researcher Awards were given to David Forsyth and Michal Irani for their impactful contributions to the field of computer vision [50][52].
VAE时代终结?谢赛宁团队「RAE」登场,表征自编码器或成DiT训练新基石
机器之心· 2025-10-14 08:24
Core Insights - The article discusses the emergence of RAE (Representation Autoencoders) as a potential replacement for VAE (Variational Autoencoders) in the field of generative models, highlighting the advancements made by the research team led by Assistant Professor Xie Saining from New York University [1][2]. Group 1: RAE Development - RAE combines pre-trained representation encoders (like DINO, SigLIP, MAE) with trained decoders to replace traditional VAE, achieving high-quality reconstruction and a semantically rich latent space [2][6]. - The new model structure addresses the limitations of VAE, such as weak representation capabilities and high computational costs associated with SD-VAE [4][13]. Group 2: Performance Metrics - RAE demonstrates superior performance in image generation tasks, achieving an FID score of 1.51 at a resolution of 256×256 without guidance, and 1.13 with guidance at both 256×256 and 512×512 resolutions [5][6]. - The study shows that RAE consistently outperforms SD-VAE in reconstruction quality, with rFID scores indicating better performance across various encoder configurations [18][20]. Group 3: Training and Architecture - The research introduces a new variant of DiT (Diffusion Transformer), named DiT^DH, which incorporates a lightweight, wide head structure to enhance the model's efficiency without significantly increasing computational costs [3][34]. - The training scheme for the RAE decoder involves using a frozen representation encoder and a ViT-based decoder, achieving reconstruction quality comparable to or better than SD-VAE [12][14]. Group 4: Scalability and Efficiency - DiT^DH exhibits improved convergence speed and computational efficiency compared to standard DiT, maintaining performance advantages across different scales of RAE [36][40]. - The model's scalability is highlighted, with DiT^DH-XL achieving a new state-of-the-art FID score of 1.13 after 400 epochs, outperforming previous models while requiring significantly less computational power [41][43]. Group 5: Noise Management Techniques - The research proposes noise-enhanced decoding to improve the robustness of the decoder against out-of-distribution challenges, which enhances the model's overall performance [29][30]. - Adjustments to noise scheduling based on the effective data dimensions of RAE are shown to significantly improve training outcomes, demonstrating the necessity of tailored noise strategies in high-dimensional latent spaces [28].
吴恩达执教的深度学习课程CS230秋季上新,新增GPT-5专题
机器之心· 2025-10-04 03:38
Core Viewpoint - The updated CS230 Deep Learning course at Stanford, taught by Andrew Ng, emphasizes the importance of artificial intelligence, likening it to electricity, and introduces new content reflecting the latest advancements in AI, particularly focusing on the GPT-5 model [1][4]. Course Structure and Content - The course adopts a flipped classroom model where students must watch Coursera's deeplearning.ai videos before attending in-person classes [3]. - Since its inception in 2017, the course has maintained a similar core framework but has integrated updates relevant to recent AI developments, including a new chapter on GPT-5 [4]. - The course enhances the discussion on generative models and incorporates popular technologies like RAG and AI Agents, using GPT-5 for case studies [6]. - CS230 aims to provide comprehensive knowledge in deep learning, covering both theoretical foundations and practical skills necessary for building and applying deep learning models [10][12]. Key Topics Covered - The course covers a wide range of topics, including: - Basics of neural networks and deep learning [20]. - Optimization techniques such as regularization, Adam optimizer, hyperparameter tuning, Dropout, and Batch Normalization [20]. - Strategies for constructing machine learning projects from conception to successful deployment [20]. - In-depth understanding of Convolutional Neural Networks (CNN) and their applications in image classification and detection [20]. - Mastery of Recurrent Neural Networks (RNN) and Long Short-Term Memory (LSTM) networks for sequence tasks [20]. - Exploration of advanced topics like Generative Adversarial Networks (GANs) and deep reinforcement learning [20]. - Insights from industry and academia, along with practical career development advice in AI [20]. Course Schedule - The 2025 fall course will run for approximately 10 weeks, starting at the end of September [15]. - Weekly topics include introductions to deep learning, neural network basics, CNNs, RNNs, optimization algorithms, generative models, and advanced topics related to GPT-5 [16].
OpenAI宋飏被Meta挖跑了,扩散模型崛起关键人物,加入MSL再会师清华校友赵晟佳
3 6 Ke· 2025-09-26 03:19
Core Insights - Meta has successfully recruited Yang Song, a prominent researcher from OpenAI, which has caused significant surprise within the industry [1][6][8]. Group 1: Yang Song's Background and Achievements - Yang Song is recognized as a key contributor to the rise of diffusion models and has made significant advancements in addressing their limitations [9][13]. - He graduated from Tsinghua University at the age of 16 and later earned his PhD from Stanford University, where he was mentored by a notable professor [20][22]. - During his time at OpenAI, he led the Strategic Explorations Team and was instrumental in developing the Consistency Models, which outperform diffusion models in speed and performance [10][11][13]. Group 2: Impact of Recruitment on Meta - The recruitment of Yang Song is part of Meta's broader strategy to attract top talent from leading AI research organizations, indicating a focus on enhancing their capabilities in AI and machine learning [6][8]. - Industry insiders believe that the motivations for such moves are not solely financial, as many of the recruited individuals have already achieved significant wealth [8]. - Yang Song's transition to Meta is seen as a strategic advantage for the company, potentially positioning them to lead in the development of next-generation AI models [6][24].
OpenAI宋飏被Meta挖跑了!扩散模型崛起关键人物,加入MSL再会师清华校友赵晟佳
量子位· 2025-09-25 13:00
Core Viewpoint - Meta has successfully recruited Yang Song, a prominent researcher from OpenAI, which has raised significant interest in the AI research community due to his notable contributions to diffusion models and generative modeling [1][6][7]. Group 1: Yang Song's Background and Achievements - Yang Song is recognized as a key contributor to the rise of diffusion models and has been a leading figure in OpenAI's Strategic Explorations Team [10][11]. - He graduated from Tsinghua University at the age of 16 and later earned his PhD from Stanford University, where he worked under the guidance of a notable professor [20][36]. - His most famous work includes the development of Consistency Models, which outperform diffusion models in speed and performance, generating images significantly faster [12][14][17]. Group 2: Impact of Yang Song's Work - The Consistency Models developed by Yang Song can generate 64 images of 256×256 pixels in approximately 3.5 seconds, showcasing a substantial improvement over existing models [12][14]. - His research has led to the creation of Continuous-Time Consistency Models, which address stability and scalability issues in earlier models, achieving a training scale of 1.5 billion parameters [15][18]. - The advancements made by Yang Song and his team are considered potential game-changers in the generative modeling field, with discussions suggesting they could "end" the dominance of diffusion models [18][19]. Group 3: Meta's Strategic Recruitment - Meta's recruitment of Yang Song is part of a broader strategy to enhance its AI capabilities by attracting top talent from leading organizations like OpenAI [9][10]. - The move is seen as a significant loss for OpenAI, with many colleagues expressing surprise at his departure [7][6]. - The motivations behind such moves are speculated to extend beyond financial incentives, as many researchers prioritize impactful work and collaboration opportunities [9].
速递| Runway跨界机器人领域,获超5亿美元融资,AI世界模型成模拟现实训练新引擎
Z Potentials· 2025-09-02 03:58
Core Insights - Runway has shifted its focus from solely creative industries to exploring opportunities in the robotics sector, receiving over $500 million in funding and achieving a valuation of $3 billion [3][4] - The company is known for its AI models that generate videos and images, with recent releases including the Gen-4 video generation model and the Runway Aleph video editing model [3][4] - Runway's technology is being utilized by robotics and autonomous vehicle companies for training simulations, which are more cost-effective and scalable compared to real-world training [4][5] Funding and Valuation - Runway has raised more than $500 million from investors such as Nvidia, Google, and General Atlantic, leading to a valuation of $3 billion [3] Technology and Applications - The company’s world models are designed to create realistic simulations, which are now attracting interest from robotics and autonomous vehicle sectors [3][4] - Runway's models allow for detailed testing of specific variables and scenarios without altering other factors in the environment, making it easier to simulate different operational outcomes [5] Future Directions - Runway does not plan to develop a completely separate product line for robotics and autonomous vehicles but will refine existing models to better serve these industries [5][6] - The core philosophy of the company revolves around the concept of simulation, which can be applied across various markets and industries as the capabilities of generative models improve [6]
简单即强大:全新生成模型「离散分布网络DDN」是如何做到原理简单,性质独特?
机器之心· 2025-08-16 05:02
Core Viewpoint - The article introduces a novel generative model called Discrete Distribution Networks (DDN), which offers unique features and capabilities in generating and reconstructing data, particularly in the context of zero-shot conditional generation and end-to-end differentiability [4][8][33]. Group 1: Overview of DDN - DDN employs a mechanism that generates K outputs simultaneously during a single forward pass, creating a discrete distribution of outputs [5][6]. - The training objective is to optimize the positions of these sample points to closely approximate the true distribution of the training data [7]. - DDN is characterized by three main features: Zero-Shot Conditional Generation (ZSCG), tree-structured one-dimensional discrete latent variables, and full end-to-end differentiability [8]. Group 2: DDN Mechanism - DDN can reconstruct data similarly to Variational Autoencoders (VAE) by mapping data to latent representations and generating highly similar reconstructed images [12]. - The reconstruction process involves multiple layers, where each layer generates K outputs, and the most similar output to the target is selected as the condition for the next layer [14][15]. - The training process mirrors the reconstruction process, with the addition of calculating loss for the selected outputs at each layer [16]. Group 3: Unique Features of DDN - DDN supports zero-shot conditional generation, allowing the model to generate images based on conditions it has never seen during training, such as text prompts or low-resolution images [24][26]. - The model can efficiently guide the sampling process using purely discriminative models, promoting a unification of generative and discriminative models [28][29]. - DDN's latent space is structured as a tree, providing a highly compressed representation of data, which can be visualized to understand its structure [36][39]. Group 4: Future Research Directions - Potential research directions include improving DDN through parameter tuning and theoretical analysis, applying DDN in various fields such as image denoising and unsupervised clustering, and integrating DDN with existing generative models for enhanced capabilities [41][42].
具身领域LLM结合强化学习与世界模型工作汇总
具身智能之心· 2025-07-30 00:02
Core Insights - The article discusses recent advancements in embodied intelligence, particularly focusing on the integration of large language models (LLMs) with reinforcement learning and world models for various applications in artificial intelligence [2][3]. Group 1: UniSim and Real-World Simulators - UniSim aims to learn general real-world interactive simulators through generative modeling, revealing that diverse natural datasets can enhance the learning of realistic simulations [3]. - The research demonstrates that high-level visual language strategies and low-level reinforcement learning strategies can be trained in a simulated environment and applied directly to real-world scenarios without additional training [3]. Group 2: Causal World Models - The study from Google DeepMind asserts that robust agents must learn causal models to generalize across varying distributions, providing a clear answer to a long-standing question in the field [5]. Group 3: MAMBA Framework - MAMBA introduces an efficient world model approach for meta-reinforcement learning, achieving up to 15 times improvement in sample efficiency while performing well in high-dimensional tasks [8]. Group 4: EMMA and Multimodal Agents - EMMA leverages LLMs trained in text-based worlds to guide visual world training, resulting in a significant performance boost of 20%-70% in task success rates compared to existing visual language models [10]. Group 5: Text2Reward Framework - The Text2Reward framework allows for the automatic generation and optimization of dense reward functions using LLMs, achieving over 94% success rates in new motion behaviors and enhancing strategy performance through human feedback [13][14]. Group 6: Online Continual Learning - The proposed online continual learning frameworks (Behavior-IL and Environment-IL) enable agents to learn continuously in real-world settings without relying on task boundary information, significantly outperforming existing methods [17][18]. Group 7: AMAGO Framework - AMAGO addresses challenges in generalization and long-term memory in reinforcement learning, demonstrating superior scalability and performance in complex tasks [21]. Group 8: PDDL and Planning with LLMs - The research presents a novel paradigm for task planning using pre-trained LLMs, effectively integrating human feedback and reducing the need for extensive manual corrections in planning tasks [22][23].