Workflow
扩散模型
icon
Search documents
为啥机器人集体放弃“跑酷” 全去“叠衣服”了?
机器人大讲堂· 2025-11-24 15:00
Core Viewpoint - The robotics industry has shifted focus from showcasing extreme capabilities, such as parkour and dancing, to addressing practical household tasks like folding clothes, indicating a maturation of the market and a response to real consumer needs [3][7][27]. Group 1: Industry Trends - The initial excitement around robotics was characterized by impressive demonstrations of movement and balance, which attracted capital and interest in the early stages of technology development [27]. - The current trend shows a significant pivot towards practical applications, with companies now prioritizing user needs over mere technical prowess [27][30]. - The emergence of clothing folding robots reflects a convergence of technological advancements and market demand, as the ability to fold clothes has become a more relatable and desirable function for consumers [9][15]. Group 2: Technological Advancements - Breakthroughs in robot learning technologies, such as diffusion models and zero-shot learning, have enabled robots to learn tasks like folding clothes from human demonstrations without extensive programming [13]. - The reduction in technical barriers has allowed startups to leverage pre-trained models to create functional demonstrations, making the technology more accessible [13][15]. - Despite advancements, current robotic demonstrations still reveal limitations in precision and adaptability, indicating that further improvements in algorithms and hardware are necessary [29][30]. Group 3: Market Demand and Consumer Expectations - There is a strong consumer desire for robots that can perform household tasks, with many willing to pay for solutions that alleviate mundane chores like folding clothes [15][26]. - The gap between what companies claim their robots can do and what consumers expect in terms of performance and reliability remains significant [24][26]. - Current demonstrations often fail to address the full scope of household tasks, focusing primarily on the folding action without integrating the entire process from retrieval to storage [24][30]. Group 4: Future Directions - The industry must continue to focus on practical applications and user needs to drive commercial viability, moving beyond mere technical demonstrations [30]. - As technology matures, there is potential for robots to expand their capabilities to include a wider range of household tasks, provided they remain aligned with consumer demands [29][30]. - The shift towards practical applications signifies a more rational approach to robotics, emphasizing the importance of solving real-world problems over showcasing extreme capabilities [30].
NeurIPS 2025 | UniLumos: 引入物理反馈的统一图像视频重打光框架,实现20倍加速的真实光影重塑!
机器之心· 2025-11-24 09:30
Core Insights - The article discusses the advancements in image and video relighting technology, particularly focusing on the introduction of UniLumos, a unified framework that enhances physical consistency and computational efficiency in relighting tasks [3][37]. Group 1: Challenges in Existing Methods - Current methods based on diffusion models face two fundamental challenges: the lack of physical consistency and an inadequate evaluation system for relighting quality [11][12]. - Existing approaches often optimize in semantic latent spaces, leading to physical inconsistencies such as misaligned shadows, overexposed highlights, and incorrect occlusions [15][11]. Group 2: Introduction of UniLumos - UniLumos is introduced as a solution to the aforementioned challenges, providing a unified framework for image and video relighting that maintains scene structure and temporal consistency while achieving high-quality relighting [17][37]. - The framework incorporates geometric feedback from RGB space, such as depth and normal maps, to align lighting effects with scene structures, significantly improving physical consistency [4][22]. Group 3: Innovations and Methodology - Key innovations include a geometric feedback mechanism to enhance physical consistency and a structured six-dimensional lighting description for fine-grained control and evaluation of lighting effects [18][22]. - The training data set, LumosData, is constructed to extract high-quality relighting samples from real-world videos, facilitating the training of the model [20][21]. Group 4: Performance and Efficiency - UniLumos demonstrates superior performance across various metrics, achieving state-of-the-art results in visual fidelity, temporal consistency, and physical accuracy compared to baseline models [27][28]. - The framework achieves a 20-fold increase in inference speed while maintaining high-quality output, making it significantly more efficient than existing methods [33][38]. Group 5: Evaluation and Results - The LumosBench evaluation framework allows for automated and interpretable assessment of relighting accuracy across six dimensions, showcasing UniLumos's advantages in fine-grained control over lighting attributes [22][29]. - Qualitative results indicate that UniLumos produces more realistic lighting effects and maintains better temporal consistency compared to baseline methods [31][33].
圣母大学团队打造分子设计新利器:让AI像写文章一样创造分子
仪器信息网· 2025-11-19 09:08
Core Insights - The article discusses the breakthrough AI system DemoDiff developed by a team from the University of Notre Dame, which can design new molecular structures by learning from a few examples, significantly accelerating drug and material development processes [7][8][10]. Group 1: AI Understanding of Molecular Design - DemoDiff mimics human learning by analyzing a few successful molecular examples to understand design patterns, allowing it to generate new candidates quickly [10][11]. - The system can even learn from negative examples, generating high-quality molecules based on poorly performing ones, showcasing its advanced reasoning capabilities [21][22]. Group 2: Innovative Molecular Representation - The team introduced a new method called "node-pair encoding," which simplifies complex molecular structures, improving efficiency by 5.5 times [9][12]. - This method allows for a significant reduction in the number of atoms needed to describe a molecule, enhancing the AI's ability to process more examples [12][13]. Group 3: Comprehensive Molecular Database - DemoDiff was trained on an extensive database containing over 1 million molecular structures and 155,000 different molecular properties, providing a rich resource for learning [14][15]. - The database includes various sources, such as the ChEMBL database, which records millions of drug molecules and their biological activities [14][15]. Group 4: Diffusion Model for Molecular Generation - The core technology of DemoDiff is based on a "diffusion model," which generates molecular structures through a progressive refinement process, ensuring chemical validity [16][17]. - This model incorporates context learning, allowing the AI to adapt its output based on different sets of example molecules [18]. Group 5: Performance Testing and Validation - DemoDiff underwent rigorous testing across 33 different molecular design tasks, demonstrating performance comparable to much larger AI models [19][20]. - The system excels in generating diverse molecular structures, providing researchers with multiple options for further exploration [20]. Group 6: Negative Learning Capability - The AI's ability to learn from negative examples allows it to infer what makes a successful molecule, enhancing its design capabilities [21][22]. - This feature is particularly valuable in early drug development stages, where researchers often have more negative examples than positive ones [21][22]. Group 7: Technical Innovations - The system employs a "graph attention mechanism" to focus on multiple important parts of a molecule simultaneously, ensuring a holistic understanding during generation [23]. - A multi-layer validation mechanism checks the generated molecules against fundamental chemical rules, ensuring their feasibility [23][24]. Group 8: Implications for Molecular Design - DemoDiff represents a paradigm shift in molecular design, potentially reducing the time and cost associated with drug development significantly [25][26]. - The technology may democratize molecular design, allowing a broader range of researchers to participate in innovation [26]. Group 9: Future Considerations - While DemoDiff shows impressive capabilities, there is recognition of the need for further improvements, particularly in handling specific design tasks [27]. - Future developments may include expanding the model's scale and enhancing data quality to tackle more complex challenges [27][28].
端到端和VLA的岗位,薪资高的离谱......
自动驾驶之心· 2025-11-19 00:03
Core Insights - There is a significant demand for end-to-end and VLA (Vision-Language Agent) technical talent in the automotive industry, with salaries for experts reaching up to $70,000 per month for positions requiring 3-5 years of experience [1] - The technology stack involved in end-to-end and VLA is complex, covering various advanced algorithms and models such as BEV perception, VLM (Vision-Language Model), diffusion models, reinforcement learning, and world models [2] Course Offerings - The company is launching two specialized courses: "End-to-End and VLA Autonomous Driving Class" and "Practical Course on VLA and Large Models," aimed at helping individuals quickly and efficiently enter the field of end-to-end and VLA technologies [2] - The "Practical Course on VLA and Large Models" focuses on VLA, covering topics from VLM as an autonomous driving interpreter to modular and integrated VLA, including mainstream inference-enhanced VLA [2] - The course includes a detailed theoretical foundation and practical assignments, teaching participants how to build their own VLA models and datasets from scratch [2] Instructor Team - The instructor team consists of experts from both academia and industry, including individuals with extensive research and practical experience in multi-modal perception, autonomous driving VLA, and large model frameworks [7][10][13] - Notable instructors include a Tsinghua University master's graduate with multiple publications in top conferences and a current algorithm expert at a leading domestic OEM [7][13] Target Audience - The courses are designed for individuals with a foundational knowledge of autonomous driving, familiar with basic modules, and who have a grasp of concepts related to transformer large models, reinforcement learning, and BEV perception [15] - Participants are expected to have a background in probability theory and linear algebra, as well as proficiency in Python and PyTorch [15]
做了一份端到端进阶路线图,面向落地求职......
自动驾驶之心· 2025-11-18 00:05
Core Insights - There is a significant demand for end-to-end and VLA (Vision-Language Agent) technical talent in the automotive industry, with salaries for experts reaching up to $70,000 per month for positions requiring 3-5 years of experience [1] - The technology stack for end-to-end and VLA is complex, involving various advanced algorithms such as BEV perception, Vision-Language Models (VLM), diffusion models, reinforcement learning, and world models [1] - The company is offering specialized courses to help individuals quickly and efficiently learn about end-to-end and VLA technologies, collaborating with experts from both academia and industry [1] Course Offerings - The "End-to-End and VLA Autonomous Driving Course" focuses on the macro aspects of end-to-end autonomous driving, covering key algorithms and theoretical foundations, including BEV perception, large language models, diffusion models, and reinforcement learning [10] - The "Autonomous Driving VLA and Large Model Practical Course" is led by academic experts and covers VLA from the perspective of VLM as an autonomous driving interpreter, modular VLA, and current mainstream inference-enhanced VLA [1][10] - Both courses include practical components, such as building a VLA model and dataset from scratch, and implementing algorithms like the Diffusion Planner and ORION algorithm [10][12] Instructor Profiles - The instructors include experienced professionals and researchers from top institutions, such as Tsinghua University and QS30 universities, with backgrounds in multimodal perception, autonomous driving VLA, and large model frameworks [6][9][12] - Instructors have published numerous papers in prestigious conferences and have hands-on experience in developing and deploying advanced algorithms in the field of autonomous driving [6][9][12] Target Audience - The courses are designed for individuals with a foundational knowledge of autonomous driving, familiar with basic modules, and concepts related to transformer large models, reinforcement learning, and BEV perception [14] - Participants are expected to have a background in probability theory and linear algebra, as well as proficiency in Python and PyTorch [14]
RAE+VAE? 预训练表征助力扩散模型Tokenizer,加速像素压缩到语义提取
机器之心· 2025-11-13 10:03
Core Insights - The article discusses the introduction of RAE (Diffusion Transformers with Representation Autoencoders) and VFM-VAE by Xi'an Jiaotong University and Microsoft Research Asia, which utilize "frozen pre-trained visual representations" to enhance the performance of diffusion models in generating images [2][6][28]. Group 1: VFM-VAE Overview - VFM-VAE combines the probabilistic modeling mechanism of VAE with RAE, systematically studying the impact of compressed pre-trained visual representations on the structure and performance of LDM systems [2][6]. - The integration of frozen foundational visual models as Tokenizers in VFM-VAE significantly accelerates model convergence and improves generation quality, marking an evolution from pixel compression to semantic representation [2][6]. Group 2: Performance Analysis - Experimental results indicate that the distillation-based Tokenizers struggle with semantic alignment under perturbations, while maintaining high consistency between latent space and foundational visual model features is crucial for robustness and convergence efficiency [8][19]. - VFM-VAE demonstrates superior performance and training efficiency, achieving a gFID of 3.80 on ImageNet 256×256, outperforming the distillation route's 5.14, and reaching a gFID of 2.22 with explicit alignment in just 80 epochs, improving training efficiency by approximately 10 times [23][24]. Group 3: Semantic Representation and Alignment - The research team introduced the SE-CKNNA metric to quantify the consistency between latent space and foundational visual model features, which is essential for evaluating the impact on subsequent generation performance [7][19]. - VFM-VAE maintains a higher average and peak CKNNA score compared to distillation-based Tokenizers, indicating a more stable alignment of latent space with foundational visual model features [19][21]. Group 4: Future Directions - The article concludes with the potential for further exploration of latent space in multimodal generation and complex visual understanding, aiming to transition from pixel compression to semantic representation [29].
速递|斯坦福教授创业,Inception获5000万美元种子轮融资,用扩散模型解锁实时AI应用
Z Potentials· 2025-11-07 02:12
Core Insights - The article discusses the current surge of funding into AI startups, highlighting it as a golden period for AI researchers to validate their ideas [1] - Inception, a startup developing diffusion-based AI models, recently raised $50 million in seed funding, led by Menlo Ventures with participation from several notable investors [2] Company Overview - Inception is focused on developing diffusion models, which generate outputs through iterative optimization rather than sequential generation [3] - The project leader, Stefano Ermon, has been researching diffusion models prior to the recent AI boom and aims to apply these models to a broader range of tasks [3] Technology and Innovation - Inception has released a new version of its Mercury model, designed specifically for software development, which has been integrated into various development tools [3] - Ermon claims that diffusion-based models will significantly optimize two critical metrics: latency and computational cost, stating that these models are faster and more efficient than those built by other companies [3][5] - Diffusion models differ structurally from autoregressive models, which dominate text-based AI services, and are believed to perform better when handling large volumes of text or data limitations [5] Performance Metrics - The diffusion models exhibit greater flexibility in hardware utilization, which is increasingly important as AI's infrastructure demands grow [5] - Ermon's benchmarks indicate that the models can process over 1,000 tokens per second, surpassing the capabilities of existing autoregressive technologies due to their inherent support for parallel processing [5]
上海AI Lab发布混合扩散语言模型SDAR:首个突破6600 tgs的开源扩散语言模型
机器之心· 2025-11-01 04:22
Core Insights - The article introduces a new paradigm called SDAR (Synergistic Diffusion-AutoRegression) that addresses the slow inference speed and high costs associated with large model applications, which are primarily due to the serial nature of autoregressive (AR) models [2][3][4]. Group 1: SDAR Paradigm - SDAR effectively decouples training and inference, combining the high performance of AR models with the parallel inference advantages of diffusion models, allowing for low-cost transformation of any AR model into a parallel decoding model [4][11]. - Experimental results show that SDAR not only matches but often surpasses the performance of original AR models across multiple benchmarks, achieving up to a 12.3 percentage point advantage in complex scientific reasoning tasks [6][28]. Group 2: Performance and Efficiency - SDAR maintains the performance of AR models while significantly improving inference speed and reducing costs, demonstrating that larger models benefit more from parallelization without sacrificing performance [17][19]. - The research indicates that SDAR can be adapted to any mainstream AR model at a low cost, achieving comparable or superior performance in downstream tasks [19][29]. Group 3: Experimental Validation - The study conducted rigorous experiments to compare SDAR's performance with AR models, confirming that SDAR can achieve substantial speed improvements in real-world applications, with SDAR-8B-chat showing a 2.3 times acceleration over its AR counterpart [23][20]. - The results highlight that SDAR's unique generation mechanism does not compromise its complex reasoning capabilities, retaining long-chain reasoning abilities and excelling in tasks requiring understanding of structured information [28][29]. Group 4: Future Implications - SDAR represents a significant advancement in the field of large models, providing a powerful and flexible tool that lowers application barriers and opens new avenues for exploring higher performance and efficiency in AI reasoning paradigms [29][31].
视觉生成的另一条路:Infinity 自回归架构的原理与实践
AI前线· 2025-10-31 05:42
Core Insights - The article discusses the significant advancements in visual autoregressive models, particularly highlighting the potential of these models in the context of AI-generated content (AIGC) and their competitive edge against diffusion models [2][4][11]. Group 1: Visual Autoregressive Models - Visual autoregressive models (VAR) utilize a "coarse-to-fine" approach, starting with low-resolution images and progressively refining them to high-resolution outputs, which aligns more closely with human visual perception [12][18]. - The VAR model architecture includes an improved VQ-VAE that employs a hierarchical structure, allowing for efficient encoding and reconstruction of images while minimizing token usage [15][30]. - VAR has demonstrated superior image generation quality compared to existing models like DiT, showcasing a robust scaling curve that indicates performance improvements with increased model size and computational resources [18][49]. Group 2: Comparison with Diffusion Models - Diffusion models operate by adding Gaussian noise to images and then training a network to reverse this process, maintaining the original resolution throughout [21][25]. - The key advantages of VAR over diffusion models include higher training parallelism and a more intuitive process that mimics human visual cognition, although diffusion models can correct errors through iterative refinement [27][29]. - VAR's approach allows for faster inference times, with the Infinity model achieving significant speed improvements over comparable diffusion models [46][49]. Group 3: Innovations in Tokenization and Error Correction - The Infinity framework introduces a novel "bitwise tokenizer" that enhances reconstruction quality while allowing for a larger vocabulary size, thus improving detail and instruction adherence in generated images [31][41]. - A self-correction mechanism is integrated into the training process, enabling the model to learn from previous errors and significantly reducing cumulative error during inference [35][40]. - The findings indicate that larger models benefit from larger vocabularies, reinforcing the reliability of scaling laws in model performance [41][49].
近500页史上最全扩散模型修炼宝典,宋飏等人一书覆盖三大主流视角
机器之心· 2025-10-29 07:23
Core Viewpoint - The article discusses the comprehensive guide on diffusion models, highlighting their transformative impact on generative AI across various domains such as images, audio, video, and 3D environments [2][4]. Summary by Sections Introduction to Diffusion Models - Diffusion models are presented as a method that views the generation process as a gradual transformation over time, contrasting with traditional generative models that directly learn mappings from noise to data [11]. - The article emphasizes the need for a systematic understanding of diffusion models, which the book aims to provide, making it a valuable resource for both researchers and beginners [6][9]. Core Principles of Diffusion Models - The book outlines the foundational principles of diffusion models, connecting three key perspectives: variational methods, score-based methods, and flow-based methods, which together form a unified theoretical framework [11][13]. - It discusses how these models achieve efficient sample generation and enhanced controllability during the generation process [12]. Detailed Exploration of Perspectives - The variational view relates to denoising diffusion probabilistic models (DDPMs), providing a basis for probabilistic inference and optimization [23]. - The score-based view focuses on learning score functions to guide the denoising process, linking diffusion modeling with classical differential equation theory [23][24]. - The flow-based view describes the generation process as a continuous flow transformation, allowing for broader applications beyond simple generation tasks [24]. Sampling Techniques and Efficiency - The article highlights the unique feature of diffusion models, which refine samples from coarse to fine through noise removal, and discusses the trade-off between performance and efficiency [27][28]. - It introduces methods for improving sampling performance without retraining models, such as classifier guidance and advanced numerical solvers to enhance generation quality and speed [29][30]. Learning Fast Generative Models - The book explores strategies for directly learning fast generative models that approximate the diffusion process, aiming to reduce reliance on multi-step inference [30][31]. - Distillation-based methods are discussed, where a student model mimics a slower teacher model to achieve faster sampling while maintaining quality [30]. Comprehensive Coverage of Diffusion Models - The book aims to establish a lasting theoretical framework for diffusion models, focusing on continuous time dynamical systems that connect simple prior distributions to data distributions [33]. - It emphasizes the importance of understanding the underlying principles and connections between different methods to design and improve next-generation generative models [36].