Workflow
变分自编码器(VAE)
icon
Search documents
无预训练模型拿下ARC-AGI榜三!Mamba作者用压缩原理挑战Scaling Law
量子位· 2025-12-15 10:33
Core Insights - The article discusses a new research called CompressARC, which introduces a novel approach to artificial intelligence based on the Minimum Description Length (MDL) principle, diverging from traditional large-scale pre-training methods [1][7][48]. Group 1: Research Findings - CompressARC, utilizing only 76K parameters and no pre-training, successfully solved 20% of problems on the ARC-AGI-1 benchmark [3][5][48]. - The model achieved a performance of 34.75% on training puzzles, demonstrating its ability to generalize without relying on extensive datasets [7][48]. - CompressARC was awarded third place in the ARC Prize 2025, highlighting its innovative approach and effectiveness [5]. Group 2: Methodology - The core methodology of CompressARC revolves around minimizing the description length of a specific ARC-AGI puzzle, aiming to express it as the shortest possible computer program [8][10][23]. - The model does not learn a generalized rule but instead seeks to find the most concise representation of the puzzle, which aligns with the MDL theory [8][9][10]. - A fixed "program template" is utilized, which allows the model to generate puzzles by filling in hardcoded values and weights, thus simplifying the search for the shortest program [25][28]. Group 3: Technical Architecture - CompressARC employs an equivariant neural network architecture that incorporates symmetry handling, allowing it to treat equivalent transformations of puzzles uniformly [38][39]. - The model uses a multitensor structure to store high-level relational information, enhancing its inductive biases for abstract reasoning [40][41]. - The architecture is similar to a Transformer, featuring a residual backbone and custom operations tailored to the rules of ARC-AGI puzzles, ensuring efficient program description [42][44]. Group 4: Performance Evaluation - The model was tested with 2000 inference training steps per puzzle, taking approximately 20 minutes for each puzzle, which contributed to its performance metrics [47]. - CompressARC challenges the assumption that intelligence must stem from large-scale pre-training, suggesting that clever application of MDL and compression principles can yield surprising capabilities [48].
天下苦VAE久矣:阿里高德提出像素空间生成模型训练范式, 彻底告别VAE依赖
量子位· 2025-10-29 02:39
Core Insights - The article discusses the rapid development of image generation technology based on diffusion models, highlighting the limitations of the Variational Autoencoder (VAE) and introducing the EPG framework as a solution [1][19]. Training Efficiency and Generation Quality - EPG demonstrates significant improvements in training efficiency and generation quality, achieving a FID of 2.04 and 2.35 on ImageNet-256 and ImageNet-512 datasets, respectively, with only 75 model forward computations [3][19]. - Compared to the mainstream VAE-based models like DiT and SiT, EPG requires significantly less pre-training and fine-tuning time, with 57 hours for pre-training and 139 hours for fine-tuning, versus 160 hours and 506 hours for DiT [7]. Consistency Model Training - EPG successfully trains a consistency model in pixel space without relying on VAE or pre-trained diffusion model weights, achieving a FID of 8.82 on ImageNet-256 [5][19]. Training Complexity and Costs - The VAE's training complexity arises from the need to balance compression rate and reconstruction quality, making it challenging [6]. - Fine-tuning costs are high when adapting to new domains, as poor performance of the pre-trained VAE necessitates retraining the entire model, increasing development time and costs [6]. Two-Stage Training Method - EPG employs a two-stage training method: self-supervised pre-training (SSL Pre-training) and end-to-end fine-tuning, decoupling representation learning from pixel reconstruction [8][19]. - The first stage focuses on extracting high-quality visual features from noisy images using a contrastive loss and representation consistency loss [9][19]. - The second stage involves directly fine-tuning the pre-trained encoder with a randomly initialized decoder, simplifying the training process [13][19]. Performance and Scalability - EPG's framework is similar to classic image classification tasks, significantly lowering the barriers for developing and applying downstream generation tasks [14][19]. - The inference performance of EPG-trained diffusion models is efficient, requiring only 75 forward computations to achieve optimal results, showcasing excellent scalability [18]. Conclusion - The introduction of the EPG framework provides a new, efficient, and VAE-independent approach to training pixel space generative models, achieving superior training efficiency and generation quality [19]. - EPG's "de-VAE" paradigm is expected to drive further exploration and application in generative AI, lowering development barriers and fostering innovation [19].