Workflow
视觉生成
icon
Search documents
NextStep-1:一次在图像生成上自回归范式的探索
机器之心· 2025-08-18 05:15
机器之心发布 机器之心编辑部 自回归模型,是 AIGC 领域一块迷人的基石。开发者们一直在探索它在 视觉生成 领 域 的边界,从经典的离散序列生成,到结合强大扩散模型的混合范式,每一 步都凝聚了社区的智慧。 这些工作,比如 MAR、Fluid、LatentLM 等,为我们带来了巨大的启发,也让我们看到了进一步优化的空间:比如,如何避免离散化带来的信息损失?如何让模 型的架构更轻盈、更强大? 为实现这一点,团队采用了一个轻量的「 流匹配头 」(Flow Matching Head)。它让模型能够: 这一设计带来了另一个显著优势: 架构的简洁与纯粹 。由于不再需要外部大型扩散模型的 「辅助」,NextStep-1 的整体架构变得高度统一,实现了真正意义上的 端到端训练。 阶跃星辰团队 认为,NextStep-1 的探索指向了一个有趣且充满潜力的方向。它证明了在不牺牲连续性的前提下,构建一个简洁、高效的自回归模型是完全可行 的。 这只是探索的第一步。 阶跃星辰 选择将 NextStep-1 开源, 衷心期待它能引发更多有价值的讨论,并希望能与社 区的研究者一起 ,继续推动生成技术的演进 。 带着这些问题, 阶跃星辰 ...
视觉强化学习最新综述:全领域梳理(新加坡国立&浙大&港中文)
自动驾驶之心· 2025-08-16 00:03
Core Insights - The article discusses the integration of Reinforcement Learning with Computer Vision, marking a paradigm shift in how AI interacts with visual data [3][4] - It highlights the potential for AI to not only understand but also create and optimize visual content based on human preferences, transforming AI from passive observers to active decision-makers [4] Research Background and Overview - The emergence of Visual Reinforcement Learning (VRL) is driven by the successful application of Reinforcement Learning in Large Language Models (LLMs) [7] - The article identifies three core challenges in the field: stability in policy optimization under complex reward signals, efficient processing of high-dimensional visual inputs, and scalable reward function design for long-term decision-making [7][8] Theoretical Foundations of Visual Reinforcement Learning - The theoretical framework for VRL includes formalizing the problem using Markov Decision Processes (MDP), which unifies text and visual generation RL frameworks [15] - Three main alignment paradigms are proposed: RL with human feedback (RLHF), Direct Preference Optimization (DPO), and Reinforcement Learning with Verifiable Rewards (RLVR) [16][18] Core Applications of Visual Reinforcement Learning - The article categorizes VRL research into four main areas: Multimodal Large Language Models (MLLM), Visual Generation, Unified Models, and Visual-Language-Action (VLA) Models [31] - Each area is further divided into specific tasks, with representative works analyzed for their contributions [31][32] Evaluation Metrics and Benchmarking - A layered evaluation framework is proposed, detailing specific benchmarks for each area to ensure reproducibility and comparability in VRL research [44][48] - The article emphasizes the need for effective metrics that align with human perception and can validate the performance of VRL systems [61] Future Directions and Challenges - The article outlines four key challenges for the future of VRL: balancing depth and efficiency in reasoning, addressing long-term RL in VLA tasks, designing reward models for visual generation, and improving data efficiency and generalization capabilities [50][52][54] - It suggests that future research should focus on integrating model-based planning, self-supervised visual pre-training, and adaptive curriculum learning to enhance the practical applications of VRL [57]
DanceGRPO:首个统一视觉生成的强化学习框架
机器之心· 2025-05-14 08:09
Core Insights - The article introduces DanceGRPO, an innovative framework that unifies visual generation reinforcement learning, covering various tasks and models [2][8]. Group 1: Motivation and Background - The rapid development of generative AI has brought RLHF (Reinforcement Learning from Human Feedback) into focus, particularly in the context of LLMs (Large Language Models) [4]. - Current mainstream RLHF solutions for visual generation tasks are less mature compared to LLMs, with two main categories identified: Diffusion/Flow-DPO and ReFL [4][5]. Group 2: Goals and Features - The goal of the DanceGRPO framework is to enhance performance significantly, manage memory pressure during video generation, train on large prompt datasets, and be adaptable to rectified flow and video generation models [7]. Group 3: Framework Design and Implementation - DanceGRPO is the first unified framework for visual generation and reinforcement learning, applicable to diffusion and rectified flow, as well as text-to-image, text-to-video, and image-to-video tasks [8]. - The framework follows the GRPO strategy, optimizing using a prompt to generate data and applying the GRPO objective function without including KL divergence regularization [9]. Group 4: Reward Models - Five types of reward models were utilized: image aesthetics, video aesthetics, text-image alignment, video dynamic quality, and a new binary reward model combining aesthetics and alignment [10]. Group 5: Experimental Results - Experimental results show significant improvements in various models, with notable performance increases in metrics such as HPS-v2.1 and CLIP Score for Stable Diffusion and FLUX [12]. - The results indicate a 45% improvement in VQ and a 181% increase in MQ for the HunyuanVideo model when using the proposed method [13].
13.8倍吞吐提升!浙大上海AI Lab等提出视觉生成新范式,从“下一个token”到“下一个邻域”
量子位· 2025-03-30 02:37
Core Viewpoint - The article discusses a new visual generation paradigm called Neighboring Autoregressive Modeling (NAR), which addresses the efficiency bottlenecks faced by traditional "next token prediction" methods in image and video generation tasks [2][12]. Group 1: Introduction to NAR - Traditional autoregressive models generate images or videos token by token in a raster order, leading to slow generation speeds, especially for high-resolution images or long videos [12]. - Existing acceleration methods often compromise on generation quality, making it a key challenge to enhance efficiency while maintaining high-quality outputs [14]. Group 2: Mechanism of NAR - NAR introduces a "next neighborhood prediction" mechanism, treating the visual generation process as a stepwise expansion, which allows for parallel prediction of multiple adjacent tokens [2][3]. - The model employs dimension-guided decoding heads, where each head predicts the next token in a specific orthogonal dimension, significantly reducing the number of forward computation steps required [4][5][16]. Group 3: Efficiency and Performance - In video generation tasks, NAR can complete the generation process in only 2n + t - 2 steps, compared to tn steps required by traditional models, showcasing a significant efficiency advantage [18][20]. - Experimental results indicate that NAR achieves a 13.8 times throughput improvement in image generation tasks on the ImageNet dataset, with lower FID scores compared to larger models [21][22]. Group 4: Application Results - For video generation on the UCF-101 dataset, NAR reduces the number of generation steps by 97.3% compared to traditional autoregressive models [23]. - In text-to-image generation, NAR uses only 0.4% of the training data yet matches the performance of larger models, achieving a 166 times increase in throughput [26][27][28]. Group 5: Conclusion - NAR provides an efficient and high-quality solution for visual generation tasks, indicating its potential for significant impact in future AI applications [29].