视觉生成 - filings, earnings calls, financial reports, news

视觉生成

Search documents

量子位· 2025-12-22 04:41

Core Viewpoint - The article discusses the introduction of a new paradigm called Thinking-while-Generating (TwiG), which interleaves textual reasoning with visual generation to enhance the capabilities of models in generating complex images and videos, addressing limitations of existing models in handling spatial relationships and object interactions [5][19]. Group 1: Existing Challenges - Current diffusion and autoregressive models, such as FLUX.1 and Emu3, struggle with generating accurate representations of complex spatial relationships and interactions, often resulting in errors like misplacing objects or incorrect quantities [1]. - Two main approaches have been previously explored: "Think-before-Generation," which lacks flexibility, and "Think-after-Generation," which incurs high computational costs and delays [4]. Group 2: Introduction of TwiG - TwiG allows models to pause during the generation process to evaluate and plan the next steps, mimicking human artistic processes [5][7]. - The framework breaks down visual generation into a cycle of "generate-think-regenerate," enabling models to incorporate reasoning at multiple points during the creation process [7]. Group 3: Core Dimensions of TwiG - The framework consists of three key dimensions: 1. **When to Think**: The model creates a "thinking schedule" based on user prompts, optimizing the generation process into three stages that align with the semantic structure of images [8]. 2. **What to Say**: At each pause, the model generates a "thought chain" that guides the next steps in a more precise manner than traditional prompts [9]. 3. **How to Refine**: After completing a section, the model performs self-reflection to correct any mistakes immediately, enhancing efficiency [10]. Group 4: Empirical Research and Results - The research team conducted experiments on a unified multimodal model (Janus-Pro) to validate the TwiG framework, demonstrating its potential through various stages of testing [12]. - **Zero-Shot Performance**: The TwiG-ZS model showed remarkable "think-while-generating" capabilities without parameter updates, outperforming baseline models in multiple dimensions [13][14]. - **Supervised Fine-Tuning (SFT)**: A dataset of 50K was used for SFT, which improved the model's coherence and control over generated thought chains [16]. - **Reinforcement Learning (RL)**: The TwiG-RL model, optimized with a specific RL strategy, demonstrated competitive performance against existing models like Emu3 and FLUX.1 in key metrics [17]. Group 5: Conclusions and Future Implications - The introduction of TwiG represents a shift in how visual generation models operate, emphasizing the need for logical reasoning in generation processes [19]. - Key conclusions include the necessity of explicit reasoning for complex logic, the efficiency of local corrections over complete rewrites, and the critical role of reinforcement learning in enhancing model capabilities [20]. - The TwiG framework is designed to be compatible with diffusion models, suggesting potential applications in more complex fields such as video generation and 3D modeling [21].

Thinking-while-Generating（TwiG）

Thinking-while-Generating（TwiG）

GRPO训练不再「自嗨」！快手可灵 x 中山大学推出「GRPO卫兵」，显著缓解视觉生成过优化

机器之心· 2025-11-13 04:12

Core Insights - The article discusses the introduction of GRPO-Guard, a solution designed to mitigate the over-optimization problem observed in GRPO within flow models, ensuring faster convergence while significantly reducing the risk of over-optimization [3][35]. Group 1: GRPO and Over-Optimization Issues - GRPO has shown significant improvements in image and video generation flow models, but it suffers from a systematic bias in the importance ratio clipping mechanism, leading to over-optimization where the model's performance degrades despite rising proxy rewards [2][14]. - The empirical analysis indicates that the mean of the importance ratio is consistently below 1, which fails to effectively constrain overly confident positive gradients, resulting in suboptimal model performance in real applications [2][14]. Group 2: Introduction of GRPO-Guard - GRPO-Guard introduces two key improvements: RatioNorm, which normalizes the importance ratio distribution to bring the mean closer to 1, and Cross-Step Gradient Balancing, which ensures uniform exploration across the noise schedule [19][21]. - These enhancements restore the effectiveness of the clipping mechanism and stabilize policy updates, thereby alleviating the over-optimization phenomenon [35]. Group 3: Experimental Results - Experiments conducted on various GRPO variants and diffusion backbone models demonstrate that GRPO-Guard significantly alleviates over-optimization while maintaining or even improving performance compared to baseline methods [26][35]. - The results show that in baseline methods, the gold score exhibits a noticeable downward trend, while GRPO-Guard effectively mitigates this decline, indicating improved model robustness [26][28]. Group 4: Future Directions - The article suggests that while GRPO-Guard addresses over-optimization, it does not completely eliminate the issue, as there remains a significant gap between proxy scores and gold scores [35]. - Future efforts should focus on developing more accurate reward models to further reduce reward hacking and enhance optimization outcomes, providing a more reliable technical foundation for GRPO's application in flow models and broader generative tasks [35].

NextStep-1：一次在图像生成上自回归范式的探索

机器之心· 2025-08-18 05:15

Core Insights - The article discusses the development of NextStep-1, a new autoregressive model for image generation that operates directly in continuous visual space, avoiding the information loss associated with discretization [2][3][4] - The model utilizes a lightweight Flow Matching Head, which simplifies the architecture and allows for end-to-end training without reliance on external diffusion models [4][5] - The exploration aims to provide a new perspective in the multimodal generation field, emphasizing the potential for creating efficient and high-fidelity generative models [26][33] Technical Framework - NextStep-1 is built on a powerful Transformer backbone network with 14 billion parameters, complemented by a Flow Matching Head with 157 million parameters for generating continuous image patches [7][8] - The model generates images autoregressively by producing patches sequentially, which helps bypass the bottleneck of discretization [8] - The architecture is designed to be simple and pure, demonstrating that a streamlined autoregressive model can be constructed without sacrificing continuity [4][26] Key Discoveries - The team identified that the Transformer acts as the main creator, while the Flow Matching Head serves as an efficient sampler, with minimal impact on image quality from the size of the Flow Matching Head [12] - Two critical techniques were discovered for stability and quality: channel-wise normalization to stabilize token statistics and the counterintuitive finding that adding more noise during training can enhance image quality [14][16] Performance Evaluation - NextStep-1 has been rigorously evaluated against industry benchmarks, achieving competitive results with state-of-the-art diffusion models [21][22] - The model's performance metrics include GenEval scores of 0.63/0.737 and DPG-Bench scores of 85.28, indicating its strong capabilities in image generation [21][22] Limitations and Future Directions - The model faces challenges related to stability during generation, particularly when expanding the latent space dimensions, which can lead to occasional failures [27][29] - The autoregressive nature of the model introduces latency issues, particularly in sequential decoding, which affects overall performance [28] - Future work will focus on optimizing the Flow Matching Head, accelerating the autoregressive backbone, and improving convergence efficiency, especially in high-resolution image generation [34][35]

视觉强化学习最新综述：全领域梳理（新加坡国立&浙大&港中文）

自动驾驶之心· 2025-08-16 00:03

Core Insights - The article discusses the integration of Reinforcement Learning with Computer Vision, marking a paradigm shift in how AI interacts with visual data [3][4] - It highlights the potential for AI to not only understand but also create and optimize visual content based on human preferences, transforming AI from passive observers to active decision-makers [4] Research Background and Overview - The emergence of Visual Reinforcement Learning (VRL) is driven by the successful application of Reinforcement Learning in Large Language Models (LLMs) [7] - The article identifies three core challenges in the field: stability in policy optimization under complex reward signals, efficient processing of high-dimensional visual inputs, and scalable reward function design for long-term decision-making [7][8] Theoretical Foundations of Visual Reinforcement Learning - The theoretical framework for VRL includes formalizing the problem using Markov Decision Processes (MDP), which unifies text and visual generation RL frameworks [15] - Three main alignment paradigms are proposed: RL with human feedback (RLHF), Direct Preference Optimization (DPO), and Reinforcement Learning with Verifiable Rewards (RLVR) [16][18] Core Applications of Visual Reinforcement Learning - The article categorizes VRL research into four main areas: Multimodal Large Language Models (MLLM), Visual Generation, Unified Models, and Visual-Language-Action (VLA) Models [31] - Each area is further divided into specific tasks, with representative works analyzed for their contributions [31][32] Evaluation Metrics and Benchmarking - A layered evaluation framework is proposed, detailing specific benchmarks for each area to ensure reproducibility and comparability in VRL research [44][48] - The article emphasizes the need for effective metrics that align with human perception and can validate the performance of VRL systems [61] Future Directions and Challenges - The article outlines four key challenges for the future of VRL: balancing depth and efficiency in reasoning, addressing long-term RL in VLA tasks, designing reward models for visual generation, and improving data efficiency and generalization capabilities [50][52][54] - It suggests that future research should focus on integrating model-based planning, self-supervised visual pre-training, and adaptive curriculum learning to enhance the practical applications of VRL [57]

DanceGRPO：首个统一视觉生成的强化学习框架

机器之心· 2025-05-14 08:09

Core Insights - The article introduces DanceGRPO, an innovative framework that unifies visual generation reinforcement learning, covering various tasks and models [2][8]. Group 1: Motivation and Background - The rapid development of generative AI has brought RLHF (Reinforcement Learning from Human Feedback) into focus, particularly in the context of LLMs (Large Language Models) [4]. - Current mainstream RLHF solutions for visual generation tasks are less mature compared to LLMs, with two main categories identified: Diffusion/Flow-DPO and ReFL [4][5]. Group 2: Goals and Features - The goal of the DanceGRPO framework is to enhance performance significantly, manage memory pressure during video generation, train on large prompt datasets, and be adaptable to rectified flow and video generation models [7]. Group 3: Framework Design and Implementation - DanceGRPO is the first unified framework for visual generation and reinforcement learning, applicable to diffusion and rectified flow, as well as text-to-image, text-to-video, and image-to-video tasks [8]. - The framework follows the GRPO strategy, optimizing using a prompt to generate data and applying the GRPO objective function without including KL divergence regularization [9]. Group 4: Reward Models - Five types of reward models were utilized: image aesthetics, video aesthetics, text-image alignment, video dynamic quality, and a new binary reward model combining aesthetics and alignment [10]. Group 5: Experimental Results - Experimental results show significant improvements in various models, with notable performance increases in metrics such as HPS-v2.1 and CLIP Score for Stable Diffusion and FLUX [12]. - The results indicate a 45% improvement in VQ and a 181% increase in MQ for the HunyuanVideo model when using the proposed method [13].

13.8倍吞吐提升！浙大上海AI Lab等提出视觉生成新范式，从“下一个token”到“下一个邻域”

量子位· 2025-03-30 02:37

Core Viewpoint - The article discusses a new visual generation paradigm called Neighboring Autoregressive Modeling (NAR), which addresses the efficiency bottlenecks faced by traditional "next token prediction" methods in image and video generation tasks [2][12]. Group 1: Introduction to NAR - Traditional autoregressive models generate images or videos token by token in a raster order, leading to slow generation speeds, especially for high-resolution images or long videos [12]. - Existing acceleration methods often compromise on generation quality, making it a key challenge to enhance efficiency while maintaining high-quality outputs [14]. Group 2: Mechanism of NAR - NAR introduces a "next neighborhood prediction" mechanism, treating the visual generation process as a stepwise expansion, which allows for parallel prediction of multiple adjacent tokens [2][3]. - The model employs dimension-guided decoding heads, where each head predicts the next token in a specific orthogonal dimension, significantly reducing the number of forward computation steps required [4][5][16]. Group 3: Efficiency and Performance - In video generation tasks, NAR can complete the generation process in only 2n + t - 2 steps, compared to tn steps required by traditional models, showcasing a significant efficiency advantage [18][20]. - Experimental results indicate that NAR achieves a 13.8 times throughput improvement in image generation tasks on the ImageNet dataset, with lower FID scores compared to larger models [21][22]. Group 4: Application Results - For video generation on the UCF-101 dataset, NAR reduces the number of generation steps by 97.3% compared to traditional autoregressive models [23]. - In text-to-image generation, NAR uses only 0.4% of the training data yet matches the performance of larger models, achieving a 166 times increase in throughput [26][27][28]. Group 5: Conclusion - NAR provides an efficient and high-quality solution for visual generation tasks, indicating its potential for significant impact in future AI applications [29].

邻近自回归建模

视觉生成

Artificial Intelligence

NAR模型

邻近自回归建模

视觉生成

Artificial Intelligence

NAR模型