Workflow
统一模型
icon
Search documents
视觉强化学习最新综述:全领域梳理(新加坡国立&浙大&港中文)
自动驾驶之心· 2025-08-16 00:03
Core Insights - The article discusses the integration of Reinforcement Learning with Computer Vision, marking a paradigm shift in how AI interacts with visual data [3][4] - It highlights the potential for AI to not only understand but also create and optimize visual content based on human preferences, transforming AI from passive observers to active decision-makers [4] Research Background and Overview - The emergence of Visual Reinforcement Learning (VRL) is driven by the successful application of Reinforcement Learning in Large Language Models (LLMs) [7] - The article identifies three core challenges in the field: stability in policy optimization under complex reward signals, efficient processing of high-dimensional visual inputs, and scalable reward function design for long-term decision-making [7][8] Theoretical Foundations of Visual Reinforcement Learning - The theoretical framework for VRL includes formalizing the problem using Markov Decision Processes (MDP), which unifies text and visual generation RL frameworks [15] - Three main alignment paradigms are proposed: RL with human feedback (RLHF), Direct Preference Optimization (DPO), and Reinforcement Learning with Verifiable Rewards (RLVR) [16][18] Core Applications of Visual Reinforcement Learning - The article categorizes VRL research into four main areas: Multimodal Large Language Models (MLLM), Visual Generation, Unified Models, and Visual-Language-Action (VLA) Models [31] - Each area is further divided into specific tasks, with representative works analyzed for their contributions [31][32] Evaluation Metrics and Benchmarking - A layered evaluation framework is proposed, detailing specific benchmarks for each area to ensure reproducibility and comparability in VRL research [44][48] - The article emphasizes the need for effective metrics that align with human perception and can validate the performance of VRL systems [61] Future Directions and Challenges - The article outlines four key challenges for the future of VRL: balancing depth and efficiency in reasoning, addressing long-term RL in VLA tasks, designing reward models for visual generation, and improving data efficiency and generalization capabilities [50][52][54] - It suggests that future research should focus on integrating model-based planning, self-supervised visual pre-training, and adaptive curriculum learning to enhance the practical applications of VRL [57]
VLM与扩散模型深度整合,图像理解生成编辑三合一模型登场,权重数据训练流程全开源
量子位· 2025-08-01 04:23
Nexus-Gen团队 投稿 量子位 | 公众号 QbitAI VLM和扩散模型被整合到一起了。 ModelScope(魔搭)团队发布 Nexus-Gen V2 ,一个同时支持图像理解、生成和编辑的统一模型,而且模型权重、训练流程和数据集全部 开源。 这事儿有多重要?今年以来,GPT-4o-Image、Gemini、Blip3O这些大厂的统一模型都在证明一件事:把图像理解和生成能力塞进一个模 型,不仅仅是为了省事,更是因为两种任务的有机结合能带来意想不到的效果。 魔搭团队其实早在五月就发布了V1版本,但他们很快发现了问题:图像理解能力相比原始VLM掉点严重,图像生成对提示词太敏感,编辑细 节也保持不好。 于是他们憋了几个月大招,从三个方向全面优化,终于拿出了这个V2版本。 在图像理解上,优化了模型的训练策略,极大程度地保留了VLM的理解能力; 在图像生成上,对所有图像生成样本进行了重标注,采用长短描述同时标注并采样选取的策略,提升了图像生成的鲁棒性,同时加入了中文标 注样本,支持了基于中文的图像生成。 在图像编辑上,团队系统性地研究了图像重建效果与图像编码token数量之间的关系,并设计了全新的编辑方案。经过 ...