One RL to See Them All？一个强化学习统一视觉-语言任务！

Core Insights - The article discusses the introduction of V-Triune, a unified reinforcement learning system by MiniMax that enhances visual-language models (VLM) for both visual reasoning and perception tasks in a single training process [2][4][5]. Group 1: V-Triune Overview - V-Triune consists of three complementary components: Sample-Level Data Formatting, Verifier-Level Reward Computation, and Source-Level Metric Monitoring, which work together to handle diverse tasks [3][8]. - The system utilizes a novel dynamic IoU reward mechanism that provides adaptive feedback for perception tasks, leading to performance improvements in reasoning and perception tasks [3][4]. Group 2: Performance Improvements - Orsta, the model generated by V-Triune, achieved significant performance gains in the MEGA-Bench Core benchmark, with improvements ranging from +2.1 to +14.1 across different model variants [4][49]. - The model's training on diverse datasets covering various visual reasoning and perception tasks has contributed to its broad capabilities [3][49]. Group 3: Sample-Level Data Formatting - MiniMax addresses the challenge of different tasks requiring distinct reward types and configurations by defining rewards at the sample level, allowing for dynamic routing and fine-grained weighting during training [9][13][16]. - This design enables seamless integration of diverse datasets into a unified training process while allowing for flexible and scalable reward control [16]. Group 4: Verifier-Level Reward Computation - MiniMax employs an independent, asynchronous reward server for generating reinforcement learning signals, enhancing modularity and scalability [17][19]. - The architecture allows for easy addition of new tasks or updates to reward logic without modifying the core training process [20]. Group 5: Source-Level Metric Monitoring - The Source-Level Metric Monitoring strategy records key performance indicators by data source for each training batch, facilitating targeted debugging and insights into the interactions between different data sources [21][24]. - Key monitored metrics include dynamic IoU rewards, perception task IoU/mAP, response length, and reflection rate, all tracked continuously by data source [24][22]. Group 6: Dynamic IoU Reward Strategy - The dynamic IoU reward strategy adjusts the IoU threshold during training to balance learning efficiency and final accuracy, starting with a relaxed threshold and progressively tightening it [26][25]. - This approach aims to guide the model's learning process smoothly while ensuring high performance in the later stages of training [26]. Group 7: Training Methodology - MiniMax's V-Triune supports scalable data, tasks, validators, and metrics systems, but early experiments indicated that joint training could lead to instability [28][29]. - To address this, MiniMax implemented targeted adjustments, including freezing ViT parameters to prevent gradient explosion and managing memory during large-scale training [34][35]. Group 8: Experimental Results - MiniMax conducted experiments using Qwen2.5-VL-7B-Instruct and Qwen2.5-VL-32B-Instruct as base models, achieving a dataset comprising 20,600 perception samples and 27,100 reasoning samples [46]. - The results indicate that V-Triune significantly enhances performance in reasoning and perception tasks, particularly in areas with rich training data [49][55]. Group 9: Conclusion - Overall, MiniMax's findings suggest that reinforcement learning can effectively enhance visual reasoning and perception capabilities within a unified framework, demonstrating continuous performance improvements across various tasks [55][56].