MiniMax开源首个视觉RL统一框架，闫俊杰领衔！推理感知两手抓，性能横扫MEGA-Bench

Core Insights - The article discusses the introduction of the V-Triune framework by MiniMax, which allows for unified learning of visual reasoning and perception tasks within a single reinforcement learning (RL) system [1][11] - The framework addresses the limitations of traditional RL methods that typically focus on either reasoning or perception tasks, enabling a more comprehensive approach to visual tasks [2][8] Framework and Model Development - V-Triune employs a three-layer component design and a dynamic Intersection over Union (IoU) reward mechanism to effectively balance multiple tasks [2][22] - The Orsta model series, developed based on V-Triune, ranges from 7 billion to 32 billion parameters and has shown significant performance improvements in the MEGA-Bench Core benchmark, with enhancements ranging from +2.1% to +14.1% [3][30] Technical Implementation - The framework allows for sample-level data formatting, enabling custom reward settings and verifiers for each sample, thus supporting dynamic routing and weight adjustments [13][14] - An asynchronous client-server architecture is utilized to decouple reward calculation from the main training loop, enhancing flexibility in task expansion and reward logic updates [15][18] Monitoring and Stability - The system includes a monitoring mechanism that tracks various metrics such as reward values, IoU, mean Average Precision (mAP), response length, and reflection rates to ensure learning stability [19][21] - Dynamic IoU rewards are introduced to alleviate cold start issues and guide models in improving localization accuracy through phased threshold adjustments [22][24] Performance Metrics - The Orsta models have been trained on a diverse dataset covering four types of reasoning tasks and four types of perception tasks, leading to significant improvements in performance metrics, particularly in perception tasks [30][31] - The article highlights the effectiveness and scalability of the unified approach, as evidenced by the substantial gains in mAP metrics during testing [30] Company Background - MiniMax, recognized as one of the "Six Little Giants" in AI, has been actively expanding its capabilities in the multimodal field, developing models that span language, audio, and video [32] - The company aims to innovate in multimodal architecture, focusing on a unified generative understanding model [35]