ICLR 2026 | 这道题是否需要用图思考？模型来告诉你！自适应思考模式切换助力通用视觉推理提升

Core Insights - The article presents a new adaptive reasoning paradigm called Mixture-of-Visual-Thoughts (MoVT), which integrates different reasoning modes to enhance general visual reasoning capabilities [2][13][39] - The proposed framework, AdaVaR, employs a two-stage learning process to enable the model to learn and adaptively select appropriate reasoning modes based on the problem context [14][20][39] Group 1: Background and Reasoning Modes - Current visual reasoning methods have primarily focused on two main paradigms: pure text reasoning, similar to LLMs, and visually-grounded reasoning, which utilizes structured information to connect key concepts with image regions [5][9] - The text reasoning mode excels in abstract visual problems, such as geometry, but may lead to hallucinations and perform poorly in visual search tasks [12][26] - The grounded reasoning mode is better at leveraging visual information and suppressing hallucinations but struggles with abstract mathematical problems [12][26] Group 2: MoVT Framework and Learning Mechanism - MoVT aims to combine the strengths of both reasoning modes within a single model, allowing for adaptive selection based on the specific problem [13][39] - The AdaVaR framework consists of two phases: the first phase uses special prefix tokens to help the model distinguish between reasoning modes, while the second phase employs a reinforcement learning algorithm (AdaGRPO) to guide the model in selecting the optimal reasoning mode [14][20][39] - The introduction of prefix tokens facilitates the model's ability to differentiate between reasoning modes and supports subsequent reinforcement learning interventions [17][18] Group 3: Experimental Results and Performance - The AdaVaR models (AdaVaR-3B and AdaVaR-7B) demonstrated superior performance across multiple tasks compared to other models based on Qwen2.5-VL, achieving the best or second-best results in various scenarios [15][26] - AdaVaR-3B achieved an average accuracy of 50.84%, while AdaVaR-7B reached 55.82%, outperforming even the GPT-4o model [15][26] - The results indicate that models based on a single reasoning mode often excel in specific domains but struggle to achieve general improvements, whereas AdaVaR consistently outperformed baseline models across all tasks [26][39] Group 4: Future Directions - The MoVT framework has the potential to incorporate additional reasoning modes beyond the two currently explored, which could enhance the model's adaptability and reasoning capabilities [39] - Future research may focus on integrating more distinct reasoning modes and addressing the exploration-exploitation tradeoff that arises with an increasing number of modes [39]