置信度校准
Search documents
浙大团队破解多模态模型「盲目自信」:先校准置信度,再分配算力丨CVPR'26
量子位· 2026-03-22 04:18
Core Insights - The article discusses the issue of "blind confidence" in multimodal large models, where models maintain high confidence levels even when visual input quality deteriorates significantly, leading to hallucinations and misjudgments [2][4][6] - A new framework called CA-TTS (Confidence-Aware Test-Time Scaling) is proposed to address this issue by calibrating the model's self-assessment capabilities through confidence-driven reinforcement learning [4][15] Group 1: Problem Identification - A study conducted by a research team from Zhejiang University, Alibaba, City University of Hong Kong, and the University of Michigan revealed that as image quality degrades, model accuracy drops sharply while confidence remains unchanged [2][4] - This phenomenon is termed "perceptual bluntness," indicating a lack of sensitivity to changes in visual information quality [7][9] Group 2: Proposed Solutions - The training phase employs a method called CDRL (Confidence-Driven Reinforcement Learning) to align visual perception with confidence levels, encouraging models to differentiate between clear and unclear visual inputs [9][10] - CDRL utilizes a dual reward mechanism: one for encouraging sensitivity to visual degradation and another for maintaining honesty in self-assessment [11][12] Group 3: Performance Improvements - The implementation of CA-TTS resulted in significant performance improvements across four mainstream visual reasoning benchmarks, with an average increase of 8.8% over existing methods [4][19] - In the Math-Vision benchmark, accuracy improved from 23.0% to 42.4%, nearly doubling the baseline performance [19] Group 4: Methodology and Results - The CA-TTS framework consists of three modules: Self-Consistency, Self-Reflection, and Self-Check, which work together to enhance decision-making during inference [15][17] - Experimental results indicate that CA-TTS outperforms traditional methods like Majority Voting and DeepConf in terms of accuracy and efficiency, with a scaling efficiency that is 2.2 times and 3.1 times higher, respectively [27][28] Group 5: Theoretical Implications - The research shifts the paradigm from "reasoning first, perception second" to "perception first, reasoning second," emphasizing the importance of reliable perception in complex reasoning tasks [29][30] - This approach aims to ensure that multimodal large models can accurately assess their confidence levels, particularly in high-risk scenarios [29][30]