CVPR 2025 Highlight | 国科大等新方法破译多模态「黑箱」，精准揪出犯错元凶

Core Viewpoint - The article discusses the importance of reliability and safety in AI decision-making, emphasizing the urgent need for improved model interpretability to understand and verify decision processes, especially in critical scenarios [1][2]. Group 1: Research Background - A joint research effort by institutions including the Chinese Academy of Sciences and Huawei has achieved significant breakthroughs in explainable attribution techniques for multimodal object-level foundation models, enhancing human understanding of model predictions and identifying input factors leading to errors [2][4]. - Existing explanation methods, such as Shapley Value and Grad-CAM, have limitations when applied to large-scale models or multimodal tasks, highlighting the need for efficient attribution methods adaptable to both large and small models [1][8]. Group 2: Methodology - The proposed Visual Precision Search (VPS) method aims to generate high-precision attribution maps with fewer regions, addressing the challenges posed by the increasing complexity of model parameters and multimodal interactions [9][12]. - The VPS method models the attribution problem as a search problem based on subset selection, optimizing the selection of sub-regions to maximize interpretability [12][14]. - Key scores, such as clue scores and collaboration scores, are defined to evaluate the importance of sub-regions in the decision-making process, contributing to the construction of a submodular function for effective attribution [15][17]. Group 3: Experimental Results - The VPS method has demonstrated superior performance in various object-level tasks, surpassing existing methods like D-RISE in metrics such as Insertion and Deletion rates across datasets like MS COCO and RefCOCO [22][23]. - The method effectively highlights important sub-regions, improving clarity in attribution compared to existing techniques, which often produce noisy or diffuse significance maps [22][24]. Group 4: Error Explanation - The VPS method excels in explaining the reasons behind model prediction errors, showcasing capabilities not present in other existing methods [24][30]. - Visualizations reveal how input disturbances and background interference contribute to classification errors, providing insights into model limitations and potential improvement directions [27][30]. Group 5: Conclusion and Future Directions - The VPS method enhances interpretability for object-level foundation models and effectively explains failures in visual localization and object detection tasks [32]. - Future applications may include improving model decision rationality during training, monitoring decisions for safety during inference, and identifying key defects for cost-effective model repairs [32].