Workflow
通用视觉异常检测
icon
Search documents
零样本&少样本横扫12个工业医疗数据集:西门子×腾讯优图新研究精准定位缺陷,检测精度新SOTA丨AAAI 2026
量子位· 2026-01-19 03:48
Core Insights - The article discusses the development of AdaptCLIP, a universal visual anomaly detection framework that aims to improve performance in industrial quality inspection and medical imaging by leveraging the capabilities of the CLIP model while addressing its limitations in zero-shot and few-shot scenarios [2][4]. Group 1: Challenges in Anomaly Detection - Traditional models for defect detection require extensive labeled data, making them less effective in real-world scenarios where data is scarce [1][3]. - The core challenge in anomaly detection is the need for models to generalize across domains while accurately identifying subtle anomalies with minimal target domain data [3][4]. Group 2: AdaptCLIP Framework - AdaptCLIP introduces a lightweight adaptation approach by adding three adapters to the CLIP model without altering its core structure, enabling it to perform both image-level anomaly classification and pixel-level anomaly segmentation [5][6]. - The framework employs an alternating learning strategy, optimizing visual and textual representations separately to enhance performance in zero-shot anomaly detection [20][21]. Group 3: Key Innovations - The visual adapter fine-tunes CLIP's output tokens to better align with the anomaly detection task, significantly improving pixel-level localization capabilities [15][18]. - The text adapter eliminates the need for manually designed prompts by learning optimized embeddings for "normal" and "anomalous" classes, thus reducing dependency on prompt engineering [16][18]. Group 4: Experimental Results - AdaptCLIP achieved an average image-level AUROC of 86.2% across multiple industrial datasets in zero-shot scenarios, outperforming existing methods [31]. - In medical imaging tasks, AdaptCLIP demonstrated an average pixel-level AUPR of 48.7% and an average image-level AUROC of 90.7%, indicating superior performance compared to other approaches [31][32]. Group 5: Efficiency and Scalability - The model introduces approximately 0.6 million additional trainable parameters under zero-shot conditions, significantly lower than competing methods that can exceed 10.7 million parameters [32][37]. - AdaptCLIP maintains a reasonable inference time of about 162 ms per image at a resolution of 518x518, balancing detection accuracy with deployment efficiency [32][37].