让大模型学会“高维找茬”,中国联通新研究解决长文本图像检索痛点|AAAI 2026 Oral
量子位·2025-12-01 05:45

Core Insights - The article discusses a new state-of-the-art (SOTA) model for long-text image retrieval called HiMo-CLIP, developed by the China Unicom Data Science and AI Research Institute, which addresses limitations in existing models like CLIP by effectively capturing semantic differences in context [2][4]. Group 1: Model Limitations - Existing models, including Long-CLIP, struggle with long text descriptions, often resulting in decreased alignment scores as the text becomes more detailed, indicating a failure to process the hierarchical structure of language [6][9]. - The phenomenon where longer descriptions lead to lower alignment scores highlights the inadequacy of current models in distinguishing core semantics from detailed information [6][9]. Group 2: HiMo-CLIP Framework - HiMo-CLIP introduces a plug-and-play representation framework that includes two core components: Hierarchical Decomposition (HiDe) and Monotonicity-aware Contrastive Loss (MoLo) [10][12]. - HiDe dynamically extracts semantic components using PCA within batches, while MoLo enforces alignment between the full text and its semantic components, ensuring monotonicity [12][17]. Group 3: Performance and Efficiency - HiMo-CLIP demonstrates significant advantages in both long and short text retrieval tasks, outperforming models trained on much larger datasets, achieving SOTA with only 1 million training samples [17][20]. - The model's ability to extract unique features from complex scenes allows it to maintain high performance across various retrieval benchmarks [18][22]. Group 4: Evaluation Metrics - The research team constructed the HiMo-Docci dataset and introduced the HiMo@K metric to quantify the model's understanding of hierarchical structures, achieving a high monotonicity correlation coefficient of 0.88, surpassing comparative methods [22][25]. - As text descriptions become more complete, HiMo-CLIP's scores show a consistent upward trend, while other models exhibit significant fluctuations [25][26].