长文本检索大突破，联通团队研发的新模型，准确率提升近两成

Core Viewpoint - HiMo-CLIP is a new AI model developed by China Unicom's Data Science and Artificial Intelligence Research Institute, designed to improve the accuracy of image retrieval by automatically identifying key information in complex descriptions, addressing the common issue of "too much detail leading to errors" in AI processing [2][7][21]. Group 1: Model Features - HiMo-CLIP utilizes a specialized module called HiDe, which employs statistical methods to extract the most distinguishing features from similar descriptions, enhancing the model's ability to focus on key attributes [7][8]. - The model achieves an accuracy rate of 89.3%, significantly improving upon previous methods that relied on fixed templates or manual annotations [8]. - HiMo-CLIP's implementation is efficient, requiring minimal hardware resources, with only a 7% increase in inference speed on A100 GPUs, making it accessible for standard servers [10][11]. Group 2: Performance Metrics - The model incorporates a dual alignment mechanism known as MoLo loss, which ensures that both the overall semantic meaning and core feature matching are prioritized, thus preventing the "more detail, more errors" phenomenon [11][13]. - In tests on the MSCOCO-Long dataset, HiMo-CLIP's mean Average Precision (mAP) improved by nearly 20% compared to the previous Long-CLIP model, while maintaining 98.3% of its original performance on short text datasets like Flickr30K [13]. Group 3: Practical Applications - HiMo-CLIP has already been applied in real-world scenarios, such as enhancing product search functionalities on JD.com, where complex user descriptions led to a 27% increase in search conversion rates [14][15]. - The model is also being explored in the autonomous driving sector to interpret complex road descriptions, improving environmental recognition for vehicle systems [18]. Group 4: Future Developments - The team plans to release a multilingual version of HiMo-CLIP by Q3 2026, aiming to handle specialized terminology and foreign language descriptions more effectively [21]. - The success of HiMo-CLIP highlights the importance of simulating human cognitive logic in AI models, suggesting a potential new direction for multimodal intelligence development through structured semantic spaces [21].