图文跨模态“近视”问题破局：360开源新模型 FG-CLIP，实现细粒度图文对齐突破｜ICML2025

Core Viewpoint - The article introduces the FG-CLIP model developed by 360 AI Research Institute, which significantly enhances fine-grained understanding in image-text alignment, overcoming limitations of the original CLIP model [4][10][40]. Group 1: FG-CLIP Model Overview - FG-CLIP can distinguish subtle differences in images, such as between "a man in a light blue jacket" and "a man in a grass green jacket," and can identify objects even when partially obscured [1][4]. - The model has been accepted at the AI conference ICML 2025 and is open-sourced [3][5]. - FG-CLIP addresses the core challenge of fine-grained alignment in image-text pairs, which was a limitation in previous models like CLIP [4][10]. Group 2: Technical Innovations - FG-CLIP employs an explicit dual-tower structure to achieve fine-grained alignment of image and text information [10]. - It utilizes a two-stage training strategy that includes global contrastive learning and regional contrastive learning, enhancing both overall and detailed understanding [16][18]. - The model innovatively constructs hard negative samples to improve its ability to discern subtle semantic differences [20]. Group 3: Performance Metrics - FG-CLIP outperforms existing models like CLIP and FineCLIP in various benchmarks, demonstrating superior local recognition and detail perception capabilities [10][29]. - In fine-grained understanding tasks, FG-CLIP achieved significant improvements, with scores of 46.4% in hard cases and 68.6% in easy cases, compared to lower scores from other models [30]. - The model also excelled in zero-shot testing on the COCO-val2017 dataset, showcasing its ability to classify objects based solely on text descriptions [31]. Group 4: Applications and Impact - FG-CLIP enhances various applications, including internet search, video recommendations, and office software, by improving the accuracy of image-text matching [11][12]. - The model's capabilities are crucial for advanced technologies such as multi-modal large language models and image generation models, which rely on effective image-text alignment [12][40]. - The open-source release of FG-CLIP aims to facilitate further research and industrial applications in the field of cross-modal understanding [10][40].