Workflow
MetaCLIP 2
icon
Search documents
谢赛宁团队新作打破“多语言诅咒”!MetaCLIP 2支持300多种语言,英语性能反倒提升了
量子位· 2025-07-31 06:51
Core Insights - The article discusses the advancements made by the Saining Xie team in developing MetaCLIP 2, a model that expands the capabilities of the original CLIP model by effectively processing non-English data and overcoming the "multilingual curse" [2][22]. Group 1: Model Development - MetaCLIP 2 is the first CLIP model trained from scratch on global data, capable of handling over 300 languages [2][14]. - The model improves upon the original MetaCLIP by optimizing its structure and processes, specifically in data selection and training strategies [13][19]. - The training sample size was increased significantly, with ViT-H/14 model using 29 billion samples, which is 2.3 times larger than previous versions [23][19]. Group 2: Methodology - The methodology includes three key innovations: expanding metadata to include over 300 languages, implementing a global filtering algorithm, and constructing a training framework for a global model [8][11]. - The filtering logic involves creating a "visual concept dictionary" for each language and applying tailored data selection criteria to ensure balanced data distribution [15][16]. - The model's training strategy was adjusted to maintain English sample volume while increasing the overall sample size, thus avoiding the pitfalls of the multilingual curse [19][21]. Group 3: Performance Metrics - MetaCLIP 2 demonstrated superior performance in various tests, breaking the multilingual curse and enhancing English capabilities, achieving an accuracy of 81.3% on ImageNet, surpassing the original CLIP's 80.5% [24][25]. - In multilingual benchmarks, MetaCLIP 2 outperformed previous models like mSigLIP and SigLIP 2, achieving a 50.2% accuracy in the Babel-ImageNet classification task and a 64.3% accuracy in the XM3600 image-to-text retrieval task [26][27]. - The model also excelled in cultural diversity tasks, showing improved accuracy in geographic localization compared to purely English or non-English models [28][29]. Group 4: Open Source and Community Engagement - The relevant data and code for MetaCLIP 2 have been made open source, promoting community engagement and further research [32].
OpenAI提出的CLIP,被Meta联合谢赛宁、刘壮,扩展到全球300+语言
机器之心· 2025-07-31 05:11
Core Viewpoint - The article discusses the introduction of MetaCLIP 2, a novel method for training the CLIP model on a global scale without relying on external resources, addressing the challenges of multilingual data processing and enhancing model performance across languages [2][4]. Group 1: MetaCLIP 2 Overview - MetaCLIP 2 is the first method to train CLIP from scratch on native global image-text pairs, overcoming the limitations of previous models that primarily focused on English data [2][5]. - The method includes three core innovations: metadata expansion to over 300 languages, a data filtering algorithm to balance concept distribution across languages, and a global training framework that proportionally increases the use of image-text pairs as non-English data is introduced [5][20]. Group 2: Performance Improvements - MetaCLIP 2 demonstrates that non-English data can enhance the capabilities of English models and vice versa, effectively breaking the "multilingual curse" [10][31]. - The model achieved state-of-the-art (SOTA) results in various multilingual benchmarks, including improvements of 3.8% on Babel-ImageNet and 1.1% on XM3600, among others [32][34]. Group 3: Training Methodology - The training framework of MetaCLIP 2 maintains consistency with OpenAI's CLIP architecture while introducing key components such as a multilingual text tokenizer and scaling of seen training pairs [26][30]. - The model's training data was expanded from 13 billion pairs to 29 billion pairs, resulting in significant performance enhancements across both English and multilingual tasks [38][39]. Group 4: Cultural and Linguistic Diversity - MetaCLIP 2 retains a comprehensive distribution of global images, enhancing geographical localization and regional recognition capabilities [13][15]. - The model directly learns from image descriptions written by native speakers, avoiding reliance on machine translation, which improves the authenticity and accuracy of the training data [12][16].