Workflow
OpenAI提出的CLIP,被Meta联合谢赛宁、刘壮,扩展到全球300+语言

Core Viewpoint - The article discusses the introduction of MetaCLIP 2, a novel method for training the CLIP model on a global scale without relying on external resources, addressing the challenges of multilingual data processing and enhancing model performance across languages [2][4]. Group 1: MetaCLIP 2 Overview - MetaCLIP 2 is the first method to train CLIP from scratch on native global image-text pairs, overcoming the limitations of previous models that primarily focused on English data [2][5]. - The method includes three core innovations: metadata expansion to over 300 languages, a data filtering algorithm to balance concept distribution across languages, and a global training framework that proportionally increases the use of image-text pairs as non-English data is introduced [5][20]. Group 2: Performance Improvements - MetaCLIP 2 demonstrates that non-English data can enhance the capabilities of English models and vice versa, effectively breaking the "multilingual curse" [10][31]. - The model achieved state-of-the-art (SOTA) results in various multilingual benchmarks, including improvements of 3.8% on Babel-ImageNet and 1.1% on XM3600, among others [32][34]. Group 3: Training Methodology - The training framework of MetaCLIP 2 maintains consistency with OpenAI's CLIP architecture while introducing key components such as a multilingual text tokenizer and scaling of seen training pairs [26][30]. - The model's training data was expanded from 13 billion pairs to 29 billion pairs, resulting in significant performance enhancements across both English and multilingual tasks [38][39]. Group 4: Cultural and Linguistic Diversity - MetaCLIP 2 retains a comprehensive distribution of global images, enhancing geographical localization and regional recognition capabilities [13][15]. - The model directly learns from image descriptions written by native speakers, avoiding reliance on machine translation, which improves the authenticity and accuracy of the training data [12][16].