Workflow
原始MetaCLIP
icon
Search documents
谢赛宁团队新作打破“多语言诅咒”!MetaCLIP 2支持300多种语言,英语性能反倒提升了
量子位· 2025-07-31 06:51
Core Insights - The article discusses the advancements made by the Saining Xie team in developing MetaCLIP 2, a model that expands the capabilities of the original CLIP model by effectively processing non-English data and overcoming the "multilingual curse" [2][22]. Group 1: Model Development - MetaCLIP 2 is the first CLIP model trained from scratch on global data, capable of handling over 300 languages [2][14]. - The model improves upon the original MetaCLIP by optimizing its structure and processes, specifically in data selection and training strategies [13][19]. - The training sample size was increased significantly, with ViT-H/14 model using 29 billion samples, which is 2.3 times larger than previous versions [23][19]. Group 2: Methodology - The methodology includes three key innovations: expanding metadata to include over 300 languages, implementing a global filtering algorithm, and constructing a training framework for a global model [8][11]. - The filtering logic involves creating a "visual concept dictionary" for each language and applying tailored data selection criteria to ensure balanced data distribution [15][16]. - The model's training strategy was adjusted to maintain English sample volume while increasing the overall sample size, thus avoiding the pitfalls of the multilingual curse [19][21]. Group 3: Performance Metrics - MetaCLIP 2 demonstrated superior performance in various tests, breaking the multilingual curse and enhancing English capabilities, achieving an accuracy of 81.3% on ImageNet, surpassing the original CLIP's 80.5% [24][25]. - In multilingual benchmarks, MetaCLIP 2 outperformed previous models like mSigLIP and SigLIP 2, achieving a 50.2% accuracy in the Babel-ImageNet classification task and a 64.3% accuracy in the XM3600 image-to-text retrieval task [26][27]. - The model also excelled in cultural diversity tasks, showing improved accuracy in geographic localization compared to purely English or non-English models [28][29]. Group 4: Open Source and Community Engagement - The relevant data and code for MetaCLIP 2 have been made open source, promoting community engagement and further research [32].