MetaCLIP 2

Search documents
谢赛宁团队新作打破“多语言诅咒”!MetaCLIP 2支持300多种语言,英语性能反倒提升了
量子位· 2025-07-31 06:51
一水 发自 凹非寺 量子位 | 公众号 QbitAI 谢赛宁团队新作正在引起热议! 一直以来,作为文生图基石的CLIP模型主要基于英文数据训练,但实际上,全球互联网仍有 超过50% 的非英文数据。 为了将CLIP模型进一步扩展,研究人员需要搞定两大"拦路虎": 缺乏处理非英语数据的筛选方法; 现有多语言CLIP的英语性能比纯英语版本差 (即所谓的"多语言诅咒") 。 而谢赛宁团队正是在这两方面取得突破。他们提出了首个基于全球数据从头训练的CLIP—— MetaCLIP 2 ,通过扩展元数据、优化筛选和提 升模型容量,斩获了以下成果: 1. 搭建了能处理 300多种语言 的CLIP数据整理流程。 2. 打破了"多语言诅咒",不仅没有影响英语任务的表现,而且反倒还提升了。 论文一作Yung-Sung Chuang (MIT博士生、现Meta实习生) 激动表示: 是时候告别语言过滤器了! 刚被小扎从OpenAI挖走的Lucas Beyer也出来对这一观点表示认同,顺带还感谢了论文中的引用: 很高兴看到我们提出并始终倡导的 "NoFilter"理念 能在MetaCLIP 2中得到应用。 这也引来了谢赛宁本人的回应: ...
OpenAI提出的CLIP,被Meta联合谢赛宁、刘壮,扩展到全球300+语言
机器之心· 2025-07-31 05:11
Core Viewpoint - The article discusses the introduction of MetaCLIP 2, a novel method for training the CLIP model on a global scale without relying on external resources, addressing the challenges of multilingual data processing and enhancing model performance across languages [2][4]. Group 1: MetaCLIP 2 Overview - MetaCLIP 2 is the first method to train CLIP from scratch on native global image-text pairs, overcoming the limitations of previous models that primarily focused on English data [2][5]. - The method includes three core innovations: metadata expansion to over 300 languages, a data filtering algorithm to balance concept distribution across languages, and a global training framework that proportionally increases the use of image-text pairs as non-English data is introduced [5][20]. Group 2: Performance Improvements - MetaCLIP 2 demonstrates that non-English data can enhance the capabilities of English models and vice versa, effectively breaking the "multilingual curse" [10][31]. - The model achieved state-of-the-art (SOTA) results in various multilingual benchmarks, including improvements of 3.8% on Babel-ImageNet and 1.1% on XM3600, among others [32][34]. Group 3: Training Methodology - The training framework of MetaCLIP 2 maintains consistency with OpenAI's CLIP architecture while introducing key components such as a multilingual text tokenizer and scaling of seen training pairs [26][30]. - The model's training data was expanded from 13 billion pairs to 29 billion pairs, resulting in significant performance enhancements across both English and multilingual tasks [38][39]. Group 4: Cultural and Linguistic Diversity - MetaCLIP 2 retains a comprehensive distribution of global images, enhancing geographical localization and regional recognition capabilities [13][15]. - The model directly learns from image descriptions written by native speakers, avoiding reliance on machine translation, which improves the authenticity and accuracy of the training data [12][16].