Workflow
DeepL Translate
icon
Search documents
首个AI翻译实战榜单出炉!GPT-4o稳坐天花板,文化方面Qwen系列一马当先丨开源
量子位· 2025-05-23 00:24
Core Viewpoint - The article discusses the launch of TransBench, the first application-based AI translation evaluation ranking system, aimed at standardizing translation quality across various AI models [1][5][32]. Group 1: TransBench Overview - TransBench is a collaborative effort by Alibaba International AI Business, Shanghai Artificial Intelligence Laboratory, and Beijing Language University [2]. - It introduces new evaluation metrics such as hallucination rate, cultural taboo words, and politeness norms, addressing common issues in large model translations [3][34]. - The evaluation system is open-source and has released its first set of results, inviting AI translation institutions to participate [5][6][44]. Group 2: Evaluation Metrics - The evaluation framework categorizes data sets into three main types: general standards, e-commerce culture, and cultural characteristics [8][35]. - The ranking assesses translation capabilities based on four dimensions: overall score, general standards, e-commerce culture, and cultural characteristics [9][11]. Group 3: Model Performance - In the English-to-other-languages category, the top three models based on overall score and general standards are GPT-4o, DeepL Translate, and GPT-4-Turbo [16][14]. - For the e-commerce sector, DeepSeek-R1 ranks among the top performers, with Qwen2.5 models excelling in cultural characteristics [17][19]. - In the Chinese-to-other-languages category, DeepSeek-V3 leads, followed by Gemini-2.5-Pro and Claude-3.5-Sonnet [23][25]. Group 4: Industry Context - The demand for high-quality AI translation models has increased, necessitating adherence to cultural nuances and industry-specific language features [28][29]. - Traditional evaluation metrics are deemed insufficient for today's requirements, prompting the development of TransBench [31][32]. - Alibaba's Marco MT model has achieved significant usage, with an average daily call volume of 600 million, highlighting the importance of translation in global e-commerce [40][41].
首个AI翻译实战榜单出炉!GPT-4o稳坐天花板,文化方面Qwen系列一马当先丨开源
量子位· 2025-05-22 14:24
衡宇 发自 凹非寺 量子位 | 公众号 QbitAI AI替咱打工搞翻译,到底谁家最好用? 终于,有人来统一翻译江湖的标准了: 首个应用型AI翻译测评榜单TransBench在OpenCompass上线 。 它由阿里国际AI Business团队联合上海人工智能实验室、北京语言大学共同发布。 与传统的翻译测评体系相比,TransBench 增加了幻觉率、文化禁忌词、敬语规范等指标 ,专门针对大模型翻译最容易出错的关键问题进行 实战考核。 比如: 这是首次针对行业的细分领域构建评测数据和评测方法。这些指标均来自真实场景的使用反馈,由此来测评大模型是否符合大规模应用的标 准。 目前, TransBench评测方法与数据集已全面开源 ,也已发布了首期测评结果。 欢迎各个AI翻译机构去打榜,一较高下~ GPT-4o稳坐"翻译AI天花板" 官网表示,TransBench数据集中涵盖中、英、法、日、韩、西班牙等多种语言。 此外,还在不断持续更新海量小语种。 TransBench评测体系中的数据集,根据"通用标准""电商文化""文化特性"三个大类,整理了不同的数据集。 目前,TransBench多语言翻译评测榜单首期已经出 ...