DiscoX
Search documents
Z Tech|ICLR 2026字节发布:从短句到篇章,DiscoX为长文翻译提供评测新范式
Z Potentials· 2026-02-13 02:27
P aper link: https://arxiv.org/abs/2511.10984 Project link: https://github.com/ByteDance-Seed/DiscoX B log link: https://randomtutu.github.io/DiscoX/ Author: Xiying Zhao, Zhoufutu Wen, Zhixuan Chen, Jingzhe Ding, Jianpeng Jiao, Shuai Li, Xi Li, Danni Liang, Shengda Long, Qianqian Liu, Xianbo Wu, Hongwan Gao, Xiang Gao, Liang Hu, Jiashuo Liu, Mengyun Liu, Weiran Shi, Chenghao Yang, Qianyu Yang, Xuanliang Zhang, Ge Zhang, Wenhao Huang, Yuwen Tang 第一作者: 字节跳动豆包大模型评测产品经理,研究方向为通用模型评测系统 (General Model Evals System ...
Z Tech|ICLR 2026字节发布:从短句到篇章,DiscoX为长文翻译提供评测新范式
Z Potentials· 2026-02-12 02:27
Core Insights - DiscoX has developed a long-form translation evaluation dataset consisting of 200 texts, with an average length of 1,712 tokens, focusing on translation accuracy, logical and stylistic consistency across paragraphs, terminology precision, and adherence to professional writing standards [4][9][12]. Group 1: Evaluation Framework - Metric-S is introduced as a novel evaluation framework for long-form translation that does not require reference answers, allowing for interpretable results through a multi-agent evaluation system [4][5][16]. - The evaluation process includes three stages: instruction adherence check, comprehensive quality assessment across accuracy, fluency, and appropriateness, and a deduplication and attribution mechanism to ensure fair scoring [17][18][19]. Group 2: Advantages of DiscoX and Metric-S - DiscoX enables precise assessment of long-form translations, revealing the shortcomings of models in handling such tasks, and provides detailed multi-dimensional scoring [7][8]. - The framework reduces the need for expensive manual annotation by utilizing a no-reference evaluation approach, addressing the lack of standard reference translations in business documents and academic papers [8][12]. Group 3: Model Performance - The evaluation of 20 representative models on DiscoX shows that the leading model, GPT-5-high, scored 76.66, which is still below the human expert level of 80.16, indicating that high-quality long-form translation remains a significant challenge for current LLMs [23][24][25]. - The performance of models varies across dimensions, with GPT-5 excelling in accuracy, Kimi-K2 in fluency, and Claude-4 series showing high accuracy but lower fluency [29].