Z Tech｜ICLR 2026字节发布：从短句到篇章，DiscoX为长文翻译提供评测新范式

Core Insights - DiscoX has developed a long-form translation evaluation dataset consisting of 200 texts, with an average length of 1,712 tokens, focusing on translation accuracy, logical and stylistic consistency across paragraphs, terminology precision, and adherence to professional writing standards [4][9][12]. Group 1: Evaluation Framework - Metric-S is introduced as a novel evaluation framework for long-form translation that does not require reference answers, allowing for interpretable results through a multi-agent evaluation system [4][5][16]. - The evaluation process includes three stages: instruction adherence check, comprehensive quality assessment across accuracy, fluency, and appropriateness, and a deduplication and attribution mechanism to ensure fair scoring [17][18][19]. Group 2: Advantages of DiscoX and Metric-S - DiscoX enables precise assessment of long-form translations, revealing the shortcomings of models in handling such tasks, and provides detailed multi-dimensional scoring [7][8]. - The framework reduces the need for expensive manual annotation by utilizing a no-reference evaluation approach, addressing the lack of standard reference translations in business documents and academic papers [8][12]. Group 3: Model Performance - The evaluation of 20 representative models on DiscoX shows that the leading model, GPT-5-high, scored 76.66, which is still below the human expert level of 80.16, indicating that high-quality long-form translation remains a significant challenge for current LLMs [23][24][25]. - The performance of models varies across dimensions, with GPT-5 excelling in accuracy, Kimi-K2 in fluency, and Claude-4 series showing high accuracy but lower fluency [29].