篇章级评测 - filings, earnings calls, financial reports, news

篇章级评测

Search documents

Z Tech｜ICLR 2026字节发布：从短句到篇章，DiscoX为长文翻译提供评测新范式

Z Potentials· 2026-02-13 02:27

Core Insights - DiscoX has developed a long-form translation evaluation dataset consisting of 200 texts, with an average length of 1,712 tokens, focusing on translation accuracy, logical and stylistic consistency across paragraphs, terminology precision, and adherence to professional writing standards [4][9][12]. Group 1: Evaluation Framework - Metric-S is introduced as a novel evaluation framework for long-form translation that does not require reference answers, allowing for interpretable results through a multi-agent evaluation system [4][5][17]. - The evaluation process includes three stages: instruction adherence detection to filter out invalid responses, comprehensive quality scoring based on accuracy, fluency, and appropriateness, and a scoring optimization mechanism to ensure fair assessment by avoiding repeated penalties for the same error [6][20][21]. Group 2: Advantages of DiscoX and Metric-S - DiscoX enables precise evaluation of long-form translations, revealing the shortcomings of models in handling such tasks, and provides detailed multi-dimensional scoring [7][8]. - The framework allows for structured diagnostic attribution, driving a feedback loop for model optimization, and reduces the cost of manual annotation by utilizing a no-reference evaluation approach [8][12]. Group 3: Model Performance - The evaluation of 20 representative models on DiscoX shows that even the state-of-the-art model, GPT-5-high, scored 76.66, which is still below the human expert level of 80.16, indicating that high-quality discourse-level translation remains a significant challenge for current LLMs [30][31][32]. - The performance of models varies across dimensions, with GPT-5 excelling in accuracy, Kimi-K2 in fluency, and Claude-4 series showing higher accuracy but lower fluency [37][38].