基准测试揭秘大模型“字数危机”：26个模型长文本生成普遍拉胯，最大输出长度过度宣传

Core Insights - The article discusses the limitations of large language models (LLMs) in following length instructions, particularly in generating long texts, revealing a significant performance gap [1][10][33] Group 1: LIFEBENCH Overview - LIFEBENCH is a benchmark designed to evaluate LLMs' performance in adhering to length instructions across various tasks and languages [2][4] - It includes a diverse dataset covering a wide range of lengths and instruction types, making it the first systematic evaluation of models' length instruction adherence [4][6] Group 2: Key Features of LIFEBENCH - The benchmark features three common length control methods: Equal To, At Most, and At Least, with output ranges from short texts (<100 words) to long texts (>2000 words) [4][6] - It assesses four types of natural language generation tasks: question answering, summarization, reasoning, and creative generation [6][10] Group 3: Evaluation Metrics - Length Deviation (LD) measures the difference between generated text length and target length, providing insights into models' performance [7][9] - Length Score (LS) quantifies the overall adherence to length instructions, offering a more nuanced analysis than simple word count matching [8][9] Group 4: Experimental Findings - Out of 26 evaluated models, 23 scored below 60 in Length Score when instructed to generate text of a specific length, indicating poor performance [10][11] - Long text generation is particularly challenging, with all models scoring below 40 in Length Score for long text tasks [12][13] Group 5: Bottlenecks in Length Instruction Adherence - Models struggle with accurate length perception, often overestimating for short outputs and underestimating for long outputs [20][22] - Input length significantly affects model performance, especially with longer inputs leading to poorer results [22][23] - Many models exhibit "lazy" generation strategies, such as prematurely terminating or refusing to generate when faced with complex tasks [23][25] Group 6: Hidden Issues and Improvement Opportunities - The quality of long text generation varies, with models performing best in the medium length range (1024-2048 words) [27][28] - Formatting requirements further complicate adherence to length instructions, with complex formats leading to increased errors [30] - Pre-training limitations contribute to models' poor performance, suggesting that enhanced training strategies could improve adherence to length instructions [32][33]