Workflow
WritingBench
icon
Search documents
OpenAI加码写作赛道?阿里最新大模型通用写作能力基准WritingBench,揭秘深度思考能否增进文学表达
量子位· 2025-03-20 10:56
Core Insights - The article discusses the launch of WritingBench, a comprehensive evaluation benchmark for generative writing capabilities of large models, developed by a collaboration between Alibaba Research, Renmin University of China, and Shanghai Jiao Tong University [3][4][10]. Group 1: WritingBench Overview - WritingBench covers six major domains and 100 sub-scenarios, with over 1,000 evaluation data points aimed at providing a thorough assessment of generative writing [3][10]. - The benchmark addresses two main challenges in evaluating AI writing: the limitation of existing assessments to single domains and short texts, and the inadequacy of traditional evaluation methods that do not align with human judgment [4][8]. Group 2: Evaluation Methodology - WritingBench employs a four-stage human-machine collaborative construction process, which includes generating simple writing tasks, complicating instructions, supplementing with real-world materials, and expert content quality checks [11][12][14]. - The benchmark supports diverse evaluation dimensions, including style, format, and length, making it more comprehensive than existing benchmarks [16]. Group 3: Dynamic Assessment System - WritingBench features a dynamic assessment system that generates evaluation metrics based on writing intent, achieving an 87% consistency score with human evaluations [19][20]. - A scoring model has been trained to provide adaptive scores from 1 to 10 based on various criteria, enhancing the evaluation process [21]. Group 4: Model Performance - The article highlights the performance of various models on WritingBench, with Deepseek-R1 achieving an average score of 8.55, while Qwen-Max scored 8.37 [28][30]. - The use of chain-of-thought (CoT) reasoning in models has shown to improve performance in creative writing tasks, with models incorporating CoT outperforming those that do not [29]. Group 5: Challenges in Long Text Generation - The article notes a significant challenge in generating long texts, with most models experiencing quality degradation when output exceeds 3,000 tokens [35][36]. - Smaller models tend to produce repetitive content, while larger models may terminate early or only provide outlines, indicating a need for further optimization in long text generation [37][39].