平衡创新与严谨：人工智能评估的深思熟虑整合指南（指导说明）（英）2025

Investment Rating - The report does not explicitly provide an investment rating for the industry. Core Insights - The integration of large language models (LLMs) in evaluation practices can significantly enhance the efficiency and validity of text data analysis, although challenges in ensuring the completeness and relevance of information extraction remain [2][17][19]. Key Considerations for Experimentation - Identifying relevant use cases is crucial, as LLMs should be applied where they can add significant value compared to traditional methods [9][23]. - Detailed workflows for use cases help teams understand how to effectively apply LLMs, allowing for the reuse of successful components [10][28]. - Agreement on resource allocation and expected outcomes is essential for successful experimentation, including clarity on human resources, technology, and definitions of success [11][33]. - A robust sampling strategy is necessary to facilitate effective prompt development and model evaluation [12][67]. - Appropriate metrics must be selected to measure LLM performance, with standard machine learning metrics for discriminative tasks and human assessment criteria for generative tasks [13][36]. Experiments and Results - The report details four experiments focusing on text classification, summarization, synthesis, and information extraction, with results indicating satisfactory performance in various tasks [49][50]. - For text classification, the model achieved an accuracy of 90%, with recall at 75% and precision at 60% [53]. - In generative tasks, the model demonstrated high relevance (4.87), coherence (4.97), and faithfulness (0.90) in summarization, while information extraction showed excellent faithfulness but lower relevance (3.25) [58]. Emerging Good Practices - Iterative prompt development and validation are critical for achieving satisfactory results, emphasizing the importance of refining prompts based on model performance [60][64]. - Including representative examples in prompts enhances the model's ability to generate relevant responses [81]. - Evaluating model performance should include assessing the faithfulness of responses and setting context-specific thresholds for selected metrics [89][90].