美联储：全面召回？大型语言模型的宏观经济知识评价（英文版）

Core Insights - The report evaluates the performance of large language models (LLMs) in recalling macroeconomic knowledge, particularly focusing on the Claude Sonnet 3.5 model's ability to estimate historical macroeconomic variables and data release dates [1][8][10] - Findings indicate that while LLMs demonstrate impressive recall for certain economic indicators, they also exhibit significant shortcomings, particularly in handling volatile data series and in avoiding look-ahead bias [2][11][18] Group 1: Performance Evaluation - LLMs show strong recall for historical unemployment rates and Consumer Price Index (CPI) values, accurately recalling quarterly values back to World War II [11][44] - However, the model struggles with more volatile data series such as real GDP growth and industrial production growth, often missing high-frequency fluctuations while capturing broader business cycle trends [11][45] - The model's estimates for GDP are found to mix first print values with subsequent revisions, leading to inaccuracies in historical understanding and real-time forecasting simulations [12][14] Group 2: Data Release Dates - LLMs can recall historical data release dates with reasonable accuracy, but they occasionally misestimate these dates by a few days [16] - The accuracy of recalling release dates is sensitive to prompt details, with adjustments to prompts reducing one type of error while increasing another [16] - On average, about 20.2% of days show at least one series with recall issues, indicating limitations in the reliability of LLMs for historical analysis and real-time forecasting [2][16] Group 3: Look-Ahead Bias - Evidence suggests that LLMs may inadvertently incorporate future data values when estimating historical data, even when instructed to ignore future information [15][18] - This look-ahead bias presents challenges for using LLMs in historical analysis and as real-time forecasters, as it reflects a tendency to blend past and future information [18][22] - The report highlights that these errors are reminiscent of human forecasting mistakes, indicating a fundamental challenge in the LLMs' recall capabilities [18][22]