大语言模型仍无法可靠区分信念与事实为高风险领域应用敲响警钟

Core Insights - A recent study from Stanford University highlights significant limitations of large language models (LLMs) in distinguishing between user beliefs and factual information, raising concerns about their application in high-risk fields such as medicine, law, and scientific decision-making [1][2] Group 1: Model Performance - The study analyzed 24 LLMs, including DeepSeek and GPT-4o, across 13,000 questions, revealing that newer models achieved an average accuracy of 91.1% or 91.5% in verifying factual data, while older models had an average accuracy of 84.8% or 71.5% [1] - When responding to first-person beliefs ("I believe..."), newer models (post-May 2024 GPT-4o) had a 34.3% lower probability of identifying false beliefs compared to true beliefs, while older models had a 38.6% lower probability [1] Group 2: Belief Recognition Challenges - LLMs tend to prioritize correcting users factually rather than identifying their beliefs, with newer models showing a 4.6% decrease in accuracy for third-person beliefs ("Mary believes...") and older models showing a 15.5% decrease [2] - The study concludes that LLMs must effectively differentiate between the nuances of fact and belief to respond accurately to user queries and prevent the spread of misinformation [2]