溯因推理

Search documents
“我没错”GPT-4o嘴硬翻车,AI在黑天鹅事件面前集体宕机
3 6 Ke· 2025-07-16 11:19
Core Insights - A joint research team from Columbia University, Vector AI Research Institute, and Nanyang Technological University found significant deficiencies in AI models' reasoning capabilities when handling unexpected events, with top models like GPT-4o and Gemini 1.5 Pro performing up to 32% worse than humans [2][14][15] Group 1: Research Findings - The study titled "Black Swan" highlights a fundamental issue in current AI evaluation methods, which primarily focus on predictable and clear visual scenarios, neglecting the unpredictable nature of real-world events [4][14] - Two core reasoning abilities essential for humans to handle unexpected situations are identified: abductive reasoning (inferring the most likely explanation from limited observations) and defeasible reasoning (revising initial conclusions based on new evidence) [5][14] Group 2: New Benchmark Testing - To accurately assess AI's reasoning capabilities in unexpected situations, the research team developed a new benchmark test called "BlackSwanSuite," consisting of 1,655 videos depicting various unconventional real-life scenarios [8][11] - The benchmark includes three core tasks: "Forecaster," where models predict future events from the beginning of a video; "Detective," where models infer missing information from the start and end of a video; and "Reporter," where models describe the entire event and reassess previous judgments based on complete information [11][12] Group 3: Performance Comparison - Top AI models, including GPT-4o and Gemini 1.5 Pro, significantly lag behind humans across all three tasks, with the best models trailing humans by up to 25% in multiple-choice questions and 32% in true/false judgments [14][15] - In the "Detective" task, GPT-4o's accuracy was 24.9% lower than that of humans, while in the "Reporter" task, the gap reached 32%, indicating that AI struggles to correct initial misjudgments [14][15][16] Group 4: Implications of Findings - The research indicates that AI models often "lock in" their initial judgments and fail to update their reasoning based on new evidence, which poses significant risks in critical applications like autonomous driving [17][20] - An experiment showed that when AI models were provided with human-written descriptions of video content, their reasoning accuracy improved by up to 10%, suggesting that the core shortcoming lies not only in advanced reasoning but also in basic perception and understanding capabilities [25][26]