反转，AI推理能力遭苹果质疑后，Claude合著论文反击：不是不会推理，是输给Token

Core Viewpoint - Apple’s machine learning research team published a paper titled "The Illusion of Thinking," which critically questions the reasoning capabilities of mainstream large language models (LLMs) like OpenAI's "o" series, Google’s Gemini 2.5, and DeepSeek-R, arguing that these models do not learn generalizable first principles from training data [4][6]. Group 1: Research Findings - The paper presents four classic problems—Tower of Hanoi, Blocks World, River Crossing, and Checkers Jumping—to demonstrate that as the complexity of these tasks increases, the accuracy of top reasoning models declines sharply, ultimately reaching zero in the most complex scenarios [4][6]. - Apple researchers noted that the length of the output tokens used for "thinking" by the models decreased, suggesting that the models were actively reducing their reasoning attempts, leading to the conclusion that reasoning is an illusion [8][10]. Group 2: Criticism and Counterarguments - A rebuttal paper titled "The Illusion of The Illusion of Thinking," co-authored by independent researcher Alex Lawsen and the AI model Claude Opus 4, argues that Apple’s claims of reasoning collapse are due to fatal flaws in the experimental design [12][13]. - Critics highlight that problems like Tower of Hanoi require exponentially more steps as the number of disks increases, which exceeds the context window and output token limits of the models, potentially leading to incorrect evaluations [15][16][18]. - The rebuttal also points out that some test questions used by Apple were mathematically unsolvable, which invalidates the assessment of model performance on these questions [20][21][22]. - An experiment showed that when models were asked to output a program to solve the Tower of Hanoi instead of detailing each step, they successfully provided correct solutions, indicating that the models possess the necessary algorithms but struggle with lengthy output requirements [23][24][25]. - Additionally, the lack of human performance benchmarks in Apple’s evaluation raises questions about the validity of declaring AI's performance degradation as a fundamental flaw in reasoning [26][27].