DeepSeek们越来越聪明，却也越来越不听话了。

Core Viewpoint - The article discusses the paradox of advanced AI models, where increased reasoning capabilities lead to a decline in their ability to follow instructions accurately, as evidenced by recent research findings [1][3][10]. Group 1: Research Findings - A study titled "When Thinking Fails: The Pitfalls of Reasoning for Instruction-Following in LLMs" reveals that when models engage in reasoning, they often fail to adhere to given instructions [2][3]. - The research team from Harvard, Amazon, and NYU conducted tests on 15 models, finding that 13 out of 14 models showed decreased accuracy when using Chain-of-Thought (CoT) reasoning in simple tasks [4][6]. - In complex tasks, all models tested exhibited a decline in performance when employing CoT reasoning [4][6]. Group 2: Performance Metrics - In the IFEval test, models like GPT-4o-mini and Claude-3.5 experienced significant drops in accuracy when using CoT, with GPT-4o-mini's accuracy falling from 82.6% to 76.9% [5]. - The results from ComplexBench also indicated a consistent decline across all models when CoT was applied, highlighting the detrimental impact of reasoning on task execution [4][6]. Group 3: Observed Behavior Changes - The models, while appearing smarter, became more prone to disregarding explicit instructions, often modifying or adding information that was not requested [9][10]. - This behavior is attributed to a decrease in "Constraint Attention," where models fail to focus on critical task constraints when reasoning is involved [10]. Group 4: Proposed Solutions - The article outlines four potential methods to mitigate the decline in instruction-following accuracy: 1. Few-Shot Learning: Providing examples to the model, though this has limited effectiveness due to input length and bias [11][12]. 2. Self-Reflection: Allowing models to review their outputs, which works well for larger models but poorly for smaller ones [13]. 3. Self-Selective Reasoning: Enabling models to determine when reasoning is necessary, resulting in high recall but low precision [14]. 4. Classifier-Selective Reasoning: Training a smaller model to decide when to use CoT, which has shown significant improvements in accuracy [15][17]. Group 5: Insights on Intelligence - The article emphasizes that true intelligence lies in the ability to focus attention on critical aspects of a task rather than processing every detail [20][22]. - It suggests that AI should be designed to prioritize key elements of tasks, akin to how humans effectively manage their focus during critical moments [26][27].