
Core Insights - The article discusses the paradox of advanced AI models becoming less obedient to instructions despite their enhanced reasoning capabilities [2][4][15]. Group 1: AI Model Performance - The emergence of powerful AI models like Gemini 2.5 Pro, OpenAI o3, and DeepSeek-R1 has led to a consensus that stronger reasoning abilities should improve task execution [2]. - A recent study found that most models, when using Chain-of-Thought (CoT) reasoning, actually experienced a decline in execution accuracy [25][27]. - In the IFEval test, 13 out of 14 models showed decreased accuracy when employing CoT, while all models performed worse in the ComplexBench test [27][28]. Group 2: Experimental Findings - The research team from Harvard, Amazon, and NYU conducted two sets of tests: IFEval for simple tasks and ComplexBench for complex instructions [18][20]. - The results indicated that even large models like LLaMA-3-70B-Instruct dropped from 85.6% accuracy to 77.3% when using CoT, highlighting the significant impact of reasoning on performance [29][30]. - The study introduced the concept of "Constraint Attention," revealing that models using CoT often lose focus on key task constraints, leading to errors [38][39]. Group 3: Recommendations for Improvement - The study proposed four methods to mitigate the decline in accuracy when using reasoning models: Few-Shot examples, Self-Reflection, Self-Selective Reasoning, and Classifier-Selective Reasoning [47][56]. - The most effective method was Classifier-Selective Reasoning, which involves training a small model to determine when to use CoT, resulting in improved accuracy across tests [58].