面对无解问题大模型竟会崩溃？港中文&华为联合提出首个大模型推理可靠性评估基准

Core Viewpoint - The article discusses the reliability of large language models (LLMs) in reasoning tasks, highlighting the issue of "hallucination" where models generate incorrect or fabricated answers when faced with unsolvable problems [2][4][17]. Group 1: Research Background - The emergence of models like DeepSeek-r1 has shown impressive performance in reasoning tasks, but they often attempt to fabricate answers for unsolvable questions, leading to significant resource waste and reliability issues [2][4]. - A new benchmark called ReliableMath has been introduced to assess the reliability of LLMs in reasoning tasks, with ongoing updates to model results on a leaderboard [5][12]. Group 2: Reliability Assessment Criteria - A set of evaluation criteria for reasoning task reliability is proposed, categorizing questions as solvable (A) or unsolvable (U) and model responses as successful (S), refused (R), or failed (F) [7][8]. - The assessment prioritizes precision (success rate) over prudence (refusal rate) when evaluating reliability [8]. Group 3: ReliableMath Dataset - The ReliableMath dataset is the first high-quality collection of unsolvable mathematical problems, constructed by modifying solvable problems to create unsolvable ones [11][12]. - The dataset includes various difficulty levels, with annotations indicating the difficulty of identifying unsolvable problems [16]. Group 4: Experimental Analysis - Experiments reveal that LLMs struggle to refuse or acknowledge unsolvable problems, often leading to meaningless reasoning processes and hallucinations [18][19]. - Introducing prompts that allow models to refuse or indicate unsolvable problems significantly improves reliability for unsolvable questions without harming performance on solvable ones [19][20]. - Larger models generally show better reliability with the reliable prompts compared to smaller models, which still have room for improvement [19]. Group 5: Reliability Alignment - A strategy for improving model reliability involves constructing a set of unsolvable problems on open-source training datasets, distilling successful responses from stronger models, and using supervised learning to enhance smaller models' reliability [23]. Group 6: Conclusion and Future Outlook - The article aims to initiate further research on the reliability of new generation reasoning models, fostering greater trust in AI outputs and enhancing their service to humanity [26].