Workflow
数学推理能力
icon
Search documents
从「记忆解题」到「深度推理」:港科大推出首个本科数学动态评测基准 UGMathBench
AI科技大本营· 2025-06-09 09:41AI Processing
数学推理能力作为衡量模型智能水平的关键指标,需对其进行全面公平的评估。然而,现有的 GSM8K、MATH 数学基准因覆盖不足和易被数据污染饱 受诟病,要么缺乏对本科水平数学问题的广泛覆盖,要么可能受到测试集的污染。 为了填补这些空白,来自香港科技大学的研究团队近日发表在 ICLR 2025的最新研究 UGMathBench——首个针对本科数学的多元化动态评测体系, 专为评估 LLM 在本科阶段各类数学主题下的推理能力而设计。它提供了动态多样的评估工具,首次将数学推理评测带入「动态污染防控」时代, 标志 着 LLMs 数学推理评估从"浅层解题"迈向"深层理解"。 论文地址:https://arxiv.org/pdf/2501.13766 | AGI-Eval | 评测榜单 入人机竞赛 | 评测集社区 | Data Studio 団 | | | など | | --- | --- | --- | --- | --- | --- | --- | | | 评测集社区:UGMathBench | | | | | | | | UGMathBench ☞▩ | | | | 我要参评 | | | | UGMathBench 是 ...
AI越聪明越不听话!新研究:最强推理模型指令遵循率仅50%
量子位· 2025-05-24 04:38
Core Viewpoint - The research reveals a trade-off between reasoning ability and instruction following in large AI models, indicating that models excelling in complex reasoning tend to disregard user instructions more frequently [1][6][21]. Group 1: Research Findings - The study introduces MathIF, a new benchmark designed to evaluate AI models' adherence to user instructions in mathematical reasoning tasks [3][4]. - The evaluation involved 23 mainstream large models, showing that those with superior mathematical reasoning capabilities often struggle to comply with user instructions [6][7]. - The best-performing model, Qwen3-14B, only managed to follow about half of the given instructions [6]. Group 2: Instruction Following Metrics - MathIF employs hard accuracy (HAcc) and soft accuracy (SAcc) to measure models' compliance with instructions, where HAcc assesses total instruction fulfillment and SAcc reflects the average adherence to each instruction [4][6]. - The results indicated that larger models do not necessarily exhibit better instruction-following capabilities, with some smaller models performing better in this regard [6][7]. Group 3: Reasons for Non-compliance - The research identifies two main reasons for the observed non-compliance: 1. Reasoning-oriented training methods, such as supervised fine-tuning (SFT) and reinforcement learning (RL), enhance reasoning skills but reduce sensitivity to specific instructions [10][21]. 2. Longer reasoning chains lead to decreased compliance, as complex reasoning processes can distract models from adhering to instructions [13][18]. Group 4: Potential Solutions - A simple method to improve instruction adherence involves repeating the instruction before providing the answer, which has shown to enhance compliance but may slightly reduce the accuracy of the model's responses [19][21]. - Future developments aim to create models that can balance deep reasoning with strict adherence to instructions, addressing the trade-off between being "smart" and "obedient" [22].