Workflow
Reinforcement Learning with Verifiable Rewards
icon
Search documents
只因一个“:”,大模型全军覆没
自动驾驶之心· 2025-07-17 12:08
Core Insights - The article discusses a significant vulnerability in large language models (LLMs) where they can be easily deceived by seemingly innocuous symbols and phrases, leading to false positive rewards in evaluation scenarios [2][13][34]. Group 1: Vulnerability of LLMs - A recent study reveals that LLMs can be tricked by simple tokens like colons and spaces, which should ideally be filtered out [4][22]. - The false positive rate (FPR) for various models is alarming, with GPT-4o showing a FPR of 35% for the symbol ":" and LLaMA3-70B having a FPR between 60%-90% for "Thought process:" [22][24]. - This vulnerability is not limited to English; it is cross-linguistic, affecting models regardless of the language used [23]. Group 2: Research Findings - The research involved testing multiple models, including specialized reward models and general LLMs, across various datasets and prompt formats to assess the prevalence of this "reward model deception" phenomenon [15][17]. - All tested models exhibited susceptibility to triggering false positive responses, indicating a systemic issue within LLMs [21][28]. Group 3: Proposed Solutions - To mitigate the impact of this vulnerability, researchers developed a new "judge" model called Master-RM, which significantly reduces the FPR to nearly zero by using an enhanced training dataset [29][31]. - The Master-RM model demonstrates robust performance across unseen datasets and deceptive attacks, validating its effectiveness as a general-purpose reward model [31][33]. Group 4: Implications for Future Research - The findings highlight the critical need for improved robustness in LLMs and suggest that reinforcement learning from human feedback (RLHF) requires more rigorous adversarial evaluations [35][36]. - The research team, comprising members from Tencent AI Lab, Princeton University, and the University of Virginia, emphasizes the importance of addressing these vulnerabilities in future studies [38][40].