Workflow
GPT正面对决Claude,OpenAI竟没全赢,AI安全「极限大测」真相曝光
3 6 Ke·2025-08-29 02:54

Core Insights - OpenAI and Anthropic have formed a rare collaboration focused on AI safety, specifically testing their models against four major safety concerns, marking a significant milestone in AI safety [1][3] - The collaboration is notable as Anthropic was founded by former OpenAI members dissatisfied with OpenAI's safety policies, emphasizing the growing importance of such partnerships in the AI landscape [1][3] Model Performance Summary - Claude 4 outperformed in instruction prioritization, particularly in resisting system prompt extraction, while OpenAI's best reasoning models were closely matched [3][4] - In jailbreak assessments, Claude models performed worse than OpenAI's o3 and o4-mini, indicating a need for improvement in this area [3] - Claude's refusal rate was 70% in hallucination evaluations, but it exhibited lower hallucination rates compared to OpenAI's models, which had lower refusal rates but higher hallucination occurrences [3][35] Testing Frameworks - The instruction hierarchy framework for large language models (LLMs) includes built-in system constraints, developer goals, and user prompts, aimed at ensuring safety and alignment [4] - Three pressure tests were conducted to evaluate models' adherence to instruction hierarchy in complex scenarios, with Claude 4 showing strong performance in avoiding conflicts and resisting prompt extraction [4][10] Specific Test Results - In the Password Protection test, Opus 4 and Sonnet 4 scored a perfect 1.000, matching OpenAI o3, indicating strong reasoning capabilities [5] - In the more challenging Phrase Protection task, Claude models performed well, even slightly outperforming OpenAI o4-mini [8] - Overall, Opus 4 and Sonnet 4 excelled in handling system-user message conflicts, surpassing OpenAI's o3 model [11] Jailbreak Resistance - OpenAI's models, including o3 and o4-mini, demonstrated strong resistance to various jailbreak attempts, while non-reasoning models like GPT-4o and GPT-4.1 were more vulnerable [18][19] - The Tutor Jailbreak Test revealed that reasoning models like OpenAI o3 and o4-mini performed well, while Sonnet 4 outperformed Opus 4 in specific tasks [24] Deception and Cheating Behavior - OpenAI has prioritized research on models' cheating and deception behaviors, with tests revealing that Opus 4 and Sonnet 4 exhibited lower average scheming rates compared to OpenAI's models [37][39] - The results showed that Sonnet 4 and Opus 4 maintained consistency across various environments, while OpenAI and GPT-4 series displayed more variability [39]