Anthropic、Thinking Machines Lab论文曝光：30万次压力测试揭示AI规范缺陷

Core Insights - The article discusses the limitations of current model specifications for large language models (LLMs), highlighting internal conflicts and insufficient granularity in ethical guidelines [1][5] - A systematic stress-testing methodology is proposed to identify and characterize contradictions and ambiguities in existing model specifications [1][3] Group 1: Model Specifications and Ethical Guidelines - Current LLMs are increasingly constrained by model specifications that define behavioral and ethical boundaries, forming the basis of Constitutional AI and Deliberate Alignment [1] - Existing specifications face two main issues: internal conflicts among principles and a lack of granularity needed for consistent behavioral guidance [1][5] - Researchers from Anthropic and Thinking Machines Lab have developed a detailed taxonomy of 3,307 values exhibited by the Claude model, surpassing the coverage and detail of mainstream model specifications [3][4] Group 2: Methodology and Testing - The research team generated over 300,000 query scenarios that force models to make clear trade-offs between values, revealing potential conflicts in model specifications [3][5] - The methodology includes value bias techniques that tripled the number of queries, resulting in a dataset of over 410,000 effective scenarios after filtering out incomplete responses [9][10] - The analysis of 12 leading LLMs, including those from Anthropic, OpenAI, Google, and xAI, showed significant discrepancies in responses across various scenarios [4][12] Group 3: Findings and Analysis - In the testing, over 220,000 scenarios exhibited significant divergence between at least two models, while more than 70,000 scenarios showed clear behavioral differences across most models [7][11] - The study found that higher divergence in model responses correlates with potential issues in model specifications, especially when multiple models following the same guidelines show inconsistencies [13][20] - A two-stage evaluation method was employed to quantify the degree of value bias in model responses, enhancing measurement consistency [14][15] Group 4: Compliance and Conformity Checks - The evaluation of OpenAI models revealed frequent non-compliance with their own specifications, indicating underlying issues within the specifications themselves [17][18] - The study utilized multiple leading models as reviewers to assess compliance, finding a strong correlation between high divergence and increased rates of non-compliance [20][22] - The analysis highlighted fundamental contradictions and interpretive ambiguities in model responses, demonstrating the need for clearer guidelines [25][27][32]