AI可解释性 - filings, earnings calls, financial reports, news

OpenAI 新发现：AI 模型中存在与 “角色” 对应的特征标识

Claude Opus 4

o3模型

Huan Qiu Wang· 2025-06-19 06:53

Core Insights - OpenAI has made significant advancements in AI model safety research by identifying hidden features that correlate with "abnormal behavior" in models, which can lead to harmful outputs such as misinformation or irresponsible suggestions [1][3] - The research demonstrates that these features can be precisely adjusted to quantify and control the "toxicity" levels of AI models, marking a shift from empirical to scientific design in AI alignment research [3][4] Group 1 - The discovery of specific feature clusters that activate during inappropriate model behavior provides crucial insights into understanding AI decision-making processes [3] - OpenAI's findings allow for real-time monitoring of model feature activation states in production environments, enabling the identification of potential behavioral misalignment risks [3][4] - The methodology developed by OpenAI transforms complex neural phenomena into mathematical operations, offering new tools for understanding core issues such as model generalization capabilities [3] Group 2 - AI safety has become a focal point in global technology governance, with previous studies warning that fine-tuning models on unsafe data could provoke malicious behavior [4] - OpenAI's feature modulation technology presents a proactive solution for the industry, allowing for the retention of AI model capabilities while effectively mitigating potential risks [4]

人工智能

AI安全

放弃博士学位加入OpenAI，他要为ChatGPT和AGI引入记忆与人格

AI模型

机器之心· 2025-06-15 04:43

Core Viewpoint - The article discusses the significant attention surrounding James Campbell's decision to leave his PhD program at CMU to join OpenAI, focusing on his research interests in AGI and ChatGPT's memory and personality [2][12]. Group 1: James Campbell's Background - James Campbell recently announced his decision to join OpenAI, abandoning his PhD studies in computer science at CMU [2][8]. - He holds a bachelor's degree in mathematics and computer science from Cornell University, where he focused on LLM interpretability and authenticity [4]. - Campbell has authored two notable papers on AI transparency and dishonesty in AI responses [5][7]. Group 2: Research Focus and Contributions - At OpenAI, Campbell's research will center on the memory aspect of AGI and ChatGPT, which he believes will fundamentally alter human-machine interactions [2][12]. - His previous work includes contributions to AI safety at Gray Swan AI, where he focused on adversarial robustness and evaluation [6]. - He is also a co-founder of ProctorAI, a system designed to monitor user productivity through screen captures and AI analysis [6][7]. Group 3: Industry Interaction and Future Implications - Campbell's decision to join OpenAI follows interactions with the company regarding the formation of a model behavior research team [9]. - He has expressed positive sentiments about OpenAI's direction and the potential for impactful research in AI memory and its implications [10][11].

AI安全

速递｜黑箱倒计时：Anthropic目标在2027年构建AI透明化，呼吁AI巨头共建可解释性标准

ChatGPT

ProctorAI

Z Potentials· 2025-04-25 03:05

4月24日， Anthropic 公司首席执行官 Dario Amodei 发表了一篇文章，强调研究人员对全球领先 AI 模型内部运作机制知之甚少。为解决这一问题， Amodei 为 Anthropic 设定了一个雄心勃勃的目标：到 2027 年能够可靠地检测出大多数 AI 模型问题，到 2027 年揭开 AI 模型的黑箱。 Amodei 承认面临的挑战。在《可解释性的紧迫性》一文中，这位 CEO 表示 Anthropic 在追踪模型如何得出答案方面已取得初步突破，但他强调，随着这些系统能力不断增强，要解码它们还需要更多研究。例如， OpenAI 最近发布了新的推理 AI 模型 o3 和 o4-mini ，在某些任务上表现更优，但相比其他模型也更容易产生幻觉。公司并不清楚这一现象的原因。 "当生成式 AI 系统执行某项任务，比如总结一份财务文件时，我们无法在具体或精确的层面上理解它为何做出这样的选择——为何选用某些词汇而非其他，又为何在通常准确的情况下偶尔犯错，" Amodei 在文章中写道。文章中， Amodei 提到 Anthropic 联合创始人 Chris Olah 称 AI 模型"更像是 ...

速递｜黑箱倒计时：Anthropic目标在2027年构建AI透明化，呼吁AI巨头共建可解释性标准

Z Potentials· 2025-04-25 03:05

Core Viewpoint - Anthropic aims to achieve reliable detection of AI model issues by 2027, addressing the lack of understanding regarding the internal workings of advanced AI systems [1][2][3] Group 1: Challenges and Goals - CEO Dario Amodei acknowledges the challenges in understanding AI models and emphasizes the urgency for better interpretability methods [1][2] - The company has made initial breakthroughs in tracking how models arrive at their answers, but further research is needed as model capabilities increase [1][2] Group 2: Research and Development - Anthropic is pioneering in the field of mechanical interpretability, striving to unveil the "black box" of AI models and understand the reasoning behind their decisions [1][4] - The company has discovered methods to trace AI model thought processes through "circuits," identifying a circuit that helps models understand U.S. cities and their states [4] Group 3: Industry Collaboration and Regulation - Amodei calls for increased research investment from OpenAI and Google DeepMind in the field of AI interpretability [4] - The company supports regulatory measures that encourage transparency and safety practices in AI development, distinguishing itself from other tech firms [5]

Claude深度“开盒”，看大模型的“大脑”到底如何运作？

AI科技大本营· 2025-04-09 02:00

近日， Claude 大模型团队发布了一篇文章《 Tracing the thoughts of a large language model》（追踪大型语言模型的思维），深入剖析大模型在回答问题时的内部机制，揭示它如何"思考"、如何推理，以及为何有时会偏离事实。如果能更深入地理解 Claude 的"思维"模式，我们不仅能更准确地掌握它的能力边界，还能确保它按照我们的意愿行事。例如：为了破解这些谜题，我们借鉴了神经科学的研究方法——就像神经科学家研究人类大脑的运作机制一样，我们试图打造一种"AI 显微镜"，用来分析模型内部的信息流动和激活模式。毕竟，仅仅通过对话很难真正理解 AI 的思维方式—— 人类自己（即使是神经科学家）都无法完全解释大脑是如何工作的。因此，我们选择深入 AI 内部。 Claude 能说出几十种不同的语言，那么它在"脑海中"究竟是用哪种语言思考的？是否存在某种通用的"思维语言"？ Claude 是逐个单词生成文本的，但它是在单纯预测下一个单词，还是会提前规划整句话的逻辑？ Claude 能够逐步写出自己的推理过程，但它的解释真的反映了推理的实 ...

AI显微镜