AI Safety

Search documents
The Great AI Safety Balancing Act | Yobie Benjamin | TEDxPaloAltoSalon
TEDx Talks· 2025-07-14 16:47
[Music] Good afternoon. My name is Yobi Benjamin. I am an immigrant and I'm an American.And before I start, I want to thank a few people. Uh first of all, I want to thank my grandmother who raised me who despite extreme poverty raised me to be the person that I am today. I want to also recognize and thank my wife and my children who continue to inspire me today.My wife Roxan is here and my son Greg. Thank you very much for inspiring me every day. Um I began my career in technology in a small company called ...
X @Anthropic
Anthropic· 2025-06-26 13:56
If you want to work with us and help shape how we keep Claude safe for people, our Safeguards team is hiring. https://t.co/UNtALvqMKh ...
提升大模型内在透明度:无需外部模块实现高效监控与自发安全增强|上海AI Lab & 上交
量子位· 2025-06-23 04:45
Core Insights - The article discusses the challenges of AI safety related to large language models (LLMs) and introduces TELLME, a new method aimed at enhancing internal transparency without relying on external monitoring modules [1][2][26]. Group 1: Current Challenges in AI Safety - Concerns about the potential risks associated with LLMs have arisen due to their increasing capabilities [1]. - Existing external monitoring methods are criticized for being unreliable and lacking adaptability, leading to unstable monitoring outcomes [5][6]. - The reliance on "black box" external detectors results in low interpretability and trustworthiness of monitoring results [5]. Group 2: TELLME Methodology - TELLME employs a technique called "representation decoupling" to enhance the internal transparency of LLMs [2]. - The core idea is to clearly separate the internal representations of safe and unsafe behaviors, facilitating more reliable monitoring [3]. - TELLME utilizes contrastive learning to drive the separation of representations, ensuring that similar risks are grouped while dissimilar ones are distanced [7]. Group 3: Experimental Validation - Experiments demonstrate significant improvements in transparency and monitoring capabilities across various scenarios, with clear clustering of different risk behaviors [10][11]. - The method maintains the general capabilities of the model while enhancing safety, proving the effectiveness of the dual constraints designed in TELLME [12]. - Monitoring accuracy increased by 22.3% compared to the original model, showcasing the method's effectiveness [14]. Group 4: Broader Implications - TELLME represents a shift from external monitoring reliance to enhancing the model's own monitorability, leading to higher precision in risk identification [26][27]. - The method shows potential for scalable oversight, suggesting that as model capabilities grow, so too will the effectiveness of TELLME's monitoring [28]. - The approach leads to spontaneous improvements in output safety, indicating a unique mechanism for enhancing model safety [23][28].
How to Build Trustworthy AI — Allie Howe
AI Engineer· 2025-06-16 20:29
Core Concept - Trustworthy AI is defined as the combination of AI Security and AI Safety, crucial for AI systems [1] Key Strategies - Building trustworthy AI requires product and engineering teams to collaborate on AI that is aligned, explainable, and secure [1] - MLSecOps, AI Red Teaming, and AI Runtime Security are three focus areas that contribute to achieving both AI Security and AI Safety [1] Resources for Implementation - Modelscan (https://github.com/protectai/modelscan) is a resource for MLSecOps [1] - PyRIT (https://azure.github.io/PyRIT/) and Microsoft's AI Red Teaming Lessons eBook (https://ashy-coast-00aeb501e.6.azurestaticapps.net/MS_AIRT_Lessons_eBook.pdf) are resources for AI Red Teaming [1] - Pillar Security (https://www.pillar.security/solutionsai-detection) and Noma Security (https://noma.security/) offer resources for AI Runtime Security [1] Demonstrating Trust - Vanta (https://www.vanta.com/collection/trust/what-is-a-trust-center) provides resources for showcasing Trustworthy AI to customers and prospects [1]
图灵奖得主Bengio再创业:启动资金就筹集了3000万美元
量子位· 2025-06-04 07:04
西风 发自 凹非寺 量子位 | 公众号 QbitAI 目前LawZero已通过多家慈善捐赠方筹集到了 3000万美元启动资金 。 具体来说,LawZero要做" 设计 即安 全 (safe-by-design)"的AI系统,要"将安全性置于商业利益之上"。 所要做的AI系统非Agent形态,而可以监督Agent: 它 以理解学习世 界为核心目标, 而非在世界中采取行动 ,通过透明化外部推理,对问题提供可验证的真实答案,"可用于加速科学发现、为 Agent型AI系统提供监督,并深化大家对AI风险及其规避方法的理解"。 Bengio表示,当前AI系统已显现出自我保护和欺骗行为的迹象,随着其能力和自主性的提升,这种趋势只会加速,LawZero是他们针对这些 挑战所给出的建设性回应。 经常有人问我,对AI的未来是乐观还是悲观?我的回答始终是:It doesn't matter (无关紧要) 。 唯一重要的是,我们每个人都能采取行动,推动AI向更好的方向发展。 刚刚,深度学习三巨头之一、图灵奖得主 Yoshua Bengio 官宣再次创业 —— 成立 非营利组织LawZero ,要构建下一代AI系统,而且 明确不做Ag ...