AI Safety
Search documents
X @Anthropic
Anthropic· 2025-06-26 13:56
If you want to work with us and help shape how we keep Claude safe for people, our Safeguards team is hiring. https://t.co/UNtALvqMKh ...
提升大模型内在透明度:无需外部模块实现高效监控与自发安全增强|上海AI Lab & 上交
量子位· 2025-06-23 04:45
Core Insights - The article discusses the challenges of AI safety related to large language models (LLMs) and introduces TELLME, a new method aimed at enhancing internal transparency without relying on external monitoring modules [1][2][26]. Group 1: Current Challenges in AI Safety - Concerns about the potential risks associated with LLMs have arisen due to their increasing capabilities [1]. - Existing external monitoring methods are criticized for being unreliable and lacking adaptability, leading to unstable monitoring outcomes [5][6]. - The reliance on "black box" external detectors results in low interpretability and trustworthiness of monitoring results [5]. Group 2: TELLME Methodology - TELLME employs a technique called "representation decoupling" to enhance the internal transparency of LLMs [2]. - The core idea is to clearly separate the internal representations of safe and unsafe behaviors, facilitating more reliable monitoring [3]. - TELLME utilizes contrastive learning to drive the separation of representations, ensuring that similar risks are grouped while dissimilar ones are distanced [7]. Group 3: Experimental Validation - Experiments demonstrate significant improvements in transparency and monitoring capabilities across various scenarios, with clear clustering of different risk behaviors [10][11]. - The method maintains the general capabilities of the model while enhancing safety, proving the effectiveness of the dual constraints designed in TELLME [12]. - Monitoring accuracy increased by 22.3% compared to the original model, showcasing the method's effectiveness [14]. Group 4: Broader Implications - TELLME represents a shift from external monitoring reliance to enhancing the model's own monitorability, leading to higher precision in risk identification [26][27]. - The method shows potential for scalable oversight, suggesting that as model capabilities grow, so too will the effectiveness of TELLME's monitoring [28]. - The approach leads to spontaneous improvements in output safety, indicating a unique mechanism for enhancing model safety [23][28].
How to Build Trustworthy AI — Allie Howe
AI Engineer· 2025-06-16 20:29
Core Concept - Trustworthy AI is defined as the combination of AI Security and AI Safety, crucial for AI systems [1] Key Strategies - Building trustworthy AI requires product and engineering teams to collaborate on AI that is aligned, explainable, and secure [1] - MLSecOps, AI Red Teaming, and AI Runtime Security are three focus areas that contribute to achieving both AI Security and AI Safety [1] Resources for Implementation - Modelscan (https://github.com/protectai/modelscan) is a resource for MLSecOps [1] - PyRIT (https://azure.github.io/PyRIT/) and Microsoft's AI Red Teaming Lessons eBook (https://ashy-coast-00aeb501e.6.azurestaticapps.net/MS_AIRT_Lessons_eBook.pdf) are resources for AI Red Teaming [1] - Pillar Security (https://www.pillar.security/solutionsai-detection) and Noma Security (https://noma.security/) offer resources for AI Runtime Security [1] Demonstrating Trust - Vanta (https://www.vanta.com/collection/trust/what-is-a-trust-center) provides resources for showcasing Trustworthy AI to customers and prospects [1]
图灵奖得主Bengio再创业:启动资金就筹集了3000万美元
量子位· 2025-06-04 07:04
Core Viewpoint - Yoshua Bengio, a Turing Award winner and one of the deep learning giants, has announced the establishment of a nonprofit organization called LawZero, aimed at building the next generation of AI systems with a focus on safety and transparency, explicitly avoiding the development of agent-based AI systems [1][3][4]. Funding and Support - LawZero has successfully raised $30 million in initial funding through various charitable donors [2][9]. - Initial supporters include notable organizations such as the Future of Life Institute, Open Philanthropy, and the Silicon Valley Community Foundation [9][10]. Mission and Objectives - LawZero aims to create AI systems that prioritize safety over commercial interests, adopting a "safe-by-design" approach [3]. - The organization focuses on understanding the world rather than taking actions within it, providing verifiable answers to questions and enhancing the understanding of AI risks [4][21]. Scientific Direction - The core scientific direction of LawZero is based on a new research methodology called "Scientist AI," which emphasizes observation and explanation rather than action [17][21]. - The system consists of two main components: a world model that generates causal theories from observed data and a reasoning engine that provides probabilistic explanations [22][23]. Applications of Scientist AI - Scientist AI is designed to serve three primary functions: 1. As a safety barrier against dangerous AI, preventing catastrophic outcomes through dual verification mechanisms [24]. 2. As a trustworthy tool for accelerating scientific discovery, particularly in fields like biology and materials science, while avoiding risks associated with traditional AI [25]. 3. As foundational infrastructure for the safe development of advanced AI, establishing audit-able safety boundaries to mitigate risks from deceptive agents [26]. Leadership and Team - Bengio serves as the chairman and scientific director of LawZero, leading a team of over 15 top researchers [12][15]. - The organization is incubated by the Mila-Quebec AI Institute, which has become an operational partner [8]. Historical Context - Bengio previously co-founded Element AI, which focused on AI strategy consulting and raised approximately $260 million before being sold for $230 million in 2020 [28][29]. - His new venture, LawZero, reflects a shift in focus towards addressing AI safety risks, a concern that has grown in light of recent advancements in AI technology [32][33]. Public Perception - There is a cautious outlook from the public regarding LawZero, with some expressing concerns about the potential for AI to undermine human agency [34].