Workflow
TELLME
icon
Search documents
提升大模型内在透明度:无需外部模块实现高效监控与自发安全增强|上海AI Lab & 上交
量子位· 2025-06-23 04:45
Core Insights - The article discusses the challenges of AI safety related to large language models (LLMs) and introduces TELLME, a new method aimed at enhancing internal transparency without relying on external monitoring modules [1][2][26]. Group 1: Current Challenges in AI Safety - Concerns about the potential risks associated with LLMs have arisen due to their increasing capabilities [1]. - Existing external monitoring methods are criticized for being unreliable and lacking adaptability, leading to unstable monitoring outcomes [5][6]. - The reliance on "black box" external detectors results in low interpretability and trustworthiness of monitoring results [5]. Group 2: TELLME Methodology - TELLME employs a technique called "representation decoupling" to enhance the internal transparency of LLMs [2]. - The core idea is to clearly separate the internal representations of safe and unsafe behaviors, facilitating more reliable monitoring [3]. - TELLME utilizes contrastive learning to drive the separation of representations, ensuring that similar risks are grouped while dissimilar ones are distanced [7]. Group 3: Experimental Validation - Experiments demonstrate significant improvements in transparency and monitoring capabilities across various scenarios, with clear clustering of different risk behaviors [10][11]. - The method maintains the general capabilities of the model while enhancing safety, proving the effectiveness of the dual constraints designed in TELLME [12]. - Monitoring accuracy increased by 22.3% compared to the original model, showcasing the method's effectiveness [14]. Group 4: Broader Implications - TELLME represents a shift from external monitoring reliance to enhancing the model's own monitorability, leading to higher precision in risk identification [26][27]. - The method shows potential for scalable oversight, suggesting that as model capabilities grow, so too will the effectiveness of TELLME's monitoring [28]. - The approach leads to spontaneous improvements in output safety, indicating a unique mechanism for enhancing model safety [23][28].
AI编程与果冻三明治难题:真正的瓶颈并不是提示词工程
3 6 Ke· 2025-05-07 23:08
神译局是36氪旗下编译团队,关注科技、商业、职场、生活等领域,重点介绍国外的新技 术、新观点、新风向。 编者按:当大多数人在沉迷提示词技巧时,哈佛课堂的果酱三明治实验却揭示了人与机器协作的秘密 ——真正的瓶颈并不是提示词工程,而在于清晰沟通的能力。文章来自编译。 过去这年,我完全沉浸在"AI竞技场"当中不能自拔——用Claude Code和Cursor等工具光速开发产品,见 证着这个领域的日新月异。过去6个月我用这些工具开发了: Betsee.xyz:能根据推文预测市场的聚合平台 TellMel.ai:一个具有同理心的个人传记助手,用于分享人生故事与智慧 GetMaxHelp.com:AI语音驱动的家庭技术支援热线 YipYap.xyz:基于话题链的社群聊天应用 甚至连我儿子都参与进来,用Lovable、Replit和Bolt等工具制作了《荒野乱斗》风格的打字学习游戏。 整个过程充满了能量与启发性。半年前我只敢让AI做自动补全,如今离了AI我简直都没法编程。 但虽然取得了这些进展,我仍反复会遇到同一个问题——这让我想起了自己人生的第一节计算机课。 果冻三明治难题 第二人说:"把面包放下。"玛戈直接将面糊团砸向 ...