谷歌：通用人工智能（AGI）技术安全保障方法研究报告

Core Viewpoint - The report by DeepMind emphasizes the need for a proactive technical approach to ensure the safety and security of Artificial General Intelligence (AGI), moving away from traditional "observe-mitigate" strategies to a more robust defense system against potential extreme risks [1][10]. Group 1: Evidence Dilemma and Defense Logic - The report addresses the "evidence dilemma" in future technology security planning, where definitive proof of the necessity for defense measures is often absent until catastrophic consequences occur [2]. - DeepMind establishes foundational assumptions, asserting that current deep learning paradigms will dominate AI capability development in the foreseeable future, with no clear "human ceiling" on AI system capabilities [2]. - The report warns that as AI begins to engage in scientific research, technological advancements may enter a self-reinforcing "acceleration phase," significantly compressing the time window for human society to identify and respond to new risks [2]. Group 2: Risk Classification - DeepMind categorizes potential risks into four main types: Misuse, Misalignment, Mistakes, and Structural Risks, with a focus on Misuse and Misalignment due to their association with malicious intent [3]. Group 3: Dual Defense System - The report outlines a dual defense system to address two distinct threat sources: malicious human use of AI capabilities and the misalignment of AI system goals [4][6]. - For "misuse" risks, DeepMind proposes a practical engineering approach centered on "blocking threat actors from accessing dangerous capabilities," utilizing a "frontline security framework" to assess and manage model risks [4]. - The report emphasizes strict access controls and leak prevention mechanisms to protect core assets, alongside comprehensive deployment defenses such as post-training safety fine-tuning and real-time monitoring [4]. Group 4: Addressing Misalignment Risks - DeepMind introduces two lines of defense against "misalignment" risks: building aligned models and defending against misaligned models [6]. - The first line involves "Amplified Oversight," where AI assists human oversight of AI outputs, transforming complex validation issues into manageable human judgment tasks [6]. - The second line incorporates a "zero trust" philosophy, assuming models may be misaligned and constructing systems that limit potential harm even if the model harbors malicious intent [7]. Group 5: Verification and Transparency - A significant contribution of the report is the introduction of "Safety Cases," requiring developers to provide structured arguments proving the safety of AGI systems in specific deployment environments [8]. - The report highlights the importance of "explainability" research, advocating for a deeper understanding of model decision-making processes to enhance safety verification [8]. - Additional supportive measures include designing agents that seek human feedback in uncertain situations and filtering training data to reduce misalignment risks from the outset [9]. Group 6: Conclusion - The report serves as a technical declaration from Google DeepMind, providing a detailed engineering framework for taming the potential rise of superintelligent AI, aiming to illuminate a safe path for humanity amidst the pursuit of technological extremes [10].