Workflow
DePass
icon
Search documents
NeurIPS 2025 | DePass:通过单次前向传播分解实现统一的特征归因
机器之心· 2025-12-01 04:08
Core Viewpoint - The article discusses the introduction of a new unified feature attribution framework called DePass, which aims to enhance the interpretability of large language models (LLMs) by providing precise attribution of model outputs to internal computations [3][11]. Group 1: Introduction of DePass - DePass is a novel framework developed by a research team from Tsinghua University and Shanghai AI Lab, designed to address the challenges of existing attribution methods that are often computationally expensive and lack a unified analysis framework [3][6]. - The framework allows for the decomposition of hidden states in the forward pass into additive components, enabling precise attribution of model behavior without modifying the model structure [7][11]. Group 2: Implementation Details - In the Attention module, DePass freezes attention scores and applies linear transformations to the hidden states, allowing for accurate distribution of information flow [8]. - For the MLP module, it treats the neurons as a key-value store, effectively partitioning the contributions of different components to the same token [9]. Group 3: Experimental Validation - DePass has been validated through various experiments, demonstrating its effectiveness in token-level, model-component-level, and subspace-level attribution tasks [11][13]. - In token-level experiments, removing the most critical tokens identified by DePass significantly decreased model output probabilities, indicating its ability to capture essential evidence driving predictions [11][14]. Group 4: Comparison with Existing Methods - Existing attribution methods, such as noise ablation and gradient-based methods, face challenges in providing fine-grained explanations and often incur high computational costs [12]. - DePass outperforms traditional importance metrics in identifying significant components, showing higher sensitivity and completeness in its attribution results [15]. Group 5: Applications and Future Potential - DePass can track the contributions of specific input tokens to particular semantic subspaces, enhancing the model's controllability and interpretability [13][19]. - The framework is expected to serve as a universal tool in mechanism interpretability research, facilitating exploration across various tasks and models [23].