Mechanistic Interpretability

Mechanistic Interpretability#

Slides from the lecture covering mechanistic interpretability methods for LMs (logit lens, residual stream, activation patching, circuit analysis, sparse autoencoders) can be found here.

Additional materials#

If you want to dig a bit deeper, here are (optional!) supplementary readings. More papers discussed in the lecture are provided in the slides.

The logit lens blogpost
Merullo et al. (2024) Language Models Implement Simple Word2Vec-style Vector Arithmetic
Elhage et al. (2021) A Mathematical Framework for Transformer Circuits
Meng et al. (2022) Locating and Editing Factual Associations in GPT, aka ROME paper mentioned in the lecture
Vig et al. (2020) Causal Mediation Analysis for Interpreting Neural NLP: The Case of Gender Bias
Heimersheim & Nanda (2024) How to use and interpret activation patching
Wang et al. (2023) Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Causal Scrubbing: a method for rigorously testing interpretability hypotheses
Merullo et al. (2024) Circuit component reuse across tasks in transformer language models
Yu et al. (2023) Characterizing Mechanisms for Factual Recall in Language Models

Mechanistic Interpretability

Contents

Mechanistic Interpretability#

Additional materials#