Mechanistic Interpretability#
This session covers a cutting-edge topic – mechanistic interpretability – which works towatds identifying computational mechanisms within the transformer architecture which support performance on various tasks. Slides of the session can be found here.
Additional materials#
If you want to dig a bit deeper, here are (optional!) supplementary readings. More papers discussed in the lecture are provided in the slides.
Merullo et al. (2024) Language Models Implement Simple Word2Vec-style Vector Arithmetic
Elhage et al. (2021) A Mathematical Framework for Transformer Circuits
Vig et al. (2020) Causal Mediation Analysis for Interpreting Neural NLP: The Case of Gender Bias
Heimersheim & Nanda (2024) How to use and interpret activation patching
Causal Scrubbing: a method for rigorously testing interpretability hypotheses
Merullo et al. (2024) Circuit component reuse across tasks in transformer language models
Yu et al. (2023) Characterizing Mechanisms for Factual Recall in Language Models