Mechanistic Interpretability

Mechanistic Interpretability#

Slides from the lecture covering mechanistic interpretability methods for LMs (logit lens, residual stream, activation patching, circuit analysis, sparse autoencoders) can be found here.

Additional materials#

If you want to dig a bit deeper, here are (optional!) supplementary readings. More papers discussed in the lecture are provided in the slides.