@gsarti on Hugging Face: "🔍 Today's pick in Interpretability & Analysis of LMs: by @aadityasingh T.…"

Post

2450

🔍 Today's pick in Interpretability & Analysis of LMs: by @aadityasingh T. Moskovitz, F. Hill, S. C. Y. Chan, A. M. Saxe ( @gatsbyunit )

This work proposes a new methodology inspired by optogenetics (dubbed "clamping") to perform targeted ablations during training to estimate the causal effect of specific interventions on mechanism formation.

Authors use this approach to study the formation of induction heads training a 2L attention-only transformer to label examples via context information.

Notable findings:

- The effects of induction heads are additive and redundant, with weaker heads compensating well for the ablation of a strong induction head in case the latter is ablated.
- Competition between induction heads might emerge as a product of optimization pressure to converge faster, but it is not strictly necessary as all heads eventually learn to solve the task.
- Previous token heads (PTH) influence induction heads in a many-to-many fashion, with any PTH eliciting above-chance prediction from a subsequent induction head
- Three subcircuits for induction are identified, respectively mixing token-label information (1 + 2), matching the previous occurrence of the current class in the context (3qk + 4), and copying the label of the matched class (3v + 5).
- The formation of induction heads is slowed down by a larger number of classes & labels, with more classes and more labels slowing down the formation of the matching and copying mechanisms, respectively. This may have implications when selecting a vocabulary size for LLMs: larger vocabularies lead to an increased compression ratio and longer contexts, but they might make copying more challenging by delaying the formation of induction heads.

💻 Code: https://github.com/aadityasingh/icl-dynamics

📄 Paper: What needs to go right for an induction head? A mechanistic study of in-context learning circuits and their formation (2404.07129)

🔍 All daily picks: https://huggingface.co/collections/gsarti/daily-picks-in-interpretability-and-analysis-ofc-lms-65ae3339949c5675d25de2f9

Join the conversation