Peter Fields

Attention Diagnostics: Testing KL and Susceptibility on the IOI Circuit

2026-02-24T00:00:00-06:00

Post 1 derived two per-head diagnostics from the structure of the softmax operator: KL selectivity $\hat\rho_{\text{eff}} = \text{KL}(\hat\pi \| u)$ measures how sharply a head focuses its attention, and susceptibility $\chi = \text{Var}_{\hat\pi}(\log\hat\pi)/(\log n)^2$ measures how sensitive that sharpness is to small changes in the query-key scores. Both are computed from a single forward pass — no gradients, no activation patching.

Here I test a concrete prediction: circuit heads should show larger shifts in these diagnostics between activating and non-activating prompts than non-circuit heads. The testbed is GPT-2-small’s Indirect Object Identification (IOI) circuit,¹ whose 23 heads and functional roles are well characterized.

Jupyter notebook with analysis here.

Recap

A prompt is a token sequence $x = (x_1, \ldots, x_n)$. GPT-2-small processes it layer by layer, maintaining a residual stream $h_i^{(l)} \in \mathbb{R}^{d_{\textrm{model}}}$ for each position — a contextualized representation that accumulates the contributions of all attention heads and MLPs up to layer $l$.

Each of the 144 heads (12 layers × 12 heads) projects the residual stream into queries and keys, computes scores $z_{ij} = q_i \cdot k_j / \sqrt{d_k}$, and applies softmax to produce an attention distribution over source positions:

\[\hat{\pi}_j = \text{softmax}(z_j).\]

This $\hat{\pi}$ depends on both $x$ (through the queries and keys) and the head’s learned parameters. From it we read off the two diagnostics introduced in Post 1:

\[\hat{\rho}_{\textrm{eff}} = {\text{KL}(\hat{\pi} \| u)}, \qquad \chi = \frac{\text{Var}_{\hat{\pi}}(\log \hat{\pi})}{(\log n)^2}, \label{eq:diagnostics}\]

where $u = (1/n, \ldots, 1/n)$ is the uniform distribution. Note: Post 1 used the notation $\partial\hat{\rho}$ for the temperature susceptibility; here we write $\chi = \partial\hat{\rho}/(\log n)^2$. No backward pass needed — $\hat{\pi}$ is already computed in the forward pass.

Experimental setup

GPT-2 small is known to have a circuit that performs indirect object identification.¹ Let’s say we have a prompt of 15 tokens that reads:

\[x = \textrm{When Alice and Bob went to the store,} \textbf{ Bob } \textrm{gave a drink to ___}. \label{eq:good_prompt}\]

The correct next token the model should predict is Alice (the indirect object of the second clause). It turns out that Wang et al. showed that several heads perform various jobs in order to predict this next token correctly. One head detects a name has appeared twice, another suppresses the name that has appeared twice, another moves the name that appeared once, and so forth.

For our experiments, we shall generate 50 IOI prompts of length $n=15$ of exactly the same format as \eqref{eq:good_prompt} along with 50 non-IOI prompts that have a similar format but no repeating name, e.g.

\[x = \textrm{After Mary and John sat down for dinner,} \textbf{ Sarah } \textrm{gave a gift to ___}\]

which should not “activate” the circuit, and this should be reflected in a shift in the values of our diagnostics, \eqref{eq:diagnostics}.

All experiment details, code, and plots can be found in this notebook.

A single circuit head

Layer 9, head 9 (L9H9) is a Name Mover attention head, responsible for copying the correct indirect object name to the output. Below we see a noticeable difference between IOI and non-IOI prompts in the $(\rho, \chi)$-plane (dropping the $\hat{\quad}$ symbol and $\textrm{eff}$ subscript for ease of notation).

Figure 1. L9H9 (Name Mover) in the $(\rho,\,\chi)$ plane. Red: IOI prompts (circuit active). Gray: non-IOI prompts (circuit inactive). Stars mark condition means.

It looks like the diagnostics can show the difference between the head activating versus not!

It also seems like our hypothesis was correct: activating the circuit shifts the head toward higher KL selectivity, consistent with the prediction from Post 1.

Let’s check all 23 circuit heads.

All 23 circuit heads

Figure 2. All 23 IOI circuit heads organized by role. Each panel shows per-prompt $(\rho,\,\chi)$ for IOI (colored) and non-IOI (gray). Arrow: non-IOI mean → IOI mean.

It seems that non-active-to-active experiments produce signal in our diagnostics, but such signal is role dependent. Name movers, backup name movers, negative name movers, and S-inhibition all show clear shifts; these heads only activate when the repeated name pattern appears. Induction, duplicate token, and previous token exhibit no shift; these are structural heads, which merely track patterns and positions among tokens, building up representations in earlier layers that the selection heads later consume. Their jobs do not change between prompt types.

It seems that some, though not all, circuit types share similar fingerprints in the $(\rho, \chi)$-plane, suggesting an interesting future direction of research: can candidate circuit heads be identified via these fingerprints?

Non-circuit heads

For comparison, we can see that non-circuit heads tend to not show a shift in ($\rho,\chi$). For the quantity defined as

\[\Delta \mathrm{KL} = \langle \rho \rangle_{\textrm{IOI prompts}} - \langle \rho \rangle_{\textrm{non-IOI prompts}} \label{eq:delta_kl}\]

we show below those non-circuit heads with either one of top four values in $|\Delta \mathrm{KL}|$ or the bottom twelve. Note that a few non-circuit heads do seem to activate on IOI prompts. Their precise role is unclear, but their attention patterns appear sensitive to syntactic features these prompts share — not to the IOI computation specifically.

Figure 3. Selected non-circuit heads. Orange: top-4 by |$\Delta $ KL|. Blue: 12 inert heads with near-zero shift. Most non-circuit heads are indifferent to whether the IOI circuit fires.

The existence of these outlier non-circuit heads means the separation is not perfect — but does it hold statistically, across all 144 heads?

Statistical test: do the diagnostics separate circuit from non-circuit?

So can the shift from non-IOI to IOI prompts in the ($\rho,\chi$) plane distinguish between circuit and non-circuit heads? To test this hypothesis we rank the mean shifts in KL ($\rho$), as measured by Eq. \eqref{eq:delta_kl}, for each head. (And similarly for the mean shift in $\chi$). Does this ordering generically rank any given circuit head higher than a non-circuit head (that is, more often than it would do to chance for random orderings)?

This corresponds to the Mann-Whitney statistical test. We can see in our plots below that it is indeed statistically significant! Our p-values are $p=0.0002$ and $p<0.0001$ for the $\Delta \mathrm{KL}$ and the $\Delta \chi$ shifts, respectively.

Figure 4. Distributions of $|\Delta\text{KL}|$ (left) and $|\Delta\chi|$ (right) for circuit (red) vs non-circuit (gray) heads. Mann-Whitney U: $p=0.00020$ and $p<0.0001$ respectively.

Cross-head correlations: a teaser

The diagnostics so far treat heads independently. But heads in the same functional role don’t just shift in isolation — they tend to move together. Computing the 144×144 Pearson correlation matrix of KL selectivity across prompts reveals structure aligned with the circuit.

Figure 5. Cross-head KL correlation matrices. $C_\text{IOI}$ and $C_\text{non-IOI}$ are 144×144 Pearson correlation matrices computed across 50 prompts each; $C_\text{diff} = C_\text{IOI} - C_\text{non-IOI}$ isolates task-specific coupling. Brown margin ticks mark the 23 known circuit heads.

The idea is that $C_\text{IOI}$ includes correlations both due to circuit activation and other typical interactions among heads for non-circuit behaviors. By subtracting off $C_\text{non-IOI}$ we are subtracting the “baseline” non-circuit correlations, thereby isolating the circuit correlation structure in $C_\text{diff}$. The above figure shows a promising sparsification in the correlation structure in $C_\text{diff}$—another promising future direction for possible circuit discovery.

Below we plot a scatter plot of every $(C_\text{IOI},\, C_\text{non-IOI})$ for all non-circuit (NC) to circuit head pairs (blue) and all NC to NC pairs. $C_\text{diff}$ is the direction from the unity line.

Figure 6. Every (NC head, other head) pair plotted as $(C_\text{IOI},\, C_\text{non-IOI})$. Points on the diagonal have $C_\text{diff}=0$. The gray band is $\pm 5\sigma$ of the empirical $C_\text{diff}$ distribution. Blue: NC head paired with a circuit head; red: two NC heads. Labeled: the three NC heads with the largest \(

C_\text{diff}

\) coupling to a circuit head.

There is one attention head in layer 8 (L8H1) that shows a dramatic shift in its correlation to a name-mover head when the circuit is activated. In fact, it’s 5 standard deviations away from mean no-shift behavior ($C_\text{diff}=0$). Could it be another circuit head missed by the analysis in Wang et al.? Five standard deviations is good enough for declaring the discovery of a new particle in high energy physics, so it seems good enough for me!

(Just kidding). In all seriousness, it would be worth checking for any mechanistic relation between this and other high $C_\text{diff}$ heads via typical ablation/patching techniques.

Limitations, implications, and next steps

Our diagnostic only observes behavior of the whole attention head for different prompts; it does not reveal mechanism for underlying computation on any kind of per-prompt basis. Additionally we expect attention to be more and more selective (higher KL) as we go into deeper layers, regardless of whether the head is inside the circuit or not. This confound needs to be accounted for. Nevertheless, the idea of similar fingerprints in the ($\rho,\chi$)-plane for similar circuit elements along with the correlation analysis idea each show promise for unsupervised circuit discovery.

Natural next steps include validation of our diagnostics for other known circuits… we have in fact only tested out one circuit on one data set after all. Though seeing if L8H1 is in fact a circuit element initially missed by Wang et al. would be interesting in and of itself.

Furthermore, our correlation analysis does not account well for strong signal due to transitive correlation, i.e. circuit A and B show strong correlation not due to any inherent coupling, but due to mutual correlation to a third element C. In biological analyses of neurons and amino-acid sequences this is typically dealt with via Direct Coupling Analysis. DCA fits a maximum entropy model to a correlation matrix, and has shown meaningful success in uncovering real, underlying interactions among neurons/amino acids. Since we are showing different prompts to our LLM and tracking the responses for stable structure, this is very similar to analysis done by Hoshal and Holmes et al. in their neuro-theory article, “Stimulus-invariant aspects of the retinal code drive discriminability of natural scenes.”²

If our analysis can take correlation graphs and turn them into candidate circuit graphs, this would help narrow down to a few candidate heads for a given circuit computation. If so, this could be a promising direction for scaling up (and speeding up) circuit discovery methods, e.g. correlation analysis to find candidates followed by causal interventions to determine mechanism.

The significant $\Delta\chi$ is also perhaps surprising — Post 1 predicted $\chi$ should be low in both conditions, so we’d expect $\Delta\chi \approx 0$. We’ll have to think a little bit more in a future post about what $\chi$ is (or is not) actually capturing. (We note that Kim (2026)³ applies a related fluctuation-dissipation susceptibility to GPT-2 training dynamics, using it to detect grokking as a phase transition — a complementary direction to the inference-time head characterization pursued here.)

Wang et al. (2022). Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small. ↩ ↩²
Hoshal et al. (2024). Stimulus-invariant aspects of the retinal code drive discriminability of natural scenes. PNAS 121(52):e2313676121. ↩
Kim, J. (2026). “Thermodynamic Isomorphism of Transformers,” arXiv:2602.08216. ↩

Why Softmax? A Hypothesis Testing Perspective on Attention Weights

2026-02-17T00:00:00-06:00

Softmax is ubiquitous in transformers, yet its role in attention can feel more heuristic than inevitable (at least to me). In this post, I try to make it feel more natural and show how this interpretation suggests useful diagnostics for the often circuit-like behavior of attention heads.

Introduction: the attention mechanism

Consider a stream of tokens to be embedded:

\[x=\{x_1,x_2,...,x_i,...,x_T \}.\]

After embedding (and potentially many passes through MLPs and attention heads) we have the contextualized tokens

\[h_i \in \mathbb R^{d_{\text{model}}}.\]

The attention mechanism updates this residual stream (as it is also called) by computing three quantities from learned parameters $W_K, W_Q$ and $ W_V $.

Given the most recent embedded token in the stream, $h_t$, and all tokens before it $ \{ h_i :i < t \} $, these three quantities are the keys, query, and values—and are defined as

\[k_i=W_Kh_i\] \[q_t=W_Qh_t\] \[v_i=W_Vh_i\]

where $q,k \in \mathbb{R}^{d_k}$ and typically \( d_k

The update to the residual stream at position $ t $ is calculated as¹

\[h_t^{\text{(new)}} =\mathcal{O}_t+ h_t^{\text{(old)}}\]

with

\[\label{eq:O_t} \mathcal{O}_t=\sum_i\pi_{i,t} v_i\] \[\pi_{i,t} = \frac{e^{\beta k_i\cdot q_t}}{\sum_j e^{\beta k_j\cdot q_t}},\]

where we identify $ \pi_i $ with the $\mathrm{softmax}$ function:

\[\pi_{i,t}=\mathrm{softmax}(\beta k_i\cdot q_t),\]

and we have introduced the scalar $ \beta$ for later use.

This lends itself to the following interpretation: for any given token at position $ t $, the query vector, $q_t$, defines what $h_t$ is “looking for” from previous tokens, and the keys, $ k_i $, determine which of the previous tokens get “advertised”.²

The query-key pairs define the distribution $ \pi_i $ over which the values, $ v_i $, are averaged. We can see that this distribution determines what values the attention head should “focus” on.

This post explores the question: why softmax and not something else?

I emphasize that this is just the way I like to think about it… not the way it should be understood.

Softmax as hypothesis testing

For notational simplicity we fix a destination position and drop the index $t$. We define the query-key score for a given key as

\[\label{eq:z} z_i=k_i\cdot q.\]

We let $n$ denote the number of scores.

Leaving $z_i$ alone for the moment, let us imagine that we had no good reason to prefer one index over another when calculating $ \mathcal{O}_t$ from Eq. \eqref{eq:O_t}. The only distribution invariant under permutation of the indices (which is the symmetry that reflects our ignorance) is the uniform distribution, which we denote by $u$.

Of course, we do have reason to prefer some indices over others in our distribution $ \pi $, namely the scores $ z_i $. We have two competing objectives: create a distribution that maximizes the expected score, $ \sum_i \pi_i z_i $ (thus properly weighting the evidence afforded us), but also do not overcommit to any particular index’s score beyond what we believe is justifiable given our prior ignorance.

In hypothesis testing, the Kullback-Leibler (KL) divergence is a natural measure of distinguishability from a null hypothesis. The number of samples required to determine that said null hypothesis is false is proportional to $ \frac{1}{\mathrm{KL}(\pi\|u)}$³. Loosely speaking, if we are given some “budget” $ \rho $, and if the KL exceeds it, then we may say that we have enough evidence to reject the null (uniform) hypothesis. This defines our notion of overcommitment. Our competing objectives are thus defined by constructing $ \pi_i $ such that the average score, $ \langle z_i\rangle$, is maximal (our commitment to our evidence is maximal), while remaining within our commitment “budget” defined by $\rho $ and the KL-divergence. This defines the constrained optimization problem

\[\max_{\pi \in \Delta} \sum_i \pi_i z_i \quad \text{s.t.} \quad \mathrm{KL}(\pi \| u) \leq \rho,\]

where $\Delta =\{ \pi \in \mathbb{R}^n : \pi_i \geq 0,\; \sum_i \pi_i = 1 \}$ is the probability simplex.

Rather than solve this directly, we can relax the hard constraint into a penalty, yielding the equivalent unconstrained problem

\[\max_{\pi \in \Delta} \sum_i \pi_i z_i - \frac{1}{\beta} \mathrm{KL}(\pi \| u),\]

where $ \frac{1}{\beta} $ controls the trade-off between maximizing expected score and staying close to uniform. Each value of $ \beta $ corresponds to a particular budget $ \rho $: large $ \beta $ (loose budget) allows sharper distributions, while small $ \beta $ (tight budget) keeps $ \pi $ near uniform. Introducing a Lagrange multiplier for normalization and taking first-order conditions, one finds that the solution is

\[\pi_i^\star \propto e^{\beta z_i},\]

which recovers softmax, defining the attention weights used in transformers.

Interpretation

I should reiterate that this is merely my interpretation of the softmax function. In modern commercial transformer architectures the above optimization problems are not explicitly written into the training objective and play no role at inference time.

That being said, the very fact that the softmax function is used in each attention head lends credence to the preceding interpretation. Each trained head in real, deployed transformers could be interpreted as instantiating some solution to a commitment-to-evidence-versus-ignorance optimization problem. Of course, the training objective does not explicitly enforce this constrained problem; the point is that the resulting functional form admits this interpretation.

The parameters $\beta$ and $\rho$ are not to be found in any such real-world transformer in the likes of Claude or ChatGPT, but each does indeed have their own weight matrices $\hat W_K, \hat W_Q$ and $ \hat W_V $. So, for a given residual stream $ \{h_i\} $ for some context $\{x_i\}$, there is nothing stopping us from interrogating an attention head by examining the quantity

\[\hat \rho_{\text{eff}, t}:=\mathrm{KL}(\hat \pi_t \|u ) \label{eq:rho_eff}\]

for

\[\hat\pi_{t,i}=\mathrm{softmax}(h_i^{\top}\hat W_K^\top\hat W_Qh_t).\]

$\hat \rho_{\text{eff},t}$ is an interesting quantity—we can think of it as measuring the “commitment to evidence” in a given attention head for given learned parameters and a given context. This last point is worth repeating: it is a context dependent quantity.

$\hat \rho_{\text{eff},t}$ will be large when the evidence to focus on certain past tokens while building $\hat \pi_t$ is large. It will be small when the evidence is “flimsy.” This is not necessarily bad, however; one can imagine certain heads operate well by considering evidence from many tokens, instead of only a few (to be very hand-wavy, think of a head that considers general themes and the tone of a context, instead of particular grammatical rules or other minutiae).

This quantity, $\hat \rho_{\text{eff},t}$, is therefore a proxy for how selectively information is routed through a given head.

Equation \eqref{eq:rho_eff} can also be written as

\[\hat \rho_{\text{eff},t} = \log n - H(\hat \pi_t)\]

where we see that $\hat \rho_{\text{eff},t}$ is dependent upon the length of the context window. Since $\hat \rho_{\text{eff},t} = \log n - H(\hat \pi_t)$, if it grows logarithmically with $n$ then $H(\hat \pi_t)$ must remain $O(1)$—meaning the head continues to focus on a fixed number of tokens regardless of how long the context becomes. Such a head can be seen as being robust.

Further implications: circuits & interpretability

Recall that in our derivation, the parameter $\beta$ controlled the trade-off between evidence and ignorance. Though it does not appear explicitly in a trained transformer, we can still ask: how sensitive is $\hat \rho_{\text{eff}}$ to perturbations in an artificial temperature parameter $\beta$, evaluated at $\beta=1$ (which recovers the actual attention weights)?

If we define the quantity

\[\partial \hat \rho := \partial _\beta \hat \rho_{\mathrm{eff}}\Big |_{\beta=1} =\mathrm{Var}_{\hat \pi}( z_i), \label{eq:d_beta}\]

(where we have dropped $t $ for simplicity of notation). Using standard exponential-family identities, we can see that this corresponds to a susceptibility to perturbations in temperature, similar to stat mech⁴. Seeing the behavior of $\partial \hat \rho $ in different contexts can allow one to further characterize a particular head.

This is particularly true when considering work in interpretability and circuits in transformer architectures⁵⁶. Both Eqs. \eqref{eq:rho_eff} and \eqref{eq:d_beta} would be interesting to track over many different contexts.

Think of a circuit that is activated in particular contexts, such as identifying which noun in a sentence is the indirect object. One can imagine each context string for that head mapping to a certain point in the $ (\hat \rho, \partial \hat \rho) $ plane. When the circuit is activated, the contexts would cluster towards high $ \hat \rho $ and low $ \partial \hat \rho $ (certain and stable). When not activated it would show low $ \hat \rho $ and low $ \partial \hat \rho $ (no preference for any past tokens and stable).

The next post shall explore the behaviors of these quantities in the indirect object identification (IOI) circuit in GPT-2.

References and Footnotes

In practice there are quite a few more bells and whistles when considering multiple attention heads, LayerNorm, etc., but we shall skip over those for simplicity. ↩
The mapping from literal token $x_i$ to embedded token $h_i$ is not one to one—as one goes through more attention/MLP layers the information between positions can become more and more mixed. ↩
Cover, T. M., & Thomas, J. A. (2006). Elements of Information Theory (2nd ed.), Section 11.8 (Chernoff-Stein Lemma). Wiley-Interscience. ↩
See, e.g., Kardar, M. (2007). Statistical Physics of Particles, Ch. 4. Cambridge University Press. In the canonical ensemble, the derivative of a thermodynamic average with respect to temperature yields the variance of the conjugate quantity (the fluctuation-dissipation relation). Kim (2026) applies the same relation to transformer training dynamics, using an analogous susceptibility to detect grokking as a phase transition; here we apply it to inference-time head characterization. See Kim, J. (2026). “Thermodynamic Isomorphism of Transformers,” arXiv:2602.08216. ↩
Voita, E., Talbot, D., Moiseev, F., Sennrich, R., & Titov, I. (2019). “Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned.” Proceedings of ACL, 5797–5808. Identifies specialized vs. redundant heads via a confidence metric (average max attention weight). ↩
Zhai, S., Likhomanenko, T., Littwin, E., Busbridge, D., Ramapuram, J., Zhang, Y., Gu, J., & Susskind, J. M. (2023). “Stabilizing Transformer Training by Preventing Attention Entropy Collapse.” Proceedings of ICML. Tracks attention entropy during training and identifies pathological entropy collapse. ↩