<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://peter-fields.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://peter-fields.github.io/" rel="alternate" type="text/html" /><updated>2026-04-09T18:51:31-05:00</updated><id>https://peter-fields.github.io/feed.xml</id><title type="html">Peter Fields</title><subtitle>Notes on machine learning, statistical physics, and other technical matters.</subtitle><author><name>Peter Fields</name></author><entry><title type="html">Attention Diagnostics: Testing KL and Susceptibility on the IOI Circuit</title><link href="https://peter-fields.github.io/attention-diagnostics/" rel="alternate" type="text/html" title="Attention Diagnostics: Testing KL and Susceptibility on the IOI Circuit" /><published>2026-02-24T00:00:00-06:00</published><updated>2026-02-24T00:00:00-06:00</updated><id>https://peter-fields.github.io/attention-diagnostics</id><content type="html" xml:base="https://peter-fields.github.io/attention-diagnostics/"><![CDATA[<!--
=== POST PLAN (revised Feb 24) ===

FIGURES & FLOW:

Fig 1: One Name Mover head (e.g. L9H9), non-IOI (gray) vs IOI (red) in (KL, χ) plane.
   Arrow from inactive mean → active mean. This is the hook — directly tests
   the Post 1 prediction. "The circuit fires, and we can see it in these diagnostics."

Fig 2: Big grid of ALL 23 circuit heads, same format. Organized by role.
   Shows fingerprints — heads with same role cluster in same region.
   Selection heads shift right (more selective). Structural heads don't move.
   S-Inhibition mixed. Role information lives in both position AND shift direction.

Fig 3: Grid of ~10 non-circuit heads for comparison.
   Pick heads that span the range: a few boring ones (near-zero ΔKL, as expected),
   plus one or two with surprisingly large shifts (L9H4, L10H3) to hint that
   not all non-circuit heads are inert. Contrast with Fig 2.

Report: Circuit vs non-circuit statistical test.
   |ΔKL| p=0.0002, |Δχ| p<0.0001. The manipulation is minimal (one name repeats
   or doesn't), yet circuit heads clearly feel it.

Fig 4: KL vs χ scatter for all 144 heads on IOI prompts.
   corr = 0.34. They measure different things. Don't belabor — just show it.
   LEAVE OUT the ΔKL-Δχ correlation (r=0.70). Looks less clean in the scatter
   than the number suggests — too many heads where one changes but the other
   doesn't. Too nuanced for this post.

Fig 5 (teaser): One correlation analysis plot.
   Maybe dendrogram-ordered C_diff matrix with circuit heads marked.
   Or the C_IOI circuit-only 23×23 block. Just enough to say "there's more
   structure here" and point to a future post.

KEY POINT TO MAKE: These diagnostics only need forward passes — no gradients,
no activation patching, no model modification. They could work on models where
per-component causal intervention is prohibitively expensive. That's the "so what"
beyond validating a known circuit.

LIMITATIONS (be honest):
   - One circuit, one model. No generalization claim.
   - how could it be useful when we don't know the circuit a priori
   - Layer depth confound: corr(layer, KL) ≈ 0.38.
   - KL is blind to target identity — captures *how selectively* a head attends,
     not *what* it attends to. A head that redirects attention without changing
     selectivity is invisible to ΔKL. Same for χ.

=== NOTEBOOK PLAN (separate curated notebook) ===

1. Define KL and χ from Post 1's theory
   - Brief recap, link back to blog
   - KL(π̂ ∥ u) measures selectivity. χ = Var_π(log π)/(log n)² measures susceptibility.

2. Setup: GPT-2-small, IOI circuit, 50/50 matched prompts
   - 144 heads. IOI vs non-IOI (third-name-C design). Exactly 15 tokens both sets.
   - Head labels from Wang et al. 2022.

3. Circuit heads respond more than non-circuit heads
   - |ΔKL|: circuit vs non-circuit p=0.0002. |Δχ|: p<0.0001.

4. Selection heads become more selective; structural heads don't care
   - ΔKL positive for NMs/Backup NMs/Neg NMs. ΔKL ≈ 0 for structural.

5. KL and χ measure different things but respond together
   - corr(KL, χ) = 0.34. corr(ΔKL, Δχ) = 0.70.

6. The (KL, χ) fingerprint figure
   - Role clustering in absolute position.

7. Limitations
   - One circuit, one model. Layer depth confound (corr ≈ 0.38).
   - KL blind to target identity (shape not content).
   - Polysemantic interpretation of χ not demonstrated.

Cut from notebook: ⟨z⟩ excess, criticality framing, stat mech analogy.
v1→v2→v3 prompt iteration: brief collapsible section, not main narrative.
-->

<p><a href="/why-softmax/">Post 1</a> derived two per-head diagnostics from the structure of the softmax operator: <strong>KL selectivity</strong> \(\hat\rho_{\text{eff}} = \text{KL}(\hat\pi \| u)\) measures how sharply a head focuses its attention, and <strong>susceptibility</strong> \(\chi = \text{Var}_{\hat\pi}(\log\hat\pi)/(\log n)^2\) measures how sensitive that sharpness is to small changes in the query-key scores. Both are computed from a single forward pass — no gradients, no activation patching.</p>

<p>Here I test a concrete prediction: <em>circuit heads should show larger shifts in these diagnostics between activating and non-activating prompts than non-circuit heads.</em> The testbed is GPT-2-small’s Indirect Object Identification (IOI) circuit,<sup id="fnref:wang2022" role="doc-noteref"><a href="#fn:wang2022" class="footnote" rel="footnote">1</a></sup> whose 23 heads and functional roles are well characterized.</p>

<p>Jupyter notebook with analysis <a href="https://github.com/peter-fields/peter-fields.github.io/blob/main/notebooks/post2_attention-diagnostics/final/attention_diagnostics_peter.ipynb">here</a>.</p>

<hr />

<h2 id="recap">Recap</h2>

<p>A <strong>prompt</strong> is a token sequence \(x = (x_1, \ldots, x_n)\). GPT-2-small processes it layer by layer, maintaining a <strong>residual stream</strong> \(h_i^{(l)} \in \mathbb{R}^{d_{\textrm{model}}}\) for each position — a contextualized representation that accumulates the contributions of all attention heads and MLPs up to layer \(l\).</p>

<p>Each of the 144 heads (12 layers × 12 heads) projects the residual stream into queries and keys, computes scores \(z_{ij} = q_i \cdot k_j / \sqrt{d_k}\), and applies softmax to produce an <strong>attention distribution</strong> over source positions:</p>

\[\hat{\pi}_j = \text{softmax}(z_j).\]

<p>This \(\hat{\pi}\) depends on both \(x\) (through the queries and keys) and the head’s learned parameters. From it we read off the two diagnostics introduced in <a href="/why-softmax/">Post 1</a>:</p>

\[\hat{\rho}_{\textrm{eff}} = {\text{KL}(\hat{\pi} \| u)}, \qquad \chi = \frac{\text{Var}_{\hat{\pi}}(\log \hat{\pi})}{(\log n)^2},
\label{eq:diagnostics}\]

<p>where \(u = (1/n, \ldots, 1/n)\) is the uniform distribution. Note: Post 1 used the notation \(\partial\hat{\rho}\) for the temperature susceptibility; here we write \(\chi = \partial\hat{\rho}/(\log n)^2\). No backward pass needed — \(\hat{\pi}\) is already computed in the forward pass.</p>

<hr />

<h2 id="experimental-setup">Experimental setup</h2>

<p>GPT-2 small is known to have a circuit that performs indirect object identification.<sup id="fnref:wang2022:1" role="doc-noteref"><a href="#fn:wang2022" class="footnote" rel="footnote">1</a></sup> Let’s say we have a prompt of 15 tokens that reads:</p>

\[x = \textrm{When Alice and Bob went to the store,} \textbf{ Bob } \textrm{gave a drink to ___}.
\label{eq:good_prompt}\]

<p>The correct next token the model should predict is Alice (the indirect object of the second clause). It turns out that Wang et al. showed that several heads perform various jobs in order to predict this next token correctly. One head detects a name has appeared twice, another suppresses the name that has appeared twice, another moves the name that appeared once, and so forth.</p>

<p>For our experiments, we shall generate 50 IOI prompts of length \(n=15\) of exactly the same format as \eqref{eq:good_prompt} along with 50 non-IOI prompts that have a similar format but no repeating name, e.g.</p>

\[x = \textrm{After Mary and John sat down for dinner,} \textbf{ Sarah } \textrm{gave a gift to ___}\]

<p>which should not “activate” the circuit, and this should be reflected in a shift in the values of our diagnostics, \eqref{eq:diagnostics}.</p>

<p>All experiment details, code, and plots can be found in this <a href="https://github.com/peter-fields/peter-fields.github.io/blob/main/notebooks/post2_attention-diagnostics/final/attention_diagnostics_peter.ipynb">notebook</a>.</p>

<hr />

<h2 id="a-single-circuit-head">A single circuit head</h2>

<p>Layer 9, head 9 (L9H9) is a Name Mover attention head, responsible for copying the correct indirect object name to the output. Below we see a noticeable difference between IOI and non-IOI prompts in the \((\rho, \chi)\)-plane (dropping the \(\hat{\quad}\) symbol and \(\textrm{eff}\) subscript for ease of notation).</p>

<figure class=""><img src="/assets/images/posts/attention-diagnostics/fig1_L9H9_name_mover.png" alt="L9H9 Name Mover — IOI vs non-IOI in (KL, χ) plane" /><figcaption>
      <strong>Figure 1.</strong> L9H9 (Name Mover) in the \((\rho,\,\chi)\) plane. Red: IOI prompts (circuit active). Gray: non-IOI prompts (circuit inactive). Stars mark condition means.

    </figcaption></figure>

<p>It looks like the diagnostics can show the difference between the head activating versus not!</p>

<p>It also seems like our hypothesis was correct: activating the circuit shifts the head toward higher KL selectivity, consistent with the prediction from Post 1.</p>

<p>Let’s check all 23 circuit heads.</p>

<hr />

<h2 id="all-23-circuit-heads">All 23 circuit heads</h2>

<figure class=""><img src="/assets/images/posts/attention-diagnostics/fig2_all_circuit_heads.png" alt="All 23 IOI circuit heads — active vs inactive" /><figcaption>
      <strong>Figure 2.</strong> All 23 IOI circuit heads organized by role. Each panel shows per-prompt \((\rho,\,\chi)\) for IOI (colored) and non-IOI (gray). Arrow: non-IOI mean → IOI mean.

    </figcaption></figure>

<p>It seems that non-active-to-active experiments produce signal in our diagnostics, but such signal is role dependent. Name movers, backup name movers, negative name movers, and S-inhibition all show clear shifts; these heads only activate when the repeated name pattern appears. Induction, duplicate token, and previous token exhibit no shift; these are structural heads, which merely track patterns and positions among tokens, building up representations in earlier layers that the selection heads later consume. Their jobs do not change between prompt types.</p>

<p>It seems that some, though not all, circuit types share similar fingerprints in the \((\rho, \chi)\)-plane, suggesting an interesting future direction of research: can candidate circuit heads be identified via these fingerprints?</p>

<hr />

<h2 id="non-circuit-heads">Non-circuit heads</h2>

<p>For comparison, we can see that non-circuit heads tend to not show a shift in (\(\rho,\chi\)). For the quantity defined as</p>

\[\Delta \mathrm{KL} = \langle \rho \rangle_{\textrm{IOI prompts}} - \langle \rho \rangle_{\textrm{non-IOI prompts}}
\label{eq:delta_kl}\]

<p>we show below those non-circuit heads with either one of top four values in \(|\Delta \mathrm{KL}|\) or the bottom twelve. Note that a few non-circuit heads do seem to activate on IOI prompts. Their precise role is unclear, but their attention patterns appear sensitive to syntactic features these prompts share — not to the IOI computation specifically.</p>

<figure class=""><img src="/assets/images/posts/attention-diagnostics/fig3_non_circuit_heads.png" alt="Selected non-circuit heads for comparison" /><figcaption>
      <strong>Figure 3.</strong> Selected non-circuit heads. Orange: top-4 by |\(\Delta \) KL|. Blue: 12 inert heads with near-zero shift. Most non-circuit heads are indifferent to whether the IOI circuit fires.

    </figcaption></figure>

<p>The existence of these outlier non-circuit heads means the separation is not perfect — but does it hold statistically, across all 144 heads?</p>

<hr />

<h2 id="statistical-test-do-the-diagnostics-separate-circuit-from-non-circuit">Statistical test: do the diagnostics separate circuit from non-circuit?</h2>

<p>So can the shift from non-IOI to IOI prompts in the (\(\rho,\chi\)) plane distinguish between circuit and non-circuit heads? To test this hypothesis we rank the mean shifts in KL (\(\rho\)), as measured by Eq. \eqref{eq:delta_kl}, for each head. (And similarly for the mean shift in \(\chi\)). Does this ordering generically rank any given circuit head higher than a non-circuit head (that is, more often than it would do to chance for random orderings)?</p>

<p>This corresponds to the Mann-Whitney statistical test. We can see in our plots below that it is indeed statistically significant! Our p-values are \(p=0.0002\) and \(p&lt;0.0001\) for the \(\Delta \mathrm{KL}\)  and the \(\Delta \chi\) shifts, respectively.</p>

<figure class=""><img src="/assets/images/posts/attention-diagnostics/fig4_delta_distributions.png" alt="|ΔKL| and |Δχ| distributions for circuit vs non-circuit heads" /><figcaption>
      <strong>Figure 4.</strong> Distributions of \(|\Delta\text{KL}|\) (left) and \(|\Delta\chi|\) (right) for circuit (red) vs non-circuit (gray) heads. Mann-Whitney U: \(p=0.00020\) and \(p&lt;0.0001\) respectively.

    </figcaption></figure>

<hr />

<h2 id="cross-head-correlations-a-teaser">Cross-head correlations: a teaser</h2>

<p>The diagnostics so far treat heads independently. But heads in the same functional role don’t just shift in isolation — they tend to move together. Computing the 144×144 Pearson correlation matrix of KL selectivity across prompts reveals structure aligned with the circuit.</p>

<figure class=""><img src="/assets/images/posts/attention-diagnostics/fig5_corr_matrices.png" alt="Cross-head KL correlation matrices C_IOI, C_nonIOI, C_diff" /><figcaption>
      <strong>Figure 5.</strong> Cross-head KL correlation matrices. \(C_\text{IOI}\) and \(C_\text{non-IOI}\) are 144×144 Pearson correlation matrices computed across 50 prompts each; \(C_\text{diff} = C_\text{IOI} - C_\text{non-IOI}\) isolates task-specific coupling. Brown margin ticks mark the 23 known circuit heads.

    </figcaption></figure>

<p>The idea is that \(C_\text{IOI}\) includes correlations both due to circuit activation and other typical interactions among heads for non-circuit behaviors. By subtracting off \(C_\text{non-IOI}\) we are subtracting the “baseline” non-circuit correlations, thereby isolating the circuit correlation structure in \(C_\text{diff}\). The above figure shows a promising sparsification in the correlation structure in \(C_\text{diff}\)—another promising future direction for possible circuit discovery.</p>

<p>Below we plot a scatter plot of every \((C_\text{IOI},\, C_\text{non-IOI})\) for all non-circuit (NC) to circuit head pairs (blue) and all NC to NC pairs. \(C_\text{diff}\) is the direction from the unity line.</p>

<figure class=""><img src="/assets/images/posts/attention-diagnostics/fig6_nc_circuit_scatter.png" alt="NC-circuit coupling scatter" /><figcaption>
      <table>
  <tbody>
    <tr>
      <td><strong>Figure 6.</strong> Every (NC head, other head) pair plotted as \((C_\text{IOI},\, C_\text{non-IOI})\). Points on the diagonal have \(C_\text{diff}=0\). The gray band is \(\pm 5\sigma\) of the empirical \(C_\text{diff}\) distribution. Blue: NC head paired with a circuit head; red: two NC heads. Labeled: the three NC heads with the largest \(</td>
      <td>C_\text{diff}</td>
      <td>\) coupling to a circuit head.</td>
    </tr>
  </tbody>
</table>

    </figcaption></figure>

<p>There is one attention head in layer 8 (L8H1) that shows a dramatic shift in its correlation to a name-mover head when the circuit is activated. In fact, it’s 5 standard deviations away from mean no-shift behavior (\(C_\text{diff}=0\)). Could it be another circuit head missed by the analysis in Wang et al.? Five standard deviations is good enough for declaring the discovery of a new particle in high energy physics, so it seems good enough for me!</p>

<p>(Just kidding). In all seriousness, it would be worth checking for any mechanistic relation between this and other high \(C_\text{diff}\) heads via typical ablation/patching techniques.</p>

<hr />

<h2 id="limitations-implications-and-next-steps">Limitations, implications, and next steps</h2>

<p>Our diagnostic only observes behavior of the whole attention head for different prompts; it does not reveal mechanism for underlying computation on any kind of per-prompt basis. Additionally we expect attention to be more and more selective (higher KL) as we go into deeper layers, regardless of whether the head is inside the circuit or not. This confound needs to be accounted for. Nevertheless, the idea of similar fingerprints in the (\(\rho,\chi\))-plane for similar circuit elements along with the correlation analysis idea each show promise for unsupervised circuit discovery.</p>

<p>Natural next steps include validation of our diagnostics for other known circuits… we have in fact only tested out one circuit on one data set after all. Though seeing if L8H1 is in fact a circuit element initially missed by Wang et al. would be interesting in and of itself.</p>

<p>Furthermore, our correlation analysis does not account well for strong signal due to transitive correlation, i.e. circuit A and B show strong correlation not due to any inherent coupling, but due to mutual correlation to a third element C. In biological analyses of neurons and amino-acid sequences this is typically dealt with via Direct Coupling Analysis. DCA fits a maximum entropy model to a correlation matrix, and has shown meaningful success in uncovering real, underlying interactions among neurons/amino acids. Since we are showing different prompts to our LLM and tracking the responses for stable structure, this is very similar to analysis done by Hoshal and Holmes et al. in their neuro-theory article, “Stimulus-invariant aspects of the retinal code drive discriminability of natural scenes.”<sup id="fnref:hoshal2024" role="doc-noteref"><a href="#fn:hoshal2024" class="footnote" rel="footnote">2</a></sup></p>

<p>If our analysis can take correlation graphs and turn them into candidate circuit graphs, this would help narrow down to a few candidate heads for a given circuit computation. If so, this could be a promising direction for scaling up (and speeding up) circuit discovery methods, e.g. correlation analysis to find candidates followed by causal interventions to determine mechanism.</p>

<p>The significant \(\Delta\chi\) is also perhaps surprising — Post 1 predicted \(\chi\) should be low in both conditions, so we’d expect \(\Delta\chi \approx 0\). We’ll have to think a little bit more in a future post about what \(\chi\) is (or is not) actually capturing. (We note that Kim (2026)<sup id="fnref:kim2026" role="doc-noteref"><a href="#fn:kim2026" class="footnote" rel="footnote">3</a></sup> applies a related fluctuation-dissipation susceptibility to GPT-2 <em>training</em> dynamics, using it to detect grokking as a phase transition — a complementary direction to the inference-time head characterization pursued here.)</p>

<hr />

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:wang2022" role="doc-endnote">
      <p>Wang et al. (2022). <a href="https://arxiv.org/abs/2211.00593">Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small</a>. <a href="#fnref:wang2022" class="reversefootnote" role="doc-backlink">&#8617;</a> <a href="#fnref:wang2022:1" class="reversefootnote" role="doc-backlink">&#8617;<sup>2</sup></a></p>
    </li>
    <li id="fn:hoshal2024" role="doc-endnote">
      <p>Hoshal et al. (2024). <a href="https://www.pnas.org/doi/10.1073/pnas.2313676121">Stimulus-invariant aspects of the retinal code drive discriminability of natural scenes</a>. <em>PNAS</em> 121(52):e2313676121. <a href="#fnref:hoshal2024" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:kim2026" role="doc-endnote">
      <p>Kim, J. (2026). <a href="https://arxiv.org/abs/2602.08216">“Thermodynamic Isomorphism of Transformers,”</a> arXiv:2602.08216. <a href="#fnref:kim2026" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>Peter Fields</name></author><category term="mechanistic-interpretability" /><category term="attention" /><category term="transformers" /><category term="IOI-circuit" /><category term="diagnostics" /><summary type="html"><![CDATA[The previous post introduced KL selectivity and susceptibility χ as per-head diagnostics derivable from attention weights alone. Here I test them on GPT-2-small's IOI circuit: can two scalar statistics, computed from a single forward pass, distinguish the 23 known circuit heads from the other 121? It seems so!]]></summary></entry><entry><title type="html">Why Softmax? A Hypothesis Testing Perspective on Attention Weights</title><link href="https://peter-fields.github.io/why-softmax/" rel="alternate" type="text/html" title="Why Softmax? A Hypothesis Testing Perspective on Attention Weights" /><published>2026-02-17T00:00:00-06:00</published><updated>2026-02-17T00:00:00-06:00</updated><id>https://peter-fields.github.io/why-softmax</id><content type="html" xml:base="https://peter-fields.github.io/why-softmax/"><![CDATA[<p>Softmax is ubiquitous in transformers, yet its role in attention can feel more heuristic than inevitable (at least to me). In this post, I try to make it feel more natural and show how this interpretation suggests useful diagnostics for the often circuit-like behavior of attention heads.</p>

<!--more-->

<h2 id="introduction-the-attention-mechanism">Introduction: the attention mechanism</h2>

<p>Consider a stream of tokens to be embedded:</p>

\[x=\{x_1,x_2,...,x_i,...,x_T \}.\]

<p>After embedding (and potentially many passes through MLPs and attention heads) we have the contextualized tokens</p>

\[h_i \in \mathbb R^{d_{\text{model}}}.\]

<p>The attention mechanism updates this <em>residual stream</em> (as it is also called) by computing three quantities from learned parameters \(W_K, W_Q\) and \( W_V \).</p>

<p>Given the most recent embedded token in the stream, \(h_t\), and all tokens before it \( \{ h_i :i &lt; t \} \), these three quantities are the keys, query, and values—and are defined as</p>

\[k_i=W_Kh_i\]

\[q_t=W_Qh_t\]

\[v_i=W_Vh_i\]

<p>where \(q,k \in \mathbb{R}^{d_k}\) and typically \( d_k&lt;d_{\text{model}} \).</p>

<p>The update to the residual stream at position \( t \) is calculated as<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">1</a></sup></p>

\[h_t^{\text{(new)}} =\mathcal{O}_t+ h_t^{\text{(old)}}\]

<p>with</p>

\[\label{eq:O_t}
\mathcal{O}_t=\sum_i\pi_{i,t} v_i\]

\[\pi_{i,t} = \frac{e^{\beta k_i\cdot q_t}}{\sum_j e^{\beta k_j\cdot q_t}},\]

<p>where we identify \( \pi_i \) with the \(\mathrm{softmax}\) function:</p>

\[\pi_{i,t}=\mathrm{softmax}(\beta k_i\cdot q_t),\]

<p>and we have introduced the scalar \( \beta\) for later use.</p>

<p>This lends itself to the following interpretation: for any given token at position \( t \), the query vector, \(q_t\), defines what \(h_t\) is “looking for” from previous tokens, and the keys, \( k_i \), determine which of the previous tokens get “advertised”.<sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">2</a></sup></p>

<p>The query-key pairs define the distribution \( \pi_i \) over which the values, \( v_i \), are averaged. We can see that this distribution determines what values the attention head should “focus” on.</p>

<p>This post explores the question: <strong>why softmax and not something else?</strong></p>

<p>I emphasize that this is just the way I like to think about it… not <em>the</em> way it should be understood.</p>

<h2 id="softmax-as-hypothesis-testing">Softmax as hypothesis testing</h2>

<p>For notational simplicity we fix a destination position and drop the index $t$. We define the query-key score for a given key as</p>

\[\label{eq:z}
z_i=k_i\cdot q.\]

<p>We let \(n\) denote the number of scores.</p>

<p>Leaving \(z_i\) alone for the moment, let us imagine that we had no good reason to prefer one index over another when calculating \( \mathcal{O}_t\) from Eq. \eqref{eq:O_t}. The only distribution invariant under permutation of the indices (which is the symmetry that reflects our ignorance) is the uniform distribution, which we denote by \(u\).</p>

<p>Of course, we do have reason to prefer some indices over others in our distribution \( \pi \), namely the scores \( z_i \). We have two competing objectives: create a distribution that maximizes the expected score, \( \sum_i \pi_i z_i \) (thus properly weighting the evidence afforded us), but also do not overcommit to any particular index’s score beyond what we believe is justifiable given our prior ignorance.</p>

<p>In hypothesis testing, the Kullback-Leibler (KL) divergence is a natural measure of distinguishability from a null hypothesis. The number of samples required to determine that said null hypothesis is false is proportional to \( \frac{1}{\mathrm{KL}(\pi\|u)}\)<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">3</a></sup>. Loosely speaking, if we are given some “budget” \( \rho \), and if the KL exceeds it, then we may say that we have enough evidence to reject the null (uniform) hypothesis. This defines our notion of overcommitment. Our competing objectives are thus defined by constructing \( \pi_i \) such that the average score, \( \langle z_i\rangle\), is maximal (our commitment to our evidence is maximal), while remaining within our commitment “budget” defined by \(\rho \) and the KL-divergence. This defines the constrained optimization problem</p>

\[\max_{\pi \in \Delta}
\sum_i \pi_i z_i \quad \text{s.t.} \quad \mathrm{KL}(\pi \| u) \leq \rho,\]

<p>where \(\Delta =\{ \pi \in \mathbb{R}^n : \pi_i \geq 0,\; \sum_i \pi_i = 1 \}\) is the probability simplex.</p>

<p>Rather than solve this directly, we can relax the hard constraint into a penalty, yielding the equivalent unconstrained problem</p>

\[\max_{\pi \in \Delta}
\sum_i \pi_i z_i - \frac{1}{\beta} \mathrm{KL}(\pi \| u),\]

<p>where \( \frac{1}{\beta} \) controls the trade-off between maximizing expected score and staying close to uniform. Each value of \( \beta \) corresponds to a particular budget \( \rho \): large \( \beta \) (loose budget) allows sharper distributions, while small \( \beta \) (tight budget) keeps \( \pi \) near uniform. Introducing a Lagrange multiplier for normalization and taking first-order conditions, one finds that the solution is</p>

\[\pi_i^\star \propto e^{\beta z_i},\]

<p>which recovers softmax, defining the attention weights used in transformers.</p>

<h2 id="interpretation">Interpretation</h2>

<p>I should reiterate that this is merely my <em>interpretation</em> of the softmax function. In modern commercial transformer architectures the above optimization problems are not explicitly written into the training objective and play no role at inference time.</p>

<p>That being said, the very fact that the softmax function is used in each attention head lends credence to the preceding interpretation. Each trained head in real, deployed transformers could be interpreted as instantiating some solution to a commitment-to-evidence-versus-ignorance optimization problem. Of course, the training objective does not explicitly enforce this constrained problem; the point is that the resulting functional form admits this interpretation.</p>

<p>The parameters \(\beta\) and \(\rho\) are not to be found in any such real-world transformer in the likes of Claude or ChatGPT, but each does indeed have their own weight matrices \(\hat W_K, \hat W_Q\) and \( \hat W_V \). So, for a given residual stream \( \{h_i\} \) for some context \(\{x_i\}\), there is nothing stopping us from interrogating an attention head by examining the quantity</p>

\[\hat \rho_{\text{eff}, t}:=\mathrm{KL}(\hat \pi_t \|u )
\label{eq:rho_eff}\]

<p>for</p>

\[\hat\pi_{t,i}=\mathrm{softmax}(h_i^{\top}\hat W_K^\top\hat W_Qh_t).\]

<p>\(\hat \rho_{\text{eff},t}\) is an interesting quantity—we can think of it as measuring the “commitment to evidence” in a given attention head for given learned parameters and a given context. This last point is worth repeating: <em>it is a context dependent quantity</em>.</p>

<p>\(\hat \rho_{\text{eff},t}\) will be large when the evidence to focus on certain past tokens while building \(\hat \pi_t\) is large. It will be small when the evidence is “flimsy.” This is not necessarily bad, however; one can imagine certain heads operate well by considering evidence from many tokens, instead of only a few (to be very hand-wavy, think of a head that considers general themes and the tone of a context, instead of particular grammatical rules or other minutiae).</p>

<p>This quantity, \(\hat \rho_{\text{eff},t}\), is therefore a proxy for how selectively information is routed through a given head.</p>

<p>Equation \eqref{eq:rho_eff} can also be written as</p>

\[\hat \rho_{\text{eff},t} = \log n - H(\hat \pi_t)\]

<p>where we see that \(\hat \rho_{\text{eff},t}\) is dependent upon the length of the context window. Since \(\hat \rho_{\text{eff},t} = \log n - H(\hat \pi_t)\), if it grows logarithmically with \(n\) then \(H(\hat \pi_t)\) must remain \(O(1)\)—meaning the head continues to focus on a fixed number of tokens regardless of how long the context becomes. Such a head can be seen as being robust.</p>

<h2 id="further-implications-circuits--interpretability">Further implications: circuits &amp; interpretability</h2>

<p>Recall that in our derivation, the parameter \(\beta\) controlled the trade-off between evidence and ignorance. Though it does not appear explicitly in a trained transformer, we can still ask: how sensitive is \(\hat \rho_{\text{eff}}\) to perturbations in an artificial temperature parameter \(\beta\), evaluated at \(\beta=1\) (which recovers the actual attention weights)?</p>

<p>If we define the quantity</p>

\[\partial \hat \rho := \partial _\beta \hat \rho_{\mathrm{eff}}\Big |_{\beta=1} =\mathrm{Var}_{\hat \pi}( z_i),
\label{eq:d_beta}\]

<p>(where we have dropped \(t \) for simplicity of notation). Using standard exponential-family identities, we can see that this corresponds to a susceptibility to perturbations in temperature, similar to stat mech<sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote" rel="footnote">4</a></sup>. Seeing the behavior of \(\partial \hat \rho \) in different contexts can allow one to further characterize a particular head.</p>

<p>This is particularly true when considering work in interpretability and circuits in transformer architectures<sup id="fnref:5" role="doc-noteref"><a href="#fn:5" class="footnote" rel="footnote">5</a></sup><sup id="fnref:6" role="doc-noteref"><a href="#fn:6" class="footnote" rel="footnote">6</a></sup>. Both Eqs. \eqref{eq:rho_eff} and \eqref{eq:d_beta} would be interesting to track over many different contexts.</p>

<p>Think of a circuit that is activated in particular contexts, such as identifying which noun in a sentence is the indirect object. One can imagine each context string for that head mapping to a certain point in the \( (\hat \rho, \partial \hat \rho) \) plane. When the circuit is activated, the contexts would cluster towards high \( \hat \rho \) and low \( \partial \hat \rho \) (certain and stable). When not activated it would show low \( \hat \rho \) and low \( \partial \hat \rho \) (no preference for any past tokens and stable).</p>

<p>The next post shall explore the behaviors of these quantities in the indirect object identification (IOI) circuit in GPT-2.</p>

<h2 id="references-and-footnotes">References and Footnotes</h2>
<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:2" role="doc-endnote">
      <p>In practice there are quite a few more bells and whistles when considering multiple attention heads, LayerNorm, etc., but we shall skip over those for simplicity. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:3" role="doc-endnote">
      <p>The mapping from literal token \(x_i\) to embedded token \(h_i\) is not one to one—as one goes through more attention/MLP layers the information between positions can become more and more mixed. <a href="#fnref:3" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:1" role="doc-endnote">
      <p>Cover, T. M., &amp; Thomas, J. A. (2006). <em>Elements of Information Theory</em> (2nd ed.), Section 11.8 (Chernoff-Stein Lemma). Wiley-Interscience. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:4" role="doc-endnote">
      <p>See, e.g., Kardar, M. (2007). <em>Statistical Physics of Particles</em>, Ch. 4. Cambridge University Press. In the canonical ensemble, the derivative of a thermodynamic average with respect to temperature yields the variance of the conjugate quantity (the fluctuation-dissipation relation). Kim (2026) applies the same relation to transformer <em>training</em> dynamics, using an analogous susceptibility to detect grokking as a phase transition; here we apply it to inference-time head characterization. See Kim, J. (2026). “Thermodynamic Isomorphism of Transformers,” arXiv:2602.08216. <a href="#fnref:4" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:5" role="doc-endnote">
      <p>Voita, E., Talbot, D., Moiseev, F., Sennrich, R., &amp; Titov, I. (2019). “Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned.” <em>Proceedings of ACL</em>, 5797–5808. Identifies specialized vs. redundant heads via a confidence metric (average max attention weight). <a href="#fnref:5" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:6" role="doc-endnote">
      <p>Zhai, S., Likhomanenko, T., Littwin, E., Busbridge, D., Ramapuram, J., Zhang, Y., Gu, J., &amp; Susskind, J. M. (2023). “Stabilizing Transformer Training by Preventing Attention Entropy Collapse.” <em>Proceedings of ICML</em>. Tracks attention entropy during training and identifies pathological entropy collapse. <a href="#fnref:6" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>Peter Fields</name></author><category term="attention" /><category term="softmax" /><category term="hypothesis testing" /><category term="KL divergence" /><category term="machine learning" /><category term="deep learning" /><summary type="html"><![CDATA[Softmax is ubiquitous in transformers, yet its role in attention can feel more heuristic than inevitable. In this post, I try to make it feel more natural and show how this interpretation suggests useful diagnostics for the often circuit-like behavior of attention heads.]]></summary></entry></feed>