Attention Diagnostics: Testing KL and Susceptibility on the IOI Circuit
The previous post introduced KL selectivity and susceptibility χ as per-head diagnostics derivable from attention weights alone. Here I test them on GPT-2-small’s IOI circuit: can two scalar statistics, computed from a single forward pass, distinguish the 23 known circuit heads from the other 121? It seems so!