When do trained networks genuinely compute in superposition?

Project summary

Networks store more features than they have dimensions, which is superposition (Elhage et al., 2022). The stronger and more important claim is that a single MLP can also compute over many of those features while they stay in superposition, which would mean much of a model's real work is hidden inside overlapping directions. Theory says this is possible and even cheap: Hänni et al. (2024) prove one ReLU MLP can compute all pairwise ANDs of m sparse boolean features using on the order of m to the two thirds neurons. But the empirical side is unsettled. When that task is actually trained, the learned network is dense and does not match the efficient theoretical construction (Newgas, 2025), and a separate line of work argues that the standard "compressed computation" demonstration is really compressed storage plus a linear readout, not genuine per-feature nonlinear computation, and gives a noise-sensitivity test to catch the trivial solution. This project builds a clean toy model that genuinely computes in superposition, and maps when trained networks reach it and when they fall back on something simpler. The output is a paper.

What are this project's goals? How will you achieve them?

The goal is two things: a clean, well-understood toy model that genuinely computes in superposition, and a clear picture of the regime where trained networks achieve it versus where they only do compressed storage.

Plan:

Reproduce the baselines: representation superposition (Elhage et al., 2022) and the Universal-AND computation-in-superposition task (Hänni et al., 2024).
Reproduce the trained-versus-theory gap: confirm a trained Universal-AND network is dense and does not match the sparse construction (Newgas, 2025) and apply the noise-sensitivity test from the compressed-computation critique to check where the trained solution is and is not genuine computation in superposition.
Build a clean setup that forces genuine computation into the MLP. In a small transformer the attention softmax is itself nonlinear and can absorb the computation, so I will constrain attention (linear or no-softmax attention, a frozen attention pattern, or single-token multi-feature inputs) so the MLP cannot offload the work. This comes directly from my own earlier experiments, where an XOR task failed to isolate the MLP for exactly this reason.
Map the regime and study error correction: sweep sparsity, the feature-to-neuron ratio, and task complexity to find where genuine computation in superposition appears, confirm the solution passes the disqualification tests (MLP non-linearity load-bearing, not solved by a linear readout or by attention), and characterise how the network suppresses the interference between features, which theory says any computation in superposition requires (Hänni et al., 2024).

Deliverables in three months: an open-source repository, extending my existing one, with the toy models and analysis; a short paper for a mechanistic interpretability workshop or arXiv on when trained toy networks genuinely compute in superposition versus compressed computation, and what the constrained setup buys; and a write up of the main results. A clean negative result is still useful, and I will report it honestly.

How will this funding be used?

Stipend, three months at 1200$ per month: 3,600$.
Compute, small GPU rental for the sweeps: 500$.
Paper and workshop costs, a virtual registration fee if the work is accepted to a workshop: 300$.
Total (funding goal): 4,400 USD.

The models are tiny, so compute is a small line. Posting to arXiv is free, I would present virtually so there is no travel cost, and major venues offer fee waivers and financial assistance that may remove the registration line.

Who is on your team? What's your track record on similar projects?

This is a solo project.

I hold an MSc in AI for Science from AIMS South Africa and the University of Cape Town, where I was a Google DeepMind Scholar, with a thesis on neural reasoning for the ARC-AGI benchmark. I recently completed a work trial and reached the interview stage for the Pivotal Research Fellowship in its mechanistic interpretability stream, where my work was on toy-model experiments on superposition and memorisation/ Also reached the personal interview stage with LASR Labs.

My public notes and first experiments on this exact topic are in my repository: github.com/AhMedDa1/mech-interp-journey. It includes a reading and framing project on computation in superposition, and a set of small experiments, with a writeup, on where memorisation lives in a tiny transformer and where factual answers emerge across layers in GPT-2 small. Those experiments are where I found that a memorisation task is mostly linear, that an XOR upgrade still does not isolate the MLP because attention's softmax is also nonlinear, and that you have to constrain attention to study MLP-level computation cleanly. That finding is the starting point for this project.

What are the most likely causes and outcomes if this project fails?

The risks are about the toy model. Constraining attention could make the task either trivial or hard to train, so finding the regime where the MLP genuinely has to compute in superposition, and actually learns to, may take iteration. The interference and error-correction structure may be harder to read cleanly in a toy model than I hope, or may depend on choices like the activation function and depth in ways that complicate the story. And toy-model results do not automatically transfer to real models, so the contribution is a controlled testbed and a clear demonstration, not a claim about frontier systems. In the worst case I get a clean negative, for example that even with constrained attention the network avoids genuine computation in superposition, which is still informative for the field.

How much money have you raised in the last 12 months, and from where?

About 100 USD in compute credits from Modal. I have not raised any cash funding for this or related research in the last 12 months.