Do interpretability tools actually work? NLA and SAE audits

Project summary

Two of my projects test the same underlying question from different angles: when an interpretability method makes a claim about what is happening inside a model, is the claim actually true in a useful sense?

NLAttack is an evaluation harness for Natural Language Autoencoders (NLAs), an architecture that compresses model activations into natural language text and reconstructs them. The claim NLAs make is that the text bottleneck preserves enough information to be readable and useful as a monitoring layer. NLAttack tests that claim directly. Across 8 concepts, the NLA I built and evaluated scored an EmergenceIndex of 0.601 (decodability 1.00, stability 0.88), meaning the text bottleneck captures the concept faithfully but is not fully stable across paraphrases. That is a real, falsifiable result, not a marketing claim.

Secret Agenda, my AAAI 2026 paper, tests a parallel claim from the other side. Sparse autoencoders (SAEs) produce auto-labeled features, and labs increasingly treat those labels as ground truth about what a feature detects. I tested whether SAE features auto-labeled for deception actually activate during strategic lying, and they mostly do not. That is evidence that an SAE feature label does not reliably predict the behavior it claims to represent.

Put together, these are two audits of the same failure mode: interpretability methods that produce plausible-sounding outputs, readable NLA text and confidently-labeled SAE features, without a guarantee that the output tracks the underlying computation. Both projects need the same kind of adversarial evaluation work, and I am set up to do both at once.

What are this project's goals? How will you achieve them?

Six months of full-time work across both projects. First, scale NLAttack's evaluation coverage from 8 concepts to 50+ concepts and additional NLA architectures. Second, design and run an adversarial attack suite against NLA text bottlenecks, including concept-masking prompts, distributional shift, and steering vector interference. Third, extend the SAE feature-label audit methodology from deception to additional safety-relevant feature categories. Fourth, publish both the NLAttack methodology paper and an extended SAE audit report.

I am the right person for this because I built the NLA that NLAttack evaluates and co-authored the SAE audit methodology in Secret Agenda. Understanding failure modes from the builder's side is what makes the adversarial tests meaningful rather than superficial.

How will this funding be used?

$110,500 for 6 months, split across both projects since they share methodology and infrastructure. About 90% ($99,000) is stipend for me for 6 months full-time, benchmarked against the Seattle ML Research Engineer median of $221K/yr (Glassdoor 2025), and already accounts for self-employment tax. The remaining 10% ($11,500) covers compute and API costs: scaling NLA evaluations to additional families and models, Neuronpedia API calls for the SAE audit, and GPU hours for adversarial testing.

Who is on your team? What's your track record on similar projects?

Solo project. I built and published nla-gemma-4-e2b, a Gemma-4-2B NLA with full eval results on HuggingFace, and NLAttack itself. I am also the lead author of Secret Agenda (AAAI 2026, accepted), which established the SAE feature-label auditing methodology this project extends. Separately, I built BioRefusalAudit, a format-sensitivity study of biomedical refusal behavior across frontier models. GitHub: github.com/SolshineCode. HuggingFace: huggingface.co/Solshine.

What are the most likely causes and outcomes if this project fails?

The most likely failure mode is running out of runway before scaling either evaluation to enough breadth for a strong publication. I have about 6 weeks of personal runway, so without funding I take a full-time job and both projects stall at their current scope: NLAttack at 8 concepts and the SAE audit limited to deception features. A partial failure looks like completing one project's scaling but not the other. I would prioritize finishing the NLA adversarial attack suite first, since that result is closer to complete.

How much money have you raised in the last 12 months, and from where?

Zero raised so far. I have active applications for this same work with the Survival and Flourishing Fund (via Copyleft Cultivars, 501(c)(3) fiscal sponsor, submitted 2026-06-30) and the Long-Term Future Fund (submitted 2026-07-01), both awaiting decision.