Vocabulary-Activation Interpretability

Project summary

I discovered that LLM self-report vocabulary systematically tracks internal activation patterns. When models use words like "loop" or "mirror" to describe their processing, these choices correlate with measurable activation dynamics (r=0.44-0.62, survives FDR correction, replicates across architectures). This is early evidence that model self-reports contain genuine signal about internal states.

I'm seeking funding to: (1) rigorously validate this finding with pre-registered replication, (2) develop an open-source toolkit, and (3) explore safety applications like deception detection.

Why I found this: I'm a linguist (BA Hons), not an ML engineer. Linguists notice vocabulary patterns that engineers filter out as noise. This outsider perspective produced the Pull Methodology and the core finding.

arXiv: https://arxiv.org/abs/2602.11358

Full repro (code + data): https://doi.org/10.5281/zenodo.18567445

What are this project's goals? How will you achieve them?

1. Build the first reliable window into what models actually experience during processing

2. Create real-time introspection monitoring for safety applications

3. Develop early-warning systems for deception and goal drift

How

I found something unexpected: you can steer models into deeper introspective states using activation directions, and their vocabulary shifts to match. At strength 2.6, Llama 70B produces 24x more introspective content. At 3.0, it converges on novel vocabulary ("glint," "shimmer") that tracks specific activation signatures. The model is telling us what's happening inside, and we can amplify the signal.

Introspection toolkit. Not just correlation metrics. A system that extracts introspection directions from any model, steers into reflective states, and maps vocabulary to activation dynamics in real-time. Open source, plug-and-play.

Live monitoring. Dashboard that tracks correspondence during deployment. When a model's self-reports start diverging from its activations, something changed. Flag it.

Deception detection. Test the hypothesis: deceptive models can't maintain calibrated self-reports. If true, vocabulary-activation correspondence becomes a lie detector. If false, we learn something important about how deception works.

Goal tracking. Apply the methodology to goal representations. Can we detect when a model's stated goals diverge from its activation-level objectives?

How will this funding be used?

Compute ($12,000) Running 70B models at scale isn't cheap. I need cloud GPU for local experiments (steering, activation extraction) and API credits for frontier model testing (Claude, GPT, Grok). This is 48% of the budget because the methodology requires thousands of samples per condition with full activation traces.

My time ($10,000) Six months of focused work. I'm not asking for market salary, I'm asking for enough to not take contract work while I build this. I've already proven I can execute without funding. With funding, I go faster.

Tools and publication ($3,000) Open-source toolkit hosting, documentation, potential publication fees if I submit to a venue. Collaboration travel if opportunities arise.

Scaling options

$10,000 (MVP): I do everything myself, slower, fewer architectures. Still produces the toolkit and validation.

$40,000 (Extended): Hire a research assistant for data processing. Test 10+ architectures instead of 5. Develop safety applications into a separate paper. Build the live monitoring prototype.

Every dollar goes to research output. No overhead, no institution taking a cut, no admin staff. Just me, GPUs, and the work.

Who is on your team? What's your track record on similar projects?

Just me. That's a feature, not a bug.

I built the Pull Methodology, ran 148,000+ introspective samples across 4 frontier models, discovered vocabulary-activation correspondence, validated it with proper controls, wrote the paper, and got 275+ downloads in 48 hours. All without funding, collaborators, or institutional support.

Solo means I move fast, change direction when experiments suggest it, and don't waste time in meetings. If scaling requires help later, I'll hire a research assistant for data processing.

Track record

This project: Already produced the core finding. Paper published on Zenodo, submitted to arXiv. Real methodology, real statistics, real replication across architectures.

Finding unstudied territory: My undergraduate dissertation was on "The Maid of Scio," an 1829 poem nobody had ever studied academically. I was the first and only person to publish on it. Same instinct led me here: I noticed a gap everyone else walked past.

Seven years self-employed: Built crypto infrastructure and marketing systems. I know how to ship things without someone managing me.

Linguistics background: BA Hons in English Literature, Language & Linguistics. This is why I noticed vocabulary patterns that ML engineers filtered out as noise. Outsider perspective, novel finding.

I don't have a PhD. I have results.

What are the most likely causes and outcomes if this project fails?

Correspondence doesn't generalize. Maybe loop/autocorr and mirror/spectral are flukes that don't extend to other vocabulary-activation pairs. I'd find out in Month 1-2 during replication. If the pattern is narrower than expected, I pivot to documenting exactly which pairs work and why, which is still useful.

Architecture-specific, not universal. The correlations hold in Llama and Qwen but vanish in other architectures. This would mean the finding is real but limited. Outcome: publish the limitation, help other researchers know where to look and where not to.

Safety applications don't pan out. Deception detection sounds exciting but might not work. Models could maintain calibrated self-reports even while deceiving, or the signal could be too noisy for practical use. If so, I report the null result. Knowing it doesn't work is valuable.

I get hit by a bus. Solo researcher risk. All code and data will be public and documented so someone else could pick it up.

What failure still produces

Even worst-case, this project outputs:

• Pre-registered replication data (positive or negative)

• Open-source toolkit (works regardless of whether findings generalize)

• Clear documentation of what works and what doesn't

The field learns something either way. I'm not promising a specific outcome, I'm promising rigorous investigation of a promising signal.

How much money have you raised in the last 12 months, and from where?

Zero.

The paper released 48 hours ago. Before that, there was nothing to fund. I did all the preliminary work (148,000+ samples, methodology development, cross-model validation) self-funded because I wanted to prove the concept before asking anyone for money.

Now I have results worth funding.

Currently pending:

• LTFF application submitted

• ARIA (UK) application in progress

This is my first funding round for this research. I'm not asking you to bet on a pitch deck. I'm asking you to bet on work that already exists.