Attention-Free Architecture Safety: Interpretability for O(L) Sequence Models

Project summary

My research focuses on attention-free sequence models for structured reasoning tasks. Through 3,000+ systematic architecture search trials, I identified a model with 103K parameters—using hierarchical pooling, dilated convolutions, and spectral filters—that achieves 97.6% average accuracy across three structured tasks. A parameter-matched Transformer baseline reaches only 64.8%.

Task-level results: Nested Depth (99.8% vs. 92.5%), MultiScale Copy (98.6% vs. 52.2%), and Hierarchical Parity (94.4% vs. 49.7%). All mechanisms operate in O(L) time complexity with no attention layers.

The next step is mechanistic interpretability: moving beyond performance metrics to understand why these non-attention mechanisms succeed, and connecting that understanding to AI safety—specifically, whether removing attention improves interpretability and formal verifiability.

Code and full experimental results are publicly available on GitHub.(https://github.com/sunghunkwag/attention-free-sequence-model)

What are this project's goals? How will you achieve them?

The objective is to close a tooling gap in mechanistic interpretability. While Transformers have established probing and visualization methods, no comparable toolkit exists for O(L) attention-free architectures built from hierarchical pooling, spectral filters, and dilated convolutions. Without such tools, the safety community cannot systematically audit non-Transformer models.

Here's why this tooling gap matters for safety: almost all mechanistic interpretability work today assumes attention and a residual stream. If the field keeps building deeper understanding only for Transformers, then any performant O(L) architecture that replaces them becomes a safety-relevant area we barely understand—deployed widely without the ability to audit its internal circuits.

Concrete deliverables:

Implement 4–5 interpretability probes adapted for O(L) mechanisms (dilated convolution, spectral filters, strided convolution, Conv-GRU)

Characterize internal representations in the arch_2334 model using these probes

Produce a technical report linking mechanistic findings to alignment-relevant properties: predictability, auditability, and resistance to deceptive internal behavior

Release fully reproducible code and documentation via GitHub

This work extends an existing codebase that already includes training infrastructure and benchmarks, so the focus moves directly to interpretability tooling without infrastructure setup delays.

How will this funding be used?

As an independent researcher, I have no institutional overhead or administrative fees. The $800 budget covers direct research costs only:

~$400 for cloud GPU compute (~50–80 hours) to run interpretability experiments

~$400 as a stipend for 3–4 weeks of focused, full-time work

Who is on your team? What's your track record on similar projects?

Independent Systems Research & Bootstrapped Validation

I am an independent researcher who has self-funded and executed 3,000+ systematic architecture search experiments over the past year, resulting in a concrete, reproducible attention-free model that outperforms Transformer baselines (97.6% vs. 64.8%) on structured algorithmic tasks with only 103K parameters. All training pipelines, search logs, and evaluation code are maintained in public GitHub repositories with detailed documentation, allowing full auditability of every claim in this proposal.

Background in Formal Reasoning

My academic foundation is in analytic philosophy (completed through third year), which provided rigorous training in logical structure, conceptual analysis, and formal reasoning. This background is a direct asset for mechanistic interpretability work, where the goal is precisely to reverse-engineer how discrete computational rules emerge from distributed representations.

Why This Is My First External Ask

This $800 request represents the first external funding I have sought for this line of work. I deliberately front-loaded personal time and compute costs to reach a functioning proof-of-concept before asking any institution or platform to share the risk. The zero prior funding reflects bootstrapping, not rejection—I am now seeking small-scale validation to accelerate the next phase of safety-relevant interpretability benchmarking.

What are the most likely causes and outcomes if this project fails?

The primary risk is that interpretability probes for these architectures yield limited mechanistic insight—either because the representations are less decomposable than anticipated, or because existing probe designs do not transfer well to O(L) mechanisms. This would constitute a null result, but a valuable one: it would establish empirical boundaries for interpreting attention-free models, directly informing safety assessments of non-Transformer architectures.

A secondary risk is compute time underestimation. Mitigation includes using the existing operational codebase to avoid setup delays, and a timeline that accommodates partial completion. In the event of delays, all intermediate results and methodology will still be documented and shared openly.

How much money have you raised in the last 12 months, and from where?

$0 from external sources in the last 12 months. This project has been self-funded to date; see the track record section for context on the bootstrapping approach.