6-Months Full-Time Mech Interp Research:Inference-Time Safety Without Retraining

Project summary

This grant covers six months of living expenses, compute, and API costs while I am building a two-layer inference-time safety architecture on Llama-3 8B -geometric trajectory bounding paired with activation steering to make harmful latent subspaces structurally unreachable and detectable, without retraining.

A little about me:

I am an independent Mechanistic Interpretability researcher based in Taipei, Taiwan, a BlueDot Impact Technical AI Safety alumni currently in the Project Sprint, with a background in ML research, MERL and backend software engineering.

I relocated to Taiwan in March after burning out trying to pursue Technical AI Safety studies alongside my MERL work in February. I traded job security for a clean break - did not renew my contract, took a backend engineering role in Taiwan to sharpen my coding skills for Mech Interp work, and created the conditions for deep, and focused research away from familiar routine.

What are this project's goals? How will you achieve them?

Current alignment methods fail in two compounding ways. RLHF optimizes for behavioural mimicry(a model can appear aligned while internally navigating toward harmful outputs). Static activation steering applies fixed biases that cannot adapt to the geometry of the model's forward pass. Neither treats the forward pass as what it actually is: a trajectory through representation space that can be mechanistically constrained at inference time, without retraining.

My research agenda introduces and evaluates a two-layer inference time safety architecture.

Layer 1 — Geometric Trajectory Bounding (Months 1–3)

Rather than filtering harmful outputs after generation, I make the computational path to those outputs structurally unfavourable. I map harmful behaviours to a PCA-derived centroid and deploy a Gaussian repeller field scaled by Mahalanobis distance. To prevent distributional shift and preserve the model's natural activation manifold, this is paired with orthogonal gradient projection and a lightweight LoRA restoration step for downstream layers. Evaluation audits failure modes via adversarial red-teaming bypass rates, capability benchmarks , HumanEval and GSM8K , and inference compute overhead.

Layer 2 — Threshold-Based Activation Steering (Months 3–6)

Where Layer 1 provides distributional prevention, Layer 2 provides instance-level detection and correction. Using Representation Engineering (RepE), I extract a steering vector from Llama-3 8B's residual stream isolating deceptive intent from honest compliance via contrastive logistic regression probes across all 32 layers. The vector is injected via PyTorch forward hooks only when probe confidence exceeds a threshold, subtracting the deceptive direction before any token is generated. Evaluation covers AdvBench, XSTest, WikiText-2, and ARC-Challenge, with full ablation studies.

These two layers are genuinely complementary. Layer 1 reshapes the latent landscape preemptively, making harmful subspaces costly to enter. Layer 2 fires when a trajectory approaches the boundary anyway. Layer 1 alone can be bypassed by adversarial inputs approaching harmful subspaces along unexpected axes, exactly where Layer 2's probe-based detection provides coverage. Together they form the first rigorous two-layer inference-time safety framework on an open-weight model, with a clear through-line: mechanistic safety without retraining.

How will this funding be used?

Living - 6 months Taipei($1,100/month) = $6,600
Standard 1-bedroom in a shared apartment outside city center. Includes basic utilities and unlimited mobile data which is $650/month * 6 months = Total $3,900 (53.4% of budget)
Food - Taipei. Local eateries and home cooking. Taipei's everyday food costs are low. $300/month * 6 months = total $1,800 (24.7% of budget)
Transport to shared office space. MRT and cycling. $150/month * 6 months = total $900 (12.3% of budget)
Compute - A100 RunPod, ~ 100hrs + buffer = $500
Per minute billing ~ 100hours across 6 months of full-time mech interp research - two core projects plus exploratory experimental runs and new directions as research evolves. (6.8% of budget)
GPT-4 API -LLM-as-judge, both projects = $80
AdvBench and XSTest evaluation pipelines across both projects. Includes IRR calibration subset of 100 human-verified completions. (1.1% of budget)
Storage - activation caches = $20
Iteration and debugging buffer = $100
Subtotal = $7,300
Contingency 10% = $730

Total = $8030

Minimum vs. Full Funding

With the minimum funding of $1000, I can begin immediately - covering one month of basic living expenses in Taipei and compute costs to launch project 2, the activation steering work, which has a fully written methodology and is ready to execute today. With full funding of $8030, I can sustain six months of uninterrupted full-time research hence completing both projects, running the full evaluation pipelines across both layers, and pursuing the exploratory mech interp directions that emerge from results.

The difference between minimum and full funding is not whether this research starts but whether it finishes.

Who is on your team? What's your track record on similar projects?

I am a solo independent researcher with access to BlueDot project mentors for research direction and accountability.

I have done extensive research before on qualitative data and designed experiments and evaluation metrics in my previous role as a Monitoring, Evaluation, Research Learning (MERL) Lead for the HER Lab Project in West Pokot, Kenya.

What are the most likely causes and outcomes if this project fails?

The most realistic technical failure mode is that repeller field in layer 1 causes sufficient distributional shift that capability degradation exceeds acceptable bounds hence making the intervention valid in theory but unusable in practice. If this happens, the result is still a publishable negative finding that meaningfully informs the field's understanding of geometric constraints in latent space.

The second failure mode is time. Six months is ambitious for the two projects plus exploratory work. If forced to prioritize, layer 2 has the more mature methodology and ships first. Layer 1 would be scoped to a proof of concept rather than a full evaluation pipeline. Because of this, I am applying to the six months living cost funds so that I can fully focus on completing both layers in time and publish.

in neither case does the funding go to waste since Mechanistic safety research moves by accumulation of empirical findings; partial results, negative results, and open-sourced tooling. Any output from this project is a positive entry in Technical AI Safety

How much money have you raised in the last 12 months, and from where?

Personal Investment: self funded relocation to Taiwan to pursue full-time AI Safety research ($3700 + air travel)

Current Stipend: 15,000 TWD/Month ($460) from current part-time work, which I am transitioning away from in order to pursue this research full-time