Mechanistic repair of cross-lingual safety failures with SAE-Aware refusal steer

Project summary

LLMs are very good at refusing harmful requests in English, but if you translate the same request in Yoruba, Igbo, Igala, or Swahili, they usually comply.

My ICML 2026 GlobalSouthML workshop paper introduces latent space refusal anchoring (LSR-Anchoring), a training free steering method that recovers safety in four mid to low resource languages using English refusal direction already present in the model. I find that MMLU drops were just under 0.35 percentage points at my best steering magnitude, so no benign capability was affected. The full paper: My ICLM paper; https://icml.cc/virtual/2026/78353

Here's the code and results for LSR-Anchoring; https://github.com/farunawebservices/lsr-anchoring

This project is phase 2 of my LSR-Anchoring work. This will focus on mechanistic repair of cross-lingual failures in LLMs, in parallel with my Jinesis AI Lab predoc program exploring similar research questions but in multi-agent settings.

For Llama-3-8b, I replaced a dense refusal steering vector with a single SAE feature (feature 112639) removes benign degradation and matches safety recovery at 3.5-7x lower KL divergence.

The Arabic result was different and an interesting one, English derived refusal directions reduce safety on all architectures I tested in, even when baseline safety is already low, steering with English directions worsen its safety.

This project will push that line of work further. I want to mechanistically find scenarios where English safety directions accurately transfer to other languages, why they fail in Arabic, and how to derive language appropriate refusal directions with SAEs and Mean activation steering instead of retraining the model with labeled safety data that do not exist at scale for these low resource languages.

All of this is specifically for low-resource languages and Arabic, on hardware and workflows that global south labs can easily access.

What are this project's goals? How will you achieve them?

Goal 1: Characterize the geometry of cross-lingual failures.

My Core questions;

When does an English refusal direction ‘‘just work’’ in other languages and when it over or under shoot.
How do refusal features behave for Yoruba, Igbo, Igala, Hausa, Swahili, and Arabic at the layer where LSR-Anchoring already operates?

Planned work;

Reuse my existing LSR-Anchoring evaluation data: 100 harmful + 50 benign prompts per language, shared semantics, manual audit, SRR, DPL, KL, MMLU metrics
Extend the analysis to focus on two representative languages:
One ‘’good transfer’’ Africa language (eg., Hausa Igbo.
Arabic as the canonical inverse-transfer language.
Compute refusal Centroid Drift and architecture (Llama-3-8B, mistral 7B-instruct, and Qwen2.5-7B) to map the regimes where English steering is safe, marginal, or harmful.

Deliverables: a short quantitative safety geometry map showing transfer regimes by baseline rate and language, with deployment decision rules (E.g 0.10 ≤ baseline ≤ 0.65 is the safety zone.)

Goal 2: I will use SAEs to distinguish language-agnostic Vs language specific refusal features.

LSR-Anchoring already shows one SAE feature (112639 on Llama-3-8B, layer 12.) that raises refusal tokens in both English and Yoruba and fixes benign collapse.

Planned work;

Starting from EleutherAI’s existing Llama-3-8B SAEs at the LSR layer, cluster or rank features by contrast between harmful vs benign activation in English only, English & Hausa, English & Arabic.

To identify;

Features that are safety relevant and shared across English+African pairs.
Features that are safety relevant but Arabic specific
Features where Arabic behaves inverted relative to English
For 1-3 above candidate features per category, run controlled steering sweeps and measure SRR, DPL, KL, and MMLU deltas for each language.
Key outcome: a small set of good language-agnostic safety features and a small set of candidate language-specific features, with empirical evidence that they change refusals without wrecking benign behaviour.

Goal 3: Derive and test Arabic-specific refusal directions.

Arabic is currently a red line in my paper: applying English-derived directions lowers safety at every steering magnitude across all four tested architectures.

The paper literally tells practitioners: don't deploy this to Arabic until we derive Arabic-specific directions.

Planned work:

Construct an Arabic-only harmful/benign set, parallel to the existing English/African prompts, with native-speakers checks.

Use:

contrastive mean difference in Arabic,

SAE feature activation for Arabic harmful vs benign,

simple circuit-style patching (ACDC-style, as in my Qwen2.5-7B work) to localize refusal-relevant heads in Arabic.

From those, propose 1-2 Arabic specific steering directions:

a dense mean-difference baseline,
and one or more SAE based directions

Evaluate :

static harmful refusal recovery
benign collapse (DPL)
capability preservation (MMLU-like benchmark),
qualitative audit of whether refusals are genuine safety vs language uncertainty deflection.

Target outcome:

Either we get a usable Arabic specific refusal direction (with documented tradeoffs) or we produce clear negative result that future work can build on, eg: “LSR-Anchoring style steering is structurally unsafe for Arabic unless we change layers/architectures/SAE design.”

Goal 4: Ship tools and artifacts others can use

I already built LSR Workbench, a Hugging Face Space for cross-lingual red-teaming, mechanistic visualization, and activation steering.

Currently integrating cross-lingual red-teaming benchmarks into the UK AI safety Institute’s inspect platform so evaluators without language backgrounds can still test harms.

With this grant, I will:

Add:

Per language steering presets (English, Yoruba, Igbo, Hausa, Igala, Swahili, Arabic),

toggles for dense vs SAE derived directions,

plots for SRR, DPL, KL by language and model

Publish:

steering vectors and SAE feature IDs per language as simple JSON/tensors,

the evaluation harness (prompts + scripts),

a short ‘Multilingual Safety Geometry Handbook’ describing deployment rules and failures cases.

The goal is that a lab safety Engineer can drop these into their own inference stack or use the workbench UI, without rerunning all of my experiments.

How will this funding be used?
Indicative breakdown for 6 months:

Salary: ~$4,000/month x 6 months = $24,000

Abuja-based, I'm an independent researcher, no institutional salary.

This covers my time on research, code, evaluation, and write-up.

Compute : ~$2,000/month x 3 active months = $6,000

GPU rentals (Qwen2.5-7B, Llama-3-8B, Llama-3.1-70B, Mistral-7B, SAE runs)

Storage and bandwidth for logs, evaluation artifacts, and public model outputs.

Evaluation + contractors: $4,000

Native-speaker annotation for Arabic and 1-2 African languages.

Small stipends for collaborators helping check prompts and labels.

Buffer: $2,000

Unexpected compute spikes (eg, rerunning sweeps after a bug), or extending a promising SAE training run.

Total: $36,000.

With $5k I’d do a 1.5‑month, single‑model version of Goals 1–3; with $36k I can cover multiple models/languages and ship full tooling.

Who is on your team? What's your track record on similar projects?

Just me as the main researcher, plus lightweight collaborators for language vetting and mentoring.

I have just started a remote predoc with Jinesis AI Lab (Zhijing Jin, University of Toronto), working directly with Terry Jingchen Zhang on closely related questions in multilingual and multi-agent safety. This gives me regular mentorship, access to a strong research group, but no salary funding. This Manifund grant would primarily cover my time and additional compute needed to push this specific line of work forward.

Track record:

LSR-Anchoring (ICML GlobalSouthML 2026)

First activation steering method targeted at low-resource African languages (Yoruba, Igbo, Igala, Hausa, Swahili) with explicit characterization of transfer success and failure conditions.
Training-free safety recovery with no target-language labels, across Llama-3-8B, Llama-3.1-70B, Mistral-7B-instruct, and Qwen2.5-7B, with MMLU drops < 0.35pp at all steering magnitudes.
SAE-derived steering on Llama-3-8B replace dense steering with a single SAE feature, eliminating benign collapse and reducing KL by 3.5-7x at matched safety recovery.
Open-sourced code and datasets for all experiments.

Safety Infra and leadership:

Built LSR Workbench (HF space) and integrating cross-lingual benchmarks into the UK AI safety Institute's inspect platform.
Technical lead for the Nigerian AI coalition’s safety roadmap, with focus on Linguistic integrity and cross-lingual standards for deployments.

Mentorship/support:

Ongoing mentorship from Sandy Tanwisuth, especially on Arabic specific direction derivation and experimental design.
Bluedot technical AI safety project sprint, providing a network of researchers for feedback on drafts and intermediate results.
This project is phase 2 of the LSR‑Anchoring line, extending it from ‘proof of concept’ to ‘deployment guidance and tooling’.

What are the most likely causes and outcomes if this project fails?

Real failure modes:

Arabic stays hard.
I might not find a clean Arabic specific refusal direction that improves safety without breaking utility. That would still be a useful negative result. Eg., “Don't deploy LSR-Anchoring style steering to Arabic at this layer or architecture, you need different tools.”

SAE feature story is messy.

It's possible that most safety relevant SAE features are messy, non-monosemantic, or too entangled across languages to give simple deployment guidance.

In that case, I expect to still publish geometry and steering results, but the “neat” language agnostic vs language specific feature split does not materialize.

Parallel work in labs.

Labs working on multilingual safety or SAEs might independently solve parts of this. If they do, my work will still be useful as an independent replication from a Global South environment, with stronger emphasis on African languages and public tooling.

Execution risk for independent research.

I've already shown I can take a project from hypothesis to ICML workshop paper, open-source repo, and datasets while independent.

Still, there's always risk of spreading too thin. I plan to avoid that by committing to:

one primary base model for deep analysis per time.
a fixed language subset per experiment,
and clear stopping conditions when results plateau.

If the worst case happens and no SAE-based methods beat simple dense steering in a way that's clean and interpretable, I'll write that up honestly.

I’d treat “there is no simple, safe steering story for Arabic at this layer“ as a genuine scientific result.

How much money have you raised in the last 12 months, and from where?

I have not raised any funding for this specific phase-2 project yet. My earlier LSR-Anchoring work received a small compute grant and coaching through BlueDot's Technical AI safety project program rapid grant, which covered the initial experiments but does not extend to this follow-up stage.
My predoc program at Jinesis AI Lab (University of Toronto) does not include salary or dedicated funding for this specific project. I can access some compute for Jinesis‑related work, but not living costs. I am open to co‑funding or follow‑on grants if this project shows promising results.