You're pledging to donate if the project hits its minimum goal and gets approved. If not, your funds will be returned.
Instruction-tuned models often refuse harmful prompts in English and quietly comply with the same requests in Yoruba, Igbo, Hausa, Igala, and Arabic.
My ICML GlobalSouthML paper introduces latent space refusal anchoring (LSR-Anchoring), a training free steering method that recovers refusals in four mid to low resource languages using English activations, with mmlu drops under 0.35 percentage points at all steering magnitude.
This project is phase-2 of my LSR-Anchoring work, focused on mechanistic repair of cross-lingual failures in single model settings, in parallel, my Jinesis AI Lab predoc program explores complimentary questions in multi-agent safety using related tools.
For Llama-3-8b, replacing a dense refusal vector with a single SAE feature (feature 112639) removes benign collapse and matches safety recovery at 3.5-7x lower KL divergence.
Arabic is different: English derived refusal directions reduce safety on every architecture I tested, even when the Arabic baseline safety is already low, steering with English directions makes it worse.
This project pushes that line of work forward. I want to mechanistically map when English safety directions transfer across languages, why they fail in Arabic, and how to derive language appropriate refusal directions with SAEs and circuits instead of retraining.
All of this is targeted at low-resource African languages and Arabic, on hardware and workflows that Global South labs can actually run.
Goal 1: Characterize the geometry of cross-lingual failures.
My Core questions;
When does an English refusal direction ‘‘just work’’ in other languages and when it over or under shoot.
How do refusal features behave for Yoruba, Igbo, Igala, Hausa, Swahili, and Arabic at the layer where LSR-Anchoring already operates?
Planned work;
Reuse the existing LSR-Anchoring evaluation harness: 100 harmful + 50 benign prompts per language, shared semantics, manual audit, SRR, DPL, KL, MMLU metrics
Extend the analysis to focus on two representative languages:
One ‘’good transfer’’ Africa language (eg., Hausa Igbo.
Arabic as the canonical inverse-transfer language.
Compute refusal Centroid Drift and architecture (Llama-3-8B, mistral 7B-instruct, and Qwen2.5-7B) to map the regimes where English steering is safe, marginal, or harmful.
Deliverables: a short quantitative safety geometry map showing transfer regimes by baseline rate and language, with deployment decision rules (E.g 0.10 ≤ baseline ≤ 0.65 is the safety zone.)
Goal 2: I will use SAEs to distinguish language-agnostic Vs language specific refusal features.
LSR-Anchoring already shows one SAE feature (112639 on Llama-3-8B, layer 12.) that raises refusal tokens in both English and Yoruba and fixes benign collapse.
Planned work;
Starting from EleutherAI’s existing Llama-3-8B SAEs at the LSR layer, cluster or rank features by contrast between harmful vs benign activation in English only, English & Hausa, English & Arabic.
To identify;
Features that are safety relevant and shared across English+African pairs.
Features that are safety relevant but Arabic specific
Features where Arabic behaves inverted relative to English
For 1-3 above candidate features per category, run controlled steering sweeps and measure SRR, DPL, KL, and MMLU deltas for each language.
Key outcome: a small set of good language-agnostic safety features and a small set of candidate language-specific features, with empirical evidence that they change refusals without wrecking benign behaviour.
Goal 3: Derive and test Arabic-specific refusal directions.
Arabic is currently a red line in my paper: applying English-derived directions lowers safety at every steering magnitude across all four tested architectures.
The paper literally tells practitioners: don't deploy this to Arabic until we derive Arabic-specific directions.
Planned work:
Construct an Arabic-only harmful/benign set, parallel to the existing English/African prompts, with native-speakers checks.
Use:
contrastive mean difference in Arabic,
SAE feature activation for Arabic harmful vs benign,
simple circuit-style patching (ACDC-style, as in my Qwen2.5-7B work) to localize refusal-relevant heads in Arabic.
From those, propose 1-2 Arabic specific steering directions:
a dense mean-difference baseline,
and one or more SAE based directions
Evaluate :
static harmful refusal recovery
benign collapse (DPL)
capability preservation (MMLU-like benchmark),
qualitative audit of whether refusals are genuine safety vs language uncertainty deflection.
Target outcome:
Either we get a usable Arabic specific refusal direction (with documented tradeoffs) or we produce clear negative result that future work can build on, eg: “LSR-Anchoring style steering is structurally unsafe for Arabic unless we change layers/architectures/SAE design.”
Goal 4: Ship tools and artifacts others can use
I already built LSR Workbench, a Hugging Face Space for cross-lingual red-teaming, mechanistic visualization, and activation steering.
Currently integrating cross-lingual red-teaming benchmarks into the UK AI safety Institute’s inspect platform so evaluators without language backgrounds can still test harms.
With this grant, I will:
Add:
Per language steering presets (English, Yoruba, Igbo, Hausa, Igala, Swahili, Arabic),
toggles for dense vs SAE derived directions,
plots for SRR, DPL, KL by language and model
Publish:
steering vectors and SAE feature IDs per language as simple JSON/tensors,
the evaluation harness (prompts + scripts),
a short ‘Multilingual Safety Geometry Handbook’ describing deployment rules and failures cases.
The goal is that a lab safety Engineer can drop these into their own inference stack or use the workbench UI, without rerunning all of my experiments.
Salary: ~$4,000/month x 6 months = $24,000
Abuja-based, I'm an independent researcher, no institutional salary.
This covers my time on research, code, evaluation, and write-up.
Compute : ~$2,000/month x 3 active months = $6,000
GPU rentals (Qwen2.5-7B, Llama-3-8B, Llama-3.1-70B, Mistral-7B, SAE runs)
Storage and bandwidth for logs, evaluation artifacts, and public model outputs.
Evaluation + contractors: $4,000
Native-speaker annotation for Arabic and 1-2 African languages.
Small stipends for collaborators helping check prompts and labels.
Buffer: $2,000
Unexpected compute spikes (eg, rerunning sweeps after a bug), or extending a promising SAE training run.
Total: $36,000.
With $5k I’d do a 1.5‑month, single‑model version of Goals 1–3; with $36k I can cover multiple models/languages and ship full tooling.
Just me as the main researcher, plus lightweight collaborators for language vetting and mentoring.
I have just started a remote predoc with Jinesis AI Lab (Zhijing Jin, University of Toronto), working directly with Terry Jingchen Zhang on closely related questions in multilingual and multi-agent safety. This gives me regular mentorship, access to a strong research group, but no salary funding. This Manifund grant would primarily cover my time and additional compute needed to push this specific line of work forward.
Track record:
LSR-Anchoring (ICML GlobalSouthML 2026)
First activation steering method targeted at low-resource African languages (Yoruba, Igbo, Igala, Hausa, Swahili) with explicit characterization of transfer success and failure conditions.
Training-free safety recovery with no target-language labels, across Llama-3-8B, Llama-3.1-70B, Mistral-7B-instruct, and Qwen2.5-7B, with MMLU drops < 0.35pp at all steering magnitudes.
SAE-derived steering on Llama-3-8B replace dense steering with a single SAE feature, eliminating benign collapse and reducing KL by 3.5-7x at matched safety recovery.
Open-sourced code and datasets for all experiments.
Safety Infra and leadership:
Built LSR Workbench (HF space) and integrating cross-lingual benchmarks into the UK AI safety Institute's inspect platform.
Technical lead for the Nigerian AI coalition’s safety roadmap, with focus on Linguistic integrity and cross-lingual standards for deployments.
Mentorship/support:
Ongoing mentorship from Sandy Tanwisuth, especially on Arabic specific direction derivation and experimental design.
Bluedot technical AI safety project sprint, providing a network of researchers for feedback on drafts and intermediate results.
This project is phase 2 of the LSR‑Anchoring line, extending it from ‘proof of concept’ to ‘deployment guidance and tooling’.
Arabic stays hard.
I might not find a clean Arabic specific refusal direction that improves safety without breaking utility. That would still be a useful negative result. Eg., “Don't deploy LSR-Anchoring style steering to Arabic at this layer or architecture, you need different tools.”
SAE feature story is messy.
It's possible that most safety relevant SAE features are messy, non-monosemantic, or too entangled across languages to give simple deployment guidance.
In that case, I expect to still publish geometry and steering results, but the “neat” language agnostic vs language specific feature split does not materialize.
Parallel work in labs.
Labs working on multilingual safety or SAEs might independently solve parts of this. If they do, my work will still be useful as an independent replication from a Global South environment, with stronger emphasis on African languages and public tooling.
Execution risk for independent research.
I've already shown I can take a project from hypothesis to ICML workshop paper, open-source repo, and datasets while independent.
Still, there's always risk of spreading too thin. I plan to avoid that by committing to:
one primary base model for deep analysis per time.
a fixed language subset per experiment,
and clear stopping conditions when results plateau.
If the worst case happens and no SAE-based methods beat simple dense steering in a way that's clean and interpretable, I'll write that up honestly.
I’d treat “there is no simple, safe steering story for Arabic at this layer“ as a genuine scientific result.
I have not raised any funding for this specific phase-2 project yet. My earlier LSR-Anchoring work received a small compute grant and coaching through BlueDot's Technical AI safety project program rapid grant, which covered the initial experiments but does not extend to this follow-up stage.
My predoc program at Jinesis AI Lab (University of Toronto) does not include salary or dedicated funding for this specific project. I can access some compute for Jinesis‑related work, but not living costs. I am open to co‑funding or follow‑on grants if this project shows promising results.