Attack, Defense, and Mechanistic Taxonomy of Protein Language Model Biosecurity

Project summary

Biosecurity screening usually treats sequence similarity (BLAST) as the whole story. But if a redesigned toxin does not look like its parent at the sequence level, the assumption is that it slips through undetected. We found that is not where the story ends. BLAST detects 0% of 643 ProteinMPNN-redesigned toxins below the 40%-identity evasion threshold, exactly the failure mode Wittmann et al. (2025) warned about. But a single frozen linear probe on ESM-2 650M embeddings, never shown a single redesigned sequence during training, still catches 93.9% of them. Using interPLM sparse autoencoders and a layer-by-layer attribution method (Direct Probe Attribution), we traced why: ProteinMPNN's backbone-preservation objective concentrates exactly the structural motifs the probe reads, so redesign amplifies the signal it is trying to hide rather than erasing it. A four-tier attack taxonomy then puts an exact number on where this actually breaks: 6.1% evasion for black-box, structure-constrained redesign vs. 100% for white-box gradient attacks on the probe itself. The real security boundary is gradient access to the probe's weights, not sequence space vs. embedding space.

BLAST: 0% detection on 643 redesigns against parent toxins (22.5% if you instead count any hit anywhere in the full nr/UniProt database, a different and weaker threat model).
Frozen ESM-2 linear probe: 93.9% detection (501/534), zero exposure to redesigns during training.
Mechanistic explanation: layer-32 attribution bottleneck (r=0.992 redesign-toxin correlation), a 50-feature SAE set with 1.28x mean transfer (up to 3.75x for the most robust features).
Attack boundary: 6.1% (black-box ProteinMPNN) vs. 100% (white-box gradient) evasion.
Zero-shot reach: scanning 1,000 UniRef50 sequences surfaced 248 candidate virulence proteins the probe was never trained to find, including cross-kingdom hits in fungi and 31 sequences from WHO Priority-1 pathogens.

In line with our impact statement, attack code is not being publicly released. Shipping white-box attacks against a biosecurity-relevant probe is the wrong kind of open science.

Extension of work accepted at GenBio Workshop, ICML 2026: openreview.net/pdf?id=Nb3y9BzCOi

What are this project's goals? How will you achieve them?

Already done and verified against the underlying experiment data: a full layer sweep on ESM-2 650M identifying layer 33 as the optimal probe readout; the BLAST-vs-probe detection gap on 534 ProteinMPNN redesigns; an SAE feature-transfer analysis explaining why redesign amplifies rather than removes the probe's signal; a DPA trace to a layer-32 mechanistic bottleneck; a four-tier attack taxonomy locating the real security boundary at gradient access; SAE-based recovery of 38% of the sequences that evade both BLAST and the linear probe; and a 1,000-sequence zero-shot scan of UniRef50 surfacing out-of-distribution hits.

The forward plan has four linked steps:

1. Validate against the obvious baseline. Run BioLMTox and VISH-Pred, the two existing fine-tuned ESM-2 toxin classifiers, on the same 534 redesigns. Our frozen-probe argument (fine-tuning overfits to sequence-level features that redesign destroys) is currently theoretical. This is the single experiment most likely to change how the result is read, in either direction.

2. Get causal evidence, not just correlation. Ablate the top-10 robust SAE features and the layer-32 DPA features and measure the resulting drop in detection. Right now "feature hierarchy" is an AUROC-ranked correlational claim. This turns it into an actual mechanistic circuit claim.

3. Test generalization across more models and more design tools. Repeat the probe + SAE + DPA pipeline on additional protein language models (ESM-1v, ProtBert, ProGen2) to check whether the same structural feature hierarchy and gradient-access boundary recur, or whether they are ESM-2-specific. In parallel, test RFdiffusion, which uses folding-based generation rather than ProteinMPNN's inverse-folding, against the same probes.

4. Close the false-positive gap and scale to deployment-grade rigor. Run a non-human negative-control pilot (benign bacterial and fungal proteins) to find out whether the WHO Priority-1 zero-shot hits are real or an artifact of human-only training controls; expand redesign coverage from the current 5.8% of training toxins to 25%+; and pressure-test the gradient-access security claim against an adaptive attacker.

How will this funding be used?

Baseline validation (BioLMTox, VISH-Pred) on the existing redesign set

Causal ablation experiments (SAE features, layer-32 DPA features)
Cross-model and cross-tool generalization (additional PLM architectures, RFdiffusion)
Non-human false-positive rate pilot and expanded redesign coverage

Who is on your team? What's your track record on similar projects?

This project is run under Redarc Labs (redarclabs.com), a nonprofit AI safety lab co-founded by:

Shivam Dubey is a Fellow at Apart Research and the Cambridge AI Safety Hub, and lead researcher on this project. Published work includes Fourier Gradient Regularisation (NeurIPS 2025 Reliable ML Workshop), adversarial curvature across architectures (TAIS 2026)and this paper (GenBio / ICML 2026).

Manan Wadhwa is a Research Fellow at AISI Georgia Tech and the Cambridge AI Safety Hub, and a Google Summer of Code 2026 participant. He is co-author on the GenBio paper and co-designed the experimental pipeline. He has an ACL ARR 2026 submission in review.

Redarc Labs focuses on mechanistic interpretability under adversarial conditions, building tools that hold up when a model actively resists inspection. This project is a concrete instance of one of the lab's core direction

What are the most likely causes and outcomes if this project fails?

The biggest risk is not having enough time or resources to complete the causal-ablation, cross-model, and validation experiments, leaving the work as a promising workshop result rather than a fully validated system.

There is also a real chance that testing on broader non-human protein datasets reveals higher false-positive rates than expected, which would make the WHO Priority-1 zero-shot discovery claims more preliminary than definitive. We think that outcome is worth knowing since it would still be a useful negative result, but it would limit the deployment argument.

Finally, our security argument assumes attackers do not have direct gradient access to the probe's weights. How well that assumption holds in realistic deployment remains an open question we plan to investigate.

How much money have you raised in the last 12 months, and from where?

None. All work to date has been self-funded throughpersonal hardware. This is our first external funding request