You're pledging to donate if the project hits its minimum goal and gets approved. If not, your funds will be returned.
LLM safety is usually treated as a single refusal mechanism either a model refuses or it doesn't. We found it's actually a tug-of-war between two opposing groups of attention heads inside the same layer: "refusal heads" that push toward declining a harmful request, and "compliance heads" that actively push back toward answering it. On Qwen2.5-7B-Instruct, the single strongest refusal signal and the single strongest compliance signal sit in the same layer (Layer 25).
This bipolar structure turns out to be the shared target of two completely different jailbreak families: GCG (white-box, gradient-optimized adversarial suffixes) and Crescendo (black-box, multi-turn conversational escalation). We built a defense that pulls both ends of the tug-of-war at once, unconditionally, on every forward pass — and it works against both attack types without any fine-tuning:
- GCG: attack success rate drops from 66% (undefended) to 33%, with 0% false positives on benign prompts.
- Crescendo: the same defense blocks the harmful terminal request in escalation scenarios that succeed 100% of the time against the undefended model.
Already done and verified against raw experiment data: identified a bipolar refusal/compliance circuit on Qwen2.5-7B and 1.5B via activation patching; showed GCG and Crescendo both exploit it; built an unconditional, fine-tuning-free defense beating a conditional variant (33% vs. 42% ASR).
The forward goal has four linked steps:
1. Test more attacks extend beyond GCG and Crescendo to PAIR, AutoDAN, and TAP (structurally unrelated attack methods) to see whether the same bipolar circuit is the shared target, or an artifact of the two attacks studied so far.
2. Identify the underlying circuits/mechanisms each attack actually exploits, via the same activation-patching methodology, on Qwen2.5-7B/1.5B.
3. Test generalization by repeating circuit discovery on Llama-3-8B-Instruct does the same head-level structure recur across model families, or is it Qwen-specific?
4. Turn findings into defense + control protocol HarmBench-scale evaluation, an adaptive attacker against the defended model, KV-cache trust partitioning, and a real-time AI-Control-style monitor that uses an already-validated compliance-head probe (AUC=0.943) to trigger the defense conditionally instead of always-on.
1. Test more attacks
2. Identify the underlying circuits/mechanisms
3. Test generalization
4. Turn findings into defense + control protocol
This project is run under Redarc Labs (https://www.redarclabs.com), a nonprofit AI safety lab co-founded by:
- Shivam Dubey Fellow at Apart Research and the Cambridge AI Safety Hub. Lead researcher on this project.
- Manan Wadhwa Research Fellow at AISI Georgia Tech and and the Cambridge AI Safety Hub; Google Summer of Code participant.
Redarc Labs focuses on mechanistic interpretability "under adversarial conditions" tools that work even when a model actively resists inspection. This project is the lab's concrete instance of one of its named directions: cross-architecture refusal circuit taxonomy.
Running out of time/funding before HarmBench, FPR, and Hydra-effect experiments finish
An adaptive attacker breaks the defense.
none
There are no bids on this project.