Emotion as Attack Surface and Monitoring Signal in Thinking Models

Project summary

Current safety monitoring for thinking models reads the text. We've shown the text isn't where the action is.
Phase 0 results, on OLMo-3-7B-Think, $0 compute: a pre-commitment probe shows the model commits to its answer before generating a single reasoning token (AUC 0.780). An internal frustration direction predicts whether it recovers from a wrong pre-commitment (AUC 0.666), invisible in the Chain-of-Thought text. That signal transfers across 6 languages (p = 0.0017) and shows 4 significant interaction effects with pre-commitment (p < 0.002).
https://github.com/punctualprocrastinator/thinking-model-emotions/
These numbers come from 596 traces total, but only 46 form correct/wrong contrastive pairs (same problem, one right sample and one wrong sample); that 46-pair subset is what the frustration direction is extracted from. This is a preliminary approximation, not the exact protocol at scale: comparable work uses thousands of supervised traces and hits AUC > 0.9.
What we want to do with this grant: run the exact methodology at full scale, more traces, more model families, full causal validation via activation patching and steering, so the AUC and effect-size gaps close as a predictable result of scale, not a hoped-for bonus.
We also want to test whether refusal and compliance behave the way our correctness data suggests: whether the same frustration signal predicts recovery from an induced unsafe-compliance state back to refusal. Right now that jailbreak relevance is only an analogy to Anthropic's own dose-response result (steering toward desperate raises blackmail from 22% to 72%), not our own measurement.
Separately, we're testing whether our own pre-commitment and frustration constructs can themselves be induced as an attack surface, not just observed naturally, and defending against that with the same monitoring primitives, using zero-cost synthetic experiments on saved activation checkpoints rather than live attacks.
We're building EmotionMonitor: it reads pre-commitment and frustration signals live, scores a risk level, and logs, steers toward calm, or halts and escalates depending on the zone.
Conducted under RedArc Labs (redarclabs.com), a nonprofit AI safety lab focused on mechanistic interpretability under adversarial conditions.

What are this project's goals? How will you achieve them?

Already done, $0 compute: pre-commitment probe (AUC 0.780) and frustration direction (AUC 0.666 recovery) on OLMo-3-7B-Think, from 596 traces (46 usable contrastive pairs). Cross-lingual transfer confirmed (p = 0.0017), 4 significant interactions (p < 0.002).

Attack-surface validation data, also $0 compute, running now: no new data collection. We post-hoc manipulate copies of the same saved activation checkpoints, zero-pinning the frustration direction in some traces and relabeling mismatched CoT/answer pairs in others, to create synthetic stand-ins for two documented attack mechanisms. We then test whether our detectors flag the manipulated copies and leave the originals alone, reported as detection rate vs. false-positive rate (ROC curves) against the existing clean-trace baseline.

What this funding enables:

Replicate across 3 to 4 model families, scaling to 5,000+ traces each, to test whether this generalizes beyond OLMo.
Capture the same data on a harmful-request benchmark with refusal/compliance labels, to test jailbreak relevance directly instead of by analogy.
Causal validation: activation patching and calm-vector dose-response steering, the gold-standard evidence that upgrades this from correlational to mechanistic.
Circuit-level analysis: which attention heads and layers carry the signal, and whether it decomposes into interpretable features.
Extend the attack-surface validation from synthetic stand-ins to the new multi-model data, to check whether detection thresholds calibrated on one model transfer to another without recalibration.
Test two new attacks on our own constructs: suppressing the part of the signal that predicts recovery, to see if a model can be kept deterministically stuck, and trying to fool the probe itself with an adversarial prompt.
Turn EmotionMonitor into a production-grade, open-source library, validated against known attack signatures with calibration data.
Publish a main-conference paper and a companion report on the attack-surface validation.

How will this funding be used?

Compute (roughly 60% of budget):

Multi-model trace capture at scale: 5,000+ traces × 3 to 4 model families × multiple benchmarks requires sustained A100/H100 GPU time. At roughly $1-2/GPU-hr, the capture phase alone is the single largest cost item.
Larger model experiments (14B, 32B-class): require multi-GPU setups or longer single-GPU runs, significantly increasing per-trace cost.
Causal patching and steering experiments: hundreds of paired interventions across multiple models, each requiring a full forward pass with modified activations.
SAE training or application: compute-intensive, especially if training model-specific SAEs rather than applying pre-existing ones.
Attack-surface validation itself stays $0: the synthetic stand-ins reuse already-captured activations, no new generation. The compute line above is what produces the larger, multi-model trace corpus that this validation is then re-run against, to test whether detection thresholds hold across models.

Researcher time (roughly 30% of budget):

This has been unfunded, part-time work to date. Funding converts the remaining research sprints into focused full-time work against publication deadlines.
Building EmotionMonitor from a PoC into a production-grade library with proper testing, documentation, and integration support.
Writing and iterating on the main-conference paper and technical report.

Infrastructure and dissemination (roughly 10% of budget):

Conference registration and travel if accepted (NeurIPS/ICLR/ICML).
Hosting costs for the open-source release (model artifacts, pre-extracted directions, calibration datasets).
Contingency for compute overruns, failed runs, and additional model families if early results warrant expansion.

Who is on your team? What's your track record on similar projects?

Run under RedArc Labs (redarclabs.com), a nonprofit AI safety lab co-founded by:
- Shivam Dubey.. Research Fellow at Apart Research). Fellow, MARS V (Meridian AI Research Summer), White-Box Techniques for AI Control track, Cambridge.
- Manan Wadhwa. . Fellow, MARS V (Meridian AI Research Summer). Google Summer of Code @ human ai .
Publication history: prior work accepted/presented at NeurIPS, ICML, and AAAI workshops, and at TAIS Oxford.
Funding efficiency: our pilot results (AUC 0.780, p = 0.0017) cost $0 of compute, entirely from saved activations. This grant funds the scale-up, not the discovery.

What are the most likely causes and outcomes if this project fails?

Frustration doesn't replicate across models (biggest risk). If it's OLMo-specific, we report that honestly and publish the methodology plus an architecture-specific finding instead of a universal tool. A negative result is still informative.
Causal patching shows correlation, not causation. The amber-zone steering intervention loses justification, but green/red-zone monitoring still works. EmotionMonitor becomes monitor and flag instead of monitor and intervene.
AUC gap doesn't close with scale. We frame the tradeoff explicitly: label-free deployability vs. raw accuracy, and explore hybrid unsupervised-plus-light-supervision approaches.
SAE decomposition yields uninterpretable features. Reported as a finding about distributed vs. decomposable representations. The monitor doesn't depend on interpretability, only predictive power.
Cross-model threshold transfer fails. EmotionMonitor ships with an automated calibration pipeline (about 50 traces per new model). A deployment cost, not a failure.
In every scenario, the core finding already holds: pre-commitment, frustration, and cross-lingual transfer are real. This grant funds how far it generalizes, not whether it's real.
How much money have you raised in the last 12 months, and from where?
None. All work to date has been self-funded throughpersonal hardware. This is our first external funding request