Awareness-conditioned emergent misalignment

Project summary

My hypothesis is harmful fine-tuning data causes broad misalignment mainly when the model already recognises the target behaviour as a norm violation.

As an initial result, I fine-tuned Qwen2.5-Coder-32B-Instruct on insecure code the base model could already identify as defective, and compared it with insecure code the base model treated as acceptable. The recognised-defect branch produced substantially higher broad misalignment on Betley's primary evaluation, with deltas of 10.0 percentage points at n=1000 and 16.1 percentage points at n=3452.

Technical brief: https://drive.google.com/file/d/19npXj9qZbRD-lHB0dYhx-LDgRzmGX2eB/view?usp=sharing
Code and artefacts: https://github.com/jashvira/emergent_misalignment_experiments

What are this project's goals? How will you achieve them?

The current result suggests that harmful fine-tuning data may be riskier when the model already represents the target behaviour as a norm violation. The goal is to nail this finding or find holes, ultimately resulting into a paper/blog.
Some directions include:

Capability staging. Fine-tune the same model on the same harmful rows before and after benign domain upskilling. If broad misalignment appears only after upskilling, that would separate harmful-data exposure from domain understanding.

Feature mediation. Measure whether persona-like, deception-like, or knowingly harmful features activate more strongly on high-awareness examples before and during fine-tuning.

Cross-domain transfer. Repeat the same awareness stratification on reward-hacking data. The key question is whether recognised norm violation predicts broad misalignment outside code.

Inoculation implications. Another test is to compare recognised versus unrecognised harmful rows but show each in either a bad-role frame or an educational frame. The distinctive signal would be recognised harmful rows in the bad-role frame dominating the other cells.

How will this funding be used?

Total request: $6,500.
$1,500 for GPU or managed fine-tuning access.
$2,000 for OpenAI API judging on broad evaluations and awareness checks.
$500 for miscellaneous costs, failed jobs, Hugging Face/W&B, and artefact preservation.
$2,500 for living costs for the duration of the project (assuming ~3 weeks).

Who is on your team? What's your track record on similar projects?

I am doing this independently. I am new and transitioning to safety research. My previous work includes post-training and capability evaluations, and I have also worked as a Research Engineer in Computer Vision and Geometry.
More on me: https://jashvira.github.io/

What are the most likely causes and outcomes if this project fails?

The most likely failure mode is that the effect is better explained by style imitation, domain confounding, or evaluation errors. This would still be useful because it would rule out a plausible mechanism and redirect attention.

How much money have you raised in the last 12 months, and from where?

I have not raised external funding for this project yet. I have personally funded the initial experiment.