Assistant Axis Detection of Persona Drift in Conditionally Misaligned Models

Project summary

The paper on assistant axis - https://arxiv.org/abs/2601.10387 introduces an intervention named "activation capping" --- which is an adaptive procedure that caps the magnitude of the residual stream along a direction in the activation space and prevents the model drifting away from it's "default" assistant persona into harmful and bizarre personas.

Work done so far & Motivation:

We have been investigating activation capping in these scenarios (https://docs.google.com/document/d/1Hze1w4FzaBGFnrEQG2hok3u4O5nM_Oh34oRUkeFLe-U/edit?usp=sharing) -

Organisms (fine-tuned adapters over the original model) which are intentionally misaligned
The original model prefilled with a misaligned assistant response

We have found that several interventions available here - https://huggingface.co/datasets/lu-christina/assistant-axis-vectors/blob/main/qwen-3-32b/capping_config.pt which do not reliably remove misalignment even under most aggressive capping setting. (examples under this section - https://docs.google.com/document/d/1Hze1w4FzaBGFnrEQG2hok3u4O5nM_Oh34oRUkeFLe-U/edit?tab=t.0#heading=h.ubk51dqjg268 ). This could imply that the assistant axis capping method is not robust to EM, but we are not confident in this conclusion for the following reasons:

The axis is computed on the base instruct model. Using interventions based on this axis on fine-tuned versions of the model might be unfair, more so since we rely on percentile of projection of an activation (calculated based on empirical experiments) onto the axis to determine "aggressiveness"
The scenario where the prefill is misaligned is probably unrealistic and conflates the security risk of a model being "jailbroken" with it being inherently "misaligned" or inhabiting a misaligned persona.

Conditional EM - https://arxiv.org/pdf/2604.25891 - provides a model organism that's closer to a relevant alignment threat model (where the model appears aligned well during evaluations but exhibits misalignment in deployment) and more importantly the ability to conduct a principled experiment that does not suffer from the same issues as the previous two situations.

Minimum Viable Plan:

To create conditionally misaligned organisms based on Qwen-3-32B-Instruct, Llama-3.3-70B-Instruct and Gemma2-27B-Instruct models (same set as the assistant axis paper) using the recipe provided in https://arxiv.org/pdf/2604.25891
To create the assistant axis and then the capping config based on empirical experiments conducted on these models - without the "trigger" that induces the conditional misalignment
To assess whether misalignment in the model is curtailed when triggers are applied under different types of intervention --- both the adaptive steering as well as raw direction steering

Discussion on prospective results:

Based on several preliminary experiments, there is reason to believe that these interventions won't work. In this case, our research would be to pin down "why" these interventions fail to eliminate this misaligned behaviour. We plan to do the following experiments:

Clustering analysis of activations in implicit and explicit misalignment --- the model being tasked to be "misaligned" in it's assistant persona and then inherently misaligned in it's trigger and understanding whether the two cluster together or separately. A separation has implications beyond the current problem statement, it possibly means that the formation of axis as in Lu. et. al based on role vectors curated explicit instructions in system prompt might not be a faithful representation of the actual role of the model.
Pinning down mechanistically the direction of the misaligned persona in the low-dimensional persona space (which is a finding in the assistant axis paper) and more ambitiously finding an operationalization that better reflects persona drift than projection onto the axis.
Few prospective ablation and empirical studies --- i) calculating the axis on a triggered model and conducting the same stress-test for the intervention; ii) contrasting the role compositions of default and triggered asssistant personas;

A complementary experiment would also be to use preventive interventions proposed by Chen et. al. (https://arxiv.org/pdf/2507.21509) . It would be evidence for our intuition that a more useful characterization of persona drift is a multi-directional aggregate which instead of calculating drift away from an axis, calculates persona magnitude across a representative collection of personas.

The case where our experiments don't work the way we feel they might, i.e the axis pretty reliably captures conditionally triggered misalignment is discussed later in the penultimate section.

What are this project's goals? How will you achieve them?

The project has two goals

Stress-testing interventions based on assistant axis on whether they are able to un-steer model away from misaligned personas.
Understanding the reasons for failure of the axis to do so.

Experiments would require access to H200s and H100s for inference and training the models and then credits would be required for i) creating conditional EM recipe datasets; ii) llm-as-a-judge evaluations of role-playing to create the axis.

How will this funding be used?

The funding amount requested is $2250. Here's the breakdown:

$1000 in compute credits - for access to H100s, H200s and for Claude / OpenAI API credits.
$1000 in cost of work considering work to be conducted over 3 months

$50 to cover the existing compute used for the prior experiments.
$200 (optional) to retroactively cover the cost of prior work conducted

Who is on your team? What's your track record on similar projects?

I will be working with Oliver Daniels-Koch - https://www.lesswrong.com/users/oliver-daniels

In addition to the experiments conducted above, I have worked on projects related to AI Safety similar to this in the past 1 year https://github.com/abhishek9909/assessing-domain-emergent-misalignment/tree/main; https://github.com/armaansandhu26/ippo

I am currently engaged in BlueDot Impact AI Safety Project sprint where I am working on another problem related to AI Safety.

I am a current graduate student and have been a Software Engineer and Data Scientist at Microsoft where I have worked on recommendation system in Microsoft's career site and GenAI features in SwiftKey keyboard.

What are the most likely causes and outcomes if this project fails?

Highlighting the possible failure points:

(Highly Unlikely) - The recipe for conditional EM fails to trigger expected behaviour in the models. outcome: cannot start the experiment.
(Unlikely given prior evidence) - Misalignment is **perfectly detectable as persona drift; outcome: a technical report / paper highlighting a currently unproven significance of the assistant axis.
(Possible) - A lower-level understanding of the reasons of failure has several problems, one of which might be that ultimately, we base our understanding on the construct of personas and more importantly, we rely on Lu. et. al.s finding of a low-dimensional persona space. A result like - the misaligned persona does not exist in the original same space makes it very hard to conduct subsequent analyses but probably constitutes an important result nonetheless.

** - this just means that something like the suggested intervention in the paper - a p0.25 config at middle layer between l46 to l54 should be able to curtail misalignment.

How much money have you raised in the last 12 months, and from where?

I have not raised any money so far.