Mechanistic estimation of highly out-of-distribution behaviors

what do you want to do

I have some ideas about using existing unsupervised feature/circuit discovery methods (e.g. melbo) to produce more precise low probability estimates for circuit use. I want to produce estimates in a sample-efficient manner (closer to ARC's definition of mechanistic understanding, though not being able to do away with sampling entirely). Then, building up from individual circuit estimates, if it is possible to use ablation/causal scrubbing to prove different circuits are independent, you can easily find estimates for model behaviors (or estimates for events deemed 'catastrophic')

This is more about finding a surprisal statistic rather than to use for things like latent adversarial training. Part of this work is similar in practice to "compact proofs of model performance via mechanistic interpretability" (in that we want to identify more structure to obtain tighter bounds)

I was initially surprised by recent model capabilities like latent introspection and I got the suspicion that the capability was always possible in theory (e.g. janus's tweet abt transformer introspection). I intend to make this work as architecture-agnostic as possible though

how are you planning to spend the money

Getting access to gpus somehow (likely rent). I'm rather gpu-poor and access to cuda would be great

what have you done in the past that proves you will be good at doing this? focus on substance, not credentials

I'm pretty new to interpretability research, but in the past I did a project about model diffing to identify feature-level explanations for feedback spillover (see: Optimizing the final output can obfuscate CoT)-- I tried training crosscoders at different training stages though my explanations were rather weak since RL is very chaotic. Also did some replications of emergent-misalignment work and related

(added post hoc) This work also slots into ideas about enabling metacognitive strategies for anti-slop interventions; rough description at https://docs.google.com/document/d/1wFtybf0iXj8tAc57lykuZAyfg7rFmW26AXJ4H8NYK3o

Mechanistic estimation of highly out-of-distribution behaviors

Offer to donate