Discovering latent goals (mechanistic interpretability PhD salary)
ALREADY FULLY FUNDED ELSEWHERE
Project summary
I'm working on interpreting LLMs, specifically trying to move us closer to a world where we can understand and edit what a model's goals are by examining/changing its activations. I plan to study this by explicitly giving a language model a task (eg. "Write linux terminal commands to do [X]") and then understand how this is implemented in the model.
Project goals
The broad goal is to understand how language models represent goals, and also try to understand whether or not we can "probe for agency" within them. More narrowly, I want to understand how language models implement simple "agentic" behavior when you directly prompt them to do a specific task.
If directly discovering goals turns out to be methodologically out of reach, I may pivot to interpreting easier LLM behaviors to help develop the methodology we need.
I'll be doing a portion of this within my alignment PhD. I also plan to spend some time in the next 6 months upskilling in AI governance so I can better understand how my work could contribute to informing policymakers (since there seems to be a lot of value in technical researchers bringing their expertise into public policy).
How will this funding be used?
The funding will go towards my salary for the next 6 months as well as a travel budget (attending conferences, spending some time in the Bay etc).
Here's a rough breakdown: I need around $64k/y to live comfortably and maximize my productivity (this is roughly half of what I was earning as a senior developer), plus a $8k/y travel budget. I'll be receiving a $28k/y PhD stipend along with a $2k/y travel budget from the UK government starting in October. I'm asking for 6 months worth of funding, meaning funding for the next 3 months before my PhD starts ((64+8)/4=$18k) plus the first 3 months of my PhD ((64+8-28-2)/4=$10.5k). I'm adding an extra 20% to cover the income tax I'll have to pay on this, and an extra 10% as an "unexpected expenses buffer" as LTFF recommends which gets me to $38k.
What is your (team's) track record on similar projects?
I started coding at age 7 and became a senior developer at a tech startup at age 18 where I stayed for 4 years while doing my undergrad. I then switched to alignment research and within 6 months, I submitted to NeurIPS as the second-name author along with people from FHI, CHAI, and FAR AI. I've also taken part in ARENA, AISC, and AGISF, and I've done alignment community building and founded an upskilling group in my city. While at ARENA, I won #1 prize in an Apart AI hackathon for my team's work on accelerated automated circuit discovery.
How could this project be actively harmful?
Looking for goals and "agency" directly could obviously be dangerous because if you find a robust way to mathematically express something, you could optimize for it. Even if this project doesn't directly get to that point, it might still move us closer to a world where you could directly optimize for agency. It's entirely possible that, if this project succeeds, I won't make the results public and will only share it with specific members of the AI safety community because it could constitute an infohazard. I believe this should mitigate any negative consequences, and I also don't think there's any meaningful alignment research which doesn't generate potential infohazards.
What other funding is this person or project getting?
I'll start receiving $30k/y from the UK government starting in October.