7

Discovering latent goals (mechanistic interpretability PhD salary)

CompleteGrant
$1,590raised

ALREADY FULLY FUNDED ELSEWHERE

Project summary

I'm working on interpreting LLMs, specifically trying to move us closer to a world where we can understand and edit what a model's goals are by examining/changing its activations. I plan to study this by explicitly giving a language model a task (eg. "Write linux terminal commands to do [X]") and then understand how this is implemented in the model.

Project goals

The broad goal is to understand how language models represent goals, and also try to understand whether or not we can "probe for agency" within them. More narrowly, I want to understand how language models implement simple "agentic" behavior when you directly prompt them to do a specific task.

If directly discovering goals turns out to be methodologically out of reach, I may pivot to interpreting easier LLM behaviors to help develop the methodology we need.

I'll be doing a portion of this within my alignment PhD. I also plan to spend some time in the next 6 months upskilling in AI governance so I can better understand how my work could contribute to informing policymakers (since there seems to be a lot of value in technical researchers bringing their expertise into public policy).

How will this funding be used?

The funding will go towards my salary for the next 6 months as well as a travel budget (attending conferences, spending some time in the Bay etc).

Here's a rough breakdown: I need around $64k/y to live comfortably and maximize my productivity (this is roughly half of what I was earning as a senior developer), plus a $8k/y travel budget. I'll be receiving a $28k/y PhD stipend along with a $2k/y travel budget from the UK government starting in October. I'm asking for 6 months worth of funding, meaning funding for the next 3 months before my PhD starts ((64+8)/4=$18k) plus the first 3 months of my PhD ((64+8-28-2)/4=$10.5k). I'm adding an extra 20% to cover the income tax I'll have to pay on this, and an extra 10% as an "unexpected expenses buffer" as LTFF recommends which gets me to $38k.

What is your (team's) track record on similar projects?

I started coding at age 7 and became a senior developer at a tech startup at age 18 where I stayed for 4 years while doing my undergrad. I then switched to alignment research and within 6 months, I submitted to NeurIPS as the second-name author along with people from FHI, CHAI, and FAR AI. I've also taken part in ARENA, AISC, and AGISF, and I've done alignment community building and founded an upskilling group in my city. While at ARENA, I won #1 prize in an Apart AI hackathon for my team's work on accelerated automated circuit discovery.

How could this project be actively harmful?

Looking for goals and "agency" directly could obviously be dangerous because if you find a robust way to mathematically express something, you could optimize for it. Even if this project doesn't directly get to that point, it might still move us closer to a world where you could directly optimize for agency. It's entirely possible that, if this project succeeds, I won't make the results public and will only share it with specific members of the AI safety community because it could constitute an infohazard. I believe this should mitigate any negative consequences, and I also don't think there's any meaningful alignment research which doesn't generate potential infohazards.

What other funding is this person or project getting?

I'll start receiving $30k/y from the UK government starting in October.

LucyFarnik avatar

Lucy Farnik

3 months ago

Final report

Description of subprojects and results, including major changes from the original proposal

The previous update was meant to be the final one, apparently I forgot to close the project

Spending breakdown

All funding went towards my salary

LucyFarnik avatar

Lucy Farnik

9 months ago

Progress update

What progress have you made since your last update?

Interpreting "goals" turned out to be out of reach, so I did what I said in the description and pivoted towards studying easier LLM phenomena which build towards being able to interpret the hard things. I spent some time researching how grammatical structures are represented, and have since pivoted towards trying to understand how "intermediate variables" are represented and passed between layers. My current high-level direction is basically "break the big black box down into smaller black boxes, and monitor their communication".

What are your next steps?

I'm currently approaching "inter-layer interpretability" with SAE-based circuit-style analysis. I basically want to figure out whether it is possible to do IOI-style things but with SAE features at different layers as the unit of ablation. I'm also looking into how to do SAE-based ablation well (to make results less noisy). I'm researching these questions in MATS under Neel Nanda.

Is there anything others could help you with?

If anyone reading this is interested in the things I described above, I could use collaborators! In particular, if you're somewhat new to alignment and would be interested in a setup where I throw a concrete specification for an experiment at you and you spend an afternoon coding it up, I'd be interested in talking to you.

Austin avatar

Austin Chen

over 1 year ago

Hi Lucy! Approving this grant as it fits within our charitable mission and doesn't seem likely to cause any negative effects.

It does look like you have a lot more room for funding; I'm not sure if any of our AI-safety focused regrantors have yet taken the time to evaluate your grant, but if you have a specific regrantor in mind, let me know and I will try to flag them!

donated $1,000
AntonMakiievskyi avatar

Anton Makiievskyi

over 1 year ago

  • I'm impressed by Lucy's background

  • Some people in Nonlinear Network funding round were excited for Lucy to be funded

  • I want promissing people not to be held back by shortage of money for day to day expenses

    So, I'm upvoting this application and offering 1000$ donation to provide more visibility to it