Description of subprojects and results, including major changes from the original proposal
The previous update was meant to be the final one, apparently I forgot to close the project
Spending breakdown
All funding went towards my salary
I'm working on interpreting LLMs, specifically trying to move us closer to a world where we can understand and edit what a model's goals are by examining/changing its activations. I plan to study this by explicitly giving a language model a task (eg. "Write linux terminal commands to do [X]") and then understand how this is implemented in the model.
The broad goal is to understand how language models represent goals, and also try to understand whether or not we can "probe for agency" within them. More narrowly, I want to understand how language models implement simple "agentic" behavior when you directly prompt them to do a specific task.
If directly discovering goals turns out to be methodologically out of reach, I may pivot to interpreting easier LLM behaviors to help develop the methodology we need.
I'll be doing a portion of this within my alignment PhD. I also plan to spend some time in the next 6 months upskilling in AI governance so I can better understand how my work could contribute to informing policymakers (since there seems to be a lot of value in technical researchers bringing their expertise into public policy).
The funding will go towards my salary for the next 6 months as well as a travel budget (attending conferences, spending some time in the Bay etc).
Here's a rough breakdown: I need around $64k/y to live comfortably and maximize my productivity (this is roughly half of what I was earning as a senior developer), plus a $8k/y travel budget. I'll be receiving a $28k/y PhD stipend along with a $2k/y travel budget from the UK government starting in October. I'm asking for 6 months worth of funding, meaning funding for the next 3 months before my PhD starts ((64+8)/4=$18k) plus the first 3 months of my PhD ((64+8-28-2)/4=$10.5k). I'm adding an extra 20% to cover the income tax I'll have to pay on this, and an extra 10% as an "unexpected expenses buffer" as LTFF recommends which gets me to $38k.
I started coding at age 7 and became a senior developer at a tech startup at age 18 where I stayed for 4 years while doing my undergrad. I then switched to alignment research and within 6 months, I submitted to NeurIPS as the second-name author along with people from FHI, CHAI, and FAR AI. I've also taken part in ARENA, AISC, and AGISF, and I've done alignment community building and founded an upskilling group in my city. While at ARENA, I won #1 prize in an Apart AI hackathon for my team's work on accelerated automated circuit discovery.
Looking for goals and "agency" directly could obviously be dangerous because if you find a robust way to mathematically express something, you could optimize for it. Even if this project doesn't directly get to that point, it might still move us closer to a world where you could directly optimize for agency. It's entirely possible that, if this project succeeds, I won't make the results public and will only share it with specific members of the AI safety community because it could constitute an infohazard. I believe this should mitigate any negative consequences, and I also don't think there's any meaningful alignment research which doesn't generate potential infohazards.
I'll start receiving $30k/y from the UK government starting in October.
Lucy Farnik
3 months ago
The previous update was meant to be the final one, apparently I forgot to close the project
All funding went towards my salary
Lucy Farnik
10 months ago
Interpreting "goals" turned out to be out of reach, so I did what I said in the description and pivoted towards studying easier LLM phenomena which build towards being able to interpret the hard things. I spent some time researching how grammatical structures are represented, and have since pivoted towards trying to understand how "intermediate variables" are represented and passed between layers. My current high-level direction is basically "break the big black box down into smaller black boxes, and monitor their communication".
I'm currently approaching "inter-layer interpretability" with SAE-based circuit-style analysis. I basically want to figure out whether it is possible to do IOI-style things but with SAE features at different layers as the unit of ablation. I'm also looking into how to do SAE-based ablation well (to make results less noisy). I'm researching these questions in MATS under Neel Nanda.
If anyone reading this is interested in the things I described above, I could use collaborators! In particular, if you're somewhat new to alignment and would be interested in a setup where I throw a concrete specification for an experiment at you and you spend an afternoon coding it up, I'd be interested in talking to you.
Austin Chen
over 1 year ago
Hi Lucy! Approving this grant as it fits within our charitable mission and doesn't seem likely to cause any negative effects.
It does look like you have a lot more room for funding; I'm not sure if any of our AI-safety focused regrantors have yet taken the time to evaluate your grant, but if you have a specific regrantor in mind, let me know and I will try to flag them!
Anton Makiievskyi
over 1 year ago
I'm impressed by Lucy's background
Some people in Nonlinear Network funding round were excited for Lucy to be funded
I want promissing people not to be held back by shortage of money for day to day expenses
So, I'm upvoting this application and offering 1000$ donation to provide more visibility to it