Approving this! Best of luck with your research~
We propose a 6-month research project to assess the viability of Developmental Interpretability, a new AI alignment research agenda. “DevInterp” studies how phase transitions give rise to computational structure in neural networks, and offers a possible path to scalable interpretability tools.
Though we have both empirical and theoretical reasons to believe that phase transitions dominate the training process, the details remain unclear. We plan to clarify the role of phase transitions by studying them in a variety of models combining techniques from Singular Learning Theory and Mechanistic Interpretability. In six months, we expect to have gathered enough evidence to confirm that DevInterp is a viable research program.
If successful, we expect Developmental Interpretability to become one of the main branches of technical alignment research over the next few years.
(This funding proposal consists of Phase 1 from the research plan described in this LessWrong post.)
We will assess the viability of Developmental Interpretability (DevInterp) as an alignment research program over the next 6 months.
Our immediate priority is to gather empirical evidence for the role of phase transitions in neural network training dynamics. To do so, we will examine a variety of models for signals that indicate the presence or absence of phase transitions.
Concretely, this means:
Completing the analysis of phase transitions and associated structure formation in the Toy Models of Superposition (preliminary work reported in the SLT & Alignment summit’s SLT High 4 lecture).
Performing a similar analysis for the Induction Heads paper.
For diverse models that are known to contain structure/circuits, we will attempt to:
detect phase transitions (using a range of metrics, including train and test losses, RLCT and singular fluctuation),
classify weights at each transition into state & control variables,
perform mechanistic interpretability analyses at these transitions,
compare these analysis to MechInterp structures found at the end of training.
Conduct a confidential capability risk assessment of DevInterp
The unit of work here is papers, submitted either to ML conferences or academic journals. At the end of this period we should have a clear idea of whether developmental interpretability has legs.
120k: RAs + Researchers + Research Fellows
50k: Core Staff
40k: Employment Costs
7k: Travel Support
18k: Fiscal Sponsorship Costs
We are responsible for advancing the Developmental Interpretability research program in the following ways:
We ran the 2023 Singular Learning Theory (SLT) & Alignment Summit, which brought together roughly 140 online participants and 40 in-person participants to learn about singular learning theory and start working on open problems relating SLT and alignment.
This also helped us scout talent for the DevInterp research program.
This summit culminated in the DevInterp research agenda, which we recently published. During this summit, we also recorded over 30 hours of lectures, and will soon release ~200 pages of lecture notes, further accompanying LessWrong posts, and original research.
Developmental interpretability was first proposed by Prof. Daniel Murfet here (the “SLT for Alignment Plan”), where he sketches the connections between SLT and mechanistic interpretability.
Jesse Hoogland and Alexander Gietelink Oldenziel first communicated the potential value of SLT to the alignment community in a LessWrong sequence.
Prof. Daniel Murfet founded metauni, an online learning community, which has hosted hundreds of seminars (including on AI alignment), and which yielded his initial SLT for Alignment Plan.
Prof. Daniel Murfet has also published dozens of articles in related fields such as mathematical physics, algebraic geometry, theory of computation, and machine learning, some of which has laid the groundwork for our current research agenda.
We expect DevInterp to yield concrete new techniques for AI alignment within the next two years. This agenda and its impact would not exist without us.
Like other forms of interpretability, DevInterp could inadvertently accelerate capabilities.
We’re not about to share the particular ways we think this could occur, but we will say that we think the risk is quite low for the next year. Longer term, we’re not as confident, which is why we’ve chosen to include a fellowship for assessing capability risks in this proposal.
No other funding has been secured yet. We have submitted somewhat similar (though broader in scope) grant requests to Lightspeed and SFF.
about 1 month ago
I think there is merit in developing a breadth of interpretability approaches. If Singular Learning Theory ends up having merit, there is a wealth of knowledge from physics and chemistry that directly applies.
The bang for your buck with researchers in the project is really good compared to most.
I am excited about bringing a talented researcher in Daniel Murfet into the field.
I am excited about this project being a launching ground for a potential future for profit or non profit if this goes well
I think Jesse is great at explaining his ideas and communicating his reasoning. I think Stan is very smart and worth supporting.
I don’t know about the merits of SLT and I’m not equipped to judge.
I don’t know the direction they will take things if they are successful and it doesn’t seem like it was yet thought through.
I need to save some money for 2 more grants I would like to make but I want to show significant support for this project and gets them across the $125k threshold.
I might potentially hire Stan as a consultant for my business. Funding this probably works against me but I felt this was important to disclose.
about 2 months ago
I think understanding the inductive biases of modern machine learning processes is extremely important both to being able to accurately asses dangers as well as discover good interventions. Most of my uncertainty over the future is currently tied up in uncertainty regarding the inductive biases of machine learning processes (see here for a good explanation of why: https://www.alignmentforum.org/posts/A9NxPTwbw6r6Awuwt/how-likely-is-deceptive-alignment).
On that front, I think Singular Learning Theory is a real contender for a theory that has a chance of effectively explaining and predicting the mechanisms behind machine learning inductive biases. Furthermore, I'm familiar with the work of many of the people involved in this project and I believe them to be quite capable at tackling this problem.
Though I have high hopes for Singular Learning Theory, my modal outcome is that it's mostly just wrong and doesn't explain machine learning inductive biases that well. Inductive biases are very complex and most theories like this in the past have failed. Though I think this is a better bet here than most, I don't think I expect it to succeed.
I've committed $100k at this time, which I think is a reasonable amount for the team to, hopefully in combination with funding from other sources, get started spinning up on this project.
Jesse was a mentee of mine in the SERI MATS program.
about 2 months ago
I’ll match donations on this project 10% up to $200k.
(I’ve been considering some Dominant Assurance Contract scheme because this project looks good and I know that people are interested in funding it, but with straight crowdfunding like we have on Manifund right now people are incentivized to play funding chicken, whereas DACs make contributing below the minimum funding a dominant strategy. On the other hand a DAC isn’t very powerful here because it would only incentivize funding up to the minimum bar, and the minimum is so low. I’d still be down to do that if you (@jesse_hoogland) were willing to raise the minimum funding bar to, say, $150k, but this risks you not getting the funding at all.)