Scoping Developmental Interpretability
Project summary
We propose a 6-month research project to assess the viability of Developmental Interpretability, a new AI alignment research agenda. “DevInterp” studies how phase transitions give rise to computational structure in neural networks, and offers a possible path to scalable interpretability tools.
Though we have both empirical and theoretical reasons to believe that phase transitions dominate the training process, the details remain unclear. We plan to clarify the role of phase transitions by studying them in a variety of models combining techniques from Singular Learning Theory and Mechanistic Interpretability. In six months, we expect to have gathered enough evidence to confirm that DevInterp is a viable research program.
If successful, we expect Developmental Interpretability to become one of the main branches of technical alignment research over the next few years.
(This funding proposal consists of Phase 1 from the research plan described in this LessWrong post.)
Project goals
We will assess the viability of Developmental Interpretability (DevInterp) as an alignment research program over the next 6 months.
Our immediate priority is to gather empirical evidence for the role of phase transitions in neural network training dynamics. To do so, we will examine a variety of models for signals that indicate the presence or absence of phase transitions.
Concretely, this means:
Completing the analysis of phase transitions and associated structure formation in the Toy Models of Superposition (preliminary work reported in the SLT & Alignment summit’s SLT High 4 lecture).
Performing a similar analysis for the Induction Heads paper.
For diverse models that are known to contain structure/circuits, we will attempt to:
detect phase transitions (using a range of metrics, including train and test losses, RLCT and singular fluctuation),
classify weights at each transition into state & control variables,
perform mechanistic interpretability analyses at these transitions,
compare these analysis to MechInterp structures found at the end of training.
Conduct a confidential capability risk assessment of DevInterp
The unit of work here is papers, submitted either to ML conferences or academic journals. At the end of this period we should have a clear idea of whether developmental interpretability has legs.
How will this funding be used?
120k: RAs + Researchers + Research Fellows
50k: Core Staff
40k: Employment Costs
10k: Compute
7k: Travel Support
18k: Fiscal Sponsorship Costs
25k: Buffer
What is your team's track record on similar projects?
(The core team currently consists of Jesse Hoogland, Alexander Gietelink Oldenziel, Prof. Daniel Murfet and Stan van Wingerden. We have a shortlist of external researchers to hire.)
We are responsible for advancing the Developmental Interpretability research program in the following ways:
We ran the 2023 Singular Learning Theory (SLT) & Alignment Summit, which brought together roughly 140 online participants and 40 in-person participants to learn about singular learning theory and start working on open problems relating SLT and alignment.
This also helped us scout talent for the DevInterp research program.
This summit culminated in the DevInterp research agenda, which we recently published. During this summit, we also recorded over 30 hours of lectures, and will soon release ~200 pages of lecture notes, further accompanying LessWrong posts, and original research.
Developmental interpretability was first proposed by Prof. Daniel Murfet here (the “SLT for Alignment Plan”), where he sketches the connections between SLT and mechanistic interpretability.
Jesse Hoogland and Alexander Gietelink Oldenziel first communicated the potential value of SLT to the alignment community in a LessWrong sequence.
Prof. Daniel Murfet founded metauni, an online learning community, which has hosted hundreds of seminars (including on AI alignment), and which yielded his initial SLT for Alignment Plan.
Prof. Daniel Murfet has also published dozens of articles in related fields such as mathematical physics, algebraic geometry, theory of computation, and machine learning, some of which has laid the groundwork for our current research agenda.
We expect DevInterp to yield concrete new techniques for AI alignment within the next two years. This agenda and its impact would not exist without us.
How could this project be actively harmful?
Like other forms of interpretability, DevInterp could inadvertently accelerate capabilities.
We’re not about to share the particular ways we think this could occur, but we will say that we think the risk is quite low for the next year. Longer term, we’re not as confident, which is why we’ve chosen to include a fellowship for assessing capability risks in this proposal.
What other funding is this project getting?
No other funding has been secured yet. We have submitted somewhat similar (though broader in scope) grant requests to Lightspeed and SFF.