Scoping Developmental Interpretability

Jesse Hoogland

CompleteGrant

$144,650raised

Project summary

We propose a 6-month research project to assess the viability of Developmental Interpretability, a new AI alignment research agenda. “DevInterp” studies how phase transitions give rise to computational structure in neural networks, and offers a possible path to scalable interpretability tools.

Though we have both empirical and theoretical reasons to believe that phase transitions dominate the training process, the details remain unclear. We plan to clarify the role of phase transitions by studying them in a variety of models combining techniques from Singular Learning Theory and Mechanistic Interpretability. In six months, we expect to have gathered enough evidence to confirm that DevInterp is a viable research program.

If successful, we expect Developmental Interpretability to become one of the main branches of technical alignment research over the next few years.

(This funding proposal consists of Phase 1 from the research plan described in this LessWrong post.)

Project goals

We will assess the viability of Developmental Interpretability (DevInterp) as an alignment research program over the next 6 months.

Our immediate priority is to gather empirical evidence for the role of phase transitions in neural network training dynamics. To do so, we will examine a variety of models for signals that indicate the presence or absence of phase transitions.

Concretely, this means:

Completing the analysis of phase transitions and associated structure formation in the Toy Models of Superposition (preliminary work reported in the SLT & Alignment summit’s SLT High 4 lecture).
Performing a similar analysis for the Induction Heads paper.
For diverse models that are known to contain structure/circuits, we will attempt to:
- detect phase transitions (using a range of metrics, including train and test losses, RLCT and singular fluctuation),
- classify weights at each transition into state & control variables,
- perform mechanistic interpretability analyses at these transitions,
- compare these analysis to MechInterp structures found at the end of training.
Conduct a confidential capability risk assessment of DevInterp

The unit of work here is papers, submitted either to ML conferences or academic journals. At the end of this period we should have a clear idea of whether developmental interpretability has legs.

How will this funding be used?

120k: RAs + Researchers + Research Fellows
50k: Core Staff
40k: Employment Costs
10k: Compute
7k: Travel Support
18k: Fiscal Sponsorship Costs
25k: Buffer

What is your team's track record on similar projects?

(The core team currently consists of Jesse Hoogland, Alexander Gietelink Oldenziel, Prof. Daniel Murfet and Stan van Wingerden. We have a shortlist of external researchers to hire.)

We are responsible for advancing the Developmental Interpretability research program in the following ways:

We ran the 2023 Singular Learning Theory (SLT) & Alignment Summit, which brought together roughly 140 online participants and 40 in-person participants to learn about singular learning theory and start working on open problems relating SLT and alignment.
- This also helped us scout talent for the DevInterp research program.
This summit culminated in the DevInterp research agenda, which we recently published. During this summit, we also recorded over 30 hours of lectures, and will soon release ~200 pages of lecture notes, further accompanying LessWrong posts, and original research.
Developmental interpretability was first proposed by Prof. Daniel Murfet here (the “SLT for Alignment Plan”), where he sketches the connections between SLT and mechanistic interpretability.
Jesse Hoogland and Alexander Gietelink Oldenziel first communicated the potential value of SLT to the alignment community in a LessWrong sequence.
Prof. Daniel Murfet founded metauni, an online learning community, which has hosted hundreds of seminars (including on AI alignment), and which yielded his initial SLT for Alignment Plan.
Prof. Daniel Murfet has also published dozens of articles in related fields such as mathematical physics, algebraic geometry, theory of computation, and machine learning, some of which has laid the groundwork for our current research agenda.

We expect DevInterp to yield concrete new techniques for AI alignment within the next two years. This agenda and its impact would not exist without us.

How could this project be actively harmful?

Like other forms of interpretability, DevInterp could inadvertently accelerate capabilities.

We’re not about to share the particular ways we think this could occur, but we will say that we think the risk is quite low for the next year. Longer term, we’re not as confident, which is why we’ve chosen to include a fellowship for assessing capability risks in this proposal.

What other funding is this project getting?

No other funding has been secured yet. We have submitted somewhat similar (though broader in scope) grant requests to Lightspeed and SFF.

Jesse Hoogland

7 months ago

Final report

Let me copy the earlier progress update we shared (which was meant to close the project):

We've posted a detailed update on LessWrong.

In short:

We consider this project a major success: SLT & DevInterp's main predictions have been validated in a number of different settings. We are now confident that these research directions are useful for understanding deep learning systems.
Our priority is now to make direct contact with alignment: It's not enough for this research to help with understanding NNs, we need to move the needle on alignment. In our update, we sketch three major directions of research that we expect to make a difference.

In more detail, with respect to the concrete points above.

~~Completing the analysis of phase transitions and associated structure formation in the Toy Models of Superposition (preliminary work reported in the SLT & Alignment summit’s~~ ~~SLT High 4~~ ~~lecture).~~ See Chen et al. (2023).
~~Performing a similar analysis for the~~ ~~Induction Heads~~ ~~paper.~~ See Hoogland et al. (2024).
For diverse models that are known to contain structure/circuits, we will attempt to:
- ~~detect phase transitions (using a range of metrics, including train and test losses, RLCT and singular fluctuation),~~
- classify weights at each transition into state & control variables,
- perform mechanistic interpretability analyses at these transitions,
- compare these analysis to MechInterp structures found at the end of training.

Classifying transitions into state & control variables remains to be done in the next few months. We have performed some mechanistic/structural analysis, and more of this kind of analysis is currently underway.

Jesse Hoogland

about 1 year ago

Progress update

We've posted a detailed u pdate on LessWrong.

In short:

We consider this project a major success: SLT & DevInterp's main predictions have been validated in a number of different settings. We are now confident that these research directions are useful for understanding deep learning systems.
Our priority is now to make direct contact with alignment: It's not enough for this research to help with understanding NNs, we need to move the needle on alignment. In our update, we sketch three major directions of research that we expect to make a difference.

Jesse Hoogland

about 1 year ago

In more detail, with respect to the concrete points above.

~~Completing the analysis of phase transitions and associated structure formation in the Toy Models of Superposition (preliminary work reported in the SLT & Alignment summit’s~~ ~~SLT High 4~~ ~~lecture).~~ See Chen et al. (2023).
~~Performing a similar analysis for the~~ ~~Induction Heads~~ ~~paper.~~ See Hoogland et al. (2024).
For diverse models that are known to contain structure/circuits, we will attempt to:
- ~~detect phase transitions (using a range of metrics, including train and test losses, RLCT and singular fluctuation),~~
- classify weights at each transition into state & control variables,
- perform mechanistic interpretability analyses at these transitions,
- compare these analysis to MechInterp structures found at the end of training.

donated $10,000

Marcus Abramovitch

over 1 year ago

If I am not mistaken, Jesse + team have received additional funding from SFF now of $500k to expand. This puts it off the table for me for further Manifund grants for now but will check in with Jesse/Stan sometime in the new year for where they are at.

donated $20,000

Ryan Kidd

over 1 year ago

Main points in favor of this grant

Developmental interpretability seems like a potentially promising and relatively underexplored research direction for exploring neural network generalization and inductive biases. Hopefully, this research can complement low-level or probe-based approaches for neural network interpretability and eventually help predict, explain, and steer dangerous AI capabilities such as learned optimization and deceptive alignment.
Jesse made a strong, positive impression on me as a scholar in the SERI MATS Winter 2022-23 Cohort; his research was impressive and he engaged well with criticism and others scholars' diverse research projects. His mentor, Evan Hubinger, endorsed his research at the time and obviously continues to do, as indicated by his recent regrant. While Jesse is relatively young to steer a research team, he has strong endorsements and support from Dan Murfet, David Krueger, Evan Hubinger, and other researchers, and has displayed impressive enterpeneurship in launching Timaeus and organizing the SLT summits.
I recently met Dan Murfet at EAGxAustralia 2023 and was impressed by his research presentation skills, engagement with AI safety, and determination to build the first dedicated academic AI safety lab in Australia. Dan seems like a great research lead for the University of Melbourne lab, where much of this research will be based.
Australia has produced many top ML and AI safety researchers, but has so far lacked a dedicated AI safety organization to leverage local talent. I believe that we need more AI safety hubs, especially in academic institutions, and I see Timaeus (although remote) and the University of Melbourne as strong contenders.
Developmental interpretability seems like an ideal research vehicle to leverage underutilized physics and mathematics talent for AI safety. Jesse is a former physicist and Dan is a mathematician who previously specialized in algebraic geometry. In my experience as Co-Director of MATS, I have realized that many former physicists and mathematicians are deeply interested in AI safety, but lack a transitionary route to adapt their skills to the challenge.
Other funders (e.g., Open Phil, SFF) seem more reluctant (or at least slower) to fund this project than Manifund or Lightspeed and Jesse/Dan told me that they would need more funds within a week if they were going to hire another RA. I believe that this $20k is a high-expected value investment in reducing the stress associated with founding a potentially promising new AI safety organization and will allow Jesse/Dan to produce more exploratory research early to ascertain the value of SLT for AI safety.

Donor's main reservations

I have read several of Jesse's and Dan's posts about SLT and Dev Interp and watched several of their talks, but still feel that I don't entirely grasp the research direction. I could spend further time on this, but I feel more than confident enough to recommend $20k.
Jesse is relatively young to run a research organization and Dan is relatively new to AI safety research; however, they seem more than capable for my level of risk tolerance with $20k, even with my current $50k pot.
The University of Melbourne may not be an ideal (or supportive) home for this research team; however, Timaeus already plans to be somewhat remote and several fiscal sponsors (e.g., Rethink Priorities Special Projects, BERI, Ashgro) would likely be willing to support their researchers.

Process for deciding amount

I chose to donate $20k because Jesse said that a single paper would cost $40k (roughly 1 RA-year) and my budget is limited. I encourage further regrantors to join me and fund another half-paper!

Conflicts of interest

Jesse was a scholar in the program I co-lead, but I do not believe that this constitutes a conflict of interest.

Austin Chen

over 1 year ago

Approving this! Best of luck with your research~

donated $10,000

Marcus Abramovitch

over 1 year ago

Main points in favor of this grant

I think there is merit in developing a breadth of interpretability approaches. If Singular Learning Theory ends up having merit, there is a wealth of knowledge from physics and chemistry that directly applies.
The bang for your buck with researchers in the project is really good compared to most.
I am excited about bringing a talented researcher in Daniel Murfet into the field.
I am excited about this project being a launching ground for a potential future for profit or non profit if this goes well
I think Jesse is great at explaining his ideas and communicating his reasoning. I think Stan is very smart and worth supporting.

Donor's main reservations

I don’t know about the merits of SLT and I’m not equipped to judge.
I don’t know the direction they will take things if they are successful and it doesn’t seem like it was yet thought through.

Process for deciding amount

I need to save some money for 2 more grants I would like to make but I want to show significant support for this project and gets them across the $125k threshold.

Conflicts of interest

I might potentially hire Stan as a consultant for my business. Funding this probably works against me but I felt this was important to disclose.

donated $100,000

Evan Hubinger

over 1 year ago

Main points in favor of this grant

I think understanding the inductive biases of modern machine learning processes is extremely important both to being able to accurately asses dangers as well as discover good interventions. Most of my uncertainty over the future is currently tied up in uncertainty regarding the inductive biases of machine learning processes (see here for a good explanation of why: https://www.alignmentforum.org/posts/A9NxPTwbw6r6Awuwt/how-likely-is-deceptive-alignment).

On that front, I think Singular Learning Theory is a real contender for a theory that has a chance of effectively explaining and predicting the mechanisms behind machine learning inductive biases. Furthermore, I'm familiar with the work of many of the people involved in this project and I believe them to be quite capable at tackling this problem.

Donor's main reservations

Though I have high hopes for Singular Learning Theory, my modal outcome is that it's mostly just wrong and doesn't explain machine learning inductive biases that well. Inductive biases are very complex and most theories like this in the past have failed. Though I think this is a better bet here than most, I don't think I expect it to succeed.

Process for deciding amount

I've committed $100k at this time, which I think is a reasonable amount for the team to, hopefully in combination with funding from other sources, get started spinning up on this project.

Conflicts of interest

Jesse was a mentee of mine in the SERI MATS program.

donated $13,150

Rachel Weinberg

over 1 year ago

I’ll match donations on this project 10% up to $200k.

(I’ve been considering some Dominant Assurance Contract scheme because this project looks good and I know that people are interested in funding it, but with straight crowdfunding like we have on Manifund right now people are incentivized to play funding chicken, whereas DACs make contributing below the minimum funding a dominant strategy. On the other hand a DAC isn’t very powerful here because it would only incentivize funding up to the minimum bar, and the minimum is so low. I’d still be down to do that if you (@jesse_hoogland) were willing to raise the minimum funding bar to, say, $150k, but this risks you not getting the funding at all.)

Jesse Hoogland

over 1 year ago

Hey Rachel, thanks for the suggestion! We decided to wait a little longer to think about this, and it seems no longer necessary.

Miguelito De Guzman

over 1 year ago

I am also conducting phase transitions with GPT2-xl, and I believe there is a need for further research on this mechanism. I fully support this application!