Projects

Next Steps in Developmental Interpretability

Scoping Developmental Interpretability

Comments

Next Steps in Developmental Interpretability

Jesse Hoogland

18 days ago

Progress update

What progress have you made since your last update?

See our recent update, "Timaeus in 2024," for a high-level overview of our research progress in 2024.
Because this Manifund proposal was not fully funded and because progress in separate research projects opened up new research possibilities, we decided to direct our attention immediately to the last of the projects we describe in this proposal: understanding-based evals, under the heading of "Singular Psychometrics."
We've been working on this project in partnership with the UK AISI and are on track to finish this project by the end of March 2025. As described in the update linked above, we have successfully overcome the engineering obstacles required to scale LLC estimation to models with billions of parameters. This unblocks the primary hurdle to seeing this project to completion.

What are your next steps?

We're currently working on the final stage of the singular psychometrics project. Our hope for this project is to use SLT-derived metrics to differentiate how different models achieve the same level of performance. Can we distinguish a model that has memorized a given benchmark from one that truly generalizes on that benchmark using the local learning coefficient?

Is there anything others could help you with?

Not currently. We're looking forward to sharing the final update in April.

Scoping Developmental Interpretability

Jesse Hoogland

6 months ago

Final report

Let me copy the earlier progress update we shared (which was meant to close the project):

We've posted a detailed update on LessWrong.

In short:

We consider this project a major success: SLT & DevInterp's main predictions have been validated in a number of different settings. We are now confident that these research directions are useful for understanding deep learning systems.
Our priority is now to make direct contact with alignment: It's not enough for this research to help with understanding NNs, we need to move the needle on alignment. In our update, we sketch three major directions of research that we expect to make a difference.

In more detail, with respect to the concrete points above.

~~Completing the analysis of phase transitions and associated structure formation in the Toy Models of Superposition (preliminary work reported in the SLT & Alignment summit’s~~ ~~SLT High 4~~ ~~lecture).~~ See Chen et al. (2023).
~~Performing a similar analysis for the~~ ~~Induction Heads~~ ~~paper.~~ See Hoogland et al. (2024).
For diverse models that are known to contain structure/circuits, we will attempt to:
- ~~detect phase transitions (using a range of metrics, including train and test losses, RLCT and singular fluctuation),~~
- classify weights at each transition into state & control variables,
- perform mechanistic interpretability analyses at these transitions,
- compare these analysis to MechInterp structures found at the end of training.

Classifying transitions into state & control variables remains to be done in the next few months. We have performed some mechanistic/structural analysis, and more of this kind of analysis is currently underway.

Scoping Developmental Interpretability

Jesse Hoogland

about 1 year ago

In more detail, with respect to the concrete points above.

~~Completing the analysis of phase transitions and associated structure formation in the Toy Models of Superposition (preliminary work reported in the SLT & Alignment summit’s~~ ~~SLT High 4~~ ~~lecture).~~ See Chen et al. (2023).
~~Performing a similar analysis for the~~ ~~Induction Heads~~ ~~paper.~~ See Hoogland et al. (2024).
For diverse models that are known to contain structure/circuits, we will attempt to:
- ~~detect phase transitions (using a range of metrics, including train and test losses, RLCT and singular fluctuation),~~
- classify weights at each transition into state & control variables,
- perform mechanistic interpretability analyses at these transitions,
- compare these analysis to MechInterp structures found at the end of training.

Scoping Developmental Interpretability

Jesse Hoogland

about 1 year ago

Progress update

We've posted a detailed u pdate on LessWrong.

In short:

We consider this project a major success: SLT & DevInterp's main predictions have been validated in a number of different settings. We are now confident that these research directions are useful for understanding deep learning systems.
Our priority is now to make direct contact with alignment: It's not enough for this research to help with understanding NNs, we need to move the needle on alignment. In our update, we sketch three major directions of research that we expect to make a difference.

Scoping Developmental Interpretability

Jesse Hoogland

over 1 year ago

Hey Rachel, thanks for the suggestion! We decided to wait a little longer to think about this, and it seems no longer necessary.

Transactions

For	Date	Type	Amount
Next Steps in Developmental Interpretability	11 days ago	project donation	+20
Next Steps in Developmental Interpretability	about 1 month ago	project donation	+200
Manifund Bank	3 months ago	withdraw	80460
Next Steps in Developmental Interpretability	6 months ago	project donation	+30000
Next Steps in Developmental Interpretability	7 months ago	project donation	+10
Next Steps in Developmental Interpretability	7 months ago	project donation	+200
Next Steps in Developmental Interpretability	7 months ago	project donation	+250
Next Steps in Developmental Interpretability	7 months ago	project donation	+50000
Manifund Bank	over 1 year ago	withdraw	144650
Scoping Developmental Interpretability	over 1 year ago	project donation	+3000
Scoping Developmental Interpretability	over 1 year ago	project donation	+20000
Scoping Developmental Interpretability	over 1 year ago	project donation	+10000
Scoping Developmental Interpretability	over 1 year ago	project donation	+45
Scoping Developmental Interpretability	over 1 year ago	project donation	+1000
Scoping Developmental Interpretability	over 1 year ago	project donation	+455
Scoping Developmental Interpretability	over 1 year ago	project donation	+10150
Scoping Developmental Interpretability	over 1 year ago	project donation	+100000