jesse_hoogland avatar
Jesse Hoogland

@jesse_hoogland

AI safety researcher

https://jessehoogland.com
$80,460total balance
$0charity balance
$80,460cash balance

$0 in pending offers

About Me

RA @ the Krueger AI Safety Lab // SERI MATS 3 // Independent research

Projects

Comments

jesse_hoogland avatar

Jesse Hoogland

3 months ago

Final report

Let me copy the earlier progress update we shared (which was meant to close the project):

We've posted a detailed update on LessWrong.

In short:

  • We consider this project a major success: SLT & DevInterp's main predictions have been validated in a number of different settings. We are now confident that these research directions are useful for understanding deep learning systems.

  • Our priority is now to make direct contact with alignment: It's not enough for this research to help with understanding NNs, we need to move the needle on alignment. In our update, we sketch three major directions of research that we expect to make a difference.

In more detail, with respect to the concrete points above.

  • Completing the analysis of phase transitions and associated structure formation in the Toy Models of Superposition (preliminary work reported in the SLT & Alignment summit’s SLT High 4 lecture). See Chen et al. (2023).

  • Performing a similar analysis for the Induction Heads paper. See Hoogland et al. (2024).

  • For diverse models that are known to contain structure/circuits, we will attempt to:

    • detect phase transitions (using a range of metrics, including train and test losses, RLCT and singular fluctuation),

    • classify weights at each transition into state & control variables,

    • perform mechanistic interpretability analyses at these transitions,

    • compare these analysis to MechInterp structures found at the end of training.

Classifying transitions into state & control variables remains to be done in the next few months. We have performed some mechanistic/structural analysis, and more of this kind of analysis is currently underway.

jesse_hoogland avatar

Jesse Hoogland

9 months ago

In more detail, with respect to the concrete points above.

  • Completing the analysis of phase transitions and associated structure formation in the Toy Models of Superposition (preliminary work reported in the SLT & Alignment summit’s SLT High 4 lecture). See Chen et al. (2023).

  • Performing a similar analysis for the Induction Heads paper. See Hoogland et al. (2024).

  • For diverse models that are known to contain structure/circuits, we will attempt to:

    • detect phase transitions (using a range of metrics, including train and test losses, RLCT and singular fluctuation),

    • classify weights at each transition into state & control variables,

    • perform mechanistic interpretability analyses at these transitions,

    • compare these analysis to MechInterp structures found at the end of training.

Classifying transitions into state & control variables remains to be done in the next few months. We have performed some mechanistic/structural analysis, and more of this kind of analysis is currently underway.

jesse_hoogland avatar

Jesse Hoogland

9 months ago

Progress update

We've posted a detailed update on LessWrong.

In short:

  • We consider this project a major success: SLT & DevInterp's main predictions have been validated in a number of different settings. We are now confident that these research directions are useful for understanding deep learning systems.

  • Our priority is now to make direct contact with alignment: It's not enough for this research to help with understanding NNs, we need to move the needle on alignment. In our update, we sketch three major directions of research that we expect to make a difference.

jesse_hoogland avatar

Jesse Hoogland

over 1 year ago

Hey Rachel, thanks for the suggestion! We decided to wait a little longer to think about this, and it seems no longer necessary.

Transactions

ForDateTypeAmount
Next Steps in Developmental Interpretability3 months agoproject donation+30000
Next Steps in Developmental Interpretability3 months agoproject donation+10
Next Steps in Developmental Interpretability3 months agoproject donation+200
Next Steps in Developmental Interpretability3 months agoproject donation+250
Next Steps in Developmental Interpretability3 months agoproject donation+50000
Manifund Bankabout 1 year agowithdraw144650
Scoping Developmental Interpretabilityabout 1 year agoproject donation+3000
Scoping Developmental Interpretabilityabout 1 year agoproject donation+20000
Scoping Developmental Interpretabilityover 1 year agoproject donation+10000
Scoping Developmental Interpretabilityover 1 year agoproject donation+45
Scoping Developmental Interpretabilityover 1 year agoproject donation+1000
Scoping Developmental Interpretabilityover 1 year agoproject donation+455
Scoping Developmental Interpretabilityover 1 year agoproject donation+10150
Scoping Developmental Interpretabilityover 1 year agoproject donation+100000