jesse_hoogland avatar
Jesse Hoogland

@jesse_hoogland

AI safety researcher

https://jessehoogland.com
$220total balance
$0charity balance
$220cash balance

$0 in pending offers

About Me

Executive director at Timaeus

Projects

Comments

jesse_hoogland avatar

Jesse Hoogland

18 days ago

Progress update

What progress have you made since your last update?

  • See our recent update, "Timaeus in 2024," for a high-level overview of our research progress in 2024.

  • Because this Manifund proposal was not fully funded and because progress in separate research projects opened up new research possibilities, we decided to direct our attention immediately to the last of the projects we describe in this proposal: understanding-based evals, under the heading of "Singular Psychometrics."

  • We've been working on this project in partnership with the UK AISI and are on track to finish this project by the end of March 2025. As described in the update linked above, we have successfully overcome the engineering obstacles required to scale LLC estimation to models with billions of parameters. This unblocks the primary hurdle to seeing this project to completion.

What are your next steps?

  • We're currently working on the final stage of the singular psychometrics project. Our hope for this project is to use SLT-derived metrics to differentiate how different models achieve the same level of performance. Can we distinguish a model that has memorized a given benchmark from one that truly generalizes on that benchmark using the local learning coefficient?

Is there anything others could help you with?

  • Not currently. We're looking forward to sharing the final update in April.

jesse_hoogland avatar

Jesse Hoogland

6 months ago

Final report

Let me copy the earlier progress update we shared (which was meant to close the project):

We've posted a detailed update on LessWrong.

In short:

  • We consider this project a major success: SLT & DevInterp's main predictions have been validated in a number of different settings. We are now confident that these research directions are useful for understanding deep learning systems.

  • Our priority is now to make direct contact with alignment: It's not enough for this research to help with understanding NNs, we need to move the needle on alignment. In our update, we sketch three major directions of research that we expect to make a difference.

In more detail, with respect to the concrete points above.

  • Completing the analysis of phase transitions and associated structure formation in the Toy Models of Superposition (preliminary work reported in the SLT & Alignment summit’s SLT High 4 lecture). See Chen et al. (2023).

  • Performing a similar analysis for the Induction Heads paper. See Hoogland et al. (2024).

  • For diverse models that are known to contain structure/circuits, we will attempt to:

    • detect phase transitions (using a range of metrics, including train and test losses, RLCT and singular fluctuation),

    • classify weights at each transition into state & control variables,

    • perform mechanistic interpretability analyses at these transitions,

    • compare these analysis to MechInterp structures found at the end of training.

Classifying transitions into state & control variables remains to be done in the next few months. We have performed some mechanistic/structural analysis, and more of this kind of analysis is currently underway.

jesse_hoogland avatar

Jesse Hoogland

about 1 year ago

In more detail, with respect to the concrete points above.

  • Completing the analysis of phase transitions and associated structure formation in the Toy Models of Superposition (preliminary work reported in the SLT & Alignment summit’s SLT High 4 lecture). See Chen et al. (2023).

  • Performing a similar analysis for the Induction Heads paper. See Hoogland et al. (2024).

  • For diverse models that are known to contain structure/circuits, we will attempt to:

    • detect phase transitions (using a range of metrics, including train and test losses, RLCT and singular fluctuation),

    • classify weights at each transition into state & control variables,

    • perform mechanistic interpretability analyses at these transitions,

    • compare these analysis to MechInterp structures found at the end of training.

Classifying transitions into state & control variables remains to be done in the next few months. We have performed some mechanistic/structural analysis, and more of this kind of analysis is currently underway.

jesse_hoogland avatar

Jesse Hoogland

about 1 year ago

Progress update

We've posted a detailed update on LessWrong.

In short:

  • We consider this project a major success: SLT & DevInterp's main predictions have been validated in a number of different settings. We are now confident that these research directions are useful for understanding deep learning systems.

  • Our priority is now to make direct contact with alignment: It's not enough for this research to help with understanding NNs, we need to move the needle on alignment. In our update, we sketch three major directions of research that we expect to make a difference.

jesse_hoogland avatar

Jesse Hoogland

over 1 year ago

Hey Rachel, thanks for the suggestion! We decided to wait a little longer to think about this, and it seems no longer necessary.

Transactions

ForDateTypeAmount
Next Steps in Developmental Interpretability11 days agoproject donation+20
Next Steps in Developmental Interpretabilityabout 1 month agoproject donation+200
Manifund Bank3 months agowithdraw80460
Next Steps in Developmental Interpretability6 months agoproject donation+30000
Next Steps in Developmental Interpretability7 months agoproject donation+10
Next Steps in Developmental Interpretability7 months agoproject donation+200
Next Steps in Developmental Interpretability7 months agoproject donation+250
Next Steps in Developmental Interpretability7 months agoproject donation+50000
Manifund Bankover 1 year agowithdraw144650
Scoping Developmental Interpretabilityover 1 year agoproject donation+3000
Scoping Developmental Interpretabilityover 1 year agoproject donation+20000
Scoping Developmental Interpretabilityover 1 year agoproject donation+10000
Scoping Developmental Interpretabilityover 1 year agoproject donation+45
Scoping Developmental Interpretabilityover 1 year agoproject donation+1000
Scoping Developmental Interpretabilityover 1 year agoproject donation+455
Scoping Developmental Interpretabilityover 1 year agoproject donation+10150
Scoping Developmental Interpretabilityover 1 year agoproject donation+100000