robertzk avatar
Robert Krzyzanowski

@robertzk

Independent AI Safety Researcher

$0total balance
$0charity balance
$0cash balance

$0 in pending offers

About Me

I am an independent AI safety researcher currently focused on mechanistic interpretability and training process transparency.

Projects

Comments

robertzk avatar

Robert Krzyzanowski

10 months ago

Progress update

What progress have you made since your last update?

I have most recently been focused on research scaling sparse autoencoders to attention layers, which has been submitted as a research paper to NeurIPS and accepted as a Spotlight presentation at the ICML Mechanistic Interpretability workshop.

As an update to Scaling Training Process Transparency, I am working with my summer mentee Gavin Ratcliffe and co-advisor Sara Price to supervise a project that combines developmental interpretability with sleeper agents. This project is a natural extension of Training Process Transparency to larger models that are considered model organisms of deception, and will help answer some questions around the mechanism and formation of the deception trigger in sleeper agents, as well as the corresponding defection behavior.

As part of this project, we intend to use the funds allocated for Scaling Training Process Transparency to address any relevant compute expenses. Morally, the work on this project is equivalent to that of Scaling Training Process Transparency, in that it takes the shape of analyzing mechanism formation throughout the training (or in this case, fine-tuning) process for a model sufficiently large to exhibit interesting behaviors (in this case, deception triggers and deception backdoor behavior).

What are your next steps?

We will be releasing our results on developmental interpretability with sleeper agents as the natural extension to Scaling Training Process Transparency as soon as we have results, which we expect to achieve later this summer '24.

Is there anything others could help you with?

Yes. We would appreciate anyone who is interested in mechanistic interpretability, developmental interpretability and/or model organisms of deception to review our project and validate that it conforms to the stated purposes of this grant. We would also appreciate any feedback and ideas from interested advisors as we are in active iteration on this project.

robertzk avatar

Robert Krzyzanowski

over 1 year ago

I have identified an engineering bottleneck in scaling this approach that I am currently working through. I am going to provisionally accept the project, but may return the funds if I am unable to address this bottleneck within the project timeline.

Transactions

ForDateTypeAmount
Manifund Bank3 months agowithdraw2000
Unprompted Unfaithful Chain of Thought Dataset Project3 months agoproject donation+2000
Manifund Bank9 months agowithdraw5150
Scaling Training Process Transparencyover 1 year agoproject donation+5000
Scaling Training Process Transparencyover 1 year agoproject donation+150