Train great open-source sparse autoencoders
Project summary
tl;dr: determine the best currently-available training setup for SAEs and disseminate this knowledge. Train SAEs for steadily larger models (starting with GPT-2-small for MATS scholars) and then scale up as budget and time allows.
Project proposal doc with more details: https://docs.google.com/document/d/15X28EEHo7pM2CYkfZqk05A0MZi4ImvTSSVaC9wtFLyI/edit?usp=sharing
What are this project's goals and how will you achieve them?
Determine good hyperparameters for sparse autoencoders for realistic LLMs by doing a comprehensive architecture and hyperparameter comparison.
Use this knowledge to train a suite of high-quality SAEs for GPT-2-small, then scale up further as resources allow, targeting ~1B and ~8B models in sequence.
Disseminate knowledge on SAE training through a technical report.
How will this funding be used?
Compute!
Who is on your team and what's your track record on similar projects?
Lead: Tom McGrath - former DeepMind interpretability researcher.
Collaborating: Joseph Bloom - owner of SAELens and contributor to Neuronpedia.
What are the most likely causes and outcomes if this project fails? (premortem)
Failure to replicate results obtained by major labs leading to low SAE performance.
What other funding are you or your project getting?
Tom McGrath: none, currently self-funding
My collaborators Joseph Bloom and Johnny Lin are funded to work on Neuronpedia.