3

Independent research to improve SAEs (4-6 months)

ActiveGrant
$55,025raised
$66,000funding goal

Note on edits: This grant has been re-titled and completely rewritten to accurately reflect my change in research focus, and after discussion with a regranter. At the time of this change, there were no donations so this seems okay to do.

Project summary

Sparse autoencoders are an exciting new tool in the interpretability toolbox, are unlocking new modes of analysis, and becoming an important core that much interpretability research is building upon. I think it is likely that they are not the final evolution of dictionary learning techniques for interpretability, and that faster improvements to this fundamental technology will benefit the field as a whole.
As such, I'm experimenting with alternate architectures and tweaks to the training setup to try and find available improvements sooner rather than later.

What are this project's goals and how will you achieve them?

The goal is to find alternative architectures and training techniques for sparse autoencoders that improve on the current state of the art.


My strategy involves making upfront investment in infrastructure to make building and evaluating new components and architectures faster, easier, and less error prone, so that later I can iterate quickly and test out many possible configurations cheaply. I'm aiming to lower the cost of trying each idea such that it's worth exploring high-risk high-reward configurations that otherwise might be too speculative to be worth trying.

I already have a working first version of this framework which I will return to and improve upon, and then I have a collection of

  • established interventions proposed by research labs and the research community I'd like to test directly and also build upon

    • eg, all the Anthropic and DeepMind updates, Gated SAEs, my own ProLU SAEs, Sqrt(L1) SAEs, etc

  • experimental interventions and alternate architectures

    • eg, deep encoders, hierarchical SAEs, alternative gate training methods for Gated SAEs, a resampling method I previously got good results from, and others

which I aim to implement as interchangeable components in the framework and begin evaluating.

If one or more speculative changes produce sufficiently large Pareto improvements, I will evaluate whether they have interpretable features. If so, I will take the best method(s), refine them further if possible, investigate + further evaluate them, and then write up the results and method to share the technique and findings with other researchers.

How will this funding be used?

Salary and compute budget:

  • Salary: $9k/month

  • Compute: $2k/month

Who is on your team and what's your track record on similar projects?

The team is just me. I have been working on this topic for a few months and I have produced this work on ProLU SAEs which resulted in a large Pareto improvement. I also have an unpublished resampling method that produced a smaller improvement.

Other qualifications: computer science degree, high comfort level working in PyTorch

What are the most likely causes and outcomes if this project fails? (premortem)

It's possible that none of the techniques I try make enough of a difference to be worth their costs in added complexity (which is plausible due to the "low probability of success/high value if successful" nature of the things I'm trying). If this is happening, I expect to continue iterating on other techniques as I don't expect to run out of things to try.

If none of the techniques which significantly Pareto-improve L0 and reconstruction quality result in interpretable SAE features, this would be unfortunate as they would not be useful as a tool, though this may be an interesting result that gives some insight about the problem space.

What other funding are you or your project getting?

No other funding.

Austin avatar

Austin Chen

8 months ago

Approving this grant -- thanks to @NeelNanda for the very in-depth writeup!

$9K/month seems not crazy salary for someone living in SF, but I'd happily follow default rates for independent researchers if anyone has compiled them

Yeah - I do think ~$100k/y is the going rate for junior independent AIS research. I also think this is kind of low; I expect most people in this category eg to be easily able to get entry-level L3 at Google, at ~$190k/y total compensation.

I would also love a levels.fyi, or failing that at least an EA Forum post surveying what researcher salaries have been based on varying amounts of expertise.

Chris-Lakin avatar

Chris Lakin

8 months ago

I would also love a levels.fyi, or failing that at least an EA Forum post surveying what researcher salaries have been based on varying amounts of expertise.

wait yeah can we get someone to do that

donated $55,000
NeelNanda avatar

Neel Nanda

8 months ago

Main points in favor of this grant

I think that SAEs are a big deal in interpretability, with lots of valuable interp work that can be unlocked with good SAEs. Developing, understanding and using SAEs is the major focus of both Anthropic's mech interp team and my team (Google DeepMind mech interp). I feel like SAE training is currently very janky and pre-paradigmatic and I would love to see progress here.

Why grant to Glen? I was particularly impressed by the ProLU work. Though it was, unfortunately, highly similar to my team's Gated SAE work, making the actual impact lower, I think ProLU was a good and principled idea that correctly identified a flaw in SAE training, and empirically showed that it was a significant improvement. Further, I think Glen broadly did the right things to show that it was an improvement, and did the leg work of training a bunch of SAEs on a range of models, layers and sites (though was bottlenecked on compute I think) and carefully comparing Pareto frontiers - this makes me more optimistic that if Glen finds an important improvement, he'll present enough evidence for me to believe him! I thought the write-up was pretty rough, but it was quite rushed, so that's not a major consideration.

We had a call, and I thought Glen was thinking about things sensibly. In particular, he had a strong emphasis on iterating fast, building the infra to try out many ideas quickly, and doubling down on any idea that meets a moderately high quality bar. I think this is a great way to do this kind of research. Another good sign is that Glen said ProLU felt less interesting to him than some of his other ideas, but had better empirical results, so was higher priority and he doubled down on it - being willing to be pragmatic like this and prioritise results makes this kind of research go much better!

Donor's main reservations

Even with a grant, this kind of research is much easier to do inside a lab, where you have a lot of compute, and more engineering expertise. There are people in labs working on this, eg Anthropic has a several person sub-team on science + scaling of SAEs. But there's many problems to work on, and ultimately not many researchers working on it, and Glen seems to have many interesting ideas, so I'm not too concerned about this. There is risk of duplicate work, eg ProLU and Gated SAEs, but I don't think that's a strong enough consideration to sink the grant.

I'm generally pretty wary of people doing independent research, especially junior researchers, with concerns specifically around lacking structure, accountability, motivation, feedback/mentorship, and stability. Glen says he hasn't been experiencing any issues with executive function, which is great! I've encouraged him to look for collaborators, and ideally a mentor, which would make me feel much better about the grant. It doesn't sound like independent research is his long-term plan, which makes me feel better about this.

Glen doesn't have much of a research track record, making it hard to be confident in this going well. But he seems promising, and I think it's good to give promising, inexperienced researchers a chance to prove themselves.

I have some concerns that this grant could result in a bunch of half-baked research threads, with no public write-up or clear conclusions. But Glen seems pretty motivated to make that not happen, and I think he also has a strong incentive to produce something legible and cool to eg help with future grant/job apps

Process for deciding amount

I'm honestly pretty confused about how to think about grant amounts here. $9K/month seems not crazy salary for someone living in SF, but I'd happily follow default rates for independent researchers if anyone has compiled them! $2K/month for compute seems enough to make it not a bottleneck without being too big a fraction of the grant. I'm funding this up to 5 months to balance between wanting Glen to have runway and a chance to prove himself, and wanting to see results before I recommend a larger/longer grant. If other grantmakers are excited about Glen's work I'd be happy to see them donating more though.

Conflicts of interest

Glen did my MATS training program about 6 months ago. I do a lot of SAE research, and expect to benefit from better knowledge of SAE training, but in the same way that the whole community will!