I'm quite excited to see this project completed. I see SAEs as like a microscope we're in the process of perfecting. Right now, we can't perfectly tell which of the images we're seeing through the microscope are scratches on the lens (artifacts of the methods) as opposed to real underlying structure. I see this project as straightforwardly looking into one of those scratches.
Matthew and Hardik have developed a lot while I've been mentoring them and I expect will do a good job on this project.
Feel free to reply here if you have any questions!
@josephbloom
Cofounder and Science Lead at Decode Research / Neuronpedia
https://www.linkedin.com/in/joseph-bloom1/$0 in pending offers
I recently co-founded an AI Safety Research Infrastructure Organisation (501c3, non-profit) which aims to accelerate progress in mechanistic interpretability, in the service of mitigating risks from future AI systems.
Our projects include:
Neuronpedia: An open platform for interpretability research. This site hosts sparse autoencoders trained by independent researchers as well as OpenAI and DeepMind.
SAE Lens: A library for training and analysing sparse autoencoders.
Prior to my current role, I was an independently funded AI Alignment Research Engineer focussing on mechanistic interpretability in reinforcement learning. I was previously a maintainer of TransformerLens, a popular open source package for mechanistic interpretability of transformers.
Prior to working in AI Alignment, I studied computational biology, and worked for 2 years as a data scientist in a proteomics startup.
Comments
joseph bloom
about 1 month ago
joseph bloom
3 months ago
Description of subprojects and results, including major changes from the original proposal
During my time as an independent researcher working on interpretability of decision transformers, I published several articles on decision transformer interpretability on less wrong (https://www.lesswrong.com/users/joseph-bloom) and developed a library for training / analysing decision transformers in python (https://github.com/jbloomAus/DecisionTransformerInterpretability).
At the end of 2023, I pivoted to working on Sparse Autoencoders, completed the MATS program under Neel Nanda, and wrote the library SAE Lens, around which I am now running a non-profit AI Safety Research Infrastructure organisation called Decode Research.
Why pivot to SAEs: SAEs represent a massive jump forward in our methods in interpretability but created many engineering challenges. While Decision transformer work could still be very fruitful, I'm sure that my contributions in the SAE space have been of higher value.
Sparse Autoencoders which I have trained have been used in numerous investigations and Decode Research recently partnered with DeepMind on the Gemma Scope which has provided the public access with high quality sparse autoencoders on model capable of exhibiting interesting behaviors.
More detail: Manifund funding covered my activities from October 2023 - April 2023 meaning these funds supported me for the following publications / output:
Features and Adversaries in Memory DT: In this project I identified latent representations of goals in a grid world decision transformer, used this to identify adversarial inputs and equivalent latent interventions. While this work was not super popular on LessWrong (I think it was niche / kind of long), the results were quite interesting and non-trivial.
Linear Encoding of Character Level Information in GPT-J token embeddings: This collaboration with Matthew Watkins was the first (to my knowledge) to show character information is linearly probeable in the embeddings of language models. Graphemic information is an incredibly good case study for investigation in mechanistic interpretability which I am currently following up on with a team of 4 LASR Scholars (results coming soon).
Open Source Sparse Autoencoder for all residual stream layers of GPT2-Small: The SAEs posted here were the first set post as comprehensively / with feature dashboards. They have been used for numerous follow-up investigations including not just LessWrong posts but papers for labs like the Tegmark Lab at MIT.
Understanding SAE Features with the Logit Lens: This project demonstrated a number of interesting findings about SAE features and proposed novel statistical methods for cheaply characterising them. While I haven't yet had time to extend this work, I will likely be doing so sometime in the next year and anticipate the techniques will complement automatic interpretability of SAE Features nicely.
Announcing Neuronpedia: A platform for accelerating research in Sparse Autoencoders: Prior to founding Decode Research, I assisted Johnny in sourcing funding for Neuronpedia specifically (for 1 year) and we "re-launched" Neuronpedia as a platform for SAEs (rather than mlp neurons) where I was planning to advise Johnny part time while doing my other research. We eventually decided to form an organisation around Neuronpedia / SAE Lens.
SAE Lens: A library for Training and Analysing Sparse Autoencoders: The backbone of Decode Research and a popular open source library in its own right, SAE Lens enables researchers to easily download and analyse a large library of SAEs. The library is still a bit rough around the edges (as it must balance performance, a large variety of supported methods / architectures, and accessibility) but I expect we will continue to improve it in the future.
Spending breakdown
I spent funds according to the budget provided with only minor variations due to stochasticity.
joseph bloom
7 months ago
Super excited about this project. Tom and I have already done a lot of good work and the collaboration with Johnny, Tom and I has a huge amount of synergy! I'd encourage people to add further funds to help Tom reach his goal as (I'm very biased but) I think the resulting SAEs will be super useful to a bunch of researchers and in the process we'll create useful knowledge that accelerates progress which will underpin future AI safety outcomes.
joseph bloom
10 months ago
I just wanted to share my recent post: https://www.lesswrong.com/posts/f9EgfLSurAiqRJySD/open-source-sparse-autoencoders-for-all-residual-stream.
In this post I release a set of Spares AutoEncoders for GPT2-Small, code for training sparse autoencoders and discuss the tacit knowledge I've accumulated while learning to train them. Despite a bug with one of the improvements (since fixed and which didn't effect results), these SAE's constitute both the highest quality SAEs published publicly since the Anthropic paper on the same topic and the best documented (eg: sharing loss curves + code + dashboards).
SAEs are super exciting but we've got a lot of work to do in order to understand whether they are capturing all the information we want them to, aren't systematically representing any information and that the information is useful.
joseph bloom
10 months ago
Note: Iāve tried to make this update relatively accessible, but Iām very happy to give more technical details or clarification in the comments or to have a short call with anyone interested in chatting.
What progress have you made since your last update?
TLDR: Over the last four months, I studied trajectory models using mechanistic interpretability techniques, then shifted focus to Sparse Autoencoders (a significant advancement in the field). In this period, I published a post demonstrating goal representation manipulation in one such model and co-authored another, applying learned principles to language models, notably altering their spelling behavior. This work, though insightful, progressed slower than anticipated (which I attribute to several reasons) and may be redundant in some ways due to other recent progress. After consultation with other researchers, as well as Marcus/Dylan, I've redirected my efforts through Sparse AutoEncoders / Language Models (which I discuss in detail in Next Steps).
The main goal of the grant is to āhelp predict, detect and/or prevent AI misalignmentā via developing a mechanistic understanding of offline-RL models (a model organism of sorts for Language models like GPT3). I think of mechanistic interpretability as a natural science of neural network internals and of this research as attempting to contribute to our understanding of the natural phenomena that underpin alignment-relevant properties (e.g., goal representations).
Therefore, I measure progress in the grant via improvements in methods, theories and techniques that enable us to understand neural network internals. We can decompose this into two components:
Algorithm Identification: the process of algorithms and intermediate structures that mediate the mapping of neural network inputs to neural network outputs. This is the well-known circuit-finding agenda (e.g., discussed here).
Ontology Identification: learning how a neural network thinks about the world (i.e., mapping internal variables in the model's computation to variables in its environment).
In the first three months of this grant, Jay Bailey and I progressed towards this goal in the gridworld context. In October, we published āFeatures and Adversaries in MemoryDTā, where we identified and manipulated internal representations of the gridworld in a trajectory model. Before this work, we had several negative results associated with applications of circuit-finding techniques, which were complicated by some interesting reasons (like my intuitions about superposition/capacity derived from prior work in the field being somewhat flawed) and less exciting reasons (mapping circuits is challenging due to distributed processing). The picture painted by our results and work on sparse autoencoders clarifies why we had extensive superposition despite lots of capacity.
I then worked on a collaboration with Matthew Watkins, extending some of my insights to language models. We published āLinear encoding of character-level information in GPT-J token embeddingsā. Spelling is interesting because it constitutes a task where humans have direct insight into the underlying structure in reality (in this case, in the characters that make up words). However, this is hidden from language models due to details about how we input text. There are several academic publications about the surprising phenomenon that LLMs know which characters are in words. We could identify and edit linear representations of character information in tokens. This work had some surprising results. In particular, when we delete letters in their token representations, the model predicts subsequent letters from the same word (proportional to their distance to the front of the word). This demonstrates that even if you identify concepts in a model, knowing what will happen when you manipulate them may be another significant challenge.
Neither of these posts was particularly popular on LessWrong, which is reasonable given that there has been considerable progress and publications in the field in the last four months. Feedback from some researchers was positive, but suggested the slow progression was evidence that a pivot might be needed. I donāt feel that I can claim we moved the needle on alignment or mechanistic interpretability much with this work and this is somewhat due to the project betting heavily on mechanistic interpretability being harder than it has turned out to be in language models. Nevertheless, the results in these posts tie in nicely to various phenomena that the research community has begun to understand better and I feel I developed significantly via working on this project. Lastly, I should mention that I am doing research building directly on my codebase / the posts and it seems plausible that for unanticipated reasons both the code / insights may become valuable in the future.
What are your next steps?
TLDR: Iāve pivoted to training Sparse Autoencoders (SAEs) on small language models to assess how they solve the ontology identification problem, which is a prerequisite for reasoning well about goals/agency within neural networks. Iāve built my own SAE training library and followed up on preliminary experimental results under the supervision of Neel Nanda as part of the MATS program.
Sparse AutoEncoders are a fascinating new technique that advances our understanding of model internals by an incredible amount. This technique enumerates many concepts and identifies which are inferred at run time by a model at a specific position in the network internals. The incredible result is that the concepts, called āfeatures,ā are often incredibly human interpretable (e.g., the concept of words that start with the letter āMā or phrases/words associated with Northern England / Scotland). As a computational biologist, I think saying SAEs are to neural networks as DNA sequencing is to cell biology is pretty accurate.
For this reason, I reached out to Dylan / Marcus (the two significant funders) to check whether it would be ok if I pivoted to working on this new technique in the language model context (as well as checking with Neel Nanda, who supported the shift). They gave me the go-ahead, so thatās been my direction for the last two months.
To support this research, I built on a few open-source libraries to make my own SAE training library, which Iāve used to train sparse autoencoders on various models, focusing especially on the GPT2 small. Under Neelās supervision at MATS, Iāve been exploring a few directions that I think try to address the critical alignment relevant questions about SAEs:
Are Sparse AutoEncoders capturing all of the information that we want them to? One way to think about this is that if you sequence DNA, print it, and then stitch it back into an organism, then the organism shouldnāt die. Molecular biologists do small versions of this all the time. The way we train sparse autoencoders very much suggests that we should get a similar property (we can replace the internals with our reconstruction, but in practice, that reconstruction does hurt the model performance. Iāve got some preliminary results showing we can better represent more information with more concepts concurrently without having those concepts become uninterpretable. Still, thereās more work to measure all the variables we care about here, such as how errors propagate through the model.
Do Sparse AutoEncoders systematically misrepresent any information? To make sure sparse autoencoders come up with features that are interpretable to us, we enumerate many concepts and try to make sure that we donāt have too many features appear at the same time. However, itās unclear that a biased process will find the ātrueā underlying concepts. Since AI alignment will likely require we are very good at estimating the true āontologyā of the model, Iām very interested in trying to find ways of measuring the distance between the ātrueā ontology and what we are finding along axes that arenāt just how well we recover the model performance. Iāve explored some experiments that may get at this via studying QK circuits, which we may follow up on.
Regarding practical details, Iāll likely settle on a specific direction shortly and pursue that as part of MATS. Iāll write this up in a research plan as part of MATS and share it here.
Neel expects his mentees to publish their work in academic articles, so I will likely be close to doing that by the time the Manifund grant period ends. Since Iāve received a LightSpeed grant with another six months of funding, I anticipate being able to continue this research for most of this year, by which time I expect to have results that justify further funding.
Is there anything others could help you with?
Whilst I think Iām mostly okay for funding/everything else (not accepting MATS funding or flight reimbursement), it is undoubtedly the case that Sparse Autoencoders are incredibly computer-hungry. So access to a cloud computing cluster or knowing that if I need to run some big experiments, there are enough funds to do so would be good.
As an estimate, it can cost $3 / hour and take 12 hours to train one SAE on gpt2 small, and we might want to train 12 of these, which would add to about $400. This is a lower bound as varying hyperparameters, working on larger models, and analysing features post-hoc will all increase the compute expenditure. It seems plausible that the previous 10k budget per year will be underestimated by 2 - 5x.
Since I donāt want the stress of being handed a lot of money to spend on computing, I mildly prefer access to compute clusters (or a line of credit to be used only for computing or something). This isnāt essential/urgent yet as I still have some uncertainty over whether the research is significantly accelerated by training many SAEs or whether it will be essential to work with larger models.
joseph bloom
about 1 year ago
I wouldn't usually comment on other people's projects but I've been mentioned in the proposal and @Austin's response. Furthermore, I recently published some research which relates to many of the main themes in Chris's post (world models, steering vectors, superposition).
It's not obvious to me that more posts like these will lead to more good work being done. I don't think we are bottlenecked on ambitious, optimistic people and this post is redundant with others in terms of convincing people to be excited about these research outcomes.
I'd be keen on seeing more results of the kind discussed in the post but my prior on paying people to promote that work on LW being optimal funds use is low.
joseph bloom
over 1 year ago
I don't think it's likely I will be hired with DeepMind as I interviewed for a role recently and they decided not to proceed. I was also told to expect that if I had joined the team it's likely I would have been working on language models.
joseph bloom
over 1 year ago
A few points on this topic:
Jay Bailey, a former senior software/devops engineer and SERI-MATS scholar has been funded to work on this agenda and has begun helping me out. I'm also discussing collaborations with other people from more of a maths / conceptual alignment background which I hope will be useful.
I agree mentorship is useful and plan to make an effort to find a mentor, although I've also been regularly discussing parts of my work with alignment researchers. At least one well respected alignment researcher told that it's plausible that this kind of work is teaching me more than I'd learn at an Org, but I know Neel disagrees.
I'm likely to co-work part time in a London AI safety office if one exists in the future.
I think I'm approaching my research with somewhat a scout mindset here. It seems plausible that independent research for some people is pareto optimal for the community across output from potential mentees/mentors. I am also considering an experiment where I do a small collaboration with an organisation which may provide evidence in the other direction. If it were true that this was productive and alleviated a mentorship bottleneck, then finding that out might be valuable/inform future funding strategies.
Transactions
For | Date | Type | Amount |
---|---|---|---|
Manifund Bank | over 1 year ago | withdraw | 51400 |
Joseph Bloom - Independent AI Safety Research | over 1 year ago | project donation | +250 |
Joseph Bloom - Independent AI Safety Research | over 1 year ago | project donation | +25000 |
Joseph Bloom - Independent AI Safety Research | over 1 year ago | project donation | +25000 |
Joseph Bloom - Independent AI Safety Research | over 1 year ago | project donation | +790 |
Joseph Bloom - Independent AI Safety Research | over 1 year ago | project donation | +10 |
Joseph Bloom - Independent AI Safety Research | over 1 year ago | project donation | +200 |
Joseph Bloom - Independent AI Safety Research | over 1 year ago | project donation | +100 |
Joseph Bloom - Independent AI Safety Research | over 1 year ago | project donation | +50 |