8

Joseph Bloom - Independent AI Safety Research

CompleteGrant
$51,400raised

Project summary

I would like to continue studying offline-RL agents using mechanistic interpretability in order to understand goals and agency. I believe the derived insights may help predict, detect and/or prevent AI misalignment. 


Key Activities:

  • Research mechanistic interpretability (MI) of trajectory transformer models.

  • Build/Maintain Open Source tooling (eg: TransformerLens package) 

  • Mentor, support and advise other researchers/engineers.  

  • Possibly: Start a 501c3 foundation modeled on the Farama Foundation to accelerate alignment tools/infrastructure.

Key reasons:

  • Research: My work is likely to produce insights into alignment relevant questions including foundational MI, goal representations and validating new MI techniques. 

  • Tooling/OpenSource: Open Source Packages enable better MI will lead to faster innovation and adoption. 

  • Community: I’d continue to help others develop technical skills, prioritize research directions and apply for funding and contribute to open source projects.

Project goals

Concretely, I’d like to broaden my current research scope to offline-RL transformers (from Decision Transformers). This will involve training models and then trying using current or novel MI approaches to look for goal representations and understand the mechanisms by which next token predictors “simulate” agents. 


Conceptually, I’d like to:

  • Better understand transformers/prosaic AI in general.

  • Reduce my confusion about things like why GPT4 isn’t more agentic or to what extent you could say it has goals. 


I expect the impact of this work to be that I will publish toy models, tools, analyses and experimental results  which improve  the state of public knowledge around agency and goals in transformer models.

How will this funding be used?

Salary-50k

Taxes-40k

Travel/Conferences-5k

Computing budget-10k

Work Requirements-5k

Total 110k

Another 140k would go towards
1. Starting a foundation to organize better tools for independent researchers working on alignment.

  1. Hiring a research intern

How could this project be actively harmful?

A) This person could show enough promise that they are headhunted by capabilities labs.
B) Open source tooling for mech interp could be used for bad purposes?

What other funding is this person or project getting?

I am confident that I should be doing AI alignment work given my skill set and so will seek funding from other sources. I have no current applications with other funders. I am interviewing for a role as a Research Engineer at DeepMind.

josephbloom avatar

joseph bloom

3 months ago

Final report

Description of subprojects and results, including major changes from the original proposal

During my time as an independent researcher working on interpretability of decision transformers, I published several articles on decision transformer interpretability on less wrong (https://www.lesswrong.com/users/joseph-bloom) and developed a library for training / analysing decision transformers in python (https://github.com/jbloomAus/DecisionTransformerInterpretability).

At the end of 2023, I pivoted to working on Sparse Autoencoders, completed the MATS program under Neel Nanda, and wrote the library SAE Lens, around which I am now running a non-profit AI Safety Research Infrastructure organisation called Decode Research.

Why pivot to SAEs: SAEs represent a massive jump forward in our methods in interpretability but created many engineering challenges. While Decision transformer work could still be very fruitful, I'm sure that my contributions in the SAE space have been of higher value.

Sparse Autoencoders which I have trained have been used in numerous investigations and Decode Research recently partnered with DeepMind on the Gemma Scope which has provided the public access with high quality sparse autoencoders on model capable of exhibiting interesting behaviors.

More detail: Manifund funding covered my activities from October 2023 - April 2023 meaning these funds supported me for the following publications / output:

  • Features and Adversaries in Memory DT: In this project I identified latent representations of goals in a grid world decision transformer, used this to identify adversarial inputs and equivalent latent interventions. While this work was not super popular on LessWrong (I think it was niche / kind of long), the results were quite interesting and non-trivial.

  • Linear Encoding of Character Level Information in GPT-J token embeddings: This collaboration with Matthew Watkins was the first (to my knowledge) to show character information is linearly probeable in the embeddings of language models. Graphemic information is an incredibly good case study for investigation in mechanistic interpretability which I am currently following up on with a team of 4 LASR Scholars (results coming soon).

  • Open Source Sparse Autoencoder for all residual stream layers of GPT2-Small: The SAEs posted here were the first set post as comprehensively / with feature dashboards. They have been used for numerous follow-up investigations including not just LessWrong posts but papers for labs like the Tegmark Lab at MIT.

  • Understanding SAE Features with the Logit Lens: This project demonstrated a number of interesting findings about SAE features and proposed novel statistical methods for cheaply characterising them. While I haven't yet had time to extend this work, I will likely be doing so sometime in the next year and anticipate the techniques will complement automatic interpretability of SAE Features nicely.

  • Announcing Neuronpedia: A platform for accelerating research in Sparse Autoencoders: Prior to founding Decode Research, I assisted Johnny in sourcing funding for Neuronpedia specifically (for 1 year) and we "re-launched" Neuronpedia as a platform for SAEs (rather than mlp neurons) where I was planning to advise Johnny part time while doing my other research. We eventually decided to form an organisation around Neuronpedia / SAE Lens.

  • SAE Lens: A library for Training and Analysing Sparse Autoencoders: The backbone of Decode Research and a popular open source library in its own right, SAE Lens enables researchers to easily download and analyse a large library of SAEs. The library is still a bit rough around the edges (as it must balance performance, a large variety of supported methods / architectures, and accessibility) but I expect we will continue to improve it in the future.

Spending breakdown

I spent funds according to the budget provided with only minor variations due to stochasticity.

josephbloom avatar

joseph bloom

10 months ago

I just wanted to share my recent post: https://www.lesswrong.com/posts/f9EgfLSurAiqRJySD/open-source-sparse-autoencoders-for-all-residual-stream.

In this post I release a set of Spares AutoEncoders for GPT2-Small, code for training sparse autoencoders and discuss the tacit knowledge I've accumulated while learning to train them. Despite a bug with one of the improvements (since fixed and which didn't effect results), these SAE's constitute both the highest quality SAEs published publicly since the Anthropic paper on the same topic and the best documented (eg: sharing loss curves + code + dashboards).

SAEs are super exciting but we've got a lot of work to do in order to understand whether they are capturing all the information we want them to, aren't systematically representing any information and that the information is useful.

josephbloom avatar

joseph bloom

10 months ago

Progress update

Note: I’ve tried to make this update relatively accessible, but I’m very happy to give more technical details or clarification in the comments or to have a short call with anyone interested in chatting.

What progress have you made since your last update?

TLDR: Over the last four months, I studied trajectory models using mechanistic interpretability techniques, then shifted focus to Sparse Autoencoders (a significant advancement in the field). In this period, I published a post demonstrating goal representation manipulation in one such model and co-authored another, applying learned principles to language models, notably altering their spelling behavior. This work, though insightful, progressed slower than anticipated (which I attribute to several reasons) and may be redundant in some ways due to other recent progress. After consultation with other researchers, as well as Marcus/Dylan, I've redirected my efforts through Sparse AutoEncoders / Language Models (which I discuss in detail in Next Steps). 

The main goal of the grant is to “help predict, detect and/or prevent AI misalignment“ via developing a mechanistic understanding of offline-RL models (a model organism of sorts for Language models like GPT3). I think of mechanistic interpretability as a natural science of neural network internals and of this research as attempting to contribute to our understanding of the natural phenomena that underpin alignment-relevant properties (e.g., goal representations). 

Therefore, I measure progress in the grant via improvements in methods, theories and techniques that enable us to understand neural network internals. We can decompose this into two components:

  • Algorithm Identification: the process of algorithms and intermediate structures that mediate the mapping of neural network inputs to neural network outputs. This is the well-known circuit-finding agenda (e.g., discussed here).

  • Ontology Identification: learning how a neural network thinks about the world (i.e., mapping internal variables in the model's computation to variables in its environment). 

In the first three months of this grant, Jay Bailey and I progressed towards this goal in the gridworld context. In October, we published “Features and Adversaries in MemoryDT”, where we identified and manipulated internal representations of the gridworld in a trajectory model. Before this work, we had several negative results associated with applications of circuit-finding techniques, which were complicated by some interesting reasons (like my intuitions about superposition/capacity derived from prior work in the field being somewhat flawed) and less exciting reasons (mapping circuits is challenging due to distributed processing). The picture painted by our results and work on sparse autoencoders clarifies why we had extensive superposition despite lots of capacity. 

I then worked on a collaboration with Matthew Watkins, extending some of my insights to language models. We published “Linear encoding of character-level information in GPT-J token embeddings”. Spelling is interesting because it constitutes a task where humans have direct insight into the underlying structure in reality (in this case, in the characters that make up words). However, this is hidden from language models due to details about how we input text. There are several academic publications about the surprising phenomenon that LLMs know which characters are in words.  We could identify and edit linear representations of character information in tokens. This work had some surprising results. In particular, when we delete letters in their token representations, the model predicts subsequent letters from the same word (proportional to their distance to the front of the word). This demonstrates that even if you identify concepts in a model, knowing what will happen when you manipulate them may be another significant challenge.

Neither of these posts was particularly popular on LessWrong, which is reasonable given that there has been considerable progress and publications in the field in the last four months. Feedback from some researchers was positive, but suggested the slow progression was evidence that a pivot might be needed. I don’t feel that I can claim we moved the needle on alignment or mechanistic interpretability much with this work and this is somewhat due to the project betting heavily on mechanistic interpretability being harder than it has turned out to be in language models. Nevertheless, the results in these posts tie in nicely to various phenomena that the research community has begun to understand better and I feel I developed significantly via working on this project. Lastly, I should mention that I am doing research building directly on my codebase / the posts and it seems plausible that for unanticipated reasons both the code / insights may become valuable in the future. 

What are your next steps?

TLDR: I’ve pivoted to training Sparse Autoencoders (SAEs) on small language models to assess how they solve the ontology identification problem, which is a prerequisite for reasoning well about goals/agency within neural networks. I’ve built my own SAE training library and followed up on preliminary experimental results under the supervision of Neel Nanda as part of the MATS program. 

Sparse AutoEncoders are a fascinating new technique that advances our understanding of model internals by an incredible amount. This technique enumerates many concepts and identifies which are inferred at run time by a model at a specific position in the network internals. The incredible result is that the concepts, called “features,” are often incredibly human interpretable (e.g., the concept of words that start with the letter “M” or phrases/words associated with Northern England / Scotland). As a computational biologist, I think saying SAEs are to neural networks as DNA sequencing is to cell biology is pretty accurate. 

For this reason, I reached out to Dylan / Marcus (the two significant funders) to check whether it would be ok if I pivoted to working on this new technique in the language model context (as well as checking with Neel Nanda, who supported the shift). They gave me the go-ahead, so that’s been my direction for the last two months. 

To support this research, I built on a few open-source libraries to make my own SAE training library, which I’ve used to train sparse autoencoders on various models, focusing especially on the GPT2 small. Under Neel’s supervision at MATS, I’ve been exploring a few directions that I think try to address the critical alignment relevant questions about SAEs:

  1. Are Sparse AutoEncoders capturing all of the information that we want them to? One way to think about this is that if you sequence DNA, print it, and then stitch it back into an organism, then the organism shouldn’t die. Molecular biologists do small versions of this all the time. The way we train sparse autoencoders very much suggests that we should get a similar property (we can replace the internals with our reconstruction, but in practice, that reconstruction does hurt the model performance. I’ve got some preliminary results showing we can better represent more information with more concepts concurrently without having those concepts become uninterpretable. Still, there’s more work to measure all the variables we care about here, such as how errors propagate through the model. 

  2. Do Sparse AutoEncoders systematically misrepresent any information? To make sure sparse autoencoders come up with features that are interpretable to us, we enumerate many concepts and try to make sure that we don’t have too many features appear at the same time. However, it’s unclear that a biased process will find the “true” underlying concepts. Since AI alignment will likely require we are very good at estimating the true “ontology” of the model, I’m very interested in trying to find ways of measuring the distance between the “true” ontology and what we are finding along axes that aren’t just how well we recover the model performance. I’ve explored some experiments that may get at this via studying QK circuits, which we may follow up on. 

Regarding practical details, I’ll likely settle on a specific direction shortly and pursue that as part of MATS. I’ll write this up in a research plan as part of MATS and share it here. 

Neel expects his mentees to publish their work in academic articles, so I will likely be close to doing that by the time the Manifund grant period ends. Since I’ve received a LightSpeed grant with another six months of funding, I anticipate being able to continue this research for most of this year, by which time I expect to have results that justify further funding.

Is there anything others could help you with?

Whilst I think I’m mostly okay for funding/everything else (not accepting MATS funding or flight reimbursement), it is undoubtedly the case that Sparse Autoencoders are incredibly computer-hungry. So access to a cloud computing cluster or knowing that if I need to run some big experiments, there are enough funds to do so would be good. 

As an estimate, it can cost $3 / hour and take 12 hours to train one SAE on gpt2 small, and we might want to train 12 of these, which would add to about $400. This is a lower bound as varying hyperparameters, working on larger models, and analysing features post-hoc will all increase the compute expenditure. It seems plausible that the previous 10k budget per year will be underestimated by 2 - 5x. 

Since I don’t want the stress of being handed a lot of money to spend on computing, I mildly prefer access to compute clusters (or a line of credit to be used only for computing or something). This isn’t essential/urgent yet as I still have some uncertainty over whether the research is significantly accelerated by training many SAEs or whether it will be essential to work with larger models. 

Austin avatar

Austin Chen

10 months ago

@josephbloom Thanks for posting this update! Your grant was one of the very first grants made through Manifund's regranter program, and I'm quite happy to see your follow ups. I especially appreciate you staying in touch with Marcus and Dylan to give them a sense of how their grants are being used as well as your next research steps.

re: compute funding, I imagine you've already seen Superalignment Fast Grants; it seems like a good fit for your ask and I'd highly encourage you to apply (Leopold, who I believe is running the program, is also a Manifund regrantor!)

donated $25,000
MarcusAbramovitch avatar

Marcus Abramovitch

11 months ago

Want to update that Joseph has been crushing it. He's made good research progress, updates Dylan and I every month and receives good feedback. Super happy with this grant.

I have also let Joseph know that if he is in need for more funding, he should contact me and I will make sure it happens.

10/10

josephbloom avatar

joseph bloom

11 months ago

@MarcusAbramovitch Thanks Marcus!

Miguel avatar

Miguelito De Guzman

over 1 year ago

I am one of the ARENA 2.0 online participants and I could say that in my interaction with Joseph he was very insightful. I believe he is competent enough to deliver on his the alignment space.

donated $1,000
AntonMakiievskyi avatar

Anton Makiievskyi

over 1 year ago

@josephbloom, would you stop this project if you get hired by DeepMind, or you're expecting to continue it as a part of the job?

josephbloom avatar

joseph bloom

over 1 year ago

I don't think it's likely I will be hired with DeepMind as I interviewed for a role recently and they decided not to proceed. I was also told to expect that if I had joined the team it's likely I would have been working on language models.

donated $25,000
MarcusAbramovitch avatar

Marcus Abramovitch

over 1 year ago

Main points in favor of this grant

  1. Neel Nanda's top choice in the Nonlinear Network. Neel says many people want to hire him.

  2. Joseph is an official maintainer of TransformerLens (the top package for mech interp).

  3. Teaches at the ARENA program.

  4. Two really good posts on Decision Transformer Interpretability and Mech Interp Analysis of GridWorld Agent-Simulator.

  5. Work was listed by Anthropic May 2023 update

  6. Working on trajectory transfers is a natural progression from decision transformers

Donor's main reservations

I wonder if he is best to be hired by some other alignment team instead since I think he might be young as he might have better mentorship with others.

Process for deciding amount

This just should be fully funded, at least to $110,000. $25000 (but ideally $50000 would put him at ease for 6 months by which time he expects to have enough output to justify further funding. I'd give more but I have a limited budget. This is already half of my budget but I feel quite strongly about this.

Conflicts of interest

Nothing to disclose.


Rachel avatar

Rachel Weinberg

over 1 year ago

At first glance when trying to foster skepticism I had the same thought as you: that teams and mentorship make people more productive, so this grant could be a push in the wrong direction. On the other hand, he's been unusually successful so far as an independent researcher. If he's particularly well-suited to working independently, which most people struggle with, that's a kind of comparative advantage it might make sense to lean into since mentorship and spots on established teams are in short supply.

donated $25,000
MarcusAbramovitch avatar

Marcus Abramovitch

over 1 year ago

I think with his track record so far and endorsements, he's earned the right to go the direction he thinks is best. Maybe it'd be better to have an org that "houses" a bunch of people that just want to work by themselves and the org just formally employs them and helps them raise funds for their project and maybe has some communal resources but I don't think I'd prefer to fund that org vs. fund someone who is just going to do good direct work.

josephbloom avatar

joseph bloom

over 1 year ago

A few points on this topic:

  • Jay Bailey, a former senior software/devops engineer and SERI-MATS scholar has been funded to work on this agenda and has begun helping me out. I'm also discussing collaborations with other people from more of a maths / conceptual alignment background which I hope will be useful.

  • I agree mentorship is useful and plan to make an effort to find a mentor, although I've also been regularly discussing parts of my work with alignment researchers. At least one well respected alignment researcher told that it's plausible that this kind of work is teaching me more than I'd learn at an Org, but I know Neel disagrees.

  • I'm likely to co-work part time in a London AI safety office if one exists in the future.

I think I'm approaching my research with somewhat a scout mindset here. It seems plausible that independent research for some people is pareto optimal for the community across output from potential mentees/mentors. I am also considering an experiment where I do a small collaboration with an organisation which may provide evidence in the other direction. If it were true that this was productive and alleviated a mentorship bottleneck, then finding that out might be valuable/inform future funding strategies.