Neuronpedia - Open Interpretability Platform

Technical AI safety

Johnny Lin

ActiveGrant

$2,500raised

$250,000funding goal

Donate

Project summary

Neuronpedia (neuronpedia.org) accelerates interpretability researchers who use Sparse Autoencoders (SAEs) by acting as a comprehensive reference and fast-feedback analysis platform. I am collaborating with Joseph Bloom (SERI-MATS) to plan, build, and test the most important features. This is a significant pivot from the previous "crowdsourcing via gamification" project.

Usage by Joseph Bloom: Uploaded to Neuronpedia, comment by Neel Nanda, direct link to directions, original SAE post

Walkthrough / usage tutorial by exploring OpenAI's MLP directions
Example Search / live inference

What are this project's goals and how will you achieve them?

The problem we’re solving is: The amount and complexity of data involved in understanding neural networks is growing rapidly due to unsupervised methods like Sparse Autoencoders or Automatic Interpretability. This will inevitably raise the bar for researchers contributing to Mechanistic Interpretability without access to great infrastructure / engineers. We’ve already made it easier for some researchers to work with Sparse Autoencoders trained on GPT2-SMALL and want to accelerate more researchers working on more models / project.

The way we solve this problem is: Distributed infrastructure which supports sharing and analysis of common datasets. We’re inspired by the the vast amounts of data shared between biomedical researchers in databases which contain data about genes, proteins and diseases. But, this is still a young field so we don’t know exactly what kinds of data / resources are going to be most useful to collaborate on and/or share. Therefore, we think a good compromise between scaling fast (around existing techniques) and waiting to see what comes next is to have an agile mentality while working with researchers to solve the problems currently blocking them when they try to make sense of neural network internals.

The particular solutions we’re excited to expand include:

Hosting SAEs trained by organisations like OpenAI and researchers like Joseph Bloom. Sparse Autoencoders provide a previously unprecedented ability to decompose model internals into meaningful components. We’re going to accelerate research by sharing model transparency as widely as possible with the research community.

Generation of feature dashboards. Data visualization is incredibly useful for summarising data and enabling researchers to build intuition. However generating feature dashboards at scale and storing them is a challenging engineering feat which many researchers / labs will be able to sidestep by using Neuronpedia.
Generation of Explanations (and scoring) Sparse Autoencoder Features. Scaling intepretability research may require significant automation which has already been integrated into Neuronpedia. Benchmarking, improving and speed up our automatic interpretability features could further accelerate our understanding of model internals. See an example of an automatic explanation for a feature here.
Interactive interfaces which enable:
- Mechanistic Interpretability via SAE features: Since we’re already supporting live inference and calculation of feature activations, it’s plausible that we could support a limited degree of experimentation live such as sampling with features pinned (steering with features) or feature ablations. Other features we could build might include search over prompt templates designed to find features involved in common algorithms.
- Red-Teaming of SAE Features: We already provide the ability to type in text on feature dashboards which enables a weak form of red teaming. However, we could expand features here such as enabling users to provide a hypothesis which GPT4 uses to generate text on which a feature may or may note fire.
- Exploration of SAE Quality and Evaluations: SAEs are a new technique where mileage may vary depending on a variety of factors. It may be useful to develop features that create transparency around SAE quality and compare results between SAEs.

Specific Milestones we think are worth working towards include:

By July 1st, release a technical note sharing Neuronpedia feature details and three different case studies in how Neuronpedia has accelerated research into neural network internals. This technical note will focus on basic feature exploration, redteaming and search related features.
By October 1st, release a second technical note sharing Neuronpedia feature details and three different case studies in how Neuronpedia has accelerated research into neural network internals. This technical note could focus on more advanced features which have been developed in consultation with the research community. For example, we might build features that enable comparison of features across layers to understand how representations change with depth, or compare features accross models to understand universality in features found between models.

How will this funding be used?

$250k funds Neuronpedia for 1 year, from Feb 2023 to Feb 2024.

Breakdown:

$90k is state, federal, and local taxes (source for California) - possibly higher as 1099.
$3-5k/mo for inference servers (currently using CPU), databases, etc
$3k/mo for contractors, potentially an intern, and other help
Remaining $5-$7k/mo is living costs (rent, insurance, utilities, etc).

This is closer to the lower bound of keeping Neuronpedia running. Less means that there's no cushion for adding significant amounts of data/features to Neuronpedia - and we want to be free, open, and frictionless hosting for all interpretability researchers. Example: we are currently hosting data for GPT2-SMALL, which is tiny and can sort of be run on CPU. Would love to study GPT2-XL, but that requires GPU (that has a decent amount of GPU memory) which gets expensive super fast.

More (500k) would be ideal. I'm incredibly time-starved right now and am doing every role it takes to run Neuronpedia end-to-end - 100% of the coding, UI/UX/design, ops, fixing bugs, PMing tasks, keeping up with requests, comms, finding grants, writing, testing, etc. I currently do 70+ hours a week and spend my other time thinking about Neuronpedia. With more funding I'd delegate a lot of tasks to contractors to build features faster and in parallel.

Who is on your team and what's your track record on similar projects?

I'm an ex-Apple engineer who founded a few apps, and last year went full time into interpretability building Neuronpedia. The Neuronpedia infrastructure I've built and knowledge I've gained during that time is helping me move much faster on the new research pivot. Previously, my most popular app got over 1,000,000 organic downloads and my writing has been featured in the Washington Post, Forbes, FastCompany, etc. I have code contributions to OpenAI's public repository for Automated Interpretability - new features, some fixes (another), etc
My collaborator is Joseph Bloom (MATS) and recently published residual directions for GPT2-SMALL. Joseph Bloom has 2+ years experience at a SaaS platform doing data processing and live exploration of biomedical data and is currently an independently funded mechanistic interpretability researcher. We collaborated to upload his directions to Neuronpedia, where all 294k directions can be browsed, filtered, and searched. Joseph is providing feedback on how Neuronpedia can help and accelerate interp research (dogfooding, PM-ing/prioritization of features, connecting with community of interp researchers), as well as training/uploading new models.

What are the most likely causes and outcomes of this project's failure? (premortem)

Not enough feedback (I super-need people to quickly and bluntly tell me "hey that idea/feature doesn't work or isn't useful, scrap it" and also ideally "work on this idea/feature instead")
Not iterating useful features quickly enough
Not enough resources to sustain the service including hosting, inference, etc and have good uptime, and to be able to scale to larger models
Bad UX and/or bugs: Researchers have difficulty using it and give up on it because it's too confusing

What other funding are you or your project getting?

All previous funding has been exhausted as of Feb 9th, 2024 - this is now being funded by me personally. Previous short-term grants include $2,500 from Manifund - details available upon request.
Applied to OpenAI's Superalignment Fast Grants and LTTF a few days ago. No status yet.

Johnny Lin

12 months ago

Progress update

What progress have you made since your last update?

Neuronpedia has been funded by LTFF. We are now uploading and supporting SAEs from various interp/alignment researchers and orgs.

What are your next steps?

Continue to support new SAEs and add more features to Neuronpedia to accelerate interpretability research!

Johnny Lin

about 1 year ago

Progress update

What progress have you made since your last update?

New post demos some research-focused features in Neuronpedia: https://www.lesswrong.com/posts/QwgYmpnMxBZnmGCsw/exploring-openai-s-latent-directions-for-gpt2-small-tests

~20,000 explanations
~10,000 verifications of explanations ("digs")
2 new game modes
Explorer/searcher/browser for neurons and "directions" (roughly, combinations of neurons)

What are your next steps?

Continue to build out new features, fix bugs, etc.

Is there anything others could help you with?

Would love feature requests from interpretability researchers - how do I make this more useful for you?
Warm intros to folks interested in supporting Neuronpedia

Johnny Lin

about 1 year ago

@hijohnnylin Another update from two weeks ago - Neuronpedia collaborated with Joseph Bloom (MATS) to upload his Sparse Autoencoders, which resulted in ~295,000 dashboards. It is also now fully pivoting to focusing on being a tool for accelerating the work of interpretability researchers and sunsetting previous gamification goals.

Joel Becker

over 1 year ago

Hi Johnny! Many congratulations on being approved for a grant from EV. Could I ask how that might change your ask for funding here?

Johnny Lin

over 1 year ago

Hi Joel!

The simple answer to your question is that the ask for funding would be the existing minus $25000. I've made a feature request for Manifund to allow requestors to add grants received outside of the platform and add details about it.

The longer answer is that I am more excited about this project than anything I've ever worked on, and would love to work on it full time for as long as possible. In the short term and possibly even after public release (currently only posted on LessWrong forums as experimental beta), I can likely handle the workload, but I'm starting to think that it could be a good idea to have more than one person work on this.

I'd love to set up a call if you're interested in learning more about where this can go. Someone I spoke to yesterday said "AI is, in a way, the greatest crossword puzzle of all time". I can't think of anything more meaningful than building something that redirects the energy of millions of humans into increasing AI safety/alignment (even if they don't realize they're doing it).

Thanks,
Johnny

donated $2,500

Austin Chen

over 1 year ago

Hi Johnny, thanks for submitting your project! I've decided to fund this project with $2500 of my own regrantor budget to start, as a retroactive grant. The reasons I am excited for this project:

Foremost, Neuropedia is just a really well-developed website; web apps are one of the areas I'm most confident in my evaluation. Neuropedia is polished, with delightful animations and a pretty good UX for expressing a complicated idea.
I like that Johnny went ahead and built a fully functional demo before asking about funding. My $2500 is intended to be a retroactive grant, though note this is still much less than the market cost of 3-4 weeks of software engineering at the quality of Neuropedia, which I'd ballpark at $10k-$20k.
Johnny looks to be a fantastic technologist with a long track record of shipping useful apps; I'd love it if Johnny specifically and others like him worked on software projects with the goal of helping AI go well.
The idea itself is intriguing. I don't have a strong sense of whether the game is fun enough to go viral on its own (my very rough guess is that there are some onboarding simplifications and virality improvements), and an even weaker sense of whether this will ultimately be useful for technical AI safety. (I'd love if one of our TAIS regrantors would like to chime in on this front!)

Johnny Lin

over 1 year ago

Hey Austin - Thanks so much for the kind words and regrant. I'm extremely grateful for the support.

I totally agree that onboarding was and still is quite clunky - it's a bit simpler now but I'm still working on an onboarding that's actually interactive instead of just a guide. Unfortunately I'm also making big tweaks to the game itself, so I'm not spending too much time on refining the tutorial each time since the game is changing quickly. Would love to have a chat with you some time about virality improvements esp given Manifold's success - this is a super important topic, at the very least could run the current ideas by you.

Anton Makiievskyi

over 1 year ago

I wanted to play more of the game, it seemed engaging =) Please sell me a subscription

I can't see usefulness of neuron naming to AI Safety though to be honest. Can't the network generate the explanation itself? Otherwise: how do you score the explanation suggested by the user?

Johnny Lin

over 1 year ago

hi anton - great questions! also lol @ subscription - if only!

i'll preface by saying that i'm by no means an expert in mechanistic interpretability, and I apologize for not including more detailed justification on the grant application or website - if you've been doing this awhile, you probably know more than me, and your question of "why try to understand neurons?" is probably best answered by someone who has an academic background in this.

Re: Usefulness of neuron naming

People aren't currently using GPT2-SMALL as their daily chatbot, but the things we learn from smaller models can ideally be applied to larger models, and the idea is that eventually we'd have new campaigns to help identify neurons in larger models. Purely for example's sake, maybe we're able to identify a neuron (or group of neurons) for violent actions - in that case we might try to update the model to avoid/reduce its impact. Of course this can turn into a potential trolley problem quickly (maybe changing that part affects some other part negatively) - but having this data is likely better than not having it.

Aside from the actual explanations themselves, data around how a user finds a good explanation can also be useful - what activations are users looking at? Which neurons tend to be easily explained and others not? Etc.

There is a greater question of the usefulness of looking at individual neurons, vs other "units", as highlighted in the 2nd premortum. You're correct that neuronpedia will eventually likely need to adapt to analyzing single neurons. This is high priority on the TODO.

Re: Can't the network generate the explanation itself?

Yes, that's exactly what the existing explanations are generated from. Basically uses GPT4 to guess what the neurons in GPT2-SMALL are related to. Please see this paper from OpenAI: https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html

The issue is that these explanations aren't great, and that's why Neuronpedia solicits human help to solve these neuron puzzles.

Re: how do you score the explanation suggested by the user?

The scoring uses a simulator from Automated Interpretability based on the top known activating text and its activations. You can see how it works here: https://github.com/openai/automated-interpretability/tree/main

One of the things the game currently does not do (that I would like to do given more resources) is to re-score all explanations when a new high-activation text is found. This would mean higher quality (more accurate) scores. Also, larger models (even GPT2-XL) requires expensive GPUs to perform activation text testing.

again, i'm no expert in this - i'm fairly new to AI and but I want to build useful things. let me know if you have further questions and i'll try my best to answer!