Johnny Lin

@hijohnnylin

trying to build useful things

https://johnnylin.co

$0total balance

$0charity balance

$0cash balance

$0 in pending offers

About Me

accelerating interpretability research @ neuronpedia.org.

Projects

Neuronpedia - Open Interpretability Platform

Comments

Neuronpedia - Open Interpretability Platform

Johnny Lin

12 months ago

Progress update

What progress have you made since your last update?

Neuronpedia has been funded by LTFF. We are now uploading and supporting SAEs from various interp/alignment researchers and orgs.

What are your next steps?

Continue to support new SAEs and add more features to Neuronpedia to accelerate interpretability research!

Neuronpedia - Open Interpretability Platform

Johnny Lin

about 1 year ago

@hijohnnylin Another update from two weeks ago - Neuronpedia collaborated with Joseph Bloom (MATS) to upload his Sparse Autoencoders, which resulted in ~295,000 dashboards. It is also now fully pivoting to focusing on being a tool for accelerating the work of interpretability researchers and sunsetting previous gamification goals.

Neuronpedia - Open Interpretability Platform

Johnny Lin

about 1 year ago

Progress update

What progress have you made since your last update?

New post demos some research-focused features in Neuronpedia: https://www.lesswrong.com/posts/QwgYmpnMxBZnmGCsw/exploring-openai-s-latent-directions-for-gpt2-small-tests

~20,000 explanations
~10,000 verifications of explanations ("digs")
2 new game modes
Explorer/searcher/browser for neurons and "directions" (roughly, combinations of neurons)

What are your next steps?

Continue to build out new features, fix bugs, etc.

Is there anything others could help you with?

Would love feature requests from interpretability researchers - how do I make this more useful for you?
Warm intros to folks interested in supporting Neuronpedia

Neuronpedia - Open Interpretability Platform

Johnny Lin

over 1 year ago

Hi Joel!

The simple answer to your question is that the ask for funding would be the existing minus $25000. I've made a feature request for Manifund to allow requestors to add grants received outside of the platform and add details about it.

The longer answer is that I am more excited about this project than anything I've ever worked on, and would love to work on it full time for as long as possible. In the short term and possibly even after public release (currently only posted on LessWrong forums as experimental beta), I can likely handle the workload, but I'm starting to think that it could be a good idea to have more than one person work on this.

I'd love to set up a call if you're interested in learning more about where this can go. Someone I spoke to yesterday said "AI is, in a way, the greatest crossword puzzle of all time". I can't think of anything more meaningful than building something that redirects the energy of millions of humans into increasing AI safety/alignment (even if they don't realize they're doing it).

Thanks,
Johnny

Neuronpedia - Open Interpretability Platform

Johnny Lin

over 1 year ago

Hey Austin - Thanks so much for the kind words and regrant. I'm extremely grateful for the support.

I totally agree that onboarding was and still is quite clunky - it's a bit simpler now but I'm still working on an onboarding that's actually interactive instead of just a guide. Unfortunately I'm also making big tweaks to the game itself, so I'm not spending too much time on refining the tutorial each time since the game is changing quickly. Would love to have a chat with you some time about virality improvements esp given Manifold's success - this is a super important topic, at the very least could run the current ideas by you.

Neuronpedia - Open Interpretability Platform

Johnny Lin

over 1 year ago

hi anton - great questions! also lol @ subscription - if only!

i'll preface by saying that i'm by no means an expert in mechanistic interpretability, and I apologize for not including more detailed justification on the grant application or website - if you've been doing this awhile, you probably know more than me, and your question of "why try to understand neurons?" is probably best answered by someone who has an academic background in this.

Re: Usefulness of neuron naming

People aren't currently using GPT2-SMALL as their daily chatbot, but the things we learn from smaller models can ideally be applied to larger models, and the idea is that eventually we'd have new campaigns to help identify neurons in larger models. Purely for example's sake, maybe we're able to identify a neuron (or group of neurons) for violent actions - in that case we might try to update the model to avoid/reduce its impact. Of course this can turn into a potential trolley problem quickly (maybe changing that part affects some other part negatively) - but having this data is likely better than not having it.

Aside from the actual explanations themselves, data around how a user finds a good explanation can also be useful - what activations are users looking at? Which neurons tend to be easily explained and others not? Etc.

There is a greater question of the usefulness of looking at individual neurons, vs other "units", as highlighted in the 2nd premortum. You're correct that neuronpedia will eventually likely need to adapt to analyzing single neurons. This is high priority on the TODO.

Re: Can't the network generate the explanation itself?

Yes, that's exactly what the existing explanations are generated from. Basically uses GPT4 to guess what the neurons in GPT2-SMALL are related to. Please see this paper from OpenAI: https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html

The issue is that these explanations aren't great, and that's why Neuronpedia solicits human help to solve these neuron puzzles.

Re: how do you score the explanation suggested by the user?

The scoring uses a simulator from Automated Interpretability based on the top known activating text and its activations. You can see how it works here: https://github.com/openai/automated-interpretability/tree/main

One of the things the game currently does not do (that I would like to do given more resources) is to re-score all explanations when a new high-activation text is found. This would mean higher quality (more accurate) scores. Also, larger models (even GPT2-XL) requires expensive GPUs to perform activation text testing.

again, i'm no expert in this - i'm fairly new to AI and but I want to build useful things. let me know if you have further questions and i'll try my best to answer!

Transactions

For	Date	Type	Amount
Manifund Bank	over 1 year ago	withdraw	2499
Manifund Bank	over 1 year ago	withdraw	1
Neuronpedia - Open Interpretability Platform	over 1 year ago	project donation	+2500