The simple answer to your question is that the ask for funding would be the existing minus $25000. I've made a feature request for Manifund to allow requestors to add grants received outside of the platform and add details about it.
The longer answer is that I am more excited about this project than anything I've ever worked on, and would love to work on it full time for as long as possible. In the short term and possibly even after public release (currently only posted on LessWrong forums as experimental beta), I can likely handle the workload, but I'm starting to think that it could be a good idea to have more than one person work on this.
I'd love to set up a call if you're interested in learning more about where this can go. Someone I spoke to yesterday said "AI is, in a way, the greatest crossword puzzle of all time". I can't think of anything more meaningful than building something that redirects the energy of millions of humans into increasing AI safety/alignment (even if they don't realize they're doing it).
$0 in pending offers
Ex-Apple iCloud. Built a bunch of stuff since then. Love building things, especially meaningful/useful things. Interested in incentives.
Hey Austin - Thanks so much for the kind words and regrant. I'm extremely grateful for the support.
I totally agree that onboarding was and still is quite clunky - it's a bit simpler now but I'm still working on an onboarding that's actually interactive instead of just a guide. Unfortunately I'm also making big tweaks to the game itself, so I'm not spending too much time on refining the tutorial each time since the game is changing quickly. Would love to have a chat with you some time about virality improvements esp given Manifold's success - this is a super important topic, at the very least could run the current ideas by you.
hi anton - great questions! also lol @ subscription - if only!
i'll preface by saying that i'm by no means an expert in mechanistic interpretability, and I apologize for not including more detailed justification on the grant application or website - if you've been doing this awhile, you probably know more than me, and your question of "why try to understand neurons?" is probably best answered by someone who has an academic background in this.
Re: Usefulness of neuron naming
People aren't currently using GPT2-SMALL as their daily chatbot, but the things we learn from smaller models can ideally be applied to larger models, and the idea is that eventually we'd have new campaigns to help identify neurons in larger models. Purely for example's sake, maybe we're able to identify a neuron (or group of neurons) for violent actions - in that case we might try to update the model to avoid/reduce its impact. Of course this can turn into a potential trolley problem quickly (maybe changing that part affects some other part negatively) - but having this data is likely better than not having it.
Aside from the actual explanations themselves, data around how a user finds a good explanation can also be useful - what activations are users looking at? Which neurons tend to be easily explained and others not? Etc.
There is a greater question of the usefulness of looking at individual neurons, vs other "units", as highlighted in the 2nd premortum. You're correct that neuronpedia will eventually likely need to adapt to analyzing single neurons. This is high priority on the TODO.
Re: Can't the network generate the explanation itself?
Yes, that's exactly what the existing explanations are generated from. Basically uses GPT4 to guess what the neurons in GPT2-SMALL are related to. Please see this paper from OpenAI: https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html
The issue is that these explanations aren't great, and that's why Neuronpedia solicits human help to solve these neuron puzzles.
Re: how do you score the explanation suggested by the user?
The scoring uses a simulator from Automated Interpretability based on the top known activating text and its activations. You can see how it works here: https://github.com/openai/automated-interpretability/tree/main
One of the things the game currently does not do (that I would like to do given more resources) is to re-score all explanations when a new high-activation text is found. This would mean higher quality (more accurate) scores. Also, larger models (even GPT2-XL) requires expensive GPUs to perform activation text testing.
again, i'm no expert in this - i'm fairly new to AI and but I want to build useful things. let me know if you have further questions and i'll try my best to answer!