Mapping neuroscience and mechanistic interpretability

Technical AI safety

Zhonghao He

ActiveGrant

$5,950raised

$9,600funding goal

Donate

Project summary

A survey paper targeting the interpretability / AI safety / ML community presenting neuroscience & cognitive tools, techniques, and insights for understanding neural network internals.

Abstract:
Deep learning systems increasingly permeate applications impacting society, catalyzing calls to understand the processes underlying their outputs. Yet, as these systems are scaled up to many billions of parameters, interpretability becomes all the more challenging. Is it impossible to interpret such a complex system? Neuroscientists and cognitive scientists have accumulated decades of experience analyzing a particularly complex system: the brain and its computations. While biological neural systems are not equivalent to artificial ones, we argue that there is immense potential for the deep learning community to look to the practices of neuroscience and cognitive science in efforts to interpret large-scale AI systems; and conversely, for researchers studying biological neural systems to sharpen their tools of analysis drawing on developments in deep learning. In this work, we lay out four grand challenges for interpretability – benchmarks, superposition, scalability of analysis techniques, and inherent complexity. We review existing work in neuroscience, cognitive science, and machine learning interpretability, and suggest that mixed -selectivity, population codinge, robustness, and modularity, are four promising fields that may inspire methods to tackle these interpretability challenges – laying the groundwork for more trustworthy deployment of deep learning systems and taking steps to better understand how complex neural systems, both biological and artificial, work in the first place.

What are this project's goals and how will you achieve them?

1st stage: A preprint on Arxiv around the Christmas time

2nd stage: A publication on Transactional Machine Learning Research

3rd stage: A nature publication (not certain for now)

We've already gathered a team to work on it and we've set up the framing for the project.

The next step would be mainly about writing. In particular, this interdisciplinary project will be involved in many discussions between neuroscience and machine learning communities.

How will this funding be used?

4~8 weeks salary, $600 a week.

Who is on your team and what's your track record on similar projects?

Interpretability:

Stephen Casper
Wes Gurnee

ML:

Katie Collins
Adrian Weller
George Ogden

Neuroscience and cognitive science:

Jascha Achterberg (anchor author)
Yinzhu Yang
Dan Akarca
Linda Tang
Rebeca Ianov
Kevin Nejad
Grace Lindsay (anchor author)
Anna Ivanova (co-first-author)
Ilia Sucholutsky
Aruna Sankaranarayanan
Kai Sandbrink

What are the most likely causes and outcomes if this project fails? (premortem)

The mapping between two fields can be more difficult than expected.
- The outcome would be that confidence in the final writeup will be lower than expected, and we may not publish the nature paper in the end.
- The countermeasure would be at least, to include top experts from two fields to have conversations.
As one first-author of this paper, I likely don't have in-depth expertise in either interpretability or neuroscience.
- The mapping would be superficial.
- The countermeasure is trying to learn as much as possible by having conversations with experts, and potentially conducting experiments as early as possible.
It's possible that we found the mapping between two fields are more conceptual/algorithmic levels, rather than implementation level (or even representational level). And future researchers may not directly borrow one technique from neuroscience and hope it magically works for neural networks.
- We would not necessarily regard this as a failure, it's more like setting up realistic expectations.
- At least we may also point out where we think neuroscience can be misleading for science deep learning, as part of the mapping.
- "Some inspiration from neuroscience" is also very helpful, although it may not be direct.

What other funding are you or your project getting?

None for no.

Zhonghao He

7 months ago

Progress update

What progress have you made since your last update?

We published a preprint at https://arxiv.org/abs/2408.12664v2!
I gave a talk at New England Mech Interp Workshop: https://nemiconf.github.io/summer24/schedule.html

What are your next steps?

I'll work on empirical works informed by the findings in this paper (send me an email if you want to collaborate!)
I'll give talks to interp & neuroscience groups on this topic (still let me know if you want me to give a talk in your lab!)

Is there anything others could help you with?

I'll need to raise new fund for further empirical works.
I'll need funding for academic trips.

donated $500

Noa Nabeshima

about 1 year ago

I'm worried that this is a difficult task for 4-8 weeks without in-depth experience in either field, although already having people that intend to help from both fields helps with this concern. Although I don't have much neuroscience experience, I believe that there could be rich transfer from neuroscience to interpretability that hasn't happened yet and it seems surprisingly underexplored (although Trenton Bricken is/was a neuroscientist). It seems like a small financial ask for the expected work and seems like a good project to fund overall.

donated $2,400

Renan Araujo

over 1 year ago

Please see Joel's comment below for more context (our outreach to recommended researchers for an 'Aurora Scholarship' program.)

I decided to donate 2.4k because that was my original funding target for this project, and it looks like this can productively accommodate more funding than the original value considering Zhonghao and Neel's comments. I'm not donating more (I'd potentially go up to 3-4k based on my intuitive reading of the comments below) because I expect to support other participants of the 'Aurora Scholarship'.

donated $1,200

Austin Chen

over 1 year ago

Approving this project now that it's hit its minimum funding bar. I wasn't aware that Renan and Joel had previously solicited Zhonghao (or set up the Aurora Scholarship, for that matter); both are awesome to hear.

donated $1,200

Austin Chen

over 1 year ago

(I've also doubled the max funding goal from $4.8k to $9.6k, per Joel's request)

donated $1,750

Joel Becker

over 1 year ago

I've made a $1.75k offer to this project. @RenanAraujo and I asked Zhonghao to put this project up, having agreed to fund him for our "Aurora Scholarship" program. (This was supposed to be $4.8k shared between us, but @Austin pipped us to the first $1.2k! I would advocate raising the maximum grant amount if possible @Austin; Zhonghao noted that he would be working on this project for more hours than we had pre-committed to fund, and @NeelNanda comment strengthens the case for potentially funding the remaining hours.)

Main points in favor of this grant

Renan and I put out a call to an invite-only scholarship program, the "Aurora Scholarship," to 9 individuals recommended by a source we trust. We were aiming to support people who are nationals of or have lived in China with a $2,400-$4,800 scholarship for a research project in a topic related to technical AI safety or AI governance. The project should last for approximately 4-8 weeks (i.e. we aim to offer $600/week at 20h/week).

Our hope is that scholars might use the experience and signaling value of these projects to counterfactually advance through to the next stage in their chosen career pipeline (e.g., PhD acceptance, think-tank placement), and that this program will strengthen the Chinese AI safety community. The program is loosely inspired by this CAIS program (but note we're not affiliated with CAIS and this does not mean CAIS endorses this program), especially in the sense that it is a requirement to join the program that each scholar has to seek their own supervisor.

Zhonghao was one of our excellent applicants.

Donor's main reservations

I think the project itself is very low downside.

Targeting the Aurora Scholarship invitees in the way we have has greater possible downside. We think this is fairly low, and have taken steps to lower this further (e.g. by not including junior researchers currently located in China).

Process for deciding amount

As above.

Conflicts of interest

None.

Neel Nanda

over 1 year ago

This seems pretty worth funding to me - it's a cheap grant, and I think this would be a cool paper to exist! I don't have a background in neuroscience or cognitive science, and I expect there's some techniques there worth my knowing about that would be useful for my work, but that much of it is irrelevant. I'd love for a paper surveying and summarising the most relevant ideas to exist! I've mentored Wes Gurnee and I trust his judgement/ability to represent the mech interp side, and expect Stephen Casper to also give good takes here. I don't know the rest of the organisers, but Wes vouches for their overall competence. I'd fund this myself if I had a regranting budget.

(I think a Nature publication is very ambitious, and would advise against bothering, but think an Arxiv publication is more than sufficient to make this worthwhile)

Zhonghao He

over 1 year ago

Thanks, Neel!

Both Wes & Cas have been very helpful.

We will be mostly focusing on an Arxiv preprint, and defer the decision on Nature at a later stage.

donated $1,200

Austin Chen

over 1 year ago

@NeelNanda Thanks for weighing in; agreed that the asking amount is very low. I've funded to half of the min funding bar based on your endorsement.