Mapping neuroscience and mechanistic interpretability

Project summary

A survey paper targeting the interpretability / AI safety / ML community presenting neuroscience & cognitive tools, techniques, and insights for understanding neural network internals.

Abstract:
Deep learning systems increasingly permeate applications impacting society, catalyzing calls to understand the processes underlying their outputs. Yet, as these systems are scaled up to many billions of parameters, interpretability becomes all the more challenging. Is it impossible to interpret such a complex system? Neuroscientists and cognitive scientists have accumulated decades of experience analyzing a particularly complex system: the brain and its computations. While biological neural systems are not equivalent to artificial ones, we argue that there is immense potential for the deep learning community to look to the practices of neuroscience and cognitive science in efforts to interpret large-scale AI systems; and conversely, for researchers studying biological neural systems to sharpen their tools of analysis drawing on developments in deep learning. In this work, we lay out four grand challenges for interpretability – benchmarks, superposition, scalability of analysis techniques, and inherent complexity. We review existing work in neuroscience, cognitive science, and machine learning interpretability, and suggest that mixed -selectivity, population codinge, robustness, and modularity, are four promising fields that may inspire methods to tackle these interpretability challenges – laying the groundwork for more trustworthy deployment of deep learning systems and taking steps to better understand how complex neural systems, both biological and artificial, work in the first place.

What are this project's goals and how will you achieve them?

1st stage: A preprint on Arxiv around the Christmas time

2nd stage: A publication on Transactional Machine Learning Research

3rd stage: A nature publication (not certain for now)

We've already gathered a team to work on it and we've set up the framing for the project.

The next step would be mainly about writing. In particular, this interdisciplinary project will be involved in many discussions between neuroscience and machine learning communities.

How will this funding be used?

4~8 weeks salary, $600 a week.

Who is on your team and what's your track record on similar projects?

Interpretability:

Stephen Casper
Wes Gurnee

ML:

Katie Collins
Adrian Weller
George Ogden

Neuroscience and cognitive science:

Jascha Achterberg (anchor author)
Yinzhu Yang
Dan Akarca
Linda Tang
Rebeca Ianov
Kevin Nejad
Grace Lindsay (anchor author)
Anna Ivanova (co-first-author)
Ilia Sucholutsky
Aruna Sankaranarayanan
Kai Sandbrink

What are the most likely causes and outcomes if this project fails? (premortem)

The mapping between two fields can be more difficult than expected.
- The outcome would be that confidence in the final writeup will be lower than expected, and we may not publish the nature paper in the end.
- The countermeasure would be at least, to include top experts from two fields to have conversations.
As one first-author of this paper, I likely don't have in-depth expertise in either interpretability or neuroscience.
- The mapping would be superficial.
- The countermeasure is trying to learn as much as possible by having conversations with experts, and potentially conducting experiments as early as possible.
It's possible that we found the mapping between two fields are more conceptual/algorithmic levels, rather than implementation level (or even representational level). And future researchers may not directly borrow one technique from neuroscience and hope it magically works for neural networks.
- We would not necessarily regard this as a failure, it's more like setting up realistic expectations.
- At least we may also point out where we think neuroscience can be misleading for science deep learning, as part of the mapping.
- "Some inspiration from neuroscience" is also very helpful, although it may not be direct.

What other funding are you or your project getting?

None for no.