Automatic circuit discovery on sparse autoencoded features

Project summary

Recent work has developed methods to automatically discover "circuits" implementing basic functionality in neural networks. These circuits are described in terms of computation involving neurons. However, concurrent work has argued that neurons are often in "superposition" representing multiple features at the same time, and that sparse autoencoders can extract these features from neurons. This poses the question: would circuits be more discoverable and easier to understand if represented at the feature level rather than neuron level? This project intends to explore automatic circuit discovery over features.

What are this project's goals and how they be achieved?

At a high-level the project will seek to determine if circuits over features are easier to understand and more readily discoverable by automatic means than circuits over neurons (the baseline method). Concretely, preliminary results have found circuits for subject-verb agreement. Can intends to extend this to a simple ELK-style problem, to identify a concept represented by the model from a labeled dataset of positive and negative examples of the concept, while disambiguating it from a highly correlated concept the model also represents based on understanding how the respective concepts are computed.

How will this funding be used?

Research stipend to Can Rager. Can may pay for other expenses such as office space and compute out of this stipend.

Who is on the team and what's their track record on similar projects?

Can Rager (Google Scholar) has experience working on automatic circuit discovery from the Attribution Patching Outperforms Automated Circuit Discovery preprint.

He will collaborate with Sam Marks, a post-doc in the Bau Lab, that specializes in interpretability, and Prof. David Bau.

What are the most likely causes and outcomes if this project fails? (premortem)

The research idea is high-risk high-reward: circuits on features may prove to be difficult to find and understand.

Additionally the entire field of mechanistic interpretability is taking a high-risk bet that neural networks can be reverse engineered at scale. Although automatic methods such as the one this project investigates help with scalability, many challenges remain. Even if the project succeeds by its own lights, if future work cannot develop the method further then it may turn out to be a research dead-end.

There is a risk in terms of execution with Can being relatively new to research. This is mitigated by collaboration with Sam & David, however David may be time-constrained (as a professor with many PHD students), and Sam's PhD was in math so may have limited hands-on engineering experience.