Cadenza Labs: AI Safety research group working on own interpretability agenda

Project summary

The goal of our group is to do research which contributes to solving AI alignment. Broadly, we aim to work on whatever technical alignment projects have the highest expected value. Our current best ideas for research directions to pursue are in interpretability. Our research agenda has three pillars:

Understanding the natural units in terms of which a neural net is performing its computation.
Developing unsupervised methods for finding features in the activation space that correspond to important concepts (e.g. DLK->Truth)
Understanding the fundamentals of LM cognition in more general terms.

The minimal funding goal would allow us to complete some current drafts and tide us over until we secure more permanent funding. The mainline funding goal would support our existing team for 6 months and allow us to either hire another person or better support our current interns. We tentatively plan to be collocated in Fixed Point, Prague, until December 2023, followed by the London Initiative for Safe AI (LISA) office.

What are this project's goals and how will you achieve them?

We aim to do AI safety research, specifically in interpretability, since we consider it a tractable and high-impact study area. An area of interpretability that we find most promising to work on is the development of automatic techniques for probing and modifying human-interpretable concepts in neural networks.

We created a theoretical framework and proposed several experiments to evaluate its aptitude, furthering our understanding of neural network (NN) internals. Starting from the original Contrast Consistent Search (CCS) code, together with collaborators from EleutherAI, we created a reimplementation of the method in the form of a library. We aimed to make the codebase cleaner, more modular and extensible, production-focused (multi-GPU features), and generally easier to use (e.g. Hugging Face integration).

The library is open-source and available for other researchers. In the long run, if our research leads to promising results, then the library could become the basis for a tool to evaluate a model's internal beliefs: (e.g, In the case that a model declares a statement as "True" at its output, be able to check if the representation of the statement at the model's internals, actually corresponds to a true statement, and the model isn't deceptive. )

The aforementioned research area is our main focus right now and corresponds to pillar 2 of our research agenda. However, we are also working on the following two areas:

Pillar 1: To successfully find high-level features, it’s useful to understand how to read off features from activation vectors / how activation vectors are made up of features. This motivates the second main focus of our research in interpretability: working toward a better understanding of features in models. (For more on this, see section 1 of our research agenda)

Pillar 3: When thinking about features in models, it helps to keep track of whether the questions we are asking even make sense in some reasonable high-level picture of how (language) models work. This, as well as the questions involved being independently interesting, has led us to think more broadly about language model cognition. (For more on this, see section 3 of or research agenda)

Cadenza Labs consists of four SERI-MATS alumni. Our previous grant from LTFF was until the end of September 2023. We're seeking funding to continue our work and expand our team. Depending on the amount we'll receive, this fund would support us either for three or six months.

In terms of specific research output, our plans are the following:

One of the foundational questions of this area of research is to confirm whether “We’re measuring what we think that we’re measuring”. We have been investigating this since April 2023. The first step towards this is to ensure that the measuring method (in our case CCS) is robust and consistent, thus our primary focus for the next months is to develop technical improvements for CCS. We estimate this to be done around mid-December. These improvements focus on detecting what a model believes and are related to section 2.1 of our agenda. We have some idea of what could improve the method and why this is the case as we have already run the following experiments:
- Experiments on prompt invariance:
  - Test if prompt invariance improves performance on especially autoregressive models compared to vanilla CCS
- Experiments on ensembling methods (layer ensembling and PPP (probe per prompt) ensembling)
- And started to work on experiments to improve our understanding of how the pseudolabel directions being different for different templates is causing lower accuracy of the CCS probe when we do not normalize template-wise.

Completing the above experiments would provide sufficient results to complete a paper. We aim to have it ready as a preprint in arXiv by the end of the year and then publish it at a conference in 2024 (e.g., submit it to NeurIPS).
We also have a side project related to RL interpretability with a group of students in the SPAR program and an RL engineer. The focus of this project is to see if a method like CCS works on different models and with different loss functions, and is able to find things like search or goals in agents. Our preliminary experiments will aim to find the value of a state in a policy network. The theoretical framework is described in detail in our post, and the RL interpretability experiment is in the section “An agent's estimate of the value of a state, from inside the policy network”.
In parallel, we aim to complete our existing draft posts/papers:
- A post that aims to build/survey some mathematical theory to understand features in neural nets better, connecting recent work by Anthropic and Redwood to the mathematical literature on frames.
  - This relates to section 1 of our agenda: understanding the natural units in terms of which a neural net is performing its computation.
- A literature survey on techniques for reconstructing features from activation data.
After completing the projects we've outlined, we'll move on to the other parts of our research agenda and prioritize research directions based on the results we'll obtain.

In short, we plan to:

Continue developing and supporting the CCS library
Run experiments on different normalization methods and, by the end of the year, have a preprint published in arXiv
Mentor a few students and interns who will work on projects developed from our research agenda (e.g. RL Interpretability project)
Writing posts on LessWrong (e.g. Finalize our draft paper on "Tight Frames, Efficient Packings", a literature survey on techniques for reconstructing features from activation data.)
Publish at least one paper at an academic conference (e.g. NeurIPS).
Continue our progress in answering the question “Are we measuring what we think we’re measuring?”

Our current experience shows that engineering and experiments take longer than planned. We'd consider the project a success if we've achieved the specific research outputs we've outlined above.

How will this funding be used?

The “Minimum amount” will provide salaries for 4 people for 3 months.
The “Funding goal” will provide salaries, office space, and compute, for 5 people for 6 months.
In both cases we’ve also added a buffer for the fiscal sponsorship.
Cost breakdown:
- Minimum amount: 100% our salary for 3 months (taxes, benefits, etc, would be paid by us)
- Funding goal:
  - Salary: 89.94%
  - Office Space: 9.47%
  - Compute: 0.59%

Who is on your team and what's your track record on similar projects?

You can find relevant information about our team and our work on our website:

Team Profiles: https://cadenzalabs.org/about/
Research Agenda: https://cadenzalabs.org/research/
Georgios, Kaarel, and Walter started their collaboration during Prague Fall Season 2023, and then participated at SERI MATS 3.0 with this project. Jonathan joined the team starting SERI MATS 3.1.

The following are the main things we’ve accomplished in the last six months (Apr-Sep 2023) that we were funded by the previous grant. For the first half of the grant, we were also attending SERI-MATS 3.1 in London.

For the first two months of the grant our focus was working on a paper, in collaboration with researchers at EleutherAI.
- The draft paper introduced an alternative method to CCS for eliciting knowledge directly from the activations of language models in a purely unsupervised way. The results will hopefully be published soon.
In parallel, part of the team worked on understanding features in NNs. We currently have two draft papers in the pipeline:
- A literature survey on techniques for linearly decomposing activation vectors into features
  - This also included giving a talk on the same topic at the SERIMATS office, and posting a shorter version of this talk online
- A post that aims to build/survey some mathematical theory to better understand features in neural nets, connecting recent work by Anthropic and Redwood to the mathematical literature on frames.
  - Access to draft available by request.
Over the past months, we were also working on the elk library with others (https://github.com/Cadenza-Labs/elk)
- Improved Accuracy: Implemented layer ensembling, feature to train a probe for each prompt template for ensembling, implemented Platt Scaling for CCS as a way of resolving sign ambiguity, feature for plotting visualizations, bug fixing, translated and added prompt templates in multiple languages (More details on GitHub) etc. (You can check the commit history for more; https://github.com/EleutherAI/elk/commits/main )
- Since the focus of the original elk library has moved towards methods with supervised probes, we recently created a fork [September 2023], in order to continue to work on methods using unsupervised probes as we believe that they have more potential for solving the scalable oversight problem.
  - https://github.com/Cadenza-Labs/elk
- Most recently, there has also been a discussion between our research scientist (Kaarel) and Dmitry Vaintrob on the topic of Grokking, memorization, and generalization
  - Grokking, memorization, and generalization — a discussion — LessWrong

As described in the previous section, we have already run the following experiments:

Experiments on prompt invariance:
- Test if prompt invariance improves performance on especially autoregressive models compared to vanilla CCS
Experiments on ensembling methods (layer ensembling and PPP (probe per prompt ensembling))

We also had the following side projects:

Created a YouTube channel to disseminate parts of research and clarify specific concepts YouTube channel. Note that we haven’t yet updated the name in some YouTube videos, since they still use our old name (NotodAI Research).
- Research Overview video
Our research lead (Kaarel) organized an intermediate/advanced linear algebra retreat in Estonia, with mostly collaborators/friends in the AI safety sphere in attendance.
- https://kaarelh.github.io/retreat.html
Our research lead was invited to and attended a few researcher retreats:
- ACS retreat in Prague (April)
- Hierarchical agency retreat (organized by ACS) in Prague (June)
- The Singular Learning Theory & Alignment Summit, in Berkeley (June)
Our team member Kay mentored at the ARENA program, where we had a Capstone project related to our research.
Similarly to last year, we participate again at the Supervised Program for Alignment Research (SPAR) seminar organized by the Berkeley AI Student Initiative.

Our team member Georgios presented topics of our research and introduction to AI Safety at the Heksagon Symposium on AI and at the Data Science Summer School 2023, in Goettingen, Germany, during September 2023

What are the most likely causes and outcomes if this project fails? (premortem)

It could be that interpretability is just doomed in general: there might be conceptual reasons why we ought to expect circuits to just be too big and messy even in their simplest faithful representation, or why neural net representations will always just be too alien to make sense of, or why essentially any method of activation ablation/editing would lobotomize any coherent cognition, or why we should expect interpretability progress to be too slow to contribute meaningfully to lowering p(doom). Though if that’s the case, we’d hopefully make some progress on understanding if and why that’s the case while working on our research, which might be helpful for understanding crucial aspects of the alignment problem better.
It could be that the more concrete interpretability bets that in part motivate and underlie our research turn out to be wrong: for instance, it could turn out that thinking of features as directions in activation space is fundamentally misguided, or that decomposing activations into features is not part of the right frame of analysis, or that there is no reason for language models to connect their ontology up to human-language inputs so as to even have a chance to compute their actual belief about what a human-language sentence intends to point to. Though again, if that’s the case, we’d plausibly still make some progress insofar as we rule out certain a priori reasonable hypotheses and plausibly understand other useful things about interpretability/alignment/cognition along the way.
Lack of concrete feedback mechanisms: We haven't received substantial feedback on our research agenda yet.
Other “common” failure modes of research projects (Implementing the experiments takes much longer than we expect, unexpected team coherence issues, etc). Although we've been generating a considerable amount of work internally, aside from our initial two blog posts, we've only managed to mostly publish drafts so far.

What other funding are you or your project getting?

We were funded by LTFF until the end of September 2023 and submitted a new application at the end of October. We don’t know when we’ll hear from them.
We're looking into alternatives to EA funders (i.e. EU HORIZON), however even if successful, these would probably take around a year to be processed and for the funds to be received by us.