Next Steps in Developmental Interpretability

Project Summary

This project builds upon our successful Manifund 2023-supported research that validated Developmental Interpretability (DevInterp) as an approach to understanding the internal structure of neural networks.

We (Timaeus) are seeking funding to continue this research and extend our runway from 6 months to ~1 year. In particular, this funding would go to a series of projects (described below) that apply our existing research to address immediate, real-world safety concerns. This has the ultimate goal of leading to novel understanding-based evals.

Developmental Interpretability & Singular Learning Theory

We initially set out to establish the viability of Developmental Interpretability, an application of Singular Learning Theory (SLT) to interpreting the formation of structure in neural networks. This succeeded. Together with our collaborators,

We demonstrated that SLT can inform the development of new measurements, such as local learning coefficient (LLC) estimation. This has now been successfully applied to models with up to 100s of millions of parameters [Lau et al. 2023; Furman & Lau 2024].
We confirmed that phase transitions are real and our tools can detect them, in simple models like the toy model of superposition and deep linear networks [Chen et al. 2023; Furman & Lau 2024].
We showed that hidden stage-wise development is common in larger more realistic models. While larger models do not undergo the same kind of “sharp” transitions as smaller models, the same theory applies to developmental stages. We showed that these stages can be revealed using the LLC [Hoogland et al. 2024]. In some stages the loss changes only slightly while the LLC decreases substantially, reflecting qualitative changes in model computation.

The upshot is that the LLC is a scalable and theoretically well-justified measure of model complexity, which can be used to study the development of structure in neural networks. We are now confident that DevInterp is a viable research agenda and expect that LLC estimation will become widely used in AI safety.

Near-Term Applications

We will continue to work towards gaining an understanding of the fundamental units of computation in neural networks and how they develop (e.g. “circuits”) but we are also excited to take the understanding and tools that we have developed so far and apply them immediately to practical problems in AI safety. It is these practical applications that we are requesting funding for.

We believe the LLC is now ready to be applied to models with billions of parameters, to radically improve our ability to reason qualitatively and quantitatively about questions like:

How deep is safety training, really?
How does safety fine-tuning / RLHF change model computation?
How does additional fine-tuning affect safety training?

The answers to these questions hinge on generalization: how do pre-trained models generalize from the (relatively) small number of examples in the safety fine-tuning / RLHF training data? When more, or different, examples are provided, how does that generalization change?

These are questions that SLT and DevInterp are well-positioned to answer. In the first case because SLT is a theory of generalization and the LLC is how we quantify the effect of loss landscape geometry on generalization, and in the second case because studying how models change over training (DevInterp) is very similar to studying how models change over fine-tuning.

We are excited about working on these questions for three reasons:

They play to the comparative advantages of our research agenda.
They prepare us for studying sharp left turns which we view as a central problem in AI alignment.
They are on the critical path to building understanding-based evals, a core goal for Timaeus.

Project(s) Description

The near-term applications of our research to AI safety are based on the local learning coefficient (LLC).

The LLC is a measure of how large a perturbation in weights is required to meaningfully change model behavior. In forthcoming work [Wang et al.] described below, we show that restricted versions of the LLC can quantify how large a perturbation in specific weights is required to meaningfully change model behavior on a particular dataset (e.g. to measure how much a particular attention head is specialized to code, or how easily a given layer can recover examples of illicit behaviors that have been “trained out” by safety fine-tuning). These restricted LLCs are useful in studying how components of the model differentiate and mature over training, and in tracking which parts of the model have information about which kinds of datasets.

We emphasize that in our experience the LLC and its restricted variants often reveal information that is not obviously accessible through the loss or other metrics alone. This empirical fact, combined with its fundamental status in the mathematical theory of Bayesian statistics, justify our focus on the LLC as a tool of choice when trying to understand models.

The particular projects we have in mind are:

Project: Simplicity biases and deceptive alignment. In their recent “sleeper agents” paper Hubinger et al. show that adversarial training fails to remove a backdoor behavior trained into an LLM, and instead causes it to learn a more specific backdoor. They point out that this is highly relevant to safety training, since “once a model develops a harmful or unintended behavior, training on examples where the model exhibits the harmful behavior might serve only to hide the behavior rather than remove it entirely”. They conjecture an explanation in terms of a bias of SGD towards simpler modifications. This phenomenon and the conjectured bias can be understood in terms of the free energy formula within SLT, and we will investigate the use of restricted LLCs for studying this phenomenon.
Project: Backdoor detection with data restricted LLCs. If a model is backdoored it may “compute differently” on inputs from slightly different distributions, in a way that is unanticipated. Different input distributions determine different population loss landscape geometries, and the data restricted LLC reflects these changes; consequently, we expect the data-restricted LLC to be sensitive to changes in model computation as a result of small changes in input distribution. We will investigate how this can be used to detect backdoors.
Project: Understanding algorithm change in many-shot jailbreaking. As long-context LLMs become available, the study of many-shot in-context learning has become increasingly critical to AI safety. As a follow-up to our ongoing work using SLT and DevInterp to study the fine-tuning process (described below) we will study many-shot jailbreaking. In the forthcoming Carroll et al., we study how the algorithms learned by transformers change over training, according to the free energy formula in SLT, and we have ideas on how to adapt this to study how the mode of computation varies over long contexts.
Project: A minimal understanding-based eval. By scaling data and weight restricted LLCs to large open-weight models, combined with the above projects using these tools to reason about the effect of safety fine-tuning and adversarial training, we will produce an eval that goes beyond behavioral evaluation and demonstrate its application to simple examples of engineered deceptive alignment and backdoors.

A subset of these projects involve a close collaboration and partnership with the Gradient Institute. The Gradient Institute is an independent nonprofit research institute based in Sydney, Australia, which over the past five years has become a trusted advisor to corporations and government in Australia on AI policy development. This partnership combines our technical expertise with Gradient's policy experience, ensuring that our research findings have a clear path to inform and influence AI policy and regulation.

Research Details

To provide more context on our research, two projects that will be completed this month:

G. Wang, J. Hoogland, S. van Wingerden, Z. Furman, D. Murfet “Differentiation and Specialization in Language Models via the Restricted Local Learning Coefficient” introduces the weight and data-restricted LLCs and shows that (a) attention heads in a 3M parameter transformer differentiate over training in ways that are tracked by the weight-restricted LLC, (b) some induction heads are partly specialized to code, and this is reflected in the data-restricted LLC on code-related tasks, (c) all attention heads follow the same pattern in which their data-restricted LLCs first increase then decrease, which appears similar to the critical periods studied by Achille-Rovere-Soatto.
L. Carroll, J. Hoogland, D. Murfet “Retreat from Ridge: Studying Algorithm Choice in Transformers using Essential Dynamics” studies the retreat from ridge phenomena following Raventós et al. and resolves the mystery of apparent non-Bayesianism there, by showing that over training for an in-context linear regression problem there is tradeoff between in-context ridge regression (a simple but high error solution) and another solution more specific to the dataset (which is more complex but lower error). This gives a striking example of the “accuracy vs. simplicity” tradeoff made quantitative by the free energy formula in SLT.

A project recently started with the Gradient Institute, which we expect to develop in parallel with the work on deceptive alignment:

Project: Measuring reversibility of safety fine-tuning: We can see ways to use restricted LLC estimation to quantify how “deep”/reversibile safety fine-tuning is in open-weights models at the scale of 7B parameters in a similar vein to the recent work of Peng et al. on the “safety landscape”.

The success of the new proposed projects will depend on improvements in the underlying technology of LLC estimation, which Timaeus will continue to push forward. We expect several of these projects to involve open-weights models on the scale of billions of parameters. At present we have extensive experience running LLC estimation on models with millions of parameters and more limited experience with models at the scale of hundreds of millions of parameters (GPT2-small). We do not foresee fundamental obstacles in scaling another order of magnitude, but at each scale we need to invest researcher time in developing best practices for hyperparameter selection, baseline sanity checks, etc.

How will this funding be used?

This funding will be used to support the research projects listed above, mainly by paying for researchers and compute.

We are seeking $670k in additional funding to complete the outlined parts of research agenda, to be spent over the course of 8 months:

$100,000 to complete partially funded projects.
$410,000 to complete unfunded projects.
$160,000 for accelerating our research timelines (2 additional RAs and increased compute)

Funding Breakdown

355k: RAs + Researchers (3.4 FTE, 8 months)
30k: Travel Support & Research Retreats
145k: Core Team + Support Staff (1.8 FTE, 8 months)
55k: Compute
85k: Fiscal Sponsorship Costs & Buffer

Timeline

By default, we are targeting publication in the major conferences and journals, subject to case-by-case considerations of capability risk. Our timeline for submissions (the actual conferences take place later) is:

ICML (January/February 2025, Partially funded)
- Project: Measuring reversibility of safety-fine tuning
- Project: Understanding algorithm change in many-shot jailbreaking
NeurIPS (May 2025, not yet funded)
- Project: Simplicity biases and deceptive alignment
- Project: Backdoor detection with data restricted LLCs
ICLR (September 2025, aiming for completion in August 2025, not yet funded)
- Project: A minimal understanding-based eval

Additional Considerations

Potential Harm: While our research aims to improve AI safety, it could potentially be misused to accelerate capabilities development. We are committed to conducting risk assessments on a project-by-project basis and working with the community to mitigate any potential harm.
Other Funding: Since October 2023, Timaeus has secured $800k in funding from Manifund, SFF, LTFF, and AISTOF, providing runway until January 1st 2025. The Gradient Institute is jointly funding the projects listed under our partnership. We are actively seeking additional funding to support our ongoing and future research, including at SFF and Foresight.

Track Record

Timaeus is currently a team of 6.5 FTE, with a strong track record in advancing DevInterp and SLT:

Research: We have successfully proposed and validated a new research direction within AI safety through a combination of careful empirical and theoretical work and on an accelerated time frame. See the references linked at the end.
Expertise: We are working in close collaboration with Daniel Murfet’s group at the University of Melbourne and with other top researchers at several universities and AI labs. Daniel Murfet is responsible for originating the developmental interpretability agenda and is senior author on many of the papers we are writing.
Outreach: We have actively engaged with the AI alignment community through conferences, workshops, talks, mentoring (via MATS, Athena, ERA), and other forms of outreach. This has led to widespread interest and endorsements by many prominent researchers in the field, including John Wentworth, Daniel Filan, and Vanessa Kosoy.

We are committed to making breakthrough progress on AI safety, and with your support we believe we will have a significant and lasting impact.

References

[Hoogland et al. 2024] The Developmental Landscape of In-Context Learning by Jesse Hoogland (Timaeus), George Wang (Timaeus), Matthew Farrugia-Roberts (then independent, now at Timaeus / Krueger Lab), Liam Carroll (then independent, now Timaeus / Gradient), Susan Wei (University of Melbourne), Daniel Murfet (University of Melbourne). [Arxiv; Distillation]
[Lau et al. 2023] Quantifying Degeneracy in Singular Models via the learning coefficient by Edmund Lau (University of Melbourne), George Wang, Daniel Murfet, Susan Wei. [Arxiv; Distillation]
[Furman & Lau 2024] Estimating the Local Learning Coefficient at Scale by Zach Furman (Timaeus) and Edmund Lau. [Arxiv]
[Chen et al. 2023] Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition by Zhongtian Chen (University of Melbourne), Edmund Lau, Jake Mendel (then independent, now at Apollo Research), Susan Wei, Daniel Murfet . [Arxiv; Distillation]