Novel research agenda towards a comprehensive theory of interpretability

I am an independent researcher working on a robust theory of world-models, with an eye towards swiftly maneuvering that theory to practical applications. My current main research aim is to formalize natural abstractions/subsystems, then use these formalisms as the foundation for paradigm-changing interpretability tools.

Research agenda:

I believe that World-Model Interpretability Is All We Need. If we develop the tools for automatically interpreting learned world-models, that would allow us to do things such as:
- Putting robust formal constraints on what AIs are able to do/think.
- Easily editing AIs’ knowledge bases, to delete harmful/dangerous knowledge.
- Retargeting the search of AGI-level systems.
- Overall, treating and editing AIs as normal software products, rather than as incomprehensible trillion-parameter matrices.
The Natural Abstractions Hypothesis gives cause to believe this problem is tractable. The NAH postulates that all efficient embedded agents converge towards modeling the world roughly the same way. They decompose it into the same concepts, because the right decompositions are “in the territory”. (See Appendix A for a basic case in favour of the NAH.)
I’m currently approaching the problem from a very different angle from John Wentworth. Rather than starting from information/probability theory, I aim to understand these “territory-level decompositions” by going up from fundamental physics.
- (See the "Research Proposal" section below for more details.)
Taking a different approach is motivated by redundancy: if other researchers' strategies run into a dead end, my own may still work, and vice versa.
In addition, pursuing many independent research directions in parallel should allow us-as-a-field to more easily determine in which direction the truth lies:
- If several unconnected, seemingly different research projects independently converge towards similar results, that's strong evidence that they've discovered some objective truth!
- And in a deeply theoretical field, without fast empirical feedback loops, being able to make use of such tricks to orient to the truth is invaluable.

Overall vision:

I stand by the “least forgiving” take on AI Alignment. I believe that alignment is very hard, requires exacting precision, and needs to be solved right on the first try.
Despite this, I am optimistic regarding solving it. I believe that once a few crucial conceptual confusions (mainly around abstractions) are cleared up, the rest is a relatively mundane, if large-scale, software-engineering problem.
My choice of research project is strategically motivated. I’m working on what I believe to be the most concrete research problem sufficient for AGI Alignment. I aim to carefully tread the line between AGI-incomplete atheoretical tinkering and needlessly abstract theory.
I expect my research, if successful, to be highly useful for current systems as well as any speculative future ones. If the “least forgiving” take is wrong, the theory would be easily re-applicable for aligning LLMs, rather than ending up as an unusable philosophical artefact.

Track record:

I’ve been engaging in independent grant-funded alignment research for several years.
I’ve scored top prizes at the ELK contest and AI Alignment Awards.
I’ve received positive feedback on my work from senior alignment researchers, including on its most speculative directions. (My primary reference is John Wentworth.)
See my LW profile if you want more examples of my writing.

Funding:

I’m aiming for $6,000/month for personal salary + $1,500/month for consulting domain experts and for running compute-intensive experiments.
3 months is the “minimal time unit” in which I expect to produce meaningful progress.
~2 years is the project’s expected runtime. Past it, I expect to reach the applications stage.
The funding goal, therefore, ranges from $22,500 to $180,000.

Research Proposal

Suppose that the natural abstractions hypothesis, or something like it, is true. Aside from the basic case for it (Appendix A), we also have solid empirical evidence: The Universality Hypothesis in interpretability, stating that the features neural networks learn are subject to convergence.

So, suppose there is a “correct” way to model reality. There are rules for decomposing the world into sparsely interacting subsystems so broadly useful that all efficient agents, no matter their goals, converge towards using them. The “correct” world-model depends only on the chunk of the world being modeled, not on the one doing the modeling.

That essentially implies a “ground truth” to the way the universe decomposes into sparsely interacting subsystems. The decompositions are “in the territory”.

If so, that should be reflected in territory: in the laws of physics. There must be some fundamental dynamic that causes this tendency to decompose. Indeed, the opposite is conceivable: the universe could have been arranged in an “irreducibly complex” way, such that it cannot be modeled/predicted partially, without keeping its entire state in mind. In such a universe, embedded agency would not have been possible. What makes our universe different, then?

My current hypothesis is that it’s caused by the Principle of Stationary Action. In short, I believe the PoSA essentially "nudged" the laws of physics to maximize the extent to which the universe is split into sparsely interacting subsystems. In a certain mathematical sense, subsystemization "furthers the PoSA's objective" of action-variance minimization. See Appendix B for details.

More precisely, I believe the quantum-mechanical dynamics that are responsible for the PoSA also create a similar “subsystemization principle”, for roughly the same reason they cause the PoSA: the more “subsystems” a trajectory has, the more constructive interference it experiences, which means the higher its amplitude becomes, and therefore the higher the probability of our observing it. We can then take on the Many-Worlds Interpretation lens, and extend it into the claim that a given Everett branch experiences constructive interference in proportion to how many subsystems it features. But I’m yet to formalize this properly. (And note that taking the MWI literally isn't really necessary here: it's just a useful framework for intuitive thinking.)

Here, incidentally, I can point to an instance of several independent research directions converging to point in the same direction. Namely:

The Singular Learning Theory suggests that the dimensionality of the loss landscape collapses when ML models are trained on certain kinds of data.
A recent result from Simplex suggests something similar: that the "correct" structure of an agent updating on evidence is self-similar, and therefore compactly specifiable.
My approach hypothesizes a "ground-truth" complement to this: if a certain physical system contains robust subsystems, they effectively collapse the dimensionality of the configuration space of that system. (See some very rough sketches in Appendix B, section 4.)

Working out the technical details of this hypothesis should yield us the “type signature” of abstractions/subsystems – in a similar way how the Noether Theorem gives us the "type signature" of conserved quantities.

That “signature” will likely take the form of a specific relationship between the raw real-world data and their abstracted forms; some insight into the way the real-world data are distributed. We would be able to employ this insight in combination with known structure-learning algorithms to build tools for automatically finding all abstractions that exist in a specific part of reality (as represented by some dataset). Compare the Pragmscope Idea.

As proofs-of-concept that something like this is possible, consider the Noether Networks paper and the way we can derive the Boltzmann distribution using the formalisms of Lagrangians and Hamiltonians. They show that it's very much possible to translate results from classical mechanics into machine-learning and structure-learning tools.

That, in turn, should yield dramatic progress in interpretability. From two complementary directions:

We would know what patterns we’re looking for in ML models: how abstractions are likely to be represented and formatted, how the features we want to edit look.
We could use these algorithms directly: point them at the parameters of ML models, and iteratively decompose them into natural subsystems which could be understood separately. (Finding such natural decompositions manually is a very hard problem: see here, and the superposition problem generally.)

And notably, by the nature of being the “correct” algorithms, the results of this research would be generalizable to any type of future AI agent and any future AI architecture.

That said, I’m not betting everything on my one hypothesis. I also engage in general research related to agency foundations, value formation, interpretability theory, and selection theorems. Should the fundamental-physics approach fail, I would quickly pivot to another avenue, likely finding a way to re-use the insights and the skills I’d accumulated on the failed approach.

Novel research agenda towards a comprehensive theory of interpretability

Offer to donate

Research Proposal