Collaboration to develop a DAG formalism to express instrumentality

The full project proposal document is available here. I have copy-pasted the relevant passages below. This application is for myself only. Sahil and Sophie Libkind (Topos) are submitting separate applications.

Project summary

This work forms a part coming out of the high actuation spaces project for AI Safety Camp 2024. One of the main aims of the project is to take the telic aspects (think: intentional stance) of intelligence as first-class and commensurate with causal aspects when seeking to understand (future) intelligences.

We claim that for extreme optimizers with coherent values and heavy self-modification abilities, the abstract structure of their commitments and the values motivating them is likely to be more robust than the abstract causal structure we attribute to them. Such coherence is usually considered apt only for sufficiently advanced intelligent systems; yet, even for present-day LLMs, a not unreasonable case can be made for taking the intentional stance. As part of this, we seek funding to cover our living expenses whilst we investigate a possible mathematical formalism for articulating the structure of telos in an intelligent system, as well as funds for contracting with the Topos Institute.

The purported formalism:

We refer to the hypothetical formalism of interest as a telic DAG; it resembles a causal DAG (directed acyclic graph) but instead of arrows denoting cause-effect relationships ("x causes y"), it features arrows expressing instrumentality (“x is instrumental for y”), and the nodes are commitments, or values, or goals. A major aim is to use this formalism to explore notions of telic factorings over a telic DAG analogous to Markov blankets, where—rather than the usual forms of factoring adopted under a causal/physical stance—the resultant factors (hereafter, co-factors, for reasons explained below) are related via a shared objective that is telically independent from the objectives of the rest of the system.

The solid arrows denote instrumentality (e.g., “X is instrumental for Y1”). Z is a terminal goal/objective of the cofactor denoted by the dotted oval, and all nodes within the cofactor are strongly coupled via their instrumentality for Z. The unmarked nodes denote a small segment of the remainder of the system, which tends towards some other objective. A fuller explanation and the links to causal DAGs are provided in Appendix 1.

We refer to this divergence as “telic decoupling” and it is our view that alignment issues can be usefully formalized in such a language, with the task of alignment falling out as preventing telic decoupling. Regardless of whether this scales all the way up to full alignment, we claim that the formalism will complement mechanistic interpretability efforts and will be of use across several threat models, most notably deep deception and sharp left-turn models; it may also offer insights to open questions in value formation, corrigibility, LLM steering (especially potential self-modifying LLMs), and embedded agency.

What are this project's goals and how will you achieve them?

If the following reads terse, do consult Appendix 3 in particular, among the appendices. (For more on the main formalism of “telic DAGs”, look at Appendix 1. For some connections to the broader agenda of “high actuation spaces” that this work is a part of, see Appendix 2; for a fuller idea of how this work might be impactful, see Appendix 3.)

Hopes/Expected Impact

Better formal language for highly adaptive entities. We aim to create better mathematical language and notation for highly adaptive entities that are not bound to a specific substrate and continuously self-modify in actualization of their goals and commitments.
- Not impose causal methodology on telic objects. Current frames and languages have an implicit causal attitude from the lack of a suitable mathematics for dealing with teleology. We hope that some grounding of this in formalisms will allow people to embrace the intentional stance without a dangling IOU of explanation.
- Orient to deep commitment that does not require force/substrate-level security mechanisms. To counter deep deception, we hope to articulate and illuminate what it means to be deeply telically coupled (aligned) in a way that does not require adversarial security mechanisms or constructions which are entirely reliant on the mechanism as a single-point of failure.
Validate the high-actuation perspective on science. The high actuation spaces project intends to bring scientific reasoning to systems and entities that resist being pinned down as stable, modular, and decomposable forms in our usual frames of reference. Things that are better defined by where they’re heading (such as intelligent entities) than their changing, adaptive structures will hopefully be amenable to telic reasoning. The plan outlined in this document can validate the relevance of the high actuation spaces context.
Scoping out of potential blind-spots in current practices. Questions surrounding AI governance and regulation are based upon the causal-heavy technical understanding currently available. We believe that there exists a class of threat models lying beyond the reach of this current understanding, and these models should be properly scoped out and incorporated into risk assessment procedures and regulation considerations. Giving rigor (however incomplete or inapplicable) to the intuitions of telic decoupling is one way to make these arguments to governance researchers and policy-makers more compelling.

Outcomes/Deliverables

Mathematical theorems and proofs relevant for better notation. This might be in the form of a telic calculus over telic networks [see Appendix 1], or string diagrams (e.g., co-Markov categories). Early investigations also suggest “co-probability theory” as a possibility.
Example applications. We intend to reformulate a lot of relevant examples (both new and old) in this language in an elegant and mathematically fit fashion. See Appendix 1 for details on the main one we consider here.
Post/paper. Depending on our progress, we intend to publish our findings on the Alignment Forum and/or submit to a relevant journal.
Important informal statements that remain to be formalized. Since this is conceptual research (though intended to be quite applied), we expect to have collated a number of open questions that the larger community can be invited to attack.

Assessment

Mathematical elegance and workability. Since this is meant to guide engineering (of the nature of interpretability), a formalism that supports such efforts for both thinking and eventual applicability will be a north star.
Peer review. Both agent foundations and prosaic alignment viewpoints are relevant to assess the value of such tools.
Calibration for intuitively expressed concepts and examples. The number and quality of current examples of intentional systems, whose conceptual/design flaws are illuminated and coherence verified using the formalism (for example, the telic decoupling that we focus on in this stage of research), will serve as an indicator of the utility of that formalism.
Early applicability to existing systems. For current ML systems, checking to see how much benefit is accorded to sensemaking them via a formal intentional stance.

Timeline

This research will be conceptual and mathematical, and the lack of precedent makes its timeline and deliverables difficult to access at this early stage. We suggest the following as a rough timeline of our work and our expected outputs.

At minimum, we seek funding for a period of four months. We assume that we are able to contract the Topos Institute.

Note that the following estimates are total hours, not ordered in sequence of actual study.

Surveyance and upskilling period

We plan to start by reviewing the related literature for causal theory and category theory, with a focus on Fong’s thesis. We believe a strong understanding of this work and the relevant works that led to it represents a good starting point for our research.

50 hours to be spent researching Bayesian networks and causal calculus, possibly using Pearl’s Primer or the more comprehensive Causality.
80 hours spent researching category theory through Fong’s symmetric monoidal categories and the related literature.
40 hours on filling miscellaneous theoretical and mathematical gaps.

Assessing support for dualizing

We want to be able to dualize from the causal DAG framework to produce telic DAGs, and at present there is no clear method for this. This is where we intend to start consulting with Topos Institute more formally. This is also where the process may become iterative and more non-linear. As a rough idea:

30 hours assessing findings in collaboration with Topos to clarify the claims and get an idea of the viability of dualizing Fong’s category-theoretic causal theory.

At this point I can imagine the research process branching as follows:

Branch 1: We can dualize the initial framework, in which case the research proceeds and we start to sketch out what a telic DAG might look like.
Branch 2: We cannot dualize the framework, in which case we will most likely work iteratively on the following:
- 80 hours trying to adapt or modify the existing work to support dualizing.

and/or (depending on advice/results)

80 hours seeing if alternative category-theoretic descriptions of causal DAGs can be made to support dualizing. (One possible starting point here is the correspondence between DAGs and preorders in the category of sets and functions.)

Exploring telic DAGs and telic decoupling

By this point I anticipate that we’d have conducted a thorough-enough investigation to consider reporting any positive or negative findings. Assuming we can obtain a dual, the remainder of the research may be this:

50 hours expanding/exploring the formalism obtained
100 hours trying to develop interventions (perhaps from the duals of the interventions on a causal DAG) that we can use to factor the telic DAG and express the intuitions of telic decoupling mentioned above. Similar to the above, this may branch; either,
- The interventions fall out naturally from the mathematics, or are easy to design,

The interventions have to be manually obtained as duals from their causal DAG counterparts,

The structure does not support interventions and some further iterative process is used.

Assuming interventions are found and telic decoupling can be expressed, we would ideally like to do the following:

50 hours specifically trying to obtain something that could in principle be tested, though we are currently unclear on what this would look like.

Writing up findings

20 hours writing up findings into a report to be published/posted online.

Working on the basis of 35-hour weeks, and leaving approximately 2–3 weeks slack time, we estimate that four months will be necessary to perform the work detailed here.

How will this funding be used?

Personal stipend + GBP 1000 buffer.

Personal stipend based upon the living wage for my location (London) (https://www.livingwage.org.uk/what-real-living-wage), calculated for a 35-hour week.

Buffer includes money for resources/tool subscriptions/workspace hire/travel (potential trip to CEEALAR to work in-person with Sahil).

Who is on your team and what's your track record on similar projects?

Mentors/Collaborators: Sahil (ex-MIRI, AISC Research Lead), Topos (most likely Sophie Libkind, perhaps David Spivak and others, depending on funding)

I have been working on the high actuation spaces project for five months now, and Sahil is keen to continue working with me and to provide mentorship as I upskill and research the relevant topics. I have an MSci in Physics and Philosophy (first-class) and have conducted two lengthy projects as part of my degree.

What are the most likely causes and outcomes if this project fails? (premortem)

Causes

Misconstrued mathematical suggestivity. This project relies fairly heavily on suggestive dualities between causal and telic viewpoints. It is possible that the mathematical work is a lot more conceptually involved and will take longer. However, we are prepared to be flexible with our commitments to candidate answers (i.e., telic DAGs) as we clarify the question (i.e., formulating a notion of telic connectedness).

Outcomes

If the formalism cannot be obtained at all, it would be a strong negative result worth publishing for future reference. I would also personally have upskilled in causal theory and category theory under mentorship; I have interests in conceptual/mathematical boundaries research and davidad's Safeguarding AI program, so this experience would constitute some useful career capital. We also will have clarified telic DAGs conceptually and may have laid some groundwork for another attempt at formalization via another approach.

What other funding are you or your project getting?

None at the moment, though we have put similar applications into LTFF and CLR.