Agent Foundations: Intermediate Divergence

Project summary

I am working on a concept in agent foundations that I call “intermediate divergence”. An abstract hypothesis and two interpretations are given below.

Consider a consequentialist agent with big-picture strategic awareness pursuing a terminal goal that is a perfect proxy for some underlying value. If there is a positive probability that any action taken since the agent began pursuing the goal diverges from the underlying value, then the probability of divergence approaches one as the number of actions approaches infinity in the intermediate period until the goal is abandoned, achieved, or modified. If the factors of divergence, including but not limited to competing terminal goals, conflicting instrumental goals, and/or incompetence, exist and non-negligible weight is given to them, then the probability of divergence becomes significant after a trivial number of actions, with the number of actions required for probabilistic significance being inversely correlated to the weight of the factors of divergence. The divergence will occur at varying degrees of severity over varying durations of time.

If this proves true, then there are at least two possible topical interpretations:

1. AI Alignment Failure Case (Short-Term): This phenomenon may constitute an alignment failure when the divergent actions occur over a sufficient duration of time and produce outcomes of sufficient severity, including but not limited to loss of control leading to some kind of existential risk. Even if the terminal goal perfectly encodes human flourishing, this may still occur. Critically, this risk is not necessarily permanent. The concern is therefore not strictly about irreversibility but about the duration and severity of the divergence, and whether such an intermediate state is acceptable under any reasonable interpretation of alignment.

2. AI Alignment Success Case (Long-Term): If the goal is achieved and the underlying value is satisfiable in ways that are compatible with such an intermediate state, then there exists a region of outcome space in which the agent causes existential risk in the short term while preserving the underlying value in the long term. If artificial superintelligence does arrive soon, this interpretation identifies a possible *positive outcome corridor* under conditions where current alignment techniques prove insufficient to prevent existential risk yet goal steering proves sufficiently effective in relation to a class of challenging long-term value-theoretic problems. This interpretation is not an endorsement of intermediate extinction, but a conditional characterization of outcome space.

What are this project's goals? How will you achieve them?

Primarily, I plan to expand, formalize, and test the intermediate divergence hypothesis and publish the results, and then develop and test the two topical interpretations presented. Secondary goals include empirical testing against contemporary AI models, identifying future research directions, and community-building around the work if it proves fruitful. The target timeline is at most six months of full-time work, subject to change.

How will this funding be used?

Primarily, the funding will be used for living expenses and services that increase research output. If the research proves valuable, extra money may be used for compute, hardware, conference attendance, publishing fees, and community-building if the work gains traction.

Who is on your team? What's your track record on similar projects?

This is a solo project. I have a background in software development with several years of sustained engagement with alignment research, effective altruism, rationality. This is my first formal research project in the field.

What are the most likely causes and outcomes if this project fails?

The most likely causes for failure may be either shaky empirical or theoretical ground or a lack of wider interest in the work. Expanding from this, some outcomes may be that the hypothesis doesn't hold , is inconsistent, or that there is an unjustified optimism that current alignment techniques will work.

How much money have you raised in the last 12 months, and from where?

None. There is one pending application to the Survival and Flourishing Fund 2026 grant program, submitted April 22, 2026.