Agent Foundations: Pursuit Misalignment

Project summary

I am working on a concept in agent foundations that I call “pursuit misalignment.” An abstract overview and two urgently relevant AI alignment–related conclusions are presented below.

Consider a consequentialist agent with big-picture strategic awareness pursuing a goal that is perfectly aligned with a value. Pursuit misalignment can be defined as action-level misalignment during goal pursuit, which is the interval after a goal is adopted but before it is abandoned or achieved. If there is a positive probability that any action is misaligned, then the probability of pursuit misalignment accumulates with each action. Pursuit misalignment factors may include capability limitations, competing or conflicting goals or shards, and other unknowns, and are likely to exist unless the agent is specifically designed to prevent them. The number of actions required to reach probabilistic significance is inversely proportional to the weight given to these factors, meaning if non-negligible weight is given, then the threshold is hit after relatively few actions. The resulting misalignment may vary in both duration and severity.

If this model holds, then there are two urgently relevant AI alignment–related conclusions.

1. Failure Case: This phenomenon may constitute an alignment failure if the misalignment occurs over a sufficient duration at a sufficient severity, including but not limited to a loss of control that leads to existential risk. Even if the goal is both achieved and perfectly aligned with human flourishing, this may still occur. As this risk is not necessarily permanent, the concern is not strictly about irreversibility but about the duration and severity, and whether such pursuit misalignment is acceptable under any reasonable interpretation of alignment.

2. Success Case: This phenomenon may constitute an alignment success if the goal is achieved and the underlying value is satisfiable in ways that are compatible with such pursuit misalignment, then there exists a region of outcome space in which the agent causes existential risk in the short term while preserving the value in the long term. If artificial superintelligence or transformative AI arrive soon, then there may exist achievable long-term axiological heuristics under conditions where current prosaic alignment techniques fail to prevent existential risk yet succeed at preserving the value despite challenging long-term alignment subproblems. This conclusion is not an endorsement of existential risk, but a conditional characterization of outcome space.

What are this project's goals? How will you achieve them?

Primarily, I plan to develop, formalize, and test the intermediate divergence hypothesis and publish the results, and then build upon the two topical interpretations presented. Secondary goals may include empirical testing against contemporary AI models, identifying future research directions, and community-building around the work if it proves valuable. The target timeline is three months of full-time work, subject to change.

How will this funding be used?

Primarily, the funding will be used for living expenses and services that increase research output. If the research proves valuable, extra money may be used for compute, hardware, conference attendance, publishing fees, and community-building if the work gains traction.

Who is on your team? What's your track record on similar projects?

This is a solo project. I have a background in software development with years of engagement with AI safety, effective altruism, and rationality. This is my first formal research project in the field.

What are the most likely causes and outcomes if this project fails?

The most likely causes for failure may be either shaky empirical or theoretical ground or a lack of wider interest in the work. Expanding from this, some outcomes may be that the hypothesis does not hold, is inconsistent, or that there is an unjustified optimism that current alignment techniques will work.

How much money have you raised in the last 12 months, and from where?

None. There is one pending application to the Survival and Flourishing Fund 2026, submitted by the end of April 22, 2026 UTC-7.