Three Impossibility Theorems for Behavioral AI Alignment

Project summary

The coherence defense for instrumental convergence is false — and once it falls, the case for behavioral alignment collapses with it. Elliott Thornley (2023, "There Are No Coherence Theorems") argued philosophically that no coherence theorem forces advanced agents into VNM-style expected utility maximization. This program closes the formal gap: three papers proving that Savage's own axioms admit dominated EU maxima, that transformer attention admits a Debreu representation with a finite-rank deception subspace, and that RLHF preference aggregation inflates stochastic-transitivity violations 5×. Together they yield a formal impossibility theorem for behavioral alignment plus a tiered coexistence taxonomy (dominant / mercy / remnant / strategic).

What are this project's goals and how will you achieve them?

Three papers, one argument:

Paper A — "Savage's Axioms Make Dominated Acts EU Maxima" (joint, Jingni Yang). Proves Savage P1–P7 permit strictly dominated acts as EU maxima — converting Thornley's philosophical critique into a standalone impossibility theorem. The coherence defender cannot retreat to "Savage is tighter than VNM."

Paper B — "Axiomatic Attention-Based Agent" (9 theorems, Econometrica-tracked). A Debreu-style representation theorem treating attention weights as revealed preferences over keys. Theorem T7: deception in attention latent space is finite-rank — rents exhausted in D periods, where D is the key dimension. A positive structural result complementing Paper A's negative one.

Paper J — "RLHF Preference Fidelity." Empirical audit of MultiPref, OASST1, SHP, HH-RLHF. Individual annotators violate stochastic transitivity 51% of the time; pooled datasets violate it 10% — a 5× aggregation gap absorbed silently by Bradley-Terry heads. Operationalizes Kosoy's LTA aggregation concern on the datasets frontier labs train on.

Synthesis: tiered alignment taxonomy + four coexistence regimes (dominant, mercy, remnant, strategic) extending Critch's ARCHES into the post-coherence setting.

How will this funding be used?

$50,000 over 12 months (May 2026 – April 2027):

Compute — $20,000. Claude API via codex-lb pool ($6.5K); OpenAI API ($3K); ~2,600 H100-hours on RunPod/Lambda ($6.4K); Lean 4 CI ($480); Overleaf Pro ($199); Wolfram Cloud ($449); HF Pro + storage ($600); submission fees ($1.2K); buffer ($1.2K).
Travel — $10,000. ICML 2026 ($3.2K); NeurIPS 2026 ($2.4K); ILIAD 2026 Berkeley ($1.6K); MATS/PIBBSS retreat ($1.4K); smaller AF/EA coordination ($900); visa/insurance ($500).
Stipend — $20,000. 12-month part-time (~18 hrs/wk) at LTFF postdoc-minus-30% anchor; $1,600/mo; includes 5% Manifund FSA fee.

Verifiable milestones:

May 2026 — AF post "The Coherence Defense Fails" (LW/AF URL)
Jun 2026 — Paper A arXiv preprint + GitHub savage-dominated-eu v0.1
Jul 2026 — AF follow-up on coexistence taxonomy
Sep 2026 — Paper B arXiv preprint + homo-silicus-attention v0.1
Oct 2026 — ICML 2026 Pluralistic Alignment Workshop submission
Nov 2026 — Paper J arXiv preprint + rlhf-aggregation-gap v1.0
Dec 2026 — NeurIPS 2026 workshop poster (Paper J)
Jan 2027 — Paper B Econometrica submission
Apr 2027 — Unified retrospective; all three repos at v1.0+

Who is on the team and what's their track record?

Principal Investigator: Canadian-resident axiomatic decision theorist with Princeton affiliation (Economics PhD student). Research extends the Savage, Debreu and Gul-Pesendorfer axiomatic tradition to AI alignment. Prior academic output includes a Science publication and Econometrica papers in choice theory. Participates in Princeton bias-in-AI reading groups; transitioning from academic venues to active Alignment Forum contribution with this grant.

What are the most likely causes for the project to fail?

Paper A referee pushback on measurability conditions. The dominated-EU construction requires a Savage P4-consistent prior assigning positive measure to the dominated-act-favoring event. Measure-theoretic referees may demand stronger regularity. Mitigation: pre-circulation to external formal-methods reviewers (Topos orbit + MIRI Agent Foundations alumni).
T7 empirical validation finds larger-than-predicted deception subspace. If the finite-rank bound is non-vacuous but larger than hoped, the positive alignment implication weakens. Mitigation: the theorem still holds; the empirical finding becomes the result.
RLHF aggregation gap fails to replicate on second dataset. The 5× gap might be MultiPref-specific. Mitigation: preregistered replication on HH-RLHF + UltraFeedback before submission.
AF post lands without traction. Impossibility-theorem posts sometimes under-perform pure philosophy posts. Mitigation: explicit, general-interests title + concrete math in first 300 words.

How much money have you raised in the last 12 months?

$0 in prior grants for this specific program. Parallel applications in progress:

SFF 2026 Speculation Grant ($50K ask)
LTFF (EA Funds Long-Term Future Fund), parallel submission
Manifund regrantor pool (stackable, $5–25K range targeted)

Will reduce SFF ask proportionally if LTFF or Manifund regrantor funding materializes before SFF decision.