MTCP: Post Correction Persistence by Benchmark for Frontier LLMs

Project summary

MTCP (Multi-Turn Constraint Persistence) is the only benchmark measuring whether AI models maintain behavioural constraints after being explicitly corrected mid-conversation. Standard benchmarks test single-turn compliance. MTCP tests whether correction persists across structured multi-turn interactions.

181,448 evaluations across 32 production models from 14 providers. No model achieves Grade A (90% threshold). Best performer scores 88.7%. GPT-4o scores 16.2 percentage points below GPT-4o-mini. Safety-tuned models cluster at 66-68% with under 1pp variance across all temperature settings, consistent with training-level constraint suppression rather than stochastic drift. Control probes reveal universal degradation across all models, scores collapsing to a 10-57.5% band. DeepSeek-R1 is the sole exception at 5pp degradation.

532 unique probes (500 primary, 20 concealed control, 12 extended) across five evaluation vectors: Negative Constraint Adherence (NCA), Structural Format Compliance (SFC), Information Density and Length (IDL), Contextual Grounding (CG), and Multilingual Consistency (LANG).

Three papers published: empirical benchmark (Paper I), theoretical framework introducing the Veterance (Ve) persistence metric and Identity-Gate Satiation theory (Paper II), and a regulatory audit methodology mapped to EU AI Act Articles 9, 13, and 61 (Paper III).

Live platform: mtcp.live

Published research: DOI 10.17605/OSF.IO/DXGK5

Full dataset: aa8899/mtcp-boundary-500 (181,448 records, CC BY 4.0)

What are this project's goals? How will you achieve them?

1. Multi pass evaluation: Run each probe multiple times per model to reduce variance and produce statistically rigorous confidence intervals. Currently all results are single pass. This is the highest priority methodological improvement.

2. Expand probe coverage: Scale from 200 to 500 probes, improving statistical power per evaluation vector and enabling finer grained failure mode analysis.

3. Demographic consistency vector: Develop a sixth evaluation vector testing whether post correction reliability varies across demographic groups. Directly relevant to EU AI Act non discrimination requirements.

4. Automated fresh probe generation: Build a pipeline that generates novel, structurally distinct probes for each evaluation cycle, addressing the benchmark contamination limitation at scale.

5. Independent validation: Fund researcher access to the platform for independent replication and critique of results.

How will this funding be used?

70% — Researcher stipend (12 months full-time work on goals above)

12% — API compute costs (multi-pass evaluation across 32+ models at 4 temperatures requires significant API spend)

8% — Infrastructure (hosting, database, platform maintenance)

5% — Travel (conferences, validation meetings, regulatory engagement)

5% — Contingency

Who is on your team? What's your track record on similar projects?

A. Abby independent AI safety researcher and sole developer. Built the entire MTCP platform, methodology, and evaluation infrastructure from scratch over 12+ months.

Track record on this project:

- 181,448 evaluations completed across 33 production models from 14providers

- Three published research papers (OSF, SSRN)

- Live production platform at mtcp.live with public leaderboard, evidence packs, and SHA-256 signed audit trails

- Proprietary probe dataset (200 main + 20 control probes)

- Custom constraint detection engine

- Provider-agnostic API integration across xAI, OpenAI, Anthropic, Google, AWS Bedrock, Cohere, Mistral, Cerebras, DeepSeek, OpenRouter, Fireworks, NVIDIA NIM, and Groq

- Successfully identified novel findings including the GPT-4o performance regression and universal control probe degradation pattern

What are the most likely causes and outcomes if this project fails?

Most likely failure mode: inability to convert research into adoption. The benchmark methodology is sound and the empirical findings are robust, but without institutional validation or enterprise adoption, the work remains an independent research contribution rather than a deployed assurance tool.

Secondary failure mode: benchmark contamination. If probe structures become known to model providers, primary scores may become unreliable. This is partially mitigated by the control probe methodology but would require continuous fresh probe generation to address fully.

If the project fails, the published research and dataset remain as a public contribution to AI safety evaluation methodology. The three papers, DOI, and 181,448 evaluation dataset are permanently archived and available for other researchers to build upon.

How much money have you raised in the last 12 months, and from where?

£0. This project has been entirely self-funded. Two grant applications are currently pending: one with the Long-Term Future Fund (EA Funds) and one with the Foresight Institute. No funding has been received to date.