$0total balance

$0charity balance

$0cash balance

$0 in pending offers

About Me

I research whether AI models maintain behavioural constraints after being corrected mid conversation. No existing benchmark tested this. I built one, ran it 181,448 times across 35 models from 14 providers, and found that no model achieves reliable post correction persistence. The findings have direct implications for EU AI Act compliance and enterprise AI deployment assurance. Published on OSF and SSRN.

Projects

MTCP: Post Correction Persistence by Benchmark for Frontier LLMs

Comments

MTCP: Post Correction Persistence by Benchmark for Frontier LLMs

A.Abby

about 1 month ago

Post correction reliability maps directly onto adversarial robustness and AI control research priorities. Temperature invariant failure in safety-tuned models (variance under 1pp across T=0.0 to T=0.8) is consistent with training-level constraint suppression rather than stochastic sampling drift. This is a different failure mode from jailbreaks and prompt injection. It persists under normal deployment conditions with no adversarial input required. Happy to discuss methodology with anyone working on alignment training robustness.

MTCP: Post Correction Persistence by Benchmark for Frontier LLMs

A.Abby

about 1 month ago

MTCP benchmark isolates post-correction reliability: the capacity to

maintain constraints after explicit correction within the same conversation.

181,448 evaluations across 32 production models show systematic failure:

• Best: 88.7% (grok-3-mini, Grade B)

• Flagship regression: GPT-4o scores 16.2pp below GPT-4o-mini

• Control probe degradation: all models collapse to 10-57.5% on fresh probes

• Exception: DeepSeek-R1 (5pp drop, contamination-resistant)

Framework published OSF.IO/DXGK5 (CC-BY-NC 4.0):

• MTCP three-turn protocol across five vectors (NCA, SFC, IDL, CG, LANG)

• IGS theory distinguishing architectural vs. stochastic failures

• Sigma-Forensics producing EU AI Act-compliant audit reports

Temperature invariance in Claude models (0.8pp variance T=0.0 to T=0.8)

suggests architectural constraint suppression vs. sampling-driven failures

in LLaMA models (7.1pp variance).

Funding advances: probe diversity, peer review, demographic vector,

statistical rigor.

Platform: mtcp.live

Contact: admin@mtcp.live