CRepair: Cross-Model Replication of LLM Self-Repair Benchmark

Project summary

CRepair is an open-source benchmark measuring whether AI agents can detect, repair, and verify their own coherence failures. Two preprints are already published (Zenodo DOIs: 10.5281/zenodo.20283434 and 10.5281/zenodo.20324352) with code live at github.com/kaminovs/crepair.

The core finding: under standard conditions, LLMs achieve 0% verification rate — they detect and repair failures but never close the loop. A structured runtime intervention raises this to near-universal verification, while generic re-prompting barely moves the needle (+0.333 vs +0.051 mean improvement).

All current results are Claude Sonnet only. This project funds cross-model replication on GPT-4o and Gemini — the single experiment that turns a Claude-specific pilot into a general finding about how current LLMs handle self-correction.

What are this project's goals? How will you achieve them?

Goal: Determine whether the CRepair dissociation finding (detection, repair, and verification are separable capabilities) holds across model families.

The existing benchmark runs 13 scenarios across 3 conditions (baseline, generic retry, CRepair wrapper) and takes approximately 2-3 hours per model. I will run the identical protocol on GPT-4o and Gemini, producing a 3-model comparison dataset.

Deliverables:

- Paper 3: cross-model comparison preprint (Zenodo)

- Updated leaderboard on GitHub with all three models

- Public dataset of all scenario responses and judge scores

If the pattern holds across models, the implication is that the verification gap is a property of how current autoregressive LLMs are trained — not a Claude quirk — and that structured repair scaffolding is a viable general intervention.

How will this funding be used?

- GPT-4o API costs for 10-run experiment (baseline + retry + wrapper, 13 scenarios): ~$300

- Gemini API costs for same protocol: ~$300

- Time allocation: approximately 50 hours of focused research work to run experiments, analyse results, and write Paper 3

- Remainder covers compute for any follow-up ablations and Zenodo archiving

Minimum funding ($1000) covers API costs and produces the cross-model dataset and Paper 3.

Full funding ($4,000) additionally covers a human spot-check validation study (20 artifacts manually reviewed by an independent rater) and a LessWrong/EA Forum writeup with full results.

Who is on your team? What's your track record on similar projects?

Independent researcher: Sergejs Kaminovs (ORCID: 0009-0004-1711-3455), UK.

Track record on this project:

- Built CRepair benchmark from scratch: 13 scenarios across 6 failure types, typed schema, LLM-as-judge evaluator, results logging, visualisation

- Published Paper 1 (benchmark) and Paper 2 (wrapper ablation study) as open preprints with DOIs

- Ran 52 scenario-condition pairs (Paper 1) and 117 scenario-condition pairs across 3 independent runs (Paper 2)

- All code open source: github.com/kaminovs/crepair

Day job: Senior data analyst in the casino industry. This research is conducted independently in my own time.

What are the most likely causes and outcomes if this project fails?

Most likely failure mode: API costs exceed estimates due to longer responses or additional debugging runs, making the experiment incomplete.

Mitigation: Both GPT-4o and Gemini have predictable pricing. At current rates, 10 runs × 13 scenarios × 3 conditions = 390 pairs per model

If cross-model results are negative (no replication): this is also a publishable finding. A null result — "the CRepair effect does not generalise beyond Claude" — is informative and would be reported honestly as Paper 3.

If I am unable to complete the work: all existing code and data remain open source and any funded portion of the experiment would be published as-is.

How much money have you raised in the last 12 months, and from where?

$0. This is my first grant application. All prior work was self-funded