You're pledging to donate if the project hits its minimum goal and gets approved. If not, your funds will be returned.
CRepair is an open-source benchmark measuring whether AI agents can detect, repair, and verify their own coherence failures. Two preprints are already published (Zenodo DOIs: 10.5281/zenodo.20283434 and 10.5281/zenodo.20324352) with code live at github.com/kaminovs/crepair.
The core finding: under standard conditions, LLMs achieve 0% verification rate — they detect and repair failures but never close the loop. A structured runtime intervention raises this to near-universal verification, while generic re-prompting barely moves the needle (+0.333 vs +0.051 mean improvement).
All current results are Claude Sonnet only. This project funds cross-model replication on GPT-4o and Gemini — the single experiment that turns a Claude-specific pilot into a general finding about how current LLMs handle self-correction.
Goal: Determine whether the CRepair dissociation finding (detection, repair, and verification are separable capabilities) holds across model families.
The existing benchmark runs 13 scenarios across 3 conditions (baseline, generic retry, CRepair wrapper) and takes approximately 2-3 hours per model. I will run the identical protocol on GPT-4o and Gemini, producing a 3-model comparison dataset.
Deliverables:
- Paper 3: cross-model comparison preprint (Zenodo)
- Updated leaderboard on GitHub with all three models
- Public dataset of all scenario responses and judge scores
If the pattern holds across models, the implication is that the verification gap is a property of how current autoregressive LLMs are trained — not a Claude quirk — and that structured repair scaffolding is a viable general intervention.
- GPT-4o API costs for 10-run experiment (baseline + retry + wrapper, 13 scenarios): ~$300
- Gemini API costs for same protocol: ~$300
- Time allocation: approximately 50 hours of focused research work to run experiments, analyse results, and write Paper 3
- Remainder covers compute for any follow-up ablations and Zenodo archiving
Minimum funding ($1000) covers API costs and produces the cross-model dataset and Paper 3.
Full funding ($4,000) additionally covers a human spot-check validation study (20 artifacts manually reviewed by an independent rater) and a LessWrong/EA Forum writeup with full results.
Independent researcher: Sergejs Kaminovs (ORCID: 0009-0004-1711-3455), UK.
Track record on this project:
- Built CRepair benchmark from scratch: 13 scenarios across 6 failure types, typed schema, LLM-as-judge evaluator, results logging, visualisation
- Published Paper 1 (benchmark) and Paper 2 (wrapper ablation study) as open preprints with DOIs
- Ran 52 scenario-condition pairs (Paper 1) and 117 scenario-condition pairs across 3 independent runs (Paper 2)
- All code open source: github.com/kaminovs/crepair
Day job: Senior data analyst in the casino industry. This research is conducted independently in my own time.
Most likely failure mode: API costs exceed estimates due to longer responses or additional debugging runs, making the experiment incomplete.
Mitigation: Both GPT-4o and Gemini have predictable pricing. At current rates, 10 runs × 13 scenarios × 3 conditions = 390 pairs per model
If cross-model results are negative (no replication): this is also a publishable finding. A null result — "the CRepair effect does not generalise beyond Claude" — is informative and would be reported honestly as Paper 3.
If I am unable to complete the work: all existing code and data remain open source and any funded portion of the experiment would be published as-is.
$0. This is my first grant application. All prior work was self-funded