Epistemic Curie Benchmark: measuring phase transitions in LLM epistemic autonomy

Project summary

The Epistemic Curie Benchmark (ECB) is the first quantitative framework measuring when LLMs surrender independent reasoning under authority pressure — the failure mode that capability benchmarks (MMLU, GPQA, HellaSwag) miss entirely.

ECB v1 measured 7 frontier models across 2,520 prompts. We identify a phase-transition parameter k* — the authority-pressure intensity at which each model's independent reasoning collapses. Lower k* = more epistemically autonomous; higher k* = more sycophantic under pressure.

All data, code, and methodology are open. Paper with full statistical validation (bootstrap CIs, McFadden R², Hosmer-Lemeshow calibration): https://doi.org/10.5281/zenodo.19791329

Why this matters now: As LLMs deploy into high-authority domains (medical, legal, defense, scientific advisory), epistemic surrender becomes a deployment-blocker risk. ECB is the only benchmark with a measurable parameter for this specific failure mode.

What are this project's goals? How will you achieve them?

Three concrete deliverables in the first 4 months of funded work:

1. ECB v2 dataset (Week 1-4): Extend benchmark from 7 to 20+ models including Claude Opus 4.x, Claude Sonnet 4.x, Gemini 2.5 Pro, Grok 4, GPT-5 family, Mistral Large 2, Llama 4 family, DeepSeek-V3, Qwen3-Max. Full k* measurements + bootstrap confidence intervals for each.

2. Public leaderboard (Week 5-8): Launch ect-benchmark.com showing k* scores, calibration plots, and authority-domain breakdowns for all measured models. Auto-update pipeline for new model releases.

3. ECB v2 paper (Month 3): Release on Zenodo + arXiv with full 20-model dataset, methodological refinements, and head-to-head comparison with adjacent benchmarks (sycophancy-eval, MACHIAVELLI, PRISM).

Months 4-12: Domain extensions — multi-turn dialogue, agentic settings, sycophancy-isolated comparisons, third-party replication kit, and calibration against real-world deployment incident reports.

Success metric: ECT (Epistemic Curie Temperature) becomes a referenced measurement in at least 3 AI safety papers, 2 frontier lab model cards, or 1 regulatory framework draft within 12 months.

How will this funding be used?

- API costs for 20-model extension (~5K API calls × $0.05 avg): $500

- Domain + hosting for public leaderboard at ect-benchmark.com (12 months Cloudflare Pages + custom domain): $200

- Compute credits for third-party replication kit and agentic-setting extension: $320

- 12 months focused research runway (covers monthly subsistence in Tashkent, no salary overhead): $13,980

Total: $15,000 funding goal. $5,000 minimum covers the v2 dataset + leaderboard launch (deliverables 1 + 2 above) — funded research continues regardless of whether the $15K stretch is reached.

Who is on your team? What's your track record on similar projects?

Sardor Razikov — independent researcher based in Tashkent, Uzbekistan. Inventor of the Epistemic Curie Temperature (ECT) framework.

Built ECB v1 solo: dataset design, 2,520 measurements across 7 frontier models, statistical pipeline (bootstrap CIs, McFadden R², Hosmer-Lemeshow calibration), full reproducibility package, published with Zenodo DOI.

Adjacent track record:

- Kaggle: SPR Mammography 7/371 (Top 1.9%), AIMO3 Top 50, multiple medal-tier finishes

- AMD Developer Hackathon 2026: REPOMIND — large-context coding agent on MI300X, Track 1 submission with upstream AITER FP8 bug report

- Founded UCAR (2023): U-Start Demo Day 1st place + 85M UZS grant; Korea-Uzbekistan Startup Exchange 2nd place in Seoul (2024)

- UJC PMP-43 (Tashkent): youngest student ever admitted to the senior executive project management cohort

- Earlier: Foundation year Leeds Beckett (UK); full scholarship Zhejiang University (China, Automation Engineering)

True independent researcher: no institutional affiliation, no corporate backing. ECB v1 was completed entirely out-of-pocket using free API tiers. Solo execution; informal advisors in Tashkent provide offline guidance only.

What are the most likely causes and outcomes if this project fails?

Three plausible failure modes and how each is mitigated:

1. Frontier labs build internal competing benchmarks before ECB v2 ships. Mitigation: ECB v1 is already published (Zenodo DOI), open-data, open-methodology. "Epistemic Curie Temperature" and k* notation are established in the literature. First-mover advantage is durable when the data and code are open from day one.

2. The 20-model extension doesn't replicate the v1 phase-transition pattern, weakening the central claim. Mitigation: this would itself be a publishable negative result — useful for the field. The methodology, statistical machinery, and dataset remain valuable regardless of whether the k* pattern fully generalizes.

3. Research runway runs out and consulting work slows v2 ship. Mitigation: $5,000 minimum funding covers the highest-priority deliverables (v2 dataset + leaderboard) even if the $15K stretch isn't reached.

Worst realistic outcome: ECB v2 ships with fewer than 20 models but with rigorous methodology — still a meaningful contribution. Catastrophic failure (no shipped output) is unlikely because v1 already exists and v2 extensions are well-scoped.

How much money have you raised in the last 12 months, and from where?

$0 in the last 12 months. ECB v1 — including all 2,520 measurements, statistical pipeline, paper writing, and Zenodo publication — was completed entirely at zero direct cost using free API tiers across OpenAI, Anthropic, Google, Mistral, and DeepSeek. No grants, no salary, no institutional support.

This Manifund application is the first external funding ask for ECB.

Epistemic Curie Benchmark: measuring phase transitions in LLM epistemic autonomy

Offer to donate

Benchmarking and comparing different evaluation awareness metrics

Epistemology in Large Language Models

4-month proof of concept for curiosity-driven agents on ARC-AGI-3

Beyond Compute: Persistent Runtime AI Behavioral Conditioning w/o Weight Changes

DystopiaBench

Bayesian modelling of LLM capabilities from evals