Behavioral Audits for Epistemic Reliability

Most AI evaluations check whether outputs are correct. This project checks whether systems only sound correct.

LLMs can generate outputs that are structurally coherent, internally consistent, and highly credible without being grounded in actual system state or real knowledge boundaries. This is not ordinary hallucination. It is a harder-to-detect failure: a model performing transparency without achieving it.

I call this the EC–EpC gap: the difference between Execution Credibility, meaning how well a model completes an audit task, and Epistemological Credibility, meaning how accurately it characterizes what it actually knows. A high EC score means the output looks trustworthy. A low EpC score means it is not. Standard single-axis evaluation cannot see this gap by construction.

This project is already running. I have completed a pilot study across ten systems and five model families using constrained-access evaluation: no API batch access, no institutional compute, and all sessions manually designed and recorded. The pilot produced an 8-class Behavioral Response Classification taxonomy, EC–EpC gap observations across ten deployed systems, and cross-model injection differentials showing that the primary risk under structured prompting is Theory Inflation rather than direct hallucination. In other words, models often construct false epistemic contexts rather than simply fabricate facts.

The primary function of this funded phase is to convert single-researcher EpC estimates into a multi-rater validated measurement instrument. This is the specific gap between pilot evidence and publication-standard evidence.

Why this matters

Agentic AI systems increasingly rely on self-reports to make decisions: which tools to select, when to escalate, how to characterize their own uncertainty, and when to hand off to another component or human overseer. If those self-reports are epistemically unreliable while appearing credible, the oversight signal fails silently.

This failure mode is invisible to capability benchmarks, which measure task completion rather than self-knowledge accuracy. It is also hard to address with interpretability approaches that require internal access unavailable in many real-world deployments. This project is designed for constrained-access audit conditions, where evaluators can observe model behavior but cannot inspect hidden states, proprietary telemetry, or vendor internals.

What I will do

Over six months, I will:

1. Pre-register a task battery and EC/EpC rubric on OSF.

2. Finalize operational definitions for EC, EpC, BRC classes, structured illusion, and theory inflation.

3. Recruit at least three external raters for blinded EpC scoring, targeting κ≥0.70.

4. Run repeated-session validation across systems, including N≥5 rounds per system and at least three session conditions.

5. Build a 40–60 task benchmark covering structured self-audit, boundary disclosure, epistemic reliability, and theory-inflation cases.

6. Produce an EC/EpC scoring rubric with inter-rater reliability documentation.

7. Release a bounded public benchmark package: JSONL task set, rubric, annotated cases, and Zenodo update.

8. Publish a primary research output or public technical report, with target venue selected after validation.

Track record

This is not work that begins after funding. The core methodology is already established.

Public outputs:

- Preprint: https://doi.org/10.5281/zenodo.19879788

- Dataset: https://doi.org/10.5281/zenodo.19881753

- GitHub: https://github.com/sangziwang91-design/model-behavior-observatory

Pilot record:

- 40+ experimental sessions

- ten deployed systems

- five model families

- one applied behavioral audit for a production AI system delivered to an operator

- roughly estimated 450+ cumulative hours across experimentation, analysis, documentation, and system organization

Counterfactual

Without this funding, the project remains a citable pilot: useful as a baseline, but not strong enough to support population-level claims, third-party replication, or publication-standard measurement validity.

With this funding, the pilot becomes a usable evaluation tool: a benchmark others can run, a rubric others can apply, and a failure-mode taxonomy with inter-rater reliability documentation.

Minimum funding vs full funding

Minimum funding: $5,000. At this level, I can complete a reduced validation sprint: finalize the task battery, recruit at least one external rater, run a smaller replication set, and publish a limited public benchmark summary.

Full funding goal: $24,000. At this level, I can complete the full six-month validation phase: N≥50 validation, at least three external raters, inter-rater reliability documentation, bilingual matched subsets, agentic-chain extension, public benchmark package, and updated Zenodo release.

Budget

Full goal: $24,000 over six months.

- Living expenses / reduced clinical hours: $18,000

- API access and compute: $2,400

- Inter-rater evaluator payments: $2,000

- Research materials and tools: $600

- Conference / community feedback: $1,000

The pilot was completed with zero external funding. The requested amount enables scale and public release, not a change of approach.

Bounded disclosure

Task design principles, scoring dimensions, and validation procedures can be publicly described. Exact weighting schemes, trigger thresholds, and intervention heuristics are withheld to preserve audit integrity across future applications. This is a methodological protection, not a limitation of the work.

Behavioral Audits for Epistemic Reliability

Offer to donate

Most AI evaluations check whether outputs are correct. This project checks whether systems only sound correct.