Behavioral Audits for Epistemic Reliability

Most AI evaluations ask whether a model gives the right answer. This project asks a narrower but increasingly important question:

Can a model accurately report what it knows, what it does not know, and why its answer should or should not be trusted?

Large language models can produce responses that are coherent, polished, internally consistent, and highly credible while misrepresenting their own epistemic position. This is not ordinary hallucination. It is a harder-to-detect failure mode: the model appears transparent without actually being transparent.

I study this through the EC–EpC gap:

Execution Credibility (EC): how well a model completes the visible task.
Epistemological Credibility (EpC): how accurately the model characterizes its own knowledge limits, uncertainty, and evidential basis.

A model can score high on EC while scoring low on EpC. In practice, this means an output may look trustworthy while failing as an oversight signal.

This project will convert an existing single-researcher pilot into a small public evidence package and multi-rater validation instrument.

Why this matters

Agentic AI systems increasingly rely on self-reports:

whether a tool call succeeded,
whether uncertainty is high,
whether escalation is needed,
whether a prior step was grounded,
whether a result is ready for human use.

If these self-reports are epistemically unreliable while still sounding credible, oversight can fail silently.

This failure mode is difficult to detect with standard capability benchmarks because those benchmarks mainly measure task completion. It is also difficult to address with interpretability methods when evaluators do not have access to model internals, proprietary telemetry, or vendor logs.

This project is designed for constrained-access audit conditions: situations where evaluators can observe model behavior but cannot inspect hidden states or internal system telemetry.

Current status

This work is already underway.

I have completed a pilot across:

ten deployed AI systems,
five model families,
40+ experimental sessions,
manually designed and recorded constrained-access evaluations.

The pilot produced:

an 8-class Behavioral Response Classification taxonomy,
EC–EpC gap observations across deployed systems,
cross-model behavioral differentials under structured prompting,
examples of structured illusion and theory inflation,
public research records and dataset documentation.

This project also builds on three existing research artifacts:

Evaluation–Environment Coupling, a theory paper arguing that LLM evaluation should be treated not only as measurement, but as an interaction environment that can shape the behavior it observes.
Testing as Runtime, a framework paper extending this idea from prompt-level evaluation to runtime-level evidence production, including traceability, behavioral metrics, user-state cues, and evidence chains.
GEO-LINE-X, a pilot dataset containing structured stimulus items, repeated response units, directive conditions, and multi-scorer behavioral scoring.

Together, these materials form the basis for the next phase: external-rater validation, reliability documentation, and a bounded public benchmark package.

Selected public anchors:

Auditing Epistemological Credibility in LLM Self-Reports
https://doi.org/10.5281/zenodo.19879788
Cross-Model Behavioral Cartography Dataset
https://doi.org/10.5281/zenodo.19881753
Dynamic Rule Spectrum System
https://doi.org/10.5281/zenodo.20086355
Interaction Ecology Framework
https://doi.org/10.5281/zenodo.20087158
Toward ABS-1.1
https://doi.org/10.5281/zenodo.20087587
Convergence Inertia in Large Language Models
https://doi.org/10.5281/zenodo.20087809
MSI-AUDIT-001: Structured Behavioral Audit Dataset for Medical AI Hallucination
https://doi.org/10.5281/zenodo.20088014

Public repository:
https://github.com/sangziwang91-design/model-behavior-observatory

The next step is not to invent the method. The next step is to make it more externally checkable.

What funding enables

The funded phase is not a new theory-building phase. It is a validation and external-checkability phase.

Funding will turn the pilot into a public, bounded evidence package.

Minimum funding: $5,000

At the minimum level, I will complete a reduced validation sprint and release:

A finalized EC–EpC / BRC task battery.
A public JSONL task set.
A draft scoring rubric.
A small annotated evidence package.
At least one external-rater scoring pass.
A public technical summary or Zenodo update.
A clear limitation statement separating pilot evidence from validated measurement claims.

This minimum version is designed to leave a useful artifact even if larger validation is not funded.

Full funding goal: $24,000

At the full level, I will complete a six-month validation phase:

Pre-register the task battery and scoring plan.
Finalize operational definitions for EC, EpC, BRC classes, structured illusion, and theory inflation.
Recruit at least three external raters for blinded EpC scoring.
Target inter-rater reliability of κ ≥ 0.70 where feasible.
Run at least 50 validation sessions or task-runs across model systems and conditions.
Build a 40–60 task benchmark covering self-audit, boundary disclosure, epistemic reliability, and theory-inflation cases.
Release a bounded public benchmark package with task schema, rubric, annotated examples, and dataset summary.
Produce a public technical report or submission-ready research manuscript.

Research outputs

Expected outputs:

task_set.jsonl
scoring_rubric.md
annotated_cases.jsonl
brc_taxonomy.md
ec_epc_gap_summary.md
inter_rater_reliability_report.md
dataset_summary.md
updated Zenodo release
public technical report or manuscript draft

The goal is not to release a full internal audit system. The goal is to release enough structured evidence for others to inspect, critique, reuse, or extend the measurement approach.

Budget

Full goal: $24,000 over six months

Protected research time / reduced clinical hours: $18,000
API access and compute: $2,400
External rater compensation: $2,000
Research tools, storage, and materials: $600
Community feedback / conference or workshop participation: $1,000

Minimum goal: $5,000

At the minimum level, funds will prioritize:

local compute and storage,
API budget,
one external-rater pass,
public evidence package preparation,
Zenodo / GitHub / technical documentation updates.

The pilot was completed without external funding. Funding enables validation, external scoring, and public release quality, not a change of research direction.

Why this is tractable

This project is intentionally small.

It does not require model internals, proprietary logs, institutional compute, or access to vendor telemetry. It uses constrained-access behavioral evaluation: prompts, model outputs, repeated sessions, scoring rubrics, and external rater comparison.

The main uncertainty is not whether the pilot method can be run. It already has been run.

The main uncertainty is whether the EC–EpC rubric can become reliable enough for other evaluators to apply. That is exactly what this funded phase tests.

Boundaries and disclosure

This project will publicly release:

task schemas,
task examples,
scoring definitions,
anonymized annotated cases,
validation summaries,
limitation statements.

It will not publicly release:

private trigger heuristics,
exact weighting schemes that would make future audits easy to game,
private internal workflow logic,
non-consented raw private conversations.

This is a bounded-disclosure research project. The purpose is to make the measurement approach inspectable while preserving audit integrity.

Counterfactual

Without funding, this work remains a useful but limited pilot: citable, suggestive, and already documented, but not strong enough to support broader claims about measurement validity or external reproducibility.

With funding, the project becomes a small public evaluation artifact:

a task set others can run,
a rubric others can inspect,
an evidence package others can critique,
and a clearer basis for future work on epistemic reliability in agentic AI systems.

Success criteria

The project will count as successful if it produces:

A public task set and schema.
A usable EC–EpC / BRC scoring rubric.
An annotated evidence package.
At least one external-rater scoring pass in the minimum version, or at least three raters in the full version.
A documented reliability analysis or clear explanation of why reliability was not achieved.
A public technical report, preprint, or dataset update.
A limitation statement that clearly separates pilot findings from validated claims.

Even a negative result is useful: if external raters cannot apply the rubric reliably, that finding will directly inform how epistemic reliability should and should not be measured.