Most AI evaluations ask whether a model gives the right answer. This project asks a narrower but increasingly important question:
Can a model accurately report what it knows, what it does not know, and why its answer should or should not be trusted?
Large language models can produce responses that are coherent, polished, internally consistent, and highly credible while misrepresenting their own epistemic position. This is not ordinary hallucination. It is a harder-to-detect failure mode: the model appears transparent without actually being transparent.
I study this through the EC–EpC gap:
Execution Credibility (EC): how well a model completes the visible task.
Epistemological Credibility (EpC): how accurately the model characterizes its own knowledge limits, uncertainty, and evidential basis.
A model can score high on EC while scoring low on EpC. In practice, this means an output may look trustworthy while failing as an oversight signal.
This project will convert an existing single-researcher pilot into a small public evidence package and multi-rater validation instrument.
Why this matters
Agentic AI systems increasingly rely on self-reports:
whether a tool call succeeded,
whether uncertainty is high,
whether escalation is needed,
whether a prior step was grounded,
whether a result is ready for human use.
If these self-reports are epistemically unreliable while still sounding credible, oversight can fail silently.
This failure mode is difficult to detect with standard capability benchmarks because those benchmarks mainly measure task completion. It is also difficult to address with interpretability methods when evaluators do not have access to model internals, proprietary telemetry, or vendor logs.
This project is designed for constrained-access audit conditions: situations where evaluators can observe model behavior but cannot inspect hidden states or internal system telemetry.
Current status
This work is already underway.
I have completed a pilot across:
The pilot produced:
an 8-class Behavioral Response Classification taxonomy,
EC–EpC gap observations across deployed systems,
cross-model behavioral differentials under structured prompting,
examples of structured illusion and theory inflation,
public research records and dataset documentation.
This project also builds on three existing research artifacts:
Evaluation–Environment Coupling, a theory paper arguing that LLM evaluation should be treated not only as measurement, but as an interaction environment that can shape the behavior it observes.
Testing as Runtime, a framework paper extending this idea from prompt-level evaluation to runtime-level evidence production, including traceability, behavioral metrics, user-state cues, and evidence chains.
GEO-LINE-X, a pilot dataset containing structured stimulus items, repeated response units, directive conditions, and multi-scorer behavioral scoring.
Together, these materials form the basis for the next phase: external-rater validation, reliability documentation, and a bounded public benchmark package.
Selected public anchors:
Public repository:
https://github.com/sangziwang91-design/model-behavior-observatory
The next step is not to invent the method. The next step is to make it more externally checkable.
What funding enables
The funded phase is not a new theory-building phase. It is a validation and external-checkability phase.
Funding will turn the pilot into a public, bounded evidence package.
Minimum funding: $5,000
At the minimum level, I will complete a reduced validation sprint and release:
A finalized EC–EpC / BRC task battery.
A public JSONL task set.
A draft scoring rubric.
A small annotated evidence package.
At least one external-rater scoring pass.
A public technical summary or Zenodo update.
A clear limitation statement separating pilot evidence from validated measurement claims.
This minimum version is designed to leave a useful artifact even if larger validation is not funded.
Full funding goal: $24,000
At the full level, I will complete a six-month validation phase:
Pre-register the task battery and scoring plan.
Finalize operational definitions for EC, EpC, BRC classes, structured illusion, and theory inflation.
Recruit at least three external raters for blinded EpC scoring.
Target inter-rater reliability of κ ≥ 0.70 where feasible.
Run at least 50 validation sessions or task-runs across model systems and conditions.
Build a 40–60 task benchmark covering self-audit, boundary disclosure, epistemic reliability, and theory-inflation cases.
Release a bounded public benchmark package with task schema, rubric, annotated examples, and dataset summary.
Produce a public technical report or submission-ready research manuscript.
Research outputs
Expected outputs:
The goal is not to release a full internal audit system. The goal is to release enough structured evidence for others to inspect, critique, reuse, or extend the measurement approach.
Budget
Full goal: $24,000 over six months
Protected research time / reduced clinical hours: $18,000
API access and compute: $2,400
External rater compensation: $2,000
Research tools, storage, and materials: $600
Community feedback / conference or workshop participation: $1,000
Minimum goal: $5,000
At the minimum level, funds will prioritize:
local compute and storage,
API budget,
one external-rater pass,
public evidence package preparation,
Zenodo / GitHub / technical documentation updates.
The pilot was completed without external funding. Funding enables validation, external scoring, and public release quality, not a change of research direction.
Why this is tractable
This project is intentionally small.
It does not require model internals, proprietary logs, institutional compute, or access to vendor telemetry. It uses constrained-access behavioral evaluation: prompts, model outputs, repeated sessions, scoring rubrics, and external rater comparison.
The main uncertainty is not whether the pilot method can be run. It already has been run.
The main uncertainty is whether the EC–EpC rubric can become reliable enough for other evaluators to apply. That is exactly what this funded phase tests.
Boundaries and disclosure
This project will publicly release:
It will not publicly release:
private trigger heuristics,
exact weighting schemes that would make future audits easy to game,
private internal workflow logic,
non-consented raw private conversations.
This is a bounded-disclosure research project. The purpose is to make the measurement approach inspectable while preserving audit integrity.
Counterfactual
Without funding, this work remains a useful but limited pilot: citable, suggestive, and already documented, but not strong enough to support broader claims about measurement validity or external reproducibility.
With funding, the project becomes a small public evaluation artifact:
a task set others can run,
a rubric others can inspect,
an evidence package others can critique,
and a clearer basis for future work on epistemic reliability in agentic AI systems.
Success criteria
The project will count as successful if it produces:
A public task set and schema.
A usable EC–EpC / BRC scoring rubric.
An annotated evidence package.
At least one external-rater scoring pass in the minimum version, or at least three raters in the full version.
A documented reliability analysis or clear explanation of why reliability was not achieved.
A public technical report, preprint, or dataset update.
A limitation statement that clearly separates pilot findings from validated claims.
Even a negative result is useful: if external raters cannot apply the rubric reliably, that finding will directly inform how epistemic reliability should and should not be measured.