You're pledging to donate if the project hits its minimum goal and gets approved. If not, your funds will be returned.
I am requesting $22,000 to build and publish an open-source benchmark that measures whether AI agents actually do what they claim, using a system I created and deployed before Anthropic, Microsoft, or ArbiterOS shipped comparable audit infrastructure months later.
"What are this project's goals? How will you achieve them?"
A reproducible benchmark evaluating governance systems on three dimensions: declared-vs-actual alignment (Karma Ledger), tool-selection hallucinations (Voice Audit), and policy enforcement under adversarial prompting (Dharma Rules).
100+ controlled interactions across 5 frameworks (LangChain, CrewAI, AutoGen, OpenAI Agents SDK, PydanticAI), published as open dataset on Hugging Face.
An arXiv preprint "Cross-Framework Agent Governance: A Benchmark for Declared-vs-Actual Alignment" with one-page integration guides per framework.
How will this funding be used?
$4,000: Compute + API credits (~50K agent interactions; 80% open-source models, 20% buffer).
$2,000: Benchmark infrastructure (CI, Hugging Face dataset hosting, arXiv fee).
$12,000: Developer time, 3 months part-time (design, execution, analysis, publication).
$2,000: Documentation + dissemination (technical report, blog post, workshop submission).
$2,000: Contingency for follow-up reviewer questions and revisions.
Who is on your team? What's your track record on similar projects?
I built and deployed the underlying system starting in fall of 2025, before comparable infrastructure shipped at Microsoft AGT (April 2026), Anthropic's model-welfare audit (April 2026), and Cloudflare Project Think (April 2026) months later.
Full claims ledger available in the public prescience.json API (in the repo); materials available on request.
What are the most likely causes and outcomes if this project fails?
Most likely failure modes: (1) framework API changes break the benchmark - mitigated by pinned dependencies and Docker reproducibility; (2) results are weaker than competitor benchmarks - mitigated by honest publication; the value is cross-framework comparison, not absolute score; (3) compute costs exceed budget - mitigated by open-source models for 80% of runs with 20% buffer.
If the project fails entirely, the partial work (protocol, integration guides, partial dataset) is still released as open source under MIT, providing a foundation for future governance benchmark efforts.
How much money have you raised in the last 12 months, and from where?
N/A