You're pledging to donate if the project hits its minimum goal and gets approved. If not, your funds will be returned.
Compassion Bench is a project of CaML which maintains the only continuously updated public leaderboard evaluating how frontier AI models reason about non-human welfare. Our two goals are: (1) run the two relevant benchmarks against every major frontier model release within days of launch, and (2) grow adoption among AI safety researchers and labs as a standard welfare alignment eval.
We achieve this through an evaluation pipeline that runs each new model for across Animal Harm Benchmark 2 (AB) and Compassion and Deception (CAD) Benchmark (coming to Inspect soon), using GPT-5-nano and Gemini 2.5 Flash-Lite as judges, and publishes results to compassionbench.com. The infrastructure is already live. This grant funds continued operation.
https://ukgovernmentbeis.github.io/inspect_evals/evals/safeguards/ahb/
https://github.com/UKGovernmentBEIS/inspect_evals/pull/1116
Twelve months of operational costs: Vercel Pro and Supabase Pro hosting ($840), API costs for judge models and frontier model inference at ~2 new models per month ($600), a part-time web developer retainer for leaderboard maintenance and minor features ($1,800), researcher time to run evaluations and maintain the pipeline ($1,200), and a contingency buffer ($560). Total: $5,000. Ideally, we also want to monitor web traffic to get a better idea of succesful adoption of the site.
Miles Tidmarsh (Co-founder, Research Director) — leads benchmark design. Jasmine Brazilek (Co-founder, Technical Lead) — built and maintains the evaluation pipeline and infrastructure. Hailey Sherman (Web / database developement) put this site together under our specifications
We built and released AHB 2, which now receives 8,000+ monthly downloads (growing monthly) and has been adopted by AI safety organisations as a standard evaluation tool. CAD-Benchmark is being put on Inspect now and refinement is underway. compassionbench.com is live with an automated leaderboard. We have also completed a draft paper analyzing methods of robustly increasing compassion, which will be on ArXiv within weeks.
Most likely failure mode is funding running out before we secure a larger institutional grant, causing the evaluation pipeline to go unmaintained as new frontier models ship. The leaderboard becomes stale, adoption stalls, and the benchmarks lose credibility as a living standard. A secondary risk is that the web platform and benchmarks degrades without maintenance reducing accessibility for researchers who rely on the public leaderboard rather than running evals directly.
The benchmarks themselves remain publicly available on Inspect regardless: the loss is the maintained, continuously updated platform that makes the work legible and useful to the broader community.
We have previously raised $110,000 for CaML from SFF, BlueDot Impac, Ryan Kidd, Marcus Abramovitch and Longview Philanthropy along with many private donors to develop this work.