Compassion Bench

Project summary

Compassion Bench is a project of CaML which maintains the only continuously updated public leaderboard evaluating how frontier AI models reason about non-human welfare. Our two goals are: (1) run the two relevant benchmarks against every major frontier model release within days of launch, and (2) grow adoption among AI safety researchers and labs as a standard welfare alignment eval.

What are this project's goals? How will you achieve them?

We achieve this through an evaluation pipeline that runs each new model for across Animal Harm Benchmark 2 (AB) and Compassion and Deception (CAD) Benchmark (coming to Inspect soon), using GPT-5-nano and Gemini 2.5 Flash-Lite as judges, and publishes results to compassionbench.com. The infrastructure is already live. This grant funds continued operation.

https://ukgovernmentbeis.github.io/inspect_evals/evals/safeguards/ahb/

https://github.com/UKGovernmentBEIS/inspect_evals/pull/1116

How will this funding be used?

Twelve months of operational costs: Vercel Pro and Supabase Pro hosting ($840), API costs for judge models and frontier model inference at ~2 new models per month ($600), a part-time web developer retainer for leaderboard maintenance and minor features ($1,800), researcher time to run evaluations and maintain the pipeline ($1,200), and a contingency buffer ($560). Total: $5,000. Ideally, we also want to monitor web traffic to get a better idea of succesful adoption of the site.

Who is on your team? What's your track record on similar projects?

Miles Tidmarsh (Co-founder, Research Director) — leads benchmark design. Jasmine Brazilek (Co-founder, Technical Lead) — built and maintains the evaluation pipeline and infrastructure. Hailey Sherman (Web / database developement) put this site together under our specifications

We built and released AHB 2, which now receives 8,000+ monthly downloads (growing monthly) and has been adopted by AI safety organisations as a standard evaluation tool. CAD-Benchmark is being put on Inspect now and refinement is underway. compassionbench.com is live with an automated leaderboard. We have also completed a draft paper analyzing methods of robustly increasing compassion, which will be on ArXiv within weeks.

What are the most likely causes and outcomes if this project fails?

Most likely failure mode is funding running out before we secure a larger institutional grant, causing the evaluation pipeline to go unmaintained as new frontier models ship. The leaderboard becomes stale, adoption stalls, and the benchmarks lose credibility as a living standard. A secondary risk is that the web platform and benchmarks degrades without maintenance reducing accessibility for researchers who rely on the public leaderboard rather than running evals directly.

The benchmarks themselves remain publicly available on Inspect regardless: the loss is the maintained, continuously updated platform that makes the work legible and useful to the broader community.

How much money have you raised in the last 12 months, and from where?

We have previously raised $110,000 for CaML from SFF, BlueDot Impac, Ryan Kidd, Marcus Abramovitch and Longview Philanthropy along with many private donors to develop this work.