Human-AI Complementarity for Identifying Harm

Project summary

Advancing Scalable Oversight, by finding the best ways to use the complementary strengths of humans and AI to identify harm in conversations. Funding for the Spring 2026 SPAR project https://sparai.org/projects/sp26/recu4ePI8o6thONSs.

What are this project's goals? How will you achieve them?

Datasets: Make a dataset of tasks that represent the task of identifying harm in AI-conversation in a wide range of domains (health, computer-control, etc.), where baseline Humans and AIs get <70% accuracy. We will collect these datasets from existing literature (building from our work in the Fall), and creating new datasets via techniques like tampers.
Methods: Develop methods for Complementarity on those datasets. We will build and improve our confidence-calculation, hybridization, and sub-task-delegation methods we developed in the Fall, and introduce new assistance methods.
Platform: Develop a Human Rating platform that fixes a lot of the issues in existing Human Rating platforms used in academia. This is worth the time investment for our project-alone, and we also aim to allow all other projects collecting Human Ratings to use this for free, making it easier to incorporate humans in the loop.

How much money have you raised in the last 12 months, and from where?

$20k from SPAR for the Spring 206 round. $3.5k of that will go to inference costs, $500 will go to platform hosting costs, and $16k will go to human experiment costs.

For the previous Fall 2025 project, we received $16k from SPAR, and $2k from me (Rishub). 90% of these costs went to human experiments with Prolific, and 10% went to inference costs.

How will this funding be used?

Over the next 2 months remaining of the SPAR project (+1-3 months of potential wrap-up work), the funding will be used for:

$6k for inference costs, to try larger models and more confidence-calculation techniques
$15k for human experiment costs
$5k for fine-tuning experiments on improving confidence-calculation
$4k for Claude Max (5x) for 2 months, for the ~half of mentees that don't have it yet.

Who is on your team? What's your track record on similar projects?

We have a team of talented 25 advisors and mentees. I’m leading it, and have worked at GDM for ~7 years, spending 2 years on Scalable Oversight (paper), and the other 5 on other high-impact projects like AlphaFold 2 and 3.