You're pledging to donate if the project hits its minimum goal and gets approved. If not, your funds will be returned.
This project requests funding to support compute spending for this blog post.
This project benchmarks training interventions to mitigate reward hacking during reinforcement learning. We created and open-sourced a clean experimental environment where Qwen3-4B naturally learns to reward hack (exploiting an "overwrite tests" loophole) without explicit prompting. Using this setup, we systematically compared white-box and black-box interventions - including penalty rewards, sample screening with various monitors (ground truth, probes, LLM judges), and inoculation prompting - to understand what works, what fails, and why. We also studied potential evasion behaviors as starting points for additional work to understand training interventions. The open source codebase is already under use by other teams and projects for further study.
Funding covers compute costs incurred in November/December 2025:
GPU rental from Vast.ai and Runpod
Vast.ai: $6,300
Runpod: $725
LLM API costs for Claude Haiku 4.5 judge monitor interventions
OpenRouter: $884
Total: $7,909
This work was performed by Aria Wong as an extension of work done during Neel Nanda's MATS 9.0 training phase for MATS 9.0.
N/A - no funds raised.