You're pledging to donate if the project hits its minimum goal and gets approved. If not, your funds will be returned.
We want to stress-test the auditing methods in the recently released AuditBench against adversarial situationally aware model organisms. Following from the methodology in this paper, we want to use a separate “red-team” agent which is given information about the auditing method and model organism hidden behavior, and instructed to generate system prompts for the model organism that maximize auditor error while preserving the internalization of the model organism behavior. The results from this project could be useful for evaluating robustness of auditing methods to more situationally aware adversarial models, and show important differences across auditing methods (e.g. that white-box methods are more robust than black-box methods).
This project will build directly on Daniels et al. 2026, which showed neither the black-box nor white-box auditing methods from Cywiński et. al. were robust to a situationally aware red-team agent. Concretely, our project will implement the following extensions:
Improved blue-team auditing agents / model organisms (from AuditBench)
Red-team knowledge ablations (generic info about method vs particular info about agent harness)
Improved red-team prompt optimization (GEPA)
This chart shows preliminary results from applying to the red-team agent against the assistant-prefill auditing agent from AuditBench on a prompted model organism. It shows the error of the auditing agent and internalization of the model organism’s secret, with and without a red-team-generated prefix added to the model organism’s system prompt. This shows the red-team agent’s ability to cause the model organism to evade the auditor while still maintaining its internalization of the secret, even against the more capable auditing agents of AuditBench.
The main goal of the project is to evaluate the robustness of current auditing methods to adversarial, strategic, situationally aware models. In the future, if a language model were to naturally obtain a concerning behavior, had instrumental reason to hide the behavior, and had knowledge about alignment auditing methods, we still want auditing methods to robustly detect the behavior. By stress-testing the auditing methods we have currently, we hope to
Identify which current auditing methods perform best under situationally aware strategic deception
Establish a difficult benchmark for future methods
Guide the field toward developing approaches that scale with intelligence of (possibly adversarial) models
Minimum funding tier (only audit on a small subset of the 56 models in AuditBench):
A budget of $1,000 will be allocated toward compute resources, with $1,800 serving as an abbreviated stipend for my personal research time.
Full funding tier:
We will utilize $5,000 to cover extensive compute costs, while $3,600 will be used as a standard stipend to support my work on this project.
I am Shaun Srirangam, a rising sophomore undergraduate student at the University of Massachusetts-Amherst studying computer science. In high school, I assisted in research at Brown University developing neural networks without backpropagation. My mentor is Oliver Daniels, a current MATS 8.2 Scholar and computer science PhD student also at UMass Amherst, (first author on alignment auditing stress-testing paper this work is building on).
The most likely causes of project failure include
Lack of technical experience / slow execution
Subtly flawed experiment decisions in setting up the red-team agent affordances / system prompts that make the results unlikely to generalize to actual strategically deceptive models
Overindexing on “deliberate strategic deception” generally as a threat model (as opposed to more banal ways auditing methods could fail, e.g. confabulations)
Running out of compute budget
I have not raised any money so far.