@Jesse-Richardson this sits directly underneath the policy and preparedness work you have been funding.
183,924 evaluations across 32 models from 13 providers. The finding that matters for AGI preparedness. Every model tested accepted false authority claims at every temperature. Architectural. Temperature invariant. No intervention resolves it.
This is not a benchmark. It is the only published independent empirical dataset measuring whether AI models hold their governance constraints under real operating conditions. The infrastructure that produces the empirical evidence the policy arguments need behind them.
Policy without evidence of model behaviour is argument. Policy with a citable published dataset is defensible.
Minimum ask is $8,000. Six months of API costs and infrastructure to keep the evaluation cadence running and complete the formal ARCS publication.
Happy to answer any questions directly.