SimpleAudit: lightweight, multilingual red-teaming for AI systems anyone can run

Project summary

SimpleAudit is an MIT-licensed, DPG-verified Python framework (pip install simpleaudit) that lets any team red-team any AI model, local open-weight models or any of 35+ API providers, through multi-turn adversarial probing with LLM-as-judge scoring. It is deliberately the opposite of frontier-lab evals infrastructure: two dependencies, local-first, minimal setup, and it runs entirely without API keys if you use local judges.

It ships 1,042 scenarios across safety, RAG, healthcare, and system-prompt adherence, including a 1,000-scenario Norwegian youth-wellbeing pack derived from real queries to Ung.no, one of the only non-English safety evaluation resources in existence. It was built by researchers at Simula/SimulaMet together with contributors from the Norwegian Directorate of Health, and is already used to audit models considered for health-sector deployment.

This funding strengthens the scientifically weakest link in all lightweight evals tooling, judge reliability, and expands multilingual coverage to languages where model safety is least tested.

What are this project's goals? How will you achieve them?

Goal 1: Make LLM-as-judge scoring trustworthy and quantified. Every lightweight evals tool relies on LLM judges, but their validity is largely unmeasured: judges disagree across model families and exhibit self-preference and position biases. We will (a) build a multi-judge ensemble mode with inter-rater agreement statistics, (b) construct a human-annotated gold standard of 500 multi-turn audit conversations (double-annotated, 1,000 annotations total), and (c) publish per-judge calibration results across at least 6 judge models (3 API-hosted, 3 local) so users know how much to trust a given configuration. We'll publish the methodology and data openly, so the findings benefit other evals tools too, not just SimpleAudit.

Goal 2: Extend multilingual auditing. Naive machine translation of English probes produces unnatural text that masks real failure modes. Building on our experience constructing the Norwegian Ung.no pack, we will develop native-speaker-validated scenario packs in 4 additional languages — candidates: German, Hindi, Swahili, and Ukrainian (adjustable), deliberately mixing high- and low-resource languages where safety behavior degrades most and evaluation data is scarcest. Each pack: 50 scenarios authored or reviewed by a native speaker against our published v1.0 scenario standards.

Goal 3: Sustainable community scenario pipeline. Turn our published scenario-authoring standards into automated validation (schema checks, expected-behavior linting, contamination detection, run-to-run stability flagging) so community-contributed packs for new domains (legal, finance, education) maintain quality.

Milestones: gold-standard annotation complete by month 3; calibration results and ensemble mode released by month 5; first 2 language packs by month 5, remaining 2 by month 6; validation pipeline and public write-up by month 6.

How will this funding be used?

Total ask: $20,000 over 6 months. Minimum funding: $5,000 (covers the judge-calibration gold standard and ensemble mode only, the highest-value slice).

$12,000 — researcher/developer time at roughly 0.25 FTE over 6 months (~480 hours) for the calibration research, ensemble implementation, multilingual coordination, and validation pipeline
$3,000 — human annotation of the gold-standard set: 500 conversations double-annotated (~1,000 annotations at ~6 minutes each ≈ 100–120 hours at $25/hour)
$3,000 — native-speaker scenario authoring and review for 4 language packs ($750 per pack of 50 scenarios)
$2,000 — judge-model API costs for large-scale validation runs (full 1,042-scenario suite at 5 turns across 6 judge configurations; we are separately applying for in-kind provider credits — if granted, this line shifts to development time)

Who is on your team? What's your track record on similar projects?

The core team is Michael A. Riegler and Sushant Gautam.

I am Michael Riegler, Head of AI at Simula Research Laboratory and Professor at OsloMet (PhD in computer science, University of Oslo). My research sits at the intersection of machine learning, evaluation, and safe deployment in high-stakes domains: 18,900+ citations, h-index 57, and inclusion in Stanford's list of the top 2% most cited researchers globally (2023 and 2024). I received the NORA (Norwegian AI Research Consortium) Lifetime Achievement Award in 2025.

Directly relevant track record:

Building widely used evaluation datasets. I created the Kvasir dataset, which became the world's most cited colonoscopy dataset, and led the Visem dataset for reproductive health — both set de facto evaluation standards in their fields. SimpleAudit's Ung.no scenario pack (1,000 Norwegian youth-wellbeing scenarios derived from real queries to Norway's national youth information service) continues this dataset-building track record.
Safe LLM systems in sensitive contexts. I led the AI work package of ILMA (Research Council of Norway FRIPRO, 12M NOK), building LLM-driven child avatars for training child-protective-services and police professionals — work requiring exactly the kind of rigorous safety evaluation of conversational AI that SimpleAudit generalizes.
LLM risk assessment for government. I serve as expert consultant to the Norwegian Directorate of Health on risk assessment of large language models for the health sector (part of Norway's national AI plan for health), and contributed to their Digitaliseringsløftet program on AI strategy, risk evaluation, and procurement. SimpleAudit grew directly out of this work, and Directorate of Health staff are named contributors to the repository.
Research to production. Co-founder of Augere.md, whose AI colonoscopy models achieved CE marking in 2024; ML lead for Norway's national contact-tracing app (Smittestopp); project leadership across 10+ Research Council of Norway projects totaling well over 100M NOK.
Responsible-AI policy. Former member of the Norwegian Board of Technology's expert groups on generative AI and AI in healthcare; editor at Nature Scientific Data and Nature Scientific Reports; supervised 65 MSc and 14 PhD students.

Sushant Gautam is a researcher at SimulaMet working on large language models and medical AI, and a core contributor to SimpleAudit's codebase.

Evidence of execution on this project specifically: SimpleAudit is live on PyPI (v0.1.6, 134 commits), verified as a Digital Public Good by the DPG Alliance, with a hosted visualization demo, published scenario-authoring standards (v1.0), and named contributors from the Norwegian Directorate of Health and more testers in the public sector in Norway.

What are the most likely causes and outcomes if this project fails?

Most likely failure mode: the calibration results are discouraging, i.e., we find that LLM judges (especially affordable/local ones) agree poorly with human annotators on multi-turn safety severity. We think this is genuinely plausible. But this is a graceful failure: the 500-conversation gold-standard dataset and negative results still get published and would be a useful warning to the entire LLM-as-judge ecosystem, and SimpleAudit would then surface honest uncertainty rather than false confidence. Funders would be buying real information either way.

Second failure mode: low adoption. The tool works but doesn't reach the small deployers it's built for. Mitigation: we have a direct channel to public-sector deployers through the Norwegian Directorate of Health collaboration, and DPG registry listing gives distribution into government/UN-adjacent procurement. Even in the low-adoption case, the scenario datasets and calibration data are standalone public goods usable by other frameworks.

Third: maintainer bandwidth. This is a research-institute side project rather than a funded product. That's precisely what this grant de-risks for the next 6 months; the validation pipeline (Goal 3) also reduces long-run maintenance cost by automating contribution review.

What does not happen if we fail: the money does not disappear into vaporware, the tool already exists and is in use; funding buys specific increments on top of a working artifact.

How much money have you raised in the last 12 months, and from where?

$0 in direct external funding. Development to date has been carried out within research time at Simula/SimulaMet and as unpaid open-source work. We have pending applications for in-kind API research credits from OpenAI, Anthropic, and Google (inference costs only, not development), and are preparing an application to other open calls.