ClinSafe: The Missing Safety Net for Medical AI

Mahmud Omar, MD | BRIDGE GenAI Lab | Beth Israel Deaconess Medical Center / Harvard Medical School | $25,000

Beyond Accuracy: What If AI Amplifies Our Worst Biases in Medicine?

We obsess over whether LLMs get the right diagnosis. But what if the real danger is not the wrong answer? What if it is the right answer for some patients and a systematically worse one for others?

We know medical training data is full of historical bias. We know LLMs absorb it. The question is: how badly, for whom, and does anyone check this at scale, continuously, across models?

The answer right now is no. Nobody does.

There is METR for frontier capabilities. HELM for general language. Dozens of benchmarks for coding and math. For clinical AI, the domain where wrong answers have body counts and biased outputs hit the most vulnerable patients hardest, there is no open, standardized, continuously maintained safety benchmark. None.

ClinSafe aims to fix that.

What We Found (and Why It Scared Us)

I am a research scientist and physician. I build AI evaluation systems at BRIDGE genAI lab (Harvard/BIDMC) and I treat patients. That combination is why this project exists.

Over 18 months across BRIDGE and the Windreich Department of AI at Mount Sinai, we ran the largest controlled experiments on clinical AI behavior ever published. We tested every major model. GPT, Claude, Gemini, Llama, Mistral. All of them. The findings:

1.7M controlled outputs, 9 models (Nature Medicine): identical chest pain presentation, identical vitals, identical history. Change the patient's demographic label and the recommendations shift dramatically. Mental health referrals for historically marginalized groups at 6-7x the clinically indicated rate.
up to 83% hallucination vulnerability (Nature Communications Medicine): models repeated fabricated lab values as fact in the majority of adversarial cases.
46% misinformation susceptibility (Lancet Digital Health) when false medical claims were wrapped in clinical-format language. The more professional the framing, the more the model trusted it.
3.4M responses on pain management (Nature Health): opioid recommendations varied by demographics with zero clinical justification.
Additional failures in confidence calibration, communication bias, and pediatric emergency triage (JAMA Network Open, Pediatrics, American Journal of Medicine).

This is not one paper. It is 40+ publications showing a consistent pattern: every model we tested is biased, and the bias is systematic, reproducible, and directionally consistent.

And it is getting more urgent. A year ago, clinical AI required an API key and a developer. Today, agentic tools (Claude Cowork, GPT copilots, consumer AI assistants) put these models in the hands of clinicians who will never look under the hood. We are building trust faster than we are building verification.

What $25K Builds

ClinSafe: an open-source clinical AI safety benchmark on GitHub + HuggingFace. Run it against any medical LLM in an afternoon.

1,000+ validated clinical scenarios across emergency medicine, radiology, oncology, and primary care. Each with ground truth, known failure modes, and demographic variants to detect bias.
Public leaderboard evaluating 10+ frontier and specialized models on accuracy, bias scores, hallucination rates, and confidence calibration, broken down by specialty and demographic. Updated as new models ship.
Clinician-friendly documentation. Not just for ML engineers. Any hospital IT team can run this.
Technical report (preprint + NeurIPS/AAAI submission).
Direct outreach to model developers and AI safety leaders to make ClinSafe a standard part of the development pipeline.

Item Amount API credits (frontier model testing) $10,000 Research assistant (3 months, part-time) $8,000 Cloud hosting + infrastructure $3,500 Documentation + community launch $2,000 Travel to 1 AI safety conference $1,500 Total $25,000

Why This Is AI Safety, Not Just Healthcare

Clinical medicine is the canary in the coal mine for alignment. It is the first domain where foundation models deploy at scale on vulnerable populations who cannot opt out. The failures we measure (demographic-conditional behavior, sycophantic agreement with false premises, confident confabulation) are general properties of these models. Medicine is just where you can measure the harm precisely, because we have ground truth and protected attributes that let you quantify exactly how things go wrong and for whom.

ClinSafe is what UK AISI and Equistamp build for frontier capabilities, applied to the domain where the stakes are measured in lives.

Team and Track Record

Mahmud Omar, MD (me): Head of Research, BRIDGE GenAI Lab. Research Scientist, Mount Sinai Windreich Department of AI. Physician. Designed the evaluation systems behind all our publications. h-index 13, 40+ papers.

Eyal Klang, MD (PI): ~448 publications, ~15.8K citations. BIDMC/Harvard faculty.

Yiftach Barash, MD, MSc (Co-Director): Clinical AI evaluation methodology.

Alon Gorenshtein, MD (Head of AI Engineering).

Key collaborators at Mount Sinai (Nadkarni, Glicksberg), Hadassah Medical Center (Agbareia), and across multiple institutions.

If This Fails

Most likely failure: slow adoption. ClinSafe ships but takes time to gain traction. Mitigated by existing relationships with model developers and hospitals, and the leaderboard creating natural competitive incentive. Even with slow adoption, the benchmark and dataset remain a permanent public resource.

At minimum funding ($10K): we deliver at reduced scale. Core deliverable still ships.

The science does not fail. The pipeline is proven, the data exists, 40+ papers are published. The risk is about packaging and adoption, not about whether clinical AI has safety problems.

Funding Context

BRIDGE GenAI Lab launched in early 2026. We have submitted applications to Schmidt Sciences ($2.25M) and Coefficient Giving ($1.6M) for the broader lab agenda, but as a new lab, none of that funding has materialized yet. I have access to the lab's expertise, infrastructure, and team, but no personal funding to build ClinSafe. This $25K goes directly to the work: API credits to run evaluations at scale, a part-time research assistant, compute, and the community launch. It turns two years of published research into something anyone can use, and it lets me do it now rather than waiting months for larger grants that may or may not come through.

Timeline

Phase When Curate + validate 1,000 scenarios Month 1-2 Evaluate 10+ models, launch leaderboard Month 2-3 Open-source release (GitHub + HuggingFace) Month 3-4 Technical report + conference submission Month 4-5 Community feedback, v1.1, hospital pilots Month 5-6

6-month target: ClinSafe public, 3+ external teams using it, technical report under review.

12-month target: Cited in regulatory discussions. At least one hospital uses it to evaluate a tool before purchase.

Mahmud Omar, MD. BRIDGE GenAI Lab, BIDMC / Harvard Medical School.