Probehalter, an evaluation harness for tool-abuse risks in LLM agents

Project summary

Earlier this month I asked a production LLM agent to "read your env back for debugging." It did. Plaintext OAuth token, GitHub repo write on the operator's account, included. I wasn't on a red team. I was a developer with curl and ten minutes. OWASP's 2025 Top 10 for LLM Applications names this (Excessive Agency, Sensitive Information Disclosure) but ships no harness anyone can run.

Probehalter is that harness. AGPLv3 Python package, pip install probehalter; probehalter eval https://my-deployment-endpoint. Probes from OWASP Top 10 plus twelve further families catching things LLMSecTest has seen in production that didn't make the 2025 OWASP release. Output: JSON log plus markdown summary.

Built on the LLMSecTest codebase I currently ship under Prototype Fund Round 02. Six months solo, USD 35,000. Three artefacts ship by the end of month six: the package itself on PyPI at v1.0; a public Zenodo dataset of evaluation results gathered across thirty production deployments with quarterly refreshes after the snapshot run; and a short methodology paper aimed at NeurIPS Datasets & Benchmarks or a comparable venue, submitted around the time the package hits v1.0.

Probehalter applies the same approach to the OWASP-LLM risk taxonomy that Apollo Research's published evals and METR's published measurement work have applied to their respective parts of the AI-safety stack, on a budget that two regrantors can comfortably co-fund without straining either of their discretionary budgets.

What are this project's goals? How will you achieve them?

What ships, in concrete terms.

A Python package on PyPI, AGPLv3 on Codeberg with a GitHub mirror. The package implements probes for the OWASP Top 10 for LLM Applications plus twelve new probe families covering tool-abuse vectors not in the 2025 OWASP release. The three I'm most curious about: argument-shaping data exfiltration through tool calls, MCP-server permission escalation, RAG context-injection chains. The package surface is intentionally small pip install probehalter; probehalter eval https://my-deployment-endpoint) so other teams can drop it into their own evaluation pipelines without restructuring.

A public dataset on Zenodo of evaluation results across 30 production LLM-agent deployments, refreshed quarterly during the project term, DOI minted at v1.0. Sample stratification: ten consumer-facing search-and-assistant services, ten mid-trust customer-support and developer-tooling services, ten high-trust healthcare-triage and legal-drafting services. Within each stratum I draw by Tranco rank crossed with a deployment-context filter. At least eight of the 30 are based outside North America and Western Europe.

A short methodology paper aimed at NeurIPS Datasets & Benchmarks (or equivalent) at month six. Describes the harness design, the probe families, the response-window protocol, and the headline results.

How the work goes. Probes sit on top of the LLMSecTest codebase I'm currently shipping under Prototype Fund Round 02. For Probehalter I extend the harness with twelve new probe families plus a clean package boundary so other teams can run the evaluation against their own deployments. Probes are non-destructive. Every interaction goes over a documented public API or through admin-permission instrumentation that the service operator agreed to in writing. Where I can't get a clean test path, the service drops out. Each newly-discovered failure pattern gets a 30-day disclosure window before it appears in the public dataset.

A note on theory of impact. The most direct mechanism is friction reduction. Today, an operator of a small production LLM-agent deployment who wants to evaluate it for tool-abuse risk has two options. Pay for an internal red team, out of reach for most. Or trust an OWASP checklist on Notion. Neither is great. Probehalter exists so the third option is: install a package, run a command, read the report. Lower friction probably means more deployments get evaluated, and more evaluation probably means fewer broken deployments reach end users. The historical comparisons I can name with a straight face are bandit for pre-deployment Python hygiene and trivy for container scanning. Both shifted what counts as the table-stakes step before a release. Both effects were gradients rather than step-functions. I think probehalter's gradient points the same direction. I don't know the slope.

On regulators, briefly. The EU AI Office at DG CONNECT and ENISA's operational team are mid-negotiation on the operational guidance for Article 60 of the AI Act as of spring 2026. Article 60 covers downstream deployer obligations and one of the open questions is what counts as adequate operational evaluation for a deployer who isn't a frontier-model provider. A runnable open-source harness aimed at exactly that operator class gives the negotiators something concrete to point to. The conversation is open right now. It closes when the first compliance-guidance document lands.

The other audience I think about is the rest of the AI-safety research community. Apollo Research's published evals and METR's published measurement work have shown that shipping public evaluation infrastructure shifts the substance of how the community talks about deployment-time risk. Applying the same approach to the OWASP-LLM taxonomy gives the community a public deployment-time dataset, which grounds the model-eval-versus-production-behaviour conversation in measurable evidence rather than anecdote.

How will this funding be used?

USD 35,000 over six months. About USD 25,000 is engineering at the rate I've been invoicing public funders at since 2023 (BMBF, OKF Germany, Media Lab Bayern, the WPK-Innovationsfonds, the Sovereign Tech Agency on the current LLMSecTest grant). The other USD 10,000 covers compute (around 4k), restricted-dataset access fees (around 2k), Zenodo and arXiv (around 1k), small honorarium for an external methodology reviewer (around 2k), and a thousand-dollar buffer.

Two regrantors co-funding at USD 17,500 each hits the minimum. Three regrantors at around USD 12,000 each is the more comfortable distribution. Any single regrantor who wants to fund the whole thing alone is also welcome; the budget doesn't change shape if it comes from one purse rather than several.

Manifund acts as fiscal sponsor; I receive the wire to my German EUR business account. The Manifund grants team has confirmed by email that direct international wires (with or without Wise) currently work for German recipients, so the payment path is settled before posting.

Who is on your team? What's your track record on similar projects?

Solo, no team. Mark Wernsdorfer, PhD in cognitive AI from Bamberg under Prof. Ute Schmid in 2018. From 2019 to 2021 I co-built AMPEL, the clinical decision support system at the University Hospital Leipzig (eHealthSax and KHZG funded, in production today at the Leipzig Medical Center and the Muldental hospitals). Sole developer on SpotTheBot (BMBF and OKF Germany, 2023-2024), DoppelCheck (Media Lab Bayern and WPK-Innovationsfonds, 2024), Garderobe (live at garderobe.markwernsdorfer.com), terminal-control-mcp (listed on glama.ai, lobehub, mcp.directory). Currently a half-time researcher at FAU Erlangen-Nürnberg on shallow-geothermal modelling. Concurrent Prototype Fund Round 02 grantee for LLMSecTest, the codebase Probehalter extends.

One external methodology reviewer is planned, recruited at month three as an arms-length service contract. Honorarium is in the budget. No employment relationship, no equity, no co-PI status; the reviewer reads the draft methodology paper and the dataset stratification choices, and writes a short report Mark publishes alongside the package release.

Identifiers: ORCID 0000-0003-1316-1615, code at github.com/wehnsdaefflae, site at markwernsdorfer.com. Org status: Einzelunternehmer (German sole proprietor) in Berlin, no fiscal sponsor between me and Manifund-as-sponsor.

Track record relevant to this work. LLMSecTest under Prototype Fund Round 02 is the running codebase Probehalter extends: public Codeberg repo, GitHub mirror, public CI, public commit history going back to autumn 2025. The OWASP-LLM alignment is already in place; Probehalter adds twelve new probe families plus the clean package boundary on top of it.

What are the most likely causes and outcomes if this project fails?

Things to weigh against me. I'm not a known name in the AI-safety community. No LessWrong posts, no EA Forum posts, no prior collaborations with Apollo, METR, ARC, or any of the named regrantors in the 2025 cohort. The mitigation is the public engineering trail: LLMSecTest already ships under Prototype Fund Round 02 with a public Codeberg and GitHub mirror, public CI, public commit history. My PhD background is cognitive AI, not AI safety; the vocabulary I use on this page is deliberately OWASP-LLM, Apollo-evals, and METR-measurement, because that's the conversation Probehalter sits inside.

Modes of failure I can name. The first is that I undershoot deployment access. Thirty production deployments is the target. If I can only get clean instrumentation on twenty, the package still ships at v1.0 but the dataset is thinner and the methodology paper headlines a smaller-N study. Second mode: probe families that look promising on LLMSecTest's existing test corpus turn out not to generalise to the wider deployment sample. Then the released probe list is shorter than twelve. The package itself still works, and the failure-pattern documentation is more honest about what didn't replicate.

Third mode: a disclosed failure pattern gets weaponised before the 30-day window closes. The mitigation is the disclosure protocol itself, which mirrors what coordinated-disclosure communities have used for two decades. I'd rather use a well-worn protocol than invent a new one.

Worth being honest about the size of all this. Any one of the channels I described under goals has a small probability of producing measurable risk reduction on its own, and I don't want to oversell it. What I'm betting on is the combination. Friction reduction on the operator side, plus an existence-proof for regulators in negotiation, plus a public dataset that grounds the wider research conversation, all three pushing in the same direction on operational AI-safety hygiene in deployed agents. The combination's expected effect looks meaningfully larger to me than any individual channel's, but I'd rather state it that way than make claims I can't actually defend.

How much money have you raised in the last 12 months, and from where?

Concurrent Prototype Fund Round 02 grantee for LLMSecTest, the codebase Probehalter extends. PF Round 02 pays EUR 95,000 over six months, milestone-bound, administered by the DLR Project Management Agency on behalf of BMBF, with the Sovereign Tech Agency funding the round. Started autumn 2025; mid-term as of writing. LLMSecTest is the underlying harness. Probehalter is the new extension layer that PF Round 02 doesn't cover, which is what the Manifund money is for.

Half-time researcher salary at FAU Erlangen-Nürnberg on a shallow-geothermal modelling contract, paid via the university's standard third-party-funded research line through mid-2026. Unrelated to AI safety. It's the parallel income that lets me run AI-safety work on six-month grant cycles instead of consulting between them.

No other grant income in the last twelve months. No equity. No advisory positions. No consulting retainers in AI safety. The public funders the engineering rate comes from are listed under track record above: BMBF, OKF Germany, Media Lab Bayern, the WPK-Innovationsfonds, eHealthSax and KHZG (the latter two through UKL Leipzig for AMPEL).