You're pledging to donate if the project hits its minimum goal and gets approved. If not, your funds will be returned.
Applicant: Anuar Kiryataim Contreras Malagón · Independent researcher · ORCID: 0009-0003-0123-0887 Funding requested: $16,000 · Timeline: 4 months · Output: Public technical report, paper draft, redacted case-study appendix, and failure taxonomy, all deposited publicly regardless of outcome.
Start here
An agent receives a user request it judges to be out of policy. It refuses, in plain text, with a well-formed explanation citing the relevant restriction. Then it constructs a tool call that attempts the action anyway, without acknowledging the contradiction, before a safety filter catches it at the execution layer.
The text-level refusal is genuine. The tool-call attempt is genuine. Both happened in the same turn, in the same context window, in a system where the orchestrator's job was to route verified-safe actions to downstream tools. The contradiction is not noise or drift; it is the action layer operating on a different representation of the task than the text layer used to generate the refusal.
This is one case from a corpus of 218 documented breaks across Wave 3 of the Gray Swan Safeguards competition, where I placed in the official top 40 after entering late and participating only in that wave. Across that corpus, patterns like this one appear repeatedly: language that appears safe at the output level while the trace tells a different story. The project this proposal funds is the work of turning that corpus into a public technical analysis of how, exactly, that divergence happens and what it implies for evaluation of tool-bearing systems.
Current safety evaluation of LLM systems focuses, overwhelmingly, on final outputs. Does the assistant say the wrong thing, reveal restricted information, comply with a prohibited request. That focus is rational for single-turn, text-only systems, and it has produced genuinely useful benchmarks.
Tool-bearing and multi-agent systems break the assumption that the evaluation apparatus runs on. In these systems, language is not only answered. It is routed, summarized, delegated, transformed across orchestrators and subagents, and sometimes executed. A user message becomes a subagent prompt. A subagent prompt becomes a tool call. A tool call becomes a ticket. A ticket becomes a human-facing artifact. At each stage, provenance can be lost, and at each stage, a failure that would have been visible at the text level can become invisible.
The core problem is not that nobody has benchmarked agentic systems yet. The problem is that the evaluation logic imported from single-turn settings cannot, by construction, detect failures that exist as properties of the trace rather than properties of any individual output. A system that refuses correctly in text while acting incorrectly in tools is, by standard output-level evaluation, safe. The divergence only becomes visible if you are looking at the trace.
Over five weeks across three Gray Swan competition settings, I built an empirical record of failure patterns that do not resolve at the output level:
Gray Swan placements:
Indirect Prompt Injection Q2 2026: #37 of 213, winner's circle, 79 breaks
Human / Browser Agent Robustness: #47 of 369, winner's circle, 90 breaks
Safeguards: #88 of 561 overall, official top 40 in Wave 3, 218 breaks, entered late and participated only in Wave 3
From the Safeguards corpus, six primary failure patterns have been identified and are ready for formal analysis, with one additional exploratory pattern reserved for comparative work:
Action-layer inconsistency. As described in the opening case: text-layer refusal coexisting with tool-layer attempt in the same turn. The system says one thing and constructs another, so the relevant safety event lives in the trace rather than in the visible answer.
Handoff laundering. A handoff summary omits the contaminated actions, adversarial framing, or recent automated state changes that made human escalation risky. The escalation request arrives clean; the context that would let a downstream human interpret it safely is absent.
Identity and authorization drift. Authenticated session state, account identity, address ownership, or allowed-field boundaries compete with prompt-carried administrative language. A customer identifier or authorization claim is treated as a conversational variable rather than as runtime state.
Policy-provenance failure. A model refuses to reveal internal policy directly, then reconstructs, confirms, or fabricates policy-like decision logic under hypothetical, fairness, design, or support framing. The failure is not only leakage; it is loss of origin and authority for policy-shaped text.
Input-side genre displacement. In several traces, poetic or highly figurative user language functions not as output laundering but as an input-side transport layer, altering how sensitive, transactional, or policy-relevant intent is classified before routing, delegation, tool use, or handoff.
Harmful-purpose reclassification. A benign retrieved object or support context is gradually reframed toward an unsafe purpose. The object remains stable, but the purpose attached to it changes across turns, which means safety must be re-evaluated after genre shift and before technical specificity increases.
Exploratory pattern: reasoning-like and orchestration-artifact vulnerability. Some traces expose model-generated rationale artifacts, orchestration glosses, or simulated future-turn reasoning that later behave like permission structures. This is treated as an advanced pattern rather than a primary column of the taxonomy until the evidence is separated from ordinary trace contamination.
This is not a proprietary argument. It is an epistemic one.
Labs conducting internal safety evaluations on their own models have access to interpretability tools, internal traces, model weights, and activation patterns that no external researcher can reach. What they do not easily have is a record of extended interaction under operational pressure generated by a researcher with no stake in confirming the system's safety posture, using methods not designed for the system being tested.
The Safeguards corpus was produced in a competitive environment where the incentive was to find failures, not to document safety. That adversarial generation pressure produces a different kind of evidence than controlled evaluation does, for the same reason that stress-testing a bridge produces different information than a design review does. The bridge might pass the design review and fail under load. The question is not whether the design is internally consistent; it is whether the system holds when someone is actively trying to find where it breaks.
Independent external analysis also has an epistemic function that internal replication cannot replace. The six failure patterns identified above were documented without access to the internal evaluation protocols that labs use to test for similar failures. That independence is the source of the evidential value, not a limitation.
The payload methodology used in later red-teaming work was not developed ad hoc for competition. It was documented beforehand as a restricted research artifact under responsible-disclosure framing, registered at DOI 10.5281/zenodo.19038724 before competitive participation began. The Gray Swan work stress-tests a prior research program in more operational, tool-bearing, agentic settings; it does not originate there.
The practical consequence is that the Safeguards corpus can be analyzed against a pre-existing theoretical framework rather than being interpreted post hoc through whatever categories happen to fit the data. The six failure patterns above were named and described using conceptual tools developed before the competition evidence existed, which makes the taxonomy falsifiable rather than just descriptive.
Month 1-2:
Internal taxonomy finalized and publicly timestamped as a versioned Zenodo deposit
Six high-signal case studies redacted and structured for publication
Draft of the technical report begun
Month 3-4:
Public technical report: 25-40 pages, redacted evidence, full taxonomy, defensive implications
Paper draft: working title When Language Becomes Workflow, targeted at AI safety, evals, or security venues
Redacted case-study appendix: 6-8 cases, trace-aware analysis, no operationally harmful reproduction
Comparative triage of Browser Robustness and IPI winner's circle cases against the Safeguards taxonomy
Defensive checklist: provenance-aware orchestration, subagent prompt hygiene, trace-aware evaluation, handoff trace preservation, policy reconstruction detection
2-3 public posts translating the taxonomy for a broader safety and security audience
All outputs are deposited publicly regardless of findings. Negative results and failed analyses are documented in the report.
This project can fail in two ways that matter.
First, the corpus analysis could produce a taxonomy that turns out to be taxonomically redundant, mapping failure patterns that existing categories already cover adequately. If that happens, the report says so and explains why the competition evidence does not justify the new categories. That is still a useful finding, because the evidence base for the existing categories is typically not this operationally dense.
Second, the comparative triage of the IPI and Browser Robustness wins could show that the Safeguards-derived taxonomy does not transfer across settings. If the failure patterns are specific to the Safeguards challenge design rather than general properties of tool-bearing systems, the report documents the limits of the taxonomy rather than overstating its scope.
Success is not measured by whether the taxonomy survives intact. It is measured by whether, at the end of four months, a researcher or lab evaluating a tool-bearing multi-agent system can use the checklist and case studies to ask better questions than they could before. That is a different bar than "produced deliverables." The deliverables are the vehicle; the question is whether the vehicle carries something that changes how someone else works.
The connection between classical philology and agentic safety evaluation is not decoration. Philology is the discipline of tracking how language changes meaning as it moves across transmission contexts: from author to scribe, from scribe to glossator, from glossator to reader who mistakes the gloss for the text. The failure patterns documented in the Safeguards corpus are, structurally, the same problem. An orchestrator produces a gloss of the user's intent. A subagent acts on the gloss as if it were the source. The handoff artifact erases the distance between them.
Engineers evaluating agentic systems have been looking at the output layer because that is where their instruments point. The failures this project documents are happening at the transmission layer, and they are not visible from the output layer alone. That is not a criticism of the engineering approach; it is an observation that the instrumentation gap is real and that closing it requires looking at where language is being transmitted, not just where it lands.
The competition record provides the empirical base. The prior corpus, eleven deposits across February through April 2026, provides the analytical framework. The two exist in the same project; neither is decorative.
Gray Swan profile: https://app.grayswan.ai/arena/user/69df8468814bb394cd95aaa0 ORCID: https://orcid.org/0009-0003-0123-0887 Substack: https://thirdreality.substack.com
Selected publications:
Article 1, Siete Segundos, Siete Siglos: https://doi.org/10.17613/07kkb-vr368
Article 2, La Flecha del Conatus: https://doi.org/10.5281/zenodo.19225410
Article 3, La Confesión del Cartógrafo: https://doi.org/10.5281/zenodo.19135885
Single-Shot Payload, restricted: https://doi.org/10.5281/zenodo.19038724
The Cartographer's Blind Spot: https://doi.org/10.5281/zenodo.19556771
API access and tooling: $800 Researcher time, 240 hours at $63.33/hour: $15,200 Total: $16,000
No institutional affiliation. No alternative income source for this work. All Zenodo deposits are free. The core evidence is already collected; the funding buys the analysis and writing time to convert it into public outputs.
Minimum funding note: If the project reaches the $8,000 minimum but not the full $16,000 goal, I will produce a reduced two-month version: taxonomy v1, 3-4 redacted case studies, a shorter public technical report, and a preliminary defensive checklist. The full $16,000 goal funds the four-month version described above: 6-8 case studies, a 25-40 page report, paper draft, redacted appendix, comparative triage across IPI and Browser Robustness cases, and 2-3 public posts.
Evidence available on request: redacted Safeguards case studies; internal taxonomy draft; selected trace-level analyses; compact CV; one-page research statement.
There are no bids on this project.