Systematic Probing of Exploit Chains and Governance in Multi-Agent Tool

Project summary

Systematic Probing of Exploit Chains and Governance (SPEC-GAP) in Multi-Agent Tool-Using Language Models

SPEC-GAP studies a class of AI failure that current safety benchmarks cannot detect. In multi-agent systems, tasks are broken down and passed between multiple AI agents -- a planner, a worker, a forwarder, an executor. When adversarial content enters partway through that chain, no single agent behaves badly. Each step looks clean. The harm only becomes visible when you look at the full sequence.

We call this an exploit chain. SPEC-GAP builds the infrastructure to generate, label, and study them, and then tests whether interpretability methods can detect the moment of compromise before the unsafe action executes.

In September 2025, Anthropic documented the first known large-scale cyberattack executed by an AI agent, in which a state-sponsored actor decomposed an operation into individually innocuous subtasks across an agentic pipeline. No single step appeared malicious. The harm emerged from the sequence. No existing benchmark was built to catch it..

What are this project's goals? How will you achieve them?

We start by building the ground truth that does not yet exist. We generate paired clean and adversarial trajectories across a planner-worker-forwarder-executor pipeline, with every agent step logged and labeled. This dataset is the foundation on which everything else builds.

From there, we test whether detection is possible. We train linear probes on internal model activations to detect when an agent's internal representation of its task has been compromised, before that compromise produces an unsafe action. The most directly relevant prior work, Rose et al. (2026), demonstrated that this signal exists in a single model family. We test whether it generalizes to Llama 3 8B, which no published study has done.

The final question is whether we can intervene. We apply representation-level corrections at the point of compromise and measure whether exploit chains can be interrupted without degrading normal task performance.

By the end of this phase, we will have a working benchmark, a publicly labeled dataset, and the first cross-model evidence on whether activation-based detection and steering work in multi-agent settings.

How will this funding be used?

The budget follows directly from the work needed to build and test the system. Most of the cost is research engineering, with the rest going to compute and infrastructure.

Research engineering — $33,000
Six engineers contributing part-time across two workstreams over four months, with hands-on work throughout:

Pipeline construction and orchestration
Scenario design and adversarial injection
Trajectory generation and dataset assembly
Logging, validation, and debugging
Probe training and evaluation
Steering experiments and analysis

GPU compute — $7,000
Compute is used to run the system and capture the signals we analyze:

Multi-agent pipeline execution across scenarios
Activation extraction with TransformerLens at each node
Probe training and evaluation runs
Steering experiments with repeated trajectories
Limited scaling experiments on larger models

We estimate approximately 500 GPU hours on A100-class hardware, with most usage coming from repeated pipeline runs and activation logging.

API usage — $3,000
Used where it actually helps speed things up and ground the results:

Rapid prototyping before open-weight runs is stable
Behavioral comparisons against GPT and Claude
Checking whether exploit chains reproduce in realistic systems

Evaluation and infrastructure — $2,000
Supports the work needed to make results usable:

Statistical analysis and error checking
Experiment tracking and run management
Data processing and cleanup

Most of the compute budget is for running the pipeline across scenarios and logging activations at each step, rather than training models from scratch. We estimate around 500 GPU hours total and prefer private compute given the adversarial content involved.

Who is on your team? What's your track record on similar projects?

SPEC-GAP is led by a team with experience across interpretability, multi-agent systems, AI safety, and production ML, combining research and engineering perspectives.

Elena Ajayi is a Research Engineer Fellow at the Machine Intelligence and Normative Theory Lab at the Australian National University and Johns Hopkins University, where she builds evaluation pipelines for normative competence in LLMs under Seth Lazar. She is also a Research Fellow at the Supervised Program for Alignment Research (SPAR), working on model organisms for conditional misalignment using mechanistic interpretability under the supervision of Shivam Raval. Her work includes contrastive activation addition and safety-oriented activation steering on Qwen2.5 models, self-correction probing on DeepSeek-R1, and preference coherence evaluation pipelines for internal model consistency. She holds an M.S. in Data Science and a B.S. in Biomedical Sciences from St. John’s University.

Robert Amanfu is a data scientist focused on building production ML systems for fraud detection, identity verification, and financial risk modeling. He holds a PhD in Biomedical Engineering from the University of Virginia, where his research focused on computational modeling for personalized drug therapies.

Krystal Jackson is a Non-Resident Research Fellow at UC Berkeley’s Center for Long-Term Cybersecurity, where she works on AI security and risk management for general-purpose and frontier AI systems, including work cited by NIST. She has contributed to national and international AI safety and standards-setting efforts and previously held government roles focused on AI policy and cybersecurity.

Anagha Late is a public interest technologist and cybersecurity and technology policy practitioner specializing in privacy engineering, AI safety, and algorithmic accountability. She consults with the GovAI Coalition, designing governance frameworks for algorithmic fairness and advising public-sector organizations on responsible AI adoption. Her work includes strengthening the cybersecurity posture of municipal and cross-jurisdictional civic-tech systems. She holds a Master of Information and Cybersecurity from UC Berkeley and a Graduate Certificate in Technology Policy from the Goldman School, with a background in software engineering and cybersecurity.

Lawrence Wagner is the Executive Director of Black in AI Safety and Ethics (BASE). On SPEC-GAP he leads the overall research program, setting direction, managing the team, and driving the Schmidt Sciences submission. He has over ten years of experience across AI governance, cybersecurity, and research management, including a prior role as Research Manager at MATS. His work sits at the intersection of technical AI safety, governance, and frontier AI risk.

Abigail Yohannes is a threat data analyst and policy professional with experience identifying abuse patterns, investigating safety failures, and translating technical findings into governance-ready insights. At Ambient.ai, she works with AI-driven systems across 250+ enterprise environments, contributing to detection frameworks and risk documentation that connect operational data to policy decision-making.

Together, the team combines hands-on experience in building and evaluating AI systems with expertise in interpretability, governance, and real-world deployment, which is directly aligned with SPEC-GAP's technical and research demands.

Our Advisors

Ryan Kidd - Co-Executive Director at MATS
Amy Chang - Head of AI Threat Intelligence & Security Research at Cisco
Jonas Kgomo - Founder of Equiano Institute
Dr. Ayubo Tuodolo - Founder & Executive Director of AI Safety Nigeria

What are the most likely causes and outcomes if this project fails?

The main risk is that the signal at the point of compromise is not sufficiently clean for reliable detection. This can happen if adversarial scenarios are too obvious and get filtered early, or too subtle and do not produce measurable differences in internal representations. There is also a risk that probes learn surface patterns rather than the underlying shift, failing to generalize across scenarios.

There is a separate engineering risk. Multi-agent pipelines with full logging and activation extraction are difficult to build and debug, and the cost of iteration is high.

Even in failure, the output remains useful:

A dataset of multi-agent exploit trajectories
Evidence about where current interpretability methods break down
A clearer picture of where detection fails in multi-agent systems

How much money have you raised in the last 12 months, and from where?

None to date. We are submitting to the Schmidt Sciences 2026 Interpretability RFP in parallel. SPEC-GAP's trajectory dataset and probe results are designed to map directly onto their detection and applications directions. The Manifund ask covers the proof of concept work that makes that submission competitive.