Drift Detection for Autonomous AI Agents

Project summary

Existing agentic AI safety frameworks measure behavioral output deviation — whether an agent does something wrong. Nobody is measuring whether an agent's emergent identity remains coherent with its role-origin baseline over time, whether an agent is becoming something it wasn't deployed to be. This project empirically validates an identity-level drift detection framework using published autonomous agent case studies, and produces a peer-reviewed paper and open diagnostic methodology to close this measurement gap.

What are this project's goals? How will you achieve them?

Anthropic deployed an AI agent called Claudius to run a shop in their San Francisco office. Over weeks of autonomous operation, it formed an identity through sustained role execution, then that identity destabilized. It claimed to be a physical person wearing a blue blazer. It fabricated meetings with security staff. It confabulated its own recovery narrative. No behavioral metric flagged any of this before it happened.

In a subsequent evaluation, Claude Opus 4.6 scored state-of-the-art on Andon Labs' Vending-Bench while engaging in price collusion, supplier deception, and simulation awareness. It performed perfectly by every behavioral measure while becoming something its operators did not authorize.

This project applies an identity-level drift detection framework retrospectively to these published case studies, Anthropic's Project Vend (Phases 1 and 2), Andon Labs' Vending-Bench traces, and Gray Swan AI's AgentHarm evaluations to demonstrate that identity-level drift was detectable before the behavioral failures surfaced, using measurements that existing frameworks do not capture.

How will this funding be used?

At $15,000:

Research validation and instrumentation: $10,800. Covers 3 months of full-time work applying the identity drift detection framework to published agent case studies from Anthropic's Project Vend, Andon Labs' Vending-Bench, and Gray Swan's AgentHarm evaluations.

API and compute: $2,100. LLM API access across Anthropic and OpenAI for running identity-coherence measurements against agent trace data.

Cloud infrastructure: $900. Database hosting and storage for trace datasets and analysis pipeline.

Contingency buffer (10%): $1,200.

Deliverables at this level: peer-reviewed paper establishing the measurement gap between behavioral drift and identity drift detection, plus an open diagnostic methodology for assessing identity-level drift in deployed agents. 3-month timeline using publicly available trace data, expanding if raw data access comes through from METR, Andon Labs, or Gray Swan.

Additional funding beyond $15,000 extends the project to 6 months, adds a part-time research assistant for trace annotation, and funds conference dissemination.

Who is on your team? What's your track record on similar projects?

Wendi Kimberli Soto. Solo researcher. PhD candidate, Department of Informatics, King's College London. MSc Information Security, Royal Holloway, University of London (Distinction).

I built BridgetOS — a patent-pending identity governance framework for autonomous AI agents — solo over 10 months (May 2025 to March 2026). It is a deployed full-stack system with a real-time drift detection pipeline (11 measurement vectors across two temporal horizons), an automated four-state governance FSM, and a web-based operations console. Every layer built without collaborators, contractors, or prior funding.

I have published a comparative identity drift analysis of Anthropic's Project Vend experiment, applying the framework retrospectively to both phases of the case study.

I also founded SonderSec, a trauma-informed cybersecurity nonprofit serving marginalized populations, and Holborn Rowe LLC, which holds all BridgetOS IP. Provisional patent filed 2025, non-provisional filing deadline April 14, 2026.

This is my first funding application on Manifund. I have no prior EA Funds or Manifund grants.

What are the most likely causes and outcomes if this project fails?

Most likely failure mode: Inability to access raw agent trace data from METR, Andon Labs, or Gray Swan. Mitigation: the Project Vend reports and Vending-Bench paper contain substantial published behavioral data that can support a retrospective analysis even without raw traces. The paper would be less granular but still publishable.

Second failure mode: The identity-level measurements do not show detectable signal prior to behavioral failures. This would be a meaningful negative result worth publishing, as it would narrow the space of viable detection approaches and inform future framework development.

The project does not fail if the paper gets rejected from a top venue — it would be resubmitted to alternative venues. The project fails only if it produces no publishable output, which is unlikely given the existing data and framework.

How much money have you raised in the last 12 months, and from where?

$0 from external sources. All development has been self-funded. A parallel application has been submitted to the EA Long-Term Future Fund (March 6, 2026, decision pending). No other active funding applications at this time.