Rapid Prototyping Defensive AI Watchdog / Scalable Oversight Agents

Project summary

Ben Goertzel, in his article "Should Humanity Build a Global AI Nanny to Delay the Singularity Until It's Better Understood?" (2012), was one of the first people to explore the concept of a beneficial AI that will watch over the humanity to prevent catastrophic risks. In "Global Solutions vs. Local Solutions for the AI Safety Problem" (2019) Turchin, Denkenberger, and Green delved further into this, making a distinction between a singleton watchdog AI, and an AI swarm that will act as a "global immune system". Today, although the concept has arguably become more viable than ever with the advent and availability of AI, there seems to be an absence in the AI Safety community of an exploration on how to actually build it. I believe that, unlike specialized interpretability techniques that may not generalize to newer models, building scalable oversight through defensive AI agents can scale directly with the increasing capabilities of AI.

What are this project's goals? How will you achieve them?

I'm an ordinary engineer with very modest resources, so instead of aiming to actually build a full-scale defensive AI, my primary goal is to quickly and iteratively explore some of the possible tools and methods that can be integrated within such an architecture. I believe this small project has some actual potential as advanced harnesses such as Claude Code and OpenAI Codex have made it easier than ever for individuals to build and test things on their laptop with high velocity. Tools such as OpenClaw have made it convenient to create a swarm architecture where each agent has very strong capabilities to autonomously interact with the world through various APIs and platforms. We have a plethora of both open and closed source models that can be used as agents, and we can fine-tune them according to our very specific use cases using excellent frameworks such as DSPy.

In this project my focus will heavily be on execution rather than research. Using the tools mentioned above and much more, my primary goal is to, over four months, quickly and iteratively explore the tools, frameworks, and methods required to build a functional Defensive AI Watchdog (either a singular overseer or a swarm architecture), rapidly discarding failing architectures and iterating on promising ones. Thus I'm hoping to pave the way for larger entities to scale the most successful approaches.

I will open-source all generated code and release weekly progress reports. In the unlikely event I uncover a critical, previously unknown vulnerability or dangerous capability, I will withhold public release and consult directly with the grantors on how to proceed safely.

How will this funding be used?

Due to unfortunate circumstances, right now I lack a personal computer that will let me comfortably work on this. Most of the funding will go to buying a MacBook Air for around $3000 (electronics are unfortunately a bit expensive here in Turkey). Apart from being a very stable and suitable computer for the purpose of this project, another advantage is that it has good resell value. Therefore, after the project ends in four months (or if it is interrupted for any reason), I can liquidate it for relatively little loss and redirect the funds to another AI Safety project that the grantors may choose. The remainder of the funds will be used to purchase Claude Code, OpenAI Codex, LLM credits from providers such as OpenRouter, and cloud expenses for LLM deployment, among other things.

Who is on your team? What's your track record on similar projects?

I am an engineer with more than 10 years of experience. I have built many different kinds of classical ML architectures, and I am heavily focused applications on generative AI since GPT-3 in 2021. I have worked in several projects where LLMs are used for various cases of automation along with state of the art frameworks, and am still working in one currently. I am admittedly a bit weaker in the theoretical side of the Generative AI and AI Safety, but as I have emphasized, in this proposed project my motto will be "think less and build more".

I will be working independently, though by open-sourcing my work, I hope to attract collaboration from researchers with potentially deeper theoretical backgrounds.

What are the most likely causes and outcomes if this project fails?

It can turn out that, although I am keeping my goals modest, it is still not feasible for my intellectual and financial capabilities to get a useful result from this project. I tentatively put a probability of around 20-30% that I can make a meaningful contribution toward AI risk mitigation as a result. That being said, even a negative conclusion may be meaningful, because it would show spending further time and resources on the "AI Nanny" approach is futile.

Another thing this project mostly lacks is ensuring that the defensive AI will actually stay aligned, but I believe capabilities work is underrepresented and arguably more urgent given the possible short timelines, and this project can be complemented by other researchers with relevant alignment research afterwards.