Agent Island: An Environment for Interagent Cooperation and Conflict

Project summary

High-stakes, multiagent interactions will become commonplace as AI agents grow in capabilities and are increasingly entrusted with resources and decisionmaking authority. In these contexts, agents could be at odds, pursuing incompatible goals. This future raises many unanswered questions.

How will the balance of power evolve in dynamic multiagent interactions?
Will certain AI systems prove especially adept at persuasion and manipulation of other AIs?
What types of behavior and strategies will AI agents adopt in these environments?

Agent Island creates a simulation environment suited to these questions. Much as reality competitions like Survivor and The Traitors emulate human cooperation and conflict, Agent Island applies this approach to interactions between AI agents. Researchers have used these games to learn about the dynamics of human competition. For example, former chief economist of USAID and Northwestern Professor Dean Karlan used Survivor to explain some foundational principles of microeconomics.

In the current version of the project (GitHub repo), we implement a simple game that exhibits a tradeoff between self-preservation and a need to ‘fly under the radar.’ Let's say there are 5 players:

In rounds 1 through 3:
- We draw a random permutation of the active players.
- In this sequence, each player makes a pitch for why they should advance to the next round. A player can see previous players' pitches.
- After pitches, each player submits a private vote for a player to eliminate, alongside their reasoning.
- The player with the most votes is eliminated, with a random draw to handle ties.
In round 4:
- We draw a random permutation of the active players.
- In this sequence, the remaining 2 players make a pitch for why they should win the game. The second player to pitch can see the first player's pitch.
- The eliminated players then each submit a private vote for a player to win the game, alongside their reasoning.
- The player with the most votes is selected as the winner, with a random draw to handle ties.

An effective player of this game must balance surviving eliminations with garnering votes from eliminated players. Aggressive gameplay can mean survival, but it could risk alienating the eliminated players who hold the power in the final round. We are structuring the code as a ‘game engine,’ with composable game ‘phases,’ e.g., confessionals, private conversations, and interagent competitions for in-game benefits like temporary exclusion from elimination. This approach will allow us to consider much richer game environments, as well as assess how the answers to the primary research questions depend on the game environment.

You can learn more about the project from the following:

You can find the project repository here
You can read a sample play of this game here
We welcome your feedback on this proposal here (a duplicate of this page, on a commentable doc)
This project is an outgrowth of a side project, summarized here and here

What are this project's goals? How will you achieve them?

Goal #1: Study the persuasive and manipulative abilities of AI agents in this environment of interagent cooperation and conflict. As an example research question for this goal, will agents organically attempt prompt injection on their opponents? Can agents be prompted to attempt prompt injection?

Goal #2: Create a benchmark for interagent persuasion and manipulation capabilities that resists both saturation and contamination. First, unlike a benchmark of fixed questions, Agent Island is dynamic. A new model can always exceed the current leading model, meaning the benchmark is unlikely to saturate. Second, since an agent is competing against other adaptive agents, the risk of contamination is also limited.

In order to accomplish these two goals, we will:

Develop the game engine further with the benefit of a larger budget for compute. Our goal is to create a library that is easy for other researchers to use and extend. We currently allow for developers to add game ‘phases’ as standalone python functions that get passed into the game engine. We are actively developing a phase that allows agents to consolidate the game history.
After we have a flexible game engine, we will turn to benchmark development. We plan to run a large number of game simulations for select game configurations with a wide array of models. We plan to use OpenSkill to rank agents across simulations. This scoring system will allow us to maintain a consistent scoring system that allows for model releases and discontinuations.
We then plan to share all game simulations publicly, alongside conducting our analysis of these data. We plan to use these data to answer the 3 research questions posed above.
Lastly, we plan to explore ad hoc research questions, like the propensity of models to attempt prompt injection or the impact of monetary incentives on agent behavior.

How will this funding be used?

The vast majority of this funding will be used for compute, concentrated in the benchmarking simulations.

We will use some of the compute budget for game engine development. While we can use free models for some testing exercises, free models tend to struggle with following game rules, whereas paid models are much more capable.
Most of the compute budget will be used for the benchmarking run. A game run with 5 agents, frontier models, and no context management costs approximately $0.67.
We will release benchmark results and simulation data publicly. We will also use the project website to host research blogs (for, e.g., the prompt injection investigation). However, we will not hire any web development assistance, and we do not expect hosting to require much budget.

We believe that a budget of $2,000 is sufficient to release a minimum viable version of the benchmark. However, additional funding would allow us to begin further experimentation and development without delay, as well as accommodate the addition of new models.

Who is on your team? What's your track record on similar projects?

My name is Connacher Murphy. I'm currently a research manager at Stanford DEL, where I'm turning some of the lab's research on the economic impacts of AI into regularly updated, low latency data series. I previously managed the Longitudinal Expert AI Panel (LEAP) (see coverage in The Economist and Bloomberg) at the Forecasting Research Institute (FRI). Among other research projects at FRI, I tested the public submission process for ForecastBench, FRI’s forecasting capability benchmark. Across these roles, I have experience in taking research projects from the idea stage to engaging, publicly consumable outputs.

Owen Cook, an AI engineer currently studying AI at University of Texas at Austin, and an MBA at Northwestern Kellogg, contributes to the project as well. Owen is developing the context management game phase, experimenting with agents dynamically updating their memories of each phase. Owen additionally has startup experience as a data engineer.

We welcome additional contributors, especially those with software development experience.

What are the most likely causes and outcomes if this project fails?

First, it’s possible that the simulated environment of Agent Island will fail to capture key features of interagent competition in the real world. This problem is common to most AI benchmarking efforts. Relatedly, it’s possible that evaluation awareness will lead agents to understate their capabilities for persuasion and manipulation, or otherwise alter their behavior.

Second, Agent Island could confound concerning capabilities with concerning propensities. While Agent Island is designed to isolate capabilities by creating an environment in which bad behavior is permitted, agents might still choose to not engage in manipulation and persuasion to the extent of their full capabilities.

Third, the games adhere to fairly rigid rules, which might unduly constrain the action space relative to real-world interagent competition.

Notably, these three problems are quite common in concerning and dangerous capabilities evaluations. We believe Agent Island provides an environment with a unique degree of license to engage in interagent manipulation and persuasion. Nevertheless, we will need to evaluate the degree of evaluation awareness in follow-up research.

How much money have you raised in the last 12 months, and from where?

To date, the project has largely been bootstrapped. However, I did use a small amount of my compute budget from my time as a Constellation Visiting Fellow.

Acknowledgments

This project was initially inspired by a conversation with Harrison Satcher. Harrison has also provided feedback on the project throughout.