Avoiding Incentives for Performative Prediction in AI

Project summary

Four months salary to write a paper on how incentives for performative prediction can be eliminated through the joint evaluation of multiple predictors.

Performative prediction refers to predictions where the act of making them affects their own outcome, such as through actions taken by others in response. This introduces an incentive for a predictor to use their prediction to influence the world towards more predictable outcomes. If powerful AI systems develop the goal of maximizing predictive accuracy, either incidentally or by design, then this incentive for manipulation could prove catastrophic.

Initial results indicate that a system of two or more predictors can be jointly evaluated in a way that removes the incentive to manipulate the outcome, in contrast with previous impossibility results for the case of a single predictor. This project aims to extend these initial results, ensure their robustness, and produce empirical evidence that models can be trained to act in the desired way across a wide range of scenarios.

What are this project's goals and how will you achieve them?

Concretely, the goal of this project is to produce a paper containing theoretical and empirical results demonstrating that the incentive for performative prediction can be eliminated by jointly evaluating multiple predictors. Ideally, such a paper could get published at a top ML conference.

More broadly, the goal of this project is to create a technique that improves the safety of powerful predictive models, and disseminate information about it to the research community. By doing so this project would reduce the risk from predictive models and increase the chance that leading AI companies focus on predictive models over more dangerous general systems.

Specific components of this paper include:

Comparing the behavior of a single model to the behavior of a jointly evaluated pair of models in an environment where performative prediction is possible
Building a theoretical model proving which conditions are necessary to avoid performative prediction when predictors have access to different information
Running experiments to test predictor behavior under incomplete information, including both the above theoretical model and setups without a closed-form solution
Extending the results to prediction and decision markets, including training models to be market makers
Exploring other possibilities opened up by jointly evaluating predictors, such as eliciting honest reports on predictor uncertainty

How will this funding be used?

The funding will pay for four months of my salary, from which I will pay the costs of running experiments and office space.

Who is on your team and what's your track record on similar projects?

I would be the only researcher receiving funding from this project. However, I may collaborate with Johannes Treutlein, a PhD student at UC Berkeley. We have previously worked together on two related papers, Condition Predictive Models: Risks and Strategies, as well as Incentivizing honest performative predictions with proper scoring rules. We have also written well received Alignment Forum posts on the Underspecification of Oracle AI, and the initial results for this project.

It is possible that I will be mentoring undergraduate or junior AI safety researchers while working on this project, in which case I could involve them in running experiments.

What are the most likely causes and outcomes if this project fails? (premortem)

The best failure mode would be conclusive negative results, in which case I could publicize them and share the lessons learned from the process. A more likely failure scenario is inconclusive results, where the system cannot be shown to work, but the possibility remains open that it could under a different setup. These failure modes could result from the theory being mathematically intractable, experimental results contradicting the theory, or from me as a researcher missing possible solutions to problems that arise.

What other funding are you or your project getting?

I currently have an application under evaluation at the Long-Term Future Fund (LTFF) to fund this project for three months. Between Manifund and the LTFF, I would not take more than four months of funding, as I believe that should be sufficient to finish the project.