7

Avoiding Incentives for Performative Prediction in AI

CompleteGrant
$33,200raised

Project summary

Four months salary to write a paper on how incentives for performative prediction can be eliminated through the joint evaluation of multiple predictors. 

Performative prediction refers to predictions where the act of making them affects their own outcome, such as through actions taken by others in response. This introduces an incentive for a predictor to use their prediction to influence the world towards more predictable outcomes. If powerful AI systems develop the goal of maximizing predictive accuracy, either incidentally or by design, then this incentive for manipulation could prove catastrophic.

Initial results indicate that a system of two or more predictors can be jointly evaluated in a way that removes the incentive to manipulate the outcome, in contrast with previous impossibility results for the case of a single predictor. This project aims to extend these initial results, ensure their robustness, and produce empirical evidence that models can be trained to act in the desired way across a wide range of scenarios.

What are this project's goals and how will you achieve them?

Concretely, the goal of this project is to produce a paper containing theoretical and empirical results demonstrating that the incentive for performative prediction can be eliminated by jointly evaluating multiple predictors. Ideally, such a paper could get published at a top ML conference.

More broadly, the goal of this project is to create a technique that improves the safety of powerful predictive models, and disseminate information about it to the research community. By doing so this project would reduce the risk from predictive models and increase the chance that leading AI companies focus on predictive models over more dangerous general systems.

Specific components of this paper include:

  • Comparing the behavior of a single model to the behavior of a jointly evaluated pair of models in an environment where performative prediction is possible

  • Building a theoretical model proving which conditions are necessary to avoid performative prediction when predictors have access to different information

  • Running experiments to test predictor behavior under incomplete information, including both the above theoretical model and setups without a closed-form solution

  • Extending the results to prediction and decision markets, including training models to be market makers

  • Exploring other possibilities opened up by jointly evaluating predictors, such as eliciting honest reports on predictor uncertainty

How will this funding be used?

The funding will pay for four months of my salary, from which I will pay the costs of running experiments and office space.

Who is on your team and what's your track record on similar projects?

I would be the only researcher receiving funding from this project. However, I may collaborate with Johannes Treutlein, a PhD student at UC Berkeley. We have previously worked together on two related papers, Condition Predictive Models: Risks and Strategies, as well as Incentivizing honest performative predictions with proper scoring rules. We have also written well received Alignment Forum posts on the Underspecification of Oracle AI, and the initial results for this project. 

It is possible that I will be mentoring undergraduate or junior AI safety researchers while working on this project, in which case I could involve them in running experiments.

What are the most likely causes and outcomes if this project fails? (premortem)

The best failure mode would be conclusive negative results, in which case I could publicize them and share the lessons learned from the process. A more likely failure scenario is inconclusive results, where the system cannot be shown to work, but the possibility remains open that it could under a different setup. These failure modes could result from the theory being mathematically intractable, experimental results contradicting the theory, or from me as a researcher missing possible solutions to problems that arise.

What other funding are you or your project getting?

I currently have an application under evaluation at the Long-Term Future Fund (LTFF) to fund this project for three months. Between Manifund and the LTFF, I would not take more than four months of funding, as I believe that should be sufficient to finish the project.

Rubi-Hudson avatar

Rubi Hudson

4 months ago

Final report

I recently submitted a paper based on this research to an ML conference, wrapping up the project. There is no public version of the paper yet, but one will be released after the first acceptance/rejection decision, and an Alignment Forum post covering the topic will be released within two weeks.

The main results of this project are as follows:

- Formalized the preliminary results, including streamlining proofs and removing assumptions

- Demonstrated that the results hold for decisions based on average prediction, not just the most preferred prediction (as shown initially), which makes the process much easier to implement with a human decision maker

- Showed that it is possible to elicit honest predictions for all actions, not only the one actually chosen

- Proved uniqueness of the zero-sum setup for incentivizing the desired behavior

- Showed that the space of actions can be searched for the optimal action in O(1) time, not just O(log(n)) as per the preliminary result

- Avoided the cost of training additional models to implement zero-sum competition by instead using multiple dropout masks

- In the first major experiment, showed that the zero-sum setup avoids performative prediction, even in an environment that incentivizes it

- In the second major experiment, showed that the zero-sum setup trains performative prediction out of a model faster and more extensively compared to a stop-gradient through the choice of action

- Ran various robustness checks, including showing that the results hold even if predictors had access to different information

- Showed that decision markets can also be structured to avoid performative prediction (this result was cut from the submitted paper for space)

The only result that I was hoping to produce that was not accomplished was showing extending the mechanism to cases where different predictors have private information. However, this is much less urgent if it is being implemented as two masks of the same model, which have access to identical information. The experiments showed that private information does not make difference in the toy model. It is possible that I can develop a theoretical solution to private information in future work, but after having worked on the problem extensively I believe such a solution is unlikely to exist, at least without making unrealistic further assumptions.

Overall, I'm happy with the outcome of this project. While there is still room for follow-up work, I believe it presents a first-pass solution to the problem of performative prediction. Through the course of working on this project, I have also come to believe that being able to elicit honest conditional predictions will have further applications to safety beyond performative prediction, especially with respect to online training and myopia.

I will also address that this project took considerably longer than expected to complete. I had hoped to have SPAR mentees implement the experiments, but was unable to generate useful work from them. I consider this setback entirely my own fault, as I should not have counted on volunteer, part-time labor, especially for work beyond what I could implement myself. After deciding to the experiments myself, I took time to build up the necessary background, and so I do not anticipate this being a bottleneck in future work. A secondary reason for the delay is that I returned to my PhD after the first four months, which reintroduced other demands on my time.

This project was completed solo, although I benefited from discussions with Johannes Treutlein, editing from Simon Marshall, and code review from Dan Valentine.

Austin avatar

Austin Chen

4 months ago

@Rubi-Hudson Congrats on finishing and submitting, fingers crossed that your paper gets accepted! (I especially appreciate the reflections on why it took longer than planned; I think this kind of delay happens a lot and I hope that other researchers can learn from your example)

Rubi-Hudson avatar

Rubi Hudson

10 months ago

Progress update

What progress have you made since your last update?

A draft paper containing results as of early February 2024 can be found here: https://drive.google.com/drive/u/1/folders/17X4YsCqsK6sEw2If8pv69A-JyAGDkq6K

Notable results

  • Formalized the model of the decision problem with multiple predictors, then updated initial proof sketches, including streamlining and removing assumptions

  • Proved uniqueness of the zero-sum scoring rule for honest predictions

  • Developed a variant decision rule that incentivizes honest predictions even for actions that will not be taken

  • Extended zero-sum scoring rule to decision markets, then found less restrictive methods to generate the same results

In terms of empirical results, progress has stalled. I mentored three junior researchers through SPAR, and was originally planning on having them run experiments under my supervision. This approach was unsuccessful, and I have restarted the process, running the experiments myself.

What are your next steps?

With respect to the theory, I just need to clean up some of the proofs, although there are a couple minor extensions I have in mind as well. In discussions with CS professors, I have been informed that modelling the situation where predictors have different information is sufficiently complex that it should be a follow-up paper. I expect to post to the Alignment Forum with the updated theory results when experimental results are also ready. From there, I will work on organizing the results into a paper.

With regards to the experiments, I expect these will take 1-2 more months. The experimental designs are already set, and have been run by others with ML expertise. The current bottleneck is my own ML skills, but I have been making some progress and this project is on track to wrap this up in a timely manner.

sheikheddy avatar

I believe this project is so promising that I applied to SPAR to volunteer to help directly.

donated $33,000
evhub avatar

Evan Hubinger

over 1 year ago

Main points in favor of this grant

I am excited about more work along the lines of the existing "Incentivizing honest performative predictions with proper scoring rules" paper. I think that there are serious safety problems surrounding predictors that select their predictions to influence the world in such a way as to make those predictions true ("self-fulfilling prophecies") and I am excited about this work as a way to discover mechanisms for dealing with those sorts of problems. "Conditioning Predictive Models" discusses these sorts of issues in more detail. Rubi is a great person to work on this as he was an author on both of those papers.

Donor's main reservations


I think my main reservations here are just around Rubi's opportunity costs, though I think this is reasonably exciting work and I trust Rubi to make a good judgement about what he should be spending his time working on. The most likely failure mode here would probably be that the additional work here doesn't turn up anything else new or interesting that wasn't already surfaced in the "Incentivizing honest performative predictions with proper scoring rules" paper.

Process for deciding amount


I think that $33k is a reasonable amount given the timeframe and work.

Conflicts of interest

Rubi was a previous mentee of mine in SERI MATS and a coauthor of mine on "Conditioning Predictive Models."

🍉

Johannes Treutlein

over 1 year ago

I have worked with Rubi on performative prediction in the past and I think he would be great at this! I think testing zero-sum training empirically would be a good next step. Rubi has some ideas for experiments that I find interesting and that I'd be happy to collaborate on.