DavidRhysBernard avatar
David Rhys Bernard


-$1total balance
-$1charity balance
$0cash balance

$0 in pending offers



DavidRhysBernard avatar

David Rhys Bernard

3 months ago

1. How much money have you spent so far? Have you gotten more funding from other sources? Do you need more funding?

I spent $1100 out of the $3000 I received on paying experienced forecasters. I haven't received any additional funding since I received the money from Manifund (although I already had funding from the EA Long-Term Future Fund for the other parts of the research project.) I do not need more funding.

2. How is the project going? (a few paragraphs)

I completed the research project and wrote up my results here (https://davidrhysbernard.files.wordpress.com/2023/08/forecasting_drb_230825.pdf). This is included as a chapter of my PhD dissertation which I will defend in two weeks!

As I was already running the study with academics and lay-people, the parts of the paper that should be counted for credit in the impact evaluation are the ones which involve analysis of expert forecasters. In particular these are; section 4.1.1 and 4.1.2, including Table 1 Panel B, Table 2, Figure 3 and Figure 4. You may have to read the intro or sections 2.2 and 3 to get sufficient context on the study to understand the results.

In section 4.1.1 (Forecaster type) I show that my sample of forecasters perform better at forecasting short and long-run treatment effects than academics (recruited from the Social Science Prediction Platform) and lay-people (recruited from Prolific). In particular, I show that regardless of the accuracy metric used, both academics and forecasters are statistically significantly better than laypeople. If I use my preferred log score accuracy metric (which relies on the full distribution given), forecasters are better than academics. However, if I use a negative absolute error metric (which only relies on the central point of the distribution), there is no significant difference between forecasters and academics. This suggests that the forecasters are better at forecasting a range of likely treatment effects, but no better at specifying the most likely effect within that range.

In section 4.1.2 (Calibration) I show that although all groups are poorly calibrated and overconfident, the forecaster group are more well calibrated than the other two. I show this with a calibration curve in figure 3 and a comparison of log scores and stated confidence levels in figure 4. Better calibration of forecasters seems to be a key part of forecasters having higher accuracy in this context.

3. How well has your project gone compared to where you expected it to be at this point? (Score from 1-10, 10 = Better than expected)

I'd give the project a 6/10. I failed at reaching my target of 30 forecasters and was significantly overconfident in how likely I would be to reach that number. I underestimated how busy and in demand most forecasters are and how interested they would be in my project. I heard that $50 per hour was within the range of expected compensation for at least one forecaster, but it turned out this was not sufficient for many others. In the future, I'd plan to pay superforecasters at least $100 per hour of their time and give them a longer period over which they can make forecasts.

Despite this limited sample size, I still ended up being well-powered enough to find meaningful differences between academics and forecasters. Collecting forecasts from experienced forecasters of impact forecasts of causal treatment effects from randomised controlled trials is already a novel contribution, since almost all previous forecasting research has been on state forecasts. Being able to show that forecasters outperform academics in this new context and that this outperformance depends on the accuracy metric used are also both useful contributions.

4. Are there any remaining ways you need help, besides more funding?

I've completed the project so I do not need any more help immediately. I'm presenting the results at a Forecasting in the Social Sciences Workshop at UC Berkeley in October. Depending on the feedback I get there, I will decide whether or not to proceed with publication. As I have now left academia (and started at Rethink Priorities), publishing this is not a top priority for me, but if someone was interested in further improving the data-analysis, writing, and submitting the paper as co-authors, I'd definitely be open to the possibility and keen to chat.

5. Any other thoughts or feedback?

The Manifund process was very smooth and easy. I want to express my gratitude to all the people who bought shares in this project, and the Manifund and ACX team for setting this up.

DavidRhysBernard avatar

David Rhys Bernard

9 months ago

Hi Domenic. If I recall correctly, one of them said that amount was the lower bound of what they'd expect, but I didn't systematically ask the people I spoke to.

DavidRhysBernard avatar

David Rhys Bernard

9 months ago

Hi Austin, thanks for the questions!

Yep, I am running the experiment with academics and domain experts at the moment. I started with them for two reasons. Firstly, academics are currently seen as the experts on these sorts of topics at the moment and they are the ones who provide policy advice, so their priors are the ones that matter more in an action-relevant sense. Of course whether this should be the case is up for debate and I hope to provide some evidence here. Secondly and more practically, academic economists tend to care more about what other academic economists think rather than uncredentialed superforecasters, so to improve the paper's chances in economic journals, I made academics my initial focus.

I've spoken to a few superforecasters already and they said they would be happy to participate if I could compensate them appropriately, so if I use them + their network, I'm 75% sure I'd be able to get 30 superforecasters conditional on receiving funding, 10% if not. From chatting to the folks at the Social Science Prediction Platform, they think 15-20 domain expert forecasters tends to be sufficient for getting a reasonable forecast of an average treatment effect. To do the additional analysis of comparing different types of forecasters requires more sample size so I would worry about being underpowered if I had much less than 30.