Research on cognitive bias in LLMs + exacerbation by RLHF

Longer description of your proposed project

Trained on human data, LLMs may inherit many of the psychological properties of the humans who produced that training data. While this has been tentatively shown in some social and moral domains (e.g., racial bias), in two phases, we seek to demonstrate that many *cognitive* biases that afflict human judgment (e.g., base-rate neglect) also become problems for LLMs. In Phase 1, Using various language models, we intend to systematically evaluate the extent to which LLMs pass validated benchmarks of rationality (e.g., Stanovich & West’s “rationality quotient”). After this, we will compare actual performance against predictions of lay people and a special sample of computer scientists; we expect that people see LLMs as more ideal reasoners and fail to account for the biases that they inherit. In Phase 2 we will test the hypothesis that, contrary to the common view that LLMs become more rational as they advance, the process of Reinforcement Learning by Human Feedback (RLHF) can actually exacerbate the problem of cognitive bias in LLMs. We will first give a sample of human raters a battery of questions known to produce bias; bias will be assessed by having participants choose between two possible responses to a question: an intuitive (wrong) answer and a correct answer, thus mimicking the human feedback component of the RLHF process. We will then fine tune a model with these human responses and compare accuracy of LLM responses pre- versus post-RLHF. We expect that RLHF will bring LLMs further away from ideal reasoning whenever human biases systematically deviate from rationality. Altogether, we hope to demonstrate ways in which LLMs may inherit human cognitive biases — and how this may be exacerbated by RLHF — in the hopes of improving the chances of aligning AI reasoning with human goals.

Describe why you think you're qualified to work on this

My PhD is in judgment & decision making, and this project is joint with two other faculty members with similar expertise: Lucius Caviola (GPI/Oxford) & Josh Lewis (NYU). The engineer we would hire with this grant money has unmatched experience doing extremely similar work and comes recommended from top engineers of a colleague's CS lab.

Other ways I can learn about you

https://www.jonbogard.com/

How much money do you need?

$14,000

Links to any supporting documents or information

This was the quote from an engineer for his time + compute purchase

Estimate your probability of succeeding if you get the amount of money you asked for

>90% chance of learning something publishable in a general science/psychology journal and useful for the community; >65% chance of publication in a top 3 journal.

Bogard, Caviola, and Lewis are an exceptionally qualified team in terms of academic psychology expertise (e.g., publications in top journals) and their commitment to do the most good.
The general research direction of using psychological theory developed with humans to model LLM and general AI "digital minds" is clearly important. However, I think this is an extremely popular research direction at the moment, and there are many psychologists and behavioral researchers who have already started major research projects and staked out the territory, so I feel skeptical of any new entrants' chances of success—both in terms of generating belief-updating insights and publication.
I think tying in cognitive biases (or other psychology paradigms) to RLHF, DPO, constitutional AI, or other safety-oriented empirical strategies is particularly promising and much less saturated than the more typical research directions. I would encourage the research team to also consider how prompt engineering and in-context methods would affect bias.
It is not clear to me why human subjects research is particularly helpful here. I would need more detail about the methodology, but for now it seems that the three researchers or an informal sample of their colleagues would be a more useful judge of bias and reasoning apparent in LLM output than the samples proposed.
It is also not clear to me why an engineer needs to be hired for this project, which is the exclusive use of funding proposed. Specifically, I don't yet see why a machine learning researcher wouldn't do this in the more typical manner of coauthorship. This could even be a graduate student who is more open to grunt work than late-career researchers. RLHF is messy and expensive, but there are also compute credits available to social benefit research and alternatives to RLHF are increasingly popular.