Mitigating Reward Misspecification in Reinforcement Learning

Project summary

AI systems developed using reinforcement learning (RL) learn to behave in ways which maximise the return derived from a reward signal provided by the system designer. This gives them the flexibility to achieve superhuman levels of performance, but also introduces a key risk – if the reward definition is not totally aligned with the desired outcome, then the agent may behave in an unexpected and possibly dangerous manner. Prior research has shown that for complex problems reward misspecification can occur quite frequently (eg Booth et al, 2023).

We will address this by using multiple independently derived reward specifications within a multi-objective reinforcement learning (MORL) framework, which uses a suitable utility function to combine the different rewards. This approach is inspired by the software engineering approach used by NASA during the late 1980s, where multiple independent specifications and implementations were used (Lee, 1987; Kelly and Murphy, 1990). If solutions are derived independently, then any errors contained in those solutions will also be independent, and so cross-referencing of results can detect those errors. Similarly we hypothesise that independently specified rewards will result in largely uncorrelated errors, so that a suitable combination of those rewards should guide the agent to act in a manner which is compatible with the true intentions of the system designers.

MORL provides a suitable framework for implementation of this concept – the scalar reward associated with each specification is treated as a component of an overall vector reward, and a suitable utility function is defined to combine those components (for example, a maximin function would maximise the worst-case performance over each of the reward specifications).

References:

Booth, S., Knox, W. B., Shah, J., Niekum, S., Stone, P., & Allievi, A. (2023, June). The perils of trial-and-error reward design: misdesign through overfitting and invalid task specifications. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 37, No. 5, pp. 5920-5929).

Kelly, J. P. J., & Murphy, S. C. (1990). Achieving dependability throughout the development process: A distributed software experiment. IEEE transactions on software engineering, 16(2), 153-165.

Lee, L. D. (1987). The predictive information obtained by testing multiple software versions (No. NASA-CR-181148).

What are this project's goals? How will you achieve them?

Artificial intelligence (AI) has the potential for revolutionary positive effects across many of the problems facing human civilisation. However, it also has the potential to cause major harm, possibly even threatening the existence of humanity, if applied unsafely. In particular, while reinforcement learning (RL) has emerged as a powerful tool for achieving superhuman levels of decision-making, it is also based on the concept of unbounded optimisation, which becomes dangerous if there is any misalignment between the reward provided to the RL agent and the actual desired outcomes.

As noted earlier, prior research has shown that even experienced RL practitioners may regularly make mistakes when specifying reward functions for an RL agent. Therefore developing methods to address the risk of misspecification is vital.

We propose to develop a novel approach based on the use of multiple, independently derived reward specifications in combination with a multi-objective reinforcement learning (MORL) framework. This is essentially a 2-stage process:

Multiple RL practitioners are given a description of the task to be performed, and use this to specify a scalar reward function.

The MORL agent receives all of those rewards as components of a vector, and applies a utility function in order to scalarise that vector and select the optimal action to perform.

We will compare results of the MORL agent against results achieved by training a scalar RL agent individually on each practitioner’s reward definition. We will use environments where a true reward measure is known (ie the actual desired behaviour is known), but this signal will not be available to the agent during training.

One of the main research questions to be addressed is to identify the best-performing utility function – possible candidates include the maximin function, aggregation measures such as the mean and median (with or without exclusion of outliers), and fairness functions such as social welfare measures. A related issue is to address issues which might arise from variations in the scale of rewards specified by different practitioners.

We will use the test environment and dataset of reward specifications derived by Booth et al (2023) as a basis for the initial stages of this study. Once successful methods have been developed for this task, we will look to extend this to more complex environments.

We also intend to investigate whether our approach can be applied to reward specifications derived from other sources. Specifically there has been considerable interest in recent years in using LLMs to automatically derive reward specifications, so we will investigate methods for generating a diverse set of LLM specifications and using them as input to our MORL agent, either by themselves or as an augmentation to human-derived specifications.

Once a successful approach is developed, we will publish this work via open-access means (both publications and an open-source code repository) so as to promote its uptake by developers and deployers of AI systems.

How will this funding be used?

The majority of the requested funding will be used to fund a one-year contract for a research assistant to carry out coding, running of experiments, data collation etc. We have an existing research assistant with familiarity with MORL and our existing code-base with a contract expiring in mid 2025 so our intention would be to re-hire them for this project, thereby ensuring an efficient start to this line of research as they will not require onboarding, familiarisation with our code etc.

We also have a component within our budget to cover the costs of data-gathering to allow us to extend our study to a more complex problem domain than that considered by Booth et al. This funding will be used to pay workers recruited through the prolific platform (https://www.prolific.com/). We have used this platform in a recent study and have found it be far more effective at gathering high-quality results than alternatives like Mechanical Turk. Participants will be presented with a visual and text description of our problem domain, and provided with an interface that enables them to define a reward function for an agent operating within that environment (based on the methodology of Booth et al). We are aiming for 100 participants and estimate the task will take around 30 minutes. Based on the recommended ethical payment rate specified by prolific this will cost around AUD900 (approximately USD600).

We have also requested a small amount of additional funding to assist in the process of disseminating the results of this research, either by presentation at a conference and/or publication in a high-quality open-access journal.

The minimum funding amount that we have requested will be sufficient for a 6-month pilot study based solely on the existing dataset of specifications produced by Booth et al. If we achieve our full funding goal, this will support an additional 6 months work gathering specifications for a more complex problem domain, and applying the most promising methods identified during the pilot study to these specifications.

Who is on your team? What's your track record on similar projects?

Prof Peter Vamplew, Assoc Prof Cameron Foale and Prof Richard Dazeley (The Australian Responsible Autonomous Agents Collective, araac.au, Federation University and Deakin University) .

We have a well-established track record of successful collaboration, both in carrying out research projects and in student supervision. Between us we have published more than 200 peer-reviewed papers, including publications in leading international AI journals. We have also supervised more than 20 Honours students and 20 Higher Degree by Research students to successful completions. Full publication records are available at:

https://scholar.google.com.au/citations?user=Q4oV_VoAAAAJ

https://scholar.google.com.au/citations?user=bFlbrpMAAAAJ

https://scholar.google.com.au/citations?user=Tp8Sx6AAAAAJ

In particular we have been pioneers in establishing and applying methods for multiobjective reinforcement learning, and over the last few years have been exploring the implications of these methods for human-aligned artificial intelligence, as evidenced in the papers listed below. Both Prof Vamplew and Prof Dazeley are members of the Future of Life Institute’s Existential AI Safety Research Community. Prof Vamplew has been an invited speaker on the topic of AI safety at both the Adaptive Learning Agents workshop at AAMAS in 2016, and the Human-aligned Reinforcement Learning workshop at the 2021 IEEE International Conference on Development and Learning. Prof Dazeley was the only Australian invited to attend an NSF-funded workshop on Provably Safe and Beneficial AI arranged at Berkeley in October 2022 by Prof Stuart Russell.

Key publications relevant to this project:

Vamplew, P., Smith, B. J., Källström, J., Ramos, G., Rădulescu, R., Roijers, D. M., ... & Foale, C. (2022). Scalar reward is not enough: A response to Silver, Singh, Precup and Sutton (2021). Autonomous Agents and Multi-Agent Systems, 36(2), 41.

Vamplew, P., Foale, C., Dazeley, R. and Bignold, A. (2021), Potential-Based Multiobjective Reinforcement Learning Approaches to Low-Impact Agents for AI Safety, Engineering Applications of Artificial Intelligence, https://doi.org/10.1016/j.engappai.2021.104186

Nikolaj, G., Vamplew, P., Foale, C., & Dazeley, R. (2021, November). Language Representations for Generalization in Reinforcement Learning. In Asian Conference on Machine Learning (pp. 390-405). PMLR.

Vamplew, P., Dazeley, R., Foale, C., Firmin, S., & Mummery, J. (2018). Human-aligned artificial intelligence is a multiobjective problem. Ethics and Information Technology, 20 (1), 27-40.

As noted above we are all experienced academics, with solid track-records in publishing scientific research and also in student supervision. We have been pioneers in extending reinforcement learning to multiobjective problems for over a decade, producing some of the most influential work on this topic. In recent years we have also been examining the benefits of adopting multiobjective approaches to the problem of AI safety, and have produced several papers so far, although our progress in this area has been slower than ideal due to a lack of funding, meaning we have had to devote more of our research time to other funded projects.

This project builds on our foundation work on multiobjective AI safety. Given these close connections to our prior successful research we are confident that this project will deliver its target outcomes.

What are the most likely causes and outcomes if this project fails?

We foresee two main risks of failure.

The reward specifications produced by human reward designers either do not contain specification errors or contain errors which are strongly correlated between different specifications rather than being independent. However the analysis performed by Booth et al suggest that this is very unlikely to be the case.

We are unable to identify a suitable utility function for combining together the independent reward specifications, or the MORL algorithms are unable to adequately learn an optimal policy for this utility function. Again, we believe this is a very unlikely outcome.

Should the project fail, the only negative outcome will be the opportunity cost that the funding (and our time) could have been used more productively elsewhere. Even in the case of failure, we would still anticipate that valuable lessons would be learned that could be used to inform future research on MORL and/or reward misspecification. At the very least the dataset of human specifications sourced through prolific will provide a complement to the original work of Booth et al, providing a valuable resource for future work addressing reward misspecification.

How much money have you raised in the last 12 months, and from where?

We received USD 83,000 from Founders Pledge (via the Survival and Flourishing Fund’s application process) in February 2024. The project funded through this support ends in mid-2025.