0

Wise AI: Fine-Tune an Open Source Model

CompleteGrant
$5,550raised

This grant is for the Meaning Alignment Institute. A detailed proposal and project plan can be found here.

Project summary

We believe we need AI that’s not just intelligent, but morally astute. We work towards models which can do superhuman moral reasoning, where such reasoning can be checked or evaluated by humans, or by lesser models through scalable supervision. Together with OpenAI, we've taken a step towards this. We used theories of moral learning to gather data about convergent values from humans.

This grant covers our next step: Generating synthetic data according to these theories, fine-tune a model with it, and qualitatively evaluate the model with crowd workers.

This 4-6 month project will result in an open-sourced wise model, a wisdom alignment dataset, and an academic paper. We hope this can spark a race in the alignment community towards wise AI.

What are this project's goals and how will you achieve them?

Traditionally, AI alignment has been defined as alignment to “operator intent”. With more powerful models deployed in social contexts, this definition is becoming unworkable:

“Aligned with operator intent” means aligned with our current societal incentive structure, which will cause problems. AI systems aligned with operator intent replacing humans in key decision-making pipelines is analogous to introducing intelligent and obedient “sociopaths”, with no regard for the values and social norms that currently prevent misaligned incentives from destroying us. There are many examples of humans disobeying orders (e.g., orders to launch nuclear missiles, or or to execute profitable but unethical business moves) that illustrate this point.

Therefore, alignment includes the broader question of “what to align towards”. This broader notion of alignment has been defined as aligned to operator intent and human values.

So far, the work done to define human values has been very vague, resorting to equating human values with moral judgements, or revealed preference, and disregarding the contextuality of values (a constitution cannot cover the many cases in which an LLM will find itself in - this is why we have case law and precendent in our legal system, not just constitutions).

We’re writing a paper in which we argue a good alignment target for human values should be the following:

  • Robust to manipulation.

  • Fine-grained with regard to contexts.

  • Generalizable to new situations.

  • Auditable & interpretable for humans.

  • Scalable, such that more elicited data yields a better model.

  • Legitimate, such that participants and users of the resulting model agree it is operating on a fair selection of human values.

The goal of this project is to pave a way for a values alignment approach – informed by a theory of moral learning fleshed out by philosophers like Charles Taylor, Ruth Chang, and others – that we believe will meet these criteria.

Based on RAG-experiments, we expect interacting with a model trained on a moral graph to be more like interacting with an agent that has a sense of the moral situation it is in – instead of providing static bullets-point lists, or refusing request that fail to meet the overly-broad HHH criteria.

How will this funding be used?

This project will take roughly 4-6 months and result in open-sourced wisdom alignment dataset, fine-tuned model and an academic paper.

The budget will be used for fine-tuning compute, inference compute, crowd workers (eval), and salary for 1 FT AI researcher, 1 FT Project Lead, 1 PT AI engineer.

Who is on your team and what's your track record on similar projects?

Joe Edelman (MIT, Dartmouth, co-founder CHT), Oliver Klingefjord (AI Objectives), Ivan Vendrov (Google Scholar, Anthropic, advisor) did the prior work with OpenAI leading to this grant.

Ryan Lowe (Google Scholar), OpenAI, InstructGPT will be advising.

What are the most likely causes and outcomes if this project fails?

It could be the case that a moral graph needs to be bigger than we anticipate in order to meaningfully improve upon existing approaches. We will most likely be able to validate our hypothesis on a more narrowly defined context if so, but this might be less convincing to alignment researchers.

What other funding are you or your project getting?

The final $25k is already committed from another source.

Klingefjord avatar

Oliver Klingefjord

5 months ago

Final report

Funding from this project was completed from other sources a while back. For updates on our progress, please follow us at https://meaningalignment.substack.com/.


donated $50
🐙

Dan Girshovich

10 months ago

All of my interactions with this team have impressed me.

donated $5,000
🍓

Ryan Lowe

10 months ago

I've been working on AI alignment for a while. I spent many months last year thinking about how we might align AI to human values, and finding the best teams already working on this. After this search, Joe, Oliver, and their team's work was at the top of the list.

Among the few proposals in the world that are both deeply grounded in relevant work (philosophy and social choice theory among others), and provide an actually inspiring vision for how we build technology that reshapes society towards flourishing.

donated $500
🌴

IvanVendrov

10 months ago

  1. I know & trust the team.

  2. most philosophically rigorous approach to values alignment I've seen, excited to see what we can learn from it!