Human values are evolving and have undergone huge, continual progress over the past millennium. Values embedded into AI systems need to undergo the same process, or else we risk locking-in current human values by putting humans into a societal echo chamber filled with like-minded AI systems. Such lock-in could be the default outcome for current alignment techniques like RLHF (including the iterative/lifelong application of RLHF at every time step), as argued here in our paper.
We believe such a permanent lock-in event is ...
Premature: We are still fundamentally confused about e.g. consciousness and other foundational questions of ethics. It will be highly undesirable to permanently lock-in our uninformed views.
An existential risk (under Bostrom's original definition): A permanent and premature value lock-in would likely mean (1) losing most of (or ~all of) humanity's future utility, and (2) perpetuating current suffering. In other words, value lock-in is (asymptotically) as bad as extinction, and potentially worse.
Likely and urgent: While the occurrence of transformative AI can significantly exacerbate this risk, value lock-in is possible even with today's LLMs/MLLMs, given (1) their potential of application in every corner of society (as writers, romantic partners, teachers, administrators, ..., some of which have already come true), and (2) psychological studies demonstrating models' strong impact on users' opinions.
In general, I believe value lock-in is one of the few AI-related x-risks that can occur before transformative AI does, in addition to e.g. risks of bioterrorist misuse.
Extremely neglected: To the best of my knowledge, there are <3 FTEs working on this problem in the safety community (myself counted), and no coordination currently exists; I don't know of anyone outside the safety community seriously trying to solve this problem.
Plausibly tractable: While completely figuring out moral progress is ambitious and likely difficult (which I think is why in mid 2022 Dan considered the problem important, neglected, but not tractable), it is much easier to build imperfect (but empirically sound) solutions to moral progress that outperform the counterfactual scenario (aka lock-in), while still capturing a large portion of the gains. This is especially hopeful given that LLMs are being used in e.g. psychotherapy and convincing conspiracy theorists they are wrong, demonstrating that they can be used for deep reflection.
This project aims to mitigate the risk of premature value lock-in by facilitating moral progress in AI systems - what we call progress alignment, i.e., alignment with moral progress. This could mean several things, including but not limited to:
Implementing lifelong alignment algorithms that continually move the model's values forward. (Our paper did early-stage exploration on this front)
Using AI systems' reasoning capabilities to (1) reason about moral philosophy, and/or (2) prompt humans' moral thinking in a Socratic manner. (Dan wrote a bit about this; also some academic papers using LLMs for philosophy [i, ii])
Developing a paradigm of alignment that accounts for human moral progress in a fundamental manner, so that when you pour sufficiently large compute into that alignment paradigm, we would converge upon perfect moral progress. (Micah's paper is a nice first step doing deconfusion work on this front, and we are working to develop solutions)
The ProgressGym project building experimental infrastructure for progress alignment.
Along with a number of open-source offshoots of this paper (see links on the paper page). Currently still in an early stage.
This project was carried out during my affiliation with the PKU-Alignment Team. Despite my current and past affiliations with CHAI and PKU-Alignment, I am not applying for funding on behalf of these organizations, and the longer-term initiative of progress alignment is not a CHAI/PKU-Alignment project.
A slightly outdated Research Agenda, written in late 2023.
We are currently working on a new research agenda, after the past few months working on ProgressGym and the new learnings in the process.
(in-progress) Compute costs for improving the ProgressGym infrastructure. [$10k for development & 1-month deployment; will seek further funding sources if decide to deploy long-term]
We are building an open leaderboard & playground for progress alignment (prototype), which requires significant computational resources due to the lengthy evaluation process in ProgressGym. Currently, this process fits into an 8xA100 machine.
(planned) Logistical costs for running contests/workshops/hackathons on progress alignment. [$2k-$50k, where the lower end is for a short proof-of-concept hackathon, while the upper end is a contest with prizes that we will scale up to if the proof-of-concept is successful]
(planned) Expand & coordinate the group of people working on progress alignment. [flexible, $0-$10k]
(potential) Running human-subject experiments to explore how human-AI interaction can facilitate moral progress in humans. [flexible depending on need for sample size, $4k-$50k]
Note that whether we get funding will not influence whether this longer-term initiative happens or not (it is already happening and we plan to continue), but will influence what concrete project options are available to us.
We commit to (1) keeping transparency about how we spend the funding, and (2) returning all leftover funding. If direct returning is not possible on Manifund, then in the form of grants to other people's EA/AI safety projects.
Currently it's only me, Tianyi (Alex) Qiu, working with a number of external collaborators on individual projects within this initiative. I am currently looking for longer-term collaborators on this initiative.
Until September this year, I am a research intern at CHAI doing alignment research (during which I am on a CHAI stipend), based in Berkeley. I have previously written 6 papers on technical alignment (including an alignment survey paper that was especially influential in China), and have a track record for both empirical/engineering-heavy LLM alignment work and theoretical/mathematical work. A millennium ago I was a gold medalist in the Chinese National Olympiad in Informatics, and I'm currently finishing my undergraduate studies as a CS major at Peking University.
Note that despite my current and past affiliations with CHAI and PKU-Alignment, I am not applying for funding on behalf of these organizations, and the longer-term initiative of progress alignment is not a CHAI/PKU-Alignment project.
Failure mode #1: Producing viable products (e.g. open-sourced progress alignment pipelines), but failing to get attention/adoption from the industry, and especially from the frontier AI labs. As prevention, we hope to (1) connect and communicate with people who can influence lab decisions, (2) try and make the problem mainstream in the ML research community, and (3) build public-facing interfaces & events (e.g. open contests, online playground/leaderboard, conference workshops). Any advice on these fronts would be appreciated.
Failure mode #2: Failing to deliver a viable product, most likely due to choosing the wrong technical path. As mentioned before, there are different potential ways to do progress alignment (we enumerated three ways), and we need to carefully prioritize/balance between them, including doing preliminary experiments to test for feasibility before allocating resources/efforts to a path.
None yet. Manifund is our first (and currently only) funding application.
The ProgressGym project, which is the first step of this initiative, may be able to use some computational resources from PKU-Alignment, upon their approval.