Hey Alexander, Iām approving this project since research like this definitely falls within Manifundās scope. However, I have to lower the amount from $280K to $190K, since thatās all that remains from the total pot allocated to regrantors. Weāre really sorry for giving false expectations about how much money youād be getting.
Removing Hazardous Knowledge from AIs
Alexander Pan
Project summary
This project aims to remove hazardous CBRN and cyber knowledge from AI models. To do this, researchers will develop CBRN and cyber knowledge evaluations which measure precursors for dangerous behavior but which are not info-hazardous. Then, the researchers develop unlearning techniques to remove this knowledge. Finally, the research will be communicated to NIST to inform their dual-use foundation model standards.
What are this project's goals and how they be achieved?
This projectās goal is to remove hazardous CBRN and cyber knowledge from AI models. This project includes:
Developing datasets of CBRN and cyber knowledge which contain precursors for dangerous behavior but which themselves are not info hazardous. (E.g., knowledge of reverse genetics itself isnāt hazardous but is required to do more hazardous things.)
Developing unlearning techniques to remove this precursor knowledge.
To develop the dataset, researchers will be working with a large group of academics, consultants, and companies, including cybersecurity researchers from Carnegie Mellon University and biosecurity experts from MIT (e.g., Kevin Esveltās lab).
To develop the unlearning techniques, researchers will experiment with many different methods. Methods need to 1) remove the relevant precursors to hazardous knowledge and 2) preserve general domain knowledge which is not hazardous.
How will this funding be used?
The funding will be used to pay consultants and contractors for dataset collection.
Who is on the team and what's their track record on similar projects?
Alex Pan, one of the research leads, is a PhD student at UC Berkeley in Jacob Steinhardtās lab (https://aypan17.github.io/). He has published two first-authored papers at top-tier ML conferences (ICML and ICLR) on reward misspecification and measuring the safety and ethical behavior of LLM agents.
The other research lead is Nat Li, who is a 3rd year undergraduate at UC Berkeley and has co-authored two ML papers previously (1, 2). I will also help advise this project and have a long track record of empirical AI safety research (3).
What are the most likely causes and outcomes if this project fails? (premortem)
One of the main risks is whether the consultants and contractors can be directed. If the consultants produce general bio/chem knowledge instead of specifically precursors to hazardous capabilities, the resulting dataset wonāt be useful to unlearn.
What other funding is this person or project getting?
None.
Dan Hendrycks
11 months ago
Main points in favor of this grant
Removing hazardous capabilities from models would greatly help reduce AI x-risk from malicious use and unilateral actors. Alex is a researcher with a strong track record who is interested in AI safety and has done previous AI safety research. The timing is right; NIST has been tasked by the recent EO to help develop standards and regulations on ādual-use foundation models.ā Research now has a much higher likelihood of helping shape regulation.
Donor's main reservations
This is a relatively complex project with many moving parts. Itās crucial that the project is executed well on a relatively short timeline.
Process for deciding amount
It was estimated by the researchers that this was the total amount needed for the dataset. I have reviewed the budget and approved.
Conflicts of interest
I will be helping advise this project.