WhiteBox Research’s AI Interpretability Fellowship

Project Summary

WhiteBox Research’s long-term goal is to help build a thriving AI safety community in Southeast Asia. To build towards this, our plan until April 2025 is to run two cohorts of our free AI Interpretability Fellowship in Manila to produce junior interpretability/safety researchers. The fellowship happens over five months and is part-time and in-person. WhiteBox was founded in August 2023 and is currently composed of Clark Urzo, Brian Tan, and Kyle Reynoso. We received funding for our first cohort via the Long-Term Future Fund and Manifund.

We finished the two-month training phase for our first cohort in May. 10 out of 13 participants are currently in the three-month guided research phase, which runs from June 1 to August 31.

We are aiming to fundraise $136,240 to fund us from September 2024 to May 2025, which will let us run another cohort of our fellowship (e.g., from Oct. 2024 to March 2025).

What are this project's goals and how will you achieve them?

Through our fellowship, we aim to help early-career individuals learn and master the fundamentals of mechanistic interpretability research. This can be their stepping stone to becoming a stellar AI interpretability or AI safety researcher.

Here’s why we’re focusing on MechInterp and Southeast Asia:

Southeast Asia (SEA) has a large population (~700M) and high geopolitical relevance. There are barely any other AI safety upskilling programs in the region, so a lot of talent here remains untapped and unserved.
We believe Manila is an attractive place to build an AI safety community within SEA due to its low cost of living, English-speaking population, lax visa requirements, and active EA community (EA Philippines).
We believe MechInterp is an important part of the portfolio of AI safety agendas, and its tight feedback loops make it a uniquely accessible way to expose people to AI safety work rapidly.

Our fellowship is divided into two phases: the Trials Phase (training) and the Proving Ground Phase (guided research). Both phases require a commitment of 10-15 hours per week, mainly consisting of weekly in-person 8-hour Saturday sessions. We currently offer honorariums of PHP 10,000 (~$178) per person for those who complete the fellowship.

Those who succeed in the Trials will move on to the Proving Ground Phase, which lasts three months. Most of them will be matched with an external mentor from the international AI safety community to guide them in doing research. Most fellows will aim to solve or make substantial progress on problems in Neel Nanda’s list of 200 Concrete Open Problems in MechInterp, which serves as our main way to operationalize early indications of their ability to make useful contributions to the field later on.

Here’s the theory of change for our AI Interpretability Fellowship, which is what we plan to focus on for the next year, along with our assumptions and evidence for the ToC.

Below are our goals and other key metrics we aim to achieve from June 2024 to May 2025. To generate these numbers, we combined our individual forecasts using linear opinion pooling with equal weights (n = 3), and we put our combined forecast in brackets before each goal. Moreover, all our forecasts below are conditional on us receiving full funding for our Cohort 2 costs.

Goals for June - December 2024 (e.g., via Cohort 1):

1. [54%] Have our fellows solve or make substantial progress on at least four concrete open MechInterp problems (aka COPs) in total by the end of September 2024.

By ‘substantial progress’ we mean having a paper, post, or demonstration with the level of rigor equal to or exceeding that of the “We Found An Neuron in GPT-2” post. (Our probability is only 54% because only six of our fellows are working on COPs, and they may not all solve one in time. These six are all working solo, while others will work on non-COP projects.) We made a Manifold market for this goal here.

2. [75%] Get at least one fellow to be counterfactually accepted by the end of 2024 into a >=0.5 FTE AI safety research fellowship (e.g. MATS Winter 2024-25 program, Apart Lab, SPAR, etc.), role, or grant.

We understand that some fellowships are more valuable and harder to get into than others (such as MATS), so ideally we would get one fellow to be accepted into a more prestigious fellowship like MATS by the end of 2024. Given this may be difficult though, we will consider here any AI safety fellowship (e.g., Apart Lab and SPAR) that focuses explicitly on doing research but will not necessarily require publication of said research. (We also are requiring a >= 0.5 FTE condition for this metric, as we want to only count here those who are putting a significant amount of time into an AI safety research opportunity.) Here’s this goal’s Manifold link.

July 24 update: We originally put our forecast as 52% for this goal, but have since increased this to 75% given that five of our fellows were invited to join Apart Lab (likely in October/November), and we think at least one of them would work on it for at least 0.5 FTE.

3. [79%] Have at least four fellows spending at least 10 hrs/week working on alignment-oriented upskilling and/or interpretability projects by the end of 2024.

This includes projects which may be self-funded or otherwise not supported by a grant. (Manifold link here.)

For the next part of our funding period, we estimated each probability conditional on us having achieved its corresponding prior goal above.

Goals from January-May 2025 (regardless of if the fellows were in Cohort 1 or 2):

1. [55%] Have our fellows solve or make substantial progress on at least eight more concrete open MechInterp problems between January-May 2025. (Again, part of why this is not higher is some fellows may choose to work on problems outside of the COPs list.)

2. [65%] Get at least two more fellows to be counterfactually accepted by May 2025 into a >=0.5 FTE AI safety research fellowship (e.g., MATS Winter 2024-25 program, Apart Lab, SPAR, etc.), role, or grant.

Our forecast here is lower than the 75% estimate we have for this metric in 2024 mainly because we are doubling our goal to getting two more fellows accepted rather than just one. We generally expect to get better at helping our fellows with their external applications, but we also think the difficulty of getting accepted into existing AI safety fellowships will slightly increase.

3. [89%] Have at least four additional fellows spending at least 10 hrs/week working on alignment-oriented upskilling and/or interpretability projects by May 2025.

Also, here are other key metrics we aim to achieve from now until May 2025:

1. Have at least 7 people complete Cohort 1’s Proving Ground phase

2. Have at least 12 people complete Cohort 2’s Trials phase

3. Have at least 10 people complete Cohort 2’s Proving Ground phase

4. Have an average “likelihood to recommend” score of at least 8.5/10 for both fellowship rounds

Should our fellowship and fundraising be successful, we’d then aim to hold a fellowship round starting June 2025 that would be open to anyone in Southeast Asia. We could fly people into Manila for a 2-4 week full-time bootcamp. Moreover, given the high demand currently for evals work, we may also expand to helping some fellows upskill in evals next year. (These plans are outside the scope of this funding request though.)

How will this funding be used?

Our preferred amount is a total of $136,240 USD to fund us until May 2025. This would fund us at 3.75 FTE for nine months starting September 2024.

Our “lower amount” ask is $86,400 USD. This would fund us at 2.75 FTE (0.25 FTE above our current FTE count) for eight months (September 2024 to April 2025).

We're also open to accepting any amount, and we can adjust our plans based on how much funding we get.

Who is on your team and what's your track record on similar projects?

We fundraised $72,490 from the Long-Term Future Fund and Manifund in late 2023. As mentioned, we recently finished the Trials Phase for our first cohort. The Trials was held from February 24 to May 11, and the Proving Ground is happening from June 1 to August 31.

Overall, we think our Trials Phase went well, but we’ve learned a lot about how to improve it for our next cohort. Below are some key outcomes and metrics of our fellowship so far.

Four of our fellows, with advising from Kyle Reynoso (our team member), won 3rd place in Apart Research’s AI Security Evaluation Hackathon held last May 25-27. They won $300 with their evals project “Say No to Mass Destruction: Benchmarking Refusals to Answer Dangerous Questions”.

Moreover, Kyle Reynoso, together with Ivan Enclonar (one of our fellows) and Lexley Villasis, won 1st place in Apart Research’s AI and Democracy Hackathon last May 3-6, 2024. They won $1,000 with their interpretability-related project called Beyond Refusal: Scrubbing Hazards from Open-Source Models, which builds on this work.

We think these two placements show early traction in our fellows’ ability to do AI interpretability and safety research. Moreover, all five of our fellows who placed in these Apart hackathons have been invited to join the Apart Lab to do research (likely after our Proving Ground).

Here are some other key outcomes/metrics from our fellowship so far:

We received 53 applicants and accepted 13 participants into our fellowship, surpassing our goal of 50 applicants and 12 participants. Our participants’ ages range from 18 to 26, with their average age being 21.
Participants are highly likely to recommend our Trials Phase to a potential future participant — the average likelihood is 9.5/10 and our NPS is +89 (from 10/13 respondents). Six of our participants gave very positive testimonials about our fellowship, and you can read about some of their backgrounds and testimonials here.
In our two-day hackathon towards the end of the Trials phase, our participants formed groups, learned how to use TransformerLens, and tried their hands at doing MechInterp research. The winning group’s project replicated the “We found an neuron” project and used the same methods to find what structures are responsible for predicting has or have. Their report entitled “We has an neuron” can be found here.
The average participant’s self-reported interest in becoming an AI interpretability or safety researcher went from “neutral” before the Trials Phase started to “interested (it’s one of my top 3 career plans)” after it.
We secured three AI safety researchers as external mentors for our Proving Ground phase - Maheep Chaudhary, Clement Neo, and Jonathan Ng.
In the Proving Ground, we have six fellows who are each working on a COP, two fellows who are interning for Apart Research’s Catastrophic Cybersecurity Evals Project, and a pair of fellows working on interpreting and mitigating sleeper agents using SAEs.

Our team is composed of:

Clark Urzo - Co-Founder and Strategy Director

Brian Tan - Co-Founder and Operations Director

Kyle Reynoso - Programs Associate

We’re advised by Callum McDougall and Lee Sharkey.

You can read this separate document for more data on our fellowship so far and our team.

What are the most likely causes and outcomes if this project fails? (premortem)

We don’t get enough talented, motivated people to join our 2nd cohort.
1. Mitigation: Since WhiteBox’s success depends heavily on who we can reach and accept, we will continue to spend a good proportion of our effort and time into outreach and marketing for cohort 2. We will give more talks, hold more salons, and have our current fellows help us spread the word about our fellowship. Our first cohort included an IOI winner and a couple of international math competition winners, and we aim to continue drawing from highly-selected pools (e.g., math/informatics olympiad participants, top Philippine unis, students/young professionals with ML experience). We’ve also already met a few promising people interested in applying for our 2nd cohort.
Our fellowship is not good enough at upskilling our best fellows to get into a further program, role, or grant in AI safety within 3-4 months after the program.
1. Mitigation: We are making key changes to our Trials challenges to improve them and automate their checking for fellows to get faster feedback. We will also get external mentors again for the Proving Ground phase, and we may funnel some fellows into SPAR. Clark and Kyle are also putting considerable time into upskilling in MechInterp so they can better guide people in upskilling. We will also help some of our Cohort 1 and 2 fellows to continue upskilling and applying for opportunities even after our program. We’re now renting an office until May 2025, and we can host fellows from Cohort 1 or 2 there.
The people we train and/or the MechInterp research we help produce contribute to AI capabilities significantly more than AI safety. We think that this downside risk is small because:
1. We strongly believe that a community centered on open discourse will achieve healthier epistemics in the long run than a community where beliefs are forced top-down. We trust that the kind of person who would do well as an alignment researcher would be persuaded by the sheer strength and veracity of safety arguments eventually, as long as we’re earnest and patient about their concerns.
2. That said, we will not offer support to people who would wish to work in a role/project we deem harmful, such as explicit capabilities research or capabilities roles at top labs.

What other funding are you or your project getting?

We have applied for funding from the Long-Term Future Fund, and we may also apply to this RFP of Open Philanthropy. We also applied for funding from Meta Charity Funders and the Nonlinear Network, but we were rejected by MCF and have not received interest from the Nonlinear Network.

If we aren’t able to fundraise our lower amount ask to run our second cohort, we’ll reconsider our plans and may just run WhiteBox on a smaller scale.

If you'd like to get in touch with us privately, you can email Brian at brian@whiteboxresearch.org. Thank you!