7

WhiteBox Research: Training Exclusively for Mechanistic Interpretability

ActiveGrant
$12,420raised
$73,460funding goal

Project Summary

We want to identify and train a highly selected cohort to focus exclusively on mechanistic interpretability (MechInterp) research, with the explicit aim of making substantial progress on Neel Nanda's 200 Concrete Open Problems in Interpretability. We believe MechInterp is an important part of the portfolio of AI safety agendas, and its tight feedback loops make it a uniquely accessible way to approach alignment. This effort will be carried out through a non-profit initiative known as WhiteBox Research.

We are requesting funding for 1.9 FTE (0.8 for Clark, 0.7 for Brian, 0.4 for Kriz) for 9 months. Roughly 70% of our time will be focused on finding & training people in MechInterp, 20% on upskilling ourselves, and 10% on anything else that helps us achieve our long-term goal: to foster a thriving alignment hub in Southeast Asia that can diversify and complement the work being done in London and the Bay Area.

Clark Urzo is a part-time independent alignment researcher and a SERI MATS Winter 2022 virtual participant under John Wentworth, and he will be leading the training program.

What are this project's goals and how will you achieve them?

The main objective of this project is to create a reproducible process of training people to become mechanistic interpretability researchers in particular—as opposed to alignment researchers in general—thereby potentially gaining a comparative advantage in both quality and cost-effectiveness over larger but less focused programs. 

Over the 9-month grant period, we plan to operationalize this goal by a series of milestones:

  1. Our training program, largely based on the flipped classroom, mastery learning model and is described in more detail here, aims to have 10-15 participants and is expected to run for a period of 8 weeks. The program will be held in or near a top university in Metro Manila where we’re all based.

  2. The Alignment Jam offers its Interpretability Hackathons roughly every three months. Our next major goal is to have our cohort win at least 3rd place in it, or conditional on the Jam not keeping to its schedule in another equally credible benchmark or competition (possibly organized by us).

  3. Once we can do well in a research hackathon, our next major milestone is to have the best in our cohort produce public-facing work (subject to an internal infohazard policy we’re still working on) that can get an endorsement from other researchers who work in the field. In particular, 3-5 high-quality posts in a joint research blog, similar to Anthropic’s Transformer Circuits Thread project, would be a valuable test of the group’s ability to conduct useful research and/or distillations for the community.

  4. Lastly, we will conduct a series of research sprints to systematically attempt the problems in Neel Nanda’s list of 200 Concrete Open Problems in MechInterp. The problems are not necessarily of utmost importance to interpretability in general, but due to being highly neglected (as seen in this spreadsheet), they can otherwise serve as a useful measure of our group’s ability to produce research that other people can actually build on top of. Concretely, we are aiming for an in-depth write-up of a decisive solution to at least one open problem at first, then aim for a regular cadence of such posts in the months following the program.

By the end of the grant period, we want to produce at least three people (not including us) who can do serious MechInterp research (e.g., become a SERI MATS research fellow in 2024 or reach a certain bar of research quality). 

We believe our idiosyncratic approach is worth trying for two main reasons:

  1. Similar initiatives targeting early-career people, such as BlueDot Impact’s AI Safety Fundamentals course and EffiSciences’ ML4Good program, offer a broader curriculum that covers many alignment topics. We think that an entire research group focusing on a narrow area and building deep expertise in it is an underexplored strategy, with Redwood Research’s REMIX program being the only example in interpretability we are aware of.

  2. Metro Manila (and by extension Southeast Asia) is an attractive place to do upskilling and research owing to its drastically lower cost of living (e.g., a fully-furnished 1 bedroom condominium unit in the city costs $250-400/mo to rent), English-speaking population, and lax visa requirements. It can therefore serve as an alternative location for alignment researchers who want to work comfortably for far longer with the same amount of funding, as well as attract alignment researchers who would otherwise be unable to afford moving to London/Berkeley given the current funding climate.

How will this funding be used?

Our preferred amount is a total of $73,460 USD for 9 months of funding at 1.9 FTE. This will fund Clark Urzo for 0.8 FTE, Brian Tan for 0.7 FTE, and Kriz Tahimic for 0.4 FTE. along with our operational expenses.

Our minimum amount to do this project is 6 months of funding for 1.8 FTE at $34,700. However, we're open to accept any amount, and can adjust our plans based on how much funding we get.

Who is on your team and what's your track record on similar projects?

Our team is composed of Brian Tan, Clark Urzo, and Kriz Tahimic.

Callum McDougall, co-founder of ARENA, has agreed to be an adviser of ours.

How the work will be split

Our plan is to have Clark and Kriz split the work on the curriculum design and teaching/mentoring for the training program, while Brian will focus on the less-technical aspects of the program (e.g., marketing, operations, and project management). We will also likely tap the help of someone in EA Philippines or a local EA student chapter to help out with some operations tasks (like event logistics).

Clark (LinkedIn):

  1. I participated in the virtual workshops of the 2022 Winter Cohort of SERI MATS under John Wentworth (though handled primarily by Joe Collman). I was also a facilitator in the 2023 AGI Safety Fundamentals course, and currently a participant in PIBBSS’ Key Phenomena in AI Risk reading group led by Tushant Jha.

  2. Also in 2022, I received a small grant from the FTX Regranting Program via Olivia Jimenez and Akash Wasil to pivot to technical alignment research. Previously, I worked briefly as a machine learning engineer optimizing video compression for a startup in California called Active Theory Inc.

  3. Aside from doing research, I also have extensive entrepreneurial experience. In 2015, I co-founded Veer, one of the first virtual reality companies in the Philippines, producing brand activations for major local companies, such as SM Cyberzone and Jack & Jill across 20+ cities. Our primary product was a virtual crew training program that was certified by the Civil Aviation Authority of the Philippines (CAAP), and I was also a key organizer of XR Philippines (prev. VR Philippines): handling strategy, managing hackathons with several dozen teams and doing targeted promotions that led to us landing multiple interviews in both national news and radio broadcasts. 

  4. I also won a grant from Pioneer.app in 2018. Pioneer is a program run by Daniel Gross and funded by Stripe and Marc Andreessen. During this time, I was featured in the online business magazine e27.

  5. Owing to a lifelong interest in rationality, I have spent over 2000+ hours reading rationalist material written by people like Eliezer Yudkowsky, Julia Galef, Gwern Branwen, and so on. I also briefly co-ran a writing group for prolific writers in the r/slatestarcodex subreddit with Alexey Guzey in 2018, and will likely do the Epistea Residency program in Prague this September simultaneously with this project.

Brian (LinkedIn):

  1. I co-founded EA Philippines in 2018 and was on a CEA Community Building Grant to work full-time at EA Philippines in 2021. EA Philippines is now one of the largest EA groups in an LMIC.

  2. At least 12 people in EA Philippines have a strong interest in AI safety/risks, including Clark, Kriz, and Ayrton San Joaquin (former Teaching Assistant at CAIS). I’ve had a minor but helpful role in five people’s AIS journey.

  3. For my AIS knowledge, I’ve consumed most of the Most Important Century (MIC) series and the implications of the MIC series by Holden Karnofsky. I’ve also consumed most resources up to week 3 of BlueDot’s AISF alignment curriculum and am slowly consuming resources in Neel Nanda’s MechInterp guide (starting with MechInterp prerequisites).

  4. I now work at CEA as a group support contractor since Dec. 2021 to support EA groups. (I’m looking to transition from my role to focus on AIS, e.g., via working on this project.) Before working in CEA, I was a UI/UX designer for 1.5 years.

Kriz (LinkedIn):

  1. I'm a 4th-year CompSci student with a full scholarship. I co-organize EA Taft (in DLSU) and was accepted into CEA's Organizer Support Program.  My working thesis tries to mitigate superposition via L1 regularization & Adversarial Training and is inspired by Anthropic's Toy Model of Superposition paper. Also, I'm currently receiving coaching from Effective Thesis under Conor Spence.

  2. My journey into EA and AI safety includes finishing the Intro EA Fellowship, In-Depth EA Program, AGISF - EA Cambridge, and Existential Risk Workshop - GCP, as well as attending EAG Washington & EAGxSingapore. Currently, I'm following Neel Nanda's guide, "Concrete Steps to Get Started in Transformer Mechanistic Interpretability." I finished the section "A Barebones Guide to Mechanistic Interpretability Prerequisites" and am now proceeding with Andrej Karpathy’s micrograd tutorial.

  3. I've given a talk on AGI x-risk at an EA Philippines event, I've facilitated AI Alignment, MLSS, and In-depth Reading Groups in EA PH, and have had 1:1’s with people on AI safety. This resulted in 10+ people actively looking to volunteer in AIS field-building, with 3 taking significant steps, including one who plans to pursue a technical AIS-focused PhD.

References

Clark’s:

  1. Chris Leong - Founder of AI Safety Australia and New Zealand

  2. Joe Collman - Technical Lead, SERI MATS

  3. Elmer Cuevas - Executive Director of EA Philippines

Brian’s:

  1. Amarins Veringa - (my manager), Post-Uni Groups Strategy Lead at CEA

  2. Dewi Erwan - Co-Founder of BlueDot Impact

  3. Nastassja Quijano - Co-Founder of EA Philippines

Kriz’s:

  1. Elmerei Cuevas - Executive Director of EA Philippines

  2. Conor Spence - Coach at Effective Thesis

  3. Wrenata Sproat at Global Challenges Project

What are the most likely causes and outcomes if this project fails? (premortem)

  1. We don’t get at least 10 reasonably talented people to join our training program, or that not more than five people complete it.

    1. Mitigation: Since the cornerstone of this project’s success is the initial seed of people we choose, we will spend a good proportion of our effort and time into outreach. We will filter for motivation first and ability second (drawing from a pool that is already highly selected, e.g. IMO participants, the top STEM magnet schools in the country).

  2. Our training is not good enough for them to produce high-quality research (e.g., to win an Alignment Jam)

    1. Mitigation: We (especially Clark and Kriz) will put considerable time (including outside the FTE of this grant) into upskilling in MechInterp. Clark and Kriz will produce projects to be posted online (particularly on LessWrong) and also attempt to place in a preceding Alignment Jam themselves. We’ll also seek advice from other established MechInterp/AIS researchers. Callum McDougall, co-founder of ARENA, has agreed to be an adviser of ours.

  3. The people we train and/or the MechInterp research we help produce contribute to AI capabilities significantly more than AI safety. We think that this downside risk is small because:

    1. We strongly believe that a community centered on open discourse will achieve healthier epistemics in the long run than a community where beliefs are forced top-down. We trust that the kind of person who would do well as an alignment researcher would be persuaded by the sheer strength and veracity of safety arguments eventually, as long as we’re earnest and patient about their concerns.

    2. That said, we will not offer support to people who would wish to work in a non-safety-related role in one of the top AI labs in the world where we think most of the downside risk is concentrated, or to those who would want to do explicit capabilities research.

    3. We will also enforce an infohazard policy and disqualify repeat offenders from the program.

What other funding are you or your project getting?

We will also apply for funding for this project from Meta Charity Funders and the Long-Term Future Fund, likely by the end of August. We have not received funding for this project so far. If you'd like to get in touch with us privately, you can email Brian at work.briantan@gmail.com.

donated $100
about 1 year ago
donated $150
about 1 year ago
donated $70
about 1 year ago
donated $100
about 1 year ago
donated $12K
about 1 year ago