Approving this project! I also especially appreciated that Kriz set up a prediction market on whether they would get to their higher bar of $37k~
We want to identify and train a highly selected cohort to focus exclusively on mechanistic interpretability (MechInterp) research, with the explicit aim of making substantial progress on Neel Nanda's 200 Concrete Open Problems in Interpretability. We believe MechInterp is an important part of the portfolio of AI safety agendas, and its tight feedback loops make it a uniquely accessible way to approach alignment. This effort will be carried out through a non-profit initiative known as WhiteBox Research.
We are requesting funding for 1.9 FTE (0.8 for Clark, 0.7 for Brian, 0.4 for Kriz) for 9 months. Roughly 70% of our time will be focused on finding & training people in MechInterp, 20% on upskilling ourselves, and 10% on anything else that helps us achieve our long-term goal: to foster a thriving alignment hub in Southeast Asia that can diversify and complement the work being done in London and the Bay Area.
Clark Urzo is a part-time independent alignment researcher and a SERI MATS Winter 2022 virtual participant under John Wentworth, and he will be leading the training program.
The main objective of this project is to create a reproducible process of training people to become mechanistic interpretability researchers in particular—as opposed to alignment researchers in general—thereby potentially gaining a comparative advantage in both quality and cost-effectiveness over larger but less focused programs.
Over the 9-month grant period, we plan to operationalize this goal by a series of milestones:
Our training program, largely based on the flipped classroom, mastery learning model and is described in more detail here, aims to have 10-15 participants and is expected to run for a period of 5-8 weeks. The program will be held in or near a top university in Metro Manila where we’re all based.
The Alignment Jam offers its Interpretability Hackathons roughly every three months. Our next major goal is to have our cohort win at least 3rd place in it, or conditional on the Jam not keeping to its schedule in another equally credible benchmark or competition (possibly organized by us).
Once we can do well in a research hackathon, our next major milestone is to have the best in our cohort produce public-facing work (subject to an internal infohazard policy we’re still working on) that can get an endorsement from other researchers who work in the field. In particular, 3-5 high-quality posts in a joint research blog, similar to Anthropic’s Transformer Circuits Thread project, would be a valuable test of the group’s ability to conduct useful research and/or distillations for the community.
Lastly, we will conduct a series of research sprints to systematically attempt the problems in Neel Nanda’s list of 200 Concrete Open Problems in MechInterp. The problems are not necessarily of utmost importance to interpretability in general, but due to being highly neglected (as seen in this spreadsheet), they can otherwise serve as a useful measure of our group’s ability to produce research that other people can actually build on top of. Concretely, we are aiming for an in-depth write-up of a decisive solution to at least one open problem at first, then aim for a regular cadence of such posts in the months following the program.
By the end of the grant period, we want to produce at least three people (not including us) who can do serious MechInterp research (e.g., become a SERI MATS research fellow in 2024 or reach a certain bar of research quality).
We believe our idiosyncratic approach is worth trying for two main reasons:
Similar initiatives targeting early-career people, such as BlueDot Impact’s AI Safety Fundamentals course and EffiSciences’ ML4Good program, offer a broader curriculum that covers many alignment topics. We think that an entire research group focusing on a narrow area and building deep expertise in it is an underexplored strategy, with Redwood Research’s REMIX program being the only example in interpretability we are aware of.
Metro Manila (and by extension Southeast Asia) is an attractive place to do upskilling and research owing to its drastically lower cost of living (e.g., a fully-furnished 1 bedroom condominium unit in the city costs $250-400/mo to rent), English-speaking population, and lax visa requirements. It can therefore serve as an alternative location for alignment researchers who want to work comfortably for far longer with the same amount of funding, as well as attract alignment researchers who would otherwise be unable to afford moving to London/Berkeley given the current funding climate.
Our preferred amount is a total of $73,460 USD for 9 months of funding. Here’s a breakdown of what that’s for:
Clark Urzo (0.8 FTE): $28,800
Brian Tan (0.7 FTE): $25,200
Kriz Tahimic (0.4 FTE): $9,216
Operational Expenses (e.g., venues and food for events/hackathons, hiring a 0.2 FTE short-term ops intern/contractor, cash prizes, etc.) – $10,244
Our minimum amount to do this project is 6 months of funding for 1.8 FTE at $34,700. However, we're open to accept any amount, and can adjust our plans based on how much funding we get.
A more detailed version of our budget can be found here.
Our plan is to have Clark and Kriz split the work on the curriculum design and teaching/mentoring for the training program, while Brian will focus on the less-technical aspects of the program (e.g., marketing, operations, and project management). We will also likely tap the help of someone in EA Philippines or a local EA student chapter to help out with some operations tasks (like event logistics).
I participated in the virtual workshops of the 2022 Winter Cohort of SERI MATS under John Wentworth (though handled primarily by Joe Collman). I was also a facilitator in the 2023 AGI Safety Fundamentals course, and currently a participant in PIBBSS’ Key Phenomena in AI Risk reading group led by Tushant Jha.
Also in 2022, I received a small grant from the FTX Regranting Program via Olivia Jimenez and Akash Wasil to pivot to technical alignment research. Previously, I worked briefly as a machine learning engineer optimizing video compression for a startup in California called Active Theory Inc.
Aside from doing research, I also have extensive entrepreneurial experience. In 2015, I co-founded Veer, one of the first virtual reality companies in the Philippines, producing brand activations for major local companies, such as SM Cyberzone and Jack & Jill across 20+ cities. Our primary product was a virtual crew training program that was certified by the Civil Aviation Authority of the Philippines (CAAP), and I was also a key organizer of XR Philippines (prev. VR Philippines): handling strategy, managing hackathons with several dozen teams and doing targeted promotions that led to us landing multiple interviews in both national news and radio broadcasts.
Owing to a lifelong interest in rationality, I have spent over 2000+ hours reading rationalist material written by people like Eliezer Yudkowsky, Julia Galef, Gwern Branwen, and so on. I also briefly co-ran a writing group for prolific writers in the r/slatestarcodex subreddit with Alexey Guzey in 2018, and will likely do the Epistea Residency program in Prague this September simultaneously with this project.
I co-founded EA Philippines in 2018 and was on a CEA Community Building Grant to work full-time at EA Philippines in 2021. EA Philippines is now one of the largest EA groups in an LMIC.
At least 12 people in EA Philippines have a strong interest in AI safety/risks, including Clark, Kriz, and Ayrton San Joaquin (former Teaching Assistant at CAIS). I’ve had a minor but helpful role in five people’s AIS journey.
For my AIS knowledge, I’ve consumed most of the Most Important Century (MIC) series and the implications of the MIC series by Holden Karnofsky. I’ve also consumed most resources up to week 3 of BlueDot’s AISF alignment curriculum and am slowly consuming resources in Neel Nanda’s MechInterp guide (starting with MechInterp prerequisites).
I now work at CEA as a group support contractor since Dec. 2021 to support EA groups. (I’m looking to transition from my role to focus on AIS, e.g., via working on this project.) Before working in CEA, I was a UI/UX designer for 1.5 years.
I'm a 4th-year CompSci student with a full scholarship. I co-organize EA Taft (in DLSU) and was accepted into CEA's Organizer Support Program. My working thesis tries to mitigate superposition via L1 regularization & Adversarial Training and is inspired by Anthropic's Toy Model of Superposition paper. Also, I'm currently receiving coaching from Effective Thesis under Conor Spence.
My journey into EA and AI safety includes finishing the Intro EA Fellowship, In-Depth EA Program, AGISF - EA Cambridge, and Existential Risk Workshop - GCP, as well as attending EAG Washington & EAGxSingapore. Currently, I'm following Neel Nanda's guide, "Concrete Steps to Get Started in Transformer Mechanistic Interpretability." I finished the section "A Barebones Guide to Mechanistic Interpretability Prerequisites" and am now proceeding with Andrej Karpathy’s micrograd tutorial.
I've given a talk on AGI x-risk at an EA Philippines event, I've facilitated AI Alignment, MLSS, and In-depth Reading Groups in EA PH, and have had 1:1’s with people on AI safety. This resulted in 10+ people actively looking to volunteer in AIS field-building, with 3 taking significant steps, including one who plans to pursue a technical AIS-focused PhD.
Chris Leong - Founder of AI Safety Australia and New Zealand
Joe Collman - Technical Lead, SERI MATS
Elmer Cuevas - Executive Director of EA Philippines
Amarins Veringa - (my manager), Post-Uni Groups Strategy Lead at CEA
Dewi Erwan - Co-Founder of BlueDot Impact
Nastassja Quijano - Co-Founder of EA Philippines
Elmerei Cuevas - Executive Director of EA Philippines
Conor Spence - Coach at Effective Thesis
Wrenata Sproat at Global Challenges Project
We don’t get at least 10 reasonably talented people to join our training program, or that not more than five people complete it.
Mitigation: Since the cornerstone of this project’s success is the initial seed of people we choose, we will spend a good proportion of our effort and time into outreach. We will filter for motivation first and ability second (drawing from a pool that is already highly selected, e.g. IMO participants, the top STEM magnet schools in the country).
Our training is not good enough for them to produce high-quality research (e.g., to win an Alignment Jam)
Mitigation: We (especially Clark and Kriz) will put considerable time (including outside the FTE of this grant) into upskilling in MechInterp. Clark and Kriz will produce projects to be posted online (particularly on LessWrong) and also attempt to place in a preceding Alignment Jam themselves. We’ll also seek advice from other established MechInterp/AIS researchers.
The people we train and/or the MechInterp research we help produce contribute to AI capabilities significantly more than AI safety. We think that this downside risk is small because:
We strongly believe that a community centered on open discourse will achieve healthier epistemics in the long run than a community where beliefs are forced top-down. We trust that the kind of person who would do well as an alignment researcher would be persuaded by the sheer strength and veracity of safety arguments eventually, as long as we’re earnest and patient about their concerns.
That said, we will not offer support to people who would wish to work in a non-safety-related role in one of the top AI labs in the world where we think most of the downside risk is concentrated, or to those who would want to do explicit capabilities research.
We will also enforce an infohazard policy and disqualify repeat offenders from the program.
We will also apply for funding for this project from Meta Charity Funders and the Long-Term Future Fund, likely by the end of August. We have not received funding for this project so far. If you'd like to get in touch with us privately, you can email Brian at firstname.lastname@example.org.
TLDR: I decided to regrant $12k to this project. I’m excited about an organized AI safety training program in an under-exposed, important region (Southeast Asia). I think the core team seems promising and worth the investment, despite their juniority. I think getting experienced mentors will be the main challenge (among others), but I think the team is aware of the relevant failure modes and taking the steps necessary to mitigate them. I’d be excited about others donating at least 23k more to this project to make their MVP possible
Why I’m excited about this project
I’m keen to see new programs happen outside the main hub as a way to widen the surface area of opportunities for talented folks to engage with AI safety. Southeast Asia is one of the regions I’m most excited about due to its large population and geopolitical relevance.
The core team seems organized and promising. They’re quite junior, but seem worth the investment as a way of skilling up by doing. This seems relevant to me especially considering there aren’t other groups trying to fill this gap, as far as I can tell – this project can plausibly allow them to become the experienced folks that would guide others in the future.
To mitigate their juniority, they’re picking a research agenda that has a track record of being useful for getting people interested in AI safety research + develop useful skills for AIS-relevant work. They’re also explicitly inserting their project into a broader pipeline, and establishing sensible metrics of success (e.g., participants winning Alignment Jams, getting into SERI MATS).
I had a call with Brian months ago and another with the whole team today. They gave me some more details about how they’re planning to skill themselves up and mitigate some of the concerns I mention here and Gaurav mentioned above. This made me even more confident about this grant.
Challenges and concerns
I’m concerned about their ability to provide high-quality mentorship to program participants considering their juniority and potential limitations around getting senior mentors involved
They haven’t heard back from some relevant people yet (e.g., Neel Nanda), and haven’t run similar programs in the past.
I’d be keen for someone with experience in this kind of program to share their expectations about how this will go.
However, as I mentioned above, this seems like a positive bet in expectation to me.
Creating talent that will end up doing capabilities research
Mech interp is quite dual-use and being the first program in the region to skill people up in this might end up hyping capabilities rather than AIS.
However, I think a) they’re sufficiently aware of this failure mode, and b) AI is sufficiently mainstream I expect this failure mode not to have a big counterfactual downside.
I worry they won’t find sufficiently talented people, and that investing in talent around existing hubs might be more cost-effective.
I think this applies to all projects aimed at field-building outside hubs. However, I think we (as a community) haven’t invested enough in experimenting in programs like these yet – so the information value by itself seems worth it, in case there are low-hanging fruits available considering a lot of talented people can’t effectively go to the main hubs due to e.g., visa limitations. I’ve been more excited about this due to my experience doing talent search via Condor Camp (Brazil) over the last 1.5 year.
As junior folks without a strong track record, I worry they might not be skilled enough to run an entire org/project by themselves. Maybe they won’t be able to follow through their plans.
I think they individually have enough experience in their fields that makes me confident about betting on them. Particularly, in my interactions with Brian, he’s seemed quite organized and competent, and I’ve appreciated his work at CEA and setting up EA Philippines. I know less of the other two team members, but at first glance they seem to have complimentary skillsets and experience.
25 days ago
Hi Renan, we really appreciate your decision to give us a regrant! Thanks also for sharing your thoughts about our project. We're taking your challenges/concerns into account, and we're already coming up with concrete plans to mitigate them.
about 1 month ago
*This was written very quickly, and I may not agree with what I'm saying later on!
Here are some questions and thoughts - I can't commit to funding at the moment, but I would like to share my thoughts.
Having spent roughly 1-1.5 years community building and observing Brian quite active on the EA Groups Slack and through email communications, I'm left with the impression that Brian is quite agentic. I hold a high prior on the plans of this proposal being executed if funded. Thus, I can see plans being made and things being carried out.
I also hold some confidence that establishing another hub might be beneficial, although I'm not entirely sure how to reconcile this with the idea that those interested in working on alignment might derive more value from visiting Berkeley than going to a new hub.
A few concerns do arise, however. The proposal mentions research sprints to solve the COPs, and while this approach seems suitable for less time-intensive tasks, I question its overall efficacy (45% sure this is true). I believe that rushing things or working on them quickly might not be the most conducive to learning.
Regarding the statement 'Due to being highly neglected,' I'm under the impression (60% sure) that interpretability is slightly saturated at the moment, contrary to the assertion that it's heavily neglected.
My final concern is about mentorship. It appears that only one person on the team has formal mentorship or experience in MI. This is concerning, particularly if you're planning on onboarding 10-15 people, as having one person mentoring them all is going to be challenging. More mentorship (and more experienced mentorship) might be necessary to identify and correct problems early and prevent suboptimal strategies from being implemented.
30 days ago
Hi Gaurav, thanks for weighing in on our project! Here are our thoughts on what you said, written mainly by Clark:
We agree there’s value in visiting Berkeley if people had the means, but we think it’s important there be more alignment hubs in various regions. We think that a good number of potential AIS researchers in Southeast Asia would find it costly and/or hard to visit or move to Berkeley (especially in the current funding landscape), as compared to visiting or working in Manila / SE Asia.
On research sprints to solve COPs: there are nuances to speed. Optimising for paper writing speed for example doesn't make sense, nor would treating the problems as Leetcode puzzles you can grind. The kind of speed we're optimizing for is closer to rate of exploration: how can we reduce our key uncertainties in a topic as quickly as possible? Can we discover all the mistakes and dead-ends ASAP to crystallize the topic's boundaries rapidly? Can we factor the open question into two dozen subquestions, each clearly doable in one sitting, and if so, how many of them can we do in a given timeframe? The crucial point is this: moving around produces information. We want to ruminate on questions in the middle of coding them up, develop the habit of thinking through problems in the space of a Jupyter notebook, and shrink this loop until it becomes second-nature. We have also emailed Neel Nanda and Joseph Bloom about our project and aim to get their advice, so we won't veer too far off course while still learning to walk on our own.
On mentorship, we expect to do well enough in the training phase, but we likely need more mentorship in the research phase. That's why we're going to get a research adviser. During the research phase, the students will (mostly) get advice from Clark and Kriz, while we take advice from a research adviser. The goal is eventually to train ourselves and/or get enough people on our team so that we can confidently do the advising ourselves. This is also why we're adopting the flipped classroom model: we'll only have to produce/curate the learning materials once, and then just focus on getting them to do exercises. We're quite confident this is doable as Clark has taught classes of more than 40 people before.
Let us know if you have more thoughts or questions!