Apollo Research: Scale up interpretability & behavioral model evals research

Technical AI safety

🥑

Apollo Research

CompleteGrant

$339,409raised

Project summary

Apollo Research is a technical AI safety organization. Our areas of focus are:

Research: Conduct foundational research in interpretability and behavioral model evaluations,
Auditing: audit real-world models for deceptive alignment,
Policy: Support policymakers with our technical expertise where needed.

We are primarily concerned about deceptive AI alignment, as we believe it holds the highest potential for catastrophic consequences. A sufficiently advanced deceptive AI could fool all behavioral tests, effectively bypassing misalignment detection mechanisms. Furthermore, deceptive alignment has not been a major focus area of empirical research until very recently and is still neglected. Therefore, our research aims to build evaluation suites that include both behavioral and interpretability evaluations for deception, and other potentially dangerous capabilities such as situational awareness, goal-directedness, and long-term planning. You can find more information in our announcement post or on our website.

We currently have multiple projects underway (highlighted below). We’re writing on various conceptual clarifications about auditing and deceptive alignment that we intend to publish in late July or early August. We intend to show results from our technical agendas (i.e. interpretability and evals) by the end of September.

We’re happy to answer any questions in detail.

Project goals

Our early research experiments have yielded promising results, and our progress is currently constrained by the number of people executing experiments, not by a shortage of research ideas. As our research agenda is heavily empirical and conceptually straightforward, we are confident in our ability to easily integrate additional technical staff.

Conceptual

We’re currently writing multiple conceptual posts about deceptive alignment, auditing, and model evaluations. For example, we’ll publish a general high-level framework for AI auditing, a theory of change for AI auditing, a detailed definition of deceptive alignment, and a post about why models would develop strategic deception in late July and early August.

Interpretability

New agenda: We’re currently developing a new mechanistic interpretability agenda. Early results on smaller models have been promising and we therefore want to scale experiments to more and larger models.
Design approaches for benchmarking and measuring interpretability techniques.

Behavioral evals

Design experiments and evaluations to measure deception in AI models. Measuring both the component concepts as well as deception holistically.
Build a large and robust suite of model evaluations to test models for deception and other potentially dangerous capabilities such as situational awareness, goal-directedness, and long-term planning.
Apply evaluations on models in major AI development labs.
Develop high-level methodologies and frameworks for the nascent field of model evaluations.

Public Benefit: Our work is designed to benefit the public at large by detecting and mitigating potentially dangerous capabilities of large AI models. We may publish our research findings if deemed safe, though we will always weigh the benefits and risks of publishing to a broader audience.

How will this funding be used?

Increasing technical staff headcount to scale up our AI safety research.

Although we have some funds designated for new hires, additional funding would allow us to meet our hiring targets. Consequently, we are seeking funding to hire 3 additional research engineers / scientists.

The annual cost for a new member of our technical staff, whether a research scientist or engineer, is approximately $150k - $250k. This figure includes a London-based salary, benefits, visa sponsorship, compute, and other employee overheads.

Research Engineer/Scientist evals team: $150k - $250k
Research Engineer/Scientist interpretability team: $150k - $250k
Research Engineer/Scientist evals/interpretability: $150k - $250k
Total Funding: ~$600K

We aim to onboard additional research staff by the end of Q3 or early Q4 2023.

Apollo Research is currently fiscally sponsored by Rethink Priorities which is a registered 501(c)3 non-profit organization. Our plan is to transition into a public benefit corporation within a year. We believe this strategy will allow us to grow faster and diversify our funding pool. As part of this strategy, we are considering securing private investments from highly aligned investors in case we think model evaluations could be a profitable business model and we don't have sufficient philanthropic funding.

What is your (team's) track record on similar projects?

Our team currently consists of six members of technical staff, all of whom have relevant AI alignment research experience in interpretability and/or evals. Moreover, our Chief Operating Officer brings over 15 years of experience in growing start-up organizations. His expertise spans operations and business strategy. He has expanded teams to over 75 members, scaling operational systems across various departments, and implemented successful business strategies.

We’re happy to share a more detailed list of accomplishments of our staff in private.

How could this project be actively harmful?

There is a low but non-zero chance that we produce insights that could lead to capability gains. There may also be accidents from our research since we evaluate dangerous capabilities.

However, we are aware of these risks and take them very seriously. We have a detailed security policy (public soon) to address them, e.g. by not sharing results that could have dangerous consequences.

What other funding is this person or project getting?

For various reasons, we can’t disclose a detailed list of funders in public. In short, we have multiple institutional and private funders from the wider AI safety space. We may be able to disclose more details in private.

🥑

Apollo Research

10 days ago

Final report

Description of subprojects and results, including major changes from the original proposal

Closing this project after forgetting about it for a long time. You can see our last public overview of our work here: https://www.apolloresearch.ai/blog/apollo-18-month-update/

🥨

6 months ago

My donation is small, but my support is genuine. To build public awareness, I believe the most powerful approach would be a fictional film or TV series that makes the risks feel real and personal. Storytelling can reach hearts and minds in a way few other methods can. Wishing you all the best in advancing this important cause.

🥑

Apollo Research

over 1 year ago

Progress update

We wrote a 1-year update which includes the projects we used the Manifund funding for: https://www.apolloresearch.ai/blog/the-first-year-of-apollo-research

Thanks again for the support!

donated $125,000

Evan Hubinger

over 2 years ago

Main points in favor of this grant

I am quite excited about deception evaluations (https://www.lesswrong.com/posts/Km9sHjHTsBdbgwKyi/monitoring-for-deceptive-alignment), transparency and interpretability (https://www.lesswrong.com/posts/nbq2bWLcYmSGup9aF/a-transparency-and-interpretability-tech-tree), and especially the combination of the two (https://www.lesswrong.com/posts/uqAdqrvxqGqeBHjTP/towards-understanding-based-safety-evaluations). If I were crafting my ideal agenda for a new alignment org, it would be pretty close to what Apollo has settled on. Additionally, I mentored Marius, who's one of the co-founders, and I have confidence that he understands what needs to be done for the agenda they're tackling and has the competence to give it a real attempt. I've also met Lee and feel similarly about him.

Donor's main reservations

My main reservations are:

It's plausible that Apollo is scaling too quickly. I don't know exactly how many people they've hired so far or plan to hire, but I do think that they should be careful not to overextend themselves and expand too rapidly. I do want Apollo to be well-funded, but I am somewhat wary of that resulting in them expanding their headcount too quickly.
As Apollo is a small lab, it might be quite difficult for them to get access to state-of-the-art models, which I think would be likely to slow down their agenda substantially. I'd be worried especially if Apollo was trading off against people going to work directly on safety at large labs (OAI, Anthropic, GDM) where large model access is more available. Though this would also be mitigated substantially if Apollo was able to find a way to work with labs to get approval to use their models for research purposes externally, and I do not know if that will happen or not.

Process for deciding amount

I decided on my $100k amount in conjunction with Tristan Hume, so that we would be together granting $300k. Both of us were excited about Apollo, but Tristan was more relatively excited about Apollo compared to other grants, so he decided to go in for the larger amount. I think $300k is a reasonable amount for Apollo to be able to spin up initial operations, ideally in conjunction with support from other funders as well.

Conflicts of interest

Marius was a mentee of mine in the SERI MATS program.

donated $200,000

Tristan Hume

over 2 years ago

I'm very excited about Apollo based on a combination of the track record of it's founding employees and the research agenda they've articulated.

[Marius](https://www.alignmentforum.org/posts/KzwB4ovzrZ8DYWgpw/more-findings-on-memorization-and-double-descent) and [Lee](https://www.alignmentforum.org/posts/z6QQJbtpkEAX3Aojj/interim-research-report-taking-features-out-of-superposition) have published work that's [significantly contributed to Anthropic's work on dictionary learning](https://transformer-circuits.pub/2023/may-update/index.html). I've also met both Marius and Lee and have confidence in them to do a good job with Apollo.

Additionally, I'm very much a fan of alignment and dangerous capability evals as an area of research and think there's lots of room for more people to work on them.

In terms of cost-effectiveness I like these research areas because they're ones I think are very tractable to approach from outside a major lab in a helpful way, while not taking large amounts of compute. I also think Apollo existing in London will allow them to hire underutilized talent that would have trouble getting a U.S. visa.

donated $9,999

Marcus Abramovitch

over 2 years ago

I think Apollo is going to be very hard to beat for a few reasons and I would have written up the grant if Marius didn't.

They are a very very talented team. I think several people involved could be working at top AI labs.
They are focused on a tractable and "most important problem" of deception in AI systems.
They seem well positioned to be a potential centre for an AI safety org in Europe
They could grow to absorb a lot more funding, effectively. A key barrier to EA funds right now is productively spending lots of money. Seems possible that they could absorb >20M/year in 4 years time
I think the people here are very "alignment-pilled"
I think I am a good judge of talent/character and Marius passes the bar.
I have a preference that they become a non-profit vs for-profit company that sells auditing and red-teaming. I think funding in the first few years will be pivotal for this.

Reservation:

They seem quite expensive on a per person basis compared to other projects Manifund has been funding. That said, there is going to be a lot of "bang per person here". Marius explained to me that they are competing with large AI labs and are already a significant salary cut. I would much rather see someone working within Apollo than doing independent research. I think we should be quite weary of funding independent researchers until orgs like Apollo are fully funded, with rare exceptions.
I still don't get why SFF and OP can't just fully fund them. My best guess here is that they are terrified of seeding another "AI accelerator" or wanting to save their cash for other things and thus allow others to donate in their place.
I am slightly worried that there isn't a good way to coordinate across the community of "how much they should receive". From the inside, Apollo wants to raise as much as possible. I don't think that is optimal at a movement level and this leads to them spending a lot of time fundraising so they can have more money. There probably is a funding level that Apollo shouldn't exceed in it's first year though I don't know what that number is.

I would bet that if we reviewed Apollo 1.5-2 years down the line, it will outperform a majority of grants on a per dollar basis (very hard to operationalize this though).

🥑

Apollo Research

over 2 years ago

Thanks for the compliments. We quickly want to contextualize some of the reservations.
1. We understand that we're more expensive on a per-person basis than independent researchers. However, On top of salaries, our costs also include compute, pensions, insurance, travel cost, visas, office, etc. which are usually not factored into the applications for independent researchers. Furthermore, as discussed in the response to Austin Chen's comment, we're currently significantly cheaper than other AI safety labs. For example, we calculate with 200k/year total cost while many technical positions in Bay area AI safety orgs have >200k/year starter salaries (not yet including pensions, insurance, compute, etc).
2. Given that we don't train any models, it would require a lot of effort for us to pivot to becoming an AI accelerator. Also, as you can see in our sharing and security policy (https://www.apolloresearch.ai/blog/security), we're very aware of the potential to accidentally accelerate AI research and thus engage in various security measures such as differential publishing.
3. Obviously, our spending target is not infinite. The reason why we are asking for more money at the moment is that we have many obvious ways to use it effectively right now--primarily by hiring more researchers and engineers. As suggested in another comment, a short runway makes it both harder to plan and may make it impossible to hire very senior staff, despite them being interested in working for us.

Anton Makiievskyi

over 2 years ago

My biggest concern is that scaling the org too fast could be detrimental, and I have no idea how to judge it besides deferring to people with similar experiences

If scaling Apollo's output is, in fact, limited only by cash - this seems like an amazing grant. At least as good as funding most of the independent folks.

🥑

Apollo Research

over 2 years ago

Some people have been concerned with us scaling too fast and we think this is good advice in general. However, there are a couple of considerations that make us think that the risk of growing too slowly is bigger than the risk of growing too fast for our particular situation.
1. We have an actionable interpretability and evals agenda. In both cases, we're only limited by the number of experiments we can run. If we were bottlenecked by ideas we wouldn't try to scale.
2. We already form a coherent team. Many people in the team have worked together before and it currently feels like our team dynamic is very good. Otherwise, we would focus more on building cohesion first.
3. We have the management capacity. Multiple people in the team have supervised others before. This really doesn't feel like a problem right now.
4. The talent is amazing. There are some really great people in the AI safety space. They are motivated, capable, and nice to work with. It's really not that hard to onboard such candidates.
5. Time is ticking. AI alignment isn't gonna solve itself and there will be lots of evals organizations popping up that don't care about catastrophic risk. Making sure that there are multiple leading organizations that care about these risks seems important.
6. We can always stop scaling for a while. If we realize after the first hiring round that we're scaling too fast, we can always postpone the next round and focus on consolidation. On the other hand, if we realize that we need more people it's hard to hire fast. Most rounds take 3 months from job ad to onboarding.

Austin Chen

over 2 years ago

Thanks for posting this application! I've heard almost universal praise for Apollo, with multiple regrantors expressing strong enthusiasm. I think it's overdetermined that we'll end up funding this, and it's a bit of a question of "how much?"

I'm going to play devil's advocate for a bit here, listing reasons I could imagine our regrantors deciding not to fund this to the full ask:

I expect Apollo to have received a lot of funding already and to soon receive further funding from other sources, given widespread enthusiasm and competent fundraising operations. In particular, I would expect Lightspeed/SFF to fund them as well. (@apollo, I'd love to know if you could publicly list at least the total amount raised to date, and any donors who'd be open to being listed; we're big believers in financial transparency at Manifold/Manifund)
The comparative advantage of Manifund regranting (among the wider EA funding ecosystem) might lie in smaller dollar grants, to individuals and newly funded orgs. Perhaps regrantors should aim to be the "first check in" or "pre-seed funding" for many projects?
I don't know if Apollo can productively spend all that money; it can be hard to find good people to hire, harder yet to manage them all well? (Though this is a heuristic from tech startup land, I'm less sure if it's true for research labs).

🥑

Apollo Research

over 2 years ago

Thanks for the high praise, we will try to live up to that expectation. These are totally reasonable questions and we're happy to answer them in more detail.
1. That mostly depends on your definition of "a lot". We have received enough funding to work for ~one year with the current team but we don't have enough funding to hire the number of people we think we can productively onboard. We have checked with our biggest funders and are allowed to disclose the amounts (which is why we didn't respond immediately). We have received ~$1.5M from OpenPhil and ~$500k from SFF speculation grants (the round has not closed yet, so we don't know how much we'll receive in total). We have a small number of other funders that add up to <$200k in addition to the two big institutional funders.
While this is a substantial amount of money, running an AI safety research lab is unfortunately quite expensive with salaries and compute being the biggest cost points. For comparison, Redwood started with $10M (and a $10M top-up after one year), and another alignment non-profit (anonymous on OpenPhil's website but not hard to deduce) started with $14.5M. Both of these organizations aim to pay competitively with FAANG salaries which are ~$200-500k for ML talent before equity.
2. We mostly agree. We hope to be fully funded by larger institutional funders or aligned investors in the near future. However, right now, we could easily onboard multiple new people but just don't have the funds to do so. Therefore, Manifund funding would directly make a big difference for us. Furthermore, the fact that people are enthusiastic about Apollo Research has meant multiple funders expected someone else to have already covered our entire funding gap or wanted to give less on the expectation that others would give more. Lastly, we're specifically asking for 3 positions because these have not been covered by other funders and we want to fill them as soon as possible.
3. We estimate that we could productively spend $5-7M in the first year, $10-20M in the second, and even more later. Furthermore, having more than 10 months of runway makes planning much easier and allows us to hire people with more experience who are often less willing to join an organization with a short runway. We don't intend to compete with Anthropic, OpenAI and DeepMind on salaries but it is a drawback if our salaries are 2-10x lower (which they currently are). For reference, a SWE at Anthropic London earns GPB 250k-420k before equity. Therefore, we would like to raise salaries to retain and attract talent in the long run (maybe so that they are only 1.5-5x lower).
Regarding talent: We are well connected in the AI safety scene and think that there is a lot of talent around that is currently looking for jobs but unable to find any. We think it is really bad for the ecosystem and for AI alignment that this talent is unable to fully contribute or has to jump from independent grant to independent grant without stability or ability to plan (this also heavily selects against people with families who are often more experienced). Our agenda is fairly straightforward and we have the management capacity to onboard new people, so we would really like to hire more people fairly soon. We think it is likely that we will be 15-30 people by next June if we have the funds to do so.

We're happy to answer more questions or clarify any of the above if needed!

Rachel Weinberg

over 2 years ago

By the way @apollo I edited your profile to include your name and a username that's not a long uuid—there was a bug in the sign-in flow at the time that you made your account so it may not have prompted you to do that. Feel free to change those, I just thought it looked bad to have them blank.

🥑

Apollo Research

over 2 years ago

Thank you!

Rachel Weinberg

over 2 years ago

@apollo btw you need to sign the grant agreement before the money goes to your account and you can withdraw. You can access it from the very top of the project page.