Apollo Research: Scale up interpretability & behavioral model evals research

Project summary

Apollo Research is a technical AI safety organization. Our areas of focus are:

Research: Conduct foundational research in interpretability and behavioral model evaluations,
Auditing: audit real-world models for deceptive alignment,
Policy: Support policymakers with our technical expertise where needed.

We are primarily concerned about deceptive AI alignment, as we believe it holds the highest potential for catastrophic consequences. A sufficiently advanced deceptive AI could fool all behavioral tests, effectively bypassing misalignment detection mechanisms. Furthermore, deceptive alignment has not been a major focus area of empirical research until very recently and is still neglected. Therefore, our research aims to build evaluation suites that include both behavioral and interpretability evaluations for deception, and other potentially dangerous capabilities such as situational awareness, goal-directedness, and long-term planning. You can find more information in our announcement post or on our website.

We currently have multiple projects underway (highlighted below). We’re writing on various conceptual clarifications about auditing and deceptive alignment that we intend to publish in late July or early August. We intend to show results from our technical agendas (i.e. interpretability and evals) by the end of September.

We’re happy to answer any questions in detail.

Project goals

Our early research experiments have yielded promising results, and our progress is currently constrained by the number of people executing experiments, not by a shortage of research ideas. As our research agenda is heavily empirical and conceptually straightforward, we are confident in our ability to easily integrate additional technical staff.

Conceptual

We’re currently writing multiple conceptual posts about deceptive alignment, auditing, and model evaluations. For example, we’ll publish a general high-level framework for AI auditing, a theory of change for AI auditing, a detailed definition of deceptive alignment, and a post about why models would develop strategic deception in late July and early August.

Interpretability

New agenda: We’re currently developing a new mechanistic interpretability agenda. Early results on smaller models have been promising and we therefore want to scale experiments to more and larger models.
Design approaches for benchmarking and measuring interpretability techniques.

Behavioral evals

Design experiments and evaluations to measure deception in AI models. Measuring both the component concepts as well as deception holistically.
Build a large and robust suite of model evaluations to test models for deception and other potentially dangerous capabilities such as situational awareness, goal-directedness, and long-term planning.
Apply evaluations on models in major AI development labs.
Develop high-level methodologies and frameworks for the nascent field of model evaluations.

Public Benefit: Our work is designed to benefit the public at large by detecting and mitigating potentially dangerous capabilities of large AI models. We may publish our research findings if deemed safe, though we will always weigh the benefits and risks of publishing to a broader audience.

How will this funding be used?

Increasing technical staff headcount to scale up our AI safety research.

Although we have some funds designated for new hires, additional funding would allow us to meet our hiring targets. Consequently, we are seeking funding to hire 3 additional research engineers / scientists.

The annual cost for a new member of our technical staff, whether a research scientist or engineer, is approximately $150k - $250k. This figure includes a London-based salary, benefits, visa sponsorship, compute, and other employee overheads.

Research Engineer/Scientist evals team: $150k - $250k
Research Engineer/Scientist interpretability team: $150k - $250k
Research Engineer/Scientist evals/interpretability: $150k - $250k
Total Funding: ~$600K

We aim to onboard additional research staff by the end of Q3 or early Q4 2023.

Apollo Research is currently fiscally sponsored by Rethink Priorities which is a registered 501(c)3 non-profit organization. Our plan is to transition into a public benefit corporation within a year. We believe this strategy will allow us to grow faster and diversify our funding pool. As part of this strategy, we are considering securing private investments from highly aligned investors in case we think model evaluations could be a profitable business model and we don't have sufficient philanthropic funding.

What is your (team's) track record on similar projects?

Our team currently consists of six members of technical staff, all of whom have relevant AI alignment research experience in interpretability and/or evals. Moreover, our Chief Operating Officer brings over 15 years of experience in growing start-up organizations. His expertise spans operations and business strategy. He has expanded teams to over 75 members, scaling operational systems across various departments, and implemented successful business strategies.

We’re happy to share a more detailed list of accomplishments of our staff in private.

How could this project be actively harmful?

There is a low but non-zero chance that we produce insights that could lead to capability gains. There may also be accidents from our research since we evaluate dangerous capabilities.

However, we are aware of these risks and take them very seriously. We have a detailed security policy (public soon) to address them, e.g. by not sharing results that could have dangerous consequences.

What other funding is this person or project getting?

For various reasons, we can’t disclose a detailed list of funders in public. In short, we have multiple institutional and private funders from the wider AI safety space. We may be able to disclose more details in private.