Bayesian modelling of LLM capabilities from evals

Laurence Aitchison

ActiveGrant

$32,000raised

$32,000funding goal

Fully funded and not currently accepting donations.

Project summary

The field of LLM evals is currently a mess. Standard practice is to not even put error bars on the performance. This is extremely problematic in LLM eval settings where the number of eval examples is usually in the range of 100--10,000, so the uncertainty on 80% performance is in the range of about ±1% (for 10,000 examples) to ±8% (for 100 examples). This is really bad, and implies that many published claims about the relative performances of different methods are wrong (or at least that there is no evidence for such claims). A few within the field are beginning a push towards the use of error bars, including Desi (in this blog post), and Anthropic (Miller, 2024).

However, simply adding error bars to existing analyses is not going to give a step-change in our understanding of the growth of LLM capabilities and how that impacts AI safety.

We ask a much bigger question. If we care about LLM capabilities, we need to actually know: what is an LLM capability? While performance on a benchmark is clearly indicative of capabilities, it isn't directly measuring "capabilities" themselves, if for no other reason than a benchmark is probably indicative of multiple capabilities. So how do we define a set of potential "capabilities"? And how do we say a model has, or does not have, a capability?

Intuitively, we can think of a model's capabilities as unobserved variables describing what the model can and can't do. Answering questions from any given benchmark is likely to require a combination of capabilities. For instance, solving problems from GSM8K would seem to involve at least two different classes of capability:

Mathematical comprehension: the ability to understand math word problems, identify relevant information and translate it into a sequence of arithmetic operations
Performing arithmetic accurately: actually calculating 23+76

Here, we propose treating language model capabilities as latent variables—unobservable factors that drive observed performance across benchmarks and tasks. We propose to infer latent capabilities, and our uncertainty around them, using a Bayesian hierarchical model of LLM evals. This approach mirrors risk models in fields like finance, where latent factors (e.g. volatility, market etc.) are inferred from observable data to assess upside and downside risks in performance (i.e. asset returns). Drawing on this analogy, our framework aims to decompose model performance into capability factors that could offer insights into both beneficial advances and potential safety concerns.

Our goal is to automatically identify these capabilities, by training a Bayesian hierarchical model on binary data representing whether the LLM correctly answered a given benchmark question. The key signal that will enable us to extract latent capabilities is in the correlation of performance across questions. In particular, if a model lacks a capability, it will perform badly on all questions that require that capability. In contrast, if a model has a capability, it is likely to perform better on all questions requiring that capability (though it may still perform badly if it lacks some other capability necessary for those questions).

What are this project's goals? How will you achieve them?

Create a Bayesian latent variable model of eval results that jointly identifies:

A list of all the capabilities identified by the Bayesian hierarchical model.
For each question, a short list of capabilities necessary to answer that question correctly.
For each language model, a list of the model capabilities.

How will this funding be used?

Compute costs:

CPU: 6000 (core); 2000 (additional) (for running HMC)
GPU: 5000 (core); 6500 (additional) (for running additional benchmarks on open source models)
API: 7500 (core); 5000 (additional) (for running additional benchmarks on closed models)
Totals 18500 (core); 13500 (additional)

Who is on your team? What's your track record on similar projects?

Dr Laurence Aitchison: A lecturer (equivalent to US Assistant Professor) at the University of Bristol. Laurence has led a number of projects on Bayesian Inference, including the use of Bayesian Inference to understand the COVID epidemic (e.g. Leech et al. 2022 in PNAS). Additionally Laurence's current research is in LLMs, so he is perfectly placed to pursue this direction.

Dr Desi R Ivanova: Florence Nightingale Fellow (equivalent to US Assistant Professor) at the University of Oxford. Desi completed her PhD in Machine learning in 2024, with a focus on Bayesian experimental design and amortized inference. Prior to her graduate studies, she worked as a quantitative researcher at Goldman Sachs, where she developed latent factor models to systematically analyze and predict asset performance—experience that aligns well with this project’s goals.

What are the most likely causes and outcomes if this project fails?

There is currently a lot of interest in LLM evals for the purposes of safety. Bayesian inference is potentially extremely useful in this setting, as it combines flexible modelling with principled uncertainty estimation. As such, we don't really envisage the project failing, but there being a spectrum of potential impact from "useful in certain settings, but not mainstream", to

How much money have you raised in the last 12 months, and from where?

None

Laurence Aitchison

4 months ago

Thanks Neel! In response to your comments:

A method is only useful if people actually use it. Agreed. The nice thing about this approach is that there's a bunch of different applications, and we're pretty sure at least one will get traction. These applications are:

Uncertainty estimation for LLM evals.
Identifying and understanding LLM capabilities.
Forecasting capabilities.
Active learning (finding a smaller set of benchmarks that capture a lot of information about capabilities).
Finding signals of contamination / sandbagging.

Getting data is expensive. That's part of the reason we're asking for money for compute. But lots of people run extensive LLM benchmarking and we're trying hard to leverage all that work. At the moment, we're working with the Hugging Face Benchmarking Team, who have very extensive benchmarking results.

List of latent factors. We don't start by hand-labelling the capabilities. We're going to infer capabilities using e.g. a sparse prior. Then we post-hoc interpret the resulting inferred capabilities. The resulting workflow very much resembles that for VAEs.

donated $18,500

Neel Nanda

4 months ago

I'm making this grant largely on the recommendation of Marius (as he's much more involved in the evals field than me), thus only giving the minimum funding of $18.5K, but I'd be happy to see this fully funded!

The overall goal and plan here checks out to me and seem important: understanding how capable AI systems are seems very important, especially dangerous capabilities, both for forecasting risks and for coordinating agreements (like RSP-style conditional pauses). Current statistical practices seem pretty sloppy, and I can buy that there's a lot of low hanging fruit to improve it. I'm not familiar enough with the statistical theory here to say if the proposed method is the right call, but it seems fairly reasonable to me, and I'd be excited to see it explored properly!

The main concerns I see:

A method is only useful if people actually use it - highly technical or complex methods seem less likely to get adoption. Though eval methods can be used by people outside AGI labs (eg Apollo), so there are far more chances for someone to try it. I expect the main things to help here are compelling evidence that it's useful, along with good explainers and software
Getting data is expensive - I'm not sure how data efficient this method is compared to other things, but both designing questions and running an AI agent on them can be highly costly, and data efficiency seems super important
To work, if I understand correctly, there must be a list of latent factors to study, and each question in the benchmark must be labelled according to which it uses? Existing benchmarks tend not to look like this, and so this will need to be addressed. My guess is that enumerating the relevant latent factors wouldn't be that hard, at least for a given dataset, and that another LLM could do a good job of labelling given access to the question and worked solution? But I'm not confident, and this is another point of failure

I have no conflicts of interest here

🥑

Apollo Research

4 months ago

In my evals field-building efforts, I recently asked people who want to build something in evals to fill out a simple form (https://www.lesswrong.com/posts/uoAAGzn4hPAHLeG2Y/which-evals-resources-would-be-good)

This project was, in my opinion, the second best (after James project: https://manifund.org/projects/more-detailed-cyber-kill-chain-for-ai-control?tab=comments). My thinking is roughly:
1. The project itself sounds very reasonable to me. I find it very obvious that modeling capabilities with latent variable models is something clearly worth trying. I think that, in the best case, their efforts will be as impactful as observational scaling laws (https://arxiv.org/abs/2405.10938). In the worst case, the method is too hard to get working in practice or doesn't provide any clear benefit over other simple analysis tools. In expectation, I think this research will not be as impactful as observational scaling laws (but it has the chance to, which is already a high bar) but will be useful to a decent number of evaluators like Apollo, METR, AISIs, etc. Furthermore, I think that the project will be conceptually valuable because it partially forces you to think about which components this Bayesian model should have and partially because I expect the findings to be insightful. Concretely, I expect the analysis of the latent variables to be quite interesting.
2. I don't know Laurence or Desi personally but their work so far seems reasonable at a first glance and they are clearly more experienced researchers than the average MATS scholar.
3. The project salaries are already covered. This grant is merely for compute. Thus, it seems more impactful on the margins. I expect that a lot of the most interesting results come from runs with lots of compute, so this grant might unlock "the most interesting stuff".
4. Finally, I think science of evals is perfectly suited for academics. It often doesn't require access to the biggest models; many of the classic academic research skills directly transfer, and the results are useful for the entire field. Thus, I have a general intuition that we should try harder to fund academics to do more work on the science of evals and I'm surprised that this isn't happening more yet.