Debate training on LLMs as a reward-hacking mitigation

what do you want to do? be specific.

the proposal is to use debate during training to mitigate reward hacking. A

model is trained in an environment made to elicit reward hacking using GRPO. In each

group of rollouts,

solutions are randomly paired and made to debate against a constitution (e.g.,

a short phrase like ”follow user intent”), where each model argues that its own

solution is more adherent to the constitution than its opponent’s. The debate

is judged by a frozen judge-LM, and the judgment is used in the reward.

The debate is synchronous and fully symmetric, neither is structurally identifiable from

the other.

Debate tokens and solution tokens receive independent gradients. Debate

tokens are trained exclusively on the debate reward (i.e., persuasion of the

judge), while solution tokens can be trained under different setups (e.g., rejection

sampling losers, varying debate-reward coefficients, etc.)

I will compare debate against other mitigations, testing for each

mitigation if the models learn reward hacking strategies at all, how fast they

learn them, if the learned strategies are deliberate or not, for each mitigation I will try to

find environments that break them. For debate

specifically, one candidate failure case is slop writing, where reward hacking

might be detectable only by observing a lack of diversity across rollouts, rather

than by inspecting any rollout pairs.

the reason I believe this is a good agenda is that the problems of debate are, in principle,

observable by the debaters themselves and can be leveraged during debate (e.g., bad-

faith arguments; exploitation of the judge). As models grow in intelligence, they become

better debaters and better exploiters, this is not a mathematical guarantee, only a intuitive

argument, whoever, for other methods for prevention already being implemented there

are no intuitive arguments for robustness against optimization, and the anti-inductive

nature of Reward Hacking might mean non-robust mitigations are a net negative for

alignment. Another practical advantages for reward hacking mitigation as a research area

is that 1) robustness can be tested by red-teaming with challenging environments and

using weak judges (simulating worst attack/defense ratios); this does not require waiting

for models to become more dangerous, and you can do good research using non-frontier

models 2) good ideas are likely to be implemented by labs as reward hacking is a prosaic

alignment problems as well as a ASI alignment problem

for more details ready my full doc: https://raul.net.br/static/pdfs/debate_proposal.pdf

how are you planning to spend the money?

compute, mostly vast.ai gpu time, coding agents mostly in the form max 200$

subscriptions, conditional on more weird expenses being permitted under the terms of

the grant I would spend on rideshare to university instead of taking the bus, this would

save 1 hour a day for about 8$ a day.

what have you done in the past that proves you will be good at doing

this? focus on substance, not credentials

I have already been doing the project, I made good progress and most of my time goes to

circumventing being compute limited, even so I had some good preliminary results

already, with sub 1$ training runs, for past performance on other projects, I do have

research experience, having published a first author workshop paper, but I am somewhat

stumped by this type of question in forms as I do not have much public impressive

material to mention, but personally what privately makes me confident is talking to other

safety researchers in person at conferences/workshops. I had a lot of very positive

feedback on research taste the few times I went to the US for EAG/GCP but this is mostly

informal/short interactions such that I cannot really use them as references.

I receive a 140$/month grant for undergraduate research

Debate training on LLMs as a reward-hacking mitigation

Offer to donate