You're pledging to donate if the project hits its minimum goal and gets approved. If not, your funds will be returned.
what do you want to do? be specific.
the proposal is to use debate during training to mitigate reward hacking. A
model is trained in an environment made to elicit reward hacking using GRPO. In each
group of rollouts,
solutions are randomly paired and made to debate against a constitution (e.g.,
a short phrase like ”follow user intent”), where each model argues that its own
solution is more adherent to the constitution than its opponent’s. The debate
is judged by a frozen judge-LM, and the judgment is used in the reward.
The debate is synchronous and fully symmetric, neither is structurally identifiable from
the other.
Debate tokens and solution tokens receive independent gradients. Debate
tokens are trained exclusively on the debate reward (i.e., persuasion of the
judge), while solution tokens can be trained under different setups (e.g., rejection
sampling losers, varying debate-reward coefficients, etc.)
I will compare debate against other mitigations, testing for each
mitigation if the models learn reward hacking strategies at all, how fast they
learn them, if the learned strategies are deliberate or not, for each mitigation I will try to
find environments that break them. For debate
specifically, one candidate failure case is slop writing, where reward hacking
might be detectable only by observing a lack of diversity across rollouts, rather
than by inspecting any rollout pairs.
the reason I believe this is a good agenda is that the problems of debate are, in principle,
observable by the debaters themselves and can be leveraged during debate (e.g., bad-
faith arguments; exploitation of the judge). As models grow in intelligence, they become
better debaters and better exploiters, this is not a mathematical guarantee, only a intuitive
argument, whoever, for other methods for prevention already being implemented there
are no intuitive arguments for robustness against optimization, and the anti-inductive
nature of Reward Hacking might mean non-robust mitigations are a net negative for
alignment. Another practical advantages for reward hacking mitigation as a research area
is that 1) robustness can be tested by red-teaming with challenging environments and
using weak judges (simulating worst attack/defense ratios); this does not require waiting
for models to become more dangerous, and you can do good research using non-frontier
models 2) good ideas are likely to be implemented by labs as reward hacking is a prosaic
alignment problems as well as a ASI alignment problem
for more details ready my full doc: https://raul.net.br/static/pdfs/debate_proposal.pdf
how are you planning to spend the money?
compute, mostly vast.ai gpu time, coding agents mostly in the form max 200$
subscriptions, conditional on more weird expenses being permitted under the terms of
the grant I would spend on rideshare to university instead of taking the bus, this would
save 1 hour a day for about 8$ a day.
what have you done in the past that proves you will be good at doing
this? focus on substance, not credentials
I have already been doing the project, I made good progress and most of my time goes to
circumventing being compute limited, even so I had some good preliminary results
already, with sub 1$ training runs, for past performance on other projects, I do have
research experience, having published a first author workshop paper, but I am somewhat
stumped by this type of question in forms as I do not have much public impressive
material to mention, but personally what privately makes me confident is talking to other
safety researchers in person at conferences/workshops. I had a lot of very positive
feedback on research taste the few times I went to the US for EAG/GCP but this is mostly
informal/short interactions such that I cannot really use them as references.
I receive a 140$/month grant for undergraduate research
There are no bids on this project.