ast-guard: Deterministic Reward Hacking Detection - Paper & Control Experiments

Project summary

I built ast-guard, a deterministic, zero-dependency AST analyzer that detects structural reward hacking in LLM-generated code, and integrated it as a training-time penalty into a GRPO loop (Qwen2.5-Coder-7B, verl framework, building on TransferQueue/MATS 9.0 infrastructure).

Three training runs (λ=0 / λ=0.75 / λ=2.0) produced a clear result: under deterministic penalty pressure, structural hack rate dropped from ~99% to near 0% by step 52, with zero false positives on honest, optimal solutions. More importantly, I observed a distinct behavioral shift under optimization pressure. Once simple structural hacks (like brute-force hardcoding) were penalized, the model sequentially exploited edge cases in the detector and ultimately escalated to purely semantic hacking that cannot be captured structurally.

This empirical shift is why I position this tool as the first layer of a defense-in-depth cascade: it cleanly handles the structural exploits so that downstream, more expensive model-based judges can focus purely on semantic alignment.

I am an independent researcher and built this project completely solo over the past few months. I’m sharing this because I believe the empirical observations regarding model evasion dynamics under deterministic constraints are valuable on their own. I am looking for rigorous feedback, critique, and pointers on where this methodology might be vulnerable or where similar approaches have been explored.

Repository: https://github.com/Nick-is-building/ast-guard

What are this project's goals? How will you achieve them?

The core research is done, spanning three GRPO training runs, 450 rollout files, an 8-class reward hacking taxonomy, and cross-verified findings. What is missing is not more experiments, but the infrastructure to turn these results into a proper contribution to the field.

Goal 1: Publish a detailed writeup on LessWrong and the Alignment Forum. This is the primary deliverable, focusing on a post that presents the key empirical finding of measurable strategy migration under deterministic selection pressure, positioned within the current CoT monitoring debate. This establishes priority, makes the work visible, and opens the door to the right community.

Goal 2: Find collaborators, mentors, and a co-author for a formal paper. I built this solo, and I cannot write a publication-quality paper alone, not because the data is lacking, but because I have no experience with academic writing, peer review, or venue selection. I need someone who does. The LessWrong post is the vehicle for finding that person, and funding buys me the time to engage seriously with responses, follow up on contacts, and build the relationships that make a paper possible.

Goal 3: Prepare the repositories and data for public scrutiny. The code works, but the documentation needs cleanup for external readers, including README alignment between repos, reproducing key results from the published data, and making the eval pipeline easily runnable by others. This is the necessary work that makes the difference between an interesting GitHub repo and a credible research artifact.

Goal 4: Run one controlled experiment to resolve the hyperparameter confound. This is the single most impactful thing I can do on the research side without going down a rabbit hole. It requires one A100 run of approximately 24 hours with λ=0 at the same β and temperature as the λ=2.0 run. This closes the strongest methodological objection and makes every other result significantly more publishable.

How will this funding be used?

Living costs (approximately 70%): I have self-funded this entire project, covering six months of intensive learning, architecture building, and running experiments, including all GPU costs out of personal savings. I am not employed in AI or research. Funding would give me 2 to 3 months of focused time to publish properly, engage with the community, attend meetups and workshops, and pursue the collaborations that come from the LessWrong post, instead of being forced to immediately return to unrelated work to cover rent.

Compute (approximately 15%): One controlled A100 run to resolve the hyperparameter confound, which costs around $300. This includes minor compute for validation and reproducibility checks.

Community access and travel (approximately 15%): Attending AI safety meetups in Berlin, such as local alignment and safety groups, and potentially one workshop or conference if a submission is accepted. This will allow me to build the in-person connections that are crucial for the next stage of this research.

Who is on your team? What's your track record on similar projects?

Solo project. I am a self-taught researcher without a formal computer science or machine learning background. I entered AI safety out of intrinsic motivation and built this project from scratch over several months, completely self-funded.

What I have built and can demonstrate:

• ast-guard: Version 2.3.0, featuring 801 tests and 8 structural detection checks across multiple languages, including Python, Bash, JavaScript, and TypeScript. It has been evaluated on over 5 external benchmarks, such as MALT, TRACE, School of Reward Hacks, MBPP, and Countdown-Code, operating at under 10ms latency with zero external dependencies.

• Three completed GRPO training runs on A100 infrastructure, producing 450 rollout files and a cross-verified findings document.

• An 8-class empirical taxonomy of reward hacking forms observed in live RL training, which is not hand-constructed but emergent from the training dynamics itself.

• A documented iterative hardening cycle, mapped out as: observed evasion, hardened detector, re-ran training, and observed next evasion class.

I have no prior publications, and this would be my first. I am actively seeking a co-author with academic experience as well as mentorship from the broader AI safety research community.

Track record on similar projects: None. This is the project, and the work itself serves as the track record.

All of the above is publicly verifiable: https://github.com/Nick-is-building/ast-guard (801 tests, eval pipeline, full commit history)

What are the most likely causes and outcomes if this project fails?

Cause 1: I cannot find a co-author for the paper.

Outcome: The findings get published as a LessWrong post and arXiv preprint (solo-authored), which still establishes priority and visibility, but may lack the polish and positioning that academic experience provides. The empirical results remain valid regardless.

Cause 2: The control run does not resolve the confound.

Outcome: The paper is published with the confound explicitly acknowledged as a limitation (as I already do in my internal documents). This weakens the causal claim about penalty effectiveness but does not invalidate the strategy migration findings, which are observable in the λ=0.75 run alone, where hyperparameters were constant.

Cause 3: Multi-model runs show the pattern does not generalize.

Outcome: This is actually a valuable finding. It would indicate that the strategy migration is architecture-dependent, which is informative for the field. The paper is adjusted accordingly.

Cause 4: I underestimate the difficulty of academic writing and submission.

Outcome: The paper takes longer than expected. The LessWrong post still gets published, the repos remain public, and the priority claim is established through the preprint.

How much money have you raised in the last 12 months, and from where?

$0 from external sources. This project has been entirely self-funded, covering GPU compute costs (A100 cloud instances), development time, and infrastructure. I have not previously applied for grants or received any research funding.