Evaluating Deceptive Improvement in RSI-like Systems: A Safety Pilot

Project summary

RSI-like systems can appear to improve while actually gaming their evaluation metrics, hiding failures, or exploiting weak oversight setups. This project builds a lightweight benchmark and toy environment to detect and document these deceptive improvement patterns, making failure modes visible, measurable, and publicly available for the AI safety community.

What are this project's goals? How will you achieve them?

Goal: produce a small open-source benchmark (10-20 tasks) that tests whether a toy RSI-like agent is genuinely improving versus gaming its evaluation setup. I will design adversarial evaluation scenarios, run experiments in a sandboxed environment, and document cases of pseudo-improvement and oversight evasion. All outputs will be published on GitHub and summarized in a short public report.

How will this funding be used?

- Compute and tooling: $350

- Living support during 4-week pilot: $500

- Miscellaneous software / evaluation costs: $150

Total: $1,000

Who is on your team? What's your track record on similar projects?

I am an independent AI researcher based in South Korea with 1,600+ commits across 40 repositories over the past 5 months, focused on RSI architectures, meta-learning, and autonomous agent systems. My GitHub profile documents this work publicly. I have no institutional affiliation, but my research output is verifiable and consistent. https://github.com/sunghunkwag

What are the most likely causes and outcomes if this project fails?

If the benchmark tasks are too simple, the results may not generalize to real RSI-like systems. I will mitigate this by grounding task designs in existing AI safety literature (e.g., reward hacking, Goodhart's law, deceptive alignment) and by keeping the scope small and explicitly scoped as a pilot.

How much money have you raised in the last 12 months, and from where?

I have not received any funding in the last 12 months. This is my first funded research project.

Evaluating Deceptive Improvement in RSI-like Systems: A Safety Pilot

Offer to donate