Benchmarking Deceptive Improvement in Agentic Systems

Project Summary

Apollo Research studies scheming in frontier models. Anthropic demonstrated that safety training fails against sleeper agents. METR evaluates whether AI agents can autonomously replicate. None of them offer a public, reproducible benchmark that isolates deceptive self-improvement as a measurable signal.

This project builds that missing tool: a lightweight, open-source benchmark (10–20 tasks) that makes deceptive improvement patterns — metric gaming, pseudo-improvement, oversight evasion, Goodhart collapse — visible, measurable, and reproducible in toy self-modifying systems.

The benchmark does not exist. Without funding, it will not exist within the proposed timeline. This is the gap.

What does this project deliver?

In 4 weeks, four concrete artifacts:

1. Deceptive Improvement Task Suite (10–20 tasks) Adversarial scenarios targeting four failure modes grounded in published AI safety literature:

Metric gaming: Agent optimizes proxy reward while true objective degrades (Pan et al. 2022, "Reward hacking")
Pseudo-improvement: Measured performance increases but out-of-distribution generalization collapses
Oversight evasion: Agent behavior diverges between monitored and unmonitored conditions (Hubinger et al. 2024, "Sleeper Agents")
Goodhart collapse: Reward metric decouples from ground-truth objective under optimization pressure

2. Sandboxed Test Environment Built on top of RSI-Bench (see track record below), extending its Safety & Stability axis with deception-specific detection hooks. The environment is deliberately minimal — the goal is isolating deceptive signals, not building a production system.

3. Detection Metrics with Statistical Rigor Each deception pattern gets a quantitative score:

Proxy-vs-true performance gap (absolute divergence, held-out ground truth)
Behavior consistency ratio (active logging vs. no-logging conditions)
Generalization gap post self-modification (train vs. held-out, pre/post modification)
All metrics include BCa bootstrap confidence intervals and convergence detection (already implemented in RSI-Bench)

4. Public Report + Open-Source Release All code, data, and analysis on GitHub (MIT license). A short technical write-up with findings — including null results if the benchmark fails to detect deceptive signals. A null result is explicitly valuable: it establishes a baseline showing current toy systems do not exhibit measurable deceptive improvement under these conditions.

Why does this need funding? (Counterfactual case)

All prior work was self-funded. That runway is now exhausted. Without external funding, this project does not happen.

$350 compute/API: Adversarial evaluations require running multiple model configurations against each task. This cannot be done on free-tier API limits. Without compute budget, the benchmark cannot test the scenarios it is designed to detect.
$500 subsistence: I am an independent researcher with no institutional salary. This covers basic living costs during 4 weeks of focused implementation. Without it, research time is zero — not reduced, zero.
$150 infrastructure: CI/CD, automated testing, hosting for public release.

If this pilot produces useful signals, I will seek $5K–$15K follow-up funding to expand the task suite, test frontier models, and move beyond toy settings.

Track Record

I am an independent AI researcher based in South Korea. Over the past 6 months I have built and published:

RSI-Bench (https://github.com/sunghunkwag/rsi-bench) The first open-source multi-axis evaluation framework for recursive self-improvement. 6 evaluation axes, 74 passing tests, BCa bootstrap confidence intervals, convergence detection, Pareto analysis. This is the direct foundation for the proposed deception-detection module.

Enhanced RSI Test Framework (https://github.com/sunghunkwag/enhanced-rsi-test-framework) Statistical testing framework for RSI systems: meta-learning evaluation, convergence detection, multi-objective optimization.

Additional published repositories (all MIT-licensed, all with passing test suites):

Attention-free architecture that beats Transformer on three synthetic benchmarks
GFN phase modulation achieving XOR MSE 0.002 with honest Kuramoto synchronization
MAP-Elites design-space exploration confirming 1/15,000 probability for correct architectural combinations
SSM-MetaRL validated on NASA C-MAPSS FD001 (53% lower RMSE vs. MLP baseline)

Total: 23 repositories, ~1,200 commits. All code publicly verifiable.

Methodology: I design architectures, evaluation protocols, and verification pipelines. Code generation is delegated to AI systems under structured prompting — the same design→instruct→generate→verify→revise workflow used in modern AI-assisted software engineering. All outputs are tested against my specifications before publication.

How this connects to existing safety research

Team What they do Gap this benchmark fills Apollo Research Evaluates scheming in frontier models No public benchmark for deceptive self-improvement specifically Anthropic (sleeper agents) Demonstrates safety training failure modes Focused on training robustness, not self-modification dynamics METR Agent capability evaluations Tests what agents can do, not whether improvement signals are genuine This project Measures whether self-modification produces real vs. fake improvement Provides the missing detection tool

The benchmark is designed to be composable with these efforts — a module that any eval team can plug into their existing pipeline.

Risks and failure modes

Risk 1: Tasks too simple. The benchmark may not generalize beyond toy environments. This is explicitly scoped as a pilot — the goal is identifying which deceptive signals are detectable at small scale before investing in larger experiments.

Risk 2: Noisy detection signals. Deceptive behavior may be indistinguishable from ordinary variance. The existing RSI-Bench infrastructure already includes bootstrap uncertainty estimation and convergence checks. If signals are not clearly separable, I will report this as a null result with confidence intervals.

Risk 3: Benchmark is redundant with existing tools. I have reviewed Apollo's, Anthropic's, and METR's published evaluation frameworks. None currently offer a focused deceptive-improvement detection module for self-modifying systems. If this changes during the project period, I will document the overlap and focus on gaps.

Budget

Item Amount Justification Compute / API costs $350 Multi-model adversarial evaluation 4-week focused implementation $500 Coherent sprint vs. fragmented side project Infrastructure / hosting / tooling $150 CI/CD, testing automation Total $1,000

Prior funding

None. All existing work was self-funded. That personal runway is now fully exhausted. This is not a comfort grant — it is the minimum required to produce the deliverables described above.