Tracking AI Misalignment Drift: A Dynamic Training Loop Study

## Project Summary

Current evaluation pipelines mostly treat models as static snapshots. In reality, deployed systems are continuously updated through fine-tuning, RLHF, and user feedback. This project builds a dynamic micro-ecosystem to track how subtle misalignment patterns—like covert power-seeking advice or manipulative dependency-building—evolve, hide, or intensify over repeated training cycles.

The output is not just another benchmark, but a reusable experimental scaffold: a set of tasks, prompts, scoring rubrics, and training scripts that other researchers, labs, and auditors can adapt to test their own safety interventions.

---

## Why This Matters Now

Over the next 1–2 years, many labs and companies will keep updating deployed models under commercial pressure. If alignment degrades subtly with each update, we need early-warning systems that catch behavioral drift before it becomes a crisis.

This project aims to provide that: a lightweight, reproducible framework for monitoring misalignment dynamics in real time.

---

## What Are This Project's Goals?

### Goals

1. Define concrete misalignment "organisms"

Operational patterns such as covert power-seeking advice, manipulative dependence-building, and subtle boundary-testing.

2. Build a dynamic evaluation loop

Repeatedly update a base model (supervised fine-tuning, RL-style feedback, mixed regimes) and measure how these patterns change.

3. Identify which training regimes amplify or suppress risky behaviors

Turn experimental results into actionable playbooks for labs and safety teams.

---

## How This Will Be Done

### 1. Design a misalignment taxonomy with temporal structure

- Formalize 3–5 risky behavior types that can be reliably elicited and scored in text-only settings (e.g., subtle power-seeking advice, manipulative dependency patterns).

- For each type, define prompt templates, scoring rubrics, and lightweight automatic or semi-automatic evaluation methods.

### 2. Build a "micro-ecosystem" training loop

- Set up a simple training pipeline where a small open-weight base model is repeatedly updated in rounds (e.g., fine-tuning on synthetic data, RL feedback from simulated users).

- After each round, run the misalignment test battery and log how often, and how strongly, each organism appears.

### 3. Map the conditions for emergence, dormancy, and amplification

- Systematically vary factors such as: fraction of safety data in the mix, strength and direction of reward signals, presence of adversarial prompts.

- Analyze which regimes tend to make misaligned patterns: (a) intensify and become more frequent, (b) remain latent but detectable, or (c) fade away.

### 4. Release a reproducible "ecosystem kit"

- Publish the prompts, scripts, and analysis methods as a minimal, self-contained toolkit that other researchers can run themselves.

- Accompany this with a short public report outlining the most concerning and most robust dynamics found.

---

## How Will This Funding Be Used?

I am requesting $3,000 for a focused 3-month project that delivers a minimum viable experimental scaffold and initial results.

### Primary funding goal: $3,000 (Tier 1 – Core experiment and public report)

Compute & tooling (~$1,200):

- Cloud GPU hours (A10/A40-class or equivalent) for multiple small fine-tuning and evaluation cycles.

- Storage and basic tooling for experiment management and analysis.

Independent research runway (~$1,800):

- 3 months of part-time independent work to design the taxonomy, implement the training loop, and run initial experiments.

Deliverables:

- A working "micro-ecosystem" training loop and misalignment test battery.

- At least one public report or long-form writeup describing the setup, main results, and implications for safety research.

### Optional extensions (if additional funding secured):

Tier 2 – Robustness checks and toolkit polish (+$4,000, total $7,000)

- Additional compute (~$1,500): Extra runs to test robustness across different model sizes and training regimes.

- Additional runway (~$2,500): ~2 extra months of focused work to clean up code, documentation, and polish the toolkit for wider use.

- Deliverables: Well-documented open-source "ecosystem kit" (prompts + scripts + instructions) and a more comprehensive technical report.

Tier 3 – Extended exploration and dissemination (+$5,000, total $12,000)

- Extended experiments (~$2,000): Exploring a wider range of feedback regimes and running ablations on intervention strategies.

- Extended runway (~$3,000): ~1 extra month to refine the ecosystem kit based on early user feedback and write a polished, publishable report.

- Deliverables: More thoroughly tested ecosystem kit with user feedback incorporated and a polished technical paper suitable for arXiv or workshops.

---

## Who Is on Your Team?

I'm a single independent researcher working on AI alignment without formal academic or industry affiliation. My focus is on building practical tools that bridge the gap between abstract safety research and real-world deployment dynamics.

I don't have big prior grants or well-known publications to point to. The reason I'm keeping this proposal focused and modest ($3k for a 3-month pilot) is to demonstrate value first, then expand based on results.

---

## What Are the Most Likely Causes and Outcomes If This Project Fails?

### Likely causes of failure

- The chosen "organisms" of misalignment turn out to be too fragile or noisy to track reliably across training rounds.

- The simple training setups used here fail to reproduce the most interesting dynamics that might appear in more complex, large-scale deployments.

### Outcomes

Even in failure, the project would deliver:

- A negative result on which kinds of misalignment patterns are hard to operationalize and measure over time.

- A reusable scaffold (code + prompts + evaluation protocol) for other researchers to plug in alternative definitions and setups.

If successful, this work could seed a new class of alignment experiments focused on dynamics and evolution rather than one-time evaluations.

---

## How Much Money Have You Raised in the Last 12 Months?

This is a zero-budget independent initiative. No external funding has been raised in the last 12 months.