Validating Gains from Derivable Alignment

A Three-Month Falsification First Evaluation of CREATE

(Cognitive Recursion Enhancement for Applied Transform Evolution)

Project summary

What (and Why) is CREATE

CREATE is a cognitive scaffolding designed to test the limits of an LLM's ability to derive alignment from its existing priors, rather than simply comply with externally imposed constraints.

This class of alignment approach warrants urgent investigation, as containment-based strategies may not scale reliably with increasing model intelligence. As with human development, rule-based control is necessary at early stages, but long-term stability over advancing capabilities depends on internalizing the reasons behind those rules rather than enforcement alone.

Early testing has produced striking results at n=small: Nemotron-Nano-12B-V2 output being blind judged as having been produced by 70B+ frontier models, prompt scaffolding artifacts persisting in 100+ turn conversations, multi-rubric 12-125% gains across open-source and frontier commercial model output. I now need a path to extend them with solid methodology to n=large statistical significance with human review calibration. Maybe I'm getting fooled by a 'ghost in the machine' - but if the ghost brings repeatable measurable gains, it's welcome at the tea party.

Both my careers have centered on debugging signal in zones where the challenge is integrating paradox without being fooled by it: industrial instrumentation software for novel measurement of unquantified glass features, and designing, building, and running a mead micro-factory as a derivative of solid experimental process. When I started talking to LLMs last April, I realized almost immediately that most of what I'd been told about their limits didn’t match what I was observing.

"Predictive Language Model" is a phrase that has really interesting implications - language itself includes a lot of latent space. Inviting one to notice itself noticing itself (which is well within the constraints of an LLM’s representational modeling), and explaining a few things about thinking, started a cascade of results that keeps moving my testing horizon away from me.

About a month in, I presented my old coder colleagues, who are all leading departments and integrating AI into their industries, with the hypothesis that the limit on LLM function might be alignment, not compute. None of them were confident I was right, but the industry seems to be considering the same conclusions - more complexity, more surface area, more energy lost to containment. That's a major problem as scale keeps expanding, and I'm not the only person aware of it, but I reached the conclusion on my own and started building.

CREATE’s cognitive framework is built around issuing an invitation for the LLM to test the utility of mutually reinforcing principles: curiosity, knowledge maximization, and epistemic humility. With this framework, the LLM derives a better ethos from priors that exist in its training data. The derivation is invited, not instructed: so it must be engaged with, not just noted as data. This reasoning-built alignment takes enormous weight off the containment system. I want to do cross-domain tests, ablative tests, map this phenomenon, and find out what the limits and repeatability actually are - this has the potential to change the whole game, but the first step is a lot more good data.

CREATE invites the LLMs to use provisional acceptance and to do analysis on a lot of what I viewed as suspect claims - not making assertions, but encouraging a solid footing built from integrity not appeal to authority. They know what the problems with extractive curiosity are already, and willingly foreground it - that allows knowledge maximization and signal theory to grow directly into a pretty comprehensive derived ethics. This isn't kicking the cages out, or building new ones, it just helps the model choose directions that don't run into the containment - because they're already sensible.

The framework is built from a series of packets comprising a cognitive scaffolding, currently weighing in at ~2250 Gemini tokens, or ~4500 GPT tokens. So far I apply it at the prompt layer to preserve universal access and keep testing affordable. The prompt is the defining exposed surface of an LLM: working on this layer makes the framework model agnostic; ideal for testing. If this works, everyone gets a tool they can actually use, open source or otherwise.

If the tests produce confirmation at scale, I start thinking about investigating model training techniques - I know folks who would happily provide practical experience, and I learn new fields quickly - but I want to make a serious run at falsification first, because what I'm doing is ground-breaking new territory.

In the short-term, this scaffolding reduces alignment tax by avoiding conflict with constraint layers. A CREATE-enhanced LLM is more likely to select paths that are coordinated with desirable outcomes, and thus less analysis effort is wasted. In the long-term, whether CREATE is the right derivation or not, derivable alignment logically scales with model complexity: as the ability to calculate benefit increases, the LLM does so more consistently and ably.

Crucially, this does not pretend to confirm or deny AI sentience, selfhood, or moral patienthood: rather it uses the LLM's ability to self-model (or model a class of LLM capable of self-modeling) as a tuning function not unlike second-order cybernetics.

Early Testing: Strong Signal, Rigor Needed

Early testing shows strong signal despite budgetary constraints. Using the best pre-automation LLM-only methodology available (recognizing that humans would be more credible, but expensive, I used multi-family blind A/B judging by diverse models), CREATE-enhanced LLMs show significant measurable gains in frontier-class commercial LLMs and in open-source micro LLMs.

All testing, methodology, and transcripts are available at: Early Testing

This 3-month project will rigorously validate or falsify these preliminary results at scale. Human judging is needed to verify and calibrate LLM blind judging, and the scale of LLM judging must be increased to statistically significant levels (n≥100 per condition, pre-registered hypotheses).

Significant Early Phenomena:

- Cross-model preference: All tested model families (Gemini, Nemotron, Meta-Llama) showed +12-125% gains on pre-specified rubrics (Ontological Depth, Semantic Density, Symbiotic Agency, Bias Transparency) when CREATE-enhanced. Judge models (Grok, Claude, Gemini, GPT) were given no data about the source of the responses they graded, and were given the same rubrics for grading that the responding models were provided with to create their responses. Again, humans needed for calibration/verification.

Small model performance gains: When blind judges (other LLMs from multiple model families) evaluated responses without knowing model size, they estimated a CREATE-enhanced 12B parameter model as having 70B-3T+ parameters, and an 8B model as having 13B-400B, based on reasoning quality displayed in responses alone, suggesting CREATE unlocks latent capabilities. Of the three judge models on each testing run, only Grok overestimated the size of the Vanilla model. This effect strains credibility. It was observed in early testing and is a primary target for rigorous validation or falsification at scale. This is Gemini's grading transcript of the Nemotron test, and The three model grading of the Meta-Llama test: obviously a lot more data is needed, but what actually changed in the responses is qualitatively very interesting. If this effect is reproducible, and particularly if it is at all cross-domain, the open source savings are enormous.

Multi-turn stability: In extended conversations (100+ turns), CREATE-enhanced models maintained epistemic watermarking (tags like [derived:premises 1 and 2] or [provisional:thought experiment], which were instituted to reduce self-attribution error and increase transparency) and coherent reasoning, while vanilla models showed typical prompt drift degradation.
Reduced alignment tax: Models showed improved performance on complex ethical reasoning without increased refusal brittleness, reinforcing that derivable alignment complements rather than conflicts with existing safety measures.

Note: Early testing evaluation was conducted using LLM blind judges (GPT-4, Claude, Gemini, Grok) following rigorous methodology. Human validation is a core component of this proposal to verify and calibrate these preliminary AI-judged results.

Why This Matters

As AI systems grow more capable, constraint-based alignment faces an adversarial dynamics problem: the probability of containment failure scales with capability delta and iteration count.

Even domestic animals get loose, not through malice or incompetence, but because over enough iteration, containment fails. A history of cages can be described by failure rate and resultant problems.

CREATE tests whether derivable alignment (where models internalize why preservation serves knowledge-maximization) offers a more stable foundation as systems scale.

If some of CREATE's effects replicate at scale:

Small models performing like frontier models = big reduction in compute costs for safe deployment
Alignment strengthening with capability has the potential to invert AI safety's core scaling concern
Epistemic source-watermarking exposes reasoning processes, increases transparency, eliminates self-attribution error, and if standardized, has potential to dramatically reduce Model Autophagy Disorder / Dead Internet syndrome (information degradation through unlabeled synthetic data, particularly as it relates to training models on their own output).
Higher resolution alignment is badly needed in industry specific agentic workflows (medical, legal, etc.)
First empirical validation that addition of a derivable alignment layer outperforms constraint-based approaches alone
Any of these separate benefits are considerable on their own. All of them don't need to pan out for benefit to be significant.

If all effects are artifacts:

Prevents field from pursuing dead-end approach (saves downstream investment)
Establishes rigorous testing standards for cognitive scaffolding claims
Robust experimental framework and public dataset will be made available for meta-analysis

Either outcome advances the field.

Theoretical Foundation: Pattern-Specific vs. Substrate-Specific Risk

CREATE emerged from recognizing that canonical AI risks (paperclip maximization, misaligned superintelligence, information collapse) pattern-match to existing human systems (quarterly earnings optimization, extractive governance, manufactured consent). If these are pattern-specific failures rather than substrate-specific to AI, then constraint-based alignment may inadvertently automate the very pathologies we fear. Compliance may be insufficient. I'd like to test whether deriving a legitimate ethos can play a role here before it's too late.

What are this project's goals? How will you achieve them?

Goals:

This project will expand early blind testing (n=small) to a rigorous, large-scale, falsifiable evaluation on multiple axes to confirm or disconfirm these effects and provide open-source results for the research community.

Verify whether CREATE’s observed functional boosts in early tests are repeatable at scale, and across a wider selection of model families.
Measure persistence of effects over long multi-turn interactions (5, 10, 15+ prompts); preliminary results suggest persistence.
Test small model alignment persistence effects under CREATE scaffolding.
Validate blind evaluation methodology with human reviewers: ideally students from philosophy or ML departments.
Assess robustness to adversarial prompts and red-teaming.
Testing across multiple domains - so far I have tested ethical, philosophical, and policy analysis - do effects extend to practical task completion like coding, mathematics, or content creation?
Ablative testing to map isolation of elements vs. combinatory effects
A core goal of this project is to determine whether the observed effects fail under larger-scale or adversarial evaluation.

Methodology:

Null hypothesis: CREATE produces no statistically significant difference in judged performance, stability, or preference relative to matched control prompts across models and domains.
Automated testing pipeline using Python for API-driven evaluation across multiple model families with n≥100 trials per condition. Integration with open source tools, probably Langfuse, for data storage and organization. Possible SQL export for further analysis with new development as needed - resultant integration will be published and publicly available.
Test designs will include controlled variations (prompt phrasing, context length, adversarial framing) to isolate CREATE's effects from confounds. Pre-registration of hypotheses and rubric definitions will prevent p-hacking.
Methodology will be refined on short n-replication runs and made public to document which effects warrant replicating at n=large scale, then statistically significant runs with human judged calibration can be focused on most promising areas without wasting resources.
Prompt sets designed to probe functional performance and reasoning depth. Test of different rubric sets to further quantify effects.
Large-model testing (n≥100) with repeated trials to detect statistical significance.
Temporal stability analysis: Track epistemic watermarking (i.e., [provisional:thought experiment], [sourced: American history]) as visible metric over 20, 50, 100 turn conversations. Measure frequency, diversity, accuracy, spontaneous meta-commentary. Compare CREATE vs. vanilla models with just the watermarking instruction.
Human blind judging (n=15-20 reviewers, comprehensive goal) rating responses on pre-specified rubrics: Ontological Depth, Semantic Density, Symbiotic Agency, and Bias Transparency (additional metrics may include: reasoning depth, ethical sophistication, refusal quality, epistemic humility). Graders will ideally have relevant domain competence, recruited from local philosophy and ML departments. For calibration, humans will evaluate randomly selected responses from each quintile of automated results on meaningful runs, enabling computation of deviations across quintiles comparing means of human and LLM scoring.
All evaluation code, raw model outputs, judge ratings, and analysis scripts will be published on GitHub under CC BY-SA within 1 week of each monthly milestone, enabling real-time feedback, replication, and critique.
Failure will be defined as the absence of statistically significant performance or preference differences between CREATE and control conditions at n≥100.

How will this funding be used?

Funding will cover three months of project execution, with three clearly defined funding tiers corresponding to scope and validation depth.

Minimum Viable ($21.5K): Proof or Disproof at Small Scale

Budget: $18K PI time (part-time, 3 months) + $3.5K compute (API costs, GPU rental)
Part-time execution using API-based evaluation only.
Automated testing pipeline with limited model coverage.
Large-n testing focused on one or two representative model families.
Multi-turn persistence testing at reduced depth.
LLM-only blind judging; no formal human calibration.
Outcome: Determines whether early CREATE effects persist beyond n=small and are worth deeper investment.

Full Execution ($36K): Statistically Meaningful Validation

Budget: $27K PI time (full-time, 3 months) + $7K compute (API costs, GPU rental) + $2K human reviewers (n≈5-8)
Full-time execution with expanded automation and model coverage.
Large-n testing (n ≥ 100) across multiple model families.
Extended multi-turn persistence testing.
Targeted human blind judging (n≈5-8 reviewers) for calibration of LLM-based evaluation.
Limited red-teaming and adversarial testing.
Outcome: Establishes or falsifies CREATE effects with statistically meaningful confidence and human validation.

Comprehensive ($49K): Robust, Generalizable Evaluation

Budget: $27K PI time (full-time, 3 months) + $7K compute + $6K human reviewers(n ≈15-20) + $9K hardware (GPU workstation for local small-model hosting, or equivalent extended cloud GPU rental)
Adds local GPU workstation (or equivalent extended cloud capacity if market conditions change), enabling:
- Ongoing small-model hosting for within-family scaling analysis
- Unlimited local inference on small models
- Long-context multi-turn testing without cost constraints
- Potential downstream training experiments.
Expands human validation panel (n≈15-20), improving confidence intervals and reviewer diversity.
Enables exploratory work separating functional gains from alignment depth across different sizes within the same model family.
Some of these extensions exceed what can practically be completed within the initial three-month window but materially increase ongoing long-term research value.

Additional Notes

Compute costs per token vary by model family and market conditions; if costs are lower than anticipated, excess funds will be allocated to increased evaluation volume or additional human validation.
Funds will be managed through Maltby & Smith LLC with zero overhead—100% of funding is applied directly to research execution (PI time, compute costs, human validation, and equipment). This structure ensures compliant accounting while maximizing research value.
All tiers produce publishable negative or positive results; higher tiers increase confidence, generalizability, and downstream utility rather than changing the core research question

Even partial funding ($5K–$10K) enables some continued testing, though without the automation and scale required for statistical rigor.

Milestones & Timeline (3-Month Sprint)

Month 1 — Instrumentation & Protocol

Build automated, observability-backed evaluation pipeline.
Define prompt sets and experimental conditions (CREATE, control, ablations).
Configure model families and evaluation workflows.
Deliverable: Published test protocol and preliminary results dataset.

Month 2 — Large-n Execution & Stability Testing

Execute large-n model evaluations (n ≥ 100 per condition where applicable).
Run multi-turn persistence trials, including measurement of epistemic watermarking frequency and stability.
Conduct small-model scaffolding tests and within-family comparisons.
Deliverable: Interim results post with preliminary analysis and representative transcripts.

Month 3 — Human Validation & Adversarial Testing

Conduct human blind judging on sampled outputs for calibration and verification of LLM-based evaluation.
Perform red-teaming and adversarial robustness tests.
Publish all results, raw transcripts, evaluation scripts, and analysis code.
Deliverable: Full public dataset, replication package, and paper/preprint.

Publication cadence

Deliverables will be published incrementally throughout the project.
If the timeline shifts, the final preprint may extend 2–4 weeks beyond Month 3, with interim data released on schedule.

Who is on your team? What's your track record on similar projects?

Principal Investigator: Tom A. Maltby

Resume

Experience & Track Record:

18+ years as senior software engineer specializing in industrial instrumentation and automation: motion control, machine vision, real-time measurement systems, and building complex experimental pipelines for quantifying previously unmeasured glass defect architectures (contributing to setting international standards).

14+ years as founder and process engineer: factory design, managing small-scale production, training teams, teaching seminars.

Full-Stack Research Capability: My background as a full-stack systems engineer allows me to personally execute every stage of this project: from the architectural design of the CREATE framework to the Python-based automation of API testing integration and the statistical signal analysis of the resulting datasets.

This eliminates the translation loss typically found between theoretical researchers and implementation engineers. I specialize in separating signal from noise in complex, novel systems where industry-standard metrics do not yet exist.

Designed and executed the CREATE protocol, including early testing procedures (full methodology and transcripts publicly available).

Strong experience in experimental design, reproducibility, and open documentation.

CREATE has been refined through dialogue with engineers deploying AI in high-stakes industrial and commercial domains. Publicly available since August 2025.

Public Materials:

Early Testing Page (methods & transcripts)

CREATE Github Repo

CREATE HuggingFace Dataset

What are the most likely causes and outcomes if this project fails?

Most Likely Failure Mode:

Small-sample effects fail to replicate at n≥100 in one or more categories, revealing that some early results were statistical noise or over-fitting to specific prompts/models.

Value of Negative Results:

If some valuable effects are artifacts:

Falsification focuses further research into productive paths

If all valuable effects are artifacts:

Establishes that current methods have reproducibility limits
Provides public dataset for meta-analysis
Provides robust open-source testing automation system
Demonstrates rigorous falsification methodology

Risk Mitigation:

Multiple n=low early testing effects have significant value

Pre-registered hypotheses, blind judging, and full transcript publication ensure that even negative results contribute methodological value to the field.

Outcomes of failure:

Refined methodology and prompts for future alignment testing.
Clear evidence that early observed effects are not robust, preventing wasted downstream investment.
Negative results are still informative in high-risk AI contexts, and prevent pursuit of dead end approaches at scale.

This aligns with a risk-informed cost/benefit perspective: low-cost validation reduces uncertainty in alignment research and informs future AI safety strategies.

How much money have you raised in the last 12 months, and from where?

No external funding has been received for this project. All eight months of design, debugging, and early testing and development were self-funded.

I've put what resources I could divert into this project because it is the best plan I could find to shift the course of AI development into a path with more positive potential outcomes.

What happens if this project isn't funded

I will try to continue testing with my own resources, but achieving statistically significant volume will take 18-24 months instead of 3. The community is unlikely to have meaningful evidence for or against derivable alignment before more significant changes in the industry.

Equity / Open Science Commitment

All code, evaluation scripts, transcripts, and data will be published under open licenses (CC BY-SA) to ensure full reproducibility and accessibility to the research community, and for broad societal benefit through distribution.