Diamond Engine — Runtime Attention Perturbation for Non-Deterministic Language Model Inference

Funding request submitted to Manifund Author: Furkan Elmas, 2026

Summary

Diamond Engine is a runtime attention-layer intervention for transformer language models. It does not modify model weights. Instead, it hooks the attention computation at inference time and applies a set of deterministic, input-driven transformations — a controlled perturbation of the attention matrix, gated by a set of interacting signals derived from the model's own internal state — before the result is fed back into the forward pass.

Through extensive hands-on testing across hundreds of generations, the author has directly observed the model producing sentence constructions, cross-domain conceptual pairings, and lexical choices it does not produce under standard decoding — output that is original by direct inspection. This proposal asks for funding to scale that testing, harden the implementation, and extend it across model architectures.

What the engine actually does

Diamond Engine replaces the standard scaled dot-product attention function in a Hugging Face transformers model (currently running on Qwen 2.5 14B Instruct, quantized, on a single consumer GPU) with a custom attention function registered through ALL_ATTENTION_FUNCTIONS. Nothing about the base model is retrained or fine-tuned; the intervention is entirely at the level of the forward pass.

1. The core perturbation ("w = w")

At each layer, before the softmax normalization, the raw attention score matrix w is combined with a reversed version of itself along the key axis:

w_rev = flip(w, along key axis)
w'    = (w + α · w_rev) / final_scale

α is computed per layer from a bell-shaped weighting over layer depth, further modulated by a global scalar (alpha_base) that is adjusted over the course of a conversation by a feedback controller watching output health. final_scale is the output of the composite gating system described below. Each token's attention distribution is systematically biased toward also attending to the mirror-image position in its own context window — a structural, non-random intervention distinct from anything in a standard attention implementation, applied at every generation step.

2. Gating: when and how much to perturb

Perturbation is not applied uniformly. A per-layer, per-token decision (should_inject) determines whether the perturbation fires, based on: a layer-position probability curve (heavier in middle layers), local attention head divergence, a rolling entropy/oscillation signal from recent generation steps, and a generation-position-dependent random draw. When injection fires, its strength is set by a composite scalar (w_scale) built from a normalized nonlinear "master equation" comparing aggregate attention-entropy and attention-magnitude signals, a reactive correction term, a layer-depth damping term, a harmonic term, and a "self-return" term measuring the asymmetry of a rotating pair of weighting factors over the course of generation. This is a hand-engineered control system with multiple interacting feedback loops, refined through direct observation of its output across many sessions.

3. Two-pass generation ("Id" / "Ego")

Each user turn is generated twice:

Pass 1 ("Id"): generated with the perturbation system fully active, deliberately permitted to range into unconventional territory.
Pass 2 ("Ego"): generated with the perturbation system disabled, conditioned on the first pass and conversation history, resolving it into a coherent response. An overlap check determines whether this rewrite step triggers.

The result, observed directly and repeatedly by the author across extended use, is a model that reliably produces original phrasing and cross-domain conceptual connections while remaining fluent and grammatically sound.

Development history

The current implementation is the result of multiple rounds of iteration and debugging: earlier versions had a gating signal that saturated to a near-constant value regardless of input, a phase-space curvature metric that was mathematically guaranteed to read zero after a few dozen tokens of context, and an attention perturbation that only fired during prompt encoding rather than during generation. All of these were identified through direct inspection of the system's internal telemetry and corrected. The system now produces output that is measurably and observably different, session over session, from the same model's baseline behavior.

Milestones

1. Scale up testing. Move from single-session manual testing to structured logging across a much larger volume of generations, to build a public record of the range of outputs the engine produces.

2. Cross-architecture port. Port the attention hook to at least one additional architecture family (Llama 3 or Mistral) to establish the mechanism beyond Qwen.

3. Ablation of the gating system. Determine the individual contribution of each sub-term (master equation, reactive term, harmonic term, self-return term) to the observed effect.

4. Write-up. A public technical report documenting the mechanism, the development history including bugs found and fixed, and the range of output observed.

Academic & Open-Source Validation

The foundational data architecture, tracking frameworks, and empirical telemetry logs supporting this research have been formally archived and indexed for scientific review:

Repository: github.com/capterr/diamond-llm
Academic DOI: 10.5281/zenodo.20368487

Funding Request & Budget Breakdown (Target: $50,000)

To scale the Diamond Engine from an empirical single-GPU prototype into a production-ready, model-agnostic inference framework over a 6-month timeline, I am requesting a total grant of $50,000.

Instead of burning millions on training static weights from scratch, this budget will be hyper-efficiently allocated toward rigorous large-scale compute telemetry, mathematical optimization, and comprehensive testing across state-of-the-art open architectures (up to 70B+ parameters).

Dedicated Enterprise Compute Infrastructure ($20,000): Funding dedicated, multi-node cloud compute clusters (A100/H100 instances via decentralized or on-demand providers) for continuous batch-generation loops. This will allow us to stress-test the w_rev matrix warping dynamic on unquantized Llama 3 (70B/405B) and Mistral Large models under high-concurrency settings, compiling hundreds of thousands of token-trace generation logs for public analysis.
Full-Time R&D and Architectural Engineering ($22,000): Sustaining 6 months of dedicated, full-time research and development. This includes mathematically decoupling the dynamic gating logic from local tokenizers, writing highly optimized custom CUDA kernels to minimize inference-time latency overhead introduced by the ALL_ATTENTION_FUNCTIONS hook, and running exhaustive ablation matrices on every single sub-term (harmonic, reactive, and self-return signals).
Red-Teaming, Evaluation Frameworks & Public Report ($8,000): Developing automated benchmark pipelines to formally measure the cognitive divergence and semantic drift of the generated outputs against standard decoding algorithms. This track will culminate in a comprehensive, peer-reviewed style technical whitepaper documenting the engineering history, fixed architectural edge cases, and verifiable results of non-deterministic cognitive synthesis.

Diamond Engine: Contextual Attention Perturbation for Intentional Hallucination

Offer to donate