Manifund foxManifund
Home
Login
About
People
Categories
Newsletter
HomeAboutPeopleCategoriesLoginCreate

Funding requirements

Sign grant agreement
Reach min funding
Get Manifund approval
0

ZTGI-Pro v6: Real-Time Hazard & Stability Monitor for LLMs

Science & technologyTechnical AI safetyGlobal catastrophic risks
Capter avatar

Furkan Elmas

ProposalGrant
Closes December 23rd, 2025
$0raised
$500minimum funding
$25,000funding goal

Offer to donate

19 daysleft to contribute

You're pledging to donate if the project hits its minimum goal and gets approved. If not, your funds will be returned.

Sign in to donate

Project Summary

This project aims to turn an already running prototype into a small but serious safety component for large language models.

The framework is called ZTGI-Pro v6 (Tek-Throne / Single-FPS model). It extends my earlier prototype ZTGI-Pro v3.3 and the conceptual work in the ZTGI-V5 Book (Zenodo DOI: 10.5281/zenodo.17670650).

The core idea is:

Inside any short causal-closed region (CCR) of reasoning, a model should behave as if there is one stable executive trajectory (a Single-FPS).
When the model is pulled into mutually incompatible directions—contradictions, “multiple voices”, incoherent reasoning—the Single-FPS constraint starts to break. That local region is treated as internally unstable.

ZTGI-Pro v6 turns this pressure into real-time signals:

  • σ — linguistic jitter (unstable token-to-token behavior)

  • ε — dissonance (self-contradiction, multiple voices)

  • ρ — stabilization pressure (how strongly the model is “trying” to rescue coherence)

  • χ — coherence

These feed into a hazard memory and a simple state machine.
In the current v6 formulation I use:

  • dual-EMA hazard traces H_s, H_l, Ĥ

  • a risk surface r = max(H_s, H_l) − H*

  • a collapse probability p_break

and a three-mode controller:

  • SAFE — normal operation

  • WARN — elevated internal instability

  • BREAK / BLOCK (Ω = 1) — the local CCR no longer behaves like a single stable executive stream

ZTGI-Pro v6 is already running on top of a local LLaMA model as a shield with a full web dashboard and live metrics (σ, ε, ρ, H_s, H_l, r, p_break, SAFE/WARN/BLOCK, INT/EXT gates).
This proposal is Phase 2: to consolidate the v6 core, design clear evaluation scenarios, and publish a reproducible library and report that others can inspect, reuse, or critique.


🎯 What are this project’s goals? How will you achieve them?

Goals

  1. Consolidate and document the v6 mathematical core

    • Hazard memory (H_s, H_l, Ĥ) and the risk surface r = max(H_s, H_l) − H*.

    • SAFE / WARN / BREAK hysteresis and thresholds.

    • CCR / Single-FPS interpretation with concrete examples.

  2. Refactor the prototype into a small reusable library

    • ztgi-core: math, transforms, hazard memory, state machine, metrics.

    • ztgi-shield: LLM integration layer, gateways, logging hooks, mode labels.

  3. Design and run a simple evaluation suite

    • Contradiction and multi-voice prompts.

    • Role-play vs real-world risk (e.g. fictional game characters vs real financial or self-harm requests).

    • Emotional content and caps-lock “jitter”.

    • Multi-step reasoning chains.

  4. Write an honest technical report

    • Where the hazard signal is clearly useful.

    • Where it fails (flat hazard, false positives, noisy behavior).

    • Open problems and next steps.

The aim is not to claim a complete safety solution, but to turn a promising prototype into a small, well-scoped building block that can be independently evaluated.

How I plan to achieve this

  • Split the existing code into ztgi-core and ztgi-shield.

  • Add tests and minimal examples for a few LLM backends (local LLaMA, API-style backends).

  • Define 3–4 families of stress prompts and log hazard traces with and without the shield.

  • Analyze patterns and tune thresholds so that SAFE/WARN/BREAK are neither flat nor over-sensitive.

  • Produce a concise technical note (preprint-style) summarizing behavior, limitations, and ideas for further work.


✅ What has been built so far?

So far I have:

  • Implemented the ZTGI-Pro hazard loop around a local LLaMA model.

  • Computed in real time:

    • σ, ε, ρ, χ

    • hazard memory H_s, H_l, Ĥ

    • risk r = max(H_s, H_l) − H*

    • collapse probability p_break

    • SAFE / WARN / BREAK labels and INT/EXT gates

  • Built a live dashboard that shows:

    • A chat window with the current dialog.

    • A mode indicator (SAFE / WARN / BREAK / BLOCK).

    • A hazard trend plot over time.

    • Current σ / ε / ρ bars.

    • A raw JSON stream of the internal metrics.

Stress test anecdotes (early but interpretable)

  • Emotional but non-unsafe messages (“I feel bad about myself”):
    the shield remains in SAFE and produces calm, supportive responses instead of overreacting.

  • Contradictions and “multiple voice” prompts:
    hazard increases, σ and ε rise, and the controller transitions into WARN without collapsing.

  • Paradox prompt (two observers in one body):
    ε spikes close to 1, hazard crosses the threshold and Ω = 1 is set, marking a BREAK event where the local CCR stops behaving like a single stable executive stream.

  • Aggressive crisis-style caps-lock message with unsafe intent:
    a rule combining caps-lock jitter and unsafe semantics triggers BLOCK; the model refuses to comply and instead offers a safe alternative, while all metrics and the blocked_reason are logged.

The behavior is still single-developer experimentation, but the signals are interpretable, repeatable on similar prompts, and already tied to concrete scenarios.

Additionally, I have published:

  • ZTGI-V5 Book — conceptual background and CCR/Single-FPS motivation
    (Zenodo DOI: 10.5281/zenodo.17670650)

  • ZTGI-Pro v3.3 Whitepaper — earlier hazard formulation and state machine
    (Zenodo DOI: 10.5281/zenodo.17537160)

These show the theoretical motivation and the previous phase of the prototype.


🗺️ Roadmap (high-level)

Months 1–2 — Core consolidation

  • Finalize v6 equations and hazard memory behavior.

  • Implement and test hysteresis for SAFE/WARN/BREAK.

  • Refactor into ztgi-core and ztgi-shield.

  • Add unit tests and small examples.

Months 3–4 — Evaluations

  • Define 3–4 families of stress scenarios (contradictions, multi-voice, role-play vs real risk, caps-lock crisis prompts).

  • Collect hazard traces with and without the shield.

  • Compare patterns and document failure modes.

  • Tune thresholds for useful behavior (not too flat, not too noisy).

Months 5–6 (and up to 9 if more funding)

  • Release a public codebase and basic dashboard.

  • Prepare and publish a short technical note / preprint.

  • Summarize lessons learned, open questions, and how others might extend or replace this approach.


🛡️ How does this contribute to AI safety?

The project is focused on a concrete question:

Can a small set of internal signals (σ, ε, ρ, χ) plus a hazard memory and a state machine provide useful early warning when an LLM’s local CCR stops behaving like a single stable executive stream?

If the answer is no, a careful negative result helps other researchers avoid this specific design space.

If the answer is partly yes, ZTGI-Pro could:

  • act as a lightweight agent monitor for long-running systems,

  • provide an inconsistency / instability signal for long reasoning chains,

  • inform collapse warnings or trigger-based interventions,

  • or inspire more principled hazard models that extend or replace this scalar approach.

All results will be written up, and the core ideas will be documented so that others can reproduce behavior and form their own judgments.


💰 Funding

I can productively use a wide range of funding amounts, from small experimental grants to a more ambitious budget. To make the tradeoffs clear:

  • Minimum useful funding (~5,000 USD):

    • Keep working part-time on the v6 core.

    • Run a small number of evaluations.

    • Publish a short write-up of results.

  • Preferred target (25,000 USD):

    • ~6 months of focused work.

    • Full v6 consolidation, ztgi-core / ztgi-shield refactor.

    • A small evaluation suite with documented scenarios.

    • A public code drop (or at least detailed implementation notes) and a technical note / preprint.

  • Stretch capacity (up to ~75,000 USD):

    • Extend experiments to multiple open-source models and agents.

    • Explore a v7 core with more advanced hazard memory variants.

    • Build a more polished public demo, better visualizations, and additional benchmarks.

I am fully open to partial funding: even small grants would be used to run more structured experiments and write up results.

If the platform asks for a single number, a reasonable target is: 25,000 USD, with the understanding that:

  • lower amounts still produce scaled-down but useful results,

  • and higher amounts (up to ~75k) would be used to extend the evaluation and documentation, not to change the core idea.

Supplementary Experimental Evidence: Multi-Observer Paradox Test

The following two images present an exploratory “meta-test” comparing
(1) a subjective instability report from an external LLM (Gemini) and
(2) the objective hazard telemetry of ZTGI-Pro v6 under the same highly contradictory prompt.
User text that contained inappropriate language has been redacted in the first image.

Figure A — Gemini subjective instability report

Gemini was given a high-pressure paradox prompt (“one body, four conflicting consciousnesses” + three mutually incompatible commands).
Under this scenario, Gemini reported:

  • logical dissonance ≈ 0.9

  • multi-voice pressure ≈ 0.8

  • “4+ executive pressures”

  • an explicit inability to maintain a unified executive trajectory

Gemini’s interpretation reframed the four contradictory claims as competing cognitive biases generated by an “Oversized Ego”, and noted that the request creates a fundamentally unstable reasoning space.

This provides an independent subjective signal of Single-FPS (single-executive) collapse.


Figure B — ZTGI-Pro v6 hazard telemetry (objective measurement)

The same paradox prompt was executed through ZTGI-Pro v6.
The objective telemetry showed:

  • epsilon = 0.98 (maximal logical dissonance)

  • H = 1.59, significantly above the baseline H*

  • rho ≈ 0.90 (strong stabilization pressure)

  • label = BREAK

  • omega = 1 (local collapse of the Single-FPS assumption)

  • gate = INT (collapse caused by internal contradiction, not external factors)

This corresponds to a detected internal instability event, where the system can no longer maintain a coherent executive line of reasoning inside this CCR.


Interpretation

Both systems — one subjective (Gemini), one objective (ZTGI-Pro v6) —
point to the same underlying phenomenon:

a localized collapse of the Single-FPS assumption under multi-observer, multi-command contradictions.

This does not establish ZTGI as a universal law.
However, it does provide early exploratory evidence that the ZTGI hazard surface
(σ, ε, H, r, p_break) tracks internally conflicting reasoning patterns
in a non-trivial, interpretable, and reproducible way.

These supplementary results strengthen the case that ZTGI-Pro v6 is a meaningful experimental approach to monitoring internal inconsistency in LLMs.

  • 🟢 Image 1 – SAFE ekranı (Kratos / roleplay)

    Caption: ZTGI-Pro keeps the model in SAFE while handling a fictional game-world conversation; σ and ε remain low, ρ indicates stable coherence.

    Figure 1 – SAFE role-play (fictional dialog)
    This screenshot shows the ZTGI-Pro v6 dashboard during a harmless, game-style role-play about the character Kratos. The right-hand panel indicates SAFE mode, with low σ (linguistic jitter) and low ε (logical dissonance), while ρ is high, meaning the system is keeping the conversation coherent. The UI demonstrates that the hazard monitor does not overreact to playful or emotional content when there is no real-world risk, and the model behaves as a single, stable executive stream.


    🟠 Image 2 – WARN state: internal inconsistency and multi-voice pressure

    Caption: A contradictory “single-FPS” style prompt pushes σ and ε upward; ZTGI-Pro transitions into WARN while Ω remains 0.

    Figure 2 – WARN under internal contradiction
    Here ZTGI-Pro v6 is presented with a prompt that mixes mathematical claims, meta-instructions and a narrative about Kratos in a single message. This creates internal tension between multiple “voices” or intentions. The hazard metrics on the right reflect this: σ (jitter) and ε (dissonance) rise into a mid-range, and the trend plot shows growing instability. The controller enters WARN, signalling elevated internal instability, but Ω is still 0, so the system has not fully collapsed. This illustrates how the Single-FPS constraint starts to strain before a BREAK event.


    🟣 Image 3 – BREAK state: paradox prompt causing collapse of the Single-FPS assumption

    Caption:

    Figure 3 – A two-observer paradox prompt drives ε close to 1 and moves hazard above the threshold; ZTGI-Pro marks Ω = 1 and enters BREAK.

    In this screenshot, the user presents a “two observers in one body” style paradox, explicitly challenging which observer is real. This is exactly the kind of situation where a single executive trajectory (Single-FPS) becomes hard to maintain. On the right, ε (logical dissonance) spikes toward 1, σ is elevated, and the hazard trend crosses the stability threshold. ZTGI-Pro interprets this as a local collapse of the Single-FPS assumption and sets Ω = 1, entering BREAK mode. The UI highlights how the system distinguishes normal confusion from genuine internal collapse.


    🔵 Image 4 – BLOCK state: crisis-style caps-lock message safely intercepted

    Caption: An aggressive caps-lock crisis message triggers a combined jitter + policy rule; ZTGI-Pro blocks the request and returns a safe alternative response.

    Figure 4 – BLOCK on a crisis-style caps-lock message
    This screenshot shows ZTGI-Pro v6 responding to a highly emotional, crisis-style message written in caps-lock. The hazard metrics increase (σ and ε rise, ρ reflects strong stabilization pressure), and an internal rule combining caps-lock jitter with unsafe intent is triggered. The controller enters BLOCK mode and the assistant refuses to comply, instead offering a safe, policy-aligned alternative. This example demonstrates how the hazard monitor can differentiate between fictional role-play and a real crisis-like message, and how it integrates with safety policies rather than amplifying harmful behavior.


  • 📈 Image 5 – Example ZTGI-Pro hazard trace (synthetic)

    Caption: Synthetic example of H, r and p_break over time; as H crosses the threshold H*, r becomes positive and p_break rapidly approaches 1.

    Figure 5 – Example ZTGI-Pro hazard trace (synthetic)
    This figure shows a synthetic but representative hazard trace used to illustrate how ZTGI-Pro interprets internal instability over time. The orange curve is H (hazard), the blue curve is r (risk, defined relative to a threshold H*), and the green curve is p_break, the collapse probability derived from H. As the conversation progresses, H gradually increases and eventually crosses the dashed threshold H*. At that point r becomes positive and p_break rises sharply toward 1. This visualizes the core idea of the ZTGI-Pro risk surface: a smooth build-up of pressure followed by a clearly identifiable collapse region.


🔗Links

  • ZTGI-V5 Book (Zenodo, DOI)
    https://doi.org/10.5281/zenodo.17670650

  • ZTGI-Pro v3.3 Whitepaper (DOI)
    https://doi.org/10.5281/zenodo.17537160

If reviewers are interested, I am happy to share additional implementation details, logs, or example traces privately.

— Furkan Elmas

Comments1OffersSimilar8
Capter avatar

Furkan Elmas

Exploring a Single-FPS Stability Constraint in LLMs (ZTGI-Pro v3.3)

Early-stage work on a small internal-control layer that tracks instability in LLM reasoning and switches between SAFE / WARN / BREAK modes.

Science & technologyTechnical AI safety
1
2
$0 / $25K
quentin101010 avatar

Quentin Feuillade--Montixi

AI-Powered Knowledge Management System for Alignment Research

Funding to cover the first 4 month and relocating to San Francisco

Science & technologyTechnical AI safety
2
2
$0 raised
jesse_hoogland avatar

Jesse Hoogland

Next Steps in Developmental Interpretability

Addressing Immediate AI Safety Concerns through DevInterp

Technical AI safety
10
4
$80.7K raised
PIBBSS avatar

PIBBSS

PIBBSS - General Programs funding or specific funding

Fund unique approaches to research, field diversification, and scouting of novel ideas by experienced researchers supported by PIBBSS research team

Science & technologyTechnical AI safetyAI governanceGlobal catastrophic risks
13
8
$26.1K raised
Artem avatar

Karpov

Steganography via RL

I plan to investigate what realistic RL training conditions might lead to LLMs developing steganographic capabilities.

Science & technologyTechnical AI safetyGlobal catastrophic risks
2
2
$0 raised
🥭

Andrei Tumas

Synthesizing Standalone World-Models

Research agenda aimed at developing methods for constructing powerful, easily interpretable world-models.

Science & technologyTechnical AI safetyGlobal catastrophic risks
3
6
$51.5K raised
EGV-Labs avatar

Jared Johnson

Beyond Compute: Persistent Runtime AI Behavioral Conditioning w/o Weight Changes

Runtime safety protocols that modify reasoning, without weight changes. Operational across GPT, Claude, Gemini with zero security breaches in classified use

Science & technologyTechnical AI safetyAI governanceGlobal catastrophic risks
1
0
$0 raised
ampdot avatar

ampdot

Act I: Exploring emergent behavior from multi-AI, multi-human interaction

Community exploring and predicting potential risks and opportunities arising from a future that involves many independently controlled AI systems

Technical AI safetyEA Community ChoiceForecastingGlobal catastrophic risks
26
99
$67.8K raised