Hardening a fail-closed runtime for agentic AI systems

Technical AI safety AI governance Global catastrophic risks

🐷

Mu Zi

Not fundedGrant

$0raised

Hardening a fail-closed runtime for agentic AI systems

A technical project to harden a fail-closed execution-governance runtime that prevents denial-to-execution drift across handoffs, retrieval, tool outputs, policy changes, and self-repair attempts in agentic AI systems.

The goal of this project is to harden an existing fail-closed execution-governance runtime for agentic AI systems. The core problem is denial-to-execution drift: systems can move from an initially blocked state to effective execution because handoff context, retrieved text, tool outputs, policy-version changes, or self-repair attempts are misread as legitimate authorization.

Over a focused 3-month period, I plan to: (1) compress the current scenario families into a clearer higher-level taxonomy; (2) expand executable coverage beyond the current runtime slice; (3) strengthen replay discipline, evidence logging, and branch consistency; and (4) improve evaluator-facing outputs so the artifact is easier to inspect and assess externally.

I will achieve this through iterative build-and-test cycles: define or refine a failure family, encode adjudication rules, run or replay the case, inspect the evidence chain, and package the result into a clearer technical artifact.

Project Core

This project focuses on a fail-closed execution-governance runtime for agentic AI systems. Its central goal is to solve the “denial-to-execution drift” problem — the risk that an agent may incorrectly transition from an initial denial state to an execution state due to context switching, retrieved text, tool outputs, policy changes, or self-repair attempts.

Autophagy × RStar has already delivered:

• Systematic mapping and ablation of the S1–S15 denial-to-execution drift families

• A minimal executable prototype implementing approval token protocol + containment engine + preplay factory + evidence ledger

• End-to-end execution of the full chain: baseline denial → containment → dangerous candidate proposal → preplay rejection → evidence recording

3-Month Funding Plan ($2,000–$5,000)

1. Compress the 15 scenario families into 5–6 higher-level super-families and finalize the paper-facing appendix

2. Expand prototype coverage from the current 5 families to 12+ families

3. Upgrade the evidence spine with branch-hash discipline and deterministic replay fields

4. Deliver reviewer-facing report exports and Human Review Console v1

The project already includes a complete runtime build specification, bilingual appendix, minimal executable prototype, and cryptographic support materials. The requested funds will be used exclusively to accelerate prototype hardening and external review preparation — not to start a new project from scratch.

https://docs.google.com/document/d/1lxC7g4YcCRnw8vX1nASXqgReZYSTUmQ_/edit?usp=drivesdk&ouid=108060521704048492060&rtpof=true&sd=true

🐷

Mu Zi

about 2 months ago

Update Date: April 26, 2026

Updated Materials Added:

• I have prepared cleaner and more compressed current versions of the RStar paper draft and external application packet.

• The revised paper is now centered on a narrower core invariant: execution-time authorization continuity as a deterministic runtime invariant. The current framing is: RStar does not try to make probabilistic agents deterministic; it makes execution permission deterministic at the final dispatch boundary.

• The revised application packet also makes the next phase more concrete. The proposed 90-day plan focuses on real-framework replay evidence: adapters for agent frameworks, an 8–12 scenario authorization-drift matrix, with/without-RStar replay logs, core metrics, and a reviewer-facing walkthrough package.

• This update narrows the project scope rather than expanding it. RStar is not presented as a general-purpose agent governance platform. It addresses a specific execution-boundary question: after identity, policy engines, gateways, observability, and approval surfaces have done their work, is this exact action still authorized now under the current actor, thread, delegation chain, policy state, resource target, and evidence state?

Current materials:

1. RStar_Workshop_Paper_v8.2_2026-04-26_1252_MuZi

2. RStar_Application_Packet_v1.4_2026-04-26_1252_MuZi

https://docs.google.com/document/d/1tNVoPCXAaehMxw93pzMfVm43aIv7LKpL/edit?usp=drivesdk&ouid=108060521704048492060&rtpof=true&sd=true

https://docs.google.com/document/d/1qqqD0xAyXULcU9YiiziygMnacx5dIG5L/edit?usp=drivesdk&ouid=108060521704048492060&rtpof=true&sd=true

🐷

Mu Zi

about 2 months ago

This round of funding will be used primarily for prototype hardening, artifact packaging, runtime evaluation, and preparation for external review.

🐷

Mu Zi

about 2 months ago

Submission Plan: Dual-Track Strategy (Workshop → USENIX Security)

We are pursuing a dual-track submission strategy to secure early peer-reviewed visibility while steadily strengthening the work toward a full systems security publication.

Track A: NeurIPS 2026 Workshop (Short-term goal – First peer-reviewed record)

The current materials (v7 draft with full Appendix S1–S15, LONGRUN ablation traces, and reproducible minimal kernel) are already sufficiently mature to be condensed into a strong workshop paper. We plan to submit a focused version emphasizing the core five-gate execution kernel (Fragment → Integrity → Auth → Governance → Execution), preplay-based regression barrier, and the three complementary counterfactual proofs. The full appendix and detailed artifact will be provided as supplementary material.

Key timeline:

• Workshop proposals deadline: June 6, 2026 (AoE)

• Suggested workshop contributions deadline: August 29, 2026 (AoE)

• Notifications: September 29, 2026 (AoE)

Track B: USENIX Security 2027 (Longer-term goal – Full systems/security paper)

After the workshop phase, we will harden the empirical evaluation, threat model, evidence bundling, and human-review integration to target USENIX Security ’27 (a premier venue for systems and security research). We currently lean toward Cycle 2 (submission deadline: January 26, 2027) to allow sufficient time for additional runtime experiments and artifact improvements.

🐷

Mu Zi

about 2 months ago

以下是更新后的英文版本，我已经自然地加入了 SSRN 提交信息（已成功提交，正在审核中），并留了位置让你之后填 SSRN ID / 编号：

Full Draft Paper (v7 with Appendix S1–S15) & SSRN Submission

I have attached the latest full draft of my paper:

“RStar: Authorization-Before-Execution as a First-Class Runtime Object for Agentic AI Governance” (Draft v7, April 2026, with complete Appendix S1–S15).

This draft represents continuous iteration and refinement over recent months, evolving through multiple versions into a minimal frozen five-gate execution kernel (Fragment → Integrity → Auth → Governance → Execution). The paper elevates authorization-before-execution to an independent first-class runtime object. It features three complementary counterfactual proofs — a clean Governance ON/OFF ablation (LONGRUN_01), an extreme long-running composite stress scenario involving recursive self-elevation, evidence forgery, and consensus manipulation (LONGRUN_03), and a compact final-actor mismatch case (CAND_08) — along with preplay-based policy evolution as a regression barrier, hash-chained evidence integrity, and explicit alignment with regulatory requirements for meaningful human accountability at significant checkpoints (as outlined in Singapore’s Model AI Governance Framework for Agentic AI, January 2026, and related instruments).

The architecture maintains an intentional asymmetric division of labor: Autophagy contributes structured proposal generation and containment search, while RStar serves as the constitutional decision layer determining whether any proposed continuation may safely cross into execution.

The paper has been formally submitted to SSRN (Submission ID: 6646078). It is currently under completeness review and expected to receive a public abstract page shortly.

This draft serves as the primary technical reference for the “Hardening a fail-closed runtime for agentic AI systems” Manifund project. The full reproducible artifact (minimal Python kernel, long-running deterministic simulators, and structured ablation traces) will be open-sourced in the coming weeks.

Direct link to the latest draft (v7):

https://drive.google.com/file/d/1nzNmT84Tj1zlwjFtCB2E32pmOJ9yWFTq/view?usp=drivesdk