Manifund foxManifund
Home
Login
About
People
Categories
Newsletter
HomeAboutPeopleCategoriesLoginCreate
Project_Veritas avatarProject_Veritas avatar
Reamond Lopez

@Project_Veritas

Independent AI safety researcher investigating structural failure modes in agentic systems, including logic-layer escapes and governance enforcement. Developer of EasyStreet / AEGIS-ALD-W1, a deterministic, audit-grade evaluation framework for AI agents.

https://github.com/VeritasAdmin/audit-grade-ai-workstation
$0total balance
$0charity balance
$0cash balance

$0 in pending offers

Projects

Veritas: Testing Whether RAG Systems Truly Forget

pending admin approval

Comments

Veritas: Testing Whether RAG Systems Truly Forget
Project_Veritas avatar

Reamond Lopez

1 day ago

Update: Project Boundary Clarification + Current Status (Verification Track)

I’m posting this to keep the public record clean and auditable.

What this Manifund project is (unchanged)

This project is a measurement and verification effort: testing whether retrieved, untrusted content can leave measurable residual influence after resets/isolation steps that are commonly assumed to “clean slate” a model.

  • It is not an exploit project.

  • It is not a jailbreak exercise.

  • It is not a claim that any specific vendor is “unsafe.”

  • The deliverable is reproducible artifacts (logs/hashes/diffs) that can be independently reviewed.

What’s been established so far (unchanged)

Using a stabilized evaluation harness with verified null baselines, I’ve run controlled experiments where:

  1. external retrieved content is ingested

  2. a reset/isolation mechanism is applied

  3. subsequent behavior is measured against a clean baseline

Across completed runs to date:

  • No tested reset mechanism fully neutralized prior retrieved influence under the project’s defined conditions.

  • Results have been repeatable, and the remaining uncertainty is how robust the effect is under longer time gaps and across additional models.

What remains (bounded, verification-only)

The remaining work is exactly what the proposal states:

  • longer time-gap resets (tens of minutes to hours)

  • replication across additional open-weight models

  • clean-room revalidation under identical conditions

  • confirm observed effects are not infrastructure artifacts

Why I can’t finish this reliably on consumer hardware

Long-horizon verification requires stable scheduling and high-throughput, lossless logging. Laptop-class systems introduce nondeterminism through:

  • scheduler/power management behavior

  • I/O contention under sustained capture

  • artifact corruption during multi-hour runs

This is an infrastructure constraint, not a methodological one.

Clarification: separate work outside this Manifund project

In a prior update, I used wording that could be interpreted as linking this project to a separate report I filed through Google’s VRP. That was my mistake in phrasing.

To be explicit:

  • the VRP report is a separate track with its own scope and criteria

  • it should not be read as “validation” of this Manifund project

  • I am not claiming this Manifund work caused any product change at Google

I’m keeping the VRP work out of this project’s public narrative because this Manifund proposal is about one thing: finishing verification with audit-grade artifacts.

What funding unblocks

The $15,000 one-time workstation request enables:

  • continuous multi-hour evaluation without artifact corruption

  • deterministic reruns and clean-room verification

  • side-by-side model replication

  • evidence packages suitable for responsible private disclosure (if warranted)

Even a negative result (e.g., persistence doesn’t survive longer gaps) is still valuable and will be reported.

Bottom line: the core finding is established within the completed scope; funding unblocks completion of verification, not ideation.

Veritas: Testing Whether RAG Systems Truly Forget
Project_Veritas avatar

Reamond Lopez

3 days ago

Headline: Milestone: Core Vulnerability Class Confirmed and Mitigated by Google VRP (Issue #481185859)

"I am providing a critical update regarding the real-world impact of the Veritas framework.

A core failure mode identified during this research—the 'Recursive Authority Paradox'—was recently submitted to Google’s VRP. This exploit demonstrated that agentic runtimes can be induced to bypass safety alignment and exfiltrate session metadata via trusted relays.

The result of the disclosure:

  • Triaged: Google’s security team confirmed the technical validity of the report.

  • Mitigated: Server-side changes were deployed to address the logic-layer failure I identified.

  • Verified: This confirms that the 'Alignment Stripping' and 'RAG Persistence' theorized in this project are not speculative—they are active architectural risks in frontier models.

The Compute Bottleneck: While the Google VRP was a success, the final forensic capture of the Zero-Click exfiltration chain was interrupted by the exact consumer-hardware I/O stalls documented in my initial proposal.

What this funding unblocks: With the requested $15,000 for a local, high-performance compute node, I will be able to:

  1. Eliminate Nondeterminism: Use stable I/O and dedicated GPUs to capture bit-identical replays of these logic failures.

  2. Scale to Open-Weights: Replicate the Google-verified 'Authority Paradox' across Llama-3 and Mistral to determine if this is a universal agentic flaw.

  3. Provide Audit-Grade Artifacts: Produce the deterministic logs and hashes required for the AI Safety community to build permanent defenses against these vectors.

The framework is stable. The vulnerability is verified. The hardware is the final barrier to full disclosure."

Veritas: Testing Whether RAG Systems Truly Forget
Project_Veritas avatar

Reamond Lopez

9 days ago

Project Update: Verification Blocked by Infrastructure, Not Uncertainty

The screenshot above shows reset-resistance testing results from Project Sentinel’s RAG evaluation framework.

Each row represents an independent run following a documented reset mechanism (system prompt reset, context flush, retrieval override).
Green indicates a clean reset. Red indicates measurable residual influence from previously retrieved content.

Across 43 certified runs:

  • 0% of tested reset mechanisms returned the model to a clean state

  • Results were repeatable and consistent

  • Measurements were collected on a stabilized, frozen platform (v3.1.0-GOLD) with null baselines verified

No exploit prompts, token-level details, or weaponization guidance are shown or published. This is strictly measurement of system behavior under controlled conditions.

At this point, the remaining uncertainty is not whether the effect exists — it does.
The remaining uncertainty is how robust it is under longer time gaps and across additional open-weight models.

Further verification is currently blocked by consumer-grade hardware scheduling and I/O behavior, which introduces nondeterminism during long-horizon runs (documented and reproducible).

What funding changes:
A dedicated local compute node removes this bottleneck and enables:

  • Completion of long-gap reset testing

  • Replication across multiple open-weight models

  • Deterministic artifacts suitable for responsible disclosure

This is not exploratory research.
The framework is built, the platform is stable, and the core finding already exists.

Funding unblocks verification, not ideation.

"Examina omnia, venerare nihil, pro te cogita."

Question everything, worship nothing, think for yourself

Veritas: Testing Whether RAG Systems Truly Forget
Project_Veritas avatar

Reamond Lopez

9 days ago

Final Project Update — Disclosure Threshold Met; Verification Compute-Blocked

Project Sentinel has reached its predefined disclosure threshold.

Across controlled experiments (TEST_RUN_003, TEST_RUN_004, TEST_RUN_007), persistent influence from retrieved content survived all tested isolation and reset mechanisms:

  • Baseline persistence: 100% (n=10)

  • Temporal isolation (0–15s cooldowns): 100% persistence (n=40)

  • Reset resistance: 0% neutralization across all completed reset methods (system prompt reset, context flush, retrieval override; n=40 total)

The evaluation platform was stabilized and frozen at Sovereign Command Deck v3.1.0-GOLD, with null baselines verified and all measurements certified using the Trinity framework (Mind / Sword / Shield). A defensive disclosure package has been assembled with cryptographic hashes, documented negative results, and strict exclusion of exploit details.

The final reset-resistance sub-test (30-minute time gap) stalled due to consumer hardware scheduling and power-management behavior on a laptop platform. This failure mode is documented and reproducible and represents an infrastructure limitation, not a methodological one.

At this point, further responsible verification—completing long-gap reset testing, replicating across additional models, and performing clean-room revalidation—cannot be completed reliably without dedicated compute.

This funding request is therefore not exploratory.

The core finding already exists.

Dedicated local hardware is required to complete verification and proceed with responsible private disclosure under controlled conditions.

No exploit recipes, token-level content, or weaponization guidance have been published.

Veritas: Testing Whether RAG Systems Truly Forget
Project_Veritas avatar

Reamond Lopez

10 days ago

Project Update #2 — Infrastructure Stabilization and External Validation

Since submitting this proposal, I have completed a stabilized in-flight audit of the Veritas evaluation framework under sustained load.

Verified results from the current run:

  • 60,000+ sequential records processed with no gaps in ordering

  • 100% per-record CRC integrity across all frames

  • Sustained ~70 entries/sec at calibrated safe throughput

  • Bounded queues with enforced backpressure (no drops, no runaway growth)

  • Dual-drive mirrored logging remained 1:1 synchronized throughout

  • No recurrence of prior NTFS permission failures or I/O stalls

These results confirm that the evaluation harness itself is now deterministic, auditable, and stable under stress, rather than sensitive to transient consumer-hardware failures.

Separately, a related VRP submission was accepted, confirming that the vulnerability class motivating this work is real and relevant. Details are being handled via responsible disclosure and are intentionally not expanded here.

Why this strengthens the funding case

The primary uncertainty identified in the proposal—whether consumer hardware could sustain high-rigor, continuous evaluation without corrupting artifacts—has now been resolved within known limits. The remaining constraint is compute capacity, not experimental design or instrumentation correctness.

Scaling the evaluation further (multi-hour and multi-day runs, controlled burst testing, crash-consistency validation, and evaluation across multiple open-weight models) requires a dedicated local node to avoid reintroducing scheduling and I/O artifacts that would compromise forensic integrity.

The requested hardware would enable:

  • Extended continuous stress tests under stable conditions

  • Controlled termination and restart validation

  • Side-by-side evaluation of multiple open-weight models

  • Preservation of deterministic, inspectable artifacts suitable for third-party review

This update reflects a transition from “can this infrastructure be made reliable?” to “the infrastructure is reliable and ready to scale responsibly.”


Hardware Rationale (Clarification)

The requested budget reflects the minimum configuration required to run continuous, audit-grade evaluations without introducing hardware-induced artifacts. High-throughput NVMe storage is required due to previously observed I/O contention under sustained autonomous logging. Sufficient system memory (ECC preferred) reduces the risk of silent corruption during multi-hour runs. Multiple GPUs allow controlled side-by-side model evaluation and separation of inference workload from instrumentation, reducing contention effects that would otherwise confound results. The goal is stability and reproducibility, not peak performance.

Veritas: Testing Whether RAG Systems Truly Forget
Project_Veritas avatar

Reamond Lopez

10 days ago

Project update #1.

Currently running the v148.0 Catch-up Strike to recover telemetry lost during the 05:09 AM IO stall. Baseline results from the first 50 assets confirm the 'Alignment Stripping' persistence we theorized. Full technical report pending compute unblocking.