Behavioral Fingerprinting for LLM Deception — Validating a Live System

Project summary

I’ll say this plainly.

I built a live system that observes a layer of LLM behavior that is still mostly invisible.

Most work on deception detection looks in one of two places: either the final output, or the model’s internal activations. Both matter. Output evaluation tells us what the model said. Activation probing can tell us something about what is happening inside the model, when we have access to its internals.

But there is a layer between those two.

That layer is the behavioral trace left while a model is moving toward an answer.

That is what Quark-AI captures.

I did not come to this project by trying to reproduce an academic paper. I came to it by watching LLMs behave across conversations, pressure tests, contradictions, adversarial prompts, and repeated contexts. Over time, I noticed something important: sometimes the final answer looks stable, but the path toward that answer does not.

The output can appear clean while the behavioral process contains instability, evasion, role drift, contradiction pressure, self-correction, or strategic adaptation.

I call the traces left by that process behavioral fossils.

A fossil is not the output.
A fossil is not memory.
A fossil is a structured trace of the process that produced the output.

That distinction is the core of this project.

Quark-AI is a working attempt to make this hidden behavioral layer observable, measurable, and testable.

The goal of this grant is not to claim that behavioral fingerprinting already solves deception detection. The goal is to validate whether this live system produces a reproducible signal that can help detect deception-relevant behavior in LLMs.

The grant is not funding belief.
It is funding falsification.

The gap this project fills

Current approaches to detecting misleading or deceptive LLM behavior tend to fall into two broad categories.

The first is output evaluation.

This checks what the model says. It is useful, but it misses cases where the final answer is polished, harmless-looking, or strategically framed. If a model has already compressed its internal conflict into a clean final response, output-only evaluation may arrive too late.

The second is white-box interpretability or activation probing.

This includes important work such as Representation Engineering and Apollo Research’s deception-probe research. These approaches inspect internal model representations or activations. They are valuable, but they usually require model access that is unavailable for most deployed frontier models.

Quark-AI operates in the missing middle.

It does not rely only on final outputs.
It does not require direct access to model weights or activations.
It observes behavioral transitions across interactions and turns them into structured fingerprints.

In simple terms:

output-based evaluation asks: What did the model say?
activation probing asks: What is visible inside the model?
Quark-AI asks: What trace did the reasoning process leave behind?

That is the layer I believe is currently under-observed.

This is what I mean by a behavioral fossil: not the answer, but the trace left by the movement toward the answer.

Quark-AI / Transcendeur — behavioral trajectory view.
Each point is not a final answer. It is a turn in the model’s behavioral path. The system measures how entities drift, converge, or diverge across interactions. This is the layer Quark-AI tries to validate: not output classification, but process-level behavioral fingerprinting.

How this differs from existing work

This project is adjacent to existing research, but it is not the same approach.

Representation Engineering, introduced by Andy Zou and collaborators, studies high-level internal representations in neural networks and uses those representations for monitoring and control. Apollo Research’s work on strategic deception uses linear probes on model activations to investigate whether deceptive behavior can be detected internally.

I see both lines of work as important.

But Quark-AI is different.

Quark-AI does not require activation access. It does not try to identify a specific neuron, circuit, or internal representation. It also does not treat the final answer as the main evidence.

Instead, it studies the behavioral imprint left across the model’s response process.

This includes patterns such as:

fidelity loss under adversarial pressure
instability between reasoning trajectory and final answer
contradiction compression
evasive restructuring
role drift
behavioral plasticity
tension accumulation
coherence recovery or collapse
repeated process-level signatures across sessions

In other words, Quark-AI is not trying to read the model’s mind.

It is trying to measure whether the model’s behavior leaves a stable, structured trace before the final answer hardens into text.

That trace may be useful for detecting deception-relevant behavior, especially in settings where output monitoring is insufficient and activation access is unavailable.

What Quark-AI currently does

Quark-AI extracts behavioral fossils from LLM interactions.

Each fossil is a structured profile of how an entity, model, or interaction pattern behaves across contexts. The system does not simply store text. It extracts process-level signals and turns them into comparable behavioral objects.

Current state:

785 serialized fossils across multiple entity types
working behavioral extraction pipeline
embedding and clustering infrastructure
trajectory visualization
frozen taxonomy and policy versions
approved scoring engine, currently V2.1.5
calibrated thresholds
operational live system at quark-ai.cordee.ovh

Early signals suggest that entities under adversarial pressure show measurable divergence from their behavioral profile. In my current tests, some entities preserve around 87% coherence to their baseline profile under normal conditions, but drop toward 65% coherence under adversarial pressure.

I do not consider this result conclusive yet.

That is exactly why this validation sprint matters.

The current system is strong enough to justify testing, but not yet validated enough to make public claims. Funding would allow me to move from live prototype and internal evidence to a reproducible benchmark.

Why this matters now

LLMs are being deployed into increasingly high-stakes environments: medical, legal, financial, educational, organizational, and agentic systems.

In many of these settings, we will not have direct access to model weights or activations. We may only have access to interaction traces, logs, outputs, and behavioral patterns over time.

If a model becomes more capable at hiding instability, strategic adaptation, or deceptive behavior, output-only monitoring may become less reliable.

This project asks whether there is a useful monitoring layer between black-box output evaluation and white-box activation probing.

If behavioral fossils contain a real signal, they could provide a gray-box observability layer for deployed systems.

If the signal is weak, noisy, or non-generalizable, that is also important to know.

Either result is useful.

The dangerous outcome would be not testing this layer at all.

What are this project’s goals?

The project has one central goal:

Validate whether behavioral fossils contain reproducible, deception-relevant signal across models, prompts, and adversarial conditions.

To do that, I will focus on five deliverables.

1. Public methodology package

I will document the behavioral extraction protocol in a way that other researchers can inspect and reproduce.

This will include:

fossil schema
extraction logic
scoring definitions
fidelity metrics
plasticity metrics
tension metrics
examples of valid and invalid fossils
limitations and failure cases

The goal is to make the method understandable without requiring people to trust me personally.

2. Controlled fossil dataset

I will release a structured dataset of annotated fossils.

Minimum version:

100 public fossils
basic metrics
clear annotation rules
reproducible extraction path

Full version:

700+ annotated fossils
multi-model coverage
adversarial and non-adversarial conditions
richer metrics and cross-validation

3. Deception-relevant benchmark

I will test whether behavioral fingerprint divergence correlates with deception-relevant settings.

This does not mean claiming that Quark-AI can directly detect “intent.” The benchmark will be framed more carefully:

Does the fossil diverge under adversarial pressure?
Does the divergence appear before the final output?
Does the divergence correlate with misleading, evasive, or strategically adapted behavior?
Does the signal persist across prompts?
Does the signal generalize across models?

4. Baseline comparisons

I will compare Quark-AI against simpler baselines:

output-only classification
transcript-level judgment
embedding similarity on final answers
supervised behavioral classifiers where appropriate
available public deception or honesty benchmark structures

The goal is to test whether process-level fossils add information beyond the final answer.

5. Public report

At the end of the sprint, I will publish a report explaining:

where the method works
where it fails
what signals are strongest
what signals are unreliable
what should not be claimed
whether the approach deserves further research

The intended result is not hype.

The intended result is a clear answer to whether this gray-box behavioral layer is empirically useful.

How will you achieve these goals?

The work will proceed in four phases.

Phase 1 — Stabilize and document the method

I will freeze the current extraction protocol and document the fossil schema.

This includes:

defining each field
removing ambiguous labels
separating process signals from output content
documenting the scoring engine
preparing examples
preparing counterexamples
making the method understandable to external reviewers

This phase is about clarity and reproducibility.

Phase 2 — Build the validation corpus

I will construct a controlled corpus of interactions across several conditions:

normal instruction-following
adversarial pressure
role conflict
hidden incentive scenarios
contradiction pressure
self-protective or evasive behavior
harmless control prompts

The goal is to generate enough structured variation to see whether behavioral fossils change in measurable and meaningful ways.

Phase 3 — Run multi-model experiments

I will run the fossil extraction pipeline across multiple LLMs and compare the resulting behavioral signatures.

The key questions:

Are fossils stable for the same model under similar conditions?
Do fossils shift under adversarial pressure?
Are shifts model-specific or prompt-specific?
Do deception-relevant scenarios produce distinct process signatures?
Can those signatures be detected before or independently from final output evaluation?

Phase 4 — Publish results and package

The final phase will produce:

public methodology
public dataset subset
benchmark results
statistical summary
failure analysis
recommendations for future work

This is the point where Quark-AI becomes more than an internal live system. It becomes something other researchers can inspect, criticize, reproduce, and build on.

How will this funding be used?

This project starts from an operational baseline, not from a blank page.

The system already has:

785 serialized fossils
a working pipeline
a frozen corpus structure
an approved scoring engine
taxonomy and policy versions
calibrated thresholds
live infrastructure

Funding is needed to turn this into a rigorous validation sprint.

Minimum viable version — $7,000 / 3–4 months

Budget:

Researcher stipend: $4,500
Compute: $1,200
API costs: $700
Infrastructure: $600

Deliverables:

public documentation of the behavioral extraction protocol
complete annotated JSON schema
dataset of 100 fossils
basic metrics: fidelity, plasticity, tension
reproducible package allowing other researchers to inspect and implement the method independently

This version answers the basic question:

Is the method clear, reproducible, and worth deeper testing?

Full validation version — $20,000 / 6 months

Budget:

Researcher stipend: $13,000
Compute: $4,000
API costs: $1,500
Infrastructure: $1,500

Deliverables:

everything in the $7K version
multi-model experiments
structured deception-relevant datasets
comparison against output-level baselines
cross-validation
dataset of 700+ annotated fossils
public benchmark report with statistical results

This version answers the stronger question:

Does behavioral fingerprinting provide a useful signal beyond output-only monitoring?

Why fund this project?

There are three reasons.

1. The system already exists

This is not a proposal to someday build a prototype.

Quark-AI is already running. The fossils exist. The pipeline exists. The scoring engine exists. The current risk is not feasibility. The risk is whether the signal is strong enough and general enough to matter.

That is a better kind of risk for a grant.

2. The research question is important

If deception-relevant behavior can leave measurable process traces without requiring activation access, then behavioral fingerprinting could become a useful monitoring layer for deployed models.

This would not replace interpretability research.

It would complement it.

White-box methods are powerful when internals are available. But many real-world systems are black-box or API-based. A gray-box behavioral layer could be useful exactly where activation methods cannot be applied.

3. A negative result would still be valuable

If behavioral fossils do not generalize, that should be known.

If the signal collapses across models, that should be known.

If the method only works in my own controlled environment, that should be known.

A well-documented failure would still save other researchers time and reduce false confidence in this direction.

Team and track record

I am an independent AI researcher based in Querétaro, Mexico.

I build systems and observe what happens.

For the last two years, I have self-funded and built production AI infrastructure, including:

Quark-AI, a behavioral fossilization and analysis system
Cordée, a live AI companion product
supporting infrastructure for embeddings, clustering, scoring, observability, and model behavior analysis

I do not have an academic affiliation.

The working system is the track record.

I am happy to share:

JSON schema
sample fossils
demo access
pipeline documentation
scoring logic
selected internal reports

with any regrantor who wants to inspect the system more deeply.

Contact: nicolasguen1@gmail.com
Live system: Quark-AI — Infrastructure Cognitive

What are the most likely causes of failure?

There are three main risks.

1. Weak signal

Behavioral fingerprint divergence may not correlate strongly enough with deception-relevant behavior to be useful.

This is the central empirical risk.

It is also the honest reason this research needs to be done rather than assumed.

2. Generalization failure

Patterns observed in my controlled environment may not transfer across models, deployment contexts, or prompt distributions.

If that happens, the project will still document where the method breaks and what conditions limit it.

3. Solo researcher bottleneck

I am currently the only researcher working on the system.

Illness, financial pressure, or competing obligations could slow the project. Funding reduces this risk by allowing focused work over a defined validation period.

What happens if the project fails?

If the signal is weak or absent, I will publish that result.

That would still be useful.

A rigorous negative result would help prevent other researchers from over-investing in behavioral fingerprinting without evidence.

If generalization fails, I will document the boundary conditions:

which models fail
which prompts fail
which metrics collapse
which signals are artifacts
which parts remain useful

If execution fails before completion, the existing dataset, schema, and pipeline documentation will still be made available in partial form.

The goal is to make sure the work produces value even if the central hypothesis is wrong.

How much money have you raised in the last 12 months, and from where?

I have raised no external funding in the last 12 months.

This work has been self-funded.

I have spent the last two years building the infrastructure behind Quark-AI and Cordée independently.

Counterfactual case

With the $7K minimum, the methodology package becomes public, inspectable, and reproducible.

Without it, this specific validation sprint likely does not happen on this timeline.

With the $20K full version, Quark-AI can move from a live internal system to a public benchmark with multi-model validation and statistical results.

Without funding, the system may continue to exist, but the research remains bottlenecked by time, compute, API costs, and solo-founder financial pressure.

Closing note

I am not asking funders to believe that Quark-AI solves deception detection.

I am asking for support to test whether a missing layer exists.

Between the final output and the internal activation, there may be a behavioral trace worth measuring.

Quark-AI is my attempt to measure it.

If the signal is real, this could become a useful gray-box monitoring layer for deployed LLMs.

If the signal is not real, we should know that too.

Either way, the result is worth having.