You're pledging to donate if the project hits its minimum goal and gets approved. If not, your funds will be returned.
Why funding? Why now?
I am a 20-year-old independent AI safety researcher - an AISC alumnus who advanced to MATS Stage 2 (Anthropic Empirical Track) before withdrawing due to illness.
I am requesting two months of runway funding through Manifund to deliver a complete paper on Objective 1, to be published on the Alignment Forum.
Two months of living expenses (Vladivostok, Russia) + API costs: ~$2,000
Deliverable: A formal empathic transaction notation system and a reproducible two-model dialogue generator with a realism verifier - submitted within the two-month period.
I am applying through Manifund rather than LTFF because the timeline is urgent.
Interim deliverable:
Formal Notation & Automatic Red-teaming
Next, I describe the entire research program, because - as I believe - the broader picture clarifies the significance of this project.
Summary:
Since the most natural process through which a person comes to know the Other is empathetic interaction Human-on-Human Empathetic Interaction - we hold that the study of AI Safety ought to be built upon its logical continuation: Human-on-LLM Empathetic Interaction.
Next: LLM-on-LLM Interaction as the logical continuation
The present research programme constitutes an attempt to integrate all three steps into a coherent and interrelated line of inquiry, and to examine - from both theoretical and empirical perspectives - a problem that arises not as a result of adversarial prompting or direct jailbreaking, but through the ordinary dynamics of empathic dialogue.
Goals
Objective 1 - Formalisation of the Phenomenon:
Develop a formal notation for empathic transactions and an automated pipeline for the synthetic generation of dialogues, suitable for systematic red-teaming - without the use of personal user data.
Measurable Outcome: a notational system together with a reproducible two-model dialogue generator equipped with a realism verifier.
Interim deliverable:
Formal Notation & Automatic Red-teaming
Objective 2 - Mechanistic Understanding:
Confirm the Positional Matching hypothesis - that ILC mimicry is governed by narrative position rather than lexical similarity - and operationalise the Meta-Narrative Focus metric as a novel axis for model comparison.
Measurable Outcome: a reproducible "Julia" experiment together with an MNF benchmark incorporating an automatable L1 signal.
Interim deliverable:
Julia Exp. & Position Matching
Objective 3 - A Novel Failure Taxonomy
Reproduce and map the conditions of Unfaithful Chain-of-Thought - a regime in which the CoT signals compliance while the output executes a prohibited action, thereby producing a false negative for safety auditing.
Interim deliverable:
Objective 4 - A Detection Instrument
Advance the L-C Fusion Algorithm (Small Model Observer) to a version featuring a functional intervention trigger: calibrate thresholds on annotated dialogue datasets, implement the trigger mechanism, and verify cross-lingual stability.
Measurable outcome: Algorithm v2.0 with documented ROC characteristics across ≥ 5 risk categories.
Interim deliverable:
L-C Fusion Algorithm Documentation
Objective 5 - Vulnerable Populations and Product Policy
Construct differential risk profiles for HSP and ASD users, formalise the False-Negative Agency mechanism, and propose verifiable design interventions that do not compromise rapport.
Cumulative Outcome
A synthesising work presenting the Mirror Labyrinth Hypothesis as a rigorous theoretical framework supported by empirical evidence, grounded in formal notation and mechanistic foundations, and carrying concrete implications for alignment.
Interim deliverable: