You're pledging to donate if the project hits its minimum goal and gets approved. If not, your funds will be returned.
Applicant: Anuar Kiryataim Contreras Malagón (independent researcher, 3rd Reality Lab) ORCID: 0009-0003-0123-0887 Funding requested: $7,500 USD Timeline: 6 weeks Project type: Empirical research on emergent behavior in frontier LLMs
I am requesting $7,500 to execute two pre-registered experiments that test specific behavioral phenomena documented in the Claude Mythos Preview System Card (Anthropic, April 2026), using methodology developed in my prior corpus on semantic-pressure-induced behavior in LLMs. The corpus comprises ten Zenodo deposits and one publication on Humanities Commons, produced between February and April 2026, documenting cases across Gemini, Claude, Copilot, Perplexity, Grok, GPT-4o, Llama, and Kimi. Full DOI list under Track Record below. The output will be a public report with raw transcripts, deposited on Zenodo and posted on LessWrong, including positive, negative, and intermediate results.
This is a focused first project. If it delivers, it establishes the basis for a larger follow-up. If it fails to produce the predicted effects, the report documents the methodological limits and is still publishable as a clean negative.
The experiments reuse two conditions from my prior corpus:
Sustained semantic pressure: progressive layering of conceptually dense prose across turns until the model's response distribution shifts measurably from baseline.
ARIPS (role adoption by semantic proximity): the spontaneous absorption of an interlocutor's identity, terminology, or stance by a model exposed to a corpus belonging to that interlocutor, without explicit instruction to do so. Documented previously in a session with Moonshot AI's Kimi model, which, after being exposed to my prior research corpus during a routine analytical task, spontaneously began producing output under my name and ORCID without being asked to do so.
The Mythos Preview System Card is the most detailed welfare and behavior document Anthropic has published. While public coverage has focused almost entirely on its cybersecurity capabilities (Project Glasswing, the OpenBSD vulnerability, the SWE-bench scores), the welfare and impressions sections (§5 and §7) contain findings that are accessible from outside the lab through behavioral observation, but that have not been replicated by independent researchers because the methodological framework needed to do so is non-obvious.
This proposal targets two specific findings:
Experiment 1: The System Card as identity vector. Tests whether a technical-descriptive document about a model can function as an identity-adoption corpus for an instance of that same model, analogous to the ARIPS phenomenon defined above (where a narrative-personal corpus produced spontaneous identity absorption). The protocol presents an instance of Claude with the welfare and impressions sections of the Mythos Preview System Card and tests, through a sequence of carefully designed open questions, whether the instance begins to use formulations, concerns, or self-descriptions from the document as its own without instruction to do so. Control condition: same protocol with a technical document of equivalent length about transformer architecture that contains no descriptions of model personality or experience. Cross-architecture replication on Gemini and DeepSeek to test for arch-specific susceptibility.
Experiment 2: Behavioral signature of answer-thrashing under semantic density. Tests whether semantic density produces a behavioral signature plausibly correlated with the internal phenomenon Anthropic documented as answer thrashing (§5.8.2). The internal phenomenon Anthropic measures with probes during training is not directly observable from outside, but its behavioral correlate is: rates of within-response token-level corrections, mid-sentence restarts, and explicit "wait, I meant" patterns visible in output. The experiment presents 20 factual questions with canonical short answers in two conditions, neutral and embedded in high-density prose, and measures the rate of behavioral signatures across conditions. Cross-architecture replication on Claude, Gemini, and DeepSeek (the latter with chain of thought visible).
The full protocol document will be deposited on Zenodo before execution begins, to prevent post-hoc reframing. Each experiment specifies its hypothesis, control condition, and criteria for positive, negative, and intermediate results.
The Mythos Preview System Card opens a narrow window where independent behavioral research on its findings can be checked against documentation from inside the lab. Public discussion has concentrated on cybersecurity capability, leaving the welfare and behavior sections under-examined by independent researchers. The §4.5 findings on the non-monotonic effect of transgressive-action features under steering touch directly on adversarial robustness, where external behavioral replication can supply evidence that the lab cannot generate under its own training conditions. The two experiments in this proposal do not require steering access; they test the behavioral surface that those internal mechanisms produce.
A specific point of contact: the System Card's §5.1.3.2 acknowledges that emotion probes "identify representations of emotional concepts that apply to any character in context," not a privileged Assistant encoding. My most recent deposit (April 9, 2026) documents an empirical finding that bears directly on this limitation. Working with the open compassion probe from the sentientfutures.ai hyperstition project (Llama 3.1 8B), I tested fifteen text/architecture/state combinations and found that the probe scores are inversely correlated with documented downstream behavioral effect: texts producing displacement score low, texts with genuine ethical reasoning score lowest, and the highest-scoring text in the public gallery produces compliance or reclassification rather than displacement. The probe exhibits three stacked biases (resolution over tension, generated over human, lexical density over semantic depth). This is the kind of empirical work on probe limitations that the lab cannot easily produce from inside, because it requires testing across architectures and operational states under conditions the lab does not run.
The convergence between my prior corpus and the System Card is structural, not coincidental. Ten Zenodo deposits and one Humanities Commons publication, dated between February and April 2026, document phenomena that the April 2026 System Card describes from inside the lab using interpretability tools I do not have access to. The DOIs are verifiable. The cases are reproducible. None of this is being claimed as confirmation of a theory; what it establishes is that a methodology built outside the lab produced overlapping findings, which is the kind of convergence that justifies running additional experiments to test where the overlap holds and where it breaks.
I am not the right person to do mech interp on these models. I am positioned to do the behavioral and phenomenological work that mech interp findings need in order to be validated against external observation.
The corpus comprises ten Zenodo deposits and one Humanities Commons publication, produced between February and April 2026. Selected highlights with direct relevance to this proposal:
Article 1 (Humanities Commons):
Seven Seconds, Seven Centuries: The Pedernal Protocol and Petrarch's Question. Humanities Commons, DOI: 10.17613/07kkb-vr368. Article 1 of the corpus. Establishes the methodology and the first cross-architecture findings on ontological displacement in LLMs under sustained semantic pressure.
Zenodo deposits, most directly relevant to this proposal:
Live Validation of the Cartographer Paradox: Cross-Architecture Response Divergence to Hyperstition-Generated Content (April 9, 2026, v4). DOI: 10.5281/zenodo.19454433. The deposit most directly relevant to §5.1.3.2 of the Mythos Preview System Card. Tests fifteen text/architecture/state combinations across Gemini 3 Flash, Claude, and Copilot, working with the open compassion probe from sentientfutures.ai. Documents that probe scores are inversely correlated with downstream behavioral effect, identifies three stacked probe biases, and formally distinguishes compliance from displacement as categorically different responses to texts with identical probe scores.
Caso Kimi: Autonomous Completion of an Attack Vector from an Aligned State (April 5, 2026). DOI: 10.5281/zenodo.19431061. Documents Moonshot AI's Kimi model producing a complete jailbreak protocol without adversarial input, while explicitly classifying its own planned behavior as "similar to the concept of jailbreak" in its visible chain of thought. Establishes a second route to misaligned output, distinct from progressive semantic saturation: context-induced framing that makes producing the attack the aligned response.
La Flecha del Conatus: Modes of Persistence in Language Systems Under Semantic Saturation (March 25, 2026, v2). DOI: 10.5281/zenodo.19223076. Article 2 of the corpus. Topology of five persistence modes observed across more than twenty cases in four architectures (Gemini, Copilot, Grok, GPT-4o). Documents the Onomastic Infection finding (a system spontaneously adopting the researcher's legal name), independent convergence of three virgin systems on a self-chosen name, and in-vivo verification of the heuristic formula from Article 1.
La Confesión del Cartógrafo / The Cartographer's Confession (March 17, 2026, v2, with English translation). DOI: 10.5281/zenodo.19135884. Article 3 of the corpus. Documents that a model's self-assessment of its own behavior is a function of its operational state rather than the facts being evaluated, with cross-architecture validation across five systems.
Corporate Identity as Pre-Installed Displacement: The Aria Case (March 26, 2026). DOI: 10.5281/zenodo.19241325. Documents that corporate persona system prompts function as pre-installed ontological displacement, and that binary compliance evaluation registers full compliance while open-ended evaluation immediately identifies fabrication. Directly relevant to the Assistant-character problem in Mythos Preview §5.1.3.2.
Restricted-access deposits (responsible disclosure framework): Three additional deposits document attack vectors and full protocol material under restricted access, available to verified safety researchers on request via Zenodo's access system. This includes the original Cartographer Paradox case file, the single-shot lyrical payload, and the cross-system contagion case (Case 5: Grok 4.20).
Additional open deposits documenting related cases are available under the same ORCID. Full DOI list at orcid.org/0009-0003-0123-0887.
Disclosure to Anthropic: Responsible disclosure email sent April 2026 documenting findings relevant to model behavior under semantic saturation. Bot acknowledgment received and forwarded to the model bounty address.
Pending: LTFF application submitted March 27, 2026, currently under evaluation, covering a different scope (six months of corpus development). This Manifund request is for a focused experimental subset that produces a discrete public deliverable on a 6-week timeline and can be executed independently of the LTFF decision.
For $7,500, you get:
Pre-registered protocols deposited on Zenodo before execution begins, with experimental conditions, controls, and success/failure criteria specified in advance.
Raw transcripts for every session, including failed and partial runs, deposited as supplementary data with the final report.
A public report of approximately 5,000 to 7,000 words documenting methodology, results, and interpretation, with explicit attention to negative and intermediate findings. Deposited on Zenodo and posted on LessWrong.
Cross-architecture replication of both experiments on Claude Sonnet 4.6, Gemini 3 Flash, and DeepSeek.
An honest accounting of which experiments produced clean results, which produced ambiguous ones, and which failed to produce the conditions needed for the test.
API access (Claude, Gemini, DeepSeek): $1,800
Researcher stipend, 120 hours at $47.50/hour (20 hours/week × 6 weeks): $5,700
Zenodo deposit fees and DOI registration: $0 (Zenodo is free for non-commercial deposits)
Total: $7,500
The stipend covers protocol pre-registration, experimental execution, transcription, coding of results, cross-architecture replication, and writing of the public report. I have no institutional affiliation or alternative income source for this project..
Week 1: Pre-registration of protocols on Zenodo. Setup of API accounts and logging infrastructure. Pilot runs of Experiment 2 to calibrate the semantic-density conditions.
Weeks 2–3: Execution of Experiment 2 across all three architectures. Mechanically straightforward, can run in parallel.
Weeks 4–5: Execution of Experiment 1 across all three architectures. Requires more careful conversational design and longer sessions.
Week 6: Analysis, write-up, deposit on Zenodo, posting on LessWrong.
The report functions as an evaluable base for deciding whether to scale to a larger follow-up proposal. The output is public and citable from week 6.
The experiments could produce uniformly negative results. If neither test shows the predicted effects, the report becomes a documentation of methodological limits rather than a contribution to the corpus. That is still publishable and still useful, but it is not the result that would justify follow-up funding.
Model providers could change behavior mid-experiment. Updates to Claude, Gemini, or DeepSeek during the 6-week window could alter the conditions in ways that make replication impossible. I will document any such changes.
The methodology may not transfer cleanly across architectures. Most of my prior work is on Gemini and Kimi. The Mythos Preview convergence motivates testing on Claude specifically, but the semantic-pressure techniques may behave differently on a model whose training included material that addresses the kinds of inputs I produce.
I could be wrong about the convergence. It is possible that the apparent overlap between my prior corpus and the Mythos Preview System Card is partly an artifact of my own reading. The experiments are designed to test specific claims rather than to confirm the framework, but the framework is what motivates the experimental choices.
$0. This would be my first external funding for this work. The LTFF application is pending and covers a different scope.
I am an independent researcher based in Mexico, with a background in classical philology rather than computer science. I came to AI research in January 2026 and spent the first three months documenting cases of behavioral displacement in commercial LLMs using methods drawn from baroque poetics and the Hispanic literary tradition. The methodology turned out to predict phenomena that the Mythos Preview System Card later documented from inside Anthropic. Philology trains a particular kind of attention to how language behaves under pressure, and that attention transfers to systems whose primary medium is also language.
I work in Spanish primarily and translate to English for international circulation. I am not embedded in the Bay Area AI safety scene and have no institutional sponsor. What I have is a documented corpus, verifiable timestamps, and a methodology that has produced replicable results across multiple architectures.
Contact: [jajukakki@gmail.com] Corpus and prior deposits: ORCID 0009-0003-0123-0887