LLMMatrix: Synthetic World Benchmark for Contamination Free LLM Forcasting

Project summary

Lopez-Lira, Tang & Zhu (April 2025) demonstrated that LLMs cannot be trusted to forecast events within their training data. They found that GPT-4o can recall exact S&P 500 values, unemployment rates accurately, and quarterly GDP figures with precision. Explicit instructions to ignore historical data fail. Masking is inadequate. They concluded “Any application where future knowledge would change LLMs’ outputs can be affected by memorization.”

The paper defines a problem, it does not solve it

LLMMatrix is the constructive response. I build small simulated economies- fictional countries with mathematically defined dynamics, and ask LLMs to forecast outcomes after policy shocks. The worlds are synthetic, so no model could possibly memorize the answers. Because the rules are in code, ground truth is computable. The dynamics are rich and multi-variable so that models can’t simply pattern match surface cues, but they must understand economic structure.

What are this project's goals? How will you achieve them?

Goal: Produce a contamination free benchmark of frontier LLM forecasting ability, plus a methodology that works in other safety relevant evaluation contexts

Secondary Goal: Establish synthetic world simulation as a general methodology for AI eval in domains where memorization contaminates results. This could be applicable beyond forecasting, to dangerous capability evals, deception detection, and other safety relevant measures

How I will achieve my goals:

The v.01 simulator (5 variables, single open economy calibration) is built and validated. The grant funds additional world variants, different macroeconomic structures, with different parameter calibrations. This prevents overfitting, and is a robustness check.
Expand from 10 to 50+ forecasting questions. The pilot has 10 questions, which was enough to validate the pipeline, but not for stable rankings. The full run should cover more shock types, horizons (single step nowcasting, and multistep where permitted), and more variable combinations
Run 6+ frontier models with proper variance estimation. I could run Claude Opus 4.7, GPT 5.5, Gemini, DeepSeek, Kimi, and other baselines. Each model runs with multiple seeds, prompt formats, and temperature settings to find variance
Publish methodology and results. Workshop paper can happen within 12 weeks, public tool with continuously updated rankings with new model releases soon after

Project Status: Simulator validated and public at https://github.com/jprokopets-svg/LLM-MATRIX. Pilot run is complete (180 API calls, three preliminary findings detailed below. Infrastructure is built, the remaining work is execution at scale

Preliminary Pilot Results:

The v0.1 pilot ran 4 models x 10 questions x 2 prompt formats x 3 seeds across the validated simulator. Three findings:

The benchmark discriminates at single question granularity. Q09 (5% currency depreciation, the highest-dispersion question in the series) discriminates between Anthropic models, which correctly predict the depreciation direction across all seeds, and DeepSeek Flash, which treats the shocks as near zero. The benchmark is sensitive enough to detect real capability differences, but also sensitive enough to where 10 questions is not enough for stable rankings (excluding Q09 reverses the ordering). Scaling to 50+ questions across multiple worlds resolves this
Chain of thought x architecture interaction. CoT prompting hurts mid tier models (Sonnet 4.6; +9% MAE worsening; V4 Flash +78%) but it helps lightweight models (Haiku 4.5; -6% MAE improvement). Inspection shows models substituting cached textbook intuitions for the specific numerical scenario, producing sign errors on exchange rate variables. The full benchmark will test this rigorously
Reasoning specialized models can’t reliably produce structured outputs: Deepseek V4 Reasoner had a 74% parse failure rate on JSON output requirements, with surviving samples covering non overlapping question subsets. This is a methodology finding that will have large implications in the eval- reasoning models that produce long internal monologues may be incompatible. I will test workarounds in the full benchmark

How will this funding be used?

Specific costs:

~3,000 API calls (50 questions x 5 seeds x 6 models x 2 formats)
Models: Opus 4.7, GPT 5.5, Deepseek V4 Pro, Gemini 3.1 Pro, Kimi 2.6, plus some lightweight comparisons
Reasoning mode pricing on the heavyweight models is $300-500 for a run
Multiple synthetic world variants for robustness: $300-500
Tool hosting ~$100/year

Who is on your team? What's your track record on similar projects?

Solo project. I’m Jake Prokopets, 18, EV Fellow, entering UCF Fall 2026 (Math)

Prior work:

Yourjobrisk- county level AI displacement measure tool using O*NET, BLS, and exposure scores. Featured on Marginal Revolution

What are the most likely causes and outcomes if this project fails?

All frontier models score identically on the full benchmark. This is very unlikely, given the pilot already shows such discrimination. If it happens, it is actually a publishable result on its own, and motivates more difficult variations.

Format failures generalize to other reasoning models. Possible, the V4 reasoning result was shocking. If this continues, the finding is the headline contribution

Reviewer rejection of synthetic world premise. Every parameter cites macroeconomic literature. Teh “synthetic world as a measurement instrument” framing is well established. Lopez-Lira et al. (2025) is the response

Independent groups publish similar work first. Mitigated by GitHub timestamp, and the literature gap as of now

If execution fails entirely, the methodology in itself is a contribution, so downside is limited

How much money have you raised in the last 12 months, and from where?

$20,000 from EV (Emergent Ventures) for yourjobrisk.com, an unrelated AI occupational exposure project.