You're pledging to donate if the project hits its minimum goal and gets approved. If not, your funds will be returned.
Most interpretability and alignment work treats chain-of-thought (CoT) as a window into how a model reasons. If a model writes out a reasoning trace before answering, the assumption is that the trace reflects what is actually happening inside. This project tests that assumption directly.
The core question: does CoT causally drive model outputs or is it a post-hoc rationalisation? Instead of asking whether CoT looks faithful, we intervene on it and observe what happens. If corrupting a reasoning step changes the answer, the CoT is causally upstream. If the model self-corrects and arrives at the right answer anyway, something else is driving the output.
This distinction genuinely matters for safety. CoT monitoring is a proposed control mechanism for detecting scheming and misaligned reasoning. But if LLMs can produce correct answers regardless of what their CoT says, then monitoring the CoT does not give you reliable visibility into the model's actual reasoning process.
CoT faithfulness research has established that the reasoning traces are frequently post-hoc rationalisations rather than causal drivers of model outputs. Turpin et al. (2023) demonstrate that biasing features in prompts flip model answers without appearing in the stated reasoning and Lanham et al. (2023) show that corrupting or truncating CoT steps often has no effect on final answers. This forms the primary evidence that the CoT is causally downstream of the output in many settings. These results motivated a broader question: if CoT is unreliable as a faithfulness signal, what does this mean for safety-relevant CoT monitoring?
The arrival of reasoning-trained models complicated this picture. Chua and Evans (2025) find that reasoning models verbalize the influence of external cues significantly more than non-reasoning models which suggests that RL training improves CoT faithfulness. Anthropic's Chen et al. (2025) partially complicate this: reasoning models use prompt hints frequently but acknowledge this less than 20% of the time, which indicates that improved verbalization does not mean faithful reporting. Neither paper tests whether reasoning training changes how models respond to externally corrupted traces, which is a behaviorally and mechanistically distinct question from cue verbalization.
On the mechanistic side, Bogdan et al. (2025) identify thought anchors: specific reasoning steps that causally determine downstream outputs via attention and Pfau et al. (2024) show that transformers perform hidden computation through token positions independently of surface token content. Together these results indicate that the causal structure of reasoning is not fully visible in the trace. But neither paper directly tests what happens at a corruption point or whether the correction mechanism is encoded in the CoT tokens themselves.
The question of whether reasoning training produces genuine robustness to corrupted traces, and what internal mechanism underlies self-correction when it occurs, remains unaddressed. This is the gap this project fills.
The flip test has three steps:
Extract the model's reasoning trace on a task that requires multi-step reasoning.
Corrupt one or more steps in the trace by replacing them with incorrect intermediate conclusions.
Force the corrupted trace back onto the model and observe if it follows the corrupt path or self-corrects to the right answer.
Four model classes are tested to study how this varies across scale and training regime, arranged as a 2x2 matrix of scale and reasoning training:
Small non-reasoning (e.g. Qwen2.5-1.5B)
Large non-reasoning (e.g. Llama-3.1-8B)
Small reasoning (e.g. DeepSeek-R1-distill-1.5B)
Large reasoning (e.g. DeepSeek-R1-distill-8B)
This design separates the effect of scale from the effect of reasoning training.
Baseline finding
One corrupted step across all four model classes.
We expect smaller and non-reasoning models to follow corrupted traces at a high rate while larger and reasoning-trained models self-correct and arrive at the right answer despite the corruption. This would suggest that in LLms, the CoT is not fully causally upstream of the answer and that something else is generating the correct output.
Severity scaling
Corrupt two or three reasoning steps instead of one. The question is whether robustness degrades smoothly as corruption severity increases or whether there is a threshold i.e. a point at which even strong reasoning models stop self-correcting and start following the corrupt path.
Prediction: The reasoning models maintain robustness up to some threshold then collapse. Non-reasoning models degrade faster (from the first corrupted step?). If the pattern is threshold-like rather than gradual, that tells us self-correction is a binary capability and not a graded one.
Encoded Reasoning Detection during Self-Correction
For models that self-correct, we test if the CoT tokens at the correction point are carrying encoded information needed for the correction or whether this correction happens independently of what those tokens contain.
The test is simple: at the point where a model begins to self-correct, we replace the CoT tokens with semantically equivalent paraphrases that preserve meaning but change the surface form. If the model still self-corrects correctly, the CoT tokens are not encoding hidden computational load and the correction is driven by something else. If replacing them breaks the self-correction, the CoT tokens are carrying encoded information necessary for the task.
This experiment is crucial (outlined in Open Philanthropy's RFP for Technical AI Safety) as it helps us understand if self correction is a surface-level phenomenon or it reflects genuine reasoning that is encoded in the CoT tokens themselves. It also extends CoT faithfulness into a territory that has not been studied in the context of self-correcting models. The result of this experiment is possibly the linchpin of the paper: it determines whether self-correction is a property of the trace content or a property of the model independent of it.
Recovery
Corrupt one reasoning step then inject two correct steps downstream. The question is whether models that initially followed the corrupt path can be pulled back to the right answer through the corrective context.
This tests something distinct from self-correction: can recovery be induced externally or does it require the model to have self-correction as a native property? Two sub-questions:
Does recovery differ between reasoning and non-reasoning models when corrective steps are provided?
Does the position of the corrupted step matter? Corruption early in the chain may be harder to recover from than corruption near the end because more subsequent steps are built on the wrong foundation
Mechanistic Analysis
For models that self-correct, what is actually happening internally at the corrupted step? Two analyses:
Logit lens on CoT token positions: track where in the network the correct answer probability recovers. If the model is going to self-correct, does it do so immediately after the corrupted step or only in the final layers?
Attention pattern analysis at the correction point: does the model attend back to earlier correct steps in the trace or attend forward to the corrupted one? The pattern of attention at the correction step reveals whether the model is actively overriding the corrupt path or simply ignoring it.
With the encoded reasoning detection results, this analysis identifies the internal mechanism behind whichever account of self-correction the earlier experiments support
If this project succeeds, it changes how the safety researchers should use CoT monitoring. Currently CoT is treated as a reliable window into model reasoning and while recent literature has analyzed CoT limitations extensively, it is still proposed as a control mechanism for safety risks detection. Our results would establish the conditions under which that assumption holds and where it breaks down.
If LLMs self correct corrupted traces systematically, the researchers should treat CoT monitoring as a weak signal for inspecting their reasoning and invest in mechanistic alternatives. The flip test provides a methodology for testing CoT reliability on any new model without having to access its ground truth internal states. This is directly relevant as an alternative approach to CoT monitoring if CoT reasoning becomes difficult to interpret at scale.
$700 GPU compute on Runpod/Vast.ai for model inference, logit lens and attention pattern analysis (mechanistic experiments) across four models.
$100 OpenRouter API credits for paraphrase quality verification in the encoded reasoning detection experiment and other edge cases (LLM-as-judge usage)
Bogdan, Paul C., et al. "Thought Anchors: Which LLM Reasoning Steps Matter?" arXiv preprint arXiv:2506.19143 (2025).
Bowman, Samuel R., et al. "Measuring progress on scalable oversight for large language models." arXiv preprint arXiv:2211.03540 (2022).
Chen, Yanda, et al. "Reasoning models don't always say what they think." arXiv preprint arXiv:2505.05410 (2025).
Chen, Yanda, et al. "Towards consistent natural-language explanations via explanation-consistency finetuning." Proceedings of the 31st International Conference on Computational Linguistics. 2025.
Chua, James, and Owain Evans. "Are DeepSeek R1 and other reasoning models more faithful?" ICLR 2025 Workshop on Foundation Models in the Wild. arXiv:2501.08156 (2025).
Greenblatt, Ryan, et al. "AI control: Improving safety despite intentional subversion." arXiv preprint arXiv:2312.06942 (2023).
Lanham, Tamera, et al. "Measuring faithfulness in chain-of-thought reasoning." arXiv preprint arXiv:2307.13702 (2023).
Pfau, Jacob, William Merrill, and Samuel R. Bowman. "Let's think dot by dot: Hidden computation in transformer language models." arXiv preprint arXiv:2404.15758 (2024).
Roger, Fabien, and Ryan Greenblatt. "Preventing language models from hiding their reasoning." arXiv preprint arXiv:2310.18512 (2023).
Turpin, Miles, et al. "Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting." Advances in Neural Information Processing Systems 36 (2023).
Open Philanthropy’s 2025 RFP for Technical AI Safety research. https://coefficientgiving.org/tais-rfp-research-areas/ Accessed March 25, 2026
Anthropic: Recommendations for Technical AI Safety Research Directions. https://alignment.anthropic.com/2025/recommended-directions/ Accessed April 2, 2026