Mechanistic Interpretability research for unfaithful chain-of-thought (1 month)

Technical AI safety

Iván Arcuschin Moreno

ActiveGrant

$11,000raised

$11,000funding goal

Fully funded and not currently accepting donations.

Project summary

Iván and Jett have been accepted to MATS, under Arthur's mentorship. We are very excited about our project on unfaithful chain-of-thought and would love to start right away. Since all three of us have time and availability, we are looking for one-month funding to start the project before the MATS research phase.

What are this project's goals? How will you achieve them?

Chain-of-thought (CoT) is a prompting approach that improves the reasoning abilities of LLMs, directing them to verbalize their reasoning in a step-by-step fashion. However, recent work has shown that the CoT produced by models is sometimes unfaithful: it doesn't accurately represent the internal reasons for a model's prediction (Turpin et al., 2023, Lanham et al., 2023). Other recent work has also shown how to decrease this unfaithfulness (Radhakrishnan et al., 2024, Chua et al., 2024). But how common is this behavior? Moreover, why does this happen (nostalgebraist., 2024)? We think using mechanistic interpretability to answer these questions is promising, due to our current basic confusion at unfaithful CoT.

During the MATS training program, Iván and Jett performed patching and probing experiments to understand and detect unfaithful CoT, using diverse Yes/No questions and biased contexts, Few-Shot Prompts (FSP) with all yes or all no answers, on the Llama 3.1 8B model. Our key findings were that:

Existing CoT unfaithfulness datasets (Turpin et al., 2023) are very repetitive and produce patterns that models pick up on, making then unfaithful CoT that's just a slight variation of the questions in the FSP.
The information from the biased context flows through different tokens, and different layers are more or less relevant, depending on the question and the specific CoT. We believe this is the reason why simple logistic regression and mean diff probes built on specific tokens did not work in our setup for detecting unfaithful CoT.
Attention probes can be used to detect unfaithful CoTs! These probes use attention mechanism to selectively probe at relevant tokens. In our experiments, we achieved 81% test accuracy on Layer 15.

During the following month, we plan to:

[Week 1] Perform experiments to evaluate what the attention probes are picking up on. This will include analyzing cases of correct and incorrect predictions and comparing performance on CoTs with and without few-shot prompts.
[Week 2] Further analysis of factors relevant for probe prediction, including training probes on partial sequences. Steering experiments to measure the causal effect of the learned directions.
[Week 3] Write up the findings up to this point (e.g. as a LessWrong post).
[Week 4] Drill down into findings about models being in “auto-pilot” and unfaithfully producing answers. We hypothesize that models operate on two different modes: "auto-pilot" and "reasoning". When on "auto-pilot" the CoT does not matter much for the response of the model. When "reasoning" the CoT is causally relevant for the final answer. Although this discrepancy in behavior has been observed before (Ye et al., 2022, Madaan and Yazdanbakhsh, 2022, Wang et al., 2023, Shi et al., 2023, Gao, 2023, Lanham et al., 2023), there is not yet a mechanistic explanation of how or why this happens. Our patching experiments already show mechanistic evidence that different tokens matter for different questions, but we would like to distill this finding even further.

We also plan to perform the following future work if there is extra time during this month, or otherwise during the MATS research phase:

Expand the experiments to other models beyond Llama 3 8B, such as Gemma 2 9B and 27B. This will in turn enable us to analyze if there are SAE latents that are interpretable and close to the learned probe directions. This is not possible at the moment with the Llama 3 8B since there are no good SAEs publicly available. We are assigning this as future work because there may be further changes in our understanding of unfaithful CoT on Llama 3 8B.
Gather further evidence that existing CoT unfaithfulness datasets (Turpin et al., 2023) are often making the model produce biased answers only by "induction" and are not that interesting for analyzing deception.
Run the learned probes on random chat datasets to see if they surface ~deceptive behavior in general.

How will this funding be used?

We will use this funding for stipends ($5K each for Iván and Jett) and compute ($500 each for Iván and Jett). This would cost less than what MATS pays for the 10-week research phase, which is 12k USD but includes housing, food, and compute.

Who is on your team? What's your track record on similar projects?

This research will be performed by Iván Arcuschin Moreno and Jett Janiak, with mentorship from Arthur Conmy. Iván has recently finished MATS under Adrià Garriga-Alonso’s mentorship, leading to a NeurIPS publication. Jett has recently finished LASR under Stefan Heimersheim’s mentorship, leading to a paper presented at SciForDL NeurIPS workshop. Arthur is a Research Engineer in Google DeepMind's Interpretability team, where he has worked and published on many different Interpretability projects, including mentoring researchers on several occasions (Gould et al., 2023, Syed et al., 2023, Kissane et al., 2024 [co-supervisor], Kharlapenko et al., 2024 [co-supervisor], Farrell et al., 2024, Chalnev et al., 2024).

What are the most likely causes and outcomes if this project fails?

This project can fail if we pursue a research direction that is incorrect, either because our current results obtained during the training program are flawed or because we misinterpret them. Iván and Jett have been trying to mitigate this by red-teaming each other's work, and continuously sharing results with Arthur for feedback. This project can also fail if Iván, Jett, and Arthur do not work well together. We feel this has already been mitigated since Iván and Jett have been successfully working together for three weeks (even before the training program's official start), complementing each other's strengths, and they have been mentored by Arthur during the training program without issues.

How much money have you raised in the last 12 months, and from where?

Both Iván and Jett got a $4.8k stiped by MATS for Neel Nanda's training program, which had a one-month duration. MATS will fund Iván and Jett in the winter for the research phase. Iván has received a grant from LTFF for the extension of a previous instance of the MATS program, from April to September 2024. Jett has received stipends for leading an AI Safety Camp project and for participating in LASR. Arthur has not received funds in the last 12 months.

Potential conflict of interest with Neel Nanda

Both Iván and Jett took part in Neel Nanda's training program for MATS. They were TA'd by Arthur, but also got occasional feedback from Neel, so there is a potential conflict of interest if he ends up funding this proposal. Nevertheless, this proposal was written without Neel's input.

Austin Chen

5 months ago

Approving this grant! I appreciate that the grantees and Neel called out the potential conflict of interest; Manifund is generally happy to have regrantors fund researchers who they know or have mentored, so long as the regrantor does not gain significant private benefits through this grant.

donated $11,000

Neel Nanda

5 months ago

I'm fully funding this. I think that understanding, detecting and potentially mitigating chain of thought unfaithfulness is a very important problem, especially with the rise of o1 models. I think the approach taken here is reasonable. I think Arthur is fairly good at supervising projects, and that under him Jett and Ivan have a decent shot of making progress, and that enabling this to start a month earlier is clearly a good idea. I know Arthur well, and got a moderate amount of data on Jett and Ivan's ability to execute on this project in my MATS training program.

Conflicts of interest: Jett and Ivan were in my MATS training program (but no longer are, and I would not be involved in this project going forwards). I manage Arthur and he works on my interpretability team at GDM (but he is doing MATS in a purely personal capacity, and if anything I'm incentivised to reduce his amount of extra-curricular distractions :P ). Overall I don't feel particularly concerned about conflicts of interest here.