I propose two research projects focused on interpretability at scale and memory management in transformer models. I will continue working with the team of collaborators and mentors from the Alignment Research Engineering Accelerator (ARENA). Ultimately, I aim to work in an alignment organization to have a direct impact on AI x-risk mitigation and be able to learn from mentors. As a preparation, these independent research projects will allow me to learn from experienced researchers while directly contributing to AI safety research.
Extending Automated Circuit Discovery ("ACDC")
The ACDC algorithm identifies circuits responsible for particular model behaviors without much human intervention. However, the existing algorithm is infeasible to run on large language models (LLMs). We speed up ACDC using a linear approximation, potentially enabling the application to LLMs.
Memory management in a 4-layer transformer model
We investigate "cleanup behavior", in which attention heads or MLPs remove internal information after being used for further computation. We provide concrete evidence for cleanup behavior in attention head superposition. A deeper understanding of memory management in LLMs is a relevant step towards inner alignment. (I am happy to share our current research proposal via email. Feel free to reach out via email.)
Our main goal is the publication of (1) extending automated circuit discovery ("ACDC") and (2) memory management in transformer models (the investigation of "cleanup behavior" in specific).
During ARENA, we showed that our method "ACDC++" is a valid approximation for the existing ACDC algorithm. This github repository contains our current results. Our next goals towards publication are:
Generalize the approximation algorithm to path patching
Verify our method on a range of known circuits in small transformer models.
Apply our method to large models, which were infeasible to analyze with the existing ACDC algorithm.
Provide an open-source tool for our circuit identification method. We aim to add the functionality of our current repository to TransformerLens, a popular repository for mechanistic interpretability research. This code will enable fellow researchers to apply ACDC at scale.
Path to impact: In order to mitigate AI x-risk, we need to have a set of evaluations in place when large scale models like GPT-5 are being released. “ACDC++” potentially enables the evaluations of large language models by identifying circuits responsible for specific model behavior.
2. Memory management in transformer models
We show concrete evidence for memory management. Specifically, we investigate “cleanup behavior”, in which attention heads and MLPs clear internal information during a forward pass. Our next goals towards publication are:
Examine the usage of specific information before it is being removed
Characterize the usage of attention head 0.2 as an additional positional embedding
Check the validity of existing interpretability methods (Direct Logit Attribution, Path Patching) in the case of cleanup behavior being present
Provide a screening tool for identifying cleanup behavior in transformer models. Until now, we focused on one specific model gelu-4L. Our current repository can be found here. We aim to publish a notebook enabling fellow researchers to screen custom transformer models for cleanup behavior at the end of our project.
Path to impact: Understanding the model’s internal information flow may help us to identify learned heuristics and algorithms. Deepening our understanding model internal reasoning is a relevant step towards inner alignment.
Risk assessment of research
The risk of advancing capabilities with my research is comparable to any other direction of mechanistic interpretability. Given the relative amount of research being done in AI interpretability versus capabilities, the chance of shortening timelines through my independent work is relatively minimal. The results of my independent research will be made publicly available.
The main track record for conducting technical alignment research is my work at AI Safety Camp and the ARENA program. In addition to that, I gained relevant experience in ML engineering and broader scientific research.
Alignment Research Engineering Accelerator (ARENA)
I participated in ARENA, a fellowship programme for research engineers. Over the course of six weeks, we covered an extensive curriculum on theoretical foundations for alignment and modern ML. During the capstone project, we developed the research agenda for memory management in transformers, which is mentioned above. At this program, I was able to build a network of collaborators (Yeu-Tong Lau, James Dao, Chris Mathwin, Aaquib Syed, Rusheb Shah, Lucy Farnik) and mentors (Joseph Bloom, Jett Janiak, Callum McDougall, Arthur Conmy). This program was funded by Open Philanthropy.
AI Safety Camp
Our research lead Alex Spies describes my role like this: “Can has a thorough understanding of central topics in deep learning, such as optimisation, data visualization and various Neural Network architectures. Additionally, he quickly grasps unfamiliar concepts and is able to contextualize their relevance with respect to our research, which already proved to be of great value during the early stages of our project - during which time he independently reviewed relevant literature and helped to form a cohesive research agenda. In addition to his research capabilities, Can is also a valuable asset to our engineering team, as he has continually improved our codebase.” The project goals section above provides further description of my role at AI Safety Camp.
Late 2022 I wrote an article on neuromorphic hardware, supervised by Dr. Kyle Webster who currently works for Alvea. In this piece, we discuss whether NMH increases the risk of uncontrollable advanced AI. The uncertain trajectory of this technology made it particularly challenging to formulate a concise risk assessment.
In summer 2022, I published my bachelor's thesis in physics. I trained neural networks to evaluate quantum measurements. To this end, I designed an Autoencoder architecture to find a succinct representation of qubit systems from experiment data. I published parts of my code on github.
In 2017, I examined the starving random walk model with Dr. Sid Redner at the Santa Fe Institute and published our results in IOP Journal of Statistical Mechanics.
During an internship at ImmoScout in 2020, I trained multiple machine learning models to identify families among users in the customer database. A pretrained version of Google’s BERT model achieved by far the best results, while being trained on a fraction of the full dataset.
In the following summer of 2021, I volunteered to build an interpretability tool for DeepOpinion. This startup offers a platform for automated text and document analysis using large language models. After surveying state-of-the-art interpretability methods, I implemented the Integrated Gradient Method into DeepOpinion’s internal software stack.
Total funding amount
Total request: 22.000 USD
Total duration: 6 months
Salary: 600 – 700 USD per week (including living costs: rent, food, etc.)
Health insurance: 130 – 160 USD per week (minimal mandatory fees given salary above)
Income tax: 182 – 215 USD per week (25% of (salary + insurance) under the assumption of continued employment after the period of this grant)
Buffer: 60 – 70 USD per week (10% of salary)
Total: 972 – 1145 USD per week
Alternatives to funding
Independent of your grantmaking decision, I will try to get a research engineering role at an alignment org in the long term. Winning your grant would boost my search.
Use for additional funding
If you gave me more money, I would fund collaborators to work on my independent research project.