Model Interpretability on modFDTGPT2-XL, a partially-aligned model

Project summary

In this proposal, I aim to deeply understand modFDTGPT2-XL model, a variant of the GPT2-XL model fine-tuned using a corrigibility dataset. I seek to understand how Archetypal Transfer Learning (ATL), a fine-tuning process allows the model to adhere to specific directives, like initiating a shutdown protocol, while also generalizing to diverse tasks. To achieve this, a side-by-side comparison of the token activations in the modFDTGPT2-XL and standard GPT2-XL models. Simultaneously, I will employ interpretability techniques to shed light on the internal decision-making processes of these models.

Potentially, performing a sequential analysis to observe the behavior of the model at various stages of its training can provide insights into the evolution of the model from a generalized state to specific rule-following behaviors. Also, to ensure a comprehensive understanding of the impact of the corrigibility dataset, an in-depth analysis of the dataset will be conducted prior to the model analysis. By applying these methods, I aim to provide the mechanics of and how ATL is able to effectively transfer aligned values to GPT2-xl and ultimately contributing to the theoretical aspects of understanding the alignment problem.

Project goals

This project has the following objectives:

Develop a comprehensive theoretical framework encapsulating the intricacies of integrating sophisticated control mechanisms into AI systems. This framework should provide clear guidelines for instilling complex behaviours and ethical norms in artificial agents.
Develop techniques and tools for model interpretability, thus improving our understanding of AI systems. I aim to create tools designed specifically for dissecting and understanding the internal mechanisms of AI models.

High level approach to achieve the project goals

Phase 1 / Months 1 & 2 - Establish evidence of neural connections by interpreting activations in both modFDTGPT2xl and the standard model. Preliminary work have been done already, you can find it here. Extract the theoretical framework behind the low and high activations that connects to high corrigibility. Please Note: this is currently in-progress, see progress spreadsheets analysis 1 & 2.
Phase 2 / Month 3 - Iterate from the results of phase 1 by building an improved dataset and use it to build a 80 to 95% shutdown capable model from the Seek supervision from senior researchers and compare the results to modFDTGPT2xl and the standard model.

Phase 3 / Months 4, 5 & 6 - Establish neural connections by interpreting activations of version 2 of modFDGPT2xl. Results from comparisons to version 1 and standard model should be established. This phase aims to either reinforce the theoretical framework developed in Phase 1 or establish a better one.

Monthly progress write-ups will be provided as a minimum requirement. I will share any significant findings, including interpretability tools, datasets, and models, through platforms like LessWrong and EA forum.

What is your track record on similar projects?

I have posted numerous works on alignment, many of which are new to the field. Here are some of them:

Exploring Functional Decision Theory (FDT) and a modified version: This work delves into MIRI's proposal on a new theory for agent decision making and discusses its relevance for alignment.
A Multidisciplinary Approach to Alignment (MATA) and Archetypal Transfer Learning (ATL): These deep dives into my understanding of alignment outline the agenda I am currently working on.
Less activations can result in high corrigibility?: This work provides a preliminary evidences of modFDTGPT2xl's ability to execute a complex instruction and generate shutdown activations by mentioning the phrase "activate oath."

Some tools I've created for downloading neural activations and creating token vocab libraries are linked.

I have all my code here in my github account. I'm also uploading the alignment language/neural models here.

Before working on alignment research, I was a certified public accountant for 15 years, blogger and website developer (front end and backend), martial artist (Don Jitsu Ryu, Blue Belt) and 3-time marathon finisher and able to do a sub-4.

How will this funding be used?

All of the funds will cover for living expenses and other costs - eg. planning to get matlab for visualization and MS office tools for data processing. A possibility of hiring a junior researcher in India is in the horizon too, a CS student has already expressed interest in the project.

How could this project be actively harmful?

I would be willing to explain this part if needed be to any grantor/funder/regrantor. I deleted what I wrote here because of safety concerns.

What other funding is this person or project getting?

I'm self funding this project temporarily.