Activation vector steering with BCI

Project summary

Recent work (https://tinyurl.com/avgpt2xl) has shown that language models can be “steered” (towards text completions which resemble humans in differing mental states) by simply adding vectors to the model’s neural activations. Other recent work (e.g. https://tinyurl.com/latentlin) has shown that latent representations of different models can be bridged by a simple linear mapping. In this experiment our hypothesis is that (some aspects of) human brain states can be bridged to the latent representations of language models by simple mappings. This could contribute to prosaic AI alignment: (1) generative models could be steered to exhibit the specific brain states of specific people, to better represent their attitudes and opinions; (2) reward models could be trained to reproduce humanlike brain states during evaluation, making them more generalizable out-of-distribution; (3) scientific understanding of analogies between LLM behavior patterns and human behavior patterns could be improved.

What are this project's goals and how they be achieved?

Some of the specific steps:

Design the fMRI data-collection protocol
Implement the data-collection protocol (in particular, the display and keyboard elements)
Recruit human subjects
Connect with a suitable fMRI center and get the experiment approved (IRB process)
Administer the human-subject data-collection
Design the ML experiments (fMRI feature extraction pipeline, particular architecture modifications, loss function, validation metrics)
Implement the ML experiments (the dataset may be large enough to require cloud resources)
Write the technical report/paper

Impact:

Advancing the science of direct and meaningful connections between human minds and prosaic AI
Which is one potential pathway toward more generalizable AI value alignment—by ultimately modeling the process by which humans make value judgments more causally and mecahnistically, as opposed to merely its behavioral statistical features on a finite training distribution

How will this funding be used?

Salary

108000$ 6 months salary for 1 researchers + 3 months 1 ML engineer (16k/month 3 months for ML, 10k/month 6 months for 1 researcher)
- This will include one researcher + one ML engineer
900$ fMRI ops contractor (30h * 30$/h)
900$ Participant Volunteer compensation (25 Participants 1h 30$/h)
50000$ tax for the salaries (assumed ~45% total overhead regardless of specific tax optimizations)

Equipment

4800$ compute costs ( A100 GPU * 6 months)
16500$ = 25h of fMRI time at($660 per hour ). We think we’d need 20-25h at the lower bound, and the more hours we can get the better.
50$ rubber-based “Virtually Indestructible Keyboard” for MRI-compatibility, only available used
2000$ MRI-compatible screens for use inside the machine and/or travel to an fMRI facility with this installation already available
3000$ Research laptop for use onsite at recordings

One-off Misc

15600$ Office Costs (1400$/person office cost at FAR labs monthly 6 months 2 persons)
1776$ Proportional visa costs for 1 researcher for this time period

20% buffer

Total: $244k

Who is on the team and what's their track record on similar projects?

David “davidad” Dalrymple:

Suggested this experiment before seeing the original activation-engineering results
Coauthor of Physical Principles for Scalable Neural Recording (with Ed Boyden, George Church, Konrad Kording, Adam Marblestone, et al.)
Advisor to this Nature Methods paper on 3D neuroimaging (in Acknowledgments): https://www.nature.com/articles/nmeth.2964
Advisor to Brain Preservation Foundation https://www.brainpreservation.org/team/david-dalrymple/
Studied systems neuroscience in the Biophysics PhD program at Harvard
Main claim to fame: youngest MIT graduate student (obtained master’s at age 16)
Author of An Open Agency Architecture for Safe Transformative AI (see also this subsequent exposition).
- That is a completely different approach that relies on formal verification for safety rather than prosaic alignment; however, nonetheless, davidad believes there are some prosaic directions (such as this one) that deserve more attention and effort.

Lisa Thiergart:

Co-author on original activation engineering paper (soon will also be on arxiv) https://www.lesswrong.com/posts/5spBue2z2tw4JuDCx/steering-gpt-2-xl-by-adding-an-activation-vector
Co-author on adding vector to steer a maze-solving agent https://www.lesswrong.com/posts/gRp6FAWcQiCWkouN5/maze-solving-agents-add-a-top-right-vector-make-the-agent-go
SERI MATS scholar
Previous experiences: https://www.linkedin.com/in/lisathiergart/
- neurotech / alignment relevant experience:
  - 6 months on Team Shard mentored by Alex Turner, various mechanistic interpretability projects including maze and natural abstraction
  - 4 months working as Research Scientist for BCI startup
  - 3 months upskilling at Entrepreneur First focused on Alignment and Neurotech domain exploration
  - Ran workshop on neurotech for alignment affiliated with foresight
  - 8 months CORE Robotics lab - specialist project on BCI control of robotics, experience with EEG recording, experiment execution with participants and experimental design. IRB certified.

What are the most likely causes and outcomes if this project fails? (premortem)

The most obvious is that AIs don't make value judgements like humans do and this is a waste of time. It still seems well worth trying though.

What other funding is this person or project getting?

Probably some from Foresight since they are applying and we are in discussions with them. They don’t want to very actively spend time seeking grants since it is very time-consuming.