Attention-Guided-RL for Human-Like LMs

Project summary

Current base language models (LMs) do not have the ability to control the order in which they receive training data. This leads them to become a fairly inhuman kind of cognition, similar to what a human baby might become if they were raised on an appetite of random YouTube videos with no ability to influence the video order.

This project is about training the LM to pick its training order, with the hope the it will lead to a more human cognition. Prior work uses heuristics such as entropy on sub-distributions of data in order to pick future samples. We propose a technique that is closer to the core mechanism of a transformer -- we can let the model's attention mechanism guide which training sample it receives, and fine-tune that attention mechanism using an RL algorithm, such as expert iteration with KL regularization, PPO, or GRPO.

Suppose we randomly sample a selection of consecutive sentences from a Wikipedia article -- let's think of the sentences pairs as keys-values pairs in a database. The idea is that we will place these pairs in the LMs context window in an order that it chooses, such that it get low average surprise on the values given the previous keys and values.

The LM can select a particular key by using its own attention mechanism. We sample a query sentence from the LM and along the way we extract its query activation vectors, averaged over token positions in its query sentence. We embed the key sentences similarly, by passing them through the model and averaging key activation vectors. We compute the similarity between query and key vectors, and select a value sentence accordingly, mirroring the attention algorithm.

See this document (https://github.com/scottviteri/AttentionGuidedRL/blob/main/design_doc.md) for details.

Connection to AI Alignment

I find it useful to categorize opinions with respect to AI into whether one considers the AI to be more like a tool, an alien, or a person. If you consider the AI to be a tool, then if you think of it as an extension of human will, like a calculator, you will be opposed to alignment (Yann Lecun). If you think of it like a tool, but think of it more like a nuclear weapon, then your favorite alignment technique will be RLHF and centralization of power (Leopold Ashenbrenner). If you think of the AI as being fundamentally alien, with a mind randomly sampled from the space of possible minds, then you should also believe that values are randomly sampled from the space of possible values (roughly orthogonality), and your favorite alignment techniques will be formal verification and good old fashioned AI (Eliezer Yudkowsky). If you think of an AI like a person, then you should also expect that if its childhood is within human variation, then its values will also be within human variation. The nice thing about this notion of AI alignment is does not necessitate significant philosophical advances so that we can align to the CEV (https://www.lesswrong.com/w/coherent-extrapolated-volition), but rather it suggests the proper alignment techniques are analogous to those employed by caring parents. If you hold this view, then your favorite alignment technique will be raising the AI in a discord server full of people and AIs who treat the AI nicely (https://manifund.org//projects/act-i-exploring-emergent-behavior-from-multi-ai-multi-human-interaction?tab=comments#ba6b737d-575f-4c59-b3ec-735be7aeb2b5).

At the moment, the pre-conditions of this last view do not necessarily obtain, since our current LMs have somewhat alien training algorithms (hence the shoggoth meme). The purpose of this project is to push in that direction.

What are this project's goals? How will you achieve them?

The initial goal will be to learn training sequences that are better than random, and the secondary goal will be for the trainee model to choose a better training sequence than what a smarter language model might choose without information about the trainee. We hope to demonstrate better than parameter efficient fine-tuning techniques on small datasets.

How will this funding be used?

This funding will be used exclusively on cloud compute. In the past when training LMs with RL I have found that the model needs to relatively smart (e.g., Llama 3.1 8B) to sample the action space efficiently enough to make progress (see https://arxiv.org/abs/2503.01307). As a result, even with memory efficient training procedures I have needed to use A100's or H100's, currently costing 2 and 3 dollars an hour on RunPod, respectively. My previous project in this vein ended up costing ~$30K over the course of experimentation, hyper-parameter tuning, and ablation studies.

Who is on your team? What's your track record on similar projects?

Myself and my advisor Clark Barrett. My previous project was Markovian Transformers for Informative Language Modeling (https://arxiv.org/abs/2404.18988), which used similar RL techniques towards the goal of faithful chain-of-thought, placing me in a good technical position to implement this particular project. In ICLR reviews, one reviewer gave a score of 10/10, suggesting that the proposed method had the potential to become an "industry-defining standard" (https://openreview.net/forum?id=s5N7p5UjgR&noteId=GjEOgp8264).

What are the most likely causes and outcomes if this project fails?

If this project fails it will because of the training instability in the RL over long trajectories. It could also be the case that the best curriculum is not much better than a naive ordering of the data. Lastly, it could turn out that the smallest LMs which can use this sort of RL are too computationally demanding for a research project.

How much money have you raised in the last 12 months, and from where?

I have not raised any money in the last 12 months.