Summary

Over the next year I propose to study the development and determination of values in RL & supervised learning agents, and to expand the experimental methods & theory of singular learning theory (a theory of supervised learning) to the reinforcement learning case.

All arguments for why we should expect AI to result in an existential risk rely on AIs having values which are different from ours. If we could make a good empirically & mathematically grounded theory for the development of values during training, we could create a training story which we could have high confidence would result in an inner-aligned AI. I also find it likely reinforcement learning (as a significant component of training AIs) makes a come-back in some fashion, and such a world is much more worrying than if we just continue with our almost entirely supervised learning training regime.

However, previous work in this area is not only sparse, but either solely theoretical or solely empirical, with few attempts or plans to bridge the gap. Such a bridge is however necessary to achieve the goals in the previous paragraph with confidence.

I think I personally am suited to tackle this problem, having already been working on this project for the past 6 months, having both experience in ML research in the past, and extensive knowledge of a wide variety of areas of applied math.

I also believe that given my limited requests for resources, I’ll be able to make claims which apply to a wide variety of RL setups, as it has generally been the case in ML that the differences between scales is only that: scale. Along with a strong theoretical component, I will be able to say when my conclusions hold, and when they don’t.

Theory of Impact

There are four different formal mediums by which I will communicate my research, and have legible outputs:

Conference paper submissions
Conference workshop submissions
Talks to research groups
LessWrong & Alignment Forum posts

There are four primary mechanisms by which I expect to have a positive impact, to which each of the four formal mediums contributes:

Improving others' research into similar areas, both on the theory of RL side, and the empirical phenomena of RL side
Improving policy, helping to produce empirically & theoretically grounded safety standards which policy makers or auditors can enforce on people training highly capable RL agents.
Improving AGI labs' abilities to train aligned AIs using RL with empirically & theoretically grounded mechanisms
Improving my own ability to achieve my research goals

And finally there are 5 primary different audiences which each of these mediums targets to some extent, and will execute on the expected impacts:

ML researchers & practitioners
Policy makers & advocates
Leaders of research groups
Leaders of AGI labs
Potential collaborators, mentors, or funders

Conference papers target a general ML research & development audience, which usually prioritizes empirical results, and unique applications. They also target policymakers to some extent, given the clout of the medium. They are expected to be highly polished, and for the most part present a "complete" story of the phenomena under consideration.

In the short term, I expect these will mostly just improve others' research into similar areas, and improve my ability to achieve research goals via feeback from reviewers & others, and communicating my work to collaborators, mentors, and funders. In the longer term, conference papers will be a key formal communication route between my research results and key decision makers, including policy makers, advocates, and leaders of AGI labs.

When I complete a project with non-trivial results, and proposed mechanisms which I feel I have enough theoretical and empirical understanding of to be highly confident in, I expect to submit to an ML conference.

I expect each such conference-level submission to encompass a confident analysis of a particular dynamic in the development of an agent in an RL environment with variable or fixed settings of reward structure, architecture used, RL algorithm used, and other structures. Two possible examples drawn from my current Craftax environment:

The development of environmental memory in Craftax using MLPs with variable recurrent dimensions trained with PPO and the default reward structure.
An analysis of suddenly increasing & slowly decreasing circuit ensemble diversity in Craftax using recurrent MLPs trained with PPO and reward structure varied as part of the analysis.

Such papers will not only argue for the existence of the relevant dynamic in the RL environment, but present & test theoretical models of how the relevant dynamic occurs. Using the above circuit diversity example, perhaps we can propose an evolutionary model of this in terms the equations describing evolutionary diversity in complex systems theory, or some proposed fast diversification mechanism which occurs when new reward sources are found in the environment plus a slow optimization mechanism which occurs when no new reward sources have been found, and describe the dynamics in terms of some action minimization principle.

In order to foster others working in similar areas, such papers will include the githubs I used & worked on with documentation, and open problems.

Conferences I expect to submit to include the following:

AAAI (AAAI Conference on Artificial Intelligence)

Submission deadline: August

ICLR (International Conference on Learning Representations)

Submission deadline: September

AAMAS (International Conference on Autonomous Agents and Multiagent Systems)

Submission deadline: October

IJCAI (International Joint Conference on Artificial Intelligence)

Submission deadline: January

ICML (International Conference on Machine Learning)

Submission deadline: February

ECAI (European Conference on Artificial Intelligence)

Submission deadline: April

NeurIPS (Conference on Neural Information Processing MS

Submission deadline: May

Conference workshop submissions communicate to a more targeted selection of an ML audience with a particular interest in the subject at hand. Usually they are focused more on preliminary results, or ideas, or partial projects not yet ready for a full conference paper.

Both in the short and long-term I expect these to be a large component in communicating with potential collaborators, and researchers interested in doing similar work. This can be expected to impact other researchers' work by getting them to work on similar subjects, or incorporate my findings in their own work, and improving my research goals via receiving feedback from researchers with context in specialized areas. Due to the nature of conference workshops, the feedback received here is extra valuable, as I would not yet be done working on the project in question.

I expect to be submitting to workshops targeting AI safety & alignment, high dimensional learning dynamics, mechanistic interpretability, inductive biases (/interpretability through time), and reinforcement learning.

Talks to research groups communicate to an even more specialized audience, and they have many of the same effects as conference workshops, but at a smaller scale. They are focused much more on getting the audience excited to learn more, and therefore are often paired with technical documentation such as a paper or other writeup.

Early on, as with conference workshops & papers, they are targeted more towards getting feedback, collaborators, and mentors, affecting my own research.

In the medium term, they are targeted more towards communicating exciting results, and getting people excited to change their own research to take advantage of these new exciting results.

And in the longer term, I expect to shift my communication more towards decision-makers than researchers, like policy makers & advocates, and leaders of AGI labs, and try to influence governmental or AGI lab policies & standards.

I'll probably mostly be giving talks to the research groups mentioned in the neglectedness section, like the causal incentives working group, the unsearch group, FAR, or Timaeus.

Finally, LessWrong and Alignment Forum posts are similar to blog-posts. They will advertise & present distillations of the other mediums, and also outline my more conceptual thinking achievements I've made during research, or very small experiments with neat results. Many of my potential funders likely read LessWrong, and so too do many of my potential collaborators, and current & future leaders of research labs so this has an effect on my own research, and others' research priorities.

Goals

This project has long term goals, which I don't expect to be finished even within the year, medium term goals which I do expect to be finished within 1-2 years, and short term goals, which I expect with confidence to be finished within 6 months. Outputs of all goal levels can be assumed to include the standard communication routes: LessWrong & Alignment Forum sequences, ML conference paper submissions, ML workshop submissions, presentations, or videos. For concrete dates see the timeline below. In general, I aim to submit a paper or workshop to each of the major ML conferences

Each short/medium/long-term goal includes both experimental, and theoretical components.

In the short term I expect to have significant knowledge about the empirical development of values in the craftax environment, along with simplified & stylized toy models which explain various motifs of that development. This understanding will include both a behavioral and mechanistic account of how the model changes, and why.

The medium term is about scaling. We answer questions such as: Under exactly what circumstances do the patterns identified appear? When radically changing the environment, reward structure, RL algorithm, or architecture, what new patterns emerge? What patterns stay the same? Do we need to revise our stylized toy models?

On the theoretical front, we unify various such robust stylized models, and contextualize them within broader mathematical disciplines such as singular learning theory, computational mechanics, or complex systems theory more broadly.

The long term vision of this project is to get a healthy theory-practice feedback loop regarding the development of values in reinforcement learning agents, which incorporates & extends the most important lessons of singular learning theory, computational mechanics, and other descriptive mathematical accounts of learning.

Importance

A deep, grounded, theoretically & experimentally justified account of the ways in which a reinforcement learning agent generalizes from its training to its deployment distribution, to me, seems essential to ensuring that in the long term ASI alignment goes right.

Why a theory of values?

Whether you plan on ensuring your AIs always follow instructions, are in some sense corrigible, have at least some measure of pro-sociality, or are entirely value aligned, you are going to need to know what the values of your AI system are, how you can influence them, and how to ensure they're preserved (or changed in only beneficial ways) when you train them (either during pretraining, post-training, or continuously during deployment). Many of the arguments for why we should expect AI to go wrongly assume as a key component that we don't know how such training will affect the values of our agents.

We need theories to go along with our empirical accounts so we can be confident about whether & in what ways our results will generalize to situations we've never seen before. Machine learning as a field is empirical by nature, but that blind empiricism is why we're in so much danger. Even if its hard (which I disagree with), this is not a hole that can be dug out of with even more blind empiricism. For this reason I heavily prioritize constructing mathematical models with testable predictions and conditions to model the empirical results I get, unifying these models, and ultimately expressing them in terms of more developed mathematical disciplines which have solid foundations, and applications to modern ML alignment.

Why reinforcement learning?

The simplest argument to make here is that reinforcement learning is the easiest way of producing systems which can be thought of as having values or optimizing for goals. In contrast with supervised learning which produces such systems only after an immense amount of training or of a small amount of fine-tuning of already very complicated pre-trained models.

A more complicated argument would be that all non-ML real world generally capable adaptive systems are ruled by a general, flexible, and slow outer optimization criterion which punishes incoherency and continuously selects for systems which do well according to that outer optimization criterion. Its how evolution works, its how business works, its how culture works, its how brains work. The introduction of such a process which can exploit large & growing piles of compute is, to me, far more worrying, and a very likely deviation from "business-as-usual" scaling of LLMs. And the closest extant object (and therefore likely the object which will be improved upon to get to such a system) is currently deep RL.

Even if such a system appears as the cherry on the metaphorical supervised learning cake, it would still play a large role in determining the values which the entire model pursues. A case-in-point here is RLHF (or RLAIF). ChatGPT, Claude, and their open source counterparts begin as incoherent base models, and are optimized into the comparatively much more coherent (though still mildly incoherent) chat models we have today. Their values are almost entirely determined by the process used to reward or punish them. In my estimation however, such systems are too complex to serve as the starting point for a theory of the process as a whole, which is why I'm not studying them directly, but is also why I hope to be able to scale up the analysis techniques we're using to be able to analyze them.

Tractability

There are a number of reasons why someone may find this project intractable. I will categorize the relevant concerns, and my reasons for being optimistic anyway into roughly two categories: criticisms about whether I can competently execute on the project, and criticisms about whether the project is able to be accomplished in the first place.

Me

On the experimental side, I've already found preliminary results during MATS 5.0 & 5.1 in the craftax, and procgenAISC maze environments. For example, we are finding a consistent pattern in the craftax environment whereby the contexts for which agents execute particular heuristics & action sequences (in the terminology of shard theory, the contexts in which certain shards are activated) suddenly grow enormously in size after the first time the agent encounters a rewarding state, then shrinks as the agent figures out the particular predecessors to that state.

Concretely speaking, in the craftax environment the agent is rewarded every time it unlocks a particular achievement. One such achievement is placing down a plant. But the agent is only rewarded for the first plant it places. We observe that after first figuring out how to place down plants, the agent learns that it is a good idea to place down plants all the time (rather than just once), so in subsequent episodes it places down many plants, and then decays down to the optimal single plant.

In the previous funding proposal I asked for funding to perform a similar analysis on the procgenAISC maze environment. This analysis was largely completed, but the results too simple to warrant a standalone NeurIPS publication, as was promised in that proposal. I pivoted to the more complicated Craftax environment. See more discussion in the past mistakes section below.

And of course I have done ML research before, in particular co-authoring the GENIES paper.

From the standpoint of my ability to handle the theoretical end of this project, I have studied numerous fields which present models & ways of thinking which are relevant to the theoretical problems I expect to arise in this project including the theory behind reinforcement learning, complex systems theory, probability theory, statistics, convex optimization, economics, computational neuro science, linear algebra, differential equations, physics, and of course singular learning theory. And I'm learning more such subjects every day.

I've also proven myself in the past to be a competent conceptual researcher among those I've worked & talked with, including Alex Turner, John Wentworth, Arun Jose, Nicholas Kees, and others. Publishing LessWrong posts including:

For more information on me, see my resume, my github, and my LessWrong page

The Project

There may be some reasonable concern about whether or not the products of our RL algorithms even have values yet in the first place. For example, there is limited evidence about exactly how far into the future our models are doing lookahead. On this subject I say that nevertheless there do in fact exist very competent models for which it would be very strange if they weren't doing goal-directed behavior in some sense, like the products of AlphaZerono-search (even without the explicit rollouts in MCMC, AlphaZero is still a competent Go and Chess player). As a concrete example of an RL system doing lookahead, consider Jenner et al.’s Evidence of Learned Look-Ahead in a Chess-Playing Neural Network, in which they find evidence that Leela Chess Zero (a transformer trained on self-play chess games) does lookahead.

Now I am not directly studying something as complex or expensive as LeelaZero, but there are likely much simpler systems which do search, and I expect the dynamics we see in sub-search systems to be very informative for systems that use search, especially in Craftax, where (for example) the main stage transitions seem to be caused by the model finding a new source of reward, which for complicated & diverse environments should be a common phenomenon. We need to establish the most basic of dynamics before we can move on to understanding more complicated dynamics, and the history of neural network scaling has shown that we can in fact predict how larger and more advanced systems will behave by studying smaller systems.

Previous project’s failures

In my previous funding proposal I proposed to do developmental interpretability on the values of a maze-solving RL agent in the procgenAISC environment used for Langosco et. al's goal misgeneralization paper, but there were a few difficulties with finding actually interesting learning dynamics in this environment.

The environment was simple in all the ways it shouldn't be simple, and complex in all the ways it shouldn't be complex, at least for a preliminary project.

The dynamics of the environment were very simple, which led to a learning curve characterized by a very sudden increase in reward, and very subtle algorithmic changes during that increase, which made it difficult to propose or mine for hypotheses. Especially hypotheses which would say something nontrivial about environments beyond the simple maze environment, or would even be interesting to modern RL researchers. The overall dynamics just weren't all that interesting.

At the same time, the backend for the environment was complex. The actual library was written in both Python & C++, and for much of the project I was using the interface developed by Team Shard in their mechanistic interpretability project, which had poor documentation and flexibility. Similarly, the architecture I was training was just overpowered for what I was trying to do. Its complexity made it difficult to interpret what was going on. And because of the complexity of the code base and model, it took a long time to train, so I couldn't really study how universal what little dynamics I found were.

I also made a few mistakes during the execution of the project. For example, I was somewhat dogmatic in the methods I was using. That is, I should have pivoted far more than I did when it turned out my first few attempts at applying a method weren't all that informative. I also should have been quicker to transition over to a different environment. For example, Craftax, the environment I'm working with now.

I think my current project largely corrects the above issues, and I've learned from them as well. Craftax has very unsubtle environmental dynamics. Its very easy to recognize big changes in the model's strategy for example, its backend code is very clean, it runs very quickly, and the model itself is very simple. I myself am pursuing a less dogmatic approach to research, being more quick to drop experiments if they don't seem productive earlier on.

More generally, my personal takeaway from the previous project has been about balancing complexity and simplicity. To begin this project, I want everything as simple & easy to understand as possible, with only one piece of complexity which would be an interesting thing to understand. For Craftax this "one piece" of complexity is the special exploratory dynamic caused by the rewards-for-achievements mechanism.

Neglectedness

There are a few groups who are working on related questions, and papers which are relevant for assessing the originality of my plan. I list a few I'm aware of below.

Orseau et al.'s Agents and Devices: A Relative Definition of Agency formalizes & experimentally validates the concept of the intentional stance by defining an "agent" as a system which is best described as optimizing some utility function, and a "device" as a system best described by its input-output mapping, and uses Bayes rule to probabilistically categorize systems into the two buckets.

Everitt et al.'s Causal Incentives Working Group uses causal diagrams to define agency, formalize mechanisms to eliminate proxy alignment & wireheading, and otherwise do other kinds of agent foundations research.

Colognese & Jose’s High-level interpretability: detecting an AI's objectives outline properties of objectives, then use those properties to derive probe-training mechanisms which empirically identify the objectives of RL agents in toy procgen goal-misgeneralization environments.

Jenner et al.'s Evidence of Learned Look-Ahead in a Chess-Playing Neural Network, as discussed above, provides a concrete existence proof of look-ahead in Leela Chess Zero.

Team Shard's mechanistic interpretability analysis of the policy of a maze solving agent under Alex Turner in MATS 3.0, along with Alex Turner & Quintin Pope's shard theory more generally.

Ivanitsky et al.'s UnSearch group, which aims to decode how transformers perform search, and represent their goals & subgoals.

The original AlphaZero paper studied the propensity of the model to use different chess openings over time, which is not so values focused, but is relevant for understanding the inductive bias of RL systems.

Along Similar lines, McGrath et al.'s Acquisition of Chess Knowledge in AlphaZero used linear probes, and other simple mechanistic interpretability tools to identify the location of different already known chess concepts in AlphaZero, and did some comparison between the progression of AlphaZero's strategies and human strategies over time (i.e. compared what AlphaZero's recommended move was at different training steps to the corresponding year since 1400's records of game data).

FAR AI has done extensive research into the mechanistic reasons why their adversarially trained policies do so well against KataGo, and last I heard Adrià Garriga-alonso was working on mechanistic interpretability into how RL models do optimization & search in simple environments.

Jaderberg et al. (Google DeepMind)'s Human-level performance in 3D multiplayer games with population-based reinforcement learning among other things, automatically detects & tracks the emergence of different strategies in a multi-agent capture the flag game.

And of course Langosco et al.'s original Goal Misgeneralization in Deep Reinforcement Learning.

This past work, like much in ML and alignment, can largely be categorized into theoretical framings and experiments. There is little cross-talk or clear cross-application between the two. The experiments and their results do not clearly provide evidence or rule out any proposed theories or models, and the theoretical framings & proofs still have a long way to go before they can become verified by any experiments we can perform, or applicable to any problem we may have, if they even can or will be developed to such an extent.

My proposal on the other hand proposes to develop both theory and experiments in conjunction with one another. With theories & models always being applicable to the observed phenomena at hand, and experiments chosen so as to confirm or disprove the relevant theories.

The above groups and past work also don't study the development of the mechanisms or strategies which they find, except in some rare circumstances, even though since RL is relatively path dependent when stopped short of optimality (which in practice it will be for highly capable systems).

Timeline

See the below Submission deadlines section for precise info on the conferences I'm keeping my eye on, and the corresponding submission deadline. These can be clustered approximately seasonally, with fall, winter, and spring conference seasons.

Workshop deadlines are by nature less predictable, as individual workshops are allowed to choose their own submission deadlines, so I don't list them here. But they also come in seasons, so the below timeline will come in seasons as well.

I am aiming to have a conference-level or workshop-level analysis submitted to a conference or workshop (as appropriate) approximately once each season, with the AAAI conference, with its deadline in August the deadline I'm aiming for my Craftax work.

After this current Craftax work is submitted, during the fall I plan on switching to a simpler environment that seems likely to induce some kind of search or lookahead in models, use simple checks (like those in Jenner et al.'s Evidence of Learned Look-Ahead in a Chess-Playing Neural Network) to verify this, and see whether we can induce or predict the sorts of dynamics I've identified in the Craftax environment in this environment, in order to verify & check whether lookahead complicates the sorts of dynamics we've found.

I'm not certain what I'll do after that, perhaps push for a greater level of realism in the environments I'm using by looking at RLHF (or RLAIF) or toy models thereof, perhaps scale up the models I'm using or use more capable RL algorithms (such as Curious Replay, which has gotten the highest score on the leaderboard for this environment), or some kind of self-play regime, which seems a likely way for RL and LLMs to interact. Perhaps spend a season trying to pin down the theoretical questions better. Perhaps dig deeper into the Craftax environment, and see whether the dynamics change as we scale up to the Crafter environment.

What I choose here depends on conversations & feedback I get from those with a greater level of research experience & taste than I have. Including reviewers, folks at Timaeus, professional connections, and friends.

Submission deadlines

AAAI (AAAI Conference on Artificial Intelligence)

Submission deadline: August

ICLR (International Conference on Learning Representations)

Submission deadline: September

AAMAS (International Conference on Autonomous Agents and Multiagent Systems)

Submission deadline: October

IJCAI (International Joint Conference on Artificial Intelligence)

Submission deadline: January

ICML (International Conference on Machine Learning)

Submission deadline: February

ECAI (European Conference on Artificial Intelligence)

Submission deadline: April

NeurIPS (Conference on Neural Information Processing MS

Submission deadline: May

Budget

The below are annualized by 1 year. If I get funded lower than the minimum amount for 1 year, then I'll work for a proportionally smaller amount of time before applying for more funding. For these purposes, the smallest time-span I work on is in chunks of 3-months, so that's $18,983 here.

Minimum

Salary: $60k

Compute: $432/month (based on current Vast AI prices) × 12 months = $5,184

Federal Income Tax: $7,880

California Income Income Tax: $2,868

Total: $75,932

Median

Salary: $100k

Compute: $432/month (based on current Vast AI prices) × 12 months = $5,184

Office: $350/month × 12 months = $4,200

Federal Income Tax: $21,276

California Income Tax: $8,347

Total: $139,007

Maximum

Salary: $120k

Compute: $432/month (based on current Vast AI prices) × 12 months = $5,184

Office: $350/month × 12 months = $4,200

Hardware, Conferences, Etc: $5k

Federal Income Tax: $28,476

California Income Tax: $11,137

Total: $173,997

Garrett Baker salary to study the development of values of RL agents over time

Summary

Theory of Impact

Goals

Importance

Why a theory of values?

Why reinforcement learning?

Tractability

Me

The Project

Previous project’s failures

Neglectedness

Timeline

Submission deadlines

Budget

Minimum

Median

Maximum