Alignment Is Hard

Alexander Bistagne

CompleteGrant

$6,070raised

Project summary

Alexander Bistagne is seeking funding to formalize the following argument:

If we can't tell that a black box program will loop forever, How could we verify that a black box agent will care about us forever instead of eventually betraying us? Answer: We can't in the worst case.

In order to make the minimum formal argument, this project needs to clearly articulate a large quantity of assumptions about what we mean by betrayal, a black box agent, time, the universe and other things. It will do this through the language of theoretical computer science: Turing Machines and Formal language theory. After defining everything appropriately, this project will reduce the complement of the halting problem to testing the alignment of a black box agent. I previously made decent progress on this project, but got sidetracked because the task of finding a job took priority. I think this project is feasible because I have made creative attempts at alignment before with a school project labeled suicidal AI, and I have attempted complexity theorems(see LvsNC1 on my github)

Project goals

The goal of this project is to convincingly argue that Alignment is Hard. My motivation for this is that; if we can convince the theoretical computer science community that this problem is important, they might be able to develop enough theory to do something or have enough power to throw the brakes on AGI projects. The concrete deliverable is to get the claim "Alignment is Hard" published in places that the current AGI designers might respect.

Concrete steps to acheive those goals

Type up the already existing rough draft.

Edit the rough draft to remove discussions about cosmology.

Submit a draft to the Alignment Forum.

Review & incorporate feedback I get on Less Wrong.

Submit the project/paper to FOCS or other conferences on theoretical computer science.

Work on removing agents' immutability assumption.

Work on stop button variation of this problem.

How will this funding be used?

Basic Project: 6,000 USD

5,000- 1 month salary

500- Payroll insurance & Taxes

500- Travel & Board for conferences.

This would let me work on this project as my top priority for 1 month. I would submit the core paper to a theoretical CS conference.

Organized Project: 40,000 USD

30,000- 6 months salary

3,000- Payroll insurance & Taxes

4,000- 26 meetings with an Editor/Mentor

3,000- Travel & Board for conferences

More funding would allow me to attempt to weaken or vary the assumptions and make more general claims about alignment. The most notable claims to weaken are around mutability of an agent’s code and the agent being in an enclosed space. Other angles to approach the problem are from agents with known utility function instead of known architecture.

What is your track record on similar projects?

I think the following projects outline that I am capable of thinking about AI in creative ways, finishing academic papers when deadlines are present, and understanding the deep mathematical field of computational complexity theory.

Suicidal AI paper https://drive.google.com/file/d/1ElnLoRbEfsAXNymIYg1nnkOisYCegSCm/view?usp=drive_link - This paper in Lise Getoor’s class Algorithms & Ethics won the most creative award. It attempted to provide counterexample style evidence against the claim that all agents want to live.

Matching Algorithms paper https://drive.google.com/file/d/1HYYfSjl38LnyBPeeBX43KtyvnMq-AoMP/view?usp=sharing - This paper was for my senior seminar in mathematics at UCSC. It mostly summarizes the topic for an undergraduate math audience.

Complexity Theorem Attempt LvsNC1

https://github.com/Alexhb61/Sorting_Algorithms/blob/main/LvsNC1_attempt - This rough draft aims to prove a nontrivial complexity theory claim, but has an unproven assumption and is thus incomplete.

Singular Value Decomposition attempt https://github.com/Alexhb61/Matrix_Algorithms/blob/main/SVD_draft1.pdf - This rough draft aimed to prove a nontrivial claim about the complexity of basic mathematical operations; specifically that singular value decomposition can be done in subcubic time in the theoretical setting. This paper needs a lot of polish.

TSort https://github.com/Alexhb61/Sorting_Algorithms/blob/main/truncation_sorts/Tsort

This rough pseudocode outlines an algorithm with sub n(logn) computational complexity for numbers with significantly less than n bits.

Cliques Via Cuts

https://github.com/Alexhb61/graph_algorithms/blob/main/Cliques_via_cuts.pdf

An ambitious attempt at disproving the strong exponential time hypothesis. Recently made into a shareable rough draft. This is more evidence of computational complexity work.

How could this project be actively harmful?

This research might convince people to give up on aligning black box agents instead of giving up on making black box agents.

What other funding is this person or project getting?

None at the moment. I have applied to the nonlinear network, but did not recieve any concrete indication that funding was available.

Alexander Bistagne

3 months ago

Final report

The original Plan was to type something formally up and get it into a conference. I succeeded at the first part and failed at the second (see other updates).

In total, while I think I was able to say something formally, my results were not especially clear. Without an environment conducive to AI safety research, I can not in good faith ask for any more money; thus, I am closing this project for the for-seeable future, and thus this grant is done.

Spending breakdown

6000$ salary

0$ taxes

0$ conference

Alexander Bistagne

3 months ago

$6070 salary.

sorry.

Alexander Bistagne

11 months ago

Progress update

What progress have you made since your last update?

The paper was submitted to and rejected from the Alignment Forum. After reading it with a friend, I noticed serious sequencing issues and unnecessary definitions. I decided I needed a break.

I have since found people willing to give feedback on future drafts, and have joined the Ronin Institute where I might also receive feedback.

What are your next steps?

I intend to write a less verbose draft with more examples.
This draft will be posted to my github.
I plan on going through at least 3 rounds of comment, review and editing, before giving a lightning talk at the Ronin Institute where I might get more feedback. After another drafting phase, I will submit another post to the Alignment Forum. I will need to acquire some mentorship or a co-author before submitting for peer-review after that. I will consider the project over after peer-review or thorough refutation.

Is there anything others could help you with?

Without more funding, I can only reliably commit to 10 hours a week on this project.
This leg of the project is aiming to have more examples.
I am looking more feedback.
I am looking for a co-author or mentor to help with formalization before peer-review. Contact me via Email if interested or have ideas.
If others are interested in giving private examples or feedback, contact me on discord or email me. Public examples or feedback can be made through Github issues.

Alexander Bistagne

over 1 year ago

Progress update

Post available on lesswrong and submitted to alignment forum.

https://www.lesswrong.com/posts/JxhJfqfTJB9dkq72K/alignment-is-hard-an-uncomputable-alignment-problem-1

Alexander Bistagne

over 1 year ago

Project is on github. https://github.com/Alexhb61/Alignment/blob/main/Draft_2.pdf

citations and submitting to Alignment forum tommorrow.

Alexander Bistagne

over 1 year ago

This project is nearly at its target, but hit a delay near the beginning of september as I needed to take up other work to pay bills. Hopefully, I will post the minimal paper soon.

Austin Chen

over 1 year ago

I'm not familiar with Alexander or his work, but the votes of confidence from Anton, Quinn, and Greg are heartening.

Approving as the project seems within scope for Manifund (on longtermist research) and not likely to cause harm.

donated $3,800

Greg Colbourn

over 1 year ago

This research seems promising. I'm pledging enough to get it to proceed. In general we need more of this kind of research to establish consensus on LLMs (foundation models) basically being fundamentally uncontrollable black boxes (that are dangerous at the frontier scale). I think this can lead - in conjunction with laws about recalls for rule breaking / interpretability - to a de facto global moratorium on this kind of dangerous (proto-)AGI. (See: https://twitter.com/gcolbourn/status/1684702488530759680)

Alexander Bistagne

over 1 year ago

Technical detail worth mentioning; Here is the main theorem of the 6K project:

Proving an immutable code agent with turing-complete architecure in a turing machine simulateable environment has nontrivial betrayal-sensitive alignment is CoR-Hard.

The paper would define nontrivial betrayal-sensitive alignment and some constructions on agents needed in the proof.

Alexander Bistagne

over 1 year ago

Correction Co-RE is the class not Co-R. The set of problems reducable to the complement of the halting problen

Alexander Bistagne

over 1 year ago

@alexhb61

Conditional on 6k being reached,

I have committed to submitting an edited draft to the alignment forum on August 23rd

donated $1,000

Anton Makiievskyi

over 1 year ago

I encouraged Alexander applying here, based on that some people in Nonlinear Networked like his application and that funding goal is quite low. I don't think 40k budget would be a good bet, but 6k could be worth a try. I'm interested to learn what regrantors think about it

I'll offer 1000$ donation in hope of others adding to it

Alexander Bistagne

over 1 year ago

Thanks for the encouragement and donation.

The 40K max would be a much larger project than the 6K project which is what I summarized.

6K would cover editing

-Argument refuting testing anti-betrayal alignments in turing complete architecture

-Argument connecting testing alignment to training alignment in single agent architecture

40k would additionally cover developing and editing

-Arguments around anti-betrayal alignments in deterministic or randomized, P or PSPACE complete architecture

-Arguments around short term anti-betrayal alignments

-Arguments connecting do-no-harm alignments to short term antibetrayal alignments

-Arguments refuting general solutions to the stop button problem which transform the utility function in computable reals context

-Arguments around general solutions to the stop button problem with floating point utility functions

-Foundations for modelling mutable agents or subagents