4

Alignment Is Hard

ActiveGrant
$6,070raised
$40,000funding goal

Project summary

Alexander Bistagne is seeking funding to formalize the following argument:

If we can't tell that a black box program will loop forever, How could we verify that a black box agent will care about us forever instead of eventually betraying us? Answer: We can't in the worst case.

In order to make the minimum formal argument, this project needs to clearly articulate a large quantity of assumptions about what we mean by betrayal, a black box agent, time, the universe and other things. It will do this through the language of theoretical computer science: Turing Machines and Formal language theory. After defining everything appropriately, this project will reduce the complement of the halting problem to testing the alignment of a black box agent. I previously made decent progress on this project, but got sidetracked because the task of finding a job took priority. I think this project is feasible because I have made creative attempts at alignment before with a school project labeled suicidal AI, and I have attempted complexity theorems(see LvsNC1 on my github)

Project goals

The goal of this project is to convincingly argue that Alignment is Hard. My motivation for this is that; if we can convince the theoretical computer science community that this problem is important, they might be able to develop enough theory to do something or have enough power to throw the brakes on AGI projects. The concrete deliverable is to get the claim "Alignment is Hard" published in places that the current AGI designers might respect.

Concrete steps to acheive those goals

Type up the already existing rough draft.

Edit the rough draft to remove discussions about cosmology.

Submit a draft to the Alignment Forum.

Review & incorporate feedback I get on Less Wrong.

Submit the project/paper to FOCS or other conferences on theoretical computer science.

Work on removing agents' immutability assumption.

Work on stop button variation of this problem.

How will this funding be used?

Basic Project: 6,000 USD

5,000- 1 month salary

500- Payroll insurance & Taxes

500- Travel & Board for conferences.

This would let me work on this project as my top priority for 1 month. I would submit the core paper to a theoretical CS conference.

Organized Project: 40,000 USD

30,000- 6 months salary

3,000- Payroll insurance & Taxes

4,000- 26 meetings with an Editor/Mentor

3,000- Travel & Board for conferences

More funding would allow me to attempt to weaken or vary the assumptions and make more general claims about alignment. The most notable claims to weaken are around mutability of an agent’s code and the agent being in an enclosed space. Other angles to approach the problem are from agents with known utility function instead of known architecture.


What is your track record on similar projects?

I think the following projects outline that I am capable of thinking about AI in creative ways, finishing academic papers when deadlines are present, and understanding the deep mathematical field of computational complexity theory.

Suicidal AI paper https://drive.google.com/file/d/1ElnLoRbEfsAXNymIYg1nnkOisYCegSCm/view?usp=drive_link - This paper in Lise Getoor’s class Algorithms & Ethics won the most creative award. It attempted to provide counterexample style evidence against the claim that all agents want to live.

Matching Algorithms paper https://drive.google.com/file/d/1HYYfSjl38LnyBPeeBX43KtyvnMq-AoMP/view?usp=sharing - This paper was for my senior seminar in mathematics at UCSC. It mostly summarizes the topic for an undergraduate math audience.

Complexity Theorem Attempt LvsNC1

https://github.com/Alexhb61/Sorting_Algorithms/blob/main/LvsNC1_attempt - This rough draft aims to prove a nontrivial complexity theory claim, but has an unproven assumption and is thus incomplete.

Singular Value Decomposition attempt https://github.com/Alexhb61/Matrix_Algorithms/blob/main/SVD_draft1.pdf - This rough draft aimed to prove a nontrivial claim about the complexity of basic mathematical operations; specifically that singular value decomposition can be done in subcubic time in the theoretical setting. This paper needs a lot of polish.

TSort https://github.com/Alexhb61/Sorting_Algorithms/blob/main/truncation_sorts/Tsort

This rough pseudocode outlines an algorithm with sub n(logn) computational complexity for numbers with significantly less than n bits.

Cliques Via Cuts

https://github.com/Alexhb61/graph_algorithms/blob/main/Cliques_via_cuts.pdf

An ambitious attempt at disproving the strong exponential time hypothesis. Recently made into a shareable rough draft. This is more evidence of computational complexity work.

How could this project be actively harmful?

This research might convince people to give up on aligning black box agents instead of giving up on making black box agents.

What other funding is this person or project getting?

None at the moment. I have applied to the nonlinear network, but did not recieve any concrete indication that funding was available.

alexhb61 avatar

Alexander Bistagne

7 months ago

Progress update

What progress have you made since your last update?

The paper was submitted to and rejected from the Alignment Forum. After reading it with a friend, I noticed serious sequencing issues and unnecessary definitions. I decided I needed a break.

I have since found people willing to give feedback on future drafts, and have joined the Ronin Institute where I might also receive feedback.

What are your next steps?

I intend to write a less verbose draft with more examples.
This draft will be posted to my github.
I plan on going through at least 3 rounds of comment, review and editing, before giving a lightning talk at the Ronin Institute where I might get more feedback. After another drafting phase, I will submit another post to the Alignment Forum. I will need to acquire some mentorship or a co-author before submitting for peer-review after that. I will consider the project over after peer-review or thorough refutation.

Is there anything others could help you with?

  1. Without more funding, I can only reliably commit to 10 hours a week on this project.

  2. This leg of the project is aiming to have more examples.

  3. I am looking more feedback.

  4. I am looking for a co-author or mentor to help with formalization before peer-review. Contact me via Email if interested or have ideas.

    If others are interested in giving private examples or feedback, contact me on discord or email me. Public examples or feedback can be made through Github issues.

alexhb61 avatar

Alexander Bistagne

about 1 year ago

Progress update

Post available on lesswrong and submitted to alignment forum.

https://www.lesswrong.com/posts/JxhJfqfTJB9dkq72K/alignment-is-hard-an-uncomputable-alignment-problem-1

alexhb61 avatar

Alexander Bistagne

about 1 year ago

Project is on github. https://github.com/Alexhb61/Alignment/blob/main/Draft_2.pdf

citations and submitting to Alignment forum tommorrow.

alexhb61 avatar

Alexander Bistagne

about 1 year ago

This project is nearly at its target, but hit a delay near the beginning of september as I needed to take up other work to pay bills. Hopefully, I will post the minimal paper soon.

Austin avatar

Austin Chen

over 1 year ago

I'm not familiar with Alexander or his work, but the votes of confidence from Anton, Quinn, and Greg are heartening.

Approving as the project seems within scope for Manifund (on longtermist research) and not likely to cause harm.

donated $3,800
Greg_Colbourn avatar

Greg Colbourn

over 1 year ago

This research seems promising. I'm pledging enough to get it to proceed. In general we need more of this kind of research to establish consensus on LLMs (foundation models) basically being fundamentally uncontrollable black boxes (that are dangerous at the frontier scale). I think this can lead - in conjunction with laws about recalls for rule breaking / interpretability - to a de facto global moratorium on this kind of dangerous (proto-)AGI. (See: https://twitter.com/gcolbourn/status/1684702488530759680)

alexhb61 avatar

Alexander Bistagne

over 1 year ago

Technical detail worth mentioning; Here is the main theorem of the 6K project:

Proving an immutable code agent with turing-complete architecure in a turing machine simulateable environment has nontrivial betrayal-sensitive alignment is CoR-Hard.

The paper would define nontrivial betrayal-sensitive alignment and some constructions on agents needed in the proof.

alexhb61 avatar

Alexander Bistagne

over 1 year ago

Correction Co-RE is the class not Co-R. The set of problems reducable to the complement of the halting problen

alexhb61 avatar

Alexander Bistagne

over 1 year ago

@alexhb61

Conditional on 6k being reached,

I have committed to submitting an edited draft to the alignment forum on August 23rd

donated $1,000
AntonMakiievskyi avatar

Anton Makiievskyi

over 1 year ago

I encouraged Alexander applying here, based on that some people in Nonlinear Networked like his application and that funding goal is quite low. I don't think 40k budget would be a good bet, but 6k could be worth a try. I'm interested to learn what regrantors think about it

I'll offer 1000$ donation in hope of others adding to it

alexhb61 avatar

Alexander Bistagne

over 1 year ago

Thanks for the encouragement and donation.

The 40K max would be a much larger project than the 6K project which is what I summarized.

6K would cover editing

-Argument refuting testing anti-betrayal alignments in turing complete architecture

-Argument connecting testing alignment to training alignment in single agent architecture

40k would additionally cover developing and editing

-Arguments around anti-betrayal alignments in deterministic or randomized, P or PSPACE complete architecture

-Arguments around short term anti-betrayal alignments

-Arguments connecting do-no-harm alignments to short term antibetrayal alignments

-Arguments refuting general solutions to the stop button problem which transform the utility function in computable reals context

-Arguments around general solutions to the stop button problem with floating point utility functions

-Foundations for modelling mutable agents or subagents