0

What would power-seeking, misaligned AGI actually do?

$0raised
$1,000valuation
Sign in to trade

[see original EA Forum post // by Kiel Brennan-Marquez]

Concerns about “super-intelligence” loom large.  The worry is that artificial general intelligence (AGI), once developed, might quickly reach a point at which it (1) becomes aware of its own capabilities and (2) uses them to subjugate or eradicate the human race.  This is often described as the problem of “power-seeking, misaligned AGI” (with “misalignment” referring to divergence between the goals of the humans who create AGI and the endogenous goals of AGI itself). 


 

Experts rate the odds of this result differently, but a common premise unites their views: namely, that if the goals of powerful AGI become “misaligned” from human goals, AGI is likely to wield its power in a manner adverse to human welfare (Carlsmith 2021).  Here, I scrutinize that premise.  Inspired by evolutionary biology and game theory, I explore other ways — apart from subjugation and eradication — that systems comprised of mutually-powerful agents (or groups of agents) tend to equilibrate.  With these patterns in mind, I argue that AGI is highly unlikely to subjugate or eradicate humans.  Rather, the strong likelihood is that AGI, if developed, will do some combination of:


 

•  Cooperating with us;

•  Purposely avoiding us; 

•  Paying us no attention whatsoever. 


 

Of course, it is still possible — and I hardly mean to deny — that AGI could wield power in ways that, for humans, qualify as “catastrophic.”  This could happen either intentionally or incidentally, depending on the goals AGI decided to pursue.  The point is simply that “catastrophic” equilibria, though possible, are far less likely than other equilibria.  Why?  In short, because: 


 

1.  As best we understand the historical evidence, interactions between mutually power-seeking agents with misaligned goals — whether between species or among human sub-groups — quite frequently result in non-catastrophic equilibria; and 

2.  There is no reason, a priori or empirically, to think the default distribution of probabilities would be radically discontinuous in the specific case of humans and AGI.  


 

In what follows, I defend these claims and expound their implications — and I also speculate a bit as to why, despite the low likelihood of catastrophic equilibria resulting from the emergence of power-seeking, misaligned AGI, the concern occupies a (relatively) large and growing share of attention.  My hypothesis is that the prospect of what might be termed “sadistic AGI” — AGI that derives active pleasure or utility from human suffering — is terrifying regardless of probability (as long as it exceeds 0).  In light of this, taking a prophylactic approach to sadistic AGI may be warranted regardless of its likelihood.  The idea of sadistic AGI may be so qualitatively abhorrent — the closest real-world approximation of the devil incarnate — that taking steps to guard against the possibility is worth potentially enormous opportunity costs.  But even so, we still should be clear-eyed about how unlikely, in quantitative terms, the prospect of AGI-induced catastrophe actually is. 


 

*


 

Claim oneinteractions between mutually power-seeking agents with misaligned goals very frequently result in non-catastrophic equilibria


 

In practice, systems compromised of mutually-powerful agents with divergent goals tend, overwhelmingly, to equilibrate in one of three (non-catastrophic) ways: 


 

1.  Mutualism

2.  Conflict Avoidance

3.  Indifference 


 

This taxonomy is meant to be answer, of sorts, to Bostrom’s famous “parable of the sparrows,” which imagines a group of sparrows that endeavor to locate an owl chick and train it as their servant.  Although Bostrom notes, cheekily, that it is “not known” how the parable ends, the implication is that things are unlikely to turn out sanguine for the sparrows (Bostrom 2014).  And the same is certainly possible with regard to AGI and humans.  The analogy — sparrow : human :: owl : AGI — may well hold.  But many other analogies exist; the question is one of relative likelihood. 


 

Mutualism 


 

The first form of non-catastrophic equilibrium is mutualism.  Mutualist relationships produce net-benefits for all groups or species involved, often through intentional or incidental exchange; each group or species provides something of value to the other, leading to multilateral incentives for cooperation (Hale et al. 2020). 


 

Often, mutualism occurs among groups or species that pose no threat to each other.  For instance, we exist in a mutualist relationship with the bacteria that constitute our “gut flora.”  The bacteria help regulate digestion, and we, in turn, provide them (luxury?) accommodation.  But mutualism can also occur among groups or species that do pose natural threats to each other — when the benefits of mutualism simply outweigh the risks of threat.  Here is a good example: zoologists have observed that gelada monkeys allow Ethiopian wolves to roam freely in the vicinity of their young, though the latter are easy prey.  Why?  The reigning hypothesis is that wolves provide the baboons protection from other predators, and the baboons help the wolves locate rodents — an easier source of food.  In light of this, the wolves are savvy enough to leave the young baboons alone (Holmes 2015).


 

The power differential between wolves and monkeys is relatively small.  But this is not a necessary feature of intra-species mutualism.  It can also transpire in settings where one species is vastly more powerful than the other.  Take, for example, bees and humans.  Bees can be a source of nuisance (and even, in some contexts, a more serious threat), and we certainly have the capacity to eradicate bees if we thought it worth our time and energy.  But we have no incentive to do so.  In fact — now that we understand pollination — we have an active incentive to keep bees alive and flourishing, and even to protect them from other threats, purely as a matter of self-interest. 


 

Given this, consider the following thought-experiment: a “parable of the bees” on similar footing with that of the sparrows.  It’s 10,000 BC, and certain members of the bee race — the Bee Intelligentsia — are worried about the growing capabilities of homo sapiens.  They begin (doing the equivalent of) writing papers that frame the concern as follows: 


 

Once homo sapiens turn their sights to reshaping the external environment, they will be more powerful than we can imagine, and there is a non-negligible chance that they will pursue goals either directly or incidentally adverse to our welfare — maybe catastrophically so.  Accordingly, we should begin expending significant amounts of collective energy future-proofing against subjugation or eradication by homo sapiens.


 

Would the Bee Intelligentsia be “wrong” to think this way?  Not exactly, for the core submission is true; human activity does raise some risk of catastrophe.  The error would lie in over-weighting the risk.  (Which is easy to do, especially when the risk in question is qualitatively terrifying; more on that below.)  But if the Bee Intelligentsia understood pollination — if it were intelligent, let alone super-intelligent — it would be able to appreciate that bees offer humans a benefit that is not trivial to replace; indeed, it might even be able to predict (some version of) the present-day dynamic, namely, that far from undermining bee welfare, humans have an active incentive to enhance it.


 

The same may be true of humans and AGI — with humans in the “bee” position.  Depending on its goals, AGI may well conclude that humans are worth keeping around, or even worth nurturing, for the mutualist benefits they deliver.  In fact, AGI might simply conclude that it’s possible humans will deliver mutualist benefits at some point, and this, alone, may be enough to inspire non-predation — as in the wolf-baboon example — or cultivation — as in the human-bee example — purely as a means of maintaining optionality.  One assumption built into the “super-intelligence” problem, after all, is that AGI will be capable of developing causal theories about the world, presumably to a much greater extent, or at least far more quickly, than humans have.  From this, it likely follows that AGI would have an enormous set of mutualist dynamics to consider (and to consider safeguarding as future possibilities) before electing to regard humans with hostility.  


 

Some versions of AGI “safeguarding future possibilities” would likely resemble domestication; the way AGI would invest in the possibility of humans delivering mutualist benefits would be — in some measure — to “farm” us.  (Researchers already use language like this in the wolf-baboon example.). That may sound jarring, even borderline dystopian, but it’s not necessarily catastrophic.  In fact, it’s easy to imagine “human domestication” scenarios that enable greater flourishing than we have been able to manage, or plausibly could manage, on our own.  Query, for example, if domestication has been catastrophic for a species like bengal cats.  At their limit, questions like this may be more metaphysical than empirical; they may ride on deep (and likely non-falsifiable) conceptions of what flourishing involves and requires.  But at a practical level, for many species, like bengal cats, it would seem odd to describe domestication as catastrophic.  Domestication has effectively relieved bengal cats of the need to constantly spend energy looking for food.  Whatever drawbacks this has also occasioned (do bengal cats experience ennui?), it seems like a major improvement, and certainly not a catastrophic deprivation, as such. 


 

Conflict Avoidance


 

The second form of non-catastrophic equilibrium is conflict avoidance.  This involves relationships of unilateral or multilateral threat or competition in which one or both groups decide that it’s easier — more utility-maximizing overall — to avoid one another.  For example, jellyfish are a threat to human beings, and human beings are a threat to jellyfish.  But the functional “equilibrium” between the two species is, on the whole, avoidant.  If, circa 10,000 BC, the Jellyfish Intelligentsia voiced concerns analogous to those of the Bee Intelligentsia above, they, too, would have had little actual reason to worry.  Although it’s certainly possible that humans would have pursued (and could still pursue) the subjugation or eradication of jellyfish, the far more likely equilibrium is one in which humans mostly leave jellyfish alone. 


 

Importantly, to qualify as non-catastrophic, an “avoidant” equilibrium need not involve zero conflict.  Rather, the key property is that conflict does not tend to multiply or escalate because, in the median case, the marginal cost of conflict is greater than the marginal cost of avoidance.  Take the jellyfish example above.  Sometimes jellyfish harm humans, and sometimes humans harm jellyfish.  What makes the equilibrium between them avoidance is not a total absence of conflict; it is that humans generally find it less costly to avoid jellyfish (by swimming away, say, or by changing the location of a diving expedition) than to confront them.  We certainly could eradicate — or come very close to eradicating — jellyfish if that were an overriding priority.  But it isn’t.  Our energy is better spent elsewhere.  


 

Similar dynamics also occur at the intra-species level.  Many human subcultures, for example, involve reciprocal threat and even dynamics of mutual predation — think, say, of organized crime, or competition among large companies in the same economic sector.  Yet here, too, avoidance is far more prevalent than subjugation or eradication.  Commodity distribution organizations, licit and illicit alike, do not tend to burn resources trying to destroy one another — at least, not when they can use the same resources to specialize in existing markets, locate new markets, innovate their products, or lower their production costs.  These strategies are almost always less costly and/or more beneficial than their destructive counterparts. 


 

Not across the board, of course; some competitive environments do not lend themselves to avoidant equilibria.  But a great many do, and the explanation is simple.  Destructive strategies often become more costly to keep pursuing the more they have already been pursued; up until the point of completion, the marginal cost of maintaining a destructive strategy tends to increase in supra-linear fashion.  Why?  Because counter-parties tend to respond to destructive strategies adaptively, and often in ways that impose costs in the other direction.  Retaliation and subversion are the two most common examples.  The history of human conflict strongly suggests that small, less powerful — in some cases, much less powerful — groups are capable of inflicting significant harm on larger, more powerful groups.  Particularly when humans (and other intelligent animals) find themselves in desperate circumstances, the combination of survival instinct, tenacity, and ingenuity can result in extraordinarily outsized per capita power.  


 

This is not always true; sometimes, small, less powerful groups are decimated by more-powerful counterparts.  The point, however, is that the possibility of small groups wielding outsized per capita power often suffices to make avoidance a more appealing ex ante strategy.  Anticipating that destruction may be costly to accomplish, powerful groups often opt — as with territorial disputes between criminal and corporate organizations — for some combination of (1) investing in the creation of new surplus and (2) informally splitting up existing surplus without resorting to (catastrophic forms of) conflict (Peltzman et al. 1995). 


 

There is reason, accordingly, to think that even if AGI found itself in potential conflict with humans — e.g., due to competition for the same resources — the most efficient response could be some combination of (1) finding other ways to amass the resource, or (2) claiming a partial share of the resource while keeping ongoing conflict at bay.  Imagine, for instance, if AGI determined that it was important to secure its own sources of energy.  Would the answer be, as some have hypothesized, to seize control of all electricity-infrastructure?  (Carlsmith 2021)  Possibly; but it’s also possible that AGI would simply devise a better of means of collecting energy, or that it would realize its longer-term interests were best served by allowing us to maintain joint access to existing energy sources — if nothing else, for purposes of appeasement and pacification. 


 

Indifference


 

The third form of non-catastrophic equilibrium — surely the most pervasive, considering the sheer multitude of inter-species relationships that exist on earth — is indifference.  Most underwater life, for example, is beyond human concern.  We do not “avoid” plankton, in the sense just conceptualized.  We simply pay them no mind.  If the Plankton Intelligentsia voiced concern on par with the Bee Intelligentsia or the Jellyfish Intelligentsia, they would, once again, not exactly be “wrong.”  But they would also be foolish to attribute much significance to — or organize their productive capacities around — the risk of human-led subjugation or eradication.  


 

The same could easily be true of AGI.  In fact, the plankton analogy — and one could apply the same reasoning to many insects, parasites, and so forth — seems to me like the most plausible template for AGI-human interactions.  If, as the super-intelligence problem hypothesizes, AGI ends up possessing vastly greater capability than humans, it stands to reason that AGI may relate to us in roughly the same way that we relate to other species of vastly lesser capability.  And how is that?  As a general matter, by paying them little or no attention.  This is not true of every such species; bees have already supplied a counter-example.  But the general claim holds.  With respect to the most species, most of the time, we have no conscious interactions at all.  


 

Of course, indifference is not always innocuous.  In fact, it can be highly destructive, if the goals of the powerful-but-indifferent group come into collision with the welfare of the less powerful group.  Humans have been “indifferent” — in the sense described above — to many species of rainforest plants and animals, for example, and the latter are considerably worse off for it.  With this category, the important point is that catastrophic results, when they occur, do so incidentally.  Catastrophe is not the goal; it is a collateral consequence (Yudlowsky 2007).  


 

How, then, are humans likely to fare under the “indifference” model?  It would depend entirely on the goals AGI decided to pursue.  Some goals would almost certainly ravage us.  Suppose AGI decided that, in the interest of (say) astrophysical experimentation, one of its overriding objectives was to turn planet earth into a perfect sphere.  In that case, human civilization may be doomed.  But other goals would leave human social order — and human welfare, such as it is — effectively unaltered.  If, for example, AGI decided the best use of its energy was to create and appreciate art of its own devise, or to exhaustively master the game of Go, or anything else along such lines, human civilization would be unlikely to register much, if any, effect.  In fact, we might not even be aware of such happenings — in roughly the same sense that plankton are not aware of human enterprise. 


 

*


 

Claim twothere is no reason, a priori or empirically, to think AGI-human  interactions will radically deviate from past patterns of equilibration


 

The goal of the last section was to show that, across a wide range of inter- and intra-species interactions, non-catastrophic equilibria are common.  They are not inevitable.  But they are prevalent — indeed, hyper-prevalent — once we broaden the scope of inter-species interactions.  (For instance, there are vastly more oceanic species with which humans exist in “avoidant” or “indifferent” equilibria than the total number of mammalian species humans have ever interacted with in any meaningful way.)


 

Here, I ask whether there is reason to assume that the distribution of equilibria we have seen historically among organic species would carry over to AGI-human interactions.  In short, I believe the answer is yes, because (1) we have seen no empirical evidence to the contrary, and (2) the most (only?) plausible axis of distinction — the enhanced capability of AGI — does not render catastrophic equilibrium more likely.  Rather, in the abstract, enhanced capability cuts both ways.  To the same extent enhanced capacity increases the risk of potential conflict, it also makes mutualism and avoidance more likely.  And for the same reason that enhanced capability could occasion catastrophic forms of indifference — i.e., the pursuit of goals unrelated to humans that incidentally imperil our welfare — it could also inspire AGI to simply leave us be. 


 

Start with the empirical evidence.  So far, AI systems that have shown signs of adaptive capability seem, uniformly, to fall within existing frameworks of strategic decision-making.  Under current technological conditions, in other words, adaptive AI tends to proceed exactly as one might think — by crafting and revising plans in response to the functional goals of the environment in which it is deployed (Carlsmith 2021).  In other words, they do exactly what one would expect any agent to do: grasping the parameters of the problem-space and deploying familiar — and roughly predictable — strategies in response. 


 

But the stronger — and more typical — argument in favor of the “AGI is different” position is not empirical but conceptual.  The idea is that AGI is likely to have goals and capabilities that vary substantially from those of humans and other biological agents, making it difficult, if not incoherent, to compare multilateral interactions of biological agents to possible interactions with AGI.  Multiple justifications underwrite this “variance” claim.  One is that AGI is likely to face a categorically different set of basic needs than biological agents (e.g., not requiring food or sleep); another is that AGI is evolving in a categorically different kind of environment than biological agents did (one marked by little to no competition for resources, exponentially faster time-horizons for adaptation, and so forth) (Garfinkel 2022). 


 

Let’s stipulate the variance claim.  (I’m not sure it’s correct, but it certainly seems plausible.)  Does it follow that we can’t learn anything from historical patterns of interaction among biological agents?  No — the question is which aspects of those patterns are essential, and which are contingent.   What is it, at core, that defines the patterns of mutualism, avoidance, and indifference traced above?  And how, accordingly, are those patterns likely to change in a world where one type of agent — AGI — is vastly more capable, and may be driven by fundamentally different goals, than other agents in the system?  


 

Mutualism is the easiest case to analyze.  Its defining feature — convergent goals that lead to cooperation — readily extrapolates to the human-AGI context.  We may (by hypothesis) not know which AGI goals are likely to converge with human goals, but the possibility certainly exists; and if such convergence transpired, we could expect some form of mutualism, at least medium- to longer-term, to follow. 


 

Avoidance is slightly more complicated.  Here, as with mutualism, we do not need to know the content of AGI’s goals to recognize that they might conflict with human activity, leading AGI to view us as threats or competitors.  The question is how AGI would act on that view.  Would it try to subjugate or eradicate us?  And more specifically, would AGI’s enhanced capability make those catastrophic results more likely than they would be in the context of biological agents?  


 

I think not, for a straightforward reason: enhanced capability is at least as likely to improve strategies of avoidance than it is to facilitate catastrophe.  In other words, the question of capability is distinct from the question of whether, in a given context, it would be more costly (1) to avoid conflict with a threatening agent, or (2) to eliminate the threat.  The two questions are not orthogonal; they do intersect.  But it is the latter question — the relative cost comparison — that will ultimately determine the equilibrium, and enhanced capability is as likely to shift the balance in favor of avoidance as it is to shift the balance the other way.  To be sure, if AGI regards humans as a threatening force, it will (almost certainly) be capable of either avoiding or eliminating the threat.  But the relevant question is which route promises fewer costs.  It is a matter of relative encumbrance, not absolute capability.  And the biological world is rife with examples — of the human-jellyfish flavor — in which agents of vastly greater capability decide, under a relative cost analysis, that avoiding threats is more efficient than attempting to eliminate them.  


 

Consider the following thought-experiment.  Humans spontaneously leap forward a few centuries of technological prowess, such that we now have the capability to instantly kill whole schools of jellyfish using electromagnetic energy.  Would we use this technology to eradicate jellyfish across the board?  Maybe — but it would depend entirely on what other avoidance strategies the technology also enabled.  If the same technology allowed individual human divers to kill specific jellyfish they happened to encounter, that solution (i.e., dealing with individual jellyfish on ad hoc basis) would likely be preferable to large-scale eradication.  Similarly, if the jellyfish grew to recognize that humans possess the capability to kill them relatively easily, they would presumably start trying to avoid us — an “avoidant equilibrium” in its own right.  Of course, we still might decide to eradicate jellyfish.  The point isn’t that eradication is an impossible, or even utterly implausible, end state.  It’s that enhanced capability is not the determinative variable.  The determinative variable is relative cost in virtue of enhanced capability — a far more contingent property.  


 

Last but not least, indifference.  For some readers, I imagine this category may seem, at least prima facie, like the most worrisome — if we don’t what AGI will be capable of, or what goals will ultimately guide its behavior, shouldn’t that make us even more concerned about what it might do?  A powerful intuition drives this question.  Namely, to be subject to the whim of a more-powerful actor, even if the actor decides to let us move through the world unhindered, can produce an uncanny sense of dread and alienation.  It can transform what would otherwise be an experience of free agency — I decided to do [X] — into an experience of subordination, of being the non-willed beneficiary of another’s grace.  


 

In fact, there is entire school of political theory — republicanism — organized around this set of concerns.  It regards “non-domination” as the core principle of legitimate social order, and it’s particularly sensitive to forms of domination that take a Sword of Damocles form: conditions in which people are relatively free, at least moment to moment, but entirely dependent on more powerful actors to produce and sustain the conditions of freedom (Pettit 1997).  Something of this idea may underwrite anxiety about power-seeking, misaligned AGI as well.  Or perhaps it’s even simpler.  One of the “unknown objectives” AGI might pursue (for reasons that, by hypothesis, are unknown) is the intentional infliction of human suffering.  And this possibility, by itself, may be sufficiently disquieting as to command an outsized share of attention.  In other words, the prospect of sadistic AGI — a prospect made sadly more plausible by the way humans have interacted with other animals — may be so terrifying that it has disabled us, collectively, from imagining the vast landscape of innocuous “indifferent” equilibria that exist alongside the nightmarish ones.  


 

In short, although it’s certainly possible that power-seeking, misaligned AGI would pursue unknown goals that directly or incidentally undermine human welfare, it’s also possible that such AGI would pursue unknown goals that have zero, or only de minimus, impact on human welfare.  And once again, as with the dynamics around potential conflict explored above, capability is not the determinative variable.  The determinative variable is the nature of the unknown goals themselves.  Enhanced capability would certainly enable AGI to pursue those goals — whatever they are — more effectively.  But it would not change the essential valence of those goals vis-a-vis human welfare. 


 

*


 

Implications for the likelihood of AGI-induced catastrophe


 

Deriving hard-and-fast numbers from the foregoing analysis is hard — and even if the numbers are roughly accurate (an enormous if!), significant vagueness will persist.  So, in lieu of trying to quantify specific AGI-human scenarios, I’m going to assign subjective probabilities by category — inspired by the three forms of equilibria analyzed above — and then explore how those probabilities, even if only coarsely right, intersect with existing analyses of the super-intelligence problem that do set forth (plausible) numerical ranges.  


 

Here is the upshot.  Treating past inter-species interactions as the most relevant precedent, the risk of power-seeking, misaligned AGI trying to subjugate or eradicate us is dwarfed by the combined likelihood of 1) cooperating with us, (2) actively avoiding us, or (3) ignoring us.  And once this reality is incorporated to an overall analysis of likelihood — once the low probability of catastrophe resulting from the emergence of power-seeking, misaligned AGI is conjoined with other conditional probabilities regarding the likelihood of such AGI emerging in the first place — the net-result hovers around 1-3%.  


 

This, to be clear, is non-negligible, and it may be worth reorganizing certain aspects of social and political life around guarding against AGI-induced catastrophic.  This may be true as such, or it may true conditional on a qualitative analysis of what form AGI-induced catastrophe is likely to take (e.g., the analysis might look different for someone who believed sadistic AGI was more likely than AGI whose goals incidentally imperiled human welfare, versus the other way around).  But either way, the bottom-line numerical likelihood remains low.  


 

P(AGI-induced catastrophe | the emergence of power-seeking, misaligned AGI)


 

I propose modeling the probability of AGI-induced catastrophe as follows.  First, we should consider the threshold likelihood that AGI’s most salient goals will be (1) convergent with humanity’s most salient goals, (2) in direct conflict with humanity’s most salient goals, or (3) orthogonal to humanity’s most salient goals.  Second, we should consider the likelihood, within each category, that the resultant equilibrium will be catastrophic to human welfare. 


 

Accordingly — and with the major (and probably needless) caveat that we’re now venturing into the realm of intelligent guesswork — I’m going to benchmark the threshold likelihood of each category as follows:


 

1.  The goals of power-seeking, misaligned AGI will be functionally convergent with humanity’s most salient goals — 5%

2.  The goals of power-seeking, misaligned AGI will directly conflict with humanity’s most salient goals — 25%

3.  The goals of power-seeking, misaligned AGI will run orthogonal to humanity’s most salient goal’s — 70%


 

I derive these numbers comes from an approximation of the equivalent distribution in the biological world.  In other words, whenever a vastly more capable new species begins to interact with less-capable species, what is the general likelihood — for any given less-capable species — that the resultant dynamic is one of functional convergence, conflict, or indifference?  


 

Functional convergence seems the least likely — not so unlikely as to be a meaningless outlier, but uncommon nevertheless.  Conflict is more common, but it’s context-bound — and primarily indexed to competition for resources within the same material environment.  Indifference, on the other hand, is far and away the most common dynamic.  Humans and other advanced mammals, for example, are indifferent to almost every single species that lives in the deep ocean or inhabits particularly hot or cold land-climates.  And even within the category of land-dwelling animals that humans tend to interact with, indifference is still far more prevalent — on the numbers — than either of the other two dynamics.   


 

But this is only half the question; the other half is how likely catastrophic equilibria are to follow from each dynamic.  On this front, I’d offer the following estimates:


 

1.  Likelihood of catastrophic equilibria resulting from convergent goals — effectively 0% 

2.  Likelihood of catastrophic equilibria resulting from direct conflict — 25%

3.  Likelihood of catastrophic equilibria resulting from indifference — 2-3%


 

This nets to an overall probability — P(AGI-induced catastrophe | the emergence of power-seeking, misaligned AGI) — of (5%x0%) + (25%x50%) + (70%x3%) = ~9%


 

Here, once again, these numbers reflect approximations of observed reality in the biological world.  Convergent goals tend inexorably — as long as the convergence is discernible to all parties — toward mutualism.  Conflict certainly can result in catastrophic results for one side or the other, but it can also result in avoidant equilibria; in some domains, in fact, the latter is much more common (e.g., competition between human sub-groups, such as criminal organizations and large corporate enterprise, which can certainly be destructive, painful, and anti-social, but rarely results in the total eradication of one side or the other).  Given the prevalence of both possibilities, but also the sense in which avoidant equilibria are more common in conflict-riven settings than we might intuitively expect, I thought 25% a reasonable approximation. 


 

In many ways, indifference is the most interesting dynamic, both because it strikes me as the most likely and because its results tend to be bimodal.  Indifference either tends to leave less-capable agents totally unhindered — as in possible futures where AGI decides that it would rather spend its time pursuing a combination of analytic and aesthetic goals beyond human comprehension — or it tends to be swiftly catastrophic.  (This, too, may contribute to the experience of dread, described above, that can accompany the prospect of unknown or unknowable AGI goals.)  Even so, however, the bimodal distribution is not an even one.  It’s intensely lopsided, if the history of interactions among biological agents is any guide.  


 

Most species are beyond the regard of vastly more-powerful species — and totally unaffected by the latter’s activities.  In light of this, absent a reason to think that AGI-human interactions will break from this pattern, the rational default should be (1) that indifference is likely to be the relevant dynamic, and, even more so, (2) that indifference is very unlikely to result in catastrophe.  The default paradigm, in other words, should be humans and plankton — not humans and bees, or humans and other advanced mammals.  And if anything, the human-plankton paradigm only becomes only more plausible the more advanced we imagine AGI’s capabilities to be, since it’s the largest disparities in capability that, at least among biological species, have tended to produce the starkest forms of indifference.  


 

Overall conditional probability of AGI-induced catastrophe


 

I’m hesitant to attach a specific number to the overall chain of probabilities, not least of all because different commentators have analyzed other steps in the chain differently (CITE).  But if the foregoing is correct, I think it’s safe, at a minimum, to conclude that analyses of the overall conditional probability of AGI-induced catastrophe are significantly inflated insofar as they have failed to incorporate an analysis of: 


 

P(catastrophic equilibration | the emergence of power-seeking, misaligned AGI)


 

My impression, which could be wrong, is that existing analyses are guilty of this error.  They have focused — understandably — on (1) which social, historical, and technological steps are likely to occasion power-seeking, misaligned AGI, and (2) the conditional probability of each step, with the goal of producing an overall projection of socio-historical likelihood.  This analysis is extremely rich and important.  It just seems incomplete.  In addition to analyzing the socio-historical likelihood of power-seeking, misaligned AGI, we also need to ask what AGI would likely do with misaligned power.  The answer is uncertain.  But there’s reason to think it won’t involve much, if any, regard for human trifles.  


 

*


 

Bibliography


 

Nick Bostrom, Superintelligence: Paths, Dangers, Strategies, Oxford University Press (2014)


 

Joseph Carlsmith, Is Power-Seeking AI an Existential Risk? (2021)


 

Ben Garfinkel, Review of ‘Carlsmith, Is Power-Seeking AI an Existential Risk?’ (2022


 

Kayla Hale et al., Mutualism Increases Diversity, Stability, and Function in Multiplex Networks that Integrate Pollinators Into Food Webs (2020)  


 

Bob Holmes, Monkeys’ Cozy Alliance with Wolves Looks Like Domestication (2015) 


 

Holden Karnofsky, Thoughts on the Singularity Institute (2012)


 

Sam Peltzman et al., The Economics of Organized Crime (1995)


 

Phillip Pettit, Republicanism: A Theory of Freedom and Government (1997) 


 

Nate Soares, Comments on ‘Is Power-Seeking AI an Existential Risk?’ (2021) 


 

Eliezer Yudkowsky, Artificial Intelligence as a Positive and Negative Factor in Global Risk (2008) 


 

Eliezer Yudkowsky, The Hidden Complexity of Wishes (2007) 


 

Eliezer Yudkowsky, Coherent Extrapolated Volition (2004) 

No comments yet. Sign in to create one!