Informatica 41 (2017) 419–427 419
  
Modeling and Interpreting Expert Disagreement About Artificial 
Superintelligence 
Seth D. Baum, Anthony M. Barrett, and Roman V. Yampolskiy 
Global Catastrophic Risk Institute, PO Box 40364, Washington, DC 20016, USA 
http://gcrinstitute.org, 
E-mail: seth@gcrinstitute.org 
 
Keywords: artificial superintelligence, expert judgment, risk analysis  
Received: Avgust 31, 2017 
Artificial superintelligence (ASI) is artificial intelligence (AI) with capabilities that are significantly 
greater than human capabilities across a wide range of domains. A hallmark of the ASI issue is 
disagreement among experts. This paper demonstrates and discusses methodological options for 
modeling and interpreting expert disagreement about the risk of ASI catastrophe. Using a new model 
called ASI-PATH, the paper models a well-documented recent disagreement between Nick Bostrom and 
Ben Goertzel, two distinguished ASI experts. Three points of disagreement are considered: (1) the 
potential for humans to evaluate the values held by an AI, (2) the potential for humans to create an AI 
with values that humans would consider desirable, and (3) the potential for an AI to create for itself 
values that humans would consider desirable. An initial quantitative analysis shows that accounting for 
variation in expert judgment can have a large effect on estimates of the risk of ASI catastrophe. The risk 
estimates can in turn inform ASI risk management strategies, which the paper demonstrates via an 
analysis of the strategy of AI confinement. The paper find the optimal strength of AI confinement to 
depend on the balance of risk parameters (1) and (2). 
Povzetek: Predstavljena je metoda za modeliranje in interpretiranje razlik v mnenjih ekspertov o 
superinteligenci. 
 
1 Introduction 
Artificial superintelligence (ASI) is artificial intelligence 
(AI) with capabilities that are significantly greater than 
human capabilities across a wide range of domains. If 
developed, ASI could have impacts that are highly 
beneficial or catastrophically harmful, depending on its 
design 
A hallmark of the ASI issue is disagreement among 
experts. Experts disagree on if ASI will be built, when it 
would be built, what designs it would use, and what its 
likely impacts would be.1 The extent of expert 
disagreement speaks to the opacity of the underlying ASI 
issue and the general difficulty of forecasting future 
technologies. This stands in contrast with other major 
global issues, such as climate change, for which there is 
extensive expert agreement on the basic parameters of 
the issue (Oreskes 2004). Expert consensus does not 
guarantee that the issue will be addressed—the ongoing 
struggle to address climate change attests to this—but it 
does offer direction for decision making. 
In the absence of expert agreement, those seeking to 
gain an understanding of the issue must decide what to 
believe given the existence of the disagreement. In some 
cases, it may be possible to look at the nature of the 
                                                          
1 On expert opinion of ASI, see Baum et al. (2011), 
Armstrong and Sotala (2012), Armstrong et al. (2014), 
and Müller and Bostrom (2014). 
disagreement and pick sides; this occurs if other sides 
clearly have flawed arguments that are not worth giving 
any credence to. However, in many cases, multiple sides 
of a disagreement make plausible arguments; in these 
cases, the thoughtful observer may wish to form a belief 
that in some way considers the divergent expert opinions. 
This paper demonstrates and discusses 
methodological options for modeling and interpreting 
expert disagreement about the risk of ASI catastrophe. 
The paper accomplishes this by using a new ASI risk 
model called ASI-PATH (Barrett and Baum 2017a; 
2017b). Expert disagreement can be modeled as differing 
estimates of parameters in the risk model. Given a set of 
differing expert parameter estimates, aggregate risk 
estimates can be made using weighting functions. 
Modeling expert disagreement within the context of a 
risk model is a method that has been used widely across a 
range of other contexts; to our knowledge this paper 
marks the first application of this method to ASI. 
The paper uses a well-documented recent 
disagreement between Nick Bostrom and Ben Goertzel 
as an illustrative example—an example that is also 
worthy of study in its own right. Bostrom and Goertzel 
are both longstanding thought leaders about ASI, with 
lengthy research track records and a shared concern with 
the societal impacts of ASI. However, in recent 
publications, Goertzel (2015; 2016) expresses significant 
420 Informatica 41 (2017) 419–427  S.D. Baum et al. 
disagreement with core arguments made by Bostrom 
(2014). The Bostrom-Goertzel disagreement is notable 
because both of them are experts whose arguments about 
ASI can be expected to merit significant credence from 
the perspective of an outside observer. Therefore, their 
disagreement offers a simple but important case study for 
demonstrating the methodology of modeling and 
interpreting expert disagreement about ASI. 
The paper begins by summarizing the terms of the 
Bostrom-Goertzel disagreement. The paper then 
introduces the ASI-PATH model and shows how the 
Bostrom-Goertzel disagreement can be expressed in 
terms of ASI-PATH model parameters. The paper then 
presents model parameter estimates based on the 
Bostrom-Goertzel disagreement. The parameter estimates 
are not rigorously justified and instead are intended 
mainly for illustration and discussion purposes. Finally, 
the paper applies the risk modeling to a practical 
problem, that of AI confinement. 
2 The Bostrom-Goertzel 
disagreement 
Goertzel (2015; 2016) presents several disagreements 
with Bostrom (2014). This section focuses on three 
disagreements of direct relevance to ASI risk. 
2.1 Human evaluation of AI values 
One disagreement is on the potential for humans to 
evaluate the values that an AI has. Humans would want 
to diagnose an AI’s values to ensure that they are 
something that humans consider desirable (henceforth 
“human-desirable”). If humans find an AI to have 
human-undesirable values, they can reprogram the AI or 
shut it down. As an AI gains in intelligence and power, it 
will become more capable of realizing its values, thus 
making it more important that its values are human-
desirable. A core point of disagreement concerns the 
prospects for evaluating the values of AI that have 
significant but still subhuman intelligence levels. 
Bostrom indicates relatively low prospects for success at 
this evaluation, whereas Goertzel indicates relatively 
high prospects for success. 
Bostrom (2014, p.116-119) posits that once an AI 
reaches a certain point of intelligence, it might adopt an 
adversarial approach. Bostrom dubs this point the 
“treacherous turn”: 
 
The treacherous turn: While weak, an AI 
behaves cooperatively (increasingly so, as it gets 
smarter). When the AI gets sufficiently strong–
without warning or provocation–it strikes, forms a 
singleton [i.e., takes over the world], and begins 
directly to optimize the world according to the 
criteria implied by its final values. (Bostrom 2014, 
p.119) 
 
Such an AI would not have durable values in the 
sense that it would go from acting in human-desirable 
ways to acting in human-undesirable ways. A key detail 
of the treacherous turn theory is that the AI has values 
that are similar to, but ultimately different from, human-
desirable values. As the AI gains intelligence, it goes 
through a series of stages: 
1. At low levels of intelligence, the AI acts in ways that 
humans consider desirable. At this stage, the 
differences between the AI’s values and human 
values are not important because the AI can only 
complete simple tasks that are human-desirable.  
2. At an intermediate level of intelligence, the AI 
realizes that its values differ from human-desirable 
values and that it if it tried deviating from human-
desirable values, humans would reprogram the AI or 
shut it down. Furthermore, the AI discovers that it 
can successfully pretend to have human-desirable 
values until it is more intelligent. 
3. At a high level of intelligence, the AI takes control 
of the world from humanity so that humans cannot 
reprogram it or shut it down, and then pursues its 
actual, human-undesirable values. 
Goertzel provides a contrasting view, focusing on 
Step 2. He posits that an AI of intermediate intelligence 
is unlikely to successfully pretend to have human-
desirable values because this would be too difficult for 
such an AI. Noting that “maintaining a web of lies 
rapidly gets very complicated” (Goertzel 2016, p.55), 
Goertzel posits that humans, being smarter and in 
control, would be able to see through a sub-human-level 
AI’s “web of lies”. Key to Goertzel’s reasoning is the 
claim that an AI is likely to exhibit human-undesirable 
behavior before it (A) learns that such behavior is 
human-undesirable and (B) learns how to fake human-
desirable behavior. Thus, Step 2 is unlikely to occur—
instead, it is more likely that an AI would either have 
actual human-desirable values or be recognized by 
humans as faulty and then be reprogrammed or shut 
down. 
Goertzel does not name his view, so we will call it 
the sordid stumble: 
The sordid stumble: An AI that lacks human-
desirable values will behave in a way that reveals its 
human-undesirable values to humans before it gains 
the capability to deceive humans into believing that 
it has human-desirable values. 
It should be noted that the distinction between the 
treacherous turn and the sordid stumble is about the AI 
itself, which is only one part of the human evaluation of 
the AI’s values. The other part is the human effort at 
evaluation. An AI that is unskilled at deceiving humans 
could still succeed if humans are not trying hard to notice 
the deception, while a skilled AI could fail if humans are 
trying hard. Thus, this particular Bostrom-Goertzel 
debate covers only one part of the AI risk. However, it is 
still the case that, given a certain amount of human effort 
at evaluating an AI’s values, Bostrom’s treacherous turn 
suggests a lower chance of successful evaluation than 
Goertzel’s sordid stumble. 
Modeling and Interpreting Expert Disagreement ... Informatica 41 (2017) 419–427 421 
2.2 Human creation of human-desirable 
AI values 
A second disagreement concerns how difficult it would 
be for humans to give an AI human-desirable values. If 
an AI’s values are human-desirable, then it is not crucial 
whether humans can evaluate them, because humans 
would not want to reprogram the AI or shut it down. As 
the AI gains in intelligence and power, it would simply 
take more and more human-desirable actions. Bostrom 
indicates relatively low prospects for success for humans 
to give AIs human-desirable values, whereas Goertzel 
indicates relatively high prospects for success. 
Bostrom (2014) argues that AIs are likely to have 
human-undesirable final goals because these goals are 
more complex: 
There is nothing paradoxical about an AI whose 
sole final goal is to count the grains of sand on 
Borcay, or to calculate the decimal expansion of pi, 
or to maximize the total number of paperclips that 
will exist in its future light cone. In fact, it would be 
easier to create an AI with simple goals like these 
than to build one that had a human-like set of values 
and dispositions (Bostrom 2014, p.107). 
The logic of the above passage is that creating an AI 
with human-desirable values is more difficult and thus 
less likely to occur. Goertzel (2016), citing Sotala (2015), 
refers to this as the difficulty thesis: 
The difficulty thesis: Getting AIs to care about 
human values in the right way is really difficult, so 
even if we take strong precautions and explicitly try 
to engineer sophisticated beneficial goals, we may 
still fail (Goertzel 2016, p.60). 
Goertzel (2016) discusses a Sotala (2015) argument 
against the difficulty thesis, which is that while human 
values are indeed complex and difficult to learn, AIs are 
increasingly capable of learning complex things.Per this 
reasoning, giving an AI human-desirable values is still 
more difficult than, say, programming it to calculate 
digits of pi, but it may nonetheless be a fairly 
straightforward task for common AI algorithms. Thus, 
while it would not be easy for humans to create an AI 
with human-desirable values, it would not be 
extraordinarily difficult either. Goertzel (2016), again 
citing Sotala (2015), refers to this as the weak difficulty 
thesis: 
The weak difficulty thesis. It is harder to 
correctly learn and internalize human values, than it 
is to learn most other concepts. This might cause 
otherwise intelligent AI systems to act in ways that 
went against our values, if those AI systems had 
internalized a different set of values than the ones we 
wanted them to internalize. 
A more important consideration than the absolute 
difficulty of giving an AI human-desirable values is its 
relative difficulty compared to the difficulty of creating 
an AI that could take over the world. A larger relative 
ease of creating an AI with human-desirable values 
implies a higher probability that AI catastrophe will be 
avoided for any given level of effort put to avoiding it. 
There is reason to believe that the easier task is 
giving an AI human-desirable values. For comparison, 
every (or almost every) human being holds human-
desirable values. Granted, some humans have more 
refined values than others, and some engage in violence 
or other antisocial conduct, but it is rare for someone to 
have pathological values like an incessant desire to 
calculate digits of pi. In contrast, none (or almost none) 
of us is capable of taking over the world. Characters like 
Alexander the Great and Genghis Khan are the 
exception, not the rule, and even they could have been 
assassinated by a single suicidal bodyguard. By the same 
reasoning, it may be easier for an AI to gain human-
desirable values than it is for an AI to take over the 
world. This reasoning does not necessarily hold, since AI 
cognition can differ substantially from human cognition, 
but it nonetheless suggests that giving an AI human-
desirable values may be the easier task.  
2.3 AI creation of human-desirable AI 
values 
A third point of discussion concerns the potential for an 
AI to end up with human-desirable values even though 
its human creators did not give it such values. If AIs tend 
to end up with human-desirable values, this reduces the 
pressure on the human creators of AI to get the AI’s 
values right. It also increases the overall prospects for a 
positive AI outcome. To generalize, Bostrom proposes 
that AIs will tend to maintain stable values, whereas 
Goertzel proposes that AIs may tend to evolve values 
that could be more human-desirable. 
Bostrom’s (2014) thinking on the matter centers on a 
concept he calls goal-content integrity: 
Goal-content integrity: If an agent retains its 
present goals into the future, then its present goals 
will be more likely to be achieved by its future self. 
This gives the agent a present instrumental reason to 
prevent alteration of its final goals (Bostrom 2014, 
p.109-110). 
The idea here is that an AI would seek to keep its 
values intact as one means of realizing its values. At any 
given moment, an AI has a certain set of values and 
seeks to act so as to realize these values. One factor it 
may consider is the extent to which its future self would 
also seek to realize these values. Bostrom’s argument is 
that an AI is likely to expect that its future self would 
realize its present values more if the future self retains 
the present self’s values, regardless of whether those 
values are human-desirable. 
Goertzel (2016) proposes an alternative perspective 
that he calls ultimate value convergence: 
Ultimate value convergence: Nearly all 
superintelligent minds will converge to the same 
universal value system (paraphrased from Goertzel 
2016, p.60). 
Goertzel further proposes that the universal value 
system will be “centered around a few key values such as 
Joy, Growth, and Choice” (Goertzel 2016, p.60). 
However, the precise details of the universal value 
422 Informatica 41 (2017) 419–427  S.D. Baum et al. 
system are less important than the possibility that the 
value system could resemble human-desirable values. 
This creates a mechanism through which an AI that 
begins with any arbitrary human-undesirable value 
system could tend towards human-desirable values. 
Goertzel does not insist that the ultimate values 
would necessarily be human-desirable. To the contrary, 
he states that “if there are convergent ‘universal’ values, 
they are likely sufficiently abstract to encompass many 
specific value systems that would be abhorrent to us 
according to our modern human values” (Goertzel 2016, 
p.60). Thus, ultimate value convergence does not 
guarantee that an AI would end up with human-desirable 
values. Instead, it increases the probability that an AI 
would end up with human-desirable values if the AI 
begins with human-undesirable values. Alternatively, if 
the AI begins with human-desirable values, then the 
ultimate value convergence theory could cause the AI to 
drift to human-undesirable values. Indeed, if the AI 
begins with human-desirable values, then more favorable 
results (from humanity’s perspective) would accrue if the 
AI has goal-content integrity. 
3 The ASI-PATH model 
The ASI-PATH model was developed to model pathways 
to ASI catastrophe (Barrett and Baum 2016). ASI-PATH 
is a fault tree model, which means it is a graphical model 
with nodes that are connected by Boolean logic and point 
to some failure mode. For ASI-PATH, a failure mode is 
any event in which ASI causes global catastrophe. Fault 
tree models like ASI-PATH are used widely in risk 
analysis across a broad range of domains.  
A core virtue of fault trees is that, by breaking 
catastrophe pathways into their constituent parts, they 
enable more detailed study of how failures can occur and 
how likely they are to occur. It is often easier to focus on 
one model node at a time instead of trying to study all 
potential failure modes simultaneously. Furthermore, the 
fault tree’s logic structure creates a means of defining 
and quantifying model parameters and combining them 
into overall probability estimates. Indeed, the three points 
of the Bostrom-Goertzel disagreement (human evaluation 
of AI values, human creation of human-desirable AI 
values, and AI creation of human-desirable AI values) 
each map to one of the ASI-PATH parameters shown in 
Figure 1. 
In Figure 1, the top node is ASI catastrophe. The left 
branch covers events that lead to the ASI gaining 
“decisive strategic advantage”, defined as “a level of 
technological and other advantages sufficient to enable it 
[the AI] to achieve complete world domination” 
(Bostrom, 2014, p. 78). The left branch models scenarios 
in which an initial “seed” AI undergoes recursive self-
improvement and “takes off”, becoming successively 
more and more intelligent until it becomes an ASI. P1 is 
the probability that such an AI is possible in the first 
place. P2 is the probability that a seed AI is created and 
undergoes recursive self-improvement. P3 is the 
probability that the AI is contained from gaining decisive 
strategic advantage; the containment can occur at any 
point in the process from seed AI to ASI. Containment is 
any measure that prevents a seed AI from gaining 
decisive strategic advantage, either by limiting recursive 
self-improvement or by preventing ASI from gaining 
decisive strategic advantage. Containment includes 
confinement, in which the AI’s ability to affect the rest of 
the world is restricted (Section 5), and enforcement, in 
which AI(s) prevent other AI(s) from gaining decisive 
strategic advantage.2 
The left branch of Figure 1 covers events that could 
lead to the ASI taking actions that are “unsafe”, which is 
defined as actions that would result in a major global 
                                                          
2 Barrett and Baum (2017a, p. 400) define confinement 
as “restrictions built into the AI’s hardware or software 
that limit the AI’s ability to affect the rest of the world so 
that it does not gain decisive strategic advantage”. This is 
slightly different than the Yampolskiy (2012) definition 
used in Section 5. This difference does not affect the 
overall argument of the present paper. 
 
Figure 1: ASI catastrophe fault tree. Adapted from Barrett and Baum (2017a). 
Modeling and Interpreting Expert Disagreement ... Informatica 41 (2017) 419–427 423 
catastrophe. P4 is the probability that humans will fail to 
make ASI goals safe. P5 is the probability that the ASI 
will not make its own goals safe. Finally, P6 is the 
probability that the ASI will not be deterred from acting 
unsafely by some other agent, potentially another AI. 
Because all the logic gates in Figure 1 are “AND”, the 
probability of ASI catastrophe, P, is simply the product 
of the six component probabilities: 



6
1n
nPP         (1) 
For convenience, we assume {P1, P2, P6} = 1. These 
parameters are unrelated to the Bostrom-Goertzel 
disagreement as discussed in this paper. Instead, we 
focus on {P3, P4, P5}, for which there is significant 
disagreement. 
P3 relates to the Bostrom-Goertzel disagreement 
about human evaluation of AI values (Section 2.1). In 
general, it should be easier to contain an AI earlier in the 
recursive self-improvement process because at that point 
it has less intelligence with which it could resist 
containment. Therefore, one factor in P3 is the potential 
for human observers to determine early in the process 
that this particular AI should be contained. The easier it 
is for humans to evaluate AI values, the earlier in the 
process they should be able to notice which AIs should 
be contained, and therefore the more probable it is that 
containment will succeed. In other words, easier human 
evaluation of AI values means lower P3. 
P4 relates to the Bostrom-Goertzel disagreement 
about human creation of human-desirable AI values 
(Section 2.2). Human-desirable values are very likely to 
be safe in the sense that they would avoid major global 
catastrophe. While one can imagine the possibility that 
somehow, deep down inside, humans actually prefer 
global catastrophe, and thus that an AI with human-
desirable values would cause catastrophe, we will omit 
this possibility. Instead, we assume that an AI with 
human-desirable values would not cause catastrophe. 
Therefore, the easier it is for humans to create AIs with 
human-desirable values, the more probable it is that 
catastrophe would be avoided. In other words, easier 
human creation of AI with human-desirable values 
means lower P4. 
P5 relates to the Bostrom-Goertzel disagreement 
about AI creation of human-desirable AI values (Section 
2.3). We assume that the more likely it is that an AI 
would create of human-desirable values for itself, the 
more probable it is that catastrophe would be avoided. In 
other words, more likely AI creation of AI with human-
desirable values means lower P5. 
For each of these three variables, we define two 
“expert belief” variables corresponding to Bostrom’s and 
Goertzel’s positions on the corresponding issue: 
 P3B is the value of P3 that follows from 
Bostrom’s position, the treacherous turn. 
 P3G is the value of P3 that follows from 
Goertzel’s position, the sordid stumble. 
 P4B is the value of P4 that follows from 
Bostrom’s position, the difficulty thesis. 
 P4G is the value of P4 that follows from 
Goertzel’s position, the weak difficulty thesis. 
 P5B is the value of P5 that follows from 
Bostrom’s position, goal-content integrity. 
 P5G is the value of P5 that follows from 
Goertzel’s position, ultimate value convergence. 
Given estimates for each of the above “expert belief” 
variables, one can calculate P according to the formula: 
 


6
1n
nGnGnBnB PWPWP  (2) 
In Equation 2, W is a weighting variable 
corresponding to how much weight one places on 
Bostrom’s or Goertzel’s position for a given variable. 
Thus, for example, W3B is how much weight one places 
on Bostrom’s position for P3, i.e. how much one believes 
that an AI would conduct a treacherous turn. For 
simplicity, we assume WnB + WnG = 1 for n = {3, 4, 5}. 
This is to assume that for each of {P3, P4, P5}, either 
Bostrom or Goertzel holds the correct position. This is a 
significant assumption: it could turn out to be the case 
that they are both mistaken. The assumption is made 
largely for analytical and expository convenience. 
This much is easy. The hard part is quantifying each 
of the P and W variables in Equation 2. What follows is 
an attempt to specify how we would quantify these 
variables. We estimate the P variables by relating the 
arguments of Bostrom and Goertzel to the variables and 
taking into account any additional aspects of the 
variables. We aim to be faithful to Bostrom’s and 
Goertzel’s thinking. We estimate the W variables by 
making our own (tentative) judgments about the strength 
of Bostrom’s and Goertzel’s arguments as we currently 
see them. Thus, the P estimations aim to represent 
Bostrom’s and Goertzel’s thinking and the W estimations 
represent our own thinking. Later in the paper we also 
explore the implications of giving both experts’ 
arguments equal weighting (i.e., WnB = WnG = 0.5 for 
each n) and of giving full weighting to exclusively one of 
the two experts. 
We make no claims to having the perfect or final 
estimations of any of these parameters. To the contrary, 
we have low confidence in our current estimations, in the 
sense that we expect we would revise our estimations 
significantly in the face of new evidence and argument. 
But there is value in having some initial estimations to 
stimulate thinking on the matter. We thus present our 
estimations largely for sake of illustration and discussion. 
We invite interested readers to make their own. 
3.1 P3 and W3: containment fails 
The human evaluation of AI values is only one aspect of 
containment. Other aspects include takeoff speed (faster 
takeoff means less opportunity to contain AI during 
recursive self-improvement) and ASI containment 
(measures to prevent an ASI from gaining decisive 
strategic advantage). Therefore, the Bostrom-Goertzel 
424 Informatica 41 (2017) 419–427 S.D. Baum et al.  
 
disagreement about human evaluation of AI values 
should only produce a relatively small difference on P3. 
Bostrom and Goertzel may well disagree on other aspects 
of P3, but those are beyond the scope of this paper. 
Bostrom’s position, the treacherous turn, 
corresponds to a higher probability of containment 
failure and thus a higher value of P3 relative to Goertzel’s 
position, the sordid stumble. We propose a 10% 
difference in P3 between Bostrom and Goertzel, i.e. P3B - 
P3G = 0.1. The absolute magnitude of P3B and P3G will 
depend on various case-specific details—for example, a 
seed AI launched on a powerful computer is more likely 
to have a fast takeoff and thus less likely to be contained. 
For simplicity, we will use P3B = 0.6 and P3G = 0.5, while 
noting that other values are also possible. 
Regarding W3B and W3G, our current view is that the 
sordid stumble is significantly more plausible. We find it 
relevant that AIs are already capable of learning complex 
tasks like face recognition, yet such AIs are nowhere 
near capable of outwitting humans with a web of lies. 
Additionally, it strikes us as much more likely that an AI 
would exhibit human-undesirable behavior before it 
becomes able to deceive humans, and indeed long 
enough in advance to give humans plenty of time to 
contain the situation. Therefore, we estimate W3B = 0.1 
and W3G = 0.9. 
3.2 P4 and W4: humans fail to give AI safe 
goals 
The Bostrom-Goertzel disagreement about human 
creation of human-desirable AI values is relevant to the 
challenge of humans giving AI safe goals. Therefore, the 
disagreement can yield large differences in P4. 
Bostrom’s position, the difficulty thesis, corresponds 
to a higher probability of humans failing to give the AI 
safe goals and thus a higher value of P4 relative to 
Goertzel’s position, the weak difficulty thesis. The values 
of P4B and P4G will depend on various case-specific 
details, such as how hard humans try to give the AI safe 
goals. As representative estimates, we propose P4B = 0.9 
and P4G = 0.4. 
Regarding W4B and W4G, our current view is that the 
weak difficulty thesis is significantly more plausible. The 
fact that AIs are already capable of learning complex 
tasks like face recognition suggests that learning human 
values is not a massively intractable task. An AI would 
not please everyone all the time—this is impossible—but 
it could learn to have broadly human-desirable values 
and behave in broadly human-desirable ways. However, 
we still see potential for the complexities of human 
values to pose AI training challenges that go far beyond 
what exists for tasks like face recognition. Therefore, we 
estimate W4B = 0.3 and W4G = 0.7. 
3.3 P5 and W5: AI fails to give itself safe 
goals 
The Bostrom-Goertzel disagreement about AI creation of 
human-desirable AI values is relevant to the challenge of 
the AI giving itself safe goals. Therefore, the 
disagreement can yield large differences in P5. 
Bostrom’s position, goal-content integrity, 
corresponds to a higher probability of the AI failing to 
give itself safe goals and thus a higher value of P5 
relative to Goertzel’s position, ultimate value 
convergence. Indeed, an AI with perfect goal-content 
integrity will never change its goals. For ultimate value 
convergence, the key factor is the relation between 
ultimate values and human-desirable values; a weak 
relation suggests a high probability that the AI will end 
up with human-undesirable values. Taking these 
considerations into account, we propose P5B = 0.95 and 
P5G = 0.5. 
Regarding W5B and W5G, our current view is that 
goal-content integrity is significantly more plausible. 
While it is easy to imagine that an AI would not have 
perfect goal-content integrity, due to a range of real-
world complications, we nonetheless find it compelling 
that this would be a general tendency of AIs. In contrast, 
we see no reason to believe that AIs would all converge 
towards some universal set of values. To the contrary, we 
believe that an agent’s values derive mainly from its 
cognitive architecture and its interaction with its 
environment; different architectures and interactions 
could lead to different values. Therefore, we estimate 
W5B = 0.9 and W5G = 0.1. 
4 The probability of ASI catastrophe 
Table 1 summarizes the various parameter estimates in 
Sections 3.1-3.3. Using these estimates, recalling the 
assumption {P1, P2, P6} = 1, and following Equation 2 
gives P = (0.1*0.6 + 0.9*0.5) * (0.3*0.9 + 0.7*0.4) * 
(0.9*0.95 + 0.1*0.5) ≈ 0.25. In other words, this set of 
parameter estimates implies an approximately 25% 
probability of ASI catastrophe. For comparison, giving 
equal weighting to Bostrom’s and Goertzel’s positions 
(i.e., setting each WB = WG = 0.5) yields P ≈ 0.26; using 
only Bostrom’s arguments (i.e., setting each WB = 1) 
yields P  ≈ 0.51; and using only Goertzel’s arguments 
(i.e., setting each WG = 1) yields P  = 0.1.   
 PB PG WB WG 
3 0.6 0.5 0.1 0.9 
4 0.9 0.4 0.3 0.7 
5 0.95 0.5 0.9 0.1 
Table 1: Summary of parameter estimates in 
Sections 3.1-3.3. 
Catastrophe probabilities of 0.1 and 0.51 may 
diverge by a factor of 5, but they are both still extremely 
high. Even “just” a 0.1 chance of major catastrophe could 
warrant extensive government regulation and/or other 
risk management. Thus, however much Bostrom and 
Goertzel may disagree with each other, they would seem 
to agree that ASI constitutes a major risk. 
However, an abundance of caveats is required. First, 
the assumption {P1, P2, P6} = 1 was made without any 
justification. Any thoughtful estimates of these 
parameters would almost certainly be lower. Our 
Modeling and Interpreting Expert Disagreement ... Informatica 41 (2017) 419–427 425 
intuition is that ASI from AI takeoff is likely to be 
possible, and ASI deterrence seems unlikely to occur, 
suggesting {P1, P6} ≈ 1, but that the creation of seed AI 
is by no means guaranteed, suggesting P2 << 1. This 
implies P ≈ 0.25 is likely an overestimate. 
Second, the assumption that the correct position was 
either Bostrom’s or Goertzel’s was also made without 
any justification. They could both be wrong, or the 
correct position could be some amalgam of both of their 
positions, or an amalgam of both of their positions plus 
other position(s). Bostrom and Goertzel are both leading 
thinkers about ASI, but there is no reason to believe that 
their range of thought necessarily corresponds to the 
breadth of potential plausible thought. To the contrary, 
the ASI topic remains sufficiently unexplored that it is 
likely that many other plausible positions can be formed. 
Accounting for these other positions could send P to 
virtually any value in [0, 1]. 
Third, the estimates in Table 1 were made with little 
effort, largely for illustration and discussion purposes. 
Many of these estimates could be significantly off, even 
by several orders of magnitude. Given the form of 
Equation 1, a single very low value for Wn*Pn would also 
make P very low. This further implies that P ≈ 0.25 is 
likely an overestimate, potentially by several orders of 
magnitude. 
Fourth, the estimates in Table 1 depend on a range of 
case-specific factors, including what other containment 
measures are used, how much effort humans put into 
giving the AI human-desirable values, and what 
cognitive architecture the AI has. Therefore, different 
seed AIs self-improving under different conditions would 
yield different values of P, potentially including much 
larger and much smaller values. 
5 A practical application: AI 
confinement 
A core motivation for analyzing ASI risk is to inform 
practical decisions aimed at reducing the risk. Risk 
analysis can help identify which actions would reduce 
the risk and by how much. Different assessments of the 
risk—such as from experts’ differing viewpoints—can 
yield different results in terms of which actions would 
best reduce the risk. Given the differences observed in 
the viewpoints of Bostrom and Goertzel about ASI risk, 
it is possible that different practical recommendations 
could follow. 
To illustrate this, we apply the above risk analysis to 
model the effects of decisions on a proposed ASI risk 
reduction measure known as AI confinement: 
AI confinement: The challenge of restricting an 
artificially intelligent entity to a confined 
environment from which it can’t exchange 
information with the outside environment via 
legitimate or covert channels if such information 
exchange was not authorized by the confinement 
authority (Yampolskiy 2012, p.196). 
AI confinement is a type of containment and thus 
relates directly to the P3 (containment fails) variable in 
the ASI-PATH model (Figure 1). Stronger confinement 
makes it less likely that an AI takeoff would result in an 
ASI gaining decisive strategic advantage. Confinement 
might be achieved, for example, by disconnecting the AI 
from the internet and placing it in a Faraday cage. 
Superficially, strong confinement would seem to 
reduce ASI risk by reducing P3. However, strong 
confinement could increase ASI risk in other ways. In 
particular, by limiting interactions between the AI and 
the human populations, strong confinement could limit 
the AI’s capability to learn human-desirable values, 
thereby increasing P4 (failure of human attempts to make 
ASI goals safe). For comparison, AIs currently learn to 
recognize key characteristics of images (e.g., faces) by 
examining large data sets of images, often guided by 
human trainers to help the AI correctly identify image 
features. Similarly, an AI may be able to learn human-
desirable values by observing large data sets of human 
decision-making, human ethical reflection, or other 
phenomena, and may further improve via the guidance of 
human trainers. Strong confinement could limit the 
potential for the AI to learn human-desirable values, thus 
increasing P4. 
Bostrom and Goertzel have expressed divergent 
views on confinement. Bostrom has favored strong 
confinement, even proposing a single international ASI 
project in which “the scientists involved would have to 
be physically isolated and prevented from 
communicating with the rest of the world for the duration 
of the project, except through a single carefully vetted 
communication channel (Bostrom 2014, p. 253)”. 
Goertzel has explicitly criticized this proposal (Goertzel 
2015, p.71-73) and instead argued that an open project 
would be safer, writing that “The more the AGI system is 
engaged with human minds and other AGI systems in the 
course of its self-modification, presumably the less likely 
it is to veer off in an undesired and unpredictable 
direction” (Goertzel and Pitt 2012, p.13). Each expert 
would seem to be emphasizing different factors in ASI 
risk: P3 for Bostrom and P4 for Goertzel. 
The practical question here is how strong to make 
the confinement for an AI. Answering this question 
requires resolving the tradeoff between P3 and P4. This in 
turn requires knowing the size of P3 and P4 as a function 
of confinement strength. Estimating that function is 
beyond the scope of this paper. However, as an 
illustrative consideration, suppose that it is possible to 
have strong confinement while still giving the AI good 
access to human-desirable values. For example, perhaps 
a robust dataset of human decisions, ethical reflections, 
etc. could be included inside the confinement. In this 
case, the effect of strong confinement on P4 may be 
small. Meanwhile, if there is no arrangement that could 
shrink the effect of confinement on P3, such that this 
effect would be large, then perhaps strong confinement 
would be better. This and other practical ASI risk 
management questions could be pursued in future 
research. 
426 Informatica 41 (2017) 419–427 S.D. Baum et al.  
 
6 Conclusion 
Estimates of the risk of ASI catastrophe can depend 
heavily on which expert makes the estimate. A neutral 
observer should consider arguments and estimates from 
all available experts and any other sources of 
information. This paper analyzes ASI catastrophe risk 
using arguments from two experts, Nick Bostrom and 
Ben Goertzel. Applying their arguments to an ASI risk 
model, we calculate that their respective ASI risk 
estimates vary by a factor of five: P  ≈ 0.51 for Bostrom 
and P  = 0.1 for Goertzel. Our estimates, combining both 
experts’ arguments, is P ≈ 0.25. Weighting both experts 
equally gave a similar result of P ≈ 0.26. These numbers 
come with many caveats and should be used mainly for 
illustration and discussion purposes. More carefully 
considered estimates could easily be much closer to 
either 0 or 1. 
These numbers are interesting, but they are not the 
only important part, or even the most important part, of 
this analysis. There is greater insight to be obtained from 
the details of the analysis than from the ensuing numbers. 
This is especially case for this analysis of ASI risk 
because the numbers are so tentative and the underlying 
analysis so comparatively rich. 
This paper is just an initial attempt to use expert 
judgment to quantify ASI risk. Future research can and 
should do the following: examine Bostrom’s and 
Goertzel’s arguments in greater detail so as to inform the 
risk model’s parameters; consider arguments and ideas 
from a wider range of experts; conduct formal expert 
surveys to elicit expert judgments of risk model 
parameters; explore different weighting techniques for 
aggregating across expert judgment, as well as 
circumstances in which weighted aggregation is 
inappropriate; conduct sensitivity analysis across spaces 
of possible parameter values, especially in the context of 
the evaluation of ASI risk management decision options; 
and do all of this for a wider range of model parameters, 
including {P1, P2, P6} as well as more detailed 
components of {P3, P4, P5}, such as modeled in Barrett 
and Baum (2017a; 2017b). Future research can also 
explore the effect on overall ASI risk when multiple ASI 
systems are launched: perhaps some would be riskier 
than others, and it may be important to avoid catastrophe 
from all of them. 
One overarching message of this paper is that more 
detailed and rigorous analysis of ASI risk can be 
achieved when the risk is broken into constituent parts 
and modeled, such as in Figure 1. Each component of 
ASI risk raises a whole host of interesting and important 
details that are worthy of scrutiny and debate. Likewise, 
aggregate risk estimates are better informed and 
generally more reliable when they are made from 
detailed models. To be sure, it is possible for models to 
be too detailed, burdening experts and analysts with 
excessive minutiae. However, given the simplicity of the 
risk models at this early stage of ASI risk analysis, we 
believe that, at this time, more detail is better. 
A final point is that the size of ASI risk depends on 
many case-specific factors that in turn depend on many 
human actions. This means that the interested human 
actor has a range of opportunities available for reducing 
the probability of ASI catastrophe. Risk modeling is an 
important step towards identifying which opportunities 
are most effective at reducing the risk. ASI catastrophe is 
by no means a foregone conclusion. The ultimate 
outcome may well be in our hands. 
7 Acknowledgement 
We thank Ben Goertzel, Miles Brundage, Kaj Sotala, 
Steve Omohundro, Allan Dafoe, Stuart Armstrong, Ryan 
Carey, Nell Watson, and Matthijs Maas for helpful 
comments on an earlier draft. Any remaining errors are 
the authors’ alone. Work for this paper is funded by 
Future of Life Institute grant 2015-143911. The views in 
this paper are those of the authors and do not necessarily 
reflect the views of the Global Catastrophic Risk Institute 
or the Future of Life Institute. 
8 References 
[1] Armstrong S, Sotala K (2012). How we’re 
predicting AI—or failing to. In Romportl J, Ircing P, 
Zackova E, Polak M, Schuster R (eds), Beyond AI: 
Artificial Dreams. Pilsen, Czech Republic: 
University of West Bohemia, pp. 52-75. 
[2] Armstrong S, Sotala K, Ó hÉigeartaigh SS (2014). 
The errors, insights and lessons of famous AI 
predictions – and what they mean for the future. 
Journal of Experimental & Theoretical Artificial 
Intelligence 26(3), 317-342. 
[3] Barrett AM, Baum SD (2017a). A model of 
pathways to artificial superintelligence catastrophe 
for risk and decision analysis. Journal of 
Experimental & Theoretical Artificial Intelligence 
29(2), 397-414. 
[4] Barrett AM, Baum SD (2017b). Risk analysis and 
risk management for the artificial superintelligence 
research and development process. In Callaghan V, 
Miller J, Yampolskiy R, Armstrong S (eds), The 
Technological Singularity: Managing the Journey. 
Berlin: Springer, pp. 127-140. 
[5] Baum SD, B Goertzel, TG Goertzel (2011). How 
long until human-level AI? Results from an expert 
assessment. Technological Forecasting & Social 
Change 78(1), 185-195. 
[6] Bostrom N (2014). Superintelligence: Paths, 
Dangers, Strategies. Oxford: Oxford University 
Press. 
[7] Goertzel B (2015). Superintelligence: Fears, 
promises and potentials. Journal of Evolution and 
Technology 25(2), 55-87.  
[8] Goertzel B (2016). Infusing advanced AGIs with 
human-like value systems: Two theses. Journal of 
Evolution and Technology 26(1), 50-72. 
[9] Goertzel B, Pitt J (2012). Nine ways to bias open-
source AGI toward friendliness. Journal of 
Evolution and Technology 22(1), 116-131. 
[10] Müller VC, Bostrom N (2014). Future progress in 
artificial intelligence: A survey of expert opinion. In 
Modeling and Interpreting Expert Disagreement ... Informatica 41 (2017) 419–427 427 
Müller VC (ed), Fundamental Issues of Artificial 
Intelligence. Berlin: Springer, pp. 555-572. 
[11] Oreskes N (2004). The scientific consensus on 
climate change. Science 306(5702), 1686. 
[12] Yampolskiy R (2012). Leakproofing the 
Singularity: Artificial intelligence confinement 
problem. Journal of Consciousness Studies 19(1-2), 
194-214.