https://doi.or g/10.31449/inf.v47i1.4497 Informatica 47 (2023) 83–96 83
Learning the Structur e of Bayesian Networks fr om Incomplete Data Using a
Mixtur e Model
Issam Salman
1
, Jiří V omlel
2
1
Faculty of Nuclear Sciences and Physical Engineering, Czech T echnical University in Prague, T rojanova 13, 120 01,
Prague, CZ
2
Institute of Information Theory and Automation of the CA, Pod V odárenskou věží 4, 182 00, Prague, CZ
E-mail: Issam.Salman@fjfi.cvut.cz, vomlel@utia.cas.cz
Keywords: Bayesian network, belief-noisy-OR, structure learning, incomplete data, EM-mixture
Received: November 8, 2022
In this paper , we pr ovide an appr oach to learning optimal Bayesian network (BN) structur es fr om incom-
plete data based on the BIC scor e function using a mixtur e model to handle missing values. W e have
compar ed the pr oposed appr oach with other methods. Our experiments have been conducted on differ ent
models, some of them Belief Noisy-Or (BNO) ones. W e have performed experiments using datasets with
values missing completely at random having differ ent missingness rates and data sizes. W e have analyzed
the significance of differ ences between the algorithm performance levels using the W ilcoxon test. The new
appr oach typically learns additional edges in the case of Belief Noisy-or models. W e have analyzed this
issue using the Chi-squar e test of independence between the variables in the true models; this appr oach
r eveals that additional edges can be explained by str ong dependence in generated data. An important
pr operty of our new method for learning BNs fr om incomplete data is that it can learn not only optimal
general BNs but also specific Belief Noisy-Or models which is using in many applications such as medical
application.
Povzetek: Razvita je metoda za določitev optimalne Bayesove mr eže ob nepopolnih podatkih.
1 Intr oduction
Bayesian networks (BNs) have been used in a variety of
applications. The challenge of learning a BN can be cat-
egorized into two parts: (1) s tructural learning, which in-
volves identifying the topology of the BN; and (2) para-
metric learning, which involves estimating the conditional
probabilities for a given network. The challenge of learning
the structure of a BN is by far more dif ficult than the other
one. Most methods, such as [1] and [2], require complete
data, while in practical applications we are often confronted
with values missing from the dataset; this problem regards
both parts (1 and 2) mentioned above and af fects the per -
formance of the model learning. A record with a missing
value should be omitted from the dataset.
An earlier work [3] studied the impact of learning the
parameters and the structure of a BN using hard EM and
soft EM with a comprehensive simulation study covering
incomplete data.
In this paper , we study the problem of learning the op-
timal BN structure from incomplete data, adopting a new
approach of using the product distribution mixture models
to handle missing values; the latter will be used with [2] to
estimate the missing values and learn the optimal structure.
In addition, we show in this paper that our new approach is
able to learn the structure of a Belief Noisy-OR (BNO) [4]
model from incomplete data.
2 Bayesian network
A Bayesian network encodes a joint probability distribution
over a set of random variables U ={ X
1
, X
2
,..., X
m
} . W e
consider only discrete variables in this work, which is the
most common current usage of BNs. A finite set of states
of a variableX
i
will be denoted byX
i
. Conditional proba-
bility distributions (CPDs) are attached to each variable in
the network. Their purpose is to quantify the strength of the
relationships depicted in the BN through its structure: these
CPDs mathematically describe the behavior of that variable
under every possible value assignment of its parents. Since
to specify this behavior one needs a number of parameters
exponential in the number of parents, and since this number
is typically smaller than the number of variables in the do-
main, this approach results in exponential savings in space
and time.
Formally , a Bayesian network forU is a pairB=⟨ G, Θ ⟩ .
Its first component, G, is a directed acyclic graph whose
vertices correspond to the random variablesU , and whose
edges represent direct dependencies between these vari-
ables. The graph G encodes independence assumptions:
each variableX
i
is independent of its non-descendants given
its parents inG. The second component of the pair , namely
Θ , represents the set of parameters that quantify the net-
work. It contains parameter θ x
i
|Π x
i
= f(x
i
|Π x
i
) for each
possible value x
i
of X
i
andΠ x
i
ofΠ X
i
, whereΠ X
i
denotes
84 Informatica 47 (2023) 83–96 I. Salman et al.
the set of parents of X
i
in G. Accordingly , a Bayesian net-
work B defines a unique joint probability distribution over
U given by:
F(X
1
=x
1
,..., X
m
=x
m
) =
m
∏ i=1
F(X
i
=x
i
|Π X
i
=Π x
i
)
=
m
∏ i=1
θ x
i
|Π x
i
for eachΠ X
i
which is a parent ofX
i
.
2.1 Structur e learning of BN
Note that a BN can be viewed from two perspectives: as an
ef fective coding of an independence relationship on the one
hand, and as an ef fective encoding of a high-dimensional
distribution of probabilities on the other hand.
One option of learning the structure is to rely on the spe-
cialists in the field through a conscious and meticulous pro-
cess of knowledge gathering. This involves training experts
in probabilistic graphical modeling, validating expert opin-
ions, and extracting and testing information. This process
all too often leads to disagreements among experts and a
lack of reliability pertaining to the model. Nonetheless, in
many fields, where data is scarce, this is one of the key ap-
proaches to model building.
Another mechanism is the automatic derivation of the
model based on a data set. It is this machine learning ap-
proach (ML) that we follow here (so that we avoid the very
rich field of human knowledge acquisition). For a data
set D ={ u
1
, u
2
,..., u
n
} , where u
i
is an instantiation of all
variables in U , the BN structure learning translates to the
problem of learning a network structure from D. Suppose
u is complete and discrete. Consequently , finding the op-
timal Bayesian network is reduced to finding the optimal
structure. The optimal structure can be learned by three ap-
proaches coming from the area of ML.
The first is the constraint-based approach to structure
learning, which takes advantage of the first perspective and
attempts to reconstruct a Bayesian network by analyzing
data independence. These algorithms require an infinite
amount of data to learn independence with certainty; high-
order independence tests can be unreliable unless the sam-
ple size is truly huge [5]. The second is the score-based ap-
proach, which invests in the second perspective and looks
for Bayesian networks that adequately describe the avail-
able data with the best score. The core of this approach is
to assign a score values(G) to each acyclic directed graph
G . The score function defines an overall order (up to equiv-
alences) over the structures in such a way that a structure
with a better description of the data is assigned a higher
value. The last approach is a hyper -approach, which mixes
the two previous approaches together .
2.2 Scor e-based
Score-based learning is a technique frequently used for de-
termining the optimal structure. In this process, each candi-
date is assigned a BN score to measure the goodness-of-fit
of a structure to the data. The goal of the learning prob-
lem is then to find the optimal scoring structure. The score
usually measures how well this BN describes the data set
D.
Definition (1): Let B =⟨ G, Θ ⟩ be a Bayesian network,
and letD={ u
1
,..., u
n
} be a training set, where each u
i
as-
signs a value to all variables in U . TheMDL scoring func-
tion of a network B given a training data set D, written
MDL(B| D) , is given by:
MDL(G| D) = LL(G| D)− logn
2
| G| where| G| is the number of parameters in the network. The
first term represents the loglikelihood, i.e., it measures the
model fit. The second term penalizes the model complex-
ity . The penalty term for MDL is greater than that for most
other evaluation functions, since optimal networks with the
MDL are usually sparser than optimal networks with func-
tion scoring. As its name suggests, an optimal network with
MDL minimizes the scoring function rather than maximiz-
ing it. The Bayesian information criterion (BIC) [6] is a
scoring function whose calculation is equivalent to MDL
for Bayesian networks, but it is derived on the basis of the
models’ asymptotic behavior . Where the score is decom-
posable, it can be written as a sum of the scores of each
variable and its parent set:
BIC(G| D) =
m
∑ i=1
BIC(X
i
|Π X
i
)
=
m
∑ i=1
{ LL(X
i
|Π X
i
)− Penalty(X
i
|Π X
i
)} The score-based algorithms’ aim is to optimize this score
and return the structure G that maximizes it. As the space
of all possible structures is at least exponential in the
number of variables m, this presents a number of prob-
lems. There are m(m− 1)/ 2 possible undirected edges
and 2
m(m− 1)/ 2
possible structures for every subset of these
edges. Moreover , there may be more than one orientation
of the edges for each such choice. One popular choice is
hill-climbing [7].
2.3 Structural learning with pruning
Statistical testing is a method of reducing the set of poten-
tial DAGs. Another approach to reducing this set is to use
constraints provided by experts. Besides that, we can use
structural constraints similar to in [2]. The structural con-
straints can be applied locally as long as they include only
one node and its parents.
Algorithm 1 represents an approach to learning the opti-
mal structure of a BN using the constraint rules and a de-
composable score [8]. The main function of the algorithm
is to compute a collection of candidate parent sets for each
variable. Then we optimize across this collection by se-
lecting one parent set for each variable, without creating
Learning the Structure of Bayesian… Informatica 47 (2023) 83–96 85
directed cycles while maximizing the total score. The fol-
lowing theorem can be used to reduce the numbers of the
collections for candidate parents.
Lemma 2.1. Let X
i
be a variable and Π ′ be a candidate
par ent set for X
i
. Suppose that BIC(X
i
|Π ′ )<BIC(X
i
|{} ) .
Then Π ′ can be safely ignor ed fr om the candidate par ent
sets.
Pr oof. The proof uses the decomposability of the BIC
score. LetG
′ andG be DAGs that dif fer only on the parent
set ofX
i
whereΠ ′ is the parent set ofX
i
inG
′ andΠ is the
parent set ofX
i
inG. SupposeΠ ⊂ Π ′ . Therefore, ifG
′ does
not contain directed cycles then G cannot contain them ei-
ther . This fact, together with BIC(X
i
|Π ) > BIC(X
i
|Π ′ ), im-
plies thastG
′ is not BIC optimal. This statement also holds
if the candidate subset is the empty setΠ ={} .
Algorithm 1 Parent sets evaluation for the BN structure
learning algorithm
Input: .
D : a data set
m : an integer representing the number of variables in
D
Output: Accepted sets of parents for each node
Phase 1: initialize the parameters
g
i
=(V, E) // A DAG containing a node and its can-
didate parent set
S
i
: BIC score ofg
i
Q
i
: priority queue of triples (X
i
, Π X
i
, S
i
) ordered by
S
i
Phase 2 : mscour(X
i
,D ) function to find the min(BIC)
score
S
i
= the BIC score ofg
i
where onlyX
i
is included
r eturnS
i
Phase 3: find the acceptedQ
i
forX
i
Q
i
is empty
S
∗ = mscour(X
i
,D )
Π X
i
is a parent set forX
i
add (X
i
, Π X
i
, S
∗ ) toQ
i
For eachX
k
, k∈{ 1,..., m} do:
addX
k
toΠ X
i
S
ki
= the BIC score of the updatedΠ X
i
if(S
ki
>S
∗ )
add (X
i
, Π X
i
, S
ki
) toQ
k
For eachX
j
, j∈{ 1,..., m} , i̸= j̸=k , do:
addX
j
toΠ X
i
S
ki
= the BIC score of the updatedΠ X
i
if(S
ki
>S
∗ )
add (X
i
, Π X
i
, S
ki
) toQ
k
else deleteX
j
fromΠ X
i
end for
else deleteX
k
fromΠ X
i
end for
The g
i
= (V, E) in Phase 1 from Algorithm 1 is a DAG
containing the set of nodesV ={ X
i
, Π X
i
} and the set of arcs
E ={ (X
o
, X
i
), ∀ X
o
∈ Π X
i
}. Algorithm 1 considers all pos-
sible parent sets that can lead to an optimal BN. Its imple-
mentation is based on [8]. After Phase 3, we find a DAG
with the highest BIC from among the variables given the
candidate parent sets of each variable. That is done using
GOBNILP [9] tool
1
which is a smart algorithm using inte-
ger linear programming. W e will refer to this algorithm as
A1.
Let us also note that, if a dataset is generated from a BN
having the empty graph as its structure and this dataset is
lar ge enough then, for any parent set {Π X
i
̸=ϕ }, it holds that
BIC(X
i
|{} )>BIC(X
i
|Π X
i
). This implies that the variables
are independent and the penalty for lar ger parent sets makes
the BIC value worse for all nonempty parent sets.
One of the axioms of the pruning rules stated in the liter -
ature states that if a candidate subset has a better score than
another candidate set and the first candidate set is a sub-
set of the second candidate set, it is safe to disregard that
second candidate set due to the decomposability of score
functions. W e have applied the pruning rule as formalized
in the theorem 2.1 in Algorithm 1. That algorithm will re-
duce the collection of accepted parent sets for each node by
discarding all parent sets which do not meet the criteria.
3 Incomplete datasets
One of the widespread problems in data mining and ma-
chine learning is incomplete data. V alues may be missing
even from training instances. Nowadays more and more
datasets are available, but most of them are incomplete.
Therefore, machine learning must cope with this problem.
Normally , to learn the BN structure using A1 algorithm [2],
we need complete data, such that all instances u
i
∈ D, i∈ { 1,..., n} are complete and don’ t have any missing values.
In the case of incomplete data and an instance which has
a missing value, A1 does not use this instance in the BN
structure learning.
3.1 Pr oduct distribution mixtur es to handle
incomplete data
Because of incomplete data, most methods in machine
learning cannot be applied. An easy way to deal with
this problem is completing the data by simply omitting the
incomplete vectors or removing the incomplete variables.
But this approach has a weakness: we may lose a massive
part of the available information. Another alternative is to
use an estimation to replace the missing values [10] (i.e.,
put in estimates of the missing values). However , for cer -
tain reasons, the estimated values have to be typical, and the
natural variability of the data will be partially restricted. For
that, the product mixture model gives us a better way to di-
rectly apply the EM algorithm to complete the dataset [1 1].
W e will refer to this approach as EM-Mixture.
Considering finite mixtures we assume that:
1
https://www.cs.york.ac.uk/aig/sw/gobnilp/
86 Informatica 47 (2023) 83–96 I. Salman et al.
P(X) =
r
∑ j=1
w
j
F(X| j), (1)
F(X| j) =
m
∏ i=1
F
i
(X
i
| j), r
∑ j=1
w
j
=1 (2)
where w
j
> 0 is a probabilistic weight of the j -th mixture
component, F
i
is the conditional distribution of the vari-
able X
i
, i ∈ 1,..., m, and r is the number of components.
Note that the product components do not imply that the in-
volved variables are independent. In this sense, the mixture
model (1) is not restrictive [12]. It is easy to verify that, by
increasing the number of components r , we can describe
any discrete probability distribution in the form (1). In our
experiments, it was selected based on the number of vari-
ables in a dataset.
T o estimate the mixture parameters, we maximize the
log-likelihood function:
LL =
n
∑ k=1
logP( u
(k)
)
wheren is the number of records in the datasetD and u
(k)
is
the k -th datavector from D . W e will use the EM algorithm
to maximize the log-likelihood function.
Next, we explain how the learned product mixture model
will be used to fill in the missing values. Let C =
{ i
1
, i
2
,..., i
k
} be a subset ofM ={ 1, 2,..., m} such that the
corresponding sub-vector
u
C
= (x
i
1
, x
i
2
,..., x
i
k
)
is complete. Then, under the product mixture model, we
can compute the mar ginal probability of u
C
as
P
C
( u
C
) =
r
∑ j=1
w
j
F
C
( u
C
| j) (3)
F
C
( u
C
| j) =
∏ i∈ C
F
i
(x
i
| j) . (4)
Let z be an index of a variable unobserved in u , i.e., z ∈ M\ C . Under the product mixture model, we can compute
the conditional distribution of the missing value u
z
given
the complete part u
C
withP
C
( u
C
)>0 as
P
z|C
( u
z
| u
C
) =
P
z,C
( u
z
, u
C
)
P
C
( u
C
)
=
r
∑ j=1
W
j
( u
C
)F
z
( u
z
| j)
whereW
j
( u
C
) are the conditional component weights:
W
j
( u
C
) =
w
j
F
C
( u
C
| j)
∑ r
j=1
w
j
F
C
( u
C
| j)
. W e thus compute the probability distribution P
z|C
( u
z
| u
C
)
for each missing value of each data vector u ∈ D with a
missing value. There are several ways of using this proba-
bility distribution to fill in the missing value ofX
z
in u – in
this paper , we select value u
z
maximizingP
z|C
( u
z
| u
C
) over
all values ofX
z
.
The last step of our presentation is the description of
adapting the EM algorithm for learning product mixture
models such that it is applicable to incomplete data. Given
a data vector u ∈ D and a variable X
i
with index i ∈ { 1, 2,..., n} , letN ( u) be the subset of indices of the avail-
able variables (i.e., observed in that data) of u , andD(i)⊂ D
be the subset of vectors with observed values of variableX
i
:
N ( u) = { v∈{ 1, 2,..., n} : u
v
observed in u} D(i) = { u∈ D:i∈ N ( u)} In Algorithm 2, we present the modification of the EM al-
gorithm for the product mixture model for incomplete data.
For x
v
∈ X
v
, v ∈ { 1, 2,..., n} , and j = 1,..., r , we use
F
v
(x
v
| j) to denote the conditional probability of observing
value x
v
of variable X
v
given the component j . The ini-
tialization of the EM-Mixture algorithm (presented in Al-
gorithm 2) is performed using the partitions obtained from
agglomerative hierarchical clustering implemented in the
function hc of the R package mclust [13]. In our algorithm,
the symbolδ (x, y) denotes the standard delta function equal
to one ifx =y and equal to zero otherwise.
Algorithm 2 EM-Mixture
Input: D is a data set
Output: a completed data set
Phase 1: initializing:
w
j
, j =1,..., r
F
v
(x
v
| j), forx
v
∈ X
v
,v∈{ 1, 2,..., n} , and j =1,..., r
L =− ∞ Phase 2:
r epeat
E-Step :
q(j| u) =
w
j∏ v∈ N
( u)
F
v
( u
v
| j)
∑ r
l=1
w
l∏ v∈ N
( u)
F
v
( u
v
| l)
, for u∈ D, j =1,..., r
w
j
=
1
| D| ∑ u∈ D
q(j| u), for j =1,..., r
M-Step : forx
v
∈ X
v
,v∈{ 1, 2,..., n} , and j =1,..., r
F
v
(x
v
| j) =
∑ u∈ D(v)
δ (x
v
, u
v
)· q(j| u)
∑ u∈ D(v)
q(j| u)
L
′ =
∑ u∈ D
log
[ r
∑ j=1
w
j ∏ v∈ N ( u)
F
v
( u
v
| j)
] Q = L
′ − L
L = L
′ untilQ≤ ε The EM algorithm conver ges monotonically to a local
Learning the Structure of Bayesian… Informatica 47 (2023) 83–96 87
or global maximum or a saddle point of the log-likelihood
function L in the sense that the sequence of { L
t
} ∞ t=0
does
not decrease. The presence of a local maximum makes the
starting point of the procedure influential; hence it is se-
lected at random. W e use the value of ε = 0. 005 to ter -
minate the main loop of the algorithm. The sequence of
log-likelihood values generated by E-Step and M-Step is
non-decreasing [1 1] (i.e.,L
′ L ).
W e adapt the BN structure learning algorithm A1 so that
it can learn from incomplete data. W e use the EM-Mixture
algorithm, i.e., Algorithm 2, to make the incomplete data
complete in Phase 3. The whole algorithm will be referred
to as A2.
3.2 Experiments
The experiments have been repeated ten times on ten dif fer -
ent subsets in each MCAR rate on dif ferent models, using
the generated datasets from the true models summarized in
T able 8 in A. W e have compared our approach denoted as
A2 with three other methods. By A1 we denote the BIC
optimal learning from complete data created by omitting all
rows containing a missing value. In [3], the authors pro-
posed the soft and hard EM algorithms to fill in the missing
values and learn an optimal BN structure from the com-
pleted data by T abu search [14], which we refer to as A3
and A4, respectively .
The test scenarios, which include more than 700 incom-
plete datasets, are summarized in Figure 1. The resulting
BNs of the simulations within each scenario are shown in
T ables 1, 2, and 3.
The decision tree shown in Figure 1 is intended to guide
practitioners as to which imputation algorithm appears to
perform the best, depending on the characteristics of their
problem with incomplete data. Each leaf of the decision
tree corresponds to a subset of the scenario that we studied,
grouped according to the values of the experimental fac-
tors, to recommend which algorithm has the best average
Structure Hamming Distance [15] (SHD) values between
the essential graph of the learned model and the essential
graph of the true model. The dominance of the algorithms
has been tested using the W ilcoxon test [16]. W e say that
an algorithm is better than another if it has a lower average
SHD and their confidence intervals do not overlap, i.e., the
p-value of the W ilcoxon test is lower than 5%.
In the results based on the SHD, A2 has scored the best
results. For the results based on the SHD and the W ilcoxon
test, we have observed some important general trends:
– A2 appears to be a good algorithm in all scenarios.
– A2 is significantly better than other Algorithms for
Model M2 in Leaves B and G.
– A2 is significantly better than other Algorithms for the
model Child in Leaf C.
– A2 and A3 are significantly better than A1 and A4 for
Models M1 in Leaves C, D, P , and K.
T able 1: Recommended algorithm by decision tree leaf
where MCAR rate in [5 - 10 ] -Group 1.
Leaf Size Bayesian Network Recommended Algorithm
W eather A1, A2 , A3, A4
A Size >5000 M1 A2 , A3, A4
W eather A2 , A3, A4
B Size in [3000 - 5000] M1 A2 , A3, A4
M2 A2
Child A2 , A3, A4
W eather A2 , A3, A4
C Size in [1500 - 2500] M1 A2 , A3
M2 A2 , A3, A4
Child A2
W eather A2 , A3, A4
D Size <1000 M1 A2 , A3
M2 A2
Child A2 , A3
– A1 is significantly worse than other Algorithms in all
scenarios where the data size is smaller than 5,000.
Figure 2 represents the algorithm results of all models with
all dataset sizes and all MCAR rates.
T able 2: Recommended algorithm by decision tree leaf
where MCAR rate in [15 - 25] - Group 2.
Leaf Size Bayesian Network Recommended Algorithm
W eather A1, A2 , A3, A4
E Size >5000 M1 A2 , A3, A4
W eather A2 , A3, A4
F Size in [3000 - 5000] M1 A2 , A3, A4
M2 A2 , A3
Child A2 , A3
W eather A2 , A3, A4
G Size in [1500 - 2500] M1 A2 , A3, A4
M2 A2
Child A2 , A3
W eather A2 , A3, A4
P Size <1000 M1 A2 , A3
M2 A2
Child A2 , A3
T able 3: Recommended algorithm by decision tree leaf
where MCAR rate in [35 - 50] - Group 3.
Leaf Size Bayesian Network Recommended Algorithm
W eather A1, A2 , A3, A4
H Size >5000 M1 A2 , A3, A4
W eather A2 , A3, A4
G Size in [3000 - 5000] M1 A2 , A3
W eather A2 , A3, A4
K Size in [1500 - 2500] M1 A2 , A3
W eather A2 , A3, A4
M Size <1000 M1 A2
4 Belief Noisy-Or model
The Belief Noisy-Or (BNO) model is suitable for describ-
ing a specific class of uncertain relationships in Bayesian
networks [4] common in several practical applications of
BNs. As an example, let us mention the QMR-DT net-
work [17]. In Figure 3 we present the structure of a CPT
88 Informatica 47 (2023) 83–96 I. Salman et al.
< 1000 5000 < 
[1000-2000] [2001-5000] 
 
[1000 – 5000] 
   
A 
Data Size 
Data Size 
 
MCAR Rate 
[15 - 25] 
[5 - 10] [35 - 50] 
D B C 
E P F G 
H M G K 
Groupe 1 
Groupe 2 
Groupe 3 
Figure 1: The decision tree for dif ferent test scenarios.
F(Y| X
1
,..., X
n
) where auxiliary nodesX
′ 1
,..., X
′ n
are added
to explicitly separate the noisy relations from the logical
OR relation. For a CPT w ith multiple parent variables
X
1
,..., X
n
the noisy-or is defined as follows
2
:
F(X
′ i
=0| X
i
=0) = 1− α , F(X
′ i
=1| X
i
=0) =α F(X
′ i
=0| X
i
=1) = p
i
, F(X
′ i
=1| X
i
=1) =1− p
i
wherei∈{ 1,..., n} andp
i
∈ [0, 1] is the parameter which
defines the probability that the positive valuex
i
of variable
X
i
is inhibited – it is referred to as the inhibition probability
and the parameterα specifies the possibility of a positive
value even if the value of the corresponding parent variable
is negative. In most experiments, we will setα = 0 . The
CPT ofF(Y| X
′ 1
,..., X
′ n
) represents the deterministic logical
OR function, i.e.,
F(Y =0| X
′ 1
=x
′ 1
,..., X
′ n
=x
′ n
) =
   1 ifx
′ 1
=0,... x
′ n
=0
0 otherwise.
Consequently , the CPT of F(Y| X
1
,..., X
n
) , which repre-
sents the noisy-or function, is computed as follows:
2
In the case of one parent variable, we use probability values as speci-
fied in T able 4.
F(Y =0| X
1
=x
1
,..., X
n
=x
n
) =
n
∏ i=1
F(X
′ i
=0| X
i
=x
i
)
=
n
∏ 1
(p
i
)
x
i
(1− α )
1− x
i
F(Y =1| X
1
=x
1
,..., X
n
=x
n
) = 1− n
∏ 1
(p
i
)
x
i
(1− α )
1− x
i
4.1 Analysis of BNO models
In this Section, we analyze the BNO models represented in
T able 9 in A where α = 0 . T ables 5, 6, and 7 show the
mar ginal probability distributions (MPD) of the variables
in BNO models N1, N2, and BN2O, respectively; look at
Figures 1 1 and 12. The T ables illustrate the decrease of
the mar ginal probability values for F(C
i
=0 ) in the case of
a node having more than one parent. See T able 7. This
decrease is due to the properties of the product of proba-
bilities in (5). On the other hand, they also illustrate the
increase of that mar ginal probability with a higher number
of its predecessors in previous layers; that increase depends
on the number of layers above and also on the numbers of
the edges in those layers. See T able 5 and T able 6.
Using the conditional probability distributions of the
variables given their parents, we can easily calculate joint
Learning the Structure of Bayesian… Informatica 47 (2023) 83–96 89
T able 4: F(X
′ i
| X
i
) table
X
i
0 1
X
′ i
0 1− α 0.2
1 α 0.8
T able 5: N1 (Figure 1 1): Mar ginal probability distributions
C1 C2 C3 C4 C5 C6
F(C
i
= 0) .5 .6 .68 .744 .795 .837
F(C
i
= 1) .5 .4 .32 .256 .205 .163
T able 6: N2 (Figure 1 1): Mar ginal probability distribu-
tions
C1 C2 C3 C4 C5 C6
F (C
i
= 0) .5 .6 .536 .539 .716 .707
F (C
i
= 1) .5 .4 .464 .461 .284 .293
T able 7: BN2O (Figure 12): Mar ginal probability distribu-
tions
C1 C2 C3 C4 C5 C6 C7 C8
F(C
i
= 0) .5 .5 .5 .5 .5 .129 .36 .36
F(C
i
= 1) .5 .5 .5 .5 .5 .871 .64 .64
0
10
20
30
A1 A2 A3 A4
Algorithms
Values
Figure 2: The Structural Hamming Distance to the true
models from the resulting models of the structure learning
algorithms using data generated from all models, summa-
rized in T able 8 averaged over all data sizes and all MCAR
rates.
probability distributions F( U) using formula (1) and con-
ditional probability distributions (CPD) F(X
A
|X
B
) , where
X
A
⊆ U andX
B
⊆ U\ X
A
. Recall that a CPD for a particular
configurationx
B
of parent nodesX
B
can be computed as
3
:
F(X
A
|X
B
=x
B
) =
F(X
A
, X
B
=x
B
)
F(X
B
=x
B
)
(5)
The Kullback-Leibler Distance (KLD) of two conditional
probability distributionsF(X
A
|X
B
) andG(X
A
|X
B
) defined
on the same state space is computed as the weighted aver -
age KLD of F(X
A
|X
B
=x
B
) and G(X
A
|X
B
=x
B
) over all
3
Please, note that all BNs considered in this paper satisfy the condition
F(X
B
=x
B
)>0 for allx
B
.
Figure 3: Noisy-or
configurationsx
B
:
D(F(X
A
|X
B
)|| G(X
A
|X
B
)) =
∑ x
B
F(X
B
=x
B
)
∗ ∑ x
A
F(X
A
=x
A
|X
B
=x
B
)
∗ log
F(X
A
=x
A
|X
B
=x
B
)
G(X
A
=x
A
|X
B
=x
B
)
=
∑ x
A
,x
B
F(X
A
=x
A
, X
B
=x
B
)
∗ log
F(X
A
=x
A
|X
B
=x
B
)
G(X
A
=x
A
|X
B
=x
B
)
W e will use KLD of conditional probability distributions
estimated from the true data to support our ar guments when
we explain the results.
4.2 Experiments
W e have performed experiments on dif ferent Belief noisy-
or (BNO) models with their CPT s defined in T able 4 where
α ∈{ 0, 0. 2} and the CPT of a node which has no parent is
uniform, i.e.,F(X
i
=1|{} )=0. 5, F(X
i
=0|{} )=0. 5 . The
experiments have been repeated ten times on ten dif ferent
datasets generated from BNO models with dif ferent MCAR
rates as specified in T able 9 in Appendix A. In all Figures,
we will denote additional edges by blue dashed lines, miss-
ing edges by red lines, and edges with dif ferent arrows by
orange lines.
90 Informatica 47 (2023) 83–96 I. Salman et al.
4.2.1 Model N1
The true N1 model is shown in Figure 1 1 in Appendix A.
W e use this model as an example of a simple model with
a chain structure. This model is motivated by some appli-
cations, e.g., from telecommunications. Let us summarize
the results of this model:
– All algorithms learn the true structure whenα ̸= 0 in
all data sizes and all MCAR rates.
– The algorithms A2, A3, and A4 learn structures dif-
ferent from the true model in some cases withα = 0 ,
MCAR rate 15% and data size of 1,000. For example,
A3 and A4 learn additional edgeC2→ C4 , also, A2
learnC4→ C6 instead ofC5→ C6 .
– Using equation (5), we calculate F(C6|C5) and
F(C6|C4) from the true model N1. W e have found
that their KLD value (computed using equation (4.1))
is very small, it is only 0.001. Also, the chi-square
test of independence, whose p-value is smaller than
0. 0001 , reveals that there is a strong dependence be-
tween C6 and C4 in addition to the relationship be-
tweenC6 andC5 , already explicitly present in the true
model. Also, the BIC of the learned structure
4
is -
2,252.93 and the BIC of the true model from the same
dataset is -2,255.64. This can be explained by the de-
terministic conditional distribution F(C6|C5 = 0) for
C5 = 0 . For these reasons, we can conclude that we
can accept that A2 has learned C4 → C6 instead o f
C5→ C6 .
4.2.2 Model N2
The true N2 model is shown on the right hand side of Fig-
ure 1 1 in Appendix A. W e use this model as an example of a
model more complicated than the previous model N1. This
model is motivated by some applications, e.g., by computer
networks. W e summarize the results of the experiments per -
formed with this model:
– Figure 4 represents the Structure Hamming Distance
(SHD) for all tested MCAR rates and models withα =
0 . W e can observe that, as expected, the algorithm’ s
performance is getting better with increasing the data
size.
– W e can see that A2 on average has a smaller SHD dis-
tance to the true model than other algorithms.
– In Figure 5, we compare the models learned from the
datasets of size 5,000 with MCAR rate 10% using all
four algorithms. W e can see that A2 and A3 have the
same SHD but they dif fer in that A3 has a missing edge
C4→ C5 while A2 has an additional edgeC3→ C6 .
This additional edge can be explained by observing
that there is a chain of nodes C3→ C4→ C6 which
4
W e report the BIC value for one of ten datasets since the results for
the remaining nine are similar .
the state 0 is propagated through because of α = 0 .
In other words, we calculateF(C3, C6|C1, C2, C4, C5)
and the product F(C3|C1, C2, C4, C5)· F(C6|C4, C5)
from the true model 1 1 using equation (5). The KLD
value (computed as explained in (4.1)) of these two
distributions is very small, it is only 0.02. Also,
the chi-square test of independence ofC3 andC6 re-
veals these variables are dependent (the test’ s p-value
is smaller than 0.0001). The additional edge can be
also supported by a comparison of BIC values of the
learned structure with and without the additional edge
C3→ C6 ; they are -9,813.67 for the model with the
additional edge and -9,880.5 for the true model.
– Ifα > 0 then no additional edge is learned anymore,
no matter what the MCAR rate is. Algorithms A2 and
A4 we are always able to learn the true structure when
the data size exceeds 1,000. Also, A1 and A3 learn the
true structure when the data size is lar ger than 1,500.
4.2.3 BN2O models
These models are motivated by health-care applications, for
example by the QMR-DT network [17]. W e created 60 dif-
ferent BN2O models consisting of two layers withN =20
nodes in total. They dif fer in the numbers of nodes in the
first layer , namelyL
1
∈{ 5, 8, 12, 15} ; the numbers of nodes
in the second layer areL
2
=20− L
1
. The numbers of edges
between layers are generated randomly with three dif ferent
options
N
2
, 2· N
2
, and
4· N
2
; each option repeated five times.
Using these models, we have generated multiple incom-
plete datasets where the sizes of the datasets are 3,000 and
5,000 and the MCAR are 10% and 15%. Figure 8 shows the
boxplot of additional and missing numbers of edges learned
in all instances for each algorithm where the dataset size is
3,000, and for all MCAR rates. The results show that A2
has better results on average (i.e., the distance to the true
model is smaller) than other algorithms.
Next we discuss in more detail one simpler example of
a BN2O. The structure of this model is shown in Figure 12
in Appendix A.
– Figure 6 represents the SHD of all learned models
grouped by MCAR rates with models whereα =0 .
– The learned models from the dataset of size 5,000,
MCAR rate 10% and α = 0 using all algorithms are
shown in Figure 7. W e can see that A2 performs bet-
ter (i.e., the SHD distance to the true model is smaller)
than other algorithms.
– Note the additional edge C7 → C8 learned by A2
for most datasets. The ar gument supporting this
additional edge is similar to that valid for the ad-
ditional edge in the N2 model. Again, we can
see that KLD of F(C7, C8|C2, C5) and the product
F(C7|C2, C5). F(C8|C2, C5) is very small; it is only
0.002. Also, the chi-square test of independence ofC7
Learning the Structure of Bayesian… Informatica 47 (2023) 83–96 91
o
o
o
o
1000 2000 3000 4000 5000
0.0 1.0 2.0 3.0
Dataset Size
Value
*
*
*
*
+
+
+
+
*
*
*
*
o
*
+
*
A1
A2
A3
A4
o
o
o
o
1000 2000 3000 4000 5000
0.0 1.0 2.0 3.0
Dataset Size
Value
*
*
*
*
+
+
+
+
*
*
*
*
o
*
+
*
A1
A2
A3
A4
o
o
o
o
1000 2000 3000 4000 5000
0.0 1.0 2.0 3.0
Dataset Size
Value
*
*
*
*
+
+
+
+
*
*
*
*
o
*
+
*
A1
A2
A3
A4
Figure 4: The Structural Hamming Distance of the result-
ing models of the structure learning algorithms to the true
model (withα = 0 ) using the data generated from the N2
model (the true model is presented in Figure 1 1) using the
average over ten experiments for dif ferent data sizes and for
the MCAR rates of 5%, 10%, and 15%, respectively .
andC8 has the p-value smaller than 0.0001 and there
is always only a very small dif ference between BIC
of the model with the extra edge and the true model;
for example„ BIC of the model with the extra edge is
-7,331.8 while the BIC of the true model is -7,338.5
for one of the en generated datasets.
– In the experiments with models havingα > 0 no ad-
ditional edge has been learned and the true model is
learned successfully when the data size is 2,500 or
lar ger for all MCAR rates.
4.2.4 A large BN2O model
W e have performed experiments with a model shown in
Figure 13 in Appendix A. This model consists of 25 vari-
ables; 14 in the first layer and 1 1 in the second layer . All
algorithms required a data size of more than 3,000 to give
a good performance. W ith the data size of 3,000 (and the
MCAR rate of 10%) the recorded SHD of algorithms A1,
A2, A3, and A4 still have not been very good – namely ,
14.6, 10.2, 10, and 9.8, respectively . W ith the data size of
5000 and 7500 (and the MCAR rate of 10%) the recorded
average SHD of A1, A2, A3, and A4 are already much bet-
ter – namely , 7.2, 4.2, 4.3, and 5.1, respectively . See Fig-
ure 9 for the learned models. W ith the data size of 10,000
we already get the true models except for the additional
edges in the case of A2, as discussed in Section 4.2.3.
92 Informatica 47 (2023) 83–96 I. Salman et al.
C1 C2
C3
C4
C5
C6
C1 C2
C3
C4
C5
C6
C1 C2
C3
C4
C5
C6
C1 C2
C3
C4
C5
C6
Figure 5: Models learned by A1, A2, A3, and A4, respectively , for most of ten datasets generated from the true N2 model
(presented in Figure 1 1) (forα =0 ) with the MCAR rate 10 and the data size of 5,000.
0
2
4
6
A1 A2 A3 A4
Algorithms
Values
0.0
2.5
5.0
7.5
10.0
A1 A2 A3 A4
Algorithms
Values
2.5
5.0
7.5
A1 A2 A3 A4
Algorithms
Values
Figure 6: The Structural Hamming Distance to the true models of the resulting models of the structure learning algorithms
using data generated from the BN2O model (the true model is presented in Figure 12) (withα =0 ) averaged over all data
sizes for MCAR rates of 5%, 10%, and 15%, respectively .
C1
C2 C3
C4
C5
C6
C7
C8
C1
C2 C3
C4
C5
C6
C7
C8
C1
C2 C3
C4
C5
C6
C7
C8
C1
C2 C3
C4
C5
C6
C7
C8
Figure 7: Models learned by A1, A2, A3, and A4, respectively , using data generated for most of ten datasets generated
from the true BN2O model (presented in Figure 12) (forα =0 ) with the MCAR rate 10 and the data size of 5,000.
0
2
4
6
8
A1 A2 A3 A4
Algorithms
Values
0
1
2
3
4
5
A1 A2 A3 A4
Algorithms
Values
Figure 8: Results of the structure learning algorithms using data generated from the BN2O model (withα = 0 ) with the
data size of 3,000 and averaged over all tested MCAR rates. The plot on LHS displays the average number of additional
edges and the plot on RHS displays the average number of missing edges.
Learning the Structure of Bayesian… Informatica 47 (2023) 83–96 93
A
B
C D
E
F
G H
I J
K
L
M
N
O
P Q
R
S
T
U
V
W
X Y
A
B
C D
E
F
G H
I J
K
L
M
N
O
P Q
R
S
T
U
V
W
X Y
A
B
C D
E
F
G H
I J
K
L
M
N
O
P Q
R
S
T
U
V
W
X Y
A
B
C D
E
F
G H
I J
K
L
M
N
O
P Q
R
S
T
U
V
W
X Y
Figure 9: Models learned by A1, A2, A3, and A4, respectively , using the data generated from the lar ge BN2O model
consisting of 25 variables (forα = 0 ) with the MCAR rate of 10% and the data size of 7,500 (true model is presented in
Figure 13).
5 Conclusion
In this paper , we provide an approach to learning the opti-
mal BN structure from incomplete data by adapting the con-
siderations of [8]. This adaptation imputes missing values
using product mixtures learned by the EM algorithm [1 1].
W e have shown that the sequence of log-likelihood values
generated by E-Step and M-Step of the EM algorithm is
non-decreasing and the algorithm conver ges. Theorem 2.1
helps us reduce the collection of candidate parent sets for a
variable, which can speed-up the learning algorithm.
W e have performed experiments on incomplete data gen-
erated from dif ferent types of BN models to compare the
proposed Algorithm A2 with other algorithms, namely with
A1 [8], soft and hard EM [3], referred to as A3 and A4, re-
spectively . In our comparisons, we use Structure Hamming
Distance of CPDAGs of learned DAGs to CPDAGs of the
original models.
Such comparisons have been undertaken on (a) general
Bayesian networks and (b) Belief Noisy-or [4] (BNO) mod-
els with partially deterministic and nondeterministic condi-
tional probability distributions. The experiments with mod-
els of type (b) are motivated by uncertain relationships in
Bayesian networks, which are common in practical appli-
cations of BNs. W e have obtained the following results in
detailed simulation studies.
(a) General BN models:
– The A2 algorithm appears to be the best choice
from among the tested algorithms for learning
the structure of BNs from any incomplete data
whatever the data size and the missing MCAR
rate are.
– In most scenarios corresponding to dif ferent
datasizes and MCAR rates, Algorithm A2 is sig-
nificantly better than other algorithms and in no
scenario is it significantly worse than any other
algorithm according to the W ilcoxon test.
(b) BNO models:
– A2 is able to recover all true edges in the tested
models except for the N1 model (shown in Fig-
ure 1 1) at size 1,000 and a missing rate of 15%.
The dif ferent learned structure of the model N1 is
acceptable because the Chi-square(X
2
) test and
the Kullback-Leibler distance (KLD) between
the related conditional probabilities suggest there
is a high degree of relationship between the con-
nected variables.
– A2 has learned an additional edge in the case
of Models N2 (shown in Figure 1 1) and BN2O
(shown in Figure 12). The additional edge is
acceptable since theX
2
test and KLD suggest
there is a high degree of relationship between
these variables. W e have seen that BIC of the
learned structure is almost equal to BIC of the
true model. For example, BIC of the model
learned using A2 (shown in Figure 7) is -7,331.8
and BIC of the true model is -7,338.5. Similar
behavior has been observed in other BNO mod-
els.
– A2 is always able to recover all edges while other
algorithms are not.
– For lar ge BN2O models, all algorithms require
data sizes lar ge than 3000 to have a good per -
formance; e.g., for the BN2O with 25 variables
A2 needs at least 10,000 data records to learn the
correct model (with the exception of additional
edges as discussed in Section 4.2.3).
W e have empirically shown that our Algorithm A2 be-
haves better than other tested algorithms on several stud-
ied BNs and in dif ferent scenarios. Based on these exper -
iments, we can recommend this algorithm for practitioners
that use BNs or BNOs with incomplete data especially in
the medical domain where BNO could be used to study the
hidden relationship between symptoms and diseases. An
interesting topic for future research might be learning the
structure of lar ge BN2O networks from incomplete data and
optimize the number of components in the EM-Mixture .
94 Informatica 47 (2023) 83–96 I. Salman et al.
Acknowledgement
This work was supported by Student Grant CTU
SGS20/132/OHK4/2T/14
Refer ences
[1] Nir Friedman, Dan Geiger , and Moises Goldszmidt.
Bayesian network classifiers. Machine Learning ,
20(2-3):131––163, 1997. URL: https://doi.org/
10.1023/a:1007465528199 .
[2] Cassio P de Campos, Mauro Scanagatta, Gior gio
Corani, and Marco Zaf falon. Entropy-based prun-
ing for learning Bayesian networks using BIC. Arti-
ficial Intelligence , 260:42––50, 2018. URL: https:
//doi.org/10.1016/j.artint.2018.04.002 .
[3] Andrea Ruggieri, Francesco Stranieri, Fabio Stella,
and Marco Scutari. Hard and soft EM in Bayesian
network learning from incomplete data. Algorithms ,
13(12):329, 2020. https://doi.or g/10.3390/a13120329
doi:10.3390/a13120329 .
[4] Judea Pearl. Pr obabilistic r easoning in intelligent sys-
tems: networks of plausible infer ence . Mor gan kauf-
mann, 1988.
[5] Nir Friedman and Moises Goldszmidt. Learn-
ing Bayesian networks with local structure. In
Learning in graphical models , page 421––459.
Springer , 1998. URL: https://doi.org/10.
1007/978- 94- 011- 5014- 9_15 .
[6] Zhifa Liu, Brandon Malone, and Changhe Y uan. Em-
pirical evaluation of scoring functions for Bayesian
network model selection. In Pr oceedings of the Ninth
Annual MCBIOS Confer ence. Dealing with the Omics
Data Deluge , Oxford, MS, USA., 2012. BMC Bioin-
formatics. https://doi.or g/10.1 186/1471-2105-13-
S15-S14 doi:10.1186/1471- 2105- 13- S15- S14 .
[7] Poh Choo Song, Hui Y ee Chong, Hong Choon Ong,
and Sing Y an Looi. A model of Bayesian network
analysis of the factors af fecting student’ s higher level
study decision: The private institution case. journal
of T elecommunication, Electr onic and Computer En-
gineering (JTEC) , 8(2):105––109, 2016.
[8] Cassio P de Campos, Zhi Zeng, and Qiang Ji.
Structure learning of Bayesian networks using
constraints. In Pr oceedings of the 26th Annual
International Confer ence on Machine Learn-
ing , ICML ’09, page 1 13––120, New Y ork, NY ,
USA, 2009. Association for Computing Machin-
ery . https://doi.or g/10.1 145/1553374.1553389
doi:10.1145/1553374.1553389 .
[9] James Cussens. Bayesian network learning with cut-
ting planes. In Pr oceedings of the T wenty-Seventh
Confer ence on Uncertainty in Artificial Intelligence ,
page 153––160, Arlington, V ir ginia, USA, 201 1.
AUAI Press.
[10] Arthur P Dempster , Nan M Laird, and Donald B Ru-
bin. Maximum likelihood from incomplete data via
the EM algorithm. J. Roy . Statist. Soc. B , 39:1––
38, 1977. URL: https://doi.org/10.1111/j.
2517- 6161.1977.tb01600.x .
[1 1] Jirí Grim, Jan Hora, Pavel Boček, Petr Somol, and
Pavel Pudil. Statistical model of the 2001 Czech cen-
sus for interactive presentation. Journal of Official
Statistics , 26(4):673––694, 2010.
[12] J. Grim and P . Boček. Statistical model of prague
households for interactive presentation of census data.
In SoftStat 95. Advances in Statistical Softwar e 5.
Confer ence on the Scientific Use of Statistical Soft-
war e , Heidelber g, DE, 1996.
[13] Luca Scrucca, Michael Fop, T . Brendan Murphy , and
Adrian E. Raftery . mclust 5: clustering, classification
and density estimation using Gaussian finite mixture
models. The R Journal , 8(1):289––317, 2016.
[14] Fred Glover . T abu s earch-part I. ORSA Journal on
computing , 1(3):190––206, 1989.
[15] Marco Scutari and Jean-Baptiste Denis. Bayesian Net-
works: with Examples in R . Chapman & Hall, Boca
Raton, 2014. URL: https://doi.org/10.1111/
biom.12856 .
[16] M Neuhäuser and Mann-Whitney T est. International
Encyclopedia of Statistical Science . Springer Berlin
Heidelber g, 201 1.
[17] Michael A Shwe, Blackford Middleton, David E
Heckerman, Max Henrion, Eric J Horvitz, Harold P
Lehmann, and Gregory F Cooper . Probabilistic di-
agnosis using a reformulation of the internist-1/qmr
knowledge base. Methods of information in Medicine ,
30(04):241––255, 1991. URL: https://doi.org/
10.1055/s- 0038- 1634846 .
[18] B Abramson, J Brown, W ard E, Allan Murphy , and
Robert L W inkler . Hailfinder: A bayesian system for
forecasting severe weather . International Journal of
For ecasting , 12(1):57–71, 1996. Probability Judg-
mental Forecasting. URL: https://doi.org/10.
1016/0169- 2070(95)00664- 8 .
[19] A Philip Dawid. Prequential analysis, stochastic com-
plexity and Bayesian inference. Bayesian statistics ,
4:109––125, 1992.
Learning the Structure of Bayesian… Informatica 47 (2023) 83–96 95
A Appendix A. Simulation
Scenarios
This Appendix provides an inclusive list of all experiments
in the simulation study described in Sections 3.2 and 4.2,
or ganized by their main characteristics in T ables 8 and 9, re-
spectively . The number of components in each experiment
selected based on the number of variables in the datasets.
The true models mentioned in the T able 8 are shown in Fig-
ure 10. The true models mentioned in the T able 9 are shown
in Figures 1 1 and 12.
T able 8: Description of the key factors of all BN experi-
ments in the simulation study .
Network Missing Rate (MCAR) Replicates Sample Size
10 10 100, 500,1000,5000,10000
W eather [18] 25 10 100, 500,1000,5000,10000
50 10 100, 500,1000,5000,10000,13000
10 10 1000, 2000,3000,5000
Child [19] 15 10 1000, 2000,3000,5000
50 10 1000, 2000,3000,5000
5 10 500,1000,1500,2500,5000
M2 (Figure 10) 10 10 500,1000,1500,2500,5000
15 10 500,1000,1500,2500,5000
25 10 500,1000,1500,2500,5000
10 10 500,1500,2500,5000,10000,13000
M1 (Figure 10) 20 10 500,1500,2500,5000,10000,13000
35 10 500,1500,2500,5000,10000,13000
50 10 500,1500,2500,5000,10000,13000
T able 9: Description of the key factors of all Belief Noisy-
OR experiments in the simulation study (true models are
presented in Figures 1 1 and 12).
Network Missing Rate (MCAR) Replicates Sample Size
5 10 1000,1500,2500,5000
BN2O 10 10 1000,1500,2500,5000
15 10 1000,1500,2500,5000
5 10 1000,1500,2500,5000
N1 10 10 1000,1500,2500,5000
15 10 1000,1500,2500,5000
5 10 1000,1500,2500,5000
N2 10 10 1000,1500,2500,5000
15 10 1000,1500,2500,5000
lar ge BN2O 10 10 5000, 7500
K
A
E
D
B
R
S
C
W
M
Q
I
P
T
N
X
U
O
V
F
H
G
L
K
A
E
D
B
R
S
C
W
Figure 10: M1 and M2 true models, respectively
96 Informatica 47 (2023) 83–96 I. Salman et al.
C1
C2 C3
C4
C5 C6
C1 C2
C3
C4
C5
C6
Figure 1 1: N1 and N2 true models, respectively . Their
mar ginal probability distributions are summarized in T a-
bles 5 and 6
C1
C2 C3
C4
C5
C6
C7
C8
Figure 12: BN2O true model. Its mar ginal probability dis-
tributions are summarized in T able 7
A
B
C
D
E
F
G H
I
J
K
L
M
N
O
P Q
R
S
T
U
V
W
X Y
Figure 13: Example of a lar ge BN2O model with 25 vari-
ables (whose learned models are presented in Figure 9).