https://doi.or g/10.31449/inf.v47i1.4497 Informatica 47 (2023) 83–96 83 Learning the Structur e of Bayesian Networks fr om Incomplete Data Using a Mixtur e Model Issam Salman 1 , Jiří V omlel 2 1 Faculty of Nuclear Sciences and Physical Engineering, Czech T echnical University in Prague, T rojanova 13, 120 01, Prague, CZ 2 Institute of Information Theory and Automation of the CA, Pod V odárenskou věží 4, 182 00, Prague, CZ E-mail: Issam.Salman@fjfi.cvut.cz, vomlel@utia.cas.cz Keywords: Bayesian network, belief-noisy-OR, structure learning, incomplete data, EM-mixture Received: November 8, 2022 In this paper , we pr ovide an appr oach to learning optimal Bayesian network (BN) structur es fr om incom- plete data based on the BIC scor e function using a mixtur e model to handle missing values. W e have compar ed the pr oposed appr oach with other methods. Our experiments have been conducted on differ ent models, some of them Belief Noisy-Or (BNO) ones. W e have performed experiments using datasets with values missing completely at random having differ ent missingness rates and data sizes. W e have analyzed the significance of differ ences between the algorithm performance levels using the W ilcoxon test. The new appr oach typically learns additional edges in the case of Belief Noisy-or models. W e have analyzed this issue using the Chi-squar e test of independence between the variables in the true models; this appr oach r eveals that additional edges can be explained by str ong dependence in generated data. An important pr operty of our new method for learning BNs fr om incomplete data is that it can learn not only optimal general BNs but also specific Belief Noisy-Or models which is using in many applications such as medical application. Povzetek: Razvita je metoda za določitev optimalne Bayesove mr eže ob nepopolnih podatkih. 1 Intr oduction Bayesian networks (BNs) have been used in a variety of applications. The challenge of learning a BN can be cat- egorized into two parts: (1) s tructural learning, which in- volves identifying the topology of the BN; and (2) para- metric learning, which involves estimating the conditional probabilities for a given network. The challenge of learning the structure of a BN is by far more dif ficult than the other one. Most methods, such as [1] and [2], require complete data, while in practical applications we are often confronted with values missing from the dataset; this problem regards both parts (1 and 2) mentioned above and af fects the per - formance of the model learning. A record with a missing value should be omitted from the dataset. An earlier work [3] studied the impact of learning the parameters and the structure of a BN using hard EM and soft EM with a comprehensive simulation study covering incomplete data. In this paper , we study the problem of learning the op- timal BN structure from incomplete data, adopting a new approach of using the product distribution mixture models to handle missing values; the latter will be used with [2] to estimate the missing values and learn the optimal structure. In addition, we show in this paper that our new approach is able to learn the structure of a Belief Noisy-OR (BNO) [4] model from incomplete data. 2 Bayesian network A Bayesian network encodes a joint probability distribution over a set of random variables U ={ X 1 , X 2 ,..., X m } . W e consider only discrete variables in this work, which is the most common current usage of BNs. A finite set of states of a variableX i will be denoted byX i . Conditional proba- bility distributions (CPDs) are attached to each variable in the network. Their purpose is to quantify the strength of the relationships depicted in the BN through its structure: these CPDs mathematically describe the behavior of that variable under every possible value assignment of its parents. Since to specify this behavior one needs a number of parameters exponential in the number of parents, and since this number is typically smaller than the number of variables in the do- main, this approach results in exponential savings in space and time. Formally , a Bayesian network forU is a pairB=⟨ G, Θ ⟩ . Its first component, G, is a directed acyclic graph whose vertices correspond to the random variablesU , and whose edges represent direct dependencies between these vari- ables. The graph G encodes independence assumptions: each variableX i is independent of its non-descendants given its parents inG. The second component of the pair , namely Θ , represents the set of parameters that quantify the net- work. It contains parameter θ x i |Π x i = f(x i |Π x i ) for each possible value x i of X i andΠ x i ofΠ X i , whereΠ X i denotes 84 Informatica 47 (2023) 83–96 I. Salman et al. the set of parents of X i in G. Accordingly , a Bayesian net- work B defines a unique joint probability distribution over U given by: F(X 1 =x 1 ,..., X m =x m ) = m ∏ i=1 F(X i =x i |Π X i =Π x i ) = m ∏ i=1 θ x i |Π x i for eachΠ X i which is a parent ofX i . 2.1 Structur e learning of BN Note that a BN can be viewed from two perspectives: as an ef fective coding of an independence relationship on the one hand, and as an ef fective encoding of a high-dimensional distribution of probabilities on the other hand. One option of learning the structure is to rely on the spe- cialists in the field through a conscious and meticulous pro- cess of knowledge gathering. This involves training experts in probabilistic graphical modeling, validating expert opin- ions, and extracting and testing information. This process all too often leads to disagreements among experts and a lack of reliability pertaining to the model. Nonetheless, in many fields, where data is scarce, this is one of the key ap- proaches to model building. Another mechanism is the automatic derivation of the model based on a data set. It is this machine learning ap- proach (ML) that we follow here (so that we avoid the very rich field of human knowledge acquisition). For a data set D ={ u 1 , u 2 ,..., u n } , where u i is an instantiation of all variables in U , the BN structure learning translates to the problem of learning a network structure from D. Suppose u is complete and discrete. Consequently , finding the op- timal Bayesian network is reduced to finding the optimal structure. The optimal structure can be learned by three ap- proaches coming from the area of ML. The first is the constraint-based approach to structure learning, which takes advantage of the first perspective and attempts to reconstruct a Bayesian network by analyzing data independence. These algorithms require an infinite amount of data to learn independence with certainty; high- order independence tests can be unreliable unless the sam- ple size is truly huge [5]. The second is the score-based ap- proach, which invests in the second perspective and looks for Bayesian networks that adequately describe the avail- able data with the best score. The core of this approach is to assign a score values(G) to each acyclic directed graph G . The score function defines an overall order (up to equiv- alences) over the structures in such a way that a structure with a better description of the data is assigned a higher value. The last approach is a hyper -approach, which mixes the two previous approaches together . 2.2 Scor e-based Score-based learning is a technique frequently used for de- termining the optimal structure. In this process, each candi- date is assigned a BN score to measure the goodness-of-fit of a structure to the data. The goal of the learning prob- lem is then to find the optimal scoring structure. The score usually measures how well this BN describes the data set D. Definition (1): Let B =⟨ G, Θ ⟩ be a Bayesian network, and letD={ u 1 ,..., u n } be a training set, where each u i as- signs a value to all variables in U . TheMDL scoring func- tion of a network B given a training data set D, written MDL(B| D) , is given by: MDL(G| D) = LL(G| D)− logn 2 | G| where| G| is the number of parameters in the network. The first term represents the loglikelihood, i.e., it measures the model fit. The second term penalizes the model complex- ity . The penalty term for MDL is greater than that for most other evaluation functions, since optimal networks with the MDL are usually sparser than optimal networks with func- tion scoring. As its name suggests, an optimal network with MDL minimizes the scoring function rather than maximiz- ing it. The Bayesian information criterion (BIC) [6] is a scoring function whose calculation is equivalent to MDL for Bayesian networks, but it is derived on the basis of the models’ asymptotic behavior . Where the score is decom- posable, it can be written as a sum of the scores of each variable and its parent set: BIC(G| D) = m ∑ i=1 BIC(X i |Π X i ) = m ∑ i=1 { LL(X i |Π X i )− Penalty(X i |Π X i )} The score-based algorithms’ aim is to optimize this score and return the structure G that maximizes it. As the space of all possible structures is at least exponential in the number of variables m, this presents a number of prob- lems. There are m(m− 1)/ 2 possible undirected edges and 2 m(m− 1)/ 2 possible structures for every subset of these edges. Moreover , there may be more than one orientation of the edges for each such choice. One popular choice is hill-climbing [7]. 2.3 Structural learning with pruning Statistical testing is a method of reducing the set of poten- tial DAGs. Another approach to reducing this set is to use constraints provided by experts. Besides that, we can use structural constraints similar to in [2]. The structural con- straints can be applied locally as long as they include only one node and its parents. Algorithm 1 represents an approach to learning the opti- mal structure of a BN using the constraint rules and a de- composable score [8]. The main function of the algorithm is to compute a collection of candidate parent sets for each variable. Then we optimize across this collection by se- lecting one parent set for each variable, without creating Learning the Structure of Bayesian… Informatica 47 (2023) 83–96 85 directed cycles while maximizing the total score. The fol- lowing theorem can be used to reduce the numbers of the collections for candidate parents. Lemma 2.1. Let X i be a variable and Π ′ be a candidate par ent set for X i . Suppose that BIC(X i |Π ′ ) BIC(X i |Π ′ ), im- plies thastG ′ is not BIC optimal. This statement also holds if the candidate subset is the empty setΠ ={} . Algorithm 1 Parent sets evaluation for the BN structure learning algorithm Input: . D : a data set m : an integer representing the number of variables in D Output: Accepted sets of parents for each node Phase 1: initialize the parameters g i =(V, E) // A DAG containing a node and its can- didate parent set S i : BIC score ofg i Q i : priority queue of triples (X i , Π X i , S i ) ordered by S i Phase 2 : mscour(X i ,D ) function to find the min(BIC) score S i = the BIC score ofg i where onlyX i is included r eturnS i Phase 3: find the acceptedQ i forX i Q i is empty S ∗ = mscour(X i ,D ) Π X i is a parent set forX i add (X i , Π X i , S ∗ ) toQ i For eachX k , k∈{ 1,..., m} do: addX k toΠ X i S ki = the BIC score of the updatedΠ X i if(S ki >S ∗ ) add (X i , Π X i , S ki ) toQ k For eachX j , j∈{ 1,..., m} , i̸= j̸=k , do: addX j toΠ X i S ki = the BIC score of the updatedΠ X i if(S ki >S ∗ ) add (X i , Π X i , S ki ) toQ k else deleteX j fromΠ X i end for else deleteX k fromΠ X i end for The g i = (V, E) in Phase 1 from Algorithm 1 is a DAG containing the set of nodesV ={ X i , Π X i } and the set of arcs E ={ (X o , X i ), ∀ X o ∈ Π X i }. Algorithm 1 considers all pos- sible parent sets that can lead to an optimal BN. Its imple- mentation is based on [8]. After Phase 3, we find a DAG with the highest BIC from among the variables given the candidate parent sets of each variable. That is done using GOBNILP [9] tool 1 which is a smart algorithm using inte- ger linear programming. W e will refer to this algorithm as A1. Let us also note that, if a dataset is generated from a BN having the empty graph as its structure and this dataset is lar ge enough then, for any parent set {Π X i ̸=ϕ }, it holds that BIC(X i |{} )>BIC(X i |Π X i ). This implies that the variables are independent and the penalty for lar ger parent sets makes the BIC value worse for all nonempty parent sets. One of the axioms of the pruning rules stated in the liter - ature states that if a candidate subset has a better score than another candidate set and the first candidate set is a sub- set of the second candidate set, it is safe to disregard that second candidate set due to the decomposability of score functions. W e have applied the pruning rule as formalized in the theorem 2.1 in Algorithm 1. That algorithm will re- duce the collection of accepted parent sets for each node by discarding all parent sets which do not meet the criteria. 3 Incomplete datasets One of the widespread problems in data mining and ma- chine learning is incomplete data. V alues may be missing even from training instances. Nowadays more and more datasets are available, but most of them are incomplete. Therefore, machine learning must cope with this problem. Normally , to learn the BN structure using A1 algorithm [2], we need complete data, such that all instances u i ∈ D, i∈ { 1,..., n} are complete and don’ t have any missing values. In the case of incomplete data and an instance which has a missing value, A1 does not use this instance in the BN structure learning. 3.1 Pr oduct distribution mixtur es to handle incomplete data Because of incomplete data, most methods in machine learning cannot be applied. An easy way to deal with this problem is completing the data by simply omitting the incomplete vectors or removing the incomplete variables. But this approach has a weakness: we may lose a massive part of the available information. Another alternative is to use an estimation to replace the missing values [10] (i.e., put in estimates of the missing values). However , for cer - tain reasons, the estimated values have to be typical, and the natural variability of the data will be partially restricted. For that, the product mixture model gives us a better way to di- rectly apply the EM algorithm to complete the dataset [1 1]. W e will refer to this approach as EM-Mixture. Considering finite mixtures we assume that: 1 https://www.cs.york.ac.uk/aig/sw/gobnilp/ 86 Informatica 47 (2023) 83–96 I. Salman et al. P(X) = r ∑ j=1 w j F(X| j), (1) F(X| j) = m ∏ i=1 F i (X i | j), r ∑ j=1 w j =1 (2) where w j > 0 is a probabilistic weight of the j -th mixture component, F i is the conditional distribution of the vari- able X i , i ∈ 1,..., m, and r is the number of components. Note that the product components do not imply that the in- volved variables are independent. In this sense, the mixture model (1) is not restrictive [12]. It is easy to verify that, by increasing the number of components r , we can describe any discrete probability distribution in the form (1). In our experiments, it was selected based on the number of vari- ables in a dataset. T o estimate the mixture parameters, we maximize the log-likelihood function: LL = n ∑ k=1 logP( u (k) ) wheren is the number of records in the datasetD and u (k) is the k -th datavector from D . W e will use the EM algorithm to maximize the log-likelihood function. Next, we explain how the learned product mixture model will be used to fill in the missing values. Let C = { i 1 , i 2 ,..., i k } be a subset ofM ={ 1, 2,..., m} such that the corresponding sub-vector u C = (x i 1 , x i 2 ,..., x i k ) is complete. Then, under the product mixture model, we can compute the mar ginal probability of u C as P C ( u C ) = r ∑ j=1 w j F C ( u C | j) (3) F C ( u C | j) = ∏ i∈ C F i (x i | j) . (4) Let z be an index of a variable unobserved in u , i.e., z ∈ M\ C . Under the product mixture model, we can compute the conditional distribution of the missing value u z given the complete part u C withP C ( u C )>0 as P z|C ( u z | u C ) = P z,C ( u z , u C ) P C ( u C ) = r ∑ j=1 W j ( u C )F z ( u z | j) whereW j ( u C ) are the conditional component weights: W j ( u C ) = w j F C ( u C | j) ∑ r j=1 w j F C ( u C | j) . W e thus compute the probability distribution P z|C ( u z | u C ) for each missing value of each data vector u ∈ D with a missing value. There are several ways of using this proba- bility distribution to fill in the missing value ofX z in u – in this paper , we select value u z maximizingP z|C ( u z | u C ) over all values ofX z . The last step of our presentation is the description of adapting the EM algorithm for learning product mixture models such that it is applicable to incomplete data. Given a data vector u ∈ D and a variable X i with index i ∈ { 1, 2,..., n} , letN ( u) be the subset of indices of the avail- able variables (i.e., observed in that data) of u , andD(i)⊂ D be the subset of vectors with observed values of variableX i : N ( u) = { v∈{ 1, 2,..., n} : u v observed in u} D(i) = { u∈ D:i∈ N ( u)} In Algorithm 2, we present the modification of the EM al- gorithm for the product mixture model for incomplete data. For x v ∈ X v , v ∈ { 1, 2,..., n} , and j = 1,..., r , we use F v (x v | j) to denote the conditional probability of observing value x v of variable X v given the component j . The ini- tialization of the EM-Mixture algorithm (presented in Al- gorithm 2) is performed using the partitions obtained from agglomerative hierarchical clustering implemented in the function hc of the R package mclust [13]. In our algorithm, the symbolδ (x, y) denotes the standard delta function equal to one ifx =y and equal to zero otherwise. Algorithm 2 EM-Mixture Input: D is a data set Output: a completed data set Phase 1: initializing: w j , j =1,..., r F v (x v | j), forx v ∈ X v ,v∈{ 1, 2,..., n} , and j =1,..., r L =− ∞ Phase 2: r epeat E-Step : q(j| u) = w j∏ v∈ N ( u) F v ( u v | j) ∑ r l=1 w l∏ v∈ N ( u) F v ( u v | l) , for u∈ D, j =1,..., r w j = 1 | D| ∑ u∈ D q(j| u), for j =1,..., r M-Step : forx v ∈ X v ,v∈{ 1, 2,..., n} , and j =1,..., r F v (x v | j) = ∑ u∈ D(v) δ (x v , u v )· q(j| u) ∑ u∈ D(v) q(j| u) L ′ = ∑ u∈ D log [ r ∑ j=1 w j ∏ v∈ N ( u) F v ( u v | j) ] Q = L ′ − L L = L ′ untilQ≤ ε The EM algorithm conver ges monotonically to a local Learning the Structure of Bayesian… Informatica 47 (2023) 83–96 87 or global maximum or a saddle point of the log-likelihood function L in the sense that the sequence of { L t } ∞ t=0 does not decrease. The presence of a local maximum makes the starting point of the procedure influential; hence it is se- lected at random. W e use the value of ε = 0. 005 to ter - minate the main loop of the algorithm. The sequence of log-likelihood values generated by E-Step and M-Step is non-decreasing [1 1] (i.e.,L ′ L ). W e adapt the BN structure learning algorithm A1 so that it can learn from incomplete data. W e use the EM-Mixture algorithm, i.e., Algorithm 2, to make the incomplete data complete in Phase 3. The whole algorithm will be referred to as A2. 3.2 Experiments The experiments have been repeated ten times on ten dif fer - ent subsets in each MCAR rate on dif ferent models, using the generated datasets from the true models summarized in T able 8 in A. W e have compared our approach denoted as A2 with three other methods. By A1 we denote the BIC optimal learning from complete data created by omitting all rows containing a missing value. In [3], the authors pro- posed the soft and hard EM algorithms to fill in the missing values and learn an optimal BN structure from the com- pleted data by T abu search [14], which we refer to as A3 and A4, respectively . The test scenarios, which include more than 700 incom- plete datasets, are summarized in Figure 1. The resulting BNs of the simulations within each scenario are shown in T ables 1, 2, and 3. The decision tree shown in Figure 1 is intended to guide practitioners as to which imputation algorithm appears to perform the best, depending on the characteristics of their problem with incomplete data. Each leaf of the decision tree corresponds to a subset of the scenario that we studied, grouped according to the values of the experimental fac- tors, to recommend which algorithm has the best average Structure Hamming Distance [15] (SHD) values between the essential graph of the learned model and the essential graph of the true model. The dominance of the algorithms has been tested using the W ilcoxon test [16]. W e say that an algorithm is better than another if it has a lower average SHD and their confidence intervals do not overlap, i.e., the p-value of the W ilcoxon test is lower than 5%. In the results based on the SHD, A2 has scored the best results. For the results based on the SHD and the W ilcoxon test, we have observed some important general trends: – A2 appears to be a good algorithm in all scenarios. – A2 is significantly better than other Algorithms for Model M2 in Leaves B and G. – A2 is significantly better than other Algorithms for the model Child in Leaf C. – A2 and A3 are significantly better than A1 and A4 for Models M1 in Leaves C, D, P , and K. T able 1: Recommended algorithm by decision tree leaf where MCAR rate in [5 - 10 ] -Group 1. Leaf Size Bayesian Network Recommended Algorithm W eather A1, A2 , A3, A4 A Size >5000 M1 A2 , A3, A4 W eather A2 , A3, A4 B Size in [3000 - 5000] M1 A2 , A3, A4 M2 A2 Child A2 , A3, A4 W eather A2 , A3, A4 C Size in [1500 - 2500] M1 A2 , A3 M2 A2 , A3, A4 Child A2 W eather A2 , A3, A4 D Size <1000 M1 A2 , A3 M2 A2 Child A2 , A3 – A1 is significantly worse than other Algorithms in all scenarios where the data size is smaller than 5,000. Figure 2 represents the algorithm results of all models with all dataset sizes and all MCAR rates. T able 2: Recommended algorithm by decision tree leaf where MCAR rate in [15 - 25] - Group 2. Leaf Size Bayesian Network Recommended Algorithm W eather A1, A2 , A3, A4 E Size >5000 M1 A2 , A3, A4 W eather A2 , A3, A4 F Size in [3000 - 5000] M1 A2 , A3, A4 M2 A2 , A3 Child A2 , A3 W eather A2 , A3, A4 G Size in [1500 - 2500] M1 A2 , A3, A4 M2 A2 Child A2 , A3 W eather A2 , A3, A4 P Size <1000 M1 A2 , A3 M2 A2 Child A2 , A3 T able 3: Recommended algorithm by decision tree leaf where MCAR rate in [35 - 50] - Group 3. Leaf Size Bayesian Network Recommended Algorithm W eather A1, A2 , A3, A4 H Size >5000 M1 A2 , A3, A4 W eather A2 , A3, A4 G Size in [3000 - 5000] M1 A2 , A3 W eather A2 , A3, A4 K Size in [1500 - 2500] M1 A2 , A3 W eather A2 , A3, A4 M Size <1000 M1 A2 4 Belief Noisy-Or model The Belief Noisy-Or (BNO) model is suitable for describ- ing a specific class of uncertain relationships in Bayesian networks [4] common in several practical applications of BNs. As an example, let us mention the QMR-DT net- work [17]. In Figure 3 we present the structure of a CPT 88 Informatica 47 (2023) 83–96 I. Salman et al. < 1000 5000 < [1000-2000] [2001-5000] [1000 – 5000] A Data Size Data Size MCAR Rate [15 - 25] [5 - 10] [35 - 50] D B C E P F G H M G K Groupe 1 Groupe 2 Groupe 3 Figure 1: The decision tree for dif ferent test scenarios. F(Y| X 1 ,..., X n ) where auxiliary nodesX ′ 1 ,..., X ′ n are added to explicitly separate the noisy relations from the logical OR relation. For a CPT w ith multiple parent variables X 1 ,..., X n the noisy-or is defined as follows 2 : F(X ′ i =0| X i =0) = 1− α , F(X ′ i =1| X i =0) =α F(X ′ i =0| X i =1) = p i , F(X ′ i =1| X i =1) =1− p i wherei∈{ 1,..., n} andp i ∈ [0, 1] is the parameter which defines the probability that the positive valuex i of variable X i is inhibited – it is referred to as the inhibition probability and the parameterα specifies the possibility of a positive value even if the value of the corresponding parent variable is negative. In most experiments, we will setα = 0 . The CPT ofF(Y| X ′ 1 ,..., X ′ n ) represents the deterministic logical OR function, i.e., F(Y =0| X ′ 1 =x ′ 1 ,..., X ′ n =x ′ n ) =    1 ifx ′ 1 =0,... x ′ n =0 0 otherwise. Consequently , the CPT of F(Y| X 1 ,..., X n ) , which repre- sents the noisy-or function, is computed as follows: 2 In the case of one parent variable, we use probability values as speci- fied in T able 4. F(Y =0| X 1 =x 1 ,..., X n =x n ) = n ∏ i=1 F(X ′ i =0| X i =x i ) = n ∏ 1 (p i ) x i (1− α ) 1− x i F(Y =1| X 1 =x 1 ,..., X n =x n ) = 1− n ∏ 1 (p i ) x i (1− α ) 1− x i 4.1 Analysis of BNO models In this Section, we analyze the BNO models represented in T able 9 in A where α = 0 . T ables 5, 6, and 7 show the mar ginal probability distributions (MPD) of the variables in BNO models N1, N2, and BN2O, respectively; look at Figures 1 1 and 12. The T ables illustrate the decrease of the mar ginal probability values for F(C i =0 ) in the case of a node having more than one parent. See T able 7. This decrease is due to the properties of the product of proba- bilities in (5). On the other hand, they also illustrate the increase of that mar ginal probability with a higher number of its predecessors in previous layers; that increase depends on the number of layers above and also on the numbers of the edges in those layers. See T able 5 and T able 6. Using the conditional probability distributions of the variables given their parents, we can easily calculate joint Learning the Structure of Bayesian… Informatica 47 (2023) 83–96 89 T able 4: F(X ′ i | X i ) table X i 0 1 X ′ i 0 1− α 0.2 1 α 0.8 T able 5: N1 (Figure 1 1): Mar ginal probability distributions C1 C2 C3 C4 C5 C6 F(C i = 0) .5 .6 .68 .744 .795 .837 F(C i = 1) .5 .4 .32 .256 .205 .163 T able 6: N2 (Figure 1 1): Mar ginal probability distribu- tions C1 C2 C3 C4 C5 C6 F (C i = 0) .5 .6 .536 .539 .716 .707 F (C i = 1) .5 .4 .464 .461 .284 .293 T able 7: BN2O (Figure 12): Mar ginal probability distribu- tions C1 C2 C3 C4 C5 C6 C7 C8 F(C i = 0) .5 .5 .5 .5 .5 .129 .36 .36 F(C i = 1) .5 .5 .5 .5 .5 .871 .64 .64 0 10 20 30 A1 A2 A3 A4 Algorithms Values Figure 2: The Structural Hamming Distance to the true models from the resulting models of the structure learning algorithms using data generated from all models, summa- rized in T able 8 averaged over all data sizes and all MCAR rates. probability distributions F( U) using formula (1) and con- ditional probability distributions (CPD) F(X A |X B ) , where X A ⊆ U andX B ⊆ U\ X A . Recall that a CPD for a particular configurationx B of parent nodesX B can be computed as 3 : F(X A |X B =x B ) = F(X A , X B =x B ) F(X B =x B ) (5) The Kullback-Leibler Distance (KLD) of two conditional probability distributionsF(X A |X B ) andG(X A |X B ) defined on the same state space is computed as the weighted aver - age KLD of F(X A |X B =x B ) and G(X A |X B =x B ) over all 3 Please, note that all BNs considered in this paper satisfy the condition F(X B =x B )>0 for allx B . Figure 3: Noisy-or configurationsx B : D(F(X A |X B )|| G(X A |X B )) = ∑ x B F(X B =x B ) ∗ ∑ x A F(X A =x A |X B =x B ) ∗ log F(X A =x A |X B =x B ) G(X A =x A |X B =x B ) = ∑ x A ,x B F(X A =x A , X B =x B ) ∗ log F(X A =x A |X B =x B ) G(X A =x A |X B =x B ) W e will use KLD of conditional probability distributions estimated from the true data to support our ar guments when we explain the results. 4.2 Experiments W e have performed experiments on dif ferent Belief noisy- or (BNO) models with their CPT s defined in T able 4 where α ∈{ 0, 0. 2} and the CPT of a node which has no parent is uniform, i.e.,F(X i =1|{} )=0. 5, F(X i =0|{} )=0. 5 . The experiments have been repeated ten times on ten dif ferent datasets generated from BNO models with dif ferent MCAR rates as specified in T able 9 in Appendix A. In all Figures, we will denote additional edges by blue dashed lines, miss- ing edges by red lines, and edges with dif ferent arrows by orange lines. 90 Informatica 47 (2023) 83–96 I. Salman et al. 4.2.1 Model N1 The true N1 model is shown in Figure 1 1 in Appendix A. W e use this model as an example of a simple model with a chain structure. This model is motivated by some appli- cations, e.g., from telecommunications. Let us summarize the results of this model: – All algorithms learn the true structure whenα ̸= 0 in all data sizes and all MCAR rates. – The algorithms A2, A3, and A4 learn structures dif- ferent from the true model in some cases withα = 0 , MCAR rate 15% and data size of 1,000. For example, A3 and A4 learn additional edgeC2→ C4 , also, A2 learnC4→ C6 instead ofC5→ C6 . – Using equation (5), we calculate F(C6|C5) and F(C6|C4) from the true model N1. W e have found that their KLD value (computed using equation (4.1)) is very small, it is only 0.001. Also, the chi-square test of independence, whose p-value is smaller than 0. 0001 , reveals that there is a strong dependence be- tween C6 and C4 in addition to the relationship be- tweenC6 andC5 , already explicitly present in the true model. Also, the BIC of the learned structure 4 is - 2,252.93 and the BIC of the true model from the same dataset is -2,255.64. This can be explained by the de- terministic conditional distribution F(C6|C5 = 0) for C5 = 0 . For these reasons, we can conclude that we can accept that A2 has learned C4 → C6 instead o f C5→ C6 . 4.2.2 Model N2 The true N2 model is shown on the right hand side of Fig- ure 1 1 in Appendix A. W e use this model as an example of a model more complicated than the previous model N1. This model is motivated by some applications, e.g., by computer networks. W e summarize the results of the experiments per - formed with this model: – Figure 4 represents the Structure Hamming Distance (SHD) for all tested MCAR rates and models withα = 0 . W e can observe that, as expected, the algorithm’ s performance is getting better with increasing the data size. – W e can see that A2 on average has a smaller SHD dis- tance to the true model than other algorithms. – In Figure 5, we compare the models learned from the datasets of size 5,000 with MCAR rate 10% using all four algorithms. W e can see that A2 and A3 have the same SHD but they dif fer in that A3 has a missing edge C4→ C5 while A2 has an additional edgeC3→ C6 . This additional edge can be explained by observing that there is a chain of nodes C3→ C4→ C6 which 4 W e report the BIC value for one of ten datasets since the results for the remaining nine are similar . the state 0 is propagated through because of α = 0 . In other words, we calculateF(C3, C6|C1, C2, C4, C5) and the product F(C3|C1, C2, C4, C5)· F(C6|C4, C5) from the true model 1 1 using equation (5). The KLD value (computed as explained in (4.1)) of these two distributions is very small, it is only 0.02. Also, the chi-square test of independence ofC3 andC6 re- veals these variables are dependent (the test’ s p-value is smaller than 0.0001). The additional edge can be also supported by a comparison of BIC values of the learned structure with and without the additional edge C3→ C6 ; they are -9,813.67 for the model with the additional edge and -9,880.5 for the true model. – Ifα > 0 then no additional edge is learned anymore, no matter what the MCAR rate is. Algorithms A2 and A4 we are always able to learn the true structure when the data size exceeds 1,000. Also, A1 and A3 learn the true structure when the data size is lar ger than 1,500. 4.2.3 BN2O models These models are motivated by health-care applications, for example by the QMR-DT network [17]. W e created 60 dif- ferent BN2O models consisting of two layers withN =20 nodes in total. They dif fer in the numbers of nodes in the first layer , namelyL 1 ∈{ 5, 8, 12, 15} ; the numbers of nodes in the second layer areL 2 =20− L 1 . The numbers of edges between layers are generated randomly with three dif ferent options N 2 , 2· N 2 , and 4· N 2 ; each option repeated five times. Using these models, we have generated multiple incom- plete datasets where the sizes of the datasets are 3,000 and 5,000 and the MCAR are 10% and 15%. Figure 8 shows the boxplot of additional and missing numbers of edges learned in all instances for each algorithm where the dataset size is 3,000, and for all MCAR rates. The results show that A2 has better results on average (i.e., the distance to the true model is smaller) than other algorithms. Next we discuss in more detail one simpler example of a BN2O. The structure of this model is shown in Figure 12 in Appendix A. – Figure 6 represents the SHD of all learned models grouped by MCAR rates with models whereα =0 . – The learned models from the dataset of size 5,000, MCAR rate 10% and α = 0 using all algorithms are shown in Figure 7. W e can see that A2 performs bet- ter (i.e., the SHD distance to the true model is smaller) than other algorithms. – Note the additional edge C7 → C8 learned by A2 for most datasets. The ar gument supporting this additional edge is similar to that valid for the ad- ditional edge in the N2 model. Again, we can see that KLD of F(C7, C8|C2, C5) and the product F(C7|C2, C5). F(C8|C2, C5) is very small; it is only 0.002. Also, the chi-square test of independence ofC7 Learning the Structure of Bayesian… Informatica 47 (2023) 83–96 91 o o o o 1000 2000 3000 4000 5000 0.0 1.0 2.0 3.0 Dataset Size Value * * * * + + + + * * * * o * + * A1 A2 A3 A4 o o o o 1000 2000 3000 4000 5000 0.0 1.0 2.0 3.0 Dataset Size Value * * * * + + + + * * * * o * + * A1 A2 A3 A4 o o o o 1000 2000 3000 4000 5000 0.0 1.0 2.0 3.0 Dataset Size Value * * * * + + + + * * * * o * + * A1 A2 A3 A4 Figure 4: The Structural Hamming Distance of the result- ing models of the structure learning algorithms to the true model (withα = 0 ) using the data generated from the N2 model (the true model is presented in Figure 1 1) using the average over ten experiments for dif ferent data sizes and for the MCAR rates of 5%, 10%, and 15%, respectively . andC8 has the p-value smaller than 0.0001 and there is always only a very small dif ference between BIC of the model with the extra edge and the true model; for example„ BIC of the model with the extra edge is -7,331.8 while the BIC of the true model is -7,338.5 for one of the en generated datasets. – In the experiments with models havingα > 0 no ad- ditional edge has been learned and the true model is learned successfully when the data size is 2,500 or lar ger for all MCAR rates. 4.2.4 A large BN2O model W e have performed experiments with a model shown in Figure 13 in Appendix A. This model consists of 25 vari- ables; 14 in the first layer and 1 1 in the second layer . All algorithms required a data size of more than 3,000 to give a good performance. W ith the data size of 3,000 (and the MCAR rate of 10%) the recorded SHD of algorithms A1, A2, A3, and A4 still have not been very good – namely , 14.6, 10.2, 10, and 9.8, respectively . W ith the data size of 5000 and 7500 (and the MCAR rate of 10%) the recorded average SHD of A1, A2, A3, and A4 are already much bet- ter – namely , 7.2, 4.2, 4.3, and 5.1, respectively . See Fig- ure 9 for the learned models. W ith the data size of 10,000 we already get the true models except for the additional edges in the case of A2, as discussed in Section 4.2.3. 92 Informatica 47 (2023) 83–96 I. Salman et al. C1 C2 C3 C4 C5 C6 C1 C2 C3 C4 C5 C6 C1 C2 C3 C4 C5 C6 C1 C2 C3 C4 C5 C6 Figure 5: Models learned by A1, A2, A3, and A4, respectively , for most of ten datasets generated from the true N2 model (presented in Figure 1 1) (forα =0 ) with the MCAR rate 10 and the data size of 5,000. 0 2 4 6 A1 A2 A3 A4 Algorithms Values 0.0 2.5 5.0 7.5 10.0 A1 A2 A3 A4 Algorithms Values 2.5 5.0 7.5 A1 A2 A3 A4 Algorithms Values Figure 6: The Structural Hamming Distance to the true models of the resulting models of the structure learning algorithms using data generated from the BN2O model (the true model is presented in Figure 12) (withα =0 ) averaged over all data sizes for MCAR rates of 5%, 10%, and 15%, respectively . C1 C2 C3 C4 C5 C6 C7 C8 C1 C2 C3 C4 C5 C6 C7 C8 C1 C2 C3 C4 C5 C6 C7 C8 C1 C2 C3 C4 C5 C6 C7 C8 Figure 7: Models learned by A1, A2, A3, and A4, respectively , using data generated for most of ten datasets generated from the true BN2O model (presented in Figure 12) (forα =0 ) with the MCAR rate 10 and the data size of 5,000. 0 2 4 6 8 A1 A2 A3 A4 Algorithms Values 0 1 2 3 4 5 A1 A2 A3 A4 Algorithms Values Figure 8: Results of the structure learning algorithms using data generated from the BN2O model (withα = 0 ) with the data size of 3,000 and averaged over all tested MCAR rates. The plot on LHS displays the average number of additional edges and the plot on RHS displays the average number of missing edges. Learning the Structure of Bayesian… Informatica 47 (2023) 83–96 93 A B C D E F G H I J K L M N O P Q R S T U V W X Y A B C D E F G H I J K L M N O P Q R S T U V W X Y A B C D E F G H I J K L M N O P Q R S T U V W X Y A B C D E F G H I J K L M N O P Q R S T U V W X Y Figure 9: Models learned by A1, A2, A3, and A4, respectively , using the data generated from the lar ge BN2O model consisting of 25 variables (forα = 0 ) with the MCAR rate of 10% and the data size of 7,500 (true model is presented in Figure 13). 5 Conclusion In this paper , we provide an approach to learning the opti- mal BN structure from incomplete data by adapting the con- siderations of [8]. This adaptation imputes missing values using product mixtures learned by the EM algorithm [1 1]. W e have shown that the sequence of log-likelihood values generated by E-Step and M-Step of the EM algorithm is non-decreasing and the algorithm conver ges. Theorem 2.1 helps us reduce the collection of candidate parent sets for a variable, which can speed-up the learning algorithm. W e have performed experiments on incomplete data gen- erated from dif ferent types of BN models to compare the proposed Algorithm A2 with other algorithms, namely with A1 [8], soft and hard EM [3], referred to as A3 and A4, re- spectively . In our comparisons, we use Structure Hamming Distance of CPDAGs of learned DAGs to CPDAGs of the original models. Such comparisons have been undertaken on (a) general Bayesian networks and (b) Belief Noisy-or [4] (BNO) mod- els with partially deterministic and nondeterministic condi- tional probability distributions. The experiments with mod- els of type (b) are motivated by uncertain relationships in Bayesian networks, which are common in practical appli- cations of BNs. W e have obtained the following results in detailed simulation studies. (a) General BN models: – The A2 algorithm appears to be the best choice from among the tested algorithms for learning the structure of BNs from any incomplete data whatever the data size and the missing MCAR rate are. – In most scenarios corresponding to dif ferent datasizes and MCAR rates, Algorithm A2 is sig- nificantly better than other algorithms and in no scenario is it significantly worse than any other algorithm according to the W ilcoxon test. (b) BNO models: – A2 is able to recover all true edges in the tested models except for the N1 model (shown in Fig- ure 1 1) at size 1,000 and a missing rate of 15%. The dif ferent learned structure of the model N1 is acceptable because the Chi-square(X 2 ) test and the Kullback-Leibler distance (KLD) between the related conditional probabilities suggest there is a high degree of relationship between the con- nected variables. – A2 has learned an additional edge in the case of Models N2 (shown in Figure 1 1) and BN2O (shown in Figure 12). The additional edge is acceptable since theX 2 test and KLD suggest there is a high degree of relationship between these variables. W e have seen that BIC of the learned structure is almost equal to BIC of the true model. For example, BIC of the model learned using A2 (shown in Figure 7) is -7,331.8 and BIC of the true model is -7,338.5. Similar behavior has been observed in other BNO mod- els. – A2 is always able to recover all edges while other algorithms are not. – For lar ge BN2O models, all algorithms require data sizes lar ge than 3000 to have a good per - formance; e.g., for the BN2O with 25 variables A2 needs at least 10,000 data records to learn the correct model (with the exception of additional edges as discussed in Section 4.2.3). W e have empirically shown that our Algorithm A2 be- haves better than other tested algorithms on several stud- ied BNs and in dif ferent scenarios. Based on these exper - iments, we can recommend this algorithm for practitioners that use BNs or BNOs with incomplete data especially in the medical domain where BNO could be used to study the hidden relationship between symptoms and diseases. An interesting topic for future research might be learning the structure of lar ge BN2O networks from incomplete data and optimize the number of components in the EM-Mixture . 94 Informatica 47 (2023) 83–96 I. Salman et al. Acknowledgement This work was supported by Student Grant CTU SGS20/132/OHK4/2T/14 Refer ences [1] Nir Friedman, Dan Geiger , and Moises Goldszmidt. Bayesian network classifiers. Machine Learning , 20(2-3):131––163, 1997. URL: https://doi.org/ 10.1023/a:1007465528199 . [2] Cassio P de Campos, Mauro Scanagatta, Gior gio Corani, and Marco Zaf falon. Entropy-based prun- ing for learning Bayesian networks using BIC. Arti- ficial Intelligence , 260:42––50, 2018. URL: https: //doi.org/10.1016/j.artint.2018.04.002 . [3] Andrea Ruggieri, Francesco Stranieri, Fabio Stella, and Marco Scutari. Hard and soft EM in Bayesian network learning from incomplete data. Algorithms , 13(12):329, 2020. https://doi.or g/10.3390/a13120329 doi:10.3390/a13120329 . [4] Judea Pearl. Pr obabilistic r easoning in intelligent sys- tems: networks of plausible infer ence . Mor gan kauf- mann, 1988. [5] Nir Friedman and Moises Goldszmidt. Learn- ing Bayesian networks with local structure. In Learning in graphical models , page 421––459. Springer , 1998. URL: https://doi.org/10. 1007/978- 94- 011- 5014- 9_15 . [6] Zhifa Liu, Brandon Malone, and Changhe Y uan. Em- pirical evaluation of scoring functions for Bayesian network model selection. In Pr oceedings of the Ninth Annual MCBIOS Confer ence. Dealing with the Omics Data Deluge , Oxford, MS, USA., 2012. BMC Bioin- formatics. https://doi.or g/10.1 186/1471-2105-13- S15-S14 doi:10.1186/1471- 2105- 13- S15- S14 . [7] Poh Choo Song, Hui Y ee Chong, Hong Choon Ong, and Sing Y an Looi. A model of Bayesian network analysis of the factors af fecting student’ s higher level study decision: The private institution case. journal of T elecommunication, Electr onic and Computer En- gineering (JTEC) , 8(2):105––109, 2016. [8] Cassio P de Campos, Zhi Zeng, and Qiang Ji. Structure learning of Bayesian networks using constraints. In Pr oceedings of the 26th Annual International Confer ence on Machine Learn- ing , ICML ’09, page 1 13––120, New Y ork, NY , USA, 2009. Association for Computing Machin- ery . https://doi.or g/10.1 145/1553374.1553389 doi:10.1145/1553374.1553389 . [9] James Cussens. Bayesian network learning with cut- ting planes. In Pr oceedings of the T wenty-Seventh Confer ence on Uncertainty in Artificial Intelligence , page 153––160, Arlington, V ir ginia, USA, 201 1. AUAI Press. [10] Arthur P Dempster , Nan M Laird, and Donald B Ru- bin. Maximum likelihood from incomplete data via the EM algorithm. J. Roy . Statist. Soc. B , 39:1–– 38, 1977. URL: https://doi.org/10.1111/j. 2517- 6161.1977.tb01600.x . [1 1] Jirí Grim, Jan Hora, Pavel Boček, Petr Somol, and Pavel Pudil. Statistical model of the 2001 Czech cen- sus for interactive presentation. Journal of Official Statistics , 26(4):673––694, 2010. [12] J. Grim and P . Boček. Statistical model of prague households for interactive presentation of census data. In SoftStat 95. Advances in Statistical Softwar e 5. Confer ence on the Scientific Use of Statistical Soft- war e , Heidelber g, DE, 1996. [13] Luca Scrucca, Michael Fop, T . Brendan Murphy , and Adrian E. Raftery . mclust 5: clustering, classification and density estimation using Gaussian finite mixture models. The R Journal , 8(1):289––317, 2016. [14] Fred Glover . T abu s earch-part I. ORSA Journal on computing , 1(3):190––206, 1989. [15] Marco Scutari and Jean-Baptiste Denis. Bayesian Net- works: with Examples in R . Chapman & Hall, Boca Raton, 2014. URL: https://doi.org/10.1111/ biom.12856 . [16] M Neuhäuser and Mann-Whitney T est. International Encyclopedia of Statistical Science . Springer Berlin Heidelber g, 201 1. [17] Michael A Shwe, Blackford Middleton, David E Heckerman, Max Henrion, Eric J Horvitz, Harold P Lehmann, and Gregory F Cooper . Probabilistic di- agnosis using a reformulation of the internist-1/qmr knowledge base. Methods of information in Medicine , 30(04):241––255, 1991. URL: https://doi.org/ 10.1055/s- 0038- 1634846 . [18] B Abramson, J Brown, W ard E, Allan Murphy , and Robert L W inkler . Hailfinder: A bayesian system for forecasting severe weather . International Journal of For ecasting , 12(1):57–71, 1996. Probability Judg- mental Forecasting. URL: https://doi.org/10. 1016/0169- 2070(95)00664- 8 . [19] A Philip Dawid. Prequential analysis, stochastic com- plexity and Bayesian inference. Bayesian statistics , 4:109––125, 1992. Learning the Structure of Bayesian… Informatica 47 (2023) 83–96 95 A Appendix A. Simulation Scenarios This Appendix provides an inclusive list of all experiments in the simulation study described in Sections 3.2 and 4.2, or ganized by their main characteristics in T ables 8 and 9, re- spectively . The number of components in each experiment selected based on the number of variables in the datasets. The true models mentioned in the T able 8 are shown in Fig- ure 10. The true models mentioned in the T able 9 are shown in Figures 1 1 and 12. T able 8: Description of the key factors of all BN experi- ments in the simulation study . Network Missing Rate (MCAR) Replicates Sample Size 10 10 100, 500,1000,5000,10000 W eather [18] 25 10 100, 500,1000,5000,10000 50 10 100, 500,1000,5000,10000,13000 10 10 1000, 2000,3000,5000 Child [19] 15 10 1000, 2000,3000,5000 50 10 1000, 2000,3000,5000 5 10 500,1000,1500,2500,5000 M2 (Figure 10) 10 10 500,1000,1500,2500,5000 15 10 500,1000,1500,2500,5000 25 10 500,1000,1500,2500,5000 10 10 500,1500,2500,5000,10000,13000 M1 (Figure 10) 20 10 500,1500,2500,5000,10000,13000 35 10 500,1500,2500,5000,10000,13000 50 10 500,1500,2500,5000,10000,13000 T able 9: Description of the key factors of all Belief Noisy- OR experiments in the simulation study (true models are presented in Figures 1 1 and 12). Network Missing Rate (MCAR) Replicates Sample Size 5 10 1000,1500,2500,5000 BN2O 10 10 1000,1500,2500,5000 15 10 1000,1500,2500,5000 5 10 1000,1500,2500,5000 N1 10 10 1000,1500,2500,5000 15 10 1000,1500,2500,5000 5 10 1000,1500,2500,5000 N2 10 10 1000,1500,2500,5000 15 10 1000,1500,2500,5000 lar ge BN2O 10 10 5000, 7500 K A E D B R S C W M Q I P T N X U O V F H G L K A E D B R S C W Figure 10: M1 and M2 true models, respectively 96 Informatica 47 (2023) 83–96 I. Salman et al. C1 C2 C3 C4 C5 C6 C1 C2 C3 C4 C5 C6 Figure 1 1: N1 and N2 true models, respectively . Their mar ginal probability distributions are summarized in T a- bles 5 and 6 C1 C2 C3 C4 C5 C6 C7 C8 Figure 12: BN2O true model. Its mar ginal probability dis- tributions are summarized in T able 7 A B C D E F G H I J K L M N O P Q R S T U V W X Y Figure 13: Example of a lar ge BN2O model with 25 vari- ables (whose learned models are presented in Figure 9).