Informatica 40 (2016) 225–235 225 Learning Sentiment Dependent Bayesian Network Classifier for Online Product Reviews Sylvester Olubolu Orimaye, Zi Yang Pang and Alvino Mandala Putra Setiawan School of Information Technology, Monash University Malaysia E-mail: sylvester.orimaye@monash.edu Keywords: sentiment classification, sentiment-dependent, Bayesian network, product reviews Received: September 17, 2015 Analyzing sentiments for polarity classification has recently gained attention in the literature with dif- ferent machine learning techniques performing moderately. The challenge is that sentiment-dependent information from multiple sources are not considered often in existing sentiment classification techniques. In this study, we propose a logical approach that maximizes the true sentiment class probabilities of the popular Bayesian Network for a more effective sentiment classification task using the individual word sen- timent scores from SentiWordNet. We emphasize on creating dependency networks with quality variables by using a sentiment-dependent scoring technique that penalizes the existing Bayesian Network scoring functions such as K2, BDeu, Entropy, AIC and MDL. The outcome of this technique is called Sentiment Dependent Bayesian Network. Empirical results on eight product review datasets from different domains suggest that a sentiment-dependent scoring mechanism for Bayesian Network classifier could improve the accuracy of sentiment classification by 2% and achieve up to 86.7% accuracy on specific domains. Povzetek: Razvit je nov Bayesov klasifikator na osnovi mnenjske odvisnosti, uporabljen za ocenjevanje spletnih produktov. 1 Introduction Sentiment Classification (SC) has recently gained a lot of attention in the research community [1, 2, 3]. More re- cently, SC has moved from the commonly used bag-of- words models to bag-of-concepts models [4, 5]. This is due to its increasing demand for the analysis of consumer sentiments on products, topic and news related text from social media such as Twitter1 and online product reviews such as Amazon2. In the same manner, Bayesian Network (BN)[6] also known as Bayesian Belief Network plays a major role in Machine Learning (ML) research for solv- ing classification problems. Over the last decade, learn- ing BNs has become an increasingly active area of ML re- search where the goal is to learn a network structure us- ing dependence or independence information between set of variables [6, 7, 8, 9]. The resulting network is a directed acyclic graph (DAG), with a set of joint probability distri- butions, where each variable of the network is a node in the graph and the arcs between the nodes represent the prob- ability distribution that signifies the level of dependency between the nodes. While it is more common to use other ML algorithms for SC tasks [10, 11], few research papers have proposed BN as a competitive alternative to other popular ML algorithms. Considering the huge size of data available from social me- dia and the level of difficulty attached with analysing sen- 1https://twitter.com/ 2https://amazon.com/ timents from natural language texts, the ability of BN to learn dependencies between words and their correspond- ing sentiment classes, could undoubtedly produce a better classifier for the sentiment classification task. This paper focusses on constructing a BN classifier that uses sentiment information as one important factor for determining depen- dency between network variables. BN has been successfully applied to solve different ML problems with its performance outweighing some of the popular ML algorithms. For example, in [12], a full Bayesian Network classifier (FBC) showed statistically significant improvement on state-of-the-art ML algorithms with the 33 UCI datasets. In the case of SC, Naïve Bayes (NB), which is a special case of BN [13], and one of the leading ML algorithms for SC tasks [10], has surprisingly and repeatedly shown improved performance on movie and product reviews despite its conditional independence as- sumption. By comparative study, we show that a Senti- ment Dependent Bayesian Network (SDBN) classifier has improved performance on popular review datasets such as Amazon product reviews due to the ability of the Bayesian Network to construct a network structure of multiple de- pendencies [12]. Constructing a BN classifier requires learning a network structure with set of Conditional Probability Tables (CPTs) [6]. Basically, there are two combined steps involved in the BN construction process. The first is to perform vari- able search on a search space, and the other is to score each variable based on the degree of fitness [14]. The chal- 226 Informatica 40 (2016) 225–235 S.O. Orimaye et al. lenge however, is to ensure that good networks are learned with appropriate parameters using a scoring or fitness func- tion to determine network variables from the given dataset. Thus, much of the research works on BN focus on devel- oping scoring functions for the BN classifier [15]. We ar- gue that such scoring functions rely on many assumptions that make them less effective for SC tasks. For example, K2 algorithm, which is based on Bayesian Scoring func- tion relies on the assumptions of parameter independence and assigning a uniform prior distribution to the param- eters, given the class [9]. We believe these assumptions lead to many false positives in the classification results as sentiment classes are better captured by conditional depen- dency between words, rather than independent word counts [16, 17]. We also suggest that varying prior distribution could be assigned to each variable since each word has a nat- ural prior probability of belonging to a particular senti- ment class, independent of the data. For example, the word “good” is naturally positive and “bad” is naturally nega- tive. Thus, in this work, we propose a sentiment scoring function that leverage sentiment information between vari- ables in the given data. The output of the sentiment scor- ing function is then used to augment existing BN scoring functions for better performance. Our aim is to ensure sen- timent information form part of the fitness criteria for se- lecting network variables from sentiment-oriented datasets such as reviews. The proposed scoring function uses a multi-class ap- proach to compute the conditional mutual information us- ing sentiment-dependent information between local vari- ables in each class of instances. The conditional mutual in- formation for all classes are then penalized using the Mini- mum Description Length (MDL) principle. The local prob- abilities used in computing the conditional mutual infor- mation is computed using the popular Bayesian probability that uses the prior probability of a variable belonging to a natural sentiment class (i.e. independent of the given data by using individual word sentiment score from SentiWord- Net [18]) and the observation of the variable in the selected class of instances and other classes in the dataset (e.g. pos- itive and negative). For example, the class probability of each word in a product review is augmented with the po- larity probability of the same word from SentiWordNet. The technique takes into account that the network structure would depend on the following criteria: – The posterior probability from multiple evidences that variables xi and xj have sentiment dependency; – The conditional mutual information between the vari- ables for all sentiment classes; – The dependency threshold computed using the Mini- mum Description Length principle; and – The representation of the network as a full Bayesian Network as proposed in [12]. The importance of the first criterion is that we are able to avoid the independence assumption made by the existing BN scoring functions. We capture local sentiment depen- dency between the variables as a joint probability of evi- dences from each variable and each class in the given data. Also, existing BN scoring functions uses the conditional independence given the data as a whole for determining de- pendencies between variables [15, 9]. Under such approach in SC, two co-occurring words may occur with the same or similar frequencies in different classes. We observed that training BN classifier without penalizing such occurrences or dependencies, could affect the classifier decision to de- cide an appropriate sentiment class. As such, our second criterion captures the conditional mutual information be- tween the variables, while the third criterion ensures that a BN classifier uses quality variables that are above the com- puted threshold. The latter also allow us to enforce strict d-separation policy between the network variables, which formally defines the process of determining independence between variables [19]. Thus, only quality variables are used to form the dependency network for the BN classifier. Finally, we introduce the last criterion as an improvement over a network constructed based on the first three criteria. The full Bayesian Network technique ensures that an in- dependent sentiment dependent network is constructed for each sentiment category (i.e. negative and positive). Section 2 of this paper discusses related work and ad- ditional motivations. In Section 3, we explain the prob- lem background and then present the proposed sentiment- dependent technique in Section 4. Our experiment is de- scribed in Section 5. Finally, Section 6 gives the conclu- sion to our study and some thoughts on future research di- rections. 2 Related work 2.1 Sentiment classification (SC) The most prominent of SC work is perhaps [20] which used supervised machine learning techniques for the po- larity classification of positive and negative sentiments in movie reviews. As a result of that work, different research directions within the field of sentiment analysis and opin- ion mining have been actively pursued [2, 3, 4, 5]. [10] proposed a subjectivity summarization technique, which uses minimum cuts to classify sentiment polarities in movie reviews. The technique identifies and extracts sub- jective portions of review documents using minimum cuts in graphs. The minimum cut approach takes into consider- ation, the pairwise proximity information supplied through graph cuts that partition sentences which are likely to be in the same sentiment class. This approach gave better im- provement from 82.8% to 86.4% on the subjective portion of review documents. The approach also gave similar bet- ter improvement when only 60% portion of a product re- view document is used compared to an entire review. In our work, we propose a classification technique that uses Learning Sentiment Dependent Bayesian Network. . . Informatica 40 (2016) 225–235 227 the entire portion of each product review. [21] performed classification of approximately 200K product reviews by using different machine learning algo- rithms. More importantly, the work investigated the signif- icance of higher order n-gram language model (n ≥ 3) in classifying sentiments from product reviews with an F1 of 90%. While Cui’s work was performed on random web- sites for online products with limited product domains, we performed experiments on Amazon product reviews with 8 different product domains. Similarly, [22] performed experiment with higher or- der n-gram features to train a Neural Network model on Amazon and TripAdvisor datasets with the best average er- ror rates of 7.12 and 7.37, respectively. In contrast, our work investigate the performance of a sentiment-dependent Bayesian Network classifier on the Amazon datasets with different product domains. More recently, [23] proposed a concept-level sentiment analysis technique, which uses the knowledge-based sen- tic computing technique that has recently gained attention within the sentiment analysis domain [3, 4, 5]. Because the sentic computing knowledge base sometimes omits vi- tal sentiment discourse, [23] combines low-level linguistic features with the sentic computing technique to train ma- chine learning model for polarity detection. In our work, the only knowledge base employed is SentiWordNet [24], which computes the natural polarity values of words rather than concept. Thus, we captured the dependencies between words by constructing and learning a Bayesian Network for sentiment classification. A detailed review of other recent sentiment classification techniques on different datasets can be found in [1, 2, 3, 4, 5]. 2.2 BN classifiers for sentiment classification [16] and [17] proposed a two-stage Markov Blanket Clas- sifier (MBC) approach to extract sentiments from unstruc- tured text such as movie reviews by using BN. The ap- proach learns conditional dependencies between variables (words) in a network and finds the portion of the network that falls within the Markov Blanket. The Tabu Search al- gorithm [25], is then used to further prune the resulting Markov Blanket network for higher cross-validated accu- racy. Although Markov Blanket has shown to be effective in avoiding over-fitting in BN classifiers [7], the MBC ap- proach finds sentiment dependencies based on the ordinary presence or absence of words in their original sentiment class only. We identify sentiment dependencies by consid- ering multiple sources of evidence. These include multiple sentiment classes in the data and the natural sentiment class of each variable which is independent of its sentiment class in the given data. Similarly, [26] proposed a parallel BN learning algo- rithm using MapReduce for the purpose of capturing senti- ments from unstructured text. The technique experimented on large scale blog data and captures dependencies among words using mutual information or entropy, with the hope of finding a vocabulary that could extract sentiments. The technique differs from [17] by using a three-phase (draft- ing, thickening and thinning) dependency search technique that was proposed in [27]. Other than using mutual infor- mation in the drafting phase of the search technique, the work did not capture additional sentiment dependencies us- ing other source of evidence. Again, we do not focus on developing a search algorithm but a scoring technique that considers multiple sentiment- dependent information as part of the existing state-of-the- art scoring functions. 3 Problem background 3.1 Bayesian network (BN) A Bayesian Network N is represented as a graphical dis- tribution of the joint probability between a set of random variables [28]. The network has two components: (1) a DAG G = (Rn,Mr) that represents the structural arrange- ment of a set of variables (nodes) Rn = {x1, ..., xn} and a corresponding set of dependence and independence asser- tions (arcs) Mr between the variables; (2) a set of condi- tional probability distributions P = {pi, ..., pn} between the parent and the child nodes in the graph. In the DAG component, the existence of an directed arc between a pair of variables xi and xj asserts a conditional dependency between the two variables [8]. The directed arc can also be seen to represent causality between one variable and the other [29], that is, variable xy is an ex- istential cause of variable xz , hence xy → xz . The absence of an directed arc between a pair of variables, however, represents a conditional independence, such that, given a subset U of variables from Rn, the degree of information about variable xi does not change by knowing xj , thus I(xi, xj |U). This also implies that p(xi|xj , U) = p(xi|U). The parent(s) of variable xi ∈ Rn is denoted by a set paG(xi) = xj ∈ Rn|xj → Xi ∈Mr, and paG(xi) = ∅ for the root node. The conditional probability distributions of the DAG G is represented by its CPT, which contains a set of numerical parameters for each variable xi ∈ Rn. These numeric pa- rameters are computed as the probability of each variable given the set of parents, p(xi|paG(Xi)). Over the set of variables in Rn, the joint probability for the BN is there- fore obtained as follows: p(x1, ..., xn) = ∏ Xi∈Rn p(xi|paG(xi)) (1) Thus, for a typical classification task, the BN classifier would learn the numerical parameters of a CPT from the DAG structure G, by estimating some statistical informa- tion from the given data. Such information include, mu- tual information (MI) between the variables and chi-square distribution [15]. The former is based on the local score 228 Informatica 40 (2016) 225–235 S.O. Orimaye et al. metrics approach and the latter exhibits conditional inde- pendence tests (CI) approach. For both approaches, dif- ferent search algorithms are used to identify the network structure. The goal is to ascertain, according to one or more search criteria, the best BN that fits the given data by evaluating the weight of the arc between the variables. The criteria for evaluating the fitness of the nodes (vari- ables), and the arcs (parameters) in the BN search algo- rithms, are expressed as fitting or scoring functions within the BN classifier [15]. Our goal is to ensure that those cri- teria include sentiment-dependent information between the variables. We will focus on penalizing existing local score metrics with our sentiment-dependent scoring function for the BN classifiers, hence the SDBN proposed in this paper. The local score metrics are of particular interest to our sentiment classification task because they exhibit a practi- cal characteristic that ensures the joint probability of the BN is decomposable to the sum (or product) of the indi- vidual probability of each node [28][15]. To the best of our knowledge, very few research papers have considered sentiment-dependent information, as part of the fitness cri- teria for capturing dependency between the variables, espe- cially for product reviews on different domains. 3.2 BN scoring functions We focus on the local score metrics functions, K2, BDeu, Entropy, AIC and MDL [15]. The functions define a fitness score, and a specified search algorithm searches for the best network that maximizes the score. Each of these functions identifies frequencies of occurrence of each variable xi in the data D and a network structure N . Although the per- formance of these scoring functions may vary on different datasets [15], in this paper, we assume that the scores gen- erated by the scoring functions are somehow naïve, thus, we attempt to mitigate its effect on SC tasks. First, we will define the parameters that are common to all the functions. We will then describe each of the functions with their asso- ciated formula and specify their limitations to the SC tasks. Similar to [30], we use ri(1 ≤ i ≤ n) to denote the size or cardinality of xi. pa(xi) represents the parents of xi and the cardinality of the parent set is represented by qi = ∏ xj∈pa(xi) rj . If pa(xi) is empty (i.e. pa(xi) = ∅), then qi = 1. The number of instances in a dataset D, where pa(xi) gets its jth value is represented by Nij(1 ≤ i ≤ n, 1 ≤ j ≤ qi). Similarly, Nijk(1 ≤ i ≤ n, 1 ≤ j ≤ qi, 1 ≤ k ≤ ri) represents the portion of D where pa(xi) gets its jth value and xi gets its kth value such that Nij = ∑ri k=1Nijk. Obviously, N represents the size of D. K2: This metric is a type of Bayesian scoring function proposed by [6]. The function relies on series of assump- tions such as parameter independence and uniform prior probability for the network. We reiterate that instead of in- dependent word counts, the sentiments expressed in a given data are better captured using conditional dependency be- tween words and their related sentiment classes [16]. The K2 metric is defined as follows: Sk2(N,D) = P (N) n∏ i=0 qi∏ j=1 (ri − 1)! (ri − 1 +Nij)! ri∏ k=1 Nijk! (2) BDeu: The metric was proposed by [31] as a general- ization of K2. It resulted from Bayesian Drichlet (BD) and BDe which were proposed by [32]. The BD is based on hyperparameters ηijk and the BDe is a result of BD with additional assumptions. BDeu relies on the sample size η as the single parameter. Since BDeu is a generalization of K2, it carries some of our concerns expressed on K2 earlier. Most importantly, the uniform prior probability assigned to each variable xi ∈ pa(xi) could be replaced by the proba- bility of the variable belonging to a natural sentiment class as stated earlier. We suggest that this is likely to increase the accuracy of the sentiment classifier especially on sparse data distribution. We define the BDeu metric as follows: SBDeu(N,D) = P (N) n∏ i=0 qi∏ j=1 (Γ( ηqi ) Γ(Nij + η qi ) ri∏ k=1 Γ(Nijk + η riqi) Γ( ηriqi ) (3) Note that the function Γ(.) is inherited from BD, and Γ(c) = ∫∞ 0 e−uuc−1du [15]. Entropy: Entropy metric measures the distance between the joint probability distributions of the network [15]. This allows dependency information to be identified by com- puting the mutual information (or entropy) between pair of variables. Thus, a minimized entropy between a pair of variables denotes dependency relationship, otherwise, a large entropy implies conditional independence between the variables [12][32]. While the entropy metric has been successful in measuring dependency information for BN classifiers, the local probabilities involved in the metric is largely computed based on conditional independence as- sumption given the data (i.e. using frequency counts for independent variables). We suggest that a joint probability of multiple sentiment evidences could improve the metric in BN classifiers for the SC tasks. The metric is defined as follows: H(N,D) = −N n∑ i=1 qi∑ j=1 ri∑ k=1 Nijk N log Nijk Nij (4) AIC: The AIC metric adds a non-negative parameter pe- nalization to the entropy method [15], which could also be improved by multiple sentiment evidences as in the case of the entropy method. The metric is specified as follows: SAIC(N,D) = H(N,D) +K (5) Learning Sentiment Dependent Bayesian Network. . . Informatica 40 (2016) 225–235 229 Where K is the number of parameters, such that K =∑n i=1(ri − 1).qi. MDL: The MDL metric is based on the minimum de- scription length principle which selects a minimum repre- sentative portion of the network variables through coding [15]. The best BN is identified to minimize the sum of the description length for the data. The metric is defined as follows: SMDL(N,D) = H(N,D) + K 2 logN (6) The use of MDL has not been investigated for sentiment classification on its own except for selecting dependency threshold between variables in BN. The study in [28], suggests that the mean of the total cross-entropy error is asymptotically proportional to logN2N , which is why the en- tropy metric is penalized in Equation 6. In this paper, the proposed sentiment-dependent score function is based on the Information Theory approach. The approach uses the entropy-based conditional mutual infor- mation (CMI) technique to measure the dependencies be- tween the variables. The local probabilities for computing the CMI between two variables are derived as joint prob- ability resulting from multiple evidences of both variables belonging to the same sentiment class. This is achieved by using a multiclass approach that measures the CMI in each sentiment class. The sum of the CMIs over the data is thereafter penalized using the MDL principle as suggested in [28]. 4 Sentiment-dependent BN As emphasized earlier, our motivation is to include sen- timent information as part of the dependency criteria be- tween the network variables. Similar to [12], we con- structed a multi-class Full Bayesian Network and encode the sentiment information within the CMIs that determine the dependencies within the network. C x3x2x1 S S Figure 1: Structural overview of SDBN. Figure 1 shows the structure of a SDBN. In this figure, C denotes the class node and the parent to all the variable nodes x1, x2 and x3. The edges between variable x1 and x2 and x3 represents dependencies between the variables. The dashed directed lines from each sentiment component S show the contribution of the sentiment information to the dependencies. The proposed SDBN is created from a sentiment depen- dent score table (SDST) similar to the conventional CPT which contains network parameters from the data. Given a dataset D containing two or more sentiment classes, we divide D into c subsets, where D1...Dc represent the senti- ment classes which are present in D. Thus, for each subset Dc, we create a SDST from the given data, and at the later stage, we use the values in SDST to learn a full Bayesian classifier. Creating an appropriate CPT or SDST is challenging, es- pecially when there is a sheer number of variables in the given data [27]. In fact, local search algorithms such as K2, Hill Climbing, TAN, Simulated annealing, Tabu search and Genetic search have been developed to address this chal- lenge [7]. Thus, we do not intend to repeat the sophisti- cated local search process in our scoring technique. We use a straight forward approach that computes CMI as the dependency between a pair of variables, given a subset Dc. The resulting scores for each pair of variables is stored into the SDST. Equation 7 computes the CMI for a pair of vari- ables. Note that this process is equivalent to the drafting phase proposed in [27] or the Chow and Liu algorithm in [33]. We can therefore focus on computing the local proba- bilities P (xi) and P (xj) for the CMI. In this work, each lo- cal probability encodes the sentiment dependency informa- tion as a joint probability of multiple sentiment evidences. We suggest that the joint probability is better than using the ordinary variable presence or single frequency count. CMI(xi, xj |C) = ∑ xi,xj ,c P (xi, xj , c) log P (xi, xj , c) P (xi|c), P (xj |c) (7) Alternatively, the CMI in Equation 7 can be computed with Equation 7 for penalizing the default CMI for a pair of variables, where Sλ(xi, xj) is the sentiment prior com- puted from the multiple sentiment evidences for each vari- able xi and xj . CMI(xi, xj |C) = ∑ xi,xj ,c P (xi, xj , c) log P (xi, xj , c) P (xi|c), P (xj |c) Sλ(xi, xj) (8) 4.1 Local probabilities for CMI In order to compute the local probabilities P (xi) and P (xj), we adopt Bayesian probability [34], to calculate the joint probability from multiple sentiment evidences. Bayesian probability encodes a generative model or likeli- hood p(D|θ) of the dataset with a prior belief p(θ) to infer 230 Informatica 40 (2016) 225–235 S.O. Orimaye et al. a posterior distribution p(θ|D), see Equation 9. The idea is to determine a favourable posterior information of a par- ticular variable belonging to its observed class, such that, the conditional mutual information between two dependent variables xi and xj increases when the posterior informa- tion for both variables in the same class is large. p(θ|D) = p(D|θ)p(θ) p(D) (9) However, in sentiment oriented documents such as prod- uct reviews, it is very common to observe variables that belong to different classes in one sentiment class. [20] re- ferred to such scenario as thwarted expectation. For ex- ample, a “positive" review document may contain certain “negative" words used to express dissatisfaction about an aspect of a product despite some level of satisfaction that the product might offer. With this kind of problem, it is much probable that a dependency network that is learned with ordinary frequency counts of each variable (regardless of the sentiment class) would no doubt leads to inaccurate sentiment classifiers. Figure 2 shows a sample BN resulting from a product review dataset upon performing attribute selection. In that network, variable “After" has a 1.0 probability of belong- ing to the negative and positive classes, respectively. Simi- larly, variable “not" has a 0.723 probability of belonging to a “positive" class rather than “negative". Every other vari- ables in the network, has split probabilities between both classes. Our aim is to remove such variables from the de- pendency network or at least minimize its influence in the network such that the quality of the network is improved for sentiment classification. Class easy perfect notAftergreatlove return back Figure 2: An example Bayesian network from product re- views. In this work, we compute the posterior information for each variable by considering its prior information and joint likelihood or observation from all the sentiment classes available in the data. The prior information is computed using the natural sen- timent or polarity scores from SentiWordNet [24]. Senti- WordNet gives the polarity scores of corresponding synsets for each English word. However, the polarity scores are of- ten different for each of the synset entries. A synset con- tains multiple semantic or polarity interpretation of a given word. Each interpretation has three different polarities val- ues. That is, a synset entry (word) would have a positive, negative, and neutral polarity scores which varies depend- ing on the semantic interpretation of the word. An example of such words is “great". Its fourth synset entry in Senti- WordNet has 0.25 positive, 0.625 negative, and 0.125 neu- tral polarity scores, respectively. In this work, we focus on the “positive" and “negative" sentiments, thus we will only consider positive and neg- ative polarity scores from SentiWordNet. The challenge however, is to compute an absolute and single polarity score for each word from its multiple synset entries. First, we compute the score for each polarity independently and then find the polarity that maximizes the other. The score for the positive or negative polarity of all synset entries for a given word is computed as follows: scoreφ(w) = 1  ∑ i=1 Ec(ei) (10) where scoreφ(w) is the score for each polarity of the given word w,  is the number of synset entries E for the word, c is the polarity or category (i.e. positive or negative) and ei is each synset entry. Thus, the prior or absolute polarity score for w is computed as follows: POLφ(w) = arg max c∈C scoreφ(w) (11) where POLφ(w) is the maximum polarity score computed with respect to either positive or negative category c from all the syset entries. We compute the likelihood information using a multi- class approach. Given a set of sentiment classes C, the probability of a variable belonging to its “first" observed sentiment class, is calculated as a joint probability of inde- pendently observing the variable in its first observed senti- ment class (i.e.negative and positive) and every other senti- ment classes, C1...Cn. Thus, the likelihood information is computed as follows: p(x1, ..., xC |D) = C∏ c=1 p(xc|D) (12) Where p(xc|D) is the probability of a variable x belonging to a class c given the data D. Given the data, our aim is to minimise the effect of the variables which might have appeared in a wrong (false pos- itive) class as a result of thwarted expectation that was sug- gested in [20], thereby biasing the dependency structure. Common examples are negation and objective words such as not and After as illustrated with Figure 2. If the word “not" for example, has a probability of 0.723 in a first ob- served “positive" class and a probability of 0.496 in the other negative class, then its likelihood of actually belong- ing to the “positive" class would be 0.359. Note that each probability is independent in this case as both probabilities do not sum to 1. Learning Sentiment Dependent Bayesian Network. . . Informatica 40 (2016) 225–235 231 In addition, the prior or natural sentiment score (see Equation 11) obtained from SentiWordNet regulates the likelihood further, ensuring that the probability of a vari- able belonging to its first observed class is also conditioned on the natural sentiment class of the word which is indepen- dent of the data. With variable not having a probability of 0.625 negative from SentiWordNet, the posterior Bayesian probability is 0.149. This means the probability of the vari- able belonging to the negative class is higher (i.e. 0.85), and thus, should not be allowed to have strong dependency on a “true positive" variable. We suggest that this tech- nique is more reliable than using the highest probability from both classes at the expense of accuracy (e.g. using only 0.723 and without the prior). Thus, using the Bayesian probability defined in Equation 9, we substitute the likelihood information p(x1, ..., xC |D) to p(D|θ) and the prior information POLφ(w) to p(θ). Note that P (D) is the sum of the two independent prob- abilities used in the likelihood (i.e. 0.723 and 0.496). 4.2 Sentiment dependency score Having computed the local probabilities P (xi) and P (xj) using the Bayesian probability approach, we compute the conditional mutual information as the dependency infor- mation between pair of variables in each class. Thus, we store the dependency information in the sentiment depen- dent score table, SDST. Again, the SDST is similar to the conventional CPT. The obvious difference is that sentiment information have been used to generate SDST. However, since we are using conditional mutual information to com- pute dependencies between variables, certain dependency threshold needs to be met in order to generate a reliable sentiment dependencies between each pair of variables in the SDST. As mentioned earlier, [28] suggested that the mean of the total cross-entropy (mutual information) error is asymptotically proportional to logN2N . Using that MDL principle, we defined the threshold value as follows: Θxi,xj = logNc 2Nc (13) where Θxi,xj is the sentiment dependency threshold be- tween a pair of variables xi and xj , Nc is the size of the data for a particular training class. Note that we gener- ated individual SDST for each sentiment class in our data. In this work, a pair of variables xi and xj have strong sentiment dependency and get stored into the appropriate SDST, if and only if, the conditional mutual information CMI(Xi, Xj |C) > Θxi,xj . Otherwise, we store a zero value to the corresponding slot in the SDST. CMI values greater than zero in the SDST is then used to build the re- sulting sentiment-dependent network structure for the ML task. The process of generating SDST is shown in Algo- rithm 1. Algorithm 4 SDST(D) Input : A set of labelled instances D. Output : A set of Sentiment Dependent Score Tables for all pairs of variables xi and xj . Steps 1: Create a multi-class structure that partitions instances D into subsets of classes Dc. 2: SDST1,...,C = empty. 3: for each subset Dc in D do 4: Compute the local probabilities P (xi) and P (xj) with sentiment dependent information as in Equa- tion 9. 5: Use the local probabilities to compute CMI for each pair of variables xi and xj using Equation 7. 6: Compute the MDL threshold Θ with Equation 13. 7: if CMI > MDL threshold Θxi,xj then 8: Store the CMI into SDSTc columns xi, xj and xj , xi, respectively. 9: else 10: Store 0 into SDSTc columns xi, xj and xj , xi, respectively. 11: end if 12: end for 13: Return SDST1,...,C for a full Bayesian Network clas- sifier 5 Experiments and results We conducted set of experiments using the proposed SDBN algorithm on 8 different product review domains. We then compared the accuracy with a state-of-the-art sen- timent classification technique. 5.1 Datasets and baselines Our datasets consist of Amazon online product reviews 3 that were manually crawled by [35]. These include Health, Kitchen, Software, Video, Books, DVD, Electronics, and Grocery reviews. Each product domain consists of 1000 positive reviews and 1000 negative reviews, hence each do- main has 2000 balanced set of instances. Note that 60% training and 40% testing sets were used on all domains. As baseline, we implemented the popular sentiment clas- sification technique in [10] as a traditional sentiment classi- fication benchmark on product reviews. The baseline tech- nique produced subjective portions of reviews from our datasets and were used with NB and the ordinary BN clas- sifiers. Baseline-NB denotes the baseline using NB clas- sifier and Baseline-BN represents the baseline with ordi- nary BN classifier on the subjective portions of reviews, re- spectively. We performed a grid search with 10-fold cross- validation for the three algorithms and observed that both SDBN and Baseline-BN gave best accuracies using Sim- pleEstimator with α = 0.5 and K2 search algorithm with 3http://www.cs.jhu.edu/ mdredze/datasets/sentiment/ 232 Informatica 40 (2016) 225–235 S.O. Orimaye et al. the Bayes/K2 scoring function, while NB performed better with the Kernel Estimator. 5.2 Data preparation We implemented our algorithm within the weka.classifiers.bayes package of the WEKA4 data mining framework. The SentiWordNet library5 including the lexicon file were also incorporated into the same WEKA directory. Further, we prepared our datasets according to the WEKA’s ARFF format by using the term frequency-inverse document frequency (TF-IDF) from the positive and negative reviews for each domain. Our implementation also uses the discretization algorithm within WEKA in order to reduce the classification error. This technique also correct the missing values within the training data. 5.3 Attribute selection We evaluated the performance of the SDBN with reduced attribute sets since attribute selection tends to improve BN’s accuracy [16]. Thus, we ranked and reduced the set of attributes for each of our dataset by using the “Attribute- Selection" filter in Weka. Specifically, we used the Info- GainAttributeEval evaluator with the ranker search algo- rithm to select from the top-10 ranked attributes to the top- 1000 ranked attributes for each domain. Our experiment showed better result with the top-ranked 100 attributes. 5.4 Results As emphasised in Table 1, we observed the proposed SDBN to have improved and sometimes comparable per- formance with the baseline classifiers. SDBN recorded bet- ter improvements on the Health and Kitchen domains. We also note that the accuracies on the Amazon video reviews seems to be lower than the accuracies that were reported on the IMDb video reviews by [10]. We suggest that this is a trade-off in sentiment classification on different datasets and/or domains as could be observed in our experiment on different Amazon domains. This could be investigated fur- ther in our future work. We believe that increased size of dataset, that is beyond the limited 1000 Amazon reviews, could further improve the accuracy of the SDBN classifier. We also performed experiment using the SDBN for top- 10, top-20, top-30, top-50, and top-100 attributes alone as shown in Table 2. The results showed that the performance of the SDBN increased steadily up to the best accuracy given by the top-100 attributes. We believe this is an in- dication that the performance of the SDBN could increase with much larger datasets. In addition, we compared be- tween the performance of the top-10 to top-100 attributes for SDBN and the two baselines on the top three domains with better accuracy: Health, Kitchen, and Video. Figures 4http://www.cs.waikato.ac.nz/ml/weka/ 5http://sentiwordnet.isti.cnr.it/download.php Table 1: Accuracies of SDBN and baseline classifiers on Amazon product reviews. Dataset SDBN Baseline- BN Baseline- NB Health 80.2% 78.9% 78.6% Kitchen 82.3% 80.5% 81.8% Software 78.7% 78.6% 78.7% Video 75.4% 75.2% 74.9% Books 77.7% 77.5% 77.3% DVD 77.6% 77.6% 77.5% Electronic 79.9% 79.8% 79.7% Grocery 86.7% 86.7% 86.5% 3, 4, 5,and 6 show consistent improvement with increasing attributes for the SDBN compared to the baselines. Table 2: Accuracies of SDBN on Top-10 to Top-100 at- tributes. Dataset Top-10 Top-20 Top-30 Top-50 Top- 100 Health 67.4% 70.7% 73.4% 76.9% 80.2% Kitchen 71.4% 74.8% 77.2% 79.0% 82.3% Software 69.7% 73.4% 75.4% 76.1% 78.7% Video 63.5% 68.6% 72.3% 75.1% 75.4% Books 68.4% 71.1% 73.6% 75.3% 77.7% DVD 68.1% 72.3% 73.8% 75.7% 77.6% Electronic 67.0% 70.1% 70.5% 75.8% 79.9% Grocery 69.9% 75.9% 80.0% 83.0% 86.7% As shown in Table 3, we also performed experiment by using SDBN with other scoring functions reported in Sec- tion 3.2 using the top-100 attributes, which gave better ac- curacy. Our observation shows that those scoring func- tions did not improve the result for SDBN beyond the Bayes/K2 scoring function used in the earlier experiments. This is consistent with the comparative study conducted in [15] on BN scoring functions. Overall, we have observed the SDBN classifier to have reasonable performance that shows a promising research pathway for using Bayesian Network as a competitive alternative classifier for senti- ment classification tasks. Learning Sentiment Dependent Bayesian Network. . . Informatica 40 (2016) 225–235 233 20 40 60 80 100 60 70 80 90 10 0 Top attributes Ac cu ra cy (% ) SDBN BN−baseline NB−baseline Figure 3: SDBN vs. baselines for top-10 to top-100 at- tributes on Health domain. 20 40 60 80 100 60 70 80 90 10 0 Top attributes Ac cu ra cy (% ) SDBN BN−baseline NB−baseline Figure 4: SDBN vs. baselines for top-10 to top-100 at- tributes on Kitchen domain. Table 3: Accuracies of SDBN with different scoring func- tions on Amazon product reviews. Dataset K2 BDeu Entropy AIC MDL Health 80.2% 76.3% 76.3% 76.1% 76.1% Kitchen 82.3% 75.3% 73.4% 75.2% 75.2% Software 78.7% 73.1% 73.1% 73.1% 73.1% Video 75.4% 73.5% 74.4% 72.8% 73.5% Books 77.7% 70.9% 70.1% 71.1% 71.1% DVD 77.6% 71.8% 70.0% 71.7% 71.7% Electronic 79.9% 77.0% 76.2% 76.2% 77.2% Grocery 86.7% 79.2% 80.0% 80.2% 79.3% There are some limitations in the use of SentiWordNet though. For example, because there are many domain spe- cific or technical terms (e.g. brand names) that were used in the product reviews, sentiment priors of those terms re- turned zero (i.e. neutral) as they are neither negative nor positive. This might have affected the sentiment dependen- cies within the network structure. 20 40 60 80 100 60 70 80 90 10 0 Top attributes Ac cu ra cy (% ) SDBN BN−baseline NB−baseline Figure 5: SDBN vs. baselines for top-10 to top-100 at- tributes on Video domain. 20 40 60 80 100 60 70 80 90 10 0 Top attributes Ac cu ra cy (% ) SDBN BN−baseline NB−baseline Figure 6: SDBN vs. baselines for top-10 to top-100 at- tributes on Books domain. 6 Conclusion In this study, we have proposed a sentiment-dependent Bayesian network (SDBN) classifier. The proposed SDBN uses a multi-class approach to compute sentiment depen- dencies between pairs of variables by using a joint proba- bility from different sentiment evidences. Thus, we calcu- lated a sentiment dependency score that penalizes existing BN scoring functions and derived sentiment dependency network structure using the conditional mutual information between each pair of variables in a dataset. We performed sentiment classification on eight different Amazon product domains with the resulting network structure. Experimen- tal results show that the proposed SDBN has comparable, and in some cases, improved accuracy than the state-of-the- art sentiment classifiers. In the future, we will experiment with SDBN on large scale Amazon SNAP datasets and the Hadoop platform. Acknowledgement The authors would like to thank the anonymous reviewers for the constructive comments. 234 Informatica 40 (2016) 225–235 S.O. Orimaye et al. References [1] H. Tang, S. Tan, and X. Cheng, “A survey on senti- ment detection of reviews,” Expert Systems with Ap- plications, vol. 36, no. 7, pp. 10 760–10 773, 2009. [2] B. Liu, “Sentiment analysis and opinion mining,” Synthesis Lectures on Human Language Technolo- gies, vol. 5, no. 1, pp. 1–167, 2012. [3] E. Cambria and A. Hussain, Sentic computing: Tech- niques, tools, and applications. Springer Science & Business Media, 2012, vol. 2. [4] E. Cambria, B. Schuller, Y. Xia, and C. Havasi, “New avenues in opinion mining and sentiment analysis,” IEEE Intelligent Systems, no. 2, pp. 15–21, 2013. [5] E. Cambria, D. Olsher, and D. Rajagopal, “Sentic- net 3: a common and common-sense knowledge base for cognition-driven sentiment analysis,” in Twenty- eighth AAAI conference on artificial intelligence, 2014, pp. 1515–1521. [6] G. F. Cooper and E. Herskovits, “A bayesian method for the induction of probabilistic networks from data,” Machine learning, vol. 9, no. 4, pp. 309–347, 1992. [7] N. Friedman, D. Geiger, and M. Gold- szmidt, “Bayesian network classifiers,” Ma- chine Learning, vol. 29, pp. 131–163, 1997, 10.1023/A:1007465528199. [Online]. Available: http://dx.doi.org/10.1023/A:1007465528199 [8] J. Cheng and R. Greiner, “Learning bayesian belief network classifiers: Algorithms and system,” in Ad- vances in Artificial Intelligence. Springer, 2001, pp. 141–151. [9] X.-W. Chen, G. Anantha, and X. Lin, “Improv- ing bayesian network structure learning with mu- tual information-based node ordering in the k2 al- gorithm,” Knowledge and Data Engineering, IEEE Transactions on, vol. 20, no. 5, pp. 628–640, 2008. [10] B. Pang and L. Lee, “A sentimental education: senti- ment analysis using subjectivity summarization based on minimum cuts,” in Proceedings of the 42nd Annual Meeting on Association for Computational Linguis- tics. Barcelona, Spain: Association for Computa- tional Linguistics, 2004, p. 271. [11] E. Boiy and M.-F. Moens, “A machine learning ap- proach to sentiment analysis in multilingual web texts,” Information Retrieval, vol. 12, no. 5, pp. 526– 558, 2009. [12] J. Su and H. Zhang, “Full bayesian network classi- fiers,” in Proceedings of the 23rd international con- ference on Machine learning. ACM, 2006, pp. 897– 904. [13] J. Cheng and R. Greiner, “Comparing bayesian net- work classifiers,” in Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc., 1999, pp. 101– 108. [14] D. Heckerman, A tutorial on learning with Bayesian networks. Springer, 2008. [15] L. M. De Campos, “A scoring function for learn- ing bayesian networks based on mutual information and conditional independence tests,” The Journal of Machine Learning Research, vol. 7, pp. 2149–2187, 2006. [16] E. Airoldi, X. Bai, and R. Padman, “Markov blan- kets and meta-heuristics search: Sentiment extraction from unstructured texts,” in Advances in Web Mining and Web Usage Analysis. Springer, 2006, pp. 167– 187. [17] X. Bai, “Predicting consumer sentiments from online text,” Decision Support Systems, vol. 50, no. 4, pp. 732–742, 2011. [18] A. Esuli, “Automatic generation of lexical resources for opinion mining: models, algorithms and appli- cations,” SIGIR Forum, vol. 42, no. 2, pp. 105–106, 2008. [19] J. Pearl, Probabilistic reasoning in intelligent sys- tems: networks of plausible inference. Morgan Kauf- mann, 1988. [20] B. Pang, L. Lee, and S. Vaithyanathan, “Thumbs up?: sentiment classification using machine learning tech- niques,” in Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10. Association for Computational Linguis- tics, 2002, pp. 79–86. [21] H. Cui, V. Mittal, and M. Datar, “Comparative exper- iments on sentiment classification for online product reviews,” American Association for Artificial Intelli- gence (AAAI), 2006. [22] D. Bespalov, B. Bai, Y. Qi, and A. Shokoufandeh, “Sentiment classification based on supervised latent n-gram analysis,” in Proceedings of the 20th ACM in- ternational conference on Information and knowledge management. ACM, 2011, pp. 375–382. [23] S. Poria, E. Cambria, G. Winterstein, and G.-B. Huang, “Sentic patterns: Dependency-based rules for concept-level sentiment analysis,” Knowledge-Based Systems, vol. 69, pp. 45–63, 2014. [24] A. Esuli and F. Sebastiani, “Sentiwordnet: A publicly available lexical resource for opinion mining,” Pro- ceedings of LREC, 2006. Learning Sentiment Dependent Bayesian Network. . . Informatica 40 (2016) 225–235 235 [25] F. Glover, M. Laguna et al., Tabu search. Springer, 1997, vol. 22. [26] W. Chen, L. Zong, W. Huang, G. Ou, Y. Wang, and D. Yang, “An empirical study of massively parallel bayesian networks learning for sentiment extraction from unstructured text,” in Web Technologies and Ap- plications. Springer, 2011, pp. 424–435. [27] J. Cheng, D. A. Bell, and W. Liu, “Learning belief networks from data: An information theory based approach,” in Proceedings of the sixth international conference on Information and knowledge manage- ment. ACM, 1997, pp. 325–331. [28] N. Friedman and Z. Yakhini, “On the sample com- plexity of learning bayesian networks,” in Proceed- ings of the Twelfth international conference on Un- certainty in artificial intelligence. Morgan Kauf- mann Publishers Inc., 1996, pp. 274–282. [29] C. F. Aliferis, A. Statnikov, I. Tsamardinos, S. Mani, and X. D. Koutsoukos, “Local causal and markov blanket induction for causal discovery and feature se- lection for classification part i: Algorithms and em- pirical evaluation,” The Journal of Machine Learning Research, vol. 11, pp. 171–234, 2010. [30] R. R. Bouckaert, Bayesian network classifiers in weka. Department of Computer Science, University of Waikato, 2004. [31] W. Buntine, “Theory refinement on bayesian net- works,” in Proceedings of the Seventh conference on Uncertainty in Artificial Intelligence. Morgan Kauf- mann Publishers Inc., 1991, pp. 52–60. [32] D. Heckerman, D. Geiger, and D. M. Chickering, “Learning bayesian networks: The combination of knowledge and statistical data,” Machine learning, vol. 20, no. 3, pp. 197–243, 1995. [33] C. Chow and C. Liu, “Approximating discrete prob- ability distributions with dependence trees,” Informa- tion Theory, IEEE Transactions on, vol. 14, no. 3, pp. 462–467, 1968. [34] P. M. Lee, Bayesian statistics: an introduction. John Wiley & Sons, 2012. [35] J. Blitzer, M. Dredze, and F. Pereira, “Biographies, bollywood, boom-boxes and blenders: Domain adap- tation for sentiment classification,” Association of Computational Linguistics (ACL), 2007.