Informatica 40 (2016) 225–235 225
Learning Sentiment Dependent Bayesian Network Classifier for Online
Product Reviews
Sylvester Olubolu Orimaye, Zi Yang Pang and Alvino Mandala Putra Setiawan
School of Information Technology, Monash University Malaysia
E-mail: sylvester.orimaye@monash.edu
Keywords: sentiment classification, sentiment-dependent, Bayesian network, product reviews
Received: September 17, 2015
Analyzing sentiments for polarity classification has recently gained attention in the literature with dif-
ferent machine learning techniques performing moderately. The challenge is that sentiment-dependent
information from multiple sources are not considered often in existing sentiment classification techniques.
In this study, we propose a logical approach that maximizes the true sentiment class probabilities of the
popular Bayesian Network for a more effective sentiment classification task using the individual word sen-
timent scores from SentiWordNet. We emphasize on creating dependency networks with quality variables
by using a sentiment-dependent scoring technique that penalizes the existing Bayesian Network scoring
functions such as K2, BDeu, Entropy, AIC and MDL. The outcome of this technique is called Sentiment
Dependent Bayesian Network. Empirical results on eight product review datasets from different domains
suggest that a sentiment-dependent scoring mechanism for Bayesian Network classifier could improve the
accuracy of sentiment classification by 2% and achieve up to 86.7% accuracy on specific domains.
Povzetek: Razvit je nov Bayesov klasifikator na osnovi mnenjske odvisnosti, uporabljen za ocenjevanje
spletnih produktov.
1 Introduction
Sentiment Classification (SC) has recently gained a lot of
attention in the research community [1, 2, 3]. More re-
cently, SC has moved from the commonly used bag-of-
words models to bag-of-concepts models [4, 5]. This is
due to its increasing demand for the analysis of consumer
sentiments on products, topic and news related text from
social media such as Twitter1 and online product reviews
such as Amazon2. In the same manner, Bayesian Network
(BN)[6] also known as Bayesian Belief Network plays a
major role in Machine Learning (ML) research for solv-
ing classification problems. Over the last decade, learn-
ing BNs has become an increasingly active area of ML re-
search where the goal is to learn a network structure us-
ing dependence or independence information between set
of variables [6, 7, 8, 9]. The resulting network is a directed
acyclic graph (DAG), with a set of joint probability distri-
butions, where each variable of the network is a node in the
graph and the arcs between the nodes represent the prob-
ability distribution that signifies the level of dependency
between the nodes.
While it is more common to use other ML algorithms for
SC tasks [10, 11], few research papers have proposed BN as
a competitive alternative to other popular ML algorithms.
Considering the huge size of data available from social me-
dia and the level of difficulty attached with analysing sen-
1https://twitter.com/
2https://amazon.com/
timents from natural language texts, the ability of BN to
learn dependencies between words and their correspond-
ing sentiment classes, could undoubtedly produce a better
classifier for the sentiment classification task. This paper
focusses on constructing a BN classifier that uses sentiment
information as one important factor for determining depen-
dency between network variables.
BN has been successfully applied to solve different ML
problems with its performance outweighing some of the
popular ML algorithms. For example, in [12], a full
Bayesian Network classifier (FBC) showed statistically
significant improvement on state-of-the-art ML algorithms
with the 33 UCI datasets. In the case of SC, Naïve Bayes
(NB), which is a special case of BN [13], and one of the
leading ML algorithms for SC tasks [10], has surprisingly
and repeatedly shown improved performance on movie and
product reviews despite its conditional independence as-
sumption. By comparative study, we show that a Senti-
ment Dependent Bayesian Network (SDBN) classifier has
improved performance on popular review datasets such as
Amazon product reviews due to the ability of the Bayesian
Network to construct a network structure of multiple de-
pendencies [12].
Constructing a BN classifier requires learning a network
structure with set of Conditional Probability Tables (CPTs)
[6]. Basically, there are two combined steps involved in
the BN construction process. The first is to perform vari-
able search on a search space, and the other is to score
each variable based on the degree of fitness [14]. The chal-
226 Informatica 40 (2016) 225–235 S.O. Orimaye et al.
lenge however, is to ensure that good networks are learned
with appropriate parameters using a scoring or fitness func-
tion to determine network variables from the given dataset.
Thus, much of the research works on BN focus on devel-
oping scoring functions for the BN classifier [15]. We ar-
gue that such scoring functions rely on many assumptions
that make them less effective for SC tasks. For example,
K2 algorithm, which is based on Bayesian Scoring func-
tion relies on the assumptions of parameter independence
and assigning a uniform prior distribution to the param-
eters, given the class [9]. We believe these assumptions
lead to many false positives in the classification results as
sentiment classes are better captured by conditional depen-
dency between words, rather than independent word counts
[16, 17].
We also suggest that varying prior distribution could
be assigned to each variable since each word has a nat-
ural prior probability of belonging to a particular senti-
ment class, independent of the data. For example, the word
“good” is naturally positive and “bad” is naturally nega-
tive. Thus, in this work, we propose a sentiment scoring
function that leverage sentiment information between vari-
ables in the given data. The output of the sentiment scor-
ing function is then used to augment existing BN scoring
functions for better performance. Our aim is to ensure sen-
timent information form part of the fitness criteria for se-
lecting network variables from sentiment-oriented datasets
such as reviews.
The proposed scoring function uses a multi-class ap-
proach to compute the conditional mutual information us-
ing sentiment-dependent information between local vari-
ables in each class of instances. The conditional mutual in-
formation for all classes are then penalized using the Mini-
mum Description Length (MDL) principle. The local prob-
abilities used in computing the conditional mutual infor-
mation is computed using the popular Bayesian probability
that uses the prior probability of a variable belonging to a
natural sentiment class (i.e. independent of the given data
by using individual word sentiment score from SentiWord-
Net [18]) and the observation of the variable in the selected
class of instances and other classes in the dataset (e.g. pos-
itive and negative). For example, the class probability of
each word in a product review is augmented with the po-
larity probability of the same word from SentiWordNet.
The technique takes into account that the network structure
would depend on the following criteria:
– The posterior probability from multiple evidences that
variables xi and xj have sentiment dependency;
– The conditional mutual information between the vari-
ables for all sentiment classes;
– The dependency threshold computed using the Mini-
mum Description Length principle; and
– The representation of the network as a full Bayesian
Network as proposed in [12].
The importance of the first criterion is that we are able to
avoid the independence assumption made by the existing
BN scoring functions. We capture local sentiment depen-
dency between the variables as a joint probability of evi-
dences from each variable and each class in the given data.
Also, existing BN scoring functions uses the conditional
independence given the data as a whole for determining de-
pendencies between variables [15, 9]. Under such approach
in SC, two co-occurring words may occur with the same or
similar frequencies in different classes. We observed that
training BN classifier without penalizing such occurrences
or dependencies, could affect the classifier decision to de-
cide an appropriate sentiment class. As such, our second
criterion captures the conditional mutual information be-
tween the variables, while the third criterion ensures that a
BN classifier uses quality variables that are above the com-
puted threshold. The latter also allow us to enforce strict
d-separation policy between the network variables, which
formally defines the process of determining independence
between variables [19]. Thus, only quality variables are
used to form the dependency network for the BN classifier.
Finally, we introduce the last criterion as an improvement
over a network constructed based on the first three criteria.
The full Bayesian Network technique ensures that an in-
dependent sentiment dependent network is constructed for
each sentiment category (i.e. negative and positive).
Section 2 of this paper discusses related work and ad-
ditional motivations. In Section 3, we explain the prob-
lem background and then present the proposed sentiment-
dependent technique in Section 4. Our experiment is de-
scribed in Section 5. Finally, Section 6 gives the conclu-
sion to our study and some thoughts on future research di-
rections.
2 Related work
2.1 Sentiment classification (SC)
The most prominent of SC work is perhaps [20] which
used supervised machine learning techniques for the po-
larity classification of positive and negative sentiments in
movie reviews. As a result of that work, different research
directions within the field of sentiment analysis and opin-
ion mining have been actively pursued [2, 3, 4, 5].
[10] proposed a subjectivity summarization technique,
which uses minimum cuts to classify sentiment polarities in
movie reviews. The technique identifies and extracts sub-
jective portions of review documents using minimum cuts
in graphs. The minimum cut approach takes into consider-
ation, the pairwise proximity information supplied through
graph cuts that partition sentences which are likely to be
in the same sentiment class. This approach gave better im-
provement from 82.8% to 86.4% on the subjective portion
of review documents. The approach also gave similar bet-
ter improvement when only 60% portion of a product re-
view document is used compared to an entire review. In
our work, we propose a classification technique that uses
Learning Sentiment Dependent Bayesian Network. . . Informatica 40 (2016) 225–235 227
the entire portion of each product review.
[21] performed classification of approximately 200K
product reviews by using different machine learning algo-
rithms. More importantly, the work investigated the signif-
icance of higher order n-gram language model (n ≥ 3) in
classifying sentiments from product reviews with an F1 of
90%. While Cui’s work was performed on random web-
sites for online products with limited product domains, we
performed experiments on Amazon product reviews with 8
different product domains.
Similarly, [22] performed experiment with higher or-
der n-gram features to train a Neural Network model on
Amazon and TripAdvisor datasets with the best average er-
ror rates of 7.12 and 7.37, respectively. In contrast, our
work investigate the performance of a sentiment-dependent
Bayesian Network classifier on the Amazon datasets with
different product domains.
More recently, [23] proposed a concept-level sentiment
analysis technique, which uses the knowledge-based sen-
tic computing technique that has recently gained attention
within the sentiment analysis domain [3, 4, 5]. Because
the sentic computing knowledge base sometimes omits vi-
tal sentiment discourse, [23] combines low-level linguistic
features with the sentic computing technique to train ma-
chine learning model for polarity detection. In our work,
the only knowledge base employed is SentiWordNet [24],
which computes the natural polarity values of words rather
than concept. Thus, we captured the dependencies between
words by constructing and learning a Bayesian Network for
sentiment classification.
A detailed review of other recent sentiment classification
techniques on different datasets can be found in [1, 2, 3, 4,
5].
2.2 BN classifiers for sentiment classification
[16] and [17] proposed a two-stage Markov Blanket Clas-
sifier (MBC) approach to extract sentiments from unstruc-
tured text such as movie reviews by using BN. The ap-
proach learns conditional dependencies between variables
(words) in a network and finds the portion of the network
that falls within the Markov Blanket. The Tabu Search al-
gorithm [25], is then used to further prune the resulting
Markov Blanket network for higher cross-validated accu-
racy. Although Markov Blanket has shown to be effective
in avoiding over-fitting in BN classifiers [7], the MBC ap-
proach finds sentiment dependencies based on the ordinary
presence or absence of words in their original sentiment
class only. We identify sentiment dependencies by consid-
ering multiple sources of evidence. These include multiple
sentiment classes in the data and the natural sentiment class
of each variable which is independent of its sentiment class
in the given data.
Similarly, [26] proposed a parallel BN learning algo-
rithm using MapReduce for the purpose of capturing senti-
ments from unstructured text. The technique experimented
on large scale blog data and captures dependencies among
words using mutual information or entropy, with the hope
of finding a vocabulary that could extract sentiments. The
technique differs from [17] by using a three-phase (draft-
ing, thickening and thinning) dependency search technique
that was proposed in [27]. Other than using mutual infor-
mation in the drafting phase of the search technique, the
work did not capture additional sentiment dependencies us-
ing other source of evidence.
Again, we do not focus on developing a search algorithm
but a scoring technique that considers multiple sentiment-
dependent information as part of the existing state-of-the-
art scoring functions.
3 Problem background
3.1 Bayesian network (BN)
A Bayesian Network N is represented as a graphical dis-
tribution of the joint probability between a set of random
variables [28]. The network has two components: (1) a
DAG G = (Rn,Mr) that represents the structural arrange-
ment of a set of variables (nodes) Rn = {x1, ..., xn} and a
corresponding set of dependence and independence asser-
tions (arcs) Mr between the variables; (2) a set of condi-
tional probability distributions P = {pi, ..., pn} between
the parent and the child nodes in the graph.
In the DAG component, the existence of an directed arc
between a pair of variables xi and xj asserts a conditional
dependency between the two variables [8]. The directed
arc can also be seen to represent causality between one
variable and the other [29], that is, variable xy is an ex-
istential cause of variable xz , hence xy → xz . The absence
of an directed arc between a pair of variables, however,
represents a conditional independence, such that, given a
subset U of variables from Rn, the degree of information
about variable xi does not change by knowing xj , thus
I(xi, xj |U). This also implies that p(xi|xj , U) = p(xi|U).
The parent(s) of variable xi ∈ Rn is denoted by a set
paG(xi) = xj ∈ Rn|xj → Xi ∈Mr, and paG(xi) = ∅
for the root node.
The conditional probability distributions of the DAG G
is represented by its CPT, which contains a set of numerical
parameters for each variable xi ∈ Rn. These numeric pa-
rameters are computed as the probability of each variable
given the set of parents, p(xi|paG(Xi)). Over the set of
variables in Rn, the joint probability for the BN is there-
fore obtained as follows:
p(x1, ..., xn) =
∏
Xi∈Rn
p(xi|paG(xi)) (1)
Thus, for a typical classification task, the BN classifier
would learn the numerical parameters of a CPT from the
DAG structure G, by estimating some statistical informa-
tion from the given data. Such information include, mu-
tual information (MI) between the variables and chi-square
distribution [15]. The former is based on the local score
228 Informatica 40 (2016) 225–235 S.O. Orimaye et al.
metrics approach and the latter exhibits conditional inde-
pendence tests (CI) approach. For both approaches, dif-
ferent search algorithms are used to identify the network
structure. The goal is to ascertain, according to one or
more search criteria, the best BN that fits the given data
by evaluating the weight of the arc between the variables.
The criteria for evaluating the fitness of the nodes (vari-
ables), and the arcs (parameters) in the BN search algo-
rithms, are expressed as fitting or scoring functions within
the BN classifier [15]. Our goal is to ensure that those cri-
teria include sentiment-dependent information between the
variables. We will focus on penalizing existing local score
metrics with our sentiment-dependent scoring function for
the BN classifiers, hence the SDBN proposed in this paper.
The local score metrics are of particular interest to our
sentiment classification task because they exhibit a practi-
cal characteristic that ensures the joint probability of the
BN is decomposable to the sum (or product) of the indi-
vidual probability of each node [28][15]. To the best of
our knowledge, very few research papers have considered
sentiment-dependent information, as part of the fitness cri-
teria for capturing dependency between the variables, espe-
cially for product reviews on different domains.
3.2 BN scoring functions
We focus on the local score metrics functions, K2, BDeu,
Entropy, AIC and MDL [15]. The functions define a fitness
score, and a specified search algorithm searches for the best
network that maximizes the score. Each of these functions
identifies frequencies of occurrence of each variable xi in
the data D and a network structure N . Although the per-
formance of these scoring functions may vary on different
datasets [15], in this paper, we assume that the scores gen-
erated by the scoring functions are somehow naïve, thus,
we attempt to mitigate its effect on SC tasks. First, we will
define the parameters that are common to all the functions.
We will then describe each of the functions with their asso-
ciated formula and specify their limitations to the SC tasks.
Similar to [30], we use ri(1 ≤ i ≤ n) to denote the
size or cardinality of xi. pa(xi) represents the parents of
xi and the cardinality of the parent set is represented by
qi =
∏
xj∈pa(xi) rj . If pa(xi) is empty (i.e. pa(xi) =
∅), then qi = 1. The number of instances in a dataset D,
where pa(xi) gets its jth value is represented by Nij(1 ≤
i ≤ n, 1 ≤ j ≤ qi). Similarly, Nijk(1 ≤ i ≤ n, 1 ≤
j ≤ qi, 1 ≤ k ≤ ri) represents the portion of D where
pa(xi) gets its jth value and xi gets its kth value such that
Nij =
∑ri
k=1Nijk. Obviously, N represents the size of D.
K2: This metric is a type of Bayesian scoring function
proposed by [6]. The function relies on series of assump-
tions such as parameter independence and uniform prior
probability for the network. We reiterate that instead of in-
dependent word counts, the sentiments expressed in a given
data are better captured using conditional dependency be-
tween words and their related sentiment classes [16]. The
K2 metric is defined as follows:
Sk2(N,D) = P (N)
n∏
i=0
qi∏
j=1
(ri − 1)!
(ri − 1 +Nij)!
ri∏
k=1
Nijk!
(2)
BDeu: The metric was proposed by [31] as a general-
ization of K2. It resulted from Bayesian Drichlet (BD) and
BDe which were proposed by [32]. The BD is based on
hyperparameters ηijk and the BDe is a result of BD with
additional assumptions. BDeu relies on the sample size η
as the single parameter. Since BDeu is a generalization of
K2, it carries some of our concerns expressed on K2 earlier.
Most importantly, the uniform prior probability assigned to
each variable xi ∈ pa(xi) could be replaced by the proba-
bility of the variable belonging to a natural sentiment class
as stated earlier. We suggest that this is likely to increase
the accuracy of the sentiment classifier especially on sparse
data distribution. We define the BDeu metric as follows:
SBDeu(N,D) = P (N)
n∏
i=0
qi∏
j=1
(Γ( ηqi )
Γ(Nij +
η
qi
)
ri∏
k=1
Γ(Nijk +
η
riqi)
Γ( ηriqi )
(3)
Note that the function Γ(.) is inherited from BD, and
Γ(c) =
∫∞
0
e−uuc−1du [15].
Entropy: Entropy metric measures the distance between
the joint probability distributions of the network [15]. This
allows dependency information to be identified by com-
puting the mutual information (or entropy) between pair
of variables. Thus, a minimized entropy between a pair
of variables denotes dependency relationship, otherwise,
a large entropy implies conditional independence between
the variables [12][32]. While the entropy metric has been
successful in measuring dependency information for BN
classifiers, the local probabilities involved in the metric is
largely computed based on conditional independence as-
sumption given the data (i.e. using frequency counts for
independent variables). We suggest that a joint probability
of multiple sentiment evidences could improve the metric
in BN classifiers for the SC tasks. The metric is defined as
follows:
H(N,D) = −N
n∑
i=1
qi∑
j=1
ri∑
k=1
Nijk
N
log
Nijk
Nij
(4)
AIC: The AIC metric adds a non-negative parameter pe-
nalization to the entropy method [15], which could also be
improved by multiple sentiment evidences as in the case of
the entropy method. The metric is specified as follows:
SAIC(N,D) = H(N,D) +K (5)
Learning Sentiment Dependent Bayesian Network. . . Informatica 40 (2016) 225–235 229
Where K is the number of parameters, such that K =∑n
i=1(ri − 1).qi.
MDL: The MDL metric is based on the minimum de-
scription length principle which selects a minimum repre-
sentative portion of the network variables through coding
[15]. The best BN is identified to minimize the sum of the
description length for the data. The metric is defined as
follows:
SMDL(N,D) = H(N,D) +
K
2
logN (6)
The use of MDL has not been investigated for sentiment
classification on its own except for selecting dependency
threshold between variables in BN. The study in [28],
suggests that the mean of the total cross-entropy error is
asymptotically proportional to logN2N , which is why the en-
tropy metric is penalized in Equation 6.
In this paper, the proposed sentiment-dependent score
function is based on the Information Theory approach. The
approach uses the entropy-based conditional mutual infor-
mation (CMI) technique to measure the dependencies be-
tween the variables. The local probabilities for computing
the CMI between two variables are derived as joint prob-
ability resulting from multiple evidences of both variables
belonging to the same sentiment class. This is achieved
by using a multiclass approach that measures the CMI in
each sentiment class. The sum of the CMIs over the data is
thereafter penalized using the MDL principle as suggested
in [28].
4 Sentiment-dependent BN
As emphasized earlier, our motivation is to include sen-
timent information as part of the dependency criteria be-
tween the network variables. Similar to [12], we con-
structed a multi-class Full Bayesian Network and encode
the sentiment information within the CMIs that determine
the dependencies within the network.
C
x3x2x1
S S
Figure 1: Structural overview of SDBN.
Figure 1 shows the structure of a SDBN. In this figure,
C denotes the class node and the parent to all the variable
nodes x1, x2 and x3. The edges between variable x1 and
x2 and x3 represents dependencies between the variables.
The dashed directed lines from each sentiment component
S show the contribution of the sentiment information to the
dependencies.
The proposed SDBN is created from a sentiment depen-
dent score table (SDST) similar to the conventional CPT
which contains network parameters from the data. Given
a dataset D containing two or more sentiment classes, we
divide D into c subsets, where D1...Dc represent the senti-
ment classes which are present in D. Thus, for each subset
Dc, we create a SDST from the given data, and at the later
stage, we use the values in SDST to learn a full Bayesian
classifier.
Creating an appropriate CPT or SDST is challenging, es-
pecially when there is a sheer number of variables in the
given data [27]. In fact, local search algorithms such as K2,
Hill Climbing, TAN, Simulated annealing, Tabu search and
Genetic search have been developed to address this chal-
lenge [7]. Thus, we do not intend to repeat the sophisti-
cated local search process in our scoring technique. We
use a straight forward approach that computes CMI as the
dependency between a pair of variables, given a subset Dc.
The resulting scores for each pair of variables is stored into
the SDST. Equation 7 computes the CMI for a pair of vari-
ables. Note that this process is equivalent to the drafting
phase proposed in [27] or the Chow and Liu algorithm in
[33]. We can therefore focus on computing the local proba-
bilities P (xi) and P (xj) for the CMI. In this work, each lo-
cal probability encodes the sentiment dependency informa-
tion as a joint probability of multiple sentiment evidences.
We suggest that the joint probability is better than using the
ordinary variable presence or single frequency count.
CMI(xi, xj |C) =
∑
xi,xj ,c
P (xi, xj , c)
log
P (xi, xj , c)
P (xi|c), P (xj |c)
(7)
Alternatively, the CMI in Equation 7 can be computed
with Equation 7 for penalizing the default CMI for a pair
of variables, where Sλ(xi, xj) is the sentiment prior com-
puted from the multiple sentiment evidences for each vari-
able xi and xj .
CMI(xi, xj |C) =
∑
xi,xj ,c
P (xi, xj , c)
log
P (xi, xj , c)
P (xi|c), P (xj |c)
Sλ(xi, xj)
(8)
4.1 Local probabilities for CMI
In order to compute the local probabilities P (xi) and
P (xj), we adopt Bayesian probability [34], to calculate
the joint probability from multiple sentiment evidences.
Bayesian probability encodes a generative model or likeli-
hood p(D|θ) of the dataset with a prior belief p(θ) to infer
230 Informatica 40 (2016) 225–235 S.O. Orimaye et al.
a posterior distribution p(θ|D), see Equation 9. The idea
is to determine a favourable posterior information of a par-
ticular variable belonging to its observed class, such that,
the conditional mutual information between two dependent
variables xi and xj increases when the posterior informa-
tion for both variables in the same class is large.
p(θ|D) = p(D|θ)p(θ)
p(D)
(9)
However, in sentiment oriented documents such as prod-
uct reviews, it is very common to observe variables that
belong to different classes in one sentiment class. [20] re-
ferred to such scenario as thwarted expectation. For ex-
ample, a “positive" review document may contain certain
“negative" words used to express dissatisfaction about an
aspect of a product despite some level of satisfaction that
the product might offer. With this kind of problem, it is
much probable that a dependency network that is learned
with ordinary frequency counts of each variable (regardless
of the sentiment class) would no doubt leads to inaccurate
sentiment classifiers.
Figure 2 shows a sample BN resulting from a product
review dataset upon performing attribute selection. In that
network, variable “After" has a 1.0 probability of belong-
ing to the negative and positive classes, respectively. Simi-
larly, variable “not" has a 0.723 probability of belonging to
a “positive" class rather than “negative". Every other vari-
ables in the network, has split probabilities between both
classes. Our aim is to remove such variables from the de-
pendency network or at least minimize its influence in the
network such that the quality of the network is improved
for sentiment classification.
Class
easy
perfect
notAftergreatlove return
back
Figure 2: An example Bayesian network from product re-
views.
In this work, we compute the posterior information for
each variable by considering its prior information and joint
likelihood or observation from all the sentiment classes
available in the data.
The prior information is computed using the natural sen-
timent or polarity scores from SentiWordNet [24]. Senti-
WordNet gives the polarity scores of corresponding synsets
for each English word. However, the polarity scores are of-
ten different for each of the synset entries. A synset con-
tains multiple semantic or polarity interpretation of a given
word. Each interpretation has three different polarities val-
ues. That is, a synset entry (word) would have a positive,
negative, and neutral polarity scores which varies depend-
ing on the semantic interpretation of the word. An example
of such words is “great". Its fourth synset entry in Senti-
WordNet has 0.25 positive, 0.625 negative, and 0.125 neu-
tral polarity scores, respectively.
In this work, we focus on the “positive" and “negative"
sentiments, thus we will only consider positive and neg-
ative polarity scores from SentiWordNet. The challenge
however, is to compute an absolute and single polarity
score for each word from its multiple synset entries. First,
we compute the score for each polarity independently and
then find the polarity that maximizes the other. The score
for the positive or negative polarity of all synset entries for
a given word is computed as follows:
scoreφ(w) =
1

∑
i=1
Ec(ei) (10)
where scoreφ(w) is the score for each polarity of the given
word w,  is the number of synset entries E for the word,
c is the polarity or category (i.e. positive or negative) and
ei is each synset entry. Thus, the prior or absolute polarity
score for w is computed as follows:
POLφ(w) = arg max
c∈C
scoreφ(w) (11)
where POLφ(w) is the maximum polarity score computed
with respect to either positive or negative category c from
all the syset entries.
We compute the likelihood information using a multi-
class approach. Given a set of sentiment classes C, the
probability of a variable belonging to its “first" observed
sentiment class, is calculated as a joint probability of inde-
pendently observing the variable in its first observed senti-
ment class (i.e.negative and positive) and every other senti-
ment classes, C1...Cn. Thus, the likelihood information is
computed as follows:
p(x1, ..., xC |D) =
C∏
c=1
p(xc|D) (12)
Where p(xc|D) is the probability of a variable x belonging
to a class c given the data D.
Given the data, our aim is to minimise the effect of the
variables which might have appeared in a wrong (false pos-
itive) class as a result of thwarted expectation that was sug-
gested in [20], thereby biasing the dependency structure.
Common examples are negation and objective words such
as not and After as illustrated with Figure 2. If the word
“not" for example, has a probability of 0.723 in a first ob-
served “positive" class and a probability of 0.496 in the
other negative class, then its likelihood of actually belong-
ing to the “positive" class would be 0.359. Note that each
probability is independent in this case as both probabilities
do not sum to 1.
Learning Sentiment Dependent Bayesian Network. . . Informatica 40 (2016) 225–235 231
In addition, the prior or natural sentiment score (see
Equation 11) obtained from SentiWordNet regulates the
likelihood further, ensuring that the probability of a vari-
able belonging to its first observed class is also conditioned
on the natural sentiment class of the word which is indepen-
dent of the data. With variable not having a probability of
0.625 negative from SentiWordNet, the posterior Bayesian
probability is 0.149. This means the probability of the vari-
able belonging to the negative class is higher (i.e. 0.85),
and thus, should not be allowed to have strong dependency
on a “true positive" variable. We suggest that this tech-
nique is more reliable than using the highest probability
from both classes at the expense of accuracy (e.g. using
only 0.723 and without the prior).
Thus, using the Bayesian probability defined in Equation
9, we substitute the likelihood information p(x1, ..., xC |D)
to p(D|θ) and the prior information POLφ(w) to p(θ).
Note that P (D) is the sum of the two independent prob-
abilities used in the likelihood (i.e. 0.723 and 0.496).
4.2 Sentiment dependency score
Having computed the local probabilities P (xi) and P (xj)
using the Bayesian probability approach, we compute the
conditional mutual information as the dependency infor-
mation between pair of variables in each class. Thus, we
store the dependency information in the sentiment depen-
dent score table, SDST. Again, the SDST is similar to the
conventional CPT. The obvious difference is that sentiment
information have been used to generate SDST. However,
since we are using conditional mutual information to com-
pute dependencies between variables, certain dependency
threshold needs to be met in order to generate a reliable
sentiment dependencies between each pair of variables in
the SDST. As mentioned earlier, [28] suggested that the
mean of the total cross-entropy (mutual information) error
is asymptotically proportional to logN2N . Using that MDL
principle, we defined the threshold value as follows:
Θxi,xj =
logNc
2Nc
(13)
where Θxi,xj is the sentiment dependency threshold be-
tween a pair of variables xi and xj , Nc is the size of the
data for a particular training class. Note that we gener-
ated individual SDST for each sentiment class in our data.
In this work, a pair of variables xi and xj have strong
sentiment dependency and get stored into the appropriate
SDST, if and only if, the conditional mutual information
CMI(Xi, Xj |C) > Θxi,xj . Otherwise, we store a zero
value to the corresponding slot in the SDST. CMI values
greater than zero in the SDST is then used to build the re-
sulting sentiment-dependent network structure for the ML
task. The process of generating SDST is shown in Algo-
rithm 1.
Algorithm 4 SDST(D)
Input : A set of labelled instances D.
Output : A set of Sentiment Dependent Score
Tables for all pairs of variables xi and xj .
Steps
1: Create a multi-class structure that partitions instances
D into subsets of classes Dc.
2: SDST1,...,C = empty.
3: for each subset Dc in D do
4: Compute the local probabilities P (xi) and P (xj)
with sentiment dependent information as in Equa-
tion 9.
5: Use the local probabilities to compute CMI for each
pair of variables xi and xj using Equation 7.
6: Compute the MDL threshold Θ with Equation 13.
7: if CMI > MDL threshold Θxi,xj then
8: Store the CMI into SDSTc columns xi, xj and
xj , xi, respectively.
9: else
10: Store 0 into SDSTc columns xi, xj and xj , xi,
respectively.
11: end if
12: end for
13: Return SDST1,...,C for a full Bayesian Network clas-
sifier
5 Experiments and results
We conducted set of experiments using the proposed
SDBN algorithm on 8 different product review domains.
We then compared the accuracy with a state-of-the-art sen-
timent classification technique.
5.1 Datasets and baselines
Our datasets consist of Amazon online product reviews 3
that were manually crawled by [35]. These include Health,
Kitchen, Software, Video, Books, DVD, Electronics, and
Grocery reviews. Each product domain consists of 1000
positive reviews and 1000 negative reviews, hence each do-
main has 2000 balanced set of instances. Note that 60%
training and 40% testing sets were used on all domains.
As baseline, we implemented the popular sentiment clas-
sification technique in [10] as a traditional sentiment classi-
fication benchmark on product reviews. The baseline tech-
nique produced subjective portions of reviews from our
datasets and were used with NB and the ordinary BN clas-
sifiers. Baseline-NB denotes the baseline using NB clas-
sifier and Baseline-BN represents the baseline with ordi-
nary BN classifier on the subjective portions of reviews, re-
spectively. We performed a grid search with 10-fold cross-
validation for the three algorithms and observed that both
SDBN and Baseline-BN gave best accuracies using Sim-
pleEstimator with α = 0.5 and K2 search algorithm with
3http://www.cs.jhu.edu/ mdredze/datasets/sentiment/
232 Informatica 40 (2016) 225–235 S.O. Orimaye et al.
the Bayes/K2 scoring function, while NB performed better
with the Kernel Estimator.
5.2 Data preparation
We implemented our algorithm within the
weka.classifiers.bayes package of the WEKA4 data
mining framework. The SentiWordNet library5 including
the lexicon file were also incorporated into the same
WEKA directory. Further, we prepared our datasets
according to the WEKA’s ARFF format by using the term
frequency-inverse document frequency (TF-IDF) from
the positive and negative reviews for each domain. Our
implementation also uses the discretization algorithm
within WEKA in order to reduce the classification error.
This technique also correct the missing values within the
training data.
5.3 Attribute selection
We evaluated the performance of the SDBN with reduced
attribute sets since attribute selection tends to improve
BN’s accuracy [16]. Thus, we ranked and reduced the set
of attributes for each of our dataset by using the “Attribute-
Selection" filter in Weka. Specifically, we used the Info-
GainAttributeEval evaluator with the ranker search algo-
rithm to select from the top-10 ranked attributes to the top-
1000 ranked attributes for each domain. Our experiment
showed better result with the top-ranked 100 attributes.
5.4 Results
As emphasised in Table 1, we observed the proposed
SDBN to have improved and sometimes comparable per-
formance with the baseline classifiers. SDBN recorded bet-
ter improvements on the Health and Kitchen domains. We
also note that the accuracies on the Amazon video reviews
seems to be lower than the accuracies that were reported
on the IMDb video reviews by [10]. We suggest that this is
a trade-off in sentiment classification on different datasets
and/or domains as could be observed in our experiment on
different Amazon domains. This could be investigated fur-
ther in our future work. We believe that increased size of
dataset, that is beyond the limited 1000 Amazon reviews,
could further improve the accuracy of the SDBN classifier.
We also performed experiment using the SDBN for top-
10, top-20, top-30, top-50, and top-100 attributes alone as
shown in Table 2. The results showed that the performance
of the SDBN increased steadily up to the best accuracy
given by the top-100 attributes. We believe this is an in-
dication that the performance of the SDBN could increase
with much larger datasets. In addition, we compared be-
tween the performance of the top-10 to top-100 attributes
for SDBN and the two baselines on the top three domains
with better accuracy: Health, Kitchen, and Video. Figures
4http://www.cs.waikato.ac.nz/ml/weka/
5http://sentiwordnet.isti.cnr.it/download.php
Table 1: Accuracies of SDBN and baseline classifiers on
Amazon product reviews.
Dataset SDBN Baseline-
BN
Baseline-
NB
Health 80.2% 78.9% 78.6%
Kitchen 82.3% 80.5% 81.8%
Software 78.7% 78.6% 78.7%
Video 75.4% 75.2% 74.9%
Books 77.7% 77.5% 77.3%
DVD 77.6% 77.6% 77.5%
Electronic 79.9% 79.8% 79.7%
Grocery 86.7% 86.7% 86.5%
3, 4, 5,and 6 show consistent improvement with increasing
attributes for the SDBN compared to the baselines.
Table 2: Accuracies of SDBN on Top-10 to Top-100 at-
tributes.
Dataset Top-10 Top-20 Top-30 Top-50 Top-
100
Health 67.4% 70.7% 73.4% 76.9% 80.2%
Kitchen 71.4% 74.8% 77.2% 79.0% 82.3%
Software 69.7% 73.4% 75.4% 76.1% 78.7%
Video 63.5% 68.6% 72.3% 75.1% 75.4%
Books 68.4% 71.1% 73.6% 75.3% 77.7%
DVD 68.1% 72.3% 73.8% 75.7% 77.6%
Electronic 67.0% 70.1% 70.5% 75.8% 79.9%
Grocery 69.9% 75.9% 80.0% 83.0% 86.7%
As shown in Table 3, we also performed experiment by
using SDBN with other scoring functions reported in Sec-
tion 3.2 using the top-100 attributes, which gave better ac-
curacy. Our observation shows that those scoring func-
tions did not improve the result for SDBN beyond the
Bayes/K2 scoring function used in the earlier experiments.
This is consistent with the comparative study conducted in
[15] on BN scoring functions. Overall, we have observed
the SDBN classifier to have reasonable performance that
shows a promising research pathway for using Bayesian
Network as a competitive alternative classifier for senti-
ment classification tasks.
Learning Sentiment Dependent Bayesian Network. . . Informatica 40 (2016) 225–235 233
20 40 60 80 100
60
70
80
90
10
0
Top attributes
Ac
cu
ra
cy
(%
)
SDBN
BN−baseline
NB−baseline
Figure 3: SDBN vs. baselines for top-10 to top-100 at-
tributes on Health domain.
20 40 60 80 100
60
70
80
90
10
0
Top attributes
Ac
cu
ra
cy
(%
)
SDBN
BN−baseline
NB−baseline
Figure 4: SDBN vs. baselines for top-10 to top-100 at-
tributes on Kitchen domain.
Table 3: Accuracies of SDBN with different scoring func-
tions on Amazon product reviews.
Dataset K2 BDeu Entropy AIC MDL
Health 80.2% 76.3% 76.3% 76.1% 76.1%
Kitchen 82.3% 75.3% 73.4% 75.2% 75.2%
Software 78.7% 73.1% 73.1% 73.1% 73.1%
Video 75.4% 73.5% 74.4% 72.8% 73.5%
Books 77.7% 70.9% 70.1% 71.1% 71.1%
DVD 77.6% 71.8% 70.0% 71.7% 71.7%
Electronic 79.9% 77.0% 76.2% 76.2% 77.2%
Grocery 86.7% 79.2% 80.0% 80.2% 79.3%
There are some limitations in the use of SentiWordNet
though. For example, because there are many domain spe-
cific or technical terms (e.g. brand names) that were used
in the product reviews, sentiment priors of those terms re-
turned zero (i.e. neutral) as they are neither negative nor
positive. This might have affected the sentiment dependen-
cies within the network structure.
20 40 60 80 100
60
70
80
90
10
0
Top attributes
Ac
cu
ra
cy
(%
)
SDBN
BN−baseline
NB−baseline
Figure 5: SDBN vs. baselines for top-10 to top-100 at-
tributes on Video domain.
20 40 60 80 100
60
70
80
90
10
0
Top attributes
Ac
cu
ra
cy
(%
)
SDBN
BN−baseline
NB−baseline
Figure 6: SDBN vs. baselines for top-10 to top-100 at-
tributes on Books domain.
6 Conclusion
In this study, we have proposed a sentiment-dependent
Bayesian network (SDBN) classifier. The proposed SDBN
uses a multi-class approach to compute sentiment depen-
dencies between pairs of variables by using a joint proba-
bility from different sentiment evidences. Thus, we calcu-
lated a sentiment dependency score that penalizes existing
BN scoring functions and derived sentiment dependency
network structure using the conditional mutual information
between each pair of variables in a dataset. We performed
sentiment classification on eight different Amazon product
domains with the resulting network structure. Experimen-
tal results show that the proposed SDBN has comparable,
and in some cases, improved accuracy than the state-of-the-
art sentiment classifiers. In the future, we will experiment
with SDBN on large scale Amazon SNAP datasets and the
Hadoop platform.
Acknowledgement
The authors would like to thank the anonymous reviewers
for the constructive comments.
234 Informatica 40 (2016) 225–235 S.O. Orimaye et al.
References
[1] H. Tang, S. Tan, and X. Cheng, “A survey on senti-
ment detection of reviews,” Expert Systems with Ap-
plications, vol. 36, no. 7, pp. 10 760–10 773, 2009.
[2] B. Liu, “Sentiment analysis and opinion mining,”
Synthesis Lectures on Human Language Technolo-
gies, vol. 5, no. 1, pp. 1–167, 2012.
[3] E. Cambria and A. Hussain, Sentic computing: Tech-
niques, tools, and applications. Springer Science &
Business Media, 2012, vol. 2.
[4] E. Cambria, B. Schuller, Y. Xia, and C. Havasi, “New
avenues in opinion mining and sentiment analysis,”
IEEE Intelligent Systems, no. 2, pp. 15–21, 2013.
[5] E. Cambria, D. Olsher, and D. Rajagopal, “Sentic-
net 3: a common and common-sense knowledge base
for cognition-driven sentiment analysis,” in Twenty-
eighth AAAI conference on artificial intelligence,
2014, pp. 1515–1521.
[6] G. F. Cooper and E. Herskovits, “A bayesian method
for the induction of probabilistic networks from data,”
Machine learning, vol. 9, no. 4, pp. 309–347, 1992.
[7] N. Friedman, D. Geiger, and M. Gold-
szmidt, “Bayesian network classifiers,” Ma-
chine Learning, vol. 29, pp. 131–163, 1997,
10.1023/A:1007465528199. [Online]. Available:
http://dx.doi.org/10.1023/A:1007465528199
[8] J. Cheng and R. Greiner, “Learning bayesian belief
network classifiers: Algorithms and system,” in Ad-
vances in Artificial Intelligence. Springer, 2001, pp.
141–151.
[9] X.-W. Chen, G. Anantha, and X. Lin, “Improv-
ing bayesian network structure learning with mu-
tual information-based node ordering in the k2 al-
gorithm,” Knowledge and Data Engineering, IEEE
Transactions on, vol. 20, no. 5, pp. 628–640, 2008.
[10] B. Pang and L. Lee, “A sentimental education: senti-
ment analysis using subjectivity summarization based
on minimum cuts,” in Proceedings of the 42nd Annual
Meeting on Association for Computational Linguis-
tics. Barcelona, Spain: Association for Computa-
tional Linguistics, 2004, p. 271.
[11] E. Boiy and M.-F. Moens, “A machine learning ap-
proach to sentiment analysis in multilingual web
texts,” Information Retrieval, vol. 12, no. 5, pp. 526–
558, 2009.
[12] J. Su and H. Zhang, “Full bayesian network classi-
fiers,” in Proceedings of the 23rd international con-
ference on Machine learning. ACM, 2006, pp. 897–
904.
[13] J. Cheng and R. Greiner, “Comparing bayesian net-
work classifiers,” in Proceedings of the Fifteenth
conference on Uncertainty in artificial intelligence.
Morgan Kaufmann Publishers Inc., 1999, pp. 101–
108.
[14] D. Heckerman, A tutorial on learning with Bayesian
networks. Springer, 2008.
[15] L. M. De Campos, “A scoring function for learn-
ing bayesian networks based on mutual information
and conditional independence tests,” The Journal of
Machine Learning Research, vol. 7, pp. 2149–2187,
2006.
[16] E. Airoldi, X. Bai, and R. Padman, “Markov blan-
kets and meta-heuristics search: Sentiment extraction
from unstructured texts,” in Advances in Web Mining
and Web Usage Analysis. Springer, 2006, pp. 167–
187.
[17] X. Bai, “Predicting consumer sentiments from online
text,” Decision Support Systems, vol. 50, no. 4, pp.
732–742, 2011.
[18] A. Esuli, “Automatic generation of lexical resources
for opinion mining: models, algorithms and appli-
cations,” SIGIR Forum, vol. 42, no. 2, pp. 105–106,
2008.
[19] J. Pearl, Probabilistic reasoning in intelligent sys-
tems: networks of plausible inference. Morgan Kauf-
mann, 1988.
[20] B. Pang, L. Lee, and S. Vaithyanathan, “Thumbs up?:
sentiment classification using machine learning tech-
niques,” in Proceedings of the ACL-02 conference on
Empirical methods in natural language processing -
Volume 10. Association for Computational Linguis-
tics, 2002, pp. 79–86.
[21] H. Cui, V. Mittal, and M. Datar, “Comparative exper-
iments on sentiment classification for online product
reviews,” American Association for Artificial Intelli-
gence (AAAI), 2006.
[22] D. Bespalov, B. Bai, Y. Qi, and A. Shokoufandeh,
“Sentiment classification based on supervised latent
n-gram analysis,” in Proceedings of the 20th ACM in-
ternational conference on Information and knowledge
management. ACM, 2011, pp. 375–382.
[23] S. Poria, E. Cambria, G. Winterstein, and G.-B.
Huang, “Sentic patterns: Dependency-based rules for
concept-level sentiment analysis,” Knowledge-Based
Systems, vol. 69, pp. 45–63, 2014.
[24] A. Esuli and F. Sebastiani, “Sentiwordnet: A publicly
available lexical resource for opinion mining,” Pro-
ceedings of LREC, 2006.
Learning Sentiment Dependent Bayesian Network. . . Informatica 40 (2016) 225–235 235
[25] F. Glover, M. Laguna et al., Tabu search. Springer,
1997, vol. 22.
[26] W. Chen, L. Zong, W. Huang, G. Ou, Y. Wang, and
D. Yang, “An empirical study of massively parallel
bayesian networks learning for sentiment extraction
from unstructured text,” in Web Technologies and Ap-
plications. Springer, 2011, pp. 424–435.
[27] J. Cheng, D. A. Bell, and W. Liu, “Learning belief
networks from data: An information theory based
approach,” in Proceedings of the sixth international
conference on Information and knowledge manage-
ment. ACM, 1997, pp. 325–331.
[28] N. Friedman and Z. Yakhini, “On the sample com-
plexity of learning bayesian networks,” in Proceed-
ings of the Twelfth international conference on Un-
certainty in artificial intelligence. Morgan Kauf-
mann Publishers Inc., 1996, pp. 274–282.
[29] C. F. Aliferis, A. Statnikov, I. Tsamardinos, S. Mani,
and X. D. Koutsoukos, “Local causal and markov
blanket induction for causal discovery and feature se-
lection for classification part i: Algorithms and em-
pirical evaluation,” The Journal of Machine Learning
Research, vol. 11, pp. 171–234, 2010.
[30] R. R. Bouckaert, Bayesian network classifiers in
weka. Department of Computer Science, University
of Waikato, 2004.
[31] W. Buntine, “Theory refinement on bayesian net-
works,” in Proceedings of the Seventh conference on
Uncertainty in Artificial Intelligence. Morgan Kauf-
mann Publishers Inc., 1991, pp. 52–60.
[32] D. Heckerman, D. Geiger, and D. M. Chickering,
“Learning bayesian networks: The combination of
knowledge and statistical data,” Machine learning,
vol. 20, no. 3, pp. 197–243, 1995.
[33] C. Chow and C. Liu, “Approximating discrete prob-
ability distributions with dependence trees,” Informa-
tion Theory, IEEE Transactions on, vol. 14, no. 3, pp.
462–467, 1968.
[34] P. M. Lee, Bayesian statistics: an introduction. John
Wiley & Sons, 2012.
[35] J. Blitzer, M. Dredze, and F. Pereira, “Biographies,
bollywood, boom-boxes and blenders: Domain adap-
tation for sentiment classification,” Association of
Computational Linguistics (ACL), 2007.