1 NAIVE BAYESIAN CLASSIFIER AND INFORMATICA 1 /92 CONTINUOUS ATTRIBUTES Keywords: machine learning, naive Bayesian classifier, continuous atributes Ključne besede: avtomatsko učenje, naivni Bayesov klasifikator, zvezni atributi Igor Kononenko University of Ljubljana Faculty of Electrical and Computer Engineering Tržaška 25, 61001 Ljubljana, Slovenia Abstract The advantages of the naive Bayesian classifier are fast and incremental learning, robustness with respect to missing data, the inclusion of all available attributes during classification and the explanation of classification as the sum of information gains. Besides the 'naivety', the weakness is also the inability to deal with continuous attributes unless they are discretized in advance. In the paper three methods for dealing with continuous attributes are proposed. The fuzzy learning method assumes fuzzy bounds of a continuous attribute during learning, the fuzzy classification method assumes fuzzy bounds during classification and the last method tries to increase the reliability of probability approximations. The performance was tested in two medical diagnostic problems. Naivni Bayespv klasifikator in zvezni atributi . (povzetek) Prednosti naivnega Bayesovega klasifikatorja so hitro in inkrementalno učenje, robustnost glede manjkajočih podatkov, uporaba vseh razpoložljivih atributov za klasifikacijo in razlaga odločitev kot vsota informacijskih prispevkov. Poleg naivnosti je slabost tudi nezmožnost obravnave zveznih atributov, ki morajo biti zato vnaprej diskretizirani. V članku so predstavljene tri metode za obravnavanje zveznih atributov. Mehko učenje in mehko testiranje predpostavljata mehke meje zveznih atributov med učenjem oziroma med klasifikacijo. Tretja metoda temelji na povečanju zanesljivosti aproksimacij verjetnosti. Uspešnost je bil testirana na treh medicinskih diagnostičnih problemih. 1 Introduction The basic Bayesian formula for calculating the probability of a class given the values of attributes, describing a given object, can be used either for generation of decision trees (Michie & A1 Attar 1990) or can be simplified into 'naive' Bayesian classifier by assuming the indepen- dence of attributes (Kononenko 1989). The advantages of the naive Bayesian classifier are fast and incremental learning, robustness with respect to missing values, the inclusion of all available attributes for classification, and the ability to explain decisions as the sum of information gains (Kononenko 1990). It was shown by several authors that, despite 'naivety', the naive Bayesian classifier performs in many real world problems approximately the same or even better than many well known inductive learning systems (Bratko & Kononenko 1987, Cestnik 1990, Kononenko 1990). The advantages of induction of decision trees are 'nonnaivety', simple and powerful mechanism for on line splitting of continuous attributes, and explicit knowledge in the form of if-then rules. This paper is concerned with the problem of dealing with continuous attributes. The problem is to split the interval of possible values of a continuous attribute into subinter-vals to obtain as much as possible useful information for classification from a given attribute. In the algorithms for induction of decision trees a continuous attribute is binarized on-line during learning. In a given node of a tree a bound is selected that maximizes the information gain of the attribute (Breiman et al. 1984, Pa-terson h Niblett 1982, Bratko & Kononenko 1987, Quinlan 1986). This approach can be applied for the naive Bayesian classifier only at the highest level by changing all continuous attributes into binary attributes. Another approach is to define all subintervals of-line before learning, either by a human expert or with an algorithm (e.g. Cestnik 1989). However, all these approaches assume, that the optimal split can be obtained using exact bounds of subintervals. This is unrealistic assumption as typically the slight difference among two values, one above and one below the bound, should not have drastic effect. In this paper three new methods for dealing with continuous attributes are proposed. Two of them are based on the idea of fuzzy bounds and the third one is based on the reliability of probability estimation. In next section the naive Bayesian classifier is briefly described. In section 3 the three methods for dealing with continuous attributes are defined and in section 4 the experiments in two medical diagnostic problems are described. Finally in section 5 some conclusions are given and the further research is proposed. 2 Naive Bayesian Classifier The classification problem discussed in this paper is the following: given a set of training instances, each described with a set of n attributes and each belonging to exactly one of a certain number of possible classes, learn to classify new, unseen objects. In addition, each attribute A,- has a fixed number of NV{ possible values. Let C represent one of the possible classes. Let Vi,Ji be a Boolean variable having value 1 if the current instance has value «7, of i-th attribute and 0 otherwise. The conditional probability of class C given the values of all attributes is given with the following formula, derived from the Bayesian rule (Good 1950) (for brevity conditions Vij = 1 will be written simply as P{C\V^,..., vn,Jn) = P(C) n Qi(c, J,-) t=l (1) where Qi(C,J;) = Wu.....Ky,) (2) and P(C) is the prior probability of class C. It was shown in (Kononenko 1989) that the classification with ID3 like inductive learning system can be described with (1). From (1) the naive Bayesian classifier, as used by Bratko and Kononenko (1987), is obtained if the independence of attributes is assumed. Eq. (1) remains unchanged except that factors Qi defined with (2) are replaced with Q\ (we will refer to changed equation (1) with (l')): wc^r^-zem (3) P(Vi,j,) P(C) The probabilities necessary to calculate (3) are approximated with relative frequencies from the training set. A new object is classified by 3 calculating the probability for each class using equation (1') and the object is classified into the class that maximizes the calculated probability. Cestnik (1990) has shown that the kind of approximation of probabilities in (3) considerably influences the classification accuracy of the naive Bayesian classifier. Let N(C, Kva) be the number of training instances with J, th value of i-th attribute and belonging to class C and let ^(K'.j,-) be the number of training instances with J,-th value of i-th attribute. Usually, the probability is approximated with relative frequency, i.e. He\vitJi) Ay(c,vu: ^V(VU) (4) However, if the training set is relatively small, the corrections are needed with respect to the assumption of initial distribution (Good 1965). Cestnik (1990) used the following formula stemming from the assumption, that initial distribution of classes is equal to P(C): plr,v > W(C,V,,J,) + 2P(C) the probabilities of all classes of an object with a given value of a continuous attribute. These probabilities should be approximated with relative frequencies calculated from the distribution of training instances with the similar value of the attribute. It is assumed, that small variations of the value of the attribute should have small effects on the probabilities. As opposed to exact bounds, where slightly different value can have drastic effects on the calculated probabilities, the bounds of intervals are here assumed to be fuzzy. The three methods described in this section differ in the way how the distribution, used in the approximation of probabilities, is obtained. For all three methods the pessimistic set of possible bounds is given in advance either by a human expert or with a simple algorithm, that returns bounds with the uniform distribution of instances over all intervals. The set of bounds is pessimistic in the sense that more bounds are given than probably needed (e.g. all attributes have in advance 20 possible intervals, which is typically too detailed split). However, exact values of these initial bounds are not important and may vary without significant changes in performance. where the probability of class C is calculated using the Laplace's law of succession (Good 1950,1965): P(C) = N(C) + 1 N + 2 (6) Cestnik has shown some nice properties of using approximation (5) in formula (3) and has shown experimentally, that the naive Bayesian classifier using approximation (5) performs significantly better than if (4) is used. The same formula was used also by Smyth and Goodman (1990). Dealing with continuous attributes 3.1 Fuzzy learning The fuzzy learning is performed by calculating the probability distribution for a given interval from all training instances rather than from instances that have value of a given continuous attribute in this interval. The influence of an instance is assumed to be normally distributed with mean value equal to the value of the regarded attribute and with given a. a is the parameter to the learning algorithm and is used to control the 'fuzziness' of the bounds. As shown in figure 1, the influence of a given instance with value v of the given continuous attribute on the distribution of interval (bj..bj+i) is proportional to the following expression: /"6j+1 1 is*-«)3 P(v,a,j) = / —-=t l-^^- (10) The lower bound is proportional to n and to 5 Table 1: Characteristics of two medical data sets. domain thyroid rheumatology # instances 884 355 classes 4 6 # attributes 15 32 average # vals/att 9.1 9.1 average #missing data 2.7 0.0 majority class - 56% 66% entropy (bit) 1.59 1.70 # continuous atts 7 22 average # intervals/att 17.3 10.0 £2. In our case we are interested in the reliability of the approximation of probability (5). Therefore the number of trials n in (10) is equal to AV, j., i.e. the number of training instances having value inside interval of attribute A{. As prior probability p is unknown, in our experiments for approximation of p at the right-hand side of (10) the worst case was assumed, i.e. p = 0.5. It remains to determine the value of e. The influence of interval J, of z-th attribute to class C is proportional to the difference between (5) and (6). If the influence is greater the reliability of approximation of (5) should be greater. Therefore e should be proportional to the influence. As we regard the influence of an interval for all the classes Cj, j = l..n, for e the average difference is used: n c^^P^OxlP^lKvJ-moi (11) 3 = 1 From above formulas it follows that the interval is unreliable if: 1 1 - 4 e2NVti