Informatica 35 (2011 ) 197-209 197 Optimal Decision Tree Based Multi-class Support Vector Machine Manju Bala and R. K. Agrawal School of Computer & Systems Sciences, Jawaharlal Nehru University, New Delhi-110 067, India Keywords: support vector machine, decision tree, class separability, information gain, Gini Index and Chi-squared, interclass scatter, intraclass scatter. Received: August 5, 2010 In this paper, decision tree Sl%Is architecture is constructed to solve multi-class problems. To maintain high generalization ability, the optimal structure of decision tree is determined using statistical measures for obtaining class separability. The proposed optimal decision tree SVM (ODT-SIM) takes advantage of both the efficient computation of the decision tree architecture and the high classification accuracy of SIM. A robust non-parametric test is carried out for statistical comparison of proposed ODT-SVAI with other classifiers over multiple data sets. Performance is evaluated in terms of classification accuracy and computation time. The statistical analysis on IJCI repository datasets indicate that ten cross validation accuracy of our proposed framework is significantly better than widely used multi-class classifiers. Experimental results and statistical tests have shown that the proposed ODT-SVAI is significantly better in comparison to conventional OvO and OAA in terms of both training and testing time. Povzetek: Aletoda odlocit\'enega drevesa s SIM dosega signifikantno boljse rezultate kot izvirni SIM. 1 Introduction Support Vector Machine (SVM) has been proved to be a successful learning machine in literature, especially for classification. SVM is based on statistical learning theory developed by Vapnik [6, 25], Since it was originally designed for binary classification [3], it is not easy to extend binary SVM to multi-class problem Constructing A'-class SVMs (k > 2) is an on-going research issue [1, 4], Two approaches are suggested in literature to solve multi-class SVM. One is considering all data in one optimization [7], The other is decomposing multi-class into a series of binary SVMs, such as "One-Against-All" (OAA) [25] and "One-versus-One" (OvO) [16], It has been reported in literature that both conventional OvO and OAA SVMs suffer from the problem of unclassifiable region [19, 24], To resolve unclassifiable region in conventional OvO, decision tree OvO SVM formulation is proposed [19], Takashaki and Abe [24] proposed class separability measure i.e. Euclidean distance between class centers to construct decision tree based OAA SVM to overcome unclassifiable region. In literature, other than Euclidean distance a large number of distance measures were used to determine the class separability, each having its own advantages and disadvantages. Few more realistic and effective statistical measures used in literature are information gain, gini index, chi-square and scatter-matrix-based class separability in kernel-induced space for measuring class separability. In this paper, we evaluate the performance in terms of classification accuracy and computation time of proposed OvO ODT-SVM [17] and OAA ODT-SVM [18], In both models, class separability is detennined using statistical measures i.e. information gain, gini index, chi-square and scatter-matrix-based class separability in kernel-induced space. A robust non-parametric test is carried out for statistical comparison of proposed ODT-SVM with other classifiers over multiple data sets. In section 2 we briefly describe the basics of SVM. In section 3 we discuss decision tree OvO and OAA SVMs approach. Section 4 presents our proposed ODT-SVMs framework using four statistical class separability measures. In section 5, we discuss the theoretical analysis and empirical estimation of training and testing time of both the proposed schemes. The experimental results demonstrate the effectiveness of our ODT-SVMs in comparison to conventional OvO and OAA SVMs. Section 6 includes the conclusion. 2 Support vector machines Support Vector Machines (SVM) is based on statistical learning theory developed by Vapnik [6, 25], It classifies data by determining a set of support vectors, which are members of the set of training inputs that outline a hyperplane in feature space [15], Let us assume {(jq, r).....( x„. v„)} be a training set with x, £ Rd and yL is the corresponding target class. The basic problem for training an SVM can be reformulated as: Maximize: / = %f=1 at - %n=1 at a^-y^x?, X-) Subject to • a, yt = 0 and «j > 0, i = 1,2,...,)? (1) 198 Informatica 35 (2011) 197-209 M. Bala et al. The computation of dot products between vectors without explicitly mapping to another space is performed by a kernel function K(xi,xj). Use of a kernel function [22] enables the curse of dimensionality to be addressed and the solution implicitly contains support vectors that provide a description of the significant data for classification. Substituting K(Xi,Xj)for Xj) in eqn. (1) produces a new optimization problem: Maximize: ¿(a) = YZ=iai--2YZ=iT?j=iaiajyiyjK(xi,xj) Subject to 0 < cti < C i,j = 1, ...,n and E"=iyt = 0 (2) Solving it for a gives a decision function of the form f(x)= sign^^aiyiK(xi,xj) + b^j (3) Whose decision boundary is a hyperplane and translates to nonlinear boundaries in the original space. 3 Decision tree based SVM The most common way to build a multi-class SVM is by constructing and combining several binary classifiers [14], To solve multi-class classification problems, we divide the whole classification problem into a number of binary classification problems. The two representative ensemble schemes are OvO and OAA [21], Convetional OvO SVM has the problem of unclassifiable region. To resolve unclassifiable region for OvO SVM, Decision Directed Acyclic graph (DDAG) SVM) [19] based on decision tree OvO SVM is proposed in literature. They have shown with an example three-class problem the existence of unclassifiable regions which can lead to degradation of generalization ability of classifier. In general, the unclassifiable region is visible and generalization ability of classifier is not good fork-class problem where k >2. In DDAG OvO scheme [19], VC dimension, LOO error estimator and Joachim'^ cu LOO measures were used for estimating the generalization ability of pairwise classifier at each level of decision tree. During training at the top node, a pair (Q,C;) that has the highest generalization ability is selected from an initial list of classes (C1,...,Ck). Then it generates the two lists deleting CL or Cj from the initial list. If the separated classes include the plural classes, at the node connected to the top node, the same procedure is repeated for the two lists till one class remains in the separated region. This means that after only k - 1 steps just one class remains, which therefore becomes the prediction for the current test sample. Gjorgji et al. [13] proposed binary tree architecture (SVM-BDT) that uses SVMs for making binary decisions in the nodes which takes advantage of both the efficient computation of the tree architecture and high accuracy of SVMs. The hierarchy of binary decision subtasks using SVMs is designed with clustering algorithms. In proposed scheme SVM-BDT, the classes are divided in two disjoint groups gi and g2 using Euclidian distance as distance measure. The two disjoint groups so obtained are then used to train a SVM classifier in the root node of the decision tree. The classes from first and second clustering group are being assigned to left and right subtree respectively. This process continues recursively until there is only one class is left in a group which defines a leaf in the decision tree. Takashaki and Abe [24] proposed OAA SVM based decision tree formulation in literature to overcome the problem of unclassifiable region to improve generalization ability of SVM. They have shown with an example of unclassifiable regions for a three-class problem which can lead to degradation of generalization ability of classifier. In general, the unclassifiable region is visible and generalization ability of classifier is not good for k-class problem where k >2. In Takashaki and Abe [24] proposed scheme, the hyperplane is determined that separates a class from others during training at the top node. If the separated classes include the plural classes, at the node connected to the top node, the hyperplane is determined that separates the classes. This process is repeated until one class remains in the separated region. This can resolve the problem of unclassifiable regions that exist in OAA SVM. They proposed different types of decision trees based on class separability measure i.e. Euclidean distance between class centers. 4 Proposed decision tree SVMs framework using statistical measures Euclidean distance measure used in the construction of decision tree (i.e. OvO SVM-BDT and Takashaki and Abe [24] OAA SVM formulation) does not take into account within class variability of patterns. Hence, it may not be suitable for measuring class separability between two different classes of patterns. To understand better picture of the overlap of the subspaces occupied by individual classes, statistical distance measures are employed by pattern recognition community which constitutes a natural concept of measuring class separability. 4.1 Statistical class separability measures Among statistical measures information gain (IG) [20] is a measure based on entropy [23] which indicates degree of disorder of a system It measures reduction in weighted average impurity of the partitions compared with the impurity of the complete set of samples when we know the value of a specific attribute. Thus, the value of IG signifies how the whole system is related to an attribute. IG is calculated using: IG(C\E) = H(C) -H(C\E) (4) OPTIMAL DECISION TREE BASED. Informatica 35 (2011 ) 197-209 199 where IG(C\E) is the information gain of the label C for a given attribute E, H(C) is the system'^ entropy and H(C\E) is the system'^ relative entropy when the value of the attribute E is known. The system'^ entropy indicates its degree of disorder and is given by the following formula k H(C) = -^p(Q)log(p(Q)) i=1 (5) where p(Q) is the probability of class Q. The relative entropy is calculated as \E\ , k v i=1 V i= 1 / (6) Where p(ej) is the probability of value j for attribute e, and p(CL\ej) is the probability of Ci with a given e;. The optimal binary SVM model is selected on the basis of maximum value of IG that signifies more separability between patterns belonging to two different classes. IG for a given independent binary SVM containing n, elements of C, and n, elements of Cj can be calculated as IG(i,j)=H(Ci,Cj) - \p(CdH(tp,fv) + p(Cj)H(fn, tj] (7) where H(x,y) = -|x| log (J^.) - |y| log (J*L) (8) and (9) where t„, f , t„, and f denote number of true p> j p> n> i n positive, false positive, true negative and false negative data points respectively. The higher value of IG signifies less overlap or more distance between two different classes of data points. Hence, IG can be a natural measure to determine class separability of different classes of data points. Similarly for every independent binary OAA SVM, assume there are two classes of dataset, Cj and and training set D contains rij elements of Cj and n; elements of class IG for a given OAA SVM model i can be calculated as IG(i) = //(Cj,C^j) - [p(Cj)//(tp,/p) + p(Cj^)H(fn,tn)] (10) Gini index is another popular measure for feature selection in the field of data mining proposed by Breiman et al. [5], It measures the impurity of given set of training data D and can be calculated as 2 Gini(D) = 1 - ^(p(Cj))2 i=1 (11) For a binary split, a weighted sum of the impurity of each resulting partition is computed. The reduction in impurity that would be incurred by a particular binary split in binary SVM between two classes Cj and C; is calculated as AGini(\,]) = Gini{D) — Giniij(D) (12) where Giniij(D) = [p(Ci)Gini(L) + p(Cj)Gini(R)] (13) Where Gini(L) is the Gini Index on the left side of the hyperplane and Gini(R) is that on the right side. OvO SVM model between class pair (ik,jk) that maximizes the reduction in impurity (i.e. Gini index) is selected as splitting node in decision tree SVM at a particular level Similarly for every independent binary OAA model assume there are two classes of dataset, Cj and The reduction in impurity that would be incurred by a particular binary split i is given by A Gini(i) = Gini{D) — Gini^D) (14) whereGinij(Z)) = [p(Cj)Gini(L) + (c;^j)cini(i?)] (15) Chi-square [9] another criterion used for binary split in data mining and machine learning, is a statistical test in which the sampling distribution of the test statistic is a chi-square distribution when the null hypothesis is true. We are interested in determining whether a particular decision rule is useful or informative. In this case, the null hypothesis is a random rule that would place tp data points from Cj and fp data points from C; independently in the left branch of decision tree and the remainder in the right branch of decision tree respectively. The candidate decision rule would differ significantly from the random rule if the proportions differed significantly from those given by the random rule. The chi-square statistic x2 will be given by X2 = g (tp, (tp + /p)p(Cj)) + g{fn, (/„ + t„)p(Cj)) + g(fp,(tp+fp)p(Cj)) + g (tn. ifn + tn)p(C/)) (16) i ^ . .>. (count—expect)2 where g(count, expect) =-expect- The higher the value of x2, the less likely is that the null hypothesis is true. Thus, for a sufficiently highx2, a dilference between the expected and the observed distributions is statistically significant; one can reject the null hypothesis and can consider candidate rule is informative. Hence OvO SVM model for class pair (ik,jk) or OAA SVM model i that maximizes x2 is selected as splitting node in ODT-SVM at a particular level. To measure class variability of patterns, the ratio of interclass and intra class scatters in kernel-induced feature space can also be used which better depicts the physical relationship of data in input space and thereby 200 Informatica 35 (2011) 197-209 M. Bala et al. providing high generalization ability of classifier based on decision tree. The scatter-matrix based measure (S) of training set D in original space [10] is defined as s _ tr(Sb) tr(Sw) (17) where Sb is the between class scatter matrix and Sw is the within class scatter matrix, defined as Sb = (mi - mj )(mi - rrij )T where rrij = -~ZxeCtx and rrij = — ~ZxiCx rij 1 J n—ni 1 (18) rrij and mj represents mean vector of data from class Ci and class Cj respectively. (19) Sw = Qt + Qj where Q. and Q are given as = -Y(x-n, £—i (20) (21) xECi = —Y(x- 71 — 71; ¿—I mi ) (x — rrij )7 x<£ Ci Using kernel trick [7], data points from CL and C; arc implicitly mapped from Rd to a high dimension feature space K. Let 0(0 : Rd H denote the mapping and kp{x0 x;) = (0(xj), 0(x;) ) denote the kernel function, where /? is the set of kernel parameters and (v) is the inner product. K denotes the kernel matrix and {K}u is defined as kp x;). Let KAi B be kernel matrix computed with the data points from A and B which denote two subsets of training sample set D. Let Sb and 5® denotes the between class scatter matrix and within class scatter matrix in 'K. respectively and defined as follows 2 (22) i=i =^^(0(x)-m?)(0(x)-m?y (=1 xEDt (23) where m® denotes the mean of training data points from CL and m® is the mean of all the training data points in K. let F is vector whose elements are all "1". Its size will be decided by the context. Then mf'mf =nr*.FTKHDiF (24) m® 'm® = n~2 ■ FtKDjJ)jF (25) 0 T 0 mi m.; 1 J = {^ny ■ ftKDvDF (26) tr(S®) = tr (27) tr(S®) = tr ^ (m®-m®)(m®-m®)7 FTKDl,DlF | FTKD.D]F FtKDiDF ^^(0(x)-m®)(0(x)-m®)7 i=l xEDi = tr(KD:D) FTKDi,DiF FTKDpDjF (28) Now the class separability (SC) in a feature space K is obtained as tr{st) SC = (29) Hst) 4.2 Algorithm for construction of ODT-SVM For the construction of Optimal Decision Tree SVM model (ODT-SVM), we can use one of the class separabilty measures to determine the structure of decision tree. Here for illustration purpose, we consider information gain in proposed optimal decision tree algorithm The outline for OvO ODT-SVM using IG class separability measure for /i-class is given below: 1 Generate the initial list {C,,..., Ck} 2 Calculate ,C;) using eqn. (8) for i,j = 1 ,...,k and j > i 3 Calculate H(tp ,fp ). H(fn , tn ). p(C;). and p(C ■) using eqn. (8) and eqn. (9) respectively. 4 Compute IG(i,j) using eqn. (7). 5 Determine class pair (CL, C;) for which IG(i,j) takes maximum value from the list. If data points belong to class CL then delete C; from the list else delete class Q. 6 If the remaining classes are more than one, repeat Steps 2-5 otherwise terminate the algorithm Similar computational steps are followed for other three measures to determine the structure of OvO ODT-SVM. Similarly, the outline for OAA ODT-SVM using IG class separability measure for /i-class is given below: 1 Generate the initial list {C1, ..., Ck). OPTIMAL DECISION TREE BASED. Informatica 35 (2011 ) 197-209 201 2 Calculate //(Q.C^ ¿) using eqn. (8) for i = 1, ...,k and j ^ i. 3 Calculate H(tp,fp\ H (fn , tn). p(C;) and p(C;^). using eqn. (8) and eqn. (9) respectively. 4 Compute /G(i) using eqn. (10). 5 Determine model i for which IG(i) takes maximum value from the list. 6 If j is multi-class, repeat steps 2-5. Otherwise, terminate the algorithm. Similar computational steps are followed for other three measures to determine the structure of OAA ODT-SVM. 4.3 Time complexity In order to compute the time complexity of training phase of OvO SVM, we assume without loss of any generality that the number of data points in each class is n approximately same i.e. - . To solve A-class problem k k(k-l) using conventional OvO, —-— binary SVM classifiers are developed. Assuming the time complexity of building a SVM with n data points and d features is 0(n2d). it can be shown that training time of conventional OvO, is 0(n2d). In worst case the decision tree generated in SVM-BDT is skewed if classes in two groups at every level is divided into uneven size. Under the assumption that group gi contain only one class and group g2 contain remaining classes or vice versa, the decision tree so generated will be of depth (k - 1) for A-class problem. Hence, the training time of SVM-BDT will be given by 1 SVM-BDT-worst n(k - 1) k (t) = (n2dk (30) In SVM-BDT approach under best case, the class in two groups at every level is divided into approximately same size. The decision tree so generated will be almost height balanced of maximum depth \log(k)]. The number of nodes in decision tree at depth /' is 2'~'. each containing n —7 data points. Hence, the training time for SVM-BDT in best case is given by t TsVM-BDT-best = n ^ + V2/ ^ + ... + (31) = (n2d) However, in general the structure of OvO decision tree generated using statistical measures is almost height balanced of maximum depth [log(fc)]. There are 21'1 th ( nodes at / level and each node uses I — I data points. Hence, the training time for OvO ODT-SVM using statistical measure is retrain 1 OvO-ODT-SVM (32) fc(fc-l) During testing phase of the conventional OvO, —-— decision functions are to be evaluated. Also the majority k(k-1) voting is computed with —-— operation. Hence, the testing time T^ 0v0 SVM for each sample is given by k(k-1) ~ f -. In worst case the depth of SVM-BDT is (k - l) which requires testing time k for each sample. However, in best case the depth of SVM-BDT is \log(k)] which requires \log(k)] testing time for each sample. Since, the maximum depth of OvO ODT-SVM is \log(k)\ the testing time requires \log (/c)] operations. According to the above analysis it is evident that the training and testing time for OvO ODT-SVM will always take less computation time in comparison to conventional OvO SVM and SVM-BDT. For A-class problem, (k-1) hyperplanes are to be calculated in case of OAA ODT-SVM, whereas k times SVM model is developed in case of conventional OAA SVM approach which is more than number of SVM models required in decision tree OAA SVM. Assuming the time complexity of building a SVM model is given by 0(n2d) where n is the number of data points and d is number of attributes. The overall time complexity of training SVM using conventional OAA approach is proportional to (kn2d). In order to compute training time T^_odt_svm required by OAA ODT-SVM, we assume that the number of data points in each class is n approximately same i.e. -. At first level of OAA ODT-SVM model will take time proportional to (n2d). While at the second stage SVM model will have (n — number of data points. It can be shown that at 1 stage, «n(i-l)\2 \ n---—I d I. Time, T^aa-odt-svm required for decision tree based SVM is given by / U\2 T train -i . / \ 1 n 'OAA-ODT-SVM = n2d + (n — —^ d n(k - 2) (33) + n -■ -m +... 2 d Under the above assumption the time required for training an OAA ODT-SVM is approximately three times lesser than the conventional OAA. While in testing phase, the values of all the hyperplanes need to be determined in case of OAA formulation whereas in OAA ODT-SVM, the value of all the hyperplanes need not be computed in general. Hence the time complexity of testing will also be less in case of OAA ODT-SVM in comparison to conventional OAA. 5 Experimental results To evaluate the performance of our proposed ODT-SVM framework using information gain, gini index, chi- 202 Informática 35 (2011 ) 197-209 M. Bala et al. square and scatter-matrix-based class separability in kernel-induced space, we have performed experiments on publically available UCI [2] benchmark datasets. Table 1 describes the datasets used in our experiments. In yeast dataset in actual ten classes are given. We have merged data of six classes into one class to make it a five class problem The kernel functions used in experiments are given in Table 2. Performance of classifiers is evaluated in terms of classification accuracy, training time and testing time measures on each data set. We have applied Friedman test [11],[12] which is a non-parametric test for statistical comparison of multiple classifiers over multiple data sets. For each dataset, we rank competing algorithms. The one that attains the best performance is ranked 1; the second best is ranked 2, and so forth. A method'^ mean rank is obtained by averaging its ranks across all datasets. Compared to mean value, mean rank can reduce the susceptibility to outliers [8], As recommended by Demsar [8], the Friedman test is effective for comparing multiple algorithms across multiple datasets. It compares the mean ranks of approaches to decide whether to reject the null hypothesis, which states that all the approaches are equivalent and, so, their ranks should be equal. If the Friedman test rejects its null hypothesis, we can proceed with a post hoc test, the Nemenyi test. It can be applied to mean ranks of competing approaches and indicate whose performances have statistically differences. Table 1 : Description of datasets. To see the effectiveness of our proposed OvO ODT-SVM and OAA ODT-SVM, we compared our methods with conventional OvO, SVM-BDT, and conventional OAA SVMs respectively. We have used five kernel functions with value of C = 1000 and y = [211, 210, T9... 2°]. The classification accuracy is determined using ten cross-validations. For a given kernel function and C, we determine the value of y for which the maximum classification accuracy is achieved. Table 3 and Table 4 show the comparison of maximum classification accuracy between conventional OvO and OvO ODT-SVM, and conventional OAA and OAA ODT-SVMs respectively. Table 2: Kernel functions. Table 5 shows comparison of maximum classification accuracy between both models of ODT-SVM, conventional OvO, conventional OAA and commonly used multi-class classifiers in literature i.e. C45, Multi Layer Perceptron (MLP). C4.5 and MLP implemented in WEKA machine learning environment [26] are used in our experiments with their default values of parameters. The best classification accuracy for each dataset is shown in bold. When we apply the Friedman test, with 7 algorithms and 11 datasets, Ff is distributed according to the F distribution with (7-1) x (11 - 1) = 60 degrees of freedom The critical value of F(6,10) at the 0.05 critical level is 2.25. Ff calculated from the mean ranks is 12.31. Since 12.31 > 2.25, we can reject the null hypothesis and infer that there exists a significant difference among adversary classifiers. Dataset Kernel Choice OvO OvO ODT-SVMs ED SC IG Gini Chi-squared Iris Gaussian 98 98 98 98 98 98 Laplace 96 96 96 96 96 96 Cauchy 98 98 98 98 98 98 Hypersecant 98 98 98 98 98 98 Square sync 95.33 94.67 97.45 98 97.99 98 Satimage Gaussian 89.43 88.87 92.89 89.66 89.57 91.99 Laplace 91.21 90.76 92.41 91.23 91.42 91.12 Cauchy 90.46 90.61 92.78 89.34 92.98 90.89 Hypersecant 89.71 90.09 93.78 91.34 91.98 93.98 Square sync 74.51 76.43 77.87 78.86 77.8 76.76 Wine Gaussian 82.58 98.98 96.32 96.65 97.98 97.52 Laplace 82.58 92.23 92.67 95.55 93.58 91.58 Cauchy 82.02 82.02 82.02 82.62 82.87 82.02 Hypersecant 93.26 93.36 93.26 94.26 92.13 93.26 Problem Strain #test #class #attributes Iris 150 0 3 4 Wine 178 0 3 13 Vehicle 846 0 4 18 Glass 214 0 6 9 Segmentation 210 0 7 19 Ecoli 336 0 8 7 Satimage 4435 2000 6 36 NewThyroid 215 0 3 5 Yeast 1484 0 5 8 MovementLibra 360 0 15 90 HeartDiseaseClev eland 303 0 5 13 Kernel Function Kix^x^ fory>0 Gaussian exp(~r 1 x- x¿ I2) Laplace exp(-y\x-x, 1) Cauchy (l/(l+r\x-x, I2)) Hypersecant 2¡(exp(7 1 x- x, 1 ) + exp(—7 | x- x, | )) Square sync sin2 (j\x-x1 1)/(j\x-xi I)2 OPTIMAL DECISION TREE BASED... Informática 35 (2011) 197-209 203 Square sync 75.28 76.89 76.97 77.97 75.28 76.97 Vehicle Gaussian 76.83 79.95 84.45 84.82 85.24 85.24 Laplace 77.42 78.3 80.61 81.61 80.74 80.24 Cauchy 76.48 82.85 84.52 84.52 84.58 85.65 Hypersecant 83.33 81.59 84.28 84.28 84.98 84.98 Square sync 71.51 80.57 81.32 81.32 81.99 81.56 Glass Gaussian 72.43 62.9 71.5 71.21 69.63 75.43 Laplace 75.70 76.17 76.64 74.64 75.7 77.57 Cauchy 72.90 72.21 72.43 71.03 68.92 73.36 Hypersecant 71.96 72.90 71.5 70.09 69.16 71.03 Square sync 66.36 64.2 64.78 62.62 62.62 56.07 Segmentation Gaussian 84.76 87.85 89.24 89.29 88.87 87.43 Laplace 87.14 88.19 89.62 89.67 89.87 90.87 Cauchy 86.19 91 89.87 89.69 89.71 89.99 Hypersecant 90 89.14 90.87 89.52 92.05 91.95 Square sync 81.9 79.05 79.52 80.13 80.95 79.05 Ecoll Gaussian 85.42 84.98 85.42 86.01 84.79 86.14 Laplace 86.97 87.2 86.99 86.9 86.99 87.9 Cauchy 85.42 85.42 86.98 84.82 86.52 85.71 Hypersecant 85.42 85.42 88.45 88.82 85.42 85.74 Square sync 85.12 84.23 87.65 85.12 87.85 86.15 NewThyroid Gaussian 97.21 97.21 97.21 100 98.21 97.21 Laplace 96.74 96.74 100 96.89 96.89 96.74 Cauchy 96.74 96.84 100 96.89 96.89 97.74 Hypersecant 97.67 100 98.67 98.89 100 100 Square sync 94.88 94.88 96.49 96.49 96.49 96.49 Yeast Gaussian 58.43 59.92 62.40 60.32 63.95 61.92 Laplace 59.56 60.31 62.43 63.32 65.17 61.86 Cauchy 61.54 62.37 64.17 65.41 66.54 62.54 Hypersecant 59.45 67.76 67.65 67.54 65.56 64.76 Square sync 57.98 57.32 58.68 59.45 60.12 58.98 MovementLibra Gaussian 74.45 79.94 81.91 78.94 77.74 76.94 Laplace 81.73 83.77 85.77 85.87 87.97 82.77 Cauchy 76.45 78.13 79.89 78.54 76.89 78.89 Hypersecant 72.23 78.64 77.65 73.64 77.18 76.64 Square sync 42.22 42.22 42.22 42.22 42.22 42.22 HeartDiseaseClev eland Gaussian 43.56 41.23 34.87 42.27 12.54 43.56 Laplace 22.11 23.43 23.91 32.31 11.88 22.11 Cauchy 15.23 34.38 17.47 13.45 12.21 15.18 Hypersecant 22.44 24.34 25.32 25.41 12.21 22.44 Square sync 12.78 15.38 12.23 11.23 13.43 12.09 Table 3: Classification accuracy of conventional OvO Vs OvO ODT-SVMs [%]. Dataset Kernel Choice OAA OAA ODT-SVM ED SC IG Gini Chi- Iris Gaussian 96 98 98 98 98 98 Laplace 96 96 96 96.44 96 96 Cauchy 96 98 98 98 98 98 Hypersecant 98 98 98 98 98 98 Square sine 96 94.67 98 98 98 98 Satimage Gaussian 89.43 88.87 92.89 89.66 89.57 91.99 Laplace 89.95 90.76 92.41 91.23 91.42 91.12 Cauchy 89.46 90.61 92.78 89.34 92.98 90.89 Hypersecant 89.71 90.09 93.78 91.34 91.98 93.98 Square sine 74.51 76.43 77.87 78.86 77.8 76.76 Wine Gaussian 82.01 98.98 96.32 96.65 97.98 97.52 Laplace 82.58 92.23 92.67 95.55 93.58 91.58 Cauchy 82.15 82.02 82.02 82.62 82.87 82.02 Hypersecant 93.82 93.36 93.26 94.26 92.13 93.26 Square sine 74.72 76.89 76.97 77.97 75.28 76.97 Vehicle Gaussian 84.63 79.95 84.45 84.82 85.24 85.24 204 Informatica 35 (2011) 197-209 M. Bala et al. Laplace 80.61 78.3 80.61 81.61 80.74 80.24 Cauchy 84.52 82.85 84.52 84.52 84.58 85.65 Hypersecant 84.87 81.59 84.28 84.28 84.98 84.98 Square sine 78.45 80.57 81.32 81.32 81.99 81.56 Glass Gaussian 60.75 62.9 71.5 71.21 69.63 75.43 Laplace 76.17 76.17 76.64 74.64 75.7 77.57 Cauchy 68.69 72.21 72.43 71.03 68.92 73.36 Hypersecant 63.55 72.9 71.5 70.09 69.16 71.03 Square sine 61.21 64.2 64.78 62.62 62.62 56.07 Segmentation Gaussian 84.36 87.85 89.24 89.29 88.87 87.43 Laplace 86.19 88.19 89.62 89.67 89.87 90.87 Cauchy 85.71 91 89.87 89.69 89.71 89.99 Hypersecant 88.57 89.14 90.87 89.52 92.05 91.95 Square sine 80.48 79.05 79.52 80.13 80.95 79.05 Ecoli Gaussian 84.01 84.98 85.42 86.01 84.79 86.14 Laplace 86.90 87.2 86.99 86.9 86.99 87.9 Cauchy 86.90 85.42 86.98 84.82 86.52 85.71 Hypersecant 82.74 85.42 88.45 88.82 85.42 85.74 Square sine 82.74 84.23 87.65 85.12 87.85 86.15 NewThyroid Gaussian 95.45 97.98 97.54 100 98.21 97.89 Laplace 96.78 98.89 100 98.52 96.89 98.49 Cauchy 97.34 97.65 98.94 97.89 100 97.96 Hypersecant 96.38 100 98.99 98.89 98.90 98.99 Square sine 93.54 95.56 96.49 96.78 98.67 97.45 Yeast Gaussian 59.65 59.92 63.90 61.54 66.65 62.32 Laplace 61.26 61.43 64.77 65.54 67.65 64.23 Cauchy 59.46 63.23 64.17 67.41 66.54 65.43 Hypersecant 59.99 68.72 68.65 65.54 65.56 61.34 Square sine 56.54 58.32 58.68 59.45 60.12 59.98 MovementLibra Gaussian 73.44 77.76 85.21 82.61 76.41 76.34 Laplace 82.48 84.43 84.67 84.42 85.54 83.75 Cauchy 76.14 78.93 74.71 85.54 78.65 78.89 Hypersecant 73.43 77.84 78.45 77.61 78.65 74.64 Square sine 42.2 42.22 42.22 42.22 42.22 42.22 HeartDiseaseClev eland Gaussian 43.43 42.36 44.54 42.51 12.21 43.53 Laplace 29.13 43.03 23.34 31.43 16.78 24.12 Cauchy 45.73 49.38 27.47 43.45 22.35 25.38 Hypersecant 22.44 24.64 27.62 25.32 13.61 21.45 Square sync 12.78 15.48 13.61 11.23 12.43 14.37 Table 4: Classification accuracy of conventional OAA Vs OAA ODT-SVMs [%]. Dataset OvO OAA SVMBDT OvO ODT-SVM OAA ODT-SVM C4.5 MLP Iris 98 98 98 98 98 96 97.33 Satimage 91.21 89.95 91.65 93.61 93.98 85.7 88.12 Wine 93.26 96.45 92.63 96.76 98.98 90.96 92.51 Vehicle 83.33 84.87 82.98 84.95 85.65 71.83 81.98 Glass 75.7 76.17 72.69 76.17 77.57 67.61 63.08 Segmentation 90 90 90 93.78 92.05 86.6 90.48 Ecoli 87.97 86.9 85.78 89.98 88.82 85.08 84.34 NewThyroid 97.67 97.34 100 100 100 91.59 95.33 Yeast 61.54 61.26 68.59 67.65 68.65 56.71 61.43 MovementLibra 81.73 82.48 87.45 87.97 85.54 67.13 80.78 HeartDiseaseClev eland 43.56 45.73 49.43 43.56 49.38 50.33 52.65 Table 5: Comparison of best average classification accuracy of ODT-SVMs with other multi-class classifiers. To determine which classifiers are significantly different, we carried out Nemenyi test whose results are illustrated in Figure 1. In this figure, the mean rank of each classifier is pointed by a circle. The horizontal bar across each circle indicates the „critical difference". The performance of two methods is significantly different if their corresponding mean ranks differ by atleast the critical difference i.e. their horizontal bars are not overlapping. Figure 1 reveals that OvO ODT-SVM is significantly different from C4.5 and MLP but not OPTIMAL DECISION TREE BASED... Informática 35 (2011) 197-209 205 significantly different from conventional OvO, conventional OAA, SVM-BDT, OAA ODT-SVM in terms of classification accuracy. Rather we can say that the proposed scheme is comparable with all the variants of SVM. Table 6 and Table 7 shows the computation time of OvO ODT-SVM for training and testing phase respectively for Gaussian kernel with y = 2-11 and C= 1000.Table 6 and Table 7 shows that the time required for training and testing of OvO ODT-SVM is significantly less in comparison to conventional OvO SVM and SVM-BDT approach. Similarly, Table 8 and Table 9 shows that that the time required for training and testing of OAA ODT-SVM is significantly less in comparison to conventional OAA SVM. The Friedman test indicates that there exist significant differences among OvO classifiers on both training and testing time. Figure 2 and Figure 3 illustrate the results of the Nemenyi test to reveal what those differences are among OvO classifiers on training and testing time respectively. Consistent with our time complexity analysis, all variants of OvO ODT-SVM scheme other that Gini measure, are most efficient in terms of training time in comparison to conventional OvO and SVM-BDT. Figure 3 shows that OvO ODT-SVM using Gini is ranked best and is significantly better than conventional OvO and SVM-BDT. All the variants of OvO ODT-SVM schemes are not significantly different from each other in terms of both training and testing time. Similarly, the Friedman test indicates that there exist significant differences among OAA schemes on both training and testing time. Figure 4 and Figure 5 illustrate the results of the Nemenyi test to reveal what those differences are among OAA schemes on training and testing time respectively. Again consistent with our time complexity analysis, all variants of OAA ODT-SVM scheme other that gini measure, are most efficient in tenns of training time in comparison to conventional OAA. Figure 5 shows that OAA ODT-SVM using IG is ranked best and is significantly better than conventional OAA. Similar to OvO schemes, all the variants of OAA ODT-SVM schemes are not significantly different from each other in terms of both training and testing time Among four measures employed for determining the structure of decision tree, neither of them is clear winner over other in terms of computation time for training and testing. OvO OAA SVM-BDT OvO ODT-SVM OAA ODT-SVM C4.5 MLR 012345678 Mean rank of classification accuracy Figure 1: Apply the Nemenyi test to mean ranks of classification accuracy of various classifiers. Dataset OvO SVM-BDT OvO ODT-SVM ED SC IG Gini Chi-square Iris 2.84 5.18 2.15 2.13 2.19 2.17 2.79 Satimage 1065.06 2852.82 867.71 864.77 873.48 958.32 871.47 Wine 2.15 3.53 1.96 2.08 2.11 2.13 2.05 Vehicle 189.96 508.81 154.76 154.24 155.79 170.92 155.43 Glass 12.31 16.89 8.93 8.60 9.05 10.06 8.99 Segmentation 3.00 5.26 2.20 2.52 2.49 2.45 2.39 Ecoli 26.48 45.90 17.28 24.09 20.07 21.97 17.55 NewThyrold 5.31 8.54 3.93 3.13 2.99 3.23 3.58 206 Informatica 35 (2011) 197-209 M. Bala et al. Yeast 245.87 675.43 178.62 176.32 173.34 198.14 189.54 MovementLibra 534.78 1432.76 433.56 437.76 446.34 476.73 436.42 Heart DiseaseOev eland 8.42 13.23 4.67 6.87 5.43 6.48 6.65 Table 6: Training time OvO ODT-SVM Vs OvO and SVM-BDT [sec]. Dataset OvO SVM-BDT OvO ODT-SVM ED SC IG Gini Chi-square Iris 0.03 0.03 0.03 0.01 0.02 0.02 0.02 Satimage 10.44 18.53 9.96 9.32 9.07 7.79 8.53 Wine 0.03 0.06 0.03 0.03 0.03 0.03 0.03 Vehicle 0.44 0. 79 0.43 0. 44 0.43 0. 33 0. 36 Glass 0.04 0.06 0.02 0.04 0.02 0.03 0.02 Segmentation 0.05 0.07 0.04 0.05 0.04 0.05 0.04 Ecoli 0.05 0.06 0.06 0.04 0.04 0.04 0.04 NewThyroid 0.07 0.09 0.06 0.05 0.06 0.05 0.05 Yeast 0.98 1.79 0.86 0.89 0.86 0.69 0.78 MovementLibra 5.89 9.89 4.96 4.89 4.43 3.86 4.54 HeartDiseaseOev eland 0.09 0.10 0.05 0.05 0.04 0.05 0.06 Table 7: Testing time OvO ODT-SVM Vs OvO and SVM-BDT [sec]. OvO SVMBDT ED SC IG Gini Chi-square 0123456789 Mean rank of training time of OvO based classifiers Figure 2: Apply the Nemenyi test to mean ranks of training time of OvO schemes. OPTIMAL DECISION TREE BASED... Informática 35 (2011) 197-209 69 OvO SVM-BDT ED SC IG Gini Ch-square 123456789 Mean rank of testing time of OvO based classifiers Figure 3: Apply the Nemenyi test to mean ranks of testing time of OvO classifiers. Dataset OAA OAA ODT-SVM ED SC IG Gini Chi-square Iris 28.09 12.13 21.65 12.07 15.95 15.75 Satimage 3451.82 858.09 945.91 940.30 989.37 989.43 Wine 5.20 2.95 2.53 2.51 2.52 2.52 Vehicle 615.65 153.04 168.71 167.71 176.46 453.63 Glass 135.60 22.85 27.18 27.18 69.02 21.34 Segmentation 6.70 2.17 2.53 2.51 4.80 4.82 Ecoli 80.68 23.34 30.02 31.02 78.42 79.80 NewThyroid 10.34 4.54 4.23 5.67 5.13 5.67 Yeast 1278.76 306.98 367.87 306.76 359.54 567.87 MovementLibra 1734.41 428.87 472.87 478.24 478.91 456.69 Heart Disease_Clev eland 102.34 24.76 23.43 21.23 32.23 24.54 Table 8: Training time OAA Vs OAA ODT-SVM [sec]. Dataset OAA OAA ODT-SVM ED SC IG Gini Chi-square Iris 0.02 0.03 0.01 0.01 0.01 0.01 Satimage 13.95 7.72 9.07 7.59 9.05 7.87 Wine 0.03 0.02 0.02 0.02 0.02 0.02 Vehicle 0.59 0.33 0.39 0.32 0.38 0.33 Glass 0.05 0.02 0.04 0.04 0.03 0.03 Segmentation 0.05 0.04 0.03 0.03 0.04 0.04 Ecoli 0.05 0.04 0.04 0.03 0.04 0.03 NewThyroid 0.07 0.05 0.05 0.04 0.04 0.05 Yeast 1.28 0.76 0.89 0.66 0.78 0.69 MovementLibra 6.89 3.78 4.57 3.87 4.56 4.08 Heart DiseaseClev eland 0.05 0.04 0.03 0.04 0.02 0.03 Table 9: Testing time OAA Vs OAA ODT-SVM [sec]. 208 Informatica 35 (2011) 197-209 M. Bala et al. OAA - Gini - Chi-square - 1 2 3 4 5 6 7 Mean rank of training time of OAA based classifiers Figure 4: Apply the Nemenyi test to mean ranks of training time of OAA classifiers. OAA ED SC IG Gini Chi-square -Q- -e- 1 2 3 4 5 6 7 8 Mean rank of testing time of OAA based classifiers Figure 5: Apply the Nemenyi test to mean ranks of testing time of OAA classifiers 6 Conclusion In this paper, we evaluate the performance in terms of classification accuracy and computation time of proposed OvO ODT-SVM and OAA ODT-SVM using the statistical measures i.e. information gain, gini index, chi-square and scatter-matrix-based class separability in kernel-induced space. We have also shown theoretically that the computation time of training and testing of both the ODT-SVMs using statistical measures is better in comparison to conventional SVMs. A robust non-parametric test is carried out for statistical comparison of classifiers over multiple data sets. The results of the experiment on UCI repository datasets indicate that accuracy of our proposed framework are significantly better than conventional OvO SVM, conventional OAA SVM and two widely used multi-class classifiers such as C4.5 and MLP for most of the datasets. Our experimental results also demonstrate that the computation time of proposed ODT-SVMs formulation is significantly less in comparison to conventional SVM and SVM-BDT models. Statistical test performed over multiple classifiers also shows that the performance of ODT-SVM model is significantly better in comparison to other natural multi-class classifiers like C4.5 and MLP. Among four measures employed for determining the structure of OPTIMAL DECISION TREE BASED... Informática 35 (2011) 197-209 71 decision tree, neither of them is clear winner over other in terms of computation time for training and testing. Acknowledgements We are thankful to the reviewers for their valuable comments which have led to the improvement of the paper. References [1] Allwein, Schapire, R., & Singer, Y. (2000). Reducing multiclass to binary: A unifying approach for margin classifiers .In Machine Learning: Proceedings of the Seventeenth International Conference. [2] Blake, C. L., & Mere, C. J. (1998). (C. Irvine, Producer, & Univ. California, Dept. Inform. Computer Science) Retrieved from UCI Repository of Machine Learning Databases: http://www.ics.uci.edu/~mlearn/ML-Repository.htm [3] Boser, Guyon, I., & Vapnik, V. (1992). A training algorithm for optimal margin classifiers. 5th Annual ACM Workshop on COLT, (pp. 144-152). [4] Bredensteiner, & Bennet, K. ( 1999). Multicategory classification by support vector machines . In Computational Optimizations and Applications , 12, 53-79. [5] Breiman, L., Friedman, J., Ohlsen, R., & Stone, C. (1984). Classification and regression trees. Belmont, CA: Wadsworth. [6] Corts, C., & Vapnik, V.N. (1995). Support Vector Networks. Machine Learning, 20, 273-297. [7] Crammer, & Singer, Y. (2001). On the algorithmic implementation of multiclass kernel-based vecto machines. Journal of Machine Learning Research , 2, 265-292. [8] Demsar, J. (2006). Statistical Comparisons of Classifiers over Multiple Data sets. Journal Machine Learning Research , 7, 1-30. [9] Duda, R. O., & Hart, P. E. (1973). Pattern classification and scene analysis. New York: J. Wiley. [10] Duda, R. O., Hart, P. E., & Stork, D. G. (2000). Pattern Classification (second ed.). John Wiley & Sons, Inc. [11] Friedman, M. (1940). A Comparison of Alternative Tests of Sgnificance for the Problem of m Ranking. Annals of Math. Statistics , 11, 86-92. [12] Friedman, M. (1937). The Use of Rank to Avoid the Assumption of Normality Implicit in Analysis of Variance. Journal Am. Statistical Association , 32, 675-701. [13] Gjorgji, M., Dejan, G., & Ivan, C. (2009). A Multi-class SVM Classifier Utilizing Binary Decision Tree. Informatica, 33, 233-241. [14] Hsu, C. W., & Lin, C. J. (2002). A comparison of methods for Multiclass Support vector machine. IEEE Transactions on Neural Networks , 13 (2), 415425. [15]Kittler, J., & Hojjatoleslami, A. (1998). A weighted combination of classifiers employing shared and distinct representations. IEEE Computer Vision Pattern Recognition, (pp. 924-929). [16]KreBel. (1999). Pairwise classification and support vector machines. Cambridge, MA: MIT Press. [17]Manju, B., & R. K., A. ( july, 210). Statistical Measures to Determine Optimal Decision Tree Based One versus One SVM. Accepted for publication in Defense Science Journal. [18] Manju, B., & R. K., A. (2009). Evaluation of Decision Tree SVM Framework Using Different Statistical Measures. International Conference on Advances in Recent Technologies in Communication and Computing, (pp. 341-445). Kerala. [19] Piatt, Cristianini, N., & Shawe-Taylor, J. (2000). Large margin DAGSVM"s for multiclass classification . Advances in Neural Information Processing System , 12, 547-553. [20]Quinlan, J. R. (1987). Induction of Decision Trees. Machine Learning, 1, 81-106. [21] Rifkin, R., & Klautau, A. (2004). In Defence of One-Vs.-All Classification. Journal of Machine Learning , 5, 101-141. [22] Scholkopf, B., & smola, A. (2002). Learning with kernels. Cambridge, MA: MIT Press. [23] Shannon, C. E. (1948). A Mathematical Theory of Communication. Bell System Tech. Journal, 27, 379423, 623-659. [24] Takahashi, F., & Abe, S. (2002). Decision-tree-based multiclass support vector machines. Proceedings of the Ninth International Conference on Neural Information Processing (ICONIP '02), 3, pp. 1418— 22. Singapore. [25] Vapnik, V. N. (1998 ). Statistical Learning Theory. New York: John Wiley & Sons. [26] Witten, I. H., & Frank, E. (2005). Data Mining : Practical Machine Learning Tools and Techniques with Java Implementation (Second ed.). Morgan Kaufmann.