https://doi.org/10.31449/inf.v44i4.3023 Informatica 44 (2020) 405–420 405 Decision Tree for Classification and Regression: A State-of-the Art Review Monalisa Jena and Satchidananda Dehuri P.G. Department of Information and Communication Technology Fakir Mohan University, Balasore, Odisha, India E-mail: bmonalisa.26@gmail.com, satchi.lapa@gmail.com Overview paper Keywords: data mining, classification, regression, decision tree, prediction Received: December 8, 2019 Classification and regression are defined under the umbrella of the prediction task of data mining. Discrete values are predicted using classification techniques, whereas regression techniques are most suitable for predicting continuous values. Analysts from different research areas like data mining, statistics, machine learning, pattern recognition, and big data analytics preferred decision trees over other classifiers as it is simple, effective, efficient, and its performance is competitive with others in a few cases. In this paper, we have extensively reviewed many popularly used state-of-the-art decision tree-based techniques for classifi- cation. Additionally, this work also reviews some of the decision tree based techniques for regression. We have presented a review of more than forty years of research that has been emphasized on the application of decision tree in both classification and regression. This review could be a potential resource for all the researchers who are keenly interested to apply the decision tree based classification/regression in their research work. Povzetek: V preglednem ˇ clanku je podana analiza raznovrstnih metod in tehnik odloˇ citvenih in regresijskih dreves za namene rudarjanja podatkov. 1 Introduction With the advancement of technologies, the process of data generation and collection is increasing at an exponential rate. The embedded sensors, IoTs, ubiquitous devices like scanners, bar code readers, and smartphones generate a huge amount of data at an exponential rate, which con- tributes to the expansion of data size and volume [1] [2] [3]. Intuitively, the valuable hidden knowledge and infor- mation in this huge amount of accumulated data could be the potential source to enhance the decision-making ca- pability of the decision-makers of an organization or so- ciety [4] [5] [6]. Some of the classification techniques like decision tree (DT), support vector machine (SVM), and random forest [7] [8] have been proven to be effective models for extracting knowledge, that is valid, potential, novel, and finally useful. In a decision tree, interpretable rules together with the constraints can be extracted by the decision-maker without compromising the performance of the model [9] [10]. A decision tree is an acyclic graphical structureG(V;E), where,V2fV 1 ;V 2 g represents a finite, non-empty set of nodes; V 1 represents a set of leaf nodes containing the class values andV 2 is the set of intermediate nodes corresponding to one of the attributes. Similarly, the set of edges, E represents distinct attribute values. DT is one of the popularly used classifiers because of its intelli- gible nature that takes after the human thinking [11]. DT induction algorithms are preferred over other learning algo- rithms due to their flexibility, robustness to noise, the low computational cost for model construction, and the ability to handle redundant attributes. They are quite simple and easy to understand by human beings and their performance is comparable with others [12] in certain cases. Decision trees can handle both classification and regression tasks. In classification, a discrete value is predicted, whereas a con- tinuous value is predicted through regression [13]. Deci- sion trees are also competent in handling unseen samples having multiple class labels [14]. A sample DT is depicted in Figure 1. In this Figure, the DT is used to identify the types of contact lenses suitable for an individual having a set of features. It employs the lenses data set, one of the popular datasets collected from the University of California, Irvine (UCI) Machine Learn- ing repository [15]. In Figure 1, the internal nodes and class labels are represented in the form of ovals and rectangles, respectively. Four different features such as tear produc- tion rate, age, spectacle prescription, astigmatic and three class labels namely hard, soft, and none are considered in this example. A pathfv 1 ;v 2 ;::::;v n g drawn fromv 1 tov n represents the class prediction for a tuple, wherev 1 is the root node,v 2 tov n 1 are the intermediate nodes, andv n is the leaf node of that particular path. For example, for the tuple (age: presbyopic, astigmatic: yes, spectacle prescrip- tion: myope, tear production rate: reduced), the class label is “none”. In this way, several rules can be extracted from the decision tree and using those rules, the class label of 406 Informatica 44 (2020) 405–420 M. Jena et al. age young pre- presbyopic Tear producton rate reduced normal none presbyopic astigmatic yes no Spectacle prescription myope hypermetrope Tear producton rate reduced normal none hard astigmatic yes no hard soft Tear producton rate reduced normal none none none astigmatic yes no soft Spectacle prescription myope hypermetrope hard none Figure 1: Decision tree to ascertain type of contact lenses to be used by a person. an unseen sample can be predicted [16]. In Figure 1, the attribute “age” is taken as the root node. The root is se- lected using several attribute selection measures [17], and the splitting attribute is chosen at a particular node as per the well-defined splitting criterion. For example, the DT in Figure 1 is generated by applying entropy as the attribute selection measure. Hence, age becomes the root node as it is selected as the splitting attribute. In the past few decades, a number of classification as well as regression tree algorithms have been proposed by several pioneers. Figure 2 gives an overall idea of the num- ber of research papers published in the domain of DT for classification and regression from 1971 to date. An effort has been made to make an extensive review of the different classification and regression tree techniques which would be helpful for the beginners and enthusiastic researchers in this specific field of research. From Figure 2, it can be ob- served that over the years, research in this particular field has increased spectacularly because of the efficiency, per- formance, and effectiveness of DT in several application domains. Many researchers in the literature have presented reviews on the classification and regression tree algorithms. Some of the works have missed few parameters while some of them have provided just a brief overview, and some are outdated. Even though we intend to give a balanced discus- sion, some of the remarks certainly reflect the viewpoints of the authors. Lim et al. [18], have compared twenty-two decision tree algorithms based on performance parameters like accuracy and computation speed. Classification accuracy is mea- sured by the mean error rate and the mean rank of error rate. Along with the decision tree algorithms, they have also presented nine statistical and two neural network algo- rithms. They have experimented on these algorithms using thirty-two datasets, out of which fourteen are from real life domains, five are from the STATLOG project, two are syn- thetic, and the rest are from the UCI repository. Among the decision tree algorithms, QUEST with linear splits is found to have the highest accuracy, and logistic regression is the second best among the thirty three statistical algorithms. Podgorelec et al. [19], have limited their review work on decision trees specific to the field of medicine. They have presented alternatives to the few traditional induction ap- proaches while emphasizing the existing and future appli- cations of medicine. Perlich et al. [20], came up with a large scale comparison between two famous classification models of that time, tree induction and logistic regression. Based on the class membership probabilities, they had esti- mated classification accuracy and quality of rankings. They have observed that logistic regression performed well for smaller training sets while tree induction methods for com- paratively larger datasets. Rokach and Maimon [21] have presented an updated sur- vey on the induction of decision tree algorithms of that time in a top-down manner. Besides, they suggested a unified algorithmic framework for presenting the decision tree induction algorithms and provided profound descrip- tions of the various pruning technologies and splitting cri- teria. They have observed that most of the algorithms fitted the framework with different stopping criteria and prun- ing methods. Barros et al. [22], have provided a review, which mainly focused on decision tree and evolutionary al- gorithms. They have presented a taxonomy that designs the decision tree components using evolutionary algorithms. They have also discussed various applications of evolu- tionary algorithms on decision tree induction in several do- mains. Loh [23] has presented a brief review of both clas- sification and regression tree algorithms. In his paper, a brief comparison of the classification tree algorithms C4.5, RPART, QUEST, CRUISE, and GUIDE is presented using Decision Tree for Classification and Regression. . . Informatica 44 (2020) 405–420 407 prediction accuracy as the performance measure. The au- thor has applied these algorithms on cars dataset for the 1993 model year, and GUIDE appeared to have the highest prediction accuracy. For comparing regression tree models, he has collected data from 654 children aged between 3 and 19 and applied those models on these datasets. GUIDE lin- ear regression tree model was found to have higher predic- tion accuracy than piecewise constant models. For classifi- cation trees, prediction error was measured by misclassifi- cation cost, and in the case of regression trees, it was mea- sured by the squared difference between predicted and ac- tual values. Loh [24] again performed a comprehensive re- view on classification and regression tree algorithms which have been adopted in the last fifty years. In his paper, he fo- cused on the majority of the algorithms that performed con- sistently well for a long period and for which software was widely available. The review work also provided the devel- opments and key ideas supporting these algorithms. He has also presented a comparative analysis of the classification tree models and their partitions given by all the classifica- tion tree models using iris data from the UCI repository. A Similar procedure has been followed for regression tree models using baseball data from Statlib. In contrast to others, we have presented a survey of all the classification and regression tree algorithms in a tech- nical yet easy to understand manner. We have provided an extensive review of DT algorithms that have consistently better performance and stood the test of the time in the last forty years. We have also discussed the application details of the techniques in various domains under DT for classifi- cation as well as regression. This paper would be a poten- tial resource for future researchers and enthusiast readers to get an overall idea about which algorithm works best in what domain, and accordingly, they can use as per their requirements. Additionally, we have given a comparative view of the algorithms, which highlights the suitability of each algorithm in the respective domains. It also presents the advantages and disadvantages of each algorithm in sev- eral domains. The rest of the sections are set out as follows: In Section 2, the DT induction algorithm is discussed and the clas- sification tree techniques are explained in detail. Section 3 highlights the application details of the techniques ex- plained in Section 2. A comparative analysis of various classification tree algorithms is presented in Section 4. In Section 5, the DT algorithms used for regression are ex- plained in a simplified manner. Sections 6 and 7 incor- porate application details and comparative analysis of the techniques reviewed under DT for regression, respectively. Sections 3-7 will help the beginners in deciding which al- gorithms to choose for their experimental works as they will get a broad perspective of the different techniques. Fi- nally, in Section 8, the paper is concluded along with future works. 2 DT as a classifier Classification is a way of fitting objects to a category which best suits its characteristics. Classification is a two-step process in which the first one constructs the classifier by examining vividly the training set containing the attributes and their associated class labels [25]. This step is called the training or learning phase [26] [27]. The second step is known as the classification phase where the performance of the classifier is measured for the testing dataset. If per- formance is found up to the mark, then those rules are ap- plied to unknown data tuples to predict their class labels [28]. Classification intends to distinguish the discrete cate- gory of a new sample by contemplating a training dataset. Mathematically, the classification process can be presented as a function as follows [29]: C =f(X; );C2L (1) where X is the feature vector, C is the class label of the new sample,f(:) is the classification function, is the parameter set of the classification function and L, the set of class labels. The main objective of DT is to represent maximum possible training datasets correctly with the bet- ter performance [30]. The decision tree is constructed by observing the behavior of the training tuples. This proce- dure is known as decision tree induction [31]. The attribute values for a tuple whose corresponding class label is un- known are tested against the decision tree. In that way, the path traced from the root node to the leaf is used to obtain several possible intelligent classification rules. The entire DT induction procedure is explained in Al- gorithm 1. The algorithm starts with a training set and an empty tree. In step 1, a single node N is generated. If in- stances are of the same class, then a node is appended to the tree containing that class (step 2). Step 3 illustrates the ter- minating condition. It says when the attribute list becomes empty, the leaf node of the DT contains the class label whose occurrence is highest. This is called the majority- voting approach [32]. Otherwise, the attribute that splits the dataset into best partitions is perceived using attribute selection methods (step 4). Steps 5 to 22 focus on the split- ting criterion and possible subsets as a result of partitioning tuples as per the splitting criterion. While inducing a deci- sion tree, the splitting criterion is the most important factor to be considered [33]. The splitting criterion helps us in choosing the attribute that divides the tuples in the dataset into partitions containing individual classes by making a test at nodeN. Hence, the split-point or the splitting sub- sets are determined according to the decision tree induction algorithm [34]. The dataset is partitioned with the aim that each of the partitions should be as pure as possible. If all samples in a partition of the dataset are linked to the same class, the partition is said to be pure. For an attribute A, having x number of values a 1 ;a 2 ;:::;a x , if it is discrete-valued, a set of branches are created corresponding to each attribute value. If it is continuous, then possible splits are in the 408 Informatica 44 (2020) 405–420 M. Jena et al. [1971-74] [1975-79] [1980-84] [1985-89] [1990-94] [1995-99] [2000-04] [2005-09] [2010-14] [2015-19] Years of Publication 0 5 10 15 20 25 30 35 Research Papers Published Figure 2: No. of research papers published over the years in the field of DT for Classification & Regression [Paper Sources: SCI, DBLP, Scopus indexed journals and conferences] form of a c for one partition and a > c for the other, where c is the splitting point. If the attribute is discrete and binary trees are to be generated only, then the splitting is in the form ofa2 S a , whereS a is the splitting subset for attributeA. The scenario is depicted in Figure 3. Several decision tree algorithms have been proposed for the classi- fication task of data mining by many pioneers in the field of machine learning and data mining. In this paper, we have discussed some of the popularly used algorithms and their working patterns. 2.1 THeta automatic interaction detection (THAID) This is the first published classification tree algorithm pro- posed by Messenger and Mandell [35]. It follows the con- cept of Automatic Interaction Detection(AID). AID is dis- cussed in detail in section 5.1. THAID uses datasets having categorical variables. The node impurity at each node is measured based on the statistical distribution of the depen- dent variables over the mean. THAID searches the overall attributes ofX extensively and finds a setS, which reduces the node impurity of its children, then splits a node for the splitfX2Sg. IfX is ordered, thenS2 (1 ;c]. Other- wise,S D(X), whereD(X) represents a set of possible values ofX (the domain ofX). This procedure is repeated for the tuples in each child node and splitting halts when the relative decrease in node impurity becomes less than a pre- determined threshold. THAID merges similar categories of the predictors for tree pruning. a ? . . . a ? a <=c a > c a Î ? Yes No (a) a is discrete-valued (b) a is continuous-valued (c) a is discrete and tree type is binary Figure 3: Different ways of partitioning tuples based on the splitting criterion 2.2 CHi-squared automatic interaction detection (CHAID) This algorithm is the extension of the AID approach, where the chi-square statistical test has been employed for finding the best split for each independent variable [36]. It was initially developed for classification and later extended to the task of regression. This algorithm can be applied to the samples having categorical, ordered with missing values, and ordered without missing values. CHAID performs bet- ter for categorical values in comparison to mixed mode data values. If the variables are continuous, they are converted to categorical before applying the CHAID algorithm. If the sample consists of ordered variables withn distinct values, the chi-square test can be used to select the best suitable split out of n 1 possible splits. If it consists of cate- gorical variables and each variable is havingn categories, it can have n splits. However, the number of splits can be lessened by applying Bonferroni adjusted significance tests. The significance test for each predictor follows a se- quential cross-tabulation approach, whose steps are put for- warded in Algorithm 2. The major advantage of CHAID Decision Tree for Classification and Regression. . . Informatica 44 (2020) 405–420 409 Algorithm 1 Decision Tree Induction Method Input: Dataset S with attribute vector X {x 1 ;x 2 ;:::;x n } and each tuple in T{t 1 ;t 2 ;:::;t p } has associated class labels L{l 1 ;l 2 ;:::;l m } Output: Decision Tree Procedure: DT_Induction Generate a node N if everyt i in S2 C then return N labeled with C end if if X = then return N as leaf with L=maxfcount(L i )g in S, 1 i m else find the best splitting criterion by applying attribute selection methods end if Label node N with attribute ‘a’ (the splitting attribute obtained from step 4). if a is discrete and non-binary then X=X-a end if for each distinct outcome i2 a do divide the dataset intoS i partitions if S i = then connect the leaf having maxfcount(L i )g to N, 1 i m. else link the node returned by DT_Induction(S i , X) to N. end if end for if a is discrete and binary then X=X-S i for each a2S a do split at node N in such a manner that one split con- tains the tuples satisfying the condition and the other contains the remaining tuples. end for end if if a is continous then two splits are formed at split-point c Split A = P p i=1 t i , ifa>c Split B = P p i=1 t i , ifa<=c end if if partition is not pure or spilitting is further Possible then gotoStart end if return N is, it reduces the computational complexity by reducing the number of categories for each predictor using the merging procedure. Algorithm 2 Sequential Cross-Tabulation Approach 1: Cross-tabulate n categories of independent variables with m categories of dependent variables. 2: Apply the chi-square test on the cross table and find the pair of categories of the independent variables which are least significantly different. 3: Merge the two categories which pass through step 2. 4: Repeat steps 2 and 3 until no non-significant chi-square test result is obtained. 5: Select the attribute whose chi-square result is largest, and split into k branches where, k l, and l is the number of categories of the independent attributes ob- tained from the merging process. 6: Repeat step 5 until the stopping criteria is satisfied. 2.3 Iterative dichotomizer(ID3) It employs entropy as a measure of node impurity [37] [38]. It uses ordered discrete attributes. The expected informa- tion or entropy relies on the probability of belongingness (P i ) of any tuple of a datasetD to a particular class. En- tropy forn classes in a dataset can be computed as follows [17]: En(D) = n X i=1 P i log 2 (P i ); (2) where, P i = jSij jSj , in which the denominator denotes the number of tuples inD and the numerator contains the amount of samples with respect to class C i . In addition, the entropy of the partitions is to be calculated based on the values of attributet in the dataset (D). Fors distinct valuesft 1 ;t 2 ;t 3 ;:::;t s g of each attributet, the entropy of the partition with respect tot is: En t (D) = s X i=1 jD i j jDj En(D i ): (3) where D is partitioned into s subsets fD 1 ;D 2 ;D 3 ;:::;D s g, and En(D i ) is the entropy of the partition with respect to values of an attribute t. D i consists of the tuples in D having outcome t i of the attribute t. This is required to obtain the exact classification of the instances. The information gain,G(t), is computed as follows: G(t) =En(D) En t (D): (4) The attribute with highest G(t) or minimum En t (D) is chosen as the splitting attribute. Originally, ID3 was pro- posed considering discrete data only, but later it experi- mented on continuous data in several works. Some have 410 Informatica 44 (2020) 405–420 M. Jena et al. considered the midpoint between each pair of adjacent val- ues as a possible split-point and some have used discretiza- tion to convert continuous data to discrete and then applied ID3 on that data. In case of midpoint procedure, possible splits for an attributet, are of the formt c for one set of tuples, andt > c for another set of tuples where, c is the split-point between two adjacent pair of attribute valuest i andt i+1 . The value ofc can be calculated as: (t i +t i+1 )=2. 2.4 C4.5 C4.5 is a descendant of ID3, proposed by J. R. Quinlan [39]. The major limitation of ID3 is that it gives prefer- ence to the attributes having more values and more missing values. In order to overcome this problem, gain ratio was adopted as the attribute selection measure instead of en- tropy. Fors subsetsD 1 ;D 2 ;::::;D s of datasetD, instead of using the entropy, it uses the splitting information (SI t ) [17]: SI t (D) = s X i=1 jD i j jDj log 2 jD i j jDj : (5) The gain ratio (GR) is the ratio of entropy andSI t (D): GR(t) = G(t) SI t (D) (6) The attribute having the highestGR(t) value is chosen as the splitting attribute. The problem arises when SI t (D) becomes negligible or tends to zero. It leads to unbalanced ratio; hence one constraint needs to be imposed, that is, G(t) value should be large enough when the gain ratio is applied. 2.5 Classification and regression trees (CART) In contrast to ID3 and C4.5, it generates binary decision trees [40]. It works on both discrete and continuous data. It uses gini index (GI) as a measure of node impurity [41]. GI(D) = 1 n X i=1 P 2 i ; (7) where,P i = jSij jSj is the ratio of number of tuples present in the dataset with respect to a particular class to the to- tal number of tuples present in D. For a binary split with respect to an attribute ‘t’, GI can be calculated as: GI t (D) = 2 X i=1 jD i j jDj GI(D i ) (8) where, D i is the gini index with respect to a partition. Due to the binary split on attribute t, the reduction in impu- rity is computed as: GI red (t) =GI(D) GI t (D) (9) For each attribute, every feasible binary splits are taken into consideration. The subset with minimumGI red (t) is cho- sen as the splitting subset [42]. For continuous-valued at- tributes, it uses the same midpoint procedure as ID3 to find a possible split-point. 2.6 Fast and accurate classification trees (FACT) The FACT algorithm for decision tree used for classifica- tion is similar to the recursive Linear Discriminant Analy- sis (LDA) procedure in which the tree is constructed with linear splits [43]. The number of children for each predic- tor is the same as the number of classes for that variable. In this algorithm, the predictors are being ranked based on the Analysis of Variance (ANOV A) and F-test, and the splitting procedure is performed on the selected predictor based on the LDA method [44]. Initially, all the categorical indepen- dent variables are transformed into ordered variables using an intermediate binary vector. One of the specialties of this algorithm is the procedure of handling the missing values. It estimates the means and modes of non-missing data val- ues of ordered and categorical predictors respectively, and replaces those values in place of missing values. The size of the tree is identified based on the stopping criteria of the ANOV A test [45]. The major advantage of this algorithm is, it is unbiased towards the selection of predictors at each level. However, it is biased towards the predictors, which are categorical as LDA is employed to convert it into an ordered one. This limitation is addressed by the Quick Un- biased Efficient Statistical Tree (QUEST) algorithm, which removes the bias for the splitting of ordered variables. 2.7 Quick unbiased efficient statistical tree (QUEST) It is an efficient decision tree classifier that addresses the FACT algorithm’s limitation, which is biased towards the selection of categorical variables. QUEST uses the cross- tabulation approach of chi-squared tests, and F-tests to han- dle categorical and ordered predictors, respectively to give a fair chance of selection [46]. When a binary split is re- quired at a node with more than one class, it merges the classes into two superclasses before the significance test is applied. If the variable is ordered, the split-point is chosen by quadratic discriminant analysis or the exhaustive search. Apart from that, if the variable is categorical, the point of splitting is chosen after transforming it into a larger dis- criminant coordinate. The major advantage of QUEST is, it improves the computational time over CART when vari- ables with many categories exist. Decision Tree for Classification and Regression. . . Informatica 44 (2020) 405–420 411 2.8 Classification rules with unbiased interaction selection and estimation (CRUISE) In CRUISE, each node is split into multiple branches, which depends on the number of class labels associated with the independent variables [47]. It is an extension of QUEST. The variable selection at each level is based on the cross-tabulation approach used in CHAID, where the columns and rows of the cross table contain the predictors and class labels, respectively. Unlike QUEST, it performs the significance tests between two independent variables, sayX i andX j instead of performing pairwise significance tests between two categories ofX variables [48]. If the sig- nificance test betweenX i andX j are found to be best,X i is chosen for splitting instead ofX j . The split-point is then identified by the LDA approach after the independent vari- able goes through a Box-Cox transformation. The major advantage of CRUISE is, it allows splitting of all the vari- ables linearly that can fit the LDA model at each leaf node. Another advantage is, it is unbiased towards the selection of variables that have more missing values. 2.9 Generalized unbiased interaction detection and estimation (GUIDE) GUIDE is the improvised version of QUEST and CRUISE. It models the decision tree classifier by leveraging the strengths of both algorithms. It also reduces the limitations of CRUISE by minimizing the number of interaction tests among the categorical variables. The amount of compu- tation is drastically reduced as it restricts the frequency of tests. The multi-level searching technique is employed for splitting at each node when the significant difference be- tween two variablesX i andX j is noticed. The first level splitting of a node is performed based onX i and the second level splitting based onX j in order to reduce the amount of impurity. This process is repeated in a reverse manner, i.e, X j is considered for splitting at first level and X i in second. The one whose reduction in impurity is greater is chosen to split the node. One of the advantages of GUIDE is, it can perform bivariate splits of two independent vari- ables at a time along with univariate splits. Bivariate linear split is preferable over univariate if the number of observa- tions at each node is found to be lesser than the number of independent variables. 2.10 Conditional inference tree (CTREE) It can handle ordered, nominal, continuous, censored as well as multivariate attributes. It uses the combination of recursive binary partitioning and theory of permutation to select split variables [49]. Based on Bonferroni adjusted p- values, it derives stopping rules to regulate the tree size in- stead of applying tree pruning to reduce the tree size. Like CART, it also uses surrogate splits to deal with missing val- ues, and the number of surrogate splits can be regulated by defining maximum surrogate splits using a function. 3 Application details of the techniques reviewed under DT for classification This section exemplifies a brief illustration of the splitting criterion used, application areas, dataset details, and perfor- mances of the different algorithms reviewed under DT for classification. THAID was used in finance and health care for various purposes. CHAID was applied in many applica- tion areas like marketing, health care, coal mining, etc and its performance is comparable with several algorithms of its time. CHAID was also used in the public vocational re- habilitation program to predict the employment outcomes and acceptance rates of rehabilitation clients with orthope- dic disabilities. ID3 is adopted in many application areas like price prediction in stock markets, in health care for medical diagnosis and it is having better classification ac- curacy than neural networks and rough sets classifiers. The extended version of ID3, i.e., C4.5 was employed in several sectors like health care for liver disease diagnosis, detection of cancer disease with the help micro-array datasets, and tumor classification [50]. It is also used in land cover map- ping and change assessment in remote sensing, etc. Its per- formance is comparable with k-Nearest Neighbor (kNN), Naive-Bayes, and Support Vector Machine (SVM) classi- fiers. Similarly, CART is used in various fields like intrusion detection, bankruptcy prediction in companies [51]; diag- nosis of diabetes and prediction of heart disease in health care [52]; landslide hazard, etc and have shown better per- formance than ID3. Likewise, FACT is also used in many areas like waveform recognition, digit recognition, and nor- mal discrimination. Researchers employed QUEST in ed- ucational institutions for evaluating teachers’ performance [33], in health care for predicting mortality rate because of head injury, financial firms for measuring firm perfor- mance, etc. Likewise, GUIDE, CRUISE, and CTREE are used in several research areas and are efficient and effective for the researchers. The details are mentioned in Table 1. 4 Comparative analysis of various classification tree algorithms In this section, different classification tree techniques, as discussed, are compared based on various parameters, as listed in Table 2. The parameters considered for the com- parison are different types of splits (univariate or linear), the maximum number of splits, the way they handle miss- ing valued attributes, node models, etc. CHAID and C4.5 algorithms do not support linear splits. However, most of the algorithms support both linear and univariate splits. The prediction accuracy of THAID is not up to the mark. 412 Informatica 44 (2020) 405–420 M. Jena et al. Table 1: Application details of the techniques under DT for Classification Sl No. Method Splitting Criterion Application Area Dataset Details Remarks 1 THAID Sum of Squared Deviation Finance, Health care Car dataset from 1970 survey of Consumer finances, IRIS dataset from UCI repository, Low predictive accuracy, Biaseness in variable selection 2 CHAID Chi-squared Statistical test Marketing modelling. Healthcare, Coal mining IRIS dataset, Breast cancer patients’ data, Coal mines data from Coal Industry Promotion Board, Rehabilitation Service Administration (RSA)-911 dataset Performance comparable and in many cases outperforms other algorithms, restricted to categorical variables 3 ID 3 Entropy Product entry decision, Weather forecasting, Medical diagnosis, Marketing, Stock market trend mining Heart disease data from UCI rep., Weather data, Buys_computer data Predictive accuracy is directly proportional to the size of the training set; Better classification accuracy than rough sets and neural networks 4 C 4.5 Gain Ratio Finance, Health care, Land cover change assessment Car dataset from Journal of Statistics Education Data Archive, IRIS dataset from UCI repository, Liver Disorders datasets from UCI repository , data sets Landsat 5 (TM) for 1986 and Landsat 7 (ETM+) for 2001 located on the satellite path; Leukemia, Colon tumour and Diffuse Large B-cell Lymphoma data from Kent Ridge Bio-Medical Data Set Repository Performance comprable with classifiers SVM, k-NN, Naive Bayes’ 5 CART Gini Index Medicine and Health care, Landslide hazard, Intrusion Detection Car dataset, Birth dataset, Type 2 Diabetic outpatient data, Survey data of malaria in central vietnam during 2008, KDD Cup 1999 dataset from UCI rep., Landslide data set of 137570 samples from Penang Island in Malaysia Great flexibility and accuracy but splitting is biased towards variables having more distinct values 6 FACT ANOV A and f-test Normal discrimination, Digit recognition, Waveform Recognition, Spherical distribution problem IRIS dataset, Boston housing dataset Classification accuracy and interpretative capability is comparable with CART, but FACT runs many times faster 7 QUEST Chi-squared & f-test Financial firms, Health care, Landslide hazard, Coal mine Car dataset from Journal of Statistics Education Data Archive, IRIS dataset, Financial data of Turkish firms from FINNET, Breast cancer patients’ data, Coal mines data from Coal Industry Promotion Board Unbiased splits, ranked fourth best overall for linear splits, Improved computational time over CART for variables of many categories 8 CRUISE LDA, Contingency Table Chi-squared tests Biomedicine, Education, Healthcare IRIS dataset, Biomedical data, Cylinder bands, Credit approval, Echo-cardiogram, Fish catch, Horse colic, Hepatitis, Heart disease, Auto imports from UCI rep.; Demography data from Rouncefield (1995), Head injury from Hawkins(1997), College data from StatLib Accuracy as high as CART and QUEST, fast computation speed, produces more intelligent splits and shorter trees, keeps track of local interactions 9 GUIDE Bonferroni test, Chi-squared test Education, Sports, Healthcare Region prediction IRIS dataset from UCI, Cars dataset from the Journal of Statistics Education Data Archive for 2004 model year Performance better than CRUISE and QUEST, Unbiased variable selection 10 C-TREE Bonferroni p-test Healthcare, Sports, Space Physics, Mammography, Biology Breast cancer, Credit, Heart, Hepatitis, Ionosphere, Sonar, Liver, TicTacToe, Titanic House votes 84 from UCI repository Performance comparable and in some cases better than GUIDE, Unbiased, uses permutation tests CHAID favors categorical variables, and it allows multi- ple splits at a node. CART has great flexibility and accu- racy, but splitting is biased towards variables having more distinct values. The classification accuracy and interpreta- tive capability of FACT are comparable with CART, and it runs many times faster than CART. CART, CHAID, and QUEST are the most popular techniques used for modeling decision trees for classification. The QUEST algorithm is a little bit faster as compared to CART and CHAID. How- ever, it is not suitable for processing bigger datasets as it requires high storage space to store the intermediate results obtained at each level of the tree. QUEST, CRUISE, GUIDE, and CTREE are the ad- vanced approaches to model the classification tree. They were found effective in terms of both time and space com- plexity. They also provide unbiased splits during the con- struction of the classification tree. Accuracy of CRUISE is as good as CART and QUEST; it has fast computa- tional speed, generates shorter trees, more intelligent splits, and also keeps track of local interactions that makes it distinguishable from other algorithms proposed before it [18]. GUIDE is having better accuracy than CRUISE and Decision Tree for Classification and Regression. . . Informatica 44 (2020) 405–420 413 Table 2: Comparision of classification tree algorithms Author Name Year Algorithm Split type Unbiased Split No. of splits Missing values Method Interaction Test Node Model R. Messenger & L. Mandell 1972 THAID U No 2 – Yes C G. V . Kass 1980 CHAID U No >=2 B Yes C J R Quinlan 1986 ID3 U No >=2 – No C J R Quinlan 1993 C 4.5 U No >=2 W No C L Breiman et al. 1984 CART U,L No 2 S No C W Y Loh & N. Vanichsetakul 1988 FACT U,L No >=2 I No C W Y Loh & Y S Shin 1997 QUEST U,L Yes 2 I No C H Kim & W Y Loh 2001 CRUISE U, L Yes >=2 I, S Yes C, D W Y Loh 2002 GUIDE U, L Yes 2 M Yes C, K, N T Hothorn et al. 2006 C-TREE U,L Yes >=2 I,S No C Description: U- univariate splits, L- Linear splits, B-Missing value branch, W- Probability weights, S- Surrogate splits, I- Missing value imputation, C- Constant model, M- missing value category, D- Discriminant model, K- kernel density model, N- Nearest neighbour model. Blank entries indicate ‘no missing values’ . QUEST and is having an unbiased variable selection. In contrast to others, CTREE uses permutation tests. Its per- formance is comparable, and in some cases, it is better than GUIDE. 5 DT for regression Regression aims to predict a continuous value for an unseen tuple by studying a training sample of data [29]: O =f(x; );O2R (10) where, x is the new observation, O is the output, f(:) is the regression function and is the regression function’s parameter set. DT for regression is similar to classification trees with the difference that it contains values or piecewise models at leaves rather than class labels [53]. The values may be the result of any test or the outcome of any opera- tion. Some of the popularly used regression tree algorithms are discussed in this section. 5.1 Automatic interaction detection (AID) It is the first regression tree algorithm, introduced by Mor- gan and Sonquist in the year 1963. This algorithm starts with a large dataset. The large dataset is then successively divided into several subgroups after applying binary divi- sions. At every step, the binary divisions of the groups are defined by one of the independent variables. It uses the sum of squared deviations as a measure of node impurity [54]. For each independent variable, all possible splits are considered. Each binary split divides the whole dataset into two parts. The one having least sum of squared deviations is chosen. The node impurity measure (I(d)) is computed as follows [24]: I(d) = n X i=1 (y i y d ) 2 (11) where, y d is the sample mean of dependent variables with respect to the partition. The attribute with the least sum of squared deviations is taken as the splitting attribute. The splitting process continues till very few tuples remain in the dataset or when I(d) becomes less than a predefined value. The task of deciding the predefined value is a mat- ter of concern, as it might lead to the problem of over- fitting or under-fitting if the number is either too large or too small, respectively. Inter-correlation among attributes leads to spurious results. A biased value is considered dur- ing the model building process. 5.2 CART for regression It uses the same approach as AID for splitting and com- puting the node impurity measure. It solves the over-fitting problem of AID by using the tree pruning procedure. The yield of CART is piecewise constant models. CART uses surrogate splitting approach to handle datasets with miss- ing values [55]. If splitting needs to be performed on an attribute with missing values, then it finds an attribute that is highly correlated to the original attribute and replaces that attribute with the original one. 5.3 Multivariate adaptive regression splines (MARS) MARS is suitable for handling datasets of higher dimen- sions. It follows the recursive, divide and conquer approach as regression and generates continuous models with con- tinuous derivatives [56]. It splits the range of independent 414 Informatica 44 (2020) 405–420 M. Jena et al. attribute values inton+1 disjoint intervals partitioned byn knots, which results in the construction of functions, called spline functions [57]. MARS comprises of a series of con- nected straight line segments. The general form of MARS model is defined as [58]: y =f(x) =z 0 + n X i=1 z i B kn (x v(k;i) ) (12) where, y is the output function, n is the number of basis functions, z 0 is a constant value, k is the order of inter- actions, x v(k;i) is the independent attribute in the k th of thei th product,B k;n (x v(k;i) ) is thei th basis function and z i is its corresponding coefficient. The basis function can be defined as: B kn = Q k i=1 b in . The value of k is one if the model is additive, and it is two, for the pairwise in- teractive model. In the first step, a significant quantity of basis functions are constructed which overfit the data. The permitted data values are categorical, continuous, and/or ordinal and they are selected as per the intervals defined. The different variables may have direct interaction with each other or some constraints may be imposed on them. In the second phase, a generalized cross validation tech- nique is applied on the basis functions and the functions having the least contribution are eliminated. The variables having better cross-validation results are chosen. In this way, an optimal MARS model is selected. MARS success- fully handles missing values by employing dummy vari- ables. By using the above-mentioned procedures, MARS also keeps track of complex data structures hidden in high dimensional datasets. 5.4 GUIDE for regression It employs the chi-square test to detect the inter- relationships between the signed residuals and groups of independent variables [59]. It can handle datasets having both discrete and continuous-valued attributes. In GUIDE, two tests are performed, curvature test and interaction test. In the curvature test, for each continuous-valued attribute, a 2 x 4 table, called contingency table is created using the dataset, whose rows indicate signs of the residuals and columns stipulate groups. Based on the number of obser- vations in each cell, the p-value is obtained from the chi- square distribution. In the Interaction test, to find interac- tion among two continuous variables, the sample median is computed, and based on the result, the range of each vari- able is divided into two equal partitions. A 2 4 con- tingency table is generated, whose rows represent residual signs and columns denote quadrants. The chi-square dis- tribution and p-value are also computed in this algorithm. If the acquired p-value is a consequence of the curvature test, the corresponding independent variable is chosen as the splitting attribute; and if it is from the interaction test, then one of the interacting variables is chosen as the split- ting attribute. The sum of squared error is computed for each sub-node, and the variable having the least sum of squared error is selected. In case one of the variables is cat- egorical, the one having a smaller p-value as a result of the curvature test is selected. The major advantage of GUIDE is that it is unbiased towards the splitting process. 5.5 M5 It involves the construction of model trees rather than the rule based, recursive binary trees [60]. Model trees are smaller in structure than regression trees and have shown better performance than the later. They can handle datasets of large dimensions. In contrast to regression trees, which contain values at their terminal nodes, model trees employ linear functions. M5 can handle both discrete and continu- ous data. Its objective is to build a model that associates the target values of the dependent variables to other attributes’ values [61]. The construction of model trees follows the divide and conquer approach. If the constructed model suf- fers from over-fitting, tree pruning is applied by substitut- ing a subset with a leaf. M5 considers standard deviation as a measure of node impurity. At first, the standard deviation of the dependent variables is computed in the training sam- ple of the dataset (Dt). Based on the outcomes of the test, the splitting process continues until there is no notable dis- tinction between the values of the attributes. By ascertain- ing the subset of data tuples associated with each outcome, every potential test is evaluated. The expected reduction in error can be calculated as [60]: err = (Dt) n X i=1 Dt i Dt (Dt i ) (13) where,Dt i denotes the subset of data tuples havingi th outcome of the potential test, n denotes the number of out- comes of a test, and (Dt) represents standard deviation of the training dataset. The test having maximum err is chosen as the potential test to predict the target values of the unseen data tuples. The test set error ( err t ) can be computed as: err t = err ( m +p m p ) (14) wherem represents the number of training set tuples at a particular node, andp refers to the number of parameters in the regression model of the node. 5.6 M5 0 It is an extension of M5, designed to address some issues that arose during the construction of M5. It is a k+1 param- eter model, where k attributes and one constant termw 0 are there. In M5, as the size of the tree becomes smaller, the standard deviation in err lessens. Hence, to manage this, a pruning factor called is used inM5 0 while computing err t [62]. err t = err ( m +p m p ) (15) Decision Tree for Classification and Regression. . . Informatica 44 (2020) 405–420 415 As the value increases, err t increases but the size of the tree decreases outstandingly. Hence, to get less error and better performance, a smaller value of must be taken; but if preference will be given to generate smaller trees then value must be increased a little bit. In addition to this,M5 0 also successfully handles the datasets containing missing values [63]. To address missing values, some modifications to err has been done as follows [62]: err = k jDtj (i) 2 4 (Dt) X j2fA;Bg jDtjj jDtj (Dt) 3 5 (16) where k refers to the number of tuples without missing values, Dt is the subset of the dataset containing tuples that are to be split based on a condition,Dt A andDt B are the sets after partition, and (i) is the correction factor defined as [62]: = e 7 2 a m , where m is the number of tuples in the dataset, and a is the total number of values of original enumerated attributes. is used for converting ‘a’ valued enumerated attributes to ‘a-1’ binary values. For continu- ous attributes, is taken as 1. Hence, after splitting, all attributes in Dt A and Dt B become binary. The attribute with a maximum err is chosen as the splitting attribute. The prediction accuracy ofM5 0 is comparable with tech- niques like Artificial Neural Networks(ANN) and is found to be better than that of regression trees like CART [64]. 6 Application details of the techniques reviewed under DT for regression This segment epitomizes a short depiction of the different algorithms reviewed under DT for regression based on few parameters as alluded in Table 3. It starts with AID. It has been adopted in several research areas like education for predicting the factors affecting the academic survival of students, in population studies for the adoption of fam- ily planning in Koyang [65], in marketing for exploratory analysis of market data, etc. CART is used in several fields like sports for analyzing the salary of baseball players [24], in public health for analyzing causes of morbidity and mor- tality from specific diseases, and in many areas for differ- ent tasks using several datasets [53]. MARS is adopted in several research areas like health care on heart attack survival data, in biology for prediction of species distribu- tions, etc. In some applications like credit scoring, CART and MARS outperform traditional logistic regression, dis- criminant analysis, SVM, and neural network techniques in terms of predictive accuracy. GUIDE is employed in vari- ous sectors like education, sports, and automobiles. MARS and M5 were also used for groundwater level forecasting, solar radiation, in the construction industry for evaluating mechanical properties of concretes containing coarse recy- cled concrete aggregates [66] and they have outperformed other algorithms in many scenarios.M5 0 is used in various application areas like coastal engineering for prediction of wave height in Lake Superior [67], for scour depth predic- tion [68], in construction industry for predicting modulus elasticity of recycled concrete [63], etc. For more details, Table 3 may be referred. 7 Comparative analysis of regression tree algorithms Comparative analysis of various regression tree algorithms based on different parameters is presented in Table 4. Some of the parameters considered for comparison are the same as the parameters used for comparing classification tree al- gorithms. However, few parameters like pruning, variable importance ranking, loss criteria, ensemble approach are added for an extensive comparison of regression tree algo- rithms. Regardless of its novelty, AID faced some prob- lems and was criticized by several authors [24]. While splitting, it experiences over-fitting as well as under-fitting. It doesn’t employ tree pruning to reduce the tree size, whereas CART and others do the same to reduce the com- plexity. The square of mean deviation is considered as the node impurity adopted in the AID and CART algorithm. MARS employs a spline basis function and incorporates a generalized cross-validation approach that increases the prediction accuracy of the model. GUIDE employs ensem- ble and bagging techniques in contrast to others.M5 0 is the best regression model which constructs the piecewise con- stant tree by fitting the linear regression model at each leaf node whereas, the GUIDE algorithm fits linear regression models at each node in the constructed tree. For detailed analysis, Table 4 may be referred. 8 Conclusions and future work The popularity of the classification and regression trees has been increasing exponentially, as they are easy to under- stand and implement. In the decision tree, the hidden rules along with the constraints can be extracted from the data and can be mapped with the nodes and branches of the tree, which makes it more convenient for understanding. However, the complexity of the model increases with the increase in the size of the datasets. To handle the com- plexity, a wide number of advanced algorithms have been adopted in the field of DT for classification and regression. In this paper, we have presented the list of the datasets and various applications in which these algorithms can be applied. This paper could be a potential resource for the researchers in searching and deciding the appropriate al- gorithms suitable for their area of research, which involve regression and classification task. The comparative analy- sis of numerous algorithms based on various parameters is also presented for both classification and regression tasks. In future, this work can be extended by including all the en- 416 Informatica 44 (2020) 405–420 M. Jena et al. Table 3: Application details of the techniques under DT for Regression Sl. No. Method Splitting Criterion Application Area Dataset Details Remarks 1 AID Sum of Squared Deviations Education, Population Studies, Market Research, Operational Research, Fishing Industry, Gasoline Consumption Data from almost 6,000 gas stations from major oil companies in the Unites States during 1970; The shing log data from The White Fish Authority, Hull, England; Data taken from Ronald Freedman, P. Whelpton and Arthur Campbell; Family Planning, Sterility and Population Growth(New York, 1959); Data from NestleCompany, Edu- cational data from The National Survey of Health and Development It is biased towards datasets of higher dimensions, Experiences overfitting and underfitting problems 2 CART Gini Index Medicine and Health Care, Landslide hazard Baseball salary data from American Statistical Association Section(StatLib), Data from the 1999 Behavioral Risk Factor Surveillance System (BRFSS) (61), conducted annually by U.S. states’ Departments of Health in collaboration with the Centers for Disease Control and Prevention Great flexibility and accuracy but splitting is biased towards variables having more distinct values 3 MARS Spline Basis Functions, Generalized Cross Validation Health Care, Biology, Credit Scoring, Solar Radiation, Construction Industry Heart attack survival data from Specialized Center of Research on Ischemic Heart Disease at the University of California, San Diego; Birds data of many countries and Plants data of Switzerland; Credit card data set provided by a local bank in Taipei, Taiwan, Solar data from Data obtained from Adana and Antakya stations, Turkey Incorporation of Generalized Cross Validation increases the prediction accuracy, Handles Curse of dimensionality problem 4 GUIDE Chi-squared test of interaction Sports, Automobiles, Education Baseball salary data from StatLib; Car dataset from the Journal of Statistics Education Data Archive for 2004 model year Fast computation speed, Unbiased and keeps track of local interactions during split selection 5 M5 Standard deviation Medicine and Health Care, Manufacturing, Automobiles, Hydrology, Solar Radiation, Evapotranspiration CPU performance data; Car price data; Drug Activity data, LHRH data from Arris Pharmaceuticals, San Francisco; Data from a discharge measuring station Swarupganj on the river Bhagirathi, India; Sediment yield data from Nagwa watershed in India from 1993 to 2004; Solar data from Data obtained from Adana and Antakya stations, Turkey, Climatic data of Davis station maintained by California Irrigation Management Information System (CIMIS) Better accuracy and smaller in structure than regression trees 6 M5’ Standard deviation Marine & Coastal Engineering, Construction Industry, Coastal and Ocean engineering Wind and Wave data gathered in Lake Superior from 6 April to 10 November 2000 and 19 April to 6 November 2001; wave run-up data of Van der Meer and Stam (1992) The prediction accuracy is comparable with techniques like Artificial Neural Networks and is found to be higher than CART, Handles datasets with missing values Table 4: Comparison of regression tree algorithms Author Name Year Algorithm Split Type Unbiased Split Number of Splits Pruning Variable Importance Ranking Node Models Missing value methods Loss Criteria Bagging and Ensembles J. N. Morgan and J. A. Sonquist 1963 AID U No 2 No Yes C – V No L Breiman et al. 1984 CART U,L No 2 Yes Yes C S V No J. H. Friedman 1991 MARS L Yes >=2 Yes Yes C, M A V No W Y Loh 2002 GUIDE U Yes 2 Yes Yes C, M, P, R A V , W Yes J. R. Quinlan 1992 M5 U No >=2 Yes No C, R – V No Y . Wang and I. H. Witten 1996 M5 0 U No >=2 Yes No C, R G V No Description: U-Univariate splits, L-Linear splits, C-Constant Model, M-Multiple linear model, R- Stepwise linear model, P- Polynomial Model, S- Surrogate splits, G- Global mean/mode imputation, A- missing value category, V- Least Square, W- Least Median square [Blank entries in the table indicate those algorithms do not handle datasets with missing values] semble approaches and their comparison with the existing ones. We also aim to explore new techniques in the field of decision tree-based hierarchical multi-label classification, multi-output, and multi-objective regression trees, etc. References [1] Xindong Wu, Xingquan Zhu, Gong-Qing Wu, and Wei Ding (2013) Data mining with big data, IEEE transactions on knowledge and data engineering, 26(1), pp. 97–107. https://doi.org/10.1109/tkde.2013. 109 [2] Rakesh Agrawal, Tomasz Imielinski, and Arun Swami (1993) Database mining: A performance per- spective. IEEE transactions on knowledge and data engineering, 5(6), pp. 914–925. https://doi. org/10.1109/69.250074 [3] Ranjan Kumar Behera, Santanu Kumar Rath, San- jay Misra, Robertas Damaševiˇ cius, and Rytis Maskeli¯ unas (2017) Large scale community detection using a small world model. Applied Sciences, 7(11), Decision Tree for Classification and Regression. . . Informatica 44 (2020) 405–420 417 pp. 1173. https://doi.org/10.3390/app7111173 [4] Satchidananda Dehuri and Ashish Ghosh (2013) Re- visiting evolutionary algorithms in feature selection and nonfuzzy/fuzzy rule based classification, Wiley Interdisciplinary Reviews: Data Mining and Knowl- edge Discovery, 3(2), pp. 83–108. https://doi.org/10.1002/widm.1087 [5] Leszek Rutkowski (2004) Adaptive probabilistic neural networks for pattern classification in time- varying environment, IEEE transactions on neu- ral networks, 15(4), pp. 811–827. https://doi. org/10.1109/tnn.2004.828757 [6] Ranjan Kumar Behera, Debadatta Naik, Dharavath Ramesh, and Santanu Kumar Rath (2020) Mr-ibc: Mapreduce-based incremental betweenness central- ity in large-scale complex networks, Social Network Analysis and Mining, 10, pp. 1–13.https://doi. org/10.1007/s13278-020-00636-9 [7] Wouter Verbeke, David Martens, Christophe Mues, and Bart Baesens (2011) Building comprehensible customer churn prediction models with advanced rule induction techniques, Expert systems with applica- tions, 38(3), pp. 2354–2364.https://doi.org/ 10.1016/j.eswa.2010.08.023 [8] Charu C Aggarwal (2014) Data classification: algo- rithms and applications, CRC press. [9] Salvador García, Alberto Fernández, and Francisco Herrera (2009) Enhancing the effectiveness and inter- pretability of decision tree and rule induction classi- fiers with evolutionary training set selection over im- balanced problems, Applied Soft Computing, 9(4), pp. 1304–1314. https://doi.org/10.1016/ j.asoc.2009.04.004 [10] Shih-Wei Lin, Kuo-Ching Ying, Chou-Yuan Lee, and Zne-Jung Lee (2012) An intelligent algorithm with feature selection and decision rules applied to anomaly intrusion detection, Applied Soft Comput- ing, 12(10), pp. 3285–3290. https://doi.org/ 10.1016/j.asoc.2012.05.004 [11] Sreerama K Murthy (1998) Automatic construction of decision trees from data: A multi-disciplinary sur- vey, Data mining and knowledge discovery, 2(4), pp. 345–389. [12] Arno De Caigny, Kristof Coussement, and Koen W De Bock (2018) A new hybrid classification algo- rithm for customer churn prediction based on logistic regression and decision trees, European Journal of Operational Research, 269(2), pp. :760–772. https://doi.org/10.1016/j.ejor. 2018.02.009 [13] Usama M Fayyad and Keki B Irani (1992) On the handling of continuous-valued attributes in decision tree generation, Machine learning, 8(1), pp. 87–102. https://doi.org/10.1007/bf00994007 [14] Dragi Kocev, Celine Vens, Jan Struyf, and Sašo Džeroski (2007) Ensembles of multi-objective decision trees, European conference on machine learning, Springer, pp. 624–631. https://doi.org/10.1007/ 978-3-540-74958-5_61 . [15] Dua Dheeru and Efi Karra Taniskidou (2017) UCI machine learning repository. [16] Jieyue He, Hae-Jin Hu, Robert Harrison, Phang C Tai, and Yi Pan (2006) Transmembrane segments pre- diction and understanding using support vector ma- chine and decision tree, Expert Systems with Applica- tions, 30(1), pp. 64–72. https://doi.org/10. 1016/j.eswa.2005.09.045 [17] Jiawei Han, Jian Pei, and Micheline Kamber (2011) Data mining: concepts and techniques, Elsevier. [18] Tjen-Sien Lim, Wei-Yin Loh, and Yu-Shan Shih (2000) A comparison of prediction accuracy, com- plexity, and training time of thirty-three old and new classification algorithms, Machine learning, 40(3), pp. 203–228. [19] Vili Podgorelec, Peter Kokol, Bruno Stiglic, and Ivan Rozman (2002) Decision trees: an overview and their use in medicine, Journal of medical systems, 26(5), pp. 445–463. https://doi.org/10.1023/a: 1016409317640 [20] Claudia Perlich, Foster Provost, and Jeffrey S Si- monoff (2003) Tree induction vs. logistic regression: A learning-curve analysis, Journal of Machine Learn- ing Research, 4(Jun), pp. 211–255. [21] Lior Rokach and Oded Maimon (2005) Top- down induction of decision trees classifiers-a sur- vey, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 35(4), pp. 476–487. https://doi.org/10. 1109/tsmcc.2004.843247 [22] Rodrigo Coelho Barros, Márcio Porto Basgalupp, Andre CPLF De Carvalho, and Alex A Freitas (2012) A survey of evolutionary algorithms for decision- tree induction, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Re- views), 42(3), pp. 291–312. https://doi.org/ 10.1109/tsmcc.2011.2157494 [23] Wei-Yin Loh (2011) Classification and regression trees, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(1), pp. 14–23. 418 Informatica 44 (2020) 405–420 M. Jena et al. [24] Wei-Yin Loh (2014) Fifty years of classification and regression trees, International Statistical Re- view, 82(3), pp. 329–348. https://doi.org/ 10.1111/insr.12016 [25] Shlomo Geva and Joaquin Sitte (1991). Adap- tive nearest neighbor pattern classification, IEEE Transactions on Neural Networks, 2(2), pp. 318–322. https://doi.org/10.1109/72.80344 [26] Se June Hong (1997) R-mini: An iterative ap- proach for generating minimal rules from examples. IEEE Transactions on Knowledge and Data Engi- neering, 9(5), pp. 709–717. https://doi.org/ 10.1109/69.634750 [27] Eric WT Ngai, Li Xiu, and Dorothy CK Chau (2009) Application of data mining techniques in customer re- lationship management: A literature review and clas- sification, Expert systems with applications, 36(2), pp. 2592–2602. https://doi.org/10.1016/ j.eswa.2008.02.021 [28] J Ross Quinlan (1987) Generating production rules from decision trees, In ijcai, Citeseer, 87, pp. 304– 307. [29] Ye Ren, Le Zhang, and Ponnuthurai N Suganthan (2016) Ensemble classification and regression-recent developments, applications and future directions, IEEE Computational Intelligence Magazine, 11(1), pp. 41–53.https://doi.org/10.1109/mci. 2015.2471235 [30] S Rasoul Safavian and David Landgrebe (1991) A survey of decision tree classifier methodology, IEEE transactions on systems, man, and cybernet- ics, 21(3), pp. 660–674.https://doi.org/10. 1109/21.97458 [31] Yoon Ho Cho, Jae Kyeong Kim, and Soung Hie Kim (2002) A personalized recommender sys- tem based on web usage mining and decision tree induction, Expert systems with Applica- tions, 23(3), pp. 329–342. https://doi.org/ 10.1016/s0957-4174(02)00052-0 [32] Michael Kearns and Yishay Mansour (1999) On the boosting ability of top-down decision tree learning algorithms, Journal of Computer and System Sci- ences, 58(1), pp. 109–128. https://doi.org/ 10.1006/jcss.1997.1543 [33] Wei-Yin Loh and Yu-Shan Shih (1997) Split selection methods for classification trees. Statistica sinica, pp. 815–840. [34] Avrim L Blum and Pat Langley (1997) Selec- tion of Relevant Features and Examples in Ma- chine Learning, Artificial intelligence, 97(1- 2). pp. 245–271.https://doi.org/10.1016/ s0004-3702(97)00063-5 [35] Robert Messenger and Lewis Mandell (1972) A modal search technique for predictive nominal scale multivariate analysis, Journal of the Ameri- can statistical association, 67(340), pp. 768–772. https://doi.org/10.1080/01621459. 1972.10481290 [36] Gordon V Kass (1980) An exploratory technique for investigating large quantities of categorical data, Journal of the Royal Statistical Society: Series C (Applied Statistics), 29(2), pp. 119–127.https:// doi.org/10.2307/2986296 [37] Leszek Rutkowski, Maciej Jaworski, Lena Pietruczuk, and Piotr Duda (2013) Decision trees for mining data streams based on the gaussian approximation, IEEE Transactions on Knowledge and Data Engineering, 26(1), pp. 108–119.https: //doi.org/10.1109/tkde.2013.34 [38] J. Ross Quinlan (1986) Induction of decision trees, Machine learning, 1(1), pp. 81–106. [39] Salvatore Ruggieri (2002) Efficient C4.5 [classifica- tion algorithm], IEEE transactions on knowledge and data engineering, 14(2), pp. 438–444. [40] Leo Breiman, 2017 Classification and regression trees, Routledge. [41] Haidi Rao, Xianzhang Shi, Ahoussou Kouassi Ro- drigue, Juanjuan Feng, Yingchun Xia, Mohamed El- hoseny, Xiaohui Yuan, and Lichuan Gu (2019) Fea- ture selection based on artificial bee colony and gra- dient boosting decision tree, Applied Soft Computing, 74, pp. 634–642.https://doi.org/10.1016/ j.asoc.2018.10.036 [42] B Chandra and P Paul Varghese (2009) Fuzzify- ing gini index based decision trees, Expert Systems with Applications, 36(4), pp. 8549–8559.https:// doi.org/10.1016/j.eswa.2008.10.053 [43] Wei-Yin Loh and Nunta Vanichsetakul (1988) Tree- structured classification via generalized discriminant analysis, Journal of the American Statistical Associ- ation, 83(403), pp. 715–725.https://doi.org/ 10.1080/01621459.1988.10478652 [44] Xiao-Bai Li, James R Sweigart, James TC Teng, Joan M Donohue, Lori A Thombs, and S Michael Wang (2003) Multivariate decision trees using lin- ear discriminants and tabu search, IEEE transactions on systems, man, and cybernetics-part a: systems and humans, 33(2), pp. 194–205.https://doi.org/ 10.1109/tsmca.2002.806499 [45] Richard J Light and Barry H Margolin (1971) An analysis of variance for categorical data, Journal of the American Statistical Association, 66(335), pp. 534–544. Decision Tree for Classification and Regression. . . Informatica 44 (2020) 405–420 419 https://doi.org/10.1080/01621459. 1971.10482297 [46] Dursun Delen, Cemil Kuzey, and Ali Uyar (2013) Measuring firm performance using financial ratios: A decision tree approach, Expert Systems with Ap- plications, 40(10), pp. 3970–3983.https://doi. org/10.1016/j.eswa.2013.01.012 [47] Hyunjoong Kim and Wei-Yin Loh (2001) Clas- sification trees with unbiased multiway splits, Journal of the American Statistical Association, 96(454), pp. 589–604. https://doi.org/10. 1198/016214501753168271 [48] João Gama (2004) Functional trees, Machine Learn- ing, 55(3), pp. 219–250. [49] Torsten Hothorn, Kurt Hornik, and Achim Zeileis (2015) ctree: Conditional inference trees, The Com- prehensive R Archive Network, pp. 1–34. [50] Jianhua Dai and Qing Xu (2013) Attribute selection based on information gain ratio in fuzzy rough set the- ory with application to tumor classification, Applied Soft Computing, 13(1), pp. 211–221. https:// doi.org/10.1016/j.asoc.2012.07.029 [51] Vadlamani Ravi, H Kurniawan, Peter Nwee Kok Thai, and P Ravi Kumar (2008) Soft computing system for bank performance prediction, Applied soft comput- ing, 8(1), pp. 305–315. https://doi.org/10. 1016/j.asoc.2007.02.001 [52] Vikas Chaurasia and Saurabh Pal (2013) Early predic- tion of heart diseases using data mining techniques, Caribbean Journal of Science and Technology, 1, pp. 208–217. [53] Hyunjoong Kim, Wei-Yin Loh, Yu-Shan Shih, and Probal Chaudhuri (2007) Visualizable and interpretable regression models with good pre- diction power, IIE Transactions, 39(6), pp. 565–579. https://doi.org/10.1080/ 07408170600897502 [54] Gordon V Kass (1975) Significance testing in au- tomatic interaction detection (AID), Journal of the Royal Statistical Society: Series C (Applied Statis- tics), 24(2), pp. 178–189. [55] Dan Steinberg and Phillip Colla (2009) Cart: classifi- cation and regression trees, The top ten algorithms in data mining, 9, pp. 179. [56] Jerome H Friedman (1991) Multivariate Adaptive Re- gression Splines, The annals of statistics, 19(1), pp. 1–67. [57] Tian-Shyug Lee and I-Fei Chen (2005) A two- stage hybrid credit scoring model using artificial neural networks and multivariate adaptive regression splines, Expert Systems with Applications, 28(4), pp. 743–752. https://doi.org/10.1016/j. eswa.2004.12.031 [58] Mohammad Rezaie-balf, Sujay Raghavendra Na- ganna, Alireza Ghaemi, and Paresh Chandra Deka (2017) Wavelet coupled mars and M5 model tree ap- proaches for groundwater level forecasting, Journal of hydrology, 553, pp. 356–373. https://doi. org/10.1016/j.jhydrol.2017.08.006 [59] Wei-Yin Loh (2002) Regression trees with unbiased variable selection and interaction detection, Statistica Sinica, pp. 361–386. [60] John R Quinlan et al (1992) Learning with continuous classes, In 5th Australian joint conference on artifi- cial intelligence, World Scientific, 92, pp. 343–348. [61] Behrooz Keshtegar, Cihan Mert, and Ozgur Kisi (2018) Comparison of four heuristic regression techniques in solar radiation modeling: Kriging method vs RSM, MARS and M5 model tree, Renewable and Sustainable Energy Reviews, 81, pp. 330–341. https://doi.org/10.1016/j. rser.2017.07.054 [62] Yong Wang and Ian H Witten (1996) Induction of model trees for predicting continuous classes. [63] Ali Behnood, Jan Olek, and Michal A Glinicki (2015) Predicting modulus elasticity of recy- cled aggregate concrete using M5 0 model tree al- gorithm, Construction and Building Materials, 94, pp. 137–147.https://doi.org/10.1016/ j.conbuildmat.2015.06.055 [64] Lisham Bonakdar and Amir Etemad-Shahidi (2011) Predicting wave run-up on rubble-mound structures using M5 model tree, Ocean Engineering, 38(1), pp. 111–118. https://doi.org/10.1016/j.oceaneng. 2010.09.015 [65] John A Ross and Sook Bang (1996) The AID computer programme, used to predict adoption of family planning in koyang, Population stud- ies, 20(1), pp. 61–75. https://doi.org/10. 1080/00324728.1966.10406084 [66] Aliakbar Gholampour, Iman Mansouri, Ozgur Kisi, and Togay Ozbakkaloglu (2018) Evaluation of me- chanical properties of concretes containing coarse re- cycled concrete aggregates using multivariate adap- tive regression splines (MARS), M5 model tree (M5tree), and least squares support vector regression (LSSVR) models, Neural Computing and Applica- tions, pp. 1–14. https://doi.org/10.1007/ s00521-018-3630-y 420 Informatica 44 (2020) 405–420 M. Jena et al. [67] A Etemad-Shahidi and Javad Mahjoobi (2009) Com- parison between M5’ model tree and neural net- works for prediction of significant wave height in lake superior, Ocean Engineering, 36(15-16), pp. 1175–1181. https://doi.org/10.1016/ j.oceaneng.2009.08.008 [68] Mehrshad Samadi, Ebrahim Jabbari, and H Md Azamathulla (2014) Assessment of M5’ model tree and classification and regression trees for prediction of scour depth below free overfall spillways, Neural Computing and applications, 24(2), pp. 357–366. https://doi.org/10. 1007/s00521-012-1230-9