https://doi.org/10.31449/inf.v43i1.1548 Informatica 43 (2019) 363–371 363 Machine Learning for Dengue Outbreak Prediction: A Performance Evaluation of Different Prominent Classifiers Naiyar Iqbal and Mohammad Islam Department of Computer Science and Information Technology Maulana Azad National Urdu University, Hyderabad, Telangana, India Email: naiyariqbal.rs@manuu.edu.in, islamcs1@gmail.com Keywords: Dengue fever, machine learning, classification, ensemble classifier, clinical symptoms Received: March 1, 2017 Dengue disease patients are increasing rapidly and actually dengue has recorded in every continent today according to the World Health Organization (WHO) record. By WHO report the number of dengue outbreak cases announced every year has expanded from 0.4 to 1.3 million during the period of 1996 to 2005 and then it has reached to 2.2 to 3.2 million during the year of 2010 to 2015 respectively. Consequently, it is fundamental to have a structure that can adequately perceive the pervasiveness of dengue outbreak in a large number of specimens momentarily. At this critical moment, the capability of seven prominent machine learning systems was assessed for the forecast of the dengue outbreak. These methods are evaluated by eight miscellaneous performance parameters. LogitBoost ensemble model is reported as the topmost classification accuracy of 92% with sensitivity and specificity of 90 and 94 % respectively. Povzetek: Sedem algoritmov strojnega učenja je analiziranih na izbruhu mrzlice dengi in LogitBoost je dosegel najboljše rezultate. 1 Introduction Dengue fever is the most well-known arboviral disease transmitted by female mosquitoes (Aedes Aegypti) in tropical and subtropical regions throughout the world [7]. Spanish word dengue is derived from dinga. Dengue fever also familiar as break-bone fever, break heart fever, and dandy fever. Dengue viral fever is originated by four concerned viruses known as DEN- (1 to 4). Now DEN-5 which is newly introduced in 2013 [13,3]. Dengue fever (DF), Dengue Hemorrhagic Fever (DHF), and Dengue Shock Syndrome (DSS) are the broad stages of dengue viral from normal to serious respectively [8,16]. According to WHO report the number of dengue outbreak cases announced every year has expanded from 0.4 to 1.3 million during the period of 1996 to 2005 and then it has reached to 2.2 to 3.2 million during the year of 2010 to 2015 respectively. Dengue outbreak is a champion among the most notable viral disease in human beings. Over 33% of the aggregate population of the world is under pitfall together with numerous urban communities of India. In due course, forecasting of dengue outbreak can protect the life of a human by alarming them to take appropriate treatment and care. Forecast of transmissible outbreaks like dengue disease is a challenging work and several prediction techniques are still in their early stages [10]. An Eco-bio-social framework for dengue vector breeding has been proposed by [2]. The researchers use six different Asian regions in their research work and as a gist, vector breeding and adult Aedes aegypti are determined by a complex interaction of the factor. Souza et al, (2007) [19] shows the influence of dengue disease on liver activity. They found that liver damage is more frequent in ladies. So, the liver test is more important that calculates the level of liver damage. Machine learning is state of the art technology to embolden machines to perform without being explicitly customized to streamline performance standard use of case data or previous observations. Machine Learning model is used for the collection of precious information from the data by the normalized dataset. At this critical moment, the capability of many prominent machine learning systems was assessed for the forecast of the dengue outbreak. For the sake of this, seven machine learning algorithms have been used like LogitBoost, Logistic regression, Decision tree, Naive Bayes, Artificial neural network, Sequential minimal optimization, and k- nearest neighbor. Additionally, the ROC curve is also used for performance measurement. In table 4, we have shown the comparison among accuracy rate, sensitivity and specificity of the prominent classifier with two ensemble models i.e. Random forest [5] and LogitBoost. 2 Related Work There are few other works concerned with the prediction of dengue outbreaks. Althouse et al. (2011) [1] applied three models, Linear regression (Step-down), Generalize Boosted regression and negative binomial Regression, as well as two another methods, logistic regression, and artificial neural network, are also applied for dengue disease prediction. They have performed their experiments for two regions Singapore and Bangkok. Authors found that the linear model is superior to other models; also support vector machine (SVM) performs 364 Informatica 43 (2019) 363–371 N. Iqbal et al. better than logistic regression in both regions. The selected linear model achieves a correlation of 0.86 and 0.93 between fitted and observed for Bangkok and Singapore region, respectively. Brasier et al. (2012) [3] performed dengue disease prediction using CART and Random forest methods based on symptoms. They are performed 10 trails with 10-fold cross-validation that shows 84.0% (for DF) & 84.6% (for DHF) average accuracy result. Support vector classification is used by Fathima et al, (2012) [6] for the prediction of arbovirus dengue. In their analysis, SVM gives 90.42% accuracy with 47.23% sensitivity and 97.59% specificity. Fathima et al, (2015) [5] has done their experiment on dengue infection prognosis using random forest (one of the ensemble model) classifier on clinical parameters. As a result, they found 92% accuracy. Ibrahim et al. (2005) [9] experiments dengue viral on 252 patients (4 DF & 248 DHF) using ANN with 9 input neurons and 5 hidden neurons on MATLAB simulator and their result showed 90% accuracy. Rachata et al. (2010) [14] applied ANN using climate parameters like temperature, rainfall and relative humidity for dengue outbreak prediction. 85.92% accuracy is found in their experiment; also, they suggested using another feature selection method such as the hidden Markov model. Decision Tree (C4.5) classifier is applied by Tanner et al. (2008) [21] on 1200 dengue samples (364 dengue positive & 836 dengue negative) consisting of five clinical parameters. Their experiments found 84.7% accuracy and 15.7% overall error rate and claims decision tree could be a useful classifier. Additional review on related literature can be found in [10], which explores around thirty literature published between the year 1995 to 2013. 3 Methods & material Data mining is an act of analyzing and extraction of substantial previous databases consider in mind that the end target is to the prediction of unknown information of a novel example from observed examples. Data mining phases are as follow: ▪ Phase 1: Problem identification ▪ Phase 2: Formulation of the hypothesis ▪ Phase 3: Data collection ▪ Phase 4: Data Pre-process (scaling, encoding, and selecting features and outlier detection or removal) ▪ Phase 5: Model estimation ▪ Phase 6: Model interpret and draw conjecture In this experiment, we use dengue disease dataset in CSV file format for the prediction on the WEKA data mining tool. This dataset consists of 75 samples with 36 samples without dengue disease (Negative) and 39 samples with dengue disease (Positive) [12,17,20]. The dataset is collected from test reports of different discharged patients. After that performs data pre- processing for smoothing some missing values using ReplaceMissingValues technique under filter option of WEKA tool. In this experiment, 8 distinct clinical attributes have been taken into account for the prediction of dengue diseases (Table 1). Attribute Name Data type Range 1. Fever Binary No/Yes 2. Headache Binary No/Yes 3. Body ache Binary No/Yes 4. Abdominal pain Binary No/Yes 5. Vomiting Binary No/Yes 6. Haemoglobin Numeric 12.0-17.5 (g/dL) 7. WBC Numeric 4000-11000(/cumm) 8. Platelet Numeric 1.5-4.5 (Lakh/mm 3 ) Dengue Binary Negative/Positive Table 1: Clinical attributes for dengue outbreak. Dataset samples incorporated of the total of 75 samples with 8 clinical attributes for each sample. Samples with the absence of dengue outbreak were treated as a negative class, and samples with the presence of dengue outbreak were treated as positive samples for purpose of analysis. The correlation between Eight attributes of negative and positive samples show the high correlation between the attributes of the two classes of samples as depicted in Figures 1 and 2. Figure 1 clearly shows that fever feature is positively correlated with all other parameters except a headache and platelet in samples without dengue outbreak (negative). Similarly, positively correlation between other parameters in negative class can be noticed in Figure 1. Similarly, Figure 2 clearly demonstrates that the hemoglobin feature is positively correlated with all other parameters except a headache and platelet in samples with dengue outbreak (positive). Similarly, positively correlation between other parameters in positive class can be observed in Figure 2. Machine Learning for Dengue Outbreak Prediction: A Performance… Informatica 43 (2019) 363–371 365 Figure 1: Linear correlation of negative cases. Figure 2: Linear correlation of positive cases. Fever Headache Bodyache Abdominal pain Vomiting Hemoglobin WBC Platelet Fever 1,00 -0,08 0,20 0,27 0,31 0,23 0,25 -0,26 Headache -0,08 1,00 -0,17 0,00 -0,29 -0,02 0,09 0,08 Bodyache 0,20 -0,17 1,00 -0,03 -0,34 0,02 -0,08 -0,04 Abdominal pain 0,27 0,00 -0,03 1,00 -0,03 -0,02 0,03 -0,14 Vomiting 0,31 -0,29 -0,34 -0,03 1,00 0,05 0,23 -0,18 Hemoglobin 0,23 -0,02 0,02 -0,02 0,05 1,00 0,43 0,00 WBC 0,25 0,09 -0,08 0,03 0,23 0,43 1,00 -0,23 Platelet -0,26 0,08 -0,04 -0,14 -0,18 0,00 -0,23 1,00 -0,60 -0,40 -0,20 0,00 0,20 0,40 0,60 0,80 1,00 1,20 Correlation Negative Fever Headache Bodyache Abdominal pain Vomiting Hemoglobin WBC Platelet Fever 1,00 -0,18 0,03 -0,12 0,09 0,04 0,06 -0,21 Headache -0,18 1,00 0,13 0,30 -0,19 -0,05 0,01 -0,13 Bodyache 0,03 0,13 1,00 0,33 0,43 0,10 -0,14 0,04 Abdominal pain -0,12 0,30 0,33 1,00 0,14 0,31 -0,03 0,31 Vomiting 0,09 -0,19 0,43 0,14 1,00 0,10 -0,08 0,21 Hemoglobin 0,04 -0,05 0,10 0,31 0,10 1,00 0,37 -0,19 WBC 0,06 0,01 -0,14 -0,03 -0,08 0,37 1,00 -0,33 Platelet -0,21 -0,13 0,04 0,31 0,21 -0,19 -0,33 1,00 -0,40 -0,20 0,00 0,20 0,40 0,60 0,80 1,00 1,20 Correlation Positive 366 Informatica 43 (2019) 363–371 N. Iqbal et al. 4 Machine learning algorithms 4.1 K-nearest neighbour (kNN) K-nearest Neighbour classifier is based on instance learning approach that is influenced by the lazy learning technique. Instance-based method, alternatively known as memory-based learning. In this approach, it matches novel problem instances with previously picked instances at training, which is stored in the memory. It is most fruitful for huge datasets with fewer features and provides global approximation and less time in training. The k-NN method can be applied to both classification and regression. In both situations, the input composed of the k nearest training instances in feature space. The outcome is dependent on the application of k- NN is applied for classification or regression [10]. In k-NN classification, the result is a class belonging. The classification of entity is decided on the basis of a majority vote of their neighbor. In contrast k-NN regression, the outcome is the merit significance for the object. The significance is the means of the values of their kNN. The k-NN model for continuous-valued objective functions that compute the average estimation of the k nearest neighbors. kNN is strong to noisy data by calculating the mean of k-nearest neighbors. The gap between neighbors can be overwhelmed by unnecessary features that lead to the curse of dimensionality. To defeat it, dimension stretch or elimination of the less significant features. 4.2 Support vector machine (SVM) Support Vector Machine, also alternatively known as Support Vector Network introduced by Vladimir Vapnik, that is used for both classification and prediction. SVM is a machine learning method for binary classification problem, despite the fact that executions of multi-class SVMs exist to guide enter vectors to a multi-dimensional feature space. A straight decision environment is worked with exclusive competence guaranteeing high generalization capability of a machine learning strategy [6]. SVM depends on the statistical learning theory that there is an infinite line known as hyperplanes, isolating the two classes. SVM approach endeavoring to search the best one, that reduce the classification error on unknown data. SVM finds for the hyperplane with the biggest margin i.e. maximum marginal hyperplane (MMH). The thought behind the SVM has been widely actualized in biology with some strategy for the limited situation where training data can be isolated error-free, additionally extending this outcome to non-separable training data. SVM is a deterministic approach that generates effective generalization properties. SVM has a strong mathematical function that uses kernel for complex learning. Sequential minimal optimization (SMO) is a method for resolving quadratic programming issue which appears at training time of support vector machine [12,18]. A separating hyperplane can be calculated as: 𝐻 = 𝑊 . 𝑋 + b = 0 Where, H hyperplane, W weight, X input vector, and b bias. 4.3 Artificial neural network (ANN) The artificial neural network is powerful processing machine, that can be an algorithm or real hardware device that has the ability to recognize experience or contemplation knowledge represented through intermediary unit collectively features, and can make such learning knowledge available for usage. The weighted sum of product x iw kj (for i=0 to m) is usually denoted as net k: 𝑛𝑒𝑡 𝑘 = 𝑥 0 𝑤 0 ∑ 𝑥 𝑖 𝑤 𝑘𝑗 𝑚 𝑖 =1 Finally, an artificial neuron computes the output y k as a certain function of net k value: 𝑦 𝑘 = 𝑓 (𝑛𝑒𝑡 𝑘 ) Where x and y are input and output signals respectively, w kj synaptic weight, j synapse, and f is activation function [10]. 4.4 Naive Bayes classifier Bayesian learning is referred to as methods in probability and statistics. Bayes theorem illustrates the possibility of an event on the basis of conditions which may be respective to the event. It has a homological performance with chosen neural network classifiers and classification tree. Every training sample can gradually increment or decrement the probability that a hypothesis is accurate means that previous knowledge could be associated accompanied by observed outcome. Naive Bayes is computability intractable and optimal decision making. Naive Bayes classifiers are applied for extraction of the appropriate grouping for a dataset wherever explicit elemental applications are conjoined [18]. The mathematical equation for Bayes theorem is stated as: 𝑃 (𝑋 |𝑌 ) = 𝑃 (𝑋 )𝑃 (𝑌 |𝑋 ) 𝑃 (𝑌 ) Here X and Y represented as events, P(X) and P(Y) represents the ratios of X and Y without concern to each other. P(X|Y) is a conditional probability of observing occurrence X given that Y is correct. P(Y|X) is the ratio of observing occurrence Y specified that X is correct. 4.5 Decision tree The decision tree is a hierarchical based prediction approach that sketches the observed attribute in the branches and the target value at their leaves. The predictions can be discrete values which is a classification decision tree or continuous values which is regression decision tree. The prominent algorithms have been developed e.g. ID3, C4.5, CART, CHAID and MARS for Machine Learning for Dengue Outbreak Prediction: A Performance… Informatica 43 (2019) 363–371 367 decision tree prediction model. J48 decision tree [11] algorithm is a popular Java development under the C4.5 algorithm in WEKA tool that is applied as one of the experiments in this research. Attribute selection measure by information gain is described as: 𝐼 (𝑝 , 𝑛 ) = − 𝑝 𝑝 + 𝑛 𝑙𝑜𝑔 2 𝑝 𝑝 + 𝑛 − 𝑛 𝑝 + 𝑛 𝑙𝑜𝑔 2 𝑛 𝑝 + 𝑛 The entropy or requisite information required to the classification of objects in overall sub-trees is calculated as: 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝐴 ) = ∑ 𝑝 𝑖 + 𝑛 𝑖 𝑝 + 𝑛 𝑣 𝑖 =1 𝐼 (𝑝 𝑖 + 𝑛 𝑖 ) The encoded information that can be obtained by divaricating on A: 𝐺𝑎𝑖𝑛 (𝐴 ) = 𝐼 (𝑝 , 𝑛 ) − 𝐸 (𝐴 ) Where A and I represent Attribute and Information gain respectively; p and n are an element of class P and N respectively. 4.6 Logistic regression classifier Logistic regression is based on the regression technique in which the dependent variable is categorical. Logistic regression is a way to the prediction of a dichotomous result. Logistic regression can be binomial, ordinal and multinomial. In multinomial, the results can have more than two possible types. Univariate logistic regression was applied for continuous covariates, whereas logistic regression techniques give odds proportion of interest, that is not easy to use as a diagnostic device because a computer would be required to compute dengue fever prediction. Consequently, we readjusted the two selected logistic regression technique that substituting continuous attributes with binary counterparts [4]. 4.7 LogitBoost: an ensemble classifier Various application of a data mining process demonstrated the legitimacy of mentioned No-Free-Lunch theorem [22]. According to No-Free-Lunch, a single learning model cannot be the best and most appropriate with the whole domain of application. Ensemble learning is an encouraging perspective strategy that combines weak learners to make a powerful model with a specific end goal to enhance the prediction model [15]. Ensemble model is a new way to the mixture of numerous prominent models for enhancement of the precision rate of a novel model for better prediction. It is a combination of k-learned models (M1, M2, M3...Mk) with the purpose of making an upgraded model M* [10], shown in figure 3. In this research, LogitBoost algorithm has applied as an ensemble classifier for the prediction of dengue outbreak. LogitBoost follows the boosting approach as an ensemble. Boosting approach is most strong learning that is applied for both classification and regression analysis. Boosting approach first builds a weak classifier and test inputs are given starting weights and more often it begins with identical weighting. During iteration, the test inputs are assigned with new weight value to center the systems that are not accurately classified with a newly learned classifier. At each progression of learning, increment weights of the input instance that are not accurately trained by the weak learner and reduction of weights of the input instance that are accurately trained by the weak learner. The ultimate classification model is built on a weighted vote of weak classifiers produced in the repetition. Figure 3: Ensemble model architecture [10]. In this comparative analysis, we found that LogitBoost performs better than another specific prominent classifier. LogitBoost ensemble model is reported as the topmost classification accuracy of 92% with sensitivity and specificity of 90 and 94 % respectively. 5 Classification performance metrics In this research, seven supervised machine learning approaches were applied for the classification of dengue disease samples. Performance of the classification techniques was estimated on tenfold cross-validation. Eight quality parameters were taken into account for the assessment of classification models. Samples with the absence of dengue outbreak were treated as a negative class, and samples with the presence of dengue outbreak were treated as a positive class. Basic terminologies of confusion matrix as described here: ▪ True Positive (TP)- a number of records predicted as positive and it does have dengue outbreak. ▪ True Negative (TN)- a number of records predicted as negative and it doesn't have dengue outbreak. ▪ False Positive (FP)- a number of records predicted as positive but actually it doesn't have dengue outbreak. FP is also known as the Type I Error. ▪ False Negative (FN)- a number of records predicted as negative but actually it does have dengue outbreak. FN is also known as the Type II Error. The quality measures on confusion matrix for binary classification are listed below as: 368 Informatica 43 (2019) 363–371 N. Iqbal et al. ❖ Classification Accuracy: The overall proportion of appropriately predicted samples to the total number of samples by the classifier model. 𝐶𝐴 = (𝑇𝑃 + 𝑇𝑁 )/(𝑡𝑜𝑡𝑎𝑙 𝑠𝑎𝑚𝑝𝑙𝑒 ) ❖ True Positive Rate: The proportion of predicted positive sample to the total actually positive samples. ▪ Also known as Sensitivity or Recall 𝑇𝑃𝑅 = 𝑇𝑃 /(𝑇𝑃 + 𝐹𝑁 ) ❖ False Positive Rate: The proportion of predicted positive sample to the total actually negative samples. 𝐹𝑃𝑅 = 𝐹𝑃 /(𝐹𝑃 + 𝑇𝑁 ) ❖ True Negative Rate: The proportion of predicted negative sample to the total actually negative samples. ▪ Also known as Specificity 𝑇𝑁𝑅 = 𝑇𝑁 /(𝑇𝑁 + 𝐹𝑃 ) ❖ Positive Predicted Value: The proportion of predicted positive sample to the total predicted positive samples. ▪ Also known as Precision 𝑃𝑃𝑉 = 𝑇𝑃 /(𝑇𝑃 + 𝐹𝑃 ) ❖ Negative Predictive Value: The proportion of predicted negative sample to the total predicted negative samples. 𝑁𝑃𝑉 = 𝑇𝑁 𝑇𝑁 + 𝐹𝑁 Rate of Misclassification: The proportion of overall incorrectly samples to the total number of samples. It can be also defined as the proportion of gross error (Type I Error and Type II Error) to the total number of samples ▪ RMC=1-CA ▪ Also known as "Error Rate" 𝑅𝑀𝐶 = 𝑇𝑦𝑝𝑒 𝐼 𝐸𝑟𝑟𝑜𝑟 + 𝑇𝑦𝑝𝑒 𝐼𝐼 𝐸𝑟𝑟𝑜𝑟 𝑡𝑜𝑡𝑎𝑙 𝑠𝑎𝑚𝑝𝑙𝑒 ❖ F1 Score: It is a weighted average of the recall and precision. 𝐹 1 = 2𝑇𝑃 2𝑇𝑃 + 𝐹𝑃 + 𝐹𝑁 6 Results and discussion The performance measurement of dengue outbreak prediction by seven machine learning algorithms is evaluated based on eight attributes as mentioned in the methods and materials section. There was a total of 75 samples taken into account with 36 negative cases and 39 positive cases of dengue outbreak. Dengue dataset samples were divided in tenfold, each fold was used in testing and rest folds were applied as training throughout cross-validation. Confusion matrix of prediction result is tabulated in Table 2 for LogitBoost, and other classifications like, Logistic regression, Decision tree, Naive Bayes, Artificial neural network, Sequential minimal optimization, and k- nearest neighbor are shown in figure 4. Figure 4 depicts the predictions of these machine learning models. It is declared from the results that LogitBoost predicts the topmost number of true positives (number of records predicted as positive and it does have dengue outbreak) and it also predicts the topmost number of true negatives (number of records predicted as negative and it doesn't have dengue outbreak (Table 2; Figure 4). Decision tree confusion matrix shows that it has the second highest true positives and Logistic regression predicts the second-highest true negatives (Figure 4). Logistic regression confusion matrix shows that it has the third highest true positives and SMO confusion matrix predicts third highest true negatives (Figure 4). Naive Bayes and ANN confusion matrix depicts that both are the fourth highest true positives and true negatives (figure 4). SMO confusion matrix indicates that it has the fifth highest true positives and Decision tree predicts the fifth highest true negatives (Figure 4). k-NN confusion matrix shows the worst performer in the sense of the lowest true positives and true negatives (Figure 4). Table 2: Confusion matrix for LogitBoost algorithm. Table 3 explains various classification chronicle measurements especially classification accuracy, specificity, sensitivity, precision, False Positive Rate, Negative predictive value, the rate of misclassification and F1 score. Table 3 declared that LogitBoost outperformed over all other machine learning methods with the topmost classification accuracy of 92% while the second highest classification accuracy is achieved by Logistic regression of 85%. In addition, LogitBoost has found the highest sensitivity of 90% and Decision tree has got the second highest sensitivity of 87%. Logitboost also acquires topmost specificity of 94% and precision of 95% which declared that LogitBoost ensemble model is most appropriate for the prediction of patients with dengue outbreak (positive class). Table 3 also shows other parameters like False Positive Rate, Negative predictive value, the rate of misclassification and F1 score of these machine learning methods. The table undoubtedly shows that LogitBoost LogitBoost: Predicted Class Total Actual Negative Positive Actual Class Negative 34 (89.47%) 2 (5.40%) 36 Positive 4 (10.53%) 35 (94.59%) 39 Total Predicted 38 37 75 Machine Learning for Dengue Outbreak Prediction: A Performance… Informatica 43 (2019) 363–371 369 has the highest negative predictive value of 89% whereas it also defeats all other methods on the F1 score with 92%. LogitBoost also achieves the lowest FP rate of 6%, and also the lowest Rate of misclassification (8%). 6.1 ROC curve for performance evaluation Receiver Operating Characteristic (ROC) curve is a generally employed diagrammatical representation which estimates the performance of the classification models over all feasible thresholds. ROC curve is generated by tracing the FPR on the x-axis with contrary to the TPR on the y-axis. ROC is impartial of both classes and important when the number of instances of both classes mutates at training. Range under ROC must be close to 1 for the best classifier. Figure 5 enlighten that LogitBoost defeats all other methods in the prediction of negative dengue outbreak case and Figure 6, LogitBoost beat other methods in the prediction of positive dengue outbreak case. 7 Limitation and future work In this experimental work, we have used 8 clinical parameters with 75 dataset samples (36 dengue negative and 39 dengue positive samples) and performs classification tasks of data mining. After that, we applied seven prominent algorithms in which LogitBoost (one of the ensemble model) performs better than others. According to No-Free-Lunch [22], a single learning algorithm cannot be the best and at most appropriate with the whole domain of application. It may be the computing cost and processing time can increase due to ensemble model but subsequently, day by day the new technologies have come into existence like cloud computing services . Figure 4: Classification output of machine learning algorithms CA Sens. Spec. Prec. FPR NPV RMC F1 LogitBoost 0.92 0.90 0.94 0.95 0.06 0.89 0.08 0.92 Logistic Regression 0.85 0.82 0.89 0.89 0.11 0.82 0.15 0.85 Decision Tree 0.84 0.87 0.81 0.83 0.19 0.85 0.16 0.85 Naïve Bayes 0.81 0.79 0.83 0.84 0.17 0.79 0.19 0.82 ANN 0.81 0.79 0.83 0.84 0.17 0.79 0.19 0.82 SMO 0.80 0.74 0.86 0.85 0.14 0.76 0.20 0.79 kNN 0.75 0.72 0.78 0.78 0.22 0.72 0.25 0.75 Table 3: Classification performance metrics of machine learning algorithms. MODEL ACCURACY (%) Sensitivity Specificity REFERENCE Support Vector Machine 90.42% 47.23% 97.59% [6] Random Forest (Ensemble) 92% 94% 92% [5] Artificial Neural Network 90% - - [9] Artificial Neural Network 85.92% - - [14] Decision Tree (C4.5) 84.7% 78.2% 80.2% [21] Alternative Decision Tree 89% 89.2% 47.6% [11] LogitBoost (Ensemble) 92% 90% 94% In this experiment Table 4: Comparison of accuracy result of LogitBoost ensemble model among other experiments. LogitBoost Logistic Regression Decision Tree Naive Bayes ANN SMO_SV M kNN TN 34 32 29 30 30 31 28 FN 4 7 5 8 8 10 11 TP 35 32 34 31 31 29 28 FP 2 4 7 6 6 5 8 0 5 10 15 20 25 30 35 40 Classification outputs of Machine Learning Algorithms 370 Informatica 43 (2019) 363–371 N. Iqbal et al. and distributed computing that reduced the computing cost and processing time. In the future, one can use huge datasets with more related clinical parameters for their experiments and improvement of model accuracy as mention in the data classification section [10]. 8 Conclusion Dengue disease patients are increasing rapidly and actually, dengue has recorded in every continent today according to the World Health Organisation (WHO) record. Dengue outbreak prediction may save the life of people and can have valuable effectiveness on their diagnostic. This effort gives a work process established on machine learning techniques for the forecasting of the negative case or the positive case of dengue outbreak. The prime focus of the research is toward prediction of dengue outbreak using WEKA tool. In this research article, seven prominent machine learning techniques have been applied and eight parameters are used for performance evaluation. It has been concluded that LogitBoost ensemble model is the topmost performance classifier techniques that it has reached a classification accuracy of 92% with sensitivity and specificity of 90 and 94 % respectively and ROC area=0.967, and had the lowest error rate. We have compared the accuracy rate of our analysis with other published results in Table 4. Based on our comparative analysis result using LogitBoost ensemble model as well as the Random forest classifier used by Fathima et al, (2015) [5] result concluded that ensemble model performs better than individual classifier (Table 4). Furthermore, we are desirous to enhance the model accuracy with more related expressed and sensitive clinical features on a huge amount of dataset in future and as well as we are also interested to develop a web-based tool that helps doctors to take a decision with more accurate dengue outbreak. List of abbreviations DEN : Dengue DF : Dengue Fever DHF : Dengue Haemorrhage Fever DSS : Dengue Shock Syndrome CSV : Comma Separated Values WBC : White Blood Count ANN : Artificial Neural Network SVM : Support Vector Machine SMO : Sequential Minimal Optimization ADT : Alternating Decision Tree NB : Naive Bayes RF : Random Forest MNB : Modified Naive Bayes MFNN : Multilayer Feedforward Neural Network ROC : Receiver Operative Characteristics 9 References [1] Althouse, B. M., Ng, Y. Y., & Cummings, D. A. (2011). Prediction of dengue incidence using search query surveillance. PLoS Negl Trop Dis, 5(8), e1258. https://doi.org/10.1371/journal.pntd.0001258 [2] Arunachalam, N., Tana, S., Espino, F., Kittayapong, P., Abeyewickrem, W., Wai, K. T., ... & Petzold, M. (2010). Eco-bio-social determinants of dengue vector breeding: a multicountry study in urban and periurban Asia. Bulletin of the World Health Organization, 88(3), 173-184. https://doi.org/10.2471/BLT.09.067892 [3] Brasier, A. R., Ju, H., Garcia, J., Spratt, H. M., Victor, S. S., Forshey, B. M., ... & Rocha, C. (2012). A three-component biomarker panel for prediction of dengue hemorrhagic fever. The American journal of tropical medicine and hygiene, 86(2), 341-348. https://doi.org/10.4269/ajtmh.2012.11-0469 [4] Chadwick, D., Arch, B., Wilder-Smith, A., & Paton, N. (2006). Distinguishing dengue fever from other infections on the basis of simple clinical and laboratory features: application of logistic regression analysis. Journal of Clinical Virology, 35(2), 147- Figure 5: ROC for seven machine learning techniques tested for the negative case. Figure 6: ROC for seven machine learning techniques tested for the positive case. Machine Learning for Dengue Outbreak Prediction: A Performance… Informatica 43 (2019) 363–371 371 153. https://doi.org/10.1016/j.jcv.2005.06.002 [5] Fathima, A. S., & Manimeglai, D. (2015). Analysis of Significant Factors for Dengue Infection Prognosis Using the Random Forest Classifier. Analysis, 6(2). https://doi.org/10.14569/IJACSA.2015.060235 [6] Fathima, A., & Manimegalai, D. (2012). Predictive analysis for the arbovirus-dengue using svm classification. International Journal of Engineering and Technology, 2(3), 521-7. [7] Gibbons, R. V., & Vaughn, D. W. (2002). Dengue: an escalating problem. BMJ: British Medical Journal, 324(7353), 1563. https://doi.org/10.1136/bmj.324.7353.1563 [8] Horstick, O., Farrar, J., Lum, L., Martinez, E., San Martin, J. L., Ehrenberg, J., ... & Kroeger, A. (2012). Reviewing the development, evidence base, and application of the revised dengue case classification. Pathogens and global health, 106(2), 94-101. https://doi.org/10.1179/2047773212Y.0000000017 [9] Ibrahim, F., Taib, M. N., Abas, W. A. B. W., Guan, C. C., & Sulaiman, S. (2005). A novel dengue fever (DF) and dengue haemorrhagic fever (DHF) analysis using artificial neural network (ANN). Computer methods and programs in biomedicine, 79(3), 273- 281. https://doi.org/10.1016/j.cmpb.2005.04.002 [10] Iqbal, N. and Islam, M. (2017). Machine learning for Dengue outbreak prediction: An outlook, International Journal of Advanced Research in Computer Science, 8(1):93-102. [11] Kumar, M. N. (2013). Alternating decision trees for early diagnosis of dengue fever. arXiv preprint arXiv:1305.7331. [12] Nandini, V., Sriranjitha, R., & Yazhini, T. P (2016). Dengue detection and prediction System using data mining with Frequency analysis. Computer Science & Information Technology, [DOI: 10.5121/csit.2016.60906]. https://doi.org/10.5121/csit.2016.60906 [13] Online Available [https://en.wikipedia.org/wiki/Dengue_fever] [14] Rachata, N., Charoenkwan, P., Yooyativong, T., Chamnongthal, K., Lursinsap, C., & Higuchi, K. (2008, October). Automatic prediction system of dengue haemorrhagic-fever outbreak risk by using entropy and artificial neural network. In Communications and Information Technologies, 2008. ISCIT 2008. International Symposium on (pp. 210-214). IEEE. https://doi.org/10.1109/ISCIT.2008.4700184 [15] Raza, K. (2019). Improving the prediction accuracy of heart disease with ensemble learning and majority voting rule. In U-Healthcare Monitoring Systems (pp. 179-196). Academic Press. https://doi.org/10.1016/B978-0-12-815370- 3.00008-6 [16] Santamaria, R., Martinez, E., Kratochwill, S., Soria, C., Tan, L. H., Nunez, A., ... & Castelobranco, I. (2009). Comparison and critical appraisal of dengue clinical guidelines and their use in Asia and Latin America. International health, 1(2), 133-140. https://doi.org/10.1016/j.inhe.2009.08.006 [17] Shakil, K. A., Anis, S., & Alam, M. (2015). Dengue disease prediction using weka data mining tool. arXiv preprint arXiv:1502.05167. [18] Shaukat, K., Masood, N., Mehreen, S., & Azmeen, U. (2015). Dengue Fever Prediction: A Data Mining Problem. Journal of Data Mining in Genomics & Proteomics, 2015. https://doi.org/10.4172/2153-0602.1000181 [19] Souza, L. J. D., Nogueira, R. M. R., Soares, L. C., Soares, C. E. C., Ribas, B. F., Alves, F. P., ... & Pessanha, F. E. B. (2007). The impact of dengue on liver function as evaluated by aminotransferase levels. Brazilian Journal of Infectious Diseases, 11(4), 407-410. https://doi.org/10.1590/S1413-86702007000400007 [20] Stany Leena Princy, S., & Muruganandam, A. (2016). An Implementation of Dengue Fever Disease Spread Using Informatica Tool with Special Reference to Dharmapuri District. International Journal of Innovative Research in Computer and Communication Engineering, 4(9). [DOI: 10.15680/IJIRCCE.2016.0409031]. [21] Tanner, L., Schreiber, M., Low, J. G., Ong, A., Tolfvenstam, T., Lai, Y. L., ... & Simmons, C. P. (2008). Decision tree algorithms predict the diagnosis and outcome of dengue fever in the early phase of illness. PLoS Negl Trop Dis,2(3), e196. https://doi.org/10.1371/journal.pntd.0000196 [22] Wolpert, D. H., & Macready, W. G. (1997). No free lunch theorems for optimization. IEEE transactions on evolutionary computation, 1(1), 67-82. https://doi.org/10.1109/4235.585893 372 Informatica 43 (2019) 363–371 N. Iqbal et al.