Informática 33 (2009) 245-259 245 Identifying Learners Robust to Low Quality Data Andres A. Folleco, Taghi M. Khoshgoftaar, Jason Van Hulse and Amri Napolitano Florida Atlantic University, Boca Raton, Florida, USA E-mail: {andres, taghi}@cse.fau.edu , {jvanhulse, amrifau}@gmail.com Keywords: quality of data, class imbalance, random forest, robust learning Received: October 12, 2008 Low quality ornoisy data, which typically consists of erroneous values for both dependent and independent variables, has been demonstrated to have a significantly negative impact on the classification performance of most learning techniques. The impact on learner performance can be magnified when the class distribution is imbalanced or skewed. Unfortunately in real world environments, the presence of low quality imbalanced data is a common occurrence. In most scenarios, the actual quality of such datasets is unknown to the data mining practitioner. In this study, we identify learners (from a total of 11 classification algorithms) with robust performance in the presence of low quality imbalanced measurement data. Noise was injected into seven imbalanced software measurement datasets, initially relatively free of noise. Learners were evaluated using analysis of variance models based on their performance as the level of injected noise, the number of attributes with noise, and the percentage of minority instances containing noise were increased. Four performance metrics suitable for class imbalanced data were used to measure learner performance. Based on our results, we recommend using the random forest ensemble learning technique for building classificati on models from software measurement data, regardless of the quali ty and class distribution of the data. Povzetek: Predstavljena je metoda za identificiranje robustnih klasifikatorjev pri sumnih podatkih. 1 Introduction Not only are real-world datasets often class imbalanced, but typically their attributes (including the class) can contain erroneous values that may negatively impact learning performance [17, 29, 32]. Consequently, it is not uncommon for empirical software engineering practitioners to construct learners using suboptimal or low quality im-balanced data. Note that in this work1, only binary classification problems were considered. In software quality classification, class imbalance occurs when the number of fault-prone (fp) modules is significantly outnumbered by the number of not fault-prone (nfp) program modules. No other related work in the software quality prediction domain was found that evaluated the robustness of learning techniques, using four performance metrics, relative to low quality imbalanced measurement data. In this study, simulated noise was injected in both the independent attributes as well as the class (i.e., labeling errors) of seven class imbalanced software engineering measurement datasets. The substantial and significant scope of our experiments make this study truly unique for the domain of empirical software engineering. The robustness of 11 distinct algorithms trained using low quality imbalanced real-world measurement data is 1This is an expanded version, by invitation, of the work accepted and presented at the 2008 IEEE International Conference on Information Reuse and Integration- IRI'08 [11] evaluated by analyzing four metrics related to learner performance. Noisy data was simulated by injecting artificially induced, domain realistic noise into the independent attributes and class of seven real-world class imbal-anced software measurement datasets as explained in Section 4.3. Four performance metrics particularly well suited to deal with class imbalanced data were selected for this study. The area under the ROC curve (AUC), the area under the Precision-Recall curve (PRC), the Kolmogorov-Smirnov statistic (KS), and the F-measure (FM) test statistic were selected (see Section 4.2) to measure the impact on learning performance. The overall impact of noise on each learner was measured as the level of noise, the number of attributes with injected noise, and the percentage of minority instances containing noise increased across all the datasets (see Section 5). Furthermore, we illustrate the impact of the factor interaction between the overall noise levels and the percent of minority instances with noise. Conclusions are presented in Section 6. 2 Related work Regardless of the perceived soundness of a preferred classification algorithm, learning from low quality class imbalanced data will very likely result in biased and suboptimal performance. Several studies in classification initiatives have demonstrated that the presence of noisy values 246 Informatica 33 (2009) 245-259 A. Folleco et al. (mainly corrupted class labels) in the training dataset will likely impact the predictive accuracy of a learning algorithm [17, 32]. Arguably, the main factors determining data quality include independent attribute noise [27, 32], dependent attribute or class noise [16, 5], and missing or omitted values [20]. The data quality characteristics considered in this study include the presence of noise in both the independent and dependent (class) attributes. In addition, the real-world measurement datasets used in this work are inherently class imbalanced. A software measurement dataset selected for binary classification tasks (we only consider binary classification in this work) is said to be imbalanced if the number of positive (fault-prone or fp) class modules is less than the number of negative (not fault-prone or nfp) class modules. Typically, minority instances make up the positive class, while the negative class is composed of the majority instances. Frequently in real-world scenarios, the practitioner is primarily interested in identifying examples from the minority group. In the context of software engineering, this is frequently seen in mission critical applications, where a premium is placed on identifying fault-prone modules during the test phase. Low quality real-world datasets containing class imbalance distributions pose a great challenge to any data mining and machine learning effort. In fact, researchers contend that data quality and class imbalance can significantly impact the reliability of machine learning models in real-world scenarios, and consequently demand empirical consideration and experimental evaluation [10, 35]. Data with such characteristics can be found in a wide variety of application domains besides software quality classification, including for example network intrusion detection [19] and fraud detection [13]. In our study, we investigate the robustness of 11 learning algorithms in the presence of low quality class imbal-anced real-world software measurement datasets, initially relatively free of noise. Weiss [29] performed a preliminary study of the effects of class and attribute noise on classification with simulated datasets using the C4.5 learner. However, no real world datasets or additional learners were used. Weiss and Provost [30] investigated the effect of class distribution and training set size on classification performance. Their results imply that the natural class distribution generates higher overall accuracy. However, the overall accuracy rate is often not considered an appropriate measurement method when dealing with class imbalanced datasets. Furthermore, they also suggested that a more balanced class distribution can result in higher AUC values. Van Hulse et al. [28] conducted an investigation of the impact on classification performance from class imbalanced data injected with simulated class noise. The AUC metric was used to measure the learning performance. The independent attributes were not considered for noise injection in their study. Studies have been conducted to investigate the impact of noise using classification performance-enhancing techniques like cost sensitive learning [33, 34] with various cost ratios [9]. In general, relatively few stud- ies have evaluated the impact of noise (in the class only) in imbalanced data. To our knowledge, no related works have investigated the impact of class and attribute noise on learners constructed from class skewed measurement datasets using several performance metrics suitable for class imbal-anced distributions. 3 Learning algorithms The open-source Java based Weka data mining and machine learning tool [31] was used to implement the learners in this study. The learners used were selected because most of them are commonly used in class imbalance scenarios and several are also used in the software engineering and software quality classification domain. The default parameter values of some of the techniques were changed when their respective classification performances improved substantially. 3.1 Random forest The random forest (RF) classifier was developed by Breiman [3]. RF is a powerful, relatively new approach to data exploration, data analysis, classification, and predictive modeling. A random forest is a collection of unpruned, CART-like trees [4] following specific rules for tree growing, tree combination, self-testing, and post-processing. Trees are grown using binary partitioning in which each parent node is split into no more than two children. Each tree is grown on a different random subsample of the training data. Randomness is also injected into the tree split selection process. RF selects a relatively small subset of available attributes at random. In the Weka tool, the default for the numFeatures parameter uses [ log2 M +1J attributes selected at random for each node in the tree where M is the original number of independent attributes in the data. Attribute selection significantly speeds up the tree generation process. Once a node is split on the best splitter attribute, the process is repeated entirely on each child node. Then, a new list of predictive attributes is selected at random for each node. The trees must remain unpruned to their absolute maximum size in order to maximize the chances of including important attributes into the trees. Bootstrapping is a process of random sampling with replacement from the training dataset. By applying bootstrapping during the tree induction process, approximately 37% of the observations in the training dataset are not used and form the out-of-bag samples. Another important parameter in the RF algorithm is the number of trees numTrees in the ensemble. The default value for numTrees in Weka is 10, however previous research by our group [15] found 100 to be a more appropriate value, so 100 was used instead. Combining results from multiple models (trees) generally yields better performance results than those obtained from a single model. Combining trees by averaging the votes will only be beneficial if the trees are different from each other. RF induces vastly more IDENTIFYING LEARNERS ROBUST TO LOW QUALITY DATA Informatica 33 (2009) 245-259 247 between-tree variation by forcing random splits on different predictive attributes. Having a diverse collection of robust trees lowers the overall error rate, avoids over-fitting training data, imbues a substantial resilience to noisy values, and therefore enhances the performance of the RF [3] ensemble classifier. 3.2 k Nearest Neighbors (Two versions) K nearest neighbors [1] (kNN) is called IBk in the Weka implementation of an instance-based classification technique using k nearest neighbors. The class of a test case is predicted by majority voting of the k nearest neighbors. If only one nearest neighbor is selected to predict the class of a test instance, especially in the presence of outliers and/or low quality data, it may lead to increased inaccuracy. The Euclidean distance is often used as a similarity function to determine the potential candidate nearest neighbors. A possible disadvantage of IBk is that its computation time depends on the size of the number of nearest neighbors. As the number of nearest neighbors increases, so do the computational resources and time needed. In our experiments, kNN classifiers were built with changes to two parameters: The 'distanceWeighting' parameter was set to 'Weight by 1/distance' and two different 'kNN' classifiers were built using k = 2 and k = 5 neighbors. These were denoted '2NN' and '5NN,' respectively. 3.3 C4.5 Decision Tree (Two versions) C4.5 [23] is a benchmark decision tree classification algorithm. J48 is Weka's implementation of this algorithm. It is an inductive supervised classifier that uses decision trees to represent the underlying structure of the input data. The algorithm has four major components: the decision tree generator, the production rule generator, the decision tree interpreter, and the production rule interpreter. These modules are used for constructing and evaluating the classification tree models. The algorithm begins with an empty tree, to which is added decision and leaf nodes, starting at the root (top) node. In the next step, using one of the attributes Xj (j = 1,... ,m = #attributes) the instances in the root node are split into two (or more) child nodes Nl and Nr. For example, if xj is a continuous attribute and i = 1,... ,n = #instances, we define Ni = { xi G D | Xij < t }, and Nr = { xi G D | Xij > t } for some value of t from Xj. C4.5 evaluates each of the attributes xj to determine the best split at each tree node. The splitting process is recursively applied to each of the resulting child/leaf nodes until some stopping criteria is met. After the tree is fully built, C4.5 provides the option to prune sections of the tree to avoid over-fitting. Two different versions of the C4.5 classifier were used in our experiments. The version we call C4D uses the default parameter settings in Weka to build the tree(s). The version of C4.5 we call C4N disables decision-tree pruning and enables Laplace smoothing. These settings were recommended for speed and performance by Weiss [30]. 3.4 Support Vector Machine The support vector machine (SVM) classifier, called SMO in Weka, can be used to solve two-class (binary) classification problems [24]. These classifiers find a maximum margin linear hyperplane within the instance space that provide the greatest separation between the two classes. Instances that are closest to the maximum margin linear hyperplane form the support vectors. Once the instances that form the support vector have been identified, the maximum margin linear hyperplane can then be constructed. We consider the following linear hyperplane separating two classes [31] from: b + ai yi a(i) ■ a (1) where i is a support vector, yi is the class of the training instance a(i), b, and ai are numeric parameters that are adjusted based on the classification algorithm. The term a(i) • a represents the dot product of the test instance with one of the support vectors. Identifying a solution to the linear hyperplane by Equation 1 is the same as solving a constrained quadratic optimization problem. In this study, the SVM classifier had two changes to the default parameters: the complexity constant 'c' was set to 5.0 and 'buildLogis-ticModels' was set to 'true.' By default, a linear kernel was used. 3.5 Logistic Regression Logistic regression (LR) is a statistical regression model [14] that can be used to estimate two-class classification problems. Using the training data instances as input, a logistic regression model is created which is used to decide the class membership of the test data instances. The logistic function used for modeling may be defined as follows: f (z) = 1 1 + e- where z denotes the input instances from the training data and e denotes the base of the natural logarithm. The logistic function can take as input z any negative or positive value, while the output of the function is always in the range of zero to one. The output of the logistic regression classifier expresses the probability of an instance belonging to a certain class. An instance of a training dataset that is used as input for the logistic regression model can be: Z = Wo + (wi X X1 ) + (w2 X X2 ) + (w3 XX3 ) + ... + X Xk ) where w0 is known as the intercept, w1 , w2, w3, ..., wk are called the regression coefficients or model weights, x1, x2, x3, ..., Xk denote the corresponding instance attribute values from the training set, and k is the total number of x 248 Informatica 33 (2009) 245-259 A. Folleco et al. attributes considered. Each of the weights describes the impact of the corresponding attribute value on z. Recall that z is used to determine the class membership of the instance. The weights must be adjusted in order to optimize the logistic regression model on the training data. The log-likelihood of the model is used to estimate the goodness of fit and the weights for the model are chosen to maximize this log-likelihood function. In this study, the Weka default parameter settings were used for this classifier. 3.6 Naive Bayes Naive Bayes (NB) is a simple and fast algorithm based on the Bayesian rule of conditional probability [12]. NB assumes that attributes are independent of each other within a given class. Even though this condition may not be realistic in real-world data, NB has been known to perform well with this assumption of attribute independence. The algorithm estimates the class probabilities P(fp\x) using the Bayes theorem by considering the following expression, P (fp\x) P (fp, x) = P (x\fp)P (fp) P(x) P(x) P (x1\fp)...P (xm\fp)P (fp) P (x) (2) where m is the number of attributes and fp is the fault-prone (or the minority/positive) class. The independence of attributes can be used to factor the class conditional probability P(x\fp) into P(x\\fp)...P(xm\fp). This transformation allows for the estimation of m one-dimensional distributions P (xj\fp), j = 1,..., m, instead of estimating the joint distribution of P(x\fp) from the data. In our experiments, the default parameter values were used within the Weka implementation of this algorithm. 3.7 Rule-Based Classifier RIPPER (Repeated Incremental Pruning to Produce Error Reduction - RIP) is a rule-based classifier and is named JRip [6] in Weka. This algorithm was introduced by Cohen [6] as a fast classifier of "If-Then" classification rules. Initially, the algorithm splits the training dataset into two parts. One part is used to induce rules, while the second part of the training dataset is used to validate the induced rules. If a rule's classification accuracy falls below a minimum accuracy threshold, the rule is then eliminated from the model. RIP imposes a rule induction ordering, minority class rules first, followed by the majority class. Once all of the instances in the minority class have been covered, a default rule is generated to classify the majority data. This feature reduces the description length of a rule set. The default Weka parameters for this classifier were not changed in our experiments. 3.8 Multilayer Perceptron Networks Multilayer Perceptron (MLP) is a network of percep-trons [21]. A perceptron is the simplest neural network representing a linear hyperplane within instance space. MLPs can be used to solve complex problems. Every MLP contains an input and output layer and at least one hidden layer. A layer is an arrangement of neurons that include hidden ones which do not have any connections to external sources or environments. MLPs are typically implemented as a back-propagation neural network. In a back-propagation neural network, the error from an output neuron is fed back to the same neuron. The neuron output is the thresholded weighted sum of all its inputs from the previous layer. This process is continued iteratively until the error can be tolerated or reaches a specific threshold. MLPs map the instances in the input data onto a set of output values using three or more layers of neurons. Activation functions are used to calculate the output from the input into the neurons, which is comprised of weighted sums of the outputs from the previous layer. Two parameters from MLP were changed from their default values. The 'hiddenLayers' parameter was changed to '3' to define a network with one hidden layer containing three nodes, and the 'validationSet-Size' parameter was changed to '10' to cause the classifier to leave 10% of the training data aside to be used as a validation set to determine when to stop the iterative training process. 3.9 Radial Basis Function Networks Radial basis function networks (RBF) are another type of artificial neural network [21]. They are similar to MLPs except in the method used for processing the data within a single hidden layer. The hidden layer is of high enough dimension which provides a nonlinear transformation from the input space. The output layer in these networks provides a linear transformation from the hidden-unit space to the output space. When using RBF neurons, a category of patterns can be regarded as a Gaussian distribution of points in pattern space. The neuron fires when its input is sufficiently close to activate the Gaussian. Inputs are encoded by computing a measure of how close they are to a receptive field, e.g., distance between the input vector and the centroid of that neuron. In this study, the only parameter change for RBF was to set the parameter 'numClusters' to 10. 4 Empirical methodology 4.1 Software Measurement Datasets The JM1, CM1, MW1, PC1, KC1, KC2, and KC3 datasets were obtained from the NASA Metrics Data Program (MDP). Learners were built using 13 basic metrics [16] as independent variables. The dependent variable was a binary module-class label, i.e., fault-prone or not fault- IDENTIFYING LEARNERS ROBUST TO LOW QUALITY DATA Informatica 33 (2009) 245-259 249 prone. The minority class is represented as the positive (fault-prone) class, while the majority class is represented as the negative (not fault-prone) class. All the instances in the data represent measurements taken from the software modules. The KC1, KC2, and KC3 projects comprise a mission control system and were developed and implemented by different personnel with no overlapping software components. The KC1 system, implemented in C++, is a software component of a large ground system. The KC2 system, implemented in C++, is the science data processing component of a storage management system used for ground processing data. The KC3 system, written in Java, is software developed for collection, processing, and delivery of satellite meta-data. The PC1 system, implemented in C, is flight control software from an earth orbiting satellite. The JM1 project, implemented in C, is a real-time ground system that uses simulation to generate predictions for space missions. The MW1 project, implemented in C, is the control software of a zero gravity experiment related to combustion. The CM1 project, implemented in C, is a science instrument system used for mission measurements. The software modules fault data obtained for the software projects indicated the number of faults detected during the corresponding software development cycles. A rule-based noise filter was applied to the CM1, MW1, PC1, KC1, KC2, and KC3 datasets to identify and remove noisy instances [18]. Table 1 provides details about the seven initial2 datasets and their respective cleansed versions. See Van Hulse and Khoshgoftaar [26] for a detailed discussion of the cleansing process for JM1. The 'i/c' sub-headers indicate the initial and cleansed number of instances of a dataset, listed as '#initial / #cleansed'. The row labeled 'P' indicates the number of positive examples in the initial and cleansed datasets. Likewise, the row labeled 'N' indicates the number of negative examples. The 'P + N' row contains the total number of instances in the initial and cleansed datasets, respectively. The '%P' row contains the level of imbalance present in the initial and cleansed datasets. For example, PC1 initially contained 1107 instances, of which 6.9% were positive. After cleansing, 703 total instances remained of which 7.5% were positive. Only the cleansed datasets were used in this work because they were subjected to a methodical and carefully designed noise cleansing process developed by Khoshgoftaar et al. [18] (see also Van Hulse [25] for a discussion of the noise cleansing procedure). The motivation for using relatively cleansed datasets before actually injecting the noise is to ensure the reliability of the results. Adding noise to inherently low quality data can significantly bias and compromise the reliability of any results derived from using such data. Table 2 contains the overall classification performance obtained across all eleven classifiers for each cleansed dataset. According to the AUC, PRC, KS, and FM val- 2JM1 is an exception to these initial datasets because it is a subset of the much larger original JM dataset [25]. ues (described in Section 4.2), the best classification performance was obtained using the largest (and relatively cleanest) dataset, JM1. The second best performance was obtained using KC1 (second largest), and the worst performance was obtained using MW1 (the most imbalanced and nearly the smallest dataset). The average values shown in the 'Avg' row were calculated across all seven datasets. 4.2 Performance Metrics In a binary decision problem, a learner labels examples as either positive or negative. If very few examples belong to the positive class (as few as 1% or less), a learner could obtain an overall accuracy of 99% by just classifying all instances as negative. This method is useless in a domain like software quality classification because the examples of interest are typically from the positive class. Thus, performance metrics such as accuracy or the misclassification rate are inappropriate for substantially class imbalanced data. The Receiver Operating Characteristic curve [22] (ROC) graphs true positive rates on the y-axis versus the false positive rates on the x-axis. The resulting curve illustrates the trade-off between detection and false alarm rates. Often, performance metrics consider only the default decision threshold of 0.5. ROC curves illustrate the performance across all decision thresholds. The threshold independent nature of ROC curves makes them well suited for describing the classification performance of models built on class imbalanced data. For a single numeric measure, the area under the ROC curve (AUC) is widely used, providing a general idea of the predictive potential of the classifier. Two classifiers can be evaluated by comparing their AUC values. Provost and Fawcett [22] give an extensive overview of ROC curves and their potential use for optimal classification. The Precision-Recall curve [8] (PR) provides a different perspective regarding a classifier's performance on class imbalanced datasets. Precision measures that fraction of instances classified as positive that are truly positive. Recall measures the fraction of positive instances that have correct labels. The PR curve graphs recall on the x-axis and precision on the y-axis. A single numeric measure for the PR curve is the area under the PR curve (PRC). Often, a large change in the number of false positives can lead to a small change in the false positive rate which is used in ROC analysis. On the other hand, PR analysis typically compares false positives to true positives rather than true negatives, encapsulating the impact of the large number of negatives instances on classification performance. However, a classifier that optimizes the area under the ROC is not guaranteed to optimize the area under the PR curve [8]. The expressions for precision and recall (which is the same as the true positive rate) are as follows: 250 Informatica 33 (2009) 245-259 A. Folleco et al. Table 1: Datasets Positive & Negative Instance Distributions JM1 PC1 CM1 MW1 KC1 KC2 KC3 Instance (i/c) (i/c) (i/c) (i/c) (i/c) (i/c) (i/c) Total P 470/235 76/53 48/39 31/20 325/271 106/82 43/38 1099/738 N 2393/2210 1031/650 457/277 372/291 1782/1093 414/333 415/264 6864/5118 P + N 2863/2445 1107/703 505/316 403/311 2107/1364 520/415 458/302 7963/5856 %P 16.4/9.6 6.9/7.5 9.5/12.3 7.7/6.4 15.4/19.9 20.4/19.8 9.4/12.6 13.8/12.6 Table 2: Classification Performance by Cleansed Dataset Data AUC Data PRC Data KS Data FM JM1 0.9987 JM1 0.9956 JM1 0.9974 JM1 0.9972 KC1 0.9977 KC1 0.9923 KC1 0.9763 KC1 0.9739 KC2 0.9922 KC2 0.9774 PC1 0.9607 KC2 0.9471 PC1 0.9915 PC1 0.9650 KC2 0.9532 CM1 0.9350 KC3 0.9865 CM1 0.9595 KC3 0.9521 PC1 0.9301 CM1 0.9837 KC3 0.9580 CM1 0.9487 KC3 0.9244 MW1 0.9767 MW1 0.9266 MW1 0.9428 MW1 0.9040 Avg 0.9897 0.9678 0.9616 0.9445 Precision Recall #TP #TP + #FP #TP #TP + #FN (3) Similarly to the PRC curve, the F-Measure (FM) is derived from recall and precision. There is a decision threshold parameter required for this metric, and the default decision threshold of 0.5 was used in this work. Further, the expression defining the FM metric has a tunable parameter 3 used to indicate the relative importance of recall and precision. Typically, one can alter 3 to place more emphasis on either recall or precision. In this study, 3 = 1. FM = (1 + ß2) x Recall x Precision 32 x Recall + Precision The Kolmogorov-Smirnov significance test [7, 13] measures the maximum difference between the empirical distribution function of the posterior probabilities p(x) of instances in each class. Let i G {positive | negative} and Fi(t) = P(p(x) < t I i), 0 < t < 1. The Fi(t) can be estimated by the proportion of class (positive | negative) instances < t. #class (i) instances with posterior probability l( ) #class (i) instances Therefore, the KS statistic is defined as follows: KS mIFpositive(t) Fnegative(t)I As the separation between the two distribution functions becomes larger, the distinction between the two classes a classifier has made will also improve. The maximum possible value for the KS is one (representing perfect separation), with a minimum of zero. The KS statistic is a commonly used metric of classifier performance in the credit scoring application domain [13]. 4.3 Noise Injection Procedure This section describes the noise injection procedure employed in our empirical study. Note that attribute noise was injected into all selected instances in the derived datasets, while class noise was only injected into the training instances, so that the class labels for the test instances were left uncorrupted. In order to add attribute noise into the datasets, the level of attribute noise (La) and the number of significant attributes to be injected with the noise (Na) were the parameters considered. In the case of class noise, the level of class noise (Lc) and the percentage of instances with class noise injected that were originally from the positive (fault-prone) class (Lm) were the parameters considered. In this study, the noise injection procedure consisted of both attribute and class noise and therefore all four parameters (La, Na, Lc, Lm) were considered. 4.3.1 Class Noise Thte class was injected with noise by swapping the respec-