https://doi.org/10.31449/inf.v45i4.3819 Informatica 45 (2021) 517–529 517 Performance of Malware Detection Classifier Using Genetic Programming in Feature Selection Heba Al-Harahsheh, Mohammad Al-Shraideh and Saleh Al-Sharaeh. King Abdullah II School for Information Technology, The University of Jordan, Amman, Jordan E-mail: Heba.moh.h@gmail.com, mshridah@ju.edu.jo and ssharaeh@ju.edu.jo Keywords: malware detection, machine learning, feature selection, classifier, genetic programming Received: September 11, 2021 The term "malicious software," which is commonly referred to as malware, describes malicious software that affects or harms computers, servers, or networks. While the numbers and complexity of malware have rapidly increased, developing a malware detection system is required to detect malware in the world of cybersecurity and test the behavior of its new features. While traditional techniques provide less efficiency in detecting new malware, machine learning techniques are used to achieve rapid malware detection in an intelligent way to improve detection performance, as malware and its application in the industry are constantly increasing. In this study, we developed a malware detection model by detecting malware using machine learning classifiers, after passing a new feature selection technique using genetic programming. We also compared the performance of all classifiers using the most recent feature selection techniques. Results show that Random Forest, Random Forest (4), and Random Tree give the best value in all experiments, while Hoeffding Tree and Decision Stump give lower values for F1-score and accuracy in all experiments. The feature selection method that proposed GPMP gives a better value than Filter-based with little differences. The accuracy and F1-score have the values of 0.881066 and 0.867546 for GPMP, and the values of 0.877624 and 0.862894 for Filter-based, respectively. The experimental results reveal that GPMP used fewer features than Filter-based, and this affected the computation and complexity of the model. Povzetek: Analizirane so bile metode strojnega učenja za povečanje uspešnosti odkrivanja zlonamerne programske opreme. 1 Introduction Nowadays, the problem of cybersecurity is growing due to the fact that all electronic devices are connected to the Internet. In addition, cybersecurity affects our daily life and the infrastructure of all fields because of the high connectivity between millions of hosts over the Internet. Malware is considered eligible to modify the target device or application in order to gain full control of the unauthorized access, and the device can have access to other vulnerable devices to steal data. The main reason of cyber-attack is malware. Accordingly, a malware detection technology must be developed to improve the legacy technology of the industrial security software used for detection. According to Kaspersky's research done in 2020, detecting new malicious files is increased by a rate of 5.2% every day [1]. Therefore, distinguishing between benign and malicious files is the most cybersecurity challenging task, which is used to detect suspicious files with higher accuracy and less time and cost. There are no highly efficient detection methods applied in the traditional methods because malware spreads very quickly on the network. Accordingly, most researchers try to use machine learning to get the best detection accuracy and reflect it in the new technologies or tools designed for malware detection and network Intrusion [2] [3] [4]. In this paper, we propose a new model using feature selection method and genetic programming that are used in a set of parallel classifiers for a more accurate model to detect malware at the lowest cost. The model is run using five methods of selecting features across ten classifiers, then they will be compared to show the best result at the lowest cost. 2 Related studies Recently, much attention has been given to finding and developing new methods of malware detection, compared to existing methods, to cover the gap of malware detection challenges that arise by the increase of malware over time [5]. Malware detection and analysis help the analyst learning the type, category, and target of malware. Malware detection can be classified into two categories, mainly: static and dynamic analysis. Static analysis is the primary category that analyzes malware and collects data from a file without executing it. Dynamic analysis is the opposite as it executes the suspicious file in an isolated and controlled environment [6]. 518 Informatica 45 (2021) 517–529 H. A.-Harahsheh et al. There are many research papers done to develop malware detection methods. In [7], for example, a detection system using effective low-dimensional features has been proposed. This system used ensemble algorithms for analysis to get better performance. The model applies detection technology to a large number of malwares with faster detection time. Another research [8] studies two categories of classification in one model. Alotaibi proposed Multi- Level Malware detection using Triad Scale (MLMTS) model that work in multi stages. The first two levels of his proposed method perform static analysis and the third level performs dynamic analysis. The linear regression in machine learning was used in this model as an input of each level. Using MLMTS method in research experiments increases the accuracy and decreases false alarming, compared to other recent models. The study done by [9] focuses on improving an effective and efficient approach for malware detection by using the behavior of malware families. The authors proposed this methodology because they knew that the attacker could modify API call features with no change in overall behavior. So, they worked on three steps: studying API calls to object operation by analyzing the malware, generating a dependency graph based on the information of these operations, and finally defining the family dependency graph for each malware. The evaluation results of the proposed approach showed that the approach can help some anti-virus companies to detect malware from a zero-day attack. Multiple anti-virus scanners detection systems were proposed for enhancement selection performance in the work done by [10]. They proposed multiple anti-virus scanners that attempt to check if increasing the number of scanners affect detection results and how these scanners are able to maximize the accuracy. The experiment shows that there is a small effect of the number of scanners on accuracy, and if the number was increasing, the overall accuracy will be lowered rather than improved. Moreover, the final ranking of the scanners depends on the accuracy and gives the best chance to select the best combination of scanners. The malware detection model in this study uses a specific feature selection method that is used in several classifiers to compare the scores in order to show the effect of contemporary feature selection on reducing the cost of training time in balanced and unbalanced datasets. The experimental results were obtained by comparing Precision, Recall, Accuracy, and F1-scare in all classifiers and by comparing the commuting time as well. The following Table 1 provides a summary of the related work done on this field of study. Table 1: Summary of the Related Work. Paper Classifiers Algorithms Features Feature Selection Method Result Objective Limitations [6] Chi-square APIs/System calls - Detecting accuracy up to 96.56% Proposing a model for recognizing and detecting the malware from benign. The limitations of this model are related to malware that have an evasion detection technique, and it was used to detect 5 classes of malware only. [11] Evolutionary Algorithm Malware OpCodes - Detecting accuracy for all datasets between 85.80% and 87.67% Using Evolutionary Algorithm to generate graph and compare the similar graph to detect the suspicious files. It was used for categorizing malware and detecting it. The study shows that the detection approach was used to categorize the malware and detect it, but it does not show if it can detect and cover all classes of malware. [12] Hidden Markov Model (HMM), Support Vector Machine (SVM), Decision Tree (J48), and Random Forest (RF) API-call, operations, and usage system library Used term and inverse term frequency (TF-IDF) Logarithm for feature extraction Random Forest classifier gives the best results, while HMM has the lowest performance Evaluating classification approaches in terms of distinctive dynamic features and finding the best dynamic features. Malware detection approaches were used to obtain the family classification and malware detection. [7] AdaBoost, random forest, XGBoost, rotation trees, and extra trees. 2-gram, 2-gramM, API-DLL, API, and WEM frequency analysis and Expert knowledge to select a relevant feature XGBoost reaches the highest rank in AUC-PRC and accuracy Developing a novel technique to reduce feature dimensionality. The study does not represent the time used to extract features by frequency analysis and expert knowledge. Performance of Malware Detection Classifier Using ... Informatica 45 (2021) 517–529 519 3 Datasets information This section presents all datasets used in our experiments conducted for this study. Our approach needed several datasets to study how they affect malware detection performance. All datasets used are available online. We used two types of balanced and imbalanced datasets for malware detection domains. They were also categorized into two groups: malicious or benign software, each with a different number of instances and features. Table 2 shows in detail all information regarding each dataset used in this study in terms of the number of features, the number of classes, the number of instances, characteristics of data, and the type of distribution datasets whether they were balanced or imbalanced. 3.1 PE section headers The "PE-section" header is a balanced dataset that was developed by Angelo Oliveira to extract dataset features from the "PE-section" portion of a group of PE malware and PE goodware files that appeared in Cuckoo Sandbox reports. This dataset was created for malware detection and classification purposes [14]. 3.2 Malware analysis datasets top1000 PE imports Angelo Oliveira generated “TOP-1000 PE Imports” which is imbalanced dataset that was created from ‘pe_imports' part of Cuckoo Sandbox reports for a group of PE malware and PE goodware files [15]. 3.3 API call sequence The imbalanced “API Call Sequence” dataset contains 42,797 malware and 1,079 goodware of API call sequences gathered by the extracted “calls” part of Cuckoo Sandbox reports [16]. 3.4 Malware detection data This imbalanced dataset was created by Takbiri in June 2019 as a result of his study done on detecting malware using Low-level Architectural Features of malware [17]. 3.5 BIG malware dataset from Microsoft Microsoft team created a balanced dataset from their competition for Malware Classification Challenge which is called “BIG 2015” [18]. [8] Proposed a model with multi-level linear regression (MLAPAM and MDMLA) Call sequences, fallouts, and arguments MLMTS method used to generate a feature set The proposed method (MLMTS) gives the maximal accuracy and minimum false positive, compared to other methods Building a model in a Multi-Level for Malware detection using Triad Scale (MLMTS) based on a regression coefficient. The experiment study was performed using one benchmark malware dataset. [9] Comparing the object operation of feature dependency graph and family dependency graph API call - The proposed model gives highly efficient and effective results. Building a malware detection system based on behavior of the malware family. The justification of using the behavior- based features and the graphs is time consuming. [10] Comparing a multi-scanner as a black box Features extracted from the malware were not considered. Only the rates from the scanners were - Combining multi anti- virus scanners with achieving high accuracy, and the result is having the best combination of scanners Proposing three models to achieve the best accuracy of multi-scanner detection system and minimize the scanning cost. The internal mechanism is not clear, and it needs more details about the features and classifiers used in all scanners. [13] Gradient Boosting Algorithm Malware OpCodes Deep learning- based feature extraction method, word2vec Detecting accuracy up to 96%. Developing a model to represent malware that mainly uses the malware opcodes. The work conducted was on a short range of malware classes. The paper covered 8 different malware classes. 520 Informatica 45 (2021) 517–529 H. A.-Harahsheh et al. 3.6 ClaMP (Classification of Malware with PE headers) The CLaMP balanced dataset is built from portable, executable files in header field values and from a combination of malware and benign samples to be used in the detection system [19]. 3.7 Malware executable detection Rumao created a dataset containing a set of features extracted from malware and goodware for Windows executable files. It blends two features of Windows executables: binary hexadecimal system calls feature, and DLL calls as hybrid features, in order to create this dataset. This imbalanced dataset contains 301 malicious programs, while the goodware contains 72 cases [20]. 3.8 Windows Malware Detection (REWEMA) Windows Malware Detection Dataset (REWEMA), as a balanced dataset, contains 3136 malicious programs and 3135 benign executable files. Features were extracted from disassembling executable files and selecting a set of useful file attributes [21]. 3.9 Malware classification Malware classification dataset uploaded to Kaggle website by Paul. Which is Imbalanced dataset, it contains 75503 malware and 140849 goodware features [22]. 3.10 Malware goodware dataset This dataset was uploaded to Kaggle in February 2021. This Imbalanced dataset contains 50210 instances features for malware and goodware files [23]. 4 Method 4.1 Methodology design The malware datasets described in Section 3 were collected to test the proposed method for the detection system. All ten datasets were classified and categorized into two categories of malware and benign software. In addition, these datasets have been further categorized into two other types: balanced and imbalanced datasets, and this categorization is based on the disproportion of the malware and benign category in each dataset. Five feature selection techniques, which are described below in Section 4.2, were used in this study, and passed through fourteen machine learning classifiers in parallel. This objective model computes detection performance at the lowest cost. In our approach, we divided the ten datasets into a training and test set with percentages of 70% and 30%, respectively. In this work, the model is designed and evaluated by making the following main steps: [1] Data cleaning was performed for all datasets before they are split for training and testing to fix all problems in the datasets (missing value, removing outliers, and resolving discrepancies, among others), and [2] five feature selection methods were used (Chi-Square, Filter-based, Wrapper-based, GPM, and GPMP). Then, [3] The number of features was selected for each feature selection method to compare performance, then it was calculated based on the number of features used in each method to test the performance based on how this method extract relevant features that reflect the effect in the overall performance of the discovery model. After that, [4] excessive oversampling SMOTE technique was applied in imbalanced datasets. [5] The release of new datasets was then introduced after applying feature selection and SMOTE methods in the classification model (14 classifiers) to measure Table 2: List of Used Datasets. Dataset Alias Name # of Feature # of Instances Used # of Classes Features Characteristics Dataset Class Distribution PE Section Headers BS1 5 43293 2 Integer, Float, Text Balanced TOP-1000 PE Imports DS2 1001 47580 2 Integer, Float, Text Imbalanced API Call Sequence DS3 101 43876 2 Integer, Float, Text Imbalanced Malware Detection Data DS4 16 70 2 Integer, Float, Text Imbalanced BIG Malware Dataset from Microsoft DS5 69 5210 2 Integer, Float, Text Balanced ClaMP (Classification of Malware with PE headers) DS6 55 5184 2 Integer, Float, Text Balanced Malware Executable Detection DS7 531 373 2 Integer, Text Imbalanced Windows Malware Detection (REWEMA ) DS8 631 6271 2 Integer, Text Balanced Malware Classification DS9 56 216352 2 Integer, Text Imbalanced Malware Goodware Dataset DS10 27 50210 2 Integer, Float, Text Imbalanced Performance of Malware Detection Classifier Using ... Informatica 45 (2021) 517–529 521 predictions. [6] The performance evaluation scale for this detection model was accuracy, F1, accuracy, and recall. [7] The rating scale was finally compared for all datasets in all feature selection methods and all classifiers as well. The result of the model focuses on the performance to obtain the results of balanced and imbalanced datasets. All these steps were performed for the ten datasets (whether balanced or unbalanced) to study whether our proposed approach will obtain good performance in all datasets with different characteristics. 4.2 Feature selection. In this work, two main steps were applied in datasets before running the feature selection technique. 4.3 Data cleaning In this study, we applied a data cleaning for all datasets. It is about preparing raw data to start working on feature selection by drop outliers, cleaning missing values, encoding (text, integer, date, and float, among others), and scaling data [24]. 4.3.1 Using data augmentation technique Synthetic Minority Over-sampling Technique (SMOTE) algorithm is one of the well-known augmentation techniques that are used in imbalanced datasets to solve minority class problems. In the imbalanced dataset, there are too few instances of minority classes that affect model decisions [25]. In this study, we used the SMOTE over-sampling technique to balance the number of classes in the datasets by adding new synthesized instances of the minority class. We also tested another SMOTE technology that is under- sampling by removing the random instances of the majority class, so that it is balanced against the minority class. However, the detection efficacy decreased because some datasets have too few minority classes which results in decreasing the dataset, and this will affect the training and testing phase. Therefore, the main augmentation technique that we used in this study for all imbalanced datasets is the SMOTE over-sampling technique [26]. 4.3.2 Feature selection techniques In this part of our study, we used five methods for feature selection, where three of them were commonly used in machine learning, and they are: Chi-Square, filter-based, and wrapper-based. The remaining methods are Genetic Programming Mean (GPM) and Genetic Programming Mean Plus (GPMP). They were developed in our study using genetic programming (GP) algorithm using the open-source frame-work HeuristicLab (Heuristic and Evolutionary Algorithms Laboratory) [27]. The GP method was used to create a weight for all features in hidden computations and to release the feature at relatively close values. We added two thresholds to the output result of the GP algorithm to find the most important and most relevant feature, in order to get more accuracy in perdition. In the first threshold used in GPM, the mean of all features values was computed, and all features were greater than the threshold. In GPMP, we changed the threshold by adding a chance for the remaining features whose values are below the mean, and that was done by creating another interim threshold which was added to the original threshold value to add a change for the features where their values are near the original threshold. See equation (1) that defines Chi- Square, where O is the observed value and E is the expected value for all values. Equation (2) represents the calculation of GPM method, WFk is the weight for the feature, and the integer number K represents all features y=1, 2, ..., K. Equation (3) is similar to equation (2), but it subtracts the mean of all weights of features under the total mean as an interim threshold is used to increase the chance for other features that have a value less than the original threshold. The main difference between these methods is that when we apply them in our approach, we find that a number of some specific features affect the computational cost and model detection performance. Each method evaluated feature values and compared them to the target value to find the strongest relationship between the target values depending on method statistical measures. Table 3 shows the five feature selection methods used in this study and their alias used in the charts. We found that each method has its own set of features that are identified to be used in the detection model. The difference in the number of features and the identified features themselves will be certainly reflected in the final Table 3: Five Feature Selection Methods. # Feature Selection Method Alias name 1 Chi-Square Chi 2 Genetic programing Mean (GPM) GPM 3 Genetic programing Mean Plus (GPMP) GPMP 4 Filter-based Filter 5 Wrapper-based Wrapper 𝑋 2 = ∑ ( ( 𝑂 𝑖 − 𝐸 𝑖 ) 2 𝐸 𝑖 ) 𝑛 𝑖 , 𝑖 = 1,2, . . , 𝑛 ( 1) 𝐺𝑃𝑀 = ∑( 𝑊 𝐹𝑘 ) /𝑘 𝑘 𝑖 , 𝑖 = 1,2, … , 𝑘 ( 2) 𝐺𝑃𝑀𝑃 = ( ∑( 𝑊 𝐹𝑘 ) /𝑘 )− ( ∑( 𝑊 𝐹𝑘 −𝑙𝑜𝑤 ) /𝑧 ) , 𝑦 = 1,2, … 𝑘 | 𝑜 = 1,2, … , 𝑧 ( 3) 𝑧 𝑜 𝑘 𝑦 522 Informatica 45 (2021) 517–529 H. A.-Harahsheh et al. results of the detection model. Table 4 shows the differences between the number of features identified in each method. 4.4 Evaluation metrics To evaluate our proposed detection model approach, we used the common evaluation metrics. These metrics are accuracy, precision, and recall, and we added F1-score because we tested two types of balanced datasets that can be measured using accuracy. In another hand, imbalanced datasets need to be measured using F1-score and accuracy. Equations from (4) to (7) show how these metrics are calculated [28]. F1-score mainly considers the values of both Precision and Recall, while Accuracy represents the percentage of the number of correct predictions in the model to the total number of inputs. 𝐹 1 − 𝑆𝑐𝑜𝑟𝑒 = 2 ∗ precision ∗ recall precision + recall ( 4) 𝐴𝑐𝑐𝑢𝑟𝑐𝑦 = 𝑇𝑃 + 𝑇𝑁 𝑇𝑃 + 𝐹𝑃 + 𝐹𝑁 + 𝑇𝑁 ( 5) 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃 𝑇𝑃 + 𝐹𝑃 ( 6) 𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃 𝑇𝑃 + 𝐹𝑃 ( 7) 5 Experimental results In this section, we present the results of our experiment to evaluate the findings of detection over ten datasets. Based on all experiments, we evaluated the detection model and summarized the results of the study in the conclusion section. Table 5 shows 14 classifiers that were used in the proposed detection model after applying five feature selection methods in ten labeled malware datasets. Based on the literature review examining the performance of classifiers, we used 14 classifiers shown in Table 5. We selected these classifiers depending on the efficiency of the literature review. We chose them based on 1) the most common classifier, 2) the least efficient classifier to test our approach, and 3) the most efficient classifier. The diversity of this chosen standardization helps us studying the proposed detection system. In our study, we applied our approach to build our model using four main steps: pre-processing for data cleaning, using augmentation technique for imbalanced datasets, using five-feature selection methods, and applying the data on Table 4: Number of features used for all feature selection methods. Number of Features used Percentage of Features used Dataset Chi- Square GPM GPMP Filter- based Wrapper- based Total Feature NO Chi- Square GPM GPMP Filter- based Wrapper- based DB1 3 2 4 3 3 5 60% 40% 80% 60% 60% DB2 948 802 829 113 518 1001 95% 80% 83% 11% 52% DB3 100 20 33 99 29 101 99% 20% 33% 98% 29% DB4 15 7 15 14 15 16 94% 44% 94% 88% 94% DB5 55 12 20 50 61 69 80% 17% 29% 72% 88% DB6 43 13 16 29 37 55 78% 24% 29% 53% 67% DB7 483 70 70 133 201 531 91% 13% 13% 25% 38% DB8 151 59 48 563 611 631 24% 9% 8% 89% 97% DB9 54 15 18 34 46 56 96% 27% 32% 61% 82% DB10 19 7 9 20 25 27 70% 26% 33% 74% 93% Total 79% 30% 43% 63% 70% Table 5: Classifiers used in proposed model. NO. Classifiers Alias name used in charts 1 Ada Boost.M1 AdaBM1 2 Ada Boost.M1 (4) AdaBM1(4) 3 AdaBoost AdaB 4 CatBoost CatBoost 5 Decision Stump DStump 6 Hoeffding Tree HTree 7 k Nearest Neighbors KNN 8 Naive Bayes NB 9 Random Committee RComm 10 Random Committee (4) RComm4 11 Random Forest RF 12 Random Forest4 RF4 13 Random Tree RT 14 Support vector Machines SVM Performance of Malware Detection Classifier Using ... Informatica 45 (2021) 517–529 523 the model using 14 classifiers.The main objectives of this study focus on: First: knowing if the new proposed feature selection methods affect the overall performance of the detection model. Second: Knowing if the proposed methods give good performance of detection in balanced and imbalanced datasets. Third: Determining which classifiers performs better using new FS methods and comparing them to other state- of-the-art performance methods. Figure 1 shows the total number of features in all datasets compared to the number of features used in all FS methods in this study. As a Figure 1 appears almost in all datasets, chi-square and wrapper-based used many features in all datasets according to their calculation. The proposed methods (GPM and GPMP) have a close result to the number of the used features, compared to filter-based. GPM and GPMP used fewer features than filter-based features in seven datasets. Table 4 shows the percentage of features used in ten datasets. GPM and GPMP have the minimum percentage of features used, with a value of 30% and 43%, respectively. The highest number of features are used in Wrapper- based, in DS8 the percentage of features used are 97% that mean almost all the features are kept and used. The highest number of features are used in Wrapper- based, and in DS8, the percentage of the used features are 97%. This means that almost all features are kept and used. Based on the percentages shown in Table 4, and by applying FS on 14 classifiers, it can be noted, after conducting the initial analysis of the results, that the best results of F1-score and accuracy were found after applying the features that were selected by GPMP and Filter-based, with a little difference in values. The first output of our results shows that the comparison between GPMP and Filter-based must be studied, while GPM gives less performance than these two FS methods. This finding guided us to check if accuracy and F1- score were affected by these percentages. As shown in figures (3) to (12), the results of the experiment conducted for ten datasets show that we must study if these FS methods give the same performance in balanced and imbalanced datasets. Furthermore, we studied the overall behavior of the performance in all datasets, and we compared the values that were found in balanced and imbalanced datasets after applying SMOTE oversampling technique. We noted, once we applied SMOTE augmentation technique, that prediction model is able to obtain the best performance based on F1-score and accuracy in the 14 classifiers that were used. SMOTE is a common oversampling technique that is mainly used to handle the imbalanced datasets, but it may cause the model to need extra time for training and over- fitting. However, in this study, oversampling technique Figure 1: The number of features is used in all Datasets based on the FS methods. Figure 2: Average accuracy and F1-score summary for ten DS using 14 classifiers. 0,000000 0,200000 0,400000 0,600000 0,800000 1,000000 Chi-squared Filter-based GPM GPMP Wrapper-based Chi-squared Filter-based GPM GPMP Wrapper-based Accuracy F1_score 524 Informatica 45 (2021) 517–529 H. A.-Harahsheh et al. helps the model to give better performance when compared to balanced datasets. Figures (3) to (12) illustrate the performance of all of our study objectives. In general, we can see that the balanced and imbalanced datasets are illustrated in similar shapes with little detailed differences occurred after applying SMOTE technique. This means that FS methods have a good result in all datasets regardless of whether they are balanced or imbalanced. In the final step of our study, we tried to determine which classifier gives better detection performance using the five FS features over ten datasets (balanced and imbalanced). After applying our approach on ten datasets, results were summarized by computing the average values for F1- score and accuracy for all experiments, as shown in Table 6 and Figure 2. The average of the highest calculated values of F1-score and accuracy shows that it is significant to rank the classifiers based on the efficiency. We found that there were three datasets that held the best ranks in the average of all conducted experiments. Random Forest, Random Forest (4), and Random Tree are in the lead in accuracy and F1-score values. They are then followed by the other three classifiers, classified as group B of performance, namely: AdaBoost, AdaBoost.M1, and KNN. Additionally, both Hoeffding Tree and Decision Stump give the lowest values of F1-score and accuracy in all experiment. The remaining classifiers are categorized in the middle of giving good performance results scales. Figure 2 summarizes the average values of accuracy and F1-score for ten DS using 14 classifiers. The average values for all experiments help us concluding our study by saying that GPMP and Filter-based give the best results in all experiments with the average of f1-score values that reach 0.867546 and 0.862894, respectively. This finding leads us to examine the differences between FS methods. Figure 1 shows the number of features used in all datasets based on FS methods. The Table 6: Average of accuracy and F1-score for ten DS using 14 classifiers and five FS methods. GPM Filter-based GPMP Chi-squared Wrapper-based Avg F1- score Accuracy F1_score Accuracy F1_score Accuracy F1_score Accuracy F1_score Accuracy F1_score AdaBoost Avg 0.897007 0.892153 0.950888 0.950875 0.912717 0.909771 0.913025 0.912015 0.905135 0.910262 0.915015 AdaBoost.M1 Avg 0.877519 0.875979 0.931936 0.931577 0.933636 0.933926 0.897579 0.897521 0.911156 0.910161 0.909833 AdaBoost.M1(4) Avg 0.889907 0.887123 0.920886 0.920435 0.917216 0.920035 0.870939 0.868993 0.902101 0.901588 0.899635 CatBoost Avg 0.844995 0.854525 0.855918 0.855751 0.885714 0.885961 0.898815 0.898688 0.857265 0.857820 0.870549 Decision Stump Avg 0.797667 0.790254 0.793439 0.775139 0.819049 0.812602 0.752604 0.732342 0.771943 0.756560 0.773379 Hoeffding Tree Avg 0.519706 0.381524 0.587115 0.442341 0.548830 0.396545 0.526053 0.386072 0.623355 0.525731 0.426442 KNN Avg 0.904014 0.901189 0.932180 0.932396 0.954862 0.953708 0.932422 0.934905 0.862516 0.852418 0.914923 NB Avg 0.768505 0.736326 0.712549 0.670044 0.735392 0.705738 0.700059 0.648922 0.789419 0.769453 0.706096 Random Committee Avg 0.882569 0.880216 0.908151 0.908101 0.906350 0.906522 0.884793 0.884288 0.825369 0.792877 0.874401 Random Committee(4) Avg 0.877746 0.873598 0.870887 0.871200 0.871380 0.871454 0.887017 0.885551 0.861022 0.858908 0.872142 Random Forest Avg 0.957570 0.955435 0.976959 0.976194 0.979496 0.979747 0.945723 0.944170 0.959321 0.962125 0.963534 Random Forest(4) Avg 0.950536 0.947880 0.975251 0.975478 0.980718 0.979334 0.948085 0.944361 0.966235 0.966801 0.962771 Random Tree Avg 0.948175 0.942662 0.972579 0.972934 0.976031 0.974188 0.939855 0.939396 0.965424 0.962871 0.958410 SVM Avg 0.880227 0.872036 0.898003 0.898045 0.913536 0.916118 0.907813 0.907274 0.915233 0.919567 0.902608 Avg 0.856867 0.842207 0.877624 0.862894 0.881066 0.867546 0.857484 0.841750 0.865392 0.853367 Figure 3: Accuracy and F1-score for DS1. Performance of Malware Detection Classifier Using ... Informatica 45 (2021) 517–529 525 Figure 4: Accuracy and F1-score for DS2. Figure 5: Accuracy and F1-score for DS3. Figure 6: Accuracy and F1-score for DS4. Figure 7: Accuracy and F1-score for DS5. 526 Informatica 45 (2021) 517–529 H. A.-Harahsheh et al. Figure 1 shows that in most of the datasets, the GPMP used fewer features than Filter-based. This means that the computation used in the model used less time in GPMP than on Filter-based. Figures (3) to (12) show the F1-score and accuracy of all datasets. The analysis of the figures values shows the same results summarized in Table 6. In all figures, Random Forest, Random Forest (4), and Random Tree are at the top of all experiments. The values of AdaBoost.M1, and KNN are approximately similar, but the values of the Hoeffding Tree and the decision stump are shown in all figures below. These findings can be generalized for all datasets, whether they are balanced or imbalanced, as previously discussed. To check the effectiveness of our study, we have implemented our model on ten datasets to get the big picture of our study and the reasons why the proposed model is more effective and efficient. It is difficult to compare the results of the proposed model with other models because most of the models use a limited number of malware detection features and because there are other limitations such as using a single dataset to make a comparison between the results. This study also covers both balanced and imbalanced datasets and applies the proposed model to them. Most of the Figure 8: Accuracy and F1-score for DS6. Figure 9: Accuracy and F1-score for DS7. Figure 10: Accuracy and F1-score for DS8. Performance of Malware Detection Classifier Using ... Informatica 45 (2021) 517–529 527 related works measure accuracy as a performance measurement, but our study does the measures using accuracy and F1-score because we use an imbalanced dataset. However, the results of the proposed model can be evaluated along with other related works by checking the result of F1-score of 0.9635 while we use Random Forest in the average of ten datasets, and this is considered a good value for the detection rate. We have proposed a malware detection model using 14 classifier algorithms and five feature selection methods, two of which are proposed. Our feature selection methods are compared to other recent methods by applying them to the same datasets to check the differences in accuracy. We found our proposed method to be very effective for distinguishing between benign and harmful programs in relation to their detection. 6 Conclusion This paper presents a model for detecting malware to enhance the detection rate by using five feature selection methods in ten malware datasets and 14 classifiers. This study examines if this proposed detection method gives better detection value for balanced and imbalanced datasets. The experiments shown throughout the study have no difference in detection values while using balanced and imbalanced datasets after applying SMOTE overfitting technique in imbalanced datasets. The results of this experiment have confirmed that the proposed GPMP feature selection methods attained high detection values in accuracy and F1-score. The overall rankings of feature selection methods depending on accuracy and F1-score in this experiment are GPMP, Filter-based, Wrapper-based, and chi-square, respectively. Results show that GPMP methods used fewer features than other methods with a percentage of 43% in the average of ten datasets. Filter-based that compete GPMP in detection rate used 63% features in an average of ten datasets. This shows how Filter-based affects the complexity and computation in the detection model. The average values of detection rate summarize the performance when using FS methods by saying that GPMP and Filter-based give average F1-score values of 0.867546 and 0.862894, respectively. The final findings in this study focus on performance ranks for 14 classifiers in an average of all experiments. Random Forest, Random Forest (4), and Random Tree have the highest experiment results in accuracy and F1- score values. The values for these classifiers in F1-score are 0.963534, 0.962771, and 0.958410, respectively. These values are followed by the values of AdaBoost, AdaBoost.M1, and KNN, while Hoeffding Tree and Decision Stump in all experiments give lower values for F1-score and accuracy. We intend, in our future work, to apply this presented method in this model on android malware detection in order to study the features of the datasets and the performance of classifiers. Figure 11: Accuracy and F1-score for DS9. Figure 12: Accuracy and F1-score for DS10. 528 Informatica 45 (2021) 517–529 H. A.-Harahsheh et al. Reference [1] “The number of new malicious files detected every day increases by 5.2% to 360,000 in 2020 | Kaspersky.” https://www.kaspersky.com/about/press- releases/2020_the-number-of-new-malicious-files- detected-every-day-increases-by-52-to-360000-in- 2020 (accessed Jun. 14, 2021). [2] Y. Jian, X. Dong, and L. Jian, “Detection and recognition of abnormal data caused by network intrusion using deep learning,” Inform., vol. 45, no. 3, pp. 441–445, 2021, doi: 10.31449/inf.v45i3.3639. [3] O. F.Y, A. J.E.T, A. O, H. J. O, O. O, and A. J, “Supervised Machine Learning Algorithms: Classification and Comparison,” Int. J. Comput. Trends Technol., vol. 48, no. 3, pp. 128–138, 2017, doi: 10.14445/22312803/ijctt-v48p126. [4] A. Chaudhuri, “Parallel fuzzy rough support vector machine for data classificatin in cloud environment,” Inform., vol. 39, no. 4, pp. 397–420, 2015. [5] K. Shaukat, S. Luo, V. Varadharajan, I. A. Hameed, and M. Xu, “A Survey on Machine Learning Techniques for Cyber Security in the Last Decade,” IEEE Access, vol. 8, no. 01, pp. 222310–222354, 2020, doi: 10.1109/ACCESS.2020.3041951. [6] O. Savenko, A. Nicheporuk, I. Hurman, and S. Lysenko, “Dynamic signature-based malware detection technique based on API call tracing,” CEUR Workshop Proc., vol. 2393, pp. 633–643, 2019. [7] S. Euh, H. Lee, D. Kim, and D. Hwang, “Comparative analysis of low-dimensional features and tree-based ensembles for malware detection systems,” IEEE Access, vol. 8, pp. 76796–76808, 2020, doi: 10.1109/ACCESS.2020.2986014. [8] S. S. Alotaibi, “Regression coefficients as triad scale for malware detection,” Comput. Electr. Eng., vol. 90, no. December 2019, p. 106886, 2021, doi: 10.1016/j.compeleceng.2020.106886. [9] B. Cheng et al., “MoG: Behavior-Obfuscation Resistance Malware Detection,” Comput. J., vol. 62, no. 12, pp. 1734–1747, 2019, doi: 10.1093/comjnl/bxz033. [10] M. N. Sakib, C. T. Huang, and Y. D. Lin, “Maximizing accuracy in multi-scanner malware detection systems,” Comput. Networks, vol. 169, p. 107027, 2020, doi: 10.1016/j.comnet.2019.107027. [11] F. Manavi and A. Hamzeh, “A new approach for malware detection based on evolutionary algorithm,” GECCO 2019 Companion - Proc. 2019 Genet. Evol. Comput. Conf. Companion, pp. 1619–1624, 2019, doi: 10.1145/3319619.3326811. [12] A. G. Kakisim, M. Nar, N. Carkaci, and I. Sogukpinar, Analysis and evaluation of dynamic feature-based malware detection methods, vol. 11359 LNCS. Springer International Publishing, 2019. [13] C. H. Lo, T. C. Liu, I. H. Liu, J. S. Li, C. G. Liu, and C. F. Li, “Malware classification using deep learning methods,” Proc. Int. Conf. Artif. Life Robot., vol. 2020, pp. 126–129, 2020, doi: 10.5954/ICAROB.2020.OS4-4. [14] “Malware Analysis Datasets: PE Section Headers | Kaggle.” https://www.kaggle.com/ang3loliveira/malware- analysis-datasets-pe-section-headers (accessed Mar. 07, 2021). [15] “Malware Analysis Datasets: Top-1000 PE Imports | IEEE DataPort.” https://ieee-dataport.org/open- access/malware-analysis-datasets-top-1000-pe- imports (accessed Mar. 07, 2021). [16] “Malware Analysis Datasets: API Call Sequences | IEEE DataPort.” https://ieee-dataport.org/open- access/malware-analysis-datasets-api-call- sequences (accessed Mar. 07, 2021). [17] “Windows Malware Detection | Kaggle.” https://www.kaggle.com/sidneylima/rewema (accessed Mar. 07, 2021). [18] Microsoft, “Microsoft Malware Classification Challenge (BIG 2015) | Kaggle,” 2018. https://www.kaggle.com/c/malware- classification/data. (accessed Mar. 07, 2021). [19] A. Kumar, “ClaMP (Classification of Malware with PE headers),” vol. 1, 2020, doi: 10.17632/XVYV59VWVZ.1. [20] “Malware Executable Detection | Kaggle.” https://www.kaggle.com/piyushrumao/malware- executable-detection (accessed Mar. 07, 2021). [21] “GitHub - rewema/REWEMA.” https://github.com/rewema/REWEMA (accessed Mar. 07, 2021). [22] “Malware Classification | Kaggle.” https://www.kaggle.com/kallolkumarpaul/malware- classification (accessed Mar. 07, 2021). [23] “Malware Goodware Dataset | Kaggle.” https://www.kaggle.com/arbazkhan971/malware- goodware-dataset (accessed Mar. 07, 2021). [24] N. Iqbal and M. Islam, “Machine learning for dengue outbreak prediction: A performance evaluation of different prominent classifiers,” Informatica, vol. 43, no. 3, 2019, doi: 10.31449/inf.v43i3.1548. [25] S. A. Alsaif and A. Hidri, “Impact of data balancing during training for best predictions,” Inform., vol. 45, no. 2, pp. 223–230, 2021, doi: 10.31449/inf.v45i2.3479. [26] J. L. P. Lima, D. MacEdo, and C. Zanchettin, “Heartbeat Anomaly Detection using Adversarial Oversampling,” Proc. Int. Jt. Conf. Neural Networks, vol. 2019-July, no. July, pp. 1–7, 2019, doi: 10.1109/IJCNN.2019.8852242. [27] A. Elyasaf and M. Sipper, “Software review: The HeuristicLab framework,” Genet. Program. Evolvable Mach., vol. 15, no. 2, pp. 215–218, 2014, doi: 10.1007/s10710-014-9214-4. [28] E. Amer and I. Zelinka, “A dynamic Windows malware detection and prediction method based on contextual understanding of API call sequence,” Comput. Secur., vol. 92, 2020, doi: 10.1016/j.cose.2020.101760.