Informatica 41 (2017) 495–506 495 A Hybrid Approach from Ant Colony Optimization and K-nearest Neighbor for Classifying Datasets Using Selected Features Enas M. F. El Houby, Nisreen I. R. Yassin and Shaimaa Omran Systems & Information Department, Engineering Division, National Research Centre, Dokki, Cairo 12311, Egypt E-mail: enas_mfahmy@yahoo.com, eng_nesrin@hotmail.com, shmomran@gmail.com Keywords: ant colony optimization, K-nearest neighbor, features selection, heuristic, pheromone. Received: December 12, 2016 This paper presents an Ant Colony Optimization (ACO) approach for feature selection. The challenge in the feature selection problem is the large search space that exists due to either redundant or irrelevant features which affects the classifier performance negatively. The proposed approach aims to minimize the subset of features used in classification and maximize the classification accuracy. The proposed approach uses several groups of ants; each group selects the candidate features using different criteria. The used ACO approach introduces the datasets to a fitness function that is composed of heuristic value component and pheromone value component. The heuristic information is represented with the Class-Separability (CS) value of the candidate feature. The pheromone value calculation is based on the classification accuracy resulted by adding the candidate feature. K-Nearest Neighbor is used as a classifier. The sequential forward feature selection has been applied, so it selects from the highest recommended features sequentially until the accuracy is enhanced. The proposed approach is applied on different medical datasets yielding promising results and findings. The classification accuracy is increased with the selected features for different datasets. The selected features that achieved the best accuracy for different datasets are given. Povzetek: Opisan je hibridni pristop optimiranja s pomočjo optimizacije s kolonijami mravelj in metodo k-najbližjih sosedov. 1 Introduction Real life data processing means having a huge amount of features that need to be analyzed, mined, classified and modeled. Classification is an important process which aims to predict the classes of future data objects. It is an automated process that requires previous knowledge of the datasets to construct a class for each group of relevant features. The aim of building classifier is to find a set of features that gives the best classification accuracy. The classification accuracy is affected by the relevancy of one feature to the other [1]. Redundant and irrelevant features worsen the performance of a classifier. This can be avoided by selecting and grouping relevant features only, thus feature selection reduces the training time and minimizes the feature set and enhances the performance of the classifier [2, 3]. The challenge in feature selection algorithm is to select minimum subset of features by eliminating features that are redundant and irrelevant which may lead the classification process to undesirable results and also removing features that do not provide predictive information. This selection is to be done with no loss of classification accuracy while reducing computation time and cost. Feature selection is an important data preprocessing phase for mining and classifying data. It is a process of selecting the optimum set of features based on a certain specified criterion to construct solid learning models [4-6]. Feature selection algorithms are divided into two categories; the first one covers the filter approach, it is an individual feature ranking approach which ranks features based on statistical methods. The second category covers the wrapper approach, which uses classifiers having classification functions to select those features with high prediction performance [7, 8]. As computation of huge number of features is not feasible; heuristic search methods are needed for feature selection. Many meta-heuristics approaches have been proposed for feature selection, such as nature inspired algorithms which have been used to select features. These algorithms like ant colony optimization have been applied to feature selection as no solid heuristic exist to find optimal feature subset, so it is expected that the ants discover good feature combinations as they proceed through the search space. Such optimization techniques were used for modeling the feature selection as an optimization problem [1, 9]. Based on this idea, in this research an Ant Colony Optimization (ACO) approach for feature selection is applied using different novel search criteria where each group of ants uses a different search criterion such as the standard deviation, the nearest, the furthest, …etc to discover different good feature combinations. In this research work, the nearest and the furthest criteria specifically were implemented. The proposed ACO approach aims to find the minimum set of features that 496 Informatica 41 (2017) 495–506 E.M.F. El Houby et al. provide the best classification accuracy. Thus, the objective is to minimize the number of features and the classification error. The next sections are organized as follow: an overview for the previous work related to subject is presented in section 2. An introduction of ACO in feature selection problems is discussed in section 3. The proposed model is described in section 4. The experimental results are presented in section 5 before drawing conclusions and future work in section 6. 2 Related work The feature selection and classification have become active research areas, as recently several researchers investigated various techniques to address the feature selection and classification problem. Different swarm intelligent optimizations have been used for feature selection and classification in many literatures. Fong et al. [10] presented a study for feature selection of high dimensional biomedical datasets. They used three meta-heuristic techniques which are the particle swarm optimization (PSO), the wolf search algorithm and the bat algorithm integrated with three classification algorithms to advance the feature selection techniques and thus lowers the classification error rate. The proposed search techniques were applied on 2 biomedical datasets which are Arrhythmia and Micro Mass. Chen, et al. [11] proposed a regression based particle swarm optimization to address the feature selection problem. Regression model was used to find if the feature was selected or not using fitness values for features. Nine data sets from UCI machine learning databases were used for evaluation of the proposed algorithm. Khuat et al. [12] used directed artificial bee colony algorithm to optimize the parameters of the software effort estimation models. The accuracy of the models after optimizing parameters was improved relative to the original models accuracy. Xue et al. [13] introduced two multi-objective algorithms for feature selection based on PSO. The first one was concerned with sorting for PSO and the second one applied the crowding and mutation for PSO. They were implemented on benchmark data sets. Khazaee et al. [14] presented the PSO technique to optimize the input feature subset selection and to set the parameters for a SVM based classifier. The proposed technique was applied on three datasets for three types of electrocardiogram beats. Yeh et al. [15] presented a rule-based classifier that is constructed using simple swarm optimization, to perform the feature selection study on a thyroid gland dataset from UCI databases. Sivagaminathan et al. [16] introduced a model that used a hybrid method of ant colony optimization and artificial neural networks (ANNs) to select features subset from medical datasets. The heuristic was calculated as a function of cost, so the feature with the lower cost was considered better and selected. Jona et al. [17] proposed a hybrid meta-heuristic search for feature selection in digital mammogram. The proposed search used a hybrid of Ant Colony Optimization (ACO) and Cuckoo Search (CS) which was used to speed local search of ACO. Support Vector Machine (SVM) was used with the ACO to classify the mammogram as normal or abnormal. Asad et al. [18] used ant colony system for features selection of retinal images dataset, a comparative study was conducted among six different features selection heuristics. They concluded that relief heuristic selection is better than the subsets selected by other heuristics. Tallon-Ballesteros et al. [19] proposed the use of Ant System (AS) search technique with two feature subset selection methods which are Correlation-based Feature Selection (CFS) and Consistency-based Feature Selection (CNS). They found that information gain is appropriate heuristic with both CFS and CNS. Dadaneh et al. [20] developed unsupervised probabilistic feature selection using ant colony optimization (UPFS). They decreased redundancy using inter-feature information which shows the similarity between the features. A matrix was used to save pheromone value between each pair of features; it was used to rank features. SVM, K-nearest neighbor (KNN), and naive Bayes (NB) were used as classifiers. Wang et al. [21] proposed a system that adjusts the parameter of ACO using different strategies to understand the parameters of ACO. The parameters included number of ants, pheromone evaporation rate, and exploration probability factor. ACO had been modified by combining it with fuzzy to be used as adaptive method for parameters. Ten UCI and StatLog benchmark data sets had been used to evaluate the performance of the proposed system. Liu et al. [22] used Bee Colony Optimization (BCO), ACO, and PSO to discover the quality of data using approaches to detect attribute outliers in datasets. The same fitness function had been used for the different search strategies. Chen et al. [23] presented an algorithm for feature selection based on ACO which traverse arcs on a directed graph; the heuristic depends on the performance of classifier and the number of selected features. SVM is used as classifier, other classifiers are used for the purpose of comparison, but SVM outperforms the other classifiers. The introduced literatures in this section cover some of the recent researches concerned with feature selection and classification using swarm inspired algorithms. However the literatures [16-23] used specifically ACO for feature selection. Different classifiers such as ANN, SVM, KNN, and NB had been used; different heuristic such as function of cost, cuckoo search, information gain, performance of classifier and the number of selected features had been applied; different pheromone value calculation methods had been proposed; ACO parameters adjusting had been studied. And other updates had been provided to enhance the performance of ACO in features selection so as to reach the best possible accuracy with the least number of features. The proposed work added up to these previous findings to enhance the performance of ACO in features selection. Our proposed model provides a novel idea of using different groups of ants which are synchronizing search for different solutions each using a different search criterion to reach the best possible solution for each group. Then the global best solution of all is obtained from the different applied criteria. The selection of features for different groups is done using the same fitness function but with different criteria. The fitness function is A Hybrid Approach from Ant Colony Optimization ... Informatica 41 (2017) 495–506 497 depending on heuristic information term which is represented by Class-separability (CS) and pheromone value term which is updated using a function in the classification accuracy. The features selection is done with considering sequential forward feature selection that the feature with the best fitted fitness value is selected as it improves the classification accuracy of the selected subset; otherwise the feature with the next value is selected. A comparison between the performance of the proposed research and the previous closest work that use the same dataset will be provided in the “Experimental Results and Discussion” section. 3 Ant colony optimization Artificial swarm intelligence is a population based meta- heuristic technique that is inspired from the behavior of living beings that live in groups, colonies, and flocks. These living organisms interact with each other and with their surrounding environment in a way that can be observed as a behavior pattern. This behavior pattern can be modeled using computers to solve different combinatorial problems. The ACO algorithm is a swarm intelligence technique that mimics real ants’ attitude and behavior when searching for and grabbing their food. Ants are blind creatures that have the ability to find the shortest path from their nest to the food source. When searching for the food, ants initially explore the area surrounding their nest in a random manner. After finding the food; the ants carry part of the food and return to their nests. They release a chemical material called pheromone on the ground when moving from the food source location back to their nest. The amount of the pheromone released which mostly depends on the quantity and quality of food guides the other ants to the location of the food, and enables them to find the shortest path from their nest to the food. The ants that use a shorter path first time to grab the food returns to the nest faster releasing larger quantity of pheromone for shorter routes rather than longer routes. Afterwards, the specified shorter path will be preferred almost by all ants and as a result the pheromone starts to evaporate. The probabilistic route selection helps the ants to find the shortest routes and provide flexibility for solving optimization problems [16, 24, 25]. The ACO technique was introduced in the nineties by Dorigo et.al to solve optimization problems as the travelling salesman problem (TSP) [26]. The solution begins from a start node (feature), afterwards the ant moves iteratively from one node (feature) to the other. A probabilistic transition rule is the most widely used function in ACO which is based on the value of the heuristic function 𝜂 and the pheromone value 𝜏. It is used to iteratively construct a solution. So, the ant moves to an unvisited node (feature) with a probability of: 𝑃𝑖 𝑘(𝑡) = (𝜏𝑖(𝑡)) 𝛼 (𝜂𝑖(𝑡)) 𝛽 ∑ (𝜏𝑗(𝑡)) 𝛼 (𝜂𝑗(𝑡)) 𝛽 𝑗ϵ𝑁𝑗 𝑘 , 𝑗 ϵ 𝑁𝑗 𝑘 (1) Where: 𝑁𝑗 𝑘 is the feasible neighborhood of the ant k, which are the features that ant k has not yet selected and can be chosen. It acts as the memory of the ants. 𝜂𝑖 (𝑡) is the heuristic information of the feature (i) at the time 𝑡. 𝜏𝑖(𝑡) is the pheromone value on the feature (i) at the time 𝑡. 𝛼 and 𝛽 are weights that represent the relative impact of the pheromone 𝜏 and the heuristic information 𝜂, respectively. 𝛼 and 𝛽 are parameters that may take real positive values according to the recommendations on parameter setting in [27]. All ants update pheromone level 𝜏𝑖(𝑡) with an increase of pheromone quantities, depending on the equations for pheromone updating, which specify how to modify the pheromone value. These equations are determined by: 𝜏𝑖(𝑡 + 1) = 𝜏𝑖(𝑡) × (1 − 𝜌) + 𝛥𝜏𝑖(𝑡) (2) 𝛥𝜏𝑖 𝑘(𝑡) = ( 𝑄 if the feature (i) is chosen by the ant k 0 otherwise ) (3) where: ρ is the pheromone evaporation rate of the features (0 < ρ <1), and 𝛥𝜏𝑖 𝑘 is the pheromone deposited by the ant 𝑘 that found the best solution for the current iteration. In the ACO algorithm, ants exchange information with each other through the pheromone value. Each ant uses information obtained from other ants to provide better solutions. This improves the efficiency of solutions obtained in consecutive iterations. Thus the algorithm can be considered as a multi agent technique [28, 29]. 4 The proposed ACO model The proposed model presents an ACO approach for feature selection. Two objectives are considered; minimizing the number of features used in classification and increasing the classification accuracy. The proposed algorithm is presented in Pseudo code1. Firstly, the algorithm starts by initializing the following parameters: The number of generations which represents the number of iterations, the number of groups of ants which represent the number of different criteria for features selection, the number of ants which is equivalent to the number of solutions for each criteria (group), maximum number of features that represent maximum allowed number of features that can be selected by each ant to achieve the best possible classification accuracy. (𝜏𝑖) which is the pheromone concentration value associated with feature (fi) and (𝜂𝑖) is the heuristic value for feature (fi); (𝜏) & (𝜂) together form the fitness function terms as shown in eq. (1). 𝛼, 𝛽 are user selected parameters; they represent the relative importance of pheromone and heuristic values respectively. 𝜌 is a user defined parameter that represents the pheromone evaporation or decay rate, it takes a value from 0 to 1. Z is the local pheromone update parameter, it is defined by the user and it takes a value less than 𝜌. A set of generations starts after the initialization phase. With each new generation, n groups of ants are formed where G1, G2,….Gn are n different groups each having na ants. The first feature selection for each ant is 498 Informatica 41 (2017) 495–506 E.M.F. El Houby et al. performed randomly taking into consideration avoiding redundancy between ants of the same group to obtain different possible solution sets. For each selected feature by different ants an initial value for the classification accuracy is obtained for each ant using the KNN algorithm. Using a set of equally initial pheromone values, and the local pheromone update parameter Z, the pheromone of the selected feature is locally updated using eq. (4). 𝜏𝑖(𝑡 + 1) = (1 – 𝑍) × 𝜏𝑖(𝑡) + 𝑍 (4) For each group of ants, the selection of the subsequent features is done using the same fitness function with different criteria. One used criterion is to select the nearest feature to the previously selected one according to fitness value. Another used criterion is to get the furthest feature to the previously selected one according to fitness value. The fitness function is calculated using eq. (1) having a term representing the pheromone 𝜏 and a heuristic term 𝜂. The heuristic information is represented with the Class- Separability (CS) value of the feature. All features have equally initial pheromone values and the pheromone of the selected feature is locally updated with eq. (4). By the end of each generation, the pheromone values of the features subsets that are part of the best solution for different groups are globally updated using eq. (5). It is a function in the classification accuracy achieved by selected features subset, so as to increase the features selection opportunity of these features in the future generations. 𝜏𝑖(𝑡 + 1) = (1 – 𝜌) × 𝜏𝑖(𝑡) + 𝜌 ∗ 𝑎𝑐𝑐 (5) Where: ρ is the pheromone evaporation rate of the features (0 < ρ <1), and acc is the classification accuracy. As mentioned above, the selection of the subsequent feature is done using different criteria. This selection is performed using sequential forward selection. The next fitted feature is selected if it improves the classification accuracy and it is considered positively correlated with the preceding features. Otherwise, if the feature reduces the accuracy or maintains it constant, it is considered negatively correlated or neutral and is not selected. This selection is repeated until finding the feature that satisfy the group criteria whether nearest or furthest and improve the classification accuracy. This process is repeated for selecting each subsequent feature. The stopping criteria is either obtaining the subset that achieves the best possible accuracy or reaching the maximum allowed number of features for the different ants. By the end of generation, the pheromone values of the features that are part of the best solution are updated. After that, a new generation is started with the updated pheromone values for features to generate different features subsets in the next generation. By the end of all generations, the features subsets that give the best accuracy in all generations and by different ants for each group are obtained, and then the best global subsets by different groups of ants are obtained. Figure 1 illustrates the full process of feature selection using the proposed ACO model. 4.1 Class-separability As mentioned previously, the heuristic value is computed using the class-separability approach. Class-separability (CS) [30] is an approach used for feature selection. CS of feature i is defined as: CSi = SBi SWi⁄ (6) Where SBi = ∑ (x̅ik − x̅i) 2K k=1 (7) SWi = ∑ ∑ (xij − x̅ik) 2 j∈Ck K k=1 (8) x̅ik = ∑ xij nk⁄j∈Ck (9) x̅i = ∑ xij n⁄ n j=1 (10) SBi is the sum of squares of between class distances (the distances between samples of different classes). SWi is the sum of squares of within class distances (the distances of samples within the same class). In the whole data set, there are K classes. Ck refers to class k that includes nk samples. xij is the value of feature i in sample j. x̅ik is the mean value in class k for feature i. n is the total number of samples. x̅i is the general mean value for feature i. A CS is calculated for each feature. A larger CS indicates a larger ratio of the distances between different classes to the distances within one specific class. Therefore, CS can be used to measure the capability of features to separate different classes. 4.2 K-Nearest Neighbor KNN (also known as Instance-based Learning) is a simple efficient data mining technique used for classifying data points based on their distance to other points in a training dataset. KNN is a lazy learner where the training data is loaded into the model and when a new instance need to be classified it looks for the specified k number of nearest neighbors; then, takes a vote to see where, the instance should be classified. For example, if k is 7, then the classes of 7 nearest neighbors are detected. KNN depends on a simple principle which is "similar instances have similar class labels". Distance functions are used to measure similarity between samples. KNN calculates the distance between the unknown data point and every known data point [31, 32]. The common distance between two data points 𝑋 = (𝑥1, … , 𝑥𝑛) and 𝑌 = (𝑦1, … , 𝑦𝑛) is defined as follows: 𝐷𝑖𝑠𝑡(𝑋 ̅, 𝑌 ̅) = (∑|𝑥𝑖 − 𝑦𝑖| 𝑝 𝑛 𝑖=1 ) 1 𝑝⁄ (11) Euclidian distance is the easiest and most common distance calculation function for quantitative data where 𝑝 = 2. A Hybrid Approach from Ant Colony Optimization ... Informatica 41 (2017) 495–506 499 5 Experimental results and discussion This section shows an empirical performance evaluation of the proposed ACO model. In order to evaluate the proposed model, real world medium-sized datasets shown in Table 1 are tested. The used datasets are heart disease, breast cancer and thyroid which contain 13, 30, 21 features and 303, 569, 7200 samples respectively. The samples are randomly divided into 257/46, 423/146 and 6000/1200 for training and testing respectively. Matlab® 2015a software on an Intel®Core™ i7 CPU @ 1.6 GHz computers is used for implementation. Extensive experimental studies had been tried in order to get the best features subsets selected by ACO which give the highest possible accuracy using KNN classifier. The values of the parameters have been tuned in the experiments as shown in Table 2. Different features subsets have been tried starting from 2 features subsets until the maximum number of features specified by the user is reached or the best possible accuracy is achieved. In each trial, two different groups of ants, each consists of 3 ants, start to select the features. A number of 100 iterations or generations are executed to reach the highest possible accuracy. Each of the two groups of ants uses a different criterion to select the features. The first group uses the nearest feature to the previously selected one according to the fitness value. The other group uses the furthest feature to the previously selected one according to the fitness value. The pheromone of the selected feature has been increased by small value using eq. (4). At the end of each iteration, the pheromone value of the features that are part of the best solution of either group is incremented by a significant value calculated using eq. (5). Table 3 illustrates the idea of the proposed ACO model for features selection of breast cancer dataset using two ant groups; nearest and furthest. For illustration purpose, the table includes samples of features subsets with the accuracy which had been recorded as the best accuracy in different generations to clarify the idea. For example, if by the end of a generation, the best achieved accuracy was 93.84% using the two groups of ants. The nearest group achieved it using the features subset {8, 29} and the furthest group achieved it using {24, 4} subset, so the pheromone value of both group subsets will be updated for that generation. The pheromone value of these features will be increased using eq. (5) to have a chance to be part of the best solution in their groups in the next generation. Initialize parameters [number of generations, number of groups of ants, number of ants, max number of features, pheromone value ( 𝜏), heuristic value(𝜂), pheromone evaporation rate(𝜌),local pheromone update parameter(Z), 𝛼, 𝛽] While Generation number not exceeded For each group of ants For each ant ○ Select a start distinct feature randomly ○ Calculate the classification accuracy for the initially randomly selected feature End for End for For each ant in each group of ants While assigned number of features not exceeded (max number of features) and accuracy can still be improved ○ Calculate the fitness value for each feature that can be selected using eq.(1) ○ Each group of ants select the candidate feature using different criteria ○ Calculate the accuracy for the subset obtained with the new added feature ○ If the calculated accuracy less than or equal the previous accuracy select consecutive feature sequentially with next appropriate fitness value ○ If the maximum accuracy is achieved or max number of feature is reached; Final set of features is obtained for the current ant Else update the pheromone value of the selected feature locally by eq.(4) End If End while End for ○ Find the features subsets that achieved the best solution in the current generation for different groups ○ Update the pheromone values of the features that are part of the best solution using eq.(5)to increase these features selection opportunity in next generations End while Obtain the global best collected subsets of features from all different groups of ants that achieved best possible accuracy in all generations Pseudo code 1: The proposed ant colony algorithm for feature selection. 500 Informatica 41 (2017) 495–506 E.M.F. El Houby et al. Figure 1: The full process of feature selection using the proposed ACO approach. . . . Different Criterion Group Standard deviation Criterion Group No Yes Yes No N ex t g en er a ti o n Dataset Incrementally Features Selection Selected Classifier Termination Condition Part of the best solution Increasing pheromone of the best solutions features for different groups Randomly Features Selection Best global features set Furthest Criterion Group Nearest Criterion Group A Hybrid Approach from Ant Colony Optimization ... Informatica 41 (2017) 495–506 501 In many cases, the same subset can be reached by the two groups either in the same generation or through different generations; this may be due to the limitation of selecting positively correlated features. Also the nearest or the furthest features may change with the generations. This is due to the local changes of the pheromone’s values of the selected feature and global changes of the best features by the end of generation which causes the changing in the distances between features. As an example, the 2 subsets that achieved 94.52 % accuracy which are {8,29,30; 8,29,6}. In the first set the furthest feature to the feature 29 was 30 and in the second set it was 6. Table 2: Parameters of ACO. The best global features subset in the breast cancer dataset that achieved the best possible accuracy which is 97.95% with the least number of features which is 4 features is {16, 28, 7, 1}. By increasing the number of features above 4, the accuracy remains constant as shown in Table 4. So, the selected 4 features are considered the best combination of features that satisfy the best possible accuracy. Figure 2 present the relation between the accuracy versus number of features for breast cancer dataset. The selected features that achieved the best possible accuracy which is 97.95% are the features of cell nucleus mentioned below, these features take decimal values: 1 - mean radius (mean of distances from center to points on the perimeter) 7 - mean of concavity (severity of concave portions of the contour) 16 - standard error of compactness (perimeter^2/area- 1.0) 28 - worst concave points (number of concave portions of the contour) Regarding the heart disease dataset, as shown from Table 5, although the best possible accuracy which is 96.88% has been achieved by 4 features, the best possible accuracy using 2 features is close to it which is 96.77% and it needs only 2 features. So, the set {9, 12} is recommended as best solution. With increasing the features set above 4 features, the accuracy decreases as shown in Table 5. The best selected features are “number of major vessels colored by fluoroscopy” and “exercise induced angina”. The first feature takes a value from 0 to 3, the second feature is Boolean and takes values (1 = yes; 0 = no). Figure 3 shows the relation between the accuracy versus number of features for heart disease dataset. For the thyroid dataset, as shown in Table 6, although the best possible accuracy which is 98.5% has been achieved by 6 features, the best possible accuracy using 4 & 5 features is close to it which is 98.25%, 98.33% respectively. With increasing the features set to 7 features the accuracy remains constant. Figure 4 shows the relation between the accuracy versus number of features for thyroid dataset. The best selected features are TSH (real [0.0 - 0.53]), thyroid surgery (integer [0, 1]), on thyroxin (integer [0, 1]), and FTI (real [0.0020 - 0.642]). Table 7 shows the percentage of features reduction and the achieved accuracy before and after features reduction for different datasets. Figure 5 presents the total number of features and the reduced number of features using the proposed ACO model for different datasets. Figure 6 presents the comparison of the accuracy for the total number of features and the reduced number of features for the three datasets used. It is clearly that the selected features achieve higher accuracy than the total number of features which ensure that noisy and irrelevant features mislead the classifiers. Table 8 shows the comparison of the proposed model with the previous work. Since the main purpose of features selection is to achieve the highest accuracy with the least number of features, a comparison between the performance of the proposed research and the previous closest work, will be limited to those that used the same dataset to simplify the comparison capability. By comparing the proposed model with previous work, it seems that the proposed model outperforms others with even less number of features for all databases. Except for breast cancer, Wang G., et al. [21] achieved 98.12% accuracy which is a bit better than ours (97.95%), but with larger numbers of features which are13.5 features rather than only 4 features with our suggested algorithm. To investigate the capability of the proposed model to achieve promising results with large dataset, the SRBCT microarray dataset [36] which contains 2308 features (genes) and 83 samples was used. The samples are divided into 63/20 for training and testing respectively. It achieved 100% accuracy with 4 genes only as shown in Table 9, the percentage of features reduction is 99.82%. After applying the proposed model on SRBCT dataset, it is concluded that it also has the ability to select features subset which Parameters Values 𝛼 1 𝛽 0.9 𝜌 0.9 Z 0.4 No. of generations 100 groups of ants (G) 2 no. of ants/G (na) 3 Table 1: The used datasets. Data set name No. of features No. of samples No. of classes Citation Heart Disease 13 303 5 [33] Breast cancer (Wisconsin diagnostic) 30 569 2 [34] Thyroid 21 7200 3 [35] 502 Informatica 41 (2017) 495–506 E.M.F. El Houby et al. achieve the highest accuracy from large number of features. Computational Complexity The actual computational cost of an algorithm can be determined by investigating the computational complexity according to the form of big-O notation. Meta-heuristic algorithms are simple in terms of complexity, and thus they are easy to implement. ACO algorithm has three inner loops n the number of groups of ants; na is the number of ants; and f is the number of selected features, and one outer loop t for iteration. So, the time complexity is O(n ∗ na ∗ f ∗ t). In the experimental studies the inner loops are small (n = 2; na =3; F = 2-7) and (t =100), so the computational cost is relatively inexpensive. The main computational cost will be in five steps according to Pseudo code1: (i) Table 3: samples of selected features subsets and achieved accuracy using ACO model through different generations for breast cancer dataset. No. of feature nearest Best Acc. % furthest Best Acc. % 2 {8,29; 24,4} 93.84 {24,4; 8,29; 6,16; 23,1} 93.84 2 {18,28; 1,21} 94.52 {2,23; 18,28; 21,1} 94.52 2 {4,23; 26,1} 95.21 {4,23} 95.21 3 { 27,3,23; 8,29,30; 8,29,9; 26,21,3} 94.52 {8,29,30; 8,29,6; 20,23,2; 1,23,12; 18,23,2;16,23,2} 94.52 3 { 28,18,30; 23,3,14; 21,12,1} 95.89 {14,3,23; 3,23,14} 95.89 3 { 1,28,7} 97.26 {7,28,1} 97.26 4 {12,21,3,1; 3,23,14,2; 1,21,29,12} 96.58 { 21,12,1,13; 8,29,19,10; 12,20,27,3} 96.58 4 {7,28,1,16} 97.95 {16,28,7,1} 97.95 5 {28,8,7,1,17; 5,7,28,1,17} 97.26 - 97.26 5 - 97.95 {10,28,7,1,16; 30,28,7,1,16} 97.95 6 {18,7,16,20,28,1;28,16,7,18,10,1} 97.95 {19,28,7,1,17,30} 97.95 7 {25,7,28,11,17,16,10;18,29,6,21,1,12,27} 97.26 - - 7 - - {19,28,10,7,17,1,30} 97.95 Table 4: The selected features subsets using the proposed ACO model that achieved the best accuracy/feature subset for breast cancer. No. of selected features The reduced features subsets Best Accuracy (%) 2 {4,23; 26,1} 95.21 3 {7,28,1} 97.26 4 {16,28,7,1} 97.95 5 {10,28,7,1,16; 30,28,7,1,16; 15,28,7,1,16; 7,28,30,1,17} 97.95 6 {18,7,16,20,28,1; 28,16,7,18,10,1; 19,28,7,1,17,30} 97.95 7 {19,28,10,7,17,1,30} 97.95 Most important features subset Features of cell nucleus: 1 - mean radius 7 - mean of concavity 16 - standard error of compactness 28 - worst concave points Figure 2: The accuracy versus number of features for breast cancer dataset. 95,21% 97,26% 97,95% 97,95% 97,95% 97,95% 93,5% 94,0% 94,5% 95,0% 95,5% 96,0% 96,5% 97,0% 97,5% 98,0% 98,5% 2 3 4 5 6 7 A cc u ra cy No. of Features The accuracy versus number of features for Breast cancer dataset Accuracy A Hybrid Approach from Ant Colony Optimization ... Informatica 41 (2017) 495–506 503 random feature selection, (ii) subset selection using different criteria, (iii) updating pheromone values (iv) calculating probabilistic transition rule, and (v) termination condition. For example, the estimated time for heart disease to select 2 features was ≅ 15.3 sec on average, selecting 3 features needs 35.5 sec on average and selecting 4 features needs 48 sec on average. 6 Conclusion and future work In this research, an ACO model has been developed to select minimum subsets of features from datasets that can achieve the best possible accuracy. The purpose was to reduce redundant, irrelevant and noisy features that mislead the classifier. The proposed model use different groups of ants to select different features subsets that give the best possible result. Each group uses a different criterion to select the features. In this research two different criteria have been applied; the nearest and furthest criteria. By the end of each generation, each group selects the best features subsets that achieve the best accuracy for that group. The pheromone values of these features are increased to be given higher opportunity to be part of the selected features in the next generation for that group. By the end of all generations, the best features subsets for each group have been selected, and then the global best solutions from all groups have been reached. The results showed that, the right selection of features and eliminating irrelevant, noisy and redundant features increase the accuracy of classifiers. The percentage of Table 5: The selected features subsets using the proposed ACO model that achieved the best accuracy /feature subset for heart disease. No. of selected features The reduced features subsets Best Accuracy (%) 2 {9,12} 96.77 3 {6,12,9; 4,12,1} 93.75 4 {4,12,1,6} 96.88 5 {4,12,6,10,2} 94.12 6 {4,6,5,7,1,12} 93.1 7 {4,6,5,7,3,8,12; 4,13,2,10,3,8,12} 84.62 Most important features subset 12 - ca: number of major vessels (0-3) colored by fluoroscopy 9 - exang: exercise induced angina (1 = yes; 0 = no) Figure 3: The accuracy versus number of features for heart disease dataset. Table 6: The selected features subsets using the proposed ACO model that achieved the best accuracy/feature subset for Thyroid. No. of selected features The reduced features subsets Best Accuracy (%) 2 {3,17} 97 3 {21,17,3} 97.92 4 {8,17,3,21} 98.25 5 {19,17,5,3,8; 8,17,3,21,16; 3,17,6,21,8} 98.33 6 {21,17,9,3,16,8} 98.5 7 {3,17,12,21,9,16,8; 21,17,5,3,9,8,16} 98.5 Most important features subset 17 - TSH 8 - Thyroid_surgery 3 - On_thyroxine 21 - FTI 96,77% 93,75% 96,88% 94,12% 93,10% 84,62% 75% 80% 85% 90% 95% 100% 2 3 4 5 6 7 A cc u ra cy No. of Features The accuracy versus number of features for heart disease dataset Accuracy 504 Informatica 41 (2017) 495–506 E.M.F. El Houby et al. features reductions are 84.61 %, 86.66 %, and 80.95 % for heart disease, breast cancer, and thyroid respectively with that the accuracy had increased from 82.5% to 96.77%, from 93.15 % to 97.95 % and from 93.15% to 98.25 % respectively. By trying the model on different datasets it achieved promising results compared with previous works, it achieved higher accuracy with less number of features for all databases. Different features subsets have been reached using different groups’ criteria which give the capability to collect different solutions and reach the best global solutions. The proposed model proved its capability to select features from large datasets, when applied on SRBCT microarray dataset. As a future work, other criteria can be used to collect different subsets of features, also parametric studies can be studied and applying the model on different datasets can be done. Figure 4: The accuracy versus number of features for Thyroid dataset. Table 7: The percentage of features reduction and the achieved accuracy for different datasets. Data sets Training size/Testing No. of features Reduced subset % Reduction Accuracy of total features Accuracy of reduced features Heart disease 257/46 13 2 84.61 % 82.5% 96.77% Breast cancer (Wisconsin diagnostic) 423/146 30 4 86.66 % 93.15 % 97.95 % Thyroid 6000/1200 21 4 80.95 % 93.15% 98.25 % Figure 5: The total no. of features compared with reduced no. of features for different datasets. 97% 97,92% 98,25% 98,33% 98,50% 98,50% 96,0% 96,5% 97,0% 97,5% 98,0% 98,5% 99,0% 2 3 4 5 6 7 A cc u ra cy No. of Features The Accuracy versus number of features for Thyroid Accuracy A Hybrid Approach from Ant Colony Optimization ... Informatica 41 (2017) 495–506 505 7 References [1] S. M. Vieira, J. M. Sousa, and T. A. Runkler, (2009). Multi-criteria ant feature selection using fuzzy classifiers. In Swarm Intelligence for Multi-objective Problems in Data Mining, Springer Berlin Heidelberg, 19-36. [2] I. A. Gheyas, L. S. Smith, (2010). Feature subset selection in large dimensionality domains. Pattern Recognition 43(1) 5-13. [3] A. Unler, A. Murat, (2010). A discrete particle swarm optimization method for feature selection in binary classification problems. Eur. J. Oper. Res. 206(3) 528–539 [4] M. Dash, K. Choi, P. Scheuermann, and H. Liu, (2002). Feature selection for clustering filter solution. In: Proc. of Second International Conference on Data Mining ICDM 115–122. [5] P. Mitra, C. A. Murthy, and S. K. Pal, (2002). Unsupervised feature selection using feature similarity. IEEE Trans. Pattern Analysis and Machine Intelligence 24(3) 301–312 [6] A. Miller, (2002). Subset Selection in Regression. (2nd ed.). Chapman & Hall/CRC, Boca Raton. [7] Blum, L. Avrim and P. Langley, (1997). Selection of relevant features and examples in machine learning. Artificial intelligence, 1, 245-271. [8] L. Talavera, (2005). An evaluation of filter and wrapper methods for feature selection in categorical clustering. In International Symposium on Intelligent Data Analysis, Springer Berlin Heidelberg, 440-451. Figure 6: The comparison of the accuracy for reduced no. of features and the total no. of features. Table 8: The comparison of the proposed model with the previous work. Data sets No. of features No. of Reduced Features/ Accuracy Proposed system Sivagaminathan, et al. [16] Dadaneh, et al. [20] Wang G., et al. [21] Chen, et al. [23] Heart Disease 13 2 / 96.77 % - - 6.1/88.22% On average 8.08/ 86.67% On average Breast cancer (Wisconsin diagnostic) 30 4 / 97.95 % 12 / 95.57 % 11 / 91.23% 13.5/98.12% On average 5.89 / 95.99 On average Thyroid 21 4 / 98.25 % 14 / 94.5 % - - - Table 9: The selected features subsets using the proposed ACO model that achieved the best accuracy /feature subset for SRBCT. No. of selected features The reduced features subsets Best Accuracy (%) 2 {1613, 1165; 1427, 1479} 85 3 {156, 1606, 1601} 95 4 {1434, 1775, 2214, 1920} 100 Most important features subset 1434- kinesin family member 3C 1775- guanosine monophosphate reductase 2214- FYN oncogene related to SRC, FGR, YES 1920- lamin B2 506 Informatica 41 (2017) 495–506 E.M.F. El Houby et al. [9] L. A. M. Pereira, D. Rodrigues, T. N. S. Almeida, C. C. O. Ramos, A. N. Souza, X-S. Yang, and J. P. Papa, (2014). A Binary Cuckoo Search and Its Application for Feature Selection, In Cuckoo Search and Firefly Algorithm. Springer International Publishing 141- 154. [10] S. Fong, S. Deb, X. S. Yang, and J. Li, (2014). Feature selection in life science classification: metaheuristic swarm search. IT Professional 16(4) 24-29. [11] K. H. Chen, L. F. Chen, and C. T. Su, (2014). A new particle swarm feature selection method for classification. Journal of Intelligent Information Systems 42(3) 507-530. [12] Thanh Tung Khuat, My Hanh Le, (2016).Optimizing Parameters of Software Effort Estimation Models using Directed Artificial Bee Colony Algorithm, Informatica 40, 427–436. [13] B. Xue, M. Zhang, and W. N. Browne, (2013). Particle swarm optimization for feature selection in classification: a multi-objective approach, IEEE transactions on cybernetics 43(6) 1656-1671. [14] A. Khazaee, (2013). Heart beat classification using particle swarm optimization, International Journal of Intelligent Systems and Applications 5(6) 25. [15] W. C. Yeh, Novel swarm optimization for mining classification rules on thyroid gland data, Information Sciences 197 (2012) 65-76. [16] R. K. Sivagaminathan, and S. Ramakrishnan, (2007). A hybrid approach for feature subset selection using neural networks and ant colony optimization. Expert systems with applications 33(1), 49-60. [17] J. B. Jona, and N. Nagaveni, (2014). Ant-cuckoo colony optimization for feature selection in digital mammogram. Pakistan Journal of Biological Sciences 17(2), 266. [18] A. Asad, A. T. Azar, N. El-Bendary, and A. E. Hassaanien, (2014). Ant colony based feature selection heuristics for retinal vessel segmentation. arXiv preprint arXiv:1403.1735. [19] A. J. Tallon-Ballesteros and J. C. Riquelme, (2014). Tackling Ant Colony Optimization Meta-Heuristic as Search Method in Feature Subset Selection Based on Correlation or Consistency Measures. in International Conference on Intelligent Data Engineering and Automated Learning. 2014. Springer, 386–393. [20] Behrouz Zamani Dadaneh, Hossein Yeganeh Markid, Ali Zakerolhosseini, ( 2016). Unsupervised probabilistic feature selection using ant colony optimization. Expert Systems with Applications, 5327–42. [21] G. Wang, H. E. Chu, Y. Zhang, H. Chen, W.Hu, Y. Li, and X. J. Peng, (2015). Multiple parameter control for ant colony optimization applied to feature selection problem. Neural Computing and Applications 26(7), 1693-1708. [22] Bo Liu, Mei Cai and Jiazong Yu, (2015 ). Swarm Intelligence and its Application in Abnormal Data Detection. Informatica 39(1), 63–69. [23] B. Chen, L. Chen, and Y. Chen, (2013). Efficient ant colony optimization for image feature selection. Signal processing 93(6) 1566-1576. [24] C. Coello, S. Dehuri, and S. Ghosh, eds, (2009). Swarm intelligence for multi-objective problems in data mining. 242 Springer. [25] Kanan, H. Rashidy, K. Faez, and S. M. Taheri, (2007). Feature selection using ant colony optimization (ACO): a new method and comparative study in the application of face recognition system. In Industrial Conference on Data Mining, Springer Berlin Heidelberg, 63-76. [26] M. Dorigo, (1992). Optimization, learning and natural algorithms, Ph.D. Thesis, Politecnico di Milano, Italy. [27] M. Dorigo, V. Maniezzo, A. Colorni, A., (1996). Ant system: optimization by a colony of cooperating agents. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics 26(1), 29-41. [28] S. Tabakhi, M. Parham, and F. Akhlaghian, (2014). An unsupervised feature selection algorithm based on ant colony optimization. Engineering Applications of Artificial Intelligence 32, 112-123. [29] S. Dehuri, S. Ghosh, and C. A. Coello, (2009). An introduction to swarm intelligence for multi- objective problems. In Swarm Intelligence for Multi- objective Problems in Data Mining, Springer Berlin Heidelberg, 1-17. [30] Feng chu & lipo wang, (2005). Applications of support vector machines to cancer classification with microarray data. International Journal of Neural Systems, 15(6), 475–484 [31] C. C. Aggarwal, 2015. Data Mining: The Textbook, Springer International Publishing Switzerland. [32] M. Kubat, 2015. An Introduction to Machine Learning, Springer International Publishing Switzerland. [33] http://archive.ics.uci.edu/ml/datasets/Heart+Disease [34] http://archive.ics.uci.edu/ml/datasets/Breast+Cancer +Wisconsin+%28Diagnostic%29 [35] http://sci2s.ugr.es/keel/category.php?cat=clas [36] Alexander Statnikov, C.F.A., Ioannis Tsamardinos. Gene Expression Model Selector. 2005 [cited 2017 April]; Available from: www.gems-system.org