https://doi.org/10.31449/inf.v43i2.1621 Informatica 43 (2019) 187–198 187 Mutual Information Based Feature Selection for Fingerprint Identification Ahlem Adjimi and Abdenour Hacine-Gharbi LMSE laboratory, University of Bordj Bou Arreridj, Elanasser, 34030 Bordj Bou Arreridj, Algeria E-mail: adjimia@yahoo.fr, hacgharbi@yahoo.fr Philippe Ravier PRISME laboratory, University of Orleans, 12 rue de Blois, 45067 Orléans, France E-mail: philippe.ravier@univ-orleans.fr, +0033238494863 Messaoud Mostefai LMSE laboratory, University of Bordj Bou Arreridj, Elanasser, 34030 Bordj Bou Arreridj, Algeria E-mail: mostefaimess@gmail.com Keywords: fingerprint identification, feature selection, dimensionality reduction, mutual information, local binary patterns, local phase quantization, histogram of gradients, binarized statistical image features Received: May 3, 2017 In the field of fingerprint identification, local histograms coding is one of the most popular techniques used for fingerprint representation, due to its simplicity. This technique is based on the concatenation of the local histograms resulting in a high dimension histogram, which causes two problems. First, long computing time and big memory capacities are required with databases growing. Second, the recognition rate may be degraded due to the curse of dimensionality phenomenon. In order to resolve these problems, we propose to reduce the dimensionality of histograms by choosing only the pertinent bins from them using a feature selection approach based on the mutual information computation. For fingerprint features extraction we use four descriptors: Local Binary Patterns (LBP), Histogram of Gradients (HoG), Local Phase Quantization (LPQ) and Binarized Statistical Image Features (BSIF). As mutual information based selection methods, we use four strategies: Maximization of Mutual Information (MIFS), minimum Redundancy and Maximal Relevance (mRMR), Conditional Info max Feature Extraction (CIFE) and Joint Mutual Information (JMI). We compare results in terms of recognition rates and number of selected features for the investigated descriptors and selection strategies. Our results are conducted on the four FVC 2002 datasets which present different image qualities. We show that the combination of mRMR or CIFE feature selection methods with HoG features gives the best results. We also show that the selection of useful fingerprint features can surely improve the recognition rate and reduce the complexity of the system in terms of computation cost. The feature selection algorithms may reach 98% of time reduction by considering only 20% of the total number of features while also improving the recognition rate of about 2% by avoiding the curse of dimensionality phenomena. Povzetek: Analizirani so različni načini opisa in preiskovanja pri histogramskem kodiranju identifikacije prstnih odtisov. 1 Introduction Biometric recognition has gained a considerable interest in the recent years because of the various applications in the large field of security. Security can be categorized in data access security (computer and mobile access, USB key, bank cards) or in person access security (forensic identification, ID access). Many technological solutions exist relying on distinctive biometric identifiers (e.g. fingerprints, face, iris or speech) each one having its own qualities. However, the most used biometric identifiers are the fingerprints due to their uniqueness, persistence, simplicity of acquisition and the availability of the electronic acquisition devices [1]. Indeed, the fingerprints are single to each person and they remain unchanged during all the life of the person. Fingerprint recognition systems can be categorized into three main approaches: minutiae-based systems, image- based correlation systems and image-based distance systems [2]. For the first category, the fingerprint image must pass through several preprocessing steps to detect and extract some points of interest called minutiae: smoothing, local ridge orientation estimation, binarization, thinning, and minutia detection. The second category directly estimates the similarity between a test and a reference fingerprint pattern by the autocorrelation method. For the third category, global or local features are extracted from the fingerprint image such that the features also called descriptors retain most of the pertinent information representing the fingerprint. This kind of 188 Informatica 43 (2019) 188–198 A. Adjimi et al. fingerprint recognition systems is preferred in the case of low quality images, because it is difficult to extract reliable minutiae sets in this case [3]. A distance measure between a test and a reference fingerprint pattern or any other classifier are finally used for making a matching decision [3]. Within this last category, many descriptors have been proposed. These descriptors can be principally grouped into histogram-based features or linear transformed features. The descriptors of the first group exploit some statistical characteristics of the fingerprint by transforming the image into a histogram of fixed length like Local Binary Patterns (LBP), Gabor filter with Local Binary Patterns (GLBP) hybrid method [4], Local Phase Quantization (LPQ) [5], Histogram of Gradients [6] or Binarized Statistical Image Features (BSIF) [7] or Scale Invariant Feature Transform (SIFT) [8][9]. In the second group, the fingerprint image is transformed into a vector of different features extracted from the fingerprint image such as Discrete Cosine Transform (DCT) features [10], Gabor filters based descriptors [11][12] and Discrete Wavelet Transform (DWT) features [13][14][15][16]. In this work, we focus on the histogram-based fingerprint representation techniques such as LBP, LPQ, HoG and BSIF. Indeed, these techniques are very used for fingerprint recognition due to their simplicity. These techniques are based on the concatenation of the local histograms leading to a histogram of great dimension (e.g.1024 features for each fingerprint in the case of LBP), which requires long computing time, big memory capacity and requires a huge training dataset to model the classes. Practically, it has been observed that features addition can cause a performance degradation of the classifier if the number of data used for the classifier designing is too low relatively to the number of features [17][18]. This phenomenon called the curse of dimensionality leads to the phenomenon of "peaking" [19]. So it is desirable to keep the number of features as small as possible which is also of benefit for reducing computational cost in the fingerprint identification task and for avoiding memory obstruction too. Keeping a small number of features is a dimensionality reduction operation, which can be done with two approaches: the first approach is a features transformation in which the initial features set is replaced by a new reduced set using transformation algorithm like PCA (Principal Component Analysis), LDA (Linear Discriminant Analysis)…. The second approach is a features selection which selects the relevant features from the initial features set [20]. However, using a reduced set of features by transformation needs greater memory capacity and more computing time in the testing phase compared to using a reduced set of features obtained by selection algorithms [20] because the former requires computation of all the features before reduction. So, in the present work, we have considered the features selection algorithms to select the relevant bins of histograms for the histogram-based fingerprint representation techniques. The feature selection methods are also divided into two categories, which are “wrapper” or “filter”. In “wrapper” methods, the relevance measure for a features subset is the training/testing recognition rate of the used classifier. Consequently, the wrapper selection procedure makes the computational cost rapidly increase, because a new classifier has to be built with training and testing phases each time a features subset is tested. Moreover, the features selected by wrapper methods are adapted to the used classifier, so their performance results are dependent on the type of classifier. In contrast, “filter” methods evaluate the features subset relevance independently of the classifier, so the selected features can be used for any classifier modelling [20][21]. For all these reasons, we have chosen the “filter” methods, which are the preferable methods in the case of high dimensionality and large datasets for computational reasons. The “filter” methods use a selection criterion typically based on information theory tools like Mutual Information (MI) useful for measuring the quantity of information that features may have for describing the data. To our knowledge, only few works have investigated the MI based criteria in the field of biometric identification. In [22], an efficient code selection method for face recognition is presented and compact LBP codes are obtained. The code selection is based on the maximization of mutual information (MMI) between features (LBP codes) and class labels. Applying this principle for selection is achieved by using the max-relevance and min- redundancy (mRMR) criterion. The method proposed consists of transforming the face images into LBP histograms, then selecting the relevant codes from these histograms using the maximization of the mutual information. In this work the authors have used the chi- square formula for measuring the distance between the histograms of the reference and the test templates. In [23], the BSIF features have been investigated in the frame of a fingerprint recognition system, with preliminary results of feature selection using the FVC2002 fingerprint dataset [24]. The experiments have shown that an increasing number of extracted sub-images leads to an increasing recognition rate, but also leads to higher dimension histograms which decreased accordingly performance of the system regarding computing time and memory capacity. This motivated the use of MI feature selection strategy, namely interaction capping (ICAP). In this work, we extend the fingerprint recognition system proposed in [23] by considering more datasets within the FVC2002 fingerprint database, more descriptor types and by investigating several other feature selection strategies, all based on mutual information computation to select the relevant bins of histograms that are extracted from the fingerprint images. The present study will focus on robustness of the fingerprint system regarding various descriptors and noisy datasets. The main aim of this work is to find a combination of feature selection method with a pertinent descriptor type in a larger context than in study [23]. To that aim, next section introduces the former developments of [23] and explains the novelty of the present paper comparatively. Section 3 proposes a brief review of all the descriptors used in this paper. Section 4 describes the feature selection methods based on mutual information. In section 5 we present the experimental Mutual Information Based Feature Selection for... Informatica 43 (2019) 187–198 189 procedure and we discuss the obtained results using a public fingerprint dataset in section 6. Finally, we draw a conclusion in section 7. 2 Related work In our previous works [23] and [25], a fingerprint recognition system was created following the flowchart of Fig. 1. A sequence of many preprocessing steps were applied on the training and testing image datasets before extracting the LBP, LPQ or BSIF features, namely enhancement, alignment, extraction of the region of interest (ROI) around the core point and division of the ROI into sub-regions. This procedure is detailed in [23]. So the set of sub-regions are inputs for the features computation. In [25], we used the novel BSIF descriptor [7] compared with LBP and LPQ descriptors, for fingerprint images. From each sub-region, a histogram of BSIF is extracted and the final feature vector is obtained by concatenating all BSIF histograms extracted from the sub-regions. In [23] an extended work of this previous work was presented, in which the relevant bins of the BSIF descriptor extracted histograms were selected using ICAP features selection method. The last step of Fig. 1 is the decision making. It is based on the distance between the histograms of the reference fingerprints and the tested one. The distance is computed as a chi-square measure which formula is given below [22] 𝜒 2 (𝑅 , 𝑇 ) = ∑ (𝑅 𝑖 − 𝑇 𝑖 ) 2 𝑅 𝑖 + 𝑇 𝑖 𝑛 𝑖 =1 (1) where𝑅 𝑖 and 𝑇 𝑖 are the reference and the tested fingerprint histogram magnitudes respectively and 𝑛 is the number of bins. The recognition system uses the following rule to make a decision: if a test fingerprint gives the best match for the fingerprint of the same person it is declared to be a correct match; else it is declared to be a false match. The recognition rate is computed as 𝑅𝑒𝑐𝑜𝑔𝑛𝑖𝑡𝑖𝑜𝑛 𝑟𝑎𝑡𝑒 (%) = number of correctly recognized images number of test images × 100, (2) In the current paper, many extensions are proposed with respect to our former work [23]. The purpose is to evaluate the robustness of the system regarding changes in the datasets, depending on the descriptors type. We thus consider the new descriptor histogram of gradients (HoG). Then all the descriptors LBP, LPQ, HoG and BSIF are evaluated on all the datasets DB1, DB2, DB3, DB4 of the FVC2002 fingerprint dataset [24]. Indeed, the DB2 and DB3 datasets were discarded for the preliminary study in work [23] while interesting for a robustness study because these are noisy datasets. Moreover, four MI strategies instead of only one in work [23] are investigated for achieving a comparison between them, also by considering the four descriptors instead of BSIF only as proposed in [23]. These novelties are described in the flowchart of Fig. 2. Furthermore, the impact of feature selection on computing time is analyzed. A deep performance analysis of the dimensionality reduction procedure is also proposed. The parameter values of the fingerprint recognition system depicted in Fig. 2 will be given in section 5.2 of the experimental part. 3 A brief review of descriptors LBP, LPQ, HoG and BSIF In this section we give a brief review of the descriptors LBP, LPQ, HoG and BSIF used in this work for features extraction. 3.1 LBP (Local Binary Patterns) This operator was proposed by Ojala et al [26] for texture analysis. It is characterized by its tolerance to illumination changes, its computational simplicity and its invariance against changes in gray levels. The LBP descriptor works on eight neighbors of a pixel and uses the gray value of this pixel as a threshold; thus, if a neighbor pixel has a higher or a same gray value than the center pixel then a binary one is assigned to that pixel, else it gets a binary zero. The LBP code for the center pixel is then produced by concatenating the eight ones or zeros to obtain a binary number that is transformed after that to a decimal number. The LBP code has a certain value from 0 to 255. Therefore, a histogram of 256 bins is composed from these values and used for matching. 3.2 LPQ (Local Phase Quantization) This texture descriptor was originally proposed by Ojansivu and Heikkila [27]. It is based on the blur invariance property of the Fourier phase spectrum. It has shown good performance in recognition of textures even when there is no blur and outperforms the Local Binary Pattern operator in texture classification. It uses the local phase information extracted using the 2-D local Fourier transform computed over a window of size (2R+1) by (2R+1) neighborhood at each pixel position in image of size n by n. For LPQ, only four complex coefficients corresponding to 2-D spatial frequencies 𝑣 1 = [𝑎 , 0], 𝑣 2 = [0, 𝑎 ], 𝑣 3 = [𝑎 , 𝑎 ] and 𝑣 4 = [−𝑎 , 𝑎 ] where 𝑎 = 1 2𝑅 +1 are retained. The real and the imaginary parts of the complex values are stacked in a vector of 8 components for each pixel which gives a matrix of size 8 by n x n. Then, the coefficients are decorrelated by a whitening operation assuming a correlation coefficient of 0.95 between adjacent pixel values and a Gaussian distribution of the pixel values. Finally, this matrix is binarized by looking the sign of each element, so that if it has a positive value, a binary 1 is assigned to that element otherwise a binary 0 is assigned. The last step is the histogram construction by transforming each column of 8 elements to a decimal value between 0 and 255. Finally a 256-dimensional histogram is composed from these values and used in classification. 190 Informatica 43 (2019) 190–198 A. Adjimi et al. Figure 1: Flowchart of the related work system of fingerprint recognition. Figure 2: Flowchart of the proposed system. The red characters indicate the added elements for a deep study of the system (details of image preprocessing and matching steps can be found in reference [23]). 3.3 HoG (Histogram of Gradients) The HoG descriptor has been first proposed by Dalal and Triggs [28] as an image descriptor used in computer vision and image processing for object detection. The basic idea of this descriptor is that local object appearance and shape can be characterized rather well by the distribution of local intensity gradients. The gradient filter is applied in both directions x and y of the image. The two obtained images are then transformed in magnitude and orientation gradients. After, they are divided into small spatial regions (cells). For each cell, each pixel has a gradient magnitude which accumulates the distribution at the bin corresponding to its orientation value. The concatenation of these histograms gives the HoG histogram. For example, if the number of orientation bins spaced over 0° - 180° is 9 (180°/20°) and the image is split into 3x4 cells (12 is the total number of cells), we then obtain a histogram of G with 3x4x9=108 bins. Actually, the obtained histogram is not a genuine one since the bins cumulative does not reach the total number of pixels. A histogram-like is finally obtained with sqrt L2- normalization [28]. 3.4 BSIF (Binarized Statistical Image Features) BSIF is a new descriptor recently proposed by Kannla&Rahtu [7] for texture classification and face recognition. Its main idea is that it automatically learns a set of filters from a small set of natural images instead of using manual filters such as in LBP and LPQ descriptors. BSIF is a binary code string which length is the number of filters. Each bit of the code string is computed by binarizing the response of the image to a linear filter from the set with a fixed threshold. Given an image patch X of size l × l pixels and the #i linear filter W i of the same size from the set of learned filters, the response s i is obtained by s i = ∑ W i (u, v)X(u, v) = w i 𝑇 x, (3) u,v where vectors w i and x contain the pixels of W i and X. The binarized feature b i is obtained by setting b i = 1 if s i > 0 and b i = 0 otherwise [7]. The BSIF descriptor depends on two parameters which are the filter window size and the number of bits representing the binary code string. So, the number of bits determines the number of extracted features. If the binary code string is represented with 8 bits, we get 256 features vector, which means a histogram of BSIF features of 256 bins. 4 Feature selection using Mutual Information Feature selection is used to identify the useful features and remove the features that are redundant and irrelevant for the task of classification. For this reason, it is necessary to reach a measurement of features relevance which makes it possible to quantify their importance in this task. In this section we briefly give some basic concepts and notions from information theory that are useful for understanding the four feature selection methods used in this work. In information theory, MI measures the statistical dependence between two random variables. So, MI can be used to evaluate the relative utility of each feature to classification, in which entropy and mutual information are two principal concepts. Entropy H can be interpreted as a measure of the uncertainty of random variables. Let X be (or represent) a discrete random variable with probabilistic distribution p(x). The entropy of X is defined as [29]: H(X) = − ∑ p(x) log(p(x)) x∈X (4) Training and testing fingerprint images DB1, DB4 Image preprocessing (enhancement, alignment, ROI extraction and division) Features extraction of LBP, LPQ and BSIF and selection of BSIF descriptor Matching using chi- square distance formula Training and testing fingerprint images DB1, DB2,DB3, DB4 Image preprocessing (enhancement, alignment, ROI extraction and division) Features extraction of LBP, LPQ, HoGand BSIF descriptors Matching using chi- square distance formula Feature selection of LBP, LPQ, HoG, BSIF using MIFS, MRMR, CIFE and JMI strategies Mutual Information Based Feature Selection for... Informatica 43 (2019) 187–198 191 The mutual information MI between two discrete variables X and Y is defined using their joint probabilistic distribution p(x, y) and their respective marginal probabilities p(x) and p(y) as: MI(X; Y) = ∑ p(x, y) log p(x, y) p(x)p(y) (5) x∈X y∈Y The objective of using MI is to select a subset S of relevant features from a set F of features, which share the most information with the class variable. The treatment of each feature needs a very big number of possible subsets (combination C k n ), this leads to the iterative "greedy" algorithms which select the relevant features one by one (sequential forward selection) or deletes the unneeded features (sequential backward selection). The use of the greedy forward selection procedure with the MI based relevance criterion is generally a good choice of feature selection procedure [30]. The Forward ‘‘greedy’’ algorithm based on MI is presented as follows [31][32]: 1) (Initialization) set F ←“initial set of n features”; S ← “empty subset” 2) (Calculation of MI), ∀f i ∈ F , calculateMI(C; f i ). 3) (Choose the first feature f s 1 ), find the feature that maximizes MI(C; f i ), affect F ← F − {f s 1 }, S ← {f s 1 }. 4) (Greedy selection), repeat until the desired number of features: a. (Compute MI between features), ∀f i ∈ F , compute MI(C; S, f i ). b. (Select the next feature f s j ), choose the feature f i ∈ F that maximizes MI(C; S, f i ) at the step j, affect F ← F − {f s j }, S ← S ∪ {f s j }. 5) Take out the subset S of the selected features. Practically, it is difficult to compute MI(C; S, f i ) when the cardinal of the subset S increases because it requires an estimation of high dimension probability density functions, which cannot be correctly estimated with a limited number of samples [20]. So the majority of the algorithms use measurements which are maximally based on three variables: two features plus the class index. For this reason, many proposed criteria based on MI are heuristic [32][33]. As previously stated, “filter”methods are preferred to wrapper ones. These methods are defined by a criterion J, also called relevance index or scoring criterion, which is planned to measure the relevance of a feature or a feature subset for the task of classification. The simplest feature- scoring criterion is referred as MIM (Mutual Information Maximization) [21]: J mim (f i ) = MI(C; f i ) (6) The J mim criterion does not include the features already selected which leads to selecting redundant features (sharing the same information with the class index C) that must be eliminated. Numerous “filter”criteria have been proposed taking into account the redundancy [33][32]. We use four criteria in this work: MIFS, mRMR, CIFE and JMI [21]. 4.1 Mutual Information Feature Selection strategy (MIFS) Proposed by Battiti [31], it is very useful in feature selection problems and classifying systems due to its simplicity. MIFS selects the feature that maximizes the information about the class label C, and subtract the MI between features f i and the already selected variable f j to achieve the minimum redundancy: J mifs (f i ) = MI(C; f i ) − β ∑ MI( f j ∈S f i ; f j ) (7) In this latter expression, S stands for the set of already selected features. The parameter β is a configurable parameter that determines the degree of redundancy checking within MIFS. It must be set experimentally [21][34]. The performance of MIFS degrades if there are many irrelevant and redundant features because it penalizes redundancy too much. 4.2 Minimum Redundancy and Maximal Relevance strategy (mRMR) Proposed by Peng et al [35], it is equivalent to MIFS with β = 1 |S| where |S| = card (S) is the number of already selected features. It finds a balance between the relevance, which is the dependence between the features and the class, and the redundancy of features with respect to the subset of previously selected features. The criterion can be written as: J mrmr (f i ) = MI(C; f i ) − 1 |S| ∑ MI( f j ∈S f i ; f j ). (8) With the minimum redundancy criterion of mRMR method, we can get more representative features of the class variable, which are maximally dissimilar to already selected ones, so it gives a small number of features which effectively covers the same space as a larger number of features. 4.3 Conditional Infomax Feature Extraction strategy (CIFE) Lin and Tang [36] proposed a criterion, called Conditional Infomax Feature Extraction, in which the joint class- relevant information is maximized by explicitly reducing the class-relevant redundancies among features [33]. Note that this criterion has been proposed by several authors in different ways [20][32][33][37]: J cife (f i ) = MI(C; f i ) − ∑ MI(f i ; f j ) f j ∈S + ∑ MI(f i ; f j |C). (9) f j ∈S 192 Informatica 43 (2019) 192–198 A. Adjimi et al. The CIFE criterion is same as MIFS plus the conditional redundancy term. 4.4 Joint Mutual Information strategy (JMI) Proposed by Yang and Moody [38], the Joint Mutual Information score is J jmi (f i ) = MI(C; f i ) − 1 |S| ∑ [MI(f i ; f j ) − f j ∈S MI(f i ; f j |C)] (10) JMI method studies relevancy and redundancy by taking the mean value, and takes into consideration the class label when calculating MI. JMI and mRMR are very similar but the difference is the conditional redundancy term. 5 Experimental procedure First, we give a brief description of the public fingerprint dataset FVC2002 [24]. Second, we present the experimental parameters chosen for our fingerprint recognition system. Third, we describe the way we select the relevant bins from LBP, LPQ, HoG and BSIF histograms using the Brown’s toolbox for feature selection [21]. 5.1 Datasets The experimental results have been conducted on the FVC2002 fingerprint dataset [24], which has been divided into two sets A and B. Each set is divided in 4 datasets DB1, DB2, DB3 and DB4. Three different scanners and the SFinGe synthetic generator were used to collect the fingerprints [24]. A total of 120 fingers and 12 impressions per finger (1440 impressions) using 30 volunteers have been collected. The top-ten quality fingers were removed from each dataset since they do not constitute an interesting case study [24]. The size of each dataset in the FVC2002 test, however, was established as 110 fingers, 8 impressions per finger (880 impressions) and split into set A (100 fingers - evaluation set) and set B (10 fingers - training set). To make set B representative of the whole dataset, the 110 collected fingers were ordered by quality, and then the 8 images from every tenth finger were included in set B. The remaining fingers constituted set A. In this work, we have used set A to conduct our experimental results [6]. 1 https://www.dropbox.com/s/wregrs3ah0qcfdd/SIfing.rar Table 1 presents the technologies and the scanners used to collect the FVC2002 datasets and the size of images in each dataset for each set. 5.2 Fingerprint recognition system This section describes the experimental parameters chosen for our fingerprint recognition system. The related work in section 2 mentioned the region around the core point of the fingerprint image. The region of size (100x100 pixels) is extracted and divided into 4 sub-regions of size (50x50 pixels) for each one. For features extraction we use the four descriptors LBP, LPQ, HoG and BSIF applied for each sub-region. • For LBP features extraction, we convert the gray value of each pixel to one of the 256 LBP codes. Next we construct the histogram of LBP codes. • For LPQ we use a radius equal to 3, so a histogram of 256 bins is extracted. • For HoG, each sub-region is divided into sub windows of 3 rows and 3 columns (9 cells total). The orientation and magnitude of each pixel is calculated. The absolute orientation is divided into 9 equally sized bins, which results in a 9-bin histogram per each of the 9 cells, so a histogram of 81 bins is produced. • For BSIF we use a filter of 11x11 size and number of bits equal to 8 to extract a histogram of 256 bins. The learnt filters are provided by [7]. For each region, the histograms of LBP, LPQ, HoG and BSIF are extracted independently and concatenated to construct the final normalized histogram for each descriptor. The LBP, LPQ, HoG and BSIF histograms are extracted using SIfingToolbox 1 . For LBP, BSIF and LPQ features, the normalization is carried out by dividing the value of each bin of the histogram by the sum of the values of the bins of this histogram. For HoG features, the normalization is done with sqrt L2-normalization as stated in [28]. Table 2 presents the number of bins in each extracted histogram for the different descriptors. In this work, the first results are obtained by training the system over 7 images of each person for each dataset. That is, we use 700 dataset images for training and use remaining 100 dataset images for testing for each dataset. In the experiments, the 8 fold-cross validation was applied, so the test step was repeated 8 times. Technology Scanner Size of image (pixel × pixel) Set A Set B Resolution DB1 Optical IdentixTouchView II 388×374 100 persons with 8 impressions per person (800) 10 persons with 8 impressions per person (80) 500 dpi DB2 Optical Biometrika FX2000 296×560 569 dpi DB3 Capacitive Precise Biometrics 100 SC 300×300 500 dpi DB4 Synthetic SFinGEv2.51 288×384 About 500 dpi Table 1: The technologies and scanners used to collect the FVC2002 datasets and the size of images in each dataset. Mutual Information Based Feature Selection for... Informatica 43 (2019) 187–198 193 5.3 Bins selection Table 2 shows that the number of extracted features is high (histogram of 1024 in the case of BSIF, LBP and LPQ and 324 in the case of HoG) which makes the response time in the matching stage very long. The dimensionality reduction is achieved by a feature selection stage. To that aim, we have used the Brown’s Toolbox (FEAST toolbox) 2 , which contains the implementation of 13 different features selection methods based on mutual information. In our case we have only used 4 feature selection methods. Two of them are based on the redundancy (MIFS and mRMR). The two other ones are based on the conditional redundancy (CIFE and JMI). Practically, the LBP, LPQ, BSIF and HoG histogram bins are extracted from all the training images that are also used for feature selection. At this point, each bin is considered as a feature in the feature selection process. This means that each feature is a random variable which probability density function can be estimated with a histogram construction using many realizations of the variable, each image being associated to a realization. Building the histogram of features necessitates the magnitude variation ranges to be properly discretized. This step is required for a low biased estimation of mutual information and entropies used in the Brown’s Toolbox. Now, we assume that the number of images is 𝑁 which is the number of samples or realizations used for histogram estimation of the features. The number 𝑚 of bins representing the histogram for each feature can be obtained by Sturges’ formula [39]: 𝑚 = 𝑙𝑜𝑔 2 (𝑁 ) + 1 (11) 6 Results and discussion 6.1 Impact of the descriptor type on classification performance In this section, we analyze performance results of the proposed descriptors for the fingerprint recognition task. Performance is measured in terms of recognition rates and computing time for the identification stage. Table 3 shows the recognition rates and the computing time with all extracted features obtained for each descriptor applied on the different datasets. It is clearly shown from Table 3 (a) 2 http://www.cs.man.ac.uk/~gbrown/fstoolbox/ that the LBP features provide the poorest recognition rates compared to the other descriptors in all datasets with an about 10% drop in the recognition rate by comparison with the other rates. The BSIF descriptor gives the best recognition rates except in the DB2 dataset. For all the datasets, the HoG and LPQ descriptors give approximately the same results. It is also observed that DB3 dataset gives the poorest recognition rates. This is due to the fact that DB3 is the most difficult dataset among the four datasets in FVC2002 in terms of image quality [40]. Mainly it can be concluded that the HoG and LPQ descriptors are robust with respect to the dataset diversity because of general high recognition rates compared to the other descriptors. This is confirmed by an average rate over the four datasets reaching near 86.8% for both descriptors. Conversely, BSIF also reaches an average rate of 86% but with extreme values with the highest rates for three datasets and the poorest rate for one dataset. From Table 3 (b), it is clearly shown that the HoG descriptor requires less computing time than the other descriptors for the identification stage. This is due to the smaller number of histogram bins required for this method. Moreover, the computing time is rather independent of the tested dataset. So generally, we can conclude that HoG features outperform the other used features in terms of calculation complexity (only 324 features) and in recognition rate. A natural perspective is to deal with higher dimension datasets and/or real-time recognition systems. This requires keeping the number of the extracted features as small as possible, which implies computational and memory cost reductions for the training and testing stages. For this reason, many feature selection algorithms have been investigated to solve the problem of computational and memory cost reduction. 6.2 Impact of the feature selection algorithm on classification performance Fig. 3 shows the results obtained by the four feature selection methods (MIFS, mRMR, CIFE and JMI) on the four datasets DB1, DB2, DB3 and DB4 and with all the descriptors. The results obtained with LPQ features are very close to those of HoG and BSIF, like observed in the previous study [23] with LBP also giving the poorest results. It can be noted that all the curves reach approximately a plateau as soon as 20% of the total number of features are selected by any of the selection algorithm except MIFS. A first conclusion is that dimensionality feature reduction can be achieved for all the datasets. In many cases, the MIFS algorithm shows an abrupt change at the beginning of the curve. Among the feature selection algorithms, the mRMR is slightly better than the other ones in average over all the datasets. The curse of dimensionality phenomenon can clearly be observed with DB3 and DB4 datasets in Fig.3, where higher recognition rates can be reached with a smaller number of features than the maximal one. However, the Feature extraction method Number of regions around the core point Number of histogram bins LBP 4 regions of size 50x50 256*4=1024 LPQ 256*4=1024 HoG 81*4=324 BSIF 256*4=1024 Table 2: Number of histogram bins for each descriptor. 194 Informatica 43 (2019) 194–198 A. Adjimi et al. HoG LPQ LBP BSIF DB 1 DB 2 DB 3 DB 4 Figure 3: Recognition rates on all the four datasets using HoG, LPQ, LBP and BSIF selected features and using MIFS, mRMR, CIFE and JMI feature selection strategies. (a) DB1 DB2 DB3 DB4 HoG 90.75 90.86 73.25 92.13 LPQ 90.25 91.25 74.13 91.50 LBP 80.75 84.00 65.75 81.38 BSIF 92.25 80.75 76.37 94.50 (b) DB1 DB2 DB3 DB4 HoG 563 569 554 564 LPQ 10350 10493 10304 10378 LBP 10161 10219 10253 10256 BSIF 11381 10905 10609 11257 Table 3: (a) Recognition rate results (%) (b) Computing time results (s) with HoG, LBP, LPQ and BSIF features on the four FVC 2002 datasets. (a) DB1 DB2 DB3 DB4 HoG 96.44 96.48 96.39 96.45 LPQ 98.08 98.15 98.11 98.09 LBP 98.08 98.09 98.09 98.11 BSIF 98.01 97.79 97.84 97.94 (b) DB1 DB2 DB3 DB4 HoG 2.62 2.19 -2.73 4.88 LPQ 4.55 5.2 4.4 3.55 LBP 0 2.98 3.04 -1.99 BSIF 1.76 1.55 0.65 0.66 Table 4: (a) Reduction Rate (%) of computing Time (b) Loss of Recognition Rate (%) caused by dimensionality reduction. 0 50 100 150 200 250 300 350 0 20 40 60 80 100 Number of features Recognition rate(%) MIFS mRMR CIFE JMI 0 200 400 600 800 1000 1200 0 20 40 60 80 100 Number of features Recognition rate(%) MIFS mRMRR CIFE JMI 0 200 400 600 800 1000 1200 0 20 40 60 80 100 Number of features Recognition rate(%) MIFS mRMR CIFE JMI 0 200 400 600 800 1000 1200 0 20 40 60 80 100 Numbre of features Recognition rate(%) MIFS mRMR CIFE JMI 0 50 100 150 200 250 300 350 0 20 40 60 80 100 Number of features Recognition rate(%) MIFS mRMR CIFE JMI 0 200 400 600 800 1000 1200 0 20 40 60 80 100 Number of features Recognition rate(%) MIFS mRMR CIFE JMI 0 200 400 600 800 1000 1200 0 20 40 60 80 100 Number of features Recognition rate(%) MIFS mRMR CIFE JMI 0 200 400 600 800 1000 1200 0 20 40 60 80 100 Numbre of features Recognition rate (%) MIFS mRMR CIFE JMI 0 50 100 150 200 250 300 350 0 20 40 60 80 Number of features Recognition rate(%) MIFS mRMR CIFE JMI 0 200 400 600 800 1000 1200 0 20 40 60 80 Number of features Recognition rate(%) MIFS mRMR CIFE JMI 0 200 400 600 800 1000 1200 0 10 20 30 40 50 60 70 Number of features Recognition rate(%) MIFS mRMR CIFE JMI 0 200 400 600 800 1000 1200 0 10 20 30 40 50 60 70 80 Numbre of features Recognition rate (%) MIFS mRMR CIFE JMI 0 50 100 150 200 250 300 350 0 20 40 60 80 100 Number of features Recognition rate(%) MIFS mRMR CIFE JMI 0 200 400 600 800 1000 1200 0 20 40 60 80 100 Number of features Recognition rate(%) MIFS mRMR CIFE JMI 0 200 400 600 800 1000 1200 0 20 40 60 80 100 Number of features Recognition rate(%) MIFS mRMR CIFE JMI 0 200 400 600 800 1000 1200 0 20 40 60 80 100 Numbre of features Recognition rate (%) MIFS mRMR CIFE JMI Mutual Information Based Feature Selection for... Informatica 43 (2019) 187–198 195 phenomenon of peaking can be far more significant in some curves without cross-validation. Indeed, the curves of Fig.3 are the result of cross-validation which makes an average of 8 recognition rate curves. This operation may mask outlier curves. As an example, we consider a case without cross-validation with HoG features on DB3 by taking the 7 th image as a test image and the remainder images as references. From Fig.4, the CIFE algorithm allows 74% of recognition rate to be attained by selecting 28 HoG features which is far better than the recognition rate of 66% obtained with all the features (324). Note in addition that such a case corresponds to the practical use of a feature selection algorithm because of averaging effect of the cross-validation process, which prevents delivering a common sequence of selected features. 6.3 Impact of feature selection on computing time In this section, we evaluate the benefit of the selection procedure on the complexity of the system in terms of computing time and its effect on the recognition rate of the system. For this experiment, we use the JMI features selection method. Table 4(a) presents the Reduction Rate of the computing Time (𝑅𝑅𝑇 ) given as follow: 𝑅𝑅𝑇 = (𝑇𝐹 − 𝑇𝑆 )/𝑇𝐹 (12) where 𝑇𝐹 is the computing Time corresponding to number of Full features and 𝑇𝑆 is the computing Time corresponding to the number of Selected features. Table 4(b) presents the Loss of Recognition Rate (LRR) caused by the dimensionality reduction. This is given by: 𝐿𝑅𝑅 = (𝑅𝑅𝐹 − 𝑅𝑅𝑆 )/𝑅𝑅𝐹 (13) where 𝑅𝑅𝐹 is the Recognition Rate corresponding to the number of Full features and 𝑅𝑅𝑆 is the Recognition Rate corresponding to the number of Selected features. In this experiment, we consider the first 20% of the selected features w.r.t. to the full number of features. HoG DB1 HoG DB2 HoG DB3 HoG DB4 Figure 5: Number of HoG selected features with 𝒂𝒍𝒑𝒉𝒂 = {90%….99%} on all datasets, using MIFS, mRMR, CIFE and JMI features selection strategies. 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 90 92 94 96 98 Number of features Alpha MIFS mRMR CIFE JMI 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 90 92 94 96 98 Number of features Alpha MIFS mRMR CIFE JMII 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 90 92 94 96 98 Number of features Alpha MIFS mRMR CIFE JMI 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 90 92 94 96 98 Number of features Alpha MIFS mRMR CIFE JMI Figure 4: The curse of dimensionality phenomenon (peaking) for DB3 dataset with HoG selected features. 0 50 100 150 200 250 300 350 0 20 40 60 80 Number of features Accuracy(%) MIFS mRMR CIFE JMI Recognition rate (%) 196 Informatica 43 (2019) 196–198 A. Adjimi et al. From table 4(a), it can be concluded that considering 20% of BSIF, LBP or LPQ selected features improves the computation time of about 98% compared to the computation time needed with the full number of features. Table 4(b) indicates that the loss of recognition rate may grow up to about 5% while some cases may improve the recognition rate (1.99% when selecting 20% of LBP features with DB3 or 2.73% when selecting 20% of HoG features with DB4 respectively). 6.4 Performance analysis of the dimensionality reduction procedure It is interesting to know to what extent the number of features could be decreased by considering a small degradation of the recognition rate. For this experiment, we thus determine the number of selected HoG features that allows a recognition rate greater than an 𝑎𝑙𝑝 ℎ𝑎 percent value of the rate obtained with the minimum number of features using the formula 𝑎𝑙𝑝 ℎ𝑎 = 𝑅𝑅𝑆 𝑅𝑅𝐹 ∗ 100 (14) where 𝑅𝑅𝑆 is the recognition rate corresponding to the selected features. 𝑅𝑅𝐹 is the recognition rate obtained with all the features. The alpha parameter can take values from 0% to 100%. Fig.5 reports the number of HOG selected features corresponding to 𝑎𝑙𝑝 ℎ𝑎 values located in {90%...99%}. From these results, it can be observed that the three feature-selection methods mRMR, CIFE and JMI give very close results, unlike MIFS that always shows poorer performance except in the case of DB3. It can also be observed that CIFE seems to show better results in the case of real bases (DB1, DB2 and DB3) with respect to the synthetic base (DB4). The number of features can be strongly reduced for DB3 with very little concession on the recognition rate (for example 34 features with CIFE are sufficient with 𝑎𝑙𝑝 ℎ𝑎 =98%), the profit being very weak for smaller 𝑎𝑙𝑝 ℎ𝑎 values. On the other hand, willing to keep the same number of features (34) with the other bases, it is necessary to go down to 𝑎𝑙𝑝 ℎ𝑎 = 94% for DB1, 95% for DB2 and less than 𝑎𝑙𝑝 ℎ𝑎 = 90% for DB4 (with mRMR). Table.5 presents the optimal number of BSIF, HoG, LPQ and LBP selected features by the used feature selection methods with 𝑎𝑙𝑝 ℎ𝑎 =98%. Table.6 presents their corresponding recognition rates. From Tables 5 and 6, the following points can be highlighted: - For DB1 and DB3, the combination of HoG features with the feature selection method CIFE gives the best performance results with a reduced number of 66 features in the case of DB1 and 34 features in the case of DB3. - For DB2 and DB4, the combination of HoG features with the feature selection method mRMR gives the best performance results with a reduced number of 66 features in the case of DB2 and 91 in the case of DB4. - For DB4, using LBP features with feature selection method mRMR gives a reduced number of features equal to 48 but with a poor recognition rate compared to HoG and LPQ. The best performance result is obtained with 87 BSIF features. As a conclusion, the two feature-selection methods mRMR and CIFE allow obtaining the reduced number of the features in the majority of cases. 7 Conclusion Histogram based techniques are very used for fingerprint image representation. Generally, concatenation of the histograms leads to the problem of high dimension, which degrades performance results of the identification system in terms of complexity (computing time and memory cost) and recognition rate. In this paper, we have deeply studied the problem of dimensionality reduction in a fingerprint identification system in order to reduce the complexity with possible improvement of the recognition rate avoiding the curse of dimensionality phenomenon. We have presented a fingerprint recognition system based on 4 descriptors: local binary pattern (LBP), local phase quantization (LPQ), Histogram of gradients (HoG) and Binarized Statistical Image Features (BSIF). For the dimensionality reduction we used 4 feature selection methods based on mutual information: MIFS, mRMR, CIFE and JMI. The experiments were conducted on the public FVC 2002 fingerprint dataset. The use of several types of features and several datasets allows efficiently to validate the feature selection BSIF HoG LPQ LBP MIFS mRMR CIFE JMI MIFS mRMR CIFE JMI MIFS mRMR CIFE JMI MIFS mRMR CIFE JMI DB1 425 202 176 201 138 107 66 80 261 472 313 448 918 144 220 137 DB2 274 113 152 194 162 66 94 75 234 303 255 411 953 207 472 222 DB3 363 121 152 124 202 38 34 35 845 303 290 348 950 260 150 216 DB4 589 90 297 87 170 91 152 98 653 184 425 248 932 48 197 52 Table 5: Number of BSIF, HoG, LPQ and LBP selected features with 𝒂𝒍𝒑𝒉𝒂 =98%. The green values correspond to the minimum number of selected features with a 98% degradation acceptance with respect to the rate obtained with all the features. BSIF HoG LPQ LBP MIFS mRMR CIFE JMI MIFS mRMR CIFE JMI MIFS mRMR CIFE JMI MIFS mRMR CIFE JMI DB1 90 90.37 90.10 90.37 89 89 89 89 88.5 88.5 88.5 88.62 79.38 79.38 79.25 79.38 DB2 79 79.12 79 79.12 89.10 89.5 89.25 89.25 89.5 89.5 89.5 89.5 82.83 82.83 82.5 82.38 DB3 74.74 74.75 74.75 74.75 71.8 72.25 72.10 72.25 72.75 72.75 72.75 72.87 64.5 64.63 64.75 64.5 DB4 92.5 92.5 92.6 92.5 90.5 90.37 90.37 90.30 89.75 89.75 89.87 89.75 79.75 79.75 80.25 79.75 Table 6: Recognition rates obtained by BSIF, HoG, LPQ and LBP selected features with 𝒂𝒍𝒑𝒉𝒂 =98%. The green numbers are those giving the smallest numbers of selected features. Mutual Information Based Feature Selection for... Informatica 43 (2019) 187–198 197 techniques and to choose the best combination (type of features/feature selection method) for the task of fingerprint identification. From all the results we can conclude that the use of feature selection methods can reduce the number of features whatever the type of features and whatever the dataset, except in the case of using MIFS with LBP features that present bad performance result. We can conclude also that the feature selection techniques can reduce the curse of dimensionality phenomenon and probably improve the recognition rate of the identification system. The combination of HoG features with CIFE or mRMR gives the best performance in terms of recognition rate, robustness and complexity of the system. In terms of complexity, a huge computation time reduction (98%) is obtained by considering only 20% of the total number of features without much affecting the recognition rate. In definitive, employing feature selection algorithms will always provide a benefit when compared to no selection since higher or equal identification performance can be obtained and at the same time the computation complexity for the identification stage can be reduced. As perspective, we plan to investigate other descriptors and biometric modalities. References [1] D. Maio, D. Maltoni, A. K. Jain and S. Prabhakar, " Handbook of fingerprint recognition," Springer, New York, NY, 2003. https://doi.org/10.1007/b97303 [2] K. S. Sunil, "A Review of Image Based Fingerprint Authentication Algorithms," International Journal of Advanced Research in Computer Science and Software Engineering, vol. 3, no. 6, pp. 553-556, 2013. [3] Y. Jucheng, "Non-minutiae based fingerprint descriptor," in Biometrics, Nanchang, In Tech, 2012, pp. 80-98. https://doi.org/10.5772/21642 [4] N. Nanni and A. Lumini, "Local Binary Patterns for a hybrid fingerprint matcher," Pattern Recognition, vol. 41, no. 11, pp. 3461-3466, 2008. https://doi.org/10.1016/j.patcog.2008.05.013 [5] S. Brahnam, C. Casanova, L. Nanni and A. Lumini, "A Hybrid Fingerprint Multimatcher," in 16th International Conference on Image Processing, Computer Vision, and Pattern Recognition, Las Vegas, Nevada, USA, pp. 877-882, 2012. [6] L. Nanni and A. Lumini, "Descriptors for image- based fingerprint matchers," Expert Systems With Applications, vol. 36, no. 10, pp. 12414-12422, 2009. https://doi.org/10.1016/j.eswa.2009.04.041 [7] J. Kanala and E. Rahtu, "BSIF: binarized statistical image features", 21st International Conference on Pattern Recognition (ICPR 2012)," IEEE, Tsukuba, Japan, pp. 1363-1366, 2012. [8] A I Awad and K Baba, "Evaluation of a fingerprint identification algorithm with SIFT features," in IIAI International Conference on Advanced Applied Informatics, Fukuoka, 2012, pp. 129-132. https://doi.org/10.1109/iiai-aai.2012.34 [9] S Egawa, A I Awad, and K Baba, "Evaluation of acceleration algorithm for biometric identification," Network Digital Technologies NDT 2012. Communications in Computer and Information Science. Springer, Berlin, Heidelberg, vol. 294, pp. 231-242, 2012. https://doi.org/10.1007/978-3-642-30567-2_19 [10] T. Amornraksa and S. Tachaphetpiboon, "Fingerprint recognition using DCT features," Electronic Letters, vol. 42, no. 9, pp. 522–523, 2006. https://doi.org/10.1049/el:20064330 [11] A. K. Jain, S. Prabhakar, L. Hong and S. Pankanti, "Filterbank-based fingerprint matching," Image Processing, IEEE Transactions, vol. 9, no. 5, pp. 846-859, 2000. https://doi.org/10.1109/83.841531 [12] S. Lifeng, Z. Feng and T. Xiaoou, "Improved fingercode for filterbank-based fingerprint matching," In International Conference on Image Processing , vol. 2, no. 2, pp. 895-898, 2003. https://doi.org/10.1109/icip.2003.1246825 [13] R. Kumar, P. Chandra and M. Hanmandlu, "Fingerprint Matching Based on Texture Feature," In Mobile Communication and Power Engineering, Springer-Verlag Berlin, vol. 296, pp. 86-91, 2013. https://doi.org/10.1007/978-3-642-35864-7_12 [14] M. Saha, J. Chaki and R. Parekh, "Fingerprint Recognition using Texture Features," International Journal of Science and Research, vol. 2, no. 12, pp. 2319-7064, 2013. [15] K. Tewari and R. L. Kalakoti, "Fingerprint Recognition Using Transform Domain Techniques," in International Technological Conference, pp.136- 140, 2014. [16] M. W. Zin and M. M. Sein, "Texture feature based fingerprint recognition for low quality imagesTexture Feature based Fingerprint Recognition for Low Quality Images," in Micro- NanoMechatronics and Human Science (MHS), International Symposium, IEEE, Nagoya, Japan, pp. 333–338, 2011. https://doi.org/10.1109/mhs.2011.6102204 [17] C. M. Bishop, Neural Networks for Pattern Recognition, Oxford University Press, 1995. [18] A. K. Jain and B. Chandrasekaran, "Dimensionality and Sample Size Considerations in Pattern Recognition Practice," in Handbook of Statistics, Amsterdam, 1982, pp. 835-855. https://doi.org/10.1016/s0169-7161(82)02042-2 [19] A. K. Jain, R. P. Duin and J. Mao, "Statistical Pattern Recognition: A Review," IEEE Transaction on Pattern Analysis and Machine Intelligence, vol. 22, no. 1, pp. 4-37, 2000. https://doi.org/10.1109/34.824819 [20] A. Hacine-Gharbi, M. Deriche, P. Ravier and T. Mohamadi, "A new histogram-based estimation 198 Informatica 43 (2019) 198–198 A. Adjimi et al. technique of entropy and mutual information using mean squared error minimization," Computers & Electrical Engineering, vol. 39, no. 3, pp. 918-933, 2013. https://doi.org/10.1016/j.compeleceng.2013.02.010 [21] G. Brown, A. Pocock, M. Lujan and M. J.Zhao, "Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection," Journal of Machine Learning Research, vol. 13, pp. 27-66, 2012. [22] B. Jun, T. Kim and D. Kim, "A compact local binary pattern using maximization of mutual information for face analysis Pattern Recognition," Pattern Recognition, vol. 44, pp. 532–543, 2011. https://doi.org/10.1016/j.patcog.2010.10.008 [23] A. Adjimi, A. Hacine-Gharbi, P. Ravier and M. Mostefai, "Extraction and selection of binarised statistical image features for fingerprint recognition," Int. J. Biometrics, vol. 9, no. 1, p. 67– 80., 2017. https://doi.org/10.1504/ijbm.2017.10005054 [24] D. Maio, D. Maltoni, R. Cappelli, J. L. Wayman and A. K. Jain, "FVC2002: Second Fingerprint Verification Competition," in 16 th international conference in Pattern Recognition, 2002. [25] A. Adjimi, A. Hacine-Gharbi and M. Mostefai, "Application of Binarized Statistical Image Features for Fingerprint Recognition," in SIVA 2015, 3 rd international conference signal image vision and their applications, Guelma, Algeria, 2015. [26] T. Ojala, M. Pietikainen and T. Maenpaa, "Multiresolution gray-scale and rotation invariant texture classification with local binary patterns," Pattern Analysis and Machine Intelligence, IEEE Transactions, vol. 24, no. 7, pp. 971-987, 2002. https://doi.org/10.1109/tpami.2002.1017623 [27] T. Ojala, M. Pietikainen and D. Harwood, "A comparative study of texture measures with classification based on feature distributions," Pattern Recognition, vol. 29, pp. 51–59, 1996. https://doi.org/10.1016/0031-3203(95)00067-4 [28] N. Dalal and B. Triggs, "Histograms of oriented gradients for human detection," in IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), San Diego, USA, p. 886-893, 2005. https://doi.org/10.1109/CVPR.2005.177 [29] T. Cover and J. Thomas, Elements of information theory, 2e édition ed., Canada: John Wiley & Sons, 2006. https://doi.org/10.1002/047174882x [30] D. François, F. Rossi, V. Wertz and M. Verleysen, "Resampling methods for parameter-free and robust feature selection with mutual information," Neurocomputing, vol. 70, pp. 1276–1288, 2007. https://doi.org/10.1016/j.neucom.2006.11.019 [31] R. Battiti, "Using mutual information for selecting features in supervised neural net learning," IEEE Trans. Neural Networks, vol. 5, no. 4, pp. 537–550, 1994. https://doi.org/10.1109/72.298224 [32] A. Hacine-Gharbi, P. Ravier and T. Mohamadi, "Une nouvelle méthode de sélection des paramètres pertinents : application en reconnaissance de la parole," in conférence TAIMA, Hammamet, Tunisie, pp. 399-407, 2009. [33] G. Brown, "A new perspective for information theoretic feature selection," in International Conference on Artificial Intelligence and Statistics, Florida, USA, pp, 49-56, 2009. [34] N. Kwak and C. H. Choi, "Input feature selection for classification problems," IEEE Transactions on Neural Networks, vol. 13, no. 1, pp. 143–159, 2002. https://doi.org/10.1109/72.977291 [35] H. Peng, F. Long and C. Ding, "Feature selection based on mutual information: Criteria of max dependency, max-relevance, and min-redundancy," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 8, pp. 1226– 1238, 2005. https://doi.org/10.1109/tpami.2005.159 [36] D. Lin and X. Tang, "Conditional infomax learning: An integrated framework for feature extraction and fusion," in European Conference on Computer Vision, Springer-Verlag Berlin, Graz, Austria, pp. 68-82 , 2006. https://doi.org/10.1007/11744023_6 [37] I. Kojadinovic, "Relevance measures for subset variable selection in regression problems based on k-additive mutual information," Comput. Statist. Data Anal., vol. 49, pp. 1205–1227, 2005. https://doi.org/10.1016/j.csda.2004.07.026 [38] H. Yang and J. Moody, "Data Visualization and Feature Selection: New Algorithms for Non Gaussian Data," Advances in Neural Information Processing Systems, MIT Press, pp. 688-695, 1999. [39] H. Sturges, "The choice of a class-interval," J. Amer. Statist. Assoc, vol. 21, pp. 65–66, 1926. https://doi.org/10.1080/01621459.1926.10502161 [40] Y. Chen, S. C. Dass and A. K. Jain, "Fingerprint Quality Indices for Predicting Authentication Performance," in Audio- and Video-Based Biometric Person Authentication, Springer-Verlag Berlin Heidelberg, Hilton Rye Town, USA, pp. 160-170, 2005. https://doi.org/10.1007/11527923_17