Informatica 41 (2017) 333–347 333 Improving Visual Vocabularies: a More Discriminative, Representative and Compact Bag of Visual Words Leonardo Chang, Airel Pérez-Suárez and José Hernández-Palancar Advanced Technologies Application Center, 7A #21406, Siboney, Playa, Havana, Cuba, C.P. 12220 E-mail: {lchang,asuarez,jpalancar}@cenatav.co.cu Miguel Arias-Estrada and L. Enrique Sucar Instituto Nacional de Astrofísica, Óptica y Electrónica Luis Enrique Erro No. 1, Sta. María Tonantzintla, Puebla, México, C.P. 72840 E-mail: {ariasmo,esucar}@inaoep.mx Keywords: bag of visual words improvement, bag of features, visual vocabulary, visual codebook, object categorization, object class recognition, object classification Received: July 21, 2016 In this paper, we introduce three properties and their corresponding quantitative evaluation measures to assess the ability of a visual word to represent and discriminate an object class, in the context of the BoW approach. Also, based on these properties, we propose a methodology for reducing the size of the visual vocabulary, retaining those visual words that best describe an object class. Reducing the vocabulary will provide a more reliable and compact image representation. Our proposal does not depend on the quantization method used for building the set of visual words, the feature descriptor or the weighting scheme used, which makes our approach suitable to any visual vocabulary. Throughout the experiments we show that using only the most discriminative and representative visual words obtained by our proposed methodology improves the classification performance; the best results obtained with our proposed method are statistically superior to those obtained with the entire vocabularies. In the Caltech-101 dataset, average best results outperformed the baseline by a 4.6% and 4.8% in mean classification accuracy using SVM and KNN, respectively. In the Pascal VOC 2006 dataset there was a 1.6% and 4.7% improvement for SVM and KNN, respectively. Furthermore, these accuracy improvements were always obtained with more compact representations. Vocabularies 10 times smaller always obtained better accuracy results than the baseline vocabularies in the Caltech-101 dataset, and in the 93.75% of the experiments on the Pascal VOC dataset. Povzetek: S pomočjo rudarjenja podatkov se prispevek ukvarja z iskanjem besed za razločevanje razredov objektov. 1 Introduction One of the most widely used approaches for represent- ing images for object categorization is the Bag of Words (BoW) approach [5]. BoW-based methods have obtained remarkable results in recent years and they even obtained the best results for several classes in the recent PASCAL Visual Object Classes Challenge on object classification [8]. The key idea of BoW approaches is to discretize the en- tire space of local features (e.g., SIFT [22]) extracted from a training set at interest points or densely sampled in the image. With this aim, clustering is performed over the set of features extracted from a training set in order to identify features that are visually equivalent. Each cluster is inter- preted as a visual word, and all clusters form a so-called visual vocabulary. Later, in order to represent an unseen image, each feature extracted from the image is assigned to a visual word of the visual vocabulary; from which a his- togram of occurrences of each visual word in the image is obtained, as illustrated in Figure 1. One of the main limitations of the BoW approach is that the visual vocabulary is built using features that belong to both the object and the background. This implies that the noise extracted from the image background is also con- sidered as part of the object class description. Also, in the BoW representation, every visual word is used, regard- less of its low representativeness or discriminative power. These elements may limit the quality of further classifica- tion processes. In addition, there is no consensus about which is the optimal way for building the visual vocabu- lary, i.e., the clustering algorithm used, the number of clus- ters (visual words) that best describe the object classes, etc. When dealing with relatively small vocabularies, clustering can be executed several times and the best performing vo- cabulary can be selected through a validation phase. How- ever, this becomes intractable for large image collections. In this paper, we propose three properties to assess the ability of a visual word to represent and discriminate an 334 Informatica 41 (2017) 333–347 L. Chang et al. 1.) Local Feature Extraction 2.) Feature Description 3.) Visual Vocabulary ) Ranking and Filtering 4.) Bag of Words 1. k5 2. k2 3. k4 4. k1 1. k5 2. k2 3. k4 4. k1 5. k6 6. k3 . . . . . . . . . . . . . . . . . . . . . . . . * FilteringRanking k1 k2 k3 k4 k5 k6 k1 k2 kK. . . k1 k2 kK. . . k1 k2 kK. . . k1 k2 kK. . . Figure 1: Classical BoW approach overview (steps 1 to 4). First, regions/points of interest are automatically detected and local descriptors over those regions/points are computed (step 1 and 2). Later in step 3, the descriptors are quantized into visual words to form the visual vocabulary. Finally, in step 4, the occurrences in the image of each specific word in the vocabulary for constructing the BoW feature are found. In this work, we propose to introduce step (∗) in order to use only the most discriminative and representative visual words from the visual vocabulary in the BoW representation. object class in the context of the BoW approach. We de- fine three measures in order to quantitatively evaluate each of these properties. The visual words that best represent a class, best generalize over intra-class variability and best differentiate between object classes will obtain the high- est scores for these measures. A methodology for reducing the size of the visual vocabulary based on these proper- ties is also proposed. Our proposal does not depend on the clustering method used to create the visual vocabulary, the descriptor used (e.g., SIFT, SURF, etc.) or the weight- ing scheme used (e.g., tf, tf-idf, etc.) Therefore, it can be applied to any visual vocabulary to improve its representa- tiveness, since it does not build a new visual vocabulary, it rather finds the best visual words of a given visual vocabu- lary. Experiments conducted on the Caltech-101 [10] and Pas- cal VOC 2006 [9] datasets, in a classification task, demon- strate the improvement introduced by the proposed method. Tested with different vocabulary sizes, different interest points extraction and description methods, and different weighting schemas, the classification accuracies achieved using the entire vocabulary were always statistically infe- rior to those achieved by several of the vocabularies ob- tained by filtering the baseline vocabulary, using our pro- posed vocabulary size reducing methodology. Moreover, the best results were obtained with as few as the 13.4% and 17.2%, in average, of the baseline visual words for the Caltech-101 and Pascal VOC 2006 datasets, respectively. Compared with a state-of-the-art mutual information based method for feature selection our proposal obtains superior classification accuracy results for the highest compression rates and comparable results for the other filtering sizes. The paper is organized as follows: Section 2 gives an overview on related works for building more discrimina- tive and representative visual vocabularies. Section 3 intro- duces the proposed properties and measures for the evalua- tion of the representativeness and distinctiveness of visual words. The performance of our proposed method on two data sets and a discussion of the obtained results are pre- sented in Section 4. Finally, Section 5 concludes the paper with a summary of our findings and a discussion of future work. 2 Related work Several methods have been proposed in the literature to overcome the limitations of the BoW approach [25]. These include part generative models and frameworks that use ge- ometric correspondence [30, 23], works that deal with the quantization artifacts introduced while assigning features to visual words [15, 11], techniques that explore different features and descriptors [24, 12], among many others. In this section, we briefly review some recent methods aimed to build more discriminative and representative visual vo- cabularies, which are more related to our work. Kersorn and Poslad [17] presented a framework to im- prove the quality of visual words by constructing visual words from representative keypoints. Also, domain spe- cific non-informative visual words are detected using two main characteristics for non-informative visual words: high document frequency and a small statistical association with all the concepts in the collection. In addition, the vector space model of visual words is restructured with respect to a structural ontology model in order to solve visual syn- onym and polysemy problems. Zhang et al. [29] proposed to obtain a visual vocabu- lary comprised of descriptive visual words and descriptive visual phrases as the visual correspondences to text words and phrases. Authors state that a descriptive visual element can be composed by the visual words and their combina- tions and that these combinations are effective in represent- Improving Visual Vocabularies: A More Discriminative. . . Informatica 41 (2017) 333–347 335 ing certain visual objects or scenes. Therefore, they define visual phrases as frequently co-occurring visual word pairs. Lopez-Sastre et al. [21] presented a method for building a more discriminative visual vocabulary by taking into ac- count the class labels of images. The authors proposed a cluster precision criterion based on class labels in order to obtain class representative visual words through a Recip- rocal Nearest Neighbors clustering algorithm. Also, they introduced an adaptive threshold refinement scheme aimed to increase vocabulary compactness. Liu [19] builds a visual vocabulary based on a Gaus- sian Mixed Model (GMM). After K-Means clusters are ob- tained, GMM is then used to model the distribution of each cluster. Each GMM will be used as a visual word of the visual vocabulary. Also, a soft assignment schema for the bag of words is proposed based on the soft assignment of image features to each GMM visual word. Liu and Shah [20] exploit mutual information maximiza- tion techniques to learn a compact set of visual words and to determine the size of the codebook. In their proposal two codebook entries are merged if they have comparable dis- tributions. In addition, spatio-temporal pyramid matching is used to exploit temporal information in videos. Most popular visual descriptors are histograms of im- age measurements. It has been shown that with histogram features, the Histogram Intersection Kernel (HIK) is more effective than the Euclidean distance in supervised learning tasks. Based on this assumption, Wu et al. [28] proposed a histogram kernel k-means algorithm which use HIK in an unsupervised manner to improve the generation of visual codebooks. In [4], in order to use low level features extracted from images to create higher level features, Chandra et al. pro- posed a hierarchical feature learning framework that uses a Naive Bayes clustering algorithm. First, SIFT features over a dense grid are quantized using K-Means to obtain the first level symbol image. Later, features from the cur- rent level are clustered using a Naive Bayes-based cluster- ing and quantized to get the symbol image at the next level. Bag of words representations can be computed using the symbol image at any level of the hierarchy. Jiu et al. [16], motivated by obtaining a visual vocabu- lary highly correlated to the recognition problem, proposed a supervised method for joint visual vocabulary creation and class learning, which uses the class labels of the train- ing set to learn the visual words. In order to achieve that, they proposed two different learning algorithms, one based on error backpropagation and the other one based on cluster label reassignment. In [27], the authors propose a hierarchical visual word mergence framework based on graph-embedding. Given a predefined large set of visual words, their goal is to hierar- chically merge them into a small number of visual words, such that the lower dimensional image representation ob- tained based on these new words can maximally maintain classification performance. Zhang et al. [31] proposed a supervised Mutual Infor- mation (MI) based feature selection method. This algo- rithm uses MI between each dimension of the image de- scriptor and the image class label to compute the dimension importance. Finally, using the highest importance values, they reduce the image representation size. This method achieve higher accuracy and less computational cost than feature compression methods such as product quantization [14] and BPBC [13]. In our work, similarly to [17, 21, 16], we also use the class labels of images. However, we do not use the class la- bels to create a new visual vocabulary but for scoring the set of visual words, according to their distinctiveness and rep- resentativeness for each class. It is important to emphasize that our proposal does not depend on the algorithm used for building the set of visual words, the descriptor used nor the weighting scheme used. The previously mentioned charac- teristics make our approach suitable to any visual vocabu- lary since it does not build a new visual vocabulary, it rather finds the best visual words of a given visual vocabulary. In fact, our proposal could directly complement all the above discussed methods, by ranking their resulting vocabularies according to the distinctiveness and representativeness of the obtained visual words, although is out of the scope of this paper to explore it. 3 Proposed method Visual vocabularies are commonly comprised by a lot of noisy visual words due to intra-class variability and the in- clusion of features from the background during the vocab- ulary building process, among others. Later, for image rep- resentation every visual word is used, which may lead to an error-prone image representation. In order to improve image representations, we introduce three properties and their corresponding quantitative eval- uations to assess the ability of a visual word to represent and discriminate an object class in the context of the BoW approach. We also propose a methodology, based on these properties, for reducing the size of the visual vocabulary, discarding those visual words that worst describe an object class (i.e., noisy visual words). Reducing the vocabulary in such a manner will allow to have a more reliable and compact image representation. We would like to emphasize that all the measures pro- posed in this section are used during the training phase; therefore, we can use all the knowledge about the data that is available during this phase. 3.1 Inter-class representativeness measure A visual word could be comprised of features from differ- ent object classes, representing visual concepts or parts of objects common to those different classes. These common parts or concepts do not have necessarily to be equally rep- resented inside the visual word because, even when similar, object classes should also have attributes that differentiate them. Therefore, we can say that, in order to represent an 336 Informatica 41 (2017) 333–347 L. Chang et al. object class the best, a property that a visual word must satisfy is to have a high representativeness of this class. In order to measure the representativeness of a class cj in vi- sual word k, the measureM1 is proposed: M1(k, cj) = fk,cj nk , (1) where fk,cj represents the number of features of class cj in visual word k and nk is the total number of features in visual word k. Figure 2 shows M1 values for two example visual words. In Figure 2 a) the ‘blue’ class has a very high value ofM1 because most of the features in the visual word be- long to the # class, being the opposite for the classes 2 and 4 that are poorly represented in the visual word. Fig- ure 2 b) shows an example visual word where every class is nearly equally represented, therefore every class have sim- ilarM1 values. 3.2 Intra-class representativeness measure A visual word could be comprised of features from differ- ent objects, many of them probably belonging to the same object class. Even when different, object instances from the same class should share several visual concepts. Tak- ing this into account, we can state that a visual word best describes a specific object class while more balanced are the features from that object class comprising the visual word, with respect to the number of different training ob- jects belonging to that class. Therefore, we could say that, in order to represent an object class the best, a property that a visual word must satisfy is to have a high generalization or intra-class representativeness over this class. To measure the intra-class representativeness of a visual word k for a given object category cj , the measure µ is proposed: µ(k, cj) = 1 Ocj Ocj∑ m=1 ∣∣∣∣ om,k,cj fk,cj − 1 Ocj ∣∣∣∣, (2) where Ocj is the number of objects (images) of class cj in the training set. om,k,cj is the number of features extracted from object m of class cj in visual word k, and fk,cj is the number of features of class cj that belong to visual word k. The term 1/Ocj represents the ideal ratio of features of class cj that guarantees the best balance, i.e., the case where each object of class cj is equally represented in vi- sual word k. The measure µ evaluates how much a given class devi- ates from its ideal value of intra-class variability balance. In order to make this value comparable with other classes and visual words, µ could be normalized using its maxi- mum possible value, which is 2·Ocj−2 O2cj . Taking into account that µ takes its maximum value in the worst case of intra-class representativeness, the mea- sureM2 is defined to take its maximum value in the case of ideal intra-class variability balance and to be normalized by max(µ(k, cj)): M2(k, cj) = 1− Ocj 2 · (Ocj − 1) Ocj∑ m=1 ∣∣∣∣ om,k,cj fk,cj − 1 Ocj ∣∣∣∣. (3) Figure 3 shows the values ofM2 on two example visual words. In Figure 3 a), the number of features from the different images of the # class in the visual word is well balanced, i.e., the visual word generalizes well over intra- class variability for the # class, hence this class presents a highM2 value. In contrast, in Figure 3 b) only one image from the # class is well represented by the visual word. As the visual word represents a visual characteristic only present in one image, it is not able to well represent intra- class variability, therefore, the # class will have a low value ofM2 in this visual word. 3.3 Inter-class distinctiveness measure M1 andM2 provide, under different perspectives, a quan- titative evaluation of the ability of a visual word to describe a given class. However, we should not build a vocabu- lary just by selecting those visual words that best repre- sent each object class, because this fact does not directly imply that the more representative words will be able to differentiate well one class from another, as a visual vo- cabulary is expected to do. Therefore, we can state that, in order to be used as part of a visual vocabulary, a desired property of a visual word is that it should have high val- ues of M1(k, cj) and M2(k, cj) (represents well the ob- ject class), while having low values ofM1(k, {cj}C) and M2(k, {cj}C) (misrepresents the rest of the classes), i.e., it must have high discriminative power. In order to quantify the distinctiveness of a visual word for a given class, the measure M3 is proposed. M3 ex- presses how much the object class that is best represented by visual word k is separated from the other classes in the M1 andM2 rankings. Let ΘM(K, cj) be the set of values of a given measure M for the set of visual words K = {k1, k2, ..., kN} and the object class cj , sorted in descending order of the value ofM. Let Φ(k, cj) be the position of visual word k ∈ K in ΘM(K, cj). Let Pk = mincj∈C(Φ(k, cj)) be the best position of visual word k in the set of all object classes C = {c1, c2, ..., cQ}. Let ck = arg mincj∈C(Φ(k, cj)) be the object class where k has position Pk. Then, the inter- class distinctiveness (measureM3), of a given visual word k for a given measureM, is defined as: M3(k,M) = 1 (|C| − 1)(|K| − 1) ∑ cj 6=ck (Φ(k, cj)− Pk). (4) In Figure 4, the M3 measure is calculated for two vi- sual words (i.e., k2 and k5) of a six visual words and three Improving Visual Vocabularies: A More Discriminative. . . Informatica 41 (2017) 333–347 337 (a) (b) k k nk = 20 M1(k, ) = M1(k, ) = M1(k, ) = 0.85 0.10 0.05 nk = 20 0.35 0.30 0.35 M1(k, ) = M1(k, ) = M1(k, ) = Figure 2: (best seen in color.) Examples ofM1 measure values for a) a visual word with a well-defined representative class (# class with high M1 value, 2 and 4 classes with low M1 values) and b) a visual word without any highly representative class (#, 2 and4 classes have low and very similarM1 values). (a) (b) k k O = 4 fk, = 17 M2(k, ) = 0.9559 o‘red’,k, = 5 o‘blue’,k, = 4 o‘yellow’,k, = 4 o‘gray’,k, = 4 O = 4 fk, = 17 M2(k, ) = 0.4853 o‘red’,k, = 1 o‘blue’,k, = 13 o‘yellow’,k, = 1 o‘gray’,k, = 2 Figure 3: (best seen in color.) Examples ofM2 measure values for the # class in a) a visual word where there is a good balance between the number of features of different images of the # class (highM2 value), and in b) the opposite case where only one image for the # class is predominantly represented in the visual word (low M2 value). In the figure, different fill colors of each feature in the visual word represent features extracted from different object images of the same class. 338 Informatica 41 (2017) 333–347 L. Chang et al. classes example. Visual word k2 is among the top items of the representativeness ranking for every class in the exam- ple. Despite this, k2 has low discriminative power because describing well several classes makes harder the process of differentiate one class from another. In contrast, visual word k5 is highly discriminative because it describes well only one class. 3.4 On ranking and reducing the size of visual vocabularies The proposed measures, provide a quantitative evaluation of the representativeness and distinctiveness of the visual words in a vocabulary for each class. The visual words that best represent a class, best generalize over intra-class variability and best differentiate between object classes will obtain the highest scores for these measures. In this sec- tion, we present a methodology for ranking and reducing the size of the visual vocabularies, towards more reliable and compact image representations. Let ΘM1(K) and ΘM2(K) be the rankings of vocab- ulary K, using measures M3(K,M1) and M3(K,M2), respectively. ΘM1(K) and ΘM2(K) provide a ranking of the vocabulary based on the distinctiveness of visual words according to inter-class and intra-class variability, respec- tively. In order to find a consensus, Θ(K), between both rank- ings ΘM1(K) and ΘM2(K) a consensus-based voting method can be used; in our case, we decided to use the Borda Count algorithm [7] although any other can be used as well. The Borda Count algorithm obtains a final ranking from multiple rankings over the same set. Given |K| visual words, a visual word receive |K| points for a first prefer- ence, |K| − 1 points for a second preference, |K| − 2 for a third, and so on for each ranking independently. Later, individual values for each visual word are added and a final ranking obtained. From this final ranking a reduced vocabulary can be ob- tained by selecting the first N visual words. As pointed in [20], the size of the vocabulary affects the performance and there is a vocabulary size which can achieve maximal accu- racy, which depends on the dataset, the number of classes and the data nature, among others. In our experiments, we explore different vocabulary sizes, over different datasets, different interest points extraction and description methods, different weighting schemas, and different classifiers. 4 Experimental evaluation In this work we have presented a methodology for im- proving BoW-based image representation by using only the most representative and discriminative visual words in the vocabulary. As it was stated in previous sections, our pro- posal does not depend on the algorithm used for building the set of visual words, the descriptor used nor the weight- ing scheme used. Therefore, the proposed methodology could be applied for improving the accuracy of any of the methods reported in the literature, which are based on a BoW approach. The main goal of the experiments we present in this sec- tion is to quantitatively evaluate the improvement intro- duced by our proposal to the BoW-based image represen- tation, over two standard datasets commonly used in object categorization. The experiments were focused on: a) to assess the validity of our proposal in a classic BoW-based classification task, b) to evaluate the methodology directly with respect to other kinds of feature selection algorithms, and c) to measure the time our methodology spent in or- der to filter the visual vocabulary built for each dataset. All the experiments were done on a single thread of a 3.6 GHz Intel i7 processor and 64GB RAM PC. The experiments conducted in order to evaluate our pro- posal were done in two well-known datasets: Caltech-101 [10] and Pascal VOC 2006 [9]. The Caltech-101 dataset [10] consists of 102 object cat- egories. There are about 40 to 800 images per category and most categories have about 50 images. The Pascal VOC 2006 dataset [9] consists of 10 object categories. In total, there are 5304 images, split into 50% for train- ing/validation and 50% for testing. The distributions of images and objects by class are approximately equal across the training/validation and test sets. 4.1 Assessing the validity in a BoW-based classification task As it was mentioned before, the goal of the first experi- ment is to assess the validity of our proposal. With this aim, we evaluate the accuracy in a classic BoW-based clas- sification task, with and without applying our vocabulary filtering methodology. In the experiments presented here, we use for image rep- resentation the BoW schema presented in Figure 1 with the following specifications: – Interest points are detected and described using two methods: SIFT [22] and SURF [2]. – K-means, with four differentK values, is used to build the visual vocabularies; these vocabularies constitute the baseline. For both Caltech-101 and Pascal VOC 2006 datasets we used K=10000, 15000, 20000 and 25000. – Each of the baseline vocabularies is ranked using our proposed visual words ranking methodology. – Later, nine new vocabularies are obtained by filtering each baseline vocabulary, leaving the 10%, 20%, ..., 90%, respectively, of the most representative and dis- criminative visual words based on the obtained rank- ing. – Two weighting schemas are used for image represen- tation: tf and tf-idf. Improving Visual Vocabularies: A More Discriminative. . . Informatica 41 (2017) 333–347 339 ΘM(K, c1) ΘM(K, c2) ΘM(K, c3) 1. k2 k2 k1 2. k1 k5 k2 3. k3 k4 k4 4. k4 k6 k6 5. k6 k1 k5 6. k5 k3 k3 R an k in g p os it io n Pk2 = 1, ck2 = 1 0.1 M3(k2,M) = 1(|C|−1)(|K|−1) ∑C cj 6=ck2 (Φ(k2, cj) − Pk2) M3(k2,M) = Pk5 = 2, ck5 = 2 0.7 M3(k5,M) = 1(|C|−1)(|K|−1) ∑C cj 6=ck5 (Φ(k5, cj) − Pk5) M3(k5,M) = Figure 4: (best seen in color.) Example ofM3 measure for two visual words. k2 has low discriminative power because it represents well several classes, while k5 has high discriminative power because it describes well only one class. – For both datasets we randomly selected 10 images from all the categories for building the visual vocab- ularies. The rest of the images were used as test im- ages; but, as in [18], we limited to 50 the number of test images per category. After that, we tested the obtained visual vocabularies in a classification task, using SVM (with a linear kernel) and KNN (where K is optimized with respect to the leave-one- out error) as classifiers. For each visual vocabulary, test images are represented using this vocabulary and, a 10-fold 10-times cross-validation process is conducted, where nine of the ten partitions are used for training and the other one for testing the trained classifier. The mean classification accuracy along the ten iterations is reported. Figures 5 and 6 show the mean classification accuracy results over the cross-validation process using SVM and KNN, respectively, on the Caltech-101 dataset. Figures 7 and 8 show the same for the Pascal VOC 2006 dataset. In Figures 5 to 8, subfigures (a) and (b) show the results us- ing SIFT descriptor; results for SURF are shown in sub- figures (c) and (d). Results for the two different weighting schemas, i.e., tf and tf-idf, are shown in subfigures (a) (c), and (b) (d), respectively. It can be seen that in both datasets, for every config- uration, our proposed methodology allows to obtain re- duced vocabularies that outperformed the classical BoW approach (baseline). Table 2 summarizes the results presented in Figures 5 and 6 for the Caltech-101 dataset. The results in Figures 7 and 8 are summarized in Table 3. For every experiment configuration, Tables 2 and 3 show the baseline classifica- tion accuracy against the best result obtained by the pro- posed method with both SVM and KNN classifiers. The size of the filtered vocabulary in which the best result was obtained is also showed. 4.1.1 Discussion The experimental results presented in this section validate the claimed contributions of our proposed method. As it can be seen in Tables 2 and 3, the best results obtained with our proposed method outperform those obtained with the whole vocabularies. For the experiments conducted in the Caltech-101 dataset, our average best results outperformed the baseline by a 4.6% and 4.8% in mean classification ac- curacy using SVM and KNN, respectively. In the Pascal VOC 2006 dataset there was a 1.6% and 4.7% improvement for SVM and KNN, respectively. As noticed on Figures 5 to 8, there is a trend of the performance with respect to the filter size in the two considered datasets, i.e., for smaller filter sizes higher accuracy. In order to validate the improvement obtained by the pro- posed method, the statistical significance of the obtained results was verified. For testing the statistical significance we used the Mann-Whitney test, with a 95% of confidence. A detailed explanation about Mann-Whitney test, as well as an implementation, can be found in [1]. As a result of this test, it has been verified that the results obtained in both datasets, by the proposed method, are statistically superior to those obtained by the baseline. In addition, the best results using the filtered vocabular- ies were obtained with vocabularies several times smaller than the baseline vocabularies, i.e., 6 and 10 times smaller in average using SVM and KNN, respectively, for the Caltech-101 datasets, and 8 and 5 times smaller in aver- age for the Pascal VOC 2006 dataset with SVM and KNN, respectively. Furthermore, vocabularies 10 times smaller always obtained better accuracy results than the baseline vocabularies in the Caltech-101 dataset, and in the 93.75% of the experiments on the Pascal VOC 2006 dataset. Ob- taining smaller vocabularies implies more compact image representations, that will have a direct impact on the effi- ciency of further processing based on these image repre- sentations, and less memory usage. Also, the conducted experiments provide evidence that a large number of visual words in a vocabulary are noisy or little discriminative. Discarding these visual words allows for a better and more compact image representations. 4.2 Comparison with other kinds of feature selection algorithms The aim of the second experiment is to compare our pro- posal with respect to other kind of feature selection algo- rithm. With this purpose, we compare the accuracy of our vocabulary filtering methodology with respect to the accu- 340 Informatica 41 (2017) 333–347 L. Chang et al. Table 1: Computation time (in seconds) of visual vocabulary ranking compared to vocabulary building. Dataset K Vocabulary building (K-means) computation time (s) Vocabulary ranking (proposed method) computation time (s) Vocabulary ranking (MI-based method [31]) computation time (s) Caltech-101 (188 248 training features) 10000 4723.452 8.111 5.754 15000 6711.089 18.622 16.825 20000 7237.885 33.890 27.478 25000 9024.024 54.338 49.963 Pascal VOC 2006 (114 697 training features) 10000 4126.487 7.985 5.285 15000 5922.134 18.320 15.879 20000 6441.980 30.259 28.458 25000 8563.692 51.743 49.132 (a) (b) SIFT + tf + SVM M ea n A cc u ra cy ( % ) 10 20 30 40 50 60 70 80 90 100 (baseline) Filtering size (%) K = 25000 K = 10000 K = 15000 K = 20000 SIFT + tf-idf + SVM K = 25000 K = 10000 K = 15000 K = 20000 M ea n A cc u ra cy ( % ) 10 20 30 40 50 60 70 80 90 100 (baseline) Filtering size (%) SURF + tf + SVM K = 25000 K = 10000 K = 15000 K = 20000 M ea n A cc u ra cy ( % ) 10 20 30 40 50 60 70 80 90 100 (baseline) Filtering size (%) SURF + tf-idf + SVM K = 25000 K = 10000 K = 15000 K = 20000 M ea n A cc u ra cy ( % ) 10 20 30 40 50 60 70 80 90 100 (baseline) Filtering size (%) (c) (d) 20.00 21.75 23.50 25.25 27.00 23.00 24.50 26.00 27.50 29.00 19.00 21.75 24.50 27.25 30.00 26.00 27.25 28.50 29.75 31.00 Figure 5: Mean classification accuracy results for SVM cross-validation on the Caltech-101 dataset. As can be seen, using the reduced vocabularies always resulted in better classification accuracies. Improving Visual Vocabularies: A More Discriminative. . . Informatica 41 (2017) 333–347 341 (a) (b) SIFT + tf + KNN M ea n A cc u ra cy ( % ) 10 20 30 40 50 60 70 80 90 100 (baseline) Filtering size (%) K = 25000 K = 10000 K = 15000 K = 20000 SIFT + tf-idf + KNN K = 25000 K = 10000 K = 15000 K = 20000 M ea n A cc u ra cy ( % ) 10 20 30 40 50 60 70 80 90 100 (baseline) Filtering size (%) SURF + tf + KNN K = 25000 K = 10000 K = 15000 K = 20000 M ea n A cc u ra cy ( % ) 10 20 30 40 50 60 70 80 90 100 (baseline) Filtering size (%) SURF + tf-idf + KNN K = 25000 K = 10000 K = 15000 K = 20000 M ea n A cc u ra cy ( % ) 10 20 30 40 50 60 70 80 90 100 (baseline) Filtering size (%) (c) (d) 2.00 3.50 5.00 6.50 8.00 2.00 3.50 5.00 6.50 8.00 2.00 4.25 6.50 8.75 11.00 2.00 4.25 6.50 8.75 11.00 Figure 6: Mean classification accuracy results for KNN cross-validation on the Caltech-101 dataset. As can be seen, using the reduced vocabularies always resulted in better classification accuracies. 342 Informatica 41 (2017) 333–347 L. Chang et al. SIFT + tf-idf + SVM M ea n A cc u ra cy ( % ) 10 20 30 40 50 60 70 80 90 100 (baseline) Filtering size (%) SURF + tf + SVM M ea n A cc u ra cy ( % ) 10 20 30 40 50 60 70 80 90 100 (baseline) Filtering size (%) (a) (b) (c) (d) SURF + tf-idf + SVM M ea n A cc u ra cy ( % ) 10 20 30 40 50 60 70 80 90 100 (baseline) Filtering size (%) SIFT + tf + SVM M ea n A cc u ra cy ( % ) 10 20 30 40 50 60 70 80 90 100 (baseline) Filtering size (%) 33 33.9 34.8 35.6 36.5 36.8 37.4 37.9 38.5 39 35 35.9 36.8 37.6 38.5 37 38 39 40 41 K = 25000 K = 10000 K = 15000 K = 20000 K = 25000 K = 10000 K = 15000 K = 20000 K = 25000 K = 10000 K = 15000 K = 20000 K = 25000 K = 10000 K = 15000 K = 20000 Figure 7: Mean classification accuracy results for SVM cross-validation on the Pascal VOC 2006 dataset. As can be seen, using the reduced vocabularies always resulted in better classification accuracies. Improving Visual Vocabularies: A More Discriminative. . . Informatica 41 (2017) 333–347 343 SIFT + tf-idf + KNN M ea n A cc u ra cy ( % ) 10 20 30 40 50 60 70 80 90 100 (baseline) Filtering size (%) SURF + tf + KNN M ea n A cc u ra cy ( % ) 10 20 30 40 50 60 70 80 90 100 (baseline) Filtering size (%) (a) (b) (c) (d) SURF + tf-idf + KNN M ea n A cc u ra cy ( % ) 10 20 30 40 50 60 70 80 90 100 (baseline) Filtering size (%) SIFT + tf + KNN M ea n A cc u ra cy ( % ) 10 20 30 40 50 60 70 80 90 100 (baseline) Filtering size (%) 7 10.3 13.5 16.8 20 18.5 19.3 20 20.8 21.5 4.5 7.9 11.3 14.6 18 19.2 20 20.9 21.7 22.5 K = 25000 K = 10000 K = 15000 K = 20000 K = 25000 K = 10000 K = 15000 K = 20000 K = 25000 K = 10000 K = 15000 K = 20000 K = 25000 K = 10000 K = 15000 K = 20000 Figure 8: Mean classification accuracy results for KNN cross-validation on the Pascal VOC 2006 dataset. As can be seen, using the reduced vocabularies always resulted in better classification accuracies. 344 Informatica 41 (2017) 333–347 L. Chang et al. Table 2: Summarized results for the Caltech-101 dataset. SVM KNN Descriptor Weighting K Base- Best Best filter Base- Best Best filterschema line result size (%) line result size (%) SIFT tf 10000 24.05 26.54 20 3.60 7.86 10 15000 22.22 26.66 10 3.06 6.74 10 20000 21.22 25.28 20 2.54 5.84 10 25000 20.41 25.85 10 2.34 5.30 10 tf-idf 10000 23.96 27.87 40 3.41 7.28 10 15000 24.05 28.55 10 2.91 6.29 10 20000 24.45 27.53 20 2.60 6.13 10 25000 24.18 27.87 20 2.45 5.59 10 SURF tf 10000 24.63 29.81 10 3.53 10.70 10 15000 22.43 28.82 10 3.26 8.77 10 20000 20.75 28.08 10 2.80 9.54 10 25000 19.56 27.66 10 3.17 8.55 10 tf-idf 10000 26.48 30.74 20 3.42 10.50 10 15000 26.39 30.21 20 2.97 8.70 10 20000 26.50 29.78 10 2.72 8.69 10 25000 26.62 30.47 30 2.57 7.61 10 Average 23.62 28.23 16.8 2.96 7.76 10 Table 3: Summarized results for the Pascal VOC 2006 dataset. SVM KNN Descriptor Weighting K Base- Best Best filter Base- Best Best filterschema line result size (%) line result size (%) SIFT tf 10000 34.97 36.11 10 13.16 18.74 10 15000 34.37 36.38 10 10.22 17.79 10 20000 33.69 36.30 10 9.31 16.89 10 25000 33.66 35.84 10 8.31 17.35 10 tf-idf 10000 37.86 38.27 10 19.25 21.12 10 15000 37.27 38.77 10 19.25 20.14 10 20000 37.86 38.84 10 19.37 19.72 90 25000 36.97 38.48 30 19.31 19.76 70 SURF tf 10000 36.36 38.15 10 6.96 17.72 10 15000 36.49 37.54 10 6.37 16.17 10 20000 36.05 37.82 20 6.11 14.15 10 25000 35.39 36.78 20 5.91 14.49 10 tf-idf 10000 37.77 40.85 10 20.56 22.20 10 15000 37.72 39.74 10 20.19 21.81 10 20000 37.28 39.21 10 20.39 20.81 60 25000 37.63 38.31 10 20.22 21.28 10 Average 36.33 37.96 12.5 14.06 18.76 21.88 Improving Visual Vocabularies: A More Discriminative. . . Informatica 41 (2017) 333–347 345 racy of the MI-based method proposed in [31], in a classifi- cation task; the experiment was done over the Caltech-101 dataset. As it was mentioned in Section 2, the MI-based method proposed in [31] obtains the best results among the feature selection and compression methods of image repre- sentation for object categorization. In the experiments presented here, we use for image rep- resentation a BoW-based schema with the following speci- fications: – PHOW features (dense multi-scale SIFT descriptors) [3]. – Spatial histograms as image descriptors. – Elkan K-means [6], with five different K values (K= 256, 512, 1024, 2048 and 4096), is used to build the visual vocabularies; these vocabularies constitute the baseline. – Each of the baseline vocabularies is ranked using the MI-based method proposed in [31] and our proposed visual vocabulary ranking methodology. – Later, nine new vocabularies are obtained by filtering each baseline vocabulary, leaving the 10%, 20%, ..., 90%, respectively. – We randomly selected 15 images from each of the 102 categories of Caltech-101 dataset, in order to build the visual vocabularies. For each category, 15 images were randomly selected as test images. We tested the obtained visual vocabularies in a classifi- cation task, using a homogeneous kernel map to transform a χ2 Support Vector Machine (SVM) into a linear one [26]. The classification accuracy is reported in Figure 9. As it can be seen in Figure 9, for each value ofK used in the experiment, our proposal obtains the best classification accuracy results for the highest compression rates. Besides, for the other filtering sizes our proposal and the MI-based method attains comparable results. 4.3 Computation time of the visual vocabulary ranking The computation time of the visual vocabulary ranking methodology has also been evaluated. Table 1 shows the time in seconds taken for the ranking method in different size vocabularies, for the Caltech-101 and the Pascal VOC 2006 dataset. In Table 1, the ranking time is compared with the time needed to build the visual vocabulary. As can be seen in Table 1, the proposed methodology can be used to improve visual vocabularies without requiring much extra computation time. 5 Conclusion and future work In this work we devised a methodology for reducing the size of visual vocabularies that allows to obtain more dis- criminative and representative visual vocabularies for BoW image representation. The vocabulary reduction is based on three properties and their corresponding quantitative measures that express the inter-class representativeness, the intra-class representativeness and inter-class distinc- tiveness of visual words. The experimental results pre- sented in this paper showed that, in average, with only 25% of the ranked vocabulary, statistically superior classifica- tion results can be obtained, compared to the classical BoW representation using the entire vocabularies. Therefore, the proposed method, in addition to providing accuracy im- provements, provides a substantial efficiency improvement. Also, compared with a mutual information based method our proposal obtained superior results for the highest com- pression rates and comparable results for the other filtering sizes. As future work, we aim to propose a weighting schema that takes advantage of the proposed measures, in order to improve image representation. Also we would like to explore the use of hierarchical classifiers for dealing with inter-class variability. Finally, we also aim to define a mea- sure that help us to automatically choose the filter size. References [1] Concepts and applications of inferential statistics, 2013. [2] Herbert Bay, Andreas Ess, Tinne Tuytelaars, and Luc Van Gool. Speeded-Up Robust Features (SURF). Comput. Vis. Image Underst., 110(3):346–359, 2008. [3] A Bosch, Andrew Zisserman, and X Munoz. Image classification using random forests and ferns. IEEE 11th International Conference on Computer Vision (2007), 23(1):1–8, 2007. [4] Siddhartha Chandra, Shailesh Kumar, and C. V. Jawa- har. Learning hierarchical bag of words using naive bayes clustering. In Asian Conference on Computer Vision, pages 382–395, 2012. [5] Gabriella Csurka, Christopher R. Dance, Lixin Fan, Jutta Willamowski, and Cédric Bray. Visual catego- rization with bags of keypoints. In Workshop on Sta- tistical Learning in Computer Vision, ECCV, pages 1–22, 2004. [6] Charles Elkan. Using the triangle inequality to ac- celerate k-means. In Tom Fawcett and Nina Mishra, editors, ICML, pages 147–153. AAAI Press, 2003. [7] Peter Emerson. The original borda count and partial voting. Social Choice and Welfare, 40(2):353–358, 2013. 346 Informatica 41 (2017) 333–347 L. Chang et al. 56 58.5 61 63.5 66 59 61 63 65 67 61 63 65 67 69 64 65.5 67 68.5 70 64 65.25 66.5 67.75 69 M ea n A cc u ra cy ( % ) M ea n A cc u ra cy ( % ) M ea n A cc u ra cy ( % ) M ea n A cc u ra cy ( % ) M ea n A cc u ra cy ( % ) K = 256 K = 512 K = 1024 K = 4096 K = 2048 (a) (b) 10 20 30 40 50 60 70 80 90 100 (whole vocabulary)Filtering size (%) 10 20 30 40 50 60 70 80 90 100 (whole vocabulary)Filtering size (%) (c) (d) 10 20 30 40 50 60 70 80 90 100 (whole vocabulary)Filtering size (%) (e) 10 20 30 40 50 60 70 80 90 100 (whole vocabulary)Filtering size (%) 10 20 30 40 50 60 70 80 90 100 (whole vocabulary)Filtering size (%) Our method MI [Zhang et. al, 2014] Figure 9: Comparison of mean classification accuracy results, on the Caltech-101 dataset, between the proposed method- ology and the MI-based method proposed in [31]. Improving Visual Vocabularies: A More Discriminative. . . Informatica 41 (2017) 333–347 347 [8] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2011 (VOC2011) Results. http://www.pascal- network.org/challenges/VOC/voc2011/ work- shop/index.html. [9] M. Everingham, A. Zisserman, C. K. I. Williams, and L. Van Gool. The PAS- CAL Visual Object Classes Challenge 2006 (VOC2006) Results. http://www.pascal- network.org/challenges/VOC/voc2006/ results.pdf. [10] Li Fei-Fei, Rob Fergus, and Pietro Perona. Learn- ing generative visual models from few training ex- amples: An incremental bayesian approach tested on 101 object categories. Comput. Vis. Image Underst., 106(1):59–70, April 2007. [11] Basura Fernando, Élisa Fromont, Damien Muselet, and Marc Sebban. Supervised learning of gaussian mixture models for visual vocabulary generation. Pat- tern Recognition, 45(2):897–907, 2012. [12] Peter V. Gehler and Sebastian Nowozin. On feature combination for multiclass object classification. In ICCV, pages 221–228. IEEE, 2009. [13] Y. Gong, S. Kumar, H. A. Rowley, and S. Lazebnik. Learning binary codes for high-dimensional data us- ing bilinear projections. In CVPR 2013, 2013. [14] H. Jégou, M. Douze, and C. Schmid. Product quan- tization for nearest neighbor search. Pattern Analysis and Machine Intellingence, 33(1):117?128, 2011. [15] Hervé Jégou, Matthijs Douze, and Cordelia Schmid. Product quantization for nearest neighbor search. IEEE Trans. Pattern Anal. Mach. Intell., 33(1):117– 128, 2011. [16] Mingyuan Jiu, Christian Wolf, Christophe Garcia, and Atilla Baskurt. Supervised learning and code- book optimization for bag of words models. Cogni- tive Computation, 4:409–419, December 2012. [17] Kraisak Kesorn and Stefan Poslad. An enhanced bag- of-visual word vector space model to represent visual content in athletics images. IEEE Transactions on Multimedia, 14(1):211–222, 2012. [18] Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. Beyond Bags of Features: Spatial Pyramid Match- ing for Recognizing Natural Scene Categories. 2006 IEEE Computer Society Conference on Computer Vi- sion and Pattern Recognition Volume 2 CVPR06, 2(2169-2178):2169–2178, 2006. [19] Gang Liu. Improved bags-of-words algorithm for scene recognition. Journal of Computational Infor- mation Systems, 6(14):4933 – 4940, 2010. [20] Jingen Liu and Mubarak Shah. Learning human ac- tions via information maximization. 2013 IEEE Con- ference on Computer Vision and Pattern Recognition, 0:1–8, 2008. [21] R.J. Lopez-Sastre, T. Tuytelaars, F.J. Acevedo- Rodriguez, and S. Maldonado-Bascon. Towards a more discriminative and semantic visual vocabu- lary. Computer Vision and Image Understanding, 115(3):415 – 425, 2011. Special issue on Feature- Oriented Image and Video Computing for Extracting Contexts and Semantics. [22] David G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2):91–110, 2004. [23] Zhiwu Lu and Horace Ho-Shing Ip. Image catego- rization with spatial mismatch kernels. In CVPR, pages 397–404. IEEE, 2009. [24] Jianzhao Qin and Nelson Hon Ching Yung. Feature fusion within local region using localized maximum- margin learning for scene categorization. Pattern Recognition, 45(4):1671–1683, 2012. [25] Chih-Fong Tsai. Bag-of-words representation in im- age annotation: A review. ISRN Artificial Intelli- gence, 2012, 2012. [26] A. Vedaldi and A. Zisserman. Efficient additive ker- nels via explicit feature maps. Pattern Analysis and Machine Intellingence, 34(3), 2011. [27] Lei Wang, Lingqiao Liu, and Luping Zhou. A graph- embedding approach to hierarchical visual word mer- gence. IEEE Trans. Neural Netw. Learning Syst., 28(2):308–320, 2017. [28] Jianxin Wu, Wei-Chian Tan, and James M. Rehg. Ef- ficient and effective visual codebook generation us- ing additive kernels. Journal of Machine Learning Research, 12:3097–3118, 2011. [29] Shiliang Zhang, Qi Tian, Gang Hua, Qingming Huang, and Wen Gao. Generating descriptive visual words and visual phrases for large-scale image ap- plications. IEEE Transactions on Image Processing, 20(9):2664–2677, 2011. [30] Shiliang Zhang, Qi Tian, Gang Hua, Wengang Zhou, Qingming Huang, Houqiang Li, and Wen Gao. Mod- eling spatial and semantic cues for large-scale near- duplicated image retrieval. Computer Vision and Im- age Understanding, 115(3):403–414, 2011. [31] Y. Zhang, J. Wu, and J. Cai. Compact representation for image classification: To choose or to compress? In CVPR 2014, 2014.