https://doi.or g/10.31449/inf.v47i7.4661 Informatica 47 (2023) 1–10 1 Entr opy-Guided Assessment of Image Retrieval Systems: Advancing Gr ouped Pr ecision as an Evaluation Measur e for Relevant Retrievals T ahar Gherbi 1 , Ahmed Zeggari 1∗ , Zianou Ahmed Seghir 2 and Fella Hachouf 3 1 Math and Computer Sciences Dept. Echahid Cheikh Larbi T ebessi University , T ebessa, Algeria 2 Faculty ST , ICOSI Lab, University of Khenchela, Khenchela, Algeria 3 Automatic and Robotic Lab, Electronic Dept. University of Mentouri Constantine, Constantine, Algeria E-mail: tahar@univ-tebessa.dz, ahmed.zeggari@univ-tebessa.dz, zianou_ahmed_seghir@yahoo.fr , hachouf.fella@gmail.com ∗ Corresponding author Keywords: information retrieval, CBIR, performance evaluation, precision, clustering, information theory , entropy Received: February 6, 2023 The performance evaluation of Content Based Image Retrieval systems (CBIR), can be consider ed as a challenging and overriding pr oblem even for human and expert users r egar ding the important numbers of CBIR systems pr oposed in the literatur e and applied to differ ent image databases. The automatic measur es widely used to assess CBIR systems ar e inspir ed fr om the general T ext Retrieval (TR) domain such as pr eci- sion and r ecall metrics. This paper pr oposes a new quantitative measur e adapted to the CBIR particularity of r elevant images gr ouping, which is based on the entr opy of the r eturned r elevant images. The pr oposed performance measur e is easy to understand and to implement. A good discriminating power of the pr o- posed measur e is shown thr ough a comparative study with the existing and well-known CBIR evaluation measur es. Povzetek: Članek pr edlaga novo kvantitativno mer o za evalvacijo sistemov za iskanje slik na podlagi vse- bine (CBIR), ki temelji na entr opiji vrnjenih r elevantnih slik. 1 Intr oduction The aim of Content Based Image Retrieval (CBIR) systems [1] [2] is to rank the most similar images in the database given a user query and based on image content rather than textual annotations or metadata. A typical example of an image retrieval system, is when the CBIR system returns the relevant images from the database, in response to the image query of the user . Query by image content is an extremely active discipline; a lar ge number of systems in dif ferent application areas are designed in the last twenty years. In [5], the authors report a tremendous growth in publications on this topic covering many disciplines such as medicine, botany , face recognition, fingerprint identifica- tion and place recognition. CBIR systems are based on au- tomatic low-level image features extraction, such as color , gray shades and texture; not on a manual keywords annota- tion [3] and [4]. The evaluation of CBIR systems is based on benchmarking and performance metrics. The goal of a benchmark is to compare dif ferent systems on a set of test images database. An exhaustive survey on this topic can be found for example in [19], [21] and [20]. A main problem in the field of CBIR evaluation is the lack of a common performance measure that allows a quan- titative and objective comparison of visual retrieval sys- tems. The most used measures describe the number and/or the rank of relevant images within a returned list, Müller [8] and van Rijsber gen [16] present a good summary . Re- cent measures dedicated to CBIR system’ s evaluation have been proposed in the last few years. In [18] the authors pro- posed a new measure called: Mean Normalized Retrieval Order (MNRO) which uses the sigmoid Gompertz function to overcome the weaknesses of Mean A verage Precision (MAP) and A verage Normalized Modified Retrieval Rank (ANMRR) [17]. The density of returned relevant results is important and compatible with human vision evaluation. The common evaluation measures cannot illustrate the grouping propri- ety of the returned relevant images. In other words, the in- terrelation between relevant images was missed, which is important for a fast exploration of the result by a user vi- sual inspection. For example, assuming a window size = 10, a system returning 100 images with 10 relevant images in one window is better than a system returning the same results with one relevant image by window . Furthermore, we extend the evaluation scale to achieve a better discrimi- nating power in which two systems having a same precision value can be evaluated dif ferently . The rest of the paper is or ganized as follows. Section 2 provides an overview of the most used measures for in- formation retrieval evaluation, section 3 describe the lim- itations of the standard measures especially for image re- trieval. In section 3 we provides an outline on the proposed entropy based measure. Section 5 provides the experimen- 2 Informatica 47 (2023) 1–10 T . Gherbi et al. tal results and discussion. Finally , Section 6 draws the con- clusions. 2 Measuring information r etrieval: quantitative assessment Quantitative evaluation measures in information retrieval (IR) are designed to fulfill specific criteria, including their correlation with user satisfaction, their ability to discrim- inate among retrieval results, and their ease of interpreta- tion and implementation. These measures serve as valuable tools in assessing the performance and ef fectiveness of in- formation retrieval systems. The most widely used evaluation measures in IR are de- rived from the fundamental concepts of recall and preci- sion. Recall represents the ability of a retrieval system to retrieve all relevant documents from a given data-set, while precision measures the proportion of retrieved documents that are truly relevant. These measures provide valuable insights into the accuracy and completeness of the retrieval results, enabling researchers and practitioners to assess and compare dif ferent systems or algorithms. However , there are also alternative evaluation measures based on utility theory . These measures, as described in works such as [9, 10, 1 1], focus on measuring the worth or value of the retrieval output to the user . Utility-based mea- sures take into consideration the utility or benefit that users derive from the retrieved documents, providing a dif ferent perspective on the quality of the retrieval system’ s output. Utility-based measures are particularly useful in evalu- ating set-based retrieval output, as observed in tasks like the TREC filtering task [12]. By considering the worth of the retrieved documents to the user , these measures capture the relevance and usefulness of the retrieved set as a whole, rather than treating each document independently . In a comprehensive evaluation scenario, an ef fective per - formance measure should adhere to the following criteria: – Relevance of Retrieved Images: The measure should consider the number of relevant images returned by the system. It is essential that the retrieved images are indeed relevant to the user ’ s query . This criterion en- sures that the system accurately identifies and retrieves the desired content. – Retrieval of Relevant Images: The measure should also take into account the size of the returned list. It is crucial that all relevant images are successfully retrieved by the system. A good performance mea- sure should strive for high recall, aiming to retrieve as many relevant images as possible. – Ranking of Relevant Images: The rank of the relevant images within the returned list is another important factor . The measure should prioritize placing the most relevant images at the top of the list. A higher -ranked position indicates a better performance, as it facilitates quick and ef ficient access to the most relevant content. – Interrelations among Returned Relevant Images: The measure should consider the i nterrelations between the returned relevant images. Ideally , the relevant im- ages should be grouped together rather than scattered throughout the list. This criterion ensures that the re- trieval system provides coherent and meaningful re- sults, enhancing the user ’ s browsing experience. By incorporating these criteria into the performance mea- sure, researchers and practitioners can gain a comprehen- sive understanding of the system’ s ef fectiveness in informa- tion retrieval tasks. It allows for a holistic evaluation, con- sidering relevance, retrieval completeness, ranking quality , and the overall or ganization of the retrieved content. 2.1 Mean average pr ecision Mean A verage Precision (MAP) has been a popular eval- uation metric in the field of T ext Retrieval since its intro- duction in the T ext Retrieval Conferences (TREC) starting from TREC 3 in 1994 [6]. Over the years, it has gained widespread adoption among researchers as a reliable mea- sure for assessing the performance of their retrieval systems [7]. The MAP metric provides a comprehensive assessment by considering both precision and the ordering of relevant documents in the retrieval results. It calculates the average precision for each query and then takes the mean of these average precision values. The formula for MAP is as fol- lows: MAP= 1 R R ∑ i=1 i r i Here, R represents the total number of relevant documents in the entire collection for a specific information query . The termr i denotes the ranking position of thei th relevant doc- ument in the retrieved list. The MAP metric takes into account the ranking position of each relevant document. It assigns higher importance to relevant documents appearing at the top of the retrieved list. The formula calculates the precision at each position and then averages these precision values over all the rel- evant documents, providing a single numerical value that represents the overall performance of the retrieval system. By utilizing MAP , researchers can evaluate the ef fective- ness of their retrieval systems by considering both the ac- curacy of the results (precision) and the completeness of the results (recall). It enables the comparison of dif ferent systems and the measurement of improvements made over time or across dif ferent experiments. 2.2 R-pr ecision The concept ofR -precision provides a valuable insight into the performance of an information retrieval system by fo- cusing on the precision achieved after retrieving a specific number , R , of relevant images for a given query . In other Entropy-Guided Assessment of Image Retrieval… Informatica 47 (2023) 1–10 3 words, R -precision measures the precision of the retrieval results up to a certain rank. When R is equal to the total number of relevant images for the query , reaching an R -precision of 1. 0 signifies an ideal scenario with perfect relevance ranking and perfect recall. It implies that all the relevant images i n the collec- tion have been retrieved within the top R positions, ensur - ing a complete and accurate representation of the query’ s intended information. An R -precision value less than 1. 0 indicates that not all the relevant images have been retrieved within the first R positions. This could be due to the presence of irrelevant or less relevant images in higher ranks, af fecting the precision achieved. As the R -precision approaches 1. 0 , it signifies an improvement in the retrieval system’ s performance, as a lar ger proportion of the relevant images are appearing ear - lier in the retrieved list. Evaluating the R -precision allows researchers and prac- titioners to assess the ef fectiveness and ef ficiency of their retrieval systems by examining how well the system ranks and retrieves relevant images at dif ferent points. It comple- ments other evaluation measures like precision at dif ferent ranks, average precision, or mean average precision, pro- viding a more granular understanding of the retrieval sys- tem’ s performance in the early stages of retrieval. 2.3 Pr ecision and r ecall The standard metrics used for evaluating the performance of information retrieval systems are precision and recall [15, 16]. Precision measures the proportion of relevant doc- uments retrieved by the system out of all the documents that were returned. It provides an indication of the accuracy and relevance of the retrieval results. A high precision indicates that a lar ge percentage of the retrieved documents are in- deed relevant to the user ’ s query . On the other hand, recall measures the proportion of rel- evant documents that were retrieved out of all the relevant documents in the collection. It captures the system’ s ability to retrieve all relevant documents and reflects its complete- ness. A high recall suggests that a significant portion of the relevant documents has been successfully retrieved. Precision and recall are complementary metrics that help assess dif ferent aspects of retrieval system performance. While precision emphasizes the quality of the retrieved re- sults, recall emphasizes the system’ s ability to capture all relevant information. The balance between precision and recall depends on the specific requirements and goals of the information retrieval task. By evaluating precision and recall, researchers and prac- titioners can gain insights into the ef fectiveness and ef fi- ciency of their information retrieval systems. These met- rics allow for comparisons between dif ferent retrieval algo- rithms or system configurations, aiding in the optimization and enhancement of retrieval performance. P= r(N) N In this context, the variable r(N) denotes the count of rel- evant images retrieved, whereas N represents the size of the retrieved list. Precision is a straightforward evaluation measure that is often favored due to its ease of implemen- tation. However , it does not take into account the specific rank positions of the relevant elements, making it less sen- sitive to their order in the retrieval results. Recall is defined as the proportion of relevant documents retrieved from the database (Rel ) out of all the relevant doc- uments present. R= r(N) Rel Ideally , a retrieval system should aim for high values for both precision (P ) and recall (R ) metrics. Rather than rely- ing on individual measures of precision or recall, it is com- mon to utilize a joint precision-recall (PR ) graph to provide a comprehensive description of the system’ s performance [3]. ThePR graph visually illustrates the trade-of f between precision and recall at various thresholds or rankings. However , one limitation of the PR graph is that its in- terpretation can be influenced by the number of relevant images associated with a particular query [8]. The shape and characteristics of the PR curve may vary depending on the specific query and the number of relevant images present. This means that comparing PR graphs across dif- ferent queries or data-sets may not always provide a fair or meaningful comparison. Despite this drawback, the PR graph remains a valuable tool for evaluating retrieval system performance. It allows researchers and practitioners to analyze the trade-of f be- tween precision and recall, make informed decisions about system parameters or algorithms, and understand the sys- tem’ s behavior at dif ferent retrieval thresholds. By con- sidering the PR graph alongside other evaluation metrics, researchers can gain deeper insights into the strengths and weaknesses of their retrieval systems. 2.4 Recall-pr ecision graph A recall-precision graph is a graphical representation that illustrates the trade-of f between recall and precision for a given information retrieval system or algorithm. Recall measures the completeness of the results returned by the system. It represents the proportion of relevant documents retrieved out of all the relevant documents in the collec- tion. Higher recall indicates that more relevant documents are being retrieved. Precision, on the other hand, measures the accuracy of the retrieved results. It represents the pro- portion of relevant documents among all the documents re- trieved. Higher precision indicates that a higher percent- age of the retrieved documents are relevant. In a recall- precision graph, recall is typically plotted on the y-axis, while precision is plotted on the x-axis. The graph shows how the precision changes as the recall increases. The curve on the graph illustrates the relationship between recall and precision, and it can provide insights into the ef fectiveness of an information retrieval system. Ideally , a retrieval sys- 4 Informatica 47 (2023) 1–10 T . Gherbi et al. tem should achieve high precision and high recall simulta- neously . However , in practice, there is often a trade-of f be- tween the two measures. The recall-precision graph helps to visualize this trade-of f and assists in finding the optimal balance based on specific retrieval system requirements. 2.5 Entr opy based measur es Entropy-based measures derived from the field of informa- tion theory play a significant role in the validation and eval- uation of clustering algorithms. These measures provide valuable insights into the quality and ef fectiveness of clus- tering results. Among the various entropy-based measures, two popular ones commonly used are Entropy and Purity , proposed by Zhao and Karypis [13], and the V -measure pro- posed by Rosenber g and Hirschber g [14]. The concept of entropy , borrowed from information the- ory , provides a quantitative measure of the uncertainty or disorder within a cluster . It assesses how well the clus- ter ’ s members are distributed across dif ferent classes or cat- egories. Lower entropy indicates a higher degree of purity and cohesion within the cluster , suggesting that the mem- bers of the cluster predominantly belong to the same class. Purity , on the other hand, measures the homogeneity of a cluster in terms of class labels. It evaluates how well the cluster assignments align with the true class labels of the data points. A high purity score signifies that the cluster contains predominantly instances from a single class, indi- cating a more accurate and reliable clustering result. The V -measure combines both entropy and purity to pro- vide a balanced evaluation metric for clustering. It cap- tures the trade-of f between homogeneity and completeness of a clustering solution. The V -measure is particularly use- ful when dealing with imbalanced data-sets, where some classes have a significantly lar ger number of instances than others. By employing entropy-based measures such as Entropy , Purity , and the V -measure, researchers and practitioners can objectively assess the quality and coherence of clustering results. These measures help in comparing and selecting appropriate clustering algorithms, fine-tuning parameters, and optimizing the clustering process to obtain meaningful and accurate clusters. Entropy= K ∑ i=1 k i N   − 1 logC C ∑ j=1 A ij k i log( A ij k i )   Purity= K ∑ i=1 1 N max j (A ij ) In the given context, the variables can be defined as fol- lows: N represents the total number of data elements, C denotes the number of standard partitions, K signifies the total number of clusters, k i refers to the size of cluster i , and A ij indicates the count of elements in partition j that are assigned to clusteri . The calculation of the V − measure involves assessing the homogeneity and completeness of a clustering solu- tion. These evaluations rely on entropy measures such as H(C) and H(K) , as well as conditional entropy’ s including H(C|K) and H(K|C) . H(C)= − C ∑ j=1 ∑ K i=1 A ij N log ∑ K i=1 A ij N H(K)= − K ∑ i=1 ∑ C j=1 A ij N log ∑ C j=1 A ij N H(C|K)= − K ∑ i=1 C ∑ j=1 A ij N log A ij ∑ C j=1 A ij H(K|C)= − K ∑ i=1 | ∑ j=1 A ij N log A ij ∑ K i=1 A ij 3 Shortcomings of conventional quantitative m etrics in evaluation The standard quantitative metrics fail to consider certain crucial factors that are vital for a comprehensive quanti- tative evaluation of content-based image retrieval systems. Firstly , they overlook the significance of high density of rel- evant results, where relevant images are clustered together within a small or lar ge collection area in the retrieved win- dow . This characteristic is not adequately captured by ex- isting evaluation metrics, which can be attributed to their origins as general information retrieval (IR) measures. Secondly , the discriminating power of a quantitative evaluation metric is often overlooked. This raises an im- portant question: if two retrieval results have the same precision value, does it imply that they are similar? In other words, can we evaluate their corresponding systems as identical? Considering these points from our perspective, it be- comes evident that the existing evaluation metrics might not fully address the nuances and complexities of content- based image retrieval. There is a need for more refined metrics that take into account factors such as the clustering of relevant results and the ability to dif ferentiate between retrieval outcomes with similar precision values. By de- veloping and incorporating such metrics, we can improve the accuracy and ef fectiveness of quantitative evaluations in content-based image retrieval systems. T able 1 shows the level of respect to the up cited propri- eties by dif ferent measures used in this study . In the following subsections, we discussed in detail these points which must be verified by our proposed CBIR eval- uation measure. 3.1 Relevant r esults density V erification of the pertinent results in the case of image retrieval, is much dif ferent from that of the general infor - Entropy-Guided Assessment of Image Retrieval… Informatica 47 (2023) 1–10 5 T able 1: A summary table of the measures used in this study in response to the proprieties of number , ranking and group- ing. Results Number Ranking Gr ouping Precision(P) High No No MAP Medium Medium No R-precision R(P) No High No RBP Medium Medium Medium mation retrieval, from which common evaluation measures are inspired. In the case of textual results search, the ver - ification of pertinent results among the returned list must be done on a sequential manner , from the first result to the last one [22]. However , the visual verification of pertinent images is by nature very fast, and guided by the location and the grouping of relevant images. An evaluation pro- cess starts with an inherent transformation of the returned results to a binary list containing r elevant and irr elevant items. Figure 1 display some user query results (sad and happy emojis) [22]. Even if the results in the first results are more precise (27,77%) than those in second one (precision = 22,22%), the presentation of the returns in the first results is dif ficult for the user to evaluate and verify . However , when the findings are gathered together , even with a lower precision rate, the results are considerably better for user evaluation. Additionally , it should be noted that, in contrast to the first results, the relevant images in the second results are located near the bottom of the 2D list. The results in this example are binary (either a sad or happy emoji), how- ever in the real situations, the scenario is far more complex and has more than two potential outcomes. Another prob- lem, is what we called situation search. Asking a system to return only one image from a database containing many relevant images raise almost to a full precision. In the next two subsections we study the ef fects of relevant images on the evaluation process when two systems have a same pre- cision rate. 3.2 Comparing r esults having a same full pr ecision An ideal CBIR system provides a perfect image retrieval results, in which each image query returns a list of rele- vant results with no prior knowledge about its size. Its size varies from an input image to another input image. There- fore, the returned list = the relevant list of a given query in the database. LetP(R, N) the precision of a retrieval result, where R represents the number of relevant images, and N represents the size of the returned list. W e distinguish two evaluation cases regarding the number of relevant images contained in the database: – The ef fectiveness of a system when the database con- tains a few relevant images, In this situation of full precision (P= 1 ), the amount of getting precision in- crease whenR decrease. A minor error of retrievingR Figure 1: Example of two returned lists: dispersed results and grouped results. af fect the systems precision not equitably . Hence, we define relevant error R_ERR , as a minimum precision mistake taken by a given system reducingR as: R _ERR= R− 1 N – The ef fectiveness of a system when the database con- tains many relevant images. When we have a lar ge number of relevant images in a database, it is very challenging to have a full precision from a lar ge re- turned list than from a small one. Hence, we define retrieval error N_ERR , as a maximum no zero preci- sion mistake taken by a system as: N_ERR= 1 N The best retrieval situation is the system that returns one and only one exact image. This system has the highest risk, in which precision has a binary value (0 or 1). The next best system is a system in which the size of the final returned list is higher . In that case, the risk to obtain no relevant image is higher than in the case of a smaller returned list. W e can define a precision error P _ERR as follows: P _ERR= min(R _ERR, N _ERR). 6 Informatica 47 (2023) 1–10 T . Gherbi et al. P _ERR represents the minimum of the two errors as shown in figure 2 in which the minimum errors intersection is de- picted. 3.3 Comparing r esults having a same pr ecision (P < 1) T wo results having a same precision value P i (R i , N i ) = P j (R j , N j ) can be evaluated dif ferently whenR i , R j . Sys- temi is better than system j whenR i < R j , because the min- imum size of returned list needed by systemi to return the same number of relevant imagesR j is: N i +R j − R i < N j . In that case, the new precision becomes: P ′ i = R j N i +R j − R i > P j . 4 An entr opy based development for visual r etrieval systems It is important and useful for a user to see the images that he needs, arranged together in the same part on the returned list. The main idea of a new measure project is to evalu- ate retrieval systems regarding a degree in which relevant images are grouped, which is very practical for a visual re- trieval perspective. The proposed Entropy Grouped Relevant images mea- sure (EGR) is initially presented as follows: EGR= − |K| ∑ i=1 |C| ∑ j=1 A ij N × c j log A ij N × c j Where: N represent the number of returned images,R is the number of relevant images in the returned list. K andC are the sets of the detected clusters and the standard parti- tions respectively . A ij is Number of elements that are mem- bers of cluster i and partition j of the same class. Figure 3 show an example of returned results composed of clusters (a ) and partitions (b ). 5 Experimentation’ s In order to evaluate the proposed measure, we compare it with other precision-based measures, including standard precision P , MAP , R-precision and RBP [23] measures. The comparison process is built around two tests: compari- son based on a fixed size of a returned list, and comparison based on dif ferent sizes. 5.1 Comparison based on a fixed size of a r eturned list In this stage, we conducted a comparison study in a situation when the system returns 12 images as returned linear list. For example when R = 10 , the possible results in a linear list are: (10), (9,1), (8,2), (7,3), (8,1,1), (6,4), (5,5), (7,2,1), (6,3,1), (6,2,2), (5,3,2), (4,4,2), (4,3,3). They have a same precision value P = 83, 3 . The EGR measure evaluates dif ferently these results according to the spatial density of the relevant items arrangement. 5.1.1 Best ranks on a fixed r eturned list As can be seen in table 2, the results are ordered and ranked according to the dif ferent measures used in this compara- tive study . The top five results are well ranked by EGR and RBP measures, the best five results are highlighted in bold. Theirs ranks correspond well to the user ranking and to the real position of these results. The other measures ranked theme on the first three ranks. The discriminating power of the proposed measure and RBP measure appears in the ranking of the five best results on the five best ranks. Whereas, precision (P) for example ranks 52 results on five best places (which correspond to 55% of all results). T able 2: Some selected results of the five best ranks accord- ing to five evaluation metrics when the returned list size N=12. Rks P MAP R(P) RBP EGR 1 st 12 12 - 1 1 - 10 12 - 1 1 - 10 12 12 2 nd 1 1 - (10,1) ( 10,1) (10,1) 1 1 1 1 3 rd 10 - (9,1) (9,1) (9,1) -(8,2) (10,1) (10,1) 4 th 9-(8,1)-(7,2) (8,1) (8,1)-(7,2) 10 10 5 th 8-(7,1)-(6,2) (7,1) (7,1)-(6,2) (9,1) (9,1) 5.1.2 W orst ranks on a fixed r eturned list The superiority of the proposed entropy based measure, over the other measures to interprets the worst results rank- ing appears on table 3. The verified worst results appears individually on each rank on EGR measure. Whereas, it appears with other results in the case of precision measure (P). The other measures (ie, MAP , R and RBP measures) cannot capture this results as the worst places. T able 3: Some chosen results of the five last ranks accord- ing to five evaluation metrics when the returned list size N=12. Rks P MAP R(P) RBP EGR 91 th 5-(4,1)-(3,2)- (3,1,1) (1,1,1) (1,1,1) - (2,1,1,1,1) 5 (2,1) 92 th 4-(3,1)-(2,2)- (2,1,1) (2,1,1,1,1,1) (2,1,1,1,1,1) (1,2,1) (1,1,1) 93 th 3- (2,1) - (1,1,1) (1,1,1,1) (1,1,1,1) (2,1) 2 94 th 2 - (1,1) (1,1,1,1,1) (1,1,1,1,1) (1,1) (1,1) 95 th 1 (1,1,1,1,1,1) (1,1,1,1,1,1) 1 1 Entropy-Guided Assessment of Image Retrieval… Informatica 47 (2023) 1–10 7 Figure 2: P _ERROR according to minimum error rates, P _ERROR is the same as N _ERROR except when N=1 5.2 Comparison based on differ ent sizes of a r eturned list The first comparison is built around the ef fectiveness of the proposed measure to evaluate systems having dif ferent sizes of returned images which are all relevant. As can be seen from figure 4, the best system is whenR= N = 1 ; i.e. , tar get search. The next best systems are ordered according to the greatest value of their returned lists. Such results corresponds well to P − ERROR depicted in figure 2. T able 4 summarize the five evaluation measures whenR= N . 5.2.1 Some r eturned images ar e r elevant (R ≤ N ) The first remark can be seen from table 4 is that a results are ordered according toEGR values, they are well corresponds to the human order than the other measures. There is an attempts to compare this results even that they have dif ferent natures (dif ferent sizes and dif ferent num- bers of relevant images returned). EGR values are very closes when theirs corresponding results are perceptually very closes. Inversely , they are very dif ferent when theirs corresponding results are very distinct. 6 Conclusions W e have proposed a new evaluation measure to assess im- age retrieval systems. The proposed metric is compatible and conforms with human vision evaluation. In addition to T able 4: The best full precision results arranged by EGR measure. Retrievals EGR P (%) MAP (%) R(P) (%) 1 ,1 0 100 100 100 19 ,19 0,067 100 100 100 18 ,18 0,069 100 100 100 15 ,15 0,078 100 100 100 1 1 ,1 1 0,094 100 100 100 5 ,5 0,139 100 100 100 the number and the rank of the relevant images on the re- turned list, the proposed measure can capture and enhance the presence of relevant images in a close area of the re- turned list. Based on entropy of pertinent images grouping, the proposed measure presents a high discriminating power against several retrieval cases, in which the actual measures evaluate them as equivalent. This allows us to use the pro- posed CBIR evaluator as a scale rather than an evaluation metric. Further investigations and experiments should be conducted, encompassing diverse situations and scenarios, to establish a robust and reliable performance measure for the proposed metric in the field of image retrieval. Addi- tionally , its applicability in other domains, such as image quality assessment and data clustering. Refer ences [1] S. Selvakanmani, Ashreetha B, G. Naga Rama Devi, Shubhrojit Misra, Jayavadivel R, Suresh Babu Perli, Deep learning approach to solve image retrieval issues 8 Informatica 47 (2023) 1–10 T . Gherbi et al. (a) detected clusters (56 clusters) (b) standard partitions (53 partitions) Figure 3: Example of two returned results and theirs corresponding clusters and partitions. associated with IOT sensors, Measurement: Sensors, V olume 24, 2022, 100458, ISSN 2665-9174, https://doi.or g/10.1016/j.measen.2022.100458 [2] K. W angi and A. Makandar , Autoencoder for Image Retrieval System using Deep Learning T echnique with T ensorflow and Kears, 2023 IEEE International Conference on Integrated Circuits and Communica- tion Systems (ICICACS), Raichur , India, 2023, pp. 1-5, https://doi.or g/10.1 109/ICICACS57338.2023.10 099675 [3] Y ong Rui and Thomas S. Huang (1999) Image re- trieval: Current techniques, promising directions and open issues, Journal of V isual Communication and Image Repr esentation , v .10, pp. 39–62. https://doi.or g/10.1006/jvci.1999.0413 [4] Smeulders, Arnold W . M. and W orring, Marcel and Santini, Simone and Gupta, Amarnath and Jain, Ramesh (2000) Content-Based Image Retrieval at the End of the Early Y ears, IEEE T rans. Pattern Anal. Mach. Intell. , IEEE Computer Society , v .22(12), pp. 1349–1380. https://doi.or g/10.1 109/34.895972 [5] Datta, Ritendra and Joshi, Dhiraj and Li, Jia and W ang, James Z. (2008), Image Retrieval: Ideas, In- fluences, and T rends of the New Age, ACM Comput. Surv . , ACM Publisher , v .40(2), pp. 5:1–5:60. https://doi.or g/10.1 145/1348246.1348248 [6] url = http://trec.nist.gov/. [7] W u, Shengli and McClean, Sally (2006), Informa- tion Retrieval Evaluation with Relevance Judgment, Flexible and Efficient Information Handling , Springer Berlin Heidelber g, pp.86–93. https://doi.or g/10.1007/1 178891 1-7 [8] Müller , Henning and Müller , W olfgang and Squire, David McG. and Marchand-Maillet, Stephane and Pun, Thierry (2001) Performance Evaluation in Content-based Image Retrieval: Overview and Pro- posals, Pattern Recogn. Lett. Elsevier Science Inc., v .22(5), pp.593–601. https://doi.or g/10.1016/s0167-8655(00)001 18-5 [9] Cooper , W illiam S. (1973) On selecting a measure of retrieval ef fectiveness, Journal of the American Soci- ety for Information Science , v .24(2), pp. 87-100. https://doi.or g/10.1002/asi.4630240204 [10] Lewis, David D. (1995) Evaluating and Optimizing Autonomous T ext Classification Systems, Pr oceed- ings of the 18th Annual International ACM SIGIR Confer ence on Resear ch and Development in Infor - mation Retrieval , ACM, Seattle, W ashington, USA, pp.246–254. https://doi.or g/10.1 145/215206.215366 [1 1] Buckley , Chris and V oorhees, Ellen M. (2000) Eval- uating Evaluation Measure Stability , Pr oceedings of the 23r d Annual International ACM SIGIR Confer - ence on Resear ch and Development in Information Retrieva , ACM, Athens, Greece, pp. 33–40. https://doi.or g/10.1 145/345508.345543 [12] David D. Lewis (1995) The TREC-4 Filtering T rack, TREC , National Institute of Standards and T echnol- ogy (NIST), Special Publication 500-236. [13] Zhao, Y ing (2005) Criterion Functions for Document Clustering, phd thesis University of Minnesota USA. Entropy-Guided Assessment of Image Retrieval… Informatica 47 (2023) 1–10 9 [14] Rosenber g, Andrew and Hirschber g, Julia (2007) V -Measure: A Conditional Entropy-Based External Cluster Evaluation Measure, Pr oceedings of the 2007 Joint Confer ence on Empirical Methods in Natu- ral Language Pr ocessing and Computational Natural Language Learning (EMNLP-CoNLL) , pp.410–420. [15] Salton, G. (1971) The SMAR T Retrieval System— Experiments in Automatic Document Processing, Prentice-Hall, Inc., Upper Saddle River , NJ, USA. https://doi.or g/10.1 109/tpc.1972.6591971 [16] C. J. van Rijsber gen (1979) Information Retrieval, Butterworth. https://doi.or g/10.1002/asi.4630300621 [17] MPEG-7 (2000) Subjective evaluation of the MPEG- 7 retrieval accuracy measure (ANMRR). ISO/WG1 1, Doc. M6029 . [18] Savvas A. Chatzichristofis and Chryssanthi Iakovidou and Y iannis S. Boutalis and Elli Angelopoulou (2014) Mean Normalized Retrieval Order (MNRO): a new content-based image retrieval performance measure, Multimedia T ools Appl. , v .70(3), pp.1767–1798. https://doi.or g/10.1007/s1 1042-012-1 192-z [19] Harald Kosch and Paul Maier (2010) Content-Based Image Retrieval Systems - Reviewing and Bench- marking, JDIM , v .8(1), pp.54–64. [20] Henning Müller , Antoine Geissbuhler , Stephane Marchand Maillet, Paul Clough (2004) Benchmark- ing image retrieval applications, W orkshop on V isual Information Systems , pp.334–337. [21] Ezekiel Mensah Martey , Hang Lei, Xiaoyu Li, Obed Appiah (2021), Ef fective Image Representation Using Double Colour Histograms For Content-Based Image Retrieval, informatica , vol.45 No.7 [22] T . Gherbi, A. Zeggari and Z. Ahmed Seghir (2023), A global precision view for information re- trieval evaluation adapted to image retrieval systems, ICAECE’2023, T ebessa , AIJR Publisher , [23] A. Mof fat and J. Zobel (2008). Rank-biased precision for measurement of retrieval ef fective- ness. ACM T rans. Inf. Syst., 27:2:1–2:27, 2008. doi/10.1 145/1416950.1416952 10 Informatica 47 (2023) 1–10 T . Gherbi et al.