Image Anal Stereol 2020;39:161-167 doi: 10.5566/ias.2346 Original Research Paper RIM-ONE DL: A UNIFIED RETINAL IMAGE DATABASE FOR ASSESSING GLAUCOMA USING DEEP LEARNING FRANCISCO FUMERO ,1, TINGUARO DIAZ-ALEMAN2, JOSE SIGUT1, SILVIA ALAYON1, RAFAEL ARNAY1 AND DENISSE ANGEL-PEREIRA2 1Department of Computer Engineering and Systems, University of La Laguna, Campus de Anchieta, 38200, Tenerife, Spain, 2Servicio de Oftalmologı́a, Hospital Universitario de Canarias, 38320, Tenerife, Spain e-mail: ffumerob@ull.edu.es, vtdac@hotmail.com, jfsigut@ull.edu.es, salayon@ull.edu.es, rarnay@ull.edu.es, dangelp30@hotmail.com (Received February 11, 2020; revised October 5, 2020; accepted October 5, 2020) ABSTRACT The first version of the Retinal IMage database for Optic Nerve Evaluation (RIM-ONE) was published in 2011. This was followed by two more, turning it into one of the most cited public retinography databases for evaluating glaucoma. Although it was initially intended to be a database with reference images for segmenting the optic disc, in recent years we have observed that its use has been more oriented toward training and testing deep learning models. The recent REFUGE challenge laid out some criteria that a set of images of these characteristics must satisfy to be used as a standard reference for validating deep learning methods that rely on the use of these data. This, combined with the certain confusion and even improper use observed in some cases of the three versions published, led us to consider revising and combining them into a new, publicly available version called RIM-ONE DL (RIM-ONE for Deep Learning). This paper describes this set of images, consisting of 313 retinographies from normal subjects and 172 retinographies from patients with glaucoma. All of these images have been assessed by two experts and include a manual segmentation of the disc and cup. It also describes an evaluation benchmark with different models of well-known convolutional neural networks. Keywords: Convolutional Neural Networks, Deep Learning, Glaucoma Assessment, RIM-ONE. INTRODUCTION The term glaucoma refers to a group of pathologies that affect the optic nerve and involve the loss of retinal ganglion cells, which is frequently associated with an increase in intraocular pressure. It is one of the leading causes of blindness in the world (Tham et al., 2014) and, since the progression of the disease is typically asymptomatic, early detection is quite difficult. There are different techniques for viewing the retina that help in making this diagnosis. One of them is a retinography, which is a color photograph of the fundus of the eye. Fig. 1 shows an example of a retinography, where the most relevant parts for diagnosing glaucoma have been highlighted: the optic disc, the cup and the neuroretinal rim. In fact, since this is the most important part of the retinography, it is typical to cut around it and discard the rest. As in other fields, automated learning-based diagnoses, and more specifically the technique known as deep learning, have taken on great significance. The key to the proper functioning of these methods is the availability of a sufficient amount of data with which to train and test the system. In addition, validating these methods requires a reference standard that can be used for comparison. In the case at hand, this involves having public retinography databases that satisfy a series of requirements, which must also be clearly defined. In the very recent paper on the Retinal Fundus Glaucoma Challenge (REFUGE) (Orlando et al., 2020), written by researchers from 20 institutions, a very important step is taken in this direction by suggesting certain criteria that can be used to compare these methods in terms of both classifying the glaucoma and segmenting the disc and cup: 1. Availability of publicly-accessible sets of images, labelled by several experts, sufficient enough in number that they can be used in deep-learning methods. 2. Clear separation between training and test sets. As noted in (Trucco et al., 2013), a comparison of the results may be unreliable without said separation. 3. Presence of diversity in the set of images, with diversity meaning having images captured by various devices involving different patient ethnicities, and images taken in different lighting, contrast, noise and other conditions. 4. In addition to having a preliminary diagnosis, include also manual reference segmentations of the disc and cup. 161 FUMERO F et al.: RIM-ONE for Deep Learning 5. Provide an evaluation framework that includes the metrics used and the format for presenting the results. The first open version of the set of fundus images called Retinal IMage database for Optic Nerve Evaluation (RIM-ONE) was published in 2011 (Fumero et al., 2011). Two other versions followed, in 2014 and 2015. They will be referred to in this paper as RIM-ONE v1, v2 and v3, respectively. Since its publication, it has been cited 179 times, becoming, based on our data, the most cited public set of retinographies for evaluating glaucoma. Although it was initially intended to be used as a database of reference images for segmenting the optic disc, in recent years we have observed that its use has been more oriented toward training and testing deep learning models. This, combined with the certain confusion and even improper use observed in some cases of the three versions published, led us to consider revising and combining them into a new version called RIM-ONE DL (RIM-ONE for Deep Learning), optimized for a deep-learning context in keeping with the specifications explained earlier. Furthermore, for benchmarking purposes, we studied the performance of various convolutional neural network models that are very popular, due to their widespread use for the classification and semantic segmentation of natural images. As a result, we hope to lay the foundations to have RIM-ONE DL become a reference for evaluating glaucoma, like the previous versions were. Fig. 1: Sample retinography with the most relevant regions for diagnosing glaucoma. The rest of the paper is structured as follows. The section on “Related Work” describes other sets of public images for evaluating glaucoma. The “RIM- ONE database” section presents the different versions of this imaging database and analyzes the feedback received over its nearly ten years of use. The “Materials and Methods” section describes RIM-ONE DL, underscoring the changes made with respect to previous versions and explaining the neural network models and the metrics used in the benchmark. We conclude with the “Results and Discussion” section. RELATED WORK There are not too many public sets of images of the fundus of the eye for evaluating glaucoma, whether for classification or for segmenting the optic disc and cup. ORIGA (Zhang et al., 2010) is composed of 168 images from patients with glaucoma and 482 from healthy patients. The data also include the disc and cup segmentation. The problem with this database is that even though it appears to have been public at one point, as far as we can tell, it stopped being public quite some time ago. DRISHTI-GS (Sivaswamy et al., 2014) consists of 70 images of glaucoma and 31 normal images. It also includes the disc and cup segmentation. DR HAGIS (Holm et al., 2017), HRF (Odstrcilik et al., 2013), and LES-AV (Orlando et al., 2018) are small sets with 39, 45 and 22 images, respectively, with no disc and cup segmentation. The ACRIMA set (Diaz- Pinto et al., 2019) contains 396 images from patients with glaucoma and 309 images from healthy patients. It also does not include segmentation of the disc or cup. Finally, the recent REFUGE database (Orlando et al., 2020) contains 120 images from patients with glaucoma and 1080 images from healthy patients with disc and cup segmentation. It is important to note that of all the sets mentioned, only REFUGE satisfies the additional requirements of offering images from different cameras, as well as a clear division of the training and test data. There seems to be a clear need, then, to expand the number of public image databases available that comply with these requirements. THE RIM-ONE DATABASE The images in the three versions of RIM- ONE include healthy and glaucomatous eyes, and were taken in various Spanish hospitals: Hospital Universitario de Canarias (HUC), in Tenerife, Hospital Universitario Miguel Servet (HUMS), in Zaragoza, 162 Image Anal Stereol 2020;39:161-167 and Hospital Clı́nico Universitario San Carlos (HCSC), in Madrid. The repository of retinographies is accessible through the website of the research group at https://medimrg.webs.ull.es. CRITICAL REVISION OF THE THREE VERSIONS OF RIM-ONE RIM-ONE v1 was presented in 2011 at the 24th International Symposium on Computer-Based Medical Systems (CBMS) (Fumero et al., 2011). The main goal of this work was to provide an open database of retinographies from 118 healthy subjects and 51 patients with various stages of glaucoma. In addition to a diagnosis, it included a manual segmentation of the optic disc, carried out by five experts in the field. This was the main goal of its publication, to provide a reference for assessing methods to segment the disc. The images were taken at the three hospitals mentioned above using a Nidek AFC-210 non-mydriatic fundus camera with a 21.1-megapixel Canon EOS 5D Mark II body, with a vertical and horizontal field of view of 45◦. RIM-ONE v2 was published in 2014. It contains 255 images from healthy subjects and 200 images from patients with glaucoma that are manually segmented by a medical specialist. In this case, the images were taken at HUC and the Hospital Universitario Miguel Servet using the same camera as in v1. It is important to note that this version was designed as an extension of the first; as a result, some images are duplicated. It also includes some test-retest cases that give rise to images that are practically identical. RIM-ONE v3 was published in 2015 and contains 85 images of healthy subjects and 74 images of patients with glaucoma. The main difference between this version and the two previous ones is that the images were captured only in the HUC with a non- mydriatic Kowa WX 3D stereo fundus camera. The images are centered on the ONH using a field angle of 34◦, giving a final stereo image with a horizontal field of view of 20◦ and a vertical field of view of 27◦, with a total resolution of 2144 x 1424 pixels (1072 x 1424 pixels per image in the stereo pair). Having stereo images available resulted in a manual segmentation of not only the optic disc, but of the cup as well. Two specialists carried out this task with help from the freely distributed DCSeg tool. More details on this version and on the tool are available at (Fumero et al., 2015). It should be noted that some of the subjects whose retinographies are contained in v2 are also present in v3. As was stated in the introduction, a critical review of these three versions is necessary in order to determine how well they comply with the criteria in place for using them in methods based on deep learning: 1. Although it may be tempting to combine the three versions for use in deep-learning problems, as indicated earlier, indiscriminately combining the images could result in their inappropriate use. It should also be noted that there is no consistency between the three versions in terms of the labelling of the images, as this was not always done by the same experts. 2. Since RIM-ONE was not originally designed for deep learning, a clear division between training and test images was never established. 3. The RIM-ONE images were taken in different hospitals with different cameras, but only one camera was used in each version. 4. In this area there is also little consistency between versions, since in the first two only the disc is segmented, while the cup is also segmented in v3. As happened with the diagnosis, the specialists involved in the manual segmentation were not the same in every case. 5. Although version 1 does propose a criterion for evaluating the quality of the disc segmentation, no details are given on the metrics for evaluating the diagnosis, since that was not the initial idea. The “Materials and Methods” section details the process used to attempt to resolve these problems with RIM-ONE DL. FEEDBACK ON THE EXPERIENCE WITH RIM-ONE Using a procedure similar to that presented in (Decencière et al., 2014) for the well-known Messidor database for segmenting the optic disc in diabetic retinography images, in this section we will focus on the feedback received over the nearly 10 years that RIM-ONE has been publicly available. To provide a quantitative idea of the impact that its publication has had, we used as a reference the total number of citations, which is 179. Moreover, Fig. 2 shows the recorded trend in terms of the primary purpose for which RIM-ONE has been used. In its early years, a large majority of the uses were centered on segmentation tasks. However, the last two years have seen a very significant increase in the number of publications in which its use was associated with deep- learning problems, a use that in 2019 even outpaced the number of works involving segmentation. This reinforces the need to have a revised and updated version of RIM-ONE that can satisfy this new trend. 163 FUMERO F et al.: RIM-ONE for Deep Learning Fig. 2: Number of citations of RIM-ONE. Blue: RIM- ONE for segmentation, Red: RIM-ONE for Deep Learning. MATERIALS AND METHODS RIM-ONE DL In this section, we describe RIM-ONE DL, which results from combining the three previous versions. This new version does away with the duplicate images contained in v1 and v2, and it also eliminates the test- retest images in v2. The images of the same patient in v2 and v3 were also deleted, with only the left image from v3 being retained. The result is a single image per patient and eye. Moreover, all the images were cropped squarely around the head of the optic nerve using the same proportionality criterion, something that was not done in the previous versions. Table 1 shows the minimum and maximum size of these cropped images per version and the hospital where the images were taken. Furthermore, the format of all the images in this new version is PNG, and the file names are prefixed with “r1”, “r2” or “r3” according to the RIM-ONE version they were extracted from. Table 1: Minimum and maximum size of the cropped images of RIM-ONE DL per version, indicating the hospital where the images were taken. Version Hospital Min. Size Max. Size v1 HCSC, HUMS 316 708 v2 HUC, HUMS 274 793 v3 HUC 318 626 As concerns its use in deep-learning problems, and as discussed in previous sections, we note the following: 1. The final set of images consists of 313 images from healthy subjects and 172 images from patients with glaucoma. In order to standardize the criterion of experts for classifying glaucoma, two experts again reviewed all the images and re-labelled them after a visual inspection. In the event of a disagreement between them, a third specialist with 20 years of experience was consulted, who made the final decision. 2. A clear division is established between the training and test sets, with two variants. In one, the test set is built randomly, while in the other, the samples taken in the HUC are used for training and the samples taken in the two other hospitals (in Madrid and Zaragoza) are used for testing. 3. This combined version exhibits great diversity in terms of the cameras and hospitals. 4. In addition to the ground truth for classification, the set includes the manual segmentation of the disc and cup performed by one of the specialists. 5. The sub-section below details the evaluation framework for classifying the glaucoma disease. Fig. 3 shows some examples of the images contained in RIM-ONE DL, indicating the hospital they were taken in. This database is publicly available at the following location: https://github.com/ miag-ull/rim-one-dl EVALUATION FRAMEWORK The evaluation framework proposed contains four main elements: definition of the training and test sets, neural network models used, training and testing strategy employed, and the metrics considered in the evaluation. As concerns the training and test sets used, as noted in the preceding sub-section, two variants are considered. In the first variant, the set of images was divided at random into training and testing images using a 70:30 ratio, respectively. In the second variant, the images taken at the HUC were used for training (195 normal and 116 glaucoma), and the images taken at the two other hospitals were used for testing (118 normal and 56 glaucoma). The only processing done to the images involved re-scaling them in intensity in the 0-1 range and resizing them to 224x224x3. In terms of the neural network models used, most of the architectures contained in the Keras Deep Learning Framework were tested: Xception, VGG16, VGG19, ResNet50, InceptionV3, InceptionResNetV2, MobileNet, DenseNet121, NASNetMobile and MobileNetV2. In every case, the size of the input layer was set to 224x224x3, and a GlobalAveragePooling2D layer was added to the convolutional base, followed by a fully-connected output layer with two outputs, 164 Image Anal Stereol 2020;39:161-167 (a) Image of a healthy subject taken at Hospital Universitario de Canarias. (b) Glaucoma image taken at Hospital Universitario Miguel Servet. (c) Image of a healthy subject taken at Hosp. Clı́nico Universitario San Carlos. Fig. 3: Examples of images included in RIM-ONE DL, indicating in which hospital they were taken. using SoftMax to distinguish between the Normal and Glaucoma classes. The training strategy was the same with both variants of the data sets. We started with the pre- trained networks using the weight values of ImageNet provided by Keras and fine tuned all the layers. Diaz-Pinto et al. (2019) found that this yielded the best results. To avoid overfitting, data augmentation consisting of random rotations (-30◦, 30◦), vertical and horizontal flip, and zoom (0.8, 1.2) was used. The steps carried out to fine tune the networks were as follows: 1. Freeze the convolutional base network. 2. Train the part that was added. 3. Unfreeze all the layers in the base network. 4. Jointly train all the layers in the network. To increase the reliability of the experiments, a 5- fold cross-validation was applied, with a proportion of 80% for the training set and 20% for the validation set in each fold. For each fold, the steps listed were followed in order to determine the most suitable number of epochs in 2 and 4. For the training in step 2, a batch size of 32 was used, along with an RMSprop optimizer with a learning rate of 2e-5, and categorical cross-entropy as a loss function. For the training in step 4, the learning rate was set to 1e-5. Once the validation phase was complete, the final model for each network was trained using the whole training data (no folds) and following the same four steps as before for the number of epochs that maximized the average validation accuracy across folds in steps 2 and 4. This final model was used to evaluate each network in the test set. As concerns the metrics used, the outline proposed in (Orlando et al., 2020) was used, in which the area under the curve (AUC) is used as a reference evaluation measure. This measure was complemented with the sensitivity value (Se = T p/(T p+Fn)) at a specificity of 0.85 (Sp = T n/(T n+F p)), where T p, F p, T n and Fn are the number of true positives, false positives, true negatives and false negatives, respectively. This allows for an assessment of the performance of the various networks when a low rate of false positives is imposed. The third measure included is accuracy, which is fairly standard in this type of problem, although it is well-known that it can exhibit some bias in data sets whose classes are not properly balanced. RESULTS AND DISCUSSION Tables 2 and 3 and figures 4 and 5 show the results of the glaucoma classifications obtained under the conditions described in the preceding section. In the case of the random test sample, the results are highly satisfactory. The VGG19 network model not only provided the highest AUC, but its sensitivity also equalled 1, the highest possible. The other network model with similar characteristics, VGG16, also yielded good results. Although a direct comparison with the results of the REFUGE challenge is not possible, it is interesting to note that the winning team (Son et al., 2018) attained an AUC of 0.9885 with a sensitivity of 0.9752 for a test sample consisting of 360 images from healthy subjects and 40 images from patients with glaucoma. In the case of the test sample from the hospitals in Madrid and Zaragoza, there is a significant drop in all the metrics, with the best 165 FUMERO F et al.: RIM-ONE for Deep Learning response again being obtained by networks VGG19 and VGG16. This drop could be explained by the fact that a set of test images was used whose visual appearance was rather different from that of the images used during training. It is important to keep in mind that the images were captured in different hospitals under different circumstances, which seems to have affected the networks. The lack of robustness of this type of system to distortions that can affect the images, such as noise, contrast or lighting, has been analyzed by various authors (Borkar and Karam, 2019). Again, it is difficult to cite any work to compare against in this regard. The closest would be (Diaz-Pinto et al., 2019), in which the Xception network was trained using images from some public databases, and it was trained with different databases whose images were taken under varying conditions. Its results are in keeping with those stemming from our experiments, with the exception that in our case, the experts who performed the reference diagnoses were the same for the training and test groups, unlike in the aforementioned work. This leads us to think that even though this factor could have some influence, the fact that the images were taken in different conditions is likely to be more relevant. Table 2: Evaluation of the different networks using the random test set. Network AUC Se Acc. VGG19 0.9867 1.0000 0.9315 VGG16 0.9834 0.9615 0.9247 Xception 0.9771 0.9808 0.9178 ResNet50 0.9755 0.9808 0.9110 MobileNetV2 0.9738 0.9423 0.9041 DenseNet 0.9726 0.9615 0.9041 MobileNet 0.9712 0.9615 0.9315 InceptionResNetV2 0.9685 0.9808 0.9110 InceptionV3 0.9597 0.9423 0.8904 NASNetMobile 0.9290 0.9231 0.7534 Table 3: Evaluation of the different networks using the test set from Madrid and Zaragoza. Network AUC Se Acc. VGG19 0.9272 0.8750 0.8563 VGG16 0.9177 0.8214 0.8506 InceptionV3 0.9015 0.7500 0.8046 Xception 0.8982 0.7500 0.7989 DenseNet 0.8919 0.7143 0.7816 MobileNet 0.8912 0.7500 0.8276 ResNet50 0.8855 0.7321 0.8333 InceptionResNetV2 0.8396 0.625 0.7644 NASNetMobile 0.7969 0.6071 0.7989 MobileNetV2 0.7765 0.4464 0.5287 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1 - Specificity 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Se ns iti vi ty VGG19 VGG16 Xception ResNet50 MobileNetV2 DenseNet MobileNet InceptionResNetV2 InceptionV3 NASNetMobile Fig. 4: ROC Curves for all the networks using the random test set. 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1 - Specificity 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Se ns iti vi ty VGG19 VGG16 Xception ResNet50 MobileNetV2 DenseNet MobileNet InceptionResNetV2 InceptionV3 NASNetMobile Fig. 5: ROC Curves for all the networks using the test set from Madrid and Zaragoza. REFERENCES Borkar TS, Karam LJ (2019). DeepCorrect: Correcting DNN Models Against Image Distortions. IEEE T Image Process 28:6022–34. Decencière E, Zhang X, Cazuguel G, Lay B, Cochener B, Trone C, Gain P, Ordonez R, Massin P, Erginay A, Charton B, Klein JC (2014). Feedback 166 Image Anal Stereol 2020;39:161-167 on a publicly distributed database: the Messidor database. Image Anal Stereol 33:231–4. Diaz-Pinto A, Morales S, Naranjo V, Köhler T, Mossi JM, Navea A (2019). CNNs for automatic glaucoma assessment using fundus images: an extensive validation. Biomed Eng Online 18:1–19. Fumero F, Alayon S, Sanchez JL, Sigut J, Gonzalez- Hernandez M (2011). RIM-ONE: An open retinal image database for optic nerve evaluation. In: 2011 24th International Symposium on Computer- Based Medical Systems (CBMS). Fumero F, Sigut J, Alayon S, Gonzalez-Hernandez M, Gonzalez de la Rosa M (2015). Interactive tool and database for optic disc and cup segmentation of stereo and monocular retinal fundus images. In: Short communications proceedings. Plzen, Czech Republic: Václav Skala - UNION Agency. Holm S, Russell G, Nourrit V, McLoughlin N (2017). DR HAGIS - a fundus image database for the automatic extraction of retinal surface vessels from diabetic patients. J Med Imaging Bellingham 4:1– 11. Odstrcilik J, Kolar R, Budai A, Hornegger J, Jan J, Gazarek J, Kubena T, Cernosek P, Svoboda O, Angelopoulou E (2013). Retinal vessel segmentation by improved matched filtering: evaluation on a new high-resolution fundus image database. IET Image Process 7:373–83. Orlando JI, Fu H, Barbosa Breda J, van Keer K, Bathula DR, Diaz-Pinto A, Fang R, Heng PA, Kim J, Lee J, Lee J, Li X, Liu P, Lu S, Murugesan B, Naranjo V, Phaye SSR, Shankaranarayana SM, Sikka A, Son J, van den Hengel A, Wang S, Wu J, Wu Z, Xu G, Xu Y, Yin P, Li F, Zhang X, Xu Y, Bogunović H (2020). REFUGE Challenge: A unified framework for evaluating automated methods for glaucoma assessment from fundus photographs. Med Image Anal 59:101570. Orlando JI, van Keer K, Barbosa Breda J, Blaschko MB, Blanco P, Bulant CA (2018). Towards a glaucoma risk index based on simulated hemodynamics from fundus images. In: Frangi AF, Schnabel JA, Davatzikos C, Alberola-Lopez C, Fichtinger G, eds., Medical Image Computing and Computer Assisted Intervention – MICCAI 2018, Lecture Notes in Computer Science. Springer International Publishing, 65–73. Sivaswamy J, Krishnadas S, Datt Joshi G, Jain M, Syed Tabish A (2014). Drishti-GS: Retinal image dataset for optic nerve head (ONH) segmentation. In: 2014 IEEE 11th International Symposium on Biomedical Imaging (ISBI). Son J, Bae W, Kim S, Park SJ, Jung KH (2018). Classification of Findings with Localized Lesions in Fundoscopic Images Using a Regionally Guided CNN. In: Stoyanov D, Taylor Z, Ciompi F, Xu Y, Martel A, Maier-Hein L, Rajpoot N, van der Laak J, Veta M, McKenna S, Snead D, Trucco E, Garvin MK, Chen XJ, Bogunovic H, eds., Computational Pathology and Ophthalmic Medical Image Analysis, Lecture Notes in Computer Science. Springer International Publishing, 176–84. Tham YC, Li X, Wong TY, Quigley HA, Aung T, Cheng CY (2014). Global prevalence of glaucoma and projections of glaucoma burden through 2040: a systematic review and meta- analysis. Ophthalmology 121:2081–90. Trucco E, Ruggeri A, Karnowski T, Giancardo L, Chaum E, Hubschman JP, Al-Diri B, Cheung CY, Wong D, Abràmoff M, Lim G, Kumar D, Burlina P, Bressler NM, Jelinek HF, Meriaudeau F, Quellec G, Macgillivray T, Dhillon B (2013). Validating retinal fundus image analysis algorithms: issues and a proposal. Invest Ophth Vis Sci 54:3546–59. Zhang Z, Yin FS, Liu J, Wong WK, Tan NM, Lee BH, Cheng J, Wong TY (2010). ORIGA- light: An online retinal fundus image database for glaucoma analysis and research. In: 2010 Annual International Conference of the IEEE Engineering in Medicine and Biology. 167