ERK'2022, Portorož, 385-389 385 Robustness of Surface Anomaly Detection Methods to Domain Shift Matej Dobrevski, Jakob Boˇ ziˇ c, Danijel Skoˇ caj Faculty of Computer Science, University of Ljubljana, Slovenia. e-mail: matej.dobrevski@fri.uni-lj.si Abstract In many realistic visual inspection scenarios it is to be ex- pected that the distribution of images may change through time, due to, e.g., gradual changes in illumination, back- ground, or acquisition settings. In this work we investigate how such a domain shift influences the performance of visual surface anomaly detection methods. We analyse three recent state-of-the-art unsupervised-learning-based methods and evaluate their response to the domain shift, caused by applying five different transformations to the original dataset. We also investigate the number of train- ing images needed to build a good model, as well as the amount of images from the shifted domain needed to be added to the original training set in order to robustly build a good model. The obtained results can be used for ad- dressing different issues in the field and are useful for both researchers and practitioners working in the field of surface anomaly detection. 1 Introduction The field of surface anomaly detection is an enabling tech- nology for various industrial and commercial applications. The problem is formulated as detection of parts of an im- age (anomalies, defects), depicting a surface of an object, that are inconsistent with a typical (normal) appearance of this surface. Keeping up with the times, the recent state- of-the-art surface anomaly detection methods are all based on CNNs, used in a variety of manners. In the standard deep learning paradigm, such a method would require labeled sets of ”normal” and ”anomalous” images. This supervised approach can quite often turn to be problem- atic — data labeling is costly, anomalies might be rare, or one might not know beforehand all the possible anomalies likely to appear in the future. The field of unsupervised surface anomaly detection addresses these issues by requiring only ”normal” train- ing images, that are used to build some type of a model that can discriminate between images that belong to the distribution observed during the training and images that are out of distribution or ”anomalous”. However, in many applications the distribution of ”normal” images is likely to change with time due to changes in the environment in which the images are taken, the objects of interest them- selves might change, or the equipment used for taking the images might itself change. Figure 1: We investigate the robustness of anomaly detection methods to the domain-shift by training a model on a source domain and evaluating it on multiple target (shifted) domains. We also investigate the number of samples needed for building an accurate model, as well as the number of samples needed to adapt the model to the target domain. In this work we evaluate the robustness of three of the most popular unsupervised surface anomaly detec- tion methods, Gaussian AD [3]. PatchCore [4] and CS- Flow[5], to different domain shifts. For this purpose we use the popular MVTec [1] dataset and create five domain shifted datasets by changing the images in various ways. We evaluate the number of images needed for building a good model, the drop in performance when the domain shifts, as well as the number of images from the shifted do- main that is needed to be included in the training process to get a performance comparable to the baseline. The main contribution of the paper are therefore the results of the analysis and the insights derived from them, which we believe to be useful for researchers and practi- tioners working on the problem of visual surface anomaly detection. 2 Related work In recent years the field of surface anomaly detection has been dominated by the methods based on unsupervised learning of specific CNN models. Various approaches to solving this problem have been proposed. A very popular approach used to be the analysis of the image reconstruc- tion (e.g., like in RIAD [9]), also in combination with reconstruction approach (such as DREAM [8]]). Several 386 methods use pre-trained CNNs for feature extraction and then create models that take the extracted features [3, 4, 5] on the input and process them further to address the prob- lem of consistency in the training images. In this paper we will analyse three different approaches to this paradigm, that turned to be among the most successful unsupervised methods for surface anomaly detection. Gaussian AD uses an ImageNet pre-trained Efficient- Net [6] to extract features from the training set of images, and build a Multivariate Gaussian distribution (MVG) model of the distribution of the features. During the test time the features of the input image are extracted and then the Mahalanobis distance to the MVG distribution of ”nor- mal” images is calculated. If the distance is above a set threshold the image is categorized as anomalous, without the possibility of localizing the anomaly. PatchCore uses different types of pre-trained ResNet [2] architectures for feature extraction from all the images in the training set. Then it uses a minimax facility location method to extract a core-set of features that represent the distribution of ”normal” images. At the test time, the features of the input image are compared to this core-set, and if the divergence is large enough the anomalies are detected and localized in the image. CS-Flow uses an ImageNet pre-trained EfficientNet [6] architecture through which three scales of the input image are passed to get three feature representations of the im- age. Then, these features are used by a custom Cross-Scale Normalizing Flow [5] neural architecture to model the dis- tribution (of the features) of ”normal” images. During the test time we simply extract the features of the image and then pass them through the Cross-Scale Flow model and the outputs directly represent anomalous regions in the image. 3 Experiment design Our analysis is centered around the MVTec dataset, con- taining images of 15 different objects. The dataset is the most common dataset in the field, used in virtually all pub- lished works on the topic of unsupervised surface anomaly detection. Five of the objects in this dataset can be seen in the leftmost column in Figure 2. In order to measure the effects of the domain-shifts on the performance of the three anomaly detection al- gorithms we created five domain-shifted versions of the MVTec dataset. The MVTec-brg dataset was obtained by simply permuting the channels of the original images as RGB→ BRG. The MVTec-coljit dataset was generated us- ing the color jittering technique for image augmentation; in our case the same image transformation was applied to all images. All images were first transformed into HSV representation, all channels were multipled with a con- stant factor, and then the images were reverted to the RGB representation. MVTec-gray is simply a grayscale version of the original. MVTec-ljs and MVTec-ld were generated using the WCT2 [7] method for style-transfer, where two arbitrarily chosen images (images of Ljubljana sunrise and Luka Donˇ ciˇ c) were used as a style image. A number of images from these datasets, showing the influence of the Figure 2: Examples from the domain-shifted dataset for five object categories. In the leftmost column we see the objects as they look in the original dataset. Other columns depict the domain shifted images. applied transformations, are shown in Figure 2. In the first experiment we simply train all three meth- ods on the original training set and evaluate the corre- sponding trained models on the original test set. This is a conventional setting, without a domain shift and the obtained results serve as a baseline for the other experi- ments. In the second experiment we examine the effect of reducing the size of the training set. We evaluate a model trained on 1, 5, 10, 50, 100, 200, and all available training images for each object from the original training set, and evaluate the performance on the complete original test set for each object. For the MVTec dataset the total number of training images for each object ranges between219 and 280 with two exceptions; the ”toothbrush” object contains only 60 training images and the ”hazelnut” contains 391 training images. In the third experiment all models that were trained on the source domain are evaluated on all shifted domains, and we record the difference in the performance in terms of the mean AUC metric. In the fourth and final experi- ment, we extend the training set consisted of all available original images (from the source domain) with 1, 5 or 10 images from the target domain in order to gauge the amount of images needed to adapt the trained model to the target domain by simply adding images from the target domain to the training set. All experiments were performed with 5 fold cross- validation and we are reporting the mean values, as well as the standard deviations. 4 Results All three methods perform relatively well on the original dataset. When trained on the complete training set of the original MVTec dataset, and tested on the complete test set of the same dataset, Gaussian AD achieves mean 387 Figure 3: Performance of Gaussian AD, PatchCore and CS-Flow on the MVTec and domain shifted MVTec datasets. The horizontal axis shows how many images from the MVTec domain were used training learning the model. The vertical axis shows the mean AUC metric when evaluating on the complete test sets of each domain. Each color represents a different test domain, the red being the MVTec dataset. Figure 4: This image shows the results of adding images from the target domain to the training set of images from the source domain for each of the three examined methods. Results are shown for adding 1, 5 or 10 images to the complete training set of the original domain. AUC of 0. 97, PatchCore achieves mean AUC of 0. 96, and CS-Flow has a mean AUC of 0. 98. Certain object cate- gories seem to be consistently easier for all three methods. For example, all three methods have a perfect score of 1 mAUC for the leather and bottle categories. The most difficult objects for all three methods are the capsule and screw objects. In general, the success of the leather cat- egory is probably due to its textural appearance, as all textural categories seem to be relatively easy to model. The good results on the bottle category as well as the bad results on the capsule and screw categories are probably due to the alignment between different samples – bottle is one of the best aligned categories, and screw and capsule are one of the mostly non-aligned categories. The results for these objects for the CS-Flow method are shown in Figure 5. Next, we evaluated the effect of the size of the train- ing set. As can be seen in Figure 3 (red line), all three methods benefit from larger training sets, however, on average, Gaussian AD and PatchCore seem to reach a plateu at about 100 samples in the training set, while for CS-Flow it seems that a plateau has not been reached yet. It is also surprising to note there are some objects where a single image is enough for building a good model, for all three methods. On average, with a single image Gaussian AD reaches a mAUC of 0. 80, for PatchCore the score is 0. 82 and for CS-Flow the score is 0. 86. It is also notable, that as the number of training samples grows, the score is less dependent on the samples that we have chosen, as the standard deviation is reduced. If we examine the individual categories, we see that the best performing categories across all methods are leather, tile, carpet and bottle. The first three are textural categories and the last is highly aligned across samples. Some of the worst performing categories are screw, hazelnut, capsule, transistor and toothbrush as can be seen in Figure 5. For all these categories we can say that they are non-textural and relatively unaligned categories. In the next experiment we evaluated the performance of the methods, when they are trained on the original MVTec and tested on MVTec-brg, MVTec-coljit, MVTec- gray, MVTec-ljs and MVTec-ld. For Gaussian AD we can see that the smallest drop in performance happens on the MVTec-coljit and MVTec-gray domains, the score drops further for MVTec-brg, and performs the worst on the MVTec-ljs and MVTec-ld domains. For PatchCore, we can see that the method is most robust to changes in the MVTec-coljit and MVTec-gray domains. Then the perfor- mance drops on the MVTec-brg domain, and it performs the worst on the MVTec-ljs and MVTec-ld domains. For CS-Flow we can see that the smallest drop in performance is observed on the MVTec-gray domain, then additional drop for the MVTec-ljs and MVTec-ld domains, while the largest drop in performance happens on the MVTec-coljit 388 Figure 5: Results for specific objects for the CS-Flow method. and MVTec-brg domains. Without any adaptation to the target domains, it seems that overall PatchCore experi- ences the smallest drop in performance. The reasons for this require additional investigation, however there are two obvious suspects, the feature extractor (ResNet vs. EfficientNet used by Gaussian AD and CS-Flow) and the core-set selection strategy. For Gaussian AD and PatchCore, somewhere about50 and100 samples a plateau is reached, and using more sam- ples does not significantly improve the performance on the domain-shifted datasets. For CS-Flow we can see that the performance on the MVTec-gray domain only starts to drop after using 200+ samples. For the MVTec-ljs and MVTec-ld domains the performance does not change significantly with the number of training samples, nor does the score start to drop. For the MVTec-coljit and MVTec-brg datasets we can see that already after 10 train- ing samples of the original domain the score starts drop- ping, which suggests overfitting on the source domain. CS-Flow is expected to be more prone to overfitting be- cause the model that it builds is more complex than the models of the other two methods In the final experiment, we evaluate the effect of adding 1, 5 or 10 images from the domain-shifted dataset to the complete training set of the source domain. The mean results over all categories are depicted in Figure 4. For Gaussian AD we see that adding images from the training set brings the performance close to baseline, for the MVTec-coljit, MVTec-brg and MVTec-gray domains, while the improvements on the MVTec-ljs and MVTec-ld domains are smaller. PatchCore performs almost equally well on the domains MVTec-coljit and MVTec-gray with- out adding any training images. The performance on the MVTec-brg domain can be close to baseline if we add anough images to the training set, while the perfor- mance on the MVTec-ljs and MVTec-ld domains does not come close to the baseline. For CS-Flow we can see that the sharp drop in performance on the MVTec-coljit and MVTec-brg domains can be alleviated with a single image in the training set. The performance on the MVTec- gray domain is relatively well without any adaptation, and further improved by adding images. Again, the worst performance happens on the MVTec-ljs and MVTec-ld domains. As expected, adding images from the target domain helps the performance for all three methods. 5 Conclusion In this work we created domain-shifted variants of the popular MVTec dataset for surface anomaly detection, and evaluated the influence of the size of the training set and the domain-shift on the performance of three popular surface anomaly detection algorithms. From the experiments presented in this work we can draw a few conclusions. Firstly, the MVTec dataset has been ”solved” to a great extent, so much so that its usage might bring about more methods that are fitted to the dataset, instead of bringing about fundamentally novel approaches to this problem. Secondly, we can see that the tested domain-shifts do bring about a drop in performance, but this drop is largely non-catastrophic. Since this holds true for all three methods, the reason for this is probably in the power of the used feature extractors to ”absorb” such changes in the image space. Lastly, we can see that certain domain shifts, like the tested BRG, COLJIT and GRAY domain shifts are easy to adapt to, needing only a few images from the target domain in order to get a performance comparable to the baseline. In order to correctly interpret all the presented results, a further investigation is needed into the way the feature representation of an image changes when the certain visual properties of the image are changed. References [1] Bergmann, P., Fauser, M., Sattlegger, D., Steger, C.: Mvtec ad — a comprehensive real-world dataset for unsupervised anomaly detection. In: 2019 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR) (2019) 389 [2] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learn- ing for image recognition. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016) [3] Rippel, O., Mertens, P., K¨ onig, E., Merhof, D.: Gaussian anomaly detection by modeling the distribution of normal data in pretrained deep features. IEEE Transactions on In- strumentation and Measurement (2021) [4] Roth, K., Pemula, L., Zepeda, J., Sch¨ olkopf, B., Brox, T., Gehler, P.V .: Towards total re- call in industrial anomaly detection. arXiv (2021), https://arxiv.org/abs/2106.08265 [5] Rudolph, M., Wehrbein, T., Rosenhahn, B., Wandt, B.: Fully convolutional cross-scale-flows for image-based defect de- tection. In: Winter Conference on Applications of Computer Vision (WACV) (2022) [6] Tan, M., Le, Q.: EfficientNet: Rethinking model scaling for convolutional neural networks. In: Proceedings of the 36th International Conference on Machine Learning (2019) [7] Yoo, J., Uh, Y ., Chun, S., Kang, B., Ha, J.W.: Photorealis- tic style transfer via wavelet transforms. In: International Conference on Computer Vision (ICCV) (2019) [8] Zavrtanik, V ., Kristan, M., Skoˇ caj, D.: Draem - a discrimina- tively trained reconstruction embedding for surface anomaly detection. In: IEEE/CVF International Conference on Com- puter Vision (ICCV) (2021) [9] Zavrtanik, V ., Kristan, M., Skoˇ caj, D.: Reconstruction by inpainting for visual anomaly detection. Pattern Recognition (2021)