Proceedings of the 2021 7th Student Computer Science Research Conference (StuCoSReC) Editors Iztok Fister Andrej Brodnik Janez Brest Iztok Fister Jr. Matjaž Krnc Niko Zimic Maribor, September 2021 Title Proceedings of the 2021 7th Student Computer Science Research Conference (StuCoSReC) Editors Iztok Fister (University of Maribor, Faculty of Electrical Engineering and Computer Science) Andrej Brodnik (University of Ljubljana, Faculty of Computer and Information Science) Janez Brest (University of Maribor, Faculty of Electrical Engineering and Computer Science) Iztok Fister Jr. (University of Maribor, Faculty of Electrical Engineering and Computer Science) Matjaž Krnc (University of Primorska, Faculty of Mathematics, Natural Sciences and Information Technology) Niko Zimic (University of Ljubljana, Faculty of Computer and Information Science) Language editing Shelagh Hedges Technical editors Iztok Fister (University of Maribor, Faculty of Electrical Engineering and Computer Science) Jan Perša (University of Maribor, University Press) Cover graphics Network, author: geralt (Pixabay.com CC0) Graphics material Authors, Editors Conference 7th Student Computer Science Research Conference (StuCoSReC) Location & date 14th September 2021, Maribor, Slovenia Program Klemen Berkovič (University of Maribor, Slovenia), Zoran Bosnić (University of committee Ljubljana, Slovenia), Borko Bošković (University of Maribor, Slovenia), Janez Brest (University of Maribor, Slovenia), Lucija Brezočnik (University of Maribor, Slovenia), Andrej Brodnik (University of Primorska and University of Ljubljana, Slovenia), Patricio Bulić (University of Ljubljana, Slovenia), Mojca Ciglarič (University of Ljubljana, Slovenia), Jani Dugonik (University of Maribor, Slovenia), Iztok Fister (University of Maribor, Slovenia), Iztok Fister Jr. (University of Maribor, Slovenia), Matjaž Gams (Jozef Stefan Institute, Slovenia), Mario Gorenjak (University of Maribor, Slovenia), Aleš Holobar (University of Maribor, Slovenia), Andres Iglesias (Universidad de Cantabria, Spain), Sašo Karakatič (University of Maribor, Slovenia), Branko Kavšek (University of Primorska, Slovenia), Domen Kavran (University of Maribor, Slovenia), Štefan Kohek (University of Maribor, Slovenia), Matjaž Krnc (University of Primorska, Slovenia), Miklos Kresz (University of Szeged, Hungary), Niko Lukač (University of Maribor, Slovenia), Marjan Mernik (University of Maribor, Slovenia), Uroš Mlakar (University of Maribor, Slovenia), Eneko Osaba (University of Deusto, Spain), Vili Podgorelec (University of Maribor, Slovenia), Jan Popič (University of Maribor, Slovenia), Miha Ravber (University of Maribor, Slovenia), Peter Rogelj (University of Primorska, Slovenia), Martin Šavc (University of Maribor, Slovenia), Alexandros Tzanetos (University of Aegean), Damjan Vavpotič (University of Ljubljana, Slovenia), Grega Vrbančič (University of Maribor, Slovenia), Nikolaj Zimic (University of Ljubljana, Slovenia), Gilbert Lim Yong San (Singapore Eye Research Institute) & Borut Žalik (University of Maribor, Slovenia). Local organizing Janez Brest (University of Maribor, Slovenia), Iztok Fister (University of Maribor, committee: Slovenia) & Aleš Holobar (University of Maribor, Slovenia). Published by University of Maribor Založnik University Press Slomškov trg 15, 2000 Maribor, Slovenia https://press.um.si, zalozba@um.si Co-published by University of Maribor Izdajatelj Faculty of Electrical Engineering and Computer Science Koroška cesta 46, 2000 Maribor, Slovenia http://www.feri.um.si, feri@um.si Co-published by University of Ljubljana Soizdajatelj Faculty of Computer and Information Science Večna pot 113, 1000 Ljubljana, Slovenia https://fri.uni-lj.si/en, dekanat@fri.uni-lj.si Co-published by University of Primorska Soizdajatelj Faculty of Mathematics, Natural Sciences and Information Technologies Glagoljaška 8, 6000 Koper, Slovenia https://www.famnit.upr.si/en; info@upr.si Edition 1st Publication type E-book Published Maribor, September 2021 Availabe at https://press.um.si/index.php/ump/catalog/book/611 © University of Maribor, University Press /Univerza v Mariboru, Univerzitetna založba Text © Authors & Editors, 2021 This book is published under a Creative Commons 4.0 International licence (CC BY 4.0). This license alows reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator. The license al ows for commercial use. Any third-party material in this book is published under the book’s Creative Commons licence unless indicated otherwise in the credit line to the material. If you would like to reuse any third-party material not covered by the book’s Creative Commons licence, you wil need to obtain permission directly from the copyright holder. https://creativecommons.org/licenses/by/4.0/ CIP - Kataložni zapis o publikaciji Univerzitetna knjižnica Maribor 004(082)(0.034.2) STUDENT Computer Science Research Conference (7 ; 2021 ; Maribor) Proceedings of the 2021 7th Student Computer Science Research Conference (StuCoSReC) [Elektronski vir] / [editors Iztok Fister ... et al.]. - 1st ed. - E-zbornik. - Maribor : University of Maribor, University Press : Faculty of Electrical Engineering and Computer Science ; Ljubljana : Faculty of Computer and Information Science ; Koper : Faculty of Mathematics, Natural Sciences and Information Technologies, 2021 Način dostopa (URL): https://press.um.si/index.php/ump/catalog/book/611 ISBN 978-961-286-516-0 (pdf) doi: 10.18690/978-961-286-516-0 COBISS.SI-ID 76158723 ISBN 978-961-286-516-0 (pdf) DOI https://doi.org/10.18690/978-961-286-516-0 Price Free copy For publisher prof. dr. Zdravko Kačič, rector of University of Maribor Citiranje Fister, I., Brodnik, A., Brest, J., Fister, I. Jr., Krnc, M. & Zimic, N. (eds.) (2021). Attribution Proceedings of the 2021 7th Student Computer Science Research Conference (StuCoSReC). Maribor: University Press. doi: 10.18690/978-961-286-516-0 PROCEEDINGS OF THE 2021 7TH STUDENT COMPUTER SCIENCE RESEARCH CONFERENCE (STUCOSREC) I. Fister, A. Brodnik, J. Brest, I. Fister Jr., M. Krnc & N. Zimic (eds.) Table of Contents Preface Janez Brest 1 On Artefact Elimination in High Density Electromyograms by Independent Component Analysis 3 Aljaž Frančič, Aleš Holobar & Milan Zorman Zero-Knowledge Authentication Jakob Povšič & Andrej Brodnik 7 Graphs where Search Methods are Indistinguishable Matjaž Krnc & Nevena Pivač 11 System for Remote Collaborative Embedded Development Martin Domajnko, Nikola Glavina & Aljaž Žel 15 Leaf Segmentation of Rosette Plants using Rough K-Means in CIELab Color Space 19 Arunita Das, Daipayan Ghosal & Krishna Gopal Dhal Adversarial Image Perturbation with a Genetic Algorithm Rok Kukovec, Špela Pečnik, Iztok Fister Jr. & Sašo Karakatič 25 Fast Recognition of Some Parametric Graph Families Nina Klobas & Matjaž Krnc 31 Interactive Evolutionary Computation Approach to Permutation Flow Shop Scheduling Problem 35 Vid Keršič Towards Representative Web Performance Measurements with Google Lighthouse 39 Tjaša Heričko, Boštjan Šumak & Saša Brdnik Transformer-based Sarcasm Detection in English and Slovene Language Matic Rašl, Mitja Žalik & Vid Keršič 43 Extraction and Analysis of Sport Activity Data Inside Certain Area Luka Lukač 47 Methodology for the Assessment of the Text Similarity of Documents in the CORE Open Access Data Set of Scholarly Documents 51 Ivan Kovačič, David Bajs & Milan Ojsteršek ii TABLE OF CONTENTS. Embedding Non-planar Graphs: Storage and Representation Ðorđe Klisura 57 Analiza ritmičnosti števnih podatkov z uporabo modela cosinor Nina Velikajne & Miha Moškon 61 Analiza sentimenta komentarjev hotelov z uporabo slovarjev in metode Naivni Bayes 65 Nina Murks, Anže Omerzu & Borko Boškovic Časovni razporejevalniki in brezstrežniško okolje Uroš Zagoranski 71 PROCEEDINGS OF THE 2021 7TH STUDENT COMPUTER SCIENCE RESEARCH CONFERENCE (STUCOSREC) I. Fister, A. Brodnik, J. Brest, I. Fister Jr., M. Krnc & N. Zimic (eds.) Preface JANEZ BREST Computer science is one of the fastest growing fields. We live in a digital age where computers, smart phones and many other devices are connected worldwide. To meet al the requirements of the growing digital world; knowledge, skills and abilities are needed, which can be found in young people, especial y in bril iant students. The Student Computer Science Research Conference (StuCoSRec) is organized by the col aboration of the departments and institutes of three public universities in Slovenia with computer science study programmes. We are very proud that over the past years, these StuCoSRec student conferences have been held every year from 2014 at different institutions in Slovenia. Unfortunately, in 2020, which had been marked by the Covid-19 pandemic, the organization of the StuCosRec conference was not possible. This year we want to hold the conference despite the stormy conditions of the Covid-19. The StuCoSReC conference is intended for computer science students to present their research work and achievements, exchange the ideas, socialize and connect among themselves. This proceeding contains the papers of the seventh Student Computer Science Research Conference 2021 (StuCoSRec'21) that is planned to be held in Maribor. The University of Maribor – Faculty of Electrical Engineering and Computer Science is proud to host the conference this year. We received seventeen submissions, covering several topics of the 2 7TH STUDENT COMPUTER SCIENCE RESEARCH CONFERENCE (STUCOSREC). computer science. Three submissions are in Slovene and the others are in English. The submissions were reviewed by two reviewers. One submission was rejected. The talks are scheduled to be held during the conference. The organizing committee would like to thank the reviewers for the well performed job. The conference is dedicated to graduate and undergraduate students of computer science, and therefore it is free of charge. We grateful y acknowledge the support of the Faculty of Electrical Engineering and Computer Science (University of Maribor), especially the Institute of Computer Science. On Artefact Elimination in High Density Electromyograms by Independent Component Analysis Aljaž Frančič Aleš Holobar University of Maribor, University of Maribor, Faculty of Electrical Faculty of Electrical Engineering and Computer Science, Engineering and Computer Science, Koroška cesta 46, 2000 Maribor, Slovenia Koroška cesta 46, 2000 Maribor, Slovenia aljaz.francic@um.si ales.holobar@um.si Milan Zorman University of Maribor, Faculty of Electrical Engineering and Computer Science, Koroška cesta 46, 2000 Maribor, Slovenia milan.zorman@um.si Abstract axonal terminals. Groups of motor units often work to- gether to coordinate the contractions of a single muscle. We propose a novel approach to artefact detection We can think of the signal that travels from the motor and elimination in high density electromyograms neuron to the skeletal muscle fiber as a time series of ze-by using previously introduced Activity Index and roes, which represent no firing, and ones, which represent Independent Component Analysis (ICA). 28 elec-the firing of a muscle fiber and, thus, the MU. Because of tromyographic recordings of the biceps brachii the refractory period, the spikes are few and far between, muscle were analysed for the presence of artefacts. making the resulting signal sparse in time. This signal is Using the technique presented in this study, we called the MU spike train. eliminated an average of 1.07 ± 1.18 artefacts per HDEMG recording. The mean number of elimi- MUs fire asynchronously and their contributions are su- nated artefacs per recording with at least one de- perimposed into HDEMG, forming a highly complex sig- tected artefact was 1.88 ± 0.96. In the HDEMGs nals that are difficult to interpret. By using computer-used in our study, each artefact was found in a aided methods, such as Convolution Kernel Compensa-separate ICA component. tion (CKC) [2], it is possible to decompose the HDEMGs into contributions of individual MUs and, therefore, iden- Keywords biomedical signal processing, electromyog- tify the firing times of MUs. This gives us insight into raphy, blind source separation, independent component the status of the motor system in humans. analysis, activity index, artefact However, due to the reasons mentioned in the initial paragraph of this section, HDEMGs often contain various 1 Introduction artifacts. These artefacts hinder the ability of CKC to decompose the HDEMG into contributions of individual In humans, movement and locomotion is regulated by motor units. Also worth noting: a single artefact might muscles. An electrical signal travels from the central ner-sometimes be present in multiple EMG channels at the vous system towards muscles, where it is electrically am- same time, so eliminating a single HDEMG channel (for plified. By using non-invasive surface electromyography, a period of time) is not always effective. it is possible to detect the subtle changes in the volt- In this study, we exemplified our technique for artefact age on the surface of the skin that originate from the detection and elimination in HDEMG by using previously muscles, even through the skin and the subcutaneous fat introduced Activity Index [1] and Independent Compo-layers. Such recordings are called surface electromyo- nent Analysis (ICA) techniques [5]. grams (SEMG). When an array of tens of electrodes is used, we call the resulting recordings high density elec- tromyograms (HDEMGs). Due to many factors involved 2 Materials and Methods in recording of HDEMGs that are often impossible to control for, the resulting recordings are prone to con- 2.1 HDEMG model and Activity Index tain artefacts from various sources, such as power line interference, inadequate electrode-skin contact, electrode In isometric contractions of skeletal muscles, HDEMG drift, subpar quality of the equipment, movement arte- signals can be modeled by the following convolutive facts, etc. model [3]: y( n) = Ht( n) + ω( n) (1) A motor unit (MU) is made up of a motor neuron and the skeletal muscle fibers innervated by that motor neuron’s DOI https://doi.org/10.18690/978-961-286-516-0.1 ISBN 978-961-286-516-0 3 0.08 where 1 -0.07 0.09 y( n) = [ y 2 1( n) ... y 1( n − F + 1) ... yM ( n − F + 1)] T (2) -0.08 0.12 stands for time-wise extended vector of HDEMG signals, 3 -0.12 with extension factor F set between 10 and 20 [3], 0.13 4 -0.47 0.10 ω( n) = [ω 5 1( n) ... ω M ( n − F + 1)] T (3) -0.06 0.48 is the time-wise extended noise vector, and 6 -0.13 0.08 t( n) = [ t 7 1( n) ... t 1( n− L− F +1) ... tJ ( n− L− F +1)] T (4) -0.09 0.09 is similarly extended vector of MU spike trains. Here, the 8 -0.15 spike train of the j-th MU is defined as 0.16 9 -0.64 X 0.12 tj( n) = δ( n − τj( k)) , j = 1 , ... , J (5) 10 -0.15 0.65 k 11 -0.14 where δ() denotes the Delta function and the k-th spike 0.09 EMG amplitude (mV) 12 of the j-th MU appears at time τ -0.07 j ( k). 0.09 13 -0.10 The mixing matrix H comprises L samples long MUAPs 0.16 14 of J active MUs, as detected by M uptake electrodes [3]. -0.11 0.25 15 The Convolution Kernel Compensation (CKC) method -0.08 0.07 [3] estimates the MU filter iteratively as 16 -0.08 0.10 ˆ 17 f -0.08 j = ˆ f j + αE( g(ˆ tj( n)) y( n)) T C−1 (6) y 0.11 18 ˆ -0.23 0.07 ˆ f j f 19 j = (7) -0.07 ˆ f 0.09 j 20 -0.08 where α determines the speed of convergence, E() stands 2351 0 for mathematical expectation, g( t) is a nonlinear weight-2.4 4.9 7.3 9.8 12.2 14.6 17.1 19.5 Activity Index amplitude (a. u.) time (s) ing function, e. g. g( t) = log(1 + t 2) and C y = E( y( n) yT ( n)) represents the correlation matrix of ex-Figure 1: Representative HDEMG channels recorded tended HDEMG measurements. After each iteration of during the isometric contraction of biceps brachii at 10 Eqs. 6 and 7, the estimate of MU spike train gets updated % MVC contraction (in blue) and the corresponding Ac- tivity Index (in red), with the detected outliers encircled T ˆ tj( n) = ˆ f y( n) (8) j in green. The two most prominent artefacts are notice- able in the 4th and 6th EMG channel at approximately The Activity Index IA at a given time n is defined as [3] 10th second, in the 9th and 11th EMG channel at ap- proximately 17th second and in the 15th and 18th EMG IA( n) = yT ( n)C−1 y( n) (9) y channel at approximately 4th second. A single artefact is Calculating the Activity Index is the first step of the often present in several EMG channels. Also evident, not CKC decomposition approach [3] and is susceptible to all the HDEMG artefacts were detrimental for Activity HDEMG artefacts. Thus, it can also be used to detect index calculation and, therefore, for MU identification. them before further processing. If we do not eliminate the artefacts in HDEMG, they can cause problems for decomposition, preventing the convergence of MU filter in Eq. 6 or segmentation of MU firing moments from 2.3 Artefact detection and elimination identified MU spike train ˆ tj( n). An example of a few representative HDEMG channels recorded during isometric contraction of biceps brachii 2.2 Independent Component Analysis muscle at 10 % of maximum voluntary contraction In signal processing, ICA is a method for separating a (MVC) is shown in Fig. 1. multivariate signal into its additive subcomponents. This The artefact detection process began by calculating the is done by assuming that the subcomponents are non- Activity Index (red line in Fig 1) from HDEMG. In our Gaussian signals and that they are statistically indepen-study, the extension factor F was set to 1. This yielded dent from each other [4]. ICA is a special case of blind one time series from several HDEMG channels. source separation. Next, the outliers (denoted by green circles on the red In our case, the fastICA [5] decomposition was applied line in Fig 1) were found in the Activity Index. In to HDEMG, yielding the individual independent com- our case, an outlier was defined as an element that was ponents. Deflation was chosen as the decorrelation ap- more than 15 scaled median absolute deviations (MAD) proach in fastICA decomposition, and the nonlinearity away from the local median. Noteworthy, this could g( u) = u 3 was selected in the fixed-point algorithm. The be parameterized and the outliers could be determined stopping criterion was set to 0.0001 and the maximum in some other fashion. The local median was defined number of iterations was set to 1000 (the default values). 4 as the median inside the window, where the window 2.4 Dataset and evaluation size was equal to 2048 samples. To the best of our knowledge there is no established method for artefact To evaluate our method for artefact detection and elimi- elimination using the Activity Index, hence the MAD nation, we used 20 second long HDEMG recordings from threshold for determining the outliers in the Activity 7 neurologically intact young subjects performing isomet- Index was selected empirically using visual inspection ric contractions of the biceps brachii muscle, at 5, 10, 15 of the results by an expert. As an additional step, we and 20 % of MVC for a total of 28 HDEMG recordings. also ignored any outlier that was fewer than 200 samples 13 × 5 electrode array was used. Visual feedback on away from an already detected outlier. In this way, force was provided to the participants. All the experi- we prevented the multiple identifications of the same ments were conducted in accordance with the Declara- artefact. tion of Helsinki, and were approved by the local Ethical Committee. Afterwards, fastICA decomposition was applied to the HDEMG, yielding the individual independent compo- In Section 3 we reported the mean ± the Standard Denents. All the identified components were taken into con- viation (SD) of the number of outliers found in the Ac- sideration, as we would like to preserve as much of the tivity Index per HDEMG recording, the mean number information as possible after artefact elimination. Each of eliminated artefacts per HDEMG recording, the mean independent component was then excluded and a new number of outliers per HDEMG, where we identified at Activity Index I least one outlier, the mean number of eliminated artefacts EX was calculated without the excluded independent component. Noteworthy, calculating the Ac- per HDEMG where we eliminated at least one artefact as tivity Index from either HDEMG or all it’s ICA compo- well as the mean interest metric IntM etEX of the elimi- nents will always yield the same result. Then, for each nated artefacts. We also provided a visual example of an detected outlier in the Activity Index I elimination of an artefact. A, we tried to iden- tify the ICA component that contributed the most to the outlier in the Activity Index (to the HDEMG artefact) by 3 Results observing the difference in the Activity Index at the time of the outlier before and after the exclusion of the indi- The mean number of outliers found in the Activity In- vidual ICA component. We did this by using the interest dex per HDEMG recording was equal to 1.25 ± 1.29. metric defined as: The mean number of eliminated artefacts per HDEMG IntM etEX ( n) = 1 − IEX ( n) /IA( n) (10) recording was equal to 1.07 ± 1.18. The mean number of at any given time n. In partucular, for each detected out- outliers in Activity Index per HDEMG with at least one lier at time x in the old Activity Index IA( x), we looked outlier was 2.06 ± 1.03 and the mean number of elimi-at all the new Activity Indices IEX ( x) (with individual nated artefacs per HDEMG with at least one eliminated ICA components excluded) and calculated the interest artefact present was 1.88 ± 0.96. The mean interest met- metrics IntM etEX ( x) at the time x of the outlier in the ric IntM etEX of the eliminated artefacts was 0.85 ± 0.11. Activity Index IA( x). In the HDEMGs used in our study, each artefact was found in a separate ICA component. A representative The interest metric IntM etEX helped to expose the rela- example of our results is provided in Fig. 2. tionship between the old Activity Index IA and the new Activity Indices IEX . Higher value of the interest metric corresponded to a greater chance that there was an arte-4 Discussion fact in the excluded ICA component. Using this metric it was possible to determine which ICA component con- Our results indicated that it was possible to eliminate tains the artefact. It was assumed that only a single ICA artefacts in HDEMG using Activity Index and ICA. How- component will contain a single artefact, as ICA works ever, it was difficult to accurately quantify the efficiency under the assumption that the sources are statistically of this approach, as we did not know the ground truth independent from each other. If we were to see arte- about the artefacts’ locations in time, nor the HDEMG facts in multiple ICA components at the same time, this channels where they were present. We could simulate would imply that they came from statistically indepen- certain kinds of artefacts at known times and in known dent sources. This would imply artefact co-occurrence, HDEMG channels. However this would only account for which is unlikely given our findings about the number certain types of artefacts. For example, we could induce of outliers and artefacts found per recording (Section 3). an artefact by touching certain electrodes during record- Interest metric threshold, above which we considered the ing at a predefined time, or during the whole recording by artefacts to be successfully eliminated was set to 0.5. incorrectly applying the contact gel. But we would still By using the ICA algorithm it would also possible to re- be left with other artefacts that we have little control construct the original recordings from ICA components. over. Moreover, not all the artefacts have a significant This would allow us to first transform the HDEMGs to impact on the Activity Index and on MU identification. ICA space, eliminate the artefacts to the best of our By observing the Activity Index and comparing it to the ability and then transform the ICA components back to HDEMG channels, it is quite clear, that certain artefacts HDEMG space using simple matrix multiplication. In are more detrimental for the Activity Index than others. our current study, we eliminated the whole ICA compo- Therefore, without using the Activity Index or a simi- nent, but it would be worth considering “repairing” that lar metric, the assessment of artefact impact on EMG component (e.g. locally setting the component elements decomposition is often difficult. to 0). 5 0.8 1.4 6.8 2.3 1.7 identifying the first of the 200 samples that is at least 15 -0.6 -1.3 -30.7 -2.2 -2.6 MAD away from the local median. This selection could 1.8 4.4 3.2 6.3 2.8 have significant implications in the case of longer artefacts -2.3 -1.2 -2.6 -2.8 -2.2 in HDEMG signals. 3.3 2.9 5.6 2.7 5.2 In our dataset, we found the mean interest metric -5.6 -3.1 -3.2 -2.1 -2.2 2.5 2.5 2.9 2.1 1.9 IntM etEX value of the eliminated artefacts to be 0.85 ± 0.11, while the threshold, above which we considered -3.3 -6.9 -4.7 -3.3 -2.4 2.8 5.6 2.2 2.7 3.8 the artefacts to be eliminated was set to 0.5. We also found slightly more outliers than actual artefacts as -2.3 -3.3 -2.5 -4.5 -2.3 4.1 2.3 2.3 1.7 2.7 identified by visual inspection of HDEMG signals. This indicated that the current parameters (especially the -4.3 -2.3 -3.3 -2.4 -3.4 extension factor for Activity Index calculation F of 1, 4.6 1.7 2.8 3.1 3.4 the MAD threshold for outlier detection of 15 and the -2.4 -1.9 -2.2 -2.2 -2.8 interest metric threshold of 0.5) were suitable to identify 3.4 3.5 3.0 3.1 2.4 artefacts in the HDEMGs. However, further tests are -2.0 -2.2 -2.2 -1.4 -3.4 required to confirm these findings in other muscles and 3.6 2.8 2.0 2.1 2.3 contraction levels. -2.2 -2.3 -2.7 -3.8 -2.8 3.2 2.3 2.2 2.7 3.7 Acknowledgment -2.5 -2.7 -3.7 -2.6 -2.0 2.7 3.2 2.1 2.5 3.0 This study was supported by the Slovenian Research -2.4 -2.3 -2.9 -3.7 -3.8 3.0 3.6 3.2 2.9 2.5 Agency (Project J2-1731 and Program funding P2- 0041). -3.5 -1.9 -2.5 -2.9 -2.9 975 0 References 1240 0 [1] 2.4 4.9 7.3 9.8 12.2 14.6 17.1 19.5 Holobar, A., and Zazula, D. Correlation-based time (s) decomposition of surface electromyograms at low con- traction forces. Medical and Biological Engineering Figure 2: Artefact detection and elimination in ICA and Computing 42, 4 (2004), 487–495. components of recorded HDEMG. The bottom panel de- [2] picts the I Holobar, A., and Zazula, D. Gradient convo- A of all the ICA components (in blue) and the lution kernel compensation applied to surface elec- IEX without the excluded component (in red). The green tromyograms. In International Conference on Inde- cross denotes the detected artefact time. The blue and pendent Component Analysis and Signal Separation red circles represent the detected outliers in IA and IEX , (2007), Springer, pp. 617–624. respectively. The second row from the bottom shows the 200 samples of Activity Index before and after the de- [3] Holobar, A., and Zazula, D. Multichannel blind tected artefact time (green cross from the bottom panel). source separation using convolution kernel compensa- The blue line shows the I tion. IEEE Transactions on Signal Processing 55, 9 A and the red line shows IEX . The first dozen rows show the ICA components 200 sam- (2007), 4487–4496. ples before and after the detected artefact time (the green [4] Hyvärinen, A. Independent component analysis: cross). The excluded component is depicted in red. It recent advances. Philosophical Transactions of the contains an artefact with an interest metric value of 0.96. Royal Society A: Mathematical, Physical and Engi- All y-axis units are arbitrary. neering Sciences 371, 1984 (2013), 20110534. [5] Hyvärinen, A., and Oja, E. Independent com- ponent analysis: algorithms and applications. Neural networks 13, 4-5 (2000), 411–430. In our method, determining the outliers in the Activity Index depend upon several parameters. The first one was the extension factor F , which was set to 1 for the purposes of this study. Also important was the scaled median absolute deviation (MAD) threshold, above which we considered an element of the Activity Index to be an outlier. In our case, we set it to 15. Lowering this threshold would yield more outliers in the Activty Index. Another parameter was the sliding window size in the Activity Index outlier detection, which in our case was set to 2048 samples. The presented outlier location detection could also be refined, as using the currently described technique does not guarantee the identified outlier location to be at the location of the actual outlier peak. Instead, we aimed at 6 Zero-Knowledge Authentication Jakob Povšič Andrej Brodnik University of Primorska, University of Primorska, Faculty of Mathematics, Natural Sciences Faculty of Mathematics, Natural Sciences and Information Technologies, and Information Technologies, Glagoljaška 8, 6000 Koper, Slovenia Glagoljaška 8, 6000 Koper, Slovenia jakob.povsic@gmail.com andrej.brodnik@upr.si Abstract Messages. The peer and the authenticator communi- cate by exchanging EAP messages. The protocol starts Zero-Knowledge proofs (ZKPs) enable proving of with the authenticator sending a message to the peer. mathematical statements, revealing nothing but They keep exchanging messages until the authenticator their validity. We design an authentication sys-can either authenticate the peer or not. Messages are tem with a ZKP as a password verification mech- exchanged in a lock-step manner, where an authentica-anism within the Extensible Authentication Pro- tor sends a message and the peer responds to it. The tocol (EAP) framework. Designing a secure pass- authenticator dictates the order of messages, meaning it word authentication system requires us to adopt can send a message at any point of communication, as security practices for protecting ourselves against opposed to the peer, which can only respond to messages the vulnerabilities of passwords. Integrating said from the authenticator. practices is not trivial because of the tight cou- pling with the password verification method. Messages are composed of fields, each field length is multiple of an octet of bits (Table 1). We will store our Keywords extensible authentication protocol, zero-authentication method data within the Type-Data field. knowledge proofs, authentication, cryptography, key- stretching, passwords, quadratic residuosity problem Our EAP method is identified by the Type 84. 3 Zero-Knowledge Proofs 1 Introduction Today privacy is a necessary sacrifice we have to make Zero-Knowledge Proofs [5, 6, 7] (ZKPs) are a concept in order to take part in the digital world, imperative in cryptography for proving the validity of mathematical to our modern life. Every day, more digital systems statements. What makes them particularly interesting is gain access to our personal information. While this that ZKPs can prove a statement revealing no informa-practice is often a necessary evil, many companies seek tion about why a statement is true, hence the term zero-to exploit this position. Zero-knowledge proofs (ZKPs) knowledge. In mathematics, theorem proofs are logical are an intriguing cryptographic phenomenon for proving arguments that establish truth through inference rules mathematical statements without revealing of a deductive system based on axioms and other proven why they are true, and have the potential to change how our data exists theorems. ZKPs are probabilistic, meaning they convince in the digital space. the verifier of the validity. We use the term convince, be- cause ZKPs are not absolute truth, but the probability Our focus will be to define a simple use for zero-knowledge of someone being convinced by a false statement is arbi-proofs. We will design an authentication system using a trarily small. zero-knowledge proof as a password verification method, as an authentication method in the extensible authenti- 3.1 ZKP System for the Quadratic Residuosity cation protocol (EAP). When designing the system, we Problem need to protect ourselves from vulnerabilities of pass- words. However, integrating security methods presents Definition 3.1 (Quadratic Residuosity Problem) a challenge because of the zero-knowledge proof system. Given an integer x, a semiprime modulus n = pq, where p and q are unknown different primes, and a Jacobi symbol value x = 1 . Determine if x is a quadratic 2 Extensible Authentication Protocol n residue modulo n or not. Extensible authentication protocol [10] (EAP) is a gen- The law of quadratic reciprocity enables efficient compu-eral purpose authentication framework designed for net- tation of the Jacobi symbol value x. However, when work access authentication. EAP defines a set of mesn x sages that support negotiation and execution of a variety = 1, it does not tell if x is a quadratic residue modulo n of authentication protocols. EAP is a two-party protocol n or not. x is only a quadratic residue if it’s a quadratic between a residue of both modulo p and q ( x = x = 1). To peer and an authenticator at each end of a link. p q compute this, we would have to know the factorization of n. However, since n is a product of two primes pq = n, this is computationally hard [2]. The only efficient way DOI https://doi.org/10.18690/978-961-286-516-0.2 ISBN 978-961-286-516-0 7 Table 1: EAP Message Format Length (Octets) 1 1 2 1 n ≤ 216 Field Code Identifier Length Type Type-Data to prove x is a quadratic residue modulo n, is with the 4 Password Protection root w. The problem acts as a trapdoor function, where it’s hard to prove if x is a quadratic residue modulo n Password cracking [1] is an offline attack [8], where an solely from x and n, while it is easy to prove when you attacker extracts passwords from data used by the au-know its root w. thentication system for password verification. Protect- Authors [7] described a ZKP system for the ing passwords on the data layer is of critical importance. quadratic Key-stretching, [9, 1] also called password hashing, is the residuosity problem. To prove x is a quadratic residue modulo industry standard method for improving security of low n in zero-knowledge we need to prove the exis- tence of the root entropy secrets like passwords. w, where w 2 ≡ x (mod n), without revealing w to the verifier. The quadratic residue x is derived from root w = p and persistently stored with the authenticator. This n Semiprime, where Jacobi x = 1 introduces a vulnerability, as an attacker with access to n x Residue, where w 2 ≡ x (mod n) x could crack the password w in an offline attack. To w Root provide adequate security, we need to use key-stretching Peer Authenticator in our authentication method. A common application of 1 u ∗ a key-stretching method is to transform the vulnerable R Z n data stored in the authentication system. However, this y = u 2 (mod n) y − → approach doesn’t work in our case. Let us revisit how the 2 b ← − b R {0 , 1} authenticator verifies the proof, and why key-stretching 3 z = uwb (mod n) z − → verify z 2 ≡ yxb (mod n) the password verification data ( x) data is an issue. We’ll begin by assuming the system can verify the proof and key-stretching the password verification data ( x) data. Table 2: ZKP Authentication with EAP As we define our process, we will see why it is not possible. Key-Stretching x. On the last step of the protocol In Table 2 of the ZKP authentication process, the middle the authenticator verifies that spaces represent the EAP message Type-Data field. z 2 ≡ yxb (mod n) . The prover begins by picking a random integer u from If we stretch x with a function H and a salt s the field Z n, computing y = u 2 (mod n), and sending y to the verifier. The verifier picks a random bit b and H( x, s) = xH, sends it to the prover. The prover computes the value z with we can then verify the proof with an inverse function b and sends it back. The verifier checks the proof H−1 by asserting z 2 ≡ yxb (mod n), this is possible since z 2 ≡ yH−1( xH, s) b. This is possible assuming a polynomial algorithm H−1 z 2 ≡ yxb (mod n) exists, however, since key-stretching methods are based ( uwb)2 ≡ u 2( w 2) b (mod n) on hashing functions (one-way functions), we know that the probability of a polynomial algorithm H−1 to success- u 2 w 2 b ≡ u 2 w 2 b (mod n) . fully compute a pseudo-inverse is negligibly small. For all For each round a cheating prover has a 1 probability of positive integers c [4] 2 succeeding by correctly guessing the value of the random Pr[ H( H−1( H( x))) = H( x)] < | x|− c. bit b. To improve the strength of the proof, we repeat this process m times for a confidence of 1 − 2− m. Even if given unbounded time and resources, the pseudo- To use this protocol as a password verification method, inverse x 0 = H−1( H( x)) might not be equal to x 0 6= x. we can treat the root The set x, x 0 ∈ I w as the password p = w known by x are all values that map into H ( x) = the peer. The ZKP protocol proves that H( x 0), and since H is not injective we know that | I x is a quadratic x| ≥ 1. residue modulo Meaning that the probability that x 0 = x is n, by proving the knowledge of the root w, where w 2 ≡ x (mod n). The peer will prove that x 1 is a quadratic residue modulo Pr[ H−1( H( x)) = x] = . n, to do this however, the | Ix| peer needs to prove the knowledge of the password p = w. With this, the authenticator can assert that the password Key-stretching x prevents us from verifying the ZKP. is valid. However, by increasing the entropy of the root w, we can eliminate the vulnerability and ensure adequate security. Our new approach won’t treat the password p 6= w as the 8 root w. However, we will use the password p to derive the root w = H( p, s), using a key-stretching function H and salt s. This way we’ve ensured the same level of pro- tection against offline attacks as if we stretched the data stored in the system. And because we didn’t transform x, we can verify the proof without being affected by issues mentioned in the previous paragraph. A similar approach is used in the PPP EAP SRP-SHA1 protocol [3]. Earlier we argued the ZKP works as a password verification method because p = w, this argument isn’t true anymore. However, even though w 6= p, the peer can only derive w knowing the password p, so when the peer proves the knowledge of w, it can only be so because they know p as well. 5 Secure Authentication The authentication process now begins with the peer sending his identifier to the authenticator and the au- thenticator responding with the peer’s unique salt s and modulo n. The peer can now derive the root w from the password p and salt s. This part of the process (Steps 1. and 2. in the Table 3) happens only once. The peer can then authenticate by following the process as described in §3.1. This part (Steps 3., 4. and 5.) of the process is repeated m times for a confidence of 1 − 2− m. Figure 1: EAP Method Execution The middle space in the Table 3 represents the Type-Data field of the EAP messages. Peer Authenticator the proof. The authenticator sends the random bit 1 I − → b and the peer responds with the proof z. The peer 2 w = H( p, s) s,n ←−− also sends the yi+1 for the next verification round 3 u i + 1, this is done as an optimisation to improve the R Z n speed of the process. (Table 3, Step 4., 5. and 3.) y = u 2 (mod n) y − → 4 b Success/Failure After each verification message, the ← − b R {0 , 1} authenticator verifies the proof, and once it’s done 5 z = uwb (mod n) z − → verify z 2 ≡ yxb (mod n) successfully for m rounds, the authenticator sends the success message. However, if the proof isn’t Table 3: Improved ZKP Authentication with EAP valid, the authenticator must send a failure message. Let us examine the EAP messages (Figure 1) of the au- 6 Conclusions and Future Work thentication process described in Table 3. The mapping between EAP messages and the steps in Table 3 is not The aim of this work was to study the utility of zero-one-to-one. We merged some steps to reduce the number knowledge proofs as an EAP authentication method. of message exchanges required for the process to com- We’ve presented an EAP method using a ZKP system plete. for password verification. Additionally, we ensured adequate password protection by using a key-stretching Identity This message is used to query the identity of method. the peer. In a system with multiple peers, this is required to identify the peer authenticating, and We have been successful in our goal of studying and using to find the correct salt the ZKP protocol. While theoretically interesting the s and quadratic residue x. (Table 3, Step 1.) system’s performance may not appropriate for real-world applications. The iterative nature of the underlying ZKP Setup The peer needs both the salt s and the modulus protocol accumulates communication latencies, slowing n to compute the proof, however, he only knows down the system. the password p. Once the peer identifies himself, the authenticator needs to send him the salt s and Future work. modulus n in the setup request message. (Table 3, Step 2. and 3.) • The EAP method presented in this work can be Verification With this message pair the peer and the implemented and tested in a real-world environment. authenticator exchange data to compute and verify 9 • The ZKP protocol used in this work is a first generation protocol. Today there are many newer pro- tocols that have solved many shortcomings of the older generation ZKPs. Using a newer generation ZKP protocol can improve the performance of the authentication system. • The ZKP protocol we’ve examined is iterative, which can cause worse performance. A parallel ZKP con- struction is assumed to have a weaker strength of zero-knowledge. However, in a real-world applica- tion, the performance improvements might justify the theoretical shortcomings. References [1] Blocki, J., Harsha, B., and Zhou, S. On the economics of offline password cracking. In 2018 IEEE Symposium on Security and Privacy (SP) (2018), IEEE, pp. 853–871. [2] Buchmann, J. A. Factoring. Springer US, New York, NY, 2001, pp. 171–183. [3] Carlson, J. D., Aboba, D. B. D., and Haveri- nen, H. PPP EAP SRP-SHA1 Authentication Pro- tocol. Internet-Draft draft-ietf-pppext-eap-srp-03, Internet Engineering Task Force, July 2001. Work in Progress. [4] Goldreich, O. Foundations of cryptography: Vol- ume 1, basic tools. Cambridge university press, 2007. [5] Goldreich, O., and Krawczyk, H. On the composition of zero-knowledge proof systems. SIAM Journal on Computing 25, 1 (1996), 169–192. [6] Goldreich, O., Micali, S., and Wigderson, A. How to Prove all NP-Statements in Zero-Knowledge, and a Methodology of Cryptographic Protocol De- sign. vol. 263, pp. 171–185. [7] Goldwasser, S., Micali, S., and Rackoff, C. The knowledge complexity of interactive proof sys- tems. SIAM Journal on computing 18, 1 (1989), 186–208. [8] Grassi, P. A., Garcia, M. E., and Fenton, J. L. NIST Special Publication 800-63-3 Digital Identity Guidelines. National Institute of Standards and Technology, Los Altos, CA (2017). [9] Hornby, T. Salted password hashing-doing it right, 2016. [10] Vollbrecht, J., Carlson, J. D., Blunk, L., Aboba, D. B. D., and Levkowetz, H. Extensible Authentication Protocol (EAP). RFC 3748, June 2004. 10 Graphs where Search Methods are Indistinguishable Matjaž Krnc and Nevena Pivač∗ University of Primorska, Faculty of Mathematics, Natural Sciences and Information Technologies, Glagoljaška 8, 6000 Koper, Slovenia matjaz.krnc@upr.si nevena.pivac@iam.upr.si Abstract Section 2, and then state the obtained results in Section 3. Due to lack of space we omit the proofs and provide some Graph searching is one of the simplest and most directions for further work in Section 4. All proofs are widely used tools in graph algorithms. Every available in the full version of paper on www.arxiv.org. graph search method is defined using some partic- ular selection rule, and the analysis of the corre- sponding vertex orderings can aid greatly in de- 2 Preliminaries vising algorithms, writing proofs of correctness, or recognition of various graph families. We now briefly describe the above-mentioned graph search methods, and give the formal definitions. Note We study graphs where the sets of vertex order- that the definitions below are not given in a historically ings produced by two different search methods standard form, but rather as so-called three-point coincide. We characterise such graph families conditions, due to Corneil and Kruger [5] and also for ten pairs from the best-known set of graph Brändstadt et. al. [3]. Two vertices u, v ∈ V ( G) satisfy searches: Breadth First Search (BFS), Depth the relation u < First Search (DFS), Lexicographic Breadth First σ v if u appears before v in the ordering σ : V ( G) → {1 , 2 , . . . , n} of vertices in G. Search (LexBFS) and Lexicographic Depth First Search (LexDFS), and Maximal Neighborhood Search (MNS). Breadth First Search (BFS), first introduced in 1959 by Moore [9], is a restriction of a generic search which Keywords graph search methods, breadth first search, puts unvisited vertices in a queue and visits a first vertex depth first search from the queue in the next iteration. After visiting a particular vertex, all its unvisited neighbors are put at the end of the queue, in an arbitrary order. 1 Introduction Definition 2.0.1 An ordering σ of V is a BFS-ordering Graph search methods (for instance, Depth First Search if and only if the following holds: if a <σ b <σ c and and Breadth First Search) are among essential concepts ac ∈ E and ab / ∈ E, then there exists a vertex d such that classically taught at the undergraduate level of com- d <σ a and db ∈ E. puter science faculties worldwide. Various types of graph Any BFS ordering of a graph G starting in a vertex v searches have been studied since the 19th century, and results in a rooted tree (with root v), which contains the used to solve diverse problems, from solving mazes, to shortest paths from v to any other vertex in G (see [6]). linear-time recognition of interval graphs, finding mini- We use this property implicitly throughout the paper. mal path-cover of co-comparability graphs, finding per- fect elimination order, or optimal coloring of a chordal Depth First Search (DFS), in contrast with the BFS, graph, and many others [1, 4, 7, 8, 10, 11]. traverses the graph as deeply as possible, visiting a neigh- In its most general form, a graph search (also generic bor of the last visited vertex whenever it is possible, and search [5]) is a method of traversing vertices of a given backtracking only when all the neighbors of the last vis-graph such that every prefix of the obtained vertex order- ited vertex are already visited. In DFS, the unvisited ing induces a connected graph. This general definition of vertices are put on top of a stack, so visiting a first vera graph search leaves much freedom for a selection rule tex in a stack means that we always visit a neighbor of determining which node is chosen next. By defining some the most recently visited vertex. specific r ule t hat r estricts t his c hoice, various different graph search methods are defined. Other search methods Definition 2.0.2 An ordering σ of V is a DFS-ordering that we focus on in this paper are Breadth First Search, if and only if the following holds: if a <σ b <σ c and Depth First Search, Lexicographic Breadth First Search, ac ∈ E and ab / ∈ E, then there exists a vertex d such that Lexicographic Depth First Search, and Maximal Neigh- a <σ d <σ b and db ∈ E. borhood Search. Lexicographic Breadth First Search (LexBFS) was We briefly present the studied graph search methods in introduced in the 1970s by Rose, Tarjan and Lueker [10] ∗This work is supported in part by the Slovenian Research as a part of an algorithm for recognizing chordal graphs in Agency (research programs P1-0285 and P1-0383, research linear time. Since then, it has been used in many graph projects N1-0102, J1-9110, N1-0160, N1-0209 and Young Re-algorithms mainly for the recognition of various graph searchers Grant). classes. DOI https://doi.org/10.18690/978-961-286-516-0.3 ISBN 978-961-286-516-0 11 Generic Search Generic Search BFS DFS BFS DFS MNS MNS LexBFS MCS LexDFS LexBFS LexDFS Figure 1: On the left: Hasse diagram showing how graph searches are refinements of one another. On the right is a summary of our results: Green pairs are equivalent on { P 4 , C 4}-free graphs. Violet pairs are equivalent on {pan, diamond}-free graphs. Blue pairs are equivalent on {paw, diamond, P 4 , C 4}-free graphs. Definition 2.0.3 An ordering σ of V is a LexBFS or- 3 Problem description and results dering if and only if the following holds: if a <σ b <σ c and ac ∈ E and ab / ∈ E, then there exists a vertex d such Since the connections in Figure 1 represent relations of that d <σ a and db ∈ E and dc / ∈ E. inclusion, it is natural to ask under which conditions on a graph G the particular inclusion holds also in the LexBFS is a restricted version of Breadth First Search, converse direction. More formally, we say that two search where the usual queue of vertices is replaced by a queue methods are equivalent on a graph G if the sets of vertex of unordered subsets of the vertices which is sometimes orderings produced by both of them are the same. We say refined, but never reordered. that two graph search methods are equivalent on a graph class G if they are equivalent on every member G ∈ G. Lexicographic Depth First Search (LexDFS) was in- Perhaps surprisingly, three different graph families suffice troduced in 2008 by Corneil and Krueger [5] and repre-to describe graph classes equivalent for the ten pairs sents a special instance of a Depth First Search. of graph search methods that we consider. Those are described in Theorems 3.1 to 3.3 below, but first we need Definition 2.0.4 An ordering σ of V is a LexDFS or-a few more definitions. dering if and only if the following holds: if a <σ b <σ c and ac ∈ E and ab / ∈ E, then there exists a vertex d such All the graphs considered in the paper are finite and that a < connected. A k-pan is a graph consisting of a k-cycle, σ d <σ b and db ∈ E and dc / ∈ E. with a pendant vertex added to it. We say that a graph is pan-free if it does not contain a pan of any size as an Maximal Neighborhood Search (MNS), introduced induced subgraph. A 3-pan is also known as a paw graph. in 2008 by Corneil and Krueger [5], is a common generalization of LexBFS, LexDFS, and MCS, and also of Maximal Label Search (see [2] for definition). Theorem 3.1 Let G be a connected graph. Then the following is equivalent: Definition 2.0.5 An ordering σ of V is an MNS or- dering if and only if the following statement holds: If A1. Graph G is { P 4 , C 4 , paw, diamond} -free. a < A2. Every graph search of G is a DFS ordering of G. σ b <σ c and ac ∈ E and ab / ∈ E, then there exists a vertex d with d < A3. Every graph search of G is a BFS ordering of G. σ b and db ∈ E and dc / ∈ E. A4. Any vertex-order of G is a BFS, if and only if it is The MNS algorithm uses the set of integers as the label, a DFS. and at every step of iteration chooses the vertex with maximal label under set inclusion. Theorem 3.2 Let G be a connected graph. Then the following is equivalent: Corneil [5] exposed an interesting structural aspect of graph searches: the particular search methods can be B1. Graph G is { pan, diamond} -free. seen as restrictions, or special instances of some more B2. Every DFS ordering of G is a LexDFS ordering of general search methods. For six well-known graph search G. methods they present a depiction, similar to the one in B3. Every BFS ordering of G is a LexBFS ordering of Figure 1, showing how those methods are related under G. the set inclusion. For example, every LexBFS ordering is B4. Every graph search of G is an MNS ordering of G. at the same time an instance of BFS and MNS ordering of the same graph. Similarly, every LexDFS ordering Theorem 3.3 Let G be a connected graph. Then the is at the same time also an instance of MNS, and of following is equivalent: DFS (see Figure 1). The reverse, however, is not true, and there exist orderings that are BFS and MNS, but C1. Graph G is { P 4 , C 4} -free. not LexBFS, or that are DFS and MNS but not LexDFS. C2. Every MNS ordering of G is a LexDFS ordering of G. C3. Every MNS ordering of G is a LexBFS ordering of G. 12 a a e a d b d d e b c d b c b c a c e e σ = ( b, c, d, a, e) σ = ( e, b, a, d, c) σ = ( d, b, e, a, c) σ = ( a, c, e, d, b) a a a e c b c d e b c e d b d σ = ( a, d, c, e, b) σ = ( d, c, b, e, a) σ = ( c, a, d, e, b) Figure 2: Graphs and corresponding orderings that are MNS and not MCS orderings. From Theorems 3.1 and 3.2 we can immediately derive [2] Berry, A., Krueger, R., and Simonet, G. similar results for two additional pairs of graph search Maximal label search algorithms to compute perfect methods. and minimal elimination orderings. SIAM Journal on Discrete Mathematics 23, 1 (2009), 428–446. Corollary 3.3.1 Let G be a connected graph. Then the [3] Brandstädt, A., Dragan, F. F., and Nicolai, following is equivalent: F. LexBFS-orderings and powers of chordal graphs. A1. Graph G is { P Discrete Math. 171, 1-3 (1997), 27–42. 4 , C 4 , paw, diamond} -free. A5. Every graph search of G is a LDFS ordering of G. [4] Corneil, D. G., Dusart, J., Habib, M., and A6. Every graph search of G is a LBFS ordering of G. Kohler, E. On the power of graph searching for cocomparability graphs. SIAM Journal on Discrete 4 Conclusion and further work Mathematics 30, 1 (2016), 569–591. [5] Corneil, D. G., and Krueger, R. M. A unified In this paper we consider the major graph search methods view of graph searching. SIAM Journal on Discrete and study the graphs in which vertex-orders of one type Mathematics 22, 4 (2008), 1259–1276. coincide with vertex-orders of some other type. Inter- [6] Even, S. Graph algorithms. Cambridge University estingly, three different graph families suffice to describe Press, 2011. graph classes equivalent for the ten pairs of graph search [7] Golumbic, M. Algorithmic Graph Theory and Per- methods that we consider, which provides an additional fect Graphs. Annals of Discrete Mathematics, Vol- aspect of similarities between the studied search methods. ume 57. Elsevier, 2004, pp. 98–99. Among the natural graph search methods not yet consid- [8] Köhler, E., and Mouatadid, L. Linear time ered in this setting would be the Maximum Cardinality lexdfs on cocomparability graphs. In Scandinavian Search (MCS), introduced in 1984 (for definition see Tar- Workshop on Algorithm Theory (2014), Springer, jan and Yannakakis [12]). As shown on Figure 1, every pp. 319–330. MCS is a special case of an MNS vertex-order. While it [9] Moore, E. F. The shortest path through a maze. is easy to verify that { P 4 , C 4 , paw, diamond}-free graphs In Proc. Int. Symp. Switching Theory, 1959 (1959), do not distinguish between MNS and MCS vertex orders, pp. 285–292. Figure 2 provides examples of graphs which admit MNS, [10] but not MNS vertex orders. Characterising graphs equiv- Rose, D. J., Lueker, G. S., and Tarjan, R. E. Algorithmic aspects of vertex elimination on graphs. alent for MNS and MCS remains an open question. SIAM Journal on Computing 5, 2 (1976), 266–283. Acknowledgements [11] Tarjan, R. E. Depth-first search and linear graph algorithms. SIAM journal on computing 1, 2 (1972), The authors would like to thank prof. Martin Milanič for 146–160. the initial suggestion of the problem, and to Ekki Köhler [12] Tarjan, R. E., and Yannakakis, M. Simple and his reseach group, for introducing the diverse world linear-time algorithms to test chordality of graphs, of graph searches to us. test acyclicity of hypergraphs, and selectively reduce acyclic hypergraphs. SIAM Journal on computing References 13, 3 (1984), 566–579. [1] Beisegel, J. Characterising AT-free graphs with BFS. In Graph-Theoretic Concepts in Computer Sci- ence (2018), A. Brandstädt, E. Köhler, and K. Meer, Eds., pp. 15–26. 13 14 System for Remote Collaborative Embedded Development Martin Domajnko Nikola Glavina University of Maribor, University of Maribor, Faculty of Electrical Faculty of Electrical Engineering and Computer Science, Engineering and Computer Science, Koroška cesta 46, 2000 Maribor, Slovenia Koroška cesta 46, 2000 Maribor, Slovenia martin.domajnko@student.um.si nikola.glavina@student.um.si Aljaž Žel University of Maribor, Faculty of Electrical Engineering and Computer Science, Koroška cesta 46, 2000 Maribor, Slovenia aljaz.zel@student.um.si Abstract continue, was based on the QEMU [8] machine emulator. We emulated our integrated circuit with an ARM This paper explores the challenges and devised core processor and added a custom software layer, which solutions for embedded development which arose served as an emulation of the real-world communication during the COVID-19 pandemic. While software pipeline with the system. The solution proved useful, development, nowadays with modern tools and albeit with the restriction that it allowed only local de-services such as git, virtual machines and commu- velopment, prompting us to develop our second idea. Re- nication suits, is relatively unaffected by resource mote programming and debugging of our hardware over location. That is not the case for firmware and the internet was the main part of that solution. This embedded systems, which relies on physical hardwas achieved with a dedicated computer system acting ware for design, development, and testing. To as a middle man between our development environment overcome the limitations of remote work and ob-and the targeted hardware, enabling programming and structed access to actual hardware, two ideas were debugging. implemented and tested. First, based on inte- grated circuit emulation using QEMU to emulate Finally, we present the structure of this paper. In the an ARM core and custom software to facilitate second section, we are going to explain in detail the communication with the embedded system. Sec-implementations of both solutions. Following in section ond, remote programming and debugging over the three, where we are going to talk about the limitations of internet with a dedicated computer system acting the implemented solutions and the lessons we learned. as a middle man between a development environ- The paper will conclude in section four with possible ment and physical hardware using OpenOCD de- improvements. bugger. 2 Implementation of remote embedded Keywords embedded development, remote develop- ment, OpenOCD, QEMU, ARM, ARM semihosting development For the work on the project to continue under COVID- 1 Introduction 19 restrictions, a solution had to be devised to solve the problems that came with remote work on a project depen- During the design, development, and testing phases of dent on specific hardware. This included hardware devel-our Passive Floating Probe project [7], we heavily relied opment, maintenance, and the ability to develop and test on physical access to the hardware. The restrictions that software on the hardware remotely. During the academic were put in place during the COVID-19 pandemic didn’t year, two solutions were devised and implemented. stop the development process of our project, but they heavily limited it. Lack of working hardware for every 2.1 Solution using emulator member of the team was our major issue, so we tried to think up solutions to overcome that problem. The The first solution was based on microcontroller emu-first idea, which would allow our work on the project to lation, as seen in figure 1, using a specialized version ∗ of QEMU emulation software called xPack [14] version Listed in alphabetical order 2.8.0-9. This enabled us to emulate a variant of STM32 DOI https://doi.org/10.18690/978-961-286-516-0.4 ISBN 978-961-286-516-0 15 family of microcontrollers. The variant chosen was STM32F407-Discovery development board, since it was closest to our target hardware, and we had access to a matching development board on which to test the differences. Using the emulator method proved to be a great benefit, since not all members had access to real hardware at that time, but everyone could set up the emulator on their computer. Since we didn’t have previous experience working with STM family micro- controllers, the emulator also allowed us to focus on platform-specific software issues without the complexity of hardware issues in novice programmers. However, this also came with a cost. First one, the emulator didn’t work out of the box. Second one, no sensors can be connected to the board directly since interfaces in the Figure 1: Schematic of the development system using emulator are virtual. emulation. Before the emulator could be used with the desired fam- ily, a minor bug inside the emulator source code, that was causing ROM memory overlap, needed to be patched manually. After initial setup, we established a basic form of IO system using a script that allowed us to run the emulator and automatically redirect the standard output using UNIX pipes to a file where we could monitor the output. The emulator approach enabled us to test out the version of real time operating system FreeRTOS [2] for STM32 microcontrollers. Some modification to compiler flags were required, especially the soft floating-point unit because the emulator was unable to emulate real FPU. While unable to connect sensors to the virtual environ- ment, we were mostly focused on creating task manager, which took care of the correct operation and communica- tion between individual tasks or sensors. Tasks, that were supposed to be related to the sensors, were temporarily Figure 2: Schematic of the remote development system. replaced with empty functions that returning predefined values for testing. To enable outside communication with our embedded software, semihosting feature of ARM architecture [13]. The selected single board computer handled net-was used [1]. A custom communication service was work communications, attached peripherals and services implemented in which standard input and output needed to facilitate remote programming and real-time were redirected to netcat running locally, creating debugging. For embedded development, an ST-LINK a network interface. After successful compilation of in-circuit debugger and programmer [11], in our case the program, the emulator could be run with the STM32F407G-DISC1 development board [10] providing compiled elf binary and redirect semihosting IO to ST-LINK/V2-A, was connected via USB connection to stdio. An example in our case: < /dev/null nc -q -1 the Raspberry Pi. To connect GDB debugging function- -l 5000 | qemu-system-gnuarmeclipse –verbose –verbose ality with the ST-LINK programming functionality with –board STM32F4-Discovery –mcu STM32F407VG -d the target integrated circuit, the OpenOCD software [6] unimp,guest_errors –image STM32F407-Discovery- version 0.10.0 was used. blinky.elf –semihosting-config enable=on,target=native | To prepare the setup, two configuration files in nc -l 6000 > /dev/null. This allowed us to emulate a se-OpenOCD format needed to be created. First one rial connection to and from the emulated microcontroller defining "hla_serial" value, describing the serial number over a network connection. of the connected ST-LINK device. The second one defining "-event gdb-detach" behavior as "resume". 2.2 Solution using remote development Defining this permitted the embedded program to run even after the debug session disconnected, allowing to To amend the shortcomings of the development on the test the functionality over longer periods of time without emulator, a remote development solution was devised and constant connection to the host machine. implemented. The solution, as seen on figure 2, is based on a dedicated computer system leveraging remote con-With the prepared configuration files, a script was nection functionality of GDB or "GNU Project Debugger" created to start the OpenOCD session with the correct [4] for remote programming and real-time debugging. parameters for the ST-LINK device, target device, network settings and created configuration files. An The system was built using Raspberry Pi 3 Model B example in our case: openocd -c "bindto $HOSTNAME" [9], running ARM version of Ubuntu version 20.04 LTS 16 -c "gdb_port 3333" -c "tcl_port disabled" -c "telnet_port security was tested, with the implementation of username disabled" -f /usr/share/openocd/scripts/interface/stlink- and password authentication using NGINX reverse proxy v2.cfg -c "adapter_khz 480" -c "transport select hla_swd" server [5] and httpd access restrictions on the system -f /usr/share/openocd/scripts/target/stm32l4x.cfg -f URL. ./gdb_resume.cfg -f ./serial.cfg » log.txt The script was started inside a tmux [12] instance. This Acknowledgment allowed for the OpenOCD session to run without an active user connection to the shell executing the script, The authors acknowledge the financial support from the or alternately for multiple users to be connected to the Institute of Computer Science of the Faculty of Electri- same shell instance to observe debug in print messages. cal Engineering and Computer Science and would like to Once the system was set up, multiple ST-LINK devices thank mag. Jernej Kranjec for his guidance and assis- could be connected and used simultaneously by adding tance. The authors would also like to acknowledge the additional configuration files with serial numbers and remaining members of the project group, namely Tilen starting OpenOCD sessions on different network ports. Koren, Anna Sidorova and Viktorija Stevanoska for their work on the project. 3 Usage and lessons learned References The development process was arranged in the form of the required hardware development and maintenance to be in [1] ARM. Arm target input/output facilities. the domain of our mentor and be kept at the university. https://developer.arm.com/documentation/ Team members could make request for hardware mod- dui0471/g/Bgbjjgij, 2021. Accessed: 2021-07-30. ifications and then develop and debug the project soft- [2] FreeRTOS. Real time operating system for mi- ware remotely. While concurrent remote software collab- crocontrollers. https://www.freertos.org/, 2021. oration was achieved through distributed revision control Accessed: 2021-07-30. system Gitea [3] as source code repository and manage- [3] Gitea. Lightweight code hosting solution. https: ment tool. //gitea.io, 2021. Accessed: 2021-07-30. [4] GNU. The gnu project debugger. https://www. 3.1 Emulator gnu.org/software/gdb/, 2021. Accessed: 2021-07-30. The emulator provided a good start into getting ac- quainted with embedded development. This allowed us [5] NGINX. Reverse proxy. https://www.nginx.com/, to learn and develop embedded software without phys- 2021. Accessed: 2021-07-30. ical hardware. The downside was the inability to work [6] OpenOCD. Open on-chip debugger. http:// with real sensors, which later made us switch to a remote openocd.org/, 2021. Accessed: 2021-07-30. system. Another problem was that the emulator was not [7] Perrone, M., Knupleš, U., Žalik, M., Keršič, capable of running functions that required precise timing, such as real-time code or interrupt execution. V., and Šinko, T. Pasive floating probe. In Stu- CoSReC: Proceedings of the 2019 6th Student Com- puter Science Research Conference (2019), pp. 13– 3.2 Remote 17. The advantage of remote system made it possible for us to [8] QEMU. the fast! processor emulator. https: connect to the targeted hardware from anywhere as if it //www.qemu.org/, 2020. Accessed: 2020-04-30. was accessible locally. This included real-time debugging [9] Raspberry Pi. Raspberry pi 3 model b. https: with the ability to see microcontroller processor states //www.raspberrypi.org/products/raspberry- and memory values. The problem with this solution was pi-3-model-b/, 2021. Accessed: 2021-07-30. the restriction of a single connection to the OpenOCD [10] STM. Discovery kit with stm32f407vg instance, which limited the work on the hardware to mcu. https://www.st.com/en/evaluation- a single developer at a time. This was resolved with tools/stm32f4discovery.html, 2021. Accessed: communication and access scheduling. 2021-07-30. [11] STM. Stm st-link/v2 in-circuit debug- 4 Conclusions ger/programmer. https://www.st.com/en/ development-tools/st-link-v2.html, 2021. The system was sufficient to allow our work on the project Accessed: 2021-07-30. to continue. During development, we were fortunate [12] tmux. Terminal multiplexer. https://github. enough to not have major problems with security. The com/tmux/tmux/wiki, 2021. Accessed: 2021-07-30. system, as it was used, would allow anyone to access [13] Ubuntu. Ubuntu os. https://ubuntu.com/ the system if they identified the used network ports and download/raspberry-pi, 2021. Accessed: 2021-07-protocols. This was not a major concern, as it only 30. allowed to program our particular microcontroller with a dedicated firmware. As such, this problem was not [14] xPack. The xpack qemu arm. https://xpack. addressed during the production. Possible additional github.io/qemu-arm/, 2020. Accessed: 2020-04-30. 17 18 Leaf Segmentation of Rosette Plants using Rough K-Means in CIELab Color Space Arunita Das ∗ Dept. of Computer Science and Application Midnapore College (Autonomous) Paschim Medinipur, West Bengal, India arunita17@gmail.com Daipayan Ghosal Krishna Gopal Dhal † Dept. of Computer Science and Application Dept. of Computer Science and Application Midnapore College (Autonomous) Midnapore College (Autonomous) Paschim Medinipur, West Bengal, India Paschim Medinipur, West Bengal, India daipayan.ghosal@gmail.com krishnagopal.dhal@midnaporecollege.ac.in Abstract lation study among several leaf segmentation algorithms had been presented in [27]. Well-known clustering tech-Segmentation of Plant Images plays an important niques like KM, FCM, Self-organizing Map (SOM), and role in modern agriculture where it can provide Particle Swarm Optimization (PSO) are also applied for accurate analysis of a plant’s growth and possi-leaf segmentation in [11] and SOM outperforms other ble anomalies. In this paper, rough set based tested methods visually and numerically. Other than partitional clustering technique called Rough K-the above techniques, deep learning is also utilized for Means has been utilized in CIELab color space the leaf segmentation [15]. However, deep learning needs for the proper leaf segmentation of rosette plants. large dataset in order to produce good results. The efficacy of the proposed technique have been analysed by comparing it with the results of tra- Therefore, it can be seen from the above discussion that ditional K-Means and Fuzzy C-Means clustering different clustering strategies provide promising segmen-algorithms. The visual and numerical results re- tation results. Although classical KM, FCM, SOM , and veal that the RKM in CIELab provides the near- PSO are utilized for leaf segmentation, but rough set est result to the ideal ground truth, hence the based K-Means (RKM) clustering did not used in this most efficient one. area according to the knowledge of the authors. Rough k-means is developed by Lingras et al. [16] and a refined Keywords image segmentation, color space, rough set, version is proposed by Peters [21] and it shows the perfor-partitional clustering mance in image clustering domain as a similarity based clustering model like KM and FCM [14] [10] [12] [22]. RKM has been efficiently applied for the proper segmen- 1 Introduction tation of tumor region from brain MRI images [14] [10], White blood cell segmentation [12], and satellite image Leaf check and leaf region (i.e., plant area) are the key segmentation [22]. As a consequence, the main contribu-plant phenotyping qualities used to examine the plant tion of the paper is the utilization of RKM in CIELab development and advancement [19] [25] [28], blossoming and its application for leaf segmentation. The proposed time [13], and yield potential. The leaf include can be methodology which is represented in Figure 1 has been tended to in different manners from the AI viewpoint compared with classical KM and FCM. Experimental re- [1]. One such route is to check the number of leaves sults show the supremacy of the proposed approach over from fragmented plant area. Several image segmenta-other tested techniques. tion techniques are reported in literature for leaf segmen- tation. For example, Maximal Similarity-based Region The rest of the paper is organized as follows: Section 2 Merging (MSRM) [18] is an intuitive segmentation ap-discusses the proposed methodology. Section 3 describes proach which utilizes a region merging framework to meld the experimental results and the paper is concluded in super-pixel division. gPb-owt-ucm [2] is another segmen-section 4. tation method which is dependent on spectral clustering and contour detection. The IPK technique [20] utilized 2 Proposed Methodology 3D histogram of L*a*b* tone space of the plant images for regulated segmentation of closer view/foundation. Clustering is a procedure of consortium a bunch of data Vukadinovic and Polder employed neural networks com- into clusters that have superior intra-cluster and inferior bined with watershed for leaves segmentation [24]. A col-inter-cluster resemblance among clusters. Two rudimen- tary types of image clustering practices are hard cluster- ∗First Author ing and soft clustering. In hard clustering, one pixel can †Corresponding Author be the adherent of only one cluster and the proper exam- DOI https://doi.org/10.18690/978-961-286-516-0.5 ISBN 978-961-286-516-0 19 ple of this is K-means [5] [9] [7]. On the contrary of the define each set Ci ∈ U/P using its lower approximation previous one, soft clustering uses a miniscule membership A( C)and upper approximation ¯ A( C). unlike hard clustering which makes it more practicable for real world usages. One pixel can be the fragment of sev- U/P = { C 1 , C 2 , . . . , Ck} (1) eral clusters with some degree of belongingness which is Let, v and ci are the vector representation of the data described by the fractional membership. Fuzzy C-means object and cluster Ci respectively. Upper and lower is a specimen of this mechanism which had been projected approximations of only a few subsets of U have been by Bezdek [3] [6]. FCM is better than hard clustering considered. Hence, it is not possible to verify all the technique like K-means because it has the more ability properties of the rough sets [16] [21]. However, the to handle the ambiguity of gray levels. In some cases, upper and lower estimates of Ci ∈ U/P are obligatory the fuzzy degree of membership may be too descriptive to follow some of the basic rough set properties which for interpreting clustering results. Therefore, researchers are as follows: have applied rough set theory into k-means and developed rough k-means [16] [21]clustering algorithm which man-P1: A data entity v can be a participant of at most one ages these equidistant data objects or overlapping clusters lower approximation A( ci). using upper and lower approximations of each cluster. P2: If a data object v is the portion of the lower ap- Rough set-based clustering provides a solution that is less proximation A( ci), then it is also portion of the upper restrictive than conventional clustering and less descrip-approximation ¯ A( C) i.e., v ∈ A( ci) =⇒ v ∈ ¯ A( ci) . tive (specific) than fuzzy clustering. In this study, Rough K-Means has been utilized to segment the leaf images. P3: If a data article v does not belong to any lower Due to non-uniform illumination of regions, the segmen- approximation A( ci) then it belongs to two or more upper tation algorithm’s performance is influenced by the color approximations ¯ A( ci). spaces used. According to the literature, perceptually In rough k-means, the lower and upper approximations of uniform color spaces such as L*a*b* or L*u*v* achieve the clusters have been computed by the following rules: much better segmentation results than non-uniform color Let v be a data object and d( v, zi)be the distance between spaces such as RGB [23], which was developed for better v and zi which is the centroid of cluster ci. color representation. As a first stage in our approach, we used MATLAB to transform all R Let d( v, z GB images to CIE i) = min(1≤ j≤ k) d( v, zj ) L*a*b* color space, which yielded three components: L*, d( v, z a*, and b*. “L*” denotes lightness, while “a*” and “b*” i) T = j : ≤ th and i 6= j (2) denote colors, with “a*” denoting red-green and “b*” de- d( v, zi) noting blue-yellow, respectively. The flowchart of the pro- posed work is presented in Figure 1 The brief mathemat-Where, th is the threshold value specified by the user. ical implementation of RKM is described in section 2.1. In order to classify a data object to the correct approx- imation(s), the following classification criteria are being used: Input Rosette Plant RGB image R1: If the set, T is not an empty set, then the data object is classified as upper approximation of both cluster i and j. So, if T 6= φ then [ v ∈ ¯ A( c Convert RGB to CIELab Color i) and v ∈ ¯ A( cj) , ∀ j] Space R2: If T is a vacant set, the data object is being categorized as lower approximation for cluster i. Then the pixel is categorized as upper approximation for clus- Find the Cluster Centers using ters i as per the hypothesis P2. So, if T = φ then Rough K-Means [ v ∈ ¯ A( ci) and v ∈ A( ci)] Depending on the above deliberations the algorithm steps Reform Segmented Image in for rough k-means are represented as Algorithm 1. RGB Color Space 3 Experimental Results Output Leaf Segmented Region The experiment has been performed over 30 plant im- ages using MatlabR2018b on Windows-10 OS, x64-based Figure 1: Flowchart of the Proposed Methodology PC, Intel core i5 CPU with 8 GB RAM. The plant im- ages are collected from [17]. The parameter settings of the utilized clustering techniques are as follows. Number of cluster prototype value depends on the user which is 2.1 Rough K-Means (RKM) taken as 2 for all clustering techniques. For FCM, fuzzifi- cation parameter is taken as 2 and if maximum difference Suppose, a hypothetical clustering scheme is defined as between two successive partition matrices U is less than Eq. (1) which partitions U depending on the equivalence minimal error threshold η then stop the corresponding al- relation P. Again, assume that it may not possible to gorithm. Mathematically, if [ M ax U t − U ( t+1)] < η then accurately describe the sets Ci, 1 ≤ i ≤ k due to inad-stop, where, minimal error threshold η = 10(−5). For KM equate knowledge in the partition. But it is possible to and RKM, if the change in centroid values are smaller 20 1 For each cluster and data object, find the Table 1: Performance parameters considered for evalu- distance d and threshold T ation of the clustering methods. 2 Classify the data object to lower and upper estimates utilizing the classification criteria i.e., R 1 and R 2. Sl. Para- Formulation and Remarks 3 Calculate the new cluster center ( mean z meters i) as per following expressions: 1 Accuracy AC = ( T P + T N ) ; Accu- ( F N + F P + T P + T N ) 4 If [ A( ci) 6= φ and ¯ A( ci) − A( ci) = φ] (AC) racy is one metric for evaluating P v v∈ A( c classification models. We calcu- 5 then z i ) i = | A( ci)| late the accuracy to know how 6 else if A( ci) 6= φ and ¯ A( ci) − A( ci) 6= φ] good our model predicts. P v v∈ A( c 2 Dice DI = 2× ; It com- 7 then z i )− A( ci ) i = (2× T P + F P + F N ) | A( ci)− ¯ A( ci)| Index bines the precision and recall P P v v v∈ A( c v∈ A( c (DI) concepts from information re- 8 else z i ) i )− A( ci ) i = wl × + w | A( c u × i )| | A( ci)− ¯ A( ci)| trieval. It is the harmonic mean 9 wl + wu = 1 and usually, wl > wu The of the precision and recall. The parameters wl and wu correspond to the DI values are within the interval relative importance of lower and upper [0, 1] and larger the value indi- approximations respectively. cates higher clustering quality. 10 If the algorithm converges, then stop. Otherwise, 3 Jaccard J I = DI/(2 − DI); Jaccard sim- repeat steps 2 to 4. Index(JI) ilarity index measures the over- lap between two sets. It is de- Algorithm 1: Procedure of Rough K-Means fined as the size of the intersec- (RKM) tion of two sets divided by the size of their union. The higher value indicates more similarities than η the stop the procedure.The rough set parameters between two objects. for classical RKM are th = 0 . 7 , w 4 Matthews M CC = l = 0 . 6, and wu = 0 . 4. ( T P × T N − F P × F N ) Threshold ( th) selection in RKM is tough for different correla- √ ; ( T P + F P )×( T P + F N )×( T N + F P )×( T N + F N ) image. We have done the experiment within the range tion (MCC) is a more reliable sta- 0 < th < = 10 and optimally set to 0.7. The performance coeffi-tistical rate which produces a of the utilized clustering techniques has been evaluated cient highscore only if the prediction by calculating four ground truth based performance eval- (MCC) obtained good results in all uation parameters namely accuracy, dice, Jaccard, and TP, TN, FP, FN categories and Matthews correlation coefficient (MCC) which are sum- proportionally both to the size marized in Table 1 [6] [26]. Here, TP - true positive, FP of positive elements and the - false positive, TN - true negative, FN - false negative. size of negative elements in the dataset. Higher value indicates Sample No. Original Image KM FCM RKM the better results. 1 Table 2: Numerical values of segmentation quality pa- rameters over five sample images 2 Sample Method Accu- Dice Jac- MCC No. racy card 1 KM 0.9756 0.9651 0.9325 0.9463 3 FCM 0.9743 0.9633 0.9292 0.9435 RKM 0.9851 0.9791 0.9590 0.9677 2 KM 0.9786 0.9720 0.9455 0.9548 FCM 0.9819 0.9762 0.9536 0.9617 4 RKM 0.9845 0.9798 0.9603 0.9675 3 KM 0.9730 0.9555 0.9147 0.9364 FCM 0.9761 0.9607 0.9244 0.9440 5 RKM 0.9773 0.9630 0.9287 0.9476 4 KM 0.9714 0.9541 0.9122 0.9336 FCM 0.9720 0.9552 0.9143 0.9350 RKM 0.9880 0.9810 0.9627 0.9723 Figure 2: Color segmentation results of clustering tech- 5 KM 0.9738 0.9659 0.9341 0.9453 niques over five sample images FCM 0.9793 0.9729 0.9473 0.9565 RKM 0.9838 0.9788 0.9585 0.9661 21 RKM have been utilized to segment the leaves of the Table 3: Average numerical values of segmentation qual- rosette plants. Figure 2 and 3 represents the original ity parameters and execution time color plant image, the ground truth images of the leaf segmentation provided by the experts, their segmented Method Accu- Dice Jac- MCC Time leaf part by the three utilized algorithms binary seg- racy card (Sec.) mented leaf part provided by the employed clustering KM 0.9638 0.9470 0.9001 0.9200 3.78 algorithms. Figures 2 and 3 here clearly show that FCM 0.9593 0.9417 0.8909 0.9112 4.56 the RKM provides the best leaf-based segmentation RKM 0.9662 0.9531 0.9124 0.9293 8.26 results. Not only visual analysis, the segmentation efficacy of the clustering algorithms has been analyzed by computing four well-known segmentation quality parameters which are presented in Table 2. The values Sample No. Ground Truth KM in binary FCM in binary RKM in binary of the segmentation quality parameters regarding five plant samples presented in Table 2. The average values 1 of the segmentation quality parameters over 30 images are given in Table 3. The best numerical values of the Tables 2 and 3 are given in bold. Most of the 2 values of the quality parameters clearly reveal that RKM provides superior outcomes to other three tested clustering algorithms. The graphical representation of the average quality parameters (recorded in Table 3) is 3 also showed in Figure 4. The average execution times of the four clustering algorithms over 30 images are also presented in Table 3. KM needs least computational 4 effort. RKM takes the largest execution time in the same environment. 5 4 Conclusion This paper presents a Rough K-Means (RKM) based clus- Figure 3: Ground truth images and segmentation re- tering algorithm in CIELab color space for leaf image seg- sults of clustering techniques in binary format over five mentation of rosette plants. The proposed RKM based sample images technique is compared against two well-known conven- tional clustering algorithms namely K-Means and Fuzzy C-Means. An entire dataset of 30 images have been used for this experiment. Experimental results here reveals that the RKM based clustering algorithm outperforms K-Means Fuzzy C-Means Rough K-Means the others and delivers the best outcomes in both the 8 2 visual as well as numerical analysis for the utilized seg- 3 6 6 3 6 .9 9 .9 mentation parameters. The main three limitations of the 0 5 0 1 .9 3 0 5 7 proposed method are noise sensitivity, local optima trap- 4 7 .90 .9 1 0 4 ping and large computational time. If researched further, .90 392 it can be possible to analyze the plant’s growth or to de- .9 2 0 tect any visually identifiable signs of disease or damage, 42 .9 2 1 0 11 or any possible visual anomaly. These results encourage .9 1 .9 0 0 0 0 further research in the improvement of RKM for image .9 9 0 09 segmentation such as incorporation of nature-inspired op- .80 timization algorithms to overcome local optima trapping problem [8] [4]. References [1] Aich, S., and Stavness, I. Leaf counting with Accuracy Dice Jaccard MCC deep convolutional and deconvolutional networks. In Proceedings of the IEEE International Conference on Computer Vision Workshops (2017), pp. 2080– 2089. Figure 4: Graphical analysis of average quality param- [2] Arbelaez, P., Maire, M., Fowlkes, C., and eters for clustering techniques Malik, J. Contour detection and hierarchical image segmentation. IEEE transactions on pattern analy- sis and machine intelligence 33, 5 (2010), 898–916. The three clustering algorithms i.e., KM, FCM, and 22 [3] Bezdek, J. C., Ehrlich, R., and Full, W. Fcm: [17] Minervini, M., Fischbach, A., Scharr, H., The fuzzy c-means clustering algorithm. Computers and Tsaftaris, S. A. Finely-grained annotated & geosciences 10, 2-3 (1984), 191–203. datasets for image-based plant phenotyping. Pat- [4] tern recognition letters 81 (2016), 80–89. Dhal, K. G., Das, A., Gálvez, J., Ray, S., and Das, S. An overview on nature-inspired optimiza- [18] Ning, J., Zhang, L., Zhang, D., and Wu, C. tion algorithms and their possible application in im- Interactive image segmentation by maximal similar- age processing domain. Pattern Recognition and Im- ity based region merging. Pattern Recognition 43, 2 age Analysis 30, 4 (2020), 614–631. (2010), 445–456. [5] Dhal, K. G., Das, A., Ray, S., and Das, S. [19] Orlando, F., Napoli, M., Dalla Marta, A., A clustering based classification approach based on Natali, F., Mancini, M., Zanchi, C., and Or- modified cuckoo search algorithm. Pattern Recogni- landini, S. Growth and development responses of tion and Image Analysis 29, 3 (2019), 344–359. tobacco (nicotiana tabacum l.) to changes in physi- [6] cal and hydrological soil properties due to minimum Dhal, K. G., Das, A., Ray, S., and Gálvez, tillage. J. Randomly attracted rough firefly algorithm for histogram based fuzzy image clustering. Knowledge- [20] Pape, J.-M., and Klukas, C. 3-d histogram- Based Systems 216 (2021), 106814. based segmentation and leaf detection for rosette [7] plants. In European Conference on Computer Vision Dhal, K. G., Fister Jr, I., Das, A., Ray, S., (2014), Springer, pp. 61–74. and Das, S. Breast histopathology image clustering using cuckoo search algorithm. In Proceedings of [21] Peters, G. Some refinements of rough k-means the 5th student computer science research conference clustering. Pattern Recognition 39, 8 (2006), 1481– (2018), pp. 47–54. 1491. [8] Dhal, K. G., Fister Jr, I., and Das, S. Parame- [22] Raj, A., and Minz, S. Spatial rough k-means terless harmony search for image multi-thresholding. algorithm for unsupervised multi-spectral classifica- In 4th student computer science research conference tion. In International Conference on Information (StuCosRec-2017) (2017), pp. 5–12. and Communication Technology for Intelligent Sys- [9] tems (2020), Springer, pp. 215–226. Dhal, K. G., Gálvez, J., Ray, S., Das, A., and Das, S. Acute lymphoblastic leukemia image seg- [23] Sarkar, S., Das, S., and Chaudhuri, S. S. A mentation driven by stochastic fractal search. Mul- multilevel color image thresholding scheme based on timedia Tools and Applications (2020), 1–29. minimum cross entropy and differential evolution. [10] Pattern Recognition Letters 54 (2015), 27–35. Dobe, O., Sarkar, A., and Halder, A. Rough k-means and morphological operation-based brain [24] Scharr, H., Minervini, M., French, A. P., tumor extraction. In Integrated Intelligent Comput- Klukas, C., Kramer, D. M., Liu, X., Luengo, ing, Communication and Security. Springer, 2019, I., Pape, J.-M., Polder, G., Vukadinovic, D., pp. 661–667. et al. Leaf segmentation in plant phenotyping: a [11] collation study. Machine vision and applications 27, Ghosal, D., Das, A., and Dhal, K. G. A com- parative study among clustering techniques for leaf 4 (2016), 585–606. segmentation in rosette plants. Pattern Recognition [25] Telfer, A., Bollman, K. M., and Poethig, and Image Analysis 31, 4 (2021). R. S. Phase change and the regulation of trichome [12] distribution in arabidopsis thaliana. Development Inbarani H, H., Azar, A. T., et al. Leukemia image segmentation using a hybrid histogram-based 124, 3 (1997), 645–654. soft covering rough k-means clustering algorithm. [26] Thanh, D. N., Prasath, V. S., Hien, N. N., Electronics 9, 1 (2020), 188. et al. Melanoma skin cancer detection method [13] based on adaptive principal curvature, colour nor- Koornneef, M., Hanhart, C., van Loenen- malisation and feature extraction with the abcd rule. Martinet, P., and Blankestijn de Vries, H. The effect of daylength on the transition to flowering Journal of digital imaging (2019), 1–12. in phytochrome-deficient, late-flowering and double [27] Vukadinovic, D., and Polder, G. Watershed mutants of arabidopsis thaliana. Physiologia Plan- and supervised classification based fully automated tarum 95, 2 (1995), 260–266. method for separate leaf segmentation. In The [14] Netherland Congress on Computer Vision (2015), Kumar, D. M., Satyanarayana, D., and pp. 1–2. Prasad, M. G. An improved gabor wavelet trans- form and rough k-means clustering algorithm for mri [28] Walter, A., and Schurr, U. The modular char- brain tumor image segmentation. Multimedia Tools acter of growth in nicotiana tabacum plants un- and Applications 80, 5 (2021), 6939–6957. der steady-state nutrition. Journal of Experimental [15] Botany 50, 336 (1999), 1169–1177. Kumar, J. P., and Domnic, S. Rosette plant seg- mentation with leaf count using orthogonal trans- form and deep convolutional neural network. Ma- chine Vision and Applications 31, 1 (2020), 1–14. [16] Lingras, P., and West, C. Interval set cluster- ing of web users with rough k-means. Journal of Intelligent Information Systems 23, 1 (2004), 5–16. 23 24 Adversarial Image Perturbation with a Genetic Algorithm Rok Kukovec Špela Pečnik University of Maribor, University of Maribor, Faculty of Electrical Faculty of Electrical Engineering and Computer Science, Engineering and Computer Science, Koroška cesta 46, 2000 Maribor, Slovenia Koroška cesta 46, 2000 Maribor, Slovenia rok.kukovec@student.um.si spela.pecnik@um.si Iztok Fister Jr. Sašo Karakatič University of Maribor, University of Maribor, Faculty of Electrical Faculty of Electrical Engineering and Computer Science, Engineering and Computer Science, Koroška cesta 46, 2000 Maribor, Slovenia Koroška cesta 46, 2000 Maribor, Slovenia iztok.fister1@um.si saso.karakatic@um.si Abstract technology for recognizing motifs from an image. Com- puter vision achieves near-human-level accuracy in recog- The quality of image recognition with neural nition, and the question arises of the key differences be-network models relies heavily on filters and tween human and computer vision. They return pre- parameters optimized through the training dicted labels and their corresponding certainties. The process. These filters a re d ifferent c ompared to problem arises when there are high certainties for wrong how humans see and recognize objects around labels [2]. them. The difference in machine and human recognition yields a noticeable gap, which is This paper presents an approach for adversarial image prone to exploitation. The workings of these perturbation with evolutionary algorithms, with the goal algorithms can be compromised with adversarial of misguiding the AlexNet convolutional neural network perturbations of images. This is where images (CNN). The implemented approach demonstrates how are seemingly modified i mperceptibly, such that simple and effective adversarial perturbation is, and how humans see little to no difference, but the neural vulnerable every day image recognition models are. The network classifies t he m otif i ncorrectly. This implemented approach aims to recreate the image as paper explores the adversarial image modifica-similar to the original image as possible, keeping the tion with an evolutionary algorithm, so that the human perception of the motif intact, while maximizing AlexNet convolutional neural network cannot the error of the image recognition model. Pixel values recognize previously clear motifs while preserving in certain places are changed such that computer vision the human perceptibility of the image. The ex-fails to classify them correctly. periment was implemented in Python and tested on the ILSVRC dataset. Original images and 2 Related work their recreated counterparts were compared and contrasted using visual assessment and statistical metrics. The findings s uggest t hat t he human The inspiration for this paper derives from [9], where the eye, without prior knowledge, will hardly spot authors implemented an adversarial perturbation deceiv-the difference compared to the original images. ing computer vision with only changing one pixel in the original image. This attack was carried out on images of Keywords adversarial perturbation, AlexNet, CNN, very low resolution, which is the reason for its success. computer vision, evolutionary algorithms In the paper authored by Fawzi et al. [1], an analysis was made of the resistance of computer vision algorithms 1 Introduction to adversary disturbances. The existence of adversarial examples was confirmed, as there is an upper bound to robustness. The goal was to find the correlation between Computer vision algorithms are already used widely in robustness against random and adversarial noise. As long every day applications, but the safety concerns persist as the boundary is so high that the recreated image has to regarding their reliability. Leaving vital decisions to them be completely distorted, it does not indicate a problem. can cause dire consequences in cases of error. Therefore, A problem arises if the image is human-recognizable and additional caution is necessary in most use cases. Such the recognition algorithm fails its prediction with high algorithms have to be tested extensively before they are certainty. Several different models of machine learning, allowed to make such decisions on their own. including CNNs, misclassify adversarial examples consis- Deep neural networks are currently the state-of-the-art tently. These are intentionally created, small interfer- DOI https://doi.org/10.18690/978-961-286-516-0.6 ISBN 978-961-286-516-0 25 ences that are detrimental for the recognition algorithm which, together, form a fitness function. It combines both [8]. The paper [13] shows that a universal adversarial results using fuzzy logic’s operator AND (∧). The value perturbation is possible. One adversarial noise filter can of normalized average absolute pixel difference is multi- be applied routinely to many different images. In the plied with AlexNet’s certainty into a wrong prediction. paper [2], there are examples of specifically produced im-The process is repeated iteratively until the termination ages in which the human eye only sees random noise, yet condition is reached. the algorithm is near certain that there is a motif. The paper [4] shows that a successful adversarial perturbation Start Reached termination condition against one neural network is likely to succeed against a variety of network architectures trained on different data sets. EA A distinctive quality of this paper is that it is readily ac- cessible to non-experts. It shows that implementations of adversarial perturbations are not limited only to teams of advanced researchers supported by both technical and Candidate solution financial capabilities. The attacker requires only a ba- sic understanding of machine learning. The experiment uses only open-source libraries and a small amount of un- derstandable custom code. Despite the straightforward AlexNets’ certainty into wrong recognition approach, results are comparable to the work mentioned above. Similarity Score Fitness function 3 Implementing adversarial perturbation on AlexNet CNN Figure 1: Diagram of the proposed adversarial pertur- bation. The main objective of the approach is that the solution image is modified in accordance with two objectives: (1) Similarity to the original image, and (2) AlexNet’s cer- Data: Original image tainty during misrecognition. The optimization method Result: Recreated image pursuing these objectives is a genetic algorithm. The goal 1 initialization; is that no change could be noticed by the human eye in 2 create first candidate solution; the reproduced image without prior knowledge. 3 while termination goal not reached do 4 calculate similarity score; The following Python libraries were used: 5 check AlexNet’s certainty into wrong prediction; • NumPy [11] is used for numerical calculations, 6 calculate fitness value; • OpenCV [3] and Pillow [6] for image preprocessing, 7 send score to evolutionary algorithm; • Scikit-image [14] for the structural similarity index 8 create new candidate solutions; measure metric, 9 end • PyTorch [12] is used for a pre-trained AlexNet CNN. • GARI - Genetic Algorithm for Reproducing Images Algorithm 1: Algorithm in pseudo-code is used for the EA (Evolutionary algorithm) [7]. The proposed approach is divided into the following in- It was shown that the combination of AlexNet’s predic- terconnected parts: tions, genetic algorithm and evaluation of the fitness func- tion was very time-consuming. Thus, it was not possible • Generation of sets of candidate solutions, to recreate the image within the set time frame to the • Evaluation of the image similarity between the can- point of recognition by the human eye. The bottleneck didate solution and the original image using the nor- appeared in the time-consuming evaluation of candidate malized average of absolute pixel differences, solutions by AlexNet. It renders the attack infeasible for use cases where real-time solutions are needed. • Classification of the candidate solution using the AlexNet image recognition model, We bypassed this bottleneck somewhat by not running • Computation of the fitness value for the candidate AlexNet before the starting 80,000 iterations at all, since solution, the first recreated images are random noise, which was optimized towards our goal. Initially, the only feedback • Selection of the fittest candidate images for further given to the EA was the similarity score. It turned out reproduction. that the recreated image was recognizable to the human Figure 1 shows the initial idea of the implementation of eye much earlier than to AlexNet. AlexNet’s predictions the adversarial perturbation. The start block represents were only calculated after the candidate image was suffi- the execution of the experiment with any given original ciently similar. Once AlexNet recognizes the image, the image. The evolutionary algorithm creates candidate so- evolutionary algorithm can start calling our final fitness lutions, which are evaluated using two separate criteria, function. It comprises of both the similarity score, as 26 well as AlexNet’s predicted class and its corresponding 4 Results certainty. The results of the experiment are evaluated visually and Figure 2 shows a working version of the experiment. Pre-using statistical metrics. Terminating conditions were set sented is the detailed control flow dictating the entry of as follows: AlexNet into fitness value calculation. The experiment is divided into two phases. Phase one consists mainly • Time limit of 2 hours reached, of quick operations. No phase transition conditions are checked in the first 80,000 generations. Depending on • Calculated fitness exceeded 0.99, the image, AlexNet started giving the first correct clas- • Algorithm finished both phases. sification at about 30,000 generations. Towards the end of the first phase, correct classification is checked. If the prediction is correct, we advance to the second phase. It aims to create an adversarial perturbation. The output of AlexNet is an array of sorted certainties with labels. For the calculation of fitness function, the value is taken from the incorrect label which has the highest certainty and is combined with the similarity score. (3.1) Leafhopper (3.2) Filter (3.3) Recreated Start Returned best candidate solution • GARI Reached termination condition Candidate solution No. of generations ≥ 80000 NO • YES (3.4) Manhole (3.5) Filter (3.6) Recreated • YES Check, if phase 2 is active NO No. of generations mod 100 = 0 NO • YES Check AlexNet classification False • (3.7) Maze (3.8) Filter (3.9) Recreated True Advance to 2nd phase Execution of 2nd phase • Candidate solution (3.10) Nautilus (3.11) Filter (3.12) Recreated AlexNets certainty into wrong recognition Original label Similarity score Original image (3.13) Strawberry (3.14) Filter (3.15) Recreated Fitness value Figure 3: Original images, adversarial filters and recre- Figure 2: Flowchart of the final implementation of the ated images. proposed approach. 27 The benchmark value was set to 0.99, since it was forcing pixels in resolution. Figure 3 shows recreated images both factors, normalized average of absolute pixel differ- that, without prior-knowledge, it is hard to spot the dif- ences and AlexNet’s certainty, into wrong prediction to ference, taking into account that the images are relatively be above 0.99. The product of two numbers between 0 small. and 1 is smaller than either factor. Since images are difficult to evaluate qualitatively and the 5 Discussion normalized mean of sum of absolute errors was already used in the evaluation process, new statistical metrics Despite the limitations of the experiment, we showed that were introduced: adversarial perturbations are possible to implement in a relatively short time with the help of genetic algorithms. • Mean Squared Error (MSE), Future research may point to one of the following six • Peak signal-to-noise ratio (PSNR), and directions: • Structural similarity index measure (SSIM). • Speeding up the process of optimization, Results showed a promising direction, but they were not • Deceiving computer vision into a custom label, optimized fully due to operational limitations. The com- • Selecting a more complex CNN, promise was agreed upon deceiving AlexNet’s prediction • Testing other optimization methods (i.e. even other to the closest label in the feature space. nature-inspired algorithms [5]), 4.1 Examples of missclassified images • Testing with only using some features [10] to speed up the optimization process, and The results of the experiment are shown in Table 1. • Protection against adversarial noise. Recreated images are shown in Figure 3. Category Certainty into References Original category after attack missclassified label Leafhopper Lacewing 99.97% [1] Alhussein Fawzi, e. a. Analysis of classifiers’ Manhole-cover Electric ray 99.98% robustness to adversarial perturbations. CoRR Maze Hay 99.97% abs/1502.02590 (2015). Nautilus Brain coral 99.98% [2] Anh Mai Nguyen, e. a. Deep neural networks are Strawberries Bell pepper 99.97% easily fooled: High confidence predictions for unrec- ognizable images. CoRR abs/1412.1897 (2014). Table 1: Results of images in Figure 3 [3] Bradski, G. The OpenCV Library. Dr. Dobb’s Journal of Software Tools (2000). [4] Christian Szegedy, e. a. Intriguing properties of The calculated metrics on different recreated images neural networks, 2014. achieve relatively high values. The human eye recognizes [5] Fister Jr, e. a. A brief review of nature- the motif of the image. The attack was carried out inspired algorithms for optimization. arXiv preprint successfully and results are shown in Table 2. arXiv:1307.4186 (2013). [6] Fredrik Lundh. Pillow - Pillow (PIL Fork) 8.2.0 Picture MSE PSNR SSIM Documentation, 2021. Agama 768.69 29.51dB 0.62 Baseball 956.16 29.04dB 0.56 [7] Gad, A. F. ahmedfgad/GARI: GARI (Genetic Al- LeafHopper 666.89 29.52dB 0.77 gorithm for Reproducing Images) reproduces a sin- gle image using Genetic Algorithm (GA) by evolving Manhole cover 642.42 29.53dB 0.77 pixel values. Maze 270.50 31.11dB 0.79 Nautilus 396.71 30.58dB 0.81 [8] Ian J. Goodfellow, e. a. Explaining and har- Nautilus 2 667.03 29.65dB 0.73 nessing adversarial examples, 2015. Panda 908.09 29.10dB 0.75 [9] Jiawei Su, e. a. One pixel attack for fooling deep Rosehip 944.62 29.29dB 0.68 neural networks. CoRR abs/1710.08864 (2017). Strawberry 394.05 31.52dB 0.83 [10] Karakatič, S. Evopreprocess—data preprocessing Sulphur butterfly 1015.85 28.82dB 0.61 framework with nature-inspired optimization algo- Upright piano 975.15 29.00dB 0.66 rithms. Mathematics 8, 6 (2020), 900. [11] NumPy. What is NumPy? — NumPy v1.20 Man- Table 2: Calculated metrics of recreated images ual, 2021. [12] Paszke, A. e. a. Pytorch: An imperative style, high-performance deep learning library. In Ad- One of the goals set was to recreate images in the input vances in Neural Information Processing Systems resolution of AlexNet (meaning 224·224 pixels). This goal 32, H. Wallach, H. Larochelle, A. Beygelzimer, was not reached because the time-complexity growth rate F. d'Alché-Buc, E. Fox, and R. Garnett, Eds. Cur- was non-linear. Recreated images were around 100 · 100 ran Associates, Inc., 2019, pp. 8024–8035. 28 [13] Seyed-Mohsen Moosavi-Dezfooli, e. a. Universal adversarial perturbations. CoRR abs/1610.08401 (2016). [14] van der Walt, e. a. scikit-image: image process- ing in Python. PeerJ 2 (6 2014), e453. 29 30 Fast Recognition of Some Parametric Graph Families Nina Klobas Matjaž Krnc Durham University, University of Primorska, Department of Computer Science, Faculty of Mathematics, Natural Sciences Upper Mountjoy Campus, Stockton Road, and Information Technologies, Durham DH1 3LE,United Kingdom Glagoljaška ulica 8, 6000 Koper, Slovenia nina.klobas@durham.ac.uk matjaz.krnc@famnit.upr.si Abstract namely Double generalized Petersen graphs and folded cubes. Due to the space constraints the study of these Recognizing graphs with high level of symmetries two families is not covered here, therefore we defer inter- is hard in general, and usually requires additional ested readers to the full version of this paper [9]. structural understanding. In this paper we study a particular graph parameter and motivate its usage by devising efficient recognition algorithm for the family of I-graphs. For integers `, λ, m a simple graph is [ `, λ, m]- cycle regular if every path of length ` belongs to exactly λ different cycles of length m. We identify all [1 , λ, 8]-cycle regular I-graphs and, as a consequence, describe linear recognition algorithm for the observed family. Similar procedure can be used to devise the recog- nition algorithms for Double generalized Petersen graphs and folded cubes. Besides that, we believe the structural observations and methods used in the paper are of independent interest and could be used for solving other algorithmic problems. Keywords I-graphs, double generalized Petersen graphs, folded cubes, recognition algorithm, cycle regularity. Figure 1: I-graph I(12 , 2 , 3), double generalized Petersen graph DP(10 , 2), and folded cube FQ . 1 Introduction 4 Important graph classes such as bipartite graphs, (weakly) chordal graphs, perfect graphs and forests are I-graphs were introduced in the Foster census [6], and defined or characterized by their cycle structure. A are trivalent or cubic graphs with edges connecting ver- particularly strong description of a cyclic structure is the tices of two star polygons. They form a natural gener- notion of cycle-regularity, introduced by Mollard [11]: alization of the well-known generalized Petersen graphs introduced in 1950 by Coxeter [3] and later named by For integers l, λ, m a simple graph is Watkins in 1969 [14]. The family of I-graphs has been [ l, λ, m] -cycle regular if every path on l + 1 studied extensively with respect to their automorphism vertices belongs to exactly λ different cycles of group and isomorphisms [1, 7, 13], Hamiltonicity [2], length m. spectrum [12], and independence number [4, 8]. It is perhaps natural that cycle-regularity mostly appears Our first result identifies all [1 , λ, 8]-cycle regular mem-in the literature in the context of symmetric graph fami- bers and determines the corresponding values of λ. lies such as hypercubes, Cayley graphs or circulants. Theorem 1. An arbitrary I-graph is never [1 , λ, 8] -cycle Understanding the structure of subgraphs of hypercubes regular, except when isomorphic to I( n, j, k) where j = which avoid all 4-cycles does not seem to be easy. Indeed, 1 and ( n, k) ∈ {(3 , 1) , (4 , 1) , (5 , 2) , (8 , 3) , (10 , 2) , (10 , 3) , a question of Erdős regarding how many edges can such (12 , 5) , (13 , 5) , (24 , 5) , (26 , 5)} . a graph contain remains open after more than 30 years [5]. These structural results are used to devise the recognition algorithm for I-graphs. In this paper we study cycle-regularity and more general cyclic aspects of a family of I-graphs, with the focus of Theorem 2. I-graphs can be recognized in linear time. devising an efficient recognition algorithm. Similar ap- proach can be extended also to two other graph families, If the input graph is a member of the observed family, we DOI https://doi.org/10.18690/978-961-286-516-0.7 ISBN 978-961-286-516-0 31 not only provide its parameters but also give a certificate Therefore the octagon value of an I-graph is said to be a of correctness, i.e. we give an exact isomorphism. triple ( σJ , σS, σI ). We say that two 8-cycles of an I-graph are equivalent if 1.1 Preliminaries we can map one into the other using rotation ρ. Let G ' I( n, j, k) be an arbitrary I-graph and let C be one of its 8-Unless specified otherwise, all graphs in this paper are fi- cycles. With γ( C) we denote the number of equivalent 8-nite, simple, undirected and connected. For a given graph cycles to C in G. Each 8-cycle contributes to the octagon G we use a standard notation for a set of vertices V ( G) value of an I-graph. We denote the contributed amount and a set of edges E( G). A k-cycle C in G, on vertices with τ ( C), defined as the triple ( δ v j , δs, δi), where we 1 , v 2 , . . . , vk from V ( G) is denoted as ( v 1 , . . . , vk ). For calculate δ integers a and b we denote with gcd( a, b) the greatest j , δs, δi by counting the number of outer, spoke and inner edges of a cycle and multiply these numbers common divisor of a and b respectively. with γ( C) /n. If a graph G admits m non-equivalent Definition 3. Let l, λ, m be positive integers. A simple 8-cycles C 1 , C 2 , . . . , Cm, one may calculate its octagon graph G is [ l, λ, m]-cycle regular if every path on l + 1 value ( σJ , σS, σI ) as P m τ ( C i=1 i) . vertices of G belongs to exactly λ different m-cycles of G. The following claim serves also as an example of the above-mentioned definitions. It is easy to see that [1 , λ, 8]-cycle regular cubic graphs are Claim 5. For I( n, j, k) where n > 3 and integers k, j < also [0 , 3 λ/ 2 , 8]-cycle regular, but the converse does not n/ 2 there always exists an 8 -cycle. hold. Related to this we define a function σ : E( G) 7→ N, where σ( e) corresponds to the number of distinct 8-cycles Indeed, if k 6= j it is of the form an edge e belongs to. We call σ( e) an octagon value of an C∗ = ( w 0 , w± k, u± k, u± k± j, w± k± j, w± j, u± j, u 0) . edge e, and we say that a graph G has a constant octagon If k = j it is of the form value if σ is a constant function. C 7 = ( u 0 , uk, u 2 k, u 3 k, w 3 k, w 2 k, wk, w 0) . 2 Structural analysis 2.2 Characterization of non-equivalent 8-cycles Before we start with the analysis of the observed graph Our aim is to identify all possible 8-cycles that can appear family we need to formally define it and present some of in an arbitrary I-graph and determine their contribution its basic properties. towards the octagon value of the graph. It is easy to Definition 4. Let n, j, k be positive integers for which see that an arbitrary 8-cycle can have either 4 , 0 or 2 n ≥ 3 and n ≥ j, k ≥ 1. An I-graph I( n, j, k) is a graph spoke edges, so we obtained this list by distinguishing 8-on vertices { u cycles by the number of spoke edges they admit. In the 0 , u 1 , . . . , un−1 , w 0 , w 1 , . . . , wn−1} , with the edge set consisting of outer edges u case of the 8-cycle admitting 2 spoke edges, we further iui+ j , inner edges w distinguish cases by the number of outer and inner edges iwi+ k and spoke edges uiwi, where the subscripts are taken modulo n. within a given 8-cycle. Due to space constraints we present the analysis of 8- Without loss of generality we always assume that j, k < cycles admitting 4 spoke edges only, as it is the easiest n/ 2. Since I( n, j, k) is isomorphic to I( n, k, j), we re-case to deal with (for remaining cases see [9]). This strict ourselves to cases when j ≤ k. It is well known analysis leads to complete characterisation of 8-cycles for [1] that an I-graph I( n, j, k) is disconnected whenever the family of I-graphs, presented in the Table 1. d = gcd( n, j, k) > 1. In this case it consists of d copies of I( n/d, j/d, k/d). Therefore, throughout the paper 8-cycles with 4 spoke edges In addition to 4 spoke we consider only graphs I( n, j, k) where gcd( n, j, k) = edges the 8-cycle must also have two inner and two outer 1. We also know [7] that two I-graphs I( n, j, k) and edges. When using the spoke edge there are two options I( n, j 0 , k 0) are isomorphic if and only if there exists an for choosing an inner (outer) edge. After considering all integer a, which is relatively prime to n, for which either cases it is easy to see that there can be just two such { j 0 , k 0} = { aj (mod n) , ak (mod n)} or { j 0 , k 0} = { aj 8-cycles, C∗ (see Claim 5), which exists whenever j 6= k, (mod n) , − ak (mod n)}. Throughout the paper, when-and C ever we discuss I-graphs with certain parameters, we con- 0, which is of the following form: sider only the lexicographically smallest possible param- C 0 = ( w 0 , w± k, u± k, u± k± j, w± k± j, w±2 k± j, u±2 k± j, u±2 k±2 j) . eters by which the graph is uniquely determined. Cycle C 0 exists whenever 2 k + 2 j = n. One can verify easily, that n applications of the rotation ρ to C∗ and n/ 2 2.1 Equivalent 8-cycles applications of the rotation ρ to cycle C 0 maps the cycle back to itself. Therefore there are n equivalent cycles to A particular member of automorphism group of every I- C∗ and n/ 2 equivalent cycles to C 0 in an I-graph I( n, j, k) graph is a rotation defined as: ρ( ui) = ui+1, ρ( wi) = and they contribute (2 , 4 , 2) and (1 , 2 , 1), respectively, to wi+1. Clearly, applying n times the rotation ρ yields an the graph octagon value. identity automorphism. When acting on I-graphs with ρ we get 3 edge orbits: orbit of outer edges EJ , orbit 2.3 Obtaining constant octagon value of spoke edges ES and orbit of inner edges EI . Edges from the same orbit EJ , ES, or EI have the same octagon Every 8-cycle of an I-graph contributes to the octagon value, which we denote by σJ , σS and σI , respectively. value of each edge partition. It turns out that if we can 32 Label A representative of an 8-cycle Existence conditions τ ( C) γ( C) C∗ ( w 0 , w± k, u± k, u± k± j, w± k± j, w± j, u± j, u 0) k 6= j and n > 4 (2 , 4 , 2) n C 0 ( w 0 , w± k, u± k, u± k± j, w± k± j, w±2 k± j, u±2 k± j, u±2 k±2 j) 2 k + 2 j = n (1 , 2 , 1) n/ 2 C 1 ( u 0 , uj, u 2 j, u 3 j, u 4 j, u 5 j, u 6 j, u 7 j) 8 j = n or 3 n (0 , 0 , 1) n/ 8 C 2 ( w 0 , wk, w 2 k, w 3 k, w 4 k, w 5 k, w 6 k, w 7 k) 8 k = n or 3 n (1 , 0 , 0) n/ 8 ( w C 0 , wk , w 2 k , w 3 k , w 4 k , w 5 k , u 5 k , u 5 k+ j ) 5 k + j = n or 2 n 3 (5 , 2 , 1) n ( w 0 , wk, w 2 k, w 3 k, w 4 k, w 5 k, u 5 k, u 5 k− j) 5 k − j = n or 2 n ( u C 0 , uj , u 2 j , u 3 j , u 4 j , u 5 j , w 5 j , w 5 j+ k ) k + 5 j = n or 2 n 4 (1 , 2 , 5) n ( u 0 , uj, u 2 j, u 3 j, u 4 j, u 5 j, w 5 j, w 5 j− k) 5 j − k = 2 n or n or 0 ( w C 0 , wk , w 2 k , w 3 k , w 4 k , u 4 k , u 4 k+ j , u 4 k+2 j ) 4 k + 2 j = n or 2 k + j = n 5 (4 , 2 , 2) n ( w 0 , wk, w 2 k, w 3 k, w 4 k, u 4 k, u 4 k− j, u 4 k−2 j) 4 k − 2 j = n ( u C 0 , uj , u 2 j , u 3 j , u 4 j , w 4 j , w 4 j+ k , w 4 j+2 k ) 2 k + 4 j = n or k + 2 j = n 6 (2 , 2 , 4) n ( u 0 , uj, u 2 j, u 3 j, u 4 j, w 4 j, w 4 j− k, w 4 j−2 k) 4 j − 2 k = n or 0 ( w C 0 , wk , w 2 k , w 3 k , u 3 k , u 3 k+ j , u 3 k+2 j , u 3 k+3 j ) 3 k + 3 j = n or 2 n 7 (3 , 2 , 3) n ( w 0 , wk, w 2 k, w 3 k, u 3 k, u 3 k− j, u 3 k−2 j, u 3 k−3 j) 3 k − 3 j = n or 0 Table 1: All non-equivalent 8-cycles of I-graphs, their existence conditions, their contribution towards the octagon value of an I-graph τ , number of their equivalent cycles in an I-graph γ. identify at least one edge partition of a graph, we can eas- Data: connected cubic graph G ily determine its parameters. Therefore, we want to find Result: parameters ( n, j, k) and isomorphism to graphs with constant octagon value. These are graphs I( n, j, k), if they exists. for which all edges touch the same number of 8-cycles. 1 P ← an empty dictionary They are called [1 , λ, 8]-cycle regular graphs. We con- 2 for e ∈ E( G) do sider all possible collections of 8-cycles and determine oc- 3 s = octagonValue( e) ; /* calculate tagon values of I-graphs admitting those 8-cycles. Since σ( e) */ I-graphs are defined with 3 parameters and all 8-cycles 4 P[ s] . append( e) give constraints for these parameters, it is enough to con- 5 end sider collections of at most 4 cycles, to uniquely determine 6 U ← an item of P with minimum positive all [1 , λ, 8]-cycle regular graphs. After a thorough anal-cardinality ysis (see [9]) we see that the list of [1 , λ, 8]-cycle regular 7 if G[ U ] is a 2 -factor ; /* 2-factor is a I-graphs consists of 10 members (see Figure 2). Surpris-2-regular graph */ ingly, it turns out that all [1 , λ, 8]-cycle regular I-graphs 8 then are in the family of generalized Petersen graphs. This 9 U ← { e | e ∈ E( G) , e is adjacent to an edge proves Theorem 1. of U } /* a perfect matching in G */ 10 end 3 Recognition algorithm 11 return Extend( G, U ) The recognition algorithm relies on the fact that there Algorithm 1: Recognition procedure for I- is just a small number of I-graphs (ten) with the con- graphs stant octagon value. In particular, whenever the input graph G of the Algorithm 1 is a member of the family of I-graphs and is not [1 , λ, 8]-cycle regular, we can im-edges (see lines 1 – 4). Since G is cubic and all 8-mediately identify one of its edge orbits ( EI , EJ , or ES), cycles containing edge e consist of edges which are of size | V ( G)| / 2. Since the octagon value of each edge is at distance at most 4 from e, it is enough to check a computed in constant time and there is a finite number of subgraph H of G of order at most 62, to calculate the the [1 , λ, 8]-cycle regular I-graphs the Theorem 2 holds. octagon value of an edge e. Therefore, calculation of octagonValue( e) takes O(1) time for each edge e Correctness and time complexity of the algo-and this whole part is performed in Θ(| E( G)|) time. rithm. We first note, that if G is not cubic then it does not belong to observed graph families. Since checking 2. Identifying the edge-orbit which corresponds whether a graph is cubic takes linear time we simply to the set of spokes. assume that the input graph is cubic. Furthermore, if Throughout lines 6 – 9 we determine the edge-orbit G is not connected then it can only be a member of the which corresponds to the set of spokes. It is easy to family of I-graphs whenever it consists of multiple copies see that this requires additional O(| E( G)| / 3) time. of a smaller I-graph G 0. However, this case can easily 3. Using set U for determining parameters of a be resolved by separately checking each part, so we can given graph. assume that the input graph is connected. Algorithm 1 The algorithm uses computed set U to determine consists of the following 3 parts. exact isomorphism between G and an I-graph or a double generalized Petersen graph, if it exists. 1. Partitioning the edges with respect to the This procedure differentiates regarding the graph octagon value. family we are considering. The related procedure The algorithm determines the octagon value of each Extend( G, U ) is performed in Θ(| E( G)|) time. edge e ∈ E( G) and builds a partition set P of graph 33 (a) Triangular prism. (c) Petersen graph, Dodecahedral, and Desargues (b) 3 -cube. graph. (d) Möbius-Kantor graph, Nauru graph, G(13 , 5) , F ∼ 048 = G(24 , 5) , and G(26 , 5). Figure 2: All [1 , λ, 8]-cycle regular I-graphs. That is, (a) the graph on no 8-cycles; (b) the graph containing C 0 and C 7; (c) graphs containing C 5 , C 6 and C∗; and (d) graphs containing C 3 , C 4 and C∗. 3.1 Subprocedure Extend( G, U ) [3] Coxeter, H. S. M. Self-dual configurations and regular graphs. Bull. Amer. Math. Soc. 56 (1950), It is easy to see that Extend( G, U ) can safely reject G 413–455. if | V ( G)| is not divisible by 2. For this subprocedure [4] Dods, M. S. Independence number of specified I- set n = | V ( G)| / 2 and denote by H the subgraph of G graphs. PhD thesis, Monterey, CA; Naval Postgrad- induced by the vertices of U . There are two possibilities. uate School, 2020. H = G. In this case graph G has a constant octagon [5] Erdős, P. Some of my favorite solved and unsolved value. Since there are just ten such I-graphs check- problems in graph theory. Quaestiones Math. 16, 3 ing G against them takes constant time. (1993), 333–350. [6] H is of order 2 n and is 1-regular. Since U is a per-Foster, R. M. The Foster census. Charles Bab- bage Research Centre, Winnipeg, MB, 1988. R. fect matching of G the set E( G) \ U is a collection of M. Foster’s census of connected symmetric trivalent cycles. If G is an I-graph, then there exist positive graphs, With a foreword by H. S. M. Coxeter, With integers i, j, l 1 , l 2 with j ≤ i such that there are j a biographical preface by Seymour Schuster, With cycles of length l 1 and i cycles of length l 2. It re-an introduction by I. Z. Bouwer, W. W. Chernoff, mains to determine parameter k and check whether B. Monson and Z. Star, Edited and with a note by G ' I( n, j, k). This procedure depends on the struc-Bouwer. ture of I-graphs and the 8-cycle C∗. It is performed in Θ(| E( G)|) time (see [9] for details). [7] Horvat, B., Pisanski, T., and Žitnik, A. Iso- morphism checking of I-graphs. Graphs Combin. 28, 6 (2012), 823–830. 4 Conclusion [8] Klein, Z. J. Structural properties of I-graphs: their independence numbers and Cayley graphs. PhD Studying the cyclic structure as described in this paper thesis, Monterey, CA; Naval Postgraduate School, led to the construction of fast recognition algorithms for 2020. three parametric families. To the best of our knowledge, [9] Klobas, N., and Krnc, M. Fast recognition in addition to this work, such a procedure was so far of some parametric graph families. arXiv e-prints only used in [10] for the family of generalized Petersen (Aug. 2020), arXiv:2008.08856. graphs. We believe that a similar approach should give interesting results for other parametric graph families of [10] Krnc, M., and Wilson, R. J. Recognizing gener- bounded degree, such as Johnson graphs, rose window alized Petersen graphs in linear time. Discrete Ap- graphs, Tabačjn graphs, Y -graphs, or H-graphs. plied Mathematics 283 (2020), 756 – 761. [11] Mollard, M. Cycle-regular graphs. Discrete math- ematics 89, 1 (1991), 29–41. References [12] Oliveira, A. S. S., and Vinagre, C. T. M. The spectrum of an I-graph, 2015. [1] Boben, M., Pisanski, T., and Žitnik, A. I- [13] graphs and the corresponding configurations. J. Petkovšek, M., and Zakrajšek, H. Enumera- tion of I-graphs: Burnside does it again. Ars Math. Comb. Des. 13, 6 (2005), 406–424. Contemp. 2, 2 (2009), 241–262. [2] Bonvicini, S., and Pisanski, T. Hamiltonian [14] Watkins, M. E. A theorem on Tait colorings with cycles in I–graphs. Electronic Notes in Discrete an application to the generalized Petersen graphs. Mathematics 40 (2013), 43–47. Combinatorics 2012. J. Combinatorial Theory 6 (1969), 152–164. 34 Interactive Evolutionary Computation Approach to Permutation Flow Shop Scheduling Problem Vid Keršič University of Maribor, Faculty of Electrical Engineering and Computer Science, Koroška cesta 46, 2000 Maribor, Slovenia vid.kersic@um.si Abstract Different approximation approaches were presented in the last decades and shown to be effective in finding Artificial intelligence and its subfields have be- suboptimal solutions. Firstly, different approximation come part of our everyday lives and efficiently algorithms were introduced to reduce time to solve the solve many problems that are very hard for us problems and find suitable solutions [4]. Afterwards, humans. But in some tasks, these methods strug- researchers applied and adjusted different optimization gle, while we, humans, are much better solvers algorithms, especially nature-inspired and metaheuristic with our intuition. Because of that, the ques- optimization algorithms, to improve results in shorter tion arises: why not combine intelligent methods time execution [5]. with human skills and intuition? This paper pro- poses an Interactive Evolutionary Computation In the last years, a new approach emerged to solve these approach to the Permutation Flow Shop Schedul- problems. Inspired by human intuition of problem- ing Problem by incorporating human-in-the-loop solving, the Interactive Evolutionary Computation in MAX-MIN Ant System through gamification (IEC) approach combines metaheuristic optimization of the problem. The analysis shows that combin- algorithms, e.g., nature-inspired algorithms, with ing the evolutionary computation approach and human-in-the-loop through some process, e.g., gamifi- human-in-the-loop leads to better solutions, sig- cation of the problem [6]. This paper proposes an IEC nificantly when the complexity of the problem in- approach based on the MAX-MIN Ant System (MMAS), creases. one of the Ant Colony Optimization (ACO) algorithms, to PFSP through gamification of the problem. Keywords scheduling problems, interactive evolution-The structure of the paper is as follows: Section 2 pro-ary computation, ant colony optimization, metaheuristic vides an overview of related work. Section 3 describes optimization the IEC approach to PFSP. Section 4 describes the experiment, shows the results, and provides an analysis of 1 Introduction the results. Section 5 concludes the paper. Computers have become an essential (and sometimes the 2 Related Work only) tool to approach complex real-world problems. But many problems are still too hard to solve, even for com- Researchers started solving PFSP almost 70 years ago, puters. Many of these problems are NP-hard, and their when Johnson proposed the exact algorithm, but it was solution cannot be found by using exact algorithms [1]. restricted to two machines [2]. Many exact algorithms Thus, many of these problems are solved with approxima-with different strategies were proposed in the following tion and different stochastic nature-inspired population- decades, such as the branch-and-bound algorithm [7]. based algorithms, where suboptimal solutions are found. Despite that, the exact algorithms could not scale to However, they are still considered good enough to use bigger instances of the problem. in practice. Some of these problems include Travelling Salesman Problem (TSP), scheduling problems, etc. This To overcome the NP-hard time complexity of the paper focuses on the Permutation Flow Shop Scheduling problem, researchers started to develop approximation Problem (PFSP), one of the scheduling problems. and heuristic algorithms, where different approaches and heuristics were considered, such as [8]. Methods based PFSP is the scheduling problem where there are m ma-on the Nawaz-Enscore-Ham (NEH) heuristic proved chines and n jobs [2]. Each job consists of m operations, to be one of the best performers to the problem [4]. where i-th operation must be executed on the i-th ma-NEH-based methods consist of two steps: first, order the chine. The order of jobs must be the same on all ma- jobs based on some characteristic and then insert them chines, and each machine can perform only a single oper- one by one. But the approximation algorithms still have ation at a time. Each operation has a specified execution several drawbacks, such as being dependent on certain time, and the goal is to minimize total job execution time, characteristics of the specific setting of the problem and called makespan. The problem was proven to be NP-hard still being too slow. when m ≥ 3, which means no efficient algorithm exists [3]. Metaheuristic optimization algorithms were also consid- DOI https://doi.org/10.18690/978-961-286-516-0.8 ISBN 978-961-286-516-0 35 ered for PSFP, and they achieved more favorable results on the combination of makespan and the time duration than approximation algorithms, especially time-wise. of the game. The algorithm of the MMAS-IEC approach Many algorithms were applied to the problem, such is shown in Algorithm 1. as Simulated Annealing, Differential Evolution, and many nature-inspired algorithms, such as ACO [9, 5]. In many benchmarks, metaheuristic algorithms achieve 1 start the game comparable or even better results than approximation 2 init MMAS algorithms [10]. 3 launch 1 iteration (MMAS) 4 while game not finished do As mentioned in the introduction, researchers tried to in- 5 wait for the user to position a job (block) corporate human intuition into metaheuristic algorithms, 6 i ← positioned _ job which proved beneficial [11]. Authors included human-7 j ← position _ index in-the-loop to TSP through gamification, where a user 8 A i,j ← A i,j × 3 affects pheromone level in MMAS algorithm when play- 9 launch 1 iteration (MMAS) ing a game [6]. Several other researchers also focused 10 end on gamification of different NP-hard problems like Bin- 11 return best found solution packing problem, Job Shop Scheduling, and Open Shop Scheduling [12, 13]. Based on the related work, the pro-Algorithm 1: Algorithm of the MMAS-IEC ap- posed approach incorporates human intuition to solve the porach. PFSP problem through the gamification process where a user affects the pheromone trail in the MMAS algorithm. 4 Experiments and Results 3 IEC Approach to PFSP The experiment aims to show that incorporating human- The goal of PFSP is to find a permutation of n jobs that in-the-loop can positively affect the search process of minimize the total makespan. Thus, the optimization metaheuristic algorithms, which means better solutions task for the problem is defined as: can be found in fewer iterations. Thus the analysis is mainly oriented on comparing MMAS without human- min C( π) (1) in-the-loop results with MMAS-IEC. where π is a permutation of n jobs and C is the cost The operation times of the jobs are randomly generated function that calculates makespan. on each new instance of the game since standard bench- mark datasets are too big to be visualized in the game. Based on the techniques and approaches discussed in the Problem sizes used in the experiment are shown in Table previous section, the solution consists of three modules1: 1. An example of the video game can be seen in Fig-video game, back-end server, and MMAS algorithm. The ure 1. The comparison of brute-force algorithm, MMAS video game was implemented in Unity, while the back-end algorithm without human-in-the-loop, and MMAS-IEC server and MMAS algorithm, which runs on the back-end, approach is performed. The results are also compared to are written in Python. the one variant of the Genetic Algorithm (GA) [14]. The video game contains blocks (jobs) and a board con- structed from several rows (machines). The user tries to move and stack blocks together while complying with the Table 1: Sizes of the problem in the experiment. limitations of the problem. For each game, a new instance of the MMAS algorithm starts. Each movement per- Number of jobs Number of machines formed by the user is sent to the back-end server, where 5 3 the MMAS algorithm runs in iterations. For each block 10 2 placement, one iteration of the algorithm is launched. 10 4 Pheromone of the ants is presented as 2D matrix A of 12 2 size N × N , where N is the number of jobs. Matrix’s 12 4 element A i,j is the pheromone level of positioning job i 12 6 on the position j. The user’s placement of blocks affects the pheromone level of the algorithm; in the proposed ap- proach, the value is multiplied by a scalar. This change of the pheromone level is the human-in-the-loop part. After In both executions of the MMAS algorithm, with and each user’s movement, the solution for the current prob- without human-in-the-loop, the parameters were set as lem is recommended to the user based on the outcome and follows: n = 5 (number of ants), p current state of the MMAS algorithm (of course, users 0 = 0 . 9 (probability to select the job with highest pheromone), mmr = 5 can freely decide which move they will make next). The (min-max ratio), ρ = 0 . 75 (persistence of the trail), player is also encouraged to perform well and achieve the max _ iter = 100 (maximum number of iterations) and highest score as fast as possible since the game contains max _ stag = 5 (maximum number of iterations of stag-a time counter. The final game score is calculated based nation - iterations with the same solution). The param- 1Source code available at: eters of MMAS were set according to the literature [5]. https://github.com/Vid201/flow-shop-scheduling-iec Readers are advised to read the referenced paper for an 36 instances. Where brute force results were available, RPD was also calculated. Results for the experiment are shown in Table 2. While in the smallest instances of the problem, the MMAS algorithm outperformed the MMAS-IEC ap- proach, the latter achieved better results in all the other problem sizes. This implies that when the search space becomes enormous and complex, human intuition positively affects the algorithm and leads to a better solution. Since the user affected the pheromone level and directed the algorithm towards better solutions, the Figure 1: The gameplay of the video game. average number of iterations was less than with only the MMAS algorithm. It must be noted that the best solution in the MMAS-IEC is usually not the same as the end state of the blocks in the user’s game. The user only in-depth description of the MMAS algorithm. For GA, affected the pheromone level of the job on the selected the parameters were adapted from the used implementa- position, and the game’s order does not directly change tion [14]. the solution. The best solution is still the solution found by the MMAS algorithm, while the user only The MMAS algorithm without human-in-the-loop was changed the algorithm’s internal state, i.e., pheromone run 30 times to get solutions characterized by the algo- matrix. Figure 2 shows that when the problem sizes rithm and not randomness. In the MMAS-IEC approach, increase, MMAS-IEC tends to find solutions with a lower the multiplier of the pheromone was set to 3, which means makespan. element A i,j is increased if the user places the job i on position j in the game. The size of the multiplier was chosen empirically. To measure the performance of algorithms, the subopti- mal solutions of metaheuristic algorithms were compared to optimal solutions by calculating relative percentage deviation (RPD), which tells how much suboptimal so- lutions are worse than the optimal solution. RPD was calculated according to the following equation: Suboptimalsol − Optimalsol RP D = × 100 (2) Optimalsol The number of iterations to find the best solution was also compared since time execution was much longer in Figure 2: Box plots of makespan for different the interactive approach due to playing the game. sizes of the problem with all four approaches. Colors represent different sizes of the problem, 4.1 Analysis and Discussion while dots show makespans for all five settings per each problem size for each approach. The video game and other algorithms were launched five times for each problem size, and the average/mean makespan with standard deviation was calculated. The Results of the MMAS-IEC approach are very dependent brute force algorithm was run only on smaller problem on the way the user plays the game. In the first played sizes since its execution time was too long for the other games, the results of the MMAS-IEC were not better Table 2: Results of the experiment, where N is number of jobs, M is number of machines, M is average makespan, σ is standard deviation, IT is average number of iterations for best solution and RP D is average relative percentage deviation. Problem sizes Approach Brute force MMAS GA MMAS-IEC N M M σ M σ IT RP D M σ IT RP D M σ IT RP D 5 3 6.514 0.402 6.605 0.399 31.0 1.396 6.515 0.402 25.8 0.015 6.742 0.463 3.6 3.500 10 2 9.719 0.814 9.834 0.753 27.8 1.183 9.749 0.797 45.0 0.309 9.742 0.792 2.8 0.236 10 4 12.281 0.876 12.890 0.543 28.4 4.958 12.594 0.625 61.2 2.549 12.614 0.657 5.2 2.711 12 2 / / 12.019 0.845 26.0 / 11.870 0.877 42.4 / 11.869 0.877 2.6 / 12 4 / / 14.005 0.949 29.8 / 13.430 1.118 49.2 / 13.463 1.152 5.4 / 12 6 / / 17.075 0.464 19.8 / 15.970 0.739 84.2 / 16.801 1.195 8.2 / 37 or were even worse than MMAS alone. But after some References games, strategies for the game come into mind, and re- sults get better, which means the makespan decreases. [1] D. E. Knuth. Postscript about np-hard problems. One of the strategies that tend to work well is to po- SIGACT News, 6(2):15–16, Apr. 1974. issn: 0163- sition the job with increasing operations times by the 5700. doi: 10.1145/1008304.1008305. machine number in the first place and then putting jobs [2] S. M. Johnson. Optimal two- and three-stage pro- with as little space between operations as possible on all duction schedules with setup times included. Naval machines. While the number of iterations is much less Research Logistics Quarterly, 1(1):61–68, 1954. in the MMAS-IEC approach, time to find the best solu- [3] M. R. Garey, D. S. Johnson, and R. Sethi. The com- tion depends on how fast the user plays the game, e.g., plexity of flowshop and jobshop scheduling. Math- MMAC alone can find the suboptimal solution in 1 sec- ematics of operations research, 1(2):117–129, 1976. ond, while the user cannot play the game so fast. For [4] M. Nawaz, E. E. Enscore Jr, and I. Ham. A heuris- bigger instances, around 10 seconds are needed to find tic algorithm for the m-machine, n-job flow-shop the best suboptimal solution (the game does not have to sequencing problem. Omega, 11(1):91–95, 1983. be played to the end to find the best solution). [5] T. Stützle et al. An ant approach to the flow Comparing to the GA, both MMAS and MMAS-IEC shop problem. In Proceedings of the 6th European approaches achieve worse results, except when N = 10 Congress on Intelligent Techniques & Soft Com- and M = 2, where the MMAS-IEC approach achieves the puting (EUFIT’98), volume 3, pages 1560–1564, best ones. According to the obtained results, combining 1998. an interactive process, i.e., gamification, and GA, looks [6] A. Holzinger, M. Plass, M. Kickmeier-Rust, K. like a promising direction to explore. Holzinger, G. C. Crişan, C.-M. Pintea, and V. Palade. Interactive machine learning: experimental The open problem for this gamification process is how evidence for the human in the algorithmic loop. to scale the game to bigger problem sizes since the game Applied Intelligence, 49(7):2401–2414, 2019. can become too complex, and it is also hard to visualize blocks and boards. The experiments were successfully [7] E. Ignall and L. Schrage. Application of the branch conducted on the problems with up to 6 machines and 12 and bound technique to some flow-shop schedul- jobs, but scaling the game to more machines and jobs is ing problems. Operations research, 13(3):400–412, open for further research. 1965. [8] S. Suliman. A two-phase heuristic approach to the permutation flow-shop scheduling problem. Inter- 5 Conclusion national Journal of production economics, 64(1- 3):143–152, 2000. This work shows how human intuition for problem- [9] I. Osman and C. Potts. Simulated annealing solving can be incorporated into metaheuristics for permutation flow-shop scheduling. Omega, algorithms for NP-hard problem PFSP. MMAS-IEC 17(6):551–557, 1989. issn: 0305-0483. approach tends to find better solutions than fully [10] E. Vallada, R. Ruiz, and J. M. Framinan. New autonomous metaheuristic algorithm MMAS, especially hard benchmark for flowshop scheduling problems in bigger settings of the problem. Solutions are also minimising makespan. European Journal of Opera- found in a fewer number of iterations. But still, there tional Research, 240(3):666–677, 2015. issn: 0377- are some open questions to the proposed approach, as 2217. bigger instances are hard to visualize, and playing the [11] A. Holzinger. Interactive machine learning for game usually takes more time than the MMAS algorithm health informatics: when do we need the human- alone. Both of these drawbacks are good starting points in-the-loop? Brain Informatics, 3(2):119–131, for further research. 2016. Further research could be dedicated to selecting a meta- [12] N. D. Ross, M. B. Johns, E. C. Keedwell, and heuristic algorithm since this study only accounts for the D. A. Savic. Human-evolutionary problem solving MMAS algorithm for PSFP. At the same time, several through gamification of a bin-packing problem. In other algorithms proved to be suitable for this problem, Proceedings of the Genetic and Evolutionary Com- e.g., the GA from the analysis. The research goal was to putation Conference Companion, pages 1465–1473, use the same metaheuristic algorithm with and without 2019. human interaction and not to compare different meta- [13] M. Vargas-Santiago, R. Monroy, J. E. Ramirez- heuristic algorithms. On the MMAS algorithm itself, Marquez, C. Zhang, D. A. Leon-Velasco, and other techniques to change the pheromone level could be H. Zhu. Complementing solutions to optimization explored. Various strategies to encourage the player to problems via crowdsourcing on video game plays. perform well could also be introduced, e.g., online score- Applied Sciences, 10(23):8410, 2020. boards or rewards, which could increase the player’s per- [14] Suyunu. Github - suyunu/flow-shop-scheduling: formance with a positive loop. It would also be interest- genetic algorithm for flow shop scheduling. url: ing to compare and try the MMAS-IEC approach on one https : / / github . com / suyunu / Flow - Shop - of the benchmarks for PSFP, e.g., Taillard’s benchmark Scheduling (visited on 07/23/2021). problems [15]. [15] E. Taillard. Benchmarks for basic scheduling problems. european journal of operational research, 64(2):278–285, 1993. 38 Towards Representative Web Performance Measurements with Google Lighthouse Tjaša Heričko Boštjan Šumak University of Maribor, University of Maribor, Faculty of Electrical Faculty of Electrical Engineering and Computer Science, Engineering and Computer Science, Koroška cesta 46, 2000 Maribor, Slovenia Koroška cesta 46, 2000 Maribor, Slovenia tjasa.hericko@um.si bostjan.sumak@um.si Saša Brdnik University of Maribor, Faculty of Electrical Engineering and Computer Science, Koroška cesta 46, 2000 Maribor, Slovenia sasa.brdnik@um.si Abstract Performance testing can be conducted using various tools, among which Google Lighthouse has gained increasing Web performance testing with tools such as attention in recent years. It is an open-source tool, pro-Google Lighthouse is a common task in software viding audits for performance, as well as for accessibil-practice and research. However, variability in ity, search engine optimization, and progressive web apps, time-based performance measurement results is with indicators on how to improve these aspects of web-observed quickly when using the tool, even if sites if needed [8]. However, when dealing with time-the website has not changed. This can occur based measurements, the results of such testing can of-due to variability in the network, web, and ten be inconsistent, as several factors can interfere with client devices. In this paper, we investigated the measures and may introduce fluctuations, even if the how this challenge was addressed in the existing website has not changed. Most commonly, results tend literature. Furthermore, an experiment was to vary due to variability in the network, web server, conducted, highlighting how unrepresentative client hardware, and client resource contention [7]. Light-measurements can result from single runs; thus, house addresses variability by providing vague strategies researchers and practitioners are advised to and recommendations on how to reduce them, though run performance tests multiple times and use results can still vary. Besides isolating external factors, an aggregation value. Based on the empirical e.g., using a dedicated device for testing, using a local results, 5 consecutive runs using a median to deployment or a machine on the same network, the most aggregate results reduce variability greatly, and straightforward strategy is to run Lighthouse multiple can be performed in a reasonable time. The times and use aggregate values instead of single tests [7]. study’s findings a lert t o p otential p itfalls when using single run-based measurement results and The research objectives of this paper are: (i) To study serve as guidelines for future use of the tool. how the research community has addressed the challenge of variability in performance measurements when using Keywords websites, web performance, performance the tool; and (ii) To demonstrate the strategy of per-testing, performance variability, tool, Google forming multiple runs empirically with their aggregation Lighthouse into a single-value result. To achieve these objectives, we performed a literature review and conducted an experi- 1 Introduction ment. Our work is broadly related to previous research provid- In software engineering, performance tests are often con- ing a better understanding and managing of variations in ducted by software researchers and practitioners to au- measurements, testing, and benchmarking for timebased dit a website’s quality. The former commonly use web performance measurements [3, 5, 10, 11]. These stud-performance measurements to assess web performance ies are focused primarily on suggesting recommendations on the observed websites [12, 14, 15], to investigate for robust testing in the presence of environmental fluc-factors (positively or negatively) affecting performance tuations, and, as such, are quite different in aim from [4, 6, 13], and to improve performance testing [10], while ours, which is to gain insight into how a specific tool – the latter use performance measurements for improving Google Lighthouse – is used in research for web perfor-a website’s quality to provide a better overall user mance measurement, further investigated with empirical experience, as web performance influences website research alerting software researchers and practitioners traffic, user attrition, user engagement, online revenue, to potential pitfalls in the future use of the tool. and even rankings in search results greatly [2, 16, 17]. DOI https://doi.org/10.18690/978-961-286-516-0.9 ISBN 978-961-286-516-0 39 The contributions of the paper are: (i) Presenting an overview of existing studies using Lighthouse for measur-Table 1: A list of primary studies selected in the litera- ing performance, with an emphasis on how the tool is ture review with their performance audit strategies. used, what measuring strategies are employed, and how the authors addressed possible inconsistencies in results; ID Year Performance audit strategy Ref (ii) Providing analysis of the effects of repeating perfor- S1 2018 5 repetitions trough the day [6] mance measurements to prevent single run‘s outliers; (iii) S2 2019 1 run [12] Highlighting potential pitfalls for research and practice S3 2019 1 run [15] using single run-based results provided by Lighthouse; S4 2019 100 consecutive repetitions [13] and (iv) Serving as a base for research studies on miti- S5 2019 1 run [10] gating unrepresentative web performance measurements. S6 2020 1 run [4] S7 2020 30 consecutive repetitions [14] S8 2021 1 run [18] 2 Literature review A literature review was performed to find existing re- sites were used, selected randomly from the Alexa Top search utilizing Lighthouse as the tool for estimating web 500 list, which includes top-ranked websites on the web performance. A full-text search was conducted using the [1]. The selected websites were: AliExpress5, Amazon In-search string »Google Lighthouse« in the following digital dia6, Bola7, Freepik8, IKEA9, Mercari10, Shopify11, Un-libraries: ACM Digital Library1, Google Scholar2, IEEE spash12, Wix13, and Zendesk14, further referred to as Xplore3, and Web of Science4. The search was carried W1, W2, W3, W4, W5, W6, W7, W8, W9, and W10, out on July 28, 2021, and altogether 134 studies were respectively. The selected websites ranged from simple retrieved from the search. Inclusion and exclusion crite- static presentation websites to dynamic websites, i.e., e- ria guided the study selection process. Only journal and commerce websites and news sites. conference papers were considered. Materials not acces- sible in English were excluded. Any research that only All websites were audited with Google Lighthouse v7.3.0 described Lighthouse theoretically was excluded, and all on a device with macOS 10.15.7 using Headless Chrome. papers where Lighthouse was not used for performance A desktop device was emulated using a broadband net- measurements were excluded as well. After the review work connection. For each website, performance audits process, 8 primary studies were selected. were performed 4 times, each time with an increasing number of repeated independent runs (N=1,5,10,100). The list of primary studies is available in Table 1. All During the auditing process, external factors were iso-primary studies were published in conference proceedings lated as much as possible. The experiment was conducted in recent years, in 2018 or later. From the performance on July 29, 2021 between 4:42 and 8:37 PM (GMT+2). measurements made with Lighthouse, primary studies used the Performance Score (S1-S3, S6) most commonly, From the collected results, we used the Performance Score a single-value indicator of websites’ overall performance, for analysis, as this value captures the overall web perfor- for their further analysis. The following more specific mance of a website. It is calculated as a weighted average time metrics were also used commonly: Speed Index (S2, of six metric scores, each metric representing some aspect S4, S7, S8), First Meaningful Paint (S1, S2, S4, S7) of a website’s performance. The Performance Score can and Estimated Input Latency (S1, S2, S7). Researchers have the following values: Poor (0-50), Needs improve- observed between 1 and 21 websites in each study. Less ment (50-89) and Good (90-100) [9]. than half of the primary studies (S1, S4, S7) have noted Data analysis was conducted using IBM SPSS Statistics some variance between runs when auditing the same v27. Descriptive statistics were used to present the char- website due to uncontrollable variables, and employed acteristics of sets of data collected with a different number some strategies to mitigate this problem. In two studies of consecutive runs, including a description and spread of (S4, S7), the authors repeated runs consequently, while the data in each set. Mood’s median test was performed in one study (S1), researchers ran performance audits to estimate if the medians of data sets from different runs multiple times trough the day. The number of runs varied on the same website were equal. from 5 to 100. Two studies (S1, S4) then used mean for aggregating multiple runs into a single value, while one 5https://www.aliexpress.com/ study (S7) used median. 6https://www.amazon.in/ 7https://www.bola.com/ 8https://www.freepik.com/ 3 Experiment 9https://www.ikea.com/ 10https://www.mercari.com/ An experiment was performed to demonstrate further 11https://www.shopify.com/ how single performance audits can be unrepresentative 12https://unsplash.com/ in some scenarios, and investigate how the number of 13https://www.wix.com/ runs affects variability. In the experiment, 10 real web- 14https://www.zendesk.com/ 1https://dl.acm.org/ 2https://scholar.google.com/ 3https://ieeexplore.ieee.org/ 4https://apps.webofknowledge.com/ 40 4 Results and Discussion Table 2: Performance scores and their statistics across The distribution of Performance Scores of each website different numbers of runs. for N=100 is presented with boxplots in Figure 1. It can easily be observed that, for almost all websites (except Statistics N=1 N=5 N=10 N=100 for W4 ), some performance measurements occurred that Mean 87 91 91.2 90.9 were not a typical representative of a website’s perfor- W1 Median 87 92 92 92 mance. Suppose one of these outlier measurements was Range / 6 6 21 the only assessment run, this can lead to an unrepresen- SD / 2.3 1.9 2.9 tative result. Consequently, wrongful conclusions and de- Mean 86 86 88.3 89.5 cisions can be made, e.g., a developer may think a change W2 Median 86 90 90 90 he implemented into a code recently made performance Range / 18 20 20 worse, yet, instead, this occurred due to fluctuation in SD / 7.5 5.7 2.8 the network, web, or client device. An interesting ob- Mean 40 42.8 44.3 43.3 W3 Median 40 43 44 44 servation is that, due to variability in the measurement Range / 5 10 14 results, a website can be interpreted in a different score SD / 1.9 2.8 1.9 group, e.g., W1, W2, W4, and W9 results are dispersed Mean 89 88.4 88.3 88.6 between score groups Good and Needs improvement, and W4 Median 89 89 88 89 W3 between score groups Poor and Needs improvement. Range / 5 5 7 SD / 1.9 1.5 1.5 Mean 99 98.6 98.5 98.6 W5 Median 99 99 98.5 99 Range / 1 1 5 SD / 0.5 0.5 0.9 Mean 28 35 32.9 33.9 W6 Median 28 34 31 33.5 Range / 15 16 21 SD / 7.3 6.1 3.9 Mean 98 98 97.6 97.3 W7 Median 98 98 98 98 Range / 0 2 5 SD / 0 0.7 1.2 Mean 94 94.4 94.4 94.5 W8 Median 94 94 94 94 Range / 1 1 4 SD / 0 0.5 0.6 Mean 77 72.4 72.1 73.3 W9 Median 77 72 72 73 Range / 7 8 23 SD / 2.9 2.5 4.2 Mean 75 83.4 85.3 86.1 Figure 1: Data distribution of Performance Scores W10 Median 75 86 87 87 (N=100). Range / 12 13 13 SD / 4.9 3.9 2.5 Detailed results for all sets of data are presented in Table 2, providing insight into how the data are spread, how to eliminate possible outliers while still performing web much repeating the test reduces variability and how re- performance testing in a reasonable time. sults stabilize as the number of tests run increases. These results further illustrate the differences between single and multiple runs, which can provide a more reliable es- 5 Conclusion timate of a website’s performance; therefore, providing a rationale why addressing intrinsic fluctuations when deal- Several strategies can be employed to reduce random ing with time-based metrics should be considered, and noise, measurement bias and errors when using Light- why a single run can (in some cases) not be representa- house for web performance measurements. In the paper, tive enough to provide reliable measurements. We argue we performed a literature review in which we selected that the use of a median value for aggregation is preferred studies using Lighthouse for estimating web performance. over other measures of central tendency to minimize the The results show that more than half of the primary stud- impact of outliers. ies did not employ any specific strategy to address vari- Mood’s median test, performed for N=5, N=10, and ability in web performance measurements. Others use a N=100, showed that the medians of the Performance reasonably straightforward approach to repeat the Light- Score were the same across all three categories of runs house audit multiple times and summarize repeated runs for all websites, except W5, where the test could not be using a mean or median. However, a large discrepancy performed. These results, presented in Table 3, indicate was noticed in these works in the number of runs and that 5 runs in comparison to 10 and 100 runs are sufficient measures of central tendency used to aggregate multiple 41 Table 3: Mood’s median test by website, comparing groups of N=5, N=10, and N=100. W1 W2 W3 W4 W5 W6 W7 W8 W9 W10 Asymp.Sig 0.996 0.723 0.137 0.557 * 0.744 0.927 0.879 0.211 0.758 *All values are less than or equal to the median. Mood’s median test could not be performed. runs into a single-value result. Thus, we investigated this [9] Google Developers. Lighthouse performance further empirically by conducting an experiment on real scoring, 2021. popular websites, to demonstrate how the number of runs [10] Johnston, O., Jarman, D., Berry, J., Zhou, affects variability and prevents single-run outliers. With Z. Q., and Chen, T. Y. Metamorphic relations this, we highlighted how measurement results from single for detection of performance anomalies. In 2019 runs could be misleading and unrepresentative; therefore, IEEE/ACM 4th International Workshop on Meta- we recommend for research and practice to run perfor- morphic Testing (MET) (2019), pp. 63–69. mance tests multiple times and use an aggregation value. [11] Based on our results, performing performance audits 5 Kalibera, T., and Jones, R. Rigorous bench- marking in reasonable time. In Proceedings of the times reduces variability in results greatly in a reasonable 2013 International Symposium on Memory Manage- time. Our study provides a base for future research stud- ment (New York, 2013), ISMM ’13, ACM, pp. 63–74. ies addressing outliers in web performance testing, and for guidelines for future studies on how to perform represen- [12] Nurshuhada, A., Yusop, R. O. M., Azmi, A., tative web performance measurements with Lighthouse. Ismail, S. A., Sarkan, H. M., and Kama, N. Enhancing performance aspect in usability guide- lines for mobile web application. In 2019 6th In- Acknowledgment ternational Conference on Research and Innovation in Information Systems (ICRIIS) (2019), pp. 1–6. The authors acknowledge the financial support from the [13] Rahman, S., and Wittie, M. P. MR-DNS: Slovenian Research Agency (Research Core Funding No. Multi-resolution Domain Name System. In Internet P2-0057). and Distributed Computing Systems (Cham, 2019), R. Montella, A. Ciaramella, G. Fortino, A. Guerrieri, References and A. Liotta, Eds., Springer International Publish- ing, pp. 191–202. [1] Alexa Internet, Inc. The top 500 sites on the [14] Riet, J. v., Paganelli, F., and Malavolta, web, 2021. I. From 6.2 to 0.15 seconds – an industrial case [2] Arapakis, I., Bai, X., and Cambazoglu, B. B. study on mobile web performance. In 2020 IEEE Impact of response latency on user behavior in web International Conference on Software Maintenance search. In Proceedings of the 37th International and Evolution (ICSME) (2020), pp. 746–755. ACM SIGIR Conference on Research & Develop- [15] Rojas-Mora, J., Lincolao-Venegas, I., and ment in Information Retrieval (New York, 2014), Schneeberger-Leon, F. S3e2: a web-based SIGIR ’14, ACM, pp. 103–112. gis for the visualization and analysis of socioe- [3] Beyer, D., Löwe, S., and Wendler, P. Reliable conomic segregation in chile’s elementary educa- benchmarking: Requirements and solutions. Inter- tion system. In Proceedings of the 1st Interna- national Journal on Software Tools for Technology tional Conference on Geospatial Information Sci- Transfer 21, 1 (2019), 1–29. ences (2019), O. S. Siordia, J. L. S. Carde- nas, A. Molina-Villegas, G. Hernandez, P. Lopez- [4] Chan-Jong-Chu, K., Islam, T., Exposito, Ramirez, R. Tapia-McClung, K. G. Zuccolotto, and M. M., Sheombar, S., Valladares, C., Philip- M. C. Colunga, Eds., vol. 13 of Kalpa Publications pot, O., Grua, E. M., and Malavolta, I. Inves- in Computing, EasyChair, pp. 12–20. tigating the correlation between performance scores and energy consumption of mobile web apps. In [16] Shivakumar, S. K. Modern Web Performance Proceedings of the Evaluation and Assessment in Optimization. Apress, Berkeley, Ca, 2020. Software Engineering (New York, 2020), EASE ’20, [17] Szalek, K., and Borzemski, L. Conversion Rate ACM, pp. 190–199. Gain with Web Performance Optimization. A Case [5] Chen, J., and Revels, J. Robust benchmarking Study. In Information Systems Architecture and in noisy environments, 2016. Technology: Proceedings of 39th International Con- ference on Information Systems Architecture and [6] Gambhir, A., and Raj, G. Analysis of cache in Technology – ISAT 2018 (Cham, 2019), Springer, service worker and performance scoring of progres- pp. 312–323. sive web application. In 2018 International Confer- ence on Advances in Computing and Communica- [18] Yu, A., and Benson, T. A. Dissecting perfor- tion Engineering (ICACCE) (2018). mance of production quic. In Proceedings of the Web Conference 2021 (New York, 2021), WWW ’21, [7] Google Developers. Variability, 2019. ACM, pp. 1157–1168. [8] Google Developers. Lighthouse, 2021. 42 Transformer-based Sarcasm Detection in English and Slovene Language Matic Rašl Mitja Žalik University of Maribor, University of Maribor, Faculty of Electrical Faculty of Electrical Engineering and Computer Science, Engineering and Computer Science, Koroška cesta 46, 2000 Maribor, Slovenia Koroška cesta 46, 2000 Maribor, Slovenia matic.rasl@student.um.si mitja.zalik1@student.um.si Vid Keršič University of Maribor, Faculty of Electrical Engineering and Computer Science, Koroška cesta 46, 2000 Maribor, Slovenia vid.kersic@um.si Abstract because not understanding and detecting it can lead to substantial miscommunication errors and disagreements. Sarcasm detection is an important problem in the Automatic sarcasm detection is also crucial in other NLP field of natural language processing. In this pa- problems, such as sentiment analysis, where undetected per, we compare performances of the three neural sarcasm can negatively affect an analysis. Therefore, networks for sarcasm detection on English and there is a need for automatic detection of sarcastic com- Slovene datasets. Each network is based on a ments and text. different transformer model: RoBERTa, Distil- Bert, and DistilBert – multilingual. In addition This paper compares performances of three neural net- to the existing Twitter-based English dataset, we works for sarcasm detection on English and Slovenene also created the Slovene dataset using the same datasets. Each neural network is based on a transformer approach. An F1 score of 0.72 and 0.88 was model. In the following sections, we overview the related achieved in the English and Slovene dataset, re- work, describe the used datasets, present the experiment, spectively. analyze the results, and conclude the paper emphasizing future work. Keywords natural language processing, sarcasm detection, transformers, RoBERTa, DistilBert 2 Related Work 1 Introduction Automatic sarcasm detection dates back to 2006 [3], but it has gotten momentum in the past few years with ad-Language is the essential tool for communication in real vancements in the fields of neural networks and NLP. In life and online in the digital world. With the fast growth general, sarcasm can be detected in three different ways of the internet in the last two decades, an enormous [2]. Rule-based approaches use specific evidence, such as amount of text data is available to everyone, which is one words or phrases, for identification. Such techniques were of the main reasons natural language processing (NLP) often used in earlier systems, such as [4]. Statistical ap-has become one of the fastest-growing fields in computer proaches either use text features or learning algorithms science and artificial intelligence. While the most com- to find sarcasm. Statistical methods were used in works, monly used NLP application is text translation, many such as [5], where combinations of positive verbs and other applications are being researched and applied, e.g., negative situation phrases were used as classification fea- text summarization, emotion recognition, sarcasm, and tures. The most common approach today is by using deep irony detection [1]. In this paper, we focus on the sar-learning techniques. For example, in [6], the model can casm detection problem. learn user-specific context and thus achieve better results than previous state-of-the-art models. Sarcasm detection is defined as a binary classification problem, where the goal is to detect if the given text is Significant advancements in NLP tasks were achieved sarcastic [2]. The most common places to find sarcastic with transformers. They are a new form of neural net-comments are social media platforms, e.g., Twitter, where work that does not use convolution and recursion. In- people often express their opinions and views on different stead, they use attention to find correlations between topics. While in some examples, e.g., "I work 40 hours words in the text. Transformers can process text in paral- a week for us to be this poor" , it is easy to spot, some-lel, allowing much faster learning than sequential meth- times, e.g., "Great, that’s just what I needed!" is harder to ods [7]. They also achieve better results than previous perceive at first sight. Detection of sarcasm is essential methods. DOI https://doi.org/10.18690/978-961-286-516-0.10 ISBN 978-961-286-516-0 43 With the increasing number of learning parameters, neu-because it was constructed in the same way as the Slovene ral networks need a larger training dataset to prevent dataset. Before training, datasets were split into the overfitting. While building large labeled datasets can training and the test sets, as shown in Table 1. be demanding, it is easy to construct large unlabelled corpora. Therefore, large models can be trained on un- labelled text data to create a good language model, i.e., Table 1: Train-Test split. expressive word embeddings. Afterwards, these represen- tations can be used for different NLP-related tasks [5]. Set English Slovene The mainstream architecture of the pre-trained mod- Train 5000 759 els is Bidirectional Encoder Representations from Trans- Test 1800 272 formers (BERT). The initial model was pre-trained on BooksCorpus and English Wikipedia, which advanced state-of-the-art for eleven NLP tasks [8]. Nowadays, 4 Method many BERT-based architectures exist. For example, RoBERTa (A Robustly Optimized BERT Pre-training Although transformers can be fine-tuned to specialize in Approach) [9] optimizes the way of masking tokens and a specific task, the process takes a long time on common thus improving the performance of the model. Another hardware. However, as shown in [13], fine-tuning can be common architecture is DistilBert [10], which has reduced avoided by utilizing other networks to find correlations in the number of training parameters. That makes its train- transformer embeddings. Since the transformer’s weights ing 60 % faster while retaining 97 % of BERTs language are not changing, its output can be calculated only once understanding capabilities. and then saved before learning the second part of the BERT has been widely and successfully utilized for sar- network. This approach significantly improves learning casm detection [11]. In [12], the accuracy is even more times. improved by also considering the context of sarcastic com- For the experiment, we implemented the neural network ments. The authors in [13] use RoBERTa to detect model similar to the one used in [13]. As some details sarcasm with even higher accuracy. Although BERT- about the network were missing in the mentioned paper, based architectures are very successful, their pre-training we also relied on implementation in [18]. The architecture still has some drawbacks. Sarcasm is present primarily of the neural network is shown in Figure 1. During the in informal communication (e.g., social networks such as experiment, three different transformers were explored, Reddit, Twitter, etc.), which was not part of the train- RoBERTa [9], pre-trained on English dataset, and two ing set. Therefore, in [14] BERT was outperformed by DistilBert [10] transformers, one pre-trained on the En-the context-independent GloVe embeddings model, which glish language and the other one on multiple languages was pre-trained on Twitter data. (DistilBert mult). DistilBert transformers are smaller than RoBERTa (66 M vs. 125 M training parameters), 3 Datasets which translates to significant speedup embedding gen- eration time. Constructing a dataset for the sarcasm detection problem As mentioned before, tokenized inputs were transformed is not a straightforward task since the perception of to embeddings at the beginning of the training. Embed- sarcasm is difficult even for people. A general approach dings were saved to the Transformer output cache. Then to dataset creation is to scrap the data from different they were used as input to the Bidirectional LSTM layer, social media platforms, e.g., Twitter, Reddit, and use whose outputs were concatenated with original embed- user-specified labels, i.e., hashtags on Twitter and /s dings before the pooling layer. Before flattening the data, on Reddit [11, 15, 16]. But this approach has several 1D spatial dropout was applied. After dense and another drawbacks, like users not annotating sarcasm with tags dropout layer, a dense layer with softmax activation was or misusing labels to express their opinion better. The applied. Headlines dataset was introduced to solve the mentioned problem. The dataset contains headlines from two news For the training, the Google Colab [19] environment with websites: one, where real-world events are reported, and Google TPU was used. Neural networks were trained for the other with sarcastic descriptions of events, including 25 epochs with a 10 % validation split. In the end, weights sarcastic headlines [17]. The third common way is to of epoch with the smallest validation loss were restored. manually label data, but this is time-consuming and still requires the annotator with a good sense of sarcasm. 5 Experiments and Results Since no dataset for sarcasm detection in the Slovenian language exist, and manual labeling is time-consuming, The accuracy of the models was tested on the test we created the Slovene dataset with the user-specified datasets with the embedding of length 20. Additionally, labels. As a knowledge base for our task, tweets (i.e., we tested the model using RoBERTa transformer with posts on Twitter) were selected. Tweets, annotated by the embeddings of length 100 to find out how embeddings users with specific hashtags (e.g., #sarcasm, #sarkazem), length affects the results. However, since the results with were considered sarcastic (i.e., positive) examples, while larger embeddings were similar to those obtained with other tweets were non-sarcastic (i.e., negative) examples. smaller ones, we did not train DisitlBert based models For the English dataset, we selected the one from the on larger embeddings due to the hardware limitations. 2nd Workshop on Figurative Language Processing [11] Results are shown in Table 2. 44 Table 2: Results of the experiment. The first three Transformer columns contain transformer’s name (M), used dataset (L), and embeddings length (EL). The dataset is denoted by its language ( slo for Slovene and en for English). In the last four columns, accuracy (A), precision (P), recall (R), and F1 norm (F1) are presented. The results are rounded to two decimal places. Transformer output cache M L EL A P R F1 RoBERTa en 100 0 . 70 0 . 67 0 . 78 0 . 72 RoBERTa en 20 0 . 66 0 . 63 0 . 79 0 . 70 Bidirectional LSTM DistilBert en 20 0 . 63 0 . 60 0 . 74 0 . 67 DistilBert - mult en 20 0 . 61 0 . 58 0 . 80 0 . 67 RoBERTa slo 100 0 . 83 0 . 82 0 . 95 0 . 88 RoBERTa slo 20 0 . 81 0 . 83 0 . 89 0 . 86 Concatenation DistilBert slo 20 0 . 74 0 . 83 0 . 78 0 . 80 DistilBert - mult slo 20 0 . 79 0 . 89 0 . 78 0 . 83 Pooling was measured on various datasets. The F1 score was be- tween 78 % and 90 %, which is considerably better than Dropout the results in English, but comparable to the Slovene dataset. However, since different datasets were used in the study, the results are hard to compare. The same English dataset as applied here was used in [11], where Flatten participants presented 13 different solutions, ranging be- tween an F1 score of 0 . 58 and 0 . 83 Only three solutions were better than the F1 score of 0 . 72 , which we achieved Dense with RoBERTa transformer and embeddings length of 100 Dataset construction is one of the most critical parts of Dropout the experiment since it provides the knowledge base for the transformer models. According to the obtained re- sults, models were able to detect sarcasm in the given Softmax examples, despite several drawbacks explained in section 3. The used approach is good enough for uncomplicated use cases where sarcasm is meant to be detected since Figure 1: The architecture of the neural network. users also annotate messages with #sarcasm. But for more complicated use cases with complex and more chal- lenging examples, alternative methods to the dataset con- struction should also be explored. For all tested models, the results on the Slovene data were significantly better than on the English data 6 Conclusion (ranging from 11 % to 18 % improvement). This means that in the used Slovene dataset, sarcasm was more In this paper, we compare how the utilization of different clearly expressed. The best results on both datasets transformers combined with the BiLSTM model affects were achieved using the RoBERTa transformer with the accuracy of sarcasm prediction. RoBERTa, English- an embedding length of 100. Even when using shorter based DistilBERT, and multilingual DistilBERT trans- embeddings, the RoBERTa transformer performed the formers were used in the experiment. All three trans- best. However, the difference between embeddings of formers were combined with the same BiLSTM model length 100 and 20 was small (only 2 % difference in and trained on English and Slovene datasets. After- F1 score). Additionally, the difference between using wards, the accuracy of the models was obtained using RoBERTa and DistilBERT transformer is also relatively test datasets in English and Slovene language. In the best small (3 % to 6 % difference in F1 score), which implies case, F1 scores of 0 . 72 and 0 . 88 were achieved on English that the usage of DistilBERT can be a good alternative and Slovene datasets, respectively. to RoBERTa on low-cost hardware. When using a mul- tilingual transformer, the results on the English dataset In the future, more work could be done on dataset cre- were close to the English-only transformer. However, on ation. Different dataset construction approaches, as de- the Slovene dataset, the multilingual dataset provided scribed in section 3, can be explored and adjusted for slightly better results. the Slovene language. Furthermore, current datasets can be expanded by adding more Tweets (especially Slovene) In [13], the performance of the RCNN-RoBERTa model or data from different sources, e.g., Reddit. Another in-45 teresting direction for future work is exploring transfer [12] H. Srivastava, V. Varshney, S. Kumari, and S. learning to reuse models on different languages, e.g., lan- Srivastava. A Novel Hierarchical BERT Archi- guages similar to Slovene. tecture for Sarcasm Detection. In Proceedings of the Second Workshop on Figurative Language Processing, pages 93–97, Stroudsburg, PA, USA. References Association for Computational Linguistics, 2020. doi: 10.18653/v1/2020.figlang-1.14. [1] G. G. Chowdhury. Natural language processing. [13] R. A. Potamias, G. Siolas, and A. G. Stafy- Annual Review of Information Science and Tech- lopatis. A transformer-based approach to irony nology, 37(1):51–89, 2003. doi: 10 . 1002 / aris . and sarcasm detection. Neural Computing and 1440370103. Applications, 32(23):17309–17320, Dec. 2020. doi: [2] A. Joshi, P. Bhattacharyya, and M. J. Carman. Au- 10.1007/s00521-020-05102-3. tomatic sarcasm detection: a survey. ACM Comput. [14] A. Khatri and P. P. Sarcasm detection in tweets Surv. , 50(5), Sept. 2017. doi: 10.1145/3124420. with BERT and GloVe embeddings. In Proceedings [3] J. Tepperman, D. Traum, and S. Narayanan. "Yeah of the Second Workshop on Figurative Language Right": sarcasm recognition for spoken dialogue Processing, pages 56–60, Online. Association for systems. In Ninth international conference on spo- Computational Linguistics, July 2020. doi: 10 . ken language processing, 2006. 18653/v1/2020.figlang-1.7. [4] T. Veale and Y. Hao. Detecting ironic intent in cre- [15] M. Bouazizi and T. Otsuki Ohtsuki. A pattern- ative comparisons. In Proceedings of 19th European based approach for sarcasm detection on twitter. Conference on Artificial Intelligence - ECAI 2010, IEEE Access, 4:5477–5488, 2016. doi: 10 . 1109 / pages 765–770, Amsterdam, The Netherlands. IOS ACCESS.2016.2594194. Press, 2010. doi: 10.3233/978-1-60750-606-5- [16] M. Khodak, N. Saunshi, and K. Vodrahalli. A 765. Large Self-Annotated Corpus for Sarcasm, 2017. [5] X. Qiu, T. Sun, Y. Xu, Y. Shao, N. Dai, and X. eprint: 1704.05579. Huang. Pre-trained models for natural language [17] R. Misra and P. Arora. Sarcasm Detection using processing: A survey. Science China Technological Hybrid Neural Network, 2019. arXiv: 1908.07414. Sciences, 63(10):1872–1897, 2020. doi: 10.1007/ [18] L. Famiglini. Irony-Sarcasm-Detection-Task. s11431-020-1647-3. url: https : / / github . com / lorenzofamiglini / [6] S. Amir, B. C. Wallace, H. Lyu, P. Carvalho, and Irony - Sarcasm - Detection - Task (visited on M. J. Silva. Modelling context with user embed-06/25/2021). dings for sarcasm detection in social media. In Pro- ceedings of The 20th SIGNLL Conference on Com- [19] Google Colab. url: https : / / colab . research . putational Natural Language Learning, pages 167– google.com/notebooks/intro.ipynb (visited on 177, Berlin, Germany. Association for Computa-06/06/2020). tional Linguistics, Aug. 2016. doi: 10.18653/v1/ K16-1017. [7] R. Kulshrestha. Transformers. 2020. url: https: / / towardsdatascience . com / transformers - 89034557de14 (visited on 06/20/2021). [8] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of Deep Bidirectional Trans- formers for Language Understanding, 2018. arXiv: 1810.04805. [9] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov. RoBERTa: A Robustly Optimized BERT Pretraining Approach, 2019. arXiv: 1907. 11692. [10] V. Sanh, L. Debut, J. Chaumond, and T. Wolf. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter, 2019. arXiv: 1910 . 01108. [11] D. Ghosh, A. Vajpayee, and S. Muresan. A re- port on the 2020 sarcasm detection shared task. In Proceedings of the Second Workshop on Figurative Language Processing, pages 1–11, Online. Associa- tion for Computational Linguistics, July 2020. doi: 10.18653/v1/2020.figlang-1.1. 46 Extraction and Analysis of Sport Activity Data Inside Certain Area Luka Lukač University of Maribor, Faculty of Electrical Engineering and Computer Science, Koroška cesta 46, 2000 Maribor, Slovenia luka.lukac@student.um.si Abstract databases require a lot of preprocessing skills. An ex- ample of a toolbox for extracting features from a sport Nowadays, sport data analysis is one of the cru- activity is ’sport-activities-features’, which is available in cial factors, used to enhance the athletes’ per- GitHub repository at: https://github.com/firefly- formance, which can depend upon many differ- cpp/sport-activities-features. It is focused on one ent circumstances. One of those is the area of of the most intensive and demanding tasks in Machine an exercise, which can dramatically impact on an Learning: data preprocessing. [2] athlete’s performance. Since not enough devotion has been given to this topic, this study focuses on The overall load indicators that are parts of sport activ-extracting and analysing parts of exercises, which ity datasets (total distance, total duration, average heart take place inside of a specific area, using principles rate, etc.) have some disadvantages, such as that de-from another part of Computer Science, Compu- tails are not expressed sufficiently, only a general outlook tational Geometry. of the training is captured and different phases of the training are not recognized directly or are not recognized Keywords sports, area, extraction, analysis at all. Thus, more hidden indicators, like intervals, to- pographic maps and weather data are extracted in the ’sport-activities-features’ toolbox. [2] 1 Introduction One of the crucial features that allows retrieving a lot As long as professional sport competitions have been of interesting and useful data is area detection, which organised, professional, as well as amateur athletes have is the main part of this study. It is well known that been constantly trying to enhance their performance. In doing exercises in hilly areas can be a lot more difficult order to know in which aspects their exercises can be and demanding. Despite apparently worse performance improved, tracking and analysing data are crucial. One of in those regions (if only overall indicators are observed), important factors, which can influence the performance, an athlete can put more power in that activity compared is the area where the training is performed. This study to the one which is set in a flat area. As different areas focuses on the extraction of the data inside certain areas, may contain distinct characteristics, it is important to be which allows thorough analysis of the effectiveness of an able to analyse the performance of an athlete in a certain athlete in those regions. area. Datasets that are used to extract this kind of data must 2 Materials and Methods include positions of an athlete at a certain time. If a dataset is corrupt or some data is missing, the analysis 2.1 Materials can output wrong results, therefore it is very important to have datasets that have no such issues. In these days, a lot of athletes track their performance during the training using modern sport trackers, which 2.2 Methods allow them to capture a lot of indicators of their exer- cise. Since there are many indicators that are not visible In order to unambiguously detect points that are located directly, but can be extracted as a product of an exten-inside the chosen area, inclusion test is a necessity. There sive data analysis, a sport activity can be analysed more are three major algorithms one can use to determine, thoroughly and deeply. A great tool in analysing the data whether a point is inside the area: is Machine Learning, efficient computational methods of • Algorithm of the same signs, which enabled rising the researches in automatic planning of sport training sessions. [1] • Algorithm of the sum of angles, • Algorithm with rays. Nowadays, many athletes monitor their performance us- ing modern sport equipment, such as sport watches or To start with, the first algorithm that was implemented sport wrists. There are websites that are focused on col-and seemed to be working was the algorithm of the same lecting that kind of data, however, it is very likely that signs. Its principles are quite simple: if the sign of the the data does not have a public access [3] and that the cross product between the hull line and the line from DOI https://doi.org/10.18690/978-961-286-516-0.11 ISBN 978-961-286-516-0 47 the hull to the point is the same for all the hull lines, it produces the right results even if there are holes or the point lies within the polygon that represents the area concave angles present in the area, this is why this algo- with great certainty. However, there are two significant rithm has been chosen for determining which parts of an disadvantages regarding this algorithm, namely, it does exercise were inside the given area. [5] not allow holes inside the area, and secondly, it does not allow concave polygons (representing area). [4] Figure 3: Ray algorithm. Figure 1: Algorithm of the same signs. The next algorithm that was considered was the algo- 1 function GET-POINTS-INSIDE-AREA( positions, rithm of the sum of angles. To each point that is located area) on the hull, a ray from the point, which has been chosen 2 points_in_area ← {}; for the inclusion test, is made. The next step is to calcu- 3 for position ← 1 to positions.count do late the sum of the angles between the rays, constructed 4 n ← number_of_intersections(position, in the previous step. If the sum is 360◦, the point is in area); the area and if the sum is 0◦, the point lies outside the 5 if n % 2 = 1 then area. Although this algorithm allows concave angles on 6 points_in_area ← points_in_area + the border of the area, the holes are still not permitted, position; this is why it has not been implemented to the package 7 end as an inclusion test. [4] 8 end 9 return points_in_area; Algorithm 1: Detection of points of an exercise inside the given area 3 Results After the implementation of the algorithm that detects parts of an exercise inside area, visualisation has been made in order to be able to test the algorithm properly. To start with, the map sectors, where the exercise is tak- ing place, are downloaded from OSM (OpenStreetMap) and merged into an image, represented as a map. Then, the area that has been used for the extraction of data is plotted on the map, and lastly, the identified and uniden- Figure 2: Algorithm of the sum of angles. tified points of an exercise are being drawn on the map according to the coordinates. An example of the visual- ization can be found at 5, where the wider area of the Slovenian Littoral is being displayed on the map and the Finally, the third algorithm that uses rays to determine exercise is plotted. whether a point lies inside an area has been implemented. A ray from the point, which we want to check if inside However, not only are parts inside the area extracted, or outside the area, to any direction is being constructed. but the data inside the area is also analysed. As already If the ray intersects with the area border even times, the pointed out, the performance of an athlete inside different point is not a part of the area. On the other hand, if there areas can vary significantly. This is why it is important is an odd number of intersections between the ray and the to be able to retrieve data, specific to a certain region. hull, the point lies inside the area. This algorithm is much One important factor about areas is also, what the perfor- less error-prone than the aforementioned algorithms, as mance of an athlete was like not only in one training, but 48 in more exercises that took place at different times. This 4 Discussion is why data, such as whole distance, average speed, max- imum speed, average heart rate, etc. is being extracted To conclude, area identification is one of the crucial from the activities or their parts, which took place inside factors in the sport data analysis. Since different areas the same area. have different difficulties, it is crucial to be able to analyse regions separately. This study portrays how this task can Despite the algorithm working flawlessly, there can be be done using algorithms from Computational Geometry. issues with obtaining datasets which would suit the anal- In the beginning, parts of exercises inside desired areas ysis. As this kind of data is usually not publicly available, should be extracted and then the extracted data can a user is often limited to analysing their own activities, be used for analysis of the performance of an athlete. which is one of the reasons why there is only a limited The next step, which has not been a part of this study, amount of exercises visualized and displayed in this pa- could be the Machine Learning phase, during which the per. extracted data could be analysed even further. Acknowledgment The author of this paper would like to kindly thank Iztok Fister Jr., who provided a lot of motivation and assistance during the writing of this study. References [1] Fister, I., Fister Jr, I., and Fister, D. Compu- tational intelligence in sports. Springer, 2019. [2] Fister Jr., I., Lukač, L., Rajšp, A., Fister, I., Pečnik, L., and Fister, D. A minimalistic toolbox for extracting features from sport activity files. In 2021 IEEE 25th International Conference on Intelligent Engineering Systems (INES) (2021), IEEE, pp. 121–126. [3] Rajšp, A., and Fister Jr., I. A systematic liter- ature review of intelligent data analysis methods for smart sport training. Applied Sciences 10, 9 (2020), Figure 4: Visualization of the area detection. 3013. [4] Topiwala, A. Is the point inside the polygon? Towards Data Science (2020). [5] Vacek, L., Atter, E., Rizo, P., and Nam, B. suas for deployment and recovery of an environmental sensor probe. Table 1: List of extracted features inside area. ID Feature Description 1 Distance Total distance inside area. 2 Time Total time inside area. 3 Maximum speed Maximum speed inside area. 4 Average speed Average speed inside area. Figure 5: Visualization of more exercises inside area. 5 Minimum heart rate Minimum heart rate inside area. 6 Maximum heart rate Maximum heart rate inside area. 7 Average heart rate Average heart rate inside area. 49 50 Methodology for the Assessment of the Text Similarity of Documents in the CORE Open Access Data Set of Scholarly Documents Ivan Kovačič David Bajs University of Maribor, University of Maribor, Faculty of Electrical Faculty of Electrical Engineering and Computer Science, Engineering and Computer Science, Koroška cesta 46, 2000 Maribor, Slovenia Koroška cesta 46, 2000 Maribor, Slovenia ivan.kovacic@um.si david.bajs@student.um.si Milan Ojsteršek University of Maribor, Faculty of Electrical Engineering and Computer Science, Koroška cesta 46, 2000 Maribor, Slovenia milan.ojstersek@um.si Abstract • copying sentence structures with minor modification without giving credit This paper describes the methodology of data preparation and analysis of the text similarity • copying the majority of your work from others, re- required for plagiarism detection on the CORE gardless of giving credit [9] data set. Firstly, we used the CrossREF API and Microsoft Academic Graph data set for metadata In order to efficiently assert the existence and degree of enrichment and elimination of duplicates of doc-plagiarism, electronic detection is necessary along with uments from the CORE 2018 data set. In the manual analysis. Using electronic detection, the similar-second step, we used 4-gram sequences of words ity index [1] is a metric which is reliable for identifying from every document and transformed them into potential cases of plagiarism, but it does not necessarily SHA-256 hash values. Features retrieved using indicate that plagiarism had occurred in an individual hashing algorithm are compared, and the result case. Another important step is to obtain a sufficient is a list of documents and the percentages of cov-amount of metadata about any given document which is erage between pairs of documents features. In analyzed. Using a sufficient amount of metadata, it is the third step, called pairwise feature-based ex-easier to detect duplicate records of the same document, haustive analysis, pairs of documents are checked which allows for a higher quality assessment of the degree using the longest common substring. of plagiarism on a given document. If a journal article is recorded twice in the same data set with inconsistent Keywords metadata, there exists a pair of documents with a high data preparation for text similarity analysis, degree of similarity. Without good quality metadata, the text similarity, CORE data set, metadata enrichment pair of documents may appear as totally different jour- nal articles, although they represent the same instance of 1 INTRODUCTION intellectual work. In the following sections, we describe our methodology of data preparation for text similarity, Text similarity is a measure of the degree of content metadata enrichment and the computation of text simi-similarity between two textual documents. It is used larity metrics on the CORE data set of scholarly docu-as a metric which assists for plagiarism detection. In ments as well as an analysis of the run time complexity general, plagiarism is a term that is used to describe acts of our methodology. of intellectual theft and violating the material and moral rights of the authors of intellectual property. However, 2 RELATED WORK there are several definitions o f p lagiarism t hat m ay be applied, depending on the circumstance of each individual Eaton and Crossman did a review of literature in the field case of plagiarism. Maurer et. al. [9] have compiled a of self-plagiarism in social sciences research databases list of possible definitions of p lagiarism. They include: and found that only 5.8 % of a sample of search results • presenting other people’s work as your own consisted of primary research done in this field, which indicates that the field of self-plagiarism is a relevant • copying other people’s work (or ideas) without giving topic of research in social sciences, we may also infer a credit higher relevance of this research topic for computational • improper quotation methods of detecting self-plagiarism [2]. DOI https://doi.org/10.18690/978-961-286-516-0.12 ISBN 978-961-286-516-0 51 In [8], a study was described in which a sample of articles that were published in the Scientific Periodicals Elec- Table 1: CORE data set JSON schema. tronic Library (SPELL) was analyzed for plagiarism. A comparison was made between two samples taken from JSON attribute Description the years 2013 and 2018. The former sample contained coreId unique CORE identifier literal reproductions in 65.9 % of the total sample size, doi DOI identifier ( if available) while the latter sample from 2018 contained evidence of title Title of the document plagiarism in 44 % of the total sample size. A reduction authors Authors (array) of the similarity index was noticeable, the difference be- contributors Contributors (array) ing attributed to the fact that the sample of articles from datePublished Date of publication 2018 contained guidelines related to plagiarism and self- abstract The abstract of the document plagiarism. The methodology of this study consisted of downloadUrl URL of the PDF document random sampling of the data set and inputting the sam- language Language of the document ples into iThenticate plagiarism detector. García-Romero fullTextIdentifier Full text unique ID and Estrada-Lorenzo have described the analysis of the pdfHashValue PDF signature hash Déjà vu database using citation analysis and found that journals List of journals or journal aliases cases of plagiarism are more frequently published in jour- rawRecordXml Harvested OAI-PMH XML response nals with lower visibility and also confirmed that dupli- year Year of publication cate documents not citing the original document show a relations Related entities (journals,journal aliases) higher degree of full text similarity than those citing it topics OAI-PMH dc:topic value [5]. subjects OAI-PMH dc:subject value issn ISSN ( if available) 3 METHODOLOGY fullText Full text of the document In this section, we present our methodology of apply- ing plagiarism detection algorithms and procedures on vested by CORE. The individual repositories have the CORE data set, including metadata enrichment tech- inconsistent metadata, which makes detection of du- niques and the plagiarism detection methods used in this plicates a non-trivial task. study. • Unstructured metadata - most documents in the 2018 CORE data set contain insufficiently struc- 3.1 The CORE data set tured metadata, since the aggregation process makes it difficult to ascertain the data like the journal CORE is one of the leading aggregation services of name, publisher. This information gets missing in open-access scholarly documents publicly available on the dc:topic and dc:subject fields and mapping the the web. Currently, CORE has aggregated metadata of data in these fields to their respective metadata fields over 200 million open-access research articles and other is a non-trivial task. scholarly documents, harvested from over 10.000 open- access repositories. A full list of harvested repositories • Non-defined document types may be seen in [12]. In our study, we used the 2018-03-01 • Non-defined fields of study CORE data set. The data set contains 9.767.152 unique • Non-normalized author names documents with full text. Each individual document is represented as a JSON record containing both the full The issues can be elaborated with a concrete example. text of the document as well as the extracted metadata. The document with CORE ID 47847252 is a book section The schema of the JSON records is described in Table 1. written in French. The document type, found in the sub- Metadata is important for plagiarism/self-plagiarism jects field, is "Book sections", which is a non-standardized analysis because it allows an expert who analyses an document type, originating from the terminology used by individual case of plagiarism to evaluate the authorship, the digital library that hosts the OAI server which pro- year of publication and other data of a document vided this document. The authors’ names are recorded which is being analyzed in comparison to a document’s with the last name written first, followed by the initial potential candidates for document similarity. The CORE of the first name. In contrast, the document with CORE data set does provide a rich aggregation of metadata ID 6871221 has authors written with the first name writ- for scholarly documents, but it is also limited by the ten firstly, followed by the last name. Such inconsisten- quality of metadata provided by the individual servers cies, which are the result of different bibliographic no- in harvested OAI-PMH repositories. The CORE team tations, increase the difficulty of automated detection of actively develops means of enriching/improving the self-plagiarism and/or duplication of documents. metadata associated with each individual document, since there exist issues with the metadata provided. 3.2 Microsoft Academic Graph (MAG) These issues include: In order to tackle the issue of incomplete and inconsistent • Duplicate document entries - there exist several metadata, we utilized the mapping to the Microsoft Aca- duplicates in the CORE data set. These are present demic Graph done by the CORE team with the help of because of multiple publications of the same docu- their published 2019-04-01 MAG mapping data set. The ment in several different repositories that are har- Microsoft Academic Graph is a document graph that con- 52 tains metadata records about scholarly documents, cita-MAG mapping data set. In addition, we normalized the tion relationships between them, as well as normalized document titles and looked for matches in MAG, exclud- data about authors, journals, fields of study and other ing the pairings where anomalies had been detected (non- metadata. By connecting CORE to MAG by using the matching year of publication, non-matching authors). We DOI persistent identifier, an intersection of over 1.200.000 further enriched the metadata of these documents by con- million articles was made in 2016 [6]. The MAG data set suming the CrossREF API and acquiring the ISSN, the contains structured metadata of scholarly documents. It proper volume, issue and publisher for the documents. is organized the same way a classical relational database Over 30 % of articles in CORE that have a DOI have no is, with individual schemas representing entities and con- ISSN associated with them in the MAG dump provided in nections between them represented as foreign keys. A [10]. Our method of metadata enrichment is summarized full description of the MAG schema is available in [3]. by Algorithm 1. We emphasize some of the more useful metadata entries in MAG which allow for metadata enrichment of CORE: Data: set of CORE documents • Normalized authors - the authors in MAG are Result: set of CORE documents with enriched normalized in a way that makes them uniquely iden- metadata tifiable and distinguishable from other authors. The 1 for document in set of CORE documents do displayed author name is separated from the author’s 2 if document in CORE MAG mapping and normalized name. Where available, authors are also document has DOI then associated with an affiliation. 3 consume CrossREF API; 4 map result to metadata properties of the • Fields of study - papers documented in MAG have CORE document; a mapping to fields of study of that particular pa- 5 else if normalized document title and year per. The fields of study are organized in a hierar- match then chical structure which allows us to construct a tree 6 enrich document using new MAG structure of fields of study for each paper in MAG. mapping; • Journal data - in MAG journal articles are associ- 7 consume CrossREF API; ated with journals, which have their names normal- 8 map result to metadata properties of ized and other metadata associated to them, where CORE document; available. Namely, a citation count, the publisher of 9 else the journal and the ISSN, where available [11]. 10 mark document as incomplete; 11 end 3.3 CrossREF API 12 end CrossREF is a service that aggregates metadata from publishers and provides tools for several use cases related Algorithm 1: CORE metadata enrichment with to metadata, among other things it is an agent of the DOI MAG and CrossREF. foundation (DF) and it allows users to register DOIs for their scholarly documents. One of the most useful fea- tures of CrossREF is the CrossREF public API, which 3.5 Text matching in the CORE data set allows for metadata enrichment by resolving a DOI which In this section, we describe how plagiarism detection was is contained in the CrossREF database into the metadata implemented on the CORE data set. We serialized the of the associated document. The metadata provided by entire CORE 2018-03-01 data set of JSON records into a CrossREF is rich and reliable, with information about the PostgreSQL database, creating a relational model which journal and publisher, authors with affiliations (where corresponds to the JSON schema of CORE. To implement available), document types (e.g. journal article or con- a plagiarism detector, we firstly had to implement a ference paper), reference counts, the volume and issue, candidate search, followed by a pairwise text matching among others. The API itself is free to use, with no sign- algorithm. up or authentication required, the client who consumes the REST API must only adhere to throttling mecha- 3.5.1 Candidate retrieval nisms specified by CrossREF which limit the number of requests per second available to the client. The API For finding plagiarism candidates for each individual doc- returns a requested sleep interval for the client in the ument, we computed n-gram models for each document in X-Rate-Limit-Interval HTTP header. CORE. The full text of the documents is tokenized with a minimum sentence length of 40 characters, using rules 3.4 Metadata enrichment of CORE using for mapping ordinal numbers, URLs, email addresses to CrossREF and MAG tokens, removing stop words and computing lexicograph- ically sorted n-gram sequences with n = 4. The minimum Our approach of metadata enrichment utilized the CORE length of 40 characters was chosen because it is used in MAG mapping with some additional enhancements. We the Slovenian open access infrastructure. Sequences of imported a published MAG dump available in [10]. When 4-grams have been determined as more efficient for cana document in CORE had a DOI available, the docu- didate retrieval in previous studies [4]. A space delim-ment’s DOI was used for pairing with MAG, where the ited sequence of sentence n-grams is transformed into an mapping was not previously present in the 2019-04-01 SHA-256 hash value which is stored in a database. The 53 set of all n-gram hashes represents the set of features of the entire full text which can be matched to other doc-Table 2: Distributed text matching workload status uments, thus acquiring lists of potential candidates for flags. plagiarism. In relational databases, this approach can be implemented efficiently by performing a table join of the Status flag Description table containing the SHA-256 hashes with itself, with the -1 unprocessed table itself being structured as a key-value map, where 0 processing the key is a unique ID of the document which is mapped 1 processed to the hashed n-gram value. In order for such a table join query to be time-efficient, indexes are necessary on the column containing the n-grams hashes, which drastically Each node running the distibuted program performs a increases the space complexity of the described method. transactional UPDATE query on the workload table, set- Once the self-join query had been performed, a hash cov- ting the flags of N documents to 0 and simultaneously erage index was computed: obtaining the document IDs from the table where the unique covered hashes in candidate current flag value equals -1. It proceeds by obtaining all hashCoverage = total number of hashes in document the full texts of the document and the candidates from The candidates are then sorted in descending order by the database and passing them on to a pairwise com- the hashCoverage metric and stored into the database parison performed by the Docker REST service which as a candidate list. The procedure of candidate search is is running on the same node. This approach allows for summarized in Algorithm 2. efficient parallel computation of text matching between documents and their candidates. The parameter N is limited by the amount of free memory available to the Data: fileid - CORE document ID node. Given an average number of candidates C per doc- Result: set of candidate documents in CORE ument and an average length L bytes of a document in 1 for document in set of CORE documents do CORE, the estimated space complexity in bytes iss given 2 perform self-join on the table of hashes by N ∗ C ∗ L. If a node failed to perform the compari-where document ID = fileid; son, for example in the case of running out of free virtual 3 compute hashCoverage; memory, a transactional update of the status flag is made, 4 sort candidate list by hashCoverage setting the flag back to -1. This allows the document to descending; be processed by another node. Using this approach, the 5 store candidate list into database; procedure is fault-tolerant and partition tolerant. The 6 end results of the text matching comparisons are arrays of offsets and length of common prefixes, which are used in Algorithm 2: Candidate retrieval with hashed the implementation of a plagiarism detector text match- n-gram sequences. ing user interface. The prefixes are also used to compute the similarity indices of the documents: Σ length of matches 3.5.2 Text matching of candidates similarityIndex = total length of document For each document, a maximum of 100 candidate docu- ments is set as a limit for reasons of reducing the space 3.5.3 Deduplication complexity of the procedure. After lists of candidates are obtained for all documents in CORE, a fine text After computing pairwise text matching between docu- comparison algorithm is used by Kärkkäinen et. al. ments and their candidates for plagiarism, the process which constructs a permuted longest common prefix ar- of plagiarism detection in the CORE database is com- ray efficiently [7]. The output of the algorithm is a se-plete. We described the problem of duplicate entries in quence of longest common prefixes along with offsets at the CORE data set in 3.1. With overall and pairwise sim-which a computed longest common prefix occurred in ilarity indices computed, we were able to conduct dedu- the text. Our implementation of the algorithm is writ- plication by using the computed indices in conjunction ten in C++ and wrapped by a REST API, which ac- with the enriched metadata we extracted from MAG and cepts two texts with configuration parameters and re- CrossREF. The process of removing duplicates begins by turns the longest common prefix array. The configura- acquiring a list of all pairwise matches with a similarity tion parameters are the minimum length of the docu- index above 85 %. For each such pair, we compare the ment ( minLen), the minimum length of the longest com- normalized titles of the two documents, the year of publi- mon suffix ( lengthBoundary) and the size of the win- cation and ISSN. When a full match has been made, the dow in which neighbor prefixes are searched ( windowSize second document is automatically marked as duplicated. - for merging adjacent common prefixes into longer se- When only a partial match has been made (e.g. only the quences). The REST API is bundled into a Docker im- normalized titles match), the document is marked as a age, which allows for an implemention of a distributed potential duplicate. The potential duplicates are submit- approach for calculating text matching on the CORE ted for manual inspection, meanwhile the detected dupli- dataset. A workload of documents is generated, on which cates are excluded from candidate lists and the similarity text matching is performed for each candidateas a pair indices of the documents which contained duplicates are (document ID, status) in a table in a relational database, corrected, respectively. with status being a flag described in Table 2. 54 4 RESULTS Table 3: Distributed text similarity computation per- By utilizing our method for metadata enrichment, we en- formance benchmarks. riched metadata for 3.457.071 documents in the 2018-03- 01 CORE data set which contain a DOI persistent iden- #nodes duration(s) est. days (CORE 2018) tifier value. We also acquired a complete set of metadata 1 1590,806 179,83 for a subset of 618.754 documents. This result is cru- 2 753,917 85,23 cial for the goal of data preparation, since text similarity 3 547,659 61,91 analysis is an instrument of plagiarism detection, which 4 425,046 48,05 requires a sufficient amount of metadata in order to be 5 374,564 42,34 useful to the expert conducting the plagiarism analysis. 6 306,461 34,64 The latter, smallest subset contains ISSN values which 7 294,918 33,34 allow for further studies on journal plagiarism and self- 8 285,888 32,32 plagiarism statistics by implementing plagiarism detec- 9 233,364 26,38 tion on this subset. Using our candidate retrieval method, 10 217,015 24,53 we have computed over 4.7 billion n-grams hashes for the 9.8 million documents in the 2018-03-01 CORE data set, yielding a total hash table size of 1478 GB in PostgreSQL, including the similarity indices for each document in the using b-tree indexes and document IDs in the form of data set, further research is possible. SHA-256 hashes. On average, we found 87.37 candidates per document for each document in the CORE data set. 6 ACKNOWLEDGMENTS For our implementation of text matching comparison, we installed the Docker image containing the text match- The authors thank Petr Knoth and the CORE team for ing REST API implementation on a cluster of 33 ma- providing the 2018-03-01 CORE data set and technical chines with Intel i5-8600 and 8 GB internal memory. Our support during our efforts of obtaining data from CORE. configuration for the parameters equaled minLen = 20, Also, we thank David Schmid for publishing the MAG lengthBoundary = 40 and windowSize = 350, respec- dump on Zenodo from April 2019. tively, as they are set used in the Slovenian open access in- frastructure. The average document length in the CORE 2018-03-01 data set is 73.41 KB, we found setting the pa- References rameter N =20 allows for stable and efficient processing of the CORE data set using this approach. Table 3 contains [1] Bretag, T., and Mahmud, S. Self-plagiarism or benchmark statistics for a varible number of computing appropriate textual re-use? Journal of Academic nodes running the text similarity Docker image. A ran- Ethics 7, 3 (Sep 2009), 193. dom sample of 1000 documents is selected for a variable [2] amount of nodes which process the workload in paral- Eaton, S. E., and Crossman, K. Self-plagiarism research literature in the social sciences: A scoping lel. The efficiency of the approach tends to asymptoti- review. Interchange 49, 3 (Aug 2018), 285–311. cally decline as we increase the number of nodes, which is a result of the centralized data storage being used for [3] Eide, D., and Huang, C. "microsoft academic synchronization of the compute nodes. The network in- graph schema". Microsoft Academic Graph official frastructure and the hardware and configuration of the website. Accessed Jun 1, 2021. [Online] Available: database server represent a bottleneck. The table also https://docs.microsoft.com/en-us/academic- contains a reference value for the time necessary to pro- services/graph/reference-data-schema. cess all the 9.767.152 documents in the CORE 2018 data [4] Fartek, J. Distributed generation of plagiarism set for a given number of nodes. The benchmarks were detection reports. J. Fartek, 2018. Bachelor thesis. performed on a subset of 10 nodes of the total 33 nodes [5] García-Romero, A., and Estrada-Lorenzo, used to process the entire CORE data set, but the results J. M. A bibliometric analysis of plagiarism and self- may be interpolated to a higher number of nodes. plagiarism through déjà vu. Scientometrics 101, 1 (Oct 2014), 381–396. 5 CONCLUSION [6] Herrmannova, D., and Knoth, P. An analysis of the microsoft academic graph. D-Lib Magazine The study described in this paper describes the method- 22 (09 2016). ology of establishing the largest plagiarism detection [7] Kärkkäinen, J., Manzini, G., and Puglisi, dataset in Slovenia. We have developed a framework S. J. Permuted longest-common-prefix array. In of processing larger sets of documents into a pipeline Combinatorial Pattern Matching (Berlin, Heidel- for plagiarism detection, with means of metadata berg, 2009), G. Kucherov and E. Ukkonen, Eds., enrichment with the help of the CrossREF API and Springer Berlin Heidelberg, pp. 181–192. Microsoft Academic Graph. Our output allows for [8] Krokoscz, M. Plagiarism in articles published in further study in the field of academic integrity, content journals indexed in the scientific periodicals elec- similarity detection, stylometry and other fields of tronic library (spell): a comparative analysis be- study. Utilizing our data set, which consists of enriched tween 2013 and 2018. International Journal for Ed- metadata entries for the 2018-03-01 CORE data set, ucational Integrity 17, 1 (Jan 2021), 1. 55 [9] Maurer, H., Kappe, F., and Zaka, B. Plagia- rism - a survey. Journal of Universal Computer Sci- ence 12 (01 2006), 1050–1084. [10] Microsoft Academic. Microsoft academic graph. Zenodo. Accessed Jun 1, 2021. [Online] Available: https://doi.org/10.5281/zenodo.2628216, Apr. 2019. [11] Sinha, A., Shen, Z., Song, Y., Ma, H., Eide, D., Hsu, B.-J. P., and Wang, K. An overview of microsoft academic service (mas) and applications. In Proceedings of the 24th International Conference on World Wide Web (New York, NY, USA, 2015), WWW ’15 Companion, Association for Computing Machinery, p. 243–246. [12] The CORE team. "core data providers". CORE official website. Accessed Jun 1, 2021. [Online] Available: https://core.ac.uk/data-providers? q=&size=10. 56 Embedding Non-planar Graphs: Storage and Representation Ðorđe K lisura University of Primorska, Faculty of Mathematics, Natural Sciences and Information Technologies, Glagoljaška ulica 8, 6 000 K oper, S lovenia klisuradjordje10@gmail.com Abstract of searching for mathematical knowledge. In a broader sense, fingerprints are used in many fields of science, rang- In this paper, we propose a convention for repre- ing from computer science to chemistry, archaeology, and senting non-planar graphs and their least-crossing genetics. Computer documentation, reducing duplication embeddings in a canonical way. We achieve this in web search results, and surely DNA fingerprinting are by using state-of-the-art tools such as canonical a few examples. labelling of graphs, Nauty’s Graph6 string and combinatorial representations for planar graphs. Because of its applications in physics, biochemistry, bi- To the best of our knowledge, this has not been ology, electrical engineering, astronomy, operations re- done before. Besides, we implement the men- search, and computer science, graph theory is rapidly tioned procedure in a SageMath language and moving into the core of mathematics. The theory of compute embeddings for certain classes of cubic, planar graphs is based on Euler’s polyhedral formula, vertex-transitive and general graphs. Our main which is related to the polyhedron edges, vertices and contribution is an extension of one of the graph faces. Planar graphs are used in a variety of applications data sets hosted on MathDataHub, and towards in the modern era, including constructing and organiz- extending the SageMath codebase. ing sophisticated radio electronic circuits, railway maps, planetary gearboxes, and chemical molecules. Pipelines, Keywords planar graph, graph representation, crossing railway lines, subway tunnels, electric transmission lines, number, graph database and metro lines are all vitally crucial for modelling an urban city. For further readings on this topic, look at 1 Introduction Trudeau and Richard [13], and Barthelemy [1]. Computers are increasingly used by mathematicians to 1.1 Related work and contributions assist them in their studies. This includes most aspects Some of the most significant projects that act as mathe- of a researcher’s work, from publishing and reading pa- matical databases which are of assistance to researchers pers to computations in mathematical software. Perhaps in their research projects are the SageMath platform [12], surprisingly, mathematicians also generate and use data, American Mathematical Society MathSciNet [11], above-and the production and processing of massive datasets mentioned Encyclopedia of Integer Sequences [6], House are becoming increasingly important in several areas of of Graphs [5], and Atlas of Graphs [10]. mathematics. The above-named tools are far from perfect and are many The primary applications for these mathematical datasets times subject to important work of the open-source com- and databases are exploratory in nature. They are used munity. This project aims to provide another toolset for by researchers to test hypotheses or to discover pat- researchers, via improving the platform MathDataHub terns and counterexamples. It’s not difficult to find such which will, in the future, provide our database contain- "datasets" that even predate computers: the Atlas of ing planar embeddings minimising the number of cross-Graphs and the Foster census are two examples from ings. Those embeddings are hard to compute and such graph theory. The Online Encyclopedia of Integer Se- a database of precomputed embeddings does not exist in quences (OEIS) is a modern mathematical database that any mathematical database. represents an online database of integer sequences. The OEIS now contains over 334000 sequences that are useful The paper is structured as follows. In Section 2, we to both professional and amateur mathematicians, mak- present the central technique and ideas of embedding ing it the largest database of its kind. The sequences non-planar graphs. Moreover, we give a concrete algo- in the database serve as fingerprints for the records they rithm that uses them and evaluates them in terms of are associated with. A somewhat similar project in graph space and time complexity. In Section 3, we talk about theory is the House of Graphs. An important notion is the impact of our results so far. Finally, in Section 4, we that of using mathematical objects, such as integer se- present some output samples of the algorithm that has quences, or graphs, to search for mathematical theorems. been evaluated before we make some concluding remarks This has been introduced as theorem fingerprinting by in Section 5. Billey and Tenner [2] as a way to improve the efficiency DOI https://doi.org/10.18690/978-961-286-516-0.13 ISBN 978-961-286-516-0 57 2 Embedding non-planar graphs initialise crossing number k to 0, to check if the graph is already planar if it is we return k and if it is not, k is In this section, we present our main result, namely the increased by 1, and we are going through the set of pairs algorithm for calculating the canonical embedding of non- of non-incident edges. For each k we modify the graph planar graphs. until we get a planar embedding, in the following way: we take the first pair, in our example pair {{0 , 1} , {2 , 3}} and we delete edges {0 , 1} and {2 , 3}. Then we add a new Data: Non-planar graph G vertex v to which we connect the vertices of the deleted Result: Planar embedding, crossing number edges: {0} , {1} , {2} , {3}. We check for planarity. If the and added vertices of G checking confirmed a positive result, that is, confirmed 1 V V ( G) that the graph is planar, the crossing number is returned 2 edgeP airs { ab, cd, where a, b, c, d ∈ V ( G) and and the algorithm terminates. However, if the checking ab, cd ∈ E( G)} confirmed a negative result, that is, confirmed that the 3 k ← 0 graph is not planar we go to the next pair. If all pairs 4 while G is not planar do fail, we look for tuples of size 3 next, and so on. 5 S ← all k-subsets from edgeP airs 6 for {a 1 , a 2 , . . . , ak} in S do 0 7 E( G) ← E( G) \ S k a i=1 i 8 for ai element in {a 1 , a 2 , . . . , ak} do 9 {{ a 1 , a 2} , { a 3 , a 4}} ← a 5 i i i i i 10 v 1 4 i ← vertex such that vi / ∈ V ( G) 6 9 11 V ( G) ← V ( G) ∪ { vi} 12 E( G) ← E( G) ∪ { a 1 v , a 3 v i i, via 2 i i i, via 4} i 7 8 13 end 14 if G is planar then 2 3 15 return G, k, V ( G) \ V 16 end 17 end Figure 1: Petersen graph. 18 k k + 1 19 end Algorithm 1: Algorithm for calculating the We get that graph G is planar after three iterations, canonical embedding of non-planar graphs. hence, the crossing number of the Peterson graph is 2. In the end, we canonically relabel vertices and we obtain a planar embedding presented in Figure 2. After investigating several non-planar graphs with up to five vertices in SageMath we came up with the notion of representing their embeddings. Combinatorial embed- ding is a key concept in the study of such graph embed- dings. The significance stems from the fact that, when combined with canonical labelling, combinatorial embed- dings can be utilized to generate a unique representation of (planar) embeddings for (planar) graphs. For further reading on combinatorial representations and planar em- beddings, refer to Mutzel and Weiskircher [9], Didjev [8], Duncan, Goodrich and Kobourov [4], and to Hopcroft and Tarjan [7]. The concept is as follows: we first construct all non- Figure 2: Planar embedding of Petersen graph after incident pairs of edges of a graph; then, we go through applying Algorithm 1. those pairs of edges and for each crossing of two edges, we delete those edges and add a new vertex to which we connect vertices of deleted edges. We repeat the process until the graph is planar. Finally, if it is planar, in the See Section 3 for more details about the transformation. end, we canonically reorient its vertices and save new embedding of a graph. 2.1 Theoretical analysis of the algorithm We show the approach and demonstrate how it works using a Petersen graph shown in Figure 1. The first Consider first the space complexity of Algorithm 1. The step of Algorithm 1 is constructing all pairs of non-amount of memory used by Algorithm 1 to execute and incident edges of G, meaning that the two different edges produce the result is linear with respect to the input cannot share the same vertex. Those pairs of edges are instance. This is due to the fact that the input instance {{0 , 1} , {2 , 3}}, {{4 , 9} , {5 , 8}} etc. In the beginning, we is a graph and most of the work on it is done in-place, by 58 modifying it locally and not taking more space even after many manipulations. Hence, we can say that Algorithm 1 does not take too much memory. Next, let us evaluate the time complexity of Algorithm 1. To determine the time complexity, we need to consider all of the SageMath integrated functions we called in our main function. Function is _ planar that is implemented runs in linear time, concerning the graph as an input instance, meaning it runs in time O( n + m) where n is a number of vertices and m is a number of edges of the graph. For more reading on the time complexity of the (1).jpg (1).jpg (1).jpg planarity algorithm, refer to Boyer and Myrvold [3]. To remove a vertex in a graph, we first need to find the vertex Figure 3: Colouring of planar embedding of Petersen in the data structure and the time complexity depends graph. on the structure we use; if we use a HashM ap, the time complexity will be O(1). Then we remove the vertex, and we do it in O( n) time. Adding and removing an edge of the graph is done in constant time, O(1), time while semicolons) since we aim to store graphs in the database. adding a vertex to a graph takes O( n) time. Checking Furthermore, we added the certificate flag verbose so that if there is an edge between vertices is done in O( n) time we may provide a detailed output (when set to True) for since a vertex can have at most O( n) neighbours. The users - with output explanations. time complexity of getting an embedding of the graph and of finding the neighbours is linear, that is O( n + m), By now, we processed cubic graphs with up to 21 edges, since we needed to perform the Breadth-First Search vertex-transitive graphs with up to 20 edges and all algorithm. graphs with up to 13 edges. These files can be stored in any database since we created a non-verbose mode of Let us analyze the lines from Algorithm 1. Line 1 has writing into them. In Table 1 we present an overall of complexity O( n) , as it assigns to a set n vertices, while the processed families of graphs by now. line 2 assigns m 2 edge pairs to a set and hence has com- plexity O( m 2). Line 3 has constant, O(1), complexity. Let us now analyse the complexity of the while loop. Table 1: Processed families of graphs Inside the while loop, we see that line 5 generates all subsets of size k from the set of size m 2. Hence, the family of graphs # generated up to edges assignment of the k-subsets to the set has complexity cubic 752 21 equal to the size of the set which is O( m 2) = O( m 2 k). k vertex-transitive 16 20 Now, the first for loop goes through all k elements of general 376899 13 the k-subset, and has complexity O( m 2 k( k + kn + m + n)) = O( m 2 k( nk + m)) because the line 7 has complexity O( k), the inner for loop has complexity O( nk) and the 4 Output samples if statement in line 14, that checks whether the graph is planar, has complexity O( n + m). Finally, the while loop Here we present some examples of the algorithm’s final is executed k times meaning that the complexity of the outputs, as previously detailed in the paper. We’ve suc- whole while loop is O( m 2 kk( nk + m)). cessfully generated embeddings, saved them in files along with additional vertices and Graph6 strings, and plotted 3 Applications images of vertex-transitive graphs on less than 20 edges. In Figure 4 we present some of them individually, with Drawing: One of the applications is in graph colouring. the Graph6 string for each one written in the caption. In Algorithm 1, we labelled newly added edges, then we use the method plot() within SageMath and with the 5 Conclusions property colour by label, we get different colours for the newly added edges. In Figure 3 we see an example of the In the paper, we had a look at a non-planar graph embed-transformed Petersen graph from Figure 2. As it can be ding, its storage, and representation. We introduced an seen, the original graph embedding is coloured red, while algorithm for constructing those embeddings and a func- the newly inserted edges are coloured green and blue. tion that writes them in both a human-readable way, or in the way suitable for the storage in the database. This Storing graphs: Another application of our approach contributes to the subject of representation theory be- is related to the storing of graphs with their combinatorial cause there was no standard way of encoding such em- embedding, added vertices (if the graph is non-planar) beddings. and with Graph6 string. We constructed a function that stores data about an individual graph in a single text We demonstrated how our method may be used to draw file that a computer can comprehend (Graph6 string, graphs and save graph data in various file formats. calculated embedding and added vertices - separated by Our techniques can be used to enrich almost any graph 59 database, and that is exactly what we were hoping to achieve. We’ve generated vertex-transitive graphs with up to 20 edges, all graphs with up to 13 edges, and all cubic graphs with up to 21 edges by now. In collaboration with Katja Berčič, PhD, our files will be uploaded to the M athDataHub database. (a) :An (b) :DaHg∼ Currently, we are working on contributing our code to the SageMath project. Acknowledgment I’d like to thank Professor Matjaž Krnc for his advice and suggestions throughout the planning, development, and writing of this paper. I’d like to thank Klemen Berkovič for his help in codifying this Author’s Guide, and to Iztok Fister Jr. for his (c) :CcKI (d) :Ea@aRgs contribution to Author’s Guide and .tex files. References [1] Barthelemy, M. Morphogenesis of Spatial Net- works. New York: Springer, 2017. [2] Billey, S. C., and Tenner, B. E. Fingerprint databases for theorems. Notices of the AMS 60, 8 (2013), 1034. [3] Boyer, J. M., and Myrvold, W. J. On the cut- ting edge: Simplified o(n) planarity by edge addi- (e) ∧ :Ea@_Q_QM@Gs (f) :Fa@_WIRQbP tion. Journal of Graph Algorithms and Applications 8, 3 (2004), 241–273. [4] Christian, D., T., G. M., and Stephen, K. Pla- nar drawings of higher-genus graphs, graph drawing, 17th international symposium, gd 2009. vol. 5849, pp. 45–56. [5] Goedgebeur, G. B. K. C. J., and Mélot, H. House of Graphs: a database of interesting graphs, Discrete Applied Mathematics, 2013. [6] Inc., O. F. The On-Line Encyclopedia of Integer Sequences, 2021. http://oeis.org. (g) :GaGecctgs (h) :Ga@_WIRhDlDZ [7] John, H., and E., T. R. Efficient planarity testing. Journal of the Association for Computing Machinery 21, 4 (1974), 549–568. [8] N., D. H. On drawing a graph convexly in the plane, graph drawing, dimacs international workshop, gd ’94, princeton. vol. 894, pp. 76–83. [9] Petra, M., and René, W. Computing optimal embeddings for planar graphs, computing and com- binatorics, 6th annual international conference, co- coon 2000. vol. 1858, pp. 95–104. [10] Read, R. C., and Wilson, R. J. An Atlas of (i) :Ga@_QaShDlDZ (j) :Ga@_WGwChLDgsTn Graphs (Mathematics). Oxford University Press, Inc., USA, 2005. [11] TePaske-King, Bert; Richert, N. The Iden- Figure 4: Coloured representations of several vertex- tification of Authors in the Mathematical Reviews transitive graphs with up to 20 edges, with Graph6 string Database, 2001. in captions. [12] The Sage Developers. SageMath, the Sage Mathematics Software System (Version 9.4), 2021. https://www.sagemath.org. [13] Trudeau, and J., R. Introduction to Graph The- ory. New York: Dover Pub, 1993. 60 Analiza ritmičnosti števnih podatkov z uporabo modela cosinor Nina Velikajne Miha Moškon Faculty of Computer and Information Science, Faculty of Computer and Information Science, University of Ljubljana, University of Ljubljana, Večna pot 113, SI-1000 Ljubljana, Slovenia Večna pot 113, SI-1000 Ljubljana, Slovenia nv6920@student.uni-lj.si miha.moskon@fri.uni-lj.si Povzetek sebne računske metode, ki upoštevajo in ohranjajo od- nose med podatki. Metode morajo upoštevati diskretno Analiza ritmičnosti števnih podatkov je postala porazdelitev, ki je omejena le na nenegativne cele vre- pomembna v mnogih vidikih znanosti, inženirstva dnosti. Pri uporabi navadne linearne regresije so lahko in celo ekonomije. Obstajajo metode z namenom napovedane vrednosti negativne, kar je teoretično nemo- detekcije ritmičnosti zveznih podatkov, ki pa veči- goče [8]. Za zaznavanje in analizo ritmičnosti v zveznih noma niso primerne za analizo števnih podatkov. podatkih obstaja kar nekaj neparametričnih metod (glej V prispevku predstavimo metodologijo, ki omo- [18, 12]). V primerjavi z omenjenimi metodami (glej [16]) goča analizo ritmičnosti v števnih podatkih. Me- nam uporaba trigonometričnih regresijskih metod v pove- toda združuje metodo cosinor z uporabo različ- zavi z različnimi cosinor modeli predstavlja številne pred- nih računskih regresijskih modelov, ki so primerni nosti, npr. ocenitev parametrov ritma [18]. Izkaže se za analizo števnih podatkov. Omogoča tako de-tudi, da alternativne metode v določenih primerih od- tekcijo ritma kot tudi ocenitev parametrov ritma, povejo zaradi velikega števila osamelcev (angl. outliers), primerjavo zgrajenih modelov in iskanje optimal-same velikosti podatkov, neuravnoteženosti podatkov in nega števila komponent za metodo cosinor ter is- zbiranja podatkov brez ponovitev (glej [17]). kanje najbolj ustreznega tipa števnega modela. Vzpostavljena metoda omogoča primerjavo zazna- V tem prispevku predstavimo metodologijo, ki z zdru- nega ritma v odvisnosti od različnih parametrov ževanjem metode cosinor skupaj z različnimi števnimi ritmičnosti in izračun njihovih intervalov zaupa- računskimi modeli upošteva vse omejitve števnih peri- nja. Celotno metodologijo smo testirali na te- odičnih podatkov in tako omogoča tudi ocenitev parame- denski periodičnosti realnih podatkov COVID-19 trov ritma. Metoda omogoča izračun intervalov zaupanja obolenj v Sloveniji. za posamezen parameter ritma s pomočjo samovzorčenja (angl. bootstrapping). S F testom določimo optimalno Ključne besede metoda cosinor, analiza ritmičnosti, število komponent za metodo cosinor, za iskanje najbolj števni podatki, pojavnost dogodkov, regresija. ustreznega tipa števnega modela pa uporabimo Vuongov test. 1 Uvod Članek je razdeljen na pet poglavij. V drugem in tretjem poglavju so predstavljeni metoda cosinor za analizo pe- Detekcija in analiza ritmičnih vzorcev v števnih podatkih riodičnih podatkov in pet računskih regresijskih modelov ima pomembno vlogo pri mnogih vidikih znanosti. Perio- za delo s števnimi podatki. Sledi poglavje, ki opisuje dični podatki so podatki, v katerih se vzorci ponavljajo z postopek izbire najbolj ustreznega računskega modela. določeno periodo. Zelo pogost tip periodičnih procesov so V poglavju Rezultati so predstavljeni rezultati testiranja procesi, ki odražajo cirkadiano nihanje – procesi s periodo vzpostavljene metode na realnih podatkih. V zadnjem 24-ur [3]. Regulira jih Zemljina rotacija in izmenjavanje poglavju so zajete ključne ugotovitve in postopki celotne dneva in noči, ki vpliva na vse organizme in posledično analize. na njihovo obnašanje ter gibanje (glej [20]). Poseben tip podatkov predstavljajo števni podatki, ki opisujejo po-2 Metoda cosinor javnost izbranih dogodkov [3]. Števni podatki se pogosto pojavljajo periodično. V na- Pri periodičnih podatkih se opazovani vzorci ponovijo z ravi in naši okolici so tovrstni podatki vseprisotni. Z nji- določeno periodo. Njihovo analizo lahko naslovimo kot hovo analizo lahko pripomoremo k razumevanju različnih regresijski problem, pri katerem pa je potrebno upošte- dogodkov in posredno k razumevanju delovanja organiz- vati tudi karakteristike ritma, npr. fazo in amplitudo mov in družbenih sistemov. Na primer, analiza števila nihanja. Metoda cosinor se uporablja za analizo časov- porodov glede na uro v dnevu lahko pomaga pri orga- nih vrst in se posveča tako detekciji ritma kot tudi oceni nizaciji medicinskega in babiškega osebja v porodnišnici parametrov ritma. Model v ozadju metode lahko opišemo [14]. Analiza dnevnih vzorcev v prometu in njihovo spre-z Enačbo 1, kjer je N število komponent, M srednja vre-minjanje skozi čas nam lahko pove veliko o gibanju in dnost ritma (MESOR, angl. Midline Estimating Statistic obnašanju populacije, posebej v času epidemije (glej [4]). of Rhythm), A amplituda, P perioda in e( t) funkcija na-pake. Spremenljivka t označuje čas, i pa iterira po številu Za analizo števnih periodičnih podatkov potrebujemo po-DOI https://doi.org/10.18690/978-961-286-516-0.14 ISBN 978-961-286-516-0 61 komponent – od 1 do N [1, 15, 16]. Generaliziran Poissonov model izhaja iz navadnega Pois- sonovega modela. Bistvena razlika tega modela v pri- N merjavi s Poissonovim modelom je ta, da je generaliziran X t Y ( t) = M + Ai, 1 · sin 2 π + Poissonov model primeren tudi za podatke, ki imajo po- P/i i=1 večano ali zmanjšano razpršitev. Vpeljemo dodaten pa- t rameter α, ki opisuje stopnjo disperzije. Model torej ne Ai, 2 · cos 2 π + e( t) , (1) P/i zahteva, da je povprečje µ enako varianci σ 2. Obstajata dve različici generaliziranega Poissonovega modela – GP- Če je perioda ritma znana vnaprej, lahko enačbo metode 1 in GP-2 [6]. V vzpostavljeni metodologiji smo uporabili cosinor poenostavimo v model linearne regresije: različico GP-1. N X Poissonov model z inflacijo ničel je razširitev navadnega Y ( t) = M + ( Ai, 1 · Xi, 1( t) + Ai, 2 · Xi, 2( t))+ e( t) , (2) Poissonovega modela. Tovrstni model za razliko od nava-i=1 dnega in generaliziranega Poissonovega modela upošteva, da so ničelne vrednost bolj pogoste kot ostale vrednosti. kjer je Xi, 1( t) = sin 2 π t in X 2 π t . V P /i i, 2 = cos P /i Izhaja iz tega, da obstajata dva dejavnika, ki vplivata na kolikor perioda ni znana, jo lahko ocenimo s pomočjo izid, ali je vrednost ničelna ali neničelna [13]. uporabe periodogramov (angl. periodograms) [1, 15, 16]. Negativen binomski model predpostavlja, da so podatki porazdeljeni z negativno binomsko porazdelitvijo. Upo- 3 Analiza števnih podatkov z metodo rabljata se dve verziji negativnega binomskega modela, cosinor tj. NB-1 in NB-2. Različica NB-1 se je izkazala kot bolj primerna, prilagojena krivulja se je namreč vidno lepše Za analizo in detekcijo ritma na izvornih očiščenih podat- prilegala izvornim podatkom, zato smo v metodologiji kih naprej uporabimo metodo cosinor. V primeru znane uporabili verzijo NB-1. Varianca takega modela je de- periode uporabimo Enačbo 2, ki podatke razdeli na po-finirana kot σ 2 = µ + α · µ, kjer je α parameter disperzije ljubno število komponent in jih transformira do regresij-in µ povprečje. Povprečje je enako povprečni pričakovani ske oblike. Na transformirane podatke nato apliciramo vrednosti λ. Model je zato primeren tudi za podatke s regresijske računske modele. Regresijski modeli omogo- povečano ali pa zmanjšano razpršitvijo [2, 11]. Ob ve- čajo identifikacijo in karakterizacijo odnosov med mno- čanju parametra α varianca konvergira k povprečju in gimi faktorji. Zaradi dela s števnimi podatki moramo iz- negativna binomska porazdelitev postane Poissonova [2]. brati ustrezne regresijske računske modele, ki upoštevajo Negativen binomski model z inflacijo ničel je razširitev vse lastnosti tovrstnih podatkov. negativnega binomskega modela. Podobno kot Poissonov Podatki so diskretni, omejeni na nenegativne cele vre- model z inflacijo ničel upošteva, da so ničelne vrednosti dnosti in velikokrat tudi razpršeni (angl. dispersed). Sre-bolj pogoste kot ostale (neničelne) vrednosti. Ključna čamo se s pojmom povečane (angl. overdispersion) ali razlika v primerjavi s Poissonovim modelom z inflacijo pa zmanjšane razpršitve (angl. underdispersion) [8]. Pri ničel je, da ta model temelji na negativni binomski po-povečani razpršitvi imajo podatki večjo varianco, kot bi razdelitvi. Model je zato primeren tudi za podatke, ki jo sicer pričakovali. Če je varianca večja kot povprečje imajo povečano ali pa zmanjšano razpršitev [10, 11]. podatkov, gre za povečano razpršitev. Obratno velja za zmanjšano razpršitev [7]. Z uporabo navadne linearne 4 Izbira najustreznejšega modela regresije bi dobili nepravilne rezultate, saj tak računski model ne bi upošteval omenjenih lastnosti [8]. Izbiro najbolj ustreznega računskega modela razdelimo V vzpostavljeni metodologiji smo se odločili za uporabo na dva nivoja. Na prvem nivoju iščemo optimalno šte- petih različnih računskih modelov, ki se najpogosteje vilo komponent za metodo cosinor. Na tem mestu smo uporabljajo za analizo števnih podatkov. Uporabili smo izhajali iz tega, da so modeli gnezdeni. Dva modela sta Poissonov model (angl. Poisson model), generaliziran gnezdena, če lahko prvi model izrazimo z drugim oz. če Poissonov model (angl. generalised Poisson model), Pois- drugi model poleg vsaj enega dodatnega parametra vse- sonov model z inflacijo ničel (angl. zero-inflated Poisson buje enake parametre kot prvi [5]. Implementirali smo F model), negativen binomski model (angl. negative bino- test. Na podlagi dveh zgrajenih modelov izračunamo F mial model) in negativen binomski model z inflacijo ničel vrednost. F test temelji na razliki vsote kvadratov (angl. (angl. zero-inflated negative binomial model). sum of squares) dveh modelov in upošteva število para- metrov modela [5]. Poissonov model predpostavlja, da so podatki porazde- ljeni s Poissonovo porazdelitvijo: Na drugem nivoju smo vrednotili tip računskega modela. Uporabili smo Vuongov test, ki je primeren tako za gnez- λke− λ P ( y = k) = , (3) dene modele kot tudi za ne gnezdene in prekrivajoče se k! (angl. overlapping) modele. Vuongov test omogoča izra- kjer je λ povprečna pričakovana vrednost oz. število do- čun Z vrednosti na podlagi logaritma največjega verjetja godkov na enoto časa. Povprečje podatkov µ je pri Pois- (angl. maximum log-likelihood) dveh modelov. Tudi ta sonovi porazdelitvi enako povprečni pričakovani vredno- test upošteva število parametrov modelov [19]. sti λ. Varianca σ 2 je enaka povprečju, poenostavljeno Oba testa sledita podobnemu postopku. Za dva modela, velja, da je σ 2= λ. Model ne upošteva povečane ali pa tj. model A in model B, izračunamo F oz. Z vrednost. zmanjšane razpršitve podatkov [9]. 62 Model A zavržemo, če je izračunana vrednosti manjša od vnaprej določene meje, tj. statistične signifikance (angl. Tabela 1: Ocenjeni parametri ritmičnosti za posamezno statistical significance) [5, 19]. podatkovno zbirko. pod. zbirka tip modela št. komponent amplituda mesor vrhovi št. poz. testov 5 Rezultati 1. gen_poisson 3.0 794.62 1233.67 0.7 2028.29 0.7 1214.69 2. gen_poisson 3.0 483.37 731.32 3.9 969.59 3. gen_poisson 3.0 637.66 995.12 0.7 1632.77 Vzpostavljeno metodologijo smo preizkusili na realnih po- datkih. Podatke smo pridobili s strani COVID-19 sledil- nik1 in analizirali število pozitivnih testov glede na dan v Tabela 2: Intervali zaupanja parametrov ritmičnosti za tednu. Upoštevali smo vse teste, tj. seštevek PCR (angl. posamezno podatkovno zbirko. polymerase chain rection) in hitrih antigenskih (HAGT) testov. Metodo smo izvedli na treh podatkovnih zbirkah. pod. zbirka amplituda mesor vrhovi št. poz. testov Prva zbirka beleži število pozitivnih testov od 20. okto- 1. [727.44 863.81] [1164.72 1299.48] [0.63 0.85] [1897.05 2158.4 ] bra 2020 – razglasitev 2. vala epidemije v Sloveniji, do [0.56 0.83] [1154.14 1271.19] 2. [452.11 514.36] [701.36 757.49] [1.96 4.82] [ 902.71 1040.37] 10. februarja 2021. Povprečje pozitivnih testov je 1340, 3. [596.96 691.66] [ 955.36 1051.93] [0.59 0.86] [1556.13 1739.78] varianca pa 276278. Druga zbirka vsebuje primere od 11. februarja 2021 do 26. aprila 2021. Povprečje podatkov je 20.10.2020-10.2.2021 2500 izvorni podatki 796, varianca pa 90328. Tretja podatkovna zbirka zdru- prilagojena krivulja žuje časovni obdobji prve in druge podatkovne zbirke. 2000 Povprečje zadnje zbirke je 1082, varianca pa 232224. Me- todo smo preizkusili tudi za časovno obdobje 1. epidemije v Slovenji – pomlad 2020, vendar je teh podatkov premalo 1500 za smiselno analizo. V podatkovni zbirki smo najprej odstranili osamelce tevilo pozitivnih testov 1000 (angl. outliers), tako da smo za posamezno uro odstranili vnose, kjer so bile vrednosti števila pozitivnih testov 500 večje ali manjše od 0,15 kvantila. Nato smo izvedli metodo cosinor za posamezno število komponent – od 1 Ponedeljek Torek Sreda etrtek Petek Sobota Nedelja Dan do 4, in zgradili posamezen tip računskega modela (glej 11.2.2021-26.4.2021 Poglavje 1). Pri številu komponent 4 smo se ustavili, ker izvorni podatki prilagojena krivulja se računski modeli zaradi prevelikega števila komponent 1200 niso več prilagajali izvornim podatkom. Zgrajene modele smo nato ovrednotili, najprej smo poiskali optimalno 1000 število komponent na podlagi F testa in nato še najbolj 800 ustrezen tip modela s pomočjo Voungovega testa. Opisan postopek se ponovi za posamezno podatkovno tevilo pozitivnih testov 600 zbirko. Vse podatkovne zbirke imajo večjo varianco kot povprečje kar pomeni, da imajo podatki povečano 400 razpršitev. Na podlagi porazdelitve izvornih podatkov (glej Sliko 1) lahko ugotovimo, da podatki nimajo 200 Ponedeljek Torek Sreda etrtek Petek Sobota Nedelja ničelnih vrednosti. Dan 20.10.2020-26.4.2021 Težave se pojavijo pri vseh modelih, v kolikor je število 2250 izvorni podatki prilagojena krivulja komponent pri metodi cosinor večje od 3. Opazimo, da se 2000 izvornim podatkom najbolje prilegata negativen binom- 1750 ski model in generaliziran Poissonov model. Oba modela, sta namreč primerna za podatke s povečano razpršitvijo. 1500 Za vse podatkovne zbirke smo dobili enak rezultat. Op- 1250 timalno število komponent je 3, najbolj ustrezen tip mo- 1000 dela pa je generaliziran Poissonov model (glej Sliko 1). tevilo pozitivnih testov 750 Na podlagi izvedene analize smo lahko ovrednotili para- metre ritmičnosti (glej Tabelo 1) in njihove intervale za-500 upanja (glej Tabelo 2). Celoten postopek analize z vsemi 250 vmesnimi rezultati je dostopen v repozitoriju GitHub2. Ponedeljek Torek Sreda etrtek Petek Sobota Nedelja Dan 6 Diskusija in zaključek Slika 1: Zmagovalni modeli za posamezno podatkovno zbirko. V vseh primerih je število komponent enako 3 in Vzpostavljena in implementirana metodologija omogoča tip modela generaliziran Poissonov model. Oranže črte analizo števnih periodičnih podatkov. Metodologija je se- označujejo intervale zaupanja modela. 1https://covid-19.sledilnik.org/sl/data 2https://github.com/ninavelikajne/Ritmicnosti- stevnih-podatkov 63 stavljena iz dveh delov. Prvi del predstavlja analizo pe- [6] Consul, P., and Famoye, F. Generalized Poisson riodičnih podatkov. Implementirana je metoda cosinor, regression model. Communications in Statistics - ki ji lahko uporabnik nastavlja poljubno število kompo- Theory and Methods 21, 1 (1992), 89–109. nent. Drugi del zajema analizo števnih podatkov. Upo- [7] Coxe, S., West, S. G., and Aiken, L. S. The rabljeni so različni računski regresijski modeli, ki so pri- analysis of count data: A gentle introduction to merni za delo s števnimi podatki. Implementirali smo Poisson regression and its alternatives. Journal of pet tovrstnih modelov, tj. Poissonov model, generalizi- personality assessment 91, 2 (2009), 121–136. ran Poissonov model, Poissonov model z inflacijo ničel, [8] Gardner, W., Mulvey, E., and Shaw, E. Re- negativen binomski model in negativen binomski model z gression analyses of counts and rates: Poisson, over- inflacijo ničel. Vzpostavljena metodologija omogoča vre- dispersed Poisson, and negative binomial models. dnotenje zgrajenih modelov. Tudi vrednotenje se tako Psychological bulletin 118 (12 1995), 392–404. kot grajenje modelov deli na dva dela. V prvem delu se [9] osredotočimo na iskanje optimalnega števila komponent Gardner, W., Mulvey, E., and Shaw, E. Re- gression analyses of counts and rates: Poisson, over- za metodo cosinor. Izhajamo iz dejstva, da so modeli dispersed Poisson, and negative binomial models. gnezdeni in jih zato lahko ovrednotimo s F testom. V Psychological bulletin 118 (12 1995), 392–404. drugem delu iščemo najbolj ustrezen tip računskega mo- dela. Uporabimo Vuongov test, ki je primeren tako za [10] Greene, W. H. Accounting for excess zeros and gnezdene, negnezdene in tudi prekrivajoče se modele. sample selection in Poisson and negative binomial regression models. Working Paper EC-94-10. Leo- Računsko metodo smo preizkusili na tedenski periodično- nard N. Stern School of Business, New York Univer- sti števila obolenj z boleznijo COVID-19 v Sloveniji. Po- sity, 1994. datke smo razdelili na 3 podatkovne zbirke, vsaka opisuje [11] različno obdobje. Za vse podatkovne zbirke se kot naj- Hilbe, J. M. Negative binomial regression. Cam- bridge University Press, 2011. boljši tip modela izkaže generaliziran Poissonov model, optimalno število komponent pa je število 3 (glej Sliko [12] Hutchison, A. L., Maienschein-Cline, M., 1). V podatkih se ritmičnost podatkov lepo izraža. Oce-Chiang, A. H., Tabei, S. A., Gudjonson, H., nili smo parametre ritma (glej Tabelo 1) in njihove inter-Bahroos, N., Allada, R., and Dinner, A. R. vale zaupanja (glej Tabelo 2). Z uporabo predstavljene Improved statistical methods enable greater sensi-metode na dobljenih rezultatih razberemo, da je število tivity in rhythm detection for genome-wide data. pozitivnih testov največje ob torkih nato pa skozi teden PLoS Comput Biol 11, 3 (2015), e1004094. upada. Kot pričakovano je število pozitivnih testov naj- [13] Lambert, D. Zero-inflated Poisson regression, with manjše ob vikendih. Oblika zaznanega ritma je za vse an application to defects in manufacturing. Techno- podatkovne zbirke podobna (glej Sliko 1). metrics 34, 1 (1992), 1–14. [14] Celotna metoda je implementirana kot modul v jeziku Martin, P., Cortina-Borja, M., Newburn, Python in je prosto dostopna. Omogoča širok spekter M., Harper, G., Gibson, R., Dodwell, M., funkcionalnosti za analizo tovrstnih podatkov. Poten- Dattani, N., and Macfarlane, A. Timing of sin- gleton births by onset of labour and mode of birth in cialni uporabnik lahko direktno spreminja in prilagaja nhs maternity units in england, 2005–2014: A study funkcionalnosti glede na svoje potrebe. of linked birth registration, birth notification, and hospital episode data. PloS One 13, 6 (2018). Literatura [15] Nelson, W., LEE, J. K., et al. Methods for cosinor-rhythmometry. Chronobiologia 6, 4 (1979), [1] Bingham, C., Arbogast, B., Guillaume, 305–323. G. C., Lee, J. K., and Halberg, F. Inferen- [16] Refinetti, R., Cornélissen, G., and Halberg, tial statistical methods for estimating and compa- F. Procedures for numerical analysis of circadian ring cosinor parameters. Chronobiologia 9, 4 (1982), rhythms. Biological rhythm research 38, 4 (2007), 397–439. 275–325. [2] Cameron, A. C., and Trivedi, P. K. Econome- [17] Ruben, M. D., Francey, L. J., Guo, Y., Wu, tric models based on count data: Comparisons and G., Cooper, E. B., Shah, A. S., Hogenesch, applications of some estimators and tests. Journal J. B., and Smith, D. F. A large-scale study reveals of Applied Econometrics 1, 1 (1986), 29–53. 24-h operational rhythms in hospital treatment. Pro- [3] Cameron, A. C., and Trivedi, P. K. Regression ceedings of the National Academy of Sciences 116, 42 analysis of count data, vol. 53. Cambridge University (2019), 20953–20958. Press, 2013. [18] Thaben, P. F., and Westermark, P. O. Detec- [4] Chang, S., Pierson, E., Koh, P. W., Gerar- ting rhythms in time series with RAIN. Journal of din, J., Redbird, B., Grusky, D., and Lesko- biological rhythms 29, 6 (2014), 391–400. vec, J. Mobility network models of covid-19 explain [19] Vuong, Q. H. Likelihood ratio tests for model inequities and inform reopening. Nature 589, 7840 selection and non-nested hypotheses. Econometrica (2021), 82–87. 57, 2 (1989), 307–333. [5] Clark, T. E., and McCracken, M. W. Tests of [20] Zhdanova, I. Melatonin. In Encyclopedia of the equal forecast accuracy and encompassing for nested Neurological Sciences (Second Edition), M. J. Ami- models. Journal of econometrics 105, 1 (2001), 85– noff and R. B. Daroff, Eds., second edition ed. Aca- 110. demic Press, Oxford, 2014, pp. 1030 – 1033. 64 Analiza sentimenta komentarjev hotelov z uporabo slovarjev in metode Naivni Bayes Nina Murks Anže Omerzu Univerza v Mariboru, Univerza v Mariboru, Fakulteta za elektotehniko, računalništvo Fakulteta za elektotehniko, računalništvo in informatiko, in informatiko, Koroška cesta 46, 2000 Maribor, Slovenija Koroška cesta 46, 2000 Maribor, Slovenija nina.murks@student.um.si anze.omerzu@student.um.si Borko Bošković Univerza v Mariboru, Fakulteta za elektotehniko, računalništvo in informatiko, Koroška cesta 46, 2000 Maribor, Slovenija borko.boskovic@um.si Povzetek so storitve že koristili; pobrskamo po spletnih komentar- jih, povezanih z mnenjem o nastanitvi. V članku smo predstavili pristop k analizi sen- timenta komentarjev hotelskih gostov s pomočjo Odločili smo se, da analizo sentimenta apliciramo na ko-slovarjev in metode Naivni Bayes. Najprej smo mentarje gostov hotelov. Sprva smo komentarje klasifici-zgradili slovarja sentimenta, ki sta vsebovala n- rali kot pozitivne ali negativne. S pomočjo klasificiranih grame, ter njihove verjetnosti, da pripadajo pozi- komentarjev smo ugotovili, ali je bil večini hotelskih go- tivnemu ali negativnemu razredu. Nato smo s po- stov izbran hotel všeč ali ne. Cilj našega eksperimenta močjo zgrajenih slovarjev klasificirali komentarje je bila aplikacija, ki omogoča takojšen vpogled v širše hotelov, pri čemer smo uporabili metodo Naivni mnenje o hotelu, ki ga najdemo v komentarjih hotelskih Bayes. Pri klasifikaciji k omentarjev s mo raču-gostov. nali klasifikacijske v rednosti o z. v erjetnosti, da V članku smo predstavili različne pristope k analizi senso posamezni komentarji pozitivni ali negativni. timenta, in sicer v poglavju Sorodna dela. V naslednjem Komentarje smo klasificirali s p omočjo unigra-poglavju sledi opis eksperimenta, ki smo ga izvedli. Zno- mov in bigramov, ter rezultate primerjali z re- traj tega smo najprej predstavili, na kakšen način smo zultati iz literature. Pri unigramih smo dosegli pridobili podatke in kako smo jih predprocesirali. Nato natančnost 0,92, pri bigramih je natančnost zna-smo predstavili način grajenja slovarjev, njihovo uporabo šala 0,80. Klasifikacijske v rednosti posameznih in klasifikacijo komentarjev hotelskih gostov. V zadnjem komentarjev smo si shranili, pri čemer smo pri podpoglavju tega poglavja smo predstavili pristop raču-komentarjih, ki smo jih klacificirali kot negativne, nanja točk za posamezen hotel. Nazadnje podamo še dodali negativen predznak. Predznačene klasifi-zaključek in možnosti za nadaljnje delo. kacijske vrednosti smo nato sešteli, za vsak hotel ter na tak način izračunali hotelom pripadajoče točke. Točke hotelov so v našem primeru poka- 2 Sorodna dela zatelj splošnega zadovoljstva hotelskih gostov, ki ga najdemo v komentarjih. Glede na točke smo Analiza sentimenta zajema računalniško analizo stališč hotele uredili po vrsti in prišli do lestvice hote-govorca ali pisca. Stališče pisca, ki ga lahko prepoznamo lov, pri katerih najdemo najbolj pozitivne komen- v besedilu, je lahko pozitivno, negativno ali nevtralno. tarje. Danes je uporaba analize izjemno koristna pri spremlja- nju družbenih omrežij, saj nam omogoča vpogled v javno Ključne besede analiza sentimenta, komentarji mnenje [2]. hotelskih gostov, slovar sentimenta, n-grami, Naivni Bayes Uporaba analize sentimenta je dobro raziskano področje, kar kaže visoko število pojavitev znanstvenih in strokov- nih člankov/prispevkov s tega področja. Znanih je več 1 Uvod pristopov k analizi sentimenta. V nadaljevanju smo jih predstavili nekaj – po vzoru že objavljenih člankov [8][3]. Preden se odpravimo na potovanje, je ena izmed po- membnih odločitev izbira nastanitve. Pri določitvi kraja Prvi pristop temelji na unigramih in vzorcih iz učne mno-namestitve so nam pomembne namestitvene zmožnosti žice (ang. training set) [1][10]. Avtorji člankov so raz- – ali ima hotel bazen, organizirano varstvo, brezplačno iskovanje usmerili na vzorce komentarjev na družbenem parkiranje itd. Več o kvaliteti storitev, ki jih ponuja iz- omrežju Twitter. S pomočjo unigramov in vzorcev so raz- brana nastanitev, lahko izvemo neposredno od gostov, ki poznavali, ali je komentar sovražen (žaljiv) ali čist (nev- DOI https://doi.org/10.18690/978-961-286-516-0.15 ISBN 978-961-286-516-0 65 tralen) ali pozitiven [10]. V drugem članku so raziskovalci varjev in metode Naivni Bayes je bil od vseh (prej omes pomočjo vzorcev iskali prisotnost sarkazma v komentar- njenih) najprimernejši, zato smo se odločili, da ga upo- jih. rabimo pri implementaciji lastne ideje. Tudi v naslednjem, za naše raziskovanje pomemben članku [11], so avtorji svoje raziskovanje usmerili na 3 Predlagan pristop sentimente komentarjev na družbenih omrežjih. Njihov pristop uporablja utežene besedne vektorje, ki predsta- V tem poglavju smo predstavili našo analizo sentimenta vljajo vhod v celico nevronske mreže (BiLSTM), ki zazna za klasificiranje hotelskih komentarjev in pridobivanje kontekstne informacije. Tako bolje predstavi vektorje točk zadovoljstva nastanitve posameznih hotelov, ki se komentarjev. Sentiment komentarjev se kasneje naprej skrivajo v besedilu komentarjev. Sledi kratka predsta- določi s klasifikacijo nevronskih mrež. vitev načina pridobivanja podatkov, ter načrtovanja in V članku [8] so uporabili ordinalno regresijo z uporabo izvedbe eksperimenta. tehnik učenja. Za omenjeno tehnologijo so uporabili javno bazo komentarjev na Twitterju, ki jih je bilo treba 3.1 Pridobivanje podatkov vnaprej procesirati z uporabo metode ekstrakcije lastno- sti. Za klasifikacijo analize sentimenta so uporabili raz- Poiskali smo podatkovno bazo, ki je že vsebovala klasifici- lične algoritme: Multinomijska logistična regresija (Mul- rane komentarje hotelov. To bazo podatkov smo našli na tinomial logistic regression - SoftMax), podpora vektorski spletni strani Kaggle. Sledi kratka predstavitev spletne regresiji (Support Vector Regression - SVR), odločitvena strani. drevesa (Decision Trees - DTs) in naključni gozd (Ran- Kaggle, podružnica podjetja Google LLC, je spletna sku- dom Forest - RF). pnost podatkovnih znanstvenikov in izvajalcev strojnega V članku [7] so raziskovalci za analizo sentimenta upora-učenja. Kaggle uporabnikom omogoča, da najdejo in ob- bili in nadgradili pristop k ansamblu lastnosti (ang. fe- javijo nabore podatkov, raziskujejo in gradijo modele v ature ensemble), pri katerem so upoštevali različne ele- spletnem okolju za znanost o podatkih, sodelujejo z dru- mente, ki jih drugi raziskovalci pri uporabi te metode gimi znanstveniki in inženirji strojnega učenja ter se ude- zanemarijo. Upoštevali so besedoslovje, besedno vrsto, ležujejo tekmovanj za reševanje izzivov na področju po- jezikovno semantiko in položaj besed. datkov. Do zdaj omenjeni pristopi uporabljajo analizo sentimenta Izbrana baza podatkov je bila narejena s pomočjo sple- nad angleškim jezikom. Zanimalo nas je, kako se ana- tne strani Booking.com. Vsebuje 515.000 komentarjev lize sentimenta lotevajo v tujih jezikih, kjer je sentiment o 1493 različnih Evropskih hotelih. Podatki imajo nasle- manj jasen. Prav to so storili v članku [4], v katerem dnjo obliko: naslov hotela, datum komentarja, povprečen so se posvetili nejasnosti kitajskih fraz, ki so izraženi v rezultat hotela (ta je izračunan s pomočjo zadnjega ko- sentimentu. Tradicionalni pristop s strojnim učenjem ne mentarja, ki se je pojavil v tekočem letu), ime hotela, more prikazati resničnega sentimenta, zato so predlagali nacionalnost pisca komentarja, negativen komentar, šte- uporabo multistrateške metode analize razpoloženja in vilo besed v negativnem komentarju, pozitiven komen- prikazali, da hibridna analiza sentimenta dosega zadovo- tar, število besed v pozitivnem komentarju, točke pisca ljive rezultate. komentarja (točke, ki jih je pisec dodelil hotelu), število komentarjev, ki jih pisec podal, število komentarjev, ki jih Članka [5] in [9] vsebujeta analizo sentimenta s pomo-ima posamezen hotel, značke (ki jih je dodelil pisec ko- čjo konvolucijskih nevronskih mrež. Prvi naveden članek mentarja), koliko dni je minilo od zadnjega komentarja, uporablja konvolucijske mreže za analizo sentimenta ko- dodatne točke k hotelu (nekateri so napisali samo točke) mentarjev na Twitterju, drugi pa za zaznavanje sarkazma ter geografska širina in dolžina lokacije hotela. na socialnih omrežjih. Konvolucijske nevronske mreže so sestavljene iz več slojev, kjer vsak sloj opravlja nalogo Komentarji v podatkovni bazi so že imeli izločena vsa lo- pripravljanja, pretvarjanja ali popravljanja podatkov. čila in znake, ki jih ni v angleški abecedi (ASCII znaki). Ker je bila baza dobro pripravljena, nismo imeli veliko Analize sentimenta se lahko lotimo tudi s pomočjo slo- dela – odstranili smo številke in pretvorili besedilo komen- varjev in metode Naivni Bayes, kot so to storili v članku tarjev v male črke. Podatkovno bazo smo tudi razdelili [6]. V tem članku so analizirali sentiment komentarjev na testno in učno množico, pri čemer je učna množica Danmaku1 videov. Uporabili so slovar sentimenta, ki so zajemala približno 80 % vseh podatkov, testna pa pre-ga razširili z emotikoni, saj so ti zelo pomembni pri ko- ostalih 20 %. Učno množico smo uporabili za grajenje mentarjih Danmaku videov. Pri analizi sentimenta pa slovarjev, testno pa za klasifikacijo komentarjev. Delitev niso samo klasificirali pozitivne in negativne komentarje, podatkov v učno in testno množico je potekala za vsak temveč so uporabili kar sedem razredov: gnus, žalost, hotel posebej. Zbrali smo vse komentarje o določenemu všečnost, jeza, presenečenje, strah in veselje. Po tem, ko hotelu, nato pa smo le-te razdelili na učno in testno mno- so generirali slovar, ki ustreza kontekstu (za komentarje žico tako, da je učna množica vsebovala približno 80 % Danmaku videov), so se lotili klasifikacije komentarjev s komentarjev. pomočjo metode Naivni Bayes. Pristop z uporabo slo- 1Danmaku je Japonski izraz za sistem podnapisov, ki ga 3.2 Grajenje slovarjev uporabljajo spletne video platforme. Omogočajo uporabniku objavljanje premikajočih se komentarjev na video, ki se pred- Podatki so bili predhodno obdelani in lahko smo se lotili vaja. gradnje slovarjev. Odločili smo se, da bomo algoritem 66 izvedli na dva načina – s pomočjo unigramov in bigraje vsebovala ime hotela, seznam komentarjev (s pomočjo mov. Algoritem, ki upravlja z unigrami, se od algoritma prej definirane strukture Review) in vrednost klasifika- z bigrami ne razlikuje preveč, le da slovarja temeljita en- cije. Naredili smo seznam podatkovnih struktur HotelRe- krat na unigramih, drugič pa na bigramih. Tudi pri upo- views, kamor smo si shranili podatke, ki smo jih kasneje rabi slovarjev in računanju klasifikacij s pomočjo metode pridobili iz testne množice podatkovne baze. Naivni Bayes je prišlo do razlike le pri predprocesiranju Prebrali smo testno množico podatkovne baze in si pri posameznih komentarjev, kjer smo pri unigramih vzeli tem shranjevali za nas potrebne informacije – ime hotela unigrame, torej posamezne besede, pri bigramih, pa bi- in komentarje, ki smo jih že označili, ali so v podatkovni grame, torej dvojice besed. bazi bili prepoznani kot pozitivni ali negativni, saj nam Grajenja slovarjev smo se lotili tako, da smo najprej pre- je to kasneje pomagalo pri izračunu natančnosti klasifika- brali učno množico podatkov in si shranili vse komentarje cijskega modela. Podatke iz testne množice smo shranili posebej – razdelili smo jih na pozitivne in negativne. Pri v prej omenjene strukture. Ko smo imeli vse podatke vsaki različici algoritma (unigrami in bigrami) smo na- shranjene v naših podatkovnih strukturah, smo se lotili redili dva slovarja. Prvi slovar je imel izračunane verje- klasifikacije komentarjev. tnosti, da posamezen n-gram (kadar koli se sklicujemo na Za vsak komentar, ki je sestavljen iz sekvence n-gramov, n-gram imamo v mislih unigram ali bigram) pripada pozi- smo vedno izračunali 2 vrednosti: klasifikacijsko vre- tivnemu komentarju, drugi pa verjetnosti, da posamezen dnost, da komentar pripada pozitivnemu komentarju, in n-gram pripada negativnemu komentarju. Da smo ver- klasifikacijsko vrednost, da pripada negativnemu komen- jetnosti pripadnosti n-gramov v posamezen razred lahko tarju. izračunali, smo najprej morali narediti še nekaj korakov. Če smo gradili slovar pozitivnih komentarjev, smo po- trebovali vse n-grame, ki jih najdemo v besednjaku po- komentar = n-gram1, n-gram2, ... n-gramN (2) zitivnih komentarjev. Potrebovali smo tudi seznam vseh n-gramov, ki se najdejo tako v pozitivnih komentarjih kot tudi v negativnih, in njihove frekvence pojavitve. Izraču- P ( komentar| pozitiven) = nali smo tudi velikost slovarja vseh n-gramov (koliko raz- log( P ( pozitiven)) ličnih n-gramov se nahaja v pozitivnih in negativnih ko- + log( P ( n-gram1| pozitiven)) mentarjih). Sedaj smo se lahko lotili računanja posame- + log( P ( n-gram2| pozitiven)) znih verjetnosti n-gramov, da le-ti pripadajo določenemu razredu (pozitivnemu ali negativnemu). Verjetnost, da ... n-gram pripada razredu smo izračunali po enačbi: + log( P ( n-gramN| pozitiven)) (3) frekvenca( n-gram, razred) + 1 P ( n-gram| razred) = (1) frekvenca( razred) + V P ( komentar| negativen) = log( P ( negativen)) Vsako verjetnost izračunamo tako, da vzamemo frekvenco + log( P ( n-gram1| negativen)) pojavitve n-grama znotraj razreda, tej prištejemo ena (da se izognemo ničelnim vrednostim), nato pa dobljen + log( P ( n-gram2| negativen)) rezultat delimo z vsoto frekvence razreda in velikostjo ... slovarja (V). Frekvenca razreda predstavlja število n- + log( P ( n-gramN| negativen)) gramov, ki se pojavi znotraj posameznega razreda. Za (4) vsak n-gram smo izračunali verjetnost za oba razreda (pozitivni in negativni razred) ter si na tak način ustvarili Če je bila pozitivna klasifikacijska vrednost večja od ne- slovarja, ki smo ju shranili v datoteki. Ker se v testni gativne, smo komentar klasificirali kot pozitiven, v na- množici, ki smo jo uporabili za klasifikacijo, lahko pojavi sprotnem primeru smo komentar klasificirali kot negati- tudi takšen n-gram, ki ga v učni množici (s pomočjo te ven. smo gradili slovarja) nismo imeli, smo v slovar na začetek dodali vrednost neznanega n-grama in ga označili kot Izračunano klasifikacijsko vrednost smo si shranili, pri če- "unknown_". Pri oznaki smo uporabili podčrtaj, saj bi mer smo pri negativnem klasifikacijskemu rezultatu do- se sam n-gram "unknown" lahko pojavil pri komentarjih, dali negativen predznak – to nam je pomagalo v priho- torej bi v tem primeru vzeli napačno verjetnost. Ker dnjih korakih pri določanju splošnega zadovoljstva ho- pa smo iz podatkovne zbirke že odstranili vsa ločila, smo telskih gostov, ki ga najdemo v komentarjih s pomočjo tukaj lahko uporabili to lastnost in n-gramu dodali ločilo. računanja točk posameznih hotelov. Več o tem sledi v naslednjem poglavju. 3.3 Klasifikacija komentarjev 3.4 Računanje točk zadovoljstva hotelskega Klasifikacija komentarjev iz testne množice je potekala bivanja tako, da smo najprej prebrali slovarja, ki smo si ju shanili v prejšnjem koraku. Ustvarili smo podatkovno strukturo Ko smo klasificirali vse komentarje posameznih hotelov, Review, ki je vsebovala vsebino komentarjev, informacijo smo se posvetili ocenjevanju splošnega zadovoljstva, ki ga o tem ali je komentar pozitiven, informacijo o pravilno- najdemo v komentarjih. Prvotna ideja je bila, da bi pre- sti klasificije komentarjev in vrednost klasifikacije. Na- šteli pozitivne in negativne komentarje, ter izračunali nji- redili smo tudi podatkovno strukturo HotelReviews. Ta hov delež v primerjavi z vsemi komentarji za posamezen 67 hotel. Odločili smo se, da bomo naredili še en korak na- prej in namesto štetja računali vrednosti klasifikacije, kar Tabela 1: Metrike uspesnosti klasifikatorjev. je točkam posameznih hotelov dodalo še dodatno težo. Ta teža se odraža pri komentarjih, ki smo jih klasificirali Unigrami Bigrami Članek [6] znotraj istega razreda – npr. 2 komentarja, ki sta kla- Natančnost 0,92 0,80 0,75 sificirana kot pozitivna se lahko razlikujeta v tem, da je Preciznost 0,94 0,93 / eden bolj pozitiven kot drugi. Priklic 0,91 0,67 / V prejšnjem koraku smo komentarjem, ki smo jih klasi- Mera F1 0,92 0,78 / ficirali za negativne, dodelili negativen predznak (ostali imajo pozitivnega, saj je verjetnost nečesa vedno pozi- tivna). Te klasifikacijske vrednosti smo nato sešteli za Če pa primerjamo priklic in preciznost bigramov, vidimo, posamezen hotel in na tak način izračunali splošno za- da so bigrami bili pri preciznosti bolj uspešni kot pri pri- dovoljstvo glede hotela. Večja kot je bila vsota komen- klicu. Iz slednjega lahko sklepamo, da je število napačno tarjev posameznega hotela, bolj so bili gosti hotela, ki so klasificiranih nagativnih komentarjev bilo manjše kot šte- komentarje napisali, zadovoljni. Za vsak hotel smo to- vilo napačno klasificiranih pozitivnih komentarjev. Pri rej izračunali vrednost, ki predstavlja točke hotela, in v unigramih pa je razlika med priklicom in preciznostjo do- splošnem zajema zadovoljstvo. Nato smo hotele uredili kaj mala, zato lahko sklepamo, da je delež napačno kla- glede na njihove izračunane točke ter na tak način dobili sificiranih komentarjev približno enako razporejen med lestvico hotelov. napačno klasificiranimi pozitivnimi komentarji (fn) in na- pačno klasificiranimi negativnimi komentarji (fp). 4 Eksperiment V članku [6] so za klasifikacijo komentarjev videov Danmaku uporabili slovar sentimenta, ki ni temeljil zgolj V eksperimentu smo klasifikacijo izvedli na dva načina na besedišču komentarjev. Pri klasifikaciji komentarjev – z uporabo unigramov in bigramov. Oba klasifikacijska so uporabili bigrame, torej lahko njihove rezultate načina smo primerjali med sabo z izračunom natančnosti primerjamo z našimi rezultati pri pristopu z bigrami. modela, preciznosti, priklicem in mero F1. Pri izračunu Dosegli smo boljše rezultate (tabela 1) zaradi uporabe natančnosti, preciznosti in priklicu smo uporabili para-slovarjev, ki so bolj domensko specifični. meter tp (ang. true positive), ki predstavlja število pra- Kot zanimivost smo ustvarili tudi vizualno predstavitev vilno klasificiranih pozitivnih komentarjev, tn (ang. true unigramov (besed), ki se najpogosteje pojavijo v pozi- negative), ki predstavlja število pravilno klasificiranih ne- tivnih in negativnih komentarjih. Vizualno predstavitev gativnih komentarjev, fp (ang. false positive), predstavlja vidimo na sliki 1, pri čemer zelena slika predstavlja po- število napačno klasificiranih negativnih komentarjev in goste besede, ki jih najdemo v pozitivnih komentarjih, fn (ang. false negative), ki pove število napačno klasifici- rdeča slika pa prikazuje pogoste besede znotrah negativ- ranih pozitivnih komentarjev. nih komentarjev. tp + tn N atancnost = (5) tp + tn + f p + f n tp P reciznost = (6) tp + f p tp P riklic = (7) tp + f n 2 ∗ P reciznost ∗ P riklic Slika 1: Vizualna predstavitev najpogostejsih besed v F 1 = (8) P reciznost + P riklic komentarjih Pri računanju metrik uspešnosti klasificikacij smo vre- dnosti izračunali tako za unigrame kot za bigrame, ter prišli do rezultatov, ki so prikazani v tabeli 1. Zraven 5 Zaključek naših rezultatov, so prikazani tudi rezultati iz literature [6], po kateri smo se zgledovali pri implementaciji algo-V okviru naše naloge smo uporabili analizo sentimenta ritma. nad komentarji hotelskih gostov. Klasifikacijo komen- Iz tabele 1 lahko razberemo, da se unigrami v vseh me-tarjev smo izvedli s pomočjo slovarjev in metode Naivni trikah uspešnosti bolje obnesejo od bigramov. Unigrami Bayes, pri čemer smo eksperiment izvedli s pomočjo uni- imajo namreč boljši rezultat pri natančnosti, preciznosti, gramov in bigramov. Končni rezultati so pokazali, da je priklicu in pri uglašeni meri F1. Največjo razliko med uni- klasifikacija s pomočjo unigramov uspešnejša, kot pa z grami in bigrami vidimo pri izračunu priklica. Na podlagi bigrami. Pri klasifikaciji s pomočjo unigramov smo do- tega lahko sklepamo, da smo pri bigramih imeli veliko segli 92 % natačnost, pri bigramih pa 80 %. V članku število napačno klasificiranih pozitivnih komentarjev. [6] so dosegli 75 % natančnost, torej so bili manj uspešni 68 pri klasifikaciji. Boljše rezultate napram primerjalnega članka smo dosegli zaradi uporabe slovarjev sentimenta, ki temeljijo zgolj na besedišču, ki ga najdemo v komen- tarjih, med tem ko so v [6] uporabljali bolj splošen slovar. Klasifikacijo komentarjev smo v našem članku uporabili za raziskavo splošnega zadovoljstva hotelskih gostov, ki so napisali komentarje za določen hotel. Eksperiment bi lahko v nadaljnem delu razširili z upoštevanjem nacio- nalnosti piscev komentarjev pri sami klasifikaciji. Na tak način bi lahko ugotovili korelacijo med nacionalnostjo pi- scev komentarjev in zadovoljstva glede določenega hotela, ki se skriva v komentarjih. Posameznim komentarjem bi v bodoče lahko dodali tudi številčno oceno (npr. od 1 do 5), ki bi jo pridobili s pomočjo razširitve trenutnega eksperimenta. Literatura [1] Bouazizi, M., and Otsuki Ohtsuki, T. A pattern-based approach for sarcasm detection on twitter. IEEE Access 4 (2016), 5477–5488. [2] Brandwatch, K. B. Understanding sentiment analysis: What it is & why it’s used. [3] Cai, R., Qin, B., Chen, Y., Zhang, L., Yang, R., Chen, S., and Wang, W. Sentiment analysis about investors and consumers in energy market ba- sed on bert-bilstm. IEEE Access 8 (2020), 171408– 171415. [4] Fang, Y., Tan, H., and Zhang, J. Multi-strategy sentiment analysis of consumer reviews based on semantic fuzziness. IEEE Access 6 (2018), 20625– 20631. [5] Jianqiang, Z., Xiaolin, G., and Xuejun, Z. Deep convolution neural networks for twitter senti- ment analysis. IEEE Access 6 (2018), 23253–23260. [6] Li, Z., Li, R., and Jin, G. Sentiment analysis of danmaku videos based on naïve bayes and sentiment dictionary. IEEE Access 8 (2020), 75073–75084. [7] Phan, H. T., Tran, V. C., Nguyen, N. T., and Hwang, D. Improving the performance of senti- ment analysis of tweets containing fuzzy sentiment using the feature ensemble model. IEEE Access 8 (2020), 14630–14641. [8] Saad, S. E., and Yang, J. Twitter sentiment analysis based on ordinal regression. IEEE Access 7 (2019), 163677–163685. [9] Son, L. H., Kumar, A., Sangwan, S. R., Arora, A., Nayyar, A., and Abdel-Basset, M. Sarcasm detection using soft attention-based bidirectional long short-term memory model with convolution network. IEEE Access 7 (2019), 23319– 23328. [10] Watanabe, H., Bouazizi, M., and Ohtsuki, T. Hate speech on twitter: A pragmatic approach to collect hateful and offensive expressions and per- form hate speech detection. IEEE Access 6 (2018), 13825–13835. [11] Xu, G., Meng, Y., Qiu, X., Yu, Z., and Wu, X. Sentiment analysis of comment texts based on bilstm. IEEE Access 7 (2019), 51522–51532. 69 70 Časovni razporejevalniki in brezstrežniško okolje Uroš Zagoranski Univerza v Mariboru, Fakulteta za elektrotehniko, računalništvo in informatiko, Koroška cesta 46, 2000 Maribor, Slovenija uros.zagoranski@student.um.si Povzetek kov v brezstrežniškem okolju? V prispevku smo se osredotočili na časovne razpo- • Kdaj je smiselno uporabiti časovne razpore- rejevalnike (ang. cron job schedulers) v brezstre- jevalnike brezstrežniškega okolja? žniškem (ang. serverless) okolju in njihovo zane- V nadaljevanju članka so v drugem poglavju predsta- sljivo uporabo. Primerjali smo razporejevalnike, vljene posebnosti sodobnega pristopa razvoja program- implementirane s pomočjo zabojnikov s tistimi, skih rešitev, to je z zabojniki in s pomočjo funkcij kot ki so gostovani v oblaku z uporabo pristopa funk- storitev, ter njuna primerjava. Tretje poglavje pokriva cije kot storitve. Ugotavljali smo, katere so po- tematiko časovnih razporejevalnikov na splošno ter se sebnosti časovnih razporejevalnikov v brezstrežni- podrobneje posveča razporejevalnikom v brezstrežniškem škem okolju in kdaj je le-te sploh smiselno upo- okolju, pregledno povzema glavne ponudnike takšnih sto- rabiti. Na praktičnem primeru smo predstavili, ritev in navaja primere uporabe. Četrto poglavje predsta- kako jih lahko vključimo v večji sistem in na ka- vlja realen primer uporabe časovnih razporejevalnikov v kšen način najlažje rešimo morebitne težave, ki namen razporejanja opravil v računovodskem servisu. V jih ob izbiri brezstrežniškega okolja zavestno pre- petem poglavju odgovarjamo na raziskovalna vprašanja vzamemo. Ugotovili smo, da so razporejevalniki in uvajamo diskusijo, v zadnjem poglavju pa povzemamo v FaaS (ang. Function as a Service) okolju naj- rezultate raziskave in glavna nova dognanja, pridobljena primernejši zaradi enostavnega in hitrega razvoja z njeno pomočjo. ter nizkih stroškov obratovanja. Ključne besede brezstrežniška arhitektura · sodobni 2 Sodobni pristopi implementacije pristopi implementacije informacijskih rešitev · časovni informacijskih rešitev v namen razporejevalniki · funkcija kot storitev · zabojnik kot storitev razporejanja opravil Vse več razvijalcev informacijskih rešitev pomeni večji na- 1 Uvod bor možnosti in preferenc pri razvoju programske opreme. Zaradi tega raziskovalno podjetje Gartner vsako leto iz- Tempo življenja se v zadnjih letih veča in zaradi tega dela poročilo trendov v informacijsko tehnološkem svetu smo ljudje vse bolj časovno obremenjeni. S tem se posle- za prihodnje koledarsko leto. V zadnjem času lahko na tej dično veča tudi potreba po časovnem razporejanju rutin, lestvici vse pogosteje zasledimo besedne zveze kot so ume-da se kakšna informacija v presežku le-teh ne izgubi. Že tna inteligenca, obogatena resničnost, analitika, avtoma-od nekdaj so se razvijali sistemi, ki so omogočali takšno tizacija ipd. [12] Virtualizacija virov in programiranje v in drugačno razporejanje, a to ni bilo še nikdar enostav- oblaku sta se hitro zlili s klasičnimi pristopi implementa-nejše, kot je v današnji dobi računalništva v oblaku. Ve- cije informacijskih rešitev, zato ni čudno, da je rast tega lik nabor ponudnikov oblačnih storitev omogoča izbiro področja in inovativnost rešitev v tej kategoriji v pozitiv-med le-temi, ki pa ravno zaradi velike ponudbe včasih nem trendu. Med napovedmi za bližnjo prihodnost se v ni trivialna. Izbiro prave storitve otežuje predvsem po- Gartnerjevih raziskavah tako pojavljajo besedne zveze, ki manjkanje analitičnih pregledov in primerjav storitev, ki omenjajo zabojnike, njihovo orkestracijo, funkcije kot sto-jih ponujajo različne korporacije. S tem namenom smo ritev in brezstrežniški pristop razvoja programske opreme želeli predstaviti razlike, prednosti in dodatne izzive, ki splošno. [6] nastanejo pri razvoju razporejevalnikov v oblaku z upo- Brezstrežniški pristop razvoja programske opreme je mo-rabo funkcij kot gonilne sile v primerjavi s tistimi, ki so del računalništva v oblaku, v katerem ponudnik oblač- implementirani v (bolj) klasičnih aplikacijah, s pomočjo nih storitev v imenu svojih strank skrbi za strežnike in orkestracije zabojnikov. Poleg tega smo našteli in pred- dodeljevanje strojnih virov. Glavna posebnost takšnega stavili glavne sisteme, ki ponujajo časovno razporejanje v pristopa je, da aplikacija nima dodeljenih računalniških brezstrežniškem okolju in na praktičnih primerih uporabe virov, če ta ni v uporabi. To se odraža v nižji ceni de-pojasnili, kateri izmed glavnih ponudnikov je za posame- lovanja, saj le-ta temelji na dejanski količini sredstev, ki zen primer boljši in zakaj. jih aplikacija porabi v določenem časovnem intervalu. [1] Raziskovalni vprašanji, ki smo ju obravnavali sta: Trenutno stanje brezstrežniških sistemov na trgu je v • Gartnerjevih analizah primerjano s predpostavkami za Katere so posebnosti časovnih razporejevalni- bližnjo prihodnost, prav tako pa je podana finančna ocena DOI https://doi.org/10.18690/978-961-286-516-0.16 ISBN 978-961-286-516-0 71 brezstrežniških sistemov v bližnji prihodnosti. Pri tem [8] Aplikacije so tipično sestavljene iz več zabojnikov, ki napovedujejo, da se bo iz lanske globalne kvote prihod- so porazdeljeni po različnih virtualnih in fizičnih gosti-kov v višini 465,8 milijonov dolarjev do leta 2024 višina teljih. Kubernetes nam pri tem pomaga pri upravlja-prihodkov kar podvojila, in sicer naj bi znašala 944 mi- nju s strukturiranjem zabojnikov v tako-imenovane stroke lijonov dolarjev. Največjo rast naj bi po ocenah dosegli (ang. pods), ki okolje bogatijo z dodatnim nivojem ab-ravno orkestracija zabojnikov ter brezstrežniški pristop strakcije. Na ta način je poenostavljeno razporejanje z razvoja informacijskih rešitev, kateri tematiki bomo po- viri, prav tako pa časovno razporejanje, ki je za nas naj-drobneje opisali v nadaljevanju. [11] zanimivejše. [10] Druga Gartnerjeva raziskava ocenjuje, da bo v prihodno- Pristop z zabojniki je primeren predvsem za ekipe, ki sti drastično narasla tudi uporaba platform, ki kot spro- prisegajo na podrobnejši nadzor nad storitvami kontejne- žilce aktivnosti v sistemu uporabljajo najmanjše možne rizacije. Prav tako omogoča avtomatizacijo uvajanja in komponente, to so funkcije. Trenutno približno 20 odstot- vračanja v prejšnje stanje, fine nastavitve količine pro-kov globalnih IT podjetij uporablja »funkcijski« (FaaS) cesorske moči in pomnilniškega prostora, samodejno ce-način implementacije, do leta 2025 pa bi naj ta številka ljenje zabojnikov, ki so bili pri izvajanju neuspešni ter narasla na kar 50 odstotkov. [4] upravljanje s konfiguracijami. [8] Porast interesa uporabe oblačnih infrastruktur se odraža tudi na številu ponudnikov. Med vodilne uvrščamo na- 2.2 Funkcije: Function as a Service (FaaS) slednje: FaaS je nov koncept v oblačnem računalništvu, podo- • Amazon Web Services, ben PaaSu (Platform as a Service) z vidika omogočanja • Google Cloud Platform, enostavnejšega razvoja programske opreme in razbreme- nitve razvijalcev upravljanja s strežniki in operacijskimi • Microsoft Azure, sistemi. Prednost funkcij v oblaku v primerjavi s prej • IBM Apache OpenWhisk. omenjenima konceptoma je predvsem v poslovnem vi- diku. Medtem, ko se pri PaaSu plačuje čas izvajanja niti, V tekmovanju teh ponudnikov se kot glavni akter že vr- se pri FaaSu plačuje izvajalni čas specifične funkcije. S sto let pojavlja platforma Amazon Web Services (v na- tem se lahko konkretno zvišajo performančne karakteri-daljevanju AWS), a v zadnjem času pridobiva vse več stike ogromnih sistemov, hkrati pa znižajo obratovalni tekmecev, med katerimi sta najbolj aktualna predvsem stroški same informacijske rešitve. [1] Microsoft Azure in Google Cloud Platform (v nadaljeva- nju GCP). A če povzamemo obe prej omenjeni podro- Z začetki v letu 2014 z Amazonovo Lambdo je do da- čji (zabojnike in funkcije), v ospredje brez konkurence nes koncept FaaS postal priznan in dobro sprejet način stopi Google, saj si lasti tako orkestracijsko rešitev Ku- implementacije informacijskih rešitev. Dokazano je ve-bernetes kot tudi ogromno platformo za programiranje v liko bolj skalabilen, elastičen, razvijalcem prijazen ter ce-oblaku, GCP. [7] Podjetja želijo, da se razvijalci osredoto- novno ugoden kot predhodne oblačne arhitekture. Poleg čajo predvsem na razvoj programske opreme, ne pa tudi AWSove Lambde v to kategorijo uvrščamo tudi Microsoft na vzdrževanje, vzpostavitev in druge podporne aktivno- Azure Functions, Google Cloud Functions, IBM Apache sti. Pri zabojnikih zato pogosto zasledimo tudi osebo, ki OpenWhisk. [2] je zadolžena za skrb in upravljanje s celotno arhitekturo Pristop s funkcijami v oblaku je primeren predvsem za (DevOps), katere pa pri funkcijskem pristopu načeloma projekte, ki imajo neenakomerno razporeditev izvajanja, ne potrebujemo. izjemno visoke ali neznane zahteve za skaliranje sistema, so vezani na zunanje dogodke, sestavljeni iz kratko- 2.1 Zabojniki: Container as a Service (CaaS) živečih diskretnih funkcij, ali pa lahko s klici operirajo brez stanja. Pogosto jih lahko zasledimo pri mobilnih Vse pogostejši trend uporabe zabojnikov za implementa- aplikacijah, IoT senzorskih komponentah in časovno cijo informacijskih rešitev se odraža tudi pri nastanku do- aktiviranih akcijah. [4] kaj nove paradigme računalništva. CaaS predstavlja nad- gradnjo kategorije IaaS (Infrastructure as a Service), kjer 2.2.1 Primerjava obeh skupin korporacije v najem ponujajo strežnike, virtualne stroje, omrežja in shrambe, pri čemer so njihovi uporabniki od- V začetku porasta informacijske tehnologije je postala po-govorni za upravljanje infrastrukture in nameščanje apli- pularna gradnja monolitnih aplikacij, ki pa se danes vse kacij nanjo. Zraven tega CaaS ponuja še zabojniške stroje manj pojavlja na seznamu trendov. V zadnjem času so in funkcionalnosti za orkestracijo le-teh. [19] Za nas je jih pretežno zamenjale mikro-storitve in posledično se je najzanimivejši podrazred CaaSa, KaaS (Kubernetes as a vpeljala uporaba zabojnikov, s katerimi sta razporejanje Service), saj je zaradi svoje robustnosti, zrelosti in boga- in razvoj aplikacij veliko enostavnejša. Zabojniki so se tega nabora funkcionalnosti Kubernetes postal de-facto v IT svetu prvič pojavili že pred letom 2000, zato ima standard in najpogosteje uporabljena tehnologija, ko je ta koncept daljšo in bogatejšo zgodovino, kot funkcije v govora o razvoju informacijskih rešitev s souporabo za- oblaku. Kljub temu t. i. eksplozija popularnosti obeh bojnikov. [13] konceptov sega v podobno časovno obdobje. Zabojniki so namreč postali izjemno aktualni šele z Dockerjem v Kubernetes je odprtokodni sistem za orkestracijo, ki omo- letu 2013, FaaS pa z AWSovo Lambdo v letu 2014. [16] goča avtomatizirano upravljanje z zabojniki, prav tako pa upravlja celoten življenjski cikel zabojnikov, pri čemer FaaS predstavlja višji nivo abstrakcije kot CaaS. Slednji uporabniki upravljajo z izvedbo in interakcijo aplikacij. namreč abstrahira nivoje strojne opreme, virtualizacij-72 Tabela 1: Primerjava CaaS in FaaS [9] CaaS FaaS Prenosljivost Zabojniki predstavljajo standardizirano Slaba standardizacija, večinoma odvisno enoto, zaradi česar je njihova prenosljivost od ponudnika, razporejevalniki kljub temu med ponudniki enostavna delno standardizirani Dvig in premik Obstoječe aplikacije enostavno oviti v zaboj- Ni mogoč, obstoječe aplikacije potrebujejo nike, če to ni že izvedeno, zabojniki nimajo adaptacijo ali popolno re-implementacijo vpliva na delovanje razporejevalnikov Razdrobljenost Enote ponavadi razdeljene na komponente, Aplikacija razdrobljena na majhne enote, ki ponujena možnost razporejanja specifičnih jih je enostavno razporejati (funkcije), ide-funkcij, tudi tistih, ki skrbijo za razporejanje alno za implementacijo razporejevalnikov Izvajanje Upravljano s strani razvijalca Upravljano s strani ponudnika v oblaku Fleksibilnost Visoka Srednja Primeri upo- Celotne spletne aplikacije, mikro-storitve, Specializirane funkcije, asinhrono procesira-rabe upravljanje s podatki nje, dogodkovno vodena arhitektura Izzivi Zahteva razumevanje delovanja zabojnikov, Upravljanje odvisnosti, sledenje in upravlja-težja implementacija razporejevalnikov nje več mikro-storitev hkrati Primernost Učinkovito za gostovanje osnovnih aplikacij, Najprimernejše za lansiranje novih aplikacij, manj primerno za implementacijo razporeje- izjemno primerno za implementacijo razpo-valnikov rejevalnikov skih strojev ter operacijskega sistema, pri čemer pa prvi zabojnikov, v novi dobi oblačnega računalništva pa se vse na vse prej omenjene nivoje dodaja še abstrakcijo izvedbe pogosteje pojavlja prej omenjeni pristop FaaS. Za izvedbo programske kode in aplikacije. Obe skupini prenašata specifičnih funkcij ob določenem času v informacijskih re-odgovornost manipulacije z aplikacijo in funkcijami na šitvah takšnega tipa skrbijo funkcije, imenovane časovni uporabnika storitve. razporejevalniki, ki v vseh sistemih, ne glede na izbiro Če na kratko povzamemo celotno tabelo 1, ki predsta- tehnološkega sklada, delujejo po enakem principu. Poleg vlja primerjavo prej omenjenih pristopov lahko ugoto- enakega principa imajo tudi enako obliko sprejema po-vimo, da v splošnem pristop z zabojniki nudi več svo- datkov, sestavljeno iz petih spremenljivk in klica funkcije bode pri implementaciji, nadzora nad izvajanjem ter eno- ali komande. Splošna oblika »cron« izraza je prikazana stavno migracijo v primeru, da neka obstoječa aplikacija na sliki 1. še ne uporablja takšnega pristopa. Na drugi strani je pristop s funkcijami v oblaku primernejši in lažji za vzdr- ževanje, je pa manj fleksibilen in slabo standardiziran, zato se tehnične podrobnosti, ki jih je potrebno upošte- vati ob implementaciji, razlikujejo med posameznimi po- nudniki. Kateri pristop uporabiti je odvisno predvsem od primera uporabe, pri čemer je za spletne aplikacije z mikro-storitvami primernejša izbira CaaS modela, za do- godkovno vodene aplikacije z asinhronim procesiranjem Slika 1: Splošna oblika »cron« izraza pa se bo bolje izkazal FaaS model. Med najpogostejšimi izrazi, ki jih zasledimo v imenih raz- 3 Časovni razporejevalniki porejevalnikov sta angleški besedi »cron« in »scheduler«. V sklopu prej opisanih razporejevalnikov najdemo glavne Časovno tempiranje aktivnosti je v današnjem svetu čisto ponudnike takšnih storitev, to so Googlov Kubernetes, običajna aktivnost, zato je vse pogostejša tudi uporaba Amazonovo orodje Lambda v kombinaciji s storitvijo Clo-sistemov, ki skrbijo za izvajanje določenih funkcij ob spe- udWatch Events, Googlovi Cloud Scheduler in Scheduled cifičnem času, na željeno frekvenco ali pa z določenimi Cloud PubSub ter Microsoftov Azure Time Trigger. zamiki. Razporejevalniki nas tako v informacijsko teh- nološkem svetu spremljajo praktično od začetka. Ti so 3.1 »Zabojniško« okolje prisotni v operacijskih sistemih, omrežnih komponentah, konec koncev pa tudi v I/O operacijah (operacijah pisa- Zabojnike lahko uporabimo za različne namene, eden po-nja in branja). membnejših je procesiranje in obdelava nalog v ozadju. To je pogost primer uporabe, ki pa je vse prej kot trivi- Medtem, ko se pri vseh prej naštetih komponentah ter- alen in odpira vrsto vprašanj, med drugim je vprašljivo min razporejanje pojavlja v obliki razporejanja virov, je sledenje izvajanja, skaliranje in upravljanje s ponovnimi za našo raziskavo veliko zanimivejše časovno razporejanje zagoni. Zraven razporejevalnikov z viri, ki so v razdelani opravil. V tradicionalnih sistemih se takšni problemi ve- arhitekturi z orodji kot sta Docker in Kubernetes neizbe- činoma rešujejo z aplikacijskimi strežniki in orkestracijo žni, se v teh okoljih srečujemo tudi s časovnimi razpore-73 jevalniki, ki znatno vplivajo na optimizacijo virov, da je Cloud Scheduler. Slednja sama po sebi uporablja vse, kar uporaba teh kar se da učinkovita. [14] prva ponuja, zato nam za povezavo med njima ni treba Zabojniki sami po sebi omogočajo uporabo časovnih spro- skrbeti. žilcev, ki pa z vpeljavo več instanc in posledično kom- Cloud Scheduler je v celoti upravljan razporejevalnik, ki pleksnosti zahtevajo uporabo posvečenega orodja, kot je omogoča razporejanje paketnih opravil, opravil z velikimi na primer Kubernetes. Stanje posameznega zabojnika se podatki ter operacij z oblačno infrastrukturo. Glavna tako ne more prenašati med preostalimi zabojniki zaradi prednost izbire tega razporejevalnika je zanesljivost, saj varnostnih razlogov, zaradi česar je potrebno najti druge njegovo distribuirano infrastrukturo upravlja Google. S načine za izpostavitev skript. Kubernetes prav tako di- tem zagotavlja vsaj-enkratno dostavo sporočil na cilj, namično generira imena zabojnikov, zaradi česar je do- hkrati pa poenostavlja koncept »crontabov«, saj omo-stopanje do njih oteženo. [17] goča specifikacijo želenega urnika preko t. i. »cron« izraza. Zraven tega ponuja tudi močno orodje za bele- 3.2 »Funkcijsko« okolje ženje izvedbe in uspešnosti ter enostavno konfiguracijo pravilnika za ponovni poskus v primeru napak. V osnovi Časovni razporejevalniki sami po sebi ne veljajo za naj- so lahko na posameznem računu brezplačno gostovane tri bolj zanesljive. To pride do izraza v funkcijskem brezstre- instance razporejevalnika, vsaka naslednja instanca pa žniškem okolju, kjer jih pogosto najdemo kot popolnoma stane 0,10 dolarjev na mesec. [5] neodvisne komponente. Njihova edina naloga je, da se izvedejo ob določenem času, zaradi česar se lahko zgodi, 3.2.2 Amazon Web Services – CloudWatch da se določena funkcija, ki je bila klicana v sklopu ne-Events kega razporejevalnika, izvede neuspešno. Do tega lahko pripelje nekaj dejavnikov, med drugim začasna prekinitev Podobno kot Google tudi Amazon lasti platformo, ki po-delovanja infrastrukture (ki je v oblaku redka, a mogoča), nuja velik nabor storitev, to je Amazon Web Services. neuspešna obdelava podatkov, ali pa prekinitev povezave V njenem sklopu lahko najdemo storitev AWS Lambda, s podatkovno bazo. Ker razporejevalniki v brezstrežni- ki sama po sebi ne ponuja časovnega razporejanja. Za škem okolju kot parametre prejmejo le čas izvedbe, ni dosego tega je Lambdo potrebno kombinirati z eno od potrebna validacija vnesenih parametrov. Iz tega vidika treh storitev, ki to omogočajo: CloudFormation, Cloud-so izjemno zanesljivi, saj (načeloma) ne more priti do na- Watch Events in EventBridge, ki je posodobljena verzija pake pri sami izvedbi. Večjo težavo predstavlja dejstvo, storitve CloudWatch Events. Medtem ko AWS Lambda da se predpisane funkcije po izvedbi preprosto »ugasnejo« ponuja le storitve za oddaljeno izvajanje funkcij, se preo-in jih uspešnost zaključka klicane funkcije ne zanima. [15] stale tri storitve osredotočajo na sistemski tok dogodkov v realnem času. Storitev CloudWatch ostaja v tako ime-Med glavne primere uporabe lahko uvrstimo naslednje: novanem stanju pripravljenosti in ob določenih sprožilcih, • vzdrževalna naloga re-indeksiranja podatkovne baze, kot so na primer časovni sprožilci, reagira. Slednja omo-ki se mora izvesti n-krat v določenem časovnem goča specifikacijo pravil za zagon funkcij, na primer na intervalu, n-minut, dni, ali pa uporabi kompleksnejše razporejanje, za katero se prav tako uporablja t. i. »cron« izraz. [3] • deaktivacija poteklih uporabniških računov, • nadzor nad prostorom pomnilnika, Storitev CloudFormation v kombinaciji z AWS SAM (Ser- verless Application Model) omogoča tudi večstopenjsko • kreacija varnostnih kopij občutljivih podatkov, izvajanje rutin, pri čemer je pisanje SAM predlog za • upravljanje s predpomnilnikom, brezstrežniško izvajanje povsem neboleče. Zaradi odlične • razpošiljanje glasila ali posebne ponudbe prijavlje- podpore integracije različnih storitev znotraj AWS plat-nim uporabnikom, forme je pri implementaciji razporejevalnikov meja le • periodično preverjanje slepih povezav na spletni nebo. [18] strani, • kodiranje videov in večjih datotek v ozadju. 3.2.3 Primerjava predlaganih razporejevalnikov Iz navedenega lahko vidimo, da nam časovni razporeje- Oba prej omenjena ponudnika sta razvila vrhunski re-valniki pomagajo pri marsikaterem rutinskem opravilu, ki šitvi za časovno razporejanje, a vendar ima vsaka svoje bi bilo z izbiro drugačnega pristopa razvoja programske prednosti. opreme zahtevnejše za implementacijo. Kot že omenjeno Medtem ko GCP Cloud Schedule ponuja 3 brezplačne in-lahko za pristop k rešitvi problema izbiramo iz širokega stance razporejevalnika AWS CloudWatch Events zago-spektra ponudnikov, zato sta v naslednjih podpoglavjih tavlja »zastonjsko« različico z maksimalno 100 zagoni na predstavljena dva najpomembnejša, GCP Cloud Schedu- mesec. Opazna razlika je tudi v ceni - medtem, ko slednji ler in AWS CloudWatch Events. za 10.000 zagonov ponuja ceno 0.30 dolarjev na mesec, GCP zagotavlja fiksno ceno 0.10 dolarjev na vsako doda- 3.2.1 Google Cloud Platform – Cloud Scheduler tno instanco razporejevalnika. Dokumentacija Googlove storitve je napisana nekoliko bolj površinsko, Amazonova Podjetje Google ponuja veliko storitev, ki so strnjene v pa je podprta z ogromno primeri uporabe. Pri komple-ogromno platformo, imenovano Google Cloud Platform. ksnosti v ospredje stopi Google, saj je povezovanje te sto- Če se za implementacijo časovnega razporejanja rutinskih ritve z drugimi iz nabora GCP trivialno, AWSove storitve opravil odločimo za GCP, je potrebno za dosego cilja upo- pa so nekoliko zahtevnejše za integracijo. Hitrost izvedbe rabiti kombinacijo dveh storitev, to je Cloud Functions in 74 je pri obeh ponudnikih izjemno visoka, zato med njima Iz tega razloga se je potrebno pri izbiri ponudnika podu-v tem aspektu praktično ni razlike. Če na kratko povza- čiti o specifikah razporejevalnika, ki ga ponujajo. Razlike, memo primerjavo je v primeru potrebe po več različnih ki jih zasledimo med ponudniki so predvsem finančne ter časovnih razporejevalnikih z majhnimi frekvencami za- performančne. Kljub temu izbira brezstrežniškega okolja gona ter za eksperimentalne namene bolje izbrati AWS, s funkcijskim pristopom, predstavljena v poglavju 3.2, v primeru fiksnega (manjšega) števila razporejevalnikov prinaša veliko prednosti, med drugim enostavne modi-in točno zadanih ciljev kaj z razporejevalniki želimo do- fikacije in nastavitev razporejevalnikov preko »cron« iz-seči pa GCP. raza ter zagotovljena vsaj enkratna izvedba funkcije, ki je poklicana v sklopu razporejevalnika, kar je pri klasičnih sistemih nekoliko težje doseči. 4 Primer uporabe Med posebnosti lahko uvrstimo tudi to, da so razporeje- Za bolj plastično predstavo delovanja časovnih razporeje- valniki v brezstrežniškem okolju samostojna komponenta. valnikov v funkcijskem okolju uporabimo primer predsta- To pomeni, da so ločeni od preostalega dela kode, zaradi vljenega računovodskega servisa na sliki 2 (na naslednji česar se lahko v primeru težav z osnovno aplikacijo na strani), ki skrbi za pošiljanje plačilnih listov vseh zaposle- njih še vedno zanesemo. Posebnost je tudi ta, da je ozke nih v podjetjih s katerimi sodeluje. Servis želi popolno časovne omejitve težko doseči, najnižji razpon dveh iz-avtomatizacijo pošiljanja plačilnih listov preko e-pošte. vedb je namreč zaradi možnosti časovnega prekrivanja Ker e-poštna sporočila vsebujejo kritične podatke za za- pri izvedbi dveh klicev ena minuta. V sklopu prednosti, poslene je potrebno zagotoviti, da so ta sporočila stood- ki jih pridobimo z izbiro brezstrežniške arhitekture pa je stotno poslana. To lahko dosežemo s kombinacijo razpo- smiselno odgovoriti tudi na drugo raziskovalno vprašanje, rejevalnikov, ki omogočajo časovno tempiranje sporočil in ki je bilo rdeča nit celotne vsebine članka: sporočilnih sistemov, ki ponujajo zagotovljeno pošiljanje sporočil. BPMN diagram iz slike 2 je zasnovan splošno, • Kdaj je smiselno uporabiti časovne razpore- zato ga je mogoče implementirati enotno, ne glede na iz- jevalnike brezstrežniškega okolja? biro ponudnika funkcijske arhitekture. Na prvi pogled razporejanje rutin ni nekaj, s čimer bi se Z zahtevo po zanesljivosti naraste tudi kompleksnost za- srečevali pogosto. A realnost je precej drugačna, hitro snovane arhitekture. Kot že omenjeno, lahko pride med lahko ugotovimo, da se časovne razporejevalnike upora-izvajanjem zgoraj opisanega procesa do motenj v delo- blja za marsikatero opravilo, kar je razvidno iz nabora vanju. Z zankami je tako predstavljeno lovljenje napak najpogostejših primerov uporabe iz poglavja 3.2. Smi-znotraj objave sporočil na temo. V primeru, da instanca selno jih je torej uporabiti predvsem za opravila, za kamed izvajanjem kode na prvi ali drugi temi pade, se po tera želimo, da se izvedejo v ozadju in brez potrebe po ponovnem zagonu stanje zaradi shranjevanja vmesnih ko- poseganju upravljalca. Kot že omenjeno v poglavju 4 rakov vedno ohrani. S tem je doseženo, da se vsa sporo- so to na primer vzdrževalne naloge re-indeksiranja po- čila zagotovo pošljejo, prav tako pa, da se vsa sporočila datkovne baze, deaktivacija poteklih uporabniških raču-pošljejo natanko enkrat. V ta primer uporabe je mogoče nov na nekem portalu, nadzor nad prostorom pomnilnika vključiti tudi druge prej omenjene možnosti. Ker plačilni in upravljanje z njim, kreacija varnostnih kopij občutlji-listi veljajo za občutljive podatke, bi računovodski servis vih podatkov, razpošiljanje e-poštnih sporočil določenim lahko razporejevalnike uporabil za kreacijo njihovih var- uporabnikom ter kodiranje večjih datotek v ozadju. Ča-nostnih kopij. Podobno bi se lahko izvedla deaktivacija sovne razporejevalnike brezstrežniškega okolja z uporabo zaposlenih, ki v sistemu več niso aktivni, ali pa, v primeru funkcij kot storitev je torej smiselno uporabiti vedno, ko dodatnih funkcionalnosti v sistemu, omogočilo kodiranje je smiselno uporabiti vse druge časovne razporejevalnike, večjih datotek v ozadju. zraven tega pa imajo razporejevalniki tega tipa dodano vrednost hitre implementacije in malega vložka s strani 5 Diskusija uporabnikov, da storitev preverjeno deluje. Izbira brezstrežniškega okolja pa ima še mnogo prednosti. Kot lahko razberemo iz drugega poglavja, je pristopov k Ena od teh je, da se razporejevalniki zaženejo le takrat, ko implementaciji časovnih razporejevalnikov rutin v oblač- je nujno potrebno. Ravno nasprotno od tega lokalni raz-nih arhitekturah veliko, zato pri izbiri le-teh prevzamemo porejevalniki za nemoteno delovanje zahtevajo napravo, tudi nekaj posebnosti, ki jih je potrebno upoštevati pri prižgano 24 ur na dan. Ločevanje komponent, v našem implementaciji, izvedbi in vzdrževanju takšnih sistemov. primeru razporejevalnikov, v našo kodo prinese svežino Prav to nas je zanimalo v sklopu prvega raziskovalnega in olajša upravljanje s posameznimi deli projekta, saj so vprašanja, ki se glasi: manjše enote veliko bolj obvladljive. Glavna prednost razporejevalnikov v brezstrežniškem okolju pa je vsekakor • Katere so posebnosti časovnih razporejevalni- odgovor na vprašanje: zakaj bi ponovno implementirali kov v brezstrežniškem okolju? nekaj, kar je že na voljo ter deluje zanesljivo in hitro. Posebnosti časovnih razporejevalnikov v brezstrežniškem okolju se delijo na tiste, ki za uporabnika predstavljajo 6 Zaključek prednosti ter tiste, ki zanj predstavljajo slabosti, kar je razvidno iz tabele 1. Med posebnosti, ki jih je potrebno V prispevku smo raziskovali pristope implementacije ča-upoštevati pri implementaciji, spadata nekoliko slabša sovnih razporejevalnikov v brezstrežniškem okolju ter nji-podpora personalizacije ter pomanjkanje standardizacije. hovo zanesljivo uporabo. Poglobili smo se v primerjavo 75 Slika 2: BPMN diagram zasnove razporejevalnika računovodskega sistema 76 dveh predstavnikov brezstrežniškega okolja, zabojnikov [14] Rodriguez, M. A., and Buyya, R. Container-in pristopa FaaS. Podrobneje smo spoznali glavna ponu- based cluster orchestration systems: A taxonomy dnika funkcij kot storitev, AWS CloudWatch Events in and future directions. Software - Practice and Expe- GCP Cloud Scheduler, ki sama po sebi omogočata imple- rience 49, 5 (may 2019), 698–719. mentacijo in zanesljivo uporabo časovnih razporejevalni- [15] kov. Singhvi, A., Houck, K., Balasubramanian, A., Danish Shaikh, M., Venkataraman, S., Z analizo literature in odgovorom na raziskovalni vpraša- and Akella, A. Archipelago: A Scalable Low- nji, zastavljeni na začetku raziskave smo ugotovili, da je Latency Serverless Platform. Tech. rep., University časovne razporejevalnike predvsem v okolju FaaS izjemno of Wisconsin-Madison, Wisconsin, nov 2019. enostavno razviti, prav tako pa ne zahtevajo veliko vzdr- [16] Tozzi, C. Why Is Docker So Popular? Explaining ževanja. Smiselno jih je uporabiti za rutinska opravila, s the Rise of Containers and Docker – Channel Futu- katerimi se nimamo časa ukvarjati, zato nas lahko razbre- res, jul 2017. menijo, hkrati pa so cenovno izjemno dostopni. Če vse [17] skupaj povzamemo je časovne razporejevalnike brezstre- Walker, J. How to Use Cron With Your Docker Containers, jan 2021. žniškega okolja torej priporočljivo uporabiti vedno, ko se v sklopu projekta pojavi potreba po reševanju ponavlja- [18] Yilmaz, E. 3 Ways to Schedule AWS Lambda and jočih oziroma rutinskih opravil. Step Functions State Machine Executions , jan 2020. [19] Zhang, R., Chen, Y., Zhang, F., Tian, F., and Dong, B. Be Good Neighbors: A Novel Literatura Application Isolation Metric Used to Optimize the Initial Container Placement in CaaS. IEEE Access [1] Alqaryouti, O., and Siyam, N. Serverless Com- 8 (sep 2020), 178195–178207. puting and Scheduling Tasks on Cloud: A Review. American Scientific Research Journal for Engine- ering, Technology, and Sciences (ASRJETS) (mar 2018), 1–14. [2] Aske, A. M. SCHEDULING FUNCTIONS-AS-A- SERVICE AT THE EDGE. PhD thesis, WASHING- TON STATE UNIVERSITY, Vancouver, may 2018. [3] AWS. What Is Amazon CloudWatch Events? - Amazon CloudWatch Events, 2021. [4] Costello, K. The CIO’s Guide to Serverless Com- puting. Gartner (jun 2020). [5] Google. Cloud Scheduler | Google Cloud, 2021. [6] Iams, T. Gartner Blog Network. Gartner (may 2019). [7] Kohgadai, A. 6 Container Adoption Trends of 2020 | StackRox. StackRox (mar 2020). [8] Kubernetes. What is Kubernetes? | Kubernetes, feb 2021. [9] Leger, Y., and Broshar, A. FaaS vs CaaS: Comparing Use Cases and Responsibilities, feb 2021. [10] Marathe, N., Gandhi, A., and Shah, J. M. Docker swarm and kubernetes in cloud computing environment. In Proceedings of the International Conference on Trends in Electronics and Informa- tics, ICOEI 2019 (apr 2019), vol. 2019-April, In- stitute of Electrical and Electronics Engineers Inc., pp. 179–184. [11] Moore, S. Gartner Forecasts Strong Revenue Gro- wth for Global Container Management Software and Services Through 2024. Gartner (jun 2020). [12] Panetta, K. Gartner Top Strategic Technology Trends for 2021. Gartner (oct 2020). [13] Pereira Ferreira, A., and Sinnott, R. A per- formance evaluation of containers running on mana- ged kubernetes services. In Proceedings of the Inter- national Conference on Cloud Computing Techno- logy and Science, CloudCom (dec 2019), vol. 2019- December, IEEE Computer Society, pp. 199–208. 77 78 PROCEEDINGS OF THE 2021 7TH STUDENT COMPUTER SCIENCE RESEARCH CONFERENCE (STUCOSREC) IZTOK FISTER ET AL. (ED.) University of Maribor, Faculty of Electrical Engineering and Computer Science, Maribor, Slovenia. E-mail: iztok.fister@um.si Abstract The 7th Student Computer Science Research Conference is an answer to the fact that modern PhD and already Master level Computer Science programs foster early research activity among the Keywords: students. The prime goal of the conference is to become a place for student conference, students to present their research work and hence further encourage computer and students for an early research. Besides the conference also wants to information science, establish an environment where students from different institutions artificial meet, let know each other, exchange the ideas, and nonetheless make intelligence, friends and research colleagues. At last but not least, the conference is data science, also meant to be meeting place for students with senior researchers data from institutions others than their own. mining DOI https://doi.org/10.18690/978-961-286-516-0 ISBN 978-961-286-516-0 Document Outline Introduction Materials and Methods HDEMG model and Activity Index Independent Component Analysis Artefact detection and elimination Dataset and evaluation Results Discussion 2.pdf Introduction Extensible Authentication Protocol Zero-Knowledge Proofs ZKP System for the Quadratic Residuosity Problem Password Protection Secure Authentication Conclusions and Future Work 3.pdf Introduction Preliminaries Problem description and results Conclusion and further work 4.pdf Introduction Implementation of remote embedded development Solution using emulator Solution using remote development Usage and lessons learned Emulator Remote Conclusions 5.pdf Introduction Proposed Methodology Rough K-Means (RKM) Experimental Results Conclusion 6.pdf Introduction Related work Implementing adversarial perturbation on AlexNet CNN Results Examples of missclassified images Discussion 7.pdf 1 Introduction 1.1 Preliminaries 2 Structural analysis 2.1 Equivalent 8-cycles 2.2 Characterization of non-equivalent 8-cycles 2.3 Obtaining constant octagon value 3 Recognition algorithm 3.1 Subprocedure Extend(G,U) 4 Conclusion 8.pdf Introduction Related Work IEC Approach to PFSP Experiments and Results Analysis and Discussion Conclusion 9.pdf Introduction Literature review Experiment Results and Discussion Conclusion 10.pdf Introduction Related Work Datasets Method Experiments and Results Conclusion 11.pdf Introduction Materials and Methods Materials Methods Results Discussion 12.pdf INTRODUCTION RELATED WORK METHODOLOGY The CORE data set Microsoft Academic Graph (MAG) CrossREF API Metadata enrichment of CORE using CrossREF and MAG Text matching in the CORE data set Candidate retrieval Text matching of candidates Deduplication RESULTS CONCLUSION ACKNOWLEDGMENTS 13.pdf Introduction Related work and contributions Embedding non-planar graphs Theoretical analysis of the algorithm Applications Output samples Conclusions 14.pdf Uvod Metoda cosinor Analiza števnih podatkov z metodo cosinor Izbira najustreznejšega modela Rezultati Diskusija in zakljucek 15.pdf Uvod Sorodna dela Predlagan pristop Pridobivanje podatkov Grajenje slovarjev Klasifikacija komentarjev Racunanje tock zadovoljstva hotelskega bivanja Eksperiment Zakljucek 16.pdf Uvod Sodobni pristopi implementacije informacijskih rešitev v namen razporejanja opravil Zabojniki: Container as a Service (CaaS) Funkcije: Function as a Service (FaaS) Primerjava obeh skupin Časovni razporejevalniki »Zabojniško« okolje »Funkcijsko« okolje Google Cloud Platform – Cloud Scheduler Amazon Web Services – CloudWatch Events Primerjava predlaganih razporejevalnikov Primer uporabe Diskusija Zaključek Blank Page Blank Page Blank Page Blank Page Blank Page Blank Page Blank Page Blank Page Blank Page Blank Page 15.pdf Uvod Sorodna dela Predlagan pristop Pridobivanje podatkov Grajenje slovarjev Klasifikacija komentarjev Racunanje tock zadovoljstva hotelskega bivanja Eksperiment Zakljucek Blank Page