Proceedings of the 2021 7th Student Computer Science

Research Conference (StuCoSReC)





Editors

Iztok Fister

Andrej Brodnik

Janez Brest

Iztok Fister Jr.

Matjaž Krnc

Niko Zimic





Maribor, September 2021

Title Proceedings of the 2021 7th Student Computer Science Research Conference (StuCoSReC)



Editors Iztok Fister

(University of Maribor, Faculty of Electrical Engineering and Computer Science) Andrej Brodnik

(University of Ljubljana, Faculty of Computer and Information Science) Janez Brest

(University of Maribor, Faculty of Electrical Engineering and Computer Science) Iztok Fister Jr.

(University of Maribor, Faculty of Electrical Engineering and Computer Science) Matjaž Krnc

(University of Primorska, Faculty of Mathematics, Natural Sciences and Information Technology)

Niko Zimic

(University of Ljubljana, Faculty of Computer and Information Science)



Language editing Shelagh Hedges



Technical editors Iztok Fister

(University of Maribor, Faculty of Electrical Engineering and Computer Science) Jan Perša

(University of Maribor, University Press)



Cover graphics Network, author: geralt (Pixabay.com CC0)



Graphics material Authors, Editors



Conference 7th Student Computer Science Research Conference (StuCoSReC)



Location & date 14th September 2021, Maribor, Slovenia





Program Klemen Berkovič (University of Maribor, Slovenia), Zoran Bosnić (University of committee Ljubljana, Slovenia), Borko Bošković (University of Maribor, Slovenia), Janez Brest (University of Maribor, Slovenia), Lucija Brezočnik (University of Maribor, Slovenia), Andrej Brodnik (University of Primorska and University of Ljubljana, Slovenia), Patricio Bulić (University of Ljubljana, Slovenia), Mojca Ciglarič (University of Ljubljana, Slovenia), Jani Dugonik (University of Maribor, Slovenia), Iztok Fister (University of Maribor, Slovenia), Iztok Fister Jr. (University of Maribor, Slovenia), Matjaž Gams (Jozef Stefan Institute, Slovenia), Mario Gorenjak (University of Maribor, Slovenia), Aleš Holobar (University of Maribor, Slovenia), Andres Iglesias (Universidad de Cantabria, Spain), Sašo Karakatič (University of Maribor, Slovenia), Branko Kavšek (University of Primorska, Slovenia), Domen Kavran (University of Maribor, Slovenia), Štefan Kohek (University of Maribor, Slovenia), Matjaž Krnc (University of Primorska, Slovenia), Miklos Kresz (University of Szeged, Hungary), Niko Lukač (University of Maribor, Slovenia), Marjan Mernik (University of Maribor, Slovenia), Uroš Mlakar (University of Maribor, Slovenia), Eneko Osaba (University of Deusto, Spain), Vili Podgorelec (University of Maribor, Slovenia), Jan Popič (University of Maribor, Slovenia), Miha Ravber (University of Maribor, Slovenia), Peter Rogelj (University of Primorska, Slovenia), Martin Šavc (University of Maribor, Slovenia), Alexandros Tzanetos (University of Aegean), Damjan Vavpotič (University of Ljubljana, Slovenia), Grega Vrbančič (University of Maribor, Slovenia), Nikolaj





Zimic (University of Ljubljana, Slovenia), Gilbert Lim Yong San (Singapore Eye Research Institute) & Borut Žalik (University of Maribor, Slovenia).



Local

organizing Janez Brest (University of Maribor, Slovenia), Iztok Fister (University of Maribor, committee: Slovenia) & Aleš Holobar (University of Maribor, Slovenia).



Published by University of Maribor

Založnik University Press

Slomškov trg 15, 2000 Maribor, Slovenia

https://press.um.si, zalozba@um.si



Co-published by University of Maribor

Izdajatelj Faculty of Electrical Engineering and Computer Science

Koroška cesta 46, 2000 Maribor, Slovenia

http://www.feri.um.si, feri@um.si



Co-published by University of Ljubljana

Soizdajatelj Faculty of Computer and Information Science

Večna pot 113, 1000 Ljubljana, Slovenia

https://fri.uni-lj.si/en, dekanat@fri.uni-lj.si



Co-published by University of Primorska

Soizdajatelj Faculty of Mathematics, Natural Sciences and Information Technologies

Glagoljaška 8, 6000 Koper, Slovenia

https://www.famnit.upr.si/en; info@upr.si



Edition 1st



Publication type E-book



Published Maribor, September 2021



Availabe at https://press.um.si/index.php/ump/catalog/book/611





© University of Maribor, University Press

/Univerza v Mariboru, Univerzitetna založba

Text © Authors & Editors, 2021

This book is published under a Creative Commons 4.0 International licence (CC BY 4.0). This license alows reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator. The license al ows for commercial use.

Any third-party material in this book is published under the book’s Creative Commons licence unless indicated otherwise in the credit line to the material. If you would like to reuse any third-party material not covered by the book’s Creative Commons licence, you wil need to obtain permission directly from the copyright holder.





https://creativecommons.org/licenses/by/4.0/



CIP - Kataložni zapis o publikaciji

Univerzitetna knjižnica Maribor

004(082)(0.034.2)

STUDENT Computer Science Research Conference (7 ; 2021 ; Maribor)

Proceedings of the 2021 7th Student Computer Science Research Conference (StuCoSReC) [Elektronski vir] / [editors Iztok Fister ... et al.]. - 1st ed. -

E-zbornik. - Maribor : University of Maribor, University Press : Faculty of Electrical Engineering and Computer Science ; Ljubljana : Faculty of Computer and Information Science ; Koper : Faculty of Mathematics, Natural Sciences and Information Technologies, 2021

Način dostopa (URL): https://press.um.si/index.php/ump/catalog/book/611

ISBN 978-961-286-516-0 (pdf)

doi: 10.18690/978-961-286-516-0

COBISS.SI-ID 76158723



ISBN 978-961-286-516-0 (pdf)



DOI https://doi.org/10.18690/978-961-286-516-0



Price Free copy



For publisher prof. dr. Zdravko Kačič, rector of University of Maribor



Citiranje Fister, I., Brodnik, A., Brest, J., Fister, I. Jr., Krnc, M. & Zimic, N. (eds.) (2021).

Attribution Proceedings of the 2021 7th Student Computer Science Research Conference (StuCoSReC).

Maribor: University Press. doi: 10.18690/978-961-286-516-0





PROCEEDINGS OF THE 2021 7TH STUDENT COMPUTER SCIENCE RESEARCH

CONFERENCE (STUCOSREC)

I. Fister, A. Brodnik, J. Brest, I. Fister Jr., M. Krnc & N. Zimic (eds.) Table of Contents





Preface

Janez Brest

1





On Artefact Elimination in High Density Electromyograms by Independent Component Analysis

3

Aljaž Frančič, Aleš Holobar & Milan Zorman





Zero-Knowledge Authentication

Jakob Povšič & Andrej Brodnik

7





Graphs where Search Methods are Indistinguishable

Matjaž Krnc & Nevena Pivač

11





System for Remote Collaborative Embedded Development

Martin Domajnko, Nikola Glavina & Aljaž Žel

15





Leaf Segmentation of Rosette Plants using Rough K-Means in CIELab Color Space

19

Arunita Das, Daipayan Ghosal & Krishna Gopal Dhal





Adversarial Image Perturbation with a Genetic Algorithm

Rok Kukovec, Špela Pečnik, Iztok Fister Jr. & Sašo Karakatič

25





Fast Recognition of Some Parametric Graph Families

Nina Klobas & Matjaž Krnc

31





Interactive Evolutionary Computation Approach to Permutation Flow Shop Scheduling Problem

35

Vid Keršič





Towards Representative Web Performance Measurements with Google Lighthouse

39

Tjaša Heričko, Boštjan Šumak & Saša Brdnik





Transformer-based Sarcasm Detection in English and Slovene Language Matic Rašl, Mitja Žalik & Vid Keršič

43





Extraction and Analysis of Sport Activity Data Inside Certain Area Luka Lukač

47





Methodology for the Assessment of the Text Similarity of Documents in the CORE Open Access Data Set of Scholarly Documents

51

Ivan Kovačič, David Bajs & Milan Ojsteršek





ii

TABLE OF CONTENTS.

Embedding Non-planar Graphs: Storage and Representation

Ðorđe Klisura

57





Analiza ritmičnosti števnih podatkov z uporabo modela cosinor

Nina Velikajne & Miha Moškon

61





Analiza sentimenta komentarjev hotelov z uporabo slovarjev in metode Naivni Bayes

65

Nina Murks, Anže Omerzu & Borko Boškovic





Časovni razporejevalniki in brezstrežniško okolje

Uroš Zagoranski

71



PROCEEDINGS OF THE 2021 7TH STUDENT COMPUTER SCIENCE RESEARCH

CONFERENCE (STUCOSREC)

I. Fister, A. Brodnik, J. Brest, I. Fister Jr., M. Krnc & N. Zimic (eds.) Preface



JANEZ BREST





Computer science is one of the fastest growing fields. We live in a digital age where computers, smart phones and many other devices are connected worldwide. To meet al the requirements of the growing digital world; knowledge, skills and abilities are needed, which can be found in young people, especial y in bril iant students.



The Student Computer Science Research Conference (StuCoSRec) is organized by the col aboration of the departments and institutes of three public universities in Slovenia with computer science study programmes. We are very proud that over the past years, these StuCoSRec student conferences have been held every year from 2014 at different institutions in Slovenia. Unfortunately, in 2020, which had been marked by the Covid-19

pandemic, the organization of the StuCosRec conference was not possible. This year we want to hold the conference despite the stormy conditions of the Covid-19. The StuCoSReC conference is intended for computer science students to present their research work and achievements, exchange the ideas, socialize and connect among themselves.



This proceeding contains the papers of the seventh Student Computer Science Research Conference 2021 (StuCoSRec'21) that is planned to be held in Maribor. The University of Maribor – Faculty of Electrical Engineering and Computer Science is proud to host the conference this year. We received seventeen submissions, covering several topics of the

2

7TH STUDENT COMPUTER SCIENCE RESEARCH CONFERENCE (STUCOSREC).

computer science. Three submissions are in Slovene and the others are in English. The submissions were reviewed by two reviewers. One submission was rejected. The talks are scheduled to be held during the conference. The organizing committee would like to thank the reviewers for the well performed job.

The conference is dedicated to graduate and undergraduate students of computer science, and therefore it is free of charge. We grateful y acknowledge the support of the Faculty of Electrical Engineering and Computer Science (University of Maribor), especially the Institute of Computer Science.





On Artefact Elimination in High Density Electromyograms by

Independent Component Analysis

Aljaž Frančič

Aleš Holobar

University of Maribor,

University of Maribor,

Faculty of Electrical

Faculty of Electrical

Engineering and Computer Science,

Engineering and Computer Science,

Koroška cesta 46, 2000 Maribor, Slovenia

Koroška cesta 46, 2000 Maribor, Slovenia

aljaz.francic@um.si

ales.holobar@um.si

Milan Zorman

University of Maribor,

Faculty of Electrical

Engineering and Computer Science,

Koroška cesta 46, 2000 Maribor, Slovenia

milan.zorman@um.si

Abstract

axonal terminals. Groups of motor units often work to-

gether to coordinate the contractions of a single muscle.

We propose a novel approach to artefact detection We can think of the signal that travels from the motor and elimination in high density electromyograms neuron to the skeletal muscle fiber as a time series of ze-by using previously introduced Activity Index and roes, which represent no firing, and ones, which represent Independent Component Analysis (ICA). 28 elec-the firing of a muscle fiber and, thus, the MU. Because of

tromyographic recordings of the biceps brachii the refractory period, the spikes are few and far between, muscle were analysed for the presence of artefacts. making the resulting signal sparse in time. This signal is Using the technique presented in this study, we called the MU spike train.

eliminated an average of 1.07 ± 1.18 artefacts per

HDEMG recording. The mean number of elimi-

MUs fire asynchronously and their contributions are su-

nated artefacs per recording with at least one de-

perimposed into HDEMG, forming a highly complex sig-

tected artefact was 1.88 ± 0.96. In the HDEMGs nals that are difficult to interpret. By using computer-used in our study, each artefact was found in a aided methods, such as Convolution Kernel Compensa-separate ICA component.

tion (CKC) [2], it is possible to decompose the HDEMGs into contributions of individual MUs and, therefore, iden-

Keywords biomedical signal processing, electromyog- tify the firing times of MUs. This gives us insight into raphy, blind source separation, independent component the status of the motor system in humans.

analysis, activity index, artefact

However, due to the reasons mentioned in the initial

paragraph of this section, HDEMGs often contain various

1 Introduction

artifacts. These artefacts hinder the ability of CKC to

decompose the HDEMG into contributions of individual

In humans, movement and locomotion is regulated by motor units. Also worth noting: a single artefact might muscles. An electrical signal travels from the central ner-sometimes be present in multiple EMG channels at the

vous system towards muscles, where it is electrically am-

same time, so eliminating a single HDEMG channel (for

plified. By using non-invasive surface electromyography, a period of time) is not always effective.

it is possible to detect the subtle changes in the volt-

In this study, we exemplified our technique for artefact

age on the surface of the skin that originate from the detection and elimination in HDEMG by using previously muscles, even through the skin and the subcutaneous fat introduced Activity Index [1] and Independent Compo-layers. Such recordings are called surface electromyo-

nent Analysis (ICA) techniques [5].

grams (SEMG). When an array of tens of electrodes is

used, we call the resulting recordings high density elec-

tromyograms (HDEMGs). Due to many factors involved 2

Materials and Methods

in recording of HDEMGs that are often impossible to

control for, the resulting recordings are prone to con-

2.1

HDEMG model and Activity Index

tain artefacts from various sources, such as power line

interference, inadequate electrode-skin contact, electrode In isometric contractions of skeletal muscles, HDEMG

drift, subpar quality of the equipment, movement arte-

signals can be modeled by the following convolutive

facts, etc.

model [3]:

y( n) = Ht( n) + ω( n)

(1)

A motor unit (MU) is made up of a motor neuron and the

skeletal muscle fibers innervated by that motor neuron’s

DOI https://doi.org/10.18690/978-961-286-516-0.1

ISBN 978-961-286-516-0

3





0.08

where

1

-0.07

0.09

y( n) = [ y

2

1( n) ... y 1( n − F + 1) ... yM ( n − F + 1)] T

(2)

-0.08

0.12

stands for time-wise extended vector of HDEMG signals,

3

-0.12

with extension factor F set between 10 and 20 [3],

0.13

4

-0.47

0.10

ω( n) = [ω

5

1( n) ... ω M ( n − F + 1)] T

(3)

-0.06

0.48

is the time-wise extended noise vector, and

6

-0.13

0.08

t( n) = [ t

7

1( n) ... t 1( n− L− F +1) ... tJ ( n− L− F +1)] T

(4)

-0.09

0.09

is similarly extended vector of MU spike trains. Here, the

8

-0.15

spike train of the j-th MU is defined as

0.16

9

-0.64

X

0.12

tj( n) =

δ( n − τj( k)) ,

j = 1 , ... , J

(5)

10 -0.15

0.65

k

11 -0.14

where δ() denotes the Delta function and the k-th spike

0.09

EMG amplitude (mV)

12

of the j-th MU appears at time τ

-0.07

j ( k).

0.09

13 -0.10

The mixing matrix H comprises L samples long MUAPs 0.16

14

of J active MUs, as detected by M uptake electrodes [3].

-0.11

0.25

15

The Convolution Kernel Compensation (CKC) method

-0.08

0.07

[3] estimates the MU filter iteratively as

16 -0.08

0.10

ˆ

17

f

-0.08

j = ˆ

f j + αE( g(ˆ

tj( n)) y( n)) T C−1

(6)

y

0.11

18

ˆ

-0.23

0.07

ˆ

f j

f

19

j =

(7)





-0.07

ˆ

f

0.09

j





20 -0.08

where α determines the speed of convergence, E() stands 2351

0

for mathematical expectation, g( t) is a nonlinear weight-2.4

4.9

7.3

9.8

12.2

14.6

17.1

19.5

Activity Index

amplitude (a. u.)

time (s)

ing function, e.

g.

g( t) = log(1 + t 2) and C y =

E( y( n) yT ( n)) represents the correlation matrix of ex-Figure 1: Representative HDEMG channels recorded

tended HDEMG measurements. After each iteration of

during the isometric contraction of biceps brachii at 10

Eqs. 6 and 7, the estimate of MU spike train gets updated

% MVC contraction (in blue) and the corresponding Ac-

tivity Index (in red), with the detected outliers encircled

T

ˆ

tj( n) = ˆ

f y( n)

(8)

j

in green. The two most prominent artefacts are notice-

able in the 4th and 6th EMG channel at approximately

The Activity Index IA at a given time n is defined as [3]

10th second, in the 9th and 11th EMG channel at ap-

proximately 17th second and in the 15th and 18th EMG

IA( n) = yT ( n)C−1 y( n) (9)

y

channel at approximately 4th second. A single artefact is

Calculating the Activity Index is the first step of the

often present in several EMG channels. Also evident, not

CKC decomposition approach [3] and is susceptible to all the HDEMG artefacts were detrimental for Activity

HDEMG artefacts. Thus, it can also be used to detect

index calculation and, therefore, for MU identification.

them before further processing. If we do not eliminate

the artefacts in HDEMG, they can cause problems for

decomposition, preventing the convergence of MU filter

in Eq. 6 or segmentation of MU firing moments from 2.3

Artefact detection and elimination

identified MU spike train ˆ

tj( n).

An example of a few representative HDEMG channels

recorded during isometric contraction of biceps brachii

2.2

Independent Component Analysis

muscle at 10 % of maximum voluntary contraction

In signal processing, ICA is a method for separating a

(MVC) is shown in Fig. 1.

multivariate signal into its additive subcomponents. This

The artefact detection process began by calculating the

is done by assuming that the subcomponents are non-

Activity Index (red line in Fig 1) from HDEMG. In our Gaussian signals and that they are statistically indepen-study, the extension factor F was set to 1. This yielded

dent from each other [4]. ICA is a special case of blind one time series from several HDEMG channels.

source separation.

Next, the outliers (denoted by green circles on the red

In our case, the fastICA [5] decomposition was applied line in Fig 1) were found in the Activity Index.

In

to HDEMG, yielding the individual independent com-

our case, an outlier was defined as an element that was

ponents. Deflation was chosen as the decorrelation ap-

more than 15 scaled median absolute deviations (MAD)

proach in fastICA decomposition, and the nonlinearity

away from the local median.

Noteworthy, this could

g( u) = u 3 was selected in the fixed-point algorithm. The be parameterized and the outliers could be determined

stopping criterion was set to 0.0001 and the maximum

in some other fashion.

The local median was defined

number of iterations was set to 1000 (the default values).

4





as the median inside the window, where the window

2.4

Dataset and evaluation

size was equal to 2048 samples.

To the best of our

knowledge there is no established method for artefact

To evaluate our method for artefact detection and elimi-

elimination using the Activity Index, hence the MAD

nation, we used 20 second long HDEMG recordings from

threshold for determining the outliers in the Activity

7 neurologically intact young subjects performing isomet-

Index was selected empirically using visual inspection

ric contractions of the biceps brachii muscle, at 5, 10, 15

of the results by an expert. As an additional step, we

and 20 % of MVC for a total of 28 HDEMG recordings.

also ignored any outlier that was fewer than 200 samples

13 × 5 electrode array was used.

Visual feedback on

away from an already detected outlier.

In this way,

force was provided to the participants. All the experi-

we prevented the multiple identifications of the same

ments were conducted in accordance with the Declara-

artefact.

tion of Helsinki, and were approved by the local Ethical

Committee.

Afterwards, fastICA decomposition was applied to the

HDEMG, yielding the individual independent compo-

In Section 3 we reported the mean ± the Standard Denents. All the identified components were taken into con-

viation (SD) of the number of outliers found in the Ac-

sideration, as we would like to preserve as much of the

tivity Index per HDEMG recording, the mean number

information as possible after artefact elimination. Each

of eliminated artefacts per HDEMG recording, the mean

independent component was then excluded and a new

number of outliers per HDEMG, where we identified at

Activity Index I

least one outlier, the mean number of eliminated artefacts

EX was calculated without the excluded

independent component. Noteworthy, calculating the Ac-

per HDEMG where we eliminated at least one artefact as

tivity Index from either HDEMG or all it’s ICA compo-

well as the mean interest metric IntM etEX of the elimi-

nents will always yield the same result. Then, for each

nated artefacts. We also provided a visual example of an

detected outlier in the Activity Index I

elimination of an artefact.

A, we tried to iden-

tify the ICA component that contributed the most to the

outlier in the Activity Index (to the HDEMG artefact) by

3

Results

observing the difference in the Activity Index at the time

of the outlier before and after the exclusion of the indi-

The mean number of outliers found in the Activity In-

vidual ICA component. We did this by using the interest

dex per HDEMG recording was equal to 1.25 ± 1.29.

metric defined as:

The mean number of eliminated artefacts per HDEMG

IntM etEX ( n) = 1 − IEX ( n) /IA( n) (10)

recording was equal to 1.07 ± 1.18. The mean number of

at any given time n. In partucular, for each detected out-

outliers in Activity Index per HDEMG with at least one

lier at time x in the old Activity Index IA( x), we looked outlier was 2.06 ± 1.03 and the mean number of elimi-at all the new Activity Indices IEX ( x) (with individual nated artefacs per HDEMG with at least one eliminated

ICA components excluded) and calculated the interest

artefact present was 1.88 ± 0.96. The mean interest met-

metrics IntM etEX ( x) at the time x of the outlier in the ric IntM etEX of the eliminated artefacts was 0.85 ± 0.11.

Activity Index IA( x).

In the HDEMGs used in our study, each artefact was

found in a separate ICA component. A representative

The interest metric IntM etEX helped to expose the rela-

example of our results is provided in Fig. 2.

tionship between the old Activity Index IA and the new

Activity Indices IEX . Higher value of the interest metric corresponded to a greater chance that there was an arte-4

Discussion

fact in the excluded ICA component. Using this metric

it was possible to determine which ICA component con-

Our results indicated that it was possible to eliminate

tains the artefact. It was assumed that only a single ICA

artefacts in HDEMG using Activity Index and ICA. How-

component will contain a single artefact, as ICA works

ever, it was difficult to accurately quantify the efficiency

under the assumption that the sources are statistically

of this approach, as we did not know the ground truth

independent from each other.

If we were to see arte-

about the artefacts’ locations in time, nor the HDEMG

facts in multiple ICA components at the same time, this

channels where they were present. We could simulate

would imply that they came from statistically indepen-

certain kinds of artefacts at known times and in known

dent sources. This would imply artefact co-occurrence,

HDEMG channels. However this would only account for

which is unlikely given our findings about the number

certain types of artefacts. For example, we could induce

of outliers and artefacts found per recording (Section 3).

an artefact by touching certain electrodes during record-

Interest metric threshold, above which we considered the

ing at a predefined time, or during the whole recording by

artefacts to be successfully eliminated was set to 0.5.

incorrectly applying the contact gel. But we would still

By using the ICA algorithm it would also possible to re-

be left with other artefacts that we have little control

construct the original recordings from ICA components.

over. Moreover, not all the artefacts have a significant

This would allow us to first transform the HDEMGs to

impact on the Activity Index and on MU identification.

ICA space, eliminate the artefacts to the best of our

By observing the Activity Index and comparing it to the

ability and then transform the ICA components back to

HDEMG channels, it is quite clear, that certain artefacts

HDEMG space using simple matrix multiplication. In

are more detrimental for the Activity Index than others.

our current study, we eliminated the whole ICA compo-

Therefore, without using the Activity Index or a simi-

nent, but it would be worth considering “repairing” that

lar metric, the assessment of artefact impact on EMG

component (e.g. locally setting the component elements

decomposition is often difficult.

to 0).

5

0.8

1.4

6.8

2.3

1.7

identifying the first of the 200 samples that is at least 15

-0.6

-1.3

-30.7

-2.2

-2.6

MAD away from the local median. This selection could

1.8

4.4

3.2

6.3

2.8

have significant implications in the case of longer artefacts

-2.3

-1.2

-2.6

-2.8

-2.2

in HDEMG signals.

3.3

2.9

5.6

2.7

5.2

In our dataset, we found the mean interest metric

-5.6

-3.1

-3.2

-2.1

-2.2

2.5

2.5

2.9

2.1

1.9

IntM etEX value of the eliminated artefacts to be 0.85

± 0.11, while the threshold, above which we considered

-3.3

-6.9

-4.7

-3.3

-2.4

2.8

5.6

2.2

2.7

3.8

the artefacts to be eliminated was set to 0.5. We also

found slightly more outliers than actual artefacts as

-2.3

-3.3

-2.5

-4.5

-2.3

4.1

2.3

2.3

1.7

2.7

identified by visual inspection of HDEMG signals. This

indicated that the current parameters (especially the

-4.3

-2.3

-3.3

-2.4

-3.4

extension factor for Activity Index calculation F of 1,

4.6

1.7

2.8

3.1

3.4

the MAD threshold for outlier detection of 15 and the

-2.4

-1.9

-2.2

-2.2

-2.8

interest metric threshold of 0.5) were suitable to identify

3.4

3.5

3.0

3.1

2.4

artefacts in the HDEMGs. However, further tests are

-2.0

-2.2

-2.2

-1.4

-3.4

required to confirm these findings in other muscles and

3.6

2.8

2.0

2.1

2.3

contraction levels.

-2.2

-2.3

-2.7

-3.8

-2.8

3.2

2.3

2.2

2.7

3.7

Acknowledgment

-2.5

-2.7

-3.7

-2.6

-2.0

2.7

3.2

2.1

2.5

3.0

This study was supported by the Slovenian Research

-2.4

-2.3

-2.9

-3.7

-3.8

3.0

3.6

3.2

2.9

2.5

Agency (Project J2-1731 and Program funding P2-

0041).

-3.5

-1.9

-2.5

-2.9

-2.9

975

0

References

1240

0

[1]

2.4

4.9

7.3

9.8

12.2

14.6

17.1

19.5

Holobar, A., and Zazula, D. Correlation-based

time (s)

decomposition of surface electromyograms at low con-

traction forces. Medical and Biological Engineering

Figure 2: Artefact detection and elimination in ICA

and Computing 42, 4 (2004), 487–495.

components of recorded HDEMG. The bottom panel de-

[2]

picts the I

Holobar, A., and Zazula, D. Gradient convo-

A of all the ICA components (in blue) and the

lution kernel compensation applied to surface elec-

IEX without the excluded component (in red). The green

tromyograms. In International Conference on Inde-

cross denotes the detected artefact time. The blue and

pendent Component Analysis and Signal Separation

red circles represent the detected outliers in IA and IEX , (2007), Springer, pp. 617–624.

respectively. The second row from the bottom shows the

200 samples of Activity Index before and after the de-

[3] Holobar, A., and Zazula, D. Multichannel blind

tected artefact time (green cross from the bottom panel).

source separation using convolution kernel compensa-

The blue line shows the I

tion. IEEE Transactions on Signal Processing 55, 9

A and the red line shows IEX .

The first dozen rows show the ICA components 200 sam-

(2007), 4487–4496.

ples before and after the detected artefact time (the green

[4] Hyvärinen, A. Independent component analysis:

cross). The excluded component is depicted in red. It

recent advances.

Philosophical Transactions of the

contains an artefact with an interest metric value of 0.96.

Royal Society A: Mathematical, Physical and Engi-

All y-axis units are arbitrary.

neering Sciences 371, 1984 (2013), 20110534.

[5] Hyvärinen, A., and Oja, E. Independent com-

ponent analysis: algorithms and applications. Neural

networks 13, 4-5 (2000), 411–430.

In our method, determining the outliers in the Activity

Index depend upon several parameters. The first one was

the extension factor F , which was set to 1 for the purposes of this study.

Also important was the scaled median

absolute deviation (MAD) threshold, above which we

considered an element of the Activity Index to be an

outlier.

In our case, we set it to 15.

Lowering this

threshold would yield more outliers in the Activty Index.

Another parameter was the sliding window size in the

Activity Index outlier detection, which in our case was

set to 2048 samples.

The presented outlier location detection could also be

refined, as using the currently described technique does

not guarantee the identified outlier location to be at the

location of the actual outlier peak. Instead, we aimed at

6





Zero-Knowledge Authentication

Jakob Povšič

Andrej Brodnik

University of Primorska,

University of Primorska,

Faculty of Mathematics, Natural Sciences

Faculty of Mathematics, Natural Sciences

and Information Technologies,

and Information Technologies,

Glagoljaška 8, 6000 Koper, Slovenia

Glagoljaška 8, 6000 Koper, Slovenia

jakob.povsic@gmail.com

andrej.brodnik@upr.si

Abstract

Messages.

The peer and the authenticator communi-

cate by exchanging EAP messages. The protocol starts

Zero-Knowledge proofs (ZKPs) enable proving of with the authenticator sending a message to the peer.

mathematical statements, revealing nothing but They keep exchanging messages until the authenticator their validity. We design an authentication sys-can either authenticate the peer or not. Messages are

tem with a ZKP as a password verification mech- exchanged in a lock-step manner, where an authentica-anism within the Extensible Authentication Pro-

tor sends a message and the peer responds to it. The

tocol (EAP) framework. Designing a secure pass-

authenticator dictates the order of messages, meaning it

word authentication system requires us to adopt can send a message at any point of communication, as security practices for protecting ourselves against opposed to the peer, which can only respond to messages the vulnerabilities of passwords. Integrating said from the authenticator.

practices is not trivial because of the tight cou-

pling with the password verification method.

Messages are composed of fields, each field length is

multiple of an octet of bits (Table 1). We will store our

Keywords extensible authentication protocol, zero-authentication method data within the Type-Data field.

knowledge proofs, authentication, cryptography, key-

stretching, passwords, quadratic residuosity problem

Our EAP method is identified by the Type 84.

3

Zero-Knowledge Proofs

1 Introduction

Today privacy is a necessary sacrifice we have to make Zero-Knowledge Proofs [5, 6, 7] (ZKPs) are a concept in order to take part in the digital world, imperative in cryptography for proving the validity of mathematical to our modern life. Every day, more digital systems statements. What makes them particularly interesting is gain access to our personal information. While this that ZKPs can prove a statement revealing no informa-practice is often a necessary evil, many companies seek tion about why a statement is true, hence the term zero-to exploit this position. Zero-knowledge proofs (ZKPs) knowledge. In mathematics, theorem proofs are logical are an intriguing cryptographic phenomenon for proving arguments that establish truth through inference rules mathematical statements without revealing

of a deductive system based on axioms and other proven

why they are

true, and have the potential to change how our data exists theorems. ZKPs are probabilistic, meaning they convince in the digital space.

the verifier of the validity. We use the term convince, be-

cause ZKPs are not absolute truth, but the probability

Our focus will be to define a simple use for zero-knowledge of someone being convinced by a false statement is arbi-proofs. We will design an authentication system using a trarily small.

zero-knowledge proof as a password verification method,

as an authentication method in the extensible authenti- 3.1 ZKP System for the Quadratic Residuosity cation protocol (EAP). When designing the system, we

Problem

need to protect ourselves from vulnerabilities of pass-

words. However, integrating security methods presents Definition 3.1 (Quadratic Residuosity Problem) a challenge because of the zero-knowledge proof system.

Given an integer x, a semiprime modulus n = pq, where

p and q are unknown different primes, and a Jacobi

symbol value

x = 1 . Determine if x is a quadratic

2 Extensible Authentication Protocol

n

residue modulo n or not.

Extensible authentication protocol [10] (EAP) is a gen- The law of quadratic reciprocity enables efficient compu-eral purpose authentication framework designed for net- tation of the Jacobi symbol value x. However, when work access authentication. EAP defines a set of mesn

x

sages that support negotiation and execution of a variety

= 1, it does not tell if x is a quadratic residue modulo

n

of authentication protocols. EAP is a two-party protocol n or not. x is only a quadratic residue if it’s a quadratic between a

residue of both modulo p and q ( x = x = 1). To peer and an authenticator at each end of a link.

p

q

compute this, we would have to know the factorization of

n. However, since n is a product of two primes pq = n, this is computationally hard [2]. The only efficient way DOI https://doi.org/10.18690/978-961-286-516-0.2

ISBN 978-961-286-516-0

7





Table 1: EAP Message Format

Length (Octets)

1

1

2

1

n ≤ 216

Field

Code Identifier Length Type Type-Data

to prove x is a quadratic residue modulo n, is with the 4

Password Protection

root w. The problem acts as a trapdoor function, where it’s hard to prove if x is a quadratic residue modulo n Password cracking [1] is an offline attack [8], where an solely from x and n, while it is easy to prove when you attacker extracts passwords from data used by the au-know its root w.

thentication system for password verification. Protect-

Authors [7] described a ZKP system for the

ing passwords on the data layer is of critical importance.

quadratic

Key-stretching, [9, 1] also called password hashing, is the residuosity problem. To prove x is a quadratic residue modulo

industry standard method for improving security of low

n in zero-knowledge we need to prove the exis-

tence of the root

entropy secrets like passwords.

w, where w 2 ≡ x (mod n), without

revealing w to the verifier.

The quadratic residue x is derived from root w = p and persistently stored with the authenticator. This

n

Semiprime, where Jacobi x = 1

introduces a vulnerability, as an attacker with access to

n

x

Residue, where w 2 ≡ x (mod n)

x could crack the password w in an offline attack. To w

Root

provide adequate security, we need to use key-stretching

Peer

Authenticator

in our authentication method. A common application of

1 u

∗

a key-stretching method is to transform the vulnerable

R Z n

data stored in the authentication system. However, this

y = u 2 (mod n)

y

−

→

approach doesn’t work in our case. Let us revisit how the

2

b

←

−

b

R {0 , 1}

authenticator verifies the proof, and why key-stretching

3 z = uwb (mod n)

z

−

→

verify z 2 ≡ yxb (mod n)

the password verification data ( x) data is an issue. We’ll begin by assuming the system can verify the proof and

key-stretching the password verification data ( x) data.

Table 2: ZKP Authentication with EAP

As we define our process, we will see why it is not possible.

Key-Stretching x.

On the last step of the protocol

In Table 2 of the ZKP authentication process, the middle the authenticator verifies that spaces represent the EAP message Type-Data field.

z 2 ≡ yxb (mod n) .

The prover begins by picking a random integer u from If we stretch x with a function H and a salt s the field Z n, computing y = u 2 (mod n), and sending y to the verifier. The verifier picks a random bit b and H( x, s) = xH,

sends it to the prover. The prover computes the value z

with

we can then verify the proof with an inverse function

b and sends it back. The verifier checks the proof

H−1

by asserting z 2 ≡ yxb (mod n), this is possible since z 2 ≡ yH−1( xH, s) b.

This is possible assuming a polynomial algorithm H−1

z 2 ≡ yxb (mod n)

exists, however, since key-stretching methods are based

( uwb)2 ≡ u 2( w 2) b (mod n) on hashing functions (one-way functions), we know that

the probability of a polynomial algorithm H−1 to success-

u 2 w 2 b ≡ u 2 w 2 b (mod n) .

fully compute a pseudo-inverse is negligibly small. For all For each round a cheating prover has a 1 probability of positive integers c [4]

2

succeeding by correctly guessing the value of the random

Pr[ H( H−1( H( x))) = H( x)] < | x|− c.

bit b. To improve the strength of the proof, we repeat

this process m times for a confidence of 1 − 2− m.

Even if given unbounded time and resources, the pseudo-

To use this protocol as a password verification method, inverse x 0 = H−1( H( x)) might not be equal to x 0 6= x.

we can treat the root

The set x, x 0 ∈ I

w as the password p = w known by

x are all values that map into H ( x) =

the peer. The ZKP protocol proves that

H( x 0), and since H is not injective we know that | I x is a quadratic

x| ≥ 1.

residue modulo

Meaning that the probability that x 0 = x is

n, by proving the knowledge of the root

w, where w 2 ≡ x (mod n). The peer will prove that x 1

is a quadratic residue modulo

Pr[ H−1( H( x)) = x] =

.

n, to do this however, the

| Ix|

peer needs to prove the knowledge of the password p = w.

With this, the authenticator can assert that the password Key-stretching x prevents us from verifying the ZKP.

is valid.

However, by increasing the entropy of the root w, we can

eliminate the vulnerability and ensure adequate security.

Our new approach won’t treat the password p 6= w as the 8





root w. However, we will use the password p to derive the root w = H( p, s), using a key-stretching function H

and salt s. This way we’ve ensured the same level of pro-

tection against offline attacks as if we stretched the data

stored in the system. And because we didn’t transform x,

we can verify the proof without being affected by issues

mentioned in the previous paragraph. A similar approach

is used in the PPP EAP SRP-SHA1 protocol [3]. Earlier we argued the ZKP works as a password verification

method because p = w, this argument isn’t true anymore.

However, even though w 6= p, the peer can only derive w knowing the password p, so when the peer proves the knowledge of w, it can only be so because they know p

as well.

5

Secure Authentication

The authentication process now begins with the peer

sending his identifier to the authenticator and the au-

thenticator responding with the peer’s unique salt s and

modulo n. The peer can now derive the root w from the password p and salt s. This part of the process (Steps 1.

and 2. in the Table 3) happens only once.

The peer can then authenticate by following the process

as described in §3.1. This part (Steps 3., 4. and 5.) of the process is repeated m times for a confidence of 1 − 2− m.

Figure 1: EAP Method Execution

The middle space in the Table 3 represents the Type-Data field of the EAP messages.

Peer

Authenticator

the proof. The authenticator sends the random bit

1

I

−

→

b and the peer responds with the proof z. The peer 2 w = H( p, s)

s,n

←−−

also sends the yi+1 for the next verification round

3 u

i + 1, this is done as an optimisation to improve the

R Z n

speed of the process. (Table 3, Step 4., 5. and 3.) y = u 2 (mod n)

y

−

→

4

b

Success/Failure After each verification message, the

←

−

b

R {0 , 1}

authenticator verifies the proof, and once it’s done

5 z = uwb (mod n)

z

−

→

verify z 2 ≡ yxb (mod n)

successfully for m rounds, the authenticator sends

the success message. However, if the proof isn’t

Table 3: Improved ZKP Authentication with EAP

valid, the authenticator must send a failure message.

Let us examine the EAP messages (Figure 1) of the au- 6

Conclusions and Future Work

thentication process described in Table 3. The mapping between EAP messages and the steps in Table 3 is not The aim of this work was to study the utility of zero-one-to-one. We merged some steps to reduce the number knowledge proofs as an EAP authentication method.

of message exchanges required for the process to com- We’ve presented an EAP method using a ZKP system plete.

for password verification.

Additionally, we ensured

adequate password protection by using a key-stretching

Identity This message is used to query the identity of method.

the peer. In a system with multiple peers, this is

required to identify the peer authenticating, and We have been successful in our goal of studying and using to find the correct salt

the ZKP protocol. While theoretically interesting the

s and quadratic residue x.

(Table 3, Step 1.)

system’s performance may not appropriate for real-world

applications. The iterative nature of the underlying ZKP

Setup The peer needs both the salt s and the modulus protocol accumulates communication latencies, slowing n to compute the proof, however, he only knows down the system.

the password p. Once the peer identifies himself,

the authenticator needs to send him the salt s and Future work.

modulus n in the setup request message. (Table 3,

Step 2. and 3.)

• The EAP method presented in this work can be

Verification With this message pair the peer and the

implemented and tested in a real-world environment.

authenticator exchange data to compute and verify

9

• The ZKP protocol used in this work is a first generation protocol. Today there are many newer pro-

tocols that have solved many shortcomings of the

older generation ZKPs. Using a newer generation

ZKP protocol can improve the performance of the

authentication system.

• The ZKP protocol we’ve examined is iterative, which

can cause worse performance. A parallel ZKP con-

struction is assumed to have a weaker strength of

zero-knowledge. However, in a real-world applica-

tion, the performance improvements might justify

the theoretical shortcomings.

References

[1] Blocki, J., Harsha, B., and Zhou, S. On the

economics of offline password cracking. In 2018

IEEE Symposium on Security and Privacy (SP)

(2018), IEEE, pp. 853–871.

[2] Buchmann, J. A. Factoring. Springer US, New

York, NY, 2001, pp. 171–183.

[3] Carlson, J. D., Aboba, D. B. D., and Haveri-

nen, H. PPP EAP SRP-SHA1 Authentication Pro-

tocol. Internet-Draft draft-ietf-pppext-eap-srp-03,

Internet Engineering Task Force, July 2001. Work

in Progress.

[4] Goldreich, O. Foundations of cryptography: Vol-

ume 1, basic tools. Cambridge university press, 2007.

[5] Goldreich, O., and Krawczyk, H. On the

composition of zero-knowledge proof systems. SIAM

Journal on Computing 25, 1 (1996), 169–192.

[6] Goldreich, O., Micali, S., and Wigderson, A.

How to Prove all NP-Statements in Zero-Knowledge,

and a Methodology of Cryptographic Protocol De-

sign. vol. 263, pp. 171–185.

[7] Goldwasser, S., Micali, S., and Rackoff, C.

The knowledge complexity of interactive proof sys-

tems. SIAM Journal on computing 18, 1 (1989),

186–208.

[8] Grassi, P. A., Garcia, M. E., and Fenton,

J. L.

NIST Special Publication 800-63-3 Digital

Identity Guidelines. National Institute of Standards

and Technology, Los Altos, CA (2017).

[9] Hornby, T. Salted password hashing-doing it right,

2016.

[10] Vollbrecht, J., Carlson, J. D., Blunk, L.,

Aboba, D. B. D., and Levkowetz, H. Extensible

Authentication Protocol (EAP). RFC 3748, June

2004.

10





Graphs where Search Methods are Indistinguishable

Matjaž Krnc and Nevena Pivač∗

University of Primorska,

Faculty of Mathematics, Natural Sciences and Information Technologies, Glagoljaška 8, 6000 Koper, Slovenia

matjaz.krnc@upr.si nevena.pivac@iam.upr.si

Abstract

Section 2, and then state the obtained results in Section 3.

Due to lack of space we omit the proofs and provide some

Graph searching is one of the simplest and most directions for further work in Section 4. All proofs are widely used tools in graph algorithms. Every available in the full version of paper on www.arxiv.org.

graph search method is defined using some partic-

ular selection rule, and the analysis of the corre-

sponding vertex orderings can aid greatly in de-

2

Preliminaries

vising algorithms, writing proofs of correctness,

or recognition of various graph families.

We now briefly describe the above-mentioned graph

search methods, and give the formal definitions. Note

We study graphs where the sets of vertex order-

that the definitions below are not given in a historically

ings produced by two different search methods standard form, but rather as so-called three-point coincide. We characterise such graph families conditions, due to Corneil and Kruger [5] and also for ten pairs from the best-known set of graph Brändstadt et. al. [3]. Two vertices u, v ∈ V ( G) satisfy searches: Breadth First Search (BFS), Depth the relation u < First Search (DFS), Lexicographic Breadth First

σ v if u appears before v in the ordering σ : V ( G) → {1 , 2 , . . . , n} of vertices in G.

Search (LexBFS) and Lexicographic Depth First

Search (LexDFS), and Maximal Neighborhood

Search (MNS).

Breadth First Search (BFS), first introduced in 1959

by Moore [9], is a restriction of a generic search which

Keywords graph search methods, breadth first search, puts unvisited vertices in a queue and visits a first vertex depth first search

from the queue in the next iteration. After visiting a

particular vertex, all its unvisited neighbors are put at

the end of the queue, in an arbitrary order.

1 Introduction

Definition 2.0.1 An ordering σ of V is a BFS-ordering

Graph search methods (for instance, Depth First Search if and only if the following holds: if a <σ b <σ c and and Breadth First Search) are among essential concepts ac ∈ E and ab /

∈ E, then there exists a vertex d such that

classically taught at the undergraduate level of com-

d <σ a and db ∈ E.

puter science faculties worldwide. Various types of graph Any BFS ordering of a graph G starting in a vertex v searches have been studied since the 19th century, and results in a rooted tree (with root v), which contains the used to solve diverse problems, from solving mazes, to shortest paths from v to any other vertex in G (see [6]).

linear-time recognition of interval graphs, finding mini-

We use this property implicitly throughout the paper.

mal path-cover of co-comparability graphs, finding per-

fect elimination order, or optimal coloring of a chordal Depth First Search (DFS), in contrast with the BFS, graph, and many others [1, 4, 7, 8, 10, 11].

traverses the graph as deeply as possible, visiting a neigh-

In its most general form, a graph search (also generic bor of the last visited vertex whenever it is possible, and search [5]) is a method of traversing vertices of a given backtracking only when all the neighbors of the last vis-graph such that every prefix of the obtained vertex order- ited vertex are already visited. In DFS, the unvisited ing induces a connected graph. This general definition of vertices are put on top of a stack, so visiting a first vera graph search leaves much freedom for a selection rule tex in a stack means that we always visit a neighbor of determining which node is chosen next. By defining some the most recently visited vertex.

specific r ule t hat r estricts t his c hoice, various different

graph search methods are defined. Other search methods Definition 2.0.2 An ordering σ of V is a DFS-ordering that we focus on in this paper are Breadth First Search, if and only if the following holds: if a <σ b <σ c and Depth First Search, Lexicographic Breadth First Search, ac ∈ E and ab /

∈ E, then there exists a vertex d such that

Lexicographic Depth First Search, and Maximal Neigh-

a <σ d <σ b and db ∈ E.

borhood Search.

Lexicographic Breadth First Search (LexBFS) was

We briefly present the studied graph search methods in

introduced in the 1970s by Rose, Tarjan and Lueker [10]

∗This work is supported in part by the Slovenian Research as a part of an algorithm for recognizing chordal graphs in Agency (research programs P1-0285 and P1-0383, research linear time. Since then, it has been used in many graph projects N1-0102, J1-9110, N1-0160, N1-0209 and Young Re-algorithms mainly for the recognition of various graph

searchers Grant).

classes.

DOI https://doi.org/10.18690/978-961-286-516-0.3

ISBN 978-961-286-516-0

11





Generic Search

Generic Search

BFS

DFS

BFS

DFS

MNS

MNS

LexBFS

MCS

LexDFS

LexBFS

LexDFS

Figure 1: On the left: Hasse diagram showing how graph searches are refinements of one another. On the right is a summary of our results: Green pairs are equivalent on { P 4 , C 4}-free graphs. Violet pairs are equivalent on {pan, diamond}-free graphs. Blue pairs are equivalent on {paw, diamond, P 4 , C 4}-free graphs.

Definition 2.0.3 An ordering σ of V is a LexBFS or-

3

Problem description and results

dering if and only if the following holds: if a <σ b <σ c and ac ∈ E and ab /

∈ E, then there exists a vertex d such

Since the connections in Figure 1 represent relations of that d <σ a and db ∈ E and dc /

∈ E.

inclusion, it is natural to ask under which conditions

on a graph G the particular inclusion holds also in the

LexBFS is a restricted version of Breadth First Search,

converse direction. More formally, we say that two search

where the usual queue of vertices is replaced by a queue

methods are equivalent on a graph G if the sets of vertex of unordered subsets of the vertices which is sometimes

orderings produced by both of them are the same. We say

refined, but never reordered.

that two graph search methods are equivalent on a graph

class G if they are equivalent on every member G ∈ G.

Lexicographic Depth First Search (LexDFS) was in-

Perhaps surprisingly, three different graph families suffice

troduced in 2008 by Corneil and Krueger [5] and repre-to describe graph classes equivalent for the ten pairs

sents a special instance of a Depth First Search.

of graph search methods that we consider. Those are

described in Theorems 3.1 to 3.3 below, but first we need Definition 2.0.4 An ordering σ of V is a LexDFS or-a few more definitions.

dering if and only if the following holds: if a <σ b <σ c and ac ∈ E and ab /

∈ E, then there exists a vertex d such

All the graphs considered in the paper are finite and

that a <

connected. A k-pan is a graph consisting of a k-cycle, σ d <σ b and db ∈ E and dc /

∈ E.

with a pendant vertex added to it. We say that a graph

is pan-free if it does not contain a pan of any size as an Maximal Neighborhood Search (MNS), introduced

induced subgraph. A 3-pan is also known as a paw graph.

in 2008 by Corneil and Krueger [5], is a common generalization of LexBFS, LexDFS, and MCS, and also of

Maximal Label Search (see [2] for definition).

Theorem 3.1 Let G be a connected graph. Then the

following is equivalent:

Definition 2.0.5 An ordering σ of V is an MNS or-

dering if and only if the following statement holds: If

A1. Graph G is { P 4 , C 4 , paw, diamond} -free.

a <

A2. Every graph search of G is a DFS ordering of G.

σ b <σ c and ac ∈ E and ab /

∈ E, then there exists a

vertex d with d <

A3. Every graph search of G is a BFS ordering of G.

σ b and db ∈ E and dc /

∈ E.

A4. Any vertex-order of G is a BFS, if and only if it is

The MNS algorithm uses the set of integers as the label,

a DFS.

and at every step of iteration chooses the vertex with

maximal label under set inclusion.

Theorem 3.2 Let G be a connected graph. Then the

following is equivalent:

Corneil [5] exposed an interesting structural aspect of graph searches: the particular search methods can be

B1. Graph G is { pan, diamond} -free.

seen as restrictions, or special instances of some more

B2. Every DFS ordering of G is a LexDFS ordering of

general search methods. For six well-known graph search

G.

methods they present a depiction, similar to the one in

B3. Every BFS ordering of G is a LexBFS ordering of

Figure 1, showing how those methods are related under G.

the set inclusion. For example, every LexBFS ordering is

B4. Every graph search of G is an MNS ordering of G.

at the same time an instance of BFS and MNS ordering

of the same graph.

Similarly, every LexDFS ordering

Theorem 3.3 Let G be a connected graph. Then the

is at the same time also an instance of MNS, and of

following is equivalent:

DFS (see Figure 1). The reverse, however, is not true, and there exist orderings that are BFS and MNS, but

C1. Graph G is { P 4 , C 4} -free.

not LexBFS, or that are DFS and MNS but not LexDFS.

C2. Every MNS ordering of G is a LexDFS ordering of

G.

C3. Every MNS ordering of G is a LexBFS ordering of

G.

12





a

a

e

a

d

b

d

d

e

b

c

d

b

c

b

c

a

c

e

e

σ = ( b, c, d, a, e)

σ = ( e, b, a, d, c)

σ = ( d, b, e, a, c)

σ = ( a, c, e, d, b)

a

a

a

e

c

b

c

d

e

b

c

e

d

b

d

σ = ( a, d, c, e, b)

σ = ( d, c, b, e, a)

σ = ( c, a, d, e, b)

Figure 2: Graphs and corresponding orderings that are MNS and not MCS orderings.

From Theorems 3.1 and 3.2 we can immediately derive

[2] Berry, A., Krueger, R., and Simonet, G.

similar results for two additional pairs of graph search

Maximal label search algorithms to compute perfect

methods.

and minimal elimination orderings. SIAM Journal

on Discrete Mathematics 23, 1 (2009), 428–446.

Corollary 3.3.1 Let G be a connected graph. Then the

[3] Brandstädt, A., Dragan, F. F., and Nicolai,

following is equivalent:

F. LexBFS-orderings and powers of chordal graphs.

A1. Graph G is { P

Discrete Math. 171, 1-3 (1997), 27–42.

4 , C 4 , paw, diamond} -free.

A5. Every graph search of G is a LDFS ordering of G.

[4] Corneil, D. G., Dusart, J., Habib, M., and

A6. Every graph search of G is a LBFS ordering of G.

Kohler, E. On the power of graph searching for

cocomparability graphs. SIAM Journal on Discrete

4

Conclusion and further work

Mathematics 30, 1 (2016), 569–591.

[5] Corneil, D. G., and Krueger, R. M. A unified

In this paper we consider the major graph search methods

view of graph searching. SIAM Journal on Discrete

and study the graphs in which vertex-orders of one type

Mathematics 22, 4 (2008), 1259–1276.

coincide with vertex-orders of some other type. Inter-

[6] Even, S. Graph algorithms. Cambridge University

estingly, three different graph families suffice to describe

Press, 2011.

graph classes equivalent for the ten pairs of graph search

[7] Golumbic, M. Algorithmic Graph Theory and Per-

methods that we consider, which provides an additional

fect Graphs. Annals of Discrete Mathematics, Vol-

aspect of similarities between the studied search methods.

ume 57. Elsevier, 2004, pp. 98–99.

Among the natural graph search methods not yet consid-

[8] Köhler, E., and Mouatadid, L. Linear time

ered in this setting would be the Maximum Cardinality

lexdfs on cocomparability graphs. In Scandinavian

Search (MCS), introduced in 1984 (for definition see Tar-

Workshop on Algorithm Theory (2014), Springer,

jan and Yannakakis [12]). As shown on Figure 1, every pp. 319–330.

MCS is a special case of an MNS vertex-order. While it

[9] Moore, E. F. The shortest path through a maze.

is easy to verify that { P 4 , C 4 , paw, diamond}-free graphs In Proc. Int. Symp. Switching Theory, 1959 (1959),

do not distinguish between MNS and MCS vertex orders,

pp. 285–292.

Figure 2 provides examples of graphs which admit MNS,

[10]

but not MNS vertex orders. Characterising graphs equiv-

Rose, D. J., Lueker, G. S., and Tarjan, R. E.

Algorithmic aspects of vertex elimination on graphs.

alent for MNS and MCS remains an open question.

SIAM Journal on Computing 5, 2 (1976), 266–283.

Acknowledgements

[11] Tarjan, R. E. Depth-first search and linear graph

algorithms. SIAM journal on computing 1, 2 (1972),

The authors would like to thank prof. Martin Milanič for

146–160.

the initial suggestion of the problem, and to Ekki Köhler

[12] Tarjan, R. E., and Yannakakis, M. Simple

and his reseach group, for introducing the diverse world

linear-time algorithms to test chordality of graphs,

of graph searches to us.

test acyclicity of hypergraphs, and selectively reduce

acyclic hypergraphs. SIAM Journal on computing

References

13, 3 (1984), 566–579.

[1] Beisegel, J. Characterising AT-free graphs with

BFS. In Graph-Theoretic Concepts in Computer Sci-

ence (2018), A. Brandstädt, E. Köhler, and K. Meer,

Eds., pp. 15–26.

13





14





System for Remote Collaborative Embedded

Development

Martin Domajnko

Nikola Glavina

University of Maribor,

University of Maribor,

Faculty of Electrical

Faculty of Electrical

Engineering and Computer Science,

Engineering and Computer Science,

Koroška cesta 46, 2000 Maribor, Slovenia

Koroška cesta 46, 2000 Maribor, Slovenia

martin.domajnko@student.um.si

nikola.glavina@student.um.si

Aljaž Žel

University of Maribor,

Faculty of Electrical

Engineering and Computer Science,

Koroška cesta 46, 2000 Maribor, Slovenia

aljaz.zel@student.um.si

Abstract

continue, was based on the QEMU [8] machine emulator. We emulated our integrated circuit with an ARM

This paper explores the challenges and devised core processor and added a custom software layer, which solutions for embedded development which arose served as an emulation of the real-world communication during the COVID-19 pandemic. While software pipeline with the system. The solution proved useful, development, nowadays with modern tools and albeit with the restriction that it allowed only local de-services such as git, virtual machines and commu-

velopment, prompting us to develop our second idea. Re-

nication suits, is relatively unaffected by resource mote programming and debugging of our hardware over location. That is not the case for firmware and the internet was the main part of that solution. This embedded systems, which relies on physical hardwas achieved with a dedicated computer system acting

ware for design, development, and testing. To as a middle man between our development environment overcome the limitations of remote work and ob-and the targeted hardware, enabling programming and

structed access to actual hardware, two ideas were debugging.

implemented and tested. First, based on inte-

grated circuit emulation using QEMU to emulate Finally, we present the structure of this paper. In the an ARM core and custom software to facilitate second section, we are going to explain in detail the communication with the embedded system. Sec-implementations of both solutions. Following in section

ond, remote programming and debugging over the three, where we are going to talk about the limitations of internet with a dedicated computer system acting the implemented solutions and the lessons we learned.

as a middle man between a development environ-

The paper will conclude in section four with possible

ment and physical hardware using OpenOCD de-

improvements.

bugger.

2

Implementation of remote embedded

Keywords embedded development, remote develop-

ment, OpenOCD, QEMU, ARM, ARM semihosting

development

For the work on the project to continue under COVID-

1 Introduction

19 restrictions, a solution had to be devised to solve the

problems that came with remote work on a project depen-

During the design, development, and testing phases of dent on specific hardware. This included hardware devel-our Passive Floating Probe project [7], we heavily relied opment, maintenance, and the ability to develop and test on physical access to the hardware. The restrictions that software on the hardware remotely. During the academic were put in place during the COVID-19 pandemic didn’t year, two solutions were devised and implemented.

stop the development process of our project, but they

heavily limited it. Lack of working hardware for every 2.1 Solution using emulator member of the team was our major issue, so we tried

to think up solutions to overcome that problem. The The first solution was based on microcontroller emu-first idea, which would allow our work on the project to

lation, as seen in figure 1, using a specialized version

∗

of QEMU emulation software called xPack [14] version Listed in alphabetical order

2.8.0-9. This enabled us to emulate a variant of STM32

DOI https://doi.org/10.18690/978-961-286-516-0.4

ISBN 978-961-286-516-0

15





family of microcontrollers.

The variant chosen was

STM32F407-Discovery development board, since it was

closest to our target hardware, and we had access to

a matching development board on which to test the

differences.

Using the emulator method proved to be

a great benefit, since not all members had access to

real hardware at that time, but everyone could set up

the emulator on their computer. Since we didn’t have

previous experience working with STM family micro-

controllers, the emulator also allowed us to focus on

platform-specific software issues without the complexity

of hardware issues in novice programmers. However, this

also came with a cost. First one, the emulator didn’t

work out of the box.

Second one, no sensors can be

connected to the board directly since interfaces in the

Figure 1: Schematic of the development system using

emulator are virtual.

emulation.

Before the emulator could be used with the desired fam-

ily, a minor bug inside the emulator source code, that

was causing ROM memory overlap, needed to be patched

manually. After initial setup, we established a basic form

of IO system using a script that allowed us to run the

emulator and automatically redirect the standard output

using UNIX pipes to a file where we could monitor the

output. The emulator approach enabled us to test out the

version of real time operating system FreeRTOS [2] for STM32 microcontrollers. Some modification to compiler

flags were required, especially the soft floating-point unit

because the emulator was unable to emulate real FPU.

While unable to connect sensors to the virtual environ-

ment, we were mostly focused on creating task manager,

which took care of the correct operation and communica-

tion between individual tasks or sensors. Tasks, that were

supposed to be related to the sensors, were temporarily

Figure 2: Schematic of the remote development system.

replaced with empty functions that returning predefined

values for testing.

To enable outside communication with our embedded

software,

semihosting feature of ARM architecture

[13]. The selected single board computer handled net-was used [1].

A custom communication service was

work communications, attached peripherals and services

implemented in which standard input and output

needed to facilitate remote programming and real-time

were redirected to netcat running locally,

creating

debugging.

For embedded development, an ST-LINK

a network interface.

After successful compilation of

in-circuit debugger and programmer [11], in our case the program, the emulator could be run with the

STM32F407G-DISC1 development board [10] providing compiled elf binary and redirect semihosting IO to

ST-LINK/V2-A, was connected via USB connection to

stdio. An example in our case: < /dev/null nc -q -1

the Raspberry Pi. To connect GDB debugging function-

-l 5000 | qemu-system-gnuarmeclipse –verbose –verbose

ality with the ST-LINK programming functionality with

–board STM32F4-Discovery –mcu STM32F407VG -d

the target integrated circuit, the OpenOCD software [6]

unimp,guest_errors

–image

STM32F407-Discovery-

version 0.10.0 was used.

blinky.elf –semihosting-config enable=on,target=native |

To

prepare

the

setup,

two

configuration

files

in

nc -l 6000 > /dev/null. This allowed us to emulate a se-OpenOCD format needed to be created.

First one

rial connection to and from the emulated microcontroller

defining "hla_serial" value, describing the serial number over a network connection.

of the connected ST-LINK device.

The second one

defining "-event gdb-detach" behavior as "resume".

2.2

Solution using remote development

Defining this permitted the embedded program to run

even after the debug session disconnected, allowing to

To amend the shortcomings of the development on the

test the functionality over longer periods of time without

emulator, a remote development solution was devised and

constant connection to the host machine.

implemented. The solution, as seen on figure 2, is based on a dedicated computer system leveraging remote con-With the prepared configuration files, a script was

nection functionality of GDB or "GNU Project Debugger"

created to start the OpenOCD session with the correct

[4] for remote programming and real-time debugging.

parameters for the ST-LINK device, target device,

network settings and created configuration files.

An

The system was built using Raspberry Pi 3 Model B

example in our case: openocd -c "bindto $HOSTNAME"

[9], running ARM version of Ubuntu version 20.04 LTS

16





-c "gdb_port 3333" -c "tcl_port disabled" -c "telnet_port security was tested, with the implementation of username

disabled" -f /usr/share/openocd/scripts/interface/stlink-

and password authentication using NGINX reverse proxy

v2.cfg -c "adapter_khz 480" -c "transport select hla_swd"

server [5] and httpd access restrictions on the system

-f

/usr/share/openocd/scripts/target/stm32l4x.cfg

-f

URL.

./gdb_resume.cfg -f ./serial.cfg » log.txt

The script was started inside a tmux [12] instance. This Acknowledgment

allowed for the OpenOCD session to run without an

active user connection to the shell executing the script,

The authors acknowledge the financial support from the

or alternately for multiple users to be connected to the

Institute of Computer Science of the Faculty of Electri-

same shell instance to observe debug in print messages.

cal Engineering and Computer Science and would like to

Once the system was set up, multiple ST-LINK devices

thank mag. Jernej Kranjec for his guidance and assis-

could be connected and used simultaneously by adding

tance. The authors would also like to acknowledge the

additional configuration files with serial numbers and

remaining members of the project group, namely Tilen

starting OpenOCD sessions on different network ports.

Koren, Anna Sidorova and Viktorija Stevanoska for their

work on the project.

3

Usage and lessons learned

References

The development process was arranged in the form of the

required hardware development and maintenance to be in

[1] ARM.

Arm

target

input/output

facilities.

the domain of our mentor and be kept at the university.

https://developer.arm.com/documentation/

Team members could make request for hardware mod-

dui0471/g/Bgbjjgij, 2021. Accessed: 2021-07-30.

ifications and then develop and debug the project soft-

[2] FreeRTOS. Real time operating system for mi-

ware remotely. While concurrent remote software collab-

crocontrollers. https://www.freertos.org/, 2021.

oration was achieved through distributed revision control

Accessed: 2021-07-30.

system Gitea [3] as source code repository and manage-

[3] Gitea. Lightweight code hosting solution. https:

ment tool.

//gitea.io, 2021. Accessed: 2021-07-30.

[4] GNU. The gnu project debugger. https://www.

3.1

Emulator

gnu.org/software/gdb/, 2021. Accessed: 2021-07-30.

The emulator provided a good start into getting ac-

quainted with embedded development. This allowed us

[5] NGINX. Reverse proxy. https://www.nginx.com/,

to learn and develop embedded software without phys-

2021. Accessed: 2021-07-30.

ical hardware. The downside was the inability to work

[6] OpenOCD.

Open on-chip debugger.

http://

with real sensors, which later made us switch to a remote

openocd.org/, 2021. Accessed: 2021-07-30.

system. Another problem was that the emulator was not

[7] Perrone, M., Knupleš, U., Žalik, M., Keršič,

capable of running functions that required precise timing,

such as real-time code or interrupt execution.

V., and Šinko, T. Pasive floating probe. In Stu-

CoSReC: Proceedings of the 2019 6th Student Com-

puter Science Research Conference (2019), pp. 13–

3.2

Remote

17.

The advantage of remote system made it possible for us to

[8] QEMU.

the fast!

processor emulator.

https:

connect to the targeted hardware from anywhere as if it

//www.qemu.org/, 2020. Accessed: 2020-04-30.

was accessible locally. This included real-time debugging

[9] Raspberry Pi. Raspberry pi 3 model b. https:

with the ability to see microcontroller processor states

//www.raspberrypi.org/products/raspberry-

and memory values. The problem with this solution was

pi-3-model-b/, 2021. Accessed: 2021-07-30.

the restriction of a single connection to the OpenOCD

[10] STM.

Discovery

kit

with

stm32f407vg

instance, which limited the work on the hardware to

mcu.

https://www.st.com/en/evaluation-

a single developer at a time.

This was resolved with

tools/stm32f4discovery.html, 2021.

Accessed:

communication and access scheduling.

2021-07-30.

[11] STM.

Stm

st-link/v2

in-circuit

debug-

4

Conclusions

ger/programmer.

https://www.st.com/en/

development-tools/st-link-v2.html,

2021.

The system was sufficient to allow our work on the project

Accessed: 2021-07-30.

to continue.

During development, we were fortunate

[12] tmux.

Terminal multiplexer.

https://github.

enough to not have major problems with security. The

com/tmux/tmux/wiki, 2021. Accessed: 2021-07-30.

system, as it was used, would allow anyone to access

[13] Ubuntu.

Ubuntu os.

https://ubuntu.com/

the system if they identified the used network ports and

download/raspberry-pi, 2021. Accessed: 2021-07-protocols.

This was not a major concern, as it only

30.

allowed to program our particular microcontroller with

a dedicated firmware.

As such, this problem was not

[14] xPack. The xpack qemu arm. https://xpack.

addressed during the production.

Possible additional

github.io/qemu-arm/, 2020. Accessed: 2020-04-30.

17





18





Leaf Segmentation of Rosette Plants using Rough

K-Means in CIELab Color Space

Arunita Das ∗

Dept. of Computer Science and Application

Midnapore College (Autonomous)

Paschim Medinipur, West Bengal, India

arunita17@gmail.com

Daipayan Ghosal

Krishna Gopal Dhal †

Dept. of Computer Science and Application

Dept. of Computer Science and Application

Midnapore College (Autonomous)

Midnapore College (Autonomous)

Paschim Medinipur, West Bengal, India

Paschim Medinipur, West Bengal, India

daipayan.ghosal@gmail.com

krishnagopal.dhal@midnaporecollege.ac.in

Abstract

lation study among several leaf segmentation algorithms

had been presented in [27]. Well-known clustering tech-Segmentation of Plant Images plays an important niques like KM, FCM, Self-organizing Map (SOM), and role in modern agriculture where it can provide Particle Swarm Optimization (PSO) are also applied for accurate analysis of a plant’s growth and possi-leaf segmentation in [11] and SOM outperforms other ble anomalies. In this paper, rough set based tested methods visually and numerically. Other than partitional clustering technique called Rough K-the above techniques, deep learning is also utilized for

Means has been utilized in CIELab color space the leaf segmentation [15]. However, deep learning needs for the proper leaf segmentation of rosette plants. large dataset in order to produce good results.

The efficacy of the proposed technique have been

analysed by comparing it with the results of tra-

Therefore, it can be seen from the above discussion that

ditional K-Means and Fuzzy C-Means clustering different clustering strategies provide promising segmen-algorithms. The visual and numerical results re-

tation results. Although classical KM, FCM, SOM , and

veal that the RKM in CIELab provides the near-

PSO are utilized for leaf segmentation, but rough set

est result to the ideal ground truth, hence the based K-Means (RKM) clustering did not used in this most efficient one.

area according to the knowledge of the authors. Rough

k-means is developed by Lingras et al. [16] and a refined

Keywords image segmentation, color space, rough set, version is proposed by Peters [21] and it shows the perfor-partitional clustering

mance in image clustering domain as a similarity based

clustering model like KM and FCM [14] [10] [12] [22].

RKM has been efficiently applied for the proper segmen-

1 Introduction

tation of tumor region from brain MRI images [14] [10],

White blood cell segmentation [12], and satellite image Leaf check and leaf region (i.e., plant area) are the key segmentation [22]. As a consequence, the main contribu-plant phenotyping qualities used to examine the plant tion of the paper is the utilization of RKM in CIELab development and advancement [19] [25] [28], blossoming and its application for leaf segmentation. The proposed time [13], and yield potential. The leaf include can be methodology which is represented in Figure 1 has been tended to in different manners from the AI viewpoint compared with classical KM and FCM. Experimental re-

[1]. One such route is to check the number of leaves sults show the supremacy of the proposed approach over from fragmented plant area. Several image segmenta-other tested techniques.

tion techniques are reported in literature for leaf segmen-

tation. For example, Maximal Similarity-based Region The rest of the paper is organized as follows: Section 2

Merging (MSRM) [18] is an intuitive segmentation ap-discusses the proposed methodology. Section 3 describes

proach which utilizes a region merging framework to meld the experimental results and the paper is concluded in super-pixel division. gPb-owt-ucm [2] is another segmen-section 4.

tation method which is dependent on spectral clustering

and contour detection. The IPK technique [20] utilized 2

Proposed Methodology

3D histogram of L*a*b* tone space of the plant images

for regulated segmentation of closer view/foundation.

Clustering is a procedure of consortium a bunch of data

Vukadinovic and Polder employed neural networks com-

into clusters that have superior intra-cluster and inferior

bined with watershed for leaves segmentation [24]. A col-inter-cluster resemblance among clusters. Two rudimen-

tary types of image clustering practices are hard cluster-

∗First Author

ing and soft clustering. In hard clustering, one pixel can

†Corresponding Author

be the adherent of only one cluster and the proper exam-

DOI https://doi.org/10.18690/978-961-286-516-0.5

ISBN 978-961-286-516-0

19





ple of this is K-means [5] [9] [7]. On the contrary of the define each set Ci ∈ U/P using its lower approximation previous one, soft clustering uses a miniscule membership

A( C)and upper approximation ¯

A( C).

unlike hard clustering which makes it more practicable for

real world usages. One pixel can be the fragment of sev-

U/P = { C 1 , C 2 , . . . , Ck}

(1)

eral clusters with some degree of belongingness which is

Let, v and ci are the vector representation of the data

described by the fractional membership. Fuzzy C-means

object and cluster Ci respectively.

Upper and lower

is a specimen of this mechanism which had been projected

approximations of only a few subsets of U have been

by Bezdek [3] [6]. FCM is better than hard clustering considered.

Hence, it is not possible to verify all the

technique like K-means because it has the more ability

properties of the rough sets [16] [21].

However, the

to handle the ambiguity of gray levels. In some cases,

upper and lower estimates of Ci ∈ U/P are obligatory the fuzzy degree of membership may be too descriptive

to follow some of the basic rough set properties which

for interpreting clustering results. Therefore, researchers

are as follows:

have applied rough set theory into k-means and developed

rough k-means [16] [21]clustering algorithm which man-P1: A data entity v can be a participant of at most one

ages these equidistant data objects or overlapping clusters

lower approximation A( ci).

using upper and lower approximations of each cluster.

P2: If a data object v is the portion of the lower ap-

Rough set-based clustering provides a solution that is less

proximation A( ci), then it is also portion of the upper restrictive than conventional clustering and less descrip-approximation ¯

A( C) i.e., v ∈ A( ci) =⇒ v ∈ ¯

A( ci) .

tive (specific) than fuzzy clustering. In this study, Rough

K-Means has been utilized to segment the leaf images.

P3: If a data article v does not belong to any lower

Due to non-uniform illumination of regions, the segmen-

approximation A( ci) then it belongs to two or more upper tation algorithm’s performance is influenced by the color

approximations ¯

A( ci).

spaces used.

According to the literature, perceptually

In rough k-means, the lower and upper approximations of

uniform color spaces such as L*a*b* or L*u*v* achieve

the clusters have been computed by the following rules:

much better segmentation results than non-uniform color

Let v be a data object and d( v, zi)be the distance between spaces such as RGB [23], which was developed for better v and zi which is the centroid of cluster ci.

color



representation. As a first stage in our approach,

we used MATLAB to transform all R

Let d( v, z

GB images to CIE

i) = min(1≤ j≤ k) d( v, zj )

L*a*b* color space, which yielded three components: L*,



d( v, z



a*, and b*. “L*” denotes lightness, while “a*” and “b*”

i)

T =

j :

≤ th and i 6= j

(2)

denote colors, with “a*” denoting red-green and “b*” de-

d( v, zi)

noting blue-yellow, respectively. The flowchart of the pro-

posed work is presented in Figure 1 The brief mathemat-Where, th is the threshold value specified by the user.

ical implementation of RKM is described in section 2.1.

In order to classify a data object to the correct approx-

imation(s), the following classification criteria are being

used:

Input Rosette Plant RGB image

R1: If the set, T is not an empty set, then the data object is classified as upper approximation of both cluster i and j. So, if T 6= φ then [ v ∈ ¯

A( c

Convert RGB to CIELab Color

i) and v ∈ ¯

A( cj) , ∀ j]

Space

R2: If T is a vacant set, the data object is being categorized as lower approximation for cluster i. Then the

pixel is categorized as upper approximation for clus-

Find the Cluster Centers using

ters i as per the hypothesis P2.

So, if T = φ then

Rough K-Means

[ v ∈ ¯

A( ci) and v ∈ A( ci)]

Depending on the above deliberations the algorithm steps

Reform Segmented Image in

for rough k-means are represented as Algorithm 1.

RGB Color Space

3

Experimental Results

Output Leaf Segmented Region

The experiment has been performed over 30 plant im-

ages using MatlabR2018b on Windows-10 OS, x64-based

Figure 1: Flowchart of the Proposed Methodology

PC, Intel core i5 CPU with 8 GB RAM. The plant im-

ages are collected from [17]. The parameter settings of the utilized clustering techniques are as follows. Number

of cluster prototype value depends on the user which is

2.1

Rough K-Means (RKM)

taken as 2 for all clustering techniques. For FCM, fuzzifi-

cation parameter is taken as 2 and if maximum difference

Suppose, a hypothetical clustering scheme is defined as

between two successive partition matrices U is less than

Eq. (1) which partitions U depending on the equivalence

minimal error threshold η then stop the corresponding al-

relation P. Again, assume that it may not possible to

gorithm. Mathematically, if [ M ax U t − U ( t+1)] < η then accurately describe the sets Ci, 1 ≤ i ≤ k due to inad-stop, where, minimal error threshold η = 10(−5). For KM

equate knowledge in the partition. But it is possible to

and RKM, if the change in centroid values are smaller

20





1 For each cluster and data object, find the

Table 1: Performance parameters considered for evalu-

distance d and threshold T

ation of the clustering methods.

2 Classify the data object to lower and upper

estimates utilizing the classification criteria i.e.,

R 1 and R 2.

Sl. Para-

Formulation and Remarks

3 Calculate the new cluster center ( mean z

meters

i) as per

following expressions:

1

Accuracy

AC =

( T P + T N )

; Accu-

( F N + F P + T P + T N )

4 If [ A( ci) 6= φ and ¯

A( ci) − A( ci) = φ]

(AC)

racy is one metric for evaluating

P

v

v∈ A( c

classification models. We calcu-

5 then z

i )

i =

| A( ci)|

late the accuracy to know how

6 else if A( ci) 6= φ and ¯

A( ci) − A( ci) 6= φ]

good our model predicts.

P

v

v∈ A( c

2

Dice

DI =

2×

; It com-

7 then z

i )− A( ci )

i =

(2× T P + F P + F N )

| A( ci)− ¯

A( ci)|

Index

bines the precision and recall

P

P

v

v

v∈ A( c

v∈ A( c

(DI)

concepts from information re-

8 else z

i )

i )− A( ci )

i = wl ×

+ w

| A( c

u ×

i )|

| A( ci)− ¯

A( ci)|

trieval. It is the harmonic mean

9 wl + wu = 1 and usually, wl > wu The of the precision and recall. The

parameters wl and wu correspond to the

DI values are within the interval

relative importance of lower and upper

[0, 1] and larger the value indi-

approximations respectively.

cates higher clustering quality.

10 If the algorithm converges, then stop. Otherwise,

3

Jaccard

J I = DI/(2 − DI); Jaccard sim-

repeat steps 2 to 4.

Index(JI)

ilarity index measures the over-

lap between two sets. It is de-

Algorithm 1: Procedure of Rough K-Means

fined as the size of the intersec-

(RKM)

tion of two sets divided by the

size of their union. The higher

value indicates more similarities

than η the stop the procedure.The rough set parameters

between two objects.

for classical RKM are th = 0 . 7 , w

4

Matthews

M CC

=

l = 0 . 6, and wu = 0 . 4.

( T P × T N − F P × F N )

Threshold ( th) selection in RKM is tough for different

correla-

√

;

( T P + F P )×( T P + F N )×( T N + F P )×( T N + F N ) image. We have done the experiment within the range

tion

(MCC) is a more reliable sta-

0 < th < = 10 and optimally set to 0.7. The performance coeffi-tistical rate which produces a

of the utilized clustering techniques has been evaluated

cient

highscore only if the prediction

by calculating four ground truth based performance eval-

(MCC)

obtained

good

results

in

all

uation parameters namely accuracy, dice, Jaccard, and

TP, TN, FP, FN categories and

Matthews correlation coefficient (MCC) which are sum-

proportionally both to the size

marized in Table 1 [6] [26]. Here, TP - true positive, FP

of positive elements and the

- false positive, TN - true negative, FN - false negative.

size of negative elements in the

dataset. Higher value indicates

Sample No.

Original Image

KM

FCM

RKM

the better results.

1

Table 2: Numerical values of segmentation quality pa-





rameters over five sample images

2

Sample Method Accu-

Dice

Jac-

MCC





No.

racy

card

1

KM

0.9756

0.9651

0.9325

0.9463

3

FCM

0.9743

0.9633

0.9292

0.9435

RKM

0.9851 0.9791 0.9590 0.9677





2

KM

0.9786

0.9720

0.9455

0.9548

FCM

0.9819

0.9762

0.9536

0.9617

4

RKM

0.9845 0.9798 0.9603 0.9675





3

KM

0.9730

0.9555

0.9147

0.9364

FCM

0.9761

0.9607

0.9244

0.9440

5

RKM

0.9773 0.9630 0.9287 0.9476

4

KM

0.9714

0.9541

0.9122

0.9336





FCM

0.9720

0.9552

0.9143

0.9350



RKM

0.9880 0.9810 0.9627 0.9723

Figure 2: Color segmentation results of clustering tech-

5

KM

0.9738

0.9659

0.9341

0.9453

niques over five sample images

FCM

0.9793

0.9729

0.9473

0.9565

RKM

0.9838 0.9788 0.9585 0.9661

21





RKM have been utilized to segment the leaves of the

Table 3: Average numerical values of segmentation qual-

rosette plants.

Figure 2 and 3 represents the original ity parameters and execution time

color plant image, the ground truth images of the leaf

segmentation provided by the experts, their segmented

Method Accu-

Dice

Jac-

MCC

Time

leaf part by the three utilized algorithms binary seg-

racy

card

(Sec.)

mented leaf part provided by the employed clustering

KM

0.9638

0.9470

0.9001

0.9200

3.78

algorithms.

Figures 2 and 3 here clearly show that FCM

0.9593

0.9417

0.8909

0.9112

4.56

the RKM provides the best leaf-based segmentation

RKM

0.9662

0.9531 0.9124 0.9293 8.26

results.

Not only visual analysis, the segmentation

efficacy of the clustering algorithms has been analyzed

by computing four well-known segmentation quality

parameters which are presented in Table 2. The values Sample No.

Ground Truth

KM in binary

FCM in binary

RKM in binary

of the segmentation quality parameters regarding five

plant samples presented in Table 2. The average values 1

of the segmentation quality parameters over 30 images





are given in Table 3.

The best numerical values of

the Tables 2 and 3 are given in bold.

Most of the

2

values of the quality parameters clearly reveal that





RKM provides superior outcomes to other three tested

clustering algorithms.

The graphical representation of

the average quality parameters (recorded in Table 3) is 3

also showed in Figure 4. The average execution times of the four clustering algorithms over 30 images are also

presented in Table 3.

KM needs least computational

4

effort.

RKM takes the largest execution time in the





same environment.

5

4

Conclusion





This paper presents a Rough K-Means (RKM) based clus-

Figure 3: Ground truth images and segmentation re-

tering algorithm in CIELab color space for leaf image seg-

sults of clustering techniques in binary format over five

mentation of rosette plants. The proposed RKM based

sample images

technique is compared against two well-known conven-

tional clustering algorithms namely K-Means and Fuzzy

C-Means. An entire dataset of 30 images have been used

for this experiment. Experimental results here reveals

that the RKM based clustering algorithm outperforms

K-Means

Fuzzy C-Means

Rough K-Means

the others and delivers the best outcomes in both the

8

2

visual as well as numerical analysis for the utilized seg-

3

6

6

3

6

.9

9

.9

mentation parameters. The main three limitations of the

0

5

0

1

.9

3

0

5

7

proposed method are noise sensitivity, local optima trap-

4

7

.90

.9

1

0

4

ping and large computational time. If researched further,

.90

392

it can be possible to analyze the plant’s growth or to de-

.9

2

0

tect any visually identifiable signs of disease or damage,

42

.9

2

1

0

11

or any possible visual anomaly. These results encourage

.9

1

.9

0

0

0

0

further research in the improvement of RKM for image

.9

9

0

09

segmentation such as incorporation of nature-inspired op-

.80

timization algorithms to overcome local optima trapping

problem [8] [4].

References

[1] Aich, S., and Stavness, I. Leaf counting with

Accuracy

Dice

Jaccard

MCC

deep convolutional and deconvolutional networks. In



Proceedings of the IEEE International Conference

on Computer Vision Workshops (2017), pp. 2080–

2089.

Figure 4: Graphical analysis of average quality param-

[2] Arbelaez, P., Maire, M., Fowlkes, C., and

eters for clustering techniques

Malik, J. Contour detection and hierarchical image

segmentation. IEEE transactions on pattern analy-

sis and machine intelligence 33, 5 (2010), 898–916.

The three clustering algorithms i.e., KM, FCM, and

22

[3] Bezdek, J. C., Ehrlich, R., and Full, W. Fcm: [17] Minervini, M., Fischbach, A., Scharr, H., The fuzzy c-means clustering algorithm. Computers

and Tsaftaris, S. A. Finely-grained annotated

& geosciences 10, 2-3 (1984), 191–203.

datasets for image-based plant phenotyping. Pat-

[4]

tern recognition letters 81 (2016), 80–89.

Dhal, K. G., Das, A., Gálvez, J., Ray, S., and

Das, S. An overview on nature-inspired optimiza-

[18] Ning, J., Zhang, L., Zhang, D., and Wu, C.

tion algorithms and their possible application in im-

Interactive image segmentation by maximal similar-

age processing domain. Pattern Recognition and Im-

ity based region merging. Pattern Recognition 43, 2

age Analysis 30, 4 (2020), 614–631.

(2010), 445–456.

[5] Dhal, K. G., Das, A., Ray, S., and Das, S. [19] Orlando, F., Napoli, M., Dalla Marta, A., A clustering based classification approach based on

Natali, F., Mancini, M., Zanchi, C., and Or-

modified cuckoo search algorithm. Pattern Recogni-

landini, S. Growth and development responses of

tion and Image Analysis 29, 3 (2019), 344–359.

tobacco (nicotiana tabacum l.) to changes in physi-

[6]

cal and hydrological soil properties due to minimum

Dhal, K. G., Das, A., Ray, S., and Gálvez,

tillage.

J. Randomly attracted rough firefly algorithm for

histogram based fuzzy image clustering. Knowledge-

[20] Pape, J.-M., and Klukas, C. 3-d histogram-

Based Systems 216 (2021), 106814.

based segmentation and leaf detection for rosette

[7]

plants. In European Conference on Computer Vision

Dhal, K. G., Fister Jr, I., Das, A., Ray, S.,

(2014), Springer, pp. 61–74.

and Das, S. Breast histopathology image clustering

using cuckoo search algorithm.

In Proceedings of

[21] Peters, G. Some refinements of rough k-means

the 5th student computer science research conference

clustering. Pattern Recognition 39, 8 (2006), 1481–

(2018), pp. 47–54.

1491.

[8] Dhal, K. G., Fister Jr, I., and Das, S. Parame- [22] Raj, A., and Minz, S. Spatial rough k-means terless harmony search for image multi-thresholding.

algorithm for unsupervised multi-spectral classifica-

In 4th student computer science research conference

tion.

In International Conference on Information

(StuCosRec-2017) (2017), pp. 5–12.

and Communication Technology for Intelligent Sys-

[9]

tems (2020), Springer, pp. 215–226.

Dhal, K. G., Gálvez, J., Ray, S., Das, A., and

Das, S. Acute lymphoblastic leukemia image seg-

[23] Sarkar, S., Das, S., and Chaudhuri, S. S. A

mentation driven by stochastic fractal search. Mul-

multilevel color image thresholding scheme based on

timedia Tools and Applications (2020), 1–29.

minimum cross entropy and differential evolution.

[10]

Pattern Recognition Letters 54 (2015), 27–35.

Dobe, O., Sarkar, A., and Halder, A. Rough

k-means and morphological operation-based brain

[24] Scharr, H., Minervini, M., French, A. P.,

tumor extraction. In Integrated Intelligent Comput-

Klukas, C., Kramer, D. M., Liu, X., Luengo,

ing, Communication and Security. Springer, 2019,

I., Pape, J.-M., Polder, G., Vukadinovic, D.,

pp. 661–667.

et al. Leaf segmentation in plant phenotyping: a

[11]

collation study. Machine vision and applications 27,

Ghosal, D., Das, A., and Dhal, K. G. A com-

parative study among clustering techniques for leaf

4 (2016), 585–606.

segmentation in rosette plants. Pattern Recognition

[25] Telfer, A., Bollman, K. M., and Poethig,

and Image Analysis 31, 4 (2021).

R. S. Phase change and the regulation of trichome

[12]

distribution in arabidopsis thaliana. Development

Inbarani H, H., Azar, A. T., et al. Leukemia

image segmentation using a hybrid histogram-based

124, 3 (1997), 645–654.

soft covering rough k-means clustering algorithm.

[26] Thanh, D. N., Prasath, V. S., Hien, N. N.,

Electronics 9, 1 (2020), 188.

et al.

Melanoma skin cancer detection method

[13]

based on adaptive principal curvature, colour nor-

Koornneef, M., Hanhart, C., van Loenen-

malisation and feature extraction with the abcd rule.

Martinet, P., and Blankestijn de Vries, H.

The effect of daylength on the transition to flowering

Journal of digital imaging (2019), 1–12.

in phytochrome-deficient, late-flowering and double

[27] Vukadinovic, D., and Polder, G. Watershed

mutants of arabidopsis thaliana. Physiologia Plan-

and supervised classification based fully automated

tarum 95, 2 (1995), 260–266.

method for separate leaf segmentation.

In The

[14]

Netherland Congress on Computer Vision (2015),

Kumar,

D. M.,

Satyanarayana,

D.,

and

pp. 1–2.

Prasad, M. G. An improved gabor wavelet trans-

form and rough k-means clustering algorithm for mri

[28] Walter, A., and Schurr, U. The modular char-

brain tumor image segmentation. Multimedia Tools

acter of growth in nicotiana tabacum plants un-

and Applications 80, 5 (2021), 6939–6957.

der steady-state nutrition. Journal of Experimental

[15]

Botany 50, 336 (1999), 1169–1177.

Kumar, J. P., and Domnic, S. Rosette plant seg-

mentation with leaf count using orthogonal trans-

form and deep convolutional neural network. Ma-

chine Vision and Applications 31, 1 (2020), 1–14.

[16] Lingras, P., and West, C. Interval set cluster-

ing of web users with rough k-means. Journal of

Intelligent Information Systems 23, 1 (2004), 5–16.

23





24





Adversarial Image Perturbation with a Genetic

Algorithm

Rok Kukovec

Špela Pečnik

University of Maribor,

University of Maribor,

Faculty of Electrical

Faculty of Electrical

Engineering and Computer Science,

Engineering and Computer Science,

Koroška cesta 46, 2000 Maribor, Slovenia

Koroška cesta 46, 2000 Maribor, Slovenia

rok.kukovec@student.um.si

spela.pecnik@um.si

Iztok Fister Jr.

Sašo Karakatič

University of Maribor,

University of Maribor,

Faculty of Electrical

Faculty of Electrical

Engineering and Computer Science,

Engineering and Computer Science,

Koroška cesta 46, 2000 Maribor, Slovenia

Koroška cesta 46, 2000 Maribor, Slovenia

iztok.fister1@um.si

saso.karakatic@um.si

Abstract

technology for recognizing motifs from an image. Com-

puter vision achieves near-human-level accuracy in recog-

The quality of image recognition with neural nition, and the question arises of the key differences be-network models relies heavily on filters and

tween human and computer vision.

They return pre-

parameters optimized through the training dicted labels and their corresponding certainties. The process. These filters a re d ifferent c ompared to problem arises when there are high certainties for wrong how humans see and recognize objects around labels [2].

them. The difference in machine and human

recognition yields a noticeable gap, which is This paper presents an approach for adversarial image prone to exploitation. The workings of these perturbation with evolutionary algorithms, with the goal algorithms can be compromised with adversarial of misguiding the AlexNet convolutional neural network perturbations of images. This is where images (CNN). The implemented approach demonstrates how are seemingly modified i mperceptibly, such that simple and effective adversarial perturbation is, and how humans see little to no difference, but the neural vulnerable every day image recognition models are. The network classifies t he m otif i

ncorrectly. This implemented approach aims to recreate the image as paper explores the adversarial image modifica-similar to the original image as possible, keeping the

tion with an evolutionary algorithm, so that the human perception of the motif intact, while maximizing AlexNet convolutional neural network cannot the error of the image recognition model. Pixel values recognize previously clear motifs while preserving in certain places are changed such that computer vision the human perceptibility of the image. The ex-fails to classify them correctly.

periment was implemented in Python and tested

on the ILSVRC dataset. Original images and 2 Related work their recreated counterparts were compared and

contrasted using visual assessment and statistical

metrics. The findings s uggest t hat t he human The inspiration for this paper derives from [9], where the eye, without prior knowledge, will hardly spot authors implemented an adversarial perturbation deceiv-the difference compared to the original images.

ing computer vision with only changing one pixel in the

original image. This attack was carried out on images of

Keywords adversarial perturbation, AlexNet, CNN, very low resolution, which is the reason for its success.

computer vision, evolutionary algorithms

In the paper authored by Fawzi et al. [1], an analysis was made of the resistance of computer vision algorithms

1 Introduction

to adversary disturbances. The existence of adversarial

examples was confirmed, as there is an upper bound to

robustness. The goal was to find the correlation between

Computer vision algorithms are already used widely in robustness against random and adversarial noise. As long every day applications, but the safety concerns persist as the boundary is so high that the recreated image has to regarding their reliability. Leaving vital decisions to them be completely distorted, it does not indicate a problem.

can cause dire consequences in cases of error. Therefore, A problem arises if the image is human-recognizable and additional caution is necessary in most use cases. Such the recognition algorithm fails its prediction with high algorithms have to be tested extensively before they are certainty. Several different models of machine learning, allowed to make such decisions on their own.

including CNNs, misclassify adversarial examples consis-

Deep neural networks are currently the state-of-the-art

tently.

These are intentionally created, small interfer-

DOI https://doi.org/10.18690/978-961-286-516-0.6

ISBN 978-961-286-516-0

25





ences that are detrimental for the recognition algorithm which, together, form a fitness function. It combines both

[8]. The paper [13] shows that a universal adversarial results using fuzzy logic’s operator AND (∧). The value

perturbation is possible. One adversarial noise filter can

of normalized average absolute pixel difference is multi-

be applied routinely to many different images. In the

plied with AlexNet’s certainty into a wrong prediction.

paper [2], there are examples of specifically produced im-The process is repeated iteratively until the termination

ages in which the human eye only sees random noise, yet

condition is reached.

the algorithm is near certain that there is a motif. The

paper [4] shows that a successful adversarial perturbation Start

Reached termination condition

against one neural network is likely to succeed against a

variety of network architectures trained on different data

sets.

EA

A distinctive quality of this paper is that it is readily ac-

cessible to non-experts. It shows that implementations of

adversarial perturbations are not limited only to teams

of advanced researchers supported by both technical and

Candidate solution

financial capabilities. The attacker requires only a ba-

sic understanding of machine learning. The experiment

uses only open-source libraries and a small amount of un-

derstandable custom code. Despite the straightforward

AlexNets’ certainty

into wrong recognition

approach, results are comparable to the work mentioned

above.

Similarity Score

Fitness function

3

Implementing adversarial

perturbation on AlexNet CNN

Figure 1: Diagram of the proposed adversarial pertur-

bation.

The main objective of the approach is that the solution

image is modified in accordance with two objectives: (1)

Similarity to the original image, and (2) AlexNet’s cer-

Data: Original image

tainty during misrecognition. The optimization method

Result: Recreated image

pursuing these objectives is a genetic algorithm. The goal

1 initialization;

is that no change could be noticed by the human eye in

2 create first candidate solution;

the reproduced image without prior knowledge.

3 while termination goal not reached do

4

calculate similarity score;

The following Python libraries were used:

5

check AlexNet’s certainty into wrong

prediction;

• NumPy [11] is used for numerical calculations, 6

calculate fitness value;

• OpenCV [3] and Pillow [6] for image preprocessing, 7

send score to evolutionary algorithm;

• Scikit-image [14] for the structural similarity index 8

create new candidate solutions;

measure metric,

9 end

• PyTorch [12] is used for a pre-trained AlexNet CNN.

• GARI - Genetic Algorithm for Reproducing Images

Algorithm 1: Algorithm in pseudo-code

is used for the EA (Evolutionary algorithm) [7].

The proposed approach is divided into the following in-

It was shown that the combination of AlexNet’s predic-

terconnected parts:

tions, genetic algorithm and evaluation of the fitness func-

tion was very time-consuming. Thus, it was not possible

• Generation of sets of candidate solutions,

to recreate the image within the set time frame to the

• Evaluation of the image similarity between the can-

point of recognition by the human eye. The bottleneck

didate solution and the original image using the nor-

appeared in the time-consuming evaluation of candidate

malized average of absolute pixel differences,

solutions by AlexNet. It renders the attack infeasible for

use cases where real-time solutions are needed.

• Classification of the candidate solution using the

AlexNet image recognition model,

We bypassed this bottleneck somewhat by not running

• Computation of the fitness value for the candidate

AlexNet before the starting 80,000 iterations at all, since

solution,

the first recreated images are random noise, which was

optimized towards our goal. Initially, the only feedback

• Selection of the fittest candidate images for further

given to the EA was the similarity score. It turned out

reproduction.

that the recreated image was recognizable to the human

Figure 1 shows the initial idea of the implementation of eye much earlier than to AlexNet. AlexNet’s predictions

the adversarial perturbation. The start block represents

were only calculated after the candidate image was suffi-

the execution of the experiment with any given original

ciently similar. Once AlexNet recognizes the image, the

image. The evolutionary algorithm creates candidate so-

evolutionary algorithm can start calling our final fitness

lutions, which are evaluated using two separate criteria,

function. It comprises of both the similarity score, as

26





well as AlexNet’s predicted class and its corresponding

4

Results

certainty.

The results of the experiment are evaluated visually and

Figure 2 shows a working version of the experiment. Pre-using statistical metrics. Terminating conditions were set

sented is the detailed control flow dictating the entry of

as follows:

AlexNet into fitness value calculation. The experiment

is divided into two phases. Phase one consists mainly

• Time limit of 2 hours reached,

of quick operations. No phase transition conditions are

checked in the first 80,000 generations. Depending on

• Calculated fitness exceeded 0.99,

the image, AlexNet started giving the first correct clas-

• Algorithm finished both phases.

sification at about 30,000 generations. Towards the end

of the first phase, correct classification is checked. If the

prediction is correct, we advance to the second phase. It

aims to create an adversarial perturbation. The output

of AlexNet is an array of sorted certainties with labels.

For the calculation of fitness function, the value is taken

from the incorrect label which has the highest certainty

and is combined with the similarity score.

(3.1) Leafhopper

(3.2) Filter

(3.3) Recreated

Start

Returned best candidate solution

•

GARI

Reached termination condition

Candidate solution

No. of generations ≥ 80000

NO

•

YES

(3.4) Manhole

(3.5) Filter

(3.6) Recreated

•

YES

Check, if phase 2 is active

NO

No. of generations mod 100 = 0

NO

•

YES

Check AlexNet classification

False

•

(3.7) Maze

(3.8) Filter

(3.9) Recreated

True

Advance to 2nd phase

Execution of 2nd phase

•

Candidate solution

(3.10) Nautilus

(3.11) Filter

(3.12) Recreated

AlexNets certainty into wrong recognition

Original label

Similarity score

Original image

(3.13) Strawberry

(3.14) Filter

(3.15) Recreated

Fitness value

Figure 3: Original images, adversarial filters and recre-

Figure 2: Flowchart of the final implementation of the

ated images.

proposed approach.

27





The benchmark value was set to 0.99, since it was forcing pixels in resolution.

Figure 3 shows recreated images

both factors, normalized average of absolute pixel differ-

that, without prior-knowledge, it is hard to spot the dif-

ences and AlexNet’s certainty, into wrong prediction to

ference, taking into account that the images are relatively

be above 0.99. The product of two numbers between 0

small.

and 1 is smaller than either factor.

Since images are difficult to evaluate qualitatively and the

5

Discussion

normalized mean of sum of absolute errors was already

used in the evaluation process, new statistical metrics

Despite the limitations of the experiment, we showed that

were introduced:

adversarial perturbations are possible to implement in a

relatively short time with the help of genetic algorithms.

• Mean Squared Error (MSE),

Future research may point to one of the following six

• Peak signal-to-noise ratio (PSNR), and

directions:

• Structural similarity index measure (SSIM).

• Speeding up the process of optimization,

Results showed a promising direction, but they were not

• Deceiving computer vision into a custom label,

optimized fully due to operational limitations. The com-

• Selecting a more complex CNN,

promise was agreed upon deceiving AlexNet’s prediction

• Testing other optimization methods (i.e. even other

to the closest label in the feature space.

nature-inspired algorithms [5]),

4.1

Examples of missclassified images

• Testing with only using some features [10] to speed up the optimization process, and

The results of the experiment are shown in Table 1.

• Protection against adversarial noise.

Recreated images are shown in Figure 3.

Category

Certainty into

References

Original category

after attack

missclassified label

Leafhopper

Lacewing

99.97%

[1] Alhussein Fawzi, e. a.

Analysis of classifiers’

Manhole-cover

Electric ray

99.98%

robustness to adversarial perturbations.

CoRR

Maze

Hay

99.97%

abs/1502.02590 (2015).

Nautilus

Brain coral

99.98%

[2] Anh Mai Nguyen, e. a. Deep neural networks are

Strawberries

Bell pepper

99.97%

easily fooled: High confidence predictions for unrec-

ognizable images. CoRR abs/1412.1897 (2014).

Table 1: Results of images in Figure 3

[3] Bradski, G. The OpenCV Library. Dr. Dobb’s

Journal of Software Tools (2000).

[4] Christian Szegedy, e. a. Intriguing properties of

The calculated metrics on different recreated images

neural networks, 2014.

achieve relatively high values. The human eye recognizes

[5] Fister Jr, e. a.

A brief review of nature-

the motif of the image.

The attack was carried out

inspired algorithms for optimization. arXiv preprint

successfully and results are shown in Table 2.

arXiv:1307.4186 (2013).

[6] Fredrik Lundh. Pillow - Pillow (PIL Fork) 8.2.0

Picture

MSE

PSNR

SSIM

Documentation, 2021.

Agama

768.69

29.51dB

0.62

Baseball

956.16

29.04dB

0.56

[7] Gad, A. F. ahmedfgad/GARI: GARI (Genetic Al-

LeafHopper

666.89

29.52dB

0.77

gorithm for Reproducing Images) reproduces a sin-

gle image using Genetic Algorithm (GA) by evolving

Manhole cover

642.42

29.53dB

0.77

pixel values.

Maze

270.50

31.11dB

0.79

Nautilus

396.71

30.58dB

0.81

[8] Ian J. Goodfellow, e. a. Explaining and har-

Nautilus 2

667.03

29.65dB

0.73

nessing adversarial examples, 2015.

Panda

908.09

29.10dB

0.75

[9] Jiawei Su, e. a. One pixel attack for fooling deep

Rosehip

944.62

29.29dB

0.68

neural networks. CoRR abs/1710.08864 (2017).

Strawberry

394.05

31.52dB

0.83

[10] Karakatič, S. Evopreprocess—data preprocessing

Sulphur butterfly

1015.85

28.82dB

0.61

framework with nature-inspired optimization algo-

Upright piano

975.15

29.00dB

0.66

rithms. Mathematics 8, 6 (2020), 900.

[11] NumPy. What is NumPy? — NumPy v1.20 Man-

Table 2: Calculated metrics of recreated images

ual, 2021.

[12] Paszke, A. e. a. Pytorch: An imperative style,

high-performance deep learning library.

In Ad-

One of the goals set was to recreate images in the input

vances in Neural Information Processing Systems

resolution of AlexNet (meaning 224·224 pixels). This goal

32, H. Wallach, H. Larochelle, A. Beygelzimer,

was not reached because the time-complexity growth rate

F. d'Alché-Buc, E. Fox, and R. Garnett, Eds. Cur-

was non-linear. Recreated images were around 100 · 100

ran Associates, Inc., 2019, pp. 8024–8035.

28

[13] Seyed-Mohsen

Moosavi-Dezfooli,

e.

a.

Universal

adversarial

perturbations.

CoRR

abs/1610.08401 (2016).

[14] van der Walt, e. a. scikit-image: image process-

ing in Python. PeerJ 2 (6 2014), e453.

29





30





Fast Recognition of Some Parametric Graph Families

Nina Klobas

Matjaž Krnc

Durham University,

University of Primorska,

Department of Computer Science,

Faculty of Mathematics, Natural Sciences

Upper Mountjoy Campus, Stockton Road,

and Information Technologies,

Durham DH1 3LE,United Kingdom

Glagoljaška ulica 8, 6000 Koper, Slovenia

nina.klobas@durham.ac.uk

matjaz.krnc@famnit.upr.si

Abstract

namely Double generalized Petersen graphs and folded

cubes. Due to the space constraints the study of these

Recognizing graphs with high level of symmetries

two families is not covered here, therefore we defer inter-

is hard in general, and usually requires additional

ested readers to the full version of this paper [9].

structural understanding. In this paper we study

a particular graph parameter and motivate its

usage by devising efficient recognition algorithm

for the family of I-graphs.

For integers `, λ, m a simple graph is [ `, λ, m]-

cycle regular if every path of length ` belongs to exactly λ different cycles of length m. We identify all [1 , λ, 8]-cycle regular I-graphs and, as a consequence, describe linear recognition algorithm for

the observed family.

Similar procedure can be used to devise the recog-

nition algorithms for Double generalized Petersen

graphs and folded cubes. Besides that, we believe

the structural observations and methods used in

the paper are of independent interest and could

be used for solving other algorithmic problems.

Keywords I-graphs, double generalized Petersen

graphs, folded cubes, recognition algorithm, cycle

regularity.

Figure 1:

I-graph I(12 , 2 , 3), double generalized Petersen graph DP(10 , 2), and folded cube FQ .

1

Introduction

4

Important graph classes such as bipartite graphs,

(weakly) chordal graphs, perfect graphs and forests are

I-graphs were introduced in the Foster census [6], and defined or characterized by their cycle structure.

A

are trivalent or cubic graphs with edges connecting ver-

particularly strong description of a cyclic structure is the

tices of two star polygons. They form a natural gener-

notion of cycle-regularity, introduced by Mollard [11]:

alization of the well-known generalized Petersen graphs

introduced in 1950 by Coxeter [3] and later named by For integers l, λ, m a simple graph is

Watkins in 1969 [14]. The family of I-graphs has been

[ l, λ, m] -cycle regular if every path on l + 1

studied extensively with respect to their automorphism

vertices belongs to exactly λ different cycles of

group and isomorphisms [1, 7, 13], Hamiltonicity [2],

length m.

spectrum [12], and independence number [4, 8].

It is perhaps natural that cycle-regularity mostly appears

Our first result identifies all [1 , λ, 8]-cycle regular mem-in the literature in the context of symmetric graph fami-

bers and determines the corresponding values of λ.

lies such as hypercubes, Cayley graphs or circulants.

Theorem 1. An arbitrary I-graph is never [1 , λ, 8] -cycle Understanding the structure of subgraphs of hypercubes

regular, except when isomorphic to I( n, j, k) where j =

which avoid all 4-cycles does not seem to be easy. Indeed,

1 and ( n, k) ∈ {(3 , 1) , (4 , 1) , (5 , 2) , (8 , 3) , (10 , 2) , (10 , 3) , a question of Erdős regarding how many edges can such

(12 , 5) , (13 , 5) , (24 , 5) , (26 , 5)} .

a graph contain remains open after more than 30 years

[5].

These structural results are used to devise the recognition

algorithm for I-graphs.

In this paper we study cycle-regularity and more general

cyclic aspects of a family of I-graphs, with the focus of

Theorem 2. I-graphs can be recognized in linear time.

devising an efficient recognition algorithm. Similar ap-

proach can be extended also to two other graph families,

If the input graph is a member of the observed family, we

DOI https://doi.org/10.18690/978-961-286-516-0.7

ISBN 978-961-286-516-0

31





not only provide its parameters but also give a certificate Therefore the octagon value of an I-graph is said to be a of correctness, i.e. we give an exact isomorphism.

triple ( σJ , σS, σI ).

We say that two 8-cycles of an I-graph are equivalent if 1.1

Preliminaries

we can map one into the other using rotation ρ. Let G '

I( n, j, k) be an arbitrary I-graph and let C be one of its 8-Unless specified otherwise, all graphs in this paper are fi-

cycles. With γ( C) we denote the number of equivalent 8-nite, simple, undirected and connected. For a given graph

cycles to C in G. Each 8-cycle contributes to the octagon G we use a standard notation for a set of vertices V ( G) value of an I-graph. We denote the contributed amount

and a set of edges E( G). A k-cycle C in G, on vertices with τ ( C), defined as the triple ( δ

v

j , δs, δi), where we

1 , v 2 , . . . , vk from V ( G) is denoted as ( v 1 , . . . , vk ).

For

calculate δ

integers a and b we denote with gcd( a, b) the greatest j , δs, δi by counting the number of outer, spoke

and inner edges of a cycle and multiply these numbers

common divisor of a and b respectively.

with γ( C) /n.

If a graph G admits m non-equivalent

Definition 3. Let l, λ, m be positive integers. A simple 8-cycles C 1 , C 2 , . . . , Cm, one may calculate its octagon graph G is [ l, λ, m]-cycle regular if every path on l + 1

value ( σJ , σS, σI ) as P m τ ( C

i=1

i) .

vertices of G belongs to exactly λ different m-cycles of G.

The following claim serves also as an example of the

above-mentioned definitions.

It is easy to see that [1 , λ, 8]-cycle regular cubic graphs are Claim 5. For I( n, j, k) where n > 3 and integers k, j < also [0 , 3 λ/ 2 , 8]-cycle regular, but the converse does not n/ 2 there always exists an 8 -cycle.

hold. Related to this we define a function σ : E( G) 7→ N, where σ( e) corresponds to the number of distinct 8-cycles Indeed, if k 6= j it is of the form

an edge e belongs to. We call σ( e) an octagon value of an C∗ = ( w 0 , w± k, u± k, u± k± j, w± k± j, w± j, u± j, u 0) .

edge e, and we say that a graph G has a constant octagon If k = j it is of the form

value if σ is a constant function.

C 7 = ( u 0 , uk, u 2 k, u 3 k, w 3 k, w 2 k, wk, w 0) .

2

Structural analysis

2.2

Characterization of non-equivalent 8-cycles

Before we start with the analysis of the observed graph

Our aim is to identify all possible 8-cycles that can appear

family we need to formally define it and present some of

in an arbitrary I-graph and determine their contribution

its basic properties.

towards the octagon value of the graph. It is easy to

Definition 4. Let n, j, k be positive integers for which see that an arbitrary 8-cycle can have either 4 , 0 or 2

n ≥ 3 and n ≥ j, k ≥ 1. An I-graph I( n, j, k) is a graph spoke edges, so we obtained this list by distinguishing 8-on vertices { u

cycles by the number of spoke edges they admit. In the

0 , u 1 , . . . , un−1 , w 0 , w 1 , . . . , wn−1} , with the edge set consisting of outer edges u

case of the 8-cycle admitting 2 spoke edges, we further

iui+ j , inner edges

w

distinguish cases by the number of outer and inner edges

iwi+ k and spoke edges uiwi, where the subscripts are taken modulo n.

within a given 8-cycle.

Due to space constraints we present the analysis of 8-

Without loss of generality we always assume that j, k <

cycles admitting 4 spoke edges only, as it is the easiest

n/ 2. Since I( n, j, k) is isomorphic to I( n, k, j), we re-case to deal with (for remaining cases see [9]).

This

strict ourselves to cases when j ≤ k. It is well known analysis leads to complete characterisation of 8-cycles for

[1] that an I-graph I( n, j, k) is disconnected whenever the family of I-graphs, presented in the Table 1.

d = gcd( n, j, k) > 1. In this case it consists of d copies of I( n/d, j/d, k/d).

Therefore, throughout the paper

8-cycles with 4 spoke edges

In addition to 4 spoke

we consider only graphs I( n, j, k) where gcd( n, j, k) =

edges the 8-cycle must also have two inner and two outer

1.

We also know [7] that two I-graphs I( n, j, k) and edges. When using the spoke edge there are two options

I( n, j 0 , k 0) are isomorphic if and only if there exists an for choosing an inner (outer) edge. After considering all

integer a, which is relatively prime to n, for which either cases it is easy to see that there can be just two such

{ j 0 , k 0} = { aj (mod n) , ak (mod n)} or { j 0 , k 0} = { aj 8-cycles, C∗ (see Claim 5), which exists whenever j 6= k, (mod n) , − ak (mod n)}. Throughout the paper, when-and C

ever we discuss I-graphs with certain parameters, we con-

0, which is of the following form:

sider only the lexicographically smallest possible param-

C 0 = ( w 0 , w± k, u± k, u± k± j, w± k± j, w±2 k± j, u±2 k± j, u±2 k±2 j) .

eters by which the graph is uniquely determined.

Cycle C 0 exists whenever 2 k + 2 j = n. One can verify easily, that n applications of the rotation ρ to C∗ and n/ 2

2.1

Equivalent 8-cycles

applications of the rotation ρ to cycle C 0 maps the cycle back to itself. Therefore there are n equivalent cycles to A particular member of automorphism group of every I-

C∗ and n/ 2 equivalent cycles to C 0 in an I-graph I( n, j, k) graph is a rotation defined as: ρ( ui) = ui+1, ρ( wi) =

and they contribute (2 , 4 , 2) and (1 , 2 , 1), respectively, to wi+1. Clearly, applying n times the rotation ρ yields an the graph octagon value.

identity automorphism. When acting on I-graphs with

ρ we get 3 edge orbits: orbit of outer edges EJ , orbit 2.3

Obtaining constant octagon value

of spoke edges ES and orbit of inner edges EI . Edges from the same orbit EJ , ES, or EI have the same octagon Every 8-cycle of an I-graph contributes to the octagon

value, which we denote by σJ , σS and σI , respectively.

value of each edge partition. It turns out that if we can

32





Label

A representative of an 8-cycle

Existence conditions

τ ( C)

γ( C)

C∗

( w 0 , w± k, u± k, u± k± j, w± k± j, w± j, u± j, u 0) k 6= j and n > 4

(2 , 4 , 2)

n

C 0

( w 0 , w± k, u± k, u± k± j, w± k± j, w±2 k± j, u±2 k± j, u±2 k±2 j) 2 k + 2 j = n

(1 , 2 , 1)

n/ 2

C 1

( u 0 , uj, u 2 j, u 3 j, u 4 j, u 5 j, u 6 j, u 7 j) 8 j = n or 3 n

(0 , 0 , 1)

n/ 8

C 2

( w 0 , wk, w 2 k, w 3 k, w 4 k, w 5 k, w 6 k, w 7 k) 8 k = n or 3 n

(1 , 0 , 0)

n/ 8

( w

C

0 , wk , w 2 k , w 3 k , w 4 k , w 5 k , u 5 k , u 5 k+ j ) 5 k + j = n or 2 n

3

(5 , 2 , 1)

n

( w 0 , wk, w 2 k, w 3 k, w 4 k, w 5 k, u 5 k, u 5 k− j) 5 k − j = n or 2 n

( u

C

0 , uj , u 2 j , u 3 j , u 4 j , u 5 j , w 5 j , w 5 j+ k ) k + 5 j = n or 2 n

4

(1 , 2 , 5)

n

( u 0 , uj, u 2 j, u 3 j, u 4 j, u 5 j, w 5 j, w 5 j− k) 5 j − k = 2 n or n or 0

( w

C

0 , wk , w 2 k , w 3 k , w 4 k , u 4 k , u 4 k+ j , u 4 k+2 j ) 4 k + 2 j = n or 2 k + j = n 5

(4 , 2 , 2)

n

( w 0 , wk, w 2 k, w 3 k, w 4 k, u 4 k, u 4 k− j, u 4 k−2 j) 4 k − 2 j = n

( u

C

0 , uj , u 2 j , u 3 j , u 4 j , w 4 j , w 4 j+ k , w 4 j+2 k ) 2 k + 4 j = n or k + 2 j = n 6

(2 , 2 , 4)

n

( u 0 , uj, u 2 j, u 3 j, u 4 j, w 4 j, w 4 j− k, w 4 j−2 k) 4 j − 2 k = n or 0

( w

C

0 , wk , w 2 k , w 3 k , u 3 k , u 3 k+ j , u 3 k+2 j , u 3 k+3 j ) 3 k + 3 j = n or 2 n

7

(3 , 2 , 3)

n

( w 0 , wk, w 2 k, w 3 k, u 3 k, u 3 k− j, u 3 k−2 j, u 3 k−3 j) 3 k − 3 j = n or 0

Table 1: All non-equivalent 8-cycles of I-graphs, their existence conditions, their contribution towards the octagon value of an I-graph τ , number of their equivalent cycles in an I-graph γ.

identify at least one edge partition of a graph, we can eas-

Data: connected cubic graph G

ily determine its parameters. Therefore, we want to find

Result: parameters ( n, j, k) and isomorphism to

graphs with constant octagon value. These are graphs

I( n, j, k), if they exists.

for which all edges touch the same number of 8-cycles.

1 P ← an empty dictionary

They are called [1 , λ, 8]-cycle regular graphs. We con-

2 for e ∈ E( G)

do

sider all possible collections of 8-cycles and determine oc-

3

s = octagonValue( e) ;

/* calculate

tagon values of I-graphs admitting those 8-cycles. Since

σ( e) */

I-graphs are defined with 3 parameters and all 8-cycles

4

P[ s] . append( e)

give constraints for these parameters, it is enough to con-

5 end

sider collections of at most 4 cycles, to uniquely determine

6 U ← an item of P with minimum positive

all [1 , λ, 8]-cycle regular graphs. After a thorough anal-cardinality

ysis (see [9]) we see that the list of [1 , λ, 8]-cycle regular 7 if G[ U ] is a 2 -factor ;

/* 2-factor is a

I-graphs consists of 10 members (see Figure 2). Surpris-2-regular graph */

ingly, it turns out that all [1 , λ, 8]-cycle regular I-graphs 8

then

are in the family of generalized Petersen graphs. This

9

U ← { e | e ∈ E( G) , e is adjacent to an edge proves Theorem 1.

of U } /* a perfect matching in G */

10 end

3

Recognition algorithm

11 return Extend( G, U )

The recognition algorithm relies on the fact that there

Algorithm 1: Recognition procedure for I-

is just a small number of I-graphs (ten) with the con-

graphs

stant octagon value. In particular, whenever the input

graph G of the Algorithm 1 is a member of the family of I-graphs and is not [1 , λ, 8]-cycle regular, we can im-edges (see lines 1 – 4). Since G is cubic and all 8-mediately identify one of its edge orbits ( EI , EJ , or ES), cycles containing edge e consist of edges which are

of size | V ( G)| / 2. Since the octagon value of each edge is at distance at most 4 from e, it is enough to check a

computed in constant time and there is a finite number of

subgraph H of G of order at most 62, to calculate the the [1 , λ, 8]-cycle regular I-graphs the Theorem 2 holds.

octagon value of an edge e. Therefore, calculation of

octagonValue( e) takes O(1) time for each edge e Correctness and time complexity of the algo-and this whole part is performed in Θ(| E( G)|) time.

rithm.

We first note, that if G is not cubic then it does

not belong to observed graph families. Since checking

2. Identifying the edge-orbit which corresponds

whether a graph is cubic takes linear time we simply

to the set of spokes.

assume that the input graph is cubic. Furthermore, if

Throughout lines 6 – 9 we determine the edge-orbit G is not connected then it can only be a member of the

which corresponds to the set of spokes. It is easy to

family of I-graphs whenever it consists of multiple copies see that this requires additional O(| E( G)| / 3) time.

of a smaller I-graph G 0. However, this case can easily 3. Using set U for determining parameters of a

be resolved by separately checking each part, so we can

given graph.

assume that the input graph is connected. Algorithm 1

The algorithm uses computed set U to determine

consists of the following 3 parts.

exact isomorphism between G and an I-graph or

a double generalized Petersen graph, if it exists.

1. Partitioning the edges with respect to the

This procedure differentiates regarding the graph

octagon value.

family we are considering. The related procedure

The algorithm determines the octagon value of each

Extend( G, U ) is performed in Θ(| E( G)|) time.

edge e ∈ E( G) and builds a partition set P of graph 33





(a) Triangular prism.

(c) Petersen graph, Dodecahedral, and Desargues

(b) 3 -cube.

graph.

(d) Möbius-Kantor graph, Nauru graph, G(13 , 5) , F

∼

048 = G(24 , 5) , and G(26 , 5).

Figure 2: All [1 , λ, 8]-cycle regular I-graphs. That is, (a) the graph on no 8-cycles; (b) the graph containing C 0

and C 7; (c) graphs containing C 5 , C 6 and C∗; and (d) graphs containing C 3 , C 4 and C∗.

3.1

Subprocedure Extend( G, U )

[3] Coxeter, H. S. M. Self-dual configurations and

regular graphs. Bull. Amer. Math. Soc. 56 (1950),

It is easy to see that Extend( G, U ) can safely reject G

413–455.

if | V ( G)| is not divisible by 2.

For this subprocedure

[4] Dods, M. S. Independence number of specified I-

set n = | V ( G)| / 2 and denote by H the subgraph of G

graphs. PhD thesis, Monterey, CA; Naval Postgrad-

induced by the vertices of U . There are two possibilities.

uate School, 2020.

H = G. In this case graph G has a constant octagon

[5] Erdős, P. Some of my favorite solved and unsolved

value. Since there are just ten such I-graphs check-

problems in graph theory. Quaestiones Math. 16, 3

ing G against them takes constant time.

(1993), 333–350.

[6]

H is of order 2 n and is 1-regular. Since U is a per-Foster, R. M. The Foster census. Charles Bab-

bage Research Centre, Winnipeg, MB, 1988.

R.

fect matching of G the set E( G) \ U is a collection of M. Foster’s census of connected symmetric trivalent

cycles. If G is an I-graph, then there exist positive graphs, With a foreword by H. S. M. Coxeter, With

integers i, j, l 1 , l 2 with j ≤ i such that there are j a biographical preface by Seymour Schuster, With

cycles of length l 1 and i cycles of length l 2. It re-an introduction by I. Z. Bouwer, W. W. Chernoff,

mains to determine parameter k and check whether

B. Monson and Z. Star, Edited and with a note by

G ' I( n, j, k). This procedure depends on the struc-Bouwer.

ture of I-graphs and the 8-cycle C∗. It is performed in Θ(| E( G)|) time (see [9] for details).

[7] Horvat, B., Pisanski, T., and Žitnik, A. Iso-

morphism checking of I-graphs. Graphs Combin. 28,

6 (2012), 823–830.

4

Conclusion

[8] Klein, Z. J.

Structural properties of I-graphs:

their independence numbers and Cayley graphs. PhD

Studying the cyclic structure as described in this paper

thesis, Monterey, CA; Naval Postgraduate School,

led to the construction of fast recognition algorithms for

2020.

three parametric families. To the best of our knowledge,

[9] Klobas, N., and Krnc, M.

Fast recognition

in addition to this work, such a procedure was so far

of some parametric graph families. arXiv e-prints

only used in [10] for the family of generalized Petersen (Aug. 2020), arXiv:2008.08856.

graphs. We believe that a similar approach should give

interesting results for other parametric graph families of

[10] Krnc, M., and Wilson, R. J. Recognizing gener-

bounded degree, such as Johnson graphs, rose window

alized Petersen graphs in linear time. Discrete Ap-

graphs, Tabačjn graphs, Y -graphs, or H-graphs.

plied Mathematics 283 (2020), 756 – 761.

[11] Mollard, M. Cycle-regular graphs. Discrete math-

ematics 89, 1 (1991), 29–41.

References

[12] Oliveira, A. S. S., and Vinagre, C. T. M. The

spectrum of an I-graph, 2015.

[1] Boben, M., Pisanski, T., and Žitnik, A. I- [13]

graphs and the corresponding configurations.

J.

Petkovšek, M., and Zakrajšek, H. Enumera-

tion of I-graphs: Burnside does it again. Ars Math.

Comb. Des. 13, 6 (2005), 406–424.

Contemp. 2, 2 (2009), 241–262.

[2] Bonvicini, S., and Pisanski, T. Hamiltonian [14] Watkins, M. E. A theorem on Tait colorings with cycles in I–graphs.

Electronic Notes in Discrete

an application to the generalized Petersen graphs.

Mathematics 40 (2013), 43–47. Combinatorics 2012.

J. Combinatorial Theory 6 (1969), 152–164.

34





Interactive Evolutionary Computation Approach to

Permutation Flow Shop Scheduling Problem

Vid Keršič

University of Maribor,

Faculty of Electrical Engineering and Computer Science,

Koroška cesta 46, 2000 Maribor, Slovenia

vid.kersic@um.si

Abstract

Different approximation approaches were presented in

the last decades and shown to be effective in finding

Artificial intelligence and its subfields have be-

suboptimal solutions.

Firstly, different approximation

come part of our everyday lives and efficiently

algorithms were introduced to reduce time to solve the

solve many problems that are very hard for us

problems and find suitable solutions [4].

Afterwards,

humans. But in some tasks, these methods strug-

researchers applied and adjusted different optimization

gle, while we, humans, are much better solvers

algorithms, especially nature-inspired and metaheuristic

with our intuition.

Because of that, the ques-

optimization algorithms, to improve results in shorter

tion arises: why not combine intelligent methods

time execution [5].

with human skills and intuition? This paper pro-

poses an Interactive Evolutionary Computation

In the last years, a new approach emerged to solve these

approach to the Permutation Flow Shop Schedul-

problems.

Inspired by human intuition of problem-

ing Problem by incorporating human-in-the-loop

solving,

the

Interactive

Evolutionary

Computation

in MAX-MIN Ant System through gamification

(IEC) approach combines metaheuristic optimization

of the problem. The analysis shows that combin-

algorithms,

e.g.,

nature-inspired

algorithms,

with

ing the evolutionary computation approach and

human-in-the-loop through some process, e.g., gamifi-

human-in-the-loop leads to better solutions, sig-

cation of the problem [6]. This paper proposes an IEC

nificantly when the complexity of the problem in-

approach based on the MAX-MIN Ant System (MMAS),

creases.

one of the Ant Colony Optimization (ACO) algorithms,

to PFSP through gamification of the problem.

Keywords scheduling problems, interactive evolution-The structure of the paper is as follows: Section 2 pro-ary computation, ant colony optimization, metaheuristic

vides an overview of related work. Section 3 describes optimization

the IEC approach to PFSP. Section 4 describes the experiment, shows the results, and provides an analysis of

1

Introduction

the results. Section 5 concludes the paper.

Computers have become an essential (and sometimes the

2

Related Work

only) tool to approach complex real-world problems. But

many problems are still too hard to solve, even for com-

Researchers started solving PFSP almost 70 years ago,

puters. Many of these problems are NP-hard, and their

when Johnson proposed the exact algorithm, but it was

solution cannot be found by using exact algorithms [1].

restricted to two machines [2]. Many exact algorithms Thus, many of these problems are solved with approxima-with different strategies were proposed in the following

tion and different stochastic nature-inspired population-

decades, such as the branch-and-bound algorithm [7].

based algorithms, where suboptimal solutions are found.

Despite that, the exact algorithms could not scale to

However, they are still considered good enough to use

bigger instances of the problem.

in practice. Some of these problems include Travelling

Salesman Problem (TSP), scheduling problems, etc. This

To overcome the NP-hard time complexity of the

paper focuses on the Permutation Flow Shop Scheduling

problem, researchers started to develop approximation

Problem (PFSP), one of the scheduling problems.

and heuristic algorithms, where different approaches and

heuristics were considered, such as [8]. Methods based PFSP is the scheduling problem where there are m ma-on the Nawaz-Enscore-Ham (NEH) heuristic proved

chines and n jobs [2]. Each job consists of m operations, to be one of the best performers to the problem [4].

where i-th operation must be executed on the i-th ma-NEH-based methods consist of two steps: first, order the

chine. The order of jobs must be the same on all ma-

jobs based on some characteristic and then insert them

chines, and each machine can perform only a single oper-

one by one. But the approximation algorithms still have

ation at a time. Each operation has a specified execution

several drawbacks, such as being dependent on certain

time, and the goal is to minimize total job execution time,

characteristics of the specific setting of the problem and

called makespan. The problem was proven to be NP-hard

still being too slow.

when m ≥ 3, which means no efficient algorithm exists

[3].

Metaheuristic optimization algorithms were also consid-

DOI https://doi.org/10.18690/978-961-286-516-0.8

ISBN 978-961-286-516-0

35





ered for PSFP, and they achieved more favorable results on the combination of makespan and the time duration

than approximation algorithms, especially time-wise.

of the game. The algorithm of the MMAS-IEC approach

Many algorithms were applied to the problem, such

is shown in Algorithm 1.

as Simulated Annealing, Differential Evolution, and

many nature-inspired algorithms, such as ACO [9, 5].

In many benchmarks, metaheuristic algorithms achieve

1 start the game

comparable or even better results than approximation

2 init MMAS

algorithms [10].

3 launch 1 iteration (MMAS)

4 while game not finished do

As mentioned in the introduction, researchers tried to in-

5

wait for the user to position a job (block)

corporate human intuition into metaheuristic algorithms,

6

i ← positioned _ job

which proved beneficial [11]. Authors included human-7

j ← position _ index

in-the-loop to TSP through gamification, where a user

8

A i,j ← A i,j × 3

affects pheromone level in MMAS algorithm when play-

9

launch 1 iteration (MMAS)

ing a game [6]. Several other researchers also focused 10 end

on gamification of different NP-hard problems like Bin-

11 return best found solution

packing problem, Job Shop Scheduling, and Open Shop

Scheduling [12, 13]. Based on the related work, the pro-Algorithm 1: Algorithm of the MMAS-IEC ap-

posed approach incorporates human intuition to solve the

porach.

PFSP problem through the gamification process where a

user affects the pheromone trail in the MMAS algorithm.

4

Experiments and Results

3

IEC Approach to PFSP

The experiment aims to show that incorporating human-

The goal of PFSP is to find a permutation of n jobs that

in-the-loop can positively affect the search process of

minimize the total makespan.

Thus, the optimization

metaheuristic algorithms, which means better solutions

task for the problem is defined as:

can be found in fewer iterations. Thus the analysis is

mainly oriented on comparing MMAS without human-

min C( π)

(1)

in-the-loop results with MMAS-IEC.

where π is a permutation of n jobs and C is the cost The operation times of the jobs are randomly generated

function that calculates makespan.

on each new instance of the game since standard bench-

mark datasets are too big to be visualized in the game.

Based on the techniques and approaches discussed in the

Problem sizes used in the experiment are shown in Table

previous section, the solution consists of three modules1:

1. An example of the video game can be seen in Fig-video game, back-end server, and MMAS algorithm. The

ure 1. The comparison of brute-force algorithm, MMAS

video game was implemented in Unity, while the back-end

algorithm without human-in-the-loop, and MMAS-IEC

server and MMAS algorithm, which runs on the back-end,

approach is performed. The results are also compared to

are written in Python.

the one variant of the Genetic Algorithm (GA) [14].

The video game contains blocks (jobs) and a board con-

structed from several rows (machines). The user tries to

move and stack blocks together while complying with the

Table 1: Sizes of the problem in the experiment.

limitations of the problem. For each game, a new instance

of the MMAS algorithm starts.

Each movement per-

Number of jobs

Number of machines

formed by the user is sent to the back-end server, where

5

3

the MMAS algorithm runs in iterations. For each block

10

2

placement, one iteration of the algorithm is launched.

10

4

Pheromone of the ants is presented as 2D matrix A of

12

2

size N × N , where N is the number of jobs. Matrix’s 12

4

element A i,j is the pheromone level of positioning job i 12

6

on the position j. The user’s placement of blocks affects

the pheromone level of the algorithm; in the proposed ap-

proach, the value is multiplied by a scalar. This change of

the pheromone level is the human-in-the-loop part. After

In both executions of the MMAS algorithm, with and

each user’s movement, the solution for the current prob-

without human-in-the-loop, the parameters were set as

lem is recommended to the user based on the outcome and

follows: n = 5 (number of ants), p

current state of the MMAS algorithm (of course, users

0 = 0 . 9 (probability

to select the job with highest pheromone), mmr = 5

can freely decide which move they will make next). The

(min-max ratio), ρ = 0 . 75 (persistence of the trail), player is also encouraged to perform well and achieve the

max _ iter = 100 (maximum number of iterations) and highest score as fast as possible since the game contains

max _ stag = 5 (maximum number of iterations of stag-a time counter. The final game score is calculated based

nation - iterations with the same solution). The param-

1Source

code

available

at:

eters of MMAS were set according to the literature [5].

https://github.com/Vid201/flow-shop-scheduling-iec

Readers are advised to read the referenced paper for an

36





instances. Where brute force results were available, RPD

was also calculated. Results for the experiment are shown

in Table 2.

While in the smallest instances of the problem, the

MMAS algorithm outperformed the MMAS-IEC ap-

proach, the latter achieved better results in all the

other problem sizes. This implies that when the search

space becomes enormous and complex, human intuition

positively affects the algorithm and leads to a better

solution.

Since the user affected the pheromone level

and directed the algorithm towards better solutions, the

Figure 1: The gameplay of the video game.

average number of iterations was less than with only

the MMAS algorithm. It must be noted that the best

solution in the MMAS-IEC is usually not the same as the

end state of the blocks in the user’s game. The user only

in-depth description of the MMAS algorithm. For GA,

affected the pheromone level of the job on the selected

the parameters were adapted from the used implementa-

position, and the game’s order does not directly change

tion [14].

the solution.

The best solution is still the solution

found by the MMAS algorithm, while the user only

The MMAS algorithm without human-in-the-loop was

changed the algorithm’s internal state, i.e., pheromone

run 30 times to get solutions characterized by the algo-

matrix.

Figure 2 shows that when the problem sizes

rithm and not randomness. In the MMAS-IEC approach,

increase, MMAS-IEC tends to find solutions with a lower

the multiplier of the pheromone was set to 3, which means

makespan.

element A i,j is increased if the user places the job i on position j in the game. The size of the multiplier was

chosen empirically.

To measure the performance of algorithms, the subopti-

mal solutions of metaheuristic algorithms were compared

to optimal solutions by calculating relative percentage

deviation (RPD), which tells how much suboptimal so-

lutions are worse than the optimal solution. RPD was

calculated according to the following equation:

Suboptimalsol − Optimalsol

RP D =

× 100

(2)

Optimalsol

The number of iterations to find the best solution was

also compared since time execution was much longer in

Figure 2: Box plots of makespan for different

the interactive approach due to playing the game.

sizes of the problem with all four approaches.

Colors represent different sizes of the problem,

4.1

Analysis and Discussion

while dots show makespans for all five settings

per each problem size for each approach.

The video game and other algorithms were launched

five times for each problem size, and the average/mean

makespan with standard deviation was calculated. The

Results of the MMAS-IEC approach are very dependent

brute force algorithm was run only on smaller problem

on the way the user plays the game. In the first played

sizes since its execution time was too long for the other

games, the results of the MMAS-IEC were not better

Table 2: Results of the experiment, where N is number of jobs, M is number of machines, M is average makespan, σ is standard deviation, IT is average number of iterations for best solution and RP D is average relative percentage deviation.

Problem sizes

Approach

Brute force

MMAS

GA

MMAS-IEC

N

M

M

σ

M

σ

IT

RP D

M

σ

IT

RP D

M

σ

IT

RP D

5

3

6.514

0.402

6.605

0.399

31.0

1.396

6.515

0.402

25.8

0.015

6.742

0.463

3.6

3.500

10

2

9.719

0.814

9.834

0.753

27.8

1.183

9.749

0.797

45.0

0.309

9.742

0.792

2.8

0.236

10

4

12.281

0.876

12.890

0.543

28.4

4.958

12.594

0.625

61.2

2.549

12.614

0.657

5.2

2.711

12

2

/

/

12.019

0.845

26.0

/

11.870

0.877

42.4

/

11.869

0.877

2.6

/

12

4

/

/

14.005

0.949

29.8

/

13.430

1.118

49.2

/

13.463

1.152

5.4

/

12

6

/

/

17.075

0.464

19.8

/

15.970

0.739

84.2

/

16.801

1.195

8.2

/

37





or were even worse than MMAS alone. But after some References

games, strategies for the game come into mind, and re-

sults get better, which means the makespan decreases.

[1]

D. E. Knuth. Postscript about np-hard problems.

One of the strategies that tend to work well is to po-

SIGACT News, 6(2):15–16, Apr. 1974. issn: 0163-

sition the job with increasing operations times by the

5700. doi: 10.1145/1008304.1008305.

machine number in the first place and then putting jobs

[2]

S. M. Johnson. Optimal two- and three-stage pro-

with as little space between operations as possible on all

duction schedules with setup times included. Naval

machines. While the number of iterations is much less

Research Logistics Quarterly, 1(1):61–68, 1954.

in the MMAS-IEC approach, time to find the best solu-

[3]

M. R. Garey, D. S. Johnson, and R. Sethi. The com-

tion depends on how fast the user plays the game, e.g.,

plexity of flowshop and jobshop scheduling. Math-

MMAC alone can find the suboptimal solution in 1 sec-

ematics of operations research, 1(2):117–129, 1976.

ond, while the user cannot play the game so fast. For

[4]

M. Nawaz, E. E. Enscore Jr, and I. Ham. A heuris-

bigger instances, around 10 seconds are needed to find

tic algorithm for the m-machine, n-job flow-shop

the best suboptimal solution (the game does not have to

sequencing problem. Omega, 11(1):91–95, 1983.

be played to the end to find the best solution).

[5]

T. Stützle et al. An ant approach to the flow

Comparing to the GA, both MMAS and MMAS-IEC

shop problem. In Proceedings of the 6th European

approaches achieve worse results, except when N = 10

Congress on Intelligent Techniques & Soft Com-

and M = 2, where the MMAS-IEC approach achieves the

puting (EUFIT’98), volume 3, pages 1560–1564,

best ones. According to the obtained results, combining

1998.

an interactive process, i.e., gamification, and GA, looks

[6]

A. Holzinger, M. Plass, M. Kickmeier-Rust, K.

like a promising direction to explore.

Holzinger, G. C. Crişan, C.-M. Pintea, and V.

Palade. Interactive machine learning: experimental

The open problem for this gamification process is how

evidence for the human in the algorithmic loop.

to scale the game to bigger problem sizes since the game

Applied Intelligence, 49(7):2401–2414, 2019.

can become too complex, and it is also hard to visualize

blocks and boards. The experiments were successfully

[7]

E. Ignall and L. Schrage. Application of the branch

conducted on the problems with up to 6 machines and 12

and bound technique to some flow-shop schedul-

jobs, but scaling the game to more machines and jobs is

ing problems. Operations research, 13(3):400–412,

open for further research.

1965.

[8]

S. Suliman. A two-phase heuristic approach to the

permutation flow-shop scheduling problem. Inter-

5

Conclusion

national Journal of production economics, 64(1-

3):143–152, 2000.

This work shows how human intuition for problem-

[9]

I. Osman and C. Potts. Simulated annealing

solving

can

be

incorporated

into

metaheuristics

for permutation flow-shop scheduling. Omega,

algorithms for NP-hard problem PFSP. MMAS-IEC

17(6):551–557, 1989. issn: 0305-0483.

approach tends to find better solutions than fully

[10]

E. Vallada, R. Ruiz, and J. M. Framinan. New

autonomous metaheuristic algorithm MMAS, especially

hard benchmark for flowshop scheduling problems

in bigger settings of the problem.

Solutions are also

minimising makespan. European Journal of Opera-

found in a fewer number of iterations. But still, there

tional Research, 240(3):666–677, 2015. issn: 0377-

are some open questions to the proposed approach, as

2217.

bigger instances are hard to visualize, and playing the

[11]

A. Holzinger. Interactive machine learning for

game usually takes more time than the MMAS algorithm

health informatics: when do we need the human-

alone. Both of these drawbacks are good starting points

in-the-loop?

Brain

Informatics,

3(2):119–131,

for further research.

2016.

Further research could be dedicated to selecting a meta-

[12]

N. D. Ross, M. B. Johns, E. C. Keedwell, and

heuristic algorithm since this study only accounts for the

D. A. Savic. Human-evolutionary problem solving

MMAS algorithm for PSFP. At the same time, several

through gamification of a bin-packing problem. In

other algorithms proved to be suitable for this problem,

Proceedings of the Genetic and Evolutionary Com-

e.g., the GA from the analysis. The research goal was to

putation Conference Companion, pages 1465–1473,

use the same metaheuristic algorithm with and without

2019.

human interaction and not to compare different meta-

[13]

M. Vargas-Santiago, R. Monroy, J. E. Ramirez-

heuristic algorithms.

On the MMAS algorithm itself,

Marquez, C. Zhang, D. A. Leon-Velasco, and

other techniques to change the pheromone level could be

H. Zhu. Complementing solutions to optimization

explored. Various strategies to encourage the player to

problems via crowdsourcing on video game plays.

perform well could also be introduced, e.g., online score-

Applied Sciences, 10(23):8410, 2020.

boards or rewards, which could increase the player’s per-

[14]

Suyunu. Github - suyunu/flow-shop-scheduling:

formance with a positive loop. It would also be interest-

genetic algorithm for flow shop scheduling. url:

ing to compare and try the MMAS-IEC approach on one

https : / / github . com / suyunu / Flow - Shop -

of the benchmarks for PSFP, e.g., Taillard’s benchmark

Scheduling (visited on 07/23/2021).

problems [15].

[15]

E. Taillard. Benchmarks for basic scheduling

problems. european journal of operational research,

64(2):278–285, 1993.

38





Towards Representative Web Performance

Measurements with Google Lighthouse

Tjaša Heričko

Boštjan Šumak

University of Maribor,

University of Maribor,

Faculty of Electrical

Faculty of Electrical

Engineering and Computer Science,

Engineering and Computer Science,

Koroška cesta 46, 2000 Maribor, Slovenia

Koroška cesta 46, 2000 Maribor, Slovenia

tjasa.hericko@um.si

bostjan.sumak@um.si

Saša Brdnik

University of Maribor,

Faculty of Electrical

Engineering and Computer Science,

Koroška cesta 46, 2000 Maribor, Slovenia

sasa.brdnik@um.si

Abstract

Performance testing can be conducted using various tools,

among which Google Lighthouse has gained increasing

Web performance testing with tools such as attention in recent years. It is an open-source tool, pro-Google Lighthouse is a common task in software viding audits for performance, as well as for accessibil-practice and research. However, variability in ity, search engine optimization, and progressive web apps, time-based performance measurement results is with indicators on how to improve these aspects of web-observed quickly when using the tool, even if sites if needed [8]. However, when dealing with time-the website has not changed. This can occur based measurements, the results of such testing can of-due to variability in the network, web, and ten be inconsistent, as several factors can interfere with client devices. In this paper, we investigated the measures and may introduce fluctuations, even if the how this challenge was addressed in the existing website has not changed. Most commonly, results tend literature.

Furthermore, an experiment was to vary due to variability in the network, web server, conducted, highlighting how unrepresentative client hardware, and client resource contention [7]. Light-measurements can result from single runs; thus, house addresses variability by providing vague strategies researchers and practitioners are advised to and recommendations on how to reduce them, though run performance tests multiple times and use results can still vary. Besides isolating external factors, an aggregation value. Based on the empirical e.g., using a dedicated device for testing, using a local results, 5 consecutive runs using a median to deployment or a machine on the same network, the most aggregate results reduce variability greatly, and straightforward strategy is to run Lighthouse multiple can be performed in a reasonable time. The times and use aggregate values instead of single tests [7].

study’s findings a lert t o p otential p itfalls when

using single run-based measurement results and The research objectives of this paper are: (i) To study serve as guidelines for future use of the tool.

how the research community has addressed the challenge

of variability in performance measurements when using

Keywords websites, web performance, performance the tool; and (ii) To demonstrate the strategy of per-testing, performance variability, tool, Google

forming multiple runs empirically with their aggregation

Lighthouse

into a single-value result. To achieve these objectives, we

performed a literature review and conducted an experi-

1 Introduction

ment.

Our work is broadly related to previous research provid-

In software engineering, performance tests are often con- ing a better understanding and managing of variations in ducted by software researchers and practitioners to au- measurements, testing, and benchmarking for timebased dit a website’s quality. The former commonly use web performance measurements [3, 5, 10, 11]. These stud-performance measurements to assess web performance ies are focused primarily on suggesting recommendations on the observed websites [12, 14, 15], to investigate for robust testing in the presence of environmental fluc-factors (positively or negatively) affecting performance tuations, and, as such, are quite different in aim from

[4, 6, 13], and to improve performance testing [10], while ours, which is to gain insight into how a specific tool –

the latter use performance measurements for improving Google Lighthouse – is used in research for web perfor-a website’s quality to provide a better overall user mance measurement, further investigated with empirical experience, as web performance influences website research alerting software researchers and practitioners traffic, user attrition, user engagement, online revenue, to potential pitfalls in the future use of the tool.

and even rankings in search results greatly [2, 16, 17].

DOI https://doi.org/10.18690/978-961-286-516-0.9

ISBN 978-961-286-516-0

39





The contributions of the paper are: (i) Presenting an overview of existing studies using Lighthouse for measur-Table 1: A list of primary studies selected in the litera-

ing performance, with an emphasis on how the tool is

ture review with their performance audit strategies.

used, what measuring strategies are employed, and how

the authors addressed possible inconsistencies in results;

ID

Year

Performance audit strategy

Ref

(ii) Providing analysis of the effects of repeating perfor-

S1

2018

5 repetitions trough the day

[6]

mance measurements to prevent single run‘s outliers; (iii)

S2

2019

1 run

[12]

Highlighting potential pitfalls for research and practice

S3

2019

1 run

[15]

using single run-based results provided by Lighthouse;

S4

2019

100 consecutive repetitions

[13]

and (iv) Serving as a base for research studies on miti-

S5

2019

1 run

[10]

gating unrepresentative web performance measurements.

S6

2020

1 run

[4]

S7

2020

30 consecutive repetitions

[14]

S8

2021

1 run

[18]

2

Literature review

A literature review was performed to find existing re-

sites were used, selected randomly from the Alexa Top

search utilizing Lighthouse as the tool for estimating web

500 list, which includes top-ranked websites on the web

performance. A full-text search was conducted using the

[1]. The selected websites were: AliExpress5, Amazon In-search string »Google Lighthouse« in the following digital

dia6, Bola7, Freepik8, IKEA9, Mercari10, Shopify11, Un-libraries: ACM Digital Library1, Google Scholar2, IEEE

spash12, Wix13, and Zendesk14, further referred to as

Xplore3, and Web of Science4. The search was carried W1, W2, W3, W4, W5, W6, W7, W8, W9, and W10, out on July 28, 2021, and altogether 134 studies were

respectively. The selected websites ranged from simple

retrieved from the search. Inclusion and exclusion crite-

static presentation websites to dynamic websites, i.e., e-

ria guided the study selection process. Only journal and

commerce websites and news sites.

conference papers were considered. Materials not acces-

sible in English were excluded. Any research that only

All websites were audited with Google Lighthouse v7.3.0

described Lighthouse theoretically was excluded, and all

on a device with macOS 10.15.7 using Headless Chrome.

papers where Lighthouse was not used for performance

A desktop device was emulated using a broadband net-

measurements were excluded as well. After the review

work connection. For each website, performance audits

process, 8 primary studies were selected.

were performed 4 times, each time with an increasing

number of repeated independent runs (N=1,5,10,100).

The list of primary studies is available in Table 1. All During the auditing process, external factors were iso-primary studies were published in conference proceedings

lated as much as possible. The experiment was conducted

in recent years, in 2018 or later. From the performance

on July 29, 2021 between 4:42 and 8:37 PM (GMT+2).

measurements made with Lighthouse, primary studies

used the Performance Score (S1-S3, S6) most commonly,

From the collected results, we used the Performance Score

a single-value indicator of websites’ overall performance,

for analysis, as this value captures the overall web perfor-

for their further analysis.

The following more specific

mance of a website. It is calculated as a weighted average

time metrics were also used commonly: Speed Index (S2,

of six metric scores, each metric representing some aspect

S4, S7, S8), First Meaningful Paint (S1, S2, S4, S7)

of a website’s performance. The Performance Score can

and Estimated Input Latency (S1, S2, S7). Researchers

have the following values: Poor (0-50), Needs improve-

observed between 1 and 21 websites in each study. Less

ment (50-89) and Good (90-100) [9].

than half of the primary studies (S1, S4, S7) have noted

Data analysis was conducted using IBM SPSS Statistics

some variance between runs when auditing the same

v27. Descriptive statistics were used to present the char-

website due to uncontrollable variables, and employed

acteristics of sets of data collected with a different number

some strategies to mitigate this problem. In two studies

of consecutive runs, including a description and spread of

(S4, S7), the authors repeated runs consequently, while

the data in each set. Mood’s median test was performed

in one study (S1), researchers ran performance audits

to estimate if the medians of data sets from different runs

multiple times trough the day. The number of runs varied

on the same website were equal.

from 5 to 100. Two studies (S1, S4) then used mean for

aggregating multiple runs into a single value, while one

5https://www.aliexpress.com/

study (S7) used median.

6https://www.amazon.in/

7https://www.bola.com/

8https://www.freepik.com/

3

Experiment

9https://www.ikea.com/

10https://www.mercari.com/

An experiment was performed to demonstrate further

11https://www.shopify.com/

how single performance audits can be unrepresentative

12https://unsplash.com/

in some scenarios, and investigate how the number of

13https://www.wix.com/

runs affects variability. In the experiment, 10 real web-

14https://www.zendesk.com/

1https://dl.acm.org/

2https://scholar.google.com/

3https://ieeexplore.ieee.org/

4https://apps.webofknowledge.com/

40





4

Results and Discussion

Table 2: Performance scores and their statistics across

The distribution of Performance Scores of each website

different numbers of runs.

for N=100 is presented with boxplots in Figure 1. It can easily be observed that, for almost all websites (except

Statistics

N=1

N=5

N=10

N=100

for W4 ), some performance measurements occurred that

Mean

87

91

91.2

90.9

were not a typical representative of a website’s perfor-

W1

Median

87

92

92

92

mance. Suppose one of these outlier measurements was

Range

/

6

6

21

the only assessment run, this can lead to an unrepresen-

SD

/

2.3

1.9

2.9

tative result. Consequently, wrongful conclusions and de-

Mean

86

86

88.3

89.5

cisions can be made, e.g., a developer may think a change

W2

Median

86

90

90

90

he implemented into a code recently made performance

Range

/

18

20

20

worse, yet, instead, this occurred due to fluctuation in

SD

/

7.5

5.7

2.8

the network, web, or client device. An interesting ob-

Mean

40

42.8

44.3

43.3

W3

Median

40

43

44

44

servation is that, due to variability in the measurement

Range

/

5

10

14

results, a website can be interpreted in a different score

SD

/

1.9

2.8

1.9

group, e.g., W1, W2, W4, and W9 results are dispersed Mean

89

88.4

88.3

88.6

between score groups Good and Needs improvement, and W4

Median

89

89

88

89

W3 between score groups Poor and Needs improvement.

Range

/

5

5

7

SD

/

1.9

1.5

1.5

Mean

99

98.6

98.5

98.6

W5

Median

99

99

98.5

99

Range

/

1

1

5

SD

/

0.5

0.5

0.9

Mean

28

35

32.9

33.9

W6

Median

28

34

31

33.5

Range

/

15

16

21

SD

/

7.3

6.1

3.9

Mean

98

98

97.6

97.3

W7

Median

98

98

98

98

Range

/

0

2

5

SD

/

0

0.7

1.2

Mean

94

94.4

94.4

94.5

W8

Median

94

94

94

94

Range

/

1

1

4

SD

/

0

0.5

0.6

Mean

77

72.4

72.1

73.3

W9

Median

77

72

72

73

Range

/

7

8

23

SD

/

2.9

2.5

4.2

Mean

75

83.4

85.3

86.1

Figure 1:

Data distribution of Performance Scores

W10

Median

75

86

87

87

(N=100).

Range

/

12

13

13

SD

/

4.9

3.9

2.5

Detailed results for all sets of data are presented in Table

2, providing insight into how the data are spread, how to eliminate possible outliers while still performing web

much repeating the test reduces variability and how re-

performance testing in a reasonable time.

sults stabilize as the number of tests run increases. These

results further illustrate the differences between single

and multiple runs, which can provide a more reliable es-

5

Conclusion

timate of a website’s performance; therefore, providing a

rationale why addressing intrinsic fluctuations when deal-

Several strategies can be employed to reduce random

ing with time-based metrics should be considered, and

noise, measurement bias and errors when using Light-

why a single run can (in some cases) not be representa-

house for web performance measurements. In the paper,

tive enough to provide reliable measurements. We argue

we performed a literature review in which we selected

that the use of a median value for aggregation is preferred

studies using Lighthouse for estimating web performance.

over other measures of central tendency to minimize the

The results show that more than half of the primary stud-

impact of outliers.

ies did not employ any specific strategy to address vari-

Mood’s median test, performed for N=5, N=10, and

ability in web performance measurements. Others use a

N=100, showed that the medians of the Performance

reasonably straightforward approach to repeat the Light-

Score were the same across all three categories of runs

house audit multiple times and summarize repeated runs

for all websites, except W5, where the test could not be

using a mean or median. However, a large discrepancy

performed. These results, presented in Table 3, indicate was noticed in these works in the number of runs and

that 5 runs in comparison to 10 and 100 runs are sufficient

measures of central tendency used to aggregate multiple

41

Table 3: Mood’s median test by website, comparing groups of N=5, N=10, and N=100.

W1

W2

W3

W4

W5

W6

W7

W8

W9

W10

Asymp.Sig

0.996

0.723

0.137

0.557

*

0.744

0.927

0.879

0.211

0.758

*All values are less than or equal to the median. Mood’s median test could not be performed.

runs into a single-value result. Thus, we investigated this

[9] Google Developers.

Lighthouse performance

further empirically by conducting an experiment on real

scoring, 2021.

popular websites, to demonstrate how the number of runs

[10] Johnston, O., Jarman, D., Berry, J., Zhou,

affects variability and prevents single-run outliers. With

Z. Q., and Chen, T. Y. Metamorphic relations

this, we highlighted how measurement results from single

for detection of performance anomalies.

In 2019

runs could be misleading and unrepresentative; therefore,

IEEE/ACM 4th International Workshop on Meta-

we recommend for research and practice to run perfor-

morphic Testing (MET) (2019), pp. 63–69.

mance tests multiple times and use an aggregation value.

[11]

Based on our results, performing performance audits 5

Kalibera, T., and Jones, R. Rigorous bench-

marking in reasonable time. In Proceedings of the

times reduces variability in results greatly in a reasonable

2013 International Symposium on Memory Manage-

time. Our study provides a base for future research stud-

ment (New York, 2013), ISMM ’13, ACM, pp. 63–74.

ies addressing outliers in web performance testing, and for

guidelines for future studies on how to perform represen-

[12] Nurshuhada, A., Yusop, R. O. M., Azmi, A.,

tative web performance measurements with Lighthouse.

Ismail, S. A., Sarkan, H. M., and Kama, N.

Enhancing performance aspect in usability guide-

lines for mobile web application. In 2019 6th In-

Acknowledgment

ternational Conference on Research and Innovation

in Information Systems (ICRIIS) (2019), pp. 1–6.

The authors acknowledge the financial support from the

[13] Rahman, S., and Wittie, M. P.

MR-DNS:

Slovenian Research Agency (Research Core Funding No.

Multi-resolution Domain Name System. In Internet

P2-0057).

and Distributed Computing Systems (Cham, 2019),

R. Montella, A. Ciaramella, G. Fortino, A. Guerrieri,

References

and A. Liotta, Eds., Springer International Publish-

ing, pp. 191–202.

[1] Alexa Internet, Inc. The top 500 sites on the [14] Riet, J. v., Paganelli, F., and Malavolta, web, 2021.

I.

From 6.2 to 0.15 seconds – an industrial case

[2] Arapakis, I., Bai, X., and Cambazoglu, B. B.

study on mobile web performance. In 2020 IEEE

Impact of response latency on user behavior in web

International Conference on Software Maintenance

search.

In Proceedings of the 37th International

and Evolution (ICSME) (2020), pp. 746–755.

ACM SIGIR Conference on Research & Develop-

[15] Rojas-Mora, J., Lincolao-Venegas, I., and

ment in Information Retrieval (New York, 2014),

Schneeberger-Leon, F.

S3e2:

a web-based

SIGIR ’14, ACM, pp. 103–112.

gis for the visualization and analysis of socioe-

[3] Beyer, D., Löwe, S., and Wendler, P. Reliable

conomic segregation in chile’s elementary educa-

benchmarking: Requirements and solutions. Inter-

tion system.

In Proceedings of the 1st Interna-

national Journal on Software Tools for Technology

tional Conference on Geospatial Information Sci-

Transfer 21, 1 (2019), 1–29.

ences (2019),

O. S. Siordia,

J. L. S. Carde-

nas, A. Molina-Villegas, G. Hernandez, P. Lopez-

[4] Chan-Jong-Chu, K., Islam, T., Exposito,

Ramirez, R. Tapia-McClung, K. G. Zuccolotto, and

M. M., Sheombar, S., Valladares, C., Philip-

M. C. Colunga, Eds., vol. 13 of Kalpa Publications

pot, O., Grua, E. M., and Malavolta, I. Inves-

in Computing, EasyChair, pp. 12–20.

tigating the correlation between performance scores

and energy consumption of mobile web apps.

In

[16] Shivakumar, S. K.

Modern Web Performance

Proceedings of the Evaluation and Assessment in

Optimization. Apress, Berkeley, Ca, 2020.

Software Engineering (New York, 2020), EASE ’20,

[17] Szalek, K., and Borzemski, L. Conversion Rate

ACM, pp. 190–199.

Gain with Web Performance Optimization. A Case

[5] Chen, J., and Revels, J. Robust benchmarking

Study.

In Information Systems Architecture and

in noisy environments, 2016.

Technology: Proceedings of 39th International Con-

ference on Information Systems Architecture and

[6] Gambhir, A., and Raj, G. Analysis of cache in

Technology – ISAT 2018 (Cham, 2019), Springer,

service worker and performance scoring of progres-

pp. 312–323.

sive web application. In 2018 International Confer-

ence on Advances in Computing and Communica-

[18] Yu, A., and Benson, T. A. Dissecting perfor-

tion Engineering (ICACCE) (2018).

mance of production quic.

In Proceedings of the

Web Conference 2021 (New York, 2021), WWW ’21,

[7] Google Developers. Variability, 2019.

ACM, pp. 1157–1168.

[8] Google Developers. Lighthouse, 2021.

42





Transformer-based Sarcasm Detection in English and

Slovene Language

Matic Rašl

Mitja Žalik

University of Maribor,

University of Maribor,

Faculty of Electrical

Faculty of Electrical

Engineering and Computer Science,

Engineering and Computer Science,

Koroška cesta 46, 2000 Maribor, Slovenia

Koroška cesta 46, 2000 Maribor, Slovenia

matic.rasl@student.um.si

mitja.zalik1@student.um.si

Vid Keršič

University of Maribor,

Faculty of Electrical

Engineering and Computer Science,

Koroška cesta 46, 2000 Maribor, Slovenia

vid.kersic@um.si

Abstract

because not understanding and detecting it can lead to

substantial miscommunication errors and disagreements.

Sarcasm detection is an important problem in the

Automatic sarcasm detection is also crucial in other NLP

field of natural language processing. In this pa-

problems, such as sentiment analysis, where undetected

per, we compare performances of the three neural

sarcasm can negatively affect an analysis.

Therefore,

networks for sarcasm detection on English and

there is a need for automatic detection of sarcastic com-

Slovene datasets.

Each network is based on a

ments and text.

different transformer model:

RoBERTa, Distil-

Bert, and DistilBert – multilingual. In addition

This paper compares performances of three neural net-

to the existing Twitter-based English dataset, we

works for sarcasm detection on English and Slovenene

also created the Slovene dataset using the same

datasets. Each neural network is based on a transformer

approach.

An F1 score of 0.72 and 0.88 was

model. In the following sections, we overview the related

achieved in the English and Slovene dataset, re-

work, describe the used datasets, present the experiment,

spectively.

analyze the results, and conclude the paper emphasizing

future work.

Keywords natural language processing, sarcasm

detection, transformers, RoBERTa, DistilBert

2

Related Work

1

Introduction

Automatic sarcasm detection dates back to 2006 [3], but it has gotten momentum in the past few years with ad-Language is the essential tool for communication in real

vancements in the fields of neural networks and NLP. In

life and online in the digital world. With the fast growth

general, sarcasm can be detected in three different ways

of the internet in the last two decades, an enormous

[2]. Rule-based approaches use specific evidence, such as amount of text data is available to everyone, which is one

words or phrases, for identification. Such techniques were

of the main reasons natural language processing (NLP)

often used in earlier systems, such as [4]. Statistical ap-has become one of the fastest-growing fields in computer

proaches either use text features or learning algorithms

science and artificial intelligence. While the most com-

to find sarcasm. Statistical methods were used in works,

monly used NLP application is text translation, many

such as

[5], where combinations of positive verbs and

other applications are being researched and applied, e.g.,

negative situation phrases were used as classification fea-

text summarization, emotion recognition, sarcasm, and

tures. The most common approach today is by using deep

irony detection [1]. In this paper, we focus on the sar-learning techniques. For example, in [6], the model can casm detection problem.

learn user-specific context and thus achieve better results

than previous state-of-the-art models.

Sarcasm detection is defined as a binary classification

problem, where the goal is to detect if the given text is

Significant advancements in NLP tasks were achieved

sarcastic [2]. The most common places to find sarcastic with transformers. They are a new form of neural net-comments are social media platforms, e.g., Twitter, where

work that does not use convolution and recursion. In-

people often express their opinions and views on different

stead, they use attention to find correlations between

topics. While in some examples, e.g., "I work 40 hours

words in the text. Transformers can process text in paral-

a week for us to be this poor" , it is easy to spot, some-lel, allowing much faster learning than sequential meth-

times, e.g., "Great, that’s just what I needed!" is harder to ods [7]. They also achieve better results than previous perceive at first sight. Detection of sarcasm is essential

methods.

DOI https://doi.org/10.18690/978-961-286-516-0.10

ISBN 978-961-286-516-0

43





With the increasing number of learning parameters, neu-because it was constructed in the same way as the Slovene

ral networks need a larger training dataset to prevent

dataset.

Before training, datasets were split into the

overfitting.

While building large labeled datasets can

training and the test sets, as shown in Table 1.

be demanding, it is easy to construct large unlabelled

corpora. Therefore, large models can be trained on un-

labelled text data to create a good language model, i.e.,

Table 1: Train-Test split.

expressive word embeddings. Afterwards, these represen-

tations can be used for different NLP-related tasks [5].

Set

English

Slovene

The mainstream architecture of the pre-trained mod-

Train

5000

759

els is Bidirectional Encoder Representations from Trans-

Test

1800

272

formers (BERT). The initial model was pre-trained on

BooksCorpus and English Wikipedia, which advanced

state-of-the-art for eleven NLP tasks [8].

Nowadays,

4

Method

many BERT-based architectures exist.

For example,

RoBERTa (A Robustly Optimized BERT Pre-training

Although transformers can be fine-tuned to specialize in

Approach) [9] optimizes the way of masking tokens and a specific task, the process takes a long time on common

thus improving the performance of the model. Another

hardware. However, as shown in [13], fine-tuning can be common architecture is DistilBert [10], which has reduced avoided by utilizing other networks to find correlations in

the number of training parameters. That makes its train-

transformer embeddings. Since the transformer’s weights

ing 60 % faster while retaining 97 % of BERTs language

are not changing, its output can be calculated only once

understanding capabilities.

and then saved before learning the second part of the

BERT has been widely and successfully utilized for sar-

network. This approach significantly improves learning

casm detection [11]. In [12], the accuracy is even more times.

improved by also considering the context of sarcastic com-

For the experiment, we implemented the neural network

ments.

The authors in

[13] use RoBERTa to detect

model similar to the one used in [13]. As some details sarcasm with even higher accuracy.

Although BERT-

about the network were missing in the mentioned paper,

based architectures are very successful, their pre-training

we also relied on implementation in [18]. The architecture still has some drawbacks. Sarcasm is present primarily

of the neural network is shown in Figure 1. During the in informal communication (e.g., social networks such as

experiment, three different transformers were explored,

Reddit, Twitter, etc.), which was not part of the train-

RoBERTa [9], pre-trained on English dataset, and two ing set. Therefore, in [14] BERT was outperformed by DistilBert [10] transformers, one pre-trained on the En-the context-independent GloVe embeddings model, which

glish language and the other one on multiple languages

was pre-trained on Twitter data.

(DistilBert

mult). DistilBert transformers are smaller

than RoBERTa (66 M vs. 125 M training parameters),

3

Datasets

which translates to significant speedup embedding gen-

eration time.

Constructing a dataset for the sarcasm detection problem

As mentioned before, tokenized inputs were transformed

is not a straightforward task since the perception of

to embeddings at the beginning of the training. Embed-

sarcasm is difficult even for people. A general approach

dings were saved to the Transformer output cache. Then

to dataset creation is to scrap the data from different

they were used as input to the Bidirectional LSTM layer,

social media platforms, e.g., Twitter, Reddit, and use

whose outputs were concatenated with original embed-

user-specified labels, i.e., hashtags on Twitter and /s

dings before the pooling layer. Before flattening the data,

on Reddit [11, 15, 16]. But this approach has several 1D spatial dropout was applied. After dense and another

drawbacks, like users not annotating sarcasm with tags

dropout layer, a dense layer with softmax activation was

or misusing labels to express their opinion better. The

applied.

Headlines dataset was introduced to solve the mentioned

problem. The dataset contains headlines from two news

For the training, the Google Colab [19] environment with websites: one, where real-world events are reported, and

Google TPU was used. Neural networks were trained for

the other with sarcastic descriptions of events, including

25 epochs with a 10 % validation split. In the end, weights

sarcastic headlines [17].

The third common way is to

of epoch with the smallest validation loss were restored.

manually label data, but this is time-consuming and still

requires the annotator with a good sense of sarcasm.

5

Experiments and Results

Since no dataset for sarcasm detection in the Slovenian

language exist, and manual labeling is time-consuming,

The accuracy of the models was tested on the test

we created the Slovene dataset with the user-specified

datasets with the embedding of length 20. Additionally,

labels. As a knowledge base for our task, tweets (i.e.,

we tested the model using RoBERTa transformer with

posts on Twitter) were selected. Tweets, annotated by

the embeddings of length 100 to find out how embeddings

users with specific hashtags (e.g., #sarcasm, #sarkazem),

length affects the results. However, since the results with

were considered sarcastic (i.e., positive) examples, while

larger embeddings were similar to those obtained with

other tweets were non-sarcastic (i.e., negative) examples.

smaller ones, we did not train DisitlBert based models

For the English dataset, we selected the one from the

on larger embeddings due to the hardware limitations.

2nd Workshop on Figurative Language Processing [11]

Results are shown in Table 2.

44





Table 2: Results of the experiment.

The first three

Transformer

columns contain transformer’s name (M), used dataset (L), and embeddings length (EL). The dataset is denoted by its language ( slo for Slovene and en for English). In the last four columns, accuracy (A), precision (P), recall (R), and F1 norm (F1) are presented. The results are rounded to two decimal places.

Transformer output cache

M

L

EL

A

P

R

F1

RoBERTa

en

100

0 . 70

0 . 67

0 . 78

0 . 72

RoBERTa

en

20

0 . 66

0 . 63

0 . 79

0 . 70

Bidirectional LSTM

DistilBert

en

20

0 . 63

0 . 60

0 . 74

0 . 67

DistilBert - mult

en

20

0 . 61

0 . 58

0 . 80

0 . 67

RoBERTa

slo

100

0 . 83

0 . 82

0 . 95

0 . 88

RoBERTa

slo

20

0 . 81

0 . 83

0 . 89

0 . 86

Concatenation

DistilBert

slo

20

0 . 74

0 . 83

0 . 78

0 . 80

DistilBert - mult

slo

20

0 . 79

0 . 89

0 . 78

0 . 83

Pooling

was measured on various datasets. The F1 score was be-

tween 78 % and 90 %, which is considerably better than

Dropout

the results in English, but comparable to the Slovene

dataset. However, since different datasets were used in

the study, the results are hard to compare. The same

English dataset as applied here was used in [11], where Flatten

participants presented 13 different solutions, ranging be-

tween an F1 score of 0 . 58 and 0 . 83 Only three solutions were better than the F1 score of 0 . 72 , which we achieved Dense

with RoBERTa transformer and embeddings length of

100

Dataset construction is one of the most critical parts of

Dropout

the experiment since it provides the knowledge base for

the transformer models. According to the obtained re-

sults, models were able to detect sarcasm in the given

Softmax

examples, despite several drawbacks explained in section

3. The used approach is good enough for uncomplicated use cases where sarcasm is meant to be detected since

Figure 1: The architecture of the neural network.

users also annotate messages with #sarcasm. But for

more complicated use cases with complex and more chal-

lenging examples, alternative methods to the dataset con-

struction should also be explored.

For all tested models, the results on the Slovene data

were significantly better than on the English data

6

Conclusion

(ranging from 11 % to 18 % improvement). This means

that in the used Slovene dataset, sarcasm was more

In this paper, we compare how the utilization of different

clearly expressed.

The best results on both datasets

transformers combined with the BiLSTM model affects

were achieved using the RoBERTa transformer with

the accuracy of sarcasm prediction. RoBERTa, English-

an embedding length of 100. Even when using shorter

based DistilBERT, and multilingual DistilBERT trans-

embeddings, the RoBERTa transformer performed the

formers were used in the experiment. All three trans-

best.

However, the difference between embeddings of

formers were combined with the same BiLSTM model

length 100 and 20 was small (only 2 % difference in

and trained on English and Slovene datasets.

After-

F1 score).

Additionally, the difference between using

wards, the accuracy of the models was obtained using

RoBERTa and DistilBERT transformer is also relatively

test datasets in English and Slovene language. In the best

small (3 % to 6 % difference in F1 score), which implies

case, F1 scores of 0 . 72 and 0 . 88 were achieved on English that the usage of DistilBERT can be a good alternative

and Slovene datasets, respectively.

to RoBERTa on low-cost hardware. When using a mul-

tilingual transformer, the results on the English dataset

In the future, more work could be done on dataset cre-

were close to the English-only transformer. However, on

ation. Different dataset construction approaches, as de-

the Slovene dataset, the multilingual dataset provided

scribed in section 3, can be explored and adjusted for slightly better results.

the Slovene language. Furthermore, current datasets can

be expanded by adding more Tweets (especially Slovene)

In [13], the performance of the RCNN-RoBERTa model or data from different sources, e.g., Reddit. Another in-45

teresting direction for future work is exploring transfer

[12]

H. Srivastava, V. Varshney, S. Kumari, and S.

learning to reuse models on different languages, e.g., lan-

Srivastava. A Novel Hierarchical BERT Archi-

guages similar to Slovene.

tecture for Sarcasm Detection. In Proceedings

of the Second Workshop on Figurative Language

Processing, pages 93–97, Stroudsburg, PA, USA.

References

Association for Computational Linguistics, 2020.

doi: 10.18653/v1/2020.figlang-1.14.

[1]

G. G. Chowdhury. Natural language processing.

[13]

R. A. Potamias, G. Siolas, and A. G. Stafy-

Annual Review of Information Science and Tech-

lopatis. A transformer-based approach to irony

nology, 37(1):51–89, 2003. doi: 10 . 1002 / aris .

and sarcasm detection. Neural Computing and

1440370103.

Applications, 32(23):17309–17320, Dec. 2020. doi:

[2]

A. Joshi, P. Bhattacharyya, and M. J. Carman. Au-

10.1007/s00521-020-05102-3.

tomatic sarcasm detection: a survey. ACM Comput.

[14]

A. Khatri and P. P. Sarcasm detection in tweets

Surv. , 50(5), Sept. 2017. doi: 10.1145/3124420.

with BERT and GloVe embeddings. In Proceedings

[3]

J. Tepperman, D. Traum, and S. Narayanan. "Yeah

of the Second Workshop on Figurative Language

Right": sarcasm recognition for spoken dialogue

Processing, pages 56–60, Online. Association for

systems. In Ninth international conference on spo-

Computational Linguistics, July 2020. doi: 10 .

ken language processing, 2006.

18653/v1/2020.figlang-1.7.

[4]

T. Veale and Y. Hao. Detecting ironic intent in cre-

[15]

M. Bouazizi and T. Otsuki Ohtsuki. A pattern-

ative comparisons. In Proceedings of 19th European

based approach for sarcasm detection on twitter.

Conference on Artificial Intelligence - ECAI 2010,

IEEE Access, 4:5477–5488, 2016. doi: 10 . 1109 /

pages 765–770, Amsterdam, The Netherlands. IOS

ACCESS.2016.2594194.

Press, 2010. doi: 10.3233/978-1-60750-606-5-

[16]

M. Khodak, N. Saunshi, and K. Vodrahalli. A

765.

Large Self-Annotated Corpus for Sarcasm, 2017.

[5]

X. Qiu, T. Sun, Y. Xu, Y. Shao, N. Dai, and X.

eprint: 1704.05579.

Huang. Pre-trained models for natural language

[17]

R. Misra and P. Arora. Sarcasm Detection using

processing: A survey. Science China Technological

Hybrid Neural Network, 2019. arXiv: 1908.07414.

Sciences, 63(10):1872–1897, 2020. doi: 10.1007/ [18] L. Famiglini. Irony-Sarcasm-Detection-Task.

s11431-020-1647-3.

url:

https : / / github . com / lorenzofamiglini /

[6]

S. Amir, B. C. Wallace, H. Lyu, P. Carvalho, and

Irony - Sarcasm - Detection - Task (visited on M. J. Silva. Modelling context with user embed-06/25/2021).

dings for sarcasm detection in social media. In Pro-

ceedings of The 20th SIGNLL Conference on Com-

[19]

Google Colab. url: https : / / colab . research .

putational Natural Language Learning, pages 167–

google.com/notebooks/intro.ipynb (visited on 177, Berlin, Germany. Association for Computa-06/06/2020).

tional Linguistics, Aug. 2016. doi: 10.18653/v1/

K16-1017.

[7]

R. Kulshrestha. Transformers. 2020. url: https:

/ / towardsdatascience . com / transformers -

89034557de14 (visited on 06/20/2021).

[8]

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova.

BERT: Pre-training of Deep Bidirectional Trans-

formers for Language Understanding, 2018. arXiv:

1810.04805.

[9]

Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D.

Chen, O. Levy, M. Lewis, L. Zettlemoyer, and

V. Stoyanov. RoBERTa: A Robustly Optimized

BERT Pretraining Approach, 2019. arXiv: 1907.

11692.

[10]

V. Sanh, L. Debut, J. Chaumond, and T. Wolf.

DistilBERT, a distilled version of BERT: smaller,

faster, cheaper and lighter, 2019. arXiv: 1910 .

01108.

[11]

D. Ghosh, A. Vajpayee, and S. Muresan. A re-

port on the 2020 sarcasm detection shared task. In

Proceedings of the Second Workshop on Figurative

Language Processing, pages 1–11, Online. Associa-

tion for Computational Linguistics, July 2020. doi:

10.18653/v1/2020.figlang-1.1.

46





Extraction and Analysis of Sport Activity Data Inside

Certain Area

Luka Lukač

University of Maribor,

Faculty of Electrical

Engineering and Computer Science,

Koroška cesta 46, 2000 Maribor, Slovenia

luka.lukac@student.um.si

Abstract

databases require a lot of preprocessing skills. An ex-

ample of a toolbox for extracting features from a sport

Nowadays, sport data analysis is one of the cru-

activity is ’sport-activities-features’, which is available in

cial factors, used to enhance the athletes’ per-

GitHub repository at: https://github.com/firefly-

formance, which can depend upon many differ-

cpp/sport-activities-features. It is focused on one ent circumstances. One of those is the area of of the most intensive and demanding tasks in Machine an exercise, which can dramatically impact on an Learning: data preprocessing. [2]

athlete’s performance. Since not enough devotion

has been given to this topic, this study focuses on The overall load indicators that are parts of sport activ-extracting and analysing parts of exercises, which ity datasets (total distance, total duration, average heart take place inside of a specific area, using principles rate, etc.) have some disadvantages, such as that de-from another part of Computer Science, Compu-

tails are not expressed sufficiently, only a general outlook

tational Geometry.

of the training is captured and different phases of the

training are not recognized directly or are not recognized

Keywords sports, area, extraction, analysis

at all. Thus, more hidden indicators, like intervals, to-

pographic maps and weather data are extracted in the

’sport-activities-features’ toolbox. [2]

1 Introduction

One of the crucial features that allows retrieving a lot

As long as professional sport competitions have been of interesting and useful data is area detection, which organised, professional, as well as amateur athletes have is the main part of this study. It is well known that been constantly trying to enhance their performance. In doing exercises in hilly areas can be a lot more difficult order to know in which aspects their exercises can be and demanding. Despite apparently worse performance improved, tracking and analysing data are crucial. One of in those regions (if only overall indicators are observed), important factors, which can influence the performance, an athlete can put more power in that activity compared is the area where the training is performed. This study to the one which is set in a flat area. As different areas focuses on the extraction of the data inside certain areas, may contain distinct characteristics, it is important to be which allows thorough analysis of the effectiveness of an able to analyse the performance of an athlete in a certain athlete in those regions.

area.

Datasets that are used to extract this kind of data must

2 Materials and Methods

include positions of an athlete at a certain time. If a

dataset is corrupt or some data is missing, the analysis

2.1 Materials

can output wrong results, therefore it is very important

to have datasets that have no such issues.

In these days, a lot of athletes track their performance

during the training using modern sport trackers, which 2.2

Methods

allow them to capture a lot of indicators of their exer-

cise. Since there are many indicators that are not visible In order to unambiguously detect points that are located directly, but can be extracted as a product of an exten-inside the chosen area, inclusion test is a necessity. There

sive data analysis, a sport activity can be analysed more are three major algorithms one can use to determine, thoroughly and deeply. A great tool in analysing the data whether a point is inside the area: is Machine Learning, efficient computational methods of

• Algorithm of the same signs,

which enabled rising the researches in automatic planning

of sport training sessions. [1]

• Algorithm of the sum of angles,

• Algorithm with rays.

Nowadays, many athletes monitor their performance us-

ing modern sport equipment, such as sport watches or To start with, the first algorithm that was implemented sport wrists. There are websites that are focused on col-and seemed to be working was the algorithm of the same

lecting that kind of data, however, it is very likely that signs. Its principles are quite simple: if the sign of the the data does not have a public access [3] and that the cross product between the hull line and the line from

DOI https://doi.org/10.18690/978-961-286-516-0.11

ISBN 978-961-286-516-0

47





the hull to the point is the same for all the hull lines,

it produces the right results even if there are holes or

the point lies within the polygon that represents the area

concave angles present in the area, this is why this algo-

with great certainty. However, there are two significant

rithm has been chosen for determining which parts of an

disadvantages regarding this algorithm, namely, it does

exercise were inside the given area. [5]

not allow holes inside the area, and secondly, it does not

allow concave polygons (representing area). [4]

Figure 3: Ray algorithm.

Figure 1: Algorithm of the same signs.

The next algorithm that was considered was the algo-

1 function GET-POINTS-INSIDE-AREA( positions,

rithm of the sum of angles. To each point that is located

area)

on the hull, a ray from the point, which has been chosen

2

points_in_area ← {};

for the inclusion test, is made. The next step is to calcu-

3

for position ← 1 to positions.count do

late the sum of the angles between the rays, constructed

4

n ← number_of_intersections(position,

in the previous step. If the sum is 360◦, the point is in

area);

the area and if the sum is 0◦, the point lies outside the

5

if n % 2 = 1 then

area. Although this algorithm allows concave angles on

6

points_in_area ← points_in_area +

the border of the area, the holes are still not permitted,

position;

this is why it has not been implemented to the package

7

end

as an inclusion test. [4]

8

end

9

return points_in_area;

Algorithm 1: Detection of points of an exercise

inside the given area

3

Results

After the implementation of the algorithm that detects

parts of an exercise inside area, visualisation has been

made in order to be able to test the algorithm properly.

To start with, the map sectors, where the exercise is tak-

ing place, are downloaded from OSM (OpenStreetMap)

and merged into an image, represented as a map. Then,

the area that has been used for the extraction of data is

plotted on the map, and lastly, the identified and uniden-

Figure 2: Algorithm of the sum of angles.

tified points of an exercise are being drawn on the map

according to the coordinates. An example of the visual-

ization can be found at 5, where the wider area of the Slovenian Littoral is being displayed on the map and the

Finally, the third algorithm that uses rays to determine

exercise is plotted.

whether a point lies inside an area has been implemented.

A ray from the point, which we want to check if inside

However, not only are parts inside the area extracted,

or outside the area, to any direction is being constructed.

but the data inside the area is also analysed. As already

If the ray intersects with the area border even times, the

pointed out, the performance of an athlete inside different

point is not a part of the area. On the other hand, if there

areas can vary significantly. This is why it is important

is an odd number of intersections between the ray and the

to be able to retrieve data, specific to a certain region.

hull, the point lies inside the area. This algorithm is much

One important factor about areas is also, what the perfor-

less error-prone than the aforementioned algorithms, as

mance of an athlete was like not only in one training, but

48





in more exercises that took place at different times. This

4

Discussion

is why data, such as whole distance, average speed, max-

imum speed, average heart rate, etc. is being extracted

To conclude, area identification is one of the crucial

from the activities or their parts, which took place inside

factors in the sport data analysis. Since different areas

the same area.

have different difficulties, it is crucial to be able to analyse

regions separately. This study portrays how this task can

Despite the algorithm working flawlessly, there can be

be done using algorithms from Computational Geometry.

issues with obtaining datasets which would suit the anal-

In the beginning, parts of exercises inside desired areas

ysis. As this kind of data is usually not publicly available,

should be extracted and then the extracted data can

a user is often limited to analysing their own activities,

be used for analysis of the performance of an athlete.

which is one of the reasons why there is only a limited

The next step, which has not been a part of this study,

amount of exercises visualized and displayed in this pa-

could be the Machine Learning phase, during which the

per.

extracted data could be analysed even further.

Acknowledgment

The author of this paper would like to kindly thank Iztok

Fister Jr., who provided a lot of motivation and assistance

during the writing of this study.

References

[1] Fister, I., Fister Jr, I., and Fister, D. Compu-

tational intelligence in sports. Springer, 2019.

[2] Fister Jr., I., Lukač, L., Rajšp, A., Fister,

I., Pečnik, L., and Fister, D.

A minimalistic

toolbox for extracting features from sport activity

files.

In 2021 IEEE 25th International Conference

on Intelligent Engineering Systems (INES) (2021),

IEEE, pp. 121–126.

[3] Rajšp, A., and Fister Jr., I. A systematic liter-

ature review of intelligent data analysis methods for

smart sport training. Applied Sciences 10, 9 (2020),

Figure 4: Visualization of the area detection.

3013.

[4] Topiwala, A.

Is the point inside the polygon?

Towards Data Science (2020).

[5] Vacek, L., Atter, E., Rizo, P., and Nam, B.

suas for deployment and recovery of an environmental

sensor probe.

Table 1: List of extracted features inside area.

ID

Feature

Description

1

Distance

Total distance inside area.

2

Time

Total time inside area.

3

Maximum speed

Maximum speed inside area.

4

Average speed

Average speed inside area.

Figure 5: Visualization of more exercises inside area.

5

Minimum heart rate

Minimum heart rate inside area.

6

Maximum heart rate

Maximum heart rate inside area.

7

Average heart rate

Average heart rate inside area.

49





50





Methodology for the Assessment of the Text Similarity of

Documents in the CORE Open Access Data Set of Scholarly

Documents

Ivan Kovačič

David Bajs

University of Maribor,

University of Maribor,

Faculty of Electrical

Faculty of Electrical

Engineering and Computer Science,

Engineering and Computer Science,

Koroška cesta 46, 2000 Maribor, Slovenia

Koroška cesta 46, 2000 Maribor, Slovenia

ivan.kovacic@um.si

david.bajs@student.um.si

Milan Ojsteršek

University of Maribor,

Faculty of Electrical

Engineering and Computer Science,

Koroška cesta 46, 2000 Maribor, Slovenia

milan.ojstersek@um.si

Abstract

• copying sentence structures with minor modification

without giving credit

This paper describes the methodology of data

preparation and analysis of the text similarity

• copying the majority of your work from others, re-

required for plagiarism detection on the CORE

gardless of giving credit [9]

data set. Firstly, we used the CrossREF API and

Microsoft Academic Graph data set for metadata In order to efficiently assert the existence and degree of enrichment and elimination of duplicates of doc-plagiarism, electronic detection is necessary along with

uments from the CORE 2018 data set. In the manual analysis. Using electronic detection, the similar-second step, we used 4-gram sequences of words ity index [1] is a metric which is reliable for identifying from every document and transformed them into potential cases of plagiarism, but it does not necessarily SHA-256 hash values. Features retrieved using indicate that plagiarism had occurred in an individual hashing algorithm are compared, and the result case. Another important step is to obtain a sufficient is a list of documents and the percentages of cov-amount of metadata about any given document which is

erage between pairs of documents features. In analyzed. Using a sufficient amount of metadata, it is the third step, called pairwise feature-based ex-easier to detect duplicate records of the same document,

haustive analysis, pairs of documents are checked which allows for a higher quality assessment of the degree using the longest common substring.

of plagiarism on a given document. If a journal article

is recorded twice in the same data set with inconsistent

Keywords

metadata, there exists a pair of documents with a high

data preparation for text similarity analysis, degree of similarity. Without good quality metadata, the text similarity, CORE data set, metadata enrichment

pair of documents may appear as totally different jour-

nal articles, although they represent the same instance of

1 INTRODUCTION

intellectual work. In the following sections, we describe

our methodology of data preparation for text similarity,

Text similarity is a measure of the degree of content metadata enrichment and the computation of text simi-similarity between two textual documents. It is used larity metrics on the CORE data set of scholarly docu-as a metric which assists for plagiarism detection. In ments as well as an analysis of the run time complexity general, plagiarism is a term that is used to describe acts of our methodology.

of intellectual theft and violating the material and moral

rights of the authors of intellectual property. However, 2 RELATED WORK

there are several definitions o f p lagiarism t hat m ay be

applied, depending on the circumstance of each individual Eaton and Crossman did a review of literature in the field case of plagiarism. Maurer et. al. [9] have compiled a of self-plagiarism in social sciences research databases list of possible definitions of p lagiarism. They include:

and found that only 5.8 % of a sample of search results

• presenting other people’s work as your own

consisted of primary research done in this field, which

indicates that the field of self-plagiarism is a relevant

• copying other people’s work (or ideas) without giving

topic of research in social sciences, we may also infer a

credit

higher relevance of this research topic for computational

• improper quotation

methods of detecting self-plagiarism [2].

DOI https://doi.org/10.18690/978-961-286-516-0.12

ISBN 978-961-286-516-0

51





In [8], a study was described in which a sample of articles that were published in the Scientific Periodicals Elec-

Table 1: CORE data set JSON schema.

tronic Library (SPELL) was analyzed for plagiarism. A

comparison was made between two samples taken from

JSON attribute

Description

the years 2013 and 2018. The former sample contained

coreId

unique CORE identifier

literal reproductions in 65.9 % of the total sample size,

doi

DOI identifier ( if available)

while the latter sample from 2018 contained evidence of

title

Title of the document

plagiarism in 44 % of the total sample size. A reduction

authors

Authors (array)

of the similarity index was noticeable, the difference be-

contributors

Contributors (array)

ing attributed to the fact that the sample of articles from

datePublished

Date of publication

2018 contained guidelines related to plagiarism and self-

abstract

The abstract of the document

plagiarism. The methodology of this study consisted of

downloadUrl

URL of the PDF document

random sampling of the data set and inputting the sam-

language

Language of the document

ples into iThenticate plagiarism detector. García-Romero

fullTextIdentifier

Full text unique ID

and Estrada-Lorenzo have described the analysis of the

pdfHashValue

PDF signature hash

Déjà vu database using citation analysis and found that

journals

List of journals or journal aliases

cases of plagiarism are more frequently published in jour-

rawRecordXml

Harvested OAI-PMH XML response

nals with lower visibility and also confirmed that dupli-

year

Year of publication

cate documents not citing the original document show a

relations

Related entities (journals,journal aliases)

higher degree of full text similarity than those citing it

topics

OAI-PMH dc:topic value

[5].

subjects

OAI-PMH dc:subject value

issn

ISSN ( if available)

3

METHODOLOGY

fullText

Full text of the document

In this section, we present our methodology of apply-

ing plagiarism detection algorithms and procedures on

vested by CORE. The individual repositories have

the CORE data set, including metadata enrichment tech-

inconsistent metadata, which makes detection of du-

niques and the plagiarism detection methods used in this

plicates a non-trivial task.

study.

• Unstructured metadata - most documents in the

2018 CORE data set contain insufficiently struc-

3.1

The CORE data set

tured metadata, since the aggregation process makes

it difficult to ascertain the data like the journal

CORE is one of the leading aggregation services of

name, publisher. This information gets missing in

open-access scholarly documents publicly available on

the dc:topic and dc:subject fields and mapping the

the web. Currently, CORE has aggregated metadata of

data in these fields to their respective metadata fields

over 200 million open-access research articles and other

is a non-trivial task.

scholarly documents, harvested from over 10.000 open-

access repositories. A full list of harvested repositories

• Non-defined document types

may be seen in [12]. In our study, we used the 2018-03-01

• Non-defined fields of study

CORE data set. The data set contains 9.767.152 unique

• Non-normalized author names

documents with full text. Each individual document is

represented as a JSON record containing both the full

The issues can be elaborated with a concrete example.

text of the document as well as the extracted metadata.

The document with CORE ID 47847252 is a book section

The schema of the JSON records is described in Table 1.

written in French. The document type, found in the sub-

Metadata is important for plagiarism/self-plagiarism

jects field, is "Book sections", which is a non-standardized analysis because it allows an expert who analyses an

document type, originating from the terminology used by

individual case of plagiarism to evaluate the authorship,

the digital library that hosts the OAI server which pro-

year of publication and other data of a document

vided this document. The authors’ names are recorded

which is being analyzed in comparison to a document’s

with the last name written first, followed by the initial

potential candidates for document similarity. The CORE

of the first name. In contrast, the document with CORE

data set does provide a rich aggregation of metadata

ID 6871221 has authors written with the first name writ-

for scholarly documents, but it is also limited by the

ten firstly, followed by the last name. Such inconsisten-

quality of metadata provided by the individual servers

cies, which are the result of different bibliographic no-

in harvested OAI-PMH repositories. The CORE team

tations, increase the difficulty of automated detection of

actively develops means of enriching/improving the

self-plagiarism and/or duplication of documents.

metadata associated with each individual document,

since there exist issues with the metadata provided.

3.2

Microsoft Academic Graph (MAG)

These issues include:

In order to tackle the issue of incomplete and inconsistent

• Duplicate document entries - there exist several

metadata, we utilized the mapping to the Microsoft Aca-

duplicates in the CORE data set. These are present

demic Graph done by the CORE team with the help of

because of multiple publications of the same docu-

their published 2019-04-01 MAG mapping data set. The

ment in several different repositories that are har-

Microsoft Academic Graph is a document graph that con-

52





tains metadata records about scholarly documents, cita-MAG mapping data set. In addition, we normalized the

tion relationships between them, as well as normalized

document titles and looked for matches in MAG, exclud-

data about authors, journals, fields of study and other

ing the pairings where anomalies had been detected (non-

metadata. By connecting CORE to MAG by using the

matching year of publication, non-matching authors). We

DOI persistent identifier, an intersection of over 1.200.000

further enriched the metadata of these documents by con-

million articles was made in 2016 [6]. The MAG data set suming the CrossREF API and acquiring the ISSN, the

contains structured metadata of scholarly documents. It

proper volume, issue and publisher for the documents.

is organized the same way a classical relational database

Over 30 % of articles in CORE that have a DOI have no

is, with individual schemas representing entities and con-

ISSN associated with them in the MAG dump provided in

nections between them represented as foreign keys. A

[10]. Our method of metadata enrichment is summarized full description of the MAG schema is available in [3].

by Algorithm 1.

We emphasize some of the more useful metadata entries

in MAG which allow for metadata enrichment of CORE:

Data: set of CORE documents

• Normalized authors - the authors in MAG are

Result: set of CORE documents with enriched

normalized in a way that makes them uniquely iden-

metadata

tifiable and distinguishable from other authors. The

1 for document in set of CORE documents do displayed author name is separated from the author’s

2

if document in CORE MAG mapping and

normalized name. Where available, authors are also

document has DOI then

associated with an affiliation.

3

consume CrossREF API;

4

map result to metadata properties of the

• Fields of study - papers documented in MAG have

CORE document;

a mapping to fields of study of that particular pa-

5

else if normalized document title and year

per. The fields of study are organized in a hierar-

match then

chical structure which allows us to construct a tree

6

enrich document using new MAG

structure of fields of study for each paper in MAG.

mapping;

• Journal data - in MAG journal articles are associ-

7

consume CrossREF API;

ated with journals, which have their names normal-

8

map result to metadata properties of

ized and other metadata associated to them, where

CORE document;

available. Namely, a citation count, the publisher of

9

else

the journal and the ISSN, where available [11].

10

mark document as incomplete;

11

end

3.3

CrossREF API

12 end

CrossREF is a service that aggregates metadata from

publishers and provides tools for several use cases related

Algorithm 1: CORE metadata enrichment with

to metadata, among other things it is an agent of the DOI

MAG and CrossREF.

foundation (DF) and it allows users to register DOIs for

their scholarly documents. One of the most useful fea-

tures of CrossREF is the CrossREF public API, which

3.5

Text matching in the CORE data set

allows for metadata enrichment by resolving a DOI which

In this section, we describe how plagiarism detection was

is contained in the CrossREF database into the metadata

implemented on the CORE data set. We serialized the

of the associated document. The metadata provided by

entire CORE 2018-03-01 data set of JSON records into a

CrossREF is rich and reliable, with information about the

PostgreSQL database, creating a relational model which

journal and publisher, authors with affiliations (where

corresponds to the JSON schema of CORE. To implement

available), document types (e.g. journal article or con-

a plagiarism detector, we firstly had to implement a

ference paper), reference counts, the volume and issue,

candidate search, followed by a pairwise text matching

among others. The API itself is free to use, with no sign-

algorithm.

up or authentication required, the client who consumes

the REST API must only adhere to throttling mecha-

3.5.1

Candidate retrieval

nisms specified by CrossREF which limit the number of

requests per second available to the client.

The API

For finding plagiarism candidates for each individual doc-

returns a requested sleep interval for the client in the

ument, we computed n-gram models for each document in

X-Rate-Limit-Interval HTTP header.

CORE. The full text of the documents is tokenized with

a minimum sentence length of 40 characters, using rules

3.4

Metadata enrichment of CORE using

for mapping ordinal numbers, URLs, email addresses to

CrossREF and MAG

tokens, removing stop words and computing lexicograph-

ically sorted n-gram sequences with n = 4. The minimum

Our approach of metadata enrichment utilized the CORE

length of 40 characters was chosen because it is used in

MAG mapping with some additional enhancements. We

the Slovenian open access infrastructure. Sequences of

imported a published MAG dump available in [10]. When 4-grams have been determined as more efficient for cana document in CORE had a DOI available, the docu-

didate retrieval in previous studies [4]. A space delim-ment’s DOI was used for pairing with MAG, where the

ited sequence of sentence n-grams is transformed into an

mapping was not previously present in the 2019-04-01

SHA-256 hash value which is stored in a database. The

53





set of all n-gram hashes represents the set of features of the entire full text which can be matched to other doc-Table 2:

Distributed text matching workload status

uments, thus acquiring lists of potential candidates for

flags.

plagiarism. In relational databases, this approach can be

implemented efficiently by performing a table join of the

Status flag

Description

table containing the SHA-256 hashes with itself, with the

-1

unprocessed

table itself being structured as a key-value map, where

0

processing

the key is a unique ID of the document which is mapped

1

processed

to the hashed n-gram value. In order for such a table join

query to be time-efficient, indexes are necessary on the

column containing the n-grams hashes, which drastically

Each node running the distibuted program performs a

increases the space complexity of the described method.

transactional UPDATE query on the workload table, set-

Once the self-join query had been performed, a hash cov-

ting the flags of N documents to 0 and simultaneously

erage index was computed:

obtaining the document IDs from the table where the

unique covered hashes in candidate

current flag value equals -1. It proceeds by obtaining all

hashCoverage = total number of hashes in document

the full texts of the document and the candidates from

The candidates are then sorted in descending order by

the database and passing them on to a pairwise com-

the hashCoverage metric and stored into the database

parison performed by the Docker REST service which

as a candidate list. The procedure of candidate search is

is running on the same node. This approach allows for

summarized in Algorithm 2.

efficient parallel computation of text matching between

documents and their candidates. The parameter N is

limited by the amount of free memory available to the

Data: fileid - CORE document ID

node. Given an average number of candidates C per doc-

Result: set of candidate documents in CORE

ument and an average length L bytes of a document in

1 for document in set of CORE documents do CORE, the estimated space complexity in bytes iss given

2

perform self-join on the table of hashes

by N ∗ C ∗ L. If a node failed to perform the compari-where document ID = fileid;

son, for example in the case of running out of free virtual

3

compute hashCoverage;

memory, a transactional update of the status flag is made,

4

sort candidate list by hashCoverage

setting the flag back to -1. This allows the document to

descending;

be processed by another node. Using this approach, the

5

store candidate list into database;

procedure is fault-tolerant and partition tolerant. The

6 end

results of the text matching comparisons are arrays of

offsets and length of common prefixes, which are used in

Algorithm 2: Candidate retrieval with hashed

the implementation of a plagiarism detector text match-

n-gram sequences.

ing user interface. The prefixes are also used to compute

the similarity indices of the documents:

Σ length of matches

3.5.2

Text matching of candidates

similarityIndex = total length of document

For each document, a maximum of 100 candidate docu-

ments is set as a limit for reasons of reducing the space

3.5.3

Deduplication

complexity of the procedure. After lists of candidates

are obtained for all documents in CORE, a fine text

After computing pairwise text matching between docu-

comparison algorithm is used by Kärkkäinen et.

al.

ments and their candidates for plagiarism, the process

which constructs a permuted longest common prefix ar-

of plagiarism detection in the CORE database is com-

ray efficiently [7]. The output of the algorithm is a se-plete. We described the problem of duplicate entries in

quence of longest common prefixes along with offsets at

the CORE data set in 3.1. With overall and pairwise sim-which a computed longest common prefix occurred in

ilarity indices computed, we were able to conduct dedu-

the text. Our implementation of the algorithm is writ-

plication by using the computed indices in conjunction

ten in C++ and wrapped by a REST API, which ac-

with the enriched metadata we extracted from MAG and

cepts two texts with configuration parameters and re-

CrossREF. The process of removing duplicates begins by

turns the longest common prefix array. The configura-

acquiring a list of all pairwise matches with a similarity

tion parameters are the minimum length of the docu-

index above 85 %. For each such pair, we compare the

ment ( minLen), the minimum length of the longest com-

normalized titles of the two documents, the year of publi-

mon suffix ( lengthBoundary) and the size of the win-

cation and ISSN. When a full match has been made, the

dow in which neighbor prefixes are searched ( windowSize

second document is automatically marked as duplicated.

- for merging adjacent common prefixes into longer se-

When only a partial match has been made (e.g. only the

quences). The REST API is bundled into a Docker im-

normalized titles match), the document is marked as a

age, which allows for an implemention of a distributed

potential duplicate. The potential duplicates are submit-

approach for calculating text matching on the CORE

ted for manual inspection, meanwhile the detected dupli-

dataset. A workload of documents is generated, on which

cates are excluded from candidate lists and the similarity

text matching is performed for each candidateas a pair

indices of the documents which contained duplicates are

(document ID, status) in a table in a relational database, corrected, respectively.

with status being a flag described in Table 2.

54





4

RESULTS

Table 3: Distributed text similarity computation per-

By utilizing our method for metadata enrichment, we en-

formance benchmarks.

riched metadata for 3.457.071 documents in the 2018-03-

01 CORE data set which contain a DOI persistent iden-

#nodes

duration(s)

est. days (CORE 2018)

tifier value. We also acquired a complete set of metadata

1

1590,806

179,83

for a subset of 618.754 documents. This result is cru-

2

753,917

85,23

cial for the goal of data preparation, since text similarity

3

547,659

61,91

analysis is an instrument of plagiarism detection, which

4

425,046

48,05

requires a sufficient amount of metadata in order to be

5

374,564

42,34

useful to the expert conducting the plagiarism analysis.

6

306,461

34,64

The latter, smallest subset contains ISSN values which

7

294,918

33,34

allow for further studies on journal plagiarism and self-

8

285,888

32,32

plagiarism statistics by implementing plagiarism detec-

9

233,364

26,38

tion on this subset. Using our candidate retrieval method,

10

217,015

24,53

we have computed over 4.7 billion n-grams hashes for the

9.8 million documents in the 2018-03-01 CORE data set,

yielding a total hash table size of 1478 GB in PostgreSQL,

including the similarity indices for each document in the

using b-tree indexes and document IDs in the form of

data set, further research is possible.

SHA-256 hashes. On average, we found 87.37 candidates

per document for each document in the CORE data set.

6

ACKNOWLEDGMENTS

For our implementation of text matching comparison, we

installed the Docker image containing the text match-

The authors thank Petr Knoth and the CORE team for

ing REST API implementation on a cluster of 33 ma-

providing the 2018-03-01 CORE data set and technical

chines with Intel i5-8600 and 8 GB internal memory. Our

support during our efforts of obtaining data from CORE.

configuration for the parameters equaled minLen = 20,

Also, we thank David Schmid for publishing the MAG

lengthBoundary = 40 and windowSize = 350, respec-

dump on Zenodo from April 2019.

tively, as they are set used in the Slovenian open access in-

frastructure. The average document length in the CORE

2018-03-01 data set is 73.41 KB, we found setting the pa-

References

rameter N =20 allows for stable and efficient processing of the CORE data set using this approach. Table 3 contains

[1] Bretag, T., and Mahmud, S. Self-plagiarism or

benchmark statistics for a varible number of computing

appropriate textual re-use?

Journal of Academic

nodes running the text similarity Docker image. A ran-

Ethics 7, 3 (Sep 2009), 193.

dom sample of 1000 documents is selected for a variable

[2]

amount of nodes which process the workload in paral-

Eaton, S. E., and Crossman, K. Self-plagiarism

research literature in the social sciences: A scoping

lel. The efficiency of the approach tends to asymptoti-

review. Interchange 49, 3 (Aug 2018), 285–311.

cally decline as we increase the number of nodes, which

is a result of the centralized data storage being used for

[3] Eide, D., and Huang, C. "microsoft academic

synchronization of the compute nodes. The network in-

graph schema". Microsoft Academic Graph official

frastructure and the hardware and configuration of the

website. Accessed Jun 1, 2021. [Online] Available:

database server represent a bottleneck. The table also

https://docs.microsoft.com/en-us/academic-

contains a reference value for the time necessary to pro-

services/graph/reference-data-schema.

cess all the 9.767.152 documents in the CORE 2018 data

[4] Fartek, J. Distributed generation of plagiarism

set for a given number of nodes. The benchmarks were

detection reports. J. Fartek, 2018. Bachelor thesis.

performed on a subset of 10 nodes of the total 33 nodes

[5] García-Romero, A., and Estrada-Lorenzo,

used to process the entire CORE data set, but the results

J. M. A bibliometric analysis of plagiarism and self-

may be interpolated to a higher number of nodes.

plagiarism through déjà vu. Scientometrics 101, 1

(Oct 2014), 381–396.

5

CONCLUSION

[6] Herrmannova, D., and Knoth, P. An analysis

of the microsoft academic graph. D-Lib Magazine

The study described in this paper describes the method-

22 (09 2016).

ology of establishing the largest plagiarism detection

[7] Kärkkäinen, J., Manzini, G., and Puglisi,

dataset in Slovenia.

We have developed a framework

S. J.

Permuted longest-common-prefix array.

In

of processing larger sets of documents into a pipeline

Combinatorial Pattern Matching (Berlin, Heidel-

for plagiarism detection,

with means of metadata

berg, 2009), G. Kucherov and E. Ukkonen, Eds.,

enrichment with the help of the CrossREF API and

Springer Berlin Heidelberg, pp. 181–192.

Microsoft Academic Graph.

Our output allows for

[8] Krokoscz, M. Plagiarism in articles published in

further study in the field of academic integrity, content

journals indexed in the scientific periodicals elec-

similarity detection, stylometry and other fields of

tronic library (spell): a comparative analysis be-

study. Utilizing our data set, which consists of enriched

tween 2013 and 2018. International Journal for Ed-

metadata entries for the 2018-03-01 CORE data set,

ucational Integrity 17, 1 (Jan 2021), 1.

55

[9] Maurer, H., Kappe, F., and Zaka, B. Plagia-

rism - a survey. Journal of Universal Computer Sci-

ence 12 (01 2006), 1050–1084.

[10] Microsoft Academic. Microsoft academic graph.

Zenodo. Accessed Jun 1, 2021. [Online] Available:

https://doi.org/10.5281/zenodo.2628216, Apr.

2019.

[11] Sinha, A., Shen, Z., Song, Y., Ma, H., Eide,

D., Hsu, B.-J. P., and Wang, K. An overview of

microsoft academic service (mas) and applications.

In Proceedings of the 24th International Conference

on World Wide Web (New York, NY, USA, 2015),

WWW ’15 Companion, Association for Computing

Machinery, p. 243–246.

[12] The CORE team. "core data providers". CORE

official website. Accessed Jun 1, 2021. [Online]

Available: https://core.ac.uk/data-providers?

q=&size=10.

56





Embedding Non-planar Graphs: Storage and

Representation

Ðorđe K

lisura

University of Primorska,

Faculty of Mathematics, Natural Sciences

and Information Technologies,

Glagoljaška ulica 8, 6 000 K

oper, S lovenia

klisuradjordje10@gmail.com

Abstract

of searching for mathematical knowledge. In a broader

sense, fingerprints are used in many fields of science, rang-

In this paper, we propose a convention for repre-

ing from computer science to chemistry, archaeology, and

senting non-planar graphs and their least-crossing

genetics. Computer documentation, reducing duplication

embeddings in a canonical way. We achieve this

in web search results, and surely DNA fingerprinting are

by using state-of-the-art tools such as canonical

a few examples.

labelling of graphs, Nauty’s Graph6 string and

combinatorial representations for planar graphs.

Because of its applications in physics, biochemistry, bi-

To the best of our knowledge, this has not been

ology, electrical engineering, astronomy, operations re-

done before.

Besides, we implement the men-

search, and computer science, graph theory is rapidly

tioned procedure in a SageMath language and

moving into the core of mathematics.

The theory of

compute embeddings for certain classes of cubic,

planar graphs is based on Euler’s polyhedral formula,

vertex-transitive and general graphs. Our main

which is related to the polyhedron edges, vertices and

contribution is an extension of one of the graph

faces. Planar graphs are used in a variety of applications

data sets hosted on MathDataHub, and towards

in the modern era, including constructing and organiz-

extending the SageMath codebase.

ing sophisticated radio electronic circuits, railway maps,

planetary gearboxes, and chemical molecules. Pipelines,

Keywords planar graph, graph representation, crossing railway lines, subway tunnels, electric transmission lines,

number, graph database

and metro lines are all vitally crucial for modelling an

urban city. For further readings on this topic, look at

1

Introduction

Trudeau and Richard [13], and Barthelemy [1].

Computers are increasingly used by mathematicians to

1.1

Related work and contributions

assist them in their studies. This includes most aspects

Some of the most significant projects that act as mathe-

of a researcher’s work, from publishing and reading pa-

matical databases which are of assistance to researchers

pers to computations in mathematical software. Perhaps

in their research projects are the SageMath platform [12],

surprisingly, mathematicians also generate and use data,

American Mathematical Society MathSciNet [11], above-and the production and processing of massive datasets

mentioned Encyclopedia of Integer Sequences [6], House are becoming increasingly important in several areas of

of Graphs [5], and Atlas of Graphs [10].

mathematics.

The above-named tools are far from perfect and are many

The primary applications for these mathematical datasets

times subject to important work of the open-source com-

and databases are exploratory in nature. They are used

munity. This project aims to provide another toolset for

by researchers to test hypotheses or to discover pat-

researchers, via improving the platform MathDataHub

terns and counterexamples. It’s not difficult to find such

which will, in the future, provide our database contain-

"datasets" that even predate computers: the Atlas of ing planar embeddings minimising the number of cross-Graphs and the Foster census are two examples from ings. Those embeddings are hard to compute and such

graph theory. The Online Encyclopedia of Integer Se-

a database of precomputed embeddings does not exist in

quences (OEIS) is a modern mathematical database that

any mathematical database.

represents an online database of integer sequences. The

OEIS now contains over 334000 sequences that are useful

The paper is structured as follows.

In Section 2, we

to both professional and amateur mathematicians, mak-

present the central technique and ideas of embedding

ing it the largest database of its kind. The sequences

non-planar graphs. Moreover, we give a concrete algo-

in the database serve as fingerprints for the records they

rithm that uses them and evaluates them in terms of

are associated with. A somewhat similar project in graph

space and time complexity. In Section 3, we talk about

theory is the House of Graphs. An important notion is

the impact of our results so far. Finally, in Section 4, we

that of using mathematical objects, such as integer se-

present some output samples of the algorithm that has

quences, or graphs, to search for mathematical theorems.

been evaluated before we make some concluding remarks

This has been introduced as theorem fingerprinting by

in Section 5.

Billey and Tenner [2] as a way to improve the efficiency DOI https://doi.org/10.18690/978-961-286-516-0.13

ISBN 978-961-286-516-0

57





2

Embedding non-planar graphs

initialise crossing number k to 0, to check if the graph is already planar if it is we return k and if it is not, k is In this section, we present our main result, namely the

increased by 1, and we are going through the set of pairs

algorithm for calculating the canonical embedding of non-

of non-incident edges. For each k we modify the graph

planar graphs.

until we get a planar embedding, in the following way:

we take the first pair, in our example pair {{0 , 1} , {2 , 3}}

and we delete edges {0 , 1} and {2 , 3}. Then we add a new Data: Non-planar graph G

vertex v to which we connect the vertices of the deleted

Result: Planar embedding, crossing number

edges: {0} , {1} , {2} , {3}. We check for planarity. If the and added vertices of G

checking confirmed a positive result, that is, confirmed

1 V

V ( G)

that the graph is planar, the crossing number is returned

2 edgeP airs

{ ab, cd, where a, b, c, d ∈ V ( G) and and the algorithm terminates. However, if the checking

ab, cd ∈ E( G)}

confirmed a negative result, that is, confirmed that the

3 k ← 0

graph is not planar we go to the next pair. If all pairs

4 while G is not planar do

fail, we look for tuples of size 3 next, and so on.

5

S ← all k-subsets from edgeP airs

6

for {a 1 , a 2 , . . . , ak} in S do 0

7

E( G) ← E( G) \ S k

a

i=1

i

8

for ai element in {a 1 , a 2 , . . . , ak} do 9

{{ a 1 , a 2} , { a 3 , a 4}} ← a 5

i

i

i

i

i

10

v

1

4

i ← vertex such that vi /

∈ V ( G)

6

9

11

V ( G) ← V ( G) ∪ { vi}

12

E( G) ← E( G) ∪ { a 1 v

, a 3 v

i

i, via 2

i

i

i,

via 4}

i

7

8

13

end

14

if G is planar then

2

3

15

return G, k, V ( G) \ V

16

end

17

end

Figure 1: Petersen graph.

18

k

k + 1

19 end

Algorithm 1: Algorithm for calculating the

We get that graph G is planar after three iterations,

canonical embedding of non-planar graphs.

hence, the crossing number of the Peterson graph is 2.

In the end, we canonically relabel vertices and we obtain

a planar embedding presented in Figure 2.

After investigating several non-planar graphs with up to

five vertices in SageMath we came up with the notion

of representing their embeddings. Combinatorial embed-

ding is a key concept in the study of such graph embed-

dings. The significance stems from the fact that, when

combined with canonical labelling, combinatorial embed-

dings can be utilized to generate a unique representation

of (planar) embeddings for (planar) graphs. For further

reading on combinatorial representations and planar em-

beddings, refer to Mutzel and Weiskircher [9], Didjev [8],

Duncan, Goodrich and Kobourov [4], and to Hopcroft and Tarjan [7].

The concept is as follows: we first construct all non-

Figure 2: Planar embedding of Petersen graph after

incident pairs of edges of a graph; then, we go through

applying Algorithm 1.

those pairs of edges and for each crossing of two edges,

we delete those edges and add a new vertex to which we

connect vertices of deleted edges. We repeat the process

until the graph is planar. Finally, if it is planar, in the

See Section 3 for more details about the transformation.

end, we canonically reorient its vertices and save new

embedding of a graph.

2.1

Theoretical analysis of the algorithm

We show the approach and demonstrate how it works

using a Petersen graph shown in Figure 1.

The first

Consider first the space complexity of Algorithm 1. The step of Algorithm 1 is constructing all pairs of non-amount of memory used by Algorithm 1 to execute and incident edges of G, meaning that the two different edges

produce the result is linear with respect to the input

cannot share the same vertex. Those pairs of edges are

instance. This is due to the fact that the input instance

{{0 , 1} , {2 , 3}}, {{4 , 9} , {5 , 8}} etc. In the beginning, we is a graph and most of the work on it is done in-place, by

58





modifying it locally and not taking more space even after

many manipulations. Hence, we can say that Algorithm 1

does not take too much memory.

Next, let us evaluate the time complexity of Algorithm 1.

To determine the time complexity, we need to consider

all of the SageMath integrated functions we called in our

main function. Function is _ planar that is implemented runs in linear time, concerning the graph as an input

instance, meaning it runs in time O( n + m) where n is a number of vertices and m is a number of edges of the

graph. For more reading on the time complexity of the

(1).jpg (1).jpg (1).jpg

planarity algorithm, refer to Boyer and Myrvold [3]. To remove a vertex in a graph, we first need to find the vertex

Figure 3: Colouring of planar embedding of Petersen

in the data structure and the time complexity depends

graph.

on the structure we use; if we use a HashM ap, the time

complexity will be O(1). Then we remove the vertex, and

we do it in O( n) time. Adding and removing an edge

of the graph is done in constant time, O(1), time while

semicolons) since we aim to store graphs in the database.

adding a vertex to a graph takes O( n) time. Checking

Furthermore, we added the certificate flag verbose so that if there is an edge between vertices is done in O( n) time

we may provide a detailed output (when set to True) for

since a vertex can have at most O( n) neighbours. The

users - with output explanations.

time complexity of getting an embedding of the graph

and of finding the neighbours is linear, that is O( n + m), By now, we processed cubic graphs with up to 21 edges,

since we needed to perform the Breadth-First Search

vertex-transitive graphs with up to 20 edges and all

algorithm.

graphs with up to 13 edges. These files can be stored in

any database since we created a non-verbose mode of

Let us analyze the lines from Algorithm 1. Line 1 has writing into them. In Table 1 we present an overall of complexity O( n) , as it assigns to a set n vertices, while the processed families of graphs by now.

line 2 assigns m 2 edge pairs to a set and hence has com-

plexity O( m 2). Line 3 has constant, O(1), complexity.

Let us now analyse the complexity of the while loop.

Table 1: Processed families of graphs

Inside the while loop, we see that line 5 generates all

subsets of size k from the set of size m 2.

Hence, the

family of graphs

# generated

up to edges

assignment of the k-subsets to the set has complexity

cubic

752

21

equal to the size of the set which is O( m 2) = O( m 2 k).

k

vertex-transitive

16

20

Now, the first for loop goes through all k elements of

general

376899

13

the k-subset, and has complexity O( m 2 k( k + kn + m +

n)) = O( m 2 k( nk + m)) because the line 7 has complexity O( k), the inner for loop has complexity O( nk) and the 4

Output samples

if statement in line 14, that checks whether the graph is

planar, has complexity O( n + m). Finally, the while loop Here we present some examples of the algorithm’s final

is executed k times meaning that the complexity of the

outputs, as previously detailed in the paper. We’ve suc-

whole while loop is O( m 2 kk( nk + m)).

cessfully generated embeddings, saved them in files along

with additional vertices and Graph6 strings, and plotted

3

Applications

images of vertex-transitive graphs on less than 20 edges.

In Figure 4 we present some of them individually, with Drawing:

One of the applications is in graph colouring.

the Graph6 string for each one written in the caption.

In Algorithm 1, we labelled newly added edges, then we use the method plot() within SageMath and with the

5

Conclusions

property colour by label, we get different colours for the

newly added edges. In Figure 3 we see an example of the In the paper, we had a look at a non-planar graph embed-transformed Petersen graph from Figure 2. As it can be ding, its storage, and representation. We introduced an

seen, the original graph embedding is coloured red, while

algorithm for constructing those embeddings and a func-

the newly inserted edges are coloured green and blue.

tion that writes them in both a human-readable way, or

in the way suitable for the storage in the database. This

Storing graphs:

Another application of our approach

contributes to the subject of representation theory be-

is related to the storing of graphs with their combinatorial

cause there was no standard way of encoding such em-

embedding, added vertices (if the graph is non-planar)

beddings.

and with Graph6 string. We constructed a function that

stores data about an individual graph in a single text

We demonstrated how our method may be used to draw

file that a computer can comprehend (Graph6 string,

graphs and save graph data in various file formats.

calculated embedding and added vertices - separated by

Our techniques can be used to enrich almost any graph

59

database, and that is exactly what we were hoping

to achieve.

We’ve generated vertex-transitive graphs

with up to 20 edges, all graphs with up to 13 edges,

and all cubic graphs with up to 21 edges by now. In

collaboration with Katja Berčič, PhD, our files will be

uploaded to the M athDataHub database.

(a) :An

(b) :DaHg∼

Currently, we are working on contributing our code to

the SageMath project.

Acknowledgment

I’d like to thank Professor Matjaž Krnc for his advice and

suggestions throughout the planning, development, and

writing of this paper.

I’d like to thank Klemen Berkovič for his help in codifying

this Author’s Guide, and to Iztok Fister Jr.

for his

(c) :CcKI

(d) :Ea@aRgs

contribution to Author’s Guide and .tex files.

References

[1] Barthelemy, M. Morphogenesis of Spatial Net-

works. New York: Springer, 2017.

[2] Billey, S. C., and Tenner, B. E. Fingerprint

databases for theorems. Notices of the AMS 60, 8

(2013), 1034.

[3] Boyer, J. M., and Myrvold, W. J. On the cut-

ting edge: Simplified o(n) planarity by edge addi-

(e)

∧

:Ea@_Q_QM@Gs

(f) :Fa@_WIRQbP

tion. Journal of Graph Algorithms and Applications

8, 3 (2004), 241–273.

[4] Christian, D., T., G. M., and Stephen, K. Pla-

nar drawings of higher-genus graphs, graph drawing,

17th international symposium, gd 2009. vol. 5849,

pp. 45–56.

[5] Goedgebeur, G. B. K. C. J., and Mélot, H.

House of Graphs: a database of interesting graphs,

Discrete Applied Mathematics, 2013.

[6] Inc., O. F. The On-Line Encyclopedia of Integer

Sequences, 2021. http://oeis.org.

(g) :GaGecctgs

(h) :Ga@_WIRhDlDZ

[7] John, H., and E., T. R. Efficient planarity testing.

Journal of the Association for Computing Machinery

21, 4 (1974), 549–568.

[8] N., D. H. On drawing a graph convexly in the plane,

graph drawing, dimacs international workshop, gd

’94, princeton. vol. 894, pp. 76–83.

[9] Petra, M., and René, W. Computing optimal

embeddings for planar graphs, computing and com-

binatorics, 6th annual international conference, co-

coon 2000. vol. 1858, pp. 95–104.

[10] Read, R. C., and Wilson, R. J. An Atlas of

(i) :Ga@_QaShDlDZ

(j) :Ga@_WGwChLDgsTn

Graphs (Mathematics).

Oxford University Press,

Inc., USA, 2005.

[11] TePaske-King, Bert; Richert, N. The Iden-

Figure 4: Coloured representations of several vertex-

tification of Authors in the Mathematical Reviews

transitive graphs with up to 20 edges, with Graph6 string

Database, 2001.

in captions.

[12] The Sage Developers.

SageMath, the Sage

Mathematics Software System (Version 9.4), 2021.

https://www.sagemath.org.

[13] Trudeau, and J., R. Introduction to Graph The-

ory. New York: Dover Pub, 1993.

60





Analiza ritmičnosti števnih podatkov z uporabo modela

cosinor

Nina Velikajne

Miha Moškon

Faculty of Computer and Information Science,

Faculty of Computer and Information Science,

University of Ljubljana,

University of Ljubljana,

Večna pot 113, SI-1000 Ljubljana, Slovenia

Večna pot 113, SI-1000 Ljubljana, Slovenia

nv6920@student.uni-lj.si

miha.moskon@fri.uni-lj.si

Povzetek

sebne računske metode, ki upoštevajo in ohranjajo od-

nose med podatki. Metode morajo upoštevati diskretno

Analiza ritmičnosti števnih podatkov je postala

porazdelitev, ki je omejena le na nenegativne cele vre-

pomembna v mnogih vidikih znanosti, inženirstva

dnosti. Pri uporabi navadne linearne regresije so lahko

in celo ekonomije. Obstajajo metode z namenom

napovedane vrednosti negativne, kar je teoretično nemo-

detekcije ritmičnosti zveznih podatkov, ki pa veči-

goče [8]. Za zaznavanje in analizo ritmičnosti v zveznih noma niso primerne za analizo števnih podatkov.

podatkih obstaja kar nekaj neparametričnih metod (glej

V prispevku predstavimo metodologijo, ki omo-

[18, 12]). V primerjavi z omenjenimi metodami (glej [16])

goča analizo ritmičnosti v števnih podatkih. Me-

nam uporaba trigonometričnih regresijskih metod v pove-

toda združuje metodo cosinor z uporabo različ-

zavi z različnimi cosinor modeli predstavlja številne pred-

nih računskih regresijskih modelov, ki so primerni

nosti, npr. ocenitev parametrov ritma [18]. Izkaže se za analizo števnih podatkov. Omogoča tako de-tudi, da alternativne metode v določenih primerih od-

tekcijo ritma kot tudi ocenitev parametrov ritma,

povejo zaradi velikega števila osamelcev (angl. outliers), primerjavo zgrajenih modelov in iskanje optimal-same velikosti podatkov, neuravnoteženosti podatkov in

nega števila komponent za metodo cosinor ter is-

zbiranja podatkov brez ponovitev (glej [17]).

kanje najbolj ustreznega tipa števnega modela.

Vzpostavljena metoda omogoča primerjavo zazna-

V tem prispevku predstavimo metodologijo, ki z zdru-

nega ritma v odvisnosti od različnih parametrov

ževanjem metode cosinor skupaj z različnimi števnimi

ritmičnosti in izračun njihovih intervalov zaupa-

računskimi modeli upošteva vse omejitve števnih peri-

nja.

Celotno metodologijo smo testirali na te-

odičnih podatkov in tako omogoča tudi ocenitev parame-

denski periodičnosti realnih podatkov COVID-19

trov ritma. Metoda omogoča izračun intervalov zaupanja

obolenj v Sloveniji.

za posamezen parameter ritma s pomočjo samovzorčenja

(angl. bootstrapping). S F testom določimo optimalno

Ključne besede metoda cosinor, analiza ritmičnosti, število komponent za metodo cosinor, za iskanje najbolj

števni podatki, pojavnost dogodkov, regresija.

ustreznega tipa števnega modela pa uporabimo Vuongov

test.

1

Uvod

Članek je razdeljen na pet poglavij. V drugem in tretjem

poglavju so predstavljeni metoda cosinor za analizo pe-

Detekcija in analiza ritmičnih vzorcev v števnih podatkih

riodičnih podatkov in pet računskih regresijskih modelov

ima pomembno vlogo pri mnogih vidikih znanosti. Perio-

za delo s števnimi podatki.

Sledi poglavje, ki opisuje

dični podatki so podatki, v katerih se vzorci ponavljajo z

postopek izbire najbolj ustreznega računskega modela.

določeno periodo. Zelo pogost tip periodičnih procesov so

V poglavju Rezultati so predstavljeni rezultati testiranja

procesi, ki odražajo cirkadiano nihanje – procesi s periodo

vzpostavljene metode na realnih podatkih. V zadnjem

24-ur [3]. Regulira jih Zemljina rotacija in izmenjavanje poglavju so zajete ključne ugotovitve in postopki celotne

dneva in noči, ki vpliva na vse organizme in posledično

analize.

na njihovo obnašanje ter gibanje (glej [20]). Poseben tip podatkov predstavljajo števni podatki, ki opisujejo po-2

Metoda cosinor

javnost izbranih dogodkov [3].

Števni podatki se pogosto pojavljajo periodično. V na-

Pri periodičnih podatkih se opazovani vzorci ponovijo z

ravi in naši okolici so tovrstni podatki vseprisotni. Z nji-

določeno periodo. Njihovo analizo lahko naslovimo kot

hovo analizo lahko pripomoremo k razumevanju različnih

regresijski problem, pri katerem pa je potrebno upošte-

dogodkov in posredno k razumevanju delovanja organiz-

vati tudi karakteristike ritma, npr.

fazo in amplitudo

mov in družbenih sistemov. Na primer, analiza števila

nihanja. Metoda cosinor se uporablja za analizo časov-

porodov glede na uro v dnevu lahko pomaga pri orga-

nih vrst in se posveča tako detekciji ritma kot tudi oceni

nizaciji medicinskega in babiškega osebja v porodnišnici

parametrov ritma. Model v ozadju metode lahko opišemo

[14]. Analiza dnevnih vzorcev v prometu in njihovo spre-z Enačbo 1, kjer je N število komponent, M srednja vre-minjanje skozi čas nam lahko pove veliko o gibanju in

dnost ritma (MESOR, angl. Midline Estimating Statistic

obnašanju populacije, posebej v času epidemije (glej [4]).

of Rhythm), A amplituda, P perioda in e( t) funkcija na-pake. Spremenljivka t označuje čas, i pa iterira po številu Za analizo števnih periodičnih podatkov potrebujemo po-DOI https://doi.org/10.18690/978-961-286-516-0.14

ISBN 978-961-286-516-0

61





komponent – od 1 do N [1, 15, 16].

Generaliziran Poissonov model izhaja iz navadnega Pois-

sonovega modela. Bistvena razlika tega modela v pri-

N





merjavi s Poissonovim modelom je ta, da je generaliziran

X

t

Y ( t) = M +

Ai, 1 · sin 2 π

+

Poissonov model primeren tudi za podatke, ki imajo po-

P/i

i=1

večano ali zmanjšano razpršitev. Vpeljemo dodaten pa-



t

rameter α, ki opisuje stopnjo disperzije. Model torej ne

Ai, 2 · cos 2 π

+ e( t) ,

(1)

P/i

zahteva, da je povprečje µ enako varianci σ 2. Obstajata dve različici generaliziranega Poissonovega modela – GP-

Če je perioda ritma znana vnaprej, lahko enačbo metode

1 in GP-2 [6]. V vzpostavljeni metodologiji smo uporabili cosinor poenostavimo v model linearne regresije:

različico GP-1.

N

X

Poissonov model z inflacijo ničel je razširitev navadnega

Y ( t) = M +

( Ai, 1 · Xi, 1( t) + Ai, 2 · Xi, 2( t))+ e( t) , (2) Poissonovega modela. Tovrstni model za razliko od nava-i=1

dnega in generaliziranega Poissonovega modela upošteva,





da so ničelne vrednost bolj pogoste kot ostale vrednosti.

kjer je Xi, 1( t) = sin 2 π t

in X

2 π t

. V

P /i

i, 2 = cos

P /i

Izhaja iz tega, da obstajata dva dejavnika, ki vplivata na

kolikor perioda ni znana, jo lahko ocenimo s pomočjo

izid, ali je vrednost ničelna ali neničelna [13].

uporabe periodogramov (angl. periodograms) [1, 15, 16].

Negativen binomski model predpostavlja, da so podatki

porazdeljeni z negativno binomsko porazdelitvijo. Upo-

3

Analiza števnih podatkov z metodo

rabljata se dve verziji negativnega binomskega modela,

cosinor

tj. NB-1 in NB-2. Različica NB-1 se je izkazala kot bolj

primerna, prilagojena krivulja se je namreč vidno lepše

Za analizo in detekcijo ritma na izvornih očiščenih podat-

prilegala izvornim podatkom, zato smo v metodologiji

kih naprej uporabimo metodo cosinor. V primeru znane

uporabili verzijo NB-1. Varianca takega modela je de-

periode uporabimo Enačbo 2, ki podatke razdeli na po-finirana kot σ 2 = µ + α · µ, kjer je α parameter disperzije ljubno število komponent in jih transformira do regresij-in µ povprečje. Povprečje je enako povprečni pričakovani

ske oblike. Na transformirane podatke nato apliciramo

vrednosti λ. Model je zato primeren tudi za podatke s

regresijske računske modele. Regresijski modeli omogo-

povečano ali pa zmanjšano razpršitvijo [2, 11]. Ob ve-

čajo identifikacijo in karakterizacijo odnosov med mno-

čanju parametra α varianca konvergira k povprečju in

gimi faktorji. Zaradi dela s števnimi podatki moramo iz-

negativna binomska porazdelitev postane Poissonova [2].

brati ustrezne regresijske računske modele, ki upoštevajo

Negativen binomski model z inflacijo ničel je razširitev

vse lastnosti tovrstnih podatkov.

negativnega binomskega modela. Podobno kot Poissonov

Podatki so diskretni, omejeni na nenegativne cele vre-

model z inflacijo ničel upošteva, da so ničelne vrednosti

dnosti in velikokrat tudi razpršeni (angl. dispersed). Sre-bolj pogoste kot ostale (neničelne) vrednosti.

Ključna

čamo se s pojmom povečane (angl. overdispersion) ali

razlika v primerjavi s Poissonovim modelom z inflacijo

pa zmanjšane razpršitve (angl. underdispersion) [8]. Pri ničel je, da ta model temelji na negativni binomski po-povečani razpršitvi imajo podatki večjo varianco, kot bi

razdelitvi. Model je zato primeren tudi za podatke, ki

jo sicer pričakovali. Če je varianca večja kot povprečje

imajo povečano ali pa zmanjšano razpršitev [10, 11].

podatkov, gre za povečano razpršitev. Obratno velja za

zmanjšano razpršitev [7]. Z uporabo navadne linearne 4

Izbira najustreznejšega modela

regresije bi dobili nepravilne rezultate, saj tak računski

model ne bi upošteval omenjenih lastnosti [8].

Izbiro najbolj ustreznega računskega modela razdelimo

V vzpostavljeni metodologiji smo se odločili za uporabo

na dva nivoja. Na prvem nivoju iščemo optimalno šte-

petih različnih računskih modelov, ki se najpogosteje

vilo komponent za metodo cosinor. Na tem mestu smo

uporabljajo za analizo števnih podatkov. Uporabili smo

izhajali iz tega, da so modeli gnezdeni. Dva modela sta

Poissonov model (angl.

Poisson model), generaliziran

gnezdena, če lahko prvi model izrazimo z drugim oz. če

Poissonov model (angl. generalised Poisson model), Pois-

drugi model poleg vsaj enega dodatnega parametra vse-

sonov model z inflacijo ničel (angl. zero-inflated Poisson

buje enake parametre kot prvi [5]. Implementirali smo F

model), negativen binomski model (angl. negative bino-

test. Na podlagi dveh zgrajenih modelov izračunamo F

mial model) in negativen binomski model z inflacijo ničel

vrednost. F test temelji na razliki vsote kvadratov (angl.

(angl. zero-inflated negative binomial model).

sum of squares) dveh modelov in upošteva število para-

metrov modela [5].

Poissonov model predpostavlja, da so podatki porazde-

ljeni s Poissonovo porazdelitvijo:

Na drugem nivoju smo vrednotili tip računskega modela.

Uporabili smo Vuongov test, ki je primeren tako za gnez-

λke− λ

P ( y = k) =

,

(3)

dene modele kot tudi za ne gnezdene in prekrivajoče se

k!

(angl. overlapping) modele. Vuongov test omogoča izra-

kjer je λ povprečna pričakovana vrednost oz. število do-

čun Z vrednosti na podlagi logaritma največjega verjetja

godkov na enoto časa. Povprečje podatkov µ je pri Pois-

(angl. maximum log-likelihood) dveh modelov. Tudi ta

sonovi porazdelitvi enako povprečni pričakovani vredno-

test upošteva število parametrov modelov [19].

sti λ. Varianca σ 2 je enaka povprečju, poenostavljeno Oba testa sledita podobnemu postopku. Za dva modela,

velja, da je σ 2= λ. Model ne upošteva povečane ali pa tj. model A in model B, izračunamo F oz. Z vrednost.

zmanjšane razpršitve podatkov [9].

62





Model A zavržemo, če je izračunana vrednosti manjša od vnaprej določene meje, tj. statistične signifikance (angl.

Tabela 1: Ocenjeni parametri ritmičnosti za posamezno

statistical significance) [5, 19].

podatkovno zbirko.

pod. zbirka

tip modela

št. komponent

amplituda

mesor

vrhovi

št. poz. testov

5

Rezultati

1.

gen_poisson

3.0

794.62

1233.67

0.7

2028.29

0.7

1214.69

2.

gen_poisson

3.0

483.37

731.32

3.9

969.59

3.

gen_poisson

3.0

637.66

995.12

0.7

1632.77

Vzpostavljeno metodologijo smo preizkusili na realnih po-

datkih. Podatke smo pridobili s strani COVID-19 sledil-

nik1 in analizirali število pozitivnih testov glede na dan v Tabela 2: Intervali zaupanja parametrov ritmičnosti za

tednu. Upoštevali smo vse teste, tj. seštevek PCR (angl.

posamezno podatkovno zbirko.

polymerase chain rection) in hitrih antigenskih (HAGT)

testov. Metodo smo izvedli na treh podatkovnih zbirkah.

pod. zbirka

amplituda

mesor

vrhovi

št. poz. testov

Prva zbirka beleži število pozitivnih testov od 20. okto-

1.

[727.44 863.81]

[1164.72 1299.48]

[0.63 0.85]

[1897.05 2158.4 ]

bra 2020 – razglasitev 2. vala epidemije v Sloveniji, do

[0.56 0.83]

[1154.14 1271.19]

2.

[452.11 514.36]

[701.36 757.49]

[1.96 4.82]

[ 902.71 1040.37]

10. februarja 2021. Povprečje pozitivnih testov je 1340,

3.

[596.96 691.66]

[ 955.36 1051.93]

[0.59 0.86]

[1556.13 1739.78]

varianca pa 276278. Druga zbirka vsebuje primere od 11.

februarja 2021 do 26. aprila 2021. Povprečje podatkov je

20.10.2020-10.2.2021

2500

izvorni podatki

796, varianca pa 90328. Tretja podatkovna zbirka zdru-

prilagojena krivulja

žuje časovni obdobji prve in druge podatkovne zbirke.

2000

Povprečje zadnje zbirke je 1082, varianca pa 232224. Me-

todo smo preizkusili tudi za časovno obdobje 1. epidemije

v Slovenji – pomlad 2020, vendar je teh podatkov premalo

1500

za smiselno analizo.

V podatkovni zbirki smo najprej odstranili osamelce

tevilo pozitivnih testov 1000

(angl. outliers), tako da smo za posamezno uro odstranili

vnose, kjer so bile vrednosti števila pozitivnih testov

500

večje ali manjše od 0,15 kvantila.

Nato smo izvedli

metodo cosinor za posamezno število komponent – od 1

Ponedeljek Torek

Sreda

etrtek Petek Sobota Nedelja

Dan

do 4, in zgradili posamezen tip računskega modela (glej

11.2.2021-26.4.2021

Poglavje 1). Pri številu komponent 4 smo se ustavili, ker izvorni podatki

prilagojena krivulja

se računski modeli zaradi prevelikega števila komponent

1200

niso več prilagajali izvornim podatkom. Zgrajene modele

smo nato ovrednotili, najprej smo poiskali optimalno

1000

število komponent na podlagi F testa in nato še najbolj

800

ustrezen tip modela s pomočjo Voungovega testa.

Opisan postopek se ponovi za posamezno podatkovno

tevilo pozitivnih testov 600

zbirko.

Vse podatkovne zbirke imajo večjo varianco

kot povprečje kar pomeni, da imajo podatki povečano

400

razpršitev. Na podlagi porazdelitve izvornih podatkov

(glej Sliko 1) lahko ugotovimo, da podatki nimajo 200

Ponedeljek Torek

Sreda

etrtek Petek Sobota Nedelja

ničelnih vrednosti.

Dan

20.10.2020-26.4.2021

Težave se pojavijo pri vseh modelih, v kolikor je število

2250

izvorni podatki

prilagojena krivulja

komponent pri metodi cosinor večje od 3. Opazimo, da se

2000

izvornim podatkom najbolje prilegata negativen binom-

1750

ski model in generaliziran Poissonov model. Oba modela,

sta namreč primerna za podatke s povečano razpršitvijo.

1500

Za vse podatkovne zbirke smo dobili enak rezultat. Op-

1250

timalno število komponent je 3, najbolj ustrezen tip mo-

1000

dela pa je generaliziran Poissonov model (glej Sliko 1).

tevilo pozitivnih testov

750

Na podlagi izvedene analize smo lahko ovrednotili para-

metre ritmičnosti (glej Tabelo 1) in njihove intervale za-500

upanja (glej Tabelo 2). Celoten postopek analize z vsemi 250

vmesnimi rezultati je dostopen v repozitoriju GitHub2.

Ponedeljek Torek

Sreda

etrtek Petek Sobota Nedelja

Dan

6

Diskusija in zaključek

Slika 1: Zmagovalni modeli za posamezno podatkovno

zbirko. V vseh primerih je število komponent enako 3 in

Vzpostavljena in implementirana metodologija omogoča

tip modela generaliziran Poissonov model. Oranže črte

analizo števnih periodičnih podatkov. Metodologija je se-

označujejo intervale zaupanja modela.

1https://covid-19.sledilnik.org/sl/data

2https://github.com/ninavelikajne/Ritmicnosti-

stevnih-podatkov

63

stavljena iz dveh delov. Prvi del predstavlja analizo pe-

[6] Consul, P., and Famoye, F. Generalized Poisson

riodičnih podatkov. Implementirana je metoda cosinor,

regression model.

Communications in Statistics -

ki ji lahko uporabnik nastavlja poljubno število kompo-

Theory and Methods 21, 1 (1992), 89–109.

nent. Drugi del zajema analizo števnih podatkov. Upo-

[7] Coxe, S., West, S. G., and Aiken, L. S. The

rabljeni so različni računski regresijski modeli, ki so pri-

analysis of count data: A gentle introduction to

merni za delo s števnimi podatki. Implementirali smo

Poisson regression and its alternatives. Journal of

pet tovrstnih modelov, tj. Poissonov model, generalizi-

personality assessment 91, 2 (2009), 121–136.

ran Poissonov model, Poissonov model z inflacijo ničel,

[8] Gardner, W., Mulvey, E., and Shaw, E. Re-

negativen binomski model in negativen binomski model z

gression analyses of counts and rates: Poisson, over-

inflacijo ničel. Vzpostavljena metodologija omogoča vre-

dispersed Poisson, and negative binomial models.

dnotenje zgrajenih modelov. Tudi vrednotenje se tako

Psychological bulletin 118 (12 1995), 392–404.

kot grajenje modelov deli na dva dela. V prvem delu se

[9]

osredotočimo na iskanje optimalnega števila komponent

Gardner, W., Mulvey, E., and Shaw, E. Re-

gression analyses of counts and rates: Poisson, over-

za metodo cosinor. Izhajamo iz dejstva, da so modeli

dispersed Poisson, and negative binomial models.

gnezdeni in jih zato lahko ovrednotimo s F testom. V

Psychological bulletin 118 (12 1995), 392–404.

drugem delu iščemo najbolj ustrezen tip računskega mo-

dela. Uporabimo Vuongov test, ki je primeren tako za

[10] Greene, W. H. Accounting for excess zeros and

gnezdene, negnezdene in tudi prekrivajoče se modele.

sample selection in Poisson and negative binomial

regression models. Working Paper EC-94-10. Leo-

Računsko metodo smo preizkusili na tedenski periodično-

nard N. Stern School of Business, New York Univer-

sti števila obolenj z boleznijo COVID-19 v Sloveniji. Po-

sity, 1994.

datke smo razdelili na 3 podatkovne zbirke, vsaka opisuje

[11]

različno obdobje. Za vse podatkovne zbirke se kot naj-

Hilbe, J. M. Negative binomial regression. Cam-

bridge University Press, 2011.

boljši tip modela izkaže generaliziran Poissonov model,

optimalno število komponent pa je število 3 (glej Sliko

[12] Hutchison, A. L., Maienschein-Cline, M.,

1). V podatkih se ritmičnost podatkov lepo izraža. Oce-Chiang, A. H., Tabei, S. A., Gudjonson, H.,

nili smo parametre ritma (glej Tabelo 1) in njihove inter-Bahroos, N., Allada, R., and Dinner, A. R.

vale zaupanja (glej Tabelo 2). Z uporabo predstavljene Improved statistical methods enable greater sensi-metode na dobljenih rezultatih razberemo, da je število

tivity in rhythm detection for genome-wide data.

pozitivnih testov največje ob torkih nato pa skozi teden

PLoS Comput Biol 11, 3 (2015), e1004094.

upada. Kot pričakovano je število pozitivnih testov naj-

[13] Lambert, D. Zero-inflated Poisson regression, with

manjše ob vikendih. Oblika zaznanega ritma je za vse

an application to defects in manufacturing. Techno-

podatkovne zbirke podobna (glej Sliko 1).

metrics 34, 1 (1992), 1–14.

[14]

Celotna metoda je implementirana kot modul v jeziku

Martin, P., Cortina-Borja, M., Newburn,

Python in je prosto dostopna. Omogoča širok spekter

M., Harper, G., Gibson, R., Dodwell, M.,

funkcionalnosti za analizo tovrstnih podatkov.

Poten-

Dattani, N., and Macfarlane, A. Timing of sin-

gleton births by onset of labour and mode of birth in

cialni uporabnik lahko direktno spreminja in prilagaja

nhs maternity units in england, 2005–2014: A study

funkcionalnosti glede na svoje potrebe.

of linked birth registration, birth notification, and

hospital episode data. PloS One 13, 6 (2018).

Literatura

[15] Nelson, W., LEE, J. K., et al. Methods for

cosinor-rhythmometry. Chronobiologia 6, 4 (1979),

[1] Bingham,

C.,

Arbogast,

B.,

Guillaume,

305–323.

G. C., Lee, J. K., and Halberg, F. Inferen-

[16] Refinetti, R., Cornélissen, G., and Halberg,

tial statistical methods for estimating and compa-

F. Procedures for numerical analysis of circadian

ring cosinor parameters. Chronobiologia 9, 4 (1982),

rhythms. Biological rhythm research 38, 4 (2007),

397–439.

275–325.

[2] Cameron, A. C., and Trivedi, P. K. Econome- [17] Ruben, M. D., Francey, L. J., Guo, Y., Wu, tric models based on count data: Comparisons and

G., Cooper, E. B., Shah, A. S., Hogenesch,

applications of some estimators and tests. Journal

J. B., and Smith, D. F. A large-scale study reveals

of Applied Econometrics 1, 1 (1986), 29–53.

24-h operational rhythms in hospital treatment. Pro-

[3] Cameron, A. C., and Trivedi, P. K. Regression

ceedings of the National Academy of Sciences 116, 42

analysis of count data, vol. 53. Cambridge University

(2019), 20953–20958.

Press, 2013.

[18] Thaben, P. F., and Westermark, P. O. Detec-

[4] Chang, S., Pierson, E., Koh, P. W., Gerar-

ting rhythms in time series with RAIN. Journal of

din, J., Redbird, B., Grusky, D., and Lesko-

biological rhythms 29, 6 (2014), 391–400.

vec, J. Mobility network models of covid-19 explain

[19] Vuong, Q. H. Likelihood ratio tests for model

inequities and inform reopening. Nature 589, 7840

selection and non-nested hypotheses. Econometrica

(2021), 82–87.

57, 2 (1989), 307–333.

[5] Clark, T. E., and McCracken, M. W. Tests of [20] Zhdanova, I. Melatonin. In Encyclopedia of the equal forecast accuracy and encompassing for nested

Neurological Sciences (Second Edition), M. J. Ami-

models. Journal of econometrics 105, 1 (2001), 85–

noff and R. B. Daroff, Eds., second edition ed. Aca-

110.

demic Press, Oxford, 2014, pp. 1030 – 1033.

64





Analiza sentimenta komentarjev hotelov z uporabo

slovarjev in metode Naivni Bayes

Nina Murks

Anže Omerzu

Univerza v Mariboru,

Univerza v Mariboru,

Fakulteta za elektotehniko, računalništvo

Fakulteta za elektotehniko, računalništvo

in informatiko,

in informatiko,

Koroška cesta 46, 2000 Maribor, Slovenija

Koroška cesta 46, 2000 Maribor, Slovenija

nina.murks@student.um.si

anze.omerzu@student.um.si

Borko Bošković

Univerza v Mariboru,

Fakulteta za elektotehniko, računalništvo

in informatiko,

Koroška cesta 46, 2000 Maribor, Slovenija

borko.boskovic@um.si

Povzetek

so storitve že koristili; pobrskamo po spletnih komentar-

jih, povezanih z mnenjem o nastanitvi.

V članku smo predstavili pristop k analizi sen-

timenta komentarjev hotelskih gostov s pomočjo Odločili smo se, da analizo sentimenta apliciramo na ko-slovarjev in metode Naivni Bayes. Najprej smo mentarje gostov hotelov. Sprva smo komentarje klasifici-zgradili slovarja sentimenta, ki sta vsebovala n-

rali kot pozitivne ali negativne. S pomočjo klasificiranih

grame, ter njihove verjetnosti, da pripadajo pozi-

komentarjev smo ugotovili, ali je bil večini hotelskih go-

tivnemu ali negativnemu razredu. Nato smo s po-

stov izbran hotel všeč ali ne. Cilj našega eksperimenta

močjo zgrajenih slovarjev klasificirali komentarje je bila aplikacija, ki omogoča takojšen vpogled v širše hotelov, pri čemer smo uporabili metodo Naivni mnenje o hotelu, ki ga najdemo v komentarjih hotelskih Bayes. Pri klasifikaciji k omentarjev s mo raču-gostov.

nali klasifikacijske v rednosti o z. v erjetnosti, da V članku smo predstavili različne pristope k analizi senso posamezni komentarji pozitivni ali negativni. timenta, in sicer v poglavju Sorodna dela. V naslednjem Komentarje smo klasificirali s p omočjo unigra-poglavju sledi opis eksperimenta, ki smo ga izvedli. Zno-

mov in bigramov, ter rezultate primerjali z re-

traj tega smo najprej predstavili, na kakšen način smo

zultati iz literature. Pri unigramih smo dosegli pridobili podatke in kako smo jih predprocesirali. Nato natančnost 0,92, pri bigramih je natančnost zna-smo predstavili način grajenja slovarjev, njihovo uporabo

šala 0,80. Klasifikacijske v rednosti posameznih in klasifikacijo komentarjev hotelskih gostov. V zadnjem komentarjev smo si shranili, pri čemer smo pri podpoglavju tega poglavja smo predstavili pristop raču-komentarjih, ki smo jih klacificirali kot negativne, nanja točk za posamezen hotel. Nazadnje podamo še dodali negativen predznak. Predznačene klasifi-zaključek in možnosti za nadaljnje delo.

kacijske vrednosti smo nato sešteli, za vsak hotel

ter na tak način izračunali hotelom pripadajoče

točke. Točke hotelov so v našem primeru poka-

2

Sorodna dela

zatelj splošnega zadovoljstva hotelskih gostov, ki

ga najdemo v komentarjih. Glede na točke smo Analiza sentimenta zajema računalniško analizo stališč hotele uredili po vrsti in prišli do lestvice hote-govorca ali pisca. Stališče pisca, ki ga lahko prepoznamo

lov, pri katerih najdemo najbolj pozitivne komen-

v besedilu, je lahko pozitivno, negativno ali nevtralno.

tarje.

Danes je uporaba analize izjemno koristna pri spremlja-

nju družbenih omrežij, saj nam omogoča vpogled v javno

Ključne besede analiza sentimenta, komentarji

mnenje [2].

hotelskih gostov, slovar sentimenta, n-grami, Naivni

Bayes

Uporaba analize sentimenta je dobro raziskano področje,

kar kaže visoko število pojavitev znanstvenih in strokov-

nih člankov/prispevkov s tega področja. Znanih je več

1 Uvod

pristopov k analizi sentimenta. V nadaljevanju smo jih

predstavili nekaj – po vzoru že objavljenih člankov [8][3].

Preden se odpravimo na potovanje, je ena izmed po-

membnih odločitev izbira nastanitve. Pri določitvi kraja Prvi pristop temelji na unigramih in vzorcih iz učne mno-namestitve so nam pomembne namestitvene zmožnosti žice (ang. training set) [1][10]. Avtorji člankov so raz-

– ali ima hotel bazen, organizirano varstvo, brezplačno

iskovanje usmerili na vzorce komentarjev na družbenem

parkiranje itd. Več o kvaliteti storitev, ki jih ponuja iz-

omrežju Twitter. S pomočjo unigramov in vzorcev so raz-

brana nastanitev, lahko izvemo neposredno od gostov, ki

poznavali, ali je komentar sovražen (žaljiv) ali čist (nev-

DOI https://doi.org/10.18690/978-961-286-516-0.15

ISBN 978-961-286-516-0

65





tralen) ali pozitiven [10]. V drugem članku so raziskovalci varjev in metode Naivni Bayes je bil od vseh (prej omes pomočjo vzorcev iskali prisotnost sarkazma v komentar-

njenih) najprimernejši, zato smo se odločili, da ga upo-

jih.

rabimo pri implementaciji lastne ideje.

Tudi v naslednjem, za naše raziskovanje pomemben

članku [11], so avtorji svoje raziskovanje usmerili na 3

Predlagan pristop

sentimente komentarjev na družbenih omrežjih. Njihov

pristop uporablja utežene besedne vektorje, ki predsta-

V tem poglavju smo predstavili našo analizo sentimenta

vljajo vhod v celico nevronske mreže (BiLSTM), ki zazna

za klasificiranje hotelskih komentarjev in pridobivanje

kontekstne informacije.

Tako bolje predstavi vektorje

točk zadovoljstva nastanitve posameznih hotelov, ki se

komentarjev. Sentiment komentarjev se kasneje naprej

skrivajo v besedilu komentarjev. Sledi kratka predsta-

določi s klasifikacijo nevronskih mrež.

vitev načina pridobivanja podatkov, ter načrtovanja in

V članku [8] so uporabili ordinalno regresijo z uporabo izvedbe eksperimenta.

tehnik učenja.

Za omenjeno tehnologijo so uporabili

javno bazo komentarjev na Twitterju, ki jih je bilo treba

3.1

Pridobivanje podatkov

vnaprej procesirati z uporabo metode ekstrakcije lastno-

sti. Za klasifikacijo analize sentimenta so uporabili raz-

Poiskali smo podatkovno bazo, ki je že vsebovala klasifici-

lične algoritme: Multinomijska logistična regresija (Mul-

rane komentarje hotelov. To bazo podatkov smo našli na

tinomial logistic regression - SoftMax), podpora vektorski

spletni strani Kaggle. Sledi kratka predstavitev spletne

regresiji (Support Vector Regression - SVR), odločitvena

strani.

drevesa (Decision Trees - DTs) in naključni gozd (Ran-

Kaggle, podružnica podjetja Google LLC, je spletna sku-

dom Forest - RF).

pnost podatkovnih znanstvenikov in izvajalcev strojnega

V članku [7] so raziskovalci za analizo sentimenta upora-učenja. Kaggle uporabnikom omogoča, da najdejo in ob-

bili in nadgradili pristop k ansamblu lastnosti (ang. fe-

javijo nabore podatkov, raziskujejo in gradijo modele v

ature ensemble), pri katerem so upoštevali različne ele-

spletnem okolju za znanost o podatkih, sodelujejo z dru-

mente, ki jih drugi raziskovalci pri uporabi te metode

gimi znanstveniki in inženirji strojnega učenja ter se ude-

zanemarijo. Upoštevali so besedoslovje, besedno vrsto,

ležujejo tekmovanj za reševanje izzivov na področju po-

jezikovno semantiko in položaj besed.

datkov.

Do zdaj omenjeni pristopi uporabljajo analizo sentimenta

Izbrana baza podatkov je bila narejena s pomočjo sple-

nad angleškim jezikom. Zanimalo nas je, kako se ana-

tne strani Booking.com.

Vsebuje 515.000 komentarjev

lize sentimenta lotevajo v tujih jezikih, kjer je sentiment

o 1493 različnih Evropskih hotelih. Podatki imajo nasle-

manj jasen. Prav to so storili v članku [4], v katerem dnjo obliko: naslov hotela, datum komentarja, povprečen

so se posvetili nejasnosti kitajskih fraz, ki so izraženi v

rezultat hotela (ta je izračunan s pomočjo zadnjega ko-

sentimentu. Tradicionalni pristop s strojnim učenjem ne

mentarja, ki se je pojavil v tekočem letu), ime hotela,

more prikazati resničnega sentimenta, zato so predlagali

nacionalnost pisca komentarja, negativen komentar, šte-

uporabo multistrateške metode analize razpoloženja in

vilo besed v negativnem komentarju, pozitiven komen-

prikazali, da hibridna analiza sentimenta dosega zadovo-

tar, število besed v pozitivnem komentarju, točke pisca

ljive rezultate.

komentarja (točke, ki jih je pisec dodelil hotelu), število

komentarjev, ki jih pisec podal, število komentarjev, ki jih

Članka [5] in [9] vsebujeta analizo sentimenta s pomo-ima posamezen hotel, značke (ki jih je dodelil pisec ko-

čjo konvolucijskih nevronskih mrež. Prvi naveden članek

mentarja), koliko dni je minilo od zadnjega komentarja,

uporablja konvolucijske mreže za analizo sentimenta ko-

dodatne točke k hotelu (nekateri so napisali samo točke)

mentarjev na Twitterju, drugi pa za zaznavanje sarkazma

ter geografska širina in dolžina lokacije hotela.

na socialnih omrežjih. Konvolucijske nevronske mreže so

sestavljene iz več slojev, kjer vsak sloj opravlja nalogo

Komentarji v podatkovni bazi so že imeli izločena vsa lo-

pripravljanja, pretvarjanja ali popravljanja podatkov.

čila in znake, ki jih ni v angleški abecedi (ASCII znaki).

Ker je bila baza dobro pripravljena, nismo imeli veliko

Analize sentimenta se lahko lotimo tudi s pomočjo slo-

dela – odstranili smo številke in pretvorili besedilo komen-

varjev in metode Naivni Bayes, kot so to storili v članku

tarjev v male črke. Podatkovno bazo smo tudi razdelili

[6]. V tem članku so analizirali sentiment komentarjev na testno in učno množico, pri čemer je učna množica

Danmaku1 videov. Uporabili so slovar sentimenta, ki so zajemala približno 80 % vseh podatkov, testna pa pre-ga razširili z emotikoni, saj so ti zelo pomembni pri ko-

ostalih 20 %. Učno množico smo uporabili za grajenje

mentarjih Danmaku videov. Pri analizi sentimenta pa

slovarjev, testno pa za klasifikacijo komentarjev. Delitev

niso samo klasificirali pozitivne in negativne komentarje,

podatkov v učno in testno množico je potekala za vsak

temveč so uporabili kar sedem razredov: gnus, žalost,

hotel posebej. Zbrali smo vse komentarje o določenemu

všečnost, jeza, presenečenje, strah in veselje. Po tem, ko

hotelu, nato pa smo le-te razdelili na učno in testno mno-

so generirali slovar, ki ustreza kontekstu (za komentarje

žico tako, da je učna množica vsebovala približno 80 %

Danmaku videov), so se lotili klasifikacije komentarjev s

komentarjev.

pomočjo metode Naivni Bayes. Pristop z uporabo slo-

1Danmaku je Japonski izraz za sistem podnapisov, ki ga

3.2

Grajenje slovarjev

uporabljajo spletne video platforme. Omogočajo uporabniku

objavljanje premikajočih se komentarjev na video, ki se pred-

Podatki so bili predhodno obdelani in lahko smo se lotili

vaja.

gradnje slovarjev. Odločili smo se, da bomo algoritem

66





izvedli na dva načina – s pomočjo unigramov in bigraje vsebovala ime hotela, seznam komentarjev (s pomočjo

mov. Algoritem, ki upravlja z unigrami, se od algoritma

prej definirane strukture Review) in vrednost klasifika-

z bigrami ne razlikuje preveč, le da slovarja temeljita en-

cije. Naredili smo seznam podatkovnih struktur HotelRe-

krat na unigramih, drugič pa na bigramih. Tudi pri upo-

views, kamor smo si shranili podatke, ki smo jih kasneje

rabi slovarjev in računanju klasifikacij s pomočjo metode

pridobili iz testne množice podatkovne baze.

Naivni Bayes je prišlo do razlike le pri predprocesiranju

Prebrali smo testno množico podatkovne baze in si pri

posameznih komentarjev, kjer smo pri unigramih vzeli

tem shranjevali za nas potrebne informacije – ime hotela

unigrame, torej posamezne besede, pri bigramih, pa bi-

in komentarje, ki smo jih že označili, ali so v podatkovni

grame, torej dvojice besed.

bazi bili prepoznani kot pozitivni ali negativni, saj nam

Grajenja slovarjev smo se lotili tako, da smo najprej pre-

je to kasneje pomagalo pri izračunu natančnosti klasifika-

brali učno množico podatkov in si shranili vse komentarje

cijskega modela. Podatke iz testne množice smo shranili

posebej – razdelili smo jih na pozitivne in negativne. Pri

v prej omenjene strukture. Ko smo imeli vse podatke

vsaki različici algoritma (unigrami in bigrami) smo na-

shranjene v naših podatkovnih strukturah, smo se lotili

redili dva slovarja. Prvi slovar je imel izračunane verje-

klasifikacije komentarjev.

tnosti, da posamezen n-gram (kadar koli se sklicujemo na

Za vsak komentar, ki je sestavljen iz sekvence n-gramov,

n-gram imamo v mislih unigram ali bigram) pripada pozi-

smo vedno izračunali 2 vrednosti:

klasifikacijsko vre-

tivnemu komentarju, drugi pa verjetnosti, da posamezen

dnost, da komentar pripada pozitivnemu komentarju, in

n-gram pripada negativnemu komentarju. Da smo ver-

klasifikacijsko vrednost, da pripada negativnemu komen-

jetnosti pripadnosti n-gramov v posamezen razred lahko

tarju.

izračunali, smo najprej morali narediti še nekaj korakov.

Če smo gradili slovar pozitivnih komentarjev, smo po-

trebovali vse n-grame, ki jih najdemo v besednjaku po-

komentar = n-gram1, n-gram2, ... n-gramN

(2)

zitivnih komentarjev. Potrebovali smo tudi seznam vseh

n-gramov, ki se najdejo tako v pozitivnih komentarjih kot

tudi v negativnih, in njihove frekvence pojavitve. Izraču-

P ( komentar| pozitiven) =

nali smo tudi velikost slovarja vseh n-gramov (koliko raz-

log( P ( pozitiven))

ličnih n-gramov se nahaja v pozitivnih in negativnih ko-

+ log( P ( n-gram1| pozitiven))

mentarjih). Sedaj smo se lahko lotili računanja posame-

+ log( P ( n-gram2| pozitiven))

znih verjetnosti n-gramov, da le-ti pripadajo določenemu

razredu (pozitivnemu ali negativnemu). Verjetnost, da

...

n-gram pripada razredu smo izračunali po enačbi:

+ log( P ( n-gramN| pozitiven))

(3)

frekvenca( n-gram, razred) + 1

P ( n-gram| razred) =

(1)

frekvenca( razred) + V

P ( komentar| negativen) =

log( P ( negativen))

Vsako verjetnost izračunamo tako, da vzamemo frekvenco

+ log( P ( n-gram1| negativen))

pojavitve n-grama znotraj razreda, tej prištejemo ena

(da se izognemo ničelnim vrednostim), nato pa dobljen

+ log( P ( n-gram2| negativen))

rezultat delimo z vsoto frekvence razreda in velikostjo

...

slovarja (V). Frekvenca razreda predstavlja število n-

+ log( P ( n-gramN| negativen))

gramov, ki se pojavi znotraj posameznega razreda. Za

(4)

vsak n-gram smo izračunali verjetnost za oba razreda

(pozitivni in negativni razred) ter si na tak način ustvarili

Če je bila pozitivna klasifikacijska vrednost večja od ne-

slovarja, ki smo ju shranili v datoteki. Ker se v testni

gativne, smo komentar klasificirali kot pozitiven, v na-

množici, ki smo jo uporabili za klasifikacijo, lahko pojavi

sprotnem primeru smo komentar klasificirali kot negati-

tudi takšen n-gram, ki ga v učni množici (s pomočjo te

ven.

smo gradili slovarja) nismo imeli, smo v slovar na začetek

dodali vrednost neznanega n-grama in ga označili kot

Izračunano klasifikacijsko vrednost smo si shranili, pri če-

"unknown_". Pri oznaki smo uporabili podčrtaj, saj bi

mer smo pri negativnem klasifikacijskemu rezultatu do-

se sam n-gram "unknown" lahko pojavil pri komentarjih,

dali negativen predznak – to nam je pomagalo v priho-

torej bi v tem primeru vzeli napačno verjetnost.

Ker

dnjih korakih pri določanju splošnega zadovoljstva ho-

pa smo iz podatkovne zbirke že odstranili vsa ločila, smo

telskih gostov, ki ga najdemo v komentarjih s pomočjo

tukaj lahko uporabili to lastnost in n-gramu dodali ločilo.

računanja točk posameznih hotelov. Več o tem sledi v

naslednjem poglavju.

3.3

Klasifikacija komentarjev

3.4

Računanje točk zadovoljstva hotelskega

Klasifikacija komentarjev iz testne množice je potekala

bivanja

tako, da smo najprej prebrali slovarja, ki smo si ju shanili

v prejšnjem koraku. Ustvarili smo podatkovno strukturo

Ko smo klasificirali vse komentarje posameznih hotelov,

Review, ki je vsebovala vsebino komentarjev, informacijo

smo se posvetili ocenjevanju splošnega zadovoljstva, ki ga

o tem ali je komentar pozitiven, informacijo o pravilno-

najdemo v komentarjih. Prvotna ideja je bila, da bi pre-

sti klasificije komentarjev in vrednost klasifikacije. Na-

šteli pozitivne in negativne komentarje, ter izračunali nji-

redili smo tudi podatkovno strukturo HotelReviews. Ta

hov delež v primerjavi z vsemi komentarji za posamezen

67





hotel. Odločili smo se, da bomo naredili še en korak na-

prej in namesto štetja računali vrednosti klasifikacije, kar

Tabela 1: Metrike uspesnosti klasifikatorjev.

je točkam posameznih hotelov dodalo še dodatno težo.

Ta teža se odraža pri komentarjih, ki smo jih klasificirali

Unigrami

Bigrami

Članek [6]

znotraj istega razreda – npr. 2 komentarja, ki sta kla-

Natančnost

0,92

0,80

0,75

sificirana kot pozitivna se lahko razlikujeta v tem, da je

Preciznost

0,94

0,93

/

eden bolj pozitiven kot drugi.

Priklic

0,91

0,67

/

V prejšnjem koraku smo komentarjem, ki smo jih klasi-

Mera F1

0,92

0,78

/

ficirali za negativne, dodelili negativen predznak (ostali

imajo pozitivnega, saj je verjetnost nečesa vedno pozi-

tivna). Te klasifikacijske vrednosti smo nato sešteli za

Če pa primerjamo priklic in preciznost bigramov, vidimo,

posamezen hotel in na tak način izračunali splošno za-

da so bigrami bili pri preciznosti bolj uspešni kot pri pri-

dovoljstvo glede hotela. Večja kot je bila vsota komen-

klicu. Iz slednjega lahko sklepamo, da je število napačno

tarjev posameznega hotela, bolj so bili gosti hotela, ki so

klasificiranih nagativnih komentarjev bilo manjše kot šte-

komentarje napisali, zadovoljni. Za vsak hotel smo to-

vilo napačno klasificiranih pozitivnih komentarjev. Pri

rej izračunali vrednost, ki predstavlja točke hotela, in v

unigramih pa je razlika med priklicom in preciznostjo do-

splošnem zajema zadovoljstvo. Nato smo hotele uredili

kaj mala, zato lahko sklepamo, da je delež napačno kla-

glede na njihove izračunane točke ter na tak način dobili

sificiranih komentarjev približno enako razporejen med

lestvico hotelov.

napačno klasificiranimi pozitivnimi komentarji (fn) in na-

pačno klasificiranimi negativnimi komentarji (fp).

4

Eksperiment

V članku [6] so za klasifikacijo komentarjev videov Danmaku uporabili slovar sentimenta, ki ni temeljil zgolj

V eksperimentu smo klasifikacijo izvedli na dva načina

na besedišču komentarjev. Pri klasifikaciji komentarjev

– z uporabo unigramov in bigramov. Oba klasifikacijska

so uporabili bigrame,

torej lahko njihove rezultate

načina smo primerjali med sabo z izračunom natančnosti

primerjamo z našimi rezultati pri pristopu z bigrami.

modela, preciznosti, priklicem in mero F1. Pri izračunu

Dosegli smo boljše rezultate (tabela 1) zaradi uporabe natančnosti, preciznosti in priklicu smo uporabili para-slovarjev, ki so bolj domensko specifični.

meter tp (ang. true positive), ki predstavlja število pra-

Kot zanimivost smo ustvarili tudi vizualno predstavitev

vilno klasificiranih pozitivnih komentarjev, tn (ang. true

unigramov (besed), ki se najpogosteje pojavijo v pozi-

negative), ki predstavlja število pravilno klasificiranih ne-

tivnih in negativnih komentarjih. Vizualno predstavitev

gativnih komentarjev, fp (ang. false positive), predstavlja

vidimo na sliki 1, pri čemer zelena slika predstavlja po-

število napačno klasificiranih negativnih komentarjev in

goste besede, ki jih najdemo v pozitivnih komentarjih,

fn (ang. false negative), ki pove število napačno klasifici-

rdeča slika pa prikazuje pogoste besede znotrah negativ-

ranih pozitivnih komentarjev.

nih komentarjev.

tp + tn

N atancnost =

(5)

tp + tn + f p + f n

tp

P reciznost =

(6)

tp + f p

tp

P riklic =

(7)

tp + f n

2 ∗ P reciznost ∗ P riklic

Slika 1: Vizualna predstavitev najpogostejsih besed v

F 1 =

(8)

P reciznost + P riklic

komentarjih

Pri računanju metrik uspešnosti klasificikacij smo vre-

dnosti izračunali tako za unigrame kot za bigrame, ter

prišli do rezultatov, ki so prikazani v tabeli 1. Zraven 5

Zaključek

naših rezultatov, so prikazani tudi rezultati iz literature

[6], po kateri smo se zgledovali pri implementaciji algo-V okviru naše naloge smo uporabili analizo sentimenta

ritma.

nad komentarji hotelskih gostov.

Klasifikacijo komen-

Iz tabele 1 lahko razberemo, da se unigrami v vseh me-tarjev smo izvedli s pomočjo slovarjev in metode Naivni

trikah uspešnosti bolje obnesejo od bigramov. Unigrami

Bayes, pri čemer smo eksperiment izvedli s pomočjo uni-

imajo namreč boljši rezultat pri natančnosti, preciznosti,

gramov in bigramov. Končni rezultati so pokazali, da je

priklicu in pri uglašeni meri F1. Največjo razliko med uni-

klasifikacija s pomočjo unigramov uspešnejša, kot pa z

grami in bigrami vidimo pri izračunu priklica. Na podlagi

bigrami. Pri klasifikaciji s pomočjo unigramov smo do-

tega lahko sklepamo, da smo pri bigramih imeli veliko

segli 92 % natačnost, pri bigramih pa 80 %. V članku

število napačno klasificiranih pozitivnih komentarjev.

[6] so dosegli 75 % natančnost, torej so bili manj uspešni 68

pri klasifikaciji. Boljše rezultate napram primerjalnega članka smo dosegli zaradi uporabe slovarjev sentimenta,

ki temeljijo zgolj na besedišču, ki ga najdemo v komen-

tarjih, med tem ko so v [6] uporabljali bolj splošen slovar.

Klasifikacijo komentarjev smo v našem članku uporabili

za raziskavo splošnega zadovoljstva hotelskih gostov, ki

so napisali komentarje za določen hotel. Eksperiment bi

lahko v nadaljnem delu razširili z upoštevanjem nacio-

nalnosti piscev komentarjev pri sami klasifikaciji. Na tak

način bi lahko ugotovili korelacijo med nacionalnostjo pi-

scev komentarjev in zadovoljstva glede določenega hotela,

ki se skriva v komentarjih. Posameznim komentarjem bi

v bodoče lahko dodali tudi številčno oceno (npr. od 1

do 5), ki bi jo pridobili s pomočjo razširitve trenutnega

eksperimenta.

Literatura

[1] Bouazizi, M., and Otsuki Ohtsuki, T.

A

pattern-based approach for sarcasm detection on

twitter. IEEE Access 4 (2016), 5477–5488.

[2] Brandwatch, K. B.

Understanding sentiment

analysis: What it is & why it’s used.

[3] Cai, R., Qin, B., Chen, Y., Zhang, L., Yang,

R., Chen, S., and Wang, W. Sentiment analysis

about investors and consumers in energy market ba-

sed on bert-bilstm. IEEE Access 8 (2020), 171408–

171415.

[4] Fang, Y., Tan, H., and Zhang, J. Multi-strategy

sentiment analysis of consumer reviews based on

semantic fuzziness. IEEE Access 6 (2018), 20625–

20631.

[5] Jianqiang, Z., Xiaolin, G., and Xuejun, Z.

Deep convolution neural networks for twitter senti-

ment analysis. IEEE Access 6 (2018), 23253–23260.

[6] Li, Z., Li, R., and Jin, G. Sentiment analysis of

danmaku videos based on naïve bayes and sentiment

dictionary. IEEE Access 8 (2020), 75073–75084.

[7] Phan, H. T., Tran, V. C., Nguyen, N. T., and

Hwang, D. Improving the performance of senti-

ment analysis of tweets containing fuzzy sentiment

using the feature ensemble model. IEEE Access 8

(2020), 14630–14641.

[8] Saad, S. E., and Yang, J. Twitter sentiment

analysis based on ordinal regression. IEEE Access

7 (2019), 163677–163685.

[9] Son, L. H., Kumar, A., Sangwan, S. R.,

Arora, A., Nayyar, A., and Abdel-Basset,

M.

Sarcasm detection using soft attention-based

bidirectional long short-term memory model with

convolution network. IEEE Access 7 (2019), 23319–

23328.

[10] Watanabe, H., Bouazizi, M., and Ohtsuki, T.

Hate speech on twitter: A pragmatic approach to

collect hateful and offensive expressions and per-

form hate speech detection. IEEE Access 6 (2018),

13825–13835.

[11] Xu, G., Meng, Y., Qiu, X., Yu, Z., and Wu,

X. Sentiment analysis of comment texts based on

bilstm. IEEE Access 7 (2019), 51522–51532.

69





70





Časovni razporejevalniki in brezstrežniško okolje

Uroš Zagoranski

Univerza v Mariboru,

Fakulteta za elektrotehniko, računalništvo in informatiko,

Koroška cesta 46, 2000 Maribor, Slovenija

uros.zagoranski@student.um.si

Povzetek

kov v brezstrežniškem okolju?

V prispevku smo se osredotočili na časovne razpo-

• Kdaj je smiselno uporabiti časovne razpore-

rejevalnike (ang. cron job schedulers) v brezstre-

jevalnike brezstrežniškega okolja?

žniškem (ang. serverless) okolju in njihovo zane-

V nadaljevanju članka so v drugem poglavju predsta-

sljivo uporabo. Primerjali smo razporejevalnike,

vljene posebnosti sodobnega pristopa razvoja program-

implementirane s pomočjo zabojnikov s tistimi,

skih rešitev, to je z zabojniki in s pomočjo funkcij kot

ki so gostovani v oblaku z uporabo pristopa funk-

storitev, ter njuna primerjava. Tretje poglavje pokriva

cije kot storitve. Ugotavljali smo, katere so po-

tematiko časovnih razporejevalnikov na splošno ter se

sebnosti časovnih razporejevalnikov v brezstrežni-

podrobneje posveča razporejevalnikom v brezstrežniškem

škem okolju in kdaj je le-te sploh smiselno upo-

okolju, pregledno povzema glavne ponudnike takšnih sto-

rabiti. Na praktičnem primeru smo predstavili,

ritev in navaja primere uporabe. Četrto poglavje predsta-

kako jih lahko vključimo v večji sistem in na ka-

vlja realen primer uporabe časovnih razporejevalnikov v

kšen način najlažje rešimo morebitne težave, ki

namen razporejanja opravil v računovodskem servisu. V

jih ob izbiri brezstrežniškega okolja zavestno pre-

petem poglavju odgovarjamo na raziskovalna vprašanja

vzamemo. Ugotovili smo, da so razporejevalniki

in uvajamo diskusijo, v zadnjem poglavju pa povzemamo

v FaaS (ang. Function as a Service) okolju naj-

rezultate raziskave in glavna nova dognanja, pridobljena

primernejši zaradi enostavnega in hitrega razvoja

z njeno pomočjo.

ter nizkih stroškov obratovanja.

Ključne besede brezstrežniška arhitektura · sodobni 2

Sodobni pristopi implementacije

pristopi implementacije informacijskih rešitev · časovni

informacijskih rešitev v namen

razporejevalniki · funkcija kot storitev · zabojnik kot

storitev

razporejanja opravil

Vse več razvijalcev informacijskih rešitev pomeni večji na-

1

Uvod

bor možnosti in preferenc pri razvoju programske opreme.

Zaradi tega raziskovalno podjetje Gartner vsako leto iz-

Tempo življenja se v zadnjih letih veča in zaradi tega dela poročilo trendov v informacijsko tehnološkem svetu smo ljudje vse bolj časovno obremenjeni. S tem se posle- za prihodnje koledarsko leto. V zadnjem času lahko na tej dično veča tudi potreba po časovnem razporejanju rutin, lestvici vse pogosteje zasledimo besedne zveze kot so ume-da se kakšna informacija v presežku le-teh ne izgubi. Že tna inteligenca, obogatena resničnost, analitika, avtoma-od nekdaj so se razvijali sistemi, ki so omogočali takšno tizacija ipd. [12] Virtualizacija virov in programiranje v in drugačno razporejanje, a to ni bilo še nikdar enostav- oblaku sta se hitro zlili s klasičnimi pristopi implementa-nejše, kot je v današnji dobi računalništva v oblaku. Ve- cije informacijskih rešitev, zato ni čudno, da je rast tega lik nabor ponudnikov oblačnih storitev omogoča izbiro področja in inovativnost rešitev v tej kategoriji v pozitiv-med le-temi, ki pa ravno zaradi velike ponudbe včasih nem trendu. Med napovedmi za bližnjo prihodnost se v ni trivialna. Izbiro prave storitve otežuje predvsem po- Gartnerjevih raziskavah tako pojavljajo besedne zveze, ki manjkanje analitičnih pregledov in primerjav storitev, ki omenjajo zabojnike, njihovo orkestracijo, funkcije kot sto-jih ponujajo različne korporacije. S tem namenom smo ritev in brezstrežniški pristop razvoja programske opreme želeli predstaviti razlike, prednosti in dodatne izzive, ki splošno. [6]

nastanejo pri razvoju razporejevalnikov v oblaku z upo- Brezstrežniški pristop razvoja programske opreme je mo-rabo funkcij kot gonilne sile v primerjavi s tistimi, ki so del računalništva v oblaku, v katerem ponudnik oblač-

implementirani v (bolj) klasičnih aplikacijah, s pomočjo nih storitev v imenu svojih strank skrbi za strežnike in orkestracije zabojnikov. Poleg tega smo našteli in pred- dodeljevanje strojnih virov. Glavna posebnost takšnega stavili glavne sisteme, ki ponujajo časovno razporejanje v pristopa je, da aplikacija nima dodeljenih računalniških brezstrežniškem okolju in na praktičnih primerih uporabe virov, če ta ni v uporabi. To se odraža v nižji ceni de-pojasnili, kateri izmed glavnih ponudnikov je za posame- lovanja, saj le-ta temelji na dejanski količini sredstev, ki zen primer boljši in zakaj.

jih aplikacija porabi v določenem časovnem intervalu. [1]

Raziskovalni vprašanji, ki smo ju obravnavali sta:

Trenutno stanje brezstrežniških sistemov na trgu je v

•

Gartnerjevih analizah primerjano s predpostavkami za

Katere so posebnosti časovnih razporejevalni-

bližnjo prihodnost, prav tako pa je podana finančna ocena

DOI https://doi.org/10.18690/978-961-286-516-0.16

ISBN 978-961-286-516-0

71





brezstrežniških sistemov v bližnji prihodnosti. Pri tem [8] Aplikacije so tipično sestavljene iz več zabojnikov, ki napovedujejo, da se bo iz lanske globalne kvote prihod- so porazdeljeni po različnih virtualnih in fizičnih gosti-kov v višini 465,8 milijonov dolarjev do leta 2024 višina teljih. Kubernetes nam pri tem pomaga pri upravlja-prihodkov kar podvojila, in sicer naj bi znašala 944 mi- nju s strukturiranjem zabojnikov v tako-imenovane stroke lijonov dolarjev. Največjo rast naj bi po ocenah dosegli (ang. pods), ki okolje bogatijo z dodatnim nivojem ab-ravno orkestracija zabojnikov ter brezstrežniški pristop strakcije. Na ta način je poenostavljeno razporejanje z razvoja informacijskih rešitev, kateri tematiki bomo po- viri, prav tako pa časovno razporejanje, ki je za nas naj-drobneje opisali v nadaljevanju. [11]

zanimivejše. [10]

Druga Gartnerjeva raziskava ocenjuje, da bo v prihodno- Pristop z zabojniki je primeren predvsem za ekipe, ki sti drastično narasla tudi uporaba platform, ki kot spro- prisegajo na podrobnejši nadzor nad storitvami kontejne-

žilce aktivnosti v sistemu uporabljajo najmanjše možne rizacije. Prav tako omogoča avtomatizacijo uvajanja in komponente, to so funkcije. Trenutno približno 20 odstot- vračanja v prejšnje stanje, fine nastavitve količine pro-kov globalnih IT podjetij uporablja »funkcijski« (FaaS) cesorske moči in pomnilniškega prostora, samodejno ce-način implementacije, do leta 2025 pa bi naj ta številka ljenje zabojnikov, ki so bili pri izvajanju neuspešni ter narasla na kar 50 odstotkov. [4]

upravljanje s konfiguracijami. [8]

Porast interesa uporabe oblačnih infrastruktur se odraža

tudi na številu ponudnikov. Med vodilne uvrščamo na- 2.2 Funkcije: Function as a Service (FaaS) slednje:

FaaS je nov koncept v oblačnem računalništvu, podo-

• Amazon Web Services,

ben PaaSu (Platform as a Service) z vidika omogočanja

• Google Cloud Platform,

enostavnejšega razvoja programske opreme in razbreme-

nitve razvijalcev upravljanja s strežniki in operacijskimi

• Microsoft Azure,

sistemi. Prednost funkcij v oblaku v primerjavi s prej

• IBM Apache OpenWhisk.

omenjenima konceptoma je predvsem v poslovnem vi-

diku. Medtem, ko se pri PaaSu plačuje čas izvajanja niti,

V tekmovanju teh ponudnikov se kot glavni akter že vr- se pri FaaSu plačuje izvajalni čas specifične funkcije. S

sto let pojavlja platforma Amazon Web Services (v na- tem se lahko konkretno zvišajo performančne karakteri-daljevanju AWS), a v zadnjem času pridobiva vse več stike ogromnih sistemov, hkrati pa znižajo obratovalni tekmecev, med katerimi sta najbolj aktualna predvsem stroški same informacijske rešitve. [1]

Microsoft Azure in Google Cloud Platform (v nadaljeva-

nju GCP). A če povzamemo obe prej omenjeni podro- Z začetki v letu 2014 z Amazonovo Lambdo je do da-

čji (zabojnike in funkcije), v ospredje brez konkurence nes koncept FaaS postal priznan in dobro sprejet način stopi Google, saj si lasti tako orkestracijsko rešitev Ku- implementacije informacijskih rešitev. Dokazano je ve-bernetes kot tudi ogromno platformo za programiranje v liko bolj skalabilen, elastičen, razvijalcem prijazen ter ce-oblaku, GCP. [7] Podjetja želijo, da se razvijalci osredoto- novno ugoden kot predhodne oblačne arhitekture. Poleg čajo predvsem na razvoj programske opreme, ne pa tudi AWSove Lambde v to kategorijo uvrščamo tudi Microsoft na vzdrževanje, vzpostavitev in druge podporne aktivno- Azure Functions, Google Cloud Functions, IBM Apache sti. Pri zabojnikih zato pogosto zasledimo tudi osebo, ki OpenWhisk. [2]

je zadolžena za skrb in upravljanje s celotno arhitekturo Pristop s funkcijami v oblaku je primeren predvsem za (DevOps), katere pa pri funkcijskem pristopu načeloma projekte, ki imajo neenakomerno razporeditev izvajanja, ne potrebujemo.

izjemno visoke ali neznane zahteve za skaliranje sistema,

so vezani na zunanje dogodke, sestavljeni iz kratko-

2.1

Zabojniki: Container as a Service (CaaS)

živečih diskretnih funkcij, ali pa lahko s klici operirajo

brez stanja. Pogosto jih lahko zasledimo pri mobilnih

Vse pogostejši trend uporabe zabojnikov za implementa- aplikacijah, IoT senzorskih komponentah in časovno cijo informacijskih rešitev se odraža tudi pri nastanku do- aktiviranih akcijah. [4]

kaj nove paradigme računalništva. CaaS predstavlja nad-

gradnjo kategorije IaaS (Infrastructure as a Service), kjer 2.2.1 Primerjava obeh skupin korporacije v najem ponujajo strežnike, virtualne stroje,

omrežja in shrambe, pri čemer so njihovi uporabniki od- V začetku porasta informacijske tehnologije je postala po-govorni za upravljanje infrastrukture in nameščanje apli- pularna gradnja monolitnih aplikacij, ki pa se danes vse kacij nanjo. Zraven tega CaaS ponuja še zabojniške stroje manj pojavlja na seznamu trendov. V zadnjem času so in funkcionalnosti za orkestracijo le-teh. [19] Za nas je jih pretežno zamenjale mikro-storitve in posledično se je najzanimivejši podrazred CaaSa, KaaS (Kubernetes as a vpeljala uporaba zabojnikov, s katerimi sta razporejanje Service), saj je zaradi svoje robustnosti, zrelosti in boga- in razvoj aplikacij veliko enostavnejša. Zabojniki so se tega nabora funkcionalnosti Kubernetes postal de-facto v IT svetu prvič pojavili že pred letom 2000, zato ima standard in najpogosteje uporabljena tehnologija, ko je ta koncept daljšo in bogatejšo zgodovino, kot funkcije v govora o razvoju informacijskih rešitev s souporabo za- oblaku. Kljub temu t. i. eksplozija popularnosti obeh bojnikov. [13]

konceptov sega v podobno časovno obdobje. Zabojniki

so namreč postali izjemno aktualni šele z Dockerjem v

Kubernetes je odprtokodni sistem za orkestracijo, ki omo- letu 2013, FaaS pa z AWSovo Lambdo v letu 2014. [16]

goča avtomatizirano upravljanje z zabojniki, prav tako

pa upravlja celoten življenjski cikel zabojnikov, pri čemer FaaS predstavlja višji nivo abstrakcije kot CaaS. Slednji uporabniki upravljajo z izvedbo in interakcijo aplikacij. namreč abstrahira nivoje strojne opreme, virtualizacij-72





Tabela 1: Primerjava CaaS in FaaS [9]

CaaS

FaaS

Prenosljivost

Zabojniki predstavljajo standardizirano Slaba standardizacija, večinoma odvisno enoto, zaradi česar je njihova prenosljivost od ponudnika, razporejevalniki kljub temu med ponudniki enostavna

delno standardizirani

Dvig in premik

Obstoječe aplikacije enostavno oviti v zaboj- Ni mogoč, obstoječe aplikacije potrebujejo nike, če to ni že izvedeno, zabojniki nimajo adaptacijo ali popolno re-implementacijo vpliva na delovanje razporejevalnikov

Razdrobljenost

Enote ponavadi razdeljene na komponente, Aplikacija razdrobljena na majhne enote, ki ponujena možnost razporejanja specifičnih jih je enostavno razporejati (funkcije), ide-funkcij, tudi tistih, ki skrbijo za razporejanje alno za implementacijo razporejevalnikov Izvajanje

Upravljano s strani razvijalca

Upravljano s strani ponudnika v oblaku

Fleksibilnost

Visoka

Srednja

Primeri

upo-

Celotne spletne aplikacije, mikro-storitve, Specializirane funkcije, asinhrono procesira-rabe

upravljanje s podatki

nje, dogodkovno vodena arhitektura

Izzivi

Zahteva razumevanje delovanja zabojnikov, Upravljanje odvisnosti, sledenje in upravlja-težja implementacija razporejevalnikov

nje več mikro-storitev hkrati

Primernost

Učinkovito za gostovanje osnovnih aplikacij, Najprimernejše za lansiranje novih aplikacij, manj primerno za implementacijo razporeje- izjemno primerno za implementacijo razpo-valnikov

rejevalnikov

skih strojev ter operacijskega sistema, pri čemer pa prvi zabojnikov, v novi dobi oblačnega računalništva pa se vse na vse prej omenjene nivoje dodaja še abstrakcijo izvedbe pogosteje pojavlja prej omenjeni pristop FaaS. Za izvedbo programske kode in aplikacije. Obe skupini prenašata specifičnih funkcij ob določenem času v informacijskih re-odgovornost manipulacije z aplikacijo in funkcijami na šitvah takšnega tipa skrbijo funkcije, imenovane časovni uporabnika storitve.

razporejevalniki, ki v vseh sistemih, ne glede na izbiro

Če na kratko povzamemo celotno tabelo 1, ki predsta- tehnološkega sklada, delujejo po enakem principu. Poleg vlja primerjavo prej omenjenih pristopov lahko ugoto- enakega principa imajo tudi enako obliko sprejema po-vimo, da v splošnem pristop z zabojniki nudi več svo- datkov, sestavljeno iz petih spremenljivk in klica funkcije bode pri implementaciji, nadzora nad izvajanjem ter eno- ali komande. Splošna oblika »cron« izraza je prikazana stavno migracijo v primeru, da neka obstoječa aplikacija na sliki 1.

še ne uporablja takšnega pristopa. Na drugi strani je

pristop s funkcijami v oblaku primernejši in lažji za vzdr-

ževanje, je pa manj fleksibilen in slabo standardiziran,

zato se tehnične podrobnosti, ki jih je potrebno upošte-

vati ob implementaciji, razlikujejo med posameznimi po-

nudniki. Kateri pristop uporabiti je odvisno predvsem

od primera uporabe, pri čemer je za spletne aplikacije z

mikro-storitvami primernejša izbira CaaS modela, za do-

godkovno vodene aplikacije z asinhronim procesiranjem

Slika 1: Splošna oblika »cron« izraza

pa se bo bolje izkazal FaaS model.

Med najpogostejšimi izrazi, ki jih zasledimo v imenih raz-

3

Časovni razporejevalniki

porejevalnikov sta angleški besedi »cron« in »scheduler«.

V sklopu prej opisanih razporejevalnikov najdemo glavne

Časovno tempiranje aktivnosti je v današnjem svetu čisto ponudnike takšnih storitev, to so Googlov Kubernetes, običajna aktivnost, zato je vse pogostejša tudi uporaba Amazonovo orodje Lambda v kombinaciji s storitvijo Clo-sistemov, ki skrbijo za izvajanje določenih funkcij ob spe- udWatch Events, Googlovi Cloud Scheduler in Scheduled cifičnem času, na željeno frekvenco ali pa z določenimi Cloud PubSub ter Microsoftov Azure Time Trigger.

zamiki. Razporejevalniki nas tako v informacijsko teh-

nološkem svetu spremljajo praktično od začetka. Ti so 3.1 »Zabojniško« okolje prisotni v operacijskih sistemih, omrežnih komponentah,

konec koncev pa tudi v I/O operacijah (operacijah pisa- Zabojnike lahko uporabimo za različne namene, eden po-nja in branja).

membnejših je procesiranje in obdelava nalog v ozadju.

To je pogost primer uporabe, ki pa je vse prej kot trivi-

Medtem, ko se pri vseh prej naštetih komponentah ter- alen in odpira vrsto vprašanj, med drugim je vprašljivo min razporejanje pojavlja v obliki razporejanja virov, je sledenje izvajanja, skaliranje in upravljanje s ponovnimi za našo raziskavo veliko zanimivejše časovno razporejanje zagoni. Zraven razporejevalnikov z viri, ki so v razdelani opravil. V tradicionalnih sistemih se takšni problemi ve- arhitekturi z orodji kot sta Docker in Kubernetes neizbe-

činoma rešujejo z aplikacijskimi strežniki in orkestracijo žni, se v teh okoljih srečujemo tudi s časovnimi razpore-73





jevalniki, ki znatno vplivajo na optimizacijo virov, da je Cloud Scheduler. Slednja sama po sebi uporablja vse, kar uporaba teh kar se da učinkovita. [14]

prva ponuja, zato nam za povezavo med njima ni treba

Zabojniki sami po sebi omogočajo uporabo časovnih spro- skrbeti.

žilcev, ki pa z vpeljavo več instanc in posledično kom- Cloud Scheduler je v celoti upravljan razporejevalnik, ki pleksnosti zahtevajo uporabo posvečenega orodja, kot je omogoča razporejanje paketnih opravil, opravil z velikimi na primer Kubernetes. Stanje posameznega zabojnika se podatki ter operacij z oblačno infrastrukturo. Glavna tako ne more prenašati med preostalimi zabojniki zaradi prednost izbire tega razporejevalnika je zanesljivost, saj varnostnih razlogov, zaradi česar je potrebno najti druge njegovo distribuirano infrastrukturo upravlja Google. S

načine za izpostavitev skript. Kubernetes prav tako di- tem zagotavlja vsaj-enkratno dostavo sporočil na cilj, namično generira imena zabojnikov, zaradi česar je do- hkrati pa poenostavlja koncept »crontabov«, saj omo-stopanje do njih oteženo. [17]

goča specifikacijo želenega urnika preko t. i. »cron«

izraza. Zraven tega ponuja tudi močno orodje za bele-

3.2

»Funkcijsko« okolje

ženje izvedbe in uspešnosti ter enostavno konfiguracijo

pravilnika za ponovni poskus v primeru napak. V osnovi

Časovni razporejevalniki sami po sebi ne veljajo za naj- so lahko na posameznem računu brezplačno gostovane tri bolj zanesljive. To pride do izraza v funkcijskem brezstre- instance razporejevalnika, vsaka naslednja instanca pa žniškem okolju, kjer jih pogosto najdemo kot popolnoma stane 0,10 dolarjev na mesec. [5]

neodvisne komponente. Njihova edina naloga je, da se

izvedejo ob določenem času, zaradi česar se lahko zgodi, 3.2.2 Amazon Web Services – CloudWatch da se določena funkcija, ki je bila klicana v sklopu ne-Events

kega razporejevalnika, izvede neuspešno. Do tega lahko

pripelje nekaj dejavnikov, med drugim začasna prekinitev Podobno kot Google tudi Amazon lasti platformo, ki po-delovanja infrastrukture (ki je v oblaku redka, a mogoča), nuja velik nabor storitev, to je Amazon Web Services.

neuspešna obdelava podatkov, ali pa prekinitev povezave V njenem sklopu lahko najdemo storitev AWS Lambda, s podatkovno bazo. Ker razporejevalniki v brezstrežni- ki sama po sebi ne ponuja časovnega razporejanja. Za škem okolju kot parametre prejmejo le čas izvedbe, ni dosego tega je Lambdo potrebno kombinirati z eno od potrebna validacija vnesenih parametrov. Iz tega vidika treh storitev, ki to omogočajo: CloudFormation, Cloud-so izjemno zanesljivi, saj (načeloma) ne more priti do na- Watch Events in EventBridge, ki je posodobljena verzija pake pri sami izvedbi. Večjo težavo predstavlja dejstvo, storitve CloudWatch Events. Medtem ko AWS Lambda da se predpisane funkcije po izvedbi preprosto »ugasnejo« ponuja le storitve za oddaljeno izvajanje funkcij, se preo-in jih uspešnost zaključka klicane funkcije ne zanima. [15] stale tri storitve osredotočajo na sistemski tok dogodkov v realnem času. Storitev CloudWatch ostaja v tako ime-Med glavne primere uporabe lahko uvrstimo naslednje:

novanem stanju pripravljenosti in ob določenih sprožilcih,

• vzdrževalna naloga re-indeksiranja podatkovne baze, kot so na primer časovni sprožilci, reagira. Slednja omo-ki se mora izvesti n-krat v določenem časovnem goča specifikacijo pravil za zagon funkcij, na primer na intervalu,

n-minut, dni, ali pa uporabi kompleksnejše razporejanje,

za katero se prav tako uporablja t. i. »cron« izraz. [3]

• deaktivacija poteklih uporabniških računov,

• nadzor nad prostorom pomnilnika,

Storitev CloudFormation v kombinaciji z AWS SAM (Ser-

verless Application Model) omogoča tudi večstopenjsko

• kreacija varnostnih kopij občutljivih podatkov,

izvajanje rutin, pri čemer je pisanje SAM predlog za

• upravljanje s predpomnilnikom,

brezstrežniško izvajanje povsem neboleče. Zaradi odlične

• razpošiljanje glasila ali posebne ponudbe prijavlje- podpore integracije različnih storitev znotraj AWS plat-nim uporabnikom,

forme je pri implementaciji razporejevalnikov meja le

• periodično preverjanje slepih povezav na spletni nebo. [18]

strani,

• kodiranje videov in večjih datotek v ozadju.

3.2.3

Primerjava predlaganih razporejevalnikov

Iz navedenega lahko vidimo, da nam časovni razporeje- Oba prej omenjena ponudnika sta razvila vrhunski re-valniki pomagajo pri marsikaterem rutinskem opravilu, ki šitvi za časovno razporejanje, a vendar ima vsaka svoje bi bilo z izbiro drugačnega pristopa razvoja programske prednosti.

opreme zahtevnejše za implementacijo. Kot že omenjeno Medtem ko GCP Cloud Schedule ponuja 3 brezplačne in-lahko za pristop k rešitvi problema izbiramo iz širokega stance razporejevalnika AWS CloudWatch Events zago-spektra ponudnikov, zato sta v naslednjih podpoglavjih tavlja »zastonjsko« različico z maksimalno 100 zagoni na predstavljena dva najpomembnejša, GCP Cloud Schedu- mesec. Opazna razlika je tudi v ceni - medtem, ko slednji ler in AWS CloudWatch Events.

za 10.000 zagonov ponuja ceno 0.30 dolarjev na mesec,

GCP zagotavlja fiksno ceno 0.10 dolarjev na vsako doda-

3.2.1

Google Cloud Platform – Cloud Scheduler

tno instanco razporejevalnika. Dokumentacija Googlove

storitve je napisana nekoliko bolj površinsko, Amazonova

Podjetje Google ponuja veliko storitev, ki so strnjene v pa je podprta z ogromno primeri uporabe. Pri komple-ogromno platformo, imenovano Google Cloud Platform. ksnosti v ospredje stopi Google, saj je povezovanje te sto-

Če se za implementacijo časovnega razporejanja rutinskih ritve z drugimi iz nabora GCP trivialno, AWSove storitve opravil odločimo za GCP, je potrebno za dosego cilja upo- pa so nekoliko zahtevnejše za integracijo. Hitrost izvedbe rabiti kombinacijo dveh storitev, to je Cloud Functions in

74





je pri obeh ponudnikih izjemno visoka, zato med njima Iz tega razloga se je potrebno pri izbiri ponudnika podu-v tem aspektu praktično ni razlike. Če na kratko povza- čiti o specifikah razporejevalnika, ki ga ponujajo. Razlike, memo primerjavo je v primeru potrebe po več različnih ki jih zasledimo med ponudniki so predvsem finančne ter časovnih razporejevalnikih z majhnimi frekvencami za- performančne. Kljub temu izbira brezstrežniškega okolja gona ter za eksperimentalne namene bolje izbrati AWS, s funkcijskim pristopom, predstavljena v poglavju 3.2, v primeru fiksnega (manjšega) števila razporejevalnikov prinaša veliko prednosti, med drugim enostavne modi-in točno zadanih ciljev kaj z razporejevalniki želimo do- fikacije in nastavitev razporejevalnikov preko »cron« iz-seči pa GCP.

raza ter zagotovljena vsaj enkratna izvedba funkcije, ki je

poklicana v sklopu razporejevalnika, kar je pri klasičnih

sistemih nekoliko težje doseči.

4

Primer uporabe

Med posebnosti lahko uvrstimo tudi to, da so razporeje-

Za bolj plastično predstavo delovanja časovnih razporeje- valniki v brezstrežniškem okolju samostojna komponenta.

valnikov v funkcijskem okolju uporabimo primer predsta- To pomeni, da so ločeni od preostalega dela kode, zaradi vljenega računovodskega servisa na sliki 2 (na naslednji česar se lahko v primeru težav z osnovno aplikacijo na strani), ki skrbi za pošiljanje plačilnih listov vseh zaposle- njih še vedno zanesemo. Posebnost je tudi ta, da je ozke nih v podjetjih s katerimi sodeluje. Servis želi popolno časovne omejitve težko doseči, najnižji razpon dveh iz-avtomatizacijo pošiljanja plačilnih listov preko e-pošte. vedb je namreč zaradi možnosti časovnega prekrivanja Ker e-poštna sporočila vsebujejo kritične podatke za za- pri izvedbi dveh klicev ena minuta. V sklopu prednosti, poslene je potrebno zagotoviti, da so ta sporočila stood- ki jih pridobimo z izbiro brezstrežniške arhitekture pa je stotno poslana. To lahko dosežemo s kombinacijo razpo- smiselno odgovoriti tudi na drugo raziskovalno vprašanje, rejevalnikov, ki omogočajo časovno tempiranje sporočil in ki je bilo rdeča nit celotne vsebine članka: sporočilnih sistemov, ki ponujajo zagotovljeno pošiljanje

sporočil. BPMN diagram iz slike 2 je zasnovan splošno,

• Kdaj je smiselno uporabiti časovne razpore-

zato ga je mogoče implementirati enotno, ne glede na iz-

jevalnike brezstrežniškega okolja?

biro ponudnika funkcijske arhitekture.

Na prvi pogled razporejanje rutin ni nekaj, s čimer bi se

Z zahtevo po zanesljivosti naraste tudi kompleksnost za- srečevali pogosto. A realnost je precej drugačna, hitro snovane arhitekture. Kot že omenjeno, lahko pride med lahko ugotovimo, da se časovne razporejevalnike upora-izvajanjem zgoraj opisanega procesa do motenj v delo- blja za marsikatero opravilo, kar je razvidno iz nabora vanju. Z zankami je tako predstavljeno lovljenje napak najpogostejših primerov uporabe iz poglavja 3.2. Smi-znotraj objave sporočil na temo. V primeru, da instanca selno jih je torej uporabiti predvsem za opravila, za kamed izvajanjem kode na prvi ali drugi temi pade, se po tera želimo, da se izvedejo v ozadju in brez potrebe po ponovnem zagonu stanje zaradi shranjevanja vmesnih ko- poseganju upravljalca. Kot že omenjeno v poglavju 4

rakov vedno ohrani. S tem je doseženo, da se vsa sporo- so to na primer vzdrževalne naloge re-indeksiranja po-

čila zagotovo pošljejo, prav tako pa, da se vsa sporočila datkovne baze, deaktivacija poteklih uporabniških raču-pošljejo natanko enkrat. V ta primer uporabe je mogoče nov na nekem portalu, nadzor nad prostorom pomnilnika vključiti tudi druge prej omenjene možnosti. Ker plačilni in upravljanje z njim, kreacija varnostnih kopij občutlji-listi veljajo za občutljive podatke, bi računovodski servis vih podatkov, razpošiljanje e-poštnih sporočil določenim lahko razporejevalnike uporabil za kreacijo njihovih var- uporabnikom ter kodiranje večjih datotek v ozadju. Ča-nostnih kopij. Podobno bi se lahko izvedla deaktivacija sovne razporejevalnike brezstrežniškega okolja z uporabo zaposlenih, ki v sistemu več niso aktivni, ali pa, v primeru funkcij kot storitev je torej smiselno uporabiti vedno, ko dodatnih funkcionalnosti v sistemu, omogočilo kodiranje je smiselno uporabiti vse druge časovne razporejevalnike, večjih datotek v ozadju.

zraven tega pa imajo razporejevalniki tega tipa dodano

vrednost hitre implementacije in malega vložka s strani

5

Diskusija

uporabnikov, da storitev preverjeno deluje.

Izbira brezstrežniškega okolja pa ima še mnogo prednosti.

Kot lahko razberemo iz drugega poglavja, je pristopov k Ena od teh je, da se razporejevalniki zaženejo le takrat, ko implementaciji časovnih razporejevalnikov rutin v oblač- je nujno potrebno. Ravno nasprotno od tega lokalni raz-nih arhitekturah veliko, zato pri izbiri le-teh prevzamemo porejevalniki za nemoteno delovanje zahtevajo napravo, tudi nekaj posebnosti, ki jih je potrebno upoštevati pri prižgano 24 ur na dan. Ločevanje komponent, v našem implementaciji, izvedbi in vzdrževanju takšnih sistemov. primeru razporejevalnikov, v našo kodo prinese svežino Prav to nas je zanimalo v sklopu prvega raziskovalnega in olajša upravljanje s posameznimi deli projekta, saj so vprašanja, ki se glasi:

manjše enote veliko bolj obvladljive. Glavna prednost

razporejevalnikov v brezstrežniškem okolju pa je vsekakor

• Katere so posebnosti časovnih razporejevalni- odgovor na vprašanje: zakaj bi ponovno implementirali kov v brezstrežniškem okolju?

nekaj, kar je že na voljo ter deluje zanesljivo in hitro.

Posebnosti časovnih razporejevalnikov v brezstrežniškem

okolju se delijo na tiste, ki za uporabnika predstavljajo 6

Zaključek

prednosti ter tiste, ki zanj predstavljajo slabosti, kar je

razvidno iz tabele 1. Med posebnosti, ki jih je potrebno V prispevku smo raziskovali pristope implementacije ča-upoštevati pri implementaciji, spadata nekoliko slabša sovnih razporejevalnikov v brezstrežniškem okolju ter nji-podpora personalizacije ter pomanjkanje standardizacije. hovo zanesljivo uporabo. Poglobili smo se v primerjavo 75



Slika 2: BPMN diagram zasnove razporejevalnika računovodskega sistema 76

dveh predstavnikov brezstrežniškega okolja, zabojnikov [14] Rodriguez, M. A., and Buyya, R. Container-in pristopa FaaS. Podrobneje smo spoznali glavna ponu-

based cluster orchestration systems: A taxonomy

dnika funkcij kot storitev, AWS CloudWatch Events in

and future directions. Software - Practice and Expe-

GCP Cloud Scheduler, ki sama po sebi omogočata imple-

rience 49, 5 (may 2019), 698–719.

mentacijo in zanesljivo uporabo časovnih razporejevalni- [15]

kov.

Singhvi, A., Houck, K., Balasubramanian,

A., Danish Shaikh, M., Venkataraman, S.,

Z analizo literature in odgovorom na raziskovalni vpraša-

and Akella, A.

Archipelago: A Scalable Low-

nji, zastavljeni na začetku raziskave smo ugotovili, da je

Latency Serverless Platform. Tech. rep., University

časovne razporejevalnike predvsem v okolju FaaS izjemno

of Wisconsin-Madison, Wisconsin, nov 2019.

enostavno razviti, prav tako pa ne zahtevajo veliko vzdr- [16] Tozzi, C. Why Is Docker So Popular? Explaining ževanja. Smiselno jih je uporabiti za rutinska opravila, s

the Rise of Containers and Docker – Channel Futu-

katerimi se nimamo časa ukvarjati, zato nas lahko razbre-

res, jul 2017.

menijo, hkrati pa so cenovno izjemno dostopni. Če vse [17]

skupaj povzamemo je časovne razporejevalnike brezstre-

Walker, J. How to Use Cron With Your Docker

Containers, jan 2021.

žniškega okolja torej priporočljivo uporabiti vedno, ko se

v sklopu projekta pojavi potreba po reševanju ponavlja- [18] Yilmaz, E. 3 Ways to Schedule AWS Lambda and jočih oziroma rutinskih opravil.

Step Functions State Machine Executions , jan 2020.

[19] Zhang, R., Chen, Y., Zhang, F., Tian, F.,

and Dong, B.

Be Good Neighbors: A Novel

Literatura

Application Isolation Metric Used to Optimize the

Initial Container Placement in CaaS. IEEE Access

[1] Alqaryouti, O., and Siyam, N. Serverless Com-

8 (sep 2020), 178195–178207.

puting and Scheduling Tasks on Cloud: A Review.

American Scientific Research Journal for Engine-

ering, Technology, and Sciences (ASRJETS) (mar

2018), 1–14.

[2] Aske, A. M. SCHEDULING FUNCTIONS-AS-A-

SERVICE AT THE EDGE. PhD thesis, WASHING-

TON STATE UNIVERSITY, Vancouver, may 2018.

[3] AWS. What Is Amazon CloudWatch Events? -

Amazon CloudWatch Events, 2021.

[4] Costello, K. The CIO’s Guide to Serverless Com-

puting. Gartner (jun 2020).

[5] Google. Cloud Scheduler | Google Cloud, 2021.

[6] Iams, T. Gartner Blog Network. Gartner (may

2019).

[7] Kohgadai, A. 6 Container Adoption Trends of 2020

| StackRox. StackRox (mar 2020).

[8] Kubernetes. What is Kubernetes? | Kubernetes,

feb 2021.

[9] Leger, Y., and Broshar, A. FaaS vs CaaS:

Comparing Use Cases and Responsibilities, feb 2021.

[10] Marathe, N., Gandhi, A., and Shah, J. M.

Docker swarm and kubernetes in cloud computing

environment. In Proceedings of the International

Conference on Trends in Electronics and Informa-

tics, ICOEI 2019 (apr 2019), vol. 2019-April, In-

stitute of Electrical and Electronics Engineers Inc.,

pp. 179–184.

[11] Moore, S. Gartner Forecasts Strong Revenue Gro-

wth for Global Container Management Software and

Services Through 2024. Gartner (jun 2020).

[12] Panetta, K. Gartner Top Strategic Technology

Trends for 2021. Gartner (oct 2020).

[13] Pereira Ferreira, A., and Sinnott, R. A per-

formance evaluation of containers running on mana-

ged kubernetes services. In Proceedings of the Inter-

national Conference on Cloud Computing Techno-

logy and Science, CloudCom (dec 2019), vol. 2019-

December, IEEE Computer Society, pp. 199–208.

77





78





PROCEEDINGS OF THE 2021

7TH STUDENT COMPUTER SCIENCE

RESEARCH CONFERENCE

(STUCOSREC)

IZTOK FISTER ET AL. (ED.)

University of Maribor, Faculty of Electrical Engineering and Computer Science, Maribor, Slovenia.

E-mail: iztok.fister@um.si

Abstract The 7th Student Computer Science Research Conference is an answer to the fact that modern PhD and already Master level

Computer Science programs foster early research activity among the Keywords: students. The prime goal of the conference is to become a place for student conference,

students to present their research work and hence further encourage computer and students for an early research. Besides the conference also wants to information science,

establish an environment where students from different institutions artificial meet, let know each other, exchange the ideas, and nonetheless make intelligence, friends and research colleagues. At last but not least, the conference is data science,

also meant to be meeting place for students with senior researchers data from institutions others than their own.

mining





DOI https://doi.org/10.18690/978-961-286-516-0 ISBN 978-961-286-516-0





Document Outline


Introduction

Materials and Methods HDEMG model and Activity Index

Independent Component Analysis

Artefact detection and elimination

Dataset and evaluation





Results

Discussion

2.pdf Introduction

Extensible Authentication Protocol

Zero-Knowledge Proofs ZKP System for the Quadratic Residuosity Problem





Password Protection

Secure Authentication

Conclusions and Future Work





3.pdf Introduction

Preliminaries

Problem description and results

Conclusion and further work





4.pdf Introduction

Implementation of remote embedded development Solution using emulator

Solution using remote development





Usage and lessons learned Emulator

Remote





Conclusions





5.pdf Introduction

Proposed Methodology Rough K-Means (RKM)





Experimental Results

Conclusion





6.pdf Introduction

Related work

Implementing adversarial perturbation on AlexNet CNN

Results Examples of missclassified images





Discussion





7.pdf 1 Introduction 1.1 Preliminaries





2 Structural analysis 2.1 Equivalent 8-cycles

2.2 Characterization of non-equivalent 8-cycles

2.3 Obtaining constant octagon value





3 Recognition algorithm 3.1 Subprocedure Extend(G,U)





4 Conclusion





8.pdf Introduction

Related Work

IEC Approach to PFSP

Experiments and Results Analysis and Discussion





Conclusion





9.pdf Introduction

Literature review

Experiment

Results and Discussion

Conclusion





10.pdf Introduction

Related Work

Datasets

Method

Experiments and Results

Conclusion





11.pdf Introduction

Materials and Methods Materials

Methods





Results

Discussion





12.pdf INTRODUCTION

RELATED WORK

METHODOLOGY The CORE data set

Microsoft Academic Graph (MAG)

CrossREF API

Metadata enrichment of CORE using CrossREF and MAG

Text matching in the CORE data set Candidate retrieval

Text matching of candidates

Deduplication





RESULTS

CONCLUSION

ACKNOWLEDGMENTS





13.pdf Introduction Related work and contributions





Embedding non-planar graphs Theoretical analysis of the algorithm





Applications

Output samples

Conclusions





14.pdf Uvod

Metoda cosinor

Analiza števnih podatkov z metodo cosinor

Izbira najustreznejšega modela

Rezultati

Diskusija in zakljucek





15.pdf Uvod

Sorodna dela

Predlagan pristop Pridobivanje podatkov

Grajenje slovarjev

Klasifikacija komentarjev

Racunanje tock zadovoljstva hotelskega bivanja





Eksperiment

Zakljucek





16.pdf Uvod

Sodobni pristopi implementacije informacijskih rešitev v namen razporejanja opravil Zabojniki: Container as a Service (CaaS)

Funkcije: Function as a Service (FaaS) Primerjava obeh skupin





Časovni razporejevalniki »Zabojniško« okolje

»Funkcijsko« okolje Google Cloud Platform – Cloud Scheduler

Amazon Web Services – CloudWatch Events

Primerjava predlaganih razporejevalnikov





Primer uporabe

Diskusija

Zaključek





Blank Page

Blank Page

Blank Page

Blank Page

Blank Page

Blank Page

Blank Page

Blank Page

Blank Page

Blank Page

15.pdf Uvod

Sorodna dela

Predlagan pristop Pridobivanje podatkov

Grajenje slovarjev

Klasifikacija komentarjev

Racunanje tock zadovoljstva hotelskega bivanja





Eksperiment

Zakljucek

Blank Page