Zbornik 25. mednarodne multikonference

INFORMACIJSKA DRUZBA

Zvezek C

Proceedings of the 25th International Multiconference

INFORMATION SOCIETY

Volume C



Odkrivanje znanja in podatkovna

skladisca - SiKDD

Data Mining and Data

Warehouses - SiKDD

Urednika  Editors:

Dunja Mladenic, Marko Grobelnik

Ljubljana, Slovenija

10. oktober

10 October

Ljubljana, Slovenia

httpis.ijs.si





Zbornik 25. mednarodne multikonference

INFORMACIJSKA DRUŽBA – IS 2022

Zvezek C





Proceedings of the 25th International Multiconference

INFORMATION SOCIETY – IS 2022

Volume C





Odkrivanje znanja in podatkovna skladišča - SiKDD

Data Mining and Data Warehouses - SiKDD





Urednika / Editors



Dunja Mladenić, Marko Grobelnik





http://is.ijs.si





10. oktober 2022 / 10 October 2022

Ljubljana, Slovenija



Urednika:





Dunja Mladenić,

Department for Artificial Intelligence

Jožef Stefan Institute, Ljubljana



Marko Grobelnik

Department for Artificial Intelligence

Jožef Stefan Institute, Ljubljana





Založnik: Institut »Jožef Stefan«, Ljubljana

Priprava zbornika: Mitja Lasič, Vesna Lasič, Lana Zemljak

Oblikovanje naslovnice: Vesna Lasič





Dostop do e-publikacije:

http://library.ijs.si/Stacks/Proceedings/InformationSociety





Ljubljana, oktober 2022





Informacijska družba

ISSN 2630-371X



Kataložni zapis o publikaciji (CIP) pripravili v Narodni in univerzitetni

knjižnici v Ljubljani

COBISS.SI-ID 127444483

ISBN 978-961-264-243-3 (PDF)





PREDGOVOR MULTIKONFERENCI

INFORMACIJSKA DRUŽBA 2022



Petindvajseta multikonferenca Informacijska družba je preživela probleme zaradi korone. Zahvala za skoraj normalno delovanje konference gre predvsem tistim predsednikom konferenc, ki so kljub prvi pandemiji modernega sveta pogumno obdržali visok strokovni nivo.



Pandemija v letih 2020 do danes skoraj v ničemer ni omejila neverjetne rasti IKTja, informacijske družbe, umetne inteligence in znanosti nasploh, ampak nasprotno – rast znanja, računalništva in umetne inteligence se nadaljuje z že kar običajno nesluteno hitrostjo. Po drugi strani se nadaljuje razpadanje družbenih vrednot ter tragična vojna v Ukrajini, ki lahko pljuskne v Evropo. Se pa zavedanje večine ljudi, da je potrebno podpreti stroko, krepi. Konec koncev je v 2022 v veljavo stopil not raziskovalni zakon, ki bo izboljšal razmere, predvsem leto za letom povečeval sredstva za znanost.



Letos smo v multikonferenco povezali enajst odličnih neodvisnih konferenc, med njimi »Legende računalništva«, s katero postavljamo nov mehanizem promocije informacijske družbe. IS 2022 zajema okoli 200 predstavitev, povzetkov in referatov v okviru samostojnih konferenc in delavnic ter 400 obiskovalcev. Prireditev so spremljale okrogle mize in razprave ter posebni dogodki, kot je svečana podelitev nagrad. Izbrani prispevki bodo izšli tudi v posebni številki revije Informatica (http://www.informatica.si/), ki se ponaša s 46-letno tradicijo odlične znanstvene revije. Multikonferenco Informacijska družba 2022 sestavljajo naslednje samostojne konference:

• Slovenska konferenca o umetni inteligenci

• Izkopavanje znanja in podatkovna skladišča

• Demografske in družinske analize

• Kognitivna znanost

• Kognitonika

• Legende računalništva

• Vseprisotne zdravstvene storitve in pametni senzorji

• Mednarodna konferenca o prenosu tehnologij

• Vzgoja in izobraževanje v informacijski družbi

• Študentska konferenca o računalniškem raziskovanju

• Matcos 2022

Soorganizatorji in podporniki konference so različne raziskovalne institucije in združenja, med njimi ACM

Slovenija, SLAIS, DKZ in druga slovenska nacionalna akademija, Inženirska akademija Slovenije (IAS). V imenu organizatorjev konference se zahvaljujemo združenjem in institucijam, še posebej pa udeležencem za njihove dragocene prispevke in priložnost, da z nami delijo svoje izkušnje o informacijski družbi. Zahvaljujemo se tudi recenzentom za njihovo pomoč pri recenziranju.



S podelitvijo nagrad, še posebej z nagrado Michie-Turing, se avtonomna stroka s področja opredeli do najbolj izstopajočih dosežkov. Nagrado Michie-Turing za izjemen življenjski prispevek k razvoju in promociji informacijske družbe je prejel prof. dr. Jadran Lenarčič. Priznanje za dosežek leta pripada ekipi NIJZ za portal zVEM. »Informacijsko limono« za najmanj primerno informacijsko potezo je prejela cenzura na socialnih omrežjih,

»informacijsko jagodo« kot najboljšo potezo pa nova elektronska osebna izkaznica. Čestitke nagrajencem!



Mojca Ciglarič, predsednik programskega odbora

Matjaž Gams, predsednik organizacijskega odbora



i

FOREWORD - INFORMATION SOCIETY 2022



The 25th Information Society Multiconference (http://is.ijs.si) survived the COVID-19 problems. The multiconference survived due to the conference chairs who bravely decided to continue with their conferences despite the first pandemics in the modern era.



The COVID-19 pandemic from 2020 till now did not decrease the growth of ICT, information society, artificial intelligence and science overall, quite on the contrary – the progress of computers, knowledge and artificial intelligence continued with the fascinating growth rate. However, the downfall of societal norms and progress seems to slowly but surely continue along with the tragical war in Ukraine. On the other hand, the awareness of the majority, that science and development are the only perspective for prosperous future, substantially grows. In 2020, a new law regulating Slovenian research was accepted promoting increase of funding year by year.



The Multiconference is running parallel sessions with 200 presentations of scientific papers at twelve conferences, many round tables, workshops and award ceremonies, and 400 attendees. Among the conferences, “Legends of computing” introduce the “Hall of fame” concept for computer science and informatics. Selected papers will be published in the Informatica journal with its 46-years tradition of excellent research publishing.



The Information Society 2022 Multiconference consists of the following conferences:

• Slovenian Conference on Artificial Intelligence

• Data Mining and Data Warehouses

• Cognitive Science

• Demographic and family analyses

• Cognitonics

• Legends of computing

• Pervasive health and smart sensing

• International technology transfer conference

• Education in information society

• Student computer science research conference 2022

• Matcos 2022

The multiconference is co-organized and supported by several major research institutions and societies, among them ACM Slovenia, i.e. the Slovenian chapter of the ACM, SLAIS, DKZ and the second national academy, the Slovenian Engineering Academy. In the name of the conference organizers, we thank all the societies and institutions, and particularly all the participants for their valuable contribution and their interest in this event, and the reviewers for their thorough reviews.



The award for life-long outstanding contributions is presented in memory of Donald Michie and Alan Turing. The Michie-Turing award was given to Prof. Dr. Jadran Lenarčič for his life-long outstanding contribution to the development and promotion of information society in our country. In addition, the yearly recognition for current achievements was awarded to NIJZ for the zVEM platform. The information lemon goes to the censorship on social networks. The information strawberry as the best information service last year went to the electronic identity card.

Congratulations!



Mojca Ciglarič, Programme Committee Chair

Matjaž Gams, Organizing Committee Chair





ii





KONFERENČNI ODBORI

CONFERENCE COMMITTEES



International Programme Committee

Organizing Committee

Vladimir Bajic, South Africa

Matjaž Gams, chair

Heiner Benking, Germany

Mitja Luštrek

Se Woo Cheon, South Korea

Lana Zemljak

Howie Firth, UK

Vesna Koricki

Olga Fomichova, Russia

Mitja Lasič

Vladimir Fomichov, Russia

Blaž Mahnič

Vesna Hljuz Dobric, Croatia



Alfred Inselberg, Israel

Jay Liebowitz, USA

Huan Liu, Singapore

Henz Martin, Germany

Marcin Paprzycki, USA

Claude Sammut, Australia

Jiri Wiedermann, Czech Republic

Xindong Wu, USA

Yiming Ye, USA

Ning Zhong, USA

Wray Buntine, Australia

Bezalel Gavish, USA

Gal A. Kaminka, Israel

Mike Bain, Australia

Michela Milano, Italy

Derong Liu, Chicago, USA

Toby Walsh, Australia

Sergio Campos-Cordobes, Spain

Shabnam Farahmand, Finland

Sergio Crovella, Italy





Programme Committee

Mojca Ciglarič, chair

Nikola Guid

Andrej Ule

Bojan Orel,

Marjan Heričko

Boštjan Vilfan

Franc Solina,

Borka Jerman Blažič Džonova

Baldomir Zajc

Viljan Mahnič,

Gorazd Kandus

Blaž Zupan

Cene Bavec,

Urban Kordeš

Boris Žemva

Tomaž Kalin,

Marjan Krisper

Leon Žlajpah

Jozsef Györkös,

Andrej Kuščer

Niko Zimic

Tadej Bajd

Jadran Lenarčič

Rok Piltaver

Jaroslav Berce

Borut Likar

Toma Strle

Mojca Bernik

Janez Malačič

Tine Kolenik

Marko Bohanec

Olga Markič

Franci Pivec

Ivan Bratko

Dunja Mladenič

Uroš Rajkovič

Andrej Brodnik

Franc Novak

Borut Batagelj

Dušan Caf

Vladislav Rajkovič

Tomaž Ogrin

Saša Divjak

Grega Repovš

Aleš Ude

Tomaž Erjavec

Ivan Rozman

Bojan Blažica

Bogdan Filipič

Niko Schlamberger

Matjaž Kljun

Andrej Gams

Stanko Strmčnik

Robert Blatnik

Matjaž Gams

Jurij Šilc

Erik Dovgan

Mitja Luštrek

Jurij Tasič

Špela Stres

Marko Grobelnik

Denis Trček

Anton Gradišek





iii



iv



KAZALO / TABLE OF CONTENTS



Odkrivanje znanja in podatkovna skladišča - SiKDD / Data Mining and Data Warehouses - SiKDD ................. 1

PREDGOVOR / FOREWORD ................................................................................................................................. 3

PROGRAMSKI ODBORI / PROGRAMME COMMITTEES ..................................................................................... 4

Emotion Recognition in Text using Graph Similarity Criteria / Komarova Nadezhda, Novalija Inna, Grobelnik

Marko .................................................................................................................................................................. 5

SLOmet – Slovenian Commonsense Description / Mladenić Grobelnik Adrian, Novak Erik, Grobelnik Marko,

Mladenić Dunja ................................................................................................................................................... 9

Measuring the Similarity of Song Artists using Topic Modelling / Calcina Erik, Novak Erik ................................. 13

Exploring the Impact of Lexical and Grammatical Features on Automatic Genre Identification / Kuzman Taja,

Ljubešić Nikola ................................................................................................................................................. 17

Stylistic features in clustering news reporting: News articles on BREXIT / Sittar Abdul, Webber Jason, Mladenić

Dunja ................................................................................................................................................................ 21

Automatically Generating Text from Film Material – A Comparison of Three Models / Korenič Tratnik Sebastian,

Novak Erik ........................................................................................................................................................ 26

The Russian invasion of Ukraine through the lens of ex-Yugoslavian Twitter / Evkoski Bojan, Mozetič Igor, Kralj

Novak Petra, Ljubešić Nikola ........................................................................................................................... 30

Visualization of consensus mechanisms in PoS based blockchain protocols / Baldouski Daniil, Tošić

Aleksandar ........................................................................................................................................................ 34

Using Machine Learning for Anti Money Laundering / Kržmanc Gregor, Koprivec Filip, Škrjanc Maja ............... 38

Forecasting Sensor Values in Waste-To-Fuel Plants: a Case Study / Brecelj Bor, Šircelj Beno, Rožanec Jože

Martin, Fortuna Blaž, Mladenić Dunja .............................................................................................................. 42

Machine Beats Machine: Machine Learning Models to Defend Against Adversarial Attacks / Rožanec Jože

Martin, Papamartzivanos Dimitrios, Veliou Entso, Anastasiou Theodora, Keizer Jelle, Fortuna Blaž, Mladenić

Dunja ................................................................................................................................................................ 46

Addressing climate change preparedness from a smart water perspective / Gucek Alenka, Pita Costa Joao,

Massri M.Besher, Santos Costa João, Rossi Maurizio, Casals del Busto Ignacio, Mocanu Iulian .................. 50

SciKit Learn vs Dask vs Apache Spark Benchmarking on the EMINST Dataset / Zevnik Filip, Fortuna Carolina,

Mušić Din, Cerar Gregor................................................................................................................................... 54

An Efficient Implementation of Hubness-Aware Weighting Using Cython / Buza Krisztian ................................. 58

Semantic Similarity of Parliamentary Speech using BERT Language Models & fastText Word Embeddings /

Meden Katja ..................................................................................................................................................... 61

Indeks avtorjev / Author index ................................................................................................................................ 65





v





vi



Zbornik 25. mednarodne multikonference

INFORMACIJSKA DRUŽBA – IS 2022

Zvezek C





Proceedings of the 25th International Multiconference

INFORMATION SOCIETY – IS 2022

Volume C





Odkrivanje znanja in podatkovna skladišča - SiKDD

Data Mining and Data Warehouses - SiKDD





Urednika / Editors



Dunja Mladenić, Marko Grobelnik





http://is.ijs.si





10. oktober 2022 / 10 October 2022

Ljubljana, Slovenija

1





2





PREDGOVOR





Tehnologije, ki se ukvarjajo s podatki so v devetdesetih letih močno napredovale. Iz prve faze, kjer je šlo predvsem za shranjevanje podatkov in kako do njih učinkovito dostopati, se je razvila industrija za izdelavo orodij za delo s podatkovnimi bazami, prišlo je do standardizacije procesov, povpraševalnih jezikov itd. Ko shranjevanje podatkov ni bil več poseben problem, se je pojavila potreba po bolj urejenih podatkovnih bazah, ki bi služile ne le transakcijskem procesiranju ampak tudi analitskim vpogledom v podatke – pojavilo se je t.i.

skladiščenje podatkov (data warehousing), ki je postalo standarden del informacijskih sistemov v podjetjih. Paradigma OLAP (On-Line-Analytical-Processing) zahteva od uporabnika, da še vedno sam postavlja sistemu vprašanja in dobiva nanje odgovore in na vizualen način preverja in išče izstopajoče situacije. Ker seveda to ni vedno mogoče, se je pojavila potreba po avtomatski analizi podatkov oz. z drugimi besedami to, da sistem sam pove, kaj bi utegnilo biti zanimivo za uporabnika – to prinašajo tehnike odkrivanja znanja v podatkih (data mining), ki iz obstoječih podatkov skušajo pridobiti novo znanje in tako uporabniku nudijo novo razumevanje dogajanj zajetih v podatkih. Slovenska KDD konferenca pokriva vsebine, ki se ukvarjajo z analizo podatkov in odkrivanjem znanja v podatkih: pristope, orodja, probleme in rešitve.





FOREWORD





Data driven technologies have significantly progressed after mid 90’s. The first phases were mainly focused on storing and efficiently accessing the data, resulted in the development of industry tools for managing large databases, related standards, supporting querying languages, etc. After the initial period, when the data storage was not a primary problem anymore, the development progressed towards analytical functionalities on how to extract added value from the data; i.e., databases started supporting not only transactions but also analytical processing of the data. At this point, data warehousing with On-Line-Analytical-Processing entered as a usual part of a company’s information system portfolio, requiring from the user to set well defined questions about the aggregated views to the data. Data Mining is a technology developed after year 2000, offering automatic data analysis trying to obtain new discoveries from the existing data and enabling a user new insights in the data. In this respect, the Slovenian KDD conference (SiKDD) covers a broad area including Statistical Data Analysis, Data, Text and Multimedia Mining, Semantic Technologies, Link Detection and Link Analysis, Social Network Analysis, Data Warehouses.





3





PROGRAMSKI ODBOR / PROGRAMME COMMITTEE



Janez Brank, Jožef Stefan Institute, Ljubljana

Marko Grobelnik, Jožef Stefan Institute, Ljubljana

Jakob Jelenčič, Jožef Stefan Institute, Ljubljana

Branko Kavšek, University of Primorska, Koper

Besher M. Massri, Jožef Stefan Institute, Ljubljana

Dunja Mladenić, Jožef Stefan Institute, Ljubljana

Erik Novak, Jožef Stefan Institute, Ljubljana

Inna Novalija, Jožef Stefan Institute, Ljubljana

Jože Rožanec, Qlector, Ljubljana

Abdul Sitar, Jožef Stefan Institute, Ljubljana

Luka Stopar, Sportradar, Ljubljana

Swati Swati, Jožef Stefan Institute, Ljubljana



4





Emotion Recognition in Text using Graph Similarity Criteria

Nadezhda Komarova, Inna Novalija, Marko Grobelnik

Jožef Stefan Institute

Jamova cesta 39, Ljubljana, Slovenia

nadezhdakomarova7@gmail.com

ABSTRACT

In Section 2, it is further explained how the graph of 𝑛-grams is constructed for a given text and how an emotion label is assigned

In this paper, a method of classifying text into several emotion cat-

to the text based on the similarity with the emotion category

egories employing different measures of similarity of two graphs

graphs. Afterwards, in Section 3, the method is compared with is proposed. The emotions utilized are happiness, sadness, fear,

related approaches.

surprise, anger and disgust, with the latter two joined into one

In Section 4, an overview of results is focused on differences category. The method is based on representing a text as a graph

between the performance of the model when different graph

of 𝑛-grams; the results presented in the paper are obtained using

similarity criteria are used. It is followed by the discussion of the

the value of 5 for 𝑛: the 𝑛-grams were the sequences of 5 charac-

model’s limitations in Section 5.

ters. The graph representation of the text was constructed based

on observing which 𝑛-grams occur close together in the text;

2

PROPOSED METHOD

additionally, frequencies of their connections were utilized to

assign edge weights. To classify the text, the graph was compared

2.1

Constructing the Graph of 𝑛-grams

with several emotion category graphs based on different graph

The method used in the paper to obtain text representation in

similarity criteria. The former relate to common vertices, edges,

the form of the graph of 𝑛-grams is the following.

and the maximum common subgraphs. The evaluation of the

• The given text was separated into 𝑛-grams of characters.

model on the test data set shows that utilizing the construction

Also, different values of 𝑛 have been tested. The results

of the maximum common subgraph to obtain the graph similar-

in Section 4, use 𝑛 = 5. The 𝑛-grams into which the given ity measure results in more accurate predictions. Additionally,

text was split were overlapping.

employing the number of common edges as a graph similarity cri-

• The 𝑛-grams obtained in this way were utilized to repre-

terion yielded more accurate results compared to employing the

sent the labels of vertices of the graph.

number of common vertices to measure the similarity between

• The edges of the graph were created in the following man-

the two graphs.

ner. The ends of edges were the vertices that corresponded

KEYWORDS

to 𝑛-grams that occurred close to each other in the text, e.g.,

the edge is connecting the first 𝑛-gram at the beginning

emotion recognition, text classification, machine learning, graphs,

of the text with the second 𝑛-gram (these two 𝑛-grams

graph similarity

would overlap with each other), as seen in Figure 1.

Different values have been tested for the maximal distance

1

INTRODUCTION

between the two vertices allowed for these two vertices to

Emotion recognition is a problem that can be connected to differ-

still be connected with the edge. The results in Section 4,

ent fields such as natural language processing, computer vision,

use the value of 7.

deep learning, etc. [4] In this paper, the focus is on the task of

• Performance of the model with both, the directed and the

recognizing emotions in texts.

undirected graphs, has been tested.

In the literature, several approaches have been introduced that

target this problem. Some of them employ vertex embedding

vectors for emotion detection and recognition from text. The

embedding vectors grasp the information related to semantics

and syntax; however, a limitation of such approaches is that they

do not capture the emotional relationship that exists between

words. Some methods attempting to alleviate this issue include

building a neural network architecture adopting pre-trained word

representations. [3] Some text classification approaches employ

𝑛-grams to construct the text representation, e.g., to deal with

Figure 1: Constructing the edges between the 5-grams that

the task of language identification. [9]

occur close to each other

In this paper, the approach to emotion recognition employs 𝑛-

grams to obtain graph representation of text. The text is viewed as

In Figure 2, it is depicted how the edges are constructed be-a sequence of characters that is divided into 𝑛-grams, i.e., shorter

tween the vertices labelled with 𝑛-grams. For the clarity of rep-

overlapping sequences of characters as presented in Figure 1.

resentation, each 𝑛-gram is shown connected to 3 other 𝑛-grams

Permission to make digital or hard copies of part or all of this work for personal instead of 7. It is important to note that if the same 𝑛-grams oc-or classroom use is granted without fee provided that copies are not made or curred in the text more than once, there was still only one vertex

distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this with this 𝑛-gram as a label: the connections of the 𝑛-gram have

work must be honored. For all other uses, contact the owner /author(s).

been aggregated at a single vertex.

Information Society 2022, 10–14 October 2022, Ljubljana, Slovenia

Additionally, the graph constructed is weighted. The weights

© 2022 Copyright held by the owner/author(s).

of the edges are obtained utilizing the frequencies of connections

5





Information Society 2022, 10–14 October 2022, Ljubljana, Slovenia

Nadezhda Komarova, et al.

of 𝑛-grams in the given text. In other words, the edge weights are

In other words, it is tested, to which of the 5 graphs the graph

initialized to 0. Then, when constructing the graph of 𝑛-grams

of the given text is most similar and the corresponding emotion

for a text, every time a certain edge would be added, instead of

is assigned to the given text.

adding it, the weight of the edge is increased by 1.

Several similarity criteria of the two graphs have been ex-

Afterwards, the edge weights are normalized to be in the range

plored.

(0, 1); hence, the edge weights are more comparable among the

(1) The number of vertices common to both graphs: the ver-

graphs of 𝑛-grams for different texts.

tices are considered common if they share the same label

(the 𝑛-gram they represent) in both graphs.

(2) The number of edges common to both graphs: the edge is

considered common if the same vertices (vertices with the

same labels) are the endpoints of the edge in both graphs

and the edge weights are the same.

(3) The number of vertices in the maximum common subgraph

(MCS) of the two graphs. Finding the maximum common

subgraph is equivalent to finding a graph with the maxi-

mum number of vertices so that it is a subgraph of each

of the two graphs. [8]

(4) The number of edges in the maximum common subgraph

(MCS) of the two graphs.

¤(𝑚−1)

(5) 𝑧 = 𝑚

− 𝑒, where 𝑚 denotes the number of vertices

2

in the maximum common subgraph of the two graphs, and

𝑒 denotes the number of edges in the maximum common

subgraph.

Figure 2: Constructing the edges between the 5-grams in

3

RELATED WORK

the text fragment "oh how funny"

In the literature describing related approaches to text classifica-

tion and emotion recognition, deep learning models are often

utilized to obtain high-quality predictions. [7]

2.2

Constructing the Emotion Category

Apart from the approaches that employ word embedding vec-

Graphs

tors [6], there are also methods that connect neural networks and graphs. Such approaches may be similar to the method de-The core of the method is the construction of the graph of 𝑛-

scribed in this paper since the graph representation of text may

grams as described in Section 2.1. In the data set used to tune the be obtained in a similar way based on the semantic connections

model, there were shorter texts labelled with one of the following

between words. One example of this kind of model is the graph

5 emotions: happy, sad, surprised, fearful, or angry-disgusted.

neural network that is enhanced by utilizing BERT to obtain

Overall, there were 1207 sentences included in the data set; out

semantic features. [11]

of this, the model was trained using 1086 sentences (to construct

The crucial part of the method in this paper is the graph

the emotion category graphs) and evaluated on 121 sentences

similarity criterion that is used when comparing the graph of the

(the split proportion is 90 : 10).

given text with different emotion category graphs. The similar

The process of obtaining the emotion category graphs is pre-

way as the construction of the maximum common subgraph is

sented below.

used in this method, it can be employed in combination with the

(1) The data set was split into 5 parts containing only the text

probabilistic classifiers. [10]

labelled with the same emotion.

The approach in this paper, on the other hand, does not employ

(2) Then, the texts in each part of the data set were used to

probabilistic classifiers such as Bayes Classification or Support

obtain 5 graphs corresponding to each emotion.

Vector Machine. [2] Instead, the emotion for which the similarity (a) This process can be viewed as for each text labelled with

measure between the corresponding emotion category graph and

a certain emotion, constructing the graph of 𝑛-grams as

the graph of the given text is maximised is assigned to the text.

explained in Section 2.1.

Additionally, it is important to note that it is possible to in-

(b) Afterwards, merge these graphs separately for different

corporate alternative graph similarity criteria, e.g., related to

emotions to obtain 5 larger graphs of 𝑛-grams; during

subgraph matching, edit distance, belief propagation, etc. [5]

the merging process, the edges are aggregated in such a

way that there are not any two vertices in the emotion

4

RESULTS

category graph sharing the same label (the character

𝑛-gram to which they correspond).

4.1

Experimental Setup

The data set used to train and evaluate the model was the one dis-

2.3

Assigning an Emotion to a Given Text

tributed by Cecilia Ovesdotter Alm. [1] It included the sentences Utilizing the 5 emotion category graphs corresponding to differ-each labelled with one of the following emotions: happiness, sad-

ent emotions, for a given text, it is determined, to which emotion

ness, fear, surprise, anger, and disgust. The latter two emotions

the text most likely corresponds. For that, the pairwise similarity

were joined into one category.

measures of the graph of the given text and of the 5 emotion

During the evaluation stage, for each sentence, a correspond-

category graphs are employed.

ing emotion was predicted, e.g., the text "then the servant was

6





Emotion Recognition in Text using Graph Similarity Criteria Information Society 2022, 10–14 October 2022, Ljubljana, Slovenia

Table 1: Results of text classification using directed graphs

Table 3: Confusion matrix: directed graph, number of edges

in the MCS as the similarity criterion

Similarity criterion

Accuracy

Precision

Recall

F1

Actual/pred.

Happy

Fearful

Surpr.

Sad

Angry-Disg.

Common vertices

0.488

0.506

0.332

0.323

Common edges

0.537

0.683

0.408

0.432

Happy

43

1

0

0

1

𝑧

0.372

0.074

0.200

0.108

Fearful

7

6

1

3

0

Vertices in the MCS

0.570

0.622

0.426

0.446

Surprised

6

1

2

1

1

Edges in the MCS

0.579

0.625

0.454

0.478

Sad

12

1

0

12

1

Angry-Disg.

11

2

0

2

7

Table 2: Results of text classification using undirected

graphs

Table 4: Confusion matrix: undirected graph, number of

edges in the MCS as the similarity criterion

Similarity criterion

Accuracy

Precision

Recall

F1

Actual/pred.

Happy

Fearful

Surpr.

Sad

Angry-Disg.

Common vertices

0.488

0.506

0.332

0.323

Common edges

0.554

0.669

0.429

0.460

Happy

42

1

0

1

1

𝑧

0.372

0.074

0.200

0.108

Fearful

8

6

1

2

0

Vertices in the MCS

0.545

0.527

0.399

0.406

Surprised

6

1

1

1

2

Edges in the MCS

0.570

0.581

0.439

0.453

Sad

11

1

0

13

1

Angry-Disg.

11

2

0

2

7

greatly frightened and said it may perhaps be only a cat or a dog"

Table 5: Confusion matrix: directed graph, number of com-

was labelled fearful, while the text "he looked very jovial did little mon edges as the similarity criterion

work and had the more holidays" was recognized to be related to

the emotion of happiness.

Actual/pred.

Happy

Fearful

Surpr.

Sad

Angry-Disg.

The value of 𝑛 that appeared to yield the best results and

was also used to obtain the results in Tables 1 and 2 was 5. Fur-Happy

42

1

0

2

0

thermore, each 5-gram (except those at the end of the text) is

Fearful

10

4

0

3

0

connected to 7 5-grams further in the text.

Surprised

6

0

2

3

0

In Tables 1 and 2, the "common edges" criterion means that Sad

13

0

1

12

0

the two edges from both graphs are considered common if they

Angry-Disg.

16

1

0

0

5

have the same weight and the same endpoints.

Additionally, in Table 1, 𝑧 denotes the difference between the Table 6: Confusion matrix: undirected graph, number of

the actual number of edges in the maximum common subgraph

common edges as the similarity criterion

and the number of edges in the complete graph with 𝑚 vertices,

where 𝑚 is the number of vertices in the maximum common

Actual/pred.

Happy

Fearful

Surpr.

Sad

Angry-Disg.

subraph.

In the trials that yielded the results in Table 1, the edges were Happy

41

1

0

2

1

directed and in the trials that yielded the results in Table 2, the Fearful

11

4

0

2

0

edges were undirected.

Surprised

6

0

2

3

0

Sad

12

0

1

13

0

4.2

Analysis

Angry-Disg.

14

1

0

0

7

From the results in Table 1 and 2, it may be noticed that the highest accuracy on the test data set was achieved when the

Furthermore, the accuracy corresponding to the similarity

number of edges in the maximum common subgraph was used

criterion being the number of the common edges (considering

as the similarity measure. In Table 1, the second highest accuracy both the endpoints and the weight of the edge) is higher by

was achieved when the number of vertices in the maximum

0.017 when the graphs are undirected than when the graphs are

common subgraph was utilized.

directed (0.554 compared to 0.537). When the graphs utilized are

From this, it may be observed that the construction of the max-

undirected, the model might be more flexible regarding the exact

imum common subgraph reflects the similarity better in certain

order of the words that occur together.

cases; possible reasons may be that deeper semantic relationships

In Tables 5 and 6, confusion matrices are presented for the can be captured this way since connections between multiple

trials when the number of edges common to both graphs, consid-

𝑛-grams are considered at the same time.

ering the endpoints and the weights of the edges, was used as

In Tables 3 and 4, the confusion matrices are presented for the the criterion of graph similarity.

the trials when the number of edges in the maximum common

subgraph was used as the criterion of graph similarity.

5

DISCUSSION

From the Tables 1 and 2, it is evident that this similarity criterion corresponded to the highest accuracy of predictions for

A strength of the approach presented in this paper is the ability

both undirected and directed graphs. However, the accuracy cor-

to capture the context of the given words on different levels; this

responding to this similarity criterion is higher when the graphs

is related to the process of constructing the edges of the graph by

are directed (0.579 compared to 0.570).

connecting 𝑛-grams that occur together in the text. Additionally,

7





Information Society 2022, 10–14 October 2022, Ljubljana, Slovenia Nadezhda Komarova, et al.

the breadth of the contextual frame considered may be varied by

(vertices labelled with the same 𝑛-gram) are contained in both

altering the number of 𝑛-grams with which a certain 𝑛-gram is

graphs resulting in more noisy data.

connected when constructing the edges.

To conclude, the future work on the task of emotion recogni-

However, overall, the accuracy values noted in Tables 1 and tion related to the proposed method may, on the one hand, be

2, were not very high possibly indicating that the training data focused on employing alternative graph similarity measures in

set was not large enough. Moreover, the data set did not include

addition to those described in this paper, e.g., those connected

texts corresponding to different emotions in even proportions

to deriving the edit distance or to the belief propagation. [5]

resulting in an imbalance which could have also had a detrimental

Furthermore, clustering algorithms may be used to obtain the

influence on the quality of predictions. The confusion matrices

patterns characteristic to the emotion categories and further em-

(Tables 3, 4, 5, and 6) indicate, e.g., that the texts were often ploy them for the emotion recognition task. To this end, both, the

falsely assigned the emotion of happiness since it was the most

vertex clustering algorithms as well as the clustering of graphs

abundant class in the data set.

as objects, might be utilized. Additionally, graph neural network

One of the limitations of the design of the model described

architecture may be built along with incorporating the graphs of

it that although it may be reasonable to expect that to obtain

𝑛-grams as the input for the network.

more accurate predictions on the test data set, training the model

(obtaining the emotion category graphs) on a larger corpus of

7

ACKNOWLEDGEMENTS

texts is needed, this may bring a significant rise in computational

This work was supported by the Slovenian Research Agency

complexity since the category graphs would possess significantly

under the project J2-1736 Causalify and the European Union

larger amounts of vertices and edges.

through Odeuropa EU H2020 project under grant agreement No

This is especially important if the maximum common sub-

101004469.

graphs are constructed when obtaining a similarity measure,

since for each text in the test data set, a maximum common sub-

REFERENCES

graph would have to be constructed several times: between the

[1]

Alm, E. C. O. Affect in text and speech, 2008.

graph of 𝑛-grams for a given text and each emotion category

[2]

Bahritidinov, B., and Sanchez, E. Probabilistic classifiers and statistical dependency: The case for grade prediction. pp. 394–403.

graph (5 such graphs in this case).

[3]

Batbaatar, E., Li, M., and Ryu, K. H. Semantic-emotion neural network for

A possible solution to the problem of having too large category

emotion recognition from text. IEEE Access 7 (2019), 111866–111878.

[4]

Guo, J. Deep learning approach to text analysis for human emotion detection graphs might be reducing the length of 𝑛-grams, i.e., using smaller

from big data. Journal of Intelligent Systems 31, 1 (2022), 113–126.

values of 𝑛, and hence reducing the number of vertices in the

[5]

Koutra, D., Ramdas, A., Parikh, A., and Xiang, J. Algorithms for graph

graph.

similarity and subgraph matching, 2011.

[6]

Li, S., and Gong, B. Word embedding and text classification based on deep

Also, reducing the number of 𝑛-grams with which a certain 𝑛-

learning methods. MATEC Web of Conferences 336 (01 2021), 06022.

gram is connected when constructing the edges of the graph may

[7]

Prasanna, P., and Rao, D. Text classification using artificial neural networks.

be investigated as a possible solution. However, if this value is too

International Journal of Engineering and Technology(UAE) 7 (01 2018), 603–606.

[8]

Quer, S., Marcelli, A., and Sqillero, G. The maximum common subgraph

low, too much contextual information may be lost; therefore, it

problem: A parallel and multi-engine approach. Computation 8, 2 (may 2020), appears necessary that for each value of n, the optimal number of

48.

[9]

Tromp, E., and Pechenizkiy, M. Graph-based n-gram language identification

𝑛-grams with which a certain 𝑛-gram is connected is determined

on short texts. Proceedings of Benelearn 2011 (01 2011), 27–34.

experimentally.

[10]

Violos, J., Tserpes, K., Varlamis, I., and Varvarigou, T. Text classification using the n-gram graph representation model over high frequency data streams.

Frontiers in Applied Mathematics and Statistics 4 (2018).

[11]

Yang, Y., and Cui, X. Bert-enhanced text graph neural network for classification. Entropy (Basel) 23 (11 2021).

6

CONCLUSION

In this paper, the model that utilizes graph similarity criteria

to classify a given text into one of the emotion categories is

described. The core of the method is to construct a graph of 𝑛-

grams for a given text and to compare this graph to each of the

emotion category graphs. The text is classified into the emotion

category, the graph of which yielded the highest similarity value

when compared to the graph of the given text.

From the results of the trials noted in Tables 1 and 2, it may be concluded that among the graph similarity criteria described, that

number of edges in the maximum common subgraph resulted in

the highest quality of predictions.

Furthermore, it may also be noted that employing the number

of edges common to both graphs resulted in higher prediction

accuracy than using the number of common vertices (0.537 and

0.488 accuracy for the directed graphs).

This may appear to be intuitively reasonable as using edges

may seem to incorporate more contextual information. Addition-

ally, it may be important to investigate the effect of the difference

between the size of the graph of 𝑛-gram for the given text and

the size of the emotion category graph on the probability that the

same connections between the two 𝑛-grams are found in both

graphs. Moreover, it may be more probable that the same vertices

8





SLOmet – Slovenian Commonsense Description

Adrian Mladenic

Erik Novak

Dunja Mladenic

Marko Grobelnik

Grobelnik

Department for Artificial

Department for Artificial

Department for Artificial

Intelligence,

Intelligence,

Intelligence,

Department for Artificial

Jozef Stefan Institute,

Jozef Stefan Institute,

Jozef Stefan Institute

Intelligence,

Jozef Stefan International

Ljubljana Slovenia

Ljubljana Slovenia

Jozef Stefan Institute

Postrgraduate School

dunja.mladenic@ijs.si

marko.grobelnik@ijs.si

Ljubljana Slovenia

Ljubljana Slovenia

adrian.m.grobelnik@ijs.si

erik.novak@ijs.si





ABSTRACT

English, we anticipate a noticeable drop in performance across all

metrics for the Slovenian language models.

This paper presents Slovenian commonsense description models

The main contributions of this paper are (1) the comparison

based on the COMET framework for English. Inspired by

of the performance of commonsense description models using

MultiCOMETs approach to multilingual commonsense

description, we finetune two Slovenian GPT-2 language models.

different Slovenian language models and the English model, (2) a

Experimental evaluation based on several performance metrics

comprehensive evaluation using a variety of performance metrics.

shows comparable performance to the original COMET GPT-2

An additional contribution (3) is the Slovene ATOMIC-2020

model for English.

dataset acquired by machine translation from the original English

dataset [6].

KEYWORDS

The rest of this paper is organized as follows: Section 2

deep learning, commonsense reasoning, multilingual natural

provides the data description. Section 3 describes the problem and

language processing, slovenian language model, gpt-2

the experimental setting. Section 4 exhibits our evaluation results.

The paper concludes with discussion and directions for future work

1 Introduction

in Section 5.

Recent research [1] into commonsense representation and

reasoning in the field of natural language understanding has

2 Data Description

demonstrated promising results for automatic commonsense

To train the Slovenian commonsense description models, we use

generation. Given a simple sentence or common entity, such

data from the ATOMIC-2020 dataset, as proposed in the COMET

technology can generate plausible commonsense descriptions

framework for English. The ATOMIC-2020 dataset consists of

relating to it. However, further testing on complex sentences,

English sentences and entities, labelled by up to 23 commonsense

uncommon entities, or by increasing the quantity of requested

relation types describing their semantics.

commonsense descriptions usually gives nonsensical results.

Following the recent success on the automatic generation of

commonsense descriptions proposed in COMET-ATOMIC 2020

[1], we focus on extending the COMET framework to the Slovenian

language. We investigate the impact of different Slovenian

language models on the overall performance of commonsense

description generation. In our previous research [2], we expanded

on an existing approach for automatic knowledge base construction

in English [3] to work on different languages. We utilized the

original ATOMIC dataset [4]. This was performed by finetuning

the original English GPT model from COMET 2019 on

automatically translated Slovenian data and evaluated based on

exact overlap for the generated commonsense descriptions.

Evaluations were performed on a small subset of 100 sentences. In

this work we use the updated ATOMIC-2020 dataset [1] and

finetune two Slovenian GPT-2 language models. We evaluate the

models’ performance using several performance metrics including

BLEU, CIDEr, METEOR and ROUGE-L. The evaluation is



performed on several thousand sentences and entities; we

Figure 1 Close-up of “Event-Centered” descriptor values

investigate how the predicted commonsense descriptions’

predicted for an example Slovene sentence “PersonX is sad”

performance relates to the language model used. Furthermore,

(“OsebaX je žalostna” in Slovenian)

given the complexity of the Slovenian language compared to

9

We refer to them as descriptors, 9 of which are identical to METEOR — Metric for Evaluation of Translation with

those used in our previous research [2]. The 23 descriptors are

Explicit Ordering is a metric initially used for evaluating machine organized into 3 categories: “Physical-Entity”, “Event-Centered”,

translation input. The metric is based on the harmonic mean of

and “Social-Interaction”. The “Physical-Entity” descriptors capture

unigram precision and recall with other features such as stemming

knowledge about the usage, location, content, and other properties

and synonymy matching. [10]

of objects. The “Event-Centered” descriptors include IsAfter,

Causes and other descriptors describing events. The “Social-

ROUGE-L — Recall-Oriented Understudy for Gisting

Interaction” descriptors include xIntent, xNeed, oReact to

Evaluation is a metric used for evaluating machine produced

distinguish between causes and effects in social settings. An

example of a part of a labeled sentence is shown in Figure 1.



summaries or translations against a set of human-produced

Sentences and entities were manually labelled by human

references. The score is calculated using Longest Common

workers on Amazon Turk; they were assigned open-text values for

Subsequence based statistics, which involves finding the longest

23 commonsense descriptors, reflecting the workers' subjective

subsequence common to all sequences in a set. [11]

commonsense knowledge. For instance, when workers were given

Comparison of the Slovene commonsense models was performed

the sentence “PersonX chases the rabbit” and asked to label it for

the “xWant” descriptor, one wrote “catch the rabbit” and another

by finetuning two state-of-the-art Slovene GPT-2 language models:

wrote “cook the rabbit for dinner”. A more detailed explanation can

macedonizer/sl-gpt2 [12], gpt-janez [13]. As a reference model, we

be found in the ATOMIC-2020 paper. There are 1.33 million

used the original COMET-2020 GPT2-XL English language model

(possibly repeating) descriptor values. The distribution of data

[1]. Moving forward, we will refer to our Slovenian finetuned

across the descriptors is depicted in [1].

models as “COMET sl-gpt2” and “COMET gpt-janez”.

To finetune our Slovenian language models, we have

automatically translated the sentences, entities, and descriptor

4 Experimental Results

values from the ATOMIC-2020 dataset from English to Slovenian.

The translation was done using DeepL’s Translate API [7]. We

We performed a train, test, and development split on the ATOMIC-

have found that while the majority of inspected translations were of

2020 dataset identical to the split used in COMET-2020. Our

good quality, there were also incorrect translations due to word

evaluation split consisted of over 150,000 descriptor values with

disambiguation problems. Nevertheless, we conclude that the

their corresponding sentences and entities.

dataset is of good enough quality to be used for our experiments.

We finetuned our Slovene commonsense models on our

The translated dataset is publicly available [6].

training set consisting of over 1 million descriptor values. Both

models were trained for 3 epochs under the same parameters; the

3 Problem Description and Experimental Setting

maximum input length was set to 50, the maximum output length

The addressed problem is predicting the most likely values for each

was set to 80; the training was performed using a train batch size of

descriptor in the Slovene-translated ATOMIC-2020 dataset, given

64. The model updates were performed using the weighted adam

a Slovenian input sentence or entity. We take inspiration from the

optimizer [14] with the starting learning rate set to 10−5 . The

approach proposed in MultiCOMET [2].

experiment’s implementation can be found on our GitHub

To compare the performance of the models,

repository [5].

we utilize a variety of performance metrics described

BLEU- BLEU- BLEU- BLEU-

ROUGE-

below. Each performance metric is a value between Model

Language

1

2

3

4

CIDEr METEOR

L

0 and 1 indicating the quality of a generated

commonsense descriptor value. Values closer to 1 COMET

represent higher quality descriptions.

sl-gpt2

Slovene

0.297

0.150

0.086

0.058 0.487

0.207

0.383

BLEU — Bilingual Evaluation Understudy was COMET

first used to evaluate the quality of machine gpt-

translated text by examining the overlap of candidate janez

Slovene

0.324

0.174

0.108

0.076 0.508

0.225

0.397

text n-grams in the reference text. BLEU-1 only uses

1-grams in the evaluation, while BLEU-4 only COMET

considers 4-grams. [8]

(GPT2-

CIDEr — Consensus-based Image Description XL)

English

0.407

0.248

0.171

0.124 0.653

0.292

0.485

Evaluation was originally used to measure image

description quality. It first transforms all n-grams to their root form,

Table 1: Comparison of the two Slovene commonsense models

then calculates the average cosine similarity between the candidate

with the English model at the bottom.

and reference TF-IDF vectors. [9]

Experimental results show performance comparable to the

original COMET-2020 English model. Both Slovene models were

10



comparable to the English model across all metrics, “COMET gpt-

Avto (car)

janez” consistently outperformed “COMET sl-gpt2” achieving a

METEOR score of 0.225 compared to 0.207. The performance gap

Descriptor

COMET sl-

COMET

COMET

gpt2

gpt-janez

(GPT2-XL)

was smallest for BLEU-4, as all models struggled to produce

ObjectUse

Vožnja do

Priti do hiše

Drive to the

generations whose 4-grams overlapped with those in the reference

trgovine

store

set. The gap in performance between the Slovene and English



Vožnja do

Priti do hiše

Get to the store

models could be attributed to multiple factors. The English model

hiše

from COMET-2020 was trained for longer on more capable



Vožnja do

Priti do hiše

Drive to the

hardware and is larger. Moreover, the machine translation done to

cilja

restaurant

acquire our dataset can be erroneous at times.

HasProperty

Noro

Najden v

Found in

To illustrate the performance of the models, we investigate

avtomobilu

parking lot

their generated descriptor values on the same inputs. Table 2 shows



Vrata

Najden v

Found on road

a side-by-side example comparison of the descriptor values

avtomobilu

generated by our three models, given the same input sentence in



Pohištvo

Najden v

Found in car

their respective language. Table 3 compares the models on an

avtomobilu

dealership

example entity. For the example sentence “Marko went to the

Table 3: Illustrative example comparing the output of the three

shop”, the descriptor “oWant” indicates what the others want as a

models on the same input entity across two descriptors.

result of the event. “COMET gpt-janez” generates a valid output

“None” but fails to provide alternatives. The other two models

In our example sentence and entity, COMET gpt-janez

agree on the most likely descriptor value being “None” (“nič” in

returns the same output when different commonsense descriptors

Slovenian) and provide plausible alternatives. The “IsBefore”

are requested. We have observed this for all input sentences and

descriptor relates to possible events following the input event. In

entities thus far. We presume such results are due to the trained

our case, “COMET gpt-janez” gives the most plausible output of

parameters in the original gpt-janez model, as macedonizer/sl-gpt2

“Buys something”. The other two models provide still plausible

was finetuned using the same workflow and returns different

outputs including “Is in the pet store” and “PersonX buys a new

descriptor values. While unsure of the exact cause, we reason it

car”.

could be due to an insufficient vocabulary or unoptimized choice

Marko je šel v trgovino (Marko went to the shop)

of parameters during training.



Descriptor

COMET sl-gpt2

COMET

COMET

gpt-

(GPT2-XL)

janez

oWant

Nič

Nič

None



Se zahvaliti

Nič

To give him a

osebiX

receipt



se zahvaliti

Nič

To give him a

discount

IsBefore

Zaslužiti denar

Kupiti

PersonX buys

nekaj

a new car



V trgovino za

Kupiti

PersonX takes

hišne ljubljenčke

nekaj

the car back

home



V trgovino z živili

Kupiti

PersonX buys

nekaj

a new one

Table 2: Illustrative example comparing the output of the three



models on the same input sentence across two descriptors.

Figure 2 Close-up of “Social-Interaction” descriptor values

predicted for an example Slovene sentence “John is very

For our example entity “car”, the descriptor “ObjectUse”

important” (“Janez je zelo pomemben” in Slovenian)

describes possible usages for that entity. Table 3 shows all models

Figures 1, 2 and 3 show the outputs generated by “COMET

are capable of generating plausible descriptor values for such

sl-gpt2” for three different inputs. Figure 2 visualizes the output for common entities. Nevertheless, the descriptor “HasProperty”

the sentence “John is very important”. Outputs include “PersonX is

proves challenging for the Slovenian models, suggesting a car is

then accomplished, happy, proud” and “As a result, others want

“crazy” and is “found in the car”. The English model gives

none, to thank PersonX”. We can see that for many descriptors the

reasonable outputs such as “Found in parking lot”.

highest ranked output is “None” (“nič” in Slovenian), indicating no



commonsense inference can be made.

11



in the holes” for the “IsBefore” descriptor. While both labels are

plausible for some context, they are not necessarily true.

Possible directions for future work include evaluating the

models’ performance for individual descriptors, as there are drastic

differences in quantity of training data and lengths of values across

them. After achieving results comparable to the original English

commonsense model COMET-2020 GPT2-XL, we intend to

finetune and evaluate models for other languages.

ACKNOWLEDGMENTS

The research described in this paper was supported by the

Slovenian research agency under the project J2-1736 Causalify, the

RSDO project funded by the Development of Slovene in a Digital

Environment project, and the Humane AI Net European Unions

Horizon 2020 project under grant agreement No 952026.

REFERENCES



[1] Hwang, J.D., Bhagavatula, C., Le Bras, R., Da, J., Sakaguchi, K.,

Figure 3 Close-up of “Physical-Entity” descriptor values

Bosselut, A., & Choi, Y. (2021). COMET-ATOMIC 2020: On Symbolic

predicted for an example Slovene entity “banana”

and Neural Commonsense Knowledge Graphs. AAAI.

Figure 3 exhibits the output for the entity “banana”, the

[2] Mladenic Grobelnik, A., Mladenić, D., & Grobelnik, M. (2020).

MultiCOMET - Multilingual Commonsense Description. In Proc. SiKDD

model claims the banana can be used to prepare food, is located in

2020, Ljubljana, Slovenia (pp. 37–40).

a building or shop, desires to be eaten for dinner and does not desire

[3] Antoine Bosselut, Hannah Rashkin, Maarten Sap, Chaitanya Malaviya,

to be frozen. On the other hand, the model claims the banana is

Asli Celikyilmaz, Yejin Choi. (2019). COMET: Commonsense

made up of clothes and is capable of going to a restaurant. This is

Transformers for Automatic Knowledge Graph Construction.

[4] Maarten Sap, Ronan Le Bras, Emily Allaway, Chandra Bhagavatula,

likely due to the overall significantly lower number of physical-

Nicholas Lourie, Hannah Rashkin, Brendan Roof, Noah A. Smith, Yejin

entity descriptor values provided in the ATOMIC-2020 dataset.

Choi. (2019). ATOMIC: An Atlas of Machine Commonsense for If-Then

In Figure 1 we can see the “Event-Centered” descriptors for Reasoning. Paul G. Allen School of Computer Science & Engineering,

the sentence “PersonX is sad”. Top descriptor values are again

University of Washington, Seattle, USA. Allen Institute for Artificial

Intelligence, Seattle, USA.

“None”, but the model also claims it is more difficult for PersonX

[5] SLOmet-ATOMIC 2020 Github https://github.com/eriknovak/RSDO-

to be sad, if PersonX has no money.

SLOmet-atomic-2020#slomet-atomic-2020-on-symbolic-and-neural-

commonsense-knowledge-graphs-in-slovenian-language Accessed 5 Discussion

30.08.2022

[6] ATOMIC-2020 Slovene Machine Translated Data

This paper applied an existing approach to multilingual

https://www.dropbox.com/sh/gs8iqcwpwkaqkuf/AAAmnCqG89JOz_umtq

commonsense description to the Slovene language. To implement

42MMxxa?dl=0 Accessed 30.08.2022

our approach, we machine translated the ATOMIC-2020 dataset to

[7] DeepL Translate API https://www.deepl.com/pro-api Accessed

30.08.2022

Slovene and finetuned two Slovene commonsense models. We

[8] Papineni, Kishore & Roukos, Salim & Ward, Todd & Zhu, Wei Jing.

compared our models to the original English commonsense model

(2002). BLEU: a Method for Automatic Evaluation of Machine

from COMET-2020 and achieved comparable experimental results

Translation.

across multiple performance metrics. Among others, our models

[9] Vedantam, R., Lawrence Zitnick, C., & Parikh, D. (2015). Cider:

Consensus-based image description evaluation. In Proceedings of the

achieved a 0.487 CIDEr score, a 0.383 ROUGE-L score, and a

IEEE conference on computer vision and pattern recognition (pp. 4566-

BLEU-1 score of 0.297.

4575).

Through examination of individual examples, we observed

[10] Lavie, Alon & Denkowski, Michael. (2009). The METEOR metric

that while “COMET gpt-janez” has the highest performance scores

for automatic evaluation of Machine Translation. Machine Translation. 23.

105-115.

on the Slovene language, it fails to provide alternative descriptor

[11] Lin, C.-Y. 2004. ROUGE: A Package for Automatic Evaluation of

values. “COMET sl-gpt” provides multiple values for the same

Summaries. In Text Summarization Branches Out.

descriptor, but in average has lower performance. It is important to

[12] Documentation page for “macedonizer/sl-gpt2” on HuggingFace

emphasize the models were trained on subjective commonsense

https://huggingface.co/macedonizer/sl-gpt2 Accessed 1.09.2022

[13] gpt-janez supporting project: RSDO

knowledge provided by individual humans. For example, workers

https://www.cjvt.si/rsdo/en/project/ Accessed 30.08.2022

labelled the sentence “PersonX digs holes” with the descriptor

[14] I. Loshchilov, F. Hutter, Decoupled weight decay regularization, in:

values “PersonX plants a garden” and “PersonX places fence posts

International Conference on Learning Representations, 201



12





Measuring the Similarity of Song Artists using Topic Modelling

Erik Calcina

Erik Novak

Jožef Stefan Institute

Jožef Stefan International Postgraduate School

Jamova cesta 39

Jožef Stefan Institute

Ljubljana, Slovenia

Jamova cesta 39

Ljubljana, Slovenia

ABSTRACT

2

RELATED WORK

In music streaming platforms, it is necessary a recommendation

Related works to our topic modeling approach use Latent Dirich-

system to provide users with similar songs of what they already

let Allocation (LDA) [1]. One work uses a topic modeling tech-listen and also recommend new artists they might be interested

nique for sentiment classification, classifying between happy

in. In this paper, we present a method to find similarities between

and sad songs, by using generated topics created with LDA and

artists that uses topic modelling. We have evaluated the method

Heuristic Dirichlet Process [12]. From a data set consisting of 150

using a data set with music artists and their lyrics. The results

lyric they’ve been able to retrieve the sub-division of two defined

show the method finds similar artists, but also is dependant on

sentiment classes [3]. Another work used LDA and Pachinko the quality of the generated topic clusters.

allocation [7] on a large data set for assessing the quality of the generated topics with applying supervised topic modeling ap-KEYWORDS

proach. [8]. In our paper we use topic modeling to generate a set of topic clusters used to calculate the similarity between artists.

song lyrics, topic modelling, clustering, sentence embeddings,

language models

3

METHODOLOGY

1

INTRODUCTION

In this section, we present the methodology used in this paper.

We present the topic modeling approach used to generate the

Nowadays, there are a plenty of music platforms to choose from

topic clusters, followed by a description of how the topic clusters

and listen to music. There, new artists appear every day and

are used to measure the similarity between the artists.

many different songs are published. If we take into account all

that have been created, we get a large selection of songs which

3.1

Topic Modeling

can increase the difficulty of finding suitable songs or artists to

To create the topic clusters we use BERTopic [5], a method which listen to.

uses document embeddings with clustering algorithms to create

To find a suitable artist or songs, different aspects can be

topic clusters. While BERTopic is described in a separate work,

considered. One such aspect can be the topic of the song; a song

we present a brief description of its workflow. The workflow is

topic can be interpreted as the main subject of the song, for

also presented in Figure 1.

example it can be an emotion, an event, a message, or something

else. When searching for suitable artists one could decide to

search for artists who have songs on similar topics.

In this paper, we propose an topic modeling-based approach

for measuring the similarity of the music artists based only on

their song lyrics. The approach uses language models for gener-

ating song embeddings used to create the topic clusters. These

topic clusters are then analyzed to find the similar artists. The

experiment was performed on a data set of songs corresponding

to fourteen (14) music artists. While the experiment shows that

similar artists can be detected using the approach, there is still

room for improving its performance.

The main contribution of this paper is a novel approach for

detecting similar music artists using topic modelling.

The reminder of the paper is structured as follows: Section 2

contains the overview of the related work on using topic mod-

elling on song data sets. Next, we present the methodology in

Section 3, and describe the experiment setting in Section 4. The experiment results are found in Section 5, followed by a discussion in Section 6. Finally, we conclude the paper and provide ideas for future work in Section 7.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and Figure 1: The BERTopic methodology workflow. The high-the full citation on the first page. Copyrights for third-party components of this lighted part is used in our approach. The image has been

work must be honored. For all other uses, contact the owner /author(s).

Information Society 2022, 10–14 October 2022, Ljubljana, Slovenia

designed using resources from Flaticon.com.

© 2022 Copyright held by the owner/author(s).

13





Information Society 2022, 10–14 October 2022, Ljubljana, Slovenia Erik Calcina and Erik Novak

Document Embeddings. Document vector representations are

The attributes used in our analysis are song name, artist and

generated using a sentence-transformer [11] model. The model lyrics.

creates a semantic representation of the documents, which al-

Data Processing. For our experiment we took fourteen (14)

lows measuring the semantic similarity. The available models

artists of various degrees of similarity. This reduces the data set

support creation of both monolingual and multilingual vectors.

to 4,470 rows which is 2.05% of the whole data set.

Since the embeddings will be used as an input of a clustering

After reviewing the lyrics, we realized that the data set has

algorithm, dimensionality reduction is performed to improve the

many song variations by the same artist, which can be seen as

clustering results. The dimensionality reduction algorithm used

duplicates. To find and remove the duplicates, we created the

is UMAP [10].

TF-IDF representations for the songs, and calculated the cosine

Document Clustering. Once the document embeddings are pre-

similarity with all other songs of the same artist; if the similarity

pared, they are input into a clustering algorithm to create the

is greater than 50% it was labeled as a duplicate and removed

topic clusters. The algorithm used is HDBSCAN [9], an optimized from the data set. This resulted in a smaller data set containing

extension of the DBSCAN [4] algorithm. The chosen algorithm 3,455 song lyrics.

creates clusters based on the density of the document embedding

The final data set statistics used for our experiments is shown

space, which allows the documents to not be assigned to a cluster

in Table 1.

if it’s not similar to any of the neighbouring documents.

Table 1: The experiment data set statistics. For each artist

Topic Word Description. Once the topic clusters are created, a

we denote the music genre of the artist (genre), the num-

topic word description is generated using the document’s text.

ber of their songs in the data set (songs), and the average

For each cluster the TF-IDF score is calculated for each word

number of words in the song’s lyrics (avg. length).

found in any of the cluster’s documents; the scores are called

cluster TF-IDF (c-TF-IDF). The words with the highest c-TF-IDF

Artist

genre

songs

avg. length

score are then chosen as the topic word description. Furthermore,

maximal marginal relevance (MMR) is performed to diversify

black-sabbath

Rock

160

184

the selected words by measuring both the words relevance to

bon-jovi

Rock

320

266

the documents, and its similarity to the other selected words.

dio

Rock

127

203

Note that the topic word description were used only for the

aerosmith

Rock

208

226

preliminary analysis of our work, but not for measuring artists

ac-dc

Rock

171

193

similarity.

coldplay

Rock

138

174

50-cent

Hip-Hop

318

502

3.2

Artists’ Similarity using Topic Clusters

2pac

Hip-Hop

259

648

Once the topic clusters are created, the similarity between artists

eminem

Hip-Hop

369

640

can be measured. First, for each topic we count the songs that cor-

black-eyed-peas

Hip-Hop

119

463

responds to a particular artist. This gives us the number of songs

celine-dion

Pop

182

230

an artist has in a particular topic. To ensure that the presence is

britney-spears

Pop

225

313

strong enough, we decide to remove the artists from a topic if the

frank-sinatra

Jazz

356

133

number of their associated songs is below some threshold. The

ella-fitzgerald

Jazz

503

156

threshold is set to five (5) in order to ensure that the songs were

Together

-

3,455

319

not assigned to a cluster by coincidence. Afterwards, for each

pair of artists we calculate their similarity using the following

equation:

|𝐴 ∩ 𝐵|

4.2

Implementation details

sim (𝑎, 𝑏) =

,

(1)

|𝐴|

In this section, we present the details of how the approach is

where 𝐴 is the set of topics of artist 𝑎, and 𝐵 is the set of topics

developed.

of artist 𝑏 .

Language model. The method uses the pre-trained Sentence

1

Transformer model, more precisely the all-mpnet-base-v2 model ,

4

EXPERIMENT

available via the HuggingFace’s transformer library [13]. It can We now present the experiment setting. First, we introduce the

take up to 384 tokens as one input, which is more than the average

data set used and its pre-processing steps. Next, we describe the

number of words in our data set, and returns a 768 dimensional

implementation details.

dense vectors. The vectors have been shown to be appropriate

for task such as clustering and semantic search.

4.1

Dataset

Dimensionality reduction. To perform dimensionality reduc-

To test our approach, we use a dataset with raw lyrics data [2].

tion, we set the UMAP parameters as follows: Fist, the number of

The dataset consists of 218,210 rows containing the following

neighboring sample points used when making the manifold ap-

attributes:

proximation is set to five (5), to make the algorithm use the local

• Song name. The name of the song.

proximity of the documents. Second, we set the dimensionality

• Release year. The year when the song was released.

of the embeddings to one (1). This values were selected using

• Song artist. The name of the artist.

hyper-parameter tuning.

• Artist genre. The genre of the song.

• Song lyrics. The lyrics text of the song.

1 https://huggingface.co/sentence-transformers/all-mpnet-base-v2

14





Measuring the Similarity of Song Artists using Topic Modelling

Information Society 2022, 10–14 October 2022, Ljubljana, Slovenia

Clustering algorithm.

Absolute co-occurrence of artists in topic clusters.

In the HDBSCAN algorithm, the mini-

mum number of documents in a cluster is set to five (5).

5

RESULTS

In this section, we present the experiment results. We analyze

the topic clusters, followed by the description of the finding on

artist’s similarity.

Topic Cluster Analysis. The experiment has generates 215 topic

clusters, out of which only 107 have at least one artist with more

than five (5) songs in it. The cluster containing songs that are

deemed as outliers is not included in the analysis.

The statistics of the topic clustering is shown in Table 2. Evidently, artists with a larger number of songs are spread over

several topic clusters than those with less songs.

Table 2: Topic clustering results. For each artist we show

the number of different topics the artist is asociated with

(topics), and the average number of their songs in the asso-

Figure 2: The absolute co-occurrence of artists in topic

ciated topics (avg. songs).

clusters.

Artist

topics

#avg. songs

Relative co-occurrence of artists in topic clusters.

black-sabbath

6

5

bon-jovi

10

6

dio

4

7

aerosmith

9

6

ac-dc

7

5

coldplay

2

5

50-cent

17

9

2pac

13

9

eminem

18

9

black-eyed-peas

3

12

celine-dion

8

6

britney-spears

12

6

frank-sinatra

16

8

ella-fitzgerald

28

8

Artists’ Similarity Analysis. The artists’ similarity is shown in

Figures 2 and 3, which show the heatmaps of the absolute and Figure 3: The relative co-occurrence of artists in topic clus-relative co-occurrence of artists in topic clusters, respectively.

ters. Artists with smaller number of topics can result in

By looking at rows of Figure 2, we see the number of common higher similarity with other artists.

topics with other artists. For example, by taking 50-cent with

his 17 topics, we see that he shares five (5) of them with 2pac,

one (1) with black-eyed-peas, one (1) with ac-dc, and six (6) with

Language Models Limitations. The chosen language model

eminem. From this we conclude that 50-cent, 2pac and eminem

all-mpnet-base-v2 supports a maximum sequence length of

have more topics in common than the rest of the artists. In other

384 tokens which is the downside of this model for our experi-

words, 50-cent is more similar to the 2pac and eminem than to

ment. Although the average number of words in the song lyrics is

the rest of the artists.

below the input limit, some artist have songs that are longer than

Figure 3 shows the similarities calculated using Equation 1.

that. However, songs have repeating sections, e.g. chorus, which

The similarities become more visible, but at the same time can be

is most likely inside the first 384 words. Therefore, the language

also misleading. Artists with smaller number of topics can result

models may not create a representation out of the whole song’s

in higher similarity with other artists with higher number of

lyrics, but it might capture the majority because of the song’s

topics. For example, Coldplay have two (2) topics, one of which

repeated text.

is shared with Bon Jovi. Despite the fact that only one topic is in

common, it is unlikely they have a similarity of 50%.

Clustering Algorithm Selection. The clustering algorithm HDB-

SCAN can create a cluster consisting of examples, which do not

6

DISCUSSION

fall into any of the topic clusters. It is convenient when instead of

In this section we discuss the advantages and disadvantages of

forcing songs into clusters, it labels them as outliers. The down-

the proposed methodology, and its possible improvements.

side is when the majority of songs are labeled as outliers. To

15





Information Society 2022, 10–14 October 2022, Ljubljana, Slovenia Erik Calcina and Erik Novak

avoid this, other clustering algorithms that assign a cluster to

[8]

Alen Lukic. A comparison of topic modeling approaches

every document can be used, for example K-means clustering [6].

for a comprehensive corpus of song lyrics. Tech. rep. Tech

report, Language Technologies Institute, School of Com-

6.1

Topic Cluster Discussion

puter Science . . ., 2015.

Some artists with a small number of songs have a lower number

[9]

Leland McInnes and John Healy. “Accelerated Hierarchical

of topics assigned, which is a problem for finding similarities.

Density Based Clustering”. In: 2017 IEEE International Con-

On the other side artists with higher number of songs tend to

ference on Data Mining Workshops (ICDMW). 2017, pp. 33–

have more topics. Additionally, to avoid taking into account small

42. doi: 10.1109/ICDMW.2017.12.

number of artist co-occurrances, which can be a product of data

[10]

Leland McInnes, John Healy, and James Melville. UMAP:

noise, a filter threshold can be considered to remove them from

Uniform Manifold Approximation and Projection for Dimen-

the final analysis.

sion Reduction. 2018. doi: 10.48550/ARXIV.1802.03426.

url: https://arxiv.org/abs/1802.03426.

7

CONCLUSION

[11]

Nils Reimers and Iryna Gurevych. “Sentence-BERT: Sen-

tence Embeddings using Siamese BERT-Networks”. In:

In this paper we present a way to measure similarity between

Proceedings of the 2019 Conference on Empirical Methods

music artists using topic modeling. We cluster lyrics and compare

in Natural Language Processing. Association for Computa-

artists based on the generated topic clusters. The results have

tional Linguistics, Nov. 2019. url: https://arxiv.org/abs/

shown that the approach finds similar artists. However, it is

1908.10084.

heavily dependent on the number and quality of the topic clusters.

[12]

Chong Wang, John Paisley, and David Blei. “Online varia-

In the future, we intend to apply the methodology on a larger

tional inference for the hierarchical Dirichlet process”. In:

data set of song lyrics and artists. In addition, we intend to use

Proceedings of the fourteenth international conference on

all of the topic cluster information (including topic word descrip-

artificial intelligence and statistics. JMLR Workshop and

tions) in order to improve the methodology’s performance.

Conference Proceedings. 2011, pp. 752–760.

ACKNOWLEDGMENTS

[13]

Thomas Wolf et al. “Transformers: State-of-the-Art Natu-

ral Language Processing”. In: Proceedings of the 2020 Con-

This work was supported by the Slovenian Research Agency and

ference on Empirical Methods in Natural Language Pro-

the Slovene AI observatory under proposal no. V2-2146.

cessing: System Demonstrations. Online: Association for

REFERENCES

Computational Linguistics, Oct. 2020, pp. 38–45. doi: 10.

18653/v1/2020.emnlp- demos.6. url: https://aclanthology.

[1]

David M. Blei, Andrew Y. Ng, and Michael I. Jordan. “La-

org/2020.emnlp- demos.6.

tent dirichlet allocation”. In: J. Mach. Learn. Res. 3 (2003),

pp. 993–1022. issn: 1532-4435. doi: http://dx.doi.org/10.

1162 / jmlr. 2003 . 3 . 4 - 5 . 993. url: http : / / portal . acm . org /

citation.cfm?id=944937.

[2]

Connor Brennan, Sayan Paul, Hitesh Yalamanchili, Justin

Yum. Classifying Song Genres Using Raw Lyric Data with

Deep Learning. Accessed August 30, 2022. https://github.

com/hiteshyalamanchili/SongGenreClassification. 2018.

[3]

Maibam Debina Devi and Navanath Saharia. “Exploiting

Topic Modelling to Classify Sentiment from Lyrics”. In:

Machine Learning, Image Processing, Network Security and

Data Sciences. Ed. by Arup Bhattacharjee et al. Singapore:

Springer Singapore, 2020, pp. 411–423. isbn: 978-981-15-

6318-8.

[4]

Martin Ester et al. “A density-based algorithm for discov-

ering clusters in large spatial databases with noise”. In:

AAAI Press, 1996, pp. 226–231.

[5]

Maarten Grootendorst. “BERTopic: Neural topic modeling

with a class-based TF-IDF procedure”. In: arXiv preprint

arXiv:2203.05794 (2022).

[6]

Xin Jin and Jiawei Han. “K-Means Clustering”. In: Ency-

clopedia of Machine Learning. Ed. by Claude Sammut and

Geoffrey I. Webb. Boston, MA: Springer US, 2010, pp. 563–

564. isbn: 978-0-387-30164-8. doi: 10 . 1007 / 978 - 0 - 387 -

30164 - 8 _ 425. url: https : / / doi . org / 10 . 1007 / 978 - 0 - 387 -

30164- 8_425.

[7]

Wei Li and Andrew McCallum. “Pachinko allocation: DAG-

structured mixture models of topic correlations”. In: ICML

’06: Proceedings of the 23rd international conference on Ma-

chine learning. New York, NY, USA: ACM, 2006, pp. 577–

584. isbn: 1595933832. doi: 10.1145/1143844.1143917. url:

http://portal.acm.org/citation.cfm?id=1143917.

16





Exploring the Impact of Lexical and Grammatical Features on Automatic Genre Identification

Taja Kuzman

Nikola Ljubešić

taja.kuzman@ijs.si

nikola.ljubesic@ijs.si

Jožef Stefan Institute and Jožef Stefan International

Jožef Stefan Institute

Postgraduate School

Jamova cesta 39

Jamova cesta 39

Ljubljana, Slovenia

Ljubljana, Slovenia

ABSTRACT

As learning on lexical features can introduce bias towards

topic, Laippala et al. (2021) recently experimented with combin-

This study analyses the impact of several types of linguistic fea-

ing lexical with grammatical features, which are represented as

tures on the task of automatic web genre identification applied

part-of-speech tags, conveying information on the word type (e.g.,

to Slovene data. To this end, text classification experiments with

noun, verb). This showed to yield better results than using solely

the fastText models were performed on 6 feature sets: original

lexical features, and provided more stable models, i.e., models

lexical representation, preprocessed text, lemmas, part-of-speech

that are able to generalize beyond the training data. Further-

tags, morphosyntactic descriptors, and syntactic dependencies,

more, their analysis revealed that the importance of feature sets

produced with the CLASSLA pipeline for language processing.

varies between genre categories, and that while some are most

Contrary to previous work, our results reveal that the grammati-

efficiently identified when learning on lexical features, others

cal feature set can be more beneficial than lexical representations

benefit more from grammatical representations.

for this task, as syntactic dependencies were found to be the most

However, these experiments were in past mostly performed

informative for genre identification. Furthermore, it is shown

on English datasets. This article is the first to analyse the impact

that this approach can provide insight into variation between

of various feature sets on automatic genre identification applied

genres.

to Slovene data. This research was made possible by the recent

KEYWORDS

development of the first Slovene dataset, manually annotated

with genre, as well as the creation of state-of-the-art language

language processing, linguistic features, automatic genre identi-

processing tools for Slovene. To compare textual representations,

fication, web genres, Slovene

additional feature sets were created from a selection of texts an-

notated with genre, presented in Section 2, by using common 1

INTRODUCTION

preprocessing methods and language processing (see Section 3).

Automatic genre identification (AGI) is a text classification task

Thus, in this paper, 6 textual representations are compared: 1)

where the focus is on genres as text categories that are defined

original, running text that we consider as our baseline, 2) pre-

based on the conventional function and/or the form of the texts.

processed text, i.e. lowercase text without punctuation, digits

In text classification tasks, texts are generally given to the ma-

and stopwords, 3) lemmas, i.e. base dictionary forms of words,

chine learning models in form of words or characters that are

4) part-of-speech (PoS) tags, i.e. main syntactic word types (e.g.,

then further transformed into numeric vectors by using bag-of-

noun, verb), 5) morphosyntactic descriptors (MSD), i.e. extended

words representations, or word embeddings created by training

PoS tags which include information on morphosyntactic features

deep neural networks on the surface text. However, recent devel-

(e.g., number, case), 6) syntactic dependencies, i.e. types of depen-

opment of tools for linguistic processing for numerous languages,

dency relations between words (e.g. subject, object). The feature

including Slovene, allows transformation of the original running

sets are compared based on their impact on the performance

text into various other sets of features to which further transfor-

of the fastText models on the automatic text classification task.

mation into numeric representations can be applied. By learning

The results of the experiments, presented in Section 4, give in-on these linguistic sets, we get insight into the importance of fea-

sights into the role of linguistic feature sets on this task and the

tures that cannot be analysed separately when given the running

differences in performance between genre categories.

text, i.e., word meaning, function of a word, and its relation to

other words.

2

DATASET

When previous work compared importance of various textual

feature sets on the performance of the models in automatic genre

For performing experiments in automatic genre identification,

identification, lexical features, i.e., word or character n-grams,

the Slovene Web genre identification corpus GINCO 1.0 [2] was mainly provided the best results ([6], [7]). However, it was noted used. The dataset consists of the “suitable” subset, annotated with

that by learning on lexical features, the models could learn to

genre, and the “not suitable” subset that comprises texts which

classify texts based on the topic instead of genre characteristics,

can be deemed as noise in the web corpora, e.g., texts without

and would not be able to generalize beyond the dataset.

full sentences, very short texts, machine translation etc. In this

research, only the “suitable” subset, containing 1002 texts, was

Permission to make digital or hard copies of part or all of this work for personal used.

or classroom use is granted without fee provided that copies are not made or The GINCO schema consists of 24 genre labels. However, pre-distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this vious experiments, performed with the fastText model on the

work must be honored. For all other uses, contact the owner /author(s).

entire dataset, showed that the model is not potent enough to

Information Society 2022, 10–14 October 2022, Ljubljana, Slovenia

differentiate between a large number of labels that are mostly

© 2022 Copyright held by the owner/author(s).

represented by less than 100 texts, reaching micro and macro

17





Information Society 2022, 10–14 October 2022, Ljubljana, Slovenia Taja Kuzman and Nikola Ljubešić

Table 1: The original GINCO categories (left) included in

3

FEATURE ENGINEERING

the reduced set, and the reduced set of labels (right), used

Feature engineering is a process of identifying features that are

in the experiments, with the total number of texts (later

most useful for a specific task with the goal of improving per-

divided between the train, dev and test split) in the paren-

formance of a machine learning model. In text classification ex-

theses.

periments, basic preprocessing methods are often used to reduce

the number of unique lexical features (words or characters) with-

GINCO

Reduced Set

out losing much information which could provide better results.

To test whether preprocessing the text improves the results for

News/Reporting

News (198)

this task, the first additional feature set was created by prepro-

Opinionated News

cessing the running text as extracted from the GINCO dataset.

Information/Explanation

Preprocessing consisted of the following steps: converting text to

Information/Explanation (127)

Research Article

lowercase, and removing digits, punctuation and function words

Opinion/Argumentation

known as stopwords, e.g., conjunctions, prepositions etc.

Opinion/Argumentation (124)

Review

In addition to this, various linguistic representations were cre-

ated by applying linguistic processing to the texts, and replacing

Promotion

words with corresponding lemmas or grammatical tags. The lan-

Promotion of a Product

Promotion (191)

guage processing was performed with the CLASSLA pipeline [5].

Promotion of Services

The following text representations were produced: lexical feature

Invitation

set, consisting of lemmas, and three grammatical feature sets:

Forum

Forum (48)

part-of-speech (PoS) tags, morphosyntactic descriptors (MSD),

and syntactic dependencies. The realisation of the created feature

sets is illustrated on an example sentence in Table 2.

4

MACHINE LEARNING EXPERIMENTS

4.1

Experimental Setup

F1 scores of 0.352 and 0.217 respectively (see [3]). Therefore, to The experiments were performed with the linear fastText [1]

be able to infer any meaningful conclusions, this article focuses

model which enables text classification and word embeddings

only on the most frequent genre labels, created by merging some

generation. The model is a shallow neural network with one hid-

labels. Instances of less frequent labels that could not be merged,

den layer where the word embeddings are created and averaged

namely Instruction, Legal/Regulation, Recipe, Announcement, Cor-

into a text representation which is fed into a linear classifier. The

respondence, Call, Interview, Prose, Lyrical, Drama/Script, FAQ,

model takes as an input a text file where each line contains a

and the labels Other and List of Summaries/Excerpts, which can

separate text instance, consisting of a label and the corresponding

be considered as noise, were not used. To focus only on the in-

document. Thus, for each feature set, appropriate train, test and

stances that are representative of their genre labels, texts that

dev files were created, and the model was trained on each repre-

were manually annotated as hard to identify (parameter hard)

1

sentation separately

. To observe the dispersion of results, five

were not used in the experiments. Furthermore, paragraphs that

runs of training were performed for each feature set. To measure

were deemed to be noise in the text, e.g., cookie consent text, and

the model’s performance on the instance and the label level, the

were marked by the annotators with the keep parameter set to

micro and macro F1 scores were used as evaluation metrics.

False, were left out of the final texts.

The hyperparameter search was performed by training the

Thus, the final set of labels, used in the experiments, shown in

model on the training split of the baseline text and evaluating

Table 1, consists of 5 genre categories, Information/Explanation, it on the dev split. The automatic hyperparameter optimisation

News, Opinion/Argumentation, Promotion and Forum. As shown in

provided by the fastText model did not yield satisfying results, as

the Table, the dataset is imbalanced, with News and Promotion be-

three runs of automatic hyperparameter optimisation produced

ing the most frequent classes, consisting of almost 200 instances,

very different results in terms of proposed optimal hyperparame-

while Forum is the least represented class, consisting of about 50

ter values and yielded micro F1 0.479 ± 0.02 and macro F1 0.382

texts. The subset, consisting of 688 texts in total, followed the

± 0.06. Therefore, we continued searching for optimal hyperpa-

original stratified split of 60:20:20, encoded in the GINCO 1.0

rameters by manually changing one hyperparameter at a time

dataset, and the models were trained on the training set, tested

on the test set, while the dev split was used for evaluating the

1 The code for data preparation and machine learning experiments is published here: hyperparameter optimisation.

https://github.com/TajaKuzman/Text- Representations- in- FastText.

Table 2: An example of the feature sets used in the experiments.

Feature Set

Example

Baseline - Running Text

V Laškem se bo v nedeljo, 21.4.2013 odvijal prvi dobrodelni tek Veselih nogic.

Preprocessed Baseline

laškem nedeljo odvijal dobrodelni tek veselih nogic

Lemmas

v Laško se biti v nedelja , 21.4.2013 odvijati prvi dobrodelen tek vesel nogica .

PoS

ADP PROPN PRON AUX ADP NOUN P UNCT NUM VERB ADJ ADJ NOUN ADJ NOUN P UNCT

MSD

Sl Npnsl Px——y Va-f3s-n Sa Ncfsa Z Mdc Vmpp-sm Mlomsn Agpmsny Ncmsn Agpfpg Ncfpg Z

Dependencies

case nmod expl aux case obl punct nummod root amod amod nsubj amod nmod punct 18





Exploring the Impact of Linguistic Features on AGI

Information Society 2022, 10–14 October 2022, Ljubljana, Slovenia

Table 4: Average micro and macro F1 scores obtained from

and conducting classification experiments. The optimum number

five runs of training and testing on each representation

of epochs revealed to be 350, the learning rate was set to 0.7,

separately.

and the number of words in n-grams to 1. For the other hyperpa-

rameters, the default values were used. Manual hyperparameter

search revealed to be considerably more effective than automatic

Representation

Micro F1

Macro F1

optimisation, as it yielded the average micro and macro F1 scores

Baseline Text

0.560 ± 0.00

0.589 ± 0.00

of 0.625 ± 0.004 and 0.618 ± 0.003 respectively, which is in aver-

Preprocessed Baseline

0.596 ± 0.00

0.597 ± 0.00

age 0.15 points better micro F1 and 0.24 points better macro F1

Lemmas

0.597 ± 0.01

0.601 ± 0.00

compared to the results of automatic optimisation.

PoS

0.540 ± 0.01

0.547 ± 0.01

To analyse whether our choice of technology is the most ap-

MSD

0.563 ± 0.01

0.536 ± 0.02

propriate one, we compared the performance of the fastText

Dependencies

0.610 ± 0.00

0.639 ± 0.00

model, which uses the hyperparameters mentioned above, with

the performance of various non-neural classifiers, commonly

used in text classification tasks: dummy majority classifier which

1 reveals that preprocessing especially improves the identifica-predicts the most frequent class to every instance, support vec-

tion of Promotion and News. The two labels are the most frequent

tor machine (SVM), decision tree classifier, logistic regression

genre classes in the dataset which explains larger improvement

classifier, random forest classifier, and Naive Bayes classifier. We

of the micro F1 scores. If we compare the baseline text and the

used the default parameters for the classifiers. The models are

preprocessed text to the third lexical set, i.e., lemmas, the results

compared based on their performance on the baseline text which

show that by using lowercase words, reduced to their dictionary

was transformed into the TF-IDF representation where necessary.

base form, the performance is further improved, although only

As shown in Table 3, fastText outperforms all other classifiers slightly, as can be seen in Table 4.

with a noticeable difference especially in the macro F1 scores,

Secondly, we compared various lexical and grammatical fea-

reaching 17 points higher scores than the next best classifier, the

ture sets, obtained with language processing tools. In previous

Naive Bayes classifier.

work, which analysed English genre datasets, lexical features

yielded better results than grammatical feature sets ([4], [6], [7]).

Table 3: Micro and macro F1 scores obtained by various

Our results revealed that this conclusion holds also for Slovene

classifiers, trained and tested on the baseline text.

when training on part-of-speech tags. Similar conclusion can be

made for the extended part-of-speech tags (MSD) which only

Classifier

Micro F1

Macro F1

slightly improve the micro F1 scores compared to the baseline

while there is a decrease in the macro F1 scores (see Table 4).

Dummy Classifier

0.24

0.08

However, the third grammatical feature set, consisting of tags for

Support Vector Machine

0.49

0.33

syntactic dependencies, which was not used in previous work,

Decision Tree

0.34

0.35

significantly outperformed the baseline text and all other fea-

Logistic Regression

0.52

0.38

ture sets. As shown in Figure 1, the improvement is especially Random Forest classifier

0.51

0.41

noticeable for the categories Forum, Opinion/Argumentation and

Naive Bayes classifier

0.54

0.42

News. By learning on the dependencies instead on lexical fea-

FastText

0.56

0.59

tures, the model learns from the structure of the sentences in

the text, i.e., the syntax, instead of word meanings that can be

more related to topic than genre, which could be the reason why

4.2

Results of Learning on Various Linguistic

this representation was revealed to be the most beneficial for the

Features

task.

As in previous work (see [4]), the experiments have revealed a To explore the role of various textual representations on the au-dependence between the text representation and performance on

tomatic genre identification of Slovene web texts, we conducted

specific genre labels, which is illustrated in Figure 1. The results text classification experiments with the fastText models on 6

show that Promotion and Information/Explanation can be most

feature sets:

successfully identified when learning purely on the meaning of

• three lexical sets: a) baseline text, i.e., the original run-

the words, i.e., on lemmas. In contrast to that, for identifying

ning text, b) preprocessed baseline text, i.e., baseline text

News, grammatical representations are more useful than lexical

converted to lowercase and without punctuation, digits

ones. Similarly, Opinion/Argumentation benefits more from gram-

and function words, c) lemmas, i.e., words reduced to their

matical feature sets than lexical representations, except in case

base dictionary forms;

of the MSD tags which significantly decreased the results for this

• three grammatical sets: a) part-of-speech (PoS), i.e., main

class, yielding F1 scores below 0.3. Interestingly, although Forum

word types, b) morphosyntactic descriptors (MSD), i.e.,

is the least frequent label, its features seem to be the easiest to

extended PoS tags, c) syntactic dependencies, i.e., types of

identify in the majority of representations. This genre benefits

words defined by their relation to other words.

the most from learning on syntactic dependencies tags, which

yielded F1 scores of almost 0.9.

First, by comparing the baseline representation and the prepro-

cessed representation, we aimed to determine whether common

5

CONCLUSIONS

preprocessing methods can improve the results in the AGI task.

As shown in Table 4, the results reveal that applying preprocessIn this paper, we have investigated the dependence of automatic

ing methods improves the performance, especially on the micro

genre classification on the lexical and grammatical representation

F1 level. Analysis of the F1 scores obtained for each label in Figure

of text. Our experiments, performed on three lexical and three

19





Information Society 2022, 10–14 October 2022, Ljubljana, Slovenia

Taja Kuzman and Nikola Ljubešić

Figure 1: The impact of various linguistic features on the F1 scores of genre labels (Information/Explanation, Promotion, News, Forum and Opinion/Argumentation).

grammatical feature sets, revealed that the choice of textual rep-

ACKNOWLEDGMENTS

resentation impacts the results of automatic genre identification.

This work has received funding from the European Union’s Con-

Similarly to previous work, it was revealed that part-of-speech

necting Europe Facility 2014-2020 - CEF Telecom, under Grant

features give worse results than lexical features. However, a gram-

Agreement No. INEA/CEF/ICT/A2020/2278341. This communica-

matical feature set, consisting of syntactic dependencies, that has

tion reflects only the author’s view. The Agency is not responsible

not been studied in previous work, revealed to be the most ben-

for any use that may be made of the information it contains. This

eficial for the automatic genre identification task. Furthermore,

work was also funded by the Slovenian Research Agency within

the experiments revealed variation between genres regarding the

the Slovenian-Flemish bilateral basic research project “Linguistic

impact of feature sets on the F1 scores of each label. While some

landscape of hate speech on social media” (N06-0099 and FWO-

genres, such as Promotion, benefit more from learning on lexical

G070619N, 2019–2023) and the research programme “Language

features, others, such as Opinion/Argumentation, benefit more

resources and technologies for Slovene” (P6-0411).

from grammatical representations.

However, it should be noted that this study has been limited

REFERENCES

to the 5 most frequent genre labels, as the previous experiments

[1]

Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas

showed that the fastText model is not potent enough to iden-

Mikolov. 2016. Bag of tricks for efficient text classification.

tify other categories represented by a small number of instances

arXiv preprint arXiv:1607.01759.

([3]). Thus, the results of these experiments give insight into

[2]

Taja Kuzman, Mojca Brglez, Peter Rupnik, and Nikola Ljubešić.

which linguistic features are the most important for differentiat-

2021. Slovene web genre identification corpus GINCO 1.0.

ing between the five most frequent genres, not for identifying the

Slovenian language resource repository CLARIN.SI. (2021).

24 original labels that encompass all the genre variation found

http://hdl.handle.net/11356/1467.

on the web, and include noise. This is why we plan to continue

[3]

Taja Kuzman, Peter Rupnik, and Nikola Ljubešić. 2022. The

genre annotation campaigns to enlarge the Slovene genre dataset,

GINCO Training Dataset for Web Genre Identification of

which would allow extending the analysis to all genre labels. In

Documents Out in the Wild. In Proceedings of the Language

addition to this, as we are interested in cross-lingual genre iden-

Resources and Evaluation Conference. European Language

tification, in the future, we plan to analyse the importance of

Resources Association, Marseille, France, 1584–1594. https:

linguistic feature sets on the Croatian and English genre datasets

//aclanthology.org/2022.lrec- 1.170.

to analyse whether the characteristics of genre labels are lan-

[4]

Veronika Laippala, Jesse Egbert, Douglas Biber, and Aki-

guage independent.

Juhani Kyröläinen. 2021. Exploring the role of lexis and

The fastText model was revealed to be useful for the anal-

grammar for the stable identification of register in an unre-

ysis of the impact of linguistic features on the AGI task, how-

stricted corpus of web documents. Language resources and

ever, previous work on automatic genre identification using the

evaluation, 1–32.

GINCO dataset revealed that if the aim of the research is to create

[5]

Nikola Ljubešić and Kaja Dobrovoljc. 2019. What does Neu-

the best-performing classifier and not to analyse the impact on

ral Bring? Analysing Improvements in Morphosyntactic

the performance, the Transformer-based pre-trained language

Annotation and Lemmatisation of Slovenian, Croatian and

models are much more suitable for the task ([3]). This was also Serbian. In Proceedings of the 7th Workshop on Balto-Slavic

confirmed by our experiments on the running text, where the

Natural Language Processing. Association for Computa-

base-sized XLM-RoBERTa model reached micro and macro F1

tional Linguistics, Florence, Italy, (August 2019), 29–34.

scores 0.816 and 0.813, which is 22–26 points more than the fast-

doi: 10 . 18653 / v1 / W19 - 3704. https : / / www. aclweb . org /

Text model. Based on the findings from this paper, one of the

anthology/W19- 3704.

reasons why the Transformer models perform better could also

[6]

Dimitrios Pritsos and Efstathios Stamatatos. 2018. Open set

be that the Transformer text representations incorporate infor-

evaluation of web genre identification. Language Resources

mation on syntax as well. In the future, we plan to investigate

and Evaluation, 52, 4, 949–968.

this further, adapting the classifier heads so that the syntactic

[7]

Serge Sharoff, Zhili Wu, and Katja Markert. 2010. The Web

information has a larger impact on the classification than the

Library of Babel: evaluating genre collections. In LREC.

lexical parts of the representation.

Citeseer.

20





Stylistic features in clustering news reporting: News articles on BREXIT

Abdul Sittar

Jason Webber

Dunja Mladenić

abdul.sittar@ijs.si

jason.webber@bl.uk

dunja.mladenic@ijs.si

Jožef Stefan Institute and Jožef

British Library

Jožef Stefan Institute and Jožef

Stefan Postgraduate School

London, United Kingdom

Stefan Postgraduate School

Jamova cesta 39

Jamova cesta 39

Ljubljana, Slovenia

Ljubljana, Slovenia

ABSTRACT

Table 1: List of all the stylistic features that are used for

clustering.

We present a comparison of typical bag-of-words features with

stylistic features. We group the news articles published from

No.

Feature

No.

Feature

three different regions of the UK namely London, Wales, and

1.

Percentage of Question Sentences

2.

Average Sentence Length

Scotland. Hierarchical clustering is performed using typical bag-

3.

Percentage of Short Sentences

4.

Average Word Length

of-words and stylistic features. We present the performance of

5.

Percentage of Long Sentences

6.

Percentage of Semicolons

Percentage of Words with Six

25 stylistic features and compare them with the bag-of-words.

7.

8.

Percentage of Punctuation marks

and More Letters

Our results show that bag-of-words are better to be used while

Percentage of Words with Two

9.

10.

Percentage of Pronouns

and Three Letters

clustering news reporting at the regional level whereas stylistic

Percentage of Coordinating

11.

12.

Percentage of Prepositions

features are better to be used while clustering news reporting at

Conjunctions

the level of news publishers/newspapers.

13.

Percentage of Comma

14.

Percentage of Adverbs

15.

Percentage of Articles

16.

Percentage of Capitals

Percentage of Words with

17.

18.

Percentage of Colons

KEYWORDS

One Syllable

19.

Percentage of Nouns

20.

Percentage of Determiners

news reporting, topic modeling, stylistic features, clustering

21.

Percentage of Verbs

22.

Percentage of Digits

23.

Percentage of Adjectives

24.

Percentage of Full stop

25.

Percentage of Interjections

1

INTRODUCTION

The role of content is an essential research topic in news spread-

ing. Media economics scholars especially showed their interest

features from the raw features, including low-level features, high-

in a variety of content forms since content analysis plays a vital

level features, and semantic features [16].

role in individual consumer decisions and political and economic

The news coverage registers the occurrence of specific events

interactions [6]. The content basically refers to the type of lan-promptly and reflects the different opinions of stakeholders [4].

guage that is used in the news. It is used to convey meaning and

We take Brexit as an event to be researched on the topic of news

it can impact social and psychological constructs such as social

reporting differences across the different regions of the UK. On

relationships, emotions, and social hierarchy [8]. The everyday 23 June 2016, the British electorate voted to leave the EU. This

act of reading the news is such a big area in which small dif-

event has already been studied following different aspects such

ferences in reporting may shape how events are perceived, and

as fundamental characteristics of the voting population, driver

ultimately judged and remembered [5].

of the vote, political and social patterns, and possible failures in

News reporting across different regions requires methods to

communication [2, 9]. In this paper, we explore how different find reporting differences. [7] characterize the relationship be-stylistic features help in clustering news articles related to Brexit

tween the volume of online opioid news reporting and measures

than bag-of-words (BOW).

differences across different geographic and socio-economic lev-

Following are the main scientific contributions of this paper:

els. Scholars across disciplines have explored the institutional,

(1) We present a comparison of clustering (using two different

organizational, and individual influences that study the quality

textual features: bag-of-words and stylistic features) for

and quantity of coverage [3].

news reporting about Brexit in three different regions

Features that could classify news reporting across different

(London, Scotland, and Wales) of the UK.

regions can be adapted to classify the news. A detailed analysis of

(2) We show in our experiments that the bag-of-words are

textual features is performed by [1] where they derived multiple better to be used while clustering news reporting at the

features for creating clusters of news articles along with their

regional level whereas stylistic features are better to be

comments. These features include terms in the title, terms in

used while clustering news reporting at the level of news

the first sentence, terms in the entire article, etc. Multi-view

publishers/newspapers.

clustering on multi-model data can provide common semantics

to improve learning effectiveness. It exploits different levels of

2

RELATED WORK

Permission to make digital or hard copies of part or all of this work for personal In this section, we review the related literature about topic mod-or classroom use is granted without fee provided that copies are not made or elling, and different types of textual features.

distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

2.1

Topic Modelling

Information Society 2022, 10 October 2022, Ljubljana, Slovenia

Topic modelling is used to infer topics from the collection of text-

© 2022 Copyright held by the owner/author(s).

document. Some techniques used only frequent words whereas

21





Information Society 2022, 10 October 2022, Ljubljana, Slovenia Abdul Sittar, Jason Webber, and Dunja Mladenić

Table 2: Total number of news articles about Brexit pub-

3

DATA COLLECTION

lished in three different regions (London, Scotland, and

We collected news articles reporting on Brexit in the English lan-

Wales).

guage from the UK Web Archive (UKWA). The dataset consists of

5061 news articles after pre-processing. Due to the unavailability

Regions

Newspapers

News articles

Total

of news articles from other regions of the UK, we selected only

bankofengland.co.uk

8

the regions (London, Scotland, and Wales) which have a sufficient

bbc.com

2209

amount of news articles. Table 2 presents the number of news dailymail.co.uk

768

articles published from different regions and by different news

Independent.co.uk

191

publishers.

inews.co.uk

52

metro.co.uk

1

4

METHODOLOGY

neweconomics.org

1

The presented research focuses on clustering news articles. To

rspb.org.uk

8

this end, we experiment clustering with the combination of dif-

theguardian.com

1167

London

4248

ferent features observing their performance. Our methodology

theneweuropean.co.uk

1

consists on four steps and compares the performance of stylistic

thesun.co.uk

235

features and bag-of-words in clustering news articles, as shown

cityam.com

3

in Figure 1.

conservativewomen.uk

1

In the first step, we select Brexit under topic and themes on

dailypost.co.uk

1

UK web archive1. After crawling the list of news articles, we ex-ft.com

2

tracted the meta data of news publishers from Wikipedia-infobox.

mirror.co.uk

9

The meta-data extraction process is explained in our previous

raeng.org.uk

1

work [15]. In this process, we extracted the headquarters of news standard.co.uk

20

publishers. Due to the unavailability of news articles from other

Scotland

news.stv.tv

533

533

regions of the UK, we selected only the regions (London, Scot-

gov.wales

3

land, and Wales) which have a sufficient amount of news articles.

Wales

nation.wales

122

280

In the second step, we perform parsing of the html web pages

Walesonline.co.uk

156

and extract the body text.

some use pooling to generate relevant topics and maintain co-

herence between topics [14]. Topics are typically represented by UKWA

London

Wales

a set of keywords. Examples of such algorithms are the Latent

Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA) and

Brexit - News Articles

Scotland

Probabilistic Latent Semantic Analysis (LSA). Clustering-based

topic modelling is another solution.

Meta-data Extraction

2.2

Stylistic Features

News reporting differences can be reflected through one’s speech,

writing, and images etc [10, 12]. A language independent features Preprocessing

have been used for different tasks of NLP such as plagiarism de-

tection, author diarization. These features considers the text of

documents as a sequence of tokens (i.e. sentences, paragraphs,

documents). On the basis of these tokens, various types of sta-

Stylistic Features

Bag-of-words

tistics could be drawn from any language [13]. Stylistic features represent the writing style of a document and have been used for

LSA

LSA

understanding the author writing styles in the past [10]. We use it to explore the clustering of the news articles based on their reporting differences across different regions. Table 1 shows the list Hierarchical

of 25 stylistic features used for the development of our proposed

Clustering

clustering of news articles.

2.3

Bag-of-words

BCubed

A bag-of-words model is a way of extracting features from text. It

is basically a representation of text that describes the occurrence

of words within a document. It firstly identifies a vocabulary of

known words and then measures the presence of known words.

Topic modelling is typically based on the bag-of-words (BOW).

Figure 1: Methodology to clustering regional news using

The essential idea of the topic model is that a document can

bag-of-words and stylistic features.

be represented by a mixture of latent topics and each topic is a

distribution over words [11].

1https://www.webarchive.org.uk/en/ukwa/collection/910

22





Stylistic features in clustering news reporting

Information Society 2022, 10 October 2022, Ljubljana, Slovenia

Since the third step required pre-processing for bag-of-words,

London respectively. Blue and red lines represent bag-of-words

we convert the text to lowercase and remove the stop words and

(BOW) and stylistic features.

punctuation marks. In the third step for the stylistic features,

We can see that for all three graphs, the silhouette score of

we extract the stylistic features(see Table 1) for all three regions stylistic features is significantly high for all three regions except

and perform LSA (Latent Semantic Analysis). Similarly, for the

at one point for Scotland. It means that cohesion is higher and the

bag-of-words, we use the pre-processed text and perform LSA.

distance between the clusters is more significant using stylistic

We also perform LSA on the combination of both types of fea-

features than BOW which is mostly too close to 0. It suggests

tures. 100 latent dimensions have been used for LSA because

that these features are better at partitioning news articles into

it is recommended. We perform LSA and hierarchical cluster-

clusters than BOW.

ing using the python library SciPy, and scikit-learn and use the

weighted distance between clusters. After performing the LSA,

we apply hierarchical clustering and utilize two different types

of evaluation measures namely BCubed F1 and Silhouette Scores.

For LSA and hierarchical clustering, we use the python library

SciPy, and scikit-learn.

5

EXPERIMENTAL EVALUATION

We have performed experimental evaluations using intrinsic

(Silhouette) and extrinsic (BCubed-F) evaluation measures. The

intrinsic evaluation metrics are used to calculate the goodness

of a clustering technique whereas extrinsic evaluation metrics

are used to evaluate clustering performance. For extrinsic evalua-

tion, we consider clusters generated by k-means clustering using

typical bag-of-words as ground truth clusters. The value of k in

k-means clustering ranges from 2 to 20. K-means identifies k cen-

troids, and then allocates every data point to the nearest cluster

while keeping the centroids as small as possible. We cannot set

the value of k to 1 which means there are no other clusters to

allocate the nearest data point.

Silhouette is used to find cohesion. It ranges from -1 to 1. 1 means

clusters are well apart from each other and clearly distinguished.

0 means clusters are indifferent, or we can say that the distance

between clusters is not significant. -1 means clusters are assigned

in the wrong way.

BCubed F-measure defines precision as point precision, namely

how many points in the same cluster belong to its class. Similarly,

point recall represents how many points from its class appear in

its cluster.

• Silhouette Score: 𝑆 (𝑖) =

𝑏 (𝑖 ) −𝑎 (𝑖 )

𝑚𝑎𝑥 (𝑎 (𝑖 ),𝑏 (𝑖 ) )

where S(i) is the silhouette coefficient of the data point i, a(i) is

the average distance between i and all the other data points in

the cluster to which i belongs, and b(i) is the average distance

from i to all clusters to which i does not belong.

• BCubed Precision and Recall:









𝐶𝑜𝑟 𝑟 𝑒𝑐𝑡 𝑛𝑒𝑠𝑠 (𝑖, 𝑗 ) =

1, 𝑖 𝑓 𝐿(𝑖) = 𝐿( 𝑗) 𝑎𝑛𝑑 𝐶 ( 𝑗) = 𝐶 ( 𝑗)





0, 𝑖 𝑓 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒



Í𝑁

Í

𝐶𝑜𝑟 𝑟 𝑒𝑐𝑡 𝑛𝑒𝑠𝑠 (𝑖, 𝑗 )

𝐵𝐶𝑢𝑏𝑒𝑑 𝑃 𝑟 𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 1

𝑁

𝑖 =1

𝑗 𝜖𝐶 (𝑖 )

|𝐶 (𝑖) |

Í𝑁

Í

𝐶𝑜𝑟 𝑟 𝑒𝑐𝑡 𝑛𝑒𝑠𝑠 (𝑖, 𝑗 )

𝐵𝐶𝑢𝑏𝑒𝑑 𝑅𝑒𝑐𝑎𝑙𝑙 = 1

𝑁

𝑖 =1

𝑗 𝜖 𝐿 (𝑖 )

|𝐿 (𝑖) |

where |C(i)| and |L(i)| denote the sizes of the sets C(i) and L(i),

respectively. L(i) and C(i) denote the class and clusters of a point

i.

• BCubed-F Score: 𝐹 = 2×𝐵𝑐𝑢𝑏𝑒𝑑𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛×𝐵𝑐𝑢𝑏𝑒𝑑𝑅𝑒𝑐𝑎𝑙𝑙

Figure 2: The line graphs represent average silhouette

𝐵𝑐𝑢𝑏𝑒𝑑 𝑃 𝑟 𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∔𝐵𝑐𝑢𝑏𝑒𝑑𝑅𝑒𝑐𝑎𝑙𝑙

scores across a different number of clusters. The blue line

6

RESULTS AND ANALYSIS

represents the score generated using bag-of-words and the

Figure 2 shows the three line graphs. Each graph shows Silhouette red line represents the score generated using stylistic fea-scores across a different number of clusters (from 2 to 20) repre-

tures. The three-line graphs are generated for three differ-

senting different regions of the UK such as Scotland, Wales, and

ent regions Scotland, Wales, and London respectively.

23





Information Society 2022, 10 October 2022, Ljubljana, Slovenia Abdul Sittar, Jason Webber, and Dunja Mladenić

Table 3: The group of news articles published from three

features are better to be used while clustering news reporting at

different regions of the UK is considered as ground truth

the level of news publishers/newspapers.

clusters and the Bcubed-F score is calculated using three

types of features including bag-of-words, stylistic features,

ACKNOWLEDGMENTS

and a combination of both types of features.

The research described in this paper was supported by the Slove-

nian research agency under the project J2-1736 Causalify and

No.

Features

Bcubed-F Score

by the European Union’s Horizon 2020 research and innovation

1.

Bag-of-words

0.75

programme under the Marie Skłodowska-Curie grant agreement

2.

Bag-of-words and stylistic features

0.51

No 812997.

3.

Stylistic features

0.54

REFERENCES

Table 4: The group of news articles published from 22 dif-

[1]

Ahmet Aker, Monica Paramita, Emina Kurtic, Adam Funk,

ferent news publishers of the UK is considered as ground

Emma Barker, Mark Hepple, and Rob Gaizauskas. 2016.

truth clusters and the Bcubed-F score is calculated using

Automatic label generation for news comment clusters.

three types of features including bag-of-words, stylistic

In Proceedings of the 9th International Natural Language

features, and a combination of both types of features.

Generation Conference. Association for Computational Lin-

guistics, 61–69.

No.

Features

Bcubed-F Score

[2]

Sascha O Becker, Thiemo Fetzer, and Dennis Novy. 2017.

1.

Bag-of-words

0.53

Who voted for brexit? a comprehensive district-level anal-

2.

Bag-of-words and stylistic features

0.57

ysis. Economic Policy, 32, 92, 601–650.

3.

Stylistic features

0.66

[3]

Danielle K Brown and Summer Harlow. 2019. Protests,

media coverage, and a hierarchy of social struggle. The

However, it is insufficient to say that stylistic features are

International Journal of Press/Politics, 24, 4, 508–530.

better for news reporting differences at this stage because it is not

[4]

Honglin Chen, Xia Huang, and Zhiyong Li. 2022. A content

necessary that the resulting clusters by internal partitioning can

analysis of chinese news coverage on covid-19 and tourism.

be equal to the ones that are based on news reporting differences.

Current Issues in Tourism, 25, 2, 198–205.

We consider each region (London, Scotland, and Wales) as a

[5]

Elizabeth W Dunn, Moriah Moore, and Brian A Nosek.

ground truth cluster of the news articles published in that region.

2005. The war of the words: how linguistic differences

Table 3 shows Bcubed-F scores when the ground truth clusters in reporting shape perceptions of terrorism. Analyses of

were matched with the one that was created using bag-of-words,

social issues and public policy, 5, 1, 67–86.

stylistic features, and a combination of both types of features.

[6]

Frederick G Fico, Stephen Lacy, and Daniel Riffe. 2008.

Similarly, we consider each newspaper/news publisher shown in

A content analysis guide for media economics scholars.

Table 2 as a ground truth cluster of the news articles published by Journal of Media Economics, 21, 2, 114–130.

that newspaper/news publisher. Table 4 shows Bcubed-F scores

[7]

Yulin Hswen, Amanda Zhang, Clark Freifeld, John S Brown-

when the ground truth clusters were matched with the one that

stein, et al. 2020. Evaluation of volume of news reporting

was created using bag-of-words, stylistic features, and a combi-

and opioid-related deaths in the united states: comparative

nation of both types of features. The scores using bag-of-words

analysis study of geographic and socioeconomic differ-

considering regions as ground truth clusters are significantly

ences. Journal of Medical Internet Research, 22, 7, e17693.

high (0.75) than stylistic features (0.54) and a combination of all

[8]

Qihao Ji, Arthur A Raney, Sophie H Janicke-Bowles, Kather-

features (0.51). The scores using stylistic features considering

ine R Dale, Mary Beth Oliver, Abigail Reed, Jonmichael

newspaper/news publishers as ground truth clusters are signifi-

Seibert, and Arthur A Raney. 2019. Spreading the good

cantly high (0.66) than bag-of-words (0.53) and a combination of

news: analyzing socially shared inspirational news con-

all features (0.57). The higher scores in regional news reporting

tent. Journalism & Mass Communication Quarterly, 96, 3,

suggest that bag-of-words is better to be used for clustering or

872–893.

classification because the newspapers/news publishers report in

[9]

Moya Jones. 2017. Wales and the brexit vote. Revue Française

different styles in a certain region. Similarly, when it comes to

de Civilisation Britannique. French Journal of British Studies,

classifying or clustering news reporting across different news-

22, XXII-2.

papers/news publishers then stylistic features are more useful

[10]

Ifrah Pervaz, Iqra Ameer, Abdul Sittar, and Rao Muham-

because the newspapers/news publishers follow a different re-

mad Adeel Nawab. 2015. Identification of author personal-

porting style.

ity traits using stylistic features: notebook for pan at clef

2015. In CLEF (Working Notes). Citeseer, 1–7.

7

CONCLUSIONS

[11]

Zengchang Qin, Yonghui Cong, and Tao Wan. 2016. Topic

modeling of chinese language beyond a bag-of-words.

In this paper, we have presented the comparison of different

Computer Speech & Language, 40, 60–78.

features observing their performance over clustering news arti-

[12]

Abdul Sittar and Iqra Ameer. 2018. Multi-lingual author

cles. The goal of this work was to investigate the performance of

profiling using stylistic features. In FIRE (Working Notes),

stylistic features and typical bag-of-words. The data consists of

240–246.

news articles about a popular event Brexit that are collected from

[13]

Abdul Sittar, Hafiz Rizwan Iqbal, and Rao Muhammad

UKWA. These news articles belong to three different regions of

Adeel Nawab. 2016. Author diarization using cluster-distance

the UK including Scotland, London, and Wales. Our experimental

approach. In CLEF (Working Notes). Citeseer, 1000–1007.

results suggest that bag-of-words are better to be used while

clustering news reporting at the regional level whereas stylistic

24

Stylistic features in clustering news reporting

Information Society 2022, 10 October 2022, Ljubljana, Slovenia

[14]

Abdul Sittar and Dunja Mladenic. 2021. How are the eco-

[16]

Jie Xu, Huayi Tang, Yazhou Ren, Liang Peng, Xiaofeng

nomic conditions and political alignment of a newspaper

Zhu, and Lifang He. 2022. Multi-level feature learning

reflected in the events they report on? In Central European

for contrastive multi-view clustering. In Proceedings of

Conference on Information and Intelligent Systems. Faculty

the IEEE/CVF Conference on Computer Vision and Pattern

of Organization and Informatics Varazdin, 201–208.

Recognition, 16051–16060.

[15]

Abdul Sittar, Dunja Mladenić, and Marko Grobelnik. 2022.

Analysis of information cascading and propagation bar-

riers across distinctive news events. Journal of Intelligent

Information Systems, 58, 1, 119–152.

25





Automatically Generating Text from Film Material –

A Comparison of Three Models





Sebastian Korenič Tratnik

Erik Novak

Jožef Stefan International Postgraduate School

Jožef Stefan Institute

Faculty of Computer and Information Science

Jožef Stefan International Postgraduate School

Večna pot 113

Jamova cesta 39

Ljubljana, Slovenia

Ljubljana, Slovenia



ABSTRACT

2 PROBLEM DESCRIPTION

The paper focuses on audio analysis and text generation using

In recent years, audio-visual data has become as influent if

film material as an example. The proposed approach is done

not more influent as traditional text-based information. With

by using three different models (Wav2Vec2, HuBERT, S2T)

this, the task of extracting information from the former and

to process the sound from different audio-visual units. A

transforming it into the latter is becoming useful for different

comparative analysis shows the strengths of different models

purposes [1, 2]. One example is that text annotations enable

and factors of different materials that determine the quality of

better comprehension in cases of bad sound quality or even

text generation for functional film annotation applications.

allow the material to be understood in situations where sound



consumption is impossible. Another one is a possible speed

KEYWORDS

up of the video that the annotations provide due to their ability

Text generation, automated transcription, cinema, film, video

to keep the content integral in a clear graphic form. The



consumption process can be made more time efficient with

1 INTRODUCTION

textual information compensating for the distortions of audio-

Applications like automatic text captions for video materials

visual quality that can be brought about with the

have become more and popular and extensively used by users

manipulations of playing options. Furthermore, in a general

on different media, spanning from the computer, television,

sense, combining audio-visual material with text can solve

smartphones and other technologies that enable audio-visual

many problems on different levels of film or video

consumption. However, even though these applications have

production. This can span from the preparing phases of pre-

to an extent already become a staple in our everyday lives,

production such as writing the script, to the post-production

their performance often varies and still has not reached

phases where one needs good orientation over a vast quantity

optimal functionality. There are many challenges when we

of material. Proper text generation can facilitate easier

work with text generation out of audio-visual materials.

orientation in such work and allows for more efficient

These span from the structure and quality, the type or

organization of the media materials.

category of sound, the age of the recordings and the models

In this paper, we will focus on those components that

on which such translation is based on. The main goal of this

contribute to the quality of proper automated text generation

paper is to provide a practical demonstration of a few basic

as a prerequisite of such developmental strategies. The main

models for automatic annotation. The goal is to take into

contributions of this paper are: (1) an analysis of the factors

account the currently most common procedures of such an

that influence automatic transcription of film or video

endeavour and figure out how to minimize the loss function

material (2) implementation and comparison of a few

of the models to allow an optimal generation of text out of

different models for sound annotation (3) reflection on how

film or video more sufficiently.

this process can be used for more complex tasks

The rest of this paper is organized in the following way.



Section 2 provides a description of the problem in the context

3 METHODOLOGY

of contemporary consumption of audio-visual materials via

The problem we are solving is to take a piece of audio-visual

most popular information and communication technologies.

material, convert it into a code that a model for automatic text

Section 3 delineates the methodology used and describes the

generation can take as input and then generate output of text

approach used to tackle the problem in a concrete

that matches the sound recording of the input in an optimal

demonstration. Section 4 presents the models being used and

way. An optimal result should provide a close

describes our implementation of them, specifying the

correspondence of the utterances in the film material and

dynamics of the obtained results. A conclusion is reached in

eventually identify different types and categories of sound

section 5, where the paper offers a discussion on the outcome

such as dialogue, noise, music etc. We will do an analysis of

and possible directions for future work.

the factors that influence the quality of automatically



generated transcriptions in the following steps: 1) a



comparison of different models for generating text from



audio files, 2) an analysis of how the quality of transcriptions

differs in relation to noise in the background (silence, music,

26





dialogues), 3) an evaluation of how the clarity of speech

influences the quality of transcriptions, and 4) an assessment

to what extent it is more difficult to generate quality

transcriptions from older audio-inscriptions (films).

Reflecting on the results of our procedure, we will think about

how to improve the quality in cases when quality of

transcriptions is bad. Aside from quality we will measure the

time demands of models, that is how much time do the models

need to generate transcriptions from the audio writing.

The following model were used:

1) Wav2Vec2 [4] is a framework for self-supervised

representation learning from raw audio that was made open-

source by Facebook. It is the first Automatic Speech

recognition model included in Transformers as one of the



Figure 2. HuBERT predicts hidden clusters assignments using

central parts of Natural Language Processing. Figure 1 shows

masked frames (y2, y3, y4 in the figure) generated by one or more the model’s architecture.

iterations of k-means clustering [7].



3) S2T [5] (Speech2Text) is a transformer-based encoder-decoder (seq2seq) model that uses a convolutional

downsampler to dramatically reduce the length of audio

inputs over one half before they are fed into the encoder. It

generates the transcripts autoregressively and is trained with

standard autoregressive cross-entropy loss.



4 EXPERIMENT SETTING





4.1 Evaluation metric

Figure 1. Wav2Vec2 learns speech units from multiple languages

We have used WER (Word error rate) as the metric of the

using cross-lingual training [4].

performance of the models which computes the error rate on

the comparison of substitutions, deletions, insertions and

The model starts by processing the raw waveform with a

correct words. Original text was used for each of the model

multilayer convolutional neural network. This yields latent

and each film example, removing the punctuation.

audio representations of 25ms that are fed into a quantizer and

a transformer. From an inventory of learned units, the

quantizer chooses appropriate ones, while half of the

representations are masked before being used. The

transformer then adds information from the whole of the

audio sequence and with the output leads to solving the

contrasting task with the model identifying the correct



quantized speech units for the masked positions.



4.2 Data set

2) HuBERT [3] (Hidden-Unit BERT) is an approach for self-

supervised speech representation that uses masking in a

The dataset was formed with clips of different films. The

similar way and in addition adds an offline clustering step that

films used were classics of world cinema ( The Godfather,

provides aligned target labels for a prediction loss. This

2001: A Space Odyssey, Star Wars, Frankenstein, Fight Club,

prediction loss is applied over the masked regions, which

Paris, Texas, Scent of A Woman, Tomorrow and Tomorrow

leads the model to learn a combined language and acoustic

and Tomorrow). 14 clips of sizes spanning from 5 to 30

model over the continuous inputs. By focusing on the

seconds were used with the lengthier ones incorporating

consistency of the unsupervised clustering step rather than the

different sound contents (like speech, shouting, whispering

intrinsic quality of the assigned cluster labels, HuBERT can

etc.). The first step was to prepare the audio in such a format

either match or improve the Wav2Vec2 model. Figure 2

that the models will be able to read it, so the clips were

shows the model’s architecture.

changed from mp4 to wav. An online converter,

cloudconvert [https://cloudconvert.com], was used as the

clips were fairly short and the results could be directly added

to the Kaggle dataset from the browser itself.

27





N WITH GUNS WHO'S GON TO DO IT YOU YOU LIEUTENANT

WINEBERG



HuBERT:

OMARTER TE CORET YOU DON'T HAVE TO ANSWER THAT Q

UESTION I'LL ANSWER THE QUESTION YOU WANT ANSWERS

I THINK I'M ENTITLED YOU WANT ANSWERRTHE TRUTH YO

U CAN'T HANDLE THE TRUTH SON WE LIVE IN A WORLD TH



AT HAS WALLS AND THOSE WALLS HAVE TO BE GUARDED B

Figure 3: A superposition of waveform graphs of all the examples.

Y MEN WITH GUNS WHO'S GOING TO DO IT YOU YOU LIEUT

ENANT WINBURG

4.3 Implementation details



Programming was done on Kaggle, where code was written

S2T:

in Python and after the experiments were set up, and the GPU

DEAR LORD THE CORRET YOU DON'T HAVE THE ANSWER T

was activated for faster computation. The general process

HAT QUESTION I'LL ANSWER THE QUESTION YOU WANT AN

using each of the models is the following. First, an encoder

SWERS BUT THEY CAN'T ENTITLE YOU ONE AND THE TRUTH

takes raw data and puts it in the model. In our demonstration,

YOU CAN'T HANDLE THE TRUTH SOME WE LIVE IN A WORL

D THAT HAS WALLS AND THOSE WALLS HAVE TO BE GUARD

tokenizers were used at the start, but as S2T tokenizers was

ED BY MEN WITH GUNS WHOSE TENANT DO IT YOU LIEUTE

not equipped to get the audio, it had to be changed to a

NANT WINEBURG THOSE HAVE TO BE GUARDED BY MEN WI

processor. To retain consistency, the same step was applied

TH GUNS WHOSE CANNON DO IT YOU YOU LIEUTENANT WI

to the other two models as well. Once data gets in the model,

NEBURG YOU LIEUTENANT WINEBURG

the model predicts particular syllables for each sound with



certain probabilities and then in an additional step selects

those with the highest probability based in the context of the

semantic whole of the sentence. In the final step, the decoder

(again the tokenizers / the processors) takes the output of the

model and transforms it into text.



5 EXPERIMENT RESULTS

The ground rules for our project were that each model had a



particular function that took sound as input and produced text

Figure 4. A scene from A Few Good Men (1992), a still and

as output with each audio having the text extracted separately.

waveform graph from the used sequence.

Subsequently different models were compared according to

The lower the WER number, the better the results. The

the accuracy of the results according to different criteria and

models did not have a noticeable variation of speed, while

a variety of scenarios (noise, music, number of characters,

the quality of their performance varied due to different

tempo of speech etc.). We will illustrate the obtained results

factors. Hubert gave overall the best results from the point of

via a concrete example. We will take a clip with relatively

view of readability. According to the rate of correspondence

clear sound from the film A Few Good Men (1992), a

between input audio and output text, HuBERT comparably

digitized version of a well preserved celluloid film. The sound

gave the better rate of the transcription in case of videos with

is clear and the dialogue takes places in a court practically in

poor audio quality from Wav2Vec2, i.e. that from older or

complete silence of the surroundings with the speech

damaged films, while Wav2Vec2 gave better performance in

changing from normal tone to screaming. The clip is 22

case of background music, but had the tendency of adding

seconds long and its waveform is shown in Figure 4. The

too much insertions. S2T had the tendency to produce

original text is as following:

mistakes, seen in peaking numbers over 1.0. The overall

A: Did you order the Code Red?!

results are given in Table 1.

B: You don't have to answer that question!

It is important to note that the average given does not reflect

C: I’ll answer the question. You want answers?

the better overall accuracy, but is the sum of different factors.

A: I think I’m entitled!

So the models can be good at transcribing particular words,

C: You want answers!?

but can add or drop extra words in the process and therefore

A: I want the truth!

make the overall text less comprehensible. An important

C: You can’t handle the truth! Son, we live in a world that has

factor is the way the original text that is used for comparison

walls, and those walls have to be guarded by men with guns.

is written – omitting punctuations and properly writing the

Who's gonna do it? You? You, Lieutenant Weinberg?

words even if they are mispronounced will improve the

The produced transcriptions are as follows:

results. Finally, it is crucial that all the texts are in caps lock,

or the comparison won’t work and will produce misleading

Wav2Vec2:

results.

YOU WAR THE CORA YOU DON'T HAVE TO ANSWER THE QU

ESTION I'LL ANSWER THE QUESTION YOU WANT ANSWERS I

As the used example shows, it is mostly clarity of speech that

THINK I'M ENTITLE YOU WANT ANT A AT THE TRUE YOU CA

will determine how the models perform. As the models were

N'T HANDLE THE TRUTH SON WE LIVE IN A WORLD THAT H

pre-trained and were not trained according to the specific data

AS WALLS AND THOSE WALLS HAVE TO BE GUARDED BY ME

used, they were in general surprisingly efficient. The

28

discrepancies in different treatments of the same audio are speaking, then person B, then person A has a long

visible, but in general as long as the dialogue was clear, the

monologue, person C answers” etc.). Another important task

results were comparable. Music seemed to cause bigger

would be identifying the sounds of different categories and

problems for the model than background noise, while

providing fitting audio-signs (sound of squeaking steps,

additional speech in the background proved most

playing of music etc.). From these steps one could eventually

problematic. Emotional influences on speech did not prove

at least to some extent automatically generate scripts for films

that problematic and even affective utterances were

or find ways to develop tools for easier text-based

transcribed comparably with neutral speech if the sound data

classification of audio-visual material.

was of high quality.



CONCLUSIONS

Table 1. The WER scores for each model. The bold values

represent the best performances on the given clip. The best

In this paper we explored ways to generate text out of audio

performing model is HuBERT.

information presented in film and video material. We used

three different models to evaluate various film units,

Clip number

Wav2Vec2

HuBERT

S2T

Wav2Vec2, HuBERT, and S2T. We found that the model

1

69%

53%

91%

HuBERT achieved best results, while the remaining two

2

100%

0%

100%

methods performed similarly.

3

100%

95%

95%



4

27%

30%

36%

5

17%

17%

17%

ACKNOWLEDGMENTS

6

39%

18%

43%

The research described in this paper was supported by

7

28%

28%

64%

International Postgraduate School Inštitut Jožef Štefan,

8

70%

46%

55%

Ljubljana, Slovenia in the class Textual/ Multimedia mining

9

50%

25%

100%

and semantic technologies held by dr. Dunja Mladenič under

10

57%

37%

73%

11

62%

38%

51%

the mentorship of Erik Novak. We also thank Besher Massri,

12

100%

95%

100%

Aljoša Rakita and Martin Abram for additional feedback.

13

60%

33%

73%



14

9%

4%

9%

REFERENCES

Average

56%

37%

65%

1 A. Ramani, A. Rao, V. Vidya and V. B. Prasad, "Automatic



Subtitle Generation for Videos," 2020 6th International

The WER usually shows the results in a metric between 0 and

Conference on Advanced Computing and Communication

1, however in case the annotation results were extremely

Systems

(ICACCS),

2020,

pp.

132-135,

doi:

unsuccessful, the higher extreme may surpass the limit. In our

10.1109/ICACCS48705.2020.9074180.



case, up to 1.6 was reached, however in the chart, it was

2 Rustam Shadiev, Yueh-Min Huang, Facilitating cross-

limited down to 1.0 for purposes of clarity.

cultural understanding with learning activities supported by



speech-to-text recognition and computer-aided translation,

5 DISCUSSION AND FURTHER WORK

Computers & Education, Volume 98, 2016, Pages 130-141

So as a general principle, when taking clips from films, the



3 Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai,

main factor that can potentially influence the quality of the

Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman

generated text in a negative way is the background noise. As

Mohamed.

HuBERT:

Self-Supervised

Speech

one can expect, the model will work best when nothing is in

Representation Learning by Masked Prediction of Hidden

the background and worst when people are talking in the

Units. [arXiv:2106.07447v1 [cs.CL], Submitted on 14 Jun

background. Ideally, to improve the quality one would train

2021].

the models for the specific material, using a similar type of



material and accordingly doing a pre-classification according

4 Alexei Baevski, Henry Zhou, Abdelrahman Mohamed,

to the main categories of sound analysis (ie. monologue,

Michael Auli. wav2vec 2.0: A Framework for Self-

dialogue, background noise, music, echo, normal speech,

Supervised

Learning

of

Speech

Representations..

loud speech, shouting, whispering etc.) - especially when

[arXiv:2006.11477v3, Submitted on 20 Jun 2020 (v1), last

using older or less preserved material, which drastically

revised 22 Oct 2020 (this version, v3)].



differs in sound data from newer or more preserved works.

5 Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Dmytro

In our research we expanded on and adapted existing work on

Okhonko, Juan Pino. fairseq S2T: Fast Speech-to-Text

automated text generation models, providing an analysis of

Modeling with fairseq. [arXiv:2010.05171v1, Submitted on

the factors that determine the quality of such results from film

11 Oct 2020] .



material. As an example, we applied our approach on

6 Wav2vec2.0: Learning the structure of speech from raw

different film material, ranging in the quality and age of the

audio. [https://ai.facebook.com/blog/wav2vec-20-learning-

clips and the structure of the sound data.

the-structure-of-speech-from-raw-audio Submitted on 24 Sep

2020, Access. 9.1.2022

A useful strategy for the future from the perspective of film



practice would be to find ways to link transcriptions with a

7 Hsu, Wei-Ning, et al. "Hubert: Self-supervised speech

script. A precondition of such an endeavour would be to

representation learning by masked prediction of hidden

implement an algorithm for recognizing the person speaking

units." IEEE/ACM Transactions on Audio, Speech, and

and identifying the source with descriptions (“person A is

Language Processing 29 (2021): 3451-3460.

29





The Russian invasion of Ukraine

through the lens of ex-Yugoslavian Twitter

Bojan Evkoski

Igor Mozetič

bojan.evkoski@ijs.si

igor.mozetic@ijs.si

Jozef Stefan Institute, and

Jozef Stefan Institute

Jozef Stefan Postgraduate School

Ljubljana, Slovenia

Ljubljana, Slovenia

Petra Kralj Novak

Nikola Ljubešić

petra.kralj.novak@ijs.si

nikola.ljubesic@ijs.si

Central European University

Jozef Stefan Institute, and

Vienna, Austria, and

Faculty of Computer and Information Science,

Jozef Stefan Institute

University of Ljubljana

Ljubljana, Slovenia

Ljubljana, Slovenia

Serbian

Serbian

left-wing opposition

right-wing opposition

Serbian

Pro-Russia

Tweetosphere

Serbian

right-wing opposition

Croatian + Bosnian +

Montenegrin

tweetosphere

Serbian populist

coalition

Serbian populist

coalition

Pro-Ukraine

Croatian + Bosnian +

Montenegrin

Serbian

tweetosphere

left-wing opposition

Serbian

Tweetosphere

Figure 1: Pre-invasion (left) and invasion (right) ex-Yugoslavian retweet networks. Node colors represent communities.

Labeled arrows point to the main communities, with labels inferred from the community users. The in-network labels represent the names of the most retweeted accounts.

ABSTRACT

orientations. Some communities detected after the start of the

The Russian invasion of Ukraine marks a dramatic change in

Russian invasion also show clear pro-Ukrainian or pro-Russian

international relations globally, as well as at specific, already

stance. Such analyses of social media help in understanding the

unstable, regions. The geographical area of interest in this paper

role and effect of this conflict at the regional level.

is a part of ex-Yugoslavia where the BCMS (Bosnian, Croatian,

Montenegrin, Serbian) languages are spoken, official varieties of

KEYWORDS

a pluricentric Serbo-Croatian macro-language [4]. We analyze social network analysis, community detection, Twitter

12 weeks of Twitter activities in this region, six weeks before

the invasion, and six weeks after the start of the invasion. We

1

INTRODUCTION

form retweet networks and detect retweet communities which

closely correspond to groups of like-minded Twitter users. The

The Russian invasion of Ukraine brings about dramatic changes

communities are distinctly divided across countries and political

to the world. Analysing the structure and content of the commu-

nication on social media, such as Twitter, can give more insight

into the causes, developments and consequences of this conflict.

Permission to make digital or hard copies of part or all of this work for personal The geographical area of interest in our research is a part of

or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and ex-Yugoslavia where the BCMS (Bosnian, Croatian, Montenegrin,

the full citation on the first page. Copyrights for third-party components of this Serbian) languages are spoken, official varieties of the pluricentric

work must be honored. For all other uses, contact the owner/author(s).

Serbo-Croatian macro-language. This area is strongly politically

Information Society 2022, 10–14 October 2022, Ljubljana, Slovenia

© 2022 Copyright held by the owner/author(s).

divided by diverging influences of NATO (Croatia, Montenegro,

North Macedonia, Bosniak and Croatian entity in Bosnia and

30





Information Society 2022, 10–14 October 2022, Ljubljana, Slovenia Evkoski et al.

Pre-Invasion network

Invasion network

communities

communities

RS Tweetosphere

55%

part 1

RS

tweetosphere

part 1

RS sports

43%

+43%

15%

ME tweetosphere

49%

Pro-Ukraine

BA + HR + ME

54%

BA + HR + ME

tweetosphere

tweetosphere

+53%

MK tweetosphere

36%

11%

RS left-wing

opposition

RS left-wing

opposition

57%

+33%

RS tweetosphere

International

11%

part 2

tweetosphere

20%

(South Serbia)

+42%

RS tweetosphere

54%

part 2

Pro-Russia

RS

right-wing

RS right-wing

51%

opposition

+55%

opposition

Bosnian Serbs

+40%

11%

Bosnian Serbs

43%

RS

populist coalition +19%

RS

68%

populist coalition

Figure 2: A Sankey diagram showing the transitions of users from the pre-invasion network communities (left) to the invasion network communities (right). Rectangle height is proportional to the community sizes. Percentages near the pre-invasion communities show the portion of users found in the corresponding invasion communities. Percentages on the right-hand side of the invasion communities show the portion of users not previously present in the large communities of the pre-invasion network. Gray rectangles depict the communities tightly related to politics, with the yellow and red denoting the detected pro-Ukraine and pro-Russia leaning communities, respectively.

Herzegovina) and Russia (Serbia, Serbian entity in Bosnia and

2

RESULTS

Herzegovina). While Croatia is full EU member since 2013, Mon-

The data analysed in this study were collected with the TweetCat

tenegro, North Macedonia and Serbia are EU candidate members,

tool [3], focused on harvesting tweets of less frequent languages.

while Bosnia and Herzegovina is a potential candidate. Regarding

TweetCat is continuously searching for new users tweeting in

military alliances, NATO members are Croatia (since 2007), Mon-

the language of interest by querying the Twitter Search API for

tenegro (since 2017) and North Macedonia (since 2020), while

the most frequent and unique words in that language. Every user

Serbia does not aspire to join NATO, primarily due to a complex

identified to tweet in the language of interest is continuously

Serbia-NATO relationship caused by the NATO intervention in

collected from that point onward. This data collection proce-

Yugoslavia in 1999.

dure is run for the BCMS set of languages since 2017. During

To shed light on the impact of the Russian invasion on this

the 12 weeks of our focus, we collected 1.2M tweets and 3.8M

brittle and complex geographical and political area, we use social

retweets from 45,336 users. A rough estimate of the per-country

network analysis over available Twitter data, 6-weeks before and

production of tweets via URL usage from country-specific top-

6-weeks during the invasion. We discover a complex landscape

level domains (upper part of Table 1) shows for Twitter to be of ideology-specific and country-specific communities (see Fig-much more popular in Serbia and Montenegro than in Croatia or

ure 1), and analyse the transition into evident pro-Ukraine and Bosnia and Herzegovina. This has to be taken into account while

pro-Russia leanings. We also present a method to measure the

analysing the communities of the underlying tweetosphere.

similarity of the communities before and during the invasion by

We created pre-invasion and invasion retweet networks

analyzing URL and hashtag usage. As the communities show very

(users as nodes, retweets as edges) from the collected data. We

divergent properties, we echo concerns of the heavy polarization

applied community detection (Ensemble Louvain [1]) on the two and possible destabilization of this area of the Balkans.

31

Russian invasion of Ukraine — ex-Yugoslavian Twitter

Information Society 2022, 10–14 October 2022, Ljubljana, Slovenia

Country

Population

URLs

Serbia (RS)

7.2M (47.3%)

106K (44.2%)

Croatia (HR)

3.9M (25.6%)

19.6K (8.1%)

Bosnia and Herzegovina (BA)

3.5M (23.0%)

14.9K (6.2%)

Montenegro (ME)

620K (4.1%)

24.7K (10.2%)

Total

15.2M

242K

Pre-invasion communities

Users

Tweets

Retweets

Intra-com. RTs

RS tweetosphere part 1

13K (29.0%)

125K (24.9%)

300K (18.9%)

80.3%

RS tweetosphere part 2

2.5K (5.6%)

35.8K (7.1%)

63.2K (4.0%)

62.3%

RS sports

1.6K (3.6%)

12.6K (2.5%)

25.6K (1.6%)

53.8%

ME tweetosphere

1.7K (3.8%)

22.7K (4.5%)

44.6K (2.8%)

74.5%

BA + HR + ME tweetosphere

5.6K (12.4%)

37.8K (7.5%)

59K (3.7%)

75.3%

Macedonian tweetosphere

200 (0.4%)

721 (0.1%)

771 (0.1%)

77.7%

International tweetosphere

934 (2.0%)

8.5K (1.7%)

11.5K (0.7%)

62.3%

RS populist coalition

2.0K (4.8%)

52.4K (10.4%)

396K (24.9%)

98.7%

RS left-wing opposition

9.3K (20.6%)

105K (20.9%) 408K (25.5%)

80.5%

RS right-wing opposition

7.6K(16.8%)

87.8K (17.4%)

247K (15.5%)

72.1%

Bosnian Serbs

139 (0.3%)

2.2K (0.4%)

3.8K (0.2%)

83.1%

Total

45.3K

502.9K

1590K

Invasion communities

Users

Tweets

Retweets

Intra-com. RTs

RS tweetosphere part 1

16.9K (29.5%)

160K (22.4%)

387K (16.8%)

71.1%

RS tweetosphere part 2

4.5K (7.7%)

57.3K (8.1%)

118K (5.1%)

58.1%

Pro-Ukraine

BA + HR + ME tweetosphere

12.4K (21.7%)

76.1K (10.6%)

235K (10.2%)

64.7%

Pro-Russia

RS right-wing opposition

11.1K (19.4%)

129K (17.9%)

508K (22.1%)

65.1%

RS populist coalition

1.8K (3.1%) 208K (29.1%)

450K (19.5%)

95.6%

RS left-wing opposition

9.8K (17.2%)

191K (26.7%) 590K (25.6%)

72.6%

Bosnian Serbs

356 (0.6%)

5.4K (0.7%)

7.1K (0.3%)

62.3%

Total

57.4K (+26.7%) 717K (+42.8%) 2302K (44.8%)

Table 1: The first part shows general population of each BCMS country and their respective tweet URL shares (.rs, .hr, .ba and .me). The second part shows the pre-invasion network communities with the number of users, tweets, retweets and intra-community retweets. The third part shows the same statistics for the invasion network communities. Grey rows depict political communities, while yellow and red show the pro-Ukraine and pro-Russia communities, respectively.

networks and analysed the community properties and user tran-

With this, we created a subset in which more than 99% of the

sitions [2]. We identified and named the large communities (more URLs were news media, making it ideal for media polarization

than 100 users) by a careful analysis of their most influential users

analysis. Once we extracted the domain of the URLs, we then cre-

and hashtag/URL usage. Figure 2 depicts the user transitions be-ated sorted lists of the top 50 URL domains and top 50 hashtags

tween the two networks, while Table 1 shows general statistics for each community, sorted by the usage counts. Finally, in order

of each community. We discovered the following peculiarities:

to calculate the similarities between communities, we used the

• The BCMS tweetosphere is dominated by Serbian (RS)

Rank-biased overlap (RBO) measure for indefinite rankings [5].

users and content.

We found out that the matchings between the pre-invasion

• The political communities are more active compared to

and invasion communities based on highest-user-overlap transi-

the non-political ones.

tions are also visible through the URL and hashtag similarities

• RS populist coalition community (led by the Serbian presi-

(see Figure 3). In fact, for each pre-invasion community, its redent Aleksandar Vučić) forms a very strong echo chamber,

spective highest-user-overlap invasion community is also the

with less than 2% of all users, yet more than 25% of tweets

highest RBO pair for both URLs and hashtags. In other words,

and retweets and more than 95% of intra-community retweets.

there is a strong positive correlation between the user transition

• RS populist coalition and left-wing opposition remain neu-

percentages (Figure 2) and the RBO scores. E.g., 68% of the users tral on the invasion topic.

from the pre-invasion "RS populist coalition" community tran-

• RS right-wing opposition and the Bosnian Serbs show a

sition in the "RS populist coalition" community in the invasion

clear pro-Russia stance.

network. Meanwhile, The URL RBO of this pair is 0.64, while

• Croatian, Bosnian and Montenegrin communities show a

the hashtag RBO is 0.43, both as the highest combination for the

clear pro-Ukraine stance.

pre-invasion "RS populist coalition" community, clearly match-

In order to compare the pre-invasion and invasion commu-

ing it with its invasion transition-based counterpart. This shows

nities in terms of content and political leanings, our following

that our simple similarity method based on URLs and hashtags

goal was to compare the pool of hashtags used and URLs shared

can even help in better matching communities in the task of

by the community users. Therefore, we developed a simple com-

community evolution [6].

munity similarity method. First, we preprocessed the URLs by

manually filtering out the ones coming from social media sources

like Twitter, Facebook, Youtube etc., as well as URL shorteners.

32





Information Society 2022, 10–14 October 2022, Ljubljana, Slovenia

Evkoski et al.

Pre-invasion communities

URLS

Hashtags

1.0

RS tweetospehere part 1

RS tweetosphere part 2

RS left-wing opposition

0.8

RS populist coalition

ME tweetosphere

0.6

RS right-wing opposition

International tweetosphere

0.4

BA + HR + ME tweetosphere

MK tweetosphere

0.2

RS sports

Bosnian Serbs

0.0

Pro-Russia

Pro-Ukraine

Pro-Russia

Invasion communities

Pro-Ukraine

Bosnian Serbs

Bosnian Serbs

RS populist coalition

RS populist coalition

RS left-wing opposition

RS tweetosphere part 1

RS tweetosphere part 2

RS left-wing opposition

RS right-wing opposition

RS tweetosphere part 1

RS tweetosphere part 2

RS right-wing opposition

BA + HR + ME tweetosphere

BA + HR + ME tweetosphere

Figure 3: Domain and hashtag community similarities. A heatmap showing the similarities between the pre-invasion and invasion network communities based on the top 50 URLs (left) and hashtags (right). Similarities are calculated using the Rank-biased overlap (RBO) measure for indefinite rankings [5].

3

CONCLUSION

ACKNOWLEDGMENTS

In this work, we investigated the Russian invasion of Ukraine

The authors acknowledge financial support of the Slovenian

through the lens of Twitter in the ex-Yugoslavian region where

Research Agency (research core funding no. P2-103 and no. P6-

Bosnian, Croatian, Montenegrin and Serbian are spoken. We an-

0411).

alyzed 12 weeks of Twitter activities in this region, six weeks

before the invasion, and six weeks after the start of the inva-

REFERENCES

sion. For each period, we created retweet networks and detected

[1] B. Evkoski, I. Mozetič, P. Kralj Novak. Community evolution with Ensemble retweet communities. We followed the transition of users from

Louvain. In 10th Intl. Conf. on Complex Networks and their Applications, Book of abstracts, pp. 58–60, Madrid, Spain, 2021.

the pre-invasion to the invasion period and analyzed these groups

[2] B. Evkoski, I. Mozetič, N. Ljubešić, and P. Kralj Novak. Community evolution of like-minded Twitter users, discovering that they are distinctly

in retweet networks. PLoS ONE, 16(9):e0256175, 2021. Non-anonymized version divided across countries and political orientations. For the inva-available at https://arxiv.org/abs/2105.06214.

[3] N. Ljubešić, D. Fišer, and T. Erjavec. TweetCaT: a tool for building Twitter sion network, we were also able to detect communities which

corpora of smaller languages. In Proc. 9th Intl. Conf. on Language Resources and show clear pro-Ukrainian and pro-Russian stance.

Evaluation, pp. 2279–2283, ELRA, Reykjavik, Iceland, 2014.

Another contribution was a simple method for comparing

[4] N. Ljubešić, M. Miličević Petrović, and T. Samardžić. Borders and boundaries in Bosnian, Croatian, Montenegrin and Serbian: Twitter data to the rescue. Journal retweet network communities based on the content of the tweets.

of Linguistic Geography 6:2, DOI 10.1017/jlg.2018.9, pp 100-124, Cambridge The method showed a strong correlation with the most prominent

University Press, 2018

[5] W. Webber, A. Moffat, Alistair, and J. Zobel. A similarity measure for indefinite user transitions we formerly discovered.

rankings. ACM Trans. Information Systems 28(4):20, 2010.

A continuation of this work is to expand it to a multidisci-

[6] G. Rossetti and R. Caxabet. Community discovery in dynamic networks: a plinary research, with the aim to meticulously analyze the po-survey. ACM computing surveys (CSUR) 51.2 (2018): 1-37.

larized content between the communities in collaboration with

domain experts who are knowledgeable in ex-Yugoslavian poli-

tics. Beyond obtaining interesting insights, we also aim to explore

two frequent issues in using social media for societal analyses:

(1) uptake bias of specific social networks across countries and

communities, and (2) entanglement of the main event with other

large-scale events.

33





Visualization of consensus mechanisms in PoS based

blockchain protocols

Daniil Baldouski

Aleksandar Tošić

University of Primorska

University of Primorska

Koper, Slovenia

Koper, Slovenia

d.baldovskiy@mail.ru

Innorenew CoE

Izola, Slovenia

aleksandar.tosic@upr.si

ABSTRACT

provide spam resistance through the use of tokens representing

value. The use of digital value within the protocol enables the

In the past decade, decentralized systems have been increasingly

protocol to enforce a level of security through economic incen-

gaining more attention. Much of the attention arguably comes

tives, and game theoretical aspects that make most attack vectors

from both financial, and sociological acceptance, and adoption of

economically infeasible or impractical for the attacker. A good

blockchain technology. One of the frontiers has been the design

example of this is the Proof of Stake (PoS) consensus mechanism,

of new consensus protocols, topology optimisation in these peer

where nodes in the decentralized protocol secure the consensus

to peer(P2P) networks, and gossip protocol design. Analogue

mechanism by requiring nodes to stake and lock up a consider-

to agent based systems, transitioning from the design to imple-

able amount of value, which can be deducted (usually refereed

mentation is a difficult task. This is due to the inherent nature

to as slashing) by the protocol in case the node misbehaves. The

of such systems, where nodes or actors within the system only

economic aspect of public blockchains poses a very high secu-

have a local view of the system with very little guarantees on

rity risk. With such strong economic incentives to identify and

availability of data. Additionally, such systems often offer no

exploit potential bugs, and system faults, it is of upmost impor-

guarantees of a system wide time synchronisation. This research

tance for the developers to thoroughly test and examine potential

offers insight into the importance of visualisation techniques in

problems. However, the aforementioned difficulties in debugging

the implementation phase of vote based consensus algorithms,

distributed and decentralized protocols require developers to be

and P2P overlay network topology. We present our custom visual-

equipped with tools that supports their efforts.

isations, and note their usefulness in debugging, and identifying

In this study, we review the state of the art approaches in

potential issues in decentralized networks. Our use case is an

testing and debugging voting based consensus mechanisms and

implementation of a blockchain protocol.

decentralized networks. We develop a visualisation specifically

KEYWORDS

designed for researchers and developers to test such networks

and compare real-time observed data with the expected. We con-

Grafana, visualisation, consensus mechanism, blockchain proto-

clude that visualisation techniques can be complementary to

cols, P2P, overlay network

traditional log based debugging, and testing techniques. More-

1

INTRODUCTION

over, we provide our tools as open source software as plugins

for popular visualisation platform Grafana. Both tools make no

Distributed systems are notoriously difficult to inspect and their

assumptions on the data storage implementation. The plugins

problems difficult to identify. The difficulty stems from the fact

can be configured via Grafana plugin configuration interface to

that predominant issues can be stochastic and difficult to repro-

fit the specifics of the protocol implementation. We validate our

duce, and from the inability to easily observe, compare, and test

tools by applying them to a custom developed blockchain, and

multiple programs running on separate machines at the same

then explain how successful they turned out to be in identifying

time. Another important aspect in distributed systems is that they

anomalies and bugs in the protocols.

inherently make heavy use of the network. The use of various net-

work protocols imposes additional complexity, which increases

2

THE ROLE OF VISUALIZATIONS IN

the search space in identifying bugs. In recent years, distributed

DEBUGGING COMPLEX DISTRIBUTED

systems have been gaining more attention both in academia and

private sector. This increasing interest can be largely attributed

SYSTEMS

to the rapid development of distributed ledger technology, and

Distributed and decentralized systems are difficult to debug as

blockchain. In recent years, many new consensus mechanisms,

developers are working on the third layer. Which includes L1

blockchain protocols, network protocols, improvement in gossip

(code level bugs), issues with concurrency on L2 (individual run-

protocols have been proposed. Many of them are transitioning

time), and finally the third dimension for potential bugs arising

from a theoretical framework to a practical implementation. How-

from the message exchange between nodes. In general, it is often

ever, public distributed ledgers (or distributed ledger technology

hard to capture the state in a distributed system as debuggers

or DLT) and blockchains secure their consensus mechanisms and

cannot be attached to all nodes’ run-times. Additionally, it is often

difficult to reproduce errors when they are inherently stochastic.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or We consider several methods, such as Logging, Remote debugging,

distributed for profit or commercial advantage and that copies bear this notice and Simulations and Visualisations.

the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner /author(s).

• Logging is the most common debugging method for all

Information Society 2022, 10–14 October 2022, Ljubljana, Slovenia

three layers. However, in distributed systems it is impor-

© 2022 Copyright held by the owner/author(s).

tant to aggregate logs, and analyze them as a time series.

34





Information Society 2022, 10–14 October 2022, Ljubljana, Slovenia

Daniil Baldouski and Aleksandar Tošić

Additionally, aggregating distributed logs assumes the sys-

distributed systems, while our tools are created specifically

tem has some method of clock synchronization protocol.

for monitoring PoS voting based consensus mechanisms

Log collection has been proven to be effective in detecting

and underlying network topology of the distributed sys-

performance issues for systems such as Hadoop [12] and

tem.

Darkstar [13]. The aggregation can be done with specific tools for log collection such as InfluxDB [8], Logstash [10],

3

RESEARCH OBJECTIVES

etc. Aggregated logs then can be viewed in a form of a

The main goal of this research is to build visualisation tools that

dashboard using tools like Grafana (see Figure 1).

offer more insight into a running distributed system using the

time series log collection data. The targeted system is a custom

proof of stake based blockchain. Such tools should visualize if

nodes contributing to the consensus learned about their correct

roles, and if they perform their roles accordingly. In the consensus

algorithms this is done by sending messages, so the tools should

visualise messages exchanged between nodes.

In the structured P2P networks information spreads using

gossip protocols and network topology changes every time slot.

Our tools should visualize such changes in the network topology

by drawing nodes and their cluster representatives, while at the

same time indicating the consensus roles for each node.

Figure 1: Part of the Grafana dashboard used by developers

In our implementation time series data comes from InfluxDB,

to gain insight into a running PoS based blockchain net-

but we want our tools to have no assumption on the data storage

work.

implementation and there are other popular databases, such as

kdb+ and Prometheus, that work well with time series data. Be-

• Remote debugging is a technique where a locally running

cause of that we choose Grafana as a platform for visualizations,

debugger is connected to a remote node in the distributed

which supports all of the aforementioned databases and many

system. This allows developers to use the same features

more at the time of writing.

as if they were debugging locally. However, it is difficult

In this work we implement two Grafana plugins built to vi-

to determine which remote node should be debugged. Ad-

sualize PoS based blockchains, and decentralized network topol-

ditionally, in case of Byzantine behaviour due to network

ogy. Our tools are designed with generality in mind, and are

faults connecting the debugger could fail.

hence applicable to other PoS voting based blockchains and other

• Distributed deterministic simulation and replay is a tech-

distributed ledger implementations. We evaluate our tools by

nique that attempts to address the issues of reproducibility

applying it to the custom developed blockchain and note their

in distributed systems. Tools like Friday [5] and liblog [6]

usefulness in debugging and identifying potential issues in de-

can be used to record the specific state of the network

centralized networks.

to use and analyze it later. The technique suggests imple-

menting an additional layer that abstracts the underlying

4

GRAFANA PLUGINS FOR VISUALISING

hardware and the network interfaces to allow for an exact

VOTE BASED CONSENSUS MECHANISMS

replay of all the state changes and messages exchanged be-

AND P2P OVERLAY NETWORKS

tween nodes. Tools such as FoundationDB or even custom

systems are built on containerisation software.

We have developed two plugins that extend the functionality of

• Visualisation and time series analysis attempts at captur-

Grafana. Figure 2 outlines the architecture used in production.

ing the state of the system, and all the nodes by visual-

A server running a database instance (preferably time series i.e.

ising the collected logs. Tools like Prometheus [11] and InfluxDB), and the Grafana platform. Depending on the underly-Grafana [2] are used extensively. Tools like Theia [4] and ing blockchain implementation, nodes can insert their telemetry

Artemis [3] are designed for monitoring and analyzing

directly to the database, or if possible have an archive node gather

performance problems in distributed systems and support

telemetry from nodes, and report them. In this example, a cluster

built-in visualization tools for data exploration. However,

was used to run multiple nodes. A coordinating node is responsi-

such tools provide logs aggregated based summaries of

ble for maintaining an overlay network and serving the nodes

the distributed systems and are not capable of observing

within the overlay with a DHCP, DNS, and routing. Nodes are

underlying low-level network properties, e.g. monitoring

packed within docker containers and submitted to the coordina-

network communication, especially in real-time while the

tor, which uses built in load balancing and distributes them to

system is running. ShiViz [1] on the other hand displays other cluster nodes.

distributed system executions as an interactive timespace

The telemetry inserted is timestamped to create a time series

diagram. With this tool all the necessary events and inter-

stream of data that is consumed by Grafana. Figure 1 shows a actions can be viewed in an orderly manner and inspected

small part of the dashboard created within Grafana using the built-

in detail. ShiViz visualization is based on logical order-

in plugins for typical visualisations. These visualisations are time

ing, meaning that unlike our tools, it is not capable of

series data of a running blockchain showing telemetry reported

running in real-time, together with the considered dis-

by the nodes. However, rendering telemetry from hundreds of

tributed network. ShiViz also works with aggregated logs

nodes as factors is hardly informative.

about various types of events of the distributed system

Both plugins were developed as React components, using a

and unlike our tools does not support direct database con-

well-known D3.js JavaScript library for animations and life-cycle

nections. ShiViz is generalized and works with all kind of

of the plugins is managed by Grafana

35





Visualization of consensus mechanisms in PoS based blockchain protocols

Information Society 2022, 10–14 October 2022, Ljubljana, Slovenia

Docker Swarm

their correct roles, and if they perform their roles accordingly. In

Master

Node

order to have a scalable visualisation, nodes are placed around a

Grafana

circle, and scaled according to the size of the network. Roles are

Cluster

Cluster

Cluster

Cluster

Node 1

Node 2

Node 3

Node n

visualized with a color map. Each slot, nodes change their roles,

and execute the protocol accordingly. To visualise the execution,

InfluxDB

the plugin visualises messages exchanged between nodes in a

Web Server

form of animated lines flying from an origin node to the desti-

T

nation node. The animations are time synchronous, and transfer

P2P Overlay Network

Telemetry

times, and latencies are taken into account. Additionally, every

message is logged with a type, indicating the sub protocol within

Figure 2: System architecture.

which it was created. As an example, messages being sent from

committee members to the block producer are attestations for

4.1

Network Plugin

the current block. The animated lines are coloured indicating the

message type.

P2P networks propagate information using gossip protocols.

The thickness of the animated lines indicates the size of the

There are many variations of the general and implementation

payload transferred between nodes. Figure 4 shows the consensus specifics but in general the family of protocols aims at gossiping

plugin running live visualising a test network of 30 nodes. The

the fact that new information is available in the network. Should

green coloured node indicates the block producer role for the

a node hear about the gossip, and require the information it will

current slot, nodes coloured violet are part of the committee, and

contact neighbouring nodes asking for the data. In general, gos-

blue nodes are validators.

sip protocols make no assumptions about the topology of the

overlay network. However, with structured networks, the infor-

mation exchange can be made much more efficient. The observed

blockchain implementation utilized a semi structured network

topology for propagating consensus based information. This is

made possible by using a seed string shared between nodes that

is used for pseudo-random role election every block. Using the

seed, nodes self-elect into roles without the need to communicate.

However, when performing roles, committee members must at-

test to the candidate block produced by the block producer. The

seeded random is therefore also used to cluster the network using

Figure 4: Consensus plugin (with legend) visualising a test

a k-means algorithm. The clustering is again performed by each

network of 30 nodes in real time.

node locally. The shared seed guarantees that nodes will produce

the same topology, which is then used to efficiently propagate

attestations to the block producer.

The network topology hence changes every slot. The plugin

4.3

Generality

aims to visualize the changes in the network topology by draw-

In order to use the above plugins, users have to provide certain

ing nodes, and their cluster representatives. Additionally, the

data to the Grafana dashboard and this can be done through

consensus roles for each node are indicated with the vertex color.

Grafana GUI. For the plugins to work all of the data should follow

Figure 3 shows the network plugin rendering a test network of a specific naming policy. For example, for the Consensus plugin

30 nodes in real-time. The node in the center coloured green is

there is one necessary query to visualize data about the nodes of

the elected block producer for the current slot, nodes surrounded

the network. It can be provided using SQL or Grafana GUI:

by the red stroke are cluster representatives, the rest of the nodes

SELECT "slot", "node", "duty" FROM "<table-name>"

are coloured based on their role in the current slot.

WHERE $timeFilter

Both plugins can be customized from the Grafana options

menu. For example, users can add new roles, name and color

them. Figure 5 shows the consensus plugin options menu, where users can additionally turn on or off display of messages, nodes

or containers labels and so on. For both plugins, users have to

manually provide the slot time of the network in the plugins

options menus.

Figure 3: Network topology plugin visualising a test net-

work of 30 nodes in real time.

4.2

Consensus Plugin

The aim of visualising the consensus mechanism is to quickly

Figure 5: Consensus plugin options menu.

evaluate if nodes contributing to the consensus learned about

36





Information Society 2022, 10–14 October 2022, Ljubljana, Slovenia Daniil Baldouski and Aleksandar Tošić

By using our tools we can visualize other protocols. For exam-

We conclude that visualisation is an important tool in design

ple with the consensus plugin we can visualize the famous Paxos

and implementation of decentralized, and distributed systems.

algorithm, first introduced in [7] by Leslie Lamport. For that, we The methods serve a complementary role to existing debugging

should provide the plugin with the Nodes and Messages queries.

methods, and are very powerful at observing unexpected be-

For the Nodes query, parameters slot, node and duty should be

haviour of the system as a whole. Visualisation techniques are

provided, which represent the slot number, node id and the role

specifically important in detecting stochastic faults that are non-

of the node respectively. From the point of nodes and slots, for

trivial to reproduce. Our tools are open-source and available for

this visualization Paxos works in the same way as the example

researchers and engineers to use. They are suitable for testing

of the PoS based consensus we mentioned before. For the duty

any kind of voting-based consensus protocol with little effort.

parameter, nodes can have one of the three roles: proposer, ac-

For future work we would like to further develop our tools

ceptor or learner. That is why in the options menu of the plugin

to accommodate other consensus protocols and help developers

we should create 3 roles and name them according to the names

visualize and debug other types of issues related to distributed

from the data table.

systems. Also, we would like to explore other types of visualiza-

We should specify slot time (in seconds) in the plugin options

tions and other existing tools that can help developers as well.

menu and at this point we can set the Grafana dashboard refresh

Since Grafana is rapidly evolving, our developed plugins can be

time and see the results, since all the necessary conditions are

updated and new technologies can be integrated with our tools

fulfilled. But in order to gain more information from the plugin,

to improve their performance.

we should add the Messages query. For the data we should have

the following parameters: id, source, target and endpoint, which

6

ACKNOWLEDGMENTS

represent the message id, node id that sends the messages, node

The authors gratefully acknowledge the European Commission

id that receives the message and the type of the message. For

for funding the InnoRenew CoE project (H2020 Grant Agreement

the additional information we can specify parameters delay (in

#739574) and the Republic of Slovenia (Investment funding of the

seconds) and size of the message.

Republic of Slovenia and the European Union of the European

If we know the expected amount of nodes for some role, we

Regional Development Fund) as well as the Slovenian Research

can put it in the in plugin options menu to see this information in

Agency (ARRS) for supporting the project number J2-2504 (C).

the plugin legend. In a similar way we should be able to visualize

other consensus protocols, for example 2PC or Raft [9].

REFERENCES

Source code for both plugins is open source, licensed under

[1]

Beschastnikh, I., Wang, P., Brun, Y., and Ernst, M. D. Debugging distributed the MIT license and available on GitLab, where users can find

systems. Commun. ACM 59, 8 ( jul 2016), 32–37.

[2]

Chakraborty, M., and Kundan, A. P. Grafana. In Monitoring Cloud-Native

the installation procedure of the plugins:

Applications. Springer, 2021, pp. 187–240.

•

[3]

Creţu-Ciocârlie, G. F., Budiu, M., and Goldszmidt, M. Hunting for prob-

Network plugin - https://gitlab.com/rentalker/topology-

lems with artemis. In Proceedings of the First USENIX Conference on Analysis

visualization-plugin,

of System Logs (USA, 2008), WASL’08, USENIX Association, p. 2.

• Consensus plugin - https://gitlab.com/rentalker/consensus-

[4]

Garduno, E., Kavulya, S. P., Tan, J., Gandhi, R., and Narasimhan, P. Theia: Visual signatures for problem diagnosis in large hadoop clusters. In Proceedings

visualization-plugin.

of the 26th International Conference on Large Installation System Administration: Strategies, Tools, and Techniques (USA, 2012), lisa’12, USENIX Association, 5

CONCLUSION

p. 33–42.

[5]

Geels, D., Altekar, G., Maniatis, P., Roscoe, T., and Stoica, I. Friday: Global We developed two Grafana plugins for visualising PoS based

comprehension for distributed replay. vol. 7.

[6]

Geels, D., Altekar, G., Shenker, S., and Stoica, I. Replay debugging for

blockchains, and the underlying overlay network topology. The

distributed applications. In 2006 USENIX Annual Technical Conference (USENIX

plugins were used to identify critical bugs, and faults in the

ATC 06) (Boston, MA, May 2006), USENIX Association.

protocol. With the help of visualisations, we were able to detect

[7]

Lamport, L. The part-time parliament. ACM Transactions on Computer Systems 16, 2 (May 1998), 133-169. Also appeared as SRC Research Report 49. This paper two problems when running test-nets.

was first submitted in 1990, setting a personal record for publication delay that

• Network congestion:

has since been broken by [60]. (May 1998). ACM SIGOPS Hall of Fame Award

for every slot, validators must re-

in 2012.

port their statistics to the block producer. Prompt delivery

[8]

Naqvi, S. N. Z., Yfantidou, S., and Zimányi, E. Time series databases and

is desired but not critical. However, as the network grew in

influxdb. Studienarbeit, Université Libre de Bruxelles 12 (2017).

[9]

Ongaro, D., and Ousterhout, J. In search of an understandable consen-

size, reporting statistics to a single node (block producer)

sus algorithm. In Proceedings of the 2014 USENIX Conference on USENIX An-

became increasingly latent as all nodes attempted to prop-

nual Technical Conference (USA, 2014), USENIX ATC’14, USENIX Association,

p. 305–320.

agate messages in tandem, and even more importantly,

[10]

Sanjappa, S., and Ahmed, M. Analysis of logs by using logstash. In Proceedings the network topology required a lot of routing for mes-of the 5th International Conference on Frontiers in Intelligent Computing: Theory sages to arrive to the block producer. The network plugin

and Applications (Singapore, 2017), S. C. Satapathy, V. Bhateja, S. K. Udgata, and P. K. Pattnaik, Eds., Springer Singapore, pp. 579–585.

helped us identify what the problem was by looking at the

[11]

Turnbull, J. Monitoring with Prometheus. Turnbull Press, 2018.

topology.

[12]

Xu, W., Huang, L., Fox, A., Patterson, D., and Jordan, M. Online system

• State synchronisation: at random, nodes failed to per-

problem detection by mining patterns of console logs. In 2009 Ninth IEEE

International Conference on Data Mining (2009), pp. 588–597.

form their roles. This resulted in missing votes even on

[13]

Xu, W., Huang, L., Fox, A., Patterson, D., and Jordan, M. I. Detecting large-small test-nets, and sometimes a chain halt where no

scale system problems by mining console logs. In Proceedings of the ACM

SIGOPS 22nd Symposium on Operating Systems Principles (New York, NY, USA,

blocks were produced for the slot. We observed the likeli-

2009), SOSP ’09, Association for Computing Machinery, p. 117–132.

hood of this happening grows in correlation with network

size. However, it was infeasible to debug the state of all

nodes in a large network. Visualising the state of nodes

at a given slot we observed that states were not always

synchronized and hence, some nodes did not learn about

their consensus role.

37





Using Machine Learning for Anti Money Laundering

Gregor Kržmanc

Filip Koprivec

Maja Škrjanc

gregor.krzmanc@ijs.si

filip.koprivec@ijs.si

maja.skrjanc@ijs.si

Jožef Stefan Institute

Jožef Stefan Institute

Jožef Stefan Institute

Ljubljana, Slovenia

IMFM

Ljubljana, Slovenia

Ljubljana, Slovenia

Figure 1: Example transaction network visualization

ABSTRACT

to strengthen its anti money laundering and terrorist financ-

ing regulatory framework and expects the same from financial

Here we present early results of a network component for anom-

institutions and supervisory authorities.

aly detection in an attributed heterogeneous financial network.

Given a pseudonymized dataset of financial transactions, can

Utilizing both externally provided features and generated topo-

we use machine learning to detect interesting, perhaps novel,

logical features, we train different models for a simple link pre-

patterns that should be inspected manually? In this paper, we try

diction task. We then evaluate the models using initial dataset

to answer this question.

corruption. We show that gradient boosting and multi-layer per-

ceptron generally have the best anomaly detection performance,

despite graph neural network models initially showing better

2 RELATED WORK

results in the link prediction task.

Both supervised [7, 6, 12] and unsupervised or self-supervised [2,

KEYWORDS

14] learning approaches have been proposed to deal with the task of detecting money laundering. Due to the lack of labelled data

Anti Money Laundering (AML), machine learning, networks, link

and the closed nature of financial data and, therefore, the lack

prediction

of standardised datasets, approach evaluation can be difficult.

1 INTRODUCTION

Despite that, cryptocurrency datasets such as [13] have been published, explored, and labelled to some extent.

Observing complex real-world graphs, be it a social, financial,

Usually, synthetic oversampling or other strategies of sampling

biochemical, or physics-related network, is an interesting task.

need to be employed in cases where labelled entities are used for

Given a time-evolving network and rich information about the

evaluation [12, 13].

nodes and edges, can we assume that there are some regular

dynamics in the network?

3 DATA

Fraud and financial crime are important issues of our time.

According to the United Nations Office on Drugs and Crime, an

In this study, we use a snapshot of the transaction data processed

estimated 2-5 % of the world GDP is laundered each year. To

through the international payment system Target2-Slovenija [11].

keep pace with evolving trends, the European Union has decided

The dataset spans from November 2007 to December 2017, con-

taining around 8 million financial transactions. No live data was

Permission to make digital or hard copies of part or all of this work for personal used when performing this research - only archived datasets

or classroom use is granted without fee provided that copies are not made or were used.

distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this For some nodes, the data about the sending or receiving party

work must be honored. For all other uses, contact the owner /author(s).

is additionally linked to data from the Slovenian Business Register

Information Society 2022, 10–14 October 2022, Ljubljana, Slovenia

(ePRS) [1] and the Slovenian Transaction Account Registry (eRTR)

© 2022 Copyright held by the owner/author(s).

[3] in order to provide additional context about each transaction.

38





Information Society 2022, 10–14 October 2022, Ljubljana, Slovenia

G. Kržmanc et al.

feature

level

degree

node-level

deg (𝐴) = |𝑁 (𝐴) |

PageRank [9]

node-level

Í

𝑃 𝑅 ( 𝐽 )

𝑃 𝑅 (𝐴) = 1−𝑑 + 𝑑

(

; 𝑑 = 0.85

𝑁

𝐽 ∈𝑁

𝐴)

𝑖𝑛

|𝑁

( 𝐽 ) |

𝑜𝑢𝑡

Jaccard coefficient

|𝑁 (𝐴)∩𝑁 (𝐵) |

edge-level

𝐽 (𝐴, 𝐵) = |𝑁 (𝐴)∪𝑁 (𝐵) |

Adamic-Adar Index

1

edge-level

𝐴 (𝑥 , 𝑦) = Í𝑢 ∈𝑁 (𝑥)∩𝑁 (𝑦) log |𝑁 (𝑢) |

Table 1: The structural features used for the link prediction

task. 𝑁 (·) represents the set of neighbours of the given

node. 𝑁 and 𝑁

represent the sets of the nodes from

𝑖𝑛

𝑜𝑢𝑡

which there is an edge to the given node (𝑖𝑛), or to which

there is an edge from the given node (𝑜𝑢𝑡). | · | represents

cardinality of the given set.

5 ANOMALY DETECTION PROBLEM

Figure 2: Degree distribution by node type.

DEFINITION

We corrupt the original graph by rewiring the total of 𝑝 = 1%

randomly picked edges of each edge type.

Due to the sensitive nature of the data, all personal and confi-

Let 𝑓

: 𝑉 ×𝑉 → [0, 1] be a binary link prediction classifier that

dential data about individuals and legal entities provided to JSI

is trained to predict the probability that a directed edge between

is pseudonymized.

the two given nodes exists.

We define the anomaly score of edge (𝑖, 𝑗 ) ∈ 𝐸 as

4 DATA REPRESENTATION AS A

HETEROGENEOUS GRAPH

𝜙 (𝑖, 𝑗 ) = 1 − 𝑓 (𝑖, 𝑗 )

(1)

The intuition behind equation 1 is that links that are typical to There are large differences in the availability of data across differ-the model would have a smaller anomaly score than links for

ent entities performing the transactions. In order to fully utilize

which the model predicts they would not exist (and are, thus,

all available features, we model the network as a heterogeneous

anomalous).

temporal graph. Here, we treat the snapshot of the transaction

graph from 𝑡

to

) as a heterogeneous graph con-

0

𝑡1 𝐺 = 𝐺 (𝑡0, 𝑡1

6 RESULTS

sisting of 3 discrete node types representing each entity’s legal

We train several models for the downstream task of link predic-

status. The types of accounts are those belonging to companies

tion and then use the predictions for anomaly detection.

(node type s), natural persons (node type p), and all other accounts

(node type o). Each transaction is represented as a directed edge

6.1 Experiment details

from its source account to its destination account.

The traditional (non-GNN) machine learning approaches are

4.1 Network statistics

trained to predict whether the given edge exists or not. For each

edge, the feature vector fed into the model is constructed by

Due to different legislative bases for different types of entities,

concatenating source node features, destination node features,

inherent differences regarding data availability are expected. Nat-

and edge features. For traditional models, a model for each edge

urally, it is also expected that different categories usually act

type is constructed separately, while the graph neural network-

differently in a network - for example, companies usually trans-

based models are the same across all edge types.

act more than individuals. While the degree distribution (Figure 2)

The GNN (graph neural network) models are constructed of 2

closely resembles the power law, significant differences in dis-

layers of GraphSAGE aggregations [8, 5] using parametric ReLU

tributions between different node types can be observed, which

activations and embedding dimensions of 128 for the first and

can be attributed to varying amounts of data available for our

64 for the second layer. As messages are passed in the direction

specific data source across account profiles.

of edges, we construct another model to facilitate information

It can be seen from Figure 2 that companies (node type s) diffusion both ways. We do this by adding edges of opposite

perform most of the transactions.

directionality than existing edges and marking them as a separate

4.2 Feature generation

edge types. We still, however, only train for the downstream link

prediction objective only on the existing (non-transposed) edges.

+

Categorical features are one-hot encoded. Rare categories with

We mark this approach as GNN

.

< 2% incidence are marked as other. Additionally, node features

The traditional ML models used are gradient boosting (Grad-

encoding the role of a node in the network (Table 1) are generated.

Boost), decision tree (DecTree), multi-layer perceptron (MLP) and

The node-level features for each node are computed on the whole

logistic regression (LogReg). The hidden layer sizes of the MLP

network as well as for the subgraph induced by the node’s own

are 20 and 10, using ReLU activation in all layers except the last

type.

one, where softmax activation is used. Different combinations of

39





Using Machine Learning for Anti Money Laundering

Information Society 2022, 10–14 October 2022, Ljubljana, Slovenia

+

reasonable hidden layer sizes were tested (32+16, 64+32, 256+128,

edge

non-GNN

no str. f.

GNN

GNN

128+128, 20+10) and the best one was selected. The training of

ss

0.19 ± 0.02

0.16 ± 0.02

0.01 ± 0.00

0.01 ± 0.00

MLP models was performed with a batch size of 200.

oo

0.11 ± 0.02

0.02 ± 0.01

0.05 ± 0.02

0.03 ± 0.02

so

0.11 ± 0.02

0.06 ± 0.01

0.01 ± 0.01

0.01 ± 0.01

6.2 Link prediction

os

0.14 ± 0.02

0.06 ± 0.01

0.01 ± 0.00

0.01 ± 0.01

sp

0.08 ± 0.04

0.02 ± 0.02

0.02 ± 0.01

0.02 ± 0.02

Traditional ML models for link prediction map concatenated

ps

0.05 ± 0.02 0.05 ± 0.02 0.01 ± 0.01 0.01 ± 0.01

source and destination node features and edge features to the

po

0.07 ± 0.04 0.07 ± 0.05 0.02 ± 0.02 0.01 ± 0.02

probability that a link between such nodes exists. The models are

op

0.18 ± 0.04

0.02 ± 0.01

0.02 ± 0.01

0.03 ± 0.02

implemented using scikit-learn [10] and are trained and evaluated Table 3: Anomaly detection performance comparison in 𝐹

using 5-fold cross-validation.

1

score (mean ± standard deviation). Best non-GNN score, as

As a preprocessing step, each feature is scaled individually

well as best non-GNN score without using any structural

using a standard scaler such that it has a mean of 0 and a standard

features, are reported next to the GNN results. Bold results

deviation of 1 across the training set.

highlight the best performance across observed methods.

When training and evaluating each model, an approximately

equal number of positive and negative links is given to the classi-

fier. The provided edge features such as transaction amount are

sampled randomly for negative edges.

Additionally, we train a 2-layer graph neural network (GNN)

−1

−1

−

precision

+ recall

1

for link prediction. The GNN model is trained jointly for all edge

𝐹

=

(2)

1

2

types using weighted binary cross-entropy loss. The model has

A naive classifier that assigns the same positive score (recall

ReLU activations in all layers except the last one, where it has

1) to each edge has 𝐹

score of ≈ 0

1

.02. However, the underrepre-

softmax activation. The hidden layer sizes are 64 and 32. The

sented edge types typically have higher variance in 𝐹

score and

1

graph neural network is implemented using PyTorch Geometric

performance insignificantly different from the naive baseline, as

[4].

seen from Table 3. The same goes for the GNN-based models. See We use a random link split for link prediction and not a tempo-Appendix A for more detailed non-GNN model results.

ral one, as our end goal is not to predict future links, but rather to

learn what kinds of transactions are typical in the given network.

7 DISCUSSION AND FUTURE WORK

Table 2 shows the aggregated link prediction results. Bold We have constructed and evaluated a self-supervised approach

results highlight the best performance across observed methods.

to anomaly detection in financial networks. Due to the lack of

The GNN does slightly improve link prediction performance in

labelled data, this is in most cases the most straightforward ap-

some cases. See Appendix A for more detailed non-GNN method

proach to tackle the problem with machine learning. There are

results. The data here is computed across multiple year-long time

significant differences in performance across different edge types.

windows.

Using this approach yields almost comparable results with both

+

raw features and structural features when evaluated on company-

edge

non-GNN

no str. f.

GNN

GNN

to-company transactions only. This may be explained by compa-

ss

0.92 ± 0.01

0.89 ± 0.01

0.92 ± 0.02

0.94 ± 0.01

nies in our dataset having the most insightful features of all node

oo

0.80 ± 0.02 0.57 ± 0.01 0.79 ± 0.02

0.53 ± 0.04

types such as the broader sector and also more precise company

so

0.83 ± 0.01

0.75 ± 0.01

0.88 ± 0.02 0.74 ± 0.04

industry type classification.

os

0.76 ± 0.01

0.64 ± 0.01

0.81 ± 0.01

0.83 ± 0.02

This paper has mainly focused on the use of unsupervised

sp

0.85 ± 0.02 0.69 ± 0.03 0.78 ± 0.05

0.73 ± 0.02

learning for anomaly detection. In the future, we plan to extend

ps

0.74 ± 0.02

0.67 ± 0.01

0.87 ± 0.02 0.75 ± 0.04

our work to supervised and semi-supervised learning approaches

po

0.78 ± 0.02

0.66 ± 0.01

0.84 ± 0.04 0.54 ± 0.08

to try to utilize the few labelled data points. The following ma-

op

0.89 ± 0.01 0.53 ± 0.01 0.78 ± 0.05

0.50 ± 0.05

chine learning strategies (or a combination of them) could be

all

0.84 ± 0.01

0.72 ± 0.01

0.86 ± 0.02

0.89 ± 0.01

tested:

Table 2: Link prediction performance comparison mea-

sured in area under the receiver operating characteristic

• Active learning. Human-assisted active learning approach

curve (AUC) (mean

is a natural way to incorporate domain knowledge into

± standard deviation). Edge types are

marked with two letters, representing the source and des-

the decision-making process.

tination node type in this order. Best non-GNN score, as

• Synthetic oversampling. Due to a small number of the

well as best non-GNN score without using any structural

positive examples, we could sample new examples that

features, are reported next to the GNN results.

are similar to them and assign them positive labels.

• Model pretraining and few-shot learning. Update model

parameters with a self-supervised pretraining strategy

first, and then optimize it further on the few labeled data

points.

6.3 Anomaly detection

ACKNOWLEDGMENTS

For comparison between different methods, the 2% of edges with

the highest anomaly scores are flagged as positive. Precision and

The research leading to the results presented in this paper has

recall are calculated by using the corrupted 1% of edges as true

received funding from the European Union’s funded Project IN-

positives.

FINITECH under grant agreement no. 856632.

To summarize precision and recall in a single metric, 𝐹

score (2)

The financial transaction data used in the presented research

1

is calculated and reported.

was collected and pseudonymized by the Bank of Slovenia.

40





Information Society 2022, 10–14 October 2022, Ljubljana, Slovenia G. Kržmanc et al.

The Bank of Slovenia collaborates with JSI and the Infinitech

[14]

Jiaxuan You, Tianyu Du, Fan-yun Sun, and Jure Leskovec.

project in order to research possible efficient and compliant bank-

2021. Graph Learning in Financial Networks. (September

ing system supervision techniques.

2021). https://snap.stanford.edu/graphlearning- workshop/

We thank Klaudija Jurkošek Seitl for her input on the style of

slides/stanford_graph_learning_Finance.pdf .

this paper.

A DETAILED RESULTS

REFERENCES

A.1 Link prediction (AUC)

[1]

2022. AJPES - ePRS. (September 2022). https://www.ajpes.

si/prs/.

edge

DecTree

GradBoost

LogReg

MLP

[2]

Claudio Alexandre and João Balsa. 2016. Client Profiling

ss

0.87 ± 0.01

0.90 ± 0.01

0.79 ± 0.01

0.92 ± 0.01

for an Anti-Money Laundering System. https://arxiv.org/

oo

0.80 ± 0.01

0.80 ± 0.02 0.51 ± 0.01 0.74 ± 0.01

abs/1510.00878.

so

0.82 ± 0.01

0.83 ± 0.01 0.65 ± 0.01 0.82 ± 0.01

[3]

2022. eRTR. (September 2022). https://www.ajpes.si/eRTR/

os

0.75 ± 0.01

0.76 ± 0.01 0.58 ± 0.02 0.73 ± 0.01

JavniDel/Iskanje.aspx.

sp

0.81 ± 0.02

0.85 ± 0.02 0.55 ± 0.02 0.83 ± 0.02

[4]

Matthias Fey and Jan E. Lenssen. 2019. Fast Graph Rep-

ps

0.70 ± 0.02

0.74 ± 0.02 0.54 ± 0.02 0.69 ± 0.01

resentation Learning with PyTorch Geometric. In ICLR

po

0.72 ± 0.02

0.78 ± 0.02 0.54 ± 0.02 0.67 ± 0.01

Workshop on Representation Learning on Graphs and Man-

op

0.85 ± 0.01

0.89 ± 0.01 0.51 ± 0.03 0.87 ± 0.01

ifolds.

all

0.81 ± 0.01

0.84 ± 0.01 0.66 ± 0.02 0.82 ± 0.01

[5]

William L. Hamilton, Rex Ying, and Jure Leskovec. 2017. In-

ductive Representation Learning on Large Graphs. arXiv:1706.02216

[cs, stat]. arXiv: 1706.02216. Retrieved 06/18/2021 from

A.2 Anomaly detection (𝐹 score)

http://arxiv.org/abs/1706.02216.

1

[6]

Mikel Joaristi, Edoardo Serra, and Francesca Spezzano.

2019. Detecting suspicious entities in Offshore Leaks net-

edge

DecTree

GradBoost

LogReg

MLP

works. Social Network Analysis and Mining, 9, 1, 1–15. Pub-

ss

0.12 ± 0.01

0.13 ± 0.02

0.04 ± 0.01

0.19 ± 0.02

lisher: Springer Vienna. issn: 1327801906. doi: 10.1007/

oo

0.07 ± 0.01

0.11 ± 0.02 0.01 ± 0.01 0.10 ± 0.02

s13278- 019- 0607- 5. https://doi.org/10.1007/s13278- 019-

so

0.08 ± 0.01

0.10 ± 0.02

0.04 ± 0.01

0.11 ± 0.02

0607- 5.

os

0.06 ± 0.01

0.12 ± 0.02

0.04 ± 0.01

0.14 ± 0.02

[7]

Martin Jullum, Anders Løland, Ragnar Bang Huseby, Geir

sp

0.06 ± 0.01

0.07 ± 0.04

0.02 ± 0.02

0.08 ± 0.04

Ånonsen, and Johannes Lorentzen. 2020. Detecting money

ps

0.04 ± 0.01

0.05 ± 0.02 0.01 ± 0.01 0.05 ± 0.02

laundering transactions with machine learning. Journal

po

0.04 ± 0.01

0.07 ± 0.04 0.02 ± 0.03 0.04 ± 0.03

of Money Laundering Control, 23, 1. doi: 10.1108/JMLC-

op

0.09 ± 0.01

0.14 ± 0.04

0.01 ± 0.01

0.18 ± 0.04

07- 2019- 0055. https://www.emerald.com/insight/1368-

5201.htm.

[8]

Thomas N. Kipf and Max Welling. 2017. Semi-Supervised

Classification with Graph Convolutional Networks. arXiv:1609.02907

[cs, stat], (February 2017). arXiv: 1609.02907. Retrieved

06/18/2021 from http://arxiv.org/abs/1609.02907.

[9]

Larry Page, Sergey Brin, R. Motwani, and T. Winograd.

1998. The PageRank Citation Ranking: Bringing Order to

the Web. (1998).

[10]

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B.

Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss,

V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M.

Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn:

machine learning in Python. Journal of Machine Learning

Research, 12, 2825–2830.

[11]

2022. TARGET2 in TARGET2-Slovenija. si. (September

2022). Retrieved 09/21/2022 from https : / / www . bsi . si /

placila - in - infrastruktura / placilni - sistemi / target2 - in -

target2- slovenija.

[12]

Dominik Wagner. 2019. Latent representations of trans-

action network graphs in continuous vector spaces as

features for money laundering detection. Gesellschaft für

Informatik, 1–1.

[13]

Mark Weber, Giacomo Domeniconi, Jie Chen, Daniel I

Karl Weidele, Claudio Bellei, Tom Robinson, and Charles

E Leiserson. 2019. Anti-Money Laundering in Bitcoin: Ex-

perimenting with Graph Convolutional Networks for Fi-

nancial Forensics. Technical report.

41





Forecasting Sensor Values in Waste-To-Fuel Plants: a Case Study.

Bor Brecelj∗

Beno Šircelj∗

Jože M. Rožanec

University of Ljubljana, Faculty of

Jožef Stefan International

Jožef Stefan International

Mathematics and Physics

Postgraduate School

Postgraduate School

Ljubljana, Slovenia

Ljubljana, Slovenia

Ljubljana, Slovenia

bor.brecelj@gmail.com

beno.sircelj@ijs.si

joze.rozanec@ijs.si

Blaž Fortuna

Dunja Mladenić

Qlector d.o.o.

Jožef Stefan Institute

Ljubljana, Slovenia

Ljubljana, Slovenia

blaz.fortuna@qlector.com

dunja.mladenic@ijs.si

ABSTRACT

to such processes can improve quality and yield and help engineers

In this research, we develop machine learning models to predict

predict anomalies to control the factory better.

future sensor readings of a waste-to-fuel plant, which would enable

We modeled the JEMS waste-to-fuel plant, which produces high-

proactive control of the plant’s operations. We developed models

quality diesel from organic waste. The plant has numerous sensors

that predict sensor readings for 30 and 60 minutes into the future.

that measure temperature, and pressure, among other variables.

The models were trained using historical data, and predictions were

It is operated by experts who must control the process. Since the

made based on sensor readings taken at a specific time. We compare

chemical process is complex and, therefore, difficult to control, we

three types of models: (a) a näive prediction that considers only

built forecasting models that can predict future sensor readings

the last predicted value, (b) neural networks that make predictions

based on historical data and the current state of the plant.

based on past sensor data (we consider different time window sizes

The model will be used to give plant operators additional infor-

for making a prediction), and (c) a gradient boosted tree regressor

mation about the future state of the plant, which will allow them to

created with a set of features that we developed. We developed and

make an informed decision about changing the plant’s parameters

tested our models on a real-world use case at a waste-to-fuel plant

and, therefore, adjust the process before it is too late.

in Canada. We found that approach (c) provided the best results,

while approach (b) provided mixed results and was not able to

2

RELATED WORK

outperform the näive consistently.

Organic wastes in energy conversion technologies are an active area

of research aimed at reducing dependence on fossil fuels, optimiz-

CCS CONCEPTS

ing production costs, improving waste management, and control-

• Computing methodologies → Machine learning; • Applied

ling emissions. Biochemical, physiochemical, and thermochemical

computing;

processes produce different biofuels, such as bio-methanation, bio-

hydrogen, biodiesel, ethanol, syngas, and coal-like fuels, which are

KEYWORDS

studied by Stephen et al. [8]. Work is also being done on optimiza-Smart Manufacturing, Machine Learning, Feature Engineering

tion, such as catalyst selection, reactor design, pyrolysis tempera-

ture, and other important factors [5].

ACM Reference Format:

Many ML methods have been developed to address waste man-

Bor Brecelj, Beno Šircelj, Jože M. Rožanec, Blaž Fortuna, and Dunja Mladenić.

agement and proper processing for biofuel production, focusing

2022. Forecasting Sensor Values in Waste-To-Fuel Plants: a Case Study.. In on energy demand and supply prediction [3]. Aghbashlo et al. [2]

Ljubljana ’22: Slovenian KDD Conference on Data Mining and Data Ware-

houses, October, 2022, Ljubljana, Slovenia. ACM, New York, NY, USA, 4 pages.

provided a systematic review of various applications of ML technol-

ogy with a focus on ANN (Artificial Neural Network) in biodiesel

1

INTRODUCTION

research. They provided an overview of the use of ML in modeling,

optimization, monitoring, and process control. Models that pre-

There is a wide range of applications of ML (machine learning).

dict the conditions of the biofuel production process that have the

One of them is the modeling and control of chemical processes,

highest yield were created by Kusumo et al. [6] and Abdelbasset et such as the production of biodiesel. Introducing machine learning

al. [1]. The models used in these studies were kernel-based extreme learning machines, ANN, and various ensemble models.

∗Both authors contributed equally to this research.

Permission to make digital or hard copies of part or all of this work for personal or 3

USE CASE

classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation The JEMS waste-to-fuel plant produces synthetic diesel (SynDi)

on the first page. Copyrights for third-party components of this work must be honored.

from any hydrocarbon-based waste, such as wood, biomass, pa-

For all other uses, contact the owner/author(s).

per, waste fuels and oils, plastics, textiles, rubber, and agricultural

SiKDD ’22, October, 2022, Ljubljana, Slovenia

residues. The plant uses a chemical-catalytic de-polymerization

© 2022 Copyright held by the owner/author(s).

process, the advantage of which is that the temperature is too low

42





SiKDD ’22, October, 2022, Ljubljana, Slovenia

Brecelj et al.

to produce carcinogenic gasses. It operates continuously and pro-

all sensors with less than 6.000 data points and kept only those that

duces about 150 liters of fuel per hour. Although it uses the latest

corresponded to chambers B100 and B200, giving us data from 39

software available and allows remote control, there is no anomaly

sensors.

detection, prediction, or optimization. As a result, there is a great

Analysis of the dataset we received revealed that many values

need for better understanding, optimization, and decision-making,

were missing. In particular, we noted that there were day-long

given data availability. The company plans to sell and install over

intervals with a tiny number of measurements. We also noticed

1.500 SynDi systems over the next ten years. In practice, this means

that specific sensor values remained constant at low temperatures -

many SynDi plants in different locations worldwide.

a condition best described by the waste-to-fuel plant’s inactivity.

There are three main chambers in the pipeline, which are named

We, therefore, decided to remove such values. Because there were

B100, B200, and B300. The plant can be conceptually split into four

many ten-minute gaps, we decided to resample the data at fifteen-

stages

minute intervals, taking the last value of each interval and assuming

(1) Feedstock inspecting and feeding;

that conditions had not changed in the short time since the last

(2) Drying and mixing (chamber B100);

measurement - a reasonable assumption for sensor values. The

(3) Processing (chamber B200);

resulting data set contained an average of 7.884 data points per

(4) Distilling (chamber B300).

sensor.

Since there are no sensors in the feedstock inspecting and feeding

We divided the dataset into a train and a test dataset, split on

stage, we focused on the later stages, each of which takes place in

October 31st 2016. The resulting train set included a total of 11.000

one of the main chambers.

samples, and the test set included 3.000 samples.

In the drying and mixing stage (B100), the starting material is

mixed with process oil, lime, and catalyst and is heated. During

4.2

Model training

mixing, the material is broken down into smaller particles, and the

In this research, we compare models that we develop using two

water is evaporated. The primary chemical reaction occurs in the

different approaches. We first tried the neural network approach, in

processing stage (B200). The material is fed to a turbine, and the

which the model makes predictions based only on sensor readings

reaction product evaporates through the diesel distillation column.

from the last five hours. Since the model did not perform better than

If the diesel obtained is not of sufficient quality, it is redistilled in

the baseline, we began the second approach, developing features

the second distillation stage (B300).

to describe the time series and capture its patterns. We used linear

Currently, the plants are operated with highly skilled person-

regression and gradient-boosted tree regressor. All the developed

nel and high costs for personnel training. Implementing automa-

models were compared with the last-value model, which we used

tion, remote control, optimization, and interconnection among the

as a benchmark.

plants would greatly facilitate their operation. Therefore, the main

challenge to be solved by integrating AI is the self-control of the

4.2.1

Neural network approach. We used the model developed for

chemical process and the plant itself by minimizing the human

forecasting Tüpras’ sensor values. Tüpras is an oil refinery, which

resources required to operate the plants. Furthermore, operating

is very similar to the JEMS use case. The model was used to forecast

many SynDi plants also means a significant challenge for ensuring

sensor values in different units of LPG production. Some of Tüpras’

remote control for troubleshooting, maintenance, and repair. AI

units are distillation columns, similar to JEMS’ chamber B200. The

integration aims to minimize the workforce required to operate the

model takes only past sensor values as input and predicts values for

plants, minimize the resulting downtime due to human interaction,

the future together with the prediction interval. More specifically,

enable self-control and predictive maintenance of the SynDi plants,

it predicts 10th, 50th and 90th percentile, which is the case in all our

and achieve less downtime and higher production efficiency.

models that give prediction interval.

In modeling the waste-to-fuel processes, we decided to model

each chamber separately. No model was developed for chamber

B300 because it was not active during the period for which we

obtained the data. As described above, a second distillation of the

fuel is performed in chamber B300 only if the fuel in chamber B200

is not pure enough.

4

METHODOLOGY

Figure 1: Architecture of the neural network model, which

4.1

Data analysis

gives the prediction interval.

The sensor measurements are from the experimental JEMS plant,

which is located in Canada. The data consists of 154 sensors from

January 2016 to January 2017. The measurements are taken at

Figure 1 shows the architecture of the neural network. The model one-minute intervals and mostly measure temperature or pressure,

is a feedforward neural network with two layers. First, there is a

but there are also sensors for motor current and valve position,

linear layer with ReLU activation. The second layer has a separate

among others. Since the data is from the prototype version of the

linear layer for each quantile. The hidden dimension of the model

waste-to-fuel plant, it contains many missing values. Our data set

is calculated from the number of features and the number of targets

contained an average of 61.607 data points per sensor. We discarded

using the formula ⌊ 𝑛features ⌋ +

2

𝑛targets.

43





Forecasting Sensor Values in Waste-To-Fuel Plants: a Case Study.

SiKDD ’22, October, 2022, Ljubljana, Slovenia

During training, we used the quantile loss function, which is

Using developed features, we trained a linear regression model,

defined as

and a gradient boosted tree regressor from the CatBoost library [4].

n





o

We used root mean squared error (RMSE) for the loss function.

max 𝑞 · 𝑦true − 𝑦pred , (1 − 𝑞) · 𝑦pred − 𝑦true

,

where

5

RESULTS AND ANALYSIS

𝑞 is the observed quantile (in our case, it can be 0.1, 0.5 or

0.9), 𝑦true is the true target value and 𝑦pred is the corresponding

We built models for main chambers B100 and B200 with two fore-

quantile of the prediction. In the case of 𝑞 = 0.5, the loss is equal

casting horizons (30 and 60 minutes). Tables 1 and 2 show mean to the mean absolute error divided by two. When calculating the

squared error (MSE) and mean absolute error (MAE) on chambers

loss of 10th percentile (𝑞 = 0.1), a prediction that is greater than

B100 and B200, respectively. There are three different neural net-

the true value is heavily penalized, while a prediction that is lower

work models (NN), which differ in the size of the window from

than the true value has a smaller loss and is therefore encouraged.

which it gets the data.

The model is implemented in the PyTorch library [7]. Since sensors measure different quantities, the values have to be scaled

horizon = 30min

horizon = 60min

before learning. Here we used Min-Max scaler from the scikit-learn

MSE

MAE

MSE

MAE

library, scaling all values between zero and one.

last-value model

21.0533

1.4320

50.6636

2.5128

NN, window = 5h

21.7525

1.6512

47.0545

2.5413

4.2.2

Feature engineering. The neural network model described

NN, window = 3h

19.7441

1.6109

45.3450

2.4127

above did not outperform the benchmark model. As a result, we

NN, window = 2h

18.9717

1.6023

46.5047

2.5357

decided to try another approach, where we developed features that

Linear regression

19.4264

1.4634

49.2268

2.5145

better describe past sensor values and capture their patterns. One of

Catboost

16.9030

1.4478

38.3066

2.3164

the problems of the neural network model was that it had too many

features. We decided to build a separate model for each sensor to

Table 1: MSE and MAE on the test set of models when pre-

tackle this problem. Each model uses only features calculated from

dicting for chamber B100.

the values of the sensor being predicted.

With the help of plant operators, we decided to consider at most

five hours of data before the prediction point to issue a forecast.

Since the latest data is usually more important in determining future

horizon = 30min

horizon = 60min

sensor values, we created features on seven different time windows:

MSE

MAE

MSE

MAE

30, 45, 75, 120, 180, 240, and 300 minutes. For each time window,

last-value model

52.3380

2.0577

124.9735

3.3768

we computed the following features:

NN, window = 5h

69.4678

3.8227

129.0330

4.9927

• average sensor value,

NN, window = 3h

57.9902

3.3601

121.1315

4.7431

• fraction of peaks in the window,

NN, window = 2h

55.8769

3.1797

117.4154

4.7146

• percentage change between first and last value in the time

Linear regression

55.0218

3.2293

115.7457

4.5888

window,

Catboost

49.3329

2.5305

109.5303

3.9745

• slope (coefficient of the least squares line through the points

in the window),

Table 2: MSE and MAE on the test set of models when pre-

• simple prediction (extension of the least squares line to the

dicting for chamber B200.

future),

• slope ratio (slope on the smaller window divided by the slope

From the tables 1 and 2 we can see that the five-hour window’s on the bigger window).

neural network performed worse than the benchmark. The main

Besides features mentioned above, which depend on the window

reason for such poor results was too many features for the amount

size, we also included features that were calculated only on the

of data that we have. More precisely, the neural network model

biggest time window (300 minutes):

uses the values of all sensors in the chamber we are predicting.

This means that there are six hundred features resulting in more

• last value,

than two hundred thousand trainable parameters for the model of

• maximal value,

chamber B200. We also have to consider that the neural network

• last value relative to the maximal value.

predicts future sensor values and prediction intervals. Therefore,

The features above attempt to capture different time series charac-

there are too many features and target values for the amount of

teristics:

data that we have.

• trend: described by percentage change and slope;

We included results of two more neural network models with

• growth pattern: described by the fraction of peaks, which

three hours and two-hour windows since reduced window size

indicate whether the growth is steady or it has ups-and-

results in a smaller number of features and trainable parameters. For

downs. Furthermore, the slope indicates how aggressive such

example, the neural network model with a two-hour time window

growth is;

for chamber B200 had two hundred and forty features and almost

• expected value: an approximation of the expected value is

fifty thousand trainable parameters. Neural network models with

given through the average, last value, maximal value, and

smaller window sizes performed better, which confirms that we

simple prediction.

had too many features.

44





SiKDD ’22, October, 2022, Ljubljana, Slovenia

Brecelj et al.

The features that we developed using the second approach were

However, there is no problem with models not being able to pre-

used with two models, linear regression and the Catboost model.

dict significant changes resulting from a manual change in plant

Comparing those two models, the Catboost model performed better

setpoint parameters, which our data does not capture. Overall, we

because it can capture more than just linear relationships between

consider the best model was the Catboost model, given in all cases

the features and the target. The Catboost model also outperformed

it outperformed the rest of the models when considering MSE, and

the neural networks, where one of the main differences is that

also achieved the best MAE when predicting chamber B100 with a

the neural network uses all sensors from the chamber while the

time horizon of 60 minutes.

Catboost model uses only sensor values of the sensor which is

being predicted. This results in forty-five features for the model

6

CONCLUSION

that predicts one sensor, which solves the problem of too many

We compared a set of models to predict sensor values for a waste-

features. In addition, the Catboost model produced better results

to-fuel plant: a neural network, linear regression, gradient-boosted

than the benchmark when comparing the mean squared error (MSE).

tree regressor, and the last-value model. The last-value model was

During the training, we used RMSE as a loss function, meaning

used as a benchmark. We developed three neural network models

that RMSE was minimized and, therefore, also MSE.

which were different in time window size. The neural network

The tables show that although most models outperform the

models were built based on the hypothesis that a simple neural

benchmark regarding MSE, almost all of them do not surpass the

network and raw sensor readings as features are enough to model

benchmark when considering MAE. When measuring MSE, pre-

the process. The results showed that this is not the case because the

dictions with strong spikes where such spikes do not take place

process is too complicated for the amount of data that we obtained.

are penalized more. Therefore, models with a competitive MSE

Lastly, we used feature engineering to develop features that better

are considered to rarely predict spikes when such spikes do not

describe the time series. Features were used for learning linear

take place. This is a key feature for our use case, given that we are

regression, and the gradient boosted tree regressor, where the latter

interested to understand whether an irregularity will take place or

produced the best results in our case.

not. Therefore, the models give valuable information even though

the average prediction is not entirely accurate.

ACKNOWLEDGMENTS

This work was supported by the Slovenian Research Agency and

the European Union’s Horizon 2020 program project FACTLOG

under grant agreement number H2020-869951.

REFERENCES

[1] Walid Kamal Abdelbasset, Safaa M Elkholi, Maria Jade Catalan Opulencia, Tazed-dinova Diana, Chia-Hung Su, May Alashwal, Mohammed Zwawi, Mohammed

Algarni, Anas Abdelrahman, and Hoang Chinh Nguyen. 2022. Development of

multiple machine-learning computational techniques for optimization of heteroge-Figure 2: True value and prediction of the Catboost model

nous catalytic biodiesel production from waste vegetable oil. Arabian Journal of Chemistry 15, 6 (2022), 103843.

for a temperature sensor in chamber B100.

[2] Mortaza Aghbashlo, Wanxi Peng, Meisam Tabatabaei, Soteris A Kalogirou, Salman Soltanian, Homa Hosseinzadeh-Bandbafha, Omid Mahian, and Su Shiung Lam.

2021. Machine learning technology in biodiesel research: A review. Progress in Energy and Combustion Science 85 (2021), 100904.

[3] Hemal Chowdhury, Tamal Chowdhury, Pranta Barua, Salman Rahman, Nazia

Hossain, and Anish Khan. 2021. Biofuel production from food waste biomass

and application of machine learning for process management. 96–117.

https:

//doi.org/10.1016/B978-0-12-823139-5.00004-6

[4] Anna Veronika Dorogush, Vasily Ershov, and Andrey Gulin. 2018. CatBoost: gradient boosting with categorical features support. arXiv preprint arXiv:1810.11363

(2018).

[5] Bidhya Kunwar, HN Cheng, Sriram R Chandrashekaran, and Brajendra K Sharma.

2016. Plastics to fuel: a review. Renewable and Sustainable Energy Reviews 54

(2016), 421–428.

Figure 3: True value and prediction with a confidence interval

[6] F Kusumo, AS Silitonga, HH Masjuki, Hwai Chyuan Ong, J Siswantoro, and TMI of the neural network model with a two-hour window for a

Mahlia. 2017. Optimization of transesterification process for Ceiba pentandra oil: A comparative study between kernel-based extreme learning machine and temperature sensor in chamber B100.

artificial neural networks. Energy 134 (2017), 24–34.

[7] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Figure 2 shows the Catboost model prediction on the test set Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan together with the true values of the temperature sensor in chamber

Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith B100. The neural network model’s prediction of the same sensor is

Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32, H. Wallach, presented in Figure 3. Since the neural network model also outputs H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.). Curran prediction interval, it is shown in the abovementioned Figure.

Associates, Inc., 8024–8035.

http://papers.neurips.cc/paper/9015-pytorch-an-

imperative-style-high-performance-deep-learning-library.pdf

From the plots, we can see that both models can closely pre-

[8] Jilu Lizy Stephen and Balasubramanian Periyasamy. 2018. Innovative develop-dict future sensor values. In the case of the neural network model,

ments in biofuels production from organic waste materials: a review. Fuel 214

the actual value is mainly inside the predicted confidence inter-

(2018), 623–633.

val, except when there is a significant change in the sensor value.

45





Machine Beats Machine: Machine Learning Models to Defend

Against Adversarial Attacks.

Jože M. Rožanec∗

Dimitrios Papamartzivanos

Entso Veliou

Jožef Stefan International

Ubitech Ltd

Department of Informatics and

Postgraduate School

Chalandri, Athens, Greece

Computer Engineering, University of

Ljubljana, Slovenia

dpapamartz@ubitech.eu

West Attica

joze.rozanec@ijs.si

Athens, Greece

eveliou@uniwa.gr

Theodora Anastasiou

Jelle Keizer

Blaž Fortuna

Ubitech Ltd

Philips Consumer Lifestyle BV

Qlector d.o.o.

Chalandri, Athens, Greece

Drachten, The Neatherlands

Ljubljana, Slovenia

tanastasiou@ubitech.eu

jelle.keizer@philips.com

blaz.fortuna@qlector.com

Dunja Mladenić

Jožef Stefan Institute

Ljubljana, Slovenia

dunja.mladenic@ijs.si

ABSTRACT

1

INTRODUCTION

We propose using a two-layered deployment of machine learning

Artificial Intelligence (AI) solutions have penetrated the Industry

models to prevent adversarial attacks. The first layer determines

4.0 domain by revolutionizing the rigid production lines enabling

whether the data was tampered, while the second layer solves a

innovative functionalities like mass customization, predictive main-

domain-specific problem. We explore three sets of features and

tenance, zero defect manufacturing, and digital twins. However,

three dataset variations to train machine learning models. Our re-

AI-fuelled manufacturing floors involve many interactions between

sults show clustering algorithms achieved promising results. In

the AI systems and other legacy Information and Communications

particular, we consider the best results were obtained by applying

Technology (ICT) systems, generating a new territory for malevo-

the DBSCAN algorithm to the structured structural similarity in-

lent actors to conquer. Hence, the threat landscape of Industry 4.0 is

dex measure computed between the images and a white reference

expanded unpredictably if we also consider the emergence of adver-

image.

sary tactics and techniques against AI systems and the constantly

increasing number of reports of Machine Learning (ML) systems

CCS CONCEPTS

abuses based on real-world observations. In this context, Adversar-

ial Machine Learning (AML) has become a significant concern in

• Information systems → Data mining; • Computing method-

adopting AI technologies for critical applications, and it has already

ologies → Computer vision problems; • Applied computing;

been identified as a barrier in multiple application domains. AML is

a class of data manipulation techniques that cause changes in the be-

KEYWORDS

havior of AI algorithms while usually going unnoticed by humans.

Cybersecurity, Adversarial Attacks, Machine Learning, Automated

Suspicious objects misclassification in airport control systems [7],

Visual Inspection

abuse of autonomous vehicles navigation systems [11], tricking of healthcare image analysis systems for classifying a benign tumor as

ACM Reference Format:

malignant [15], abnormal robotic navigation control [23] are only Jože M. Rožanec, Dimitrios Papamartzivanos, Entso Veliou, Theodora Anas-a few examples of AI models’ compromise that advocate the need

tasiou, Jelle Keizer, Blaž Fortuna, and Dunja Mladenić. 2021. Machine Beats for the investigation and development of robust defense solutions.

Machine: Machine Learning Models to Defend Against Adversarial Attacks..

Recently, the evident challenges posed by AML have attracted

In Ljubljana ’22: Slovenian KDD Conference on Data Mining and Data Ware-

the attention of the research community, the industry 4.0, and

houses, October, 2022, Ljubljana, Slovenia. ACM, New York, NY, USA, 4 pages.

the manufacturing domains [20], as possible security issues on AI systems can pose a threat to systems reliability, productivity,

and safety [2]. In this reality, defenders should not be just passive Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed spectators, as there is a pressing need for robustifying AI systems

for profit or commercial advantage and that copies bear this notice and the full citation to hold against the perils of adversarial attacks. New methods are

on the first page. Copyrights for third-party components of this work must be honored.

needed to safeguard AI systems and sanitize the ML data pipelines

For all other uses, contact the owner/author(s).

SiKDD ’22, October, 2022, Ljubljana, Slovenia

from the potential injection of adversarial data samples due to

© 2021 Copyright held by the owner/author(s).

poisoning and evasion attacks.

46





SiKDD ’22, October, 2022, Ljubljana, Slovenia

Rožanec et al.

We developed a machine learning model to address the above-

by various transformations and contaminated by different noises

mentioned challenges, detecting whether the incoming images are

to foster robustness using adversarial training.

adversarially altered. This enables a two-layered deployment of

On top of the above, several standalone solutions have been

machine learning models that can be used to prevent adversar-

proposed. CARAMEL system in [13] offered a set of detection ial attacks (see Fig. 1): (a) the first layer with models determining techniques to combat security risks in automotive systems with

whether the data was tampered, and (b) a second layer that operates

embedded camera sensors. Hybrid approaches and more general

with regular machine learning models developed to solve particu-

alternatives intrinsically improve the robustness of AI models. A

lar domain-specific problems. We demonstrate our approach in a

defensive Distillation mechanism against evasion attacks is pro-

real-world use case from Philips Consumer Lifestyle BV. This paper

posed in [16] being able to reduce the effectiveness of adversarial explores a diverse set of features and machine learning models

sample creation from 95% to less than 0.5% on a studied DNN. Sub-

to detect whether the images have been tampered for malicious

set Scanning was presented in [19] to give the ability to DNNs to purposes.

recognize out-of-distribution samples.

3

USE CASE

The Philips factory in Drachten, the Netherlands, is an advanced

factory for mass manufacturing consumer goods (e.g., shavers,

OneBlade, baby bottles, and soothers). Current production lines are

often tailored for the mass production of one product or product

series in the most efficient way. However, the manufacturing land-

scape is changing due to global shortages, manufacturing assets

and components are becoming scarcer [1], and a shift in market Figure 1: Two-layered deployment of machine learning mod-demand requires the production of smaller batches more often. To

els can be used to prevent adversarial attacks.

adhere to these changes, production flexibility, re-use of assets, and

a reduction of reconfiguration times are becoming more critical

This paper is organized as follows. Section 2 outlines the current for the cost-efficient production of consumer goods. One of the

state of the art and related works, Section 3 describes the use case, topics currently investigated within Philips is quickly setting up

and Section 4 provides a detailed description of the methodology automated quality inspections to make reconfiguring quality con-and experiments. Finally, Section 5 outlines the results obtained, trol systems faster and easier. Next to working on the technical

while Section 6 concludes and describes future work.

challenges of doing this, safety and cyber-security topics are ex-

plored, aiming to implement AI-enabled automated quality systems

2

RELATED WORK

with state-of-the-art defenses, the latter of which is the focus point

discussed in this paper.

AML attacks are considered a severe threat to AI systems, and, that

The dataset used contains images of the decorative part of a

is, the research community seeks new robust defensive methods.

Philips shaver. This product is mass-produced and important for the

Image classifiers, those analyzed in this work, are the focal point of

visual appearance of the shavers. Next to that, the part is very close

the vast majority of the AML literature, as those have been proved

to or in direct contact with the user’s skin, where any deviations in

prone to noise perturbations. According to the literature, promi-

its quality could impact shaver performance or even shaver safety.

nent solutions focus on denoising the image classifiers, training

The dataset contains 1.194 images classified into two classes: (a)

the target model with adversarial examples, known as adversarial

attacked with the Projected Gradient Descent attack [5], and (b) training, or applying standalone defense algorithms.

not attacked.

Yan et. al. [21] proposed a new adversarial attack called Observation-based Zero-mean Attack, and they evaluated the robustness of var-

ious deep image denoisers. They followed an adversarial training

4

METHODOLOGY

strategy and effectively removed various synthetic and adversarial

We framed adversarial attack detection as a classification problem.

noises from data. In [17], pre-processing data defenses for image We experimented with three kinds of features: (a) image embed-denoising are evaluated, highlighting the advantages of such ap-

dings (obtained from the Average Pooling Layer of a pre-trained

proaches that do not require the retraining of the classifiers, which

ResNet-18 model ([9])), (b) histograms reflecting grayscale pixel is a computationally intense task in computer vision.

frequencies (with pixel values extending between zero and 255), and

However, the robustness of adversarial training via data augmen-

(c) structural similarity index measure (SSIM) computed against a

tation and distillation is advocated by the majority of the works

white image. While the embeddings provide information about the

in the domain. Specifically, Bortsova et al. [3] have focused on image as a whole, we considered the histograms and SSIM metric

adversarial black-box settings, assuming that the attacker does

could be useful given the apparent difference between the origi-

not have full access to the target model as a more realistic sce-

nal and perturbed images. Furthermore, we computed the features

nario. They tuned their testbed to ensure minimal visual percepti-

across three different datasets (see Fig. 2 for sample images): (a) bility of the attacks. The applied adversarial training dramatically

original set of images, (b) images cropped considering an image

decreased the performance of the designed attack. Hashemi and

slice extending from top to bottom (coordinates (160, 0, 200, 369) -

Mozaffari [8] trained CNNs with perturbed samples manipulated we name this dataset set "Cropped (v1)"), and (c) images cropped 47





Machine Beats Machine: Machine Learning Models to Defend Against Adversarial Attacks.

SiKDD ’22, October, 2022, Ljubljana, Slovenia

well, it would be useful to generalize the approach toward detecting

new cyberattacks where no labeled data exists yet. We consider

such a characteristic to be fundamental to production environments.

For the models resulting from the three abovementioned datasets,

we measured the estimated number of clusters, estimated number

of noise points, homogeneity (whether the clusters contain only

samples belonging to a single class), completeness (whether all

the data points members of a given class are elements of the same

cluster), V-measure (harmonic mean between homogeneity and

completeness), adjusted Rand index (similarity between clusterings

obtained by the proposed and random models), and the Silhouette

Coefficient (estimates the separation distance between the resulting

Figure 2: Three sets of images: (a) indicates the original image,

clusters). We ran the DBSCAN algorithm measuring the distance

while (b) indicates the images attacked with the Projected

between clusters with the Euclidean distance, considering the max-

Gradient Descent attack. The subsets I, II, and III indicate (I)

imum distance between two samples for one to be considered as in

the whole image, (II) cropped image (v1 (considering coordi-

the neighborhood of the other to be 0,3. Furthermore, we consid-

nates (160, 0, 200, 369))), and cropped image (v2 - (considering

ered that at least ten samples in a neighborhood were required for

coordinates (160, 50, 200, 319))).

a point to be considered as a core point.

5

RESULTS AND ANALYSIS

considering a slice of the central part of the image (coordinates (160,

50, 200, 319) - - we name this dataset set "Cropped (v2)"). By com-Model

Catboost

KMeans

Logistic regression

paring the original image dataset against those obtained by slicing

Original image

0.0167

1.0000

0.0228

the central part, we sought to understand if the models’ predictive

Embeddings

Cropped (v1)

0.0014

1.0000

0.0003

Cropped (v2)

0.0181

1.0000

0.0213

power increased by looking at a specific area of the image rather

Original image

0.0152

1.0000

0.0184

than the whole.

SSIM

Cropped (v1)

0.0008

1.0000

0.0004

Cropped (v2)

0.0179

1.0000

0.0195

We first trained three machine learning models: Catboost [18]

Original image

0.0016

1.0000

0.0030

with Focal Loss [14] (trained over 150 iterations, and considering a Histograms

Cropped (v1)

0.0003

1.0000

0.0011

tree depth of ten, while evaluating the performance during training

Cropped (v2)

0.0018

1.0000

0.0031

with the logloss metric), Logistic Regression (the dataset was scaled

between zero and one, considering the train set, and transformed to

Table 1: Results obtained across classification experiments.

ensure zero mean and unit variance), and KMeans (the dataset was

We measure models’ performance with Eq. 1. Best results are transformed to ensure zero mean and unit variance, and the model

bolded, second-best are italicized.

initiated with random initialization and seeking to generate two

clusters). We evaluated our experiments with a ten-fold stratified

We present the results obtained in our classification experiments

cross-validation ([12, 22]), using one fold for testing and the rest in Table 1. We found the KMeans models achieved perfect discrimi-of the folds to train the model. Furthermore, to avoid overfitting,

nation in all cases, while the second-best model was the Logistic

we performed a feature selection using the mutual information

regression, which had second-best results in all but two cases. Nev-

to evaluate the most relevant ones and select the top K features,

√

ertheless, the Logistic regression and the Catboost models achieved

with 𝐾 =

𝑁 , considering 𝑁 to be equal to the number of data

a low discriminative power, almost unable to distinguish between

instances in the train set [10]. Finally, we measured our models’

tampered and non-tampered images. Regarding the features, we

performance with a custom metric (𝐷𝑃

) that summarizes

𝐴𝑈 𝐶 𝑅𝑂𝐶

found that the best average performance was obtained when train-

the discriminative power as computed from the area under the

ing the models on the Cropped (v2) dataset, followed by those

receiver operating characteristic curve (AUC ROC, see [4]) (see trained on the whole images.

Eq. 1). The metric ranges from zero (no discriminative power) to When running the DBSCAN algorithm (see results in Table 2),

one (perfect discriminative power) and it preserves the AUC ROC

we found the best results were obtained considering the SSIM mea-

desirable properties of being threshold independent and invariant

sure. Furthermore, using the SSIM issued excellent results in all

to a priori class probabilities.

cases. The best ones were obtained considering the Cropped (v1)

dataset, while the second-best was achieved with the Cropped (v2)

dataset. Using the SSIM only, the DBSCAN algorithm was able to

𝐷 𝑃

= 2 · |(0.5 − 𝐴𝑈 𝐶𝑅𝑂𝐶)|

(1)

𝐴𝑈 𝐶

𝑂𝐶

𝑅

correctly group the instances into two groups and misclassified at

most a single instance. However, the performance achieved either

Based on the good results obtained in the clustering setting, we

with embeddings or histograms was not satisfactory. When consid-

decided to conduct additional experiments, running the DBSCAN

ering histogram features, the DBSCAN algorithm was not able to

algorithm [6] over all existing data. The advantage of such an algo-discriminate between instances, creating a single cluster. On the

rithm is that it can estimate the clusters with no prior information

other hand, when considering embeddings, DBSCAN created three

regarding the number of expected clusters. Therefore, if working

clusters that issued a bad performance, considering most of the

48





SiKDD ’22, October, 2022, Ljubljana, Slovenia

Rožanec et al.

Embeddings

SSIM

Histograms

Original image

Cropped (v1)

Cropped (v2)

Original image

Cropped (v1)

Cropped (v2)

Original image

Cropped (v1)

Cropped (v2)

Number of clusters

3

1

1

2

2

2

1

1

1

Number of noise points

1010

794

887

1

0

1

621

603

606

Homogeneity

0.1770

0.4550

0.3170

1.0000

1.0000

1.0000

0.8550

0.9290

0.9150

Completeness

0.2090

0.4940

0.3860

0.9910

1.0000

0.9910

0.8560

0.9290

0.9150

V-measure

0.1920

0.4740

0.3480

0.9960

1.0000

0.9960

0.8550

0.9290

0.9150

Adjusted Rand index

0.0710

0.4350

0.2540

0.9980

1.0000

0.9980

0.9020

0.9600

0.9500

Silhouette coefficient

0.0750

0.4310

0.2660

0.8980

0.9590

0.9070

0.8330

0.8970

0.8800

Table 2: Results obtained across clustering experiments. Best ones are bolded, second-best are italicized.

points to be noisy. We, therefore, conclude that the only promising

Processing (ICIP). IEEE, 1241–1245.

results were those obtained considering the SSIM. Nevertheless, we

[6] Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, et al. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise.. In consider further research is required to understand whether this

kdd, Vol. 96. 226–231.

kind of feature can be useful across a wide range of attacks and

[7] Dan-Ioan Gota, Adela Puscasiu, Alexandra Fanca, Honoriu Valean, and Liviu in the real-world. SSIM provides metadata describing the images.

Miclea. 2020. Threat objects detection in airport using machine learning. In 2020

21th International Carpathian Control Conference (ICCC). IEEE, 1–6.

Given high-quality attacks aim to reduce the visual footprint on the

[8] Atiyeh Hashemi and Saeed Mozaffari. 2021. CNN adversarial attack mitigation images, it remains an open question to which extent can the SSIM

using perturbed samples training. Multim. Tools Appl. 80 (2021), 22077–22095.

[9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual capture weak footprints and therefore enable similar discriminative

learning for image recognition. In Proceedings of the IEEE conference on computer capabilities on machine learning models.

vision and pattern recognition. 770–778.

[10] Jianping Hua, Zixiang Xiong, James Lowey, Edward Suh, and Edward R

Dougherty. 2005. Optimal number of features as a function of sample size

6

CONCLUSION

for various classification rules. Bioinformatics 21, 8 (2005), 1509–1515.

[11] A. Kloukiniotis, A. Papandreou, A. Lalos, P. Kapsalas, D.-V. Nguyen, and K.

In this work, we explored multiple sets of features and machine

Moustakas. 2022. Countering adversarial attacks on autonomous vehicles using learning models to determine whether an image has been tampered

denoising techniques: A Review. IEEE Open Journal of Intelligent Transportation with for the purpose of an adversarial attack. While the difference

Systems (2022). Publisher: IEEE.

[12] Max Kuhn, Kjell Johnson, et al. 2013. Applied predictive modeling. Vol. 26.

between attacked and non-attacked images is evident to the human

Springer.

eye, it is not to the machine learning algorithms. We found that

[13] Christos Kyrkou, Andreas Papachristodoulou, Andreas Kloukiniotis, Andreas the Catboost and Logistic regression models could almost not dis-Papandreou, Aris Lalos, Konstantinos Moustakas, and Theocharis Theocharides.

2020. Towards artificial-intelligence-based cybersecurity for robustifying auto-criminate between both cases. On the other hand, the clustering

mated driving systems against camera sensor attacks. In 2020 IEEE Computer algorithms (KMeans and DBSCAN) had a stronger performance.

Society Annual Symposium on VLSI (ISVLSI). IEEE, 476–481.

[14] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2017.

While the KMeans models did so perfectly, regardless of the fea-

Focal loss for dense object detection. In Proceedings of the IEEE international tures, the DBSCAN model only performed well using the SSIM.

conference on computer vision. 2980–2988.

We consider the strength of such a model the fact that no a pri-

[15] Xingjun Ma, Yuhao Niu, Lin Gu, Yisen Wang, Yitian Zhao, James Bailey, and Feng Lu. 2021. Understanding adversarial attacks on deep learning based medical ori information regarding the classes is required, therefore saving

image analysis systems. Pattern Recognition 110 (2021), 107332.

the annotation effort and providing greater flexibility towards fu-

[16] Nicolas Papernot, Patrick D. McDaniel, Xi Wu, Somesh Jha, and Ananthram ture adversarial attacks. Our future research will focus on testing a

Swami. 2015. Distillation as a Defense to Adversarial Perturbations against Deep Neural Networks. CoRR abs/1511.04508 (2015). arXiv:1511.04508 http:

wider range of cyberattacks while ensuring the attack will not be

//arxiv.org/abs/1511.04508

discernable to the human eye.

[17] Marek Pawlicki and Ryszard S. Choraś. 2021. Preprocessing Pipelines including Block-Matching Convolutional Neural Network for Image Denoising to Robustify Deep Reidentification against Evasion Attacks. Entropy 23, 10 (2021), 1304.

ACKNOWLEDGMENTS

Publisher: MDPI.

[18] Liudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Doro-This work was supported by the Slovenian Research Agency and

gush, and Andrey Gulin. 2018. CatBoost: unbiased boosting with categorical the European Union’s Horizon 2020 program project STAR under

features. Advances in neural information processing systems 31 (2018).

grant agreement number H2020-956573.

[19] Skyler Speakman, Srihari Sridharan, Sekou Remy, Komminist Weldemariam, and Edward McFowland. 2018. Subset scanning over neural network activations.

arXiv preprint arXiv:1810.08676 (2018).

REFERENCES

[20] Entso Veliou, Dimitrios Papamartzivanos, Sofia Anna Menesidou, Panagiotis Gouvas, and Thanassis Giannetsos. 2021. Artificial Intelligence and Secure Manu-

[1] [n.d.]. European Economic Forecast. Autumn 2021. https://economy-finance.

facturing: Filling Gaps in Making Industrial Environments Safer. Now Publishers.

ec.europa.eu/publications/european-economic-forecast-autumn-2021_en. Ac-30–51 pages. https://doi.org/10.1561/9781680838770.ch2

cessed: 2022-08-05.

[21] Hanshu Yan, Jingfeng Zhang, Jiashi Feng, Masashi Sugiyama, and Vincent YF

[2] Adrien Bécue, Isabel Praça, and João Gama. 2021. Artificial intelligence, cyber-Tan. 2022. Towards Adversarially Robust Deep Image Denoising. arXiv preprint threats and Industry 4.0: Challenges and opportunities. Artificial Intelligence arXiv:2201.04397 (2022).

Review 54, 5 (2021), 3849–3886.

[22] Xinchuan Zeng and Tony R Martinez. 2000. Distribution-balanced stratified

[3] Gerda Bortsova, Cristina González-Gonzalo, Suzanne C. Wetstein, Florian Du-cross-validation for accuracy estimation. Journal of Experimental & Theoretical bost, Ioannis Katramados, Laurens Hogeweg, Bart Liefers, Bram van Ginneken, Artificial Intelligence 12, 1 (2000), 1–12.

Josien PW Pluim, and Mitko Veta. 2021. Adversarial attack vulnerability of

[23] Fangyi Zhang, Jürgen Leitner, Michael Milford, Ben Upcroft, and Peter Corke.

medical image analysis systems: Unexplored factors. Medical Image Analysis 73

2015. Towards vision-based deep reinforcement learning for robotic motion

(2021), 102141. Publisher: Elsevier.

control. arXiv preprint arXiv:1511.03791 (2015).

[4] Andrew P. Bradley. 1997. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition 30, 7 (1997), 1145

– 1159. https://doi.org/10.1016/S0031-3203(96)00142-2

[5] Yingpeng Deng and Lina J Karam. 2020. Universal adversarial attack via enhanced projected gradient descent. In 2020 IEEE International Conference on Image 49





Addressing climate change preparedness

from a smart water perspective



Alenka Guček *, Joao Pita Costa ** ***, M.Besher Massri * ** *******, João Santos Costa *, Maurizio Rossi ****, Ignacio Casals del Busto *****, Iulian Mocanu ******

* Institute Jozef Stefan, Slovenia, ** IRCAI, Slovenia, *** Quintelligence, Slovenia, **** Ville de Carouge, Switzerland, ***** Aguas de Alicante, Spain,

****** Apa Braila, Romania, ******* Jozef Stefan International Postgraduate School ABSTRACT

economy through green technology over a new framework to

understand and position water resource management in the

Observing the world on a global scale can help us understand

context of the challenges of tomorrow [1]. In the context of

better the role of water and water resource management

the NAIADES project [3] we repurpose and customize the

utilities in a climate change context that engage us all. The

NAIADES Water Observatory, adding a measurements

usage of machine learning algorithms on open data

dimension to its text mining capabilities to allow for forecasts

measurements and statistical indicators can help us

on, e.g., water level and temperature to complete the

understand the behavioral changes in seasons and better

perspective on the impact of climate change for the

prepare. These are complemented by powerful text mining

preparedness both of water management utilities and users

algorithms that mine worldwide news, social media,

as in, e.g., smart agriculture. This will improve the climate

published research and patented innovation towards best

change preparedness of water resource management

practices from success stories. In this paper, we propose a

facilities and local authorities in a global context, in particular

data-driven global observatory that puts together the

in European regions where water scarcity or extreme

different perspectives of media, science, statistics and sensing

weather events are predicted. The water-related climate

over heterogeneous data sources and text mining algorithms.

change topics that we are already addressing include, e.g.,

We also discuss the implementation of this global

water reuse, wastewater management, saline intrusion and

observatory in the context of epidemic intelligence,

groundwater contamination.

monitoring the impact of climate change, and the value of this

In this paper we will discuss our contribution to this cause,

global solution in local contexts and priorities.

through the NAIADES Water Observatory (accessible at



naiades.ijs.si) [12], focusing on water-related aspects,

allowing the user to explore a combination of perspectives

CCS CONCEPTS

offered over layers of information sourced in statistics,

• Real-time systems • Data management systems • Life and

historical measurements, multilingual news and social media

medical science

to published science, weather models and indicators. It is also

being used in the context of extreme weather events to



analyze worldwide trends and best practices in water topics

like, e.g., floods, landslide, and contamination [9], building

KEYWORDS Climate Change Preparedness, Data-driven

business intelligence from the available open data in

Decision-making, Water Resource Management, Smart Water,

combination with data streams [11].

Observatory, Water Digital Twin, Deep Learning, Text Mining,

The NAIADES Water Observatory is not only contributing to

Interactive Data Visualization

the improvement of European sustainability in water-related



activities and business intelligence but it is also providing an

active role to local actors in improving together with

1 Introduction

municipalities and water resource management utilities the

efficient use of resources [13]. This local perspective is

In the present decade, Climate Change has become positioned

especially important for providing information at the local

as one of the world priorities, a global problem with great

granularity, which enables communities or municipalities to

socio-economic impact. It has been in the focus of European

build solutions that are relevant for their specific cases.

and Worldwide strategies, rapidly changing priorities



towards sustainability and environmental efficiency,

transversely to most domains of action. The European

Commission’s Green Deal [5] is a good example of this, aiming

for a climate neutral Europe in 2050, and boosting the

50





SIKDD’22, October 2022, Ljubljana, Slovenia

A.Guček et al.





Figure 2: The weather across seasons over the past 20

Figure 1: Long‐term forecast of 10 years (average per

years distinguished by seasons, exhibiting high

year) built on 20 years of data to understand the

temperature periods earlier in the year.

behavior of air temperature, water levels and

temperature and the consequent changes within seasons.

To further explore the relations of multivariate timeseries

data, we have developed the State analysis tool [14]. With this

technology we automatically abstract data as states of the

2 Understanding behaviors from data

Markov chain and transitions between them. This allows for

In the era of Big Data where technologies and sensors are

ingestion of large datasets, and due to hierarchical clustering

every day cheaper and more efficient, a wide range of useful

the data can be observed on several levels. This tool works

especially well for observing long term behavior and exposing

measurements is available and can be used to forecast

recurrent patterns. In the context of climate change

weather and water resource behaviors and to identify

preparedness, the aim was to better understand the reality of

environmental trends with local granularity.

the seasons as defined by the weather parameters as well as

With the motivation to grasp a realistic perspective on the

the water level and temperature over the past 20 years.

impact of climate change in the region of Carouge,

Depicted in Figure 3 are the transitions between seven states

Switzerland, we obtained 20 years of water levels and water

we can already depict in the municipality of Carouge,

temperature data (sourced by Meteoswiss Data Portal

Switzerland and the surrounding area. Five of those states

IDAWEB), and we were able to build a 10-year forecast that

correspond to a passage between Spring-Summer and

allows us to see a signal of the global trend.

Summer-Autumn, and to Summer itself, characterized by the

For this aim, we have developed a Long Short Term Memory

states indicating a high water temperature. With the impact

of climate change in redefining seasons this tool can help to

(LSTM) neural network, which is a type of Recurrent Neural

plan ahead, having in mind the granularity of the data that can

Network, widely used for predicting sequential data. In order

be customized to predefined geographic regions where

to optimize the performance and accuracy of the LSTM, we

relevant water resources are located.

used some results from Differential Geometry and Chaos

Theory such as Takens’ Embedding Theorem, Shannon

Entropy, Conditional Shannon Entropy, Markov Chains, etc.

This theoretical support was key for obtaining the optimal

number of timesteps [4] and to produce a long-term forecast

aiming to observe the weather behavior across the historical

data collected and a perspective on the future seasons based

on the derived prediction, represented by the three

parameters - temperature, humidity and rainfall - or the

water levels in rivers, lakes and basins in the area determined

by the geolocation provided by the NAIADES use cases. The

time series of historical data in Figure 1 indicates that already

the air temperature yearly averages are increasing, and this

increase is predicted also for the next 10 years. Comparing

our model with the Meteoswiss model for the area, the

differences were minimal. To emphasize the changes



throughout the year, we added a per year visualization

(Figure 2), where one can compare the seasonal trends for the

Figure 3: The analysis of the impact of climate change on

local weather and water parameters.

water levels and temperature across seasons using

Markov chains





51





Addressing climate change preparedness from a smart water

SIKDD’22, October 2022, Ljubljana, Slovenia

perspective



3 Enrichment with local indicators



To better understand the comparative progress of each

Water is fundamental to all human activity and ecosystem

region on the selected water-related topics, we also enable

health, and is a topic of rising awareness in the context of

the representation of the time-series curves (see Figure 5) to

climate change. Water resource management is central to

identify transitions, peaks and other behaviors (per

those concerns, with the industry accounting for over 19% of

parameter in analysis) that are otherwise not seen in the

global water withdrawal, and agricultural supply chains are

bubble chart animation.

responsible for 70% of water stress [10]. In 2015 the UN



established "clean water and sanitation for all" as one of the

17 Sustainable Development Goals, aiming for eight targets to

be achieved by 2030 [2].



To exploit the functionality for the customization at the level

of local regional providers, news monitoring, and exploration

of scientific research can be customized to observed

problems, e.g., groundwater contamination. Moreover,

ingestion of local indicators can be customized also. These

agencies (e.g. Aguas de Alicante) are collecting data on their

water resource management services to improve the

customer satisfaction and optimize their system, aiming for a

smart water [6] approach for the optimization of resources



and means, often deploying intelligent systems close to the

Figure 5: The curves comparing regional indicators on

idea of a water digital twin [7].

water topics (as, e.g., reused water in Spain)



Together with the municipality of Carouge, Switzerland, and

4 Knowledge extracted from news, social

with the water management utilities of Alicante, Spain, and

media and scientific research

Braila, Romania, we have collected open data from national

data portals and environmental agencies with a regional

The NAIADES Water Observatory also allows for a news

granularity to be able to assess the comparative progress of

monitoring perspective with global and local coverage on

regions through the visual data representation of indicators

topics like, e.g., water scarcity and water quality. It is

(see Figure 4). Through this interactive data visualization we

particularly relevant in the surrounding regions of the water

can investigate the progress on a variety of topics (with three

resource management agencies, but also at a worldwide level

simultaneous parameters represented over a bubble chart)

recurring to its multilingual capacity to access success stories

that are much relevant to the analysis of climate change,

and best practices form similar scenarios happening

including water availability, reused and treated water, or

worldwide. This is based on the Event Registry news engine

water usage by populations and industry. With the

[8] that collects over 300 thousand news articles daily in over

appropriate combination of variables in comparison, the user

60 languages. In the past 3 months we were able to capture

can identify the most efficient regions over the country.

almost 33 thousand articles relating both with water and with



the climate crisis, 1500 of them happening in Spain and

relating to concepts such as, e.g., draught, wildfire heat wave,

irrigation and extreme weather.





Figure 4: The comparison of indicators in the Spanish

regions across time



52





SIKDD’22, October 2022, Ljubljana, Slovenia

A.Guček et al.



Although the predictions are in accordance with IPCC’s and

Meteoswiss forecasting, this preliminary work needs to be

extended with ingesting several other data variables and

compared to the existing widely used models to bring more

accurate insight specially for the weather data, but also the

water-relevant resources.





ACKNOWLEDGMENTS

We thank the support of the European Commission on the



H2020 NAIADES project (GA nr. 820985).

Figure 6: The combined perspective of multilingual news,

social media and scientific research on water scarcity and

REFERENCES

extreme weather aiming to identify best practices and



success stories

[1] A. Akhmouch, C. Delphine and P. G. Delphine Clavreul. Introducing the



OECD principles on water governance. Water International, 43: 5–12,

2018

This global system is also capturing the filtered Twitter feed

[2] V. Blazhevska. United Nations launches framework to speed up progress

on 10% of the signal, to identify posts related to heat wave

on water and sanitation goal. United Nations Sustainable Development,

and drought (see Figure 6).

2020

[3] CORDIS, "NAIADES Project". [Online]. Available:



https://cordis.europa.eu/project/id/820985 [Accessed 1 9 2020].

The scientific research on climate change topics can bring an

[4] Costa J., Kenda K., Pita Costa J. (2021). Entropy for Time Series

Forecasting. In: Slovenian Data Mining and Data. Warehouses conference

important complement in this context, providing success

(SiKDD2021)

stories and best practices that can be extracted from the

[5] European Commission, "European Green Deal," 2019. [Online]. Available: textual data, and explored with complex data visualization

https://ec.europa.eu/info/strategy/priorities-2019-2024/ european-

green-deal_en. [Accessed 1 9 2020].

technology allowing the user to powerful Lucene-based

[6] C. Sun, V. Puig, G. Cembrano. (2020). Real-Time Control of UrbanWater

queries over the article's metadata and to relate that research

Cycle under Cyber-Physical Systems Framework. Water: 12, 406.

[7] Di Nardo et al. (2018). On-line Measuring Sensors for Smart Water

across time suggesting related topics (see Figure 7). These

Network Monitoring. EPiC Series in Engineering. 3: 572-581

data analytics technologies are able to analyze

[8] G. Leban, B. Fortuna, J. Brank and M. Grobelnik, "Event registry: learning about world events from news," Proceedings of the 23rd International

simultaneously multiple time-series providing interactive

Conference on World Wide Web, pp. 107-110, 2014.

exploration tools to understand trends in climate change

[9] M. Mikoš, N. Bezak, J. Pita Costa, M. Besher Massri, M. Jermol, M.

research and water topics related to it.

Grobelnik (2021) Natural-hazard-related web observatories as a

sustainable development tool in Progress, in Landslide Research and



Technology, Springer, Vol. 1, No. 1, 2022.

[10] Our World in Data (2022). Water Use Stress.

https://ourworldindata.org/water-use-stress. [Accessed 1 8 2022]

[11] J. Pita Costa (2022). Business intelligence built from open data. Water World Magazine. [Online]. Available:

https://www.waterworld.com/water-utility-management/smart-water-

utility/article/14234325/2203wwint [Accessed 1 8 2022]

[12] J. Pita Costa (2021). Observing water-related events to support

decision-making. Smart Water Magazine. [Online]. Available:

https://smartwatermagazine.com/news/naiades-project/observing-

water-related-events-support-decision-making [Accessed 1 8 2022]

[13] J. Pita Costa, I. Casals del Busto, A. Guček, et al (2022). Building A Water Observatory From Open Data. Proceedings of the IWA 2022.



[14] L. Stopar, P. Škraba, M. Grobelnik, and D. Mladenić (2018). StreamStory: Figure 7: The trends over time that relate to the topic

Exploring Multivariate Time Series on Multiple Scales. IEEE transactions

Climate Change in the scientific literature

on visualization and computer graphics 25. 4: 1788-1802.

5. Conclusions and further work

Adapting to climate change is an important topic for water

management services, since their work is quintessential for

the well-being of people. Understanding the seasonality

changes and forecasting the availability of resources at the

local levels is therefore crucial to enable relevant adaptation

at the correct granularity.





53





SciKit Learn vs Dask vs Apache Spark

Benchmarking on the EMINST Dataset

Filip Zevnik, Din Music, Carolina Fortuna, Gregor Cerar

Department of Communication Systems, Jozef Stefan Institute

Ljubljana, Slovenia

zevnikfilip@gmail.com

Abstract—As datasets for machine learning tasks can become

[4] and on various image processing and learning scenarios

very large, more consideration to memory and computing re-

[5]–[7]. The work in [7] is the closest to this one, however

source usage has to be given. As a result, several libraries for

they focused on evaluating the tradeoffs in parellelizing feature

parallel processing that improve RAM utilization and speed up

extraction and clustering while this work focuses on evaluating

computations by parallelizing ML jobs have emerged. While

SciKit Learn is the typical go to library for practitioners, Dask

data loading and merging and subsequent classification.

is a parallel computing library that can be used with SciKit

In this paper, we benchmark the three solutions for devel-

and Apache Spark is an analytics engine for large-scale data

oping ML pipelines with respect to data merging and loading

processing that includes some machine learning techniques. In

and subsequently for training and predicting on the extended

this paper, we benchmark the three solutions for developing

ML pipelines with respect to data merging and loading and

MNIST (eMNIST) dataset under Linux and Windows OS. Our

subsequently for training and predicting on the extended MNIST

results show that Linux is the better option for all of the

(eMNIST) dataset under Linux and Windows OS. Our results

benchmarks. For low amounts of data plain SciKit learn is

show that Linux is the better option for all of the benchmarks.

the best option for machine learning, but for more samples, we

For low amounts of data plain SciKit learn is the best option

would choose Apache Spark. On the other hand, when it comes

for machine learning, but for more samples, we would choose

Apache Spark. On the other hand, when it comes to dataframe

to dataframe manipulation Spark is behind Dask, and Dask

manipulation Dask beats a normal pandas import and merge.

beats a normal pandas import and merge. The contribution of

Index Terms—Apache Spark, Dask, machine learning, Pandas,

this paper is the benchmarking of three ML libraries across

import

various data sizes and two operating systems on two parts of

the ML model development pipeline.

I. INTRODUCTION

The remainder of the paper is structured as follows. Section

II discusses related work. Section III presents the methodology

As datasets for machine learning tasks can become very

used in the benchmarking. Section IV evaluates the compari-

large, more consideration to memory and computing resource

son. Finally, Section V presents our conclusions.

usage has to be given. As a result, several libraries for

parallel processing that improve RAM utilization and speed up

II. RELATED WORK

computations by parallelizing ML jobs have emerged. While

Chintapalli et al. (2016) [8] compared streaming platforms

SciKit Learn [1] is the typical go to library for practitioners,

Flink, Storm and Spark. The paper focuses on real-world

Dask [2] is a parallel computing library that can be used

streaming scenarios using ads and ad campaigns. Each strem-

with SciKit to improve memory and CPU utilization. Dask

ing platform was used to build a pipeline that identifies

improves memory utilization by not immediately loading all

relevant events, which were sources from Kafka. In addition,

the data, but only pointing to it. Only part of the data is loaded

Redis was used for storing windowed count of relevant events

on a per need basis. It also enables using all available cores on

per campaign. The test system contained 40 nodes, where each

a system to train a model. Apache Spark is an analytics engine

node contained 2 CPUs with 8 cores and 24GB of RAM. All

written in Java and Scala for processing large-scale data that

nodes were interconnected using a gigabit ethernet connection.

incorporates some machine learning techniques and is tightly

The experiment encompassed Kafka producing events at set

integrated with the Spark architecture.

rate with 30 minutes interval between each batch was fired.

While there are other libraries [3] that enable paralleliza-

The results showed that both Flink and Storm were almost

tion of ML, when it comes to distributed computing tools

equal in terms of event latency, while Spark turned out to be

for tabular datasets, Spark and Dask are the most popular

the slowest of the three.

choices today. Even though Spark is an older, more stable

Dugré et al. (2019) [4] compared Dask and Spark on the

solution, Dask is part of the vibrant Python ecosystem and both

neuroimaging big data pipelines. As neuroimaging requires a

technologies excel at parallelization. While the two solutions

large amount of images to be processed, Spark and Dask were

have been already been benchmarked on big data pipelines

in the time of writing the best suited Big Data engines. The

This work was funded by the Slovenian Research Agency ARRS under

paper compares the technologies with three different pipelines.

program P-0016.

First is incrementation, second is histogram and the final

54



Fig. 1. Workflow of the Machine learning test example used for benchmarking.

one is a BIDS app example (a map-reduce style application).

time the data importing and merging process, referred to as

All comparisons were done on BigBrain and CoRR datasets,

Benchmark 1 in the figure, followed by model training and

with sizes of 81GB and 39GB respectively. The authors have

evaluation denoted by Benchmark 2. While the time required

concluded that all platforms perform very similarly and that

to train the model is usually the most important metric because

the incrementation of worker nodes is not always the optimal

it takes up most of the computation time, importing and

solution due to the transfer times and overall overhead. While

merging the input data cannot be ignored. As described in

all platforms yielded similar results, the Spark is claimed to

Algorithm 1, for Benchmark 1, training data was imported and

be the fastest out of the three platforms.

then merged. For SciKit Learn dataframes were used all along

Nguyen et al. (2019) [6] evaluated SciDB, Myria, Spark,

and no parallelization was used while for Dask and Spark

Dask and TensorFlow to figure out which system is best suited

parallelization was turned on.

for image processing. Similarly to [4], the authors compared

the systems using different pipelines. For comparison, the

Algorithm 1: Import and merge benchmarking process.

authors used 2 datasets, both over 100GB in size. The com-

parison reveled that Dask and Spark are comparable in the

performacnce as well as the ease of use.

Enable parallelization

Mehta et al. (2016) [5] presented the satellite data process-

Require: data a and data b

ing pipeline. The pipeline consists of two steps, a feature

Merge the DataFrames

extraction step and a clustering step. The baseline pipeline

Convert data to a pandas DataFrame

used the Caffe deep learning library and SciKit. The improved

pipeline used Keras along with Spark and Dask for multi-

Algorithm 2: Train/fit and evaluate benchmarking process.

node computation. They found that while Spark was the

fastest in terms of computational time required per task, Dask

used almost half the memory compared to Spark due to

Enable parallelization

recalculation of the intermediate values. SciKit Learn was not

Import and setup data

able to complete the task and was excluded from the final

train = [80% of the samples], test = [20% of the samples]

comparison. It was concluded that Spark is the best performer,

Define ML algorithm

while Dask is the easiest to use.

Fit the data

Cheng et al. (2019) [7] presented a comparison of the

Predict the samples

RADICAL-Pilot, Dask and Spark for image processing. All

Evaluate - F1

three systems were tested using watershed and a blob detector

algorithms. Each test was split into two parts, a weak scaling

As described in Algorithm 2, for Benchmark 2 in Figure

algorithm where the amount of data to be processed was

1, an example of machine learning with a decision tree

increased alongside the number of nodes, and a strong scaling

classifier depicts the workflow of the machine learning test

algorithm where the amount of data stayed the same and the

example. First, parellelization is enabled for Dask and Spark

number of nodes increased. The evaluation showed that Dask

and immediately after that the data is imported and modified

outperformed Spark on weak scaling, while Spark excelled in

accordingly to fit the test scenario. Next, the decision tree

the strong scaling part.

classifier is trained using various training data size, dividing

the data set into a training subset and a test subset. The

III. METHODOLOGY

training subset represents 80% of the original dataset and for

To benchmark the three solutions, namely SciKit learn,

the training subset the remaining data is used, representing

Dask and Spark, we single out two parts of the end-to-end

20% of the original dataset. Each task is run with 5 different

model development process depicted in Figure 1. We first

sample sizes, ranging from 50k to 250k samples, with a step of

55





50k samples. Finally, the execution report with the calculation

testing Spark on the import and merge benchmark, both Win-

times of each task is generated.

dows and Linux ran out of memory with two and four workers.

To realize these benchmarks1, we used the extended MNIST

Swap memory could be used to overcome this shortcoming,

or EMNIST dataset2. The data set contains approximately

however, the resulting comparison would not be fair because

250k samples of handwritten digits, resulting in total size of

the Dask benchmarks didn’t need the swap memory.

516MB. The size of all images is exactly the same, 28 by 28

pixels and each pixel has a value ranging from zero to 255. The

dataset is represented in the CSV (Comma Separated Values)

format with the first column being the label and the rest of the

columns representing 784 pixels. For the benchmarks, different

data set sizes, ranging from 50k to 250k samples with a 50k

step were generated.

In addition, each data set size was tested on Dask and

Spark with 1, 2 and 4 workers. Therefore, the programs used

to test computation time on Windows and Linux operating

systems have the same complexity. All tests were performed

on equivalent Windows and Linux virtual machines running

on the 6 CPU core machine with 10 GB of RAM.

IV. RESULTS

In this section we provide the results of the benchmarks

Fig. 2. Benchmark results of import and merge times at 100k samples: raw

data to Pandas.

collected using the methodology described in Section III.

A. Import and merge

First, we present in Figure 2 the import and merge times for

100k samples on Linux without parallelization across the the

three platforms. In the first bar, it can be seen that importing

(i.e. loading the data into memory) takes most of the time with

Pandas. Merging (i.e. concatenation) is relatively negligible

while computation is not relevant in this case as after merging

it already returns the desired data structure. The total import

and merge time is slightly above 4s.

From the second bar, it can be seen that importing and

merging is negligible with Dask as doesn’t load anything into

memory at these steps, rather it prepares only recipes that

will be executed during the most time consuming compute

phase. During compute, Dask turns a lazy collection into its

in-memory equivalent, in our case, the Dask dataframe turns

Fig. 3.

Benchmark results two operating systems, Dask with import and

into a Pandas dataframe. Overall, it can be seen that on a single

merge on 250k samples.

node, Dask is comparable to Pandas, with a total import and

merge time slightly below 4s.

B. Machine learning

Finally, from the last bar, it can be seen that Spark import

Figure 4 shows the comparison of computation time be-

and merge are very fast and efficient, taking below 2s. How-

tween Dask, Spark, and SciKit on the Windows operating

ever, when transforming the internal data structure of Spark

system for different dataset sizes. Each column in the figure

into pandas (i.e. during the compute phase in this case) is very

represents the average computation time of 5 test runs. The

time consuming. We added this step so that the final outcome

results show that Dask and Spark are almost equivalent when

is consistent with the other two (i.e. Pandas data structure),

the input dataset size is around 150k samples. Dask performs

however in the end-to-end ML pipeline the ML algorithm will

better on smaller datasets, while Spark’s performance is best

be trained directly using Spark’s internal data structure.

on larger datasets. Interestingly, SciKit outperforms both Dask

Figure 3 shows how the import and merge times fare as a

and Spark on all dataset sizes, although it is not able to

function of worker nodes for Dask across Linux and Windows.

parallelize tasks. This is most likely because of the transfer

As expected, a decreasing tendency of the import/merge times

times between nodes and the overall overhead of Dask and

with the increase of the working nodes can be seen. When

Spark. Since the datasets fit completely into the computer’s

memory, SciKit has no problems computing them, while Dask

1Scripts for the benchmarks, https://github.com/sensorlab/parMLBenchmarks

2EMINST dataset - https://www.kaggle.com/crawford/emnist (accessed:

and Spark only cause unnecessary overhead. However, Dask

30.07.2022)

and Spark are meant for large clusters with hundreds or even

56





thousands of nodes, while SciKit is meant for computations

The machine learning benchmark measured the time to cast

on a single computer.

all columns into smaller data types. It seems that Dask has

a dedicated function to cast all of the columns of a Dask

dataframe at once whereas with the Spark function you have

to cast each column one by one. The Dask casting was faster

(0.06s) than Sparks (7.2s).

V. CONCLUSIONS

In this paper we benchmarked two parallel computing

technologies, Dask and Apache Spark, against each other

and against the single node SciKit Learn. The benchmarks

were computed on the EMNIST dataset for various subsets

from 50k to 250k samples on different operating systems and

various degrees of parallelization. The results show a slight

advantage on running the training pipeline on Linux rather

than on Windows. Dask is seen as superior in dataframe

manipulation while Apache Spark has a superior end-to-end

Fig. 4. Computational time for different dataset sizes on Windows operating processing performance on larger datasets with comparable

system.

final F1 scores.

Figure 5 shows the results of the same experiment per-

ACKNOWLEDGMENTS

formed on the Linux operating system. Compared to the Figure

This work was funded in part by the Slovenian Research

4, the results are very similar, with only difference that on

Agency under the grant P2-0016.

Linux operating system Dask performs better then Spark even

when input data set contains 150k samples.

REFERENCES

Table I shows the F1 scores. An F1 score is the harmonic

[1] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,

mean (alternative metric for the arithmetic mean) of precision

O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al.,

and recall. The precision gives information on how many of

“Scikit-learn: Machine learning in python,” the Journal of machine

Learning research, vol. 12, pp. 2825–2830, 2011.

the predicted samples that have been predicted as positive

[2] M. Rocklin, “Dask: Parallel computation with blocked algorithms and

are correct. The recall gives information on how many of all

task scheduling,” in Proceedings of the 14th python in science conference, positive samples the model managed to find.

vol. 130, p. 136, Citeseer, 2015.

[3] S. Celis and D. R. Musicant, “Weka-parallel: machine learning in parallel,” in Carleton College, CS TR, Citeseer, 2002.

[4] M. Dugré, V. Hayot-Sasson, and T. Glatard, “A performance comparison

of dask and apache spark for data-intensive neuroimaging pipelines,” in

2019 IEEE/ACM Workflows in Support of Large-Scale Science (WORKS),

pp. 40–49, 2019.

[5] P. Mehta, S. Dorkenwald, D. Zhao, T. Kaftan, A. Cheung, M. Balazinska, A. Rokem, A. Connolly, J. Vanderplas, and Y. AlSayyad, “Comparative

evaluation of big-data systems on scientific image analytics workloads,”

vol. 10, p. 1226–1237, VLDB Endowment, aug 2017.

[6] M. H. Nguyen, J. Li, D. Crawl, J. Block, and I. Altintas, “Scaling

deep learning-based analysis of high-resolution satellite imagery with

distributed processing,” in 2019 IEEE International Conference on Big

Data (Big Data), pp. 5437–5443, 2019.

[7] M. T. S. J. William Cheng, Ioannis Paraskevakos, “Image processing

using task parallel and data parallel frameworks,” pp. 1–7, 2019.

[8] S. Chintapalli, D. Dagit, B. Evans, R. Farivar, T. Graves, M. Holderbaugh, Z. Liu, K. Nusbaum, K. Patil, and B. J. Peng, “Benchmarking streaming

computation engines: Storm, flink and spark streaming,” 2016.

Fig. 5.

Computational time for different dataset sizes on Linux operating

system.

TABLE I

TABLE OF F1 SCORES FOR WINDOWS BENCHMARKS FOR VARIOUS

SAMPLE SIZES (SIMILAR FOR LINUX).

Number of samples (x1000)

50

100

150

200

250

Spark

0.71

0.73

0.73

0.71

0.71

Dask

0.71

0.72

0.73

0.71

0.70

Scikit

0.70

0.71

0.70

0.71

0.73

57





An Efficient Implementation of Hubness-Aware Weighting Using Cython

Krisztian Buza

buza@biointelligence.hu

BioIntelligence Group, Department of Mathematics-Informatics

Sapientia Hungarian University of Transylvania

Targu Mures, Romania

ABSTRACT

In case of the aforementioned hubness-aware classifiers, the

Hubness-aware classifiers are recent variants of

computationally most expensive step of the training is to deter-

𝑘 -nearest neighbor.

When training hubness-aware classifiers, the computationally most

mine the hubness scores of training instances, i.e., how frequently

expensive step is the calculation of hubness scores. We show that

they appear as (bad) nearest neighbors of other instances. In this

this step can be sped up by an order of magnitude or even more if

paper, we address this issue by a Cython-based implementation.

it is implemented in Cython instead of Python while the accuracy

Cython [1] aims to combine the advantages of Python (rapid proto-is the same in both cases.

typing and clarity thanks to concise code) with the efficiency of C.

In particular, we implement the computation of hubness scores in

KEYWORDS

Cython. Compared with a standard implementation in Python, we

observed up to 25 times speedup on the Spambase dataset2 from nearest neighbor, hubs, cython

the UCI repository (and the speedup is likely to be even more in

case of larger datasets).

1

INTRODUCTION

Nearest neighbor classifiers are simple, intuitive and popular, there

2

BACKGROUND: HUBNESS-AWARE

are theoretical results about their accuracy and error bounds [6].

WEIGHTING

However, nearest neighbors are affected by bad hubs. An instance

We say that an instance

′

is called a bad hub, if it appears surprisingly frequently as nearest

𝑥 is a bad neighbor of another instance 𝑥

if (i)

′ and (ii) their class

neighbor of other instances, but its class label is different from

𝑥 is one of the 𝑘 -nearest neighbors of 𝑥

labels are different. In case of hubness-aware weighting [9], first we the labels of those other instances. Bad hubs were shown to be

determine how frequently each instance

responsible for a surprisingly large fraction of the total classification

𝑥 appears as bad neighbor

of other instances. This is denoted as

(

error [10].

𝐵 𝑁

𝑥 ). Subsequently, the

𝑘

normalized bad hubness score

(

In order to reduce the detrimental effect of bad hubs, hubness-

ℎ

𝑥 ) of each instance 𝑥 is calculated

𝑏

as follows:

aware classifiers have been introduced, such as Hubness-Weighted

𝐵 𝑁

(𝑥 ) − 𝜇 (𝐵𝑁 )

𝑘

𝑘

𝑘 -Nearest Neighbor (HWKNN) [9], Naive Hubness Bayesian Near-

ℎ

(𝑥 ) =

(1)

𝑏

𝜎 (𝐵𝑁 )

est Neighbor (NHBNN) [16] and Hubness-based Fuzzy Nearest

𝑘

Neighbor (HFNN) [14]. Hubness has also been studied in context of where 𝜇 (𝐵𝑁 ) and 𝜎 (𝐵𝑁 ) denote the mean and standard devia-

𝑘

𝑘

collaborative filtering [8], regression [3], clustering [15], instance tion of the 𝐵𝑁 (𝑥) values over all instances of the training data.

𝑘

selection and feature selection [13]. Recently, hubness-aware en-HWKNN performs weighted 𝑘-nearest neighbor classification, the

−ℎ (𝑥 )

𝑏

sembles have been proposed [17] and used for the classification of weight of each training instance is 𝑤 (𝑥) = 𝑒

. For a detailed

breast cancer subtypes [12].

illustration of HWKNN we refer to [13].

Other prominent applications of hubness-aware methods include

music recommendation [7], time series classification [11], drug-3

CYTHON-BASED IMPLEMENTATION OF

target prediction [4] and classification of gene expression data [2].

HUBNESS CALCULATIONS

Last, but not least, we mention that even neural networks may

Python code is usually run by an interpreter which makes the

benefit from hubness-aware weighting [5].

execution relatively slow. Much of the inefficiency originates from

Hubness-aware classifiers may be implemented in various pro-

dynamic typing: for example, the actual semantic of the ’+’ symbol

gramming languages, one of the most prominent implementation

depends on the types of the operands. It may stand for addition of

is probably the Java-based HubMiner1 library.

numbers, concatenation of strings or lists, element-wise addition of

arrays, etc. Which of the operations to perform, will be determined

1https://github.com/datapoet/hubminer

by the interpreter at execution time.

Permission to make digital or hard copies of part or all of this work for personal or The core idea of Cython3 is to annotate variables according to classroom use is granted without fee provided that copies are not made or distributed their types and to compile the resulting code into C which will

for profit or commercial advantage and that copies bear this notice and the full citation further be compiled into binary code for efficient execution. In

on the first page. Copyrights for third-party components of this work must be honored.

For all other uses, contact the owner/author(s).

case of computationally expensive functions, this may results in

Information Society 2022, 10–14 October 2022, Ljubljana, Slovenia

© 2022 Copyright held by the owner/author(s).

2https://archive.ics.uci.edu/ml/datasets/spambase

3https://cython.org/

58





Information Society 2022, 10–14 October 2022, Ljubljana, Slovenia

K. Buza

the training instances have to be determined. Thus the resulting

overall complexity is quadratic.

We note that, both in case of Cython and Python, indexing tech-

niques may be used to speed up the determination of the nearest

neighbors. However, we omitted indexing in our implementation

for simplicity.

4

DISCUSSION

In order to calculate distances effectively, we used pairwise dis-

tances from scikit-learn in our experiment. However, in case of

large datasets, it may be necessary to calculate distances on the fly,

as the distance matrix may be too large to be stored in RAM. In

such cases, it may be worth considering to implement the distance

calculations in Cython as well. In our previous works, we observed

that the calculation of dynamic time warping distance was several

Figure 1: Runtime (in second, vertical axis) of hubness score

orders of magnitudes faster when we implemented it in Cython

calculation in case of Python-based (dashed line with ’x’) and

instead of Python.

Cython-based (solid line with bullets) implementations for

In case of very large datasets, straight forward calculation of

various number of instances (horizontal axis).

hubness scores may be infeasible due to its quadratic complexity

even if the calculations are implemented in Cython. In such cases,

the aforementioned indexing techniques and/or calculation of ap-

several orders of magnitude speedup. At the same time, functions

proximate hubness scores (e.g. using a random subset of the data)

implemented in Cython can be called from Python code just like

may be necessary.

Python functions.

As future work, we plan an exhaustive evaluation of both im-

We implemented the calculation of hubness scores both in Python

plementations with respect to various datasets with different sizes

and Cython, and made the code available in our github repository:

and number of features.

https://github.com/kr7/cython .

We evaluated both implementations on the Spambase dataset

ACKNOWLEDGEMENT

from the UCI repository. The dataset contains 4601 instances and

The author thanks to the Reviewers for their insightful comments

57 features (without the class label). Each instance corresponds to

and suggestions.

an e-mail. For each e-mail, the same features were extracted. The

associated classification task is to decide whether the e-mail is spam

REFERENCES

or not.

[1] Stefan Behnel, Robert Bradshaw, Craig Citro, Lisandro Dalcin, Dag Sverre Sel-We used 100 instances as test data and 4500 instances as training

jebotn, and Kurt Smith. 2010. Cython: The best of both worlds. Computing in data. We run the experiments in Google Colab.4 We used 𝑘 = 10

Science & Engineering 13, 2 (2010), 31–39.

[2] Krisztian Buza. 2016. Classification of gene expression data: A hubness-aware nearest neighbors both for the calculation of hubness scores and

semi-supervised approach. Computer methods and programs in biomedicine 127

the final classification. According to our observations, the Cython-

(2016), 105–113.

based calculation of hubness scores was more than 20 times faster

[3] Krisztian Buza, Alexandros Nanopoulos, and Gábor Nagy. 2015. Nearest neighbor regression in the presence of bad hubs. Knowledge-Based Systems 86 (2015), 250–

than the standard implementation in Python. Both versions pro-

260.

duced the exactly same 𝐵𝑁 (𝑥) scores. As the weight of an instance

[4] Krisztian Buza and Ladislav Peška. 2017. Drug–target interaction prediction

𝑘

with Bipartite Local Models and hubness-aware regression. Neurocomputing 260

𝑥 only depends on its 𝐵𝑁 (𝑥 ) score, both versions produce the same

𝑘

(2017), 284–293.

predictions. Therefore the accuracy (0.94) is equal in both cases.

[5] Krisztian Buza and Noémi Ágnes Varga. 2016. Parkinsonet: estimation of updrs We repeated the experiments with using only 1000, 2000 and

score using hubness-aware feedforward neural networks. Applied Artificial

Intelligence 30, 6 (2016), 541–555.

3000 instances as training data. As Fig. 1 shows, the Cython-based

[6] Luc Devroye, László Györfi, and Gábor Lugosi. 2013. A probabilistic theory of implementation was consistently faster than the implementation in

pattern recognition. Vol. 31. Springer Science & Business Media.

Python. Note that logarithmic scale is used on the vertical axis. The

[7] Arthur Flexer, Monika Dörfler, Jan Schlüter, and Thomas Grill. 2018. Hubness as a case of technical algorithmic bias in music recommendation. In 2018 IEEE

difference showed an increasing trend when more training data

International Conference on Data Mining Workshops (ICDMW). IEEE, 1062–1069.

was used: whereas in case of 1000 training instances, the Cython-

[8] Peter Knees, Dominik Schnitzer, and Arthur Flexer. 2014.

Improving

based implementation was only about 12 times faster than the

neighborhood-based collaborative filtering by reducing hubness. In Proceedings of International Conference on Multimedia Retrieval. 161–168.

Python-based implementation, in case of 4500 training instances,

[9] Miloš Radovanović, Alexandros Nanopoulos, and Mirjana Ivanović. 2009. Nearest the speedup factor was approximately 25. This may be attributed to

neighbors in high-dimensional data: The emergence and influence of hubs. In Proceedings of the 26th Annual International Conference on Machine Learning.

the non-linear complexity of hubness score calculations. Assuming

865–872.

a naive implementation, determination of the nearest neighbors of

[10] Milos Radovanovic, Alexandros Nanopoulos, and Mirjana Ivanovic. 2010. Hubs an instance is linear in the size of the training data. However, in

in space: Popular nearest neighbors in high-dimensional data. Journal of Machine Learning Research 11, sept (2010), 2487–2531.

order to calculate the hubness scores, the nearest neighbors of all

[11] Miloš Radovanović, Alexandros Nanopoulos, and Mirjana Ivanović. 2010. Timeseries classification in many intrinsic dimensions. In Proceedings of the 2010 SIAM

4https://colab.research.google.com

International Conference on Data Mining. SIAM, 677–688.

59

An Efficient Implementation of Hubness-Aware Weighting Using Cython Information Society 2022, 10–14 October 2022, Ljubljana, Slovenia

[12] S Raja Sree and A Kunthavai. 2022. Hubness weighted svm ensemble for predic-

[15] Nenad Tomasev, Milos Radovanovic, Dunja Mladenic, and Mirjana Ivanovic. 2013.

tion of breast cancer subtypes. Technology and Health Care 30, 3 (2022), 565–578.

The role of hubness in clustering high-dimensional data. IEEE transactions on

[13] Nenad Tomašev, Krisztian Buza, Kristóf Marussy, and Piroska B Kis. 2015.

knowledge and data engineering 26, 3 (2013), 739–751.

Hubness-aware classification, instance selection and feature construction: Survey

[16] Nenad Tomašev, Miloš Radovanović, Dunja Mladenić, and Mirjana Ivanović.

and extensions to time-series. In Feature selection for data and pattern recognition.

2014. Hubness-based fuzzy measures for high-dimensional k-nearest neighbor Springer, 231–262.

classification. International Journal of Machine Learning and Cybernetics 5, 3

[14] Nenad Tomašev, Miloš Radovanovic, Dunja Mladenic, and Mirjana Ivanovic.

(2014), 445–458.

2011. A probabilistic approach to nearest-neighbor classification: Naive hub-

[17] Qin Wu, Yaping Lin, Tuanfei Zhu, and Yue Zhang. 2020. HIBoost: A hubness-ness bayesian knn. In Proc. 20th ACM Int. Conf. on Information and Knowledge aware ensemble learning algorithm for high-dimensional imbalanced data classi-Management (CIKM). 2173–2176.

fication. Journal of Intelligent & Fuzzy Systems 39, 1 (2020), 133–144.

60





Semantic Similarity of Parliamentary Speech using BERT

Language Models & fastText Word Embeddings

Katja Meden

Department of Knowledge Technologies E8,

Jožef Stefan Institute

katja.meden@ijs.si



ABSTRACT

We measured sentence similarity with four BERT-based

language models (Language agnostic BERT Sentence Encoder -

The main objective of this paper is to present the work done on

LaBSE model [7], Sentence-LaBSE [8], Sentence-BERT [14],

comparing the two methods for measuring semantic similarity of

multilingual BERT – mBERT [1]) and compared the scores of

parliamentary speech between coalition and opposition regarding

most similar and least similar sentences.

the adoption of the first COVID-19 epidemic response package.

To facilitate the intended scope of our initial research, i.e.,

We first measured sentence similarity using four BERT-based

researching similarity of full-text parliamentary speech, we used

language models (Language agnostic BERT Sentence Encoder -

fastText [5] and presented results using descriptive analysis to

LaBSE model, Sentence-LaBSE, Sentence-BERT, multilingual

gain additional insight into the characteristics of coalition and

BERT - mBERT) and compared the results amongst them. Using

opposition parliamentary speech. Lastly, we highlighted some of

the word embedding method, fastText, we then measured the

the advantages and disadvantages of each method for measuring

semantic similarity of full-text parliamentary speech and

semantic similarity of parliamentary speech.

presented the results using descriptive analysis. Lastly, we

The paper is structured as follows: Section 2 contains an

compared the usage of both methods and highlighted some of the

overview of the related work on word embeddings and language

advantages and disadvantages of each method for measuring the

models. Section 3 presents the methodology and we describe the

semantic similarity of parliamentary speech.

experiment setting in Section 4. The experiment results are found

in Section 5. Finally, we conclude the paper and provide ideas

KEYWORDS

for future work in Section 6.

parliamentary speech, semantic similarity, sentence similarity,

BERT language models, fastText

2 RELATED WORK

Two blocks of texts are considered similar if they contain the

1 INTRODUCTION

same words or characters. Techniques like Bag of Words (BoW),

“National parliamentary data is a verified communication

Term Frequency - Inverse Document Frequency (TF-IDF) can be

channel between the elected political representatives and society

used to represent text as real value vectors to aid calculation of

members in any democracy. It needs to be made accessible and

Semantic Textual Similarity (STS) [3]. STS is defined as the

comprehensive - especially in times of a global crisis.” [13] In

measure of semantic equivalence between two blocks of text and

parliamentary discourse, politicians expound their beliefs and

usually give a ranking or percentage of similarity between texts,

ideas through argumentation and to persuade the audience, they

rather than a binary decision as similar or not similar [3]. Word

highlight some aspect of an issue. If we are to understand the role

embeddings are one of the methods developed to aid in

of parliamentary discourse practices, we need to explore the

measuring semantic similarity. They provide vector

recurring linguistic patterns and rhetorical strategies used by

representations of words where vectors retain the underlying

MPs that help to reveal their ideological commitments, hidden

linguistic relationship between the similarities of the words.

agendas, and argumentation tactics [11]. One of the ways to

Word embeddings consist of two types: static and contextualized

study the aforementioned linguistic patterns can be done by

word embeddings. With static word embeddings, words will

researching similarities of parliamentary speeches using different

always have the same representation, regardless of the context

methods for measuring semantic similarity of text.

where it occurs, while with contextualized word embedding,

The aim of this paper is to present the work done on

representation depends on the context of where that word occurs

comparing the two methods for measuring semantic similarity of

– meaning, that the same word in different contexts can have

parliamentary speech between coalition and opposition regarding

different representations.

the adoption of the first COVID-19 epidemic response package.

FastText is an open-source, free, lightweight library that



allows users to learn text representations and text classifiers [5].

It i s a representative of the static word embedding technique,

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed where a vector representation is associated to each character nfor profit or commercial advantage and that copies bear this notice and the full gram; words being represented as the sum of these

citation on the first page. Copyrights for third-party components of this work must representations [2]. The fundamental problem of word

be honored. For all other uses, contact the owner/author(s).

Information Society 2022, 10–14 October 2022, Ljubljana, Slovenia

embeddings is that they generate the same embedding for the

© 2022 Copyright held by the owner/author(s).

same word in different contexts, failing to capture polysemy [4].

61





Language models are contextualized word representations

We used the same settings for the second part of the experiment

that aim at capturing word semantics in different contexts to

(comparing sentence similarity with the four BERT-based

address the issue of polysemy and the context of words [4].

models) with one difference. Since all BERT-based models

BERT, or Bidirectional Encoder Representations from

support max_lenght input in the size 512 tokens, we decided to

Transformer, is a language model, designed to pre-train deep

filter out sentences that refer explicitly to the response package

bidirectional representations from an unlabeled text by jointly

(keyword for selection being zakon). To facilitate the

conditioning on both left and right context in all layers [6]. BERT

visualisations and balance out our dataset, we randomly chose 20

word representations are therefore contextual

sentences for each group (coalition/opposition).

3.3 Experiment settings

3 METHODOLOGY

As mentioned, BERT-based models have restrictions on the

3.1 Dataset

maximum length of input documents. For most, this is 512

tokens, and in the case of Sentence-BERT, this restriction is even

Dataset contains 230 documents (speeches) from the

more severe (128 word tokens). Most speeches in the dataset are

Extraordinary Session 33 from the corpora of the Slovenian

longer than the maximum length – this limitation did not allow

parliamentary debates (ParlaMint-SI) [9] from 2014 to mid-2020,

us to conclude semantic similarity measurement on full

linguistically annotated and represented in the CoNNL-U format

parliamentary speech. The first part of the experiment therefore

(which include POS, lemmatized and NER tags). We chose an

focuses on sentence similarity. From previously described

extraordinary session in a time of crisis for two reasons: firstly,

BERT-based models, three of the models were fine-tuned for

regular sessions deal with multiple problems (such as MP

sentence similarity tasks: Sentence-LaBSE [7], LaBSE [8],

questions), which makes a comparison between speeches

mBERT [1] and Sentence-BERT [14]. For easier comparison, we

difficult. Similarly, we chose only one specific theme (the

used mean pooling and cosine distance to measure the similarity.

adoption of the first epidemic response package), which helped

To achieve the intended scope of our initial research

in the initial analysis and comparison of documents.

(researching the semantic similarity of parliamentary speech), we



used the fastText-based Orange widget Document embedding

(using mean as the aggregation method) to embed our documents

3.2 Data analysis and pre-processing

and calculate cosine similarity to achieve comparison between

For the initial data analysis, we used the Orange data-mining tool

coalition and opposition parliamentary speech. With these two

[12] that helped us with the data understanding and initial dataset

experiments, we can compare measuring semantic similarity

pre-processing.

with language models to the word embedding method (fastText).

For full speech measuring with fastText we removed

This comparison would be better with Longformer language

speeches by Chairperson to avoid adding noise to the dataset in

model (which can take up to around 1000+ word tokens as

the form of procedural speech that would make measuring

max_input) as we could compare methods for measuring

semantic similarities almost impossible. We also removed

semantic similarity of full-text documents (speeches), but as of

Slovene stopwords and manually added a list of four additional

time of writing this paper, Longformer [10] does not yet support

stopwords: hvala, danes, l epa and beseda, which excluded the Slovene language.

very common phrase Hvala za besedo (eng. Thank you for the

word) and its variations. Some of the documents were missing

the party_status labels (values: coalition and opposition). The 4 RESULTS

missing values (17 documents) were thus removed from the

dataset. The pre-process gave us a total of 97 documents,

4.1 Results of the sentence similarity measure

presented in Table 1. Looking at the distributions of the speeches

with BERT-based models

in the session, almost 1/3 of the speeches belongs to the

As stated previously, we used four different BERT-based models

opposition. Both coalition and opposition consists of four

to measure semantic similarity of 40 sentences (20 sentences for

political parties: LMŠ, Levica, SAB and SD are part of the

each group - coalition and opposition) and visualized the results

opposition, all of mostly left and centre-left political orientation.

using heat maps (example in Figure 1). Initially, we first selected

Similarly, the coalition consists of DeSUS, NSi, SDS and SMC

well-known BERT-based models that were optimized for

political parties1, all mostly right-winged and centre-right parties.

Slovene (trilingual model CroSloEngual BERT and monolingual

model SloBERTa), that did not produce reliable results - as

Table 1: Preprocessed dataset

shown in Table 2, CroSloEngual [15] and SloBERTa [16]

produce extremely high similarity scores, since, as we later

discovered, were not fine-tuned for sentence similarity task.

Sample

Number

of

Total

documents

Coalition

30 (30.93%)

97

Opposition

67 (69.07%)

1 Technically, the opposition consists of 5 political parties, but SNS (Slovenska Nacionalna Stranka) does not have any speeches in the dataset.

62





Table 2: Similarity scores of language models for most

4.2 Results of the document similarity with

similar and least similar sentences

fastText

Model

Most similar

Least similar

For the second part of our experiment, we used fastText for word

embedding and measured cosine distance to get semantic

Sentence-LaBSE 0.6184

0.1235

similarity score of our documents. Figure 2 shows visualized

LaBSE

0.7610

0.3649

results comparing speeches between coalition and opposition

mBERT

0.8930

0.5377

speakers:

Sentence-BERT

0.6677

-0.0792

CroSloEngual

0.9931

0.9480

SloBERTa

0.9867

0.8899





Figure 2: Document similarity with fastText, visualized



using MDS

Figure 1: Example of heat map using Sentence-LaBSE

Documents (or speeches) are connected closely together –

model

this could be attributed mostly to the fact that they are addressing

the same issue – the adoption of the first epidemic release

When comparing the models, it does not surprise that

package. The most similar speeches were made by the members

Sentence-LaBSE and Sentence-BERT show very similar results

of the political party SDS (coalition) and SD (opposition),

(see Table 2), as they come from the same family of models and

followed closely by SMC and Levica. All speeches are long and

thus have similar model architecture (and are both fine-tuned for

focus on the topic of the session – the proposed law (most

this specific task). What is interesting is the fact that Sentence-

speeches include keywords such as “zakon” (law), “zakonski

BERT is the only model that produced a negative score for the

paket” (law packet), “amandma” (amendment), “ukrepi”

least similar sentence (similarity score of -0.0792), while

(measures).

mBERT model showed the highest similarity scores (outside of

Outlier detection analysis showed 8 speeches (7 made by the

CroSloEngual and SloBERTa). Some of the highest scored

opposition, 1 by coalition), which are all very short and focus

sentences showed that speakers from different party statuses tend

solely on parliamentary procedures. We also observed some

to use similar language patterns, for example:

trends in the usage of the words, concatenated from the word



“korona”: “koronakriza”, “koronazakon”, “antikoronazakon”, Coalition : “Ob hitrem sprejemanju zakona je potrebno

“koronaobveznica”, “koronapomoči”, “protikoronapaket” etc.

zagotoviti, da ne bodo spregledane posamezne ranljive skupine

(used mostly by the opposition).

posameznikov.”

In Figure 3, we compared speech between the members of the

(Eng. “With the rapid adoption of the law, it is necessary to

opposition. The visualization showed a cluster of similar

ensure that individual vulnerable groups of individuals are not

speeches. Members of Levica seemed to be most vocal during

overlooked.")

the session (by having more than 50% of all opposition speeches),



while also having several similar speeches, with the central sub-

Opposition : “Še enkrat, ostaja še cela vrsta ranljivih skupin v

topic being proposed amendments to the law and financial

zakonu, ki je nenaslovljena.”

consequences of it. The least similar speech was made by Violeta

(Eng. “Once again, there is a whole range of vulnerable groups

Tomić, member of Levica, in regard to the date the epidemic was

in the law that remain unaddressed.”)

declared.

63





semantic similarity/sentence similarity tasks and thus do not

produce accurate results. Limitation on maximum length of input

text that most BERT-based models have is probably one of the

biggest disadvantages of the language models for semantic

similarity measures (this is being alleviated with new emerging

language models, such as Longformer, that allow over 1000+

tokens as maximum input length). For sentence similarity task

language models from Sentence-BERT family show the most

accuracy and are easier to use as standard BERT models (such as

mBERT).

Even though BERT contextualizes word embeddings (and

therefore might produce better results because of it), fastText

solved the problem of text-input length and combined with

Orange data mining tool allowed us to explore similarities

between speeches as we originally intended to do. From the

document similarity analysis, we saw that most speeches were

relatively connected (similar) to one another. Speeches amongst



the members of the opposition were more similar in comparison

to the speeches made amongst coalition members. There were a

Figure 3: Document similarity with fastText (opposition)

few outlier speeches in both opposition and coalition – they were

all shorter speeches and less related to the original topic of the

In Figure 4, we compared speech between the members of

discourse. For future work, some limitations of this research

coalition: speeches are less connected; with most similar divided

should first be addressed (e.g., comparing language models to

among SDS members, closely connected to the SMC, NSI and

word embedding techniques on a full-text basis) and repeat the

DeSUS members. The common sub-topic to all of the speeches

experiments with fine-tuned SloBERTa and CroSloEngual

made is the financial crisis as a direct result of the epidemic. Two

model on full ParlaMint-SI corpora.

of the most far-away speeches belong to the member of DeSUS



(Franc Jurša). Both speeches are among the shortest ones in the

dataset, with a focus on the topic of pensions and registration of

REFERENCES

a parliamentary group, and thus are not explicitly connected to

[1] BERT multilingual base model (cased): https://huggingface.co/bert-base-

the central topic of the discourse.

multilingual-cased

[2] Bojanowski, Piotr, Grave, Edouard, Joulin, Armand and Mikolov, Tomas.

(2017). Enriching word vectors with subword information. In

Transactions of the Association for Computational Linguistics, 5, 135-146. DOI: https://doi.org/10.1162/tacl_a_00051

[3] Chandrasekaran, Dhivya, and Vijay Mago. 2021. Evolution of Semantic

Similarity—A Survey. In ACM Computing Surveys, 1-37.

[4] David S. Batista. 2018. Language Models and Contextualised Word Embeddings.

https://www.davidsbatista.net/blog/2018/12/06/Word_Embeddings/

[5] FastText - Library for efficient text classification and representation learning. https://fasttext.cc/

[6] Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova.

2018. Bert: Pre-training of deep bidirectional transformers for language

understanding. arXiv preprint arXiv:1810.04805.

[7] Language-agnostic BERT Sentence Encoder (LaBSE) (Sentence-

Transformers): https://huggingface.co/sentence-transformers/LaBSE

[8] Language-agnostic

BERT

Sentence

Encoder

(LaBSE):

https://huggingface.co/setu4993/LaBSE

[9] Linguistically annotated multilingual comparable corpora of

parliamentary

debates

ParlaMint.ana

2.1.

2021.

http://hdl.handle.net/11356/1431

[10] Longformer:

https://huggingface.co/docs/transformers/model_doc/longformer

[11] Naderi, Nona, and Graeme Hirst. 2015. Argumentation mining in parliamentary discourse. In Principles and practice of multi-agent systems,



16-25. https://cmna.csc.liv.ac.uk/CMNA15/paper%209.pdf

[12] Orange:

Data

Mining

Tool

for

visual

programming.

Figure 4: Document similarity with fastText (coalition)

https://orangedatamining.com/

[13] ParlaMint: Towards Comparable Parliamentary Corpora. 2020.

https://www.clarin.eu/content/parlamint-towards-comparable-

parliamentary-corpora

5 CONCLUSIONS

[14] Sentence-BERT

(sentence-transformers/distiluse-base-multilingual-

cased-v2): https://huggingface.co/sentence-transformers/distiluse-base-

In this paper, we were comparing language models and word

multilingual-cased-v2

embeddings as methods for measuring semantic similarity of

[15] Ulčar, Matej and Robnik-Šikonja, Marko, 2020, CroSloEngual BERT 1.1, Slovenian

language

resource

repository

parliamentary speech. In the initial stages, it turned out that there

CLARIN.SI, http://hdl.handle.net/11356/1330.

is not a lot of models that support Slovene as input language.

[16] Ulčar, Matej and Robnik-Šikonja, Marko, 2021, Slovenian RoBERTa Those that were made explicitly with Slovene in mind (such as

contextual embeddings model: SloBERTa 2.0, Slovenian language

resource repository CLARIN.SI, http://hdl.handle.net/11356/1397

SloBERTa and CroSloEngual BERT) were not fine-tuned for



64





Indeks avtorjev / Author index



Anastasiou Theodora .................................................................................................................................................................... 46

Baldouski Daniil ........................................................................................................................................................................... 34

Brecelj Bor ................................................................................................................................................................................... 42

Buza Krisztian .............................................................................................................................................................................. 58

Calcina Erik .................................................................................................................................................................................. 13

Casals del Busto Ignacio .............................................................................................................................................................. 50

Cerar Gregor ................................................................................................................................................................................. 54

Evkoski Bojan .............................................................................................................................................................................. 30

Fortuna Blaž ........................................................................................................................................................................... 42, 46

Fortuna Carolina ........................................................................................................................................................................... 54

Grobelnik Marko ........................................................................................................................................................................ 5, 9

Gucek Alenka ............................................................................................................................................................................... 50

Keizer Jelle ................................................................................................................................................................................... 46

Komarova Nadezhda ...................................................................................................................................................................... 5

Koprivec Filip .............................................................................................................................................................................. 38

Korenič Tratnik Sebastian ............................................................................................................................................................ 26

Kralj Novak Petra ......................................................................................................................................................................... 30

Kržmanc Gregor ........................................................................................................................................................................... 38

Kuzman Taja ................................................................................................................................................................................ 17

Ljubešić Nikola ...................................................................................................................................................................... 17, 30

Massri M.Besher .......................................................................................................................................................................... 50

Meden Katja ................................................................................................................................................................................. 61

Mladenić Dunja ............................................................................................................................................................ 9, 21, 42, 46

Mladenić Grobelnik Adrian............................................................................................................................................................ 9

Mocanu Iulian .............................................................................................................................................................................. 50

Mozetič Igor ................................................................................................................................................................................. 30

Mušić Din ..................................................................................................................................................................................... 54

Novak Erik ......................................................................................................................................................................... 9, 13, 26

Novalija Inna .................................................................................................................................................................................. 5

Papamartzivanos Dimitrios .......................................................................................................................................................... 46

Pita Costa Joao ............................................................................................................................................................................. 50

Rossi Maurizio ............................................................................................................................................................................. 50

Rožanec Jože Martin .............................................................................................................................................................. 42, 46

Santos Costa João ......................................................................................................................................................................... 50

Šircelj Beno .................................................................................................................................................................................. 42

Sittar Abdul .................................................................................................................................................................................. 21

Škrjanc Maja ................................................................................................................................................................................ 38

Tošić Aleksandar .......................................................................................................................................................................... 34

Veliou Entso ................................................................................................................................................................................. 46

Webber Jason ............................................................................................................................................................................... 21

Zevnik Filip .................................................................................................................................................................................. 54





65





66



Odkrivanje znanja in podatkovna

skladisca - SiKDD

Data Mining and Data

Warehouses - SiKDD

Urednika  Editors:

Dunja Mladenic, Marko Grobelnik





Document Outline


02 - Naslovnica - notranja - C - TEMP

03 - Kolofon - C - TEMP

04 - IS2022 - Predgovor - TEMP

05 - IS2022 - Konferencni odbori - TEMP

07 - Kazalo - C

08 - Naslovnica - notranja - C - TEMP - Copy

09 - Predgovor podkonference - C

10 - Programski odbor podkonference - C

01 - SiKDD2022_paper_5613 Abstract

1 Introduction

2 Proposed Method 2.1 Constructing the Graph of n-grams

2.2 Constructing the Emotion Category Graphs

2.3 Assigning an Emotion to a Given Text





3 Related Work

4 Results 4.1 Experimental Setup

4.2 Analysis





5 Discussion

6 Conclusion

7 Acknowledgements





02 - SiKDD2022_paper_5674

03 - SiKDD2022_paper_5269 Abstract

1 Introduction

2 Related Work

3 Methodology 3.1 Topic Modeling

3.2 Artists' Similarity using Topic Clusters





4 Experiment 4.1 Dataset

4.2 Implementation details





5 Results

6 Discussion 6.1 Topic Cluster Discussion





7 Conclusion

Acknowledgments





04 - SiKDD2022_paper_5343 Abstract

1 Introduction

2 Dataset

3 Feature Engineering

4 Machine Learning Experiments 4.1 Experimental Setup

4.2 Results of Learning on Various Linguistic Features





5 Conclusions

Acknowledgments





05 - SiKDD2022_paper_7454 Abstract

1 Introduction

2 Related Work 2.1 Topic Modelling

2.2 Stylistic Features

2.3 Bag-of-words





3 Data collection

4 Methodology

5 Experimental Evaluation

6 Results and Analysis

7 Conclusions





06 - SiKDD2022_paper_4772

07 - SiKDD2022_paper_817 Abstract

1 Introduction

2 Results

3 Conclusion





08 - SiKDD2022_paper_3754 Abstract

1 Introduction

2 THE ROLE OF VISUALIZATIONS IN DEBUGGING COMPLEX DISTRIBUTED SYSTEMS

3 Research Objectives

4 GRAFANA PLUGINS FOR VISUALISING VOTE BASED CONSENSUS MECHANISMS AND P2P OVERLAY NETWORKS 4.1 Network Plugin

4.2 Consensus Plugin

4.3 Generality





5 Conclusion

6 Acknowledgments





09 - SiKDD2022_paper_1139 Abstract

1 Introduction

2 Related work

3 Data

4 Data representation as a heterogeneous graph 4.1 Network statistics

4.2 Feature generation





5 Anomaly detection problem definition

6 Results 6.1 Experiment details

6.2 Link prediction

6.3 Anomaly detection





7 Discussion and future work

Acknowledgments

A Detailed results A.1 Link prediction (AUC)

A.2 Anomaly detection (F1 score)





10 - SiKDD2022_paper_2558 Abstract

1 Introduction

2 Related Work

3 Use Case

4 Methodology 4.1 Data analysis

4.2 Model training





5 Results and Analysis

6 Conclusion

Acknowledgments

References





11 - SiKDD2022_paper_2909 Abstract

1 Introduction

2 Related Work

3 Use Case

4 Methodology

5 Results and Analysis

6 Conclusion

Acknowledgments

References





12 - SiKDD2022_paper_6501

13 - SiKDD2022_paper_1337

14 - SiKDD2022_paper_4886 Abstract

1 Introduction

2 Background: Hubness-aware Weighting

3 Cython-based Implementation of Hubness Calculations

4 Discussion

References





15 - SiKDD2022_paper_4306

12 - Index - C

Blank Page

Blank Page

Blank Page

08 - Naslovnica - notranja - C.pdf Blank Page





07 - Kazalo - C.pdf Blank Page





12 - Index - C.pdf Blank Page





Blank Page