Zbornik 25. mednarodne multikonference INFORMACIJSKA DRUZBA Zvezek C Proceedings of the 25th International Multiconference INFORMATION SOCIETY Volume C Odkrivanje znanja in podatkovna skladisca - SiKDD Data Mining and Data Warehouses - SiKDD Urednika  Editors: Dunja Mladenic, Marko Grobelnik Ljubljana, Slovenija 10. oktober 10 October Ljubljana, Slovenia httpis.ijs.si Zbornik 25. mednarodne multikonference INFORMACIJSKA DRUŽBA – IS 2022 Zvezek C Proceedings of the 25th International Multiconference INFORMATION SOCIETY – IS 2022 Volume C Odkrivanje znanja in podatkovna skladišča - SiKDD Data Mining and Data Warehouses - SiKDD Urednika / Editors Dunja Mladenić, Marko Grobelnik http://is.ijs.si 10. oktober 2022 / 10 October 2022 Ljubljana, Slovenija Urednika: Dunja Mladenić, Department for Artificial Intelligence Jožef Stefan Institute, Ljubljana Marko Grobelnik Department for Artificial Intelligence Jožef Stefan Institute, Ljubljana Založnik: Institut »Jožef Stefan«, Ljubljana Priprava zbornika: Mitja Lasič, Vesna Lasič, Lana Zemljak Oblikovanje naslovnice: Vesna Lasič Dostop do e-publikacije: http://library.ijs.si/Stacks/Proceedings/InformationSociety Ljubljana, oktober 2022 Informacijska družba ISSN 2630-371X Kataložni zapis o publikaciji (CIP) pripravili v Narodni in univerzitetni knjižnici v Ljubljani COBISS.SI-ID 127444483 ISBN 978-961-264-243-3 (PDF) PREDGOVOR MULTIKONFERENCI INFORMACIJSKA DRUŽBA 2022 Petindvajseta multikonferenca Informacijska družba je preživela probleme zaradi korone. Zahvala za skoraj normalno delovanje konference gre predvsem tistim predsednikom konferenc, ki so kljub prvi pandemiji modernega sveta pogumno obdržali visok strokovni nivo. Pandemija v letih 2020 do danes skoraj v ničemer ni omejila neverjetne rasti IKTja, informacijske družbe, umetne inteligence in znanosti nasploh, ampak nasprotno – rast znanja, računalništva in umetne inteligence se nadaljuje z že kar običajno nesluteno hitrostjo. Po drugi strani se nadaljuje razpadanje družbenih vrednot ter tragična vojna v Ukrajini, ki lahko pljuskne v Evropo. Se pa zavedanje večine ljudi, da je potrebno podpreti stroko, krepi. Konec koncev je v 2022 v veljavo stopil not raziskovalni zakon, ki bo izboljšal razmere, predvsem leto za letom povečeval sredstva za znanost. Letos smo v multikonferenco povezali enajst odličnih neodvisnih konferenc, med njimi »Legende računalništva«, s katero postavljamo nov mehanizem promocije informacijske družbe. IS 2022 zajema okoli 200 predstavitev, povzetkov in referatov v okviru samostojnih konferenc in delavnic ter 400 obiskovalcev. Prireditev so spremljale okrogle mize in razprave ter posebni dogodki, kot je svečana podelitev nagrad. Izbrani prispevki bodo izšli tudi v posebni številki revije Informatica (http://www.informatica.si/), ki se ponaša s 46-letno tradicijo odlične znanstvene revije. Multikonferenco Informacijska družba 2022 sestavljajo naslednje samostojne konference: • Slovenska konferenca o umetni inteligenci • Izkopavanje znanja in podatkovna skladišča • Demografske in družinske analize • Kognitivna znanost • Kognitonika • Legende računalništva • Vseprisotne zdravstvene storitve in pametni senzorji • Mednarodna konferenca o prenosu tehnologij • Vzgoja in izobraževanje v informacijski družbi • Študentska konferenca o računalniškem raziskovanju • Matcos 2022 Soorganizatorji in podporniki konference so različne raziskovalne institucije in združenja, med njimi ACM Slovenija, SLAIS, DKZ in druga slovenska nacionalna akademija, Inženirska akademija Slovenije (IAS). V imenu organizatorjev konference se zahvaljujemo združenjem in institucijam, še posebej pa udeležencem za njihove dragocene prispevke in priložnost, da z nami delijo svoje izkušnje o informacijski družbi. Zahvaljujemo se tudi recenzentom za njihovo pomoč pri recenziranju. S podelitvijo nagrad, še posebej z nagrado Michie-Turing, se avtonomna stroka s področja opredeli do najbolj izstopajočih dosežkov. Nagrado Michie-Turing za izjemen življenjski prispevek k razvoju in promociji informacijske družbe je prejel prof. dr. Jadran Lenarčič. Priznanje za dosežek leta pripada ekipi NIJZ za portal zVEM. »Informacijsko limono« za najmanj primerno informacijsko potezo je prejela cenzura na socialnih omrežjih, »informacijsko jagodo« kot najboljšo potezo pa nova elektronska osebna izkaznica. Čestitke nagrajencem! Mojca Ciglarič, predsednik programskega odbora Matjaž Gams, predsednik organizacijskega odbora i FOREWORD - INFORMATION SOCIETY 2022 The 25th Information Society Multiconference (http://is.ijs.si) survived the COVID-19 problems. The multiconference survived due to the conference chairs who bravely decided to continue with their conferences despite the first pandemics in the modern era. The COVID-19 pandemic from 2020 till now did not decrease the growth of ICT, information society, artificial intelligence and science overall, quite on the contrary – the progress of computers, knowledge and artificial intelligence continued with the fascinating growth rate. However, the downfall of societal norms and progress seems to slowly but surely continue along with the tragical war in Ukraine. On the other hand, the awareness of the majority, that science and development are the only perspective for prosperous future, substantially grows. In 2020, a new law regulating Slovenian research was accepted promoting increase of funding year by year. The Multiconference is running parallel sessions with 200 presentations of scientific papers at twelve conferences, many round tables, workshops and award ceremonies, and 400 attendees. Among the conferences, “Legends of computing” introduce the “Hall of fame” concept for computer science and informatics. Selected papers will be published in the Informatica journal with its 46-years tradition of excellent research publishing. The Information Society 2022 Multiconference consists of the following conferences: • Slovenian Conference on Artificial Intelligence • Data Mining and Data Warehouses • Cognitive Science • Demographic and family analyses • Cognitonics • Legends of computing • Pervasive health and smart sensing • International technology transfer conference • Education in information society • Student computer science research conference 2022 • Matcos 2022 The multiconference is co-organized and supported by several major research institutions and societies, among them ACM Slovenia, i.e. the Slovenian chapter of the ACM, SLAIS, DKZ and the second national academy, the Slovenian Engineering Academy. In the name of the conference organizers, we thank all the societies and institutions, and particularly all the participants for their valuable contribution and their interest in this event, and the reviewers for their thorough reviews. The award for life-long outstanding contributions is presented in memory of Donald Michie and Alan Turing. The Michie-Turing award was given to Prof. Dr. Jadran Lenarčič for his life-long outstanding contribution to the development and promotion of information society in our country. In addition, the yearly recognition for current achievements was awarded to NIJZ for the zVEM platform. The information lemon goes to the censorship on social networks. The information strawberry as the best information service last year went to the electronic identity card. Congratulations! Mojca Ciglarič, Programme Committee Chair Matjaž Gams, Organizing Committee Chair ii KONFERENČNI ODBORI CONFERENCE COMMITTEES International Programme Committee Organizing Committee Vladimir Bajic, South Africa Matjaž Gams, chair Heiner Benking, Germany Mitja Luštrek Se Woo Cheon, South Korea Lana Zemljak Howie Firth, UK Vesna Koricki Olga Fomichova, Russia Mitja Lasič Vladimir Fomichov, Russia Blaž Mahnič Vesna Hljuz Dobric, Croatia Alfred Inselberg, Israel Jay Liebowitz, USA Huan Liu, Singapore Henz Martin, Germany Marcin Paprzycki, USA Claude Sammut, Australia Jiri Wiedermann, Czech Republic Xindong Wu, USA Yiming Ye, USA Ning Zhong, USA Wray Buntine, Australia Bezalel Gavish, USA Gal A. Kaminka, Israel Mike Bain, Australia Michela Milano, Italy Derong Liu, Chicago, USA Toby Walsh, Australia Sergio Campos-Cordobes, Spain Shabnam Farahmand, Finland Sergio Crovella, Italy Programme Committee Mojca Ciglarič, chair Nikola Guid Andrej Ule Bojan Orel, Marjan Heričko Boštjan Vilfan Franc Solina, Borka Jerman Blažič Džonova Baldomir Zajc Viljan Mahnič, Gorazd Kandus Blaž Zupan Cene Bavec, Urban Kordeš Boris Žemva Tomaž Kalin, Marjan Krisper Leon Žlajpah Jozsef Györkös, Andrej Kuščer Niko Zimic Tadej Bajd Jadran Lenarčič Rok Piltaver Jaroslav Berce Borut Likar Toma Strle Mojca Bernik Janez Malačič Tine Kolenik Marko Bohanec Olga Markič Franci Pivec Ivan Bratko Dunja Mladenič Uroš Rajkovič Andrej Brodnik Franc Novak Borut Batagelj Dušan Caf Vladislav Rajkovič Tomaž Ogrin Saša Divjak Grega Repovš Aleš Ude Tomaž Erjavec Ivan Rozman Bojan Blažica Bogdan Filipič Niko Schlamberger Matjaž Kljun Andrej Gams Stanko Strmčnik Robert Blatnik Matjaž Gams Jurij Šilc Erik Dovgan Mitja Luštrek Jurij Tasič Špela Stres Marko Grobelnik Denis Trček Anton Gradišek iii iv KAZALO / TABLE OF CONTENTS Odkrivanje znanja in podatkovna skladišča - SiKDD / Data Mining and Data Warehouses - SiKDD ................. 1 PREDGOVOR / FOREWORD ................................................................................................................................. 3 PROGRAMSKI ODBORI / PROGRAMME COMMITTEES ..................................................................................... 4 Emotion Recognition in Text using Graph Similarity Criteria / Komarova Nadezhda, Novalija Inna, Grobelnik Marko .................................................................................................................................................................. 5 SLOmet – Slovenian Commonsense Description / Mladenić Grobelnik Adrian, Novak Erik, Grobelnik Marko, Mladenić Dunja ................................................................................................................................................... 9 Measuring the Similarity of Song Artists using Topic Modelling / Calcina Erik, Novak Erik ................................. 13 Exploring the Impact of Lexical and Grammatical Features on Automatic Genre Identification / Kuzman Taja, Ljubešić Nikola ................................................................................................................................................. 17 Stylistic features in clustering news reporting: News articles on BREXIT / Sittar Abdul, Webber Jason, Mladenić Dunja ................................................................................................................................................................ 21 Automatically Generating Text from Film Material – A Comparison of Three Models / Korenič Tratnik Sebastian, Novak Erik ........................................................................................................................................................ 26 The Russian invasion of Ukraine through the lens of ex-Yugoslavian Twitter / Evkoski Bojan, Mozetič Igor, Kralj Novak Petra, Ljubešić Nikola ........................................................................................................................... 30 Visualization of consensus mechanisms in PoS based blockchain protocols / Baldouski Daniil, Tošić Aleksandar ........................................................................................................................................................ 34 Using Machine Learning for Anti Money Laundering / Kržmanc Gregor, Koprivec Filip, Škrjanc Maja ............... 38 Forecasting Sensor Values in Waste-To-Fuel Plants: a Case Study / Brecelj Bor, Šircelj Beno, Rožanec Jože Martin, Fortuna Blaž, Mladenić Dunja .............................................................................................................. 42 Machine Beats Machine: Machine Learning Models to Defend Against Adversarial Attacks / Rožanec Jože Martin, Papamartzivanos Dimitrios, Veliou Entso, Anastasiou Theodora, Keizer Jelle, Fortuna Blaž, Mladenić Dunja ................................................................................................................................................................ 46 Addressing climate change preparedness from a smart water perspective / Gucek Alenka, Pita Costa Joao, Massri M.Besher, Santos Costa João, Rossi Maurizio, Casals del Busto Ignacio, Mocanu Iulian .................. 50 SciKit Learn vs Dask vs Apache Spark Benchmarking on the EMINST Dataset / Zevnik Filip, Fortuna Carolina, Mušić Din, Cerar Gregor................................................................................................................................... 54 An Efficient Implementation of Hubness-Aware Weighting Using Cython / Buza Krisztian ................................. 58 Semantic Similarity of Parliamentary Speech using BERT Language Models & fastText Word Embeddings / Meden Katja ..................................................................................................................................................... 61 Indeks avtorjev / Author index ................................................................................................................................ 65 v vi Zbornik 25. mednarodne multikonference INFORMACIJSKA DRUŽBA – IS 2022 Zvezek C Proceedings of the 25th International Multiconference INFORMATION SOCIETY – IS 2022 Volume C Odkrivanje znanja in podatkovna skladišča - SiKDD Data Mining and Data Warehouses - SiKDD Urednika / Editors Dunja Mladenić, Marko Grobelnik http://is.ijs.si 10. oktober 2022 / 10 October 2022 Ljubljana, Slovenija 1 2 PREDGOVOR Tehnologije, ki se ukvarjajo s podatki so v devetdesetih letih močno napredovale. Iz prve faze, kjer je šlo predvsem za shranjevanje podatkov in kako do njih učinkovito dostopati, se je razvila industrija za izdelavo orodij za delo s podatkovnimi bazami, prišlo je do standardizacije procesov, povpraševalnih jezikov itd. Ko shranjevanje podatkov ni bil več poseben problem, se je pojavila potreba po bolj urejenih podatkovnih bazah, ki bi služile ne le transakcijskem procesiranju ampak tudi analitskim vpogledom v podatke – pojavilo se je t.i. skladiščenje podatkov (data warehousing), ki je postalo standarden del informacijskih sistemov v podjetjih. Paradigma OLAP (On-Line-Analytical-Processing) zahteva od uporabnika, da še vedno sam postavlja sistemu vprašanja in dobiva nanje odgovore in na vizualen način preverja in išče izstopajoče situacije. Ker seveda to ni vedno mogoče, se je pojavila potreba po avtomatski analizi podatkov oz. z drugimi besedami to, da sistem sam pove, kaj bi utegnilo biti zanimivo za uporabnika – to prinašajo tehnike odkrivanja znanja v podatkih (data mining), ki iz obstoječih podatkov skušajo pridobiti novo znanje in tako uporabniku nudijo novo razumevanje dogajanj zajetih v podatkih. Slovenska KDD konferenca pokriva vsebine, ki se ukvarjajo z analizo podatkov in odkrivanjem znanja v podatkih: pristope, orodja, probleme in rešitve. FOREWORD Data driven technologies have significantly progressed after mid 90’s. The first phases were mainly focused on storing and efficiently accessing the data, resulted in the development of industry tools for managing large databases, related standards, supporting querying languages, etc. After the initial period, when the data storage was not a primary problem anymore, the development progressed towards analytical functionalities on how to extract added value from the data; i.e., databases started supporting not only transactions but also analytical processing of the data. At this point, data warehousing with On-Line-Analytical-Processing entered as a usual part of a company’s information system portfolio, requiring from the user to set well defined questions about the aggregated views to the data. Data Mining is a technology developed after year 2000, offering automatic data analysis trying to obtain new discoveries from the existing data and enabling a user new insights in the data. In this respect, the Slovenian KDD conference (SiKDD) covers a broad area including Statistical Data Analysis, Data, Text and Multimedia Mining, Semantic Technologies, Link Detection and Link Analysis, Social Network Analysis, Data Warehouses. 3 PROGRAMSKI ODBOR / PROGRAMME COMMITTEE Janez Brank, Jožef Stefan Institute, Ljubljana Marko Grobelnik, Jožef Stefan Institute, Ljubljana Jakob Jelenčič, Jožef Stefan Institute, Ljubljana Branko Kavšek, University of Primorska, Koper Besher M. Massri, Jožef Stefan Institute, Ljubljana Dunja Mladenić, Jožef Stefan Institute, Ljubljana Erik Novak, Jožef Stefan Institute, Ljubljana Inna Novalija, Jožef Stefan Institute, Ljubljana Jože Rožanec, Qlector, Ljubljana Abdul Sitar, Jožef Stefan Institute, Ljubljana Luka Stopar, Sportradar, Ljubljana Swati Swati, Jožef Stefan Institute, Ljubljana 4 Emotion Recognition in Text using Graph Similarity Criteria Nadezhda Komarova, Inna Novalija, Marko Grobelnik Jožef Stefan Institute Jamova cesta 39, Ljubljana, Slovenia nadezhdakomarova7@gmail.com ABSTRACT In Section 2, it is further explained how the graph of 𝑛-grams is constructed for a given text and how an emotion label is assigned In this paper, a method of classifying text into several emotion cat- to the text based on the similarity with the emotion category egories employing different measures of similarity of two graphs graphs. Afterwards, in Section 3, the method is compared with is proposed. The emotions utilized are happiness, sadness, fear, related approaches. surprise, anger and disgust, with the latter two joined into one In Section 4, an overview of results is focused on differences category. The method is based on representing a text as a graph between the performance of the model when different graph of 𝑛-grams; the results presented in the paper are obtained using similarity criteria are used. It is followed by the discussion of the the value of 5 for 𝑛: the 𝑛-grams were the sequences of 5 charac- model’s limitations in Section 5. ters. The graph representation of the text was constructed based on observing which 𝑛-grams occur close together in the text; 2 PROPOSED METHOD additionally, frequencies of their connections were utilized to assign edge weights. To classify the text, the graph was compared 2.1 Constructing the Graph of 𝑛-grams with several emotion category graphs based on different graph The method used in the paper to obtain text representation in similarity criteria. The former relate to common vertices, edges, the form of the graph of 𝑛-grams is the following. and the maximum common subgraphs. The evaluation of the • The given text was separated into 𝑛-grams of characters. model on the test data set shows that utilizing the construction Also, different values of 𝑛 have been tested. The results of the maximum common subgraph to obtain the graph similar- in Section 4, use 𝑛 = 5. The 𝑛-grams into which the given ity measure results in more accurate predictions. Additionally, text was split were overlapping. employing the number of common edges as a graph similarity cri- • The 𝑛-grams obtained in this way were utilized to repre- terion yielded more accurate results compared to employing the sent the labels of vertices of the graph. number of common vertices to measure the similarity between • The edges of the graph were created in the following man- the two graphs. ner. The ends of edges were the vertices that corresponded KEYWORDS to 𝑛-grams that occurred close to each other in the text, e.g., the edge is connecting the first 𝑛-gram at the beginning emotion recognition, text classification, machine learning, graphs, of the text with the second 𝑛-gram (these two 𝑛-grams graph similarity would overlap with each other), as seen in Figure 1. Different values have been tested for the maximal distance 1 INTRODUCTION between the two vertices allowed for these two vertices to Emotion recognition is a problem that can be connected to differ- still be connected with the edge. The results in Section 4, ent fields such as natural language processing, computer vision, use the value of 7. deep learning, etc. [4] In this paper, the focus is on the task of • Performance of the model with both, the directed and the recognizing emotions in texts. undirected graphs, has been tested. In the literature, several approaches have been introduced that target this problem. Some of them employ vertex embedding vectors for emotion detection and recognition from text. The embedding vectors grasp the information related to semantics and syntax; however, a limitation of such approaches is that they do not capture the emotional relationship that exists between words. Some methods attempting to alleviate this issue include building a neural network architecture adopting pre-trained word representations. [3] Some text classification approaches employ 𝑛-grams to construct the text representation, e.g., to deal with Figure 1: Constructing the edges between the 5-grams that the task of language identification. [9] occur close to each other In this paper, the approach to emotion recognition employs 𝑛- grams to obtain graph representation of text. The text is viewed as In Figure 2, it is depicted how the edges are constructed be-a sequence of characters that is divided into 𝑛-grams, i.e., shorter tween the vertices labelled with 𝑛-grams. For the clarity of rep- overlapping sequences of characters as presented in Figure 1. resentation, each 𝑛-gram is shown connected to 3 other 𝑛-grams Permission to make digital or hard copies of part or all of this work for personal instead of 7. It is important to note that if the same 𝑛-grams oc-or classroom use is granted without fee provided that copies are not made or curred in the text more than once, there was still only one vertex distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this with this 𝑛-gram as a label: the connections of the 𝑛-gram have work must be honored. For all other uses, contact the owner /author(s). been aggregated at a single vertex. Information Society 2022, 10–14 October 2022, Ljubljana, Slovenia Additionally, the graph constructed is weighted. The weights © 2022 Copyright held by the owner/author(s). of the edges are obtained utilizing the frequencies of connections 5 Information Society 2022, 10–14 October 2022, Ljubljana, Slovenia Nadezhda Komarova, et al. of 𝑛-grams in the given text. In other words, the edge weights are In other words, it is tested, to which of the 5 graphs the graph initialized to 0. Then, when constructing the graph of 𝑛-grams of the given text is most similar and the corresponding emotion for a text, every time a certain edge would be added, instead of is assigned to the given text. adding it, the weight of the edge is increased by 1. Several similarity criteria of the two graphs have been ex- Afterwards, the edge weights are normalized to be in the range plored. (0, 1); hence, the edge weights are more comparable among the (1) The number of vertices common to both graphs: the ver- graphs of 𝑛-grams for different texts. tices are considered common if they share the same label (the 𝑛-gram they represent) in both graphs. (2) The number of edges common to both graphs: the edge is considered common if the same vertices (vertices with the same labels) are the endpoints of the edge in both graphs and the edge weights are the same. (3) The number of vertices in the maximum common subgraph (MCS) of the two graphs. Finding the maximum common subgraph is equivalent to finding a graph with the maxi- mum number of vertices so that it is a subgraph of each of the two graphs. [8] (4) The number of edges in the maximum common subgraph (MCS) of the two graphs. ¤(𝑚−1) (5) 𝑧 = 𝑚 − 𝑒, where 𝑚 denotes the number of vertices 2 in the maximum common subgraph of the two graphs, and 𝑒 denotes the number of edges in the maximum common subgraph. Figure 2: Constructing the edges between the 5-grams in 3 RELATED WORK the text fragment "oh how funny" In the literature describing related approaches to text classifica- tion and emotion recognition, deep learning models are often utilized to obtain high-quality predictions. [7] 2.2 Constructing the Emotion Category Apart from the approaches that employ word embedding vec- Graphs tors [6], there are also methods that connect neural networks and graphs. Such approaches may be similar to the method de-The core of the method is the construction of the graph of 𝑛- scribed in this paper since the graph representation of text may grams as described in Section 2.1. In the data set used to tune the be obtained in a similar way based on the semantic connections model, there were shorter texts labelled with one of the following between words. One example of this kind of model is the graph 5 emotions: happy, sad, surprised, fearful, or angry-disgusted. neural network that is enhanced by utilizing BERT to obtain Overall, there were 1207 sentences included in the data set; out semantic features. [11] of this, the model was trained using 1086 sentences (to construct The crucial part of the method in this paper is the graph the emotion category graphs) and evaluated on 121 sentences similarity criterion that is used when comparing the graph of the (the split proportion is 90 : 10). given text with different emotion category graphs. The similar The process of obtaining the emotion category graphs is pre- way as the construction of the maximum common subgraph is sented below. used in this method, it can be employed in combination with the (1) The data set was split into 5 parts containing only the text probabilistic classifiers. [10] labelled with the same emotion. The approach in this paper, on the other hand, does not employ (2) Then, the texts in each part of the data set were used to probabilistic classifiers such as Bayes Classification or Support obtain 5 graphs corresponding to each emotion. Vector Machine. [2] Instead, the emotion for which the similarity (a) This process can be viewed as for each text labelled with measure between the corresponding emotion category graph and a certain emotion, constructing the graph of 𝑛-grams as the graph of the given text is maximised is assigned to the text. explained in Section 2.1. Additionally, it is important to note that it is possible to in- (b) Afterwards, merge these graphs separately for different corporate alternative graph similarity criteria, e.g., related to emotions to obtain 5 larger graphs of 𝑛-grams; during subgraph matching, edit distance, belief propagation, etc. [5] the merging process, the edges are aggregated in such a way that there are not any two vertices in the emotion 4 RESULTS category graph sharing the same label (the character 𝑛-gram to which they correspond). 4.1 Experimental Setup The data set used to train and evaluate the model was the one dis- 2.3 Assigning an Emotion to a Given Text tributed by Cecilia Ovesdotter Alm. [1] It included the sentences Utilizing the 5 emotion category graphs corresponding to differ-each labelled with one of the following emotions: happiness, sad- ent emotions, for a given text, it is determined, to which emotion ness, fear, surprise, anger, and disgust. The latter two emotions the text most likely corresponds. For that, the pairwise similarity were joined into one category. measures of the graph of the given text and of the 5 emotion During the evaluation stage, for each sentence, a correspond- category graphs are employed. ing emotion was predicted, e.g., the text "then the servant was 6 Emotion Recognition in Text using Graph Similarity Criteria Information Society 2022, 10–14 October 2022, Ljubljana, Slovenia Table 1: Results of text classification using directed graphs Table 3: Confusion matrix: directed graph, number of edges in the MCS as the similarity criterion Similarity criterion Accuracy Precision Recall F1 Actual/pred. Happy Fearful Surpr. Sad Angry-Disg. Common vertices 0.488 0.506 0.332 0.323 Common edges 0.537 0.683 0.408 0.432 Happy 43 1 0 0 1 𝑧 0.372 0.074 0.200 0.108 Fearful 7 6 1 3 0 Vertices in the MCS 0.570 0.622 0.426 0.446 Surprised 6 1 2 1 1 Edges in the MCS 0.579 0.625 0.454 0.478 Sad 12 1 0 12 1 Angry-Disg. 11 2 0 2 7 Table 2: Results of text classification using undirected graphs Table 4: Confusion matrix: undirected graph, number of edges in the MCS as the similarity criterion Similarity criterion Accuracy Precision Recall F1 Actual/pred. Happy Fearful Surpr. Sad Angry-Disg. Common vertices 0.488 0.506 0.332 0.323 Common edges 0.554 0.669 0.429 0.460 Happy 42 1 0 1 1 𝑧 0.372 0.074 0.200 0.108 Fearful 8 6 1 2 0 Vertices in the MCS 0.545 0.527 0.399 0.406 Surprised 6 1 1 1 2 Edges in the MCS 0.570 0.581 0.439 0.453 Sad 11 1 0 13 1 Angry-Disg. 11 2 0 2 7 greatly frightened and said it may perhaps be only a cat or a dog" Table 5: Confusion matrix: directed graph, number of com- was labelled fearful, while the text "he looked very jovial did little mon edges as the similarity criterion work and had the more holidays" was recognized to be related to the emotion of happiness. Actual/pred. Happy Fearful Surpr. Sad Angry-Disg. The value of 𝑛 that appeared to yield the best results and was also used to obtain the results in Tables 1 and 2 was 5. Fur-Happy 42 1 0 2 0 thermore, each 5-gram (except those at the end of the text) is Fearful 10 4 0 3 0 connected to 7 5-grams further in the text. Surprised 6 0 2 3 0 In Tables 1 and 2, the "common edges" criterion means that Sad 13 0 1 12 0 the two edges from both graphs are considered common if they Angry-Disg. 16 1 0 0 5 have the same weight and the same endpoints. Additionally, in Table 1, 𝑧 denotes the difference between the Table 6: Confusion matrix: undirected graph, number of the actual number of edges in the maximum common subgraph common edges as the similarity criterion and the number of edges in the complete graph with 𝑚 vertices, where 𝑚 is the number of vertices in the maximum common Actual/pred. Happy Fearful Surpr. Sad Angry-Disg. subraph. In the trials that yielded the results in Table 1, the edges were Happy 41 1 0 2 1 directed and in the trials that yielded the results in Table 2, the Fearful 11 4 0 2 0 edges were undirected. Surprised 6 0 2 3 0 Sad 12 0 1 13 0 4.2 Analysis Angry-Disg. 14 1 0 0 7 From the results in Table 1 and 2, it may be noticed that the highest accuracy on the test data set was achieved when the Furthermore, the accuracy corresponding to the similarity number of edges in the maximum common subgraph was used criterion being the number of the common edges (considering as the similarity measure. In Table 1, the second highest accuracy both the endpoints and the weight of the edge) is higher by was achieved when the number of vertices in the maximum 0.017 when the graphs are undirected than when the graphs are common subgraph was utilized. directed (0.554 compared to 0.537). When the graphs utilized are From this, it may be observed that the construction of the max- undirected, the model might be more flexible regarding the exact imum common subgraph reflects the similarity better in certain order of the words that occur together. cases; possible reasons may be that deeper semantic relationships In Tables 5 and 6, confusion matrices are presented for the can be captured this way since connections between multiple trials when the number of edges common to both graphs, consid- 𝑛-grams are considered at the same time. ering the endpoints and the weights of the edges, was used as In Tables 3 and 4, the confusion matrices are presented for the the criterion of graph similarity. the trials when the number of edges in the maximum common subgraph was used as the criterion of graph similarity. 5 DISCUSSION From the Tables 1 and 2, it is evident that this similarity criterion corresponded to the highest accuracy of predictions for A strength of the approach presented in this paper is the ability both undirected and directed graphs. However, the accuracy cor- to capture the context of the given words on different levels; this responding to this similarity criterion is higher when the graphs is related to the process of constructing the edges of the graph by are directed (0.579 compared to 0.570). connecting 𝑛-grams that occur together in the text. Additionally, 7 Information Society 2022, 10–14 October 2022, Ljubljana, Slovenia Nadezhda Komarova, et al. the breadth of the contextual frame considered may be varied by (vertices labelled with the same 𝑛-gram) are contained in both altering the number of 𝑛-grams with which a certain 𝑛-gram is graphs resulting in more noisy data. connected when constructing the edges. To conclude, the future work on the task of emotion recogni- However, overall, the accuracy values noted in Tables 1 and tion related to the proposed method may, on the one hand, be 2, were not very high possibly indicating that the training data focused on employing alternative graph similarity measures in set was not large enough. Moreover, the data set did not include addition to those described in this paper, e.g., those connected texts corresponding to different emotions in even proportions to deriving the edit distance or to the belief propagation. [5] resulting in an imbalance which could have also had a detrimental Furthermore, clustering algorithms may be used to obtain the influence on the quality of predictions. The confusion matrices patterns characteristic to the emotion categories and further em- (Tables 3, 4, 5, and 6) indicate, e.g., that the texts were often ploy them for the emotion recognition task. To this end, both, the falsely assigned the emotion of happiness since it was the most vertex clustering algorithms as well as the clustering of graphs abundant class in the data set. as objects, might be utilized. Additionally, graph neural network One of the limitations of the design of the model described architecture may be built along with incorporating the graphs of it that although it may be reasonable to expect that to obtain 𝑛-grams as the input for the network. more accurate predictions on the test data set, training the model (obtaining the emotion category graphs) on a larger corpus of 7 ACKNOWLEDGEMENTS texts is needed, this may bring a significant rise in computational This work was supported by the Slovenian Research Agency complexity since the category graphs would possess significantly under the project J2-1736 Causalify and the European Union larger amounts of vertices and edges. through Odeuropa EU H2020 project under grant agreement No This is especially important if the maximum common sub- 101004469. graphs are constructed when obtaining a similarity measure, since for each text in the test data set, a maximum common sub- REFERENCES graph would have to be constructed several times: between the [1] Alm, E. C. O. Affect in text and speech, 2008. graph of 𝑛-grams for a given text and each emotion category [2] Bahritidinov, B., and Sanchez, E. Probabilistic classifiers and statistical dependency: The case for grade prediction. pp. 394–403. graph (5 such graphs in this case). [3] Batbaatar, E., Li, M., and Ryu, K. H. Semantic-emotion neural network for A possible solution to the problem of having too large category emotion recognition from text. IEEE Access 7 (2019), 111866–111878. [4] Guo, J. Deep learning approach to text analysis for human emotion detection graphs might be reducing the length of 𝑛-grams, i.e., using smaller from big data. Journal of Intelligent Systems 31, 1 (2022), 113–126. values of 𝑛, and hence reducing the number of vertices in the [5] Koutra, D., Ramdas, A., Parikh, A., and Xiang, J. Algorithms for graph graph. similarity and subgraph matching, 2011. [6] Li, S., and Gong, B. Word embedding and text classification based on deep Also, reducing the number of 𝑛-grams with which a certain 𝑛- learning methods. MATEC Web of Conferences 336 (01 2021), 06022. gram is connected when constructing the edges of the graph may [7] Prasanna, P., and Rao, D. Text classification using artificial neural networks. be investigated as a possible solution. However, if this value is too International Journal of Engineering and Technology(UAE) 7 (01 2018), 603–606. [8] Quer, S., Marcelli, A., and Sqillero, G. The maximum common subgraph low, too much contextual information may be lost; therefore, it problem: A parallel and multi-engine approach. Computation 8, 2 (may 2020), appears necessary that for each value of n, the optimal number of 48. [9] Tromp, E., and Pechenizkiy, M. Graph-based n-gram language identification 𝑛-grams with which a certain 𝑛-gram is connected is determined on short texts. Proceedings of Benelearn 2011 (01 2011), 27–34. experimentally. [10] Violos, J., Tserpes, K., Varlamis, I., and Varvarigou, T. Text classification using the n-gram graph representation model over high frequency data streams. Frontiers in Applied Mathematics and Statistics 4 (2018). [11] Yang, Y., and Cui, X. Bert-enhanced text graph neural network for classification. Entropy (Basel) 23 (11 2021). 6 CONCLUSION In this paper, the model that utilizes graph similarity criteria to classify a given text into one of the emotion categories is described. The core of the method is to construct a graph of 𝑛- grams for a given text and to compare this graph to each of the emotion category graphs. The text is classified into the emotion category, the graph of which yielded the highest similarity value when compared to the graph of the given text. From the results of the trials noted in Tables 1 and 2, it may be concluded that among the graph similarity criteria described, that number of edges in the maximum common subgraph resulted in the highest quality of predictions. Furthermore, it may also be noted that employing the number of edges common to both graphs resulted in higher prediction accuracy than using the number of common vertices (0.537 and 0.488 accuracy for the directed graphs). This may appear to be intuitively reasonable as using edges may seem to incorporate more contextual information. Addition- ally, it may be important to investigate the effect of the difference between the size of the graph of 𝑛-gram for the given text and the size of the emotion category graph on the probability that the same connections between the two 𝑛-grams are found in both graphs. Moreover, it may be more probable that the same vertices 8 SLOmet – Slovenian Commonsense Description Adrian Mladenic Erik Novak Dunja Mladenic Marko Grobelnik Grobelnik Department for Artificial Department for Artificial Department for Artificial Intelligence, Intelligence, Intelligence, Department for Artificial Jozef Stefan Institute, Jozef Stefan Institute, Jozef Stefan Institute Intelligence, Jozef Stefan International Ljubljana Slovenia Ljubljana Slovenia Jozef Stefan Institute Postrgraduate School dunja.mladenic@ijs.si marko.grobelnik@ijs.si Ljubljana Slovenia Ljubljana Slovenia adrian.m.grobelnik@ijs.si erik.novak@ijs.si ABSTRACT English, we anticipate a noticeable drop in performance across all metrics for the Slovenian language models. This paper presents Slovenian commonsense description models The main contributions of this paper are (1) the comparison based on the COMET framework for English. Inspired by of the performance of commonsense description models using MultiCOMETs approach to multilingual commonsense description, we finetune two Slovenian GPT-2 language models. different Slovenian language models and the English model, (2) a Experimental evaluation based on several performance metrics comprehensive evaluation using a variety of performance metrics. shows comparable performance to the original COMET GPT-2 An additional contribution (3) is the Slovene ATOMIC-2020 model for English. dataset acquired by machine translation from the original English dataset [6]. KEYWORDS The rest of this paper is organized as follows: Section 2 deep learning, commonsense reasoning, multilingual natural provides the data description. Section 3 describes the problem and language processing, slovenian language model, gpt-2 the experimental setting. Section 4 exhibits our evaluation results. The paper concludes with discussion and directions for future work 1 Introduction in Section 5. Recent research [1] into commonsense representation and reasoning in the field of natural language understanding has 2 Data Description demonstrated promising results for automatic commonsense To train the Slovenian commonsense description models, we use generation. Given a simple sentence or common entity, such data from the ATOMIC-2020 dataset, as proposed in the COMET technology can generate plausible commonsense descriptions framework for English. The ATOMIC-2020 dataset consists of relating to it. However, further testing on complex sentences, English sentences and entities, labelled by up to 23 commonsense uncommon entities, or by increasing the quantity of requested relation types describing their semantics. commonsense descriptions usually gives nonsensical results. Following the recent success on the automatic generation of commonsense descriptions proposed in COMET-ATOMIC 2020 [1], we focus on extending the COMET framework to the Slovenian language. We investigate the impact of different Slovenian language models on the overall performance of commonsense description generation. In our previous research [2], we expanded on an existing approach for automatic knowledge base construction in English [3] to work on different languages. We utilized the original ATOMIC dataset [4]. This was performed by finetuning the original English GPT model from COMET 2019 on automatically translated Slovenian data and evaluated based on exact overlap for the generated commonsense descriptions. Evaluations were performed on a small subset of 100 sentences. In this work we use the updated ATOMIC-2020 dataset [1] and finetune two Slovenian GPT-2 language models. We evaluate the models’ performance using several performance metrics including BLEU, CIDEr, METEOR and ROUGE-L. The evaluation is performed on several thousand sentences and entities; we Figure 1 Close-up of “Event-Centered” descriptor values investigate how the predicted commonsense descriptions’ predicted for an example Slovene sentence “PersonX is sad” performance relates to the language model used. Furthermore, (“OsebaX je žalostna” in Slovenian) given the complexity of the Slovenian language compared to 9 We refer to them as descriptors, 9 of which are identical to METEOR — Metric for Evaluation of Translation with those used in our previous research [2]. The 23 descriptors are Explicit Ordering is a metric initially used for evaluating machine organized into 3 categories: “Physical-Entity”, “Event-Centered”, translation input. The metric is based on the harmonic mean of and “Social-Interaction”. The “Physical-Entity” descriptors capture unigram precision and recall with other features such as stemming knowledge about the usage, location, content, and other properties and synonymy matching. [10] of objects. The “Event-Centered” descriptors include IsAfter, Causes and other descriptors describing events. The “Social- ROUGE-L — Recall-Oriented Understudy for Gisting Interaction” descriptors include xIntent, xNeed, oReact to Evaluation is a metric used for evaluating machine produced distinguish between causes and effects in social settings. An example of a part of a labeled sentence is shown in Figure 1. summaries or translations against a set of human-produced Sentences and entities were manually labelled by human references. The score is calculated using Longest Common workers on Amazon Turk; they were assigned open-text values for Subsequence based statistics, which involves finding the longest 23 commonsense descriptors, reflecting the workers' subjective subsequence common to all sequences in a set. [11] commonsense knowledge. For instance, when workers were given Comparison of the Slovene commonsense models was performed the sentence “PersonX chases the rabbit” and asked to label it for the “xWant” descriptor, one wrote “catch the rabbit” and another by finetuning two state-of-the-art Slovene GPT-2 language models: wrote “cook the rabbit for dinner”. A more detailed explanation can macedonizer/sl-gpt2 [12], gpt-janez [13]. As a reference model, we be found in the ATOMIC-2020 paper. There are 1.33 million used the original COMET-2020 GPT2-XL English language model (possibly repeating) descriptor values. The distribution of data [1]. Moving forward, we will refer to our Slovenian finetuned across the descriptors is depicted in [1]. models as “COMET sl-gpt2” and “COMET gpt-janez”. To finetune our Slovenian language models, we have automatically translated the sentences, entities, and descriptor 4 Experimental Results values from the ATOMIC-2020 dataset from English to Slovenian. The translation was done using DeepL’s Translate API [7]. We We performed a train, test, and development split on the ATOMIC- have found that while the majority of inspected translations were of 2020 dataset identical to the split used in COMET-2020. Our good quality, there were also incorrect translations due to word evaluation split consisted of over 150,000 descriptor values with disambiguation problems. Nevertheless, we conclude that the their corresponding sentences and entities. dataset is of good enough quality to be used for our experiments. We finetuned our Slovene commonsense models on our The translated dataset is publicly available [6]. training set consisting of over 1 million descriptor values. Both models were trained for 3 epochs under the same parameters; the 3 Problem Description and Experimental Setting maximum input length was set to 50, the maximum output length The addressed problem is predicting the most likely values for each was set to 80; the training was performed using a train batch size of descriptor in the Slovene-translated ATOMIC-2020 dataset, given 64. The model updates were performed using the weighted adam a Slovenian input sentence or entity. We take inspiration from the optimizer [14] with the starting learning rate set to 10−5 . The approach proposed in MultiCOMET [2]. experiment’s implementation can be found on our GitHub To compare the performance of the models, repository [5]. we utilize a variety of performance metrics described BLEU- BLEU- BLEU- BLEU- ROUGE- below. Each performance metric is a value between Model Language 1 2 3 4 CIDEr METEOR L 0 and 1 indicating the quality of a generated commonsense descriptor value. Values closer to 1 COMET represent higher quality descriptions. sl-gpt2 Slovene 0.297 0.150 0.086 0.058 0.487 0.207 0.383 BLEU — Bilingual Evaluation Understudy was COMET first used to evaluate the quality of machine gpt- translated text by examining the overlap of candidate janez Slovene 0.324 0.174 0.108 0.076 0.508 0.225 0.397 text n-grams in the reference text. BLEU-1 only uses 1-grams in the evaluation, while BLEU-4 only COMET considers 4-grams. [8] (GPT2- CIDEr — Consensus-based Image Description XL) English 0.407 0.248 0.171 0.124 0.653 0.292 0.485 Evaluation was originally used to measure image description quality. It first transforms all n-grams to their root form, Table 1: Comparison of the two Slovene commonsense models then calculates the average cosine similarity between the candidate with the English model at the bottom. and reference TF-IDF vectors. [9] Experimental results show performance comparable to the original COMET-2020 English model. Both Slovene models were 10 comparable to the English model across all metrics, “COMET gpt- Avto (car) janez” consistently outperformed “COMET sl-gpt2” achieving a METEOR score of 0.225 compared to 0.207. The performance gap Descriptor COMET sl- COMET COMET gpt2 gpt-janez (GPT2-XL) was smallest for BLEU-4, as all models struggled to produce ObjectUse Vožnja do Priti do hiše Drive to the generations whose 4-grams overlapped with those in the reference trgovine store set. The gap in performance between the Slovene and English Vožnja do Priti do hiše Get to the store models could be attributed to multiple factors. The English model hiše from COMET-2020 was trained for longer on more capable Vožnja do Priti do hiše Drive to the hardware and is larger. Moreover, the machine translation done to cilja restaurant acquire our dataset can be erroneous at times. HasProperty Noro Najden v Found in To illustrate the performance of the models, we investigate avtomobilu parking lot their generated descriptor values on the same inputs. Table 2 shows Vrata Najden v Found on road a side-by-side example comparison of the descriptor values avtomobilu generated by our three models, given the same input sentence in Pohištvo Najden v Found in car their respective language. Table 3 compares the models on an avtomobilu dealership example entity. For the example sentence “Marko went to the Table 3: Illustrative example comparing the output of the three shop”, the descriptor “oWant” indicates what the others want as a models on the same input entity across two descriptors. result of the event. “COMET gpt-janez” generates a valid output “None” but fails to provide alternatives. The other two models In our example sentence and entity, COMET gpt-janez agree on the most likely descriptor value being “None” (“nič” in returns the same output when different commonsense descriptors Slovenian) and provide plausible alternatives. The “IsBefore” are requested. We have observed this for all input sentences and descriptor relates to possible events following the input event. In entities thus far. We presume such results are due to the trained our case, “COMET gpt-janez” gives the most plausible output of parameters in the original gpt-janez model, as macedonizer/sl-gpt2 “Buys something”. The other two models provide still plausible was finetuned using the same workflow and returns different outputs including “Is in the pet store” and “PersonX buys a new descriptor values. While unsure of the exact cause, we reason it car”. could be due to an insufficient vocabulary or unoptimized choice Marko je šel v trgovino (Marko went to the shop) of parameters during training. Descriptor COMET sl-gpt2 COMET COMET gpt- (GPT2-XL) janez oWant Nič Nič None Se zahvaliti Nič To give him a osebiX receipt se zahvaliti Nič To give him a discount IsBefore Zaslužiti denar Kupiti PersonX buys nekaj a new car V trgovino za Kupiti PersonX takes hišne ljubljenčke nekaj the car back home V trgovino z živili Kupiti PersonX buys nekaj a new one Table 2: Illustrative example comparing the output of the three models on the same input sentence across two descriptors. Figure 2 Close-up of “Social-Interaction” descriptor values predicted for an example Slovene sentence “John is very For our example entity “car”, the descriptor “ObjectUse” important” (“Janez je zelo pomemben” in Slovenian) describes possible usages for that entity. Table 3 shows all models Figures 1, 2 and 3 show the outputs generated by “COMET are capable of generating plausible descriptor values for such sl-gpt2” for three different inputs. Figure 2 visualizes the output for common entities. Nevertheless, the descriptor “HasProperty” the sentence “John is very important”. Outputs include “PersonX is proves challenging for the Slovenian models, suggesting a car is then accomplished, happy, proud” and “As a result, others want “crazy” and is “found in the car”. The English model gives none, to thank PersonX”. We can see that for many descriptors the reasonable outputs such as “Found in parking lot”. highest ranked output is “None” (“nič” in Slovenian), indicating no commonsense inference can be made. 11 in the holes” for the “IsBefore” descriptor. While both labels are plausible for some context, they are not necessarily true. Possible directions for future work include evaluating the models’ performance for individual descriptors, as there are drastic differences in quantity of training data and lengths of values across them. After achieving results comparable to the original English commonsense model COMET-2020 GPT2-XL, we intend to finetune and evaluate models for other languages. ACKNOWLEDGMENTS The research described in this paper was supported by the Slovenian research agency under the project J2-1736 Causalify, the RSDO project funded by the Development of Slovene in a Digital Environment project, and the Humane AI Net European Unions Horizon 2020 project under grant agreement No 952026. REFERENCES [1] Hwang, J.D., Bhagavatula, C., Le Bras, R., Da, J., Sakaguchi, K., Figure 3 Close-up of “Physical-Entity” descriptor values Bosselut, A., & Choi, Y. (2021). COMET-ATOMIC 2020: On Symbolic predicted for an example Slovene entity “banana” and Neural Commonsense Knowledge Graphs. AAAI. Figure 3 exhibits the output for the entity “banana”, the [2] Mladenic Grobelnik, A., Mladenić, D., & Grobelnik, M. (2020). MultiCOMET - Multilingual Commonsense Description. In Proc. SiKDD model claims the banana can be used to prepare food, is located in 2020, Ljubljana, Slovenia (pp. 37–40). a building or shop, desires to be eaten for dinner and does not desire [3] Antoine Bosselut, Hannah Rashkin, Maarten Sap, Chaitanya Malaviya, to be frozen. On the other hand, the model claims the banana is Asli Celikyilmaz, Yejin Choi. (2019). COMET: Commonsense made up of clothes and is capable of going to a restaurant. This is Transformers for Automatic Knowledge Graph Construction. [4] Maarten Sap, Ronan Le Bras, Emily Allaway, Chandra Bhagavatula, likely due to the overall significantly lower number of physical- Nicholas Lourie, Hannah Rashkin, Brendan Roof, Noah A. Smith, Yejin entity descriptor values provided in the ATOMIC-2020 dataset. Choi. (2019). ATOMIC: An Atlas of Machine Commonsense for If-Then In Figure 1 we can see the “Event-Centered” descriptors for Reasoning. Paul G. Allen School of Computer Science & Engineering, the sentence “PersonX is sad”. Top descriptor values are again University of Washington, Seattle, USA. Allen Institute for Artificial Intelligence, Seattle, USA. “None”, but the model also claims it is more difficult for PersonX [5] SLOmet-ATOMIC 2020 Github https://github.com/eriknovak/RSDO- to be sad, if PersonX has no money. SLOmet-atomic-2020#slomet-atomic-2020-on-symbolic-and-neural- commonsense-knowledge-graphs-in-slovenian-language Accessed 5 Discussion 30.08.2022 [6] ATOMIC-2020 Slovene Machine Translated Data This paper applied an existing approach to multilingual https://www.dropbox.com/sh/gs8iqcwpwkaqkuf/AAAmnCqG89JOz_umtq commonsense description to the Slovene language. To implement 42MMxxa?dl=0 Accessed 30.08.2022 our approach, we machine translated the ATOMIC-2020 dataset to [7] DeepL Translate API https://www.deepl.com/pro-api Accessed 30.08.2022 Slovene and finetuned two Slovene commonsense models. We [8] Papineni, Kishore & Roukos, Salim & Ward, Todd & Zhu, Wei Jing. compared our models to the original English commonsense model (2002). BLEU: a Method for Automatic Evaluation of Machine from COMET-2020 and achieved comparable experimental results Translation. across multiple performance metrics. Among others, our models [9] Vedantam, R., Lawrence Zitnick, C., & Parikh, D. (2015). Cider: Consensus-based image description evaluation. In Proceedings of the achieved a 0.487 CIDEr score, a 0.383 ROUGE-L score, and a IEEE conference on computer vision and pattern recognition (pp. 4566- BLEU-1 score of 0.297. 4575). Through examination of individual examples, we observed [10] Lavie, Alon & Denkowski, Michael. (2009). The METEOR metric that while “COMET gpt-janez” has the highest performance scores for automatic evaluation of Machine Translation. Machine Translation. 23. 105-115. on the Slovene language, it fails to provide alternative descriptor [11] Lin, C.-Y. 2004. ROUGE: A Package for Automatic Evaluation of values. “COMET sl-gpt” provides multiple values for the same Summaries. In Text Summarization Branches Out. descriptor, but in average has lower performance. It is important to [12] Documentation page for “macedonizer/sl-gpt2” on HuggingFace emphasize the models were trained on subjective commonsense https://huggingface.co/macedonizer/sl-gpt2 Accessed 1.09.2022 [13] gpt-janez supporting project: RSDO knowledge provided by individual humans. For example, workers https://www.cjvt.si/rsdo/en/project/ Accessed 30.08.2022 labelled the sentence “PersonX digs holes” with the descriptor [14] I. Loshchilov, F. Hutter, Decoupled weight decay regularization, in: values “PersonX plants a garden” and “PersonX places fence posts International Conference on Learning Representations, 201 12 Measuring the Similarity of Song Artists using Topic Modelling Erik Calcina Erik Novak Jožef Stefan Institute Jožef Stefan International Postgraduate School Jamova cesta 39 Jožef Stefan Institute Ljubljana, Slovenia Jamova cesta 39 Ljubljana, Slovenia ABSTRACT 2 RELATED WORK In music streaming platforms, it is necessary a recommendation Related works to our topic modeling approach use Latent Dirich- system to provide users with similar songs of what they already let Allocation (LDA) [1]. One work uses a topic modeling tech-listen and also recommend new artists they might be interested nique for sentiment classification, classifying between happy in. In this paper, we present a method to find similarities between and sad songs, by using generated topics created with LDA and artists that uses topic modelling. We have evaluated the method Heuristic Dirichlet Process [12]. From a data set consisting of 150 using a data set with music artists and their lyrics. The results lyric they’ve been able to retrieve the sub-division of two defined show the method finds similar artists, but also is dependant on sentiment classes [3]. Another work used LDA and Pachinko the quality of the generated topic clusters. allocation [7] on a large data set for assessing the quality of the generated topics with applying supervised topic modeling ap-KEYWORDS proach. [8]. In our paper we use topic modeling to generate a set of topic clusters used to calculate the similarity between artists. song lyrics, topic modelling, clustering, sentence embeddings, language models 3 METHODOLOGY 1 INTRODUCTION In this section, we present the methodology used in this paper. We present the topic modeling approach used to generate the Nowadays, there are a plenty of music platforms to choose from topic clusters, followed by a description of how the topic clusters and listen to music. There, new artists appear every day and are used to measure the similarity between the artists. many different songs are published. If we take into account all that have been created, we get a large selection of songs which 3.1 Topic Modeling can increase the difficulty of finding suitable songs or artists to To create the topic clusters we use BERTopic [5], a method which listen to. uses document embeddings with clustering algorithms to create To find a suitable artist or songs, different aspects can be topic clusters. While BERTopic is described in a separate work, considered. One such aspect can be the topic of the song; a song we present a brief description of its workflow. The workflow is topic can be interpreted as the main subject of the song, for also presented in Figure 1. example it can be an emotion, an event, a message, or something else. When searching for suitable artists one could decide to search for artists who have songs on similar topics. In this paper, we propose an topic modeling-based approach for measuring the similarity of the music artists based only on their song lyrics. The approach uses language models for gener- ating song embeddings used to create the topic clusters. These topic clusters are then analyzed to find the similar artists. The experiment was performed on a data set of songs corresponding to fourteen (14) music artists. While the experiment shows that similar artists can be detected using the approach, there is still room for improving its performance. The main contribution of this paper is a novel approach for detecting similar music artists using topic modelling. The reminder of the paper is structured as follows: Section 2 contains the overview of the related work on using topic mod- elling on song data sets. Next, we present the methodology in Section 3, and describe the experiment setting in Section 4. The experiment results are found in Section 5, followed by a discussion in Section 6. Finally, we conclude the paper and provide ideas for future work in Section 7. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and Figure 1: The BERTopic methodology workflow. The high-the full citation on the first page. Copyrights for third-party components of this lighted part is used in our approach. The image has been work must be honored. For all other uses, contact the owner /author(s). Information Society 2022, 10–14 October 2022, Ljubljana, Slovenia designed using resources from Flaticon.com. © 2022 Copyright held by the owner/author(s). 13 Information Society 2022, 10–14 October 2022, Ljubljana, Slovenia Erik Calcina and Erik Novak Document Embeddings. Document vector representations are The attributes used in our analysis are song name, artist and generated using a sentence-transformer [11] model. The model lyrics. creates a semantic representation of the documents, which al- Data Processing. For our experiment we took fourteen (14) lows measuring the semantic similarity. The available models artists of various degrees of similarity. This reduces the data set support creation of both monolingual and multilingual vectors. to 4,470 rows which is 2.05% of the whole data set. Since the embeddings will be used as an input of a clustering After reviewing the lyrics, we realized that the data set has algorithm, dimensionality reduction is performed to improve the many song variations by the same artist, which can be seen as clustering results. The dimensionality reduction algorithm used duplicates. To find and remove the duplicates, we created the is UMAP [10]. TF-IDF representations for the songs, and calculated the cosine Document Clustering. Once the document embeddings are pre- similarity with all other songs of the same artist; if the similarity pared, they are input into a clustering algorithm to create the is greater than 50% it was labeled as a duplicate and removed topic clusters. The algorithm used is HDBSCAN [9], an optimized from the data set. This resulted in a smaller data set containing extension of the DBSCAN [4] algorithm. The chosen algorithm 3,455 song lyrics. creates clusters based on the density of the document embedding The final data set statistics used for our experiments is shown space, which allows the documents to not be assigned to a cluster in Table 1. if it’s not similar to any of the neighbouring documents. Table 1: The experiment data set statistics. For each artist Topic Word Description. Once the topic clusters are created, a we denote the music genre of the artist (genre), the num- topic word description is generated using the document’s text. ber of their songs in the data set (songs), and the average For each cluster the TF-IDF score is calculated for each word number of words in the song’s lyrics (avg. length). found in any of the cluster’s documents; the scores are called cluster TF-IDF (c-TF-IDF). The words with the highest c-TF-IDF Artist genre songs avg. length score are then chosen as the topic word description. Furthermore, maximal marginal relevance (MMR) is performed to diversify black-sabbath Rock 160 184 the selected words by measuring both the words relevance to bon-jovi Rock 320 266 the documents, and its similarity to the other selected words. dio Rock 127 203 Note that the topic word description were used only for the aerosmith Rock 208 226 preliminary analysis of our work, but not for measuring artists ac-dc Rock 171 193 similarity. coldplay Rock 138 174 50-cent Hip-Hop 318 502 3.2 Artists’ Similarity using Topic Clusters 2pac Hip-Hop 259 648 Once the topic clusters are created, the similarity between artists eminem Hip-Hop 369 640 can be measured. First, for each topic we count the songs that cor- black-eyed-peas Hip-Hop 119 463 responds to a particular artist. This gives us the number of songs celine-dion Pop 182 230 an artist has in a particular topic. To ensure that the presence is britney-spears Pop 225 313 strong enough, we decide to remove the artists from a topic if the frank-sinatra Jazz 356 133 number of their associated songs is below some threshold. The ella-fitzgerald Jazz 503 156 threshold is set to five (5) in order to ensure that the songs were Together - 3,455 319 not assigned to a cluster by coincidence. Afterwards, for each pair of artists we calculate their similarity using the following equation: |𝐴 ∩ 𝐵| 4.2 Implementation details sim (𝑎, 𝑏) = , (1) |𝐴| In this section, we present the details of how the approach is where 𝐴 is the set of topics of artist 𝑎, and 𝐵 is the set of topics developed. of artist 𝑏 . Language model. The method uses the pre-trained Sentence 1 Transformer model, more precisely the all-mpnet-base-v2 model , 4 EXPERIMENT available via the HuggingFace’s transformer library [13]. It can We now present the experiment setting. First, we introduce the take up to 384 tokens as one input, which is more than the average data set used and its pre-processing steps. Next, we describe the number of words in our data set, and returns a 768 dimensional implementation details. dense vectors. The vectors have been shown to be appropriate for task such as clustering and semantic search. 4.1 Dataset Dimensionality reduction. To perform dimensionality reduc- To test our approach, we use a dataset with raw lyrics data [2]. tion, we set the UMAP parameters as follows: Fist, the number of The dataset consists of 218,210 rows containing the following neighboring sample points used when making the manifold ap- attributes: proximation is set to five (5), to make the algorithm use the local • Song name. The name of the song. proximity of the documents. Second, we set the dimensionality • Release year. The year when the song was released. of the embeddings to one (1). This values were selected using • Song artist. The name of the artist. hyper-parameter tuning. • Artist genre. The genre of the song. • Song lyrics. The lyrics text of the song. 1 https://huggingface.co/sentence-transformers/all-mpnet-base-v2 14 Measuring the Similarity of Song Artists using Topic Modelling Information Society 2022, 10–14 October 2022, Ljubljana, Slovenia Clustering algorithm. Absolute co-occurrence of artists in topic clusters. In the HDBSCAN algorithm, the mini- mum number of documents in a cluster is set to five (5). 5 RESULTS In this section, we present the experiment results. We analyze the topic clusters, followed by the description of the finding on artist’s similarity. Topic Cluster Analysis. The experiment has generates 215 topic clusters, out of which only 107 have at least one artist with more than five (5) songs in it. The cluster containing songs that are deemed as outliers is not included in the analysis. The statistics of the topic clustering is shown in Table 2. Evidently, artists with a larger number of songs are spread over several topic clusters than those with less songs. Table 2: Topic clustering results. For each artist we show the number of different topics the artist is asociated with (topics), and the average number of their songs in the asso- Figure 2: The absolute co-occurrence of artists in topic ciated topics (avg. songs). clusters. Artist topics #avg. songs Relative co-occurrence of artists in topic clusters. black-sabbath 6 5 bon-jovi 10 6 dio 4 7 aerosmith 9 6 ac-dc 7 5 coldplay 2 5 50-cent 17 9 2pac 13 9 eminem 18 9 black-eyed-peas 3 12 celine-dion 8 6 britney-spears 12 6 frank-sinatra 16 8 ella-fitzgerald 28 8 Artists’ Similarity Analysis. The artists’ similarity is shown in Figures 2 and 3, which show the heatmaps of the absolute and Figure 3: The relative co-occurrence of artists in topic clus-relative co-occurrence of artists in topic clusters, respectively. ters. Artists with smaller number of topics can result in By looking at rows of Figure 2, we see the number of common higher similarity with other artists. topics with other artists. For example, by taking 50-cent with his 17 topics, we see that he shares five (5) of them with 2pac, one (1) with black-eyed-peas, one (1) with ac-dc, and six (6) with Language Models Limitations. The chosen language model eminem. From this we conclude that 50-cent, 2pac and eminem all-mpnet-base-v2 supports a maximum sequence length of have more topics in common than the rest of the artists. In other 384 tokens which is the downside of this model for our experi- words, 50-cent is more similar to the 2pac and eminem than to ment. Although the average number of words in the song lyrics is the rest of the artists. below the input limit, some artist have songs that are longer than Figure 3 shows the similarities calculated using Equation 1. that. However, songs have repeating sections, e.g. chorus, which The similarities become more visible, but at the same time can be is most likely inside the first 384 words. Therefore, the language also misleading. Artists with smaller number of topics can result models may not create a representation out of the whole song’s in higher similarity with other artists with higher number of lyrics, but it might capture the majority because of the song’s topics. For example, Coldplay have two (2) topics, one of which repeated text. is shared with Bon Jovi. Despite the fact that only one topic is in common, it is unlikely they have a similarity of 50%. Clustering Algorithm Selection. The clustering algorithm HDB- SCAN can create a cluster consisting of examples, which do not 6 DISCUSSION fall into any of the topic clusters. It is convenient when instead of In this section we discuss the advantages and disadvantages of forcing songs into clusters, it labels them as outliers. The down- the proposed methodology, and its possible improvements. side is when the majority of songs are labeled as outliers. To 15 Information Society 2022, 10–14 October 2022, Ljubljana, Slovenia Erik Calcina and Erik Novak avoid this, other clustering algorithms that assign a cluster to [8] Alen Lukic. A comparison of topic modeling approaches every document can be used, for example K-means clustering [6]. for a comprehensive corpus of song lyrics. Tech. rep. Tech report, Language Technologies Institute, School of Com- 6.1 Topic Cluster Discussion puter Science . . ., 2015. Some artists with a small number of songs have a lower number [9] Leland McInnes and John Healy. “Accelerated Hierarchical of topics assigned, which is a problem for finding similarities. Density Based Clustering”. In: 2017 IEEE International Con- On the other side artists with higher number of songs tend to ference on Data Mining Workshops (ICDMW). 2017, pp. 33– have more topics. Additionally, to avoid taking into account small 42. doi: 10.1109/ICDMW.2017.12. number of artist co-occurrances, which can be a product of data [10] Leland McInnes, John Healy, and James Melville. UMAP: noise, a filter threshold can be considered to remove them from Uniform Manifold Approximation and Projection for Dimen- the final analysis. sion Reduction. 2018. doi: 10.48550/ARXIV.1802.03426. url: https://arxiv.org/abs/1802.03426. 7 CONCLUSION [11] Nils Reimers and Iryna Gurevych. “Sentence-BERT: Sen- tence Embeddings using Siamese BERT-Networks”. In: In this paper we present a way to measure similarity between Proceedings of the 2019 Conference on Empirical Methods music artists using topic modeling. We cluster lyrics and compare in Natural Language Processing. Association for Computa- artists based on the generated topic clusters. The results have tional Linguistics, Nov. 2019. url: https://arxiv.org/abs/ shown that the approach finds similar artists. However, it is 1908.10084. heavily dependent on the number and quality of the topic clusters. [12] Chong Wang, John Paisley, and David Blei. “Online varia- In the future, we intend to apply the methodology on a larger tional inference for the hierarchical Dirichlet process”. In: data set of song lyrics and artists. In addition, we intend to use Proceedings of the fourteenth international conference on all of the topic cluster information (including topic word descrip- artificial intelligence and statistics. JMLR Workshop and tions) in order to improve the methodology’s performance. Conference Proceedings. 2011, pp. 752–760. ACKNOWLEDGMENTS [13] Thomas Wolf et al. “Transformers: State-of-the-Art Natu- ral Language Processing”. In: Proceedings of the 2020 Con- This work was supported by the Slovenian Research Agency and ference on Empirical Methods in Natural Language Pro- the Slovene AI observatory under proposal no. V2-2146. cessing: System Demonstrations. Online: Association for REFERENCES Computational Linguistics, Oct. 2020, pp. 38–45. doi: 10. 18653/v1/2020.emnlp- demos.6. url: https://aclanthology. [1] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. “La- org/2020.emnlp- demos.6. tent dirichlet allocation”. In: J. Mach. Learn. Res. 3 (2003), pp. 993–1022. issn: 1532-4435. doi: http://dx.doi.org/10. 1162 / jmlr. 2003 . 3 . 4 - 5 . 993. url: http : / / portal . acm . org / citation.cfm?id=944937. [2] Connor Brennan, Sayan Paul, Hitesh Yalamanchili, Justin Yum. Classifying Song Genres Using Raw Lyric Data with Deep Learning. Accessed August 30, 2022. https://github. com/hiteshyalamanchili/SongGenreClassification. 2018. [3] Maibam Debina Devi and Navanath Saharia. “Exploiting Topic Modelling to Classify Sentiment from Lyrics”. In: Machine Learning, Image Processing, Network Security and Data Sciences. Ed. by Arup Bhattacharjee et al. Singapore: Springer Singapore, 2020, pp. 411–423. isbn: 978-981-15- 6318-8. [4] Martin Ester et al. “A density-based algorithm for discov- ering clusters in large spatial databases with noise”. In: AAAI Press, 1996, pp. 226–231. [5] Maarten Grootendorst. “BERTopic: Neural topic modeling with a class-based TF-IDF procedure”. In: arXiv preprint arXiv:2203.05794 (2022). [6] Xin Jin and Jiawei Han. “K-Means Clustering”. In: Ency- clopedia of Machine Learning. Ed. by Claude Sammut and Geoffrey I. Webb. Boston, MA: Springer US, 2010, pp. 563– 564. isbn: 978-0-387-30164-8. doi: 10 . 1007 / 978 - 0 - 387 - 30164 - 8 _ 425. url: https : / / doi . org / 10 . 1007 / 978 - 0 - 387 - 30164- 8_425. [7] Wei Li and Andrew McCallum. “Pachinko allocation: DAG- structured mixture models of topic correlations”. In: ICML ’06: Proceedings of the 23rd international conference on Ma- chine learning. New York, NY, USA: ACM, 2006, pp. 577– 584. isbn: 1595933832. doi: 10.1145/1143844.1143917. url: http://portal.acm.org/citation.cfm?id=1143917. 16 Exploring the Impact of Lexical and Grammatical Features on Automatic Genre Identification Taja Kuzman Nikola Ljubešić taja.kuzman@ijs.si nikola.ljubesic@ijs.si Jožef Stefan Institute and Jožef Stefan International Jožef Stefan Institute Postgraduate School Jamova cesta 39 Jamova cesta 39 Ljubljana, Slovenia Ljubljana, Slovenia ABSTRACT As learning on lexical features can introduce bias towards topic, Laippala et al. (2021) recently experimented with combin- This study analyses the impact of several types of linguistic fea- ing lexical with grammatical features, which are represented as tures on the task of automatic web genre identification applied part-of-speech tags, conveying information on the word type (e.g., to Slovene data. To this end, text classification experiments with noun, verb). This showed to yield better results than using solely the fastText models were performed on 6 feature sets: original lexical features, and provided more stable models, i.e., models lexical representation, preprocessed text, lemmas, part-of-speech that are able to generalize beyond the training data. Further- tags, morphosyntactic descriptors, and syntactic dependencies, more, their analysis revealed that the importance of feature sets produced with the CLASSLA pipeline for language processing. varies between genre categories, and that while some are most Contrary to previous work, our results reveal that the grammati- efficiently identified when learning on lexical features, others cal feature set can be more beneficial than lexical representations benefit more from grammatical representations. for this task, as syntactic dependencies were found to be the most However, these experiments were in past mostly performed informative for genre identification. Furthermore, it is shown on English datasets. This article is the first to analyse the impact that this approach can provide insight into variation between of various feature sets on automatic genre identification applied genres. to Slovene data. This research was made possible by the recent KEYWORDS development of the first Slovene dataset, manually annotated with genre, as well as the creation of state-of-the-art language language processing, linguistic features, automatic genre identi- processing tools for Slovene. To compare textual representations, fication, web genres, Slovene additional feature sets were created from a selection of texts an- notated with genre, presented in Section 2, by using common 1 INTRODUCTION preprocessing methods and language processing (see Section 3). Automatic genre identification (AGI) is a text classification task Thus, in this paper, 6 textual representations are compared: 1) where the focus is on genres as text categories that are defined original, running text that we consider as our baseline, 2) pre- based on the conventional function and/or the form of the texts. processed text, i.e. lowercase text without punctuation, digits In text classification tasks, texts are generally given to the ma- and stopwords, 3) lemmas, i.e. base dictionary forms of words, chine learning models in form of words or characters that are 4) part-of-speech (PoS) tags, i.e. main syntactic word types (e.g., then further transformed into numeric vectors by using bag-of- noun, verb), 5) morphosyntactic descriptors (MSD), i.e. extended words representations, or word embeddings created by training PoS tags which include information on morphosyntactic features deep neural networks on the surface text. However, recent devel- (e.g., number, case), 6) syntactic dependencies, i.e. types of depen- opment of tools for linguistic processing for numerous languages, dency relations between words (e.g. subject, object). The feature including Slovene, allows transformation of the original running sets are compared based on their impact on the performance text into various other sets of features to which further transfor- of the fastText models on the automatic text classification task. mation into numeric representations can be applied. By learning The results of the experiments, presented in Section 4, give in-on these linguistic sets, we get insight into the importance of fea- sights into the role of linguistic feature sets on this task and the tures that cannot be analysed separately when given the running differences in performance between genre categories. text, i.e., word meaning, function of a word, and its relation to other words. 2 DATASET When previous work compared importance of various textual feature sets on the performance of the models in automatic genre For performing experiments in automatic genre identification, identification, lexical features, i.e., word or character n-grams, the Slovene Web genre identification corpus GINCO 1.0 [2] was mainly provided the best results ([6], [7]). However, it was noted used. The dataset consists of the “suitable” subset, annotated with that by learning on lexical features, the models could learn to genre, and the “not suitable” subset that comprises texts which classify texts based on the topic instead of genre characteristics, can be deemed as noise in the web corpora, e.g., texts without and would not be able to generalize beyond the dataset. full sentences, very short texts, machine translation etc. In this research, only the “suitable” subset, containing 1002 texts, was Permission to make digital or hard copies of part or all of this work for personal used. or classroom use is granted without fee provided that copies are not made or The GINCO schema consists of 24 genre labels. However, pre-distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this vious experiments, performed with the fastText model on the work must be honored. For all other uses, contact the owner /author(s). entire dataset, showed that the model is not potent enough to Information Society 2022, 10–14 October 2022, Ljubljana, Slovenia differentiate between a large number of labels that are mostly © 2022 Copyright held by the owner/author(s). represented by less than 100 texts, reaching micro and macro 17 Information Society 2022, 10–14 October 2022, Ljubljana, Slovenia Taja Kuzman and Nikola Ljubešić Table 1: The original GINCO categories (left) included in 3 FEATURE ENGINEERING the reduced set, and the reduced set of labels (right), used Feature engineering is a process of identifying features that are in the experiments, with the total number of texts (later most useful for a specific task with the goal of improving per- divided between the train, dev and test split) in the paren- formance of a machine learning model. In text classification ex- theses. periments, basic preprocessing methods are often used to reduce the number of unique lexical features (words or characters) with- GINCO Reduced Set out losing much information which could provide better results. To test whether preprocessing the text improves the results for News/Reporting News (198) this task, the first additional feature set was created by prepro- Opinionated News cessing the running text as extracted from the GINCO dataset. Information/Explanation Preprocessing consisted of the following steps: converting text to Information/Explanation (127) Research Article lowercase, and removing digits, punctuation and function words Opinion/Argumentation known as stopwords, e.g., conjunctions, prepositions etc. Opinion/Argumentation (124) Review In addition to this, various linguistic representations were cre- ated by applying linguistic processing to the texts, and replacing Promotion words with corresponding lemmas or grammatical tags. The lan- Promotion of a Product Promotion (191) guage processing was performed with the CLASSLA pipeline [5]. Promotion of Services The following text representations were produced: lexical feature Invitation set, consisting of lemmas, and three grammatical feature sets: Forum Forum (48) part-of-speech (PoS) tags, morphosyntactic descriptors (MSD), and syntactic dependencies. The realisation of the created feature sets is illustrated on an example sentence in Table 2. 4 MACHINE LEARNING EXPERIMENTS 4.1 Experimental Setup F1 scores of 0.352 and 0.217 respectively (see [3]). Therefore, to The experiments were performed with the linear fastText [1] be able to infer any meaningful conclusions, this article focuses model which enables text classification and word embeddings only on the most frequent genre labels, created by merging some generation. The model is a shallow neural network with one hid- labels. Instances of less frequent labels that could not be merged, den layer where the word embeddings are created and averaged namely Instruction, Legal/Regulation, Recipe, Announcement, Cor- into a text representation which is fed into a linear classifier. The respondence, Call, Interview, Prose, Lyrical, Drama/Script, FAQ, model takes as an input a text file where each line contains a and the labels Other and List of Summaries/Excerpts, which can separate text instance, consisting of a label and the corresponding be considered as noise, were not used. To focus only on the in- document. Thus, for each feature set, appropriate train, test and stances that are representative of their genre labels, texts that dev files were created, and the model was trained on each repre- were manually annotated as hard to identify (parameter hard) 1 sentation separately . To observe the dispersion of results, five were not used in the experiments. Furthermore, paragraphs that runs of training were performed for each feature set. To measure were deemed to be noise in the text, e.g., cookie consent text, and the model’s performance on the instance and the label level, the were marked by the annotators with the keep parameter set to micro and macro F1 scores were used as evaluation metrics. False, were left out of the final texts. The hyperparameter search was performed by training the Thus, the final set of labels, used in the experiments, shown in model on the training split of the baseline text and evaluating Table 1, consists of 5 genre categories, Information/Explanation, it on the dev split. The automatic hyperparameter optimisation News, Opinion/Argumentation, Promotion and Forum. As shown in provided by the fastText model did not yield satisfying results, as the Table, the dataset is imbalanced, with News and Promotion be- three runs of automatic hyperparameter optimisation produced ing the most frequent classes, consisting of almost 200 instances, very different results in terms of proposed optimal hyperparame- while Forum is the least represented class, consisting of about 50 ter values and yielded micro F1 0.479 ± 0.02 and macro F1 0.382 texts. The subset, consisting of 688 texts in total, followed the ± 0.06. Therefore, we continued searching for optimal hyperpa- original stratified split of 60:20:20, encoded in the GINCO 1.0 rameters by manually changing one hyperparameter at a time dataset, and the models were trained on the training set, tested on the test set, while the dev split was used for evaluating the 1 The code for data preparation and machine learning experiments is published here: hyperparameter optimisation. https://github.com/TajaKuzman/Text- Representations- in- FastText. Table 2: An example of the feature sets used in the experiments. Feature Set Example Baseline - Running Text V Laškem se bo v nedeljo, 21.4.2013 odvijal prvi dobrodelni tek Veselih nogic. Preprocessed Baseline laškem nedeljo odvijal dobrodelni tek veselih nogic Lemmas v Laško se biti v nedelja , 21.4.2013 odvijati prvi dobrodelen tek vesel nogica . PoS ADP PROPN PRON AUX ADP NOUN P UNCT NUM VERB ADJ ADJ NOUN ADJ NOUN P UNCT MSD Sl Npnsl Px——y Va-f3s-n Sa Ncfsa Z Mdc Vmpp-sm Mlomsn Agpmsny Ncmsn Agpfpg Ncfpg Z Dependencies case nmod expl aux case obl punct nummod root amod amod nsubj amod nmod punct 18 Exploring the Impact of Linguistic Features on AGI Information Society 2022, 10–14 October 2022, Ljubljana, Slovenia Table 4: Average micro and macro F1 scores obtained from and conducting classification experiments. The optimum number five runs of training and testing on each representation of epochs revealed to be 350, the learning rate was set to 0.7, separately. and the number of words in n-grams to 1. For the other hyperpa- rameters, the default values were used. Manual hyperparameter search revealed to be considerably more effective than automatic Representation Micro F1 Macro F1 optimisation, as it yielded the average micro and macro F1 scores Baseline Text 0.560 ± 0.00 0.589 ± 0.00 of 0.625 ± 0.004 and 0.618 ± 0.003 respectively, which is in aver- Preprocessed Baseline 0.596 ± 0.00 0.597 ± 0.00 age 0.15 points better micro F1 and 0.24 points better macro F1 Lemmas 0.597 ± 0.01 0.601 ± 0.00 compared to the results of automatic optimisation. PoS 0.540 ± 0.01 0.547 ± 0.01 To analyse whether our choice of technology is the most ap- MSD 0.563 ± 0.01 0.536 ± 0.02 propriate one, we compared the performance of the fastText Dependencies 0.610 ± 0.00 0.639 ± 0.00 model, which uses the hyperparameters mentioned above, with the performance of various non-neural classifiers, commonly used in text classification tasks: dummy majority classifier which 1 reveals that preprocessing especially improves the identifica-predicts the most frequent class to every instance, support vec- tion of Promotion and News. The two labels are the most frequent tor machine (SVM), decision tree classifier, logistic regression genre classes in the dataset which explains larger improvement classifier, random forest classifier, and Naive Bayes classifier. We of the micro F1 scores. If we compare the baseline text and the used the default parameters for the classifiers. The models are preprocessed text to the third lexical set, i.e., lemmas, the results compared based on their performance on the baseline text which show that by using lowercase words, reduced to their dictionary was transformed into the TF-IDF representation where necessary. base form, the performance is further improved, although only As shown in Table 3, fastText outperforms all other classifiers slightly, as can be seen in Table 4. with a noticeable difference especially in the macro F1 scores, Secondly, we compared various lexical and grammatical fea- reaching 17 points higher scores than the next best classifier, the ture sets, obtained with language processing tools. In previous Naive Bayes classifier. work, which analysed English genre datasets, lexical features yielded better results than grammatical feature sets ([4], [6], [7]). Table 3: Micro and macro F1 scores obtained by various Our results revealed that this conclusion holds also for Slovene classifiers, trained and tested on the baseline text. when training on part-of-speech tags. Similar conclusion can be made for the extended part-of-speech tags (MSD) which only Classifier Micro F1 Macro F1 slightly improve the micro F1 scores compared to the baseline while there is a decrease in the macro F1 scores (see Table 4). Dummy Classifier 0.24 0.08 However, the third grammatical feature set, consisting of tags for Support Vector Machine 0.49 0.33 syntactic dependencies, which was not used in previous work, Decision Tree 0.34 0.35 significantly outperformed the baseline text and all other fea- Logistic Regression 0.52 0.38 ture sets. As shown in Figure 1, the improvement is especially Random Forest classifier 0.51 0.41 noticeable for the categories Forum, Opinion/Argumentation and Naive Bayes classifier 0.54 0.42 News. By learning on the dependencies instead on lexical fea- FastText 0.56 0.59 tures, the model learns from the structure of the sentences in the text, i.e., the syntax, instead of word meanings that can be more related to topic than genre, which could be the reason why 4.2 Results of Learning on Various Linguistic this representation was revealed to be the most beneficial for the Features task. As in previous work (see [4]), the experiments have revealed a To explore the role of various textual representations on the au-dependence between the text representation and performance on tomatic genre identification of Slovene web texts, we conducted specific genre labels, which is illustrated in Figure 1. The results text classification experiments with the fastText models on 6 show that Promotion and Information/Explanation can be most feature sets: successfully identified when learning purely on the meaning of • three lexical sets: a) baseline text, i.e., the original run- the words, i.e., on lemmas. In contrast to that, for identifying ning text, b) preprocessed baseline text, i.e., baseline text News, grammatical representations are more useful than lexical converted to lowercase and without punctuation, digits ones. Similarly, Opinion/Argumentation benefits more from gram- and function words, c) lemmas, i.e., words reduced to their matical feature sets than lexical representations, except in case base dictionary forms; of the MSD tags which significantly decreased the results for this • three grammatical sets: a) part-of-speech (PoS), i.e., main class, yielding F1 scores below 0.3. Interestingly, although Forum word types, b) morphosyntactic descriptors (MSD), i.e., is the least frequent label, its features seem to be the easiest to extended PoS tags, c) syntactic dependencies, i.e., types of identify in the majority of representations. This genre benefits words defined by their relation to other words. the most from learning on syntactic dependencies tags, which yielded F1 scores of almost 0.9. First, by comparing the baseline representation and the prepro- cessed representation, we aimed to determine whether common 5 CONCLUSIONS preprocessing methods can improve the results in the AGI task. As shown in Table 4, the results reveal that applying preprocessIn this paper, we have investigated the dependence of automatic ing methods improves the performance, especially on the micro genre classification on the lexical and grammatical representation F1 level. Analysis of the F1 scores obtained for each label in Figure of text. Our experiments, performed on three lexical and three 19 Information Society 2022, 10–14 October 2022, Ljubljana, Slovenia Taja Kuzman and Nikola Ljubešić Figure 1: The impact of various linguistic features on the F1 scores of genre labels (Information/Explanation, Promotion, News, Forum and Opinion/Argumentation). grammatical feature sets, revealed that the choice of textual rep- ACKNOWLEDGMENTS resentation impacts the results of automatic genre identification. This work has received funding from the European Union’s Con- Similarly to previous work, it was revealed that part-of-speech necting Europe Facility 2014-2020 - CEF Telecom, under Grant features give worse results than lexical features. However, a gram- Agreement No. INEA/CEF/ICT/A2020/2278341. This communica- matical feature set, consisting of syntactic dependencies, that has tion reflects only the author’s view. The Agency is not responsible not been studied in previous work, revealed to be the most ben- for any use that may be made of the information it contains. This eficial for the automatic genre identification task. Furthermore, work was also funded by the Slovenian Research Agency within the experiments revealed variation between genres regarding the the Slovenian-Flemish bilateral basic research project “Linguistic impact of feature sets on the F1 scores of each label. While some landscape of hate speech on social media” (N06-0099 and FWO- genres, such as Promotion, benefit more from learning on lexical G070619N, 2019–2023) and the research programme “Language features, others, such as Opinion/Argumentation, benefit more resources and technologies for Slovene” (P6-0411). from grammatical representations. However, it should be noted that this study has been limited REFERENCES to the 5 most frequent genre labels, as the previous experiments [1] Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas showed that the fastText model is not potent enough to iden- Mikolov. 2016. Bag of tricks for efficient text classification. tify other categories represented by a small number of instances arXiv preprint arXiv:1607.01759. ([3]). Thus, the results of these experiments give insight into [2] Taja Kuzman, Mojca Brglez, Peter Rupnik, and Nikola Ljubešić. which linguistic features are the most important for differentiat- 2021. Slovene web genre identification corpus GINCO 1.0. ing between the five most frequent genres, not for identifying the Slovenian language resource repository CLARIN.SI. (2021). 24 original labels that encompass all the genre variation found http://hdl.handle.net/11356/1467. on the web, and include noise. This is why we plan to continue [3] Taja Kuzman, Peter Rupnik, and Nikola Ljubešić. 2022. The genre annotation campaigns to enlarge the Slovene genre dataset, GINCO Training Dataset for Web Genre Identification of which would allow extending the analysis to all genre labels. In Documents Out in the Wild. In Proceedings of the Language addition to this, as we are interested in cross-lingual genre iden- Resources and Evaluation Conference. European Language tification, in the future, we plan to analyse the importance of Resources Association, Marseille, France, 1584–1594. https: linguistic feature sets on the Croatian and English genre datasets //aclanthology.org/2022.lrec- 1.170. to analyse whether the characteristics of genre labels are lan- [4] Veronika Laippala, Jesse Egbert, Douglas Biber, and Aki- guage independent. Juhani Kyröläinen. 2021. Exploring the role of lexis and The fastText model was revealed to be useful for the anal- grammar for the stable identification of register in an unre- ysis of the impact of linguistic features on the AGI task, how- stricted corpus of web documents. Language resources and ever, previous work on automatic genre identification using the evaluation, 1–32. GINCO dataset revealed that if the aim of the research is to create [5] Nikola Ljubešić and Kaja Dobrovoljc. 2019. What does Neu- the best-performing classifier and not to analyse the impact on ral Bring? Analysing Improvements in Morphosyntactic the performance, the Transformer-based pre-trained language Annotation and Lemmatisation of Slovenian, Croatian and models are much more suitable for the task ([3]). This was also Serbian. In Proceedings of the 7th Workshop on Balto-Slavic confirmed by our experiments on the running text, where the Natural Language Processing. Association for Computa- base-sized XLM-RoBERTa model reached micro and macro F1 tional Linguistics, Florence, Italy, (August 2019), 29–34. scores 0.816 and 0.813, which is 22–26 points more than the fast- doi: 10 . 18653 / v1 / W19 - 3704. https : / / www. aclweb . org / Text model. Based on the findings from this paper, one of the anthology/W19- 3704. reasons why the Transformer models perform better could also [6] Dimitrios Pritsos and Efstathios Stamatatos. 2018. Open set be that the Transformer text representations incorporate infor- evaluation of web genre identification. Language Resources mation on syntax as well. In the future, we plan to investigate and Evaluation, 52, 4, 949–968. this further, adapting the classifier heads so that the syntactic [7] Serge Sharoff, Zhili Wu, and Katja Markert. 2010. The Web information has a larger impact on the classification than the Library of Babel: evaluating genre collections. In LREC. lexical parts of the representation. Citeseer. 20 Stylistic features in clustering news reporting: News articles on BREXIT Abdul Sittar Jason Webber Dunja Mladenić abdul.sittar@ijs.si jason.webber@bl.uk dunja.mladenic@ijs.si Jožef Stefan Institute and Jožef British Library Jožef Stefan Institute and Jožef Stefan Postgraduate School London, United Kingdom Stefan Postgraduate School Jamova cesta 39 Jamova cesta 39 Ljubljana, Slovenia Ljubljana, Slovenia ABSTRACT Table 1: List of all the stylistic features that are used for clustering. We present a comparison of typical bag-of-words features with stylistic features. We group the news articles published from No. Feature No. Feature three different regions of the UK namely London, Wales, and 1. Percentage of Question Sentences 2. Average Sentence Length Scotland. Hierarchical clustering is performed using typical bag- 3. Percentage of Short Sentences 4. Average Word Length of-words and stylistic features. We present the performance of 5. Percentage of Long Sentences 6. Percentage of Semicolons Percentage of Words with Six 25 stylistic features and compare them with the bag-of-words. 7. 8. Percentage of Punctuation marks and More Letters Our results show that bag-of-words are better to be used while Percentage of Words with Two 9. 10. Percentage of Pronouns and Three Letters clustering news reporting at the regional level whereas stylistic Percentage of Coordinating 11. 12. Percentage of Prepositions features are better to be used while clustering news reporting at Conjunctions the level of news publishers/newspapers. 13. Percentage of Comma 14. Percentage of Adverbs 15. Percentage of Articles 16. Percentage of Capitals Percentage of Words with 17. 18. Percentage of Colons KEYWORDS One Syllable 19. Percentage of Nouns 20. Percentage of Determiners news reporting, topic modeling, stylistic features, clustering 21. Percentage of Verbs 22. Percentage of Digits 23. Percentage of Adjectives 24. Percentage of Full stop 25. Percentage of Interjections 1 INTRODUCTION The role of content is an essential research topic in news spread- ing. Media economics scholars especially showed their interest features from the raw features, including low-level features, high- in a variety of content forms since content analysis plays a vital level features, and semantic features [16]. role in individual consumer decisions and political and economic The news coverage registers the occurrence of specific events interactions [6]. The content basically refers to the type of lan-promptly and reflects the different opinions of stakeholders [4]. guage that is used in the news. It is used to convey meaning and We take Brexit as an event to be researched on the topic of news it can impact social and psychological constructs such as social reporting differences across the different regions of the UK. On relationships, emotions, and social hierarchy [8]. The everyday 23 June 2016, the British electorate voted to leave the EU. This act of reading the news is such a big area in which small dif- event has already been studied following different aspects such ferences in reporting may shape how events are perceived, and as fundamental characteristics of the voting population, driver ultimately judged and remembered [5]. of the vote, political and social patterns, and possible failures in News reporting across different regions requires methods to communication [2, 9]. In this paper, we explore how different find reporting differences. [7] characterize the relationship be-stylistic features help in clustering news articles related to Brexit tween the volume of online opioid news reporting and measures than bag-of-words (BOW). differences across different geographic and socio-economic lev- Following are the main scientific contributions of this paper: els. Scholars across disciplines have explored the institutional, (1) We present a comparison of clustering (using two different organizational, and individual influences that study the quality textual features: bag-of-words and stylistic features) for and quantity of coverage [3]. news reporting about Brexit in three different regions Features that could classify news reporting across different (London, Scotland, and Wales) of the UK. regions can be adapted to classify the news. A detailed analysis of (2) We show in our experiments that the bag-of-words are textual features is performed by [1] where they derived multiple better to be used while clustering news reporting at the features for creating clusters of news articles along with their regional level whereas stylistic features are better to be comments. These features include terms in the title, terms in used while clustering news reporting at the level of news the first sentence, terms in the entire article, etc. Multi-view publishers/newspapers. clustering on multi-model data can provide common semantics to improve learning effectiveness. It exploits different levels of 2 RELATED WORK Permission to make digital or hard copies of part or all of this work for personal In this section, we review the related literature about topic mod-or classroom use is granted without fee provided that copies are not made or elling, and different types of textual features. distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). 2.1 Topic Modelling Information Society 2022, 10 October 2022, Ljubljana, Slovenia Topic modelling is used to infer topics from the collection of text- © 2022 Copyright held by the owner/author(s). document. Some techniques used only frequent words whereas 21 Information Society 2022, 10 October 2022, Ljubljana, Slovenia Abdul Sittar, Jason Webber, and Dunja Mladenić Table 2: Total number of news articles about Brexit pub- 3 DATA COLLECTION lished in three different regions (London, Scotland, and We collected news articles reporting on Brexit in the English lan- Wales). guage from the UK Web Archive (UKWA). The dataset consists of 5061 news articles after pre-processing. Due to the unavailability Regions Newspapers News articles Total of news articles from other regions of the UK, we selected only bankofengland.co.uk 8 the regions (London, Scotland, and Wales) which have a sufficient bbc.com 2209 amount of news articles. Table 2 presents the number of news dailymail.co.uk 768 articles published from different regions and by different news Independent.co.uk 191 publishers. inews.co.uk 52 metro.co.uk 1 4 METHODOLOGY neweconomics.org 1 The presented research focuses on clustering news articles. To rspb.org.uk 8 this end, we experiment clustering with the combination of dif- theguardian.com 1167 London 4248 ferent features observing their performance. Our methodology theneweuropean.co.uk 1 consists on four steps and compares the performance of stylistic thesun.co.uk 235 features and bag-of-words in clustering news articles, as shown cityam.com 3 in Figure 1. conservativewomen.uk 1 In the first step, we select Brexit under topic and themes on dailypost.co.uk 1 UK web archive1. After crawling the list of news articles, we ex-ft.com 2 tracted the meta data of news publishers from Wikipedia-infobox. mirror.co.uk 9 The meta-data extraction process is explained in our previous raeng.org.uk 1 work [15]. In this process, we extracted the headquarters of news standard.co.uk 20 publishers. Due to the unavailability of news articles from other Scotland news.stv.tv 533 533 regions of the UK, we selected only the regions (London, Scot- gov.wales 3 land, and Wales) which have a sufficient amount of news articles. Wales nation.wales 122 280 In the second step, we perform parsing of the html web pages Walesonline.co.uk 156 and extract the body text. some use pooling to generate relevant topics and maintain co- herence between topics [14]. Topics are typically represented by UKWA London Wales a set of keywords. Examples of such algorithms are the Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA) and Brexit - News Articles Scotland Probabilistic Latent Semantic Analysis (LSA). Clustering-based topic modelling is another solution. Meta-data Extraction 2.2 Stylistic Features News reporting differences can be reflected through one’s speech, writing, and images etc [10, 12]. A language independent features Preprocessing have been used for different tasks of NLP such as plagiarism de- tection, author diarization. These features considers the text of documents as a sequence of tokens (i.e. sentences, paragraphs, documents). On the basis of these tokens, various types of sta- Stylistic Features Bag-of-words tistics could be drawn from any language [13]. Stylistic features represent the writing style of a document and have been used for LSA LSA understanding the author writing styles in the past [10]. We use it to explore the clustering of the news articles based on their reporting differences across different regions. Table 1 shows the list Hierarchical of 25 stylistic features used for the development of our proposed Clustering clustering of news articles. 2.3 Bag-of-words BCubed A bag-of-words model is a way of extracting features from text. It is basically a representation of text that describes the occurrence of words within a document. It firstly identifies a vocabulary of known words and then measures the presence of known words. Topic modelling is typically based on the bag-of-words (BOW). Figure 1: Methodology to clustering regional news using The essential idea of the topic model is that a document can bag-of-words and stylistic features. be represented by a mixture of latent topics and each topic is a distribution over words [11]. 1https://www.webarchive.org.uk/en/ukwa/collection/910 22 Stylistic features in clustering news reporting Information Society 2022, 10 October 2022, Ljubljana, Slovenia Since the third step required pre-processing for bag-of-words, London respectively. Blue and red lines represent bag-of-words we convert the text to lowercase and remove the stop words and (BOW) and stylistic features. punctuation marks. In the third step for the stylistic features, We can see that for all three graphs, the silhouette score of we extract the stylistic features(see Table 1) for all three regions stylistic features is significantly high for all three regions except and perform LSA (Latent Semantic Analysis). Similarly, for the at one point for Scotland. It means that cohesion is higher and the bag-of-words, we use the pre-processed text and perform LSA. distance between the clusters is more significant using stylistic We also perform LSA on the combination of both types of fea- features than BOW which is mostly too close to 0. It suggests tures. 100 latent dimensions have been used for LSA because that these features are better at partitioning news articles into it is recommended. We perform LSA and hierarchical cluster- clusters than BOW. ing using the python library SciPy, and scikit-learn and use the weighted distance between clusters. After performing the LSA, we apply hierarchical clustering and utilize two different types of evaluation measures namely BCubed F1 and Silhouette Scores. For LSA and hierarchical clustering, we use the python library SciPy, and scikit-learn. 5 EXPERIMENTAL EVALUATION We have performed experimental evaluations using intrinsic (Silhouette) and extrinsic (BCubed-F) evaluation measures. The intrinsic evaluation metrics are used to calculate the goodness of a clustering technique whereas extrinsic evaluation metrics are used to evaluate clustering performance. For extrinsic evalua- tion, we consider clusters generated by k-means clustering using typical bag-of-words as ground truth clusters. The value of k in k-means clustering ranges from 2 to 20. K-means identifies k cen- troids, and then allocates every data point to the nearest cluster while keeping the centroids as small as possible. We cannot set the value of k to 1 which means there are no other clusters to allocate the nearest data point. Silhouette is used to find cohesion. It ranges from -1 to 1. 1 means clusters are well apart from each other and clearly distinguished. 0 means clusters are indifferent, or we can say that the distance between clusters is not significant. -1 means clusters are assigned in the wrong way. BCubed F-measure defines precision as point precision, namely how many points in the same cluster belong to its class. Similarly, point recall represents how many points from its class appear in its cluster. • Silhouette Score: 𝑆 (𝑖) = 𝑏 (𝑖 ) −𝑎 (𝑖 ) 𝑚𝑎𝑥 (𝑎 (𝑖 ),𝑏 (𝑖 ) ) where S(i) is the silhouette coefficient of the data point i, a(i) is the average distance between i and all the other data points in the cluster to which i belongs, and b(i) is the average distance from i to all clusters to which i does not belong. • BCubed Precision and Recall:     𝐶𝑜𝑟 𝑟 𝑒𝑐𝑡 𝑛𝑒𝑠𝑠 (𝑖, 𝑗 ) = 1, 𝑖 𝑓 𝐿(𝑖) = 𝐿( 𝑗) 𝑎𝑛𝑑 𝐶 ( 𝑗) = 𝐶 ( 𝑗)   0, 𝑖 𝑓 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒  Í𝑁 Í 𝐶𝑜𝑟 𝑟 𝑒𝑐𝑡 𝑛𝑒𝑠𝑠 (𝑖, 𝑗 ) 𝐵𝐶𝑢𝑏𝑒𝑑 𝑃 𝑟 𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 1 𝑁 𝑖 =1 𝑗 𝜖𝐶 (𝑖 ) |𝐶 (𝑖) | Í𝑁 Í 𝐶𝑜𝑟 𝑟 𝑒𝑐𝑡 𝑛𝑒𝑠𝑠 (𝑖, 𝑗 ) 𝐵𝐶𝑢𝑏𝑒𝑑 𝑅𝑒𝑐𝑎𝑙𝑙 = 1 𝑁 𝑖 =1 𝑗 𝜖 𝐿 (𝑖 ) |𝐿 (𝑖) | where |C(i)| and |L(i)| denote the sizes of the sets C(i) and L(i), respectively. L(i) and C(i) denote the class and clusters of a point i. • BCubed-F Score: 𝐹 = 2×𝐵𝑐𝑢𝑏𝑒𝑑𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛×𝐵𝑐𝑢𝑏𝑒𝑑𝑅𝑒𝑐𝑎𝑙𝑙 Figure 2: The line graphs represent average silhouette 𝐵𝑐𝑢𝑏𝑒𝑑 𝑃 𝑟 𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∔𝐵𝑐𝑢𝑏𝑒𝑑𝑅𝑒𝑐𝑎𝑙𝑙 scores across a different number of clusters. The blue line 6 RESULTS AND ANALYSIS represents the score generated using bag-of-words and the Figure 2 shows the three line graphs. Each graph shows Silhouette red line represents the score generated using stylistic fea-scores across a different number of clusters (from 2 to 20) repre- tures. The three-line graphs are generated for three differ- senting different regions of the UK such as Scotland, Wales, and ent regions Scotland, Wales, and London respectively. 23 Information Society 2022, 10 October 2022, Ljubljana, Slovenia Abdul Sittar, Jason Webber, and Dunja Mladenić Table 3: The group of news articles published from three features are better to be used while clustering news reporting at different regions of the UK is considered as ground truth the level of news publishers/newspapers. clusters and the Bcubed-F score is calculated using three types of features including bag-of-words, stylistic features, ACKNOWLEDGMENTS and a combination of both types of features. The research described in this paper was supported by the Slove- nian research agency under the project J2-1736 Causalify and No. Features Bcubed-F Score by the European Union’s Horizon 2020 research and innovation 1. Bag-of-words 0.75 programme under the Marie Skłodowska-Curie grant agreement 2. Bag-of-words and stylistic features 0.51 No 812997. 3. Stylistic features 0.54 REFERENCES Table 4: The group of news articles published from 22 dif- [1] Ahmet Aker, Monica Paramita, Emina Kurtic, Adam Funk, ferent news publishers of the UK is considered as ground Emma Barker, Mark Hepple, and Rob Gaizauskas. 2016. truth clusters and the Bcubed-F score is calculated using Automatic label generation for news comment clusters. three types of features including bag-of-words, stylistic In Proceedings of the 9th International Natural Language features, and a combination of both types of features. Generation Conference. Association for Computational Lin- guistics, 61–69. No. Features Bcubed-F Score [2] Sascha O Becker, Thiemo Fetzer, and Dennis Novy. 2017. 1. Bag-of-words 0.53 Who voted for brexit? a comprehensive district-level anal- 2. Bag-of-words and stylistic features 0.57 ysis. Economic Policy, 32, 92, 601–650. 3. Stylistic features 0.66 [3] Danielle K Brown and Summer Harlow. 2019. Protests, media coverage, and a hierarchy of social struggle. The However, it is insufficient to say that stylistic features are International Journal of Press/Politics, 24, 4, 508–530. better for news reporting differences at this stage because it is not [4] Honglin Chen, Xia Huang, and Zhiyong Li. 2022. A content necessary that the resulting clusters by internal partitioning can analysis of chinese news coverage on covid-19 and tourism. be equal to the ones that are based on news reporting differences. Current Issues in Tourism, 25, 2, 198–205. We consider each region (London, Scotland, and Wales) as a [5] Elizabeth W Dunn, Moriah Moore, and Brian A Nosek. ground truth cluster of the news articles published in that region. 2005. The war of the words: how linguistic differences Table 3 shows Bcubed-F scores when the ground truth clusters in reporting shape perceptions of terrorism. Analyses of were matched with the one that was created using bag-of-words, social issues and public policy, 5, 1, 67–86. stylistic features, and a combination of both types of features. [6] Frederick G Fico, Stephen Lacy, and Daniel Riffe. 2008. Similarly, we consider each newspaper/news publisher shown in A content analysis guide for media economics scholars. Table 2 as a ground truth cluster of the news articles published by Journal of Media Economics, 21, 2, 114–130. that newspaper/news publisher. Table 4 shows Bcubed-F scores [7] Yulin Hswen, Amanda Zhang, Clark Freifeld, John S Brown- when the ground truth clusters were matched with the one that stein, et al. 2020. Evaluation of volume of news reporting was created using bag-of-words, stylistic features, and a combi- and opioid-related deaths in the united states: comparative nation of both types of features. The scores using bag-of-words analysis study of geographic and socioeconomic differ- considering regions as ground truth clusters are significantly ences. Journal of Medical Internet Research, 22, 7, e17693. high (0.75) than stylistic features (0.54) and a combination of all [8] Qihao Ji, Arthur A Raney, Sophie H Janicke-Bowles, Kather- features (0.51). The scores using stylistic features considering ine R Dale, Mary Beth Oliver, Abigail Reed, Jonmichael newspaper/news publishers as ground truth clusters are signifi- Seibert, and Arthur A Raney. 2019. Spreading the good cantly high (0.66) than bag-of-words (0.53) and a combination of news: analyzing socially shared inspirational news con- all features (0.57). The higher scores in regional news reporting tent. Journalism & Mass Communication Quarterly, 96, 3, suggest that bag-of-words is better to be used for clustering or 872–893. classification because the newspapers/news publishers report in [9] Moya Jones. 2017. Wales and the brexit vote. Revue Française different styles in a certain region. Similarly, when it comes to de Civilisation Britannique. French Journal of British Studies, classifying or clustering news reporting across different news- 22, XXII-2. papers/news publishers then stylistic features are more useful [10] Ifrah Pervaz, Iqra Ameer, Abdul Sittar, and Rao Muham- because the newspapers/news publishers follow a different re- mad Adeel Nawab. 2015. Identification of author personal- porting style. ity traits using stylistic features: notebook for pan at clef 2015. In CLEF (Working Notes). Citeseer, 1–7. 7 CONCLUSIONS [11] Zengchang Qin, Yonghui Cong, and Tao Wan. 2016. Topic modeling of chinese language beyond a bag-of-words. In this paper, we have presented the comparison of different Computer Speech & Language, 40, 60–78. features observing their performance over clustering news arti- [12] Abdul Sittar and Iqra Ameer. 2018. Multi-lingual author cles. The goal of this work was to investigate the performance of profiling using stylistic features. In FIRE (Working Notes), stylistic features and typical bag-of-words. The data consists of 240–246. news articles about a popular event Brexit that are collected from [13] Abdul Sittar, Hafiz Rizwan Iqbal, and Rao Muhammad UKWA. These news articles belong to three different regions of Adeel Nawab. 2016. Author diarization using cluster-distance the UK including Scotland, London, and Wales. Our experimental approach. In CLEF (Working Notes). Citeseer, 1000–1007. results suggest that bag-of-words are better to be used while clustering news reporting at the regional level whereas stylistic 24 Stylistic features in clustering news reporting Information Society 2022, 10 October 2022, Ljubljana, Slovenia [14] Abdul Sittar and Dunja Mladenic. 2021. How are the eco- [16] Jie Xu, Huayi Tang, Yazhou Ren, Liang Peng, Xiaofeng nomic conditions and political alignment of a newspaper Zhu, and Lifang He. 2022. Multi-level feature learning reflected in the events they report on? In Central European for contrastive multi-view clustering. In Proceedings of Conference on Information and Intelligent Systems. Faculty the IEEE/CVF Conference on Computer Vision and Pattern of Organization and Informatics Varazdin, 201–208. Recognition, 16051–16060. [15] Abdul Sittar, Dunja Mladenić, and Marko Grobelnik. 2022. Analysis of information cascading and propagation bar- riers across distinctive news events. Journal of Intelligent Information Systems, 58, 1, 119–152. 25 Automatically Generating Text from Film Material – A Comparison of Three Models Sebastian Korenič Tratnik Erik Novak Jožef Stefan International Postgraduate School Jožef Stefan Institute Faculty of Computer and Information Science Jožef Stefan International Postgraduate School Večna pot 113 Jamova cesta 39 Ljubljana, Slovenia Ljubljana, Slovenia ABSTRACT 2 PROBLEM DESCRIPTION The paper focuses on audio analysis and text generation using In recent years, audio-visual data has become as influent if film material as an example. The proposed approach is done not more influent as traditional text-based information. With by using three different models (Wav2Vec2, HuBERT, S2T) this, the task of extracting information from the former and to process the sound from different audio-visual units. A transforming it into the latter is becoming useful for different comparative analysis shows the strengths of different models purposes [1, 2]. One example is that text annotations enable and factors of different materials that determine the quality of better comprehension in cases of bad sound quality or even text generation for functional film annotation applications. allow the material to be understood in situations where sound consumption is impossible. Another one is a possible speed KEYWORDS up of the video that the annotations provide due to their ability Text generation, automated transcription, cinema, film, video to keep the content integral in a clear graphic form. The consumption process can be made more time efficient with 1 INTRODUCTION textual information compensating for the distortions of audio- Applications like automatic text captions for video materials visual quality that can be brought about with the have become more and popular and extensively used by users manipulations of playing options. Furthermore, in a general on different media, spanning from the computer, television, sense, combining audio-visual material with text can solve smartphones and other technologies that enable audio-visual many problems on different levels of film or video consumption. However, even though these applications have production. This can span from the preparing phases of pre- to an extent already become a staple in our everyday lives, production such as writing the script, to the post-production their performance often varies and still has not reached phases where one needs good orientation over a vast quantity optimal functionality. There are many challenges when we of material. Proper text generation can facilitate easier work with text generation out of audio-visual materials. orientation in such work and allows for more efficient These span from the structure and quality, the type or organization of the media materials. category of sound, the age of the recordings and the models In this paper, we will focus on those components that on which such translation is based on. The main goal of this contribute to the quality of proper automated text generation paper is to provide a practical demonstration of a few basic as a prerequisite of such developmental strategies. The main models for automatic annotation. The goal is to take into contributions of this paper are: (1) an analysis of the factors account the currently most common procedures of such an that influence automatic transcription of film or video endeavour and figure out how to minimize the loss function material (2) implementation and comparison of a few of the models to allow an optimal generation of text out of different models for sound annotation (3) reflection on how film or video more sufficiently. this process can be used for more complex tasks The rest of this paper is organized in the following way. Section 2 provides a description of the problem in the context 3 METHODOLOGY of contemporary consumption of audio-visual materials via The problem we are solving is to take a piece of audio-visual most popular information and communication technologies. material, convert it into a code that a model for automatic text Section 3 delineates the methodology used and describes the generation can take as input and then generate output of text approach used to tackle the problem in a concrete that matches the sound recording of the input in an optimal demonstration. Section 4 presents the models being used and way. An optimal result should provide a close describes our implementation of them, specifying the correspondence of the utterances in the film material and dynamics of the obtained results. A conclusion is reached in eventually identify different types and categories of sound section 5, where the paper offers a discussion on the outcome such as dialogue, noise, music etc. We will do an analysis of and possible directions for future work. the factors that influence the quality of automatically generated transcriptions in the following steps: 1) a comparison of different models for generating text from audio files, 2) an analysis of how the quality of transcriptions differs in relation to noise in the background (silence, music, 26 dialogues), 3) an evaluation of how the clarity of speech influences the quality of transcriptions, and 4) an assessment to what extent it is more difficult to generate quality transcriptions from older audio-inscriptions (films). Reflecting on the results of our procedure, we will think about how to improve the quality in cases when quality of transcriptions is bad. Aside from quality we will measure the time demands of models, that is how much time do the models need to generate transcriptions from the audio writing. The following model were used: 1) Wav2Vec2 [4] is a framework for self-supervised representation learning from raw audio that was made open- source by Facebook. It is the first Automatic Speech recognition model included in Transformers as one of the Figure 2. HuBERT predicts hidden clusters assignments using central parts of Natural Language Processing. Figure 1 shows masked frames (y2, y3, y4 in the figure) generated by one or more the model’s architecture. iterations of k-means clustering [7]. 3) S2T [5] (Speech2Text) is a transformer-based encoder-decoder (seq2seq) model that uses a convolutional downsampler to dramatically reduce the length of audio inputs over one half before they are fed into the encoder. It generates the transcripts autoregressively and is trained with standard autoregressive cross-entropy loss. 4 EXPERIMENT SETTING 4.1 Evaluation metric Figure 1. Wav2Vec2 learns speech units from multiple languages We have used WER (Word error rate) as the metric of the using cross-lingual training [4]. performance of the models which computes the error rate on the comparison of substitutions, deletions, insertions and The model starts by processing the raw waveform with a correct words. Original text was used for each of the model multilayer convolutional neural network. This yields latent and each film example, removing the punctuation. audio representations of 25ms that are fed into a quantizer and a transformer. From an inventory of learned units, the quantizer chooses appropriate ones, while half of the representations are masked before being used. The transformer then adds information from the whole of the audio sequence and with the output leads to solving the contrasting task with the model identifying the correct quantized speech units for the masked positions. 4.2 Data set 2) HuBERT [3] (Hidden-Unit BERT) is an approach for self- supervised speech representation that uses masking in a The dataset was formed with clips of different films. The similar way and in addition adds an offline clustering step that films used were classics of world cinema ( The Godfather, provides aligned target labels for a prediction loss. This 2001: A Space Odyssey, Star Wars, Frankenstein, Fight Club, prediction loss is applied over the masked regions, which Paris, Texas, Scent of A Woman, Tomorrow and Tomorrow leads the model to learn a combined language and acoustic and Tomorrow). 14 clips of sizes spanning from 5 to 30 model over the continuous inputs. By focusing on the seconds were used with the lengthier ones incorporating consistency of the unsupervised clustering step rather than the different sound contents (like speech, shouting, whispering intrinsic quality of the assigned cluster labels, HuBERT can etc.). The first step was to prepare the audio in such a format either match or improve the Wav2Vec2 model. Figure 2 that the models will be able to read it, so the clips were shows the model’s architecture. changed from mp4 to wav. An online converter, cloudconvert [https://cloudconvert.com], was used as the clips were fairly short and the results could be directly added to the Kaggle dataset from the browser itself. 27 N WITH GUNS WHO'S GON TO DO IT YOU YOU LIEUTENANT WINEBERG HuBERT: OMARTER TE CORET YOU DON'T HAVE TO ANSWER THAT Q UESTION I'LL ANSWER THE QUESTION YOU WANT ANSWERS I THINK I'M ENTITLED YOU WANT ANSWERRTHE TRUTH YO U CAN'T HANDLE THE TRUTH SON WE LIVE IN A WORLD TH AT HAS WALLS AND THOSE WALLS HAVE TO BE GUARDED B Figure 3: A superposition of waveform graphs of all the examples. Y MEN WITH GUNS WHO'S GOING TO DO IT YOU YOU LIEUT ENANT WINBURG 4.3 Implementation details Programming was done on Kaggle, where code was written S2T: in Python and after the experiments were set up, and the GPU DEAR LORD THE CORRET YOU DON'T HAVE THE ANSWER T was activated for faster computation. The general process HAT QUESTION I'LL ANSWER THE QUESTION YOU WANT AN using each of the models is the following. First, an encoder SWERS BUT THEY CAN'T ENTITLE YOU ONE AND THE TRUTH takes raw data and puts it in the model. In our demonstration, YOU CAN'T HANDLE THE TRUTH SOME WE LIVE IN A WORL D THAT HAS WALLS AND THOSE WALLS HAVE TO BE GUARD tokenizers were used at the start, but as S2T tokenizers was ED BY MEN WITH GUNS WHOSE TENANT DO IT YOU LIEUTE not equipped to get the audio, it had to be changed to a NANT WINEBURG THOSE HAVE TO BE GUARDED BY MEN WI processor. To retain consistency, the same step was applied TH GUNS WHOSE CANNON DO IT YOU YOU LIEUTENANT WI to the other two models as well. Once data gets in the model, NEBURG YOU LIEUTENANT WINEBURG the model predicts particular syllables for each sound with certain probabilities and then in an additional step selects those with the highest probability based in the context of the semantic whole of the sentence. In the final step, the decoder (again the tokenizers / the processors) takes the output of the model and transforms it into text. 5 EXPERIMENT RESULTS The ground rules for our project were that each model had a particular function that took sound as input and produced text Figure 4. A scene from A Few Good Men (1992), a still and as output with each audio having the text extracted separately. waveform graph from the used sequence. Subsequently different models were compared according to The lower the WER number, the better the results. The the accuracy of the results according to different criteria and models did not have a noticeable variation of speed, while a variety of scenarios (noise, music, number of characters, the quality of their performance varied due to different tempo of speech etc.). We will illustrate the obtained results factors. Hubert gave overall the best results from the point of via a concrete example. We will take a clip with relatively view of readability. According to the rate of correspondence clear sound from the film A Few Good Men (1992), a between input audio and output text, HuBERT comparably digitized version of a well preserved celluloid film. The sound gave the better rate of the transcription in case of videos with is clear and the dialogue takes places in a court practically in poor audio quality from Wav2Vec2, i.e. that from older or complete silence of the surroundings with the speech damaged films, while Wav2Vec2 gave better performance in changing from normal tone to screaming. The clip is 22 case of background music, but had the tendency of adding seconds long and its waveform is shown in Figure 4. The too much insertions. S2T had the tendency to produce original text is as following: mistakes, seen in peaking numbers over 1.0. The overall A: Did you order the Code Red?! results are given in Table 1. B: You don't have to answer that question! It is important to note that the average given does not reflect C: I’ll answer the question. You want answers? the better overall accuracy, but is the sum of different factors. A: I think I’m entitled! So the models can be good at transcribing particular words, C: You want answers!? but can add or drop extra words in the process and therefore A: I want the truth! make the overall text less comprehensible. An important C: You can’t handle the truth! Son, we live in a world that has factor is the way the original text that is used for comparison walls, and those walls have to be guarded by men with guns. is written – omitting punctuations and properly writing the Who's gonna do it? You? You, Lieutenant Weinberg? words even if they are mispronounced will improve the The produced transcriptions are as follows: results. Finally, it is crucial that all the texts are in caps lock, or the comparison won’t work and will produce misleading Wav2Vec2: results. YOU WAR THE CORA YOU DON'T HAVE TO ANSWER THE QU ESTION I'LL ANSWER THE QUESTION YOU WANT ANSWERS I As the used example shows, it is mostly clarity of speech that THINK I'M ENTITLE YOU WANT ANT A AT THE TRUE YOU CA will determine how the models perform. As the models were N'T HANDLE THE TRUTH SON WE LIVE IN A WORLD THAT H pre-trained and were not trained according to the specific data AS WALLS AND THOSE WALLS HAVE TO BE GUARDED BY ME used, they were in general surprisingly efficient. The 28 discrepancies in different treatments of the same audio are speaking, then person B, then person A has a long visible, but in general as long as the dialogue was clear, the monologue, person C answers” etc.). Another important task results were comparable. Music seemed to cause bigger would be identifying the sounds of different categories and problems for the model than background noise, while providing fitting audio-signs (sound of squeaking steps, additional speech in the background proved most playing of music etc.). From these steps one could eventually problematic. Emotional influences on speech did not prove at least to some extent automatically generate scripts for films that problematic and even affective utterances were or find ways to develop tools for easier text-based transcribed comparably with neutral speech if the sound data classification of audio-visual material. was of high quality. CONCLUSIONS Table 1. The WER scores for each model. The bold values represent the best performances on the given clip. The best In this paper we explored ways to generate text out of audio performing model is HuBERT. information presented in film and video material. We used three different models to evaluate various film units, Clip number Wav2Vec2 HuBERT S2T Wav2Vec2, HuBERT, and S2T. We found that the model 1 69% 53% 91% HuBERT achieved best results, while the remaining two 2 100% 0% 100% methods performed similarly. 3 100% 95% 95% 4 27% 30% 36% 5 17% 17% 17% ACKNOWLEDGMENTS 6 39% 18% 43% The research described in this paper was supported by 7 28% 28% 64% International Postgraduate School Inštitut Jožef Štefan, 8 70% 46% 55% Ljubljana, Slovenia in the class Textual/ Multimedia mining 9 50% 25% 100% and semantic technologies held by dr. Dunja Mladenič under 10 57% 37% 73% 11 62% 38% 51% the mentorship of Erik Novak. We also thank Besher Massri, 12 100% 95% 100% Aljoša Rakita and Martin Abram for additional feedback. 13 60% 33% 73% 14 9% 4% 9% REFERENCES Average 56% 37% 65% 1 A. Ramani, A. Rao, V. Vidya and V. B. Prasad, "Automatic Subtitle Generation for Videos," 2020 6th International The WER usually shows the results in a metric between 0 and Conference on Advanced Computing and Communication 1, however in case the annotation results were extremely Systems (ICACCS), 2020, pp. 132-135, doi: unsuccessful, the higher extreme may surpass the limit. In our 10.1109/ICACCS48705.2020.9074180. case, up to 1.6 was reached, however in the chart, it was 2 Rustam Shadiev, Yueh-Min Huang, Facilitating cross- limited down to 1.0 for purposes of clarity. cultural understanding with learning activities supported by speech-to-text recognition and computer-aided translation, 5 DISCUSSION AND FURTHER WORK Computers & Education, Volume 98, 2016, Pages 130-141 So as a general principle, when taking clips from films, the 3 Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, main factor that can potentially influence the quality of the Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman generated text in a negative way is the background noise. As Mohamed. HuBERT: Self-Supervised Speech one can expect, the model will work best when nothing is in Representation Learning by Masked Prediction of Hidden the background and worst when people are talking in the Units. [arXiv:2106.07447v1 [cs.CL], Submitted on 14 Jun background. Ideally, to improve the quality one would train 2021]. the models for the specific material, using a similar type of material and accordingly doing a pre-classification according 4 Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, to the main categories of sound analysis (ie. monologue, Michael Auli. wav2vec 2.0: A Framework for Self- dialogue, background noise, music, echo, normal speech, Supervised Learning of Speech Representations.. loud speech, shouting, whispering etc.) - especially when [arXiv:2006.11477v3, Submitted on 20 Jun 2020 (v1), last using older or less preserved material, which drastically revised 22 Oct 2020 (this version, v3)]. differs in sound data from newer or more preserved works. 5 Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Dmytro In our research we expanded on and adapted existing work on Okhonko, Juan Pino. fairseq S2T: Fast Speech-to-Text automated text generation models, providing an analysis of Modeling with fairseq. [arXiv:2010.05171v1, Submitted on the factors that determine the quality of such results from film 11 Oct 2020] . material. As an example, we applied our approach on 6 Wav2vec2.0: Learning the structure of speech from raw different film material, ranging in the quality and age of the audio. [https://ai.facebook.com/blog/wav2vec-20-learning- clips and the structure of the sound data. the-structure-of-speech-from-raw-audio Submitted on 24 Sep 2020, Access. 9.1.2022 A useful strategy for the future from the perspective of film practice would be to find ways to link transcriptions with a 7 Hsu, Wei-Ning, et al. "Hubert: Self-supervised speech script. A precondition of such an endeavour would be to representation learning by masked prediction of hidden implement an algorithm for recognizing the person speaking units." IEEE/ACM Transactions on Audio, Speech, and and identifying the source with descriptions (“person A is Language Processing 29 (2021): 3451-3460. 29 The Russian invasion of Ukraine through the lens of ex-Yugoslavian Twitter Bojan Evkoski Igor Mozetič bojan.evkoski@ijs.si igor.mozetic@ijs.si Jozef Stefan Institute, and Jozef Stefan Institute Jozef Stefan Postgraduate School Ljubljana, Slovenia Ljubljana, Slovenia Petra Kralj Novak Nikola Ljubešić petra.kralj.novak@ijs.si nikola.ljubesic@ijs.si Central European University Jozef Stefan Institute, and Vienna, Austria, and Faculty of Computer and Information Science, Jozef Stefan Institute University of Ljubljana Ljubljana, Slovenia Ljubljana, Slovenia Serbian Serbian left-wing opposition right-wing opposition Serbian Pro-Russia Tweetosphere Serbian right-wing opposition Croatian + Bosnian + Montenegrin tweetosphere Serbian populist coalition Serbian populist coalition Pro-Ukraine Croatian + Bosnian + Montenegrin Serbian tweetosphere left-wing opposition Serbian Tweetosphere Figure 1: Pre-invasion (left) and invasion (right) ex-Yugoslavian retweet networks. Node colors represent communities. Labeled arrows point to the main communities, with labels inferred from the community users. The in-network labels represent the names of the most retweeted accounts. ABSTRACT orientations. Some communities detected after the start of the The Russian invasion of Ukraine marks a dramatic change in Russian invasion also show clear pro-Ukrainian or pro-Russian international relations globally, as well as at specific, already stance. Such analyses of social media help in understanding the unstable, regions. The geographical area of interest in this paper role and effect of this conflict at the regional level. is a part of ex-Yugoslavia where the BCMS (Bosnian, Croatian, Montenegrin, Serbian) languages are spoken, official varieties of KEYWORDS a pluricentric Serbo-Croatian macro-language [4]. We analyze social network analysis, community detection, Twitter 12 weeks of Twitter activities in this region, six weeks before the invasion, and six weeks after the start of the invasion. We 1 INTRODUCTION form retweet networks and detect retweet communities which closely correspond to groups of like-minded Twitter users. The The Russian invasion of Ukraine brings about dramatic changes communities are distinctly divided across countries and political to the world. Analysing the structure and content of the commu- nication on social media, such as Twitter, can give more insight into the causes, developments and consequences of this conflict. Permission to make digital or hard copies of part or all of this work for personal The geographical area of interest in our research is a part of or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and ex-Yugoslavia where the BCMS (Bosnian, Croatian, Montenegrin, the full citation on the first page. Copyrights for third-party components of this Serbian) languages are spoken, official varieties of the pluricentric work must be honored. For all other uses, contact the owner/author(s). Serbo-Croatian macro-language. This area is strongly politically Information Society 2022, 10–14 October 2022, Ljubljana, Slovenia © 2022 Copyright held by the owner/author(s). divided by diverging influences of NATO (Croatia, Montenegro, North Macedonia, Bosniak and Croatian entity in Bosnia and 30 Information Society 2022, 10–14 October 2022, Ljubljana, Slovenia Evkoski et al. Pre-Invasion network Invasion network communities communities RS Tweetosphere 55% part 1 RS tweetosphere part 1 RS sports 43% +43% 15% ME tweetosphere 49% Pro-Ukraine BA + HR + ME 54% BA + HR + ME tweetosphere tweetosphere +53% MK tweetosphere 36% 11% RS left-wing opposition RS left-wing opposition 57% +33% RS tweetosphere International 11% part 2 tweetosphere 20% (South Serbia) +42% RS tweetosphere 54% part 2 Pro-Russia RS right-wing RS right-wing 51% opposition +55% opposition Bosnian Serbs +40% 11% Bosnian Serbs 43% RS populist coalition +19% RS 68% populist coalition Figure 2: A Sankey diagram showing the transitions of users from the pre-invasion network communities (left) to the invasion network communities (right). Rectangle height is proportional to the community sizes. Percentages near the pre-invasion communities show the portion of users found in the corresponding invasion communities. Percentages on the right-hand side of the invasion communities show the portion of users not previously present in the large communities of the pre-invasion network. Gray rectangles depict the communities tightly related to politics, with the yellow and red denoting the detected pro-Ukraine and pro-Russia leaning communities, respectively. Herzegovina) and Russia (Serbia, Serbian entity in Bosnia and 2 RESULTS Herzegovina). While Croatia is full EU member since 2013, Mon- The data analysed in this study were collected with the TweetCat tenegro, North Macedonia and Serbia are EU candidate members, tool [3], focused on harvesting tweets of less frequent languages. while Bosnia and Herzegovina is a potential candidate. Regarding TweetCat is continuously searching for new users tweeting in military alliances, NATO members are Croatia (since 2007), Mon- the language of interest by querying the Twitter Search API for tenegro (since 2017) and North Macedonia (since 2020), while the most frequent and unique words in that language. Every user Serbia does not aspire to join NATO, primarily due to a complex identified to tweet in the language of interest is continuously Serbia-NATO relationship caused by the NATO intervention in collected from that point onward. This data collection proce- Yugoslavia in 1999. dure is run for the BCMS set of languages since 2017. During To shed light on the impact of the Russian invasion on this the 12 weeks of our focus, we collected 1.2M tweets and 3.8M brittle and complex geographical and political area, we use social retweets from 45,336 users. A rough estimate of the per-country network analysis over available Twitter data, 6-weeks before and production of tweets via URL usage from country-specific top- 6-weeks during the invasion. We discover a complex landscape level domains (upper part of Table 1) shows for Twitter to be of ideology-specific and country-specific communities (see Fig-much more popular in Serbia and Montenegro than in Croatia or ure 1), and analyse the transition into evident pro-Ukraine and Bosnia and Herzegovina. This has to be taken into account while pro-Russia leanings. We also present a method to measure the analysing the communities of the underlying tweetosphere. similarity of the communities before and during the invasion by We created pre-invasion and invasion retweet networks analyzing URL and hashtag usage. As the communities show very (users as nodes, retweets as edges) from the collected data. We divergent properties, we echo concerns of the heavy polarization applied community detection (Ensemble Louvain [1]) on the two and possible destabilization of this area of the Balkans. 31 Russian invasion of Ukraine — ex-Yugoslavian Twitter Information Society 2022, 10–14 October 2022, Ljubljana, Slovenia Country Population URLs Serbia (RS) 7.2M (47.3%) 106K (44.2%) Croatia (HR) 3.9M (25.6%) 19.6K (8.1%) Bosnia and Herzegovina (BA) 3.5M (23.0%) 14.9K (6.2%) Montenegro (ME) 620K (4.1%) 24.7K (10.2%) Total 15.2M 242K Pre-invasion communities Users Tweets Retweets Intra-com. RTs RS tweetosphere part 1 13K (29.0%) 125K (24.9%) 300K (18.9%) 80.3% RS tweetosphere part 2 2.5K (5.6%) 35.8K (7.1%) 63.2K (4.0%) 62.3% RS sports 1.6K (3.6%) 12.6K (2.5%) 25.6K (1.6%) 53.8% ME tweetosphere 1.7K (3.8%) 22.7K (4.5%) 44.6K (2.8%) 74.5% BA + HR + ME tweetosphere 5.6K (12.4%) 37.8K (7.5%) 59K (3.7%) 75.3% Macedonian tweetosphere 200 (0.4%) 721 (0.1%) 771 (0.1%) 77.7% International tweetosphere 934 (2.0%) 8.5K (1.7%) 11.5K (0.7%) 62.3% RS populist coalition 2.0K (4.8%) 52.4K (10.4%) 396K (24.9%) 98.7% RS left-wing opposition 9.3K (20.6%) 105K (20.9%) 408K (25.5%) 80.5% RS right-wing opposition 7.6K(16.8%) 87.8K (17.4%) 247K (15.5%) 72.1% Bosnian Serbs 139 (0.3%) 2.2K (0.4%) 3.8K (0.2%) 83.1% Total 45.3K 502.9K 1590K Invasion communities Users Tweets Retweets Intra-com. RTs RS tweetosphere part 1 16.9K (29.5%) 160K (22.4%) 387K (16.8%) 71.1% RS tweetosphere part 2 4.5K (7.7%) 57.3K (8.1%) 118K (5.1%) 58.1% Pro-Ukraine BA + HR + ME tweetosphere 12.4K (21.7%) 76.1K (10.6%) 235K (10.2%) 64.7% Pro-Russia RS right-wing opposition 11.1K (19.4%) 129K (17.9%) 508K (22.1%) 65.1% RS populist coalition 1.8K (3.1%) 208K (29.1%) 450K (19.5%) 95.6% RS left-wing opposition 9.8K (17.2%) 191K (26.7%) 590K (25.6%) 72.6% Bosnian Serbs 356 (0.6%) 5.4K (0.7%) 7.1K (0.3%) 62.3% Total 57.4K (+26.7%) 717K (+42.8%) 2302K (44.8%) Table 1: The first part shows general population of each BCMS country and their respective tweet URL shares (.rs, .hr, .ba and .me). The second part shows the pre-invasion network communities with the number of users, tweets, retweets and intra-community retweets. The third part shows the same statistics for the invasion network communities. Grey rows depict political communities, while yellow and red show the pro-Ukraine and pro-Russia communities, respectively. networks and analysed the community properties and user tran- With this, we created a subset in which more than 99% of the sitions [2]. We identified and named the large communities (more URLs were news media, making it ideal for media polarization than 100 users) by a careful analysis of their most influential users analysis. Once we extracted the domain of the URLs, we then cre- and hashtag/URL usage. Figure 2 depicts the user transitions be-ated sorted lists of the top 50 URL domains and top 50 hashtags tween the two networks, while Table 1 shows general statistics for each community, sorted by the usage counts. Finally, in order of each community. We discovered the following peculiarities: to calculate the similarities between communities, we used the • The BCMS tweetosphere is dominated by Serbian (RS) Rank-biased overlap (RBO) measure for indefinite rankings [5]. users and content. We found out that the matchings between the pre-invasion • The political communities are more active compared to and invasion communities based on highest-user-overlap transi- the non-political ones. tions are also visible through the URL and hashtag similarities • RS populist coalition community (led by the Serbian presi- (see Figure 3). In fact, for each pre-invasion community, its redent Aleksandar Vučić) forms a very strong echo chamber, spective highest-user-overlap invasion community is also the with less than 2% of all users, yet more than 25% of tweets highest RBO pair for both URLs and hashtags. In other words, and retweets and more than 95% of intra-community retweets. there is a strong positive correlation between the user transition • RS populist coalition and left-wing opposition remain neu- percentages (Figure 2) and the RBO scores. E.g., 68% of the users tral on the invasion topic. from the pre-invasion "RS populist coalition" community tran- • RS right-wing opposition and the Bosnian Serbs show a sition in the "RS populist coalition" community in the invasion clear pro-Russia stance. network. Meanwhile, The URL RBO of this pair is 0.64, while • Croatian, Bosnian and Montenegrin communities show a the hashtag RBO is 0.43, both as the highest combination for the clear pro-Ukraine stance. pre-invasion "RS populist coalition" community, clearly match- In order to compare the pre-invasion and invasion commu- ing it with its invasion transition-based counterpart. This shows nities in terms of content and political leanings, our following that our simple similarity method based on URLs and hashtags goal was to compare the pool of hashtags used and URLs shared can even help in better matching communities in the task of by the community users. Therefore, we developed a simple com- community evolution [6]. munity similarity method. First, we preprocessed the URLs by manually filtering out the ones coming from social media sources like Twitter, Facebook, Youtube etc., as well as URL shorteners. 32 Information Society 2022, 10–14 October 2022, Ljubljana, Slovenia Evkoski et al. Pre-invasion communities URLS Hashtags 1.0 RS tweetospehere part 1 RS tweetosphere part 2 RS left-wing opposition 0.8 RS populist coalition ME tweetosphere 0.6 RS right-wing opposition International tweetosphere 0.4 BA + HR + ME tweetosphere MK tweetosphere 0.2 RS sports Bosnian Serbs 0.0 Pro-Russia Pro-Ukraine Pro-Russia Invasion communities Pro-Ukraine Bosnian Serbs Bosnian Serbs RS populist coalition RS populist coalition RS left-wing opposition RS tweetosphere part 1 RS tweetosphere part 2 RS left-wing opposition RS right-wing opposition RS tweetosphere part 1 RS tweetosphere part 2 RS right-wing opposition BA + HR + ME tweetosphere BA + HR + ME tweetosphere Figure 3: Domain and hashtag community similarities. A heatmap showing the similarities between the pre-invasion and invasion network communities based on the top 50 URLs (left) and hashtags (right). Similarities are calculated using the Rank-biased overlap (RBO) measure for indefinite rankings [5]. 3 CONCLUSION ACKNOWLEDGMENTS In this work, we investigated the Russian invasion of Ukraine The authors acknowledge financial support of the Slovenian through the lens of Twitter in the ex-Yugoslavian region where Research Agency (research core funding no. P2-103 and no. P6- Bosnian, Croatian, Montenegrin and Serbian are spoken. We an- 0411). alyzed 12 weeks of Twitter activities in this region, six weeks before the invasion, and six weeks after the start of the inva- REFERENCES sion. For each period, we created retweet networks and detected [1] B. Evkoski, I. Mozetič, P. Kralj Novak. Community evolution with Ensemble retweet communities. We followed the transition of users from Louvain. In 10th Intl. Conf. on Complex Networks and their Applications, Book of abstracts, pp. 58–60, Madrid, Spain, 2021. the pre-invasion to the invasion period and analyzed these groups [2] B. Evkoski, I. Mozetič, N. Ljubešić, and P. Kralj Novak. Community evolution of like-minded Twitter users, discovering that they are distinctly in retweet networks. PLoS ONE, 16(9):e0256175, 2021. Non-anonymized version divided across countries and political orientations. For the inva-available at https://arxiv.org/abs/2105.06214. [3] N. Ljubešić, D. Fišer, and T. Erjavec. TweetCaT: a tool for building Twitter sion network, we were also able to detect communities which corpora of smaller languages. In Proc. 9th Intl. Conf. on Language Resources and show clear pro-Ukrainian and pro-Russian stance. Evaluation, pp. 2279–2283, ELRA, Reykjavik, Iceland, 2014. Another contribution was a simple method for comparing [4] N. Ljubešić, M. Miličević Petrović, and T. Samardžić. Borders and boundaries in Bosnian, Croatian, Montenegrin and Serbian: Twitter data to the rescue. Journal retweet network communities based on the content of the tweets. of Linguistic Geography 6:2, DOI 10.1017/jlg.2018.9, pp 100-124, Cambridge The method showed a strong correlation with the most prominent University Press, 2018 [5] W. Webber, A. Moffat, Alistair, and J. Zobel. A similarity measure for indefinite user transitions we formerly discovered. rankings. ACM Trans. Information Systems 28(4):20, 2010. A continuation of this work is to expand it to a multidisci- [6] G. Rossetti and R. Caxabet. Community discovery in dynamic networks: a plinary research, with the aim to meticulously analyze the po-survey. ACM computing surveys (CSUR) 51.2 (2018): 1-37. larized content between the communities in collaboration with domain experts who are knowledgeable in ex-Yugoslavian poli- tics. Beyond obtaining interesting insights, we also aim to explore two frequent issues in using social media for societal analyses: (1) uptake bias of specific social networks across countries and communities, and (2) entanglement of the main event with other large-scale events. 33 Visualization of consensus mechanisms in PoS based blockchain protocols Daniil Baldouski Aleksandar Tošić University of Primorska University of Primorska Koper, Slovenia Koper, Slovenia d.baldovskiy@mail.ru Innorenew CoE Izola, Slovenia aleksandar.tosic@upr.si ABSTRACT provide spam resistance through the use of tokens representing value. The use of digital value within the protocol enables the In the past decade, decentralized systems have been increasingly protocol to enforce a level of security through economic incen- gaining more attention. Much of the attention arguably comes tives, and game theoretical aspects that make most attack vectors from both financial, and sociological acceptance, and adoption of economically infeasible or impractical for the attacker. A good blockchain technology. One of the frontiers has been the design example of this is the Proof of Stake (PoS) consensus mechanism, of new consensus protocols, topology optimisation in these peer where nodes in the decentralized protocol secure the consensus to peer(P2P) networks, and gossip protocol design. Analogue mechanism by requiring nodes to stake and lock up a consider- to agent based systems, transitioning from the design to imple- able amount of value, which can be deducted (usually refereed mentation is a difficult task. This is due to the inherent nature to as slashing) by the protocol in case the node misbehaves. The of such systems, where nodes or actors within the system only economic aspect of public blockchains poses a very high secu- have a local view of the system with very little guarantees on rity risk. With such strong economic incentives to identify and availability of data. Additionally, such systems often offer no exploit potential bugs, and system faults, it is of upmost impor- guarantees of a system wide time synchronisation. This research tance for the developers to thoroughly test and examine potential offers insight into the importance of visualisation techniques in problems. However, the aforementioned difficulties in debugging the implementation phase of vote based consensus algorithms, distributed and decentralized protocols require developers to be and P2P overlay network topology. We present our custom visual- equipped with tools that supports their efforts. isations, and note their usefulness in debugging, and identifying In this study, we review the state of the art approaches in potential issues in decentralized networks. Our use case is an testing and debugging voting based consensus mechanisms and implementation of a blockchain protocol. decentralized networks. We develop a visualisation specifically KEYWORDS designed for researchers and developers to test such networks and compare real-time observed data with the expected. We con- Grafana, visualisation, consensus mechanism, blockchain proto- clude that visualisation techniques can be complementary to cols, P2P, overlay network traditional log based debugging, and testing techniques. More- 1 INTRODUCTION over, we provide our tools as open source software as plugins for popular visualisation platform Grafana. Both tools make no Distributed systems are notoriously difficult to inspect and their assumptions on the data storage implementation. The plugins problems difficult to identify. The difficulty stems from the fact can be configured via Grafana plugin configuration interface to that predominant issues can be stochastic and difficult to repro- fit the specifics of the protocol implementation. We validate our duce, and from the inability to easily observe, compare, and test tools by applying them to a custom developed blockchain, and multiple programs running on separate machines at the same then explain how successful they turned out to be in identifying time. Another important aspect in distributed systems is that they anomalies and bugs in the protocols. inherently make heavy use of the network. The use of various net- work protocols imposes additional complexity, which increases 2 THE ROLE OF VISUALIZATIONS IN the search space in identifying bugs. In recent years, distributed DEBUGGING COMPLEX DISTRIBUTED systems have been gaining more attention both in academia and private sector. This increasing interest can be largely attributed SYSTEMS to the rapid development of distributed ledger technology, and Distributed and decentralized systems are difficult to debug as blockchain. In recent years, many new consensus mechanisms, developers are working on the third layer. Which includes L1 blockchain protocols, network protocols, improvement in gossip (code level bugs), issues with concurrency on L2 (individual run- protocols have been proposed. Many of them are transitioning time), and finally the third dimension for potential bugs arising from a theoretical framework to a practical implementation. How- from the message exchange between nodes. In general, it is often ever, public distributed ledgers (or distributed ledger technology hard to capture the state in a distributed system as debuggers or DLT) and blockchains secure their consensus mechanisms and cannot be attached to all nodes’ run-times. Additionally, it is often difficult to reproduce errors when they are inherently stochastic. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or We consider several methods, such as Logging, Remote debugging, distributed for profit or commercial advantage and that copies bear this notice and Simulations and Visualisations. the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner /author(s). • Logging is the most common debugging method for all Information Society 2022, 10–14 October 2022, Ljubljana, Slovenia three layers. However, in distributed systems it is impor- © 2022 Copyright held by the owner/author(s). tant to aggregate logs, and analyze them as a time series. 34 Information Society 2022, 10–14 October 2022, Ljubljana, Slovenia Daniil Baldouski and Aleksandar Tošić Additionally, aggregating distributed logs assumes the sys- distributed systems, while our tools are created specifically tem has some method of clock synchronization protocol. for monitoring PoS voting based consensus mechanisms Log collection has been proven to be effective in detecting and underlying network topology of the distributed sys- performance issues for systems such as Hadoop [12] and tem. Darkstar [13]. The aggregation can be done with specific tools for log collection such as InfluxDB [8], Logstash [10], 3 RESEARCH OBJECTIVES etc. Aggregated logs then can be viewed in a form of a The main goal of this research is to build visualisation tools that dashboard using tools like Grafana (see Figure 1). offer more insight into a running distributed system using the time series log collection data. The targeted system is a custom proof of stake based blockchain. Such tools should visualize if nodes contributing to the consensus learned about their correct roles, and if they perform their roles accordingly. In the consensus algorithms this is done by sending messages, so the tools should visualise messages exchanged between nodes. In the structured P2P networks information spreads using gossip protocols and network topology changes every time slot. Our tools should visualize such changes in the network topology by drawing nodes and their cluster representatives, while at the same time indicating the consensus roles for each node. Figure 1: Part of the Grafana dashboard used by developers In our implementation time series data comes from InfluxDB, to gain insight into a running PoS based blockchain net- but we want our tools to have no assumption on the data storage work. implementation and there are other popular databases, such as kdb+ and Prometheus, that work well with time series data. Be- • Remote debugging is a technique where a locally running cause of that we choose Grafana as a platform for visualizations, debugger is connected to a remote node in the distributed which supports all of the aforementioned databases and many system. This allows developers to use the same features more at the time of writing. as if they were debugging locally. However, it is difficult In this work we implement two Grafana plugins built to vi- to determine which remote node should be debugged. Ad- sualize PoS based blockchains, and decentralized network topol- ditionally, in case of Byzantine behaviour due to network ogy. Our tools are designed with generality in mind, and are faults connecting the debugger could fail. hence applicable to other PoS voting based blockchains and other • Distributed deterministic simulation and replay is a tech- distributed ledger implementations. We evaluate our tools by nique that attempts to address the issues of reproducibility applying it to the custom developed blockchain and note their in distributed systems. Tools like Friday [5] and liblog [6] usefulness in debugging and identifying potential issues in de- can be used to record the specific state of the network centralized networks. to use and analyze it later. The technique suggests imple- menting an additional layer that abstracts the underlying 4 GRAFANA PLUGINS FOR VISUALISING hardware and the network interfaces to allow for an exact VOTE BASED CONSENSUS MECHANISMS replay of all the state changes and messages exchanged be- AND P2P OVERLAY NETWORKS tween nodes. Tools such as FoundationDB or even custom systems are built on containerisation software. We have developed two plugins that extend the functionality of • Visualisation and time series analysis attempts at captur- Grafana. Figure 2 outlines the architecture used in production. ing the state of the system, and all the nodes by visual- A server running a database instance (preferably time series i.e. ising the collected logs. Tools like Prometheus [11] and InfluxDB), and the Grafana platform. Depending on the underly-Grafana [2] are used extensively. Tools like Theia [4] and ing blockchain implementation, nodes can insert their telemetry Artemis [3] are designed for monitoring and analyzing directly to the database, or if possible have an archive node gather performance problems in distributed systems and support telemetry from nodes, and report them. In this example, a cluster built-in visualization tools for data exploration. However, was used to run multiple nodes. A coordinating node is responsi- such tools provide logs aggregated based summaries of ble for maintaining an overlay network and serving the nodes the distributed systems and are not capable of observing within the overlay with a DHCP, DNS, and routing. Nodes are underlying low-level network properties, e.g. monitoring packed within docker containers and submitted to the coordina- network communication, especially in real-time while the tor, which uses built in load balancing and distributes them to system is running. ShiViz [1] on the other hand displays other cluster nodes. distributed system executions as an interactive timespace The telemetry inserted is timestamped to create a time series diagram. With this tool all the necessary events and inter- stream of data that is consumed by Grafana. Figure 1 shows a actions can be viewed in an orderly manner and inspected small part of the dashboard created within Grafana using the built- in detail. ShiViz visualization is based on logical order- in plugins for typical visualisations. These visualisations are time ing, meaning that unlike our tools, it is not capable of series data of a running blockchain showing telemetry reported running in real-time, together with the considered dis- by the nodes. However, rendering telemetry from hundreds of tributed network. ShiViz also works with aggregated logs nodes as factors is hardly informative. about various types of events of the distributed system Both plugins were developed as React components, using a and unlike our tools does not support direct database con- well-known D3.js JavaScript library for animations and life-cycle nections. ShiViz is generalized and works with all kind of of the plugins is managed by Grafana 35 Visualization of consensus mechanisms in PoS based blockchain protocols Information Society 2022, 10–14 October 2022, Ljubljana, Slovenia Docker Swarm their correct roles, and if they perform their roles accordingly. In Master Node order to have a scalable visualisation, nodes are placed around a Grafana circle, and scaled according to the size of the network. Roles are Cluster Cluster Cluster Cluster Node 1 Node 2 Node 3 Node n visualized with a color map. Each slot, nodes change their roles, and execute the protocol accordingly. To visualise the execution, InfluxDB the plugin visualises messages exchanged between nodes in a Web Server form of animated lines flying from an origin node to the desti- T nation node. The animations are time synchronous, and transfer P2P Overlay Network Telemetry times, and latencies are taken into account. Additionally, every message is logged with a type, indicating the sub protocol within Figure 2: System architecture. which it was created. As an example, messages being sent from committee members to the block producer are attestations for 4.1 Network Plugin the current block. The animated lines are coloured indicating the message type. P2P networks propagate information using gossip protocols. The thickness of the animated lines indicates the size of the There are many variations of the general and implementation payload transferred between nodes. Figure 4 shows the consensus specifics but in general the family of protocols aims at gossiping plugin running live visualising a test network of 30 nodes. The the fact that new information is available in the network. Should green coloured node indicates the block producer role for the a node hear about the gossip, and require the information it will current slot, nodes coloured violet are part of the committee, and contact neighbouring nodes asking for the data. In general, gos- blue nodes are validators. sip protocols make no assumptions about the topology of the overlay network. However, with structured networks, the infor- mation exchange can be made much more efficient. The observed blockchain implementation utilized a semi structured network topology for propagating consensus based information. This is made possible by using a seed string shared between nodes that is used for pseudo-random role election every block. Using the seed, nodes self-elect into roles without the need to communicate. However, when performing roles, committee members must at- test to the candidate block produced by the block producer. The seeded random is therefore also used to cluster the network using Figure 4: Consensus plugin (with legend) visualising a test a k-means algorithm. The clustering is again performed by each network of 30 nodes in real time. node locally. The shared seed guarantees that nodes will produce the same topology, which is then used to efficiently propagate attestations to the block producer. The network topology hence changes every slot. The plugin 4.3 Generality aims to visualize the changes in the network topology by draw- In order to use the above plugins, users have to provide certain ing nodes, and their cluster representatives. Additionally, the data to the Grafana dashboard and this can be done through consensus roles for each node are indicated with the vertex color. Grafana GUI. For the plugins to work all of the data should follow Figure 3 shows the network plugin rendering a test network of a specific naming policy. For example, for the Consensus plugin 30 nodes in real-time. The node in the center coloured green is there is one necessary query to visualize data about the nodes of the elected block producer for the current slot, nodes surrounded the network. It can be provided using SQL or Grafana GUI: by the red stroke are cluster representatives, the rest of the nodes SELECT "slot", "node", "duty" FROM "" are coloured based on their role in the current slot. WHERE $timeFilter Both plugins can be customized from the Grafana options menu. For example, users can add new roles, name and color them. Figure 5 shows the consensus plugin options menu, where users can additionally turn on or off display of messages, nodes or containers labels and so on. For both plugins, users have to manually provide the slot time of the network in the plugins options menus. Figure 3: Network topology plugin visualising a test net- work of 30 nodes in real time. 4.2 Consensus Plugin The aim of visualising the consensus mechanism is to quickly Figure 5: Consensus plugin options menu. evaluate if nodes contributing to the consensus learned about 36 Information Society 2022, 10–14 October 2022, Ljubljana, Slovenia Daniil Baldouski and Aleksandar Tošić By using our tools we can visualize other protocols. For exam- We conclude that visualisation is an important tool in design ple with the consensus plugin we can visualize the famous Paxos and implementation of decentralized, and distributed systems. algorithm, first introduced in [7] by Leslie Lamport. For that, we The methods serve a complementary role to existing debugging should provide the plugin with the Nodes and Messages queries. methods, and are very powerful at observing unexpected be- For the Nodes query, parameters slot, node and duty should be haviour of the system as a whole. Visualisation techniques are provided, which represent the slot number, node id and the role specifically important in detecting stochastic faults that are non- of the node respectively. From the point of nodes and slots, for trivial to reproduce. Our tools are open-source and available for this visualization Paxos works in the same way as the example researchers and engineers to use. They are suitable for testing of the PoS based consensus we mentioned before. For the duty any kind of voting-based consensus protocol with little effort. parameter, nodes can have one of the three roles: proposer, ac- For future work we would like to further develop our tools ceptor or learner. That is why in the options menu of the plugin to accommodate other consensus protocols and help developers we should create 3 roles and name them according to the names visualize and debug other types of issues related to distributed from the data table. systems. Also, we would like to explore other types of visualiza- We should specify slot time (in seconds) in the plugin options tions and other existing tools that can help developers as well. menu and at this point we can set the Grafana dashboard refresh Since Grafana is rapidly evolving, our developed plugins can be time and see the results, since all the necessary conditions are updated and new technologies can be integrated with our tools fulfilled. But in order to gain more information from the plugin, to improve their performance. we should add the Messages query. For the data we should have the following parameters: id, source, target and endpoint, which 6 ACKNOWLEDGMENTS represent the message id, node id that sends the messages, node The authors gratefully acknowledge the European Commission id that receives the message and the type of the message. For for funding the InnoRenew CoE project (H2020 Grant Agreement the additional information we can specify parameters delay (in #739574) and the Republic of Slovenia (Investment funding of the seconds) and size of the message. Republic of Slovenia and the European Union of the European If we know the expected amount of nodes for some role, we Regional Development Fund) as well as the Slovenian Research can put it in the in plugin options menu to see this information in Agency (ARRS) for supporting the project number J2-2504 (C). the plugin legend. In a similar way we should be able to visualize other consensus protocols, for example 2PC or Raft [9]. REFERENCES Source code for both plugins is open source, licensed under [1] Beschastnikh, I., Wang, P., Brun, Y., and Ernst, M. D. Debugging distributed the MIT license and available on GitLab, where users can find systems. Commun. ACM 59, 8 ( jul 2016), 32–37. [2] Chakraborty, M., and Kundan, A. P. Grafana. In Monitoring Cloud-Native the installation procedure of the plugins: Applications. Springer, 2021, pp. 187–240. • [3] Creţu-Ciocârlie, G. F., Budiu, M., and Goldszmidt, M. Hunting for prob- Network plugin - https://gitlab.com/rentalker/topology- lems with artemis. In Proceedings of the First USENIX Conference on Analysis visualization-plugin, of System Logs (USA, 2008), WASL’08, USENIX Association, p. 2. • Consensus plugin - https://gitlab.com/rentalker/consensus- [4] Garduno, E., Kavulya, S. P., Tan, J., Gandhi, R., and Narasimhan, P. Theia: Visual signatures for problem diagnosis in large hadoop clusters. In Proceedings visualization-plugin. of the 26th International Conference on Large Installation System Administration: Strategies, Tools, and Techniques (USA, 2012), lisa’12, USENIX Association, 5 CONCLUSION p. 33–42. [5] Geels, D., Altekar, G., Maniatis, P., Roscoe, T., and Stoica, I. Friday: Global We developed two Grafana plugins for visualising PoS based comprehension for distributed replay. vol. 7. [6] Geels, D., Altekar, G., Shenker, S., and Stoica, I. Replay debugging for blockchains, and the underlying overlay network topology. The distributed applications. In 2006 USENIX Annual Technical Conference (USENIX plugins were used to identify critical bugs, and faults in the ATC 06) (Boston, MA, May 2006), USENIX Association. protocol. With the help of visualisations, we were able to detect [7] Lamport, L. The part-time parliament. ACM Transactions on Computer Systems 16, 2 (May 1998), 133-169. Also appeared as SRC Research Report 49. This paper two problems when running test-nets. was first submitted in 1990, setting a personal record for publication delay that • Network congestion: has since been broken by [60]. (May 1998). ACM SIGOPS Hall of Fame Award for every slot, validators must re- in 2012. port their statistics to the block producer. Prompt delivery [8] Naqvi, S. N. Z., Yfantidou, S., and Zimányi, E. Time series databases and is desired but not critical. However, as the network grew in influxdb. Studienarbeit, Université Libre de Bruxelles 12 (2017). [9] Ongaro, D., and Ousterhout, J. In search of an understandable consen- size, reporting statistics to a single node (block producer) sus algorithm. In Proceedings of the 2014 USENIX Conference on USENIX An- became increasingly latent as all nodes attempted to prop- nual Technical Conference (USA, 2014), USENIX ATC’14, USENIX Association, p. 305–320. agate messages in tandem, and even more importantly, [10] Sanjappa, S., and Ahmed, M. Analysis of logs by using logstash. In Proceedings the network topology required a lot of routing for mes-of the 5th International Conference on Frontiers in Intelligent Computing: Theory sages to arrive to the block producer. The network plugin and Applications (Singapore, 2017), S. C. Satapathy, V. Bhateja, S. K. Udgata, and P. K. Pattnaik, Eds., Springer Singapore, pp. 579–585. helped us identify what the problem was by looking at the [11] Turnbull, J. Monitoring with Prometheus. Turnbull Press, 2018. topology. [12] Xu, W., Huang, L., Fox, A., Patterson, D., and Jordan, M. Online system • State synchronisation: at random, nodes failed to per- problem detection by mining patterns of console logs. In 2009 Ninth IEEE International Conference on Data Mining (2009), pp. 588–597. form their roles. This resulted in missing votes even on [13] Xu, W., Huang, L., Fox, A., Patterson, D., and Jordan, M. I. Detecting large-small test-nets, and sometimes a chain halt where no scale system problems by mining console logs. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles (New York, NY, USA, blocks were produced for the slot. We observed the likeli- 2009), SOSP ’09, Association for Computing Machinery, p. 117–132. hood of this happening grows in correlation with network size. However, it was infeasible to debug the state of all nodes in a large network. Visualising the state of nodes at a given slot we observed that states were not always synchronized and hence, some nodes did not learn about their consensus role. 37 Using Machine Learning for Anti Money Laundering Gregor Kržmanc Filip Koprivec Maja Škrjanc gregor.krzmanc@ijs.si filip.koprivec@ijs.si maja.skrjanc@ijs.si Jožef Stefan Institute Jožef Stefan Institute Jožef Stefan Institute Ljubljana, Slovenia IMFM Ljubljana, Slovenia Ljubljana, Slovenia Figure 1: Example transaction network visualization ABSTRACT to strengthen its anti money laundering and terrorist financ- ing regulatory framework and expects the same from financial Here we present early results of a network component for anom- institutions and supervisory authorities. aly detection in an attributed heterogeneous financial network. Given a pseudonymized dataset of financial transactions, can Utilizing both externally provided features and generated topo- we use machine learning to detect interesting, perhaps novel, logical features, we train different models for a simple link pre- patterns that should be inspected manually? In this paper, we try diction task. We then evaluate the models using initial dataset to answer this question. corruption. We show that gradient boosting and multi-layer per- ceptron generally have the best anomaly detection performance, despite graph neural network models initially showing better 2 RELATED WORK results in the link prediction task. Both supervised [7, 6, 12] and unsupervised or self-supervised [2, KEYWORDS 14] learning approaches have been proposed to deal with the task of detecting money laundering. Due to the lack of labelled data Anti Money Laundering (AML), machine learning, networks, link and the closed nature of financial data and, therefore, the lack prediction of standardised datasets, approach evaluation can be difficult. 1 INTRODUCTION Despite that, cryptocurrency datasets such as [13] have been published, explored, and labelled to some extent. Observing complex real-world graphs, be it a social, financial, Usually, synthetic oversampling or other strategies of sampling biochemical, or physics-related network, is an interesting task. need to be employed in cases where labelled entities are used for Given a time-evolving network and rich information about the evaluation [12, 13]. nodes and edges, can we assume that there are some regular dynamics in the network? 3 DATA Fraud and financial crime are important issues of our time. According to the United Nations Office on Drugs and Crime, an In this study, we use a snapshot of the transaction data processed estimated 2-5 % of the world GDP is laundered each year. To through the international payment system Target2-Slovenija [11]. keep pace with evolving trends, the European Union has decided The dataset spans from November 2007 to December 2017, con- taining around 8 million financial transactions. No live data was Permission to make digital or hard copies of part or all of this work for personal used when performing this research - only archived datasets or classroom use is granted without fee provided that copies are not made or were used. distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this For some nodes, the data about the sending or receiving party work must be honored. For all other uses, contact the owner /author(s). is additionally linked to data from the Slovenian Business Register Information Society 2022, 10–14 October 2022, Ljubljana, Slovenia (ePRS) [1] and the Slovenian Transaction Account Registry (eRTR) © 2022 Copyright held by the owner/author(s). [3] in order to provide additional context about each transaction. 38 Information Society 2022, 10–14 October 2022, Ljubljana, Slovenia G. Kržmanc et al. feature level degree node-level deg (𝐴) = |𝑁 (𝐴) | PageRank [9] node-level Í 𝑃 𝑅 ( 𝐽 ) 𝑃 𝑅 (𝐴) = 1−𝑑 + 𝑑 ( ; 𝑑 = 0.85 𝑁 𝐽 ∈𝑁 𝐴) 𝑖𝑛 |𝑁 ( 𝐽 ) | 𝑜𝑢𝑡 Jaccard coefficient |𝑁 (𝐴)∩𝑁 (𝐵) | edge-level 𝐽 (𝐴, 𝐵) = |𝑁 (𝐴)∪𝑁 (𝐵) | Adamic-Adar Index 1 edge-level 𝐴 (𝑥 , 𝑦) = Í𝑢 ∈𝑁 (𝑥)∩𝑁 (𝑦) log |𝑁 (𝑢) | Table 1: The structural features used for the link prediction task. 𝑁 (·) represents the set of neighbours of the given node. 𝑁 and 𝑁 represent the sets of the nodes from 𝑖𝑛 𝑜𝑢𝑡 which there is an edge to the given node (𝑖𝑛), or to which there is an edge from the given node (𝑜𝑢𝑡). | · | represents cardinality of the given set. 5 ANOMALY DETECTION PROBLEM Figure 2: Degree distribution by node type. DEFINITION We corrupt the original graph by rewiring the total of 𝑝 = 1% randomly picked edges of each edge type. Due to the sensitive nature of the data, all personal and confi- Let 𝑓 : 𝑉 ×𝑉 → [0, 1] be a binary link prediction classifier that dential data about individuals and legal entities provided to JSI is trained to predict the probability that a directed edge between is pseudonymized. the two given nodes exists. We define the anomaly score of edge (𝑖, 𝑗 ) ∈ 𝐸 as 4 DATA REPRESENTATION AS A HETEROGENEOUS GRAPH 𝜙 (𝑖, 𝑗 ) = 1 − 𝑓 (𝑖, 𝑗 ) (1) The intuition behind equation 1 is that links that are typical to There are large differences in the availability of data across differ-the model would have a smaller anomaly score than links for ent entities performing the transactions. In order to fully utilize which the model predicts they would not exist (and are, thus, all available features, we model the network as a heterogeneous anomalous). temporal graph. Here, we treat the snapshot of the transaction graph from 𝑡 to ) as a heterogeneous graph con- 0 𝑡1 𝐺 = 𝐺 (𝑡0, 𝑡1 6 RESULTS sisting of 3 discrete node types representing each entity’s legal We train several models for the downstream task of link predic- status. The types of accounts are those belonging to companies tion and then use the predictions for anomaly detection. (node type s), natural persons (node type p), and all other accounts (node type o). Each transaction is represented as a directed edge 6.1 Experiment details from its source account to its destination account. The traditional (non-GNN) machine learning approaches are 4.1 Network statistics trained to predict whether the given edge exists or not. For each edge, the feature vector fed into the model is constructed by Due to different legislative bases for different types of entities, concatenating source node features, destination node features, inherent differences regarding data availability are expected. Nat- and edge features. For traditional models, a model for each edge urally, it is also expected that different categories usually act type is constructed separately, while the graph neural network- differently in a network - for example, companies usually trans- based models are the same across all edge types. act more than individuals. While the degree distribution (Figure 2) The GNN (graph neural network) models are constructed of 2 closely resembles the power law, significant differences in dis- layers of GraphSAGE aggregations [8, 5] using parametric ReLU tributions between different node types can be observed, which activations and embedding dimensions of 128 for the first and can be attributed to varying amounts of data available for our 64 for the second layer. As messages are passed in the direction specific data source across account profiles. of edges, we construct another model to facilitate information It can be seen from Figure 2 that companies (node type s) diffusion both ways. We do this by adding edges of opposite perform most of the transactions. directionality than existing edges and marking them as a separate 4.2 Feature generation edge types. We still, however, only train for the downstream link prediction objective only on the existing (non-transposed) edges. + Categorical features are one-hot encoded. Rare categories with We mark this approach as GNN . < 2% incidence are marked as other. Additionally, node features The traditional ML models used are gradient boosting (Grad- encoding the role of a node in the network (Table 1) are generated. Boost), decision tree (DecTree), multi-layer perceptron (MLP) and The node-level features for each node are computed on the whole logistic regression (LogReg). The hidden layer sizes of the MLP network as well as for the subgraph induced by the node’s own are 20 and 10, using ReLU activation in all layers except the last type. one, where softmax activation is used. Different combinations of 39 Using Machine Learning for Anti Money Laundering Information Society 2022, 10–14 October 2022, Ljubljana, Slovenia + reasonable hidden layer sizes were tested (32+16, 64+32, 256+128, edge non-GNN no str. f. GNN GNN 128+128, 20+10) and the best one was selected. The training of ss 0.19 ± 0.02 0.16 ± 0.02 0.01 ± 0.00 0.01 ± 0.00 MLP models was performed with a batch size of 200. oo 0.11 ± 0.02 0.02 ± 0.01 0.05 ± 0.02 0.03 ± 0.02 so 0.11 ± 0.02 0.06 ± 0.01 0.01 ± 0.01 0.01 ± 0.01 6.2 Link prediction os 0.14 ± 0.02 0.06 ± 0.01 0.01 ± 0.00 0.01 ± 0.01 sp 0.08 ± 0.04 0.02 ± 0.02 0.02 ± 0.01 0.02 ± 0.02 Traditional ML models for link prediction map concatenated ps 0.05 ± 0.02 0.05 ± 0.02 0.01 ± 0.01 0.01 ± 0.01 source and destination node features and edge features to the po 0.07 ± 0.04 0.07 ± 0.05 0.02 ± 0.02 0.01 ± 0.02 probability that a link between such nodes exists. The models are op 0.18 ± 0.04 0.02 ± 0.01 0.02 ± 0.01 0.03 ± 0.02 implemented using scikit-learn [10] and are trained and evaluated Table 3: Anomaly detection performance comparison in 𝐹 using 5-fold cross-validation. 1 score (mean ± standard deviation). Best non-GNN score, as As a preprocessing step, each feature is scaled individually well as best non-GNN score without using any structural using a standard scaler such that it has a mean of 0 and a standard features, are reported next to the GNN results. Bold results deviation of 1 across the training set. highlight the best performance across observed methods. When training and evaluating each model, an approximately equal number of positive and negative links is given to the classi- fier. The provided edge features such as transaction amount are sampled randomly for negative edges. Additionally, we train a 2-layer graph neural network (GNN) −1 −1 − precision + recall 1 for link prediction. The GNN model is trained jointly for all edge 𝐹 = (2) 1 2 types using weighted binary cross-entropy loss. The model has A naive classifier that assigns the same positive score (recall ReLU activations in all layers except the last one, where it has 1) to each edge has 𝐹 score of ≈ 0 1 .02. However, the underrepre- softmax activation. The hidden layer sizes are 64 and 32. The sented edge types typically have higher variance in 𝐹 score and 1 graph neural network is implemented using PyTorch Geometric performance insignificantly different from the naive baseline, as [4]. seen from Table 3. The same goes for the GNN-based models. See We use a random link split for link prediction and not a tempo-Appendix A for more detailed non-GNN model results. ral one, as our end goal is not to predict future links, but rather to learn what kinds of transactions are typical in the given network. 7 DISCUSSION AND FUTURE WORK Table 2 shows the aggregated link prediction results. Bold We have constructed and evaluated a self-supervised approach results highlight the best performance across observed methods. to anomaly detection in financial networks. Due to the lack of The GNN does slightly improve link prediction performance in labelled data, this is in most cases the most straightforward ap- some cases. See Appendix A for more detailed non-GNN method proach to tackle the problem with machine learning. There are results. The data here is computed across multiple year-long time significant differences in performance across different edge types. windows. Using this approach yields almost comparable results with both + raw features and structural features when evaluated on company- edge non-GNN no str. f. GNN GNN to-company transactions only. This may be explained by compa- ss 0.92 ± 0.01 0.89 ± 0.01 0.92 ± 0.02 0.94 ± 0.01 nies in our dataset having the most insightful features of all node oo 0.80 ± 0.02 0.57 ± 0.01 0.79 ± 0.02 0.53 ± 0.04 types such as the broader sector and also more precise company so 0.83 ± 0.01 0.75 ± 0.01 0.88 ± 0.02 0.74 ± 0.04 industry type classification. os 0.76 ± 0.01 0.64 ± 0.01 0.81 ± 0.01 0.83 ± 0.02 This paper has mainly focused on the use of unsupervised sp 0.85 ± 0.02 0.69 ± 0.03 0.78 ± 0.05 0.73 ± 0.02 learning for anomaly detection. In the future, we plan to extend ps 0.74 ± 0.02 0.67 ± 0.01 0.87 ± 0.02 0.75 ± 0.04 our work to supervised and semi-supervised learning approaches po 0.78 ± 0.02 0.66 ± 0.01 0.84 ± 0.04 0.54 ± 0.08 to try to utilize the few labelled data points. The following ma- op 0.89 ± 0.01 0.53 ± 0.01 0.78 ± 0.05 0.50 ± 0.05 chine learning strategies (or a combination of them) could be all 0.84 ± 0.01 0.72 ± 0.01 0.86 ± 0.02 0.89 ± 0.01 tested: Table 2: Link prediction performance comparison mea- sured in area under the receiver operating characteristic • Active learning. Human-assisted active learning approach curve (AUC) (mean is a natural way to incorporate domain knowledge into ± standard deviation). Edge types are marked with two letters, representing the source and des- the decision-making process. tination node type in this order. Best non-GNN score, as • Synthetic oversampling. Due to a small number of the well as best non-GNN score without using any structural positive examples, we could sample new examples that features, are reported next to the GNN results. are similar to them and assign them positive labels. • Model pretraining and few-shot learning. Update model parameters with a self-supervised pretraining strategy first, and then optimize it further on the few labeled data points. 6.3 Anomaly detection ACKNOWLEDGMENTS For comparison between different methods, the 2% of edges with the highest anomaly scores are flagged as positive. Precision and The research leading to the results presented in this paper has recall are calculated by using the corrupted 1% of edges as true received funding from the European Union’s funded Project IN- positives. FINITECH under grant agreement no. 856632. To summarize precision and recall in a single metric, 𝐹 score (2) The financial transaction data used in the presented research 1 is calculated and reported. was collected and pseudonymized by the Bank of Slovenia. 40 Information Society 2022, 10–14 October 2022, Ljubljana, Slovenia G. Kržmanc et al. The Bank of Slovenia collaborates with JSI and the Infinitech [14] Jiaxuan You, Tianyu Du, Fan-yun Sun, and Jure Leskovec. project in order to research possible efficient and compliant bank- 2021. Graph Learning in Financial Networks. (September ing system supervision techniques. 2021). https://snap.stanford.edu/graphlearning- workshop/ We thank Klaudija Jurkošek Seitl for her input on the style of slides/stanford_graph_learning_Finance.pdf . this paper. A DETAILED RESULTS REFERENCES A.1 Link prediction (AUC) [1] 2022. AJPES - ePRS. (September 2022). https://www.ajpes. si/prs/. edge DecTree GradBoost LogReg MLP [2] Claudio Alexandre and João Balsa. 2016. Client Profiling ss 0.87 ± 0.01 0.90 ± 0.01 0.79 ± 0.01 0.92 ± 0.01 for an Anti-Money Laundering System. https://arxiv.org/ oo 0.80 ± 0.01 0.80 ± 0.02 0.51 ± 0.01 0.74 ± 0.01 abs/1510.00878. so 0.82 ± 0.01 0.83 ± 0.01 0.65 ± 0.01 0.82 ± 0.01 [3] 2022. eRTR. (September 2022). https://www.ajpes.si/eRTR/ os 0.75 ± 0.01 0.76 ± 0.01 0.58 ± 0.02 0.73 ± 0.01 JavniDel/Iskanje.aspx. sp 0.81 ± 0.02 0.85 ± 0.02 0.55 ± 0.02 0.83 ± 0.02 [4] Matthias Fey and Jan E. Lenssen. 2019. Fast Graph Rep- ps 0.70 ± 0.02 0.74 ± 0.02 0.54 ± 0.02 0.69 ± 0.01 resentation Learning with PyTorch Geometric. In ICLR po 0.72 ± 0.02 0.78 ± 0.02 0.54 ± 0.02 0.67 ± 0.01 Workshop on Representation Learning on Graphs and Man- op 0.85 ± 0.01 0.89 ± 0.01 0.51 ± 0.03 0.87 ± 0.01 ifolds. all 0.81 ± 0.01 0.84 ± 0.01 0.66 ± 0.02 0.82 ± 0.01 [5] William L. Hamilton, Rex Ying, and Jure Leskovec. 2017. In- ductive Representation Learning on Large Graphs. arXiv:1706.02216 [cs, stat]. arXiv: 1706.02216. Retrieved 06/18/2021 from A.2 Anomaly detection (𝐹 score) http://arxiv.org/abs/1706.02216. 1 [6] Mikel Joaristi, Edoardo Serra, and Francesca Spezzano. 2019. Detecting suspicious entities in Offshore Leaks net- edge DecTree GradBoost LogReg MLP works. Social Network Analysis and Mining, 9, 1, 1–15. Pub- ss 0.12 ± 0.01 0.13 ± 0.02 0.04 ± 0.01 0.19 ± 0.02 lisher: Springer Vienna. issn: 1327801906. doi: 10.1007/ oo 0.07 ± 0.01 0.11 ± 0.02 0.01 ± 0.01 0.10 ± 0.02 s13278- 019- 0607- 5. https://doi.org/10.1007/s13278- 019- so 0.08 ± 0.01 0.10 ± 0.02 0.04 ± 0.01 0.11 ± 0.02 0607- 5. os 0.06 ± 0.01 0.12 ± 0.02 0.04 ± 0.01 0.14 ± 0.02 [7] Martin Jullum, Anders Løland, Ragnar Bang Huseby, Geir sp 0.06 ± 0.01 0.07 ± 0.04 0.02 ± 0.02 0.08 ± 0.04 Ånonsen, and Johannes Lorentzen. 2020. Detecting money ps 0.04 ± 0.01 0.05 ± 0.02 0.01 ± 0.01 0.05 ± 0.02 laundering transactions with machine learning. Journal po 0.04 ± 0.01 0.07 ± 0.04 0.02 ± 0.03 0.04 ± 0.03 of Money Laundering Control, 23, 1. doi: 10.1108/JMLC- op 0.09 ± 0.01 0.14 ± 0.04 0.01 ± 0.01 0.18 ± 0.04 07- 2019- 0055. https://www.emerald.com/insight/1368- 5201.htm. [8] Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with Graph Convolutional Networks. arXiv:1609.02907 [cs, stat], (February 2017). arXiv: 1609.02907. Retrieved 06/18/2021 from http://arxiv.org/abs/1609.02907. [9] Larry Page, Sergey Brin, R. Motwani, and T. Winograd. 1998. The PageRank Citation Ranking: Bringing Order to the Web. (1998). [10] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830. [11] 2022. TARGET2 in TARGET2-Slovenija. si. (September 2022). Retrieved 09/21/2022 from https : / / www . bsi . si / placila - in - infrastruktura / placilni - sistemi / target2 - in - target2- slovenija. [12] Dominik Wagner. 2019. Latent representations of trans- action network graphs in continuous vector spaces as features for money laundering detection. Gesellschaft für Informatik, 1–1. [13] Mark Weber, Giacomo Domeniconi, Jie Chen, Daniel I Karl Weidele, Claudio Bellei, Tom Robinson, and Charles E Leiserson. 2019. Anti-Money Laundering in Bitcoin: Ex- perimenting with Graph Convolutional Networks for Fi- nancial Forensics. Technical report. 41 Forecasting Sensor Values in Waste-To-Fuel Plants: a Case Study. Bor Brecelj∗ Beno Šircelj∗ Jože M. Rožanec University of Ljubljana, Faculty of Jožef Stefan International Jožef Stefan International Mathematics and Physics Postgraduate School Postgraduate School Ljubljana, Slovenia Ljubljana, Slovenia Ljubljana, Slovenia bor.brecelj@gmail.com beno.sircelj@ijs.si joze.rozanec@ijs.si Blaž Fortuna Dunja Mladenić Qlector d.o.o. Jožef Stefan Institute Ljubljana, Slovenia Ljubljana, Slovenia blaz.fortuna@qlector.com dunja.mladenic@ijs.si ABSTRACT to such processes can improve quality and yield and help engineers In this research, we develop machine learning models to predict predict anomalies to control the factory better. future sensor readings of a waste-to-fuel plant, which would enable We modeled the JEMS waste-to-fuel plant, which produces high- proactive control of the plant’s operations. We developed models quality diesel from organic waste. The plant has numerous sensors that predict sensor readings for 30 and 60 minutes into the future. that measure temperature, and pressure, among other variables. The models were trained using historical data, and predictions were It is operated by experts who must control the process. Since the made based on sensor readings taken at a specific time. We compare chemical process is complex and, therefore, difficult to control, we three types of models: (a) a näive prediction that considers only built forecasting models that can predict future sensor readings the last predicted value, (b) neural networks that make predictions based on historical data and the current state of the plant. based on past sensor data (we consider different time window sizes The model will be used to give plant operators additional infor- for making a prediction), and (c) a gradient boosted tree regressor mation about the future state of the plant, which will allow them to created with a set of features that we developed. We developed and make an informed decision about changing the plant’s parameters tested our models on a real-world use case at a waste-to-fuel plant and, therefore, adjust the process before it is too late. in Canada. We found that approach (c) provided the best results, while approach (b) provided mixed results and was not able to 2 RELATED WORK outperform the näive consistently. Organic wastes in energy conversion technologies are an active area of research aimed at reducing dependence on fossil fuels, optimiz- CCS CONCEPTS ing production costs, improving waste management, and control- • Computing methodologies → Machine learning; • Applied ling emissions. Biochemical, physiochemical, and thermochemical computing; processes produce different biofuels, such as bio-methanation, bio- hydrogen, biodiesel, ethanol, syngas, and coal-like fuels, which are KEYWORDS studied by Stephen et al. [8]. Work is also being done on optimiza-Smart Manufacturing, Machine Learning, Feature Engineering tion, such as catalyst selection, reactor design, pyrolysis tempera- ture, and other important factors [5]. ACM Reference Format: Many ML methods have been developed to address waste man- Bor Brecelj, Beno Šircelj, Jože M. Rožanec, Blaž Fortuna, and Dunja Mladenić. agement and proper processing for biofuel production, focusing 2022. Forecasting Sensor Values in Waste-To-Fuel Plants: a Case Study.. In on energy demand and supply prediction [3]. Aghbashlo et al. [2] Ljubljana ’22: Slovenian KDD Conference on Data Mining and Data Ware- houses, October, 2022, Ljubljana, Slovenia. ACM, New York, NY, USA, 4 pages. provided a systematic review of various applications of ML technol- ogy with a focus on ANN (Artificial Neural Network) in biodiesel 1 INTRODUCTION research. They provided an overview of the use of ML in modeling, optimization, monitoring, and process control. Models that pre- There is a wide range of applications of ML (machine learning). dict the conditions of the biofuel production process that have the One of them is the modeling and control of chemical processes, highest yield were created by Kusumo et al. [6] and Abdelbasset et such as the production of biodiesel. Introducing machine learning al. [1]. The models used in these studies were kernel-based extreme learning machines, ANN, and various ensemble models. ∗Both authors contributed equally to this research. Permission to make digital or hard copies of part or all of this work for personal or 3 USE CASE classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation The JEMS waste-to-fuel plant produces synthetic diesel (SynDi) on the first page. Copyrights for third-party components of this work must be honored. from any hydrocarbon-based waste, such as wood, biomass, pa- For all other uses, contact the owner/author(s). per, waste fuels and oils, plastics, textiles, rubber, and agricultural SiKDD ’22, October, 2022, Ljubljana, Slovenia residues. The plant uses a chemical-catalytic de-polymerization © 2022 Copyright held by the owner/author(s). process, the advantage of which is that the temperature is too low 42 SiKDD ’22, October, 2022, Ljubljana, Slovenia Brecelj et al. to produce carcinogenic gasses. It operates continuously and pro- all sensors with less than 6.000 data points and kept only those that duces about 150 liters of fuel per hour. Although it uses the latest corresponded to chambers B100 and B200, giving us data from 39 software available and allows remote control, there is no anomaly sensors. detection, prediction, or optimization. As a result, there is a great Analysis of the dataset we received revealed that many values need for better understanding, optimization, and decision-making, were missing. In particular, we noted that there were day-long given data availability. The company plans to sell and install over intervals with a tiny number of measurements. We also noticed 1.500 SynDi systems over the next ten years. In practice, this means that specific sensor values remained constant at low temperatures - many SynDi plants in different locations worldwide. a condition best described by the waste-to-fuel plant’s inactivity. There are three main chambers in the pipeline, which are named We, therefore, decided to remove such values. Because there were B100, B200, and B300. The plant can be conceptually split into four many ten-minute gaps, we decided to resample the data at fifteen- stages minute intervals, taking the last value of each interval and assuming (1) Feedstock inspecting and feeding; that conditions had not changed in the short time since the last (2) Drying and mixing (chamber B100); measurement - a reasonable assumption for sensor values. The (3) Processing (chamber B200); resulting data set contained an average of 7.884 data points per (4) Distilling (chamber B300). sensor. Since there are no sensors in the feedstock inspecting and feeding We divided the dataset into a train and a test dataset, split on stage, we focused on the later stages, each of which takes place in October 31st 2016. The resulting train set included a total of 11.000 one of the main chambers. samples, and the test set included 3.000 samples. In the drying and mixing stage (B100), the starting material is mixed with process oil, lime, and catalyst and is heated. During 4.2 Model training mixing, the material is broken down into smaller particles, and the In this research, we compare models that we develop using two water is evaporated. The primary chemical reaction occurs in the different approaches. We first tried the neural network approach, in processing stage (B200). The material is fed to a turbine, and the which the model makes predictions based only on sensor readings reaction product evaporates through the diesel distillation column. from the last five hours. Since the model did not perform better than If the diesel obtained is not of sufficient quality, it is redistilled in the baseline, we began the second approach, developing features the second distillation stage (B300). to describe the time series and capture its patterns. We used linear Currently, the plants are operated with highly skilled person- regression and gradient-boosted tree regressor. All the developed nel and high costs for personnel training. Implementing automa- models were compared with the last-value model, which we used tion, remote control, optimization, and interconnection among the as a benchmark. plants would greatly facilitate their operation. Therefore, the main challenge to be solved by integrating AI is the self-control of the 4.2.1 Neural network approach. We used the model developed for chemical process and the plant itself by minimizing the human forecasting Tüpras’ sensor values. Tüpras is an oil refinery, which resources required to operate the plants. Furthermore, operating is very similar to the JEMS use case. The model was used to forecast many SynDi plants also means a significant challenge for ensuring sensor values in different units of LPG production. Some of Tüpras’ remote control for troubleshooting, maintenance, and repair. AI units are distillation columns, similar to JEMS’ chamber B200. The integration aims to minimize the workforce required to operate the model takes only past sensor values as input and predicts values for plants, minimize the resulting downtime due to human interaction, the future together with the prediction interval. More specifically, enable self-control and predictive maintenance of the SynDi plants, it predicts 10th, 50th and 90th percentile, which is the case in all our and achieve less downtime and higher production efficiency. models that give prediction interval. In modeling the waste-to-fuel processes, we decided to model each chamber separately. No model was developed for chamber B300 because it was not active during the period for which we obtained the data. As described above, a second distillation of the fuel is performed in chamber B300 only if the fuel in chamber B200 is not pure enough. 4 METHODOLOGY Figure 1: Architecture of the neural network model, which 4.1 Data analysis gives the prediction interval. The sensor measurements are from the experimental JEMS plant, which is located in Canada. The data consists of 154 sensors from January 2016 to January 2017. The measurements are taken at Figure 1 shows the architecture of the neural network. The model one-minute intervals and mostly measure temperature or pressure, is a feedforward neural network with two layers. First, there is a but there are also sensors for motor current and valve position, linear layer with ReLU activation. The second layer has a separate among others. Since the data is from the prototype version of the linear layer for each quantile. The hidden dimension of the model waste-to-fuel plant, it contains many missing values. Our data set is calculated from the number of features and the number of targets contained an average of 61.607 data points per sensor. We discarded using the formula ⌊ 𝑛features ⌋ + 2 𝑛targets. 43 Forecasting Sensor Values in Waste-To-Fuel Plants: a Case Study. SiKDD ’22, October, 2022, Ljubljana, Slovenia During training, we used the quantile loss function, which is Using developed features, we trained a linear regression model, defined as and a gradient boosted tree regressor from the CatBoost library [4]. n o We used root mean squared error (RMSE) for the loss function. max 𝑞 · 𝑦true − 𝑦pred , (1 − 𝑞) · 𝑦pred − 𝑦true , where 5 RESULTS AND ANALYSIS 𝑞 is the observed quantile (in our case, it can be 0.1, 0.5 or 0.9), 𝑦true is the true target value and 𝑦pred is the corresponding We built models for main chambers B100 and B200 with two fore- quantile of the prediction. In the case of 𝑞 = 0.5, the loss is equal casting horizons (30 and 60 minutes). Tables 1 and 2 show mean to the mean absolute error divided by two. When calculating the squared error (MSE) and mean absolute error (MAE) on chambers loss of 10th percentile (𝑞 = 0.1), a prediction that is greater than B100 and B200, respectively. There are three different neural net- the true value is heavily penalized, while a prediction that is lower work models (NN), which differ in the size of the window from than the true value has a smaller loss and is therefore encouraged. which it gets the data. The model is implemented in the PyTorch library [7]. Since sensors measure different quantities, the values have to be scaled horizon = 30min horizon = 60min before learning. Here we used Min-Max scaler from the scikit-learn MSE MAE MSE MAE library, scaling all values between zero and one. last-value model 21.0533 1.4320 50.6636 2.5128 NN, window = 5h 21.7525 1.6512 47.0545 2.5413 4.2.2 Feature engineering. The neural network model described NN, window = 3h 19.7441 1.6109 45.3450 2.4127 above did not outperform the benchmark model. As a result, we NN, window = 2h 18.9717 1.6023 46.5047 2.5357 decided to try another approach, where we developed features that Linear regression 19.4264 1.4634 49.2268 2.5145 better describe past sensor values and capture their patterns. One of Catboost 16.9030 1.4478 38.3066 2.3164 the problems of the neural network model was that it had too many features. We decided to build a separate model for each sensor to Table 1: MSE and MAE on the test set of models when pre- tackle this problem. Each model uses only features calculated from dicting for chamber B100. the values of the sensor being predicted. With the help of plant operators, we decided to consider at most five hours of data before the prediction point to issue a forecast. Since the latest data is usually more important in determining future horizon = 30min horizon = 60min sensor values, we created features on seven different time windows: MSE MAE MSE MAE 30, 45, 75, 120, 180, 240, and 300 minutes. For each time window, last-value model 52.3380 2.0577 124.9735 3.3768 we computed the following features: NN, window = 5h 69.4678 3.8227 129.0330 4.9927 • average sensor value, NN, window = 3h 57.9902 3.3601 121.1315 4.7431 • fraction of peaks in the window, NN, window = 2h 55.8769 3.1797 117.4154 4.7146 • percentage change between first and last value in the time Linear regression 55.0218 3.2293 115.7457 4.5888 window, Catboost 49.3329 2.5305 109.5303 3.9745 • slope (coefficient of the least squares line through the points in the window), Table 2: MSE and MAE on the test set of models when pre- • simple prediction (extension of the least squares line to the dicting for chamber B200. future), • slope ratio (slope on the smaller window divided by the slope From the tables 1 and 2 we can see that the five-hour window’s on the bigger window). neural network performed worse than the benchmark. The main Besides features mentioned above, which depend on the window reason for such poor results was too many features for the amount size, we also included features that were calculated only on the of data that we have. More precisely, the neural network model biggest time window (300 minutes): uses the values of all sensors in the chamber we are predicting. This means that there are six hundred features resulting in more • last value, than two hundred thousand trainable parameters for the model of • maximal value, chamber B200. We also have to consider that the neural network • last value relative to the maximal value. predicts future sensor values and prediction intervals. Therefore, The features above attempt to capture different time series charac- there are too many features and target values for the amount of teristics: data that we have. • trend: described by percentage change and slope; We included results of two more neural network models with • growth pattern: described by the fraction of peaks, which three hours and two-hour windows since reduced window size indicate whether the growth is steady or it has ups-and- results in a smaller number of features and trainable parameters. For downs. Furthermore, the slope indicates how aggressive such example, the neural network model with a two-hour time window growth is; for chamber B200 had two hundred and forty features and almost • expected value: an approximation of the expected value is fifty thousand trainable parameters. Neural network models with given through the average, last value, maximal value, and smaller window sizes performed better, which confirms that we simple prediction. had too many features. 44 SiKDD ’22, October, 2022, Ljubljana, Slovenia Brecelj et al. The features that we developed using the second approach were However, there is no problem with models not being able to pre- used with two models, linear regression and the Catboost model. dict significant changes resulting from a manual change in plant Comparing those two models, the Catboost model performed better setpoint parameters, which our data does not capture. Overall, we because it can capture more than just linear relationships between consider the best model was the Catboost model, given in all cases the features and the target. The Catboost model also outperformed it outperformed the rest of the models when considering MSE, and the neural networks, where one of the main differences is that also achieved the best MAE when predicting chamber B100 with a the neural network uses all sensors from the chamber while the time horizon of 60 minutes. Catboost model uses only sensor values of the sensor which is being predicted. This results in forty-five features for the model 6 CONCLUSION that predicts one sensor, which solves the problem of too many We compared a set of models to predict sensor values for a waste- features. In addition, the Catboost model produced better results to-fuel plant: a neural network, linear regression, gradient-boosted than the benchmark when comparing the mean squared error (MSE). tree regressor, and the last-value model. The last-value model was During the training, we used RMSE as a loss function, meaning used as a benchmark. We developed three neural network models that RMSE was minimized and, therefore, also MSE. which were different in time window size. The neural network The tables show that although most models outperform the models were built based on the hypothesis that a simple neural benchmark regarding MSE, almost all of them do not surpass the network and raw sensor readings as features are enough to model benchmark when considering MAE. When measuring MSE, pre- the process. The results showed that this is not the case because the dictions with strong spikes where such spikes do not take place process is too complicated for the amount of data that we obtained. are penalized more. Therefore, models with a competitive MSE Lastly, we used feature engineering to develop features that better are considered to rarely predict spikes when such spikes do not describe the time series. Features were used for learning linear take place. This is a key feature for our use case, given that we are regression, and the gradient boosted tree regressor, where the latter interested to understand whether an irregularity will take place or produced the best results in our case. not. Therefore, the models give valuable information even though the average prediction is not entirely accurate. ACKNOWLEDGMENTS This work was supported by the Slovenian Research Agency and the European Union’s Horizon 2020 program project FACTLOG under grant agreement number H2020-869951. REFERENCES [1] Walid Kamal Abdelbasset, Safaa M Elkholi, Maria Jade Catalan Opulencia, Tazed-dinova Diana, Chia-Hung Su, May Alashwal, Mohammed Zwawi, Mohammed Algarni, Anas Abdelrahman, and Hoang Chinh Nguyen. 2022. Development of multiple machine-learning computational techniques for optimization of heteroge-Figure 2: True value and prediction of the Catboost model nous catalytic biodiesel production from waste vegetable oil. Arabian Journal of Chemistry 15, 6 (2022), 103843. for a temperature sensor in chamber B100. [2] Mortaza Aghbashlo, Wanxi Peng, Meisam Tabatabaei, Soteris A Kalogirou, Salman Soltanian, Homa Hosseinzadeh-Bandbafha, Omid Mahian, and Su Shiung Lam. 2021. Machine learning technology in biodiesel research: A review. Progress in Energy and Combustion Science 85 (2021), 100904. [3] Hemal Chowdhury, Tamal Chowdhury, Pranta Barua, Salman Rahman, Nazia Hossain, and Anish Khan. 2021. Biofuel production from food waste biomass and application of machine learning for process management. 96–117. https: //doi.org/10.1016/B978-0-12-823139-5.00004-6 [4] Anna Veronika Dorogush, Vasily Ershov, and Andrey Gulin. 2018. CatBoost: gradient boosting with categorical features support. arXiv preprint arXiv:1810.11363 (2018). [5] Bidhya Kunwar, HN Cheng, Sriram R Chandrashekaran, and Brajendra K Sharma. 2016. Plastics to fuel: a review. Renewable and Sustainable Energy Reviews 54 (2016), 421–428. Figure 3: True value and prediction with a confidence interval [6] F Kusumo, AS Silitonga, HH Masjuki, Hwai Chyuan Ong, J Siswantoro, and TMI of the neural network model with a two-hour window for a Mahlia. 2017. Optimization of transesterification process for Ceiba pentandra oil: A comparative study between kernel-based extreme learning machine and temperature sensor in chamber B100. artificial neural networks. Energy 134 (2017), 24–34. [7] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Figure 2 shows the Catboost model prediction on the test set Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan together with the true values of the temperature sensor in chamber Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith B100. The neural network model’s prediction of the same sensor is Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32, H. Wallach, presented in Figure 3. Since the neural network model also outputs H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.). Curran prediction interval, it is shown in the abovementioned Figure. Associates, Inc., 8024–8035. http://papers.neurips.cc/paper/9015-pytorch-an- imperative-style-high-performance-deep-learning-library.pdf From the plots, we can see that both models can closely pre- [8] Jilu Lizy Stephen and Balasubramanian Periyasamy. 2018. Innovative develop-dict future sensor values. In the case of the neural network model, ments in biofuels production from organic waste materials: a review. Fuel 214 the actual value is mainly inside the predicted confidence inter- (2018), 623–633. val, except when there is a significant change in the sensor value. 45 Machine Beats Machine: Machine Learning Models to Defend Against Adversarial Attacks. Jože M. Rožanec∗ Dimitrios Papamartzivanos Entso Veliou Jožef Stefan International Ubitech Ltd Department of Informatics and Postgraduate School Chalandri, Athens, Greece Computer Engineering, University of Ljubljana, Slovenia dpapamartz@ubitech.eu West Attica joze.rozanec@ijs.si Athens, Greece eveliou@uniwa.gr Theodora Anastasiou Jelle Keizer Blaž Fortuna Ubitech Ltd Philips Consumer Lifestyle BV Qlector d.o.o. Chalandri, Athens, Greece Drachten, The Neatherlands Ljubljana, Slovenia tanastasiou@ubitech.eu jelle.keizer@philips.com blaz.fortuna@qlector.com Dunja Mladenić Jožef Stefan Institute Ljubljana, Slovenia dunja.mladenic@ijs.si ABSTRACT 1 INTRODUCTION We propose using a two-layered deployment of machine learning Artificial Intelligence (AI) solutions have penetrated the Industry models to prevent adversarial attacks. The first layer determines 4.0 domain by revolutionizing the rigid production lines enabling whether the data was tampered, while the second layer solves a innovative functionalities like mass customization, predictive main- domain-specific problem. We explore three sets of features and tenance, zero defect manufacturing, and digital twins. However, three dataset variations to train machine learning models. Our re- AI-fuelled manufacturing floors involve many interactions between sults show clustering algorithms achieved promising results. In the AI systems and other legacy Information and Communications particular, we consider the best results were obtained by applying Technology (ICT) systems, generating a new territory for malevo- the DBSCAN algorithm to the structured structural similarity in- lent actors to conquer. Hence, the threat landscape of Industry 4.0 is dex measure computed between the images and a white reference expanded unpredictably if we also consider the emergence of adver- image. sary tactics and techniques against AI systems and the constantly increasing number of reports of Machine Learning (ML) systems CCS CONCEPTS abuses based on real-world observations. In this context, Adversar- ial Machine Learning (AML) has become a significant concern in • Information systems → Data mining; • Computing method- adopting AI technologies for critical applications, and it has already ologies → Computer vision problems; • Applied computing; been identified as a barrier in multiple application domains. AML is a class of data manipulation techniques that cause changes in the be- KEYWORDS havior of AI algorithms while usually going unnoticed by humans. Cybersecurity, Adversarial Attacks, Machine Learning, Automated Suspicious objects misclassification in airport control systems [7], Visual Inspection abuse of autonomous vehicles navigation systems [11], tricking of healthcare image analysis systems for classifying a benign tumor as ACM Reference Format: malignant [15], abnormal robotic navigation control [23] are only Jože M. Rožanec, Dimitrios Papamartzivanos, Entso Veliou, Theodora Anas-a few examples of AI models’ compromise that advocate the need tasiou, Jelle Keizer, Blaž Fortuna, and Dunja Mladenić. 2021. Machine Beats for the investigation and development of robust defense solutions. Machine: Machine Learning Models to Defend Against Adversarial Attacks.. Recently, the evident challenges posed by AML have attracted In Ljubljana ’22: Slovenian KDD Conference on Data Mining and Data Ware- the attention of the research community, the industry 4.0, and houses, October, 2022, Ljubljana, Slovenia. ACM, New York, NY, USA, 4 pages. the manufacturing domains [20], as possible security issues on AI systems can pose a threat to systems reliability, productivity, and safety [2]. In this reality, defenders should not be just passive Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed spectators, as there is a pressing need for robustifying AI systems for profit or commercial advantage and that copies bear this notice and the full citation to hold against the perils of adversarial attacks. New methods are on the first page. Copyrights for third-party components of this work must be honored. needed to safeguard AI systems and sanitize the ML data pipelines For all other uses, contact the owner/author(s). SiKDD ’22, October, 2022, Ljubljana, Slovenia from the potential injection of adversarial data samples due to © 2021 Copyright held by the owner/author(s). poisoning and evasion attacks. 46 SiKDD ’22, October, 2022, Ljubljana, Slovenia Rožanec et al. We developed a machine learning model to address the above- by various transformations and contaminated by different noises mentioned challenges, detecting whether the incoming images are to foster robustness using adversarial training. adversarially altered. This enables a two-layered deployment of On top of the above, several standalone solutions have been machine learning models that can be used to prevent adversar- proposed. CARAMEL system in [13] offered a set of detection ial attacks (see Fig. 1): (a) the first layer with models determining techniques to combat security risks in automotive systems with whether the data was tampered, and (b) a second layer that operates embedded camera sensors. Hybrid approaches and more general with regular machine learning models developed to solve particu- alternatives intrinsically improve the robustness of AI models. A lar domain-specific problems. We demonstrate our approach in a defensive Distillation mechanism against evasion attacks is pro- real-world use case from Philips Consumer Lifestyle BV. This paper posed in [16] being able to reduce the effectiveness of adversarial explores a diverse set of features and machine learning models sample creation from 95% to less than 0.5% on a studied DNN. Sub- to detect whether the images have been tampered for malicious set Scanning was presented in [19] to give the ability to DNNs to purposes. recognize out-of-distribution samples. 3 USE CASE The Philips factory in Drachten, the Netherlands, is an advanced factory for mass manufacturing consumer goods (e.g., shavers, OneBlade, baby bottles, and soothers). Current production lines are often tailored for the mass production of one product or product series in the most efficient way. However, the manufacturing land- scape is changing due to global shortages, manufacturing assets and components are becoming scarcer [1], and a shift in market Figure 1: Two-layered deployment of machine learning mod-demand requires the production of smaller batches more often. To els can be used to prevent adversarial attacks. adhere to these changes, production flexibility, re-use of assets, and a reduction of reconfiguration times are becoming more critical This paper is organized as follows. Section 2 outlines the current for the cost-efficient production of consumer goods. One of the state of the art and related works, Section 3 describes the use case, topics currently investigated within Philips is quickly setting up and Section 4 provides a detailed description of the methodology automated quality inspections to make reconfiguring quality con-and experiments. Finally, Section 5 outlines the results obtained, trol systems faster and easier. Next to working on the technical while Section 6 concludes and describes future work. challenges of doing this, safety and cyber-security topics are ex- plored, aiming to implement AI-enabled automated quality systems 2 RELATED WORK with state-of-the-art defenses, the latter of which is the focus point discussed in this paper. AML attacks are considered a severe threat to AI systems, and, that The dataset used contains images of the decorative part of a is, the research community seeks new robust defensive methods. Philips shaver. This product is mass-produced and important for the Image classifiers, those analyzed in this work, are the focal point of visual appearance of the shavers. Next to that, the part is very close the vast majority of the AML literature, as those have been proved to or in direct contact with the user’s skin, where any deviations in prone to noise perturbations. According to the literature, promi- its quality could impact shaver performance or even shaver safety. nent solutions focus on denoising the image classifiers, training The dataset contains 1.194 images classified into two classes: (a) the target model with adversarial examples, known as adversarial attacked with the Projected Gradient Descent attack [5], and (b) training, or applying standalone defense algorithms. not attacked. Yan et. al. [21] proposed a new adversarial attack called Observation-based Zero-mean Attack, and they evaluated the robustness of var- ious deep image denoisers. They followed an adversarial training 4 METHODOLOGY strategy and effectively removed various synthetic and adversarial We framed adversarial attack detection as a classification problem. noises from data. In [17], pre-processing data defenses for image We experimented with three kinds of features: (a) image embed-denoising are evaluated, highlighting the advantages of such ap- dings (obtained from the Average Pooling Layer of a pre-trained proaches that do not require the retraining of the classifiers, which ResNet-18 model ([9])), (b) histograms reflecting grayscale pixel is a computationally intense task in computer vision. frequencies (with pixel values extending between zero and 255), and However, the robustness of adversarial training via data augmen- (c) structural similarity index measure (SSIM) computed against a tation and distillation is advocated by the majority of the works white image. While the embeddings provide information about the in the domain. Specifically, Bortsova et al. [3] have focused on image as a whole, we considered the histograms and SSIM metric adversarial black-box settings, assuming that the attacker does could be useful given the apparent difference between the origi- not have full access to the target model as a more realistic sce- nal and perturbed images. Furthermore, we computed the features nario. They tuned their testbed to ensure minimal visual percepti- across three different datasets (see Fig. 2 for sample images): (a) bility of the attacks. The applied adversarial training dramatically original set of images, (b) images cropped considering an image decreased the performance of the designed attack. Hashemi and slice extending from top to bottom (coordinates (160, 0, 200, 369) - Mozaffari [8] trained CNNs with perturbed samples manipulated we name this dataset set "Cropped (v1)"), and (c) images cropped 47 Machine Beats Machine: Machine Learning Models to Defend Against Adversarial Attacks. SiKDD ’22, October, 2022, Ljubljana, Slovenia well, it would be useful to generalize the approach toward detecting new cyberattacks where no labeled data exists yet. We consider such a characteristic to be fundamental to production environments. For the models resulting from the three abovementioned datasets, we measured the estimated number of clusters, estimated number of noise points, homogeneity (whether the clusters contain only samples belonging to a single class), completeness (whether all the data points members of a given class are elements of the same cluster), V-measure (harmonic mean between homogeneity and completeness), adjusted Rand index (similarity between clusterings obtained by the proposed and random models), and the Silhouette Coefficient (estimates the separation distance between the resulting Figure 2: Three sets of images: (a) indicates the original image, clusters). We ran the DBSCAN algorithm measuring the distance while (b) indicates the images attacked with the Projected between clusters with the Euclidean distance, considering the max- Gradient Descent attack. The subsets I, II, and III indicate (I) imum distance between two samples for one to be considered as in the whole image, (II) cropped image (v1 (considering coordi- the neighborhood of the other to be 0,3. Furthermore, we consid- nates (160, 0, 200, 369))), and cropped image (v2 - (considering ered that at least ten samples in a neighborhood were required for coordinates (160, 50, 200, 319))). a point to be considered as a core point. 5 RESULTS AND ANALYSIS considering a slice of the central part of the image (coordinates (160, 50, 200, 319) - - we name this dataset set "Cropped (v2)"). By com-Model Catboost KMeans Logistic regression paring the original image dataset against those obtained by slicing Original image 0.0167 1.0000 0.0228 the central part, we sought to understand if the models’ predictive Embeddings Cropped (v1) 0.0014 1.0000 0.0003 Cropped (v2) 0.0181 1.0000 0.0213 power increased by looking at a specific area of the image rather Original image 0.0152 1.0000 0.0184 than the whole. SSIM Cropped (v1) 0.0008 1.0000 0.0004 Cropped (v2) 0.0179 1.0000 0.0195 We first trained three machine learning models: Catboost [18] Original image 0.0016 1.0000 0.0030 with Focal Loss [14] (trained over 150 iterations, and considering a Histograms Cropped (v1) 0.0003 1.0000 0.0011 tree depth of ten, while evaluating the performance during training Cropped (v2) 0.0018 1.0000 0.0031 with the logloss metric), Logistic Regression (the dataset was scaled between zero and one, considering the train set, and transformed to Table 1: Results obtained across classification experiments. ensure zero mean and unit variance), and KMeans (the dataset was We measure models’ performance with Eq. 1. Best results are transformed to ensure zero mean and unit variance, and the model bolded, second-best are italicized. initiated with random initialization and seeking to generate two clusters). We evaluated our experiments with a ten-fold stratified We present the results obtained in our classification experiments cross-validation ([12, 22]), using one fold for testing and the rest in Table 1. We found the KMeans models achieved perfect discrimi-of the folds to train the model. Furthermore, to avoid overfitting, nation in all cases, while the second-best model was the Logistic we performed a feature selection using the mutual information regression, which had second-best results in all but two cases. Nev- to evaluate the most relevant ones and select the top K features, √ ertheless, the Logistic regression and the Catboost models achieved with 𝐾 = 𝑁 , considering 𝑁 to be equal to the number of data a low discriminative power, almost unable to distinguish between instances in the train set [10]. Finally, we measured our models’ tampered and non-tampered images. Regarding the features, we performance with a custom metric (𝐷𝑃 ) that summarizes 𝐴𝑈 𝐶 𝑅𝑂𝐶 found that the best average performance was obtained when train- the discriminative power as computed from the area under the ing the models on the Cropped (v2) dataset, followed by those receiver operating characteristic curve (AUC ROC, see [4]) (see trained on the whole images. Eq. 1). The metric ranges from zero (no discriminative power) to When running the DBSCAN algorithm (see results in Table 2), one (perfect discriminative power) and it preserves the AUC ROC we found the best results were obtained considering the SSIM mea- desirable properties of being threshold independent and invariant sure. Furthermore, using the SSIM issued excellent results in all to a priori class probabilities. cases. The best ones were obtained considering the Cropped (v1) dataset, while the second-best was achieved with the Cropped (v2) dataset. Using the SSIM only, the DBSCAN algorithm was able to 𝐷 𝑃 = 2 · |(0.5 − 𝐴𝑈 𝐶𝑅𝑂𝐶)| (1) 𝐴𝑈 𝐶 𝑂𝐶 𝑅 correctly group the instances into two groups and misclassified at most a single instance. However, the performance achieved either Based on the good results obtained in the clustering setting, we with embeddings or histograms was not satisfactory. When consid- decided to conduct additional experiments, running the DBSCAN ering histogram features, the DBSCAN algorithm was not able to algorithm [6] over all existing data. The advantage of such an algo-discriminate between instances, creating a single cluster. On the rithm is that it can estimate the clusters with no prior information other hand, when considering embeddings, DBSCAN created three regarding the number of expected clusters. Therefore, if working clusters that issued a bad performance, considering most of the 48 SiKDD ’22, October, 2022, Ljubljana, Slovenia Rožanec et al. Embeddings SSIM Histograms Original image Cropped (v1) Cropped (v2) Original image Cropped (v1) Cropped (v2) Original image Cropped (v1) Cropped (v2) Number of clusters 3 1 1 2 2 2 1 1 1 Number of noise points 1010 794 887 1 0 1 621 603 606 Homogeneity 0.1770 0.4550 0.3170 1.0000 1.0000 1.0000 0.8550 0.9290 0.9150 Completeness 0.2090 0.4940 0.3860 0.9910 1.0000 0.9910 0.8560 0.9290 0.9150 V-measure 0.1920 0.4740 0.3480 0.9960 1.0000 0.9960 0.8550 0.9290 0.9150 Adjusted Rand index 0.0710 0.4350 0.2540 0.9980 1.0000 0.9980 0.9020 0.9600 0.9500 Silhouette coefficient 0.0750 0.4310 0.2660 0.8980 0.9590 0.9070 0.8330 0.8970 0.8800 Table 2: Results obtained across clustering experiments. Best ones are bolded, second-best are italicized. points to be noisy. We, therefore, conclude that the only promising Processing (ICIP). IEEE, 1241–1245. results were those obtained considering the SSIM. Nevertheless, we [6] Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, et al. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise.. In consider further research is required to understand whether this kdd, Vol. 96. 226–231. kind of feature can be useful across a wide range of attacks and [7] Dan-Ioan Gota, Adela Puscasiu, Alexandra Fanca, Honoriu Valean, and Liviu in the real-world. SSIM provides metadata describing the images. Miclea. 2020. Threat objects detection in airport using machine learning. In 2020 21th International Carpathian Control Conference (ICCC). IEEE, 1–6. Given high-quality attacks aim to reduce the visual footprint on the [8] Atiyeh Hashemi and Saeed Mozaffari. 2021. CNN adversarial attack mitigation images, it remains an open question to which extent can the SSIM using perturbed samples training. Multim. Tools Appl. 80 (2021), 22077–22095. [9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual capture weak footprints and therefore enable similar discriminative learning for image recognition. In Proceedings of the IEEE conference on computer capabilities on machine learning models. vision and pattern recognition. 770–778. [10] Jianping Hua, Zixiang Xiong, James Lowey, Edward Suh, and Edward R Dougherty. 2005. Optimal number of features as a function of sample size 6 CONCLUSION for various classification rules. Bioinformatics 21, 8 (2005), 1509–1515. [11] A. Kloukiniotis, A. Papandreou, A. Lalos, P. Kapsalas, D.-V. Nguyen, and K. In this work, we explored multiple sets of features and machine Moustakas. 2022. Countering adversarial attacks on autonomous vehicles using learning models to determine whether an image has been tampered denoising techniques: A Review. IEEE Open Journal of Intelligent Transportation with for the purpose of an adversarial attack. While the difference Systems (2022). Publisher: IEEE. [12] Max Kuhn, Kjell Johnson, et al. 2013. Applied predictive modeling. Vol. 26. between attacked and non-attacked images is evident to the human Springer. eye, it is not to the machine learning algorithms. We found that [13] Christos Kyrkou, Andreas Papachristodoulou, Andreas Kloukiniotis, Andreas the Catboost and Logistic regression models could almost not dis-Papandreou, Aris Lalos, Konstantinos Moustakas, and Theocharis Theocharides. 2020. Towards artificial-intelligence-based cybersecurity for robustifying auto-criminate between both cases. On the other hand, the clustering mated driving systems against camera sensor attacks. In 2020 IEEE Computer algorithms (KMeans and DBSCAN) had a stronger performance. Society Annual Symposium on VLSI (ISVLSI). IEEE, 476–481. [14] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2017. While the KMeans models did so perfectly, regardless of the fea- Focal loss for dense object detection. In Proceedings of the IEEE international tures, the DBSCAN model only performed well using the SSIM. conference on computer vision. 2980–2988. We consider the strength of such a model the fact that no a pri- [15] Xingjun Ma, Yuhao Niu, Lin Gu, Yisen Wang, Yitian Zhao, James Bailey, and Feng Lu. 2021. Understanding adversarial attacks on deep learning based medical ori information regarding the classes is required, therefore saving image analysis systems. Pattern Recognition 110 (2021), 107332. the annotation effort and providing greater flexibility towards fu- [16] Nicolas Papernot, Patrick D. McDaniel, Xi Wu, Somesh Jha, and Ananthram ture adversarial attacks. Our future research will focus on testing a Swami. 2015. Distillation as a Defense to Adversarial Perturbations against Deep Neural Networks. CoRR abs/1511.04508 (2015). arXiv:1511.04508 http: wider range of cyberattacks while ensuring the attack will not be //arxiv.org/abs/1511.04508 discernable to the human eye. [17] Marek Pawlicki and Ryszard S. Choraś. 2021. Preprocessing Pipelines including Block-Matching Convolutional Neural Network for Image Denoising to Robustify Deep Reidentification against Evasion Attacks. Entropy 23, 10 (2021), 1304. ACKNOWLEDGMENTS Publisher: MDPI. [18] Liudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Doro-This work was supported by the Slovenian Research Agency and gush, and Andrey Gulin. 2018. CatBoost: unbiased boosting with categorical the European Union’s Horizon 2020 program project STAR under features. Advances in neural information processing systems 31 (2018). grant agreement number H2020-956573. [19] Skyler Speakman, Srihari Sridharan, Sekou Remy, Komminist Weldemariam, and Edward McFowland. 2018. Subset scanning over neural network activations. arXiv preprint arXiv:1810.08676 (2018). REFERENCES [20] Entso Veliou, Dimitrios Papamartzivanos, Sofia Anna Menesidou, Panagiotis Gouvas, and Thanassis Giannetsos. 2021. Artificial Intelligence and Secure Manu- [1] [n.d.]. European Economic Forecast. Autumn 2021. https://economy-finance. facturing: Filling Gaps in Making Industrial Environments Safer. Now Publishers. ec.europa.eu/publications/european-economic-forecast-autumn-2021_en. Ac-30–51 pages. https://doi.org/10.1561/9781680838770.ch2 cessed: 2022-08-05. [21] Hanshu Yan, Jingfeng Zhang, Jiashi Feng, Masashi Sugiyama, and Vincent YF [2] Adrien Bécue, Isabel Praça, and João Gama. 2021. Artificial intelligence, cyber-Tan. 2022. Towards Adversarially Robust Deep Image Denoising. arXiv preprint threats and Industry 4.0: Challenges and opportunities. Artificial Intelligence arXiv:2201.04397 (2022). Review 54, 5 (2021), 3849–3886. [22] Xinchuan Zeng and Tony R Martinez. 2000. Distribution-balanced stratified [3] Gerda Bortsova, Cristina González-Gonzalo, Suzanne C. Wetstein, Florian Du-cross-validation for accuracy estimation. Journal of Experimental & Theoretical bost, Ioannis Katramados, Laurens Hogeweg, Bart Liefers, Bram van Ginneken, Artificial Intelligence 12, 1 (2000), 1–12. Josien PW Pluim, and Mitko Veta. 2021. Adversarial attack vulnerability of [23] Fangyi Zhang, Jürgen Leitner, Michael Milford, Ben Upcroft, and Peter Corke. medical image analysis systems: Unexplored factors. Medical Image Analysis 73 2015. Towards vision-based deep reinforcement learning for robotic motion (2021), 102141. Publisher: Elsevier. control. arXiv preprint arXiv:1511.03791 (2015). [4] Andrew P. Bradley. 1997. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition 30, 7 (1997), 1145 – 1159. https://doi.org/10.1016/S0031-3203(96)00142-2 [5] Yingpeng Deng and Lina J Karam. 2020. Universal adversarial attack via enhanced projected gradient descent. In 2020 IEEE International Conference on Image 49 Addressing climate change preparedness from a smart water perspective Alenka Guček *, Joao Pita Costa ** ***, M.Besher Massri * ** *******, João Santos Costa *, Maurizio Rossi ****, Ignacio Casals del Busto *****, Iulian Mocanu ****** * Institute Jozef Stefan, Slovenia, ** IRCAI, Slovenia, *** Quintelligence, Slovenia, **** Ville de Carouge, Switzerland, ***** Aguas de Alicante, Spain, ****** Apa Braila, Romania, ******* Jozef Stefan International Postgraduate School ABSTRACT economy through green technology over a new framework to understand and position water resource management in the Observing the world on a global scale can help us understand context of the challenges of tomorrow [1]. In the context of better the role of water and water resource management the NAIADES project [3] we repurpose and customize the utilities in a climate change context that engage us all. The NAIADES Water Observatory, adding a measurements usage of machine learning algorithms on open data dimension to its text mining capabilities to allow for forecasts measurements and statistical indicators can help us on, e.g., water level and temperature to complete the understand the behavioral changes in seasons and better perspective on the impact of climate change for the prepare. These are complemented by powerful text mining preparedness both of water management utilities and users algorithms that mine worldwide news, social media, as in, e.g., smart agriculture. This will improve the climate published research and patented innovation towards best change preparedness of water resource management practices from success stories. In this paper, we propose a facilities and local authorities in a global context, in particular data-driven global observatory that puts together the in European regions where water scarcity or extreme different perspectives of media, science, statistics and sensing weather events are predicted. The water-related climate over heterogeneous data sources and text mining algorithms. change topics that we are already addressing include, e.g., We also discuss the implementation of this global water reuse, wastewater management, saline intrusion and observatory in the context of epidemic intelligence, groundwater contamination. monitoring the impact of climate change, and the value of this In this paper we will discuss our contribution to this cause, global solution in local contexts and priorities. through the NAIADES Water Observatory (accessible at naiades.ijs.si) [12], focusing on water-related aspects, allowing the user to explore a combination of perspectives CCS CONCEPTS offered over layers of information sourced in statistics, • Real-time systems • Data management systems • Life and historical measurements, multilingual news and social media medical science to published science, weather models and indicators. It is also being used in the context of extreme weather events to analyze worldwide trends and best practices in water topics like, e.g., floods, landslide, and contamination [9], building KEYWORDS Climate Change Preparedness, Data-driven business intelligence from the available open data in Decision-making, Water Resource Management, Smart Water, combination with data streams [11]. Observatory, Water Digital Twin, Deep Learning, Text Mining, The NAIADES Water Observatory is not only contributing to Interactive Data Visualization the improvement of European sustainability in water-related activities and business intelligence but it is also providing an active role to local actors in improving together with 1 Introduction municipalities and water resource management utilities the efficient use of resources [13]. This local perspective is In the present decade, Climate Change has become positioned especially important for providing information at the local as one of the world priorities, a global problem with great granularity, which enables communities or municipalities to socio-economic impact. It has been in the focus of European build solutions that are relevant for their specific cases. and Worldwide strategies, rapidly changing priorities towards sustainability and environmental efficiency, transversely to most domains of action. The European Commission’s Green Deal [5] is a good example of this, aiming for a climate neutral Europe in 2050, and boosting the 50 SIKDD’22, October 2022, Ljubljana, Slovenia A.Guček et al. Figure 2: The weather across seasons over the past 20 Figure 1: Long‐term forecast of 10 years (average per years distinguished by seasons, exhibiting high year) built on 20 years of data to understand the temperature periods earlier in the year. behavior of air temperature, water levels and temperature and the consequent changes within seasons. To further explore the relations of multivariate timeseries data, we have developed the State analysis tool [14]. With this technology we automatically abstract data as states of the 2 Understanding behaviors from data Markov chain and transitions between them. This allows for In the era of Big Data where technologies and sensors are ingestion of large datasets, and due to hierarchical clustering every day cheaper and more efficient, a wide range of useful the data can be observed on several levels. This tool works especially well for observing long term behavior and exposing measurements is available and can be used to forecast recurrent patterns. In the context of climate change weather and water resource behaviors and to identify preparedness, the aim was to better understand the reality of environmental trends with local granularity. the seasons as defined by the weather parameters as well as With the motivation to grasp a realistic perspective on the the water level and temperature over the past 20 years. impact of climate change in the region of Carouge, Depicted in Figure 3 are the transitions between seven states Switzerland, we obtained 20 years of water levels and water we can already depict in the municipality of Carouge, temperature data (sourced by Meteoswiss Data Portal Switzerland and the surrounding area. Five of those states IDAWEB), and we were able to build a 10-year forecast that correspond to a passage between Spring-Summer and allows us to see a signal of the global trend. Summer-Autumn, and to Summer itself, characterized by the For this aim, we have developed a Long Short Term Memory states indicating a high water temperature. With the impact of climate change in redefining seasons this tool can help to (LSTM) neural network, which is a type of Recurrent Neural plan ahead, having in mind the granularity of the data that can Network, widely used for predicting sequential data. In order be customized to predefined geographic regions where to optimize the performance and accuracy of the LSTM, we relevant water resources are located. used some results from Differential Geometry and Chaos Theory such as Takens’ Embedding Theorem, Shannon Entropy, Conditional Shannon Entropy, Markov Chains, etc. This theoretical support was key for obtaining the optimal number of timesteps [4] and to produce a long-term forecast aiming to observe the weather behavior across the historical data collected and a perspective on the future seasons based on the derived prediction, represented by the three parameters - temperature, humidity and rainfall - or the water levels in rivers, lakes and basins in the area determined by the geolocation provided by the NAIADES use cases. The time series of historical data in Figure 1 indicates that already the air temperature yearly averages are increasing, and this increase is predicted also for the next 10 years. Comparing our model with the Meteoswiss model for the area, the differences were minimal. To emphasize the changes throughout the year, we added a per year visualization (Figure 2), where one can compare the seasonal trends for the Figure 3: The analysis of the impact of climate change on local weather and water parameters. water levels and temperature across seasons using Markov chains 51 Addressing climate change preparedness from a smart water SIKDD’22, October 2022, Ljubljana, Slovenia perspective 3 Enrichment with local indicators To better understand the comparative progress of each Water is fundamental to all human activity and ecosystem region on the selected water-related topics, we also enable health, and is a topic of rising awareness in the context of the representation of the time-series curves (see Figure 5) to climate change. Water resource management is central to identify transitions, peaks and other behaviors (per those concerns, with the industry accounting for over 19% of parameter in analysis) that are otherwise not seen in the global water withdrawal, and agricultural supply chains are bubble chart animation. responsible for 70% of water stress [10]. In 2015 the UN established "clean water and sanitation for all" as one of the 17 Sustainable Development Goals, aiming for eight targets to be achieved by 2030 [2]. To exploit the functionality for the customization at the level of local regional providers, news monitoring, and exploration of scientific research can be customized to observed problems, e.g., groundwater contamination. Moreover, ingestion of local indicators can be customized also. These agencies (e.g. Aguas de Alicante) are collecting data on their water resource management services to improve the customer satisfaction and optimize their system, aiming for a smart water [6] approach for the optimization of resources and means, often deploying intelligent systems close to the Figure 5: The curves comparing regional indicators on idea of a water digital twin [7]. water topics (as, e.g., reused water in Spain) Together with the municipality of Carouge, Switzerland, and 4 Knowledge extracted from news, social with the water management utilities of Alicante, Spain, and media and scientific research Braila, Romania, we have collected open data from national data portals and environmental agencies with a regional The NAIADES Water Observatory also allows for a news granularity to be able to assess the comparative progress of monitoring perspective with global and local coverage on regions through the visual data representation of indicators topics like, e.g., water scarcity and water quality. It is (see Figure 4). Through this interactive data visualization we particularly relevant in the surrounding regions of the water can investigate the progress on a variety of topics (with three resource management agencies, but also at a worldwide level simultaneous parameters represented over a bubble chart) recurring to its multilingual capacity to access success stories that are much relevant to the analysis of climate change, and best practices form similar scenarios happening including water availability, reused and treated water, or worldwide. This is based on the Event Registry news engine water usage by populations and industry. With the [8] that collects over 300 thousand news articles daily in over appropriate combination of variables in comparison, the user 60 languages. In the past 3 months we were able to capture can identify the most efficient regions over the country. almost 33 thousand articles relating both with water and with the climate crisis, 1500 of them happening in Spain and relating to concepts such as, e.g., draught, wildfire heat wave, irrigation and extreme weather. Figure 4: The comparison of indicators in the Spanish regions across time 52 SIKDD’22, October 2022, Ljubljana, Slovenia A.Guček et al. Although the predictions are in accordance with IPCC’s and Meteoswiss forecasting, this preliminary work needs to be extended with ingesting several other data variables and compared to the existing widely used models to bring more accurate insight specially for the weather data, but also the water-relevant resources. ACKNOWLEDGMENTS We thank the support of the European Commission on the H2020 NAIADES project (GA nr. 820985). Figure 6: The combined perspective of multilingual news, social media and scientific research on water scarcity and REFERENCES extreme weather aiming to identify best practices and success stories [1] A. Akhmouch, C. Delphine and P. G. Delphine Clavreul. Introducing the OECD principles on water governance. Water International, 43: 5–12, 2018 This global system is also capturing the filtered Twitter feed [2] V. Blazhevska. United Nations launches framework to speed up progress on 10% of the signal, to identify posts related to heat wave on water and sanitation goal. United Nations Sustainable Development, and drought (see Figure 6). 2020 [3] CORDIS, "NAIADES Project". [Online]. Available: https://cordis.europa.eu/project/id/820985 [Accessed 1 9 2020]. The scientific research on climate change topics can bring an [4] Costa J., Kenda K., Pita Costa J. (2021). Entropy for Time Series Forecasting. In: Slovenian Data Mining and Data. Warehouses conference important complement in this context, providing success (SiKDD2021) stories and best practices that can be extracted from the [5] European Commission, "European Green Deal," 2019. [Online]. Available: textual data, and explored with complex data visualization https://ec.europa.eu/info/strategy/priorities-2019-2024/ european- green-deal_en. [Accessed 1 9 2020]. technology allowing the user to powerful Lucene-based [6] C. Sun, V. Puig, G. Cembrano. (2020). Real-Time Control of UrbanWater queries over the article's metadata and to relate that research Cycle under Cyber-Physical Systems Framework. Water: 12, 406. [7] Di Nardo et al. (2018). On-line Measuring Sensors for Smart Water across time suggesting related topics (see Figure 7). These Network Monitoring. EPiC Series in Engineering. 3: 572-581 data analytics technologies are able to analyze [8] G. Leban, B. Fortuna, J. Brank and M. Grobelnik, "Event registry: learning about world events from news," Proceedings of the 23rd International simultaneously multiple time-series providing interactive Conference on World Wide Web, pp. 107-110, 2014. exploration tools to understand trends in climate change [9] M. Mikoš, N. Bezak, J. Pita Costa, M. Besher Massri, M. Jermol, M. research and water topics related to it. Grobelnik (2021) Natural-hazard-related web observatories as a sustainable development tool in Progress, in Landslide Research and Technology, Springer, Vol. 1, No. 1, 2022. [10] Our World in Data (2022). Water Use Stress. https://ourworldindata.org/water-use-stress. [Accessed 1 8 2022] [11] J. Pita Costa (2022). Business intelligence built from open data. Water World Magazine. [Online]. Available: https://www.waterworld.com/water-utility-management/smart-water- utility/article/14234325/2203wwint [Accessed 1 8 2022] [12] J. Pita Costa (2021). Observing water-related events to support decision-making. Smart Water Magazine. [Online]. Available: https://smartwatermagazine.com/news/naiades-project/observing- water-related-events-support-decision-making [Accessed 1 8 2022] [13] J. Pita Costa, I. Casals del Busto, A. Guček, et al (2022). Building A Water Observatory From Open Data. Proceedings of the IWA 2022. [14] L. Stopar, P. Škraba, M. Grobelnik, and D. Mladenić (2018). StreamStory: Figure 7: The trends over time that relate to the topic Exploring Multivariate Time Series on Multiple Scales. IEEE transactions Climate Change in the scientific literature on visualization and computer graphics 25. 4: 1788-1802. 5. Conclusions and further work Adapting to climate change is an important topic for water management services, since their work is quintessential for the well-being of people. Understanding the seasonality changes and forecasting the availability of resources at the local levels is therefore crucial to enable relevant adaptation at the correct granularity. 53 SciKit Learn vs Dask vs Apache Spark Benchmarking on the EMINST Dataset Filip Zevnik, Din Music, Carolina Fortuna, Gregor Cerar Department of Communication Systems, Jozef Stefan Institute Ljubljana, Slovenia zevnikfilip@gmail.com Abstract—As datasets for machine learning tasks can become [4] and on various image processing and learning scenarios very large, more consideration to memory and computing re- [5]–[7]. The work in [7] is the closest to this one, however source usage has to be given. As a result, several libraries for they focused on evaluating the tradeoffs in parellelizing feature parallel processing that improve RAM utilization and speed up extraction and clustering while this work focuses on evaluating computations by parallelizing ML jobs have emerged. While SciKit Learn is the typical go to library for practitioners, Dask data loading and merging and subsequent classification. is a parallel computing library that can be used with SciKit In this paper, we benchmark the three solutions for devel- and Apache Spark is an analytics engine for large-scale data oping ML pipelines with respect to data merging and loading processing that includes some machine learning techniques. In and subsequently for training and predicting on the extended this paper, we benchmark the three solutions for developing ML pipelines with respect to data merging and loading and MNIST (eMNIST) dataset under Linux and Windows OS. Our subsequently for training and predicting on the extended MNIST results show that Linux is the better option for all of the (eMNIST) dataset under Linux and Windows OS. Our results benchmarks. For low amounts of data plain SciKit learn is show that Linux is the better option for all of the benchmarks. the best option for machine learning, but for more samples, we For low amounts of data plain SciKit learn is the best option would choose Apache Spark. On the other hand, when it comes for machine learning, but for more samples, we would choose Apache Spark. On the other hand, when it comes to dataframe to dataframe manipulation Spark is behind Dask, and Dask manipulation Dask beats a normal pandas import and merge. beats a normal pandas import and merge. The contribution of Index Terms—Apache Spark, Dask, machine learning, Pandas, this paper is the benchmarking of three ML libraries across import various data sizes and two operating systems on two parts of the ML model development pipeline. I. INTRODUCTION The remainder of the paper is structured as follows. Section II discusses related work. Section III presents the methodology As datasets for machine learning tasks can become very used in the benchmarking. Section IV evaluates the compari- large, more consideration to memory and computing resource son. Finally, Section V presents our conclusions. usage has to be given. As a result, several libraries for parallel processing that improve RAM utilization and speed up II. RELATED WORK computations by parallelizing ML jobs have emerged. While Chintapalli et al. (2016) [8] compared streaming platforms SciKit Learn [1] is the typical go to library for practitioners, Flink, Storm and Spark. The paper focuses on real-world Dask [2] is a parallel computing library that can be used streaming scenarios using ads and ad campaigns. Each strem- with SciKit to improve memory and CPU utilization. Dask ing platform was used to build a pipeline that identifies improves memory utilization by not immediately loading all relevant events, which were sources from Kafka. In addition, the data, but only pointing to it. Only part of the data is loaded Redis was used for storing windowed count of relevant events on a per need basis. It also enables using all available cores on per campaign. The test system contained 40 nodes, where each a system to train a model. Apache Spark is an analytics engine node contained 2 CPUs with 8 cores and 24GB of RAM. All written in Java and Scala for processing large-scale data that nodes were interconnected using a gigabit ethernet connection. incorporates some machine learning techniques and is tightly The experiment encompassed Kafka producing events at set integrated with the Spark architecture. rate with 30 minutes interval between each batch was fired. While there are other libraries [3] that enable paralleliza- The results showed that both Flink and Storm were almost tion of ML, when it comes to distributed computing tools equal in terms of event latency, while Spark turned out to be for tabular datasets, Spark and Dask are the most popular the slowest of the three. choices today. Even though Spark is an older, more stable Dugré et al. (2019) [4] compared Dask and Spark on the solution, Dask is part of the vibrant Python ecosystem and both neuroimaging big data pipelines. As neuroimaging requires a technologies excel at parallelization. While the two solutions large amount of images to be processed, Spark and Dask were have been already been benchmarked on big data pipelines in the time of writing the best suited Big Data engines. The This work was funded by the Slovenian Research Agency ARRS under paper compares the technologies with three different pipelines. program P-0016. First is incrementation, second is histogram and the final 54 Fig. 1. Workflow of the Machine learning test example used for benchmarking. one is a BIDS app example (a map-reduce style application). time the data importing and merging process, referred to as All comparisons were done on BigBrain and CoRR datasets, Benchmark 1 in the figure, followed by model training and with sizes of 81GB and 39GB respectively. The authors have evaluation denoted by Benchmark 2. While the time required concluded that all platforms perform very similarly and that to train the model is usually the most important metric because the incrementation of worker nodes is not always the optimal it takes up most of the computation time, importing and solution due to the transfer times and overall overhead. While merging the input data cannot be ignored. As described in all platforms yielded similar results, the Spark is claimed to Algorithm 1, for Benchmark 1, training data was imported and be the fastest out of the three platforms. then merged. For SciKit Learn dataframes were used all along Nguyen et al. (2019) [6] evaluated SciDB, Myria, Spark, and no parallelization was used while for Dask and Spark Dask and TensorFlow to figure out which system is best suited parallelization was turned on. for image processing. Similarly to [4], the authors compared the systems using different pipelines. For comparison, the Algorithm 1: Import and merge benchmarking process. authors used 2 datasets, both over 100GB in size. The com- parison reveled that Dask and Spark are comparable in the performacnce as well as the ease of use. Enable parallelization Mehta et al. (2016) [5] presented the satellite data process- Require: data a and data b ing pipeline. The pipeline consists of two steps, a feature Merge the DataFrames extraction step and a clustering step. The baseline pipeline Convert data to a pandas DataFrame used the Caffe deep learning library and SciKit. The improved pipeline used Keras along with Spark and Dask for multi- Algorithm 2: Train/fit and evaluate benchmarking process. node computation. They found that while Spark was the fastest in terms of computational time required per task, Dask used almost half the memory compared to Spark due to Enable parallelization recalculation of the intermediate values. SciKit Learn was not Import and setup data able to complete the task and was excluded from the final train = [80% of the samples], test = [20% of the samples] comparison. It was concluded that Spark is the best performer, Define ML algorithm while Dask is the easiest to use. Fit the data Cheng et al. (2019) [7] presented a comparison of the Predict the samples RADICAL-Pilot, Dask and Spark for image processing. All Evaluate - F1 three systems were tested using watershed and a blob detector algorithms. Each test was split into two parts, a weak scaling As described in Algorithm 2, for Benchmark 2 in Figure algorithm where the amount of data to be processed was 1, an example of machine learning with a decision tree increased alongside the number of nodes, and a strong scaling classifier depicts the workflow of the machine learning test algorithm where the amount of data stayed the same and the example. First, parellelization is enabled for Dask and Spark number of nodes increased. The evaluation showed that Dask and immediately after that the data is imported and modified outperformed Spark on weak scaling, while Spark excelled in accordingly to fit the test scenario. Next, the decision tree the strong scaling part. classifier is trained using various training data size, dividing the data set into a training subset and a test subset. The III. METHODOLOGY training subset represents 80% of the original dataset and for To benchmark the three solutions, namely SciKit learn, the training subset the remaining data is used, representing Dask and Spark, we single out two parts of the end-to-end 20% of the original dataset. Each task is run with 5 different model development process depicted in Figure 1. We first sample sizes, ranging from 50k to 250k samples, with a step of 55 50k samples. Finally, the execution report with the calculation testing Spark on the import and merge benchmark, both Win- times of each task is generated. dows and Linux ran out of memory with two and four workers. To realize these benchmarks1, we used the extended MNIST Swap memory could be used to overcome this shortcoming, or EMNIST dataset2. The data set contains approximately however, the resulting comparison would not be fair because 250k samples of handwritten digits, resulting in total size of the Dask benchmarks didn’t need the swap memory. 516MB. The size of all images is exactly the same, 28 by 28 pixels and each pixel has a value ranging from zero to 255. The dataset is represented in the CSV (Comma Separated Values) format with the first column being the label and the rest of the columns representing 784 pixels. For the benchmarks, different data set sizes, ranging from 50k to 250k samples with a 50k step were generated. In addition, each data set size was tested on Dask and Spark with 1, 2 and 4 workers. Therefore, the programs used to test computation time on Windows and Linux operating systems have the same complexity. All tests were performed on equivalent Windows and Linux virtual machines running on the 6 CPU core machine with 10 GB of RAM. IV. RESULTS In this section we provide the results of the benchmarks Fig. 2. Benchmark results of import and merge times at 100k samples: raw data to Pandas. collected using the methodology described in Section III. A. Import and merge First, we present in Figure 2 the import and merge times for 100k samples on Linux without parallelization across the the three platforms. In the first bar, it can be seen that importing (i.e. loading the data into memory) takes most of the time with Pandas. Merging (i.e. concatenation) is relatively negligible while computation is not relevant in this case as after merging it already returns the desired data structure. The total import and merge time is slightly above 4s. From the second bar, it can be seen that importing and merging is negligible with Dask as doesn’t load anything into memory at these steps, rather it prepares only recipes that will be executed during the most time consuming compute phase. During compute, Dask turns a lazy collection into its in-memory equivalent, in our case, the Dask dataframe turns Fig. 3. Benchmark results two operating systems, Dask with import and into a Pandas dataframe. Overall, it can be seen that on a single merge on 250k samples. node, Dask is comparable to Pandas, with a total import and merge time slightly below 4s. B. Machine learning Finally, from the last bar, it can be seen that Spark import Figure 4 shows the comparison of computation time be- and merge are very fast and efficient, taking below 2s. How- tween Dask, Spark, and SciKit on the Windows operating ever, when transforming the internal data structure of Spark system for different dataset sizes. Each column in the figure into pandas (i.e. during the compute phase in this case) is very represents the average computation time of 5 test runs. The time consuming. We added this step so that the final outcome results show that Dask and Spark are almost equivalent when is consistent with the other two (i.e. Pandas data structure), the input dataset size is around 150k samples. Dask performs however in the end-to-end ML pipeline the ML algorithm will better on smaller datasets, while Spark’s performance is best be trained directly using Spark’s internal data structure. on larger datasets. Interestingly, SciKit outperforms both Dask Figure 3 shows how the import and merge times fare as a and Spark on all dataset sizes, although it is not able to function of worker nodes for Dask across Linux and Windows. parallelize tasks. This is most likely because of the transfer As expected, a decreasing tendency of the import/merge times times between nodes and the overall overhead of Dask and with the increase of the working nodes can be seen. When Spark. Since the datasets fit completely into the computer’s memory, SciKit has no problems computing them, while Dask 1Scripts for the benchmarks, https://github.com/sensorlab/parMLBenchmarks 2EMINST dataset - https://www.kaggle.com/crawford/emnist (accessed: and Spark only cause unnecessary overhead. However, Dask 30.07.2022) and Spark are meant for large clusters with hundreds or even 56 thousands of nodes, while SciKit is meant for computations The machine learning benchmark measured the time to cast on a single computer. all columns into smaller data types. It seems that Dask has a dedicated function to cast all of the columns of a Dask dataframe at once whereas with the Spark function you have to cast each column one by one. The Dask casting was faster (0.06s) than Sparks (7.2s). V. CONCLUSIONS In this paper we benchmarked two parallel computing technologies, Dask and Apache Spark, against each other and against the single node SciKit Learn. The benchmarks were computed on the EMNIST dataset for various subsets from 50k to 250k samples on different operating systems and various degrees of parallelization. The results show a slight advantage on running the training pipeline on Linux rather than on Windows. Dask is seen as superior in dataframe manipulation while Apache Spark has a superior end-to-end Fig. 4. Computational time for different dataset sizes on Windows operating processing performance on larger datasets with comparable system. final F1 scores. Figure 5 shows the results of the same experiment per- ACKNOWLEDGMENTS formed on the Linux operating system. Compared to the Figure This work was funded in part by the Slovenian Research 4, the results are very similar, with only difference that on Agency under the grant P2-0016. Linux operating system Dask performs better then Spark even when input data set contains 150k samples. REFERENCES Table I shows the F1 scores. An F1 score is the harmonic [1] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, mean (alternative metric for the arithmetic mean) of precision O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al., and recall. The precision gives information on how many of “Scikit-learn: Machine learning in python,” the Journal of machine Learning research, vol. 12, pp. 2825–2830, 2011. the predicted samples that have been predicted as positive [2] M. Rocklin, “Dask: Parallel computation with blocked algorithms and are correct. The recall gives information on how many of all task scheduling,” in Proceedings of the 14th python in science conference, positive samples the model managed to find. vol. 130, p. 136, Citeseer, 2015. [3] S. Celis and D. R. Musicant, “Weka-parallel: machine learning in parallel,” in Carleton College, CS TR, Citeseer, 2002. [4] M. Dugré, V. Hayot-Sasson, and T. Glatard, “A performance comparison of dask and apache spark for data-intensive neuroimaging pipelines,” in 2019 IEEE/ACM Workflows in Support of Large-Scale Science (WORKS), pp. 40–49, 2019. [5] P. Mehta, S. Dorkenwald, D. Zhao, T. Kaftan, A. Cheung, M. Balazinska, A. Rokem, A. Connolly, J. Vanderplas, and Y. AlSayyad, “Comparative evaluation of big-data systems on scientific image analytics workloads,” vol. 10, p. 1226–1237, VLDB Endowment, aug 2017. [6] M. H. Nguyen, J. Li, D. Crawl, J. Block, and I. Altintas, “Scaling deep learning-based analysis of high-resolution satellite imagery with distributed processing,” in 2019 IEEE International Conference on Big Data (Big Data), pp. 5437–5443, 2019. [7] M. T. S. J. William Cheng, Ioannis Paraskevakos, “Image processing using task parallel and data parallel frameworks,” pp. 1–7, 2019. [8] S. Chintapalli, D. Dagit, B. Evans, R. Farivar, T. Graves, M. Holderbaugh, Z. Liu, K. Nusbaum, K. Patil, and B. J. Peng, “Benchmarking streaming computation engines: Storm, flink and spark streaming,” 2016. Fig. 5. Computational time for different dataset sizes on Linux operating system. TABLE I TABLE OF F1 SCORES FOR WINDOWS BENCHMARKS FOR VARIOUS SAMPLE SIZES (SIMILAR FOR LINUX). Number of samples (x1000) 50 100 150 200 250 Spark 0.71 0.73 0.73 0.71 0.71 Dask 0.71 0.72 0.73 0.71 0.70 Scikit 0.70 0.71 0.70 0.71 0.73 57 An Efficient Implementation of Hubness-Aware Weighting Using Cython Krisztian Buza buza@biointelligence.hu BioIntelligence Group, Department of Mathematics-Informatics Sapientia Hungarian University of Transylvania Targu Mures, Romania ABSTRACT In case of the aforementioned hubness-aware classifiers, the Hubness-aware classifiers are recent variants of computationally most expensive step of the training is to deter- 𝑘 -nearest neighbor. When training hubness-aware classifiers, the computationally most mine the hubness scores of training instances, i.e., how frequently expensive step is the calculation of hubness scores. We show that they appear as (bad) nearest neighbors of other instances. In this this step can be sped up by an order of magnitude or even more if paper, we address this issue by a Cython-based implementation. it is implemented in Cython instead of Python while the accuracy Cython [1] aims to combine the advantages of Python (rapid proto-is the same in both cases. typing and clarity thanks to concise code) with the efficiency of C. In particular, we implement the computation of hubness scores in KEYWORDS Cython. Compared with a standard implementation in Python, we observed up to 25 times speedup on the Spambase dataset2 from nearest neighbor, hubs, cython the UCI repository (and the speedup is likely to be even more in case of larger datasets). 1 INTRODUCTION Nearest neighbor classifiers are simple, intuitive and popular, there 2 BACKGROUND: HUBNESS-AWARE are theoretical results about their accuracy and error bounds [6]. WEIGHTING However, nearest neighbors are affected by bad hubs. An instance We say that an instance ′ is called a bad hub, if it appears surprisingly frequently as nearest 𝑥 is a bad neighbor of another instance 𝑥 if (i) ′ and (ii) their class neighbor of other instances, but its class label is different from 𝑥 is one of the 𝑘 -nearest neighbors of 𝑥 labels are different. In case of hubness-aware weighting [9], first we the labels of those other instances. Bad hubs were shown to be determine how frequently each instance responsible for a surprisingly large fraction of the total classification 𝑥 appears as bad neighbor of other instances. This is denoted as ( error [10]. 𝐵 𝑁 𝑥 ). Subsequently, the 𝑘 normalized bad hubness score ( In order to reduce the detrimental effect of bad hubs, hubness- ℎ 𝑥 ) of each instance 𝑥 is calculated 𝑏 as follows: aware classifiers have been introduced, such as Hubness-Weighted 𝐵 𝑁 (𝑥 ) − 𝜇 (𝐵𝑁 ) 𝑘 𝑘 𝑘 -Nearest Neighbor (HWKNN) [9], Naive Hubness Bayesian Near- ℎ (𝑥 ) = (1) 𝑏 𝜎 (𝐵𝑁 ) est Neighbor (NHBNN) [16] and Hubness-based Fuzzy Nearest 𝑘 Neighbor (HFNN) [14]. Hubness has also been studied in context of where 𝜇 (𝐵𝑁 ) and 𝜎 (𝐵𝑁 ) denote the mean and standard devia- 𝑘 𝑘 collaborative filtering [8], regression [3], clustering [15], instance tion of the 𝐵𝑁 (𝑥) values over all instances of the training data. 𝑘 selection and feature selection [13]. Recently, hubness-aware en-HWKNN performs weighted 𝑘-nearest neighbor classification, the −ℎ (𝑥 ) 𝑏 sembles have been proposed [17] and used for the classification of weight of each training instance is 𝑤 (𝑥) = 𝑒 . For a detailed breast cancer subtypes [12]. illustration of HWKNN we refer to [13]. Other prominent applications of hubness-aware methods include music recommendation [7], time series classification [11], drug-3 CYTHON-BASED IMPLEMENTATION OF target prediction [4] and classification of gene expression data [2]. HUBNESS CALCULATIONS Last, but not least, we mention that even neural networks may Python code is usually run by an interpreter which makes the benefit from hubness-aware weighting [5]. execution relatively slow. Much of the inefficiency originates from Hubness-aware classifiers may be implemented in various pro- dynamic typing: for example, the actual semantic of the ’+’ symbol gramming languages, one of the most prominent implementation depends on the types of the operands. It may stand for addition of is probably the Java-based HubMiner1 library. numbers, concatenation of strings or lists, element-wise addition of arrays, etc. Which of the operations to perform, will be determined 1https://github.com/datapoet/hubminer by the interpreter at execution time. Permission to make digital or hard copies of part or all of this work for personal or The core idea of Cython3 is to annotate variables according to classroom use is granted without fee provided that copies are not made or distributed their types and to compile the resulting code into C which will for profit or commercial advantage and that copies bear this notice and the full citation further be compiled into binary code for efficient execution. In on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). case of computationally expensive functions, this may results in Information Society 2022, 10–14 October 2022, Ljubljana, Slovenia © 2022 Copyright held by the owner/author(s). 2https://archive.ics.uci.edu/ml/datasets/spambase 3https://cython.org/ 58 Information Society 2022, 10–14 October 2022, Ljubljana, Slovenia K. Buza the training instances have to be determined. Thus the resulting overall complexity is quadratic. We note that, both in case of Cython and Python, indexing tech- niques may be used to speed up the determination of the nearest neighbors. However, we omitted indexing in our implementation for simplicity. 4 DISCUSSION In order to calculate distances effectively, we used pairwise dis- tances from scikit-learn in our experiment. However, in case of large datasets, it may be necessary to calculate distances on the fly, as the distance matrix may be too large to be stored in RAM. In such cases, it may be worth considering to implement the distance calculations in Cython as well. In our previous works, we observed that the calculation of dynamic time warping distance was several Figure 1: Runtime (in second, vertical axis) of hubness score orders of magnitudes faster when we implemented it in Cython calculation in case of Python-based (dashed line with ’x’) and instead of Python. Cython-based (solid line with bullets) implementations for In case of very large datasets, straight forward calculation of various number of instances (horizontal axis). hubness scores may be infeasible due to its quadratic complexity even if the calculations are implemented in Cython. In such cases, the aforementioned indexing techniques and/or calculation of ap- several orders of magnitude speedup. At the same time, functions proximate hubness scores (e.g. using a random subset of the data) implemented in Cython can be called from Python code just like may be necessary. Python functions. As future work, we plan an exhaustive evaluation of both im- We implemented the calculation of hubness scores both in Python plementations with respect to various datasets with different sizes and Cython, and made the code available in our github repository: and number of features. https://github.com/kr7/cython . We evaluated both implementations on the Spambase dataset ACKNOWLEDGEMENT from the UCI repository. The dataset contains 4601 instances and The author thanks to the Reviewers for their insightful comments 57 features (without the class label). Each instance corresponds to and suggestions. an e-mail. For each e-mail, the same features were extracted. The associated classification task is to decide whether the e-mail is spam REFERENCES or not. [1] Stefan Behnel, Robert Bradshaw, Craig Citro, Lisandro Dalcin, Dag Sverre Sel-We used 100 instances as test data and 4500 instances as training jebotn, and Kurt Smith. 2010. Cython: The best of both worlds. Computing in data. We run the experiments in Google Colab.4 We used 𝑘 = 10 Science & Engineering 13, 2 (2010), 31–39. [2] Krisztian Buza. 2016. Classification of gene expression data: A hubness-aware nearest neighbors both for the calculation of hubness scores and semi-supervised approach. Computer methods and programs in biomedicine 127 the final classification. According to our observations, the Cython- (2016), 105–113. based calculation of hubness scores was more than 20 times faster [3] Krisztian Buza, Alexandros Nanopoulos, and Gábor Nagy. 2015. Nearest neighbor regression in the presence of bad hubs. Knowledge-Based Systems 86 (2015), 250– than the standard implementation in Python. Both versions pro- 260. duced the exactly same 𝐵𝑁 (𝑥) scores. As the weight of an instance [4] Krisztian Buza and Ladislav Peška. 2017. Drug–target interaction prediction 𝑘 with Bipartite Local Models and hubness-aware regression. Neurocomputing 260 𝑥 only depends on its 𝐵𝑁 (𝑥 ) score, both versions produce the same 𝑘 (2017), 284–293. predictions. Therefore the accuracy (0.94) is equal in both cases. [5] Krisztian Buza and Noémi Ágnes Varga. 2016. Parkinsonet: estimation of updrs We repeated the experiments with using only 1000, 2000 and score using hubness-aware feedforward neural networks. Applied Artificial Intelligence 30, 6 (2016), 541–555. 3000 instances as training data. As Fig. 1 shows, the Cython-based [6] Luc Devroye, László Györfi, and Gábor Lugosi. 2013. A probabilistic theory of implementation was consistently faster than the implementation in pattern recognition. Vol. 31. Springer Science & Business Media. Python. Note that logarithmic scale is used on the vertical axis. The [7] Arthur Flexer, Monika Dörfler, Jan Schlüter, and Thomas Grill. 2018. Hubness as a case of technical algorithmic bias in music recommendation. In 2018 IEEE difference showed an increasing trend when more training data International Conference on Data Mining Workshops (ICDMW). IEEE, 1062–1069. was used: whereas in case of 1000 training instances, the Cython- [8] Peter Knees, Dominik Schnitzer, and Arthur Flexer. 2014. Improving based implementation was only about 12 times faster than the neighborhood-based collaborative filtering by reducing hubness. In Proceedings of International Conference on Multimedia Retrieval. 161–168. Python-based implementation, in case of 4500 training instances, [9] Miloš Radovanović, Alexandros Nanopoulos, and Mirjana Ivanović. 2009. Nearest the speedup factor was approximately 25. This may be attributed to neighbors in high-dimensional data: The emergence and influence of hubs. In Proceedings of the 26th Annual International Conference on Machine Learning. the non-linear complexity of hubness score calculations. Assuming 865–872. a naive implementation, determination of the nearest neighbors of [10] Milos Radovanovic, Alexandros Nanopoulos, and Mirjana Ivanovic. 2010. Hubs an instance is linear in the size of the training data. However, in in space: Popular nearest neighbors in high-dimensional data. Journal of Machine Learning Research 11, sept (2010), 2487–2531. order to calculate the hubness scores, the nearest neighbors of all [11] Miloš Radovanović, Alexandros Nanopoulos, and Mirjana Ivanović. 2010. Timeseries classification in many intrinsic dimensions. In Proceedings of the 2010 SIAM 4https://colab.research.google.com International Conference on Data Mining. SIAM, 677–688. 59 An Efficient Implementation of Hubness-Aware Weighting Using Cython Information Society 2022, 10–14 October 2022, Ljubljana, Slovenia [12] S Raja Sree and A Kunthavai. 2022. Hubness weighted svm ensemble for predic- [15] Nenad Tomasev, Milos Radovanovic, Dunja Mladenic, and Mirjana Ivanovic. 2013. tion of breast cancer subtypes. Technology and Health Care 30, 3 (2022), 565–578. The role of hubness in clustering high-dimensional data. IEEE transactions on [13] Nenad Tomašev, Krisztian Buza, Kristóf Marussy, and Piroska B Kis. 2015. knowledge and data engineering 26, 3 (2013), 739–751. Hubness-aware classification, instance selection and feature construction: Survey [16] Nenad Tomašev, Miloš Radovanović, Dunja Mladenić, and Mirjana Ivanović. and extensions to time-series. In Feature selection for data and pattern recognition. 2014. Hubness-based fuzzy measures for high-dimensional k-nearest neighbor Springer, 231–262. classification. International Journal of Machine Learning and Cybernetics 5, 3 [14] Nenad Tomašev, Miloš Radovanovic, Dunja Mladenic, and Mirjana Ivanovic. (2014), 445–458. 2011. A probabilistic approach to nearest-neighbor classification: Naive hub- [17] Qin Wu, Yaping Lin, Tuanfei Zhu, and Yue Zhang. 2020. HIBoost: A hubness-ness bayesian knn. In Proc. 20th ACM Int. Conf. on Information and Knowledge aware ensemble learning algorithm for high-dimensional imbalanced data classi-Management (CIKM). 2173–2176. fication. Journal of Intelligent & Fuzzy Systems 39, 1 (2020), 133–144. 60 Semantic Similarity of Parliamentary Speech using BERT Language Models & fastText Word Embeddings Katja Meden Department of Knowledge Technologies E8, Jožef Stefan Institute katja.meden@ijs.si ABSTRACT We measured sentence similarity with four BERT-based language models (Language agnostic BERT Sentence Encoder - The main objective of this paper is to present the work done on LaBSE model [7], Sentence-LaBSE [8], Sentence-BERT [14], comparing the two methods for measuring semantic similarity of multilingual BERT – mBERT [1]) and compared the scores of parliamentary speech between coalition and opposition regarding most similar and least similar sentences. the adoption of the first COVID-19 epidemic response package. To facilitate the intended scope of our initial research, i.e., We first measured sentence similarity using four BERT-based researching similarity of full-text parliamentary speech, we used language models (Language agnostic BERT Sentence Encoder - fastText [5] and presented results using descriptive analysis to LaBSE model, Sentence-LaBSE, Sentence-BERT, multilingual gain additional insight into the characteristics of coalition and BERT - mBERT) and compared the results amongst them. Using opposition parliamentary speech. Lastly, we highlighted some of the word embedding method, fastText, we then measured the the advantages and disadvantages of each method for measuring semantic similarity of full-text parliamentary speech and semantic similarity of parliamentary speech. presented the results using descriptive analysis. Lastly, we The paper is structured as follows: Section 2 contains an compared the usage of both methods and highlighted some of the overview of the related work on word embeddings and language advantages and disadvantages of each method for measuring the models. Section 3 presents the methodology and we describe the semantic similarity of parliamentary speech. experiment setting in Section 4. The experiment results are found in Section 5. Finally, we conclude the paper and provide ideas KEYWORDS for future work in Section 6. parliamentary speech, semantic similarity, sentence similarity, BERT language models, fastText 2 RELATED WORK Two blocks of texts are considered similar if they contain the 1 INTRODUCTION same words or characters. Techniques like Bag of Words (BoW), “National parliamentary data is a verified communication Term Frequency - Inverse Document Frequency (TF-IDF) can be channel between the elected political representatives and society used to represent text as real value vectors to aid calculation of members in any democracy. It needs to be made accessible and Semantic Textual Similarity (STS) [3]. STS is defined as the comprehensive - especially in times of a global crisis.” [13] In measure of semantic equivalence between two blocks of text and parliamentary discourse, politicians expound their beliefs and usually give a ranking or percentage of similarity between texts, ideas through argumentation and to persuade the audience, they rather than a binary decision as similar or not similar [3]. Word highlight some aspect of an issue. If we are to understand the role embeddings are one of the methods developed to aid in of parliamentary discourse practices, we need to explore the measuring semantic similarity. They provide vector recurring linguistic patterns and rhetorical strategies used by representations of words where vectors retain the underlying MPs that help to reveal their ideological commitments, hidden linguistic relationship between the similarities of the words. agendas, and argumentation tactics [11]. One of the ways to Word embeddings consist of two types: static and contextualized study the aforementioned linguistic patterns can be done by word embeddings. With static word embeddings, words will researching similarities of parliamentary speeches using different always have the same representation, regardless of the context methods for measuring semantic similarity of text. where it occurs, while with contextualized word embedding, The aim of this paper is to present the work done on representation depends on the context of where that word occurs comparing the two methods for measuring semantic similarity of – meaning, that the same word in different contexts can have parliamentary speech between coalition and opposition regarding different representations. the adoption of the first COVID-19 epidemic response package. FastText is an open-source, free, lightweight library that allows users to learn text representations and text classifiers [5]. It i s a representative of the static word embedding technique, Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed where a vector representation is associated to each character nfor profit or commercial advantage and that copies bear this notice and the full gram; words being represented as the sum of these citation on the first page. Copyrights for third-party components of this work must representations [2]. The fundamental problem of word be honored. For all other uses, contact the owner/author(s). Information Society 2022, 10–14 October 2022, Ljubljana, Slovenia embeddings is that they generate the same embedding for the © 2022 Copyright held by the owner/author(s). same word in different contexts, failing to capture polysemy [4]. 61 Language models are contextualized word representations We used the same settings for the second part of the experiment that aim at capturing word semantics in different contexts to (comparing sentence similarity with the four BERT-based address the issue of polysemy and the context of words [4]. models) with one difference. Since all BERT-based models BERT, or Bidirectional Encoder Representations from support max_lenght input in the size 512 tokens, we decided to Transformer, is a language model, designed to pre-train deep filter out sentences that refer explicitly to the response package bidirectional representations from an unlabeled text by jointly (keyword for selection being zakon). To facilitate the conditioning on both left and right context in all layers [6]. BERT visualisations and balance out our dataset, we randomly chose 20 word representations are therefore contextual sentences for each group (coalition/opposition). 3.3 Experiment settings 3 METHODOLOGY As mentioned, BERT-based models have restrictions on the 3.1 Dataset maximum length of input documents. For most, this is 512 tokens, and in the case of Sentence-BERT, this restriction is even Dataset contains 230 documents (speeches) from the more severe (128 word tokens). Most speeches in the dataset are Extraordinary Session 33 from the corpora of the Slovenian longer than the maximum length – this limitation did not allow parliamentary debates (ParlaMint-SI) [9] from 2014 to mid-2020, us to conclude semantic similarity measurement on full linguistically annotated and represented in the CoNNL-U format parliamentary speech. The first part of the experiment therefore (which include POS, lemmatized and NER tags). We chose an focuses on sentence similarity. From previously described extraordinary session in a time of crisis for two reasons: firstly, BERT-based models, three of the models were fine-tuned for regular sessions deal with multiple problems (such as MP sentence similarity tasks: Sentence-LaBSE [7], LaBSE [8], questions), which makes a comparison between speeches mBERT [1] and Sentence-BERT [14]. For easier comparison, we difficult. Similarly, we chose only one specific theme (the used mean pooling and cosine distance to measure the similarity. adoption of the first epidemic response package), which helped To achieve the intended scope of our initial research in the initial analysis and comparison of documents. (researching the semantic similarity of parliamentary speech), we used the fastText-based Orange widget Document embedding (using mean as the aggregation method) to embed our documents 3.2 Data analysis and pre-processing and calculate cosine similarity to achieve comparison between For the initial data analysis, we used the Orange data-mining tool coalition and opposition parliamentary speech. With these two [12] that helped us with the data understanding and initial dataset experiments, we can compare measuring semantic similarity pre-processing. with language models to the word embedding method (fastText). For full speech measuring with fastText we removed This comparison would be better with Longformer language speeches by Chairperson to avoid adding noise to the dataset in model (which can take up to around 1000+ word tokens as the form of procedural speech that would make measuring max_input) as we could compare methods for measuring semantic similarities almost impossible. We also removed semantic similarity of full-text documents (speeches), but as of Slovene stopwords and manually added a list of four additional time of writing this paper, Longformer [10] does not yet support stopwords: hvala, danes, l epa and beseda, which excluded the Slovene language. very common phrase Hvala za besedo (eng. Thank you for the word) and its variations. Some of the documents were missing the party_status labels (values: coalition and opposition). The 4 RESULTS missing values (17 documents) were thus removed from the dataset. The pre-process gave us a total of 97 documents, 4.1 Results of the sentence similarity measure presented in Table 1. Looking at the distributions of the speeches with BERT-based models in the session, almost 1/3 of the speeches belongs to the As stated previously, we used four different BERT-based models opposition. Both coalition and opposition consists of four to measure semantic similarity of 40 sentences (20 sentences for political parties: LMŠ, Levica, SAB and SD are part of the each group - coalition and opposition) and visualized the results opposition, all of mostly left and centre-left political orientation. using heat maps (example in Figure 1). Initially, we first selected Similarly, the coalition consists of DeSUS, NSi, SDS and SMC well-known BERT-based models that were optimized for political parties1, all mostly right-winged and centre-right parties. Slovene (trilingual model CroSloEngual BERT and monolingual model SloBERTa), that did not produce reliable results - as Table 1: Preprocessed dataset shown in Table 2, CroSloEngual [15] and SloBERTa [16] produce extremely high similarity scores, since, as we later discovered, were not fine-tuned for sentence similarity task. Sample Number of Total documents Coalition 30 (30.93%) 97 Opposition 67 (69.07%) 1 Technically, the opposition consists of 5 political parties, but SNS (Slovenska Nacionalna Stranka) does not have any speeches in the dataset. 62 Table 2: Similarity scores of language models for most 4.2 Results of the document similarity with similar and least similar sentences fastText Model Most similar Least similar For the second part of our experiment, we used fastText for word embedding and measured cosine distance to get semantic Sentence-LaBSE 0.6184 0.1235 similarity score of our documents. Figure 2 shows visualized LaBSE 0.7610 0.3649 results comparing speeches between coalition and opposition mBERT 0.8930 0.5377 speakers: Sentence-BERT 0.6677 -0.0792 CroSloEngual 0.9931 0.9480 SloBERTa 0.9867 0.8899 Figure 2: Document similarity with fastText, visualized using MDS Figure 1: Example of heat map using Sentence-LaBSE Documents (or speeches) are connected closely together – model this could be attributed mostly to the fact that they are addressing the same issue – the adoption of the first epidemic release When comparing the models, it does not surprise that package. The most similar speeches were made by the members Sentence-LaBSE and Sentence-BERT show very similar results of the political party SDS (coalition) and SD (opposition), (see Table 2), as they come from the same family of models and followed closely by SMC and Levica. All speeches are long and thus have similar model architecture (and are both fine-tuned for focus on the topic of the session – the proposed law (most this specific task). What is interesting is the fact that Sentence- speeches include keywords such as “zakon” (law), “zakonski BERT is the only model that produced a negative score for the paket” (law packet), “amandma” (amendment), “ukrepi” least similar sentence (similarity score of -0.0792), while (measures). mBERT model showed the highest similarity scores (outside of Outlier detection analysis showed 8 speeches (7 made by the CroSloEngual and SloBERTa). Some of the highest scored opposition, 1 by coalition), which are all very short and focus sentences showed that speakers from different party statuses tend solely on parliamentary procedures. We also observed some to use similar language patterns, for example: trends in the usage of the words, concatenated from the word “korona”: “koronakriza”, “koronazakon”, “antikoronazakon”, Coalition : “Ob hitrem sprejemanju zakona je potrebno “koronaobveznica”, “koronapomoči”, “protikoronapaket” etc. zagotoviti, da ne bodo spregledane posamezne ranljive skupine (used mostly by the opposition). posameznikov.” In Figure 3, we compared speech between the members of the (Eng. “With the rapid adoption of the law, it is necessary to opposition. The visualization showed a cluster of similar ensure that individual vulnerable groups of individuals are not speeches. Members of Levica seemed to be most vocal during overlooked.") the session (by having more than 50% of all opposition speeches), while also having several similar speeches, with the central sub- Opposition : “Še enkrat, ostaja še cela vrsta ranljivih skupin v topic being proposed amendments to the law and financial zakonu, ki je nenaslovljena.” consequences of it. The least similar speech was made by Violeta (Eng. “Once again, there is a whole range of vulnerable groups Tomić, member of Levica, in regard to the date the epidemic was in the law that remain unaddressed.”) declared. 63 semantic similarity/sentence similarity tasks and thus do not produce accurate results. Limitation on maximum length of input text that most BERT-based models have is probably one of the biggest disadvantages of the language models for semantic similarity measures (this is being alleviated with new emerging language models, such as Longformer, that allow over 1000+ tokens as maximum input length). For sentence similarity task language models from Sentence-BERT family show the most accuracy and are easier to use as standard BERT models (such as mBERT). Even though BERT contextualizes word embeddings (and therefore might produce better results because of it), fastText solved the problem of text-input length and combined with Orange data mining tool allowed us to explore similarities between speeches as we originally intended to do. From the document similarity analysis, we saw that most speeches were relatively connected (similar) to one another. Speeches amongst the members of the opposition were more similar in comparison to the speeches made amongst coalition members. There were a Figure 3: Document similarity with fastText (opposition) few outlier speeches in both opposition and coalition – they were all shorter speeches and less related to the original topic of the In Figure 4, we compared speech between the members of discourse. For future work, some limitations of this research coalition: speeches are less connected; with most similar divided should first be addressed (e.g., comparing language models to among SDS members, closely connected to the SMC, NSI and word embedding techniques on a full-text basis) and repeat the DeSUS members. The common sub-topic to all of the speeches experiments with fine-tuned SloBERTa and CroSloEngual made is the financial crisis as a direct result of the epidemic. Two model on full ParlaMint-SI corpora. of the most far-away speeches belong to the member of DeSUS (Franc Jurša). Both speeches are among the shortest ones in the dataset, with a focus on the topic of pensions and registration of REFERENCES a parliamentary group, and thus are not explicitly connected to [1] BERT multilingual base model (cased): https://huggingface.co/bert-base- the central topic of the discourse. multilingual-cased [2] Bojanowski, Piotr, Grave, Edouard, Joulin, Armand and Mikolov, Tomas. (2017). Enriching word vectors with subword information. In Transactions of the Association for Computational Linguistics, 5, 135-146. DOI: https://doi.org/10.1162/tacl_a_00051 [3] Chandrasekaran, Dhivya, and Vijay Mago. 2021. Evolution of Semantic Similarity—A Survey. In ACM Computing Surveys, 1-37. [4] David S. Batista. 2018. Language Models and Contextualised Word Embeddings. https://www.davidsbatista.net/blog/2018/12/06/Word_Embeddings/ [5] FastText - Library for efficient text classification and representation learning. https://fasttext.cc/ [6] Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. [7] Language-agnostic BERT Sentence Encoder (LaBSE) (Sentence- Transformers): https://huggingface.co/sentence-transformers/LaBSE [8] Language-agnostic BERT Sentence Encoder (LaBSE): https://huggingface.co/setu4993/LaBSE [9] Linguistically annotated multilingual comparable corpora of parliamentary debates ParlaMint.ana 2.1. 2021. http://hdl.handle.net/11356/1431 [10] Longformer: https://huggingface.co/docs/transformers/model_doc/longformer [11] Naderi, Nona, and Graeme Hirst. 2015. Argumentation mining in parliamentary discourse. In Principles and practice of multi-agent systems, 16-25. https://cmna.csc.liv.ac.uk/CMNA15/paper%209.pdf [12] Orange: Data Mining Tool for visual programming. Figure 4: Document similarity with fastText (coalition) https://orangedatamining.com/ [13] ParlaMint: Towards Comparable Parliamentary Corpora. 2020. https://www.clarin.eu/content/parlamint-towards-comparable- parliamentary-corpora 5 CONCLUSIONS [14] Sentence-BERT (sentence-transformers/distiluse-base-multilingual- cased-v2): https://huggingface.co/sentence-transformers/distiluse-base- In this paper, we were comparing language models and word multilingual-cased-v2 embeddings as methods for measuring semantic similarity of [15] Ulčar, Matej and Robnik-Šikonja, Marko, 2020, CroSloEngual BERT 1.1, Slovenian language resource repository parliamentary speech. In the initial stages, it turned out that there CLARIN.SI, http://hdl.handle.net/11356/1330. is not a lot of models that support Slovene as input language. [16] Ulčar, Matej and Robnik-Šikonja, Marko, 2021, Slovenian RoBERTa Those that were made explicitly with Slovene in mind (such as contextual embeddings model: SloBERTa 2.0, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1397 SloBERTa and CroSloEngual BERT) were not fine-tuned for 64 Indeks avtorjev / Author index Anastasiou Theodora .................................................................................................................................................................... 46 Baldouski Daniil ........................................................................................................................................................................... 34 Brecelj Bor ................................................................................................................................................................................... 42 Buza Krisztian .............................................................................................................................................................................. 58 Calcina Erik .................................................................................................................................................................................. 13 Casals del Busto Ignacio .............................................................................................................................................................. 50 Cerar Gregor ................................................................................................................................................................................. 54 Evkoski Bojan .............................................................................................................................................................................. 30 Fortuna Blaž ........................................................................................................................................................................... 42, 46 Fortuna Carolina ........................................................................................................................................................................... 54 Grobelnik Marko ........................................................................................................................................................................ 5, 9 Gucek Alenka ............................................................................................................................................................................... 50 Keizer Jelle ................................................................................................................................................................................... 46 Komarova Nadezhda ...................................................................................................................................................................... 5 Koprivec Filip .............................................................................................................................................................................. 38 Korenič Tratnik Sebastian ............................................................................................................................................................ 26 Kralj Novak Petra ......................................................................................................................................................................... 30 Kržmanc Gregor ........................................................................................................................................................................... 38 Kuzman Taja ................................................................................................................................................................................ 17 Ljubešić Nikola ...................................................................................................................................................................... 17, 30 Massri M.Besher .......................................................................................................................................................................... 50 Meden Katja ................................................................................................................................................................................. 61 Mladenić Dunja ............................................................................................................................................................ 9, 21, 42, 46 Mladenić Grobelnik Adrian............................................................................................................................................................ 9 Mocanu Iulian .............................................................................................................................................................................. 50 Mozetič Igor ................................................................................................................................................................................. 30 Mušić Din ..................................................................................................................................................................................... 54 Novak Erik ......................................................................................................................................................................... 9, 13, 26 Novalija Inna .................................................................................................................................................................................. 5 Papamartzivanos Dimitrios .......................................................................................................................................................... 46 Pita Costa Joao ............................................................................................................................................................................. 50 Rossi Maurizio ............................................................................................................................................................................. 50 Rožanec Jože Martin .............................................................................................................................................................. 42, 46 Santos Costa João ......................................................................................................................................................................... 50 Šircelj Beno .................................................................................................................................................................................. 42 Sittar Abdul .................................................................................................................................................................................. 21 Škrjanc Maja ................................................................................................................................................................................ 38 Tošić Aleksandar .......................................................................................................................................................................... 34 Veliou Entso ................................................................................................................................................................................. 46 Webber Jason ............................................................................................................................................................................... 21 Zevnik Filip .................................................................................................................................................................................. 54 65 66 Odkrivanje znanja in podatkovna skladisca - SiKDD Data Mining and Data Warehouses - SiKDD Urednika  Editors: Dunja Mladenic, Marko Grobelnik Document Outline 02 - Naslovnica - notranja - C - TEMP 03 - Kolofon - C - TEMP 04 - IS2022 - Predgovor - TEMP 05 - IS2022 - Konferencni odbori - TEMP 07 - Kazalo - C 08 - Naslovnica - notranja - C - TEMP - Copy 09 - Predgovor podkonference - C 10 - Programski odbor podkonference - C 01 - SiKDD2022_paper_5613 Abstract 1 Introduction 2 Proposed Method 2.1 Constructing the Graph of n-grams 2.2 Constructing the Emotion Category Graphs 2.3 Assigning an Emotion to a Given Text 3 Related Work 4 Results 4.1 Experimental Setup 4.2 Analysis 5 Discussion 6 Conclusion 7 Acknowledgements 02 - SiKDD2022_paper_5674 03 - SiKDD2022_paper_5269 Abstract 1 Introduction 2 Related Work 3 Methodology 3.1 Topic Modeling 3.2 Artists' Similarity using Topic Clusters 4 Experiment 4.1 Dataset 4.2 Implementation details 5 Results 6 Discussion 6.1 Topic Cluster Discussion 7 Conclusion Acknowledgments 04 - SiKDD2022_paper_5343 Abstract 1 Introduction 2 Dataset 3 Feature Engineering 4 Machine Learning Experiments 4.1 Experimental Setup 4.2 Results of Learning on Various Linguistic Features 5 Conclusions Acknowledgments 05 - SiKDD2022_paper_7454 Abstract 1 Introduction 2 Related Work 2.1 Topic Modelling 2.2 Stylistic Features 2.3 Bag-of-words 3 Data collection 4 Methodology 5 Experimental Evaluation 6 Results and Analysis 7 Conclusions 06 - SiKDD2022_paper_4772 07 - SiKDD2022_paper_817 Abstract 1 Introduction 2 Results 3 Conclusion 08 - SiKDD2022_paper_3754 Abstract 1 Introduction 2 THE ROLE OF VISUALIZATIONS IN DEBUGGING COMPLEX DISTRIBUTED SYSTEMS 3 Research Objectives 4 GRAFANA PLUGINS FOR VISUALISING VOTE BASED CONSENSUS MECHANISMS AND P2P OVERLAY NETWORKS 4.1 Network Plugin 4.2 Consensus Plugin 4.3 Generality 5 Conclusion 6 Acknowledgments 09 - SiKDD2022_paper_1139 Abstract 1 Introduction 2 Related work 3 Data 4 Data representation as a heterogeneous graph 4.1 Network statistics 4.2 Feature generation 5 Anomaly detection problem definition 6 Results 6.1 Experiment details 6.2 Link prediction 6.3 Anomaly detection 7 Discussion and future work Acknowledgments A Detailed results A.1 Link prediction (AUC) A.2 Anomaly detection (F1 score) 10 - SiKDD2022_paper_2558 Abstract 1 Introduction 2 Related Work 3 Use Case 4 Methodology 4.1 Data analysis 4.2 Model training 5 Results and Analysis 6 Conclusion Acknowledgments References 11 - SiKDD2022_paper_2909 Abstract 1 Introduction 2 Related Work 3 Use Case 4 Methodology 5 Results and Analysis 6 Conclusion Acknowledgments References 12 - SiKDD2022_paper_6501 13 - SiKDD2022_paper_1337 14 - SiKDD2022_paper_4886 Abstract 1 Introduction 2 Background: Hubness-aware Weighting 3 Cython-based Implementation of Hubness Calculations 4 Discussion References 15 - SiKDD2022_paper_4306 12 - Index - C Blank Page Blank Page Blank Page 08 - Naslovnica - notranja - C.pdf Blank Page 07 - Kazalo - C.pdf Blank Page 12 - Index - C.pdf Blank Page Blank Page