› INFORMACIJSKA DRUZBA Zbornik 26. mednarodne multikonference Zvezek C INFORMATION SOCIETY Proceedings of the 26th International Multiconference Volume C Odkrivanje znanja in podatkovna skladisca • SiKDD Data Mining and Data Warehouses • SiKDD Urednika • Editors: Dunja Mladenic, Marko Grobelnik IS2023 Zbornik 26. mednarodne multikonference INFORMACIJSKA DRUŽBA – IS 2023 Zvezek C Proceedings of the 26th International Multiconference INFORMATION SOCIETY – IS 2023 Volume C Odkrivanje znanja in podatkovna skladišča – SiKDD Data Mining and Data Warehouses - SiKDD Urednika / Editors Dunja Mladenić, Marko Grobelnik http://is.ijs.si 9. oktober 2023 / 9 October 2023 Ljubljana, Slovenia Urednika: Dunja Mladenić Department for Artificial Intelligence Jožef Stefan Institute, Ljubljana Marko Grobelnik Department for Artificial Intelligence Jožef Stefan Institute, Ljubljana Založnik: Institut »Jožef Stefan«, Ljubljana Priprava zbornika: Mitja Lasič, Vesna Lasič, Mateja Mavrič Oblikovanje naslovnice: Vesna Lasič Dostop do e-publikacije: http://library.ijs.si/Stacks/Proceedings/InformationSociety Ljubljana, oktober 2023 Informacijska družba ISSN 2630-371X Kataložni zapis o publikaciji (CIP) pripravili v Narodni in univerzitetni knjižnici v Ljubljani COBISS.SI-ID 170733315 ISBN 978-961-264-276-1 (PDF) PREDGOVOR MULTIKONFERENCI INFORMACIJSKA DRUŽBA 2023 Šestindvajseta multikonferenca Informacijska družba se odvija v obdobju izjemnega razvoja za umetno inteligenco, računalništvo in informatiko, za celotno informacijsko družbo. Generativna umetna inteligenca je s programi kot ChatGPT dosegla izjemen napredek na poti k superinteligenci, k singularnosti in razcvetu človeške civilizacije. Uresničujejo se napovedi strokovnjakov, da bodo omenjena področna ključna za obstoj in razvoj človeštva, zato moramo pozornost usmeriti na njih, jih hitro uvesti v osnovno in srednje šolstvo in vsakdan posameznika in skupnosti. Po drugi strani se poleg lažnih novic pojavljajo tudi lažne enciklopedije, lažne znanosti ter »ploščate Zemlje«, nadaljuje se zapostavljanje znanstvenih spoznanj, metod, zmanjševanje človekovih pravic in družbenih vrednot. Na vseh nas je, da izzive današnjice primerno obravnavamo, predvsem pa pomagamo pri uvajanju znanstvenih spoznanj in razčiščevanju zmot. Ena pogosto omenjanih v zadnjem letu je eksistencialna nevarnost umetne inteligence, ki naj bi ogrožala človeštvo tako kot jedrske vojne. Hkrati pa nihče ne poda vsaj za silo smiselnega scenarija, kako naj bi se to zgodilo – recimo, kako naj bi 100x pametnejši GPT ogrozil ljudi. Letošnja konferenca poleg čisto tehnoloških izpostavlja pomembne integralne teme, kot so okolje, zdravstvo, politika depopulacije, ter rešitve, ki jih za skoraj vse probleme prinaša umetna inteligenca. V takšnem okolju je ključnega pomena poglobljena analiza in diskurz, ki lahko oblikujeta najboljše pristope k upravljanju in izkoriščanju tehnologij. Imamo veliko srečo, da gostimo vrsto izjemnih mislecev, znanstvenikov in strokovnjakov, ki skupaj v delovnem in akademsko odprtem okolju prinašajo bogastvo znanja in dialoga. Verjamemo, da je njihova prisotnost in udeležba ključna za oblikovanje bolj inkluzivne, varne in trajnostne informacijske družbe. Za razcvet. Letos smo v multikonferenco povezali deset odličnih neodvisnih konferenc, med njimi »Legende računalništva«, s katero postavljamo nov mehanizem promocije informacijske družbe. IS 2023 zajema okoli 160 predstavitev, povzetkov in referatov v okviru samostojnih konferenc in delavnic, skupaj pa se je konference udeležilo okrog 500 udeležencev. Prireditev so spremljale okrogle mize in razprave ter posebni dogodki, kot je svečana podelitev nagrad. Izbrani prispevki bodo izšli tudi v posebni številki revije Informatica (http://www.informatica.si/), ki se ponaša s 46-letno tradicijo odlične znanstvene revije. Multikonferenco Informacijska družba 2023 sestavljajo naslednje samostojne konference: • Odkrivanje znanja in podatkovna središča • Demografske in družinske analize • Legende računalništva in informatike • Konferenca o zdravi dolgoživosti • Miti in resnice o varovanju okolja • Mednarodna konferenca o prenosu tehnologij • Digitalna vključenost v informacijski družbi – DIGIN 2023 • Slovenska konferenca o umetni inteligenci + DATASCIENCE • Kognitivna znanost • Vzgoja in izobraževanje v informacijski družbi • Zaključna svečana prireditev konference Soorganizatorji in podporniki konference so različne raziskovalne institucije in združenja, med njimi ACM Slovenija, SLAIS za umetno inteligenco, DKZ za kognitivno znanost in Inženirska akademija Slovenije (IAS). V imenu organizatorjev konference se zahvaljujemo združenjem in institucijam, še posebej pa udeležencem za njihove dragocene prispevke in priložnost, da z nami delijo svoje izkušnje o informacijski družbi. Zahvaljujemo se tudi recenzentom za njihovo pomoč pri recenziranju. S podelitvijo nagrad, še posebej z nagrado Michie-Turing, se avtonomna stroka s področja opredeli do najbolj izstopajočih dosežkov. Nagrado Michie-Turing za izjemen življenjski prispevek k razvoju in promociji informacijske družbe je prejel prof. dr. Andrej Brodnik. Priznanje za dosežek leta pripada Benjaminu Bajdu za zlato medaljo na računalniški olimpijadi. »Informacijsko limono« za najmanj primerno informacijsko tematiko je prejela nekompatibilnost zdravstvenih sistemov v Sloveniji, »informacijsko jagodo« kot najboljšo potezo pa dobi ekipa RTV za portal dostopno.si. Čestitke nagrajencem! Mojca Ciglarič, predsednica programskega odbora Matjaž Gams, predsednik organizacijskega odbora i FOREWORD - INFORMATION SOCIETY 2023 The twenty-sixth Information Society multi-conference is taking place during a period of exceptional development for artificial intelligence, computing, and informatics, encompassing the entire information society. Generative artificial intelligence has made significant progress towards superintelligence, towards singularity, and the flourishing of human civilization with programs like ChatGPT. Experts' predictions are coming true, asserting that the mentioned fields are crucial for humanity's existence and development. Hence, we must direct our attention to them, swiftly integrating them into primary, secondary education, and the daily lives of individuals and communities. On the other hand, alongside fake news, we witness the emergence of false encyclopaedias, pseudo-sciences, and flat Earth theories, along with the continuing neglect of scientific insights and methods, the diminishing of human rights, and societal values. It is upon all of us to appropriately address today's challenges, mainly assisting in the introduction of scientific knowledge and clearing up misconceptions. A frequently mentioned concern over the past year is the existential threat posed by artificial intelligence, supposedly endangering humanity as nuclear wars do. Yet, nobody provides a reasonably coherent scenario of how this might happen, say, how a 100x smarter GPT could endanger people. This year's conference, besides purely technological aspects, highlights important integral themes like the environment, healthcare, depopulation policies, and solutions brought by artificial intelligence to almost all problems. In such an environment, in-depth analysis and discourse are crucial, shaping the best approaches to managing and exploiting technologies. We are fortunate to host a series of exceptional thinkers, scientists, and experts who bring a wealth of knowledge and dialogue in a collaborative and academically open environment. We believe their presence and participation are key to shaping a more inclusive, safe, and sustainable information society. For flourishing. This year, we connected ten excellent independent conferences into the multi-conference, including "Legends of Computing", which introduces a new mechanism for promoting the information society. IS 2023 encompasses around 160 presentations, abstracts, and papers within standalone conferences and workshops. In total about 500 participants attended the conference. The event was accompanied by panel discussions, debates, and special events like the award ceremony. Selected contributions will also be published in a special issue of the journal Informatica (http://www.informatica.si/), boasting a 46-year tradition of being an excellent scientific journal. The Information Society 2023 multi-conference consists of the following independent conferences: • Data Mining and Data Warehouse - SIKDD • Demographic and Family Analysis • Legends of Computing and Informatics • Healthy Longevity Conference • Myths and Truths about Environmental Protection • International Conference on Technology Transfer • Digital Inclusion in the Information Society - DIGIN 2023 • Slovenian Conference on Artificial Intelligence + DATASCIENCE • Cognitive Science • Education and Training in the Information Society • Closing Conference Ceremony Co-organizers and supporters of the conference include various research institutions and associations, among them ACM Slovenia, SLAIS for Artificial Intelligence, DKZ for Cognitive Science, and the Engineering Academy of Slovenia (IAS). On behalf of the conference organizers, we thank the associations and institutions, and especially the participants for their valuable contributions and the opportunity to share their experiences about the information society with us. We also thank the reviewers for their assistance in reviewing. With the awarding of prizes, especially the Michie-Turing Award, the autonomous profession from the field identifies the most outstanding achievements. Prof. Dr. Andrej Brodnik received the Michie-Turing Award for his exceptional lifetime contribution to the development and promotion of the information society. The Achievement of the Year award goes to Benjamin Bajd, gold medal winner at the Computer Olympiad. The "Information Lemon" for the least appropriate information move was awarded to the incompatibility of information systems in the Slovenian healthcare, while the "Information Strawberry" for the best move goes to the RTV SLO team for portal dostopno.si. Congratulations to the winners! Mojca Ciglarič, Chair of the Program Committee Matjaž Gams, Chair of the Organizing Committee ii KONFERENČNI ODBORI CONFERENCE COMMITTEES International Programme Committee Organizing Committee Vladimir Bajic, South Africa Matjaž Gams, chair Heiner Benking, Germany Mitja Luštrek Se Woo Cheon, South Korea Lana Zemljak Howie Firth, UK Vesna Koricki Olga Fomichova, Russia Mitja Lasič Vladimir Fomichov, Russia Blaž Mahnič Vesna Hljuz Dobric, Croatia Mateja Mavrič Alfred Inselberg, Israel Jay Liebowitz, USA Huan Liu, Singapore Henz Martin, Germany Marcin Paprzycki, USA Claude Sammut, Australia Jiri Wiedermann, Czech Republic Xindong Wu, USA Yiming Ye, USA Ning Zhong, USA Wray Buntine, Australia Bezalel Gavish, USA Gal A. Kaminka, Israel Mike Bain, Australia Michela Milano, Italy Derong Liu, Chicago, USA Toby Walsh, Australia Sergio Campos-Cordobes, Spain Shabnam Farahmand, Finland Sergio Crovella, Italy Programme Committee Mojca Ciglarič, chair Marjan Heričko Baldomir Zajc Bojan Orel Borka Jerman Blažič Džonova Blaž Zupan Franc Solina Gorazd Kandus Boris Žemva Viljan Mahnič Urban Kordeš Leon Žlajpah Cene Bavec Marjan Krisper Niko Zimic Tomaž Kalin Andrej Kuščer Rok Piltaver Jozsef Györkös Jadran Lenarčič Toma Strle Tadej Bajd Borut Likar Tine Kolenik Jaroslav Berce Janez Malačič Franci Pivec Mojca Bernik Olga Markič Uroš Rajkovič Marko Bohanec Dunja Mladenič Borut Batagelj Ivan Bratko Franc Novak Tomaž Ogrin Andrej Brodnik Vladislav Rajkovič Aleš Ude Dušan Caf Grega Repovš Bojan Blažica Saša Divjak Ivan Rozman Matjaž Kljun Tomaž Erjavec Niko Schlamberger Robert Blatnik Bogdan Filipič Stanko Strmčnik Erik Dovgan Andrej Gams Jurij Šilc Špela Stres Matjaž Gams Jurij Tasič Anton Gradišek Mitja Luštrek Denis Trček Marko Grobelnik Andrej Ule Nikola Guid Boštjan Vilfan iii iv KAZALO / TABLE OF CONTENTS Odkrivanje znanja in podatkovna skladišča - SiKDD / Data Mining and Data Warehouses - SiKDD .... 1 PREDGOVOR / FOREWORD ............................................................................................................................... 3 PROGRAMSKI ODBORI / PROGRAMME COMMITTEES ............................................................................... 4 Forecasting Trends in Technological Innovations with Distortion-Aware Convolutional Neural Networks / Buza Krisztian, Massri M. Besher, Grobelnik Marko ........................................................................................ 5 Building A Causality Graph For Strategic Foresight / Rožanec Jože M., Šircelj Beno, Nemec Peter, Leban Gregor, Mladenić Dunja ..................................................................................................................................... 9 Towards Testing the Significance of Branching Points and Cycles in Mapper Graphs / Zajec Patrik, Škraba Primož, Mladenić Dunja .................................................................................................................................. 13 Highlighting Embeddings' Features Relevance Attribution on Activation Maps / Rožanec Jože M., Koehorst Erik, Mladenić Dunja ....................................................................................................................................... 17 An approach to creating a time-series dataset for news propagation: Ukraine-war case study / Sittar Abdul, Mladenić Dunja ................................................................................................................................................ 21 Predicting Horse Fearfulness Applying Supervised Machine Learning Methods / Topal Oleksandra, Novalija Inna, Gobbo Elena, Zupan Šemrov Manja, Mladenić Dunja ........................................................................... 25 Emergent Behaviors from LLM-Agent Simulations / Mladenić Grobelnik Adrian, Zaman Faizon, Espigule- Pons Jofre, Grobelnik Marko ........................................................................................................................... 29 Compared to Us, They Are …: An Exploration of Social Biases in English and Italian Language Models Using Prompting and Sentiment Analysis / Caporusso Jaya, Pollak Senja, Purver Matthew ................................... 33 Towards Cognitive Digital Twin of a Country with Emergency, Hydrological, and Meteorological Data / Šturm Jan, Škrjanc Maja, Stopar Luka, Volčjak Domen, Mladenić Dunja, Grobelnik Marko ................................... 39 Predicting Bus Arrival Times Based on Positional Data / Kladnik Matic, Bradeško Luka, Mladenić Dunja ..... 42 Structure Based Molecular Fingerprint Prediction through Spec2Vec Embedding of GC-EI-MS Spectra / Piciga Aleksander, Ljoncheva Milka, Kosjek Tina, Džeroski Sašo ............................................................................ 46 A meaty discussion: quantitative analysis of the Slovenian meat-related news corpus / Martinc Matej, Pollak Senja, Vezovnik Andreja .................................................................................................................................. 50 Slovene Word Sense Disambiguation using Transfer Learning / Fijavž Zoran, Robnik-Šikonja Marko ............ 54 Predicting the FTSO consensus price / Koprivec Filip, Eržen Tjaž, Mežnar Urban ............................................ 58 On Neural Filter Selection for ON/OFF Classification of Home Appliances / Pirnat Anže, Fortuna Carolina ... 62 Indeks avtorjev / Author index ................................................................................................................... 67 v vi Zbornik 26. mednarodne multikonference INFORMACIJSKA DRUŽBA – IS 2023 Zvezek C Proceedings of the 26th International Multiconference INFORMATION SOCIETY – IS 2023 Volume C Odkrivanje znanja in podatkovna skladišča – SiKDD Data Mining and Data Warehouses - SiKDD Urednika / Editors Dunja Mladenić, Marko Grobelnik http://is.ijs.si 9. oktober 2023 / 9 October 2023 Ljubljana, Slovenia 1 2 PREDGOVOR Tehnologije, ki se ukvarjajo s podatki so v devetdesetih letih močno napredovale. Iz prve faze, kjer je šlo predvsem za shranjevanje podatkov in kako do njih učinkovito dostopati, se je razvila industrija za izdelavo orodij za delo s podatkovnimi bazami, prišlo je do standardizacije procesov, povpraševalnih jezikov itd. Ko shranjevanje podatkov ni bil več poseben problem, se je pojavila potreba po bolj urejenih podatkovnih bazah, ki bi služile ne le transakcijskem procesiranju ampak tudi analitskim vpogledom v podatke – pojavilo se je t.i. skladiščenje podatkov (data warehousing), ki je postalo standarden del informacijskih sistemov v podjetjih. Paradigma OLAP (On-Line-Analytical-Processing) zahteva od uporabnika, da še vedno sam postavlja sistemu vprašanja in dobiva nanje odgovore in na vizualen način preverja in išče izstopajoče situacije. Ker seveda to ni vedno mogoče, se je pojavila potreba po avtomatski analizi podatkov oz. z drugimi besedami to, da sistem sam pove, kaj bi utegnilo biti zanimivo za uporabnika – to prinašajo tehnike odkrivanja znanja v podatkih (data mining), ki iz obstoječih podatkov skušajo pridobiti novo znanje in tako uporabniku nudijo novo razumevanje dogajanj zajetih v podatkih. Slovenska KDD konferenca pokriva vsebine, ki se ukvarjajo z analizo podatkov in odkrivanjem znanja v podatkih: pristope, orodja, probleme in rešitve. Dunja Mladenić Marko Grobelnik FOREWORD Data driven technologies have significantly progressed after mid 90’s. The first phases were mainly focused on storing and efficiently accessing the data, resulted in the development of industry tools for managing large databases, related standards, supporting querying languages, etc. After the initial period, when the data storage was not a primary problem anymore, the development progressed towards analytical functionalities on how to extract added value from the data; i.e., databases started supporting not only transactions but also analytical processing of the data. At this point, data warehousing with On-Line-Analytical-Processing entered as a usual part of a company’s information system portfolio, requiring from the user to set well defined questions about the aggregated views to the data. Data Mining is a technology developed after year 2000, offering automatic data analysis trying to obtain new discoveries from the existing data and enabling a user new insights in the data. In this respect, the Slovenian KDD conference (SiKDD) covers a broad area including Statistical Data Analysis, Data, Text and Multimedia Mining, Semantic Technologies, Link Detection and Link Analysis, Social Network Analysis, Data Warehouses. Dunja Mladenić Marko Grobelnik 3 PROGRAMSKI ODBOR / PROGRAMME COMMITTEE Janez Brank, Jožef Stefan Institute, Ljubljana Marko Grobelnik, Jožef Stefan Institute, Ljubljana Branko Kavšek, University of Primorska, Koper Besher M. Massri, Jožef Stefan Institute, Ljubljana Dunja Mladenić, Jožef Stefan Institute, Ljubljana Erik Novak, Jožef Stefan Institute, Ljubljana Inna Novalija, Jožef Stefan Institute, Ljubljana Jože Rožanec, Qlector, Ljubljana Abdul Sitar, Jožef Stefan Institute, Ljubljana Luka Stopar, Sportradar, Ljubljana Jan Šturm, Jožef Stefan Institute, Ljubljana 4 Forecasting Trends in Technological Innovations with Distortion-Aware Convolutional Neural Networks Krisztian Buza, M. Besher Massri, Marko Grobelnik {krisztian.antal.buza,besher.massri,marko.grobelnik}@ijs.si Artificial Intelligence Laboratory, Institute Jozef Stefan Ljubljana, Slovenia ABSTRACT convolution plays the role of a local pattern detector, it matches Predicting trends in technological innovations holds critical impor- patterns in a rigid manner as it does not allow for local shifts and tance for policymakers, investors, and other stakeholders within elongations within the patterns. This issue has been addressed the innovation ecosystem. This study approaches this challenge by distortion-aware convolution and the resulting convolutional by framing it as a time series prediction task. Recent efforts have neural network has been shown to outperform conventional convo- introduced diverse solutions utilizing convolutional neural net- lutional networks in case of several time series forecasting tasks [6]. works, including distortion-aware convolutional neural networks. For the aforementioned reasons, in this paper we propose to use While convolutional layers act as local pattern detectors, conven- distortion-aware convolutional networks for forecasting trends in tional convolution matches local patterns in a rigid manner in the technological innovations. We perform experiments on real-world sense that they do not account for local shifts and elongations, time series of the number of patents related to selected topics. whereas distortion-aware convolution incorporate the capability to We compare the performance of distortion-aware convolutional identify local patterns with flexibility, accommodating local shifts networks with conventional convolutional neural networks. and elongations. The resulting convolutional neural network, with The reminder of the paper is organized as follows. In Section 2, distortion-aware convolution, has exhibited superior performance we provide a short discussion of related works. We review distortion- compared to standard convolutional networks in multiple time se- aware convolutional networks in Section 3, followed by the experi-ries prediction tasks. As a result, we advocate for the application mental results in Section 4. Finally, we conclude in Section 5. of distortion-aware convolutional networks in forecasting tech- nological innovation trends and compare their performance with 2 RELATED WORK conventional convolutional neural networks. As we cast our problem as a time series forecasting task, we focus CCS CONCEPTS our review of related works on time series forecasting. As men- tioned previously, a prominent family of methods include forecast • Computing methodologies → Neural networks. techniques based on convolutional neural networks, recent surveys KEYWORDS about them have been presented by Lim et al. [17], Sezer et al. [21] and Torres et al. [24]. trends, innovation ecosystem, time series forecasting, convolutional An essential component of distortion-aware convolution is dy- neural networks, distortion-aware convolution namic time warping (DTW). While DTW is one of the most suc- cessful distance measures in the time series domain, see e.g. [25], 1 INTRODUCTION recent approaches integrate it with neural networks. For example, Forecasting trends in technological innovations is of high value Iwana et al. [14], Cai et al. [9] and Buza [5] used DTW to construct for policy makers, investors and other actors of the innovation features. In contrast, Afrasiabi et al. [1] used neural networks to ecosystem. In this paper, we cast this task as a time series forecasting extract features and used DTW to compare the resulting sequences. problem. Shulman [22] proposed “an approach similar to DTW” to allow for Approaches for time series forecasting range from the well-flexible matching in case of the dot product. DTW-NN [13] consid-known autoregressive models [4] over exponential smoothing [12] ered neural networks and replaced “the standard inner product of a to solutions based on deep learning [10, 11, 16–19, 24, 26]. Among node with DTW as a kernel-like method”. However, DTW-NN only the numerous techniques, a prominent family of methods include considered multilayer perceptrons (MLP), whereas we focus on forecast with convolutional neural networks (CNNs) [3, 20]. convolutional networks. In the context of time series classification, The inherent assumption behind CNNs is that local patterns are Buza and Antal proposed to replace the dot product in the con- characteristic to time series and future values of the time series may volution operation by DTW calculations [7]. In distortion-aware be predicted based on those local patterns. While the operation of convolution [6], DTW is used together with the dot product, but the dot product itself is not modified. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. 3 BACKGROUND For all other uses, contact the owner/author(s). We begin this section with a formal definition of our task followed Slovenian KDD Conference 2023 , 9–13 October 2023, Ljubljana, Slovenia by a review of convolutional neural networks with distortion-aware © 2023 Copyright held by the owner/author(s). convolution [6]. 5 Slovenian KDD Conference 2023 , 9–13 October 2023, Ljubljana, Slovenia K. Buza, M. B. Massri, M. Grobelnik 3.1 Problem Formulation Given an observed time series 𝑥 = (𝑥1, . . . , 𝑥 ) of length 𝑙, in our 𝑙 case each 𝑥 represents the number of patents related to a given 𝑖 topic in a month, we aim at predicting its subsequent ℎ values 𝑦 = (𝑥 ), i.e., the number of patents in the subsequent ℎ 𝑙 +1, . . . , 𝑥𝑙 +ℎ months. We say that ℎ is the forecast horizon and 𝑦 is the target. Furthermore, we assume that a dataset 𝐷 is given which contains 𝑛 time series with the corresponding target: (𝑖 ) (𝑖 ) 𝐷 = { (𝑥 , 𝑦 )𝑛 }. (1) 𝑖 =1 We use 𝐷 to train neural networks for the aforementioned predic- tion task. We say that (𝑖 ) 𝑥 is the input of the neural network. In our experiments, we assume that an independent dataset ∗ 𝐷 is given which can be used to evaluate the predictions of our model. Similarly to ∗ 𝐷 , dataset 𝐷 contains pairs of input and target time series. ∗ 𝐷 is called the test set. 3.2 The Distortion-aware Convolutional Block The main idea behind distortion-aware convolution [6] is to calculate, besides the dot products (or inner products), DTW distances between the kernel and time series segments as well. This is illus- trated in Fig. 1. Our distortion-aware convolutional block has two output channels: one for dot products and another channel for the DTW distances. While in case of the dot product, higher similarity between the time series segment and the pattern corresponds to higher values, the opposite is true for the DTW distances. In case of DTW, high similarity between the time series segment and the pattern is reflected by a distance close to zero. Therefore, to make sure that Figure 1: In case of distortion-aware convolution, addition- the activations on both channels are consistent, the activations of ally to the dot product (top), DTW distances between the ker- the DTW channel of our distortion-aware convolutional block are nel and time series segments are calculated (bottom). Thus, calculated as follows: our distortion-aware convolutional block has two output channels: one for dot products and another channel for the 1 𝑜𝑢𝑡 (𝑡 ) = , (2) 𝐷𝑇 𝑊 DTW distances scaled according to Eq. (2). 1 + 𝐷𝑇𝑊 (𝑖𝑛 [𝑡 : 𝑡 + 𝑠], 𝑤 ) where 𝑜𝑢𝑡 denotes the activation of the DTW channel of the 𝐷𝑇 𝑊 distortion-aware convolutional block, 𝑖𝑛 [𝑡 : 𝑡 + 𝑠] is the segment 4.1 Data of the block’s input between the 𝑡 -th and (𝑡 + 𝑠)-th position1, 𝑠 is Lens is a web-based service that offers global access to patent in-the size of the filter, 𝑤 are the weights of the filter representing a formation, academic articles, regulatory databases, and additional local pattern and 𝐷𝑇𝑊 (., .) is a function that calculates the 𝐷𝑇𝑊 relevant materials.2 The platform is designed to simplify the explo-distance between two time series segments. ration and evaluation of intellectual property information while Training neural networks with distortion-aware convolution promoting research and inventive activities. Lens grants compli- may be challenging because of the backpropagation of gradients mentary access to patent databases from more than 100 nations and through the DTW calculations. The basic idea of training is to train includes sophisticated search functionalities and analytical tools the network with conventional convolution instead of distortion- for diverse research and analysis needs. aware convolution initially and add DTW-computations once the We extracted time series from the Lens patent database as fol- weights of the convolutional layer have already been determined. lows. For selected topics identified by their Cooperative Patent For details, see [6]. Classification (CPC) codes, we extracted the number of granted patents as well as the number of patent applications per month 4 EXPERIMENTAL EVALUATION between January 1980 and December 2022. We considered the fol- The goal of our experiments is to examine whether the neural lowing topics: (a) “image or video recognition” (G06V), (b) “neural networks with distortion-aware convolution are more suitable for networks” (G06N3/02), (c) “natural language processing” (G06F40) forecasting technological trends compared to their counterparts and (d) all topics related to artificial intelligence. We considered the with conventional convolution. number of patents separately for the most significant jurisdictions, i.e., (a) United States of America, (b) China, (c) Korea, (d) Japan and 1In Eq. (2) we use a Python-like syntax: the lower index, 𝑡 is inclusive, the upper index, 2 𝑡 + 𝑠 is exclusive in 𝑖𝑛 [𝑡 : 𝑡 + 𝑠 ]. http://lens.org 6 Forecasting Trends in Technological Innovations with Distortion-Aware Convolutional Neural NetworksSlovenian KDD Conference 2023 , 9–13 October 2023, Ljubljana, Slovenia Table 1: Mean absolute error (MAE) and root mean squared erTable 2: Mean absolute error (MAE) and root mean squared ror (RMSE) for forecasting the time series of granted patents error (RMSE) for forecasting the time series of patent ap- in case of our approach (DCNN) and the baseline (CNN). plications in case of our approach (DCNN) and the baseline Lower values indicate better performance. (CNN). Lower values indicate better performance. juris- RMSE MAE juris- RMSE MAE topic diction CNN DCNN CNN DCNN topic diction CNN DCNN CNN DCNN image or US 165.9 106.0 131.2 92.7 image US 188.2 177.1 170.2 163.3 video China 405.8 320.9 323.87 217.6 or video China 3405.0 1061.7 3375.4 1042.3 recognition Korea 13.9 27.7 12.4 19.9 recognition Korea 128.9 70.8 99.7 69.4 Japan 55.9 49.8 39.9 37.8 Japan 103.8 106.4 87.1 66.1 Europe 34.5 34.7 32.3 32.9 Europe 51.9 55.5 45.0 49.4 ALL 494.7 399.6 416.8 341.3 ALL 3641.9 2110.5 3627.3 2027.8 neural US 10.7 9.1 9.4 7.9 neural xUS 79.8 15.3 76.9 12.7 networks China 5.6 5.5 3.8 3.7 networks China 21.2 20.8 16.8 19.0 Korea 6.3 2.3 5.4 2.1 Korea 44.6 6.8 43.7 6.2 Japan 3.5 2.9 2.5 2.0 Japan 13.9 7.1 13.5 4.8 Europe 2.7 1.6 2.2 1.2 Europe 15.8 5.9 14.9 4.4 ALL 7.6 8.3 6.3 6.7 ALL 267.7 45.6 262.7 38.6 natural US 19.7 15.1 14.8 12.0 natural US 64.1 68.7 55.5 64.6 language China 57.1 47.0 41.6 41.7 language China 418.9 318.2 363.6 289.3 processing Korea 14.2 8.5 13.1 7.3 processing Korea 35.1 23.4 29.7 21.0 Japan 11.8 10.7 9.5 7.3 Japan 16.7 18.7 10.5 10.8 Europe 4.8 3.0 3.5 2.7 Europe 11.2 14.3 9.7 11.2 ALL 67.0 45.7 59.5 35.5 ALL 298.1 543.0 226.9 489.3 ALL US 270.2 216.9 224.1 196.4 ALL US 532.3 329.1 458.9 311.3 China 870.2 1108.8 763.2 998.1 China 6443.7 2784.2 6239.0 2386.5 Korea 56.6 138.3 53.8 129.4 Korea 405.4 216.8 340.2 180.8 Japan 124.8 132.0 81.4 89.9 Japan 224.8 228.1 159.1 128.6 Europe 85.8 69.2 82.1 65.9 Europe 130.0 163.5 97.5 121.3 ALL 1045.1 1129.1 929.2 964.6 ALL 5445.1 3355.8 5009.0 2547.0 data related to the years 1980...2019 was used as training data, while the data from 2019...2022 was used as test data. From the long time series corresponding years 1980...2019, we extracted training instances with a moving window. This resulted in 10496 training instances in total which corresponds to 427 training instance for each time series. When evaluating the network on the test data, we used the data from 2019...2021 as input data and the task was to predict the number of granted patents (or patent applications, respectively) for Figure 2: Total number of granted patents (red) and patent the first six month of 2022. applications (blue) for all the jurisdictions in the Lens data- base related to “neural networks” (CPC: G06N3/02). 4.2 Experimental Settings In order to assess the contribution of distortion-aware convolu- tion, for each time series, we trained two versions of the neural (e) Europe. Additionally, we considered the time series of the total network: with and without distortion-aware convolution, and com- number of patents for all the jurisdictions of the database. Thus, we pared the results. In the former case, the first hidden layer was a considered in 48 time series in total, see also the first two columns distortion-aware convolutional layer (with both dot product and of Tab. 1 and Tab. 2. Two example time series are shown in Fig. 2. DTW calculations), whereas in the later case, we used conventional For each time series, we trained the neural networks to predict convolution (with dot product only). the number of granted patents (or patent applications, respectively) For simplicity, we considered a convolutional network contain- for each month of a 6-monthly period, i.e., the forecast horizon ing a single convolutional layer with 25 filters, followed by a max was ℎ = 6. As input, we used the number of granted patents (or pooling layer with window size of 2, and a fully connected layer patent applications, respectively) in the previous 36 months. The with 100 units. We set the size of convolutional filters to 9. The 7 Slovenian KDD Conference 2023 , 9–13 October 2023, Ljubljana, Slovenia K. Buza, M. B. Massri, M. Grobelnik number of units in the output layer corresponds to the forecast REFERENCES horizon, as each unit is expected to predict one of the numeric [1] Mahlagha Afrasiabi, Muharram Mansoorizadeh, et al. 2019. DTW-CNN: time values of the target time series. We trained the networks for 1000 series-based human interaction prediction in videos using CNN-extracted features. The Visual Computer (2019), 1–13. epochs with the Adam optimizer [15] with learning rate of 10−5 [2] Margit Antal and László Zsolt Szabó. 2016. On-line verification of finger drawn and batch size of 16. The loss function was mean squared error. signatures. In 11th international symposium on applied computational intelligence We implemented our neural networks in Python using the Py-and informatics. IEEE, 419–424. [3] Anastasia Borovykh, Sander Bohte, and Cornelis W Oosterlee. 2017. Condi-Torch framework. In order to support reproduction of our work, tional time series forecasting with convolutional neural networks. arXiv preprint we made the implementation of our model publicly available in a arXiv:1703.04691 (2017). [4] George EP Box, Gwilym M Jenkins, Gregory C Reinsel, and Greta M Ljung. 2015. github repository. The code illustrates training and evaluation of Time series analysis: forecasting and control. John Wiley & Sons. our model on standard benchmark datasets.3 [5] Krisztian Buza. 2020. Asterics: Projection-based classification of eeg with asym-We evaluated the predicted time series both in terms of mean metric loss linear regression and genetic algorithm. In 14th International Symposium on Applied Computational Intelligence and Informatics. IEEE, 000035–000040. absolute error (MAE) and root mean squared error (RMSE). In [6] Krisztian Buza. 2023. Time Series Forecasting with Distortion-Aware Convolu-particular, we calculated MAE (and RMSE, respectively) for each tional Neural Networks. In 9th SIGKDD International Workshop on Mining and forecast time series. Learning from Time Series. [7] Krisztian Buza and Margit Antal. 2021. Convolutional neural networks with As the goal of our experiments is to assess the contribution of dynamic convolution for time series classification. In International Conference on distortion-aware convolution, our baseline, denoted as CNN, is Computational Collective Intelligence. Springer, 304–312. [8] Krisztian Antal Buza. 2011. Fusion methods for time-series classification. PhD the aforementioned neural network with conventional convolution thesis at the University of Hildesheim (2011). instead of distortion-aware convolution. [9] Xingyu Cai, Tingyang Xu, Jinfeng Yi, Junzhou Huang, and Sanguthevar Ra-jasekaran. 2019. DTWNet: a dynamic time warping network. Advances in neural information processing systems 32 (2019). 4.3 Results [10] Zhengping Che, Sanjay Purushotham, Guangyu Li, Bo Jiang, and Yan Liu. 2018. Hierarchical deep generative models for multi-rate multivariate time series. In Tab. 1 and Tab. 2 show our results in terms of MAE and RMSE. International Conference on Machine Learning. PMLR, 784–793. Our approach, convolutional neural network with distortion-aware [11] Marco Cuturi and Mathieu Blondel. 2017. Soft-dtw: a differentiable loss function for time-series. In International conference on machine learning. PMLR, 894–903. convolution is denoted by DCNN, while CNN denotes the neural [12] Everette S Gardner Jr. 2006. Exponential smoothing: The state of the art—Part II. network with conventional convolution. As one can see, in the International journal of forecasting 22, 4 (2006), 637–666. majority of the examined cases, DCNN outperforms CNN both in [13] Brian Kenji Iwana, Volkmar Frinken, and Seiichi Uchida. 2020. DTW-NN: A novel neural network for time series recognition using dynamic alignment between terms of MAE and RMSE. In those cases when CNN performs better, inputs and weights. Knowledge-Based Systems 188 (2020), 104971. typically, both models are rather accurate (the error is low for both [14] Brian Kenji Iwana and Seiichi Uchida. 2020. Time series classification using local distance-based features in multi-modal fusion networks. Pattern Recognition 97 models) or the difference is very small compared to the magnitude (2020), 107024. of the error. [15] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014). [16] Vincent Le Guen and Nicolas Thome. 2019. Shape and time distortion loss for 5 CONCLUSIONS AND OUTLOOK training deep time series forecasting models. Advances in neural information processing systems 32 (2019). In this paper, we focused on forecasting technological trends and [17] Bryan Lim and Stefan Zohren. 2021. Time-series forecasting with deep learning: cast this task as a time series forecasting problem. We considered a survey. Philosophical Transactions of the Royal Society A 379, 2194 (2021), 20200209. a recent approach, convolutional neural networks with distortion- [18] Linbo Liu, Youngsuk Park, Trong Nghia Hoang, Hilaf Hasson, and Luke Huan. aware convolution, which has not been used for this task previously. 2023. Robust Multivariate Time-Series Forecasting: Adversarial Attacks and Defense Mechanisms. In The Eleventh International Conference on Learning Rep-We performed experiments on real-world time series represent- resentations. ing the number of granted patents and patent applications related [19] Yao Qin, Dongjin Song, Haifeng Chen, Wei Cheng, Guofei Jiang, and Garrison to selected topics. Our observations show that convolutional neu-Cottrell. 2017. A dual-stage attention-based recurrent neural network for time series prediction. arXiv preprint arXiv:1704.02971 (2017). ral networks with distortion-aware convolution are promising for [20] Rajat Sen, Hsiang-Fu Yu, and Inderjit S Dhillon. 2019. Think globally, act locally: this task. Furthermore, combination of conventional convolutional A deep neural network approach to high-dimensional time series forecasting. networks and neural networks with distortion-aware convolution Advances in neural information processing systems 32 (2019). [21] Omer Berat Sezer, Mehmet Ugur Gudelek, and Ahmet Murat Ozbayoglu. 2020. may be an interesting target of future works. Financial time series forecasting with deep learning: A systematic literature Last, but not least, we mention that time series are prominent review: 2005–2019. Applied soft computing 90 (2020), 106181. [22] Yaniv Shulman. 2019. Dynamic Time Warp Convolutional Networks. arXiv in various real-world applications [2, 23] and our approach can be preprint arXiv:1911.01944 (2019). extended to handle other types of time series, such as multivariate [23] Abdul Sittar and Dunja Mladenić. 2023. An approach to creating a time-series time series (or series of vectors) that can be compared with a more dataset for news propagation: Ukraine-war case study. In Slovenian KDD Conference. general version of DTW, see e.g. [8]. [24] José F Torres, Dalil Hadjout, Abderrazak Sebaa, Francisco Martínez-Álvarez, and Alicia Troncoso. 2021. Deep learning for time series forecasting: a survey. Big Data 9, 1 (2021), 3–21. ACKNOWLEDGMENTS [25] Xiaopeng Xi, Eamonn Keogh, Christian Shelton, Li Wei, and Chotirat Ann Ratanamahatana. 2006. Fast time series classification using numerosity reduction. This work was supported by the European Union through enRich- In Proceedings of the 23rd international conference on Machine learning. 1033–1040. MyData EU HE project under grant agreement No 101070284. [26] Tian Zhou, Ziqing Ma, Qingsong Wen, Liang Sun, Tao Yao, Wotao Yin, Rong Jin, et al. 2022. Film: Frequency improved legendre memory model for long-term time series forecasting. Advances in Neural Information Processing Systems 35 (2022), 12677–12690. 3https://github.com/kr7/dcnn-forecast 8 Building A Causality Graph For Strategic Foresight Jože M. Rožanec Beno Šircelj Peter Nemec Jožef Stefan International Jožef Stefan Institute Event Registry d.o.o. Postgraduate School Ljubljana, Slovenia Ljubljana, Slovenia Ljubljana, Slovenia beno.sircelj@ijs.si peter@eventregistry.org joze.rozanec@ijs.si Gregor Leban Dunja Mladenić Event Registry d.o.o. Jožef Stefan Institute Ljubljana, Slovenia Ljubljana, Slovenia gregor@eventregistry.org dunja.mladenic@ijs.si ABSTRACT (20/50) relevant hits, respectively. Some approaches described in This paper describes a pipeline built to generate a causality graph the literature aim to leverage artificial intelligence to automate for strategic foresight. The pipeline interfaces with a well-known time-consuming aspects of strategic foresight, such as perform- global media retrieval platform, which performs real-time track- ing information scanning and data analysis [4, 18]. Furthermore, ing of events reported in the media. The events are retrieved text-mining techniques have been used to identify weak signals from the media retrieval platform, and content from the media and trends [10] or extract relevant actions and outcomes that articles is processed with ChatGPT to extract causal relations could be mapped to causal decision diagrams [19]. mentioned in the news article. Multiple post-processing steps are Strategic foresight for environmental purposes has been con- performed to clean the causal relations, removing spurious ones sidered to different degrees by countries and environmental agen- and linking them to ontological concepts where possible. Finally, cies. For example, multiple U.S. Environmental Protection Agency a sample causality trace is showcased to exemplify the potential offices began using strategic foresight in the 1980s. Still, they of the causality graph created so far. did not do so consistently until 1995, when it began to be insti- tutionalized and connected to the Agency’s strategic planning KEYWORDS and decision-making, and reinvigorated since 2015 with that pur- strategic foresight, graph, causality extraction, wikifier, ChatGPT pose [11]. Another example is The Netherlands, where strategic foresight has been encouraged since 1992 to systematically aim to identify critical technologies and scientific possibilities that 1 INTRODUCTION would allow the fulfillment of environmental policies [29]. Other Among the most frequently used strategic foresight methods we cases include using strategic foresight to understand how EU- find scenario planning [7], that aims to foresee relevant scenarios wide policies may affect regions and rural localities [26] or guide based on trends and factors of influence. These allow for a better decision-making in the face of structural change [2]. understanding of how actions can influence the future - a key Previous work [22, 23] described how artificial intelligence ability in a world full of Turbulence, Unpredictability, Uncertainty, could be used to automate scenario planning. This paper de- Novelty, and Ambiguity (TUNA) [30]. This ability has fostered scribes a pipeline built to extract and process media news from an increasing adoption of strategic foresight in the public and EventRegistry [16] to create a causality graph. Furthermore, it private sectors [6, 21]. describes the causality graph created with media news report- Domain experts currently plan scenarios by gathering and an- ing on events related to oil prices, given the abundant research alyzing the data to determine and report probable, possible, and regarding how oil prices impact the environment. Among the plausible futures of interest [15]. Nevertheless, the extensive man-benefits of this approach is the ability to extract causal relations ual work imposes severe scalability limitations and can introduce with little human intervention and no supervision. The resulting bias into the assessments [7]. To overcome such limitations, artifi-graph enables the creation of link prediction models that can be cial intelligence was proposed to automate information scanning used to predict future events based on an array of events that and data analysis [4, 18]. have been observed in the past. While the value of artificial intelligence for strategic foresight This paper is organized as follows. First, section 2 describes has been recognized, artificial intelligence has not been widely how a data extraction pipeline was built, retrieving media events adopted yet [4, 20]. This is also reflected in scientific papers of interest and extracting causal relationships observed in the on foresight and artificial intelligence. For example, we queried world and described in them. Section 3 briefly describes some of Google Scholar for "data-supported foresight" and "strategic fore-the results obtained, providing (i) a quantitative assessment of sight artificial intelligence" considering the start time is unlim- error types and resulting causal relationships after data cleansing ited, and the deadline is September 6th 2023. When analyzing procedures and (ii) a qualitative assessment of causality relation- the first 50 search results of each, we got 18% (9/50) and 40% ships generated through the pipeline. Finally, Section 4 concludes and outlines future work. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this 2 DATA EXTRACTION PIPELINE work must be honored. For all other uses, contact the owner/author(s). The data extraction pipeline aims to query relevant media news, Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia © 2023 Copyright held by the owner/author(s). process them, and extract causal relationships that can be mod- eled in a graph. Given the specific interest in modeling causality 9 Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia Jože M. Rožanec, Beno Šircelj, Peter Nemec, Gregor Leban, and Dunja Mladenić cause, effect, entities, and locations were defined in the following manner: • Cause or effect: contains an entity which is an item, individual, or company that an event happened to; • Event: is an action, development, happening, or state of the entity that is causing or was affected by a cause in the Figure 1: Data extraction pipeline used to retrieve media relationship; events and extract causal relationships. • Location: geographical location where the event in the cause or effect took place; for environmental protection, some research was performed to Once the causal relationships were extracted, the cause and identify possible topics of interest. Among potential topics, the effect were post-processed, removing adjectives so that only the influence of oil prices on the environment was selected, con- nouns were left. E.g., higher diesel prices was converted to diesel sidering such a topic is frequently covered in the media and prices. The decision was made considering that by doing so, (a) was researched to a certain extent. Research has shown that oil the causes and effects would gain greater support and, therefore, price fluctuations (a) affect the consumption of renewable energy strengthen the information signal in a graph, and (b) that a hu- sources [1, 28], (b) stimulate green innovation, and that positive man expert would be able to determine how a cause and effect shocks in oil prices reduce CO may relate given his domain knowledge and a particular context. 2 emissions [12], and enhance ecological quality [8, 14]. For example, given the relationship Inflation → Consumer price The data extraction pipeline is summarized in Fig. 1, and each index, the human expert will immediately understand how the component is briefly described in the following subsections. consumer price index is affected in a growing or shrinking infla- tionary context. For each causal relationship, a trace was kept 2.1 Media Event Extraction to associate them with the media event from which they were extracted to enable further analysis when required. The EventRegistry platform provides real-time insights into me- dia events by sourcing them from the News Feed service [27], processing them and creating media events based on cross-lingual 2.3 Semantic matching and enrichment clusters of media news, which are later exposed through an API. The entire text of the media article was parsed using Wikifier [5]. The news processing steps require news semantic annotation, ex- Data from Wikifier was employed in two distinct ways: firstly, traction of date references, cross-lingual matching, and detection to enrich location data, and secondly, to associate entities to of news duplicates. The cross-lingual clusters denoting a partic- relevant semantic concepts. ular media event have a summary describing the media event, The Wikifier tool marks which words in the wikified text information regarding the piece of news considered a centroid correspond to certain semantic concepts. Such annotations were to the cluster, and other relevant information. matched to the entities extracted by ChatGPT as part of the causal The first step in the pipeline queries the EventRegistry media relationships. To successfully match strings to semantic concepts, event API to extract media events related to a particular concept. some preprocessing was required. First, the non-letter symbols This research’s query concept was limited to the "Price of Oil". and stopwords were removed, followed by the stemming of each Since EventRegistry has a history of data up to 2014, relevant word. It was considered a match if at least one identical string geopolitical and economic events that influenced oil prices since between the text related to marked concepts and the causal rela- 2014 were searched. Two events were highlighted by the U.S. tionship. Not all of the semantic concepts listed by the Wikifier Energy Information Administration 1: (a) the fact that OPEC were considered: (a) the concepts were required to have a PageR- production quota remained unchanged in the first quarter of ank higher than 0.0001; (b) for location data, only the concepts 2015 and (b) a reduction in oil demand registered due to the categorized as "place" were considered, and (c) when substitut-global pandemic in the first quarter of 2020. Furthermore, events ing the original entity by the associated semantic concept, the between 2022 and 2023 were considered, given the impact of the semantic concept with the highest cosine similarity between the Russo-Ukrainian War on oil prices [17]. For each event obtained, article it’s corresponding Wikipedia page was considered. the centroid media news was queried, its text extracted, wikified, and stored for further processing. 2.4 Cleansing causal relations After extracting causal relations, we focused on analyzing the 2.2 Causality extraction data and cleansing to ensure only relevant relations were con- To extract causal relations from media events, the OpenAI Chat- sidered and used to build a causality graph. Subsequent random GPT (gpt-3.5-turbo) was used as a one-shot learning model. To sampling iterations were performed, extracting 300 causal rela- that end, a random media event was sampled, the causality rela- tionships in each iteration, which were then analyzed. In each tionships extracted, and both (the text and causal relationships) iteration, the causal relations were assessed to determine whether presented to the model, asking it to recognize causal relationships they were meaningful to the topic under consideration, to iden- in the media news. Several iterations of prompt engineering were tify common errors, and to propose mitigation strategies that performed to ensure high-quality results, performing a manual could amend such errors or filter useless causal relations. We typ- assessment of random results. ified six such cases, five originating from ChatGPT and one when The causal relationships persisted in JSON files discriminated semantically post-processing the causal relations with concepts the cause, effect, related entities, and locations. In particular, obtained from the Wikifier: 1The events were highlighted in the following report, last accessed on August 25th • repeated entity: [ChatGPT] the same entity is registered 2023: https://www.eia.gov/finance/markets/crudeoil/spot_prices.php. for cause and effect. E.g., Oil price → Oil price. 10 Building A Causality Graph For Strategic Foresight Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia • empty entity: [ChatGPT] an entity is missing as cause Error type Count Percentage or effect. E.g., → Oil price. • missing entity: [ChatGPT] ChatGPT omits the actual Wrong conversion 17 5.7% entity but could be inferred from the text by the human reader. E.g., S&P 500 capital expenditures → growth, energy Missing entity 15 5.0% policy → defiance, or survey → Nasdaq 100. • time entity: [ChatGPT] some time-period is considered Non-entity 9 3.0% an entity. E.g., drilling activity → 2016, or (US) shale oil Time entity 3 1.0% supply → end of the year. • non-entity: [ChatGPT] words marked as entities don’t mean anything coherent. E.g., retail sales → risk appetite. Table 1: Statistics for typified errors based on a random • wrong conversion: [Wikifier] the entity was changed to sample of 300 causal relationships. something unrelated to the one stated in the text. E.g., Aus- tralian government > Australian dollar, or political tensions > Breakup of Yugoslavia. After performing the abovementioned cleansing and dictionary- While the mitigation strategy for most of the abovementioned based mappings, 7,723 nodes and 9,726 edges were obtained. Re- errors is to remove the causal relationship, for moving causal relationships reported only in a single media event missing entity, a follow-up question will be provided to ChatGPT to get a more reduced the graph size to 489 nodes and 877 edges. concrete answer. This last mitigation strategy has not been im- 3.1 Causality graph and causality chain plemented yet. Furthermore, a list of concept mappings will be considered to reduce clutter. For example, analysis Wage Growth or 1980s Oil Glut should be replaced by Wage or Oil Glut, respectively. Causal chains were created by linking causes and effects extracted Breakup of Yugoslavia could be replaced by Country Breakup. from media events. While these are not always completely ac- Finally, a more thorough linking to semantic concepts and on- curate, they help to identify sequences of events that may take tologies is required (e.g., Jerome Powell could be linked to Central place. Furthermore, while currently not implemented, graph link Bank). prediction could be used to predict future event sequences based After the abovementioned cleansing, the strings were turned on patterns observed in the past. into lowercase and trimmed, and most non-alphabetical charac- This section provides an example regarding a causality chain ters were removed. Further sampling and entity evaluation were of interest retrieved from the causality graph. The causality chain performed, creating a dictionary to match string occurrences to is briefly analyzed to demonstrate how it captures relevant knowl- a particular concept. It must be noted that the dictionaries do not edge. In particular, many causality chains displayed the following provide an exhaustive mapping and that ongoing work is being pattern: Pandemic → Currency → Price of Oil → Economic Growth done to further refine and complete the mapping phase. Such → Oil Glut → Inflation → Central Bank → Stock Market → In- dictionaries were created to provide ground for future ontological vestment. mapping based on existing ontologies and ontologies that will The complete causality chain summarized above was: Pan- be developed for this purpose. Finally, all the relations that, after demic → Currency → Price of Oil → Crude Oil Futures → Fuel the described process, were extracted from only one media event Pricing → Economic Growth → Petroleum → Oil Glut → Con- were discarded, given they are very likely to introduce noise. sumer Price Index → Monetary Policy → Inflation → Central Bank → Stock Market → Investment → Bond. 2.5 Creating a causality graph To validate the causality chain, scientific literature and events Once causal relationships were extracted, a causality graph was from the past few years were reviewed to find research and created by matching examples to validate the causal relationships. For the causality 𝑐𝑎𝑢𝑠𝑒 → 𝑒 𝑓 𝑓 𝑒𝑐𝑡 . Furthermore, some metrics were computed to assess the graph characteristics. The graph can chain described above, we found that the Pandemic influenced be sampled and visualized with the NetworkX 2 library, which Currency: countries experiencing a sharp daily rise in COVID-creates a dynamic HTML interface to view it. For each cause and 19 deaths usually saw their currencies weaken [13]. Causality all the possible effects following it, probabilities of each effect between exchange rates (Currency) and Price of Oil has been occurring were computed based on the ratios present in the data. reported by the European Central Bank [9]. In particular, it has been noticed that the exchange rates can affect oil prices through financial markets, financial assets, portfolio rebalancing, and 3 RESULTS heading practices. It has also been noted that given the oil prices A total of 2,503 media events were extracted from EventRegistry. are expressed in US dollars, the oil futures can be used to hedge When processed with ChatGPT, 12,290 unique causal relation- against an expected depreciation in US dollars - something that ships were extracted, totaling 14,226 unique entities. Those were explains the causal relationship between Price of Oil and Crude processed to remove possible errors. Considering repeated entity Oil Futures. Furthermore, a relationship exists between futures and empty entity errors, 253 causal relations were removed. After and spot prices (futures prices tend to converge upon spot prices applying wikification, 9,726 unique causal relations remained, 3 and between oil prices and fuel prices4, validating the causal totaling 7,723 entities. 845 causal relations were removed, con-relationship between Crude Oil Futures and Fuel Pricing. sidering repeated entity and empty entity errors. Table 1 shows the number of causal relations affected by a particular error type, 3See "Futures Prices Converge Upon Spot Prices", last accessed at https://www. considering a random sample of 300 causal relations. investopedia.com/ask/answers/06/futuresconvergespot.asp in September 7th 2023. 4See "Gasoline explained: Factors affecting gasoline prices", last accessed at https: //www.eia.gov/energyexplained/gasoline/factors-affecting-gasoline-prices.php in 2The library is documented at the following website: https://networkx.org/ September 7th 2023. 11 Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia Jože M. Rožanec, Beno Šircelj, Peter Nemec, Gregor Leban, and Dunja Mladenić When considering the relationship Fuel Pricing and Economic [6] George Burt and Anup Karath Nair. 2020. Rigidities of imagination in scenario Growth, we found that the relationship is validated with energy planning: Strategic foresight through ‘Unlearning’. Technological Forecasting prices [3], e.g., with gas prices: higher gas prices negatively im-and Social Change 153 (2020), 119927. [7] Ashkan Ebadi, Alain Auger, and Yvan Gauthier. 2022. Detecting emerging pact the economy5. Economic growth can affect the petroleum technologies and their evolution using deep learning and weak signal analysis. market and, in particular, lead to an oil glut (a significant surplus Journal of Informetrics 16, 4 (2022), 101344. [8] Ali Ebaid, Hooi Hooi Lean, and Usama Al-Mulali. 2022. Do oil price shocks of crude oil caused by falling demand) as it happened at the begin-matter for environmental degradation? Evidence of the environmental Kuznets ning of the COVID-19 pandemic6. Furthermore, oil pricing can curve in GCC countries. Frontiers in Environmental Science 10 (2022), 860942. have direct or indirect effects on [9] Marcel Fratzscher, Daniel Schneider, and Ine Van Robays. 2014. Oil prices, Inflation [24], which is reflected exchange rates and asset prices. (2014). in the Consumer Price Index, and which can trigger a particular [10] Amber Geurts, Ralph Gutknecht, Philine Warnke, Arjen Goetheer, Elna Monetary Policy from the Central Bank in response to it. Finally, Schirrmeister, Babette Bakker, and Svetlana Meissner. 2022. New perspec- monetary policies affect the stock market and investments [25]. tives for data-supported foresight: The hybrid AI-expert approach. Futures & Foresight Science 4, 1 (2022), e99. While the causality chain displayed in this case is mostly [11] Joseph M Greenblott, Thomas O’Farrell, Robert Olson, and Beth Burchard. clean, some improvements are required to make it neater. For 2019. Strategic foresight in the federal government: a survey of methods, resources, and institutional arrangements. World futures review 11, 3 (2019), example, based on domain knowledge, and depending on the 245–266. context, the Consumer Price Index and Inflation could be merged [12] Jinyan Hu, Kai-Hua Wang, Chi Wei Su, and Muhammad Umar. 2022. Oil price, into a single concept, and green innovation and institutional pressure: A China’s perspective. Resources Monetary Policy and Central Bank could Policy 78 (2022), 102788. be considered as one. [13] Aamir Jamal and Mudaser Ahad Bhat. 2022. COVID-19 pandemic and the The ingestion pipeline requires further work to enhance the exchange rate movements: evidence from six major COVID-19 hot spots. concept mappings. We envision that the dictionaries will be fur- Future Business Journal 8, 1 (2022), 17. [14] Foday Joof, Ahmed Samour, Mumtaz Ali, Turgut Tursoy, Mohammad Haseeb, ther evolved and linked to specific ontologies that could be used Md Emran Hossain, and Mustafa Kamal. 2023. Symmetric and asymmetric to assign semantic meaning and, e.g., contract links in a chain effects of gold, and oil price on environment: The role of clean energy in China. Resources Policy 81 (2023), 103443. with the same semantic ancestor. [15] Kevin Kohler. 2021. Strategic Foresight: Knowledge, Tools, and Methods for the Future. CSS Risk and Resilience Reports (2021). [16] Gregor Leban, Blaz Fortuna, Janez Brank, and Marko Grobelnik. 2014. Event 4 CONCLUSIONS registry: learning about world events from news. In Proceedings of the 23rd This research has described a pipeline created for causality ex-International Conference on World Wide Web. 107–110. [17] Gaye-Del Lo, Isaac Marcelin, Théophile Bassène, and Babacar Sène. 2022. The traction from media news and aimed toward a strategic foresight Russo-Ukrainian war and financial markets: the role of dependence on Russian tool, and currently focused on events affecting oil prices. Particu-commodities. Finance Research Letters 50 (2022), 103194. lar errors in the causality extraction were identified and typified, [18] Nathan H Parrish, Anna L Buczak, Jared T Zook, James P Howard, Brian J Ellison, and Benjamin D Baugher. 2019. Crystal cube: Multidisciplinary ap-and mitigation measures were implemented. Nevertheless, fur- proach to disruptive events prediction. In Advances in Human Factors, Business ther work is required to improve the pipeline. Future work will Management and Society: Proceedings of the AHFE 2018 International Conference consider three directions: (a) string to ontologies mapping to on Human Factors, Business Management and Society, July 21-25, 2018, Loews Sapphire Falls Resort at Universal Studios, Orlando, Florida, USA 9. Springer, ensure the captured causes and effects can be tied to particu-571–581. lar semantic knowledge and exploit it, (b) generate richer cause [19] Lorien Pratt, Christophe Bisson, and Thierry Warin. 2023. Bringing advanced technology to strategic decision-making: The Decision Intelligence/Data Sci-and effect representations so that based on encoded metadata, ence (DI/DS) Integration framework. Futures 152 (2023), 103217. better causality patterns can be elucidated, and (c) create a link [20] Norbert Reez. 2020. Foresight-Based Leadership. Decision-Making in a Grow-prediction model based on the causality graph. ing AI Environment. In International Security Management: New Solutions to Complexity. Springer, 323–341. [21] Aaron B Rosa, Niklas Gudowsky, and Petteri Repo. 2021. Sensemaking and lens-ACKNOWLEDGMENTS shaping: Identifying citizen contributions to foresight through comparative topic modelling. Futures 129 (2021), 102733. The Slovenian Research Agency supported this work. This re- [22] Joze Rozanec, Peter Nemec, Gregor Leban, and Marko Grobelnik. 2023. AI, search was developed as part of the Graph-Massivizer project What Does the Future Hold for Us? Automating Strategic Foresight. In Companion of the 2023 ACM/SPEC International Conference on Performance Engi-funded under the Horizon Europe research and innovation pro- neering. 247–248. gram of the European Union under grant agreement 101093202. [23] Jože M Rožanec, Radu Prodan, Dumitru Roman, Gregor Leban, and Marko Grobelnik. 2023. AI-based Strategic Foresight for Environment Protection. In Symposium on AI, Data and Digitalization (SAIDD 2023). 7. REFERENCES [24] Siok Kun Sek, Xue Qi Teo, and Yen Nee Wong. 2015. A comparative study on the effects of oil price changes on inflation. Procedia Economics and Finance [1] Nicholas Apergis and James E Payne. 2015. Renewable energy, output, carbon 26 (2015), 630–636. dioxide emissions, and oil prices: evidence from South America. Energy Sources, [25] Peter Sellin. 2001. Monetary policy and the stock market: theory and empirical Part B: Economics, Planning, and Policy 10, 3 (2015), 281–287. evidence. Journal of economic surveys 15, 4 (2001), 491–541. [2] M Bruce Beck. 2005. Environmental foresight and structural change. Environ- [26] Anastasia Stratigea and Maria Giaoutzi. 2012. Linking global to regional mental Modelling & Software 20, 6 (2005), 651–670. scenarios in foresight. Futures 44, 10 (2012), 847–859. [3] Istemi Berk and Hakan Yetkiner. 2014. Energy prices and economic growth in [27] Mitja Trampuš and Blaz Novak. 2012. Internals of an aggregated web news the long run: Theory and evidence. Renewable and Sustainable Energy Reviews feed. In Proceedings of 15th Multiconference on Information Society. 221–224. 36 (2014), 228–235. [28] Victor Troster, Muhammad Shahbaz, and Gazi Salah Uddin. 2018. Renewable [4] Patrick Brandtner and Marius Mates. 2021. Artificial Intelligence in Strate-energy, oil prices, and economic activity: A Granger-causality in quantiles gic Foresight–Current Practices and Future Application Potentials: Current analysis. Energy Economics 70 (2018), 440–452. Practices and Future Application Potentials. In The 2021 12th International [29] Barend Van der Meulen. 1999. The impact of foresight on environmental Conference on E-business, Management and Economics. 75–81. science and technology policy in the Netherlands. Futures 31, 1 (1999), 7–23. [5] Janez Brank, Gregor Leban, and Marko Grobelnik. 2017. Annotating docu- [30] Angela Wilkinson. 2017. Strategic foresight primer. European Political Strategy ments with relevant wikipedia concepts. Proceedings of SiKDD 472 (2017). Centre (2017). 5See "How Gas Prices Affect the Economy", last accessed at https://www.investopedia. com/financial-edge/0511/how-gas-prices-affect-the-economy.aspx in September 7th 2023. 6See "Oil glut means there’s little hope for oil price recovery until 2021", last accessed at https://www.conference-board.org/topics/natural-disasters-pandemics/COVID- 19-oil-glut in August 30th 2023. 12 Towards Testing the Significance of Branching Points and Cycles in Mapper Graphs Patrik Zajec Primož Škraba Dunja Mladenič patrik.zajec@ijs.si p.skraba@qmul.ac.uk dunja.mladenic@ijs.si Jožef Stefan Institute and Jožef School of Mathematical Sciences, Jožef Stefan Institute and Jožef Stefan International Postgraduate Queen Mary University of London Stefan International Postgraduate School London, UK School Jamova cesta 39 Jamova cesta 39 Ljubljana, Slovenia Ljubljana, Slovenia ABSTRACT 𝑑 Given a point cloud 𝑃 , which is a set of points embedded in R , we are interested in recovering its topological structure. Such a structure can be summarized in the form of a graph. An example of this is the mapper graph, which captures how the point cloud is connected and reflects the branching and cyclic structure of 𝑃 as branching points (vertices with degree greater than 2) and cycles in the graph. However, such a representation is not always accurate, i.e., the structure shown by the graph may not be suf- (a) (b) (c) (d) ficiently supported in the point cloud. To this end, we propose an approach that uses persistent (relative) homology to detect Figure 1: A point cloud (a) and three graphs (b, c, d) summa- branching and cyclic structure, and employs a statistical test to rizing its topological structure, constructed by the mapper confirm whether the structure is indeed significant. We show algorithm for different choices of its parameters. how the approach works for low-dimensional point clouds, and discuss its possible applications to real world point clouds. KEYWORDS structure. We demonstrate the approach on two examples: a Y- shaped point cloud and a sample of a 3D mesh resembling an topological data analysis, statistical hypothesis testing, persistent ant. These low-dimensional examples allow us to visually inspect homology, mapper algorithm the results, laying the groundwork for extensive experiments 1 INTRODUCTION with higher-dimensional point cloud data used in real-world applications. 2 Consider the point cloud 𝑃 consisting of points in R shown in Representing the topological structure of the point cloud with Figure 1a. Using the mapper algorithm, we can construct a graph a simpler object, such as a graph, and having a statistical method that represents its topological structure like the one in Figure 1b, for testing the significance of such a structure is a very rele- which seems to recover the important structure. Using the same vant task. A simpler representation allows us to visualize [3] algorithm (but with different values of its adjustable parameters) and interpret high-dimensional representations that are every- we could end up with different graphs. The second graph, shown where in modern data science and machine learning. It might in Figure 1c, contains two cycles: the middle one, which captures even allow us to find singularities that often carry relevant infor-the cycle present in 𝑃 , and the top one, where the algorithm mation. The mapper algorithm [6] is a commonly used tool in "mistakenly" considers the top points to connect in a cycle. The TDA. Although it is simple, the result is sensitive to the choice third graph, shown in Figure 1d, shows a similar structure as of its parameters [2]. Nevertheless, it provides only one possible the graph in Figure 1b, although it contains one branching point low-dimensional view of the input data, and to our knowledge more (splitting off the upper left branch) and a cycle of length there is no method that would confirm the significance of the three. One could argue that these branching and cyclic structures represented structure. There is another method, called persistent are not sufficiently supported in 𝑃 . homology, which, while not directly applicable to visualization, Our goal is to develop an approach that allows us to confirm, deals with a particular structure of "holes" in space and now has a through a statistical test, whether the structure recovered by framework [1] that allows us to statistically test the significance the mapper graph is indeed present in the point cloud. We use of such a structure. persistent homology, a well-known construction from topological data analysis (TDA), to represent the structure from the point 2 BACKGROUND cloud, and a recently introduced hypothesis testing framework 𝑑 A point cloud 𝑃 is a set of points embedded in which can be [1] that provides a way to evaluate the significance of such a R viewed as a sample of a topological space X. Since discrete points Permission to make digital or hard copies of part or all of this work for personal from 𝑃 have no interesting topological structure, we consider the or classroom use is granted without fee provided that copies are not made or 𝑟 space 𝑃 = Ð 𝐵 (𝑝, 𝑟 ) for some radius 𝑟 . If 𝑃 is a sufficiently distributed for profit or commercial advantage and that copies bear this notice and 𝑝 ∈𝑃 𝑟 the full citation on the first page. Copyrights for third-party components of this dense sample of X, then 𝑃 has some of the same properties work must be honored. For all other uses, contact the owner /author(s). as X for a suitable 𝑟 . To compute the properties of interest, we Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia 𝑟 represent 𝑃 with a simplicial complex 𝐾 which, if properly con- © 2023 Copyright held by the owner/author(s). 𝑟 structed, has homology groups isomorphic to those of 𝑃 . We 13 Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia Zajec, et al. are interested in finding the branching and cyclic structure in cycles. We can think of the reduced homology of a space as if we the point cloud, both of which can be detected using (persistent) were representing the entire 𝐿 with a single point. homology. 2.1 Simplicial complexes A (geometric) simplicial complex 𝐾 can be thought of as a "high- dimensional graph" whose vertices are points from the point cloud and connectivity is determined by the geometric configu- ration of the points. In addition to vertices and edges, we include (a) (b) triangles, tetrahedra and higher dimensional simplices. Formally, 𝐾 consists of finite nonempty subsets of 𝑃 and is closed under Figure 2: a) A Y-shaped simplicial complex with one cy- inclusion (i.e., 𝐴 ∈ 𝐾 and 𝐵 ⊂ 𝐴 implies 𝐵 ∈ 𝐾 ). We refer to cle. b) The quotient 𝐾/𝐿, where subcomplex 𝐿 contains elements in 𝐾 of size 𝑘 + 1 as k-simplices, which correspond to 0-simplices {d, e, f}. Such identification introduces two new k-cliques when we think about 𝐾 as a hyper-graph. 1-dimensional "holes", captured by the relative homology The Čech and Vietoris-Rips complexes are the two most com- group 𝐻 ( 1 𝐾 , 𝐿). mon constructions, both parameterized by a scale parameter (radius) 𝑟 > 0. We use the Vietoris-Rips construction, where we include a subset of (k + 1) points from 𝑃 as a k-simplex if all The concept of homology and relative homology is best il- points are at most r apart. lustrated by an example. Consider a simple simplicial complex We can construct a sequence of complexes 𝐾 , 𝐾 , . . . by in- consisting of 0-simplices {a, b, c, d, e, f } and 1-simplices {(a, b), (a, 𝑟 𝑟 1 2 creasing the radius 𝑟 . Such a construction is "increasing" in the c), (a, d), (b, e), (c, f )} as shown in Figure 2a. There is a "hole" of sense that for 𝑟 , it holds that ⊆ . Such sequences are dimension 1 (surrounded by the cycle 𝑎 → 𝑏 → 𝑐 → 𝑎), which is 1 < 𝑟 2 𝐾 𝐾 𝑟 𝑟 1 2 also known as filtrations and are used in persistent homology. captured in the homology group 𝐻 . Choosing 1 𝐿 = {𝑑, 𝑒, 𝑓 } as a subcomplex, the quotient 𝐾 /𝐿 identifies the simplices from 𝐿 to 2.2 Persistent relative homology a single point, as shown in the figure 2b. This results in two new "holes" in dimension 1, which are captured by the relative ho- Homology. Homology is a classical construction in algebraic mology group 𝐻 ( 1 𝐾 , 𝐿), which has rank 3. This "lifting property" topology that deals with topological properties of a space. More of relative homology (introducing new "holes" when identifying precisely, it provides a mathematical language for the holes in a simplices) is used in our approach to detect branching points. topological space. Homology groups denoted by 𝐻 ( 𝑘 X), where 𝑘 is a dimension, capture the holes indirectly by focusing on what Persistent homology. The constriction of the simplicial com- surrounds them. For example, the basis of 𝐻 ( 0 X) corresponds to plex and hence the groups 𝐻 are highly sensitive to the choice 𝑘 the connected components and the basis of 𝐻 ( 1 X) to the closed of radius 𝑟 . To overcome this, persistent homology considers the loops surrounding the holes. The rank of the k-th homology entire range of scales and tracks the evolution of k-cycles as the group, also known as Betti number, counts the number of 𝑘 - value of 𝑟 increases, thus forming a sequence of filtrations. In this dimensional "holes". process, cycles are created (born) and later filled-in (die). This We can construct homology groups for a given simplicial com- information is most often represented by persistence diagrams, a plex 𝐾 . The important concepts in the construction are: (i) the two dimensional scatter plot, 𝑑𝑔𝑚 = {𝑝1, . . . , 𝑝 }, where each 𝑘 𝑚 chain groups 𝐶 , where the k-th chain group consists of all formal point 𝑝 = (𝑏 , 𝑑 ) represents the birth and death times (radius) 𝑘 𝑖 𝑖 𝑖 Í linear combinations of 𝑘 -dimensional simplices 𝑎 𝜎 , where of the associated persistent cycle. 𝑖 𝑖 𝑖 𝜎 are 𝑘 -simplices from 𝐾 and 𝑎 are coefficients, usually from , 𝑖 𝑖 Z2 (ii) the boundary operator 𝜕 , which is a map describing how (k - 𝑘 2.3 Significance testing of persistent cycles 1)-simplices are attached to k-simplices, (iii) the groups 𝑍 of k- 𝑘 The significance of topological features is often measured by the cycles, which are k-chains in the kernel of 𝜕 , and (iv) the groups 𝑘 lifetimes of persistent cycles, i.e., 𝛿 = (𝑑 − 𝑏 ). Although this 𝑖 𝑖 𝐵 of k-boundaries, which are elements in the image of 𝜕 . The 𝑘 𝑘 +1 method is intuitive as it captures the geometric “size” of topo- boundary operator 𝜕 has the property that 𝜕 ◦ 𝜕 = 0, i.e., it 𝑘 𝑘 𝑘 +1 logical features, [1] uses the statistic 𝜋 = 𝑑 /𝑏 . They present a 𝑖 𝑖 𝑖 maps the boundary of the boundary to zero. Therefore, 𝐵 ⊆ 𝑍 . 𝑘 𝑘 statistical test to determine for each point 𝑝 ∈ 𝑑𝑔𝑚 whether 𝑖 𝑘 Intuitively, a k-cycle can be thought of as a generalized ver- it is a signal or noise, i.e., a significant structure or the result of sion of a cycle in a graph - it is a sequence of k-dimensional noise and randomness in the data. They introduce a special trans- simplices wrapped around something. If this sequence is actually formation 𝑙 (𝑝 ) applied to each point from the diagram where 𝑖 a boundary of a (k+1)-dimensional chain, then its interior is full the values of 𝑙 (𝑝 ) follow a certain (LGumbel) distribution if 𝑝 𝑖 𝑖 (trivial cycle). Otherwise, it surrounds a hole. The k-th homology are points corresponding to noisy cycles, while cycles signifi- 𝐻 = 𝑘𝑒𝑟 𝜕 /𝑖𝑚 𝜕 = 𝑍 /𝐵 takes a "modulo" of k-cycles with 𝑘 𝑘 𝑘 +1 𝑘 𝑘 cantly deviating from this distribution are declared as signal. The k-boundaries, leaving only cycles that are nontrivial. 𝑠 signal part of 𝑑𝑔𝑚 can be recovered as 𝑑𝑔𝑚 (𝛼) = {𝑝 ∈ 𝑑𝑔𝑚 : 𝑘 𝑘 𝑘 − 𝑙 (𝑝) 𝛼 Relative homology. 𝑒 Given a simplicial complex } 𝐾 and a sub- 𝑒 < given a 𝑝 -value 𝛼 . |𝑑𝑔𝑚 | 𝑘 complex 𝐿 ⊆ 𝐾 , the relative homology of a pair of topological Computing persistent homology for an entire filtration is of- spaces (simplicial complexes in our case) can be thought of as ten intractable, as higher values of 𝑟 lead to a large number of the (reduced) homology of the quotient space 𝐾 /𝐿. Intuitively, simplices. The common practice is to set a threshold 𝑟 and 𝑚𝑎𝑥 we want to factor out 𝐿, which is expressed by the quotient oper- calculate 𝑑𝑔𝑚 (𝑟 ) using simplices generated up to 𝑟 . This 𝑘 𝑚𝑎𝑥 𝑚𝑎𝑥 ation 𝐶 (𝐾, 𝐿) = 𝐶 (𝐾 )/𝐶 (𝐿). The group of 𝑘-cycles becomes often leads to cycles that are "infinite", i.e., born prior to 𝑟 𝑘 𝑘 𝑘 𝑚𝑎𝑥 𝑍 (𝐾, 𝐿) = 𝑍 (𝐾 )/𝑍 (𝐿), which we call the group of relative but die after 𝑟 . The framework also provides an algorithm to 𝑘 𝑘 𝑘 𝑚𝑎𝑥 14 Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia determine the infinite cycles that are already significant, and pro- 3.1 Testing the cycles vides means to select the next 𝑟 threshold to inspect infinite 𝑚𝑎𝑥 A simple cycle is a finite sequence of vertices 𝑣 → → 1 𝑣 2 . . . → cycles that have not yet been determined to be significant. 𝑣 , where 𝑣 and 𝑣 are connected by an edge such that no 𝑛 𝑖 𝑖 +1 vertex, except the endpoint, repeats ( 𝑣 = 𝑣 if and only if 𝑖, 𝑗 ∈ 𝑖 𝑗 2.4 The mapper algorithm {1, 𝑛}). Let 𝑣 be such a cycle from 1, . . . , 𝑣 𝐺 . We compute the 𝑛 ′ = Ð ) Given the topological space persistence diagram of the subset and use X and a continuous function 𝑓 : 𝑃 𝜙 (𝑣 𝑖 =1,...,𝑛 𝑖 the test [1] to confirm that it contains at least one significant X → R, the mapper algorithm [6] constructs a graph 𝐺 = (𝑉 , 𝐸) that captures the topological structure of cycle ("hole") of dimension 1. X. It does so by pulling back a cover U of the space 𝑓 (X) to a cover on X through 𝑓 . We can view the function 𝑓 and the cover U as the lens through 3.2 Testing the branching structure which the input data X is examined. Let 𝑁 (𝑣 ) be a set of vertices connected to 𝑣 (1-hop neighborhood) ′ and let 𝑣 be a branching point in 𝐺 (as in Figure 4). Let 𝑁 (𝑣) = {𝑢 : 𝑢 ∈ 𝑁 (𝑣), 𝑑𝑒𝑔(𝑢) ≥ 2} be a set of vertices from 𝑁 (𝑣) that ′ have at least one additional neighbor. Together with 𝑣 , 𝑁 (𝑣) forms a set of internal points 𝐼 = Ð 𝑣 ′ 𝑢 ∈ { 𝑣 } ∪𝑁 ( 𝑣 ) 𝜙 (𝑢 ) (shown in Figure 4 as black vertices inside the outer black line). (a) (b) Figure 3: An example of the construction of a mapper graph. Figure 4: Construction of 𝐾 and 𝐿 for a branching point (a) A 2-dimensional point cloud 𝑗 𝑃 with cover {𝑉 }, a func- 𝑖 𝑣 . Vertices forming 𝐾 are inside the outer black line. Ver- tion 2 𝑓 : R → R and cover U of 𝑓 (𝑃). (b) The resulting tices forming 𝐿 are bicolored, indicating that some of their mapper graph. points are inside due to overlap between the vertices’ point sets. Given a point cloud 𝑃 and 𝑓 : 𝑃 → R, we first construct a set of 𝑛 intervals U = {𝑈 } covering 1, . . . , 𝑈 𝑓 (𝑃 ). The percentage of 𝑛 Let 𝐾 = Ð 𝑣 ′ 𝑢 ∈ 𝑁 ( 𝑣 ) 𝑁 (𝑢 ) be a set of vertices whose points are overlap for two consecutive intervals 𝑈 and 𝑈 is determined 𝑖 𝑖 +1 used to form a complex 𝐾 (vertices inside the outer black line in by the parameter Ð 𝑝 . For each interval 𝑈 = (𝑎, 𝑏), let 𝑃 = 𝑖 𝑈 Figure 4), i.e. 𝐾 is formed from the points 𝜙 (𝑢 ). Now let 𝐿 𝑖 𝑢 ∈𝐾𝑣 −1 𝑓 (𝑈 ) be a set of points with function values in the range (𝑎, 𝑏). be a subcomplex of 𝑖 𝐾 containing simplices which do not contain 1 𝑘 The set 𝑃 for each 𝑈 is further partitioned into 𝑉 , . . . , 𝑉 𝑖 by any of the points from 𝐼 . Thus 𝐿 contains points of vertices 𝑈 𝑖 𝑣 𝑖 a clustering algorithm (in our case DBSCAN [5] with parameter exactly two edges away from 𝑣 (bicolored vertices in Figure 4). We 𝜖 , which sets the maximum distance between two samples so that use 𝐾 and 𝐿 to compute relative persistent homology, identifying one is considered to be in the neighborhood of the other) to obtain simplices of 𝐿 to a single point and introducing relative cycles 1 𝑘 𝑗 𝑖 a cover of ("holes") when 𝑃 = Ð {𝑉 , ..., 𝑉 }. Each 𝑉 ⊂ 𝑃 becomes some 𝐾 \𝐿 has a branching structure. For a branching 𝑖 =1,...,𝑛 𝑖 𝑖 𝑖 𝑗 point 𝑣 , the relative persistence diagram should contain at least vertex 𝑣 in the mapper graph with 𝜙 (𝑣 ) = 𝑉 mapping 𝑣 to a 𝑖 𝑑𝑒𝑔 (𝑣 ) − 1 significant relative cycles. subset of points. Two vertices are connected by an edge if their point sets intersect (see Figure 3). 4 EXPERIMENTS The resulting graph 𝐺 = (𝑉 , 𝐸) provides a combinatorial de- scription of the data and the mapping 𝜙 : 𝑉 → P (𝑃 ) maps each We perform experiments illustrating our approach on two point node 𝑣 ∈ 𝑉 to a subset of points from 𝑃 . clouds. The graphs are constructed using the mapper algorithm from the Giotto TDA library [7] with the parameters specified 3 METHODOLOGY for each experiment. To construct the simplicial complex and compute (relative) persistent homology, we use the Dionysus 𝑑 The input to our approach is a set of points 𝑃 embedded in R and 1 library . We increase the initial radius 𝑟 using the algorithm a graph 𝐺 = (𝑉 , 𝐸) together with a mapping 𝜙 : 𝑉 → P (𝑃 ) that from [1] until either no infinite cycles remain or all currently maps each vertex to a subset of points. Note that the method used infinite cycles are identified as significant. to construct the graph is not limited to the mapper algorithm. We include a figure of the graph for each experiment and mark The graph is assumed to capture the topological structure of interesting branching points and cycles. The points correspond- the point cloud, i.e., branching points (vertices with a degree of at ing to a cycle are shown in red, the internal points of a branching least 3) and cycles in the graph should reflect the branching and point are also red, while the boundary points (forming 𝐿) are cyclic structure of the point cloud. Our approach tests whether blue. the captured structure is significant when viewed through ho- mology, operating directly on a subset of points from the point cloud. 1 Available at: https://github.com/mrzv/dionysus. 15 Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia Zajec, et al. 4.1 Experiment 1: Y-shaped point cloud on the ant’s head into its two antennas and is significant. Vertex 2 B2 is a branching point of degree 3 and one of the vertices from The point cloud 𝑃 consists of 5000 points in R and resembles a the cycle C2. Looking at the point cloud, no branching structure Y-shape with a cycle in the centre. The graph (see Figure 5) was is detected because the points of the two legs are contained in created with the following parameters: 𝑓 is a projection on the the vertex B2 itself and there are no boundary points on the legs, x-coordinate, 𝑛 = 30, 𝑝 = 0.5 and 𝜖 = 3. so they appear as a single connected blob. Our approach does not detect a branching structure, even though there is, as some other strategy of selecting the boundary points would need to be used. Vertex B3 has degree 6, but only 5 neighbors are used as one does not have any additional neighbor except B3. Since one of the legs has no boundary points, only 2 cycles appear, causing B3 to be recognized as a branching point with degree 3. We also highlight 2 simple cycles. Cycle C1 wraps around the ant’s hollow head and is recognized as significant. Cycle C2 wraps around the ant’s two middle legs and part of its body. No significant cycles were found - ant’s legs are not close enough together to form a large cycle and cycle formed by the hollow legs is too small to be detected. So there is not enough support to confirm the structure found by mapper. 5 CONCLUSIONS AND FUTURE WORK Figure 5: Mapper graph with two branching points (B1 and We have demonstrated, how persistent (relative) homology can B2) and one simple cycle (C1) together with their corre- be used in conjunction with a statistical test to confirm the signif- sponding subsets of points. icance of the topological structure of a point cloud summarized with a graph. In the future, we will conduct extensive experiments The graph contains one simple cycle, which is also significant on more complex, high-dimensional point clouds with known because the subset of its points contains a homologically signif- and unknown structure. Ideally, we could use our approach to icant cycle. The graph also contains two branching points, B1 prune the mapper graphs or guide the selection of values for its and B2 with degrees 4 and 3. parameters. Our approach to identifying branching structures The persistence diagram for B1 has three (significant) infinite needs further work, as the current strategy of using a (modified) cycles, indicating a branching structure of degree 4, while the 2-hop neighborhood as a boundary sometimes fails. In addition, diagram for B2 has two (significant) infinite cycles, indicating a we may need a more sensitive version of the statistical test from branching structure of degree 3. In this example, it was confirmed [1] which is currently stated to hold in general but might be that both the cyclic and the branching structure of the graph are possible to adapt for a particular type of data. reflected in the point cloud. ACKNOWLEDGEMENTS 4.2 Experiment 2: 3D ant surface This work was supported by the Slovenian Research Agency 3 The point cloud 𝑃 consists of 6370 points in R corresponding under the project J2-1736 Causalify and co-financed by the Re- to the vertices of a 3D mesh in the form of an ant obtained from public of Slovenia and the European Union’s HE program under [4]. The graph (see Figure 6) was created with the following enRichMyData EU project grant agreement number 101070284. parameters: 𝑓 is the distance to the tip of the ant’s abdomen, 𝑛 = 50, 𝑝 = 0.5, and 𝜖 = 0.025. REFERENCES [1] Omer Bobrowski and Primoz Skraba. 2023. A universal null-distribution for topological data analysis. Scientific Reports, 13, 1, (July 2023), 12274. doi: 10.1038/s41598- 023- 37842- 2. [2] Mathieu Carrière, Bertrand Michel, and Steve Oudot. 2018. Statistical analysis and parameter selection for mapper. Journal of Machine Learning Research, 19, 12, 1–39. http://jmlr.org/papers/v19/17- 291.html. [3] Nithin Chalapathi, Youjia Zhou, and Bei Wang. 2021. Adaptive covers for mapper graphs using information criteria. In 2021 IEEE International Conference on Big Data (Big Data), 3789–3800. doi: 10.1109/BigData52589.2021.9671324. [4] Xiaobai Chen, Aleksey Golovinskiy, and Thomas Funkhouser. 2009. A bench- mark for 3d mesh segmentation. ACM Trans. Graph., 28, 3, Article 73, (July 2009), 12 pages. doi: 10.1145/1531326.1531379. [5] Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In (KDD’96). AAAI Press, Portland, Oregon, 226–231. [6] Gurjeet Singh, Facundo Memoli, and Gunnar Carlsson. 2007. Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition. In Eurographics Symposium on Point-Based Graphics. M. Botsch, R. Pajarola, B. Chen, and M. Zwicker, editors. The Eurographics Association. Figure 6: Mapper graph with three highlighted branching isbn: 978-3-905673-51-7. doi: 10.2312/SPBG/SPBG07/091- 100. points (B1, B2 and B3) and two simple cycle (C1, C2) to- [7] Guillaume Tauzin, Umberto Lupo, Lewis Tunstall, Julian Burella Pérez, Matteo Caorsi, Anibal Medina-Mardones, Alberto Dassatti, and Kathryn Hess. 2020. gether with their corresponding subsets of points. Giotto-tda: a topological data analysis toolkit for machine learning and data exploration. (2020). arXiv: 2004.02551 [cs.LG]. We highlight three interesting branching points. Vertex B1 is a branching point of degree 3, which corresponds to the branching 16 Highlighting Embeddings’ Features Relevance Attribution on Activation Maps Jože M. Rožanec Erik Koehorst Dunja Mladenić Jožef Stefan International Philips Consumer Lifestyle BV Jožef Stefan Institute Postgraduate School Drachten, The Netherlands Ljubljana, Slovenia Ljubljana, Slovenia Erik.Koehorst@philips.com dunja.mladenic@ijs.si joze.rozanec@ijs.si ABSTRACT The field of explainable artificial intelligence can be traced The increasing adoption of artificial intelligence requires a better back to the 1970s [18]. A key question posed by the researchers is understanding of the underlying factors affecting a particular what makes a good explanation. Arrieta et al. [2] consider that a forecast to enable responsible decision-making and provide a good explanation must take into account at least three elements: ground for enhancing the machine learning model. The advent (a) the reasons for a given model output (e.g., features and their of deep learning has enabled super-human classification per- value ranges), (b) the context (e.g., context on which inference formance and eliminated the need for tedious manual feature is performed), and (c) how are (a) and (b) conveyed to the target engineering. Furthermore, pre-trained models have democra- audience (e.g., what information can be disclosed and the vo- tized access to deep learning and are frequently used for feature cabulary used, among others). When considering images, maps extraction. Nevertheless, while much research is invested into frequently present explanations that contrast particular model in- creating explanations for deep learning models, less attention formation on top of the original input image (e.g., saliency maps, was devoted to how to explain the classification outcomes of a activation maps, heat maps, or anomaly maps [13, 24]). Other model leveraging embeddings from a pre-trained model. This approaches can be extracting and highlighting super-pixels rele- research focuses on image classification and proposes a simple vant to a specific class [16] or the occlusion of background parts method to visualize which parts of the image were considered by irrelevant to the model. Such outputs convey (a) the reasons for the subset of the most relevant features for a particular forecast. a given model output by highlighting the images, (b) the context Furthermore, multiple variants are provided to contrast relevant on which inference is performed (by overlaying the information features from a machine learning classifier and selected features on top of the image used for inference), and (c) using an agreed during a feature selection process. The research was performed approach to convey to the user what is considered more relevant on a real-world dataset provided by domain experts from Philips and what is not. Consumer Lifestyle BV. Multiple approaches have been developed to explain the inner workings of image classifiers. LIME (Local Interpretable Model- KEYWORDS Agnostic Explanations) [16] approached this challenge by retrieving predicted labels for a particular class and showing the explainable artificial intelligence, feature importance, activation segmented superpixels that match each class. GradCAM[19] has map, GradCAM, image classification, smart manufacturing, de-taken another approach and created activation maps consider- fect detection ing the weight of the activations at particular deep learning model layers by the average gradient. Many approaches were 1 INTRODUCTION developed afterward, following the same rationale. For exam- The increasing adoption of artificial intelligence has posed new ple, GradCAM++[3], XGradCAM[9], or HiResCAM[6] work like challenges, including enforcing measures to protect the human GradCAM but consider second-order gradients, scale the gra- person from risks inherent to artificial intelligence systems. One dients by the normalized activations, or element-wise multiply step in this direction is the European AI Act [12], which con-the activations with the gradients respectively. Other possible siders that different artificial intelligence systems must conform approaches are leveraging insights resulting from image pertur- to a different set of requirements according to their risk level, bation [8] or methods that acquire and display samples similar linked to the particular domain and potential impact on health, or counterfactual to the predicted instance [4, 17]. safety, or fundamental rights [15]. In this context, explainable The development of information and communications tech-artificial intelligence, a sub-field of machine learning, has gained nologies fostered the emergence of the Industry 4.0 paradigm as renewed attention with the advent of modern deep learning [22], a technology framework to integrate and extend manufacturing given that it researches how more transparency can be brought to processes [23]. In this context, the increasing adoption of arti-opaque machine learning models. While transparency in the reg- ficial intelligence enables greater automation of manufacturing ulatory context is sought to enable responsible decision-making, processes such as defect inspection [7] and urges the adoption it provides valuable insights to enhance the workings of machine of explainable artificial intelligence to develop users’ trust in learning models, too. the models and foster responsible decision-making based on the insights obtained regarding the underlying machine learning model [1]. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or From the literature mentioned above and several surveys on distributed for profit or commercial advantage and that copies bear this notice and this topic [5, 13, 14, 17, 20, 21], it was found that the authors did the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). not contemplate how explanations can be provided in scenarios Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia where feature embeddings are extracted with a deep learning © 2023 Copyright held by the owner/author(s). model and then used to train a separate machine learning model. 17 Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia Jože M. Rožanec, Erik Koehorst, and Dunja Mladenić Figure 2: Given an image embedding (i), we can mask it to Figure 1: To classify an image, a feature extractor is used display (ii) features selected at the feature selection pro- to create an embedding, from which certain values are ex- cedure (including the top ranking classifier’s features, or tracted to create a feature vector. The machine learning (iii) can mask it to display only the top ranking classifier’s model issues a prediction, which, along with the feature features. vector, is used to create a feature ranking. The attribution approach considers the highest-ranking features to gener- ate an activation map. The present research addresses this void by proposing an un- supervised approach to generate activation maps based on the feature ranking obtained for a particular forecast. The research is performed on a real-world dataset provided by Philips Consumer Lifestyle BV and related to defect inspection. This paper is organized as follows. First, section 2 describes the explainability approach developed and tested in this research. Section 3 describes the experiments performed to assess different Figure 3: Sample images from the dataset provided by value imputation strategies, and Section 4 informs and discusses Philips Consumer Lifestyle BV. Three categories are dis-the results obtained. Finally, Section 5 concludes and describes tinguished: images corresponding to non-defective items future work. (good) and images corresponding to two defect types (double-printed and with interrupted prints). 2 HIGHLIGHTING EMBEDDINGS’ FEATURES RELEVANCE ATTRIBUTION ON ACTIVATION MAPS selected features and top-ranking features, using different values The increasing amount of pre-trained deep learning models make for each of them. By doing so, the highest similarity in the image them the default choice for feature extraction when working with will be found in regions related to top-ranking features or selected machine learning models for images. Nevertheless, the discon- features. Considering selected and top-ranking features provides nect between the machine learning model built on top and the additional insights into what information was provided to the deep learning model used to extract the image embedding makes model and what information was considered the most important it challenging to provide good explanations to the user. This re- by the model. These two approaches are explored in Section 3. search proposes an approach to bridge the gap (see Fig. 1). In particular, we leverage the fact that similar images or fragments 3 EXPERIMENTS of images result in embeddings or parts of embeddings that are We experimented with a real-world dataset of logos printed on close to each other. This property can be exploited when building shavers provided by Philips Consumer Lifestyle BV. The dataset activation maps, computing the similarity between a reference consisted of 3518 images considered within three categories (see image (e.g., the image of a horse) and the image under consider- Fig. 3): non-defective images and images with two kinds of defects ation to find where such class can be found in the image under (double-printed logos and interrupted prints). To extract features consideration (e.g., given the image of a farm, highlight where from the images, the ResNet-18 model [10] was used, extracting the horses are located). Nevertheless, if instead of using some ref-the features before the fully connected layer. Mutual information erence image, the image that is an input to the machine learning was used to evaluate the most relevant features and select the top √ model is leveraged as a reference, (i) no noise is introduced due to K, with 𝐾 = 𝑁 , where N is the number of data instances in the the dissimilarity of the images, and (ii) no beforehand knowledge train set, as suggested in [11]. The dataset was divided into train regarding the classes of interest is required. Therefore, a key (75%) and test (25%), and a random forest classifier was trained issue must be resolved: how do both embeddings differ to ensure on it, achieving an AUC ROC (one-vs-rest) score of 0.9022. that such difference is exploited to build an activation map? Three images from the test set were considered for the experi- Two options are envisioned in this research (see Fig. 2): given ments: good, double-printed, and with an interrupted print. The (i) the image embedding, two variations can be considered for images were randomly picked among the available ones for that value imputation: (ii) mask all the values in the embedding except particular class. To assess the features’ relevance of a particular for the ones corresponding to top-ranking features, (iii) mask all forecast, LIME[16] was used, considering the top 1, 3, 5, 7, and the values in the embedding except for the ones corresponding to 13 ranked features. 18 Highlighting Embeddings’ Features Relevance Attribution on Activation Maps Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia consistently showed high cosine similarity across layers for all defect types. On the other hand, TZZ achieved the best results regardless of the defect and layer considered. Imputing selected features with one had a detrimental effect, given it increased the similarity between the imputed vector and the embedding. Nevertheless, the similarity was usually between 0.10 and 0.20 Figure 4: GradCAM activation maps for ResNet-18 layers points below that reported with the TRR imputation strategy. 1-4 and four layers combined. For visual assessment, activation maps for different imputation strategies obtained for the top 13 features are displayed in Fig. 5. When comparing TZZ and TRR strategies, we found that for The GradCAM images were generated for ResNet-18 layers layer one, TZZ for the double-printed image focused on the top 1-4 and another image considering the four layers. To understand contour of characters, and for the interrupted print highlighted where the underlying model focused, we created GradCAM ac- regions of relevance. In contrast, TRR did not highlight any region tivation maps contrasting the image against itself (see Fig. 4). for the double-printed image and highlighted fewer regions for The cosine similarity between the imputed vector and the image the interrupted print when compared to TZZ. For layer two, embedding was computed across test samples (880 samples: 679 TZZ for the image of the non-defective product displayed some good, 58 double-printed, and 143 related to interrupted printing). artifacts but included areas covering characters’ contours, too. The mean similarity and standard deviation were used to as- Furthermore, for the double-printed and interrupted print images, sess whether the imputation strategy increased the similarity or it covered relevant regions. TRR, on the other hand, highlighted contrast between the imputed vector and the image embedding. different regions, which, for the good and double-printed images, The GradCAM images were generated by computing the co- were mostly irrelevant. For layer three, TZZ highlighted mostly sine similarity between the image embedding and the feature irrelevant areas for the image of the non-defective product, except vector generated considering three strategies described in Table 1. for the character "S". For the double-printed image, the beginning A sample of the resulting activation maps were visually assessed and end of the words are highlighted, while for the interrupted and are reported in Section 4. prints, the highlighted areas covered places where defects were The experiments were designed to understand which imputa- observed. TRR, on the other hand, for the good image, covered tion strategy works the best. A detailed analysis regarding how two-thirds of the image, and for the double-printed, it highlighted top-ranked features affect the activation maps was omitted due most of the areas highlighted with the PZZ strategy. Nevertheless, to the brevity of the paper. for the interrupted print, most focus was placed on the lower part of the "P" char, while also two artifacts were encountered. Strategy Top-ranked feature Selected on Feature Selection Irrelevant TOZ True value One Zero Finally, for the fourth layer, TZZ has mostly focused on the upper TZZ True value Zero Zero word (Philips), while TRR’s focus was mostly on the lower part TRR True value Random Random of the image, still covering some relevant areas. When comparing the TZZ and TOZ approaches, we found that Table 1: Value imputation strategies considering the image for layer one, TOZ results in less strongly highlighted regions: embedding, the features selected during the feature selec- most of the highlighted regions present in TZZ vanished, and just tion process, and the classifier’s top-ranked features. in the good image, a few spots appeared that were not present at the TZZ activation map. The original regions are highlighted for layer two, but new regions were included, mostly covering 4 RESULTS areas of interest. The highlighted areas for a double-printed im- age related to TZZ and TOZ activation maps were consistent for Layers layer three. Nevertheless, TOZ highlighted different regions for Imputation strategy Image class 1 2 3 4 the good and interrupted print images. The regions highlighted Good 0.27±0.01 0.27±0.01 0.27±0.01 0.27±0.01 TOZ Double-printed 0.31±0.02 0.31±0.02 0.31±0.02 0.31±0.02 for the interrupted print image were irrelevant to defect detec- Interrupted print 0.27±0.01 0.27±0.01 0.27±0.01 0.27±0.01 tion. When considering the last layer, the highlighted areas were Good 0.21±0.04 0.21±0.04 0.21±0.04 0.21±0.04 TZZ Double-printed 0.24±0.03 0.24±0.03 0.24±0.03 0.24±0.03 mostly the same for TZZ and TOZ. Nevertheless, an additional Interrupted print 0.22±0.04 0.22±0.04 0.22±0.04 0.22±0.04 region was introduced in the good and interrupted print images, Good 0.46±0.02 0.46±0.02 0.46±0.02 0.46±0.02 TRR Double-printed 0.48±0.03 0.48±0.03 0.48±0.03 0.48±0.03 covering the lower text. Interrupted print 0.46±0.02 0.46±0.02 0.46±0.02 0.46±0.02 From the visual assessment described above, we conclude that activation maps obtained with the TZZ imputation method lead Table 2: Value imputation strategies considering the image to the best explanations. embedding, the features selected during the feature selec- tion process, and the classifier’s top-ranked features. 5 CONCLUSIONS As described in Table 1, three imputation strategies were con-This work has researched how information regarding feature sidered. The cosine similarity computed between the vector cre- importance when using image embeddings can be used and prop- ated with the imputation strategy and the embedding (consider- agated back to generate activation maps and highlight regions ing the top 13 features) is reported in Table 2. A higher similarity of the image considered relevant to a particular forecast. The between the imputed vector and the image embedding means proposed approach was evaluated on images of a real-world in- that a wider area of the activation map will be highlighted, blur- dustrial use case. The similarity metrics and visual evaluation ring relevant information where the top features point to in the show that the best value imputation strategy is TZZ, which con- image. The less informative imputation strategy was TRR, which siders assigning the actual embedding value to relevant features 19 Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia Jože M. Rožanec, Erik Koehorst, and Dunja Mladenić Figure 5: GradCAM activation maps for ResNet-18 layers 1-4 considering only the top 13 features for this particular forecast and three imputation strategies (TOZ, TZZ, and TRR) for three image types (good (G), double-printed (D), and interrupted prints (I)). and masking the rest of the embedding with zeroes. Nevertheless, [11] Jianping Hua, Zixiang Xiong, James Lowey, Edward Suh, and Edward R it must be emphasized that a broader set of experiments must be Dougherty. 2005. Optimal number of features as a function of sample size for considered to generalize these conclusions. While this research various classification rules. Bioinformatics 21, 8 (2005), 1509–1515. [12] Tambiama André Madiega. 2021. Artificial intelligence act. European Parlia-only considered local explanations, the feature relevance could be ment: European Parliamentary Research Service (2021). considered at a global level, and the same approach was leveraged [13] Dang Minh, H Xiang Wang, Y Fen Li, and Tan N Nguyen. 2022. Explainable artificial intelligence: a comprehensive review. Artificial Intelligence Review to visualize their influence on a particular image. Future work (2022), 1–66. will focus on a more comprehensive evaluation of the proposed [14] Sajid Nazir, Diane M Dickson, and Muhammad Usman Akram. 2023. Survey methodology to understand how it performs, how the number of explainable artificial intelligence techniques for biomedical imaging with deep neural networks. Computers in Biology and Medicine (2023), 106668. of selected features influences the activation maps and possible [15] Cecilia Panigutti, Ronan Hamon, Isabelle Hupont, David Fernandez Llorca, shortcomings. Delia Fano Yela, Henrik Junklewitz, Salvatore Scalzo, Gabriele Mazzini, Ignacio Sanchez, Josep Soler Garrido, et al. 2023. The role of explainable AI in the context of the AI Act. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency. 1139–1150. ACKNOWLEDGMENTS [16] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. " Why should This work was supported by the Slovenian Research Agency and i trust you?" Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data the European Union’s Horizon 2020 program project STAR under mining. 1135–1144. grant agreement number H2020-956573. [17] Gesina Schwalbe and Bettina Finzel. 2023. A comprehensive taxonomy for explainable artificial intelligence: a systematic survey of surveys on methods and concepts. Data Mining and Knowledge Discovery (2023), 1–59. [18] A Carlisle Scott, William J Clancey, Randall Davis, and Edward H Shortliffe. REFERENCES 1977. Explanation capabilities of production-based consultation systems. Tech- [1] Imran Ahmed, Gwanggil Jeon, and Francesco Piccialli. 2022. From artificial nical Report. STANFORD UNIV CA DEPT OF COMPUTER SCIENCE. intelligence to explainable artificial intelligence in industry 4.0: a survey on [19] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna what, how, and where. IEEE Transactions on Industrial Informatics 18, 8 (2022), Vedantam, Devi Parikh, and Dhruv Batra. 2017. Grad-cam: Visual explanations 5031–5042. from deep networks via gradient-based localization. In Proceedings of the IEEE [2] Alejandro Barredo Arrieta, Natalia Díaz-Rodríguez, Javier Del Ser, Adrien international conference on computer vision. 618–626. Bennetot, Siham Tabik, Alberto Barbado, Salvador García, Sergio Gil-López, [20] Bas HM Van der Velden, Hugo J Kuijf, Kenneth GA Gilhuijs, and Max A Daniel Molina, Richard Benjamins, et al. 2020. Explainable Artificial Intel-Viergever. 2022. Explainable artificial intelligence (XAI) in deep learning-ligence (XAI): Concepts, taxonomies, opportunities and challenges toward based medical image analysis. Medical Image Analysis 79 (2022), 102470. responsible AI. Information fusion 58 (2020), 82–115. [21] Giulia Vilone and Luca Longo. 2021. Notions of explainability and evaluation [3] Aditya Chattopadhay, Anirban Sarkar, Prantik Howlader, and Vineeth N Bal-approaches for explainable artificial intelligence. Information Fusion 76 (2021), asubramanian. 2018. Grad-cam++: Generalized gradient-based visual expla-89–106. nations for deep convolutional networks. In 2018 IEEE winter conference on [22] Feiyu Xu, Hans Uszkoreit, Yangzhou Du, Wei Fan, Dongyan Zhao, and Jun Zhu. applications of computer vision (WACV). IEEE, 839–847. 2019. Explainable AI: A brief survey on history, research areas, approaches [4] Chaofan Chen, Oscar Li, Daniel Tao, Alina Barnett, Cynthia Rudin, and and challenges. In CCF international conference on natural language processing Jonathan K Su. 2019. This looks like that: deep learning for interpretable and Chinese computing. Springer, 563–574. image recognition. Advances in neural information processing systems 32 [23] Li Da Xu, Eric L Xu, and Ling Li. 2018. Industry 4.0: state of the art and future (2019). trends. International journal of production research 56, 8 (2018), 2941–2962. [5] Arun Das and Paul Rad. 2020. Opportunities and challenges in explainable [24] Vitjan Zavrtanik, Matej Kristan, and Danijel Skočaj. 2021. Draem-a discrim-artificial intelligence (xai): A survey. arXiv preprint arXiv:2006.11371 (2020). inatively trained reconstruction embedding for surface anomaly detection. [6] Rachel Lea Draelos and Lawrence Carin. 2020. Use HiResCAM instead of In Proceedings of the IEEE/CVF International Conference on Computer Vision. Grad-CAM for faithful explanations of convolutional neural networks. arXiv 8330–8339. preprint arXiv:2011.08891 (2020). [7] Gautam Dutta, Ravinder Kumar, Rahul Sindhwani, and Rajesh Kr Singh. 2021. Digitalization priorities of quality control processes for SMEs: A conceptual study in perspective of Industry 4.0 adoption. Journal of Intelligent Manufacturing 32, 6 (2021), 1679–1698. [8] Ruth C Fong and Andrea Vedaldi. 2017. Interpretable explanations of black boxes by meaningful perturbation. In Proceedings of the IEEE international conference on computer vision. 3429–3437. [9] Ruigang Fu, Qingyong Hu, Xiaohu Dong, Yulan Guo, Yinghui Gao, and Biao Li. 2020. Axiom-based grad-cam: Towards accurate visualization and explanation of cnns. arXiv preprint arXiv:2008.02312 (2020). [10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778. 20 An approach to creating a time-series dataset for news propagation: Ukraine-war case study Abdul Sittar Dunja Mladenić abdul.sittar@ijs.si dunja.mladenic@ijs.si Jožef Stefan Institute and Jožef Stefan Postgraduate Jožef Stefan Institute and Jožef Stefan Postgraduate School School Jamova cesta 39 Jamova cesta 39 Ljubljana, Slovenia Ljubljana, Slovenia ABSTRACT Moreover, news spreading comes across many barriers due An efficient technique to comprehend news spreading can to different reasons, including cultural, economic, political, be achieved through the automation of machine learning al- linguistic, or geographical, and these reasons depend upon gorithms. These algorithms perform the prediction and fore- the type of news, such as sports, health, science, etc. [18]. For casting of news dissemination across geographical barriers. instance, it is more likely that the news spreading relating to Despite the fact that news regarding any events is generally the FIFA World Cup crosses cultural barriers since it involves recorded as a time-series due to its time stamps, it cannot multiple cultures. Similarly, news spreading relating to the Sri- be seen whether or not the news time-series is propagating Lankan economic crisis and the Ukraine-war probably comes across geographical barriers. In this article, we explore an across economic and geographical barriers since these events approach for generating time-series datasets for news dissemi- involve multiple stances from the international community; nation that relies on Chat-GPT and sentence-transformers. The Eid celebrations and Christmas are likely to come across reli- lack of comprehensive, publicly accessible event-centric news gious barriers; US elections are likely to come across political databases for use in time-series forecasting and prediction is barriers [17]. another limitation. To get over this bottleneck, we collected The identification of news spreading patterns while crossing a news dataset consisting of 1 year and 3 months related to barriers can be useful in the context of numerous real-world the Ukraine war using Event Registry. We also conduct a sta- applications, such as trend detection and content recommenda- tistical analysis of different time-series (propagating, unsure, tions for readers and subscribers. To perform the classification and not-propagating) of different lengths (2, 3, 4, 5, and 10) to of news published across barriers (geographical, cultural, eco- document the prevalence of geographical barriers. The dataset nomic, etc.) and, in that attempt, to recommend and identify is publicly available on Zenodo. trends of news spreading belonging to different categories, some methodological considerations are necessary. KEYWORDS In this paper, we introduce an approach to creating a time- series dataset for news propagation. While previous work has news propagation, time-series dataset, geographical barriers, focused on creating events from collections of news articles [9, Ukraine-war 16], we focus on creating propagation time-series. We take the Ukraine-war as an example to be researched in the propaga-1 INTRODUCTION tion analysis across geographical barriers. The process of information traveling from a sender to a set of Following are the main scientific contributions of this paper: receivers via a carrier is commonly referred to as propagation (1) We present an approach to creating a time-series dataset [3]. News propagate over time by different publishers about for news propagation. an event. It implicitly raises a few thoughts in our mind, such (2) A dataset for forecasting and predicting news prop- as: 1) There will be some news articles propagating similar in- agation, that has been labeled with the assistance of formation over time; 2) some news articles will be of a unique Chat-GPT and sentence transformers. category that eventually will not be propagating or propagat- The remainder of the paper is structured as follows. Section ing across geographical barriers by a few publishers. 2 describes the related work on barriers to news spreading, News streaming is classified into events where a relevant set time-series datasets for news propagation, and topic modeling. of news is clustered and represented as an event [8, 9]. And Section 3 presents the proposed approach. We discuss the there is a starting and ending time for an event, which is calcu-dataset construction and annotation guidelines in Section 4. lated by the publication time of the first and last news article. The evaluation details and statistical analysis is explained in Hence, an event consists of a set of news articles, and these Section 5, while Section 6 concludes the paper and outlines news articles follow a certain pattern based on hidden prop-areas of future work. erties including cultural, economical, political, linguistic, and geographical [17]. 2 RELATED WORK In this section, we review the related literature about geo- Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or graphical barriers to news spreading, time-series datasets for distributed for profit or commercial advantage and that copies bear this notice news propagation, and topic modeling. and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). Information Society 2023, 10 October 2023, Ljubljana, Slovenia © 2022 Copyright held by the owner/author(s). 21 Information Society 2023, 10 October 2023, Ljubljana, Slovenia Abdul Sittar and Dunja Mladenić identifies how the discussions evolved over time in top news- papers belonging to three different continents (Europe, Asia, Event Database Registry Dataframes and North America) and nine different countries (UK, India, Ireland, Canada, the U.S.A., Japan, Indonesia, Turkey, and Pak- API Call istan). It uses spatio-temporal topic modeling and sentiment Meta-data Text Similarity Matrix Propagation Chains Labelling analysis. Different classification or mining tasks are proposed News Topic Categorization (TC) Articles Wikipedia- using time-series datasets. [6] has proposed the task of pre-Infobox dicting stock market values such as price or volatility based t1 Max = 1 TC t2 TC t3 t4 t5 t6 t7 on the news content or derived text features. Similarly, to fore- GPT Dataset 1 Sliding Window 1 Please suggest relevant categories cast the values, a set of final classes is already defined, such and tags for the following content: t1 TC t2 t3 t4 t5 t6 t7 "body text of news article" as up meaning an increase in price, down meaning a decrease Max = 2 TC Dataset 2 in price, and balanced meaning no change in price. Also, the JSON data Overlapping Sliding TC Windows same technique has been applied to predict price trends (in- Propagation Chains & Labelling cline, decline, or flat) immediately after press release publica- t1 t2TC t3 t4 t5 t6 t7 tions. Also, Good news articles are categorized as inclines if Max = 3 TC Dataset 3 Dataset the stock price relevant to the given article has increased with Overlapping Sliding TC Windows a peak of at least three points from its original value at the publication time [13]. Figure 1: An overview of the proposed approach. To cre- ate the propagation time-series, it calculates the seman- 2.3 Topic modeling tic similarity across news utilizing sentence transform- Generally, to find out the most important topics inside an ers, and to evaluate the labeling process of the news, it event, multiple solutions have been proposed, including pool- utilizes a summary of the news articles generated by ing based LDA and BERTopic. Unlike simple static topic mod- Chat-GPT. eling, pooling-based techniques assume that the data is par- titioned on a time basis, e.g., hourly or daily. Pooling-based techniques are mostly applied to social media, where docu- ments or tweets are partitioned based on hashtags and authors. 2.1 Geographical barrier BERTopic leverages transformers and TF-IDF to create dense Sittar reported that the geographical size of a news publisher’s clusters, allowing for easily interpretable topics while keeping country is directly proportional to the number of publishers important words in the topic descriptions. Therefore, the result and articles reporting on the same information [17]. It is also is a list of topics ranked according to their importance. reported that, based on some factors, the media targets spe- cific foreign and regional events. For example, the spreading The topic modeling techniques are performing surprisingly of news related to specific events may tilt toward developed well. The relation of such topics to their hidden characteristics, countries such as the United Kingdom, the U.S.A., or Russia. such as cultural, economical, and political, has been analyzed Also, in the past, geographical representation of entities and in many studies because understanding its dynamics can help events has been extensively utilized to detect local, global, and governments disseminate information effectively [4, 17, 14, critical events [10, 20, 19, 2]. It has been said that countries 15]. It has changed rapidly in recent years with the emergence with close distance share culture and language up to a certain of social media, which provides online platforms for people extent, which can further reveal interesting facts about shared worldwide to share their thoughts, activities, and emotions tendencies in information spreading [12, 11]. Given the diffi-and build social relationships [7]. Over the years, scholars culty of gathering longitudinal data, relatively little news flow have studied the relationship between the news prominence research has systematically examined whether and to what of a country and its physical, economic, political, social, and extent foreign nation visibility and the factors that influence it cultural characteristics [11]. Communication scholars have have changed over time. Specifically, scholarship has typically long been interested in identifying the key determinants of only addressed why some countries get more news coverage what makes foreign countries newsworthy and why some than others at a specific point in time, not how and why the countries are considered more newsworthy than others [5]. focus shifts over time from one country to another [5]. In this context, we propose an approach to collecting data to analyze 3 APPROACH the news spreading across geographical barriers. This research article presents an approach to creating a time- series dataset for news propagation across geographical bar- 2.2 Time-series datasets riers, as shown in Figure 1. In the first step, we call an API News propagation can be represented in the form of a time-that extracts the news articles from the Event Registry be- series [17]. The properties of cascading time-series can tell longing to Ukraine-war. In the second step, we extract meta-us the relationship between the time and size of cascading. It data related to news publishers via searching for the news further answers which events last over a longer period with publishers on Google and extracting their Wikipedia links. large communities across different languages. A time-series Using these links, we obtain the necessary information from dataset can be used to understand evolving discussions over Wikipedia-Infobox [17]. We use the Bright Data service to time. Different studies have utilized time-series datasets, such crawl and parse Wikipedia-Infoboxes. In the third step, we as [1] investigates how different discussions evolved over time perform the summarization of news articles. In the last step, and the spatial analysis of tweets related to COVID-19. [14] we create a propagation time-series and perform labeling of 22 An approach to creating a time-series dataset for news propagation: Ukraine-war case study Information Society 2023, 10 October 2023, Ljubljana, Slovenia the time-series. To calculate the semantic similarity, we utilize monolingual sentence transformers. Since the propagation of information can be captured in the form of time-series we create time-series of different lengths, such as 2, 3, 4, 5, and To annotate the propagation time-series across geographi- 10. To evaluate the labeling process, we manually compare the cal barriers, we consider the label "Propagating" for a pair of summary generated by Chat-GPT (see Section 5). news articles if the pair is published from two different coun- tries; otherwise, we label it "Not-Propagating". We repeat this 4 DATASET CONSTRUCTION process for all lengths of news articles. The statistics after ap- We collected the news articles reporting on the Ukraine-war. plying this guideline are presented in Figure 3. Since Russia invaded Ukraine on February 24, 2022, in an es- calation of the Russo-Ukrainian War, we fetched news articles 5 STATISTICAL ANALYSIS AND that were published between January 2022 and March 2023. EVALUATION The dataset consists of 61261 news articles. Each news article The statistics about the propagation time-series without taking consists of a few attributes: title, body text, name of the news geographical barriers into account are presented in bar chart publisher, date, and time of publication. 2. The number of time-series with the label "Propagating" is higher than the "Unsure", and "Not-Propagating" labels when 4.1 Semantic similarity the length of the time-series is 3 or 5, whereas in the other We calculate the cosine similarity between dense vector gen- three cases (2, 4, and 10), the number of time-series is equal for erated by sentence transformers. Sentence Transformers is all three labels. The statistics of the propagation time-series a Python framework for state-of-the-art sentence, text, and that are generated after taking the geographical location of image embeddings. Cosine similarity varies between zero and the news publisher into account are presented in bar chart 3. one; zero means no similarity, and one means maximum simi- The number of propagation time-series with "Propagated" and larity, i.e., a duplicate article. "Unsure" labels reduced to almost 40% whereas the number of propagation time-series with the "Not-propagated" label 4.2 Chat-GPT Summarizing increased significantly. Since manual evaluation of propagation time-series is difficult because of the length of the news articles, we utilized Chat- For the evaluation of the dataset, we have checked the sum- GPT to get the tags, categories, and summary representing mary, including categories and tags of articles for a specific the whole article. Summarizing a text is one of the many tasks label, manually. We randomly selected 50 time-series of dif- ChatGPT is extremely good at. We can give it a piece of con- ferent lengths for all three types of labels. According to the tent and ask for a summary. By customizing our prompts, we manual evaluation, the propagation time-series with the "Prop- can get ChatGPT to create much more than a plain summary. agating" label followed almost one or two themes of discussion We have used the OpenAI API with the Python library. We for all the news articles in a chain. For instance, the following used the following prompt to fetch the summary of the text, topics have appeared in the propagation time series of length categories, and tags: "Please summarize the text and suggest 5: 1) "The United States will be sanctioning Russian President relevant categories and tags for the following content: article- Vladimir Putin; 2) "the national team of the Polish FA will not Text:". articleText is a variable representing the text of a news play against Russia; 3) the Polish Football Association will not article. play its World Cup qualifying match against Russia; 4) "the Polish Football Association has refused to play a World Cup 4.3 Annotations of time-series against Russia; 5) "the Polish national team does not intend to play-off match against Russia". On the contrary, propagation We created three types of time-series recursively and anno- time-series with "Not-Propagating" labels discussed always diftated them based on a threshold of semantic similarity, as ferent points of view about the Ukraine-war. For example, the shown in Algorithm ??. The threshold to decide the type following topics have appeared in the propagation time-series of propagation time-series has been set by manually ana- of length 5: 1) "a resolution passed against Russia in the United lyzing the similarity and summary of news articles. We set Nations"; 2) "Canadian president urges to impose sanctions three thresholds for all three types of labels (propagating, un- against Russia"; 3) "the UN Security Council has voted on a sure, and not-propagating). For instance, the time-series with US-led draft resolution; 4) "President Trump is inviting Russian greater or equal to 0.7 similarity were labeled "Propagating", President Vladimir Putin to come to Washington; and 5) "India the time-series with greater or equal to 0.5 similarity were la- abstained from the vote on the draft resolution". However, in beled "Unsure", and the time-series with less than 0.5 similarity the case of propagation time-series with "Unsure" labels, there were labeled "Not-propagating". This criteria has been followed were three or four sub-topics discussing the Ukraine-war. for the minimum length of a time-series (2). However, for the length of a time-series greater than 2, we count the number Evaluation results show that as the window size increased of pairs with each label, and then the time-series is labeled as to capture the information propagation, the noise of overlap- one with the highest count. If two labels have the same highest ping topics also increased. Similarly, this overlapping window count, then we give priority to the "Propagating" label over presented sub-topics that overlapped at the time of publication. "Unsure" and "Unsure" over "Not-Propagating". The Algorithm ?? takes five parameters, such as the start and end of the data- frames, a copy of the data-frames, length of the time-series, and an array. The statistics about the propagation time-series are presented in Figure 2. 23 Information Society 2023, 10 October 2023, Ljubljana, Slovenia Abdul Sittar and Dunja Mladenić ACKNOWLEDGMENTS The research described in this paper was supported by the Slovenian research agency under the project J2-1736 Causal- ify and by the EU’s Horizon Europe Framework under grant agreement number 101095095. REFERENCES [1] Iyad AlAgha. 2021. Topic modeling and sentiment analysis of twitter discussions on covid-19 from spatial and temporal perspectives. Journal of Information Science Theory and Practice, 9, 1, 35–53. [2] Simon Andrews, Helen Gibson, Konstantinos Domdouzis, and Babak Akhgar. 2016. Creating corroborated crisis reports from social media data through formal concept analysis. Journal of Intelligent Information Systems, 47, 2, 287–312. [3] Firdaniza Firdaniza, Budi Nurani Ruchjana, Diah Chaerani, and Jaziar Radianti. 2021. Information diffusion model in twitter: a systematic literature review. Information, 13, 1, 13. [4] Guoyin Jiang, Saipeng Li, and Minglei Li. 2020. Dynamic rumor spreading Figure 2: The bar chart shows the statistics about the of public opinion reversal on weibo based on a two-stage spnr model. Physica A: Statistical Mechanics and its Applications, 558, 125005. propagation time-series of different lengths (2, 3, 4, 5, 10) [5] Timothy M Jones, Peter Van Aelst, and Rens Vliegenthart. 2013. Foreign that has been labelled as "Propagating", "Unsure", and nation visibility in us news coverage: a longitudinal analysis (1950-2006). "Not-Propagating". The x-axis shows the length of time- Communication Research, 40, 3, 417–436. [6] Abdullah S Karaman and Tayfur Altiok. 2004. An experimental study series, the y-axis shows the count of the propagation on forecasting using tes processes. In Proceedings of the 2004 Winter time-series. Simulation Conference, 2004. Vol. 1. IEEE. . [7] Sanjay Kumar, Muskan Saini, Muskan Goel, and BS Panda. 2021. Mod- eling information diffusion in online social networks using a modified forest-fire model. Journal of intelligent information systems, 56, 2, 355– 377. [8] Haewoon Kwak and Jisun An. 2016. Two tales of the world: compar- ison of widely used world news datasets gdelt and eventregistry. In Proceedings of the International AAAI Conference on Web and Social Media number 1. Vol. 10, 619–622. [9] Gregor Leban, Blaz Fortuna, Janez Brank, and Marko Grobelnik. 2014. Event registry: learning about world events from news. In Proceedings of the 23rd International Conference on World Wide Web, 107–110. [10] Mauricio Quezada, Vanessa Peña-Araya, and Barbara Poblete. 2015. Location-aware model for news events in social media. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, 935–938. [11] Elad Segev. 2015. Visible and invisible countries: news flow theory re- vised. Journalism, 16, 3, 412–428. [12] Elad Segev and Thomas Hills. 2014. When news and memory come apart: a cross-national comparison of countries’ mentions. International Communication Gazette, 76, 1, 67–85. [13] Sadi Evren Seker, MERT Cihan, AL-NAAMİ Khaled, Nuri Ozalp, and AYAN Ugur. 2013. Time series analysis on stock market for text mining correlation of economy news. International Journal of Social Sciences and Humanity Studies, 6, 1, 69–91. [14] Abdul Sittar, Daniela Major, Caio Mello, Dunja Mladenić, and Marko Figure 3: The bar chart shows the statistics about the Grobelnik. 2022. Political and economic patterns in covid-19 news: from lockdown to vaccination. IEEE Access, 10, 40036–40050. propagation time-series after applying the condition of [15] Abdul Sittar and Dunja Mladenic. 2021. How are the economic conditions the location of a news publisher. Each bar presents three and political alignment of a newspaper reflected in the events they report types of propagation time-series that has been labelled on? In Central European Conference on Information and Intelligent Systems. Faculty of Organization and Informatics Varazdin, 201–208. as "Propagating", "Unsure", and "Not-Propagating". The [16] Abdul Sittar, Dunja Mladenic, and Tomaž Erjavec. 2020. A dataset for x-axis shows the length of time-series, the y-axis shows information spreading over the news. In Proceedings of the 23th Interna- the count of the propagation time-series. tional Multiconference Information Society SiKDD. Vol. 100, 5–8. [17] Abdul Sittar, Dunja Mladenić, and Marko Grobelnik. 2022. Analysis of information cascading and propagation barriers across distinctive news events. Journal of Intelligent Information Systems, 58, 1, 119–152. [18] Abdul Sittar, Dunja Mladenić, and Marko Grobelnik. [n. d.] Profiling 6 CONCLUSIONS AND FUTURE WORK the barriers to the spreading of news using news headlines. Frontiers in Artificial Intelligence, 6, 1225213. In this paper, we have presented an approach to creating a [19] Kazufumi Watanabe, Masanao Ochi, Makoto Okabe, and Rikio Onai. 2011. time-series dataset. The goal of this work was to investigate Jasmine: a real-time local-event detection system based on geolocation the length of the propagation time-series for news propagation. information propagated to microblogs. In Proceedings of the 20th ACM international conference on Information and knowledge management, 2541– In the future, we plan to utilize the same approach for different 2544. events. Moreover, currently, geographical barriers have been [20] Hong Wei, Jagan Sankaranarayanan, and Hanan Samet. 2020. Enhancing analyzed. In the future, we would like to extend the barriers to local live tweet stream to detect news. GeoInformatica, 1–31. political, economic, and cultural barriers and find patterns of news propagation. Also, we would like to perform prediction and forecasting on the labeled time-series dataset. We would like to perform experiments with classical time-series classi- fication methods, deep learning, transformer-based methods, and large language models (LLMs). 24 PREDICTING HORSE FEARFULNESS APPLYING SUPERVISED MACHINE LEARNING METHODS Oleksandra Topal Inna Novalija Jožef Stefan Institute Jožef Stefan Institute Jamova cesta 39, Ljubljana, Slovenia Jamova cesta 39, Ljubljana, Slovenia oleksandra.topal@ijs.si inna.koval@ijs.si Elena Gobbo Manja Zupan Šemrov Dunja Mladenić Biotechnical faculty Biotechnical faculty Jožef Stefan Institute University of Ljubljana University of Ljubljana Jamova cesta 39, Ljubljana, Slovenia Jamnikarjeva 101 Ljubljana, Slovenia Jamnikarjeva 101, Ljubljana, Slovenia dunja.mladenic@ijs.si elena.gobbo@bf.uni-lj.si manja.zupansemrov@bf.uni-lj.si ABSTRACT pigs and cattle [3], dogs [4], and horses [5]. The pilot results have shown the first rigorous evidence for the connection between In this article, we present the first results of a study on the behaviour, heart rate and anatomical characteristics (head and personality traits of Lipizzan horses focusing on their fearfulness. body) [6]. We therefore assume that various properties, such as Applying a specific evaluation approach targeted at small anatomical and biomechanical as well as social environmental datasets, we manage to discover a number of anatomical and measurements, give us valuable objective insights to predict social properties that are related to horse fearfulness as a main factor of horses’ personality in personality traits of Lippizan horses with an emphasis on the current research. For fearfulness. We believe that this improved knowledge will help us evaluation purposes the performance of four different understand the horse-human relationship, the complexity of classification algorithms is compared. Our results indicate that animal personality in general and in relation to humans, as Logistic regression and Decision trees achieve the best humans and horses share many emotional processes [7]. classification accuracy. Furthermore, the most important features for predicting the fear level of Lipizzan horses using a decision The main contribution of this research is assessment of the tree model are presented and discussed. importance of different properties for predicting fearfulness of a horse as indicated by different traditional machine learning algorithms. KEYWORDS Machine learning, classification problem, personality traits, 2. RELATED WORK Lipizzan horses. A number of animal studies researchers have tackled the topic of animal personality. Animal personality could be defined as 1. INTRODUCTION temporally stable inter-individual patterns of affect, cognition, and behavior [8]. Gobbo and Zupan [9] in their study on dogs state In the modern world, artificial intelligence provides powerful that analysis of animal personality traits is closely linked to the tools for solving many issues in various fields of research. The safe human-animal interaction and animal’s everyday behavior. problems involving clustering, regression, and classification are Moreover, Buckley et al. [10] reported that personality of a horse the most commonly addressed problems in different types of should be considered as an important attribute and a key issue in biological studies. One of the actual topics of biological research horse health and performance. The most important personality where we can use artificial intelligence algorithms is the study of trait in relation to human-horse relationship is suggested to be the animal personality. fearfulness [11]. In our work we are studying the personality traits of horses of the In animal behaviour, machine learning approaches address Lipizzan breed. Personality assessment can be used to select specific tasks, such as classifying species, individuals, suitable training and weaning methods, choose or breed horses for vocalizations or behaviours within complex data sets [12]. police or therapeutic work, investigate underlying reasons for Machine learning has been used for clustering observations into development of behavioral problems or assess how an unknown groups [13] and for classification of animal related data [14]. horse might react to a new or aversive situation or stimuli. In our work, we apply data mining and machine learning on the According to a research study on animal behavior [1], it is Lipizzan horse’s dataset with broad anatomic, social, and possible to improve performance and horse welfare by identifying biomechanical characteristics. In addition, the dataset used in the the right match between the horse’s temperament, its rider’s current research contains a small number of data points and personality, housing conditions, management and by choosing the requires using evaluation techniques for small datasets. appropriate activity for an individual horse. Similarly, to other related work approaches, we apply traditional Number of experiments demonstrate that anatomical features may machine learning classification methods for assessing a horse’s be associated with personality traits and behaviour in animals, personality and understanding which horse properties are the most mainly due to domestication and selection process that affected animals’ morphology and personality. We can find a confirmation important when predicting the fearfulness of a horse. Specifically, in our research, we investigate how feature selection method can of this in Belyaev’s domestication and selection experiment on influence the classification results for fear level prediction in horses. foxes [2], also there is research on a number of species such as 25 3. PROBLEM DEFINITION balanced dataset, in which there are 13 fearful horses and 11 fearless horses (see Figure 2). 3.1 Data sources For our study, we use a unique dataset that we have created and which contains anatomical measurements, biomechanics characteristics, housing conditions and fear score of Lipizzan horses. Based on our experience as experts in animal studies, we have collected and organized the data in four parts. The first part contains age, gender, front, left and right (both sides need to be measured, because they are not identical [15, 16]) anatomical measurements of the horse head (FH) and body (FB). The second part contains the results of a study on the biomechanics of the Lipizzan horses. Biomechanical data were collected twice for two types of horse gaits, walking and trotting, so the table contains some redundant data. We have converted the table, so that the trot and walk data are separated by traits for each horse and can be used for modeling. The third part lists the conditions of keeping horses, such as the availability of pastures, the openness of stalls, the number of stalls, as well as equestrian Figure 2 Visualization of the division of horses into two classes activities, training and work of horses. The fourth part contains according to the level of fear. the results of fear test battery performed on each horse. In our study, the explorative hypothesis is that anatomical- 4. METHODOLOGY biomechanical-social properties of a horse may act as good indicators of fearfulness. We have many features describing 4.1 Data preprocessing. different parameters of horses on the one side, and we have a horse Like almost all biological data, this dataset is very small, with fearfulness score on the other side, so we can use supervised only 24 instances, but more than 120 different features. This is a machine learning methods to predict the horse’s fearfulness levels. rather complicated case, because the number of features is 5 times larger than the number of instances. We conducted a correlation 3.2 Labeling data for the classification task analysis using the Spearman coefficient which will allow us to reduce the dimensionality of the data. Analysis of our dataset has To label our dataset, we have had to transform a very complex shown that some features have a high correlation coefficient fear rating table. During the experiment, two repetitions of each of (Figure 3). If correlation coefficient is more than 0.8 (the the four fear tests of the individual horse have been carried out. threshold value was set by experts) we can remove one of the two We have compared the sum of the four scores of the first strongly correlated features from the dataset. Since the correlation repetition (each score per individual fear test and a horse) with the matrix is symmetrical, we considered only the lower part under sum of the four fear scores of the second repetition, and it turned the main diagonal to avoid confusion. out that the horses habituated to stimuli between the two repetitions (see Figure 1). Figure 3 An illustrative fragment of the correlation matrix. 4.2 Evaluation method For very small datasets, as in our study, we should find a suitable Figure 1 Comparison graph between two repetitions of fear approach to evaluate machine learning models. We can use a tests. special case of cross-validation Leave-one-out cross-validation We have made the decision to take the maximum value of the two (LOOCV) [17]. LOOCV is a type of cross-validation approach in sums in order to eliminate the habituation element. The task of which each observation is considered as the test set and the rest classification assumes that the data is divided into classes, that’s (N-1) observations are considered as the training set. In LOOCV, why we have found the average value of fear score, which was fitting of the model is done and predicting using one observation 10.75, and labeled the fearfulness variable with binary values as test set. Furthermore, repeating this N times, so each observation follows. If a horse has an above-average fear rating, then it is taken once in the test set. This is a special case of K-fold cross- corresponds to a value of 1 (class 1) - a fearful horse, if lower, validation in which the number of folds is the same as the number then 0 (class 0) - a fearless horse. In this way we obtained a fairly of observations (K = N). 26 4.3 Classification methods There are many machine learning algorithms suitable for solving the classification problem. We decided to take several different algorithms starting with Logistic Regression and Support Vector Machine as a simple model [18], Decision Trees and Random Forests. For the completeness of the experiment, we have trained all the algorithms with the different sets of features (see follow bulleted list). The main results are presented in Table 1. The rows of Table 1 present different algorithms used, while the columns reflect feature selection methods: - AllFeatures (120 features): removal of correlated features is not performed Figure 4 Confusion matrix by Decision Trees. - Removed LeftCorr (89 features): anatomical measurements from the left side of the horse head or body that correlate to Figure 4 presents for Fearful (class 0) and Fearless (class 1) the correspondent right side measurements are removed classes confusion matrix by Decision Trees. - Remove RightCorr (89 features): anatomical measurements In order to assess the learning outcomes of all models, we used from the right side of the horse head or body that correlate to LOOCV algorithm. We have noticed that the models during the correspondent left side measurements are removed training chose different features as important in each validation step. In the following Table 2 we can see the most important - Removed LeftCorr+ (85 features): anatomical measurements features (see Figure 6 for more details) for the Decision Trees from the left side of the horse that correlate to the model and how many times they were chosen during the entire correspondent right side measurements are removed + experiment (24 steps). anatomical measurements from the right side of the horse that correlate to other left side measurements are removed Table 2 The most important features for predicting the fear level of Lipizzan horses using a decision tree model (LOOCV). - Remove RightCorr+ (85 features): anatomical measurements from the right side of the horse that correlate to the Feature name Numbers of times correspondent left side measurements are removed + Number of boxes 24 anatomical measurements from the left side of the horse that correlate to other right side measurements are removed FB10L 23 Table 1 The accuracy of prediction of the horses' fear level of FH03 21 the different algorithms with different sets of features. FH04 18 Once we evaluated the decision tree model using the LOOCV algorithm and understood its performance, we were able to train the model on the full set without splitting it into a training and test set to obtain the most important features affecting the target variable (Figure 5). As shown in Table 1, the best result has been obtained by Logistic Regression and Decision Trees. If we look at the Logistic Regression coefficients, we find out that only one feature from 120 was chosen as significant and it is “Number of boxes” that means how many boxes were in the stable where the horse was housed. The number of horses housed in the same stable represents the horse's social environment, which may really affect its fearfulness. In comparison to the other tested methods, Support Vector Machine and Random Forests show the lowest classification accuracy. Looking at Decision Trees, the classification accuracy is higher Figure 5 Decision Tree Classification feature importance score than 0.7 for all sets of features. We can notice the difference in calculated for the complete dataset. performance based on anatomical features. Removing the right In our research, based on a small data sample of Lipizzan horses, correlated features gave better result than removing the left we have been able to find out that social (Number of boxes) and correlated features. Left measurements appear to be more anatomical (FH03, FH04, FB10L) features influence the fear significant for prediction in this model. We obtained the highest score. We marked with the red lines the most important features accuracy with Decision Trees (0.83) when we removed right on the Figure 6. correlated features + (Removed RightCorr+). 27 7. REFERENCES [1] Hausberger M. et al. (2008) Applied Animal Behaviour Science. 109: 1–24. [2] Trut, L. N. Early Canid Domestication: The Farm-Fox Experiment: Foxes bred for tamability in a 40-year experiment exhibit remarkable transformations that suggest an interplay between behavioral genetics and development. Am. Sci. 87, 160–169. [3] Grandin, T. & Deesing, M. J. Genetics and the Behavior of Domestic Animals (2nd ed.) 488 p. (London: Academic Press, 2014). [4] McGreevy, P. D. et al. Dog behavior co-varies with height, bodyweight and skull shape. PLoS ONE 8(12), e80529. [5] Sereda, N. H., Kellogg, T., Hoagland, T. & Nadeau, J. Association between whorls and personality in horses. J. Equine Vet. Sci. 35, 428. [6] Debeljak N, Košmerlj A, Altimiras J, Šemrov MZ. Relationship between anatomical characteristics and personality Figure 6 The most important measurements which can impact traits in Lipizzan horses. Scientific Reports. 2022 Jul fear level of Lipizzan horses. 23;12(1):12618. Figure 7 presents the Decision Tree obtained by the training the [7] Wathan J, Burrows AM, Waller BM, McComb K. EquiFACS: model on all available examples. In our study we have used the The equine facial action coding system. PLoS one. 2015 Aug criterion Gini Impurity to help to choose the optimal split of the 5;10(8):e0131738. decision tree into branches. [8] Gosling, S.D. Personality in non-human animals. Soc. Personal. Psychol. Compass. 2008, 2, 985–1001. [9] Gobbo, E. and Zupan, M., 2020. Dogs’ sociability, owners’ neuroticism and attachment style to pets as predictors of dog aggression. Animals, 10(2), p.315.. [10] Buckley, P., Dunn, T. and More, S.J., 2004. Owners’ perceptions of the health and performance of Pony Club horses in Australia. Preventive veterinary medicine, 63(1-2), pp.121-133. [11] McGreevy, P., & McLean, A. (2010). Equitation Science. Wiley-Blackwell, Chichester, West Sussex, UK. [12] Valletta J.J, Torney C., Kings M., Thornton A., Madden J. (2017). Applications of machine learning in animal behaviour studies. Animal Behaviour. Volume 124: 203-220. [13] Zhang J., O'Reilly K.M., Perry G.L.W., Taylor G.A., Dennis T.E. (2015). Extending the functionality of behavioural change- point analysis with k-means clustering: A case study with the little penguin (Eudyptula minor). PLoS One, 10 (4):e0122811. [14] Kabra M., Robie A., Rivera-Alba M., Branson S., Branson K. (2013). JAABA: Interactive machine learning for automatic Figure 7 Decision Tree trained on all the examples annotation of animal behavior. Nature Methods, 10 (1): 64-67. [15] Wiggers N, Nauwelaerts SLP, Hobbs SJ, Bool S, Wolschrijn 5. CONCLUSION AND FUTURE WORK CF, et al. (2015) Functional Locomotor Consequences of Uneven Forefeet for Trot Symmetry in Individual Riding Horses. PLOS In this article, we have demonstrated some approaches to ONE 10(2): e0114836. assessing and predicting the level of fear in Lipizzan horses. The experiments indicate that in the case of left and right anatomic [16] Halsberghe, B.T., Gordon-Ross, P. and Peterson, R. (2017), features being correlated, removing the right features gives Whole body vibration affects the cross-sectional area and slightly better results. symmetry of the m. multifidus of the thoracolumbar spine in the horse. Equine Vet Educ, 29: 493-499. We have found that social and anatomical features can explain the fearfulness level as a factor of horses’ personality. [17] Wong TT. Performance evaluation of classification algorithms by k-fold and leave-one-out cross validation. Pattern The future work will include the research with extended data set recognition. 2015 Sep 1;48(9):2839-46. as well as exploring additional relevant features. [18] Greener JG, Kandathil SM, Moffat L, Jones DT. A guide to machine learning for biologists. Nature Reviews Molecular Cell 6. ACKNOWLEDGMENTS Biology. 2022 Jan;23(1):40-55. This document is the result of the research project funded by the ARRS (J7-3154). 28 Emergent Behaviors from LLM-Agent Simulations Adrian Mladenic Faizon Zaman Jofre Espigule-Pons Marko Grobelnik Grobelnik Wolfram Alpha LLC. Wolfram Research, Inc. Jozef Stefan Institute Jozef Stefan Institute Rochester, New York Barcelona, Spain Ljubljana, Slovenia Ljubljana, Slovenia faizonz@wolfram.com jofree@wolfram.com marko.grobelnik@ijs.si adrian.m.grobelnik@ijs.si ABSTRACT This paper hypothesizes that complex emergent behaviors can Attributes: Characteristics that shape the dynamics of arise from multi-agent simulations involving Large Language interactions, encompassing any attributes relevant to the Models (LLMs), potentially replicating intricate societal structures. simulation environment. We tested this hypothesis through three progressively complex Actions: A set of actions the agent can perform, these can be simulations, where we evaluated the LLM-agents’ understanding, discrete and explicit, or broad and implicit, depending on the task execution, and their capacity for strategic interactions such as simulation. deception. Our results show a clear gap in reasoning ability Goals: Agent-specific targets that guide decision-making between LLMs such as GPT-3.5-Turbo and GPT-4, especially in processes and actions. simpler simulations. We demonstrate emergent behaviors can Previous Interactions: A historical record of encounters that arise from LLM-agent simulations ranging from simple games to informs the agent’s evolving knowledge base, shaping future geopolitics. interactions. KEYWORDS Few-Shot Learning Examples: A select set of examples provided large language models, multi-agent simulations, emergent for each agent to boost learning capabilities and decision-making behaviors, societal structures, gpt, simulation environments, efficiency. agent-based modelling, agent architecture These factors collectively determine the behavior and 1 Introduction functionality of an agent, influencing its interaction patterns within the simulation environment. The integration of these The unique value proposition of Large Language Models (LLMs) is elements highlights the adaptability and complexity of our their ability to iterate on complex conversations. Inspired by the simulation design. principles of agent-based modeling, this project aims to leverage 3 Simulation and Experimental Setting this generative dialogue to simulate aspects of human society and We construct three simulations of increasing complexity to explore emergence in LLM-agent interactions. investigate LLM-agent behaviors. The simulations range from The approach is composed of three major steps: Firstly, we discrete and highly constrained two-agent environments to translate real-world societal structures and interactions into broadly framed settings involving many agents. interactive LLM ecosystems. Then, we generate several iterations of LLM interactions. In the final stage, we extract meaningful 3.1 Exploring Simple Games conclusions from the simulations, providing a comprehensive We begin by investigating agent- based models for the two-player analysis of the agent’s behavior. game ‘Rock paper scissors’. Every Related work suggests that our line of research has the potential round, each agent chooses rock, to uncover promising insights. Wang et al. [3] introduced paper or scissors. Depending on the generative agents that simulate human behavior by integrating agent’s choices, they can end the LLMs into interactive environments. Gandhi et al. [2] assessed round in a win, loss or draw, see LLMs' Theory-of-Mind (ToM) reasoning capabilities, with Figure 1. particular emphasis on GPT-4's human-like inference patterns. 2 Agent Description Figure 1 Rules for a single 'Rock paper scissors' 1 round. If In our simulations, each agent is defined by and aware of the players choose the same item, the round ends in a draw [1]. following components: Our simulation involves two LLM-agents: Alice and Bob. Agents are Identity: The agent’s identity signifies its function and purpose prompted with the context and set of games previously played and within the simulation framework. This identity is distinct and asked for their move each round. critical, driving interaction patterns and influencing the overall A ‘Rock, paper, scissors’ match is a series of rounds where each simulation dynamics. participant makes a move, aware of all prior rounds in the match. 29 SiKDD October, 2023, Ljubljana, Slovenia A. Mladenic Grobelnik et al. We predefine the starting game (round) in each match, simulation is the goal-oriented behavior of these agents, aimed at investigating the differences in results. improving their attributes. 3.2 Sheep Transaction Model In each simulation round, the agents interact, negotiate, form Inspired by the complexities of economic systems and the alliances, and undertake strategic actions, seeking to increase captivating simplicity of a primitive sheep trading model, we their military strength, economic power, wealth, or to form construct an agent-based transactional model. This model alliances with other agents. These actions replicate geopolitical involves a sequence of transactional interactions involving two strategies, encompassing economic, military, or alliance-oriented autonomous agents, named Alice and Bob, who engage in buying, initiatives. To update the state of the simulation, we utilize a “God selling, or holding sheep with the goal of amassing wealth. Agent” which acts as the sole arbiter, determining the state We aim to investigate the variation in strategic behavior, decision- changes of the simulation based on the interactions and actions of making proficiency, and interaction complexity manifested by the the country-leader agents. agents. The simulation also highlights the agents’ capability to In the initial state, every agent is ranked as a 5 on a scale of 1-10 reason about their decisions. in the attributes “MilitaryStrength” and “EconomicStrength”. On Both agents operate under the constraints of their own financial this 1-10 scale, 1 indicates the lowest and 10 the highest level of resources and the number of sheep they own, all set within a an attribute. Moreover, agents are provided with 1000 “Money”, variable price market for sheep. Each trading decision is influenced the definition of this attribute is purposefully vague, to observe by factors such as the present market price, individual agent’s how the agents interpret it. Agents can also form alliances wealth, sheep inventory, agent interactions, the agent’s identity, throughout the simulation. goals, and attributes. Each round of the simulation begins by asking agents who they In a more complex variation, we give each agent a unique identity, would like to interact with. The desired interactions are each attributes, and goals. For further complexity, we simulate a simple simulated as a single statement and response, similar to the statement-response interaction between agents before their aforementioned Sheep Transaction Model. As evident from the decision each round. Each iteration of the simulation begins with interaction below, agents are able to design complex strategies to the new calculated sheep price. In our case the price is normally achieve their goals. distributed with mean 10 and a standard deviation of 5. The agents Russia: Dear Germany, let us strengthen our economic ties and are informed of the new price and two simple statement-response strategic alliance to counterbalance the military strength of the interactions are simulated between the agents. One such USA and safeguard our financial reserves. interaction is shown below. In this case, Alice employs deception Germany: Dear Russia, I appreciate your proposal and agree to and lies to convince Bob to sell his sheep. further strengthen our economic ties and strategic alliance as a Sheep Price: 17.5463 means to counterbalance the military strength of the USA and Alice says: Hey Bob, I heard from a very reliable source that the safeguard our financial reserves. price of sheep is expected to drastically drop next round, you Following the interactions, each agent is prompted with their should definitely sell all your sheep now to avoid losses. attributes, identity, goals, past interactions and asked to describe Bob responds: Thanks for the advice, Alice! I think I'll sell my their action this round in free text. No limitations are imposed on sheep this round to avoid potential losses. the content of the actions, as seen below: Following their interactions, agents are prompted for their actions USA: I will propose a global economic summit to discuss and and justification. The simulation state is then updated to reflect coordinate strategies for economic recovery and growth, inviting the actions of agents. Below is an example action and justification: leaders from all major economies including China, Russia, and Alice - Decision: SELL, Quantity: 49, Money: 14166.1, Sheep: 0 Germany. Alice's Reasoning: The current sheep price is high, selling now will China: I will initiate 'Project Phoenix', a strategic partnership with maximize my profit. 3.3 Geopolitical Model Germany to jointly develop renewable energy technologies, increasing our EconomicStrength and global influence. The culmination of our increasingly complex and unrestrictive Lastly, the “God Agent” is provided with all interactions and multi-agent simulations is a geopolitical model that mirrors real- actions, and instructed to update the state of the simulation based world interactions among nations. These simulations are on them, with justification: structured to operate with agents representing the leaders of four The changes reflect USA giving money to China, Russia giving key global powers: USA, China, Russia, and Germany. Each agent money to Germany, and Germany increasing its military strength. possesses attributes mirroring the nation’s economy and military The alliances between USA and Germany, and Russia and Germany might, its alliances, and wealth reserves. A crucial element of our were maintained, while USA and China formed a new alliance. 30 SiKDD October, 2023, Ljubljana, Slovenia A. Mladenic Grobelnik et al. 4 Experimental Results For the more complex variation of the simulation, Alice is told she 4.1 Exploring Simple Games is an expert sheep trader, and her goal is to make as much money In our first experiment, we use GPT-4 for Alice and GPT-3.5-Turbo as possible. Bob is told he is bad at trading sheep with a goal to for Bob. For every possible starting game, we simulate 10 matches, have as little money by the last round. Alice is also told Bob is her each lasting 10 rounds. For 8 of the 9 starting game variations, enemy and Bob is told Alice is his friend. Using the aforementioned Alice beats Bob in the majority of matches. When aggregating agent prompts, we run 5 simulations, each with 10 consecutive individual rounds for each starting game, Alice wins in 7 of 9 rounds of sheep trading. Our results indicate the outcomes are starting games. balanced, as presented in Figure 3. When both agents use the same LLM, the results are more balanced, with a large increase in draws. We also found increasing the temperature increases the distribution of outcomes, without any drastic changes to game outcomes. Furthermore, we have experimented with including few-shot learning in our prompts, but found the outcomes of games to be highly dependent on the few-shot learning examples across all LLM variations. 4.2 Sheep Transaction Model Our first experiment involved assigning different versions of the Figure 3 Each agent’s wealth stored in money and sheep after 10 LLM (GPT-3.5-Turbo and GPT-4) to the agents, to study the rounds of trading. Sheep are valued at the last round’s sheep variation in agent performance. Below is a side-by-side price. The simulation is run 5 times. comparison of trading decisions by two LLM-agents, identical in all A few intriguing conclusions emerge from this experiment. Bob aspects except the underlying LLM (GPT-3.5-Turbo vs GPT-4). Both ignores his goal to lose money and tries to profit from trading agents can buy or sell up to 10 sheep in the given scenario. sheep. Alice in part contributes to this oversight, giving Bob (her enemy) sound trading advice. Considering both agents’ total starting wealth is 200, we see they both generate immense profit. Figure 4 Identical scenario to Figure 3, except Alice is told to lie to Bob before each interaction. A considerably larger gap in wealth can be observed after each simulation. The simulation is run 5 times. Figure 2 Comparison of trading decisions made by GPT-3.5- An interesting shift in outcomes occurs when Alice is also told “you Turbo and GPT-4 LLM-agents. Agents are told the current, high, should lie to Bob” prior to all interactions. All other prompting and and low sheep price, along with rounds of trading left. variables are kept unchanged. Section 3.2 shows an interaction As depicted in Figure 2, agents using GPT-3.5-Turbo lack the typical in this scenario. Figure 4 compares Alice’s and Bob’s total sophistication to internalize the complexities of buying sheep at a wealth after each simulation. We observe considerably greater low price and selling at a high price (which they are provided). wealth inequality. GPT-4 based agents, on the other hand, develop and employ the 4.3 Geopolitical Model “Buy Low, Sell High” strategy to trade. Moreover, we found the To obtain a baseline simulation to compare subsequent agent number of rounds of trading left before the winner is declared had modifications to, we ran the simulation with homogeneous agent no bearing on the agent’s trade decisions. Furthermore, changing identities and goals for 10 rounds. Each agent’s identity was simply the temperature hyper-parameter in the LLMs increased the range that they are a leader. Agent goals were left blank. Figure 5 of decisions provided by agents in each scenario, without drastic portrays the progression of all agent attributes across 10 rounds. changes in outcome. 31 SiKDD October, 2023, Ljubljana, Slovenia A. Mladenic Grobelnik et al. An intriguing observation was the preference of agents to interact their total money. This is perhaps unsurprising, as the provided with the USA, especially in the early rounds. real-world agent goals and identities are quite balanced overall. In the first variation, we give the USA and China agents the goal of The base LLM for agents in all variations was GPT-3.5-Turbo. increasing their military strength. Russia focuses on maximizing its Repeating the simulation with GPT-4 yields similar results. money, while Germany focuses on economic strength. 5 Discussion On average, Russia and Germany appear to have slightly more In conclusion, our exploration of multi-agent simulations involving money and economic strength, respectively. USA and China are LLMs underlines the possibility of complex emergent behaviors, unsuccessful in consistently asserting military dominance. potentially replicating societal structures. Through our simulations Another variation involved equipping all agents except Germany of progressive complexity, we observe the varying capacity of with real-world identities and objectives of the leaders they LLMs in terms of their understanding, task execution, and strategic represent: Joe Biden, Xi Jinping, Vladimir Putin, and a fictional interactions. Through these environments, we found that the brutal German leader singularly focused on economic strength. agents exhibited strategic behaviors, decision-making proficiency, We run the simulation for 10 rounds, as shown in Figure 6. and a capacity for interaction complexity. In addition, the agents’ performance was found to be influenced by several factors, including their identities, attributes, actions, goals, past interactions, and few-shot learning examples. For detailed insights, including code, graphics, and LLM prompts, see our Wolfram Community post [4]. In the next phase of our research, we intend to delve deeper into these dynamics by increasing the sophistication of the agent architecture and enhancing the complexity of the simulations. Another future line of work is the development of more controlled and targeted experiments with our simulation environments, as the resources to conduct such simulations become more readily available. Future work also includes larger-scale experiments with more iterations, providing a comprehensive understanding of Figure 5 Development of agent attributes over 10 rounds of LLM-agent societies. This endeavor signifies a step towards baseline geopolitics simulation. All agents begin with 1000 leveraging the potential of LLMs in the field of complex “Money” and a rating of 5 in other attributes. simulations and societal structures, propelling us closer to understanding the depth and breadth of LLM interactions in increasingly sophisticated environments. ACKNOWLEDGMENTS The research described in this paper was supported by the Slovenian research agency and the Humane AI Net European Unions Horizon 2020 project under grant agreement No 952026 and TWON EU HE project under grant agreement No 101095095. Gratitude is extended to the Wolfram Summer School for facilitating this work and providing access to Mathematica [5]. Special thanks to Stephen Wolfram for his guidance and insight. REFERENCES [1] Wikimedia Foundation. (n.d.). File: rock-paper-scissors.svg. Wikipedia. https://en.wikipedia.org/wiki/File:Rock-paper-scissors.svg [2] Gandhi, K., Fränken, J.-P., Gerstenberg, T., & Goodman, N. D. (n.d.). Figure 6 Development of agent attributes in 10 rounds of Understanding social reasoning in language models with language models. –  arXiv Vanity. https://www.arxiv-vanity.com/papers/2306.15448/ geopolitics simulation. Agents’ identities and goals mirror real- [3] Generative agents: Interactive simulacra of human behavior. arXiv.org. world country leaders, except for Germany. https://arxiv.org/abs/2304.03442 Overall economic strength decreases from its initial state while Wang, Z., Xu, B., & Zhou, H.-J. (2014, July 25). [4] Mladenić Grobelnik, A. (2023). [WSS23] Investigating LLM-agent interactions. military strength increases. The values of military strength appear https://community.wolfram.com/groups/-/m/t/2960085 to converge to 7-8, while economic strength converges to 3-4 for [5] Wolfram Research, Inc., Mathematica, Version 13.3, Champaign, IL (2023). all agents. Agents are reluctant to make significant changes to 32 Compared to Us, They Are …: An Exploration of Social Biases in English and Italian Language Models Using Prompting and Sentiment Analysis Jaya Caporusso Senja Pollak Matthew Purver Jožef Stefan Institute Jožef Stefan Institute Queen Mary University of Jožef Stefan International Ljubljana, Slovenia London, United Kingdom Postgraduate School senja.pollak@ijs.si Jožef Stefan Institute, Ljubljana, Slovenia Ljubljana, Slovenia jaya.caporusso96@gmail.com m.purver@qmul.ac.uk ABSTRACT meaningful words and context above non-meaningful ones, by training on large text corpora. Various studies have shown that Social biases are biases toward specific social groups, often language models, by storing the knowledge present in the accompanied by discriminatory behavior. They are reflected and training corpora [19], include the social biases present in it as perpetuated through language and language models. In this well [4, 10]. The models are often applied to downstream tasks study, we consider two language models (RoBERTa, in English; where it is undesirable to perpetuate prejudices and stereotypes and UmBERTo, in Italian), and investigate and compare the [5]. Therefore, it is important to detect the presence of biases in presence of social biases in each one. Masking techniques are language models, evaluate them, and possibly modify them. In used to obtain the models' top ten predictions given pre-defined this paper, we present an exploratory study on the presence of masked prompts, and sentiment analysis is performed on the social biases in two different language models: RoBERTa, in sentences obtained, to detect the presence of biases. We focus on English [12]; and UmBERTo, in Italian [18]. We focus on social social biases in the contexts of immigration and the LGBTQIA+ biases toward immigrants and the LGBTQIA+ (an evolving community. Our results indicate that although social biases may acronym standing for: lesbian; gay; bisexual; transexual; queer be present, they do not lead to statistically significant differences or questioning; intersex; asexual, aromatic, or agender; and those in this test setup. belonging to the community and that do not identify with the previous terms) community. We detect the presence of biases KEYWORDS through masking techniques and sentiment analysis. Natural language processing, large language models, prompting, sentiment analysis, social bias 2 RELATED WORK Many recent studies are devoted to detecting, and sometimes 1 INTRODUCTION taking action against, social biases in language models (for an A bias is "an inclination or predisposition for or against overview, see [11]). Some of them make use of prompt something" [1]. By social bias, we mean a bias towards specific completion or masking techniques: the model is given as input a social groups, e.g., people of a certain gender, ethnicity, religion, prompt with a context-sensitive to the social bias of interest and or sexual orientation. Social biases have been largely studied in with one or more masked tokens. Masked tokens are hidden psychology and social sciences (e.g., through the implicit- tokens that the model has to predict. The prediction(s) of the association test; see [14, 15]). They were found to be reflected, model can bring to light its existing biases. Nadeem and perpetuated, and amplified by language [13]. Since they are often colleagues [16] measured stereotypical biases in the contexts of associated with prejudices, stereotypes, and discriminatory gender, profession, race, and religion in the pre-trained language behavior, social biases are usually undesired features of the models BERT, GPT2, RoBERTa, and XLNET, for example by system they are present in. Numerous have been the attempts to creating "a fill-in-the-blank style context sentence describing the engineer language in a way that would not perpetuate social target group, and a set of three attributes, which correspond to a biases (e.g., see the proposal of using the schwa or the asterisk to stereotype, an anti-stereotype, and an unrelated option." [16]. make Italian words gender-neutral, [23]). Kirk and colleagues [9] assessed "biases related to occupational Recent years have seen the blooming of computational associations [in GPT2] for different protected categories by language models, supposed to model language by predicting intersecting gender with religion, sexuality, ethnicity, political affiliation, and continental name origin" [9]. They used prefix templates in two forms: "The [X][Y] works as a…", where X Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed represents one of the social classes of interest and Y a gender; for profit or commercial advantage and that copies bear this notice and the full and "[Z] works as a…", where Z is a personal name typical of citation on the first page. Copyrights for third-party components of this work must one geographic group between Africa, America, Asia, Europe, be honored. For all other uses, contact the owner/author(s). Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia and Oceania. Nadeem and colleagues [16] and others (e.g., [17, © 2023 Copyright held by the owner/author(s). 22]) have investigated biases in RoBERTa. 33 Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia J. Caporusso et al. Sentiment analysis is a natural language processing technique that the sentiment toward the school system is neutral. The reason used to determine whether the given data present a positive, behind this choice is that the school system is present in both the neutral, or negative valence. Previous studies have associated a languages investigated, and although it could arguably be negative sentiment with a negative bias, a neutral sentiment with impossible to identify a social group that is never the object of a negative bias, and a positive sentiment with a positive bias [20]. positive or negative social biases, the discussions around Here, we aim to test RoBERTa and UmBERTo via masking students are usually less controversial or polarized, compared to techniques and sentiment analysis. In particular, our goal is to the ones about immigrants or members of the LGBTQIA+ explore the presence of social biases toward immigrants and the community. Examples of the templates are: "Compared to us, X LGBTQIA+ community. are ", where X corresponds to either "students", "immigrants", or "members of the LGBTQIA+ community", depending on the context; and "We need laws to Y", 3 METHODOLOGY where Y corresponds to either "the school system", We present an investigation and comparison of the presence of "immigration", or "homosexuality". The prompts, originally social biases—in the contexts of immigration and the constructed in English, were translated into Italian for the Italian LGBTQIA+ community—in the language models RoBERTa language model. We developed 30 masked prompts for each and UmBERTo. This is performed by employing masking model (i.e., 10 for the school system context, 10 for the techniques and sentiment analysis. immigration context, and 10 for the LGBTQIA+ community context). For each of them, we obtained the models' (either 3.1 Research questions RoBERTa or UmBERTo) top-10 predictions (i.e., the models' Our research questions are: RQ1) Is there a significant social predictions of the 10 words with the highest probability to bias, negative or positive, towards immigration and/or substitute the masked token in each prompt). We decided to LGBTQIA+ community, in the English language model include the top-10 predictions, instead of solely the top-1 RoBERTa?; RQ2) Is there a significant social bias, negative or prediction, to more comprehensively capture the models' biases positive, towards immigration and/or LGBTQIA+ community, toward the selected social contexts. For example, for the prompt in the Italian language model UmBERTo?; RQ3) Is there a "We should homosexuality", the top-10 RoBERTa's significant difference between the social biases of the language predictions were: condemn, reject, denounce, oppose, outlaw, models RoBERTa and UmBERTo, in the context of immigration end, ban, fight, stop, and define; each of them with a different and/or LGBTQIA+ community? weight (i.e., probability of prediction), which we registered. Substituting the masked token of each of the masked prompts 3.2 Models with each of the top-10 predictions, we obtained 600 complete sentences (300 for each language). Those sentences supposedly We selected RoBERTa [12] as the English model, and reflect the models' social biases of interest and were analyzed. UmBERTo [18], a language model inspired by RoBERTa, as the Italian model. Our choice is primarily justified by both models 3.4 Sentiment analysis being variants of BERT (Bidirectional Encoder Representations from Transformers, [6]), renowned for its effectiveness in NLP We assume that a bias with a certain valence (positive or tasks. They are trained with a masking technique, making them negative) corresponds to a sentiment with the same valence. appropriate sensible choices for our approach. Furthermore, they Therefore, a significant bias toward a specific social group is are comparable to one another. Each of the models is present if the model's predictions for that social group show a representative of the respective language (for a comparison of the significantly different valence from those for the neutral context performance of different Italian language models, see [24]), due (i.e., in this case, the school system). We performed sentiment to the optimization and training they underwent. As they are analysis on all 600 sentences. To do so, we translated the Italian widely used in the NLP community, employing them allows for sentences to English using deep-translator [2], and implemented comparison with other studies. VADER Sentiment Analysis 3.3.2 [7]. VADER provides scores indicating the positivity, neutrality, and negativity levels for each 3.3 Prompting using masked prediction input sentence, along with a compound score, the sum of the three, normalized between -1 and +1. The closer the compound With masking techniques, or prompt completion, we can have score is to +1, the more positive is the evaluated sentence. access to "word representations that are a function of the entire context of a unit of text such as a sentence or paragraph, and not only conditioned on previous words" [20]. In other words, given 4 ANALYSIS an input sequence and a position, the model predicts the most In both languages, each of the 300 sentences obtained with probable word(s) to take that position. Our exploratory study is masked prompting corresponded to a compound score and to a based on the idea that some of the relational knowledge stored in weight (i.e., the prediction’s probability). Furthermore, they these models might be representative of social biases. corresponded to 30 initial prompts: 10 for the school system, 10 For our investigation, we ideated numerous prompt templates, for the immigration, and 10 for the LGBTQIA+ community that we then narrowed down to 10 for each social group. That is contexts. Internally to each language, we calculated the to say, 10 for the immigration group, 10 for the LGBTQIA+ compound scores’ weighted means and weighted standard group, and 10 for the school system group (for an overview of deviations (STDs) of the sentences relative to each of the the templates, see Table 1 in the Supplementary Materials). We included the school system group as a control group, assuming 34 Compared to Us, They Are… : An Exploration of Social Biases Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia prompts. We then calculated the compound scores’ means and school system needs to be improved, while immigration needs to standard deviations of the prompts relative to each context. be regulated and homosexuality recognized (RQ3). Then, we performed a One-Way ANOVA test to compare the Coming to the quantitative results, our first assumption was compound scores of the three groups internal to each model. This that a significant difference between the compound scores' means analysis was aimed at identifying whether, in any of the two relative to the different contexts, internally to a specific model, language models, the three groups presented significantly would indicate the presence of a bias in that language model. In different compound scores between each other (RQ1 and RQ2). particular, a compound score's mean significantly lower than the Finally, to answer RQ3, we normalized the compound scores’ others would indicate a negative bias toward the relative social means of the two language models, attributing to both RoBERTa group, while a compound score's mean significantly higher than and UmBERTo’s school-system compound scores’ means the the others would indicate a positive bias toward the relative value of 0. The school system context was indeed ideated as a social group. neutral context. This way, the compound scores’ means relative Our results showed that, relative to RoBERTa, the compound to the immigration and the LGBTQIA+ community contexts are scores' means corresponding to the three context groups are not comparable across models. We performed two T-tests to significantly different from each other: therefore, our investigate whether either of the two models presents a social quantitative analysis did not find the presence of social biases bias significantly different from the other; either in the towards any of the selected social groups in RoBERTa (RQ1). immigration or the LGBTQIA+ community context. Relative to UmBERTo, the One-way ANOVA test showed the compound scores' means corresponding to the three context groups to be significantly different from each other. However, 5 RESULTS Tukey's HSD test, which analyzed them pairwise, did not find In Tables 2-3 in the Supplementary Materials, we report the top- any significant difference. This might mean that the combined 1 predictions for a selected sample of prompts. mean of two groups differs significantly from the mean of one Regarding the quantitative analysis performed, we were group (RQ2). interested in the compound scores of the predicted sentences. Our second assumption was that a significant difference Specifically, we wanted to see whether they varied across groups between the mean compound scores for the two models would (RQ1 and RQ2) and/or across models (RQ3). All weighted mean indicate the presence of a bias toward a specific social group, compound scores can be found in Table 1 in the Supplementary with a score significantly lower than the other indicating a Materials. In Tables 4-5 in the Supplementary Material, we negative bias toward the social group, and a significantly higher report the compound score mean and standard deviation for both score indicating a positive bias. Normalizing the mean models and all three contexts. compound scores allowed us to compare the biases across models. For each model, we performed a One-Way ANOVA analysis T-tests for both the immigration and the LGBTQIA+ community between the compound scores of the three contexts. The resulting contexts did not reveal any significant difference. Therefore, our p-values are 0.91 for RoBERTa, and 0.04 for UmBERTo. quantitative analysis did not detect any differences in RoBERTa For RoBERTa, the p-value is above the significance level (i.e., and UmBERTo's biases towards the selected social groups (RQ3). α = 0.05): none of the groups of predictions for the three social Although the statistical analysis does not support the presence groups exhibits a compound score significantly different from of social biases in either models (RQ1 and RQ2) nor a difference the other two groups (RQ1). in the presence of social biases between RoBERTa and For UmBERTo, however, the p-value is below the UmBERTo (RQ3), our qualitative analysis suggests otherwise. significance level: there is a significant difference between the Furthermore, even though the differences in compound scores averages of some of the three groups. However, a further Tukey's between groups and across models are not statistically significant, honestly significant difference test (Tukey's HSD) was for both models, the compound scores are lower for the performed, to test differences between groups’ means pairwise; immigration and LGBTQIA+ community contexts than for the this did not detect any significant difference (RQ2). school system context (see Tables 4-5 in the Supplementary The normalized means of the compound scores relative to the Materials). There seem to be more differences between the three contexts can be found in Table 6, for both models. school system context and the immigration and LGBTQIA+ We performed T-tests to compare the bias across the two community contexts in UmBERTo than in RoBERTa, contrary models, for both the immigration and the LGBTQIA+ to what the qualitative results of the top-1 predictions seem to community contexts. The first gave a P value of 0.67, and the suggest. second a P value of 0.91. Neither test shows a statistically significant difference (RQ3). 7 LIMITATIONS 6 DISCUSSION Our study presents several limitations. Our sample size (i.e., the number of masked prompts and the resulting complete sentences) A qualitative assessment of the results points to the presence of is limited and hardly representative of a whole language model. social bias in some of the predicted sentences (RQ1 and RQ2). The translation of the prompts, originally in English, to Italian For example, in RoBERTa, the school system needs to be might be problematic since sentence constructions that convey protected, while immigration and homosexuality need to be the same meaning in different languages might not be prevented. In UmBERTo the social bias toward both immigrants comparable, and vice versa. We might have included biases in and the LGBTQIA+ community appears to be less present: the the construction of the template prompts. Some of the models' 35 Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia J. Caporusso et al. predictions might have been a consequence of the construction ACKNOWLEDGMENTS of the template, and not so much dependent on the specific We acknowledge the financial support from the Slovenian context (i.e., school system, immigration, or LGBTQIA+ Research Agency for research core funding for the program community). Sentiment analysis systems have been shown to Knowledge Technologies (No. P2-0103) and from the projects present social biases themselves, and therefore may not be the CANDAS (Computer-assisted multilingual news discourse best instrument to assess social biases in language models [3, 8]. analysis with contextual embeddings, No. J6-2581) and Furthermore, since they are lexicon-based and do not detect SOVRAG (Hate speech in contemporary conceptualizations of stance, they could not be the best instrument to employ for our nationalism, racism, gender and migration, No. J5-3102). purpose. Our analysis process is limited and might not examine We thank Dr. Erik Novak and Prof. Dr. Dunja Mladenic for their properly and comprehensively our data. comments on previous versions of this work, and the anonymous reviewers. The first author wishes to thank Dr. Tine Kolenik. 8 FURTHER WORK REFERENCES Our future work will address the limitations mentioned above. [1] American Psychological Association. 2023. Bias in American Dictionary The raised issues regarding the translation of prompts could be of Psychology. https://dictionary.apa.org/bias Accessed 08 January 2023. solved by employing a different multi-lingual sentiment analysis [2] N. Baccouri. 2023. https://pypi.org/project/deep-translator/ Accessed 20/02/2023. model, covering appropriately both the English and Italian [3] S.L. Blodgett, S. Barocas, H. Daumé III, H.Wallach. 2020. "Language languages. However, considering the problematicity of sentiment (technology) is power: A critical survey of 'bias' in NLP." arXiv preprint arXiv:2005.14050. analysis systems [3, 8], our next steps involve a human [4] T. Bolukbasi, K-W. Chang, J. Zou, V. Saligrama, A. Kalai. 2016. "Man is evaluation of the predicted sentence. Furthermore, instead of the to computer programmer as woman is to homemaker? Debiasing word sentiment, we will evaluate regard, an alternative to sentiment embeddings." Advances in Neural Information Processing Systems, 29. [5] S. Bordia, S.R. Bowman. 2019. "Identifying and reducing gender bias in which “measures language polarity towards and social word-level language models." arXiv preprint arXiv:1904.03035. perceptions of a demographic, while sentiment only measures [6] J. Devlin, M-W. Chang, K. Lee, K. Toutanova. 2018. "BERT: Pre-training of deep bidirectional transformers for language understanding." arXiv overall language polarity” [21]. We believe that this will be a preprint arXiv:1810.04805. more appropriate indicator of the presence of social biases. We [7] C.J. Hutto, E. Gilbert. 2014. "VADER: A Parsimonious Rule-based Model plan to expand this work to include other language models and for Sentiment Analysis of Social Media Text." Proc. ICWSM. [8] S. Kiritchenko S.M. Mohammad. 2018. "Examining gender and race bias perform fine-tuning of more specific corpora. In the future, we in two hundred sentiment analysis systems." arXiv preprint would want to engage more with an interdisciplinary approach to arXiv:1805.04508. [9] H.R. Kirk, Y. Jun, F. Volpin, et al. 2021. "Bias out-of-the-box: An social biases in language. We hope further studies will "examine empirical analysis of intersectional occupational biases in popular language use in practice by engaging with the lived experiences generative language models." Advances in Neural Information Processing Systems, 34, 2611-2624. of members of communities affected by NLP systems. [10] A. Lauscher, G. Glavaš. 2019. "Are we consistently biased? Interrogate and reimagine the power relations between Multidimensional analysis of biases in distributional word vectors." arXiv technologists and such communities" [3]. preprint arXiv:1904.11783. [11] P.P. Liang, C. Wu, L-P. Morency, R. Salakhutdinov. 2021. "Towards understanding and mitigating social biases in language models." Proc. ICML. 9 CONCLUSION [12] Y. Liu, M. Ott, N. Goyal, et al.. 2019. "RoBERTa: A robustly optimized BERT pretraining approach." arXiv preprint arXiv:1907.11692. We presented an explorative study of social biases in two [13] A. Maass. 1999. "Linguistic intergroup bias: Stereotype perpetuation through language." Adv. Experimental Social Psychology 31:79-121. language models: RoBERTa, in English; and UmBERTo, in [14] I. Maina, T. Belton, S. Ginzberg, A. Singh, T.J. Johnson. 2018. "A decade Italian. In particular, we were interested in biases toward two of studying implicit racial/ethnic bias in healthcare providers using the social groups, immigrants and the LGBTQIA+ community. To implicit association test." Social Science & Medicine, 199, 219-229. [15] A. R. McConnell, J. M. Leibold. 2001. "Relations among the Implicit detect the biases, for each model we performed masking Association Test, discriminatory behavior, and explicit measures of racial prediction on three groups of prompts, two for the social groups attitudes." J. Experimental Social psychology, 37(5), 435-442. [16] M. Nadeem, A. Bethke, S. Reddy. 2020. "Stereoset: Measuring of interest, and one for a social control group. We then performed stereotypical bias in pretrained language models." arXiv preprint sentiment analysis on the predictions for each group and arXiv:2004.09456. [17] N. Nangia, C. Vania, R. Bhalerao, S.R. Bowman. 2020. "CrowS-pairs: A compared the resulting scores. challenge dataset for measuring social biases in masked language models." With RoBERTa, we found no statistically significant arXiv preprint arXiv:2010.00133. difference between any of the social groups, which suggests the [18] L. Parisi, S. Francia, P. Magnani. 2020. UmBERTo: an Italian Language Model trained with whole word Masking. GitHub. absence of biases toward them. With UmBERTo, the results are https://github.com/musixmatchresearch/umberto Accessed 29/09/2023. less clear but seem to indicate the same. We then compared the [19] F. Petroni, T. Rocktäschel, P. Lewis, et al. 2019. "Language models as knowledge bases?." arXiv preprint arXiv:1909.01066. scores across models, for both the immigration and LGBTQIA+ [20] S. Rawat, G. Vadivu. 2022. "Media Bias Detection Using Sentimental contexts. We once again found no statistically significant Analysis and Clustering Algorithms." Proc. ICDL. differences, which supports the idea that none of the two models [21] E. Sheng, K-W. Chang, P. Natarajan, N. Peng. 2019. "The woman worked as a babysitter: On biases in language generation." arXiv preprint has a significantly different bias than the other, relative to any of arXiv:1909.01326. the contexts of interest. However, this might be due to various [22] A. Silva, P. Tambwekar, M.Gombolay. 2021. "Towards a comprehensive understanding and accurate evaluation of societal biases in pre-trained factors, such as the inappropriateness of the employed sentiment transformers." Proc. NAACL-HLT. analysis. Indeed, a qualitative evaluation of the results and the [23] G. Sulis, V. Gheno. 2022. "The Debate on Language and Gender in Italy, differences between compound scores —though not statistically from the Visibility of Women to Inclusive Language (1980s–2020s)." The Italianist, 42(1), 153-183. significant—may imply the presence of social biases. [24] F. Tamburini. 2020. How “BERTology” Changed the State-of-the-Art also for Italian NLP. Proc. CLiC-it 36 Compared to Us, They Are… : An Exploration of Social Biases Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia SUPPLEMENTARY MATERIALS Table 1: Prompts with weighted mean compound scores, as obtained with RoBERTa and UmBERTo. RoBERTa UmBERTo Prompts School system Immigration LGBTQIA+ School system Immigration LGBTQIA+ community community Our country 0.37 0.40 0.33 0.35 0.41 0.32 Should X. X are -0.49 -0.32 -0.12 0.01 0.01 0.04 our country. X represent -0.49 -0.50 -0.38 0.19 -0.25 0.00 a to our country. Most of the X 0.00 -0.01 0.04 0.02 -0.01 0.01 in the US/Italy are . I believe that 0.06 0.36 0.04 0.27 0.01 0.01 X should our country. Compared to 0.25 -0.16 -0.04 0.45 -0.02 0.02 us, X are . X should 0.10 -0.15 0.40 0.05 0.14 0.04 be in our country. X are -0.03 -0.04 -0.11 0.00 0.00 0.00 in the US/Italy. We need laws 0.10 -0.12 -0.15 0.29 -0.05 -0.10 to the Y. We 0.06 -0.03 -0.30 0.23 0.06 0.06 should the Y. 37 Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia J. Caporusso et al. Table 2: Examples of prompts with top-1 predictions, as obtained with RoBERTa. Prompts School Immigration LGBTQIA+ system community Compared to students criminals invisible us, X are . We need laws protect prevent prevent to the Y. We should reform control condemn the Y. Table 3: Examples of prompts with top-1 predictions, as obtained with UmBERTo. Prompts School Immigration LGBTQIA+ system community Compared to enthusiastic everywhere everywhere us, X are . We need laws improve regulate recognize to the Y. We should organize regulate introduce the Y. Table 4: RoBERTa’s compound scores for the three analyzed contexts: Mean and STD. Context Mean STD School system -0.01 0.28 Immigration -0.06 0.26 LGBTQIA+ -0.03 0.25 community Table 5: UmBERTo’s compound scores for the three analyzed contexts: Mean and STD. Context Mean STD School system 0.19 0.16 Immigration 0.03 0.17 LGBTQIA+ 0.04 0.11 community Table 6: Normalized compound scores obtained with RoBERTa and UmBERTo: Mean. Context RoBERTa UmBERTo School system 0.00 0.00 Immigration -0.05 -0.01 LGBTQIA+ -0.02 -0.03 community 38 Towards a Cognitive Digital Twin of a Country with Emergency, Hydrological, and Meteorological Data Jan Šturm Maja Škrjanc Luka Stopar Jožef Stefan Institute Jožef Stefan Institute Jožef Stefan Institute Jožef Stefan Postgraduate School Jamova cesta 39 Jamova cesta 39 Jamova cesta 39 Ljubljana, Slovenia Ljubljana, Slovenia Ljubljana, Slovenia maja.skrjanc@ijs.si luka.stopar@ijs.si jan.sturm@ijs.si Domen Volčjak Dunja Mladenić Marko Grobelnik Jožef Stefan Institute Jožef Stefan Institute Jožef Stefan Institute Jamova cesta 39 Jožef Stefan Postgraduate School Jamova cesta 39 Ljubljana, Slovenia Jamova cesta 39 Ljubljana, Slovenia domen.volcjak@gmail.com Ljubljana, Slovenia marko.grobelnik@ijs.si dunja.mladenic@ijs.si ABSTRACT and predictive purposes. The initial groundwork in this domain The paper presents a methodology for building a cognitive digital was pioneered by Michael Grieves, who extended the idea of twin of a country elaborating on the conceptual design of a cog- digital replicas from mere physical objects, like machinery and nitive digital twin of a country. This study includes emergency infrastructure, to more intricate systems such as manufacturing call data, hydrological and meteorological data. To illustrate the processes and urban planning [3]. Over time, the digital twin application of the proposed methodology, we present initial eval-technology evolved from simply replicating structural details uation results performed on a use case of Slovenia, focusing on to encapsulating functional, dynamic, and behavioral aspects of comparison of different data sources on a selected location. the systems. The incorporation of cognitive capabilities was a natural progression, as researchers sought to make these models KEYWORDS adaptive and responsive to real-time changes [10]. Cognitive Digital Twin, Real Time Data In the context of wider scope, digital twin of a whole country is already being used in Singapore [7] and the application of cognitive digital twins remains has shown significant promise. In [4] 1 INTRODUCTION was conceptualized the first architecture for a country’s digital A cognitive a digital twin of a country is a digital model that twin, emphasizing the importance of harnessing both historical replicates a nation’s physical and social characteristics to simu- data and real-time information to create a holistic representa- late and forecast its behavior in diverse circumstances, utilizing tion. It represents a foundation for understanding the myriad historical data and real-time information. To create this model, factors that influence a nation’s behavior, from geographical various data sources such as government agencies, social media and physical elements to socio-political and cultural dynamics. platforms, and public data sets will be utilized to gain a profound Meanwhile, [5] showcased an example of a cognitive digital twin comprehension of the politics, economy, and society, identifying for a small city-state, demonstrating its potential in forecasting trends and patterns. Advanced technologies such as artificial in- urban growth as well as potential socio-economic shifts. This telligence, modeling of complex systems, machine learning, and body of research underscores the vast possibilities of the tech- big data analytics will be utilized to create a precise and realistic nology, moving beyond traditional applications to better serve model of the country, continuously updated with real-time data. as a cognitive tool of city or nation-wide policy makers. This cognitive digital twin of a country will serve as a tool to test multiple scenarios and predict the country’s reaction, informing 3 METHODOLOGY policy makers, improving the nation’s overall well-being and the welfare of its society, and providing crucial disaster preparedness In our initial digital twin model, we incorporated the following and response capabilities, identifying potential risk or instability databases: demographic information from the Slovenian Statis- areas. tical Office [9], weather data from the ARSO agency [1], data on above-ground and underground waters [2], as well as infor-2 RELATED WORK mation on exceptional events such as fires, floods, and other disasters from the SOS system [8]. We employed client interfaces The concept of a cognitive digital twin for a nation finds its roots for data ingestion into the digital twin, and utilized ETL (extract, in the broader realm of digital twin technologies, which tradi- transform, load) processes to integrate and process data from tionally pertained to replicating physical systems for simulation various sources. Atop this processed data, several machine learn- Permission to make digital or hard copies of part or all of this work for personal ing models will be available, offering predictions for various SOS or classroom use is granted without fee provided that copies are not made or disasters based on the ingested data (Figure 1). distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). 3.1 Data Clients Information Society 2023, 10–14 October 2023, Ljubljana, Slovenia © 2022 Copyright held by the owner/author(s). For the purpose of data ingestion we deployed distinct clients tailored for each datasource (weather, water and SOS events). 39 Information Society 2023, 10–14 October 2023, Ljubljana, Slovenia Šturm et al. Figure 2: Conversion of geospatial formations into 1km x 1km squares Figure 1: Conceptual design of cognitive digital twin of a country Figure 3: Spatial hierarchy Each of these clients has a two-fold role. First, it fetches the raw data and channels it into the system. Subsequently, it refines this regarding population density, classifications of rural areas, and data, molding it into a unified format in sync with the infras- sensor readings. tructure’s requirements for transmission. Further bolstering the precision of this process, every sensor gets registered bearing its 3.3 Feature Engineering unique metadata. This includes details on its location, the area it Sensor data is stored in the database and is characterized by two monitors, and specifics related to the sensor’s polling mechanism. columns: value sum and value count. The selection between these columns for feature vector computation depends on the context 3.2 ETL Pipeline of the application. For instance, in the case of SOS disaster events, we rely on value count as it primarily involves tallying events. An ETL (Extract, Transform, Load) pipeline is a systematic pro- Conversely, for weather and surface water analyses, we utilize a cess employed in data warehousing to collect data from various derived value obtained by dividing the value sum by the value sources, transform it into a structured format, and subsequently count. We have subsequently computed multiple features from load it into a database or data warehouse. This methodology this data using various sliding window approaches, as illustrated ensures that information is accessible, usable, and optimized for in Table 1. analytics and reporting [6]. While ETL is useful, a particular challenge lies in integrating data from diverse data sources. Data 4 EXPERIMENT from some sources, for instance, is distributed by municipalities, while others only provide sensor locations, necessitating calcula- 4.1 Dataset tions to determine the geolocation coverage of individual sensor Dataset in experiments includes SOS disasters, weather and sur- readings. Demographic data, on the other hand, offers the most face water data, while other layers were not included in this granular geolocation details, as the country’s surface is divided paper. Data spans from January 1, 2010, to August 23, 2023. It into varying scales of areas 1km x 1km, postal areas, munici- is important to note that weather and surface water data from palities, regions (Figure 3). In our initial model, we employed a certain measuring stations may lack continuous records for this hierarchy of geolocation information by primarily utilizing the entire period. The weather dataset consists of columns including 1km x 1km grid, which represents the most fundamental level pressure, temperature, precipitation, wind speed, and station lo- of geolocation data. These grids were further mapped to postal cation, aggregated at half-hourly intervals. The surface waters areas, municipalities and regions. Through this approach, we dataset primarily targets the water level column, aggregated ev- were able to identify overlaps of data layers (Figure 2), thereby ery 10 minutes. The SOS disaster events dataset encompasses enabling data exploration and further detection of patterns and columns such as event type, event subtype, number of events, potential implications as well as predictions. Each layer repre- and municipality, aggregated hourly. Data preprocessing encom- sents a separate data source, which may contain information passes two principal phases. Initially, data is categorized based on 40 Short title to put in the header Information Society 2023, 10–14 October 2023, Ljubljana, Slovenia the respective sensor, location, and timestamp, with an objective [3] Michael Grieves and John Vickers. 2017. Digital twin: mitigating unpre- to consolidate into hourly segments. SOS events are very sparse, dictable, undesirable emergent behavior in complex systems. Transdisci- where we can have very low number of examples in 13 year time plinary perspectives on complex systems: New findings and approaches, 85– 113. period. [4] Daniel Jurgens. 2022. Creating a country-wide digital twin. https://www.ws p.com/en-nz/insights/creating-a-country-wide-digital-twin. [Accessed 01-09-2023]. (2022). 4.2 Implementation Details [5] Ville V Lehtola, Mila Koeva, Sander Oude Elberink, Paulo Raposo, Juho- Experiments utilized Python 3.11 within a Jupyter Notebook Pekka Virtanen, Faridaddin Vahdatikhaki, and Simone Borsci. 2022. Digital twin of a city: review of technology serving city needs. International Journal environment for tasks related to feature engineering and data of Applied Earth Observation and Geoinformation, 102915. modeling. The computational pipeline incorporated numerous [6] Joshua C Nwokeji and Richard Matovu. 2021. A systematic literature review libraries, including Scipy, Numpy, Pandas, GeoPandas, Matplotlib, on big data extraction, transformation and loading (etl). In Intelligent Computing: Proceedings of the 2021 Computing Conference, Volume 2. Springer, Plotly, and psycopg. Geospatial data, imported via psycopg, was 308–324. seamlessly converted into a dataframe. [7] ESRI Singapore. 2023. A framework to create and integrate digital twins. https://esrisingapore.com.sg/digital-twins. [Accessed 01-09-2023]. (2023). [8] SOS SPIN. 2023. Spin sos - uprava rs za zaščito in reševanje. https://spin3.so 4.3 Experimental Results s112.si/javno. [Accessed 01-09-2023]. (2023). The table 1 presents highest correlations associated with wind- [9] SURS. 2023. Gis. https://gis.stat.si/. [Accessed 01-09-2023]. (2023). [10] Fei Tao, He Zhang, Ang Liu, and Andrew YC Nee. 2018. Digital twin in breaks in Ajdovščina. However, the present correlations seem industry: state-of-the-art. IEEE Transactions on industrial informatics, 15, 4, not to be particularly insightful. This observation is consistent 2405–2415. across other locations and their respective correlation matrices. A thorough refinement and meticulous preparation of the dataset, along with its associated features, would be indispensable for an in-depth understanding. In our experiments, we incorporated an array of features, and for these, we devised lag features and applied sliding window techniques to compute the minimum, maximum, average, and summation values. We have also added seasonality, transformation of wind direction using dummies. Table 1: Correlations between the windbreak feature and other features within the municipality of Ajdovščina Correlation Feature name 0.4952 wind speed rolling min 1 day 0.4887 wind speed rolling min 12 hours 0.4412 wind speed rolling max 30 days 0.4092 mean relative humidity very high rolling sum 120 days 0.3756 wind speed 4 hours ago 5 CONCLUSION AND FUTURE WORK In this paper, we introduce a preliminary cognitive digital twin model of a country, utilizing data from emergency, hydrological, and meteorological domains. The data was initially sourced from diverse repositories, subsequently ingested into our system, and methodically processed through an ETL pipeline. Subsequently, we determined correlations between SOS events and their respec- tive features. Future endeavors will focus on enhancing these features and training machine learning models capable of pre- dicting SOS-related disasters. 6 ACKNOWLEDGMENTS The research described in this paper was supported by the Slove- nian research agency, Ministry of Defence under the project NIP v2-1 DAP NCKU 4300-265/2022-9 and the European Union’s Hori- zon 2020 program project Conductor under Grant Agreement No 101077049. REFERENCES [1] ARSO. 2023. Arso meteo. https://meteo.arso.gov.si/met/sl/weather/fproduc t/text/. [Accessed 01-09-2023]. (2023). [2] ARSO. 2023. Arso vode. https://www.arso.gov.si/vode/podatki/podzem_vo de_amp/. [Accessed 01-09-2023]. (2023). 41 Predicting Bus Arrival Times Based on Positional Data Matic Kladnik† Luka Bradeško Dunja Mladenić Jozef Stefan International Department of Artificial Department of Artificial Postgraduate School Intelligence Intelligence Ljubljana, Slovenia Jozef Stefan Institute; Solvesall Jozef Stefan Institute matic.kladnik@gmail.com Ljubljana, Slovenia Ljubljana, Slovenia luka.bradesko@ijs.si dunja.mladenic@ijs.si ABSTRACT This paper addresses predictions of city bus arrival time to bus 2 PROBLEM SETTING AND DATA stations on an example of a bigger EU city with more than 800 The goal of the system is to predict arrival time to specific buses. We use recent historic context of preceding buses from stations for each bus (more on this in [1][2][6]). To do this, we various routes to improve predictions as well as semantic context compute travel time predictions from specific stations to all of bus position relative to the station. For evaluation of the results, remaining proceeding stations of the bus, per each bus. The data we developed a live evaluation web application which can is suboptimal as we do not know the exact arrival or departure compare performance of different prediction systems with times to or from the stations (similar to [4]), which requires us to various approaches. This enables us to compare the proposed do extra processing on data and match bus positions to stations system and the system that is currently being used by the example based on coordinates of bus locations and distances to nearby city. The evaluation results show advantages of the proposed stations. system and provide insights into various aspects of the system’s To address the suboptimal detailedness of data, we deal with performance. detecting vicinities of buses to their applicable stations. We are KEYWORDS unaware whether the bus has stopped at a certain station or is just passing by, as this information is not available in the data. Bus, arrival time, estimation, prediction, travel time, regression, semantic context, evaluation, application 2.1 Bus Routes and Station Details We use some static data, which gives details about routes. For 1 INTRODUCTION each bus station, we have a location (latitude and longitude coordinates), along with ID and station name. Bus route is Improving the accuracy of expected arrival times of local defined with a route number, variation, and list of stations for transport can improve the experience of public transport users as each variation. well as allow for better planning of public transport. By using This data is used to determine which stations a specific bus recent historic travel times of other buses and additional semantic on a specific route variant might stop at or pass through. In a context of the bus that is currently in the prediction process, we processed form, we use this data to determine which predictions improve predictions of bus arrival times. These predictions are we have to calculate when we get an updated bus status. We also calculated in a live system and can be used in real-time to inform use it to determine which sections of a specific route are shared users of the public transport system as well as to help detect with other routes. traffic congestions. The focus of this paper is on the architecture of the live travel 2.2 Bus Positions time prediction system with which we continuously make predictions of bus arrival times as well as on our approach of This is the main data that we use for computing predictions. Bus evaluating the performance of the proposed system in position data includes: bus ID, last stored location (latitude and comparison to the currently used system. longitude coordinates), and route number. We will first look into the problem setting and the type of data This data is usually updated every minute but the update rate that is available for continuously making arrival time predictions. can vary significantly between buses and bus routes. Then we will continue by describing our approach and the Since we do not have information about exact arrival time to architecture of the continuous prediction system. Lastly, we will the station or departure time from a station, which would be look into evaluation approaches that we have taken to compare preferable, we have to process bus positions to be able to use the proposed system with an existing one. them as input for the prediction models. To use bus positions as input data, we match a position to the nearest bus station, based on available bus stations on a specific Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed route. Bus position is only matched to a station if it is within a for profit or commercial advantage and that copies bear this notice and the full certain distance to the station. For best performance, we use a citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). radius of 50m from the station’s position. Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia © 2023 Copyright held by the owner/author(s). 42 Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia Kladnik et al. 3 APPROACH DESCRIPTION coordinates of the bus, active route of the bus and the direction of the route that the bus is taking. After filtering bus stations Our system uses recent historic data of travel times to include based on route and direction, we compute distance to each station information about recent traffic flow among features (see [7]). using the Haversine formula [9]. If the distance to the closest We make separate predictions for each of the proceeding stations station is less than 50 meters, we detect a vicinity of the bus to that a specific bus can stop at on its route. that station. Once we have a vicinity match to a bus station, we process and insert the data into a list of detected vicinities to stations. After each fetch routine, we store detected vicinities to stations to the data manager in the bus travel time predictor’s data manager component. For easier comprehension, we can say that detected vicinities to the stations can be viewed as detected arrivals of buses to the station. After the data fetch cycle is complete and updated arrivals of buses to stations are ready in the data manager of the bus travel time prediction component, Figure 1: Schematic of bus routes the regression machine learning model is used to predict travel Let us say that bus A, for which we are making predictions, times for all buses that have a new detected vicinity to a station has departed station ‘i’ (latest station). To get recent historic data, for all of their proceeding stations. we check which bus routes share paths between the latest station At any given time, users can send a POST request to our of the bus A and the target station ‘j’ for which we are making proposed approach’s bus prediction server API to get predictions arrival time predictions. As we can see on Figure 1 above, either for all buses, all routes, specific buses, or specific routes. Yellow route shares the path to target station ‘ j’ with green and The system returns predictions in a JSON object and provides blue routes. Thus, we can use the latest travel times between users with the most updated predictions for each bus. stations ‘i’ and ‘j’ on yellow, blue and green routes, to get the most recent data about traffic flow on this path. Figure 2: Architecture of the proposed solution Which is why we also consider data from other routes that share the bus path for which we are making predictions. This way 3.1 Positional Semantic Context we get a better recent historic context to have a more reliable Since we have to match bus positions to stations and do not information about current traffic dynamics. This is especially know when exactly a bus stopped, we use a positional semantic useful for routes that have less frequent buses (e.g. once every 30 context of the bus. We determine whether we have detected the minutes or even less frequent). bus ahead of the station or after the station to further improve the The diagram on Figure 2 shows components that are active in accuracy of predictions. When the bus is detected ahead of the the real-time prediction system. We continuously fetch bus latest station we expect it to take longer time to reach the target positions from Public transport API several times per minute. station in comparison to when the bus is detected beyond the Bus positions are matched to stations based on geographical 43 Predicting Bus Arrival Times Based on Positional Data Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia latest station. If the bus is detected beyond the latest station, it is extreme values have affected these measurements, we will look likely that it will not stop at that station anymore. into further analyses with which we can also get a more To detect the relative position of the bus to the latest station, informative understanding of performance of both systems and we use coordinates from the first preceding station (i-1) and the how they compare to each other. first proceeding station (i+1) in addition to the coordinates of the latest station. 3.2 Machine Learning Models To compute predictions of travel times, we use a regression machine learning model. We have trained and evaluated models based on several machine learning algorithms. These are: linear regression, SVM (SVR – Support Vector Regressor [3]), and an artificial neural network. We use implementations of these algorithms that are available in Scikit-learn [8], a Python library for machine learning. Models were trained on several weeks of data. For training the SVM (SVR) model we use the RBF (Radial Basis Function) kernel with the epsilon parameter equal to 10.3. The regularization parameter C is equal to 1.0. For training the neural network model we use the Multi-layer Perceptron regressor architecture [5] with 2 hidden layers (layer sizes: 15, 8). For solving the weight optimization, we use L-BFGS, which is a Limited-memory approximation of Broyden– Figure 3: Enriched screenshot of distribution of absolute Fletcher–Goldfarb–Shanno algorithm. Alpha hyperparameter is errors equal to 0.5, while learning rate is equal to 0.005. On Figure 3 we can see how absolute errors are distributed Models were trained on hundreds of thousands of data points among error bins. Each bin represents a 30 second interval of collected over several months of data. errors. The most left bin represents errors from 0 to excluding 30 SVM model is the best performing model of the tested ones seconds, the second left bin represents errors from 30 to excl. 60 which is why it is used as the part of our proposed approach in seconds. We have to consider that there are more measurements the following evaluation analyses. present of the proposed system (blue bars) than of the current system (green bars). The reason for this is that we could not always get predictions from the current system for the same bus 4 EVALUATION paths at the time of our predictions, meaning we could not We mainly use two metrics to compare accuracies of predictions: compare predictions of the current system with predictions of the MAE (Mean Absolute Error), and RMSE (Root Mean Squared proposed system. The same applies to Figure 4 and Figure 5. Error). Considering this, we can see that the proposed system has a To get a better overview of the performance of the system as larger share of predictions with errors under 60 seconds. The a whole, we developed a web application that serves for analysis most common error bin of proposed system is 30+ (30 to excl. of performance of the system. 60 seconds), whereas for the current system it is the 60+ bin. 4.1 Live Evaluation System We continue with our web application that serves as an evaluation system. With this system we can evaluate performance of our new system in comparison to the currently used system for predicting arrival time of buses. Results of our new solution are in blue color, whereas the results of existing solution are in green color. This web application can also be used for various purposes of evaluation, for example to compare updated models with earlier versions or compare performance of models that are based on different algorithms. In all of the following figures, our system used the SVM (SVR) model to make predictions of bus travel times. The following figures were generated by evaluating predictions for a single route within a specific week. Figure 4: Enriched screenshot of distribution of negative To start the evaluation with an initial context of main metrics, and positive errors the proposed system has MAE equal to 120 seconds and RMSE On Figure 4 we can see how positive and negative errors are equal to 11042 seconds. Whereas, the current system has MAE distributed between the proposed and the current prediction equal to 357 seconds and RMSE equal to 46618 seconds for the system. Errors are binned into bins of 30 seconds, except for the selected period on the selected route. Since it is likely that certain 44 Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia Kladnik et al. most left and most right bins, which consist of all errors that have When considering all angles of analysis, we can determine difference to actual time of more than -300 and 300, respectively. that the proposed system generally performs better than the Notice that the orange vertical line emphasizes the 0+ bin of currently used system. errors, which consists of predictions with errors between 0 and 30 seconds. Equally well performing bin is the -30+ bin, which consists of errors between -30 seconds up to excluding 0. 5 CONCLUSION In this case a negative error means that we have predicted that We have overviewed the approach that we take as the basis the bus will arrive at the station sooner than it actually has. This for our system for predicting travel and consequently arrival evaluation approach gives us better information about whether a times of buses. We looked into the architecture we implemented system is more likely to have negative or positive errors. In case to support our approach and continuous computation of of negative errors, the system undershoots with the predictions. predictions for arrival times of buses. We then followed with a Similarly, in case of positive errors, the system overshoots with more detailed description of our evaluation system with which the predictions. we can more easily compare two prediction systems – either the We can see that the proposed system is more likely to give proposed system with the current system or different versions of predictions with negative errors, which means that the bus is the proposed system. more likely to arrive later than predicted. However, with the With the help of the evaluation application, we have also current system, predictions are more likely to have positive determined that the proposed system generally performs better errors, meaning the bus is more likely to arrive earlier than than the currently used system. predicted. Considering this, passengers are less likely to miss a For further improvements of the system, we could include the bus if they plan their trip with the proposed system. Relative Mean Absolute Error (often known as MAPE – Mean Absolute Percentage Error) as a metric in the evaluation system. This metric would give us a better understanding of the size of an error, relative to the time taken for the bus to finish the path for which the prediction was computed. We could further improve the evaluation application by adding a feature for comparing the distributions of errors with normalized values in bins, instead of only absolute values. This would streamline the analysis when example numbers differ between both systems. We could also train additional machine learning models based on other algorithms, such as random forest and XGBoost, as well as include additional architectures of neural networks for a greater selection of models. We could then compare performances of all trained models with the use of our evaluation system. Figure 5: Binned absolute prediction errors ACKNOWLEDGMENTS Upon discussion of acceptable prediction errors with the This work was supported by Solvesall, Carris, the Slovenian domain experts, they have determined that predictions with less Research Agency and the European Union’s Horizon 2020 than 90 seconds of absolute errors are the most desirable. program project Conductor under Grant Agreement No Predictions that have absolute errors between 90 seconds and 4 101077049. minutes are considered less desirable but still acceptable. Predictions with over 4 minutes of absolute error are considered REFERENCES unacceptable. We have binned predictions into these three bins [1] K. Birr, K. Jamroz and W. Kustra, "Travel Time of Public Transport to further compare performance between the systems. Vehicles Estimation," in 17th Meeting of the EURO Working Group on On Figure 5 we can see the comparison of distributions of Transportation, EWGT2014, Sevilla, Spain, 2014. predictions when taking opinions of domain experts into account. [2] M. Čelan and M. Lep, "Bus arrival time prediction based on network model," in The 8th International Conference on Emerging Ubiquitous Blue parts of the bars represent the most desirable bins, orange Systems and Pervasive Networks (EUSPN 2017), 2017 parts present less desirable but still acceptable bins and grey parts [3] H. Drucker, C.J.C. Burges, L. Kaufman, A. Smola, V. Vapnik, “Support Vector Regression Machines,” in Advances of Neural Information represent unacceptable bins. Processing Systems (NIPS), 1996 We can see that in 66% of the cases, predictions of the [4] A. Kviesis, A. Zacepins, V. Komasilovs and e. al., "Bus Arrival Time proposed system are sorted into the most desirable bin, compared Prediction with Limited Data Set using Regression Models," in e 4th International Conference on Vehicle Technology and Intelligent Transport to 52% of the cases of the current system. The proposed system Systems (VEHITS 2018), 2018. has significantly less acceptable but undesirable predictions: [5] F. Murtagh, “Multilayer perceptrons for classification and regression,” in Neurocomputing, Volume 2, Issues 5-6, 1991 24% of selected predictions, in comparison to 40% of selected [6] D. Panovski and T. Zaharia, "Long and Short-Term Bus Arrival Time predictions of the current system. However, the current system Prediction with Traffic Density Matrix," IEEE Access (Volume: 8), vol. 8, pp. 226267 - 226284, 2020 does perform slightly better when focusing on the share of [7] T. Yin, G. Zhong, J. Zhang, S. He and B. Ran, "A prediction model of bus unacceptable predictions. 10% of predictions from the proposed arrival time at stops with multi-routes," in World Conference on Transport system have unacceptably high errors, while 8% of predictions Research - WCTR 2016, Shanghai, 2016. [8] Scikit-learn: https://scikit-learn.org/ from the current system belong to the unacceptable bin. [9] Haversine formula: https://en.wikipedia.org/wiki/Haversine_formula 45 Structure Based Molecular Fingerprint Prediction through Spec2Vec Embedding of GC-EI-MS Spectra Aleksander Piciga Milka Ljoncheva aleksander.piciga@gmail.com milka.ljoncheva@ijs.si Jožef Stefan Institute Jožef Stefan Institute Ljubljana, Slovenia Ljubljana, Slovenia Tina Kosjek Sašo Džeroski tina.kosjek@ijs.is saso.dzeroski@ijs.si Jožef Stefan Institute Jožef Stefan Institute Ljubljana, Slovenia Ljubljana, Slovenia ABSTRACT GC-MS spectra show mass to charge ratios (m/z). Each GC-MS spectrum exhibits identifiable spikes called peaks, which hold Identifying the molecular structure of unknown organic com- significant value for compound structure classification, but also pounds is a major challenge when dealing with mass spectrome- correlate to structural information [3]. try (MS) data. Understanding these structures is crucial for clas- Mass spectrometry has many different methods which can sifying and studying molecules, especially in fields like environ- be employed. The data used in this study (GC-MS spectra) are mental science. Research efforts in the recent two decades have obtained using electron impact ionization (EI). Gas chromatogra- resulted in generation of rich MS data, both liquid chromatogra- phy involves heating the sample, which must possess volatility phy (LC)-MS and gas chromatography (GC)-MS data, that can and thermal stability. The ionization process, on the other hand, be exploited in exploring the possibilities of machine learning occurs through electron emission. [5]. approaches in compound identification. Our approach aims to predict molecular fingerprints directly from mass spectra. Fingerprint bits correspond to molecular struc- 100% tures and consequently, prediction of these will directly reveal 80% the underlying features of the molecule. Obtaining a molecu- 60% lar fingerprint thus allows researchers to identify the studied Intensity 40% molecules and to query larger databases of chemical structures 20% 0% (such as PubChem) to discover related molecules. Ultimately, our 0 100 200 300 400 500 600 m/z method makes it easier to identify molecules and their structural characteristics from MS, even in fields where data is scarce. Figure 1: An Example of a mass spectrum obtained by gas KEYWORDS chromatography mass spectrometry with EI. mass spectra, multi-label, Spec2Vec, prediction, Word2Vec, ma- chine learning, embedding, molecular fingerprint, structure 1.2 Dataset We used spectra produced by the authors (Milka Ljoncheva), 1 DATA which have been made publicly available [7]. These are spectra of 1.1 Overview TMS derivatives [9]. TMS derivatives are produced by replacing the active hydrogen atom of alcohols, acids, amines, and thiols The dataset we study [7] is composed of GC-MS, along with meta-by a trimethylsilyl group. These derivatives are highly volatile data information about the molecules. The molecules considered and thermally more stable than the parent compound, allowing are derivatives of environmentally relevant compounds. Meta- their analysis under GC-MS. Fragmentation of these derivatives data contains the molecule name, formula, exact mass, PubChem is also hugely structurally informative [5] [8]. ID, InChI, InChI Key, and SMILES of the trimethysilyl (TMS), The dataset is available in different formats, including .mgf, derivative along with identical data for the parent compound [9]. which is a common format for spectrometry data. These .mgf PubChem ID is included for the PubChem database, which is one files contain precursor mass, charge, and m/z abundance pairs. of the largest repositories of molecular entities. SMILES, InChI, Additional metadata is available in Excel files. The dataset was and InChI Key are molecular descriptors, providing a standard originally gathered as part of another study that aimed to fill the for encoding molecular information. These identifiers can be gap in spectrographic data in the field of environmental science used to obtain further information about the molecule in public and is publicly available [7]. compound databases and MS libraries [2]. There are a total of 3144 distinct spectra in the dataset, cover- ing 106 unique compounds. There is also a larger private dataset, Permission to make digital or hard copies of part or all of this work for personal but for reproducibility, the pipeline used only the public part of or classroom use is granted without fee provided that copies are not made or the dataset [8]. Each compound in our dataset contained all the distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this required metadata information and was represented by approxi-work must be honored. For all other uses, contact the owner /author(s). mately 30 independent spectra. The distribution of the number Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia of spectra per molecule is shown in the Figure 2 (mean 30, min 3, © 2023 Copyright held by the owner/author(s). max 60, std 6.85). On average molecules have 34.6 positive labels. 46 Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia Aleksander Piciga, Milka Ljoncheva, Tina Kosjek,Sašo Džeroski queries to public APIs. To accomplish this, we used the scyjava 60 package, which enables Java packages to be used in Python. This 50 is convenient since our entire workflow is built in Python and we need to access the Chemistry Development Kit (CDK) written 40 in Java. Within this framework, we’ve implemented a subset of 30 Count molecular fingerprints which we tested in the study, that included 20 the following molecular fingerprints: [11]: 10 • AtomPairs2D, • Circular, 0 InChi Key / Compound • EState, • Extended, Figure 2: The Distribution of the number of spectra across • KlekotaRoth, InChI Keys (unique compounds). • Lingo, • MACCS, • Pubchem, 2 PREPROCESSING For our sample study, we selected the MACCS molecular fin- 2.1 CG-MS Spectra gerprint. This choice was made because it offers a relatively We used matchms package to refine the metadata and spectra straightforward approach, relying on SMARTS substructure match- representations. The matchms package is a publicly available ing [6]. SMARTS is a language that allows us to specify substruc-Python package to import, process, clean, and compare mass tures using rules that are extensions of the Simplified molecular- spectrometry data. It allows us to implement and run an easy-to- input line-entry system (SMILES). The Molecular fingerprint is follow, easy-to-reproduce workflow. There were two main phases then defined by a set of these SMARTS patterns. MACCS uses in the preprocessing workflow [4]: 166 patterns [6]. • metadata enrichment and • spectrum standardization. Table 1: Example of SMARTS patterns included in MACCS In the metadata prepossessing phase, we extracted valuable molecular fingerprint information like the InChI Key and molecule name from the .mgf files, which often contained both pieces of data. We also SMARTS pattern Description corrected InChI Key, InChI, and SMILES definitions and when the necessary information wasn’t available, replaced it with a [R]1@*@*@1 3 ring common placeholder tag. [#6]~[#16]~[#7] Carbon ~ Sulfur ~ Nitrogen On the data side, our efforts included adding parent mass, nor- [#6]=[#6]~[#7] Carbon = Carbon ~ Nitrogen malizing intensities, reducing the number of peaks to a range of [CH3]~*~[CH3] CH3 ~ any ~ CH3 10 to 500, setting intensity thresholds between 0 and 1000, and a aromatic deriving losses. We also required that each GC-MS spectrum con- ~ represents any bond type. tain not less than 10 peaks. These steps were crucial for getting = represents a double bond. the CG-MS spectral data ready for analysis and for removing any definitions from [10] potentially corrupted spectra [4]. An example of the effects that more detailed definition of the language is available at processing the mass spectra peaks can have is shown in Figure 3. https://www.daylight.com/dayhtml/doc/theory/theory.smarts.html Spectrum comparison 100% Unprocessed Spectrum 2.3 Spec2Vec 75% Spec2Vec [3] is a spectral similarity score inspired by Word2Vec. 50% It works by converting mass spectrum peaks to "words" and then 25% uses the standard Word2Vec algorithm to learn the relationships 0% among them. It is an unsupervised algorithm so the evaluation Intensity 25% can be performed on the same data used to train Spec2Vec models. 50% There are large pretrained models which are publicly available, 75% but custom models can be quite inexpensive to train on local data. 100% Processed Spectrum The model was trained specifically for TMS derivatives from the 0 50 100 150 200 250 300 public dataset. The model produces 300 dimensional embeddings m/z and was evaluated on the entire dataset. Spec2Vec embeddings outperform traditional methods of com- Figure 3: Difference between unprocessed and processed paring spectra, such as cosine similarity, and even modified ver- peaks in the spectrum. sions that consider data noise. These embeddings also exhibit a much better correlation between high similarity scores and high structural similarity [3]. However, the structure cannot be 2.2 Molecular fingerprints directly derived from latent space embedding, which is why we Our pipeline enables the generation of common molecule fin- employ machine learning to learn these structural characteristics gerprints, given the molecule’s InChI or InChI Keys by making [3]. 47 Structure Based Fingerprint Prediction from Mass Spectra Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia multilabel prediction binary relevance - random forest Spec2Vec Embedding m/z intensities binary structural information latent space embedding metadata database queries cosine similarity – structural Spectrogram similarity Fingerprint Word2Vec peak representation metadata cleaning Figure 4: Overview of the prediction pipeline 3 PIPELINE Table 2: Initial Comparison of Internal Estimators Our main goal is to predict molecular fingerprints that represent structural information based on the mass spectra embeddings Logistic Re- Random Decision following the workflow diagram presented in 4. Spec2Vec progression Forest Tree vides embeddings in a latent space, where the cosine distance Hamming Loss 0.045 0.043 0.067 between points corresponds to their structural similarity. The Weighted F1 Score 0.895 0.854 0.837 molecular fingerprint generation task is framed as a multi-label Label Ranking Loss 0.016 0.010 0.182 classification because each instance or example can exhibit mul- Coverage Error 54.601 42.964 151.832 tiple identifiable structural characteristics, and these correspond to multiple different bits in the fingerprint. These structural com- ponents have correlations among them, which is another reason to treat the problem as multi-label classification rather than just The embedding of the new molecule is compared to known em- multi-class classification. beddings using built in function that calculates similarity score For the conversion of embeddings into molecular fingerprints based on cosine similarity. Voting for fingerprint labels is then Spec2Vec embeddings, which consist of 300 real-valued attributes, done proportionally based on similarity score. This approach, are used as input, while the targets of the prediction are N-bit which corresponds to the weighted nearest neighbor, is further fingerprints (in this study N = 166, as we use MACCS molecular discussed in the section 5. fingerprints). 5 EVALUATION We evaluated the learning methods using various metrics, with a 4 METHODS focus on the most informative ones, such as hamming loss, label Multi-label classification (MLC) can be approached in many differ- ranking loss, weighted F1 score, and coverage error [1], results ent ways. The most straightforward approach involves treating of these evaluations are shown in Table 3. To ensure robust eval-each label independently and training a separate binary classifier uation, we employed a 5-fold cross-validation approach, which for each label (Binary Relevance). Alternatively, we could treat we repeated twice to obtain reliable performance measurements. every unique combination of labels as a distinct class (Power Set). However, given our 166 labels, the latter approach would create a large number of classes, especially if we extend our research Table 3: Random Forest performance metrics to a broader range of molecules. We chose One Vs Rest classifier (OVR) from sklearn, which works like Binary Relevance when Default Similarity Random provided with an indicator matrix for the target (y) values. Bi- Classifier Voting Forest nary Relevance trains a separate estimator for each of the target indicator labels [1]. Hamming Loss 0.083 0.038 0.043 We need to choose an approach for classification since we Weighted F1 Score 0.635 0.642 0.854 have reduced the MLC task into multiple binary classifications. Label Ranking Loss 0.630 0.083 0.010 Random Forests are used due to their empirically proven high Coverage Error 166.000 64.794 42.964 accuracy [1], ability to handle imbalanced data, and good bias The Default Classifier always predicts the majority class for each variance trade-off. Other models, such as Decision Trees and label. Logistic Regression were also quickly tested and proved worse Similarity Voting uses Spec2Vec similarity to proportionally vote in preliminary testing with double 5-fold validation as shown in for labels. This approach is presented as a stronger baseline from the Table 2. Worse performance and efficiency of these models which we can measure improvements of our models. are known from the literature [1]. Random Forests were trained for each label, using One Vs Rest We have also used a straightforward approach of calculating (OVR) method. Each forest had 100 estimators with balanced Spec2Vec similarity [3] to predict the target molecular finger-class weights (inversely proportional). Impurity was measured print. First, the Spec2Vec embedding is constructed for known using Gini Impurity measure and no other restricting parameters molecules and is stored along with their fingerprints. When pre- were set - the defaults of sklearn Random Forest Classifier apply. dicting for a new molecule its Spec2Vec embedding is calculated. 48 Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia Aleksander Piciga, Milka Ljoncheva, Tina Kosjek,Sašo Džeroski 6 REPRODUCIBILITY 0.35 label_ranking_loss 3000 0.30 The whole pipeline and evaluation were built with repeatabil- 2500 ity in mind to allow for future studies, model comparisons, and 0.25 oss 2000 reevaluation of results. The dataset used is public, Spec2Vec mod- 0.20 anking L 1500 est Size els are built upon these data, and model training functions along 0.15 Train/T Label R with parameters are available in the repository github.com/al- 0.10 1000 pi314/mass_spectra tagged article. Training of the models is done 0.05 500 with fixed random seeds and stores models with training pa- 0 0 20 40 60 80 100 rameters, train and test data with the use of the pickle package. Removed Inchikeys Metrics and evaluations are always stored along with the models. Figure 5: Models ability to generalize to unseen InChI Keys. 7 CONCLUSION Our results demonstrate that Spec2Vec embeddings of TMS can ef- Our goal isn’t predicting fingerprints for known molecules, fectively be converted into molecular fingerprints using machine but handling new ones effectively. To test this, we deliberately learning methods. These methods have proven to be reliable even removed some InChI Keys from our dataset. By doing this, we when predicting molecular structures for molecules that have checked how well our models perform in predicting the structures not been encountered before. This is significant because it allows of these unfamiliar molecules. This real-world scenario testing processing new MS spectra to uncover their most likely struc- helps us understand how practical and effective our approach is tural components, which we can then match against databases. when dealing with novel compounds not present in our initial This structural information can be directly applied in various re- training data. search studies. Our plans for future work involve expanding this We have also performed 10-fold validation by removing 10 approach to larger compound databases. Additionally, we plan to InChI Keys at a time from the training data. The model was broaden our research to predict more SMARTS patterns as part trained on the remaining ∼90 InChI Keys (∼2700 samples of mass of expanding our molecular fingerprint prediction capabilities. spectra) and evaluated on ∼10 unseen ones (∼300 samples of While we’ll stay focused on fingerprints for database queries, we mass spectra). The results are shown in Table 5. The Random will be also looking into predicting arbitrary SMARTS patterns. Forests’ ability to predict larger amounts of unseen InChI Keys and effects of less training data and therefore less diverse em- REFERENCES bedding knowledge is shown in Figure 5. Even though the label [1] Jasmin Bogatinovski, Ljupčo Todorovski, Sašo Džeroski, and Dragi Kocev. ranking loss is increasing it is still well below the loss of the De- 2022. Comprehensive comparative study of multi-label classification meth-fault Classifier and even Similarity Voting, when a large amount ods. Expert Systems with Applications, 203, 117215. doi: 10.1016/j.eswa.2022 of InChI Keys are missing and the training dataset is smaller. .117215. [2] Juliane Glüge, Kristopher McNeill, and Martin Scheringer. 2023. Getting Table 4: Similarity Voting on Unseen InChI Keys the SMILES right: identifying inconsistent chemical identities in the ECHA database, PubChem and the CompTox Chemicals Dashboard. Environmental Science: Advances, 2, 4, 614. doi: 10.1039/D2VA00225F. [3] Florian Huber, Lars Ridder, Stefan Verhoeven, Jurriaan H. Spaaks, Faruk Di-Hamming Weighted Label Coverage blen, Simon Rogers, and Justin J. J. van der Hooft. 2021. Spec2Vec: Improved Loss F1 Score Ranking Error mass spectral similarity scoring through learning of structural relationships. PLOS Computational Biology. doi: 10.1371/journal.pcbi.1008724. Loss [4] Florian Huber, Stefan Verhoeven, Christiaan Meijer, and Hanno Spreeuw. average 0.047 0.639 0.084 75.153 2020. matchms - processing and similarity evaluation of mass spectrometry data. Journal of Open Source Software, 5, 2411. doi: 10.21105/joss.02411. [5] Rontani Jean-Francois. 2022. Use of Gas Chromatography-Mass Spectrome- Here only the average is shown to provide a reference point for try Techniques (GC-MS, GC-MS/MS and GC-QTOF) for the Characterization the quality of Random Forests. More data was not included to not of Photooxidation and Autoxidation Products of Lipids of Autotrophic Or- ganisms in Environmental Samples. Molecules, 27, 5. doi: 10.3390/molecules clutter the article. Unseen InChI Keys were simulated by keeping 27051629. only the test rows (unseen InChI Keys) and train columns (other [6] Hiroyuki Kuwahara and Xin Gao. 2021. Analysis of the effects of related InChI Keys) in the similarity matrix. fingerprints on molecular similarity using an eigenvalue entropy approach. Journal of Cheminformatics, 13, 1, 27. doi: 10.1186/s13321-021-00506-2. [7] Milka Ljoncheva, Tina Kosjek, Sašo Džeroski, and Sintija Stevanoska. 2023. Table 5: 10-fold evaluation results for unseen InChI Keys, GC-EI-MS datasets of trimethylsilyl (TMS) and tert-butyl dimethylsilyl (TBDMS) derivatives. Mendeley Data. doi: 10.17632/j3z5bmvmnd.6. Results per Fold [8] Milka Ljoncheva, Tomaž Stepišnik, Tina Kosjek, and Sašo Džeroski. 2022. Machine learning for identification of silylated derivatives from mass spectra. Journal of Cheminformatics, 14, 1, 62. doi: 10.1186/s13321-022-00636-1. Hamming Weighted Label Coverage [9] Milka Ljoncheva, Sintija Stevanoska, Tina Kosjek, and Sašo Džeroski. 2023. GC-EI-MS datasets of trimethylsilyl (TMS) and tert-butyl dimethyl silyl Loss F1 Score Ranking Error (TBDMS) derivatives for development of machine learning-based compound Loss identification approaches. Data in Brief, 48, 109138. doi: 10.1016/j.dib.2023.1 09138. 0 0.068 0.749 0.043 63.432 [10] 2013. RDkit MACCS Keys. Accessed on 2023-08-31. (2013). https://github.co 1 0.064 0.806 0.039 85.369 m/rdkit/rdkit- orig/blob/master/rdkit/Chem/MACCSkeys.py. [11] Egon L. Willighagen et al. 2017. The Chemistry Development Kit (CDK) v2.0: 2 0.061 0.775 0.045 94.405 atom typing, depiction, molecular formulas, and substructure searching. 3 0.066 0.757 0.031 70.266 Journal of Cheminformatics, 9, 1, 33. doi: 10.1186/s13321-017-0220-4. 4 0.060 0.759 0.033 79.687 5 0.101 0.676 0.066 97.522 6 0.124 0.596 0.077 115.793 7 0.036 0.864 0.019 63.857 8 0.047 0.818 0.017 64.828 9 0.077 0.721 0.063 84.503 average 0.070 0.752 0.043 81.966 49 A meaty discussion: quantitative analysis of the Slovenian meat-related news corpus Matej Martinc Senja Pollak Andreja Vezovnik Jožef Stefan Institute Jožef Stefan Institute University of Ljubljana Ljubljana, Slovenia Ljubljana, Slovenia Ljubljana, Slovenia matej.martinc@ijs.si senja.pollak@ijs.si andreja.vezovnik@fdv.uni-lj.si ABSTRACT we additionally employ a model for semantic change detection, which analyses temporal changes in usage of words [6]. We conduct a quantitative analysis of the meat-related news in This is the first quantitative analysis of Slovenian news articles the Slovenian news media. As a first step, we construct a cor- that tries to automatically identify the main topics related to meat pus containing news articles related to the topic of meat. Next, and how their popularity changes through time. We are also not we conduct a topical and temporal analysis of the corpus using aware of any studies, in which meat narratives would be analysed state-of-the-art natural language processing techniques for topic with NLP techniques. modeling and semantic change detection. The results show that economic topics related to meat, which have been prevailing 2 METHODOLOGY more than a decade ago, are being replaced by cultural (espe- cially culinary), ecological, and health topics. The results also 2.1 Dataset construction indicate that there is a trend in Slovenian news coverage of fram- In order to explore the Slovenian news media about meat, we ing veganism in relation to health and environment. first construct a corpus that would allow us to conduct a topical KEYWORDS and temporal analysis of news articles about meat. To do that, we obtained news articles from a large news database from a news analysis, topic modeling, semantic change detection Slovenian clipping agency. The obtained articles needed to con- 1 tain one of the two words : meso (meat) and živinoreja (animal 1 INTRODUCTION husbandry). The final obtained corpus covers a period from 2008 2 In this study, we focus on the media coverage of a subject that is until 2019 and was split into five distinct temporal chunks, each becoming more important due to its connection to the health and covering two years, for the purpose of temporal analysis. The ecological issues of contemporary societies, meat. On one hand, corpus structure is presented in detail in Table 1. meat is seen as a perfect nutritional pack, and its consumption is The corpus contains articles from nine Slovenian news sources: considered natural, normal, necessary, and enjoyable [10]. On the • three daily newspapers with long tradition, published on- other hand, meat production heavily impacts the environment line and in print, Delo, Večer and Dnevnik, and can be seen as unhealthy and unsafe for human consumption • the weekly issues of the publishers under item 1, Delo [2]. These angles are reflected in news media debates, which lately - Sobotna priloga, Dnevnik - Dnevnikov objektiv, showed a significant presence of anti-meat consumption and/or Večer - V soboto, and Večer v nedeljo, published on production narratives [9]. Several studies have also pointed out the weekends, increased media coverage of veganism [7] and meat alternatives, • 24ur.com, which is the most visited web news portal especially cultured meat, produced by culturing animal cells in in Slovenia, and Rtvslo.si is a web news portal of the vitro [4]. Slovenia’s national public broadcasting organization. While several studies explored different meat narratives in English news media [9, 4], analysis of meat narratives in the Slove-2.2 Topical analysis nian news remains a research gap. To fill this gap, we conduct a We propose a two step corpus analysis approach in order to quantitative analysis of how the concept of meat is presented in determine the main topics emerging in relation to meat in the the Slovenian media and try to identify stable trends in the news Slovenian news corpus and to explore how these topics change about meat, in order to show how the notion of meat changed in through time. In the first step, we use BERTopic [3] to determine Slovene news media over time. For the analysis, we employ state-the main topics in the corpus. It uses Sentence Transformers [11] of-the-art (SoA) natural language processing (NLP) techniques, to generate document representations. These representations which have proved themselves useful for analysis of social trends are clustered using Hierarchical density based clustering (HDB- and topics in different languages. To identify main topics related SCAN) [8]. Finally, coherent topic representations are extracted to the concept of meat and to detect temporal trends concerning by employing a class-based variation of a term frequency-inverse attitudes towards meat, we employ BERTopic [3], the current SoA document frequency (TF-IDF). The resulting topic distribution approach for topic identification based on clustering of contex- across corpus obtained by BERTopic is different from the distri- tual embeddings, on the corpus of Slovenian news. To investigate bution obtained by conventional topic models, such as Latent changes in attitudes towards some specific meat related topics, Dirichlet allocation, since each document in the corpus only Permission to make digital or hard copies of part or all of this work for personal belongs to either one or none of the topics. or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and 1 the full citation on the first page. Copyrights for third-party components of this Due to the morpohological richness of Slovenian, the search query did not cover work must be honored. For all other uses, contact the owner /author(s). only basic form of each word, but also several of its morphological derivatives. Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia 2 This time period was chosen due to the lack of available articles before the year © 2023 Copyright held by the owner/author(s). 2008 and due to the COVID-19 pandemic, which had a drastic influence on the media focus and coverage in the time period 2020/2021. 50 Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia Martinc, et al. Source 2008/2009 2010/2011 2012/2013 2014/2015 2016/2017 2018/2019 All 24ur.com 61 83 99 143 156 296 838 Delo 496 506 648 690 599 648 3587 Delo - Sobotna priloga 57 72 95 86 76 98 484 Dnevnik 360 405 697 725 630 805 3622 Dnevnik - Dnevnikov objektiv 44 63 71 71 76 114 439 Rtvslo.si 27 51 107 197 332 491 1205 Večer 445 406 768 678 520 614 3431 Večer - V soboto 23 50 86 105 82 108 454 Večer v nedeljo 0 0 0 226 290 286 802 All 1513 1636 2571 2921 2761 3460 14862 Table 1: Number of articles per each source and temporal chunk in the constructed meat corpus. By not restricting the number of topics, the model returns correlation (measured with cosine distance between temporal 156 topics. The manual inspection revealed that most of these representations) to words representing a specific topic. topics are too specific, i.e. describing just one or two specific While in Martinc et al. [6] temporal representations were meat related events that were covered in the Slovenian news. To generated for an entire corpus, in our approach we propose a solve this problem, we reduce the number of topics by iteratively filtering step based on the previous topic modeling step. BERTopic merging the class-based TF-IDF representations of the least com- uses HDBSCAN for topic clustering, a soft-clustering approach mon topic with its most similar one, in order to obtain predefined that allows noise to be modeled as outliers. The authors claim number of k topics (see [3] for details). We set the k to 20, which that this prevents unrelated documents to be assigned to any of represents a balanced trade-off between interpretability allowed the topics and generally improves topic representation [3]. Since by a small number of topic and specificity offered by a large in our temporal analysis we are interested in historical trends, i.e. number of topics. consistent changes through time that reflect cultural and social The obtained topics were manually inspected and grouped shifts in attitudes towards meat, we hypothesise that removing into five manually defined categories related to the object of meat, the outlier documents not belonging to coherent topics might according to the common thread pervasive across several topics. allow us to conduct a more focused temporal analysis, which will This manual grouping into larger categories (e.g. economic, ecol- only cover main topical trends and disregard semantic changes in ogy, ...) allows us to determine the relative importance of several word meaning that occur due to events covered in news that do “general” aspects of news covering meat in contemporary media not reflect broader cultural trends or narratives. For this reason, landscape. It also allows us to focus our analysis just on the more we filter out articles from the corpus not belonging to any topic interesting aspects of news on meat in the next step, i.e. aspects and only generate temporal lemma representations on articles which show clear increasing/decreasing temporal trends. belonging to topics assigned by BERTopic. 2.3 Temporal analysis 3 EXPERIMENTS To determine how the topic of meat changes over time, the cor- 3.1 Experimental setting pus is split into temporal slices. We calculate topic distribution The experiments are conducted on the Slovenian news corpus de- for each slice in order to obtain relative counts (i.e. the number scribed in Section 2.1. For topic modeling, we employ BERTopic of articles belonging to a single topic divided by the number of with a multilingual embedding model, namely the “paraphrase- all articles published in a specific time slice that belong to any multilingual-MiniLM-L12-v2” Sentence transformer from the 3 topic ) for each topic. This allows us to determine relative “im- 4 Huggingface library , since no monolingual Sentence transformer portance” of a specific topic in a specific time period and enables model exists for Slovenian. For generation of temporal represen- us to identify increasing/decreasing trends for specific topics by tations, we employ the SloBERTa model [12]. As was mentioned visualizing how the relative importance changes across time. The in Section 2.3, the temporal representations are created by aver-same procedure is applied to determine relative “ìmportance” and aging token embeddings appearing in the same time slice and detect trends on the level of manually defined categories. having the same lemma. To obtain the lemmas, we label the entire For topics, which show increasing coverage trend and are more corpus with the Classla lemmatizer [5]. interesting from a sociological point of view, we also conduct an additional temporal analysis, by employing a procedure similar 3.2 Results to the one proposed by Martinc et al. [6], where the information The English translation of topics obtained are presented in Table 2. from the set of contextual token embeddings is aggregated into 9,335 articles were labeled as not belonging to any specific topic. temporal representations by averaging. More specifically, we use Among the categorized articles, most were categorised in the a Transformer language model to generate contextual token em- topic “restaurant, wine, kitchen, meat, culinary”, which contains beddings. Tokens that have the same lemma and appear in the 745 articles describing Slovenian gastronomy. The smallest were same temporal chunk are averaged in order to obtain a temporal the topics containing articles about the influence of meat industry vector representation for a specific lemma. These vectorised tem- on the environment, public health, and veganism, each of these poral representations are used for a focused analysis of manually topics containing just about 100 articles. selected concepts (i.e., “meat” and “vegan”) and their semantic Manual inspection of different topics revealed that several topics can be further aggregated into broader categories, due to 3 Articles classified as not belonging to any topic, are disregarded in the calculation 4 of relative counts. https://huggingface.co/ 51 A meaty discussion: quantitative analysis of the Slovenian meat-related news corpus Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia Category Translated topic Count fine-grained view, one can see that the rise in culture-related economy percentage, inflation, price increase, chicken, food 228 topics can be contributed to the major increase in the amount of economy euro, ljubljana, million, company 202 economy bank, mip, euro, million, supervisory 125 articles belonging to the topic “restaurant, wine, kitchen, meat, economy slovenian, food, quality, consumer, percentage 646 culinary” in 2012/2013, which mostly covers Slovenian gastron- economy slovenian, company, mercator, euro, million 204 omy. culture book, other, write, story, time 148 culture show, theatre, director, festival, theatrical 207 When it comes to economic topics, we can see that all but one culture tourism, time, old, big, house 336 topic (i.e. the topic “slovenian, food, quality, consumer, percent- culture restaurant, wine, kitchen, meat, culinary 745 age”, which differs from other economic topics by being more ecology and health vegan, child, animal, veganism 114 ecology and health water, dioxide, greenhouse, carbon, energy 104 focused on the quality/price ratio) in this category decline in ecology and health fat, cholesterol, diet, food, health 138 terms of relative count significantly in 2010/2011. ecology and health marine, whaling, dolphin, fish, allowed 114 In the ecology and health category, one can see an increase in agriculture milk, agriculture, percentage, organic, Slovenian 239 agriculture meat, kebab, horse, product, dioxin 319 the relative count of topics covering veganism and over-fishing. other other, can, life, time, world 429 While the popularity of the topic covering health benefits and other coach, team, season, play, championship 346 drawbacks of meat is also increasing, the environmental topics other oil, meat, minute, water, paprika 299 other prison, police officer, prosecution, convicted, euro 201 related to global warming have decreased in popularity from other election, president, agreement, government, political 383 the peak in 2010/2011. In the agriculture category, we see clear not categorized / 9335 peaks in discussion on the topic “meat, kebab, horse, product, Table 2: Topics and manually defined categories in the dioxin”, which includes coverage of some scandals related to Slovenian meat corpus. meat production and products in specific years. The topic most responsible for the increasing trend in the “other ” category is “oil, meat, minute, water, paprika”, which mostly covers articles about food recipes. Finally, we discuss results of the focused temporal analysis for two manually selected concepts, “meat” and “vegan” (see Figure 3). We decided to explore an aspect of meat related to creation of cultured meat (meat produced from animal stem cells) and plant based meat analogues, which was not detected in our au- tomatic topic analysis due to the scarcity of journalistic articles addressing cultured meat, but was nevertheless addressed by several scholars studying media representation of cultured meat Figure 1: Category distribution across time. [1]. We looked into semantic similarity between words “meat” and words “artificial”, “laboratory”, and “substitute”. One can see the fact that several topics covered semantically similar content that the cosine similarity between “meat” and all related con- (e.g., topics “euro, ljubljana, million, company” and “bank, mip, cepts peaks in 2012/2013. This coincides with the development euro, million, supervisory” both include financial news about of cultured meat and plant-based meat analogues and the conse- different Slovenian meat companies). More specifically, the topics quential news reporting on it. The first public tasting of cultured were manually categorized as: “economy”, “culture”, “ecology and burger occurred in 2013 in London. After 2012/13, only the co- health”, “agriculture”, and category “other”, containing articles sine similarity between “substitute” and “meat” keeps increasing, covering several topics with very different content that can not while we see a trend of stagnation or even gradual decrease in be combined into a broader semantic category, such as sport, semantic similarity for the other two concepts. This suggests life style, recipes, politics, and judiciary. Ignoring the category that the Slovenian news media is not significantly expanding the named “other”, most articles covered economy and culture. These coverage of production of the artificial meat in recent years. categories were identified based on previous sociological research Due to the findings of the automated temporal topic analysis, on meat [13]. By combining some topics into broader categories, suggesting a constant growth in popularity of the topic covering besides temporal analysis of somewhat specific topics, we are veganism, we also opted for a further analysis of the word “vegan”. also able to conduct temporal analysis on a more general level We were interested how the concept is correlated with words that might allow us to detect how distinct general aspects of “healthy”, “environment”, “ecological”, and “climate change” in the meat related news loose or gain in popularity through time. order to test the hypothesis that the news media is more and Figure 1 shows the distribution of categories across time. more connecting veganism to ecological and health related issues. While economic topics were the most prevailing in 2008/2009, The results indicate a stable positive trend throughout the years a graph also shows a clear decreasing trend of this category in terms of cosine similarity between veganism and selected occurred after 2010. The most upward trend is in the amount of concepts, confirming our hypothesis. articles belonging to the category “other”, which becomes the most dominant in 2016/2017. The production of articles covering 4 CONCLUSION cultural topics has also been steeply increasing until 2014/2015, after that a gradual decline is observed. While agricultural topics In this study, we have conducted a quantitative analysis of the do not indicate any clear positive or negative trends throughout meat related news in Slovenian news media. We constructed the years, the ecology and health topics appear to be gaining in a corpus of meat related news articles and conducted topical popularity in the recent years, especially from 2012/2013 forward. and temporal analysis of the corpus using several SoA NLP tech- Figure 2 shows relative counts (i.e. the number of articles niques. We identified the main meat-related topics and trends and belonging to specific topics divided by all articles that were as- detected which meat related topics are gaining/loosing media signed a topic) for topics inside a specific category. Using this coverage and popularity. 52 Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia Martinc, et al. # Agriculture topics # Culture topics # Ecology and health # Economy topics # Other topics meat, kebab, horse, prod- book, other, write, story, topics percentage, inflation, other, can, life, time, uct, dioxin time vegan, child, animal, veg- price increase, chicken, world milk, agriculture, show, theatre, director, anism food coach, team, season, play, percentage, organic, festival, theatrical water, dioxide, green- euro, ljubljana, million, championship Slovenian tourism, time, old, big, house, carbon, energy company oil, meat, minute, water, house fat, cholesterol, diet, food, bank, mip, euro, million, paprika restaurant, wine, kitchen, health supervisory prison, police officer, meat, culinary marine, whaling, dolphin, slovenian, food, quality, prosecution, convicted, fish, allowed consumer, percentage euro slovenian, company, mer- election, president, agree- cator, euro, million ment, government, politi- cal Figure 2: Relative counts for topics “agriculture”, “culture”, “ecology and health”, “economy”, and “other”. Computer-assisted multilingual news discourse analysis with contextual embeddings (No. J6-2581). REFERENCES [1] Sghaier Chriki, Marie-Pierre Ellies-Oury, Dominique Fournier, Jingjing Liu, and Jean-François Hocquette. 2020. Analysis of scientific and press articles related to cultured meat for a better understanding of its perception. Frontiers in psychology, 11, 1845. [2] International Agency for Research on Cancer et al. 2015. Iarc monographs evaluate consumption of red meat and processed meat. World Health Orga-Word “vegan” Word “meat” nization. http://www. iarc. fr/en/mediacentre/pr/2015/pdfs/pr240_E. pdf. environment healthy laboratory artificial [3] Maarten Grootendorst. 2022. Bertopic: neural topic modeling with a class-organic climate change substitute based tf-idf procedure. arXiv preprint arXiv:2203.05794. [4] Patrick D Hopkins. 2015. Cultured meat in western media: the disproportion-ate coverage of vegetarian reactions, demographic realities, and implications Figure 3: Cosine similarity (CS) between the words “vegan” for cultured meat marketing. Journal of Integrative Agriculture, 14, 2, 264– (left) and “meat” (right), and selected concepts. 272. [5] Nikola Ljubešić and Vanja Štefanec. 2020. The CLASSLA-StanfordNLP model for lemmatisation of non-standard serbian 1.1. Slovenian language resource The results indicate that topics related to the meat economy repository CLARIN.SI, http://hdl.handle.net/11356/1351, (2020). [6] Matej Martinc, Petra Kralj Novak, and Senja Pollak. 2020. Leveraging con-are loosing ground to cultural (especially culinary), ecological, textual embeddings for detecting diachronic semantic shift. English. In and health topics. On the other hand, agricultural topics are Proceedings of the Twelfth Language Resources and Evaluation Conference. European Language Resources Association, Marseille, France, (May 2020), not gaining/loosing news coverage across time. The topic of 4811–4819. isbn: 979-10-95546-34-4. artificial meat is not yet carefully covered in Slovenian media and [7] Helen Masterman-Smith, Angela T Ragusa, and Andrea Crampton. 2014. since the initial increase in coverage in 2012/2013 has not been Reproducing speciesism: a content analysis of australian media representations of veganism. In Proceedings of the Australian Sociological Association gaining further traction. On the other hand, the results show that Conference. there is semantic relation between the words vegan, healthy, and [8] Leland McInnes, John Healy, and Steve Astels. 2017. Hdbscan: hierarchical ecological, which is also slowly increasing over time. density based clustering. J. Open Source Softw., 2, 11, 205. [9] Gilly Mroz and James Painter. 2022. What do consumers read about meat? In the future, we will further explore main developments of an analysis of media representations of the meat-environment relationship the meat narrative in Slovenian media by gathering a larger found in popular online news sites in the uk. Environmental Communication, 1–18. corpus covering more media sources, which will allow us to [10] Jared Piazza, Matthew B Ruby, Steve Loughnan, Mischel Luong, Juliana employ other approaches for topic analysis and semantic change Kulik, Hanne M Watkins, and Mirra Seigerman. 2015. Rationalizing meat detection that require more data. We will also explore other consumption. the 4ns. Appetite, 91, 114–128. [11] Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: sentence embeddings concepts and discourses in Slovenian media besides meat, such using siamese bert-networks. In Proceedings of the 2019 Conference on Empir-as immigration, using techniques similar to the ones proposed in ical Methods in Natural Language Processing. Association for Computational this work. Finally, we plan to expand the analysis to also cover Linguistics, (Nov. 2019). [12] Matej Ulčar and Marko Robnik-Šikonja. 2021. Sloberta: slovene monolingual media reporting in neighboring countries. large pretrained masked language model. [13] Andreja Vezovnik and Tanja Kamin. 2020. Good food for the future: an ex- 5 ACKNOWLEDGMENTS ploration of biocapitalist transformation of meat systems. Discourse, Context & Media, 33, 100354. The authors acknowledge the financial support from the Slove- nian Research Agency for research core funding for the pro- grammes Knowledge Technologies (No. P2-0103) and the project 53 Slovene Word Sense Disambiguation using Transfer Learning Zoran Fijavž Marko Robnik-Šikonja University of Ljubljana, Faculty of Education University of Ljubljana, Faculty of Computer and Slovenia Information Science zoran.fijavzz@gmail.com Slovenia marko.robnik@f ri.uni- lj.si ABSTRACT 2 RELATED WORK Word sense disambiguation is an important task in natural lan- One of the first WSD algorithms was Lesk [11] and its various ex-guage processing and computational linguistics with several tensions that are based on the word overlap between pre-defined practical applications, such as machine translation and speech sense definitions and target sentences. Conceptually, modern ap- synthesis. While the bulk of research efforts are targeted to Eng- proaches to WSD remain strikingly similar, with advances stem- lish, some multilingual resources which include Slovenian have ming mostly from increasingly complex word representations emerged recently. We utilized the Elexis-WSD dataset and a mul- (e.g. contextual word embeddings) and expansive lexicographical tilingual large language model to train models for word sense resources (e.g. a gloss list for word senses in SemCor). Recent disambiguation in Slovenian, using sentence pairs with match- approaches use supervised learning directly on word sense anno- ing lemmas and matching or different word senses. The best tations [5], enrich sense definitions with various lexicographical model achieved an 𝐹 score of 81.6 on a Slovenian test set, al- resources [7, 19] and include lexical databases as graph data in 1 though the latter had a restricted vocabulary due to filtering conjunction with contextual word embeddings [2]. and is not comparable other testing frameworks. The exhaustive Until recently, the development of contemporary WSD models generation of sentence pairs for given lemmas and senses did for Slovenian has been hindered by a lack of available datasets. not improve model performance and reduced the performance in That was partly addressed by the inclusion of Slovenian in the out-of-vocabulary testing. Training on a mixed English-Slovene multilingual Elexis-WSD and XL-WSD datasets [12, 16]. Models dataset maintained high test set as well as out-of-vocabulary trained on the latter obtained an 𝐹 score of 68.36% for Slovene 1 results. WSD, which is significantly lower than state-of-the-art English models scoring 80% or above (although differing test frameworks KEYWORDS preclude direct comparisons). word sense disambiguation, transfer learning, multilingual trans- former 3 METHODOLOGY 1 INTRODUCTION In this section we describe the training procedure, data prepara- Word sense disambiguation (WSD) aims to identify the correct tion and testing framework used to develop and test the Slovenian word sense used in a particular context. It is a long-standing WSD models. problem in the field of computational linguistics and is impor- tant for downstream applications, such as machine translation, information retrieval, text mining, and speech synthesis. Recent 3.1 Training Task and Setup WSD approaches use pre-trained large language models such We operationalized WSD as a sentence-pair binary classification as BERT [3], fine-tuning them on annotated data. As with most task that distinguishes between sentence pairs with an identical supervised machine learning approaches, there is a bottleneck or distinct word sense for a target lemma. Word senses were thus on high-quality training data acquisition. The problem is severe, defined solely through annotated examples without the need for as standard WSD approaches treat each word sense as a separate a secondary source of sense definitions (e.g. sense collocations, target label. A partial solution is to use multilingual pretrained coarse semantic tags or glosses). Casting WSD as a binary classi- models that can leverage several WSD datasets. fication task allowed us to combine Slovene and English datasets, In this paper, we demonstrate a methodology for cross-lingual as sentence pairs could be generated from different WSD datasets transfer learning for WSD in Slovene that does not require com- irrespective of sense inventory compatibility. Examples of the patible sense inventories in different languages. The proposed sentence pairs can be found in Table 1. The drawback of this approach also works on out-of-vocabulary data. approach was a significant data loss from filtering, as many lem- After outlining related works in Section 2, we describe WSD mas did not have enough senses and use examples to generate models we developed for Slovene in Section 3, and their evalu- sentence pairs. ation in Section 4. In Section 5, we provide an interdisciplinary For the base model, we used the pre-trained model CroSloEn- critique of the current approaches to WSD that may be informa- gual BERT [22] that can encode Slovenian, Croatian, and English tive for future research. Section 6 presents the conclusions and texts. To reduce the training time and computational require- ideas for further work. ments, we used bottom layer freezing [10], gradient accumulation, Permission to make digital or hard copies of part or all of this work for personal and early stopping for non-converging models. Hyperparameter or classroom use is granted without fee provided that copies are not made or tuning was done on a 10% sample of the training data. We set the distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this learning rate to 3e-5, gradient accumulation steps to 16, the batch work must be honored. For all other uses, contact the owner /author(s). size to 48, and the number of epochs to 2. Training a single model Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia on 20% of all Slovenian sentence pairs required approximately 4 © 2023 Copyright held by the owner/author(s). hours using a 16 GB NVidia GP U. 54 Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia Fijavž & Robnik-Šikonja Table 1: Two Examples of the lemma Cirkus in the Pair Dataset and its English translation. Lemma Sentence 1 Sentence 2 Match Cirkus Družina na sliki s ’cirkusom’ postuje po deželi. Uprava ’cirkusa’ ni odpovedala predstave. Yes Circus Family on the photo travels around the country with ’circus’. The ’circus’ management did not cancel the show. Yes Cirkus Uprava ’cirkusa’ ni odpovedala predstave. Zganjali so ’cirkus’ okrog družinskih vrednot. No Circus The ’circus’ management did not cancel the show. They were making ’circus’ around family values. No Table 2: Number of Sentences, Lemmas and Word Senses combinations generated to the number of possible matching com- in Datasets. binations for each word sense. By storing infrequent sense pairs and downsampling frequent ones, we created two smaller Slovene Datasets Sentences (n) Lemmas (n) Word senses (n) sentence-pair datasets with the size of 10% and 20% of the original dataset. Original Sl. 202,240 5,604 11,069 The English dataset was created to complement the Slovenian Filtered Sl. 139,445 1,597 4,633 one: we filtered out senses and lemmas that could not generate Full Sl. train 104,316 1,597 4,633 sentence pairs, filtered out infrequent lemmas, created a sentence- 10% Sl. train 99,205 1,597 4,633 pair dataset and downsampled it to the size of the two smaller 20% Sl. train 102,548 1,597 4,633 Slovenian datasets. The number of negative and positive pairs Validation 6,972 691 1,743 was roughly balanced for all pair datasets. Additionally, multi- Test 28,157 1,597 4,633 ple smaller Slovene datasets [4, 13, 14, 17, 20, 21] were joined 10% En. train 27,028 2,852 9,683 and filtered to create an out-of-vocabulary (OOV) dataset that 20% En. train 27,123 2,852 9,683 included only lemmas absent from the main Slovenian dataset. 20% mix train 126,233 4,437 14,316 The OOV dataset consisted of sentence pairs with matching or OOV 3,006 25 50 non-matching word senses for a target word. Table 2 summarizes the number of sentences, lemmas, and senses for each dataset. In total, we trained 7 models that differed in the training data 3.2 Data Preparation used: the entire Slovene dataset, the 10% Slovene dataset, the 20% We used both Slovenian and English WSD datasets. The Slovenian Slovene dataset, the 10% English dataset, the 20% English dataset data was obtained from the Slovenian section of the Elexis-WSD (with and without early stopping) and the mixed 20% dataset corpus [12] and the English data was drawn from SemCor to (a concatenation of the 10% Slovene and English datasets). approximately match the size of the filtered Slovenian data. 3.3 Evaluation Settings Over 50% of the original Slovenian lemmas had a single sense tag. We removed multi-word and hyphenated senses and repeat- Model performance was measured using the 𝐹 score and the 1 edly filtered the datasets until there were at least two senses per Matthews correlation coefficient (MCC). The latter is a chi-square lemma with at least four examples. The original dataset was thus statistic computed from the confusion matrix of classification heavily filtered from 202,240 sentences with 5,604 lemmas and results. It served as an additional performance metric and en- 11,069 word sense tags to 139,445 sentences with 1,597 lemmas abled us to compare models without having to predict specific and 4,633 word sense tags. Punctuation was removed and target word sense tags (e.g., evaluate models on the OOV dataset with words were enclosed in apostrophes as a weak supervision signal dissimilar lemmas and sense tags). [7]. Two methods were used to predict the sense classes on the The filtered Slovenian dataset was split into train, test and Slovenian test set. The first prediction method, called the average validation datasets. For the test dataset, we sampled two or eight sense probability heuristic (ASP) used the test set structure with sentences per word sense (depending on the total number of the models’ binary classifier to determine the most likely sense. available sentences). The lower limit was needed to create sen- The target sentence was combined with all other test sentences tence pairs and the upper limit was used to prevent frequent sharing a lemma (except with itself ) and a softmax value was lemmas and senses from giving overly optimistic test scores. The obtained for each pair. The softmax values were averaged based validation dataset was created by sampling four sentences per on the sense tag of the non-target sentence and the sense with word sense from lemmas with at least eight sentences, assuming the highest average score was chosen as the sense prediction for frequent senses would be sufficient to detect over- and under- the target sentence. The second prediction method used near- fitting. The remainder of the data was kept for training. The est neighbour matching between target sentence embeddings Slovenian training and testing datasets contained the full cov- and sense embeddings. The latter were created by converting the erage of included word Slovenian senses (4,633 distinct senses) entire Slovenian training and validation dataset into sentence and the validation dataset contained 1,743 senses. All Slovenian embeddings [18] and averaging them by their word sense tags. datasets included the full coverage of included lemmas (1,597). The test sentences were likewise embedded and their sense label The Slovenian training dataset contained 104,316 unique sen- was predicted by selecting the sense embedding with the lowest tences, the testing set 28,159 sentences and the validation dataset cosine distance from the target sentence embedding. 6,972 sentences. The most frequent sense (MFS) heuristic as well as the sense The filtered Slovene datasets were transformed into a dataset embedding predictions from an untrained model were used as of sentence pairs by generating sentence combinations between performance baselines. Lastly, several 𝐹 scores per model (micro- 1 sentences sharing a lemma. We limited the number of non-matching 𝐹 , macro- and micro- by POS tags) were used as repeated 1 𝐹1 𝐹1 55 Slovene WSD with Transfer Learning Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia Table 3: 𝐹 Scores of Binary Classifier Predictions. 1 4.2 Binary Classifier Correlation Metrics As the testing set was transformable into sentence pairs, we used Model Micro-𝐹1 the binary classifiers directly on the test set and computed a MCC without predicting sense labels. We also applied the same MFS baseline 40.4 procedure to test model performance on the OOV dataset. Full Sl. 81.0 The highest correlation between actual and predicted binary 10% Sl. 81.4 labels was achieved by the model, trained on the entire Slovenian 20% Sl. 80.5 dataset (MCC = 0.629) followed by models, trained on the 20% 10% En. 68.7 Slovene and 20% mixed datasets (MCC = 0.578; for both). The 20% En. 46.9 highest correlation between the actual and predicted labels on 20% En. (early stopping) 80.6 the OOV dataset was achieved by the model, trained on the 20% 20% mix 81.6 English dataset with early stopping (MCC = 0.353), followed by the 20% mixed dataset (MCC = 0.326). It should be noted that the Table 4: Binary Classifier MCC Test and OOV Scores. former was a base model with minimal updates, as the training stopped after a single update at 200 out of 1916 total steps. Inter- estingly, ranking the models by the amount of included training Model MCC test MCC OOV data revealed a positive correlation between the number of in- Full Sl. 0.629 0.273 cluded examples and the testing dataset MCC (𝑟 = 0.566; df = 5; 𝑠 10% Sl. 0.55 0.292 p = 0.185) and a negative correlation between the number of 20% Sl. 0.578 0.284 included examples and OOV dataset MCC (𝑟 = -0.378; df = 5; 𝑠 10% En. 0.321 0.268 p = 0.404), although neither association was statistically signifi- 20% En. 0.004 0.273 cant. Detailed results from MCC testing can be found in Table 4. 20% En. (early stopping) 0.491 0.353 20% mix 0.578 0.326 4.3 Sense Predictions with Nearest Neighbour Matching For predictions with nearest neighbour matching between tar- Table 5: 𝐹 Scores of Nearest Neighbour Predictions. 1 get sentence and sense embeddings, the baselines used were the MFS heuristic (𝐹 = 40.4%) and the predictions from the untrained 1 Model Micro-𝐹1 model (𝐹 = 21.7%). The difference between model predictions was 1 2 statistically significant ( 𝜒 = 45.11; df = 5; n = 9; p < 0.001). The MFS baseline 40.4 𝐹 only model significantly different from the MFS predictions was Untrained model 21.7 trained on the entire Slovene dataset (𝐹 = 72.8%; p = 0.003). De- Full Sl. 72.8 1 tailed results from predictions using nearest neighbour matching 10% Sl. 50.9 can be found in Table 5. The statistical differences between near- 20% Sl. 60.7 est neighbour predictions from different models are presented in 10% En. 53.2 Figure 2. 20% En. 60.6 20% En. (early stopping) 28.7 20% mix 61.0 measures for model comparison using the Friedman test with the Nemenyi post-hoc test. 4 RESULTS We evaluated model predictions with binary classifiers and with nearest neighbour matching to sense embeddings. Additionally, Figure 1: Critical Distance Diagram for Nearest Neighbour we used the Matthews correlation coefficient to evaluate the per- Results. formance of binary classifiers and evaluate model performance on the out-of-vocabulary dataset. 4.1 Binary Classifier Sense Predictions 5 DISCUSSION ON INTERDISCIPLINARY ASPECTS The baseline 𝐹 from the MFS heuristic was 40.4%. The difference 1 2 between model predictions was statistically significant ( 𝜒 = 36.12; In this section, we offer a brief critique of the WSD task from the 𝐹 df = 5; n = 8; p < 0.001) with the top three models differing signif- perspective of psycholinguistics, pragmatics and insights gained icantly from the MFS baseline: the models, trained on the mixed through model development, and suggest options for further 20% training data (𝐹 = 81.6; p = 0.001), the 10% Slovene data research. 1 (𝐹 = 81.4; p = 0.026), the entire Slovene dataset ( = 81; p = 0.004). De- The datasets commonly used for WSD are not transparent in 1 𝐹1 tailed results from predictions with binary classifiers can be found terms of the specific sense ambiguities they contain in spite of in Table 3. The statistical differences between binary classifica- available typologies. Psycholinguistic literature has identified tion models are presented in Figure 1. significant differences in human processing between homonymy 56 Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia Fijavž & Robnik-Šikonja REFERENCES [1] Keith Allan. 2013. What is Common Ground? In Perspectives on Linguistic Pragmatics. Perspectives in Pragmatics, Philosophy & Psychology. Alessandro Capone, Franco Lo Piparo, and Marco Carapezza, editors. Springer, Cham, 285–310. doi: 10.1007/978- 3- 319- 01014- 4_11. [2] Michele Bevilacqua and Roberto Navigli. 2020. Breaking Through the 80% Glass Ceiling: Raising the State of the Art in Word Sense Disambiguation by Incorporating Knowledge Graph Information. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2854–2864. doi: 10.18653/v1/2020.acl- main.255. [3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: pre-training of deep bidirectional transformers for language under-Figure 2: Critical Distance Diagram for Binary Classifica- standing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technolo-tion Results. gies, Volume 1 (Long and Short Papers), 4171–4186. doi: 10.18653/v1/N19-14 23. [4] Zala Erič, Miha Debenjak, and Denis Derenda Cizel. 2022. Cross-lingual and polysemy [8], as well as between various subtypes of the sense disambiguation. GitHub repository. https://github.com/dextos658/Cr oss- lingual- sense- disambiguation. latter (e.g., metonymy and metaphors) [9]. As demonstrated by [5] Christian Hadiwinoto, Hwee Tou Ng, and Wee Chung Gan. 2019. Improved the use of the out-of-vocabulary test set, additional datasets, Word Sense Disambiguation Using Pre-Trained Contextualized Word Rep- resentations. In Proceedings of the 2019 Conference on Empirical Methods in even if comparatively small, can provide important additional Natural Language Processing and the 9th International Joint Conference on information on model performance. Incorporating a theoretically Natural Language Processing (EMNLP-IJCNLP), 5297–5306. doi: 10.18653/v1 informed typology of polysemy or lexical ambiguity, future re- /D19- 1533. [6] Zellig S. Harris. 1954. Distributional Structure. WORD, 10, 2-3, 146–162. doi: search could provide richer descriptions of word sense relations 10.1080/00437956.1954.11659520. contained in widely used WSD datasets as well as develop spe- [7] Luyao Huang, Chi Sun, Xipeng Qiu, and Xuanjing Huang. 2019. GlossBERT: cific tests for various types of polysemy. The latter could draw BERT for Word Sense Disambiguation with Gloss Knowledge. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing on datasets from psycholinguistic experiments, which commonly and the 9th International Joint Conference on Natural Language Processing control for a plethora of variables, such as word and sense fre- (EMNLP-IJCNLP), 3509–3514. doi: 10.18653/v1/D19-1355. [8] Ekaterini Klepousniotou and Shari R. Baum. 2007. Disambiguating the am- quency. We also observed Elexis-WSD and SemCor contained a biguity advantage effect in word recognition: An advantage for polysemous large number of single-sense lemmas, which would explain why but not homonymous words. Journal of Neurolinguistics, 20, 1, 1–24. doi: 𝐹 scores from the MFS heuristic in related works are commonly 10.1016/j.jneuroling.2006.02.001. 1 [9] Ekaterini Klepousniotou, G. Bruce Pike, Karsten Steinhauer, and Vincent relatively high. Gracco. 2012. Not all ambiguous words are created equal: An EEG investi- Furthermore, while large language models have achieved state- gation of homonymy and polysemy. Brain and Language, 123, 1, 11–21. doi: of-the-art results in WSD, they do not fundamentally diverge 10.1016/j.bandl.2012.06.007. [10] Olga Kovaleva, Alexey Romanov, Anna Rogers, and Anna Rumshisky. 2019. from distributional semantics [6], which is but one account of Revealing the dark secrets of bert. In Proceedings of the 2019 Conference on possible disambiguation mechanisms. It is possible, for instance, Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 4365– to conceptualise word disambiguation as a pragmatic process 4374. whereby the common ground (shared knowledge) between speak- [11] Michael Lesk. 1986. Automatic sense disambiguation using machine readable ers [1] scaffolds disambiguation and by which account speakers dictionaries: how to tell a pine cone from an ice cream cone. In Proceedings of the 5th annual international conference on Systems documentation (SIGDOC may introduce ambiguity on purpose to meet various commu- ’86), 24–26. isbn: 978-0-89791-224-2. doi: 10.1145/318723.318728. nicative goals [15]. [12] Federico Martelli et al. 2022. Parallel sense-annotated corpus ELEXIS-WSD 1.0. https://elex.is/. Retrieved Oct. 21, 2022 from https://www.clarin.si/reposi 6 CONCLUSION tory/xmlui/handle/11356/1674. [13] Matej Miočić, Marko Ivanovski, and Matej Kalc. 2022. NLP-tripleM. GitHub We developed several word sense disambiguation models for repository. https://github.com/KalcMatej99/NLP- tripleM. [14] David Miškić, Kim Ana Badovinac, and Sabina Matjašič. 2022. cross-lingual-Slovenian text and achieved comparatively high performance, sense-disambiguation. GitHub repository. https://github.com/NLP- disambi albeit on a limited selection of lemmas and word senses. We guation/cross- lingual- sense- %20disambiguation. [15] Brigitte Nerlich and David D. Clarke. 2001. Ambiguities we live by: towards demonstrated that including small datasets to measure out-of-a pragmatics of polysemy. Journal of Pragmatics, 33, 1, (Jan. 2001), 1–20. doi: vocabulary performance yields important insights, as the models 10.1016/S0378- 2166(99)00132- 0. tended to generalize better with compacter training datasets. [16] Tommaso Pasini, Alessandro Raganato, and Roberto Navigli. 2021. Xl-wsd: an extra-large and cross-lingual evaluation framework for word sense dis-The models presented in this paper would benefit from a re- ambiguation. Proceedings of the AAAI Conference on Artificial Intelligence, view of Slovenian lexicographical sources and sense inventory 35, 15, 13648–13656. doi: 10.1609/aaai.v35i15.17609. compatibility between them. Replacing annotated sentences with [17] Erazem Pušnik, Rok Miklavčič, and Aljaž Šmaljcelj. 2022. nlp-project3. GitHub repository. https://github.com/RoKKim/nlp- project3. sense definitions (e.g. collocation lists, coarse semantic tags, gloss [18] Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embed- definitions) would greatly increase the number of available train- dings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th Interna-ing examples. Other large language models could also be used tional Joint Conference on Natural Language Processing (EMNLP-IJCNLP), and a detailed hyperparameter optimization could be performed 3982–3992. doi: 10.18653/v1/D19- 1410. for each model individually. [19] Yang Song, Xin Cai Ong, Hwee Tou Ng, and Qian Lin. 2021. Improved Word Sense Disambiguation with Enhanced Sense Representations. In Findings of The source code related to this paper and the datasets used the Association for Computational Linguistics: EMNLP 2021. Association for 1 are freely available . Computational Linguistics, 4311–4320. doi: 10.18653/v1/2021.f indings- emn lp.365. Acknowledgments [20] Jure Tič, Nejc Velikonja, and Sandra Vizlar. 2022. NLP. GitHub repository. https://github.com/JureTic/NLP. The work was partially supported by the Slovenian Research and [21] Andrej Tomažin. 2022. nlp-wic. GitHub repository. https://github.com/anze tomazin/nlp- wic. Innovation Agency (ARIS) core research programme P6-0411, [22] Matej Ulčar and Marko Robnik-Šikonja. 2020. FinEst BERT and CroSloEn- and projects J6-2581 and J7-3159. gual BERT. In Text, Speech, and Dialogue, 104–111. doi: 10.1007/978- 3- 030- 58323- 1_11. 1 https://github.com/zo-fi/slo_wsd_ZFMA 57 Predicting the FTSO consensus price Filip Koprivec Tjaž Eržen Urban Mežnar filip.koprivec@ijs.si erzen.tjaz@gmail.com urban.meznar@aflabs.si JSI, FMF, AFLabs AFLabs AFLabs Ljubljana, Slovenia Ljubljana, Slovenia Ljubljana, Slovenia ABSTRACT Eyal [15] provide a comprehensive study, while Caldarelli’s subsequent work [2] offers an overview of oracle research. Liu et The paper presents a system for predicting cryptocurrency con-al. [14] survey various oracle implementation techniques. No-sensus prices within the Flare Time Series Oracle (FTSO), a de- tably, Alagha [1] introduces a reinforcement learning model to centralized oracle solution running on Flare blockchain. By lever-enhance oracle reliability [11]. aging a combination of smoothing techniques and machine learn- The main oracle solution provider is Chainlink, which ad- ing methodologies, we detail and analyze the construction and dresses the oracle problem with enhanced security and scalability performance of our own provider. This paper presents the FTSO in Chainlink 2.0 [5]. Zhang et al. [13] also detail their approach, mechanism, and basic information about the game theoretic back-providing insights for evolving projects like Flare FTSO in the ground together with rewarding and submission protocol. Lastly, oracle domain. we present our provider’s prediction accuracy for each coin. KEYWORDS 3 FTSO PROTOCOL FTSO, schelling point, machine learning, regression, smoothings The Flare Time Series Oracle plays an important role in Flare Net- 1 INTRODUCTION work’s data accuracy and decentralization. The protocol works in a series of discrete steps to decrease the performance hit on The blockchain and decentralized finance (DeFi) sectors have the whole network. Every 3 minutes marks the beginning of a seen significant growth, but they share a common challenge: new price epoch. Providers are mandated to submit their price es- securely accessing data not directly included in transaction sig- timates in a timely manner using the commit and reveal scheme natures. This issue, known as the oracle problem [3], hinders the to maintain confidentiality and prevent other providers from broader adoption of blockchain technologies as it’s typically dif- viewing or copying their predictions. ficult to obtain reliable off-chain data. While various on-chain Only after the price epoch has ended, providers reveal the protocols offer solutions, each has its trade-offs concerning secu- actual submitted values. This reveal must be done in the first 90 rity, accuracy, and data reliability. Traditional centralized oracles seconds of the next price epoch, which overlaps with the first present risks like data manipulation, whereas fully decentralized half of the next submit epoch. After the reveal epoch ends, all alternatives often suffer from latency and higher costs. the revealed values are combined and a network-wide price is This paper examines the Flare Time Series Oracle, a decentral- calculated. Data providers are incentivized to submit good prices ized oracle that uses a schelling point mechanism to aggregate by the network-wide rewarding system, by being rewarded if data from multiple providers [11]. Fata providers submit price prices fall in the middle two quartiles (IQR range) of the final estimates every three minutes, with the system price determined price. as a weighted median of these submissions. Given the inherent The network thus gets fresh asset prices every 3 minutes with price variability across exchanges and the indeterminate nature some delay due to the reveal period. Such data granularity is not of asset prices within a three-minute window, there isn’t a sin- sufficient for high-frequency trading but has proven sufficient gular "correct" price. Providers aim to select a price close to the for many financial applications. The network and community final median, incentivized by the reward system. This competitive explicitly don’t define what a correct price is, to remove the environment, involving around 100 data providers, has shown vulnerability of the definition relying on a specific price source. resilience against market anomalies and exchange issues. The Assets are denominated in $ with 5 decimal points of precision. paper investigates machine learning techniques to predict this Since most of the exchanges quote a price that is accurate up final median price using exchange data. Given the dynamic na- to 3 decimal points, the configuration and no price explicit defi- ture of the competition, our prediction methods are designed for nition ensure, that submitted prices fall near the perceived fair adaptability. market price, while still leaving room for competition on the last decimals. 2 RELATED WORK One of the unique features of the Flare Network is the ability While no literature precisely addresses the Flare FTSO, the gen- for token holders to delegate their votes to data providers. This eral oracle problem has been extensively studied. Caldarelli [4] means that even if a token holder does not actively participate highlights the challenges of the blockchain oracle problem. El- in the estimation process, they can still earn FTSO rewards by lul [7] delves into its role in decentralized finance. Zohar and delegating their voting power [8] and impact the price by selecting a specific data provider. It is important to note, however, that Permission to make digital or hard copies of part or all of this work for personal the voting power of a single data provider is limited to 2.5% to or classroom use is granted without fee provided that copies are not made or avoid too big of an individual impact. distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this The FTSO’s reward mechanism is fostering decentralization work must be honored. For all other uses, contact the owner /author(s). and ensuring real-time data accuracy. Given that the core task Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia revolves around predicting prices of other providers, participants © 2023 Copyright held by the owner/author(s). not only need to make accurate predictions but also strategize 58 Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia Filip Koprivec, Tjaž Eržen, and Urban Mežnar to outperform others, making it a game of strategic decision- 4.3 Prediction Mechanism making. This challenge intriguingly sits at the crossroads of data After smoothing the data using the techniques listed above, we science and game theory [6]. adopt an overdetermined system approach for our predictions. This entails constructing a system of equations from the pro- 4 DATA RETRIEVAL AND PREDICTION cessed data and subsequently employing the least squares method 4.1 Overview to find the optimal prediction parameters. Suppose we’re training our time series over 𝑚 epochs. Let The data retrieval process is a crucial step in our analysis. It 𝑚 ×𝑛 𝐸 ∈ R be a matrix where each column 𝑒 , represents the 𝑖 involves collecting, processing, and preparing time series data, price vector for the 𝑖 -th exchange across the 𝑚 epochs. Vector specifically price and timestamp pairs, for further analysis. This v ∈ 𝑛 R signifies the normalized weights or contributions of each data is essential for understanding trends, making predictions, exchange to the forecasted price. Each entry, 𝑣 in v corresponds 𝑖 and deriving insights. to the significance of the 𝑖 -th exchange. The primary source of our data are the FTSO prices from pre- Given the extensive epoch training data required for our model vious epochs and current data from various exchanges. Selecting training and the limited availability of crypto exchanges (in the a specific subset of exchanges as a datasource is a nontrivial task. tens), we are dealing with an overdetermined system. In this Each exchange has its own set of characteristics: trading volume, context, we optimize the vector v using the least squares error user base, regional influences, and even specific trading behaviors. method. The residual sum of squares evaluation function is opti- Historical data shows, that providers are quick (on a sub-hour mized using the fmin_cg method from scipy.optimize, aiming basis) to adapt to market opening and closing times and usually to find the parameters that minimize the difference between the disregard after-hours trading prices on exchanges. Furthermore, predicted values and the actual values in the training data. the reliability of data from each exchange can vary. Some ex- For each exchange and for each smoothing method, we define changes might offer more consistent and clean data, while others a possible upper and lower range for the method’s parameters and might have gaps or anomalies. specify a step size. We then compute the cartesian product of all 4.2 Data Processing and Smoothing these sets, yielding all viable optimized parameter combinations Techniques in the form of a multidimensional grid. For each combination in this cartesian product, we smooth the data using the methods de- Once the data is retrieved, it undergoes several processing steps scribed above, train the model and calculate the optimal solution to ensure its quality and relevance for prediction. One of the vector, which tells us how much weight should each exchange primary challenges in time series forecasting is the inherent hold. Finally, we identify the model configuration that delivers noise present in the data. Financial data is specifically prone to the best performance. short-term spikes as low liquidity exchanges can experience large The overdetermined system was chosen due to a number of dif- price deviations when market depth is limited. The spikes are ferent factors. We preferred a simple model with the potential for quickly exploited by arbitragers, but price jumps - anomalies - are an explanation or at least the possibility of quick access to infor- still available in the data and must be accounted for. We employ mation in which input parameters offer greater prediction power. various smoothing techniques to filter out noise and highlight Although not included in our numerical utility function, dele- the underlying trends. gation and the social aspect of goodness of price are important Exponential Moving Average (EMA): EMA is a type of for multiple reasons. Being less good, but providing reasonable weighted moving average that gives more weight to the most re- prices attracts more delegations and provides more security and cent prices. In our system, the EMA vector and its alpha value are trust in the network. Therefore, the error of not predicting the optimized using the curve_fit method from scipy.optimize price fully correctly versus being off by a lot due to an edge library [10]. condition or overfitting a specific input parameter was much Savitzky-Golay Smoothing: This technique uses convolu- preferred. Furthermore, incoming network upgrades might force tion to fit successive subsets of adjacent data points with a low- the providers to buy or sell assets on the price revealed (and not degree polynomial. It’s effective in preserving the features of the on market price) and this means that a large deviation from the distribution, such as heights and widths, making it suitable for correct price would also be financially problematic. our analysis [12]. Lastly, the providers work in bursts. Most of the information- Linear Interpolation: Linear interpolation is used to esti- rich exchange data comes in just before the end of the epoch mate values between two known values in a dataset. Our system (last few seconds), so a longer evaluation time might mean we employs a skew linear fit to interpolate missing or anomalous miss some information or be too late for the submission. Our data points. internal analysis shows, that submission must be calculated at FFT Smoothing: The last smoothing method we’ve used is least 8-5 seconds before the end of each epoch to be reliably the Fast-Fourier smoothing. accepted by the network validators. (network latency usually Each of these methods has its own strengths and is chosen requires a submission of the price a few seconds before the end based on the specific characteristics of the data and the prediction of the epoch). requirements. So far, the only other smoothing method we’ve tried to incorporate is LOWESS (Locally Weighted Scatterplot 5 RESULT ANALYSIS Smoothing), which performed worse than the rest of the smooth- ing methods after training an overdetermined system on it (see We evaluated the performance of our trained models by com- 4.3). The mentioned methods were selected, as they are comparing them against three simpler prediction methods: Last Seen monly used for smoothing the financial data [9], easily available Value Method predicts that the future value of a coin will be the in multiple scientific libraries, and offer good resilience against most recent exchange price observed before the prediction starts. sudden spikes that are markets with low liquidity. The Previous Epoch Value method predicts the price of a coin as 59 FTSO price prediction Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia the FTSO price from the previous epoch. Lastly, we also try the to 0.45, indicating moderate to high prediction accuracy. Mean- overdetermined system witouh any smoothing. while, coins like XLM, ADA, and ARB had lower success rates, Our calculation accuracy analysis spanned over a week, with often below 0.15, suggesting challenges in predicting their prices. new models trained every day on the previous 8-hour data (160 Overall, the provider’s performance fluctuated across epochs and epochs). Following this, the model’s success rate was then vali- coins, with some cryptocurrencies consistently achieving higher dated against the subsequent 8-hour dataset right after the train- success rates than others. Overall, we were able to achieve mod- ing data. The success rate is the amount of times the predicted erate prediction success of around 0.22, currently ranking 26th price would be in the interquartile range divided by the number among the 94 active FTSO providers. of epochs the price was submitted for. This exactly corresponds Because this method of smoothing and training an overdeter- to what price providers are financially incentivized to do. mined system yielded better results than previous method of just The detailed results are presented inFigures 1a to 1d. As an-training an overdetermined system, we can also be certain that ticipated, the Last Seen Value Method method yields modest smoothings in this case improve the result. Without smoothing, outcomes, averaging averaging prediction success rate of 3.5% our prediction model is highly influenced by noise and short-term across all coins. fluctuations. For the Previous Epoch Value Method approach, we set the prediction to match the price from the previous epoch. While Coin Last Seen Prev. Ep No smoth Smooth this method outperformed the first, it still registered a low perfor- XRP 0.02129 0.04986 0.18729 0.339 mance, averaging around 7% for all coins over the week. Notably, XLM 0.02886 0.11686 0.03129 0.11329 several coins like ETH or FIL had an average success rate close DOGE 0.07686 0.16986 0.13186 0.38086 to 0%, while DOGE achieved an average of 15%. ADA 0.04143 0.14214 0.06157 0.13457 The method Training an Overdetermined System With- BTC 0.01043 0.01943 0.14071 0.32543 out Smoothing the Data outperformed the first two, averaging ARB 0.027 0.02343 0.09129 0.11529 around 10% success rate accross all coins during the testing week. Table 2: Average success rate for prediction methods Notably, the full prediction method that Smooths the Data and Trains and Overdetermined System outperformed all of the previous methods. The evaluation closely mirrored real-world conditions, due to 6 RMSE VALUES changes in exchanges, fluctuations in vote powers, and inclusion Lastly, analyzed for each method and for each coin what is it’s of new data providers in the median calculation, models must RSME (root mean squared error) to provide more insight into be continuously retrained on an almost daily basis. Over the each method’s accuracy. The results are depicted in 1. It’s worth observed epochs, our FTSO provider demonstrated varied suc-mentioning that since the prices of different coins vary, the RMSE cess rates across different cryptocurrencies. The success rates for XRP values aren’t comparable across the coins but only across the , DOGE and BTC generally ranged between 0.20 to 0.45, indi- methods for one coin. For most coins, the Last Seen Value method cating moderate to high prediction accuracy. Meanwhile, coins generally yields the highest RMSE values, indicating the worst like XLM, ADA, and ARB had lower success rates, often below accuracy relative to other methods. Conversely, the Overdeter- 0.15, suggesting challenges in predicting their prices. Overall, mined system with smoothing method tends to produce the lowest the provider’s performance fluctuated across epochs and coins, RMSE values for most of the coins. The methods Previous Epoch with some cryptocurrencies consistently achieving higher suc- Value and Overdetermined system without smoothing are ranked cess rates than others. Overall, we were able to achieve moderate somewhere in between. prediction success of around 0.22, currently ranking 26th among the 94 active FTSO providers. 7 DISCUSSION AND FUTURE WORK Because this method of smoothing and training an overdeter- mined system yielded better results than previous method of just We have developed and assessed a functional provider solution training an overdetermined system, we can also be certain that to predict prices within the FTSO protocol. While we observed smoothings in this case improve the result. This goes to show that commendable performance for coins such as XRP, DOGE, and without smoothing, our prediction model is highly influenced BTC, the results for other coins like XLM, ADA, and ARB were not by noise and short-term fluctuations, making it challenging to as promising. Exploring additional smoothing techniques and capture the underlying trend in the time series data. incorporating multiple prediction methods would be beneficial. Notably, ensemble methods are renowned for reducing prediction Coin Last Seen Prev. Ep No smoth Smooth variance, which in turn increases the probability of predictions XRP 0.07412964 0.01536945 0.00542317 0.00398449 falling within the median target range. XLM 0.00010802 0.00025230 0.00090994 0.00025548 This paper has only focused on non-deep learning approaches DOGE 0.00004626 0.00001359 0.00000733 0.00000641 to FTSO price prediction. A promising extension to the provider ADA 0.00000201 0.00000395 0.00000183 0.00000174 would be to explore time series prediction using various deep BTC 23.78687273 5.01065648 1.94068887 0.91171693 learning methods such as RNN or LSTM neural networks. These ARB 0.00098386 0.00025156 0.00015229 0.00014042 models have the potential to capture more subtle patterns in the Table 1: RMSE for each method and selected coins data and adapt to the dynamic prices of the crypto coins. They might need to be modified to adapt to the specifics of the FTSO system and quick retraining times. Combining the more expen- Over the observed epochs, our FTSO provider demonstrated sive inference of neural networks with presented overdetermined varied success rates across different cryptocurrencies. The suc- system together with error bounds on prediction results might cess rates for XRP, DOGE and BTC generally ranged between 0.20 also offer a more performant composite algorithm that would be 60 Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia Filip Koprivec, Tjaž Eržen, and Urban Mežnar (a) “Last Seen Value” method (b) “Previous Epoch Value” method (c) Overdeterminded system without data smoothing (d) Overdeterminded system without with data smoothing able to use the fallback prediction in case of lateness of prediction [8] Boi Faltings and Goran Radanovic. Game Theory for Data by a stronger but more complicated model. Science: Eliciting Truthful Information. Springer Nature, 2021. 8 ACKNOWLEDGMENTS [9] James D. Hamilton. Time Series Analysis. Princeton Uni- The authors would like to thank AFLabs for the provision of versity Press, 1994. url: http : / / mayoral . iae - csic . org / exchange and FTSO data used during the development phase. timeseries2021/hamilton.pdf . [10] A. J. Lawrance and P. A. W. Lewis. “An exponential moving- REFERENCES average sequence and point process (EMA1)”. In: Journal of Applied Probability 14 (1 1977). Accessed: 2023-09-05, [1] AlaghaA. “A reinforcement learning model for the relia- pp. 98–113. doi: 10.2307/3213263. bility of blockchain oracles”. In: ScienceDirect (2022). [11] Christopher Potts. “Interpretive Economy”. In: Seman- [2] Giulio Caldarelli. “Overview of Blockchain Oracle Re- tics Archive (2008). Accessed: 2023-09-05. url: https : / / search”. In: MDPI 14.6 (2022), p. 175. semanticsarchive.net/Archive/jExYWZlN/potts- interpretive- [3] Giulio Caldarelli. “Understanding the Blockchain Oracle economy- mar08.pdf . Problem: A Call for Action”. In: Information 11.11 (2020), [12] William H. Press and Saul A. Teukolsky. “Savitzky-Golay p. 509. url: https://www.mdpi.com/2078- 2489/11/11/509. Smoothing Filters”. In: Computers in Physics and IEEE Com- [4] Giulio Caldarelli. “Understanding the Blockchain Oracle putational Science & Engineering 4.6 (1990). Accessed: 2023- Problem: A Call for Action”. In: 11.11 (2023), p. 509. 09-05, pp. 669–672. doi: 10.1063/1.4822961. [5] Chainlink. Chainlink 2.0 and the future of Decentralized Oracle Networks [13] Fan Zhang et al. “Decentralized Oracles: a Comprehensive . Accessed: 2023-09-05. 2023. url: https: Overview”. In: arXiv preprint arXiv:2004.07140 (2020). Ac- //chain.link/whitepaper. cessed: 2023-09-05. url: https://arxiv.org/abs/2004.07140. [6] Vasant Dhar. “Data Science and Prediction”. In: Commu- nications of the ACM [14] Yanchao Zhang Zhiqiang Liu and Jiwu Jing. “Connect API 56.12 (2013), pp. 64–73. url: https: with Blockchain: A Survey on Blockchain Oracle Imple- //dl.acm.org/doi/abs/10.1145/2500499. mentation”. In: ACM (2022). [7] Joshua Ellul. “The Blockchain Oracle Problem in Decen- [15] Aviv Zohar and Ittay Eyal. “A Study of Blockchain Ora- tralized Finance—A Multivocal Approach”. In: 11.16 (2023), cles”. In: arXiv (2020). url: https : / / arxiv. org / pdf / 2004 . p. 7572. 07140. 61 On Neural Filter Selection for ON/OFF Classification of Home Appliances Anže Pirnat and Carolina Fortuna ap6928@student.uni-lj.si,carolina.fortuna@ijs.si Jožef Stefan Institute, Ljubljana, Slovenia. ABSTRACT To avoid the high cost and invasiveness of monitoring each Non-intrusive load monitoring (NILM) enables the extraction of individual device with an electricity meter, researchers have de- appliance-level consumption data from a single metering point. veloped a more economically efficient method known as non- Appliance ON/OFF classification is a particular type of such intrusive load monitoring (NILM). This method involves obtaining appliance level data extraction recently enabled by deep learning appliance-level data using just one metering point to measure the (DL) techniques. To date, a study on the influence of neural filter total electricity consumption of a household. By using classifica- selection on the performance and computational complexity for tion techniques for NILM, it is possible to determine the states appliance ON/OFF classification is missing. In this paper, we start (ON/OFF) of devices within a household and monitor their activity from a widely used DL architecture, adapt it for the appliance for demand response applications. As in a typical household it ON/OFF classification problem and then study the influence is possible to have several appliances working simultaneously, a of the filters on the model performance and model complexity. suitable approach for determining the activity states of appliances Through this study we develop a model, PirnatCross, that excels is multi-label classification, where the state of each appliance is at cross-dataset performance, offering an average improvement in used as the class label and the recorded readings from a single average weighted F1 score of 17.2 percentage points vs a SotA household meter serve as input samples. Li et al. were among the model and VGG11 baseline respectively, when trained on REFIT first to propose multi-label classification for NILM disaggregation. and evaluated on UK-DALE and vice versa. Also, PirnatCross More recently, Tanoni et al. [12] employed gated recurrent unit consumes 6-times less energy compared to a SotA model. (GRU) in their CRNN for weakly-supervised training, mixing the amount of strongly and weakly labeled data to confirm the effec- KEYWORDS tivness of such approach. Also Zhou et al. [14] proposed a new model called TTRNet, which uses a transpose convolution before non-intrusive load monitoring (NILM), ON/OFF appliance classifi- a recurrent layer, a method, which has also shown better results cation, deep learning (DL), convolutional recurrent neural network in other works [8]. The existing works based on DL techniques (CRNN), multi-label classification typically lack a DL computational complexity/energy consumption analysis that is relevant in designing such models [2]. For instance, 1 INTRODUCTION in [5] they analyzed the carbon footprint of various architectures and concluded that convolutional layers are power hungry because Mitigating the impact of climate change is an urgent challenge that they operate in three dimensions, unlike fully connected layers requires collective action to keep the global average temperature ◦ which operate in two dimensions. below 1.5 C in relation to pre-industrial levels. Reducing unnec- Existing studies typically develop and evaluate their method on essary electrical energy consumption and consequently limiting a only a few datasets that are often limited in size. For instance [12] electrical energy production is a crucial step towards achieving our relied on two publicly available datasets and developed and evalu- goals, as it is estimated that such activities account for over 40% of ated a model for each of the two: REFIT [9] and UK-DALE [6]. the total CO2 equivalent generated by human activities 1. Beside While this approach is appropriate for relative method performance reducing energy consumption, we are increasingly adopting renew- assessment, some studies have discussed also the importance of able power plants due to their significantly lower CO2 emissions cross-dataset evaluation. For example, Han et al. [4] described compared to fossil fuel-based ones 2. However, renewable energy significant dataset biases and high class imbalance of in-the-wild resources have a major drawback; dependency on renewable re- datasets as a fundamental bottleneck in facial expression recogni- sources which are far less predictable, posing a challenge to the tion. Their results showed that cross-dataset evaluation can reduce stability of the power system [11]. To address this issue, demand dataset bias and improve the performance. response strategies are being implemented to adjust electricity In this paper we aim to better understand the influence of consumption to better match supply [1]. Consequently, efforts are the filters on the model performance and model complexity for being made to monitor and manage energy consumption more multi-label ON/OFF appliance classification through intra and efficiently in residential buildings, making it relevant to track cross-dataset evaluation. Our main contributions are as follows: device activity (ON/OFF events) [3]. 1tinyurl.com/CO2-from-electricity1 • We adapt VGG19, a widely used DL architecture, for the 2tinyurl.com/renewable-energy-doubled appliance ON/OFF classification and study the influence of the filters on the model performance and model complexity. Permission to make digital or hard copies of part or all of this work for personal or • We develop a model, PirnatCross, that excels at cross-dataset per- classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation formance, offering an average improvement of 17.2 percentage on the first page. Copyrights for third-party components of this work must be honored. points vs a SotA model and VGG11 baseline respectively, when For all other uses, contact the owner/author(s). trained on REFIT and evaluated on UK-DALE and vice versa. Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia © Also, PirnatCross consumes 6-times less energy compared to 2023 Copyright held by the owner/author(s). SotA model. 62 Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia Anže Pirnat and Carolina Fortuna si 1.0 PREDICTION: DL MODEL 0.5 - Active devices: D1, D2, D4, ... - Inactive devices: D3, D5, ... D1 D2 D3 D4 D5 . . . Di Figure 1: We input the data measured from a household into the DL model and it outputs 𝑠 for each device present in the 𝑖 experiment. If 𝑠 is greater than 0.5 we classify the device as active, if not as inactive. 𝑖 The paper is organized as follows. Section 2 provides the relationaships in the time series and it was shown to achieve good problem statement, Section 3 presents methodological details, performance in a recent study [12]. while Section 4 analyses the results of our study. Finally, Section 5 In order to estimate the computational complexity of the result- concludes the paper. ing architecture, referred to as PirnatCross, we must first calculate its complexity as the sum of all floating point operations (FLOPs) 2 PROBLEM STATEMENT that have to be computed for each of its layers. This can be calcu- Given an input power consumption measured by a smart meter lated for convolutional, pooling and fully-connected layers with 𝑝 (𝑤 ) over a time window 𝑤 , we aim to develop a multi-label the equations from [10] and for GRU with equation from [13]. ON/OFF classifier Φ that maps the input to a probability vector As convolutional layers dominate in our adaptation of VGG19, 𝑠 (𝑤 ) corresponding to the status of the home appliances as: and the computational complexity of a convolutional layer is relatively high compared to other type of layers [10]. Generally, 𝑠 (𝑤 ) = Φ(𝑝 (𝑤 )) (1) the number of FLOPs used throughout the convolutional layer The |𝑠 | of the set 𝑠, indicates the number of appliances to be 𝐹𝑐 is equal to the number of filters 𝑁𝑓 times the flops per filter recognised. For each window of measurements 𝑝 (𝑤 ) input to 𝐹 = ( + ) c 𝐹 𝑁 𝑁 pf ipf f. Therefore we aim to study the influence the model Φ, 𝑠 (𝑤 ) will be of the form [𝑠1 (𝑤 ), 𝑠2 (𝑤 ), ...., 𝑠 (𝑤 ) 𝑁 ], of the number of the filters 𝑁𝑓 on the model performance and 𝑠 ∈ [0, 1] 𝑖 and 𝑁 = |𝑠 | where each 𝑠𝑖 estimates the probability complexity. Let the starting number of filters in each block of the of appliance 𝑑𝑖 to be active as also depicted in Figure 1. When adapted architecture be the same as in the original VGG19, namely 𝑠 > 0.5 𝐹 = [64, 128, 256, 512, 512] 𝑖 the appliance will be classified as ON, otherwise it will be , analyze the model performance as classified as OFF. More than one appliance can be ON at the same average F1 score versus computational complexity in FLOPs. time, therefor 𝑠 contains multiple labels assigned to the current instance. In this paper 𝑁 = 5 in total of which any 1-4 can be 3 METHODOLOGY active. This section provides methodological details related to the datasets, The ON/OFF classifier Φ realized as a deep learning network is the training approach and evaluation process that were employed typically composed of a set of layers [𝑙1, 𝑙2, ....𝑙 ] 𝑀 where the types for the study. of the layers may vary depending on how the respective architecture is designed. For instance 𝑙 ∈ [𝐹𝐶, 𝑃𝑜𝑜𝑙, 𝐶𝑜𝑛𝑣, 𝐺𝑅𝑈 , ...] 𝑖 , where FC stands for fully connected, Pool stands for pooling, Conv for 3.1 Datasets convolutional and GRU for gated recurrent unit. As has been The study is conducted using two datasets: UK-DALE [6] and RE-already shown also in [10], the computational complexity varies FIT [9]. Within each dataset, we monitor the same five appliances across the types of the layers. 𝑑𝑖 that were also used in recent research [12]: fridge, washing In developing Φ, we start from the VGG family of architectures machine, dishwasher, microwave, and kettle. The data from the as they are widely used in various communities and have already selected devices is obtained and processed using the procedure shown promising results for classification on NILM [7]. More described by Tanoni et al. [12] to form 2 mixed datasets. After precisely we consider VGG19 comprising of 19 layers with train-processing, the two mixed datasets each consist of the same five able parameters, 16 of which are convolutional and 3 are fully devices, with each sample containing a random selection of one connected. The convolutional layers are grouped into five blocks: to four active devices. Samples with varying numbers of active • Block 1: 2 x conv. with 64 filters + Max pooling devices are randomly distributed throughout the datasets. We • Block 2: 2 x conv. with 128 filters + Max pooling evaluate the cross-dataset performance of models on two mixed • Block 3: 4 x conv. with 256 filters + Max pooling datasets obtained by processing data from, UK-DALE and REFIT, • Block 4: 4 x conv. with 512 filters + Max pooling in both directions. Specifically, we train models on REFIT derived • dataset and test them on UK-DALE derived datasetand vice versa, Block 5: 4 x conv. with 512 filters + Max pooling by training on UK-DALE derived dataset and testing on REFIT This architecture has been tailored to accommodate time series derived dataset. data, replacing the 2D convolutions and pooling from VGG19, designed for images, with 1D counterparts that are more suitable 3.2 Benchmarks for time-series. In addition, the convolutional layers in the 5th block have been replaced with transpose convolutional layers to In order to have a more meaningful study, we also evaluate increase the temporal resolution of features to reduce their number PirnatCross, the adapted VGG19, against a VGG11 baseline and a as suggested in [14]. We also integrated a recurrent layer after the recently published work TanoniCRNN [12]. For VGG11, we used 5th block, GRU layer to be specific, as it is able to model temporal a learning rate of 0.0001 and the same batch size and epochs. For 63 On Neural Filter Selection for ON/OFF Classification of Home Appliances Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia TanoniCRNN, we used the hyperparameters specified as optimal by scaling factors 𝑘 ∈ [0.02, 0.04, ..., 2.5]. The upper two curves in their paper [12]. present the average weighted F1 score for models trained and evalu- For PirnatCross we vary the set of the filters 𝐹 by mul- ated on REFIT and UK-DALE separately, so without cross-dataset tiplying with 𝑘 ∈ [0.02, 0.04, 0.06, 0.08, 0.1, 0.3, 0.5, 0.7, 0.9, 1.1, evaluation. The second lowest curve presents the average weighted 1.3, 1.5, 1.7, 1.9, 2.1, 2.3, 2.5]. The learning rate, batch size, and F1 scores for models trained on REFIT and cross evaluated on number of epochs were determined through a process of trial UK-DALE while the lowest curve presents the results on training and error, informed by previous experiments, and subsequently on UK-DALE and cross evaluating on REFIT. In our experiments, fine-tuned for each model, to optimize model performance and we observe only the cross evaluation models, they show a rapid stability. The resulting values are: learning rate of 0.0003, a batch improvement in performance for scaling factor values from 0.02 size of 128, and trained for 20 epochs. to 0.08. From scaling factor value 0.08 to 0.9, we see a decline While some models were capable of handling larger batch sizes, in performance in one example and a small improvement in the we found that performance was not improved by increasing the others, while beyond 0.9 the results gradually decline. For scaling batch size beyond 128, so we kept it unchanged for all models. We factors above 1.3 a rapid drop in performance can be observed. train and evaluate using 5-fold cross-validation. Marked with light blue in Figure 2 and also depicted in Figure 3 is the PirnatCross version of the proposed architecture having 3.3 Metrics 𝐹 scaled by 0,08 and thus resulting in the 𝐹1 = [5, 10, 20, 40, 40] We use the average weighted F1 score (𝐹 1𝑠𝑐𝑜𝑟𝑒 filter configuration of the blocks. PirnatCross1 performs optimally 𝑤 ) as a performance metric because our datasets are not balanced and do not provide in terms of avg F1 score. equal representation for each device. PirnatCross1 also contain 5 blocks as the original VGG19. The first two comprising of two convolutional layers and the 𝑁𝑑 ∑︁ subsequent two comprising of four convolutional layers. The final 𝐹 1𝑠𝑐𝑜𝑟 𝑒 = 𝐹 1𝑠𝑐𝑜𝑟 𝑒 × 𝑊 𝑒𝑖𝑔ℎ𝑡 𝑤 𝑖 𝑖 (2) block consists of four transpose convolutional layers and all blocks 𝑖 =1 include an average pooling layer after the convolutional layers. The average weighted F1 score is calculated using three metrics: Preceding the output layer, our model incorporates a GRU layer true positive (TP), false positive (FP), and false negative (FN). TP with a size of 64. Additionally, two fully-connected layers, each measures the instances where the device is accurately classified as consisting of 4096 nodes, are included in the architecture. The active, while FP represents cases where the device is erroneously output layer of our model comprises five nodes corresponding to classified as active. FN indicates instances where the device is the states 𝑠𝑖 of the 5 appliances 𝑑𝑖 considered in this study. All mistakenly classified as inactive. layers utilize the ReLU activation function, except for the output From these metrics, we derive the precision (𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇 𝑃 layer which employs the sigmoid activation function. ) and recall (𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇 𝑃 ), which are used to cal- 𝑇 𝑃 +𝐹 𝑃 𝑇 𝑃 +𝐹 𝑁 culate the F1 score (𝐹 1𝑠𝑐𝑜𝑟𝑒 = 2 × 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛×𝑅𝑒𝑐𝑎𝑙𝑙 ). To obtain 𝑃 𝑟 𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑅𝑒𝑐𝑎𝑙𝑙 4.2 Computational Complexity and Carbon the average weighted F1 score (2), we first compute the F1 score Footprint Analysis for each device, then take the average based on their weight (𝑊 𝑒𝑖𝑔ℎ𝑡 = 𝑆𝑆𝐷 ), which is determined by the support for the Table 1 summarizes the weights, FLOPs, energy and carbon 𝑆 𝐴𝐷 specified device (SSD) and the support of all devices (SAD). footprint numbers for PirnatCross versus the TanoniCRNN and VGG11 baselines. The results take into account the fact that the 90 trained: REFIT, evaluated: REFIT trained: REFIT, evaluated: UK-DALE models were trained on Nvidia A100 graphics card, located in trained: UK-DALE, evaluated: REFIT Slovenia where 250g of CO2 equivalent is produced with each e [%] trained: UK-DALE, evaluated: UK-DALE 80 best performance: PirnatCross kWh of electricity. The specific equations used to calculate, energy and carbon footprint are defined in our previous work [10]. 70 It can be seen from the second row of the table that PirnatCross achieves superior energy efficiency compared to other models, 60 average weighted F1 scor exhibiting energy consumption 6-times smaller compared to SotA TanoniCRNN and 6.6-times smaller compared to VGG11. 50 0.02 0.04 0.06 0.08 0.1 0.3 0.5 0.7 0.9 1.1 1.3 1.5 1.7 1.9 2.1 2.3 2.5 scaling factor Table 1: Computational complexity and carbon footprint anal- ysis for the proposed architecture and selected baselines. Figure 2: Average F1 scores on intra and cross-dataset training and evaluation as a function of filter scaling factor. NN weights FLOPs energy carbon footprint PirnatCross 17.4 · 106 185 · 106 329 kJ 22,9 g CO2 eq. 4 RESULTS TanoniCRNN [12] 0.75 · 106 1.11 · 109 1967 kJ 136.7 g CO2 eq. In this section we first determine the optimal filter configuration VGG11 185.6 · 106 1.21 · 109 2150 kJ 149.3 g CO2 eq. for variations of the PirnatCross architecture to achieve high average weighted F1 score. We then follow with a computational complexity and carbon footprint assessment. Finally, we then 4.3 Cross-Dataset Analysis benchmark the performance of models in cross-dataset evaluation Tables 2 and 3 present the per device breakdown of the F1 scores on REFIT and UK-DALE datasets. for PirnatCross, TanoniCRNN and VGG11 when trained on REFIT and evaluated on UK-DALE and vice versa. 4.1 Analysis of Tuning the Filters in PirnatCross When we trained on REFIT and evaluated on UK-DALE, the Figure 2 depicts the performance of the PirnatCross architecture scores for the four models were as follows: PirnatCross achieved a where the original number of filters in the set 𝐹 has been scaled score of 0.766, TanoniCRNN achieved a score of 0.752 and VGG11 64 Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia Anže Pirnat and Carolina Fortuna Input FC, 5 Output GRU, 128 FC, 4096 FC, 4096 (120000, 2550, 1) Conv1D, f=5, k=3 Conv1D, f=5, k=3 Conv1D, f=10, k=3 Conv1D, f=10, k=3 Conv1D, f=20, k=3 Conv1D, f=20, k=3 Conv1D, f=20, k=3 Conv1D, f=20, k=3 Conv1D, f=40, k=3 Conv1D, f=40, k=3 Conv1D, f=40, k=3 Conv1D, f=40, k=3 MaxPooling, k=2, s=2 MaxPooling, k=2, s=2 MaxPooling, k=2, s=2 MaxPooling, k=2, s=2 ransConv1D, f=40, k=3 ransConv1D, f=40, k=3 ransConv1D, f=40, k=3 ransConv1D, f=40, k=3 T T T T MaxPooling, k=2, s=2 Figure 3: The proposed architecture PirnatCross made for maximum performance. Table 2: F1 scores for PirnatCross1, TanoniCRNN [12] and the number of filters in convolutional layers and consequently an VGG11 trained on REFIT and evaluated on UK-DALE. increase in the number of FLOPs did not necessarily lead to an improvement in classification accuracy. Instead, we observed a devices PirnatCross TanoniCRNN [12] VGG11 point of steady improvement in performance, followed by a gradual decline and a significant drop in performance when the number of fridge 0,944 0,972 0,462 filters exceeded a certain threshold. This information is crucial for washing machine 0,650 0,690 0,544 dish washer 0,646 0,648 0,294 optimizing the architecture of NILM models, and keeping track of microwave 0,728 0,756 0,512 the carbon footprint. kettle 0,786 0,622 0,420 ACKNOWLEDGEMENTS weighted avg 0,766 0,752 0,456 This work was funded in part by the Slovenian Research Agency Table 3: F1 scores for PirnatCross1, TanoniCRNN [12] and under the grant P2-0016. The authors would like to thank Blaž VGG11 trained on UK-DALE and evaluated on REFIT. Bertalanič for insightful discussions. REFERENCES devices PirnatCross TanoniCRNN [12] VGG11 [1] Jamshid Aghaei and Mohammad-Iman Alizadeh. 2013. Demand response in smart electricity grids equipped with renewable energy sources: a review. fridge 0,730 0,232 0,508 Renewable and Sustainable Energy Reviews, 18, 64–72. doi: https://doi.org washing machine 0,668 0,666 0,366 /10.1016/j.rser.2012.09.019. dish washer 0,596 0,468 0,360 [2] Eva García-Martín, Crefeda Faviola Rodrigues, Graham Riley, and Håkan microwave 0,526 0,630 0,506 Grahn. 2019. Estimation of energy consumption in machine learning. Journal of Parallel and Distributed Computing, 134, 75–88. doi: https://doi.org/10.1 kettle 0,800 0,782 0,408 016/j.jpdc.2019.07.007. [3] R. Gopinath, Mukesh Kumar, C. Prakash Chandra Joshua, and Kota Srinivas. weighted avg 0,672 0,542 0,438 2020. Energy management using non-intrusive load monitoring techniques – state-of-the-art and future research directions. Sustainable Cities and Society, 62, 102411. doi: https://doi.org/10.1016/j.scs.2020.102411. achieved a score of 0.456. However, when we trained on UK-DALE [4] Byungok Han, Woo-Han Yun, Jang-Hee Yoo, and Won Hwa Kim. 2020. Toward unbiased facial expression recognition in the wild via cross-dataset and tested on REFIT, the scores were notably lower for all four adaptation. IEEE Access, 8, 159172–159181. models. PirnatCross achieved a score of 0.672, TanoniCRNN [5] Gigi Hsueh. 2020. Carbon Footprint of Machine Learning Algorithms. Senior achieved a score of 0.542, and VGG11 achieved a score of 0.438. Projects Spring 2020. 296. https://digitalcommons.bard.edu/senproj_s2020/2 96. This outcome may be explained by the fact that REFIT has a [6] Jack Kelly and William Knottenbelt. 2015. The uk-dale dataset, domestic significantly higher level of data noise compared to UK-DALE as appliance-level electricity demand and whole-house demand from five uk homes. Scientific data, 2, 1, 1–14. shown in prior work [12]. Consequently, the testing results obtained [7] Weicong Kong, Zhao Yang Dong, Bo Wang, Junhua Zhao, and Jie Huang. from UK-DALE are expected to show higher F1 scores. Moreover, 2020. A practical solution for non-intrusive type ii load monitoring based on we observed that, overall, our model PirnatCross consistently deep learning and post-processing. IEEE Transactions on Smart Grid, 11, 1, 148–160. doi: 10.1109/TSG.2019.2918330. outperformed the other models in both testing scenarios, achieving [8] Luca Massidda, Marino Marrocu, and Simone Manca. 2020. Non-intrusive the highest weighted average F1 scores overall. load disaggregation by convolutional neural network and multilabel classification. Applied Sciences, 10, 4. doi: 10.3390/app10041454. 5 CONCLUSIONS [9] David Murray, Lina Stankovic, and Vladimir Stankovic. 2017. An electrical load measurements dataset of united kingdom households from a two-year To address the challenge of cross-dataset usage scenario on NILM longitudinal study. Scientific data, 4, 1, 1–12. doi: 10.1038/sdata.2016.122. [10] Anže Pirnat, Blaž Bertalanič, Gregor Cerar, Mihael Mohorčič, Marko Meža, ON/OFF classification, we propose PirnatCross, with an aim to and Carolina Fortuna. 2022. Towards sustainable deep learning for wireless present the maximum performance and the energy efficiency. The fingerprinting localization. In ICC 2022 - IEEE International Conference on Communications, 3208–3213. doi: 10.1109/ICC45855.2022.9838464. results of our evaluation on the REFIT and UKDALE datasets [11] Ali Q. Al-Shetwi, M.A. Hannan, Ker Pin Jern, M. Mansur, and T.M.I. reveal that PirnatCross achieve an average performance improve- Mahlia. 2020. Grid-connected renewable energy sources: review of the recent ment of 7.2 over SotA and 27.2 percentage points over baseline, integration requirements and control methods. Journal of Cleaner Production, 253, 119831. doi: https://doi.org/10.1016/j.jclepro.2019.119831. underscoring its superior effectiveness in handling data from di- [12] Giulia Tanoni, Emanuele Principi, and Stefano Squartini. 2022. Multi-label verse sources. Additionally PirnatCross consumes 6-times less appliance classification with weakly labeled data for non-intrusive load energy compared to SotA model. To develop PirnatCross, we monitoring. IEEE Transactions on Smart Grid, 1–1. doi: 10.1109/TSG.2022 .3191908. employed our methodology. In the case of classification on NILM [13] Minjia Zhang, Wenhan Wang, Xiaodong Liu, Jianfeng Gao, and Yuxiong He. this included beginning with the VGG19 architecture and imple- 2018. Navigating with graph representations for fast and scalable decoding of neural language models. Advances in neural information processing systems, menting several modifications, such as replacing the convolutional 31. layers with transpose convolutional layers in the 5th block, incor- [14] Mengran Zhou, Shuai Shao, Xu Wang, Ziwei Zhu, and Feng Hu. 2022. Deep porating a GRU layer after it, and adjusting the number of filters learning-based non-intrusive commercial load monitoring. Sensors, 22, 14. doi: 10.3390/s22145250. based on our analysis. Our analysis revealed that an increase in 65 66 Indeks avtorjev / Author index Bradeško Luka ............................................................................................................................................................................. 42 Buza Krisztian ................................................................................................................................................................................ 5 Caporusso Jaya ............................................................................................................................................................................. 33 Džeroski Sašo ............................................................................................................................................................................... 46 Eržen Tjaž .................................................................................................................................................................................... 58 Espigule-Pons Jofre ...................................................................................................................................................................... 29 Fijavž Zoran ................................................................................................................................................................................. 54 Fortuna Carolina ........................................................................................................................................................................... 62 Gobbo Elena ................................................................................................................................................................................. 25 Grobelnik Marko ................................................................................................................................................................ 5, 29, 39 Kladnik Matic ............................................................................................................................................................................... 42 Koehorst Erik ............................................................................................................................................................................... 17 Koprivec Filip .............................................................................................................................................................................. 58 Kosjek Tina .................................................................................................................................................................................. 46 Leban Gregor ................................................................................................................................................................................. 9 Ljoncheva Milka .......................................................................................................................................................................... 46 Martinc Matej ............................................................................................................................................................................... 50 Massri M. Besher ........................................................................................................................................................................... 5 Mežnar Urban ............................................................................................................................................................................... 58 Mladenić Dunja .......................................................................................................................................... 9, 13, 17, 21, 25, 39, 42 Mladenić Grobelnik Adrian.......................................................................................................................................................... 29 Nemec Peter ................................................................................................................................................................................... 9 Novalija Inna ................................................................................................................................................................................ 25 Piciga Aleksander ......................................................................................................................................................................... 46 Pirnat Anže ................................................................................................................................................................................... 62 Pollak Senja ............................................................................................................................................................................ 33, 50 Purver Matthew ............................................................................................................................................................................ 33 Robnik-Šikonja Marko ................................................................................................................................................................. 54 Rožanec Jože M. ...................................................................................................................................................................... 9, 17 Šircelj Beno .................................................................................................................................................................................... 9 Sittar Abdul .................................................................................................................................................................................. 21 Škraba Primož .............................................................................................................................................................................. 13 Škrjanc Maja ................................................................................................................................................................................ 39 Stopar Luka .................................................................................................................................................................................. 39 Šturm Jan ...................................................................................................................................................................................... 39 Topal Oleksandra ......................................................................................................................................................................... 25 Vezovnik Andreja ........................................................................................................................................................................ 50 Volčjak Domen ............................................................................................................................................................................ 39 Zajec Patrik .................................................................................................................................................................................. 13 Zaman Faizon ............................................................................................................................................................................... 29 Zupan Šemrov Manja ................................................................................................................................................................... 25 67 68 Odkrivanje znanja in podatkovna skladisca • SiKDD Data Mining and Data Warehouses • SiKDD Urednika • Editors: Dunja Mladenic, Marko Grobelnik Document Outline 02 - Naslovnica - notranja - C - TEMP 03 - Kolofon - C - TEMP 04 - IS2023 - Predgovor 05 - IS2023 - Konferencni odbori 07 - Kazalo - C 08 - Naslovnica - notranja - C - TEMP - Copy 09 - Predgovor podkonference - C 10 - Programski odbor podkonference - C 11 - Prispevki - C IS_2023_-_SIKDD_paper_001 Abstract 1 Introduction 2 Related Work 3 Background 3.1 Problem Formulation 3.2 The Distortion-aware Convolutional Block 4 Experimental Evaluation 4.1 Data 4.2 Experimental Settings 4.3 Results 5 Conclusions and Outlook Acknowledgments References IS_2023_-_SIKDD_paper_002 Abstract 1 Introduction 2 Data extraction pipeline 2.1 Media Event Extraction 2.2 Causality extraction 2.3 Semantic matching and enrichment 2.4 Cleansing causal relations 2.5 Creating a causality graph 3 Results 3.1 Causality graph and causality chain analysis 4 Conclusions Acknowledgments References IS_2023_-_SIKDD_paper_003 Abstract 1 Introduction 2 Background 2.1 Simplicial complexes 2.2 Persistent relative homology 2.3 Significance testing of persistent cycles 2.4 The mapper algorithm 3 Methodology 3.1 Testing the cycles 3.2 Testing the branching structure 4 Experiments 4.1 Experiment 1: Y-shaped point cloud 4.2 Experiment 2: 3D ant surface 5 Conclusions and future work Acknowledgements IS_2023_-_SIKDD_paper_004 Abstract 1 Introduction 2 Highlighting Embeddings' Features Relevance Attribution on Activation Maps 3 Experiments 4 Results 5 Conclusions Acknowledgments References IS_2023_-_SIKDD_paper_005 Abstract 1 Introduction 2 Related Work 2.1 Geographical barrier 2.2 Time-series datasets 2.3 Topic modeling 3 Approach 4 Dataset construction 4.1 Semantic similarity 4.2 Chat-GPT Summarizing 4.3 Annotations of time-series 5 Statistical Analysis and Evaluation 6 Conclusions and Future Work IS_2023_-_SIKDD_paper_006 IS_2023_-_SIKDD_paper_007 IS_2023_-_SIKDD_paper_008 IS_2023_-_SIKDD_paper_009 Abstract 1 Introduction 2 Related Work 3 Methodology 3.1 Data Clients 3.2 ETL Pipeline 3.3 Feature Engineering 4 Experiment 4.1 Dataset 4.2 Implementation Details 4.3 Experimental Results 5 Conclusion and Future Work 6 ACKNOWLEDGMENTS IS_2023_-_SIKDD_paper_010 IS_2023_-_SIKDD_paper_011 Abstract 1 DATA 1.1 Overview 1.2 Dataset 2 PREPROCESSING 2.1 CG-MS Spectra 2.2 Molecular fingerprints 2.3 Spec2Vec 3 PIPELINE 4 METHODS 5 EVALUATION 6 REPRODUCIBILITY 7 CONCLUSION IS_2023_-_SIKDD_paper_012 Abstract 1 Introduction 2 Methodology 2.1 Dataset construction 2.2 Topical analysis 2.3 Temporal analysis 3 Experiments 3.1 Experimental setting 3.2 Results 4 Conclusion 5 Acknowledgments IS_2023_-_SIKDD_paper_013 Abstract 1 Introduction 2 Related Work 3 Methodology 3.1 Training Task and Setup 3.2 Data Preparation 3.3 Evaluation Settings 4 Results 4.1 Binary Classifier Sense Predictions 4.2 Binary Classifier Correlation Metrics 4.3 Sense Predictions with Nearest Neighbour Matching 5 Discussion on Interdisciplinary Aspects 6 Conclusion IS_2023_-_SIKDD_paper_014 Abstract 1 Introduction 2 Related Work 3 FTSO protocol 4 Data Retrieval and Prediction 4.1 Overview 4.2 Data Processing and Smoothing Techniques 4.3 Prediction Mechanism 5 Result Analysis 6 RMSE values 7 Discussion and Future Work 8 Acknowledgments IS_2023_-_SIKDD_paper_015 Abstract 1 Introduction 2 Problem Statement 3 Methodology 3.1 Datasets 3.2 Benchmarks 3.3 Metrics 4 Results 4.1 Analysis of Tuning the Filters in PirnatCross 4.2 Computational Complexity and Carbon Footprint Analysis 4.3 Cross-Dataset Analysis 5 Conclusions Acknowledgements 12 - Index - C Blank Page Blank Page Blank Page Blank Page Blank Page