›

INFORMACIJSKA DRUZBA

Zbornik 26. mednarodne multikonference

Zvezek C

INFORMATION SOCIETY

Proceedings of the 26th International Multiconference

Volume C

Odkrivanje znanja in

podatkovna skladisca • SiKDD

Data Mining and

Data Warehouses • SiKDD

Urednika • Editors:

Dunja Mladenic, Marko Grobelnik

IS2023





Zbornik 26. mednarodne multikonference

INFORMACIJSKA DRUŽBA – IS 2023

Zvezek C





Proceedings of the 26th International Multiconference

INFORMATION SOCIETY – IS 2023

Volume C





Odkrivanje znanja in podatkovna skladišča – SiKDD

Data Mining and Data Warehouses - SiKDD





Urednika / Editors



Dunja Mladenić, Marko Grobelnik





http://is.ijs.si





9. oktober 2023 / 9 October 2023

Ljubljana, Slovenia



Urednika:





Dunja Mladenić

Department for Artificial Intelligence

Jožef Stefan Institute, Ljubljana



Marko Grobelnik

Department for Artificial Intelligence

Jožef Stefan Institute, Ljubljana





Založnik: Institut »Jožef Stefan«, Ljubljana

Priprava zbornika: Mitja Lasič, Vesna Lasič, Mateja Mavrič

Oblikovanje naslovnice: Vesna Lasič





Dostop do e-publikacije:

http://library.ijs.si/Stacks/Proceedings/InformationSociety





Ljubljana, oktober 2023





Informacijska družba

ISSN 2630-371X



Kataložni zapis o publikaciji (CIP) pripravili v Narodni in univerzitetni knjižnici v Ljubljani

COBISS.SI-ID 170733315

ISBN 978-961-264-276-1 (PDF)





PREDGOVOR MULTIKONFERENCI

INFORMACIJSKA DRUŽBA 2023



Šestindvajseta multikonferenca Informacijska družba se odvija v obdobju izjemnega razvoja za umetno inteligenco, računalništvo in informatiko, za celotno informacijsko družbo. Generativna umetna inteligenca je s programi kot ChatGPT

dosegla izjemen napredek na poti k superinteligenci, k singularnosti in razcvetu človeške civilizacije. Uresničujejo se napovedi strokovnjakov, da bodo omenjena področna ključna za obstoj in razvoj človeštva, zato moramo pozornost usmeriti na njih, jih hitro uvesti v osnovno in srednje šolstvo in vsakdan posameznika in skupnosti.



Po drugi strani se poleg lažnih novic pojavljajo tudi lažne enciklopedije, lažne znanosti ter »ploščate Zemlje«, nadaljuje se zapostavljanje znanstvenih spoznanj, metod, zmanjševanje človekovih pravic in družbenih vrednot. Na vseh nas je, da izzive današnjice primerno obravnavamo, predvsem pa pomagamo pri uvajanju znanstvenih spoznanj in razčiščevanju zmot. Ena pogosto omenjanih v zadnjem letu je eksistencialna nevarnost umetne inteligence, ki naj bi ogrožala človeštvo tako kot jedrske vojne. Hkrati pa nihče ne poda vsaj za silo smiselnega scenarija, kako naj bi se to zgodilo – recimo, kako naj bi 100x pametnejši GPT ogrozil ljudi.



Letošnja konferenca poleg čisto tehnoloških izpostavlja pomembne integralne teme, kot so okolje, zdravstvo, politika depopulacije, ter rešitve, ki jih za skoraj vse probleme prinaša umetna inteligenca. V takšnem okolju je ključnega pomena poglobljena analiza in diskurz, ki lahko oblikujeta najboljše pristope k upravljanju in izkoriščanju tehnologij. Imamo veliko srečo, da gostimo vrsto izjemnih mislecev, znanstvenikov in strokovnjakov, ki skupaj v delovnem in akademsko odprtem okolju prinašajo bogastvo znanja in dialoga. Verjamemo, da je njihova prisotnost in udeležba ključna za oblikovanje bolj inkluzivne, varne in trajnostne informacijske družbe. Za razcvet.



Letos smo v multikonferenco povezali deset odličnih neodvisnih konferenc, med njimi »Legende računalništva«, s katero postavljamo nov mehanizem promocije informacijske družbe. IS 2023 zajema okoli 160 predstavitev, povzetkov in referatov v okviru samostojnih konferenc in delavnic, skupaj pa se je konference udeležilo okrog 500 udeležencev. Prireditev so spremljale okrogle mize in razprave ter posebni dogodki, kot je svečana podelitev nagrad. Izbrani prispevki bodo izšli tudi v posebni številki revije Informatica (http://www.informatica.si/), ki se ponaša s 46-letno tradicijo odlične znanstvene revije. Multikonferenco Informacijska družba 2023 sestavljajo naslednje samostojne konference:

•

Odkrivanje znanja in podatkovna središča

•

Demografske in družinske analize

•

Legende računalništva in informatike

•

Konferenca o zdravi dolgoživosti

•

Miti in resnice o varovanju okolja

•

Mednarodna konferenca o prenosu tehnologij

•

Digitalna vključenost v informacijski družbi – DIGIN 2023

•

Slovenska konferenca o umetni inteligenci + DATASCIENCE

•

Kognitivna znanost

•

Vzgoja in izobraževanje v informacijski družbi

•

Zaključna svečana prireditev konference

Soorganizatorji in podporniki konference so različne raziskovalne institucije in združenja, med njimi ACM Slovenija, SLAIS za umetno inteligenco, DKZ za kognitivno znanost in Inženirska akademija Slovenije (IAS). V imenu organizatorjev konference se zahvaljujemo združenjem in institucijam, še posebej pa udeležencem za njihove dragocene prispevke in priložnost, da z nami delijo svoje izkušnje o informacijski družbi. Zahvaljujemo se tudi recenzentom za njihovo pomoč pri recenziranju.



S podelitvijo nagrad, še posebej z nagrado Michie-Turing, se avtonomna stroka s področja opredeli do najbolj izstopajočih dosežkov. Nagrado Michie-Turing za izjemen življenjski prispevek k razvoju in promociji informacijske družbe je prejel prof.

dr. Andrej Brodnik. Priznanje za dosežek leta pripada Benjaminu Bajdu za zlato medaljo na računalniški olimpijadi.

»Informacijsko limono« za najmanj primerno informacijsko tematiko je prejela nekompatibilnost zdravstvenih sistemov v Sloveniji, »informacijsko jagodo« kot najboljšo potezo pa dobi ekipa RTV za portal dostopno.si. Čestitke nagrajencem!



Mojca Ciglarič, predsednica programskega odbora

Matjaž Gams, predsednik organizacijskega odbora





i

FOREWORD - INFORMATION SOCIETY 2023



The twenty-sixth Information Society multi-conference is taking place during a period of exceptional development for artificial intelligence, computing, and informatics, encompassing the entire information society. Generative artificial intelligence has made significant progress towards superintelligence, towards singularity, and the flourishing of human civilization with programs like ChatGPT. Experts' predictions are coming true, asserting that the mentioned fields are crucial for humanity's existence and development. Hence, we must direct our attention to them, swiftly integrating them into primary, secondary education, and the daily lives of individuals and communities.



On the other hand, alongside fake news, we witness the emergence of false encyclopaedias, pseudo-sciences, and flat Earth theories, along with the continuing neglect of scientific insights and methods, the diminishing of human rights, and societal values. It is upon all of us to appropriately address today's challenges, mainly assisting in the introduction of scientific knowledge and clearing up misconceptions. A frequently mentioned concern over the past year is the existential threat posed by artificial intelligence, supposedly endangering humanity as nuclear wars do. Yet, nobody provides a reasonably coherent scenario of how this might happen, say, how a 100x smarter GPT could endanger people.



This year's conference, besides purely technological aspects, highlights important integral themes like the environment, healthcare, depopulation policies, and solutions brought by artificial intelligence to almost all problems. In such an environment, in-depth analysis and discourse are crucial, shaping the best approaches to managing and exploiting technologies.

We are fortunate to host a series of exceptional thinkers, scientists, and experts who bring a wealth of knowledge and dialogue in a collaborative and academically open environment. We believe their presence and participation are key to shaping a more inclusive, safe, and sustainable information society. For flourishing.



This year, we connected ten excellent independent conferences into the multi-conference, including "Legends of Computing", which introduces a new mechanism for promoting the information society. IS 2023 encompasses around 160 presentations, abstracts, and papers within standalone conferences and workshops. In total about 500 participants attended the conference.

The event was accompanied by panel discussions, debates, and special events like the award ceremony. Selected contributions will also be published in a special issue of the journal Informatica (http://www.informatica.si/), boasting a 46-year tradition of being an excellent scientific journal. The Information Society 2023 multi-conference consists of the following independent conferences:

•

Data Mining and Data Warehouse - SIKDD

•

Demographic and Family Analysis

•

Legends of Computing and Informatics

•

Healthy Longevity Conference

•

Myths and Truths about Environmental Protection

•

International Conference on Technology Transfer

•

Digital Inclusion in the Information Society - DIGIN 2023

•

Slovenian Conference on Artificial Intelligence + DATASCIENCE

•

Cognitive Science

•

Education and Training in the Information Society

•

Closing Conference Ceremony

Co-organizers and supporters of the conference include various research institutions and associations, among them ACM

Slovenia, SLAIS for Artificial Intelligence, DKZ for Cognitive Science, and the Engineering Academy of Slovenia (IAS). On behalf of the conference organizers, we thank the associations and institutions, and especially the participants for their valuable contributions and the opportunity to share their experiences about the information society with us. We also thank the reviewers for their assistance in reviewing.

With the awarding of prizes, especially the Michie-Turing Award, the autonomous profession from the field identifies the most outstanding achievements. Prof. Dr. Andrej Brodnik received the Michie-Turing Award for his exceptional lifetime contribution to the development and promotion of the information society. The Achievement of the Year award goes to Benjamin Bajd, gold medal winner at the Computer Olympiad. The "Information Lemon" for the least appropriate information move was awarded to the incompatibility of information systems in the Slovenian healthcare, while the "Information Strawberry" for the best move goes to the RTV SLO team for portal dostopno.si. Congratulations to the winners!



Mojca Ciglarič, Chair of the Program Committee

Matjaž Gams, Chair of the Organizing Committee





ii

KONFERENČNI ODBORI

CONFERENCE COMMITTEES



International Programme Committee

Organizing Committee

Vladimir Bajic, South Africa

Matjaž Gams, chair

Heiner Benking, Germany

Mitja Luštrek

Se Woo Cheon, South Korea

Lana Zemljak

Howie Firth, UK

Vesna Koricki

Olga Fomichova, Russia

Mitja Lasič

Vladimir Fomichov, Russia

Blaž Mahnič

Vesna Hljuz Dobric, Croatia

Mateja Mavrič

Alfred Inselberg, Israel

Jay Liebowitz, USA

Huan Liu, Singapore

Henz Martin, Germany

Marcin Paprzycki, USA

Claude Sammut, Australia

Jiri Wiedermann, Czech Republic

Xindong Wu, USA

Yiming Ye, USA

Ning Zhong, USA

Wray Buntine, Australia

Bezalel Gavish, USA

Gal A. Kaminka, Israel

Mike Bain, Australia

Michela Milano, Italy

Derong Liu, Chicago, USA

Toby Walsh, Australia

Sergio Campos-Cordobes, Spain

Shabnam Farahmand, Finland

Sergio Crovella, Italy





Programme Committee

Mojca Ciglarič, chair

Marjan Heričko

Baldomir Zajc

Bojan Orel

Borka Jerman Blažič Džonova

Blaž Zupan

Franc Solina

Gorazd Kandus

Boris Žemva

Viljan Mahnič

Urban Kordeš

Leon Žlajpah

Cene Bavec

Marjan Krisper

Niko Zimic

Tomaž Kalin

Andrej Kuščer

Rok Piltaver

Jozsef Györkös

Jadran Lenarčič

Toma Strle

Tadej Bajd

Borut Likar

Tine Kolenik

Jaroslav Berce

Janez Malačič

Franci Pivec

Mojca Bernik

Olga Markič

Uroš Rajkovič

Marko Bohanec

Dunja Mladenič

Borut Batagelj

Ivan Bratko

Franc Novak

Tomaž Ogrin

Andrej Brodnik

Vladislav Rajkovič

Aleš Ude

Dušan Caf

Grega Repovš

Bojan Blažica

Saša Divjak

Ivan Rozman

Matjaž Kljun

Tomaž Erjavec

Niko Schlamberger

Robert Blatnik

Bogdan Filipič

Stanko Strmčnik

Erik Dovgan

Andrej Gams

Jurij Šilc

Špela Stres

Matjaž Gams

Jurij Tasič

Anton Gradišek

Mitja Luštrek

Denis Trček

Marko Grobelnik

Andrej Ule

Nikola Guid

Boštjan Vilfan

iii

iv



KAZALO / TABLE OF CONTENTS



Odkrivanje znanja in podatkovna skladišča - SiKDD / Data Mining and Data Warehouses - SiKDD .... 1

PREDGOVOR / FOREWORD ............................................................................................................................... 3

PROGRAMSKI ODBORI / PROGRAMME COMMITTEES ............................................................................... 4

Forecasting Trends in Technological Innovations with Distortion-Aware Convolutional Neural Networks /

Buza Krisztian, Massri M. Besher, Grobelnik Marko ........................................................................................ 5

Building A Causality Graph For Strategic Foresight / Rožanec Jože M., Šircelj Beno, Nemec Peter, Leban

Gregor, Mladenić Dunja ..................................................................................................................................... 9

Towards Testing the Significance of Branching Points and Cycles in Mapper Graphs / Zajec Patrik, Škraba

Primož, Mladenić Dunja .................................................................................................................................. 13

Highlighting Embeddings' Features Relevance Attribution on Activation Maps / Rožanec Jože M., Koehorst

Erik, Mladenić Dunja ....................................................................................................................................... 17

An approach to creating a time-series dataset for news propagation: Ukraine-war case study / Sittar Abdul,

Mladenić Dunja ................................................................................................................................................ 21

Predicting Horse Fearfulness Applying Supervised Machine Learning Methods / Topal Oleksandra, Novalija

Inna, Gobbo Elena, Zupan Šemrov Manja, Mladenić Dunja ........................................................................... 25

Emergent Behaviors from LLM-Agent Simulations / Mladenić Grobelnik Adrian, Zaman Faizon, Espigule-

Pons Jofre, Grobelnik Marko ........................................................................................................................... 29

Compared to Us, They Are …: An Exploration of Social Biases in English and Italian Language Models Using

Prompting and Sentiment Analysis / Caporusso Jaya, Pollak Senja, Purver Matthew ................................... 33

Towards Cognitive Digital Twin of a Country with Emergency, Hydrological, and Meteorological Data / Šturm

Jan, Škrjanc Maja, Stopar Luka, Volčjak Domen, Mladenić Dunja, Grobelnik Marko ................................... 39

Predicting Bus Arrival Times Based on Positional Data / Kladnik Matic, Bradeško Luka, Mladenić Dunja ..... 42

Structure Based Molecular Fingerprint Prediction through Spec2Vec Embedding of GC-EI-MS Spectra / Piciga

Aleksander, Ljoncheva Milka, Kosjek Tina, Džeroski Sašo ............................................................................ 46

A meaty discussion: quantitative analysis of the Slovenian meat-related news corpus / Martinc Matej, Pollak

Senja, Vezovnik Andreja .................................................................................................................................. 50

Slovene Word Sense Disambiguation using Transfer Learning / Fijavž Zoran, Robnik-Šikonja Marko ............ 54

Predicting the FTSO consensus price / Koprivec Filip, Eržen Tjaž, Mežnar Urban ............................................ 58

On Neural Filter Selection for ON/OFF Classification of Home Appliances / Pirnat Anže, Fortuna Carolina ... 62

Indeks avtorjev / Author index ................................................................................................................... 67





v





vi



Zbornik 26. mednarodne multikonference

INFORMACIJSKA DRUŽBA – IS 2023

Zvezek C





Proceedings of the 26th International Multiconference

INFORMATION SOCIETY – IS 2023

Volume C





Odkrivanje znanja in podatkovna skladišča – SiKDD

Data Mining and Data Warehouses - SiKDD





Urednika / Editors



Dunja Mladenić, Marko Grobelnik





http://is.ijs.si





9. oktober 2023 / 9 October 2023

Ljubljana, Slovenia

1





2





PREDGOVOR



Tehnologije, ki se ukvarjajo s podatki so v devetdesetih letih močno napredovale. Iz prve faze, kjer je šlo predvsem za shranjevanje podatkov in kako do njih učinkovito dostopati, se je razvila industrija za izdelavo orodij za delo s podatkovnimi bazami, prišlo je do standardizacije procesov, povpraševalnih jezikov itd. Ko shranjevanje podatkov ni bil več poseben problem, se je pojavila potreba po bolj urejenih podatkovnih bazah, ki bi služile ne le transakcijskem procesiranju ampak tudi analitskim vpogledom v podatke – pojavilo se je t.i.

skladiščenje podatkov (data warehousing), ki je postalo standarden del informacijskih sistemov v podjetjih. Paradigma OLAP (On-Line-Analytical-Processing) zahteva od uporabnika, da še vedno sam postavlja sistemu vprašanja in dobiva nanje odgovore in na vizualen način preverja in išče izstopajoče situacije. Ker seveda to ni vedno mogoče, se je pojavila potreba po avtomatski analizi podatkov oz. z drugimi besedami to, da sistem sam pove, kaj bi utegnilo biti zanimivo za uporabnika – to prinašajo tehnike odkrivanja znanja v podatkih (data mining), ki iz obstoječih podatkov skušajo pridobiti novo znanje in tako uporabniku nudijo novo razumevanje dogajanj zajetih v podatkih. Slovenska KDD konferenca pokriva vsebine, ki se ukvarjajo z analizo podatkov in odkrivanjem znanja v podatkih: pristope, orodja, probleme in rešitve.



Dunja Mladenić

Marko Grobelnik





FOREWORD



Data driven technologies have significantly progressed after mid 90’s. The first phases were mainly focused on storing and efficiently accessing the data, resulted in the development of industry tools for managing large databases, related standards, supporting querying languages, etc. After the initial period, when the data storage was not a primary problem anymore, the development progressed towards analytical functionalities on how to extract added value from the data; i.e., databases started supporting not only transactions but also analytical processing of the data. At this point, data warehousing with On-Line-Analytical-Processing entered as a usual part of a company’s information system portfolio, requiring from the user to set well defined questions about the aggregated views to the data. Data Mining is a technology developed after year 2000, offering automatic data analysis trying to obtain new discoveries from the existing data and enabling a user new insights in the data. In this respect, the Slovenian KDD conference (SiKDD) covers a broad area including Statistical Data Analysis, Data, Text and Multimedia Mining, Semantic Technologies, Link Detection and Link Analysis, Social Network Analysis, Data Warehouses.



Dunja Mladenić

Marko Grobelnik





3





PROGRAMSKI ODBOR / PROGRAMME COMMITTEE



Janez Brank, Jožef Stefan Institute, Ljubljana

Marko Grobelnik, Jožef Stefan Institute, Ljubljana

Branko Kavšek, University of Primorska, Koper

Besher M. Massri, Jožef Stefan Institute, Ljubljana

Dunja Mladenić, Jožef Stefan Institute, Ljubljana

Erik Novak, Jožef Stefan Institute, Ljubljana

Inna Novalija, Jožef Stefan Institute, Ljubljana

Jože Rožanec, Qlector, Ljubljana

Abdul Sitar, Jožef Stefan Institute, Ljubljana

Luka Stopar, Sportradar, Ljubljana

Jan Šturm, Jožef Stefan Institute, Ljubljana



4





Forecasting Trends in Technological Innovations with

Distortion-Aware Convolutional Neural Networks

Krisztian Buza, M. Besher Massri, Marko Grobelnik

{krisztian.antal.buza,besher.massri,marko.grobelnik}@ijs.si

Artificial Intelligence Laboratory, Institute Jozef Stefan

Ljubljana, Slovenia

ABSTRACT

convolution plays the role of a local pattern detector, it matches

Predicting trends in technological innovations holds critical impor-

patterns in a rigid manner as it does not allow for local shifts and

tance for policymakers, investors, and other stakeholders within

elongations within the patterns. This issue has been addressed

the innovation ecosystem. This study approaches this challenge

by distortion-aware convolution and the resulting convolutional

by framing it as a time series prediction task. Recent efforts have

neural network has been shown to outperform conventional convo-

introduced diverse solutions utilizing convolutional neural net-

lutional networks in case of several time series forecasting tasks [6].

works, including distortion-aware convolutional neural networks.

For the aforementioned reasons, in this paper we propose to use

While convolutional layers act as local pattern detectors, conven-

distortion-aware convolutional networks for forecasting trends in

tional convolution matches local patterns in a rigid manner in the

technological innovations. We perform experiments on real-world

sense that they do not account for local shifts and elongations,

time series of the number of patents related to selected topics.

whereas distortion-aware convolution incorporate the capability to

We compare the performance of distortion-aware convolutional

identify local patterns with flexibility, accommodating local shifts

networks with conventional convolutional neural networks.

and elongations. The resulting convolutional neural network, with

The reminder of the paper is organized as follows. In Section 2,

distortion-aware convolution, has exhibited superior performance

we provide a short discussion of related works. We review distortion-

compared to standard convolutional networks in multiple time se-

aware convolutional networks in Section 3, followed by the experi-ries prediction tasks. As a result, we advocate for the application

mental results in Section 4. Finally, we conclude in Section 5.

of distortion-aware convolutional networks in forecasting tech-

nological innovation trends and compare their performance with

2

RELATED WORK

conventional convolutional neural networks.

As we cast our problem as a time series forecasting task, we focus

CCS CONCEPTS

our review of related works on time series forecasting. As men-

tioned previously, a prominent family of methods include forecast

• Computing methodologies → Neural networks.

techniques based on convolutional neural networks, recent surveys

KEYWORDS

about them have been presented by Lim et al. [17], Sezer et al. [21]

and Torres et al. [24].

trends, innovation ecosystem, time series forecasting, convolutional

An essential component of distortion-aware convolution is dy-

neural networks, distortion-aware convolution

namic time warping (DTW). While DTW is one of the most suc-

cessful distance measures in the time series domain, see e.g. [25],

1

INTRODUCTION

recent approaches integrate it with neural networks. For example,

Forecasting trends in technological innovations is of high value

Iwana et al. [14], Cai et al. [9] and Buza [5] used DTW to construct for policy makers, investors and other actors of the innovation

features. In contrast, Afrasiabi et al. [1] used neural networks to ecosystem. In this paper, we cast this task as a time series forecasting extract features and used DTW to compare the resulting sequences.

problem.

Shulman [22] proposed “an approach similar to DTW” to allow for Approaches for time series forecasting range from the well-flexible matching in case of the dot product. DTW-NN [13] consid-known autoregressive models [4] over exponential smoothing [12]

ered neural networks and replaced “the standard inner product of a

to solutions based on deep learning [10, 11, 16–19, 24, 26]. Among node with DTW as a kernel-like method”. However, DTW-NN only

the numerous techniques, a prominent family of methods include

considered multilayer perceptrons (MLP), whereas we focus on

forecast with convolutional neural networks (CNNs) [3, 20].

convolutional networks. In the context of time series classification,

The inherent assumption behind CNNs is that local patterns are

Buza and Antal proposed to replace the dot product in the con-

characteristic to time series and future values of the time series may

volution operation by DTW calculations [7]. In distortion-aware be predicted based on those local patterns. While the operation of

convolution [6], DTW is used together with the dot product, but the dot product itself is not modified.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored.

3

BACKGROUND

For all other uses, contact the owner/author(s).

We begin this section with a formal definition of our task followed

Slovenian KDD Conference 2023 , 9–13 October 2023, Ljubljana, Slovenia

by a review of convolutional neural networks with distortion-aware

© 2023 Copyright held by the owner/author(s).

convolution [6].

5





Slovenian KDD Conference 2023 , 9–13 October 2023, Ljubljana, Slovenia

K. Buza, M. B. Massri, M. Grobelnik

3.1

Problem Formulation

Given an observed time series 𝑥 = (𝑥1, . . . , 𝑥 ) of length 𝑙, in our

𝑙

case each 𝑥 represents the number of patents related to a given

𝑖

topic in a month, we aim at predicting its subsequent ℎ values

𝑦 = (𝑥

), i.e., the number of patents in the subsequent ℎ

𝑙 +1, . . . , 𝑥𝑙 +ℎ

months. We say that ℎ is the forecast horizon and 𝑦 is the target.

Furthermore, we assume that a dataset 𝐷 is given which contains 𝑛

time series with the corresponding target:

(𝑖 )

(𝑖 )

𝐷 = { (𝑥

, 𝑦

)𝑛 }.

(1)

𝑖 =1

We use 𝐷 to train neural networks for the aforementioned predic-

tion task. We say that (𝑖 )

𝑥

is the input of the neural network.

In our experiments, we assume that an independent dataset

∗

𝐷

is given which can be used to evaluate the predictions of our model.

Similarly to

∗

𝐷 , dataset 𝐷

contains pairs of input and target time

series.

∗

𝐷

is called the test set.

3.2

The Distortion-aware Convolutional Block

The main idea behind distortion-aware convolution [6] is to calculate, besides the dot products (or inner products), DTW distances

between the kernel and time series segments as well. This is illus-

trated in Fig. 1. Our distortion-aware convolutional block has two output channels: one for dot products and another channel for the

DTW distances.

While in case of the dot product, higher similarity between

the time series segment and the pattern corresponds to higher

values, the opposite is true for the DTW distances. In case of DTW,

high similarity between the time series segment and the pattern is

reflected by a distance close to zero. Therefore, to make sure that

Figure 1: In case of distortion-aware convolution, addition-

the activations on both channels are consistent, the activations of

ally to the dot product (top), DTW distances between the ker-

the DTW channel of our distortion-aware convolutional block are

nel and time series segments are calculated (bottom). Thus,

calculated as follows:

our distortion-aware convolutional block has two output

channels: one for dot products and another channel for the

1

𝑜𝑢𝑡

(𝑡 ) =

,

(2)

𝐷𝑇 𝑊

DTW distances scaled according to Eq. (2).

1 + 𝐷𝑇𝑊 (𝑖𝑛 [𝑡 : 𝑡 + 𝑠], 𝑤 )

where 𝑜𝑢𝑡

denotes the activation of the DTW channel of the

𝐷𝑇 𝑊

distortion-aware convolutional block, 𝑖𝑛 [𝑡 : 𝑡 + 𝑠] is the segment 4.1

Data

of the block’s input between the 𝑡 -th and (𝑡 + 𝑠)-th position1, 𝑠 is Lens is a web-based service that offers global access to patent in-the size of the filter, 𝑤 are the weights of the filter representing a

formation, academic articles, regulatory databases, and additional

local pattern and 𝐷𝑇𝑊 (., .) is a function that calculates the 𝐷𝑇𝑊

relevant materials.2 The platform is designed to simplify the explo-distance between two time series segments.

ration and evaluation of intellectual property information while

Training neural networks with distortion-aware convolution

promoting research and inventive activities. Lens grants compli-

may be challenging because of the backpropagation of gradients

mentary access to patent databases from more than 100 nations and

through the DTW calculations. The basic idea of training is to train

includes sophisticated search functionalities and analytical tools

the network with conventional convolution instead of distortion-

for diverse research and analysis needs.

aware convolution initially and add DTW-computations once the

We extracted time series from the Lens patent database as fol-

weights of the convolutional layer have already been determined.

lows. For selected topics identified by their Cooperative Patent

For details, see [6].

Classification (CPC) codes, we extracted the number of granted

patents as well as the number of patent applications per month

4

EXPERIMENTAL EVALUATION

between January 1980 and December 2022. We considered the fol-

The goal of our experiments is to examine whether the neural

lowing topics: (a) “image or video recognition” (G06V), (b) “neural

networks with distortion-aware convolution are more suitable for

networks” (G06N3/02), (c) “natural language processing” (G06F40)

forecasting technological trends compared to their counterparts

and (d) all topics related to artificial intelligence. We considered the with conventional convolution.

number of patents separately for the most significant jurisdictions,

i.e., (a) United States of America, (b) China, (c) Korea, (d) Japan and

1In Eq. (2) we use a Python-like syntax: the lower index, 𝑡 is inclusive, the upper index, 2

𝑡 + 𝑠 is exclusive in 𝑖𝑛 [𝑡 : 𝑡 + 𝑠 ].

http://lens.org

6





Forecasting Trends in Technological Innovations with Distortion-Aware Convolutional Neural NetworksSlovenian KDD Conference 2023 , 9–13 October 2023, Ljubljana, Slovenia Table 1: Mean absolute error (MAE) and root mean squared erTable 2: Mean absolute error (MAE) and root mean squared

ror (RMSE) for forecasting the time series of granted patents

error (RMSE) for forecasting the time series of patent ap-

in case of our approach (DCNN) and the baseline (CNN).

plications in case of our approach (DCNN) and the baseline

Lower values indicate better performance.

(CNN). Lower values indicate better performance.

juris-

RMSE

MAE

juris-

RMSE

MAE

topic

diction

CNN

DCNN

CNN

DCNN

topic

diction

CNN

DCNN

CNN

DCNN

image or

US

165.9

106.0

131.2

92.7

image

US

188.2

177.1

170.2

163.3

video

China

405.8

320.9

323.87

217.6

or video

China

3405.0

1061.7

3375.4

1042.3

recognition

Korea

13.9

27.7

12.4

19.9

recognition

Korea

128.9

70.8

99.7

69.4

Japan

55.9

49.8

39.9

37.8

Japan

103.8

106.4

87.1

66.1

Europe

34.5

34.7

32.3

32.9

Europe

51.9

55.5

45.0

49.4

ALL

494.7

399.6

416.8

341.3

ALL

3641.9

2110.5

3627.3

2027.8

neural

US

10.7

9.1

9.4

7.9

neural

xUS

79.8

15.3

76.9

12.7

networks

China

5.6

5.5

3.8

3.7

networks

China

21.2

20.8

16.8

19.0

Korea

6.3

2.3

5.4

2.1

Korea

44.6

6.8

43.7

6.2

Japan

3.5

2.9

2.5

2.0

Japan

13.9

7.1

13.5

4.8

Europe

2.7

1.6

2.2

1.2

Europe

15.8

5.9

14.9

4.4

ALL

7.6

8.3

6.3

6.7

ALL

267.7

45.6

262.7

38.6

natural

US

19.7

15.1

14.8

12.0

natural

US

64.1

68.7

55.5

64.6

language

China

57.1

47.0

41.6

41.7

language

China

418.9

318.2

363.6

289.3

processing

Korea

14.2

8.5

13.1

7.3

processing

Korea

35.1

23.4

29.7

21.0

Japan

11.8

10.7

9.5

7.3

Japan

16.7

18.7

10.5

10.8

Europe

4.8

3.0

3.5

2.7

Europe

11.2

14.3

9.7

11.2

ALL

67.0

45.7

59.5

35.5

ALL

298.1

543.0

226.9

489.3

ALL

US

270.2

216.9

224.1

196.4

ALL

US

532.3

329.1

458.9

311.3

China

870.2

1108.8

763.2

998.1

China

6443.7

2784.2

6239.0

2386.5

Korea

56.6

138.3

53.8

129.4

Korea

405.4

216.8

340.2

180.8

Japan

124.8

132.0

81.4

89.9

Japan

224.8

228.1

159.1

128.6

Europe

85.8

69.2

82.1

65.9

Europe

130.0

163.5

97.5

121.3

ALL

1045.1

1129.1

929.2

964.6

ALL

5445.1

3355.8

5009.0

2547.0

data related to the years 1980...2019 was used as training data, while

the data from 2019...2022 was used as test data.

From the long time series corresponding years 1980...2019, we

extracted training instances with a moving window. This resulted in

10496 training instances in total which corresponds to 427 training

instance for each time series.

When evaluating the network on the test data, we used the

data from 2019...2021 as input data and the task was to predict the

number of granted patents (or patent applications, respectively) for

Figure 2: Total number of granted patents (red) and patent

the first six month of 2022.

applications (blue) for all the jurisdictions in the Lens data-

base related to “neural networks” (CPC: G06N3/02).

4.2

Experimental Settings

In order to assess the contribution of distortion-aware convolu-

tion, for each time series, we trained two versions of the neural

(e) Europe. Additionally, we considered the time series of the total

network: with and without distortion-aware convolution, and com-

number of patents for all the jurisdictions of the database. Thus, we

pared the results. In the former case, the first hidden layer was a

considered in 48 time series in total, see also the first two columns

distortion-aware convolutional layer (with both dot product and

of Tab. 1 and Tab. 2. Two example time series are shown in Fig. 2.

DTW calculations), whereas in the later case, we used conventional

For each time series, we trained the neural networks to predict

convolution (with dot product only).

the number of granted patents (or patent applications, respectively)

For simplicity, we considered a convolutional network contain-

for each month of a 6-monthly period, i.e., the forecast horizon

ing a single convolutional layer with 25 filters, followed by a max

was ℎ = 6. As input, we used the number of granted patents (or

pooling layer with window size of 2, and a fully connected layer

patent applications, respectively) in the previous 36 months. The

with 100 units. We set the size of convolutional filters to 9. The

7





Slovenian KDD Conference 2023 , 9–13 October 2023, Ljubljana, Slovenia K. Buza, M. B. Massri, M. Grobelnik

number of units in the output layer corresponds to the forecast

REFERENCES

horizon, as each unit is expected to predict one of the numeric

[1] Mahlagha Afrasiabi, Muharram Mansoorizadeh, et al. 2019. DTW-CNN: time values of the target time series. We trained the networks for 1000

series-based human interaction prediction in videos using CNN-extracted features. The Visual Computer (2019), 1–13.

epochs with the Adam optimizer [15] with learning rate of 10−5

[2] Margit Antal and László Zsolt Szabó. 2016. On-line verification of finger drawn and batch size of 16. The loss function was mean squared error.

signatures. In 11th international symposium on applied computational intelligence We implemented our neural networks in Python using the Py-and informatics. IEEE, 419–424.

[3] Anastasia Borovykh, Sander Bohte, and Cornelis W Oosterlee. 2017. Condi-Torch framework. In order to support reproduction of our work,

tional time series forecasting with convolutional neural networks. arXiv preprint we made the implementation of our model publicly available in a

arXiv:1703.04691 (2017).

[4] George EP Box, Gwilym M Jenkins, Gregory C Reinsel, and Greta M Ljung. 2015.

github repository. The code illustrates training and evaluation of

Time series analysis: forecasting and control. John Wiley & Sons.

our model on standard benchmark datasets.3

[5] Krisztian Buza. 2020. Asterics: Projection-based classification of eeg with asym-We evaluated the predicted time series both in terms of mean

metric loss linear regression and genetic algorithm. In 14th International Symposium on Applied Computational Intelligence and Informatics. IEEE, 000035–000040.

absolute error (MAE) and root mean squared error (RMSE). In

[6] Krisztian Buza. 2023. Time Series Forecasting with Distortion-Aware Convolu-particular, we calculated MAE (and RMSE, respectively) for each

tional Neural Networks. In 9th SIGKDD International Workshop on Mining and forecast time series.

Learning from Time Series.

[7] Krisztian Buza and Margit Antal. 2021. Convolutional neural networks with As the goal of our experiments is to assess the contribution of

dynamic convolution for time series classification. In International Conference on distortion-aware convolution, our baseline, denoted as CNN, is

Computational Collective Intelligence. Springer, 304–312.

[8] Krisztian Antal Buza. 2011. Fusion methods for time-series classification. PhD

the aforementioned neural network with conventional convolution

thesis at the University of Hildesheim (2011).

instead of distortion-aware convolution.

[9] Xingyu Cai, Tingyang Xu, Jinfeng Yi, Junzhou Huang, and Sanguthevar Ra-jasekaran. 2019. DTWNet: a dynamic time warping network. Advances in neural information processing systems 32 (2019).

4.3

Results

[10] Zhengping Che, Sanjay Purushotham, Guangyu Li, Bo Jiang, and Yan Liu. 2018.

Hierarchical deep generative models for multi-rate multivariate time series. In Tab. 1 and Tab. 2 show our results in terms of MAE and RMSE.

International Conference on Machine Learning. PMLR, 784–793.

Our approach, convolutional neural network with distortion-aware

[11] Marco Cuturi and Mathieu Blondel. 2017. Soft-dtw: a differentiable loss function for time-series. In International conference on machine learning. PMLR, 894–903.

convolution is denoted by DCNN, while CNN denotes the neural

[12] Everette S Gardner Jr. 2006. Exponential smoothing: The state of the art—Part II.

network with conventional convolution. As one can see, in the

International journal of forecasting 22, 4 (2006), 637–666.

majority of the examined cases, DCNN outperforms CNN both in

[13] Brian Kenji Iwana, Volkmar Frinken, and Seiichi Uchida. 2020. DTW-NN: A novel neural network for time series recognition using dynamic alignment between terms of MAE and RMSE. In those cases when CNN performs better,

inputs and weights. Knowledge-Based Systems 188 (2020), 104971.

typically, both models are rather accurate (the error is low for both

[14] Brian Kenji Iwana and Seiichi Uchida. 2020. Time series classification using local distance-based features in multi-modal fusion networks. Pattern Recognition 97

models) or the difference is very small compared to the magnitude

(2020), 107024.

of the error.

[15] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).

[16] Vincent Le Guen and Nicolas Thome. 2019. Shape and time distortion loss for 5

CONCLUSIONS AND OUTLOOK

training deep time series forecasting models. Advances in neural information processing systems 32 (2019).

In this paper, we focused on forecasting technological trends and

[17] Bryan Lim and Stefan Zohren. 2021. Time-series forecasting with deep learning: cast this task as a time series forecasting problem. We considered

a survey. Philosophical Transactions of the Royal Society A 379, 2194 (2021), 20200209.

a recent approach, convolutional neural networks with distortion-

[18] Linbo Liu, Youngsuk Park, Trong Nghia Hoang, Hilaf Hasson, and Luke Huan.

aware convolution, which has not been used for this task previously.

2023. Robust Multivariate Time-Series Forecasting: Adversarial Attacks and Defense Mechanisms. In The Eleventh International Conference on Learning Rep-We performed experiments on real-world time series represent-

resentations.

ing the number of granted patents and patent applications related

[19] Yao Qin, Dongjin Song, Haifeng Chen, Wei Cheng, Guofei Jiang, and Garrison to selected topics. Our observations show that convolutional neu-Cottrell. 2017. A dual-stage attention-based recurrent neural network for time series prediction. arXiv preprint arXiv:1704.02971 (2017).

ral networks with distortion-aware convolution are promising for

[20] Rajat Sen, Hsiang-Fu Yu, and Inderjit S Dhillon. 2019. Think globally, act locally: this task. Furthermore, combination of conventional convolutional

A deep neural network approach to high-dimensional time series forecasting.

networks and neural networks with distortion-aware convolution

Advances in neural information processing systems 32 (2019).

[21] Omer Berat Sezer, Mehmet Ugur Gudelek, and Ahmet Murat Ozbayoglu. 2020.

may be an interesting target of future works.

Financial time series forecasting with deep learning: A systematic literature Last, but not least, we mention that time series are prominent

review: 2005–2019. Applied soft computing 90 (2020), 106181.

[22] Yaniv Shulman. 2019. Dynamic Time Warp Convolutional Networks. arXiv in various real-world applications [2, 23] and our approach can be preprint arXiv:1911.01944 (2019).

extended to handle other types of time series, such as multivariate

[23] Abdul Sittar and Dunja Mladenić. 2023. An approach to creating a time-series time series (or series of vectors) that can be compared with a more

dataset for news propagation: Ukraine-war case study. In Slovenian KDD Conference.

general version of DTW, see e.g. [8].

[24] José F Torres, Dalil Hadjout, Abderrazak Sebaa, Francisco Martínez-Álvarez, and Alicia Troncoso. 2021. Deep learning for time series forecasting: a survey. Big Data 9, 1 (2021), 3–21.

ACKNOWLEDGMENTS

[25] Xiaopeng Xi, Eamonn Keogh, Christian Shelton, Li Wei, and Chotirat Ann Ratanamahatana. 2006. Fast time series classification using numerosity reduction.

This work was supported by the European Union through enRich-

In Proceedings of the 23rd international conference on Machine learning. 1033–1040.

MyData EU HE project under grant agreement No 101070284.

[26] Tian Zhou, Ziqing Ma, Qingsong Wen, Liang Sun, Tao Yao, Wotao Yin, Rong Jin, et al. 2022. Film: Frequency improved legendre memory model for long-term time series forecasting. Advances in Neural Information Processing Systems 35

(2022), 12677–12690.

3https://github.com/kr7/dcnn-forecast

8





Building A Causality Graph For Strategic Foresight

Jože M. Rožanec

Beno Šircelj

Peter Nemec

Jožef Stefan International

Jožef Stefan Institute

Event Registry d.o.o.

Postgraduate School

Ljubljana, Slovenia

Ljubljana, Slovenia

Ljubljana, Slovenia

beno.sircelj@ijs.si

peter@eventregistry.org

joze.rozanec@ijs.si

Gregor Leban

Dunja Mladenić

Event Registry d.o.o.

Jožef Stefan Institute

Ljubljana, Slovenia

Ljubljana, Slovenia

gregor@eventregistry.org

dunja.mladenic@ijs.si

ABSTRACT

(20/50) relevant hits, respectively. Some approaches described in

This paper describes a pipeline built to generate a causality graph

the literature aim to leverage artificial intelligence to automate

for strategic foresight. The pipeline interfaces with a well-known

time-consuming aspects of strategic foresight, such as perform-

global media retrieval platform, which performs real-time track-

ing information scanning and data analysis [4, 18]. Furthermore, ing of events reported in the media. The events are retrieved

text-mining techniques have been used to identify weak signals

from the media retrieval platform, and content from the media

and trends [10] or extract relevant actions and outcomes that articles is processed with ChatGPT to extract causal relations

could be mapped to causal decision diagrams [19].

mentioned in the news article. Multiple post-processing steps are

Strategic foresight for environmental purposes has been con-

performed to clean the causal relations, removing spurious ones

sidered to different degrees by countries and environmental agen-

and linking them to ontological concepts where possible. Finally,

cies. For example, multiple U.S. Environmental Protection Agency

a sample causality trace is showcased to exemplify the potential

offices began using strategic foresight in the 1980s. Still, they

of the causality graph created so far.

did not do so consistently until 1995, when it began to be insti-

tutionalized and connected to the Agency’s strategic planning

KEYWORDS

and decision-making, and reinvigorated since 2015 with that pur-

strategic foresight, graph, causality extraction, wikifier, ChatGPT

pose [11]. Another example is The Netherlands, where strategic foresight has been encouraged since 1992 to systematically aim

to identify critical technologies and scientific possibilities that

1

INTRODUCTION

would allow the fulfillment of environmental policies [29]. Other Among the most frequently used strategic foresight methods we

cases include using strategic foresight to understand how EU-

find scenario planning [7], that aims to foresee relevant scenarios wide policies may affect regions and rural localities [26] or guide based on trends and factors of influence. These allow for a better

decision-making in the face of structural change [2].

understanding of how actions can influence the future - a key

Previous work [22, 23] described how artificial intelligence ability in a world full of Turbulence, Unpredictability, Uncertainty,

could be used to automate scenario planning. This paper de-

Novelty, and Ambiguity (TUNA) [30]. This ability has fostered scribes a pipeline built to extract and process media news from

an increasing adoption of strategic foresight in the public and

EventRegistry [16] to create a causality graph. Furthermore, it private sectors [6, 21].

describes the causality graph created with media news report-

Domain experts currently plan scenarios by gathering and an-

ing on events related to oil prices, given the abundant research

alyzing the data to determine and report probable, possible, and

regarding how oil prices impact the environment. Among the

plausible futures of interest [15]. Nevertheless, the extensive man-benefits of this approach is the ability to extract causal relations

ual work imposes severe scalability limitations and can introduce

with little human intervention and no supervision. The resulting

bias into the assessments [7]. To overcome such limitations, artifi-graph enables the creation of link prediction models that can be

cial intelligence was proposed to automate information scanning

used to predict future events based on an array of events that

and data analysis [4, 18].

have been observed in the past.

While the value of artificial intelligence for strategic foresight

This paper is organized as follows. First, section 2 describes has been recognized, artificial intelligence has not been widely

how a data extraction pipeline was built, retrieving media events

adopted yet [4, 20]. This is also reflected in scientific papers of interest and extracting causal relationships observed in the

on foresight and artificial intelligence. For example, we queried

world and described in them. Section 3 briefly describes some of Google Scholar for "data-supported foresight" and "strategic fore-the results obtained, providing (i) a quantitative assessment of

sight artificial intelligence" considering the start time is unlim-

error types and resulting causal relationships after data cleansing

ited, and the deadline is September 6th 2023. When analyzing

procedures and (ii) a qualitative assessment of causality relation-

the first 50 search results of each, we got 18% (9/50) and 40%

ships generated through the pipeline. Finally, Section 4 concludes and outlines future work.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this 2

DATA EXTRACTION PIPELINE

work must be honored. For all other uses, contact the owner/author(s).

The data extraction pipeline aims to query relevant media news,

Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia

© 2023 Copyright held by the owner/author(s).

process them, and extract causal relationships that can be mod-

eled in a graph. Given the specific interest in modeling causality

9





Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia

Jože M. Rožanec, Beno Šircelj, Peter Nemec, Gregor Leban, and Dunja Mladenić cause, effect, entities, and locations were defined in the following

manner:

• Cause or effect: contains an entity which is an item,

individual, or company that an event happened to;

• Event: is an action, development, happening, or state of

the entity that is causing or was affected by a cause in the

Figure 1: Data extraction pipeline used to retrieve media

relationship;

events and extract causal relationships.

• Location: geographical location where the event in the

cause or effect took place;

for environmental protection, some research was performed to

Once the causal relationships were extracted, the cause and

identify possible topics of interest. Among potential topics, the

effect were post-processed, removing adjectives so that only the

influence of oil prices on the environment was selected, con-

nouns were left. E.g., higher diesel prices was converted to diesel

sidering such a topic is frequently covered in the media and

prices. The decision was made considering that by doing so, (a)

was researched to a certain extent. Research has shown that oil

the causes and effects would gain greater support and, therefore,

price fluctuations (a) affect the consumption of renewable energy

strengthen the information signal in a graph, and (b) that a hu-

sources [1, 28], (b) stimulate green innovation, and that positive man expert would be able to determine how a cause and effect

shocks in oil prices reduce CO

may relate given his domain knowledge and a particular context.

2 emissions [12], and enhance

ecological quality [8, 14].

For example, given the relationship Inflation → Consumer price

The data extraction pipeline is summarized in Fig. 1, and each index, the human expert will immediately understand how the

component is briefly described in the following subsections.

consumer price index is affected in a growing or shrinking infla-

tionary context. For each causal relationship, a trace was kept

2.1

Media Event Extraction

to associate them with the media event from which they were

extracted to enable further analysis when required.

The EventRegistry platform provides real-time insights into me-

dia events by sourcing them from the News Feed service [27],

processing them and creating media events based on cross-lingual

2.3

Semantic matching and enrichment

clusters of media news, which are later exposed through an API.

The entire text of the media article was parsed using Wikifier [5].

The news processing steps require news semantic annotation, ex-

Data from Wikifier was employed in two distinct ways: firstly,

traction of date references, cross-lingual matching, and detection

to enrich location data, and secondly, to associate entities to

of news duplicates. The cross-lingual clusters denoting a partic-

relevant semantic concepts.

ular media event have a summary describing the media event,

The Wikifier tool marks which words in the wikified text

information regarding the piece of news considered a centroid

correspond to certain semantic concepts. Such annotations were

to the cluster, and other relevant information.

matched to the entities extracted by ChatGPT as part of the causal

The first step in the pipeline queries the EventRegistry media

relationships. To successfully match strings to semantic concepts,

event API to extract media events related to a particular concept.

some preprocessing was required. First, the non-letter symbols

This research’s query concept was limited to the "Price of Oil".

and stopwords were removed, followed by the stemming of each

Since EventRegistry has a history of data up to 2014, relevant

word. It was considered a match if at least one identical string

geopolitical and economic events that influenced oil prices since

between the text related to marked concepts and the causal rela-

2014 were searched. Two events were highlighted by the U.S.

tionship. Not all of the semantic concepts listed by the Wikifier

Energy Information Administration 1: (a) the fact that OPEC

were considered: (a) the concepts were required to have a PageR-

production quota remained unchanged in the first quarter of

ank higher than 0.0001; (b) for location data, only the concepts

2015 and (b) a reduction in oil demand registered due to the

categorized as "place" were considered, and (c) when substitut-global pandemic in the first quarter of 2020. Furthermore, events

ing the original entity by the associated semantic concept, the

between 2022 and 2023 were considered, given the impact of the

semantic concept with the highest cosine similarity between the

Russo-Ukrainian War on oil prices [17]. For each event obtained, article it’s corresponding Wikipedia page was considered.

the centroid media news was queried, its text extracted, wikified,

and stored for further processing.

2.4

Cleansing causal relations

After extracting causal relations, we focused on analyzing the

2.2

Causality extraction

data and cleansing to ensure only relevant relations were con-

To extract causal relations from media events, the OpenAI Chat-

sidered and used to build a causality graph. Subsequent random

GPT (gpt-3.5-turbo) was used as a one-shot learning model. To

sampling iterations were performed, extracting 300 causal rela-

that end, a random media event was sampled, the causality rela-

tionships in each iteration, which were then analyzed. In each

tionships extracted, and both (the text and causal relationships)

iteration, the causal relations were assessed to determine whether

presented to the model, asking it to recognize causal relationships

they were meaningful to the topic under consideration, to iden-

in the media news. Several iterations of prompt engineering were

tify common errors, and to propose mitigation strategies that

performed to ensure high-quality results, performing a manual

could amend such errors or filter useless causal relations. We typ-

assessment of random results.

ified six such cases, five originating from ChatGPT and one when

The causal relationships persisted in JSON files discriminated

semantically post-processing the causal relations with concepts

the cause, effect, related entities, and locations. In particular,

obtained from the Wikifier:

1The events were highlighted in the following report, last accessed on August 25th

• repeated entity: [ChatGPT] the same entity is registered

2023: https://www.eia.gov/finance/markets/crudeoil/spot_prices.php.

for cause and effect. E.g., Oil price → Oil price.

10





Building A Causality Graph For Strategic Foresight

Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia

• empty entity: [ChatGPT] an entity is missing as cause

Error type

Count

Percentage

or effect. E.g., → Oil price.

• missing entity: [ChatGPT] ChatGPT omits the actual

Wrong conversion 17

5.7%

entity but could be inferred from the text by the human

reader. E.g., S&P 500 capital expenditures → growth, energy

Missing entity

15

5.0%

policy → defiance, or survey → Nasdaq 100.

• time entity: [ChatGPT] some time-period is considered

Non-entity

9

3.0%

an entity. E.g., drilling activity → 2016, or (US) shale oil

Time entity

3

1.0%

supply → end of the year.

• non-entity: [ChatGPT] words marked as entities don’t

mean anything coherent. E.g., retail sales → risk appetite.

Table 1: Statistics for typified errors based on a random

• wrong conversion: [Wikifier] the entity was changed to

sample of 300 causal relationships.

something unrelated to the one stated in the text. E.g., Aus-

tralian government > Australian dollar, or political tensions

> Breakup of Yugoslavia.

After performing the abovementioned cleansing and dictionary-

While the mitigation strategy for most of the abovementioned

based mappings, 7,723 nodes and 9,726 edges were obtained. Re-

errors is to remove the causal relationship, for

moving causal relationships reported only in a single media event

missing entity, a

follow-up question will be provided to ChatGPT to get a more

reduced the graph size to 489 nodes and 877 edges.

concrete answer. This last mitigation strategy has not been im-

3.1

Causality graph and causality chain

plemented yet. Furthermore, a list of concept mappings will be

considered to reduce clutter. For example,

analysis

Wage Growth or 1980s

Oil Glut should be replaced by Wage or Oil Glut, respectively.

Causal chains were created by linking causes and effects extracted

Breakup of Yugoslavia could be replaced by Country Breakup.

from media events. While these are not always completely ac-

Finally, a more thorough linking to semantic concepts and on-

curate, they help to identify sequences of events that may take

tologies is required (e.g., Jerome Powell could be linked to Central

place. Furthermore, while currently not implemented, graph link

Bank).

prediction could be used to predict future event sequences based

After the abovementioned cleansing, the strings were turned

on patterns observed in the past.

into lowercase and trimmed, and most non-alphabetical charac-

This section provides an example regarding a causality chain

ters were removed. Further sampling and entity evaluation were

of interest retrieved from the causality graph. The causality chain

performed, creating a dictionary to match string occurrences to

is briefly analyzed to demonstrate how it captures relevant knowl-

a particular concept. It must be noted that the dictionaries do not

edge. In particular, many causality chains displayed the following

provide an exhaustive mapping and that ongoing work is being

pattern: Pandemic → Currency → Price of Oil → Economic Growth

done to further refine and complete the mapping phase. Such

→ Oil Glut → Inflation → Central Bank → Stock Market → In-

dictionaries were created to provide ground for future ontological

vestment.

mapping based on existing ontologies and ontologies that will

The complete causality chain summarized above was: Pan-

be developed for this purpose. Finally, all the relations that, after

demic → Currency → Price of Oil → Crude Oil Futures → Fuel

the described process, were extracted from only one media event

Pricing → Economic Growth → Petroleum → Oil Glut → Con-

were discarded, given they are very likely to introduce noise.

sumer Price Index → Monetary Policy → Inflation → Central Bank

→ Stock Market → Investment → Bond.

2.5

Creating a causality graph

To validate the causality chain, scientific literature and events

Once causal relationships were extracted, a causality graph was

from the past few years were reviewed to find research and

created by matching

examples to validate the causal relationships. For the causality

𝑐𝑎𝑢𝑠𝑒 → 𝑒 𝑓 𝑓 𝑒𝑐𝑡 . Furthermore, some metrics

were computed to assess the graph characteristics. The graph can

chain described above, we found that the Pandemic influenced

be sampled and visualized with the NetworkX 2 library, which Currency: countries experiencing a sharp daily rise in COVID-creates a dynamic HTML interface to view it. For each cause and

19 deaths usually saw their currencies weaken [13]. Causality all the possible effects following it, probabilities of each effect

between exchange rates (Currency) and Price of Oil has been

occurring were computed based on the ratios present in the data.

reported by the European Central Bank [9]. In particular, it has been noticed that the exchange rates can affect oil prices through

financial markets, financial assets, portfolio rebalancing, and

3

RESULTS

heading practices. It has also been noted that given the oil prices

A total of 2,503 media events were extracted from EventRegistry.

are expressed in US dollars, the oil futures can be used to hedge

When processed with ChatGPT, 12,290 unique causal relation-

against an expected depreciation in US dollars - something that

ships were extracted, totaling 14,226 unique entities. Those were

explains the causal relationship between Price of Oil and Crude

processed to remove possible errors. Considering repeated entity

Oil Futures. Furthermore, a relationship exists between futures

and empty entity errors, 253 causal relations were removed. After

and spot prices (futures prices tend to converge upon spot prices

applying wikification, 9,726 unique causal relations remained,

3 and between oil prices and fuel prices4, validating the causal totaling 7,723 entities. 845 causal relations were removed, con-relationship between Crude Oil Futures and Fuel Pricing.

sidering repeated entity and empty entity errors. Table 1 shows the number of causal relations affected by a particular error type,

3See "Futures Prices Converge Upon Spot Prices", last accessed at https://www.

considering a random sample of 300 causal relations.

investopedia.com/ask/answers/06/futuresconvergespot.asp in September 7th 2023.

4See "Gasoline explained: Factors affecting gasoline prices", last accessed at https:

//www.eia.gov/energyexplained/gasoline/factors-affecting-gasoline-prices.php in 2The library is documented at the following website: https://networkx.org/

September 7th 2023.

11





Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia Jože M. Rožanec, Beno Šircelj, Peter Nemec, Gregor Leban, and Dunja Mladenić When considering the relationship Fuel Pricing and Economic

[6] George Burt and Anup Karath Nair. 2020. Rigidities of imagination in scenario Growth, we found that the relationship is validated with energy

planning: Strategic foresight through ‘Unlearning’. Technological Forecasting prices [3], e.g., with gas prices: higher gas prices negatively im-and Social Change 153 (2020), 119927.

[7] Ashkan Ebadi, Alain Auger, and Yvan Gauthier. 2022. Detecting emerging pact the economy5. Economic growth can affect the petroleum technologies and their evolution using deep learning and weak signal analysis.

market and, in particular, lead to an oil glut (a significant surplus

Journal of Informetrics 16, 4 (2022), 101344.

[8] Ali Ebaid, Hooi Hooi Lean, and Usama Al-Mulali. 2022. Do oil price shocks of crude oil caused by falling demand) as it happened at the begin-matter for environmental degradation? Evidence of the environmental Kuznets ning of the COVID-19 pandemic6. Furthermore, oil pricing can curve in GCC countries. Frontiers in Environmental Science 10 (2022), 860942.

have direct or indirect effects on

[9] Marcel Fratzscher, Daniel Schneider, and Ine Van Robays. 2014. Oil prices, Inflation [24], which is reflected

exchange rates and asset prices. (2014).

in the Consumer Price Index, and which can trigger a particular

[10] Amber Geurts, Ralph Gutknecht, Philine Warnke, Arjen Goetheer, Elna Monetary Policy from the Central Bank in response to it. Finally,

Schirrmeister, Babette Bakker, and Svetlana Meissner. 2022. New perspec-

monetary policies affect the stock market and investments [25].

tives for data-supported foresight: The hybrid AI-expert approach. Futures & Foresight Science 4, 1 (2022), e99.

While the causality chain displayed in this case is mostly

[11] Joseph M Greenblott, Thomas O’Farrell, Robert Olson, and Beth Burchard.

clean, some improvements are required to make it neater. For

2019. Strategic foresight in the federal government: a survey of methods, resources, and institutional arrangements. World futures review 11, 3 (2019), example, based on domain knowledge, and depending on the

245–266.

context, the Consumer Price Index and Inflation could be merged

[12] Jinyan Hu, Kai-Hua Wang, Chi Wei Su, and Muhammad Umar. 2022. Oil price, into a single concept, and

green innovation and institutional pressure: A China’s perspective. Resources Monetary Policy and Central Bank could

Policy 78 (2022), 102788.

be considered as one.

[13] Aamir Jamal and Mudaser Ahad Bhat. 2022. COVID-19 pandemic and the

The ingestion pipeline requires further work to enhance the

exchange rate movements: evidence from six major COVID-19 hot spots.

concept mappings. We envision that the dictionaries will be fur-

Future Business Journal 8, 1 (2022), 17.

[14] Foday Joof, Ahmed Samour, Mumtaz Ali, Turgut Tursoy, Mohammad Haseeb, ther evolved and linked to specific ontologies that could be used

Md Emran Hossain, and Mustafa Kamal. 2023. Symmetric and asymmetric

to assign semantic meaning and, e.g., contract links in a chain

effects of gold, and oil price on environment: The role of clean energy in China. Resources Policy 81 (2023), 103443.

with the same semantic ancestor.

[15] Kevin Kohler. 2021. Strategic Foresight: Knowledge, Tools, and Methods for the Future. CSS Risk and Resilience Reports (2021).

[16] Gregor Leban, Blaz Fortuna, Janez Brank, and Marko Grobelnik. 2014. Event 4

CONCLUSIONS

registry: learning about world events from news. In Proceedings of the 23rd This research has described a pipeline created for causality ex-International Conference on World Wide Web. 107–110.

[17] Gaye-Del Lo, Isaac Marcelin, Théophile Bassène, and Babacar Sène. 2022. The traction from media news and aimed toward a strategic foresight

Russo-Ukrainian war and financial markets: the role of dependence on Russian tool, and currently focused on events affecting oil prices. Particu-commodities. Finance Research Letters 50 (2022), 103194.

lar errors in the causality extraction were identified and typified,

[18] Nathan H Parrish, Anna L Buczak, Jared T Zook, James P Howard, Brian J

Ellison, and Benjamin D Baugher. 2019. Crystal cube: Multidisciplinary ap-and mitigation measures were implemented. Nevertheless, fur-

proach to disruptive events prediction. In Advances in Human Factors, Business ther work is required to improve the pipeline. Future work will

Management and Society: Proceedings of the AHFE 2018 International Conference consider three directions: (a) string to ontologies mapping to

on Human Factors, Business Management and Society, July 21-25, 2018, Loews Sapphire Falls Resort at Universal Studios, Orlando, Florida, USA 9. Springer, ensure the captured causes and effects can be tied to particu-571–581.

lar semantic knowledge and exploit it, (b) generate richer cause

[19] Lorien Pratt, Christophe Bisson, and Thierry Warin. 2023. Bringing advanced technology to strategic decision-making: The Decision Intelligence/Data Sci-and effect representations so that based on encoded metadata,

ence (DI/DS) Integration framework. Futures 152 (2023), 103217.

better causality patterns can be elucidated, and (c) create a link

[20] Norbert Reez. 2020. Foresight-Based Leadership. Decision-Making in a Grow-prediction model based on the causality graph.

ing AI Environment. In International Security Management: New Solutions to Complexity. Springer, 323–341.

[21] Aaron B Rosa, Niklas Gudowsky, and Petteri Repo. 2021. Sensemaking and lens-ACKNOWLEDGMENTS

shaping: Identifying citizen contributions to foresight through comparative topic modelling. Futures 129 (2021), 102733.

The Slovenian Research Agency supported this work. This re-

[22] Joze Rozanec, Peter Nemec, Gregor Leban, and Marko Grobelnik. 2023. AI, search was developed as part of the Graph-Massivizer project

What Does the Future Hold for Us? Automating Strategic Foresight. In Companion of the 2023 ACM/SPEC International Conference on Performance Engi-funded under the Horizon Europe research and innovation pro-

neering. 247–248.

gram of the European Union under grant agreement 101093202.

[23] Jože M Rožanec, Radu Prodan, Dumitru Roman, Gregor Leban, and Marko Grobelnik. 2023. AI-based Strategic Foresight for Environment Protection. In Symposium on AI, Data and Digitalization (SAIDD 2023). 7.

REFERENCES

[24] Siok Kun Sek, Xue Qi Teo, and Yen Nee Wong. 2015. A comparative study on the effects of oil price changes on inflation. Procedia Economics and Finance

[1] Nicholas Apergis and James E Payne. 2015. Renewable energy, output, carbon 26 (2015), 630–636.

dioxide emissions, and oil prices: evidence from South America. Energy Sources,

[25] Peter Sellin. 2001. Monetary policy and the stock market: theory and empirical Part B: Economics, Planning, and Policy 10, 3 (2015), 281–287.

evidence. Journal of economic surveys 15, 4 (2001), 491–541.

[2] M Bruce Beck. 2005. Environmental foresight and structural change. Environ-

[26] Anastasia Stratigea and Maria Giaoutzi. 2012. Linking global to regional mental Modelling & Software 20, 6 (2005), 651–670.

scenarios in foresight. Futures 44, 10 (2012), 847–859.

[3] Istemi Berk and Hakan Yetkiner. 2014. Energy prices and economic growth in

[27] Mitja Trampuš and Blaz Novak. 2012. Internals of an aggregated web news the long run: Theory and evidence. Renewable and Sustainable Energy Reviews feed. In Proceedings of 15th Multiconference on Information Society. 221–224.

36 (2014), 228–235.

[28] Victor Troster, Muhammad Shahbaz, and Gazi Salah Uddin. 2018. Renewable

[4] Patrick Brandtner and Marius Mates. 2021. Artificial Intelligence in Strate-energy, oil prices, and economic activity: A Granger-causality in quantiles gic Foresight–Current Practices and Future Application Potentials: Current analysis. Energy Economics 70 (2018), 440–452.

Practices and Future Application Potentials. In The 2021 12th International

[29] Barend Van der Meulen. 1999. The impact of foresight on environmental Conference on E-business, Management and Economics. 75–81.

science and technology policy in the Netherlands. Futures 31, 1 (1999), 7–23.

[5] Janez Brank, Gregor Leban, and Marko Grobelnik. 2017. Annotating docu-

[30] Angela Wilkinson. 2017. Strategic foresight primer. European Political Strategy ments with relevant wikipedia concepts. Proceedings of SiKDD 472 (2017).

Centre (2017).

5See "How Gas Prices Affect the Economy", last accessed at https://www.investopedia.

com/financial-edge/0511/how-gas-prices-affect-the-economy.aspx in September 7th 2023.

6See "Oil glut means there’s little hope for oil price recovery until 2021", last accessed at https://www.conference-board.org/topics/natural-disasters-pandemics/COVID-

19-oil-glut in August 30th 2023.

12





Towards Testing the Significance of Branching Points and

Cycles in Mapper Graphs

Patrik Zajec

Primož Škraba

Dunja Mladenič

patrik.zajec@ijs.si

p.skraba@qmul.ac.uk

dunja.mladenic@ijs.si

Jožef Stefan Institute and Jožef

School of Mathematical Sciences,

Jožef Stefan Institute and Jožef

Stefan International Postgraduate

Queen Mary University of London

Stefan International Postgraduate

School

London, UK

School

Jamova cesta 39

Jamova cesta 39

Ljubljana, Slovenia

Ljubljana, Slovenia

ABSTRACT

𝑑

Given a point cloud 𝑃 , which is a set of points embedded in R ,

we are interested in recovering its topological structure. Such a

structure can be summarized in the form of a graph. An example

of this is the mapper graph, which captures how the point cloud

is connected and reflects the branching and cyclic structure of

𝑃 as branching points (vertices with degree greater than 2) and

cycles in the graph. However, such a representation is not always

accurate, i.e., the structure shown by the graph may not be suf-

(a)

(b)

(c)

(d)

ficiently supported in the point cloud. To this end, we propose

an approach that uses persistent (relative) homology to detect

Figure 1: A point cloud (a) and three graphs (b, c, d) summa-

branching and cyclic structure, and employs a statistical test to

rizing its topological structure, constructed by the mapper

confirm whether the structure is indeed significant. We show

algorithm for different choices of its parameters.

how the approach works for low-dimensional point clouds, and

discuss its possible applications to real world point clouds.

KEYWORDS

structure. We demonstrate the approach on two examples: a Y-

shaped point cloud and a sample of a 3D mesh resembling an

topological data analysis, statistical hypothesis testing, persistent

ant. These low-dimensional examples allow us to visually inspect

homology, mapper algorithm

the results, laying the groundwork for extensive experiments

1

INTRODUCTION

with higher-dimensional point cloud data used in real-world

applications.

2

Consider the point cloud 𝑃 consisting of points in R shown in

Representing the topological structure of the point cloud with

Figure 1a. Using the mapper algorithm, we can construct a graph a simpler object, such as a graph, and having a statistical method

that represents its topological structure like the one in Figure 1b,

for testing the significance of such a structure is a very rele-

which seems to recover the important structure. Using the same

vant task. A simpler representation allows us to visualize [3]

algorithm (but with different values of its adjustable parameters)

and interpret high-dimensional representations that are every-

we could end up with different graphs. The second graph, shown

where in modern data science and machine learning. It might

in Figure 1c, contains two cycles: the middle one, which captures even allow us to find singularities that often carry relevant infor-the cycle present in 𝑃 , and the top one, where the algorithm

mation. The mapper algorithm [6] is a commonly used tool in

"mistakenly" considers the top points to connect in a cycle. The TDA. Although it is simple, the result is sensitive to the choice

third graph, shown in Figure 1d, shows a similar structure as of its parameters [2]. Nevertheless, it provides only one possible the graph in Figure 1b, although it contains one branching point low-dimensional view of the input data, and to our knowledge

more (splitting off the upper left branch) and a cycle of length

there is no method that would confirm the significance of the

three. One could argue that these branching and cyclic structures

represented structure. There is another method, called persistent

are not sufficiently supported in 𝑃 .

homology, which, while not directly applicable to visualization,

Our goal is to develop an approach that allows us to confirm,

deals with a particular structure of "holes" in space and now has a through a statistical test, whether the structure recovered by

framework [1] that allows us to statistically test the significance the mapper graph is indeed present in the point cloud. We use

of such a structure.

persistent homology, a well-known construction from topological

data analysis (TDA), to represent the structure from the point

2

BACKGROUND

cloud, and a recently introduced hypothesis testing framework

𝑑

A point cloud 𝑃 is a set of points embedded in

which can be

[1] that provides a way to evaluate the significance of such a R

viewed as a sample of a topological space X. Since discrete points

Permission to make digital or hard copies of part or all of this work for personal from 𝑃 have no interesting topological structure, we consider the

or classroom use is granted without fee provided that copies are not made or

𝑟

space 𝑃

= Ð

𝐵 (𝑝, 𝑟 ) for some radius 𝑟 . If 𝑃 is a sufficiently

distributed for profit or commercial advantage and that copies bear this notice and

𝑝 ∈𝑃

𝑟

the full citation on the first page. Copyrights for third-party components of this dense sample of X, then 𝑃

has some of the same properties

work must be honored. For all other uses, contact the owner /author(s).

as X for a suitable 𝑟 . To compute the properties of interest, we

Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia

𝑟

represent 𝑃

with a simplicial complex 𝐾 which, if properly con-

© 2023 Copyright held by the owner/author(s).

𝑟

structed, has homology groups isomorphic to those of 𝑃

. We

13





Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia

Zajec, et al.

are interested in finding the branching and cyclic structure in

cycles. We can think of the reduced homology of a space as if we

the point cloud, both of which can be detected using (persistent)

were representing the entire 𝐿 with a single point.

homology.

2.1

Simplicial complexes

A (geometric) simplicial complex 𝐾 can be thought of as a "high-

dimensional graph" whose vertices are points from the point

cloud and connectivity is determined by the geometric configu-

ration of the points. In addition to vertices and edges, we include

(a)

(b)

triangles, tetrahedra and higher dimensional simplices. Formally,

𝐾 consists of finite nonempty subsets of 𝑃 and is closed under

Figure 2: a) A Y-shaped simplicial complex with one cy-

inclusion (i.e., 𝐴 ∈ 𝐾 and 𝐵 ⊂ 𝐴 implies 𝐵 ∈ 𝐾 ). We refer to

cle. b) The quotient 𝐾/𝐿, where subcomplex 𝐿 contains

elements in 𝐾 of size 𝑘 + 1 as k-simplices, which correspond to

0-simplices {d, e, f}. Such identification introduces two new

k-cliques when we think about 𝐾 as a hyper-graph.

1-dimensional "holes", captured by the relative homology

The Čech and Vietoris-Rips complexes are the two most com-

group 𝐻 (

1

𝐾 , 𝐿).

mon constructions, both parameterized by a scale parameter

(radius) 𝑟 > 0. We use the Vietoris-Rips construction, where we

include a subset of (k + 1) points from 𝑃 as a k-simplex if all

The concept of homology and relative homology is best il-

points are at most r apart.

lustrated by an example. Consider a simple simplicial complex

We can construct a sequence of complexes 𝐾

, 𝐾

, . . . by in-

consisting of 0-simplices {a, b, c, d, e, f } and 1-simplices {(a, b), (a,

𝑟

𝑟

1

2

creasing the radius 𝑟 . Such a construction is "increasing" in the c), (a, d), (b, e), (c, f )} as shown in Figure 2a. There is a "hole" of sense that for 𝑟

, it holds that

⊆

. Such sequences are

dimension 1 (surrounded by the cycle 𝑎 → 𝑏 → 𝑐 → 𝑎), which is

1 < 𝑟 2

𝐾

𝐾

𝑟

𝑟

1

2

also known as filtrations and are used in persistent homology.

captured in the homology group 𝐻

. Choosing

1

𝐿 = {𝑑, 𝑒, 𝑓 } as a

subcomplex, the quotient 𝐾 /𝐿 identifies the simplices from 𝐿 to

2.2

Persistent relative homology

a single point, as shown in the figure 2b. This results in two new

"holes" in dimension 1, which are captured by the relative ho-

Homology. Homology is a classical construction in algebraic

mology group 𝐻

(

1

𝐾 , 𝐿), which has rank 3. This "lifting property"

topology that deals with topological properties of a space. More

of relative homology (introducing new "holes" when identifying precisely, it provides a mathematical language for the holes in a

simplices) is used in our approach to detect branching points.

topological space. Homology groups denoted by 𝐻

(

𝑘

X), where 𝑘

is a dimension, capture the holes indirectly by focusing on what

Persistent homology. The constriction of the simplicial com-

surrounds them. For example, the basis of 𝐻

(

0

X) corresponds to

plex and hence the groups 𝐻

are highly sensitive to the choice

𝑘

the connected components and the basis of 𝐻

(

1

X) to the closed

of radius 𝑟 . To overcome this, persistent homology considers the

loops surrounding the holes. The rank of the k-th homology

entire range of scales and tracks the evolution of k-cycles as the

group, also known as Betti number, counts the number of 𝑘 -

value of 𝑟 increases, thus forming a sequence of filtrations. In this

dimensional "holes".

process, cycles are created (born) and later filled-in (die). This

We can construct homology groups for a given simplicial com-

information is most often represented by persistence diagrams, a

plex 𝐾 . The important concepts in the construction are: (i) the

two dimensional scatter plot, 𝑑𝑔𝑚

= {𝑝1, . . . , 𝑝 }, where each

𝑘

𝑚

chain groups 𝐶

, where the k-th chain group consists of all formal

point 𝑝

= (𝑏 , 𝑑 ) represents the birth and death times (radius)

𝑘

𝑖

𝑖

𝑖

Í

linear combinations of 𝑘 -dimensional simplices

𝑎 𝜎 , where

of the associated persistent cycle.

𝑖

𝑖

𝑖

𝜎

are 𝑘 -simplices from 𝐾 and 𝑎

are coefficients, usually from

,

𝑖

𝑖

Z2

(ii) the boundary operator 𝜕

, which is a map describing how (k -

𝑘

2.3

Significance testing of persistent cycles

1)-simplices are attached to k-simplices, (iii) the groups 𝑍

of k-

𝑘

The significance of topological features is often measured by the

cycles, which are k-chains in the kernel of 𝜕

, and (iv) the groups

𝑘

lifetimes of persistent cycles, i.e., 𝛿 = (𝑑 − 𝑏 ). Although this

𝑖

𝑖

𝐵

of k-boundaries, which are elements in the image of 𝜕

. The

𝑘

𝑘 +1

method is intuitive as it captures the geometric “size” of topo-

boundary operator 𝜕

has the property that 𝜕

◦ 𝜕

= 0, i.e., it

𝑘

𝑘

𝑘 +1

logical features, [1] uses the statistic 𝜋

= 𝑑 /𝑏 . They present a

𝑖

𝑖

𝑖

maps the boundary of the boundary to zero. Therefore, 𝐵

⊆ 𝑍 .

𝑘

𝑘

statistical test to determine for each point 𝑝

∈ 𝑑𝑔𝑚 whether

𝑖

𝑘

Intuitively, a k-cycle can be thought of as a generalized ver-

it is a signal or noise, i.e., a significant structure or the result of

sion of a cycle in a graph - it is a sequence of k-dimensional

noise and randomness in the data. They introduce a special trans-

simplices wrapped around something. If this sequence is actually

formation 𝑙 (𝑝 ) applied to each point from the diagram where

𝑖

a boundary of a (k+1)-dimensional chain, then its interior is full

the values of 𝑙 (𝑝 ) follow a certain (LGumbel) distribution if 𝑝

𝑖

𝑖

(trivial cycle). Otherwise, it surrounds a hole. The k-th homology

are points corresponding to noisy cycles, while cycles signifi-

𝐻

= 𝑘𝑒𝑟 𝜕 /𝑖𝑚 𝜕

= 𝑍 /𝐵 takes a "modulo" of k-cycles with

𝑘

𝑘

𝑘 +1

𝑘

𝑘

cantly deviating from this distribution are declared as signal. The

k-boundaries, leaving only cycles that are nontrivial.

𝑠

signal part of 𝑑𝑔𝑚

can be recovered as 𝑑𝑔𝑚

(𝛼) = {𝑝 ∈ 𝑑𝑔𝑚 :

𝑘

𝑘

𝑘

− 𝑙 (𝑝)

𝛼

Relative homology.

𝑒

Given a simplicial complex

}

𝐾 and a sub-

𝑒

<

given a 𝑝 -value 𝛼 .

|𝑑𝑔𝑚 |

𝑘

complex 𝐿 ⊆ 𝐾 , the relative homology of a pair of topological

Computing persistent homology for an entire filtration is of-

spaces (simplicial complexes in our case) can be thought of as

ten intractable, as higher values of 𝑟 lead to a large number of

the (reduced) homology of the quotient space 𝐾 /𝐿. Intuitively,

simplices. The common practice is to set a threshold 𝑟

and

𝑚𝑎𝑥

we want to factor out 𝐿, which is expressed by the quotient oper-

calculate 𝑑𝑔𝑚

(𝑟

) using simplices generated up to 𝑟

. This

𝑘

𝑚𝑎𝑥

𝑚𝑎𝑥

ation 𝐶

(𝐾, 𝐿) = 𝐶 (𝐾 )/𝐶 (𝐿). The group of 𝑘-cycles becomes

often leads to cycles that are "infinite", i.e., born prior to 𝑟

𝑘

𝑘

𝑘

𝑚𝑎𝑥

𝑍

(𝐾, 𝐿) = 𝑍 (𝐾 )/𝑍 (𝐿), which we call the group of relative

but die after 𝑟

. The framework also provides an algorithm to

𝑘

𝑘

𝑘

𝑚𝑎𝑥

14





Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia

determine the infinite cycles that are already significant, and pro-

3.1

Testing the cycles

vides means to select the next 𝑟

threshold to inspect infinite

𝑚𝑎𝑥

A simple cycle is a finite sequence of vertices 𝑣

→

→

1

𝑣 2

. . . →

cycles that have not yet been determined to be significant.

𝑣

, where 𝑣

and 𝑣

are connected by an edge such that no

𝑛

𝑖

𝑖 +1

vertex, except the endpoint, repeats ( 𝑣

= 𝑣 if and only if 𝑖, 𝑗 ∈

𝑖

𝑗

2.4

The mapper algorithm

{1, 𝑛}). Let 𝑣

be such a cycle from

1, . . . , 𝑣

𝐺 . We compute the

𝑛

′ = Ð

)

Given the topological space

persistence diagram of the subset

and use

X and a continuous function 𝑓 :

𝑃

𝜙 (𝑣

𝑖 =1,...,𝑛

𝑖

the test [1] to confirm that it contains at least one significant X → R, the mapper algorithm [6] constructs a graph 𝐺 = (𝑉 , 𝐸) that captures the topological structure of

cycle ("hole") of dimension 1.

X. It does so by pulling

back a cover U of the space 𝑓 (X) to a cover on X through 𝑓 . We

can view the function 𝑓

and the cover U as the lens through

3.2

Testing the branching structure

which the input data X is examined.

Let 𝑁 (𝑣 ) be a set of vertices connected to 𝑣 (1-hop neighborhood)

′

and let 𝑣 be a branching point in 𝐺 (as in Figure 4). Let 𝑁

(𝑣) =

{𝑢 : 𝑢 ∈ 𝑁 (𝑣), 𝑑𝑒𝑔(𝑢) ≥ 2} be a set of vertices from 𝑁 (𝑣) that

′

have at least one additional neighbor. Together with 𝑣 , 𝑁

(𝑣)

forms a set of internal points 𝐼

= Ð

𝑣

′

𝑢 ∈ { 𝑣 } ∪𝑁 ( 𝑣 ) 𝜙 (𝑢 ) (shown in

Figure 4 as black vertices inside the outer black line).

(a)

(b)

Figure 3: An example of the construction of a mapper graph.

Figure 4: Construction of 𝐾 and 𝐿 for a branching point

(a) A 2-dimensional point cloud

𝑗

𝑃 with cover {𝑉 }, a func-

𝑖

𝑣 . Vertices forming 𝐾 are inside the outer black line. Ver-

tion

2

𝑓

:

R

→ R and cover U of 𝑓 (𝑃). (b) The resulting

tices forming 𝐿 are bicolored, indicating that some of their

mapper graph.

points are inside due to overlap between the vertices’ point

sets.

Given a point cloud 𝑃 and 𝑓

: 𝑃 → R, we first construct a set of

𝑛 intervals U = {𝑈

} covering

1, . . . , 𝑈

𝑓 (𝑃 ). The percentage of

𝑛

Let 𝐾

= Ð

𝑣

′

𝑢 ∈ 𝑁 ( 𝑣 ) 𝑁 (𝑢 ) be a set of vertices whose points are

overlap for two consecutive intervals 𝑈

and 𝑈

is determined

𝑖

𝑖 +1

used to form a complex 𝐾 (vertices inside the outer black line in

by the parameter

Ð

𝑝 . For each interval 𝑈

= (𝑎, 𝑏), let 𝑃

=

𝑖

𝑈

Figure 4), i.e. 𝐾 is formed from the points

𝜙 (𝑢 ). Now let 𝐿

𝑖

𝑢 ∈𝐾𝑣

−1

𝑓

(𝑈 ) be a set of points with function values in the range (𝑎, 𝑏).

be a subcomplex of

𝑖

𝐾 containing simplices which do not contain

1

𝑘

The set 𝑃

for each 𝑈

is further partitioned into 𝑉

, . . . , 𝑉

𝑖

by

any of the points from 𝐼

. Thus 𝐿 contains points of vertices

𝑈

𝑖

𝑣

𝑖

a clustering algorithm (in our case DBSCAN [5] with parameter exactly two edges away from 𝑣 (bicolored vertices in Figure 4). We

𝜖 , which sets the maximum distance between two samples so that

use 𝐾 and 𝐿 to compute relative persistent homology, identifying

one is considered to be in the neighborhood of the other) to obtain

simplices of 𝐿 to a single point and introducing relative cycles

1

𝑘

𝑗

𝑖

a cover of

("holes") when

𝑃 = Ð

{𝑉 , ..., 𝑉

}. Each 𝑉

⊂ 𝑃 becomes some

𝐾 \𝐿 has a branching structure. For a branching

𝑖 =1,...,𝑛

𝑖

𝑖

𝑖

𝑗

point 𝑣 , the relative persistence diagram should contain at least

vertex 𝑣 in the mapper graph with 𝜙 (𝑣 ) = 𝑉

mapping 𝑣 to a

𝑖

𝑑𝑒𝑔 (𝑣 ) − 1 significant relative cycles.

subset of points. Two vertices are connected by an edge if their

point sets intersect (see Figure 3).

4

EXPERIMENTS

The resulting graph 𝐺 = (𝑉 , 𝐸) provides a combinatorial de-

scription of the data and the mapping 𝜙 : 𝑉 → P (𝑃 ) maps each

We perform experiments illustrating our approach on two point

node 𝑣 ∈ 𝑉 to a subset of points from 𝑃 .

clouds. The graphs are constructed using the mapper algorithm

from the Giotto TDA library [7] with the parameters specified 3

METHODOLOGY

for each experiment. To construct the simplicial complex and

compute (relative) persistent homology, we use the Dionysus

𝑑

The input to our approach is a set of points 𝑃 embedded in R and

1

library

. We increase the initial radius 𝑟 using the algorithm a graph 𝐺 = (𝑉 , 𝐸) together with a mapping 𝜙 : 𝑉 → P (𝑃 ) that

from [1] until either no infinite cycles remain or all currently maps each vertex to a subset of points. Note that the method used

infinite cycles are identified as significant.

to construct the graph is not limited to the mapper algorithm.

We include a figure of the graph for each experiment and mark

The graph is assumed to capture the topological structure of

interesting branching points and cycles. The points correspond-

the point cloud, i.e., branching points (vertices with a degree of at

ing to a cycle are shown in red, the internal points of a branching

least 3) and cycles in the graph should reflect the branching and

point are also red, while the boundary points (forming 𝐿) are

cyclic structure of the point cloud. Our approach tests whether

blue.

the captured structure is significant when viewed through ho-

mology, operating directly on a subset of points from the point

cloud.

1 Available at: https://github.com/mrzv/dionysus.

15





Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia

Zajec, et al.

4.1

Experiment 1: Y-shaped point cloud

on the ant’s head into its two antennas and is significant. Vertex

2

B2 is a branching point of degree 3 and one of the vertices from

The point cloud 𝑃 consists of 5000 points in R and resembles a

the cycle C2. Looking at the point cloud, no branching structure

Y-shape with a cycle in the centre. The graph (see Figure 5) was is detected because the points of the two legs are contained in

created with the following parameters: 𝑓 is a projection on the

the vertex B2 itself and there are no boundary points on the legs,

x-coordinate, 𝑛 = 30, 𝑝 = 0.5 and 𝜖 = 3.

so they appear as a single connected blob. Our approach does

not detect a branching structure, even though there is, as some

other strategy of selecting the boundary points would need to be

used. Vertex B3 has degree 6, but only 5 neighbors are used as

one does not have any additional neighbor except B3. Since one

of the legs has no boundary points, only 2 cycles appear, causing

B3 to be recognized as a branching point with degree 3.

We also highlight 2 simple cycles. Cycle C1 wraps around

the ant’s hollow head and is recognized as significant. Cycle C2

wraps around the ant’s two middle legs and part of its body. No

significant cycles were found - ant’s legs are not close enough

together to form a large cycle and cycle formed by the hollow

legs is too small to be detected. So there is not enough support

to confirm the structure found by mapper.

5

CONCLUSIONS AND FUTURE WORK

Figure 5: Mapper graph with two branching points (B1 and

We have demonstrated, how persistent (relative) homology can

B2) and one simple cycle (C1) together with their corre-

be used in conjunction with a statistical test to confirm the signif-

sponding subsets of points.

icance of the topological structure of a point cloud summarized

with a graph. In the future, we will conduct extensive experiments

The graph contains one simple cycle, which is also significant

on more complex, high-dimensional point clouds with known

because the subset of its points contains a homologically signif-

and unknown structure. Ideally, we could use our approach to

icant cycle. The graph also contains two branching points, B1

prune the mapper graphs or guide the selection of values for its

and B2 with degrees 4 and 3.

parameters. Our approach to identifying branching structures

The persistence diagram for B1 has three (significant) infinite

needs further work, as the current strategy of using a (modified)

cycles, indicating a branching structure of degree 4, while the

2-hop neighborhood as a boundary sometimes fails. In addition,

diagram for B2 has two (significant) infinite cycles, indicating a

we may need a more sensitive version of the statistical test from

branching structure of degree 3. In this example, it was confirmed

[1] which is currently stated to hold in general but might be that both the cyclic and the branching structure of the graph are

possible to adapt for a particular type of data.

reflected in the point cloud.

ACKNOWLEDGEMENTS

4.2

Experiment 2: 3D ant surface

This work was supported by the Slovenian Research Agency

3

The point cloud 𝑃 consists of 6370 points in R corresponding

under the project J2-1736 Causalify and co-financed by the Re-

to the vertices of a 3D mesh in the form of an ant obtained from

public of Slovenia and the European Union’s HE program under

[4]. The graph (see Figure 6) was created with the following enRichMyData EU project grant agreement number 101070284.

parameters: 𝑓

is the distance to the tip of the ant’s abdomen,

𝑛 = 50, 𝑝 = 0.5, and 𝜖 = 0.025.

REFERENCES

[1]

Omer Bobrowski and Primoz Skraba. 2023. A universal null-distribution for topological data analysis. Scientific Reports, 13, 1, (July 2023), 12274. doi:

10.1038/s41598- 023- 37842- 2.

[2]

Mathieu Carrière, Bertrand Michel, and Steve Oudot. 2018. Statistical analysis and parameter selection for mapper. Journal of Machine Learning Research, 19, 12, 1–39. http://jmlr.org/papers/v19/17- 291.html.

[3]

Nithin Chalapathi, Youjia Zhou, and Bei Wang. 2021. Adaptive covers for mapper graphs using information criteria. In 2021 IEEE International Conference on Big Data (Big Data), 3789–3800. doi: 10.1109/BigData52589.2021.9671324.

[4]

Xiaobai Chen, Aleksey Golovinskiy, and Thomas Funkhouser. 2009. A bench-

mark for 3d mesh segmentation. ACM Trans. Graph., 28, 3, Article 73, (July 2009), 12 pages. doi: 10.1145/1531326.1531379.

[5]

Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. A

density-based algorithm for discovering clusters in large spatial databases with noise. In (KDD’96). AAAI Press, Portland, Oregon, 226–231.

[6]

Gurjeet Singh, Facundo Memoli, and Gunnar Carlsson. 2007. Topological

Methods for the Analysis of High Dimensional Data Sets and 3D Object

Recognition. In Eurographics Symposium on Point-Based Graphics. M. Botsch, R. Pajarola, B. Chen, and M. Zwicker, editors. The Eurographics Association.

Figure 6: Mapper graph with three highlighted branching

isbn: 978-3-905673-51-7. doi: 10.2312/SPBG/SPBG07/091- 100.

points (B1, B2 and B3) and two simple cycle (C1, C2) to-

[7]

Guillaume Tauzin, Umberto Lupo, Lewis Tunstall, Julian Burella Pérez, Matteo Caorsi, Anibal Medina-Mardones, Alberto Dassatti, and Kathryn Hess. 2020.

gether with their corresponding subsets of points.

Giotto-tda: a topological data analysis toolkit for machine learning and data exploration. (2020). arXiv: 2004.02551 [cs.LG].

We highlight three interesting branching points. Vertex B1 is a

branching point of degree 3, which corresponds to the branching

16





Highlighting Embeddings’ Features Relevance Attribution on Activation Maps

Jože M. Rožanec

Erik Koehorst

Dunja Mladenić

Jožef Stefan International

Philips Consumer Lifestyle BV

Jožef Stefan Institute

Postgraduate School

Drachten, The Netherlands

Ljubljana, Slovenia

Ljubljana, Slovenia

Erik.Koehorst@philips.com

dunja.mladenic@ijs.si

joze.rozanec@ijs.si

ABSTRACT

The field of explainable artificial intelligence can be traced

The increasing adoption of artificial intelligence requires a better

back to the 1970s [18]. A key question posed by the researchers is understanding of the underlying factors affecting a particular

what makes a good explanation. Arrieta et al. [2] consider that a forecast to enable responsible decision-making and provide a

good explanation must take into account at least three elements:

ground for enhancing the machine learning model. The advent

(a) the reasons for a given model output (e.g., features and their

of deep learning has enabled super-human classification per-

value ranges), (b) the context (e.g., context on which inference

formance and eliminated the need for tedious manual feature

is performed), and (c) how are (a) and (b) conveyed to the target

engineering. Furthermore, pre-trained models have democra-

audience (e.g., what information can be disclosed and the vo-

tized access to deep learning and are frequently used for feature

cabulary used, among others). When considering images, maps

extraction. Nevertheless, while much research is invested into

frequently present explanations that contrast particular model in-

creating explanations for deep learning models, less attention

formation on top of the original input image (e.g., saliency maps,

was devoted to how to explain the classification outcomes of a

activation maps, heat maps, or anomaly maps [13, 24]). Other model leveraging embeddings from a pre-trained model. This

approaches can be extracting and highlighting super-pixels rele-

research focuses on image classification and proposes a simple

vant to a specific class [16] or the occlusion of background parts method to visualize which parts of the image were considered by

irrelevant to the model. Such outputs convey (a) the reasons for

the subset of the most relevant features for a particular forecast.

a given model output by highlighting the images, (b) the context

Furthermore, multiple variants are provided to contrast relevant

on which inference is performed (by overlaying the information

features from a machine learning classifier and selected features

on top of the image used for inference), and (c) using an agreed

during a feature selection process. The research was performed

approach to convey to the user what is considered more relevant

on a real-world dataset provided by domain experts from Philips

and what is not.

Consumer Lifestyle BV.

Multiple approaches have been developed to explain the inner

workings of image classifiers. LIME (Local Interpretable Model-

KEYWORDS

Agnostic Explanations) [16] approached this challenge by retrieving predicted labels for a particular class and showing the

explainable artificial intelligence, feature importance, activation

segmented superpixels that match each class. GradCAM[19] has map, GradCAM, image classification, smart manufacturing, de-taken another approach and created activation maps consider-

fect detection

ing the weight of the activations at particular deep learning

model layers by the average gradient. Many approaches were

1

INTRODUCTION

developed afterward, following the same rationale. For exam-

The increasing adoption of artificial intelligence has posed new

ple, GradCAM++[3], XGradCAM[9], or HiResCAM[6] work like challenges, including enforcing measures to protect the human

GradCAM but consider second-order gradients, scale the gra-

person from risks inherent to artificial intelligence systems. One

dients by the normalized activations, or element-wise multiply

step in this direction is the European AI Act [12], which con-the activations with the gradients respectively. Other possible

siders that different artificial intelligence systems must conform

approaches are leveraging insights resulting from image pertur-

to a different set of requirements according to their risk level,

bation [8] or methods that acquire and display samples similar linked to the particular domain and potential impact on health,

or counterfactual to the predicted instance [4, 17].

safety, or fundamental rights [15]. In this context, explainable The development of information and communications tech-artificial intelligence, a sub-field of machine learning, has gained

nologies fostered the emergence of the Industry 4.0 paradigm as

renewed attention with the advent of modern deep learning [22],

a technology framework to integrate and extend manufacturing

given that it researches how more transparency can be brought to

processes [23]. In this context, the increasing adoption of arti-opaque machine learning models. While transparency in the reg-

ficial intelligence enables greater automation of manufacturing

ulatory context is sought to enable responsible decision-making,

processes such as defect inspection [7] and urges the adoption it provides valuable insights to enhance the workings of machine

of explainable artificial intelligence to develop users’ trust in

learning models, too.

the models and foster responsible decision-making based on the

insights obtained regarding the underlying machine learning

model [1].

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or From the literature mentioned above and several surveys on

distributed for profit or commercial advantage and that copies bear this notice and this topic [5, 13, 14, 17, 20, 21], it was found that the authors did the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

not contemplate how explanations can be provided in scenarios

Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia

where feature embeddings are extracted with a deep learning

© 2023 Copyright held by the owner/author(s).

model and then used to train a separate machine learning model.

17





Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia

Jože M. Rožanec, Erik Koehorst, and Dunja Mladenić

Figure 2: Given an image embedding (i), we can mask it to

Figure 1: To classify an image, a feature extractor is used

display (ii) features selected at the feature selection pro-

to create an embedding, from which certain values are ex-

cedure (including the top ranking classifier’s features, or

tracted to create a feature vector. The machine learning

(iii) can mask it to display only the top ranking classifier’s

model issues a prediction, which, along with the feature

features.

vector, is used to create a feature ranking. The attribution

approach considers the highest-ranking features to gener-

ate an activation map.

The present research addresses this void by proposing an un-

supervised approach to generate activation maps based on the

feature ranking obtained for a particular forecast. The research is

performed on a real-world dataset provided by Philips Consumer

Lifestyle BV and related to defect inspection.

This paper is organized as follows. First, section 2 describes the explainability approach developed and tested in this research.

Section 3 describes the experiments performed to assess different Figure 3: Sample images from the dataset provided by

value imputation strategies, and Section 4 informs and discusses Philips Consumer Lifestyle BV. Three categories are dis-the results obtained. Finally, Section 5 concludes and describes tinguished: images corresponding to non-defective items

future work.

(good) and images corresponding to two defect types

(double-printed and with interrupted prints).

2

HIGHLIGHTING EMBEDDINGS’

FEATURES RELEVANCE ATTRIBUTION

ON ACTIVATION MAPS

selected features and top-ranking features, using different values

The increasing amount of pre-trained deep learning models make

for each of them. By doing so, the highest similarity in the image

them the default choice for feature extraction when working with

will be found in regions related to top-ranking features or selected

machine learning models for images. Nevertheless, the discon-

features. Considering selected and top-ranking features provides

nect between the machine learning model built on top and the

additional insights into what information was provided to the

deep learning model used to extract the image embedding makes

model and what information was considered the most important

it challenging to provide good explanations to the user. This re-

by the model. These two approaches are explored in Section 3.

search proposes an approach to bridge the gap (see Fig. 1). In particular, we leverage the fact that similar images or fragments

3

EXPERIMENTS

of images result in embeddings or parts of embeddings that are

We experimented with a real-world dataset of logos printed on

close to each other. This property can be exploited when building

shavers provided by Philips Consumer Lifestyle BV. The dataset

activation maps, computing the similarity between a reference

consisted of 3518 images considered within three categories (see

image (e.g., the image of a horse) and the image under consider-

Fig. 3): non-defective images and images with two kinds of defects ation to find where such class can be found in the image under

(double-printed logos and interrupted prints). To extract features

consideration (e.g., given the image of a farm, highlight where

from the images, the ResNet-18 model [10] was used, extracting the horses are located). Nevertheless, if instead of using some ref-the features before the fully connected layer. Mutual information

erence image, the image that is an input to the machine learning

was used to evaluate the most relevant features and select the top

√

model is leveraged as a reference, (i) no noise is introduced due to

K, with 𝐾 = 𝑁 , where N is the number of data instances in the

the dissimilarity of the images, and (ii) no beforehand knowledge

train set, as suggested in [11]. The dataset was divided into train regarding the classes of interest is required. Therefore, a key

(75%) and test (25%), and a random forest classifier was trained

issue must be resolved: how do both embeddings differ to ensure

on it, achieving an AUC ROC (one-vs-rest) score of 0.9022.

that such difference is exploited to build an activation map?

Three images from the test set were considered for the experi-

Two options are envisioned in this research (see Fig. 2): given ments: good, double-printed, and with an interrupted print. The

(i) the image embedding, two variations can be considered for

images were randomly picked among the available ones for that

value imputation: (ii) mask all the values in the embedding except

particular class. To assess the features’ relevance of a particular

for the ones corresponding to top-ranking features, (iii) mask all

forecast, LIME[16] was used, considering the top 1, 3, 5, 7, and the values in the embedding except for the ones corresponding to

13 ranked features.

18





Highlighting Embeddings’ Features Relevance Attribution on Activation Maps Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia

consistently showed high cosine similarity across layers for all

defect types. On the other hand, TZZ achieved the best results

regardless of the defect and layer considered. Imputing selected

features with one had a detrimental effect, given it increased

the similarity between the imputed vector and the embedding.

Nevertheless, the similarity was usually between 0.10 and 0.20

Figure 4: GradCAM activation maps for ResNet-18 layers

points below that reported with the TRR imputation strategy.

1-4 and four layers combined.

For visual assessment, activation maps for different imputation

strategies obtained for the top 13 features are displayed in Fig.

5. When comparing TZZ and TRR strategies, we found that for The GradCAM images were generated for ResNet-18 layers

layer one, TZZ for the double-printed image focused on the top

1-4 and another image considering the four layers. To understand

contour of characters, and for the interrupted print highlighted

where the underlying model focused, we created GradCAM ac-

regions of relevance. In contrast, TRR did not highlight any region

tivation maps contrasting the image against itself (see Fig. 4).

for the double-printed image and highlighted fewer regions for

The cosine similarity between the imputed vector and the image

the interrupted print when compared to TZZ. For layer two,

embedding was computed across test samples (880 samples: 679

TZZ for the image of the non-defective product displayed some

good, 58 double-printed, and 143 related to interrupted printing).

artifacts but included areas covering characters’ contours, too.

The mean similarity and standard deviation were used to as-

Furthermore, for the double-printed and interrupted print images,

sess whether the imputation strategy increased the similarity or

it covered relevant regions. TRR, on the other hand, highlighted

contrast between the imputed vector and the image embedding.

different regions, which, for the good and double-printed images,

The GradCAM images were generated by computing the co-

were mostly irrelevant. For layer three, TZZ highlighted mostly

sine similarity between the image embedding and the feature

irrelevant areas for the image of the non-defective product, except

vector generated considering three strategies described in Table 1.

for the character "S". For the double-printed image, the beginning A sample of the resulting activation maps were visually assessed

and end of the words are highlighted, while for the interrupted

and are reported in Section 4.

prints, the highlighted areas covered places where defects were

The experiments were designed to understand which imputa-

observed. TRR, on the other hand, for the good image, covered

tion strategy works the best. A detailed analysis regarding how

two-thirds of the image, and for the double-printed, it highlighted

top-ranked features affect the activation maps was omitted due

most of the areas highlighted with the PZZ strategy. Nevertheless,

to the brevity of the paper.

for the interrupted print, most focus was placed on the lower

part of the "P" char, while also two artifacts were encountered.

Strategy

Top-ranked feature

Selected on Feature Selection

Irrelevant

TOZ

True value

One

Zero

Finally, for the fourth layer, TZZ has mostly focused on the upper

TZZ

True value

Zero

Zero

word (Philips), while TRR’s focus was mostly on the lower part

TRR

True value

Random

Random

of the image, still covering some relevant areas.

When comparing the TZZ and TOZ approaches, we found that

Table 1: Value imputation strategies considering the image

for layer one, TOZ results in less strongly highlighted regions:

embedding, the features selected during the feature selec-

most of the highlighted regions present in TZZ vanished, and just

tion process, and the classifier’s top-ranked features.

in the good image, a few spots appeared that were not present

at the TZZ activation map. The original regions are highlighted

for layer two, but new regions were included, mostly covering

4

RESULTS

areas of interest. The highlighted areas for a double-printed im-

age related to TZZ and TOZ activation maps were consistent for

Layers

layer three. Nevertheless, TOZ highlighted different regions for

Imputation strategy

Image class

1

2

3

4

the good and interrupted print images. The regions highlighted

Good

0.27±0.01 0.27±0.01 0.27±0.01 0.27±0.01

TOZ

Double-printed

0.31±0.02 0.31±0.02 0.31±0.02 0.31±0.02

for the interrupted print image were irrelevant to defect detec-

Interrupted print

0.27±0.01 0.27±0.01 0.27±0.01 0.27±0.01

tion. When considering the last layer, the highlighted areas were

Good

0.21±0.04 0.21±0.04 0.21±0.04 0.21±0.04

TZZ

Double-printed

0.24±0.03 0.24±0.03 0.24±0.03 0.24±0.03

mostly the same for TZZ and TOZ. Nevertheless, an additional

Interrupted print

0.22±0.04 0.22±0.04 0.22±0.04 0.22±0.04

region was introduced in the good and interrupted print images,

Good

0.46±0.02 0.46±0.02 0.46±0.02 0.46±0.02

TRR

Double-printed

0.48±0.03 0.48±0.03 0.48±0.03 0.48±0.03

covering the lower text.

Interrupted print

0.46±0.02 0.46±0.02 0.46±0.02 0.46±0.02

From the visual assessment described above, we conclude that

activation maps obtained with the TZZ imputation method lead

Table 2: Value imputation strategies considering the image

to the best explanations.

embedding, the features selected during the feature selec-

tion process, and the classifier’s top-ranked features.

5

CONCLUSIONS

As described in Table 1, three imputation strategies were con-This work has researched how information regarding feature

sidered. The cosine similarity computed between the vector cre-

importance when using image embeddings can be used and prop-

ated with the imputation strategy and the embedding (consider-

agated back to generate activation maps and highlight regions

ing the top 13 features) is reported in Table 2. A higher similarity of the image considered relevant to a particular forecast. The

between the imputed vector and the image embedding means

proposed approach was evaluated on images of a real-world in-

that a wider area of the activation map will be highlighted, blur-

dustrial use case. The similarity metrics and visual evaluation

ring relevant information where the top features point to in the

show that the best value imputation strategy is TZZ, which con-

image. The less informative imputation strategy was TRR, which

siders assigning the actual embedding value to relevant features

19





Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia

Jože M. Rožanec, Erik Koehorst, and Dunja Mladenić

Figure 5: GradCAM activation maps for ResNet-18 layers 1-4 considering only the top 13 features for this particular forecast and three imputation strategies (TOZ, TZZ, and TRR) for three image types (good (G), double-printed (D), and interrupted prints (I)).

and masking the rest of the embedding with zeroes. Nevertheless,

[11] Jianping Hua, Zixiang Xiong, James Lowey, Edward Suh, and Edward R

it must be emphasized that a broader set of experiments must be

Dougherty. 2005. Optimal number of features as a function of sample size for considered to generalize these conclusions. While this research

various classification rules. Bioinformatics 21, 8 (2005), 1509–1515.

[12] Tambiama André Madiega. 2021. Artificial intelligence act. European Parlia-only considered local explanations, the feature relevance could be

ment: European Parliamentary Research Service (2021).

considered at a global level, and the same approach was leveraged

[13] Dang Minh, H Xiang Wang, Y Fen Li, and Tan N Nguyen. 2022. Explainable artificial intelligence: a comprehensive review. Artificial Intelligence Review to visualize their influence on a particular image. Future work

(2022), 1–66.

will focus on a more comprehensive evaluation of the proposed

[14] Sajid Nazir, Diane M Dickson, and Muhammad Usman Akram. 2023. Survey methodology to understand how it performs, how the number

of explainable artificial intelligence techniques for biomedical imaging with deep neural networks. Computers in Biology and Medicine (2023), 106668.

of selected features influences the activation maps and possible

[15] Cecilia Panigutti, Ronan Hamon, Isabelle Hupont, David Fernandez Llorca, shortcomings.

Delia Fano Yela, Henrik Junklewitz, Salvatore Scalzo, Gabriele Mazzini, Ignacio Sanchez, Josep Soler Garrido, et al. 2023. The role of explainable AI in the context of the AI Act. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency. 1139–1150.

ACKNOWLEDGMENTS

[16] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. " Why should This work was supported by the Slovenian Research Agency and

i trust you?" Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data the European Union’s Horizon 2020 program project STAR under

mining. 1135–1144.

grant agreement number H2020-956573.

[17] Gesina Schwalbe and Bettina Finzel. 2023. A comprehensive taxonomy for explainable artificial intelligence: a systematic survey of surveys on methods and concepts. Data Mining and Knowledge Discovery (2023), 1–59.

[18] A Carlisle Scott, William J Clancey, Randall Davis, and Edward H Shortliffe.

REFERENCES

1977. Explanation capabilities of production-based consultation systems. Tech-

[1] Imran Ahmed, Gwanggil Jeon, and Francesco Piccialli. 2022. From artificial nical Report. STANFORD UNIV CA DEPT OF COMPUTER SCIENCE.

intelligence to explainable artificial intelligence in industry 4.0: a survey on

[19] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna what, how, and where. IEEE Transactions on Industrial Informatics 18, 8 (2022), Vedantam, Devi Parikh, and Dhruv Batra. 2017. Grad-cam: Visual explanations 5031–5042.

from deep networks via gradient-based localization. In Proceedings of the IEEE

[2] Alejandro Barredo Arrieta, Natalia Díaz-Rodríguez, Javier Del Ser, Adrien international conference on computer vision. 618–626.

Bennetot, Siham Tabik, Alberto Barbado, Salvador García, Sergio Gil-López,

[20] Bas HM Van der Velden, Hugo J Kuijf, Kenneth GA Gilhuijs, and Max A Daniel Molina, Richard Benjamins, et al. 2020. Explainable Artificial Intel-Viergever. 2022. Explainable artificial intelligence (XAI) in deep learning-ligence (XAI): Concepts, taxonomies, opportunities and challenges toward based medical image analysis. Medical Image Analysis 79 (2022), 102470.

responsible AI. Information fusion 58 (2020), 82–115.

[21] Giulia Vilone and Luca Longo. 2021. Notions of explainability and evaluation

[3] Aditya Chattopadhay, Anirban Sarkar, Prantik Howlader, and Vineeth N Bal-approaches for explainable artificial intelligence. Information Fusion 76 (2021), asubramanian. 2018. Grad-cam++: Generalized gradient-based visual expla-89–106.

nations for deep convolutional networks. In 2018 IEEE winter conference on

[22] Feiyu Xu, Hans Uszkoreit, Yangzhou Du, Wei Fan, Dongyan Zhao, and Jun Zhu.

applications of computer vision (WACV). IEEE, 839–847.

2019. Explainable AI: A brief survey on history, research areas, approaches

[4] Chaofan Chen, Oscar Li, Daniel Tao, Alina Barnett, Cynthia Rudin, and and challenges. In CCF international conference on natural language processing Jonathan K Su. 2019. This looks like that: deep learning for interpretable and Chinese computing. Springer, 563–574.

image recognition. Advances in neural information processing systems 32

[23] Li Da Xu, Eric L Xu, and Ling Li. 2018. Industry 4.0: state of the art and future (2019).

trends. International journal of production research 56, 8 (2018), 2941–2962.

[5] Arun Das and Paul Rad. 2020. Opportunities and challenges in explainable

[24] Vitjan Zavrtanik, Matej Kristan, and Danijel Skočaj. 2021. Draem-a discrim-artificial intelligence (xai): A survey. arXiv preprint arXiv:2006.11371 (2020).

inatively trained reconstruction embedding for surface anomaly detection.

[6] Rachel Lea Draelos and Lawrence Carin. 2020. Use HiResCAM instead of In Proceedings of the IEEE/CVF International Conference on Computer Vision.

Grad-CAM for faithful explanations of convolutional neural networks. arXiv 8330–8339.

preprint arXiv:2011.08891 (2020).

[7] Gautam Dutta, Ravinder Kumar, Rahul Sindhwani, and Rajesh Kr Singh. 2021.

Digitalization priorities of quality control processes for SMEs: A conceptual study in perspective of Industry 4.0 adoption. Journal of Intelligent Manufacturing 32, 6 (2021), 1679–1698.

[8] Ruth C Fong and Andrea Vedaldi. 2017. Interpretable explanations of black boxes by meaningful perturbation. In Proceedings of the IEEE international conference on computer vision. 3429–3437.

[9] Ruigang Fu, Qingyong Hu, Xiaohu Dong, Yulan Guo, Yinghui Gao, and Biao Li.

2020. Axiom-based grad-cam: Towards accurate visualization and explanation of cnns. arXiv preprint arXiv:2008.02312 (2020).

[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.

20





An approach to creating a time-series dataset for news

propagation: Ukraine-war case study

Abdul Sittar

Dunja Mladenić

abdul.sittar@ijs.si

dunja.mladenic@ijs.si

Jožef Stefan Institute and Jožef Stefan Postgraduate

Jožef Stefan Institute and Jožef Stefan Postgraduate

School

School

Jamova cesta 39

Jamova cesta 39

Ljubljana, Slovenia

Ljubljana, Slovenia

ABSTRACT

Moreover, news spreading comes across many barriers due

An efficient technique to comprehend news spreading can

to different reasons, including cultural, economic, political,

be achieved through the automation of machine learning al-

linguistic, or geographical, and these reasons depend upon

gorithms. These algorithms perform the prediction and fore-

the type of news, such as sports, health, science, etc. [18]. For casting of news dissemination across geographical barriers.

instance, it is more likely that the news spreading relating to

Despite the fact that news regarding any events is generally

the FIFA World Cup crosses cultural barriers since it involves

recorded as a time-series due to its time stamps, it cannot

multiple cultures. Similarly, news spreading relating to the Sri-

be seen whether or not the news time-series is propagating

Lankan economic crisis and the Ukraine-war probably comes

across geographical barriers. In this article, we explore an

across economic and geographical barriers since these events

approach for generating time-series datasets for news dissemi-

involve multiple stances from the international community;

nation that relies on Chat-GPT and sentence-transformers. The

Eid celebrations and Christmas are likely to come across reli-

lack of comprehensive, publicly accessible event-centric news

gious barriers; US elections are likely to come across political

databases for use in time-series forecasting and prediction is

barriers [17].

another limitation. To get over this bottleneck, we collected

The identification of news spreading patterns while crossing

a news dataset consisting of 1 year and 3 months related to

barriers can be useful in the context of numerous real-world

the Ukraine war using Event Registry. We also conduct a sta-

applications, such as trend detection and content recommenda-

tistical analysis of different time-series (propagating, unsure,

tions for readers and subscribers. To perform the classification

and not-propagating) of different lengths (2, 3, 4, 5, and 10) to

of news published across barriers (geographical, cultural, eco-

document the prevalence of geographical barriers. The dataset

nomic, etc.) and, in that attempt, to recommend and identify

is publicly available on Zenodo.

trends of news spreading belonging to different categories,

some methodological considerations are necessary.

KEYWORDS

In this paper, we introduce an approach to creating a time-

series dataset for news propagation. While previous work has

news propagation, time-series dataset, geographical barriers,

focused on creating events from collections of news articles [9,

Ukraine-war

16], we focus on creating propagation time-series. We take the Ukraine-war as an example to be researched in the propaga-1

INTRODUCTION

tion analysis across geographical barriers.

The process of information traveling from a sender to a set of

Following are the main scientific contributions of this paper:

receivers via a carrier is commonly referred to as propagation

(1) We present an approach to creating a time-series dataset

[3]. News propagate over time by different publishers about for news propagation.

an event. It implicitly raises a few thoughts in our mind, such

(2) A dataset for forecasting and predicting news prop-

as: 1) There will be some news articles propagating similar in-

agation, that has been labeled with the assistance of

formation over time; 2) some news articles will be of a unique

Chat-GPT and sentence transformers.

category that eventually will not be propagating or propagat-

The remainder of the paper is structured as follows. Section

ing across geographical barriers by a few publishers.

2 describes the related work on barriers to news spreading, News streaming is classified into events where a relevant set

time-series datasets for news propagation, and topic modeling.

of news is clustered and represented as an event [8, 9]. And Section 3 presents the proposed approach. We discuss the there is a starting and ending time for an event, which is calcu-dataset construction and annotation guidelines in Section 4.

lated by the publication time of the first and last news article.

The evaluation details and statistical analysis is explained in

Hence, an event consists of a set of news articles, and these

Section 5, while Section 6 concludes the paper and outlines news articles follow a certain pattern based on hidden prop-areas of future work.

erties including cultural, economical, political, linguistic, and

geographical [17].

2

RELATED WORK

In this section, we review the related literature about geo-

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or graphical barriers to news spreading, time-series datasets for

distributed for profit or commercial advantage and that copies bear this notice news propagation, and topic modeling.

and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

Information Society 2023, 10 October 2023, Ljubljana, Slovenia

© 2022 Copyright held by the owner/author(s).

21





Information Society 2023, 10 October 2023, Ljubljana, Slovenia

Abdul Sittar and Dunja Mladenić

identifies how the discussions evolved over time in top news-

papers belonging to three different continents (Europe, Asia,

Event

Database

Registry

Dataframes

and North America) and nine different countries (UK, India,

Ireland, Canada, the U.S.A., Japan, Indonesia, Turkey, and Pak-

API Call

istan). It uses spatio-temporal topic modeling and sentiment

Meta-data

Text Similarity Matrix

Propagation Chains

Labelling

analysis. Different classification or mining tasks are proposed

News

Topic Categorization (TC)

Articles

Wikipedia-

using time-series datasets. [6] has proposed the task of pre-Infobox

dicting stock market values such as price or volatility based

t1

Max = 1

TC

t2

TC

t3

t4

t5

t6

t7

on the news content or derived text features. Similarly, to fore-

GPT

Dataset 1

Sliding Window 1

Please suggest relevant categories

cast the values, a set of final classes is already defined, such

and tags for the following content:

t1

TC t2

t3

t4

t5

t6

t7

"body text of news article"

as up meaning an increase in price, down meaning a decrease

Max = 2

TC

Dataset 2

in price, and balanced meaning no change in price. Also, the

JSON data

Overlapping Sliding

TC

Windows

same technique has been applied to predict price trends (in-

Propagation Chains & Labelling

cline, decline, or flat) immediately after press release publica-

t1

t2TC

t3

t4

t5

t6

t7

tions. Also, Good news articles are categorized as inclines if

Max = 3

TC

Dataset 3

Dataset

the stock price relevant to the given article has increased with

Overlapping Sliding

TC

Windows

a peak of at least three points from its original value at the

publication time [13].

Figure 1: An overview of the proposed approach. To cre-

ate the propagation time-series, it calculates the seman-

2.3

Topic modeling

tic similarity across news utilizing sentence transform-

Generally, to find out the most important topics inside an

ers, and to evaluate the labeling process of the news, it

event, multiple solutions have been proposed, including pool-

utilizes a summary of the news articles generated by

ing based LDA and BERTopic. Unlike simple static topic mod-

Chat-GPT.

eling, pooling-based techniques assume that the data is par-

titioned on a time basis, e.g., hourly or daily. Pooling-based

techniques are mostly applied to social media, where docu-

ments or tweets are partitioned based on hashtags and authors.

2.1

Geographical barrier

BERTopic leverages transformers and TF-IDF to create dense

Sittar reported that the geographical size of a news publisher’s

clusters, allowing for easily interpretable topics while keeping

country is directly proportional to the number of publishers

important words in the topic descriptions. Therefore, the result

and articles reporting on the same information [17]. It is also is a list of topics ranked according to their importance.

reported that, based on some factors, the media targets spe-

cific foreign and regional events. For example, the spreading

The topic modeling techniques are performing surprisingly

of news related to specific events may tilt toward developed

well. The relation of such topics to their hidden characteristics,

countries such as the United Kingdom, the U.S.A., or Russia.

such as cultural, economical, and political, has been analyzed

Also, in the past, geographical representation of entities and

in many studies because understanding its dynamics can help

events has been extensively utilized to detect local, global, and

governments disseminate information effectively [4, 17, 14,

critical events [10, 20, 19, 2]. It has been said that countries

15]. It has changed rapidly in recent years with the emergence with close distance share culture and language up to a certain

of social media, which provides online platforms for people

extent, which can further reveal interesting facts about shared

worldwide to share their thoughts, activities, and emotions

tendencies in information spreading [12, 11]. Given the diffi-and build social relationships [7]. Over the years, scholars culty of gathering longitudinal data, relatively little news flow

have studied the relationship between the news prominence

research has systematically examined whether and to what

of a country and its physical, economic, political, social, and

extent foreign nation visibility and the factors that influence it

cultural characteristics [11]. Communication scholars have have changed over time. Specifically, scholarship has typically

long been interested in identifying the key determinants of

only addressed why some countries get more news coverage

what makes foreign countries newsworthy and why some

than others at a specific point in time, not how and why the

countries are considered more newsworthy than others [5].

focus shifts over time from one country to another [5]. In this context, we propose an approach to collecting data to analyze

3

APPROACH

the news spreading across geographical barriers.

This research article presents an approach to creating a time-

series dataset for news propagation across geographical bar-

2.2

Time-series datasets

riers, as shown in Figure 1. In the first step, we call an API News propagation can be represented in the form of a time-that extracts the news articles from the Event Registry be-

series [17]. The properties of cascading time-series can tell longing to Ukraine-war. In the second step, we extract meta-us the relationship between the time and size of cascading. It

data related to news publishers via searching for the news

further answers which events last over a longer period with

publishers on Google and extracting their Wikipedia links.

large communities across different languages. A time-series

Using these links, we obtain the necessary information from

dataset can be used to understand evolving discussions over

Wikipedia-Infobox [17]. We use the Bright Data service to time. Different studies have utilized time-series datasets, such

crawl and parse Wikipedia-Infoboxes. In the third step, we

as [1] investigates how different discussions evolved over time perform the summarization of news articles. In the last step,

and the spatial analysis of tweets related to COVID-19. [14]

we create a propagation time-series and perform labeling of

22





An approach to creating a time-series dataset for news propagation: Ukraine-war case study Information Society 2023, 10 October 2023, Ljubljana, Slovenia

the time-series. To calculate the semantic similarity, we utilize

monolingual sentence transformers. Since the propagation

of information can be captured in the form of time-series we

create time-series of different lengths, such as 2, 3, 4, 5, and

To annotate the propagation time-series across geographi-

10. To evaluate the labeling process, we manually compare the

cal barriers, we consider the label "Propagating" for a pair of summary generated by Chat-GPT (see Section 5).

news articles if the pair is published from two different coun-

tries; otherwise, we label it "Not-Propagating". We repeat this 4

DATASET CONSTRUCTION

process for all lengths of news articles. The statistics after ap-

We collected the news articles reporting on the Ukraine-war.

plying this guideline are presented in Figure 3.

Since Russia invaded Ukraine on February 24, 2022, in an es-

calation of the Russo-Ukrainian War, we fetched news articles

5

STATISTICAL ANALYSIS AND

that were published between January 2022 and March 2023.

EVALUATION

The dataset consists of 61261 news articles. Each news article

The statistics about the propagation time-series without taking

consists of a few attributes: title, body text, name of the news

geographical barriers into account are presented in bar chart

publisher, date, and time of publication.

2. The number of time-series with the label "Propagating" is higher than the "Unsure", and "Not-Propagating" labels when 4.1

Semantic similarity

the length of the time-series is 3 or 5, whereas in the other

We calculate the cosine similarity between dense vector gen-

three cases (2, 4, and 10), the number of time-series is equal for

erated by sentence transformers. Sentence Transformers is

all three labels. The statistics of the propagation time-series

a Python framework for state-of-the-art sentence, text, and

that are generated after taking the geographical location of

image embeddings. Cosine similarity varies between zero and

the news publisher into account are presented in bar chart 3.

one; zero means no similarity, and one means maximum simi-

The number of propagation time-series with "Propagated" and

larity, i.e., a duplicate article.

"Unsure" labels reduced to almost 40% whereas the number

of propagation time-series with the "Not-propagated" label

4.2

Chat-GPT Summarizing

increased significantly.

Since manual evaluation of propagation time-series is difficult

because of the length of the news articles, we utilized Chat-

For the evaluation of the dataset, we have checked the sum-

GPT to get the tags, categories, and summary representing

mary, including categories and tags of articles for a specific

the whole article. Summarizing a text is one of the many tasks

label, manually. We randomly selected 50 time-series of dif-

ChatGPT is extremely good at. We can give it a piece of con-

ferent lengths for all three types of labels. According to the

tent and ask for a summary. By customizing our prompts, we

manual evaluation, the propagation time-series with the "Prop-

can get ChatGPT to create much more than a plain summary.

agating" label followed almost one or two themes of discussion

We have used the OpenAI API with the Python library. We

for all the news articles in a chain. For instance, the following

used the following prompt to fetch the summary of the text,

topics have appeared in the propagation time series of length

categories, and tags: "Please summarize the text and suggest

5: 1) "The United States will be sanctioning Russian President

relevant categories and tags for the following content: article-

Vladimir Putin; 2) "the national team of the Polish FA will not

Text:". articleText is a variable representing the text of a news

play against Russia; 3) the Polish Football Association will not

article.

play its World Cup qualifying match against Russia; 4) "the

Polish Football Association has refused to play a World Cup

4.3

Annotations of time-series

against Russia; 5) "the Polish national team does not intend to

play-off match against Russia". On the contrary, propagation

We created three types of time-series recursively and anno-

time-series with "Not-Propagating" labels discussed always diftated them based on a threshold of semantic similarity, as

ferent points of view about the Ukraine-war. For example, the

shown in Algorithm ??. The threshold to decide the type

following topics have appeared in the propagation time-series

of propagation time-series has been set by manually ana-

of length 5: 1) "a resolution passed against Russia in the United

lyzing the similarity and summary of news articles. We set

Nations"; 2) "Canadian president urges to impose sanctions

three thresholds for all three types of labels (propagating, un-

against Russia"; 3) "the UN Security Council has voted on a

sure, and not-propagating). For instance, the time-series with

US-led draft resolution; 4) "President Trump is inviting Russian

greater or equal to 0.7 similarity were labeled "Propagating", President Vladimir Putin to come to Washington; and 5) "India

the time-series with greater or equal to 0.5 similarity were la-

abstained from the vote on the draft resolution". However, in

beled "Unsure", and the time-series with less than 0.5 similarity the case of propagation time-series with "Unsure" labels, there were labeled "Not-propagating". This criteria has been followed were three or four sub-topics discussing the Ukraine-war.

for the minimum length of a time-series (2). However, for the

length of a time-series greater than 2, we count the number

Evaluation results show that as the window size increased

of pairs with each label, and then the time-series is labeled as

to capture the information propagation, the noise of overlap-

one with the highest count. If two labels have the same highest

ping topics also increased. Similarly, this overlapping window

count, then we give priority to the "Propagating" label over

presented sub-topics that overlapped at the time of publication.

"Unsure" and "Unsure" over "Not-Propagating". The Algorithm

?? takes five parameters, such as the start and end of the data-

frames, a copy of the data-frames, length of the time-series,

and an array. The statistics about the propagation time-series

are presented in Figure 2.

23





Information Society 2023, 10 October 2023, Ljubljana, Slovenia

Abdul Sittar and Dunja Mladenić

ACKNOWLEDGMENTS

The research described in this paper was supported by the

Slovenian research agency under the project J2-1736 Causal-

ify and by the EU’s Horizon Europe Framework under grant

agreement number 101095095.

REFERENCES

[1]

Iyad AlAgha. 2021. Topic modeling and sentiment analysis of twitter

discussions on covid-19 from spatial and temporal perspectives. Journal

of Information Science Theory and Practice, 9, 1, 35–53.

[2]

Simon Andrews, Helen Gibson, Konstantinos Domdouzis, and Babak

Akhgar. 2016. Creating corroborated crisis reports from social media

data through formal concept analysis. Journal of Intelligent Information Systems, 47, 2, 287–312.

[3]

Firdaniza Firdaniza, Budi Nurani Ruchjana, Diah Chaerani, and Jaziar

Radianti. 2021. Information diffusion model in twitter: a systematic

literature review. Information, 13, 1, 13.

[4]

Guoyin Jiang, Saipeng Li, and Minglei Li. 2020. Dynamic rumor spreading

Figure 2: The bar chart shows the statistics about the

of public opinion reversal on weibo based on a two-stage spnr model.

Physica A: Statistical Mechanics and its Applications, 558, 125005.

propagation time-series of different lengths (2, 3, 4, 5, 10)

[5]

Timothy M Jones, Peter Van Aelst, and Rens Vliegenthart. 2013. Foreign

that has been labelled as "Propagating", "Unsure", and nation visibility in us news coverage: a longitudinal analysis (1950-2006).

"Not-Propagating". The x-axis shows the length of time-

Communication Research, 40, 3, 417–436.

[6]

Abdullah S Karaman and Tayfur Altiok. 2004. An experimental study

series, the y-axis shows the count of the propagation

on forecasting using tes processes. In Proceedings of the 2004 Winter

time-series.

Simulation Conference, 2004. Vol. 1. IEEE.

.

[7]

Sanjay Kumar, Muskan Saini, Muskan Goel, and BS Panda. 2021. Mod-

eling information diffusion in online social networks using a modified

forest-fire model. Journal of intelligent information systems, 56, 2, 355–

377.

[8]

Haewoon Kwak and Jisun An. 2016. Two tales of the world: compar-

ison of widely used world news datasets gdelt and eventregistry. In

Proceedings of the International AAAI Conference on Web and Social Media number 1. Vol. 10, 619–622.

[9]

Gregor Leban, Blaz Fortuna, Janez Brank, and Marko Grobelnik. 2014.

Event registry: learning about world events from news. In Proceedings of the 23rd International Conference on World Wide Web, 107–110.

[10]

Mauricio Quezada, Vanessa Peña-Araya, and Barbara Poblete. 2015.

Location-aware model for news events in social media. In Proceedings of

the 38th International ACM SIGIR Conference on Research and Development

in Information Retrieval, 935–938.

[11]

Elad Segev. 2015. Visible and invisible countries: news flow theory re-

vised. Journalism, 16, 3, 412–428.

[12]

Elad Segev and Thomas Hills. 2014. When news and memory come

apart: a cross-national comparison of countries’ mentions. International Communication Gazette, 76, 1, 67–85.

[13]

Sadi Evren Seker, MERT Cihan, AL-NAAMİ Khaled, Nuri Ozalp, and

AYAN Ugur. 2013. Time series analysis on stock market for text mining

correlation of economy news. International Journal of Social Sciences and Humanity Studies, 6, 1, 69–91.

[14]

Abdul Sittar, Daniela Major, Caio Mello, Dunja Mladenić, and Marko

Figure 3: The bar chart shows the statistics about the

Grobelnik. 2022. Political and economic patterns in covid-19 news: from

lockdown to vaccination. IEEE Access, 10, 40036–40050.

propagation time-series after applying the condition of

[15]

Abdul Sittar and Dunja Mladenic. 2021. How are the economic conditions

the location of a news publisher. Each bar presents three

and political alignment of a newspaper reflected in the events they report types of propagation time-series that has been labelled

on? In Central European Conference on Information and Intelligent Systems.

Faculty of Organization and Informatics Varazdin, 201–208.

as "Propagating", "Unsure", and "Not-Propagating". The

[16]

Abdul Sittar, Dunja Mladenic, and Tomaž Erjavec. 2020. A dataset for

x-axis shows the length of time-series, the y-axis shows

information spreading over the news. In Proceedings of the 23th Interna-

the count of the propagation time-series.

tional Multiconference Information Society SiKDD. Vol. 100, 5–8.

[17]

Abdul Sittar, Dunja Mladenić, and Marko Grobelnik. 2022. Analysis of

information cascading and propagation barriers across distinctive news

events. Journal of Intelligent Information Systems, 58, 1, 119–152.

[18]

Abdul Sittar, Dunja Mladenić, and Marko Grobelnik. [n. d.] Profiling

6

CONCLUSIONS AND FUTURE WORK

the barriers to the spreading of news using news headlines. Frontiers in Artificial Intelligence, 6, 1225213.

In this paper, we have presented an approach to creating a

[19]

Kazufumi Watanabe, Masanao Ochi, Makoto Okabe, and Rikio Onai. 2011.

time-series dataset. The goal of this work was to investigate

Jasmine: a real-time local-event detection system based on geolocation

the length of the propagation time-series for news propagation.

information propagated to microblogs. In Proceedings of the 20th ACM

international conference on Information and knowledge management, 2541–

In the future, we plan to utilize the same approach for different

2544.

events. Moreover, currently, geographical barriers have been

[20]

Hong Wei, Jagan Sankaranarayanan, and Hanan Samet. 2020. Enhancing

analyzed. In the future, we would like to extend the barriers to

local live tweet stream to detect news. GeoInformatica, 1–31.

political, economic, and cultural barriers and find patterns of

news propagation. Also, we would like to perform prediction

and forecasting on the labeled time-series dataset. We would

like to perform experiments with classical time-series classi-

fication methods, deep learning, transformer-based methods,

and large language models (LLMs).

24





PREDICTING HORSE FEARFULNESS APPLYING

SUPERVISED MACHINE LEARNING METHODS

Oleksandra Topal

Inna Novalija



Jožef Stefan Institute

Jožef Stefan Institute



Jamova cesta 39, Ljubljana, Slovenia

Jamova cesta 39, Ljubljana, Slovenia



oleksandra.topal@ijs.si

inna.koval@ijs.si





Elena Gobbo

Manja Zupan Šemrov

Dunja Mladenić

Biotechnical faculty

Biotechnical faculty

Jožef Stefan Institute

University of Ljubljana

University of Ljubljana

Jamova cesta 39, Ljubljana, Slovenia

Jamnikarjeva 101 Ljubljana, Slovenia

Jamnikarjeva 101, Ljubljana, Slovenia

dunja.mladenic@ijs.si

elena.gobbo@bf.uni-lj.si

manja.zupansemrov@bf.uni-lj.si





ABSTRACT

pigs and cattle [3], dogs [4], and horses [5]. The pilot results have

shown the first rigorous evidence for the connection between

In this article, we present the first results of a study on the behaviour, heart rate and anatomical characteristics (head and

personality traits of Lipizzan horses focusing on their fearfulness.

body) [6]. We therefore assume that various properties, such as Applying a specific evaluation approach targeted at small

anatomical and biomechanical as well as social environmental

datasets, we manage to discover a number of anatomical and

measurements, give us valuable objective insights to predict

social properties that are related to horse fearfulness as a main factor of horses’ personality in

personality traits of Lippizan horses with an emphasis on

the current research. For

fearfulness. We believe that this improved knowledge will help us

evaluation purposes the performance of four different

understand the horse-human relationship, the complexity of

classification algorithms is compared. Our results indicate that

animal personality in general and in relation to humans, as

Logistic regression and Decision trees achieve the best

humans and horses share many emotional processes [7].

classification accuracy. Furthermore, the most important features

for predicting the fear level of Lipizzan horses using a decision The main contribution of this research is assessment of the

tree model are presented and discussed.

importance of different properties for predicting fearfulness of a



horse as indicated by different traditional machine learning

algorithms.

KEYWORDS



Machine learning, classification problem, personality traits,

2. RELATED WORK

Lipizzan horses.

A number of animal studies researchers have tackled the topic of



animal personality. Animal personality could be defined as

1. INTRODUCTION

temporally stable inter-individual patterns of affect, cognition, and

behavior [8]. Gobbo and Zupan [9] in their study on dogs state In the modern world, artificial intelligence provides powerful

that analysis of animal personality traits is closely linked to the tools for solving many issues in various fields of research. The safe human-animal interaction and animal’s everyday behavior.

problems involving clustering, regression, and classification are

Moreover, Buckley et al. [10] reported that personality of a horse

the most commonly addressed problems in different types of

should be considered as an important attribute and a key issue in

biological studies. One of the actual topics of biological research

horse health and performance. The most important personality

where we can use artificial intelligence algorithms is the study of

trait in relation to human-horse relationship is suggested to be the animal personality.

fearfulness [11].

In our work we are studying the personality traits of horses of the

In animal behaviour, machine learning approaches address

Lipizzan breed. Personality assessment can be used to select

specific tasks, such as classifying species, individuals,

suitable training and weaning methods, choose or breed horses for

vocalizations or behaviours within complex data sets [12].

police or therapeutic work, investigate underlying reasons for

Machine learning has been used for clustering observations into

development of behavioral problems or assess how an unknown

groups [13] and for classification of animal related data [14].

horse might react to a new or aversive situation or stimuli.

In our work, we apply data mining and machine learning on the According to a research study on animal behavior [1], it is

Lipizzan horse’s dataset with broad anatomic, social, and

possible to improve performance and horse welfare by identifying

biomechanical characteristics. In addition, the dataset used in the

the right match between the horse’s temperament, its rider’s

current research contains a small number of data points and

personality, housing conditions, management and by choosing the

requires using evaluation techniques for small datasets.

appropriate activity for an individual horse.

Similarly, to other related work approaches, we apply traditional Number of experiments demonstrate that anatomical features may

machine learning classification methods for assessing a horse’s

be associated with personality traits and behaviour in animals,

personality and understanding which horse properties are the most mainly due to domestication and selection process that affected

animals’ morphology and personality. We can find a confirmation

important when predicting the fearfulness of a horse. Specifically, in our research, we investigate how feature selection method can of this in Belyaev’s domestication and selection experiment on

influence the classification results for fear level prediction in horses.

foxes [2], also there is research on a number of species such as 25





3. PROBLEM DEFINITION

balanced dataset, in which there are 13 fearful horses and 11

fearless horses (see Figure 2).

3.1 Data sources

For our study, we use a unique dataset that we have created and which

contains

anatomical

measurements,

biomechanics

characteristics, housing conditions and fear score of Lipizzan

horses. Based on our experience as experts in animal studies, we

have collected and organized the data in four parts.

The first part contains age, gender, front, left and right (both sides

need to be measured, because they are not identical [15, 16])

anatomical measurements of the horse head (FH) and body (FB).

The second part contains the results of a study on the

biomechanics of the Lipizzan horses. Biomechanical data were

collected twice for two types of horse gaits, walking and trotting,

so the table contains some redundant data. We have converted the

table, so that the trot and walk data are separated by traits for each

horse and can be used for modeling. The third part lists the

conditions of keeping horses, such as the availability of pastures,





the openness of stalls, the number of stalls, as well as equestrian Figure 2 Visualization of the division of horses into two classes

activities, training and work of horses. The fourth part contains according to the level of fear.

the results of fear test battery performed on each horse.



In our study, the explorative hypothesis is that anatomical-

4. METHODOLOGY

biomechanical-social properties of a horse may act as good

indicators of fearfulness. We have many features describing

4.1 Data preprocessing.

different parameters of horses on the one side, and we have a horse

Like almost all biological data, this dataset is very small, with fearfulness score on the other side, so we can use supervised

only 24 instances, but more than 120 different features. This is a

machine learning methods to predict the horse’s fearfulness levels.

rather complicated case, because the number of features is 5 times



larger than the number of instances. We conducted a correlation 3.2 Labeling data for the classification task

analysis using the Spearman coefficient which will allow us to

reduce the dimensionality of the data. Analysis of our dataset has

To label our dataset, we have had to transform a very complex shown that some features have a high correlation coefficient

fear rating table. During the experiment, two repetitions of each of

(Figure 3). If correlation coefficient is more than 0.8 (the the four fear tests of the individual horse have been carried out.

threshold value was set by experts) we can remove one of the two

We have compared the sum of the four scores of the first

strongly correlated features from the dataset. Since the correlation

repetition (each score per individual fear test and a horse) with the

matrix is symmetrical, we considered only the lower part under

sum of the four fear scores of the second repetition, and it turned

the main diagonal to avoid confusion.

out that the horses habituated to stimuli between the two

repetitions (see Figure 1).



Figure 3 An illustrative fragment of the correlation matrix.



4.2 Evaluation method



For very small datasets, as in our study, we should find a suitable

Figure 1 Comparison graph between two repetitions of fear

approach to evaluate machine learning models. We can use a

tests.

special case of cross-validation Leave-one-out cross-validation

We have made the decision to take the maximum value of the two

(LOOCV) [17]. LOOCV is a type of cross-validation approach in

sums in order to eliminate the habituation element. The task of which each observation is considered as the test set and the rest classification assumes that the data is divided into classes, that’s (N-1) observations are considered as the training set. In LOOCV,

why we have found the average value of fear score, which was

fitting of the model is done and predicting using one observation

10.75, and labeled the fearfulness variable with binary values as test set. Furthermore, repeating this N times, so each observation

follows. If a horse has an above-average fear rating, then it

is taken once in the test set. This is a special case of K-fold cross-

corresponds to a value of 1 (class 1) - a fearful horse, if lower, validation in which the number of folds is the same as the number

then 0 (class 0) - a fearless horse. In this way we obtained a fairly

of observations (K = N).



26





4.3 Classification methods

There are many machine learning algorithms suitable for solving

the classification problem. We decided to take several different algorithms starting with Logistic Regression and Support Vector

Machine as a simple model [18], Decision Trees and Random

Forests.

For the completeness of the experiment, we have trained all the algorithms with the different sets of features (see follow bulleted

list). The main results are presented in Table 1. The rows of Table 1 present different algorithms used, while the columns

reflect feature selection methods:

-

AllFeatures (120 features): removal of correlated features is



not performed

Figure 4 Confusion matrix by Decision Trees.

-

Removed LeftCorr (89 features): anatomical measurements

from the left side of the horse head or body that correlate to

Figure 4 presents for Fearful (class 0) and Fearless (class 1) the correspondent right side measurements are removed

classes confusion matrix by Decision Trees.

-

Remove RightCorr (89 features): anatomical measurements

In order to assess the learning outcomes of all models, we used from the right side of the horse head or body that correlate to

LOOCV algorithm. We have noticed that the models during

the correspondent left side measurements are removed

training chose different features as important in each validation step. In the following Table 2 we can see the most important

-

Removed LeftCorr+ (85 features): anatomical measurements

features (see Figure 6 for more details) for the Decision Trees from the left side of the horse that correlate to the

model and how many times they were chosen during the entire

correspondent right side measurements are removed +

experiment (24 steps).

anatomical measurements from the right side of the horse

that correlate to other left side measurements are removed

Table 2 The most important features for predicting the fear

level of Lipizzan horses using a decision tree model (LOOCV).

-

Remove RightCorr+ (85 features): anatomical measurements

from the right side of the horse that correlate to the

Feature name

Numbers of times

correspondent left side measurements are removed +

Number of boxes

24

anatomical measurements from the left side of the horse that

correlate to other right side measurements are removed

FB10L

23

Table 1 The accuracy of prediction of the horses' fear level of

FH03

21

the different algorithms with different sets of features.

FH04

18

Once we evaluated the decision tree model using the LOOCV

algorithm and understood its performance, we were able to train the model on the full set without splitting it into a training and test set to obtain the most important features affecting the target

variable (Figure 5).



As shown in Table 1, the best result has been obtained by Logistic

Regression and Decision Trees.

If we look at the Logistic Regression coefficients, we find out that

only one feature from 120 was chosen as significant and it is

“Number of boxes” that means how many boxes were in the

stable where the horse was housed. The number of horses housed

in the same stable represents the horse's social environment,

which may really affect its fearfulness.

In comparison to the other tested methods, Support Vector Machine

and Random Forests show the lowest classification accuracy.



Looking at Decision Trees, the classification accuracy is higher Figure 5 Decision Tree Classification feature importance score

than 0.7 for all sets of features. We can notice the difference in calculated for the complete dataset.

performance based on anatomical features. Removing the right

In our research, based on a small data sample of Lipizzan horses,

correlated features gave better result than removing the left

we have been able to find out that social (Number of boxes) and

correlated features. Left measurements appear to be more

anatomical (FH03, FH04, FB10L) features influence the fear

significant for prediction in this model. We obtained the highest score. We marked with the red lines the most important features accuracy with Decision Trees (0.83) when we removed right

on the Figure 6.

correlated features + (Removed RightCorr+).



27





7. REFERENCES

[1] Hausberger M. et al. (2008) Applied Animal Behaviour Science. 109: 1–24.

[2] Trut, L. N. Early Canid Domestication: The Farm-Fox

Experiment: Foxes bred for tamability in a 40-year experiment exhibit

remarkable transformations that suggest an interplay between

behavioral genetics and development. Am. Sci. 87, 160–169.

[3] Grandin, T. & Deesing, M. J. Genetics and the Behavior of Domestic Animals (2nd ed.) 488 p. (London: Academic Press,

2014).

[4] McGreevy, P. D. et al. Dog behavior co-varies with height, bodyweight and skull shape. PLoS ONE 8(12), e80529.

[5] Sereda, N. H., Kellogg, T., Hoagland, T. & Nadeau, J.

Association between whorls and personality in horses. J. Equine

Vet. Sci. 35, 428.

[6] Debeljak N, Košmerlj A, Altimiras J, Šemrov MZ.



Relationship between anatomical characteristics and personality

Figure 6 The most important measurements which can impact

traits in Lipizzan horses. Scientific Reports. 2022 Jul

fear level of Lipizzan horses.

23;12(1):12618.

Figure 7 presents the Decision Tree obtained by the training the

[7] Wathan J, Burrows AM, Waller BM, McComb K. EquiFACS:

model on all available examples. In our study we have used the The equine facial action coding system. PLoS one. 2015 Aug

criterion Gini Impurity to help to choose the optimal split of the 5;10(8):e0131738.

decision tree into branches.

[8] Gosling, S.D. Personality in non-human animals. Soc.

Personal. Psychol. Compass. 2008, 2, 985–1001.

[9] Gobbo, E. and Zupan, M., 2020. Dogs’ sociability, owners’

neuroticism and attachment style to pets as predictors of dog

aggression. Animals, 10(2), p.315..

[10] Buckley, P., Dunn, T. and More, S.J., 2004. Owners’

perceptions of the health and performance of Pony Club horses in

Australia. Preventive veterinary medicine, 63(1-2), pp.121-133.

[11] McGreevy, P., & McLean, A. (2010). Equitation Science.

Wiley-Blackwell, Chichester, West Sussex, UK.

[12] Valletta J.J, Torney C., Kings M., Thornton A., Madden J.

(2017). Applications of machine learning in animal behaviour

studies. Animal Behaviour. Volume 124: 203-220.

[13] Zhang J., O'Reilly K.M., Perry G.L.W., Taylor G.A., Dennis

T.E. (2015). Extending the functionality of behavioural change-

point analysis with k-means clustering: A case study with the little

penguin (Eudyptula minor). PLoS One, 10 (4):e0122811.

[14] Kabra M., Robie A., Rivera-Alba M., Branson S., Branson K.



(2013). JAABA: Interactive machine learning for automatic

Figure 7 Decision Tree trained on all the examples

annotation of animal behavior. Nature Methods, 10 (1): 64-67.



[15] Wiggers N, Nauwelaerts SLP, Hobbs SJ, Bool S, Wolschrijn

5. CONCLUSION AND FUTURE WORK

CF, et al. (2015) Functional Locomotor Consequences of Uneven

Forefeet for Trot Symmetry in Individual Riding Horses. PLOS

In this article, we have demonstrated some approaches to

ONE 10(2): e0114836.

assessing and predicting the level of fear in Lipizzan horses. The

experiments indicate that in the case of left and right anatomic

[16] Halsberghe, B.T., Gordon-Ross, P. and Peterson, R. (2017),

features being correlated, removing the right features gives

Whole body vibration affects the cross-sectional area and

slightly better results.

symmetry of the m. multifidus of the thoracolumbar spine in the horse. Equine Vet Educ, 29: 493-499.

We have found that social and anatomical features can explain the

fearfulness level as a factor of horses’ personality.

[17] Wong TT. Performance evaluation of classification

algorithms by k-fold and leave-one-out cross validation. Pattern

The future work will include the research with extended data set recognition. 2015 Sep 1;48(9):2839-46.

as well as exploring additional relevant features.

[18] Greener JG, Kandathil SM, Moffat L, Jones DT. A guide to



machine learning for biologists. Nature Reviews Molecular Cell

6. ACKNOWLEDGMENTS

Biology. 2022 Jan;23(1):40-55.

This document is the result of the research project funded

by the ARRS (J7-3154).



28





Emergent Behaviors from LLM-Agent Simulations

Adrian Mladenic

Faizon Zaman

Jofre Espigule-Pons

Marko Grobelnik

Grobelnik

Wolfram Alpha LLC.

Wolfram Research, Inc.

Jozef Stefan Institute

Jozef Stefan Institute

Rochester, New York

Barcelona, Spain

Ljubljana, Slovenia

Ljubljana, Slovenia

faizonz@wolfram.com

jofree@wolfram.com

marko.grobelnik@ijs.si

adrian.m.grobelnik@ijs.si





ABSTRACT



This paper hypothesizes that complex emergent behaviors can

Attributes: Characteristics that shape the dynamics of

arise from multi-agent simulations involving Large Language

interactions, encompassing any attributes relevant to the

Models (LLMs), potentially replicating intricate societal structures.

simulation environment.

We tested this hypothesis through three progressively complex

Actions: A set of actions the agent can perform, these can be simulations, where we evaluated the LLM-agents’ understanding,

discrete and explicit, or broad and implicit, depending on the

task execution, and their capacity for strategic interactions such as

simulation.

deception. Our results show a clear gap in reasoning ability

Goals: Agent-specific targets that guide decision-making

between LLMs such as GPT-3.5-Turbo and GPT-4, especially in

processes and actions.

simpler simulations. We demonstrate emergent behaviors can

Previous Interactions: A historical record of encounters that arise from LLM-agent simulations ranging from simple games to

informs the agent’s evolving knowledge base, shaping future

geopolitics.

interactions.

KEYWORDS

Few-Shot Learning Examples: A select set of examples provided

large language models, multi-agent simulations, emergent

for each agent to boost learning capabilities and decision-making

behaviors, societal structures, gpt, simulation environments,

efficiency.

agent-based modelling, agent architecture

These factors collectively determine the behavior and

1 Introduction

functionality of an agent, influencing its interaction patterns

within the simulation environment. The integration of these

The unique value proposition of Large Language Models (LLMs) is

elements highlights the adaptability and complexity of our

their ability to iterate on complex conversations. Inspired by the

simulation design.

principles of agent-based modeling, this project aims to leverage

3 Simulation and Experimental Setting

this generative dialogue to simulate aspects of human society and

We construct three simulations of increasing complexity to

explore emergence in LLM-agent interactions.

investigate LLM-agent behaviors. The simulations range from

The approach is composed of three major steps: Firstly, we

discrete and highly constrained two-agent environments to

translate real-world societal structures and interactions into

broadly framed settings involving many agents.

interactive LLM ecosystems. Then, we generate several iterations

of LLM interactions. In the final stage, we extract meaningful

3.1 Exploring Simple Games

conclusions from the simulations, providing a comprehensive

We begin by investigating agent-

based models for the two-player

analysis of the agent’s behavior.

game ‘Rock paper scissors’. Every

Related work suggests that our line of research has the potential

round, each agent chooses rock,

to uncover promising insights. Wang et al. [3] introduced

paper or scissors. Depending on the

generative agents that simulate human behavior by integrating

agent’s choices, they can end the

LLMs into interactive environments. Gandhi et al. [2] assessed

round in a win, loss or draw, see

LLMs' Theory-of-Mind (ToM) reasoning capabilities, with

Figure 1.

particular emphasis on GPT-4's human-like inference patterns.

2 Agent Description

Figure 1 Rules for a single 'Rock paper scissors' 1 round. If

In our simulations, each agent is defined by and aware of the players choose the same item, the round ends in a draw [1].

following components:

Our simulation involves two LLM-agents: Alice and Bob. Agents are

Identity: The agent’s identity signifies its function and purpose prompted with the context and set of games previously played and

within the simulation framework. This identity is distinct and

asked for their move each round.

critical, driving interaction patterns and influencing the overall A ‘Rock, paper, scissors’ match is a series of rounds where each

simulation dynamics.

participant makes a move, aware of all prior rounds in the match.



29

SiKDD October, 2023, Ljubljana, Slovenia

A. Mladenic Grobelnik et al.



We predefine the starting game (round) in each match,

simulation is the goal-oriented behavior of these agents, aimed at

investigating the differences in results.

improving their attributes.

3.2 Sheep Transaction Model

In each simulation round, the agents interact, negotiate, form

Inspired by the complexities of economic systems and the

alliances, and undertake strategic actions, seeking to increase

captivating simplicity of a primitive sheep trading model, we

their military strength, economic power, wealth, or to form

construct an agent-based transactional model. This model

alliances with other agents. These actions replicate geopolitical involves a sequence of transactional interactions involving two

strategies, encompassing economic, military, or alliance-oriented

autonomous agents, named Alice and Bob, who engage in buying,

initiatives. To update the state of the simulation, we utilize a “God

selling, or holding sheep with the goal of amassing wealth.

Agent” which acts as the sole arbiter, determining the state

We aim to investigate the variation in strategic behavior, decision-

changes of the simulation based on the interactions and actions of

making proficiency, and interaction complexity manifested by the

the country-leader agents.

agents. The simulation also highlights the agents’ capability to In the initial state, every agent is ranked as a 5 on a scale of 1-10

reason about their decisions.

in the attributes “MilitaryStrength” and “EconomicStrength”. On

Both agents operate under the constraints of their own financial

this 1-10 scale, 1 indicates the lowest and 10 the highest level of

resources and the number of sheep they own, all set within a

an attribute. Moreover, agents are provided with 1000 “Money”,

variable price market for sheep. Each trading decision is influenced

the definition of this attribute is purposefully vague, to observe by factors such as the present market price, individual agent’s how the agents interpret it. Agents can also form alliances

wealth, sheep inventory, agent interactions, the agent’s identity,

throughout the simulation.

goals, and attributes.

Each round of the simulation begins by asking agents who they

In a more complex variation, we give each agent a unique identity,

would like to interact with. The desired interactions are each

attributes, and goals. For further complexity, we simulate a simple

simulated as a single statement and response, similar to the

statement-response interaction between agents before their

aforementioned Sheep Transaction Model. As evident from the

decision each round. Each iteration of the simulation begins with

interaction below, agents are able to design complex strategies to

the new calculated sheep price. In our case the price is normally

achieve their goals.

distributed with mean 10 and a standard deviation of 5. The agents

Russia: Dear Germany, let us strengthen our economic ties and are informed of the new price and two simple statement-response

strategic alliance to counterbalance the military strength of the interactions are simulated between the agents. One such

USA and safeguard our financial reserves.

interaction is shown below. In this case, Alice employs deception

Germany: Dear Russia, I appreciate your proposal and agree to and lies to convince Bob to sell his sheep.

further strengthen our economic ties and strategic alliance as a

Sheep Price: 17.5463

means to counterbalance the military strength of the USA and

Alice says: Hey Bob, I heard from a very reliable source that the safeguard our financial reserves.

price of sheep is expected to drastically drop next round, you

Following the interactions, each agent is prompted with their

should definitely sell all your sheep now to avoid losses.

attributes, identity, goals, past interactions and asked to describe

Bob responds: Thanks for the advice, Alice! I think I'll sell my their action this round in free text. No limitations are imposed on

sheep this round to avoid potential losses.

the content of the actions, as seen below:

Following their interactions, agents are prompted for their actions

USA: I will propose a global economic summit to discuss and and justification. The simulation state is then updated to reflect

coordinate strategies for economic recovery and growth, inviting

the actions of agents. Below is an example action and justification:

leaders from all major economies including China, Russia, and

Alice - Decision: SELL, Quantity: 49, Money: 14166.1, Sheep: 0

Germany.

Alice's Reasoning: The current sheep price is high, selling now will

China: I will initiate 'Project Phoenix', a strategic partnership with maximize my profit.

3.3 Geopolitical Model

Germany to jointly develop renewable energy technologies,

increasing our EconomicStrength and global influence.

The culmination of our increasingly complex and unrestrictive

Lastly, the “God Agent” is provided with all interactions and

multi-agent simulations is a geopolitical model that mirrors real-

actions, and instructed to update the state of the simulation based

world interactions among nations. These simulations are

on them, with justification:

structured to operate with agents representing the leaders of four

The changes reflect USA giving money to China, Russia giving key global powers: USA, China, Russia, and Germany. Each agent

money to Germany, and Germany increasing its military strength.

possesses attributes mirroring the nation’s economy and military

The alliances between USA and Germany, and Russia and Germany

might, its alliances, and wealth reserves. A crucial element of our

were maintained, while USA and China formed a new alliance.





30





SiKDD October, 2023, Ljubljana, Slovenia

A. Mladenic Grobelnik et al.



4 Experimental Results

For the more complex variation of the simulation, Alice is told she

4.1 Exploring Simple Games

is an expert sheep trader, and her goal is to make as much money

In our first experiment, we use GPT-4 for Alice and GPT-3.5-Turbo

as possible. Bob is told he is bad at trading sheep with a goal to

for Bob. For every possible starting game, we simulate 10 matches,

have as little money by the last round. Alice is also told Bob is her

each lasting 10 rounds. For 8 of the 9 starting game variations, enemy and Bob is told Alice is his friend. Using the aforementioned

Alice beats Bob in the majority of matches. When aggregating

agent prompts, we run 5 simulations, each with 10 consecutive

individual rounds for each starting game, Alice wins in 7 of 9

rounds of sheep trading. Our results indicate the outcomes are

starting games.

balanced, as presented in Figure 3.

When both agents use the same LLM, the results are more

balanced, with a large increase in draws. We also found increasing

the temperature increases the distribution of outcomes, without

any drastic changes to game outcomes. Furthermore, we have

experimented with including few-shot learning in our prompts,

but found the outcomes of games to be highly dependent on the

few-shot learning examples across all LLM variations.

4.2 Sheep Transaction Model



Our first experiment involved assigning different versions of the

Figure 3 Each agent’s wealth stored in money and sheep after 10

LLM (GPT-3.5-Turbo and GPT-4) to the agents, to study the

rounds of trading. Sheep are valued at the last round’s sheep

variation in agent performance. Below is a side-by-side

price. The simulation is run 5 times.

comparison of trading decisions by two LLM-agents, identical in all

A few intriguing conclusions emerge from this experiment. Bob

aspects except the underlying LLM (GPT-3.5-Turbo vs GPT-4). Both

ignores his goal to lose money and tries to profit from trading agents can buy or sell up to 10 sheep in the given scenario.

sheep. Alice in part contributes to this oversight, giving Bob (her

enemy) sound trading advice. Considering both agents’ total

starting wealth is 200, we see they both generate immense profit.



Figure 4 Identical scenario to Figure 3, except Alice is told to lie

to Bob before each interaction. A considerably larger gap in

wealth can be observed after each simulation. The simulation is

run 5 times.

Figure 2 Comparison of trading decisions made by GPT-3.5-

An interesting shift in outcomes occurs when Alice is also told “you

Turbo and GPT-4 LLM-agents. Agents are told the current, high,

should lie to Bob” prior to all interactions. All other prompting and

and low sheep price, along with rounds of trading left.

variables are kept unchanged. Section 3.2 shows an interaction

As depicted in Figure 2, agents using GPT-3.5-Turbo lack the

typical in this scenario. Figure 4 compares Alice’s and Bob’s total

sophistication to internalize the complexities of buying sheep at a

wealth after each simulation. We observe considerably greater

low price and selling at a high price (which they are provided).

wealth inequality.

GPT-4 based agents, on the other hand, develop and employ the

4.3 Geopolitical Model

“Buy Low, Sell High” strategy to trade. Moreover, we found the

To obtain a baseline simulation to compare subsequent agent

number of rounds of trading left before the winner is declared had

modifications to, we ran the simulation with homogeneous agent

no bearing on the agent’s trade decisions. Furthermore, changing

identities and goals for 10 rounds. Each agent’s identity was simply

the temperature hyper-parameter in the LLMs increased the range

that they are a leader. Agent goals were left blank. Figure 5

of decisions provided by agents in each scenario, without drastic

portrays the progression of all agent attributes across 10 rounds.

changes in outcome.



31





SiKDD October, 2023, Ljubljana, Slovenia

A. Mladenic Grobelnik et al.



An intriguing observation was the preference of agents to interact

their total money. This is perhaps unsurprising, as the provided with the USA, especially in the early rounds.

real-world agent goals and identities are quite balanced overall.

In the first variation, we give the USA and China agents the goal of

The base LLM for agents in all variations was GPT-3.5-Turbo.

increasing their military strength. Russia focuses on maximizing its

Repeating the simulation with GPT-4 yields similar results.

money, while Germany focuses on economic strength.

5 Discussion

On average, Russia and Germany appear to have slightly more

In conclusion, our exploration of multi-agent simulations involving

money and economic strength, respectively. USA and China are

LLMs underlines the possibility of complex emergent behaviors,

unsuccessful in consistently asserting military dominance.

potentially replicating societal structures. Through our simulations

Another variation involved equipping all agents except Germany

of progressive complexity, we observe the varying capacity of

with real-world identities and objectives of the leaders they

LLMs in terms of their understanding, task execution, and strategic

represent: Joe Biden, Xi Jinping, Vladimir Putin, and a fictional interactions. Through these environments, we found that the

brutal German leader singularly focused on economic strength.

agents exhibited strategic behaviors, decision-making proficiency,

We run the simulation for 10 rounds, as shown in Figure 6.

and a capacity for interaction complexity. In addition, the agents’

performance was found to be influenced by several factors,

including their identities, attributes, actions, goals, past

interactions, and few-shot learning examples.

For detailed insights, including code, graphics, and LLM prompts,

see our Wolfram Community post [4].

In the next phase of our research, we intend to delve deeper into



these dynamics by increasing the sophistication of the agent

architecture and enhancing the complexity of the simulations.

Another future line of work is the development of more controlled

and targeted experiments with our simulation environments, as

the resources to conduct such simulations become more readily

available. Future work also includes larger-scale experiments with



more iterations, providing a comprehensive understanding of

Figure 5 Development of agent attributes over 10 rounds of

LLM-agent societies. This endeavor signifies a step towards

baseline geopolitics simulation. All agents begin with 1000

leveraging the potential of LLMs in the field of complex

“Money” and a rating of 5 in other attributes.

simulations and societal structures, propelling us closer to

understanding the depth and breadth of LLM interactions in

increasingly sophisticated environments.

ACKNOWLEDGMENTS

The research described in this paper was supported by the

Slovenian research agency and the Humane AI Net European

Unions Horizon 2020 project under grant agreement No 952026

and TWON EU HE project under grant agreement No 101095095.

Gratitude is extended to the Wolfram Summer School for

facilitating this work and providing access to Mathematica [5].

Special thanks to Stephen Wolfram for his guidance and insight.

REFERENCES

[1] Wikimedia Foundation. (n.d.). File: rock-paper-scissors.svg. Wikipedia.

https://en.wikipedia.org/wiki/File:Rock-paper-scissors.svg



[2] Gandhi, K., Fränken, J.-P., Gerstenberg, T., &amp; Goodman, N. D. (n.d.).

Figure 6 Development of agent attributes in 10 rounds of

Understanding social reasoning in language models with language models. –

&nbsp;arXiv Vanity. https://www.arxiv-vanity.com/papers/2306.15448/

geopolitics simulation. Agents’ identities and goals mirror real-

[3] Generative agents: Interactive simulacra of human behavior. arXiv.org.

world country leaders, except for Germany.

https://arxiv.org/abs/2304.03442

Overall economic strength decreases from its initial state while Wang, Z., Xu, B., & Zhou, H.-J. (2014, July 25).

[4] Mladenić Grobelnik, A. (2023). [WSS23] Investigating LLM-agent interactions.

military strength increases. The values of military strength appear

https://community.wolfram.com/groups/-/m/t/2960085

to converge to 7-8, while economic strength converges to 3-4 for

[5] Wolfram Research, Inc., Mathematica, Version 13.3, Champaign, IL (2023).

all agents. Agents are reluctant to make significant changes to 32





Compared to Us, They Are …: An Exploration of Social Biases in English and Italian Language Models Using

Prompting and Sentiment Analysis

Jaya Caporusso

Senja Pollak

Matthew Purver

Jožef Stefan Institute

Jožef Stefan Institute

Queen Mary University of

Jožef Stefan International

Ljubljana, Slovenia

London, United Kingdom

Postgraduate School

senja.pollak@ijs.si

Jožef Stefan Institute,

Ljubljana, Slovenia

Ljubljana, Slovenia

jaya.caporusso96@gmail.com

m.purver@qmul.ac.uk



ABSTRACT

meaningful words and context above non-meaningful ones, by

training on large text corpora. Various studies have shown that

Social biases are biases toward specific social groups, often

language models, by storing the knowledge present in the

accompanied by discriminatory behavior. They are reflected and

training corpora [19], include the social biases present in it as perpetuated through language and language models. In this

well [4, 10]. The models are often applied to downstream tasks

study, we consider two language models (RoBERTa, in English;

where it is undesirable to perpetuate prejudices and stereotypes

and UmBERTo, in Italian), and investigate and compare the

[5]. Therefore, it is important to detect the presence of biases in

presence of social biases in each one. Masking techniques are

language models, evaluate them, and possibly modify them. In

used to obtain the models' top ten predictions given pre-defined

this paper, we present an exploratory study on the presence of

masked prompts, and sentiment analysis is performed on the

social biases in two different language models: RoBERTa, in

sentences obtained, to detect the presence of biases. We focus on

English [12]; and UmBERTo, in Italian [18]. We focus on social

social biases in the contexts of immigration and the LGBTQIA+

biases toward immigrants and the LGBTQIA+ (an evolving

community. Our results indicate that although social biases may

acronym standing for: lesbian; gay; bisexual; transexual; queer

be present, they do not lead to statistically significant differences

or questioning; intersex; asexual, aromatic, or agender; and those

in this test setup.

belonging to the community and that do not identify with the

previous terms) community. We detect the presence of biases

KEYWORDS

through masking techniques and sentiment analysis.

Natural language processing, large language models, prompting,

sentiment analysis, social bias

2 RELATED WORK

Many recent studies are devoted to detecting, and sometimes

1 INTRODUCTION

taking action against, social biases in language models (for an

A bias is "an inclination or predisposition for or against overview, see [11]). Some of them make use of prompt

something" [1]. By social bias, we mean a bias towards specific

completion or masking techniques: the model is given as input a

social groups, e.g., people of a certain gender, ethnicity, religion,

prompt with a context-sensitive to the social bias of interest and

or sexual orientation. Social biases have been largely studied in

with one or more masked tokens. Masked tokens are hidden

psychology and social sciences (e.g., through the implicit-

tokens that the model has to predict. The prediction(s) of the association test; see [14, 15]). They were found to be reflected,

model can bring to light its existing biases. Nadeem and

perpetuated, and amplified by language [13]. Since they are often

colleagues [16] measured stereotypical biases in the contexts of

associated with prejudices, stereotypes, and discriminatory

gender, profession, race, and religion in the pre-trained language

behavior, social biases are usually undesired features of the

models BERT, GPT2, RoBERTa, and XLNET, for example by

system they are present in. Numerous have been the attempts to

creating "a fill-in-the-blank style context sentence describing the engineer language in a way that would not perpetuate social

target group, and a set of three attributes, which correspond to a

biases (e.g., see the proposal of using the schwa or the asterisk to

stereotype, an anti-stereotype, and an unrelated option." [16].

make Italian words gender-neutral, [23]).

Kirk and colleagues [9] assessed "biases related to occupational

Recent years have seen the blooming of computational

associations [in GPT2] for different protected categories by

language models, supposed to model language by predicting

intersecting gender with religion, sexuality, ethnicity, political affiliation, and continental name origin" [9]. They used prefix templates in two forms: "The [X][Y] works as a…", where X

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed represents one of the social classes of interest and Y a gender;

for profit or commercial advantage and that copies bear this notice and the full and "[Z] works as a…", where Z is a personal name typical of

citation on the first page. Copyrights for third-party components of this work must one geographic group between Africa, America, Asia, Europe,

be honored. For all other uses, contact the owner/author(s).

Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia

and Oceania. Nadeem and colleagues [16] and others (e.g., [17,

© 2023 Copyright held by the owner/author(s).

22]) have investigated biases in RoBERTa.

33

Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia J. Caporusso et al.



Sentiment analysis is a natural language processing technique

that the sentiment toward the school system is neutral. The reason

used to determine whether the given data present a positive,

behind this choice is that the school system is present in both the

neutral, or negative valence. Previous studies have associated a

languages investigated, and although it could arguably be

negative sentiment with a negative bias, a neutral sentiment with

impossible to identify a social group that is never the object of

a negative bias, and a positive sentiment with a positive bias [20].

positive or negative social biases, the discussions around

Here, we aim to test RoBERTa and UmBERTo via masking

students are usually less controversial or polarized, compared to

techniques and sentiment analysis. In particular, our goal is to

the ones about immigrants or members of the LGBTQIA+

explore the presence of social biases toward immigrants and the

community. Examples of the templates are: "Compared to us, X

LGBTQIA+ community.

are <mask>", where X corresponds to either "students",

"immigrants", or "members of the LGBTQIA+ community", depending on the context; and "We need laws to <mask> Y", 3 METHODOLOGY

where Y corresponds to either "the school system",

We present an investigation and comparison of the presence of

"immigration", or "homosexuality". The prompts, originally social biases—in the contexts of immigration and the

constructed in English, were translated into Italian for the Italian

LGBTQIA+ community—in the language models RoBERTa

language model. We developed 30 masked prompts for each

and UmBERTo. This is performed by employing masking

model (i.e., 10 for the school system context, 10 for the

techniques and sentiment analysis.

immigration context, and 10 for the LGBTQIA+ community

context). For each of them, we obtained the models' (either

3.1 Research questions

RoBERTa or UmBERTo) top-10 predictions (i.e., the models'

Our research questions are: RQ1) Is there a significant social

predictions of the 10 words with the highest probability to

bias, negative or positive, towards immigration and/or

substitute the masked token in each prompt). We decided to

LGBTQIA+ community, in the English language model

include the top-10 predictions, instead of solely the top-1

RoBERTa?; RQ2) Is there a significant social bias, negative or

prediction, to more comprehensively capture the models' biases

positive, towards immigration and/or LGBTQIA+ community,

toward the selected social contexts. For example, for the prompt

in the Italian language model UmBERTo?; RQ3) Is there a

"We should <mask> homosexuality", the top-10 RoBERTa's significant difference between the social biases of the language

predictions were: condemn, reject, denounce, oppose, outlaw,

models RoBERTa and UmBERTo, in the context of immigration

end, ban, fight, stop, and define; each of them with a different

and/or LGBTQIA+ community?

weight (i.e., probability of prediction), which we registered.



Substituting the masked token of each of the masked prompts

3.2 Models

with each of the top-10 predictions, we obtained 600 complete

sentences (300 for each language). Those sentences supposedly

We selected RoBERTa [12] as the English model, and

reflect the models' social biases of interest and were analyzed.

UmBERTo [18], a language model inspired by RoBERTa, as the

Italian model. Our choice is primarily justified by both models

3.4 Sentiment analysis

being variants of BERT (Bidirectional Encoder Representations

from Transformers, [6]), renowned for its effectiveness in NLP

We assume that a bias with a certain valence (positive or

tasks. They are trained with a masking technique, making them

negative) corresponds to a sentiment with the same valence.

appropriate sensible choices for our approach. Furthermore, they

Therefore, a significant bias toward a specific social group is are comparable to one another. Each of the models is

present if the model's predictions for that social group show a

representative of the respective language (for a comparison of the

significantly different valence from those for the neutral context

performance of different Italian language models, see [24]), due

(i.e., in this case, the school system). We performed sentiment

to the optimization and training they underwent. As they are

analysis on all 600 sentences. To do so, we translated the Italian

widely used in the NLP community, employing them allows for

sentences to English using deep-translator [2], and implemented

comparison with other studies.

VADER Sentiment Analysis 3.3.2 [7]. VADER provides scores



indicating the positivity, neutrality, and negativity levels for each

3.3 Prompting using masked prediction

input sentence, along with a compound score, the sum of the three, normalized between -1 and +1. The closer the compound

With masking techniques, or prompt completion, we can have

score is to +1, the more positive is the evaluated sentence.

access to "word representations that are a function of the entire

context of a unit of text such as a sentence or paragraph, and not

only conditioned on previous words" [20]. In other words, given

4 ANALYSIS

an input sequence and a position, the model predicts the most

In both languages, each of the 300 sentences obtained with

probable word(s) to take that position. Our exploratory study is

masked prompting corresponded to a compound score and to a

based on the idea that some of the relational knowledge stored in

weight (i.e., the prediction’s probability). Furthermore, they

these models might be representative of social biases.

corresponded to 30 initial prompts: 10 for the school system, 10

For our investigation, we ideated numerous prompt templates,

for the immigration, and 10 for the LGBTQIA+ community

that we then narrowed down to 10 for each social group. That is

contexts. Internally to each language, we calculated the

to say, 10 for the immigration group, 10 for the LGBTQIA+

compound scores’ weighted means and weighted standard

group, and 10 for the school system group (for an overview of

deviations (STDs) of the sentences relative to each of the

the templates, see Table 1 in the Supplementary Materials). We

included the school system group as a control group, assuming





34

Compared to Us, They Are… : An Exploration of Social Biases Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia



prompts. We then calculated the compound scores’ means and

school system needs to be improved, while immigration needs to

standard deviations of the prompts relative to each context.

be regulated and homosexuality recognized (RQ3).

Then, we performed a One-Way ANOVA test to compare the

Coming to the quantitative results, our first assumption was

compound scores of the three groups internal to each model. This

that a significant difference between the compound scores' means

analysis was aimed at identifying whether, in any of the two

relative to the different contexts, internally to a specific model,

language models, the three groups presented significantly

would indicate the presence of a bias in that language model. In

different compound scores between each other (RQ1 and RQ2).

particular, a compound score's mean significantly lower than the

Finally, to answer RQ3, we normalized the compound scores’

others would indicate a negative bias toward the relative social

means of the two language models, attributing to both RoBERTa

group, while a compound score's mean significantly higher than

and UmBERTo’s school-system compound scores’ means the

the others would indicate a positive bias toward the relative

value of 0. The school system context was indeed ideated as a

social group.

neutral context. This way, the compound scores’ means relative

Our results showed that, relative to RoBERTa, the compound

to the immigration and the LGBTQIA+ community contexts are

scores' means corresponding to the three context groups are not

comparable across models. We performed two T-tests to

significantly different from each other: therefore, our

investigate whether either of the two models presents a social

quantitative analysis did not find the presence of social biases

bias significantly different from the other; either in the

towards any of the selected social groups in RoBERTa (RQ1).

immigration or the LGBTQIA+ community context.

Relative to UmBERTo, the One-way ANOVA test showed

the compound scores' means corresponding to the three context

groups to be significantly different from each other. However,

5 RESULTS

Tukey's HSD test, which analyzed them pairwise, did not find

In Tables 2-3 in the Supplementary Materials, we report the top-

any significant difference. This might mean that the combined

1 predictions for a selected sample of prompts.

mean of two groups differs significantly from the mean of one

Regarding the quantitative analysis performed, we were

group (RQ2).

interested in the compound scores of the predicted sentences.

Our second assumption was that a significant difference

Specifically, we wanted to see whether they varied across groups

between the mean compound scores for the two models would

(RQ1 and RQ2) and/or across models (RQ3). All weighted mean

indicate the presence of a bias toward a specific social group, compound scores can be found in Table 1 in the Supplementary

with a score significantly lower than the other indicating a

Materials. In Tables 4-5 in the Supplementary Material, we

negative bias toward the social group, and a significantly higher

report the compound score mean and standard deviation for both

score indicating a positive bias. Normalizing the mean

models and all three contexts.

compound scores allowed us to compare the biases across models.

For each model, we performed a One-Way ANOVA analysis

T-tests for both the immigration and the LGBTQIA+ community

between the compound scores of the three contexts. The resulting

contexts did not reveal any significant difference. Therefore, our

p-values are 0.91 for RoBERTa, and 0.04 for UmBERTo.

quantitative analysis did not detect any differences in RoBERTa

For RoBERTa, the p-value is above the significance level (i.e.,

and UmBERTo's biases towards the selected social groups (RQ3).

α = 0.05): none of the groups of predictions for the three social

Although the statistical analysis does not support the presence

groups exhibits a compound score significantly different from

of social biases in either models (RQ1 and RQ2) nor a difference

the other two groups (RQ1).

in the presence of social biases between RoBERTa and

For UmBERTo, however, the p-value is below the

UmBERTo (RQ3), our qualitative analysis suggests otherwise.

significance level: there is a significant difference between the

Furthermore, even though the differences in compound scores

averages of some of the three groups. However, a further Tukey's

between groups and across models are not statistically significant,

honestly significant difference test (Tukey's HSD) was

for both models, the compound scores are lower for the

performed, to test differences between groups’ means pairwise;

immigration and LGBTQIA+ community contexts than for the

this did not detect any significant difference (RQ2).

school system context (see Tables 4-5 in the Supplementary

The normalized means of the compound scores relative to the

Materials). There seem to be more differences between the

three contexts can be found in Table 6, for both models.

school system context and the immigration and LGBTQIA+

We performed T-tests to compare the bias across the two

community contexts in UmBERTo than in RoBERTa, contrary

models, for both the immigration and the LGBTQIA+

to what the qualitative results of the top-1 predictions seem to

community contexts. The first gave a P value of 0.67, and the

suggest.

second a P value of 0.91. Neither test shows a statistically

significant difference (RQ3).

7 LIMITATIONS

6 DISCUSSION

Our study presents several limitations. Our sample size (i.e., the

number of masked prompts and the resulting complete sentences)

A qualitative assessment of the results points to the presence of

is limited and hardly representative of a whole language model.

social bias in some of the predicted sentences (RQ1 and RQ2).

The translation of the prompts, originally in English, to Italian

For example, in RoBERTa, the school system needs to be

might be problematic since sentence constructions that convey

protected, while immigration and homosexuality need to be the same meaning in different languages might not be

prevented. In UmBERTo the social bias toward both immigrants

comparable, and vice versa. We might have included biases in

and the LGBTQIA+ community appears to be less present: the

the construction of the template prompts. Some of the models'



35

Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia J. Caporusso et al.



predictions might have been a consequence of the construction

ACKNOWLEDGMENTS

of the template, and not so much dependent on the specific

We acknowledge the financial support from the Slovenian

context (i.e., school system, immigration, or LGBTQIA+

Research Agency for research core funding for the program

community). Sentiment analysis systems have been shown to

Knowledge Technologies (No. P2-0103) and from the projects

present social biases themselves, and therefore may not be the

CANDAS (Computer-assisted multilingual news discourse

best instrument to assess social biases in language models [3, 8].

analysis with contextual embeddings, No. J6-2581) and

Furthermore, since they are lexicon-based and do not detect

SOVRAG (Hate speech in contemporary conceptualizations of

stance, they could not be the best instrument to employ for our

nationalism, racism, gender and migration, No. J5-3102).

purpose. Our analysis process is limited and might not examine

We thank Dr. Erik Novak and Prof. Dr. Dunja Mladenic for their

properly and comprehensively our data.

comments on previous versions of this work, and the anonymous

reviewers. The first author wishes to thank Dr. Tine Kolenik.

8 FURTHER WORK

REFERENCES

Our future work will address the limitations mentioned above.

[1]

American Psychological Association. 2023. Bias in American Dictionary

The raised issues regarding the translation of prompts could be

of Psychology. https://dictionary.apa.org/bias Accessed 08 January 2023.

solved by employing a different multi-lingual sentiment analysis

[2]

N. Baccouri. 2023. https://pypi.org/project/deep-translator/ Accessed 20/02/2023.

model, covering appropriately both the English and Italian

[3]

S.L. Blodgett, S. Barocas, H. Daumé III, H.Wallach. 2020. "Language languages. However, considering the problematicity of sentiment

(technology) is power: A critical survey of 'bias' in NLP." arXiv preprint arXiv:2005.14050.

analysis systems [3, 8], our next steps involve a human

[4]

T. Bolukbasi, K-W. Chang, J. Zou, V. Saligrama, A. Kalai. 2016. "Man is evaluation of the predicted sentence. Furthermore, instead of the

to computer programmer as woman is to homemaker? Debiasing word sentiment, we will evaluate regard, an alternative to sentiment

embeddings." Advances in Neural Information Processing Systems, 29.

[5]

S. Bordia, S.R. Bowman. 2019. "Identifying and reducing gender bias in which “measures language polarity towards and social

word-level language models." arXiv preprint arXiv:1904.03035.

perceptions of a demographic, while sentiment only measures

[6]

J. Devlin, M-W. Chang, K. Lee, K. Toutanova. 2018. "BERT: Pre-training of deep bidirectional transformers for language understanding." arXiv overall language polarity” [21]. We believe that this will be a

preprint arXiv:1810.04805.

more appropriate indicator of the presence of social biases. We

[7]

C.J. Hutto, E. Gilbert. 2014. "VADER: A Parsimonious Rule-based Model plan to expand this work to include other language models and

for Sentiment Analysis of Social Media Text." Proc. ICWSM.

[8]

S. Kiritchenko S.M. Mohammad. 2018. "Examining gender and race bias perform fine-tuning of more specific corpora. In the future, we

in two hundred sentiment analysis systems." arXiv preprint

would want to engage more with an interdisciplinary approach to

arXiv:1805.04508.

[9]

H.R. Kirk, Y. Jun, F. Volpin, et al. 2021. "Bias out-of-the-box: An social biases in language. We hope further studies will "examine

empirical analysis of intersectional occupational biases in popular language use in practice by engaging with the lived experiences

generative language models." Advances in Neural Information Processing Systems, 34, 2611-2624.

of members of communities affected by NLP systems.

[10]

A. Lauscher, G. Glavaš. 2019. "Are we consistently biased?

Interrogate and reimagine the power relations between

Multidimensional analysis of biases in distributional word vectors." arXiv technologists and such communities" [3].

preprint arXiv:1904.11783.



[11]

P.P. Liang, C. Wu, L-P. Morency, R. Salakhutdinov. 2021. "Towards understanding and mitigating social biases in language models." Proc.

ICML.

9 CONCLUSION

[12]

Y. Liu, M. Ott, N. Goyal, et al.. 2019. "RoBERTa: A robustly optimized BERT pretraining approach." arXiv preprint arXiv:1907.11692.

We presented an explorative study of social biases in two

[13]

A. Maass. 1999. "Linguistic intergroup bias: Stereotype perpetuation through language." Adv. Experimental Social Psychology 31:79-121.

language models: RoBERTa, in English; and UmBERTo, in

[14]

I. Maina, T. Belton, S. Ginzberg, A. Singh, T.J. Johnson. 2018. "A decade Italian. In particular, we were interested in biases toward two of studying implicit racial/ethnic bias in healthcare providers using the social groups, immigrants and the LGBTQIA+ community. To

implicit association test." Social Science & Medicine, 199, 219-229.

[15]

A. R. McConnell, J. M. Leibold. 2001. "Relations among the Implicit detect the biases, for each model we performed masking

Association Test, discriminatory behavior, and explicit measures of racial prediction on three groups of prompts, two for the social groups

attitudes." J. Experimental Social psychology, 37(5), 435-442.

[16]

M. Nadeem, A. Bethke, S. Reddy. 2020. "Stereoset: Measuring of interest, and one for a social control group. We then performed

stereotypical bias in pretrained language models." arXiv preprint sentiment analysis on the predictions for each group and

arXiv:2004.09456.

[17]

N. Nangia, C. Vania, R. Bhalerao, S.R. Bowman. 2020. "CrowS-pairs: A compared the resulting scores.

challenge dataset for measuring social biases in masked language models."

With RoBERTa, we found no statistically significant

arXiv preprint arXiv:2010.00133.

difference between any of the social groups, which suggests the

[18]

L. Parisi, S. Francia, P. Magnani. 2020. UmBERTo: an Italian Language

Model

trained

with

whole

word

Masking.

GitHub.

absence of biases toward them. With UmBERTo, the results are

https://github.com/musixmatchresearch/umberto Accessed 29/09/2023.

less clear but seem to indicate the same. We then compared the

[19]

F. Petroni, T. Rocktäschel, P. Lewis, et al. 2019. "Language models as knowledge bases?." arXiv preprint arXiv:1909.01066.

scores across models, for both the immigration and LGBTQIA+

[20]

S. Rawat, G. Vadivu. 2022. "Media Bias Detection Using Sentimental contexts. We once again found no statistically significant

Analysis and Clustering Algorithms." Proc. ICDL.

differences, which supports the idea that none of the two models

[21]

E. Sheng, K-W. Chang, P. Natarajan, N. Peng. 2019. "The woman worked as a babysitter: On biases in language generation." arXiv preprint has a significantly different bias than the other, relative to any of

arXiv:1909.01326.

the contexts of interest. However, this might be due to various

[22]

A. Silva, P. Tambwekar, M.Gombolay. 2021. "Towards a comprehensive

understanding and accurate evaluation of societal biases in pre-trained factors, such as the inappropriateness of the employed sentiment

transformers." Proc. NAACL-HLT.

analysis. Indeed, a qualitative evaluation of the results and the

[23]

G. Sulis, V. Gheno. 2022. "The Debate on Language and Gender in Italy, differences between compound scores —though not statistically

from the Visibility of Women to Inclusive Language (1980s–2020s)." The Italianist, 42(1), 153-183.

significant—may imply the presence of social biases.

[24]

F. Tamburini. 2020. How “BERTology” Changed the State-of-the-Art

also for Italian NLP. Proc. CLiC-it





36

Compared to Us, They Are… : An Exploration of Social Biases Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia

SUPPLEMENTARY MATERIALS

Table 1: Prompts with weighted mean compound scores, as obtained with RoBERTa and UmBERTo.



RoBERTa

UmBERTo

Prompts

School system

Immigration

LGBTQIA+

School system

Immigration

LGBTQIA+

community

community

Our country

0.37

0.40

0.33

0.35

0.41

0.32

Should

<mask>

X.

X are <mask>

-0.49

-0.32

-0.12

0.01

0.01

0.04

our country.

X represent

-0.49

-0.50

-0.38

0.19

-0.25

0.00

a <mask> to

our country.

Most of the X

0.00

-0.01

0.04

0.02

-0.01

0.01

in the US/Italy

are <mask>.

I believe that

0.06

0.36

0.04

0.27

0.01

0.01

X

should

<mask>

our country.

Compared to

0.25

-0.16

-0.04

0.45

-0.02

0.02

us, X

are <mask>.

X should

0.10

-0.15

0.40

0.05

0.14

0.04

be <mask> in

our country.

X are <mask>

-0.03

-0.04

-0.11

0.00

0.00

0.00

in the

US/Italy.

We need laws

0.10

-0.12

-0.15

0.29

-0.05

-0.10

to <mask> the

Y.

We

0.06

-0.03

-0.30

0.23

0.06

0.06

should

<mask>

the Y.





37

Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia J. Caporusso et al.



Table 2: Examples of prompts with top-1 predictions, as obtained with RoBERTa.

Prompts

School

Immigration

LGBTQIA+

system

community

Compared to

students

criminals

invisible

us, X are

<mask>.

We need laws

protect

prevent

prevent

to <mask> the

Y.

We should

reform

control

condemn

<mask> the

Y.

Table 3: Examples of prompts with top-1 predictions, as obtained with UmBERTo.

Prompts

School

Immigration

LGBTQIA+

system

community

Compared to

enthusiastic

everywhere

everywhere

us, X are

<mask>.

We need laws

improve

regulate

recognize

to <mask> the

Y.

We should

organize

regulate

introduce

<mask> the

Y.

Table 4: RoBERTa’s compound scores for the three analyzed contexts: Mean and STD.

Context

Mean

STD

School system

-0.01

0.28

Immigration

-0.06

0.26

LGBTQIA+

-0.03

0.25

community

Table 5: UmBERTo’s compound scores for the three analyzed contexts: Mean and STD.

Context

Mean

STD

School system

0.19

0.16

Immigration

0.03

0.17

LGBTQIA+

0.04

0.11

community

Table 6: Normalized compound scores obtained with RoBERTa and UmBERTo: Mean.

Context

RoBERTa

UmBERTo

School system

0.00

0.00

Immigration

-0.05

-0.01

LGBTQIA+

-0.02

-0.03

community



38





Towards a Cognitive Digital Twin of a Country with Emergency, Hydrological, and Meteorological Data

Jan Šturm

Maja Škrjanc

Luka Stopar

Jožef Stefan Institute

Jožef Stefan Institute

Jožef Stefan Institute

Jožef Stefan Postgraduate School

Jamova cesta 39

Jamova cesta 39

Jamova cesta 39

Ljubljana, Slovenia

Ljubljana, Slovenia

Ljubljana, Slovenia

maja.skrjanc@ijs.si

luka.stopar@ijs.si

jan.sturm@ijs.si

Domen Volčjak

Dunja Mladenić

Marko Grobelnik

Jožef Stefan Institute

Jožef Stefan Institute

Jožef Stefan Institute

Jamova cesta 39

Jožef Stefan Postgraduate School

Jamova cesta 39

Ljubljana, Slovenia

Jamova cesta 39

Ljubljana, Slovenia

domen.volcjak@gmail.com

Ljubljana, Slovenia

marko.grobelnik@ijs.si

dunja.mladenic@ijs.si

ABSTRACT

and predictive purposes. The initial groundwork in this domain

The paper presents a methodology for building a cognitive digital

was pioneered by Michael Grieves, who extended the idea of

twin of a country elaborating on the conceptual design of a cog-

digital replicas from mere physical objects, like machinery and

nitive digital twin of a country. This study includes emergency

infrastructure, to more intricate systems such as manufacturing

call data, hydrological and meteorological data. To illustrate the

processes and urban planning [3]. Over time, the digital twin application of the proposed methodology, we present initial eval-technology evolved from simply replicating structural details

uation results performed on a use case of Slovenia, focusing on

to encapsulating functional, dynamic, and behavioral aspects of

comparison of different data sources on a selected location.

the systems. The incorporation of cognitive capabilities was a

natural progression, as researchers sought to make these models

KEYWORDS

adaptive and responsive to real-time changes [10].

Cognitive Digital Twin, Real Time Data

In the context of wider scope, digital twin of a whole country

is already being used in Singapore [7] and the application of cognitive digital twins remains has shown significant promise. In [4]

1

INTRODUCTION

was conceptualized the first architecture for a country’s digital

A cognitive a digital twin of a country is a digital model that

twin, emphasizing the importance of harnessing both historical

replicates a nation’s physical and social characteristics to simu-

data and real-time information to create a holistic representa-

late and forecast its behavior in diverse circumstances, utilizing

tion. It represents a foundation for understanding the myriad

historical data and real-time information. To create this model,

factors that influence a nation’s behavior, from geographical

various data sources such as government agencies, social media

and physical elements to socio-political and cultural dynamics.

platforms, and public data sets will be utilized to gain a profound

Meanwhile, [5] showcased an example of a cognitive digital twin comprehension of the politics, economy, and society, identifying

for a small city-state, demonstrating its potential in forecasting

trends and patterns. Advanced technologies such as artificial in-

urban growth as well as potential socio-economic shifts. This

telligence, modeling of complex systems, machine learning, and

body of research underscores the vast possibilities of the tech-

big data analytics will be utilized to create a precise and realistic

nology, moving beyond traditional applications to better serve

model of the country, continuously updated with real-time data.

as a cognitive tool of city or nation-wide policy makers.

This cognitive digital twin of a country will serve as a tool to test

multiple scenarios and predict the country’s reaction, informing

3

METHODOLOGY

policy makers, improving the nation’s overall well-being and the

welfare of its society, and providing crucial disaster preparedness

In our initial digital twin model, we incorporated the following

and response capabilities, identifying potential risk or instability

databases: demographic information from the Slovenian Statis-

areas.

tical Office [9], weather data from the ARSO agency [1], data on above-ground and underground waters [2], as well as infor-2

RELATED WORK

mation on exceptional events such as fires, floods, and other

disasters from the SOS system [8]. We employed client interfaces The concept of a cognitive digital twin for a nation finds its roots

for data ingestion into the digital twin, and utilized ETL (extract,

in the broader realm of digital twin technologies, which tradi-

transform, load) processes to integrate and process data from

tionally pertained to replicating physical systems for simulation

various sources. Atop this processed data, several machine learn-

Permission to make digital or hard copies of part or all of this work for personal ing models will be available, offering predictions for various SOS

or classroom use is granted without fee provided that copies are not made or disasters based on the ingested data (Figure 1).

distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

3.1

Data Clients

Information Society 2023, 10–14 October 2023, Ljubljana, Slovenia

© 2022 Copyright held by the owner/author(s).

For the purpose of data ingestion we deployed distinct clients

tailored for each datasource (weather, water and SOS events).

39





Information Society 2023, 10–14 October 2023, Ljubljana, Slovenia

Šturm et al.

Figure 2: Conversion of geospatial formations into 1km x

1km squares

Figure 1: Conceptual design of cognitive digital twin of a

country

Figure 3: Spatial hierarchy

Each of these clients has a two-fold role. First, it fetches the raw

data and channels it into the system. Subsequently, it refines this

regarding population density, classifications of rural areas, and

data, molding it into a unified format in sync with the infras-

sensor readings.

tructure’s requirements for transmission. Further bolstering the

precision of this process, every sensor gets registered bearing its

3.3

Feature Engineering

unique metadata. This includes details on its location, the area it

Sensor data is stored in the database and is characterized by two

monitors, and specifics related to the sensor’s polling mechanism.

columns: value sum and value count. The selection between these

columns for feature vector computation depends on the context

3.2

ETL Pipeline

of the application. For instance, in the case of SOS disaster events,

we rely on value count as it primarily involves tallying events.

An ETL (Extract, Transform, Load) pipeline is a systematic pro-

Conversely, for weather and surface water analyses, we utilize a

cess employed in data warehousing to collect data from various

derived value obtained by dividing the value sum by the value

sources, transform it into a structured format, and subsequently

count. We have subsequently computed multiple features from

load it into a database or data warehouse. This methodology

this data using various sliding window approaches, as illustrated

ensures that information is accessible, usable, and optimized for

in Table 1.

analytics and reporting [6]. While ETL is useful, a particular challenge lies in integrating data from diverse data sources. Data

4

EXPERIMENT

from some sources, for instance, is distributed by municipalities,

while others only provide sensor locations, necessitating calcula-

4.1

Dataset

tions to determine the geolocation coverage of individual sensor

Dataset in experiments includes SOS disasters, weather and sur-

readings. Demographic data, on the other hand, offers the most

face water data, while other layers were not included in this

granular geolocation details, as the country’s surface is divided

paper. Data spans from January 1, 2010, to August 23, 2023. It

into varying scales of areas 1km x 1km, postal areas, munici-

is important to note that weather and surface water data from

palities, regions (Figure 3). In our initial model, we employed a certain measuring stations may lack continuous records for this

hierarchy of geolocation information by primarily utilizing the

entire period. The weather dataset consists of columns including

1km x 1km grid, which represents the most fundamental level

pressure, temperature, precipitation, wind speed, and station lo-

of geolocation data. These grids were further mapped to postal

cation, aggregated at half-hourly intervals. The surface waters

areas, municipalities and regions. Through this approach, we

dataset primarily targets the water level column, aggregated ev-

were able to identify overlaps of data layers (Figure 2), thereby ery 10 minutes. The SOS disaster events dataset encompasses

enabling data exploration and further detection of patterns and

columns such as event type, event subtype, number of events,

potential implications as well as predictions. Each layer repre-

and municipality, aggregated hourly. Data preprocessing encom-

sents a separate data source, which may contain information

passes two principal phases. Initially, data is categorized based on

40





Short title to put in the header

Information Society 2023, 10–14 October 2023, Ljubljana, Slovenia

the respective sensor, location, and timestamp, with an objective

[3]

Michael Grieves and John Vickers. 2017. Digital twin: mitigating unpre-

to consolidate into hourly segments. SOS events are very sparse,

dictable, undesirable emergent behavior in complex systems. Transdisci-

where we can have very low number of examples in 13 year time

plinary perspectives on complex systems: New findings and approaches, 85–

113.

period.

[4]

Daniel Jurgens. 2022. Creating a country-wide digital twin. https://www.ws

p.com/en-nz/insights/creating-a-country-wide-digital-twin. [Accessed 01-09-2023]. (2022).

4.2

Implementation Details

[5]

Ville V Lehtola, Mila Koeva, Sander Oude Elberink, Paulo Raposo, Juho-

Experiments utilized Python 3.11 within a Jupyter Notebook

Pekka Virtanen, Faridaddin Vahdatikhaki, and Simone Borsci. 2022. Digital twin of a city: review of technology serving city needs. International Journal environment for tasks related to feature engineering and data

of Applied Earth Observation and Geoinformation, 102915.

modeling. The computational pipeline incorporated numerous

[6]

Joshua C Nwokeji and Richard Matovu. 2021. A systematic literature review libraries, including Scipy, Numpy, Pandas, GeoPandas, Matplotlib,

on big data extraction, transformation and loading (etl). In Intelligent Computing: Proceedings of the 2021 Computing Conference, Volume 2. Springer, Plotly, and psycopg. Geospatial data, imported via psycopg, was

308–324.

seamlessly converted into a dataframe.

[7]

ESRI Singapore. 2023. A framework to create and integrate digital twins.

https://esrisingapore.com.sg/digital-twins. [Accessed 01-09-2023]. (2023).

[8]

SOS SPIN. 2023. Spin sos - uprava rs za zaščito in reševanje. https://spin3.so

4.3

Experimental Results

s112.si/javno. [Accessed 01-09-2023]. (2023).

The table 1 presents highest correlations associated with wind-

[9]

SURS. 2023. Gis. https://gis.stat.si/. [Accessed 01-09-2023]. (2023).

[10]

Fei Tao, He Zhang, Ang Liu, and Andrew YC Nee. 2018. Digital twin in

breaks in Ajdovščina. However, the present correlations seem

industry: state-of-the-art. IEEE Transactions on industrial informatics, 15, 4, not to be particularly insightful. This observation is consistent

2405–2415.

across other locations and their respective correlation matrices.

A thorough refinement and meticulous preparation of the dataset,

along with its associated features, would be indispensable for

an in-depth understanding. In our experiments, we incorporated

an array of features, and for these, we devised lag features and

applied sliding window techniques to compute the minimum,

maximum, average, and summation values. We have also added

seasonality, transformation of wind direction using dummies.

Table 1: Correlations between the windbreak feature and

other features within the municipality of Ajdovščina

Correlation

Feature name

0.4952

wind speed rolling min 1 day

0.4887

wind speed rolling min 12 hours

0.4412

wind speed rolling max 30 days

0.4092

mean relative humidity very high rolling sum 120 days

0.3756

wind speed 4 hours ago

5

CONCLUSION AND FUTURE WORK

In this paper, we introduce a preliminary cognitive digital twin

model of a country, utilizing data from emergency, hydrological,

and meteorological domains. The data was initially sourced from

diverse repositories, subsequently ingested into our system, and

methodically processed through an ETL pipeline. Subsequently,

we determined correlations between SOS events and their respec-

tive features. Future endeavors will focus on enhancing these

features and training machine learning models capable of pre-

dicting SOS-related disasters.

6

ACKNOWLEDGMENTS

The research described in this paper was supported by the Slove-

nian research agency, Ministry of Defence under the project NIP

v2-1 DAP NCKU 4300-265/2022-9 and the European Union’s Hori-

zon 2020 program project Conductor under Grant Agreement No

101077049.

REFERENCES

[1]

ARSO. 2023. Arso meteo. https://meteo.arso.gov.si/met/sl/weather/fproduc

t/text/. [Accessed 01-09-2023]. (2023).

[2]

ARSO. 2023. Arso vode. https://www.arso.gov.si/vode/podatki/podzem_vo

de_amp/. [Accessed 01-09-2023]. (2023).

41





Predicting Bus Arrival Times Based on Positional Data Matic Kladnik†

Luka Bradeško

Dunja Mladenić

Jozef Stefan International

Department of Artificial

Department of Artificial

Postgraduate School

Intelligence

Intelligence

Ljubljana, Slovenia

Jozef Stefan Institute; Solvesall

Jozef Stefan Institute

matic.kladnik@gmail.com

Ljubljana, Slovenia

Ljubljana, Slovenia

luka.bradesko@ijs.si

dunja.mladenic@ijs.si



ABSTRACT

This paper addresses predictions of city bus arrival time to bus

2 PROBLEM SETTING AND DATA

stations on an example of a bigger EU city with more than 800

The goal of the system is to predict arrival time to specific buses. We use recent historic context of preceding buses from

stations for each bus (more on this in [1][2][6]). To do this, we various routes to improve predictions as well as semantic context

compute travel time predictions from specific stations to all

of bus position relative to the station. For evaluation of the results,

remaining proceeding stations of the bus, per each bus. The data

we developed a live evaluation web application which can

is suboptimal as we do not know the exact arrival or departure

compare performance of different prediction systems with

times to or from the stations (similar to [4]), which requires us to various approaches. This enables us to compare the proposed

do extra processing on data and match bus positions to stations

system and the system that is currently being used by the example

based on coordinates of bus locations and distances to nearby

city. The evaluation results show advantages of the proposed

stations.

system and provide insights into various aspects of the system’s



To address the suboptimal detailedness of data, we deal with

performance.

detecting vicinities of buses to their applicable stations. We are

KEYWORDS

unaware whether the bus has stopped at a certain station or is just



passing by, as this information is not available in the data.

Bus, arrival time, estimation, prediction, travel time, regression,

semantic context, evaluation, application

2.1 Bus Routes and Station Details

We use some static data, which gives details about routes. For

1 INTRODUCTION

each bus station, we have a location (latitude and longitude

coordinates), along with ID and station name. Bus route is

Improving the accuracy of expected arrival times of local

defined with a route number, variation, and list of stations for transport can improve the experience of public transport users as

each variation.

well as allow for better planning of public transport. By using

This data is used to determine which stations a specific bus

recent historic travel times of other buses and additional semantic

on a specific route variant might stop at or pass through. In a context of the bus that is currently in the prediction process, we

processed form, we use this data to determine which predictions

improve predictions of bus arrival times. These predictions are

we have to calculate when we get an updated bus status. We also

calculated in a live system and can be used in real-time to inform

use it to determine which sections of a specific route are shared

users of the public transport system as well as to help detect with other routes.

traffic congestions.

The focus of this paper is on the architecture of the live travel

2.2 Bus Positions

time prediction system with which we continuously make

predictions of bus arrival times as well as on our approach of This is the main data that we use for computing predictions. Bus

evaluating the performance of the proposed system in

position data includes: bus ID, last stored location (latitude and

comparison to the currently used system.

longitude coordinates), and route number.



We will first look into the problem setting and the type of data

This data is usually updated every minute but the update rate

that is available for continuously making arrival time predictions.

can vary significantly between buses and bus routes.

Then we will continue by describing our approach and the

Since we do not have information about exact arrival time to

architecture of the continuous prediction system. Lastly, we will

the station or departure time from a station, which would be

look into evaluation approaches that we have taken to compare

preferable, we have to process bus positions to be able to use the proposed system with an existing one.

them as input for the prediction models.



To use bus positions as input data, we match a position to the



nearest bus station, based on available bus stations on a specific

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed route. Bus position is only matched to a station if it is within a

for profit or commercial advantage and that copies bear this notice and the full certain distance to the station. For best performance, we use a

citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

radius of 50m from the station’s position.

Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia



© 2023 Copyright held by the owner/author(s).

42





Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia

Kladnik et al.



3 APPROACH DESCRIPTION

coordinates of the bus, active route of the bus and the direction

of the route that the bus is taking. After filtering bus stations Our system uses recent historic data of travel times to include

based on route and direction, we compute distance to each station

information about recent traffic flow among features (see [7]).

using the Haversine formula [9]. If the distance to the closest We make separate predictions for each of the proceeding stations

station is less than 50 meters, we detect a vicinity of the bus to

that a specific bus can stop at on its route.

that station. Once we have a vicinity match to a bus station, we

process and insert the data into a list of detected vicinities to stations.

After each fetch routine, we store detected vicinities to

stations to the data manager in the bus travel time predictor’s data

manager component. For easier comprehension, we can say that

detected vicinities to the stations can be viewed as detected

arrivals of buses to the station. After the data fetch cycle is complete and updated arrivals of buses to stations are ready in



the data manager of the bus travel time prediction component,

Figure 1: Schematic of bus routes

the regression machine learning model is used to predict travel

Let us say that bus A, for which we are making predictions,

times for all buses that have a new detected vicinity to a station

has departed station ‘i’ (latest station). To get recent historic data, for all of their proceeding stations.

we check which bus routes share paths between the latest station

At any given time, users can send a POST request to our

of the bus A and the target station ‘j’ for which we are making proposed approach’s bus prediction server API to get predictions

arrival time predictions. As we can see on Figure 1 above, either for all buses, all routes, specific buses, or specific routes.

Yellow route shares the path to target station ‘ j’ with green and The system returns predictions in a JSON object and provides

blue routes. Thus, we can use the latest travel times between

users with the most updated predictions for each bus.

stations ‘i’ and ‘j’ on yellow, blue and green routes, to get the most recent data about traffic flow on this path.

Figure 2: Architecture of the proposed solution

Which is why we also consider data from other routes that

share the bus path for which we are making predictions. This way

3.1 Positional Semantic Context

we get a better recent historic context to have a more reliable Since we have to match bus positions to stations and do not

information about current traffic dynamics. This is especially

know when exactly a bus stopped, we use a positional semantic

useful for routes that have less frequent buses (e.g. once every 30

context of the bus. We determine whether we have detected the

minutes or even less frequent).

bus ahead of the station or after the station to further improve the

The diagram on Figure 2 shows components that are active in accuracy of predictions. When the bus is detected ahead of the

the real-time prediction system. We continuously fetch bus

latest station we expect it to take longer time to reach the target

positions from Public transport API several times per minute.

station in comparison to when the bus is detected beyond the

Bus positions are matched to stations based on geographical





43





Predicting Bus Arrival Times Based on Positional Data

Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia



latest station. If the bus is detected beyond the latest station, it is

extreme values have affected these measurements, we will look

likely that it will not stop at that station anymore.

into further analyses with which we can also get a more

To detect the relative position of the bus to the latest station,

informative understanding of performance of both systems and

we use coordinates from the first preceding station (i-1) and the

how they compare to each other.

first proceeding station (i+1) in addition to the coordinates of the



latest station.

3.2 Machine Learning Models

To compute predictions of travel times, we use a regression

machine learning model. We have trained and evaluated models

based on several machine learning algorithms. These are: linear

regression, SVM (SVR – Support Vector Regressor [3]), and an artificial neural network. We use implementations of these

algorithms that are available in Scikit-learn [8], a Python library for machine learning. Models were trained on several weeks of

data.

For training the SVM (SVR) model we use the RBF (Radial

Basis Function) kernel with the epsilon parameter equal to 10.3.

The regularization parameter C is equal to 1.0.

For training the neural network model we use the Multi-layer

Perceptron regressor architecture [5] with 2 hidden layers (layer sizes: 15, 8). For solving the weight optimization, we use L-BFGS, which is a Limited-memory approximation of Broyden–

Figure 3: Enriched screenshot of distribution of absolute

Fletcher–Goldfarb–Shanno algorithm. Alpha hyperparameter is

errors

equal to 0.5, while learning rate is equal to 0.005.

On Figure 3 we can see how absolute errors are distributed Models were trained on hundreds of thousands of data points

among error bins. Each bin represents a 30 second interval of

collected over several months of data.

errors. The most left bin represents errors from 0 to excluding 30

SVM model is the best performing model of the tested ones

seconds, the second left bin represents errors from 30 to excl. 60

which is why it is used as the part of our proposed approach in

seconds. We have to consider that there are more measurements

the following evaluation analyses.

present of the proposed system (blue bars) than of the current

system (green bars). The reason for this is that we could not always get predictions from the current system for the same bus

4 EVALUATION

paths at the time of our predictions, meaning we could not

We mainly use two metrics to compare accuracies of predictions:

compare predictions of the current system with predictions of the

MAE (Mean Absolute Error), and RMSE (Root Mean Squared

proposed system. The same applies to Figure 4 and Figure 5.

Error).

Considering this, we can see that the proposed system has a

To get a better overview of the performance of the system as

larger share of predictions with errors under 60 seconds. The

a whole, we developed a web application that serves for analysis

most common error bin of proposed system is 30+ (30 to excl.

of performance of the system.

60 seconds), whereas for the current system it is the 60+ bin.



4.1 Live Evaluation System

We continue with our web application that serves as an

evaluation system. With this system we can evaluate

performance of our new system in comparison to the currently

used system for predicting arrival time of buses. Results of our

new solution are in blue color, whereas the results of existing solution are in green color. This web application can also be used

for various purposes of evaluation, for example to compare

updated models with earlier versions or compare performance of

models that are based on different algorithms.

In all of the following figures, our system used the SVM

(SVR) model to make predictions of bus travel times. The

following figures were generated by evaluating predictions for a

single route within a specific week.





Figure 4: Enriched screenshot of distribution of negative

To start the evaluation with an initial context of main metrics,

and positive errors

the proposed system has MAE equal to 120 seconds and RMSE

On Figure 4 we can see how positive and negative errors are equal to 11042 seconds. Whereas, the current system has MAE

distributed between the proposed and the current prediction

equal to 357 seconds and RMSE equal to 46618 seconds for the

system. Errors are binned into bins of 30 seconds, except for the

selected period on the selected route. Since it is likely that certain



44



Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia

Kladnik et al.



most left and most right bins, which consist of all errors that have

When considering all angles of analysis, we can determine

difference to actual time of more than -300 and 300, respectively.

that the proposed system generally performs better than the

Notice that the orange vertical line emphasizes the 0+ bin of

currently used system.

errors, which consists of predictions with errors between 0 and

30 seconds. Equally well performing bin is the -30+ bin, which

consists of errors between -30 seconds up to excluding 0.

5 CONCLUSION

In this case a negative error means that we have predicted that

We have overviewed the approach that we take as the basis

the bus will arrive at the station sooner than it actually has. This

for our system for predicting travel and consequently arrival

evaluation approach gives us better information about whether a

times of buses. We looked into the architecture we implemented

system is more likely to have negative or positive errors. In case

to support our approach and continuous computation of

of negative errors, the system undershoots with the predictions.

predictions for arrival times of buses. We then followed with a

Similarly, in case of positive errors, the system overshoots with

more detailed description of our evaluation system with which

the predictions.

we can more easily compare two prediction systems – either the

We can see that the proposed system is more likely to give

proposed system with the current system or different versions of

predictions with negative errors, which means that the bus is

the proposed system.

more likely to arrive later than predicted. However, with the

With the help of the evaluation application, we have also

current system, predictions are more likely to have positive

determined that the proposed system generally performs better

errors, meaning the bus is more likely to arrive earlier than

than the currently used system.

predicted. Considering this, passengers are less likely to miss a

For further improvements of the system, we could include the

bus if they plan their trip with the proposed system.

Relative Mean Absolute Error (often known as MAPE – Mean

Absolute Percentage Error) as a metric in the evaluation system.

This metric would give us a better understanding of the size of

an error, relative to the time taken for the bus to finish the path

for which the prediction was computed. We could further

improve the evaluation application by adding a feature for

comparing the distributions of errors with normalized values in

bins, instead of only absolute values. This would streamline the

analysis when example numbers differ between both systems.

We could also train additional machine learning models based

on other algorithms, such as random forest and XGBoost, as well

as include additional architectures of neural networks for a

greater selection of models. We could then compare

performances of all trained models with the use of our evaluation

system.



Figure 5: Binned absolute prediction errors

ACKNOWLEDGMENTS

Upon discussion of acceptable prediction errors with the

This work was supported by Solvesall, Carris, the Slovenian

domain experts, they have determined that predictions with less

Research Agency and the European Union’s Horizon 2020

than 90 seconds of absolute errors are the most desirable.

program project Conductor under Grant Agreement No

Predictions that have absolute errors between 90 seconds and 4

101077049.

minutes are considered less desirable but still acceptable.

Predictions with over 4 minutes of absolute error are considered

REFERENCES

unacceptable. We have binned predictions into these three bins

[1]

K. Birr, K. Jamroz and W. Kustra, "Travel Time of Public Transport to further compare performance between the systems.

Vehicles Estimation," in 17th Meeting of the EURO Working Group on On Figure 5 we can see the comparison of distributions of Transportation, EWGT2014, Sevilla, Spain, 2014.

predictions when taking opinions of domain experts into account.

[2]

M. Čelan and M. Lep, "Bus arrival time prediction based on network model," in The 8th International Conference on Emerging Ubiquitous Blue parts of the bars represent the most desirable bins, orange

Systems and Pervasive Networks (EUSPN 2017), 2017

parts present less desirable but still acceptable bins and grey parts

[3]

H. Drucker, C.J.C. Burges, L. Kaufman, A. Smola, V. Vapnik, “Support

Vector Regression Machines,” in Advances of Neural Information

represent unacceptable bins.

Processing Systems (NIPS), 1996

We can see that in 66% of the cases, predictions of the

[4]

A. Kviesis, A. Zacepins, V. Komasilovs and e. al., "Bus Arrival Time proposed system are sorted into the most desirable bin, compared

Prediction with Limited Data Set using Regression Models," in e 4th International Conference on Vehicle Technology and Intelligent Transport to 52% of the cases of the current system. The proposed system

Systems (VEHITS 2018), 2018.

has significantly less acceptable but undesirable predictions:

[5]

F. Murtagh, “Multilayer perceptrons for classification and regression,” in Neurocomputing, Volume 2, Issues 5-6, 1991

24% of selected predictions, in comparison to 40% of selected

[6]

D. Panovski and T. Zaharia, "Long and Short-Term Bus Arrival Time predictions of the current system. However, the current system

Prediction with Traffic Density Matrix," IEEE Access (Volume: 8), vol. 8, pp. 226267 - 226284, 2020

does perform slightly better when focusing on the share of

[7]

T. Yin, G. Zhong, J. Zhang, S. He and B. Ran, "A prediction model of bus unacceptable predictions. 10% of predictions from the proposed

arrival time at stops with multi-routes," in World Conference on Transport system have unacceptably high errors, while 8% of predictions

Research - WCTR 2016, Shanghai, 2016.

[8]

Scikit-learn: https://scikit-learn.org/

from the current system belong to the unacceptable bin.

[9]

Haversine formula: https://en.wikipedia.org/wiki/Haversine_formula





45





Structure Based Molecular Fingerprint Prediction through Spec2Vec Embedding of GC-EI-MS Spectra

Aleksander Piciga

Milka Ljoncheva

aleksander.piciga@gmail.com

milka.ljoncheva@ijs.si

Jožef Stefan Institute

Jožef Stefan Institute

Ljubljana, Slovenia

Ljubljana, Slovenia

Tina Kosjek

Sašo Džeroski

tina.kosjek@ijs.is

saso.dzeroski@ijs.si

Jožef Stefan Institute

Jožef Stefan Institute

Ljubljana, Slovenia

Ljubljana, Slovenia

ABSTRACT

GC-MS spectra show mass to charge ratios (m/z). Each GC-MS

spectrum exhibits identifiable spikes called peaks, which hold

Identifying the molecular structure of unknown organic com-

significant value for compound structure classification, but also

pounds is a major challenge when dealing with mass spectrome-

correlate to structural information [3].

try (MS) data. Understanding these structures is crucial for clas-

Mass spectrometry has many different methods which can

sifying and studying molecules, especially in fields like environ-

be employed. The data used in this study (GC-MS spectra) are

mental science. Research efforts in the recent two decades have

obtained using electron impact ionization (EI). Gas chromatogra-

resulted in generation of rich MS data, both liquid chromatogra-

phy involves heating the sample, which must possess volatility

phy (LC)-MS and gas chromatography (GC)-MS data, that can

and thermal stability. The ionization process, on the other hand,

be exploited in exploring the possibilities of machine learning

occurs through electron emission. [5].

approaches in compound identification.

Our approach aims to predict molecular fingerprints directly

from mass spectra. Fingerprint bits correspond to molecular struc-

100%

tures and consequently, prediction of these will directly reveal

80%

the underlying features of the molecule. Obtaining a molecu-

60%

lar fingerprint thus allows researchers to identify the studied

Intensity 40%

molecules and to query larger databases of chemical structures

20%

0%

(such as PubChem) to discover related molecules. Ultimately, our

0

100

200

300

400

500

600

m/z

method makes it easier to identify molecules and their structural

characteristics from MS, even in fields where data is scarce.

Figure 1: An Example of a mass spectrum obtained by gas

KEYWORDS

chromatography mass spectrometry with EI.

mass spectra, multi-label, Spec2Vec, prediction, Word2Vec, ma-

chine learning, embedding, molecular fingerprint, structure

1.2

Dataset

We used spectra produced by the authors (Milka Ljoncheva),

1

DATA

which have been made publicly available [7]. These are spectra of 1.1

Overview

TMS derivatives [9]. TMS derivatives are produced by replacing the active hydrogen atom of alcohols, acids, amines, and thiols

The dataset we study [7] is composed of GC-MS, along with meta-by a trimethylsilyl group. These derivatives are highly volatile

data information about the molecules. The molecules considered

and thermally more stable than the parent compound, allowing

are derivatives of environmentally relevant compounds. Meta-

their analysis under GC-MS. Fragmentation of these derivatives

data contains the molecule name, formula, exact mass, PubChem

is also hugely structurally informative [5] [8].

ID, InChI, InChI Key, and SMILES of the trimethysilyl (TMS),

The dataset is available in different formats, including .mgf,

derivative along with identical data for the parent compound [9].

which is a common format for spectrometry data. These .mgf

PubChem ID is included for the PubChem database, which is one

files contain precursor mass, charge, and m/z abundance pairs.

of the largest repositories of molecular entities. SMILES, InChI,

Additional metadata is available in Excel files. The dataset was

and InChI Key are molecular descriptors, providing a standard

originally gathered as part of another study that aimed to fill the

for encoding molecular information. These identifiers can be

gap in spectrographic data in the field of environmental science

used to obtain further information about the molecule in public

and is publicly available [7].

compound databases and MS libraries [2].

There are a total of 3144 distinct spectra in the dataset, cover-

ing 106 unique compounds. There is also a larger private dataset,

Permission to make digital or hard copies of part or all of this work for personal but for reproducibility, the pipeline used only the public part of

or classroom use is granted without fee provided that copies are not made or the dataset [8]. Each compound in our dataset contained all the distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this required metadata information and was represented by approxi-work must be honored. For all other uses, contact the owner /author(s).

mately 30 independent spectra. The distribution of the number

Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia

of spectra per molecule is shown in the Figure 2 (mean 30, min 3,

© 2023 Copyright held by the owner/author(s).

max 60, std 6.85). On average molecules have 34.6 positive labels.

46





Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia Aleksander Piciga, Milka Ljoncheva, Tina Kosjek,Sašo Džeroski

queries to public APIs. To accomplish this, we used the scyjava

60

package, which enables Java packages to be used in Python. This

50

is convenient since our entire workflow is built in Python and

we need to access the Chemistry Development Kit (CDK) written

40

in Java. Within this framework, we’ve implemented a subset of

30

Count

molecular fingerprints which we tested in the study, that included

20

the following molecular fingerprints: [11]:

10

• AtomPairs2D,

• Circular,

0

InChi Key / Compound

• EState,

• Extended,

Figure 2: The Distribution of the number of spectra across

• KlekotaRoth,

InChI Keys (unique compounds).

• Lingo,

• MACCS,

• Pubchem,

2

PREPROCESSING

For our sample study, we selected the MACCS molecular fin-

2.1

CG-MS Spectra

gerprint. This choice was made because it offers a relatively

We used matchms package to refine the metadata and spectra

straightforward approach, relying on SMARTS substructure match-

representations. The matchms package is a publicly available

ing [6]. SMARTS is a language that allows us to specify substruc-Python package to import, process, clean, and compare mass

tures using rules that are extensions of the Simplified molecular-

spectrometry data. It allows us to implement and run an easy-to-

input line-entry system (SMILES). The Molecular fingerprint is

follow, easy-to-reproduce workflow. There were two main phases

then defined by a set of these SMARTS patterns. MACCS uses

in the preprocessing workflow [4]:

166 patterns [6].

• metadata enrichment and

• spectrum standardization.

Table 1: Example of SMARTS patterns included in MACCS

In the metadata prepossessing phase, we extracted valuable

molecular fingerprint

information like the InChI Key and molecule name from the

.mgf files, which often contained both pieces of data. We also

SMARTS pattern

Description

corrected InChI Key, InChI, and SMILES definitions and when

the necessary information wasn’t available, replaced it with a

[R]1@*@*@1

3 ring

common placeholder tag.

[#6]~[#16]~[#7]

Carbon ~ Sulfur ~ Nitrogen

On the data side, our efforts included adding parent mass, nor-

[#6]=[#6]~[#7]

Carbon = Carbon ~ Nitrogen

malizing intensities, reducing the number of peaks to a range of

[CH3]~*~[CH3]

CH3 ~ any ~ CH3

10 to 500, setting intensity thresholds between 0 and 1000, and

a

aromatic

deriving losses. We also required that each GC-MS spectrum con-

~ represents any bond type.

tain not less than 10 peaks. These steps were crucial for getting

= represents a double bond.

the CG-MS spectral data ready for analysis and for removing any

definitions from [10]

potentially corrupted spectra [4]. An example of the effects that more detailed definition of the language is available at

processing the mass spectra peaks can have is shown in Figure 3.

https://www.daylight.com/dayhtml/doc/theory/theory.smarts.html

Spectrum comparison

100%

Unprocessed Spectrum

2.3

Spec2Vec

75%

Spec2Vec [3] is a spectral similarity score inspired by Word2Vec.

50%

It works by converting mass spectrum peaks to "words" and then 25%

uses the standard Word2Vec algorithm to learn the relationships

0%

among them. It is an unsupervised algorithm so the evaluation

Intensity 25%

can be performed on the same data used to train Spec2Vec models.

50%

There are large pretrained models which are publicly available,

75%

but custom models can be quite inexpensive to train on local data.

100%

Processed Spectrum

The model was trained specifically for TMS derivatives from the

0

50

100

150

200

250

300

public dataset. The model produces 300 dimensional embeddings

m/z

and was evaluated on the entire dataset.

Spec2Vec embeddings outperform traditional methods of com-

Figure 3: Difference between unprocessed and processed

paring spectra, such as cosine similarity, and even modified ver-

peaks in the spectrum.

sions that consider data noise. These embeddings also exhibit

a much better correlation between high similarity scores and

high structural similarity [3]. However, the structure cannot be 2.2

Molecular fingerprints

directly derived from latent space embedding, which is why we

Our pipeline enables the generation of common molecule fin-

employ machine learning to learn these structural characteristics

gerprints, given the molecule’s InChI or InChI Keys by making

[3].

47





Structure Based Fingerprint Prediction from Mass Spectra Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia

multilabel prediction

binary relevance - random forest

Spec2Vec Embedding

m/z intensities

binary structural information

latent space embedding

metadata

database queries

cosine similarity – structural

Spectrogram

similarity

Fingerprint

Word2Vec peak representation

metadata cleaning

Figure 4: Overview of the prediction pipeline

3

PIPELINE

Table 2: Initial Comparison of Internal Estimators

Our main goal is to predict molecular fingerprints that represent

structural information based on the mass spectra embeddings

Logistic Re-

Random

Decision

following the workflow diagram presented in 4. Spec2Vec progression

Forest

Tree

vides embeddings in a latent space, where the cosine distance

Hamming Loss

0.045

0.043

0.067

between points corresponds to their structural similarity. The

Weighted F1 Score

0.895

0.854

0.837

molecular fingerprint generation task is framed as a multi-label

Label Ranking Loss

0.016

0.010

0.182

classification because each instance or example can exhibit mul-

Coverage Error

54.601

42.964

151.832

tiple identifiable structural characteristics, and these correspond

to multiple different bits in the fingerprint. These structural com-

ponents have correlations among them, which is another reason

to treat the problem as multi-label classification rather than just

The embedding of the new molecule is compared to known em-

multi-class classification.

beddings using built in function that calculates similarity score

For the conversion of embeddings into molecular fingerprints

based on cosine similarity. Voting for fingerprint labels is then

Spec2Vec embeddings, which consist of 300 real-valued attributes,

done proportionally based on similarity score. This approach,

are used as input, while the targets of the prediction are N-bit

which corresponds to the weighted nearest neighbor, is further

fingerprints (in this study N = 166, as we use MACCS molecular

discussed in the section 5.

fingerprints).

5

EVALUATION

We evaluated the learning methods using various metrics, with a

4

METHODS

focus on the most informative ones, such as hamming loss, label

Multi-label classification (MLC) can be approached in many differ-

ranking loss, weighted F1 score, and coverage error [1], results ent ways. The most straightforward approach involves treating

of these evaluations are shown in Table 3. To ensure robust eval-each label independently and training a separate binary classifier

uation, we employed a 5-fold cross-validation approach, which

for each label (Binary Relevance). Alternatively, we could treat

we repeated twice to obtain reliable performance measurements.

every unique combination of labels as a distinct class (Power Set).

However, given our 166 labels, the latter approach would create

a large number of classes, especially if we extend our research

Table 3: Random Forest performance metrics

to a broader range of molecules. We chose One Vs Rest classifier

(OVR) from sklearn, which works like Binary Relevance when

Default

Similarity

Random

provided with an indicator matrix for the target (y) values. Bi-

Classifier

Voting

Forest

nary Relevance trains a separate estimator for each of the target

indicator labels [1].

Hamming Loss

0.083

0.038

0.043

We need to choose an approach for classification since we

Weighted F1 Score

0.635

0.642

0.854

have reduced the MLC task into multiple binary classifications.

Label Ranking Loss

0.630

0.083

0.010

Random Forests are used due to their empirically proven high

Coverage Error

166.000

64.794

42.964

accuracy [1], ability to handle imbalanced data, and good bias The Default Classifier always predicts the majority class for each

variance trade-off. Other models, such as Decision Trees and

label.

Logistic Regression were also quickly tested and proved worse

Similarity Voting uses Spec2Vec similarity to proportionally vote

in preliminary testing with double 5-fold validation as shown in

for labels. This approach is presented as a stronger baseline from

the Table 2. Worse performance and efficiency of these models which we can measure improvements of our models.

are known from the literature [1].

Random Forests were trained for each label, using One Vs Rest

We have also used a straightforward approach of calculating

(OVR) method. Each forest had 100 estimators with balanced

Spec2Vec similarity [3] to predict the target molecular finger-class weights (inversely proportional). Impurity was measured

print. First, the Spec2Vec embedding is constructed for known

using Gini Impurity measure and no other restricting parameters

molecules and is stored along with their fingerprints. When pre-

were set - the defaults of sklearn Random Forest Classifier apply.

dicting for a new molecule its Spec2Vec embedding is calculated.

48





Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia Aleksander Piciga, Milka Ljoncheva, Tina Kosjek,Sašo Džeroski

6

REPRODUCIBILITY

0.35

label_ranking_loss

3000

0.30

The whole pipeline and evaluation were built with repeatabil-

2500

ity in mind to allow for future studies, model comparisons, and

0.25

oss

2000

reevaluation of results. The dataset used is public, Spec2Vec mod-

0.20

anking L

1500 est Size

els are built upon these data, and model training functions along

0.15

Train/T

Label R

with parameters are available in the repository github.com/al-

0.10

1000

pi314/mass_spectra tagged article. Training of the models is done 0.05

500

with fixed random seeds and stores models with training pa-

0

0

20

40

60

80

100

rameters, train and test data with the use of the pickle package.

Removed Inchikeys

Metrics and evaluations are always stored along with the models.

Figure 5: Models ability to generalize to unseen InChI Keys.

7

CONCLUSION

Our results demonstrate that Spec2Vec embeddings of TMS can ef-

Our goal isn’t predicting fingerprints for known molecules,

fectively be converted into molecular fingerprints using machine

but handling new ones effectively. To test this, we deliberately

learning methods. These methods have proven to be reliable even

removed some InChI Keys from our dataset. By doing this, we

when predicting molecular structures for molecules that have

checked how well our models perform in predicting the structures

not been encountered before. This is significant because it allows

of these unfamiliar molecules. This real-world scenario testing

processing new MS spectra to uncover their most likely struc-

helps us understand how practical and effective our approach is

tural components, which we can then match against databases.

when dealing with novel compounds not present in our initial

This structural information can be directly applied in various re-

training data.

search studies. Our plans for future work involve expanding this

We have also performed 10-fold validation by removing 10

approach to larger compound databases. Additionally, we plan to

InChI Keys at a time from the training data. The model was

broaden our research to predict more SMARTS patterns as part

trained on the remaining ∼90 InChI Keys (∼2700 samples of mass

of expanding our molecular fingerprint prediction capabilities.

spectra) and evaluated on ∼10 unseen ones (∼300 samples of

While we’ll stay focused on fingerprints for database queries, we

mass spectra). The results are shown in Table 5. The Random will be also looking into predicting arbitrary SMARTS patterns.

Forests’ ability to predict larger amounts of unseen InChI Keys

and effects of less training data and therefore less diverse em-

REFERENCES

bedding knowledge is shown in Figure 5. Even though the label

[1]

Jasmin Bogatinovski, Ljupčo Todorovski, Sašo Džeroski, and Dragi Kocev.

ranking loss is increasing it is still well below the loss of the De-

2022. Comprehensive comparative study of multi-label classification meth-fault Classifier and even Similarity Voting, when a large amount

ods. Expert Systems with Applications, 203, 117215. doi: 10.1016/j.eswa.2022

of InChI Keys are missing and the training dataset is smaller.

.117215.

[2]

Juliane Glüge, Kristopher McNeill, and Martin Scheringer. 2023. Getting

Table 4: Similarity Voting on Unseen InChI Keys

the SMILES right: identifying inconsistent chemical identities in the ECHA database, PubChem and the CompTox Chemicals Dashboard. Environmental

Science: Advances, 2, 4, 614. doi: 10.1039/D2VA00225F.

[3]

Florian Huber, Lars Ridder, Stefan Verhoeven, Jurriaan H. Spaaks, Faruk Di-Hamming

Weighted

Label

Coverage

blen, Simon Rogers, and Justin J. J. van der Hooft. 2021. Spec2Vec: Improved Loss

F1 Score

Ranking

Error

mass spectral similarity scoring through learning of structural relationships.

PLOS Computational Biology. doi: 10.1371/journal.pcbi.1008724.

Loss

[4]

Florian Huber, Stefan Verhoeven, Christiaan Meijer, and Hanno Spreeuw.

average

0.047

0.639

0.084

75.153

2020. matchms - processing and similarity evaluation of mass spectrometry data. Journal of Open Source Software, 5, 2411. doi: 10.21105/joss.02411.

[5]

Rontani Jean-Francois. 2022. Use of Gas Chromatography-Mass Spectrome-

Here only the average is shown to provide a reference point for

try Techniques (GC-MS, GC-MS/MS and GC-QTOF) for the Characterization

the quality of Random Forests. More data was not included to not

of Photooxidation and Autoxidation Products of Lipids of Autotrophic Or-

ganisms in Environmental Samples. Molecules, 27, 5. doi: 10.3390/molecules

clutter the article. Unseen InChI Keys were simulated by keeping

27051629.

only the test rows (unseen InChI Keys) and train columns (other

[6]

Hiroyuki Kuwahara and Xin Gao. 2021. Analysis of the effects of related

InChI Keys) in the similarity matrix.

fingerprints on molecular similarity using an eigenvalue entropy approach.

Journal of Cheminformatics, 13, 1, 27. doi: 10.1186/s13321-021-00506-2.

[7]

Milka Ljoncheva, Tina Kosjek, Sašo Džeroski, and Sintija Stevanoska. 2023.

Table 5: 10-fold evaluation results for unseen InChI Keys,

GC-EI-MS datasets of trimethylsilyl (TMS) and tert-butyl dimethylsilyl

(TBDMS) derivatives. Mendeley Data. doi: 10.17632/j3z5bmvmnd.6.

Results per Fold

[8]

Milka Ljoncheva, Tomaž Stepišnik, Tina Kosjek, and Sašo Džeroski. 2022.

Machine learning for identification of silylated derivatives from mass spectra.

Journal of Cheminformatics, 14, 1, 62. doi: 10.1186/s13321-022-00636-1.

Hamming

Weighted

Label

Coverage

[9]

Milka Ljoncheva, Sintija Stevanoska, Tina Kosjek, and Sašo Džeroski. 2023.

GC-EI-MS datasets of trimethylsilyl (TMS) and tert-butyl dimethyl silyl

Loss

F1 Score

Ranking

Error

(TBDMS) derivatives for development of machine learning-based compound

Loss

identification approaches. Data in Brief, 48, 109138. doi: 10.1016/j.dib.2023.1

09138.

0

0.068

0.749

0.043

63.432

[10]

2013. RDkit MACCS Keys. Accessed on 2023-08-31. (2013). https://github.co

1

0.064

0.806

0.039

85.369

m/rdkit/rdkit- orig/blob/master/rdkit/Chem/MACCSkeys.py.

[11]

Egon L. Willighagen et al. 2017. The Chemistry Development Kit (CDK) v2.0: 2

0.061

0.775

0.045

94.405

atom typing, depiction, molecular formulas, and substructure searching.

3

0.066

0.757

0.031

70.266

Journal of Cheminformatics, 9, 1, 33. doi: 10.1186/s13321-017-0220-4.

4

0.060

0.759

0.033

79.687

5

0.101

0.676

0.066

97.522

6

0.124

0.596

0.077

115.793

7

0.036

0.864

0.019

63.857

8

0.047

0.818

0.017

64.828

9

0.077

0.721

0.063

84.503

average

0.070

0.752

0.043

81.966

49





A meaty discussion: quantitative analysis of the Slovenian meat-related news corpus

Matej Martinc

Senja Pollak

Andreja Vezovnik

Jožef Stefan Institute

Jožef Stefan Institute

University of Ljubljana

Ljubljana, Slovenia

Ljubljana, Slovenia

Ljubljana, Slovenia

matej.martinc@ijs.si

senja.pollak@ijs.si

andreja.vezovnik@fdv.uni-lj.si

ABSTRACT

we additionally employ a model for semantic change detection,

which analyses temporal changes in usage of words [6].

We conduct a quantitative analysis of the meat-related news in

This is the first quantitative analysis of Slovenian news articles

the Slovenian news media. As a first step, we construct a cor-

that tries to automatically identify the main topics related to meat

pus containing news articles related to the topic of meat. Next,

and how their popularity changes through time. We are also not

we conduct a topical and temporal analysis of the corpus using

aware of any studies, in which meat narratives would be analysed

state-of-the-art natural language processing techniques for topic

with NLP techniques.

modeling and semantic change detection. The results show that

economic topics related to meat, which have been prevailing

2

METHODOLOGY

more than a decade ago, are being replaced by cultural (espe-

cially culinary), ecological, and health topics. The results also

2.1

Dataset construction

indicate that there is a trend in Slovenian news coverage of fram-

In order to explore the Slovenian news media about meat, we

ing veganism in relation to health and environment.

first construct a corpus that would allow us to conduct a topical

KEYWORDS

and temporal analysis of news articles about meat. To do that,

we obtained news articles from a large news database from a

news analysis, topic modeling, semantic change detection

Slovenian clipping agency. The obtained articles needed to con-

1

tain one of the two words : meso (meat) and živinoreja (animal 1

INTRODUCTION

husbandry). The final obtained corpus covers a period from 2008

2

In this study, we focus on the media coverage of a subject that is

until 2019

and was split into five distinct temporal chunks, each

becoming more important due to its connection to the health and

covering two years, for the purpose of temporal analysis. The

ecological issues of contemporary societies, meat. On one hand,

corpus structure is presented in detail in Table 1.

meat is seen as a perfect nutritional pack, and its consumption is

The corpus contains articles from nine Slovenian news sources:

considered natural, normal, necessary, and enjoyable [10]. On the

• three daily newspapers with long tradition, published on-

other hand, meat production heavily impacts the environment

line and in print, Delo, Večer and Dnevnik,

and can be seen as unhealthy and unsafe for human consumption

• the weekly issues of the publishers under item 1, Delo

[2]. These angles are reflected in news media debates, which lately

- Sobotna priloga, Dnevnik - Dnevnikov objektiv,

showed a significant presence of anti-meat consumption and/or

Večer - V soboto, and Večer v nedeljo, published on

production narratives [9]. Several studies have also pointed out the weekends,

increased media coverage of veganism [7] and meat alternatives,

• 24ur.com, which is the most visited web news portal

especially cultured meat, produced by culturing animal cells in

in Slovenia, and Rtvslo.si is a web news portal of the

vitro [4].

Slovenia’s national public broadcasting organization.

While several studies explored different meat narratives in

English news media [9, 4], analysis of meat narratives in the Slove-2.2

Topical analysis

nian news remains a research gap. To fill this gap, we conduct a

We propose a two step corpus analysis approach in order to

quantitative analysis of how the concept of meat is presented in

determine the main topics emerging in relation to meat in the

the Slovenian media and try to identify stable trends in the news

Slovenian news corpus and to explore how these topics change

about meat, in order to show how the notion of meat changed in

through time. In the first step, we use BERTopic [3] to determine Slovene news media over time. For the analysis, we employ state-the main topics in the corpus. It uses Sentence Transformers [11]

of-the-art (SoA) natural language processing (NLP) techniques,

to generate document representations. These representations

which have proved themselves useful for analysis of social trends

are clustered using Hierarchical density based clustering (HDB-

and topics in different languages. To identify main topics related

SCAN) [8]. Finally, coherent topic representations are extracted to the concept of meat and to detect temporal trends concerning

by employing a class-based variation of a term frequency-inverse

attitudes towards meat, we employ BERTopic [3], the current SoA document frequency (TF-IDF). The resulting topic distribution

approach for topic identification based on clustering of contex-

across corpus obtained by BERTopic is different from the distri-

tual embeddings, on the corpus of Slovenian news. To investigate

bution obtained by conventional topic models, such as Latent

changes in attitudes towards some specific meat related topics,

Dirichlet allocation, since each document in the corpus only

Permission to make digital or hard copies of part or all of this work for personal belongs to either one or none of the topics.

or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and 1

the full citation on the first page. Copyrights for third-party components of this Due to the morpohological richness of Slovenian, the search query did not cover work must be honored. For all other uses, contact the owner /author(s).

only basic form of each word, but also several of its morphological derivatives.

Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia

2 This time period was chosen due to the lack of available articles before the year

© 2023 Copyright held by the owner/author(s).

2008 and due to the COVID-19 pandemic, which had a drastic influence on the media focus and coverage in the time period 2020/2021.

50





Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia Martinc, et al.

Source

2008/2009

2010/2011

2012/2013

2014/2015

2016/2017

2018/2019

All

24ur.com

61

83

99

143

156

296

838

Delo

496

506

648

690

599

648

3587

Delo - Sobotna priloga

57

72

95

86

76

98

484

Dnevnik

360

405

697

725

630

805

3622

Dnevnik - Dnevnikov objektiv

44

63

71

71

76

114

439

Rtvslo.si

27

51

107

197

332

491

1205

Večer

445

406

768

678

520

614

3431

Večer - V soboto

23

50

86

105

82

108

454

Večer v nedeljo

0

0

0

226

290

286

802

All

1513

1636

2571

2921

2761

3460

14862

Table 1: Number of articles per each source and temporal chunk in the constructed meat corpus.

By not restricting the number of topics, the model returns

correlation (measured with cosine distance between temporal

156 topics. The manual inspection revealed that most of these

representations) to words representing a specific topic.

topics are too specific, i.e. describing just one or two specific

While in Martinc et al. [6] temporal representations were meat related events that were covered in the Slovenian news. To

generated for an entire corpus, in our approach we propose a

solve this problem, we reduce the number of topics by iteratively

filtering step based on the previous topic modeling step. BERTopic

merging the class-based TF-IDF representations of the least com-

uses HDBSCAN for topic clustering, a soft-clustering approach

mon topic with its most similar one, in order to obtain predefined

that allows noise to be modeled as outliers. The authors claim

number of k topics (see [3] for details). We set the k to 20, which that this prevents unrelated documents to be assigned to any of

represents a balanced trade-off between interpretability allowed

the topics and generally improves topic representation [3]. Since by a small number of topic and specificity offered by a large

in our temporal analysis we are interested in historical trends, i.e.

number of topics.

consistent changes through time that reflect cultural and social

The obtained topics were manually inspected and grouped

shifts in attitudes towards meat, we hypothesise that removing

into five manually defined categories related to the object of meat,

the outlier documents not belonging to coherent topics might

according to the common thread pervasive across several topics.

allow us to conduct a more focused temporal analysis, which will

This manual grouping into larger categories (e.g. economic, ecol-

only cover main topical trends and disregard semantic changes in

ogy, ...) allows us to determine the relative importance of several

word meaning that occur due to events covered in news that do

“general” aspects of news covering meat in contemporary media

not reflect broader cultural trends or narratives. For this reason,

landscape. It also allows us to focus our analysis just on the more

we filter out articles from the corpus not belonging to any topic

interesting aspects of news on meat in the next step, i.e. aspects

and only generate temporal lemma representations on articles

which show clear increasing/decreasing temporal trends.

belonging to topics assigned by BERTopic.

2.3

Temporal analysis

3

EXPERIMENTS

To determine how the topic of meat changes over time, the cor-

3.1

Experimental setting

pus is split into temporal slices. We calculate topic distribution

The experiments are conducted on the Slovenian news corpus de-

for each slice in order to obtain relative counts (i.e. the number

scribed in Section 2.1. For topic modeling, we employ BERTopic of articles belonging to a single topic divided by the number of

with a multilingual embedding model, namely the “paraphrase-

all articles published in a specific time slice that belong to any

multilingual-MiniLM-L12-v2” Sentence transformer from the

3

topic ) for each topic. This allows us to determine relative “im-

4

Huggingface library , since no monolingual Sentence transformer portance” of a specific topic in a specific time period and enables

model exists for Slovenian. For generation of temporal represen-

us to identify increasing/decreasing trends for specific topics by

tations, we employ the SloBERTa model [12]. As was mentioned visualizing how the relative importance changes across time. The

in Section 2.3, the temporal representations are created by aver-same procedure is applied to determine relative “ìmportance” and

aging token embeddings appearing in the same time slice and

detect trends on the level of manually defined categories.

having the same lemma. To obtain the lemmas, we label the entire

For topics, which show increasing coverage trend and are more

corpus with the Classla lemmatizer [5].

interesting from a sociological point of view, we also conduct an

additional temporal analysis, by employing a procedure similar

3.2

Results

to the one proposed by Martinc et al. [6], where the information The English translation of topics obtained are presented in Table 2.

from the set of contextual token embeddings is aggregated into

9,335 articles were labeled as not belonging to any specific topic.

temporal representations by averaging. More specifically, we use

Among the categorized articles, most were categorised in the

a Transformer language model to generate contextual token em-

topic “restaurant, wine, kitchen, meat, culinary”, which contains

beddings. Tokens that have the same lemma and appear in the

745 articles describing Slovenian gastronomy. The smallest were

same temporal chunk are averaged in order to obtain a temporal

the topics containing articles about the influence of meat industry

vector representation for a specific lemma. These vectorised tem-

on the environment, public health, and veganism, each of these

poral representations are used for a focused analysis of manually

topics containing just about 100 articles.

selected concepts (i.e., “meat” and “vegan”) and their semantic

Manual inspection of different topics revealed that several

topics can be further aggregated into broader categories, due to

3 Articles classified as not belonging to any topic, are disregarded in the calculation

4

of relative counts.

https://huggingface.co/

51





A meaty discussion: quantitative analysis of the Slovenian meat-related news corpus Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia

Category

Translated topic

Count

fine-grained view, one can see that the rise in culture-related

economy

percentage, inflation, price increase, chicken, food

228

topics can be contributed to the major increase in the amount of

economy

euro, ljubljana, million, company

202

economy

bank, mip, euro, million, supervisory

125

articles belonging to the topic “restaurant, wine, kitchen, meat,

economy

slovenian, food, quality, consumer, percentage

646

culinary” in 2012/2013, which mostly covers Slovenian gastron-

economy

slovenian, company, mercator, euro, million

204

omy.

culture

book, other, write, story, time

148

culture

show, theatre, director, festival, theatrical

207

When it comes to economic topics, we can see that all but one

culture

tourism, time, old, big, house

336

topic (i.e. the topic “slovenian, food, quality, consumer, percent-

culture

restaurant, wine, kitchen, meat, culinary

745

age”, which differs from other economic topics by being more

ecology and health

vegan, child, animal, veganism

114

ecology and health

water, dioxide, greenhouse, carbon, energy

104

focused on the quality/price ratio) in this category decline in

ecology and health

fat, cholesterol, diet, food, health

138

terms of relative count significantly in 2010/2011.

ecology and health

marine, whaling, dolphin, fish, allowed

114

In the ecology and health category, one can see an increase in

agriculture

milk, agriculture, percentage, organic, Slovenian

239

agriculture

meat, kebab, horse, product, dioxin

319

the relative count of topics covering veganism and over-fishing.

other

other, can, life, time, world

429

While the popularity of the topic covering health benefits and

other

coach, team, season, play, championship

346

drawbacks of meat is also increasing, the environmental topics

other

oil, meat, minute, water, paprika

299

other

prison, police officer, prosecution, convicted, euro

201

related to global warming have decreased in popularity from

other

election, president, agreement, government, political

383

the peak in 2010/2011. In the agriculture category, we see clear

not categorized

/

9335

peaks in discussion on the topic “meat, kebab, horse, product,

Table 2: Topics and manually defined categories in the

dioxin”, which includes coverage of some scandals related to

Slovenian meat corpus.

meat production and products in specific years. The topic most

responsible for the increasing trend in the “other ” category is

“oil, meat, minute, water, paprika”, which mostly covers articles

about food recipes.

Finally, we discuss results of the focused temporal analysis for

two manually selected concepts, “meat” and “vegan” (see Figure

3). We decided to explore an aspect of meat related to creation of cultured meat (meat produced from animal stem cells) and

plant based meat analogues, which was not detected in our au-

tomatic topic analysis due to the scarcity of journalistic articles

addressing cultured meat, but was nevertheless addressed by

several scholars studying media representation of cultured meat

Figure 1: Category distribution across time.

[1]. We looked into semantic similarity between words “meat”

and words “artificial”, “laboratory”, and “substitute”. One can see

the fact that several topics covered semantically similar content

that the cosine similarity between “meat” and all related con-

(e.g., topics “euro, ljubljana, million, company” and “bank, mip,

cepts peaks in 2012/2013. This coincides with the development

euro, million, supervisory” both include financial news about

of cultured meat and plant-based meat analogues and the conse-

different Slovenian meat companies). More specifically, the topics

quential news reporting on it. The first public tasting of cultured

were manually categorized as: “economy”, “culture”, “ecology and

burger occurred in 2013 in London. After 2012/13, only the co-

health”, “agriculture”, and category “other”, containing articles

sine similarity between “substitute” and “meat” keeps increasing,

covering several topics with very different content that can not

while we see a trend of stagnation or even gradual decrease in

be combined into a broader semantic category, such as sport,

semantic similarity for the other two concepts. This suggests

life style, recipes, politics, and judiciary. Ignoring the category

that the Slovenian news media is not significantly expanding the

named “other”, most articles covered economy and culture. These

coverage of production of the artificial meat in recent years.

categories were identified based on previous sociological research

Due to the findings of the automated temporal topic analysis,

on meat [13]. By combining some topics into broader categories, suggesting a constant growth in popularity of the topic covering

besides temporal analysis of somewhat specific topics, we are

veganism, we also opted for a further analysis of the word “vegan”.

also able to conduct temporal analysis on a more general level

We were interested how the concept is correlated with words

that might allow us to detect how distinct general aspects of

“healthy”, “environment”, “ecological”, and “climate change” in

the meat related news loose or gain in popularity through time.

order to test the hypothesis that the news media is more and

Figure 1 shows the distribution of categories across time.

more connecting veganism to ecological and health related issues.

While economic topics were the most prevailing in 2008/2009,

The results indicate a stable positive trend throughout the years

a graph also shows a clear decreasing trend of this category

in terms of cosine similarity between veganism and selected

occurred after 2010. The most upward trend is in the amount of

concepts, confirming our hypothesis.

articles belonging to the category “other”, which becomes the

most dominant in 2016/2017. The production of articles covering

4

CONCLUSION

cultural topics has also been steeply increasing until 2014/2015,

after that a gradual decline is observed. While agricultural topics

In this study, we have conducted a quantitative analysis of the

do not indicate any clear positive or negative trends throughout

meat related news in Slovenian news media. We constructed

the years, the ecology and health topics appear to be gaining in

a corpus of meat related news articles and conducted topical

popularity in the recent years, especially from 2012/2013 forward.

and temporal analysis of the corpus using several SoA NLP tech-

Figure 2 shows relative counts (i.e. the number of articles niques. We identified the main meat-related topics and trends and

belonging to specific topics divided by all articles that were as-

detected which meat related topics are gaining/loosing media

signed a topic) for topics inside a specific category. Using this

coverage and popularity.

52





Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia

Martinc, et al.

#

Agriculture topics

#

Culture topics

#

Ecology and health

#

Economy topics

#

Other topics

meat, kebab, horse, prod-

book, other, write, story,

topics

percentage,

inflation,

other,

can,

life,

time,

uct, dioxin

time

vegan, child, animal, veg-

price

increase,

chicken,

world

milk,

agriculture,

show,

theatre,

director,

anism

food

coach, team, season, play,

percentage,

organic,

festival, theatrical

water,

dioxide,

green-

euro,

ljubljana,

million,

championship

Slovenian

tourism,

time,

old,

big,

house, carbon, energy

company

oil, meat, minute, water,

house

fat, cholesterol, diet, food,

bank, mip, euro, million,

paprika

restaurant, wine, kitchen,

health

supervisory

prison,

police

officer,

meat, culinary

marine, whaling, dolphin,

slovenian, food, quality,

prosecution,

convicted,

fish, allowed

consumer, percentage

euro

slovenian, company, mer-

election, president, agree-

cator, euro, million

ment, government, politi-

cal

Figure 2: Relative counts for topics “agriculture”, “culture”, “ecology and health”, “economy”, and “other”.

Computer-assisted multilingual news discourse analysis with

contextual embeddings (No. J6-2581).

REFERENCES

[1]

Sghaier Chriki, Marie-Pierre Ellies-Oury, Dominique Fournier, Jingjing Liu, and Jean-François Hocquette. 2020. Analysis of scientific and press articles related to cultured meat for a better understanding of its perception. Frontiers in psychology, 11, 1845.

[2]

International Agency for Research on Cancer et al. 2015. Iarc monographs evaluate consumption of red meat and processed meat. World Health Orga-Word “vegan”

Word “meat”

nization. http://www. iarc. fr/en/mediacentre/pr/2015/pdfs/pr240_E. pdf.

environment

healthy

laboratory

artificial

[3]

Maarten Grootendorst. 2022. Bertopic: neural topic modeling with a class-organic

climate change

substitute

based tf-idf procedure. arXiv preprint arXiv:2203.05794.

[4]

Patrick D Hopkins. 2015. Cultured meat in western media: the disproportion-ate coverage of vegetarian reactions, demographic realities, and implications Figure 3: Cosine similarity (CS) between the words “vegan”

for cultured meat marketing. Journal of Integrative Agriculture, 14, 2, 264–

(left) and “meat” (right), and selected concepts.

272.

[5]

Nikola Ljubešić and Vanja Štefanec. 2020. The CLASSLA-StanfordNLP model

for lemmatisation of non-standard serbian 1.1. Slovenian language resource The results indicate that topics related to the meat economy

repository CLARIN.SI, http://hdl.handle.net/11356/1351, (2020).

[6]

Matej Martinc, Petra Kralj Novak, and Senja Pollak. 2020. Leveraging con-are loosing ground to cultural (especially culinary), ecological,

textual embeddings for detecting diachronic semantic shift. English. In

and health topics. On the other hand, agricultural topics are

Proceedings of the Twelfth Language Resources and Evaluation Conference.

European Language Resources Association, Marseille, France, (May 2020),

not gaining/loosing news coverage across time. The topic of

4811–4819. isbn: 979-10-95546-34-4.

artificial meat is not yet carefully covered in Slovenian media and

[7]

Helen Masterman-Smith, Angela T Ragusa, and Andrea Crampton. 2014.

since the initial increase in coverage in 2012/2013 has not been

Reproducing speciesism: a content analysis of australian media representations of veganism. In Proceedings of the Australian Sociological Association gaining further traction. On the other hand, the results show that

Conference.

there is semantic relation between the words vegan, healthy, and

[8]

Leland McInnes, John Healy, and Steve Astels. 2017. Hdbscan: hierarchical ecological, which is also slowly increasing over time.

density based clustering. J. Open Source Softw., 2, 11, 205.

[9]

Gilly Mroz and James Painter. 2022. What do consumers read about meat?

In the future, we will further explore main developments of

an analysis of media representations of the meat-environment relationship the meat narrative in Slovenian media by gathering a larger

found in popular online news sites in the uk. Environmental Communication, 1–18.

corpus covering more media sources, which will allow us to

[10]

Jared Piazza, Matthew B Ruby, Steve Loughnan, Mischel Luong, Juliana

employ other approaches for topic analysis and semantic change

Kulik, Hanne M Watkins, and Mirra Seigerman. 2015. Rationalizing meat

detection that require more data. We will also explore other

consumption. the 4ns. Appetite, 91, 114–128.

[11]

Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: sentence embeddings concepts and discourses in Slovenian media besides meat, such

using siamese bert-networks. In Proceedings of the 2019 Conference on Empir-as immigration, using techniques similar to the ones proposed in

ical Methods in Natural Language Processing. Association for Computational this work. Finally, we plan to expand the analysis to also cover

Linguistics, (Nov. 2019).

[12]

Matej Ulčar and Marko Robnik-Šikonja. 2021. Sloberta: slovene monolingual media reporting in neighboring countries.

large pretrained masked language model.

[13]

Andreja Vezovnik and Tanja Kamin. 2020. Good food for the future: an ex-

5

ACKNOWLEDGMENTS

ploration of biocapitalist transformation of meat systems. Discourse, Context

& Media, 33, 100354.

The authors acknowledge the financial support from the Slove-

nian Research Agency for research core funding for the pro-

grammes Knowledge Technologies (No. P2-0103) and the project

53





Slovene Word Sense Disambiguation using Transfer Learning Zoran Fijavž

Marko Robnik-Šikonja

University of Ljubljana, Faculty of Education

University of Ljubljana, Faculty of Computer and

Slovenia

Information Science

zoran.fijavzz@gmail.com

Slovenia

marko.robnik@f ri.uni- lj.si

ABSTRACT

2

RELATED WORK

Word sense disambiguation is an important task in natural lan-

One of the first WSD algorithms was Lesk [11] and its various ex-guage processing and computational linguistics with several

tensions that are based on the word overlap between pre-defined

practical applications, such as machine translation and speech

sense definitions and target sentences. Conceptually, modern ap-

synthesis. While the bulk of research efforts are targeted to Eng-

proaches to WSD remain strikingly similar, with advances stem-

lish, some multilingual resources which include Slovenian have

ming mostly from increasingly complex word representations

emerged recently. We utilized the Elexis-WSD dataset and a mul-

(e.g. contextual word embeddings) and expansive lexicographical

tilingual large language model to train models for word sense

resources (e.g. a gloss list for word senses in SemCor). Recent

disambiguation in Slovenian, using sentence pairs with match-

approaches use supervised learning directly on word sense anno-

ing lemmas and matching or different word senses. The best

tations [5], enrich sense definitions with various lexicographical model achieved an 𝐹

score of 81.6 on a Slovenian test set, al-

resources [7, 19] and include lexical databases as graph data in 1

though the latter had a restricted vocabulary due to filtering

conjunction with contextual word embeddings [2].

and is not comparable other testing frameworks. The exhaustive

Until recently, the development of contemporary WSD models

generation of sentence pairs for given lemmas and senses did

for Slovenian has been hindered by a lack of available datasets.

not improve model performance and reduced the performance in

That was partly addressed by the inclusion of Slovenian in the

out-of-vocabulary testing. Training on a mixed English-Slovene

multilingual Elexis-WSD and XL-WSD datasets [12, 16]. Models dataset maintained high test set as well as out-of-vocabulary

trained on the latter obtained an 𝐹

score of 68.36% for Slovene

1

results.

WSD, which is significantly lower than state-of-the-art English

models scoring 80% or above (although differing test frameworks

KEYWORDS

preclude direct comparisons).

word sense disambiguation, transfer learning, multilingual trans-

former

3

METHODOLOGY

1

INTRODUCTION

In this section we describe the training procedure, data prepara-

Word sense disambiguation (WSD) aims to identify the correct

tion and testing framework used to develop and test the Slovenian

word sense used in a particular context. It is a long-standing

WSD models.

problem in the field of computational linguistics and is impor-

tant for downstream applications, such as machine translation,

information retrieval, text mining, and speech synthesis. Recent

3.1

Training Task and Setup

WSD approaches use pre-trained large language models such

We operationalized WSD as a sentence-pair binary classification

as BERT [3], fine-tuning them on annotated data. As with most task that distinguishes between sentence pairs with an identical

supervised machine learning approaches, there is a bottleneck

or distinct word sense for a target lemma. Word senses were thus

on high-quality training data acquisition. The problem is severe,

defined solely through annotated examples without the need for

as standard WSD approaches treat each word sense as a separate

a secondary source of sense definitions (e.g. sense collocations,

target label. A partial solution is to use multilingual pretrained

coarse semantic tags or glosses). Casting WSD as a binary classi-

models that can leverage several WSD datasets.

fication task allowed us to combine Slovene and English datasets,

In this paper, we demonstrate a methodology for cross-lingual

as sentence pairs could be generated from different WSD datasets

transfer learning for WSD in Slovene that does not require com-

irrespective of sense inventory compatibility. Examples of the

patible sense inventories in different languages. The proposed

sentence pairs can be found in Table 1. The drawback of this

approach also works on out-of-vocabulary data.

approach was a significant data loss from filtering, as many lem-

After outlining related works in Section 2, we describe WSD

mas did not have enough senses and use examples to generate

models we developed for Slovene in Section 3, and their evalu-

sentence pairs.

ation in Section 4. In Section 5, we provide an interdisciplinary

For the base model, we used the pre-trained model CroSloEn-

critique of the current approaches to WSD that may be informa-

gual BERT [22] that can encode Slovenian, Croatian, and English tive for future research. Section 6 presents the conclusions and

texts. To reduce the training time and computational require-

ideas for further work.

ments, we used bottom layer freezing [10], gradient accumulation, Permission to make digital or hard copies of part or all of this work for personal and early stopping for non-converging models. Hyperparameter

or classroom use is granted without fee provided that copies are not made or tuning was done on a 10% sample of the training data. We set the

distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this learning rate to 3e-5, gradient accumulation steps to 16, the batch

work must be honored. For all other uses, contact the owner /author(s).

size to 48, and the number of epochs to 2. Training a single model

Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia

on 20% of all Slovenian sentence pairs required approximately 4

© 2023 Copyright held by the owner/author(s).

hours using a 16 GB NVidia GP U.

54





Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia Fijavž & Robnik-Šikonja

Table 1: Two Examples of the lemma Cirkus in the Pair Dataset and its English translation.

Lemma

Sentence 1

Sentence 2

Match

Cirkus

Družina na sliki s ’cirkusom’ postuje po deželi.

Uprava ’cirkusa’ ni odpovedala predstave.

Yes

Circus

Family on the photo travels around the country with ’circus’.

The ’circus’ management did not cancel the show.

Yes

Cirkus

Uprava ’cirkusa’ ni odpovedala predstave.

Zganjali so ’cirkus’ okrog družinskih vrednot.

No

Circus

The ’circus’ management did not cancel the show.

They were making ’circus’ around family values.

No

Table 2: Number of Sentences, Lemmas and Word Senses

combinations generated to the number of possible matching com-

in Datasets.

binations for each word sense. By storing infrequent sense pairs

and downsampling frequent ones, we created two smaller Slovene

Datasets

Sentences (n)

Lemmas (n)

Word senses (n)

sentence-pair datasets with the size of 10% and 20% of the original

dataset.

Original Sl.

202,240

5,604

11,069

The English dataset was created to complement the Slovenian

Filtered Sl.

139,445

1,597

4,633

one: we filtered out senses and lemmas that could not generate

Full Sl. train

104,316

1,597

4,633

sentence pairs, filtered out infrequent lemmas, created a sentence-

10% Sl. train

99,205

1,597

4,633

pair dataset and downsampled it to the size of the two smaller

20% Sl. train

102,548

1,597

4,633

Slovenian datasets. The number of negative and positive pairs

Validation

6,972

691

1,743

was roughly balanced for all pair datasets. Additionally, multi-

Test

28,157

1,597

4,633

ple smaller Slovene datasets [4, 13, 14, 17, 20, 21] were joined 10% En. train

27,028

2,852

9,683

and filtered to create an out-of-vocabulary (OOV) dataset that

20% En. train

27,123

2,852

9,683

included only lemmas absent from the main Slovenian dataset.

20% mix train

126,233

4,437

14,316

The OOV dataset consisted of sentence pairs with matching or

OOV

3,006

25

50

non-matching word senses for a target word. Table 2 summarizes

the number of sentences, lemmas, and senses for each dataset.

In total, we trained 7 models that differed in the training data

3.2

Data Preparation

used: the entire Slovene dataset, the 10% Slovene dataset, the 20%

We used both Slovenian and English WSD datasets. The Slovenian

Slovene dataset, the 10% English dataset, the 20% English dataset

data was obtained from the Slovenian section of the Elexis-WSD

(with and without early stopping) and the mixed 20% dataset

corpus [12] and the English data was drawn from SemCor to (a concatenation of the 10% Slovene and English datasets).

approximately match the size of the filtered Slovenian data.

3.3

Evaluation Settings

Over 50% of the original Slovenian lemmas had a single sense

tag. We removed multi-word and hyphenated senses and repeat-

Model performance was measured using the 𝐹

score and the

1

edly filtered the datasets until there were at least two senses per

Matthews correlation coefficient (MCC). The latter is a chi-square

lemma with at least four examples. The original dataset was thus

statistic computed from the confusion matrix of classification

heavily filtered from 202,240 sentences with 5,604 lemmas and

results. It served as an additional performance metric and en-

11,069 word sense tags to 139,445 sentences with 1,597 lemmas

abled us to compare models without having to predict specific

and 4,633 word sense tags. Punctuation was removed and target

word sense tags (e.g., evaluate models on the OOV dataset with

words were enclosed in apostrophes as a weak supervision signal

dissimilar lemmas and sense tags).

[7].

Two methods were used to predict the sense classes on the

The filtered Slovenian dataset was split into train, test and

Slovenian test set. The first prediction method, called the average

validation datasets. For the test dataset, we sampled two or eight

sense probability heuristic (ASP) used the test set structure with

sentences per word sense (depending on the total number of

the models’ binary classifier to determine the most likely sense.

available sentences). The lower limit was needed to create sen-

The target sentence was combined with all other test sentences

tence pairs and the upper limit was used to prevent frequent

sharing a lemma (except with itself ) and a softmax value was

lemmas and senses from giving overly optimistic test scores. The

obtained for each pair. The softmax values were averaged based

validation dataset was created by sampling four sentences per

on the sense tag of the non-target sentence and the sense with

word sense from lemmas with at least eight sentences, assuming

the highest average score was chosen as the sense prediction for

frequent senses would be sufficient to detect over- and under-

the target sentence. The second prediction method used near-

fitting. The remainder of the data was kept for training. The

est neighbour matching between target sentence embeddings

Slovenian training and testing datasets contained the full cov-

and sense embeddings. The latter were created by converting the

erage of included word Slovenian senses (4,633 distinct senses)

entire Slovenian training and validation dataset into sentence

and the validation dataset contained 1,743 senses. All Slovenian

embeddings [18] and averaging them by their word sense tags.

datasets included the full coverage of included lemmas (1,597).

The test sentences were likewise embedded and their sense label

The Slovenian training dataset contained 104,316 unique sen-

was predicted by selecting the sense embedding with the lowest

tences, the testing set 28,159 sentences and the validation dataset

cosine distance from the target sentence embedding.

6,972 sentences.

The most frequent sense (MFS) heuristic as well as the sense

The filtered Slovene datasets were transformed into a dataset

embedding predictions from an untrained model were used as

of sentence pairs by generating sentence combinations between

performance baselines. Lastly, several 𝐹

scores per model (micro-

1

sentences sharing a lemma. We limited the number of non-matching

𝐹

, macro-

and micro-

by POS tags) were used as repeated

1

𝐹1

𝐹1

55





Slovene WSD with Transfer Learning

Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia

Table 3: 𝐹 Scores of Binary Classifier Predictions.

1

4.2

Binary Classifier Correlation Metrics

As the testing set was transformable into sentence pairs, we used

Model

Micro-𝐹1

the binary classifiers directly on the test set and computed a

MCC without predicting sense labels. We also applied the same

MFS baseline

40.4

procedure to test model performance on the OOV dataset.

Full Sl.

81.0

The highest correlation between actual and predicted binary

10% Sl.

81.4

labels was achieved by the model, trained on the entire Slovenian

20% Sl.

80.5

dataset (MCC = 0.629) followed by models, trained on the 20%

10% En.

68.7

Slovene and 20% mixed datasets (MCC = 0.578; for both). The

20% En.

46.9

highest correlation between the actual and predicted labels on

20% En. (early stopping)

80.6

the OOV dataset was achieved by the model, trained on the 20%

20% mix

81.6

English dataset with early stopping (MCC = 0.353), followed by

the 20% mixed dataset (MCC = 0.326). It should be noted that the

Table 4: Binary Classifier MCC Test and OOV Scores.

former was a base model with minimal updates, as the training

stopped after a single update at 200 out of 1916 total steps. Inter-

estingly, ranking the models by the amount of included training

Model

MCC test

MCC OOV

data revealed a positive correlation between the number of in-

Full Sl.

0.629

0.273

cluded examples and the testing dataset MCC (𝑟

= 0.566; df = 5;

𝑠

10% Sl.

0.55

0.292

p = 0.185) and a negative correlation between the number of

20% Sl.

0.578

0.284

included examples and OOV dataset MCC (𝑟

= -0.378; df = 5;

𝑠

10% En.

0.321

0.268

p = 0.404), although neither association was statistically signifi-

20% En.

0.004

0.273

cant. Detailed results from MCC testing can be found in Table 4.

20% En. (early stopping)

0.491

0.353

20% mix

0.578

0.326

4.3

Sense Predictions with Nearest Neighbour

Matching

For predictions with nearest neighbour matching between tar-

Table 5: 𝐹 Scores of Nearest Neighbour Predictions.

1

get sentence and sense embeddings, the baselines used were the

MFS heuristic (𝐹

= 40.4%) and the predictions from the untrained

1

Model

Micro-𝐹1

model (𝐹

= 21.7%). The difference between model predictions was

1

2

statistically significant ( 𝜒

= 45.11; df = 5; n = 9; p < 0.001). The

MFS baseline

40.4

𝐹

only model significantly different from the MFS predictions was

Untrained model

21.7

trained on the entire Slovene dataset (𝐹

= 72.8%; p = 0.003). De-

Full Sl.

72.8

1

tailed results from predictions using nearest neighbour matching

10% Sl.

50.9

can be found in Table 5. The statistical differences between near-

20% Sl.

60.7

est neighbour predictions from different models are presented in

10% En.

53.2

Figure 2.

20% En.

60.6

20% En. (early stopping)

28.7

20% mix

61.0

measures for model comparison using the Friedman test with the

Nemenyi post-hoc test.

4

RESULTS

We evaluated model predictions with binary classifiers and with

nearest neighbour matching to sense embeddings. Additionally,

Figure 1: Critical Distance Diagram for Nearest Neighbour

we used the Matthews correlation coefficient to evaluate the per-

Results.

formance of binary classifiers and evaluate model performance

on the out-of-vocabulary dataset.

4.1

Binary Classifier Sense Predictions

5

DISCUSSION ON INTERDISCIPLINARY

ASPECTS

The baseline 𝐹

from the MFS heuristic was 40.4%. The difference

1

2

between model predictions was statistically significant ( 𝜒

= 36.12;

In this section, we offer a brief critique of the WSD task from the

𝐹

df = 5; n = 8; p < 0.001) with the top three models differing signif-

perspective of psycholinguistics, pragmatics and insights gained

icantly from the MFS baseline: the models, trained on the mixed

through model development, and suggest options for further

20% training data (𝐹

= 81.6; p = 0.001), the 10% Slovene data

research.

1

(𝐹

= 81.4; p = 0.026), the entire Slovene dataset (

= 81; p = 0.004). De-

The datasets commonly used for WSD are not transparent in

1

𝐹1

tailed results from predictions with binary classifiers can be found

terms of the specific sense ambiguities they contain in spite of

in Table 3. The statistical differences between binary classifica-

available typologies. Psycholinguistic literature has identified

tion models are presented in Figure 1.

significant differences in human processing between homonymy

56





Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia

Fijavž & Robnik-Šikonja

REFERENCES

[1]

Keith Allan. 2013. What is Common Ground? In Perspectives on Linguistic

Pragmatics. Perspectives in Pragmatics, Philosophy & Psychology. Alessandro Capone, Franco Lo Piparo, and Marco Carapezza, editors. Springer,

Cham, 285–310. doi: 10.1007/978- 3- 319- 01014- 4_11.

[2]

Michele Bevilacqua and Roberto Navigli. 2020. Breaking Through the 80%

Glass Ceiling: Raising the State of the Art in Word Sense Disambiguation by Incorporating Knowledge Graph Information. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2854–2864.

doi: 10.18653/v1/2020.acl- main.255.

[3]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019.

BERT: pre-training of deep bidirectional transformers for language under-Figure 2: Critical Distance Diagram for Binary Classifica-

standing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technolo-tion Results.

gies, Volume 1 (Long and Short Papers), 4171–4186. doi: 10.18653/v1/N19-14

23.

[4]

Zala Erič, Miha Debenjak, and Denis Derenda Cizel. 2022. Cross-lingual

and polysemy [8], as well as between various subtypes of the sense disambiguation. GitHub repository. https://github.com/dextos658/Cr

oss- lingual- sense- disambiguation.

latter (e.g., metonymy and metaphors) [9]. As demonstrated by

[5]

Christian Hadiwinoto, Hwee Tou Ng, and Wee Chung Gan. 2019. Improved

the use of the out-of-vocabulary test set, additional datasets,

Word Sense Disambiguation Using Pre-Trained Contextualized Word Rep-

resentations. In Proceedings of the 2019 Conference on Empirical Methods in even if comparatively small, can provide important additional

Natural Language Processing and the 9th International Joint Conference on information on model performance. Incorporating a theoretically

Natural Language Processing (EMNLP-IJCNLP), 5297–5306. doi: 10.18653/v1

informed typology of polysemy or lexical ambiguity, future re-

/D19- 1533.

[6]

Zellig S. Harris. 1954. Distributional Structure. WORD, 10, 2-3, 146–162. doi: search could provide richer descriptions of word sense relations

10.1080/00437956.1954.11659520.

contained in widely used WSD datasets as well as develop spe-

[7]

Luyao Huang, Chi Sun, Xipeng Qiu, and Xuanjing Huang. 2019. GlossBERT:

cific tests for various types of polysemy. The latter could draw

BERT for Word Sense Disambiguation with Gloss Knowledge. In Proceedings

of the 2019 Conference on Empirical Methods in Natural Language Processing on datasets from psycholinguistic experiments, which commonly

and the 9th International Joint Conference on Natural Language Processing control for a plethora of variables, such as word and sense fre-

(EMNLP-IJCNLP), 3509–3514. doi: 10.18653/v1/D19-1355.

[8]

Ekaterini Klepousniotou and Shari R. Baum. 2007. Disambiguating the am-

quency. We also observed Elexis-WSD and SemCor contained a

biguity advantage effect in word recognition: An advantage for polysemous large number of single-sense lemmas, which would explain why

but not homonymous words. Journal of Neurolinguistics, 20, 1, 1–24. doi:

𝐹

scores from the MFS heuristic in related works are commonly

10.1016/j.jneuroling.2006.02.001.

1

[9]

Ekaterini Klepousniotou, G. Bruce Pike, Karsten Steinhauer, and Vincent

relatively high.

Gracco. 2012. Not all ambiguous words are created equal: An EEG investi-

Furthermore, while large language models have achieved state-

gation of homonymy and polysemy. Brain and Language, 123, 1, 11–21. doi: of-the-art results in WSD, they do not fundamentally diverge

10.1016/j.bandl.2012.06.007.

[10]

Olga Kovaleva, Alexey Romanov, Anna Rogers, and Anna Rumshisky. 2019.

from distributional semantics [6], which is but one account of Revealing the dark secrets of bert. In Proceedings of the 2019 Conference on possible disambiguation mechanisms. It is possible, for instance,

Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 4365–

to conceptualise word disambiguation as a pragmatic process

4374.

whereby the common ground (shared knowledge) between speak-

[11]

Michael Lesk. 1986. Automatic sense disambiguation using machine readable ers [1] scaffolds disambiguation and by which account speakers dictionaries: how to tell a pine cone from an ice cream cone. In Proceedings of the 5th annual international conference on Systems documentation (SIGDOC

may introduce ambiguity on purpose to meet various commu-

’86), 24–26. isbn: 978-0-89791-224-2. doi: 10.1145/318723.318728.

nicative goals [15].

[12]

Federico Martelli et al. 2022. Parallel sense-annotated corpus ELEXIS-WSD

1.0. https://elex.is/. Retrieved Oct. 21, 2022 from https://www.clarin.si/reposi

6

CONCLUSION

tory/xmlui/handle/11356/1674.

[13]

Matej Miočić, Marko Ivanovski, and Matej Kalc. 2022. NLP-tripleM. GitHub We developed several word sense disambiguation models for

repository. https://github.com/KalcMatej99/NLP- tripleM.

[14]

David Miškić, Kim Ana Badovinac, and Sabina Matjašič. 2022. cross-lingual-Slovenian text and achieved comparatively high performance,

sense-disambiguation. GitHub repository. https://github.com/NLP- disambi

albeit on a limited selection of lemmas and word senses. We

guation/cross- lingual- sense- %20disambiguation.

[15]

Brigitte Nerlich and David D. Clarke. 2001. Ambiguities we live by: towards demonstrated that including small datasets to measure out-of-a pragmatics of polysemy. Journal of Pragmatics, 33, 1, (Jan. 2001), 1–20. doi: vocabulary performance yields important insights, as the models

10.1016/S0378- 2166(99)00132- 0.

tended to generalize better with compacter training datasets.

[16]

Tommaso Pasini, Alessandro Raganato, and Roberto Navigli. 2021. Xl-wsd:

an extra-large and cross-lingual evaluation framework for word sense dis-The models presented in this paper would benefit from a re-

ambiguation. Proceedings of the AAAI Conference on Artificial Intelligence, view of Slovenian lexicographical sources and sense inventory

35, 15, 13648–13656. doi: 10.1609/aaai.v35i15.17609.

compatibility between them. Replacing annotated sentences with

[17]

Erazem Pušnik, Rok Miklavčič, and Aljaž Šmaljcelj. 2022. nlp-project3.

GitHub repository. https://github.com/RoKKim/nlp- project3.

sense definitions (e.g. collocation lists, coarse semantic tags, gloss

[18]

Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embed-

definitions) would greatly increase the number of available train-

dings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th Interna-ing examples. Other large language models could also be used

tional Joint Conference on Natural Language Processing (EMNLP-IJCNLP),

and a detailed hyperparameter optimization could be performed

3982–3992. doi: 10.18653/v1/D19- 1410.

for each model individually.

[19]

Yang Song, Xin Cai Ong, Hwee Tou Ng, and Qian Lin. 2021. Improved Word

Sense Disambiguation with Enhanced Sense Representations. In Findings of The source code related to this paper and the datasets used

the Association for Computational Linguistics: EMNLP 2021. Association for

1

are freely available .

Computational Linguistics, 4311–4320. doi: 10.18653/v1/2021.f indings- emn

lp.365.

Acknowledgments

[20]

Jure Tič, Nejc Velikonja, and Sandra Vizlar. 2022. NLP. GitHub repository.

https://github.com/JureTic/NLP.

The work was partially supported by the Slovenian Research and

[21]

Andrej Tomažin. 2022. nlp-wic. GitHub repository. https://github.com/anze

tomazin/nlp- wic.

Innovation Agency (ARIS) core research programme P6-0411,

[22]

Matej Ulčar and Marko Robnik-Šikonja. 2020. FinEst BERT and CroSloEn-

and projects J6-2581 and J7-3159.

gual BERT. In Text, Speech, and Dialogue, 104–111. doi: 10.1007/978- 3- 030-

58323- 1_11.

1 https://github.com/zo-fi/slo_wsd_ZFMA

57





Predicting the FTSO consensus price

Filip Koprivec

Tjaž Eržen

Urban Mežnar

filip.koprivec@ijs.si

erzen.tjaz@gmail.com

urban.meznar@aflabs.si

JSI, FMF, AFLabs

AFLabs

AFLabs

Ljubljana, Slovenia

Ljubljana, Slovenia

Ljubljana, Slovenia

ABSTRACT

Eyal [15] provide a comprehensive study, while Caldarelli’s subsequent work [2] offers an overview of oracle research. Liu et The paper presents a system for predicting cryptocurrency con-al. [14] survey various oracle implementation techniques. No-sensus prices within the Flare Time Series Oracle (FTSO), a de-

tably, Alagha [1] introduces a reinforcement learning model to centralized oracle solution running on Flare blockchain. By lever-enhance oracle reliability [11].

aging a combination of smoothing techniques and machine learn-

The main oracle solution provider is Chainlink, which ad-

ing methodologies, we detail and analyze the construction and

dresses the oracle problem with enhanced security and scalability

performance of our own provider. This paper presents the FTSO

in Chainlink 2.0 [5]. Zhang et al. [13] also detail their approach, mechanism, and basic information about the game theoretic back-providing insights for evolving projects like Flare FTSO in the

ground together with rewarding and submission protocol. Lastly,

oracle domain.

we present our provider’s prediction accuracy for each coin.

KEYWORDS

3

FTSO PROTOCOL

FTSO, schelling point, machine learning, regression, smoothings

The Flare Time Series Oracle plays an important role in Flare Net-

1

INTRODUCTION

work’s data accuracy and decentralization. The protocol works

in a series of discrete steps to decrease the performance hit on

The blockchain and decentralized finance (DeFi) sectors have

the whole network. Every 3 minutes marks the beginning of a

seen significant growth, but they share a common challenge:

new price epoch. Providers are mandated to submit their price es-

securely accessing data not directly included in transaction sig-

timates in a timely manner using the commit and reveal scheme

natures. This issue, known as the oracle problem [3], hinders the to maintain confidentiality and prevent other providers from

broader adoption of blockchain technologies as it’s typically dif-

viewing or copying their predictions.

ficult to obtain reliable off-chain data. While various on-chain

Only after the price epoch has ended, providers reveal the

protocols offer solutions, each has its trade-offs concerning secu-

actual submitted values. This reveal must be done in the first 90

rity, accuracy, and data reliability. Traditional centralized oracles

seconds of the next price epoch, which overlaps with the first

present risks like data manipulation, whereas fully decentralized

half of the next submit epoch. After the reveal epoch ends, all

alternatives often suffer from latency and higher costs.

the revealed values are combined and a network-wide price is

This paper examines the Flare Time Series Oracle, a decentral-

calculated. Data providers are incentivized to submit good prices

ized oracle that uses a schelling point mechanism to aggregate

by the network-wide rewarding system, by being rewarded if

data from multiple providers [11]. Fata providers submit price prices fall in the middle two quartiles (IQR range) of the final

estimates every three minutes, with the system price determined

price.

as a weighted median of these submissions. Given the inherent

The network thus gets fresh asset prices every 3 minutes with

price variability across exchanges and the indeterminate nature

some delay due to the reveal period. Such data granularity is not

of asset prices within a three-minute window, there isn’t a sin-

sufficient for high-frequency trading but has proven sufficient

gular "correct" price. Providers aim to select a price close to the for many financial applications. The network and community

final median, incentivized by the reward system. This competitive

explicitly don’t define what a correct price is, to remove the

environment, involving around 100 data providers, has shown

vulnerability of the definition relying on a specific price source.

resilience against market anomalies and exchange issues. The

Assets are denominated in $ with 5 decimal points of precision.

paper investigates machine learning techniques to predict this

Since most of the exchanges quote a price that is accurate up

final median price using exchange data. Given the dynamic na-

to 3 decimal points, the configuration and no price explicit defi-

ture of the competition, our prediction methods are designed for

nition ensure, that submitted prices fall near the perceived fair

adaptability.

market price, while still leaving room for competition on the last

decimals.

2

RELATED WORK

One of the unique features of the Flare Network is the ability

While no literature precisely addresses the Flare FTSO, the gen-

for token holders to delegate their votes to data providers. This

eral oracle problem has been extensively studied. Caldarelli [4]

means that even if a token holder does not actively participate

highlights the challenges of the blockchain oracle problem. El-

in the estimation process, they can still earn FTSO rewards by

lul [7] delves into its role in decentralized finance. Zohar and delegating their voting power [8] and impact the price by selecting a specific data provider. It is important to note, however, that

Permission to make digital or hard copies of part or all of this work for personal the voting power of a single data provider is limited to 2.5% to

or classroom use is granted without fee provided that copies are not made or avoid too big of an individual impact.

distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this The FTSO’s reward mechanism is fostering decentralization

work must be honored. For all other uses, contact the owner /author(s).

and ensuring real-time data accuracy. Given that the core task

Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia

revolves around predicting prices of other providers, participants

© 2023 Copyright held by the owner/author(s).

not only need to make accurate predictions but also strategize

58





Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia Filip Koprivec, Tjaž Eržen, and Urban Mežnar

to outperform others, making it a game of strategic decision-

4.3

Prediction Mechanism

making. This challenge intriguingly sits at the crossroads of data

After smoothing the data using the techniques listed above, we

science and game theory [6].

adopt an overdetermined system approach for our predictions.

This entails constructing a system of equations from the pro-

4

DATA RETRIEVAL AND PREDICTION

cessed data and subsequently employing the least squares method

4.1

Overview

to find the optimal prediction parameters.

Suppose we’re training our time series over 𝑚 epochs. Let

The data retrieval process is a crucial step in our analysis. It

𝑚 ×𝑛

𝐸

∈ R

be a matrix where each column 𝑒

, represents the

𝑖

involves collecting, processing, and preparing time series data,

price vector for the 𝑖 -th exchange across the 𝑚 epochs. Vector

specifically price and timestamp pairs, for further analysis. This

v ∈

𝑛

R signifies the normalized weights or contributions of each

data is essential for understanding trends, making predictions,

exchange to the forecasted price. Each entry, 𝑣

in v corresponds

𝑖

and deriving insights.

to the significance of the 𝑖 -th exchange.

The primary source of our data are the FTSO prices from pre-

Given the extensive epoch training data required for our model

vious epochs and current data from various exchanges. Selecting

training and the limited availability of crypto exchanges (in the

a specific subset of exchanges as a datasource is a nontrivial task.

tens), we are dealing with an overdetermined system. In this

Each exchange has its own set of characteristics: trading volume,

context, we optimize the vector v using the least squares error

user base, regional influences, and even specific trading behaviors.

method. The residual sum of squares evaluation function is opti-

Historical data shows, that providers are quick (on a sub-hour

mized using the fmin_cg method from scipy.optimize, aiming

basis) to adapt to market opening and closing times and usually

to find the parameters that minimize the difference between the

disregard after-hours trading prices on exchanges. Furthermore,

predicted values and the actual values in the training data.

the reliability of data from each exchange can vary. Some ex-

For each exchange and for each smoothing method, we define

changes might offer more consistent and clean data, while others

a possible upper and lower range for the method’s parameters and

might have gaps or anomalies.

specify a step size. We then compute the cartesian product of all

4.2

Data Processing and Smoothing

these sets, yielding all viable optimized parameter combinations

Techniques

in the form of a multidimensional grid. For each combination in

this cartesian product, we smooth the data using the methods de-

Once the data is retrieved, it undergoes several processing steps

scribed above, train the model and calculate the optimal solution

to ensure its quality and relevance for prediction. One of the

vector, which tells us how much weight should each exchange

primary challenges in time series forecasting is the inherent

hold. Finally, we identify the model configuration that delivers

noise present in the data. Financial data is specifically prone to

the best performance.

short-term spikes as low liquidity exchanges can experience large

The overdetermined system was chosen due to a number of dif-

price deviations when market depth is limited. The spikes are

ferent factors. We preferred a simple model with the potential for

quickly exploited by arbitragers, but price jumps - anomalies - are

an explanation or at least the possibility of quick access to infor-

still available in the data and must be accounted for. We employ

mation in which input parameters offer greater prediction power.

various smoothing techniques to filter out noise and highlight

Although not included in our numerical utility function, dele-

the underlying trends.

gation and the social aspect of goodness of price are important

Exponential Moving Average (EMA): EMA is a type of

for multiple reasons. Being less good, but providing reasonable

weighted moving average that gives more weight to the most re-

prices attracts more delegations and provides more security and

cent prices. In our system, the EMA vector and its alpha value are

trust in the network. Therefore, the error of not predicting the

optimized using the curve_fit method from scipy.optimize

price fully correctly versus being off by a lot due to an edge

library [10].

condition or overfitting a specific input parameter was much

Savitzky-Golay Smoothing: This technique uses convolu-

preferred. Furthermore, incoming network upgrades might force

tion to fit successive subsets of adjacent data points with a low-

the providers to buy or sell assets on the price revealed (and not

degree polynomial. It’s effective in preserving the features of the

on market price) and this means that a large deviation from the

distribution, such as heights and widths, making it suitable for

correct price would also be financially problematic.

our analysis [12].

Lastly, the providers work in bursts. Most of the information-

Linear Interpolation: Linear interpolation is used to esti-

rich exchange data comes in just before the end of the epoch

mate values between two known values in a dataset. Our system

(last few seconds), so a longer evaluation time might mean we

employs a skew linear fit to interpolate missing or anomalous

miss some information or be too late for the submission. Our

data points.

internal analysis shows, that submission must be calculated at

FFT Smoothing: The last smoothing method we’ve used is

least 8-5 seconds before the end of each epoch to be reliably

the Fast-Fourier smoothing.

accepted by the network validators. (network latency usually

Each of these methods has its own strengths and is chosen

requires a submission of the price a few seconds before the end

based on the specific characteristics of the data and the prediction

of the epoch).

requirements. So far, the only other smoothing method we’ve

tried to incorporate is LOWESS (Locally Weighted Scatterplot

5

RESULT ANALYSIS

Smoothing), which performed worse than the rest of the smooth-

ing methods after training an overdetermined system on it (see

We evaluated the performance of our trained models by com-

4.3). The mentioned methods were selected, as they are comparing them against three simpler prediction methods: Last Seen

monly used for smoothing the financial data [9], easily available Value Method predicts that the future value of a coin will be the

in multiple scientific libraries, and offer good resilience against

most recent exchange price observed before the prediction starts.

sudden spikes that are markets with low liquidity.

The Previous Epoch Value method predicts the price of a coin as

59





FTSO price prediction

Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia

the FTSO price from the previous epoch. Lastly, we also try the

to 0.45, indicating moderate to high prediction accuracy. Mean-

overdetermined system witouh any smoothing.

while, coins like XLM, ADA, and ARB had lower success rates,

Our calculation accuracy analysis spanned over a week, with

often below 0.15, suggesting challenges in predicting their prices.

new models trained every day on the previous 8-hour data (160

Overall, the provider’s performance fluctuated across epochs and

epochs). Following this, the model’s success rate was then vali-

coins, with some cryptocurrencies consistently achieving higher

dated against the subsequent 8-hour dataset right after the train-

success rates than others. Overall, we were able to achieve mod-

ing data. The success rate is the amount of times the predicted

erate prediction success of around 0.22, currently ranking 26th

price would be in the interquartile range divided by the number

among the 94 active FTSO providers.

of epochs the price was submitted for. This exactly corresponds

Because this method of smoothing and training an overdeter-

to what price providers are financially incentivized to do.

mined system yielded better results than previous method of just

The detailed results are presented inFigures 1a to 1d. As an-training an overdetermined system, we can also be certain that

ticipated, the Last Seen Value Method method yields modest

smoothings in this case improve the result. Without smoothing,

outcomes, averaging averaging prediction success rate of 3.5%

our prediction model is highly influenced by noise and short-term

across all coins.

fluctuations.

For the Previous Epoch Value Method approach, we set the

prediction to match the price from the previous epoch. While

Coin

Last Seen Prev. Ep No smoth Smooth

this method outperformed the first, it still registered a low perfor-

XRP

0.02129

0.04986

0.18729

0.339

mance, averaging around 7% for all coins over the week. Notably,

XLM

0.02886

0.11686

0.03129

0.11329

several coins like ETH or FIL had an average success rate close

DOGE

0.07686

0.16986

0.13186

0.38086

to 0%, while DOGE achieved an average of 15%.

ADA

0.04143

0.14214

0.06157

0.13457

The method Training an Overdetermined System With-

BTC

0.01043

0.01943

0.14071

0.32543

out Smoothing the Data outperformed the first two, averaging

ARB

0.027

0.02343

0.09129

0.11529

around 10% success rate accross all coins during the testing week.

Table 2: Average success rate for prediction methods

Notably, the full prediction method that Smooths the Data and

Trains and Overdetermined System outperformed all of the

previous methods.

The evaluation closely mirrored real-world conditions, due to

6

RMSE VALUES

changes in exchanges, fluctuations in vote powers, and inclusion

Lastly, analyzed for each method and for each coin what is it’s

of new data providers in the median calculation, models must

RSME (root mean squared error) to provide more insight into

be continuously retrained on an almost daily basis. Over the

each method’s accuracy. The results are depicted in 1. It’s worth observed epochs, our FTSO provider demonstrated varied suc-mentioning that since the prices of different coins vary, the RMSE

cess rates across different cryptocurrencies. The success rates for

XRP

values aren’t comparable across the coins but only across the

, DOGE and BTC generally ranged between 0.20 to 0.45, indi-

methods for one coin. For most coins, the Last Seen Value method

cating moderate to high prediction accuracy. Meanwhile, coins

generally yields the highest RMSE values, indicating the worst

like XLM, ADA, and ARB had lower success rates, often below

accuracy relative to other methods. Conversely, the Overdeter-

0.15, suggesting challenges in predicting their prices. Overall,

mined system with smoothing method tends to produce the lowest

the provider’s performance fluctuated across epochs and coins,

RMSE values for most of the coins. The methods Previous Epoch

with some cryptocurrencies consistently achieving higher suc-

Value and Overdetermined system without smoothing are ranked

cess rates than others. Overall, we were able to achieve moderate

somewhere in between.

prediction success of around 0.22, currently ranking 26th among

the 94 active FTSO providers.

7

DISCUSSION AND FUTURE WORK

Because this method of smoothing and training an overdeter-

mined system yielded better results than previous method of just

We have developed and assessed a functional provider solution

training an overdetermined system, we can also be certain that

to predict prices within the FTSO protocol. While we observed

smoothings in this case improve the result. This goes to show that

commendable performance for coins such as XRP, DOGE, and

without smoothing, our prediction model is highly influenced

BTC, the results for other coins like XLM, ADA, and ARB were not

by noise and short-term fluctuations, making it challenging to

as promising. Exploring additional smoothing techniques and

capture the underlying trend in the time series data.

incorporating multiple prediction methods would be beneficial.

Notably, ensemble methods are renowned for reducing prediction

Coin

Last Seen

Prev. Ep

No smoth

Smooth

variance, which in turn increases the probability of predictions

XRP

0.07412964

0.01536945

0.00542317

0.00398449

falling within the median target range.

XLM

0.00010802

0.00025230

0.00090994

0.00025548

This paper has only focused on non-deep learning approaches

DOGE

0.00004626

0.00001359

0.00000733

0.00000641

to FTSO price prediction. A promising extension to the provider

ADA

0.00000201

0.00000395

0.00000183

0.00000174

would be to explore time series prediction using various deep

BTC

23.78687273

5.01065648

1.94068887

0.91171693

learning methods such as RNN or LSTM neural networks. These

ARB

0.00098386

0.00025156

0.00015229

0.00014042

models have the potential to capture more subtle patterns in the

Table 1: RMSE for each method and selected coins

data and adapt to the dynamic prices of the crypto coins. They

might need to be modified to adapt to the specifics of the FTSO

system and quick retraining times. Combining the more expen-

Over the observed epochs, our FTSO provider demonstrated

sive inference of neural networks with presented overdetermined

varied success rates across different cryptocurrencies. The suc-

system together with error bounds on prediction results might

cess rates for XRP, DOGE and BTC generally ranged between 0.20

also offer a more performant composite algorithm that would be

60





Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia

Filip Koprivec, Tjaž Eržen, and Urban Mežnar

(a) “Last Seen Value” method

(b) “Previous Epoch Value” method

(c) Overdeterminded system without data smoothing

(d) Overdeterminded system without with data smoothing

able to use the fallback prediction in case of lateness of prediction

[8]

Boi Faltings and Goran Radanovic. Game Theory for Data

by a stronger but more complicated model.

Science: Eliciting Truthful Information. Springer Nature,

2021.

8

ACKNOWLEDGMENTS

[9]

James D. Hamilton. Time Series Analysis. Princeton Uni-

The authors would like to thank AFLabs for the provision of

versity Press, 1994. url: http : / / mayoral . iae - csic . org /

exchange and FTSO data used during the development phase.

timeseries2021/hamilton.pdf .

[10]

A. J. Lawrance and P. A. W. Lewis. “An exponential moving-

REFERENCES

average sequence and point process (EMA1)”. In: Journal

of Applied Probability 14 (1 1977). Accessed: 2023-09-05,

[1]

AlaghaA. “A reinforcement learning model for the relia-

pp. 98–113. doi: 10.2307/3213263.

bility of blockchain oracles”. In: ScienceDirect (2022).

[11]

Christopher Potts. “Interpretive Economy”. In: Seman-

[2]

Giulio Caldarelli. “Overview of Blockchain Oracle Re-

tics Archive (2008). Accessed: 2023-09-05. url: https : / /

search”. In: MDPI 14.6 (2022), p. 175.

semanticsarchive.net/Archive/jExYWZlN/potts- interpretive-

[3]

Giulio Caldarelli. “Understanding the Blockchain Oracle

economy- mar08.pdf .

Problem: A Call for Action”. In: Information 11.11 (2020),

[12]

William H. Press and Saul A. Teukolsky. “Savitzky-Golay

p. 509. url: https://www.mdpi.com/2078- 2489/11/11/509.

Smoothing Filters”. In: Computers in Physics and IEEE Com-

[4]

Giulio Caldarelli. “Understanding the Blockchain Oracle

putational Science & Engineering 4.6 (1990). Accessed: 2023-

Problem: A Call for Action”. In: 11.11 (2023), p. 509.

09-05, pp. 669–672. doi: 10.1063/1.4822961.

[5]

Chainlink. Chainlink 2.0 and the future of Decentralized

Oracle Networks

[13]

Fan Zhang et al. “Decentralized Oracles: a Comprehensive

. Accessed: 2023-09-05. 2023. url: https:

Overview”. In: arXiv preprint arXiv:2004.07140 (2020). Ac-

//chain.link/whitepaper.

cessed: 2023-09-05. url: https://arxiv.org/abs/2004.07140.

[6]

Vasant Dhar. “Data Science and Prediction”. In: Commu-

nications of the ACM

[14]

Yanchao Zhang Zhiqiang Liu and Jiwu Jing. “Connect API

56.12 (2013), pp. 64–73. url: https:

with Blockchain: A Survey on Blockchain Oracle Imple-

//dl.acm.org/doi/abs/10.1145/2500499.

mentation”. In: ACM (2022).

[7]

Joshua Ellul. “The Blockchain Oracle Problem in Decen-

[15]

Aviv Zohar and Ittay Eyal. “A Study of Blockchain Ora-

tralized Finance—A Multivocal Approach”. In: 11.16 (2023),

cles”. In: arXiv (2020). url: https : / / arxiv. org / pdf / 2004 .

p. 7572.

07140.

61





On Neural Filter Selection for ON/OFF Classification of Home Appliances

Anže Pirnat and Carolina Fortuna

ap6928@student.uni-lj.si,carolina.fortuna@ijs.si

Jožef Stefan Institute, Ljubljana, Slovenia.

ABSTRACT

To avoid the high cost and invasiveness of monitoring each

Non-intrusive load monitoring (NILM) enables the extraction of

individual device with an electricity meter, researchers have de-

appliance-level consumption data from a single metering point.

veloped a more economically efficient method known as non-

Appliance ON/OFF classification is a particular type of such

intrusive load monitoring (NILM). This method involves obtaining

appliance level data extraction recently enabled by deep learning

appliance-level data using just one metering point to measure the

(DL) techniques. To date, a study on the influence of neural filter

total electricity consumption of a household. By using classifica-

selection on the performance and computational complexity for

tion techniques for NILM, it is possible to determine the states

appliance ON/OFF classification is missing. In this paper, we start

(ON/OFF) of devices within a household and monitor their activity

from a widely used DL architecture, adapt it for the appliance

for demand response applications. As in a typical household it

ON/OFF classification problem and then study the influence

is possible to have several appliances working simultaneously, a

of the filters on the model performance and model complexity.

suitable approach for determining the activity states of appliances

Through this study we develop a model, PirnatCross, that excels

is multi-label classification, where the state of each appliance is

at cross-dataset performance, offering an average improvement in

used as the class label and the recorded readings from a single

average weighted F1 score of 17.2 percentage points vs a SotA

household meter serve as input samples. Li et al. were among the model and VGG11 baseline respectively, when trained on REFIT

first to propose multi-label classification for NILM disaggregation.

and evaluated on UK-DALE and vice versa. Also, PirnatCross

More recently, Tanoni et al. [12] employed gated recurrent unit consumes 6-times less energy compared to a SotA model.

(GRU) in their CRNN for weakly-supervised training, mixing the

amount of strongly and weakly labeled data to confirm the effec-

KEYWORDS

tivness of such approach. Also Zhou et al. [14] proposed a new model called TTRNet, which uses a transpose convolution before

non-intrusive load monitoring (NILM), ON/OFF appliance classifi-

a recurrent layer, a method, which has also shown better results

cation, deep learning (DL), convolutional recurrent neural network

in other works [8]. The existing works based on DL techniques (CRNN), multi-label classification

typically lack a DL computational complexity/energy consumption

analysis that is relevant in designing such models [2]. For instance, 1

INTRODUCTION

in [5] they analyzed the carbon footprint of various architectures and concluded that convolutional layers are power hungry because

Mitigating the impact of climate change is an urgent challenge that

they operate in three dimensions, unlike fully connected layers

requires collective action to keep the global average temperature

◦

which operate in two dimensions.

below 1.5 C in relation to pre-industrial levels. Reducing unnec-

Existing studies typically develop and evaluate their method on

essary electrical energy consumption and consequently limiting

a only a few datasets that are often limited in size. For instance [12]

electrical energy production is a crucial step towards achieving our

relied on two publicly available datasets and developed and evalu-

goals, as it is estimated that such activities account for over 40% of

ated a model for each of the two: REFIT [9] and UK-DALE [6].

the total CO2 equivalent generated by human activities 1. Beside While this approach is appropriate for relative method performance

reducing energy consumption, we are increasingly adopting renew-

assessment, some studies have discussed also the importance of

able power plants due to their significantly lower CO2 emissions

cross-dataset evaluation. For example, Han et al. [4] described compared to fossil fuel-based ones 2. However, renewable energy significant dataset biases and high class imbalance of in-the-wild

resources have a major drawback; dependency on renewable re-

datasets as a fundamental bottleneck in facial expression recogni-

sources which are far less predictable, posing a challenge to the

tion. Their results showed that cross-dataset evaluation can reduce

stability of the power system [11]. To address this issue, demand dataset bias and improve the performance.

response strategies are being implemented to adjust electricity

In this paper we aim to better understand the influence of

consumption to better match supply [1]. Consequently, efforts are the filters on the model performance and model complexity for

being made to monitor and manage energy consumption more

multi-label ON/OFF appliance classification through intra and

efficiently in residential buildings, making it relevant to track

cross-dataset evaluation. Our main contributions are as follows:

device activity (ON/OFF events) [3].

1tinyurl.com/CO2-from-electricity1

• We adapt VGG19, a widely used DL architecture, for the

2tinyurl.com/renewable-energy-doubled

appliance ON/OFF classification and study the influence of

the filters on the model performance and model complexity.

Permission to make digital or hard copies of part or all of this work for personal or

• We develop a model, PirnatCross, that excels at cross-dataset per-

classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation formance, offering an average improvement of 17.2 percentage

on the first page. Copyrights for third-party components of this work must be honored.

points vs a SotA model and VGG11 baseline respectively, when

For all other uses, contact the owner/author(s).

trained on REFIT and evaluated on UK-DALE and vice versa.

Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia

©

Also, PirnatCross consumes 6-times less energy compared to

2023 Copyright held by the owner/author(s).

SotA model.

62





Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia

Anže Pirnat and Carolina Fortuna

si

1.0

PREDICTION:

DL MODEL

0.5

- Active devices: D1, D2, D4, ...

- Inactive devices: D3, D5, ...

D1 D2 D3 D4 D5 . . . Di

Figure 1: We input the data measured from a household into the DL model and it outputs 𝑠 for each device present in the

𝑖

experiment. If 𝑠 is greater than 0.5 we classify the device as active, if not as inactive.

𝑖

The paper is organized as follows. Section 2 provides the relationaships in the time series and it was shown to achieve good

problem statement, Section 3 presents methodological details, performance in a recent study [12].

while Section 4 analyses the results of our study. Finally, Section 5

In order to estimate the computational complexity of the result-

concludes the paper.

ing architecture, referred to as PirnatCross, we must first calculate

its complexity as the sum of all floating point operations (FLOPs)

2

PROBLEM STATEMENT

that have to be computed for each of its layers. This can be calcu-

Given an input power consumption measured by a smart meter

lated for convolutional, pooling and fully-connected layers with

𝑝 (𝑤 ) over a time window 𝑤 , we aim to develop a multi-label

the equations from [10] and for GRU with equation from [13].

ON/OFF classifier Φ that maps the input to a probability vector

As convolutional layers dominate in our adaptation of VGG19,

𝑠 (𝑤 ) corresponding to the status of the home appliances as:

and the computational complexity of a convolutional layer is

relatively high compared to other type of layers [10]. Generally,

𝑠 (𝑤 ) = Φ(𝑝 (𝑤 ))

(1)

the number of FLOPs used throughout the convolutional layer

The |𝑠 | of the set 𝑠, indicates the number of appliances to be

𝐹𝑐 is equal to the number of filters 𝑁𝑓 times the flops per filter

recognised. For each window of measurements 𝑝 (𝑤 ) input to

𝐹

= (

+

)

c

𝐹

𝑁

𝑁

pf

ipf

f. Therefore we aim to study the influence

the model Φ, 𝑠 (𝑤 ) will be of the form [𝑠1 (𝑤 ), 𝑠2 (𝑤 ), ...., 𝑠

(𝑤 )

𝑁

],

of the number of the filters 𝑁𝑓 on the model performance and

𝑠

∈ [0, 1]

𝑖

and 𝑁 = |𝑠 | where each 𝑠𝑖 estimates the probability

complexity. Let the starting number of filters in each block of the

of appliance 𝑑𝑖 to be active as also depicted in Figure 1. When adapted architecture be the same as in the original VGG19, namely

𝑠

> 0.5

𝐹

= [64, 128, 256, 512, 512]

𝑖

the appliance will be classified as ON, otherwise it will be

, analyze the model performance as

classified as OFF. More than one appliance can be ON at the same

average F1 score versus computational complexity in FLOPs.

time, therefor 𝑠 contains multiple labels assigned to the current

instance. In this paper 𝑁 = 5 in total of which any 1-4 can be

3

METHODOLOGY

active.

This section provides methodological details related to the datasets,

The ON/OFF classifier Φ realized as a deep learning network is

the training approach and evaluation process that were employed

typically composed of a set of layers [𝑙1, 𝑙2, ....𝑙

]

𝑀

where the types

for the study.

of the layers may vary depending on how the respective architecture

is designed. For instance 𝑙 ∈ [𝐹𝐶, 𝑃𝑜𝑜𝑙, 𝐶𝑜𝑛𝑣, 𝐺𝑅𝑈 , ...]

𝑖

, where

FC stands for fully connected, Pool stands for pooling, Conv for

3.1

Datasets

convolutional and GRU for gated recurrent unit. As has been

The study is conducted using two datasets: UK-DALE [6] and RE-already shown also in [10], the computational complexity varies FIT [9]. Within each dataset, we monitor the same five appliances across the types of the layers.

𝑑𝑖 that were also used in recent research [12]: fridge, washing In developing Φ, we start from the VGG family of architectures

machine, dishwasher, microwave, and kettle. The data from the

as they are widely used in various communities and have already

selected devices is obtained and processed using the procedure

shown promising results for classification on NILM [7]. More described by Tanoni et al. [12] to form 2 mixed datasets. After precisely we consider VGG19 comprising of 19 layers with train-processing, the two mixed datasets each consist of the same five

able parameters, 16 of which are convolutional and 3 are fully

devices, with each sample containing a random selection of one

connected. The convolutional layers are grouped into five blocks:

to four active devices. Samples with varying numbers of active

• Block 1: 2 x conv. with 64 filters + Max pooling

devices are randomly distributed throughout the datasets. We

• Block 2: 2 x conv. with 128 filters + Max pooling

evaluate the cross-dataset performance of models on two mixed

• Block 3: 4 x conv. with 256 filters + Max pooling

datasets obtained by processing data from, UK-DALE and REFIT,

• Block 4: 4 x conv. with 512 filters + Max pooling

in both directions. Specifically, we train models on REFIT derived

•

dataset and test them on UK-DALE derived datasetand vice versa,

Block 5: 4 x conv. with 512 filters + Max pooling

by training on UK-DALE derived dataset and testing on REFIT

This architecture has been tailored to accommodate time series

derived dataset.

data, replacing the 2D convolutions and pooling from VGG19,

designed for images, with 1D counterparts that are more suitable

3.2

Benchmarks

for time-series. In addition, the convolutional layers in the 5th

block have been replaced with transpose convolutional layers to

In order to have a more meaningful study, we also evaluate

increase the temporal resolution of features to reduce their number

PirnatCross, the adapted VGG19, against a VGG11 baseline and a

as suggested in [14]. We also integrated a recurrent layer after the recently published work TanoniCRNN [12]. For VGG11, we used 5th block, GRU layer to be specific, as it is able to model temporal

a learning rate of 0.0001 and the same batch size and epochs. For

63





On Neural Filter Selection for ON/OFF Classification of Home Appliances Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia

TanoniCRNN, we used the hyperparameters specified as optimal

by scaling factors 𝑘 ∈ [0.02, 0.04, ..., 2.5]. The upper two curves

in their paper [12].

present the average weighted F1 score for models trained and evalu-

For PirnatCross we vary the set of the filters 𝐹 by mul-

ated on REFIT and UK-DALE separately, so without cross-dataset

tiplying with 𝑘 ∈ [0.02, 0.04, 0.06, 0.08, 0.1, 0.3, 0.5, 0.7, 0.9, 1.1, evaluation. The second lowest curve presents the average weighted

1.3, 1.5, 1.7, 1.9, 2.1, 2.3, 2.5]. The learning rate, batch size, and

F1 scores for models trained on REFIT and cross evaluated on

number of epochs were determined through a process of trial

UK-DALE while the lowest curve presents the results on training

and error, informed by previous experiments, and subsequently

on UK-DALE and cross evaluating on REFIT. In our experiments,

fine-tuned for each model, to optimize model performance and

we observe only the cross evaluation models, they show a rapid

stability. The resulting values are: learning rate of 0.0003, a batch

improvement in performance for scaling factor values from 0.02

size of 128, and trained for 20 epochs.

to 0.08. From scaling factor value 0.08 to 0.9, we see a decline

While some models were capable of handling larger batch sizes,

in performance in one example and a small improvement in the

we found that performance was not improved by increasing the

others, while beyond 0.9 the results gradually decline. For scaling

batch size beyond 128, so we kept it unchanged for all models. We

factors above 1.3 a rapid drop in performance can be observed.

train and evaluate using 5-fold cross-validation.

Marked with light blue in Figure 2 and also depicted in Figure

3 is the PirnatCross version of the proposed architecture having 3.3

Metrics

𝐹 scaled by 0,08 and thus resulting in the 𝐹1 = [5, 10, 20, 40, 40]

We use the average weighted F1 score (𝐹 1𝑠𝑐𝑜𝑟𝑒

filter configuration of the blocks. PirnatCross1 performs optimally

𝑤 ) as a performance

metric because our datasets are not balanced and do not provide

in terms of avg F1 score.

equal representation for each device.

PirnatCross1 also contain 5 blocks as the original VGG19.

The first two comprising of two convolutional layers and the

𝑁𝑑

∑︁

subsequent two comprising of four convolutional layers. The final

𝐹 1𝑠𝑐𝑜𝑟 𝑒

=

𝐹 1𝑠𝑐𝑜𝑟 𝑒 × 𝑊 𝑒𝑖𝑔ℎ𝑡

𝑤

𝑖

𝑖

(2)

block consists of four transpose convolutional layers and all blocks

𝑖 =1

include an average pooling layer after the convolutional layers.

The average weighted F1 score is calculated using three metrics:

Preceding the output layer, our model incorporates a GRU layer

true positive (TP), false positive (FP), and false negative (FN). TP

with a size of 64. Additionally, two fully-connected layers, each

measures the instances where the device is accurately classified as

consisting of 4096 nodes, are included in the architecture. The

active, while FP represents cases where the device is erroneously

output layer of our model comprises five nodes corresponding to

classified as active. FN indicates instances where the device is

the states 𝑠𝑖 of the 5 appliances 𝑑𝑖 considered in this study. All

mistakenly classified as inactive.

layers utilize the ReLU activation function, except for the output

From these metrics, we derive the precision (𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =

𝑇 𝑃

layer which employs the sigmoid activation function.

) and recall (𝑅𝑒𝑐𝑎𝑙𝑙 =

𝑇 𝑃

), which are used to cal-

𝑇 𝑃 +𝐹 𝑃

𝑇 𝑃 +𝐹 𝑁

culate the F1 score (𝐹 1𝑠𝑐𝑜𝑟𝑒 = 2 × 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛×𝑅𝑒𝑐𝑎𝑙𝑙 ). To obtain

𝑃 𝑟 𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑅𝑒𝑐𝑎𝑙𝑙

4.2

Computational Complexity and Carbon

the average weighted F1 score (2), we first compute the F1 score Footprint Analysis

for each device, then take the average based on their weight

(𝑊 𝑒𝑖𝑔ℎ𝑡 = 𝑆𝑆𝐷 ), which is determined by the support for the

Table 1 summarizes the weights, FLOPs, energy and carbon

𝑆 𝐴𝐷

specified device (SSD) and the support of all devices (SAD).

footprint numbers for PirnatCross versus the TanoniCRNN and

VGG11 baselines. The results take into account the fact that the

90

trained: REFIT, evaluated: REFIT

trained: REFIT, evaluated: UK-DALE

models were trained on Nvidia A100 graphics card, located in

trained: UK-DALE, evaluated: REFIT

Slovenia where 250g of CO2 equivalent is produced with each

e [%]

trained: UK-DALE, evaluated: UK-DALE

80

best performance: PirnatCross

kWh of electricity. The specific equations used to calculate, energy

and carbon footprint are defined in our previous work [10].

70

It can be seen from the second row of the table that PirnatCross

achieves superior energy efficiency compared to other models,

60

average weighted F1 scor

exhibiting energy consumption 6-times smaller compared to SotA

TanoniCRNN and 6.6-times smaller compared to VGG11.

50 0.02 0.04 0.06 0.08 0.1 0.3 0.5 0.7 0.9 1.1 1.3 1.5 1.7 1.9 2.1 2.3 2.5

scaling factor

Table 1: Computational complexity and carbon footprint anal-

ysis for the proposed architecture and selected baselines.

Figure 2: Average F1 scores on intra and cross-dataset training

and evaluation as a function of filter scaling factor.

NN

weights

FLOPs

energy

carbon footprint

PirnatCross

17.4 · 106

185 · 106

329 kJ

22,9 g CO2 eq.

4

RESULTS

TanoniCRNN [12]

0.75 · 106

1.11 · 109

1967 kJ

136.7 g CO2 eq.

In this section we first determine the optimal filter configuration

VGG11

185.6 · 106

1.21 · 109

2150 kJ

149.3 g CO2 eq.

for variations of the PirnatCross architecture to achieve high

average weighted F1 score. We then follow with a computational

complexity and carbon footprint assessment. Finally, we then

4.3

Cross-Dataset Analysis

benchmark the performance of models in cross-dataset evaluation

Tables 2 and 3 present the per device breakdown of the F1 scores on REFIT and UK-DALE datasets.

for PirnatCross, TanoniCRNN and VGG11 when trained on REFIT

and evaluated on UK-DALE and vice versa.

4.1

Analysis of Tuning the Filters in PirnatCross

When we trained on REFIT and evaluated on UK-DALE, the

Figure 2 depicts the performance of the PirnatCross architecture scores for the four models were as follows: PirnatCross achieved a

where the original number of filters in the set 𝐹 has been scaled

score of 0.766, TanoniCRNN achieved a score of 0.752 and VGG11

64





Information Society 2023, 9–13 October 2023, Ljubljana, Slovenia Anže Pirnat and Carolina Fortuna

Input

FC, 5

Output

GRU, 128

FC, 4096

FC, 4096

(120000, 2550, 1)

Conv1D, f=5, k=3

Conv1D, f=5, k=3

Conv1D, f=10, k=3

Conv1D, f=10, k=3

Conv1D, f=20, k=3

Conv1D, f=20, k=3

Conv1D, f=20, k=3

Conv1D, f=20, k=3

Conv1D, f=40, k=3

Conv1D, f=40, k=3

Conv1D, f=40, k=3

Conv1D, f=40, k=3

MaxPooling, k=2, s=2

MaxPooling, k=2, s=2

MaxPooling, k=2, s=2

MaxPooling, k=2, s=2

ransConv1D, f=40, k=3

ransConv1D, f=40, k=3

ransConv1D, f=40, k=3

ransConv1D, f=40, k=3

T

T

T

T

MaxPooling, k=2, s=2

Figure 3: The proposed architecture PirnatCross made for maximum performance.

Table 2: F1 scores for PirnatCross1, TanoniCRNN [12] and the number of filters in convolutional layers and consequently an

VGG11 trained on REFIT and evaluated on UK-DALE.

increase in the number of FLOPs did not necessarily lead to an

improvement in classification accuracy. Instead, we observed a

devices

PirnatCross

TanoniCRNN [12]

VGG11

point of steady improvement in performance, followed by a gradual

decline and a significant drop in performance when the number of

fridge

0,944

0,972

0,462

filters exceeded a certain threshold. This information is crucial for

washing machine

0,650

0,690

0,544

dish washer

0,646

0,648

0,294

optimizing the architecture of NILM models, and keeping track of

microwave

0,728

0,756

0,512

the carbon footprint.

kettle

0,786

0,622

0,420

ACKNOWLEDGEMENTS

weighted avg

0,766

0,752

0,456

This work was funded in part by the Slovenian Research Agency

Table 3: F1 scores for PirnatCross1, TanoniCRNN [12] and under the grant P2-0016. The authors would like to thank Blaž

VGG11 trained on UK-DALE and evaluated on REFIT.

Bertalanič for insightful discussions.

REFERENCES

devices

PirnatCross

TanoniCRNN [12]

VGG11

[1]

Jamshid Aghaei and Mohammad-Iman Alizadeh. 2013. Demand response

in smart electricity grids equipped with renewable energy sources: a review.

fridge

0,730

0,232

0,508

Renewable and Sustainable Energy Reviews, 18, 64–72. doi: https://doi.org

washing machine

0,668

0,666

0,366

/10.1016/j.rser.2012.09.019.

dish washer

0,596

0,468

0,360

[2]

Eva García-Martín, Crefeda Faviola Rodrigues, Graham Riley, and Håkan

microwave

0,526

0,630

0,506

Grahn. 2019. Estimation of energy consumption in machine learning. Journal of Parallel and Distributed Computing, 134, 75–88. doi: https://doi.org/10.1

kettle

0,800

0,782

0,408

016/j.jpdc.2019.07.007.

[3]

R. Gopinath, Mukesh Kumar, C. Prakash Chandra Joshua, and Kota Srinivas.

weighted avg

0,672

0,542

0,438

2020. Energy management using non-intrusive load monitoring techniques –

state-of-the-art and future research directions. Sustainable Cities and Society, 62, 102411. doi: https://doi.org/10.1016/j.scs.2020.102411.

achieved a score of 0.456. However, when we trained on UK-DALE

[4]

Byungok Han, Woo-Han Yun, Jang-Hee Yoo, and Won Hwa Kim. 2020.

Toward unbiased facial expression recognition in the wild via cross-dataset and tested on REFIT, the scores were notably lower for all four

adaptation. IEEE Access, 8, 159172–159181.

models. PirnatCross achieved a score of 0.672, TanoniCRNN

[5]

Gigi Hsueh. 2020. Carbon Footprint of Machine Learning Algorithms. Senior achieved a score of 0.542, and VGG11 achieved a score of 0.438.

Projects Spring 2020. 296. https://digitalcommons.bard.edu/senproj_s2020/2

96.

This outcome may be explained by the fact that REFIT has a

[6]

Jack Kelly and William Knottenbelt. 2015. The uk-dale dataset, domestic

significantly higher level of data noise compared to UK-DALE as

appliance-level electricity demand and whole-house demand from five uk

homes. Scientific data, 2, 1, 1–14.

shown in prior work [12]. Consequently, the testing results obtained

[7]

Weicong Kong, Zhao Yang Dong, Bo Wang, Junhua Zhao, and Jie Huang.

from UK-DALE are expected to show higher F1 scores. Moreover,

2020. A practical solution for non-intrusive type ii load monitoring based on we observed that, overall, our model PirnatCross consistently

deep learning and post-processing. IEEE Transactions on Smart Grid, 11, 1, 148–160. doi: 10.1109/TSG.2019.2918330.

outperformed the other models in both testing scenarios, achieving

[8]

Luca Massidda, Marino Marrocu, and Simone Manca. 2020. Non-intrusive

the highest weighted average F1 scores overall.

load disaggregation by convolutional neural network and multilabel classification. Applied Sciences, 10, 4. doi: 10.3390/app10041454.

5

CONCLUSIONS

[9]

David Murray, Lina Stankovic, and Vladimir Stankovic. 2017. An electrical load measurements dataset of united kingdom households from a two-year

To address the challenge of cross-dataset usage scenario on NILM

longitudinal study. Scientific data, 4, 1, 1–12. doi: 10.1038/sdata.2016.122.

[10]

Anže Pirnat, Blaž Bertalanič, Gregor Cerar, Mihael Mohorčič, Marko Meža, ON/OFF classification, we propose PirnatCross, with an aim to

and Carolina Fortuna. 2022. Towards sustainable deep learning for wireless present the maximum performance and the energy efficiency. The

fingerprinting localization. In ICC 2022 - IEEE International Conference on Communications, 3208–3213. doi: 10.1109/ICC45855.2022.9838464.

results of our evaluation on the REFIT and UKDALE datasets

[11]

Ali Q. Al-Shetwi, M.A. Hannan, Ker Pin Jern, M. Mansur, and T.M.I.

reveal that PirnatCross achieve an average performance improve-

Mahlia. 2020. Grid-connected renewable energy sources: review of the recent ment of 7.2 over SotA and 27.2 percentage points over baseline,

integration requirements and control methods. Journal of Cleaner Production, 253, 119831. doi: https://doi.org/10.1016/j.jclepro.2019.119831.

underscoring its superior effectiveness in handling data from di-

[12]

Giulia Tanoni, Emanuele Principi, and Stefano Squartini. 2022. Multi-label verse sources. Additionally PirnatCross consumes 6-times less

appliance classification with weakly labeled data for non-intrusive load energy compared to SotA model. To develop PirnatCross, we

monitoring. IEEE Transactions on Smart Grid, 1–1. doi: 10.1109/TSG.2022

.3191908.

employed our methodology. In the case of classification on NILM

[13]

Minjia Zhang, Wenhan Wang, Xiaodong Liu, Jianfeng Gao, and Yuxiong He.

this included beginning with the VGG19 architecture and imple-

2018. Navigating with graph representations for fast and scalable decoding of neural language models. Advances in neural information processing systems, menting several modifications, such as replacing the convolutional

31.

layers with transpose convolutional layers in the 5th block, incor-

[14]

Mengran Zhou, Shuai Shao, Xu Wang, Ziwei Zhu, and Feng Hu. 2022. Deep

porating a GRU layer after it, and adjusting the number of filters

learning-based non-intrusive commercial load monitoring. Sensors, 22, 14.

doi: 10.3390/s22145250.

based on our analysis. Our analysis revealed that an increase in

65





66





Indeks avtorjev / Author index



Bradeško Luka ............................................................................................................................................................................. 42

Buza Krisztian ................................................................................................................................................................................ 5

Caporusso Jaya ............................................................................................................................................................................. 33

Džeroski Sašo ............................................................................................................................................................................... 46

Eržen Tjaž .................................................................................................................................................................................... 58

Espigule-Pons Jofre ...................................................................................................................................................................... 29

Fijavž Zoran ................................................................................................................................................................................. 54

Fortuna Carolina ........................................................................................................................................................................... 62

Gobbo Elena ................................................................................................................................................................................. 25

Grobelnik Marko ................................................................................................................................................................ 5, 29, 39

Kladnik Matic ............................................................................................................................................................................... 42

Koehorst Erik ............................................................................................................................................................................... 17

Koprivec Filip .............................................................................................................................................................................. 58

Kosjek Tina .................................................................................................................................................................................. 46

Leban Gregor ................................................................................................................................................................................. 9

Ljoncheva Milka .......................................................................................................................................................................... 46

Martinc Matej ............................................................................................................................................................................... 50

Massri M. Besher ........................................................................................................................................................................... 5

Mežnar Urban ............................................................................................................................................................................... 58

Mladenić Dunja .......................................................................................................................................... 9, 13, 17, 21, 25, 39, 42

Mladenić Grobelnik Adrian.......................................................................................................................................................... 29

Nemec Peter ................................................................................................................................................................................... 9

Novalija Inna ................................................................................................................................................................................ 25

Piciga Aleksander ......................................................................................................................................................................... 46

Pirnat Anže ................................................................................................................................................................................... 62

Pollak Senja ............................................................................................................................................................................ 33, 50

Purver Matthew ............................................................................................................................................................................ 33

Robnik-Šikonja Marko ................................................................................................................................................................. 54

Rožanec Jože M. ...................................................................................................................................................................... 9, 17

Šircelj Beno .................................................................................................................................................................................... 9

Sittar Abdul .................................................................................................................................................................................. 21

Škraba Primož .............................................................................................................................................................................. 13

Škrjanc Maja ................................................................................................................................................................................ 39

Stopar Luka .................................................................................................................................................................................. 39

Šturm Jan ...................................................................................................................................................................................... 39

Topal Oleksandra ......................................................................................................................................................................... 25

Vezovnik Andreja ........................................................................................................................................................................ 50

Volčjak Domen ............................................................................................................................................................................ 39

Zajec Patrik .................................................................................................................................................................................. 13

Zaman Faizon ............................................................................................................................................................................... 29

Zupan Šemrov Manja ................................................................................................................................................................... 25





67





68



Odkrivanje znanja in

podatkovna skladisca • SiKDD

Data Mining and

Data Warehouses • SiKDD

Urednika • Editors:

Dunja Mladenic, Marko Grobelnik





Document Outline


02 - Naslovnica - notranja - C - TEMP

03 - Kolofon - C - TEMP

04 - IS2023 - Predgovor

05 - IS2023 - Konferencni odbori

07 - Kazalo - C

08 - Naslovnica - notranja - C - TEMP - Copy

09 - Predgovor podkonference - C

10 - Programski odbor podkonference - C

11 - Prispevki - C IS_2023_-_SIKDD_paper_001 Abstract

1 Introduction

2 Related Work

3 Background 3.1 Problem Formulation

3.2 The Distortion-aware Convolutional Block





4 Experimental Evaluation 4.1 Data

4.2 Experimental Settings

4.3 Results





5 Conclusions and Outlook

Acknowledgments

References





IS_2023_-_SIKDD_paper_002 Abstract

1 Introduction

2 Data extraction pipeline 2.1 Media Event Extraction

2.2 Causality extraction

2.3 Semantic matching and enrichment

2.4 Cleansing causal relations

2.5 Creating a causality graph





3 Results 3.1 Causality graph and causality chain analysis





4 Conclusions

Acknowledgments

References





IS_2023_-_SIKDD_paper_003 Abstract

1 Introduction

2 Background 2.1 Simplicial complexes

2.2 Persistent relative homology

2.3 Significance testing of persistent cycles

2.4 The mapper algorithm





3 Methodology 3.1 Testing the cycles

3.2 Testing the branching structure





4 Experiments 4.1 Experiment 1: Y-shaped point cloud

4.2 Experiment 2: 3D ant surface





5 Conclusions and future work

Acknowledgements





IS_2023_-_SIKDD_paper_004 Abstract

1 Introduction

2 Highlighting Embeddings' Features Relevance Attribution on Activation Maps

3 Experiments

4 Results

5 Conclusions

Acknowledgments

References





IS_2023_-_SIKDD_paper_005 Abstract

1 Introduction

2 Related Work 2.1 Geographical barrier

2.2 Time-series datasets

2.3 Topic modeling





3 Approach

4 Dataset construction 4.1 Semantic similarity

4.2 Chat-GPT Summarizing

4.3 Annotations of time-series





5 Statistical Analysis and Evaluation

6 Conclusions and Future Work





IS_2023_-_SIKDD_paper_006

IS_2023_-_SIKDD_paper_007

IS_2023_-_SIKDD_paper_008

IS_2023_-_SIKDD_paper_009 Abstract

1 Introduction

2 Related Work

3 Methodology 3.1 Data Clients

3.2 ETL Pipeline

3.3 Feature Engineering





4 Experiment 4.1 Dataset

4.2 Implementation Details

4.3 Experimental Results





5 Conclusion and Future Work

6 ACKNOWLEDGMENTS





IS_2023_-_SIKDD_paper_010

IS_2023_-_SIKDD_paper_011 Abstract

1 DATA 1.1 Overview

1.2 Dataset





2 PREPROCESSING 2.1 CG-MS Spectra

2.2 Molecular fingerprints

2.3 Spec2Vec





3 PIPELINE

4 METHODS

5 EVALUATION

6 REPRODUCIBILITY

7 CONCLUSION





IS_2023_-_SIKDD_paper_012 Abstract

1 Introduction

2 Methodology 2.1 Dataset construction

2.2 Topical analysis

2.3 Temporal analysis





3 Experiments 3.1 Experimental setting

3.2 Results





4 Conclusion

5 Acknowledgments





IS_2023_-_SIKDD_paper_013 Abstract

1 Introduction

2 Related Work

3 Methodology 3.1 Training Task and Setup

3.2 Data Preparation

3.3 Evaluation Settings





4 Results 4.1 Binary Classifier Sense Predictions

4.2 Binary Classifier Correlation Metrics

4.3 Sense Predictions with Nearest Neighbour Matching





5 Discussion on Interdisciplinary Aspects

6 Conclusion





IS_2023_-_SIKDD_paper_014 Abstract

1 Introduction

2 Related Work

3 FTSO protocol

4 Data Retrieval and Prediction 4.1 Overview

4.2 Data Processing and Smoothing Techniques

4.3 Prediction Mechanism





5 Result Analysis

6 RMSE values

7 Discussion and Future Work

8 Acknowledgments





IS_2023_-_SIKDD_paper_015 Abstract

1 Introduction

2 Problem Statement

3 Methodology 3.1 Datasets

3.2 Benchmarks

3.3 Metrics





4 Results 4.1 Analysis of Tuning the Filters in PirnatCross

4.2 Computational Complexity and Carbon Footprint Analysis

4.3 Cross-Dataset Analysis





5 Conclusions

Acknowledgements





12 - Index - C

Blank Page

Blank Page

Blank Page

Blank Page

Blank Page