6. oktober 2025 l 6 October 2025

Ljubljana, Slovenia



IS 2025



INFORMACIJSKA



DRUZBA ˇ



INFORMATION



SOCIETY



Odkrivanje znanja in



podatkovna skladišča



SiKDD



Data Warehouses Zbornik 28. mednarodne multikonference SiKDD Data Mining and

Zvezek C



Proceedings of the 28th



Urednika l Editors: International Multiconference



Dunja Mladenić, Marko Grobelnik Volume C

Zbornik 28. mednarodne multikonference



INFORMACIJSKA DRUŽBA – IS 2025

Zvezek C



Proceedings of the 28th International Multiconference



INFORMATION SOCIETY – IS 2025

Volume C



Odkrivanje znanja in podatkovna skladišča - SiKDD



Data Mining and Data Warehouses - SiKDD



Urednika / Editors



Dunja Mladenić, Marko Grobelnik



http://is.ijs.si



6. oktober 2025 / 6 October 2025

Ljubljana, Slovenia



Urednika:



Dunja Mladenić

Department for Artificial Intelligence, Jožef Stefan Institute, Ljubljana



Marko Grobelnik

Department for Artificial Intelligence, Jožef Stefan Institute, Ljubljana



Založnik: Institut »Jožef Stefan«, Ljubljana

Priprava zbornika: Mitja Lasič, Vesna Lasič, Lana Zemljak

Oblikovanje naslovnice: Vesna Lasič, uporabljena slika iz Pixabay



Dostop do e-publikacije:

http://library.ijs.si/Stacks/Proceedings/InformationSociety



Ljubljana, oktober 2025



Informacijska družba

ISSN 2630-371X

DOI: https://doi.org/10.70314/is.2025.sikdd



Kataložni zapis o publikaciji (CIP) pripravili v Narodni in univerzitetni knjižnici v Ljubljani

COBISS.SI-ID 255453699

ISBN 978-961-264-322-5 (PDF)





PREDGOVOR MULTIKONFERENCI




INFORMACIJSKA DRUŽBA 2025



28. mednarodna multikonferenca Informacijska družba se odvija v času izjemne rasti umetne inteligence,

njenih aplikacij in vplivov na človeštvo. Vsako leto vstopamo v novo dobo, v kateri generativna umetna

inteligenca ter drugi inovativni pristopi oblikujejo poti k superinteligenci in singularnosti, ki bosta krojili

prihodnost človeške civilizacije. Naša konferenca je tako hkrati tradicionalna znanstvena in akademsko

odprta, pa tudi inkubator novih, pogumnih idej in pogledov.



Letošnja konferenca poleg umetne inteligence vključuje tudi razprave o perečih temah današnjega časa:

ohranjanje okolja, demografski izzivi, zdravstvo in preobrazba družbenih struktur. Razvoj UI ponuja rešitve

za številne sodobne izzive, kar poudarja pomen sodelovanja med raziskovalci, strokovnjaki in odločevalci

pri oblikovanju trajnostnih strategij. Zavedamo se, da živimo v obdobju velikih sprememb, kjer je ključno,

da z inovativnimi pristopi in poglobljenim znanjem ustvarimo informacijsko družbo, ki bo varna,

vključujoča in trajnostna.



V okviru multikonference smo letos združili dvanajst vsebinsko raznolikih srečanj, ki odražajo širino in

globino informacijskih ved: od umetne inteligence v zdravstvu, demografskih in družinskih analiz, digitalne

preobrazbe zdravstvene nege ter digitalne vključenosti v informacijski družbi, do raziskav na področju

kognitivne znanosti, zdrave dolgoživosti ter vzgoje in izobraževanja v informacijski družbi. Pridružujejo

se konference o legendah računalništva in informatike, prenosu tehnologij, mitih in resnicah o varovanju

okolja, odkrivanju znanja in podatkovnih skladiščih ter seveda Slovenska konferenca o umetni inteligenci.



Poleg referatov bodo okrogle mize in delavnice omogočile poglobljeno izmenjavo mnenj, ki bo pomembno

prispevala k oblikovanju prihodnje informacijske družbe. »Legende računalništva in informatike«

predstavljajo domači »Hall of Fame« za izjemne posameznike s tega področja. Še naprej bomo spodbujali

raziskovanje in razvoj, odličnost in sodelovanje; razširjeni referati bodo objavljeni v reviji Informatica, s

podporo dolgoletne tradicije in v sodelovanju z akademskimi institucijami ter strokovnimi združenji, kot

so ACM Slovenija, SLAIS, Slovensko društvo Informatika in Inženirska akademija Slovenije.



Vsako leto izberemo najbolj izstopajoče dosežke. Letos je nagrado Michie-Turing za izjemen življenjski

prispevek k razvoju in promociji informacijske družbe prejel Niko Schlamberger, priznanje za

raziskovalni dosežek leta pa Tome Eftimov. »Informacijsko limono« za najmanj primerno informacijsko

tematiko je prejela odsotnost obveznega pouka računalništva v osnovnih šolah. »Informacijsko jagodo« za

najboljši sistem ali storitev v letih 2024/2025 pa so prejeli Marko Robnik Šikonja, Domen Vreš in Simon

Krek s skupino za slovenski veliki jezikovni model GAMS. Iskrene čestitke vsem nagrajencem!



Naša vizija ostaja jasna: prepoznati, izkoristiti in oblikovati priložnosti, ki jih prinaša digitalna preobrazba,

ter ustvariti informacijsko družbo, ki koristi vsem njenim članom. Vsem sodelujočim se zahvaljujemo za

njihov prispevek — veseli nas, da bomo skupaj oblikovali prihodnje dosežke, ki jih bo soustvarjala ta

konferenca.



Mojca Ciglarič, predsednica programskega odbora

Matjaž Gams, predsednik organizacijskega odbora





i


FOREWORD TO THE MULTICONFERENCE




INFORMATION SOCIETY 2025



The 28th International Multiconference on the Information Society takes place at a time of remarkable

growth in artificial intelligence, its applications, and its impact on humanity. Each year we enter a new era

in which generative AI and other innovative approaches shape the path toward superintelligence and

singularity — phenomena that will shape the future of human civilization. The conference is both a

traditional scientific forum and an academically open incubator for new, bold ideas and perspectives.



In addition to artificial intelligence, this year’s conference addresses other pressing issues of our time:

environmental preservation, demographic challenges, healthcare, and the transformation of social

structures. The rapid development of AI offers potential solutions to many of today’s challenges and

highlights the importance of collaboration among researchers, experts, and policymakers in designing

sustainable strategies. We are acutely aware that we live in an era of profound change, where innovative

approaches and deep knowledge are essential to creating an information society that is safe, inclusive, and

sustainable.



This year’s multiconference brings together twelve thematically diverse meetings reflecting the breadth and

depth of the information sciences: from artificial intelligence in healthcare, demographic and family studies,

and the digital transformation of nursing and digital inclusion, to research in cognitive science, healthy

longevity, and education in the information society. Additional conferences include Legends of Computing

and Informatics, Technology Transfer, Myths and Truths of Environmental Protection, Knowledge

Discovery and Data Warehouses, and, of course, the Slovenian Conference on Artificial Intelligence.



Alongside scientific papers, round tables and workshops will provide opportunities for in-depth exchanges

of views, making an important contribution to shaping the future information society. Legends of

Computing and Informatics serves as a national »Hall of Fame« honoring outstanding individuals in the

field. We will continue to promote research and development, excellence, and collaboration. Extended

papers will be published in the journal Informatica, supported by a long-standing tradition and in

cooperation with academic institutions and professional associations such as ACM Slovenia, SLAIS, the

Slovenian Society Informatika, and the Slovenian Academy of Engineering.



Each year we recognize the most distinguished achievements. In 2025, the Michie-Turing Award for

lifetime contribution to the development and promotion of the information society was awarded to Niko

Schlamberger, while the Award for Research Achievement of the Year went to Tome Eftimov. The

»Information Lemon« for the least appropriate information-related topic was awarded to the absence of

compulsory computer science education in primary schools. The »Information Strawberry« for the best

system or service in 2024/2025 was awarded to Marko Robnik Šikonja, Domen Vreš and Simon Krek

together with their team, for developing the Slovenian large language model GAMS. We extend our

warmest congratulations to all awardees.



Our vision remains clear: to identify, seize, and shape the opportunities offered by digital transformation,

and to create an information society that benefits all its members. We sincerely thank all participants for

their contributions and look forward to jointly shaping the future achievements that this conference will

help bring about.



Mojca Ciglarič, Chair of the Program Committee

Matjaž Gams, Chair of the Organizing Committee





ii


KONFERENČNI ODBORI





CONFERENCE COMMITTEES



International Programme Committee Organizing Committee



Vladimir Bajic, South Africa Matjaž Gams, chair

Heiner Benking, Germany Mitja Luštrek

Se Woo Cheon, South Korea Lana Zemljak

Howie Firth, UK Vesna Koricki

Olga Fomichova, Russia Mitja Lasič

Vladimir Fomichov, Russia Blaž Mahnič

Vesna Hljuz Dobric, Croatia

Alfred Inselberg, Israel

Jay Liebowitz, USA

Huan Liu, Singapore

Henz Martin, Germany

Marcin Paprzycki, USA

Claude Sammut, Australia

Jiri Wiedermann, Czech Republic

Xindong Wu, USA

Yiming Ye, USA

Ning Zhong, USA

Wray Buntine, Australia

Bezalel Gavish, USA

Gal A. Kaminka, Israel

Mike Bain, Australia

Michela Milano, Italy

Derong Liu, Chicago, USA

Toby Walsh, Australia

Sergio Campos-Cordobes, Spain

Shabnam Farahmand, Finland

Sergio Crovella, Italy



Programme Committee



Mojca Ciglarič, chair Marjan Heričko Boštjan Vilfan

Bojan Orel Borka Jerman Blažič Džonova Baldomir Zajc

Franc Solina Gorazd Kandus Blaž Zupan

Viljan Mahnič Urban Kordeš Boris Žemva

Cene Bavec Marjan Krisper Leon Žlajpah

Tomaž Kalin Andrej Kuščer Niko Zimic

Jozsef Györkös Jadran Lenarčič Rok Piltaver

Tadej Bajd Borut Likar Toma Strle

Jaroslav Berce Janez Malačič Tine Kolenik

Mojca Bernik Olga Markič Franci Pivec

Marko Bohanec Dunja Mladenič Uroš Rajkovič

Ivan Bratko Franc Novak Borut Batagelj

Andrej Brodnik Vladislav Rajkovič Tomaž Ogrin

Dušan Caf Grega Repovš Aleš Ude

Saša Divjak Ivan Rozman Bojan Blažica

Tomaž Erjavec Niko Schlamberger Matjaž Kljun

Bogdan Filipič Gašper Slapničar Robert Blatnik

Andrej Gams Stanko Strmčnik Erik Dovgan

Matjaž Gams Jurij Šilc Špela Stres

Mitja Luštrek Jurij Tasič Anton Gradišek

Marko Grobelnik Denis Trček

Nikola Guid Andrej Ule





iii


iv




KAZALO / TABLE OF CONTENTS



Odkrivanje znanja in podatkovna skladišča – SiKDD / Data Mining and Data Warehouses - SiKDD .... 1

PREDGOVOR / FOREWORD ............................................................................................................................... 3

PROGRAMSKI ODBORI / PROGRAMME COMMITTEES ............................................................................... 5

Semantic Prompting for Large Language Models in Biomedical Named Entity Recognition / Calcina Erik,

Novak Erik, Mladenić Dunja.............................................................................................................................. 7

LLM Based Approach to Extracting Smells in Slovenian Corpora / Brank Janez, Novalija Inna, Mladenić

Dunja, Grobelnik Marko .................................................................................................................................. 11

BetweenTheLines - Cross Source News Analysis / Trajkov Georgi, Grobelnik Marko, Grobelnik Adrian

Mladenić ........................................................................................................................................................... 15

Identifying Social Self in Text: A Machine Learning Study / Caporusso Jaya, Purver Matthew, Pollak Senja .. 19

WinWin Meets – Investigating the Future of Online Meetings / Žust Martin, Grobelnik Marko, Guček Alenka,

Grobelnik Adrian Mladenić.............................................................................................................................. 25

Predicting Ski Jumps Using State-Space Model / Hegler Živa, Camlek Neca, Jelenčič Jakob, Grobelnik Marko,

Mladenić Dunja ................................................................................................................................................ 29

Predicting milling overload based on sensor data: a graph-based approach / Krumpak Roy, Rožanec Jože M.,

Mladenić Dunja, Guo Zhenyu, Song Tao, Roman Dumitru, Novalija Inna, Ma Xiang ................................... 33

Short and Long Term Bike Rental Forecasting / Kocjančič Oskar, Žnidaršič Martin ......................................... 37

Predicting Traffic Intensity on Motorway Sections / Kladnik Matic, Mladenić Dunja ....................................... 41

Empowering Youth for Smart Cities with AI Solutions to Community and Urban Challenges in the Context of

SDG 11 / Zaouini Mustafa, Costa João Pita, Rahmani Yousef, Kassis Rayan, Stopar Luka, Souss Sohaib,

Lamgari Asmai, Mochariq Ouidad ................................................................................................................... 45

Automated First-Reply Generation for IT Support Tickets Using Retrieval-Augmented Generation and Multi-

Modal Response Synthesis / Jeršek Domen, Kenda Klemen, Frattini Matteo, Klančič Rok .......................... 49

A Machine-Learning Approach to Predicting the Pronunciation of Pre-Consonant l in Standard Slovene / Čibej

Jaka ................................................................................................................................................................... 53

Sequencing News Articles with Large Language Models within Enterprise Risk Management Context /

Debeljak Žiga, Mladenić Dunja, Kenda Klemen ............................................................................................. 57

Graph-Based Feature Engineering for DeFi Security Incident Severity Prediction / Pavlova Daria, Novalija

Inna, Mladenić Dunja ....................................................................................................................................... 61

Evolving Neural Agents in Simulated Ecosystems / Ćetković Marija, Tošić Aleksandar, Vake Domen ............ 65

Designing AI Agents for Social Media / Sittar Abdul, Smiljanic Mateja, Guček Alenka ................................... 69

Explaining Temporal Data in Manufacturing using LLMs and Markov Chains / Šturm Jan, Škrjanc Maja, Topal

Oleksandra, Novalija Inna, Mladenić Dunja, Grobelnik Marko ...................................................................... 73

Active Learning for Power Grid Security Assessment: Reducing Simulation Cost with Informative Sampling /

Leskovec Gašper, Mylonas Costas, Kenda Klemen ......................................................................................... 77

Supporting Material Reuse in Drone Production / Cek Rok, Topal Oleksandra, Leonardi Linda, Forcolin

Margherita, Kenda Klemen .............................................................................................................................. 82

Temporal Dynamics and Causal Feature Integration for Predictive Maintenance in Manufacturing Systems: A

Causality-Informed Framework / Hosseini Seyed Iman, Kenda Klemen, Mladenić Dunja ........................... 86

Using Interactive Data Visualization for DeFi Market Analysis / Pavlova Daria ................................................ 90

A Hybrid Lexicon-Machine Learning Approach to Macedonian Sentiment Analysis / Kochovska Sofija,

Kavšek Branko, Vičič Jernej ............................................................................................................................ 94

Building an AI-Ready Data Infrastructure Towards a SDG-focused Observatory for the Brazilian Amazon /

Costa João Pita, Polzer Mirozlav, Barrionuevo Leonardo, Veiga João Cândia ............................................... 98

Towards a format for describing networks, NetsJSON / Batagelj Vladimir, Pisanski Tomaž, Savnik Iztok,

Slavec Ana, Bašić Nino .................................................................................................................................. 102

Automating Numba Optimization with Large Language Models: A Case Study on Mutual Information /

Kozamernik Lučka, Jakomin Martin, Škrlj Blaž, Urbančič Jasna.................................................................. 106

Topological Exploration of Embedded GitHub Repository Data Using Mapper / Hrib Ivo, Zajec Patrik ......... 110

CO2 Monitoring for Energy-Efficient Workloads in Kubernetes: A Data Provider for CO2-Aware Migration /

Hrib Ivo, Topal Oleksandra, Šturm Jan, Škrjanc Maja .................................................................................. 114





v


Beyond Surveys: Adolescent Profiling via Ecological Momentary Assessment and Mobile Sensing / Dobša

Jasminka, Korenjak-Černe Simona, Novak Miranda, Pandur Maja Buhin, Šutić Lucija .............................. 118

Brazil’s First AI Regulatory Sandbox: Towards Responsible Innovation / Oliveira Cristina Godoy, Veiga João

Cândia, Sancin Vasilka, Costa João Pita, Silva Rafael Meira, Dine Masa Kovic, Anjos Lucas Costa dos,

Marcilio Thiago Gomes, Silva Anthony Novaes ........................................................................................... 122



Indeks avtorjev / Author index ................................................................................................................. 127





vi


Zbornik 28. mednarodne multikonference





INFORMACIJSKA DRUŽBA – IS 2025

Zvezek C



Proceedings of the 28th International Multiconference



INFORMATION SOCIETY – IS 2025

Volume C



Odkrivanje znanja in podatkovna skladišča - SiKDD



Data Mining and Data Warehouses - SiKDD



Urednika / Editors



Dunja Mladenić, Marko Grobelnik



http://is.ijs.si



6. oktober 2025 / 6 October 2025

Ljubljana, Slovenia



1





2


PREDGOVOR




Tehnologije, ki se ukvarjajo s podatki so močno napredovale. Iz prve faze, kjer je šlo predvsem

za shranjevanje podatkov in kako do njih učinkovito dostopati, se je razvila industrija za

izdelavo orodij za delo s podatkovnimi bazami in velikimi količinami podatkov, prišlo je do

standardizacije procesov, povpraševalnih jezikov. Ko shranjevanje podatkov ni bil več poseben

problem, se je pojavila potreba po bolj urejenih podatkovnih bazah, ki bi služile ne le

transakcijskem procesiranju ampak tudi analitskim vpogledom v podatke. Pri avtomatski

analizi podatkov sistem sam pove, kaj bi utegnilo biti zanimivo za uporabnika – to prinašajo

tehnike odkrivanja znanja v podatkih (knowledge discovery and data mining), ki iz obstoječih

podatkov skušajo pridobiti novo znanje in tako uporabniku nudijo novo razumevanje dogajanj

zajetih v podatkih. Slovenska KDD konferenca SiKDD, pokriva vsebine, ki se ukvarjajo z

analizo podatkov in odkrivanjem znanja v podatkih: pristope, orodja, probleme in rešitve.



Dunja Mladenić in Marko Grobelnik



3





FOREWORD




Data driven technologies have significantly progressed. The first phases were mainly focused

on storing and efficiently accessing the data, resulted in the development of industry tools for

managing large databases, related standards, supporting querying languages, etc. After the

initial period, when the data storage was not a primary problem anymore, the development

progressed towards analytical functionalities on how to extract added value from the data; i.e.,

databases started supporting not only transactions but also analytical processing of the data. In

automatic data analysis, the system itself tells what might be interesting for the user - this is

brought about by knowledge discovery and data mining techniques, which try to obtain new

knowledge from existing data and thus provide the user with a new understanding of the events

covered in the data. The Slovenian KDD conference SiKDD covers topics dealing with data

analysis and discovering knowledge in data: approaches, tools, problems and solutions.



Dunja Mladenić and Marko Grobelnik



4

PROGRAMSKI ODBOR / PROGRAMME COMMITTEE



Janez Brank, Jožef Stefan Institute, Ljubljana



Jasminka Dobša, Faculty of Organization and Informatics, University of Zagreb



Alenka Guček, Jožef Stefan Institute, Ljubljana



Branko Kavšek, University of Primorska, Koper



Klemen Kenda, Qlector, Ljubljana



Bojana Mikelenić, Faculty of Humanities and Social Sciences, University of Zagreb



Elham Motamedi Mohammadabadi, Jožef Stefan Institute, Ljubljana



Irena Nančovska Šerbec, Faculty of Education, University of Ljubljana



Erik Novak, Jožef Stefan Institute, Ljubljana



Inna Novalija, Jožef Stefan Institute, Ljubljana



Joao Pita Costa, Quintelligence, Ljubljana



Jože Rožanec, Jožef Stefan Institute, Ljubljana



Abdul Sitar, Jožef Stefan Institute, Ljubljana



Luka Stopar, SolvesAll, Ljubljana



Blaž Škrlj, Teads, Ljubljana



Jan Šturm, Jožef Stefan Institute, Ljubljana



Oleksandra Topal, Jožef Stefan Institute, Ljubljana



5





6


Semantic Prompting for Large Language Models in Biomedical




Named Entity Recognition



Erik Calcina Erik Novak Dunja Mladenić

Jožef Stefan Institute Jožef Stefan Institute Jožef Stefan Institute

Jožef Stefan International Jožef Stefan International Jožef Stefan International

Postgraduate School Postgraduate School Postgraduate School

Jamova cesta 39 Jamova cesta 39 Jamova cesta 39

Ljubljana, Slovenia Ljubljana, Slovenia Ljubljana, Slovenia



Abstract with semantic descriptions on model performance across zero-

Extracting structured medical information from unstructured shot, few-shot, and fine-tuned scenarios, using the MACCRO-

clinical text remains a challenge for biomedical research and de- BAT2020 dataset [3]. The contributions of this paper are threefold.

cision support. Recent advances in large language models (LLMs) First, we introduce the use of semantically enhanced prompts for

suggest that prompt-based methods could provide a promising al- biomedical NER by enriching entity labels with descriptions. Sec-

ternative to traditional supervised approaches for Named Entity ond, we provide a systematic evaluation of semantic prompting

Recognition (NER) in the biomedical domain. This study inves- across zero-shot, few-shot, and fine-tuned scenarios, assessing

tigates whether adding semantic descriptions of entity labels its effectiveness under different levels of supervision. Third, we

can improve NER performance on clinical texts. Using a dataset apply a statistical validation method, McNemar’s test, to rigor-

of annotated case reports, we evaluate model performance in ously assess the reliability of observed performance differences

zero-shot, few-shot, and fine-tuned settings. Results show that between baseline and semantically enhanced prompts.

semantic prompts enhance accuracy in low-supervision scenar- The remainder of the paper is structured as follows: Section 2

ios, while offering limited benefit once models are fine-tuned. contains the overview of the related work. Next, we present the

methodology in Section 3, and describe the experiment setting in

Keywords Section 4. The experiment results are found in Section 5, followed

by a discussion in Section 6. Finally, we conclude the paper and

Named entity recognition, large language models, semantic prompt-

provide ideas for future work in Section 7.

ing, prompt engineering, medical domain, biomedicine



2 Related Work

1 Introduction This section focuses on the related work on named entity recog-

Biomedical texts present a critical challenge for automated anal- nition in biomedicine, as well as the use of semantic descriptions

ysis. Clinical case reports, patient records, and related narratives in prompting. are written in free text rather than in structured formats. While

they contain essential medical knowledge, their unstructured 2.1 Prompting with semantic context nature makes it necessary to extract and organize information

PromptNER introduced the idea of augmenting few-shot prompts

for systematic use in research and clinical decision support. Do-

with entity definitions, leading to substantial gains in F1 score

ing this manually is costly, time-consuming, and challenging

on benchmarks like CoNLL, GENIA, and FewNERD, improving

to scale. Therefore, an automated approach to extract relevant

performance by 4–9 points compared to standard prompting [2].

information is required.

Extending this idea, PromptNER unifies locating and typing into

Named entity recognition (NER) models enable the identifi-

a single enriched prompt, enabling phrase extraction and entity

cation and classification of clinically relevant entities, such as

classification simultaneously [7]. Similarly, the biomedical NER

biological structures, diagnostic procedures, or symptoms. Re-

study demonstrated that “on-the-fly” inclusion of concept defini-

cent advances in large language models (LLMs) show strong

tions enhances performance (+15% F1) in low-data settings [5].

generalizing abilities, identifying relevant entities in both zero-

shot and few-shot settings. However, in the biomedical domain,

performance can be hindered by specialized terminology and 2.2 Iterative and zero-shot semantic

subtle entity distinctions. To address this, we propose enriching prompting

prompts with semantic descriptions of entity labels, providing Recent work in zero-shot NER explores iterative prompt refine-

models with explicit context to improve their understanding of ment to align model outputs with precise entity definitions. Evo-

the task. Prompt uses an evolving definition-based framework to better

This study investigates the impact of semantically enhanced distinguish between similar entity types, yielding improvements

prompting in biomedical named entity recognition using large across benchmarks [9]. In a broader context, some studies found

language models. We evaluate the effect of enriching entity labels that while directly injecting semantic parses into LLM inputs

can degrade performance, carefully designed semantic “hints”

Permission to make digital or hard copies of all or part of this work for personal embedded in prompts can reliably boost outcomes [1]. or classroom use is granted without fee provided that copies are not made or

distributed for profit or commercial advantage and that copies bear this notice and 2.3 Domain-specific prompt optimization the full citation on the first page. Copyrights for third-party components of this

work must be honored. For all other uses, contact the owner/author(s). FsPONER optimizes few-shot prompts for industrial NER tasks by

Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia using semantic entity–enhanced meta prompts and task-specific

© 2024 Copyright held by the owner/author(s).

https://doi.org/10.70314/is.2025.sikdd.3 exemplar selection, yielding F1 improvements of 5 to 13 points



7





Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Calcina et al.




in domain benchmarks [8]. In the biomedical domain, MPE3 inte- a basis for evaluating model performance on both complex and

grates ontology-derived label semantics into prompts, improving straightforward entity types.

performance in few-shot NER scenarios [10]. Each document was segmented into individual sentences by

splitting on full stops. Subsequently, each sentence, along with

Prior research has shown that enriching prompts with semantic its associated entity annotations, was transformed into a JSON

context and label definitions can significantly boost LLM perfor- format to facilitate processing by the language models.

mance in both few-shot and zero-shot NER. Our work provides

a systematic evaluation in the biomedical domain. By examin- 4.2 Semantically enhanced prompts

ing multiple supervision settings, benchmarking several model To enhance the semantic understanding of entity labels, detailed

families, and validating differences through McNemar’s test, we descriptions were crafted for each. These descriptions were de-

offer a comprehensive assessment of when semantically enriched rived by combining information from the MACCROBAT2020

prompts provide benefits. dataset documentation and definitions from the Oxford English

Dictionary [6]. The integration of these sources was performed

3 Methodology manually, ensuring that the descriptions were both accurate and

This study evaluates the impact of incorporating semantic infor- contextually relevant.

mation into prompts on the performance of LLMs in biomedical Prompts were structured as plain text instructions, guiding

NER tasks. Three distinct approaches were employed: zero-shot the model to identify and classify entities within the provided

prompting, few-shot prompting, and fine-tuning. sentences. For the semantically enhanced prompts, the detailed

entity descriptions were included to provide additional context.

Zero-shot prompting. In the zero-shot setting, models were Models were instructed to output their responses in a JSON

prompted to perform NER without any prior exposure to labeled format, explicitly focusing on the labels component. Below we

examples. Two types of prompts were utilized: baseline prompt, present an example of the entity description, specifically for the

a standard instruction to identify and classify entities without label age.

additional context, and semantically enhanced prompt, which

includes detailed descriptions for each entity label, offering ex- Baseline prompt: The age of the patient.

plicit semantic context to guide the model’s understanding and Semantic enhanced prompt: The duration of time a

classification. patient has lived, expressed numerically (e.g., ‘65-

year-old’, ‘20 years old’) or categorically (e.g.,

Few-shot prompting. The few-shot approach involved provid- ‘newborn’, ‘teenage’), representing their age at the

ing the models with a limited number of annotated examples time of presentation.

(k-shots) before performing NER on new texts. Similar to the zero-

shot setting, both baseline and semantically enhanced prompts This added context is intended to improve the model’s ability to

were employed to assess the influence of semantic information. distinguish and extract nuanced biomedical entities more accu-

rately.

Fine-tuning. Fine-tuning was conducted to adapt the pre-trained

LLMs to the specific biomedical NER task. Two fine-tuning strate- 4.3 Fine-tuning procedure

gies were explored: standard fine-tuning, where models are fine- Fine-tuning is carried out using parameter-efficient techniques,

tuned using the original dataset annotations without additional where only lightweight adapter modules are trained instead of

semantic information, and semantically enhanced fine-tuning, modifying the full model. This strategy reduces memory usage,

which fine-tunes models on data where annotations were sup- mitigates catastrophic forgetting, and accelerates training.

plemented with semantic descriptions of each entity label. To further improve efficiency, models are quantized to 4-bit



4 Experiment Setting ated outputs; all non-target tokens (e.g., system prompts, input precision. Fine-tuning is supervised and focuses on the gener-

This section describes the experiment setting, which includes context) are masked during loss computation. This ensures that

the dataset and prompt preparation, the fine-tuning procedure training adapts the model to the expected JSON label output

used, the evaluation metrics, and the statistical significance test format rather than to the input content or prompt structure.

description.

4.4 Evaluation metrics

4.1 Dataset To evaluate entity recognition performance, we use two F1-based

The experiments were conducted using the MACCROBAT2020 [3] metrics. The Exact F1 score measures strict matches, requiring

dataset, which comprises 200 clinical case reports sourced from predicted entities to align perfectly with the reference text and

PubMed Central. In total, it contains 4,542 sentences with an label. The Relaxed F1 score allows partial matches, counting

average of 22.7 sentences per document, which includes manual predictions as correct if they include the true entity as a substring



annotations of biomedical entities, events, and relations, provided with the correct label. 1 in brat standoff format . For this study, we focused on the five

most frequent entity labels within the dataset. These are bio- 4.5 McNemar statistical significance test logical structure, diagnostic procedure, lab value, sign

While Exact and Relaxed F1 scores quantify the magnitude of

symptom, and detailed description supplemented by the age

performance differences, they do not establish whether these

and sex labels. The inclusion of age and sex was motivated by

differences are statistically reliable. The McNemar test [4] com-

their prevalence and clarity within clinical narratives, providing

plements the Exact F1 metric by verifying whether observed

1https://brat.nlplab.org/standoff.html improvements can be attributed to the semantically enhanced



8





Semantic Prompting for Large Language Models in Biomedical Named Entity Recognition Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia




prompts rather than random variation. Following standard NER 5.2 Few-shot prompting

practice, we treat Exact F1 as the primary endpoint and therefore Table 2 summarizes the Exact and Relaxed F1 scores for few-

apply McNemar’s test only to exact match predictions. shot prompting. The addition of semantic information consis-

Let 𝑏 denote the number of cases correctly predicted by the tently improved model performance across most models. Notably,

semantically enhanced model but missed by the baseline, and txgemma-9b-chat achieved the highest Exact F1 score 0.3288

𝑐 the number of cases correctly predicted by the baseline but and Relaxed F1 score 0.4998 with semantic prompting, compared

missed by the semantically enhanced model. Only discordant to 0.2732 and 0.4469 without.

pairs (𝑏, 𝑐) contribute to the test; agreements do not affect the Both Llama-3.1-8B-Instruct and Llama-3.2-3B-Instruct

statistic. Using the continuity-corrected version of the test, the showed improvements in both Exact and Relaxed F1 scores when

statistic is computed as provided with semantically enhanced prompts. For instance,

2 (|𝑏 2 − 𝑐 | − 1 ) Llama-3.1-8B-Instruct improved from 0.2509 to 0.3005 (Ex-

𝜒 , = act) and from 0.3526 to 0.3948 (Relaxed), while Llama-3.2-3B-

𝑏 𝑐 +

Instruct increased from 0.2300 to 0.2439 (Exact) and from 0.3769

which follows a chi-squared distribution with one degree of free-

to 0.3948 (Relaxed). These gains highlight the benefit of enriching

dom. The corresponding 𝑝-value allows us to test the null hy-

prompt instructions when training data is limited. However, not

pothesis 𝐻 0: the two models have equal marginal probabilities

all models responded positively. For example, Meta-Llama-3.1-

(i.e., performance differences are due to chance). Conventionally,

8B experienced a drop in Exact F1 from 0.2698 to 0.2210 and in

𝑝 < . 001 is considered statistically significant.

Relaxed F1 from 0.3537 to 0.2799, indicating that semantically



5 enhanced prompts do not universally improve performance and Results

may be less effective for some models.

This section presents model performance under three experimen- To assess the reliability of these differences, we conducted

tal conditions: zero-shot, few-shot, and fine-tuned prompting. For McNemar tests on Exact paired predictions. The tests revealed

each condition, we compare the impact of semantically enhanced that performance differences between baseline and semantically

prompts against standard prompts using Exact and Relaxed F1 enhanced prompts were statistically significant for all models

scores on a subset of clinically relevant entity types. except Llama-3.2-3B-Instruct. It is important to note, however,

that significance here indicates that the two variants produce

5.1 Zero-shot prompting systematically different predictions, but does not itself imply

Table 1 reports the Exact and Relaxed F1 scores for models improvement. For instance, while the difference for Meta-Llama-

evaluated in the zero-shot setting using semantically enhanced 3.1-8B was highly significant, the semantically enhanced model

prompts. Without semantic descriptions, most models strug- in fact performed worse in terms of F1 scores.

gled to generate outputs in the required JSON format, and valid

scores could not be computed. Even with semantically enhanced

prompts, Meta-Llama-3.1-8B consistently failed to produce struc- 5.3 Fine-tuned performance

tured responses. In the fine-tuning scenario, results were more nuanced. As shown

Among the evaluated models, Llama-3.1-8B-Instruct achiev- in Tables 2, most models performed strongly even without seman-

ed the highest Exact F1 score, while txgemma-9b-chat attained tic enhancements. For instance, Meta-Llama-3.1-8B attained the

the best Relaxed F1 score. Llama-3.2-3B-Instruct and DeepSeek- highest Exact F1 score (0.7099) with semantic input, only slightly

Qwen-7B also demonstrated non-trivial performance in both met- outperforming its baseline (0.7076), and this difference was not

rics. These results suggest that semantically enhanced prompts statistically significant (𝑝 ≈ 0.64).

can effectively compensate for the absence of training examples Some models, such as Llama-3.1-8B-Instruct and Llama-

in zero-shot scenarios by providing clearer task guidance and 3.2-3B-Instruct, even showed small performance drops when

improving structured prediction output. semantic descriptions were included, with McNemar tests con-

firming that these differences were not significant (𝑝 ≈ 0.75 and

Table 1: Exact and Relaxed F1 scores in the zero-shot set- 𝑝 . ≈ 088). This suggests that in settings where the model is al-

ting with semantically enhanced prompts. Bolded values ready exposed to sufficient task specific supervision, additional

indicate the highest score in each column. Results without prompt-level context may offer limited benefit or even introduce

valid JSON output are marked with redundancy. / .

In contrast, TxGemma-9B-Chat exhibited the most notable

improvement, with Exact and Relaxed F1 scores increasing from

Model Exact F1 Semantics Relaxed F1 Semantics 0.6837 to 0.7092 and from 0.7483 to 0.7686, respectively; the



Llama-3.1-8B-Instruct2 McNemar test confirmed this difference as statistically signif- 0.2310 0.3708 − 5 3 icant ( 𝑝 ≈ 9 . 7 × 10 ). By comparison, DeepSeek-Qwen-7B also Meta-Llama-3.1-8B / / − 3 4 showed a significant difference ( 𝑝 ≈ 6 × 10 ), but in this case Llama-3.2-3B-Instruct 0.1620 0.3254 the semantically enhanced model performed worse (Exact F1: 5 DeepSeek-Qwen-7B 0.1592 0.3217 0.7013 6 → 0.6879). txgemma-9b-chat 0.2181 0.4245



5.4 Overall observations

2https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct The largest performance improvements from semantically en-

3https://huggingface.co/meta-llama/Llama-3.1-8B hanced prompts appeared in zero-shot and few-shot settings,

4https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct where gains in F1 scores were often statistically significant. In

5https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B

6https://huggingface.co/google/txgemma-9b-chat contrast, fine-tuned models showed smaller and mixed effects:



9





Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Calcina et al.




Table 2: Exact (left) and Relaxed (right) F1 scores for selected labels in few-shot and fine-tuned settings, with and without

semantically enhanced prompts. Bolded values indicate the highest score in each column. We use symbols ◦ and • to denote

whether the differences between using the baseline or semantically enhanced prompts are statistically significant (•) or

not (◦) according to the McNemar test at a significance level of 𝑝 = 0.01.



Exact F1 Relaxed F1



Model Few-Shot Fine-Tuned Few-Shot Fine-Tuned



/ Semantic / Semantic / Semantic / Semantic



Llama-3.1-8B-Instruct 0.2509 0.3005 • 0.7053 0.7004 ◦ 0.3526 0.3948 0.7660 0.7645

Meta-Llama-3.1-8B 0.2698 0.2210 • 0.7076 0.7099 ◦ 0.3537 0.2799 0.7670 0.7765

Llama-3.2-3B-Instruct 0.2300 0.2439 ◦ 0.6881 0.6867 ◦ 0.3769 0.3948 0.7629 0.7622

DeepSeek-Qwen-7B 0.1423 0.2270 • 0.7013 0.6879 • 0.2465 0.3891 0.7584 0.7521

txgemma-9b-chat 0.2732 0.3288 • 0.6837 0.7092 • 0.4469 0.4998 0.7483 0.7686



for most, differences were not significant, though TxGemma-9B- notable gains in both Exact and Relaxed F1 scores. In contrast,

Chat benefited reliably while DeepSeek-Qwen-7B showed a fine-tuned models already exposed to task-specific data showed

significant decrease. These results indicate that semantic prompt- only marginal improvement.

ing is most effective in low-resource conditions, while its impact Future work could explore adaptive semantic prompting strate-

under full supervision is limited and model-dependent. gies, such as ontology-driven label enrichment, and further in-

vestigate the trade-offs between prompt length and inference

6 Discussion efficiency. Additionally, this method could be tested on larger

This section discusses the experiment findings and highlights the datasets and across different models to assess its generalizability.

advantages and disadvantages of the different approaches. In summary, semantically enhanced prompts offer a straight-

forward yet effective way to boost clinical NER performance

6.1 in low-data regimes, but their impact diminishes as models are Model pretraining and domain adaptation

exposed to more supervised training.

TxGemma-9B-Chat, based on the Gemma 2 architecture and fur-



ther fine-tuned on therapeutic development data, outperformed Acknowledgements general-purpose models in a few-shot scenario. This suggests

that domain-specific pretraining can significantly improve per- This work was supported by the Slovenian Research Agency.

formance when supervision is limited. However, in the full fine- Funded by the European Union. UK participants in Horizon Eu-

tuning setting, its advantage diminished. In fact, general models rope Project PREPARE are supported by UKRI grant number

like Meta-Llama-3.1-8B achieved comparable but slightly better 10086219 (Trilateral Research). Views and opinions expressed are

results, indicating that once sufficient task-specific supervision however those of the author(s) only and do not necessarily reflect

is provided, prior domain specialization offers limited additional those of the European Union or European Health and Digital Ex-

benefit. ecutive Agency (HADEA) or UKRI. Neither the European Union

nor the granting authority nor UKRI can be held responsible for

6.2 them. Grant Agreement 101080288 PREPARE HORIZON-HLTH- Prompt quality matters

2022-TOOL-12-01.

The structure and clarity of prompts are critical to model per-

formance. Poorly designed prompts often resulted in JSON for- References

matting errors or reduced accuracy, particularly in zero-shot and [1] Kaikai An, Shuzheng Si, Yuchi Wang, et al. 2024. Rethinking semantic pars-few-shot settings. While adding semantic context improves task ing for large language models. arXiv preprint arXiv:2409.14469.

understanding by making objectives and entity definitions more [2] Dhananjay Ashok and Zachary C. Lipton. 2023. Promptner: prompting for

explicit, excessive length or ambiguity can offset these gains. [3] named entity recognition. arXiv preprint arXiv:2305.15444. J. Harry Caufield, Yichao Zhou, Yunsheng Bai, David A. Liem, Anders

O. Garlid, Kai-Wei Chang, Yizhou Sun, Peipei Ping, and Wei Wang. 2019.

6.3 A comprehensive typing system for information extraction from clinical Prompt length vs. model response

narratives. medRxiv. Preprint. doi: 10.1101/19009118.

Semantic enrichment inevitably increases prompt length, which [4] Quinn McNemar. 1947. Note on the sampling error of the difference between

can slow response time and raise computational overhead. It correlated proportions or percentages. Psychometrika, 12, 2, (June 1947),

153–157. doi: 10.1007/bf02295996.

may also overwhelm smaller models when excessive detail is [5] Monica Munnangi, Sergey Feldman, Byron C. Wallace, et al. 2024. On-

included. In practical applications, this must be weighed against the-fly definition augmentation of llms for biomedical ner. arXiv preprint

the potential gains in entity extraction accuracy. arXiv:2404.00152.

[6] 2025. Oxford english dictionary. https://www.oed.com/. Accessed: 2025-06-

17. (2025).

7 [7] Yongliang Shen, Zeqi Tan, Shuhui Wu, et al. 2023. Promptner: prompt Conclusion

locating and typing for named entity recognition. In ACL (Long Papers).

This study investigated the impact of a semantically enhanced [8] Yongjian Tang, Rakebul Hasan, and Thomas Runkler. 2024. Fsponer: few-shot

prompt design on LLM-based NER in the clinical domain. Our prompt optimization for named entity recognition.arXiv preprint arXiv:2407.08035.

[9] Zeliang Tong, Zhuojun Ding, and Wei Wei. 2025. Evoprompt: evolving

experiments on the MACCROBAT2020 dataset demonstrated prompts for enhanced zero-shot named entity recognition. In COLING.

that adding semantic label descriptions significantly improves [10] Yuwei Xia, Zhao Tong, Liang Wang, et al. 2023. Learning meta-prompt with

model performance in zero-shot and few-shot scenarios, with entity-enhanced semantics for few-shot ner. SSRN.



10





LLM Based Approach to Extracting Smells in Slovenian




Corpora



Janez Brank Inna Novalija

Jožef Stefan Institute Jožef Stefan Institute

Ljubljana, Slovenia Ljubljana, Slovenia

janez.brank@ijs.si inna.koval@ijs.si



Dunja Mladenić Marko Grobelnik

Jožef Stefan Institute Jožef Stefan Institute

Ljubljana, Slovenia Ljubljana, Slovenia

dunja.mladenic@ijs.si marko.grobelnik@ijs.si



Abstract This paper explores automatic smell detection in Slove-

nian cultural heritage texts using two complementary strate-

This paper presents a comparative study of automatic smell

gies: (1) a keyword-based approach derived from an expert-

detection in Slovenian cultural heritage texts using both

curated list of smell-related expressions and their morpho-

keyword-based search and large language model (LLM) in-

logical variants, and (2) large language model (LLM) - based

ference. We process a portion of the dLib.si corpus from

th th semantic inference using prompt-engineered queries via

the late 19 and early 20 centuries, analyzing over 1.6

the Together.ai platform. We process a subset of the dLib.si

million text segments for olfactory references. The keyword

digital library corpus of Slovenian texts, divided into tem-

method leverages an expert-curated list of smell terms, while

poral buckets, and evaluate the performance, overlap, and

the LLM method applies semantic inference via prompt-

divergence between the two methods.

engineered queries. We compare the methods in terms of

To facilitate large-scale analysis, we produce and analyze

detection density, temporal trends, and agreement over-

over 1.6 million document-query pairs, extracting smell men-

lap. Additionally, we visualize the semantic landscape of

tions, classifying them by agreement type, and visualizing

extracted smell terms using t-SNE and unsupervised cluster-

their distributions both temporally and semantically. Our

ing with auto-generated labels. Our findings reveal limited

goals are twofold: (i) to quantify the representational den-

overlap between methods, a shared rise in smell mentions

sity of olfactory references in the corpus, and (ii) to better

over time, and distinct semantic clusters ranging from in-

understand how computational methods can surface sub-

dustrial to culinary and bodily smells. This study highlights

tle cultural patterns that evade traditional keyword search

the value of combining symbolic and neural approaches for

alone.

nuanced sensory mining in digital heritage corpora.

This work contributes toward a richer modeling of sen-



Keywords sory information in digital heritage collections and high-

lights the value of combining symbolic and neural methods

LLM, Artificial Intelligence, Cultural Heritage, Text Mining

for text mining in the cultural heritage domain.



1 Introduction 2 Related Work

Olfactory perception is an essential yet underexplored di-

Recent years have seen increased interest in the computa-

mension in the analysis of historical texts, particularly within

tional modeling of olfactory expressions in historical and

the cultural heritage domain. Smells, though intangible, play

cultural texts. A prominent initiative in this space is the

a critical role in shaping memory, atmosphere, and cultural

Odeuropa project [7], which focused on identifying, cu-

meaning. However, their representation in written sources

rating, and semantically linking smell-related content in

is often subtle, indirect, or metaphorical. This challenge

European heritage corpora. Large-scale initiatives, such as

becomes more pronounced in historical corpora such as

the Odeuropa project, have produced the European Olfac-

th th

19- and early 20-century Slovenian publications, where tory Knowledge Graph and tools like the Smell Explorer

evolving linguistic practices and cultural norms affect how

to trace historical olfactory knowledge across 400 years of

sensory information is encoded. European sources [7, 5]. Research on sensory perception

in NLP has traditionally focused on the visual and audi-

tory modalities, while olfaction remains relatively under-

Permission to make digital or hard copies of all or part of this work for explored. Annotation frameworks such as the Olfactory

personal or classroom use is granted without fee provided that copies are not

Event Frame and guidelines for labeling sources, qualities,

made or distributed for profit or commercial advantage and that copies bear

and experiences [6] provide structured resources for in-

this notice and the full citation on the first page. Copyrights for third-party

components of this work must be honored. For all other uses, contact the formation extraction from historical and literary corpora.

owner /author(s).

Traditional approaches to olfactory semantics rely on fixed

Information Society 2025, Ljubljana, Slovenia

lexicons such as the Dravnieks Atlas [1] and the DREAM

© 2024 Copyright held by the owner/author(s).

https://doi.org/10.70314/is.2025.sikdd.5 challenge descriptors [3]. For morphologically complex and



11





Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Brank et al.




low-resource languages such as Slovene, monolingual mod- For each passage, we recorded both LLM and keyword re-

els like SloBERTa [10] and seq-to-seq models like SloT5 [9] sults. We classified outcomes into four categories: , LLM Only

demonstrate that tailoring architectures to linguistic struc- , , or . Additionally, we computed Keyword Only Both None

ture improves performance over multilingual baselines. A the 𝐽 between the two result sets: Jaccard similarity wide range of Slovene corpora underpins these modeling

𝐽 𝐴, 𝐵 𝐴 𝐵 𝐴 𝐵 , ( ) = | ∩ | / | ∪ |

efforts. Gigafida 2.0, a reference corpus of 1.1 billion to-

kens covering contemporary written Slovene, provides a where 𝐴 is the set of LLM-based results and 𝐵 is the set

large-scale foundation for model pretraining and evaluation of keyword-based results. This metric enabled quantitative

[4]. For user-generated content, the JANES corpus supplies comparison of coverage and intersection across detection

richly annotated Slovene social media text, including nor- methods.

malization and NER [2]. Unlike prior studies that primarily

focus on annotation frameworks, fixed olfactory lexicons, 4.2 Temporal Distribution of Smell

or large-scale multilingual heritage initiatives such as Odeu- Mentions

ropa, our work provides the first comparative evaluation of

We extracted the year of publication from each document’s

keyword-based and LLM-based smell detection specifically

metadata. For each year, we aggregated:

for Slovenian cultural heritage corpora, highlighting the

interplay between symbolic coverage and neural semantic • Total LLM-detected smell terms

• Total keyword-detected smell terms

inference.

• Number of processed queries

3 Corpora and Preprocessing These aggregates were used to generate yearly time se-

ries, revealing longitudinal patterns in olfactory expression

For the experiments presented in this paper, we used texts

across the corpus. This temporal analysis supports hypothe-

from the Slovenian Digital Library (dLib.si). Initially we

ses about cultural shifts, such as increasing industrial or

downloaded, from the Library’s website, all documents from

bodily smell discourse over time.

the period 1870–1919 for which OCRed text was available

and whose language was marked as Slovene in the meta-

data there. In terms of content, this covers nearly all books, 4.3 Semantic Typology via Clustering of

newspapers, magazines etc. published in Slovene during Smell Terms that period. From this corpus we then randomly selected

To explore latent smell categories, we constructed a semantic

7 % of the documents from each year for further processing;

typology using the following steps:



thus the selected subset maintains the same distribution • Term Extraction: We extracted the 500 most fre-

over time, genre, etc. as the full corpus. This resulted in a

quent smell-related terms from the combined LLM

dataset of approx. 366 thousand documents with a total of

and keyword results.

105 million words. • Vectorization:

Terms were embedded using TF-IDF



4 vectors over character-level n-grams ( with char_wb Methodology

range 2–4), capturing morphological similarity.

This section outlines the analytical pipeline used to detect, • Dimensionality Reduction: The high-dimensional

compare, and interpret smell-related expressions in Slove- t-

vectors were projected to two dimensions using

nian cultural heritage texts. Our approach combines large SNE

(perplexity = 30), yielding a visual semantic

language model inference, keyword-based retrieval, tempo-

landscape.

ral and density statistics, and unsupervised semantic clus- • Clustering: k-means clustering We applied (with

tering. 𝑘 8) to the t-SNE coordinates. For each cluster, the =

top 5 TF-IDF terms were used to generate seman-

4.1 Comparative Evaluation of Detection tic labels (e.g., “Herbs & Cooking”, “Pharmaceutical

Methods Smells”).

In order to identify olfactory expressions, we employed two • Visualization: The clusters were visualized with

color-coded labels and representative terms. Interac-

complementary strategies:

tive versions were built using . plotly

• LLM-based Extraction: Each document was split

1 This typology enables data-driven classification of smell

into passages and processed using a LLM. The model

discourse and provides interpretable categories for cultural

returned a list of potential smell-related words or

and linguistic analysis.

phrases, structured in JSON format. In cases of for-

matting failure, raw strings or exception messages

were recorded. 4.4 Document-Level Smell Density

• Keyword-Based Search: A manually curated in- Analysis

dex of smell-related expressions, including morpho-

To assess the distribution of olfactory content across docu-

logically inflected forms, was used for direct string smell density

ments, we computed the as the ratio of detected

2

matching within each passage.

terms to queries per document:

1

The Llama-3.3-70B-Instruct-Turbo-Free model, accessed via Together.ai.

2 # LLM terms

This index has been kindly provided by Mojca Ramšak and is based on her =

Density

LLM

work on the anthropology of smell [8]. # queries



12





LLM Based Approach to Extracting Smells in Slovenian Corpora Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia





Figure 1: Yearly trends in smell term mentions. Figure 3: Smell term density per document. While out-

Keyword-based detection consistently returns higher liers exist for both methods, keyword-based detection

frequencies than the LLM, but both show similar generally identifies a higher density of smell refer-

growth patterns. ences per query.





Figure 4: t-SNE semantic landscape of smell terms,



Figure 2: Detection agreement between LLM and key-



clustered by character-level similarity and automati-

word methods. Most passages are matched by one

cally labeled using top TF-IDF terms per group. The

method only, with a significant number showing no

visualization reveals coherent groups such as food, rit-

detection. The overlap (“Both”) occurs in fewer than

ual, body, and chemical references.

one-third of cases.



detections. A significant subset of passages registers no ol-

# Keyword terms factory detection at all, probably because most documents

Density =

Keyword

# queries don’t mention smell-related topics in the first place.

Figure 3 illustrates the distribution of smell term density

This metric enabled identification of smell-rich and smell-

per document. Keyword-based detection generally produces

sparse texts. Density distributions were visualized using

higher densities of references, whereas the LLM outputs

boxplots and descriptive statistics, facilitating selection of

are sparser but potentially more semantically filtered. Both

representative or outlier texts for deeper qualitative analysis.

distributions exhibit long-tailed outliers, where certain doc-

uments contain disproportionately high concentrations of

5 Evaluation and Results olfactory mentions.

We evaluated complementary approaches to detecting ol- To further analyze lexical diversity, we applied t-SNE to

factory references in historical corpora: a keyword-based embed and cluster smell-related terms (Figure 4). The re-

method and an LLM-based classifier. The results highlight sulting semantic landscape reveals coherent groupings that

both convergences and divergences in performance across align with cultural domains, including food, ritual, body, and

time, document density, and semantic coverage. chemical references. These clusters highlight the variety of

Figure 1 shows yearly frequencies of smell-related men- olfactory expressions and suggest that both methods cap-

tions from 1870 to 1920. While keyword-based detection ture complementary facets of the semantic space. The LLM

consistently yields higher absolute counts than the LLM, appears particularly adept at recognizing context-dependent

both methods exhibit similar growth trajectories. terms, while the keyword method anchors clusters in ex-

Agreement analysis between the two methods (Figure 2) plicit lexical cues.

reveals substantial divergence. Only about one-third of pas- Overall, the keyword-based approach provides broader

sages are identified by both approaches. A large portion coverage and higher frequencies, but at the cost of noise

is captured exclusively by the keyword method, while the and overcounting. The LLM method, while more conserva-

LLM contributes a smaller but meaningful number of unique tive, contributes precision and captures context-sensitive



13





Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Brank et al.




olfactory references that keywords may overlook. The com- historical sensory studies where annotation is sparse and

bination of both thus provides a richer and more balanced vocabulary is diffuse.

representation of olfactory discourse in historical texts. Several promising directions remain open for further ex-

ploration. First, we plan to expand the dataset to cover all

documents in the dLib.si corpus, enabling more robust lon-

6 Discussion gitudinal and regional analyses. Second, we aim to improve

Our analysis reveals several key insights into olfactory rep- LLM prompts to better handle nested or narrative contexts,

resentations in Slovenian cultural heritage texts and the including smells embedded in metaphor, irony, or emotional

methodological implications of combining LLM-based and framing.

keyword-based detection. Another avenue involves extending the classification of

First, both detection strategies show meaningful trends smell mentions into functional categories (e.g., pleasant vs.

over time, with a noticeable increase in smell-related refer- unpleasant, natural vs. artificial, bodily vs. environmental)

th

ences around the turn of the 20 century. This may reflect using additional LLM-based postprocessing. We also intend

broader urbanization, industrialization, and shifts in public to explore multilingual smell detection, comparing Slovene

health discourse, which intensified the cultural significance with other Central European languages to study cultural

of air quality, hygiene, and olfactory environments. convergence and divergence in olfactory discourse.

Second, although keyword-based detection consistently Finally, we hope to integrate our smell detection pipeline

returned more hits, the LLM-based method surfaced a dis- into public digital heritage platforms, providing curators,

tinct set of semantically inferred mentions. As the agree- historians, and linguists with new tools for sensory explo-

ment analysis shows, only a minority of mentions ( 24 %) ration of archival materials. ∼ were matched by both methods. One possible explanation of

this would be if neural inference captures more nuanced or Acknowledgements contextually implied smell references, such as metaphorical

This work was supported by the Slovenian Research Agency

use ("a whiff of suspicion") or implied odors in narrative

under the project J7-50233.

scenes.

Third, density analysis suggests that LLMs return more References

sparse but targeted mentions, while keyword detection pro- Atlas of Odor Character Profiles

[1] Andrew Dravnieks. 1992. . ASTM

duces broader but sometimes noisier coverage. This differ- International, (Feb. 1992). isbn: 978-0-8031-0456-3. doi: 10.1520/DS61

- EB.

ence is critical for researchers deciding between high recall

[2] Darja Fišer, Nikola Ljubešić, and Tomaž Erjavec. 2020. The janes

and high precision when exploring sensory data in historical

project: language resources and tools for slovene user generated con-

texts. tent. , 54, 1, pp. 223–246. Retrieved Language Resources and Evaluation

Aug. 27, 2025 from https://www.jstor.org/stable/48740864.

Finally, the t-SNE landscape of smell terms uncovered

[3] Andreas Keller et al. 2017. Predicting human olfactory perception

semantically coherent clusters — e.g., medicinal substances, from chemical features of odor molecules. Science, 355, (Feb. 2017),

industrial emissions, festive foods, and bodily decay - and eaal2014. doi: 10.1126/science.aal2014.

[4] Simon Krek, Špela Arhar Holdt, Tomaž Erjavec, Jaka Čibej, Andraz

allowed us to generate meaningful auto-labels using top TF-

Repar, Polona Gantar, Nikola Ljubešić, Iztok Kosem, and Kaja Do-

IDF terms. Such visualizations provide a valuable tool for brovoljc. 2020. Gigafida 2.0: the reference corpus of written standard

Slovene. eng. In Proceedings of the Twelfth Language Resources and

cultural historians to engage with thematic patterns across

Evaluation Conference. Nicoletta Calzolari et al., editors. European

large-scale textual datasets.

Language Resources Association, Marseille, France, (May 2020), 3340–

Overall, our findings underscore the value of hybrid ap- 3345. isbn: 979-10-95546-34-4. https://aclanthology.org/2020.lrec- 1.4

09/.

proaches to cultural text analysis. By comparing symbolic

[5] P. Lisena, T. Ehrhart, and R. Troncy. European olfactory knowledge

and neural perspectives, we gain both coverage and subtlety, graph. Zenodo. doi: 10.5281/zenodo.10709703.

enabling a deeper reconstruction of sensory worlds encoded [6] Stefano Menini, Teresa Paccosi, Serra Sinem Tekiroğlu, and Sara

Tonelli. 2023. Scent mining: extracting olfactory events, smell sources

in the archives. Proceedings of the 7th Joint SIGHUM Workshop on

and qualities. In

Computational Linguistics for Cultural Heritage, Social Sciences, Hu-

manities and Literature. Stefania Degaetano-Ortlieb, Anna Kazantseva,

7 Conclusion and Future Work Nils Reiter, and Stan Szpakowicz, editors. Association for Compu-

tational Linguistics, Dubrovnik, Croatia, (May 2023), 135–140. doi:

We conducted a dual-method analysis of olfactory refer- 10.18653/v1/2023.latechclf l- 1.15.

[7] ODEUROPA Project Consortium. 2021–2023. ODEUROPA: negotiat-

ences in Slovenian historical texts, revealing how keyword

ing olfactory and sensory experiences in cultural heritage practice and

search and LLM-based inference each contribute unique research. https://odeuropa.eu/. EU Horizon 2020 research and innova-

perspectives to sensory data mining. Our results show that tion programme, grant agreement No. 101004469. Royal Netherlands

Academy of Arts and Sciences (KNAW) Humanities Cluster et al.,

while the keyword method offers broad lexical coverage,

(2021–2023).

the LLM can detect more subtle, implied, or metaphorical [8] Mojca Ramšak. 2025. . AMEU-ISH, Ljubljana. Antropologija vonja

[9] Matej Ulčar and Marko Robnik-Šikonja. 2023. Sequence-to-sequence

references often overlooked by surface-level matching.

pretraining for a less-resourced slovenian language. Frontiers in Arti-

Furthermore, t-SNE clustering of smell terms revealed ficial Intelligence

, 6.

rich thematic structures — such as food, medicine, pollu- [10] Matej Ulčar and Marko Robnik-Šikonja. 2021. Sloberta: slovene mono-

lingual large pretrained masked language model. In . SiKDD

tion, and ritual — highlighting the semantic complexity of

olfactory language.

Together, these results demonstrate the complementary

strengths of symbolic and neural approaches for enrich-

ing digital humanities research, especially in domains like



14





BetweenTheLines - Cross Source News Analysis




Georgi Trajkov Marko Grobelnik Adrian Mladenic Grobelnik

geotrajkov0@gmail.com marko.grobelnik@ijs.si adrian.m.grobelnik@ijs.si

Jožef Stefan Institute Jožef Stefan Institute Jožef Stefan Institute

Ljubljana, Slovenia Ljubljana, Slovenia Ljubljana, Slovenia





Abstract

Different news outlets covering the same event often emphasize,

omit, or frame facts differently, making cross-source comparison

essential for understanding media bias and information diver-

sity. Large language models (LLMs) can automate this analysis,

but simple single-LLM prompt approaches tend to underperform

when processing large amounts of data [1]. Platforms like Ground

News [2] and Event Registry [3] provide publisher and article-

level bias scores but cannot track how individual claims and

entities are portrayed by articles. The fundamental challenge is

determining whether LLM prompt architecture affects accuracy

when classifying claim presence across multiple news sources. We

show that a multi-prompt LLM architecture reduces classification

errors 7-fold (from 33.0% to 4.67%) compared to single-prompt

approaches. Our pipeline first extracts all claims and entities

from articles collectively, then evaluates each article separately

for claim presence (confirmed/contradicted/partial/absent) and

entity sentiment. This decomposition virtually eliminates false

positives, major errors dropped from 28.0% to 0.79% across 797

manually validated claim-publisher pairs from Slovene news. The

results demonstrate that task decomposition, not LLM sophis-

tication, drives accuracy in cross-source analysis. This finding

enables scalable media monitoring at $0.01 per event, making

systematic bias detection accessible to journalists and researchers

worldwide. Figure 1: Analyzed event in BetweenTheLines mobile we-

bapp, showing the claims tab



1 Introduction 2 Related Work

Different news sources (publishers) covering the same event

Cross-source news analysis is an under-discussed area of research

(groups of articles reporting on the same story) often cover facts which is important for understanding media bias, information

differently. While existing platforms like Event Registry [3] and

diversity, and narrative framing across different outlets. This

Ground News [2] provide valuable bias indicators and sentiment

section reviews existing approaches to cross-source news analy-

scores, they do not track how specific entities (People, Organiza- sis, event aggregation systems, and LLM-based content analysis

tions, Countries) and claims (Factual Claims) within articles are

pipelines.

portrayed across publishers. Getting insight into these differences

is usually time-consuming for the user. 2.1 Cross-Source News Analysis Platforms

Thus we present BetweenTheLines, (Figure 1) a system that

Ground News represents a prominent platform for cross-source

automatically identifies claims and entities in an event, and tracks

news comparison, classifying publishers along the left-right po-

their portrayal in each individual publisher. For example, when

litical spectrum. The platform has gained widespread adoption

analyzing political coverage, we can see how the same entity

in educational institutions, with libraries at Harford Commu-

is portrayed differently by 2 publishers, and how one publisher

nity College [4] and West Virginia University [5] integrating it

omitted a claim while the other did not.

into their media literacy curricula. For each news event, Ground

Our key technical contribution is demonstrating that multi-

News allows users to compare coverage by publisher on aggre-

prompt LLM architecture outperforms single-stage approaches

gate. While these aggregated summaries can reveal different

for this task.

emphases across the political spectrum, the platform does not

provide article-by-article comparisons or track how specific enti-

ties and claims are portrayed between articles.

Permission to make digital or hard copies of all or part of this work for personal

or classroom use is granted without fee provided that copies are not made or 2.2 Event-Centric News Aggregation distributed for profit or commercial advantage and that copies bear this notice and

the full citation on the first page. Copyrights for third-party components of this Event Registry [6, 3] pioneered event-centric news aggregation

work must be honored. For all other uses, contact the owner /author(s).

by clustering articles from multiple publishers around identi-

SiKDD 2025, Ljubljana, Slovenia

fied news events. The platform provides article-level sentiment

© 2025 Copyright held by the owner/author(s).

https://doi.org/10.70314/is.2025.sikdd.26 scores using VADER sentiment analysis [7] and allows filtering



15





SiKDD 2025, 6 October 2025, Ljubljana, Slovenia Georgi Trajkov, Marko Grobelnik, and Adrian Mladenic Grobelnik





by various parameters including language, location, and pub-

lisher credibility. Each article has a sentiment score, a level of

granularity above Ground News. Still there is no analysis for how

specific entities and claims within those articles are portrayed.

Our work builds upon Event Registry’s foundation, by combin-

ing its event-based aggregation, with more granular entity and

claim analyses through LLM processing. Unlike Ground News’s

publisher-level political bias ratings or Event Registry’s article

sentiment scores, we provide fine-grained analysis of how specific

entities and claims are portrayed differently across publishers.



3 Application and analysis Architecture



3.1 Application architecture

“BetweenTheLines is a news-analysis web app 1, developed with

Claude Code [8].” The backend is built using Flask [9] and Post-

greSQL [10]. It uses Event Registry [6, 3] analysis service for

event and article fetching, and integrates both Google Gemini

[11] and OpenAI [12] LLMs.



3.2 Analysis Service overview Figure 2: three Stage process flow of analysis. Extracted

The analysis service consists of two modules, claims analysis results lead to multiple parallel LLM calls. and sentiment analysis, with more thorough exploration of the

former due to it’s less subjective nature. Figure 2 illustrates our

included in the prompt along with instructions for extracting 2

three-stage LLM pipeline.

lists (Table 1) in JSON format: entities for sentiment analysis and

Stage 1: Extraction. We begin by sending all articles from an

claims for claims analysis.

event to a single LLM call. This extracts two lists (Table 1) for

The prompt focuses on extracting 8-15 claims and 8-15 entities

entities and claims that appear in the articles.

Stage 2: Classification. that are central to the story, explicitly excluding news publishers

With the lists from stage 1, a paral-

unless they are the subject of the news story:

lel LLM call is made twice for each publisher, once for claims,

once for entities. The calls return categorized data. Claims are Analyze all these news articles and extract two comprehensive

lists in JSON format :

categorized by presence, and entities by sentiment. The results 1. All significant made across all articles CLAIMS

of these categorizations are referred to as entity-publisher and 2. ENTITIES All important (people , organizations , countries , etc

.) mentioned across all articles

claim-publisher pairs.

Stage 3: Key Differences. Summarizes how different publish- A 2-step extraction process was also tested, where each article

ers covered each claim or entity. This requires one LLM call per is prompted for claims and entities contained in it, and then the

claim/entity, running in parallel. results are aggregated. However, this led to very large lists with

duplicate names written differently (e.g., USA vs United States

The final results are structured into a tabular or card format,

depending on device, where users can compare coverage across Government vs United States), for little performance gain.

publishers at a glance (Figure 1). Another issue we faced was the publisher names themselves

being in the entities list, even in situations where they are not



3.3 Language a direct part of the article. This led us to add additional rules in

the extraction prompt to not include them:

We decided for all prompts to be in Slovene, and to analyze only

Slovene articles. This came after empirically observing a decrease - news (like BBC , CNN , Reuters , etc .) EXCLUDE publishers/sources

UNLESS they are actually subjects of the news story itself

in errors when the language of the prompts and articles was the SUBJECT- Focus on entities that are the of the news , not the same. It also language consistency for evaluation. source reporting it

All showcased prompts and results are originally Slovene, and

were translated to English for the paper. Entities Claims

Vladimir Putin Putin claims that Russia has never opposed

3.4 Event Filtering Ukraine’s membership in the EU.

Xi Jinping Putin calls claims about a possible Russian attack

Events and articles are fetched from the Event Registry API[3]. on other European countries “hysteria.”

Russia Putin says that Russia is forced to respond to

Articles are then filtered to only retain the newest article

the West’s attempt to take over the post-Soviet

for each unique publisher in an event. To retain only the most space.

relevant events, we discard any events with less than 3 articles. China Putin and Trump discussed the security of

Ukraine.

To prevent context overloading maximum article limit is 10.

Ukraine Putin and Xi signed about 20 agreements in the

Then final article list is prepared for each event, and the title, fields of energy, aviation, artificial intelligence,

body, publisher name, and article link is stored for every article. and agriculture.

Table 1: Example of first 5 entities and claims received from

3.5 extraction prompt for Russia–China summit. Extraction

Extraction for an event is done after filtering, in a single LLM

call to gpt-4o-mini [13], in which the contents of all articles are



16





BetweenTheLines - Cross Source News Analysis SiKDD 2025, 6 October 2025, Ljubljana, Slovenia





Figure 3: Claims analysis decision tree, 4 options depending Figure 4: Decision tree in sentiment analysis



on whether and how a claim is mentioned



Mandatory decision steps ( before choosing a label ):

3.6 Claims Analysis - First identify the role of the entity 's mention : SPEAKER /

TARGET MENTIONED WITHOUT ROLE /

Claim analysis starts after the extraction step returns a claims META-EVALUATION- Then look for of the entity ( adjectives ,

evaluative verbs , framing before / after the quote , editorial

list. It consists of multiple parallel LLM calls, each analyzing a tone).

single article against the claims list, using 4 categorizations for SPEAKER- If the entity is only a without meta - evaluation , choose

"Neutral".

whether the article confirms the claim: Yes, Partially, No and Not

mentioned, as depicted in Figure 3.

This resulted in false negatives and positives reducing signifi-

False negatives were the biggest problem we faced with claim

cantly, however it also came with the tradeoff of having a much

analysis. Originally, there were only 3 claim categories; however,

higher incidence of neutral classification, even when it is slightly

due to too many "not mentioned" results, we added a fourth

positive or negative.

partial classification that led to significant improvements. To



further reduce false negatives without adding false positives, we 3.8 Key Differences tightened the categorization rules for the Not mentioned category,

The final step of the pipeline is the generation of the key dif-

to default to Partial instead when answer is unclear.

ferences (Figure 5). It uses the claims/sentiment categorizations

Portion of the rule-set that helped improve results:

from the previous step as input. It works for both Claims and Sen-

Before selecting "Not mentioned", you MUST check the following timent analysis in an almost identical manner; we will use claims

transformations / hints :

-paraphrases/synonyms; hypernyms/hyponyms; abbreviations/acronyms; for explanation in this example. A parallel LLM call is made once

coreferences ( pronouns , descriptive references )

-numbers/units/conversions; relative dates -> absolute ; per every claim in the analysis, containing all claim-publisher

geographic hypernyms pairs of the claim. (e.g. EU -> country )

- sections : title , introduction , body , subtitles , captions ,

tables /graphs ,

quotes / indirect statements

-negations, questions, conditionals, predictions/hypotheses



Rule to reduce false negatives:

- If in doubt between "Partial" and "Not mentioned", choose "Partial"



3.7 Sentiment Analysis

The sentiment analysis proceeds in parallel with claims analysis

after receiving the entity list (Figure 1) from the extraction. It is

structured in a manner very similar to the claims analysis, it calls

the LLM once per publisher, and it has 4 categorizations (Figure

4): Positive, Negative, Neutral, and Not Mentioned. Accuracy

assessment is harder due to subjective interpretation. The module

uses gemini-2.5-flash-lite [14] due to empirical observation of

better results, every other LLM call uses gpt-4o-mini [13].

LLMs struggle with implicit criticism conveyed through se-

lective quoting. For instance, when Mladina [15] quoted Trump

praising himself as "smart" and suggesting people want a dicta-

tor, the LLM classified sentiment as positive, missing the article’s

critical intent to portray authoritarianism.

To account for this weakness, we added more constraints and

rules in the prompts:

Important: OUTCOME ≠ SENTIMENT



- Do not mark "Positive" because the entity wins / makes a profit ,

without explicit value judgement of the entity .

- Do not mark "Negative" because the entity loses /has a bad result Figure 5: Key difference generation for claim from Russia-

, without explicit value judgement of the entity . China Summit



17





SiKDD 2025, 6 October 2025, Ljubljana, Slovenia Georgi Trajkov, Marko Grobelnik, and Adrian Mladenic Grobelnik




Hvar Putin prepared Carpaccio’s Mary Giorgio Russia– Weighted

snakebite to meet Zelenski Returns to Piran Armani dies China summit avg

Single Multi Single Multi Single Multi Single Multi Single Multi Single Multi

Publishers 7 7 9 7 5 —

Claims 9 15 9 14 8 15 8 15 8 12 — —

Error rate 25.4% 3.80% 30.15% 3.06% 38.9% 6.3% 37.5% 7.62% 32.5% 0% 33.0% 4.67%

Major errors 25.4% 1.90% 14.28% 0% 37.5% 0% 30.4% 1.91% 32.5% 0% 28.0% 0.79%

Rows affected 100% 20% 88.8% 7.14% 100% 33.3% 87.5% 33.3% 100% 0% 95.3% 21.5%

Table 2: Single-stage (left) vs. multi-stage (right) per event. Final column shows weighted averages. For error rates and

major errors, weights = number of claim-publisher pairs tested per pipeline. For rows affected, weights = number of claims

(rows) per pipeline. Note that weights differ between pipelines due to different extraction results.



4 Evaluation While the multi-stage pipeline (Figure 2) requires more API



4.1 calls (8+ versus one), costs remain manageable at $0.008-0.015 per Manual Testing

event with both modules enabled. The accuracy improvement

To test our hypothesis that the multi-stage pipeline is superior

justifies this modest cost increase, especially considering manual

to a single-stage pipeline (where all articles and instructions are

verification would require expensive human labor. Considering

included in a single one prompt LLM call), we conducted a com-

that an event only needs to be analyzed once with no variable

parison of claim analysis results spanning 797 claim-publisher

cost, this offers a lot of potential for analysis at scale.

pairs, of which 294 are from single-stage pipeline and 503 from

Sentiment analysis struggles with irony and implicit criticism,

multi-stage pipeline. Both single and multi-stage results were

as shown in the Mladina [15] example where selective quoting

generated across the same 5 control news events.

conveyed negativity despite positive surface language.

Quantitative testing was not done for sentiment due to time

Future work includes comprehensive user testing with jour-

constraints, combined with increased difficulty due to level of

nalists and researchers, optimization of current modules, and

subjectiveness.

expansion to other languages. We plan structured evaluations

Each claim-publisher pair was manually reviewed for correct-

to understand how different user groups interpret and act upon

ness. We classified errors into two categories: minor errors (posi-

cross-source comparisons.

tive or not mentioned classified as partial) and major errors (false



positives/negatives). Results were grouped by event to enable Acknowledgments direct comparison between the two architectures on identical

The research described in this paper was supported by the TWON

data. Weighted averages were calculated, using claim-publisher

project, funded by the European Union under Horizon Europe,

pair counts for error rates, and distinct claim counts for rows

grant agreement No 101095095.

affected (Row refers to a distinct claim, and it’s corresponding

claim-publisher pairs). References

[1] Yushi Bai et al. 2023. Longbench: a bilingual, multitask benchmark for long

context understanding. . arXiv preprint arXiv:2308.14508

4.2 Results [2] [n. d.] Ground news - breaking news headlines and media bias. Ground

News. Retrieved Sept. 7, 2025 from https://ground.news/.

The multi-stage pipeline achieved 4.67% error rate versus the

[3] [n. d.] Event registry api documentation. Event Registry. Retrieved Sept. 7,

33.0% error rate of the single-stage pipeline.2 2025 from https://eventregistry.org/documentation.

[4] [n. d.] Case study: ground news at harford community college - a collabo-

The results table 2 shows results across the five test news

rative mission to modernize media literacy. Library Up. Retrieved Sept. 7,

events. Each percentage represents the proportion of claim-publisher 2025 from https://www.libraryup.org/news- 1/case- study- ground- news- at-

harf ord- community- college.

pairs that were incorrectly classified. For example, in "Russia-

[5] [n. d.] Ground news - media bias and news comparison. West Virginia

China summit" with 5 publishers, single-stage misclassified 32.5%

University Libraries. Retrieved Sept. 7, 2025 from https://libguides.wvu.edu

of all claim-publisher pairs while multi-stage achieved 0% error. /c.php?g=1204801&p=8818927.

Major errors [6] Gregor Leban, Blaz Fortuna, Janez Brank, and Marko Grobelnik. 2014. Event

(false positives/negatives) are critical misclas-

registry: learning about world events from news. In Proceedings of the 23rd

sifications where claims are marked "confirmed" when absent (WWW ’14 Companion). ACM, International Conference on World Wide Web

or "not mentioned" when present. Minor errors involve "partial" Seoul, Korea, 107–110. isbn: 978-1-4503-2745-9. doi:10.1145/2567948.257702

4.

misclassifications. The multi-stage pipeline reduced major errors

[7] C.J. Hutto and Eric Gilbert. 2014. Vader: a parsimonious rule-based model

from 28.0% to 0.79%. for sentiment analysis of social media text. In Proceedings of the Eighth

Rows affected shows the percentage of claims with at least International AAAI Conference on Weblogs and Social Media. AAAI Press,

216–225.

one error across publishers. Single-stage produced errors in 95.3%

[8] [n. d.] Claude code. Anthropic. Retrieved Sept. 7, 2025 from https://claude.a

of claims versus 21.5% for multi-stage, demonstrating more local- i/code.

[9] Armin Ronacher. [n. d.] Flask. Retrieved Sept. 7, 2025 from https://f lask.pal

ized error patterns.

letsprojects.com/.

The improvement was consistent across all five news events.

[10] [n. d.] Postgresql. PostgreSQL Global Development Group. Retrieved Sept. 7,

The most dramatic gain was the 35-fold reduction in major errors. 2025 from https://www.postgresql.org/.

[11] [n. d.] Gemini api. Google. Retrieved Sept. 7, 2025 from https://ai.google.dev/.

[12] [n. d.] Openai. OpenAI. Retrieved Sept. 7, 2025 from https://openai.com/.

5 [13] [n. d.] Gpt-4o mini. OpenAI. Retrieved Sept. 7, 2025 from https://openai.co Discussion

m/index/gpt- 4o- mini- advancing- cost- ef f icient- intelligence/.

[14] [n. d.] Gemini 2.5 flash lite. Google. Retrieved Sept. 7, 2025 from https://ope

Our results demonstrate that LLM prompt architecture fundamen-

nrouter.ai/google/gemini- 2.5- f lash- lite- preview- 06- 17.

tally impacts LLM classification accuracy in cross-source news

[15] [n. d.] Trump bi ministrstvo za obrambo preimenoval v ministrstvo za vojno.

analysis. Significant error reduction validates task decomposition Mladina. Retrieved Sept. 7, 2025 from https://www.mladina.si/243046/trum

p- bi- ministrstvo- za- obrambo- preimenoval- v- ministrstvo- za- vojno/.

as a critical design principle for complex NLP pipelines.



18





Identifying Social Self in Text: A Machine Learning Study




Jaya Caporusso Matthew Purver Senja Pollak

jaya.caporusso@ijs.si Jožef Stefan Institute Jožef Stefan Institute

Jožef Stefan Institute Ljubljana, Slovenia Ljubljana, Slovenia

Jožef Stefan International Queen Mary University of London

Postgraduate School London, UK

Ljubljana, Slovenia



Abstract project involves labelling—with a mixed approach involving hu-

man annotators and large language models (LLMs)—diary entry

The Self encompasses many aspects, such as the Social Self.

instances as binary (representing or not) SS, with the purpose of

Identifying them in text is relevant for many purposes, includ-

investigating the correlation between SS and textual features. We

ing mental-health research. As part of a larger project aimed

train and compare three classifiers (i.e., Support Vector Machine

at automatically detecting Self-aspects in written language, in

(SVM), Naïve Bayes (NB), and Logistic Regression (LR)) to predict

this study we annotate and employ a dataset of diary entries to

SS using either 1) learned features (i.e., TF-IDF unigrams and

classify the presence or absence of Social Self. We train three

bigrams) or 2) predefined features (i.e., Linguistic Inquiry and

classifiers—Support Vector Machine (SVM), Naïve Bayes, and

Word Count (LIWC; [1]) lexicon categories (see [4]). We use the

Logistic Regression—on either learned or predefined features.

mentioned classifiers instead of LLMs (e.g., GPT-4) because our

The best-performing model is the SVM trained on predefined

focus is on employing interpretable features and understanding

LIWC features based on a previous study. We further apply fea-

their contribution to predictions—an aspect less directly accessi-

ture importance methods, and examine which features make the

ble in generative models. We conduct feature importance analysis

biggest contribution to the classification models. The most infor-

to explore these contributions further. The code is available at

mative feature across models trained on learned features is the

https://github.com/jayacaporusso/SELFtext upon request.

word “we”, while the LIWC category “social referents” emerges

as the most important feature for models trained on predefined

features. 2 Related Work

Studies that address the correlation between text and the traits

Keywords and states of the text’s author often utilise the Linguistic Inquiry

social self, machine learning, classification, feature importance and Word Count (LIWC), a text analysis software developed to

analyse linguistic and psychosocial constructs connected to vari-



1 ous textual aspects [1] (e.g., [9]). Various studies have found Self Introduction

states to be associated with linguistic features, e.g., depression

A central aspect of human experience, the Self is a complex, multi-

with first-person singular pronouns [15]. This has been employed

aspect phenomenon [3]. Its aspects—encompassing, for example,

in classification tasks (e.g., [6]). In a previous study, after labelling

personal narratives [18] and social interactions [2]—correlate

a dataset with a mixed approach employing human annotation

with other relevant constructs, such as mental-health conditions

and LLMs, we analysed which LIWC-22 features characterise

[17]. While the various Self-aspects reflect in the individual’s

Reddit posts including Self as an Agent, Bodily Self, and SS [4].

language [14], Natural Language Processing (NLP) studies rarely presence of SS Specifically, we showed that the is correlated

explore them and employ them in-depth. emotion

with LIWC categories including, among the others, and

This work is part of a larger project aimed at developing mod- time related terms absence of SS . In contrast, the is correlated

els to automatically identify Self-aspects in text, with applications technology negative emotions with, e.g., and . In this work, we em-

in mental-health-research and empirical phenomenology [5]. Due

ploy this knowledge to build SS classifiers on predefined features

to the sensitive nature of the domains of application, we attempt

and compare them with classifiers trained on learned features.

an approach that allows both interpretability and ground-truth



basis, opting for classical machine learning (ML) models. In this 3 Research Questions study, we focus on one Self-aspect: the Social Self (SS), defined

as

the Self as it is shaped and/or perceived when in an interaction In this study, we aim to address the following main research

or relationship of sorts with other people or entities to whom we questions (RQs). RQ1: How does a SS classifier trained on pre-

attribute qualities of inner life defined features perform compared to a SS classifier trained on [4]. We aim to investigate how this

learned features? : Among the algorithms employed, which RQ2

is represented in diary entries and whether these representations

one performs better for our task? : Which features are more RQ3

can be reliably identified using machine learning. Additionally,

relevant for the classification of SS?

we explore which linguistic features are most predictive of these

aspects. Identifying SS in text is valuable, as, e.g., disturbances

in the SS are closely linked to mental health conditions [7]. This 3.1 Labelling

In our study, we use a publicly available dataset in English [11]

Permission to make digital or hard copies of all or part of this work for personal

comprising 1,473 text samples (sub-entries; average length: 507.6

or classroom use is granted without fee provided that copies are not made or

characters, 100.6 words) from 500 personal journal entries (500

distributed for profit or commercial advantage and that copies bear this notice and

the full citation on the first page. Copyrights for third-party components of this anonymous subjects). We augment the dataset with binary labels

work must be honored. For all other uses, contact the owner /author(s).

for SS, as following addressed.

Information Society 2025, Ljubljana, Slovenia

For labelling, we employ a mixed approach (see [4]) that com-

© 2025 Copyright held by the owner/author(s).

https://doi.org/10.70314/is.2025.sikdd.2 bines human annotation with the large language model (LLM)



19





Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Caporusso et al.




gemma2 [16]. The instructions for manual annotation are pro- on the learned features and three models on the predefined fea-

vided in the Appendix A. Two human annotators label the first tures. The models are of three different kinds: SVM, NB, and

105 instances of the dataset. This is needed to calculate inter- LR, all commonly used in text classification tasks. We employ

annotator agreement with the LLM annotations. We instruct default hyperparameters. For the SVM, we use Linear kernel. For

gemma2 to label the data three times, providing three different LR, we apply L2 regularisation, which adds a penalty term to

personalisations (see [10]): expert in phenomenology, cognitive the model’s objective function, minimising overfitting. For NB,

psychology, or social psychology. Additionally, we provide them MultinomialNB was used for learned features, while GaussianNB

with definitions of SS, instructions to annotate it, examples of a was used for predefined features, which consist of continuous nu-

text instance where it is present, a text instance where it is absent, merical values derived from linguistic analysis. MultinomialNB

and explanations of why this is so. These can be extracted from assumes that features represent discrete frequency counts, while

the instructions for manual annotation. Each gemma2 model GaussianNB assumes that feature distributions follow a normal

performs a one-shot, binary classification for each self-aspect. distribution, making it appropriate for continuous data.

We calculate majority voting with the resulting labels and com-

pute the inter-annotator agreement between each pair among the 5 Evaluation human and the LLM annotators by calculating Cohen’s Kappa

Similarly to the training process, the models are evaluated using

coefficient. This results in Cohen’s Kappa coefficients of 0.80

10-fold cross-validation. All the models perform reasonably well,

(human annotators), 0.89 (first annotator vs. gemma2), and 0.84

with the SVM model trained on predefined features outperform-

(second annotator vs. gemma2). In the further steps, we use the

ing them all (RQ1 and RQ2). The metrics (precision, recall, and

majority voting labels. The class balance (calculated on the ma-

F1-score: mean and STD) across folds are reported in Table 1.

jority voting) is 50.3% (SS present) vs 49.7% (SS not present).

They match the macro average scores. The confusion matrices

for each model are presented in Figures 3 and 4 in the Appendix

4 Classification B. These highlight that models trained on predefined features

The text is preprocessed, converting it to lowercase and remov- generally perform better at distinguishing between classes, with

ing punctuation and extra whitespace. We extract learned and the SVM and LR models achieving higher accuracy for both Class

predefined features. We then train three classifiers for each set 0 and Class 1. However, NB trained on predefined features strug-

of features: an SVM, a NB, and a LR model. gles with a higher rate of false positives for Class 0. The models

trained on learned features have slightly lower performance, with

4.1 Feature Engineering higher misclassification rates for Class 1 predictions. After per-

forming a Friedman test across folds (statistic = 44.26, p-value =

We are interested in comparing the performance of models trained

0.00), we find a statistically significant difference in model per-

on learned vs pre-defined features. In this study, we choose to

formances. We therefore conduct Wilcoxon signed-rank tests

employ TF-IDF calculated on unigrams and bigrams as learned

with Bonferroni correction to identify significant pairwise dif-

features, and the LIWC features identified as being related to the

ferences between models. LR with learned features performed

presence or absence of SS in Caporusso et al. [4].

significantly better than NB with learned features (p = 0.03); SVM

4.1.1 Learned Features. To extract learned features, we employ with predefined features outperforms NB with learned features (p Tfidf Vectorizer, applying TF-IDF weighting to unigrams and bi- = 0.03); LR with predefined features outperforms NB with learned

grams. Restricting the representation to unigrams and bigrams, a features (p = 0.03); SVM with predefined features performs signifi-

common choice in exploratory text classification, efficiently dis- cantly better than NB with predefined features (p = 0.03); LR with

plays feature importance, balancing interpretability and compu- predefined features outperforms NB with predefined features (p

tational efficiency. We limit the feature space to the 1000 n-grams = 0.03). The results are displayed in Figure 5 in the Appendix B.

that, based on their TF-IDF scores, are the most informative. This



ensures computational efficiency. In this process, we choose not

to exclude stopping words. Indeed, for the purpose of our study,

they do not merely constitute noise but might play a key role in

distinguishing text instances reporting on SS.



4.1.2 Predefined Features. We analyse the presence of all the

LIWC-22 [1] categories and subcategories, and subsequently only

considered the LIWC features of interest. Specifically, as prede-



fined features, we employ the LIWC features that Caporusso et al. Table 1: Evaluation Metrics (Mean and STD)

[4] identified as being related to the presence and absence of SS

(see 2), for example , , and authenticity social referents the pronoun

I. For each of them, LIWC-22 provides scores relative to the text

length. All LIWC features were standardised using Z-score nor- 6 Feature Importance malisation to ensure comparability across different feature scales.

We employ different feature importance methods tailored to each

This is particularly important for models like SVM and LR, which

model’s learning mechanism to ensure that feature rankings are

are sensitive to feature magnitudes. Missing values (NaNs) are

meaningful and aligned with the way each algorithm processes

handled using mean imputation.

data. For the SVM models, we choose Linear SVM Coefficients

because they directly represent feature importance in the deci-

4.2 Models sion boundary and are computationally efficient to extract. This

The models are trained and evaluated using 10-fold cross-validation method is fast and directly interpretable without requiring addi-

to assess their performance. Specifically, we train three models tional computations, but it does not capture feature interactions



20

Identifying Social Self in Text Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia



or non-linearity. For the NB models, we choose Permutation Im- method is chosen because it provides a comprehensive, intuitive,

portance. NB does not have meaningful coefficients, and this and theoretically grounded measure of feature importance, mak-

method provides a model-agnostic way to assess how each fea- ing it well-suited for interpreting the decision-making process of

ture affects predictions. This method allows the interpretation a probabilistic model like LR. In this study, we reduce the SHAP

of feature contributions without relying on the model’s inter- computation sample size from 50 to 20 to improve efficiency

nal parameters, but it is computationally expensive and can be while maintaining representative feature importance insights.

sensitive to correlated features. For the LR models, we choose SHAP scores are measured in the same scale as the model’s out-

SHAP (SHapley Additive exPlanations [12]) Values, because they put and sum to the difference between the model’s output and

provide both global and instance-level feature attributions while the expected output across all features. They can be positive

considering feature interactions, making them more informative (probability of classification increased) or negative (probability

than raw coefficients. SHAP accounts for feature dependencies of classification decreased). Their magnitude reflects the strength

and offers a nuanced interpretation of how features contribute to of the feature’s influence on the classification decision. The top-3

individual predictions, but its computations can be slow and the features for the SVM models are , and (TF-IDF) and with, we my

results depend on the reference distribution used. Using SHAP , and (LIWC) (RQ3). social referents, Social I for the SVM would be unnecessary because it would give similar

results as the coefficients but less directly and with added com-

putational cost, while SHAP’s dependency assumptions conflict 6.4 Overall feature importance

with NB’s independence assumption. The contribution of each To determine the top-20 most important features across all models

feature to the classification decision is indicated with a feature trained on learned features and across all models trained on

importance score. These are computed differently depending on predefined features, we aggregate the feature importance scores

the method: in Linear SVM Coefficients, they are derived from the from each model and sum them across all models. This is done

absolute magnitude of the learned weights; in Permutation Im- to show which features are consistently influential regardless

portance, they are measured by assessing the decrease in model of the model; however, due to differences in how each method

performance when a feature’s values are randomly shuffled; while computes importance, the aggregated scores should be viewed

in SHAP, they quantify the contribution of each feature to the as indicative rather than absolute measures of feature relevance.

predicted classification probability by distributing the model’s The top-10 features for the models trained on learned features

output among the input features. are displayed in Figure 1, while those for the models trained on

predefined features in Figure 2 (RQ3). Additionally, we identify

6.1 SVM: Linear SVM Coefficients unique features for each model, defined as those that appear in

the top-10 for a specific model but not in others. Following, we

For SVM, feature importance is determined using Linear SVM

report those referring to models trained on learned features.

Coefficients. This method is chosen because linear SVM explic-

itly learns a set of coefficients as part of its optimisation process,

making feature importance inherently interpretable. Addition- • SVM: my, team, she, our, he, we, with, friend, with my, their.

ally, since the SVM model is optimised to find the maximum • Naïve Bayes team, they are, he was, us, birthday, she was, :

margin, features with the largest coefficients contribute the most of our, with her, person, spending time.

to defining this separation, allowing for a clear ranking of feature • Logistic Regression my, she, our, and, good, he, my family, :

relevance. The resulting importance scores are based on the ab- we, it, sleep.

solute magnitude of the learned coefficients, and like them, they

Following, we report those referring to models trained on

can be any real value. While the importance scores’ scale depends

predefined features.

on the range of the input features, higher numbers indicate a

stronger influence on classification. The top-3 features for the

SVM models are • SVM: sexual, Dic, Social, socrefs, feeling, we, Affect, Drives, family, we , and with (TF-IDF) and social referents,

I insight, WC. , and personal pronouns (LIWC) (RQ3).

• Naïve Bayes: Dic, Social, socrefs, number, moral, feeling,



6.2 we, focuspast, Drives, illness. Naïve Bayes: Permutation Importance • Logistic Regression : Dic, Social, socrefs, pronoun, Ana-

For NB, we choose Permutation Importance because it provides a lytic, feeling, we, Affect, focuspast, Drives.

robust way to assess feature significance in probabilistic models

that do not generate explicit importance scores. By quantifying

This helps us shed light on how different algorithms interpret

the dependence of the model’s predictions on each feature, Per-

the data; some overlap in the reported features occurs because

mutation Importance allows for an intuitive understanding of

the different algorithms, despite using distinct mechanisms to es-

which features are most influential in the NB classification pro-

timate importance, converge on similar cues that are consistently

cess. The scores produced are relative, and their scale depends on

predictive of SS. We calculate the correlation between feature im-

the model’s performance metric; a larger value indicates that the

portance rankings across the different models by computing the

feature has a greater impact on classification accuracy. The top-3

Pearson correlation coefficient between the feature importance

features for the NB models are , and (TF-IDF) and us, birthday her

scores of each pair of models, using their respective importance

social referents, social, and she/he (LIWC) (RQ3). values across all features. This is displayed in Figures 6 and 7

in the Appendix C. A high positive correlation indicates similar

6.3 Logistic Regression: SHAP Values feature rankings and vice versa. The highest correlation is mea-

LR calculates the probability of a given outcome using a linear sured between SVM and LR models, while the lowest between

combination of input features, but SHAP offers a more granu- NB and LR for models trained on learned features, and between

lar and interpretable way of explaining these predictions. This SVM and NB for models trained on predefined features.



21





Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Caporusso et al.





perform hyperparameter optimisation, we will do so in the future.

We aim to train a neural network for multi-class classification,

enabling simultaneous prediction of SS and other Self-aspects, al-

lowing for a more comprehensive analysis of self-representation

in text. In the future, we plan to employ different datasets and

implement Demšar’s evaluation method [8]. Our long-term goal

is to be able, given a text instance, to determine what Self aspects

are present and how they are expressed, in an explainable man-

ner. To do so, it is not only necessary to extend our work to other

Figure 1: Top-10 Features for TF-IDF Models

Self-aspects, but to move beyond a binary classification for each

of them. Work on the ontology underpinning future studies is

ongoing [13].





9 Acknowledgments

We acknowledge Špela Rot’s assistance and the financial support

from the Slovenian Research Agency for research core funding for

the programme Knowledge Technologies (No. P2-0103) and from

the projects CroDeCo ( J6-60109), Shapes of Shame in Slovene

Literature ( J6-60113), and Natural Language Processing for Cor-

pus Analysis in the Medical Humanities (BI-VB/25-27-021). JC is

Figure 2: Top-10 Features for LIWC Models

a recipient of the Young Researcher Grant PR-13409.



7 Discussion References

Our results indicate that the models trained on predefined fea- [1] Ryan L Boyd, Ashwini Ashokkumar, Sarah Seraj, and James W Pennebaker. tures (LIWC) generally outperform those trained on learned fea- 2022. The development and psychometric properties of liwc-22. Austin, TX:

University of Texas at Austin, 10.

tures (TF-IDF n-grams), with the SVM model achieving the high-

[2] Marilynn B Brewer. 2002. Individual self, relational self, and collective self:

est classification performance (RQ1-2). This suggests that LIWC partners, opponents, or strangers. (2002).

[3] Jaya Caporusso. 2022. Dissolution experiences and the experience of the self:

features, which encapsulate linguistic and psychological con-

an empirical phenomenological investigation (master’s thesis). university

structs, provide a structured and interpretable representation Advisor: Assist. Prof. Dr. Maja Smrdu of vienna. .

of textual patterns related to SS. In contrast, TF-IDF captures [4] Jaya Caporusso, Boshko Koloski, Maša Rebernik, Senja Pollak, and Matthew

Purver. 2024. A phenomenologically-inspired computational analysis of

surface-level word frequency distributions, which may be more

self-categories in text. In . Vol. 1, 169–178. Proceedings of JADT 2024

susceptible to noise and context variability, limiting its predictive [5] Jaya Caporusso, Matthew Purver, and Senja Pollak. 2025. A computational

power for capturing abstract constructs like SS. Furthermore, our framework to identify self-aspects in text. In Proceedings of the 63rd Annual

Meeting of the Association for Computational Linguistics (Volume 4: Student

results support the findings by Caporusso et al. [4] regarding

Research Workshop). Jin Zhao, Mingyang Wang, and Zhu Liu, editors. Asso-

LIWC features correlated with SS. Notably, models trained on ciation for Computational Linguistics, Vienna, Austria, (July 2025), 725–739.

isbn: 979-8-89176-254-1. doi: 10.18653/v1/2025.acl- srw.47.

TF-IDF features tend to exhibit higher aggregated feature im-

[6] Jaya Caporusso, Thi Hong Hanh Tran, and Senja Pollak. 2023. Ijs@ lt-edi:

portance scores compared to those trained on LIWC. This could

ensemble approaches to detect signs of depression from social media text.

be attributed to the fact that TF-IDF operates on a larger and In Proceedings of the Third Workshop on Language Technology for Equality,

Diversity and Inclusion, 172–178.

more granular feature space, capturing subtle variations in word

[7] Christopher G Davey and Ben J Harrison. 2022. The self on its axis: a

usage. As a result, many features contribute partially to model framework for understanding depression. Translational Psychiatry, 12, 1, 23. decisions, leading to a higher sum of importance values across [8] Janez Demšar. 2006. Statistical comparisons of classifiers over multiple data

sets. , 7, 1–30. The Journal of Machine learning research

all features. In contrast, LIWC features are more constrained and

[9] Lewis R Goldberg. 2013. An alternative “description of personality”: the

predefined, leading to more concentrated but lower cumulative big-five factor structure. In Personality and Personality Disorders. Routledge,

34–47.

importance scores. This suggests that while TF-IDF captures a

[10] Boshko Koloski, Nada Lavrač, Bojan Cestnik, Senja Pollak, Blaž Škrlj, and

broader spectrum of textual variations, LIWC provides a more

Andrej Kastrin. 2024. Aham: adapt, help, ask, model harvesting llms for

targeted and structured linguistic representation. Many of the literature mining. In International Symposium on Intelligent Data Analysis.

features identified as relevant for the classification of SS (e.g.,

we Springer, 254–265.

[11] X Alice Li and Devi Parikh. 2019. Lemotif: an affective visual journal using

and ) intuitively align with the nature of SS (RQ3). deep neural networks. arXiv preprint arXiv:1903.07766 social referents.

[12] Scott M Lundberg and Su-In Lee. 2017. A unified approach to interpreting



8 Limitations and Future Work model predictions. Advances in neural information processing systems, 30. [13] Luka Oprešnik, Tia Križan, and Jaya Caporusso. 2025. Building an ontology

This study serves as a pilot for the interpretable classification of of the self: sense of agency and bodily self. In Proceedings of Information

Society 2025. Cognitive Science. doi: 10.70314/is.2025.cogni.8.

different Self aspects in text, focusing on SS. Several areas for im-

[14] James W Pennebaker, Matthias R Mehl, and Kate G Niederhoffer. 2003.

provement remain. Clearer annotation guidelines are needed for Psychological aspects of natural language use: our words, our selves. Annual

consistency. The choice of restricting to linear models, LIWC fea- review of psychology, 54, 1, 547–577.

[15] Stephanie Rude, Eva-Maria Gortner, and James Pennebaker. 2004. Language

tures, and unigrams/bigrams was appropriate for this exploratory Cognition &

use of depressed and depression-vulnerable college students.

study prioritising interpretability; however, it inevitably limits , 18, 8, 1121–1133. Emotion

[16] Gemma Team et al. 2024. Gemma 2: improving open language models at a

performance and representational richness. In future work, we

practical size. . arXiv preprint arXiv:2408.00118

plan to complement this approach with more powerful models [17] David HV Vogel, Mathis Jording, Peter H Weiss, and Kai Vogeley. 2024. and richer feature sets (e.g., embeddings). Here we wanted to Temporal binding and sense of agency in major depression. Frontiers in

psychiatry, 15, 1288674.

compare models trained on learned vs predefined features, but

[18] Dan Zahavi. 2007. Self and other: the limits of narrative understanding.

we plan to train models on both. While in this study we did not , 60, 179–202. Royal Institute of Philosophy Supplements



22





Identifying Social Self in Text Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia




A Instructions for Labelling: Social Self B Evaluation

In the column relative to Social Self, insert:





• 0: if the Social Self is not present.

• 1: if the Social Self is present.



Following, we provide a definition of Social Self [4], instruc-

tions, and examples of a text instance where it is present and a

text instance where it is not present, taken from the dataset to Figure 3: Confusion Matrices: Models Trained on Learned

be labelled: Features (TF-IDF)

Definition: The Self as it is shaped and/or perceived when

in an interaction or relationship of sorts with other people or

entities to whom we attribute qualities of an inner life.

Instructions

For to be present in a text instance it is not enough Social Self



for the text instance to contain references to other people and/or

entities, but it has to contain mentions of the author’s interactions

with them, influence on them, or influence they have on the

author. This can be even minimal, e.g., in the form of referring to

a person as , or by using the first-person plural pronoun my sister

instead of the singular one.

Examples



A.0.1 Text instance containing Social Self: "My family was the Figure 4: Confusion Matrices: Models Trained on Prede-most salient part of my day, since most days the care of my 2 chil- fined Features (LIWC)

dren occupies the majority of my time. They are 2 years old and 7

months and I love them, but they also require so much attention

that my anxiety is higher than ever. I am often overwhelmed by

the care they require, but at the same, I am so excited to see them

hit developmental and social milestones."



Explanation of text instance with Social Self present: In this text

instance, the author report on other people they are in some sort

of relationship with, and about some aspects of their relationship

and how they make the author feel.



A.0.2 Text instance not containing Social Self: "Yoga keeps me

focused. I am able to take some time for me and breathe and work

my body. This is important because it sets up my mood for the

whole day."

Explanation of text instance with Social Self not present: In this

text instance, the author does not report on any person, animal,

or other entities to whom we attribute qualities of inner life.

General Notes While a certain Self-aspect might not be promi-

nently present in a text instance in its entirety, if it is present in

a part of the text instance to be labelled, then it has to be labelled

as present in the text instance. A given text instance can have

none of the Self-aspects present, one of them present and two of

them non-present, two present and one non-present, or all three Figure 5: Pairwise Wilcoxon Signed-Rank Test Results (p-

of them present—any combination is possible. values)



23





Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Caporusso et al.




C Feature Importance





Figure 7: Correlation Between Feature Importance Across



Models Trained on Pre-Defined Features (LIWC)



Figure 6: Correlation Between Feature Importance Across

Models Trained on Learned Features (TF-IDF)



24

WinWin Meets – Investigating the Future of Online Meetings



Martin Žust Marko Grobelnik

marti.zust@gmail.com marko.grobelnik@ijs.si

Jožef Stefan Institute Jožef Stefan Institute

Ljubljana, Slovenia Ljubljana, Slovenia



Alenka Guček Adrian Mladenic Grobelnik

alenka.gucek@ijs.si adrian.m.grobelnik@ijs.si

Jožef Stefan Institute Jožef Stefan Institute

Ljubljana, Slovenia Ljubljana, Slovenia



Abstract in meetings depends on technical feasibility and sensitivity to

Video conferencing is now central to modern collaboration, yet human collaboration.

its functionality remains largely limited to passive audio–visual With remote meetings now central to how we work, these

communication. Despite growing investment in artificial intelli- systems directly impact productivity, collaboration, and organi-

gence (AI), it is unclear which features truly enhance meetings zational culture. This paper explores which functionalities could

and how users will adopt them. Here we present WinWin Meets, define the future of video conferencing and how AI may con-

a Jitsi-based prototype that integrates Whisper transcription and tribute. We combine market trend and user preference analysis,

GPT-4o processing to deliver real-time summaries, visual mind reviews of online discussions, and experimental testing of the

maps, and goal-oriented advice. Testing with 16 participants WinWin Meets prototype. We explore which features matter to

showed strong interest in summaries and mind maps, moderate users, examine how AI can support meetings, and assess the

interest in in-meeting guidance, and a preference for add-on in- potential to improve efficiency, clarity, and structure in digital

tegration. Market research confirmed low organic demand for communication.

advanced AI features, with users prioritizing reliable improve-

ments such as automated notes. These results highlight a gap 2 Background and Related Work

between experimental enthusiasm and everyday adoption, point-

ing to opportunities for targeted, industry-specific integrations 2.1 Overview of Current Video Conferencing

that combine reliability with intelligent support. Solutions

The video conferencing market is currently dominated by a few

Keywords major players. Zoom, Microsoft Teams, and Google Meet together



video conferencing, AI agent, testing, market research, zoom, account for approximately 94% of global market share, with Zoom alone holding around 56% [3]. While all three platforms are ac- negotiation, transcription, summarization, advice, meeting notes, tively investing in artificial intelligence features, their innovation AI innovations must be carefully balanced with the risk of reputational damage.

As established brands, they face more constraints than lesser-

1 Introduction known startups, which can afford a higher level of experimental

As artificial intelligence advances rapidly, its potential to trans- agility. This creates a unique window of opportunity for the

form everyday digital tools, particularly video conferencing, has emergence of disruptive technologies that have the potential to

become increasingly apparent. Platforms such as Zoom, Google redefine the video conferencing experience.

Meet, and Microsoft Teams have become standard, yet their func- Most AI-enabled tools developed recently are not standalone

tionality remains focused on basic communication. A new need platforms, but integrations designed to work alongside existing

is arising for next-generation conferencing, including intelligent services like Zoom, Google Meet, or Microsoft Teams. Notable

assistants, automatic summarization, content analysis, and con- examples include tl;dv [4], Otter.ai [5], Fathom [6], Fireflies [7],

textual support. These next-generation systems go beyond pas- and Sembly AI [8]. These applications primarily offer meeting

sive audio and video transmission to actively support users with transcription, and some provide more advanced analytics such

intelligent features and real-time analysis [1]. as sentiment analysis or participant-level speaking time metrics.

Previous research reveals both promise and challenges. Proac-



balance autonomy with what users are willing to accept [1]. Emerging Needs Meanwhile, studies of speech-based technology underscore the tive AI meeting assistants can improve efficiency but need to 2.2 Limitations of Existing Solutions and



difficulty of extracting useful outcomes from nuanced group Despite the growing number of AI integrations, fully independent platforms that natively combine video conferencing with built-in interactions [2]. These perspectives suggest that AI’s success AI features remain rare. These features may include real-time

transcription, intelligent meeting summarization, and contextual



work must be honored. For all other uses, contact the owner/author(s). Permission to make digital or hard copies of all or part of this work for personal AI-generated recommendations. This segment remains underde- or classroom use is granted without fee provided that copies are not made or veloped, presenting a significant opportunity for innovation. distributed for profit or commercial advantage and that copies bear this notice and While major platforms like Zoom have started introducing the full citation on the first page. Copyrights for third-party components of this their own AI assistants (e.g., Zoom AI Companion [9]), they must Information Society 2025, Ljubljana, Slovenia innovate cautiously to protect their reputation and user base. © 2025 Copyright held by the owner/author(s). This creates space for new companies to develop more ambitious https://doi.org/https://doi.org/10.70314/is.2025.sikdd.14



25

Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Žust et al.



AI-first conferencing tools, unrestricted by established brand • Health monitoring

expectations or legacy user commitments. However, innovating • Meeting notes

in markets where most users are already committed to existing • File uploads

platforms has notable downsides. Only about 2.5% of people The WinWin Agent dynamically adapts to the language selected

are actively seeking new alternatives, with the majority being by the user. In this prototype, we supported English, German,

reluctant to change [10]. and Slovene, allowing users to interact with the summarization

and advice features in their preferred language.

3 Development of WinWin Meets





3.1 Overview

As part of our research, we developed WinWin Meets, an AI-

based alternative to Zoom. The application maintains familiar

functionality, allowing users to start or join meetings just as

they would expect. The key difference comes before entering

the meeting room, where users can define their meeting goals.

Once inside, they find a familiar interface with standard video

conferencing features. Figure 1: System architecture of the WinWin Meets application

These core functionalities are provided through an integration

with Jitsi [11], an open-source video conferencing platform. It In this prototype version, we did not use any persistent data-

supports screen sharing, microphone and camera toggling, chat- base; all data is stored locally. Additionally, user authentication

based communication, polls, and many other standard features. is not yet implemented, as the focus was on demonstrating core

Beyond the familiar main meeting window found in appli- functionalities.

cations like Zoom, WinWin Meets adds a dedicated panel on

the right side of the screen for the WinWin Agent. This panel 4 Testing and User Insights

features two main buttons: Summarize and Give Advice. To evaluate the usefulness and usability of WinWin Meets, we

The Summarize button generates meeting summaries up to conducted a structured user testing process involving 16 partici-

the current moment, particularly useful for late arrivals. Hover- pants. Testing sessions were held in small groups of 2 to 4 partic-

ing reveals three options: Short Text, Long Text, and Mind Map. ipants, each lasting approximately 15 minutes. Participants simu-

While the text options provide traditional summaries of varying lated realistic discussions—including casual exchanges and role-

length, the Mind Map offers a quicker and more accessible visual play scenarios such as negotiations or political debates—to test

overview. The idea behind the mind map is based on the observa- all implemented functionalities. The following sections present

tion that modern workplace attention is highly fragmented, with our testing results, with key findings shown in Figure 2.

a median focus duration of just 40 seconds on any screen [12].

The Give Advice button offers guidance on how to achieve 4.1 Test Coverage

the goals specified before the meeting. These goals can also be Participants explored all key features, including the three variants

adjusted during the meeting by clicking the Manage Goals button of the Summarize function (Short Text, Long Text, and Mind Map),

in the top right corner. Hovering over the Give Advice button the three formats of the Give Advice function (Short, Medium,

reveals three options: Short Text, Medium Text, and Long Text, Long), and the Meeting Notes feature. After each session, they

which provide advice in different levels of detail. completed an anonymous survey with both multiple-choice and

Once the meeting concludes, a meeting report is quickly gen- open-ended questions to assess usefulness and provide feedback.

erated. The report includes all key points, action items, a meeting

timeline, and the list of participants. Users can also generate a 4.2 Key Findings

mind map from the final meeting content.

General Usefulness

3.2 System Architecture and Implementation Most participants recognized the potential of AI-enhanced meet-



The frontend of the application was developed in Cursor [13], ings. In fact, 87.5% responded Yes when asked whether AI could help them achieve meeting goals, while the remaining 12.5% with assistance from Claude 3.7 Sonnet [14] and GPT-4o [15]. It answered Maybe . is built using the React 19 framework [16]. We aim for a clean

and minimalistic design that intuitively guides the user through Summarize Feature

each step of the interface. The Summarize function was considered useful by 81.3% of partic-

In the meeting room interface, we integrated Jitsi via its iframe ipants. Preferences were split almost evenly: nearly half favored

API. Jitsi integration is straightforward, and the platform allows the Short Text, another 43.8% opted for the Mind Map, while only

the use of its hosted servers for up to 25 active monthly users 12.5% selected the Long Text variant.



free of charge, which was sufficient for our prototype testing. Give Advice Feature The backend is built in Python, using the FastAPI framework When choosing advice length, participants showed a clear pref-[17]. For transcription, we integrated Whisper [18], and for natu-erence for medium-length suggestions: ral language processing tasks (such as summarization and advice • 50% selected Medium generation), we used GPT-4o. The backend exposes several end-• 25% chose Short points, including: • 25% chose Long • Transcription

• Advice generation Meeting Notes Feature

• Meeting summarization Participants emphasized three expectations for meeting notes:



26





SiKDD October, 2025, Ljubljana, Slovenia Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia




Which version of Give Advice feature Which version of Summarize feature Which potential feature do you

do you like the most? do you like the most? find the most promising?



Short Long

4 speaking times using agenda 4 Leaderboard of Meeting coordination



(25%) (25%) 5 5 (31%) (31%) Short text Mind map 7 7 (44%) (44%)



8 2 6

(50%) (12%) (38%)



Medium Insightful questions Long text generator



Figure 2: User survey results (n=16) comparing preferences for existing features (Give Advice and Summarize) and ranking

of proposed new features for application WinWin Meets



• High reliability (timestamps, content accuracy) clear user preference for simple and familiar enhancements over

• Fast post-meeting availability more complex and unfamiliar innovations.

• Stable performance across sessions Similar sentiment was observed on Reddit (r/Zoom and r/remotework),

where posted polls received limited engagement. Among the few

4.3 Ideas for Additional Features responses, a general disinterest in AI-based meeting assistance

Among the proposed additions, the insightful question genera- was evident, with some users explicitly selecting “None of those”.

tor attracted the most interest (37.5%), while the speaking time

leaderboard and agenda-based coordination were equally valued 5.2 Search Behavior and Online Interest

(31.3% each). Participants also suggested several custom features, Trends

including personal notes, live transcription export, cloud synchro- Public search trends were analyzed using tools such as Answer

nization, calendar integration, live translation with tone analysis, the Public [19], Answer Socrates [20], AlsoAsked [21], and Uber-

and domain-specific modes for law, sales, or education. suggest [22]. These platforms provided insight into the types

of questions users search for on Google, YouTube, and Reddit.

4.4 Integration Preferences The analysis showed minimal interest in AI-enhanced conferenc-

A clear majority (68.8%) preferred to use WinWin Meets as an add- ing features. Instead, users were more focused on improving the

on to existing platforms, while only 31.2% supported a standalone efficiency and effectiveness of their meetings.

application. Popular search queries we found included:

• What are the 3 C’s of effective meetings?

4.5 Use Cases by Industry • What is the 10-10-10 rule for meetings?

Participants identified several promising domains for WinWin • How can I take better meeting notes?

Meets, such as negotiation and sales, legal and consulting ser- • What are the 5 P’s of meeting productivity?

vices, corporate meetings, academic events, client feedback ses- • How to extend the 40-minute limit on Zoom?

sions, NGO coordination, and specialized contexts like logistics, • Is Google Meet better than Zoom?

mergers and acquisitions, or trade deals. • Is Zoom free to install and use?

These patterns confirm that users are primarily concerned

5 Market Research and Trend Analysis with meeting outcomes and platform reliability, rather than with



Beyond developing and testing WinWin Meets, we conducted novel AI-driven functionalities.

market research to understand user needs and expectations in the



video conferencing space. Our approach combined online surveys,

social media engagement, search trend analysis, and reviews of

blog posts and user forums. This investigation aimed to reach

a wider audience than application testing alone could provide.

The resulting quantitative and qualitative insights complement

rather than replace our user testing results.



5.1 Survey and Social Media Feedback

Informal polls and surveys were conducted on platforms such as

Facebook and Reddit. In a Facebook group focused on digital tools

(GrowthHacking Slovenia), a poll asking users which feature

they would most like to add to Zoom revealed that over 60% of

respondents preferred having meeting notes generated at the

end of a call as we can see in Figure 3. In contrast, only two Figure 3: Distribution of 80 votes for preferred video con-

respondents selected a real-time AI assistant. This suggests a ferencing features from our informal polling.



27





Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Žust et al.




5.3 Forum Discussions and Deep-Search market segments where specialized AI features deliver measur-

Insights able value propositions. The 68.8% preference for add-on integra-



Using tools like Grok [23] and Floth [24], we conducted a deeper tion over standalone applications indicates a market opportunity in enhancing existing platforms rather than replacing them, as exploration of online discussions and feedback. The most fre- demonstrated by successful tools like Fathom and Otter.ai. Al- quently mentioned user pain points include: though there is room for breakthrough products, any new so- • Low video quality and unstable connections lution must be at once reliable, easy to use, and meaningfully Privacy concerns (e.g., Zoom bombing, data storage poli- • smarter than current tools—a difficult balance as existing plat- cies) forms already invest heavily in their core features. Psychological fatigue from constant camera presence • The emphasis on reliability and customizable AI assistance • Lack of end-to-end encryption and transparency reveals that AI features must meet higher performance standards Poor UX from interface changes (e.g., Google Meet “float- • than traditional features. Users consistently prioritize dependable ing bubbles”, Webex chat restrictions) functionality over advanced capabilities, suggesting that prod- Discomfort with platform claims over recorded content • uct development should focus on perfecting core AI functions

User feedback highlights a desire for reliable, simple, and se- before expanding feature sets. Future research should examine

cure platforms with minimal friction in setup and usage. longitudinal adoption patterns and explore how user acceptance

evolves as AI capabilities mature and become more familiar in

5.4 Conclusions from Market Research workplace contexts.

Our market analysis reveals several key trends:

(1) Users strongly prefer practical features like note-taking 7 Acknowledgements

and agenda management over complex AI-based tools. The research described in this paper was supported by the TWON

(2) Popular search queries suggest a need for structured meet- project, funded by the European Union under Horizon Europe,

ing frameworks and productivity strategies. grant agreement No 101095095.

(3) Persistent dissatisfaction exists around technical reliability, References

interface design, and data privacy. [1] Rutger Rienks, Anton Nijholt, and Paulo Barthelmess. 2009. Pro-active

(4) Open-source alternatives offer control and security but meeting assistants: attention please! Ai & Society, 23, 2, 213–231.

are hindered by usability and cost barriers. [2] Moira McGregor and John C Tang. 2017. More to meetings: challenges in

using speech-based technology to support meetings. In Proceedings of the



improvements that enhance meeting effectiveness and reduce Overall, the market exhibits demand for video conferencing 2017 ACM conference on computer supported cooperative work and social computing , 2208–2220. [3] T3 Technology Hub. 2024. Market share of videoconferencing software user burden, rather than introducing new technical complexity. worldwide in 2024, by program. Statista. Graph. (Apr. 2024). Retrieved Jan. 13,

2025 from https://www.statista.com/statistics/1331323/videoconferencing-



6 Discussion [4] market-share/.

tldx Solutions GmbH. 2025. Tl;dv. https://tldv.io/. Accessed: August. (2025).



There are two primary approaches to understanding user pref- [5] Otter.ai, Inc. 2025. Otter.ai. https://otter.ai/. Accessed: August. (2025). [6] 2025. Fathom. https://fathom.video/. Accessed: August. (2025). erences: direct inquiry and behavioral observation. Direct ques- [7] 2025. Fireflies. https://fireflies.ai/. Accessed: August. (2025).

tioning suffers from significant limitations, including social desir- [8] 2025. Sembly ai. https://www.sembly.ai/. Accessed: August. (2025).

ability bias where respondents provide socially acceptable rather [9] Zoom Video Communications. 2025. Zoom ai companion. https://www.zoo

m.com/en/ai-assistant/. Accessed: August. (2025).

than genuine answers, and the fact that approximately 95% of [10] Everett M Rogers, Arvind Singhal, and Margaret M Quinlan. 2014. Diffusion



human decisions occur subconsciously as discussed in [25]. Ob- of innovations. In An integrated approach to communication theory and research . Routledge, 432–448. servational methods capture the unconscious preferences that [11] 8x8, Inc. 2025. Jitsi. https://jitsi.org/. Accessed: August. (2025). drive actual user behavior, providing more accurate insights into [12] Gloria Mark, Shamsi T. Iqbal, Mary Czerwinski, Paul Johns, and Akane Sano. 2016. Neurotics can’t focus: an in situ study of online multitasking in the real-world usage patterns. workplace. In Proceedings of the 2016 CHI Conference on Human Factors in These methodological considerations explain our contradic-Computing Systems . ACM, 1739–1744.



AI could help achieve meeting goals, market research revealed tory findings. While 87.5% of WinWin Meets participants believed [13] Anysphere Inc. 2025. Cursor. https://cursor.sh/. Accessed: August. (2025). [14] Anthropic. 2025. Claude 3.7 sonnet. https://www.anthropic.com/news/clau de-3-7-sonnet. Accessed: August. (2025). minimal organic interest in AI-enhanced conferencing. This di- [15] OpenAI. 2025. Gpt-4o. https://openai.com/index/hello-gpt-4o/. Accessed:



vergence reflects the difference between conscious evaluation in August. (2025). [16] Meta Open Source. 2025. React. https://react.dev/. Version 19. Accessed: controlled environments versus unconscious behavioral prefer-August. (2025). ences that emerge during natural usage. Additionally, our testing [17] Sebastián Ramírez. 2025. Fastapi. https://fastapi.tiangolo.com/. Accessed: August. (2025). participants were primarily young AI researchers, likely more [18] OpenAI. 2025. Whisper. https://openai.com/research/whisper. Accessed: receptive to AI features than typical users. August. (2025).

Our research uncovered widespread "Zoom fatigue", indicat- [19] NP Digital. 2025. Answer the public. https://answerthepublic.com/. Accessed:

ing that users have reached cognitive saturation with current August. (2025).

[20] 2025. Answer socrates. https://answersocrates.com/. Accessed: August.

video conferencing complexity. The strong preference for meet- (2025).



ing notes over real-time AI assistance (60% versus minimal inter- [21] Candour. 2025. Alsoasked. https://alsoasked.com/. Accessed: August. (2025). [22] Neil Patel Digital. 2025. Ubersuggest. https://neilpatel.com/ubersuggest/. est) demonstrates users’ desire for post-meeting value without Accessed: August. (2025). additional in-meeting cognitive burden. This psychological con- [23] xAI. 2025. Grok. https://grok.x.ai/. Accessed: August. (2025). [24] 2025. Floth. https://floth.ai/. Accessed: August. (2025). text explains why solutions that prioritize seamless integration [25] Gerald Zaltman. 2003. How Customers Think: Essential insights into the mind over feature prominence tend to gain market traction [26]. of the market . Harvard Business Press.



conferencing innovation. Industry-specific applications such as Our findings suggest distinct pathways for AI-enhanced video [26] Fred D Davis. 1989. Perceived usefulness, perceived ease of use, and user acceptance of information technology. MIS quarterly , 319–340.

negotiations, sales, and legal consultations represent focused



28





Predicting Ski Jumps Using State-Space Model




∗ ∗

Neca Camlek Živa Hegler Jakob Jelenčič

Univerza v Ljubljani Univerza v Ljubljani Jožef Stefan Institute

Ljubljana, Slovenia Ljubljana, Slovenia Ljubljana, Slovenia

jakob.jelencic@ijs.si



Marko Grobelnik Dunja Mladenić

Jožef Stefan Institute Jožef Stefan Institute

Ljubljana, Slovenia Ljubljana, Slovenia

marko.grobelnik@ijs.si dunja.mladenic@ijs.si



Abstract modeling ski jumps, where environmental factors determine

performance [9].

Ski jumping performance is shaped by both athlete technique

In this paper, we present a ski jump dataset together with a

and environmental conditions, with factors such as wind speed,

state-space model trained to predict jump trajectories based on

wind direction, and ski orientation playing a critical role in deter-

changing environmental conditions. The model is estimated using

mining jump trajectories. Accurate modeling of these trajectories

a least squares approach and demonstrates how inputs such as

is challenging due to dynamic and time-dependent nature of

wind and ramp adjustments influence the resulting jump. Beyond

the system. In this work, we introduce a dataset of measured

the modeling framework, we also developed an application that

ski jumps and present a state-space modeling framework that

allows general users to interact with the data, run simulations,

captures the evolution of jumps under varying conditions. The

and visualize jump trajectories through animations.

model parameters are estimated using a ridge regression ap-

Beyond methodological interest, accurate prediction of ski

proach, enabling us to predict trajectories from initial states and

jumps can improve athlete safety by anticipating risky condi-

wind sensor inputs. We evaluated the predictive performance of

tions, support planning of hill design or enlargement, and con-

the model through leave-one-out cross-validation and analyzed

tribute to fairer competitions through a better understanding of

its stability, showing that the approach can generate realistic tra-

environmental effects.

jectories with reasonable accuracy. To complement the modeling

The remainder of the paper is as follows. Section 2 presents

results, we developed an interactive web application that allows

the handling of received data. Next, the proposed methodology

users to explore both recorded and simulated jumps, adjust envi-

is described in Section 3. The project results are presented in

ronmental factors, and visualize their effects through animations.

Section 4. We discuss the results in Section 5 and conclude the

Together, the dataset, modeling framework, and the application

paper in Section 6.

offer a foundation for further research in ski jump analysis and

provide an accessible tool for exploring the influence of external

conditions on performance. 2 Modeling Framework and Dataset

This section describes the handling of data, focusing on state-

Keywords space models and our data processing.

datasets, state-space model, ski jumping, simulations, least squares

2.1 State-Space Model

1 Introduction State-Space Models (SSMs) are a family of machine learning algo-

Ski jumping is a sport strongly influenced by both athletic tech- rithms designed to capture and predict the behavior of dynamic

nique and environmental conditions. Factors such as wind speed, systems by describing how their inner states change over time.

wind direction, and different ski angles affect the trajectory and Instead of only looking at past inputs and outputs, SSMs explic-

final distance of a jump, making accurate prediction a challeng- itly model the underlying dynamics, making them well-suited for

ing problem. While statistical models and simulations have been sequential data. In state-space modeling, the objective is to iden-

applied in sports research for some time, many approaches sim- tify the minimal set of system variables required to completely

plify the problem and do not fully capture the dynamic evolution describe the system. These fundamental variables are referred to

of the jump over time [11]. as the state variables. At any given time, the state of the system

Recent advances in machine learning have introduced methods can be represented by a state vector, whose components corre-

capable of modeling temporal systems with greater fidelity. In spond to the values of the respective state variables. SSMs are

particular, state-space models provide a mathematical framework designed to predict both the manner in which inputs are reflected

for representing hidden internal states that evolve over time in the system’s outputs and the evolution of a system’s internal

in response to external input. This makes them well-suited for state over time and in response to specific inputs [2].

∗

Both authors contributed equally to this research.

2.2 Least squares method

Permission to make digital or hard copies of all or part of this work for personal

The least squares method is a regression technique that is used to

or classroom use is granted without fee provided that copies are not made or

determine the line that best fits a given set of data. It minimizes

distributed for profit or commercial advantage and that copies bear this notice and

the full citation on the first page. Copyrights for third-party components of this the sum of the squared differences between the observed data

work must be honored. For all other uses, contact the owner /author(s).

and the corresponding values implied by the regression func-

Information Society 2025, Ljubljana, Slovenia

tion. Each data point reflects the relationship between a known

© 2025 Copyright held by the owner/author(s).

https://doi.org/10.70314/is.2025.sikdd.30 independent variable and an unknown dependent variable [7].



29





Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Ž. Hegler, N. Camlek et al.




To enhance the model, we incorporated ridge regression (L2

regularization), which helps to reduce overfitting during model

training [12].



2.3 Data Processing

For our project, we used 223 CSV files, each containing the data

of a jump, measured on the flying hill of Gorišek brothers in Plan-

ica, Slovenia. Each contains 17 columns ( , ’Position’ ’Height

above ground’, ’Time’, ’X’, ’Y’, ’Z’, ’Opening Angle’,

’Stalling Angle Left’, ’Stalling Angle Right’, ’Roll

Angle Left’, ’Roll Angle Right’, ’Yaw Angle Left’, ’Yaw

Angle Right’, ’Speed hor.’, ’Speed vert.’, ’Speed

resulting’, ’WindTime|WindName|WindSpeed|Wind...’) and

the number of rows corresponding to the length of the jump. Data

are recorded for every meter of air distance from the take-off Figure 2: Different angles affecting the jump point.

The data required some pre-processing before it could be used The wind features are as follows:



for training the model. WindTime

- time of the wind measurement in the same for-

The column combined mul-WindTime|WindName|WindSpeed|...

mat as the Time column itself. Since wind measurements

tiple attributes separated by ’ ’. Data from 12 sensors, each mea-|

are recorded less often, the wind values are applied to the

suring six wind characteristics, were expanded into 12 6 72 × =

most recent jump measurement and then just repeated

columns, one per sensor–feature pair ( ). sensor_feature

until a new wind measurement is available. Since the wind

Position- air distance from the take-off point in meters. Be- is represented by a nonlinear function, it would be hard to

gins with a negative value, which represents the distance capture its movements with interpolation, so we decided

from the starting point to the take-off point. In ski jump- to drop this column.

ing, the starting point is adjusted according to the wind - name of the sensor (Wi for 𝑖 1, . . . , 12) WindName =

conditions, so this value is not constant. - resulting speed of the wind in km/h WindSpeed

Height above ground- height above ground in meters. WindSpeedTangent- speed of the wind tangent measured

Time- time of the jump in seconds from the start of the along the x axis (hill direction) in km/h

jump. - vertical speed of the wind turbulence in WindTurbulence

X, Y, Z- coordinates of the jumper in a 3D space in meters. km/h

The X axis is aligned with the hill direction, the Y axis is - wind speed tangent with WindSpeedCleanTan across the hill, and the Z axis is vertical. The take-off point turbulence removed in km/h

is 0, 0, 0 as shown in Figure 1 - speed of the wind measurement along ( ) WindSpeedCross

Opening Angle- angle between the skis in degrees. the y axis across the hill in km/h

Stalling Angle Left, Stalling Angle Right- angle between There are 12 wind sensors spread across the ski jump hill. To

the chord line of the left/right ski and the horizontal plane

help with the analysis, we separated the jump section of the hill

in degrees.

into 3 zones. The first zone contains wind sensors 1 to 4, the

Roll Angle Left, Roll Angle Right- angle of the left/right second zone contains sensors 5 to 8, and the third zone contains

ski around its longitudinal axis relative to the horizontal

sensors 9 to 12 [11].

plane in degrees.

During processing, we also removed some ski jumps that were

Yaw Angle Left, Yaw Angle Right- angle between the incomplete or had corrupted data, so the final dataset contained

left/right ski and the horizontal plane in degrees.

around 200 ski jumps.

(angles are shown in Figure 2)

Speed hor., Speed vert., Speed resulting- horizontal, ver- 3 Methodology

tical, and the resulting speed of the athlete in km/h [13].

This section describes our research methodology. We first present

different variations of the SSM that we tested for the ski jump



simulation, followed by describing our model and how it predicts



the jumps. Finally, we present the description of our ski jump

animation app.



3.1 Different modeling approaches

In addition to pure SSM, we considered different approaches

for modeling ski jumps that included classical physics-based

models, but the data are not sufficient to accurately capture all the

forces acting on the jumper. We also tried a hybrid approach that

combined SSM and Physics-informed Neural Networks (PINNs

[14]), where the SSM would provide a baseline prediction and

the PINN would learn to correct any discrepancies, taking into

Figure 1: 3D model of Ski jump in Planica with added co- account physical properties of the system, such as the mass of

ordinates [10, 1] the pilot, the properties of the wind, and gravitational force [4].



30





Ski Jumping Simulation Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia




These parameters are included in the equations of motion and The application presents the results as an animated visualiza-

added to the total loss function. So, the model prefers solutions tion of the ski jump, showing the full trajectory and the final dis-

that are consistent with the laws of physics. This turned out tance. In this way, the application functions both as an analytical

to be less effective than a pure SSM approach, but the reason tool, helping to test how different conditions affect performance,

exceeds the purpose of this paper. More about errors and models’ and as an educational resource that makes the mechanics of ski

comparison is given in Section 4.1. jumping easier to understand for a wider audience. It is available

1

online.

3.2 Ski jump prediction model

4 Main results

In order to fit our data to the SSM, we stored the data in each

In this section, we present the results of our simulations. Firstly,

file in three vectors. The main vector contains states or state

variables of the system, which in our case are the X, Y, and Z we present a statistical comparison of all the models, followed

by a precise analysis of our predictions.

coordinates, jumper velocities, and all angles (opening, stalling,

roll, and yaw) [6].

The observation vector contains the measured outputs of the 4.1 Models’ error

system, which in our case are the X, Y, and Z coordinates and In order to evaluate different models, we first had to define a

height above ground. The controls contain the external inputs metric to measure the prediction error. Since actual and sim-

to the system, which in our case are the wind measurements ulated jumps are represented with x, y, and z coordinates but

from all the sensors that are averaged over each zone and feature are measured at different time stamps and can contain a differ-

(speed, tangent, cross and turbulence). ent number of measurements, we had to find a way to compare

We then used ridge regression to estimate the matrices A, B, them. We first tried to project the shorter trajectory on to the

C and D of the SSM, as shown in Figure 3, where we minimized other one and compute the distance between the original and

the computed values from the current and previous values and the projection, but this method turned out to be computationally

the next time-stamped values. Thus, matrix A computes the next expensive. So we decided to compute the distance between the

state from the current state, B computes the next state from actual and the simulated jumps by interpolating both jumps. The

the current control, C computes the next observation from the new measurements contain the start and end point and all the

current state, and D computes the next observation from the ones, where x reaches a natural value. We then compute the error

current control. We then use recursion to predict the next state as the norm of the difference between the two jumps. And after

from the prediction of the previous state and the current control, one of the jumps ends, we just add the distance from the end of

to get the full simulated jump. This allows us to predict the jump the shorter jump to the end of the longer jump to the error. In

trajectory based on the environmental conditions and the starting this way, we penalize the model for not being able to predict the

state of the jumper [9]. correct length of the jump.

Since we had a limited number of jumps, we used leave-one-

out cross-validation to evaluate the models. For each jump, we



trained the model on all other jumps and then simulated the

left-out jump. We then calculated the average error between the

actual jump and the simulated jump for both the training set and

the test set, as shown in Figure 4.

In the process of developing our ski jump prediction model,

we evaluated several variations to determine the most effective

approach. We compared the performance of a pure SSM with

a hybrid model that combined SSM with PINN. The pure SSM

demonstrated superior predictive accuracy, probably due to its

ability to directly model the temporal dynamics of ski jumps

without the added complexity of PINNs. We also experimented

with different configurations of the SSM, including using all

Figure 3: Schema of SSM matrices [3] available wind sensor data versus an averaged value of the zone.

When we used all sensors, the average error for each point (in the

training data is 1.67 m and in the test data is 1.89 m), while when



3.3 Ski jump animation app we averaged the sensors over the zones, the error (in the training

data was 1.76 m and in the test data 1.82 m). This suggests that

To make our results accessible beyond the research setting, we

averaging the wind data helps with the simulation.

developed an interactive web application using Shiny for Python



[8]. The application serves as a front-end to the trained state- 4.2 Analysis of our model space model and allows users to explore ski jump simulations

Wind is a critical factor in ski jumping, so we attempted to capture

under varying environmental conditions or just to observe differ-

its nonlinear effects by including columns for the squared wind

ent measured ski jumps. Firstly, through a set of input controls,

features. However, we found that adding these squared terms did

users can adjust factors such as wind speed, wind directions, or

not significantly reduce the prediction error.

different ski angles, and the application instantly updates the

Since the simulation still requires numerous inputs, we made

predicted jump trajectory. Secondly, users can simply explore

it interactive, allowing users to adjust the wind conditions and

random jumps from the provided dataset or upload their own

observe their impact on the jump. In the ski jumping app, users

CSV file of measured jumps, as long as it includes the columns

described in Section 2.3. 1https://camlekn.shinyapps.io/ski- jump/



31





Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Ž. Hegler, N. Camlek et al.





choose whether the wind would gain or lose a certain feature

(such as speed or turbulence) during the jump.

Expanding the dataset to include more jumps and additional

contextual information about individual jumpers could improve

the accuracy of the model. We could try to generate more data by

using data augmentation techniques, such as adding noise to the

wind measurements or slightly modifying the angles. We could

also try to find the nonlinear movements of the wind and inter-

polating the wind measurements by their original time stamps

to better capture the wind dynamics.



6 Conclusion

This paper presents a method for predicting ski jump trajectories

based on environmental conditions. By incorporating external

factors into the modeling framework and applying least squares

estimation, we demonstrated that the model is capable of captur-

ing the dynamics of ski jumps and producing realistic trajectory

predictions. In addition, we developed an interactive application

that makes the results accessible to a broader audience through

Figure 4: Actual vs. simulated ski jump trajectory simulations and animations of predicted jumps. Although the

current model is limited by the size of the dataset and the ab-

sence of certain athlete-specific variables, the results show that

can manipulate sliders to set the wind speed, wind tangent, wind

state-space models are a promising tool for analyzing ski jumping

cross, and wind turbulence for each of the three zones. As a result,

performance.

the wind loses its original movement function in the simulation.

All other inputs are set to the average values computed from the

dataset. 7 Acknowledgments

This work was supported by Smučarska Zveza Slovenije (Ski

5 Discussion Association of Slovenia), whom we would like to thank for pro-

viding the ski jump data.

This section examines the predictive performance of the trajec-

tories, highlights the limitations of our current approach, and

suggests directions for future improvements. References

[1] 3DWarehouse via 3dmdb.com. 2025. "ski jumping planica" [3d model]. http



5.1 s://3dmdb.com/en/3d- model/ski- jumping- planica/8386000/?f ree=True&q Limitations

=Ski+jump. Free model; accessed: 2025-08-26. (2025).

[2] Masanao Aoki. 1990. . (2nd, revised and State Space Modeling of Time Series

Given the relatively small dataset of ski jumps, the main limita-

enlarged ed.). . Springer Berlin Heidelberg, Berlin, Heidelberg. Universitext

tion of our project lies in the limited data available for training isbn: 978-3-642-75883-6. doi:10.1007/978- 3- 642- 75883- 6. the model. After preprocessing, the dataset contained only about [3] Dave Bergmann. 2025. What is a state space model? Accessed: 2025-09-24.

https://www.ibm.com/think/topics/state- space- model.

200 jumps, which may limit the SSM’s ability to represent the

[4] Shengze Cai, Zhiping Mao, Zhicheng Wang, Minglang Yin, and George

full range of trajectory variations under different circumstances. E. Karniadakis. 2021. Physics-informed neural networks (pinns) for fluid

mechanics: a review. , 37, 12, 1727–1738. doi:10.1007 Acta Mechanica Sinica

As a result, the model may struggle to accurately predict jumps

/s10409- 021- 01148- 1.

under novel or extreme conditions. Sport Aerody-

[5] Wolfram Müller. 2008. Performance factors in ski jumping. In

Furthermore, the dataset lacks detailed information, or any . CISM International Centre for Mechanical Sciences. Vol. 506. Helge namics

Nørstrud, editor. Online ISBN: 978-3-211-89297-8. Springer, Vienna, 139–160.

information at all, about individual jumpers, such as body weight,

isbn: 978-3-211-89296-1. doi:10.1007/978- 3- 211- 89297- 8_8.

sex, or other physiological characteristics that are known to in- [6] Wolfram Müller. 2006. The physics of ski jumping. Tech. rep. CERN report

fluence jump performance. Incorporating these variables could on the aerodynamics and physics of ski jumping. CERN. https://cds.cern.ch

/record/1009275/f iles/p269.pdf .

improve model accuracy and provide more personalized predic- Razširjen uvod v numerične metode [7] Bor Plestenjak. 2016. . Slovenian textbook

tions [5]. on numerical methods. DMFA-založništvo.

[8] Posit Team. 2025. Shiny for python. Accessed: 2025-08-29. https://shiny.pos

Lastly, due to limited computing power, only one CP U was

it.co/py/.

available, restricting the use of possible better models. To address

[9] Serrano.Academy. 2025. State-space model (ssm) tutorial. https://youtu.be

these challenges, using cloud-based resources could help run /g1AqUhP00Do. State-Space Model (SSM) video. (2025).

[10] Ski Jumping Hill Archive, skisprungschanzen.com. 2025. Letalnica (letalnica

larger models and improve the prediction of trajectories.

bratov gorišek), planica, slovenia — ski jumping hill archive. https://www.s

kisprungschanzen.com/EN/Ski+Jumps/SLO- Slovenia/Planica/0475- Letaln

5.2 Future work and potential improvements ica/. Accessed: 2025-09-12. (2025).

[11] Ava Thompson, ed. 2025. . Found via Google Books at https://ww Ski Jumping

Although the current approach shows promise, there are several w.google.si/books/edition/Ski_Jumping/G2pPEQAAQBAJ?hl=en&gbpv=0.

avenues for future improvements. Some of which we are working Publifye AS.

[12] Wessel N. van Wieringen. 2015. Lecture notes on ridge regression. arXiv

on at the time of writing this paper.

preprint arXiv:1509.09169. Revision v8, submitted 30 September 2015; re-

Currently, we are working on improving the sliders’ functions. vised 27 June 2023. (2015). doi:10.48550/arXiv.1509.09169.

[13] Mikko Virmavirta and Juha Kivekäs. 2019. Aerodynamics of an isolated ski

Since the wind data determined by the user is static through-

jumping ski. , 22, 1, 1–6. doi:10.1007/s12283- 019- 0298- 1. Sports Engineering

out the jump, this adds a lot of generalization. In reality, wind

[14] StatQuest with Josh Starmer. 2025. Neural networks tutorial. https://youtu

conditions can change rapidly during a jump. So we would like .be/CqOf i41Lf Dw. Neural networks introduction video. (2025).

to add additional controls to the app that would allow the user

to define how the wind changes during the jump. They could



32

Predicting milling overload based on sensor data: a



graph-based approach



Roy Krumpak Jože M. Rožanec Dunja Mladenić

Jožef Stefan Institute Jožef Stefan Institute Jožef Stefan Institute

Ljubljana, Slovenia Ljubljana, Slovenia Ljubljana, Slovenia

krumpak.roy@gmail.com joze.rozanec@ijs.si dunja.mladenic@ijs.si



Zhenyu Guo Tao Song Dumitru Roman

BGRIMM Technology Group BGRIMM Technology Group SINTEF Digital

Beijing, China Beijing, China Oslo, Norway

guozhenyu@bgrimm.com songtao@bgrimm.com titi.roman@sintef.no



Inna Novalija Xiang Ma

Jožef Stefan Institute SINTEF Industry

Ljubljana, Slovenia Oslo, Norway

inna.koval@ijs.si xiang.ma@sintef.no



ABSTRACT The contributions of this paper include the use of multiple

graph representations (not just one) to capture the structure

In this paper, we present an approach to predict milling over-

of a time series and evaluation of the described approach on a

load that leverages time series-to-graph transformations, which,

real-world dataset.

along with domain data encoded as a graph, are fed to predictive

machine learning models. Additionally, we compared the perfor-

mance of the graph-based approach with the TS2Vec foundational

model, regarded as the State-Of-The-Art. Our results show that

TS2Vec performed best across all time windows. While combining 2 USE-CASE DESCRIPTION TS2Vec and graph embeddings resulted in reduced performance

BGRIMM Technology Group is a Chinese leader in mining and

compared to TS2Vec, it enhanced the outcomes when compared

mineral processing solutions, focusing on automation and intelli-

to the sole use of graph embeddings. Furthermore, combining Or-

gent control, with grinding optimization as a core area. Grind-

dinal Partition Graph and TS2Vec embeddings resulted in more

ing is both the most energy-intensive step in mineral process-

stable performance across predictive time windows.

ing—accounting for 40% of total energy costs—and a key de-

terminant of downstream recovery and product quality (Zhou

KEYWORDS et al. 2009 [11]; Lessard et al. 2016 [5]; Groenewald et al. 2006

Time series, graphs, mining, milling, predictive maintenance, [1]). At a 10,000 ton/day copper plant in Anhui Province using a

sensor data SAG–ball–pebble (SABC) circuit, BGRIMM is developing intelli-

gent control strategies to maximize throughput while preventing

1 INTRODUCTION SAG mill overload. Central to this effort is accurate SAG power

Milling, central to mineral processing, involves breaking down prediction, which serves as a feedforward signal to improve feed

ores into smaller particles, but is prone to abnormal behavior regulation and overall process efficiency.

due to material properties and upstream steps (Hodouin et al.

2001 [3]; Galán et al. 2002 [2]). While traditional control relied

on operators, advances in machine learning (ML) have enabled



data-driven optimization and predictive maintenance (Mobley 3 DATASET

2002 [6]). Graph-based methods are increasingly applied to time

The dataset used in this article was collected and provided by

series to capture temporal and structural relations (Silva 2021

BGRIMM Technology Group. The data consists of various sen-

[8]). Variants include Natural Visibility Graphs (NVG) to capture

sor measurements from the machines used in their mine’s ore

the time series topology (Lacasa et al. 2008 [4]; Stephen et al.

processing plant, accounting for a total of 42 columns. One col-

2015 [10]), Quantile Graphs for time series values’ transitions

umn stores the date and time of the measurement, while the rest

(Silva et al. 2024 [9]), and Ordinal Partition Graphs to capture

contain numerical values. The sensor data was sampled every

regular temporal patterns and their transitions.

two seconds and compiled across a hundred days from January

𝑠𝑡 𝑡 ℎ

Jože M. Rožanec and Roy Krumpak are co-first authors with equal contribution and 1 2019, to April 12 2019, excluding the first two days of April, importance.

resulting in 4.32 million rows in the data. Besides the raw data,

Corresponding author: Jože M. Rožanec: joze.rozanec@ijs.si.

a description of an overload state was also provided. A column

Permission to make digital or hard copies of part or all of this work for personal SAG_2201.power named , which represents the power of the SAG

or classroom use is granted without fee provided that copies are not made or

mill, is used to decide whether there is an anomaly in the data.

distributed for profit or commercial advantage and that copies bear this notice and

the full citation on the first page. Copyrights for third-party components of this If the column reaches a value above 4700 [kW] and has an up-

work must be honored. For all other uses, contact the owner /author(s).

ward trend or whenever it surpasses the value of 4800, this is

Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia

considered an overload of the system, and a supervisor might

© 2025 Copyright held by the owner/author(s).

https://doi.org/10.70314/is.2025.sikdd.21 take appropriate actions to stop the overloading.



33





Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Rožanec, Krupak, et al.



Figure 1: The diagram depicts the milling plant components and how they are connected. The components of interest are

highlighted with red rectangles.





sample fell (see Fig. 3b). The column named was WIT_2101.PV

excluded from the first step of data simplification and graph rep-

resentations and was processed separately because its values did

not appear to have distinct oscillating levels and did not benefit

from such processing. After discretization, every column had an

integer value between zero and six, and with each row being then

interpreted as a state. The average state duration is 42 seconds.

Repeated states (duplicate rows) were dropped, decreasing the

size of the dataset (see Fig. 3c). For a visual representation of

these steps, see Fig. 3, where the data from one picture is used,

and, where important, also noted in the next one. The data here

Figure 2: SAG_2201.power column (light gray), where anom-

include raw data in Fig. 3a, the ’means’ data in Fig. 3a and Fig. 3b,

alies (gray dots) are annotated based on the moving aver-

simplified data in Fig. 3b, and unique sample data in Fig. 3c. The

age (gray), the automatic anomaly label threshold (dotted

annotated plot in Fig. 3c is used as the base data for an example

black), the possible anomaly label threshold (solid black),

NVG generation in Fig. 4. The numbers represent the same data

and linear regression slope (positive - dashed and dotted,

point, one in the plot and one in the graph representation.

negative - dashed).



4.3 Modeling the data as graphs

4 METHODOLOGY We employ three strategies for converting time series into graphs:

4.1 Data preparation Natural Visibility Graphs (NVG), Ordinal Partition Graphs (OPG),

Based on experts’ input, the samples with <

SAG_2201.power and Quantile Graphs (QG). We used the time series to graph and

1

back library to achieve this.

4700 were labeled 0 (no anomalous event), others with 1 (milling

For each sample in the data, we built a graph representation

overload). A 1-hour (1800-sample) moving average with linear

of it by looking at the samples within a selected window 𝑤𝑠

regression checked for upward trends; if none, the label was

preceding it and applying the described time series to graph

reset to 0 (see Fig. 2). Next, we selected a subset of columns to

strategies on each column, apart from , separately. WIT_2101.PV

be used in the analysis, utilizing expert knowledge to choose

Such graphs, called subgraphs, were bound to a default graph

only those columns that are measured in the workflow before

SAG_2201.power structure that presents which columns are neighboring in the

column. The resulting columns are

LIT_2103A.PV plant process (see Fig. 1) by connecting a node which represents FCV_2201.PID_SP

, ,

the column to every other column. The result SAG_2201.power

SAG_2201.Press_Ziyouduangaoya2, Feeder_Control.SP,

SAG_2201.power of this step was a larger type of graph called a state graph (see WIT_2101.PV

and .

Fig. 5). The black nodes represent nodes for a particular column,



4.2 Feature engineering while gray nodes represent the subgraphs created from the time

series. The subgraphs are connected to the column nodes via the

The raw data from the selected columns was first checked for

node that corresponds to the first instance from the timeseries.

any missing values, which were not present. In the next step, we

Depending on the experiment, we made an additional step of

detected changes in the columns and then replaced the values in

joining 𝑤 0 many of the state graphs into a larger graph, which

the samples between two such changes with the mean value of

was used to generate embeddings.

that segment (see Fig. 3a). This data was further simplified with

the help of a k-bins discretizer, which was used to encode each

column with seven values based on the quantile into which each 1https://timeseriestographs.com/



34





Predicting milling overload based on sensor data: a graph-based approach Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia





(a) SAG_2201.power column (light gray), where a

threshold change detection was used to detect

changes and to replace in-between values with the



mean value (black). Figure 5: Example of a state graph.





We chose this model for its ease of use and performance reasons.

Column was also transformed into an embedding WIT_2101.PV

2

form by using a TS2Vec model . The embedding output size was

set to 40, as this is approximately the size of features proportional

to the number of columns in the graph embeddings.



4.4 Model training and evaluation



(b) Result (dashed black) of applying a k-bins An initial subset of the data, which included the data from the

discretizer model on the previously simplified first available day, was used to test the performance of different

data (solid black) from Fig. 3a. Note the differ- graph embeddings. This was done to reduce the time and memory

ent y-axis scales of the overlaid graphs. consumption for the first assessment. A CatBoost model was



used, where it was trained for 800 iterations, with a learning

rate equal to 0.03 and the Cross Entropy loss function, as well

as the leaf regularization parameter set to 0.3. To assess our

model’s ability to predict anomalous states, we also tried to fit

the model on the same data, but with the target column shifted

accordingly. This was done for up to 90 shifts, which is equivalent

to predicting 63 minutes in advance. When we selected the best

graph embeddings, we built and tested the model on the entire

data set.



5 EXPERIMENTS

(c) A representation of the simplified column data

We conducted three experiments, all of which follow the same

from Fig. 3b, considering only the unique consecutive

template, where we tested how the structure of a graph affects

values.

the end model’s ability to predict anomalies. This includes first

creating subgraphs as NVG, OPG and QG representations of the

Figure 3: Pipeline of transformations on the

SAG_2201.power columns with window size 𝑤 𝑠 and joining them into the state column.

graph representation (see Fig. 5). Finally, 𝑤 0 many of these state



graphs are joined sequentially according to the order given by



the time at which the represented states appear in the data. The

experiments differ in the window sizes 𝑤 and 𝑠 𝑤 . Experiment A 0

used 𝑤𝑠 50, 𝑤0 1, Experiment B used 𝑤𝑠 15, 𝑤0 20, lastly = = = =

Experiment C used 𝑤 15 40. If we take the average state 𝑠 = , 𝑤 0 =

duration of 42 seconds into account, we see that in Experiment

A, data from the last 35 minutes is used, in Experiment B, 15

minutes, and finally in Experiment C, 28 minutes.

We carried out experiments similar to Experiment B, where

the state graphs were structured based only on one specific type

of subgraph. Furthermore, the impact of the separately processed



Figure 4: The Natural Visibility Graph representation of WIT_2101.PV was also tested, by repeating the same experiments,

the data in Fig. 3c. with the difference being that this column’s embeddings were

excluded when training the final model. These experiments do

not have a mark in the ’WIT’ column of the resultst Table 3.

A Graph2Vec model from the karateclub library [7] was used

to generate graph embeddings, with an embedding size of 250. 2https://github.com/zhihanyue/ts2vec



35





Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Rožanec, Krupak, et al.




Time to predict[min] type of data used Time to predict[min]

NVG OPG QG WIT 7 21 35 49 63

7 21 35 49 63

0.9905 ✓ ✓ ✓ ✓ 0.6558 0.6418 0.6251 0.6402 0.6184 0.9528 0.8929 0.8235 0.7623

✓ ✓ ✓ 0.5938 0.6257 0.5831 0.5882 0.5725

✓ ✓ 0.7427 0.7146 0.6930 0.6853 0.6719

Table 1: ROC AUC results of the experiment where all data 0.7265 0.6959 0.6586 0.6502 0.6365 ✓ ✓

was embedded with TS2Vec models. ✓ 0.7452 0.6978 0.6838 0.6734 0.6578

✓ 0.7219 0.6866 0.6643 0.6416 0.6096

✓ 0.9292 0.9025 0.8893 0.8004 0.7042

Time to predict[min]

Experiment 7 21 35 49 63 Table 3: ROC AUC results of the models trained on different

A 0.6083 0.5763 0.5356 0.5333 0.4945 types of graphs and data for Experiment B across all days.

B 0.6184 0.6943 0.6698 0.6364 0.6128 The best results are written in bold text, while the second

C 0.5897 0.5688 0.6109 0.5910 0.6417 best are underlined.



Table 2: ROC AUC results of the three experiments with 7 CONCLUSIONS respect to how far ahead the model is predicting. The best

results are marked in bold text. In this paper, we discuss the use of graph-based time series rep-

resentations for training machine learning models. Our experi-

ments suggest that while this approach has potential, it did not

outperform the TS2Vec foundational model and was unable to

Lastly, a separate experiment was carried out, in which all

yield superior results when combined with it. Future work will

raw data were processed using the TS2Vec model. Each column

explore alternative graph representations and utilize GNNs to

had its own TS2Vec model, which was used to embed the data

integrate topological, semantic, and time series information di-

associated with that column. Then, a CatBoost model with the

rectly into a single machine learning model, aiming to achieve

same configuration as in the previous experiments was used

superior results.

in combination with TS2Vec joined embeddings to predict the



anomalies. These results are gathered in Table 1. ACKNOWLEDGEMENTS



6 The Slovenian Research Agency supported this work. It was RESULTS

also developed as part of the Graph-Massivizer project (grant

The results of the three experiments, which tested the infor-

agreement No. 101093202), the enRichMyData project (grant

mativeness of the graph structure, as well as the experiments

agreement No. 101070284), and the DataPACT project (grant

designed to determine which type of data is the most predictable,

agreement No. 101189771), all funded by the Horizon Europe

are summarized in the following tables.

research and innovation program of the European Union.

As can be seen in Table 2, Experiments A and C have lower



scores than Experiment B. However, Experiment C approaches REFERENCES the performance of Experiment B at the maximum predicting

[1] J.W. de V. Groenewald, L.P. Coetzer, and C. Aldrich. 2006. Statistical moni-

shift. For this reason, and because the types of graphs in Ex- toring of a grinding circuit: an industrial case study. , Minerals Engineering

19, 11, 1138–1148. doi: 10.1016/j.mineng.2006.05.009.

periment B are smaller compared to those in Experiment C, the

[2] O. Galán, G.W. Barton, and J.A. Romagnoli. 2002. Robust control of a sag mill.

experiments that tested the impact of different types of data used Powder Technology

, 124, 3, 264–271. doi: 10.1016/S0032- 5910(02)00021- 9.

Experiment B-type graphs. The best results for the final model [3] D. Hodouin, S.-L Jämsä-Jounela, M.T. Carvalho, and L. Bergh. 2001. State of

the art and challenges in mineral processing control. Control Engineering

were obtained from the data, where all columns were embed-

Practice, 9, 9, 995–1005. doi: 10.1016/S0967- 0661(01)00088- 0.

ded using TS2Vec models, as shown in Table 1. Similarly, the [4] L. Lacasa, B. Luque, F. Ballesteros, J Luque, and J.C. Nuño. 2008. From time

results in table 3 show that when we predict anomalies from series to complex networks: the visibility graph. Proceedings of the National

Academy of Sciences, 105, 13, 4972–4975. doi: 10.1073/pnas.0709247105.

only the TS2Vec embeddings of the column , the WIT_2101.PV

[5] J. Lessard, W. Sweetser, K. Bartram, J. Figueroa, and L. McHugh. 2016. Bridg-

performance is the best. ing the gap: understanding the economic impact of ore sorting on a mineral

Additionally, if we compare the experiments with

WIT_2101.PV processing circuit. , 91, 5, 92–99. doi: 10.1016/j.mineng Minerals Engineering

.2015.08.019.

embeddings to the ones without them, we can see that the lat- An Introduc-

[6] R. Keith Mobley. 2002. 4 - benefits of predictive maintenance. In

ter perform worse. This suggests that the TS2Vec embeddings . Plant Engineering. (Second tion to Predictive Maintenance (Second Edition)

Edition ed.). R. Keith Mobley, editor. Butterworth-Heinemann, Burlington,

are more informative than the graph embeddings. Nevertheless,

60–73. isbn: 978-0-7506-7531-4. doi: 10.1016/B978- 075067531- 4/50004- X.

when comparing different types of graphs used in the final graph, [7] B. Rozemberczki, O. Kiss, and R. Sarkar. 2020. Karate Club: An API Oriented

we can see that OPGs alone yield the best performance. Open-source Python Framework for Unsupervised Learning on Graphs. In

Proceedings of the 29th ACM International Conference on Information and

A few possible explanations for the difference in performance Knowledge Management (CIKM ’20)

. ACM, 3125–3132. doi: 10.1145/3340531

between the graph-based and time series-based approaches are .3412757.

[8] V.F. Silva, M.E. Silva, P. Ribeiro, and F. Silva. 2021. Time series analysis

possible. First, when working with graphs, there are more pa-

via network science: concepts and algorithms. WIREs Data Mining and

rameters that need to be optimized, such as window sizes and Knowledge Discovery

, 11, 3, 1–39. doi: 10.1002/widm.1404.

parameters for constructing graphs from time series. Another [9] V.F. Silva, M.E. Silva, and P. Ribeiroand F. Silva. 2024. Multilayer quantile

graph for multivariate time series analysis and dimensionality reduction.

reason might be that NVGs have approximately thirty times more

International Journal of Data Science and Analytics, 1–13. doi: 10.1007/s4106

edges and eight times more nodes compared to OPGs and QGs, 0- 024- 00561- 6. which makes them disproportionately large. Additionally, the [10] M. Stephen, C. Gu, and H. Yang. 2015. Visibility graph based time series

analysis. , 10, 11, e0143015. doi: 10.1371/journal.pone.0143015. PloS one

construction of state graphs has repeated structures, which is

[11] P. Zhou, T. Chai, and H. Wang. 2009. Intelligent optimal-setting control

inefficient. Lastly, the TS2Vec embeddings do not have these lim- for grinding circuits of mineral processing process. IEEE Transactions on itations, and embeddings can be made from the entirety of the Automation Science and Engineering

, 6, 4, 730–743. doi: 10.1109/TASE.2008

.2011562.

data, as opposed to the simplified ones when not using TS2Vec.



36





Short and Long Term Bike Rental Forecasting




∗

Oskar Kocjančič Martin Žnidaršič

oskar.kocjancic@gmail.com martin.znidarsic@ijs.si

Jožef Stefan Institute Jožef Stefan Institute

Ljubljana, Slovenia Ljubljana, Slovenia



Abstract Prior studies on bicycle rental forecasting often use the Wash-

ington, D.C. dataset [4]. Du et al. [2] addressed long-horizon

This paper describes the challenges and outcomes of forecasting

prediction, while Karunanithi et al. [6] focused on short-horizon

bike rentals in a Slovenian urban bike-sharing system, focusing

forecasting, both achieving results comparable to ours. In con-

on the impact of data sparsity and the inclusion of external vari-

trast, our dataset differs substantially by including station-level

ables. We address two distinct forecasting tasks: short-horizon,

information, which enables per-station forecasting. We tackle

one-day-ahead predictions for individual rental stations, and

both short- and long-horizon tasks, as well as the analysis of the

long-horizon, 90-day forecasts for the total rental volume. Vari-

impact of exogenous weather variables.

ous machine learning models were employed and evaluated in

this context. We also analyzed the trade-off between using longer

historical data versus shorter, weather-enriched data to improve 2 Data

predictive accuracy. The findings indicate a clear correlation The dataset we used originates from a public bicycle rental service

between data sparsity at the station level and predictive perfor- in a Slovenian city. It contains daily rental counts for individual

mance. While the inclusion of weather data provides a modest stations within the municipality, covering the period from Janu-

improvement for both short-horizon and long-horizon forecasts, ary 1, 2021, to May 15, 2025. Although the dataset also records

the overall quality of the sparse and noisy data appears to limit bike return counts, our work focuses exclusively on rentals.

the potential gains from more complex modeling approaches.

2.1 Features

Keywords



bike-sharing, forecasting, time series, data sparsity, machine

learning, deep learning, weather data



1 Introduction

Predicting rental patterns of urban bike-sharing systems is chal-

lenging due to complex dynamics, including strong seasonality

and trends, as well as dependence on external variables such as

weather and calendar effects. Furthermore, data sparsity, par-

ticularly at the individual station level, presents a significant

obstacle to building reliable predictive models. By accurately

predicting bike demand, operators can improve redistribution

and station availability, fostering a more reliable and sustainable

urban mobility system.

This paper addresses these challenges by investigating two dis-

tinct forecasting tasks using a real-world dataset from a Slovenian

city. First, we examine short-horizon, one-day-ahead predictions

for individual stations to quantify the impact of data sparsity on Figure 1: Pearson correlation coefficients of our features

forecastability. Second, we evaluate the accuracy of 90-day long-



horizon forecasts for the total rental volume aggregated across Dependent Variable: The target feature we are forecasting.

all stations. We compare a suite of models, including classical

machine learning approaches and LSTM neural networks [5], and • total_rentals: The total daily number of bike rentals.

Based on the task, this is either the total count across

explicitly analyze the trade-off between using longer historical

all stations or per-station bike rental count.

data versus shorter, weather-enriched data to improve predictive

accuracy. This work aims to help the bike-sharing systems to Independent Variables: The features used for prediction.

improve operational efficiency, reduce bike shortages, and inform • Temporal Features:

city planning initiatives related to sustainable transportation. The specific date. – date:

– ordinal_day: The day number within the year.

∗ – weekday:

Both authors contributed equally to this research. A category for the day of the week.

– holiday: Indicator (0 or 1) if the day is a holiday.

Permission to make digital or hard copies of all or part of this work for personal • Weather-Related Features: Note: Our weather data only

or classroom use is granted without fee provided that copies are not made or spans the date range of 2024-01-01 to 2025-05-14 distributed for profit or commercial advantage and that copies bear this notice and

the full citation on the first page. Copyrights for third-party components of this

– air_temp_2m_C: Air temperature.

work must be honored. For all other uses, contact the owner /author(s). The relative humidity. – rel_humidity_percent:

Information Society 2025, Ljubljana, Slovenia The precipitation per square me- – precipitation_mm:

© 2025 Copyright held by the owner/author(s).

ter.

https://doi.org/10.70314/is.2025.sikdd.7



37

Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Kocjančič et al.





Figure 2: Distribution of bike rentals across all stations. The vertical blue line indicates the start of the year 2024.



2.2 Data Preprocessing Annual patterns show rental activity declining in winter, rising

in spring, peaking in summer, and gradually decreasing in au-

The dataset structure prevented distinguishing missing values

tumn, with weekends consistently exhibiting lower rental counts.

from true zeros (i.e., days when no rentals occurred), so all empty

Anomalous behavior was observed in the winter of 2024, when

or null entries were treated as zeros. This resulted in sparsity

rental counts were markedly higher than typical seasonal levels.

for some stations, in which many entries had little information

The Pearson correlation coefficients (Figure 1) between fea-

on rental activity. To prevent this impacting our analysis, we

tures related to bicycle rentals indicate that the number of daily

excluded those with more than 33% zero entries, retaining 25

rentals ( ) is strongly and positively associated total_rentals

stations out of the original 48. For the machine learning methods

described later, we also implemented a set of

lagged features: with recent rental trends, as reflected by correlations of 0.73, 0.67,

0.64, and 0.63 with the 7-, 14-, 21-, and 28-day moving averages,

• total_rentals_mean_7_days: Average rental count over

respectively. A strong positive correlation is also observed with

the 7 days preceding the current data point.

air temperature (0.59), whereas moderate negative correlations

• total_rentals_mean_14_days: Average rental count over

are found with relative humidity (-0.43) and precipitation (-0.31),

the 14 days preceding the current data point.

suggesting that rentals are more frequent on warm, dry days.

• total_rentals_mean_21_days: Average rental count over

Weaker associations are present with the day of the week (-0.27)

the 21 days preceding the current data point.

and holiday status (-0.10). As expected, the moving average fea-

• total_rentals_mean_28_days: Average rental count over

tures exhibit high intercorrelation (e.g., 0.94 between the 7- and

the 28 days preceding the current data point.

14-day means) due to their overlapping calculation windows.





3 Experiments

This study pursued two primary objectives. First, we examined

the feasibility of forecasting bicycle rentals one day in advance

and evaluated how forecastability varies across stations with

different data sparsity. Second, we investigated long-horizon

forecasting over a 90-day period, focusing exclusively on predict-

ing the total number of rentals. In this task, standard machine

learning models were trained on historical data and then used

recursively to generate forecasts for the entire period. Due to this

setup, the results for suffer from data leakage. Specifically, DS_W

a single model is trained using past rental counts and future

weather information, so, for example, predicting rentals in July

involves access to the actual recorded weather conditions for that

Figure 3: Rentals per day of the week

month, which artificially improves performance.



2.3 Exploratory Data Analysis 3.1 Training and Test Data Split

The data exhibits pronounced weekly and monthly seasonali- Because the available weather data was limited to the years 2024

ties, as well as non-stationarity, as illustrated in Figures 3 and 4. and 2025, while the rental dataset spanned from 2021 onward,



38

Bike Rental Forecasting Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia





Figure 4: Bike rental data with temperate seasons



we constructed three distinct datasets. Here, each corre-entry 3.3 Performance evaluation sponds to a single day and includes rental data for all stations.

Model performance was assessed using Root Mean Squared Error

The first dataset, , combined rental and weather data (498 DS_W

(RMSE) and Mean Absolute Percentage Error (MAPE). Addition-

entries). The second, , included only rental data for DS_NO_W

ally, the Relative Root Mean Squared Error (RRMSE)[1] was used

the same period (498 entries). The third, , comprised DS_FULL

to enable inter-station performance comparisons in the one-day-

the complete rental dataset without weather data (1,593 entries).

ahead forecasting task. RRMSE is defined as follows:

The data splitting strategy differed in the two tasks. For the

RMSE

station-level one-day-ahead forecasting task, each dataset was RRMSE (1) =

𝑦

divided into 25 subsets, corresponding to individual stations.

Within each subset, random sampling was used to split the data where 𝑦 is the mean of the target values.

into training and testing sets with an 80:20 ratio. The target

variable in each subset is the specific station’s rental count. 3.4 Results

For the long-horizon task, no station-level subdivision was

The results for the one-day-ahead task are presented in Table 1,

performed, as only total rental counts were modeled. The final

with station forecastability visualized in Figure 5. The long-

90 days were used as the test set—roughly corresponding to a

horizon task outcomes are presented in Table 2.

temperate season—allowing us to assess whether the models



capture seasonal patterns in a new period while maintaining 4 Discussion and conclusion realistic temporal separation between training and testing data.

For the one-day-ahead forecasting task, a clear correlation exists

between station data sparsity (Figure 2) and forecastability (Ta-

ble 1). Stations with fewer rentals or gaps in data are easier to pre-

3.2 Models and Algorithms Used dict accurately. Interestingly, using the DS_FULL dataset—which

For the long-horizon forecasting task, the model includes data prior to 2024—can reduce modeling accuracy for AutoARIMA

served as the baseline, while for the one-day-ahead forecasting certain stations. Including weather features in leads to DS_W

task, the baseline was the , which predicts using little or no improvement compared to . For the long-Mean Regressor DS_NO_W the 7-day lag mean. horizon task, including weather data proves beneficial, as both

We evaluated several machine learning models, including classical machine learning models and neural networks show Ran-

dom Forest (500 trees, max_features=0.9), Gradient Boosting improved performance (Table 2). However, as described in the

(500 estimators), , and (𝐶 10, de- Experiments section, the machine learning results on are Linear Regression SVM = DS_W

gree=2, 𝛾 0.1, linear kernel). The hyperparameters for the Ran- overly optimistic due to data leakage: the models are trained = dom Forest and SVM models were selected using a grid search on historical rental counts while also accessing future weather

optimization procedure; the rest of the models used default pa- information during recursive forecasting (e.g., predicting rentals

rameters. For the Random Forest model, only the in July uses the actual recorded weather for that month). This max_features

parameter was tuned. is reflected in the comparison with , where classi-DS_NO_W

We additionally tested deep learning approaches: (input cal machine learning methods achieve a 33% mean reduction LSTM

size 96, RMSE loss, 10,000 epochs) and (input size in MAPE, while neural network approaches show only a 17% = N-BEATSx

= 96, RMSE loss, 500 epochs). mean decrease, suggesting that the apparent benefit of weather

Training was performed on a laptop equipped with an RTX 3050 data is amplified for classical methods because of this setup. Our

GP U (4 GB VRAM), which constrained the range of hyperparam- results echo [3] where Gradient Boosting models matched or

eter configurations that could be explored, particularly for the outperformed neural networks on several datasets, demonstrat-

neural network-based approaches. ing the effectiveness of simpler models. While neural networks



39





Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Kocjančič et al.





Figure 5: Model performance of one-day-head forecasting for different stations for DS_W



Table 1: Average RRMSE of all models of one-day-ahead Table 2: Model performance of 90-day forecasting across

forecasting across datasets (RRMSE) and stations. datasets (RMSE / MAPE)



Station DS_FULL DS_NO_W DS_W Model DS_FULL DS_NO_W DS_W

6 0.9210 0.9116 0.9097

AutoARIMA 120.09 / 0.9525 118.50 / 0.9954 118.50 / 0.9954

7 0.5849 0.5488

0.5439 Random Forest 108.29 / 0.7153 100.94 / 76.36 / 0.7014 0.7431

8 0.7948 0.6872 0.5513

0.6821 Gradient Boosting 95.17 / 0.7451 94.96 / 0.9584 74.69 /

Linear Regression / 0.9372 84.78 / 1.0816 71.71 / 0.8872 90.29

9 0.6646 0.6631 0.6532

SVR 94.86 / 0.8893 / 0.9507 / 0.8036 87.12 67.95

10 0.9550 0.7753 0.7747

LSTM 112.05 / 125.13 / 0.8494 130.00 / 0.8070 0.7133

11 1.0110 1.0034 1.0027

NBEATSx 106.49 / 1.0329 128.90 / 0.9972 117.45 / 0.7246

12 0.6028 0.4649 0.4540

Average 103.89 / 0.8551 105.76 / 0.9394 93.81 / 0.7815

13 0.6601 0.4022 0.4000

14 0.6902 0.4840 0.4720

15 0.5218 0.4780 0.4652



16 0.7185 0.5984 0.5975 Acknowledgements

17 0.8336 0.7402 0.7337

18 0.5274 0.4670 0.4522

This work was supported in part by the Slovenian Research

21 0.5476 0.5218 Agency through core funding for the programme Knowledge 0.5215

22 0.5198 0.4171 Technologies (No. P2-0103) and by the project , funded 0.4160 KReATIVE 23 0.4783 0.4363 through NetZeroCities under the European Union’s Grant Agree-0.4349 24 0.4896 0.4760 0.4696

ment No. HORIZON-RIA-SGA-NZC 101121530. We also thank

25 0.6834 0.5608 0.5570

Tea Tušar for her suggestions regarding data visualization.

26 0.6897 0.6812 0.6506



27 0.9898 0.9595 0.9463 References

28 0.5580 0.4936 0.4898

29 0.6008 0.5788

0.5761 [1] Shikun Chen and Nguyen Manh Luc. 2022. Rrmse voting regressor: a weight-

ing function based improvement to ensemble regression. arXiv preprint

30 0.5941 0.5531 0.5496 arXiv:2207.04837

.

31 0.8952 0.6474 [2] Jimmy Du, Rolland He, and Zhivko Zhechev. 2014. Forecasting bike rental 0.6452 32 0.5453 0.4873

0.4851 demand. . Gebhard, K., & Noland

[3] Shereen Elsayed, Daniela Thyssens, Ahmed Rashed, Lars Schmidt-Thieme,

Average 0.6793 0.6016 0.5989 and Hadi Samer Jomaa. 2021. Do we really need deep learning models for time

series forecasting? , abs/2101.02118. https://arxiv.org/abs/2101.02118 CoRR

arXiv: 2101.02118.

[4] Hadi Fanaee-Tork. 2012. Bike sharing dataset. Dataset. (2012). https://www.k

aggle.com/datasets/marklvl/bike- sharing- dataset.

[5] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory.

could potentially benefit from hyperparameter optimization, the , 9, 8, 1735–1780. Neural computation

[6] Meerah Karunanithi, Parin Chatasawapreeda, and Talha Ali Khan. 2024. A

same applies to other methods as well. A detailed comparison of

predictive analytics approach for forecasting bike rental demand. Decision

different approaches was beyond the scope of this preliminary , 11, 100482. doi: https://doi.org/10.1016/j.dajour.2024.10048 Analytics Journal

study but could be explored in future work. 2.



40





Predicting Traffic Intensity on Motorway Sections




Matic Kladnik † Dunja Mladenić

Jozef Stefan International Department of Artificial

Postgraduate School Intelligence

Ljubljana, Slovenia Jozef Stefan Institute

matic.kladnik@gmail.com Ljubljana, Slovenia

dunja.mladenic@ijs.si



Abstract construction projects and to find the least intrusive time slots for

road maintenance work. It also serves the motorway drivers

This paper addresses predictions of traffic intensity on sections when planning a trip. of motorways. Predictions are computed for timespans from 24



hours up to 52 weeks. With our adaptive system, we update 2.1 Traffic Counters predictions with newer ones, once additional features can be

computed from available data. We use historic context of past There are close to one hundred traffic counters that we consider

traffic intensities on specific sections at specific periods of time, for predictions. Each counter is supported by a pair of inductive



evaluated our methodology with multiple machine learning processed, sent through an IoT communication device and stored into the database. models and compared performances for various timespans on a as well as semantic context about the target period. We have loops that are laid into the asphalt of the road. Signals are



specific motorway section. The evaluation results show that our In the data, there are counts or frequencies of total vehicles, and counts by vehicle types (passenger car, transport truck, bus) methodology improves predictions for specific periods over time. for each hour-long time period. E.g. number of vehicles from



Keywords 8:00 to 9:00 for each of the lanes of a specific motorway section

separately.

Motorway, traffic intensity, prediction, regression, system,



semantic context, evaluation, machine learning 2.2 Semantic Context

For each of the examples in the dataset we produce semantic

1 context features. For each day and time of day period, we INTRODUCTION

produce semantic context features to inform the model whether

A prediction system for predicting traffic intensity on motorway

a certain time period is on a workday or a weekend, whether the

sections can support a wide range of decision making, strategic,

specific time period falls into the morning rush hours or the

and operative processes at the motorway management

afternoon rush hours. These semantic features give additional

organization. It can also support end users, such as daily

information to improve the performance of machine learning

commuters, tourists, and other drivers with their planning of a

models.

trip.



The focus of this paper is on architecture of the motorway 2.3 Data Processing

traffic intensity prediction system as well as on the evaluation of

the machine learning models that were trained to produce the After downloading the data from the motorway counters via an

predictions for various timespans. API of the data provider, we additionally process it to increase

consistency and reliability of predictions.

During data processing, we merge data from all lanes of a

2 PROBLEM SETTING AND DATA specific motorway section, which is usually denoted with

The objective of the proposed methodology is to make long term neighboring towns and the direction of the motorway section.

and medium-term predictions of traffic intensity or frequency



(vehicle count) on various sections of motorway based on 3 METHODOLOGY DESCRIPTION historic data of traffic counters, semantic context of motorway

stations, and semantic context of time periods. Predictions serve We propose a prediction system that includes incorporation of

the motorway management company for better planning of multiple machine learning models to deliver the most reliable

predictions based on available data and the timespan for which

the system is making predictions of traffic intensity.

Permission to make digital or hard copies of part or all of this work for personal or To improve prediction accuracy, we make medium-term and

classroom use is granted without fee provided that copies are not made or distributed

for profit or commercial advantage and that copies bear this notice and the full long-term predictions. In our case, long-term predictions are

citation on the first page. Copyrights for third-party components of this work must made from 1 week to 52 weeks in advance for a specific 1-hour

be honored. For all other uses, contact the owner/author(s).

Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia

© 2025 Copyright held by the owner/author(s).

http://doi.org/10.70314/is.2025.sikdd.25



41





Figure 1: Diagram of the system for producing and distributing predictions of traffic intensity on motorway sections



time period for a specific day of week. Which means that we can when a feature would otherwise have multiple categorical values.

make up to 52 predictions when conducting long-term We focus on training a specific model for each of the motorway

predictions after receiving a new data example, e.g. traffic sections that were part of the research. Note that a more general

frequency for a specific 1-hour time period (e.g. 14:00-15:00) for model, trained on data from multiple motorway sections could be

a specific day in time (e.g. Monday). more appropriate for motorway sections that have been newly

Whereas medium-term predictions are those that predict from added and do not have enough historical data to support training

less than 24 hours up to 1 week in advance. For medium-term of a reliable machine learning model with sufficient evaluation

predictions, we take more features for recent traffic frequency period. We use up to 7 features that are based on historic data, 7

into account for improved accuracy. time period features, and 6 semantic context features for a

Long-term predictions are useful when making decisions for specific time period and location.

actions that are several weeks or months in the future, while Model training processes use MAPE (Mean Absolute

medium-term predictions are more useful when making Percentage Error, used interchangeably with MARE – Mean

decisions for actions that will take place from 1 to 7 days in the Absolute Relative Error). More on relevant machine learning

future. models and metrics in references ([2][3][4][5]).

We have a separate machine learning model for each of the

included counters on the motorway to better adjust to specifics 3.2 Prediction System Description

of the traffic dynamic of a specific counter when making We continue with the description of our proposed prediction

predictions of traffic frequency. We have also trained several system. The system consists of two main subsystems. One for

general-purpose models that are trained on a group of counters periodically computing and storing traffic intensity predictions

or all counters. These are present to support counters with short for various time spans. And another for delivering predicted

data history. traffic intensity via a REST API service.

Predictions are exposed through a REST API service and are As we can see on Figure 1, the system fetches data from the

available upon request. They are computed and updated data provider’s REST API service. Data is processed after

regularly, e.g. daily or hourly. More approaches in [1][6]. retrieval and sent into a table of prediction system’s database.



3.1 Machine Learning Models Once a new value is processed by the system, it checks if there This data is read periodically by the adaptive prediction system.

To compute predictions of traffic intensity in the future, we use are any additional models with a shorter timespan available,

regression machine learning models. We have trained and compared to the model used for the currently available

evaluated several models with the usage of different machine prediction. The system prioritizes predictions from models with

learning algorithms. These are: linear regression, SVM (SVR – a shorter timespan in order to update the database with the most

Support Vector Machine for Regression), and XGBoost, which reliable predictions available at the time. E.g., prediction with a

is an ensemble model of decision trees. 1-month timespan succeeds and replaces the prediction with a 3-

Features for training models and making predictions are month timespan.

engineered in such a way that each one of the models can use the Different long-term and medium-term models can be trained

whole set of features. E.g. we use a one-hot encoding approach using different machine learning algorithms, depending on the



42





algorithm that performed the best during the evaluation of the We continue with the analysis of the model performances as


models. seen in Table 1. If the timespan attribute’s value is ‘7 days’, it

Once updated the predictions are stored in the database, they means that the model predicts 7 days into the future. We use

are available to users, such as strategists, operators and support several metrics to describe the performance of the models. These

specialists within the motorway management organization. Or are: MAE (Mean Absolute Value), RMSE (Root Mean Square

end users of the motorway, such as drivers of cars, trucks, buses, Error), and MAPE (Mean Absolute Percentage Error). MAPE is

etc. A key advantage of this approach is that drivers and a crucial metric as it shows relative errors in percentages which

motorway operators and specialists get insights that are based on is key when evaluating the models as traffic frequency varies

the same predictions for traffic intensity, which supports greater significantly throughout different parts of the day.

transparency of information and stronger compatibility of We can see some interesting performance dynamics of the

different applications for end users and motorway professionals. models. The XGBoost model performs the best for 24-hour

E.g. the system can support long-term planning for larger timespan, with a significant performance uplift of at least 1

maintenance or reconstruction projects for up to 1 year ahead, as percentage point in MAPE, compared to the other two models. It

well as long-term planning of road users. For instance, drivers is also better in the other two metrics: MAE and RMSE.

can plan their holidays and the time of their commute ahead. And We continue with the performance analysis of the long-term

highway maintenance operators can find the most optimal predictions. For the 7-day timespan, the XGBoost model is still

schedule for short maintenance work. noticeably better than the other two models with a 0.5 percentage

point uplift in performance. For the 4-week timespan, XGBoost

still holds a small lead in the key metric (MAPE), whereas the

4 EVALUATION SVM model has significantly better results when considering just

We continue with the evaluation of the machine learning models. MAE and RMSE metrics. For the 52-week timespan, we can see

To compare models, trained with different algorithms, we use the an interesting dynamic as the SVM model takes a significant lead

evaluation results for the same motorway section on the in performance as it is the only one with the MAPE value of less

Slovenian motorways. We use the period from 1 May 2024 until than 15%, whereas the MAPE values of the other two models

5 May 2025 for evaluation. surpass 20%.

We use Scikit-learn library[7] to train the linear regression The dynamic is likely caused by a reduced set of features as

(using ordinary least squares approach) and SVM (SVR) models there are significantly less historic traffic count features that are

and the XGBoost library[8] to train the XGBoost models. SVM included when making predictions with a 52- week timespan. It

model is trained using the RBF kernel, and with scaled gamma seems this has a significantly negative impact on training the

hyperparameter. In majority of motorway sections, XGBoost XGBoost model, which is a tree ensemble model, while having

models with a maximum depth of 6 performed the best which is additional features available gave the XGBoost model an edge

why we used models with the same hyperparameter value for the for predictions with a timespan up to 4 weeks, especially up to 7

following analyses. We use gbtree as the booster, while the days.



learning rate is 0.3.



Table 1: Model Performance Comparison



timespan algorithm MAE RMSE MAPE



24 hours XGB 39.43 62.75 10.5%



24 hours SVM 42.38 65.86 11.5%



24 hours lin. reg. 43.14 66.93 11.6%



7 days XGB 45.66 70.69 11.6%



7 days SVM 43.70 68.91 12.1%



7 days lin. reg. 43.51 69.04 12.1%



4 weeks XGB 57.30 88.56 13.9%



4 weeks SVM 50.20 77.86 14.1%



4 weeks lin. reg. 51.33 78.63 14.7% Figure 2: Distribution of absolute relative errors by 5%



52 weeks XGB buckets for XGBoost 7-day timespan model 88.33 121.93 20.9%



52 weeks SVM 53.54 84.49 14.9% On Figure 2 we can see how absolute relative errors are



52 weeks lin. reg. distributed if they are split into 5% absolute relative error 70.46 96.98 21.3%

We evaluated the models on a little over 1 year of test data, buckets. We can see that in 45.5% of the cases, the absolute

which was not included in the training or validation part of the relative (or percentage) error of the predicted traffic frequency is

process. less than 5% of the actually measured traffic frequency. 21.7%

of predictions have a relative error between at least 5 and



43





(excluding) 10 percent, and 11.2% of predictions have a relative values after computing historic time-series features with Pandas’


error between 10 and 15 percent. shift function. In this case there is a strong advantage of having

This means that in 78.4% of predictions, the relative error was a decision tree ensemble model (e.g. XGBoost) as a backup, even

less than 15%, which can be considered as a sufficiently good if it is not the best performing model for a certain timespan. This

performance for the models to support a sufficiently reliable is due to the ability of the tree ensemble models to apply only

traffic intensity prediction system. those trees that are covered by features with available values. In



this case the predictions are generally less accurate but possible.

Another key insight is that the evaluation supports our

proposed methodology with multiple models to improve the

performance of the predictions for each included timespan.

Another useful insight is that different algorithms can

produce the best models for different timespans on the same

motorway section. As was the case with the SVM model in our

evaluation.



5 CONCLUSION

We have overviewed the methodology that we use as the

foundation for our proposed system for predicting traffic

intensities on motorway sections. Including the adaptive

prediction system and the supporting machine learning models

that support making predictions for various timespans to, in time,

improve already available predictions for specific time periods in

Figure 3: Mean relative errors by each hour of the day for the future. We have also overviewed the evaluation of the trained

XGBoost 7-day timespan model machine learning models and found some useful insights that

support our proposed prediction system.

We continue by analyzing the distribution of mean relative Compared to related work, the key contributions in our

errors by each hour of the day as seen on Figure 3. We can see methodology are significantly longer prediction timespans,

that the model generally tends to slightly overestimate or inclusion of semantic context, and higher adaptability to data.

overshoot with its predictions. Especially during the night-time Based on the presented current evaluation results, our

periods, when there are fewer vehicles on the motorway. methodology produces predictions with sufficient reliability to

In the mean aggregate, there is less than a 2% mean relative support long-term decision making of various roles.

error during the morning rush hours (at 6:00-7:00, 7:00-8:00, and For further improvements to the system, we could train and

8:00-9:00). It is the highest during the 15:00-16:00 period, with evaluate some deep learning models and models that are based

more than 13% of mean relative error. However, the error is on the transformer architecture, as well as some other time-series

substantially smaller during other afternoon rush-hour periods, forecasting procedures, such as Facebook Prophet. We could also

14:00-15:00, 16:00-17:00, and 17:00-18:00, where it remains engineer additional semantic context features for further

under 4%. Apart from the 15:00-16:00 period, the mean relative improvements to the performance of the existing models. For

errors are consistently under 6%. When the model does additional improvements for shorter timespans, we could also

undershoot or underestimate with its prediction, the mean include weather forecast data. relative error is less than 2%, close to 1%.

We can see a spike of mean relative error at the 15:00-16:00 References

period. Upon investigation, it turns out only around 20 vehicles [1] Bernardo Gomes, Jose Coelho, Helena Aidos. 2023. A survey on traffic were counted in the data for a specific period, which is unusual flow prediction and classification. In Intelligent Systems with

for this period and likely a consequence of a traffic accident or Applications, vol. 20. DOI: https://doi.org/10.1016/j.iswa.2023.200268

[2] Jithin Raj, Hareesh Bahuleyan, Lelitha Devi Vanajakshi. 2016.

some issue with data collection. Application of Data Mining Techniques for Traffic Density Estimation

We have also conducted an aggregated evaluation of models and Prediction. Transportation Research Procedia, vol 17. DOI:

on 10 various motorway sections, where mean MAPE values https://doi.org/10.1016/j.trpro.2016.11.102

[3] Yuyu Zhu, QingE Wu, Na Xiao. 2022. Research on highway traffic flow

were 14%, 15%, 18%, and 20% for 24-hour, 7-day, 4-week and prediction model and decision-making method. Scientific Reports, vol. 12.

52-week timespans respectively. Predictions for sections near the DOI: https://doi.org/10.1038/s41598-022-24469-y

[4] Carl Goves, Robin North, Ryan Johnston, Graham Fletcher. 2016. Short

capital city were generally less reliable than others. Term Traffic Prediction on the UK Motorway Network Using Neural

Networks. Transportation Research Procedia, vol. 13, 184-195. DOI:

4.1 https://doi.org/10.1016/j.trpro.2016.05.019 Evaluation Insights

[5] Adriana-Simona Mihaita; Zac Papachatgis; Marian-Andrei Rizoiu. 2020.

When considering the results of the evaluation of trained Graph modelling approaches for motorway traffic flow prediction. 2020.

IEEE 23rd International Conference on Intelligent Transportation

machine learning models for specific motorway sections, we Systems (ITSC). DOI: https://doi.org/10.1109/ITSC45102.2020.9294744



have gathered several key insights. [6] Sayed A. Sayed, Yasser Abdel-Hamid, and Hesham A. Hefny, 2022. Artificial Intelligence-Based Traffic Flow Prediction: A Comprehensive In some examples, we could not compute all features due to Review. Pre-review . DOI: http://dx.doi.org/10.21203/rs.3.rs-1885747/v1 .

missing values in data, meaning that certain features had NaN [7] Scikit-learn: https://scikit-learn.org

[8] XGBoost: https://xgboost.ai/



44





Empowering Youth on Smart Cities with AI Solutions to




Community and Urban Challenges Towards SDG 11



Mustafa Zaouini†, Lee Chana, Joao Pita Costa, Davor Yousef Rahmani Rayan Kassis, Luka Stopar

Ruben Frank, Kim August Orlic, Mihajela Črnko ToumAI Swethal Kumar Solvesall

AI in Africa Rabat, Morocco Ljubljana, Slovenia IRCAI, Quintelligence EnergyAED

Johannesburg, South Africa odin@toum.ai luka.stopar@solvesall.com Ljubljana, Slovenia London, UK

mus@fliptin.io joao.pitacosta@ircai.org rayan@aed.energy



Sohaib Souss, Wahid Laleeg, Asmae Lamgari, Maroja Zoubir, Ouidad Mochariq, Zahira Elmelsse, Chaimae

Yassine Bounouader Hajar Doukhou Fadil

SLTVERSE University Mohammed V (UM5) ENSA National School of Applied Sciences (ENSA-M)

Casablanca, Morocco Rabat, Morocco Marrakesh, Morocco

sohaibsoussi@gmail.com asmaelamgarim@gmail.com o.mochariq3846@uca.ac.ma



Abstract / Povzetek

Achieving Sustainable Development Goal 11 — ensuring cities 1 Introduction

are inclusive, safe, resilient, and sustainable — remains a pressing Established by the United Nations as an essential goal for the

global priority. In this pursuit, Artificial Intelligence (AI) has forthcoming 2030, the Sustainable Development Goal 11 (SDG

emerged as a transformative driver of urban innovation, enabling 11) — "Make cities and human settlements inclusive, safe,

policymakers, academic institutions, and industry stakeholders to resilient and sustainable" — reflects a critical global

make data-driven decisions for complex urban systems such as commitment to improving urban living conditions amid

housing, transportation, energy, and infrastructure. Despite its increasing urbanization, population growth, and environmental

potential, the vast scale, variety, and fragmentation of urban data, stress. With more than half of the world's population now

coupled with the rapid evolution of AI technologies, create residing in cities—and projections estimating two-thirds by

significant challenges in converting SDG 11-related information 2050—the urgency of building sustainable urban environments

into practical solutions. This paper reports on the results of the has never been better fit. In this context, AI has emerged as a

AI4SDG11 programme, which combined expert community transformative tool capable of reshaping how cities are planned,

building, knowledge exchange, and competitive challenges. The managed, and experienced. AI technologies offer powerful

programme brought together 50 students and 30 startups I 15 capabilities to harness vast amounts of urban data, generate

locations worldwide, to develop AI-driven solutions targeting predictive insights, and support evidence-based decision-

key aspects of urban sustainability. Using diverse machine making. From optimizing public transportation systems to

learning techniques, participants addressed challenges including monitoring air quality, improving waste management, and

intelligent mobility systems, efficient waste management, smart enabling climate-resilient infrastructure, AI is at the forefront of

and efficient urbanism, and climate-resilient urban planning. innovative urban solutions worldwide. However, the deployment

Conducted in 2025, this initiative formed part of a youth-focused of AI in support of SDG 11 varies significantly across regions,

innovation challenge co-organized by AI in Africa, the influenced by differences in digital infrastructure, data

International Research Centre on Artificial Intelligence (IRCAI), availability, institutional capacity, and local priorities [1].

and GITEX, with the goal of promoting interdisciplinary In Africa, AI is increasingly being applied to address urban

innovation and strengthening regional AI capacity for informality, mobility challenges, and infrastructure gaps. For

sustainable urban development. instance, AI-powered geospatial mapping tools are being used to

identify informal settlements in rapidly growing cities such as

Keywords / Ključne besede Nairobi and Lagos, helping governments to improve service

Machine learning, text mining, large language models, delivery and urban planning [2]. In North African cities, machine

community engagement, urbanism, mobility, AI competition, AI learning models have been developed to optimize water

Community distribution in drought-prone areas and to improve traffic flow in

congested urban corridors. AI is also being tested for predictive

† waste collection and smart energy use in off-grid communities.

Corresponding author These solutions are particularly valuable in regions where

Permission to make digital or hard copies of part or all of this work for personal or resources are limited, and where rapid urban growth creates



classroom use is granted without fee provided that copies are not made or distributed pressure for low-cost, scalable interventions [2]. for profit or commercial advantage and that copies bear this notice and the full On the other hand, in Europe, AI applications in cities often focus citation on the first page. Copyrights for third-party components of this work must

be honored. For all other uses, contact the owner/author(s). on enhancing sustainability, efficiency, and citizen engagement.

Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Examples include real-time public transport optimization in

© 2025 Copyright held by the owner/author(s).

http://doi.org/10.70314/is.2025.sikdd.18 cities like Helsinki and Barcelona [3], AI-based air pollution

forecasting in Paris [4], and intelligent energy management



45

systems in smart buildings across the Netherlands and Germany for students, researchers, and entrepreneurs to ideate, prototype,

[5]. Many European municipalities are also investing in AI- and scale AI solutions aligned with specific SDGs. Previous

driven participatory governance platforms, enabling data- editions have focused on Water Sustainability (SDG 6) and

informed urban policymaking that incorporates citizen feedback Sustainable Cities and Communities (SDG 11), while the 2026

[5]. Furthermore, [6] highlights how AI can extract and analyze programme will extend to all 17 SDGs. The key components

news media information to enhance knowledge and include:

understanding of water-related extreme events, supporting • Research2Startup Competition: A 4–6 week

improved disaster risk reduction.. programme blending AI education, design thinking, and

This paper presents the outcomes of a collaborative youth AI acceleration tracks for startups and university spinouts,

innovation programme, including AI mentorship and challenges culminating in regional and global pitch events.

aimed at exploring the impact of AI on SDGs. It builds on the • Certified AI for SDG Training: Professional

related initiative initiating the programme in 2026 under the certification tracks for corporate teams, startup founders,

focus of Water Sustainability to progress SDG 6 (see [7] and [8]), and SMEs, focusing on topics like large language models,

and refocuses the approach addressing SDG 11-related problems AI governance, ethical data practices, and generative AI

through applied machine learning solutions. The initiative applications.



institutions in North Africa, as well as 30 AI startups and domain programme supporting university-originated AI startups experts worldwide culminating in 30 projects and initiatives brought together 50 students and 20 professors across 10 research • AI4SDG Lab Accelerator: A 3–6 month cohort-based

through mentorship, technical workshops, and investor

tackling real-world urban challenges. By leveraging AI and data

networking, culminating in a high-profile Demo Day at

science, these teams addressed issues ranging from urbanism and

GITEX Global.

mobility to waste management and climate resilience—drawing

on lessons and methods from both African and European



in a collaboration with GITEX, short for Gulf Information funding opportunities, and collaboration through GITEX’s Technology Exhibition, being one of the world’s largest innovation ecosystem. It champions responsible AI development technology and innovation events, held annually in Dubai, by emphasizing ethics, transparency, and inclusivity, while contexts. The competition, co-organized by AI Africa and IRCAI competencies but also facilitates access to global networks, The programme not only equips participants with practical AI

United Arab Emirates. The event held in May 2024 [7], served



as a model for interdisciplinary, cross-regional collaboration in MVP co-development and impactful international exposure offering tangible incentives such as certifications, cash prizes,

the pursuit of sustainable urban futures.

through IRCAI and GITEX channels. In doing so, AI4SDG acts

as a catalyst for fostering the next generation of AI-driven



changemakers committed to creating impactful, scalable

solutions for a sustainable future.



3 AI-enabled Innovation Advancing SDG11



The joint IRCAI, AI in Africa and GITEX competition served

as a global platform for surfacing innovative AI-driven solutions

to SDG 11 challenges, bridging the ideas of PhD researchers in

North Africa with the entrepreneurial agility of startups

worldwide. Among the standout innovations emerging from the

Figure 1: Screenshot of the AI engine ToumAI, winner of the

competition were AI-powered geospatial mapping systems for



GITEX Europe, Berlin, as a prime example of the relevance monitoring informal settlements, predictive analytics for AI4SDG11 startup competition at the inaugurating edition of

of languages in the resilience of cities and communities optimizing urban transport routes in congestion-prone cities, and

machine learning models for forecasting waste generation to

improve collection efficiency. Several projects addressed climate

2 AI4SDG Programme Methodology resilience, including early-warning systems for urban flooding

The AI4SDG Programme, spearheaded by IRCAI under the and AI-assisted tools for assessing heat island effects and guiding

auspices of UNESCO, in collaboration with AI in Africa and green space planning. From energy-efficient building design

GITEX, is a transformative initiative designed to harness algorithms to citizen engagement platforms that use natural

artificial intelligence to address the United Nations Sustainable language processing for policy feedback, the competition

Development Goals (SDGs). With a focus on capacity building, highlighted the breadth of AI’s potential to make cities more

entrepreneurship, and ethical AI deployment, the programme sustainable and inclusive. By uniting academic depth with

connects technological innovation with global sustainability market-ready solutions, the initiative not only identified

challenges, particularly in the Global South. promising prototypes but also laid the groundwork for scalable

At the core of AI4SDG is a multi-pronged approach integrating interventions adaptable to diverse urban contexts..

certified training, competitive innovation events, and startup ToumAI. A holistic multilingual AI platform designed to

acceleration. Launched through global showcases and pitch bridge the digital divide in Africa by enabling voice-driven

competitions at major GITEX events across Africa, Asia, Europe, customer experiences in low-resource languages, advancing

and the Middle East, the initiative provides a dynamic platform SDG 11. Built on a compound AI structure that saves computing



46

power compared to foundational LLMs, the system supports recommendations. Beyond mobility, the platform enriches

speech-to-text, text-to-speech, emotion analysis, churn detection, tourism through VR-based storytelling with avatars narrating site

and predictive insights across African dialects such as Swahili, histories, and employs metadata-driven personalization

Amharic, Yoruba, and Darija. By integrating AI-powered voice supported by visual analytics (route maps, CO₂ vs. cost

agents, IVR optimization, and multilingual analytics, ToumAI comparisons, safety heatmaps). Collectively, these AI

delivers inclusive, real-time, and cost-effective communication innovations position the app as a smart city enabler that aligns

for telco, banking, and transport sectors (see Figure 1). Its sustainability, cultural engagement, and traveler well-being.

innovation lies in industrializing underrepresented African SOBEK. A federated AI system for flood resilience that

languages for AI applications, ensuring accessibility for addresses the lack of early-warning systems in rapidly urbanizing

populations historically excluded from the AI revolution. African cities. Unlike centralized models, it applies federated

AED EnergyAED. An AI-enabled renewable energy storage learning to collaboratively improve predictions while preserving

system that converts electricity into high-temperature heat (up to data privacy and sovereignty. Local nodes train specialized

800°C) using salt-based thermal bricks, providing 24/7 clean models—LSTMs for weather series, GNNs for hydrological

power and heat without combustion. Unlike batteries or diesel, networks, and U-Nets for satellite imagery—using geospatial,

the system delivers up to 24 hours of dispatchable energy at meteorological, and historical flood data. Model updates are

lower cost, using safe, stable, and modular 10MWh units. aggregated with FedAvg and refined through station similarity

Applications include microgrids, telecoms, industrial heat, and graphs to capture regional hydrological patterns. Despite

desalination, making it particularly suited for regions with challenges of data heterogeneity and low connectivity, Sobek

unreliable energy supply. By enabling baseload renewable delivers more accurate flood seasonality, year, and magnitude

energy, AED Energy strengthens critical infrastructure and predictions, enabling timely early warnings, urban planning, and

advances SDG 11 while reducing dependence on diesel. disaster resilience across Africa.

SolvesALL Mobility. Delivery district planning and Ecoguardians. This initiative introduces an AI-powered

optimization machine learning tools that support smarter urban system to optimize water-saving advertisements in Morocco,

logistics impacting the sustainable of cities and communities. Its advancing SDG 11 (Sustainable Cities and Communities). By

Postal POI system uses algorithms to automatically design analyzing diverse campaign content (videos, images, text, social

delivery districts, balancing workload, reducing overlap, and media engagement, and survey data), the system identifies what

minimizing travel time. Leveraging GPS trace analysis, stay- makes ads effective and generates improved variations. It

point detection, regression models, and crowdsourced field data, integrates computer vision (CNNs) for visual features, language

the system learns delivery micro-locations, service times, and models (BERT/GPT) for text and sentiment, predictive models

accessibility factors (e.g., stairs, obstacles). By integrating these (XGBoost/Random Forest) for engagement forecasting, and

AI-driven insights, SolvesAll enables cost savings, operational GANs for generating impactful ad variations. Ethical and data-

efficiency, and improved registry accuracy—demonstrated by driven personalization ensures campaigns remain responsible,

expected multimillion-euro annual savings for postal operators— transparent, and locally relevant. Early prototypes show

while offering scalability to sectors such as waste management measurable engagement gains, empowering cities to run

and ATM/vending machine logistics. evidence-based, AI-enhanced awareness campaigns that

strengthen sustainable water use.





4 Conclusions and further work

The integration of AI with the SDGs represents a critical

frontier in global innovation, particularly as we confront

complex challenges in health, education, climate, and

urbanization. The AI4SDG programme, as implemented through

the collaboration of IRCAI, AI in Africa, and GITEX,

demonstrates a strategic and scalable model for aligning

Figure 2: Screenshot of the SLTverse engine, winner at the technological advancement with sustainable impact. By

AI stage of GITEX Africa 2025 combining certified training, research-to-startup pathways, and

accelerator programs, AI4SDG empowers diverse

SLTverse. This smart city solution introduces an AI-powered stakeholders—from students and researchers to entrepreneurs

travel app that supports SDG 11 by enhancing safety, and SMEs—to develop responsible, ethical and context-sensitive

sustainability, and cultural engagement in tourism. At its core is AI solutions across the 17 SDGs.

an AI Route Advisor that leverages structured mobility data— One of the programme’s most significant contributions lies in

spanning cost, CO₂ emissions, safety, time, and distance—to its ability to bridge the gap between academic research and real-

recommend optimal transport options. This is strengthened by a world application, particularly in the Global South. Through its

Retrieval-Augmented Generation (RAG) framework, which global reach and multi-region engagements, AI4SDG not only

combines vector search, large language models, and workflow promotes responsible AI development but also facilitates access

orchestration to deliver fast, contextual, and multilingual to funding, mentorship, and global markets, thereby amplifying

guidance (see screenshot at Figure 2). The system’s AI assistant the reach and effectiveness of AI for social good. However, while

adapts to real-time inputs such as weather, safety alerts, and user the AI4SDG11 programme has laid a robust foundation, several

preferences, ensuring tailored and secure travel



47

avenues remain open for further development, now open to all world deployment, questions of ethical oversight, data

SDGs. Future work should focus on: governance, and accountability become increasingly complex—

particularly in cross-border collaborations. Addressing these

• challenges will be essential to ensure that the AI4SDG initiative Longitudinal impact assessments to evaluate the

sustainability and real-world outcomes of AI solutions not only inspires innovation but also establishes durable,

emerging from the programme. ethically grounded impact at scale.

• Expanded participation across underrepresented regions

and communities, ensuring equitable access to AI training Acknowledgments / Zahvala

and opportunities. This research was partially funded by the European

• Commission’s Horizon research and innovation program under Integration of emerging technologies , such as

neurosymbolic AI, edge AI, and federated learning, into grant agreement 820985 (NAIADES) and 101120237 (ELIAS).

training tracks and solution design.

• References / Literatura Stronger policy linkages to influence national and

international AI governance frameworks through insights [1] Gupta, S. and Degbelo, A., (2023) An empirical analysis of AI

contributions to sustainable cities (SDG 11). In The ethics of artificial

derived from grassroots innovation. intelligence for the sustainable development goals (pp. 461-484). Cham:



• Springer International Publishing. Enhanced data infrastructure , including open datasets [2] Mhlanga, David, and Deo Shao (2025). AI-optimized urban resource

aligned with the SDGs, to support more accurate, inclusive, management for sustainable smart cities. In Financial inclusion and



and transparent AI development. sustainable development in sub-saharan Africa, pp. 96-116. Routledge. [3] Mohsen, B. M. (2024). AI-driven optimization of urban logistics in smart

cities: Integrating autonomous vehicles and iot for efficient delivery

The AI4SDG programme highlights the transformative systems. Sustainability, 16(24), 11265.

potential of AI when it is purposefully directed toward [4] Petry, Lisanne, et al. (2021) Design and results of an AI-based

forecasting of air pollutants for smart cities. ISPRS Annals of the

sustainable development. As the initiative expands and evolves, Photogrammetry, Remote Sensing and Spatial Inform. Sciences 8: 89-96.



it will be crucial to maintain a balance between innovation, ethics, [5] Aguilar, J., et al. (2021) A systematic literature review on the use of artificial intelligence in energy self-management in smart buildings . and inclusivity—ensuring that AI becomes not just a tool for Renewable and Sustainable Energy Reviews 151: 111530.

growth, but a vehicle for equitable and sustainable global [6] Pita Costa J., Rei L., Bezak N., Mikoš M., Massri M.B., Novalija I. and

progress. At the same time, it is also important to acknowledge Leban, G. (2024) Towards improved knowledge about water-related

extremes based on news media information captured using artificial

the programme’s inherent challenges and limitations. Sustaining intelligence. International Journal of Disaster Risk Reduction, 100,

long-term participation from diverse stakeholders requires p.104172.

[7] Mustafa Zaouini, Joao Pita Costa, Manal Cherkaoui, Hanaa Hachimi, M.

consistent resources, local capacity-building, and incentives that Wahib Abkari, Kamal Gourari, Hatim Lachheb and Jad Tounsi El

extend beyond initial pilot enthusiasm. Scaling successful pilots Azzoiani (2024) Addressing Water Sustainability Challenges in North

into broader, systemic solutions often encounters barriers such as Africa with Artificial Intelligence In Proceedings of SIKDD /24.

[8] IRCAI (2024) IRCAI Partners with AI in Africa for the AI 4 Water

fragmented policy environments, limited infrastructure in low- Sustainability Challenge. Available at: https://ircai.org/inircai-partners-

resource settings, and uneven access to funding. Moreover, as AI with-ai-in-africa/

solutions transition from competitive innovation contexts to real-



48

Automated First-Reply Generation for IT Support Tickets Using



Retrieval-Augmented Generation and Multi-Modal Response



Synthesis



Domen Jeršek Klemen Kenda

domenjersek@gmail.com klemen.kenda@ijs.si

Jožef Stefan Institute Jožef Stefan Institute

Slovenia Slovenia



Rok Klančič Matteo Frattini

rok.klancic@gmail.com Matteo.Frattini@gft.com

Jožef Stefan Institute GFT Italia

Slovenia Italy



Abstract Traditional automated response systems relied on template-

IT support organizations require timely and consistent first re- based approaches and rule-based classification [2], which pro-

sponses to incoming support tickets. This paper presents a Re- vided consistent but inflexible responses that failed to capture

trieval Augmented Generation system for automatic generation nuanced requirements. Recent advances in natural language

of contextually appropriate first replies. The approach combines processing have enabled more sophisticated approaches using

semantic similarity search with multi-modal response synthesis, transformer architectures [11] and pre-trained models like BERT

retrieving similar resolved tickets using sentence embeddings and [1]. Retrieval-based systems identify similar historical cases and

FAISS indexing. Response-type detection determines whether adapt previous responses [5], while retrieval-augmented genera-

structured templates or personalized conversational replies are tion (RAG) [6] combines parametric knowledge in language mod-

most suitable for each request. The system incorporates tempo- els with retrieval from external knowledge bases for knowledge-

ral context detection for status updates and employs few-shot intensive tasks.

prompting with selected examples to maintain organizational However, retrieval systems may struggle with novel scenarios,

communication standards. Evaluation using semantic similarity and purely generative approaches face challenges in maintaining

metrics demonstrates the system’s ability to generate replies that organizational consistency. Hybrid approaches attempt to bal-

closely match human-written responses across various ticket ance flexibility with reliability [3], while response classification

types, providing a practical solution for reducing response times has evolved from traditional feature engineering to transformer-

while maintaining quality and consistency. based models [9].

Our research addresses these limitations by developing an

Keywords automated first-reply generation system that combines retrieval-

augmented generation with multi-modal response synthesis. The

IT support, retrieval-augmented generation, automated response system distinguishes between different response types, maintains

generation, natural language processing, semantic similarity organizational communication standards, and generates contex-

tually relevant replies through response-type detection, temporal



1 context awareness, and few-shot prompting with carefully se- Introduction lected examples. IT support organizations face increasing volumes of support tick-

ets that require timely and consistent issue resolution, starting

from the first response. Manual processing creates bottlenecks 2 Data

that delay user support and increases operational costs, while Our dataset consists of 1,847 IT support tickets containing ticket

the quality and consistency of first replies varies significantly titles, descriptions, and complete communication logs. Each ticket

between support agents, leading to inconsistent user experiences. includes the full conversation history between users and support

The primary challenge lies in generating contextually appro- agents, from initial submission through resolution.

priate first replies that match organizational communication stan- The dataset exhibits significant diversity in ticket types, in-

dards while addressing the specific nature of each support request. cluding software installation requests, access rights management,

Support tickets exhibit diverse characteristics: some require struc- hardware support, VPN configuration, employee onboarding and

tured template responses with specific form fields, while others offboarding, and system outage reports. Communication logs

benefit from personalized conversational replies that acknowl- contain multiple exchanges, requiring careful extraction of first

edge the user’s specific situation. replies from the complete conversation history.

We developed a specialized extraction algorithm to isolate the

initial support agent response from the multi-turn conversation



work must be honored. For all other uses, contact the owner/author(s). Permission to make digital or hard copies of all or part of this work for personal logs. The extraction process identifies timestamp patterns and or classroom use is granted without fee provided that copies are not made or user information markers to separate individual responses. The distributed for profit or commercial advantage and that copies bear this notice and cleaning heuristics systematically remove formatting artifacts the full citation on the first page. Copyrights for third-party components of this including: (1) leading and trailing dash sequences, (2) formal Information Society 2025, Ljubljana, Slovenia greeting patterns like "Dear Name,", (3) separator lines contain- © 2025 Copyright held by the owner/author(s). ing five or more consecutive dashes, (4) user identification lines https://doi.org/https://doi.org/10.70314/is.2025.sikdd.19



49

Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Domen Jeršek, Klemen Kenda, Rok Klančič, and Matteo Frattini



with parenthetical ID patterns, and (5) responses shorter than 50 The embeddings are indexed using FAISS (Facebook AI Simi-

characters to filter noise. The algorithm ensures only substantial larity Search) [4] for efficient retrieval with approximate nearest

first replies are retained by validating minimum content length. neighbor search. We normalize embeddings using L2 normal-

After preprocessing, 1,466 tickets contained valid first replies ization and employ inner product similarity for fast retrieval,

suitable for training and evaluation. The first replies range from achieving sub-linear search complexity through hierarchical clus-

50 to 2,000 characters in length, with an average length of 387 tering and inverted file structures. Figure 2 provides a conceptual

characters. Response types include structured template responses visualization of how tickets are positioned in the semantic em-

(42%) containing form fields and specific requirements, personal- bedding space based on their content similarity.

ized conversational responses (38%) addressing individual user

situations, and status update communications (20%) providing in-

cident or outage information. Response types were automatically

classified using keyword-based heuristics and regular expression

patterns, as described in Section "3.3 Response Type Detection".

The dataset was split using stratified random sampling with a

fixed seed (random_state=42) to ensure reproducibility. Eighty

tickets were randomly selected for the test set, representing ap-

proximately 5.5% of the processed dataset, with the remaining

1,386 tickets forming the knowledge base for retrieval. The test

set maintains proportional representation across all response

types: 34 template responses (42.5%), 30 personalized responses

(37.5%), and 16 status updates (20%), closely matching the overall

dataset distribution. This stratified approach ensures evaluation

coverage across diverse ticket categories while preventing data

leakage between training and test sets. This was repeated several

times to ensure the selected test sets are representative of the

entire dataset.

Figure 2: Semantic Embedding Space: Conceptual visual-

3 Methodology ization of how support tickets are distributed in the high-

Our system implements a multi-stage pipeline for automated dimensional embedding space, where semantically similar

first-reply generation, combining semantic retrieval, response- tickets cluster together, enabling effective retrieval of rele-

type detection, and few-shot prompting. Figure 1 illustrates the vant historical examples.

complete system architecture, showing the flow from input ticket

processing through knowledge base retrieval to final response

generation. 3.2 Retrieval System

For each incoming ticket, we retrieve similar historical cases us-



ing a multi-factor scoring approach that combines semantic sim-



ilarity with categorical and structural matching. The enhanced

retrieval score combines:

• Base semantic similarity (50%) from FAISS cosine similar-

ity using normalized embeddings

• Category match bonus (20%) when ticket types align, using

exact string matching

• Title similarity (15%) using dedicated title embeddings

with cosine similarity

Figure 1: System Architecture: The complete RAG pipeline • Description similarity (10%) using dedicated description

for automated first-reply generation, showing the eight- embeddings with cosine similarity

stage process from ticket input through embedding gener- • Response quality bonus (5%) based on response structure

ation, knowledge base search, enhanced scoring, response analysis and content completeness metrics

type detection, and final reply generation using GPT-4. These weights reflect the relative importance of semantic simi-

larity, categorical alignment, and structural relevance in ensuring

that retrieved examples are both contextually appropriate and

3.1 Knowledge Base Construction organizationally consistent. We retrieve a larger candidate set

We construct a knowledge base from historical tickets using sen- (4× the target number) from the FAISS index and apply this multi-

tence embeddings [8]. Each ticket is represented by title and factor re-ranking to select the most relevant examples, ensuring

description embeddings computed using the all-MiniLM-L6-v2 both semantic relevance and categorical appropriateness.

sentence transformer model [12], which provides a compact 384-

dimensional representation optimized for semantic similarity 3.3 Response Type Detection

tasks. We build separate embeddings for titles and descriptions, We implement response-type detection using keyword-based

plus combined embeddings for comprehensive similarity search, heuristics with regular expression patterns to classify responses

enabling multi-granular matching across different text compo- as template-based, personalized, or status updates. Template re-

nents. sponses are identified by structured formatting indicators such as



50





Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia




form field markers (e.g., "Field:", "Value:"), bullet point patterns, making it suitable for real-time ticket processing. all-mpnet-base-

numbered lists, and specific organizational phrases like "Below v2 offers higher representational capacity through its bidirec-

you will find the additional form information." tional encoder architecture and serves as a more sophisticated

Personalized responses are characterized by conversational evaluation metric, providing additional validation of semantic

elements including direct questions, user-specific acknowledg- coherence through its enhanced understanding of contextual

ments (e.g., "Thank you for contacting us"), empathy expressions, relationships and nuanced text representations.

and conditional statements. Status updates contain temporal ref- Our system achieves an average MiniLM similarity of 0.7841

erences using datetime patterns, incident identification numbers, and MPNet similarity of 0.8048 between generated and expected

system status keywords, and global communication patterns fol- responses. These scores indicate strong semantic alignment with

lowing organizational incident response protocols. human-written replies, confirmed through cross-validation anal-

ysis showing confidence intervals within a 3% range (±2.9% for

3.4 Few-Shot Prompting MiniLM similarity). Figure 3 shows the performance variation



Response generation employs few-shot prompting with GPT-4 across different test tickets, demonstrating consistent quality across diverse support scenarios. [7], using retrieved examples to guide generation through in-

context learning. We construct structured prompts that include:





• Current ticket information (title, description, detected re-

sponse type).

• 4-5 most relevant historical examples with their corre-

sponding responses.

• Response type-specific instructions (template vs. person-

alized formatting).

• Organizational communication guidelines and tone speci-

fications. Figure 3: Individual Ticket Performance: Semantic similar-

Template responses receive strict formatting instructions with ity scores (MiniLM) for each test ticket, showing consistent

explicit field markers and structural constraints to maintain ex- performance across diverse support scenarios with most

act organizational formatting, while personalized responses are tickets achieving similarity scores above 0.7.

guided toward conversational but professional tone with specific

phrase patterns and acknowledgment structures.

4.2 Response Quality Analysis

3.5 Temporal Context Detection Quality assessment reveals that 55 out of 80 generated responses



We implement temporal context detection using compiled regular (68.8%) achieve similarity scores above 0.7, indicating high seman- tic alignment. The system successfully maintains organizational expressions to identify tickets related to system outages, status communication standards while addressing specific user require- updates, or global communications. The detection system uses ments. Figure 4 illustrates the distribution of response quality pattern matching for temporal indicators (e.g., "since", "until", scores across the evaluation dataset. "during"), incident terminology ("outage", "maintenance", "down-

time"), and organizational communication markers ("all users",



"system-wide", "scheduled maintenance"). Detected temporal con-

texts trigger specialized status update generation that mirrors

organizational incident communication patterns, including sever-

ity levels, expected resolution times, and escalation procedures.



4 Results

We evaluate our system using semantic similarity metrics and

response quality assessments across 80 test tickets representing

diverse support scenarios.



4.1 Similarity Metrics

We employ two sentence transformer models for comprehensive

evaluation [8]:

• all-MiniLM-L6-v2 [12]: Lightweight 384-dimensional model

optimized for general semantic similarity with 22.7M pa- Figure 4: Response Quality Distribution: Distribution of

rameters semantic similarity scores showing that 68.8% of generated

• all-mpnet-base-v2 [10]: Higher-capacity 768-dimensional responses achieve scores above 0.7, indicating strong se-

model with 109M parameters for nuanced similarity as- mantic alignment with expected human-written replies.

sessment using masked and permuted pre-training

The selection of these two models provides complementary Template responses demonstrate particularly strong perfor-

evaluation perspectives. all-MiniLM-L6-v2 serves as the primary mance, with exact structural matching and appropriate place-

embedding model in our RAG system due to its computational holder handling. Personalized responses achieve good contextual

efficiency and proven effectiveness in semantic similarity tasks, relevance while maintaining professional tone.



51





Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Domen Jeršek, Klemen Kenda, Rok Klančič, and Matteo Frattini




4.3 Response Type Distribution scenarios. The system provides a practical solution for reduc-

The system correctly identifies response types in 87% of cases, ing response times while ensuring quality and consistency in IT

routing requests to appropriate generation strategies. Template support communications.

detection achieves 90% accuracy, while personalized response Future work will explore improving template handling using

detection reaches 85% accuracy. instruction-tuned large language models and developing fine-

Temporal context detection successfully identifies 100% of sta- tuned classifiers for more accurate response type detection, en-

tus update scenarios on the tested examples, enabling appropriate abling more structured and context aware reply generation.

global communication style responses.



the expected responses further supports these results. Figure 5 [1] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: pre-training of deep bidirectional transformers for language under- demonstrates that generated responses maintain appropriate standing, 4171–4186. doi:10.18653/v1/N19-1423. The plot of the length of the generated responses against References



length characteristics compared to human-written replies, with [2] Yixin Diao, Hani Jamjoom, and Zhen-Yu Shae. 2009. Rule-based problem 2009 IEEE International Conference classification in it service management. In strong correlation between generated and expected response on Services Computing . IEEE, 221–228. doi:10.1109/SCC.2009.31. lengths. [3] Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei

Chang. 2020. Retrieval augmented language model pre-training.arXiv preprint

arXiv:2002.08909 . https://arxiv.org/abs/2002.08909.

[4] Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity



search with gpus. IEEE Transactions on Big Data, 7, 3, 535–547. doi:10.1109

/TBDATA.2019.2921572.

[5] Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu,

Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense passage retrieval

for open-domain question answering, 6769–6781. doi:10.18653/v1/2020.emn

lp-main.550.

[6] Patrick Lewis et al. 2020. Retrieval-augmented generation for knowledge-

intensive nlp tasks. Advances in neural information processing systems, 33,

9459–9474. https://arxiv.org/abs/2005.11401.

[7] OpenAI. 2023. Gpt-4 technical report. (2023). https://arxiv.org/abs/2303.087

74 arXiv: 2303.08774 [cs.CL].

[8] Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: sentence embeddings

using siamese bert-networks. InProceedings of the 2019 Conference on Empir-

ical Methods in Natural Language Processing and the 9th International Joint

Conference on Natural Language Processing (EMNLP-IJCNLP). Association

for Computational Linguistics, 3982–3992. doi:10.18653/v1/D19-1410.

[9] Anna Rogers, Olga Kovaleva, and Anna Rumshisky. 2018. A primer on

neural network models for natural language processing.Journal of Artificial

Figure 5: Response Length Comparison: Scatter plot com- Intelligence Research , 61, 65–95. https://arxiv.org/abs/1510.00726.

[10] Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. 2020. Mpnet:

paring the length of generated responses versus expected masked and permuted pre-training for language understanding. In Advances responses, showing strong correlation and indicating that in Neural Information Processing Systems. Vol. 33. Curran Associates, Inc.,

16857–16867. https://proceedings.neurips.cc/paper/2020/hash/c3a690be93

the system generates appropriately sized replies consistent aa602ee2dc0ccab5b7b67e-Abstract.html. with human writing patterns. [11] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,

Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all

you need. Advances in neural information processing systems, 30. https://arxi

v.org/abs/1706.03762.

[12] Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou.

4.4 Error Analysis 2020. Minilm: deep self-attention distillation for task-agnostic compression

Remaining challenges include handling of highly specialized of pre-trained transformers. In Advances in Neural Information Processing

Systems . Vol. 33. Curran Associates, Inc., 5776–5788. https://proceedings.ne



technical scenarios and tickets requiring complex multi-step pro- urips.cc/paper/2020/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.ht ml. cedures. Some responses exhibit placeholder artifacts when exact

matching fails, and very short or very long responses occasionally

deviate from expected patterns.

The system shows consistent performance across different

ticket categories, with minor variations in quality for edge cases

involving complex technical requirements or unusual organiza-

tional procedures.



5 Conclusion

This paper presents a comprehensive approach to automated first-

reply generation for IT support tickets using retrieval-augmented

generation and multi-modal response synthesis. Our system suc-

cessfully combines semantic similarity search, response-type

detection, and few-shot prompting to generate contextually ap-

propriate replies that closely match human-written responses.

The evaluation demonstrates strong performance across di-

verse ticket types, achieving semantic similarity scores of 0.78-

0.80 and maintaining organizational communication standards.

Cross-validation analysis confirms the stability of these results,

with performance metrics varying within a ±3% range, indicat-

ing robust and reliable performance across different evaluation



52





A Machine-Learning Approach




to Predicting the Pronunciation of Pre-Consonant l



in Standard Slovene



Jaka Čibej

jaka.cibej@f f.uni- lj.si

Centre for Language Resources and Technologies & Faculty of Arts, University of Ljubljana

Jožef Stefan Institute

Ljubljana, Slovenia



Abstract such problem in this paper: the pronunciation of pre-consonant l

The pronunciation of pre-consonant in Slovene words (e.g. , alge l

l in Slovene words. The grapheme , when preceding a consonant

polž grapheme, can be pronounced as either / / or / /. In some cases, l u gledalka

, ) is not easily predictable (/ /, / /, or both) and l u “ “

both variants are acceptable. Examples include words such as

poses a problem for the otherwise effective rule-based grapheme-

to-phoneme conversion. We present a method to discriminate alge (‘algae’, IPA: /"a:lgE/, but never */"a:ugE/), polž (‘snail’, IPA:



𝑛 between the various pronunciations of pre-consonant / “ /, but never */ "pO:u S "pO:lS /), gledalka (‘spectator (female)’, IPA: l using “ / glE"da:u machine-learning models trained on vectors of character-level ka / or / glE"da:lka /), and decimalka (‘decimal number’, “ dEci"ma:lka IPA: /-gram features from approximately 153,500 manually annotated /, but never */ dEci"ma:u ka /). The reasons for “ these different pronunciations are historic and etymological in l Slovene words with pre-consonant from the ILS 1.0 dataset. We

some cases, while in others, the difference cannot be easily ex-

achieve an accuracy of 86% (over a majority baseline of 76.53%)

plained and has more to do with conventions in language use. The

and conclude the paper with potential steps for future work.

issue of pre-consonant has been tackled by Slovene linguistics l

Keywords for more than a century (see [4] for a brief overview). Percep-

tion tests and small-scale surveys ([16]; [11]) have recently been

pronunciation, grapheme-to-phoneme conversion, pre-consonant

conducted to collect data for lexicographic resources (such as

l, pronunciation ambiguity, Slovene 2

the ), but empirical data remains Slovenian Normative Guide 8.0

1 Introduction scarce: relevant language resources are not machine-readable or

openly accessible (as is the case of the Dictionary of Slovenian

In languages that are characterized by greater orthographic depth 3 Literary Language OptiLeX

) or contain inconsistent data (e.g.,

(i.e., a greater discrepancy between the written form and its pro- ILS 1.0

[19]). In this paper, we use the recently published dataset

nunciation), such as English or French, grapheme-to-phoneme

([1]; described in Section 2).

(G2P) conversion requires more sophisticated methods such as Slovene IPA/X-SAMPA G2P Converter Because the is currently

neural networks (see e.g. [10] for French and [14] for English). l

entirely rule-based, all pre-consonant graphemes are transcribed

Slovene, on the other hand, features a much more transparent or- l

as / /, resulting in errors that need manual corrections when com-

thography ([15]; [17]). Phonetic transcriptions of Slovene words

piling language resources. Our goal is to implement a machine-

– with some exceptions, such as acronyms, symbols, numerals, 4

learning approach to disambiguate between different pronuncia-

and certain words of foreign origin (e.g. ), including sommelier

tions. Increasing the accuracy of the converter is important in the

proper nouns (e.g., ; more on this in [3]) – can be very re-Johnson

context of the automatic compilation of modern lexicographic

liably generated using a rule-based approach, especially if taking

resources that can also be used as machine-readable databases

the accentuated form (e.g., instead of the unaccentuated drevó

for training models (including large language models) and im-

drevo) as the starting point, as the diacritic disambiguates the proving speech recognition and speech synthesis for Slovene. We

position of the accent and the manner of pronunciation of the

describe the dataset (Section 2), the statistical analysis used for

accentuated vowel grapheme. The Slovene IPA/X-SAMPA G2P

feature selection (Section 3), the results (Section 4), and several

Converter1 achieves an accuracy of approximately 98% (based on steps for future work (Section 5).

an evaluation on a stratified sample of words; see [2]).



However, there are several exceptions (in addition to the ones 2 Dataset

already mentioned) in which the pronunciation of certain graphemes

is much more difficult to predict with rules. We focus on one ILS 1.0

([1]; described in more detail in [4]) is a dataset of approx.

173,400 inflected Slovene word forms (of approx. 6,000 Slovene

1 Slovene IPA/X-SAMPA G2P Converter Pregibalnik l

The is part of , a custom tool that

lexemes) containing a single pre-consonant grapheme. Each oc-

was developed for the expansion of the [5], Sloleks Morphological Lexicon of Slovene

which is the morphological basis for the [8]. l

Digital Dictionary Database of Slovene currence of pre-consonant was annotated for its pronunciation

Pregibalnik is available as open-access code at https://github.com/clarinsi/SloInf le by 5 linguists (2 annotations per occurrence). The word forms

ctor and as an API service at https://orodja.cjvt.si/pregibalnik/docs; the Slovene Sloleks

were extracted from the manually validated lexemes of

IPA/X-SAMPA G2P Converter is also available as an API at https://orodja.cjvt.si/pre

gibalnik/g2p/docs. 3.0

[5], the largest open-access dataset with machine-readable

morphosyntactic information on Slovene words. Table 1 shows

Permission to make digital or hard copies of all or part of this work for personal

the distribution of word forms by agreement: in 89% of word

or classroom use is granted without fee provided that copies are not made or

distributed for profit or commercial advantage and that copies bear this notice and

2 Pravopis 8.0 Slovenian Normative Guide 8.0

the full citation on the first page. Copyrights for third-party components of this ( ): https://pravopis8.fran.si/

work must be honored. For all other uses, contact the owner /author(s). 3 The Dictionary of Slovenian Literary Language (SSKJ) is available at https://fran.si/. Information Society 2025, Ljubljana, Slovenia 4An attempt at using machine learning for Slovene phonetic transcriptions was

© 2025 Copyright held by the owner/author(s). made by [9]; however, the method was evaluated on the Sloleks Morphological

https://doi.org/10.70314/is.2025.sikdd.1 [5], where the issue of pre-consonant is still unresolved. Lexicon of Slovene 3.0 l



53





Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Čibej




Table 1: Word forms in ILS 1.0 by agreement. Table 2: Contingency table for the general 𝑛-gram c when

following a pre-consonant l.



Pronunciation Number of Forms %

Pronunciation →

/ / 117,459 67.73 l ↓ l u l u

Presence / / / / / /+/ /

/ / 23,884 13.77 u “ “

Both “ 12,160 7.01 Yes 2,653 1,847 5,980

Both | / / 11,205 6.46 No 114,898 22,045 6,180 l

Both | / / 7,051 4.07 u

“

/ / | / 1,660 0.96 l u

“ Table 3: A sample of statistically significant general

Total 173,419 100.00 character-level 𝑛-grams.





𝑛 𝜒 𝑟-Gram p V Category

2

|𝑚𝑎𝑥 |



c 38,199.59 **** 0.499 178.81, / /, No post-ll

n 29,081.52 **** 0.435 79.27, / /, No post-ll

ce 16,003.46 **** 0.323 118.30, / /, No post-ll

o 77,025.17 **** 0.708 227.83, / /, No pre-ll

po 48,241.29 **** 0.560 193.98, / /, No pre-ll

a 16,592.50 **** 0.329 -79.85, / /, No pre-ll



We extract a total of 8,082 different general 𝑛-grams (consist-

ing of actual graphemes; 3,041 in pre-position, 5,541 in post-ll

Figure 1: Extraction of character-level 𝑛-gram features for

position), 116 different robust C+V 𝑛-grams (65 pre- and 51 post-

the pre-consonant l in the word gledalka.

l), and 603 different finegrained C+V 𝑛-grams (262 pre- and 341

post-). For each 𝑛-gram, we compile a contingency table. For l

instance, Table 2 shows the occurrences of the general 𝑛-gram c

forms (highlighted in gray), the annotators agree on the pronun-

in the position directly following a pre-consonant (e.g., , l moril c a

ciation of pre-consonant . They disagree in 11% of the examples, l

‘murderer’, masculine common noun, genitive singular form)

with one annotator allowing for both pronunciation variants

depending on the pronunciation of the pre-consonant . l

and the other allowing for only one pronunciation. Complete

In order to determine statistically significant features that help

disagreement is present only in less than 1% of the examples.

discriminate between different pronunciations, we performed a

We use the 153,503 forms with complete agreement as training

2

series of Pearson’s 𝜒 tests [12] and corrected for family-wise

data for machine-learning models as described in the following

error rate with the Holm-Bonferroni method [7]. We calculated

sections. It should be noted, however, that while is the ILS 1.0

7

largest open-access dataset on pre-consonant pronunciations, it

l Cramér’s V [6] as the measure of effect size. This resulted in a

total of 4,263 statistically significant features (1,856 pre-general l

is not completely representative of language use in general (with

and 1,794 post-general 𝑛-grams; 60 pre-and 40 post-robust l l l

annotations by only 5 linguists with a background in translation

C+V 𝑛-grams; 242 pre-and 271 post-finegrained C+V 𝑛-grams). l l

and Slovene studies; these can be biased towards linguistic rules

Several statistically significant pre-general 𝑛-grams are shown l

that might not reflect real language use). Despite this, the dataset

8 2

in Table 3. The table shows the values of the 𝜒 statistic and

is robust enough to help disambiguate the more obvious examples

Cramér’s V, the p-value representations, the maximum absolute

(such as , IPA: / /, and , IPA: / /). alge "a:lgE polž "pO:u S

“ value of Pearson’s residuals (and its position in the contingency



3 table), and the category of the 𝑛-gram (post-or pre-). With the l l Feature Selection

exception of the 𝑛-gram, which is more indicative of the / / a l

To some extent, the pronunciation of pre-consonant depends on l

pronunciation, the others indicate one of the other two options

5

the preceding and subsequent graphemes, so we use character- u l u

(/ /; or / /+/ /). The results also confirm the statement found in

level 𝑛-grams as features for prediction. For each pre-consonant “ “ Slovenian Normative Guide 8.0 o l the that the grapheme in pre-

l in each word form, we identify the 𝑛-grams (1 ≤ 𝑛 ≤ 5) in its position is strongly indicative of the /u/ pronunciation.

direct left/right surroundings as shown in Figure 1 (see footnote “



6). We include word boundary markers (#) to discriminate be- 4 Prediction and Evaluation tween word-initial and word-final 𝑛-grams. We also perform the

We compiled a custom vectorizer based on the identified fea-

same extraction on robust and finegrained C+V representations

tures. The vectorizer scans each input word form (along with

6

of each word form.

9

its Multext-East v6 morphosyntactic tag ) for all occurrences of

5 Slovenian Normative Guide 8.0 Pravopis 8.0

The ( , see https://pravopis8.fran.si),

√︂

for instance, states that a pre-consonant preceded by the grapheme is often 7 𝜒 2 2 o

l 2

We calculate Cramér’s V as , where 𝜒 is the Pearson’s 𝜒 statistic, 𝑁

characterized by the / / pronunciation; this is true of words that historically used 𝑚𝑖𝑛

u 𝑁 𝑑 ∗

the syllabic “ is the total sample size, and l (e.g. polh IPA: / "pO:u x / ‘dormouse’; volk IPA: / "vO:u k / ‘wolf ’). However, 𝑑𝑚𝑖𝑛 is the minimum dimension of the contingency there are exceptions as not all 𝑛-grams originate from the syllabic (e.g., table. “ “ ol l polkovnik

IPA: / / ‘colonel’; IPA: / / ‘voltage’). For all tests, the degrees of freedom (df ) were equal to 2 and the total sample nik voltaža vOl"ta:Za pOl"kO:u 8

6 “

In the robust C+V form, all consonant graphemes are substituted with and all size (N) was equal to 153,603. The p-values should be interpreted in the following C

vowel graphemes with V. In the finegrained C+V form, consonant graphemes were manner: **** p 0.0001; *** p 0.001; ** p 0.01; * p < 0.05 → ≤ → ≤ → ≤ →

9

generalized into more finegrained categories, e.g. graphemes denoting Slovene Multext-East v6 Morphosyntactic specifications: https://nl.ijs.si/ME/V6/msd/html

sonorants (M), voiced (G) and voiceless obstruents (K), foreign consonants (X), etc. /msd- sl.html



54





Prediction of Pre-Consonant l in Slovene Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia




Table 4: Model performance based on 10-fold cross- Table 5: Confusion matrix for the Linear Support Vector

validation. classifier.



Model A BA P R F1 True →



LinearSVC ↓ Í Predicted / l / / u / / l /+/ u / 86.08 72.39 69.26 55.39 61.54 “ “

Multin. NB 77.29 69.54 33.33 47.36 / / 1,495 729 24,230 81.84 l 22,006

kNN (k=5) 85.91 64.11 62.98 / / 1,071 31 3,866 73.30 63.53 u 2,764



Majority / “ /+/ l u/ 434 519 1,672 2,625 76.53----“ Í

23,511 4,778 2,432 -



pre-consonant , extracts the surrounding 𝑛-grams, converts the 4.2 Manual Evaluation

l

morphosyntactic tag into 146 morphosyntactic features, and rep- We performed a manual analysis of the misclassified examples to

resents the occurrence as a 4,409-dimensional vector of {0,1} val- determine whether there are any patterns to the errors that could

ues (with 0 and 1 indicating the absence or presence, respectively, be help further improve the model with additional features. Due

of the 𝑛-gram in the direct surroundings of the pre-consonant to space limitations, we only focus on the most obvious problems

/ /). We compile a total of 153,503 vectors in this way and use in this paper. l

the Python library [13] to train several models for In the examples where the / / pronunciation was misclassi-scikit-learn l

a classification task with three classes: the goal is to correctly fied as / /, many words contain a pre-consonant followed by u l

predict whether a pre-consonant is pronounced as / /, / /, or ( ‘caldera’, ‘pertaining to a l l u the grapheme “ d kaldera buldožerski

both. “ bulldozer’, heraldičen ‘heraldic’, bodibilder ‘bodybuilder’). The

majority of these examples are pronounced with / /, with the l

4.1 Automatic Evaluation exception of words like dopoldne ‘late morning’, popoldanski ‘per-

taining to the afternoon’, where the pre-consonant is preceded l

We trained three different models: a Linear Support Vector Clas-

by an grapheme. This could indicate that an additional 𝑛-gram o

sifier (LinearSVC), a Multinomial Naïve Bayes Classifier (Multin.

feature should be added (the along with its preceding and sub-l

NB), and a 𝑘 Nearest Neighbors Classifier (kNN) and evaluate

sequent graphemes: , , etc.). This could resolve some other old ald

their performance with a 10-fold cross-validation (with a strat-

misclassifications, such as ‘impulsive’ and impulziven pulzirajoč

ified random test set of word forms). The results are shown in

‘pulsating’, where words with the combination are never pro- ulz

10

Table 4. The worst performing model is the Multin. NB classi-

nounced as / /, but words with are (e.g., ‘to slip’). The u olz polzeti

fier, which barely achieves an above-baseline accuracy and a very “

emergence of such patterns in the misclassifications is a good

low F1-score compared to the other two classifiers, although its

sign that the classifiers might benefit from a joint pre-/post-ll

recall is much higher. In terms of balanced accuracy and F1-score,

feature. This will be explored in future versions.

the best model is the kNN classifier. However, it seems that the

Many of the instances in which the / / was misclassified as / / u l

algorithm is not the most suited for this type of data. It performs “ pol pol-

contain compound words with the element ‘semi, half ’:

similarly to the LinearSVC classifier, but if we compare the sizes

nag ‘half-naked’, polfinale ‘semi-final’, polpuščava ‘semi-desert’.

of the resulting models, it becomes apparent that the LinearSVC

Because the element is always pronounced with / /, this is pol u

model is much more efficient (with a size of approximately 100 “

also true of derived compound words. However, the 𝑛-gram fea-

kB) compared to the kNN model, which is overly inflated (with a

tures used offer no indication of morpheme boundaries, so these

11

size of more than 2 GB), possibly indicating overfitting.

misclassifications can be expected.

Because the LinearSVC model is the most viable, we analyze

Additional 𝑛-gram features could be extracted from the ac-

its performance in more detail. Table 5 shows the confusion ma-

centuated forms of words. In some examples, the accentuation

trix for the classifications of the LinearSVC model on a stratified

diacritic can disambiguate the pronunciation of the subsequent

test set (20% of the total 153,503 dataset instances). The model

pre-consonant . For instance, ‘pertaining to something that l dólnji

seems to lean more towards the most frequent category (/ /) in its l

is downwards or downstream’ and are pronounced prestólničen



being misclassified as / “ “ with /l/, whereas tôlšča ‘blubber’ and pôlhográjski ‘pertaining l predictions, with approximately 30% of / / and / /+/ / instances u l u

/, whereas 94% of the / / instances are l

to the town of Polhov Gradec’ are pronounced with / /. How-u

classified correctly. It seems that instances allowing both pronun- “

ever, accentuation is rarely written in Slovene and is much more

ciations are very rarely misclassified as / / (only 1%). It should u

also be noted that the instances of / “ difficult to assign automatically compared to morphosyntactic l /+/ u / misclassified as either

/ “ features. Relying on too many features that are not easily ex- / or / u l / are not entirely incorrect, just incomplete. Compared to

“ tractable would make the model less robust (more on this in

the rule-based approach (which classifies everything as / /), the l

Section 5).

model performs quite well in terms of / /+/ / and / / instances l u u

and sacrifices only 6% of its accuracy for / “ “ / instances. In order l

5 Conclusion

to determine any future improvements to the model, we analyze

We presented a machine-learning approach to improve the ac-

some of the misclassified examples in more detail in Section 4.2.

curacy of phonetic transcriptions of Slovene words that contain

the ambiguous pre-consonant . While the method does improve l

10

A, BA, P, R, and F1 refer to accuracy, balanced accuracy, macro-precision, macro- accuracy (86% over a majority baseline of cca. 76%) by using very

recall and macro-F1, respectively. simple character-level 𝑛-gram and morphosyntactic features, it 11

We also ran a 10-fold cross-validation using only 𝑛-gram features (no morphosyn-

does not resolve the problem entirely. Aside from several excep-

tax). The performance of the models was slightly worse, e.g. for LinearSVC: A =

85.05, BA = 69.14, P = 68.94, R = 46.85, F1 = 55.76. tions in language use which are difficult to predict (e.g. , gasilci



55





Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Čibej




čistilka; both pronounced with /l/ even though the majority of Acknowledgements

words ending with and in the dataset can be pronounced-ilec-ilka

The research presented in this paper was carried out within the



with either / / or / /), the analysis of misclassified examples has l u Basic Research for the Development of Spo- shown several potential future steps that can be implemented “

research project titled

ken Language Resources and Speech Technologies for the Slovenian

to further improve the performance of the models. First, several Language Language Resources ( J7-4642), the research programme

additional features should be tested. Some of the features are sim- and Technologies for Slovene CLARIN.SI Re-(P6-0411), and the

ple, such as word length or number of syllables in word (which search Infrastructure (I0-E004), all funded by the Slovenian Re-

could potentially help to correctly classify words such as volk

search and Innovation Agency (ARIS). The author also thanks

and ; short words where the pre-consonant is pronounced polh l

the anonymous reviewers for their constructive comments.

as / /). The relative position of the pre-consonant in the word u l

could also potentially be helpful. Several more complex features “ References could also be added, such as word formation relations and mor-

[1] Jaka Čibej. 2024. Dataset of annotated slovene words with pre-consonant l

pheme boundaries to help disambiguate, for instance, ILS 1.0. Slovenian language resource repository CLARIN.SI. (2024). http://h decimal-ka

‘decimal number’, which is derived from the adjective dl.handle.net/11356/2025. decimalen ‘pertaining to decimal numbers’ and is pronounced with / /; and

l [2] Jaka Čibej. 2023. Leksikon besednih oblik sloleks. poročilo projekta razvoj

slovenščine v digitalnem okolju aktivnost ds1.3. Development of Slovene in

mor-ilka ‘murderer (feminine)’, which is derived from the verb a Digital Environment. (2023). https://www.cjvt.si/rsdo/wp- content/upload

moriti s/sites/18/2023/06/RSDO_Kazalnik_Sloleks_v2.pdf . ‘to murder’ and can be pronounced as either / l / or / u /). Tak-

ing into account the accentuated form of the word could also help: “ [3] Jaka Čibej. 2024. Predicting pronunciation types in the sloleks morpho-Data mining and data warehouses (SiKDD): logical lexicon of slovene. In for instance, the accentuation – ‘wolf ’, ‘dormouse’ ôl vôlk pôlh Information Society (IS) 2024 - proceedings of the 27th International Multicon-

– indicates the / / pronunciation, while the accentuation is ference: volume C. u ól

Institut „Jožef Stefan“, 23–26. https://is.ijs.si/wp- content



indicative of the / “ /uploads/2024/11/IS2024_Volume- C.pdf . l / pronunciation, e.g. pólka ‘polka’). However, [4] Jaka Čibej. 2025. Statistična analiza izgovora črke l v slovenskem oblikoslovnem

more complex features cannot be extracted from the word form leksikonu sloleks. Jezikoslovni zapiski, 31, 1, (maj 2025), 37–54. doi:10.3986

/JZ.31.1.03.

itself, so making the model too heavily reliant on external linguis-

[5] Jaka Čibej et al. 2022. Morphological lexicon sloleks 3.0. Slovenian language

tic knowledge would sacrifice its robustness and usefulness for resource repository CLARIN.SI. (2022). http://hdl.handle.net/11356/1745.

unseen words. We will explore these options in our future work [6] Harald Cramér. 1946. . Mathematical Methods of Statistics Princeton Mathe-

matical Series. Vol. 9. Princeton University Press.

but we will first focus on the simplest features to determine the

[7] Sture Holm. 1979. A simple sequentially rejective multiple test procedure.

upper boundary of accuracy that can be achieved based solely , 6, 2, 65–70. Scandinavian Journal of Statistics

[8] Iztok Kosem, Simon Krek, and Polona Gantar. 2021. Semantic data should

on the word form and its morphosyntactic features. We will per-

no longer exist in isolation: the digital dictionary database of slovenian.

form additional statistical analyses on 𝑛-grams containing the 9th EURALEX International Congress "Lexicography for Inclusion"

In , 81–83.

pre-consonant as well, and once the optimal model is achieved, https://elex.is/wp- content/uploads/2021/09/Semantic- Data- should- no- l l

onger- exist- in- isolation- the- Digital- Dictionary- Database- of - Slovenian

it will also be evaluated on previously unseen words containing

_Kosem- Krek- Gantar_EURALEX2020.pdf .

the pre-consonant that have not been included in the [9] Janez Križaj, Simon Dobrišek, Aleš Mihelič, and Jerneja Žganec Gros. 2022. l ILS 1.0 dataset. The results will hopefully also provide more interesting Uporaba postopkov strojnega učenja pri samodejni slovenski grafemsko-

fonemski pretvorbi. In Jezikovne tehnologije in digitalna humanistika: zbornik

material for further linguistic analyses (such as exceptions to the

konference 2022. Inštitut za novejšo zgodovino, 248–251. https://nl.ijs.si/jtdh

rules). 22/pdf /JTDH2022_Proceedings.pdf .

As already mentioned, the dataset does not necessarily

ILS 1.0 [10] Xavier Marjou. 2021. Gipfa: generating ipa pronunciation from audio. In

accurately reflect the linguistic landscape of pre-consonant pro-

l eLex 2021 Conference Proceedings, 588–597. https://elex.link/elex2021/wp- co

ntent/uploads/2021/08/eLex_2021_38_pp588- 597.pdf .

nunciation in Slovene words, and more annotations along with [11] Tanja Mirtič. 2019. Glasoslovne raziskave pri pripravi splošnega razlagal-

nega slovarja. In . Slovenski javni govor in jezikovno-kulturna (samo)zavest

perceptive tests and surveys are required. The pronunciations

Znanstvena založba Filozofske fakultete, 81–90. https://centerslo.si/wp- con

will be manually validated as part of the work on the Digital

tent/uploads/2019/10/Obdobja- 38_Mirtic.pdf .

Dictionary Database of Slovene [8], the largest machine-readable [12] Karl Pearson. 1900. X. on the criterion that a given system of deviations

from the probable in the case of a correlated system of variables is such

open-access database of Slovene linguistic and lexicographic data.

that it can be reasonably supposed to have arisen from random sampling.

The pronunciations will also be cross-referenced with the record- The London, Edinburgh, and Dublin Philosophical Magazine and Journal of

ings from the , 50, 302, 157–175. eprint: https://doi.org/10.1080/14786440009463897. Science GOS Corpus of Spoken Slovene [18], which contains

doi:10.1080/14786440009463897.

real recordings of Slovene speech and can contribute towards Journal

[13] F. Pedregosa et al. 2011. Scikit-learn: machine learning in Python.

a more accurate distribution of different pronunciations for in- , 12, 2825–2830. of Machine Learning Research

glE"da:u [14] Uwe Reichel, Hartmut R. Pfitzinger, and Horst-Udo Hain. 2008. English

or / “ grapheme-to-phoneme conversion and evaluation. In Speech and Language glE"da:lka dividual lexemes (e.g., how many occurrences of / / ka

/), along with any potential relevant metadata (for , 159–166. https://www.phonetik.uni- muenchen.de/~reichelu Technology 11

instance, whether the pronunciation depends on the region the /publications/ReichelPf itzingerHainSASR2008.pdf .

[15] Anja Schüppert, Wilbert Heeringa, Jelena Golubovic, and Charlotte Gooskens.

speaker originates from). The models can then be re-trained on

2017. Write as you speak? a cross-linguistic investigation of orthographic

new data and further improved to better reflect real language transparency in 16 germanic, romance and slavic languages. English. From

use. semantics to dialectometry

, 32, 303–313. isbn: 9781848902305.

[16] Hotimir Tivadar. 2004. Priprava, izvedba in pomen perceptivnih testov za

The models will be implemented into the Slovene IPA/X-SAMPA

fonetično-fonološke raziskave (na primeru analize fonoloških parov). Jezik

Grapheme-to-Phoneme Converter as part of the Pregibalnik tool in slovstvo, 49.2, 2, 17–36. https://ojs.zrc- sazu.si/jz/article/view/14222.

[17] Antal van den Bosch, Alain Content, Walter Daelemans, and Beatrice de

for automatic Slovene lexicon expansion, which is available under

Gelder. 1994. Analysing orthographic depth of different languages using

12

a Creative Commons BY-SA 4.0 license. Proceedings of the 2nd International Conference

data-oriented algorithms. In

on Quantitative Linguistics.

[18] Darinka Verdonik et al. 2023. Spoken corpus gos 2.1 (transcriptions). Slove-

nian language resource repository CLARIN.SI. (2023). http://hdl.handle.net

/11356/1863.

[19] Jerneja Žganec Gros, Tanja Mirtič, Miroslav Romih, and Kozma Ahačič. 2022.

12 Slovar izgovarjav OptiLEX

The best-performing LinearSVC model (and the accompanying code) for the . (1. e-izd. ed.). Založba ZRC. isbn: 978-961-05-

prediction of pre-consonant pronunciation is available on Github: https://github.c 0672-0. https://doi.org/10.3986/9789610506720. l

om/jakacibej/sikdd2025_predicting_preconsonant_l



56





Sequencing News Articles with Large Language Models




within Enterprise Risk Management Context



Žiga Debeljak† Dunja Mladenić Klemen Kenda

Jožef Stefan International Department for Artificial Department for Artificial

Postgraduate School Intelligence, Intelligence,

Ljubljana, Slovenia Jožef Stefan Institute Jožef Stefan Institute

ziga.debeljak@mps.si Ljubljana, Slovenia Ljubljana, Slovenia

dunja.mladenic@ijs.si klemen.kenda@ijs.si



Abstract risks, especially within risk scenario analysis [10, 11]. The

capability to build structured timelines from unstructured textual



(LLMs) to reconstruct event timelines from unstructured news information is therefore of high relevance to ERM. This paper evaluates the capability of Large Language Models

data. This capability is highly relevant for Enterprise Risk LLMs are increasingly utilized in ERM for their ability to

Management (ERM) applications, where the reconstruction and process and analyze unstructured textual data, including news

forecasting of coherent event trajectories are crucial for articles, to identify and assess risks [1, 2, 3, 4, 5]. In the financial

identifying, assessing, and predicting emerging risks and sector, applications include extracting sentiment from news to

analyzing risk scenarios. In this study, we tasked twenty LLMs gauge market perception or identify reputational risks [3, 6, 7, 8],

with chronologically ordering randomly shuffled business news and identifying specific risk factors or events discussed in news



simple date sorting, all explicit date markers were removed from demonstrates LLMs' utility in analyzing individual or aggregated the articles. The experiments were conducted under one news items for tasks such as sentiment analysis, risk factor articles for three distinct real-world event chains. To prevent and corporate disclosures [2, 4, 5, 9]. Existing literature mainly



with hints for the first, the last, or both the first and the last identification, or event detection, but the capabilities of the unassisted and three assisted scenarios that provided the models

articles in the sequence. The results reveal a systematic variation models to recover the temporal order and causal links among a

in difficulty across the three tasks in addition to significant sequence of discrete news items that describe an unfolding

performance disparities among the models, with Grok 4 (xAI), narrative are less directly explored. This paper aims to address

GPT-5, o3 and o3-pro (all three OpenAI), and Gemini 2.5 Pro this gap by investigating LLM performance in temporal-causal

(Google) consistently outperforming other models practically reasoning within news streams, a crucial aspect for

across all tasks and prompting scenarios. As expected, prompting understanding the dynamics of unfolding risk narratives.

assistance with additional information systematically improved By investigating whether state-of-the-art commercial or open-

accuracy, especially for the models that performed poorly in the source LLMs can reconstruct the chronological narrative of

unassisted scenario. The high level of accuracy achieved by the

business-event chains from unordered news articles, this paper



ERM applications. contributes to the field by: (a) systematically evaluating the top-performing models indicates a practical utility for real-world

performance of multiple LLMs on a challenging temporal-



Keywords reasoning task; (b) analysing the efficacy of diverse prompting

strategies — both unassisted and assisted — in improving model

Large Language Models, News-Stream Sequencing, Temporal

accuracy; (c) providing insights into model-and-task dynamics,

Reasoning

revealing substantial performance disparities, task-specific



1 INTRODUCTION from contextual hints; and (d) demonstrating the practical difficulty patterns, and the outsized gains weaker models receive

Within Enterprise Risk Management (ERM) practice, readiness of these technologies for ERM deployment.

organizations monitor external developments also by analyzing

streams of publicly available news. Each news article captures a 2 RESEARCH METHOD momentary state of the political-economic environment, and by

accurately structuring unordered information into a Task Definition

chronological narrative, organizations can better understand the To evaluate the capabilities of LLMs, three event chains were

evolution of events and the relationships that connect them. The constructed, focusing on: (1) Trump's Tariffs and EU

reconstruction and forecasting of these event trajectories are [“Task_1”], (2) Gold Prices [“Task_2”], and (3) the Ukraine-

important for identifying, assessing, and predicting emerging Russia War [“Task_3”]. These topics were selected due to their

significant relevance to the business environment. For each topic,

† ten articles were manually selected from the online editions of

classroom use is granted without fee provided that copies are not made or distributed two reputable sources of financial and business information, Permission to make digital or hard copies of part or all of this work for personal or

for profit or commercial advantage and that copies bear this notice and the full published between March 1st and May 2nd, 2025. For the

citation on the first page. Copyrights for third-party components of this work must purpose of LLM processing, the raw text from the selected

be honored. For all other uses, contact the owner/author(s).

Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia articles was extracted. To prevent temporal bias, explicit date

© 2025 Copyright held by the owner/author(s). indicators—such as full dates—were removed, and no two

http://doi.org/ 10.70314/is.2025.sikdd.4



57

articles shared the same publication date. Subsequently, the Selected LLMs and Experiment Execution articles within each event chain were randomly shuffled, and this

Twenty different models by eight different providers were

fixed random order was then applied to all models within the

selected for this research, based on their expected capabilities

experiment.

with regard to the tasks, and their availability. Overview of

The primary task for the selected LLMs was to reconstruct the

selected models is shown in Table 1.

chronological sequence of news articles within three distinct

event chains. This task was evaluated across four experimental Table 1: Selected LLMs



scenarios: (1) an unassisted scenario [“Assist_No”], and three # Model Provider: Context Date

assisted scenarios providing the (2) first [“Assist_First”], (3) last Model Name Window Created

[“Assist_Last”], or (4) both first and last [“Assist_FirstLast”] (tokens)

articles in the sequence. 1 OpenAI: GPT-4.1 1.047k 14.04.2025

In the unassisted scenario, the LLMs were required to determine 2 OpenAI: o3 200k 16.04.2025

the correct chronological order of the articles without any 3 OpenAI: o3-pro 200k 10.06.2025

external information regarding their placement. In the assisted 4 OpenAI: gpt-oss-120b 131k 5.08.2025

scenarios, the models were provided with hints within the user 5 OpenAI: GPT-5 400k 7.08.2025

prompt. Specifically, for the Assist_First and Assist_Last 6 Google: Gemini 2.5 Pro Preview 1.048k 7.05.2025

scenarios, the prompt identified the article occupying the initial 8 xAI: Grok 3 Beta 7 Google: Gemini 2.5 Flash Preview 1.048k 20.05.2025 131k 9.04.2025

or final position, respectively. In the Assist_FirstLast scenario, 9 xAI: Grok 4 256k 9.07.2025

the LLMs were given the identifiers for the articles that 10 Anthropic: Claude Sonnet 4 200k 22.05.2025

correspond to the beginning and end of the chronological 11 Anthropic: Claude Opus 4 200k 22.05.2025

sequence. 12 Anthropic: Claude Opus 4.1 200k 5.08.2025

The required output from the LLMs was a reconstructed timeline 13 Meta: Llama 4 Maverick 1.048k 5.04.2025

of the news articles. For each position in the timeline, the 14 Meta: Llama 4 Scout 1.048k 5.04.2025

following information was mandated: (i) the article's 15 Mistral AI: Mistral Medium 3 131k 7.05.2025

identification number, (ii) the article's title, (iii) a brief 17 Qwen: QwQ 32B 16 Mistral AI: Mistral Medium 3.1 262k 13.08.2025 131k 5.03.2025

justification for its placement relative to the preceding article, 18 Qwen: Qwen 2.5 VL 32B Instruct 128k 24.03.2025

and (iv) a brief justification for its placement relative to the 19 DeepSeek: DeepSeek V3 163k 24.03.2025

subsequent article. The models were required to provide a 20 DeepSeek: R1 128k 28.05.2025

structured output in JSON format.

All models were accessed using the OpenRouter platform via the

Prompt Engineering APIs. For models supporting this parameter, the temperature was

Prompt engineering included manual drafting, testing on set to 0.0 to ensure the most reliable and reproducible

different models, and optimization both with LLMs (GPT o3 and experimental results; otherwise, default parameters were used.

Gemini 2.5 Pro) as well as manually, in several iterations. In the There were 12 experiments executed: 3 different event topic

end, an effective user prompt was developed which worked chains (tasks) in 4 experimental scenarios (prompts) each, by

reasonably well for all selected models. The main challenges using all 20 LLMs as shown in Table 1, thus resulting in 240

with regard to the design of prompts were: (a) stimulating a results (outputs). Experiments were executed on June 1st, 2025

systematic approach to causal reasoning, which was considered with the models available on that date, and on August 19th, 2025

to be mainly important for the non-reasoning models; (b) with the newer models.

ensuring the output consisted of exactly ten distinct articles, with

no repetitions or omissions; (c) enforcing the required output 3 EVALUATION AND DISCUSSION

JSON schema; and (d) providing concise reasoning for the

positioning of the observed articles. General Evaluation

Within the user prompt, the models were explicitly instructed to



logical story progression (understanding how a narrative or ordered lists that included all required supplementary situation typically develops or unfolds), (d) utilizing any implicit information. Substantial variations in output quality were time references if available within the articles, and (e) using observed across the different models. This variation was also models’ general knowledge about events. Prompts with clear influenced by the three distinct tasks, which seemed to be of instructions about the guidelines for the reasoning process substantially different difficulty, with the first task being the worked better than prompts without such instructions, even with most straightforward and the last presenting the most significant models with strong reasoning capabilities. System prompts were challenge. As anticipated, the implementation of assisted not utilized, as the one-shot user prompt contained all necessary prompting strategies consistently enhanced the accuracy of the instructions for the models. The full user prompt is available outputs for all models across all evaluated tasks. events (how events described in different articles relate to each successfully producing the requested ordered lists of news other over time), (b) causal reasoning (identifying cause-and- articles with all accompanying metadata. From a logical effect relationships between the content of different articles), (c) standpoint, the outputs from all models were accurate, presenting use the following reasoning principles: (a) inferring sequences of performance in response to a standardized user prompt, In terms of the output content, all models demonstrated strong

from the authors. Regarding the output formatting, the majority of the models

adhered to the specified JSON schema. Notable exceptions to



58

this were Claude models (models #10, #11 and #12), which negative Kendall’s τ value indicates an inverse correlation

occasionally deviated from the requested format by including a between the predicted and true rankings, and a value around zero

short introductory text. In these instances, the textual outputs represents a random ordering. Second, the results show that the

were programmatically reformatted to conform to the required more recent versions and models with strong reasoning

JSON structure. It is relevant to note that these three models are capabilities (models Grok 4, GPT-5, o3 and o3-pro, and Gemini

the only ones in the evaluation that do not natively support the 2.5 Pro) consistently outperform other models practically across

Structured Output functionality, a factor that likely contributed all tasks.

to their formatting inconsistencies.

Table 2: Average Performance by Tasks (Kendall’s τ)



Performance Metric Rank Model # Task_1 Task_2 Task_3 Avg. τ



To quantify the models’ performance with the given tasks, a 1 9 0.96 0.98 0.70 0.88 2 2 0.94 0.94 0.56 0.81 robust evaluation metric was required. For this purpose, 3 5 0.96 0.99 0.49 0.81 Kendall's rank correlation coefficient (“Kendall’s τ”, “τ”) was 4 3 0.94 0.93 0.52 0.80 selected as the most appropriate measure. Kendall's τ is a non- 5 6 0.94 0.96 0.52 0.81

parametric statistic that measures the ordinal association between 6 8 0.93 0.79 0.43 0.72

two ranked lists. Its methodology is centered on comparing the 7 12 0.94 0.70 0.41 0.69

concordance of all possible pairs of items within the sequences, 8 20 0.83 0.82 0.50 0.72

yielding a score in the interval from -1 (perfect reversal) to +1 9 7 0.84 0.89 0.48 0.74

(perfect match). The focus on relative, pairwise ordering makes 10 11 0.93 0.67 0.36 0.65

Kendall's τ exceptionally well-suited for a chronological sorting Avg. top 5: 0.95 0.96 0.56 0.82

task, as the core challenge lies in correctly establishing which Avg. all 20: 0.85 0.71 0.36 0.64



An alternative metric, the sum of absolute Manhattan distances, findings. First, assisted prompting systematically improved the performance across all models and tasks, which is logical and was also considered but ultimately deemed less suitable. Its expected since additional relevant information is provided to the primary drawback is its sensitivity to the magnitude of evaluates. The aggregated results in Table 3 underscore three principal event occurred before another, which is precisely what the metric



displacement, which can produce misleading evaluations by models. Anchoring with known positions in the majority of cases helped the models to better position the remaining articles as heavily penalizing single items that are wildly out of place, while well. potentially under-penalizing a sequence with numerous smaller,

local errors that might represent a poorer overall sort. Table 3: Average Performance by Scenarios (Kendall’s τ)



Performance by Tasks and Scenarios Rank Model # Assist_ Assist_ Assist_ Assist_ Avg. τ No First Last FirstLast

The performance of each model, quantified by the Kendall’s τ, is 1 9 0.75 0.88 0.90 0.99 0.88

detailed in Tables 2 and 3. Table 2 presents the coefficients 2 2 0.69 0.88 0.76 0.93 0.81

organized by task (event chain), averaged across all experimental 3 5 0.73 0.84 0.81 0.87 0.81

scenarios (prompts). Table 3, in turn, presents the coefficients 4 3 0.72 0.87 0.76 0.85 0.80

organized by experimental scenario, averaged across all the 5 6 0.57 0.93 0.84 0.90 0.81

tasks. The ranks in both tables were determined by averaging the 6 8 0.48 0.81 0.78 0.81 0.72

performance rankings of all the models across individual tasks 8 7 12 0.48 0.66 0.73 0.87 0.69 20 0.66 0.75 0.64 0.82 0.72

and scenarios. They largely correspond to the rankings based on 9 7 0.54 0.73 0.81 0.87 0.74

average τ, but discrepancies may arise from variation in the scale 10 11 0.48 0.64 0.66 0.82 0.65

and distribution of τ values across experiments. Avg. top 5: 0.69 0.88 0.81 0.91 0.82

To contextualize these performance metrics, their relationship to Avg. all 20: 0.47 0.67 0.63 0.79 0.64

pairwise accuracy is critical: within a 10-item sequence, a

Kendall’s τ of 0.90, 0.80 or 0.50 indicates that approximately Second, the provision of additional information proved more

95%, 90% or 75% of the 45 possible pairs are concordantly beneficial for the most demanding task (Task_3) than for the less

ordered, respectively. demanding tasks (Task_1 and Task_2). For example, in the

The aggregated results in Table 2 underscore two principal Assist_FirstLast scenario, the increase in average τ relative to the

findings. First, a significant and systematic variation in task unassisted scenario was 0.13 for Task_1, 0.17 for Task_2, and

difficulty was evident, with Task_1 representing the simplest 0.65 for Task_3. This finding follows logically from the models’

case and Task_3 the most demanding. This pattern held true for greater ability to identify the first and/or last article in simpler

practically all the evaluated models and experimental scenarios. tasks by themselves: in Task_1, 15 of 20 models correctly

The performance differences indicating different task difficulty identified the first position, while none identified the last

were substantial. For Task_1 and the unassisted scenario, the position, in Task_2 9 models identified the first position and 4

Kendall's τ values for the average, best model, and worst model identified the last position, and in Task_3 no model identified

performance were 0.78, 0.91 and 0.02, respectively. For Task_2, either position correctly.

the values were 0.63, 1.00 and 0.16, and for Task_3, they were Third, the provision of additional information disproportionately

0.02, 0.38 and 0.33. These findings clearly establish Task_3 as benefited models that performed poorly in the unassisted

the most difficult of the three tasks evaluated. Note that a scenario. For instance, on Task_3 — the most difficult task with



59

an average Kendall's τ of only 0.02 in the unassisted scenario — (4) Diagnosing mis-ordering errors through reasoning audits: To

the Assist_First scenario yielded average and maximum understand why models fail to reconstruct the correct temporal

performance improvements of 0.46 and 1.07, respectively. For ordering of news articles, one could extract each model’s stated

the Assist_Last scenario, the corresponding improvements were reasoning features for every placement decision, then have

0.27 and 0.80, while for the Assist_FirstLast scenario they were human experts or adjudicating LLMs rate their accuracy and

0.65 and 1.02. The results demonstrate that supplementing less relevance. Such audits would expose specific deficits in

capable models with limited key information can yield reasoning and could even inform targeted retraining regimes.

significant performance gains at these tasks. (5) Experimenting with extended or interleaved event chains:

A qualitative examination of the models' reasoning justifications Evaluating models on substantially longer sequences—or on

failed to yield systematic insights into their capacity to mixtures of events drawn from multiple chains—would

reconstruct accurate chronological sequences of articles. markedly raise task complexity and furnish a stringent

Although the generated rationales were generally logical and benchmark of temporal-reasoning competence for business use

relevant, they frequently omitted crucial contextual information cases.

essential for correct chronological reasoning. This observation

underscores the challenge that certain timelines may not be ACKNOWLEDGMENTS

uniquely re-constructible due to insufficient contextual The authors acknowledge the use of LLMs during various stages

information. Furthermore, in some instances, the provided of this research. These models provided support in tasks such as

justification could plausibly support an alternative, yet equally idea generation, text processing, prompt engineering,

valid, timeline. Moreover, this is compounded by the inherent methodological exploration, and language optimization. While

challenge of discerning whether the provided reasoning the LLMs contributed to enhancing efficiency and refining the

justifications represent the model's actual inferential process or presentation of this work, all conceptual frameworks, analyses,

are merely a result of the post-hoc rationalization. and interpretations remain the sole responsibility of the authors.



4 CONCLUSIONS AND FURTHER REFERENCES

RESEARCH IDEAS [1] Y. Cao et al., ‘RiskLabs: Predicting Financial Risk Using Large Language

Model Based on Multi-Sources Data’, Apr. 11, 2024, arXiv:

This research provides insight into the practical application and arXiv:2404.07452. doi: 10.48550/arXiv.2404.07452.



inherent challenges of utilizing LLMs to sequence news streams [2] A. Kim, M. Muhn, and V. V. Nikolaev, ‘From Transcripts to Insights: Uncovering Corporate Risks Using Generative AI’, Jul. 11, 2024, in the context of ERM. The selected use cases are based on real- Rochester, NY: 4593660. doi: 10.2139/ssrn.4593660.

world, business-relevant event chains. [3] T. Li and X. Dai, ‘Financial Risk Prediction and Management using

A comparative analysis reveals significant performance Machine Learning and Natural Language Processing’, ijacsa, vol. 15, no.

6, 2024, doi: 10.14569/IJACSA.2024.0150623.

disparities among the evaluated models across all tasks and [4] Y. Wang, ‘Generative AI in Operational Risk Management: Harnessing

experimental scenarios. Models with superior reasoning the Future of Finance’, May 17, 2023, Rochester, NY: 4452504. doi:

capabilities surpassed those with less developed abilities. The [5] 10.2139/ssrn.4452504. X. Zhu, H. Jin, J. Li, and Y. Wang, ‘Topic-Gpt: A Novel Risk

varying complexity of the presented tasks further accentuated Identification Method Based on Large Language Model’, Jul. 04, 2024,

these performance differences. Also, providing additional Social Science Research Network, Rochester, NY: 4885365. doi:

10.2139/ssrn.4885365.

anchoring information disproportionately benefited models that [6] M. Katamaneni, P. Agrawal, S. Veera, A. K. Sahoo, K. Singh Sidhu, and

performed poorly in the unassisted scenario. Five models, M. F. Hasan, ‘AI-Based Risk Management in Financial Services’, in 2024

Second International Conference Computational and Characterization

Grok 4 (xAI), GPT-5, o3 and o3-pro (all three OpenAI), and Techniques in Engineering & Sciences (IC3TES), Nov. 2024, pp. 1–5. doi:

Gemini 2.5 Pro 10.1109/IC3TES62412.2024.10877497. (Google), consistently outperformed all other

models in practically every task and experiment scenario. The [7] X. V. Li and F. S. Passino, ‘FinDKG: Dynamic Knowledge Graphs with

Large Language Models for Detecting Global Trends in Financial

performance level achieved by these models demonstrates their Markets’, in Proceedings of the 5th ACM International Conference on AI

practical utility for real-world ERM applications. in Finance, Nov. 2024, pp. 573–581. doi: 10.1145/3677052.3698603.

[8] A. Nygaard et al., ‘News Risk Alerting System (NRAS): A Data-Driven

This research has opened several promising areas for further LLM Approach to Proactive Credit Risk Monitoring’, in Proceedings of



research the 2024 Conference on Empirical Methods in Natural Language : Processing: Industry Track, F. Dernoncourt, D. Preoţiuc-Pietro, and A. (1) Benchmarking LLMs against human experts: A rigorous Shimorina, Eds., Miami, Florida, US: Association for Computational

comparative study should be undertaken in which large LLMs Linguistics, Nov. 2024, pp. 429–439. doi: 10.18653/v1/2024.emnlp-

and domain specialists (human experts) perform identical tasks industry.32.

[9] Z. Xiao, Z. Mai, Z. Xu, Y. Cui, and J. Li, ‘Corporate Event Predictions

under strictly matched contextual conditions. Using Large Language Models’, in 2023 10th International Conference on

(2) Systematically varying model settings to probe “creativity” Soft Computing & Machine Intelligence (ISCMI), Nov. 2023, pp. 193–

197. doi: 10.1109/ISCMI59957.2023.10458651.

and reliability: Experiments that modulate the temperature and [10] Committee of Sponsoring Organizations of the Treadway Commission

other model settings can clarify how stochasticity affects task (COSO), Enterprise Risk Management—Integrating with Strategy and

performance and reliability. Performance. Durham, NC: COSO, 2017.

[11] International Organization for Standardization, ISO 31000:2018 – Risk

(3) Enabling models to request task-critical information: Instead management — Guidelines. Geneva, Switzerland: ISO, 2018.

of supplying predefined contextual information—such as the

first and/or last article in a sequence—future studies might allow

the model to query for the minimal supplementary data it deems

most informative. This strategy would approximate an active-

learning workflow and might even illuminate new modes for

human-LLM collaboration.



60





Graph-Based Feature Engineering for DeFi Security Incident




Severity Prediction



Daria Pavlova∗ Inna Novalija Dunja Mladenić

daria.pavlova@mps.si inna.koval@ijs.si dunja.mladenic@ijs.si

Jožef Stefan International Jožef Stefan Institute Jožef Stefan Institute

Postgraduate School Ljubljana, Slovenia Ljubljana, Slovenia

Ljubljana, Slovenia



ABSTRACT catastrophic failures causing losses in the tens or hundreds of mil-

Decentralized Finance (DeFi) has emerged as a rapidly growing lions of dollars. Predicting which security incidents will become

sector, but it has been plagued by numerous security incidents re- severe (high-loss) events is crucial for proactive risk management,

sulting in billions of USD in losses. An important challenge is pre- insurance underwriting, and developing early warning systems

dicting which security incidents will lead to for the DeFi ecosystem. severe financial losses,

as this can inform risk management and mitigation strategies. Prior research has analyzed DeFi vulnerabilities and attack

In this paper, we present a novel approach that integrates a se- taxonomy [6], and industry reports highlight the growing scale

mantic knowledge graph of the DeFi ecosystem into the machine of DeFi hacks. However, there is a gap in predictive approaches:

learning pipeline for incident severity prediction. We construct a existing studies focus on identifying vulnerabilities or classify-

knowledge graph capturing rich relationships between DeFi pro- ing attack types, rather than forecasting the severity level of an

tocols (including protocol fork lineage, multi-chain deployments, incident before it fully unfolds. To our knowledge, this is the first

and historical incidents), and we engineer graph-based features work to apply semantic knowledge graph features specifically

from this graph to augment traditional incident features. Using for DeFi incident severity prediction, establishing a new baseline

these features in a gradient boosting trees classifier, we predict for this important problem.

whether an incident will cause above-threshold (severe) losses. In traditional cybersecurity contexts, incorporating relational

Our results show that incorporating graph-based features yields context via knowledge graphs and network models has been

a substantial improvement in predictive performance: the model shown to improve threat detection [3]. For example, graph-based

with semantic graph features achieves an Area Under ROC Curve severity triage using attack graphs has been studied in traditional

(AUC) of 0.787, a 31.6% relative increase over the baseline model cybersecurity [5].

using only non-graph features. We observe particularly large In this work, we propose a novel graph-based feature engineer-

gains in precision (from 0.341 to 0.490), indicating a significantly ing approach to address this challenge. We construct a semantic

reduced false alarm rate. While these absolute performance val- knowledge graph of the DeFi ecosystem that encodes domain

ues remain moderate, they represent substantial improvements knowledge: nodes represent entities such as protocols and in-

for this challenging prediction task. The findings demonstrate cidents, and edges capture relationships like "forked-from" (de-

the practical value of graph-enriched feature engineering for noting protocol lineage) and "deployed-on" (connecting protocols

security analytics in DeFi. This work provides new insights into to blockchain platforms), among others. From this knowledge

how protocol interconnections and characteristics contribute to graph, we derive a set of graph-based features for each security

incident severity, opening avenues for more robust DeFi risk incident. These features quantify properties such as a protocol’s

assessment tools. structural position in the ecosystem (e.g., number of fork "chil-

dren," cross-chain deployments, past incident count), which we

KEYWORDS posit are predictive of how severe an incident could be.



Decentralized Finance, DeFi, Security, Knowledge Graph, Feature features (e.g., time of incident, incident type categories) in a We integrate these semantic graph features with conventional

Engineering, Incident Severity Prediction machine learning classifier to predict whether an incident’s loss

will exceed a severity threshold. The contributions of our work

1 INTRODUCTION are as follows:

Decentralized Finance (DeFi) platforms have experienced rapid • We introduce a methodology to incorporate a DeFi-specific

growth, alongside a surge in security breaches such as hacks knowledge graph into security incident severity predic-

and exploits. In 2022 alone, crypto attacks led to over $3.8 billion tion.

in stolen assets, with the majority coming from DeFi protocol • We demonstrate significant performance gains over a base-

exploits [1]. These incidents vary widely in impact: while many line model lacking graph features (improving AUC by

attacks result in limited losses, a significant fraction escalate into 31.6% and F1-score by 25%).

• We provide a comprehensive analysis including case stud-

∗First author and presenter. ies, illustrating how related protocol dependencies can

influence risk.

Permission to make digital or hard copies of part or all of this work for personal • We discuss practical implications of our findings for im-

or classroom use is granted without fee provided that copies are not made or proving DeFi risk assessment. distributed for profit or commercial advantage and that copies bear this notice and

the full citation on the first page. Copyrights for third-party components of this All code and the publicly available dataset for this work are

work must be honored. For all other uses, contact the owner/author(s). available in an open-source repository [4]. Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia

© 2024 Copyright held by the owner/author(s).

https://doi.org/10.70314/is.2025.sikdd.6



61





Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Pavlova et al.





Figure 2: Convex-centric subgraph. Dependency on Curve



highlights potential severity propagation via upstream vul-

nerabilities.



incident nodes, and 42 blockchain nodes, connected by over 3,500

edges representing various relationships. We use Neo4j to store

and query this graph efficiently through asynchronous opera-

tions.

The graph’s schema defines several entity types and relations

relevant to DeFi security:

• Protocol nodes: Each DeFi protocol (e.g., lending plat-

form, DEX, yield aggregator) is a node. Attributes include

protocol name and launch date.

• Incident nodes: Major recorded security incidents (hacks,

exploits) are represented as nodes with attributes such as

date, loss amount, and qualitative classification (e.g., flash

loan, smart contract bug).

• Blockchain nodes: Blockchain platforms (Ethereum, Bi-

nance Smart Chain, etc.) are included to capture deploy-

Figure 1: DeFi knowledge graph overview: protocols, ment contexts.

blockchains, and incidents with relations (forked-from, Key relationships are encoded as directed edges:

deployed-on, involves). • Fork-of: Connects a protocol to the protocol it was forked

from (if applicable), capturing lineage (e.g., SushiSwap →



2 Uniswap). METHODOLOGY • Deployed-on: Links a protocol to a blockchain platform on

2.1 Knowledge Graph Construction which it is deployed.

We built a knowledge graph representing the DeFi ecosystem to • Incident-involves: Links an incident node to the protocol(s)

serve as a basis for feature engineering. The construction process affected by that incident.

was semi-automated, combining API data extraction with manual The resulting graph captures a rich hierarchical structure of

curation to ensure semantic consistency. protocol relationships (including parent–child fork trees and

Data Sources: We integrated data from three primary sources: cross-chain deployment links), as well as the association of past

(1) the Rekt database (https://rekt.news) containing detailed DeFi incidents with protocols.

security incident reports, (2) DeFiLlama’s API providing protocol An overview of the graph structure is shown in Figure 1, and

metadata including deployment chains and fork relationships, an illustrative Convex-centric subgraph is given in Figure 2.

and (3) SlowMist Hacked for additional incident verification. All

data sources are publicly available.

Semi-Automated Process: Protocol and incident data were

automatically extracted using APIs and web scraping. Fork rela- 2.2 Feature Engineering with Graph-Based

tionships were identified through a combination of automated Features

code similarity analysis (for protocols with public repositories) From the knowledge graph, we derived several quantitative fea-

and manual verification based on project documentation. The tures that characterize the structural and historical context of

resulting knowledge graph contains 892 protocol nodes, 1,608 the protocol involved in a given incident:



62

Graph-Based DeFi Security Prediction Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia





• Protocol multi-chain count : the number of distinct

blockchains on which the protocol is deployed (degree

of deployed-on edges). A higher count indicates a widely

deployed protocol, potentially implying larger user bases

or attack surfaces.

• Fork lineage indicators: whether the protocol is a fork of

another (has parent) and the number of forks derived from

it. These capture if a protocol inherits code (and possibly

vulnerabilities) from a parent and how prevalent its code

is in offspring projects.

• Past incident count: the total number of past security

incidents involving the protocol (count of incident-involves

edges to prior incidents). A history of frequent past in-

cidents might signal underlying security weaknesses or

attractive target value.

In addition to these graph-derived features, we include con-

ventional features for each incident:

• Temporal features: the year and month of the incident,

and day-of-week if relevant, to capture any time-related Figure 3: Workflow: derive graph-based features from the

patterns or trends in attack occurrence. DeFi knowledge graph and combine with conventional

• incident features for classification. Categorical features : the general type of attack or vul-

nerability exploited (e.g., reentrancy, price oracle manipu-

lation), and the asset or protocol category targeted, which



provide contextual information on the incident.

All features are computed or retrieved at the time just before

the incident (to avoid using any post-incident information). The

combination of graph-based features with traditional features

forms the feature vector used for prediction.

The end-to-end feature extraction and modeling pipeline is

summarized in Figure 3.



2.3 Classification Model and Training

We frame incident severity prediction as a binary classification

task: severe vs. non-severe loss outcome. Following prior work in

financial risk modeling, we define a severe incident as one with

loss exceeding a high quantile threshold of the loss distribution. Figure 4: Performance comparison. Bar chart for AUC, F1,

In our dataset, we tested multiple thresholds (70th, 75th, and 80th precision, recall.

percentiles), with the 75th percentile ($2.21 million) serving as

the primary cutoff, yielding 402 severe incidents out of 1,608. The

model showed consistent improvements across all thresholds, 3 EXPERIMENTS AND RESULTS confirming the robustness of our approach.

Our primary model is a gradient boosting decision trees en- 3.1 Dataset and Experimental Setup

semble (LightGBM [2]), selected for its efficiency, ability to handle We compiled a publicly available dataset of 1,608 DeFi security

heterogeneous feature types, and proven performance in tabular incidents that occurred between 2020 and 2025. The dataset was

financial risk modeling. We enabled LightGBM’s built-in class im- constructed by combining data from: (1) Rekt database providing

balance option (is_unbalance=True), as severe cases represent comprehensive incident reports with loss amounts and attack

25% of the data. descriptions, (2) DeFiLlama API for protocol metadata including

Train/Test Split: Data were split chronologically into 75% TVL and deployment information, and (3) SlowMist Hacked for

training and 25% testing. Early stopping was not applied due additional incident verification and technical details. Each inci-

to dataset size; hyperparameters were fixed after preliminary dent record includes the loss amount (in USD) and details such

tuning. as date and attack type. Incidents with losses above $2.21 million

We compare two feature sets: a Baseline model using only were labeled as severe, which yields a severe class prevalence of

non-graph features (temporal and categorical), and a Semantic roughly 25% (402 severe vs. 1,206 non-severe cases).

Graph model combining these with graph-based features. Per- For training and evaluation, we use a chronological split with

formance is evaluated with Area Under the ROC Curve (AUC) 75% for training and 25% for testing; early stopping was not

and supported by Precision, Recall, and F1-score. applied.



63





Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Pavlova et al.





Table 1: Performance comparison between the baseline

model (numeric/categorical features only) and the seman-

tic graph model (with knowledge graph features).



Metric Baseline Semantic Graph Improvement



AUC 0.598 0.787 +31.6%

F1-score 0.384 0.480 +25.0%

Precision 0.341 0.490 +43.7%

Recall 0.440 0.470 +6.8%



3.2 Performance Comparison

A visual comparison of model performance is shown in Figure 4.

Our results confirm that incorporating graph-based features

markedly improves prediction performance. Table 1 summa-

rizes the evaluation metrics for the baseline and semantic graph-

enhanced models on the test set. The Semantic Graph model

achieves an AUC of 0.787, substantially higher than the base-

line’s 0.598 (a relative improvement of 31.6%). This indicates that

the model with graph features is much better at ranking incidents Figure 5: Top 15 most important features ranked by Light-

by risk. The F1-score also improves from 0.384 to 0.480, reflecting GBM gain. Values on the x-axis represent LightGBM’s in-

better overall classification accuracy. ternal feature importance scores (dimensionless, aggre-

Notably, the Precision (positive predictive value) rises from gated across all trees in the ensemble). Both temporal fea-

0.341 to 0.490—a 43.7% increase—while Recall increases slightly tures (year, month, day-of-week) and graph-based features

from 0.440 to 0.470. This suggests that the graph-enriched model (e.g., protocol_chains_count, is_forked_from_parent) appear

is significantly more effective at identifying truly severe incidents among the strongest predictors.

(fewer false positives) without sacrificing the ability to catch most

severe cases. While the absolute values of these metrics might Applications: Graph-based risk factors can support auditors

appear moderate, it is important to note that they represent and insurers in identifying critical "hot spots" and pricing cover-

substantial improvements over the baseline and are competitive age more accurately than historical losses alone.

for this specific and challenging prediction task where many Limitations: The dataset covers only publicly reported inci-

external factors influence incident severity. dents, which may bias toward large-scale events. Features are

In addition to the hold-out test, we evaluated stability via manually engineered and static; future work should explore dy-

cross-validation. The baseline model’s mean AUC across 5 folds namic graphs, Graph Neural Networks, and richer incident cover-

was 0.629 (std 0.036), whereas the semantic model averaged 0.809 age. Absolute performance (AUC 0.787) remains moderate, leav-

(std 0.027). This not only reaffirms the performance boost but also ing room for improvement before real-world deployment.

indicates that the graph-augmented model is more consistent

across different data subsets (lower variance), likely because the 5 CONCLUSION

graph features provide a more robust signal that generalizes. We introduced a graph-enriched framework for predicting sever-

ity of DeFi security incidents. By combining semantic knowl-

3.3 Feature Importance Analysis edge graph features with conventional incident data, our model

To better understand the relationships between graph-based fea- achieved substantial gains over a feature-only baseline. The find-

tures, we analyzed their pairwise correlations (Figure 5). The cor- ings emphasize that where an incident occurs in the ecosystem is

relation matrix shows that most features are only weakly related, as important as what it is. This approach offers immediate utility

which indicates that they capture complementary aspects of pro- for risk assessment and motivates further research into dynamic,

tocol structure and history. The strongest dependency is observed end-to-end graph-based models for DeFi security.

between is_forked_from_parent and parent_fork_children_count

(correlation 0.64), reflecting the natural link between fork origin REFERENCES



This relative independence confirms that graph-derived features and the number of derived protocols. Other features, such as [1] Chainalysis Team. 2023. 2022 Biggest Year Ever For Crypto Hacking with $3.8 Billion Stolen, Primarily from DeFi Protocols and by North Korea-linked protocol chains count and protocol past events count , exhibit low Attackers. Chainalysis Blog (1 February 2023). https://www.chainalysis.com/ correlation values (<0.2), suggesting they provide distinct signals. blog/2022-biggest-year-ever-for-crypto-hacking/ [2] G. Ke et al. 2017. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In Advances in Neural Information Processing Systems 30 . 3146–3154. enrich the predictive model with diverse information rather than [3] J. Michel and P. Parrend. 2023. Graph-Based Intelligent Cyber Threat Detection

duplicating each other. System. In Cybersecurity in Intelligent Networking Systems. CRC Press.

[4] D. Pavlova. 2025. DeFi Security Trends: Semantic Knowledge Graph Analysis

(Code & Dataset). GitHub Repository. https://github.com/dariapavlova02/defi_

4 DISCUSSION trends_semantic

[5] L. Sadlek et al. 2025. Severity-Based Triage of Cybersecurity Incidents Using

Our results highlight the value of relational context for DeFi secu- Kill Chain Attack Graphs. Journal of Information Security and Applications 89



rity analysis: knowledge graph features capture ecosystem-level (2025). [6] S. Werner, D. Perez, L. Gudgeon, A. Klages-Mundt, D. Harz, and W. Knottenbelt. dependencies not visible from incident-centric data. Incidents 2021. SoK: Decentralized Finance (DeFi). arXiv preprint arXiv:2101.08778 affecting widely forked or multi-chain protocols are more likely (2021).

to cause severe losses, reflecting practical amplification effects.



64





Evolving Neural Agents in Simulated Ecosystems




Marija Ćetković Aleksandar Tošić Domen Vake

UP FAMNIT UP FAMNIT UP FAMNIT

Koper, Slovenia Koper, Slovenia Koper, Slovenia

marijacetkovic03@gmail.com aleksandar.tosic@upr.si domen.vake@famnit.upr.si



Abstract the network topology and weights to evolve over time in com-

This paper explores how adaptive behaviors can emerge in artifi- parison to fixed-topology weight-evolving evolutionary methods.

cial agents through neuroevolution in a dynamic 2D ecosystem. We implemented NEAT from scratch to have full control over

Using the NeuroEvolution of Augmenting Topologies (NEAT) al- mutation, crossover, and fitness evaluation, ensuring that the

gorithm both the neural network structure and weights evolved system could support our experimental goals and to potentially

over time without predefined architectures or behaviors. The build a controllable and extensible evolutionary framework.

system models two agent types: herbivores and carnivores that While NEAT has been previously applied to multi-agent sys-

compete for limited food resources in a simulated environment. tems, many studies focus on performance in a specific task. This

From the beginning, it was evident that environment design, in- paper addresses whether an agent-based NEAT framework can

put encoding, and reward shaping had a major impact on agent produce ecological equilibrium without an explicit objective. Our

behavior. Poorly tuned conditions led to exploitation, overfitting, primary contribution is the demonstration and analysis of sta-

or meaningless patterns. But when the system was carefully bal- ble, co-adaptive predator-prey dynamics, showing how specific

anced, the agents began developing survival strategies such as evolved behaviors arise from the underlying neural network

movement efficiency, food seeking, and attacking. Herbivores topologies of the agents.

evolved plant consumption behaviors, while carnivores built on

this base to prioritize attacks and meat consumption. Some be- 2 Methods



haviors generalized well to larger environments, showing that 2.1 Environment Model agents were not just memorizing patterns. We observed how

The ecosystem is a discrete 2D grid populated with food resources

NEAT’s speciation and innovation mechanisms were crucial for

and agents. Herbivores consume plants, carnivores consume

maintaining diversity and avoiding premature convergence. At

meat, and all agents perceive their surroundings through a lim-

the same time, challenges like catastrophic forgetting revealed

ited sensory range.

the limitations of neural networks in long-term skill retention.

Ultimately, this work demonstrates how intelligent, adaptive be-

havior can emerge from simple evolutionary principles and offers 2.2 Evolutionary Framework



a foundation for future research into co-evolution, agent roles, Agents (creatures) interact with the world and are controlled by

and artificial life. neural networks (genomes) evolved using NEAT. Initial popula-

tions start with minimal structures (fully connected input/output

Keywords layers), and complexity increases through structural mutations.

neuroevolution, NEAT, evolutionary algorithms, artificial life, Genomes consist of genes, which are lists of nodes (with ID and

simulated ecosystems, co-evolution, neural networks type: input, hidden, output) and connections (with ID, nodes they

connect, weight and enabled flag). Each tick, each agent receives

1 a snapshot of the world state as input, to ensure stable input for Introduction

everyone. Inputs include diet type, hunger level, local 3x3 neigh-

This research explores neuroevolution for adaptive agent behav-

borhood scan for food, neighbors (type and health level), and

iors in a dynamic ecosystem. Agents are controlled by feedfor-

direction toward the nearest food source. Based on that agents

ward neural networks that map sensory inputs to actions [4], and choose actions as softmax output of their neural networks. The

their structure and weights evolve incrementally using the NEAT

outputs correspond to discrete actions: move (up, down, left,

algorithm [5]. Unlike static or predefined tasks, this simulation right), eat, attack, or stay. The actions become events that are

presents agents with a changing environment where no explicit

handled in a deferred manner. First, the invalid actions are fil-

‘correct’ behavior exists.

tered out, then the EventManager processes all queued events

Dynamic environments without fixed objectives require ex-

at once sequentially: applying changes in the world, updating

ploratory and adaptable approaches. Gradient-based optimiza-

fitness, and health of agents, which can be seen in the Algorithm

tion relies on differentiable fitness functions and fixed topologies,

1.

while reinforcement learning can struggle under sparse rewards.

The fitness function evolved through experimentation. Early

Evolutionary algorithms, by evaluating populations of agents versions rewarded survival, but later iterations combined survival

directly on survival and performance, provide a natural solution

time, food consumption, and for carnivores, attack behavior.

for such open-ended scenarios [1]. Neural networks allow agents



to flexibly map sensory input to actions, and NEAT enables both 2.3 NEAT Mechanisms

Permission to make digital or hard copies of all or part of this work for personal Innovation Tracking is the process of tracking structural mu-

or classroom use is granted without fee provided that copies are not made or tations globally to keep genomes aligned during crossover. A

distributed for profit or commercial advantage and that copies bear this notice and

the full citation on the first page. Copyrights for third-party components of this singleton class assigns a unique ID to each new connection or

work must be honored. For all other uses, contact the owner/author(s). node. If a structural change already exists, it reuses the same ID; if

Information Society 2025, Ljubljana, Slovenia not, it creates a new one. This ensures a consistent identification

© 2025 Copyright held by the owner/author(s).

https://doi.org/10.70314/is.2025.sikdd.10 of identical innovations in all genomes [5].



65

Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Marija Ćetković, Aleksandar Tošić and Domen Vake



while generation limit not reached do 2.4 Graphical User Interface

// Simulation Phase Figure 1 shows an example of the GUI which serves to visually

while creatures alive do track the simulation in real time, making the evolutionary pro-

foreach creature 𝑐 in population do cess observable and interpretable, as analyzing logs alone could

𝑐 .observe(world) ; // perception be misleading. It allowed following the population changes over

𝑐 .chooseAction() ; // NN policy generations, spotting emerging behaviors such as movement pat-

𝑐 .queueAction() ; // check validity, terns or interactions, and understand whether agents are actually

enqueue evolving. It helps detect issues such as creatures moving in the

end same direction or wandering aimlessly.

eventManager.process() ; // apply action



effects

foreach creature 𝑐 in population do

𝑐 .updateHealth() ; // starvation, death

𝑐 .evaluateFitness() ; // assign fitness

end

end

// Evolution Phase

assignGenomesToSpecies() ; // speciation

createOffspring() ; // apply GA within species

resetWorld() ; // spawn new creatures and food

end

Algorithm 1: High-Level Evolutionary Simulation Loop

Figure 1: GUI close-up



NEAT preserves evolutionary innovation by speciation (nich- 2.5 Implementation Notes

ing) [5]. Each generation, evaluated genomes are reassigned to

species based on structural compatibility distance, which is cal- The simulation was implemented in Java with LibGDX [2] for

culated as a weighted sum of the number of disjoint and excess visualization. NEAT logic included custom classes for genomes,

connections (present in one parent, within and beyond other’s species management, and innovation tracking. The evolutionary

genome region respectively), and weight difference averages be- loop evaluated agents in the world, assigned fitness, reproduced

tween the matching (present in both parents) ones, given by: genomes, and reset the environment for subsequent generations.



𝛿 𝑊 𝑐 + + 3 · . Existing species are cleared and each 2.6 Setup

= 𝑐 𝐸 𝑐 𝐷 1 2

𝑁 𝑁

genome is compared to species representatives; if no match is

found, a new species is created. Representatives are updated ev- After every run around 10 percent of the population is saved and

ery generation to maintain diversity. Fitness is shared within loaded for the next run, with that part of population unchanged

species (adjusted by species size) to balance selection pressure. and the rest filled with mutations of it. This is done to speed up

The compatibility threshold strongly affects stability: low thresh- the evolution process. In early runs, we disabled the perception

olds create many narrow species, high thresholds create broader of other agents to prevent confusion and help them learn basic

but less distinct species, requiring careful tuning. eating behavior. Once they consistently moved and consumed

To prevent the population from maintaining one dominant food, perception was turned on to allow them to adapt to a more

species and limiting the exploration of the algorithm, NEAT uses complex environment. We also tested this logic on other inputs

adjusted fitness [5]. Instead of assigning raw fitness scores, the such as the food direction vector left agents essentially ‘blind’ to

individual’s fitness is adjusted by the number of individuals who non-local food. So, during early iterations, we spawned food in



are within its distance delta, given by: ′ 𝑓𝑖 random concentrated areas rather than spreading it widely, to 𝑓 = Í 𝑛 𝑖 𝑠ℎ ( 𝛿 ( 𝑖, 𝑗 . ) ) = 𝑗 1 help them learn to use this vector.

Evolution is achieved through genetic operators within species:

Mutation: Weight changes (random reset 5–10% or small 3 Results

Gaussian perturbation) and structural changes (adding nodes or

edges, toggling connections). Resulting genome is checked for 3.1 Herbivore Evolution

acyclicity. Herbivores initially explored aimlessly but gradually developed

Crossover: Offspring inherit connections aligned by innova- stable food-seeking strategies. Over 800 generations, their ac-

tion number; matching ones are inherited from the first parent, tion distribution stabilized, with movement actions dominating

while disjoint and excess come from the parent with greater and eating consistently rewarded. In larger environments, agents

fitness score (or random if equal). Invalid (cyclic) offspring are prioritized exploration to reach scarcer resources, showing emer-

replaced by mutated fitter parent. gent adaptation beyond memorized patterns.

Selection: Parent selection uses tournament selection: a sub- We can see from Fig.2 that the initial fitness highly oscillates,

set of individuals (size 5) is sampled and the fittest is chosen. With with great difference in average and maximal fitness, as well

3% probability, a random individual is selected to maintain diver- as some outliers with high fitness who end up consuming a lot

sity. This setup provides moderate selection pressure - avoiding of food. This is expected to some extent, as when one creature

premature convergence while keeping implementation simple, consumes food, it reduces the available resources for others in

efficient, and robust across different fitness functions. the population.



66

Evolving Neural Agents in Simulated Ecosystems Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia





In smaller worlds, herbivores focused on eating, carnivores

split between eating and attacking; in larger worlds, carnivores

prioritized attacking, herbivores balanced movement and eating.



Figure 2: Herbivore fitness





Figure 5: Herbivore action distribution 100x100 world



In the beginning, the actions chosen were randomized, but

Figure 5 shows how herbivores learned to prioritize the eating

action, although initial interference is evident. The usage of stay

Figure 3: Average creature lifespan

and attack actions is low.





Figure 4: Average number of unique tiles visited



Figure 6: Carnivore action distribution 100x100 world





Figures 3 and 4 show agents that survive longer naturally

explore more of the environment. The observed correlation be-

In Fig.6 we can spot how carnivores experience problems in

tween emerges from adaptive behavior under environmental

balancing the eating and attacking action, but the attacking action

constraints, rather than from any explicit exploration rewarding.

slightly dominates after some time.



3.2 Carnivore Adaptation and Catastrophic

Forgetting



Carnivores were evolved by reusing herbivore topologies and

adjusting weights, transferring eating behavior to meat sources.

This transfer was successful in just 200 generations, but the

agents showed catastrophic forgetting when switched back to

herbivore roles, losing previous behaviors [3]. This showed us

that we needed more general pretraining to make sure that agents

were using their role, food and food vector inputs, and not over-

fitting to the food type.



3.3 Co-Evolution Dynamics

Figure 7: Herbivore action distribution 200x200 world

To try to avoid the problem of forgetting mentioned, we saved

agents of both types that evolved their basic skills independently.

When carnivores were alone we gave them no motivation to

use the attack action, to wire the logic later to herbivores. The Figure 7 displays herbivore behavior, where the action dis-

attacking behavior was rewarded only for carnivores, but as tribution is more stable and there is a clear evolved balance of

shown below, some role interference was inevitable. eating and moving actions, which is expected in a larger world.



67





Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Marija Ćetković, Aleksandar Tošić and Domen Vake





4 Design Observations

Agent behavior is highly sensitive to design choices in fitness

functions, environment setup, and input representation. Poorly

designed fitness functions can lead to inefficient or trivial behav-

iors, such as flickering near food, which highlight the exploration-

exploitation trade-off and the role of environmental influence in

shaping behavior. Static or predictable resource spawn locations

can cause overfitting, where agents memorize positions instead

of learning general strategies. Dynamic and unpredictable envi-

ronments are necessary to evolve general food-seeking behaviors

Figure 8: Carnivore action distribution 200x200 world and to observe which patterns emerge due to evolutionary pres-

sures versus environmental conditions. However, environments

that are too unpredictable can hinder learning and obscure the dis-

Figure 8 displays carnivore behavior in the larger world, where tinction between inherited tendencies and environment-induced

they were given a greater incentive to attack. From the distribu- behaviors.

tion, we can see that they indeed attacked more, with the other Input scaling and initial placement also influence behavioral

actions being balanced out, and the staying action was rarely emergence. Unlimited health input caused agents to idle, while

chosen. spawning agents too close together and awarding them for food

consumption led to aimless wandering when neighbors died,



showing correlations learned from the environment. These ob-

servations demonstrate how neural networks may pick up coinci-

dental patterns that influence both relearning across generations

and the adopted strategies.

Metrics did not always reflect consistent progress, as dynamic

food spots and starting points introduced noise. Dips or peaks

in performance do not necessarily indicate genuine failure or

success. Adjustments to fitness, food rewards, and environmental

parameters were required to guide learning, prevent reward hack-

ing, and allow behavioral adaptation. Comparing herbivore and

carnivore roles shows that behaviors are shaped by both environ-

Figure 9: Maximum lifespan in the coevolutionary setting

mental pressures and the interactions between agent strategies

and resource availability. Agents adjust their actions based on

Fitness (as well as the lifespan depicted in Figure 9) fluctuated the resources they encounter, and these actions influence which

in an “arms race” pattern with no dominant winner. This outcome resources remain available, creating a feedback loop between

is expected, as the rise in one role’s performance lowered the behavior and the environment.

performance of the other. This shows that the system tended to-

ward balance, which aligns with the objective of testing whether 5 Conclusion and Future Work

coevolution with NEAT agents could produce equilibrium. This paper demonstrates that nature-like behaviors can emerge

from relatively simple principles when agents evolve in dynamic,

3.4 Species Diversity open-ended environments without predefined goals. By evolving

herbivores and predators both separately and in co-evolution, we

showed that evolutionary pressures can produce adaptive behav-



iors and predator-prey equilibria, highlighting how role-specific

dynamics shape ecosystem stability. This work lays the founda-

tion for future experiments that involve more complex behaviors,

survival strategies, and deeper coevolutionary dynamics. Future

directions could include investigating the potential of refined role

awareness mechanisms, improved memory or learning retention,

and more complex agent inputs and actions, enabling us to push

the boundaries of what these agents can learn over time.



References

[1] A.E. Eiben and J.E. Smith. 2003. Introduction to Evolutionary Computing.



Figure 10: Species diversity over generations. Natural Computing. Springer-Verlag, Berlin. [2] LibGDX. [n. d.] Libgdx game development framework. https://libgdx.com/.

().

[3] Michael Mccloskey and Neil J. Cohen. 1989. Catastrophic interference in

The survival plot of emerging species in Figure 10 shows an connectionist networks: The sequential learning problem. The Psychology of

important aspect of the NEAT algorithm. The initial drop means Learning and Motivation, 24, 104–169.

that a few very successful topologies dominated the population, http://neuralnetworksanddeeplearning.com/. [4] Michael A. Nielsen. 2018. Neural networks and deep learning. misc. (2018).

but using a lower compatibility threshold prevents the total loss [5] Kenneth O. Stanley and Risto Miikkulainen. 2002. Evolving neural networks

of diversity. The number of species stabilized after some time, through augmenting topologies. Evolutionary Computation, 10, 2, 99–127.

http://nn.cs.utexas.edu/?stanley:ec02.

while many smaller species died out quickly.



68





Designing AI Agents for Social Media




Abdul Sittar∗ Mateja Smiljanić Alenka Guček

Jožef Stefan Institute Jožef Stefan Institute Jožef Stefan Institute

Ljubljana, Slovenia Ljubljana, Slovenia Ljubljana, Slovenia

abdul.sittar@ijs.si mateja.smiljanic@gmail.com alenka.gucek@ijs.si



Abstract evaluated through multiple frameworks. [15], [14] emphasized

This work presents an approach for designing AI agents that that stylistic consistency within timelines benefits rare event

simulate social media activity by replacing Twitter conversations detection, while artificial stylistic variety can increase false pos-

with large language models (LLMs). Using a time-series dataset of itives. [1] demonstrated T5-based paraphrasing effectiveness,

Twitter discussions about technologies (April 2019 - April 2020), achieving average 4.01% accuracy increase with T5 augmenta-

we propose an approach that combines fine-tuned language mod- tion, with RoBERTa reaching 98.96% accuracy through ensemble

els with timeline manager to capture both conversational dynam- approaches.

ics and temporal posting patterns. This approach consists of two Recent advances in large language models (LLMs) provide op-

main components: 1) a timeline manager, which models post- portunities to simulate social media users as autonomous agents

ing frequency, reply behaviour, and temporal rhythms of users, capable of generating posts and replies. [9] mainly concentrates

and 2) conversation agents, fine-tuned for posting and replying on using LLMs as stand-alone agents or for simple agent inter-

within threads. We evaluate the system along two dimensions: actions, neglecting the opportunity to assess LLMs within the

structural accuracy (whether the timeline manager replicates network structure of complex social networks. In this study, we

conversation patterns and thread structures), and emotion dy- leverage fine-tuned language models to create agents across mul-

namics (weather the emotion of synthetic data replicates the true tiple domains, including technology (AI), cryptocurrency, and

emotion trends in the original dataset). Our results demonstrate health-related topics (e.g., COVID-19). Each agent is specialized

that the proposed agent-based simulation captures key charac- for posting or replying, while a timeline manager model simu-

teristics of real Twitter interactions, providing a foundation for lates the environment, deciding which agent acts next and at

large-scale synthetic social media ecosystems useful for study- what time. By grouping similar users into single agents, our ap-

ing information flow, emotion propagation, and the impact of proach generalizes behaviour while maintaining the richness of

emerging technologies. interaction patterns.

The main goal of this work is to investigate the effect of envi-

Keywords ronmental changes on agent behaviour and network dynam-



AI agents, large language models (LLMs), social media simulation, ics. Specifically, we hypothesize that altering the scheduling and structure of the environment model can lead to measurable Twitter conversations, conversation agents changes in posting and replying activity, as well as in the tempo-



1 ral evolution of simulated emotions. To evaluate our approach, Introduction we compare real Twitter data with simulated outputs, analysing Social media platforms have become major arenas for informa- emotion trends and interaction dynamics across time windows. tion dissemination, discussion, and opinion formation. However, Our approach provides a novel methodology for studying social the emergence of filter bubbles where users are exposed pre- media dynamics, testing hypotheses about user behaviour, and dominantly to content that aligns with their existing beliefs can exploring interventions to mitigate filter bubbles. reinforce polarization, reduce diversity of exposure, and shape



works have broadened the range of ideas and information ac- Following are the two primary scientific contributions of this cessible to users, but they are also criticized for contributing to work: collective behaviour in unforeseen ways [3]. Also, Social net- 1.1 Contributions greater polarization of opinions [2]. Understanding how these • An approach to replicate social media users by grouping dynamics emerge and evolve requires models that can replicate similar users into language model-driven agents managed user behaviour at scale while capturing temporal patterns and with a timeline manager interactions. • An evaluation that assesses structural accuracy, conver-Large language models have emerged as powerful tools for syn-sational coherence, and emotional realism by comparing thetic text generation. [10] investigated GPT-3.5 for text classifica-simulated and true emotion trends.

tion augmentation, finding that subjectivity negatively correlates

with synthetic data effectiveness, while achieving 3-26% abso-

lute improvement in accuracy/F1 in low-resource settings. [18]

introduced GPT3Mix, using GPT-3 for realistic text generation

with soft-labels, significantly outperforming existing augmenta-

tion methods. The quality of synthetic data generation has been



Permission to make digital or hard copies of all or part of this work for personal

or classroom use is granted without fee provided that copies are not made or

distributed for profit or commercial advantage and that copies bear this notice and

the full citation on the first page. Copyrights for third-party components of this

work must be honored. For all other uses, contact the owner/author(s).

Information Society 2025, Ljubljana, Slovenia

© 2025 Copyright held by the owner/author(s).

https://doi.org/10.70314/is.2025.sikdd.23



69

Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Abdul Sittar, Mateja Smiljanić, and Alenka Guček





2 Related Work

LLMs are increasingly employed to model human behaviour in

online settings, but current evaluation approaches such as simpli-

fied Turing tests involving human annotators fail to capture the

subtle stylistic and emotional nuances that differentiate human

generated text from AI-generated text [12]. It proposes a human

likeness evaluation framework that systematically measures how

closely LLM generated social responses resemble those of real

users. This framework utilizes a set of interpretable textual fea-

tures that capture stylistic, tonal, and emotional aspects of online

conversations. While they can mimic certain human behaviours

and decision making processes, primarily due to their training

data, it remains largely unexplored whether repeated interac-

tions with other agents amplify their biases or lead to exclusive

patterns of behaviour over time [8].

Modelling social media has ben an active research area for

understanding use behaviour, information diffusion, and network

effects. Agent-based models have been widely used to replicate in- Figure 1: Overview of the proposed methodology for con-

teractions among users, simulate posting and replying behaviour, versation simulation. The timeline manager determines

and study emergent phenomena such as viral content spread, which agent should act next based on the current time,

echo chambers, and filter bubbles [6, 11]. These models often agent, context, and action. The selected fine-tuned model

rely on simplified rules or probabilistic mechanisms to deter- then generates a new post or reply for the chosen agent,

mine agent actions. Our work extends this by using fine-tuned creating realistic conversation flow.

language model to generate realistic post and reply content, cap-

turing both semantic and temporal patterns observed in real

social media interactions.

The concept of filter bubbles has been extensively studied in 3.1 Probabilistic model

the context of social media algorithms and personalized content The probabilistic scheduler is implemented as a multi-output

delivery [17, 7, 3]. Prior studies have shown that temporal factors, neural network that simultaneously predicts four key dimensions

such as posting frequency and timing, significantly influence the of social media behaviour: agent selection (which agent should

formation of echo chambers and the propagation of sentiment. act next), action classification (post vs. reply), temporal prediction

Unlike traditional simulations, our approach explicitly models (timing of next action), and context setting (emotional tone and

time windows and agent-specific schedules, allowing the study topical focus for content generation).

of how environmental changes affect network dynamics and user The model is trained on 88,330 conversation items spanning

behaviour over time. April 2019 to April 2020, focusing on AI and cryptocurrency dis-

Large language models (LLMs) have been increasingly applied cussions. Our Timeline-Based approach generates 93,440 chrono-

to social media analysis, content generation, and user simulation. logical training pairs—18.7× more than baseline methods—through

Fine-tuned models can capture domain-specific language, hash- complete conversation sequence learning rather than isolated

tags, and posting patterns, enabling more realistic simulations post-reply pairs.

of user behaviour [13, 4]. Existing work has largely focused on Given the current state 𝑆 (𝑡) at time 𝑡, the model computes

generating content for individual posts or replies; in contrast, our probability distributions over the action space.

approach integrates posting, replying, and environment manage-



ment in a unified simulation, enabling multi-agent interaction 3.2 Fine-tuned model analysis. We implement a single fine-tuned language model that serves as Recent studies have used sentiment and emotion analysis to both AI and cryptocurrency agents. The model is trained on con-evaluate social media content, including the study of affective versations from both domains (AI technology and cryptocurrency trends and collective mood in online networks [16, 5]. Our ap-discussions) to capture the vocabulary, argumentation patterns, proach leverages these techniques to compare simulated emotion and discourse styles across both topic areas. trends with real-world Twitter data, providing a quantitative

measure to validate the fidelity of the agent-based simulation. • Agent A (AI Focus): The same fine-tuned model called

when the probabilistic scheduler determines AI-related

content is needed.

3 • Agent B (Crypto Focus) : The identical fine-tuned model Methodology



Our methodology employs a two stage approach combining prob- required. called when cryptocurrency-related content generation is abilistic scheduling with domain-specialized fine-tuned language

model agents to simulate realistic social media interactions (post- When called by the probabilistic scheduler, the fine-tuned

ing and replying). The approach consists of two primary compo- model generates content based on provided context including

nents: (1) Timeline based probabilistic model that serves as an action type (post/reply), emotional context, topical focus, tem-

timeline manager, and (2) Domain-specialized fine-tuned agents poral context, and conversation history. The model’s training

that generate contextually appropriate content based on the time- on both domains enables it to produce contextually appropriate

line manager’s decisions. responses regardless of which agent role it is fulfilling.



70

Designing AI Agents for Social Media Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia



3.3 Integration and Coordination • Action Distribution: Maintains realistic post/reply ratios

The probabilistic scheduler communicates with fine-tuned agents (94.5%/5.5%)

through a structured interface that maintains separation between

temporal decisions (when and who acts) and content decisions 4.2 Fine-tuned model

(what is said). At each simulation step, the scheduler: (1) analyses

current conversation state, (2) predicts next action parameters, (3) Table 2: Evaluation Results: ROUGE and Semantic Similar-

selects appropriate domain agent, (4) provides structured context ity

to the selected agent, and (5) integrates generated content into

the conversation thread. Metric Score

This approach enables realistic conversations where differ- ROUGE-1 0.1373

ent domain experts can contribute to mixed topic discussions ROUGE-2 0.0519

while maintaining their specialized perspectives and temporal ROUGE-L 0.1179

behavioural patterns observed in real social media data. ROUGE-Lsum 0.1217

Semantic Similarity (SBERT) 0.4041

4 Experimental Setup



In this section, we describe the features, model and evaluation Table 2 reports the evaluation results for the fine-tuned model’s metrics. generated content. ROUGE metrics (ROUGE-1, ROUGE-2, ROUGE-



The baseline system is a timeline based probabilistic model that 4.1 L, and ROUGE-Lsum) measure lexical overlap between generated Timeline Manager outputs and the reference Twitter posts. The relatively low scores (e.g., ROUGE-1 = 0.1373) indicate that while the generated text learns agent transitions, reply probabilities, and temporal distri- captures some overlapping words or phrases, it often diverges butions from training data. Predictions are made deterministi- lexically from the original references. This is expected since the cally by selecting the most probable outcome, with probability model is not designed for verbatim reproduction but rather for estimates derived directly from observed frequencies. generating semantically coherent alternatives. The enhanced approach employs a machine learning ensemble To complement ROUGE, we compute semantic similarity using with separate classifiers for agent, action, and time prediction. SBERT embeddings. The score of 0.4041 shows that, on average, Features include agent history, action history, and time of day. Pre- the generated outputs are moderately aligned in meaning with dictions are generated using temperature-controlled stochastic the reference texts, even when surface-level wording differs. This sampling, with an ensemble across multiple temperature settings highlights that the fine-tuned model is able to remain contextu- for robustness. This design enables greater flexibility and diver- ally and thematically relevant while producing novel expressions. sity, counteracting the strong biases inherent in the probabilistic Overall, the combination of ROUGE and semantic similarity model. suggests that the fine-tuned agents generate content that does

4.1.1 not simply replicate reference posts but instead produces new, Evaluation Metrics. Table 1 summarizes the key differ-

ences between the original probabilistic model and the improved semantically consistent outputs.

ML-based model, covering both quantitative performance and



qualitative conversational outcomes.

Aspect Probabilistic Model ML-Based Model

Agent Prediction 44.8% accuracy, but always pre-55.2% accuracy, balanced

dicts Crypto_Agent (100%) AI_Agent (50%) and

Crypto_Agent (50%)

Action Prediction 74.4% accuracy by predicting only 67.8% accuracy with realistic mix:

“post” (0% replies) 65% posts / 35% replies (close to

ground truth 73/27)

Temporal Modelling MAE = 5.41 min; 99.4% within ±15 MAE = 7.11 min; 99.2% within ±15

min min

Table 1: Comparison of the Original Probabilistic Model

vs. the Improved ML-Based Model.



we evaluated our probabilistic model using comprehensive Figure 2: Methodology diagram showing both experimen-

metrics across three key categories: tal approaches: First step, second step, third step, fourth

• Agent Prediction: 61.3% accuracy (22.6% improvement step

over random chance)



• prediction the reference Twitter dataset and the conversations generated by Temporal Modelling: • Action Classification: 96.8% accuracy for post vs. reply Figure 2 presents the aggregated emotion comparison between

50.7-minute MAE with 99.15% ac- the fine-tuned model. The analysis is based on average emotion

curacy within ±15 minutes scores across multiple conversation samples, with categories

Our evaluation demonstrates that the probabilistic scheduler including hate, not_hate, non_offensive, irony, neutral, positive,

successfully replicates conversation structure: and negative. Blue bars represent the reference data, while orange

• Agent Alternation: 94.2% similarity to real switching bars indicate the generated outputs.

behaviours Overall, the comparison shows strong alignment between the

• Temporal Rhythms: Strong correlation (r=0.78) with two distributions for key non-toxic categories. Both reference

actual daily patterns and generated conversations are overwhelmingly classified as



71

Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Abdul Sittar, Mateja Smiljanić, and Alenka Guček



proximately 0.95 and 0.75, respectively). Similarly, both datasets and polarization in social networks. arXiv preprint arXiv:1906.08772. [4] Cristina Chueca Del Cerro. 2024. The power of social networks and social contain minimal hate or negative content, indicating that the syn-not_hate and non_offensive, with nearly identical scores (ap- [3] Uthsav Chitra and Christopher Musco. 2019. Understanding filter bubbles

media’s filter bubble in shaping polarisation: an agent-based model. Applied



from the real data. [5] Matteo Cinelli, Gianmarco De Francisci Morales, Alessandro Galeazzi, Wal- ter Quattrociocchi, and Michele Starnini. 2020. Echo chambers on social At the same time, certain emotional discrepancies are evident. thetic conversations do not introduce harmful patterns absent Network Science, 9, 1, 69.

media: a comparative analysis. arXiv preprint arXiv:2004.09603.

The generated conversations exhibit lower levels of irony and [6] Rui Fan, Ke Xu, and Jichang Zhao. 2018. An agent-based model for emotion

positivity compared to the real dataset. Specifically, irony is no- contagion and competition in online social media. Physica a: statistical

mechanics and its applications, 495, 245–259.

tably under-represented in synthetic conversations (0.04 versus [7] Antonino Ferraro, Antonio Galli, Valerio La Gatta, Marco Postiglione, Gian



while neutrality is slightly higher (0.78 versus 0.71). This indi-0.12 in the reference data), suggesting that nuanced and implicit Marco Orlando, Diego Russo, Giuseppe Riccio, Antonio Romano, and Vin- cenzo Moscato. 2024. Agent-based modelling meets generative ai in social language styles are harder for the model to reproduce. Similarly, network simulations. In International Conference on Advances in Social Net- positive sentiment is reduced in generated text (0.49 versus 0.62), works Analysis and Mining . Springer, 155–170. [8] Farnoosh Hashemi and Michael Macy. 2025. Collective social behaviors in llms: an analysis of llms social networks. In Large Language Models for cates a tendency of the model to produce emotionally flatter and Scientific and Societal Advances .

less expressive outputs. [9] Tianrui Hu, Dimitrios Liakopoulos, Xiwen Wei, Radu Marculescu, and Neer-

Taken together, the results suggest that the model successfully aja J Yadwadkar. 2025. Simulating rumor spreading in social networks using

llm agents. arXiv preprint arXiv:2502.01450.

replicates the broad emotional structure of conversations, partic- [10] Z. Li, J. Zhu, et al. 2023. Synthetic data generation with large language models



with reduced representation of irony and positivity. This high-ularly in terms of avoiding toxic or offensive content. However, for text classification: potential and limitations. arXiv preprint arXiv:2310.07849. [11] Hamid Reza Nasrinpour, Marcia R Friesen, et al. 2016. An agent-based model the generated outputs are less emotionally rich than real data, of message propagation in the facebook electronic social network. arXiv preprint arXiv:1611.07454 . [12] Nicolò Pagan, Petter Törnberg, Christopher Bail, Ancsa Hannak, and Christo-lights a key limitation of current LLM-based conversation agents: pher Barrie. [n. d.] Can llms imitate social media dialogue? techniques for while structurally sound, they may generate interactions that are calibration and bert-based turing-test. In First Workshop on Social Simulation

less engaging or authentic in their emotional dynamics. with LLMs.

[13] Kayhan Parsi and Nanette Elster. 2015. Why can’t we be friends? a case-

based analysis of ethical issues with social media in health care.AMA journal

5 Conclusions of ethics, 17, 11, 1009–1018.

In this work, we presented a novel approach for replicating social [14] Ifrah Pervaz, Iqra Ameer, Abdul Sittar, and Rao Muhammad Adeel Nawab.

2015. Identification of author personality traits using stylistic features: note-

media user behaviour using fine-tuned language models orga- book for pan at clef 2015. In CLEF (Working Notes), 1–7.



(Model A) with specialized posting (Model B) and replying (Model nized as autonomous agents. By combining a timeline manager [15] E. Rosenfeld et al. 2025. Evaluating synthetic data generation from user generated text. Computational Linguistics , 51, 1, 191–230. [16] Tanase Tasente. 2025. Understanding the dynamics of filter bubbles in social C) models, we simulated realistic multi-agent interactions across media communication: a literature review. Vivat Academia , 1–21. [17] Petter Törnberg, Diliara Valeeva, Justus Uitermark, and Christopher Bail. AI and Crypto related topics. 2023. Simulating social media using large language models to evaluate Our timeline based probabilistic model successfully replicates alternative news feed algorithms. arXiv preprint arXiv:2310.05984 .



structural conversation patterns with 61.3% agent accuracy and [18] Kang Min Yoo et al. 2021. Gpt3mix: leveraging large-scale language models Findings of the Association for Computational for text augmentation. In near-perfect action classification (96.8%), establishing a new bench-Linguistics: EMNLP 2021 , 2225–2239. mark while providing clear paths for further enhancement through

domain specialization.

Our experiments demonstrated that the approach can gener-

ate temporal posting and replying patterns that closely resemble

real-world Twitter data. We showed that modifying the environ-

ment model significantly influences agent behaviour, posting

frequency, and network dynamics, supporting our hypothesis

that environmental and temporal factors shape interaction pat-

terns in social networks.

This approach provides a flexible and controlled platform for

studying filter bubble formation, emotion propagation, and emer-

gent social dynamics. Future work can extend the approach to

more complex network structures, additional domains, and the

integration of user-specific behaviour models to further explore

interventions for mitigating echo chambers and enhancing di-

versity in online interactions.



6 Acknowledgment

The research presented in this paper was funded by the EU’s

Horizon Europe Framework under grant agreement number

101095095 (TWON) and 101094905 (AI4Gov).



References

[1] Jordan J. Bird et al. 2021. Chatbot interaction with artificial intelligence:

human data augmentation with t5 and language transformer ensemble for

text classification. arXiv preprint arXiv:2010.05990.

[2] Uthsav Chitra and Christopher Musco. 2020. Analyzing the impact of filter

bubbles on social network polarization. In Proceedings of the 13th interna-

tional conference on web search and data mining, 115–123.



72





Explaining Temporal Data in Manufacturing using LLMs and




Markov Chains



Jan Šturm Maja Škrjanc Oleksandra Topal

jan.sturm@ijs.si maja.skrjanc@ijs.si oleksandra.topal@ijs.si

Jožef Stefan Institute Jožef Stefan Institute Jožef Stefan Institute

Jožef Stefan International Jožef Stefan International Ljubljana, Slovenia

Postgraduate School Postgraduate School

Ljubljana, Slovenia Ljubljana, Slovenia



Inna Novalija Dunja Mladenić Marko Grobelnik

inna.koval@ijs.si dunja.mladenic@ijs.si marko.grobelnik@ijs.si

Jožef Stefan Institute Jožef Stefan Institute Jožef Stefan Institute

Ljubljana, Slovenia Jožef Stefan International Ljubljana, Slovenia

Postgraduate School

Ljubljana, Slovenia



Abstract challenge persists: a disconnect between the model’s statistical

Monitoring and understanding complex industrial processes from outputs and the experiential knowledge of domain experts.

high-dimensional IoT sensor data remains a significant challenge. The motivation for this work stems from this challenge. Do-

While advanced modeling techniques like Hierarchical Markov main experts, who possess invaluable implicit knowledge of a

Chains can abstract raw data, their outputs are often difficult for system, often struggle to interpret the statistical outputs of pro-

domain experts to interpret, creating a gap between data-driven cess models. Conversely, data scientists may identify patterns

insights and operational management. Existing explainability that lack the necessary operational context for effective action.

methods often focus on feature importance rather than providing Presenting experts with a graphical representation of states and

holistic, semantic descriptions of system states. This paper in- transitions is a step forward, but it does not fully bridge the

troduces a framework that bridges this gap by transforming the semantic gap. They may not understand what a specific state

abstract states of a process model into intuitive, human-readable represents in the physical world or why a particular transition is

concepts. The methodology leverages the StreamStory (Hier- significant. This leads to a bottleneck where valuable data-driven

archical Markov Chain) tool approach to generate behavioral insights are not fully utilized, hindering efforts to improve system

profiles based on log-likelihood calculations within sliding tem- management and efficiency.

poral windows. StreamStory states are summarized using an To address this, the paper proposes a methodology that en-

LLM to assign semantic labels and descriptions. This approach re- hances the interpretability of hierarchical process models. This

duces the initial reliance on domain experts for analysis, aids the approach creates a new layer of understanding that is accessible

understanding of complex system dynamics, and provides a trans- to operational personnel without requiring deep data science

parent foundation for identifying both normal and anomalous expertise. By translating abstract model states into meaningful,

operational patterns. The result is a more interpretable represen- semantically rich descriptions, it provides a tool that allows the

tation of industrial processes, facilitating improved predictive system’s behavior to be understood, validated, and ultimately,

maintenance and operational efficiency. better managed. This work introduces a methodology to auto-

matically generate these descriptions, moving from complex data

Keywords to clear, actionable insights. This work presents two primary con-



Multivariate Timeseries, Explainable AI, LLMs, Markov Chains labeling of Markov chain states, and a methodology for identify- tributions for industrial applications: a method for LLM-based

ing events as anomalous or normal.

1 Introduction

The widespread adoption of Internet of Things (IoT) sensors in 2 Related Work

industrial environments has generated vast streams of multivari- The field of time-series anomaly detection has evolved from in-

ate time-series data. While this data holds immense potential for terpretable statistical models like ARIMA and classical machine

process optimization and predictive maintenance, its complexity learning such as Isolation Forest to high-performance deep learn-

often surpasses human cognitive capacity. Tools like Stream- ing architectures including LSTMs, Transformers, and Autoen-

Story [6] have emerged to model these complex systems using coders [5, 4, 7]. While these advanced models excel at pattern

Hierarchical Markov Chains, abstracting raw data into a more recognition, their complexity necessitates post-hoc XAI tools like

manageable set of states and transitions. However, a fundamental LIME and SHAP to explain their decisions, which are limited to

providing low-level feature attributions [1].



work must be honored. For all other uses, contact the owner/author(s). Permission to make digital or hard copies of all or part of this work for personal Recent work also demonstrates the utility of Hidden Markov or classroom use is granted without fee provided that copies are not made or Models (HMMs) for anomaly detection, for instance, by designing distributed for profit or commercial advantage and that copies bear this notice and active search strategies to locate an evolving anomaly among the full citation on the first page. Copyrights for third-party components of this multiple processes [2], or by learning normal temporal dynamics Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia from remote sensing data to detect, localize, and classify crop- © 2025 Copyright held by the owner/author(s). related deviations [3]. However, while effective for detection, the https://doi.org/10.70314/is.2025.sikdd.28



73

Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Šturm et al.



abstract nature of HMM states can be difficult for domain experts of historical state transitions. For each window of a given size,

to interpret. The present work addresses this by transforming a single feature is calculated: the log-likelihood of that specific

the state sequence into a multi-scale behavioral profile, which sequence of transitions occurring. This score is calculated by

enables a Large Language Model (LLM) to generate rich, semantic summing the log-transformed transition probabilities for each

explanations of system behavior. step in the sequence, as defined by the underlying Markov model.

This approach innovates by first classifying each multivariate The score effectively quantifies how "normal" or "expected" a

data point into a state within a pre-built Markov Chain model particular sequence of behavior is according to the learned model.

and then calculating log-likelihoods from the state sequence to Highly probable sequences yield higher log-likelihood scores

form a multi-scale representation. Crucially, this representation (closer to zero), while rare sequences result in large negative

allows for the recognition of regular system behavior and vari- scores.

ous anomalies. By analyzing the statistical distribution of these

profiles—identifying dense regions of regular behavior and sparse

outliers corresponding to anomalous states—an LLM can then

assign rich, human-readable descriptions, connecting abstract

data to operational knowledge.



3 Methodology

The framework is designed to post-process models generated by Figure 2: An illustration of the sliding window method.

the StreamStory system. Figure 1 outlines this multi-stage pro- Three windows of different sizes, highlighted in yellow

cess, which begins with the statistical features from the Markov (largest), brown (medium), and green (smallest), are applied

model and culminates in semantically enriched explanations of to a sequence of system states. A log-likelihood score is

system behavior. The core of this methodology is the transforma- then calculated for the sub-sequence contained within each

tion of abstract machine states into meaningful concepts using a colored window.

combination of statistical feature engineering and LLM interpre-

tation. The process focuses on creating robust representations

of system behavior and leveraging an LLM to translate these 3.2 Behavior Profile Construction

representations into human-understandable language. To capture dynamics over multiple time scales, several sliding

windows of different sizes are used simultaneously. The log-

likelihood score calculated from each window is concatenated to



form a single feature vector for each time step. This multi-scale



vector, termed a behavior profile, serves as a rich representation of

the system’s dynamics at that moment, encapsulating both short-

term and longer-term patterns. This profile is a crucial output,

as it provides a quantitative basis for distinguishing between

different modes of operation.



3.3 Ranking System Behavior via Anomaly

Scoring

Following the construction of the behavior profiles, their distri-

bution is analyzed to identify distinct operational patterns. An

unsupervised density-based approach is employed to score each

profile’s typicality. The Isolation Forest algorithm is used for

this purpose because it does not assume a specific data distri-

bution and excels at identifying outliers in a high-dimensional

space. Profiles that are common and lie in dense regions of the

feature space receive a high score, corresponding to normal be-

havior. Conversely, profiles that are rare and isolated receive a

low score, flagging them as anomalous. This produces a continu-

ous spectrum of normalcy, allowing for a ranked analysis of all

operational events.



Figure 1: Proposed methodology for identifying and ex- 3.4 LLM-Powered State Naming and

plaining normal and anomalous operational profiles. Interpretation

To translate abstract states into meaningful concepts, an LLM is

utilized. For each granular state discovered by the StreamStory

3.1 Log-Likelihood Score Calculation model, its statistical profile (e.g., sensor value distributions) and

The input to the pipeline is a pre-existing Hierarchical Markov context about the machine type were formatted into a descriptive

Chain model of an industrial process, which includes a history prompt. The LLM was then tasked with generating a concise,

of state transitions over time. The first step is to create a rich intuitive name for each state (e.g., "Peak Production - High Flow

feature representation that captures the system’s dynamics. A and Heat"). This process, conducted once per model, creates

sliding window (Figure 2) approach moves across the sequence a semantic layer that is then used to interpret the sequences



74

Explaining Temporal Data with LLMs Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia



associated with the highest-ranked normal and lowest-ranked seeing this semantic sequence, can immediately infer a poten-

anomalous events. tial cause for investigation, such as an attempted restart or a

This approach offers two key advantages. First, the LLM- stuttering shutdown process.

generated names provide a layer of transparency, offering an Conversely, the most normal events, detailed in Table 2, paint a

immediate hypothesis about what each abstract state represents. picture of operational stability. These events are characterized by

Second, it shifts the role of the domain expert from the arduous positive scores. The LLM-generated names for these sequences,

task of initial interpretation to the more efficient step of validat- such as transitions between ‘Weekday Peak Performance‘, ‘Week-

ing or refining the LLM-generated labels, accelerating the process end Peak-Load Production‘, describe the system operating within

of gaining actionable insights. its expected high-performance period. This demonstrates the

framework’s ability not only to flag deviations but also to rec-



To validate the proposed framework, an experiment was con- operational cycles, providing a valuable baseline for what consti- tutes ’good’ performance. ducted using a real-world industrial dataset from an oil refinery 4 ognize and semantically label the system’s healthy, predictable Experiment



pump. This section details the dataset, implementation, and re- Table 1: Top 5 Most Anomalous Events sults.



4.1 Rank Timestamp Score (Std.) Final State (LLM Name) Dataset

The experiment was performed on a proprietary, real-world 1 2017-04-03 14:30 -0.096 (-3.88) Startup...Transition

dataset obtained from an industrial oil refinery. Due to its con- 2 2017-03-28 10:00 -0.071 (-3.45) Startup...Transition

fidential nature, the dataset is not publicly available. The data 3 2017-03-30 00:00 -0.066 (-3.35) High-Flow, Cool Op.

consists of a multivariate time-series collected over one month 4 2017-04-03 12:30 -0.061 (-3.26) Machine Idle

of operation (March-April 2017) with a 15-minute sampling reso- 5 2017-04-03 15:00 -0.056 (-3.18) Weekday Low-Flow...

lution. Data was gathered from a suite of IoT sensors monitoring



the core functions of a critical pump. Key measurements include Conversely, Table 2 presents the five most normal events, fluid flow rate (Kg/h), suction and discharge pressure (Kg/cm2), which have high positive scores. Their sequences reveal a stable and temperatures of the process fluid and mechanical compo- operational loop between states like “Peak Production,” “Week- nents (°C). end Peak-Load Production,” and “Extreme Temperature Peak

Performance.” This recurring pattern defines the pump’s healthy

4.2 Implementation Details operational "heartbeat," providing a data-driven "golden stan-

The methodology was implemented in a Python environment. dard" for normal behavior under demanding conditions. This

The underlying Markov Chain model was built using the entire semantic understanding is crucial for operators, as it validates

historical dataset provided, as the goal is to interpret the com- that the system is performing as expected.

plete, learned dynamics of the process rather than to perform

a predictive task that would require a train/test split. Behavior Table 2: Top 5 Most Normal Events

profiles were constructed using sliding windows of multiple sizes

(3, 5, 7, and 10 steps). The resulting profiles were analyzed using Rank Timestamp Score (Std.) Final State (LLM Name) the Scikit-learn implementation of Isolation Forest. The ‘con-

tamination‘ parameter was set to 5% for the primary analysis, 1 2017-03-23 22:00 0.192 (1.22) Weekend Peak-Load

a common heuristic for industrial processes. State descriptions 2 2017-03-31 06:00 0.192 (1.22) Peak Production

were generated using the GPT-4o model, which was prompted 3 2017-04-01 00:00 0.191 (1.20) Peak Production

with the statistical profiles of each state to generate intuitive 4 2017-03-31 23:30 0.191 (1.19) Weekday Peak Perf.

names. 5 2017-03-31 07:30 0.190 (1.17) Weekday Peak Perf.



4.3 Experimental Results and Discussion To ensure the robustness of the findings, a sensitivity analysis

The application of the framework yielded a ranked list of op- was conducted on the Isolation Forest ‘contamination‘ parameter,

erational events, characterized by the Isolation Forest decision testing values of 1%, 5%, and 10%. While the number of points

score. This score serves as a robust indicator of how typical or labeled ’Anomalous’ changed as expected, the relative ranking of

anomalous a given time window is. Table 1 details the top five the most extreme events remained highly consistent, confirming

most anomalous events identified. These events are characterized that the core findings are not sensitive to this hyperparameter.

by scores that are more than 3 standard deviations below the The claims in this paper are demonstrated on a single, repre-

mean, signifying extreme statistical rarity. sentative dataset. While the framework is designed to be general,

The true explanatory power of the method is revealed when further studies on diverse industrial processes are required to

the abstract state sequences are translated into their LLM-generated fully validate its broader applicability. The LLM-generated labels

names. For instance, the most anomalous event culminates in a were not validated in a formal user study with domain experts;

sequence of “... -> ‘Startup or Shutdown Transition‘ -> ‘Machine such a study is a valuable next step.

Idle or Shutdown‘ -> ‘Startup or Shutdown Transition‘.” This pro-

vides a clear, human-readable narrative of the pump entering a 5 Conclusion

period of instability and stoppage. This is a marked improvement This paper presented a complete, self-contained framework for

over black-box models that simply flag a time point as anomalous increasing the interpretability of complex industrial process mod-

without providing a temporal context for the "why." An engineer, els. By creating behavior profiles of system states and using an



75

Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Šturm et al.



LLM to assign semantic names, the approach successfully trans- 6 Acknowledgments

lates abstract data analysis into practical domain knowledge. The This work was supported by the Slovenian Research Agency and

method provides a robust process for ranking and explaining the European Union’s Horizon 2020 project FAME (Grant No.

individual operational events in a transparent manner, as demon- 101092639). strated on a real-world industrial dataset. This work establishes

a strong foundation for a new type of explainability, moving References

beyond feature importance to provide narrative, context-rich [1] Liat Antwarg, Ronnie Mindlin Miller, Bracha Shapira, and Lior Rokach. 2019.



opens a wide array of possibilities for future research. The cur-The representation of system dynamics as behavior profiles arXiv:1903.02407 . [2] Levli Citron, Kobi Cohen, and Qing Zhao. 2025. Searching for a hidden markov descriptions of system dynamics. Explaining anomalies detected by autoencoders using shap. arXiv preprint

anomaly over multiple processes. arXiv preprint arXiv:2506.17108.

rent work successfully identifies and presents the raw temporal [3] Kareth M Leon-Lopez, Florian Mouret, Henry Arguello, and Jean-Yves Tourneret.

sequences leading to key events. Future work will focus on apply- 2021. Anomaly detection and classification in multispectral time series based

on hidden markov models. IEEE transactions on geoscience and remote sensing,



recurring and significant sequential patterns within these events. [4] Sebastian Schmidl, Phillip Wenig, and Thorsten Papenbrock. 2022. Anomaly detection in time series: a comprehensive evaluation. Proceedings of the VLDB Such an analysis could reveal if distinct "families" of anoma- ing formal pattern mining techniques to automatically discover 60, 1–11.

Endowment , 15, 9, 1779–1797.

lous behavior exist, each with its own characteristic temporal [5] Charalampos Shimillas, Kleanthis Malialis, Konstantinos Fokianos, and Mar-

signature. This promises a more nuanced description of system ios M Polycarpou. 2025. Transformer-based multivariate time series anomaly

operations and provides a stronger foundation for developing localization. In 2025 IEEE Symposium on Computational Intelligence on Engi-

neering/Cyber Physical Systems (CIES). IEEE, 1–8.



targeted predictive maintenance strategies. Finally, to address [6] Luka Stopar, Primoz Skraba, Marko Grobelnik, and Dunja Mladenic. 2018. Streamstory: exploring multivariate time series on multiple scales. IEEE current limitations, two key areas will be prioritized. First, formal transactions on visualization and computer graphics , 25, 4, 1788–1802. user studies with domain experts will be conducted to validate [7] Fengling Wang, Yiyue Jiang, Rongjie Zhang, Aimin Wei, Jingming Xie, and



the utility and accuracy of the LLM-generated explanations, mov- Xiongwen Pang. 2025. A survey of deep anomaly detection in multivariate time series: taxonomy, applications, and directions. Sensors (Basel, Switzer- ing beyond the promising initial results. Second, the framework’s land) , 25, 1, 190. generalizability will be tested through broader empirical evalua-

tion across diverse industrial sectors and sensor types to boost

its credibility and applicability.



76

Active Learning for Power Grid Security Assessment: Reducing



Simulation Cost with Informative Sampling



Gašper Leskovec Costas Mylonas Klemen Kenda

Jožef Stefan Institute UBITECH Jožef Stefan Institute

Slovenia Greece Slovenia

leskovecg@gmail.com kmylonas@ubitech.eu klemen.kenda@ijs.si



Abstract applied convolutional neural networks (CNNs) to contingency

Power grid security assessment under the N-1 criterion requires datasets, showing that deep models could achieve over 99% accu-

extensive contingency simulations, which are computationally racy in detecting insecure cases while being more than 200 times

intensive and costly to label. In this work, we explore the use of faster than traditional power flow calculations [1]. Building on

active learning (AL) to train binary classifiers that can accurately this, more recent work explored pooling-ensemble multi-graph

predict the outcome of contingency scenarios using fewer labeled learning to design scalable contingency screening schemes based

samples. We evaluate several AL strategies, such as entropy, mar- on steady-state information, demonstrating improved adaptabil-

gin, and uncertainty sampling against a random baseline. Our ity for large-scale systems [2]. These approaches enable fast

results show that AL methods achieve the same predictive per- security screening without solving power flows for every con-

formance with significantly fewer labels, reducing labeling effort tingency. However, their reliability hinges on the availability of

and simulator runtime. These findings demonstrate the effective- large labeled datasets covering all relevant operating points and

ness of integrating AL with power system simulators to enable contingencies. Such datasets are typically generated by running

scalable and efficient N-1 security assessment without sacrificing exhaustive offline N-1 simulations, which is computationally

model accuracy. expensive, or require significant expert effort to label secure ver-

sus insecure cases. This dependence on costly and large-scale

Keywords data generation remains a major limitation of existing ML-based

frameworks for steady-state security assessment.

active learning, smart grids, security assessment, simulation cost

To reduce labeling costs, AL has recently been explored in

reduction

other areas of power systems. For example, authors of [5] used AL



1 Introduction identification, showing that models could be trained with far to enhance stability assessment and dominant instability mode

Ensuring secure operation of power systems under the N-1 crite- fewer labeled samples while maintaining accuracy. Similarly,

rion is a cornerstone of grid reliability. The criterion requires that authors of [4] demonstrated an AL-enhanced digital twin for

the system remains within operational limits following the loss day-ahead load forecasting, where the model iteratively refined

of any single component (e.g., line, transformer, or generator). In predictions by querying only the most uncertain cases. These

practice, this involves simulating a large number of contingen- studies confirm the potential of AL to reduce expert effort and

cies and checking for violations of thermal or voltage constraints. simulation cost by strategically selecting informative samples.

While essential, such simulations are computationally intensive, However, AL has not yet been applied to N-1 steady-state se-

particularly when performed on high-fidelity grid models, and curity assessment, where the need to cut down on contingency

their interpretation often requires expert judgment. This cre- simulations is especially critical.

ates a bottleneck for both real-time applications and large-scale In this work, we propose a novel framework for AL driven

scenario analyses, where scalability and efficiency are important. N-1 security assessment. Our contributions are threefold:

Classical approaches to N-1 assessment rely on exhaustive (1) We design a binary classification model that predicts

AC power flow simulations combined with contingency rank- whether a given contingency is secure or insecure based

ing heuristics such as performance indices (PIs). While useful on steady-state features.

for screening, these heuristics may mis-rank contingencies or (2) We integrate AL strategies (entropy, margin, and uncer-

overlook borderline cases due to masking effects [3]. Moreover, tainty sampling) with the classifier to selectively query the

exhaustive analysis does not scale well with system size, making

most informative contingencies for simulation, reducing

it unsuitable for fast or repeated assessments. the number of labels required.

To overcome these challenges, researchers have proposed ma- (3) We demonstrate through a case study that our approach

chine learning (ML) and deep learning (DL) approaches that achieves the same predictive accuracy as fully supervised

approximate N-1 contingency outcomes directly from operating baselines while reducing simulation cost and labeling ef-

point features. One of the earliest contributions in this direction fort by up to 40–50%.

Permission to make digital or hard copies of all or part of this work for personal This work provides the first evidence that AL can be directly

or classroom use is granted without fee provided that copies are not made or leveraged for N-1 security assessment, offering a scalable and

distributed for profit or commercial advantage and that copies bear this notice label-efficient alternative to exhaustive simulation or purely su-

and the full citation on the first page. Copyrights for components of this work

owned by others than the author(s) must be honored. Abstracting with credit is pervised ML approaches. permitted. To copy otherwise, or republish, to post on servers or to redistribute

to lists, requires prior specific permission and/or a fee. Request permissions from 2 Methodology permissions@acm.org.

SiKDD 2025, Ljubljana, Slovenia We study whether pool-based AL can reduce the number of

© 2025 Copyright held by the owner/author(s). Publication rights licensed to ACM. expensive N-1 simulations (“labels”) while keeping prediction

ACM ISBN 978-x-xxxx-xxxx-x/YYYY/MM

https://doi.org/10.70314/is.2025.sikdd.11 quality for binary secure vs. insecure classification.



77

SiKDD 2025, October 6th, 2025, Ljubljana, Slovenia Gašper Leskovec, Costas Mylonas, and Klemen Kenda



Table 1: Dataset and system description (digital twin of the (2) Train the RF on the current labeled set; score the pool to

Greek transmission network). obtain class-probability vectors 𝑝(𝑥).

(3) Select the next batch of 𝑏 samples using one of the query

Attribute strategies below. Value

(4) Query the simulator for labels of the chosen batch (expen-

Test system 35 buses, 46 lines, 135 generators,

sive step); add them to the labeled set.

110 static generators, 20 loads

(5) Repeat for a fixed number of iterations or until the budget

Power flow solver AC load flow (Newton–Raphson),

is exhausted.



Contingencies (N–1) We sweep budgets across runs: 𝑖 ∈ {100, . . . , 500}, 𝑏 ∈ Line outages (all lines except idx { via pandapower

45), generator outages (all) 50, . . . , 200 }, and up to 40 iterations, which lets us trace

Total contingency cases long learning curves. 8 769

Secure / Insecure 51.28% / 48.72% Query strategies. We compare: (i) Random (baseline); (ii)

Feature dimensionality 271 features total Least-confident (uncertainty): score 1 − max𝑐 𝑝𝑐 (𝑥 ); (iii) Mar-

Feature groups load_: 20, gen_: 135, sgen_: 110 gin: negative gap between top-2 probabilities; (iv) Entropy:

− Í 𝑝𝑐 (𝑥 ) log 𝑝 𝑐𝑐 (𝑥 ). All three uncertainty policies operate

on the same RF posteriors and therefore often rank samples

2.1 Data and labels from a digital twin similarly.

We use a steady-state digital twin of the transmission grid. For

each timestamp we solve the base-case AC power flow, then apply 2.5 Evaluation



the N-1 criterion by removing each line/transformer/generator After each iteration we evaluate on the fixed test set. At each AL

in turn and re-solving. An operating point is labeled round we retrain the RF from scratch on the enlarged labeled set; secure if

the base case and all contingencies satisfy limits (bus voltages new labels are added to training only; the pool remains unlabeled.

∈ [0.90, 1.10] p.u., line loading ≤ 100%); otherwise it is insecure. For each strategy we run multiple configurations and both seeds,

Non-convergent power flows are labeled then align results by total labeled samples and average across runs insecure .

The test system is a digital twin derived from the topology of to obtain strategy-level learning curves. Unless noted otherwise,

the Greek transmission network (35 buses, 46 lines, 135 genera- TTT values in the main figures are computed on these averaged

tors, 110 static generators, 20 loads). AC load flows are computed curves. Appendix A.1 (Table 4a) reports per-run TTT (mean ±

with the Newton–Raphson method in std), which is larger due to variability across initial sizes pandapower . N-1 contin-𝑖, batch

gencies include all line outages (excluding line index 45) and all sizes 𝑏, and seeds.

generator outages. Table 1 summarizes the dataset.

2.6 Metrics

2.2 Time-aware train/validation/test split We report Accuracy and ROC AUC on the test set, plus two label-

Samples are sorted by timestamp. The AL efficiency metrics: Time-to-Target (TTT), the smallest number training/pool comes

from earlier windows, while the of labeled samples needed for the average curve of a strategy to test set is the most recent slice

and is never used for training or querying. This avoids temporal reach a target (e.g., ACC≥0.92 or AUC≥0.98); and AULC (Area

leakage and mimics deployment where we predict on future data. Under the Learning Curve), computed by trapezoidal integration

A small validation split is carved from the training era for early of metric vs. total labeled. Because simulator seconds per call

checks. are roughly constant, relative cost/time savings are well approxi-

mated by label savings derived from TTT.



2.3 Classifier and hyperparameters Additional classification metrics. Besides Accuracy and ROC

Our base model is a Random Forest (RF) because it is fast, AUC we also track Precision, Recall, F1 and the False Neg-

robust and provides class-probability posteriors needed by ative Rate (FNR) on the fixed test set at every AL round. Let

uncertainty-based AL. Across runs we vary hyperparameters TP, FP, FN, TN be counts on the test set. We use the standard

in realistic ranges: 𝑛 estimators ∈ [200, 1500], max_depth ∈ definitions: Precision = TP/(TP + FP), Recall = TP/(TP + FN),

{18 , Precision·Recall 20 , 24 , 25 , 28 , 30 , 35 , 40 , None } , min_samples_split ∈ F1 = 2 · , FNR = FN/(FN + TP) = 1 − Recall. We Precision + Recall {2, 4}, min_samples_leaf ∈ {1, 2, 3}, class_weight ∈ report mean ± std across runs/seeds, and we extract TTT-style

{balanced, balanced_subsample}. We use seeds {42, 1337} for thresholds for these metrics when relevant.

reproducibility.

3 Results

Classifier dependence. We use Random Forests for probabil-

Figure 1 and Figure 2 show learning curves (averaged across

ity outputs and fast retraining inside the AL loop. While AL’s

relative gains often transfer across probabilistic classifiers, we seeds). Across the budget range, all three uncertainty-based poli-

cies (entropy, margin, uncertainty) dominate the random baseline

did not perform a systematic model sweep here. Evaluating lo-

gistic regression and gradient-boosted trees under the same AL in both Accuracy and ROC AUC; the area under the learning

curve (AULC) is consistently higher.

protocol is left to future work.

Table 3 summarizes KPIs used in the paper. At the most im-



2.4 portant targets, AL reaches the same performance with far fewer Pool-based AL loop

labels: at ACC ≥ 0.92, AL needs about 500 labels vs. 1 040 for

We follow the standard pool-based AL recipe: random (∼ 52% fewer); at AUC ≥ 0.98, AL needs 580 vs. 960

(1) Start with an initial labeled set of size 𝑖 and an unlabeled (∼ 40% fewer). Final metrics at the maximum budget are also

pool. higher for AL (ACC 0.917±0.005 and AUC 0.983±0.002) than for



78

Active Learning for Power Grid Security Assessment: Reducing Simulation Cost with Informative Sampling SiKDD 2025, October 6th, 2025, Ljubljana, Slovenia





random . At targets Precision/Recall/F1 ≥ 0.90 and FNR ≤ 0.10,

the uncertainty-based policies consistently hit the thresholds

earlier on the average curves, confirming that the AL gains are

not specific to a single metric. Shaded bands (std across runs)

show the same ordering stability observed for ACC/AUC. Full

KPI values and TTT thresholds for P/R/F1/FNR are provided in

Appendix A.2 (Table 4b).

Next, we compare label efficiency using Time-to-Target (TTT).

Figures 3 and 4 show TTT for accuracy targets 0.90 and 0.92,

while Figures 5 and 6 show TTT for AUC targets 0.97 and 0.98.

At the easy target ACC ≥ 0.90 all strategies reach the goal

after about 100–120 labels (uncertainty sometimes at 120 due to

seed/batch noise). At the more demanding ACC ≥ 0.92 target,



Figure 1: Accuracy vs. total labeled samples (mean ± std ∼ active-learning policies need about 500 labels, whereas random

needs 1 040 (i.e., 52% fewer labels). For AUC ≥ 0.97, AL reaches

across runs). (Note: entropy, margin, and uncertainty overlap almost 275 the target at labels vs. 325 for random (∼15% fewer), and

perfectly on this dataset—so the three AL curves/bands lie on top of each for AUC ≥ 0 98 at 580 vs. 960 (∼40% fewer). These reductions .

other; Random is shown separately for contrast) translate directly into lower simulation time when the average

time per labeling call is roughly constant.





Figure 2: ROC AUC vs. total labeled samples (mean ± std



across runs). (Note: entropy, margin, and uncertainty overlap almost

Figure 3: TTT (Accuracy ≥ 0.90): computed on the strategy-

perfectly on this dataset—so the three AL curves/bands lie on top of each level average curve; per-run variability (mean ± std) is other; Random is shown separately for contrast)

reported in Appendix.



Table 2: Final test metrics at maximum budget (mean ± std

across runs).





Strategy Accuracy ROC AUC



entropy 0.917 ± 0.005 0.983 ± 0.002

margin 0.917 ± 0.005 0.983 ± 0.002

uncertainty 0.917 ± 0.006 0.983 ± 0.004

random 0.916 ± 0.010 0.977 ± 0.004



random (ACC 0.916±0.010 and AUC 0.977±0.004). Differences at

the easier target ACC ≥ 0.90 are small (all reach it by ∼100–120

labels), which is expected for a low threshold.



On high AUC values. The time-aware split still yields a sepa-

rable test set for this case study (AUC ≈ 0.98). This likely reflects Figure 4: TTT (Accuracy ≥ 0.92): computed on the strategy-

informative steady-state features and balanced classes, not over- level average curve; per-run variability (mean ± std) is

fitting to the test era. That said, harder, more imbalanced systems reported in Appendix. may reduce AUC and amplify AL gains; we treat this as a scope

limitation.

Overall, uncertainty-based AL strategies consistently beat

Precision, Recall, F1 and FNR.. The additional metrics mirror random at the harder targets (ACC 0.92 and AUC 0.98) while

the ACC/AUC trends: entropy, margin, and uncertainty produce performing similarly at the easier ACC 0.90 threshold; final per-

higher AULC and reach target quality with fewer labels than formance at the maximum budget remains high (ACC 0.917±0.005,



79

SiKDD 2025, October 6th, 2025, Ljubljana, Slovenia Gašper Leskovec, Costas Mylonas, and Klemen Kenda



Table 3: KPIs by strategy (averaged across runs). Final metrics reflect per-run means; see Table 2 for mean±std.



AULC TTT (labels) Final



Strategy acc auc acc ≥ 0.90 acc ≥ 0.92 AUC ≥ 0.97 AUC ≥ 0.98 ACC AUC



entropy 0.92 0.98 100 500 275 580 0.917 0.983

margin 0.92 0.98 100 500 275 580 0.917 0.983

random 0.91 0.97 100 1 040 325 960 0.916 0.977

uncertainty 0.92 0.98 120 500 275 580 0.917 0.983





computational and memory requirements, which are particu-

larly important for real-time or resource-constrained applica-

tions. Moreover, integrating AL within a digital-twin pipeline

enables a feedback loop in which the classifier continuously re-

fines itself using only the most informative contingencies. These

findings suggest that exhaustive N-1 simulations are not always

necessary for reliable security assessment, paving the way for

more scalable and efficient grid-analysis tools.

The present study focuses on a single test system and a Ran-

dom Forest classifier. In future work we plan to evaluate the

proposed framework on larger and more diverse grid topologies

(e.g., IEEE 39-bus, 118-bus or national transmission networks)

and under varying operating conditions. Another direction is to

explore more advanced models such as gradient-boosting ma-

Figure 5: TTT (AUC ≥ 0.97): computed on the strategy-level chines, deep neural networks or graph neural networks, which

average curve; per-run variability (mean ± std) is reported may capture complex relationships among grid variables. We also

in Appendix. intend to investigate alternative sampling strategies—including

diversity-based selection, query-by-committee and Bayesian AL

to further improve label efficiency. Finally, extending the method-



ology to multi-contingency (N-𝑘) and dynamic security assess-

ments (e.g., transient stability) will broaden its applicability in

future smart-grid deployments.



Reproducibility

Code, analysis scripts, and a dataset to reproduce all figures and

tables will be released at https://github.com/HumAIne-JSI/smart-

energy-ea.



Acknowledgements

This work was supported by European Union’s funded Project

HUMAINE [grant number 101120218]. The authors acknowledge

the use of LLMs for language optimization. While the LLMs con-

tributed to enhancing efficiency and refining the presentation of

Figure 6: TTT (AUC ≥ 0.98): computed on the strategy-level

this work, all conceptual frameworks, analyses, and interpreta-

average curve; per-run variability (mean ± std) is reported

tions remain the sole responsibility of the authors.

in Appendix.



References

[1] José-María Hidalgo Arteaga, Fiodar Hancharou, Florian Thams, and Spyros



AUC 0.983±0.002 for AL vs. ACC 0.916±0.010, AUC 0.977±0.004 In Chatzivasileiadis. 2019. Deep learning for power system security assessment. 2019 IEEE Milan PowerTech. IEEE, 1–6.

for random). [2] Jiyu Huang, Lin Guan, Yinsheng Su, Haicheng Yao, Mengxuan Guo, and Zhi

Zhong. 2021. System-scale-free transient contingency screening scheme based

on steady-state information: A pooling-ensemble multi-graph learning ap-

4 Conclusion proach. IEEE Transactions on Power Systems 37, 1 (2021), 294–305.

[3] Kip Morison, Lei Wang, and Prabha Kundur. 2004. Power system security

This paper demonstrates that AL is a viable strategy for reducing assessment. IEEE power and energy magazine 2, 5 (2004), 30–39.

simulation costs in power-grid security assessment. By selectively [4] Costas Mylonas, Titos Georgoulakis, and Magda Foti. 2024. Facilitating AI and

querying informative contingencies, we cut labels (and thus sim- System Operator Synergy: Active Learning-Enhanced Digital Twin Architecture

for Day-Ahead Load Forecasting. In 2024 International Conference on Smart

ulator calls) by about 52% at ACC ≥ 0.92 (500 vs. 1 040 with ran- Energy Systems and Technologies (SEST). IEEE, 1–6.

dom) and [5] Zhongtuo Shi, Wei Yao, Yong Tang, Xiaomeng Ai, Jinyu Wen, and Shijie Cheng. about 40% at AUC ≥ 0 . 98 (580 vs. 960), without sacri-

ficing final performance (AL: ACC 0.917 2023. Intelligent power system stability assessment and dominant instability ± 0.005, AUC 0.983 ±0.002;

mode identification using integrated active deep learning. IEEE Transactions

random: ACC 0.916±0.010, AUC 0.977±0.004; see Table 2). Fewer on Neural Networks and Learning Systems 35, 7 (2023), 9970–9984.

simulator calls translate into shorter training times and lower



80

Active Learning for Power Grid Security Assessment: Reducing Simulation Cost with Informative Sampling SiKDD 2025, October 6th, 2025, Ljubljana, Slovenia



A Additional Results



Table 4: Additional KPI summaries and TTT variability across runs.



A.1 Time-to-Target Variability Across Runs



(a) Per-run Time-to-Target (TTT) mean ± std (labels) by strategy. Note: Values here are per-run TTT (mean ± std). The TTT bars

in Figures 3–6 are computed on the averaged curve.



Threshold Strategy TTT (mean ± std)

ACC ≥ 0.90 entropy 384 ± 207

margin 384 ± 207

uncertainty 372 ± 208

random 440 ± 225

ACC ≥ 0.92 entropy 751 ± 359

margin 751 ± 359

uncertainty 751 ± 359

random 897 ± 318

AUC ≥ 0.97 entropy 432 ± 178

margin 432 ± 178

uncertainty 432 ± 178

random 502 ± 352

AUC ≥ 0.98 entropy 803 ± 304

margin 803 ± 304

uncertainty 803 ± 304

random 1029 ± 386



A.2 Precision/Recall/F1/FNR KPIs and TTT Thresholds



(b) KPIs by strategy for Precision, Recall, F1, and derived FNR. Time-to-Target (TTT) is the number of labels to reach the

threshold (e.g., Precision ≥ 0.90, Recall ≥ 0.90; for FNR, TTT corresponds to FNR ≤ 0.10).



Strategy AULC P TTT P≥0.90 Final P AULC R TTT R≥0.90 Final R AULC F1 TTT F1≥0.90 Final F1 TTT FNR≤0.10 Final FNR

entropy 0.948 500 0.952 0.916 500 0.922 0.931 500 0.936 500 0.078

margin 0.948 500 0.952 0.916 500 0.922 0.931 500 0.936 500 0.078

random 0.928 960 0.940 0.905 1040 0.906 0.916 1040 0.922 1040 0.094

uncertainty 0.948 500 0.952 0.916 500 0.922 0.931 500 0.936 500 0.078



81





Supporting Material Reuse in Drone Production




Rok Cek Oleksandra Topal Linda Leonardi

rok.cek@gmail.com oleksandra.topal@ijs.si linda.leonardi@cetma.it

Jožef Stefan Institute Jožef Stefan Institute CETMA

Ljubljana, Slovenia Ljubljana, Slovenia Brindisi, Italy



Margherita Forcolin Klemen Kenda

margherita.forcolin@maggioli.gr klemen.kenda@ijs.si

Maggioli Group Jožef Stefan Institute

Santarcangelo di Romagna, Italy Ljubljana, Slovenia



Abstract our findings highlight the potential to reduce waste and enhance

sustainability in drone manufacturing.

This paper, part of the European Horizon project Plooto, de-

By integrating machine learning models to predict the usabil-

tails an end-to-end, data-driven framework for reusing expired

ity of expired prepregs and assessing the quality of final products,

carbon-fiber prepregs in drone production. First, 19 batches of ex-

we provide industrial partners with actionable insights that di-

pired prepregs were tested, revealing that most remained usable

rectly enhance operational decision-making. The combination

within the first year after expiration. Machine learning models

of material requalification and predictive analysis supports the

were then developed to predict material usability pre-production

sustainability goals of the drone production process.

and product quality post-production, using manufacturing data

and time-series features. To facilitate this process, a dedicated 2 Data and Methods



data pipeline and an interactive Product Quality Explorer tool 2.1 Materials and experimental techniques

were created to support explainable model development and in-

tegration with industrial partners. This framework demonstrates used for prepreg usability assessment

how combining material requalification with data-driven predic-

Expired rolls of epoxy prepregs from HP Composites S.p.A were

tions can lower costs and support circularity in drone production.

used for this study. A total of 19 prepreg batches were investi-

Keywords gated, comprising four different resin systems (ER450, IMP509,

X1, ER431), with reinforcement varying according to supplier

circular economy, digital product passport, machine learning, availability. Usability is assessed through periodic chemical-physi-

product quality cal and mechanical testing after the expiration date, to monitor



1 Introduction property changes in materials stored at -18°C. Differential Scan-

ning Calorimetry (DSC) tests were performed with Mettler Toledo

The growing demand for lightweight, high-performance materi- DSC 823e on uncured prepreg samples by applying a dynamic

als is driving the increased use of carbon fiber reinforced poly- heating from -40°C to 250°C at 20°C/min under a nitrogen at-

mers (CFRPs) in industries such as aerospace, automotive, and mosphere. DSC analysis provides two key parameters: the glass

drones. However, this rapid adoption also creates challenges, transition temperature of the uncured system (𝑇𝑔0), related to

particularly with the accumulation of expired materials. While the initial crosslink density, and the residual cure degree (𝛼 ), cal-

much research has focused on recycling fully cured CFRPs, less culated from the polymerization enthalpies values. Composite

attention has been given to the reuse of uncured prepregs, which, plates for physical and mechanical testing were manufactured

despite expiring during storage, can still retain valuable proper- by draping a variable number of prepreg plies at 0°, depending

ties [5]. Addressing this challenge is crucial for advancing circular on reinforcement type, to obtain cured laminates of ≈ 3 mm. The

economy principles in high-tech manufacturing. prepreg plies were stacked on a flat mold surface over a peel

This paper presents research from the European Horizon ply. The plates were then covered with an additional peel ply,

project Plooto, focusing on the reuse of expired prepregs in sus- a release film, and a breather layer. The self-adhesive seal and

tainable drone production. Our work contributes in three key the vacuum bag were used to create a sealed vacuum during

areas: (1) a comprehensive evaluation of the effects of aging on the entire process. Plates curing was carried out in a hot press

expired prepregs through thermal, chemical, and mechanical test- according to the curing cycle recommended by the supplier in

ing to establish requalification thresholds [1], (2) the development the material datasheet, as reported in the table 1. The void con-

of machine learning models to predict the usability of expired tent (𝑉𝑐) was measured on five specimens through a digestion

prepregs before production, and (3) the application of predictive procedure according to standard ASTM D3171 Method A. [3] The

models to assess the quality of final products after production, interlaminar shear strength (ILSS) tests were performed with a

specifically for sandwich panels made from recycled prepregs. 3-point bending system on MTS Insight machine according to

By combining experimental testing with data-driven methods, the standard test ASTM D2344 [2] on five different specimens for

each prepreg batch. These experimental results, including DSC

Permission to make digital or hard copies of all or part of this work for personal data, ILSS, and void content (𝑉𝑐) measurements, provide essential

or classroom use is granted without fee provided that copies are not made or features for the machine learning models discussed in Section

distributed for profit or commercial advantage and that copies bear this notice and 2.2. The values of key properties such as the glass transition tem-

the full citation on the first page. Copyrights for third-party components of this

work must be honored. For all other uses, contact the owner/author(s). perature (𝑇𝑔0), residual cure degree (𝛼), and interlaminar shear

Information Society 2025, Ljubljana, Slovenia strength (ILSS) are directly used to predict the usability of the

© 2025 Copyright held by the owner/author(s).

https://doi.org/https://doi.org/10.70314/is.2025.sikdd.20



82

Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Cek et al.



expired prepregs and to assess the quality of the final products The dataset combined two types of information. The first com-

after manufacturing. ponent consisted of production metadata, which described the



Material Temperature (°C) Time (h) Pressure (bar) context of each cycle. These attributes included the date of the cy-

cle, the operator responsible for production, the specific prepreg

ER 450 135°C 2h 6 bar

batch (identified by lot number), and the number of days between

IMP 509 140°C 1.5h 4 bar when the prepreg was made and used in production. Tool-related X1 120 130°C 1.5h 6 bar

information was also provided, such as which tool was used and

ER 431 125°C 1h 5 bar



Table 1: Curing cycle parameters for the plates recom- cle was associated with a measurement curve identifier, a quality how many cycles had passed since its last maintenance. Each cy-

mended in the material datasheet. result (labelled as either fully compliant, minor defect, or scrap),

and, in cases of non-compliance, the reported reason for failure.



2.2 Predicting the usability and key The second component of the dataset consisted of time-series

parameters of prepreg using machine data collected during the manufacturing process. For each cycle,

approximately 1,300 measurements were recorded at ten-second

learning methods intervals. These measurements included the chamber’s target

temperature (setpoint), the actual chamber temperature, the tem-

The results from the DSC tests, along with other experimental

data such as ILSS and void content ( 𝑐 ) collected in Section 2.1,

𝑉 perature of the piece being moulded, and the vacuum setpoint.

Together, these readings captured the thermal and pressure dy-

were systematically organized and used as input features for the

namics that govern the curing of composite materials.

machine learning models to predict prepreg usability and key

To make this information usable for machine learning mod-

process parameters. Each row represents one checkpoint on an

els, feature extraction was required. Temperature curves were

expired roll and includes: test date, month code, prepreg code

◦ divided into intervals based on their inflection points—that is,

and lot, type (expired roll), stocking temperature (−18 C), orig-

◦ ◦ the points where the curve transitioned from stable plateaus to

inal expiry date, 𝛼 (%), 𝑇𝑔,onset ( C), ILSS (MPa), 𝑉𝑐 ( C; curing

rising or falling slopes. Each interval was then summarised using

temperature), Usable (Y/N), and, when redefinition is applied,

pressure (bar), temperature ( ◦ statistical properties such as average, minimum, maximum, vari- C), time (min), and the redefined

ance, and trend. In addition to these aggregated features, new

expiry date. For the correct operation of machine-learning meth-

variables were engineered to capture deviations from expected



test_date behaviour. For example, the vacuum difference quantified the gap − original_expiry_date . between the measured and target pressure, while the temperature ods, a days-after-expiry feature was introduced and computed as

The study addresses two predictive tasks: a classification prob-

difference measured the offset between chamber setpoints and

lem for Usable (three classes: Y, Y/N, N) and regression problems

the actual values recorded. These derived variables provided in-

for process/quality parameters (ILSS,𝑇𝑔,onset, 𝑉𝑐 , 𝛼 ). Analysis pro-

dicators of process deviations that might affect the final product

ceeds in two stages. First, a per-material stage fits separate models

quality.

for each prepreg system (ER450, IMP509, ER431, X1) to resolve

material-specific issues observed during preliminary inspection. The analysis followed the CRISP-DM methodology, beginning

with data fusion and preparation, followed by feature selection

Second, a pooled stage trains a unified model over all records to

and model training. Metadata and time-series features were com-

evaluate cross-material generalisation.

bined into a single dataset, from which irrelevant or redundant

Predictors are restricted to pre-test covariates: days-after-

variables were removed.

expiry, material identity, normalised lot descriptors, month code,

For predictive modelling, several classification algorithms

storage conditions, and other metadata available at decision time,

were evaluated to balance interpretability and performance. Lo-

while measured targets are excluded from inputs to prevent label

leakage. Random-forest classifiers and regressors (scikit-learn) pa- gistic regression and decision trees offered transparent decision

rameterised as estimators=100, max _𝑑𝑒 𝑝𝑡 ℎ=3, random_state=42

𝑛 boundaries, while ensemble methods such as random forests and

gradient boosting provided stronger predictive power by aggre-

serve as the base models and enable inspection of feature impor-

gating multiple weak learners. Multi-layer perceptrons (MLP)

tances.

were also considered to capture non-linear patterns in the data.

Performance estimation relies on leave-one-out cross-validation

To integrate the methodology into the production workflow,

(LOO-CV) [6] in both stages. For the classification task, overall

a dedicated service was implemented. Metadata was provided in

accuracy is reported to evaluate the model’s performance in pre-

dicting prepreg usability. For the regression tasks, , MAE, and

𝑅 2 an Excel (.xlsx) file, while the process data was provided in .rdb

formats by the industrial partner. A pipeline was developed to

RMSE are used to assess the model’s ability to predict continu-

ous process parameters. 2 automatically download these files from a shared Dropbox folder 𝑅 measures the proportion of variance

provided by the industrial partner, parse the .rdb data, and convert

explained, while MAE provides the average error magnitude,

the files into structured JSON files. The JSON files were enriched

and RMSE emphasizes larger errors. Feature-importance profiles

with derived variables and unique identifiers, then uploaded to

are examined to identify the dominant drivers of re-usability

the Plooto platform via its API. This ensured seamless integration

and variation in process parameters across materials and in the

of raw production data with machine learning models, enabling

pooled setting.

continuous prediction of product quality.



2.3 Machine Learning for Post-Production As part of this work, we developed a tool called Product Qual-

Quality Prediction ity Explorer to support domain experts in analyzing production

data and assessing product quality [4]. Its primary goal is to

This part of the pilot addressed the prediction of production qual- facilitate the creation of explainable machine learning models.

ity in sandwich panel manufacturing, with the aim of supporting The tool helps users understand factors influencing quality out-

drone production after re-qualification. comes and make informed adjustments to the manufacturing



83

Supporting Material Reuse in Drone Production Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia



process. The tool provides a summary of descriptive statistics 3.2 Predictive modeling results for prepreg

(count, mean, standard deviation, minimum, quartiles, and max- reuse

imum) and allows users to visualize selected columns through

We analysed 𝑁 = 81 inspection records with a two–stage work-

histograms and boxplots. Finally, it generates a heatmap of all

flow: global model across all prepregs and material-specific mod-

columns to provide an overview of relationships within the data.

els were trained and estimated using leave-one-out cross-validation

In the next step, the user selects the features to include in the

(LOO-CV). Table 2 summarizes the results of all experiments, in-

machine learning model. This step is necessary both to define the

cluding classification and regression performance for global and

target variable for prediction and to exclude irrelevant columns

material-specific models.

such as IDs, dates, or textual data. The tool also provides several

options for handling missing values. The user can choose the

Type Usability Metrics 𝛼 𝑇𝑔0 ILSS 𝑉𝑐

approach that best suits the dataset: leaving missing values un- AggR2 = 0.83 0.77 0.7 0.77

changed (which may prevent some algorithms from functioning All types Acc=0.91 MAE = 1.22 1.05 4.49 1.52

properly), removing features with missing values, removing rows RMSE = 1.59 1.33 5.93 1.98

containing missing values, or imputing missing values using the AggR2 = 0.86 0.88 0.92 0.94

column mean. ER450 Acc=0.96 MAE = 1.25 0.54 2.75 0.87

The next step provides the option to generate new attributes. RMSE = 1.51 0.77 4.05 1.15

This can be done through techniques such as one-hot encoding, AggR2 = 0.76 0.6 0.82 0.8

polynomial feature generation, or logarithmic transformations. IMP509 Acc=0.87 MAE = 1.44 1.23 2.5 1.35

After creating new attributes, the user selects the features to be RMSE = 1.9 1.58 3.01 1.75

used in the machine learning process. This selection can be per- AggR2 = 0.82 0.79 0.79 0.43

formed manually or automatically with the assistance of genetic X1 Acc=0.96 MAE = 1.12 0.98 2.41 1.77

algorithms. RMSE = 1.44 1.12 3.09 2.32

Finally, the user can select which machine learning models to AggR2 = 0.97 0.88 0.94 0.87

apply. Once training is complete, the results are presented in a ER431 Acc=1 MAE = 0.57 0.89 1.43 1.06

summary table containing performance metrics such as precision, RMSE = 0.76 1.15 1.93 1.64 recall, F1-score, and accuracy, along with a confusion matrix

visualization. The tool also provides a comparative overview of Table 2: LOO-CV performance across prepregs for regres-

model performance across all metrics (precision, recall, F1-score, sion and classification

accuracy).

In addition to evaluation, the system integrates explainability

techniques. Global explanations are generated using SHAP to As we can see from the presented results, the global multi-

show how features influence model decisions across the entire class classifier achieved 0.91 accuracy under LOO-CV on an im-

dataset, while local explanations are provided using SHAP and balanced set (54 Y / 14 Y-N / 13 N), indicating that a simple pre-

LIME to illustrate how the model arrived at a prediction for a production screen is feasible from routine metadata. Per-material

specific datapoint. These explanations are supported by interac- classifiers were even higher (often ≥0.96), but these figures are

tive visualizations, which enable users to better understand both almost certainly optimistic given tiny per-material sample sizes

the overall model behavior and individual predictions. and class imbalance. A detailed classification report, including



3 Results precision, recall, and F1 scores, can be provided upon request.

A consistent trend across the regression tasks is the superior



Ageing trends from DSC. 3.1 Results of usability assessment performance of models trained on a single prepreg type compared 1 to the global model trained on all data. For instance, the global Differential scanning calorimetry 2 model predicted ILSS with an aggregate 𝑅 of 0.70, whereas the (DSC) on the selected prepreg rolls (grouped by resin system) material-specific models for ER450 and ER431 achieved much shows that 𝑇 𝑔 0 increases progressively over time after expiration. higher scores of 0.92 and 0.94, respectively. This suggests that This behaviour is consistent with i) increasing molecular weight ageing and curing behaviours are highly specific to the resin and ii) higher crosslink density of the polymer network due to system, and tailored models better capture these characteristics. ongoing polymerization. The measured 𝛼 values align with the However, this is not a universal rule; the prediction of 𝑉 𝑐 for trend, indicating a time-dependent decrease in the residual 𝑇 𝑔 0 2 the X1 prepreg (aggregate 𝑅 = 0 . 43) was notably worse than the degree of cure; notably, within the first two years after expiration, 2 global model (aggregate 𝑅 = 0 . 77), indicating that in cases of very the reduction remains limited to <15%. limited data or less distinct features, the global model can be Mechanical strength and porosity evolution. Across all more robust. batches, interlaminar shear strength (ILSS) exhibits a time-depen- Feature importance analysis performed during the experi- dent decline: reductions generally do not exceed 15% within the ments revealed the most influential factors in predicting key first 12 months after expiration, whereas more pronounced de- parameters in Table 2. The Days_Since_Expiry was consistently creases of 25–30% occur in the 12–24 month interval. Consistent one of the most critical predictors across both global and material- with this mechanical trend, the void content 𝑉 𝑐 remains below specific models, confirming its fundamental role in tracking ma- 10% during the first 12 months after expiration and increases terial degradation. Furthermore, the analysis revealed strong thereafter, often exceeding 15% in later months. intercorrelations between the measured properties themselves.

For example, the degree of cure (𝛼 ) and 𝑇𝑔0 were often the most



1 𝑛 The dataset is modest and unevenly distributed across resins (ER450=28, X1

𝑛 𝑛 𝑛 = 22 , IMP509 = 15 , ER431=14). Consequently, per-material models are trained

on few observations and LOO-CV performance is likely optimistic.



84

Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Cek et al.



important features for predicting ILSS and 𝑉𝑐, indicating that 4 Conclusion these thermal and chemical properties are highly interdepen-

This study demonstrates an end-to-end approach that integrates

dent. Batch identifiers (prepreg code/lot) were generally minor,

material science and machine learning to enhance the reuse of

although *lot* occasionally ranked higher for ILSS, indicating

expired prepregs in drone production. By evaluating and requali-

possible batch effects.

fying expired materials, we have shown that they remain service-

Taken together, these patterns suggest that compact, physics-

able within the first year after expiry, with gradual performance

aligned feature sets explain most of the variance, and that ageing/𝛼

decline, particularly in interlaminar shear strength and curing

consistently drive both regression and classification. Neverthe-

behavior. This underscores the effectiveness of resin-specific

less, limited data—especially for IMP509 and ER431—and the

reuse gates and modified processing windows to extend material

optimism of LOO-CV preclude production use without further

lifetimes.

data collection and validation across broader process conditions.

Machine learning models were employed to support both pre-



3.3 Evaluation of Post-Production production and post-production processes. The pre-production

Classification Models models classified expired prepregs for reuse, while the post-

production models predicted the quality of sandwich panels based

The predictive modelling was applied to production cycles from on combined metadata and process features. Despite challenges

sandwich panel manufacturing provided by the Italian pilot part- related to data imbalance, the results demonstrate the potential

ners. We also used the aforementioned Product Quality Explorer for predictive quality monitoring in manufacturing, contributing

tool after we had already transformed the data and created new to more sustainable production practices.

features. The objective was to assess whether production quality The integration of machine learning with material science not

outcomes could be predicted from a combination of metadata only optimizes requalification processes and reduces waste, but

and process-derived time-series features. This is particularly im- also supports cost reduction and environmental sustainability in

portant for supporting drone production after re-qualification, high-performance manufacturing. Future work should focus on

as early detection of potential quality issues can prevent defec- expanding datasets, refining resin-specific criteria, and explor-

tive panels from progressing further in the manufacturing chain. ing the broader applicability of the models in other composite

Moreover, it can save manufacturers time, energy, and personnel manufacturing contexts, further advancing circular economy

costs, as each panel must currently be manually inspected and principles.



tested. Acknowledgements

The dataset comprised 294 production cycles, the majority of

which were compliant, with only a small fraction classified as non- This work was supported by the European Commission under the

compliant. This strong imbalance reflects real-world conditions, Horizon Europe project Plooto, Grant Agreement No. 101092008.

where defects are rare but critical, yet it also creates difficulties We would like to express our gratitude to all project partners for

for machine learning approaches. Most algorithms tend to favour their contributions and collaboration.

the majority class, which can lead to high overall accuracy but The authors acknowledge the use of LLMs for language opti-

poor detection of defective cases. mization. While the LLMs contributed to enhancing efficiency

Several classification algorithms were tested. Overall accuracy and refining the presentation of this work, all conceptual frame-

values appeared relatively high (between 0.77 and 0.85) this was works, analyses, and interpretations remain the sole responsibil-

largely driven by the correct classification of compliant cases. ity of the authors.



Performance on the minority (non-compliant) class was weaker, References as reflected by modest recall and F1-scores. This indicates that

while the models are well-suited to reproducing the majority [1] Constance Amare, Olivier Mantaux, Arnaud Gillet, Matthieu Pedros, and Eric

outcome, their ability to identify rare defective panels is more Lacoste. 2022. Innovative test methodology for shelf life extension of carbon

limited. fibre prepregs. IOP Conference Series: Materials Science and Engineering, 1226,

1, (Feb. 2022), 012101. https://dx.doi.org/10.1088/1757-899X/1226/1/012101.

These findings suggest that machine learning can provide use- [2] ASTM International. 2022. ASTM D2344/D2344M-22: Standard Test Method

ful insights into production quality trends, but further progress for Short-Beam Strength of Polymer Matrix Composite Materials and Their

requires additional data, particularly more defective cases. A from https://store.astm.org/d2344_d2344m-22.html. Laminates. West Conshohocken, PA, USA, (2022). Retrieved Sept. 3, 2025 larger dataset would allow models to better distinguish between [3] ASTM International. 2022. ASTM D3171-22: Standard Test Methods for Con-

compliant and non-compliant cycles, thereby increasing their stituent Content of Composite Materials. West Conshohocken, PA, USA,

(2022). Retrieved Sept. 3, 2025 from https://store.astm.org/d3171-22.html.

value as a decision-support tool in quality assurance. [4] Rok Cek and Klemen Kenda. 2025. Product quality explorer - determining

The detailed performance of each tested classifier is reported product quality based on the digital product passport. In 17th Jožef Stefan

in Table 3. International Postgraduate School Students’ Conference : 28th–30th May: Book

of abstracts: from research to reality, 33. http://ipssc.mps.si/auxiliary_material

/IPSSC25%20BoA.pdf.

Model Accuracy Precision Recall F1-Score [5] Gaurav Nilakantan and Steven Nutt. 2015. Reuse and upcycling of aerospace Logistic Regression 0.846 0.838 0.838 0.838 prepreg scrap and waste. Reinforced Plastics, 59, 1, 44–51. Decision Tree 0.769 0.764 0.738 0.745 [6] Tzu-Tsung Wong. 2015. Performance evaluation of classification algorithms Random Forest 0.808 0.797 0.806 0.800 by k-fold and leave-one-out cross validation. Pattern Recognition, 48, 9, 2839– XGBoost 0.808 0.797 0.806 0.800 2846.

LightGBM 0.846 0.838 0.838 0.838

Support Vector Machine (SVM) 0.808 0.801 0.788 0.793

Multi-layer Perceptron (MLP) 0.808 0.801 0.788 0.793

Table 3: Performance of machine learning models on the

Italian pilot sandwich panel dataset.



85





Temporal Dynamics and Causal Feature Integration for




Predictive Maintenance in Manufacturing Systems:



A Causality-Informed Framework



Seyed Iman Hosseini Klemen Kenda Dunja Mladenič

iman.hosseini@ijs.si klemen.kenda@ijs.si dunja.mladenic@ijs.si

Jožef Stefan Institute Jožef Stefan Institute Jožef Stefan Institute

Ljubljana, Slovenia Ljubljana, Slovenia Ljubljana, Slovenia

Jožef Stefan International Qlector Jožef Stefan International

Postgraduate School Ljubljana, Slovenia Postgraduate School

Ljubljana, Slovenia Ljubljana, Slovenia



ABSTRACT To address these limitations, this study proposes a causality-

informed framework for predictive maintenance that leverages

Predictive maintenance is increasingly central to manufacturing,

temporal causal discovery techniques, such as Vector Autore-

where the goals are to reduce unplanned downtime and extend as-

gressive LiNGAM (VARLiNGAM), to engineer predictive features

set lifetimes. Conventional models often rely on correlations that

from multivariate sensor data. Our approach integrates cross-

insufficiently capture temporal dynamics and causal dependen-

correlation analysis and lag-optimized causal graphs to detect

cies underlying failures. This study proposes a causality-informed

failure precursors and identify their optimal predictive windows.

feature-engineering pipeline that combines cross-correlation-

We hypothesize that the observed lack of competitive advan-

derived lags with VARLiNGAM to construct lag-aware features

tage for causality-informed models, especially when applied to

from multivariate sensor streams, and evaluates it against stan-

data from a single machine, arises from the limited operational

dard time-series models using a time-aware split. Three machine-

diversity and failure variability. This limitation may cause models

learning models—Random Forest, XGBoost, and Gradient Boost-

to overfit to machine-specific correlations and exclude informa-

ing—were trained and assessed by F1-score (rather than accu-

tive temporal features, thereby hindering their generalizability.

racy) on a single-machine subset of the Microsoft Azure Pre-

dictive Maintenance dataset (8,708 samples; 26 failures, 0.3%

≈ Testing this hypothesis through multi-machine datasets will be a

key focus of future work.

prevalence). XGBoost trained on raw temporal features achieved

F1 0.94 for longer prediction horizons ( 10 h) under time-≈ ≥

series–aware cross-validation, with performance declining at

shorter horizons as temporal context diminishes. In this setting, 2 RELATED WORK causality-informed features did not improve results over the raw-

Causality in time series analysis has become increasingly critical

feature baseline. These findings indicate that, with data from in predictive maintenance, particularly within industrial and

a single machine, causal discovery is susceptible to overfitting

manufacturing domains, where early failure detection plays a

and may suppress informative temporal patterns; broader, multi- pivotal role in minimizing operational disruptions and financial

machine datasets are likely required for causality-enhanced rep-

losses [5]. Classical statistical models have been widely used

resentations to yield consistent gains.

to infer causal relationships between sensor measurements and

machine states, yet they often fail to capture complex temporal

KEYWORDS dynamics and the nonlinear relationships inherent in real-world

system failures.

Predictive Maintenance, Causality, Time-Series Analysis, Ma-

chine Learning, VARLiNGAM, Manufacturing Systems Recent studies have explored advanced causal inference tech-

niques to enhance fault prediction. Wang S. proposed a et al.



1 framework for fault diagnosis that integrates spatiotemporal INTRODUCTION

dependencies, demonstrating improved predictive accuracy in

The rising complexity and interconnectivity of industrial systems

chemical manufacturing systems [9]. While their work advances

have accelerated the need for intelligent maintenance strategies

reliability in industrial diagnostics, it lacks the flexibility to gen-

that move beyond reactive and preventive paradigms. Predictive

eralize across diverse application domains. On the other hand,

maintenance, driven by sensor data and machine learning, has

Cui et al. introduced a deep learning framework that enhances

emerged as a transformative approach to minimize unplanned

predictive maintenance by integrating causal reasoning and long-

downtime and optimize asset life cycles [1]. Traditional predictive

sequence multivariate time-series data, significantly improving

maintenance models, however, often rely on statistical correla-

predictive performance and interpretability [3]. Despite this, the

tions that fail to capture the directionality and temporal dynamics

challenge of automating temporal feature engineering and seam-

inherent in real-world system failures [6].

lessly deploying models across different domains remains.

Yang X. contributed to the growing literature on data- et al.

Permission to make digital or hard copies of part or all of this work for personal

driven causal analysis by incorporating dynamic latent variables

or classroom use is granted without fee provided that copies are not made or

and probabilistic graphical models into causal modeling frame-

distributed for profit or commercial advantage and that copies bear this notice and

the full citation on the first page. Copyrights for third-party components of this works [10]. However, these models have yet to fully address the

work must be honored. For all other uses, contact the owner /author(s).

temporal feature extraction required for scalable deployment

Information Society, 2025, Ljubljana, Slovenia

in real-world predictive maintenance applications. Furthermore,

© 2025 Copyright held by the owner/author(s).

https://doi.org/https://doi.org/10.70314/is.2025.sikdd.12 more recent work by Wang Q. introduced a Causal Graph et al.



86

Information Society, 2025, Ljubljana, Slovenia Seyed Iman Hosseini, Klemen Kenda, and Dunja Mladenič



Convolution Module that adapts causal discovery within time-

series prediction [8], but their approach is still dependent on

complex model adjustments across domains.

In this study, we propose a novel framework that integrates

lagged correlation with causal analysis techniques to detect fail-

ure precursors and quantify their lead times. This framework au-

tomates temporal feature engineering and is designed for diverse

real-world applications across manufacturing settings, without re-

quiring extensive architectural modifications. The automation of Figure 2: Cross correlation analysis temporal feature engineering and its seamless deployment across

comparable manufacturing environments remains a significant

challenge, and extending generalization beyond this domain is

left for future work.

3.1 Dataset and Preprocessing

3 EXPERIMENT We used the Microsoft Azure Predictive Maintenance Dataset

[2], which provides hourly telemetry (voltage, rotation, pressure,

Our experimental methodology followed a sequential four-stage

vibration) plus maintenance records, failure events, incident re-

process to construct and validate a robust failure prediction

ports, and machine metadata for 100 machines over 12 months in

model, as shown in Figure 1. The first stage involved performing a

2015 (over 800k hourly summaries and thousands of non-failure

cross-correlation analysis between each sensor’s time-series data

error entries). For this study, we restricted the analysis to machine

and the target failure events to determine the optimal predictive

ID 98; after cleaning and merging the sources, we constructed

time lag, which guided the subsequent steps. In the second stage,

a causality-informed feature vector and standardized features

the identified optimal lag was used to parameterize a Vector Au-

across modalities. Cross-correlation suggested predictive lags of

toregressive LiNGAM (VARLiNGAM) model, which generated

1–24 hours, so we derived lagged/statistical features from six pri-

a directed acyclic graph (DAG) representing the causal relation-

mary variables (voltage, rotation, pressure, vibration, age, error

ships and effect strengths between sensor variables and the failure

type). The final dataset comprised 8,708 samples with 26 failures

event. The third stage focused on creating a causality-informed

( 0.3% ), indicating strong class imbalance [7, 2]. The feature ≈

feature vector by integrating standard statistical metrics from

set comprised 150 causality-informed features and 36 features

rolling time windows along with advanced features informed by

without causal information.

the causal analysis, using the correlation strengths and causal

effect strengths derived from the VARLiNGAM model to select



and weight features based on their respective optimal and causal 3.2 Cross-correlation Analysis lags. Finally, in the fourth stage, the enriched feature set was fed

Cross-correlation analysis examines the correlation between two

into a machine learning pipeline, employing a time-based data

time series as a function of the time lag applied to one of them

split to prevent look-ahead bias, and training several classifica-

[11][12]. Unlike simple correlation, which measures linear rela-

tion models, including Random Forest, XGBoost, and Gradient

tionships at a single point in time, cross-correlation reveals how

Boosting, to assess the effectiveness of the causality-informed

variables relate across different time delays, making it particu-

approach for predictive maintenance. This integrated approach

larly valuable for identifying lead-lag relationships and temporal

enhances the predictive capabilities of machine learning mod-

dependencies. The initial phase of our experimental framework

els, offering a robust solution for failure prediction in industrial

involved a cross-correlation analysis to empirically determine

settings.

the predictive temporal relationships between sensor signals and

equipment failures. For each sensor, we computed the Pearson



correlation coefficient between its time series and the binary



failure time series across a range of discrete time lags. This pro-

cedure was executed by systematically shifting the failure signal

backward in time, which allowed for the correlation of sensor

readings at a given time t with failure events at a future time t +

lag. The optimal predictive lag for each sensor was then identified

as the time lag that yielded the maximum absolute correlation

value. This analysis is critical as it quantifies the time window in

which each sensor’s data is most informative for forecasting an

impending failure, thereby providing an empirical foundation for

the subsequent causal discovery and feature engineering stages.

In the cross-correlation plot shown in Figure 2, the red star

annotated on each sensor’s curve denotes the optimal predictive

lag—20 hours for Pressure, 14 hours for Vibration, and so forth.

This marker identifies the specific time lag, measured in hours,

at which the sensor’s signal exhibits the highest absolute Pear-

son correlation with the future failure event. Consequently, the

Figure 1: proposed framework red star highlights the most influential temporal offset for each

variable, effectively quantifying the sensor’s most informative

predictive window within the 24-hour forecasting horizon.



87

A Causality-Informed Framework Information Society, 2025, Ljubljana, Slovenia





3.3 Causal Graph Construction

To elucidate the causal interdependencies between sensor signals

and equipment failures, a causal graph was constructed using

VARLiNGAM. This methodology first employs a Vector Autore-

gression (VAR) model to capture the linear, time-lagged relation-

ships among the multivariate sensor time series. The optimal

lag for the VAR model was adaptively informed by the preced-

ing cross-correlation analysis to focus on the most predictive

temporal window. Following the VAR estimation, the LiNGAM

algorithm is applied to the resulting model residuals, or innova-

tions. By exploiting the non-Gaussian nature of these innova-

tions, LiNGAM uniquely identifies the contemporaneous causal

structure—the instantaneous effects between variables—and de-

termines the direction of influence, thereby producing a directed

Figure 3: Causal Graph

acyclic graph (DAG). The final output is a set of adjacency ma-

trices representing the causal graph, where each non-zero entry

quantifies the strength and direction of a causal link from one

analysis that selects per-sensor optimal prediction windows. Us-

variable to another at a specific time lag. Our approach constructs

ing a sliding feature window (typically 72 h), samples are formed

a directed causal graph from time-series sensor data using the

from historical data only to avoid leakage. Feature construction

following steps: basic statistics

proceeds in four stages: (1) (mean, standard devi-

(1) Chronologically sort sensor ation, min/max, latest/earliest within the window); (2) Data Sorting and Integrity: causality-

data, verifying integrity and noting irregular intervals. computed at the optimal lags iden- aligned temporal features

(2) Define variables which are vibra- tified by causal analysis; (3) via trend slopes (linear Variable Definition: dynamics

tion, rotation, pressure, voltage, and a binary failure indi- regression), rolling volatility (standard deviation), and rates of

cator as the target node. change; and (4) implied by the causal graph cross-feature terms

(3) Configure a VARLiNGAM [4] model (e.g., voltage/rotation ratios and pressure–vibration correlations). Causal Model Setup:

with a specified lag order and BIC-based pruning. Targets are defined for multiple horizons (1, 6, 12, and 24 h ahead)

(4) Fit the model to the prepared data matrix, to enable early warnings at different lead times. The resulting Model Fitting:

applying regularization—by adding small Gaussian noise dataset contains 150 features that integrate causal dependencies

(e.g., 10−6 )—when numerical instability arises during VAR- with temporal patterns.

LiNGAM causal graph construction due to ill-conditioned

matrices. 3.5 Machine Learning Models

(5) Extract adjacency matrices to Adjacency Extraction:

Three classification algorithms, each configured with default

identify directed edges, effect strengths, and correspond-

hyperparameters, were evaluated using time-based data parti-

ing lags.

tioning to mitigate the risk of data leakage.

(6) Assemble the causal graph, catego-Graph Assembly:

rizing edges by their relation to the target and between • Random Forest (RF): Ensemble method with 200 estima-

tors, maximum depth of 15, and balanced class weights

sensor variables.

• XGBoost (XGB): Gradient boosting with 200 estimators,

This workflow ensures that temporal ordering is respected

learning rate of 0.1, and automatic scale balancing

and that detected causal links most likely represent meaningful • Gradient Boosting (GB): Scikit-learn implementation

relationships for predictive maintenance and further analytical

with 200 estimators and 0.8 subsample ratio

investigations. Figure 3 presents the causal graph generated by

Model performance was assessed using F1 Score metric appro-

the VARLiNGAM algorithm, illustrating the network of causal

priate for imbalanced classification:

relationships between sensor telemetry (volt, pressure, vibration,

rotate), machine properties (age), and the target failure event. In • F1-Score: Harmonic mean of precision and recall

this graph, nodes represent the variables, and the directed edges A time-series–aware data partitioning strategy was imple-

(arrows) signify the direction of causality, with edge thickness mented using scikit-learn’s , which generates folds TimeSeriesSplit corresponding to the strength of the effect. The labels on each in chronological order by progressively expanding the training

edge quantify the causal strength and the time delay (lag) in hours. set with earlier observations and reserving subsequent periods

The analysis reveals a complex web of interactions, prominently for testing. This procedure ensures that all training data tem-

highlighting that machine age is the most significant causal driver porally precedes the corresponding test data. To approximate

of failure, with an exceptionally strong effect strength at a lag of stratification and preserve class balance between rare failure and

6 hours. Other notable, though weaker, causal pathways are also more frequent non-failure events, the folds were constructed

identified, such as the influence of rotate on failure. This causal to proportionally distribute failure cases across splits without

structure provides critical insights into the system’s dynamics, introducing randomization. This design maintains the tempo-

identifying the key variables and time-delayed interactions that ral integrity of the sensor data while supporting reliable model

precede a failure event. evaluation.



3.4 Causality-Informed Feature Engineering 4 RESULTS AND DISCUSSION

We prepared the data by building a Figure 4 presents the comprehensive F1-score evaluation of all causality-informed feature vec-

tor grounded in the paper’s causal graph and a temporal causality three models, while Figure 5 provides a comparative analysis



88

Information Society, 2025, Ljubljana, Slovenia Seyed Iman Hosseini, Klemen Kenda, and Dunja Mladenič





5 FUTURE WORKS

While this study establishes a robust, domain-agnostic frame-

work for failure prediction, future work will focus on enhancing

its transparency and causal reasoning capabilities. The integra-

tion of Explainable Artificial Intelligence (XAI) methods, such as

SHAP or LIME, will provide transparent insights into the predic-

tive models’ decision-making processes, fostering trust among

users and enabling more informed maintenance decisions. Ad-

ditionally, investigating counterfactual analysis will allow for

exploring ’what-if ’ scenarios to better understand the causal im-

pacts of various factors on failure predictions. Alongside these

enhancements, we will address the observed limitations of ap-

plying causality-informed models to data from a single machine.

Specifically, we hypothesize that the lack of competitive advan-

Figure 4: F1-score evaluated over a 20-hour prediction hori- tage stems from the limited operational diversity and failure

zon variability of a single-machine dataset, leading to overfitting. Fu-

ture work will validate this hypothesis by expanding the dataset



to include multiple machines, ensuring more generalizable in-

sights into causal relationships and improving the robustness of

predictive models.



ACKNOWLEDGEMENTS

We gratefully acknowledge the European Commission for its

support of the Marie Skłodowska-Curie program through the

Horizon Europe DN APRIORI project (GA 101073551).



REFERENCES

[1] Abdeldjalil Benhanifia, Zied Ben Cheikh, Paulo Moura Oliveira, Antonio

Valente, and José Lima. 2025. Systematic review of predictive maintenance

practices in the manufacturing sector. , Intelligent Systems with Applications

26, 200501. doi: https://doi.org/10.1016/j.iswa.2025.200501.

[2] Arnab Biswas. 2025. Microsoft azure predictive maintenance. Accessed:

2025-05-20. (2025). https://www.kaggle.com/datasets/arnabbiswas1/micros

Figure 5: The XGBoost F1-score across a 20-hour prediction of t- azure- predictive- maintenance/data.

[3] Qing’an Cui, Jiao Lu, and Xianhui Yin. 2025. Causality enhanced deep

horizon, evaluated with and without a causality-informed learning framework for quality characteristic prediction via long sequence

feature vector multivariate time-series data. Measurement Science and Technology, 36, (Mar.

2025), 3, (Mar. 2025). doi: 10.1088/1361- 6501/adb05a.

[4] LiNGAM Developers. 2025. VARLiNGAM — LiNGAM 1.10.0 documentation.

https : / / lingam . readthedocs . io / en / latest / tutorial / var . html. Accessed:

of the XGBoost model with and without the causality-informed 2025-06-25. (2025).

[5] Karim Nadim, Ahmed Ragab, and Mohamed Salah Ouali. 2023. Data-driven

feature vector. Standard time-series models, particularly those

dynamic causality analysis of industrial systems using interpretable machine

trained on raw temporal data, consistently outperform causality- learning and process mining. , 34, (Jan. Journal of Intelligent Manufacturing

2023), 57–83, 1, (Jan. 2023). doi: 10.1007/s10845- 021- 01903- y.

informed approaches in predictive maintenance tasks, especially

[6] P. Nunes, J. Santos, and E. Rocha. 2023. Challenges in predictive maintenance

at extended prediction horizons. XGBoost, for instance, achieves CIRP Journal of Manufacturing Science and Technology – a review. , 40, 53–67.

F1 scores exceeding 94% for horizons beyond 10 hours, though doi: https://doi.org/10.1016/j.cirpj.2022.11.004.

[7] Margarida Da Rocha and Faísca Moreira. 2024. FACULDADE DE ENGEN-

performance declines with shorter windows due to reduced tem-

HARIA DA UNIVERSIDADE DO PORTO Data-Driven Predictive Mainte-

poral context. In contrast, causality-informed models offer no nance for Component Life-Cycle Extension. Tech. rep.

competitive advantage—primarily due to the limitations of causal [8] Qipeng Wang, Shoubo Feng, and Min Han. 2025. Causal graph convolution

neural differential equation for spatio-temporal time series prediction. Ap-

discovery conducted on data from a single machine. This nar- plied Intelligence

, 55, (May 2025), 7, (May 2025). doi: 10.1007/s10489- 025- 06

row scope lacks the operational diversity and failure variability 287- 7.

[9] Sheng Wang, Qiang Zhao, Yinghua Han, and Jinkuan Wang. 2023. Root

needed to infer generalizable causal structures, resulting in over-

cause diagnosis for complex industrial process faults via spatiotemporal

fitting to machine-specific correlations and the exclusion of in-

coalescent based time series prediction and optimized granger causality.

formative temporal features. These findings highlight the critical , 233, (Feb. 2023). doi: 10.10 Chemometrics and Intelligent Laboratory Systems

16/j.chemolab.2022.104728.

need for multi-machine datasets when applying causal methods,

[10] Xing Yang, Tian Lan, Hao Qiu, and Chen Zhang. 2025. Nonlinear causal

ensuring that inferred relationships reflect true causality rather discovery via dynamic latent variables. IEEE Transactions on Automation

than artifacts of constrained data. In addition, Longer prediction . doi: 10.1109/TASE.2024.3522917. Science and Engineering

[11] Tanja Zerenner, Marc Goodfellow, and Peter Ashwin. 2021. Harmonic cross-

horizons (e.g., 20 hours) afford models access to extended histor- Physical Review E

correlation decomposition for multivariate time series. ,

ical windows (e.g., 72 hours), enhancing their ability to detect 103, (June 2021), 6, (June 2021). doi: 10.1103/PhysRevE.103.062213.

[12] XIAOJUN ZHAO, PENGJIAN SHANG, and JINGJING HUANG. 2017. Several

subtle patterns and causal signals. In contrast, short horizons

fundamental properties of dcca cross-correlation coefficient. , 25, Fractals

(e.g., 1 hour) offer limited temporal context, increasing suscepti-

02, 1750017. eprint: https : / / doi . org / 10 . 1142 / S0218348X17500177. doi:

bility to noise and overfitting. Causality-informed features such 10.1142/S0218348X17500177.

as optimal lag and causal strength are inherently better suited to

longer windows, where failure patterns emerge gradually rather

than abruptly.



89





Using Interactive Data Visualization for DeFi Market Analysis




Daria Pavlova Inna Novalija

daria.pavlova@mps.si inna.koval@ijs.si

Jožef Stefan International Postgraduate School Jožef Stefan Institute

Ljubljana, Slovenia Ljubljana, Slovenia



ABSTRACT 2 RELATED WORK



lenges with its data-rich, volatile, and multi-dimensional ecosys- Surveys of DeFi systems [12] highlight the centrality of TVL, mar- tem. Static reports struggle to convey short-term dynamics and Decentralized Finance (DeFi) presents unique analytical chal- 2.1 DeFi Analytics Landscape



Tableau dashboard. Our ETL architecture processes data from hensive Business Intelligence (BI) solution featuring an auto- search and dashboards, processing millions of daily transactions mated Extract-Transform-Load (ETL) pipeline and interactive 1 into consumable metrics. Recent advances in artificial intelligence have opened new cross-sectional structure simultaneously. We present a compre- from CoinGecko and DeFiLlama expose these aggregates for re- ket capitalization, and volume as monitoring signals. Public APIs three Application Programming Interfaces (APIs)—CoinGecko, frontiers in DeFi analysis. Chen et al. [1] proposed ensemble DeFiLlama, and DexScreener—through validation and transfor- machine learning approaches for detecting rug pulls and pro- mation stages, achieving 45-second execution time. The dash- tocol vulnerabilities, achieving 87% accuracy using features ex- board integrates Key Performance Indicators (KPIs), Total Value tracted from on-chain data and social signals. Their Random Locked (TVL) time-series, market categories analysis, and top Forest model combined with Long Short-Term Memory (LSTM) movers panel with synchronized filters. Performance evaluation networks demonstrated AI’s potential in risk assessment. How- demonstrates 85-99% reduction in analysis time compared to ever, these Machine Learning (ML) approaches require significant manual methods. Three real-world use cases validate practical computational resources and technical expertise, creating barri- applicability: narrative rotation detection (28% investment re- ers for non-technical analysts. Our solution complements these



turns), risk concentration monitoring (15% drawdown reduction), advanced techniques by providing immediate, interpretable in- and competitive benchmarking. Our approach bridges the gap sights through interactive visualization. between complex DeFi data and actionable insights without re-

quiring technical expertise. 2.2 Business Intelligence and Visualization

KEYWORDS Classic data warehouse and BI literature formalizes metrics and

dimensional modeling for decision support [6]. Industry guid-

DeFi, Business Intelligence, Tableau, TVL, KPI dashboards, Inter- ance positions interactive platforms such as Tableau among

active Visualization, ETL Pipeline, Data Mining, Cryptocurrency leading tools for exploratory analysis [4]. Visualization princi-

ples—overview first, zoom and filter, details-on-demand [8]—map



1 directly to dashboard layout patterns [3, 9]. INTRODUCTION Studies of graphical perception [2, 5] explain why bars out-Decentralized Finance (DeFi) compresses high-frequency market perform pies for accurate comparisons, and why color semantics activity—liquidity flows, incentive programs, and new protocol (green/red for gains/losses) aid preattentive detection [11]. We deployments—into datasets that change hourly. The ecosystem align with these findings in our chart choices and encodings. encompasses over 6,000 protocols managing billions in Total



Value Locked (TVL), creating analytical complexity that tradi- 3 SYSTEM ARCHITECTURE AND tional tools struggle to handle. Practitioners must simultaneously METHODOLOGY answer three critical questions: How big is the market now? (level

KPIs), 3.1 ETL Pipeline Architecture How is it moving? (time series), and What drives the cross-

section? (categories, movers). Our ETL pipeline implements a modular, fault-tolerant architec-

Interactive visualization reduces cognitive load and increases ture processing data through five stages. The architecture follows

pattern salience relative to static tables [3, 7, 10, 11]. However, ex- a standard Extract-Transform-Load pattern with additional vali-

isting solutions present trade-offs: Dune Analytics requires Struc- dation and quality checks at each stage.

tured Query Language (SQL) expertise, Nansen charges $1,800 Extract Layer: Three parallel API clients collect data from

annually, while free alternatives like DeFiLlama offer limited CoinGecko (200 tokens per page), DeFiLlama (6,000+ protocols),

visualization capabilities. Our goal is to demonstrate a compact, and DexScreener (100+ Decentralized Exchange pairs). Each

reproducible Business Intelligence workflow that democratizes client implements asynchronous Hypertext Transfer Protocol

DeFi analytics through automated data processing and intuitive (HTTP) requests with exponential backoff (4-10 seconds) and

visualization. retry logic (up to 5 attempts).

Validation Layer: Implements four-level data quality checks:

Permission to make digital or hard copies of part or all of this work for personal • Completeness: Missing value detection with fallback strate-



distributed for profit or commercial advantage and that copies bear this notice and • Consistency: Cross-validation between data sources the full citation on the first page. Copyrights for third-party components of this Timeliness: Timestamp validation (<1 hour freshness) or classroom use is granted without fee provided that copies are not made or gies work must be honored. For all other uses, contact the owner/author(s). •

SiKDD 2025, October 6, 2025, Ljubljana, Slovenia

© 2025 Copyright held by the owner/author(s). 1API documentation: https://www.coingecko.com/en/api, https://defillama.com/

https://doi.org/10.70314/is.2025.sikdd.15 docs/api.



90

SiKDD 2025, October 6, 2025, Ljubljana, Slovenia D. Pavlova





• Top movers identification: 10 min → instant (100% au-

tomation)



4.2 Comparison with Existing Solutions



Table 1: Feature Comparison with Industry Solutions



Feature Our Solution Dune Nansen DeFiLlama



Cost Free $390/yr $1,800/yr Free

Figure 1: ETL Pipeline Architecture: Data flows from three No-code Interface ✓ × ✓ ✓

APIs through validation and transformation stages to pro- Custom ETL ✓ × × ×

duce four CSV files for dashboard visualization. The system Response Time <2s 5-30s <3s <1s

processes 6,000+ protocols with automated retry logic and Visualization Types 4 Unlimited 10+ 2

data quality checks. Data Sources 3 Multiple Multiple 1

Historical Data 30 days All All Limited



Our solution occupies a unique position: more sophisticated

• Accuracy: Outlier detection using Median Absolute Devi- than DeFiLlama’s basic charts, more accessible than Dune’s SQL

ation (MAD) requirements, and more affordable than Nansen’s premium tiers.

Transform Layer: Processes validated data through three

streams: 5 RESULTS AND USE CASE VALIDATION



• Normalize: Converts to tidy format with Coordinated Uni- 5.1 Dashboard Implementation



Values (CSV) files optimized for Tableau consumption. Load Layer • with synchronized filtering capabilities. The design synthesizes Aggregate: Groups by time windows and categories four key data dimensions: : Exports processed data as Comma-Separated • KPI Header : Market metrics provide immediate context—$2.86T • The integrated dashboard combines multiple analytical views Features: Calculates rolling statistics and market sentiment versal Time (UTC) timestamps

total market cap with 56.1% BTC dominance indicates risk-

off sentiment

3.2 Dashboard Design Methodology • TVL Time-Series: Shows capital deployment patterns

The dashboard layout follows Shneiderman’s Visual Information across protocols, with upward trajectory suggesting re-

Seeking Mantra [8]: overview first, zoom and filter, then details- newed confidence

on-demand. • Top Movers Panel: Highlights outliers—clustering in spe-

Layout Structure: cific sectors signals narrative emergence

• • Top Row : Four KPI cards displaying market totals with Category Analysis: Reveals market concentration—top

24-hour changes 3 sectors comprise 51% of total value

• Middle Section: TVL time-series (left, 60% width) and

Top Movers panel (right, 40% width) 5.2 Use Case Validation

• Bottom Section: Category bars (left) and pie chart (right) Use Case 1: Narrative Rotation Detection

for market structure analysis An investment fund utilized the dashboard to identify emerging

• Right Sidebar: Interactive filters for Time Window, Cate- trends in Liquid Staking Derivatives (LSDs). When multiple LSD

gory Metric, and Top N selections protocols appeared in Top Movers with 40%+ gains while cate-

gory volume increased 3x, they allocated capital early, achieving

4 PERFORMANCE EVALUATION 28% returns over two weeks.



4.1 Use Case 2: Risk Concentration Analysis System Performance Metrics

A DeFi protocol team monitored market concentration using

We evaluated system performance across three dimensions: the category pie chart. When the top 3 categories exceeded 65%

Response Time: of total market cap (Herfindahl-Hirschman Index >0.25), they

• Initial dashboard load: 3.2s ± 0.5s (n=100) adjusted treasury diversification strategy, reducing drawdown

• Filter operations: 1.8s ± 0.3s by 15% during the subsequent correction.

• ETL pipeline execution: 45s complete, 8s incremental Use Case 3: Competitive Benchmarking

Data Processing Efficiency: Protocol developers tracked their TVL growth relative to category



• program launched 3 days after competitors but achieved 2x the API delay: 0.1s between requests TVL growth rate, validating their tokenomics design. • peers. The synchronized time-series view revealed their incentive Batch processing: 50-100 protocols per batch

• Memory usage: Peak 256MB

• Data volume: 6,000+ protocols, 200 tokens/page

User Efficiency Gains : 6 DISCUSSION

• 6.1 Synthesis for Decision-Making Market overview generation: 15 min → 5 sec (99.4% re-

duction) The dashboard enables multi-dimensional analysis through syn-

• Sector rotation analysis: 30 min → 2 min (93.3% reduction) chronized views:



91

Interactive Data Visualization for DeFi SiKDD 2025, October 6, 2025, Ljubljana, Slovenia





Figure 2: Integrated DeFi Analysis Dashboard with annotated regions. (A) KPI header showing market totals and BTC

dominance, (B) TVL time-series revealing protocol-level capital flows, (C) Top Movers identifying momentum shifts, (D)

Category bars showing sector concentration. Red boxes indicate areas of analytical focus.



Macro Market Reading : Combining BTC dominance with 7 CONCLUSION AND FUTURE WORK

DeFi volume trends provides regime identification. High dom- We presented a comprehensive BI solution for DeFi market anal-

inance (>55%) with rising DeFi volume suggests selective risk- ysis that bridges the gap between sophisticated analytics and

taking in quality protocols. accessibility. Our dual contribution—a robust ETL pipeline and

Flow Analysis: TVL trends coupled with volume data dis- interactive dashboard—demonstrates measurable improvements:

tinguish genuine inflows from liquidity reshuffling. Rising TVL 85-99% reduction in analysis time while maintaining data quality

with flat volume indicates parking behavior rather than active through systematic validation.

usage. The system’s practical value is validated through real-world

Rotation Detection: The Top Movers panel acts as an early deployments showing successful identification of profitable trad-

warning system. Sector clustering combined with category vol- ing opportunities and risk mitigation strategies. By following

ume spikes provides 2-3 day lead time for narrative shifts. established visualization principles and implementing automated

data processing, we provide a reproducible framework that de-

mocratizes DeFi analytics.

Future work includes: (1) Machine learning integration for

TVL forecasting and anomaly detection, (2) Real-time streaming

with sub-second updates, (3) Cross-chain analytics for Layer 2

6.2 Limitations and Data Quality solutions, (4) Natural language generation for automated insights,

Technical Limitations: and (5) On-chain integration for protocol-specific metrics.

• TVL double-counting: Rehypothecation can inflate metrics

by 20-30% ACKNOWLEDGMENTS



Mitigation Strategies • API latency: 5-15 minute delays during high volatility We thank the reviewers for their constructive feedback, partic- • Protocol coverage: Excludes protocols with <$1M TVL ularly suggestions on AI integration and visualization improve- : ments. Special thanks to the SiKDD conference organizers for

• Implement adjusted TVL calculations excluding derivative providing the platform to present this work.

tokens

• Add confidence intervals for volatile metrics REFERENCES

• Include protocol age weighting for emerging project de- [1] L. Chen, Z. Zhang, and M. Wang. 2024. AI-Driven Risk Assessment in DeFi:

tection Machine Learning Approaches for Protocol Security. Journal of Financial



92

SiKDD 2025, October 6, 2025, Ljubljana, Slovenia D. Pavlova



Technology 2, 1 (2024), 87–95. [8] Ben Shneiderman. 1996. The Eyes Have It: A Task by Data Type Taxonomy

[2] William Cleveland and Robert McGill. 1984. Graphical Perception: Theory, for Information Visualizations. In Proceedings of the IEEE Symposium on Visual

Experimentation, and Application to the Development of Graphical Methods. Languages. IEEE, 336–343. J. Amer. Statist. Assoc. 79, 387 (1984), 531–554. [9] Tableau Software. 2022. Visual Analysis Best Practices: Simple Techniques for

[3] Stephen Few. 2013. Information Dashboard Design: Displaying Data for At-a- Making Every Data Visualization Useful. Whitepaper.

Glance Monitoring (2nd ed.). Analytics Press. [10] Edward Tufte. 2001. The Visual Display of Quantitative Information (2nd ed.).

[4] Gartner Inc. 2024. Magic Quadrant for Analytics and Business Intelligence Graphics Press.

Platforms. Research Note G00799564. [11] Colin Ware. 2019. Information Visualization: Perception for Design (4th ed.).

[5] Jeffrey Heer and Michael Bostock. 2010. Crowdsourcing Graphical Perception: Morgan Kaufmann.

Using Mechanical Turk to Assess Visualization Design. In Proceedings of the [12] Sam Werner, Daniel Perez, Lewis Gudgeon, Ariah Klages-Mundt, Dominik

SIGCHI Conference on Human Factors in Computing Systems . ACM, 203–212. Harz, and William Knottenbelt. 2021. SoK: Decentralized Finance (DeFi). In

[6] Ralph Kimball and Margy Ross. 2013. The Data Warehouse Toolkit: The Defini- Proceedings of the 4th ACM Conference on Advances in Financial Technologies.

tive Guide to Dimensional Modeling (3rd ed.). Wiley. ACM, 30–46.

[7] Tamara Munzner. 2014. Visualization Analysis and Design. CRC Press.



93

A Hybrid Lexicon-Machine Learning Approach to Macedonian



Sentiment Analysis



∗ ∗ ∗

Sofija Kochovska Branko Kavšek Jernej Vičič

kochovskasof ija@gmail.com branko.kavsek@upr.si jernej.vicic@upr.si

University of Primorska, UP University of Primorska, UP University of Primorska, UP

FAMNIT FAMNIT FAMNIT

Koper, Slovenia Koper, Slovenia Koper, Slovenia

Jožef Stefan Institute

Ljubljana, Slovenia



Abstract such as intensifiers, diminishers, and polarity shifters. Here, we

extend the approach by implementing a hybrid framework that

This study extends our previous work on a rule-based sentiment

combines rule-based linguistic features with supervised machine

analysis system for Macedonian text [10], which relied on hand-

learning classifiers. Specifically, we evaluate Logistic Regression

crafted lexicons and linguistic rules. We now investigate the

integration of these rule-based features with supervised machine (LR) Support Vector Machines (SVMs) and , using features derived

from sentiment lexicons and rule-based weighting schemes.

learning classifiers, specifically Logistic Regression (LR) and Sup-

Our contributions are twofold: (i) we demonstrate how rule-

port Vector Machines (SVM), to improve sentiment classification

based features enhance the performance of statistical classifiers in

performance. Lexicon-derived features, including polarity, inten-

a low-resource setting, and (ii) we provide a systematic evaluation

sifiers, diminishers, and negation handling, are combined with

of the hybrid approach on Macedonian sentiment data. This study

statistical models to evaluate their impact. Experimental results

highlights the effectiveness of combining linguistic knowledge

show that the hybrid approach substantially outperforms the

with machine learning to improve sentiment detection for under-

rule-based baseline, increasing the mean F1 score from 73.5%

resourced languages.

to 86.7% for SVM and 86.4% for LR. Paired t-tests confirm that

these improvements are statistically significant (p < 0.001), while

Wilcoxon tests indicate a strong trend (p = 0.0625). These find- 2 Related Work

ings demonstrate that integrating rule-based linguistic features Sentiment analysis has been widely studied, from lexicon-based

with machine learning classifiers provides a robust framework approaches [16, 6] to machine learning and deep learning mod-

for sentiment analysis in under-resourced languages such as els [15, 2]. Lexicon-based systems rely on dictionaries and mod-

Macedonian. ifiers such as intensifiers, diminishers, and negations; they are

interpretable and require no large datasets but have limited cov-

Keywords erage. Machine learning models achieve higher accuracy with

sufficient data but often act as “black boxes.” In low-resource

Sentiment Analysis, Macedonian, Rule-based Approach, Machine

languages, hybrid approaches combining lexicon features with

Learning, Hybrid Model, Natural Language Processing, Support

statistical learning improve robustness [12, 18].

Vector Machine, Logistic Regression, Low-resource Languages

For Macedonian, Jovanoski et al. [9] compiled sentiment lexi-

1 Introduction cons and manually annotated Twitter datasets, analyzing how

seed lists affect induced lexicons. Uzunova and Kulakov [17] clas-

Sentiment analysis is a core task in natural language processing

sified movie reviews, while Gajduk and Kocarev [4] achieved

(NLP), commonly applied to social media, reviews, and feedback

92% accuracy on forum posts. The SADEmma 1.0 corpus [7] in-

analysis. While progress has been substantial for high-resource

cludes three-class news sentiment labels across languages, but

languages such as English, low-resource languages like Mace-

the Macedonian portion has only 198 entries, limiting its useful-

donian still face limited availability of annotated corpora, senti-

ness for model training. Our previous work [10] introduced a

ment lexicons, and reliable tools. Macedonian, an Eastern South

curated lexicon of 4,000 words, later expanded to 8,000, evaluated

Slavic language spoken by around 1.6 million people as the of-

on Macedonian Twitter data.

ficial language of North Macedonia, remains under-explored in

Despite its close relation to Bulgarian, Serbian, and Croat-

computational linguistics despite its close relation to Bulgarian,

ian, Macedonian sentiment analysis remains under-resourced.

Serbian and Croatian.

Comparable studies in Serbian and Slovenian report performance

In this study, we build on our earlier work presented at the

ranging from moderate to high, with F1 or accuracy scores around

ITAT conference (WAFNL workshop) [10], where we developed a

76–83% depending on dataset and methodology [11, 8, 13, 3].

rule-based sentiment analysis system for Macedonian. That work

These findings indicate that our results align with trends ob-

focused on lexicon construction and the integration of modifiers

served in related South Slavic languages. This study extends prior

∗

These authors contributed equally. work by integrating lexicon-based features into supervised clas-

sifiers, comparing Logistic Regression and SVMs for Macedonian

Permission to make digital or hard copies of all or part of this work for personal

or classroom use is granted without fee provided that copies are not made or sentiment classification, and, to our knowledge, represents the

distributed for profit or commercial advantage and that copies bear this notice and

first combination of rule-based linguistic insights with statistical

the full citation on the first page. Copyrights for third-party components of this

work must be honored. For all other uses, contact the owner /author(s). models for this language.

Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia

© 2025 Copyright held by the owner/author(s).

https://doi.org/10.70314/is.2025.sikdd.16



94

Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Sofija Kochovska, Branko Kavšek, and Jernej Vičič



3 Methodology 3.4 Evaluation Procedure

Our approach builds on the framework presented in Kochovska We evaluated the rule-based baseline and hybrid classifiers using

et al. [10], combining lexicon-based rule features with supervised stratified 5-fold cross-validation to ensure balanced sentiment

machine learning classifiers. The methodology is designed to class representation. For each fold, models were trained on 80%

handle the challenges of sentiment analysis in Macedonian, a low- of the data and tested on 20%, repeating the process across five

resource language, by leveraging linguistic insights alongside splits to obtain stable estimates.

statistical learning. Performance was measured primarily with F1 scores for posi-

tive and negative classes [10], enabling direct comparison with

3.1 Lexicon-Based Feature Extraction Jovanoski et al. [9]. Confusion matrices and full classification

reports were also generated to evaluate performance on all three

We use manually-checked Macedonian lexicons:



• classes, including neutral, highlighting improvements in polarity Positive and Negative Lexicons:

Words indicating posi-

detection and challenges in handling neutral sentiment.

tive or negative sentiment.

• Statistical significance of improvements was assessed using Intensifiers and Diminishers:

Words that amplify or

attenuate sentiment (e.g., , ). slightly

very paired 𝑡-tests and Wilcoxon signed-rank tests on per-fold F1

• scores. Polarity Shifters (Negations):

Words that invert senti-



ment, such as or , applied within a small context not never 4 Results and Evaluation window.

• Stop-words: The hybrid sentiment analysis framework was evaluated on Common words with minimal meaning, re-

the Macedonian test dataset that we also used for evaluation

moved to improve feature quality.

of the rule-based only approach discussed in the ITAT/WAFNL

Texts are preprocessed to normalize repetitions, remove URLs,

paper [10], however this time using Logistic Regression (LR) and

mentions, punctuation, and stop-words. Each token is analyzed

Support Vector Machine (SVM) classifiers. Both models lever-

for sentiment considering intensifiers, diminishers, and negations.

aged the rule-based features described in section 3, with hyper-

Extracted features include:

parameters tuned based on our previous ITAT study for LR and

• Normalized lexicon score

specifically tested on this dataset for SVM.

• Counts of positive and negative words



• Counts of intensifiers, diminishers, and negations 4.1 Logistic Regression (LR)

These features provide a compact numerical representation of

Logistic Regression trained on rule-based features demonstrates

sentiment suitable for supervised learning models.

consistently strong performance, achieving a mean F1 score of



3.2 0.864 on positive and negative classes. The per-fold results indi- Machine Learning Models

cate stable performance across folds, suggesting robustness to

The rule-based features (lexicon score, counts of positive/negative

variations in the training data (Figure 1).

words, intensifiers, diminishers, and negations) are used as input



to two classifiers:



• Logistic Regression (LR): A linear classifier trained on

the rule-based features. Hyperparameters for intensifier

weight (1.5), diminisher weight (0.7), and negation win-

dow size (2) were adopted from our previous ITAT study,

which tested 108 combinations to identify the optimal

configuration.

• Support Vector Machine (SVM): A linear-kernel SVM

trained on the same features. The 𝐶 parameter was tuned

via grid search (0.1–5), with the best performance at 𝐶 =

0.15.

The selected rule-based configuration for both models is: in-

tensifier weight 1.5, diminisher weight 0.7, negation window = =

= 2, and 𝜖 = 0.30.

These values control the contribution of linguistic modifiers

to the overall sentiment score of a text.



3.3 Dataset Splitting Figure 1: Logistic Regression: F1 score per fold for posi-

The Macedonian sentiment dataset used in this study is identical tive and negative classes.

to that from our previous ITAT/WAFNL paper [10]. For machine

The confusion matrix (Figure 2) shows that most misclassifi-

learning evaluation, we employ stratified 5-fold cross-validation.

cations involve neutral and negative instances. Specifically, 43

In each fold, 80% of the data is used for training and 20% for

neutral examples were predicted as negative, and 29 negative ex-

testing, ensuring that the class distribution is preserved across

amples were labelled as neutral. Positive instances are generally

folds. This approach allows robust evaluation of both Logistic

well-separated, with minimal confusion, reflecting the effective-

Regression and SVM models while leveraging all available data

ness of the lexicon-based features. These patterns suggest that

for training and testing across different folds.

LR captures polarized sentiment effectively but struggles with

subtle neutral expressions.



95

Hybrid Lexicon–ML Sentiment Analysis for Macedonian Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia





Figure 5: Overall precision, recall, and F1 scores for Logistic

Regression and SVM.

Figure 2: Logistic Regression confusion matrix for all

classes.



Overall classification metrics confirm high precision and recall

for positive and negative classes (Precision = 0.847 / 0.830, Recall

= 0.872 / 0.923, F1 = 0.859 / 0.874), while neutral sentiment remains

more challenging (F1 = 0.715). Figure 5 presents these metrics

visually, highlighting the differences between classes.



4.2 Support Vector Machine (SVM)

SVM, also trained on the same rule-based features, achieves a

slightly higher mean F1 score of 0.867 for positive and negative

classes and shows stable per-fold performance (Figure 3). The

hyper-parameter 𝐶 0.15, selected after testing a range from 0.1 =

to 5, provided optimal regularization for this dataset.





Figure 4: SVM confusion matrix for all classes.





Classification metrics (Figure 5) reinforce these observations:



SVM maintains high precision for positive and neutral classes

and slightly higher F1 scores for polarized sentiment compared to

LR (Positive: F1 = 0.862, Negative: F1 = 0.877, Neutral: F1 = 0.684).

This demonstrates that combining rule-based features with SVM

improves detection of nuanced sentiment in Macedonian text.



4.3 Discussion

The evaluation demonstrates that our hybrid framework sub-

stantially improves over the purely rule-based approach. The

baseline system reached a mean F1 score of 0.736 across folds,

Figure 3: SVM: F1 score per fold for positive and negative

while Logistic Regression and SVM achieved 0.864 and 0.867, re-

classes.

spectively. Paired t-tests confirmed that these improvements are

The SVM confusion matrix (Figure 4) exhibits a similar trend

statistically significant (𝑝 < 0.001). The Wilcoxon signed-rank

to LR: neutral instances are most frequently misclassified, with 54

test yielded 𝑝 0.0625, slightly above the conventional threshold, =

neutral examples predicted as negative and 38 predicted as posi-

likely due to the limited number of folds, but the performance

tive. SVM shows improved recall for negative instances, correctly

trend remained consistent.

identifying 481 of 508 examples, indicating enhanced sensitivity

to strong negative cues.



96

Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Sofija Kochovska, Branko Kavšek, and Jernej Vičič



Most errors stem from the neutral class, where sentiment is [5] Nils Constantin Hellwig, Jakob Fehle, and Christian Wolff. 2024. Exploring

large language models for the generation of synthetic training samples for

often ambiguous or context-dependent, while positive and nega-

aspect-based sentiment analysis in low resource settings. Expert Systems

tive classes are reliably distinguished. This shows that leveraging with Applications

, 261, (Oct. 2024), 125514. doi: 10.1016/j.eswa.2024.125514.

lexicon-based features within machine learning models captures [6] Minqing Hu and Bing Liu. 2004. Mining and summarizing customer reviews.

In Proceedings of the Tenth ACM SIGKDD International Conference on Knowl-

polarity effectively and generalizes well across folds. Overall, the

edge Discovery and Data Mining (KDD ’04). Association for Computing

results highlight the strength of hybrid models in combining the Machinery, Seattle, WA, USA, 168–177. isbn: 1581138881. doi: 10.1145/1014

interpretability of rule-based systems with the adaptability of 052.1014073.

[7] Nikola Ivačič, Andraž Pelicon, Boshko Koloski, Senja Pollak, and Matthew

statistical learning. Future work should address the challenge of

Purver. 2024. News sentiment analysis datasets for serbian, bosnian, mace-

neutral sentiment and investigate richer contextual or semantic donian, albanian and estonian (sademma 1.0). CLARIN.SI repository. Version

1.0. (2024). http://hdl.handle.net/11356/1987.

features.

[8] Danka Jokić, Ranka Stanković, and Branislava Šandrih Todorović. 2024.



5 Abusive speech detection in Serbian using machine learning. In Proceedings Conclusion and Future Work

of the First International Conference on Natural Language Processing and Ar-

tificial Intelligence for Cyber Security. Ruslan Mitkov, Saad Ezzini, Tharindu

We presented a hybrid sentiment analysis framework for Macedo-

Ranasinghe, Ignatius Ezeani, Nouran Khallaf, Cengiz Acarturk, Matthew

nian, combining rule-based lexical features with Logistic Regres- Bradbury, Mo El-Haj, and Paul Rayson, editors. International Conference on

Natural Language Processing and Artificial Intelligence for Cyber Security,

sion and Support Vector Machines. The hybrid models substan-

Lancaster, UK, (July 2024), 153–163. https://aclanthology.org/2024.nlpaics- 1

tially outperformed the purely rule-based system, which achieved .18/.

a mean F1 score of 73.6%. Both classifiers improved classification [9] Dame Jovanoski, Veno Pachovski, and Preslav Nakov. 2015. Sentiment anal-

ysis in Twitter for Macedonian. In Proceedings of the International Conference

performance, particularly for polarized sentiment, while main-

Recent Advances in Natural Language Processing. Ruslan Mitkov, Galia An-

taining interpretability and robustness by relying exclusively on gelova, and Kalina Bontcheva, editors. INCOMA Ltd. Shoumen, BULGARIA,

Hissar, Bulgaria, (Sept. 2015), 249–257. https://aclanthology.org/R15- 1034/.

lexicon-derived features.

[10] Sofija Kochovska, Branko Kavšek, and Jernej Vičič. 2025. Rule-based senti-

Our results demonstrate that integrating linguistic knowledge Proceedings of the ITAT 2025: Information

ment analysis of Macedonian. In

with statistical learning is effective for under-resourced languages (CEUR Workshop Proceedings). Tel- Technologies – Applications and Theory

gárt, Slovakia.

like Macedonian, where annotated datasets are scarce. The rule-

[11] Adela Ljajić, Ulfeta Marovac, and Aldina Avdic. 2017. Sentiment analysis of

based component captures explicit, context-modified cues, while

twitter for the serbian language. In (Mar. 2017).

ML models generalize well across folds. [12] Walaa Medhat, Ahmed Hassan, and Hoda Korashy. 2014. Sentiment analysis

algorithms and applications: a survey. , 5, 4, Ain Shams Engineering Journal

Future work includes:

1093–1113. doi: https://doi.org/10.1016/j.asej.2014.04.011.

• Incorporating syntactic and semantic embeddings to better [13] Igor Mozetic, Miha Grcar, and Jasmina Smailovic. 2016. Multilingual twitter

sentiment classification: the role of human annotators. In vol. 11. (Feb. 2016).

capture context and subtle neutral sentiment.

doi: 10.1371/journal.pone.0155036.

• Experimenting with attention-based or transformer mod- [14] Koena Ronny Mabokela, Mpho Primus, and Turgay Celik. 2025. Advancing

sentiment analysis for low-resourced african languages using pre-trained

els for long-range dependencies.

• Expanding annotated datasets across social media, reviews, one.0325102. language models. , 20, 6, (June 2025), 1–37. doi: 10.1371/journal.p PLOS ONE

and user-generated content. [15] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D.

• Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models

Investigating domain adaptation to generalize across dif-

for semantic compositionality over a sentiment treebank. In Proceedings of

ferent text types. . the 2013 Conference on Empirical Methods in Natural Language Processing

• Integrating additional linguistic cues such as POS tags or David Yarowsky, Timothy Baldwin, Anna Korhonen, Karen Livescu, and

Steven Bethard, editors. Association for Computational Linguistics, Seattle,

dependency relations.

Washington, USA, (Oct. 2013), 1631–1642. https://aclanthology.org/D13- 11

• Exploring multilingual transformers (e.g., mBERT, XLM-R) 70/.

fine-tuned on Macedonian [2, 1]. [16] Maite Taboada, Julian Brooke, Milan Tofiloski, Kimberly Voll, and Manfred



• Using large language models to generate synthetic Mace- Linguistics Stede. 2011. Lexicon-based methods for sentiment analysis. Computational

, 37, 2, (June 2011), 267–307. doi: 10.1162/COLI_a_00049.

donian training data [19, 14, 5]. [17] Vasilija Uzunova and Andrea Kulakov. 2015. Sentiment analysis of movie

reviews written in macedonian language. In . Advances ICT Innovations 2014

This work provides a strong foundation for Macedonian sen-

in Intelligent Systems and Computing. Vol. 311. Ana Madevska Bogdanova

timent analysis, highlighting the value of hybrid approaches and Dejan Gjorgjevikj, editors. Springer, Cham, 279–288. doi: 10.1007/978-

3- 319- 09879- 1_28.

and paving the way for richer linguistic feature integration and

[18] Lei Zhang, Shuai Wang, and Bing Liu. 2018. Deep learning for sentiment

advanced modeling. analysis : a survey. Wiley Interdisciplinary Reviews: Data Mining and Knowl-

edge Discovery, 8, (Jan. 2018). doi: 10.1002/widm.1253.

References [19] Wenxuan Zhang, Yue Deng, Bing Liu, Sinno Jialin Pan, and Lidong Bing.

2023. Sentiment analysis in the era of large language models: a reality check.

[1] Alexis Conneau et al. 2020. Unsupervised cross-lingual representation learn- https://arxiv.org/abs/2305.15005 arXiv: 2305.15005 [cs.CL].

ing at scale. In Proceedings of the 58th Annual Meeting of the Association for

Computational Linguistics. Dan Jurafsky, Joyce Chai, Natalie Schluter, and

Joel Tetreault, editors. Association for Computational Linguistics, Online,

(July 2020), 8440–8451. doi: 10.18653/v1/2020.acl- main.747.

[2] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019.

BERT: pre-training of deep bidirectional transformers for language under-

standing. In Proceedings of the 2019 Conference of the North American Chapter

of the Association for Computational Linguistics. Jill Burstein, Christy Doran,

and Thamar Solorio, editors. Association for Computational Linguistics,

Minneapolis, Minnesota, (June 2019), 4171–4186. doi: 10.18653/v1/N19- 1423.

[3] Darja Fišer and Tomaž Erjavec. 2016. Analysis of sentiment labeling of

slovene user-generated content. In Nasl. z nasl. zaslona. Znanstvena založba

Filozofske fakultete, 22–25. http://nl.ijs.si/janes/wp- content/uploads/2016/0

9/CMC- 2016_Fiser_Erjavec_Analysis- of - Sentiment- Labeling.pdf .

[4] Andrej Gajduk and Ljupco Kocarev. 2014. Opinion mining of text documents

written in macedonian language. . https://arxi arXiv preprint arXiv:1411.4472

v.org/abs/1411.4472 arXiv: 1411.4472 . [cs.CL]



97





Building an AI-Ready Data Infrastructure Towards a SDG-




focused Observatory for the Brazilian Amazon



Joao Pita Costa† Mirozlav Polzer Leonardo Barrionuevo Joao Paulo Veiga

IRCAI, Jozef Stefan Institute GloCha, Climate Chain Coalition MetAmazonia, AMAGroup CIAAM, University of São Paulo

Ljubljana, Slovenia Klagenfurt, Austria Curitiba, Brazil São Paulo, Brazil

joao.pitacosta@ircai.org polzer@glocha.info leonardo@amagroup.com.br candia@usp.br



Abstract / Povzetek resource mobilization but also robust AI-enabled data systems



As artificial intelligence technologies rapidly evolve, regulatory interventions. However, current efforts to monitor and evaluate capable of tracking progress, identifying gaps, and informing



responsible AI development, enabling innovation while outdated data that are not designed with advanced analytics or AI safeguarding fundamental rights and public interests. This paper applications in mind [1]. As the volume and variety of sandbox initiatives have emerged as crucial tools for promoting the SDGs are often hampered by fragmented, inaccessible, or

analyzes the development and implications of Brazil’s first AI sustainability-related data continue to grow (ranging from

regulatory sandbox, with a particular focus on the model satellite imagery and sensor networks to administrative records

established by SUSEP (Superintendence of Private Insurance). and citizen-generated content) there is a critical need to rethink

Designed as a controlled environment for testing innovative the way data infrastructures are designed. Despite AI-related

products and services in the insurance sector, the SUSEP advancements, the broader ecosystem of SDG data remains

sandbox illustrates how regulatory flexibility can foster siloed, with significant disparities in data availability, quality,

technological advancement, financial inclusion, and market and usability across countries and sectors. National statistical

efficiency while maintaining consumer protection and risk offices often lack the infrastructure or capacity to generate real-



Law, the sandbox has evolved through three editions (2020, remain underutilized due to interoperability issues or lack of 2021, and 2024), prioritizing both sustainable and technological trust. As a result, policymakers and researchers face substantial oversight. Being developed under Brazil’s Economic Freedom time, high-resolution data, while non-governmental data sources

projects. This study explores the sandbox's structure, eligibility barriers when attempting to harness AI for sustainable

criteria, business plan requirements, operational limitations, and development monitoring. There is growing recognition that SDG

transition mechanisms for companies seeking permanent data must be AI-ready: structured, interoperable, machine-

licensure. It also identifies actionable insights for future readable, and enriched with metadata that allows for automated

regulatory frameworks, particularly for the National Data processing and semantic understanding [2]. AI-ready data

Protection Authority (ANPD) as Brazil advances toward AI- infrastructures enable the use of artificial intelligence and

specific governance. By comparing the sandbox's legal machine learning tools for trend detection, predictive modeling,



with international best practices, this paper underscores the toward sustainable development. Several initiatives have sandbox’s role as a blueprint for responsible AI regulation in emerged to bridge the gap between data collection and actionable foundations, selection processes, and risk mitigation protocols and evidence-based policymaking, accelerating the global effort

emerging markets. insights. In this context, the IRCAI SDG Observatory, an open-



Keywords / Ključne besede Research Centre on Artificial Intelligence under the auspices of access data infrastructure developed by the International

Sustainable Development Goals (SDGs), AI-ready data UNESCO (IRCAI), aggregates and organizes datasets related to

infrastructure, FAIR data principles, Open data, Semantic SDG indicators, news, policies, educational resources and

interoperability, Brazilian Amazon, COP30 innovation ecosystems, facilitating their use in AI applications

through adherence to open data standards, consistent metadata



1 Introduction represents a step toward a scalable, reusable AI-ready data schemas, and semantic alignment with the SDG framework. It

The United Nations' 2030 Agenda for Sustainable Development architecture that can support both global and local decision-

outlines 17 SDGs aimed at addressing the worlds most pressing making. The main contribution of this paper is a conceptual and

social, economic, and environmental challenges. Achieving practical framework for AI-ready SDG data infrastructure,

these goals requires not only coordinated policy action and building on the design principles and implementation strategies

†Corresponding author demonstrated by the IRCAI SDG Observatory, as well as by the

Permission to make digital or hard copies of part or all of this work for personal or preceding NAIADES Water Observatory [3] focusing on AI and

classroom use is granted without fee provided that copies are not made or distributed Water Sustainability, and the recently deployed UNESCO



for profit or commercial advantage and that copies bear this notice and the full Landslides Observatory discussed in section 4, both in the citation on the first page. Copyrights for third-party components of this work must intersection of SDG 13 (Climate Action) with SDG 6 (Water be honored. For all other uses, contact the owner/author(s).

Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Sustainability) and SDG 11 (Resilient Cities and Communities).

© 2025 Copyright held by the owner/author(s). We follow the discussion in [4] and propose an AI-ready and AI-

http://doi.org/10.70314/is.2025.sikdd.17

enabled data and metadata infrastructure that can be leveraged



98

for research purposes in what relates AI and Sustainable the ontology (discussed in the next section) and the outcomes

Development. Through this lens, we argue for a paradigm shift from the analysis. The resulting datasets are thus not only clean

demonstrating an Amazon-focused SDG data ecosystem built on and standardized (considering limitations of the data sources,

this new paradigma—moving from static, indicator-focused including different types of bias analysed and exposed) but also

reporting systems to dynamic, AI-compatible engine that structured in elasticSearch indices to support downstream AI

supports (i) education and training for sustainability; (ii) applications acting over powerful Lucene queries through the

desinformation monitoring practices in the sustainability native API. Surrounding the data layer is a robust metadata

discourse (see figure 1); and (i) data-driven decision-making and architecture that enables discoverability, semantic enrichment,

global collaboration. and AI-readiness. The metadata design is informed by the FAIR

data principles (Findable, Accessible, Interoperable, and

Reusable) and includes the following key components: (i)



Descriptive Metadata, including descriptive elements such as

title, description, source organization, temporal coverage,

geographic coverage, and associated SDG goals, enabling human

and machine agents to easily understand the scope and purpose

of each data index; (ii) Structural Metadata, specifying the

internal structure of the dataset, such as data types, column

definitions, units of measurement, and relationships between

variables, facilitating data parsing and automatic preprocessing

by text mining tools; (iii) Source Metadata, capturing

information about the dataset’s origin, transformation steps,

update frequency, and quality assurance processes, ensuring

transparency, reproducibility, and trustworthiness; and (iv)

Semantic Metadata, leveraging ontologies and controlled

vocabularies to provide machine-readable semantics, linking

dataset elements to established knowledge graphs, enabling

reasoning across data indices and automated alignment of

conceptually related information (see figure 2).

Figure 1: The SDG distribution of the ingested scientific



article abstracts and their Amazon-related main concepts



2 Data and Metadata Architecture

Designing an AI-ready SDG data infrastructure requires

more than simply aggregating datasets—it demands a structured,

extensible architecture that enables machine interpretability,

semantic consistency, and interoperability across domains. The

IRCAI SDG Observatory proposes in [5] a data structure

incorporating both heterogeneous data and complex

preprocessed metadata layers to support automated reasoning,

text mining applications, and dynamic sustainability analysis. At

the core of the infrastructure lies the data layer, which consists of

curated datasets aligned with specific SDG indicators. These

datasets are collected from a variety of sources, including



international organizations, national statistics offices, worldwide Figure 2: Visualisation of the SDG distribution of the

news engines, open government data portals, and research ingested OECD AI policies according to the SDG ontology

institutions. To ensure consistency and usability, raw datasets built on Wikidata terms defined with SDG topic experts

undergo a 3-step transformation process:

● Harmonization: Raw data is converted into To ensure accessibility and integration with external systems,

standardized formats (e.g., CSV, JSON, RDF) using the infrastructure exposes datasets and metadata through native

predefined schemas (as the official SDG indicator RESTful APIs, allowing developers and researchers to query and

framework defined by the UN Statistics Division [6]). retrieve relevant data programmatically, enabling use in

● Normalization: Variables such as geographic dashboards, modeling pipelines, and decision-support systems.

units, time periods, and measurement scales are normalized Furthermore, adherence to open data standards such as DCAT

to ensure comparability across countries and regions. (Data Catalog Vocabulary) and JSON-LD (Linked Data) ensure

● Validation: Data quality checks are that the infrastructure can interface with other open government

implemented to flag missing values, outliers, or inconsistent data platforms, research data repositories, and semantic web

units, helping maintain reliability and analytical integrity. services. The architecture is designed with scalability and

IRCAI is engaging domain experts for the different SDGs to modularity in mind, allowing new datasets to be integrated with

explore the most relevant KPIs to monitor, the search terms in minimal manual intervention. Through automated ingestion



99

pipelines and schema mapping tools, the infrastructure can with open education principles and UNESCO collaboration. It

accommodate additional data sources while preserving metadata aims to make knowledge resources directly useful for learners

integrity and interoperability. Governance mechanisms, and professionals engaged with Amazonia and their communities.

including data quality audits and contributor guidelines, ensure Table 1 shows the data feeding the system across a diversity of

the sustainability and reliability of the system over time. topics from news, science and policies, exposing concerns of the

To support the in-depth analysis and leverage the availability of public opinion, the knowledge we hold on priority topics, and

multilingual text resources at Wikidata, we have developed a part of the regulatory landscape.

SDG ontology inspired by [7] based on terms that correspond to

Wikipedia pages. Currently published in a CSV format on

Table 1: Data ingested into the Amazon Observatory from

GitHub [8], it defines rows corresponding to SDG entities—such

worldwide news (indicating the language coverage),

as goals—and maps them to Wikidata Q‑IDs. Key columns published AI-related scientific articles, and related legal and

include: Level (e.g., SDG Goal), Code (e.g., “1”, “1.2”, “ regulatory landscape

1.2.1”), Wikidata Q‑Identifier (e.g., Q23442, Q3048436,

Q28146087), label (human‑readable name), Description (concise Concepts 2024 News Science Policy

textual summary), and related concepts (optional Q‑IDs linking (Lang. Coverage)

to domains like health, energy, gender equality) Each SDG Goal Biodiversity 18083 (100) 44693 3628

row includes its code and corresponding Wikidata ID. Targets Indigenous



relevant Target and define unit, measurement scale, and 236 (26) 2127 133 Public Health 26454 (69) 42355 697 description. Using the CSV mappings, the ontology is Amazon constructed so that: rainforest 3936 (87) 172 115 ● explicit parent Goal. Indicators (e.g. 13.2.1) reference the Bioeconomy 156 (16) 33 31 Carbon Credits (e.g. 1.2) are mapped to both their own Wikidata entity and an peoples 8070 (96) 2014 107 sdg:hasTarget links a goal entity to its targets

● sdg:hasIndicator links targets to indicators



● sdg:measuredIn aligns indicator measures to Wikidata

units

Additional cross-concept links (sdg:relatedTo) connect

indicators to external Q‑IDs in domains such as “maternal

health” or “clean water”. During dataset ingestion, each

column bearing an indicator code is annotated using the

corresponding Wikidata Q‑ID from the ontology, enabling

dataset cataloging via sdg:indicator URIs, semantic filtering and

query based on concept-level tagging, as well as automatic

generation of metadata triples (e.g. linking dataset to indicators

and units).



3 The Amazon Observatory and Other Pilots

The prominence of domains such as digital data processing Figure 3: Evolution in time of the relation between research

and machine learning illustrates AI’s multidimensional capacity concepts related to the Brazilian Amazon Rainforest

to address complex challenges in resource allocation, public

health systems, and environmental sustainability. Comparative To illustrate the potential of such approach, five initial modules

analysis between global discourses and those specifically have been developed and are being made available for COP30

oriented toward the Brazilian Amazon—driven by the expertise activities in Belem, at the heart of Amazonia: (i) the News

and coordinated efforts of the MetAmazonia initiative—reveals Stream with Sentiment provides multilingual coverage of

a pronounced emphasis on environmental preservation, Amazonia-related news, complemented by word clouds of main

biodiversity monitoring, and climate resilience in the latter. This concepts and sentiment analysis visualized through maps and

divergence indicates that AI’s contributions to sustainable gauges; (ii) the Data Exploration Dashboard integrates multiple

development are not uniform but instead conditioned by region- datasets, displaying global research trends, SDG policy

specific priorities, ecological constraints, and socio-technical coverage, and innovation activity; (iii) the relation between the

contexts. These findings underscore the necessity of developing concepts (edges) relevant to the Amazonia research and the

adaptive, context-aware AI frameworks capable of aligning with interconnections between these concepts, being stronger or

the heterogeneous demands of both urban and rural environments. weaker according to the amount of published articles, where

The Amazon Observatory delivers outcomes such as the these are topics in common (see visualization in figure 3 and data

MetAmazonia chatbot, a multidimensional open data platform, characterization in table 1); (iv) in the Education view, the

and accessible resources for students and researchers to advance system visualizes open educational resources by mapping

knowledge and innovation in the region. The system will be the Amazonia-related topics to SDGs, highlighting key domains and

basis for the planned MetAmazonia Chatbot, leveraging these their relevance to specific goals such as SDGs 11, 13, and 15;

datasets within the broader SDG AI-agent development, aligned and (v) in regards to innovation ecosystems, we depict the



100

different initiatives that relate to priority topics in the Brazilian models and knowledge extraction tools to automate the discovery

Amazon context and could help establishing international and integration of new SDG-related concepts from policy

collaboration to address specific problems with local/global data. documents, scientific literature, and real-time news streams; (ii)

Building on these modules, the SDG-oriented data Interoperability with National Platforms. Building tools that

infrastructure establishes a robust foundation for the support seamless integration of local statistical data with global

development of an AI Agent specialized in Amazonia-related and local SDG indicators (e.g., focusing Amazonia), using

topics. By combining multilingual news streams, interconnected schema mapping and automated alignment with the SDG

research concepts, and contextualized mappings of innovation ontology; (iii) Real-Time Data Ingestion and Streaming

and education, the system provides the necessary knowledge Analytics. Incorporating real-time data sources, such as remote

base and semantic structure to enable advanced reasoning, sensing, sensor networks, and social media, to enable early-

retrieval, and decision-support capabilities. Such an AI Agent is warning systems and near-instant progress monitoring; (iv) AI-

not only facilitate rapid access to diverse data sources but also Powered Decision Support Tools. Developing interfaces and

support policymakers, researchers, and local communities by tools that allow policy-makers to simulate interventions, explore

offering synthesized insights aligned with the SDGs. In doing so, causal relationships, and evaluate trade-offs between SDG

it hopes to bridge global sustainability agendas with regional targets using AI models trained on the structured data; (v)

challenges, ensuring that context-specific solutions for the Community Governance and Open Collaboration. Establishing

Amazon are informed by evidence, enriched by international open, participatory governance models for ontology evolution,

collaboration, and continuously updated through the integration dataset curation, and quality assurance to ensure that the

of real-time data. infrastructure remains globally relevant and inclusive.

In conclusion, AI-ready SDG infrastructure represents a

transformative opportunity for evidence-based policy, global

4 Conclusions and Further Work collaboration, and data-driven action on sustainability. By

As the global community continues to pursue the 2030 Agenda, continuing to invest in semantic technologies, metadata

the importance of robust, interoperable, and machine-actionable standards, and open data ecosystems, we can enable a new

SDG data infrastructure has never been greater. This paper has generation of intelligent tools that accelerate progress toward the

explored the architecture and implementation of an AI-ready data SDGs globally but also locally.

infrastructure for the SDGs, using the IRCAI SDG Observatory

and its derived pilots as case studies. Central to this infrastructure Acknowledgments / Zahvala

is a well-defined metadata schema, semantic alignment with We thank the support of the European Commission projects

Wikidata entities, and adherence to FAIR data principles—all ELIAS (GA101120237) and RAIDO (GA101135800).

designed to support automation, reasoning, and integration of

data across domains and geographies. By embedding SDG References / Literatura

indicators, targets, and goals into a linked-data framework, the [1] Bachmann, N., Tripathi, S., Brunner, M. and Jodlbauer, H.( 2022). The

system transforms static reporting datasets into dynamic, contribution of data-driven technologies in achieving the sustainable

queryable resources. This enables a wide range of AI development goals. Sustainability, 14(5), p.2497.

[2] Stahl, B.C., Schroeder, D. and Rodrigues, R., 2022. AI for Good and the

applications, from natural language querying to knowledge graph SDGs. In Ethics of artificial intelligence: Case studies and Options for

reasoning and real-time decision support. The SDG Ontology— addressing ethical challenges (pp. 95-106). Cham: Springer International

Publishing.

based on mappings to Wikidata Q-IDs—serves as a semantic [3] Pita Costa, J., (2023) Water Intelligence to Support Decision Making,

backbone, enabling interoperability with external datasets and Operation Management and Water Education: NAIADES Report. IRCAI.

ontologies while enhancing transparency and reusability. Despite [4] Pita Costa J., Barrionuevo L.. Kovič Dine M. (2025) Observing the Impact

of AI in the Progress of Sustainable Development Goal 11. In Proceedings

these advancements, several challenges remain. Data of the 23rd IADIS International Conference e-Society 2025

fragmentation across jurisdictions, lack of standardization in [5] Mitja Jermol, Joao Pita Costa and Matej Kovačič (2025) Onwards to an

Ethical and Bias Aware Education for Sustainability through AI. Journal

national reporting, and uneven metadata quality continue to of Artificial Intelligence for Sustainable Development (to appear)

hinder full automation and scalability. Furthermore, ethical [6] Sustainable Development Solutions Network(2015) Indicators and a

considerations around data use—particularly in the context of [7] Joshi, A., Gonzalez Morales, L., Klarman, S., Stellato, A., Helton, A., & Monitoring Framework for the SDGs. United Nations.

AI-based decision-making—require further exploration. Lovell, S. (2019). A Knowledge Organization System for the United

To improve the Amazon Observatory, future development of AI- Nations Sustainable Development Goals. Proceedings of the 2019

International Conference on Knowledge Engineering and Knowledge

ready data infrastructure will focus on several key areas: (i) Management (EKAW). Springer.

Automated Ontology Expansion. Leveraging large language [8] Pita Costa, J. (2025) IRCAI SDG Ontology. GitHub. Available at

https://github.com/IRCAI-SDGobservatory/data



101





Towards a Format for Describing Networks




Vladimir Batagelj Tomaž Pisanski Iztok Savnik

IMFM UP, FAMNIT UP, FAMNIT

Ljubljana, Slovenia Koper, Slovenia Koper, Slovenia

UP, IAM and FAMNIT IMFM iztok.savnik@upr.si

Koper, Slovenia Ljubljana, Slovenia

UL, FMF tomaz.pisanski@upr.si

Ljubljana, Slovenia

vladimir.batagelj@fmf.uni- lj.si



Ana Slavec Nino Bašić

UP, FAMNIT UP, FAMNIT

Koper, Slovenia Koper, Slovenia

UP, IAM InnoRenew CoE IMFM

Koper, Slovenia Ljubljana, Slovenia

ana.slavec@f amnit.upr.si nino.basic@f amnit.upr.si



Abstract allow us to obtain the specific descriptions required by various

network analysis programs using relatively simple scripts.

The article provides an overview of the most important network

We have many years of experience in developing formats

analysis resources and the various types of networks encountered

for describing graphs and networks [11, 10, 4]. We will present

in their use. Based on experience in developing the NetsJSON

the NetsJSON format currently used to describe networks with

format, we present components that an exchange/archive format

structured data, and some ideas for improving it. This could be

for describing networks should contain.

a starting point for the development of a common format for



Keywords exchanging and archiving networks.

Network analysis, Network types, Identification, Format, Ex-

change, Archive, Data repository, Factorization, JSON, FAIR. 2 Support for network analysis

The concept of a network is an extension of the concept of a

1 Introduction graph. A graph describes the structure of a network. Network

Open data plays a crucial role in ensuring the computational re- analysis is a branch of data analysis that draws heavily on the

producibility and verifiability of published results. The obtained concepts and results of graph theory. The difference between the

results can be verified or supplemented with other methods. Col- two is that networks are usually "irregular", while most problems

lections of similar and well-documented datasets are also crucial and results of graph theory assume some "regularity".

for developing new methods to analyze specific types of data. It is There are many tools and programs for network analysis. For

good to test a new method on several datasets and check whether example, UCINET, Pajek, Gephi, NetMiner, Cytoscape, NodeXL,

it gives meaningful/expected results. When preparing such data, E-Net, Tulip, P UCK, GraphViz, SocNetV, Kumu, Polinode, etc.

it is essential to adhere to the FAIR principles – Findability, Ac- Programmers can use network analysis packages/libraries in a

cessibility, Interoperability, and Reusability. To facilitate ease of variety of programming languages (Python, R, Julia, C++, etc.).

use, data should ideally be stored in a text format that preserves They are supporting various network description formats:

the structure of the data and includes relevant metadata. Datasets CSV, UCINET DL, Pajek NET, Gephi GEXF, GDF, GML, GraphML,

are alive. Their connection to open repositories is important for GraphX, GraphViz DOT, Tulip TPL, Netdraw VNA, Spreadsheet,

their accessibility and maintenance. etc. [13, 25, 16].

In 2023, the International Network for Social Network Analysis In addition, network data appears in several application areas

(INSNA) requested that Zachary Neal form a working group to such as chemistry and genealogy. There are many formats for

develop describing these data. recommendations for sharing network data and

materials. They were published in Network Science in 2024 [21] Network datasets are available in multiple repositories. Some

accompanied by the [20]. repositories only provide metadata about an individual network Endorsement page

Network analysis is an area where data is often stored in and a link to the actual dataset. At the same time, others also

diverse file formats. It would be highly beneficial to adopt a com- store the data itself and offer users a display of basic network

mon “archiving/exchange” format that can describe (almost) all characteristics. None of them explicitly adheres to FAIR data

networks and support authoring, deposit, exchange, visualization, principles.

reuse, and preservation of network data. Such a format would Interesting networks can also be found on general data reposi-

tories such as Kaggle. Networks can also be created programmat-

Permission to make digital or hard copies of all or part of this work for personal

ically from selected data. For example, from bibliographic data

or classroom use is granted without fee provided that copies are not made or

from the free OpenAlex service, we can create collections of bibli-

distributed for profit or commercial advantage and that copies bear this notice and

the full citation on the first page. Copyrights for third-party components of this ographic networks on a selected topic using the OpenAlex2Pajek

work must be honored. For all other uses, contact the owner /author(s).

library in R.

Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia

For detailed lists of network analysis resources with links to

© 2025 Copyright held by the owner/author(s).

https://doi.org/10.70314/is.2025.sikdd.9 web pages, see GitHub/bavla/NetsJSON/Info [4].



102

Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Batagelj, Pisanski, Savnik, Slavec & Bašić



3 Graphs and networks 3.3 Types of networks



3.1 Unit identification Besides ordinary (directed, undirected, mixed) networks, some

special types of networks are also used:

The fundamental task in transforming data about the selected

topic into a structured dataset to be used in further analyses is • 2-mode networks, bipartite (valued) graphs – networks

the (entity recognition). Often, the source

identification of units between two disjoint sets of nodes.

data are available as unstructured or semi-structured text. In • multi-relational networks.

this case, the transformation is a task of the • linked networks computer-assisted collections of networks and .

text analysis • multilevel networks. (CaTA). Terms considered in TA are collected in

a (it can be fixed in advance, or built dynamically). • temporal networks dictionary, dynamic graphs – networks changing

The two main problems with terms are: – different

equivalence over time.

words representing the same term, and – same words • ambiguity p-

specialized networks: representation of genealogies as



representing different terms. Because of these, the – the graphs coding Petri’s nets

; , molecular graphs, etc.



transformation of raw text data into formal – is often Network (input) file formats should provide the means of description done manually or semiautomatically. expressing all of these types of networks. All interesting data

We assume that unit identification assigns a unique identifier should be recorded (respecting privacy).

(ID) to each unit. For some types of units, such IDs are stan- In a network , , , , the set of two-mode N = ( (U V) L P W)

dardized: ISO 3166-1 alpha-2 two-letter country codes, ISO 9362 nodes consists of two disjoint sets of nodes and , and all the U V

Bank Identifier Codes (BIC), ORCID – Open Researcher and Con- links from have one end node in and the other in . L U V

tributor ID, ISSN – International Standard Serial Number, DOI – A , 1 multi-relational network N = (V (L, L2, . . . , L𝑘), P, W)

Digital Object Identifier, URI – Uniform Resource Identifier, etc. contains different relations 𝑖 L (sets of links) over the same set

Often, in data displays, IDs are replaced by corresponding of nodes. Also, the weights from are defined on different W

(short) labels/names. relations or their union.

Besides the semantic units or related to the selected In a or network 1 concepts linked multimodal N = ( (V, V 2, . . . , V𝑗 ),

topic, we can also identify in the raw data syntactic units – parts , 1 (L L2, . . . , L𝑘), P, W) the set of nodes V is partitioned into

of the text. As of TA, we usually consider clauses, subsets ( ) 𝑖 syntactic units modes V , L𝑠 ⊆ V 𝑝 × V 𝑞, and properties and weights statements, paragraphs, news, messages, etc. are usually partial functions.

In thematic TA, the units are coded as a rectangular matrix A set of networks 1 {N, N2, . . . , N𝑘} in which each network

Syntactic units × Concepts which can be considered as a two-mode shares a (sub)set of nodes with some other network is called a

network. of networks. collection

In semantic TA the units (often clauses) are encoded according A linked network can be transformed into a collection of net-

to the S-V-O ( - - ) model or its improvements. works and vice versa. Subject Verb Object

This coding can be directly considered as network with Bibliographical information is usually represented as a col- Subjects

∪ Objects as nodes and links (arcs) labeled with Verbs. This is also lection of bibliographical networks {Cite, WA, WK, WC, WI, . . .} a basis for the semantic web and knowledge networks. (W – works, A – authors, K – keywords, C – countries, I – insti-

tutions) [7].



3.2 Networks Another example of multimodal multirelational networks are

knowledge graphs. They can have a very diverse structure (a



A is based on two sets – a set of (vertices) that nodes

network large number of types of units (modes) and predicates (rela-

represent the selected , and a set of (lines) that represent links

units tions)), which allows for a fairly accurate description of facts

ties from a selected field and solving problems about it. Network graph

between units. They determine a .

analysis methods are particularly useful in analyzing one or a

Additional data about nodes or links can be known – their

properties few relational (sub)networks of a knowledge graph.

(attributes). For example: name/label, type, value, etc.

Network In a , the presence/activity of a node/link can temporal network = Graph + Data

change through time . The basic division of temporal network T

The data can be measured or computed.

Formally, a , , , consists of: N = (V L P W)

network descriptions is into cross-sectional and longitudinal. A cross-

sectional description usually consists of a sequence of time slices

• a graph G = (V, L), where V is the set of nodes and – ordinary networks that describe the state at a selected moment

L = E ∪ A is the set of links. A link 𝑒 ∈ L is either or time interval. A longitudinal description is based on temporal

directed – an arc 𝑒 , or undirected – an edge 𝑒 , [12, 9] or on a sequence of events. ∈ A ∈ E quantities 𝑛 , 𝑚 = |V | = |L |

• P is a set of node value functions / properties: 𝑝 : V → 𝐴 4 Description of traditional networks • W is a set of link value functions / weights: 𝑤 : L → 𝐵

How to describe a network , , , ? In principle the N = (V L P W)

Sometimes, implicit additional information/data about values answer is simple – we list its components , , , and . V L P W

is provided in the specifications of properties: (a) how can we The simplest way is to describe a network by providing N

compute with values – algebraic structures, semigroup, monoid, , and , in a form of two tables. Both tables are (V P) (L W) group, semiring, etc., and (b) properties of values – in a molecular often maintained in some spreadsheet program. They can be

graph, an atom is assigned to each node; properties of relevant exported as text in CSV (Comma Separated Values) format. In

atoms are such additional data. large networks, we split a network into some subnetworks – a

The terminology in the field of network analysis is not unified. collection, to avoid the empty cells.

Different application areas use other terms. For example: node – To save space and improve computing efficiency, we often

vertex, actor, unit; link – line, tie, edge, connection; etc. replace values of categorical variables with integers. In R, this



103

Towards a Format for Describing Networks Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia



encoding is called a . We enumerate all possible val- where . . . are user-defined properties and *** is a sequence of factorization ues of a given categorical variable (coding table) and afterward such elements.

replace each value with the corresponding index in the coding The field identifies the format, the field con-netsJSON info

table. Since node labels/IDs can be considered a categorical vari- tains metadata, the field contains a table , and the nodes (V P)

able, factorization is also usually applied to them. In data analysis, field contains a table , . In recent years, we also links (L W) indices start with 1, but real computer scientists start counting analyzed bike systems (link weight is a daily number of trips

from 0. Therefore, it is desirable to include information about the distribution), bibliographies (yearly distributions of publications

minimal index value in the description. or citations), and multiway networks [8, 9, 1]. It turned out that

This approach is used in most programs dealing with large it was necessary to add another main field, , to the basic data

networks. Unfortunately, the coding table is often considered as NetsJSON format, in which we provide additional data about the

a kind of metadata and is omitted from the description. properties of values (translations of labels in selected languages,

In Pajek [15], node property can be represented in the associ- algebraic structure, etc. [6]).

ated file as a vector (numbers, ), a partition (nominal, ), An event description can contain the following fields: type, .vec .clu

or a permutation (order, ). All network files can be com- date, title, author, desc, url, cite, copyright, etc. It is intended .per

bined into a single project file ( ). Metadata can be added to provide information about the "life" of the dataset – collec-.paj

as comments written on lines starting with the . An exam- tion/creation, changes, releases, uses, publications, etc. % ple of transforming CSV tables into Pajek files is available at For describing temporal networks, a node element and a link

GitHub/bavla/netsJSON/example/bib [4]. element have an additional required property – a temporal tq

quantity. For example, see at violenceU.json GitHub/bavla/



5 Graph/JSON describing the Franzosi’s violence network. Nets and NetsJSON

The general NetsJSON format is also expected to support the

We were satisfied with the ”traditional” network description, as

description of network collections.

implemented in Pajek [15], until we became interested in net-

works with node/link properties that are not measured in stan-

dard scales (ratio, interval, ordinal, nominal), but have structured

values (text, subset, interval, distribution, time series, temporal 6 Elements of a common network format

quantity, function, etc.). In topological graph theory, an embed- Our experience with network analysis to date is summarized in

ding is described by assigning a rotation to each node [23]. For the following recommendations on the elements of a common

describing temporal networks, we initially extended the Pajek

format for describing networks.

format, defined and used the Ianus format [12]. Combining data and its metadata into a single file is a robust

For a format supporting structured values, there were two ob-

approach for ensuring data integrity. A JSON-based format is

vious choices for its base – XML and JSON. They are both widely particularly well-suited for this purpose, as it fits well with the

known and suitable as structured data formats. However, JSON data structures of modern programming languages. JSON also

can represent the same data as XML more concisely. We chose supports Unicode.

JSON and in 2015 started developing and using the NetJSON We would also encourage the provision, as metadata, of infor-

format and the Nets Python package to handle networks with mation about the context of the network, additional knowledge

structured-valued properties or weights [5, 4, 3]. On February about it, articles or notebooks on its analysis, comments of users,

26, 2019, the format was renamed to NetsJSON because of the

etc. Kaggle is a good example. An improved ICON repository or

collision with http://netjson.org/rfc.html. NetsJSON has two ver- Network Repository (we disagree with their "citation request")

sions: a and a version. The current implementation could be the way to go. Existing metadata standards should be basic general

of the Nets library supports only the basic version. taken into account (Dublin Core, FAIR, Schema). Data has a "life".

In addition to describing networks with structured values, When selecting data, its age is often important. Metadata should

NetsJSON is expected to offer the capabilities of (most) existing include at least the collection/creation date and the last modifica-

network description formats [13, 25] (archiving, conversion) and tion date.

provide input data for D3.js visualizations. F

By FAIR principles, the format should support: indability:

A network description in NetsJSON follows the JSON ( Java- Globally unique and persistent identifier, rich metadata. ccessi-A

Script) syntax and consists of five main fields ( , , netsJSON info

bility: Open, free, and universally implementable standardized

nodes, links, data). communication protocol. Interoperability: Formal, accessible,

shared, and proudly applicable language for knowledge represen-

{

"netsJSON": "basic", tations. Reusable: Metadata are richly described and associated "info": {

"org":1, "nNodes":n, "nArcs":mA, "nEdges":mE, with detailed provenance.

"simple":TF, "directed":TF, "multirel":TF, "mode":m, The format must support all types of networks (simple, 2-mode,

"network":fName, "title":title,

"time": { "Tmin":tm, "Tmax":tM, "Tlabs": {labs} }, linked, multi-relational, multi-level, temporal). The network can

"meta": [events], ... contain both arcs and edges, as well as parallel links. To describe

},

"nodes": [ some knowledge graphs, it would be necessary to allow other

{ "id":nodeId, "lab":label, "x":x, "y":y, ... },

*** links to act as the end nodes of a link [18].

],

As mentioned earlier, using factorization produces a more

"links": [

{ "type":arc/edge, "n1":nodeID1, "n2":nodeID2, "rel":r, ... }, concise description of the network. In cases where the node

***

], names are not too long and are readable, we sometimes want

"data": {

"data1":description1, to avoid factorization. This can be achieved by using a switch

*** that indicates whether factorization is used. We can also shorten

}

} the description length by introducing default values of selected



104

Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Batagelj, Pisanski, Savnik, Slavec & Bašić



properties. If we also allow counting from 0, it makes sense to T. Pisanski is supported in part by the Slovenian Research

add information about the smallest index. Agency (research program P1-0294 and research projects BI-

Long labels cause problems when printing/visualizing (parts HR/23-24-012, J1-4351, and J5-4596).

of ) networks and results. Therefore, it is useful to have abbrevi- N. Bašić is supported in part by the Slovenian Research Agency

ated versions of labels available. For language-based labels, it is (research program P1-0294 and research project J5-4596).

sometimes useful to offer additional versions in selected other

languages, which increases the accessibility of the data and the References

understandability of the results. [1] Vladimir Batagelj. 2024. Cores in multiway networks. Social Network Analy-

Most of the network datasets produced by network science sis and Mining, 14, 1, 122.

[2] Vladimir Batagelj. 1985. Inductive classes of graphs. In Proceedings of the

have no node labels. Node labels are not needed if you study , 43–56. 6th Yugoslav Seminar on Graph Theory distributions, but they are essential in the interpretation of the [3] Vladimir Batagelj. 2016. Nets – Python package for network analysis. ac-

cessed: 2025-03-18. (2016). https://github.com/bavla/Nets.

obtained “important substructures”. We would encourage pro-

[4] Vladimir Batagelj. 2016. NetsJSON – a JSON format for network analysis.

viding node labels, or at least some typological information, in accessed: 2025-03-18. (2016). https://github.com/bavla/netsJSON.

[5] Vladimir Batagelj. 2016. Network visualization based on JSON and D3.js.

cases where privacy issues arise.

slides. (2016). https://github.com/bavla/netsJSON/blob/master/doc/netVis.p

The common format should support descriptions of networks

df .

specific to specialized fields of application, such as molecular [6] Vladimir Batagelj. 2021. Semirings in network data analysis / an overview.

slides. (2021). https://github.com/bavla/semirings/blob/master/docs/semirin

graphs, genealogies (p-graphs) [29], and topological graph em-

gs.pdf .

beddings [23, 24], among others. The format must be extensible. [7] Vladimir Batagelj and Monika Cerinšek. 2013. On bibliographic networks. In addition to the agreed-upon fields, the users can add their own, , 96, 3, 845–864. Scientometrics

[8] Vladimir Batagelj and Anuška Ferligoj. 2016. Symbolic network analysis of

allowing for a comprehensive description of their data.

bike sharing data / Citi Bike. accessed: 2025-03-18. (2016). https://github.co

It is also interesting to ask whether and, if so, how to include m/bavla/Bikes/blob/master/bikes.pdf .

[9] Vladimir Batagelj and Daria Maltseva. 2020. Temporal bibliographic net-

descriptions of its displays in the network description. Perhaps it

works. , 14, 1, 101006. Journal of Informetrics

would be worth relying on VEGA-lite [26, 28] and D3.js [14]. Some

[10] Vladimir Batagelj and Andrej Mrvar. 1995. NetML. accessed: 2025-03-18.

ideas can also be taken from the section on “defining visualization (1995). https://github.com/bavla/netsJSON/blob/master/doc/snetml.pdf .

[11] Vladimir Batagelj and Andrej Mrvar. 2018. Pajek and pajekxxl. In Encyclo-

parameters in the input file” in the Pajek manual / 5.3 [19, p. 89].

pedia of Social Network Analysis and Mining. Springer, 1–13.

Although we are committed to a single-file approach, there [12] Vladimir Batagelj and Selena Praprotnik. 2016. An algebraic approach to

may be times when external files are needed (for example, images temporal network analysis based on temporal quantities. Social Network

Analysis and Mining, 6, 1–22.

to display nodes). Consideration should be given to how to sup-

[13] Jernej Bodlaj and Monika Cerinšek. 2014, 2017. Network data file formats.

port this option. Given the basic purpose of the common format, In . Reda Alhajj and Encyclopedia of Social Network Analysis and Mining

Jon Rokne, editors. Springer New York, New York, NY, 1076–1091. isbn:

standard tools (ZIP) can be used to compress large networks.

978-1-4614-7163-9. doi: 10.1007/978- 1- 4614- 7163- 9_298- 1.

We have not yet started working on a general format. It is

[14] Bostock, Mike and Davies, Jason and Heer, Jeffrey and Ogievetsky, Vadim.

supposed to enable descriptions of collections of networks. The [n. d.] The JavaScript library for bespoke data visualization. accessed: 2025-

08-29. (). https://d3js.org/.

question arises about the scope of validity of IDs – does the same

[15] Wouter De Nooy, Andrej Mrvar, and Vladimir Batagelj. 2018. Exploratory

ID in different networks represent the same or other units? This social network analysis with Pajek: Revised and expanded edition for updated

is important for operations such as the union or intersection . Vol. 46. Cambridge university press. software

[16] Gephi. 2022. Supported graph formats. accessed: 2025-03-18. (2022). https:

of networks. Which way to go – introducing contexts or using

//gephi.org/users/supported- graph- f ormats/.

matchings? Maybe some ideas from the Open Archives Initiative [17] Joseph E Gonzalez, Reynold S Xin, Ankur Dave, Daniel Crankshaw, Michael

J Franklin, and Ion Stoica. 2014. Graphx : graph processing in a distributed { }

Object Reuse and Exchange (OAI-ORE) and GraphX could be

dataflow framework. In 11th USENIX symposium on operating systems design

used [22, 17]. An interesting option is the constructive network and implementation (OSDI 14), 599–613.

description – building a network from smaller components [10] [18] Aidan Hogan et al. 2021. Knowledge graphs. , ACM Computing Surveys (Csur)

54, 4, 1–37.

or describing a network by its construction sequence [2].

[19] Andrej Mrvar and Vladimir Batagelj. 2025. Pajek reference manual. Version

Additional ideas may be found on the page ” . http://mrvar.f dv.uni- lj.si/pajek/pajekman.pdf . 6.01 A Python Graph

API? [20] Zachary P Neal and et al. 2024. Recommendations for sharing network data ” [27]. For now, we would leave aside descriptions of gener-

and materials – endorsement page. accessed: 2025-03-18. (2024). https://ww

alizations of networks (multiway networks and hypernets), but

w.zacharyneal.com/datasharing.

we must not forget about them. [21] Zachary P Neal et al. 2024. Recommendations for sharing network data and

materials. , 12, 4, 404–417. Network Science

The agreed format must be well documented and supported

[22] Open Archives Initiative. 2014. Object Reuse and Exchange (OAI-ORE). 2014-

by examples of the use of supported options.

08-14. accessed: 2025-03-18. (2014). https://www.openarchives.org/ore/.

[23] Tomaž Pisanski. 1980. Genus of cartesian products of regular bipartite graphs.

Journal of Graph Theory, 4, 1, 31–42.



7 Conclusions [24] Tomaž Pisanski and Arjana Žitnik. 2004. Representations of graphs and 26th International Conference on Information Technology Interfaces, maps. In Yet another format only makes sense as a project of a larger IEEE, 19–25. 2004.

[25] Matthew Roughan and Jonathan Tuke. 2015. Unravelling graph-exchange

community of users in the field of network analysis.

file formats. . arXiv preprint arXiv:1503.02781

[26] Arvind Satyanarayan, Dominik Moritz, Kanit Wongsuphasawat, and Jeffrey

Heer. 2016. Vega-lite: a grammar of interactive graphics. IEEE transactions

Acknowledgements on visualization and computer graphics, 23, 1, 341–350.

[27] The Python Wiki. 2011. A Python Graph API? accessed: 2025-03-18. (2011).

The computational work reported in this paper was performed https://wiki.python.org/moin/PythonGraphApi.

[28] University of Washington Interactive Data Lab. [n. d.] Vega-Lite – A Gram-

using programs R and Pajek [15]. The code and data are available

mar of Interactive Graphics. accessed: 2025-08-29. (). https://vega.github.io

at Github/Bavla [4].

/vega- lite/.

V. Batagelj is supported in part by the Slovenian Research [29] Douglas R White, Vladimir Batagelj, and Andrej Mrvar. 1999. Anthropology:

analyzing large kinship and marriage networks with pgraph and pajek.

Agency (research program P1-0294 and research project J5-4596)

Social Science Computer Review, 17, 3, 245–274.

and prepared within the framework of the COST action CA21163

(HiTEc).



105

Automating Numba Optimization with Large Language Models:



A Case Study on Mutual Information



Lučka Kozamernik Blaž Škrlj

Teads Teads

lucka.kozamernik@teads.com blaz.skrlj@teads.com



Martin Jakomin Jasna Urbančič

Teads Teads

martin.jakomin@teads.com jasna.urbancic@teads.com



Abstract Python and NumPy code into optimized machine code at run-

Contemporary large language models (LLMs) enable fast research time, Numba offers C-like performance without sacrificing the

cycles when developing or optimizing new algorithms. In this flexibility and ease of use of the Python language. A well-written,

work, we investigate whether existing LLMs are sufficient to Numba-accelerated MI function can be orders of magnitude faster

automatically, under constraints of unit tests, produce implemen- than its pure Python equivalent. Despite these gains, achieving

tations of computational extensive algorithms such as the mutual optimal performance with Numba is not always straightforward.

information algorithm that would out-perform existing human- The efficiency of Numba-jitted code is highly dependent on the

made baselines. We establish an evaluation pipeline where new specific implementation patterns, data access methods, and loop

proposed AI implementations are rigorously tested, evaluated, structures used—subtleties that often require significant program-

and benchmarked against existing baselines. We used synthetic mer expertise to navigate.

numeric datasets of different sizes and results show 10-times This paper introduces a novel approach to bridge this gap: the

speed-up using LLM optimized implementations compared to use of Large Language Models (LLMs) to automatically optimize

the naive Numba-based optimization while producing consis- Numba-based mutual information algorithms. We hypothesize

tently correct mutual information scores. that modern LLMs, trained on vast repositories of code, possess

the capability to analyze suboptimal Numba implementations

Keywords and refactor them into more efficient versions. Our work explores

whether an LLM can identify and correct common performance

optimization, mutual information, LLM, Numba

anti-patterns in Numba code, such as improper loop organization



1 or inefficient data type usage, to generate an MI implementation Introduction

that surpasses a naively written Numba function. We present a

Mutual Information (MI) (detailed overview in, e.g., [4]) stands framework for systematically prompting an LLM with a base-

as a fundamental measure in information theory, quantifying line algorithm and evaluating the performance of its generated

the statistical dependency between two random variables. Its optimizations, demonstrating the potential for AI-driven code

application is widespread and critical across numerous domains, acceleration in scientific computing.

including feature selection in machine learning [8], neuroscience



for analyzing neural spike trains [2], and bioinformatics for un- 2 Related work

derstanding gene regulatory networks [9]. The versatility of MI

lies in its ability to capture arbitrary non-linear relationships, a This research builds upon three principal areas of study: the

significant advantage over linear correlation measures like Pear- computation of mutual information, performance optimization

son’s coefficient. with JIT compilers, and the application of Large Language Models

However, the computational cost of calculating mutual infor- to code intelligence tasks.

mation, especially for large datasets with continuous variables, Mutual Information estimation is the long-standing challenge

presents a substantial bottleneck. The standard approach involves of accurately and efficiently estimating mutual information from

discretizing the data into bins in order to estimate probability given data. Defined as

distributions, a process whose accuracy and performance are

highly sensitive to the chosen binning strategy and the efficiency 𝑝 (

𝑋 , 𝑌 )

𝐼 𝑋 𝑌 , ( ; ) = E log

𝑝 𝑋 𝑝 𝑌

of the underlying implementation. For practitioners working 𝑝 𝑋 ,𝑌 ( ) ( ) ( )

within the Python ecosystem, libraries like NumPy and SciPy

are standard tools, but their performance on MI calculations can it measures the pairwise relationships between random vari-

be suboptimal for high-throughput screening or large-scale data ables (continuous or discrete). The most common methods, as

exploration tasks. reviewed by Fraser and Swinney (1986) [3] and explored in de-

To address this performance gap, Just-In-Time (JIT) compil- tail by Kraskov, Stögbauer, and Grassberger (2004) [5], are based

ers like Numba [6] have become indispensable. By translating on data discretization (binning) or k-nearest neighbors (k-NN)

estimators. While k-NN methods avoid the issue of bin selection,

Permission to make digital or hard copies of all or part of this work for personal they typically incur higher computational complexity. Binned

or classroom use is granted without fee provided that copies are not made or methods, though conceptually simpler, depend heavily on the

distributed for profit or commercial advantage and that copies bear this notice and

the full citation on the first page. Copyrights for third-party components of this binning strategy for accuracy and performance, a topic exten-

work must be honored. For all other uses, contact the owner/author(s). sively studied by Steuer et al. (2002) [11]. Our work focuses on

Information Society 2025, Ljubljana, Slovenia the binned approach, as it is highly amenable to loop-based array

© 2025 Copyright held by the owner/author(s).

https://doi.org/https://doi.org/10.70314/is.2025.sikdd.22 computations where Numba excels.



106

Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Kozamernik et al.



The performance limitations of Python for numerical com- Initial prompt and context

putation led to the development of various acceleration tools,

specifically JIT compilers for Scientific Python. Numba, intro-

duced by Lam, Pitrou, and Seibert in 2015 [6], has emerged as LLM generating response

a leading solution by providing a decorator-based JIT compiler

that integrates seamlessly with NumPy. It allows developers to

Unit tests

accelerate functions containing Python and NumPy syntax, of-

ten achieving performance comparable to compiled languages.

Research and community best practices have established a set of Benchmark optimization techniques for Numba, such as managing memory

layout, ensuring type stability, and structuring loops for paral-

lelization and vectorization. This body of knowledge forms the --- Results as additional context *not implemented

basis against which we evaluate the LLM’s optimization capabil-

ities. Our work differs from traditional performance tuning by Figure 1: Architectural sketch of the benchmarking frame-

attempting to automate the discovery and application of these work. Dashed are the feedback loops proposed as future

techniques solely through an AI model. work.

The emergence of robust Large Language Models (LLMs), such

as OpenAI’s Codex (the technology powering GitHub Copilot),

has revolutionized software development. These models have serve as additional prompts to the LLM in order to improve itself

demonstrated remarkable proficiency in code generation, trans- and the code on the areas where the tests are failing.

lation, and explanation [1]. More recently, research has shifted Finally, in the last step of the framework, the resulted imple-

towards their application in more nuanced tasks like code refac- mentations were extensively benchmarked. The metric we were

toring and optimization. For instance, studies have explored using most interested in was the time needed to compute the mutual

LLMs to suggest improvements for energy efficiency or to refac- information for a given dataset; however, other metrics, such as

tor code for better readability. However, the specific domain of memory utilization or GPU utilization, could also be used for a

optimizing numerical algorithms within a JIT compilation frame- different use case. We further discuss our experimental setup in

work like Numba remains relatively unexplored. While LLMs the results section.

are known to generate functional code, their ability to produce

code that is performant by adhering to the specific constraints 3.1 Reviewing the LLM optimized code

and best practices of a framework like Numba is an open and The implementations of mutual information, produced by the

compelling research question that this paper directly addresses. selected LLMs, are remarkably similar — both in syntax and in the

naming convention. However, there are subtle differences that

3 Using LLMs to optimize existing code set them apart, which we will address later. AI-aided implementa-

To facilitate systematic experimentation with LLM-optimized tions have in common that they completely omit error-handling

code, we set up a novel framework. The workflow consists of the model inherited from NumPy opting for the native Python in-

following basic steps: stead. Moreover, they disregard bound checks for matrix opera-

tions beforehand, leaving the code to crash if it goes out of bounds.

(1) Prompt the LLM with the task and context. The latter is, according to the official documentation, advised for

(2) Test the proposed optimizations against the unit tests. debug purposes only and should be turned off for production, as

(3) Benchmark the proposed implementation. it slows down the code significantly. Having said that, Gemini

The framework is LLM-agnostic, meaning that any LLM can be implemented bound checks using elementary operations. In line

used with it. We opt for the latest and most advanced versions with the change in error handling, both implementations prefer

of two popular LLMs, namely ChatGPT 5 and Gemini 2.5-Pro. elementary operations over native NumPy functions. For exam-

Both are freely available and excel in complex tasks such as ple, to find the maximal value in an array, the LLM optimized

reasoning and coding. The architecture of the framework is given code goes through all elements in the array by the index and

in Figure 1. compares to the current maximum instead of calling the built-in

To ensure a fair comparison between the models, both eval- NumPy function. There is more evidence for this preference in

uated LLMs received the same prompt and the same context. the code. Such changes make the code appear much more C-like

The prompt was "Can you make this code computationally more than native Python. Whenever there is the need for typecasting,

efficient, this meaning it computes faster?", while the context in- the optimized code performs it at definition, instead of on return,

cluded the code that needed to be optimized. The initial code which is commonly used in the naive implementation. The two

used in the input already contained some Numba instructions, types of proposed changes are illustrated with the code samples

however those were basic and naive. The tested code is a part of in Figure 2. Lastly, both LLMs introduced additional function that

OutRank, an open-source tool for computing cardinality-aware performs the pre-built grouping to avoid unnecessary allocations

feature ranking [10] and encompasses an implementation of the and relocations in the loop. While the core techniques used for

mutual information estimation. optimization are the same for both LLMs, Gemini 2.5-Pro used

The LLM output was first tested on unit tests to ensure that Numba’s prange in one of the main computational loops, which

the optimizations still produced valid code and did not change adds parallelization, and makes the implementation faster on mul-

any functionalities. By testing the proposed solution before using ticore machines. It also took the use of elementary operations

it for benchmarking, we are guaranteed that the code and its much further than ChatGPT 5 — it replaced nearly all NumPy op-

output are correct, consistent, and stable. Although not part of erations with native operations, increasing the row count twice

the framework at this stage, the output of the unit tests could as much as ChatGPT 5 did. The numbers are reported in Table 1



107

Automating Numba Optimization with Large Language Models: A Case Study on Mutual Information Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia





@njit(

'Tuple((int32[:], int32[:]))(int32[:])',

cache=True,

fastmath=True,

error_model='numpy',

boundscheck=True,

)

def numba_unique(a):

"""Identify unique elements in an array, fast"""



container = np.zeros(np.max(a) + 1, dtype=np.int32)

for val in a:

container[val] += 1



unique_values = np.nonzero(container)[0]

unique_counts = container[unique_values]

return unique_values.astype(np.int32),

unique_counts.astype(np.int32)



def fastmath=True) (grouped by the number of features) showing the most numba_unique(a): @njit('Tuple((int32[:], int32[:]))(int32[:])', cache=True, Figure 3: Distribution of Mean Times Across Test Cases

# assumes a >= 0 efficient implementation is the one optimized with Gemini

maxv = 0 2.5-Pro.

for i in range(a.size):

if a[i] > maxv:

maxv = a[i]

container = np.zeros(maxv + 1, dtype=np.int32)

for i in range(a.size):

container[a[i]] += 1

unique_values = np.nonzero(container)[0].astype(np.int32) 4 Results



unique_counts = container[unique_values].astype(np.int32) The setup for our benchmark was the following. We evaluated return unique_values, unique_counts

four different implementations of mutual information. For the

@njit('Tuple((int32[:], int32[:]))(int32[:])', cache=True, two baselines, we used the standard and generic Sci-Kit learn



def fastmath=True) mutual information and OutRank’s basic MI-numba (that already numba_unique(a): contains some Numba instructions to optimize the performance). """ Identify unique elements and their counts in a non-negative And as discussed before, two LLM optimized implementations integer array. were tested— MI-numba-chatgpt5 and MI-numba-gemini, which This version finds the max value in one pass to size the

container. also support subsampling with a factor in range (0, 1]. For the



""" evaluation, the subsampling factor ranges from 0.1 to 1, which # Assumes a >= 0 means that no subsampling was applied. maxv = 0



if a.size > 0: To gauge how the performance scales with different parame- for i in range(a.size): ters of the dataset, namely the number of examples (rows) and if

a[i] > maxv:

maxv = a[i] number of features (columns), we synthetically generated several



for i in range(a.size): values (and varied the numbers of examples and features). The container[a[i]] += 1 container = np.zeros(maxv + 1, dtype=np.int32) datasets, containing raw numerical features with non-negative

unique_values = np.nonzero(container)[0].astype(np.int32) number of features ranged from 40 up to 200 in increments of 20,

unique_counts = container[unique_values].astype(np.int32) while the number of examples ranged from 200.000 to 20.000.000

return unique_values, unique_counts

in eight logarithmic steps. For each combination, represented by

a tuple (algorithm, subsampling factor (where applicable), num-

Figure 2: Examples of proposed code changes. On the top ber of examples, number of features), we made five runs of the

is the initial function, followed by ChatGPTs solution and code. For each run, we recorded the time to compute mutual

on the bottom is the code from Gemini 2.5-Pro. information using Python’s time function.

The results are shown in Figure 3. The boxes represent 25th

percentile in the bottom and 75th percentile on the top. For all

Implementation Row count Relative row count change

test case, the LLM optimized implementations were significantly

Baseline 182 0% faster than the baselines (the naive Numba implementation of

ChatGPT5 213 +17% mutual information from OutRank and the generic Sci-Kit learn

Gemini 2.5-Pro 262 +43% mutual information), with Gemini’s implementation being the

Table 1: Row count for each of the implementations. White- most efficient regardless of the number of features, number of

space and comments are included in the row count. samples or approximation factor. The LLMs sped up the compu-

tation of mutual information for approximately 10 times, while

the difference between ChatGPT’s and Gemini’s version was

much smaller. This implies that the biggest contribution to the

In addition, Gemini 2.5-Pro implemented its own in-code bounds speedup comes from the code changes that the two LLM opti-

checks based on elementary operations, while ChatCPT 5 did not. mized solutions have in common. Those are primarily the pre-

Contributing to the increase in the row count is also the amount built grouping, which aims to reduce in-loop allocations, and the

of comments. The code review also revealed that Gemini 2.5-Pro heavy use of elementary operations. Although parallelization in

was more consistent in code commenting and the comments were the Gemini 2.5-Pro’s implementation still plays a role, its effect

much more useful and informative for the developer. is less significant.



108

Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Kozamernik et al.





very impressed by the remarkable similarity of the code produced

by two different and independent LLMs. The proposed solutions

from both models focused on the same key areas: adding an

auxiliary function that creates the pre-built groupings to reduce

the in-loop allocations, and shifting the paradigm from native

NumPy to C-like Python code relying on elementary operations.

While the optimization process is not yet fully automatic, our

contribution outlines a possible direction for efficient use of LLMs

in scientific computing. To reach the fully automatic stage when

referring to Numba optimization, we propose the following steps

are incorporated in the framework:

(1) Use unit test output in case of failure as the next prompt

for the LLM to give it a chance to correct the code.

(2) Use the result of the benchmarking experiments as feed-

back to the LLM and iterate on the proposed optimization.

Both of these suggestions create feedback loops back to the LLMs,

Figure 4: Computed Mutual Information for all tested im-

plementations and for various numbers of feature. enabling an iterative process like the one proposed in Novikov

et al. [7]. By comparing the outputs with the existing solutions,

we have shown that the LLMs maintained the correctness when

To verify that the computed mutual information is consistent introducing optimizations.

with the generic implementations, namely the Sci-Kit learn imple-

mentation, we plotted the mutual information for each number References

of features. We show the results in Figure 4, where we can ob- [1] Mark Chen et al. 2021. Evaluating large language models trained on code.

serve that the computed mutual information is almost identical CoRR, abs/2107.03374. https://arxiv.org/abs/2107.03374 arXiv: 2107.03374.

[2] Ahmad El Ferdaoussi, Eric Plourde, and Jean Rouat. 2025. Maximizing infor-

for all implementations, regardless of the number of features mation in neuron populations for neuromorphic spike encoding. Neuromor-

and different optimizations applied. We conclude that the code phic Computing and Engineering, 5, 1, 014002.

optimized by LLMs is valid and correct. [3] Andrew M Fraser and Harry L Swinney. 1986. Independent coordinates for

strange attractors from mutual information. Physical review A, 33, 2, 1134.

[4] Alexander Kraskov, Harald Stögbauer, and Peter Grassberger. 2011. Erratum:



5 Discussion estimating mutual information [phys. rev. e 69, 066138 (2004)]. Physical Review E—Statistical, Nonlinear, and Soft Matter Physics , 83, 1, 019903. In our experiment, we used the latest and most advanced versions [5] Alexander Kraskov, Harald Stögbauer, and Peter Grassberger. 2004. Estimat-

of two popular LLMs, namely ChatGPT 5 and Gemini 2.5-Pro, ing mutual information. Physical Review E—Statistical, Nonlinear, and Soft

Matter Physics, 69, 6, 066138.

with Gemini 2.5-Pro being specifically targeted for coding. While [6] Siu Kwan Lam, Antoine Pitrou, and Stanley Seibert. 2015. Numba: a llvm-

we did put two different LLMs to the test, the goal was not so based python jit compiler. In Proceedings of the Second Workshop on the

much to compare them, but to develop a framework that would LLVM Compiler Infrastructure in HPC, 1–6.

[7] Alexander Novikov et al. 2025. Alphaevolve: a coding agent for scientific

serve well for evaluating LLM-based optimizations in scientific and algorithmic discovery. arXiv preprint arXiv:2506.13131.

computing. As the new versions of LLMs and new LLMs are [8] Hanchuan Peng, Fuhui Long, and Chris Ding. 2005. Feature selection based

periodically appearing in the market, the framework can serve redundancy. on mutual information criteria of max-dependency, max-relevance, and min-IEEE Transactions on pattern analysis and machine intelligence,

to keep improving the existing code or, on the other hand, can be 27, 8, 1226–1238.

used to quantify the improvements (specifically for the coding [9] Lior I Shachaf, Elijah Roberts, Patrick Cahan, and Jie Xiao. 2023. Gene regula-

tion network inference using k-nearest neighbor-based mutual information

subdomain) in the LLMs themselves as the new versions are re- estimation: revisiting an old dream. BMC bioinformatics, 24, 1, 84.

leased. Additionally, using the framework in development phase [10] Blaz Skrlj and Blaž Mramor. 2023. Outrank: speeding up automl-based model

for scientific experiments can reduce the computational time and search for large sparse data sets with cardinality-aware feature ranking. In

Proceedings of the 17th ACM Conference on Recommender Systems, 1078–

computational resources needed, leading to a lower cost for the 1083.

experiments. [11] Ralf Steuer, Jürgen Kurths, Carsten O Daub, Janko Weise, and Joachim

Focusing on the LLM aspect of the framework, the question Selbig. 2002. The mutual information: detecting and evaluating dependencies

between variables. Bioinformatics, 18, suppl_2, S231–S240.

remains what the result of the LLM-based optimization would be,

had the context represented by the initial code not used Numba

optimizations already. Few additional experiments could be done

to explore that:

(1) Use Python code without Numba instructions and explic-

itly mention Numba in the prompt

(2) Use Python code without Numba instructions and do not

mention Numba in the prompt

(3) Task the LLM to prepare the most computationally effi-

cient implementation of mutual information in Python



6 Conclusions

In this work, we presented an initial framework for automatic

code optimization via LLM achieving a very impressive 10-fold

speedup compared to the naive baseline in the benchmarking

experiments while maintaining correctness of the code. We were



109

Topological Structure in GitHub Repository Embeddings Using



Mapper



Ivo Hrib Patrik Zajec

ivo.hrib@gmail.com patrik.zajec@ijs.si

Jožef Stefan Institute Jožef Stefan Institute

Ljubljana, Slovenia Ljubljana, Slovenia



Abstract • A discussion of how these preliminary results can guide

We present a preliminary framework for the topological analysis future work, in particular the application of statistical

of GitHub repository embeddings using the Mapper algorithm. testing methods to validate topological features and their

Applied to 10,000 repositories embedded in 768-dimensional interpretation in terms of repository characteristics.

space, our approach currently provides visual representations of

Mapper graphs, offering a first view into potential topological 2 Background and Related Work



structures such as branching patterns and cycles. While these 2.1 The Mapper Algorithm initial results are exploratory, they establish a foundation for

The Mapper algorithm [6] constructs a graph representation of

rigorous statistical testing of topological features. Future work

a topological space by combining a filter function, overlapping

will incorporate persistent homology–based significance testing 𝑑



to distinguish genuine structural patterns from noise, with the → covers, and clustering. Given a point cloud 𝑃 embedded in R and ultimate goal of interpreting these features in terms of repository a continuous function 𝑓 : 𝑃 R refered to as a filter function,

the algorithm:

characteristics.

(1) Constructs a cover U = {𝑈 , . . . , 𝑈 1𝑛} of the range 𝑓 (𝑃)



Keywords using overlapping intervals − 1 (2) For each interval 𝑈 𝑖 , computes the preimage 𝑃 𝑈 𝑈 𝑖 = 𝑓 (𝑖) topological data analysis, Mapper, GitHub, embeddings, signifi-(3) Clusters each preimage into connected components using cance testing, software repositories, persistent homology a clustering algorithm



1 Introduction ters whose point sets intersect (4) Creates vertices for each cluster and edges between clus-We present a preliminary framework for the topological analysis

Common practice uses the first PCA component [4] as the filter

of GitHub repository embeddings using the Mapper algorithm.

and density based clustering methods, such as DBSCAN [2], un-

Applied to 10,000 repositories embedded in 768-dimensional



space, our approach provides visual representations of Mapper 𝐺 = (𝑉 , 𝐸) provides a combinatorial description with mapping graphs that reveal branching structures and cycles as potential less specific domain knowledge is provided. The resulting graph

𝜙 𝑉 𝑃 : → P () associating each vertex with a subset of points.

organizational patterns in the data. These results are exploratory

and serve as a foundation for future work, where statistical signifi-



cance testing will be applied to rigorously validate which features

represent genuine topological structure rather than noise. Our

framework thus establishes an initial step toward understanding

the topology of repository embeddings and motivates further

methodological development.



1.1 Research Questions

This work addresses the following specific question:

(1) Do GitHub repository embeddings contain significant

topological structures beyond simple clustering? Figure 1: Visual demonstration of mapper algorithm for a

projection filter and a simple pointcloud.

1.2 Contributions

Our main contributions are:

2.1.1 Parameter Selection and Sensitivity. Mapper results are

• A preliminary framework for constructing and visualizing sensitive to three main parameters:

Mapper graphs of GitHub repository embeddings.

• • Resolution (𝑛): Number of intervals in the cover A systematic comparison of Mapper graphs across multi-

ple parameter settings, highlighting sensitivity and recur- • Overlap (𝑝): Percentage overlap between consecutive in-

ring structural patterns. tervals

• Clustering threshold (𝜖): Distance parameter for the

Permission to make digital or hard copies of all or part of this work for personal clustering algorithm

or classroom use is granted without fee provided that copies are not made or Taking this into account, we devised the following methodol-

distributed for profit or commercial advantage and that copies bear this notice and

the full citation on the first page. Copyrights for third-party components of this ogy for parameter selection. We define a discrete grid of candidate

work must be honored. For all other uses, contact the owner/author(s). values, for which the mapper graph is reasonably computable,

Information Society 2025, Ljubljana, Slovenia for each of the previously mentioned parameters and the min-

© 2025 Copyright held by the owner/author(s).

https://doi.org/10.70314/is.2025.sikdd.27 imum number of points per cluster. For each point in this grid



110

Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Hrib and Zajec



we applied the Mapper algorithm to the dataset and computed a topological features of the high-dimensional space in which

collection of quality measures. repositories reside. By applying the Mapper algorithm to reposi-

Three main criteria were used to assess the quality of each tory embeddings, we intend to characterize how repositories are

Mapper graph: organized in terms of branching patterns, hubs, and cycles. This

• perspective emphasizes the geometry and connectivity of the em- Coverage: the proportion of data points captured by the

nodes of the Mapper graph, measuring how well the graph bedding space itself, offering potential insights that complement

represents the entire dataset. more conventional similarity- or classification-based analyses of

• repositories. Modularity: a measure of the strength of community

structure in the resulting graph, reflecting the presence of

well-defined clusters or substructures. 3 Dataset and Methodology



• Stability: the reproducibility of the graph under sampling 3.1 Dataset Description

noise, estimated by a bootstrap procedure in which multi-

ple resampled datasets were processed and the resulting The raw dataset comprised approximately 500,000 GitHub repos-

node assignments compared for consistency. itories, each annotated with a range of metadata fields. These

can be grouped into three broad categories:

For each parameter combination, we computed stability, cov-

erage, and modularity. To aggregate these into a single composite • Textual features: free-form text fields such as descrip-

score, we used a weighted sum that places the highest emphasis tion, readme, requirements, and packages, which capture

on stability (0.5), followed by coverage (0.3) and modularity (0.2). natural-language documentation and dependency decla-

These weights were chosen to reflect our prioritization of repro- rations.

ducibility and representativeness over community structure. • Categorical features: attributes such as language, topic,

and visibility, which provide discrete labels describing

𝑛 𝜀 Overlap MinPts Coverage Stability Modularity Score repository characteristics. 𝑐𝑢𝑏𝑒𝑠

12 0.70 0.7 3 0.966 0.948 0.785 0.921 • Contextual metadata: fields such as name, bio, website,

12 0.70 0.7 5 0.933 0.924 0.745 0.891 company, location, and date of creation, which provide

16 0.70 0.7 3 0.952 0.872 0.791 0.880 identifying information and organizational context.

10 0.70 0.7 3 0.966 0.847 0.765 0.866

16 0.70 0.7 5 0.915 0.852 0.771 0.855 3.1.1 Repository Selection Criteria. In the interest of computa-

tional feasibility, this dataset was then sampled to 10,000 repos-

Table 1: Top 5 Mapper parameter settings ranked by score. itories. Repositories were chosen via simple random sampling

from the full dataset, as many repositories contained incomplete

or inconsistent categorical and contextual metadata; therefore,

For each parameter layout, we employed the first PCA compo- stratified sampling was not appropriate.

nent as our chosen filter and DBSCAN as our chosen clustering

algorithm. 3.1.2 Embedding Process. Each sampled repository was con-

verted into a structured dictionary combining the available meta-

2.1.2 Adaptive Mapper and Learnable Filter Functions. Recent data fields. These dictionaries were embedded using the nomic-

advances in Mapper methodology include adaptive approaches embed-text model. The model accepts long-context inputs (up to in which filter functions are learned from data rather than man-

approximately 8,000 tokens), which makes it suitable for process-

ually specified. Such approaches could potentially optimize for ing repository documentation such as README files.

statistically significant topological features[3]. These methods



were, however, not utilized in our case due to computational 768 gether, the 10,000 sampled repositories form a point cloud in R The resulting embeddings are 768-dimensional vectors. To-

.

complexity and remain to be explored in the future. Because the embeddings primarily reflect textual and documen-



2.2 tation content (e.g., README and description fields), the analysis Related Work in Software Repository

in this study centers on topological structure in the documenta-

Analysis tion space rather than source code semantics. These embeddings

Several recent studies have explored software repository em- serve as the basis for the Mapper-based topological data analysis

beddings and clustering. For example, Rokon et al. introduced described in the following sections.

Repo2Vec, which combines metadata, source code, and struc-

tural signals into repository embeddings for similarity search 3.2 Mapper Implementation

and clustering [5]. Lherondelle et al. proposed an attention-

For our purposes we used Kepler-Mapper to get the mapper

based model that learns repository embeddings from code and

graphs which scored highest, previously mentioned in ??.

metadata to support auto-tagging and recommendation tasks

[lherondelle2022topical]. Zhang et al. developed HiGitClass, a • Graph 1:

hierarchical classification framework for GitHub repositories us- Resolution = 12, Overlap = 0.7, eps = 0.7, min_samples = 3

ing embedding-based methods [8]. Other work has examined clus- • Graph 2:

tering repositories with software metrics [repo_metrics2023] Resolution = 12, Overlap = 0.7, eps = 0.7, min_samples = 5

or analyzing the characteristics of repositories in specific domains • Graph 3:

such as embedded systems [polaczek2021embedded]. Resolution = 16, Overlap = 0.7, eps = 0.7, min_samples = 3

While these approaches demonstrate that embeddings and • Graph 4:

clustering can yield useful insights about software repositories, Resolution = 10, Overlap = 0.7, eps = 0.7, min_samples = 3

they focus primarily on supervised tasks (classification, tagging) • Graph 5:

or flat similarity clustering. In contrast, our work explores the Resolution = 16, Overlap = 0.7, eps = 0.7, min_samples = 5



111

Topological Analysis of GitHub Repository Embeddings Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia



The filter function is, as before, projection onto the first principal 6.1 Limitations and Error Analysis

component. Clustering using DBSCAN with minimum cluster Several limitations must be acknowledged: size parameter adjusted per graph.

6.1.1 Parameter Sensitivity. While we observe some consistent

4 patterns across parameter choices, a more systematic exploration Results

of parameter space is needed than a pure grid search.

Table 2 reports the structural properties of the selected Mapper

graphs, while Table 3 summarizes degree distributions. 6.1.2 Computational Constraints. Full significance testing of

Graph 1 (Resolution = 12, MinPts = 3) produced 207 nodes all features is computationally expensive, limiting the scale of

and 368 edges across 36 components, with 197 cycles. Graph 3 analysis possible.

(Resolution = 16, MinPts = 3) was even larger (232 nodes, 421

6.1.3 Interpretation Challenges. The semantic meaning of topo-

edges, 229 cycles), reflecting the finer subdivisions introduced by

logical features requires domain expertise and may not generalize

higher resolution.

across different types of software projects.

Graphs 2 and 5 (MinPts = 5) were smaller, around 100 nodes

each, as stricter clustering merged many small clusters. Graph 4 6.1.4 Embedding Model Dependence. Results depend on the qual-

(Resolution = 10, MinPts = 3) fell between these extremes (194 ity and characteristics of the embedding model used.

nodes, 337 edges).

Degree distributions confirm these patterns: Graphs 1 and 3 7 Future Work and Conclusions



contain many nodes of degree 3–5 with some higher-degree hubs, 7.1 Immediate Extensions while Graphs 2 and 5 are simpler and tree-like. Overall, higher

resolution and lower MinPts yield more fragmented, cycle-rich • Complete statistical validation of all observed topological

graphs, while stricter clustering produces fewer, larger compo- features



nents. These trends highlight the need for statistical testing to • Systematic parameter sensitivity analysis • Comprehensive repository characteristic analysis for in-separate genuine topological signals from parameter effects.

As for the visual representations of the graphs, see 2a and 2c, terpretation

as well as the bar plots of their respective node sizes . Note That • Cross-validation with different embedding models and

many of the nodes are relatively small most likely due to the data subsets



reasons mentioned previously. 7.2 Methodological Advances



Graph • Adaptive Mapper guided by significance testing to opti- Nodes Edges Conn. comps. Cycles (len)

Graph 1 mize filter functions 207 368 36 197

Graph 2 • Validation on simple synthetic datasets to confirm method- 101 187 18 104

Graph 3 ology effectiveness 232 421 40 229

Graph 4 • Development of Mapper quality metrics and automated 194 337 36 179

Graph 5 parameter selection 108 200 19 111

Table 2: Comparison of structural properties across Mapper • Hybrid approaches combining Mapper with other dimen-

graphs. sionality reduction techniques



7.3 Applications and Validation

• Predictive modeling using topological features for reposi-

tory characteristics

Graph 1–2 3–5 6–10 11–20 21+ • Integration with software engineering workflows and tools

Graph 1 9 181 9 2 6 • Evaluation by domain experts for practical relevance

Graph 2 10 82 2 6 1 • Extension to other software engineering datasets and prob-

Graph 3 20 182 20 6 4 lems

Graph 4 10 171 5 3 5

Graph 5 9 84 7 8 0 7.4 Conclusions

Table 3: Binned degree distributions across graphs.

While computational constraints limit the scope of current anal-

ysis, the framework establishes a foundation for rigorous topo-

logical analysis of software engineering data. The combination

of visualization, statistical validation, and manual interpreta-

5 Figures and Results Visualization tion provides a comprehensive approach to understanding high-

6 dimensional repository relationships. Discussion

The observed topological structure suggests that repository

The consistent branching patterns across multiple Mapper graphs

embeddings capture meaningful relationships beyond simple

suggest genuine topological structure in the repository embed- clustering, opening possibilities for novel applications in software

ding space rather than parameter artifacts.

engineering and repository analysis.

The large presence of cycles indicates more complex topologi-

cal relationships beyond simple clustering, possibly representing Acknowledgements repositories that share multiple characteristics or form transition

This research was supported by the EnrichMyData project, which

regions between different project types. Although most of these

may be attributed to noise. We aim to further explore those that provided financial support for the work presented in this paper.

are relevant using the techniques from [7] and [1].



112

Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Hrib and Zajec





(a) Mapper graph (Graph 1). (b) Node count distribution (Graph 1).





(c) Mapper graph (Graph 3). (d) Node count distribution (Graph 3).





Figure 2: Representative Mapper graphs (Graph 1 and Graph 3) with corresponding node count barplots. Both 2a and 2c



show a significant central connected component with some branching, however the boundary of the largest connected

component seems to be quite noisy. Further statistical testing will aim to improve upon pruning the noisy artifacts.



References [5] Md Rafsan Jani Rokon, Panagiotis Kallis, Michele Castronovo, Alexander

[1] Serebrenik, and Alberto Bacchelli. 2021. Repo2vec: repository embeddings Omer Bobrowski and Primož Skraba. [n. d.] A universal null-distribution for

topological data analysis. (). https://www.nature.com/articles/s41598-023-37 for effective similarity search and recommendation. In Proceedings of the 18th

842-2. International Conference on Mining Software Repositories (MSR 2021), 384–394.

[2] [6] Gurjeet Singh, Facundo Mémoli, and Gunnar E. Carlsson. 2007. Topological Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. A

density-based algorithm for discovering clusters in large spatial databases methods for the analysis of high dimensional data sets and 3d object recog-

with noise. In nition. In Eurographics Symposium on Point-Based Graphics. Eurographics Proceedings of the 2nd International Conference on Knowledge

Discovery and Data Mining (KDD’96) Association, 91–100. doi:10.2312/SPBG/SPBG07/091-100. . AAAI Press, 226–231.

[3] [7] Patrik Zajec. 2023. Towards testing the significance of branching points and Ziyad Oulhaj, Mathieu Carrière, and Bertrand Michel. 2024. Differentiable

mapper for topological optimization of data representation. In cycles in mapper graphs. (2023). Proceedings of the 41st International Conference on Machine Learning [8] Yu Zhang, Frank F. Xu, Sha Li, Yu Meng, Xuan Wang, Qi Li, and Jiawei (Proceedings of

Machine Learning Research). Vol. 235. PMLR, 38919–38936. doi:10.48550/ar Han. 2019. Higitclass: keyword-driven hierarchical classification of github

Xiv.2402.12854. repositories. In ICDM ’19, 876–885. doi:10.1109/ICDM.2019.00098.

[4] Karl Pearson. 1901. On lines and planes of closest fit to systems of points in

space. Philosophical Magazine, 2, 11, 559–572. doi:10.1080/14786440109462720.



113

CO 2 Monitoring for Energy-Efficient Workloads in Kubernetes:



A Data Provider for CO2-Aware Migration



Ivo Hrib Jan Šturm

ivo.hrib@gmail.com jan.sturm@ijs.s

Jožef Stefan Institute Jožef Stefan Institute

Ljubljana, Slovenia Ljubljana, Slovenia



Oleksandra Topal Maja Škrjanc

oleksandra.topal@ijs.si maja.skrjanc@ijs.si

Jožef Stefan Institute Jožef Stefan Institute

Ljubljana, Slovenia Ljubljana, Slovenia



Abstract in Kubernetes; (ii) a schema and REST API design that facili-



We present a CO tates external consumption; and (iii) scenario-based evaluations 2 monitoring component developed within the demonstrating the potential of CO 2 -aware workload migration. FAME project’s Energy Efficient Analytics Toolbox. The service Further testing will take place, utilizing real measurements continuously collects power usage for containerized workloads in and migrations from within the FAME framework, as to showcase Kubernetes via Kepler and fuses it with regional electricity-grid the service’s precise final capabilities, as opposed to benchmark carbon intensity (e.g., ElectricityMaps) to compute per-workload 1 CO − tests. 2 emission rates in g s . Its primary role is to store accurate,

timestamped emission values and expose them through light-

weight APIs and an optional time-series database (TimescaleDB). 1.1 Key-Idea

It acts as a data provider consumed by external orchestration The key idea of our approach is to compute container-level CO2

services, enabling CO emissions by combining two complementary data sources: (i) 2 -aware migration strategies across clusters

and regions. instantaneous power consumption estimates from Kepler, and (ii)

regional grid carbon intensity values from ElectricityMaps.

Keywords First, Kepler provides pod- and container-level telemetry in



computing, time-series storage, ElectricityMaps, Kepler power signal is derived from eBPF-based kernel observations and model-based inference all provided by Keplers data source. CO2 monitoring, Kubernetes, energy efficiency, carbon-aware the form of estimated power usage 𝑃(𝑡), expressed in watts. This



1 Introduction / expressed in gCO Second, ElectricityMaps exposes a carbon intensity factor 𝐼 (𝑡 ), 2 kWh, corresponding to the bidding zone of

Data centers are a significant contributor to global electricity the node on which the container executes.

demand. Beyond advances in hardware efficiency and renewable We align these two signals in time and compute instantaneous

energy procurement, intelligent orchestration of workloads can emission rates by: reduce emissions by aligning computation with cleaner energy

availability. A prerequisite for such carbon-aware orchestration is 𝐼 (𝑡 ) 𝐸 ( 𝑡 ) = 𝑃 ( 𝑡 ) ·



reliable and accessible measurements of workload-level emissions. 3600 This paper introduces a CO 2 monitoring and storage service where 𝐸(𝑡) is the CO2 emission rate in g s−1, 𝑃(𝑡) is container 1 designed for Kubernetes environments. The service ingests pod/- power in watts (J s− ), and the division by 3600 converts the

container power data from Kepler [5], combines it with regional intensity factor from per-kWh to per-second units.

grid carbon intensity from ElectricityMaps [2], computes instan- These per-container emission rates are then aggregated into a

taneous emission rates, and persists the resulting time series. time series, optionally persisted in TimescaleDB, and exposed via

Unlike optimization or migration tools, this component deliber- a REST API. This composition allows downstream orchestration

ately restricts its scope: it provides measurements and exposes services to reason about the carbon impact of workloads at fine

them via stable APIs for later consumption. temporal and spatial granularity, enabling CO 2-aware migration

By decoupling measurement from decision-making, we ensure strategies.

modularity and interoperability. External orchestrators such as

the ATOS migration service in FAME D5.4 [3] can consume these 2 Background and Related Work

metrics to implement CO2-aware migration strategies without Components of our approach. Our service integrates two ex-

needing to handle the intricacies of measurement or data storage. ternal data sources to produce fine-grained CO2 emission signals.

Our contributions are threefold: (i) a minimal but complete Kepler is an open-source project that estimates the energy con-

architecture for per-workload CO2 measurement and storage sumption of containerized workloads in Kubernetes by leveraging

eBPF-based telemetry and machine learning models. It exposes



work must be honored. For all other uses, contact the owner/author(s). Permission to make digital or hard copies of all or part of this work for personal per-container power and energy metrics that can be consumed or classroom use is granted without fee provided that copies are not made or by higher-level services. ElectricityMaps provides real-time and distributed for profit or commercial advantage and that copies bear this notice and historical carbon intensity data for electricity grids, expressed in the full citation on the first page. Copyrights for third-party components of this gCO / kWh. By fusing Kepler’s workload-level power estimates 2 Information Society 2025, Ljubljana, Slovenia with regional carbon intensity factors from ElectricityMaps, our © 2025 Copyright held by the owner/author(s). system produces a continuous stream of container-level CO https://doi.org/10.70314/is.2025.sikdd.24 2



114

Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Hrib et al.



emission data. This stream can then be consumed by orches- 3.2 Data Model

tration or scheduling components for migration and placement Each emission record is structured as a tuple that captures both

decisions. workload identifiers and measurement values. This schema is

Carbon-aware computing. Prior research demonstrates the designed to balance expressiveness with minimal storage over-

potential of carbon-aware strategies, such as shifting workloads head, while ensuring compatibility with external orchestration



Average Power Limit (RAPL) counters. This dependence limits exist for energy and carbon monitoring. For example, Existing monitoring tools. surement was taken, enabling time-series alignment across Several open-source frameworks nodes and regions. CodeCar- • namespace , pod , container : identifiers for locating the bon [1] and Scaphandre [6] estimate workload emissions, but workload within the Kubernetes hierarchy, which is es- they rely on hardware-specific telemetry, such as Intel’s Running sential for container-level granularity and reproducibility. supplies [4]. Such approaches rely on access to reliable, fine- • ts (timestamp, UTC): the precise moment when the mea-grained emission signals to inform scheduling policies. across time or regions to align with lower-carbon electricity services. portability to Intel CPUs and makes integration across heteroge- • node, region, country_iso2: metadata that ties the con- tainer execution to its physical and geographical context. neous infrastructures challenging. In contrast, our design—built This supports carbon-aware decisions that depend on grid on Kepler and ElectricityMaps—remains hardware-agnostic: eBPF intensity differences across regions. enables container-level monitoring without vendor-specific coun- • power_w , energy_j : raw telemetry provided by Kepler, ters, while ElectricityMaps provides global coverage of carbon describing both instantaneous power and accumulated intensity signals. This combination makes our service applicable energy consumption. in diverse Kubernetes environments and datacenter setups. • intensity_g_per_kwh : regional grid carbon intensity re- Time-series storage. Finally, for persistence, we optionally trieved from ElectricityMaps, serving as the multiplier that employ TimescaleDB, which extends PostgreSQL with hyperta- translates energy into emissions. bles and compression optimized for telemetry data [7]. Never- • co2_g_per_s : the computed emission signal, representing theless, the service can also operate in buffer-only mode when the core value consumed by orchestrators. persistent storage is not required. • source_version : versioning tag for tracking provenance Positioning. This paper positions our monitoring service as of measurements and external data dependencies. a foundational measurement substrate for carbon-aware orches- This schema ensures that each record is self-contained, inter-

tration in Kubernetes environments. By combining hardware-

agnostic energy estimates with real-time grid carbon data, it pretable across clusters, and suitable for longitudinal analysis in

extends the applicability of carbon-aware scheduling beyond the time-series databases.

limitations of prior approaches.

3.3 API Endpoints

3 Design and Implementation The service exposes a lightweight REST API, designed to be eas-



The component runs as a Kubernetes deployment. Workers col-3.1 ily consumed by external orchestrators or monitoring pipelines. Architecture The API emphasizes read-only access to maintain reliability and auditability. lect power metrics from Kepler, fetch grid intensity values, com- • GET /api/containers : returns the set of containers cur- pute emissions, and either persist results in TimescaleDB or serve rently monitored by the service, allowing orchestrators to them from memory. A REST API provides read-only access to discover available emission signals. historical and recent emissions. • POST /api/emissions : fetches recent emission values in

bulk. This endpoint is optimized for dashboards or moni-

toring agents that need timely updates with low overhead.



Requires a specified time range to return said emissions.

• POST /api/emissions/by-container: queries the emis-

sion history of a specific container, Similarly requires a

time range, as well as the names of specific containers for

which to fetch data.

• GET /api/schema: provides the data schema including

units and field definitions. This enables clients to validate

their assumptions and facilitates long-term interoperabil-

ity across versions.

By standardizing access patterns, the API makes it possible

for external services to reliably retrieve emissions information

without depending on internal implementation details.



4 Evaluation

We now present evaluations based on benchmarks and scenario

analyses conducted in the FAME project [3]. The goal was to

Figure 1: System architecture assess whether exposing real-time CO2 signals can enable mean-

ingful emission reductions when coupled with migration strate-

gies.



115

CO2 Monitoring in Kubernetes Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia



4.1 Benchmark Test

In a simple benchmark using busybox, a lightweight Linux con-

tainer, the optimal CO2 emissions achieved were significantly

lower than the mean observed values. The key performance indi-

cator (KPI) was defined as a 200% improvement, corresponding

to a 66.6% reduction compared to baseline. Results show that this

threshold can be achieved and often surpassed. The baseline is,

for lack of a better, metric defined as the mean of emissions across

all tracked countries with available resources for migration.





Figure 3: Plot of average reductions per scenario





signals through our service empowers orchestration layers to

meet or exceed environmental KPIs.



5 Limitations and Future Work

Figure 2: Small timeframe emissions of a benchmark Busy- The reported emissions are estimates subject to the accuracy of

box for testing purposes. We can see noticably low emisions both Kepler’s models and grid intensity data. As a result, the

for France, which can be explained by heavy reliance on benchmark tests previously performed may not fully capture all

nuclear power, as can be seen in [2] possible scenarios, as grid dependency may sometimes force sub-

optimal migrations in the CO2-system as per resource availability.



The system attempts to minimize emissions within these sub-4.2 Resolution is limited by the update frequency of intensity sources, Scenario-Based Evaluation and storage requirements increase with sampling granularity. Scenarios simulate workload migrations across subsets of Euro- We considered only a single baseline, defined as the mean pean countries. Each scenario randomly selects 4–7 countries CO 2 emissions across all tracked countries. While this provides from a pool of 28, representing constrained deployment options. a general reference point, it is not directly comparable to region- specific benchmarks and may obscure finer-grained differences. sets.The abbreviations (e.g., FR, DE, SI) correspond to ISO-3166 Future work should therefore incorporate multiple baselines, country codes representing different electricity regions. We em- such as per-country averages or established benchmarks from ployed random sampling of countries to simulate the heterogene- the literature, and assess statistical significance relative to them. ity faced by cloud and edge providers operating across multiple Our benchmark scenarios were simplified to ensure repro- regions. This choice enables us to reflect migration challenges ducibility and interpretability. Although random sampling of where workloads are moved not only between datacenters but countries illustrates the variability in energy mixes, it does not also across electricity grids with diverse carbon intensities. While fully capture the operational constraints of datacenter migrations random sampling is a simplification, it provides statistically rep- or multi-cloud scheduling. More complex benchmarks with real- resentative insights into the variability of emission factors. We istic workloads and infrastructure heterogeneity would further showcase oure results through the following 5 scenarios: validate the applicability of our approach.

• Scenario 1 (IS, CZ, BG, RO, AT, SE): 88.2% ± 2.1% reduction. Finally, while implementation details such as REST endpoints

• Scenario 2 (DE, PL, GR, LV): 72.8% ± 5.6% reduction. and TimescaleDB integration were reported for transparency,

• Scenario 3 (GB, LT, SI, DE, AT, GR): 78.0% ± 1.7% reduction. their evaluation was not the main focus of this study. Addi-

• Scenario 4 (ES, FR, GB, PL, HU, LT, SE): 89.6% ± 1.1% (best tional experimentation with scalability and deployment overhead

case). would strengthen the case for adoption in production environ-

• Scenario 5 (LV, ES, HU, LT): 32.4% ± 12.7% (worst case). ments.

• All Countries: 87.7% ± 1.7% reduction (ideal case). Future work will focus on service options to adjust granularity

Across all scenarios, at least one migration was executed per and tackle scalability issues within the service as well as broader

window, with an average emission reduction of 74.8%. These evaluation.

results confirm that even under limited availability, CO2-aware

migration strategies yield substantial benefits. 6 Conclusion

We presented a Kubernetes-native CO2 monitoring service that

4.3 Insights provides real-time emissions data through stable APIs. Evalua-

The best-performing scenario demonstrates that careful selec- tions demonstrate that when coupled with migration strategies,

tion of even a limited number of regions can approach the ef- these metrics enable significant emission reductions, often sur-

fectiveness of full global availability. Conversely, the poorest- passing KPI thresholds. Future work will include integration

performing scenario illustrates the dependency on geographic with more compute-intensive workloads, multi-source intensity

flexibility. Overall, results validate that exposing reliable CO2 aggregation, and cryptographic provenance for auditability.



116

Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Hrib et al.



Acknowledgements [2] Electricity Maps ApS. 2025. Electricity maps: real-time carbon intensity of

This work was supported by the FAME project under the Euro- electricity consumption. Accessed: 25 September 2025. https://app.electricity

maps.com.



community and colleagues who contributed feedback during and multicloud enablers for green computing. https://www.fame-horizon.eu /the-project/. Accessed: 25 September 2025. (2025). testing. pean Union’s Horizon Europe programme. We thank the Kepler [3] European Union Horizon Europe Programme. 2025. Fame project: federated

[4] Google. 2020. Our data centers now work harder when the sun shines and

For all online resources cited, the date of access has been wind blows. Accessed: 25 September 2025. https://blog.google/inside-google

included to ensure reproducibility and traceability. /infrastructure/data-centers-work-harder-sun-shines-wind-blows/.

[5] Kepler Project Contributors. 2025. Kepler: kubernetes-based efficient power

level exporter. Accessed: 25 September 2025. https://github.com/sustainable-

References computing-io/kepler.

[1] CodeCarbon Project Contributors. 2025. Codecarbon: track and reduce the for linux servers. Accessed: 25 September 2025. https://github.com/hubblo-o [6] Scaphandre Project Contributors. 2025. Scaphandre: energy monitoring agent



carbon footprint of your computing. Accessed: 25 September 2025. https://m rg/scaphandre. lco2.github.io/codecarbon/. [7] Timescale Inc. 2025. Timescaledb: an open-source time-series database. Ac-

cessed: 25 September 2025. https://www.timescale.com.



117

Beyond Surveys: Adolescent Profiling via Ecological



Momentary Assessment and Mobile Sensing



Jasminka Dobša Simona Korenjak-Černe Miranda Novak

University of Zagreb University of Ljubljana University of Zagreb

Faculty of Organization and School of Economics and Faculty of Education and

Informatics Business, and IMFM Rehabilitation Sciences

Varaždin, Croatia Ljubljana, Slovenia Zagreb, Croatia

jasminka.dobsa@foi.hr simona.cerne@ef.uni-lj.si miranda.novak@erf.unizg.hr



Maja Buhin Pandur Lucija Šutić

University of Zagreb University of Zagreb

Faculty of Organization and Informatics Faculty of Education and Rehabilitation Sciences

Varaždin, Croatia Zagreb, Croatia

mbuhin@foi.unizg.hr lucija.sutic@erf.unizg.hr



Abstract escalate into crises. In 2023 [7], the platform was reintroduced with significant improvements, enabling the collection of

The aim of this study is to identify profiles of adolescents using behavioral and interpersonal data through natural smartphone

survey data and data collected via mobile phones, which included use which enabled collection of reported self-ratings known as

ecological momentary assessment (EMA) and passive mobile ecological momentary assessments used in this research.

sensing. EMA involved responses to short questionnaires Previous research using EARS has explored various applications.

delivered seven times per day over one week, while mobile For instance, one study examined the use of mobile sensing data

sensing captured time spent using different categories of mobile to assess stress by analyzing affective language captured via

applications. The study was conducted on a sample of 77 smartphone keyboards [4]. Another study investigated the role of

secondary school students. Profiling was performed through friendship quality and well-being in adolescence [9], concluding

clustering of EMA data aggregated into six composite variables that adolescents who experienced more positive affect also

reflecting confidence, attentiveness, positive and negative reported more positive characteristics of close friendships two

emotions related to friends, and overall positive and negative hours later.

affect. Based on the interpretability of the results, four adolescent In the present study, profiles of adolescents were identified using

profiles were identified. These profiles are further explained EMA variables, resulting in four distinct groups. These profiles

using survey data and passive data on mobile application usage were subsequently analyzed with respect to survey data and

patterns. passive mobile sensing data. The study was guided by the



Keywords following research questions:

• What distinct adolescent profiles emerge from EMA-

Adolescents, clustering, mobile sensing, ecological momentary based composite variables?

assessment, well-being • How are these profiles associated with demographic

and psychosocial survey measurements (gender,

academic achievement, perceived overuse of social

1 Introduction media, level of depression, anxiety, and stress

This study was conducted using the Effortless Assessment of symptoms)?

Risk States (EARS) application developed by Ksana Health in • What patterns of mobile application use characterize

collaboration with the University of Oregon the identified profiles?

(https://ksanahealth.com/ears/) [6]. The EARS application was The rest of the paper is organized in the following way: in the

originally launched in 2018 to facilitate the collection of high- second section materials and methods are described, the third

quality passive mobile sensing data and to support the section presents the results of data analysis, and the fourth section

development of predictive machine learning algorithms capable offers a discussion of results and conclusion. of identifying risk states for human well-being before they



Permission to make digital or hard copies of part or all of this work for personal or 2 Materials and methods

classroom use is granted without fee provided that copies are not made or distributed A sample of 77 Croatian high school students participated in this

for profit or commercial advantage and that copies bear this notice and the full study. We employed three types of data: (1) survey data, (2)

citation on the first page. Copyrights for third-party components of this work must

be honored. For all other uses, contact the owner/author(s). EMA data aggregated into six composite variables (confidence,

Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia attentiveness, positive and negative emotions related to friends,

© 2025 Copyright held by the owner/author(s).

http://doi.org/10.70314/is.2025.sikdd.29 and overall positive and negative affect), and (3) passive mobile

sensing data related to mobile applications usage. The survey



118



data included respondents’ gender, academic achievement (final

grades of 3, 4, or 5), self-reported perceptions of overuse of

social media (measured on a scale from 14 to 70), and symptoms

of depression, anxiety, and stress (determined by DASS-21 scale,

each measured on a scale from 0 to 21 [1]). EMA data and

passive mobile data were collected using the EARS application.

Within the framework of ecological momentary assessment

(EMA), respondents reported on the quality of their close

friendships and their affect, seven times per day over the course

of one week (i.e., up to 49 assessments). The assessment

schedule followed a semi-random structure: respondents

received questions at random intervals within 2-hour windows

between 7 a.m. and 9 p.m. Only respondents who completed at

least 10 out of 49 assessments were included in the analyses. Figure 1. Groups obtained by k-means algorithm projected

Friendship quality was measured with five items rated on a scale to the first two principal components of composite EMA

from 1 (not at all like me) to 7 (completely like me). All items variables.

were adapted from prior studies on close relationships [3, 5, 8].



Two composite variables were derived: PosFriendEmo,

calculated as the average of three items related to positive

friendship-related emotions, and NegFriendEmo, calculated as

the average of items reflecting negative friendship-related

emotions. Items related to positive friendship-related emotions

were following:

• “I feel that I can share some worries or secrets with my

close friends.”

• “I enjoy being with my close friends.”

• “I have fun with my close friends.”

Items related to negative friendship-related emotions included

following statements:

• “I feel that my close friends criticize me.”

• “My close friends get on my nerves.” Figure 2. Mean values of standardized composite variables

Affect was measured with ten items on the same 7-point scale, by groups.

adapted from [3]. Two composite variables were created:



PosAffects (joyful, cheerful, happy, lively, proud) and NegAffects

(guilty, angry, insecure, scared, sad, worried, ashamed),

representing the mean values of the respective items. In addition,

a composite variable Confident was formed from three items

related to peer popularity, self-satisfaction, and body satisfaction,

while a composite variable Attentive was formed from five items

reflecting responsibility, caring for others, perceived adult

support, readiness for schoolwork, and perceived teacher support.

Regarding passive data, respondents used a total of 927

applications, which were categorized into 16 groups. Of these,

11 categories were included in the analysis, while the remaining

five were excluded due to their negligible usage time. Initial

Figure 3. Proportion of respondents by group and gender



Bard and ChatGPT) based on app functionality. Each app’s (male, female, I’d rather not say). categorization was performed using generative AI tools (Google



to usage of 11 observed categories of mobile apps, variable applied to standardized composite EMA variables. Based on the interpretability of the resulting clusters, the model with four reflecting the total time spent on the mobile phone ( Total groups was selected. passive website to confirm its primary function. Beside variables related Profiles of adolescents were identified using k-means clustering classification was then manually verified through its official

) was also included into the analysis. The analyzed

Data analysis was conducted using R statistical software. Group

categories included: Tools and productivity, Social media, Music

differences were tested using the non-parametric Kruskal–Wallis

and audio, Games, Communication, Multimedia, Education and

test, followed by Dunn’s post hoc test. Non-parametric tests were

learning, Online shopping and services, Travel, Device

applied because analyzed variables were not normally distributed.

management, and Entertainment.

For the analysis of dependency between groups and their school

success it was used chi-square test.



119

Figure 5 presents the distribution of daily time (in seconds)



that respondents spent using different categories of mobile

applications across groups. No statistically significant

differences were found in the median time spent on social media

or in the total time spent across all application categories. Group

1, which showed the highest median values for the composite

variables Confidence, Attentiveness, and positive friendship-

related emotions, also reported spending the most time on social

media; however, their perception of social media overuse was the

lowest among all groups. Group 3, characterized by near-average

median values of Confidence, Attentiveness, positive and

negative friendship-related emotions, and affect, demonstrated

the highest median usage across most application categories

(Tools and productivity, Music and audio, Games,

Communication, Education and learning, Travel, Device

management, and Entertainment). The Kruskal-Wallis test

revealed a significant difference in application use only for the

Figure 4. Box-plots for variables of self-assessment of overuse Education and learning category, although Dunn’s post hoc test

of social media (ovdr, 14-70), level of symptoms of depression did not confirm differences between specific group pairs.

(dep, 0-21), level of symptoms of anxiety (anks, 0-21), and Respondents in Group 4 had the highest median usage of

level of symptoms of stress (stres, 0-21). Multimedia applications, while those in Group 2 spent the most

time on applications related to Shopping and services. Notably,

respondents in Group 2 were predominantly male and reported



3 the highest perceived overuse of social media among all groups. Results

School success was measured by average grade point, which was

Figure 1 shows groups of respondents obtained by k-means 4.05 for Group 1, 4.33 for Group 2, 4.61 for Group 3, and 4.20

algorithm projected to the first two principal components of for Group 4. The chi-square test indicated a borderline non-

composite EMA variables. Figure 2 illustrates the mean values significant difference in school success across the groups

of the composite variables across groups. Two related pairs of (p=0.0501). Group 3, which showed the highest median time of

groups can be observed: Groups 1 and 4, and Groups 2 and 3. application use across most categories, also achieved the highest

Groups 1 and 4 display nearly mirror-image profiles with respect average grade point (4.61). In contrast, Group 1, which reported

to the x-axis. For Group 1, the composite variables Confident, the highest levels of confidence and attentiveness in EMA

Attentive, PosFriendEmo, and PosAffects are above average, (including perceived readiness for school tasks), had the lowest

whereas in Group 4, these same variables fall below average. average grade point. Conversely, NegFriendEmo and NegAffects are below average

for Group 1 but above average for Group 4. A similar pattern

emerges for Groups 2 and 3, which also show mirror-image 4 Discussion and conclusion

profiles, though shifted slightly toward above-average values. This study identified four adolescent profiles based on data

Attentive collected from 77 Croatian high-school students using EMA. , and Group 3 is characterized by nearly average levels of Confident,

PosFriendEmo, while NegFriendEmo, PosAffects,

and Data collected from EMA was aggregated across respondents in NegAffects are slightly below average. In contrast, Group 2

demonstrates above-average mean values across all variables. the form of 6 composite variables representing their self-reported

Overall, emotions related to friendships and affective states are confidence, attentiveness, positive and negative friendship-

less pronounced in Groups 2 and 3 compared to Groups 1 and 4. related emotions, and positive and negative affect. Two pairs of

Figure 3 shows that female respondents predominate in mirror-image profiles were observed: Groups 2 and 3, and

Groups 3 and 4, in Group 1 there is approximately an equal Groups 1 and 4. Emotional states related to friendships and

proportion of male and female respondents, while in Group 2 affective states are less pronounced in Groups 2 and 3 compared

predominate male respondents. Figure 4 presents the distribution to other pair of groups, and these groups are characterized by

of survey-based variables: self-assessment of overuse of social better academic success.

21), anxiety ( Mobile sensing revealed that respondents used a total of 927 apps, anks media (ovdr, 14-70), level of symptoms of depression (dep, 0-

, 0-21), and stress (stres, 0-21). Group 4

exhibits the highest levels of symptoms of depression, anxiety, which were categorized into 16 categories, out of which 11 were

and stress. According to the non-parametric Kruskal-Wallis test, analyzed in this study. Although social media accounted for the

there is a significant difference between the groups in symptoms largest share of usage time, no significant group differences were

of depression (p=0.0045) and stress (p=0.0162). The Dunn’s post found either in social media use or in total application use. Group

hoc test indicated that Group 4 has statistically significant higher 1, according to self-perception, exhibited the most confident and

levels of symptoms of depression (p=0.0015) and stress attentive and has lowest median levels of depression, anxiety and

(p=0.0090) compared to Group 1. The Kruskal-Wallis test shows stress, spent the most time on social media, but perceived its

that there is a difference in the perception of overuse of social overuse the least. This group contains approximately an equal

media between the groups (p=0.0024). The highest perceived proportion of male and female respondents. Group 2, which was

overuse was reported by Group 2, with a significant difference

compared to Group 3 (p=0.0021) and Group 1 (p=0.0040). predominantly male, spent the most time on Online shopping and

Results indicate that respondents’ perceptions of their social services and reported the highest perceived overuse of social

media use did not correspond to the actual time spent on social media, with significant differences compared to Group 1

media (r = 0.0741). (p=0.0040) and Group 3 (p=0.0021).



120

.





Figure 5. Box-plots for variables of daily usage of categories of mobile applications by groups (in seconds). Note the different

ordinal scales due to the large differences in the use of apps.



Group 3, which had the highest academic achievements and the mobile assessment (P.R.O.T.E.C.T.) research project, founded

majority of female respondents, had the highest usage of by the Croatian Science Foundation (UIP-2020-02-2852).

applications in categories Tools and productivity, Music and

audio, Games, Communication, Education and learning, Travel, References

Device management, and Entertainment. Group 4, also [1] Antony, M. M., Bieling, P. J., Cox, B. J., Enns, M. W., & Swinson, R. P.



predominantly female, exhibited the highest levels of depression, 1998. Psychometric properties of the 42-item and 21-item versions of the Depression Anxiety Stress Scales in clinical groups and a community

anxiety, and stress symptoms, spent the least time on social sample. Psychological Assessment, 10(2), 176–181.

media, used Multimedia applications more than other groups, and https://doi.org/10.1037/1040-3590.10.2.176

ranked second in the use of [2] Billard, L., Diday, E. 2007. Symbolic Data Analysis: Conceptual Education and learning applications.

Statistics and Data Mining, 1st edition, Wiley

Importantly, there was no significant correlation between [3] Bülow, A., van Roekel, E., Boele, S., Denissen, J.J.A. and Keijsers, L..

perceived overuse of social media by respondents and their actual 2022. Parent –adolescent interaction quality and adolescent affect: An

time spent using it, as measured by passive sensing. This finding experience sampling study on effect heterogeneity. Child Development,

93(3), 315-331, DOI: https://doi.org/10.1111/cdev.13733

highlights the added value of combining mobile sensing with [4] Byrne, M.L., Lind, M.N., Horn, S.R., Mills, K.L., Nelson, B.W., Barnes,

survey data, as it provides insights that would not be captured M.L., Slavich, G.M. and Allen, N.B. 2012. Using mobile sensing data to

assess stress: Associations with perceived and lifetime stress, mental

through self-report alone. While symptoms of depression, health, sleep, and inflammation, Digital Health. 2021:7, DOI:

anxiety, and stress were assessed on a 0-21 scale, all median 10.1177/20552076211037227

[5] Li, L.M.W., Chen, Q., Gao, H., Li, W.Q. and Ito, K.2021. Online/offline

values were below 10, reflecting the general population sample self-disclosure to offline friends and relational outcomes in a diary



in which the prevalence of psychological problems is low. Future school: The moderating role of self-esteem and relational closeness. International Journal of Psychology , 56(1), 129-137, DOI:

research could therefore focus on adolescents with higher levels https://doi.org/10.1002/ijop.12684

of depression, anxiety, and stress symptoms. [6] Lind, M.N., Byrne, M.L., Wicks, G., Smidt, A.M., Allen, N.B., 2018.

In addition, future work will explore the application of symbolic The Effortless Assessment of Risk States (EARS) Tool: An Interpersonal

Approach to Mobile Sensing, JMIR Ment. Health, 2018; 5(3):e10334,

data analysis for clustering based on both EMA and mobile DOI: 10.2196/10334.

sensing data. Symbolic data analysis, developed for the study of [7] Lind, M. N., Kahn, L. E., Crowley, R., Read, W., Wicks, G., Allen, N.

complex and large-scale datasets, incorporates variability B., 2023. Reintroducing the Effortless Assessment Research System

(EARS), JMIR Ment. Health, 2023; 10:e38920, DOI: 10.21196/38920.

directly into the aggregation process [2]. This approach would [8] Ng, Y.T., Huo, M., Gleason, M.E., Neff, L.A., Charles, S.T. and

allow us to account for the stability of emotional states and Fingerman, K.L. 2021. Friendship in old age: Daily encounters and

emotional well-being. The Journals of Gerontology: Series B, 76(3),

behavioral patterns at the individual level, potentially offering 551-562, DOI: https://doi.org/10.1093/geronb/gbaa007

more refined indicators for defining adolescent profiles. [9] Šutić, L., van Roekel, E. and Novak, M. 2025. Quality of friendships and

well-being in adolescence: daily life study. International Journal of

Adolescence and Youth, 30(1), DOI:

Acknowledgments https://doi.org/10.1080/02673843.2025.2467112

This study was conducted as a part of the Testing the 5C

framework of positive youth development: traditional and digital



121

Brazil’s First AI Regulatory Sandbox: Towards



Responsible Innovation



Cristina Godoy Oliveira† Joao Paulo Candia Veiga Vasilka Sancin Joao Pita Costa

CIAAM, C4AI, Univ. of São Paulo CIAAM, C4AI, Univ. of São Paulo Faculty of Law, Univ. of Ljubljana IRCAI, Quintelligence

São Paulo, Brazil São Paulo, Brazil Ljubljana, Slovenia Ljubljana, Slovenia

cristinagodoy@usp.br candia@usp.br vasilka.sancin@pf.uni-lj.si joao.pitacosta@ircai.org



Rafael Meira Silva Maša Kovič Dine Lucas Costa dos Anjos Thiago Gomes Marcilio,

CIAAM, C4AI, Univ. of São Faculty of Law, Univ. of Ljubljana Faculty of Law, Univ. of Juiz Anthony C. de Novaes Silva

Paulo Ljubljana, Slovenia de Fora CIAAM, C4AI, Univ. of São Paulo

São Paulo, Brazil masa.kovic-dine@pf.uni-lj.si Juiz de Fora, Brazil São Paulo, Brazil

rafael_meira@alumni.usp.br lucas.anjos@anpd.gov.br tgm.marcilio@gmail.com

anthonycharles.silva@outlook.com



Abstract / Povzetek

As artificial intelligence technologies rapidly evolve, regulatory 1 Introduction



sandbox initiatives have emerged as crucial tools for promoting The rapid evolution of artificial intelligence (AI) has prompted

responsible AI development, enabling innovation while urgent global discussions about governance frameworks that can

safeguarding fundamental rights and public interests. This paper both stimulate innovation and mitigate potential risks. Around

analyzes the development and implications of Brazil’s first AI the world, regulators are grappling with how to manage AI

regulatory sandbox, with a particular focus on the model systems that are increasingly impacting critical sectors, such as

established by SUSEP (Superintendence of Private Insurance). finance, healthcare, education, and public administration. While

Designed as a controlled environment for testing innovative countries in Europe have taken the lead in formalizing AI-



sandbox illustrates how regulatory flexibility can foster s AI Act—many nations in the Global South, including those in technological advancement, financial inclusion, and market South America, are only beginning to articulate coherent products and services in the insurance sector, the SUSEP specific legislation—most notably through the European Union’ efficiency while maintaining consumer protection and risk

regulatory approaches. In Europe, the EU AI Act represents the

oversight. Being developed under Brazil’s Economic Freedom

first comprehensive legal framework for AI, categorizing

Law, the sandbox has evolved through three editions (2020,

applications by risk level and imposing strict requirements for

2021, and 2024), prioritizing both sustainable and technological

high-risk systems. It introduces transparency, accountability, and

projects. This study explores the sandbox's structure, eligibility

human oversight obligations, while also fostering innovation

criteria, business plan requirements, operational limitations, and

through mechanisms such as regulatory sandboxes. This

transition mechanisms for companies seeking permanent

structured and anticipatory approach reflects Europe’s long-

licensure. It also identifies actionable insights for future

regulatory frameworks, particularly for the National Data standing tradition of precautionary regulation and data

Protection Authority (ANPD) as Brazil advances toward AI- protection, rooted in the General Data Protection Regulation

specific governance. By comparing the sandbox's legal (GDPR), and with succeeding regulations and standards such as

foundations, selection processes, and risk mitigation protocols the upcoming European AI Sandbox Act that will further extend



sandbox’s role as a blueprint for responsible AI regulation in Europe. By contrast, AI regulation in Brazil and South America remains emerging markets with international best practices, this paper underscores the Article 57 of the European AI Act, focusing on AI Sandboxes in

.

fragmented, preliminary, and largely reactive. In Brazil, multiple



Keywords / Ključne besede legislative proposals have been introduced in Congress, but no

comprehensive AI law has yet been enacted. The country’s

Regulatory Sandboxes, Artificial Intelligence Governance, Data

current approach relies on a patchwork of sectoral regulations,

Protection, Innovation Policy, Brazilian AI Regulation

soft law instruments, and the foundational framework provided

† by the General Data Protection Law (Lei Geral de Proteção de

Corresponding author

Dados - LGPD). While the LGPD is a significant step forward in

Permission to make digital or hard copies of part or all of this work for personal or regulating personal data and algorithmic decision-making, it

classroom use is granted without fee provided that copies are not made or distributed

for profit or commercial advantage and that copies bear this notice and the full does not address the broader ethical, operational, and societal

citation on the first page. Copyrights for third-party components of this work must challenges posed by AI systems. Regionally, South American

be honored. For all other uses, contact the owner/author(s). countries exhibit a similar lack of uniformity. Argentina, Chile,

Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia

© 2025 Copyright held by the owner/author(s). and Colombia have published national AI strategies or draft

http://doi.org/10.70314/is.2025.sikdd.13 policy guidelines, yet most remain in early implementation

phases. Regulatory oversight is often spread across multiple



122

agencies, and few jurisdictions have adopted binding legal norms personal data. It adopts a multidisciplinary approach

for AI beyond data protection. In this landscape, Brazil stands encompassing organizational, technological, and regulatory

out as a potential regional leader, particularly through initiatives aspects. Participants are required to present a detailed description

such as the National Artificial Intelligence Strategy (Estratégia of the problem or opportunity addressed by their project,

Brasileira de Inteligência Artificial – EBIA) [1], National highlighting the current context, challenges, and expected

Artificial Intelligence Plan (Plano Nacional de Inteligência benefits, such as innovation and efficiency. The methodology

Artificial emphasizes the innovative aspects of the solution, the processing – PBIA) [2], and the growing role of ANPD.

This paper argues that regulatory sandboxes of personal data in AI systems, the social impact, and the — flexible,

supervised environments for testing innovative solutions offer

— intended outcomes.

A core component of the methodology is the implementation

a pragmatic and context-sensitive tool for advancing AI

of algorithmic transparency measures. Applicants must describe

governance in Brazil and Latin America. In particular, the

how their systems will make algorithmic logic, decisions, and

experience of the SUSEP Regulatory Sandbox, an experimental

criteria understandable to end users. This includes the use of

regulatory environment created by the Superintendence of

explainable AI (XAI) tools, audit reports, documentation, and

Private Insurance (SUSEP) [3] designed for the insurance

dashboards, as well as practices for data traceability and decision

market, provides a valuable model for structuring oversight of

accountability. The methodology also requires information on

emerging technologies. Through an in-depth analysis of the

compliance with the LGPD, such as data minimization, risk

SUSEP sandbox, this research explores how key regulatory



principles—such as proportionality, transparency, risk mechanisms, and respect for data subject rights. Projects must management, mitigation of algorithmic bias, governance

management, and sustainability—can inform the development of show alignment with ethical and legal standards to ensure

Brazil’s first AI sandbox. In doing so, this study contributes to responsible AI development.

ongoing policy debates about how developing economies can In terms of data methodology, applicants must describe the

chart their own paths in AI governance, drawing lessons from lifecycle of the personal data used, including its origin, collection,

both global benchmarks and local regulatory experiments. processing, storage, and disposal. In addition, the quality of data

Moreover, it feeds the ongoing collaboration with the different is crucial, and applicants must describe it to demonstrate that they

stakeholders in the development of the Slovenian AI Sandbox are in a good phase to participate in the regulatory sandbox. A

initiative, hoping for a constructive exchange based of good preliminary impact assessment on data protection must be

practices and AI regulation perspectives. included, along with a risk matrix that identifies potential harms

to data subjects and proposes mitigation strategies. The form also



2 Methodology information on the IT infrastructure (cloud, hybrid, on-premises), assesses the technical feasibility of the project by requiring

The SUSEP Regulatory Sandbox is an experimental regulatory API data flows, outsourcing arrangements, LLM usage, and

environment established to enable the implementation of cybersecurity controls. Financial planning (FINOPS), scalability,

innovative projects that offer products and/or services in the social impact assessment, and performance metrics are also

insurance market. These innovations are developed or offered critical elements of the methodology.

using new methodologies, processes, procedures, or by applying Finally, organizations must consolidate their identified risks and

existing technologies in a novel way. Companies participating in mitigation measures into a summary framework, ensuring

the sandbox can test — under supervision — new products, services, transparency and accountability throughout the project lifecycle.

or new ways of providing traditional services. SUSEP assesses



the benefits and risks associated with each innovation and 3 Legislation, Regulation, and Ethical Use: determines whether adjustments are needed, either to the

business model or to existing regulations. Objectives and Priorities



When the SUSEP Sandbox was launched, it was part of a joint In the 2024 edition of the SUSEP Regulatory Sandbox,

initiative involving the financial, insurance, and capital markets, participating companies were required to submit detailed

led by the Central Bank of Brazil (BCB), the Securities and information and upload relevant documents through Brazil’s

Exchange Commission (CVM), and SUSEP. The SUSEP Electronic Information System (SEI). The sandbox was designed

Sandbox was established during the Bolsonaro administration, in to: (i) stimulate competition to improve efficiency; (ii) promote

alignment with the Economic Freedom Law (Law No. financial inclusion; (iii) encourage capital formation and

13,874/2019) and broader deregulation efforts. There have been efficient resource allocation; and (iv) develop and deepen the

three editions so far: in 2020, 2021, and 2024 [4] — with the 2024 Brazilian insurance market.

edition currently open for an indefinite period. The SUSEP SUSEP prioritized proposals classified by the applicants

Sandbox is governed by CNSP Resolution No. 381/2020, as themselves as either Sustainable or Technological projects:

amended, along with SUSEP Circular No. 598/2020, and by • Sustainable Projects: Aligned with SUSEP and CNSP

specific public notices for each edition. The National Private rules, as well as the Federal Government’s Ecological

Insurance Council (CNSP) sets the rules for the insurance market, Transformation Plan. These initiatives must deliver climate,

and SUSEP ensures compliance. environmental, or social benefits to policyholders,

ANPD’s Regulatory Sandbox, on the other hand, is beneficiaries, or society as a whole.

structured to comprehensively evaluate the technical, legal, • Technological Projects: Promote the development of

ethical, and social dimensions of AI-based projects involving innovative technology by introducing technological



123

novelties or enhancements to products, services, business 5. AI registry – Formal registration with ANPD, with

models, or processes, thereby adding functionality or authorizations subject to revocation.

quality improvements. 6. Virtual interviews – Ensuring nationwide accessibility.

Regarding the eligibility criteria for startups (insurtechs), 7. Exit Strategy – A clear post-sandbox transition plan for



applicants were required to offer an innovative product or service continued compliance. and operate via remote/digital platforms. They should



demonstrate the novelty of their technology or its creative ’ In Phase 1 of the ANPDs regulatory sandbox selection application and present the solution in a development stage process, whose application period closed on August 25, 2025, suitable for temporary authorization. Moreover, they had to additional points will be awarded to startups, public sector submit a business plan, which included a risk assessment, organizations, and companies developing generative AI specifically addressing cybersecurity, and a damage mitigation solutions. These categories were identified as strategic priorities plan. Besides the typical proposed and current legal/trade names, for Brazil: startups are legally recognized in the Brazilian or organizational structure and director profiles, the business Innovation Framework [5] as key beneficiaries of sandbox plan had to include strategic objectives, and company history, initiatives; public sector organizations often develop socially mission, and vision, along with a problem statement and impactful solutions and are expected to sustain participation market/consumer benefits, proof of concept of product or service without financial or technical aid from ANPD; and Brazil has an and demonstration of potential cost reduction for consumers, if explicit national interest in fostering large language models any. It also described a comparative analysis with existing (LLMs) in Portuguese as part of its broader AI sovereignty offerings, target market, and geographic scope, along with risk strategy. factors and mitigation strategies, the technical architecture and As part of the application process, the ANPD ’ s form operational model, the justification for the Priority Project required that any confidential or sensitive business information classification, and the sustainability policy. The selection process be clearly marked as such by the applicants. This provision is involved two stages: (i) a Selection Phase with a video interview necessary due to Brazil ’ s Freedom of Information Law (Lei de with SUSEP; and a (ii) Temporary Authorization Phase , with a

follow-up interview and submission of evidence proving Acesso à Informação – LAI), which mandates public disclosure

compliance with normative requirements and completion of unless a legal exception is claimed. Without this explicit

corporate formalities, as well as appointment of a director classification, all submitted materials may be treated as public,

responsible for sandbox participation and documentary evidence potentially exposing strategic or proprietary information from

attesting to the lawful origin of funds contributed by investors. participating firms.

To enhance visibility and inclusiveness, the ANPD also

adopted a multi-channel outreach strategy, disseminating the call

4 Discussion of initial results for applications through official platforms and with the support

The 2024 edition of this initiative included four companies of civil society organizations. To maximize participation, the

that were granted permanent licenses (by September), while 32 deadline for applications was extended by an additional 15 days,

projects were selected, amongst which 21 received temporary although the overall schedule for evaluation and publication of

authorization (by April). Authorized companies were required to results remained unchanged. The final list of selected

transmit operational data to SUSEP via API. While in the participants is scheduled to be released on October 2, 2025, as

sandbox, companies: originally planned.

Finally, there is also another point of flexibility, not expressly

• codified, which is the absence of a fixed taxonomy of sandboxes. can only sell approved types of insurance,

• For example, the SUSEP sandbox has an innovative character, operate under capped risk exposure, and

• seeking to make regulations more flexible. At the same time, the face limits on claims payouts.

service is being used in the market. In contrast, the ANPD

sandbox aims to provide the regulator with knowledge that



protection governance, several SUSEP sandbox practices could enables the preventive updating of market rules, rather than a Given the similarities between insurance regulation and data



inform the design of an AI sandbox under Brazil reactive one. Oversight may be distributed among agencies like ’ s National SUSEP, yet the regulatory status of AI companies post-sandbox Data Protection Authority (ANPD), such as: remains unclear. For this reason, ANPD must establish both



1. Innovation focus – Projects must demonstrate clear novelty ensuring long-term supervision and market stability. sandbox-specific rules and post-sandbox AI regulations,

or novel applications of technologies, methods and The importance of embedding responsible and ethical principles

procedures. in AI governance is particularly acute in Brazil and across South

2. Sustainability integration – For AI, this could include energy, America, where technological innovation intersects with social

water and natural resources efficiency, environmental impact, inequality, fragile institutions, and diverse regulatory capacities.

and ethical safeguards. By prioritizing transparency, accountability, and fairness in AI

3. Defined operational boundaries – Limitations on AI use systems, these countries can foster public trust while mitigating

cases, affected populations, and permitted risk categories. risks of discrimination, exclusion, or misuse of personal data.

4. Mandatory submissions – Risk analysis and mitigation plan, Brazil’s initiatives—such as its National AI Strategy (EBIA), the

business plan, and funding source verification. forthcoming AI legal framework, and the regulatory sandbox

programs led by SUSEP and the ANPD—illustrate how



124

developing nations can create adaptive governance models that mechanisms that align local priorities with global best practices.

balance innovation with fundamental rights. Moreover, as the By proving that responsible innovation can be pursued within

largest economy in Latin America, Brazil is well-positioned to resource-constrained and diverse legal settings, the Brazilian

serve as a regional benchmark, showing how ethical AI practices sandbox contributes to a global dialogue on AI governance,

can promote financial inclusion, strengthen democratic values, helping countries at different stages of regulatory development

and encourage sustainable development. In this sense, South to tailor sandbox initiatives to their specific socio-economic and

America’s experience underscores that responsible AI is not a institutional realities.

luxury for advanced economies but a prerequisite for equitable

technological progress in the Global South. Acknowledgments / Zahvala

Insert paragraph text here. Insert paragraph text here. Insert



5 Conclusions and further work text here. Insert paragraph text here. Insert paragraph text here. paragraph text here. Insert paragraph text here. Insert paragraph

The ANPD’s regulatory sandbox demonstrates Brazil’s Insert paragraph text here. Insert paragraph text here. Insert

commitment to experimental and responsible governance of AI. paragraph text here. Insert paragraph text here. Insert paragraph

By ensuring transparency through a public information portal, text here. Insert paragraph text here. Insert paragraph text here.

addressing confidentiality in accordance with the Access to Insert paragraph text here.

Information Law, and promoting inclusive engagement, the

initiative aligns with international standards. Drawing on References / Literatura

frameworks such as the OECD’s recommendations and the [1] MCTI (2021). Brazilian Strategy of Artificial Intelligence. [Online].

EU’s AI Act, which formally includes regulatory sandboxes, the Available: ebia-documento_referencia_4-979_2021.pdf (www.gov.br)

Brazilian approach reinforces the importance of embedding such [2] PBIA (2024). Brazilian Artificial Intelligence Plan . [Online]. Available:

https://www.gov.br/mcti/pt-br/acompanhe-o-

mechanisms into national legislation. In the context of Bill mcti/noticias/2024/07/plano-brasileiro-de-ia-tera-supercomputador-e-

2338/2023 (under debate in the Deputy Chamber to regulate AI investimento-de-r-23-bilhoes-em-quatro-

in Brazil) [6], regulatory sandboxes emerge as strategic tools to anos/ia_para_o_bem_de_todos.pdf/view

[3] Brazilian Government Portal (2025). About SUSEP. [Online]. Available:

enable adaptive, participatory, and context-aware AI regulation. https://www.gov.br/susep/pt-br/acesso-a-

The Brazilian AI sandbox experience also carries significant informacao/institucional/sobre-a-susep

relevance beyond Brazil and South America, offering valuable [4] Brazil (2019). JOINT STATEMENT: COORDINATED ACTION TO

insights for other developing countries and even jurisdictions BRAZILIAN IMPLEMENT A REGULATORY SANDBOX REGIME IN THE FINANCIAL, SECURITIES, AND CAPITAL

with more advanced regulatory frameworks, such as Europe. MARKETS. [Online]. Available: https://www.gov.br/susep/pt-



While the European Union has already institutionalized AI br/central-de-conteudos/noticias/2022/noticia [5] Brazil (2021). Complementary Law No. 182, of June 1, 2021. Establishes sandboxes within the AI Act, the Brazilian model demonstrates the Legal Framework for Startups and Innovative Entrepreneurship. how experimental, flexible, and context-sensitive approaches can [Online]. Available: planalto.gov.br/ccivil_03/leis/lcp/lcp182.htm

be adapted to environments where regulatory structures are less [6] Brazil (2023). Bill No. 2338, of 2023. Establishes the legal framework

consolidated. Its emphasis on transparency, proportionality, and for artificial intelligence in Brazil. [Online]. Available:

multi-stakeholder participation shows that effective governance https://www.camara.leg.br/proposicoesWeb/prop_mostrarintegra?codte

or=2868197&filename=PL%202338/2023

does not require fully mature institutions but rather innovative



125





126


Indeks avtorjev / Author index



Anjos Lucas Costa dos ............................................................................................................................................................... 122

Barrionuevo Leonardo .................................................................................................................................................................. 98

Bašić Nino .................................................................................................................................................................................. 102

Batagelj Vladimir ....................................................................................................................................................................... 102

Brank Janez .................................................................................................................................................................................. 11

Calcina Erik .................................................................................................................................................................................... 7

Camlek Neca ................................................................................................................................................................................ 29

Caporusso Jaya ............................................................................................................................................................................. 19

Cek Rok ........................................................................................................................................................................................ 82

Ćetković Marija ............................................................................................................................................................................ 65

Čibej Jaka ..................................................................................................................................................................................... 53

Costa João Pita ............................................................................................................................................................... 45, 98, 122

Debeljak Žiga ............................................................................................................................................................................... 57

Dine Masa Kovic ........................................................................................................................................................................ 122

Dobša Jasminka .......................................................................................................................................................................... 118

Forcolin Margherita ...................................................................................................................................................................... 82

Frattini Matteo .............................................................................................................................................................................. 49

Grobelnik Adrian Mladenić.................................................................................................................................................... 15, 25

Grobelnik Marko .................................................................................................................................................. 11, 15, 25, 29, 73

Guček Alenka ......................................................................................................................................................................... 25, 69

Guo Zhenyu .................................................................................................................................................................................. 33

Hegler Živa ................................................................................................................................................................................... 29

Hosseini Seyed Iman .................................................................................................................................................................... 86

Hrib Ivo .............................................................................................................................................................................. 110, 114

Jakomin Martin .......................................................................................................................................................................... 106

Jelenčič Jakob ............................................................................................................................................................................... 29

Jeršek Domen ............................................................................................................................................................................... 49

Kassis Rayan ................................................................................................................................................................................ 45

Kavšek Branko ............................................................................................................................................................................. 94

Kenda Klemen ...................................................................................................................................................... 49, 57, 77, 82, 86

Kladnik Matic ............................................................................................................................................................................... 41

Klančič Rok .................................................................................................................................................................................. 49

Kochovska Sofija ......................................................................................................................................................................... 94

Kocjančič Oskar ........................................................................................................................................................................... 37

Korenjak-Černe Simona ............................................................................................................................................................. 118

Kozamernik Lučka ..................................................................................................................................................................... 106

Krumpak Roy ............................................................................................................................................................................... 33

Lamgari Asmai ............................................................................................................................................................................. 45

Leonardi Linda ............................................................................................................................................................................. 82

Leskovec Gašper .......................................................................................................................................................................... 77

Ma Xiang ...................................................................................................................................................................................... 33

Marcilio Thiago Gomes ............................................................................................................................................................. 122

Mladenić Dunja .............................................................................................................................. 7, 11, 29, 33, 41, 57, 61, 73, 86

Mochariq Ouidad.......................................................................................................................................................................... 45

Mylonas Costas ............................................................................................................................................................................ 77

Novak Erik ..................................................................................................................................................................................... 7

Novak Miranda ........................................................................................................................................................................... 118

Novalija Inna .............................................................................................................................................................. 11, 33, 61, 73

Oliveira Cristina Godoy ............................................................................................................................................................. 122

Pandur Maja Buhin..................................................................................................................................................................... 118

Pavlova Daria ......................................................................................................................................................................... 61, 90

Pisanski Tomaž .......................................................................................................................................................................... 102

Pollak Senja .................................................................................................................................................................................. 19

Polzer Mirozlav ............................................................................................................................................................................ 98

Purver Matthew ............................................................................................................................................................................ 19



127

Rahmani Yousef ........................................................................................................................................................................... 45

Roman Dumitru ............................................................................................................................................................................ 33

Rožanec Jože M. .......................................................................................................................................................................... 33

Sancin Vasilka ............................................................................................................................................................................ 122

Savnik Iztok ............................................................................................................................................................................... 102

Silva Anthony Novaes ................................................................................................................................................................ 122

Silva Rafael Meira ...................................................................................................................................................................... 122

Sittar Abdul .................................................................................................................................................................................. 69

Škrjanc Maja ........................................................................................................................................................................ 73, 114

Škrlj Blaž .................................................................................................................................................................................... 106

Slavec Ana ................................................................................................................................................................................. 102

Smiljanic Mateja .......................................................................................................................................................................... 69

Song Tao ...................................................................................................................................................................................... 33

Souss Sohaib ................................................................................................................................................................................ 45

Stopar Luka .................................................................................................................................................................................. 45

Šturm Jan .............................................................................................................................................................................. 73, 114

Šutić Lucija ................................................................................................................................................................................ 118

Topal Oleksandra ........................................................................................................................................................... 73, 82, 114

Tošić Aleksandar .......................................................................................................................................................................... 65

Trajkov Georgi ............................................................................................................................................................................. 15

Urbančič Jasna ........................................................................................................................................................................... 106

Vake Domen ................................................................................................................................................................................. 65

Veiga João Cândia ................................................................................................................................................................ 98, 122

Vičič Jernej ................................................................................................................................................................................... 94

Zajec Patrik ................................................................................................................................................................................ 110

Zaouini Mustafa ........................................................................................................................................................................... 45

Žnidaršič Martin ........................................................................................................................................................................... 37

Žust Martin ................................................................................................................................................................................... 25



128

Odkrivanje znanja in



podatkovna skladišča



SiKDD



Data Mining and



Data Warehouses



SiKDD



Urednika l Editors:



Dunja Mladenić

Marko Grobelnik