6. oktober 2025 l 6 October 2025 Ljubljana, Slovenia IS 2025 INFORMACIJSKA DRUZBA ˇ INFORMATION SOCIETY Odkrivanje znanja in podatkovna skladišča SiKDD Data Warehouses Zbornik 28. mednarodne multikonference SiKDD Data Mining and Zvezek C Proceedings of the 28th Urednika l Editors: International Multiconference Dunja Mladenić, Marko Grobelnik Volume C Zbornik 28. mednarodne multikonference INFORMACIJSKA DRUŽBA – IS 2025 Zvezek C Proceedings of the 28th International Multiconference INFORMATION SOCIETY – IS 2025 Volume C Odkrivanje znanja in podatkovna skladišča - SiKDD Data Mining and Data Warehouses - SiKDD Urednika / Editors Dunja Mladenić, Marko Grobelnik http://is.ijs.si 6. oktober 2025 / 6 October 2025 Ljubljana, Slovenia Urednika: Dunja Mladenić Department for Artificial Intelligence, Jožef Stefan Institute, Ljubljana Marko Grobelnik Department for Artificial Intelligence, Jožef Stefan Institute, Ljubljana Založnik: Institut »Jožef Stefan«, Ljubljana Priprava zbornika: Mitja Lasič, Vesna Lasič, Lana Zemljak Oblikovanje naslovnice: Vesna Lasič, uporabljena slika iz Pixabay Dostop do e-publikacije: http://library.ijs.si/Stacks/Proceedings/InformationSociety Ljubljana, oktober 2025 Informacijska družba ISSN 2630-371X DOI: https://doi.org/10.70314/is.2025.sikdd Kataložni zapis o publikaciji (CIP) pripravili v Narodni in univerzitetni knjižnici v Ljubljani COBISS.SI-ID 255453699 ISBN 978-961-264-322-5 (PDF) PREDGOVOR MULTIKONFERENCI INFORMACIJSKA DRUŽBA 2025 28. mednarodna multikonferenca Informacijska družba se odvija v času izjemne rasti umetne inteligence, njenih aplikacij in vplivov na človeštvo. Vsako leto vstopamo v novo dobo, v kateri generativna umetna inteligenca ter drugi inovativni pristopi oblikujejo poti k superinteligenci in singularnosti, ki bosta krojili prihodnost človeške civilizacije. Naša konferenca je tako hkrati tradicionalna znanstvena in akademsko odprta, pa tudi inkubator novih, pogumnih idej in pogledov. Letošnja konferenca poleg umetne inteligence vključuje tudi razprave o perečih temah današnjega časa: ohranjanje okolja, demografski izzivi, zdravstvo in preobrazba družbenih struktur. Razvoj UI ponuja rešitve za številne sodobne izzive, kar poudarja pomen sodelovanja med raziskovalci, strokovnjaki in odločevalci pri oblikovanju trajnostnih strategij. Zavedamo se, da živimo v obdobju velikih sprememb, kjer je ključno, da z inovativnimi pristopi in poglobljenim znanjem ustvarimo informacijsko družbo, ki bo varna, vključujoča in trajnostna. V okviru multikonference smo letos združili dvanajst vsebinsko raznolikih srečanj, ki odražajo širino in globino informacijskih ved: od umetne inteligence v zdravstvu, demografskih in družinskih analiz, digitalne preobrazbe zdravstvene nege ter digitalne vključenosti v informacijski družbi, do raziskav na področju kognitivne znanosti, zdrave dolgoživosti ter vzgoje in izobraževanja v informacijski družbi. Pridružujejo se konference o legendah računalništva in informatike, prenosu tehnologij, mitih in resnicah o varovanju okolja, odkrivanju znanja in podatkovnih skladiščih ter seveda Slovenska konferenca o umetni inteligenci. Poleg referatov bodo okrogle mize in delavnice omogočile poglobljeno izmenjavo mnenj, ki bo pomembno prispevala k oblikovanju prihodnje informacijske družbe. »Legende računalništva in informatike« predstavljajo domači »Hall of Fame« za izjemne posameznike s tega področja. Še naprej bomo spodbujali raziskovanje in razvoj, odličnost in sodelovanje; razširjeni referati bodo objavljeni v reviji Informatica, s podporo dolgoletne tradicije in v sodelovanju z akademskimi institucijami ter strokovnimi združenji, kot so ACM Slovenija, SLAIS, Slovensko društvo Informatika in Inženirska akademija Slovenije. Vsako leto izberemo najbolj izstopajoče dosežke. Letos je nagrado Michie-Turing za izjemen življenjski prispevek k razvoju in promociji informacijske družbe prejel Niko Schlamberger, priznanje za raziskovalni dosežek leta pa Tome Eftimov. »Informacijsko limono« za najmanj primerno informacijsko tematiko je prejela odsotnost obveznega pouka računalništva v osnovnih šolah. »Informacijsko jagodo« za najboljši sistem ali storitev v letih 2024/2025 pa so prejeli Marko Robnik Šikonja, Domen Vreš in Simon Krek s skupino za slovenski veliki jezikovni model GAMS. Iskrene čestitke vsem nagrajencem! Naša vizija ostaja jasna: prepoznati, izkoristiti in oblikovati priložnosti, ki jih prinaša digitalna preobrazba, ter ustvariti informacijsko družbo, ki koristi vsem njenim članom. Vsem sodelujočim se zahvaljujemo za njihov prispevek — veseli nas, da bomo skupaj oblikovali prihodnje dosežke, ki jih bo soustvarjala ta konferenca. Mojca Ciglarič, predsednica programskega odbora Matjaž Gams, predsednik organizacijskega odbora i FOREWORD TO THE MULTICONFERENCE INFORMATION SOCIETY 2025 The 28th International Multiconference on the Information Society takes place at a time of remarkable growth in artificial intelligence, its applications, and its impact on humanity. Each year we enter a new era in which generative AI and other innovative approaches shape the path toward superintelligence and singularity — phenomena that will shape the future of human civilization. The conference is both a traditional scientific forum and an academically open incubator for new, bold ideas and perspectives. In addition to artificial intelligence, this year’s conference addresses other pressing issues of our time: environmental preservation, demographic challenges, healthcare, and the transformation of social structures. The rapid development of AI offers potential solutions to many of today’s challenges and highlights the importance of collaboration among researchers, experts, and policymakers in designing sustainable strategies. We are acutely aware that we live in an era of profound change, where innovative approaches and deep knowledge are essential to creating an information society that is safe, inclusive, and sustainable. This year’s multiconference brings together twelve thematically diverse meetings reflecting the breadth and depth of the information sciences: from artificial intelligence in healthcare, demographic and family studies, and the digital transformation of nursing and digital inclusion, to research in cognitive science, healthy longevity, and education in the information society. Additional conferences include Legends of Computing and Informatics, Technology Transfer, Myths and Truths of Environmental Protection, Knowledge Discovery and Data Warehouses, and, of course, the Slovenian Conference on Artificial Intelligence. Alongside scientific papers, round tables and workshops will provide opportunities for in-depth exchanges of views, making an important contribution to shaping the future information society. Legends of Computing and Informatics serves as a national »Hall of Fame« honoring outstanding individuals in the field. We will continue to promote research and development, excellence, and collaboration. Extended papers will be published in the journal Informatica, supported by a long-standing tradition and in cooperation with academic institutions and professional associations such as ACM Slovenia, SLAIS, the Slovenian Society Informatika, and the Slovenian Academy of Engineering. Each year we recognize the most distinguished achievements. In 2025, the Michie-Turing Award for lifetime contribution to the development and promotion of the information society was awarded to Niko Schlamberger, while the Award for Research Achievement of the Year went to Tome Eftimov. The »Information Lemon« for the least appropriate information-related topic was awarded to the absence of compulsory computer science education in primary schools. The »Information Strawberry« for the best system or service in 2024/2025 was awarded to Marko Robnik Šikonja, Domen Vreš and Simon Krek together with their team, for developing the Slovenian large language model GAMS. We extend our warmest congratulations to all awardees. Our vision remains clear: to identify, seize, and shape the opportunities offered by digital transformation, and to create an information society that benefits all its members. We sincerely thank all participants for their contributions and look forward to jointly shaping the future achievements that this conference will help bring about. Mojca Ciglarič, Chair of the Program Committee Matjaž Gams, Chair of the Organizing Committee ii KONFERENČNI ODBORI CONFERENCE COMMITTEES International Programme Committee Organizing Committee Vladimir Bajic, South Africa Matjaž Gams, chair Heiner Benking, Germany Mitja Luštrek Se Woo Cheon, South Korea Lana Zemljak Howie Firth, UK Vesna Koricki Olga Fomichova, Russia Mitja Lasič Vladimir Fomichov, Russia Blaž Mahnič Vesna Hljuz Dobric, Croatia Alfred Inselberg, Israel Jay Liebowitz, USA Huan Liu, Singapore Henz Martin, Germany Marcin Paprzycki, USA Claude Sammut, Australia Jiri Wiedermann, Czech Republic Xindong Wu, USA Yiming Ye, USA Ning Zhong, USA Wray Buntine, Australia Bezalel Gavish, USA Gal A. Kaminka, Israel Mike Bain, Australia Michela Milano, Italy Derong Liu, Chicago, USA Toby Walsh, Australia Sergio Campos-Cordobes, Spain Shabnam Farahmand, Finland Sergio Crovella, Italy Programme Committee Mojca Ciglarič, chair Marjan Heričko Boštjan Vilfan Bojan Orel Borka Jerman Blažič Džonova Baldomir Zajc Franc Solina Gorazd Kandus Blaž Zupan Viljan Mahnič Urban Kordeš Boris Žemva Cene Bavec Marjan Krisper Leon Žlajpah Tomaž Kalin Andrej Kuščer Niko Zimic Jozsef Györkös Jadran Lenarčič Rok Piltaver Tadej Bajd Borut Likar Toma Strle Jaroslav Berce Janez Malačič Tine Kolenik Mojca Bernik Olga Markič Franci Pivec Marko Bohanec Dunja Mladenič Uroš Rajkovič Ivan Bratko Franc Novak Borut Batagelj Andrej Brodnik Vladislav Rajkovič Tomaž Ogrin Dušan Caf Grega Repovš Aleš Ude Saša Divjak Ivan Rozman Bojan Blažica Tomaž Erjavec Niko Schlamberger Matjaž Kljun Bogdan Filipič Gašper Slapničar Robert Blatnik Andrej Gams Stanko Strmčnik Erik Dovgan Matjaž Gams Jurij Šilc Špela Stres Mitja Luštrek Jurij Tasič Anton Gradišek Marko Grobelnik Denis Trček Nikola Guid Andrej Ule iii iv KAZALO / TABLE OF CONTENTS Odkrivanje znanja in podatkovna skladišča – SiKDD / Data Mining and Data Warehouses - SiKDD .... 1 PREDGOVOR / FOREWORD ............................................................................................................................... 3 PROGRAMSKI ODBORI / PROGRAMME COMMITTEES ............................................................................... 5 Semantic Prompting for Large Language Models in Biomedical Named Entity Recognition / Calcina Erik, Novak Erik, Mladenić Dunja.............................................................................................................................. 7 LLM Based Approach to Extracting Smells in Slovenian Corpora / Brank Janez, Novalija Inna, Mladenić Dunja, Grobelnik Marko .................................................................................................................................. 11 BetweenTheLines - Cross Source News Analysis / Trajkov Georgi, Grobelnik Marko, Grobelnik Adrian Mladenić ........................................................................................................................................................... 15 Identifying Social Self in Text: A Machine Learning Study / Caporusso Jaya, Purver Matthew, Pollak Senja .. 19 WinWin Meets – Investigating the Future of Online Meetings / Žust Martin, Grobelnik Marko, Guček Alenka, Grobelnik Adrian Mladenić.............................................................................................................................. 25 Predicting Ski Jumps Using State-Space Model / Hegler Živa, Camlek Neca, Jelenčič Jakob, Grobelnik Marko, Mladenić Dunja ................................................................................................................................................ 29 Predicting milling overload based on sensor data: a graph-based approach / Krumpak Roy, Rožanec Jože M., Mladenić Dunja, Guo Zhenyu, Song Tao, Roman Dumitru, Novalija Inna, Ma Xiang ................................... 33 Short and Long Term Bike Rental Forecasting / Kocjančič Oskar, Žnidaršič Martin ......................................... 37 Predicting Traffic Intensity on Motorway Sections / Kladnik Matic, Mladenić Dunja ....................................... 41 Empowering Youth for Smart Cities with AI Solutions to Community and Urban Challenges in the Context of SDG 11 / Zaouini Mustafa, Costa João Pita, Rahmani Yousef, Kassis Rayan, Stopar Luka, Souss Sohaib, Lamgari Asmai, Mochariq Ouidad ................................................................................................................... 45 Automated First-Reply Generation for IT Support Tickets Using Retrieval-Augmented Generation and Multi- Modal Response Synthesis / Jeršek Domen, Kenda Klemen, Frattini Matteo, Klančič Rok .......................... 49 A Machine-Learning Approach to Predicting the Pronunciation of Pre-Consonant l in Standard Slovene / Čibej Jaka ................................................................................................................................................................... 53 Sequencing News Articles with Large Language Models within Enterprise Risk Management Context / Debeljak Žiga, Mladenić Dunja, Kenda Klemen ............................................................................................. 57 Graph-Based Feature Engineering for DeFi Security Incident Severity Prediction / Pavlova Daria, Novalija Inna, Mladenić Dunja ....................................................................................................................................... 61 Evolving Neural Agents in Simulated Ecosystems / Ćetković Marija, Tošić Aleksandar, Vake Domen ............ 65 Designing AI Agents for Social Media / Sittar Abdul, Smiljanic Mateja, Guček Alenka ................................... 69 Explaining Temporal Data in Manufacturing using LLMs and Markov Chains / Šturm Jan, Škrjanc Maja, Topal Oleksandra, Novalija Inna, Mladenić Dunja, Grobelnik Marko ...................................................................... 73 Active Learning for Power Grid Security Assessment: Reducing Simulation Cost with Informative Sampling / Leskovec Gašper, Mylonas Costas, Kenda Klemen ......................................................................................... 77 Supporting Material Reuse in Drone Production / Cek Rok, Topal Oleksandra, Leonardi Linda, Forcolin Margherita, Kenda Klemen .............................................................................................................................. 82 Temporal Dynamics and Causal Feature Integration for Predictive Maintenance in Manufacturing Systems: A Causality-Informed Framework / Hosseini Seyed Iman, Kenda Klemen, Mladenić Dunja ........................... 86 Using Interactive Data Visualization for DeFi Market Analysis / Pavlova Daria ................................................ 90 A Hybrid Lexicon-Machine Learning Approach to Macedonian Sentiment Analysis / Kochovska Sofija, Kavšek Branko, Vičič Jernej ............................................................................................................................ 94 Building an AI-Ready Data Infrastructure Towards a SDG-focused Observatory for the Brazilian Amazon / Costa João Pita, Polzer Mirozlav, Barrionuevo Leonardo, Veiga João Cândia ............................................... 98 Towards a format for describing networks, NetsJSON / Batagelj Vladimir, Pisanski Tomaž, Savnik Iztok, Slavec Ana, Bašić Nino .................................................................................................................................. 102 Automating Numba Optimization with Large Language Models: A Case Study on Mutual Information / Kozamernik Lučka, Jakomin Martin, Škrlj Blaž, Urbančič Jasna.................................................................. 106 Topological Exploration of Embedded GitHub Repository Data Using Mapper / Hrib Ivo, Zajec Patrik ......... 110 CO2 Monitoring for Energy-Efficient Workloads in Kubernetes: A Data Provider for CO2-Aware Migration / Hrib Ivo, Topal Oleksandra, Šturm Jan, Škrjanc Maja .................................................................................. 114 v Beyond Surveys: Adolescent Profiling via Ecological Momentary Assessment and Mobile Sensing / Dobša Jasminka, Korenjak-Černe Simona, Novak Miranda, Pandur Maja Buhin, Šutić Lucija .............................. 118 Brazil’s First AI Regulatory Sandbox: Towards Responsible Innovation / Oliveira Cristina Godoy, Veiga João Cândia, Sancin Vasilka, Costa João Pita, Silva Rafael Meira, Dine Masa Kovic, Anjos Lucas Costa dos, Marcilio Thiago Gomes, Silva Anthony Novaes ........................................................................................... 122 Indeks avtorjev / Author index ................................................................................................................. 127 vi Zbornik 28. mednarodne multikonference INFORMACIJSKA DRUŽBA – IS 2025 Zvezek C Proceedings of the 28th International Multiconference INFORMATION SOCIETY – IS 2025 Volume C Odkrivanje znanja in podatkovna skladišča - SiKDD Data Mining and Data Warehouses - SiKDD Urednika / Editors Dunja Mladenić, Marko Grobelnik http://is.ijs.si 6. oktober 2025 / 6 October 2025 Ljubljana, Slovenia 1 2 PREDGOVOR Tehnologije, ki se ukvarjajo s podatki so močno napredovale. Iz prve faze, kjer je šlo predvsem za shranjevanje podatkov in kako do njih učinkovito dostopati, se je razvila industrija za izdelavo orodij za delo s podatkovnimi bazami in velikimi količinami podatkov, prišlo je do standardizacije procesov, povpraševalnih jezikov. Ko shranjevanje podatkov ni bil več poseben problem, se je pojavila potreba po bolj urejenih podatkovnih bazah, ki bi služile ne le transakcijskem procesiranju ampak tudi analitskim vpogledom v podatke. Pri avtomatski analizi podatkov sistem sam pove, kaj bi utegnilo biti zanimivo za uporabnika – to prinašajo tehnike odkrivanja znanja v podatkih (knowledge discovery and data mining), ki iz obstoječih podatkov skušajo pridobiti novo znanje in tako uporabniku nudijo novo razumevanje dogajanj zajetih v podatkih. Slovenska KDD konferenca SiKDD, pokriva vsebine, ki se ukvarjajo z analizo podatkov in odkrivanjem znanja v podatkih: pristope, orodja, probleme in rešitve. Dunja Mladenić in Marko Grobelnik 3 FOREWORD Data driven technologies have significantly progressed. The first phases were mainly focused on storing and efficiently accessing the data, resulted in the development of industry tools for managing large databases, related standards, supporting querying languages, etc. After the initial period, when the data storage was not a primary problem anymore, the development progressed towards analytical functionalities on how to extract added value from the data; i.e., databases started supporting not only transactions but also analytical processing of the data. In automatic data analysis, the system itself tells what might be interesting for the user - this is brought about by knowledge discovery and data mining techniques, which try to obtain new knowledge from existing data and thus provide the user with a new understanding of the events covered in the data. The Slovenian KDD conference SiKDD covers topics dealing with data analysis and discovering knowledge in data: approaches, tools, problems and solutions. Dunja Mladenić and Marko Grobelnik 4 PROGRAMSKI ODBOR / PROGRAMME COMMITTEE Janez Brank, Jožef Stefan Institute, Ljubljana Jasminka Dobša, Faculty of Organization and Informatics, University of Zagreb Alenka Guček, Jožef Stefan Institute, Ljubljana Branko Kavšek, University of Primorska, Koper Klemen Kenda, Qlector, Ljubljana Bojana Mikelenić, Faculty of Humanities and Social Sciences, University of Zagreb Elham Motamedi Mohammadabadi, Jožef Stefan Institute, Ljubljana Irena Nančovska Šerbec, Faculty of Education, University of Ljubljana Erik Novak, Jožef Stefan Institute, Ljubljana Inna Novalija, Jožef Stefan Institute, Ljubljana Joao Pita Costa, Quintelligence, Ljubljana Jože Rožanec, Jožef Stefan Institute, Ljubljana Abdul Sitar, Jožef Stefan Institute, Ljubljana Luka Stopar, SolvesAll, Ljubljana Blaž Škrlj, Teads, Ljubljana Jan Šturm, Jožef Stefan Institute, Ljubljana Oleksandra Topal, Jožef Stefan Institute, Ljubljana 5 6 Semantic Prompting for Large Language Models in Biomedical Named Entity Recognition Erik Calcina Erik Novak Dunja Mladenić Jožef Stefan Institute Jožef Stefan Institute Jožef Stefan Institute Jožef Stefan International Jožef Stefan International Jožef Stefan International Postgraduate School Postgraduate School Postgraduate School Jamova cesta 39 Jamova cesta 39 Jamova cesta 39 Ljubljana, Slovenia Ljubljana, Slovenia Ljubljana, Slovenia Abstract with semantic descriptions on model performance across zero- Extracting structured medical information from unstructured shot, few-shot, and fine-tuned scenarios, using the MACCRO- clinical text remains a challenge for biomedical research and de- BAT2020 dataset [3]. The contributions of this paper are threefold. cision support. Recent advances in large language models (LLMs) First, we introduce the use of semantically enhanced prompts for suggest that prompt-based methods could provide a promising al- biomedical NER by enriching entity labels with descriptions. Sec- ternative to traditional supervised approaches for Named Entity ond, we provide a systematic evaluation of semantic prompting Recognition (NER) in the biomedical domain. This study inves- across zero-shot, few-shot, and fine-tuned scenarios, assessing tigates whether adding semantic descriptions of entity labels its effectiveness under different levels of supervision. Third, we can improve NER performance on clinical texts. Using a dataset apply a statistical validation method, McNemar’s test, to rigor- of annotated case reports, we evaluate model performance in ously assess the reliability of observed performance differences zero-shot, few-shot, and fine-tuned settings. Results show that between baseline and semantically enhanced prompts. semantic prompts enhance accuracy in low-supervision scenar- The remainder of the paper is structured as follows: Section 2 ios, while offering limited benefit once models are fine-tuned. contains the overview of the related work. Next, we present the methodology in Section 3, and describe the experiment setting in Keywords Section 4. The experiment results are found in Section 5, followed by a discussion in Section 6. Finally, we conclude the paper and Named entity recognition, large language models, semantic prompt- provide ideas for future work in Section 7. ing, prompt engineering, medical domain, biomedicine 2 Related Work 1 Introduction This section focuses on the related work on named entity recog- Biomedical texts present a critical challenge for automated anal- nition in biomedicine, as well as the use of semantic descriptions ysis. Clinical case reports, patient records, and related narratives in prompting. are written in free text rather than in structured formats. While they contain essential medical knowledge, their unstructured 2.1 Prompting with semantic context nature makes it necessary to extract and organize information PromptNER introduced the idea of augmenting few-shot prompts for systematic use in research and clinical decision support. Do- with entity definitions, leading to substantial gains in F1 score ing this manually is costly, time-consuming, and challenging on benchmarks like CoNLL, GENIA, and FewNERD, improving to scale. Therefore, an automated approach to extract relevant performance by 4–9 points compared to standard prompting [2]. information is required. Extending this idea, PromptNER unifies locating and typing into Named entity recognition (NER) models enable the identifi- a single enriched prompt, enabling phrase extraction and entity cation and classification of clinically relevant entities, such as classification simultaneously [7]. Similarly, the biomedical NER biological structures, diagnostic procedures, or symptoms. Re- study demonstrated that “on-the-fly” inclusion of concept defini- cent advances in large language models (LLMs) show strong tions enhances performance (+15% F1) in low-data settings [5]. generalizing abilities, identifying relevant entities in both zero- shot and few-shot settings. However, in the biomedical domain, performance can be hindered by specialized terminology and 2.2 Iterative and zero-shot semantic subtle entity distinctions. To address this, we propose enriching prompting prompts with semantic descriptions of entity labels, providing Recent work in zero-shot NER explores iterative prompt refine- models with explicit context to improve their understanding of ment to align model outputs with precise entity definitions. Evo- the task. Prompt uses an evolving definition-based framework to better This study investigates the impact of semantically enhanced distinguish between similar entity types, yielding improvements prompting in biomedical named entity recognition using large across benchmarks [9]. In a broader context, some studies found language models. We evaluate the effect of enriching entity labels that while directly injecting semantic parses into LLM inputs can degrade performance, carefully designed semantic “hints” Permission to make digital or hard copies of all or part of this work for personal embedded in prompts can reliably boost outcomes [1]. or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and 2.3 Domain-specific prompt optimization the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). FsPONER optimizes few-shot prompts for industrial NER tasks by Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia using semantic entity–enhanced meta prompts and task-specific © 2024 Copyright held by the owner/author(s). https://doi.org/10.70314/is.2025.sikdd.3 exemplar selection, yielding F1 improvements of 5 to 13 points 7 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Calcina et al. in domain benchmarks [8]. In the biomedical domain, MPE3 inte- a basis for evaluating model performance on both complex and grates ontology-derived label semantics into prompts, improving straightforward entity types. performance in few-shot NER scenarios [10]. Each document was segmented into individual sentences by splitting on full stops. Subsequently, each sentence, along with Prior research has shown that enriching prompts with semantic its associated entity annotations, was transformed into a JSON context and label definitions can significantly boost LLM perfor- format to facilitate processing by the language models. mance in both few-shot and zero-shot NER. Our work provides a systematic evaluation in the biomedical domain. By examin- 4.2 Semantically enhanced prompts ing multiple supervision settings, benchmarking several model To enhance the semantic understanding of entity labels, detailed families, and validating differences through McNemar’s test, we descriptions were crafted for each. These descriptions were de- offer a comprehensive assessment of when semantically enriched rived by combining information from the MACCROBAT2020 prompts provide benefits. dataset documentation and definitions from the Oxford English Dictionary [6]. The integration of these sources was performed 3 Methodology manually, ensuring that the descriptions were both accurate and This study evaluates the impact of incorporating semantic infor- contextually relevant. mation into prompts on the performance of LLMs in biomedical Prompts were structured as plain text instructions, guiding NER tasks. Three distinct approaches were employed: zero-shot the model to identify and classify entities within the provided prompting, few-shot prompting, and fine-tuning. sentences. For the semantically enhanced prompts, the detailed entity descriptions were included to provide additional context. Zero-shot prompting. In the zero-shot setting, models were Models were instructed to output their responses in a JSON prompted to perform NER without any prior exposure to labeled format, explicitly focusing on the labels component. Below we examples. Two types of prompts were utilized: baseline prompt, present an example of the entity description, specifically for the a standard instruction to identify and classify entities without label age. additional context, and semantically enhanced prompt, which includes detailed descriptions for each entity label, offering ex- Baseline prompt: The age of the patient. plicit semantic context to guide the model’s understanding and Semantic enhanced prompt: The duration of time a classification. patient has lived, expressed numerically (e.g., ‘65- year-old’, ‘20 years old’) or categorically (e.g., Few-shot prompting. The few-shot approach involved provid- ‘newborn’, ‘teenage’), representing their age at the ing the models with a limited number of annotated examples time of presentation. (k-shots) before performing NER on new texts. Similar to the zero- shot setting, both baseline and semantically enhanced prompts This added context is intended to improve the model’s ability to were employed to assess the influence of semantic information. distinguish and extract nuanced biomedical entities more accu- rately. Fine-tuning. Fine-tuning was conducted to adapt the pre-trained LLMs to the specific biomedical NER task. Two fine-tuning strate- 4.3 Fine-tuning procedure gies were explored: standard fine-tuning, where models are fine- Fine-tuning is carried out using parameter-efficient techniques, tuned using the original dataset annotations without additional where only lightweight adapter modules are trained instead of semantic information, and semantically enhanced fine-tuning, modifying the full model. This strategy reduces memory usage, which fine-tunes models on data where annotations were sup- mitigates catastrophic forgetting, and accelerates training. plemented with semantic descriptions of each entity label. To further improve efficiency, models are quantized to 4-bit 4 Experiment Setting ated outputs; all non-target tokens (e.g., system prompts, input precision. Fine-tuning is supervised and focuses on the gener- This section describes the experiment setting, which includes context) are masked during loss computation. This ensures that the dataset and prompt preparation, the fine-tuning procedure training adapts the model to the expected JSON label output used, the evaluation metrics, and the statistical significance test format rather than to the input content or prompt structure. description. 4.4 Evaluation metrics 4.1 Dataset To evaluate entity recognition performance, we use two F1-based The experiments were conducted using the MACCROBAT2020 [3] metrics. The Exact F1 score measures strict matches, requiring dataset, which comprises 200 clinical case reports sourced from predicted entities to align perfectly with the reference text and PubMed Central. In total, it contains 4,542 sentences with an label. The Relaxed F1 score allows partial matches, counting average of 22.7 sentences per document, which includes manual predictions as correct if they include the true entity as a substring annotations of biomedical entities, events, and relations, provided with the correct label. 1 in brat standoff format . For this study, we focused on the five most frequent entity labels within the dataset. These are bio- 4.5 McNemar statistical significance test logical structure, diagnostic procedure, lab value, sign While Exact and Relaxed F1 scores quantify the magnitude of symptom, and detailed description supplemented by the age performance differences, they do not establish whether these and sex labels. The inclusion of age and sex was motivated by differences are statistically reliable. The McNemar test [4] com- their prevalence and clarity within clinical narratives, providing plements the Exact F1 metric by verifying whether observed 1https://brat.nlplab.org/standoff.html improvements can be attributed to the semantically enhanced 8 Semantic Prompting for Large Language Models in Biomedical Named Entity Recognition Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia prompts rather than random variation. Following standard NER 5.2 Few-shot prompting practice, we treat Exact F1 as the primary endpoint and therefore Table 2 summarizes the Exact and Relaxed F1 scores for few- apply McNemar’s test only to exact match predictions. shot prompting. The addition of semantic information consis- Let 𝑏 denote the number of cases correctly predicted by the tently improved model performance across most models. Notably, semantically enhanced model but missed by the baseline, and txgemma-9b-chat achieved the highest Exact F1 score 0.3288 𝑐 the number of cases correctly predicted by the baseline but and Relaxed F1 score 0.4998 with semantic prompting, compared missed by the semantically enhanced model. Only discordant to 0.2732 and 0.4469 without. pairs (𝑏, 𝑐) contribute to the test; agreements do not affect the Both Llama-3.1-8B-Instruct and Llama-3.2-3B-Instruct statistic. Using the continuity-corrected version of the test, the showed improvements in both Exact and Relaxed F1 scores when statistic is computed as provided with semantically enhanced prompts. For instance, 2 (|𝑏 2 − 𝑐 | − 1 ) Llama-3.1-8B-Instruct improved from 0.2509 to 0.3005 (Ex- 𝜒 , = act) and from 0.3526 to 0.3948 (Relaxed), while Llama-3.2-3B- 𝑏 𝑐 + Instruct increased from 0.2300 to 0.2439 (Exact) and from 0.3769 which follows a chi-squared distribution with one degree of free- to 0.3948 (Relaxed). These gains highlight the benefit of enriching dom. The corresponding 𝑝-value allows us to test the null hy- prompt instructions when training data is limited. However, not pothesis 𝐻 0: the two models have equal marginal probabilities all models responded positively. For example, Meta-Llama-3.1- (i.e., performance differences are due to chance). Conventionally, 8B experienced a drop in Exact F1 from 0.2698 to 0.2210 and in 𝑝 < . 001 is considered statistically significant. Relaxed F1 from 0.3537 to 0.2799, indicating that semantically 5 enhanced prompts do not universally improve performance and Results may be less effective for some models. This section presents model performance under three experimen- To assess the reliability of these differences, we conducted tal conditions: zero-shot, few-shot, and fine-tuned prompting. For McNemar tests on Exact paired predictions. The tests revealed each condition, we compare the impact of semantically enhanced that performance differences between baseline and semantically prompts against standard prompts using Exact and Relaxed F1 enhanced prompts were statistically significant for all models scores on a subset of clinically relevant entity types. except Llama-3.2-3B-Instruct. It is important to note, however, that significance here indicates that the two variants produce 5.1 Zero-shot prompting systematically different predictions, but does not itself imply Table 1 reports the Exact and Relaxed F1 scores for models improvement. For instance, while the difference for Meta-Llama- evaluated in the zero-shot setting using semantically enhanced 3.1-8B was highly significant, the semantically enhanced model prompts. Without semantic descriptions, most models strug- in fact performed worse in terms of F1 scores. gled to generate outputs in the required JSON format, and valid scores could not be computed. Even with semantically enhanced prompts, Meta-Llama-3.1-8B consistently failed to produce struc- 5.3 Fine-tuned performance tured responses. In the fine-tuning scenario, results were more nuanced. As shown Among the evaluated models, Llama-3.1-8B-Instruct achiev- in Tables 2, most models performed strongly even without seman- ed the highest Exact F1 score, while txgemma-9b-chat attained tic enhancements. For instance, Meta-Llama-3.1-8B attained the the best Relaxed F1 score. Llama-3.2-3B-Instruct and DeepSeek- highest Exact F1 score (0.7099) with semantic input, only slightly Qwen-7B also demonstrated non-trivial performance in both met- outperforming its baseline (0.7076), and this difference was not rics. These results suggest that semantically enhanced prompts statistically significant (𝑝 ≈ 0.64). can effectively compensate for the absence of training examples Some models, such as Llama-3.1-8B-Instruct and Llama- in zero-shot scenarios by providing clearer task guidance and 3.2-3B-Instruct, even showed small performance drops when improving structured prediction output. semantic descriptions were included, with McNemar tests con- firming that these differences were not significant (𝑝 ≈ 0.75 and Table 1: Exact and Relaxed F1 scores in the zero-shot set- 𝑝 . ≈ 088). This suggests that in settings where the model is al- ting with semantically enhanced prompts. Bolded values ready exposed to sufficient task specific supervision, additional indicate the highest score in each column. Results without prompt-level context may offer limited benefit or even introduce valid JSON output are marked with redundancy. / . In contrast, TxGemma-9B-Chat exhibited the most notable improvement, with Exact and Relaxed F1 scores increasing from Model Exact F1 Semantics Relaxed F1 Semantics 0.6837 to 0.7092 and from 0.7483 to 0.7686, respectively; the Llama-3.1-8B-Instruct2 McNemar test confirmed this difference as statistically signif- 0.2310 0.3708 − 5 3 icant ( 𝑝 ≈ 9 . 7 × 10 ). By comparison, DeepSeek-Qwen-7B also Meta-Llama-3.1-8B / / − 3 4 showed a significant difference ( 𝑝 ≈ 6 × 10 ), but in this case Llama-3.2-3B-Instruct 0.1620 0.3254 the semantically enhanced model performed worse (Exact F1: 5 DeepSeek-Qwen-7B 0.1592 0.3217 0.7013 6 → 0.6879). txgemma-9b-chat 0.2181 0.4245 5.4 Overall observations 2https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct The largest performance improvements from semantically en- 3https://huggingface.co/meta-llama/Llama-3.1-8B hanced prompts appeared in zero-shot and few-shot settings, 4https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct where gains in F1 scores were often statistically significant. In 5https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B 6https://huggingface.co/google/txgemma-9b-chat contrast, fine-tuned models showed smaller and mixed effects: 9 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Calcina et al. Table 2: Exact (left) and Relaxed (right) F1 scores for selected labels in few-shot and fine-tuned settings, with and without semantically enhanced prompts. Bolded values indicate the highest score in each column. We use symbols ◦ and • to denote whether the differences between using the baseline or semantically enhanced prompts are statistically significant (•) or not (◦) according to the McNemar test at a significance level of 𝑝 = 0.01. Exact F1 Relaxed F1 Model Few-Shot Fine-Tuned Few-Shot Fine-Tuned / Semantic / Semantic / Semantic / Semantic Llama-3.1-8B-Instruct 0.2509 0.3005 • 0.7053 0.7004 ◦ 0.3526 0.3948 0.7660 0.7645 Meta-Llama-3.1-8B 0.2698 0.2210 • 0.7076 0.7099 ◦ 0.3537 0.2799 0.7670 0.7765 Llama-3.2-3B-Instruct 0.2300 0.2439 ◦ 0.6881 0.6867 ◦ 0.3769 0.3948 0.7629 0.7622 DeepSeek-Qwen-7B 0.1423 0.2270 • 0.7013 0.6879 • 0.2465 0.3891 0.7584 0.7521 txgemma-9b-chat 0.2732 0.3288 • 0.6837 0.7092 • 0.4469 0.4998 0.7483 0.7686 for most, differences were not significant, though TxGemma-9B- notable gains in both Exact and Relaxed F1 scores. In contrast, Chat benefited reliably while DeepSeek-Qwen-7B showed a fine-tuned models already exposed to task-specific data showed significant decrease. These results indicate that semantic prompt- only marginal improvement. ing is most effective in low-resource conditions, while its impact Future work could explore adaptive semantic prompting strate- under full supervision is limited and model-dependent. gies, such as ontology-driven label enrichment, and further in- vestigate the trade-offs between prompt length and inference 6 Discussion efficiency. Additionally, this method could be tested on larger This section discusses the experiment findings and highlights the datasets and across different models to assess its generalizability. advantages and disadvantages of the different approaches. In summary, semantically enhanced prompts offer a straight- forward yet effective way to boost clinical NER performance 6.1 in low-data regimes, but their impact diminishes as models are Model pretraining and domain adaptation exposed to more supervised training. TxGemma-9B-Chat, based on the Gemma 2 architecture and fur- ther fine-tuned on therapeutic development data, outperformed Acknowledgements general-purpose models in a few-shot scenario. This suggests that domain-specific pretraining can significantly improve per- This work was supported by the Slovenian Research Agency. formance when supervision is limited. However, in the full fine- Funded by the European Union. UK participants in Horizon Eu- tuning setting, its advantage diminished. In fact, general models rope Project PREPARE are supported by UKRI grant number like Meta-Llama-3.1-8B achieved comparable but slightly better 10086219 (Trilateral Research). Views and opinions expressed are results, indicating that once sufficient task-specific supervision however those of the author(s) only and do not necessarily reflect is provided, prior domain specialization offers limited additional those of the European Union or European Health and Digital Ex- benefit. ecutive Agency (HADEA) or UKRI. Neither the European Union nor the granting authority nor UKRI can be held responsible for 6.2 them. Grant Agreement 101080288 PREPARE HORIZON-HLTH- Prompt quality matters 2022-TOOL-12-01. The structure and clarity of prompts are critical to model per- formance. Poorly designed prompts often resulted in JSON for- References matting errors or reduced accuracy, particularly in zero-shot and [1] Kaikai An, Shuzheng Si, Yuchi Wang, et al. 2024. Rethinking semantic pars-few-shot settings. While adding semantic context improves task ing for large language models. arXiv preprint arXiv:2409.14469. understanding by making objectives and entity definitions more [2] Dhananjay Ashok and Zachary C. Lipton. 2023. Promptner: prompting for explicit, excessive length or ambiguity can offset these gains. [3] named entity recognition. arXiv preprint arXiv:2305.15444. J. Harry Caufield, Yichao Zhou, Yunsheng Bai, David A. Liem, Anders O. Garlid, Kai-Wei Chang, Yizhou Sun, Peipei Ping, and Wei Wang. 2019. 6.3 A comprehensive typing system for information extraction from clinical Prompt length vs. model response narratives. medRxiv. Preprint. doi: 10.1101/19009118. Semantic enrichment inevitably increases prompt length, which [4] Quinn McNemar. 1947. Note on the sampling error of the difference between can slow response time and raise computational overhead. It correlated proportions or percentages. Psychometrika, 12, 2, (June 1947), 153–157. doi: 10.1007/bf02295996. may also overwhelm smaller models when excessive detail is [5] Monica Munnangi, Sergey Feldman, Byron C. Wallace, et al. 2024. On- included. In practical applications, this must be weighed against the-fly definition augmentation of llms for biomedical ner. arXiv preprint the potential gains in entity extraction accuracy. arXiv:2404.00152. [6] 2025. Oxford english dictionary. https://www.oed.com/. Accessed: 2025-06- 17. (2025). 7 [7] Yongliang Shen, Zeqi Tan, Shuhui Wu, et al. 2023. Promptner: prompt Conclusion locating and typing for named entity recognition. In ACL (Long Papers). This study investigated the impact of a semantically enhanced [8] Yongjian Tang, Rakebul Hasan, and Thomas Runkler. 2024. Fsponer: few-shot prompt design on LLM-based NER in the clinical domain. Our prompt optimization for named entity recognition.arXiv preprint arXiv:2407.08035. [9] Zeliang Tong, Zhuojun Ding, and Wei Wei. 2025. Evoprompt: evolving experiments on the MACCROBAT2020 dataset demonstrated prompts for enhanced zero-shot named entity recognition. In COLING. that adding semantic label descriptions significantly improves [10] Yuwei Xia, Zhao Tong, Liang Wang, et al. 2023. Learning meta-prompt with model performance in zero-shot and few-shot scenarios, with entity-enhanced semantics for few-shot ner. SSRN. 10 LLM Based Approach to Extracting Smells in Slovenian Corpora Janez Brank Inna Novalija Jožef Stefan Institute Jožef Stefan Institute Ljubljana, Slovenia Ljubljana, Slovenia janez.brank@ijs.si inna.koval@ijs.si Dunja Mladenić Marko Grobelnik Jožef Stefan Institute Jožef Stefan Institute Ljubljana, Slovenia Ljubljana, Slovenia dunja.mladenic@ijs.si marko.grobelnik@ijs.si Abstract This paper explores automatic smell detection in Slove- nian cultural heritage texts using two complementary strate- This paper presents a comparative study of automatic smell gies: (1) a keyword-based approach derived from an expert- detection in Slovenian cultural heritage texts using both curated list of smell-related expressions and their morpho- keyword-based search and large language model (LLM) in- logical variants, and (2) large language model (LLM) - based ference. We process a portion of the dLib.si corpus from th th semantic inference using prompt-engineered queries via the late 19 and early 20 centuries, analyzing over 1.6 the Together.ai platform. We process a subset of the dLib.si million text segments for olfactory references. The keyword digital library corpus of Slovenian texts, divided into tem- method leverages an expert-curated list of smell terms, while poral buckets, and evaluate the performance, overlap, and the LLM method applies semantic inference via prompt- divergence between the two methods. engineered queries. We compare the methods in terms of To facilitate large-scale analysis, we produce and analyze detection density, temporal trends, and agreement over- over 1.6 million document-query pairs, extracting smell men- lap. Additionally, we visualize the semantic landscape of tions, classifying them by agreement type, and visualizing extracted smell terms using t-SNE and unsupervised cluster- their distributions both temporally and semantically. Our ing with auto-generated labels. Our findings reveal limited goals are twofold: (i) to quantify the representational den- overlap between methods, a shared rise in smell mentions sity of olfactory references in the corpus, and (ii) to better over time, and distinct semantic clusters ranging from in- understand how computational methods can surface sub- dustrial to culinary and bodily smells. This study highlights tle cultural patterns that evade traditional keyword search the value of combining symbolic and neural approaches for alone. nuanced sensory mining in digital heritage corpora. This work contributes toward a richer modeling of sen- Keywords sory information in digital heritage collections and high- lights the value of combining symbolic and neural methods LLM, Artificial Intelligence, Cultural Heritage, Text Mining for text mining in the cultural heritage domain. 1 Introduction 2 Related Work Olfactory perception is an essential yet underexplored di- Recent years have seen increased interest in the computa- mension in the analysis of historical texts, particularly within tional modeling of olfactory expressions in historical and the cultural heritage domain. Smells, though intangible, play cultural texts. A prominent initiative in this space is the a critical role in shaping memory, atmosphere, and cultural Odeuropa project [7], which focused on identifying, cu- meaning. However, their representation in written sources rating, and semantically linking smell-related content in is often subtle, indirect, or metaphorical. This challenge European heritage corpora. Large-scale initiatives, such as becomes more pronounced in historical corpora such as the Odeuropa project, have produced the European Olfac- th th 19- and early 20-century Slovenian publications, where tory Knowledge Graph and tools like the Smell Explorer evolving linguistic practices and cultural norms affect how to trace historical olfactory knowledge across 400 years of sensory information is encoded. European sources [7, 5]. Research on sensory perception in NLP has traditionally focused on the visual and audi- tory modalities, while olfaction remains relatively under- Permission to make digital or hard copies of all or part of this work for explored. Annotation frameworks such as the Olfactory personal or classroom use is granted without fee provided that copies are not Event Frame and guidelines for labeling sources, qualities, made or distributed for profit or commercial advantage and that copies bear and experiences [6] provide structured resources for in- this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the formation extraction from historical and literary corpora. owner /author(s). Traditional approaches to olfactory semantics rely on fixed Information Society 2025, Ljubljana, Slovenia lexicons such as the Dravnieks Atlas [1] and the DREAM © 2024 Copyright held by the owner/author(s). https://doi.org/10.70314/is.2025.sikdd.5 challenge descriptors [3]. For morphologically complex and 11 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Brank et al. low-resource languages such as Slovene, monolingual mod- For each passage, we recorded both LLM and keyword re- els like SloBERTa [10] and seq-to-seq models like SloT5 [9] sults. We classified outcomes into four categories: , LLM Only demonstrate that tailoring architectures to linguistic struc- , , or . Additionally, we computed Keyword Only Both None ture improves performance over multilingual baselines. A the 𝐽 between the two result sets: Jaccard similarity wide range of Slovene corpora underpins these modeling 𝐽 𝐴, 𝐵 𝐴 𝐵 𝐴 𝐵 , ( ) = | ∩ | / | ∪ | efforts. Gigafida 2.0, a reference corpus of 1.1 billion to- kens covering contemporary written Slovene, provides a where 𝐴 is the set of LLM-based results and 𝐵 is the set large-scale foundation for model pretraining and evaluation of keyword-based results. This metric enabled quantitative [4]. For user-generated content, the JANES corpus supplies comparison of coverage and intersection across detection richly annotated Slovene social media text, including nor- methods. malization and NER [2]. Unlike prior studies that primarily focus on annotation frameworks, fixed olfactory lexicons, 4.2 Temporal Distribution of Smell or large-scale multilingual heritage initiatives such as Odeu- Mentions ropa, our work provides the first comparative evaluation of We extracted the year of publication from each document’s keyword-based and LLM-based smell detection specifically metadata. For each year, we aggregated: for Slovenian cultural heritage corpora, highlighting the interplay between symbolic coverage and neural semantic • Total LLM-detected smell terms • Total keyword-detected smell terms inference. • Number of processed queries 3 Corpora and Preprocessing These aggregates were used to generate yearly time se- ries, revealing longitudinal patterns in olfactory expression For the experiments presented in this paper, we used texts across the corpus. This temporal analysis supports hypothe- from the Slovenian Digital Library (dLib.si). Initially we ses about cultural shifts, such as increasing industrial or downloaded, from the Library’s website, all documents from bodily smell discourse over time. the period 1870–1919 for which OCRed text was available and whose language was marked as Slovene in the meta- data there. In terms of content, this covers nearly all books, 4.3 Semantic Typology via Clustering of newspapers, magazines etc. published in Slovene during Smell Terms that period. From this corpus we then randomly selected To explore latent smell categories, we constructed a semantic 7 % of the documents from each year for further processing; typology using the following steps: thus the selected subset maintains the same distribution • Term Extraction: We extracted the 500 most fre- over time, genre, etc. as the full corpus. This resulted in a quent smell-related terms from the combined LLM dataset of approx. 366 thousand documents with a total of and keyword results. 105 million words. • Vectorization: Terms were embedded using TF-IDF 4 vectors over character-level n-grams ( with char_wb Methodology range 2–4), capturing morphological similarity. This section outlines the analytical pipeline used to detect, • Dimensionality Reduction: The high-dimensional compare, and interpret smell-related expressions in Slove- t- vectors were projected to two dimensions using nian cultural heritage texts. Our approach combines large SNE (perplexity = 30), yielding a visual semantic language model inference, keyword-based retrieval, tempo- landscape. ral and density statistics, and unsupervised semantic clus- • Clustering: k-means clustering We applied (with tering. 𝑘 8) to the t-SNE coordinates. For each cluster, the = top 5 TF-IDF terms were used to generate seman- 4.1 Comparative Evaluation of Detection tic labels (e.g., “Herbs & Cooking”, “Pharmaceutical Methods Smells”). In order to identify olfactory expressions, we employed two • Visualization: The clusters were visualized with color-coded labels and representative terms. Interac- complementary strategies: tive versions were built using . plotly • LLM-based Extraction: Each document was split 1 This typology enables data-driven classification of smell into passages and processed using a LLM. The model discourse and provides interpretable categories for cultural returned a list of potential smell-related words or and linguistic analysis. phrases, structured in JSON format. In cases of for- matting failure, raw strings or exception messages were recorded. 4.4 Document-Level Smell Density • Keyword-Based Search: A manually curated in- Analysis dex of smell-related expressions, including morpho- To assess the distribution of olfactory content across docu- logically inflected forms, was used for direct string smell density ments, we computed the as the ratio of detected 2 matching within each passage. terms to queries per document: 1 The Llama-3.3-70B-Instruct-Turbo-Free model, accessed via Together.ai. 2 # LLM terms This index has been kindly provided by Mojca Ramšak and is based on her = Density LLM work on the anthropology of smell [8]. # queries 12 LLM Based Approach to Extracting Smells in Slovenian Corpora Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Figure 1: Yearly trends in smell term mentions. Figure 3: Smell term density per document. While out- Keyword-based detection consistently returns higher liers exist for both methods, keyword-based detection frequencies than the LLM, but both show similar generally identifies a higher density of smell refer- growth patterns. ences per query. Figure 4: t-SNE semantic landscape of smell terms, Figure 2: Detection agreement between LLM and key- clustered by character-level similarity and automati- word methods. Most passages are matched by one cally labeled using top TF-IDF terms per group. The method only, with a significant number showing no visualization reveals coherent groups such as food, rit- detection. The overlap (“Both”) occurs in fewer than ual, body, and chemical references. one-third of cases. detections. A significant subset of passages registers no ol- # Keyword terms factory detection at all, probably because most documents Density = Keyword # queries don’t mention smell-related topics in the first place. Figure 3 illustrates the distribution of smell term density This metric enabled identification of smell-rich and smell- per document. Keyword-based detection generally produces sparse texts. Density distributions were visualized using higher densities of references, whereas the LLM outputs boxplots and descriptive statistics, facilitating selection of are sparser but potentially more semantically filtered. Both representative or outlier texts for deeper qualitative analysis. distributions exhibit long-tailed outliers, where certain doc- uments contain disproportionately high concentrations of 5 Evaluation and Results olfactory mentions. We evaluated complementary approaches to detecting ol- To further analyze lexical diversity, we applied t-SNE to factory references in historical corpora: a keyword-based embed and cluster smell-related terms (Figure 4). The re- method and an LLM-based classifier. The results highlight sulting semantic landscape reveals coherent groupings that both convergences and divergences in performance across align with cultural domains, including food, ritual, body, and time, document density, and semantic coverage. chemical references. These clusters highlight the variety of Figure 1 shows yearly frequencies of smell-related men- olfactory expressions and suggest that both methods cap- tions from 1870 to 1920. While keyword-based detection ture complementary facets of the semantic space. The LLM consistently yields higher absolute counts than the LLM, appears particularly adept at recognizing context-dependent both methods exhibit similar growth trajectories. terms, while the keyword method anchors clusters in ex- Agreement analysis between the two methods (Figure 2) plicit lexical cues. reveals substantial divergence. Only about one-third of pas- Overall, the keyword-based approach provides broader sages are identified by both approaches. A large portion coverage and higher frequencies, but at the cost of noise is captured exclusively by the keyword method, while the and overcounting. The LLM method, while more conserva- LLM contributes a smaller but meaningful number of unique tive, contributes precision and captures context-sensitive 13 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Brank et al. olfactory references that keywords may overlook. The com- historical sensory studies where annotation is sparse and bination of both thus provides a richer and more balanced vocabulary is diffuse. representation of olfactory discourse in historical texts. Several promising directions remain open for further ex- ploration. First, we plan to expand the dataset to cover all documents in the dLib.si corpus, enabling more robust lon- 6 Discussion gitudinal and regional analyses. Second, we aim to improve Our analysis reveals several key insights into olfactory rep- LLM prompts to better handle nested or narrative contexts, resentations in Slovenian cultural heritage texts and the including smells embedded in metaphor, irony, or emotional methodological implications of combining LLM-based and framing. keyword-based detection. Another avenue involves extending the classification of First, both detection strategies show meaningful trends smell mentions into functional categories (e.g., pleasant vs. over time, with a noticeable increase in smell-related refer- unpleasant, natural vs. artificial, bodily vs. environmental) th ences around the turn of the 20 century. This may reflect using additional LLM-based postprocessing. We also intend broader urbanization, industrialization, and shifts in public to explore multilingual smell detection, comparing Slovene health discourse, which intensified the cultural significance with other Central European languages to study cultural of air quality, hygiene, and olfactory environments. convergence and divergence in olfactory discourse. Second, although keyword-based detection consistently Finally, we hope to integrate our smell detection pipeline returned more hits, the LLM-based method surfaced a dis- into public digital heritage platforms, providing curators, tinct set of semantically inferred mentions. As the agree- historians, and linguists with new tools for sensory explo- ment analysis shows, only a minority of mentions ( 24 %) ration of archival materials. ∼ were matched by both methods. One possible explanation of this would be if neural inference captures more nuanced or Acknowledgements contextually implied smell references, such as metaphorical This work was supported by the Slovenian Research Agency use ("a whiff of suspicion") or implied odors in narrative under the project J7-50233. scenes. Third, density analysis suggests that LLMs return more References sparse but targeted mentions, while keyword detection pro- Atlas of Odor Character Profiles [1] Andrew Dravnieks. 1992. . ASTM duces broader but sometimes noisier coverage. This differ- International, (Feb. 1992). isbn: 978-0-8031-0456-3. doi: 10.1520/DS61 - EB. ence is critical for researchers deciding between high recall [2] Darja Fišer, Nikola Ljubešić, and Tomaž Erjavec. 2020. The janes and high precision when exploring sensory data in historical project: language resources and tools for slovene user generated con- texts. tent. , 54, 1, pp. 223–246. Retrieved Language Resources and Evaluation Aug. 27, 2025 from https://www.jstor.org/stable/48740864. Finally, the t-SNE landscape of smell terms uncovered [3] Andreas Keller et al. 2017. Predicting human olfactory perception semantically coherent clusters — e.g., medicinal substances, from chemical features of odor molecules. Science, 355, (Feb. 2017), industrial emissions, festive foods, and bodily decay - and eaal2014. doi: 10.1126/science.aal2014. [4] Simon Krek, Špela Arhar Holdt, Tomaž Erjavec, Jaka Čibej, Andraz allowed us to generate meaningful auto-labels using top TF- Repar, Polona Gantar, Nikola Ljubešić, Iztok Kosem, and Kaja Do- IDF terms. Such visualizations provide a valuable tool for brovoljc. 2020. Gigafida 2.0: the reference corpus of written standard Slovene. eng. In Proceedings of the Twelfth Language Resources and cultural historians to engage with thematic patterns across Evaluation Conference. Nicoletta Calzolari et al., editors. European large-scale textual datasets. Language Resources Association, Marseille, France, (May 2020), 3340– Overall, our findings underscore the value of hybrid ap- 3345. isbn: 979-10-95546-34-4. https://aclanthology.org/2020.lrec- 1.4 09/. proaches to cultural text analysis. By comparing symbolic [5] P. Lisena, T. Ehrhart, and R. Troncy. European olfactory knowledge and neural perspectives, we gain both coverage and subtlety, graph. Zenodo. doi: 10.5281/zenodo.10709703. enabling a deeper reconstruction of sensory worlds encoded [6] Stefano Menini, Teresa Paccosi, Serra Sinem Tekiroğlu, and Sara Tonelli. 2023. Scent mining: extracting olfactory events, smell sources in the archives. Proceedings of the 7th Joint SIGHUM Workshop on and qualities. In Computational Linguistics for Cultural Heritage, Social Sciences, Hu- manities and Literature. Stefania Degaetano-Ortlieb, Anna Kazantseva, 7 Conclusion and Future Work Nils Reiter, and Stan Szpakowicz, editors. Association for Compu- tational Linguistics, Dubrovnik, Croatia, (May 2023), 135–140. doi: We conducted a dual-method analysis of olfactory refer- 10.18653/v1/2023.latechclf l- 1.15. [7] ODEUROPA Project Consortium. 2021–2023. ODEUROPA: negotiat- ences in Slovenian historical texts, revealing how keyword ing olfactory and sensory experiences in cultural heritage practice and search and LLM-based inference each contribute unique research. https://odeuropa.eu/. EU Horizon 2020 research and innova- perspectives to sensory data mining. Our results show that tion programme, grant agreement No. 101004469. Royal Netherlands Academy of Arts and Sciences (KNAW) Humanities Cluster et al., while the keyword method offers broad lexical coverage, (2021–2023). the LLM can detect more subtle, implied, or metaphorical [8] Mojca Ramšak. 2025. . AMEU-ISH, Ljubljana. Antropologija vonja [9] Matej Ulčar and Marko Robnik-Šikonja. 2023. Sequence-to-sequence references often overlooked by surface-level matching. pretraining for a less-resourced slovenian language. Frontiers in Arti- Furthermore, t-SNE clustering of smell terms revealed ficial Intelligence , 6. rich thematic structures — such as food, medicine, pollu- [10] Matej Ulčar and Marko Robnik-Šikonja. 2021. Sloberta: slovene mono- lingual large pretrained masked language model. In . SiKDD tion, and ritual — highlighting the semantic complexity of olfactory language. Together, these results demonstrate the complementary strengths of symbolic and neural approaches for enrich- ing digital humanities research, especially in domains like 14 BetweenTheLines - Cross Source News Analysis Georgi Trajkov Marko Grobelnik Adrian Mladenic Grobelnik geotrajkov0@gmail.com marko.grobelnik@ijs.si adrian.m.grobelnik@ijs.si Jožef Stefan Institute Jožef Stefan Institute Jožef Stefan Institute Ljubljana, Slovenia Ljubljana, Slovenia Ljubljana, Slovenia Abstract Different news outlets covering the same event often emphasize, omit, or frame facts differently, making cross-source comparison essential for understanding media bias and information diver- sity. Large language models (LLMs) can automate this analysis, but simple single-LLM prompt approaches tend to underperform when processing large amounts of data [1]. Platforms like Ground News [2] and Event Registry [3] provide publisher and article- level bias scores but cannot track how individual claims and entities are portrayed by articles. The fundamental challenge is determining whether LLM prompt architecture affects accuracy when classifying claim presence across multiple news sources. We show that a multi-prompt LLM architecture reduces classification errors 7-fold (from 33.0% to 4.67%) compared to single-prompt approaches. Our pipeline first extracts all claims and entities from articles collectively, then evaluates each article separately for claim presence (confirmed/contradicted/partial/absent) and entity sentiment. This decomposition virtually eliminates false positives, major errors dropped from 28.0% to 0.79% across 797 manually validated claim-publisher pairs from Slovene news. The results demonstrate that task decomposition, not LLM sophis- tication, drives accuracy in cross-source analysis. This finding enables scalable media monitoring at $0.01 per event, making systematic bias detection accessible to journalists and researchers worldwide. Figure 1: Analyzed event in BetweenTheLines mobile we- bapp, showing the claims tab 1 Introduction 2 Related Work Different news sources (publishers) covering the same event Cross-source news analysis is an under-discussed area of research (groups of articles reporting on the same story) often cover facts which is important for understanding media bias, information differently. While existing platforms like Event Registry [3] and diversity, and narrative framing across different outlets. This Ground News [2] provide valuable bias indicators and sentiment section reviews existing approaches to cross-source news analy- scores, they do not track how specific entities (People, Organiza- sis, event aggregation systems, and LLM-based content analysis tions, Countries) and claims (Factual Claims) within articles are pipelines. portrayed across publishers. Getting insight into these differences is usually time-consuming for the user. 2.1 Cross-Source News Analysis Platforms Thus we present BetweenTheLines, (Figure 1) a system that Ground News represents a prominent platform for cross-source automatically identifies claims and entities in an event, and tracks news comparison, classifying publishers along the left-right po- their portrayal in each individual publisher. For example, when litical spectrum. The platform has gained widespread adoption analyzing political coverage, we can see how the same entity in educational institutions, with libraries at Harford Commu- is portrayed differently by 2 publishers, and how one publisher nity College [4] and West Virginia University [5] integrating it omitted a claim while the other did not. into their media literacy curricula. For each news event, Ground Our key technical contribution is demonstrating that multi- News allows users to compare coverage by publisher on aggre- prompt LLM architecture outperforms single-stage approaches gate. While these aggregated summaries can reveal different for this task. emphases across the political spectrum, the platform does not provide article-by-article comparisons or track how specific enti- ties and claims are portrayed between articles. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or 2.2 Event-Centric News Aggregation distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this Event Registry [6, 3] pioneered event-centric news aggregation work must be honored. For all other uses, contact the owner /author(s). by clustering articles from multiple publishers around identi- SiKDD 2025, Ljubljana, Slovenia fied news events. The platform provides article-level sentiment © 2025 Copyright held by the owner/author(s). https://doi.org/10.70314/is.2025.sikdd.26 scores using VADER sentiment analysis [7] and allows filtering 15 SiKDD 2025, 6 October 2025, Ljubljana, Slovenia Georgi Trajkov, Marko Grobelnik, and Adrian Mladenic Grobelnik by various parameters including language, location, and pub- lisher credibility. Each article has a sentiment score, a level of granularity above Ground News. Still there is no analysis for how specific entities and claims within those articles are portrayed. Our work builds upon Event Registry’s foundation, by combin- ing its event-based aggregation, with more granular entity and claim analyses through LLM processing. Unlike Ground News’s publisher-level political bias ratings or Event Registry’s article sentiment scores, we provide fine-grained analysis of how specific entities and claims are portrayed differently across publishers. 3 Application and analysis Architecture 3.1 Application architecture “BetweenTheLines is a news-analysis web app 1, developed with Claude Code [8].” The backend is built using Flask [9] and Post- greSQL [10]. It uses Event Registry [6, 3] analysis service for event and article fetching, and integrates both Google Gemini [11] and OpenAI [12] LLMs. 3.2 Analysis Service overview Figure 2: three Stage process flow of analysis. Extracted The analysis service consists of two modules, claims analysis results lead to multiple parallel LLM calls. and sentiment analysis, with more thorough exploration of the former due to it’s less subjective nature. Figure 2 illustrates our included in the prompt along with instructions for extracting 2 three-stage LLM pipeline. lists (Table 1) in JSON format: entities for sentiment analysis and Stage 1: Extraction. We begin by sending all articles from an claims for claims analysis. event to a single LLM call. This extracts two lists (Table 1) for The prompt focuses on extracting 8-15 claims and 8-15 entities entities and claims that appear in the articles. Stage 2: Classification. that are central to the story, explicitly excluding news publishers With the lists from stage 1, a paral- unless they are the subject of the news story: lel LLM call is made twice for each publisher, once for claims, once for entities. The calls return categorized data. Claims are Analyze all these news articles and extract two comprehensive lists in JSON format : categorized by presence, and entities by sentiment. The results 1. All significant made across all articles CLAIMS of these categorizations are referred to as entity-publisher and 2. ENTITIES All important (people , organizations , countries , etc .) mentioned across all articles claim-publisher pairs. Stage 3: Key Differences. Summarizes how different publish- A 2-step extraction process was also tested, where each article ers covered each claim or entity. This requires one LLM call per is prompted for claims and entities contained in it, and then the claim/entity, running in parallel. results are aggregated. However, this led to very large lists with duplicate names written differently (e.g., USA vs United States The final results are structured into a tabular or card format, depending on device, where users can compare coverage across Government vs United States), for little performance gain. publishers at a glance (Figure 1). Another issue we faced was the publisher names themselves being in the entities list, even in situations where they are not 3.3 Language a direct part of the article. This led us to add additional rules in the extraction prompt to not include them: We decided for all prompts to be in Slovene, and to analyze only Slovene articles. This came after empirically observing a decrease - news (like BBC , CNN , Reuters , etc .) EXCLUDE publishers/sources UNLESS they are actually subjects of the news story itself in errors when the language of the prompts and articles was the SUBJECT- Focus on entities that are the of the news , not the same. It also language consistency for evaluation. source reporting it All showcased prompts and results are originally Slovene, and were translated to English for the paper. Entities Claims Vladimir Putin Putin claims that Russia has never opposed 3.4 Event Filtering Ukraine’s membership in the EU. Xi Jinping Putin calls claims about a possible Russian attack Events and articles are fetched from the Event Registry API[3]. on other European countries “hysteria.” Russia Putin says that Russia is forced to respond to Articles are then filtered to only retain the newest article the West’s attempt to take over the post-Soviet for each unique publisher in an event. To retain only the most space. relevant events, we discard any events with less than 3 articles. China Putin and Trump discussed the security of Ukraine. To prevent context overloading maximum article limit is 10. Ukraine Putin and Xi signed about 20 agreements in the Then final article list is prepared for each event, and the title, fields of energy, aviation, artificial intelligence, body, publisher name, and article link is stored for every article. and agriculture. Table 1: Example of first 5 entities and claims received from 3.5 extraction prompt for Russia–China summit. Extraction Extraction for an event is done after filtering, in a single LLM call to gpt-4o-mini [13], in which the contents of all articles are 16 BetweenTheLines - Cross Source News Analysis SiKDD 2025, 6 October 2025, Ljubljana, Slovenia Figure 3: Claims analysis decision tree, 4 options depending Figure 4: Decision tree in sentiment analysis on whether and how a claim is mentioned Mandatory decision steps ( before choosing a label ): 3.6 Claims Analysis - First identify the role of the entity 's mention : SPEAKER / TARGET MENTIONED WITHOUT ROLE / Claim analysis starts after the extraction step returns a claims META-EVALUATION- Then look for of the entity ( adjectives , evaluative verbs , framing before / after the quote , editorial list. It consists of multiple parallel LLM calls, each analyzing a tone). single article against the claims list, using 4 categorizations for SPEAKER- If the entity is only a without meta - evaluation , choose "Neutral". whether the article confirms the claim: Yes, Partially, No and Not mentioned, as depicted in Figure 3. This resulted in false negatives and positives reducing signifi- False negatives were the biggest problem we faced with claim cantly, however it also came with the tradeoff of having a much analysis. Originally, there were only 3 claim categories; however, higher incidence of neutral classification, even when it is slightly due to too many "not mentioned" results, we added a fourth positive or negative. partial classification that led to significant improvements. To further reduce false negatives without adding false positives, we 3.8 Key Differences tightened the categorization rules for the Not mentioned category, The final step of the pipeline is the generation of the key dif- to default to Partial instead when answer is unclear. ferences (Figure 5). It uses the claims/sentiment categorizations Portion of the rule-set that helped improve results: from the previous step as input. It works for both Claims and Sen- Before selecting "Not mentioned", you MUST check the following timent analysis in an almost identical manner; we will use claims transformations / hints : -paraphrases/synonyms; hypernyms/hyponyms; abbreviations/acronyms; for explanation in this example. A parallel LLM call is made once coreferences ( pronouns , descriptive references ) -numbers/units/conversions; relative dates -> absolute ; per every claim in the analysis, containing all claim-publisher geographic hypernyms pairs of the claim. (e.g. EU -> country ) - sections : title , introduction , body , subtitles , captions , tables /graphs , quotes / indirect statements -negations, questions, conditionals, predictions/hypotheses Rule to reduce false negatives: - If in doubt between "Partial" and "Not mentioned", choose "Partial" 3.7 Sentiment Analysis The sentiment analysis proceeds in parallel with claims analysis after receiving the entity list (Figure 1) from the extraction. It is structured in a manner very similar to the claims analysis, it calls the LLM once per publisher, and it has 4 categorizations (Figure 4): Positive, Negative, Neutral, and Not Mentioned. Accuracy assessment is harder due to subjective interpretation. The module uses gemini-2.5-flash-lite [14] due to empirical observation of better results, every other LLM call uses gpt-4o-mini [13]. LLMs struggle with implicit criticism conveyed through se- lective quoting. For instance, when Mladina [15] quoted Trump praising himself as "smart" and suggesting people want a dicta- tor, the LLM classified sentiment as positive, missing the article’s critical intent to portray authoritarianism. To account for this weakness, we added more constraints and rules in the prompts: Important: OUTCOME ≠ SENTIMENT - Do not mark "Positive" because the entity wins / makes a profit , without explicit value judgement of the entity . - Do not mark "Negative" because the entity loses /has a bad result Figure 5: Key difference generation for claim from Russia- , without explicit value judgement of the entity . China Summit 17 SiKDD 2025, 6 October 2025, Ljubljana, Slovenia Georgi Trajkov, Marko Grobelnik, and Adrian Mladenic Grobelnik Hvar Putin prepared Carpaccio’s Mary Giorgio Russia– Weighted snakebite to meet Zelenski Returns to Piran Armani dies China summit avg Single Multi Single Multi Single Multi Single Multi Single Multi Single Multi Publishers 7 7 9 7 5 — Claims 9 15 9 14 8 15 8 15 8 12 — — Error rate 25.4% 3.80% 30.15% 3.06% 38.9% 6.3% 37.5% 7.62% 32.5% 0% 33.0% 4.67% Major errors 25.4% 1.90% 14.28% 0% 37.5% 0% 30.4% 1.91% 32.5% 0% 28.0% 0.79% Rows affected 100% 20% 88.8% 7.14% 100% 33.3% 87.5% 33.3% 100% 0% 95.3% 21.5% Table 2: Single-stage (left) vs. multi-stage (right) per event. Final column shows weighted averages. For error rates and major errors, weights = number of claim-publisher pairs tested per pipeline. For rows affected, weights = number of claims (rows) per pipeline. Note that weights differ between pipelines due to different extraction results. 4 Evaluation While the multi-stage pipeline (Figure 2) requires more API 4.1 calls (8+ versus one), costs remain manageable at $0.008-0.015 per Manual Testing event with both modules enabled. The accuracy improvement To test our hypothesis that the multi-stage pipeline is superior justifies this modest cost increase, especially considering manual to a single-stage pipeline (where all articles and instructions are verification would require expensive human labor. Considering included in a single one prompt LLM call), we conducted a com- that an event only needs to be analyzed once with no variable parison of claim analysis results spanning 797 claim-publisher cost, this offers a lot of potential for analysis at scale. pairs, of which 294 are from single-stage pipeline and 503 from Sentiment analysis struggles with irony and implicit criticism, multi-stage pipeline. Both single and multi-stage results were as shown in the Mladina [15] example where selective quoting generated across the same 5 control news events. conveyed negativity despite positive surface language. Quantitative testing was not done for sentiment due to time Future work includes comprehensive user testing with jour- constraints, combined with increased difficulty due to level of nalists and researchers, optimization of current modules, and subjectiveness. expansion to other languages. We plan structured evaluations Each claim-publisher pair was manually reviewed for correct- to understand how different user groups interpret and act upon ness. We classified errors into two categories: minor errors (posi- cross-source comparisons. tive or not mentioned classified as partial) and major errors (false positives/negatives). Results were grouped by event to enable Acknowledgments direct comparison between the two architectures on identical The research described in this paper was supported by the TWON data. Weighted averages were calculated, using claim-publisher project, funded by the European Union under Horizon Europe, pair counts for error rates, and distinct claim counts for rows grant agreement No 101095095. affected (Row refers to a distinct claim, and it’s corresponding claim-publisher pairs). References [1] Yushi Bai et al. 2023. Longbench: a bilingual, multitask benchmark for long context understanding. . arXiv preprint arXiv:2308.14508 4.2 Results [2] [n. d.] Ground news - breaking news headlines and media bias. Ground News. Retrieved Sept. 7, 2025 from https://ground.news/. The multi-stage pipeline achieved 4.67% error rate versus the [3] [n. d.] Event registry api documentation. Event Registry. Retrieved Sept. 7, 33.0% error rate of the single-stage pipeline.2 2025 from https://eventregistry.org/documentation. [4] [n. d.] Case study: ground news at harford community college - a collabo- The results table 2 shows results across the five test news rative mission to modernize media literacy. Library Up. Retrieved Sept. 7, events. Each percentage represents the proportion of claim-publisher 2025 from https://www.libraryup.org/news- 1/case- study- ground- news- at- harf ord- community- college. pairs that were incorrectly classified. For example, in "Russia- [5] [n. d.] Ground news - media bias and news comparison. West Virginia China summit" with 5 publishers, single-stage misclassified 32.5% University Libraries. Retrieved Sept. 7, 2025 from https://libguides.wvu.edu of all claim-publisher pairs while multi-stage achieved 0% error. /c.php?g=1204801&p=8818927. Major errors [6] Gregor Leban, Blaz Fortuna, Janez Brank, and Marko Grobelnik. 2014. Event (false positives/negatives) are critical misclas- registry: learning about world events from news. In Proceedings of the 23rd sifications where claims are marked "confirmed" when absent (WWW ’14 Companion). ACM, International Conference on World Wide Web or "not mentioned" when present. Minor errors involve "partial" Seoul, Korea, 107–110. isbn: 978-1-4503-2745-9. doi:10.1145/2567948.257702 4. misclassifications. The multi-stage pipeline reduced major errors [7] C.J. Hutto and Eric Gilbert. 2014. Vader: a parsimonious rule-based model from 28.0% to 0.79%. for sentiment analysis of social media text. In Proceedings of the Eighth Rows affected shows the percentage of claims with at least International AAAI Conference on Weblogs and Social Media. AAAI Press, 216–225. one error across publishers. Single-stage produced errors in 95.3% [8] [n. d.] Claude code. Anthropic. Retrieved Sept. 7, 2025 from https://claude.a of claims versus 21.5% for multi-stage, demonstrating more local- i/code. [9] Armin Ronacher. [n. d.] Flask. Retrieved Sept. 7, 2025 from https://f lask.pal ized error patterns. letsprojects.com/. The improvement was consistent across all five news events. [10] [n. d.] Postgresql. PostgreSQL Global Development Group. Retrieved Sept. 7, The most dramatic gain was the 35-fold reduction in major errors. 2025 from https://www.postgresql.org/. [11] [n. d.] Gemini api. Google. Retrieved Sept. 7, 2025 from https://ai.google.dev/. [12] [n. d.] Openai. OpenAI. Retrieved Sept. 7, 2025 from https://openai.com/. 5 [13] [n. d.] Gpt-4o mini. OpenAI. Retrieved Sept. 7, 2025 from https://openai.co Discussion m/index/gpt- 4o- mini- advancing- cost- ef f icient- intelligence/. [14] [n. d.] Gemini 2.5 flash lite. Google. Retrieved Sept. 7, 2025 from https://ope Our results demonstrate that LLM prompt architecture fundamen- nrouter.ai/google/gemini- 2.5- f lash- lite- preview- 06- 17. tally impacts LLM classification accuracy in cross-source news [15] [n. d.] Trump bi ministrstvo za obrambo preimenoval v ministrstvo za vojno. analysis. Significant error reduction validates task decomposition Mladina. Retrieved Sept. 7, 2025 from https://www.mladina.si/243046/trum p- bi- ministrstvo- za- obrambo- preimenoval- v- ministrstvo- za- vojno/. as a critical design principle for complex NLP pipelines. 18 Identifying Social Self in Text: A Machine Learning Study Jaya Caporusso Matthew Purver Senja Pollak jaya.caporusso@ijs.si Jožef Stefan Institute Jožef Stefan Institute Jožef Stefan Institute Ljubljana, Slovenia Ljubljana, Slovenia Jožef Stefan International Queen Mary University of London Postgraduate School London, UK Ljubljana, Slovenia Abstract project involves labelling—with a mixed approach involving hu- man annotators and large language models (LLMs)—diary entry The Self encompasses many aspects, such as the Social Self. instances as binary (representing or not) SS, with the purpose of Identifying them in text is relevant for many purposes, includ- investigating the correlation between SS and textual features. We ing mental-health research. As part of a larger project aimed train and compare three classifiers (i.e., Support Vector Machine at automatically detecting Self-aspects in written language, in (SVM), Naïve Bayes (NB), and Logistic Regression (LR)) to predict this study we annotate and employ a dataset of diary entries to SS using either 1) learned features (i.e., TF-IDF unigrams and classify the presence or absence of Social Self. We train three bigrams) or 2) predefined features (i.e., Linguistic Inquiry and classifiers—Support Vector Machine (SVM), Naïve Bayes, and Word Count (LIWC; [1]) lexicon categories (see [4]). We use the Logistic Regression—on either learned or predefined features. mentioned classifiers instead of LLMs (e.g., GPT-4) because our The best-performing model is the SVM trained on predefined focus is on employing interpretable features and understanding LIWC features based on a previous study. We further apply fea- their contribution to predictions—an aspect less directly accessi- ture importance methods, and examine which features make the ble in generative models. We conduct feature importance analysis biggest contribution to the classification models. The most infor- to explore these contributions further. The code is available at mative feature across models trained on learned features is the https://github.com/jayacaporusso/SELFtext upon request. word “we”, while the LIWC category “social referents” emerges as the most important feature for models trained on predefined features. 2 Related Work Studies that address the correlation between text and the traits Keywords and states of the text’s author often utilise the Linguistic Inquiry social self, machine learning, classification, feature importance and Word Count (LIWC), a text analysis software developed to analyse linguistic and psychosocial constructs connected to vari- 1 ous textual aspects [1] (e.g., [9]). Various studies have found Self Introduction states to be associated with linguistic features, e.g., depression A central aspect of human experience, the Self is a complex, multi- with first-person singular pronouns [15]. This has been employed aspect phenomenon [3]. Its aspects—encompassing, for example, in classification tasks (e.g., [6]). In a previous study, after labelling personal narratives [18] and social interactions [2]—correlate a dataset with a mixed approach employing human annotation with other relevant constructs, such as mental-health conditions and LLMs, we analysed which LIWC-22 features characterise [17]. While the various Self-aspects reflect in the individual’s Reddit posts including Self as an Agent, Bodily Self, and SS [4]. language [14], Natural Language Processing (NLP) studies rarely presence of SS Specifically, we showed that the is correlated explore them and employ them in-depth. emotion with LIWC categories including, among the others, and This work is part of a larger project aimed at developing mod- time related terms absence of SS . In contrast, the is correlated els to automatically identify Self-aspects in text, with applications technology negative emotions with, e.g., and . In this work, we em- in mental-health-research and empirical phenomenology [5]. Due ploy this knowledge to build SS classifiers on predefined features to the sensitive nature of the domains of application, we attempt and compare them with classifiers trained on learned features. an approach that allows both interpretability and ground-truth basis, opting for classical machine learning (ML) models. In this 3 Research Questions study, we focus on one Self-aspect: the Social Self (SS), defined as the Self as it is shaped and/or perceived when in an interaction In this study, we aim to address the following main research or relationship of sorts with other people or entities to whom we questions (RQs). RQ1: How does a SS classifier trained on pre- attribute qualities of inner life defined features perform compared to a SS classifier trained on [4]. We aim to investigate how this learned features? : Among the algorithms employed, which RQ2 is represented in diary entries and whether these representations one performs better for our task? : Which features are more RQ3 can be reliably identified using machine learning. Additionally, relevant for the classification of SS? we explore which linguistic features are most predictive of these aspects. Identifying SS in text is valuable, as, e.g., disturbances in the SS are closely linked to mental health conditions [7]. This 3.1 Labelling In our study, we use a publicly available dataset in English [11] Permission to make digital or hard copies of all or part of this work for personal comprising 1,473 text samples (sub-entries; average length: 507.6 or classroom use is granted without fee provided that copies are not made or characters, 100.6 words) from 500 personal journal entries (500 distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this anonymous subjects). We augment the dataset with binary labels work must be honored. For all other uses, contact the owner /author(s). for SS, as following addressed. Information Society 2025, Ljubljana, Slovenia For labelling, we employ a mixed approach (see [4]) that com- © 2025 Copyright held by the owner/author(s). https://doi.org/10.70314/is.2025.sikdd.2 bines human annotation with the large language model (LLM) 19 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Caporusso et al. gemma2 [16]. The instructions for manual annotation are pro- on the learned features and three models on the predefined fea- vided in the Appendix A. Two human annotators label the first tures. The models are of three different kinds: SVM, NB, and 105 instances of the dataset. This is needed to calculate inter- LR, all commonly used in text classification tasks. We employ annotator agreement with the LLM annotations. We instruct default hyperparameters. For the SVM, we use Linear kernel. For gemma2 to label the data three times, providing three different LR, we apply L2 regularisation, which adds a penalty term to personalisations (see [10]): expert in phenomenology, cognitive the model’s objective function, minimising overfitting. For NB, psychology, or social psychology. Additionally, we provide them MultinomialNB was used for learned features, while GaussianNB with definitions of SS, instructions to annotate it, examples of a was used for predefined features, which consist of continuous nu- text instance where it is present, a text instance where it is absent, merical values derived from linguistic analysis. MultinomialNB and explanations of why this is so. These can be extracted from assumes that features represent discrete frequency counts, while the instructions for manual annotation. Each gemma2 model GaussianNB assumes that feature distributions follow a normal performs a one-shot, binary classification for each self-aspect. distribution, making it appropriate for continuous data. We calculate majority voting with the resulting labels and com- pute the inter-annotator agreement between each pair among the 5 Evaluation human and the LLM annotators by calculating Cohen’s Kappa Similarly to the training process, the models are evaluated using coefficient. This results in Cohen’s Kappa coefficients of 0.80 10-fold cross-validation. All the models perform reasonably well, (human annotators), 0.89 (first annotator vs. gemma2), and 0.84 with the SVM model trained on predefined features outperform- (second annotator vs. gemma2). In the further steps, we use the ing them all (RQ1 and RQ2). The metrics (precision, recall, and majority voting labels. The class balance (calculated on the ma- F1-score: mean and STD) across folds are reported in Table 1. jority voting) is 50.3% (SS present) vs 49.7% (SS not present). They match the macro average scores. The confusion matrices for each model are presented in Figures 3 and 4 in the Appendix 4 Classification B. These highlight that models trained on predefined features The text is preprocessed, converting it to lowercase and remov- generally perform better at distinguishing between classes, with ing punctuation and extra whitespace. We extract learned and the SVM and LR models achieving higher accuracy for both Class predefined features. We then train three classifiers for each set 0 and Class 1. However, NB trained on predefined features strug- of features: an SVM, a NB, and a LR model. gles with a higher rate of false positives for Class 0. The models trained on learned features have slightly lower performance, with 4.1 Feature Engineering higher misclassification rates for Class 1 predictions. After per- forming a Friedman test across folds (statistic = 44.26, p-value = We are interested in comparing the performance of models trained 0.00), we find a statistically significant difference in model per- on learned vs pre-defined features. In this study, we choose to formances. We therefore conduct Wilcoxon signed-rank tests employ TF-IDF calculated on unigrams and bigrams as learned with Bonferroni correction to identify significant pairwise dif- features, and the LIWC features identified as being related to the ferences between models. LR with learned features performed presence or absence of SS in Caporusso et al. [4]. significantly better than NB with learned features (p = 0.03); SVM 4.1.1 Learned Features. To extract learned features, we employ with predefined features outperforms NB with learned features (p Tfidf Vectorizer, applying TF-IDF weighting to unigrams and bi- = 0.03); LR with predefined features outperforms NB with learned grams. Restricting the representation to unigrams and bigrams, a features (p = 0.03); SVM with predefined features performs signifi- common choice in exploratory text classification, efficiently dis- cantly better than NB with predefined features (p = 0.03); LR with plays feature importance, balancing interpretability and compu- predefined features outperforms NB with predefined features (p tational efficiency. We limit the feature space to the 1000 n-grams = 0.03). The results are displayed in Figure 5 in the Appendix B. that, based on their TF-IDF scores, are the most informative. This ensures computational efficiency. In this process, we choose not to exclude stopping words. Indeed, for the purpose of our study, they do not merely constitute noise but might play a key role in distinguishing text instances reporting on SS. 4.1.2 Predefined Features. We analyse the presence of all the LIWC-22 [1] categories and subcategories, and subsequently only considered the LIWC features of interest. Specifically, as prede- fined features, we employ the LIWC features that Caporusso et al. Table 1: Evaluation Metrics (Mean and STD) [4] identified as being related to the presence and absence of SS (see 2), for example , , and authenticity social referents the pronoun I. For each of them, LIWC-22 provides scores relative to the text length. All LIWC features were standardised using Z-score nor- 6 Feature Importance malisation to ensure comparability across different feature scales. We employ different feature importance methods tailored to each This is particularly important for models like SVM and LR, which model’s learning mechanism to ensure that feature rankings are are sensitive to feature magnitudes. Missing values (NaNs) are meaningful and aligned with the way each algorithm processes handled using mean imputation. data. For the SVM models, we choose Linear SVM Coefficients because they directly represent feature importance in the deci- 4.2 Models sion boundary and are computationally efficient to extract. This The models are trained and evaluated using 10-fold cross-validation method is fast and directly interpretable without requiring addi- to assess their performance. Specifically, we train three models tional computations, but it does not capture feature interactions 20 Identifying Social Self in Text Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia or non-linearity. For the NB models, we choose Permutation Im- method is chosen because it provides a comprehensive, intuitive, portance. NB does not have meaningful coefficients, and this and theoretically grounded measure of feature importance, mak- method provides a model-agnostic way to assess how each fea- ing it well-suited for interpreting the decision-making process of ture affects predictions. This method allows the interpretation a probabilistic model like LR. In this study, we reduce the SHAP of feature contributions without relying on the model’s inter- computation sample size from 50 to 20 to improve efficiency nal parameters, but it is computationally expensive and can be while maintaining representative feature importance insights. sensitive to correlated features. For the LR models, we choose SHAP scores are measured in the same scale as the model’s out- SHAP (SHapley Additive exPlanations [12]) Values, because they put and sum to the difference between the model’s output and provide both global and instance-level feature attributions while the expected output across all features. They can be positive considering feature interactions, making them more informative (probability of classification increased) or negative (probability than raw coefficients. SHAP accounts for feature dependencies of classification decreased). Their magnitude reflects the strength and offers a nuanced interpretation of how features contribute to of the feature’s influence on the classification decision. The top-3 individual predictions, but its computations can be slow and the features for the SVM models are , and (TF-IDF) and with, we my results depend on the reference distribution used. Using SHAP , and (LIWC) (RQ3). social referents, Social I for the SVM would be unnecessary because it would give similar results as the coefficients but less directly and with added com- putational cost, while SHAP’s dependency assumptions conflict 6.4 Overall feature importance with NB’s independence assumption. The contribution of each To determine the top-20 most important features across all models feature to the classification decision is indicated with a feature trained on learned features and across all models trained on importance score. These are computed differently depending on predefined features, we aggregate the feature importance scores the method: in Linear SVM Coefficients, they are derived from the from each model and sum them across all models. This is done absolute magnitude of the learned weights; in Permutation Im- to show which features are consistently influential regardless portance, they are measured by assessing the decrease in model of the model; however, due to differences in how each method performance when a feature’s values are randomly shuffled; while computes importance, the aggregated scores should be viewed in SHAP, they quantify the contribution of each feature to the as indicative rather than absolute measures of feature relevance. predicted classification probability by distributing the model’s The top-10 features for the models trained on learned features output among the input features. are displayed in Figure 1, while those for the models trained on predefined features in Figure 2 (RQ3). Additionally, we identify 6.1 SVM: Linear SVM Coefficients unique features for each model, defined as those that appear in the top-10 for a specific model but not in others. Following, we For SVM, feature importance is determined using Linear SVM report those referring to models trained on learned features. Coefficients. This method is chosen because linear SVM explic- itly learns a set of coefficients as part of its optimisation process, making feature importance inherently interpretable. Addition- • SVM: my, team, she, our, he, we, with, friend, with my, their. ally, since the SVM model is optimised to find the maximum • Naïve Bayes team, they are, he was, us, birthday, she was, : margin, features with the largest coefficients contribute the most of our, with her, person, spending time. to defining this separation, allowing for a clear ranking of feature • Logistic Regression my, she, our, and, good, he, my family, : relevance. The resulting importance scores are based on the ab- we, it, sleep. solute magnitude of the learned coefficients, and like them, they Following, we report those referring to models trained on can be any real value. While the importance scores’ scale depends predefined features. on the range of the input features, higher numbers indicate a stronger influence on classification. The top-3 features for the SVM models are • SVM: sexual, Dic, Social, socrefs, feeling, we, Affect, Drives, family, we , and with (TF-IDF) and social referents, I insight, WC. , and personal pronouns (LIWC) (RQ3). • Naïve Bayes: Dic, Social, socrefs, number, moral, feeling, 6.2 we, focuspast, Drives, illness. Naïve Bayes: Permutation Importance • Logistic Regression : Dic, Social, socrefs, pronoun, Ana- For NB, we choose Permutation Importance because it provides a lytic, feeling, we, Affect, focuspast, Drives. robust way to assess feature significance in probabilistic models that do not generate explicit importance scores. By quantifying This helps us shed light on how different algorithms interpret the dependence of the model’s predictions on each feature, Per- the data; some overlap in the reported features occurs because mutation Importance allows for an intuitive understanding of the different algorithms, despite using distinct mechanisms to es- which features are most influential in the NB classification pro- timate importance, converge on similar cues that are consistently cess. The scores produced are relative, and their scale depends on predictive of SS. We calculate the correlation between feature im- the model’s performance metric; a larger value indicates that the portance rankings across the different models by computing the feature has a greater impact on classification accuracy. The top-3 Pearson correlation coefficient between the feature importance features for the NB models are , and (TF-IDF) and us, birthday her scores of each pair of models, using their respective importance social referents, social, and she/he (LIWC) (RQ3). values across all features. This is displayed in Figures 6 and 7 in the Appendix C. A high positive correlation indicates similar 6.3 Logistic Regression: SHAP Values feature rankings and vice versa. The highest correlation is mea- LR calculates the probability of a given outcome using a linear sured between SVM and LR models, while the lowest between combination of input features, but SHAP offers a more granu- NB and LR for models trained on learned features, and between lar and interpretable way of explaining these predictions. This SVM and NB for models trained on predefined features. 21 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Caporusso et al. perform hyperparameter optimisation, we will do so in the future. We aim to train a neural network for multi-class classification, enabling simultaneous prediction of SS and other Self-aspects, al- lowing for a more comprehensive analysis of self-representation in text. In the future, we plan to employ different datasets and implement Demšar’s evaluation method [8]. Our long-term goal is to be able, given a text instance, to determine what Self aspects are present and how they are expressed, in an explainable man- ner. To do so, it is not only necessary to extend our work to other Figure 1: Top-10 Features for TF-IDF Models Self-aspects, but to move beyond a binary classification for each of them. Work on the ontology underpinning future studies is ongoing [13]. 9 Acknowledgments We acknowledge Špela Rot’s assistance and the financial support from the Slovenian Research Agency for research core funding for the programme Knowledge Technologies (No. P2-0103) and from the projects CroDeCo ( J6-60109), Shapes of Shame in Slovene Literature ( J6-60113), and Natural Language Processing for Cor- pus Analysis in the Medical Humanities (BI-VB/25-27-021). JC is Figure 2: Top-10 Features for LIWC Models a recipient of the Young Researcher Grant PR-13409. 7 Discussion References Our results indicate that the models trained on predefined fea- [1] Ryan L Boyd, Ashwini Ashokkumar, Sarah Seraj, and James W Pennebaker. tures (LIWC) generally outperform those trained on learned fea- 2022. The development and psychometric properties of liwc-22. Austin, TX: University of Texas at Austin, 10. tures (TF-IDF n-grams), with the SVM model achieving the high- [2] Marilynn B Brewer. 2002. Individual self, relational self, and collective self: est classification performance (RQ1-2). This suggests that LIWC partners, opponents, or strangers. (2002). [3] Jaya Caporusso. 2022. Dissolution experiences and the experience of the self: features, which encapsulate linguistic and psychological con- an empirical phenomenological investigation (master’s thesis). university structs, provide a structured and interpretable representation Advisor: Assist. Prof. Dr. Maja Smrdu of vienna. . of textual patterns related to SS. In contrast, TF-IDF captures [4] Jaya Caporusso, Boshko Koloski, Maša Rebernik, Senja Pollak, and Matthew Purver. 2024. A phenomenologically-inspired computational analysis of surface-level word frequency distributions, which may be more self-categories in text. In . Vol. 1, 169–178. Proceedings of JADT 2024 susceptible to noise and context variability, limiting its predictive [5] Jaya Caporusso, Matthew Purver, and Senja Pollak. 2025. A computational power for capturing abstract constructs like SS. Furthermore, our framework to identify self-aspects in text. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student results support the findings by Caporusso et al. [4] regarding Research Workshop). Jin Zhao, Mingyang Wang, and Zhu Liu, editors. Asso- LIWC features correlated with SS. Notably, models trained on ciation for Computational Linguistics, Vienna, Austria, (July 2025), 725–739. isbn: 979-8-89176-254-1. doi: 10.18653/v1/2025.acl- srw.47. TF-IDF features tend to exhibit higher aggregated feature im- [6] Jaya Caporusso, Thi Hong Hanh Tran, and Senja Pollak. 2023. Ijs@ lt-edi: portance scores compared to those trained on LIWC. This could ensemble approaches to detect signs of depression from social media text. be attributed to the fact that TF-IDF operates on a larger and In Proceedings of the Third Workshop on Language Technology for Equality, Diversity and Inclusion, 172–178. more granular feature space, capturing subtle variations in word [7] Christopher G Davey and Ben J Harrison. 2022. The self on its axis: a usage. As a result, many features contribute partially to model framework for understanding depression. Translational Psychiatry, 12, 1, 23. decisions, leading to a higher sum of importance values across [8] Janez Demšar. 2006. Statistical comparisons of classifiers over multiple data sets. , 7, 1–30. The Journal of Machine learning research all features. In contrast, LIWC features are more constrained and [9] Lewis R Goldberg. 2013. An alternative “description of personality”: the predefined, leading to more concentrated but lower cumulative big-five factor structure. In Personality and Personality Disorders. Routledge, 34–47. importance scores. This suggests that while TF-IDF captures a [10] Boshko Koloski, Nada Lavrač, Bojan Cestnik, Senja Pollak, Blaž Škrlj, and broader spectrum of textual variations, LIWC provides a more Andrej Kastrin. 2024. Aham: adapt, help, ask, model harvesting llms for targeted and structured linguistic representation. Many of the literature mining. In International Symposium on Intelligent Data Analysis. features identified as relevant for the classification of SS (e.g., we Springer, 254–265. [11] X Alice Li and Devi Parikh. 2019. Lemotif: an affective visual journal using and ) intuitively align with the nature of SS (RQ3). deep neural networks. arXiv preprint arXiv:1903.07766 social referents. [12] Scott M Lundberg and Su-In Lee. 2017. A unified approach to interpreting 8 Limitations and Future Work model predictions. Advances in neural information processing systems, 30. [13] Luka Oprešnik, Tia Križan, and Jaya Caporusso. 2025. Building an ontology This study serves as a pilot for the interpretable classification of of the self: sense of agency and bodily self. In Proceedings of Information Society 2025. Cognitive Science. doi: 10.70314/is.2025.cogni.8. different Self aspects in text, focusing on SS. Several areas for im- [14] James W Pennebaker, Matthias R Mehl, and Kate G Niederhoffer. 2003. provement remain. Clearer annotation guidelines are needed for Psychological aspects of natural language use: our words, our selves. Annual consistency. The choice of restricting to linear models, LIWC fea- review of psychology, 54, 1, 547–577. [15] Stephanie Rude, Eva-Maria Gortner, and James Pennebaker. 2004. Language tures, and unigrams/bigrams was appropriate for this exploratory Cognition & use of depressed and depression-vulnerable college students. study prioritising interpretability; however, it inevitably limits , 18, 8, 1121–1133. Emotion [16] Gemma Team et al. 2024. Gemma 2: improving open language models at a performance and representational richness. In future work, we practical size. . arXiv preprint arXiv:2408.00118 plan to complement this approach with more powerful models [17] David HV Vogel, Mathis Jording, Peter H Weiss, and Kai Vogeley. 2024. and richer feature sets (e.g., embeddings). Here we wanted to Temporal binding and sense of agency in major depression. Frontiers in psychiatry, 15, 1288674. compare models trained on learned vs predefined features, but [18] Dan Zahavi. 2007. Self and other: the limits of narrative understanding. we plan to train models on both. While in this study we did not , 60, 179–202. Royal Institute of Philosophy Supplements 22 Identifying Social Self in Text Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia A Instructions for Labelling: Social Self B Evaluation In the column relative to Social Self, insert: • 0: if the Social Self is not present. • 1: if the Social Self is present. Following, we provide a definition of Social Self [4], instruc- tions, and examples of a text instance where it is present and a text instance where it is not present, taken from the dataset to Figure 3: Confusion Matrices: Models Trained on Learned be labelled: Features (TF-IDF) Definition: The Self as it is shaped and/or perceived when in an interaction or relationship of sorts with other people or entities to whom we attribute qualities of an inner life. Instructions For to be present in a text instance it is not enough Social Self for the text instance to contain references to other people and/or entities, but it has to contain mentions of the author’s interactions with them, influence on them, or influence they have on the author. This can be even minimal, e.g., in the form of referring to a person as , or by using the first-person plural pronoun my sister instead of the singular one. Examples A.0.1 Text instance containing Social Self: "My family was the Figure 4: Confusion Matrices: Models Trained on Prede-most salient part of my day, since most days the care of my 2 chil- fined Features (LIWC) dren occupies the majority of my time. They are 2 years old and 7 months and I love them, but they also require so much attention that my anxiety is higher than ever. I am often overwhelmed by the care they require, but at the same, I am so excited to see them hit developmental and social milestones." Explanation of text instance with Social Self present: In this text instance, the author report on other people they are in some sort of relationship with, and about some aspects of their relationship and how they make the author feel. A.0.2 Text instance not containing Social Self: "Yoga keeps me focused. I am able to take some time for me and breathe and work my body. This is important because it sets up my mood for the whole day." Explanation of text instance with Social Self not present: In this text instance, the author does not report on any person, animal, or other entities to whom we attribute qualities of inner life. General Notes While a certain Self-aspect might not be promi- nently present in a text instance in its entirety, if it is present in a part of the text instance to be labelled, then it has to be labelled as present in the text instance. A given text instance can have none of the Self-aspects present, one of them present and two of them non-present, two present and one non-present, or all three Figure 5: Pairwise Wilcoxon Signed-Rank Test Results (p- of them present—any combination is possible. values) 23 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Caporusso et al. C Feature Importance Figure 7: Correlation Between Feature Importance Across Models Trained on Pre-Defined Features (LIWC) Figure 6: Correlation Between Feature Importance Across Models Trained on Learned Features (TF-IDF) 24 WinWin Meets – Investigating the Future of Online Meetings Martin Žust Marko Grobelnik marti.zust@gmail.com marko.grobelnik@ijs.si Jožef Stefan Institute Jožef Stefan Institute Ljubljana, Slovenia Ljubljana, Slovenia Alenka Guček Adrian Mladenic Grobelnik alenka.gucek@ijs.si adrian.m.grobelnik@ijs.si Jožef Stefan Institute Jožef Stefan Institute Ljubljana, Slovenia Ljubljana, Slovenia Abstract in meetings depends on technical feasibility and sensitivity to Video conferencing is now central to modern collaboration, yet human collaboration. its functionality remains largely limited to passive audio–visual With remote meetings now central to how we work, these communication. Despite growing investment in artificial intelli- systems directly impact productivity, collaboration, and organi- gence (AI), it is unclear which features truly enhance meetings zational culture. This paper explores which functionalities could and how users will adopt them. Here we present WinWin Meets, define the future of video conferencing and how AI may con- a Jitsi-based prototype that integrates Whisper transcription and tribute. We combine market trend and user preference analysis, GPT-4o processing to deliver real-time summaries, visual mind reviews of online discussions, and experimental testing of the maps, and goal-oriented advice. Testing with 16 participants WinWin Meets prototype. We explore which features matter to showed strong interest in summaries and mind maps, moderate users, examine how AI can support meetings, and assess the interest in in-meeting guidance, and a preference for add-on in- potential to improve efficiency, clarity, and structure in digital tegration. Market research confirmed low organic demand for communication. advanced AI features, with users prioritizing reliable improve- ments such as automated notes. These results highlight a gap 2 Background and Related Work between experimental enthusiasm and everyday adoption, point- ing to opportunities for targeted, industry-specific integrations 2.1 Overview of Current Video Conferencing that combine reliability with intelligent support. Solutions The video conferencing market is currently dominated by a few Keywords major players. Zoom, Microsoft Teams, and Google Meet together video conferencing, AI agent, testing, market research, zoom, account for approximately 94% of global market share, with Zoom alone holding around 56% [3]. While all three platforms are ac- negotiation, transcription, summarization, advice, meeting notes, tively investing in artificial intelligence features, their innovation AI innovations must be carefully balanced with the risk of reputational damage. As established brands, they face more constraints than lesser- 1 Introduction known startups, which can afford a higher level of experimental As artificial intelligence advances rapidly, its potential to trans- agility. This creates a unique window of opportunity for the form everyday digital tools, particularly video conferencing, has emergence of disruptive technologies that have the potential to become increasingly apparent. Platforms such as Zoom, Google redefine the video conferencing experience. Meet, and Microsoft Teams have become standard, yet their func- Most AI-enabled tools developed recently are not standalone tionality remains focused on basic communication. A new need platforms, but integrations designed to work alongside existing is arising for next-generation conferencing, including intelligent services like Zoom, Google Meet, or Microsoft Teams. Notable assistants, automatic summarization, content analysis, and con- examples include tl;dv [4], Otter.ai [5], Fathom [6], Fireflies [7], textual support. These next-generation systems go beyond pas- and Sembly AI [8]. These applications primarily offer meeting sive audio and video transmission to actively support users with transcription, and some provide more advanced analytics such intelligent features and real-time analysis [1]. as sentiment analysis or participant-level speaking time metrics. Previous research reveals both promise and challenges. Proac- balance autonomy with what users are willing to accept [1]. Emerging Needs Meanwhile, studies of speech-based technology underscore the tive AI meeting assistants can improve efficiency but need to 2.2 Limitations of Existing Solutions and difficulty of extracting useful outcomes from nuanced group Despite the growing number of AI integrations, fully independent platforms that natively combine video conferencing with built-in interactions [2]. These perspectives suggest that AI’s success AI features remain rare. These features may include real-time transcription, intelligent meeting summarization, and contextual work must be honored. For all other uses, contact the owner/author(s). Permission to make digital or hard copies of all or part of this work for personal AI-generated recommendations. This segment remains underde- or classroom use is granted without fee provided that copies are not made or veloped, presenting a significant opportunity for innovation. distributed for profit or commercial advantage and that copies bear this notice and While major platforms like Zoom have started introducing the full citation on the first page. Copyrights for third-party components of this their own AI assistants (e.g., Zoom AI Companion [9]), they must Information Society 2025, Ljubljana, Slovenia innovate cautiously to protect their reputation and user base. © 2025 Copyright held by the owner/author(s). This creates space for new companies to develop more ambitious https://doi.org/https://doi.org/10.70314/is.2025.sikdd.14 25 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Žust et al. AI-first conferencing tools, unrestricted by established brand • Health monitoring expectations or legacy user commitments. However, innovating • Meeting notes in markets where most users are already committed to existing • File uploads platforms has notable downsides. Only about 2.5% of people The WinWin Agent dynamically adapts to the language selected are actively seeking new alternatives, with the majority being by the user. In this prototype, we supported English, German, reluctant to change [10]. and Slovene, allowing users to interact with the summarization and advice features in their preferred language. 3 Development of WinWin Meets 3.1 Overview As part of our research, we developed WinWin Meets, an AI- based alternative to Zoom. The application maintains familiar functionality, allowing users to start or join meetings just as they would expect. The key difference comes before entering the meeting room, where users can define their meeting goals. Once inside, they find a familiar interface with standard video conferencing features. Figure 1: System architecture of the WinWin Meets application These core functionalities are provided through an integration with Jitsi [11], an open-source video conferencing platform. It In this prototype version, we did not use any persistent data- supports screen sharing, microphone and camera toggling, chat- base; all data is stored locally. Additionally, user authentication based communication, polls, and many other standard features. is not yet implemented, as the focus was on demonstrating core Beyond the familiar main meeting window found in appli- functionalities. cations like Zoom, WinWin Meets adds a dedicated panel on the right side of the screen for the WinWin Agent. This panel 4 Testing and User Insights features two main buttons: Summarize and Give Advice. To evaluate the usefulness and usability of WinWin Meets, we The Summarize button generates meeting summaries up to conducted a structured user testing process involving 16 partici- the current moment, particularly useful for late arrivals. Hover- pants. Testing sessions were held in small groups of 2 to 4 partic- ing reveals three options: Short Text, Long Text, and Mind Map. ipants, each lasting approximately 15 minutes. Participants simu- While the text options provide traditional summaries of varying lated realistic discussions—including casual exchanges and role- length, the Mind Map offers a quicker and more accessible visual play scenarios such as negotiations or political debates—to test overview. The idea behind the mind map is based on the observa- all implemented functionalities. The following sections present tion that modern workplace attention is highly fragmented, with our testing results, with key findings shown in Figure 2. a median focus duration of just 40 seconds on any screen [12]. The Give Advice button offers guidance on how to achieve 4.1 Test Coverage the goals specified before the meeting. These goals can also be Participants explored all key features, including the three variants adjusted during the meeting by clicking the Manage Goals button of the Summarize function (Short Text, Long Text, and Mind Map), in the top right corner. Hovering over the Give Advice button the three formats of the Give Advice function (Short, Medium, reveals three options: Short Text, Medium Text, and Long Text, Long), and the Meeting Notes feature. After each session, they which provide advice in different levels of detail. completed an anonymous survey with both multiple-choice and Once the meeting concludes, a meeting report is quickly gen- open-ended questions to assess usefulness and provide feedback. erated. The report includes all key points, action items, a meeting timeline, and the list of participants. Users can also generate a 4.2 Key Findings mind map from the final meeting content. General Usefulness 3.2 System Architecture and Implementation Most participants recognized the potential of AI-enhanced meet- The frontend of the application was developed in Cursor [13], ings. In fact, 87.5% responded Yes when asked whether AI could help them achieve meeting goals, while the remaining 12.5% with assistance from Claude 3.7 Sonnet [14] and GPT-4o [15]. It answered Maybe . is built using the React 19 framework [16]. We aim for a clean and minimalistic design that intuitively guides the user through Summarize Feature each step of the interface. The Summarize function was considered useful by 81.3% of partic- In the meeting room interface, we integrated Jitsi via its iframe ipants. Preferences were split almost evenly: nearly half favored API. Jitsi integration is straightforward, and the platform allows the Short Text, another 43.8% opted for the Mind Map, while only the use of its hosted servers for up to 25 active monthly users 12.5% selected the Long Text variant. free of charge, which was sufficient for our prototype testing. Give Advice Feature The backend is built in Python, using the FastAPI framework When choosing advice length, participants showed a clear pref-[17]. For transcription, we integrated Whisper [18], and for natu-erence for medium-length suggestions: ral language processing tasks (such as summarization and advice • 50% selected Medium generation), we used GPT-4o. The backend exposes several end-• 25% chose Short points, including: • 25% chose Long • Transcription • Advice generation Meeting Notes Feature • Meeting summarization Participants emphasized three expectations for meeting notes: 26 SiKDD October, 2025, Ljubljana, Slovenia Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Which version of Give Advice feature Which version of Summarize feature Which potential feature do you do you like the most? do you like the most? find the most promising? Short Long 4 speaking times using agenda 4 Leaderboard of Meeting coordination (25%) (25%) 5 5 (31%) (31%) Short text Mind map 7 7 (44%) (44%) 8 2 6 (50%) (12%) (38%) Medium Insightful questions Long text generator Figure 2: User survey results (n=16) comparing preferences for existing features (Give Advice and Summarize) and ranking of proposed new features for application WinWin Meets • High reliability (timestamps, content accuracy) clear user preference for simple and familiar enhancements over • Fast post-meeting availability more complex and unfamiliar innovations. • Stable performance across sessions Similar sentiment was observed on Reddit (r/Zoom and r/remotework), where posted polls received limited engagement. Among the few 4.3 Ideas for Additional Features responses, a general disinterest in AI-based meeting assistance Among the proposed additions, the insightful question genera- was evident, with some users explicitly selecting “None of those”. tor attracted the most interest (37.5%), while the speaking time leaderboard and agenda-based coordination were equally valued 5.2 Search Behavior and Online Interest (31.3% each). Participants also suggested several custom features, Trends including personal notes, live transcription export, cloud synchro- Public search trends were analyzed using tools such as Answer nization, calendar integration, live translation with tone analysis, the Public [19], Answer Socrates [20], AlsoAsked [21], and Uber- and domain-specific modes for law, sales, or education. suggest [22]. These platforms provided insight into the types of questions users search for on Google, YouTube, and Reddit. 4.4 Integration Preferences The analysis showed minimal interest in AI-enhanced conferenc- A clear majority (68.8%) preferred to use WinWin Meets as an add- ing features. Instead, users were more focused on improving the on to existing platforms, while only 31.2% supported a standalone efficiency and effectiveness of their meetings. application. Popular search queries we found included: • What are the 3 C’s of effective meetings? 4.5 Use Cases by Industry • What is the 10-10-10 rule for meetings? Participants identified several promising domains for WinWin • How can I take better meeting notes? Meets, such as negotiation and sales, legal and consulting ser- • What are the 5 P’s of meeting productivity? vices, corporate meetings, academic events, client feedback ses- • How to extend the 40-minute limit on Zoom? sions, NGO coordination, and specialized contexts like logistics, • Is Google Meet better than Zoom? mergers and acquisitions, or trade deals. • Is Zoom free to install and use? These patterns confirm that users are primarily concerned 5 Market Research and Trend Analysis with meeting outcomes and platform reliability, rather than with Beyond developing and testing WinWin Meets, we conducted novel AI-driven functionalities. market research to understand user needs and expectations in the video conferencing space. Our approach combined online surveys, social media engagement, search trend analysis, and reviews of blog posts and user forums. This investigation aimed to reach a wider audience than application testing alone could provide. The resulting quantitative and qualitative insights complement rather than replace our user testing results. 5.1 Survey and Social Media Feedback Informal polls and surveys were conducted on platforms such as Facebook and Reddit. In a Facebook group focused on digital tools (GrowthHacking Slovenia), a poll asking users which feature they would most like to add to Zoom revealed that over 60% of respondents preferred having meeting notes generated at the end of a call as we can see in Figure 3. In contrast, only two Figure 3: Distribution of 80 votes for preferred video con- respondents selected a real-time AI assistant. This suggests a ferencing features from our informal polling. 27 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Žust et al. 5.3 Forum Discussions and Deep-Search market segments where specialized AI features deliver measur- Insights able value propositions. The 68.8% preference for add-on integra- Using tools like Grok [23] and Floth [24], we conducted a deeper tion over standalone applications indicates a market opportunity in enhancing existing platforms rather than replacing them, as exploration of online discussions and feedback. The most fre- demonstrated by successful tools like Fathom and Otter.ai. Al- quently mentioned user pain points include: though there is room for breakthrough products, any new so- • Low video quality and unstable connections lution must be at once reliable, easy to use, and meaningfully Privacy concerns (e.g., Zoom bombing, data storage poli- • smarter than current tools—a difficult balance as existing plat- cies) forms already invest heavily in their core features. Psychological fatigue from constant camera presence • The emphasis on reliability and customizable AI assistance • Lack of end-to-end encryption and transparency reveals that AI features must meet higher performance standards Poor UX from interface changes (e.g., Google Meet “float- • than traditional features. Users consistently prioritize dependable ing bubbles”, Webex chat restrictions) functionality over advanced capabilities, suggesting that prod- Discomfort with platform claims over recorded content • uct development should focus on perfecting core AI functions User feedback highlights a desire for reliable, simple, and se- before expanding feature sets. Future research should examine cure platforms with minimal friction in setup and usage. longitudinal adoption patterns and explore how user acceptance evolves as AI capabilities mature and become more familiar in 5.4 Conclusions from Market Research workplace contexts. Our market analysis reveals several key trends: (1) Users strongly prefer practical features like note-taking 7 Acknowledgements and agenda management over complex AI-based tools. The research described in this paper was supported by the TWON (2) Popular search queries suggest a need for structured meet- project, funded by the European Union under Horizon Europe, ing frameworks and productivity strategies. grant agreement No 101095095. (3) Persistent dissatisfaction exists around technical reliability, References interface design, and data privacy. [1] Rutger Rienks, Anton Nijholt, and Paulo Barthelmess. 2009. Pro-active (4) Open-source alternatives offer control and security but meeting assistants: attention please! Ai & Society, 23, 2, 213–231. are hindered by usability and cost barriers. [2] Moira McGregor and John C Tang. 2017. More to meetings: challenges in using speech-based technology to support meetings. In Proceedings of the improvements that enhance meeting effectiveness and reduce Overall, the market exhibits demand for video conferencing 2017 ACM conference on computer supported cooperative work and social computing , 2208–2220. [3] T3 Technology Hub. 2024. Market share of videoconferencing software user burden, rather than introducing new technical complexity. worldwide in 2024, by program. Statista. Graph. (Apr. 2024). Retrieved Jan. 13, 2025 from https://www.statista.com/statistics/1331323/videoconferencing- 6 Discussion [4] market-share/. tldx Solutions GmbH. 2025. Tl;dv. https://tldv.io/. Accessed: August. (2025). There are two primary approaches to understanding user pref- [5] Otter.ai, Inc. 2025. Otter.ai. https://otter.ai/. Accessed: August. (2025). [6] 2025. Fathom. https://fathom.video/. Accessed: August. (2025). erences: direct inquiry and behavioral observation. Direct ques- [7] 2025. Fireflies. https://fireflies.ai/. Accessed: August. (2025). tioning suffers from significant limitations, including social desir- [8] 2025. Sembly ai. https://www.sembly.ai/. Accessed: August. (2025). ability bias where respondents provide socially acceptable rather [9] Zoom Video Communications. 2025. Zoom ai companion. https://www.zoo m.com/en/ai-assistant/. Accessed: August. (2025). than genuine answers, and the fact that approximately 95% of [10] Everett M Rogers, Arvind Singhal, and Margaret M Quinlan. 2014. Diffusion human decisions occur subconsciously as discussed in [25]. Ob- of innovations. In An integrated approach to communication theory and research . Routledge, 432–448. servational methods capture the unconscious preferences that [11] 8x8, Inc. 2025. Jitsi. https://jitsi.org/. Accessed: August. (2025). drive actual user behavior, providing more accurate insights into [12] Gloria Mark, Shamsi T. Iqbal, Mary Czerwinski, Paul Johns, and Akane Sano. 2016. Neurotics can’t focus: an in situ study of online multitasking in the real-world usage patterns. workplace. In Proceedings of the 2016 CHI Conference on Human Factors in These methodological considerations explain our contradic-Computing Systems . ACM, 1739–1744. AI could help achieve meeting goals, market research revealed tory findings. While 87.5% of WinWin Meets participants believed [13] Anysphere Inc. 2025. Cursor. https://cursor.sh/. Accessed: August. (2025). [14] Anthropic. 2025. Claude 3.7 sonnet. https://www.anthropic.com/news/clau de-3-7-sonnet. Accessed: August. (2025). minimal organic interest in AI-enhanced conferencing. This di- [15] OpenAI. 2025. Gpt-4o. https://openai.com/index/hello-gpt-4o/. Accessed: vergence reflects the difference between conscious evaluation in August. (2025). [16] Meta Open Source. 2025. React. https://react.dev/. Version 19. Accessed: controlled environments versus unconscious behavioral prefer-August. (2025). ences that emerge during natural usage. Additionally, our testing [17] Sebastián Ramírez. 2025. Fastapi. https://fastapi.tiangolo.com/. Accessed: August. (2025). participants were primarily young AI researchers, likely more [18] OpenAI. 2025. Whisper. https://openai.com/research/whisper. Accessed: receptive to AI features than typical users. August. (2025). Our research uncovered widespread "Zoom fatigue", indicat- [19] NP Digital. 2025. Answer the public. https://answerthepublic.com/. Accessed: ing that users have reached cognitive saturation with current August. (2025). [20] 2025. Answer socrates. https://answersocrates.com/. Accessed: August. video conferencing complexity. The strong preference for meet- (2025). ing notes over real-time AI assistance (60% versus minimal inter- [21] Candour. 2025. Alsoasked. https://alsoasked.com/. Accessed: August. (2025). [22] Neil Patel Digital. 2025. Ubersuggest. https://neilpatel.com/ubersuggest/. est) demonstrates users’ desire for post-meeting value without Accessed: August. (2025). additional in-meeting cognitive burden. This psychological con- [23] xAI. 2025. Grok. https://grok.x.ai/. Accessed: August. (2025). [24] 2025. Floth. https://floth.ai/. Accessed: August. (2025). text explains why solutions that prioritize seamless integration [25] Gerald Zaltman. 2003. How Customers Think: Essential insights into the mind over feature prominence tend to gain market traction [26]. of the market . Harvard Business Press. conferencing innovation. Industry-specific applications such as Our findings suggest distinct pathways for AI-enhanced video [26] Fred D Davis. 1989. Perceived usefulness, perceived ease of use, and user acceptance of information technology. MIS quarterly , 319–340. negotiations, sales, and legal consultations represent focused 28 Predicting Ski Jumps Using State-Space Model ∗ ∗ Neca Camlek Živa Hegler Jakob Jelenčič Univerza v Ljubljani Univerza v Ljubljani Jožef Stefan Institute Ljubljana, Slovenia Ljubljana, Slovenia Ljubljana, Slovenia jakob.jelencic@ijs.si Marko Grobelnik Dunja Mladenić Jožef Stefan Institute Jožef Stefan Institute Ljubljana, Slovenia Ljubljana, Slovenia marko.grobelnik@ijs.si dunja.mladenic@ijs.si Abstract modeling ski jumps, where environmental factors determine performance [9]. Ski jumping performance is shaped by both athlete technique In this paper, we present a ski jump dataset together with a and environmental conditions, with factors such as wind speed, state-space model trained to predict jump trajectories based on wind direction, and ski orientation playing a critical role in deter- changing environmental conditions. The model is estimated using mining jump trajectories. Accurate modeling of these trajectories a least squares approach and demonstrates how inputs such as is challenging due to dynamic and time-dependent nature of wind and ramp adjustments influence the resulting jump. Beyond the system. In this work, we introduce a dataset of measured the modeling framework, we also developed an application that ski jumps and present a state-space modeling framework that allows general users to interact with the data, run simulations, captures the evolution of jumps under varying conditions. The and visualize jump trajectories through animations. model parameters are estimated using a ridge regression ap- Beyond methodological interest, accurate prediction of ski proach, enabling us to predict trajectories from initial states and jumps can improve athlete safety by anticipating risky condi- wind sensor inputs. We evaluated the predictive performance of tions, support planning of hill design or enlargement, and con- the model through leave-one-out cross-validation and analyzed tribute to fairer competitions through a better understanding of its stability, showing that the approach can generate realistic tra- environmental effects. jectories with reasonable accuracy. To complement the modeling The remainder of the paper is as follows. Section 2 presents results, we developed an interactive web application that allows the handling of received data. Next, the proposed methodology users to explore both recorded and simulated jumps, adjust envi- is described in Section 3. The project results are presented in ronmental factors, and visualize their effects through animations. Section 4. We discuss the results in Section 5 and conclude the Together, the dataset, modeling framework, and the application paper in Section 6. offer a foundation for further research in ski jump analysis and provide an accessible tool for exploring the influence of external conditions on performance. 2 Modeling Framework and Dataset This section describes the handling of data, focusing on state- Keywords space models and our data processing. datasets, state-space model, ski jumping, simulations, least squares 2.1 State-Space Model 1 Introduction State-Space Models (SSMs) are a family of machine learning algo- Ski jumping is a sport strongly influenced by both athletic tech- rithms designed to capture and predict the behavior of dynamic nique and environmental conditions. Factors such as wind speed, systems by describing how their inner states change over time. wind direction, and different ski angles affect the trajectory and Instead of only looking at past inputs and outputs, SSMs explic- final distance of a jump, making accurate prediction a challeng- itly model the underlying dynamics, making them well-suited for ing problem. While statistical models and simulations have been sequential data. In state-space modeling, the objective is to iden- applied in sports research for some time, many approaches sim- tify the minimal set of system variables required to completely plify the problem and do not fully capture the dynamic evolution describe the system. These fundamental variables are referred to of the jump over time [11]. as the state variables. At any given time, the state of the system Recent advances in machine learning have introduced methods can be represented by a state vector, whose components corre- capable of modeling temporal systems with greater fidelity. In spond to the values of the respective state variables. SSMs are particular, state-space models provide a mathematical framework designed to predict both the manner in which inputs are reflected for representing hidden internal states that evolve over time in the system’s outputs and the evolution of a system’s internal in response to external input. This makes them well-suited for state over time and in response to specific inputs [2]. ∗ Both authors contributed equally to this research. 2.2 Least squares method Permission to make digital or hard copies of all or part of this work for personal The least squares method is a regression technique that is used to or classroom use is granted without fee provided that copies are not made or determine the line that best fits a given set of data. It minimizes distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this the sum of the squared differences between the observed data work must be honored. For all other uses, contact the owner /author(s). and the corresponding values implied by the regression func- Information Society 2025, Ljubljana, Slovenia tion. Each data point reflects the relationship between a known © 2025 Copyright held by the owner/author(s). https://doi.org/10.70314/is.2025.sikdd.30 independent variable and an unknown dependent variable [7]. 29 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Ž. Hegler, N. Camlek et al. To enhance the model, we incorporated ridge regression (L2 regularization), which helps to reduce overfitting during model training [12]. 2.3 Data Processing For our project, we used 223 CSV files, each containing the data of a jump, measured on the flying hill of Gorišek brothers in Plan- ica, Slovenia. Each contains 17 columns ( , ’Position’ ’Height above ground’, ’Time’, ’X’, ’Y’, ’Z’, ’Opening Angle’, ’Stalling Angle Left’, ’Stalling Angle Right’, ’Roll Angle Left’, ’Roll Angle Right’, ’Yaw Angle Left’, ’Yaw Angle Right’, ’Speed hor.’, ’Speed vert.’, ’Speed resulting’, ’WindTime|WindName|WindSpeed|Wind...’) and the number of rows corresponding to the length of the jump. Data are recorded for every meter of air distance from the take-off Figure 2: Different angles affecting the jump point. The data required some pre-processing before it could be used The wind features are as follows: for training the model. WindTime - time of the wind measurement in the same for- The column combined mul-WindTime|WindName|WindSpeed|... mat as the Time column itself. Since wind measurements tiple attributes separated by ’ ’. Data from 12 sensors, each mea-| are recorded less often, the wind values are applied to the suring six wind characteristics, were expanded into 12 6 72 × = most recent jump measurement and then just repeated columns, one per sensor–feature pair ( ). sensor_feature until a new wind measurement is available. Since the wind Position- air distance from the take-off point in meters. Be- is represented by a nonlinear function, it would be hard to gins with a negative value, which represents the distance capture its movements with interpolation, so we decided from the starting point to the take-off point. In ski jump- to drop this column. ing, the starting point is adjusted according to the wind - name of the sensor (Wi for 𝑖 1, . . . , 12) WindName = conditions, so this value is not constant. - resulting speed of the wind in km/h WindSpeed Height above ground- height above ground in meters. WindSpeedTangent- speed of the wind tangent measured Time- time of the jump in seconds from the start of the along the x axis (hill direction) in km/h jump. - vertical speed of the wind turbulence in WindTurbulence X, Y, Z- coordinates of the jumper in a 3D space in meters. km/h The X axis is aligned with the hill direction, the Y axis is - wind speed tangent with WindSpeedCleanTan across the hill, and the Z axis is vertical. The take-off point turbulence removed in km/h is 0, 0, 0 as shown in Figure 1 - speed of the wind measurement along ( ) WindSpeedCross Opening Angle- angle between the skis in degrees. the y axis across the hill in km/h Stalling Angle Left, Stalling Angle Right- angle between There are 12 wind sensors spread across the ski jump hill. To the chord line of the left/right ski and the horizontal plane help with the analysis, we separated the jump section of the hill in degrees. into 3 zones. The first zone contains wind sensors 1 to 4, the Roll Angle Left, Roll Angle Right- angle of the left/right second zone contains sensors 5 to 8, and the third zone contains ski around its longitudinal axis relative to the horizontal sensors 9 to 12 [11]. plane in degrees. During processing, we also removed some ski jumps that were Yaw Angle Left, Yaw Angle Right- angle between the incomplete or had corrupted data, so the final dataset contained left/right ski and the horizontal plane in degrees. around 200 ski jumps. (angles are shown in Figure 2) Speed hor., Speed vert., Speed resulting- horizontal, ver- 3 Methodology tical, and the resulting speed of the athlete in km/h [13]. This section describes our research methodology. We first present different variations of the SSM that we tested for the ski jump simulation, followed by describing our model and how it predicts the jumps. Finally, we present the description of our ski jump animation app. 3.1 Different modeling approaches In addition to pure SSM, we considered different approaches for modeling ski jumps that included classical physics-based models, but the data are not sufficient to accurately capture all the forces acting on the jumper. We also tried a hybrid approach that combined SSM and Physics-informed Neural Networks (PINNs [14]), where the SSM would provide a baseline prediction and the PINN would learn to correct any discrepancies, taking into Figure 1: 3D model of Ski jump in Planica with added co- account physical properties of the system, such as the mass of ordinates [10, 1] the pilot, the properties of the wind, and gravitational force [4]. 30 Ski Jumping Simulation Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia These parameters are included in the equations of motion and The application presents the results as an animated visualiza- added to the total loss function. So, the model prefers solutions tion of the ski jump, showing the full trajectory and the final dis- that are consistent with the laws of physics. This turned out tance. In this way, the application functions both as an analytical to be less effective than a pure SSM approach, but the reason tool, helping to test how different conditions affect performance, exceeds the purpose of this paper. More about errors and models’ and as an educational resource that makes the mechanics of ski comparison is given in Section 4.1. jumping easier to understand for a wider audience. It is available 1 online. 3.2 Ski jump prediction model 4 Main results In order to fit our data to the SSM, we stored the data in each In this section, we present the results of our simulations. Firstly, file in three vectors. The main vector contains states or state variables of the system, which in our case are the X, Y, and Z we present a statistical comparison of all the models, followed by a precise analysis of our predictions. coordinates, jumper velocities, and all angles (opening, stalling, roll, and yaw) [6]. The observation vector contains the measured outputs of the 4.1 Models’ error system, which in our case are the X, Y, and Z coordinates and In order to evaluate different models, we first had to define a height above ground. The controls contain the external inputs metric to measure the prediction error. Since actual and sim- to the system, which in our case are the wind measurements ulated jumps are represented with x, y, and z coordinates but from all the sensors that are averaged over each zone and feature are measured at different time stamps and can contain a differ- (speed, tangent, cross and turbulence). ent number of measurements, we had to find a way to compare We then used ridge regression to estimate the matrices A, B, them. We first tried to project the shorter trajectory on to the C and D of the SSM, as shown in Figure 3, where we minimized other one and compute the distance between the original and the computed values from the current and previous values and the projection, but this method turned out to be computationally the next time-stamped values. Thus, matrix A computes the next expensive. So we decided to compute the distance between the state from the current state, B computes the next state from actual and the simulated jumps by interpolating both jumps. The the current control, C computes the next observation from the new measurements contain the start and end point and all the current state, and D computes the next observation from the ones, where x reaches a natural value. We then compute the error current control. We then use recursion to predict the next state as the norm of the difference between the two jumps. And after from the prediction of the previous state and the current control, one of the jumps ends, we just add the distance from the end of to get the full simulated jump. This allows us to predict the jump the shorter jump to the end of the longer jump to the error. In trajectory based on the environmental conditions and the starting this way, we penalize the model for not being able to predict the state of the jumper [9]. correct length of the jump. Since we had a limited number of jumps, we used leave-one- out cross-validation to evaluate the models. For each jump, we trained the model on all other jumps and then simulated the left-out jump. We then calculated the average error between the actual jump and the simulated jump for both the training set and the test set, as shown in Figure 4. In the process of developing our ski jump prediction model, we evaluated several variations to determine the most effective approach. We compared the performance of a pure SSM with a hybrid model that combined SSM with PINN. The pure SSM demonstrated superior predictive accuracy, probably due to its ability to directly model the temporal dynamics of ski jumps without the added complexity of PINNs. We also experimented with different configurations of the SSM, including using all Figure 3: Schema of SSM matrices [3] available wind sensor data versus an averaged value of the zone. When we used all sensors, the average error for each point (in the training data is 1.67 m and in the test data is 1.89 m), while when 3.3 Ski jump animation app we averaged the sensors over the zones, the error (in the training data was 1.76 m and in the test data 1.82 m). This suggests that To make our results accessible beyond the research setting, we averaging the wind data helps with the simulation. developed an interactive web application using Shiny for Python [8]. The application serves as a front-end to the trained state- 4.2 Analysis of our model space model and allows users to explore ski jump simulations Wind is a critical factor in ski jumping, so we attempted to capture under varying environmental conditions or just to observe differ- its nonlinear effects by including columns for the squared wind ent measured ski jumps. Firstly, through a set of input controls, features. However, we found that adding these squared terms did users can adjust factors such as wind speed, wind directions, or not significantly reduce the prediction error. different ski angles, and the application instantly updates the Since the simulation still requires numerous inputs, we made predicted jump trajectory. Secondly, users can simply explore it interactive, allowing users to adjust the wind conditions and random jumps from the provided dataset or upload their own observe their impact on the jump. In the ski jumping app, users CSV file of measured jumps, as long as it includes the columns described in Section 2.3. 1https://camlekn.shinyapps.io/ski- jump/ 31 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Ž. Hegler, N. Camlek et al. choose whether the wind would gain or lose a certain feature (such as speed or turbulence) during the jump. Expanding the dataset to include more jumps and additional contextual information about individual jumpers could improve the accuracy of the model. We could try to generate more data by using data augmentation techniques, such as adding noise to the wind measurements or slightly modifying the angles. We could also try to find the nonlinear movements of the wind and inter- polating the wind measurements by their original time stamps to better capture the wind dynamics. 6 Conclusion This paper presents a method for predicting ski jump trajectories based on environmental conditions. By incorporating external factors into the modeling framework and applying least squares estimation, we demonstrated that the model is capable of captur- ing the dynamics of ski jumps and producing realistic trajectory predictions. In addition, we developed an interactive application that makes the results accessible to a broader audience through Figure 4: Actual vs. simulated ski jump trajectory simulations and animations of predicted jumps. Although the current model is limited by the size of the dataset and the ab- sence of certain athlete-specific variables, the results show that can manipulate sliders to set the wind speed, wind tangent, wind state-space models are a promising tool for analyzing ski jumping cross, and wind turbulence for each of the three zones. As a result, performance. the wind loses its original movement function in the simulation. All other inputs are set to the average values computed from the dataset. 7 Acknowledgments This work was supported by Smučarska Zveza Slovenije (Ski 5 Discussion Association of Slovenia), whom we would like to thank for pro- viding the ski jump data. This section examines the predictive performance of the trajec- tories, highlights the limitations of our current approach, and suggests directions for future improvements. References [1] 3DWarehouse via 3dmdb.com. 2025. "ski jumping planica" [3d model]. http 5.1 s://3dmdb.com/en/3d- model/ski- jumping- planica/8386000/?f ree=True&q Limitations =Ski+jump. Free model; accessed: 2025-08-26. (2025). [2] Masanao Aoki. 1990. . (2nd, revised and State Space Modeling of Time Series Given the relatively small dataset of ski jumps, the main limita- enlarged ed.). . Springer Berlin Heidelberg, Berlin, Heidelberg. Universitext tion of our project lies in the limited data available for training isbn: 978-3-642-75883-6. doi:10.1007/978- 3- 642- 75883- 6. the model. After preprocessing, the dataset contained only about [3] Dave Bergmann. 2025. What is a state space model? Accessed: 2025-09-24. https://www.ibm.com/think/topics/state- space- model. 200 jumps, which may limit the SSM’s ability to represent the [4] Shengze Cai, Zhiping Mao, Zhicheng Wang, Minglang Yin, and George full range of trajectory variations under different circumstances. E. Karniadakis. 2021. Physics-informed neural networks (pinns) for fluid mechanics: a review. , 37, 12, 1727–1738. doi:10.1007 Acta Mechanica Sinica As a result, the model may struggle to accurately predict jumps /s10409- 021- 01148- 1. under novel or extreme conditions. Sport Aerody- [5] Wolfram Müller. 2008. Performance factors in ski jumping. In Furthermore, the dataset lacks detailed information, or any . CISM International Centre for Mechanical Sciences. Vol. 506. Helge namics Nørstrud, editor. Online ISBN: 978-3-211-89297-8. Springer, Vienna, 139–160. information at all, about individual jumpers, such as body weight, isbn: 978-3-211-89296-1. doi:10.1007/978- 3- 211- 89297- 8_8. sex, or other physiological characteristics that are known to in- [6] Wolfram Müller. 2006. The physics of ski jumping. Tech. rep. CERN report fluence jump performance. Incorporating these variables could on the aerodynamics and physics of ski jumping. CERN. https://cds.cern.ch /record/1009275/f iles/p269.pdf . improve model accuracy and provide more personalized predic- Razširjen uvod v numerične metode [7] Bor Plestenjak. 2016. . Slovenian textbook tions [5]. on numerical methods. DMFA-založništvo. [8] Posit Team. 2025. Shiny for python. Accessed: 2025-08-29. https://shiny.pos Lastly, due to limited computing power, only one CP U was it.co/py/. available, restricting the use of possible better models. To address [9] Serrano.Academy. 2025. State-space model (ssm) tutorial. https://youtu.be these challenges, using cloud-based resources could help run /g1AqUhP00Do. State-Space Model (SSM) video. (2025). [10] Ski Jumping Hill Archive, skisprungschanzen.com. 2025. Letalnica (letalnica larger models and improve the prediction of trajectories. bratov gorišek), planica, slovenia — ski jumping hill archive. https://www.s kisprungschanzen.com/EN/Ski+Jumps/SLO- Slovenia/Planica/0475- Letaln 5.2 Future work and potential improvements ica/. Accessed: 2025-09-12. (2025). [11] Ava Thompson, ed. 2025. . Found via Google Books at https://ww Ski Jumping Although the current approach shows promise, there are several w.google.si/books/edition/Ski_Jumping/G2pPEQAAQBAJ?hl=en&gbpv=0. avenues for future improvements. Some of which we are working Publifye AS. [12] Wessel N. van Wieringen. 2015. Lecture notes on ridge regression. arXiv on at the time of writing this paper. preprint arXiv:1509.09169. Revision v8, submitted 30 September 2015; re- Currently, we are working on improving the sliders’ functions. vised 27 June 2023. (2015). doi:10.48550/arXiv.1509.09169. [13] Mikko Virmavirta and Juha Kivekäs. 2019. Aerodynamics of an isolated ski Since the wind data determined by the user is static through- jumping ski. , 22, 1, 1–6. doi:10.1007/s12283- 019- 0298- 1. Sports Engineering out the jump, this adds a lot of generalization. In reality, wind [14] StatQuest with Josh Starmer. 2025. Neural networks tutorial. https://youtu conditions can change rapidly during a jump. So we would like .be/CqOf i41Lf Dw. Neural networks introduction video. (2025). to add additional controls to the app that would allow the user to define how the wind changes during the jump. They could 32 Predicting milling overload based on sensor data: a graph-based approach Roy Krumpak Jože M. Rožanec Dunja Mladenić Jožef Stefan Institute Jožef Stefan Institute Jožef Stefan Institute Ljubljana, Slovenia Ljubljana, Slovenia Ljubljana, Slovenia krumpak.roy@gmail.com joze.rozanec@ijs.si dunja.mladenic@ijs.si Zhenyu Guo Tao Song Dumitru Roman BGRIMM Technology Group BGRIMM Technology Group SINTEF Digital Beijing, China Beijing, China Oslo, Norway guozhenyu@bgrimm.com songtao@bgrimm.com titi.roman@sintef.no Inna Novalija Xiang Ma Jožef Stefan Institute SINTEF Industry Ljubljana, Slovenia Oslo, Norway inna.koval@ijs.si xiang.ma@sintef.no ABSTRACT The contributions of this paper include the use of multiple graph representations (not just one) to capture the structure In this paper, we present an approach to predict milling over- of a time series and evaluation of the described approach on a load that leverages time series-to-graph transformations, which, real-world dataset. along with domain data encoded as a graph, are fed to predictive machine learning models. Additionally, we compared the perfor- mance of the graph-based approach with the TS2Vec foundational model, regarded as the State-Of-The-Art. Our results show that TS2Vec performed best across all time windows. While combining 2 USE-CASE DESCRIPTION TS2Vec and graph embeddings resulted in reduced performance BGRIMM Technology Group is a Chinese leader in mining and compared to TS2Vec, it enhanced the outcomes when compared mineral processing solutions, focusing on automation and intelli- to the sole use of graph embeddings. Furthermore, combining Or- gent control, with grinding optimization as a core area. Grind- dinal Partition Graph and TS2Vec embeddings resulted in more ing is both the most energy-intensive step in mineral process- stable performance across predictive time windows. ing—accounting for 40% of total energy costs—and a key de- terminant of downstream recovery and product quality (Zhou KEYWORDS et al. 2009 [11]; Lessard et al. 2016 [5]; Groenewald et al. 2006 Time series, graphs, mining, milling, predictive maintenance, [1]). At a 10,000 ton/day copper plant in Anhui Province using a sensor data SAG–ball–pebble (SABC) circuit, BGRIMM is developing intelli- gent control strategies to maximize throughput while preventing 1 INTRODUCTION SAG mill overload. Central to this effort is accurate SAG power Milling, central to mineral processing, involves breaking down prediction, which serves as a feedforward signal to improve feed ores into smaller particles, but is prone to abnormal behavior regulation and overall process efficiency. due to material properties and upstream steps (Hodouin et al. 2001 [3]; Galán et al. 2002 [2]). While traditional control relied on operators, advances in machine learning (ML) have enabled data-driven optimization and predictive maintenance (Mobley 3 DATASET 2002 [6]). Graph-based methods are increasingly applied to time The dataset used in this article was collected and provided by series to capture temporal and structural relations (Silva 2021 BGRIMM Technology Group. The data consists of various sen- [8]). Variants include Natural Visibility Graphs (NVG) to capture sor measurements from the machines used in their mine’s ore the time series topology (Lacasa et al. 2008 [4]; Stephen et al. processing plant, accounting for a total of 42 columns. One col- 2015 [10]), Quantile Graphs for time series values’ transitions umn stores the date and time of the measurement, while the rest (Silva et al. 2024 [9]), and Ordinal Partition Graphs to capture contain numerical values. The sensor data was sampled every regular temporal patterns and their transitions. two seconds and compiled across a hundred days from January 𝑠𝑡 𝑡 ℎ Jože M. Rožanec and Roy Krumpak are co-first authors with equal contribution and 1 2019, to April 12 2019, excluding the first two days of April, importance. resulting in 4.32 million rows in the data. Besides the raw data, Corresponding author: Jože M. Rožanec: joze.rozanec@ijs.si. a description of an overload state was also provided. A column Permission to make digital or hard copies of part or all of this work for personal SAG_2201.power named , which represents the power of the SAG or classroom use is granted without fee provided that copies are not made or mill, is used to decide whether there is an anomaly in the data. distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this If the column reaches a value above 4700 [kW] and has an up- work must be honored. For all other uses, contact the owner /author(s). ward trend or whenever it surpasses the value of 4800, this is Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia considered an overload of the system, and a supervisor might © 2025 Copyright held by the owner/author(s). https://doi.org/10.70314/is.2025.sikdd.21 take appropriate actions to stop the overloading. 33 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Rožanec, Krupak, et al. Figure 1: The diagram depicts the milling plant components and how they are connected. The components of interest are highlighted with red rectangles. sample fell (see Fig. 3b). The column named was WIT_2101.PV excluded from the first step of data simplification and graph rep- resentations and was processed separately because its values did not appear to have distinct oscillating levels and did not benefit from such processing. After discretization, every column had an integer value between zero and six, and with each row being then interpreted as a state. The average state duration is 42 seconds. Repeated states (duplicate rows) were dropped, decreasing the size of the dataset (see Fig. 3c). For a visual representation of these steps, see Fig. 3, where the data from one picture is used, and, where important, also noted in the next one. The data here Figure 2: SAG_2201.power column (light gray), where anom- include raw data in Fig. 3a, the ’means’ data in Fig. 3a and Fig. 3b, alies (gray dots) are annotated based on the moving aver- simplified data in Fig. 3b, and unique sample data in Fig. 3c. The age (gray), the automatic anomaly label threshold (dotted annotated plot in Fig. 3c is used as the base data for an example black), the possible anomaly label threshold (solid black), NVG generation in Fig. 4. The numbers represent the same data and linear regression slope (positive - dashed and dotted, point, one in the plot and one in the graph representation. negative - dashed). 4.3 Modeling the data as graphs 4 METHODOLOGY We employ three strategies for converting time series into graphs: 4.1 Data preparation Natural Visibility Graphs (NVG), Ordinal Partition Graphs (OPG), Based on experts’ input, the samples with < SAG_2201.power and Quantile Graphs (QG). We used the time series to graph and 1 back library to achieve this. 4700 were labeled 0 (no anomalous event), others with 1 (milling For each sample in the data, we built a graph representation overload). A 1-hour (1800-sample) moving average with linear of it by looking at the samples within a selected window 𝑤𝑠 regression checked for upward trends; if none, the label was preceding it and applying the described time series to graph reset to 0 (see Fig. 2). Next, we selected a subset of columns to strategies on each column, apart from , separately. WIT_2101.PV be used in the analysis, utilizing expert knowledge to choose Such graphs, called subgraphs, were bound to a default graph only those columns that are measured in the workflow before SAG_2201.power structure that presents which columns are neighboring in the column. The resulting columns are LIT_2103A.PV plant process (see Fig. 1) by connecting a node which represents FCV_2201.PID_SP , , the column to every other column. The result SAG_2201.power SAG_2201.Press_Ziyouduangaoya2, Feeder_Control.SP, SAG_2201.power of this step was a larger type of graph called a state graph (see WIT_2101.PV and . Fig. 5). The black nodes represent nodes for a particular column, 4.2 Feature engineering while gray nodes represent the subgraphs created from the time series. The subgraphs are connected to the column nodes via the The raw data from the selected columns was first checked for node that corresponds to the first instance from the timeseries. any missing values, which were not present. In the next step, we Depending on the experiment, we made an additional step of detected changes in the columns and then replaced the values in joining 𝑤 0 many of the state graphs into a larger graph, which the samples between two such changes with the mean value of was used to generate embeddings. that segment (see Fig. 3a). This data was further simplified with the help of a k-bins discretizer, which was used to encode each column with seven values based on the quantile into which each 1https://timeseriestographs.com/ 34 Predicting milling overload based on sensor data: a graph-based approach Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia (a) SAG_2201.power column (light gray), where a threshold change detection was used to detect changes and to replace in-between values with the mean value (black). Figure 5: Example of a state graph. We chose this model for its ease of use and performance reasons. Column was also transformed into an embedding WIT_2101.PV 2 form by using a TS2Vec model . The embedding output size was set to 40, as this is approximately the size of features proportional to the number of columns in the graph embeddings. 4.4 Model training and evaluation (b) Result (dashed black) of applying a k-bins An initial subset of the data, which included the data from the discretizer model on the previously simplified first available day, was used to test the performance of different data (solid black) from Fig. 3a. Note the differ- graph embeddings. This was done to reduce the time and memory ent y-axis scales of the overlaid graphs. consumption for the first assessment. A CatBoost model was used, where it was trained for 800 iterations, with a learning rate equal to 0.03 and the Cross Entropy loss function, as well as the leaf regularization parameter set to 0.3. To assess our model’s ability to predict anomalous states, we also tried to fit the model on the same data, but with the target column shifted accordingly. This was done for up to 90 shifts, which is equivalent to predicting 63 minutes in advance. When we selected the best graph embeddings, we built and tested the model on the entire data set. 5 EXPERIMENTS (c) A representation of the simplified column data We conducted three experiments, all of which follow the same from Fig. 3b, considering only the unique consecutive template, where we tested how the structure of a graph affects values. the end model’s ability to predict anomalies. This includes first creating subgraphs as NVG, OPG and QG representations of the Figure 3: Pipeline of transformations on the SAG_2201.power columns with window size 𝑤 𝑠 and joining them into the state column. graph representation (see Fig. 5). Finally, 𝑤 0 many of these state graphs are joined sequentially according to the order given by the time at which the represented states appear in the data. The experiments differ in the window sizes 𝑤 and 𝑠 𝑤 . Experiment A 0 used 𝑤𝑠 50, 𝑤0 1, Experiment B used 𝑤𝑠 15, 𝑤0 20, lastly = = = = Experiment C used 𝑤 15 40. If we take the average state 𝑠 = , 𝑤 0 = duration of 42 seconds into account, we see that in Experiment A, data from the last 35 minutes is used, in Experiment B, 15 minutes, and finally in Experiment C, 28 minutes. We carried out experiments similar to Experiment B, where the state graphs were structured based only on one specific type of subgraph. Furthermore, the impact of the separately processed Figure 4: The Natural Visibility Graph representation of WIT_2101.PV was also tested, by repeating the same experiments, the data in Fig. 3c. with the difference being that this column’s embeddings were excluded when training the final model. These experiments do not have a mark in the ’WIT’ column of the resultst Table 3. A Graph2Vec model from the karateclub library [7] was used to generate graph embeddings, with an embedding size of 250. 2https://github.com/zhihanyue/ts2vec 35 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Rožanec, Krupak, et al. Time to predict[min] type of data used Time to predict[min] NVG OPG QG WIT 7 21 35 49 63 7 21 35 49 63 0.9905 ✓ ✓ ✓ ✓ 0.6558 0.6418 0.6251 0.6402 0.6184 0.9528 0.8929 0.8235 0.7623 ✓ ✓ ✓ 0.5938 0.6257 0.5831 0.5882 0.5725 ✓ ✓ 0.7427 0.7146 0.6930 0.6853 0.6719 Table 1: ROC AUC results of the experiment where all data 0.7265 0.6959 0.6586 0.6502 0.6365 ✓ ✓ was embedded with TS2Vec models. ✓ 0.7452 0.6978 0.6838 0.6734 0.6578 ✓ 0.7219 0.6866 0.6643 0.6416 0.6096 ✓ 0.9292 0.9025 0.8893 0.8004 0.7042 Time to predict[min] Experiment 7 21 35 49 63 Table 3: ROC AUC results of the models trained on different A 0.6083 0.5763 0.5356 0.5333 0.4945 types of graphs and data for Experiment B across all days. B 0.6184 0.6943 0.6698 0.6364 0.6128 The best results are written in bold text, while the second C 0.5897 0.5688 0.6109 0.5910 0.6417 best are underlined. Table 2: ROC AUC results of the three experiments with 7 CONCLUSIONS respect to how far ahead the model is predicting. The best results are marked in bold text. In this paper, we discuss the use of graph-based time series rep- resentations for training machine learning models. Our experi- ments suggest that while this approach has potential, it did not outperform the TS2Vec foundational model and was unable to Lastly, a separate experiment was carried out, in which all yield superior results when combined with it. Future work will raw data were processed using the TS2Vec model. Each column explore alternative graph representations and utilize GNNs to had its own TS2Vec model, which was used to embed the data integrate topological, semantic, and time series information di- associated with that column. Then, a CatBoost model with the rectly into a single machine learning model, aiming to achieve same configuration as in the previous experiments was used superior results. in combination with TS2Vec joined embeddings to predict the anomalies. These results are gathered in Table 1. ACKNOWLEDGEMENTS 6 The Slovenian Research Agency supported this work. It was RESULTS also developed as part of the Graph-Massivizer project (grant The results of the three experiments, which tested the infor- agreement No. 101093202), the enRichMyData project (grant mativeness of the graph structure, as well as the experiments agreement No. 101070284), and the DataPACT project (grant designed to determine which type of data is the most predictable, agreement No. 101189771), all funded by the Horizon Europe are summarized in the following tables. research and innovation program of the European Union. As can be seen in Table 2, Experiments A and C have lower scores than Experiment B. However, Experiment C approaches REFERENCES the performance of Experiment B at the maximum predicting [1] J.W. de V. Groenewald, L.P. Coetzer, and C. Aldrich. 2006. Statistical moni- shift. For this reason, and because the types of graphs in Ex- toring of a grinding circuit: an industrial case study. , Minerals Engineering 19, 11, 1138–1148. doi: 10.1016/j.mineng.2006.05.009. periment B are smaller compared to those in Experiment C, the [2] O. Galán, G.W. Barton, and J.A. Romagnoli. 2002. Robust control of a sag mill. experiments that tested the impact of different types of data used Powder Technology , 124, 3, 264–271. doi: 10.1016/S0032- 5910(02)00021- 9. Experiment B-type graphs. The best results for the final model [3] D. Hodouin, S.-L Jämsä-Jounela, M.T. Carvalho, and L. Bergh. 2001. State of the art and challenges in mineral processing control. Control Engineering were obtained from the data, where all columns were embed- Practice, 9, 9, 995–1005. doi: 10.1016/S0967- 0661(01)00088- 0. ded using TS2Vec models, as shown in Table 1. Similarly, the [4] L. Lacasa, B. Luque, F. Ballesteros, J Luque, and J.C. Nuño. 2008. From time results in table 3 show that when we predict anomalies from series to complex networks: the visibility graph. Proceedings of the National Academy of Sciences, 105, 13, 4972–4975. doi: 10.1073/pnas.0709247105. only the TS2Vec embeddings of the column , the WIT_2101.PV [5] J. Lessard, W. Sweetser, K. Bartram, J. Figueroa, and L. McHugh. 2016. Bridg- performance is the best. ing the gap: understanding the economic impact of ore sorting on a mineral Additionally, if we compare the experiments with WIT_2101.PV processing circuit. , 91, 5, 92–99. doi: 10.1016/j.mineng Minerals Engineering .2015.08.019. embeddings to the ones without them, we can see that the lat- An Introduc- [6] R. Keith Mobley. 2002. 4 - benefits of predictive maintenance. In ter perform worse. This suggests that the TS2Vec embeddings . Plant Engineering. (Second tion to Predictive Maintenance (Second Edition) Edition ed.). R. Keith Mobley, editor. Butterworth-Heinemann, Burlington, are more informative than the graph embeddings. Nevertheless, 60–73. isbn: 978-0-7506-7531-4. doi: 10.1016/B978- 075067531- 4/50004- X. when comparing different types of graphs used in the final graph, [7] B. Rozemberczki, O. Kiss, and R. Sarkar. 2020. Karate Club: An API Oriented we can see that OPGs alone yield the best performance. Open-source Python Framework for Unsupervised Learning on Graphs. In Proceedings of the 29th ACM International Conference on Information and A few possible explanations for the difference in performance Knowledge Management (CIKM ’20) . ACM, 3125–3132. doi: 10.1145/3340531 between the graph-based and time series-based approaches are .3412757. [8] V.F. Silva, M.E. Silva, P. Ribeiro, and F. Silva. 2021. Time series analysis possible. First, when working with graphs, there are more pa- via network science: concepts and algorithms. WIREs Data Mining and rameters that need to be optimized, such as window sizes and Knowledge Discovery , 11, 3, 1–39. doi: 10.1002/widm.1404. parameters for constructing graphs from time series. Another [9] V.F. Silva, M.E. Silva, and P. Ribeiroand F. Silva. 2024. Multilayer quantile graph for multivariate time series analysis and dimensionality reduction. reason might be that NVGs have approximately thirty times more International Journal of Data Science and Analytics, 1–13. doi: 10.1007/s4106 edges and eight times more nodes compared to OPGs and QGs, 0- 024- 00561- 6. which makes them disproportionately large. Additionally, the [10] M. Stephen, C. Gu, and H. Yang. 2015. Visibility graph based time series analysis. , 10, 11, e0143015. doi: 10.1371/journal.pone.0143015. PloS one construction of state graphs has repeated structures, which is [11] P. Zhou, T. Chai, and H. Wang. 2009. Intelligent optimal-setting control inefficient. Lastly, the TS2Vec embeddings do not have these lim- for grinding circuits of mineral processing process. IEEE Transactions on itations, and embeddings can be made from the entirety of the Automation Science and Engineering , 6, 4, 730–743. doi: 10.1109/TASE.2008 .2011562. data, as opposed to the simplified ones when not using TS2Vec. 36 Short and Long Term Bike Rental Forecasting ∗ Oskar Kocjančič Martin Žnidaršič oskar.kocjancic@gmail.com martin.znidarsic@ijs.si Jožef Stefan Institute Jožef Stefan Institute Ljubljana, Slovenia Ljubljana, Slovenia Abstract Prior studies on bicycle rental forecasting often use the Wash- ington, D.C. dataset [4]. Du et al. [2] addressed long-horizon This paper describes the challenges and outcomes of forecasting prediction, while Karunanithi et al. [6] focused on short-horizon bike rentals in a Slovenian urban bike-sharing system, focusing forecasting, both achieving results comparable to ours. In con- on the impact of data sparsity and the inclusion of external vari- trast, our dataset differs substantially by including station-level ables. We address two distinct forecasting tasks: short-horizon, information, which enables per-station forecasting. We tackle one-day-ahead predictions for individual rental stations, and both short- and long-horizon tasks, as well as the analysis of the long-horizon, 90-day forecasts for the total rental volume. Vari- impact of exogenous weather variables. ous machine learning models were employed and evaluated in this context. We also analyzed the trade-off between using longer historical data versus shorter, weather-enriched data to improve 2 Data predictive accuracy. The findings indicate a clear correlation The dataset we used originates from a public bicycle rental service between data sparsity at the station level and predictive perfor- in a Slovenian city. It contains daily rental counts for individual mance. While the inclusion of weather data provides a modest stations within the municipality, covering the period from Janu- improvement for both short-horizon and long-horizon forecasts, ary 1, 2021, to May 15, 2025. Although the dataset also records the overall quality of the sparse and noisy data appears to limit bike return counts, our work focuses exclusively on rentals. the potential gains from more complex modeling approaches. 2.1 Features Keywords bike-sharing, forecasting, time series, data sparsity, machine learning, deep learning, weather data 1 Introduction Predicting rental patterns of urban bike-sharing systems is chal- lenging due to complex dynamics, including strong seasonality and trends, as well as dependence on external variables such as weather and calendar effects. Furthermore, data sparsity, par- ticularly at the individual station level, presents a significant obstacle to building reliable predictive models. By accurately predicting bike demand, operators can improve redistribution and station availability, fostering a more reliable and sustainable urban mobility system. This paper addresses these challenges by investigating two dis- tinct forecasting tasks using a real-world dataset from a Slovenian city. First, we examine short-horizon, one-day-ahead predictions for individual stations to quantify the impact of data sparsity on Figure 1: Pearson correlation coefficients of our features forecastability. Second, we evaluate the accuracy of 90-day long- horizon forecasts for the total rental volume aggregated across Dependent Variable: The target feature we are forecasting. all stations. We compare a suite of models, including classical machine learning approaches and LSTM neural networks [5], and • total_rentals: The total daily number of bike rentals. Based on the task, this is either the total count across explicitly analyze the trade-off between using longer historical all stations or per-station bike rental count. data versus shorter, weather-enriched data to improve predictive accuracy. This work aims to help the bike-sharing systems to Independent Variables: The features used for prediction. improve operational efficiency, reduce bike shortages, and inform • Temporal Features: city planning initiatives related to sustainable transportation. The specific date. – date: – ordinal_day: The day number within the year. ∗ – weekday: Both authors contributed equally to this research. A category for the day of the week. – holiday: Indicator (0 or 1) if the day is a holiday. Permission to make digital or hard copies of all or part of this work for personal • Weather-Related Features: Note: Our weather data only or classroom use is granted without fee provided that copies are not made or spans the date range of 2024-01-01 to 2025-05-14 distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this – air_temp_2m_C: Air temperature. work must be honored. For all other uses, contact the owner /author(s). The relative humidity. – rel_humidity_percent: Information Society 2025, Ljubljana, Slovenia The precipitation per square me- – precipitation_mm: © 2025 Copyright held by the owner/author(s). ter. https://doi.org/10.70314/is.2025.sikdd.7 37 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Kocjančič et al. Figure 2: Distribution of bike rentals across all stations. The vertical blue line indicates the start of the year 2024. 2.2 Data Preprocessing Annual patterns show rental activity declining in winter, rising in spring, peaking in summer, and gradually decreasing in au- The dataset structure prevented distinguishing missing values tumn, with weekends consistently exhibiting lower rental counts. from true zeros (i.e., days when no rentals occurred), so all empty Anomalous behavior was observed in the winter of 2024, when or null entries were treated as zeros. This resulted in sparsity rental counts were markedly higher than typical seasonal levels. for some stations, in which many entries had little information The Pearson correlation coefficients (Figure 1) between fea- on rental activity. To prevent this impacting our analysis, we tures related to bicycle rentals indicate that the number of daily excluded those with more than 33% zero entries, retaining 25 rentals ( ) is strongly and positively associated total_rentals stations out of the original 48. For the machine learning methods described later, we also implemented a set of lagged features: with recent rental trends, as reflected by correlations of 0.73, 0.67, 0.64, and 0.63 with the 7-, 14-, 21-, and 28-day moving averages, • total_rentals_mean_7_days: Average rental count over respectively. A strong positive correlation is also observed with the 7 days preceding the current data point. air temperature (0.59), whereas moderate negative correlations • total_rentals_mean_14_days: Average rental count over are found with relative humidity (-0.43) and precipitation (-0.31), the 14 days preceding the current data point. suggesting that rentals are more frequent on warm, dry days. • total_rentals_mean_21_days: Average rental count over Weaker associations are present with the day of the week (-0.27) the 21 days preceding the current data point. and holiday status (-0.10). As expected, the moving average fea- • total_rentals_mean_28_days: Average rental count over tures exhibit high intercorrelation (e.g., 0.94 between the 7- and the 28 days preceding the current data point. 14-day means) due to their overlapping calculation windows. 3 Experiments This study pursued two primary objectives. First, we examined the feasibility of forecasting bicycle rentals one day in advance and evaluated how forecastability varies across stations with different data sparsity. Second, we investigated long-horizon forecasting over a 90-day period, focusing exclusively on predict- ing the total number of rentals. In this task, standard machine learning models were trained on historical data and then used recursively to generate forecasts for the entire period. Due to this setup, the results for suffer from data leakage. Specifically, DS_W a single model is trained using past rental counts and future weather information, so, for example, predicting rentals in July involves access to the actual recorded weather conditions for that Figure 3: Rentals per day of the week month, which artificially improves performance. 2.3 Exploratory Data Analysis 3.1 Training and Test Data Split The data exhibits pronounced weekly and monthly seasonali- Because the available weather data was limited to the years 2024 ties, as well as non-stationarity, as illustrated in Figures 3 and 4. and 2025, while the rental dataset spanned from 2021 onward, 38 Bike Rental Forecasting Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Figure 4: Bike rental data with temperate seasons we constructed three distinct datasets. Here, each corre-entry 3.3 Performance evaluation sponds to a single day and includes rental data for all stations. Model performance was assessed using Root Mean Squared Error The first dataset, , combined rental and weather data (498 DS_W (RMSE) and Mean Absolute Percentage Error (MAPE). Addition- entries). The second, , included only rental data for DS_NO_W ally, the Relative Root Mean Squared Error (RRMSE)[1] was used the same period (498 entries). The third, , comprised DS_FULL to enable inter-station performance comparisons in the one-day- the complete rental dataset without weather data (1,593 entries). ahead forecasting task. RRMSE is defined as follows: The data splitting strategy differed in the two tasks. For the RMSE station-level one-day-ahead forecasting task, each dataset was RRMSE (1) = 𝑦 divided into 25 subsets, corresponding to individual stations. Within each subset, random sampling was used to split the data where 𝑦 is the mean of the target values. into training and testing sets with an 80:20 ratio. The target variable in each subset is the specific station’s rental count. 3.4 Results For the long-horizon task, no station-level subdivision was The results for the one-day-ahead task are presented in Table 1, performed, as only total rental counts were modeled. The final with station forecastability visualized in Figure 5. The long- 90 days were used as the test set—roughly corresponding to a horizon task outcomes are presented in Table 2. temperate season—allowing us to assess whether the models capture seasonal patterns in a new period while maintaining 4 Discussion and conclusion realistic temporal separation between training and testing data. For the one-day-ahead forecasting task, a clear correlation exists between station data sparsity (Figure 2) and forecastability (Ta- ble 1). Stations with fewer rentals or gaps in data are easier to pre- 3.2 Models and Algorithms Used dict accurately. Interestingly, using the DS_FULL dataset—which For the long-horizon forecasting task, the model includes data prior to 2024—can reduce modeling accuracy for AutoARIMA served as the baseline, while for the one-day-ahead forecasting certain stations. Including weather features in leads to DS_W task, the baseline was the , which predicts using little or no improvement compared to . For the long-Mean Regressor DS_NO_W the 7-day lag mean. horizon task, including weather data proves beneficial, as both We evaluated several machine learning models, including classical machine learning models and neural networks show Ran- dom Forest (500 trees, max_features=0.9), Gradient Boosting improved performance (Table 2). However, as described in the (500 estimators), , and (𝐶 10, de- Experiments section, the machine learning results on are Linear Regression SVM = DS_W gree=2, 𝛾 0.1, linear kernel). The hyperparameters for the Ran- overly optimistic due to data leakage: the models are trained = dom Forest and SVM models were selected using a grid search on historical rental counts while also accessing future weather optimization procedure; the rest of the models used default pa- information during recursive forecasting (e.g., predicting rentals rameters. For the Random Forest model, only the in July uses the actual recorded weather for that month). This max_features parameter was tuned. is reflected in the comparison with , where classi-DS_NO_W We additionally tested deep learning approaches: (input cal machine learning methods achieve a 33% mean reduction LSTM size 96, RMSE loss, 10,000 epochs) and (input size in MAPE, while neural network approaches show only a 17% = N-BEATSx = 96, RMSE loss, 500 epochs). mean decrease, suggesting that the apparent benefit of weather Training was performed on a laptop equipped with an RTX 3050 data is amplified for classical methods because of this setup. Our GP U (4 GB VRAM), which constrained the range of hyperparam- results echo [3] where Gradient Boosting models matched or eter configurations that could be explored, particularly for the outperformed neural networks on several datasets, demonstrat- neural network-based approaches. ing the effectiveness of simpler models. While neural networks 39 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Kocjančič et al. Figure 5: Model performance of one-day-head forecasting for different stations for DS_W Table 1: Average RRMSE of all models of one-day-ahead Table 2: Model performance of 90-day forecasting across forecasting across datasets (RRMSE) and stations. datasets (RMSE / MAPE) Station DS_FULL DS_NO_W DS_W Model DS_FULL DS_NO_W DS_W 6 0.9210 0.9116 0.9097 AutoARIMA 120.09 / 0.9525 118.50 / 0.9954 118.50 / 0.9954 7 0.5849 0.5488 0.5439 Random Forest 108.29 / 0.7153 100.94 / 76.36 / 0.7014 0.7431 8 0.7948 0.6872 0.5513 0.6821 Gradient Boosting 95.17 / 0.7451 94.96 / 0.9584 74.69 / Linear Regression / 0.9372 84.78 / 1.0816 71.71 / 0.8872 90.29 9 0.6646 0.6631 0.6532 SVR 94.86 / 0.8893 / 0.9507 / 0.8036 87.12 67.95 10 0.9550 0.7753 0.7747 LSTM 112.05 / 125.13 / 0.8494 130.00 / 0.8070 0.7133 11 1.0110 1.0034 1.0027 NBEATSx 106.49 / 1.0329 128.90 / 0.9972 117.45 / 0.7246 12 0.6028 0.4649 0.4540 Average 103.89 / 0.8551 105.76 / 0.9394 93.81 / 0.7815 13 0.6601 0.4022 0.4000 14 0.6902 0.4840 0.4720 15 0.5218 0.4780 0.4652 16 0.7185 0.5984 0.5975 Acknowledgements 17 0.8336 0.7402 0.7337 18 0.5274 0.4670 0.4522 This work was supported in part by the Slovenian Research 21 0.5476 0.5218 Agency through core funding for the programme Knowledge 0.5215 22 0.5198 0.4171 Technologies (No. P2-0103) and by the project , funded 0.4160 KReATIVE 23 0.4783 0.4363 through NetZeroCities under the European Union’s Grant Agree-0.4349 24 0.4896 0.4760 0.4696 ment No. HORIZON-RIA-SGA-NZC 101121530. We also thank 25 0.6834 0.5608 0.5570 Tea Tušar for her suggestions regarding data visualization. 26 0.6897 0.6812 0.6506 27 0.9898 0.9595 0.9463 References 28 0.5580 0.4936 0.4898 29 0.6008 0.5788 0.5761 [1] Shikun Chen and Nguyen Manh Luc. 2022. Rrmse voting regressor: a weight- ing function based improvement to ensemble regression. arXiv preprint 30 0.5941 0.5531 0.5496 arXiv:2207.04837 . 31 0.8952 0.6474 [2] Jimmy Du, Rolland He, and Zhivko Zhechev. 2014. Forecasting bike rental 0.6452 32 0.5453 0.4873 0.4851 demand. . Gebhard, K., & Noland [3] Shereen Elsayed, Daniela Thyssens, Ahmed Rashed, Lars Schmidt-Thieme, Average 0.6793 0.6016 0.5989 and Hadi Samer Jomaa. 2021. Do we really need deep learning models for time series forecasting? , abs/2101.02118. https://arxiv.org/abs/2101.02118 CoRR arXiv: 2101.02118. [4] Hadi Fanaee-Tork. 2012. Bike sharing dataset. Dataset. (2012). https://www.k aggle.com/datasets/marklvl/bike- sharing- dataset. [5] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. could potentially benefit from hyperparameter optimization, the , 9, 8, 1735–1780. Neural computation [6] Meerah Karunanithi, Parin Chatasawapreeda, and Talha Ali Khan. 2024. A same applies to other methods as well. A detailed comparison of predictive analytics approach for forecasting bike rental demand. Decision different approaches was beyond the scope of this preliminary , 11, 100482. doi: https://doi.org/10.1016/j.dajour.2024.10048 Analytics Journal study but could be explored in future work. 2. 40 Predicting Traffic Intensity on Motorway Sections Matic Kladnik † Dunja Mladenić Jozef Stefan International Department of Artificial Postgraduate School Intelligence Ljubljana, Slovenia Jozef Stefan Institute matic.kladnik@gmail.com Ljubljana, Slovenia dunja.mladenic@ijs.si Abstract construction projects and to find the least intrusive time slots for road maintenance work. It also serves the motorway drivers This paper addresses predictions of traffic intensity on sections when planning a trip. of motorways. Predictions are computed for timespans from 24 hours up to 52 weeks. With our adaptive system, we update 2.1 Traffic Counters predictions with newer ones, once additional features can be computed from available data. We use historic context of past There are close to one hundred traffic counters that we consider traffic intensities on specific sections at specific periods of time, for predictions. Each counter is supported by a pair of inductive evaluated our methodology with multiple machine learning processed, sent through an IoT communication device and stored into the database. models and compared performances for various timespans on a as well as semantic context about the target period. We have loops that are laid into the asphalt of the road. Signals are specific motorway section. The evaluation results show that our In the data, there are counts or frequencies of total vehicles, and counts by vehicle types (passenger car, transport truck, bus) methodology improves predictions for specific periods over time. for each hour-long time period. E.g. number of vehicles from Keywords 8:00 to 9:00 for each of the lanes of a specific motorway section separately. Motorway, traffic intensity, prediction, regression, system, semantic context, evaluation, machine learning 2.2 Semantic Context For each of the examples in the dataset we produce semantic 1 context features. For each day and time of day period, we INTRODUCTION produce semantic context features to inform the model whether A prediction system for predicting traffic intensity on motorway a certain time period is on a workday or a weekend, whether the sections can support a wide range of decision making, strategic, specific time period falls into the morning rush hours or the and operative processes at the motorway management afternoon rush hours. These semantic features give additional organization. It can also support end users, such as daily information to improve the performance of machine learning commuters, tourists, and other drivers with their planning of a models. trip. The focus of this paper is on architecture of the motorway 2.3 Data Processing traffic intensity prediction system as well as on the evaluation of the machine learning models that were trained to produce the After downloading the data from the motorway counters via an predictions for various timespans. API of the data provider, we additionally process it to increase consistency and reliability of predictions. During data processing, we merge data from all lanes of a 2 PROBLEM SETTING AND DATA specific motorway section, which is usually denoted with The objective of the proposed methodology is to make long term neighboring towns and the direction of the motorway section. and medium-term predictions of traffic intensity or frequency (vehicle count) on various sections of motorway based on 3 METHODOLOGY DESCRIPTION historic data of traffic counters, semantic context of motorway stations, and semantic context of time periods. Predictions serve We propose a prediction system that includes incorporation of the motorway management company for better planning of multiple machine learning models to deliver the most reliable predictions based on available data and the timespan for which the system is making predictions of traffic intensity. Permission to make digital or hard copies of part or all of this work for personal or To improve prediction accuracy, we make medium-term and classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full long-term predictions. In our case, long-term predictions are citation on the first page. Copyrights for third-party components of this work must made from 1 week to 52 weeks in advance for a specific 1-hour be honored. For all other uses, contact the owner/author(s). Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia © 2025 Copyright held by the owner/author(s). http://doi.org/10.70314/is.2025.sikdd.25 41 Figure 1: Diagram of the system for producing and distributing predictions of traffic intensity on motorway sections time period for a specific day of week. Which means that we can when a feature would otherwise have multiple categorical values. make up to 52 predictions when conducting long-term We focus on training a specific model for each of the motorway predictions after receiving a new data example, e.g. traffic sections that were part of the research. Note that a more general frequency for a specific 1-hour time period (e.g. 14:00-15:00) for model, trained on data from multiple motorway sections could be a specific day in time (e.g. Monday). more appropriate for motorway sections that have been newly Whereas medium-term predictions are those that predict from added and do not have enough historical data to support training less than 24 hours up to 1 week in advance. For medium-term of a reliable machine learning model with sufficient evaluation predictions, we take more features for recent traffic frequency period. We use up to 7 features that are based on historic data, 7 into account for improved accuracy. time period features, and 6 semantic context features for a Long-term predictions are useful when making decisions for specific time period and location. actions that are several weeks or months in the future, while Model training processes use MAPE (Mean Absolute medium-term predictions are more useful when making Percentage Error, used interchangeably with MARE – Mean decisions for actions that will take place from 1 to 7 days in the Absolute Relative Error). More on relevant machine learning future. models and metrics in references ([2][3][4][5]). We have a separate machine learning model for each of the included counters on the motorway to better adjust to specifics 3.2 Prediction System Description of the traffic dynamic of a specific counter when making We continue with the description of our proposed prediction predictions of traffic frequency. We have also trained several system. The system consists of two main subsystems. One for general-purpose models that are trained on a group of counters periodically computing and storing traffic intensity predictions or all counters. These are present to support counters with short for various time spans. And another for delivering predicted data history. traffic intensity via a REST API service. Predictions are exposed through a REST API service and are As we can see on Figure 1, the system fetches data from the available upon request. They are computed and updated data provider’s REST API service. Data is processed after regularly, e.g. daily or hourly. More approaches in [1][6]. retrieval and sent into a table of prediction system’s database. 3.1 Machine Learning Models Once a new value is processed by the system, it checks if there This data is read periodically by the adaptive prediction system. To compute predictions of traffic intensity in the future, we use are any additional models with a shorter timespan available, regression machine learning models. We have trained and compared to the model used for the currently available evaluated several models with the usage of different machine prediction. The system prioritizes predictions from models with learning algorithms. These are: linear regression, SVM (SVR – a shorter timespan in order to update the database with the most Support Vector Machine for Regression), and XGBoost, which reliable predictions available at the time. E.g., prediction with a is an ensemble model of decision trees. 1-month timespan succeeds and replaces the prediction with a 3- Features for training models and making predictions are month timespan. engineered in such a way that each one of the models can use the Different long-term and medium-term models can be trained whole set of features. E.g. we use a one-hot encoding approach using different machine learning algorithms, depending on the 42 algorithm that performed the best during the evaluation of the We continue with the analysis of the model performances as models. seen in Table 1. If the timespan attribute’s value is ‘7 days’, it Once updated the predictions are stored in the database, they means that the model predicts 7 days into the future. We use are available to users, such as strategists, operators and support several metrics to describe the performance of the models. These specialists within the motorway management organization. Or are: MAE (Mean Absolute Value), RMSE (Root Mean Square end users of the motorway, such as drivers of cars, trucks, buses, Error), and MAPE (Mean Absolute Percentage Error). MAPE is etc. A key advantage of this approach is that drivers and a crucial metric as it shows relative errors in percentages which motorway operators and specialists get insights that are based on is key when evaluating the models as traffic frequency varies the same predictions for traffic intensity, which supports greater significantly throughout different parts of the day. transparency of information and stronger compatibility of We can see some interesting performance dynamics of the different applications for end users and motorway professionals. models. The XGBoost model performs the best for 24-hour E.g. the system can support long-term planning for larger timespan, with a significant performance uplift of at least 1 maintenance or reconstruction projects for up to 1 year ahead, as percentage point in MAPE, compared to the other two models. It well as long-term planning of road users. For instance, drivers is also better in the other two metrics: MAE and RMSE. can plan their holidays and the time of their commute ahead. And We continue with the performance analysis of the long-term highway maintenance operators can find the most optimal predictions. For the 7-day timespan, the XGBoost model is still schedule for short maintenance work. noticeably better than the other two models with a 0.5 percentage point uplift in performance. For the 4-week timespan, XGBoost still holds a small lead in the key metric (MAPE), whereas the 4 EVALUATION SVM model has significantly better results when considering just We continue with the evaluation of the machine learning models. MAE and RMSE metrics. For the 52-week timespan, we can see To compare models, trained with different algorithms, we use the an interesting dynamic as the SVM model takes a significant lead evaluation results for the same motorway section on the in performance as it is the only one with the MAPE value of less Slovenian motorways. We use the period from 1 May 2024 until than 15%, whereas the MAPE values of the other two models 5 May 2025 for evaluation. surpass 20%. We use Scikit-learn library[7] to train the linear regression The dynamic is likely caused by a reduced set of features as (using ordinary least squares approach) and SVM (SVR) models there are significantly less historic traffic count features that are and the XGBoost library[8] to train the XGBoost models. SVM included when making predictions with a 52- week timespan. It model is trained using the RBF kernel, and with scaled gamma seems this has a significantly negative impact on training the hyperparameter. In majority of motorway sections, XGBoost XGBoost model, which is a tree ensemble model, while having models with a maximum depth of 6 performed the best which is additional features available gave the XGBoost model an edge why we used models with the same hyperparameter value for the for predictions with a timespan up to 4 weeks, especially up to 7 following analyses. We use gbtree as the booster, while the days. learning rate is 0.3. Table 1: Model Performance Comparison timespan algorithm MAE RMSE MAPE 24 hours XGB 39.43 62.75 10.5% 24 hours SVM 42.38 65.86 11.5% 24 hours lin. reg. 43.14 66.93 11.6% 7 days XGB 45.66 70.69 11.6% 7 days SVM 43.70 68.91 12.1% 7 days lin. reg. 43.51 69.04 12.1% 4 weeks XGB 57.30 88.56 13.9% 4 weeks SVM 50.20 77.86 14.1% 4 weeks lin. reg. 51.33 78.63 14.7% Figure 2: Distribution of absolute relative errors by 5% 52 weeks XGB buckets for XGBoost 7-day timespan model 88.33 121.93 20.9% 52 weeks SVM 53.54 84.49 14.9% On Figure 2 we can see how absolute relative errors are 52 weeks lin. reg. distributed if they are split into 5% absolute relative error 70.46 96.98 21.3% We evaluated the models on a little over 1 year of test data, buckets. We can see that in 45.5% of the cases, the absolute which was not included in the training or validation part of the relative (or percentage) error of the predicted traffic frequency is process. less than 5% of the actually measured traffic frequency. 21.7% of predictions have a relative error between at least 5 and 43 (excluding) 10 percent, and 11.2% of predictions have a relative values after computing historic time-series features with Pandas’ error between 10 and 15 percent. shift function. In this case there is a strong advantage of having This means that in 78.4% of predictions, the relative error was a decision tree ensemble model (e.g. XGBoost) as a backup, even less than 15%, which can be considered as a sufficiently good if it is not the best performing model for a certain timespan. This performance for the models to support a sufficiently reliable is due to the ability of the tree ensemble models to apply only traffic intensity prediction system. those trees that are covered by features with available values. In this case the predictions are generally less accurate but possible. Another key insight is that the evaluation supports our proposed methodology with multiple models to improve the performance of the predictions for each included timespan. Another useful insight is that different algorithms can produce the best models for different timespans on the same motorway section. As was the case with the SVM model in our evaluation. 5 CONCLUSION We have overviewed the methodology that we use as the foundation for our proposed system for predicting traffic intensities on motorway sections. Including the adaptive prediction system and the supporting machine learning models that support making predictions for various timespans to, in time, improve already available predictions for specific time periods in Figure 3: Mean relative errors by each hour of the day for the future. We have also overviewed the evaluation of the trained XGBoost 7-day timespan model machine learning models and found some useful insights that support our proposed prediction system. We continue by analyzing the distribution of mean relative Compared to related work, the key contributions in our errors by each hour of the day as seen on Figure 3. We can see methodology are significantly longer prediction timespans, that the model generally tends to slightly overestimate or inclusion of semantic context, and higher adaptability to data. overshoot with its predictions. Especially during the night-time Based on the presented current evaluation results, our periods, when there are fewer vehicles on the motorway. methodology produces predictions with sufficient reliability to In the mean aggregate, there is less than a 2% mean relative support long-term decision making of various roles. error during the morning rush hours (at 6:00-7:00, 7:00-8:00, and For further improvements to the system, we could train and 8:00-9:00). It is the highest during the 15:00-16:00 period, with evaluate some deep learning models and models that are based more than 13% of mean relative error. However, the error is on the transformer architecture, as well as some other time-series substantially smaller during other afternoon rush-hour periods, forecasting procedures, such as Facebook Prophet. We could also 14:00-15:00, 16:00-17:00, and 17:00-18:00, where it remains engineer additional semantic context features for further under 4%. Apart from the 15:00-16:00 period, the mean relative improvements to the performance of the existing models. For errors are consistently under 6%. When the model does additional improvements for shorter timespans, we could also undershoot or underestimate with its prediction, the mean include weather forecast data. relative error is less than 2%, close to 1%. We can see a spike of mean relative error at the 15:00-16:00 References period. Upon investigation, it turns out only around 20 vehicles [1] Bernardo Gomes, Jose Coelho, Helena Aidos. 2023. A survey on traffic were counted in the data for a specific period, which is unusual flow prediction and classification. In Intelligent Systems with for this period and likely a consequence of a traffic accident or Applications, vol. 20. DOI: https://doi.org/10.1016/j.iswa.2023.200268 [2] Jithin Raj, Hareesh Bahuleyan, Lelitha Devi Vanajakshi. 2016. some issue with data collection. Application of Data Mining Techniques for Traffic Density Estimation We have also conducted an aggregated evaluation of models and Prediction. Transportation Research Procedia, vol 17. DOI: on 10 various motorway sections, where mean MAPE values https://doi.org/10.1016/j.trpro.2016.11.102 [3] Yuyu Zhu, QingE Wu, Na Xiao. 2022. Research on highway traffic flow were 14%, 15%, 18%, and 20% for 24-hour, 7-day, 4-week and prediction model and decision-making method. Scientific Reports, vol. 12. 52-week timespans respectively. Predictions for sections near the DOI: https://doi.org/10.1038/s41598-022-24469-y [4] Carl Goves, Robin North, Ryan Johnston, Graham Fletcher. 2016. Short capital city were generally less reliable than others. Term Traffic Prediction on the UK Motorway Network Using Neural Networks. Transportation Research Procedia, vol. 13, 184-195. DOI: 4.1 https://doi.org/10.1016/j.trpro.2016.05.019 Evaluation Insights [5] Adriana-Simona Mihaita; Zac Papachatgis; Marian-Andrei Rizoiu. 2020. When considering the results of the evaluation of trained Graph modelling approaches for motorway traffic flow prediction. 2020. IEEE 23rd International Conference on Intelligent Transportation machine learning models for specific motorway sections, we Systems (ITSC). DOI: https://doi.org/10.1109/ITSC45102.2020.9294744 have gathered several key insights. [6] Sayed A. Sayed, Yasser Abdel-Hamid, and Hesham A. Hefny, 2022. Artificial Intelligence-Based Traffic Flow Prediction: A Comprehensive In some examples, we could not compute all features due to Review. Pre-review . DOI: http://dx.doi.org/10.21203/rs.3.rs-1885747/v1 . missing values in data, meaning that certain features had NaN [7] Scikit-learn: https://scikit-learn.org [8] XGBoost: https://xgboost.ai/ 44 Empowering Youth on Smart Cities with AI Solutions to Community and Urban Challenges Towards SDG 11 Mustafa Zaouini†, Lee Chana, Joao Pita Costa, Davor Yousef Rahmani Rayan Kassis, Luka Stopar Ruben Frank, Kim August Orlic, Mihajela Črnko ToumAI Swethal Kumar Solvesall AI in Africa Rabat, Morocco Ljubljana, Slovenia IRCAI, Quintelligence EnergyAED Johannesburg, South Africa odin@toum.ai luka.stopar@solvesall.com Ljubljana, Slovenia London, UK mus@fliptin.io joao.pitacosta@ircai.org rayan@aed.energy Sohaib Souss, Wahid Laleeg, Asmae Lamgari, Maroja Zoubir, Ouidad Mochariq, Zahira Elmelsse, Chaimae Yassine Bounouader Hajar Doukhou Fadil SLTVERSE University Mohammed V (UM5) ENSA National School of Applied Sciences (ENSA-M) Casablanca, Morocco Rabat, Morocco Marrakesh, Morocco sohaibsoussi@gmail.com asmaelamgarim@gmail.com o.mochariq3846@uca.ac.ma Abstract / Povzetek Achieving Sustainable Development Goal 11 — ensuring cities 1 Introduction are inclusive, safe, resilient, and sustainable — remains a pressing Established by the United Nations as an essential goal for the global priority. In this pursuit, Artificial Intelligence (AI) has forthcoming 2030, the Sustainable Development Goal 11 (SDG emerged as a transformative driver of urban innovation, enabling 11) — "Make cities and human settlements inclusive, safe, policymakers, academic institutions, and industry stakeholders to resilient and sustainable" — reflects a critical global make data-driven decisions for complex urban systems such as commitment to improving urban living conditions amid housing, transportation, energy, and infrastructure. Despite its increasing urbanization, population growth, and environmental potential, the vast scale, variety, and fragmentation of urban data, stress. With more than half of the world's population now coupled with the rapid evolution of AI technologies, create residing in cities—and projections estimating two-thirds by significant challenges in converting SDG 11-related information 2050—the urgency of building sustainable urban environments into practical solutions. This paper reports on the results of the has never been better fit. In this context, AI has emerged as a AI4SDG11 programme, which combined expert community transformative tool capable of reshaping how cities are planned, building, knowledge exchange, and competitive challenges. The managed, and experienced. AI technologies offer powerful programme brought together 50 students and 30 startups I 15 capabilities to harness vast amounts of urban data, generate locations worldwide, to develop AI-driven solutions targeting predictive insights, and support evidence-based decision- key aspects of urban sustainability. Using diverse machine making. From optimizing public transportation systems to learning techniques, participants addressed challenges including monitoring air quality, improving waste management, and intelligent mobility systems, efficient waste management, smart enabling climate-resilient infrastructure, AI is at the forefront of and efficient urbanism, and climate-resilient urban planning. innovative urban solutions worldwide. However, the deployment Conducted in 2025, this initiative formed part of a youth-focused of AI in support of SDG 11 varies significantly across regions, innovation challenge co-organized by AI in Africa, the influenced by differences in digital infrastructure, data International Research Centre on Artificial Intelligence (IRCAI), availability, institutional capacity, and local priorities [1]. and GITEX, with the goal of promoting interdisciplinary In Africa, AI is increasingly being applied to address urban innovation and strengthening regional AI capacity for informality, mobility challenges, and infrastructure gaps. For sustainable urban development. instance, AI-powered geospatial mapping tools are being used to identify informal settlements in rapidly growing cities such as Keywords / Ključne besede Nairobi and Lagos, helping governments to improve service Machine learning, text mining, large language models, delivery and urban planning [2]. In North African cities, machine community engagement, urbanism, mobility, AI competition, AI learning models have been developed to optimize water Community distribution in drought-prone areas and to improve traffic flow in congested urban corridors. AI is also being tested for predictive † waste collection and smart energy use in off-grid communities. Corresponding author These solutions are particularly valuable in regions where Permission to make digital or hard copies of part or all of this work for personal or resources are limited, and where rapid urban growth creates classroom use is granted without fee provided that copies are not made or distributed pressure for low-cost, scalable interventions [2]. for profit or commercial advantage and that copies bear this notice and the full On the other hand, in Europe, AI applications in cities often focus citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). on enhancing sustainability, efficiency, and citizen engagement. Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Examples include real-time public transport optimization in © 2025 Copyright held by the owner/author(s). http://doi.org/10.70314/is.2025.sikdd.18 cities like Helsinki and Barcelona [3], AI-based air pollution forecasting in Paris [4], and intelligent energy management 45 systems in smart buildings across the Netherlands and Germany for students, researchers, and entrepreneurs to ideate, prototype, [5]. Many European municipalities are also investing in AI- and scale AI solutions aligned with specific SDGs. Previous driven participatory governance platforms, enabling data- editions have focused on Water Sustainability (SDG 6) and informed urban policymaking that incorporates citizen feedback Sustainable Cities and Communities (SDG 11), while the 2026 [5]. Furthermore, [6] highlights how AI can extract and analyze programme will extend to all 17 SDGs. The key components news media information to enhance knowledge and include: understanding of water-related extreme events, supporting • Research2Startup Competition: A 4–6 week improved disaster risk reduction.. programme blending AI education, design thinking, and This paper presents the outcomes of a collaborative youth AI acceleration tracks for startups and university spinouts, innovation programme, including AI mentorship and challenges culminating in regional and global pitch events. aimed at exploring the impact of AI on SDGs. It builds on the • Certified AI for SDG Training: Professional related initiative initiating the programme in 2026 under the certification tracks for corporate teams, startup founders, focus of Water Sustainability to progress SDG 6 (see [7] and [8]), and SMEs, focusing on topics like large language models, and refocuses the approach addressing SDG 11-related problems AI governance, ethical data practices, and generative AI through applied machine learning solutions. The initiative applications. institutions in North Africa, as well as 30 AI startups and domain programme supporting university-originated AI startups experts worldwide culminating in 30 projects and initiatives brought together 50 students and 20 professors across 10 research • AI4SDG Lab Accelerator: A 3–6 month cohort-based through mentorship, technical workshops, and investor tackling real-world urban challenges. By leveraging AI and data networking, culminating in a high-profile Demo Day at science, these teams addressed issues ranging from urbanism and GITEX Global. mobility to waste management and climate resilience—drawing on lessons and methods from both African and European in a collaboration with GITEX, short for Gulf Information funding opportunities, and collaboration through GITEX’s Technology Exhibition, being one of the world’s largest innovation ecosystem. It champions responsible AI development technology and innovation events, held annually in Dubai, by emphasizing ethics, transparency, and inclusivity, while contexts. The competition, co-organized by AI Africa and IRCAI competencies but also facilitates access to global networks, The programme not only equips participants with practical AI United Arab Emirates. The event held in May 2024 [7], served as a model for interdisciplinary, cross-regional collaboration in MVP co-development and impactful international exposure offering tangible incentives such as certifications, cash prizes, the pursuit of sustainable urban futures. through IRCAI and GITEX channels. In doing so, AI4SDG acts as a catalyst for fostering the next generation of AI-driven changemakers committed to creating impactful, scalable solutions for a sustainable future. 3 AI-enabled Innovation Advancing SDG11 The joint IRCAI, AI in Africa and GITEX competition served as a global platform for surfacing innovative AI-driven solutions to SDG 11 challenges, bridging the ideas of PhD researchers in North Africa with the entrepreneurial agility of startups worldwide. Among the standout innovations emerging from the Figure 1: Screenshot of the AI engine ToumAI, winner of the competition were AI-powered geospatial mapping systems for GITEX Europe, Berlin, as a prime example of the relevance monitoring informal settlements, predictive analytics for AI4SDG11 startup competition at the inaugurating edition of of languages in the resilience of cities and communities optimizing urban transport routes in congestion-prone cities, and machine learning models for forecasting waste generation to improve collection efficiency. Several projects addressed climate 2 AI4SDG Programme Methodology resilience, including early-warning systems for urban flooding The AI4SDG Programme, spearheaded by IRCAI under the and AI-assisted tools for assessing heat island effects and guiding auspices of UNESCO, in collaboration with AI in Africa and green space planning. From energy-efficient building design GITEX, is a transformative initiative designed to harness algorithms to citizen engagement platforms that use natural artificial intelligence to address the United Nations Sustainable language processing for policy feedback, the competition Development Goals (SDGs). With a focus on capacity building, highlighted the breadth of AI’s potential to make cities more entrepreneurship, and ethical AI deployment, the programme sustainable and inclusive. By uniting academic depth with connects technological innovation with global sustainability market-ready solutions, the initiative not only identified challenges, particularly in the Global South. promising prototypes but also laid the groundwork for scalable At the core of AI4SDG is a multi-pronged approach integrating interventions adaptable to diverse urban contexts.. certified training, competitive innovation events, and startup ToumAI. A holistic multilingual AI platform designed to acceleration. Launched through global showcases and pitch bridge the digital divide in Africa by enabling voice-driven competitions at major GITEX events across Africa, Asia, Europe, customer experiences in low-resource languages, advancing and the Middle East, the initiative provides a dynamic platform SDG 11. Built on a compound AI structure that saves computing 46 power compared to foundational LLMs, the system supports recommendations. Beyond mobility, the platform enriches speech-to-text, text-to-speech, emotion analysis, churn detection, tourism through VR-based storytelling with avatars narrating site and predictive insights across African dialects such as Swahili, histories, and employs metadata-driven personalization Amharic, Yoruba, and Darija. By integrating AI-powered voice supported by visual analytics (route maps, CO₂ vs. cost agents, IVR optimization, and multilingual analytics, ToumAI comparisons, safety heatmaps). Collectively, these AI delivers inclusive, real-time, and cost-effective communication innovations position the app as a smart city enabler that aligns for telco, banking, and transport sectors (see Figure 1). Its sustainability, cultural engagement, and traveler well-being. innovation lies in industrializing underrepresented African SOBEK. A federated AI system for flood resilience that languages for AI applications, ensuring accessibility for addresses the lack of early-warning systems in rapidly urbanizing populations historically excluded from the AI revolution. African cities. Unlike centralized models, it applies federated AED EnergyAED. An AI-enabled renewable energy storage learning to collaboratively improve predictions while preserving system that converts electricity into high-temperature heat (up to data privacy and sovereignty. Local nodes train specialized 800°C) using salt-based thermal bricks, providing 24/7 clean models—LSTMs for weather series, GNNs for hydrological power and heat without combustion. Unlike batteries or diesel, networks, and U-Nets for satellite imagery—using geospatial, the system delivers up to 24 hours of dispatchable energy at meteorological, and historical flood data. Model updates are lower cost, using safe, stable, and modular 10MWh units. aggregated with FedAvg and refined through station similarity Applications include microgrids, telecoms, industrial heat, and graphs to capture regional hydrological patterns. Despite desalination, making it particularly suited for regions with challenges of data heterogeneity and low connectivity, Sobek unreliable energy supply. By enabling baseload renewable delivers more accurate flood seasonality, year, and magnitude energy, AED Energy strengthens critical infrastructure and predictions, enabling timely early warnings, urban planning, and advances SDG 11 while reducing dependence on diesel. disaster resilience across Africa. SolvesALL Mobility. Delivery district planning and Ecoguardians. This initiative introduces an AI-powered optimization machine learning tools that support smarter urban system to optimize water-saving advertisements in Morocco, logistics impacting the sustainable of cities and communities. Its advancing SDG 11 (Sustainable Cities and Communities). By Postal POI system uses algorithms to automatically design analyzing diverse campaign content (videos, images, text, social delivery districts, balancing workload, reducing overlap, and media engagement, and survey data), the system identifies what minimizing travel time. Leveraging GPS trace analysis, stay- makes ads effective and generates improved variations. It point detection, regression models, and crowdsourced field data, integrates computer vision (CNNs) for visual features, language the system learns delivery micro-locations, service times, and models (BERT/GPT) for text and sentiment, predictive models accessibility factors (e.g., stairs, obstacles). By integrating these (XGBoost/Random Forest) for engagement forecasting, and AI-driven insights, SolvesAll enables cost savings, operational GANs for generating impactful ad variations. Ethical and data- efficiency, and improved registry accuracy—demonstrated by driven personalization ensures campaigns remain responsible, expected multimillion-euro annual savings for postal operators— transparent, and locally relevant. Early prototypes show while offering scalability to sectors such as waste management measurable engagement gains, empowering cities to run and ATM/vending machine logistics. evidence-based, AI-enhanced awareness campaigns that strengthen sustainable water use. 4 Conclusions and further work The integration of AI with the SDGs represents a critical frontier in global innovation, particularly as we confront complex challenges in health, education, climate, and urbanization. The AI4SDG programme, as implemented through the collaboration of IRCAI, AI in Africa, and GITEX, demonstrates a strategic and scalable model for aligning Figure 2: Screenshot of the SLTverse engine, winner at the technological advancement with sustainable impact. By AI stage of GITEX Africa 2025 combining certified training, research-to-startup pathways, and accelerator programs, AI4SDG empowers diverse SLTverse. This smart city solution introduces an AI-powered stakeholders—from students and researchers to entrepreneurs travel app that supports SDG 11 by enhancing safety, and SMEs—to develop responsible, ethical and context-sensitive sustainability, and cultural engagement in tourism. At its core is AI solutions across the 17 SDGs. an AI Route Advisor that leverages structured mobility data— One of the programme’s most significant contributions lies in spanning cost, CO₂ emissions, safety, time, and distance—to its ability to bridge the gap between academic research and real- recommend optimal transport options. This is strengthened by a world application, particularly in the Global South. Through its Retrieval-Augmented Generation (RAG) framework, which global reach and multi-region engagements, AI4SDG not only combines vector search, large language models, and workflow promotes responsible AI development but also facilitates access orchestration to deliver fast, contextual, and multilingual to funding, mentorship, and global markets, thereby amplifying guidance (see screenshot at Figure 2). The system’s AI assistant the reach and effectiveness of AI for social good. However, while adapts to real-time inputs such as weather, safety alerts, and user the AI4SDG11 programme has laid a robust foundation, several preferences, ensuring tailored and secure travel 47 avenues remain open for further development, now open to all world deployment, questions of ethical oversight, data SDGs. Future work should focus on: governance, and accountability become increasingly complex— particularly in cross-border collaborations. Addressing these • challenges will be essential to ensure that the AI4SDG initiative Longitudinal impact assessments to evaluate the sustainability and real-world outcomes of AI solutions not only inspires innovation but also establishes durable, emerging from the programme. ethically grounded impact at scale. • Expanded participation across underrepresented regions and communities, ensuring equitable access to AI training Acknowledgments / Zahvala and opportunities. This research was partially funded by the European • Commission’s Horizon research and innovation program under Integration of emerging technologies , such as neurosymbolic AI, edge AI, and federated learning, into grant agreement 820985 (NAIADES) and 101120237 (ELIAS). training tracks and solution design. • References / Literatura Stronger policy linkages to influence national and international AI governance frameworks through insights [1] Gupta, S. and Degbelo, A., (2023) An empirical analysis of AI contributions to sustainable cities (SDG 11). In The ethics of artificial derived from grassroots innovation. intelligence for the sustainable development goals (pp. 461-484). Cham: • Springer International Publishing. Enhanced data infrastructure , including open datasets [2] Mhlanga, David, and Deo Shao (2025). AI-optimized urban resource aligned with the SDGs, to support more accurate, inclusive, management for sustainable smart cities. In Financial inclusion and and transparent AI development. sustainable development in sub-saharan Africa, pp. 96-116. Routledge. [3] Mohsen, B. M. (2024). AI-driven optimization of urban logistics in smart cities: Integrating autonomous vehicles and iot for efficient delivery The AI4SDG programme highlights the transformative systems. Sustainability, 16(24), 11265. potential of AI when it is purposefully directed toward [4] Petry, Lisanne, et al. (2021) Design and results of an AI-based forecasting of air pollutants for smart cities. ISPRS Annals of the sustainable development. As the initiative expands and evolves, Photogrammetry, Remote Sensing and Spatial Inform. Sciences 8: 89-96. it will be crucial to maintain a balance between innovation, ethics, [5] Aguilar, J., et al. (2021) A systematic literature review on the use of artificial intelligence in energy self-management in smart buildings . and inclusivity—ensuring that AI becomes not just a tool for Renewable and Sustainable Energy Reviews 151: 111530. growth, but a vehicle for equitable and sustainable global [6] Pita Costa J., Rei L., Bezak N., Mikoš M., Massri M.B., Novalija I. and progress. At the same time, it is also important to acknowledge Leban, G. (2024) Towards improved knowledge about water-related extremes based on news media information captured using artificial the programme’s inherent challenges and limitations. Sustaining intelligence. International Journal of Disaster Risk Reduction, 100, long-term participation from diverse stakeholders requires p.104172. [7] Mustafa Zaouini, Joao Pita Costa, Manal Cherkaoui, Hanaa Hachimi, M. consistent resources, local capacity-building, and incentives that Wahib Abkari, Kamal Gourari, Hatim Lachheb and Jad Tounsi El extend beyond initial pilot enthusiasm. Scaling successful pilots Azzoiani (2024) Addressing Water Sustainability Challenges in North into broader, systemic solutions often encounters barriers such as Africa with Artificial Intelligence In Proceedings of SIKDD /24. [8] IRCAI (2024) IRCAI Partners with AI in Africa for the AI 4 Water fragmented policy environments, limited infrastructure in low- Sustainability Challenge. Available at: https://ircai.org/inircai-partners- resource settings, and uneven access to funding. Moreover, as AI with-ai-in-africa/ solutions transition from competitive innovation contexts to real- 48 Automated First-Reply Generation for IT Support Tickets Using Retrieval-Augmented Generation and Multi-Modal Response Synthesis Domen Jeršek Klemen Kenda domenjersek@gmail.com klemen.kenda@ijs.si Jožef Stefan Institute Jožef Stefan Institute Slovenia Slovenia Rok Klančič Matteo Frattini rok.klancic@gmail.com Matteo.Frattini@gft.com Jožef Stefan Institute GFT Italia Slovenia Italy Abstract Traditional automated response systems relied on template- IT support organizations require timely and consistent first re- based approaches and rule-based classification [2], which pro- sponses to incoming support tickets. This paper presents a Re- vided consistent but inflexible responses that failed to capture trieval Augmented Generation system for automatic generation nuanced requirements. Recent advances in natural language of contextually appropriate first replies. The approach combines processing have enabled more sophisticated approaches using semantic similarity search with multi-modal response synthesis, transformer architectures [11] and pre-trained models like BERT retrieving similar resolved tickets using sentence embeddings and [1]. Retrieval-based systems identify similar historical cases and FAISS indexing. Response-type detection determines whether adapt previous responses [5], while retrieval-augmented genera- structured templates or personalized conversational replies are tion (RAG) [6] combines parametric knowledge in language mod- most suitable for each request. The system incorporates tempo- els with retrieval from external knowledge bases for knowledge- ral context detection for status updates and employs few-shot intensive tasks. prompting with selected examples to maintain organizational However, retrieval systems may struggle with novel scenarios, communication standards. Evaluation using semantic similarity and purely generative approaches face challenges in maintaining metrics demonstrates the system’s ability to generate replies that organizational consistency. Hybrid approaches attempt to bal- closely match human-written responses across various ticket ance flexibility with reliability [3], while response classification types, providing a practical solution for reducing response times has evolved from traditional feature engineering to transformer- while maintaining quality and consistency. based models [9]. Our research addresses these limitations by developing an Keywords automated first-reply generation system that combines retrieval- augmented generation with multi-modal response synthesis. The IT support, retrieval-augmented generation, automated response system distinguishes between different response types, maintains generation, natural language processing, semantic similarity organizational communication standards, and generates contex- tually relevant replies through response-type detection, temporal 1 context awareness, and few-shot prompting with carefully se- Introduction lected examples. IT support organizations face increasing volumes of support tick- ets that require timely and consistent issue resolution, starting from the first response. Manual processing creates bottlenecks 2 Data that delay user support and increases operational costs, while Our dataset consists of 1,847 IT support tickets containing ticket the quality and consistency of first replies varies significantly titles, descriptions, and complete communication logs. Each ticket between support agents, leading to inconsistent user experiences. includes the full conversation history between users and support The primary challenge lies in generating contextually appro- agents, from initial submission through resolution. priate first replies that match organizational communication stan- The dataset exhibits significant diversity in ticket types, in- dards while addressing the specific nature of each support request. cluding software installation requests, access rights management, Support tickets exhibit diverse characteristics: some require struc- hardware support, VPN configuration, employee onboarding and tured template responses with specific form fields, while others offboarding, and system outage reports. Communication logs benefit from personalized conversational replies that acknowl- contain multiple exchanges, requiring careful extraction of first edge the user’s specific situation. replies from the complete conversation history. We developed a specialized extraction algorithm to isolate the initial support agent response from the multi-turn conversation work must be honored. For all other uses, contact the owner/author(s). Permission to make digital or hard copies of all or part of this work for personal logs. The extraction process identifies timestamp patterns and or classroom use is granted without fee provided that copies are not made or user information markers to separate individual responses. The distributed for profit or commercial advantage and that copies bear this notice and cleaning heuristics systematically remove formatting artifacts the full citation on the first page. Copyrights for third-party components of this including: (1) leading and trailing dash sequences, (2) formal Information Society 2025, Ljubljana, Slovenia greeting patterns like "Dear Name,", (3) separator lines contain- © 2025 Copyright held by the owner/author(s). ing five or more consecutive dashes, (4) user identification lines https://doi.org/https://doi.org/10.70314/is.2025.sikdd.19 49 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Domen Jeršek, Klemen Kenda, Rok Klančič, and Matteo Frattini with parenthetical ID patterns, and (5) responses shorter than 50 The embeddings are indexed using FAISS (Facebook AI Simi- characters to filter noise. The algorithm ensures only substantial larity Search) [4] for efficient retrieval with approximate nearest first replies are retained by validating minimum content length. neighbor search. We normalize embeddings using L2 normal- After preprocessing, 1,466 tickets contained valid first replies ization and employ inner product similarity for fast retrieval, suitable for training and evaluation. The first replies range from achieving sub-linear search complexity through hierarchical clus- 50 to 2,000 characters in length, with an average length of 387 tering and inverted file structures. Figure 2 provides a conceptual characters. Response types include structured template responses visualization of how tickets are positioned in the semantic em- (42%) containing form fields and specific requirements, personal- bedding space based on their content similarity. ized conversational responses (38%) addressing individual user situations, and status update communications (20%) providing in- cident or outage information. Response types were automatically classified using keyword-based heuristics and regular expression patterns, as described in Section "3.3 Response Type Detection". The dataset was split using stratified random sampling with a fixed seed (random_state=42) to ensure reproducibility. Eighty tickets were randomly selected for the test set, representing ap- proximately 5.5% of the processed dataset, with the remaining 1,386 tickets forming the knowledge base for retrieval. The test set maintains proportional representation across all response types: 34 template responses (42.5%), 30 personalized responses (37.5%), and 16 status updates (20%), closely matching the overall dataset distribution. This stratified approach ensures evaluation coverage across diverse ticket categories while preventing data leakage between training and test sets. This was repeated several times to ensure the selected test sets are representative of the entire dataset. Figure 2: Semantic Embedding Space: Conceptual visual- 3 Methodology ization of how support tickets are distributed in the high- Our system implements a multi-stage pipeline for automated dimensional embedding space, where semantically similar first-reply generation, combining semantic retrieval, response- tickets cluster together, enabling effective retrieval of rele- type detection, and few-shot prompting. Figure 1 illustrates the vant historical examples. complete system architecture, showing the flow from input ticket processing through knowledge base retrieval to final response generation. 3.2 Retrieval System For each incoming ticket, we retrieve similar historical cases us- ing a multi-factor scoring approach that combines semantic sim- ilarity with categorical and structural matching. The enhanced retrieval score combines: • Base semantic similarity (50%) from FAISS cosine similar- ity using normalized embeddings • Category match bonus (20%) when ticket types align, using exact string matching • Title similarity (15%) using dedicated title embeddings with cosine similarity Figure 1: System Architecture: The complete RAG pipeline • Description similarity (10%) using dedicated description for automated first-reply generation, showing the eight- embeddings with cosine similarity stage process from ticket input through embedding gener- • Response quality bonus (5%) based on response structure ation, knowledge base search, enhanced scoring, response analysis and content completeness metrics type detection, and final reply generation using GPT-4. These weights reflect the relative importance of semantic simi- larity, categorical alignment, and structural relevance in ensuring that retrieved examples are both contextually appropriate and 3.1 Knowledge Base Construction organizationally consistent. We retrieve a larger candidate set We construct a knowledge base from historical tickets using sen- (4× the target number) from the FAISS index and apply this multi- tence embeddings [8]. Each ticket is represented by title and factor re-ranking to select the most relevant examples, ensuring description embeddings computed using the all-MiniLM-L6-v2 both semantic relevance and categorical appropriateness. sentence transformer model [12], which provides a compact 384- dimensional representation optimized for semantic similarity 3.3 Response Type Detection tasks. We build separate embeddings for titles and descriptions, We implement response-type detection using keyword-based plus combined embeddings for comprehensive similarity search, heuristics with regular expression patterns to classify responses enabling multi-granular matching across different text compo- as template-based, personalized, or status updates. Template re- nents. sponses are identified by structured formatting indicators such as 50 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia form field markers (e.g., "Field:", "Value:"), bullet point patterns, making it suitable for real-time ticket processing. all-mpnet-base- numbered lists, and specific organizational phrases like "Below v2 offers higher representational capacity through its bidirec- you will find the additional form information." tional encoder architecture and serves as a more sophisticated Personalized responses are characterized by conversational evaluation metric, providing additional validation of semantic elements including direct questions, user-specific acknowledg- coherence through its enhanced understanding of contextual ments (e.g., "Thank you for contacting us"), empathy expressions, relationships and nuanced text representations. and conditional statements. Status updates contain temporal ref- Our system achieves an average MiniLM similarity of 0.7841 erences using datetime patterns, incident identification numbers, and MPNet similarity of 0.8048 between generated and expected system status keywords, and global communication patterns fol- responses. These scores indicate strong semantic alignment with lowing organizational incident response protocols. human-written replies, confirmed through cross-validation anal- ysis showing confidence intervals within a 3% range (±2.9% for 3.4 Few-Shot Prompting MiniLM similarity). Figure 3 shows the performance variation Response generation employs few-shot prompting with GPT-4 across different test tickets, demonstrating consistent quality across diverse support scenarios. [7], using retrieved examples to guide generation through in- context learning. We construct structured prompts that include: • Current ticket information (title, description, detected re- sponse type). • 4-5 most relevant historical examples with their corre- sponding responses. • Response type-specific instructions (template vs. person- alized formatting). • Organizational communication guidelines and tone speci- fications. Figure 3: Individual Ticket Performance: Semantic similar- Template responses receive strict formatting instructions with ity scores (MiniLM) for each test ticket, showing consistent explicit field markers and structural constraints to maintain ex- performance across diverse support scenarios with most act organizational formatting, while personalized responses are tickets achieving similarity scores above 0.7. guided toward conversational but professional tone with specific phrase patterns and acknowledgment structures. 4.2 Response Quality Analysis 3.5 Temporal Context Detection Quality assessment reveals that 55 out of 80 generated responses We implement temporal context detection using compiled regular (68.8%) achieve similarity scores above 0.7, indicating high seman- tic alignment. The system successfully maintains organizational expressions to identify tickets related to system outages, status communication standards while addressing specific user require- updates, or global communications. The detection system uses ments. Figure 4 illustrates the distribution of response quality pattern matching for temporal indicators (e.g., "since", "until", scores across the evaluation dataset. "during"), incident terminology ("outage", "maintenance", "down- time"), and organizational communication markers ("all users", "system-wide", "scheduled maintenance"). Detected temporal con- texts trigger specialized status update generation that mirrors organizational incident communication patterns, including sever- ity levels, expected resolution times, and escalation procedures. 4 Results We evaluate our system using semantic similarity metrics and response quality assessments across 80 test tickets representing diverse support scenarios. 4.1 Similarity Metrics We employ two sentence transformer models for comprehensive evaluation [8]: • all-MiniLM-L6-v2 [12]: Lightweight 384-dimensional model optimized for general semantic similarity with 22.7M pa- Figure 4: Response Quality Distribution: Distribution of rameters semantic similarity scores showing that 68.8% of generated • all-mpnet-base-v2 [10]: Higher-capacity 768-dimensional responses achieve scores above 0.7, indicating strong se- model with 109M parameters for nuanced similarity as- mantic alignment with expected human-written replies. sessment using masked and permuted pre-training The selection of these two models provides complementary Template responses demonstrate particularly strong perfor- evaluation perspectives. all-MiniLM-L6-v2 serves as the primary mance, with exact structural matching and appropriate place- embedding model in our RAG system due to its computational holder handling. Personalized responses achieve good contextual efficiency and proven effectiveness in semantic similarity tasks, relevance while maintaining professional tone. 51 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Domen Jeršek, Klemen Kenda, Rok Klančič, and Matteo Frattini 4.3 Response Type Distribution scenarios. The system provides a practical solution for reduc- The system correctly identifies response types in 87% of cases, ing response times while ensuring quality and consistency in IT routing requests to appropriate generation strategies. Template support communications. detection achieves 90% accuracy, while personalized response Future work will explore improving template handling using detection reaches 85% accuracy. instruction-tuned large language models and developing fine- Temporal context detection successfully identifies 100% of sta- tuned classifiers for more accurate response type detection, en- tus update scenarios on the tested examples, enabling appropriate abling more structured and context aware reply generation. global communication style responses. the expected responses further supports these results. Figure 5 [1] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: pre-training of deep bidirectional transformers for language under- demonstrates that generated responses maintain appropriate standing, 4171–4186. doi:10.18653/v1/N19-1423. The plot of the length of the generated responses against References length characteristics compared to human-written replies, with [2] Yixin Diao, Hani Jamjoom, and Zhen-Yu Shae. 2009. Rule-based problem 2009 IEEE International Conference classification in it service management. In strong correlation between generated and expected response on Services Computing . IEEE, 221–228. doi:10.1109/SCC.2009.31. lengths. [3] Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. 2020. Retrieval augmented language model pre-training.arXiv preprint arXiv:2002.08909 . https://arxiv.org/abs/2002.08909. [4] Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity search with gpus. IEEE Transactions on Big Data, 7, 3, 535–547. doi:10.1109 /TBDATA.2019.2921572. [5] Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense passage retrieval for open-domain question answering, 6769–6781. doi:10.18653/v1/2020.emn lp-main.550. [6] Patrick Lewis et al. 2020. Retrieval-augmented generation for knowledge- intensive nlp tasks. Advances in neural information processing systems, 33, 9459–9474. https://arxiv.org/abs/2005.11401. [7] OpenAI. 2023. Gpt-4 technical report. (2023). https://arxiv.org/abs/2303.087 74 arXiv: 2303.08774 [cs.CL]. [8] Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: sentence embeddings using siamese bert-networks. InProceedings of the 2019 Conference on Empir- ical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, 3982–3992. doi:10.18653/v1/D19-1410. [9] Anna Rogers, Olga Kovaleva, and Anna Rumshisky. 2018. A primer on neural network models for natural language processing.Journal of Artificial Figure 5: Response Length Comparison: Scatter plot com- Intelligence Research , 61, 65–95. https://arxiv.org/abs/1510.00726. [10] Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. 2020. Mpnet: paring the length of generated responses versus expected masked and permuted pre-training for language understanding. In Advances responses, showing strong correlation and indicating that in Neural Information Processing Systems. Vol. 33. Curran Associates, Inc., 16857–16867. https://proceedings.neurips.cc/paper/2020/hash/c3a690be93 the system generates appropriately sized replies consistent aa602ee2dc0ccab5b7b67e-Abstract.html. with human writing patterns. [11] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, 30. https://arxi v.org/abs/1706.03762. [12] Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. 4.4 Error Analysis 2020. Minilm: deep self-attention distillation for task-agnostic compression Remaining challenges include handling of highly specialized of pre-trained transformers. In Advances in Neural Information Processing Systems . Vol. 33. Curran Associates, Inc., 5776–5788. https://proceedings.ne technical scenarios and tickets requiring complex multi-step pro- urips.cc/paper/2020/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.ht ml. cedures. Some responses exhibit placeholder artifacts when exact matching fails, and very short or very long responses occasionally deviate from expected patterns. The system shows consistent performance across different ticket categories, with minor variations in quality for edge cases involving complex technical requirements or unusual organiza- tional procedures. 5 Conclusion This paper presents a comprehensive approach to automated first- reply generation for IT support tickets using retrieval-augmented generation and multi-modal response synthesis. Our system suc- cessfully combines semantic similarity search, response-type detection, and few-shot prompting to generate contextually ap- propriate replies that closely match human-written responses. The evaluation demonstrates strong performance across di- verse ticket types, achieving semantic similarity scores of 0.78- 0.80 and maintaining organizational communication standards. Cross-validation analysis confirms the stability of these results, with performance metrics varying within a ±3% range, indicat- ing robust and reliable performance across different evaluation 52 A Machine-Learning Approach to Predicting the Pronunciation of Pre-Consonant l in Standard Slovene Jaka Čibej jaka.cibej@f f.uni- lj.si Centre for Language Resources and Technologies & Faculty of Arts, University of Ljubljana Jožef Stefan Institute Ljubljana, Slovenia Abstract such problem in this paper: the pronunciation of pre-consonant l The pronunciation of pre-consonant in Slovene words (e.g. , alge l l in Slovene words. The grapheme , when preceding a consonant polž grapheme, can be pronounced as either / / or / /. In some cases, l u gledalka , ) is not easily predictable (/ /, / /, or both) and l u “ “ both variants are acceptable. Examples include words such as poses a problem for the otherwise effective rule-based grapheme- to-phoneme conversion. We present a method to discriminate alge (‘algae’, IPA: /"a:lgE/, but never */"a:ugE/), polž (‘snail’, IPA: 𝑛 between the various pronunciations of pre-consonant / “ /, but never */ "pO:u S "pO:lS /), gledalka (‘spectator (female)’, IPA: l using “ / glE"da:u machine-learning models trained on vectors of character-level ka / or / glE"da:lka /), and decimalka (‘decimal number’, “ dEci"ma:lka IPA: /-gram features from approximately 153,500 manually annotated /, but never */ dEci"ma:u ka /). The reasons for “ these different pronunciations are historic and etymological in l Slovene words with pre-consonant from the ILS 1.0 dataset. We some cases, while in others, the difference cannot be easily ex- achieve an accuracy of 86% (over a majority baseline of 76.53%) plained and has more to do with conventions in language use. The and conclude the paper with potential steps for future work. issue of pre-consonant has been tackled by Slovene linguistics l Keywords for more than a century (see [4] for a brief overview). Percep- tion tests and small-scale surveys ([16]; [11]) have recently been pronunciation, grapheme-to-phoneme conversion, pre-consonant conducted to collect data for lexicographic resources (such as l, pronunciation ambiguity, Slovene 2 the ), but empirical data remains Slovenian Normative Guide 8.0 1 Introduction scarce: relevant language resources are not machine-readable or openly accessible (as is the case of the Dictionary of Slovenian In languages that are characterized by greater orthographic depth 3 Literary Language OptiLeX ) or contain inconsistent data (e.g., (i.e., a greater discrepancy between the written form and its pro- ILS 1.0 [19]). In this paper, we use the recently published dataset nunciation), such as English or French, grapheme-to-phoneme ([1]; described in Section 2). (G2P) conversion requires more sophisticated methods such as Slovene IPA/X-SAMPA G2P Converter Because the is currently neural networks (see e.g. [10] for French and [14] for English). l entirely rule-based, all pre-consonant graphemes are transcribed Slovene, on the other hand, features a much more transparent or- l as / /, resulting in errors that need manual corrections when com- thography ([15]; [17]). Phonetic transcriptions of Slovene words piling language resources. Our goal is to implement a machine- – with some exceptions, such as acronyms, symbols, numerals, 4 learning approach to disambiguate between different pronuncia- and certain words of foreign origin (e.g. ), including sommelier tions. Increasing the accuracy of the converter is important in the proper nouns (e.g., ; more on this in [3]) – can be very re-Johnson context of the automatic compilation of modern lexicographic liably generated using a rule-based approach, especially if taking resources that can also be used as machine-readable databases the accentuated form (e.g., instead of the unaccentuated drevó for training models (including large language models) and im- drevo) as the starting point, as the diacritic disambiguates the proving speech recognition and speech synthesis for Slovene. We position of the accent and the manner of pronunciation of the describe the dataset (Section 2), the statistical analysis used for accentuated vowel grapheme. The Slovene IPA/X-SAMPA G2P feature selection (Section 3), the results (Section 4), and several Converter1 achieves an accuracy of approximately 98% (based on steps for future work (Section 5). an evaluation on a stratified sample of words; see [2]). However, there are several exceptions (in addition to the ones 2 Dataset already mentioned) in which the pronunciation of certain graphemes is much more difficult to predict with rules. We focus on one ILS 1.0 ([1]; described in more detail in [4]) is a dataset of approx. 173,400 inflected Slovene word forms (of approx. 6,000 Slovene 1 Slovene IPA/X-SAMPA G2P Converter Pregibalnik l The is part of , a custom tool that lexemes) containing a single pre-consonant grapheme. Each oc- was developed for the expansion of the [5], Sloleks Morphological Lexicon of Slovene which is the morphological basis for the [8]. l Digital Dictionary Database of Slovene currence of pre-consonant was annotated for its pronunciation Pregibalnik is available as open-access code at https://github.com/clarinsi/SloInf le by 5 linguists (2 annotations per occurrence). The word forms ctor and as an API service at https://orodja.cjvt.si/pregibalnik/docs; the Slovene Sloleks were extracted from the manually validated lexemes of IPA/X-SAMPA G2P Converter is also available as an API at https://orodja.cjvt.si/pre gibalnik/g2p/docs. 3.0 [5], the largest open-access dataset with machine-readable morphosyntactic information on Slovene words. Table 1 shows Permission to make digital or hard copies of all or part of this work for personal the distribution of word forms by agreement: in 89% of word or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and 2 Pravopis 8.0 Slovenian Normative Guide 8.0 the full citation on the first page. Copyrights for third-party components of this ( ): https://pravopis8.fran.si/ work must be honored. For all other uses, contact the owner /author(s). 3 The Dictionary of Slovenian Literary Language (SSKJ) is available at https://fran.si/. Information Society 2025, Ljubljana, Slovenia 4An attempt at using machine learning for Slovene phonetic transcriptions was © 2025 Copyright held by the owner/author(s). made by [9]; however, the method was evaluated on the Sloleks Morphological https://doi.org/10.70314/is.2025.sikdd.1 [5], where the issue of pre-consonant is still unresolved. Lexicon of Slovene 3.0 l 53 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Čibej Table 1: Word forms in ILS 1.0 by agreement. Table 2: Contingency table for the general 𝑛-gram c when following a pre-consonant l. Pronunciation Number of Forms % Pronunciation → / / 117,459 67.73 l ↓ l u l u Presence / / / / / /+/ / / / 23,884 13.77 u “ “ Both “ 12,160 7.01 Yes 2,653 1,847 5,980 Both | / / 11,205 6.46 No 114,898 22,045 6,180 l Both | / / 7,051 4.07 u “ / / | / 1,660 0.96 l u “ Table 3: A sample of statistically significant general Total 173,419 100.00 character-level 𝑛-grams. 𝑛 𝜒 𝑟-Gram p V Category 2 |𝑚𝑎𝑥 | c 38,199.59 **** 0.499 178.81, / /, No post-ll n 29,081.52 **** 0.435 79.27, / /, No post-ll ce 16,003.46 **** 0.323 118.30, / /, No post-ll o 77,025.17 **** 0.708 227.83, / /, No pre-ll po 48,241.29 **** 0.560 193.98, / /, No pre-ll a 16,592.50 **** 0.329 -79.85, / /, No pre-ll We extract a total of 8,082 different general 𝑛-grams (consist- ing of actual graphemes; 3,041 in pre-position, 5,541 in post-ll Figure 1: Extraction of character-level 𝑛-gram features for position), 116 different robust C+V 𝑛-grams (65 pre- and 51 post- the pre-consonant l in the word gledalka. l), and 603 different finegrained C+V 𝑛-grams (262 pre- and 341 post-). For each 𝑛-gram, we compile a contingency table. For l instance, Table 2 shows the occurrences of the general 𝑛-gram c forms (highlighted in gray), the annotators agree on the pronun- in the position directly following a pre-consonant (e.g., , l moril c a ciation of pre-consonant . They disagree in 11% of the examples, l ‘murderer’, masculine common noun, genitive singular form) with one annotator allowing for both pronunciation variants depending on the pronunciation of the pre-consonant . l and the other allowing for only one pronunciation. Complete In order to determine statistically significant features that help disagreement is present only in less than 1% of the examples. discriminate between different pronunciations, we performed a We use the 153,503 forms with complete agreement as training 2 series of Pearson’s 𝜒 tests [12] and corrected for family-wise data for machine-learning models as described in the following error rate with the Holm-Bonferroni method [7]. We calculated sections. It should be noted, however, that while is the ILS 1.0 7 largest open-access dataset on pre-consonant pronunciations, it l Cramér’s V [6] as the measure of effect size. This resulted in a total of 4,263 statistically significant features (1,856 pre-general l is not completely representative of language use in general (with and 1,794 post-general 𝑛-grams; 60 pre-and 40 post-robust l l l annotations by only 5 linguists with a background in translation C+V 𝑛-grams; 242 pre-and 271 post-finegrained C+V 𝑛-grams). l l and Slovene studies; these can be biased towards linguistic rules Several statistically significant pre-general 𝑛-grams are shown l that might not reflect real language use). Despite this, the dataset 8 2 in Table 3. The table shows the values of the 𝜒 statistic and is robust enough to help disambiguate the more obvious examples Cramér’s V, the p-value representations, the maximum absolute (such as , IPA: / /, and , IPA: / /). alge "a:lgE polž "pO:u S “ value of Pearson’s residuals (and its position in the contingency 3 table), and the category of the 𝑛-gram (post-or pre-). With the l l Feature Selection exception of the 𝑛-gram, which is more indicative of the / / a l To some extent, the pronunciation of pre-consonant depends on l pronunciation, the others indicate one of the other two options 5 the preceding and subsequent graphemes, so we use character- u l u (/ /; or / /+/ /). The results also confirm the statement found in level 𝑛-grams as features for prediction. For each pre-consonant “ “ Slovenian Normative Guide 8.0 o l the that the grapheme in pre- l in each word form, we identify the 𝑛-grams (1 ≤ 𝑛 ≤ 5) in its position is strongly indicative of the /u/ pronunciation. direct left/right surroundings as shown in Figure 1 (see footnote “ 6). We include word boundary markers (#) to discriminate be- 4 Prediction and Evaluation tween word-initial and word-final 𝑛-grams. We also perform the We compiled a custom vectorizer based on the identified fea- same extraction on robust and finegrained C+V representations tures. The vectorizer scans each input word form (along with 6 of each word form. 9 its Multext-East v6 morphosyntactic tag ) for all occurrences of 5 Slovenian Normative Guide 8.0 Pravopis 8.0 The ( , see https://pravopis8.fran.si), √︂ for instance, states that a pre-consonant preceded by the grapheme is often 7 𝜒 2 2 o l 2 We calculate Cramér’s V as , where 𝜒 is the Pearson’s 𝜒 statistic, 𝑁 characterized by the / / pronunciation; this is true of words that historically used 𝑚𝑖𝑛 u 𝑁 𝑑 ∗ the syllabic “ is the total sample size, and l (e.g. polh IPA: / "pO:u x / ‘dormouse’; volk IPA: / "vO:u k / ‘wolf ’). However, 𝑑𝑚𝑖𝑛 is the minimum dimension of the contingency there are exceptions as not all 𝑛-grams originate from the syllabic (e.g., table. “ “ ol l polkovnik IPA: / / ‘colonel’; IPA: / / ‘voltage’). For all tests, the degrees of freedom (df ) were equal to 2 and the total sample nik voltaža vOl"ta:Za pOl"kO:u 8 6 “ In the robust C+V form, all consonant graphemes are substituted with and all size (N) was equal to 153,603. The p-values should be interpreted in the following C vowel graphemes with V. In the finegrained C+V form, consonant graphemes were manner: **** p 0.0001; *** p 0.001; ** p 0.01; * p < 0.05 → ≤ → ≤ → ≤ → 9 generalized into more finegrained categories, e.g. graphemes denoting Slovene Multext-East v6 Morphosyntactic specifications: https://nl.ijs.si/ME/V6/msd/html sonorants (M), voiced (G) and voiceless obstruents (K), foreign consonants (X), etc. /msd- sl.html 54 Prediction of Pre-Consonant l in Slovene Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Table 4: Model performance based on 10-fold cross- Table 5: Confusion matrix for the Linear Support Vector validation. classifier. Model A BA P R F1 True → LinearSVC ↓ Í Predicted / l / / u / / l /+/ u / 86.08 72.39 69.26 55.39 61.54 “ “ Multin. NB 77.29 69.54 33.33 47.36 / / 1,495 729 24,230 81.84 l 22,006 kNN (k=5) 85.91 64.11 62.98 / / 1,071 31 3,866 73.30 63.53 u 2,764 Majority / “ /+/ l u/ 434 519 1,672 2,625 76.53----“ Í 23,511 4,778 2,432 - pre-consonant , extracts the surrounding 𝑛-grams, converts the 4.2 Manual Evaluation l morphosyntactic tag into 146 morphosyntactic features, and rep- We performed a manual analysis of the misclassified examples to resents the occurrence as a 4,409-dimensional vector of {0,1} val- determine whether there are any patterns to the errors that could ues (with 0 and 1 indicating the absence or presence, respectively, be help further improve the model with additional features. Due of the 𝑛-gram in the direct surroundings of the pre-consonant to space limitations, we only focus on the most obvious problems / /). We compile a total of 153,503 vectors in this way and use in this paper. l the Python library [13] to train several models for In the examples where the / / pronunciation was misclassi-scikit-learn l a classification task with three classes: the goal is to correctly fied as / /, many words contain a pre-consonant followed by u l predict whether a pre-consonant is pronounced as / /, / /, or ( ‘caldera’, ‘pertaining to a l l u the grapheme “ d kaldera buldožerski both. “ bulldozer’, heraldičen ‘heraldic’, bodibilder ‘bodybuilder’). The majority of these examples are pronounced with / /, with the l 4.1 Automatic Evaluation exception of words like dopoldne ‘late morning’, popoldanski ‘per- taining to the afternoon’, where the pre-consonant is preceded l We trained three different models: a Linear Support Vector Clas- by an grapheme. This could indicate that an additional 𝑛-gram o sifier (LinearSVC), a Multinomial Naïve Bayes Classifier (Multin. feature should be added (the along with its preceding and sub-l NB), and a 𝑘 Nearest Neighbors Classifier (kNN) and evaluate sequent graphemes: , , etc.). This could resolve some other old ald their performance with a 10-fold cross-validation (with a strat- misclassifications, such as ‘impulsive’ and impulziven pulzirajoč ified random test set of word forms). The results are shown in ‘pulsating’, where words with the combination are never pro- ulz 10 Table 4. The worst performing model is the Multin. NB classi- nounced as / /, but words with are (e.g., ‘to slip’). The u olz polzeti fier, which barely achieves an above-baseline accuracy and a very “ emergence of such patterns in the misclassifications is a good low F1-score compared to the other two classifiers, although its sign that the classifiers might benefit from a joint pre-/post-ll recall is much higher. In terms of balanced accuracy and F1-score, feature. This will be explored in future versions. the best model is the kNN classifier. However, it seems that the Many of the instances in which the / / was misclassified as / / u l algorithm is not the most suited for this type of data. It performs “ pol pol- contain compound words with the element ‘semi, half ’: similarly to the LinearSVC classifier, but if we compare the sizes nag ‘half-naked’, polfinale ‘semi-final’, polpuščava ‘semi-desert’. of the resulting models, it becomes apparent that the LinearSVC Because the element is always pronounced with / /, this is pol u model is much more efficient (with a size of approximately 100 “ also true of derived compound words. However, the 𝑛-gram fea- kB) compared to the kNN model, which is overly inflated (with a tures used offer no indication of morpheme boundaries, so these 11 size of more than 2 GB), possibly indicating overfitting. misclassifications can be expected. Because the LinearSVC model is the most viable, we analyze Additional 𝑛-gram features could be extracted from the ac- its performance in more detail. Table 5 shows the confusion ma- centuated forms of words. In some examples, the accentuation trix for the classifications of the LinearSVC model on a stratified diacritic can disambiguate the pronunciation of the subsequent test set (20% of the total 153,503 dataset instances). The model pre-consonant . For instance, ‘pertaining to something that l dólnji seems to lean more towards the most frequent category (/ /) in its l is downwards or downstream’ and are pronounced prestólničen being misclassified as / “ “ with /l/, whereas tôlšča ‘blubber’ and pôlhográjski ‘pertaining l predictions, with approximately 30% of / / and / /+/ / instances u l u /, whereas 94% of the / / instances are l to the town of Polhov Gradec’ are pronounced with / /. How-u classified correctly. It seems that instances allowing both pronun- “ ever, accentuation is rarely written in Slovene and is much more ciations are very rarely misclassified as / / (only 1%). It should u also be noted that the instances of / “ difficult to assign automatically compared to morphosyntactic l /+/ u / misclassified as either / “ features. Relying on too many features that are not easily ex- / or / u l / are not entirely incorrect, just incomplete. Compared to “ tractable would make the model less robust (more on this in the rule-based approach (which classifies everything as / /), the l Section 5). model performs quite well in terms of / /+/ / and / / instances l u u and sacrifices only 6% of its accuracy for / “ “ / instances. In order l 5 Conclusion to determine any future improvements to the model, we analyze We presented a machine-learning approach to improve the ac- some of the misclassified examples in more detail in Section 4.2. curacy of phonetic transcriptions of Slovene words that contain the ambiguous pre-consonant . While the method does improve l 10 A, BA, P, R, and F1 refer to accuracy, balanced accuracy, macro-precision, macro- accuracy (86% over a majority baseline of cca. 76%) by using very recall and macro-F1, respectively. simple character-level 𝑛-gram and morphosyntactic features, it 11 We also ran a 10-fold cross-validation using only 𝑛-gram features (no morphosyn- does not resolve the problem entirely. Aside from several excep- tax). The performance of the models was slightly worse, e.g. for LinearSVC: A = 85.05, BA = 69.14, P = 68.94, R = 46.85, F1 = 55.76. tions in language use which are difficult to predict (e.g. , gasilci 55 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Čibej čistilka; both pronounced with /l/ even though the majority of Acknowledgements words ending with and in the dataset can be pronounced-ilec-ilka The research presented in this paper was carried out within the with either / / or / /), the analysis of misclassified examples has l u Basic Research for the Development of Spo- shown several potential future steps that can be implemented “ research project titled ken Language Resources and Speech Technologies for the Slovenian to further improve the performance of the models. First, several Language Language Resources ( J7-4642), the research programme additional features should be tested. Some of the features are sim- and Technologies for Slovene CLARIN.SI Re-(P6-0411), and the ple, such as word length or number of syllables in word (which search Infrastructure (I0-E004), all funded by the Slovenian Re- could potentially help to correctly classify words such as volk search and Innovation Agency (ARIS). The author also thanks and ; short words where the pre-consonant is pronounced polh l the anonymous reviewers for their constructive comments. as / /). The relative position of the pre-consonant in the word u l could also potentially be helpful. Several more complex features “ References could also be added, such as word formation relations and mor- [1] Jaka Čibej. 2024. Dataset of annotated slovene words with pre-consonant l pheme boundaries to help disambiguate, for instance, ILS 1.0. Slovenian language resource repository CLARIN.SI. (2024). http://h decimal-ka ‘decimal number’, which is derived from the adjective dl.handle.net/11356/2025. decimalen ‘pertaining to decimal numbers’ and is pronounced with / /; and l [2] Jaka Čibej. 2023. Leksikon besednih oblik sloleks. poročilo projekta razvoj slovenščine v digitalnem okolju aktivnost ds1.3. Development of Slovene in mor-ilka ‘murderer (feminine)’, which is derived from the verb a Digital Environment. (2023). https://www.cjvt.si/rsdo/wp- content/upload moriti s/sites/18/2023/06/RSDO_Kazalnik_Sloleks_v2.pdf . ‘to murder’ and can be pronounced as either / l / or / u /). Tak- ing into account the accentuated form of the word could also help: “ [3] Jaka Čibej. 2024. Predicting pronunciation types in the sloleks morpho-Data mining and data warehouses (SiKDD): logical lexicon of slovene. In for instance, the accentuation – ‘wolf ’, ‘dormouse’ ôl vôlk pôlh Information Society (IS) 2024 - proceedings of the 27th International Multicon- – indicates the / / pronunciation, while the accentuation is ference: volume C. u ól Institut „Jožef Stefan“, 23–26. https://is.ijs.si/wp- content indicative of the / “ /uploads/2024/11/IS2024_Volume- C.pdf . l / pronunciation, e.g. pólka ‘polka’). However, [4] Jaka Čibej. 2025. Statistična analiza izgovora črke l v slovenskem oblikoslovnem more complex features cannot be extracted from the word form leksikonu sloleks. Jezikoslovni zapiski, 31, 1, (maj 2025), 37–54. doi:10.3986 /JZ.31.1.03. itself, so making the model too heavily reliant on external linguis- [5] Jaka Čibej et al. 2022. Morphological lexicon sloleks 3.0. Slovenian language tic knowledge would sacrifice its robustness and usefulness for resource repository CLARIN.SI. (2022). http://hdl.handle.net/11356/1745. unseen words. We will explore these options in our future work [6] Harald Cramér. 1946. . Mathematical Methods of Statistics Princeton Mathe- matical Series. Vol. 9. Princeton University Press. but we will first focus on the simplest features to determine the [7] Sture Holm. 1979. A simple sequentially rejective multiple test procedure. upper boundary of accuracy that can be achieved based solely , 6, 2, 65–70. Scandinavian Journal of Statistics [8] Iztok Kosem, Simon Krek, and Polona Gantar. 2021. Semantic data should on the word form and its morphosyntactic features. We will per- no longer exist in isolation: the digital dictionary database of slovenian. form additional statistical analyses on 𝑛-grams containing the 9th EURALEX International Congress "Lexicography for Inclusion" In , 81–83. pre-consonant as well, and once the optimal model is achieved, https://elex.is/wp- content/uploads/2021/09/Semantic- Data- should- no- l l onger- exist- in- isolation- the- Digital- Dictionary- Database- of - Slovenian it will also be evaluated on previously unseen words containing _Kosem- Krek- Gantar_EURALEX2020.pdf . the pre-consonant that have not been included in the [9] Janez Križaj, Simon Dobrišek, Aleš Mihelič, and Jerneja Žganec Gros. 2022. l ILS 1.0 dataset. The results will hopefully also provide more interesting Uporaba postopkov strojnega učenja pri samodejni slovenski grafemsko- fonemski pretvorbi. In Jezikovne tehnologije in digitalna humanistika: zbornik material for further linguistic analyses (such as exceptions to the konference 2022. Inštitut za novejšo zgodovino, 248–251. https://nl.ijs.si/jtdh rules). 22/pdf /JTDH2022_Proceedings.pdf . As already mentioned, the dataset does not necessarily ILS 1.0 [10] Xavier Marjou. 2021. Gipfa: generating ipa pronunciation from audio. In accurately reflect the linguistic landscape of pre-consonant pro- l eLex 2021 Conference Proceedings, 588–597. https://elex.link/elex2021/wp- co ntent/uploads/2021/08/eLex_2021_38_pp588- 597.pdf . nunciation in Slovene words, and more annotations along with [11] Tanja Mirtič. 2019. Glasoslovne raziskave pri pripravi splošnega razlagal- nega slovarja. In . Slovenski javni govor in jezikovno-kulturna (samo)zavest perceptive tests and surveys are required. The pronunciations Znanstvena založba Filozofske fakultete, 81–90. https://centerslo.si/wp- con will be manually validated as part of the work on the Digital tent/uploads/2019/10/Obdobja- 38_Mirtic.pdf . Dictionary Database of Slovene [8], the largest machine-readable [12] Karl Pearson. 1900. X. on the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such open-access database of Slovene linguistic and lexicographic data. that it can be reasonably supposed to have arisen from random sampling. The pronunciations will also be cross-referenced with the record- The London, Edinburgh, and Dublin Philosophical Magazine and Journal of ings from the , 50, 302, 157–175. eprint: https://doi.org/10.1080/14786440009463897. Science GOS Corpus of Spoken Slovene [18], which contains doi:10.1080/14786440009463897. real recordings of Slovene speech and can contribute towards Journal [13] F. Pedregosa et al. 2011. Scikit-learn: machine learning in Python. a more accurate distribution of different pronunciations for in- , 12, 2825–2830. of Machine Learning Research glE"da:u [14] Uwe Reichel, Hartmut R. Pfitzinger, and Horst-Udo Hain. 2008. English or / “ grapheme-to-phoneme conversion and evaluation. In Speech and Language glE"da:lka dividual lexemes (e.g., how many occurrences of / / ka /), along with any potential relevant metadata (for , 159–166. https://www.phonetik.uni- muenchen.de/~reichelu Technology 11 instance, whether the pronunciation depends on the region the /publications/ReichelPf itzingerHainSASR2008.pdf . [15] Anja Schüppert, Wilbert Heeringa, Jelena Golubovic, and Charlotte Gooskens. speaker originates from). The models can then be re-trained on 2017. Write as you speak? a cross-linguistic investigation of orthographic new data and further improved to better reflect real language transparency in 16 germanic, romance and slavic languages. English. From use. semantics to dialectometry , 32, 303–313. isbn: 9781848902305. [16] Hotimir Tivadar. 2004. Priprava, izvedba in pomen perceptivnih testov za The models will be implemented into the Slovene IPA/X-SAMPA fonetično-fonološke raziskave (na primeru analize fonoloških parov). Jezik Grapheme-to-Phoneme Converter as part of the Pregibalnik tool in slovstvo, 49.2, 2, 17–36. https://ojs.zrc- sazu.si/jz/article/view/14222. [17] Antal van den Bosch, Alain Content, Walter Daelemans, and Beatrice de for automatic Slovene lexicon expansion, which is available under Gelder. 1994. Analysing orthographic depth of different languages using 12 a Creative Commons BY-SA 4.0 license. Proceedings of the 2nd International Conference data-oriented algorithms. In on Quantitative Linguistics. [18] Darinka Verdonik et al. 2023. Spoken corpus gos 2.1 (transcriptions). Slove- nian language resource repository CLARIN.SI. (2023). http://hdl.handle.net /11356/1863. [19] Jerneja Žganec Gros, Tanja Mirtič, Miroslav Romih, and Kozma Ahačič. 2022. 12 Slovar izgovarjav OptiLEX The best-performing LinearSVC model (and the accompanying code) for the . (1. e-izd. ed.). Založba ZRC. isbn: 978-961-05- prediction of pre-consonant pronunciation is available on Github: https://github.c 0672-0. https://doi.org/10.3986/9789610506720. l om/jakacibej/sikdd2025_predicting_preconsonant_l 56 Sequencing News Articles with Large Language Models within Enterprise Risk Management Context Žiga Debeljak† Dunja Mladenić Klemen Kenda Jožef Stefan International Department for Artificial Department for Artificial Postgraduate School Intelligence, Intelligence, Ljubljana, Slovenia Jožef Stefan Institute Jožef Stefan Institute ziga.debeljak@mps.si Ljubljana, Slovenia Ljubljana, Slovenia dunja.mladenic@ijs.si klemen.kenda@ijs.si Abstract risks, especially within risk scenario analysis [10, 11]. The capability to build structured timelines from unstructured textual (LLMs) to reconstruct event timelines from unstructured news information is therefore of high relevance to ERM. This paper evaluates the capability of Large Language Models data. This capability is highly relevant for Enterprise Risk LLMs are increasingly utilized in ERM for their ability to Management (ERM) applications, where the reconstruction and process and analyze unstructured textual data, including news forecasting of coherent event trajectories are crucial for articles, to identify and assess risks [1, 2, 3, 4, 5]. In the financial identifying, assessing, and predicting emerging risks and sector, applications include extracting sentiment from news to analyzing risk scenarios. In this study, we tasked twenty LLMs gauge market perception or identify reputational risks [3, 6, 7, 8], with chronologically ordering randomly shuffled business news and identifying specific risk factors or events discussed in news simple date sorting, all explicit date markers were removed from demonstrates LLMs' utility in analyzing individual or aggregated the articles. The experiments were conducted under one news items for tasks such as sentiment analysis, risk factor articles for three distinct real-world event chains. To prevent and corporate disclosures [2, 4, 5, 9]. Existing literature mainly with hints for the first, the last, or both the first and the last identification, or event detection, but the capabilities of the unassisted and three assisted scenarios that provided the models articles in the sequence. The results reveal a systematic variation models to recover the temporal order and causal links among a in difficulty across the three tasks in addition to significant sequence of discrete news items that describe an unfolding performance disparities among the models, with Grok 4 (xAI), narrative are less directly explored. This paper aims to address GPT-5, o3 and o3-pro (all three OpenAI), and Gemini 2.5 Pro this gap by investigating LLM performance in temporal-causal (Google) consistently outperforming other models practically reasoning within news streams, a crucial aspect for across all tasks and prompting scenarios. As expected, prompting understanding the dynamics of unfolding risk narratives. assistance with additional information systematically improved By investigating whether state-of-the-art commercial or open- accuracy, especially for the models that performed poorly in the source LLMs can reconstruct the chronological narrative of unassisted scenario. The high level of accuracy achieved by the business-event chains from unordered news articles, this paper ERM applications. contributes to the field by: (a) systematically evaluating the top-performing models indicates a practical utility for real-world performance of multiple LLMs on a challenging temporal- Keywords reasoning task; (b) analysing the efficacy of diverse prompting strategies — both unassisted and assisted — in improving model Large Language Models, News-Stream Sequencing, Temporal accuracy; (c) providing insights into model-and-task dynamics, Reasoning revealing substantial performance disparities, task-specific 1 INTRODUCTION from contextual hints; and (d) demonstrating the practical difficulty patterns, and the outsized gains weaker models receive Within Enterprise Risk Management (ERM) practice, readiness of these technologies for ERM deployment. organizations monitor external developments also by analyzing streams of publicly available news. Each news article captures a 2 RESEARCH METHOD momentary state of the political-economic environment, and by accurately structuring unordered information into a Task Definition chronological narrative, organizations can better understand the To evaluate the capabilities of LLMs, three event chains were evolution of events and the relationships that connect them. The constructed, focusing on: (1) Trump's Tariffs and EU reconstruction and forecasting of these event trajectories are [“Task_1”], (2) Gold Prices [“Task_2”], and (3) the Ukraine- important for identifying, assessing, and predicting emerging Russia War [“Task_3”]. These topics were selected due to their significant relevance to the business environment. For each topic, † ten articles were manually selected from the online editions of classroom use is granted without fee provided that copies are not made or distributed two reputable sources of financial and business information, Permission to make digital or hard copies of part or all of this work for personal or for profit or commercial advantage and that copies bear this notice and the full published between March 1st and May 2nd, 2025. For the citation on the first page. Copyrights for third-party components of this work must purpose of LLM processing, the raw text from the selected be honored. For all other uses, contact the owner/author(s). Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia articles was extracted. To prevent temporal bias, explicit date © 2025 Copyright held by the owner/author(s). indicators—such as full dates—were removed, and no two http://doi.org/ 10.70314/is.2025.sikdd.4 57 articles shared the same publication date. Subsequently, the Selected LLMs and Experiment Execution articles within each event chain were randomly shuffled, and this Twenty different models by eight different providers were fixed random order was then applied to all models within the selected for this research, based on their expected capabilities experiment. with regard to the tasks, and their availability. Overview of The primary task for the selected LLMs was to reconstruct the selected models is shown in Table 1. chronological sequence of news articles within three distinct event chains. This task was evaluated across four experimental Table 1: Selected LLMs scenarios: (1) an unassisted scenario [“Assist_No”], and three # Model Provider: Context Date assisted scenarios providing the (2) first [“Assist_First”], (3) last Model Name Window Created [“Assist_Last”], or (4) both first and last [“Assist_FirstLast”] (tokens) articles in the sequence. 1 OpenAI: GPT-4.1 1.047k 14.04.2025 In the unassisted scenario, the LLMs were required to determine 2 OpenAI: o3 200k 16.04.2025 the correct chronological order of the articles without any 3 OpenAI: o3-pro 200k 10.06.2025 external information regarding their placement. In the assisted 4 OpenAI: gpt-oss-120b 131k 5.08.2025 scenarios, the models were provided with hints within the user 5 OpenAI: GPT-5 400k 7.08.2025 prompt. Specifically, for the Assist_First and Assist_Last 6 Google: Gemini 2.5 Pro Preview 1.048k 7.05.2025 scenarios, the prompt identified the article occupying the initial 8 xAI: Grok 3 Beta 7 Google: Gemini 2.5 Flash Preview 1.048k 20.05.2025 131k 9.04.2025 or final position, respectively. In the Assist_FirstLast scenario, 9 xAI: Grok 4 256k 9.07.2025 the LLMs were given the identifiers for the articles that 10 Anthropic: Claude Sonnet 4 200k 22.05.2025 correspond to the beginning and end of the chronological 11 Anthropic: Claude Opus 4 200k 22.05.2025 sequence. 12 Anthropic: Claude Opus 4.1 200k 5.08.2025 The required output from the LLMs was a reconstructed timeline 13 Meta: Llama 4 Maverick 1.048k 5.04.2025 of the news articles. For each position in the timeline, the 14 Meta: Llama 4 Scout 1.048k 5.04.2025 following information was mandated: (i) the article's 15 Mistral AI: Mistral Medium 3 131k 7.05.2025 identification number, (ii) the article's title, (iii) a brief 17 Qwen: QwQ 32B 16 Mistral AI: Mistral Medium 3.1 262k 13.08.2025 131k 5.03.2025 justification for its placement relative to the preceding article, 18 Qwen: Qwen 2.5 VL 32B Instruct 128k 24.03.2025 and (iv) a brief justification for its placement relative to the 19 DeepSeek: DeepSeek V3 163k 24.03.2025 subsequent article. The models were required to provide a 20 DeepSeek: R1 128k 28.05.2025 structured output in JSON format. All models were accessed using the OpenRouter platform via the Prompt Engineering APIs. For models supporting this parameter, the temperature was Prompt engineering included manual drafting, testing on set to 0.0 to ensure the most reliable and reproducible different models, and optimization both with LLMs (GPT o3 and experimental results; otherwise, default parameters were used. Gemini 2.5 Pro) as well as manually, in several iterations. In the There were 12 experiments executed: 3 different event topic end, an effective user prompt was developed which worked chains (tasks) in 4 experimental scenarios (prompts) each, by reasonably well for all selected models. The main challenges using all 20 LLMs as shown in Table 1, thus resulting in 240 with regard to the design of prompts were: (a) stimulating a results (outputs). Experiments were executed on June 1st, 2025 systematic approach to causal reasoning, which was considered with the models available on that date, and on August 19th, 2025 to be mainly important for the non-reasoning models; (b) with the newer models. ensuring the output consisted of exactly ten distinct articles, with no repetitions or omissions; (c) enforcing the required output 3 EVALUATION AND DISCUSSION JSON schema; and (d) providing concise reasoning for the positioning of the observed articles. General Evaluation Within the user prompt, the models were explicitly instructed to logical story progression (understanding how a narrative or ordered lists that included all required supplementary situation typically develops or unfolds), (d) utilizing any implicit information. Substantial variations in output quality were time references if available within the articles, and (e) using observed across the different models. This variation was also models’ general knowledge about events. Prompts with clear influenced by the three distinct tasks, which seemed to be of instructions about the guidelines for the reasoning process substantially different difficulty, with the first task being the worked better than prompts without such instructions, even with most straightforward and the last presenting the most significant models with strong reasoning capabilities. System prompts were challenge. As anticipated, the implementation of assisted not utilized, as the one-shot user prompt contained all necessary prompting strategies consistently enhanced the accuracy of the instructions for the models. The full user prompt is available outputs for all models across all evaluated tasks. events (how events described in different articles relate to each successfully producing the requested ordered lists of news other over time), (b) causal reasoning (identifying cause-and- articles with all accompanying metadata. From a logical effect relationships between the content of different articles), (c) standpoint, the outputs from all models were accurate, presenting use the following reasoning principles: (a) inferring sequences of performance in response to a standardized user prompt, In terms of the output content, all models demonstrated strong from the authors. Regarding the output formatting, the majority of the models adhered to the specified JSON schema. Notable exceptions to 58 this were Claude models (models #10, #11 and #12), which negative Kendall’s τ value indicates an inverse correlation occasionally deviated from the requested format by including a between the predicted and true rankings, and a value around zero short introductory text. In these instances, the textual outputs represents a random ordering. Second, the results show that the were programmatically reformatted to conform to the required more recent versions and models with strong reasoning JSON structure. It is relevant to note that these three models are capabilities (models Grok 4, GPT-5, o3 and o3-pro, and Gemini the only ones in the evaluation that do not natively support the 2.5 Pro) consistently outperform other models practically across Structured Output functionality, a factor that likely contributed all tasks. to their formatting inconsistencies. Table 2: Average Performance by Tasks (Kendall’s τ) Performance Metric Rank Model # Task_1 Task_2 Task_3 Avg. τ To quantify the models’ performance with the given tasks, a 1 9 0.96 0.98 0.70 0.88 2 2 0.94 0.94 0.56 0.81 robust evaluation metric was required. For this purpose, 3 5 0.96 0.99 0.49 0.81 Kendall's rank correlation coefficient (“Kendall’s τ”, “τ”) was 4 3 0.94 0.93 0.52 0.80 selected as the most appropriate measure. Kendall's τ is a non- 5 6 0.94 0.96 0.52 0.81 parametric statistic that measures the ordinal association between 6 8 0.93 0.79 0.43 0.72 two ranked lists. Its methodology is centered on comparing the 7 12 0.94 0.70 0.41 0.69 concordance of all possible pairs of items within the sequences, 8 20 0.83 0.82 0.50 0.72 yielding a score in the interval from -1 (perfect reversal) to +1 9 7 0.84 0.89 0.48 0.74 (perfect match). The focus on relative, pairwise ordering makes 10 11 0.93 0.67 0.36 0.65 Kendall's τ exceptionally well-suited for a chronological sorting Avg. top 5: 0.95 0.96 0.56 0.82 task, as the core challenge lies in correctly establishing which Avg. all 20: 0.85 0.71 0.36 0.64 An alternative metric, the sum of absolute Manhattan distances, findings. First, assisted prompting systematically improved the performance across all models and tasks, which is logical and was also considered but ultimately deemed less suitable. Its expected since additional relevant information is provided to the primary drawback is its sensitivity to the magnitude of evaluates. The aggregated results in Table 3 underscore three principal event occurred before another, which is precisely what the metric displacement, which can produce misleading evaluations by models. Anchoring with known positions in the majority of cases helped the models to better position the remaining articles as heavily penalizing single items that are wildly out of place, while well. potentially under-penalizing a sequence with numerous smaller, local errors that might represent a poorer overall sort. Table 3: Average Performance by Scenarios (Kendall’s τ) Performance by Tasks and Scenarios Rank Model # Assist_ Assist_ Assist_ Assist_ Avg. τ No First Last FirstLast The performance of each model, quantified by the Kendall’s τ, is 1 9 0.75 0.88 0.90 0.99 0.88 detailed in Tables 2 and 3. Table 2 presents the coefficients 2 2 0.69 0.88 0.76 0.93 0.81 organized by task (event chain), averaged across all experimental 3 5 0.73 0.84 0.81 0.87 0.81 scenarios (prompts). Table 3, in turn, presents the coefficients 4 3 0.72 0.87 0.76 0.85 0.80 organized by experimental scenario, averaged across all the 5 6 0.57 0.93 0.84 0.90 0.81 tasks. The ranks in both tables were determined by averaging the 6 8 0.48 0.81 0.78 0.81 0.72 performance rankings of all the models across individual tasks 8 7 12 0.48 0.66 0.73 0.87 0.69 20 0.66 0.75 0.64 0.82 0.72 and scenarios. They largely correspond to the rankings based on 9 7 0.54 0.73 0.81 0.87 0.74 average τ, but discrepancies may arise from variation in the scale 10 11 0.48 0.64 0.66 0.82 0.65 and distribution of τ values across experiments. Avg. top 5: 0.69 0.88 0.81 0.91 0.82 To contextualize these performance metrics, their relationship to Avg. all 20: 0.47 0.67 0.63 0.79 0.64 pairwise accuracy is critical: within a 10-item sequence, a Kendall’s τ of 0.90, 0.80 or 0.50 indicates that approximately Second, the provision of additional information proved more 95%, 90% or 75% of the 45 possible pairs are concordantly beneficial for the most demanding task (Task_3) than for the less ordered, respectively. demanding tasks (Task_1 and Task_2). For example, in the The aggregated results in Table 2 underscore two principal Assist_FirstLast scenario, the increase in average τ relative to the findings. First, a significant and systematic variation in task unassisted scenario was 0.13 for Task_1, 0.17 for Task_2, and difficulty was evident, with Task_1 representing the simplest 0.65 for Task_3. This finding follows logically from the models’ case and Task_3 the most demanding. This pattern held true for greater ability to identify the first and/or last article in simpler practically all the evaluated models and experimental scenarios. tasks by themselves: in Task_1, 15 of 20 models correctly The performance differences indicating different task difficulty identified the first position, while none identified the last were substantial. For Task_1 and the unassisted scenario, the position, in Task_2 9 models identified the first position and 4 Kendall's τ values for the average, best model, and worst model identified the last position, and in Task_3 no model identified performance were 0.78, 0.91 and 0.02, respectively. For Task_2, either position correctly. the values were 0.63, 1.00 and 0.16, and for Task_3, they were Third, the provision of additional information disproportionately 0.02, 0.38 and 0.33. These findings clearly establish Task_3 as benefited models that performed poorly in the unassisted the most difficult of the three tasks evaluated. Note that a scenario. For instance, on Task_3 — the most difficult task with 59 an average Kendall's τ of only 0.02 in the unassisted scenario — (4) Diagnosing mis-ordering errors through reasoning audits: To the Assist_First scenario yielded average and maximum understand why models fail to reconstruct the correct temporal performance improvements of 0.46 and 1.07, respectively. For ordering of news articles, one could extract each model’s stated the Assist_Last scenario, the corresponding improvements were reasoning features for every placement decision, then have 0.27 and 0.80, while for the Assist_FirstLast scenario they were human experts or adjudicating LLMs rate their accuracy and 0.65 and 1.02. The results demonstrate that supplementing less relevance. Such audits would expose specific deficits in capable models with limited key information can yield reasoning and could even inform targeted retraining regimes. significant performance gains at these tasks. (5) Experimenting with extended or interleaved event chains: A qualitative examination of the models' reasoning justifications Evaluating models on substantially longer sequences—or on failed to yield systematic insights into their capacity to mixtures of events drawn from multiple chains—would reconstruct accurate chronological sequences of articles. markedly raise task complexity and furnish a stringent Although the generated rationales were generally logical and benchmark of temporal-reasoning competence for business use relevant, they frequently omitted crucial contextual information cases. essential for correct chronological reasoning. This observation underscores the challenge that certain timelines may not be ACKNOWLEDGMENTS uniquely re-constructible due to insufficient contextual The authors acknowledge the use of LLMs during various stages information. Furthermore, in some instances, the provided of this research. These models provided support in tasks such as justification could plausibly support an alternative, yet equally idea generation, text processing, prompt engineering, valid, timeline. Moreover, this is compounded by the inherent methodological exploration, and language optimization. While challenge of discerning whether the provided reasoning the LLMs contributed to enhancing efficiency and refining the justifications represent the model's actual inferential process or presentation of this work, all conceptual frameworks, analyses, are merely a result of the post-hoc rationalization. and interpretations remain the sole responsibility of the authors. 4 CONCLUSIONS AND FURTHER REFERENCES RESEARCH IDEAS [1] Y. Cao et al., ‘RiskLabs: Predicting Financial Risk Using Large Language Model Based on Multi-Sources Data’, Apr. 11, 2024, arXiv: This research provides insight into the practical application and arXiv:2404.07452. doi: 10.48550/arXiv.2404.07452. inherent challenges of utilizing LLMs to sequence news streams [2] A. Kim, M. Muhn, and V. V. Nikolaev, ‘From Transcripts to Insights: Uncovering Corporate Risks Using Generative AI’, Jul. 11, 2024, in the context of ERM. The selected use cases are based on real- Rochester, NY: 4593660. doi: 10.2139/ssrn.4593660. world, business-relevant event chains. [3] T. Li and X. Dai, ‘Financial Risk Prediction and Management using A comparative analysis reveals significant performance Machine Learning and Natural Language Processing’, ijacsa, vol. 15, no. 6, 2024, doi: 10.14569/IJACSA.2024.0150623. disparities among the evaluated models across all tasks and [4] Y. Wang, ‘Generative AI in Operational Risk Management: Harnessing experimental scenarios. Models with superior reasoning the Future of Finance’, May 17, 2023, Rochester, NY: 4452504. doi: capabilities surpassed those with less developed abilities. The [5] 10.2139/ssrn.4452504. X. Zhu, H. Jin, J. Li, and Y. Wang, ‘Topic-Gpt: A Novel Risk varying complexity of the presented tasks further accentuated Identification Method Based on Large Language Model’, Jul. 04, 2024, these performance differences. Also, providing additional Social Science Research Network, Rochester, NY: 4885365. doi: 10.2139/ssrn.4885365. anchoring information disproportionately benefited models that [6] M. Katamaneni, P. Agrawal, S. Veera, A. K. Sahoo, K. Singh Sidhu, and performed poorly in the unassisted scenario. Five models, M. F. Hasan, ‘AI-Based Risk Management in Financial Services’, in 2024 Second International Conference Computational and Characterization Grok 4 (xAI), GPT-5, o3 and o3-pro (all three OpenAI), and Techniques in Engineering & Sciences (IC3TES), Nov. 2024, pp. 1–5. doi: Gemini 2.5 Pro 10.1109/IC3TES62412.2024.10877497. (Google), consistently outperformed all other models in practically every task and experiment scenario. The [7] X. V. Li and F. S. Passino, ‘FinDKG: Dynamic Knowledge Graphs with Large Language Models for Detecting Global Trends in Financial performance level achieved by these models demonstrates their Markets’, in Proceedings of the 5th ACM International Conference on AI practical utility for real-world ERM applications. in Finance, Nov. 2024, pp. 573–581. doi: 10.1145/3677052.3698603. [8] A. Nygaard et al., ‘News Risk Alerting System (NRAS): A Data-Driven This research has opened several promising areas for further LLM Approach to Proactive Credit Risk Monitoring’, in Proceedings of research the 2024 Conference on Empirical Methods in Natural Language : Processing: Industry Track, F. Dernoncourt, D. Preoţiuc-Pietro, and A. (1) Benchmarking LLMs against human experts: A rigorous Shimorina, Eds., Miami, Florida, US: Association for Computational comparative study should be undertaken in which large LLMs Linguistics, Nov. 2024, pp. 429–439. doi: 10.18653/v1/2024.emnlp- and domain specialists (human experts) perform identical tasks industry.32. [9] Z. Xiao, Z. Mai, Z. Xu, Y. Cui, and J. Li, ‘Corporate Event Predictions under strictly matched contextual conditions. Using Large Language Models’, in 2023 10th International Conference on (2) Systematically varying model settings to probe “creativity” Soft Computing & Machine Intelligence (ISCMI), Nov. 2023, pp. 193– 197. doi: 10.1109/ISCMI59957.2023.10458651. and reliability: Experiments that modulate the temperature and [10] Committee of Sponsoring Organizations of the Treadway Commission other model settings can clarify how stochasticity affects task (COSO), Enterprise Risk Management—Integrating with Strategy and performance and reliability. Performance. Durham, NC: COSO, 2017. [11] International Organization for Standardization, ISO 31000:2018 – Risk (3) Enabling models to request task-critical information: Instead management — Guidelines. Geneva, Switzerland: ISO, 2018. of supplying predefined contextual information—such as the first and/or last article in a sequence—future studies might allow the model to query for the minimal supplementary data it deems most informative. This strategy would approximate an active- learning workflow and might even illuminate new modes for human-LLM collaboration. 60 Graph-Based Feature Engineering for DeFi Security Incident Severity Prediction Daria Pavlova∗ Inna Novalija Dunja Mladenić daria.pavlova@mps.si inna.koval@ijs.si dunja.mladenic@ijs.si Jožef Stefan International Jožef Stefan Institute Jožef Stefan Institute Postgraduate School Ljubljana, Slovenia Ljubljana, Slovenia Ljubljana, Slovenia ABSTRACT catastrophic failures causing losses in the tens or hundreds of mil- Decentralized Finance (DeFi) has emerged as a rapidly growing lions of dollars. Predicting which security incidents will become sector, but it has been plagued by numerous security incidents re- severe (high-loss) events is crucial for proactive risk management, sulting in billions of USD in losses. An important challenge is pre- insurance underwriting, and developing early warning systems dicting which security incidents will lead to for the DeFi ecosystem. severe financial losses, as this can inform risk management and mitigation strategies. Prior research has analyzed DeFi vulnerabilities and attack In this paper, we present a novel approach that integrates a se- taxonomy [6], and industry reports highlight the growing scale mantic knowledge graph of the DeFi ecosystem into the machine of DeFi hacks. However, there is a gap in predictive approaches: learning pipeline for incident severity prediction. We construct a existing studies focus on identifying vulnerabilities or classify- knowledge graph capturing rich relationships between DeFi pro- ing attack types, rather than forecasting the severity level of an tocols (including protocol fork lineage, multi-chain deployments, incident before it fully unfolds. To our knowledge, this is the first and historical incidents), and we engineer graph-based features work to apply semantic knowledge graph features specifically from this graph to augment traditional incident features. Using for DeFi incident severity prediction, establishing a new baseline these features in a gradient boosting trees classifier, we predict for this important problem. whether an incident will cause above-threshold (severe) losses. In traditional cybersecurity contexts, incorporating relational Our results show that incorporating graph-based features yields context via knowledge graphs and network models has been a substantial improvement in predictive performance: the model shown to improve threat detection [3]. For example, graph-based with semantic graph features achieves an Area Under ROC Curve severity triage using attack graphs has been studied in traditional (AUC) of 0.787, a 31.6% relative increase over the baseline model cybersecurity [5]. using only non-graph features. We observe particularly large In this work, we propose a novel graph-based feature engineer- gains in precision (from 0.341 to 0.490), indicating a significantly ing approach to address this challenge. We construct a semantic reduced false alarm rate. While these absolute performance val- knowledge graph of the DeFi ecosystem that encodes domain ues remain moderate, they represent substantial improvements knowledge: nodes represent entities such as protocols and in- for this challenging prediction task. The findings demonstrate cidents, and edges capture relationships like "forked-from" (de- the practical value of graph-enriched feature engineering for noting protocol lineage) and "deployed-on" (connecting protocols security analytics in DeFi. This work provides new insights into to blockchain platforms), among others. From this knowledge how protocol interconnections and characteristics contribute to graph, we derive a set of graph-based features for each security incident severity, opening avenues for more robust DeFi risk incident. These features quantify properties such as a protocol’s assessment tools. structural position in the ecosystem (e.g., number of fork "chil- dren," cross-chain deployments, past incident count), which we KEYWORDS posit are predictive of how severe an incident could be. Decentralized Finance, DeFi, Security, Knowledge Graph, Feature features (e.g., time of incident, incident type categories) in a We integrate these semantic graph features with conventional Engineering, Incident Severity Prediction machine learning classifier to predict whether an incident’s loss will exceed a severity threshold. The contributions of our work 1 INTRODUCTION are as follows: Decentralized Finance (DeFi) platforms have experienced rapid • We introduce a methodology to incorporate a DeFi-specific growth, alongside a surge in security breaches such as hacks knowledge graph into security incident severity predic- and exploits. In 2022 alone, crypto attacks led to over $3.8 billion tion. in stolen assets, with the majority coming from DeFi protocol • We demonstrate significant performance gains over a base- exploits [1]. These incidents vary widely in impact: while many line model lacking graph features (improving AUC by attacks result in limited losses, a significant fraction escalate into 31.6% and F1-score by 25%). • We provide a comprehensive analysis including case stud- ∗First author and presenter. ies, illustrating how related protocol dependencies can influence risk. Permission to make digital or hard copies of part or all of this work for personal • We discuss practical implications of our findings for im- or classroom use is granted without fee provided that copies are not made or proving DeFi risk assessment. distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this All code and the publicly available dataset for this work are work must be honored. For all other uses, contact the owner/author(s). available in an open-source repository [4]. Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia © 2024 Copyright held by the owner/author(s). https://doi.org/10.70314/is.2025.sikdd.6 61 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Pavlova et al. Figure 2: Convex-centric subgraph. Dependency on Curve highlights potential severity propagation via upstream vul- nerabilities. incident nodes, and 42 blockchain nodes, connected by over 3,500 edges representing various relationships. We use Neo4j to store and query this graph efficiently through asynchronous opera- tions. The graph’s schema defines several entity types and relations relevant to DeFi security: • Protocol nodes: Each DeFi protocol (e.g., lending plat- form, DEX, yield aggregator) is a node. Attributes include protocol name and launch date. • Incident nodes: Major recorded security incidents (hacks, exploits) are represented as nodes with attributes such as date, loss amount, and qualitative classification (e.g., flash loan, smart contract bug). • Blockchain nodes: Blockchain platforms (Ethereum, Bi- nance Smart Chain, etc.) are included to capture deploy- Figure 1: DeFi knowledge graph overview: protocols, ment contexts. blockchains, and incidents with relations (forked-from, Key relationships are encoded as directed edges: deployed-on, involves). • Fork-of: Connects a protocol to the protocol it was forked from (if applicable), capturing lineage (e.g., SushiSwap → 2 Uniswap). METHODOLOGY • Deployed-on: Links a protocol to a blockchain platform on 2.1 Knowledge Graph Construction which it is deployed. We built a knowledge graph representing the DeFi ecosystem to • Incident-involves: Links an incident node to the protocol(s) serve as a basis for feature engineering. The construction process affected by that incident. was semi-automated, combining API data extraction with manual The resulting graph captures a rich hierarchical structure of curation to ensure semantic consistency. protocol relationships (including parent–child fork trees and Data Sources: We integrated data from three primary sources: cross-chain deployment links), as well as the association of past (1) the Rekt database (https://rekt.news) containing detailed DeFi incidents with protocols. security incident reports, (2) DeFiLlama’s API providing protocol An overview of the graph structure is shown in Figure 1, and metadata including deployment chains and fork relationships, an illustrative Convex-centric subgraph is given in Figure 2. and (3) SlowMist Hacked for additional incident verification. All data sources are publicly available. Semi-Automated Process: Protocol and incident data were automatically extracted using APIs and web scraping. Fork rela- 2.2 Feature Engineering with Graph-Based tionships were identified through a combination of automated Features code similarity analysis (for protocols with public repositories) From the knowledge graph, we derived several quantitative fea- and manual verification based on project documentation. The tures that characterize the structural and historical context of resulting knowledge graph contains 892 protocol nodes, 1,608 the protocol involved in a given incident: 62 Graph-Based DeFi Security Prediction Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia • Protocol multi-chain count : the number of distinct blockchains on which the protocol is deployed (degree of deployed-on edges). A higher count indicates a widely deployed protocol, potentially implying larger user bases or attack surfaces. • Fork lineage indicators: whether the protocol is a fork of another (has parent) and the number of forks derived from it. These capture if a protocol inherits code (and possibly vulnerabilities) from a parent and how prevalent its code is in offspring projects. • Past incident count: the total number of past security incidents involving the protocol (count of incident-involves edges to prior incidents). A history of frequent past in- cidents might signal underlying security weaknesses or attractive target value. In addition to these graph-derived features, we include con- ventional features for each incident: • Temporal features: the year and month of the incident, and day-of-week if relevant, to capture any time-related Figure 3: Workflow: derive graph-based features from the patterns or trends in attack occurrence. DeFi knowledge graph and combine with conventional • incident features for classification. Categorical features : the general type of attack or vul- nerability exploited (e.g., reentrancy, price oracle manipu- lation), and the asset or protocol category targeted, which provide contextual information on the incident. All features are computed or retrieved at the time just before the incident (to avoid using any post-incident information). The combination of graph-based features with traditional features forms the feature vector used for prediction. The end-to-end feature extraction and modeling pipeline is summarized in Figure 3. 2.3 Classification Model and Training We frame incident severity prediction as a binary classification task: severe vs. non-severe loss outcome. Following prior work in financial risk modeling, we define a severe incident as one with loss exceeding a high quantile threshold of the loss distribution. Figure 4: Performance comparison. Bar chart for AUC, F1, In our dataset, we tested multiple thresholds (70th, 75th, and 80th precision, recall. percentiles), with the 75th percentile ($2.21 million) serving as the primary cutoff, yielding 402 severe incidents out of 1,608. The model showed consistent improvements across all thresholds, 3 EXPERIMENTS AND RESULTS confirming the robustness of our approach. Our primary model is a gradient boosting decision trees en- 3.1 Dataset and Experimental Setup semble (LightGBM [2]), selected for its efficiency, ability to handle We compiled a publicly available dataset of 1,608 DeFi security heterogeneous feature types, and proven performance in tabular incidents that occurred between 2020 and 2025. The dataset was financial risk modeling. We enabled LightGBM’s built-in class im- constructed by combining data from: (1) Rekt database providing balance option (is_unbalance=True), as severe cases represent comprehensive incident reports with loss amounts and attack 25% of the data. descriptions, (2) DeFiLlama API for protocol metadata including Train/Test Split: Data were split chronologically into 75% TVL and deployment information, and (3) SlowMist Hacked for training and 25% testing. Early stopping was not applied due additional incident verification and technical details. Each inci- to dataset size; hyperparameters were fixed after preliminary dent record includes the loss amount (in USD) and details such tuning. as date and attack type. Incidents with losses above $2.21 million We compare two feature sets: a Baseline model using only were labeled as severe, which yields a severe class prevalence of non-graph features (temporal and categorical), and a Semantic roughly 25% (402 severe vs. 1,206 non-severe cases). Graph model combining these with graph-based features. Per- For training and evaluation, we use a chronological split with formance is evaluated with Area Under the ROC Curve (AUC) 75% for training and 25% for testing; early stopping was not and supported by Precision, Recall, and F1-score. applied. 63 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Pavlova et al. Table 1: Performance comparison between the baseline model (numeric/categorical features only) and the seman- tic graph model (with knowledge graph features). Metric Baseline Semantic Graph Improvement AUC 0.598 0.787 +31.6% F1-score 0.384 0.480 +25.0% Precision 0.341 0.490 +43.7% Recall 0.440 0.470 +6.8% 3.2 Performance Comparison A visual comparison of model performance is shown in Figure 4. Our results confirm that incorporating graph-based features markedly improves prediction performance. Table 1 summa- rizes the evaluation metrics for the baseline and semantic graph- enhanced models on the test set. The Semantic Graph model achieves an AUC of 0.787, substantially higher than the base- line’s 0.598 (a relative improvement of 31.6%). This indicates that the model with graph features is much better at ranking incidents Figure 5: Top 15 most important features ranked by Light- by risk. The F1-score also improves from 0.384 to 0.480, reflecting GBM gain. Values on the x-axis represent LightGBM’s in- better overall classification accuracy. ternal feature importance scores (dimensionless, aggre- Notably, the Precision (positive predictive value) rises from gated across all trees in the ensemble). Both temporal fea- 0.341 to 0.490—a 43.7% increase—while Recall increases slightly tures (year, month, day-of-week) and graph-based features from 0.440 to 0.470. This suggests that the graph-enriched model (e.g., protocol_chains_count, is_forked_from_parent) appear is significantly more effective at identifying truly severe incidents among the strongest predictors. (fewer false positives) without sacrificing the ability to catch most severe cases. While the absolute values of these metrics might Applications: Graph-based risk factors can support auditors appear moderate, it is important to note that they represent and insurers in identifying critical "hot spots" and pricing cover- substantial improvements over the baseline and are competitive age more accurately than historical losses alone. for this specific and challenging prediction task where many Limitations: The dataset covers only publicly reported inci- external factors influence incident severity. dents, which may bias toward large-scale events. Features are In addition to the hold-out test, we evaluated stability via manually engineered and static; future work should explore dy- cross-validation. The baseline model’s mean AUC across 5 folds namic graphs, Graph Neural Networks, and richer incident cover- was 0.629 (std 0.036), whereas the semantic model averaged 0.809 age. Absolute performance (AUC 0.787) remains moderate, leav- (std 0.027). This not only reaffirms the performance boost but also ing room for improvement before real-world deployment. indicates that the graph-augmented model is more consistent across different data subsets (lower variance), likely because the 5 CONCLUSION graph features provide a more robust signal that generalizes. We introduced a graph-enriched framework for predicting sever- ity of DeFi security incidents. By combining semantic knowl- 3.3 Feature Importance Analysis edge graph features with conventional incident data, our model To better understand the relationships between graph-based fea- achieved substantial gains over a feature-only baseline. The find- tures, we analyzed their pairwise correlations (Figure 5). The cor- ings emphasize that where an incident occurs in the ecosystem is relation matrix shows that most features are only weakly related, as important as what it is. This approach offers immediate utility which indicates that they capture complementary aspects of pro- for risk assessment and motivates further research into dynamic, tocol structure and history. The strongest dependency is observed end-to-end graph-based models for DeFi security. between is_forked_from_parent and parent_fork_children_count (correlation 0.64), reflecting the natural link between fork origin REFERENCES This relative independence confirms that graph-derived features and the number of derived protocols. Other features, such as [1] Chainalysis Team. 2023. 2022 Biggest Year Ever For Crypto Hacking with $3.8 Billion Stolen, Primarily from DeFi Protocols and by North Korea-linked protocol chains count and protocol past events count , exhibit low Attackers. Chainalysis Blog (1 February 2023). https://www.chainalysis.com/ correlation values (<0.2), suggesting they provide distinct signals. blog/2022-biggest-year-ever-for-crypto-hacking/ [2] G. Ke et al. 2017. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In Advances in Neural Information Processing Systems 30 . 3146–3154. enrich the predictive model with diverse information rather than [3] J. Michel and P. Parrend. 2023. Graph-Based Intelligent Cyber Threat Detection duplicating each other. System. In Cybersecurity in Intelligent Networking Systems. CRC Press. [4] D. Pavlova. 2025. DeFi Security Trends: Semantic Knowledge Graph Analysis (Code & Dataset). GitHub Repository. https://github.com/dariapavlova02/defi_ 4 DISCUSSION trends_semantic [5] L. Sadlek et al. 2025. Severity-Based Triage of Cybersecurity Incidents Using Our results highlight the value of relational context for DeFi secu- Kill Chain Attack Graphs. Journal of Information Security and Applications 89 rity analysis: knowledge graph features capture ecosystem-level (2025). [6] S. Werner, D. Perez, L. Gudgeon, A. Klages-Mundt, D. Harz, and W. Knottenbelt. dependencies not visible from incident-centric data. Incidents 2021. SoK: Decentralized Finance (DeFi). arXiv preprint arXiv:2101.08778 affecting widely forked or multi-chain protocols are more likely (2021). to cause severe losses, reflecting practical amplification effects. 64 Evolving Neural Agents in Simulated Ecosystems Marija Ćetković Aleksandar Tošić Domen Vake UP FAMNIT UP FAMNIT UP FAMNIT Koper, Slovenia Koper, Slovenia Koper, Slovenia marijacetkovic03@gmail.com aleksandar.tosic@upr.si domen.vake@famnit.upr.si Abstract the network topology and weights to evolve over time in com- This paper explores how adaptive behaviors can emerge in artifi- parison to fixed-topology weight-evolving evolutionary methods. cial agents through neuroevolution in a dynamic 2D ecosystem. We implemented NEAT from scratch to have full control over Using the NeuroEvolution of Augmenting Topologies (NEAT) al- mutation, crossover, and fitness evaluation, ensuring that the gorithm both the neural network structure and weights evolved system could support our experimental goals and to potentially over time without predefined architectures or behaviors. The build a controllable and extensible evolutionary framework. system models two agent types: herbivores and carnivores that While NEAT has been previously applied to multi-agent sys- compete for limited food resources in a simulated environment. tems, many studies focus on performance in a specific task. This From the beginning, it was evident that environment design, in- paper addresses whether an agent-based NEAT framework can put encoding, and reward shaping had a major impact on agent produce ecological equilibrium without an explicit objective. Our behavior. Poorly tuned conditions led to exploitation, overfitting, primary contribution is the demonstration and analysis of sta- or meaningless patterns. But when the system was carefully bal- ble, co-adaptive predator-prey dynamics, showing how specific anced, the agents began developing survival strategies such as evolved behaviors arise from the underlying neural network movement efficiency, food seeking, and attacking. Herbivores topologies of the agents. evolved plant consumption behaviors, while carnivores built on this base to prioritize attacks and meat consumption. Some be- 2 Methods haviors generalized well to larger environments, showing that 2.1 Environment Model agents were not just memorizing patterns. We observed how The ecosystem is a discrete 2D grid populated with food resources NEAT’s speciation and innovation mechanisms were crucial for and agents. Herbivores consume plants, carnivores consume maintaining diversity and avoiding premature convergence. At meat, and all agents perceive their surroundings through a lim- the same time, challenges like catastrophic forgetting revealed ited sensory range. the limitations of neural networks in long-term skill retention. Ultimately, this work demonstrates how intelligent, adaptive be- havior can emerge from simple evolutionary principles and offers 2.2 Evolutionary Framework a foundation for future research into co-evolution, agent roles, Agents (creatures) interact with the world and are controlled by and artificial life. neural networks (genomes) evolved using NEAT. Initial popula- tions start with minimal structures (fully connected input/output Keywords layers), and complexity increases through structural mutations. neuroevolution, NEAT, evolutionary algorithms, artificial life, Genomes consist of genes, which are lists of nodes (with ID and simulated ecosystems, co-evolution, neural networks type: input, hidden, output) and connections (with ID, nodes they connect, weight and enabled flag). Each tick, each agent receives 1 a snapshot of the world state as input, to ensure stable input for Introduction everyone. Inputs include diet type, hunger level, local 3x3 neigh- This research explores neuroevolution for adaptive agent behav- borhood scan for food, neighbors (type and health level), and iors in a dynamic ecosystem. Agents are controlled by feedfor- direction toward the nearest food source. Based on that agents ward neural networks that map sensory inputs to actions [4], and choose actions as softmax output of their neural networks. The their structure and weights evolve incrementally using the NEAT outputs correspond to discrete actions: move (up, down, left, algorithm [5]. Unlike static or predefined tasks, this simulation right), eat, attack, or stay. The actions become events that are presents agents with a changing environment where no explicit handled in a deferred manner. First, the invalid actions are fil- ‘correct’ behavior exists. tered out, then the EventManager processes all queued events Dynamic environments without fixed objectives require ex- at once sequentially: applying changes in the world, updating ploratory and adaptable approaches. Gradient-based optimiza- fitness, and health of agents, which can be seen in the Algorithm tion relies on differentiable fitness functions and fixed topologies, 1. while reinforcement learning can struggle under sparse rewards. The fitness function evolved through experimentation. Early Evolutionary algorithms, by evaluating populations of agents versions rewarded survival, but later iterations combined survival directly on survival and performance, provide a natural solution time, food consumption, and for carnivores, attack behavior. for such open-ended scenarios [1]. Neural networks allow agents to flexibly map sensory input to actions, and NEAT enables both 2.3 NEAT Mechanisms Permission to make digital or hard copies of all or part of this work for personal Innovation Tracking is the process of tracking structural mu- or classroom use is granted without fee provided that copies are not made or tations globally to keep genomes aligned during crossover. A distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this singleton class assigns a unique ID to each new connection or work must be honored. For all other uses, contact the owner/author(s). node. If a structural change already exists, it reuses the same ID; if Information Society 2025, Ljubljana, Slovenia not, it creates a new one. This ensures a consistent identification © 2025 Copyright held by the owner/author(s). https://doi.org/10.70314/is.2025.sikdd.10 of identical innovations in all genomes [5]. 65 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Marija Ćetković, Aleksandar Tošić and Domen Vake while generation limit not reached do 2.4 Graphical User Interface // Simulation Phase Figure 1 shows an example of the GUI which serves to visually while creatures alive do track the simulation in real time, making the evolutionary pro- foreach creature 𝑐 in population do cess observable and interpretable, as analyzing logs alone could 𝑐 .observe(world) ; // perception be misleading. It allowed following the population changes over 𝑐 .chooseAction() ; // NN policy generations, spotting emerging behaviors such as movement pat- 𝑐 .queueAction() ; // check validity, terns or interactions, and understand whether agents are actually enqueue evolving. It helps detect issues such as creatures moving in the end same direction or wandering aimlessly. eventManager.process() ; // apply action effects foreach creature 𝑐 in population do 𝑐 .updateHealth() ; // starvation, death 𝑐 .evaluateFitness() ; // assign fitness end end // Evolution Phase assignGenomesToSpecies() ; // speciation createOffspring() ; // apply GA within species resetWorld() ; // spawn new creatures and food end Algorithm 1: High-Level Evolutionary Simulation Loop Figure 1: GUI close-up NEAT preserves evolutionary innovation by speciation (nich- 2.5 Implementation Notes ing) [5]. Each generation, evaluated genomes are reassigned to species based on structural compatibility distance, which is cal- The simulation was implemented in Java with LibGDX [2] for culated as a weighted sum of the number of disjoint and excess visualization. NEAT logic included custom classes for genomes, connections (present in one parent, within and beyond other’s species management, and innovation tracking. The evolutionary genome region respectively), and weight difference averages be- loop evaluated agents in the world, assigned fitness, reproduced tween the matching (present in both parents) ones, given by: genomes, and reset the environment for subsequent generations. 𝛿 𝑊 𝑐 + + 3 · . Existing species are cleared and each 2.6 Setup = 𝑐 𝐸 𝑐 𝐷 1 2 𝑁 𝑁 genome is compared to species representatives; if no match is found, a new species is created. Representatives are updated ev- After every run around 10 percent of the population is saved and ery generation to maintain diversity. Fitness is shared within loaded for the next run, with that part of population unchanged species (adjusted by species size) to balance selection pressure. and the rest filled with mutations of it. This is done to speed up The compatibility threshold strongly affects stability: low thresh- the evolution process. In early runs, we disabled the perception olds create many narrow species, high thresholds create broader of other agents to prevent confusion and help them learn basic but less distinct species, requiring careful tuning. eating behavior. Once they consistently moved and consumed To prevent the population from maintaining one dominant food, perception was turned on to allow them to adapt to a more species and limiting the exploration of the algorithm, NEAT uses complex environment. We also tested this logic on other inputs adjusted fitness [5]. Instead of assigning raw fitness scores, the such as the food direction vector left agents essentially ‘blind’ to individual’s fitness is adjusted by the number of individuals who non-local food. So, during early iterations, we spawned food in are within its distance delta, given by: ′ 𝑓𝑖 random concentrated areas rather than spreading it widely, to 𝑓 = Í 𝑛 𝑖 𝑠ℎ ( 𝛿 ( 𝑖, 𝑗 . ) ) = 𝑗 1 help them learn to use this vector. Evolution is achieved through genetic operators within species: Mutation: Weight changes (random reset 5–10% or small 3 Results Gaussian perturbation) and structural changes (adding nodes or edges, toggling connections). Resulting genome is checked for 3.1 Herbivore Evolution acyclicity. Herbivores initially explored aimlessly but gradually developed Crossover: Offspring inherit connections aligned by innova- stable food-seeking strategies. Over 800 generations, their ac- tion number; matching ones are inherited from the first parent, tion distribution stabilized, with movement actions dominating while disjoint and excess come from the parent with greater and eating consistently rewarded. In larger environments, agents fitness score (or random if equal). Invalid (cyclic) offspring are prioritized exploration to reach scarcer resources, showing emer- replaced by mutated fitter parent. gent adaptation beyond memorized patterns. Selection: Parent selection uses tournament selection: a sub- We can see from Fig.2 that the initial fitness highly oscillates, set of individuals (size 5) is sampled and the fittest is chosen. With with great difference in average and maximal fitness, as well 3% probability, a random individual is selected to maintain diver- as some outliers with high fitness who end up consuming a lot sity. This setup provides moderate selection pressure - avoiding of food. This is expected to some extent, as when one creature premature convergence while keeping implementation simple, consumes food, it reduces the available resources for others in efficient, and robust across different fitness functions. the population. 66 Evolving Neural Agents in Simulated Ecosystems Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia In smaller worlds, herbivores focused on eating, carnivores split between eating and attacking; in larger worlds, carnivores prioritized attacking, herbivores balanced movement and eating. Figure 2: Herbivore fitness Figure 5: Herbivore action distribution 100x100 world In the beginning, the actions chosen were randomized, but Figure 5 shows how herbivores learned to prioritize the eating action, although initial interference is evident. The usage of stay Figure 3: Average creature lifespan and attack actions is low. Figure 4: Average number of unique tiles visited Figure 6: Carnivore action distribution 100x100 world Figures 3 and 4 show agents that survive longer naturally explore more of the environment. The observed correlation be- In Fig.6 we can spot how carnivores experience problems in tween emerges from adaptive behavior under environmental balancing the eating and attacking action, but the attacking action constraints, rather than from any explicit exploration rewarding. slightly dominates after some time. 3.2 Carnivore Adaptation and Catastrophic Forgetting Carnivores were evolved by reusing herbivore topologies and adjusting weights, transferring eating behavior to meat sources. This transfer was successful in just 200 generations, but the agents showed catastrophic forgetting when switched back to herbivore roles, losing previous behaviors [3]. This showed us that we needed more general pretraining to make sure that agents were using their role, food and food vector inputs, and not over- fitting to the food type. 3.3 Co-Evolution Dynamics Figure 7: Herbivore action distribution 200x200 world To try to avoid the problem of forgetting mentioned, we saved agents of both types that evolved their basic skills independently. When carnivores were alone we gave them no motivation to use the attack action, to wire the logic later to herbivores. The Figure 7 displays herbivore behavior, where the action dis- attacking behavior was rewarded only for carnivores, but as tribution is more stable and there is a clear evolved balance of shown below, some role interference was inevitable. eating and moving actions, which is expected in a larger world. 67 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Marija Ćetković, Aleksandar Tošić and Domen Vake 4 Design Observations Agent behavior is highly sensitive to design choices in fitness functions, environment setup, and input representation. Poorly designed fitness functions can lead to inefficient or trivial behav- iors, such as flickering near food, which highlight the exploration- exploitation trade-off and the role of environmental influence in shaping behavior. Static or predictable resource spawn locations can cause overfitting, where agents memorize positions instead of learning general strategies. Dynamic and unpredictable envi- ronments are necessary to evolve general food-seeking behaviors Figure 8: Carnivore action distribution 200x200 world and to observe which patterns emerge due to evolutionary pres- sures versus environmental conditions. However, environments that are too unpredictable can hinder learning and obscure the dis- Figure 8 displays carnivore behavior in the larger world, where tinction between inherited tendencies and environment-induced they were given a greater incentive to attack. From the distribu- behaviors. tion, we can see that they indeed attacked more, with the other Input scaling and initial placement also influence behavioral actions being balanced out, and the staying action was rarely emergence. Unlimited health input caused agents to idle, while chosen. spawning agents too close together and awarding them for food consumption led to aimless wandering when neighbors died, showing correlations learned from the environment. These ob- servations demonstrate how neural networks may pick up coinci- dental patterns that influence both relearning across generations and the adopted strategies. Metrics did not always reflect consistent progress, as dynamic food spots and starting points introduced noise. Dips or peaks in performance do not necessarily indicate genuine failure or success. Adjustments to fitness, food rewards, and environmental parameters were required to guide learning, prevent reward hack- ing, and allow behavioral adaptation. Comparing herbivore and carnivore roles shows that behaviors are shaped by both environ- Figure 9: Maximum lifespan in the coevolutionary setting mental pressures and the interactions between agent strategies and resource availability. Agents adjust their actions based on Fitness (as well as the lifespan depicted in Figure 9) fluctuated the resources they encounter, and these actions influence which in an “arms race” pattern with no dominant winner. This outcome resources remain available, creating a feedback loop between is expected, as the rise in one role’s performance lowered the behavior and the environment. performance of the other. This shows that the system tended to- ward balance, which aligns with the objective of testing whether 5 Conclusion and Future Work coevolution with NEAT agents could produce equilibrium. This paper demonstrates that nature-like behaviors can emerge from relatively simple principles when agents evolve in dynamic, 3.4 Species Diversity open-ended environments without predefined goals. By evolving herbivores and predators both separately and in co-evolution, we showed that evolutionary pressures can produce adaptive behav- iors and predator-prey equilibria, highlighting how role-specific dynamics shape ecosystem stability. This work lays the founda- tion for future experiments that involve more complex behaviors, survival strategies, and deeper coevolutionary dynamics. Future directions could include investigating the potential of refined role awareness mechanisms, improved memory or learning retention, and more complex agent inputs and actions, enabling us to push the boundaries of what these agents can learn over time. References [1] A.E. Eiben and J.E. Smith. 2003. Introduction to Evolutionary Computing. Figure 10: Species diversity over generations. Natural Computing. Springer-Verlag, Berlin. [2] LibGDX. [n. d.] Libgdx game development framework. https://libgdx.com/. (). [3] Michael Mccloskey and Neil J. Cohen. 1989. Catastrophic interference in The survival plot of emerging species in Figure 10 shows an connectionist networks: The sequential learning problem. The Psychology of important aspect of the NEAT algorithm. The initial drop means Learning and Motivation, 24, 104–169. that a few very successful topologies dominated the population, http://neuralnetworksanddeeplearning.com/. [4] Michael A. Nielsen. 2018. Neural networks and deep learning. misc. (2018). but using a lower compatibility threshold prevents the total loss [5] Kenneth O. Stanley and Risto Miikkulainen. 2002. Evolving neural networks of diversity. The number of species stabilized after some time, through augmenting topologies. Evolutionary Computation, 10, 2, 99–127. http://nn.cs.utexas.edu/?stanley:ec02. while many smaller species died out quickly. 68 Designing AI Agents for Social Media Abdul Sittar∗ Mateja Smiljanić Alenka Guček Jožef Stefan Institute Jožef Stefan Institute Jožef Stefan Institute Ljubljana, Slovenia Ljubljana, Slovenia Ljubljana, Slovenia abdul.sittar@ijs.si mateja.smiljanic@gmail.com alenka.gucek@ijs.si Abstract evaluated through multiple frameworks. [15], [14] emphasized This work presents an approach for designing AI agents that that stylistic consistency within timelines benefits rare event simulate social media activity by replacing Twitter conversations detection, while artificial stylistic variety can increase false pos- with large language models (LLMs). Using a time-series dataset of itives. [1] demonstrated T5-based paraphrasing effectiveness, Twitter discussions about technologies (April 2019 - April 2020), achieving average 4.01% accuracy increase with T5 augmenta- we propose an approach that combines fine-tuned language mod- tion, with RoBERTa reaching 98.96% accuracy through ensemble els with timeline manager to capture both conversational dynam- approaches. ics and temporal posting patterns. This approach consists of two Recent advances in large language models (LLMs) provide op- main components: 1) a timeline manager, which models post- portunities to simulate social media users as autonomous agents ing frequency, reply behaviour, and temporal rhythms of users, capable of generating posts and replies. [9] mainly concentrates and 2) conversation agents, fine-tuned for posting and replying on using LLMs as stand-alone agents or for simple agent inter- within threads. We evaluate the system along two dimensions: actions, neglecting the opportunity to assess LLMs within the structural accuracy (whether the timeline manager replicates network structure of complex social networks. In this study, we conversation patterns and thread structures), and emotion dy- leverage fine-tuned language models to create agents across mul- namics (weather the emotion of synthetic data replicates the true tiple domains, including technology (AI), cryptocurrency, and emotion trends in the original dataset). Our results demonstrate health-related topics (e.g., COVID-19). Each agent is specialized that the proposed agent-based simulation captures key charac- for posting or replying, while a timeline manager model simu- teristics of real Twitter interactions, providing a foundation for lates the environment, deciding which agent acts next and at large-scale synthetic social media ecosystems useful for study- what time. By grouping similar users into single agents, our ap- ing information flow, emotion propagation, and the impact of proach generalizes behaviour while maintaining the richness of emerging technologies. interaction patterns. The main goal of this work is to investigate the effect of envi- Keywords ronmental changes on agent behaviour and network dynam- AI agents, large language models (LLMs), social media simulation, ics. Specifically, we hypothesize that altering the scheduling and structure of the environment model can lead to measurable Twitter conversations, conversation agents changes in posting and replying activity, as well as in the tempo- 1 ral evolution of simulated emotions. To evaluate our approach, Introduction we compare real Twitter data with simulated outputs, analysing Social media platforms have become major arenas for informa- emotion trends and interaction dynamics across time windows. tion dissemination, discussion, and opinion formation. However, Our approach provides a novel methodology for studying social the emergence of filter bubbles where users are exposed pre- media dynamics, testing hypotheses about user behaviour, and dominantly to content that aligns with their existing beliefs can exploring interventions to mitigate filter bubbles. reinforce polarization, reduce diversity of exposure, and shape works have broadened the range of ideas and information ac- Following are the two primary scientific contributions of this cessible to users, but they are also criticized for contributing to work: collective behaviour in unforeseen ways [3]. Also, Social net- 1.1 Contributions greater polarization of opinions [2]. Understanding how these • An approach to replicate social media users by grouping dynamics emerge and evolve requires models that can replicate similar users into language model-driven agents managed user behaviour at scale while capturing temporal patterns and with a timeline manager interactions. • An evaluation that assesses structural accuracy, conver-Large language models have emerged as powerful tools for syn-sational coherence, and emotional realism by comparing thetic text generation. [10] investigated GPT-3.5 for text classifica-simulated and true emotion trends. tion augmentation, finding that subjectivity negatively correlates with synthetic data effectiveness, while achieving 3-26% abso- lute improvement in accuracy/F1 in low-resource settings. [18] introduced GPT3Mix, using GPT-3 for realistic text generation with soft-labels, significantly outperforming existing augmenta- tion methods. The quality of synthetic data generation has been Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). Information Society 2025, Ljubljana, Slovenia © 2025 Copyright held by the owner/author(s). https://doi.org/10.70314/is.2025.sikdd.23 69 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Abdul Sittar, Mateja Smiljanić, and Alenka Guček 2 Related Work LLMs are increasingly employed to model human behaviour in online settings, but current evaluation approaches such as simpli- fied Turing tests involving human annotators fail to capture the subtle stylistic and emotional nuances that differentiate human generated text from AI-generated text [12]. It proposes a human likeness evaluation framework that systematically measures how closely LLM generated social responses resemble those of real users. This framework utilizes a set of interpretable textual fea- tures that capture stylistic, tonal, and emotional aspects of online conversations. While they can mimic certain human behaviours and decision making processes, primarily due to their training data, it remains largely unexplored whether repeated interac- tions with other agents amplify their biases or lead to exclusive patterns of behaviour over time [8]. Modelling social media has ben an active research area for understanding use behaviour, information diffusion, and network effects. Agent-based models have been widely used to replicate in- Figure 1: Overview of the proposed methodology for con- teractions among users, simulate posting and replying behaviour, versation simulation. The timeline manager determines and study emergent phenomena such as viral content spread, which agent should act next based on the current time, echo chambers, and filter bubbles [6, 11]. These models often agent, context, and action. The selected fine-tuned model rely on simplified rules or probabilistic mechanisms to deter- then generates a new post or reply for the chosen agent, mine agent actions. Our work extends this by using fine-tuned creating realistic conversation flow. language model to generate realistic post and reply content, cap- turing both semantic and temporal patterns observed in real social media interactions. The concept of filter bubbles has been extensively studied in 3.1 Probabilistic model the context of social media algorithms and personalized content The probabilistic scheduler is implemented as a multi-output delivery [17, 7, 3]. Prior studies have shown that temporal factors, neural network that simultaneously predicts four key dimensions such as posting frequency and timing, significantly influence the of social media behaviour: agent selection (which agent should formation of echo chambers and the propagation of sentiment. act next), action classification (post vs. reply), temporal prediction Unlike traditional simulations, our approach explicitly models (timing of next action), and context setting (emotional tone and time windows and agent-specific schedules, allowing the study topical focus for content generation). of how environmental changes affect network dynamics and user The model is trained on 88,330 conversation items spanning behaviour over time. April 2019 to April 2020, focusing on AI and cryptocurrency dis- Large language models (LLMs) have been increasingly applied cussions. Our Timeline-Based approach generates 93,440 chrono- to social media analysis, content generation, and user simulation. logical training pairs—18.7× more than baseline methods—through Fine-tuned models can capture domain-specific language, hash- complete conversation sequence learning rather than isolated tags, and posting patterns, enabling more realistic simulations post-reply pairs. of user behaviour [13, 4]. Existing work has largely focused on Given the current state 𝑆 (𝑡) at time 𝑡, the model computes generating content for individual posts or replies; in contrast, our probability distributions over the action space. approach integrates posting, replying, and environment manage- ment in a unified simulation, enabling multi-agent interaction 3.2 Fine-tuned model analysis. We implement a single fine-tuned language model that serves as Recent studies have used sentiment and emotion analysis to both AI and cryptocurrency agents. The model is trained on con-evaluate social media content, including the study of affective versations from both domains (AI technology and cryptocurrency trends and collective mood in online networks [16, 5]. Our ap-discussions) to capture the vocabulary, argumentation patterns, proach leverages these techniques to compare simulated emotion and discourse styles across both topic areas. trends with real-world Twitter data, providing a quantitative measure to validate the fidelity of the agent-based simulation. • Agent A (AI Focus): The same fine-tuned model called when the probabilistic scheduler determines AI-related content is needed. 3 • Agent B (Crypto Focus) : The identical fine-tuned model Methodology Our methodology employs a two stage approach combining prob- required. called when cryptocurrency-related content generation is abilistic scheduling with domain-specialized fine-tuned language model agents to simulate realistic social media interactions (post- When called by the probabilistic scheduler, the fine-tuned ing and replying). The approach consists of two primary compo- model generates content based on provided context including nents: (1) Timeline based probabilistic model that serves as an action type (post/reply), emotional context, topical focus, tem- timeline manager, and (2) Domain-specialized fine-tuned agents poral context, and conversation history. The model’s training that generate contextually appropriate content based on the time- on both domains enables it to produce contextually appropriate line manager’s decisions. responses regardless of which agent role it is fulfilling. 70 Designing AI Agents for Social Media Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia 3.3 Integration and Coordination • Action Distribution: Maintains realistic post/reply ratios The probabilistic scheduler communicates with fine-tuned agents (94.5%/5.5%) through a structured interface that maintains separation between temporal decisions (when and who acts) and content decisions 4.2 Fine-tuned model (what is said). At each simulation step, the scheduler: (1) analyses current conversation state, (2) predicts next action parameters, (3) Table 2: Evaluation Results: ROUGE and Semantic Similar- selects appropriate domain agent, (4) provides structured context ity to the selected agent, and (5) integrates generated content into the conversation thread. Metric Score This approach enables realistic conversations where differ- ROUGE-1 0.1373 ent domain experts can contribute to mixed topic discussions ROUGE-2 0.0519 while maintaining their specialized perspectives and temporal ROUGE-L 0.1179 behavioural patterns observed in real social media data. ROUGE-Lsum 0.1217 Semantic Similarity (SBERT) 0.4041 4 Experimental Setup In this section, we describe the features, model and evaluation Table 2 reports the evaluation results for the fine-tuned model’s metrics. generated content. ROUGE metrics (ROUGE-1, ROUGE-2, ROUGE- The baseline system is a timeline based probabilistic model that 4.1 L, and ROUGE-Lsum) measure lexical overlap between generated Timeline Manager outputs and the reference Twitter posts. The relatively low scores (e.g., ROUGE-1 = 0.1373) indicate that while the generated text learns agent transitions, reply probabilities, and temporal distri- captures some overlapping words or phrases, it often diverges butions from training data. Predictions are made deterministi- lexically from the original references. This is expected since the cally by selecting the most probable outcome, with probability model is not designed for verbatim reproduction but rather for estimates derived directly from observed frequencies. generating semantically coherent alternatives. The enhanced approach employs a machine learning ensemble To complement ROUGE, we compute semantic similarity using with separate classifiers for agent, action, and time prediction. SBERT embeddings. The score of 0.4041 shows that, on average, Features include agent history, action history, and time of day. Pre- the generated outputs are moderately aligned in meaning with dictions are generated using temperature-controlled stochastic the reference texts, even when surface-level wording differs. This sampling, with an ensemble across multiple temperature settings highlights that the fine-tuned model is able to remain contextu- for robustness. This design enables greater flexibility and diver- ally and thematically relevant while producing novel expressions. sity, counteracting the strong biases inherent in the probabilistic Overall, the combination of ROUGE and semantic similarity model. suggests that the fine-tuned agents generate content that does 4.1.1 not simply replicate reference posts but instead produces new, Evaluation Metrics. Table 1 summarizes the key differ- ences between the original probabilistic model and the improved semantically consistent outputs. ML-based model, covering both quantitative performance and qualitative conversational outcomes. Aspect Probabilistic Model ML-Based Model Agent Prediction 44.8% accuracy, but always pre-55.2% accuracy, balanced dicts Crypto_Agent (100%) AI_Agent (50%) and Crypto_Agent (50%) Action Prediction 74.4% accuracy by predicting only 67.8% accuracy with realistic mix: “post” (0% replies) 65% posts / 35% replies (close to ground truth 73/27) Temporal Modelling MAE = 5.41 min; 99.4% within ±15 MAE = 7.11 min; 99.2% within ±15 min min Table 1: Comparison of the Original Probabilistic Model vs. the Improved ML-Based Model. we evaluated our probabilistic model using comprehensive Figure 2: Methodology diagram showing both experimen- metrics across three key categories: tal approaches: First step, second step, third step, fourth • Agent Prediction: 61.3% accuracy (22.6% improvement step over random chance) • prediction the reference Twitter dataset and the conversations generated by Temporal Modelling: • Action Classification: 96.8% accuracy for post vs. reply Figure 2 presents the aggregated emotion comparison between 50.7-minute MAE with 99.15% ac- the fine-tuned model. The analysis is based on average emotion curacy within ±15 minutes scores across multiple conversation samples, with categories Our evaluation demonstrates that the probabilistic scheduler including hate, not_hate, non_offensive, irony, neutral, positive, successfully replicates conversation structure: and negative. Blue bars represent the reference data, while orange • Agent Alternation: 94.2% similarity to real switching bars indicate the generated outputs. behaviours Overall, the comparison shows strong alignment between the • Temporal Rhythms: Strong correlation (r=0.78) with two distributions for key non-toxic categories. Both reference actual daily patterns and generated conversations are overwhelmingly classified as 71 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Abdul Sittar, Mateja Smiljanić, and Alenka Guček proximately 0.95 and 0.75, respectively). Similarly, both datasets and polarization in social networks. arXiv preprint arXiv:1906.08772. [4] Cristina Chueca Del Cerro. 2024. The power of social networks and social contain minimal hate or negative content, indicating that the syn-not_hate and non_offensive, with nearly identical scores (ap- [3] Uthsav Chitra and Christopher Musco. 2019. Understanding filter bubbles media’s filter bubble in shaping polarisation: an agent-based model. Applied from the real data. [5] Matteo Cinelli, Gianmarco De Francisci Morales, Alessandro Galeazzi, Wal- ter Quattrociocchi, and Michele Starnini. 2020. Echo chambers on social At the same time, certain emotional discrepancies are evident. thetic conversations do not introduce harmful patterns absent Network Science, 9, 1, 69. media: a comparative analysis. arXiv preprint arXiv:2004.09603. The generated conversations exhibit lower levels of irony and [6] Rui Fan, Ke Xu, and Jichang Zhao. 2018. An agent-based model for emotion positivity compared to the real dataset. Specifically, irony is no- contagion and competition in online social media. Physica a: statistical mechanics and its applications, 495, 245–259. tably under-represented in synthetic conversations (0.04 versus [7] Antonino Ferraro, Antonio Galli, Valerio La Gatta, Marco Postiglione, Gian while neutrality is slightly higher (0.78 versus 0.71). This indi-0.12 in the reference data), suggesting that nuanced and implicit Marco Orlando, Diego Russo, Giuseppe Riccio, Antonio Romano, and Vin- cenzo Moscato. 2024. Agent-based modelling meets generative ai in social language styles are harder for the model to reproduce. Similarly, network simulations. In International Conference on Advances in Social Net- positive sentiment is reduced in generated text (0.49 versus 0.62), works Analysis and Mining . Springer, 155–170. [8] Farnoosh Hashemi and Michael Macy. 2025. Collective social behaviors in llms: an analysis of llms social networks. In Large Language Models for cates a tendency of the model to produce emotionally flatter and Scientific and Societal Advances . less expressive outputs. [9] Tianrui Hu, Dimitrios Liakopoulos, Xiwen Wei, Radu Marculescu, and Neer- Taken together, the results suggest that the model successfully aja J Yadwadkar. 2025. Simulating rumor spreading in social networks using llm agents. arXiv preprint arXiv:2502.01450. replicates the broad emotional structure of conversations, partic- [10] Z. Li, J. Zhu, et al. 2023. Synthetic data generation with large language models with reduced representation of irony and positivity. This high-ularly in terms of avoiding toxic or offensive content. However, for text classification: potential and limitations. arXiv preprint arXiv:2310.07849. [11] Hamid Reza Nasrinpour, Marcia R Friesen, et al. 2016. An agent-based model the generated outputs are less emotionally rich than real data, of message propagation in the facebook electronic social network. arXiv preprint arXiv:1611.07454 . [12] Nicolò Pagan, Petter Törnberg, Christopher Bail, Ancsa Hannak, and Christo-lights a key limitation of current LLM-based conversation agents: pher Barrie. [n. d.] Can llms imitate social media dialogue? techniques for while structurally sound, they may generate interactions that are calibration and bert-based turing-test. In First Workshop on Social Simulation less engaging or authentic in their emotional dynamics. with LLMs. [13] Kayhan Parsi and Nanette Elster. 2015. Why can’t we be friends? a case- based analysis of ethical issues with social media in health care.AMA journal 5 Conclusions of ethics, 17, 11, 1009–1018. In this work, we presented a novel approach for replicating social [14] Ifrah Pervaz, Iqra Ameer, Abdul Sittar, and Rao Muhammad Adeel Nawab. 2015. Identification of author personality traits using stylistic features: note- media user behaviour using fine-tuned language models orga- book for pan at clef 2015. In CLEF (Working Notes), 1–7. (Model A) with specialized posting (Model B) and replying (Model nized as autonomous agents. By combining a timeline manager [15] E. Rosenfeld et al. 2025. Evaluating synthetic data generation from user generated text. Computational Linguistics , 51, 1, 191–230. [16] Tanase Tasente. 2025. Understanding the dynamics of filter bubbles in social C) models, we simulated realistic multi-agent interactions across media communication: a literature review. Vivat Academia , 1–21. [17] Petter Törnberg, Diliara Valeeva, Justus Uitermark, and Christopher Bail. AI and Crypto related topics. 2023. Simulating social media using large language models to evaluate Our timeline based probabilistic model successfully replicates alternative news feed algorithms. arXiv preprint arXiv:2310.05984 . structural conversation patterns with 61.3% agent accuracy and [18] Kang Min Yoo et al. 2021. Gpt3mix: leveraging large-scale language models Findings of the Association for Computational for text augmentation. In near-perfect action classification (96.8%), establishing a new bench-Linguistics: EMNLP 2021 , 2225–2239. mark while providing clear paths for further enhancement through domain specialization. Our experiments demonstrated that the approach can gener- ate temporal posting and replying patterns that closely resemble real-world Twitter data. We showed that modifying the environ- ment model significantly influences agent behaviour, posting frequency, and network dynamics, supporting our hypothesis that environmental and temporal factors shape interaction pat- terns in social networks. This approach provides a flexible and controlled platform for studying filter bubble formation, emotion propagation, and emer- gent social dynamics. Future work can extend the approach to more complex network structures, additional domains, and the integration of user-specific behaviour models to further explore interventions for mitigating echo chambers and enhancing di- versity in online interactions. 6 Acknowledgment The research presented in this paper was funded by the EU’s Horizon Europe Framework under grant agreement number 101095095 (TWON) and 101094905 (AI4Gov). References [1] Jordan J. Bird et al. 2021. Chatbot interaction with artificial intelligence: human data augmentation with t5 and language transformer ensemble for text classification. arXiv preprint arXiv:2010.05990. [2] Uthsav Chitra and Christopher Musco. 2020. Analyzing the impact of filter bubbles on social network polarization. In Proceedings of the 13th interna- tional conference on web search and data mining, 115–123. 72 Explaining Temporal Data in Manufacturing using LLMs and Markov Chains Jan Šturm Maja Škrjanc Oleksandra Topal jan.sturm@ijs.si maja.skrjanc@ijs.si oleksandra.topal@ijs.si Jožef Stefan Institute Jožef Stefan Institute Jožef Stefan Institute Jožef Stefan International Jožef Stefan International Ljubljana, Slovenia Postgraduate School Postgraduate School Ljubljana, Slovenia Ljubljana, Slovenia Inna Novalija Dunja Mladenić Marko Grobelnik inna.koval@ijs.si dunja.mladenic@ijs.si marko.grobelnik@ijs.si Jožef Stefan Institute Jožef Stefan Institute Jožef Stefan Institute Ljubljana, Slovenia Jožef Stefan International Ljubljana, Slovenia Postgraduate School Ljubljana, Slovenia Abstract challenge persists: a disconnect between the model’s statistical Monitoring and understanding complex industrial processes from outputs and the experiential knowledge of domain experts. high-dimensional IoT sensor data remains a significant challenge. The motivation for this work stems from this challenge. Do- While advanced modeling techniques like Hierarchical Markov main experts, who possess invaluable implicit knowledge of a Chains can abstract raw data, their outputs are often difficult for system, often struggle to interpret the statistical outputs of pro- domain experts to interpret, creating a gap between data-driven cess models. Conversely, data scientists may identify patterns insights and operational management. Existing explainability that lack the necessary operational context for effective action. methods often focus on feature importance rather than providing Presenting experts with a graphical representation of states and holistic, semantic descriptions of system states. This paper in- transitions is a step forward, but it does not fully bridge the troduces a framework that bridges this gap by transforming the semantic gap. They may not understand what a specific state abstract states of a process model into intuitive, human-readable represents in the physical world or why a particular transition is concepts. The methodology leverages the StreamStory (Hier- significant. This leads to a bottleneck where valuable data-driven archical Markov Chain) tool approach to generate behavioral insights are not fully utilized, hindering efforts to improve system profiles based on log-likelihood calculations within sliding tem- management and efficiency. poral windows. StreamStory states are summarized using an To address this, the paper proposes a methodology that en- LLM to assign semantic labels and descriptions. This approach re- hances the interpretability of hierarchical process models. This duces the initial reliance on domain experts for analysis, aids the approach creates a new layer of understanding that is accessible understanding of complex system dynamics, and provides a trans- to operational personnel without requiring deep data science parent foundation for identifying both normal and anomalous expertise. By translating abstract model states into meaningful, operational patterns. The result is a more interpretable represen- semantically rich descriptions, it provides a tool that allows the tation of industrial processes, facilitating improved predictive system’s behavior to be understood, validated, and ultimately, maintenance and operational efficiency. better managed. This work introduces a methodology to auto- matically generate these descriptions, moving from complex data Keywords to clear, actionable insights. This work presents two primary con- Multivariate Timeseries, Explainable AI, LLMs, Markov Chains labeling of Markov chain states, and a methodology for identify- tributions for industrial applications: a method for LLM-based ing events as anomalous or normal. 1 Introduction The widespread adoption of Internet of Things (IoT) sensors in 2 Related Work industrial environments has generated vast streams of multivari- The field of time-series anomaly detection has evolved from in- ate time-series data. While this data holds immense potential for terpretable statistical models like ARIMA and classical machine process optimization and predictive maintenance, its complexity learning such as Isolation Forest to high-performance deep learn- often surpasses human cognitive capacity. Tools like Stream- ing architectures including LSTMs, Transformers, and Autoen- Story [6] have emerged to model these complex systems using coders [5, 4, 7]. While these advanced models excel at pattern Hierarchical Markov Chains, abstracting raw data into a more recognition, their complexity necessitates post-hoc XAI tools like manageable set of states and transitions. However, a fundamental LIME and SHAP to explain their decisions, which are limited to providing low-level feature attributions [1]. work must be honored. For all other uses, contact the owner/author(s). Permission to make digital or hard copies of all or part of this work for personal Recent work also demonstrates the utility of Hidden Markov or classroom use is granted without fee provided that copies are not made or Models (HMMs) for anomaly detection, for instance, by designing distributed for profit or commercial advantage and that copies bear this notice and active search strategies to locate an evolving anomaly among the full citation on the first page. Copyrights for third-party components of this multiple processes [2], or by learning normal temporal dynamics Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia from remote sensing data to detect, localize, and classify crop- © 2025 Copyright held by the owner/author(s). related deviations [3]. However, while effective for detection, the https://doi.org/10.70314/is.2025.sikdd.28 73 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Šturm et al. abstract nature of HMM states can be difficult for domain experts of historical state transitions. For each window of a given size, to interpret. The present work addresses this by transforming a single feature is calculated: the log-likelihood of that specific the state sequence into a multi-scale behavioral profile, which sequence of transitions occurring. This score is calculated by enables a Large Language Model (LLM) to generate rich, semantic summing the log-transformed transition probabilities for each explanations of system behavior. step in the sequence, as defined by the underlying Markov model. This approach innovates by first classifying each multivariate The score effectively quantifies how "normal" or "expected" a data point into a state within a pre-built Markov Chain model particular sequence of behavior is according to the learned model. and then calculating log-likelihoods from the state sequence to Highly probable sequences yield higher log-likelihood scores form a multi-scale representation. Crucially, this representation (closer to zero), while rare sequences result in large negative allows for the recognition of regular system behavior and vari- scores. ous anomalies. By analyzing the statistical distribution of these profiles—identifying dense regions of regular behavior and sparse outliers corresponding to anomalous states—an LLM can then assign rich, human-readable descriptions, connecting abstract data to operational knowledge. 3 Methodology The framework is designed to post-process models generated by Figure 2: An illustration of the sliding window method. the StreamStory system. Figure 1 outlines this multi-stage pro- Three windows of different sizes, highlighted in yellow cess, which begins with the statistical features from the Markov (largest), brown (medium), and green (smallest), are applied model and culminates in semantically enriched explanations of to a sequence of system states. A log-likelihood score is system behavior. The core of this methodology is the transforma- then calculated for the sub-sequence contained within each tion of abstract machine states into meaningful concepts using a colored window. combination of statistical feature engineering and LLM interpre- tation. The process focuses on creating robust representations of system behavior and leveraging an LLM to translate these 3.2 Behavior Profile Construction representations into human-understandable language. To capture dynamics over multiple time scales, several sliding windows of different sizes are used simultaneously. The log- likelihood score calculated from each window is concatenated to form a single feature vector for each time step. This multi-scale vector, termed a behavior profile, serves as a rich representation of the system’s dynamics at that moment, encapsulating both short- term and longer-term patterns. This profile is a crucial output, as it provides a quantitative basis for distinguishing between different modes of operation. 3.3 Ranking System Behavior via Anomaly Scoring Following the construction of the behavior profiles, their distri- bution is analyzed to identify distinct operational patterns. An unsupervised density-based approach is employed to score each profile’s typicality. The Isolation Forest algorithm is used for this purpose because it does not assume a specific data distri- bution and excels at identifying outliers in a high-dimensional space. Profiles that are common and lie in dense regions of the feature space receive a high score, corresponding to normal be- havior. Conversely, profiles that are rare and isolated receive a low score, flagging them as anomalous. This produces a continu- ous spectrum of normalcy, allowing for a ranked analysis of all operational events. Figure 1: Proposed methodology for identifying and ex- 3.4 LLM-Powered State Naming and plaining normal and anomalous operational profiles. Interpretation To translate abstract states into meaningful concepts, an LLM is utilized. For each granular state discovered by the StreamStory 3.1 Log-Likelihood Score Calculation model, its statistical profile (e.g., sensor value distributions) and The input to the pipeline is a pre-existing Hierarchical Markov context about the machine type were formatted into a descriptive Chain model of an industrial process, which includes a history prompt. The LLM was then tasked with generating a concise, of state transitions over time. The first step is to create a rich intuitive name for each state (e.g., "Peak Production - High Flow feature representation that captures the system’s dynamics. A and Heat"). This process, conducted once per model, creates sliding window (Figure 2) approach moves across the sequence a semantic layer that is then used to interpret the sequences 74 Explaining Temporal Data with LLMs Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia associated with the highest-ranked normal and lowest-ranked seeing this semantic sequence, can immediately infer a poten- anomalous events. tial cause for investigation, such as an attempted restart or a This approach offers two key advantages. First, the LLM- stuttering shutdown process. generated names provide a layer of transparency, offering an Conversely, the most normal events, detailed in Table 2, paint a immediate hypothesis about what each abstract state represents. picture of operational stability. These events are characterized by Second, it shifts the role of the domain expert from the arduous positive scores. The LLM-generated names for these sequences, task of initial interpretation to the more efficient step of validat- such as transitions between ‘Weekday Peak Performance‘, ‘Week- ing or refining the LLM-generated labels, accelerating the process end Peak-Load Production‘, describe the system operating within of gaining actionable insights. its expected high-performance period. This demonstrates the framework’s ability not only to flag deviations but also to rec- To validate the proposed framework, an experiment was con- operational cycles, providing a valuable baseline for what consti- tutes ’good’ performance. ducted using a real-world industrial dataset from an oil refinery 4 ognize and semantically label the system’s healthy, predictable Experiment pump. This section details the dataset, implementation, and re- Table 1: Top 5 Most Anomalous Events sults. 4.1 Rank Timestamp Score (Std.) Final State (LLM Name) Dataset The experiment was performed on a proprietary, real-world 1 2017-04-03 14:30 -0.096 (-3.88) Startup...Transition dataset obtained from an industrial oil refinery. Due to its con- 2 2017-03-28 10:00 -0.071 (-3.45) Startup...Transition fidential nature, the dataset is not publicly available. The data 3 2017-03-30 00:00 -0.066 (-3.35) High-Flow, Cool Op. consists of a multivariate time-series collected over one month 4 2017-04-03 12:30 -0.061 (-3.26) Machine Idle of operation (March-April 2017) with a 15-minute sampling reso- 5 2017-04-03 15:00 -0.056 (-3.18) Weekday Low-Flow... lution. Data was gathered from a suite of IoT sensors monitoring the core functions of a critical pump. Key measurements include Conversely, Table 2 presents the five most normal events, fluid flow rate (Kg/h), suction and discharge pressure (Kg/cm2), which have high positive scores. Their sequences reveal a stable and temperatures of the process fluid and mechanical compo- operational loop between states like “Peak Production,” “Week- nents (°C). end Peak-Load Production,” and “Extreme Temperature Peak Performance.” This recurring pattern defines the pump’s healthy 4.2 Implementation Details operational "heartbeat," providing a data-driven "golden stan- The methodology was implemented in a Python environment. dard" for normal behavior under demanding conditions. This The underlying Markov Chain model was built using the entire semantic understanding is crucial for operators, as it validates historical dataset provided, as the goal is to interpret the com- that the system is performing as expected. plete, learned dynamics of the process rather than to perform a predictive task that would require a train/test split. Behavior Table 2: Top 5 Most Normal Events profiles were constructed using sliding windows of multiple sizes (3, 5, 7, and 10 steps). The resulting profiles were analyzed using Rank Timestamp Score (Std.) Final State (LLM Name) the Scikit-learn implementation of Isolation Forest. The ‘con- tamination‘ parameter was set to 5% for the primary analysis, 1 2017-03-23 22:00 0.192 (1.22) Weekend Peak-Load a common heuristic for industrial processes. State descriptions 2 2017-03-31 06:00 0.192 (1.22) Peak Production were generated using the GPT-4o model, which was prompted 3 2017-04-01 00:00 0.191 (1.20) Peak Production with the statistical profiles of each state to generate intuitive 4 2017-03-31 23:30 0.191 (1.19) Weekday Peak Perf. names. 5 2017-03-31 07:30 0.190 (1.17) Weekday Peak Perf. 4.3 Experimental Results and Discussion To ensure the robustness of the findings, a sensitivity analysis The application of the framework yielded a ranked list of op- was conducted on the Isolation Forest ‘contamination‘ parameter, erational events, characterized by the Isolation Forest decision testing values of 1%, 5%, and 10%. While the number of points score. This score serves as a robust indicator of how typical or labeled ’Anomalous’ changed as expected, the relative ranking of anomalous a given time window is. Table 1 details the top five the most extreme events remained highly consistent, confirming most anomalous events identified. These events are characterized that the core findings are not sensitive to this hyperparameter. by scores that are more than 3 standard deviations below the The claims in this paper are demonstrated on a single, repre- mean, signifying extreme statistical rarity. sentative dataset. While the framework is designed to be general, The true explanatory power of the method is revealed when further studies on diverse industrial processes are required to the abstract state sequences are translated into their LLM-generated fully validate its broader applicability. The LLM-generated labels names. For instance, the most anomalous event culminates in a were not validated in a formal user study with domain experts; sequence of “... -> ‘Startup or Shutdown Transition‘ -> ‘Machine such a study is a valuable next step. Idle or Shutdown‘ -> ‘Startup or Shutdown Transition‘.” This pro- vides a clear, human-readable narrative of the pump entering a 5 Conclusion period of instability and stoppage. This is a marked improvement This paper presented a complete, self-contained framework for over black-box models that simply flag a time point as anomalous increasing the interpretability of complex industrial process mod- without providing a temporal context for the "why." An engineer, els. By creating behavior profiles of system states and using an 75 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Šturm et al. LLM to assign semantic names, the approach successfully trans- 6 Acknowledgments lates abstract data analysis into practical domain knowledge. The This work was supported by the Slovenian Research Agency and method provides a robust process for ranking and explaining the European Union’s Horizon 2020 project FAME (Grant No. individual operational events in a transparent manner, as demon- 101092639). strated on a real-world industrial dataset. This work establishes a strong foundation for a new type of explainability, moving References beyond feature importance to provide narrative, context-rich [1] Liat Antwarg, Ronnie Mindlin Miller, Bracha Shapira, and Lior Rokach. 2019. opens a wide array of possibilities for future research. The cur-The representation of system dynamics as behavior profiles arXiv:1903.02407 . [2] Levli Citron, Kobi Cohen, and Qing Zhao. 2025. Searching for a hidden markov descriptions of system dynamics. Explaining anomalies detected by autoencoders using shap. arXiv preprint anomaly over multiple processes. arXiv preprint arXiv:2506.17108. rent work successfully identifies and presents the raw temporal [3] Kareth M Leon-Lopez, Florian Mouret, Henry Arguello, and Jean-Yves Tourneret. sequences leading to key events. Future work will focus on apply- 2021. Anomaly detection and classification in multispectral time series based on hidden markov models. IEEE transactions on geoscience and remote sensing, recurring and significant sequential patterns within these events. [4] Sebastian Schmidl, Phillip Wenig, and Thorsten Papenbrock. 2022. Anomaly detection in time series: a comprehensive evaluation. Proceedings of the VLDB Such an analysis could reveal if distinct "families" of anoma- ing formal pattern mining techniques to automatically discover 60, 1–11. Endowment , 15, 9, 1779–1797. lous behavior exist, each with its own characteristic temporal [5] Charalampos Shimillas, Kleanthis Malialis, Konstantinos Fokianos, and Mar- signature. This promises a more nuanced description of system ios M Polycarpou. 2025. Transformer-based multivariate time series anomaly operations and provides a stronger foundation for developing localization. In 2025 IEEE Symposium on Computational Intelligence on Engi- neering/Cyber Physical Systems (CIES). IEEE, 1–8. targeted predictive maintenance strategies. Finally, to address [6] Luka Stopar, Primoz Skraba, Marko Grobelnik, and Dunja Mladenic. 2018. Streamstory: exploring multivariate time series on multiple scales. IEEE current limitations, two key areas will be prioritized. First, formal transactions on visualization and computer graphics , 25, 4, 1788–1802. user studies with domain experts will be conducted to validate [7] Fengling Wang, Yiyue Jiang, Rongjie Zhang, Aimin Wei, Jingming Xie, and the utility and accuracy of the LLM-generated explanations, mov- Xiongwen Pang. 2025. A survey of deep anomaly detection in multivariate time series: taxonomy, applications, and directions. Sensors (Basel, Switzer- ing beyond the promising initial results. Second, the framework’s land) , 25, 1, 190. generalizability will be tested through broader empirical evalua- tion across diverse industrial sectors and sensor types to boost its credibility and applicability. 76 Active Learning for Power Grid Security Assessment: Reducing Simulation Cost with Informative Sampling Gašper Leskovec Costas Mylonas Klemen Kenda Jožef Stefan Institute UBITECH Jožef Stefan Institute Slovenia Greece Slovenia leskovecg@gmail.com kmylonas@ubitech.eu klemen.kenda@ijs.si Abstract applied convolutional neural networks (CNNs) to contingency Power grid security assessment under the N-1 criterion requires datasets, showing that deep models could achieve over 99% accu- extensive contingency simulations, which are computationally racy in detecting insecure cases while being more than 200 times intensive and costly to label. In this work, we explore the use of faster than traditional power flow calculations [1]. Building on active learning (AL) to train binary classifiers that can accurately this, more recent work explored pooling-ensemble multi-graph predict the outcome of contingency scenarios using fewer labeled learning to design scalable contingency screening schemes based samples. We evaluate several AL strategies, such as entropy, mar- on steady-state information, demonstrating improved adaptabil- gin, and uncertainty sampling against a random baseline. Our ity for large-scale systems [2]. These approaches enable fast results show that AL methods achieve the same predictive per- security screening without solving power flows for every con- formance with significantly fewer labels, reducing labeling effort tingency. However, their reliability hinges on the availability of and simulator runtime. These findings demonstrate the effective- large labeled datasets covering all relevant operating points and ness of integrating AL with power system simulators to enable contingencies. Such datasets are typically generated by running scalable and efficient N-1 security assessment without sacrificing exhaustive offline N-1 simulations, which is computationally model accuracy. expensive, or require significant expert effort to label secure ver- sus insecure cases. This dependence on costly and large-scale Keywords data generation remains a major limitation of existing ML-based frameworks for steady-state security assessment. active learning, smart grids, security assessment, simulation cost To reduce labeling costs, AL has recently been explored in reduction other areas of power systems. For example, authors of [5] used AL 1 Introduction identification, showing that models could be trained with far to enhance stability assessment and dominant instability mode Ensuring secure operation of power systems under the N-1 crite- fewer labeled samples while maintaining accuracy. Similarly, rion is a cornerstone of grid reliability. The criterion requires that authors of [4] demonstrated an AL-enhanced digital twin for the system remains within operational limits following the loss day-ahead load forecasting, where the model iteratively refined of any single component (e.g., line, transformer, or generator). In predictions by querying only the most uncertain cases. These practice, this involves simulating a large number of contingen- studies confirm the potential of AL to reduce expert effort and cies and checking for violations of thermal or voltage constraints. simulation cost by strategically selecting informative samples. While essential, such simulations are computationally intensive, However, AL has not yet been applied to N-1 steady-state se- particularly when performed on high-fidelity grid models, and curity assessment, where the need to cut down on contingency their interpretation often requires expert judgment. This cre- simulations is especially critical. ates a bottleneck for both real-time applications and large-scale In this work, we propose a novel framework for AL driven scenario analyses, where scalability and efficiency are important. N-1 security assessment. Our contributions are threefold: Classical approaches to N-1 assessment rely on exhaustive (1) We design a binary classification model that predicts AC power flow simulations combined with contingency rank- whether a given contingency is secure or insecure based ing heuristics such as performance indices (PIs). While useful on steady-state features. for screening, these heuristics may mis-rank contingencies or (2) We integrate AL strategies (entropy, margin, and uncer- overlook borderline cases due to masking effects [3]. Moreover, tainty sampling) with the classifier to selectively query the exhaustive analysis does not scale well with system size, making most informative contingencies for simulation, reducing it unsuitable for fast or repeated assessments. the number of labels required. To overcome these challenges, researchers have proposed ma- (3) We demonstrate through a case study that our approach chine learning (ML) and deep learning (DL) approaches that achieves the same predictive accuracy as fully supervised approximate N-1 contingency outcomes directly from operating baselines while reducing simulation cost and labeling ef- point features. One of the earliest contributions in this direction fort by up to 40–50%. Permission to make digital or hard copies of all or part of this work for personal This work provides the first evidence that AL can be directly or classroom use is granted without fee provided that copies are not made or leveraged for N-1 security assessment, offering a scalable and distributed for profit or commercial advantage and that copies bear this notice label-efficient alternative to exhaustive simulation or purely su- and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is pervised ML approaches. permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from 2 Methodology permissions@acm.org. SiKDD 2025, Ljubljana, Slovenia We study whether pool-based AL can reduce the number of © 2025 Copyright held by the owner/author(s). Publication rights licensed to ACM. expensive N-1 simulations (“labels”) while keeping prediction ACM ISBN 978-x-xxxx-xxxx-x/YYYY/MM https://doi.org/10.70314/is.2025.sikdd.11 quality for binary secure vs. insecure classification. 77 SiKDD 2025, October 6th, 2025, Ljubljana, Slovenia Gašper Leskovec, Costas Mylonas, and Klemen Kenda Table 1: Dataset and system description (digital twin of the (2) Train the RF on the current labeled set; score the pool to Greek transmission network). obtain class-probability vectors 𝑝(𝑥). (3) Select the next batch of 𝑏 samples using one of the query Attribute strategies below. Value (4) Query the simulator for labels of the chosen batch (expen- Test system 35 buses, 46 lines, 135 generators, sive step); add them to the labeled set. 110 static generators, 20 loads (5) Repeat for a fixed number of iterations or until the budget Power flow solver AC load flow (Newton–Raphson), is exhausted. Contingencies (N–1) We sweep budgets across runs: 𝑖 ∈ {100, . . . , 500}, 𝑏 ∈ Line outages (all lines except idx { via pandapower 45), generator outages (all) 50, . . . , 200 }, and up to 40 iterations, which lets us trace Total contingency cases long learning curves. 8 769 Secure / Insecure 51.28% / 48.72% Query strategies. We compare: (i) Random (baseline); (ii) Feature dimensionality 271 features total Least-confident (uncertainty): score 1 − max𝑐 𝑝𝑐 (𝑥 ); (iii) Mar- Feature groups load_: 20, gen_: 135, sgen_: 110 gin: negative gap between top-2 probabilities; (iv) Entropy: − Í 𝑝𝑐 (𝑥 ) log 𝑝 𝑐𝑐 (𝑥 ). All three uncertainty policies operate on the same RF posteriors and therefore often rank samples 2.1 Data and labels from a digital twin similarly. We use a steady-state digital twin of the transmission grid. For each timestamp we solve the base-case AC power flow, then apply 2.5 Evaluation the N-1 criterion by removing each line/transformer/generator After each iteration we evaluate on the fixed test set. At each AL in turn and re-solving. An operating point is labeled round we retrain the RF from scratch on the enlarged labeled set; secure if the base case and all contingencies satisfy limits (bus voltages new labels are added to training only; the pool remains unlabeled. ∈ [0.90, 1.10] p.u., line loading ≤ 100%); otherwise it is insecure. For each strategy we run multiple configurations and both seeds, Non-convergent power flows are labeled then align results by total labeled samples and average across runs insecure . The test system is a digital twin derived from the topology of to obtain strategy-level learning curves. Unless noted otherwise, the Greek transmission network (35 buses, 46 lines, 135 genera- TTT values in the main figures are computed on these averaged tors, 110 static generators, 20 loads). AC load flows are computed curves. Appendix A.1 (Table 4a) reports per-run TTT (mean ± with the Newton–Raphson method in std), which is larger due to variability across initial sizes pandapower . N-1 contin-𝑖, batch gencies include all line outages (excluding line index 45) and all sizes 𝑏, and seeds. generator outages. Table 1 summarizes the dataset. 2.6 Metrics 2.2 Time-aware train/validation/test split We report Accuracy and ROC AUC on the test set, plus two label- Samples are sorted by timestamp. The AL efficiency metrics: Time-to-Target (TTT), the smallest number training/pool comes from earlier windows, while the of labeled samples needed for the average curve of a strategy to test set is the most recent slice and is never used for training or querying. This avoids temporal reach a target (e.g., ACC≥0.92 or AUC≥0.98); and AULC (Area leakage and mimics deployment where we predict on future data. Under the Learning Curve), computed by trapezoidal integration A small validation split is carved from the training era for early of metric vs. total labeled. Because simulator seconds per call checks. are roughly constant, relative cost/time savings are well approxi- mated by label savings derived from TTT. 2.3 Classifier and hyperparameters Additional classification metrics. Besides Accuracy and ROC Our base model is a Random Forest (RF) because it is fast, AUC we also track Precision, Recall, F1 and the False Neg- robust and provides class-probability posteriors needed by ative Rate (FNR) on the fixed test set at every AL round. Let uncertainty-based AL. Across runs we vary hyperparameters TP, FP, FN, TN be counts on the test set. We use the standard in realistic ranges: 𝑛 estimators ∈ [200, 1500], max_depth ∈ definitions: Precision = TP/(TP + FP), Recall = TP/(TP + FN), {18 , Precision·Recall 20 , 24 , 25 , 28 , 30 , 35 , 40 , None } , min_samples_split ∈ F1 = 2 · , FNR = FN/(FN + TP) = 1 − Recall. We Precision + Recall {2, 4}, min_samples_leaf ∈ {1, 2, 3}, class_weight ∈ report mean ± std across runs/seeds, and we extract TTT-style {balanced, balanced_subsample}. We use seeds {42, 1337} for thresholds for these metrics when relevant. reproducibility. 3 Results Classifier dependence. We use Random Forests for probabil- Figure 1 and Figure 2 show learning curves (averaged across ity outputs and fast retraining inside the AL loop. While AL’s relative gains often transfer across probabilistic classifiers, we seeds). Across the budget range, all three uncertainty-based poli- cies (entropy, margin, uncertainty) dominate the random baseline did not perform a systematic model sweep here. Evaluating lo- gistic regression and gradient-boosted trees under the same AL in both Accuracy and ROC AUC; the area under the learning curve (AULC) is consistently higher. protocol is left to future work. Table 3 summarizes KPIs used in the paper. At the most im- 2.4 portant targets, AL reaches the same performance with far fewer Pool-based AL loop labels: at ACC ≥ 0.92, AL needs about 500 labels vs. 1 040 for We follow the standard pool-based AL recipe: random (∼ 52% fewer); at AUC ≥ 0.98, AL needs 580 vs. 960 (1) Start with an initial labeled set of size 𝑖 and an unlabeled (∼ 40% fewer). Final metrics at the maximum budget are also pool. higher for AL (ACC 0.917±0.005 and AUC 0.983±0.002) than for 78 Active Learning for Power Grid Security Assessment: Reducing Simulation Cost with Informative Sampling SiKDD 2025, October 6th, 2025, Ljubljana, Slovenia random . At targets Precision/Recall/F1 ≥ 0.90 and FNR ≤ 0.10, the uncertainty-based policies consistently hit the thresholds earlier on the average curves, confirming that the AL gains are not specific to a single metric. Shaded bands (std across runs) show the same ordering stability observed for ACC/AUC. Full KPI values and TTT thresholds for P/R/F1/FNR are provided in Appendix A.2 (Table 4b). Next, we compare label efficiency using Time-to-Target (TTT). Figures 3 and 4 show TTT for accuracy targets 0.90 and 0.92, while Figures 5 and 6 show TTT for AUC targets 0.97 and 0.98. At the easy target ACC ≥ 0.90 all strategies reach the goal after about 100–120 labels (uncertainty sometimes at 120 due to seed/batch noise). At the more demanding ACC ≥ 0.92 target, Figure 1: Accuracy vs. total labeled samples (mean ± std ∼ active-learning policies need about 500 labels, whereas random needs 1 040 (i.e., 52% fewer labels). For AUC ≥ 0.97, AL reaches across runs). (Note: entropy, margin, and uncertainty overlap almost 275 the target at labels vs. 325 for random (∼15% fewer), and perfectly on this dataset—so the three AL curves/bands lie on top of each for AUC ≥ 0 98 at 580 vs. 960 (∼40% fewer). These reductions . other; Random is shown separately for contrast) translate directly into lower simulation time when the average time per labeling call is roughly constant. Figure 2: ROC AUC vs. total labeled samples (mean ± std across runs). (Note: entropy, margin, and uncertainty overlap almost Figure 3: TTT (Accuracy ≥ 0.90): computed on the strategy- perfectly on this dataset—so the three AL curves/bands lie on top of each level average curve; per-run variability (mean ± std) is other; Random is shown separately for contrast) reported in Appendix. Table 2: Final test metrics at maximum budget (mean ± std across runs). Strategy Accuracy ROC AUC entropy 0.917 ± 0.005 0.983 ± 0.002 margin 0.917 ± 0.005 0.983 ± 0.002 uncertainty 0.917 ± 0.006 0.983 ± 0.004 random 0.916 ± 0.010 0.977 ± 0.004 random (ACC 0.916±0.010 and AUC 0.977±0.004). Differences at the easier target ACC ≥ 0.90 are small (all reach it by ∼100–120 labels), which is expected for a low threshold. On high AUC values. The time-aware split still yields a sepa- rable test set for this case study (AUC ≈ 0.98). This likely reflects Figure 4: TTT (Accuracy ≥ 0.92): computed on the strategy- informative steady-state features and balanced classes, not over- level average curve; per-run variability (mean ± std) is fitting to the test era. That said, harder, more imbalanced systems reported in Appendix. may reduce AUC and amplify AL gains; we treat this as a scope limitation. Overall, uncertainty-based AL strategies consistently beat Precision, Recall, F1 and FNR.. The additional metrics mirror random at the harder targets (ACC 0.92 and AUC 0.98) while the ACC/AUC trends: entropy, margin, and uncertainty produce performing similarly at the easier ACC 0.90 threshold; final per- higher AULC and reach target quality with fewer labels than formance at the maximum budget remains high (ACC 0.917±0.005, 79 SiKDD 2025, October 6th, 2025, Ljubljana, Slovenia Gašper Leskovec, Costas Mylonas, and Klemen Kenda Table 3: KPIs by strategy (averaged across runs). Final metrics reflect per-run means; see Table 2 for mean±std. AULC TTT (labels) Final Strategy acc auc acc ≥ 0.90 acc ≥ 0.92 AUC ≥ 0.97 AUC ≥ 0.98 ACC AUC entropy 0.92 0.98 100 500 275 580 0.917 0.983 margin 0.92 0.98 100 500 275 580 0.917 0.983 random 0.91 0.97 100 1 040 325 960 0.916 0.977 uncertainty 0.92 0.98 120 500 275 580 0.917 0.983 computational and memory requirements, which are particu- larly important for real-time or resource-constrained applica- tions. Moreover, integrating AL within a digital-twin pipeline enables a feedback loop in which the classifier continuously re- fines itself using only the most informative contingencies. These findings suggest that exhaustive N-1 simulations are not always necessary for reliable security assessment, paving the way for more scalable and efficient grid-analysis tools. The present study focuses on a single test system and a Ran- dom Forest classifier. In future work we plan to evaluate the proposed framework on larger and more diverse grid topologies (e.g., IEEE 39-bus, 118-bus or national transmission networks) and under varying operating conditions. Another direction is to explore more advanced models such as gradient-boosting ma- Figure 5: TTT (AUC ≥ 0.97): computed on the strategy-level chines, deep neural networks or graph neural networks, which average curve; per-run variability (mean ± std) is reported may capture complex relationships among grid variables. We also in Appendix. intend to investigate alternative sampling strategies—including diversity-based selection, query-by-committee and Bayesian AL to further improve label efficiency. Finally, extending the method- ology to multi-contingency (N-𝑘) and dynamic security assess- ments (e.g., transient stability) will broaden its applicability in future smart-grid deployments. Reproducibility Code, analysis scripts, and a dataset to reproduce all figures and tables will be released at https://github.com/HumAIne-JSI/smart- energy-ea. Acknowledgements This work was supported by European Union’s funded Project HUMAINE [grant number 101120218]. The authors acknowledge the use of LLMs for language optimization. While the LLMs con- tributed to enhancing efficiency and refining the presentation of Figure 6: TTT (AUC ≥ 0.98): computed on the strategy-level this work, all conceptual frameworks, analyses, and interpreta- average curve; per-run variability (mean ± std) is reported tions remain the sole responsibility of the authors. in Appendix. References [1] José-María Hidalgo Arteaga, Fiodar Hancharou, Florian Thams, and Spyros AUC 0.983±0.002 for AL vs. ACC 0.916±0.010, AUC 0.977±0.004 In Chatzivasileiadis. 2019. Deep learning for power system security assessment. 2019 IEEE Milan PowerTech. IEEE, 1–6. for random). [2] Jiyu Huang, Lin Guan, Yinsheng Su, Haicheng Yao, Mengxuan Guo, and Zhi Zhong. 2021. System-scale-free transient contingency screening scheme based on steady-state information: A pooling-ensemble multi-graph learning ap- 4 Conclusion proach. IEEE Transactions on Power Systems 37, 1 (2021), 294–305. [3] Kip Morison, Lei Wang, and Prabha Kundur. 2004. Power system security This paper demonstrates that AL is a viable strategy for reducing assessment. IEEE power and energy magazine 2, 5 (2004), 30–39. simulation costs in power-grid security assessment. By selectively [4] Costas Mylonas, Titos Georgoulakis, and Magda Foti. 2024. Facilitating AI and querying informative contingencies, we cut labels (and thus sim- System Operator Synergy: Active Learning-Enhanced Digital Twin Architecture for Day-Ahead Load Forecasting. In 2024 International Conference on Smart ulator calls) by about 52% at ACC ≥ 0.92 (500 vs. 1 040 with ran- Energy Systems and Technologies (SEST). IEEE, 1–6. dom) and [5] Zhongtuo Shi, Wei Yao, Yong Tang, Xiaomeng Ai, Jinyu Wen, and Shijie Cheng. about 40% at AUC ≥ 0 . 98 (580 vs. 960), without sacri- ficing final performance (AL: ACC 0.917 2023. Intelligent power system stability assessment and dominant instability ± 0.005, AUC 0.983 ±0.002; mode identification using integrated active deep learning. IEEE Transactions random: ACC 0.916±0.010, AUC 0.977±0.004; see Table 2). Fewer on Neural Networks and Learning Systems 35, 7 (2023), 9970–9984. simulator calls translate into shorter training times and lower 80 Active Learning for Power Grid Security Assessment: Reducing Simulation Cost with Informative Sampling SiKDD 2025, October 6th, 2025, Ljubljana, Slovenia A Additional Results Table 4: Additional KPI summaries and TTT variability across runs. A.1 Time-to-Target Variability Across Runs (a) Per-run Time-to-Target (TTT) mean ± std (labels) by strategy. Note: Values here are per-run TTT (mean ± std). The TTT bars in Figures 3–6 are computed on the averaged curve. Threshold Strategy TTT (mean ± std) ACC ≥ 0.90 entropy 384 ± 207 margin 384 ± 207 uncertainty 372 ± 208 random 440 ± 225 ACC ≥ 0.92 entropy 751 ± 359 margin 751 ± 359 uncertainty 751 ± 359 random 897 ± 318 AUC ≥ 0.97 entropy 432 ± 178 margin 432 ± 178 uncertainty 432 ± 178 random 502 ± 352 AUC ≥ 0.98 entropy 803 ± 304 margin 803 ± 304 uncertainty 803 ± 304 random 1029 ± 386 A.2 Precision/Recall/F1/FNR KPIs and TTT Thresholds (b) KPIs by strategy for Precision, Recall, F1, and derived FNR. Time-to-Target (TTT) is the number of labels to reach the threshold (e.g., Precision ≥ 0.90, Recall ≥ 0.90; for FNR, TTT corresponds to FNR ≤ 0.10). Strategy AULC P TTT P≥0.90 Final P AULC R TTT R≥0.90 Final R AULC F1 TTT F1≥0.90 Final F1 TTT FNR≤0.10 Final FNR entropy 0.948 500 0.952 0.916 500 0.922 0.931 500 0.936 500 0.078 margin 0.948 500 0.952 0.916 500 0.922 0.931 500 0.936 500 0.078 random 0.928 960 0.940 0.905 1040 0.906 0.916 1040 0.922 1040 0.094 uncertainty 0.948 500 0.952 0.916 500 0.922 0.931 500 0.936 500 0.078 81 Supporting Material Reuse in Drone Production Rok Cek Oleksandra Topal Linda Leonardi rok.cek@gmail.com oleksandra.topal@ijs.si linda.leonardi@cetma.it Jožef Stefan Institute Jožef Stefan Institute CETMA Ljubljana, Slovenia Ljubljana, Slovenia Brindisi, Italy Margherita Forcolin Klemen Kenda margherita.forcolin@maggioli.gr klemen.kenda@ijs.si Maggioli Group Jožef Stefan Institute Santarcangelo di Romagna, Italy Ljubljana, Slovenia Abstract our findings highlight the potential to reduce waste and enhance sustainability in drone manufacturing. This paper, part of the European Horizon project Plooto, de- By integrating machine learning models to predict the usabil- tails an end-to-end, data-driven framework for reusing expired ity of expired prepregs and assessing the quality of final products, carbon-fiber prepregs in drone production. First, 19 batches of ex- we provide industrial partners with actionable insights that di- pired prepregs were tested, revealing that most remained usable rectly enhance operational decision-making. The combination within the first year after expiration. Machine learning models of material requalification and predictive analysis supports the were then developed to predict material usability pre-production sustainability goals of the drone production process. and product quality post-production, using manufacturing data and time-series features. To facilitate this process, a dedicated 2 Data and Methods data pipeline and an interactive Product Quality Explorer tool 2.1 Materials and experimental techniques were created to support explainable model development and in- tegration with industrial partners. This framework demonstrates used for prepreg usability assessment how combining material requalification with data-driven predic- Expired rolls of epoxy prepregs from HP Composites S.p.A were tions can lower costs and support circularity in drone production. used for this study. A total of 19 prepreg batches were investi- Keywords gated, comprising four different resin systems (ER450, IMP509, X1, ER431), with reinforcement varying according to supplier circular economy, digital product passport, machine learning, availability. Usability is assessed through periodic chemical-physi- product quality cal and mechanical testing after the expiration date, to monitor 1 Introduction property changes in materials stored at -18°C. Differential Scan- ning Calorimetry (DSC) tests were performed with Mettler Toledo The growing demand for lightweight, high-performance materi- DSC 823e on uncured prepreg samples by applying a dynamic als is driving the increased use of carbon fiber reinforced poly- heating from -40°C to 250°C at 20°C/min under a nitrogen at- mers (CFRPs) in industries such as aerospace, automotive, and mosphere. DSC analysis provides two key parameters: the glass drones. However, this rapid adoption also creates challenges, transition temperature of the uncured system (𝑇𝑔0), related to particularly with the accumulation of expired materials. While the initial crosslink density, and the residual cure degree (𝛼 ), cal- much research has focused on recycling fully cured CFRPs, less culated from the polymerization enthalpies values. Composite attention has been given to the reuse of uncured prepregs, which, plates for physical and mechanical testing were manufactured despite expiring during storage, can still retain valuable proper- by draping a variable number of prepreg plies at 0°, depending ties [5]. Addressing this challenge is crucial for advancing circular on reinforcement type, to obtain cured laminates of ≈ 3 mm. The economy principles in high-tech manufacturing. prepreg plies were stacked on a flat mold surface over a peel This paper presents research from the European Horizon ply. The plates were then covered with an additional peel ply, project Plooto, focusing on the reuse of expired prepregs in sus- a release film, and a breather layer. The self-adhesive seal and tainable drone production. Our work contributes in three key the vacuum bag were used to create a sealed vacuum during areas: (1) a comprehensive evaluation of the effects of aging on the entire process. Plates curing was carried out in a hot press expired prepregs through thermal, chemical, and mechanical test- according to the curing cycle recommended by the supplier in ing to establish requalification thresholds [1], (2) the development the material datasheet, as reported in the table 1. The void con- of machine learning models to predict the usability of expired tent (𝑉𝑐) was measured on five specimens through a digestion prepregs before production, and (3) the application of predictive procedure according to standard ASTM D3171 Method A. [3] The models to assess the quality of final products after production, interlaminar shear strength (ILSS) tests were performed with a specifically for sandwich panels made from recycled prepregs. 3-point bending system on MTS Insight machine according to By combining experimental testing with data-driven methods, the standard test ASTM D2344 [2] on five different specimens for each prepreg batch. These experimental results, including DSC Permission to make digital or hard copies of all or part of this work for personal data, ILSS, and void content (𝑉𝑐) measurements, provide essential or classroom use is granted without fee provided that copies are not made or features for the machine learning models discussed in Section distributed for profit or commercial advantage and that copies bear this notice and 2.2. The values of key properties such as the glass transition tem- the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). perature (𝑇𝑔0), residual cure degree (𝛼), and interlaminar shear Information Society 2025, Ljubljana, Slovenia strength (ILSS) are directly used to predict the usability of the © 2025 Copyright held by the owner/author(s). https://doi.org/https://doi.org/10.70314/is.2025.sikdd.20 82 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Cek et al. expired prepregs and to assess the quality of the final products The dataset combined two types of information. The first com- after manufacturing. ponent consisted of production metadata, which described the Material Temperature (°C) Time (h) Pressure (bar) context of each cycle. These attributes included the date of the cy- cle, the operator responsible for production, the specific prepreg ER 450 135°C 2h 6 bar batch (identified by lot number), and the number of days between IMP 509 140°C 1.5h 4 bar when the prepreg was made and used in production. Tool-related X1 120 130°C 1.5h 6 bar information was also provided, such as which tool was used and ER 431 125°C 1h 5 bar Table 1: Curing cycle parameters for the plates recom- cle was associated with a measurement curve identifier, a quality how many cycles had passed since its last maintenance. Each cy- mended in the material datasheet. result (labelled as either fully compliant, minor defect, or scrap), and, in cases of non-compliance, the reported reason for failure. 2.2 Predicting the usability and key The second component of the dataset consisted of time-series parameters of prepreg using machine data collected during the manufacturing process. For each cycle, approximately 1,300 measurements were recorded at ten-second learning methods intervals. These measurements included the chamber’s target temperature (setpoint), the actual chamber temperature, the tem- The results from the DSC tests, along with other experimental data such as ILSS and void content ( 𝑐 ) collected in Section 2.1, 𝑉 perature of the piece being moulded, and the vacuum setpoint. Together, these readings captured the thermal and pressure dy- were systematically organized and used as input features for the namics that govern the curing of composite materials. machine learning models to predict prepreg usability and key To make this information usable for machine learning mod- process parameters. Each row represents one checkpoint on an els, feature extraction was required. Temperature curves were expired roll and includes: test date, month code, prepreg code ◦ divided into intervals based on their inflection points—that is, and lot, type (expired roll), stocking temperature (−18 C), orig- ◦ ◦ the points where the curve transitioned from stable plateaus to inal expiry date, 𝛼 (%), 𝑇𝑔,onset ( C), ILSS (MPa), 𝑉𝑐 ( C; curing rising or falling slopes. Each interval was then summarised using temperature), Usable (Y/N), and, when redefinition is applied, pressure (bar), temperature ( ◦ statistical properties such as average, minimum, maximum, vari- C), time (min), and the redefined ance, and trend. In addition to these aggregated features, new expiry date. For the correct operation of machine-learning meth- variables were engineered to capture deviations from expected test_date behaviour. For example, the vacuum difference quantified the gap − original_expiry_date . between the measured and target pressure, while the temperature ods, a days-after-expiry feature was introduced and computed as The study addresses two predictive tasks: a classification prob- difference measured the offset between chamber setpoints and lem for Usable (three classes: Y, Y/N, N) and regression problems the actual values recorded. These derived variables provided in- for process/quality parameters (ILSS,𝑇𝑔,onset, 𝑉𝑐 , 𝛼 ). Analysis pro- dicators of process deviations that might affect the final product ceeds in two stages. First, a per-material stage fits separate models quality. for each prepreg system (ER450, IMP509, ER431, X1) to resolve material-specific issues observed during preliminary inspection. The analysis followed the CRISP-DM methodology, beginning with data fusion and preparation, followed by feature selection Second, a pooled stage trains a unified model over all records to and model training. Metadata and time-series features were com- evaluate cross-material generalisation. bined into a single dataset, from which irrelevant or redundant Predictors are restricted to pre-test covariates: days-after- variables were removed. expiry, material identity, normalised lot descriptors, month code, For predictive modelling, several classification algorithms storage conditions, and other metadata available at decision time, were evaluated to balance interpretability and performance. Lo- while measured targets are excluded from inputs to prevent label leakage. Random-forest classifiers and regressors (scikit-learn) pa- gistic regression and decision trees offered transparent decision rameterised as estimators=100, max _𝑑𝑒 𝑝𝑡 ℎ=3, random_state=42 𝑛 boundaries, while ensemble methods such as random forests and gradient boosting provided stronger predictive power by aggre- serve as the base models and enable inspection of feature impor- gating multiple weak learners. Multi-layer perceptrons (MLP) tances. were also considered to capture non-linear patterns in the data. Performance estimation relies on leave-one-out cross-validation To integrate the methodology into the production workflow, (LOO-CV) [6] in both stages. For the classification task, overall a dedicated service was implemented. Metadata was provided in accuracy is reported to evaluate the model’s performance in pre- dicting prepreg usability. For the regression tasks, , MAE, and 𝑅 2 an Excel (.xlsx) file, while the process data was provided in .rdb formats by the industrial partner. A pipeline was developed to RMSE are used to assess the model’s ability to predict continu- ous process parameters. 2 automatically download these files from a shared Dropbox folder 𝑅 measures the proportion of variance provided by the industrial partner, parse the .rdb data, and convert explained, while MAE provides the average error magnitude, the files into structured JSON files. The JSON files were enriched and RMSE emphasizes larger errors. Feature-importance profiles with derived variables and unique identifiers, then uploaded to are examined to identify the dominant drivers of re-usability the Plooto platform via its API. This ensured seamless integration and variation in process parameters across materials and in the of raw production data with machine learning models, enabling pooled setting. continuous prediction of product quality. 2.3 Machine Learning for Post-Production As part of this work, we developed a tool called Product Qual- Quality Prediction ity Explorer to support domain experts in analyzing production data and assessing product quality [4]. Its primary goal is to This part of the pilot addressed the prediction of production qual- facilitate the creation of explainable machine learning models. ity in sandwich panel manufacturing, with the aim of supporting The tool helps users understand factors influencing quality out- drone production after re-qualification. comes and make informed adjustments to the manufacturing 83 Supporting Material Reuse in Drone Production Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia process. The tool provides a summary of descriptive statistics 3.2 Predictive modeling results for prepreg (count, mean, standard deviation, minimum, quartiles, and max- reuse imum) and allows users to visualize selected columns through We analysed 𝑁 = 81 inspection records with a two–stage work- histograms and boxplots. Finally, it generates a heatmap of all flow: global model across all prepregs and material-specific mod- columns to provide an overview of relationships within the data. els were trained and estimated using leave-one-out cross-validation In the next step, the user selects the features to include in the (LOO-CV). Table 2 summarizes the results of all experiments, in- machine learning model. This step is necessary both to define the cluding classification and regression performance for global and target variable for prediction and to exclude irrelevant columns material-specific models. such as IDs, dates, or textual data. The tool also provides several options for handling missing values. The user can choose the Type Usability Metrics 𝛼 𝑇𝑔0 ILSS 𝑉𝑐 approach that best suits the dataset: leaving missing values un- AggR2 = 0.83 0.77 0.7 0.77 changed (which may prevent some algorithms from functioning All types Acc=0.91 MAE = 1.22 1.05 4.49 1.52 properly), removing features with missing values, removing rows RMSE = 1.59 1.33 5.93 1.98 containing missing values, or imputing missing values using the AggR2 = 0.86 0.88 0.92 0.94 column mean. ER450 Acc=0.96 MAE = 1.25 0.54 2.75 0.87 The next step provides the option to generate new attributes. RMSE = 1.51 0.77 4.05 1.15 This can be done through techniques such as one-hot encoding, AggR2 = 0.76 0.6 0.82 0.8 polynomial feature generation, or logarithmic transformations. IMP509 Acc=0.87 MAE = 1.44 1.23 2.5 1.35 After creating new attributes, the user selects the features to be RMSE = 1.9 1.58 3.01 1.75 used in the machine learning process. This selection can be per- AggR2 = 0.82 0.79 0.79 0.43 formed manually or automatically with the assistance of genetic X1 Acc=0.96 MAE = 1.12 0.98 2.41 1.77 algorithms. RMSE = 1.44 1.12 3.09 2.32 Finally, the user can select which machine learning models to AggR2 = 0.97 0.88 0.94 0.87 apply. Once training is complete, the results are presented in a ER431 Acc=1 MAE = 0.57 0.89 1.43 1.06 summary table containing performance metrics such as precision, RMSE = 0.76 1.15 1.93 1.64 recall, F1-score, and accuracy, along with a confusion matrix visualization. The tool also provides a comparative overview of Table 2: LOO-CV performance across prepregs for regres- model performance across all metrics (precision, recall, F1-score, sion and classification accuracy). In addition to evaluation, the system integrates explainability techniques. Global explanations are generated using SHAP to As we can see from the presented results, the global multi- show how features influence model decisions across the entire class classifier achieved 0.91 accuracy under LOO-CV on an im- dataset, while local explanations are provided using SHAP and balanced set (54 Y / 14 Y-N / 13 N), indicating that a simple pre- LIME to illustrate how the model arrived at a prediction for a production screen is feasible from routine metadata. Per-material specific datapoint. These explanations are supported by interac- classifiers were even higher (often ≥0.96), but these figures are tive visualizations, which enable users to better understand both almost certainly optimistic given tiny per-material sample sizes the overall model behavior and individual predictions. and class imbalance. A detailed classification report, including 3 Results precision, recall, and F1 scores, can be provided upon request. A consistent trend across the regression tasks is the superior Ageing trends from DSC. 3.1 Results of usability assessment performance of models trained on a single prepreg type compared 1 to the global model trained on all data. For instance, the global Differential scanning calorimetry 2 model predicted ILSS with an aggregate 𝑅 of 0.70, whereas the (DSC) on the selected prepreg rolls (grouped by resin system) material-specific models for ER450 and ER431 achieved much shows that 𝑇 𝑔 0 increases progressively over time after expiration. higher scores of 0.92 and 0.94, respectively. This suggests that This behaviour is consistent with i) increasing molecular weight ageing and curing behaviours are highly specific to the resin and ii) higher crosslink density of the polymer network due to system, and tailored models better capture these characteristics. ongoing polymerization. The measured 𝛼 values align with the However, this is not a universal rule; the prediction of 𝑉 𝑐 for trend, indicating a time-dependent decrease in the residual 𝑇 𝑔 0 2 the X1 prepreg (aggregate 𝑅 = 0 . 43) was notably worse than the degree of cure; notably, within the first two years after expiration, 2 global model (aggregate 𝑅 = 0 . 77), indicating that in cases of very the reduction remains limited to <15%. limited data or less distinct features, the global model can be Mechanical strength and porosity evolution. Across all more robust. batches, interlaminar shear strength (ILSS) exhibits a time-depen- Feature importance analysis performed during the experi- dent decline: reductions generally do not exceed 15% within the ments revealed the most influential factors in predicting key first 12 months after expiration, whereas more pronounced de- parameters in Table 2. The Days_Since_Expiry was consistently creases of 25–30% occur in the 12–24 month interval. Consistent one of the most critical predictors across both global and material- with this mechanical trend, the void content 𝑉 𝑐 remains below specific models, confirming its fundamental role in tracking ma- 10% during the first 12 months after expiration and increases terial degradation. Furthermore, the analysis revealed strong thereafter, often exceeding 15% in later months. intercorrelations between the measured properties themselves. For example, the degree of cure (𝛼 ) and 𝑇𝑔0 were often the most 1 𝑛 The dataset is modest and unevenly distributed across resins (ER450=28, X1 𝑛 𝑛 𝑛 = 22 , IMP509 = 15 , ER431=14). Consequently, per-material models are trained on few observations and LOO-CV performance is likely optimistic. 84 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Cek et al. important features for predicting ILSS and 𝑉𝑐, indicating that 4 Conclusion these thermal and chemical properties are highly interdepen- This study demonstrates an end-to-end approach that integrates dent. Batch identifiers (prepreg code/lot) were generally minor, material science and machine learning to enhance the reuse of although *lot* occasionally ranked higher for ILSS, indicating expired prepregs in drone production. By evaluating and requali- possible batch effects. fying expired materials, we have shown that they remain service- Taken together, these patterns suggest that compact, physics- able within the first year after expiry, with gradual performance aligned feature sets explain most of the variance, and that ageing/𝛼 decline, particularly in interlaminar shear strength and curing consistently drive both regression and classification. Neverthe- behavior. This underscores the effectiveness of resin-specific less, limited data—especially for IMP509 and ER431—and the reuse gates and modified processing windows to extend material optimism of LOO-CV preclude production use without further lifetimes. data collection and validation across broader process conditions. Machine learning models were employed to support both pre- 3.3 Evaluation of Post-Production production and post-production processes. The pre-production Classification Models models classified expired prepregs for reuse, while the post- production models predicted the quality of sandwich panels based The predictive modelling was applied to production cycles from on combined metadata and process features. Despite challenges sandwich panel manufacturing provided by the Italian pilot part- related to data imbalance, the results demonstrate the potential ners. We also used the aforementioned Product Quality Explorer for predictive quality monitoring in manufacturing, contributing tool after we had already transformed the data and created new to more sustainable production practices. features. The objective was to assess whether production quality The integration of machine learning with material science not outcomes could be predicted from a combination of metadata only optimizes requalification processes and reduces waste, but and process-derived time-series features. This is particularly im- also supports cost reduction and environmental sustainability in portant for supporting drone production after re-qualification, high-performance manufacturing. Future work should focus on as early detection of potential quality issues can prevent defec- expanding datasets, refining resin-specific criteria, and explor- tive panels from progressing further in the manufacturing chain. ing the broader applicability of the models in other composite Moreover, it can save manufacturers time, energy, and personnel manufacturing contexts, further advancing circular economy costs, as each panel must currently be manually inspected and principles. tested. Acknowledgements The dataset comprised 294 production cycles, the majority of which were compliant, with only a small fraction classified as non- This work was supported by the European Commission under the compliant. This strong imbalance reflects real-world conditions, Horizon Europe project Plooto, Grant Agreement No. 101092008. where defects are rare but critical, yet it also creates difficulties We would like to express our gratitude to all project partners for for machine learning approaches. Most algorithms tend to favour their contributions and collaboration. the majority class, which can lead to high overall accuracy but The authors acknowledge the use of LLMs for language opti- poor detection of defective cases. mization. While the LLMs contributed to enhancing efficiency Several classification algorithms were tested. Overall accuracy and refining the presentation of this work, all conceptual frame- values appeared relatively high (between 0.77 and 0.85) this was works, analyses, and interpretations remain the sole responsibil- largely driven by the correct classification of compliant cases. ity of the authors. Performance on the minority (non-compliant) class was weaker, References as reflected by modest recall and F1-scores. This indicates that while the models are well-suited to reproducing the majority [1] Constance Amare, Olivier Mantaux, Arnaud Gillet, Matthieu Pedros, and Eric outcome, their ability to identify rare defective panels is more Lacoste. 2022. Innovative test methodology for shelf life extension of carbon limited. fibre prepregs. IOP Conference Series: Materials Science and Engineering, 1226, 1, (Feb. 2022), 012101. https://dx.doi.org/10.1088/1757-899X/1226/1/012101. These findings suggest that machine learning can provide use- [2] ASTM International. 2022. ASTM D2344/D2344M-22: Standard Test Method ful insights into production quality trends, but further progress for Short-Beam Strength of Polymer Matrix Composite Materials and Their requires additional data, particularly more defective cases. A from https://store.astm.org/d2344_d2344m-22.html. Laminates. West Conshohocken, PA, USA, (2022). Retrieved Sept. 3, 2025 larger dataset would allow models to better distinguish between [3] ASTM International. 2022. ASTM D3171-22: Standard Test Methods for Con- compliant and non-compliant cycles, thereby increasing their stituent Content of Composite Materials. West Conshohocken, PA, USA, (2022). Retrieved Sept. 3, 2025 from https://store.astm.org/d3171-22.html. value as a decision-support tool in quality assurance. [4] Rok Cek and Klemen Kenda. 2025. Product quality explorer - determining The detailed performance of each tested classifier is reported product quality based on the digital product passport. In 17th Jožef Stefan in Table 3. International Postgraduate School Students’ Conference : 28th–30th May: Book of abstracts: from research to reality, 33. http://ipssc.mps.si/auxiliary_material /IPSSC25%20BoA.pdf. Model Accuracy Precision Recall F1-Score [5] Gaurav Nilakantan and Steven Nutt. 2015. Reuse and upcycling of aerospace Logistic Regression 0.846 0.838 0.838 0.838 prepreg scrap and waste. Reinforced Plastics, 59, 1, 44–51. Decision Tree 0.769 0.764 0.738 0.745 [6] Tzu-Tsung Wong. 2015. Performance evaluation of classification algorithms Random Forest 0.808 0.797 0.806 0.800 by k-fold and leave-one-out cross validation. Pattern Recognition, 48, 9, 2839– XGBoost 0.808 0.797 0.806 0.800 2846. LightGBM 0.846 0.838 0.838 0.838 Support Vector Machine (SVM) 0.808 0.801 0.788 0.793 Multi-layer Perceptron (MLP) 0.808 0.801 0.788 0.793 Table 3: Performance of machine learning models on the Italian pilot sandwich panel dataset. 85 Temporal Dynamics and Causal Feature Integration for Predictive Maintenance in Manufacturing Systems: A Causality-Informed Framework Seyed Iman Hosseini Klemen Kenda Dunja Mladenič iman.hosseini@ijs.si klemen.kenda@ijs.si dunja.mladenic@ijs.si Jožef Stefan Institute Jožef Stefan Institute Jožef Stefan Institute Ljubljana, Slovenia Ljubljana, Slovenia Ljubljana, Slovenia Jožef Stefan International Qlector Jožef Stefan International Postgraduate School Ljubljana, Slovenia Postgraduate School Ljubljana, Slovenia Ljubljana, Slovenia ABSTRACT To address these limitations, this study proposes a causality- informed framework for predictive maintenance that leverages Predictive maintenance is increasingly central to manufacturing, temporal causal discovery techniques, such as Vector Autore- where the goals are to reduce unplanned downtime and extend as- gressive LiNGAM (VARLiNGAM), to engineer predictive features set lifetimes. Conventional models often rely on correlations that from multivariate sensor data. Our approach integrates cross- insufficiently capture temporal dynamics and causal dependen- correlation analysis and lag-optimized causal graphs to detect cies underlying failures. This study proposes a causality-informed failure precursors and identify their optimal predictive windows. feature-engineering pipeline that combines cross-correlation- We hypothesize that the observed lack of competitive advan- derived lags with VARLiNGAM to construct lag-aware features tage for causality-informed models, especially when applied to from multivariate sensor streams, and evaluates it against stan- data from a single machine, arises from the limited operational dard time-series models using a time-aware split. Three machine- diversity and failure variability. This limitation may cause models learning models—Random Forest, XGBoost, and Gradient Boost- to overfit to machine-specific correlations and exclude informa- ing—were trained and assessed by F1-score (rather than accu- tive temporal features, thereby hindering their generalizability. racy) on a single-machine subset of the Microsoft Azure Pre- dictive Maintenance dataset (8,708 samples; 26 failures, 0.3% ≈ Testing this hypothesis through multi-machine datasets will be a key focus of future work. prevalence). XGBoost trained on raw temporal features achieved F1 0.94 for longer prediction horizons ( 10 h) under time-≈ ≥ series–aware cross-validation, with performance declining at shorter horizons as temporal context diminishes. In this setting, 2 RELATED WORK causality-informed features did not improve results over the raw- Causality in time series analysis has become increasingly critical feature baseline. These findings indicate that, with data from in predictive maintenance, particularly within industrial and a single machine, causal discovery is susceptible to overfitting manufacturing domains, where early failure detection plays a and may suppress informative temporal patterns; broader, multi- pivotal role in minimizing operational disruptions and financial machine datasets are likely required for causality-enhanced rep- losses [5]. Classical statistical models have been widely used resentations to yield consistent gains. to infer causal relationships between sensor measurements and machine states, yet they often fail to capture complex temporal KEYWORDS dynamics and the nonlinear relationships inherent in real-world system failures. Predictive Maintenance, Causality, Time-Series Analysis, Ma- chine Learning, VARLiNGAM, Manufacturing Systems Recent studies have explored advanced causal inference tech- niques to enhance fault prediction. Wang S. proposed a et al. 1 framework for fault diagnosis that integrates spatiotemporal INTRODUCTION dependencies, demonstrating improved predictive accuracy in The rising complexity and interconnectivity of industrial systems chemical manufacturing systems [9]. While their work advances have accelerated the need for intelligent maintenance strategies reliability in industrial diagnostics, it lacks the flexibility to gen- that move beyond reactive and preventive paradigms. Predictive eralize across diverse application domains. On the other hand, maintenance, driven by sensor data and machine learning, has Cui et al. introduced a deep learning framework that enhances emerged as a transformative approach to minimize unplanned predictive maintenance by integrating causal reasoning and long- downtime and optimize asset life cycles [1]. Traditional predictive sequence multivariate time-series data, significantly improving maintenance models, however, often rely on statistical correla- predictive performance and interpretability [3]. Despite this, the tions that fail to capture the directionality and temporal dynamics challenge of automating temporal feature engineering and seam- inherent in real-world system failures [6]. lessly deploying models across different domains remains. Yang X. contributed to the growing literature on data- et al. Permission to make digital or hard copies of part or all of this work for personal driven causal analysis by incorporating dynamic latent variables or classroom use is granted without fee provided that copies are not made or and probabilistic graphical models into causal modeling frame- distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this works [10]. However, these models have yet to fully address the work must be honored. For all other uses, contact the owner /author(s). temporal feature extraction required for scalable deployment Information Society, 2025, Ljubljana, Slovenia in real-world predictive maintenance applications. Furthermore, © 2025 Copyright held by the owner/author(s). https://doi.org/https://doi.org/10.70314/is.2025.sikdd.12 more recent work by Wang Q. introduced a Causal Graph et al. 86 Information Society, 2025, Ljubljana, Slovenia Seyed Iman Hosseini, Klemen Kenda, and Dunja Mladenič Convolution Module that adapts causal discovery within time- series prediction [8], but their approach is still dependent on complex model adjustments across domains. In this study, we propose a novel framework that integrates lagged correlation with causal analysis techniques to detect fail- ure precursors and quantify their lead times. This framework au- tomates temporal feature engineering and is designed for diverse real-world applications across manufacturing settings, without re- quiring extensive architectural modifications. The automation of Figure 2: Cross correlation analysis temporal feature engineering and its seamless deployment across comparable manufacturing environments remains a significant challenge, and extending generalization beyond this domain is left for future work. 3.1 Dataset and Preprocessing 3 EXPERIMENT We used the Microsoft Azure Predictive Maintenance Dataset [2], which provides hourly telemetry (voltage, rotation, pressure, Our experimental methodology followed a sequential four-stage vibration) plus maintenance records, failure events, incident re- process to construct and validate a robust failure prediction ports, and machine metadata for 100 machines over 12 months in model, as shown in Figure 1. The first stage involved performing a 2015 (over 800k hourly summaries and thousands of non-failure cross-correlation analysis between each sensor’s time-series data error entries). For this study, we restricted the analysis to machine and the target failure events to determine the optimal predictive ID 98; after cleaning and merging the sources, we constructed time lag, which guided the subsequent steps. In the second stage, a causality-informed feature vector and standardized features the identified optimal lag was used to parameterize a Vector Au- across modalities. Cross-correlation suggested predictive lags of toregressive LiNGAM (VARLiNGAM) model, which generated 1–24 hours, so we derived lagged/statistical features from six pri- a directed acyclic graph (DAG) representing the causal relation- mary variables (voltage, rotation, pressure, vibration, age, error ships and effect strengths between sensor variables and the failure type). The final dataset comprised 8,708 samples with 26 failures event. The third stage focused on creating a causality-informed ( 0.3% ), indicating strong class imbalance [7, 2]. The feature ≈ feature vector by integrating standard statistical metrics from set comprised 150 causality-informed features and 36 features rolling time windows along with advanced features informed by without causal information. the causal analysis, using the correlation strengths and causal effect strengths derived from the VARLiNGAM model to select and weight features based on their respective optimal and causal 3.2 Cross-correlation Analysis lags. Finally, in the fourth stage, the enriched feature set was fed Cross-correlation analysis examines the correlation between two into a machine learning pipeline, employing a time-based data time series as a function of the time lag applied to one of them split to prevent look-ahead bias, and training several classifica- [11][12]. Unlike simple correlation, which measures linear rela- tion models, including Random Forest, XGBoost, and Gradient tionships at a single point in time, cross-correlation reveals how Boosting, to assess the effectiveness of the causality-informed variables relate across different time delays, making it particu- approach for predictive maintenance. This integrated approach larly valuable for identifying lead-lag relationships and temporal enhances the predictive capabilities of machine learning mod- dependencies. The initial phase of our experimental framework els, offering a robust solution for failure prediction in industrial involved a cross-correlation analysis to empirically determine settings. the predictive temporal relationships between sensor signals and equipment failures. For each sensor, we computed the Pearson correlation coefficient between its time series and the binary failure time series across a range of discrete time lags. This pro- cedure was executed by systematically shifting the failure signal backward in time, which allowed for the correlation of sensor readings at a given time t with failure events at a future time t + lag. The optimal predictive lag for each sensor was then identified as the time lag that yielded the maximum absolute correlation value. This analysis is critical as it quantifies the time window in which each sensor’s data is most informative for forecasting an impending failure, thereby providing an empirical foundation for the subsequent causal discovery and feature engineering stages. In the cross-correlation plot shown in Figure 2, the red star annotated on each sensor’s curve denotes the optimal predictive lag—20 hours for Pressure, 14 hours for Vibration, and so forth. This marker identifies the specific time lag, measured in hours, at which the sensor’s signal exhibits the highest absolute Pear- son correlation with the future failure event. Consequently, the Figure 1: proposed framework red star highlights the most influential temporal offset for each variable, effectively quantifying the sensor’s most informative predictive window within the 24-hour forecasting horizon. 87 A Causality-Informed Framework Information Society, 2025, Ljubljana, Slovenia 3.3 Causal Graph Construction To elucidate the causal interdependencies between sensor signals and equipment failures, a causal graph was constructed using VARLiNGAM. This methodology first employs a Vector Autore- gression (VAR) model to capture the linear, time-lagged relation- ships among the multivariate sensor time series. The optimal lag for the VAR model was adaptively informed by the preced- ing cross-correlation analysis to focus on the most predictive temporal window. Following the VAR estimation, the LiNGAM algorithm is applied to the resulting model residuals, or innova- tions. By exploiting the non-Gaussian nature of these innova- tions, LiNGAM uniquely identifies the contemporaneous causal structure—the instantaneous effects between variables—and de- termines the direction of influence, thereby producing a directed Figure 3: Causal Graph acyclic graph (DAG). The final output is a set of adjacency ma- trices representing the causal graph, where each non-zero entry quantifies the strength and direction of a causal link from one analysis that selects per-sensor optimal prediction windows. Us- variable to another at a specific time lag. Our approach constructs ing a sliding feature window (typically 72 h), samples are formed a directed causal graph from time-series sensor data using the from historical data only to avoid leakage. Feature construction following steps: basic statistics proceeds in four stages: (1) (mean, standard devi- (1) Chronologically sort sensor ation, min/max, latest/earliest within the window); (2) Data Sorting and Integrity: causality- data, verifying integrity and noting irregular intervals. computed at the optimal lags iden- aligned temporal features (2) Define variables which are vibra- tified by causal analysis; (3) via trend slopes (linear Variable Definition: dynamics tion, rotation, pressure, voltage, and a binary failure indi- regression), rolling volatility (standard deviation), and rates of cator as the target node. change; and (4) implied by the causal graph cross-feature terms (3) Configure a VARLiNGAM [4] model (e.g., voltage/rotation ratios and pressure–vibration correlations). Causal Model Setup: with a specified lag order and BIC-based pruning. Targets are defined for multiple horizons (1, 6, 12, and 24 h ahead) (4) Fit the model to the prepared data matrix, to enable early warnings at different lead times. The resulting Model Fitting: applying regularization—by adding small Gaussian noise dataset contains 150 features that integrate causal dependencies (e.g., 10−6 )—when numerical instability arises during VAR- with temporal patterns. LiNGAM causal graph construction due to ill-conditioned matrices. 3.5 Machine Learning Models (5) Extract adjacency matrices to Adjacency Extraction: Three classification algorithms, each configured with default identify directed edges, effect strengths, and correspond- hyperparameters, were evaluated using time-based data parti- ing lags. tioning to mitigate the risk of data leakage. (6) Assemble the causal graph, catego-Graph Assembly: rizing edges by their relation to the target and between • Random Forest (RF): Ensemble method with 200 estima- tors, maximum depth of 15, and balanced class weights sensor variables. • XGBoost (XGB): Gradient boosting with 200 estimators, This workflow ensures that temporal ordering is respected learning rate of 0.1, and automatic scale balancing and that detected causal links most likely represent meaningful • Gradient Boosting (GB): Scikit-learn implementation relationships for predictive maintenance and further analytical with 200 estimators and 0.8 subsample ratio investigations. Figure 3 presents the causal graph generated by Model performance was assessed using F1 Score metric appro- the VARLiNGAM algorithm, illustrating the network of causal priate for imbalanced classification: relationships between sensor telemetry (volt, pressure, vibration, rotate), machine properties (age), and the target failure event. In • F1-Score: Harmonic mean of precision and recall this graph, nodes represent the variables, and the directed edges A time-series–aware data partitioning strategy was imple- (arrows) signify the direction of causality, with edge thickness mented using scikit-learn’s , which generates folds TimeSeriesSplit corresponding to the strength of the effect. The labels on each in chronological order by progressively expanding the training edge quantify the causal strength and the time delay (lag) in hours. set with earlier observations and reserving subsequent periods The analysis reveals a complex web of interactions, prominently for testing. This procedure ensures that all training data tem- highlighting that machine age is the most significant causal driver porally precedes the corresponding test data. To approximate of failure, with an exceptionally strong effect strength at a lag of stratification and preserve class balance between rare failure and 6 hours. Other notable, though weaker, causal pathways are also more frequent non-failure events, the folds were constructed identified, such as the influence of rotate on failure. This causal to proportionally distribute failure cases across splits without structure provides critical insights into the system’s dynamics, introducing randomization. This design maintains the tempo- identifying the key variables and time-delayed interactions that ral integrity of the sensor data while supporting reliable model precede a failure event. evaluation. 3.4 Causality-Informed Feature Engineering 4 RESULTS AND DISCUSSION We prepared the data by building a Figure 4 presents the comprehensive F1-score evaluation of all causality-informed feature vec- tor grounded in the paper’s causal graph and a temporal causality three models, while Figure 5 provides a comparative analysis 88 Information Society, 2025, Ljubljana, Slovenia Seyed Iman Hosseini, Klemen Kenda, and Dunja Mladenič 5 FUTURE WORKS While this study establishes a robust, domain-agnostic frame- work for failure prediction, future work will focus on enhancing its transparency and causal reasoning capabilities. The integra- tion of Explainable Artificial Intelligence (XAI) methods, such as SHAP or LIME, will provide transparent insights into the predic- tive models’ decision-making processes, fostering trust among users and enabling more informed maintenance decisions. Ad- ditionally, investigating counterfactual analysis will allow for exploring ’what-if ’ scenarios to better understand the causal im- pacts of various factors on failure predictions. Alongside these enhancements, we will address the observed limitations of ap- plying causality-informed models to data from a single machine. Specifically, we hypothesize that the lack of competitive advan- Figure 4: F1-score evaluated over a 20-hour prediction hori- tage stems from the limited operational diversity and failure zon variability of a single-machine dataset, leading to overfitting. Fu- ture work will validate this hypothesis by expanding the dataset to include multiple machines, ensuring more generalizable in- sights into causal relationships and improving the robustness of predictive models. ACKNOWLEDGEMENTS We gratefully acknowledge the European Commission for its support of the Marie Skłodowska-Curie program through the Horizon Europe DN APRIORI project (GA 101073551). REFERENCES [1] Abdeldjalil Benhanifia, Zied Ben Cheikh, Paulo Moura Oliveira, Antonio Valente, and José Lima. 2025. Systematic review of predictive maintenance practices in the manufacturing sector. , Intelligent Systems with Applications 26, 200501. doi: https://doi.org/10.1016/j.iswa.2025.200501. [2] Arnab Biswas. 2025. Microsoft azure predictive maintenance. Accessed: 2025-05-20. (2025). https://www.kaggle.com/datasets/arnabbiswas1/micros Figure 5: The XGBoost F1-score across a 20-hour prediction of t- azure- predictive- maintenance/data. [3] Qing’an Cui, Jiao Lu, and Xianhui Yin. 2025. Causality enhanced deep horizon, evaluated with and without a causality-informed learning framework for quality characteristic prediction via long sequence feature vector multivariate time-series data. Measurement Science and Technology, 36, (Mar. 2025), 3, (Mar. 2025). doi: 10.1088/1361- 6501/adb05a. [4] LiNGAM Developers. 2025. VARLiNGAM — LiNGAM 1.10.0 documentation. https : / / lingam . readthedocs . io / en / latest / tutorial / var . html. Accessed: of the XGBoost model with and without the causality-informed 2025-06-25. (2025). [5] Karim Nadim, Ahmed Ragab, and Mohamed Salah Ouali. 2023. Data-driven feature vector. Standard time-series models, particularly those dynamic causality analysis of industrial systems using interpretable machine trained on raw temporal data, consistently outperform causality- learning and process mining. , 34, (Jan. Journal of Intelligent Manufacturing 2023), 57–83, 1, (Jan. 2023). doi: 10.1007/s10845- 021- 01903- y. informed approaches in predictive maintenance tasks, especially [6] P. Nunes, J. Santos, and E. Rocha. 2023. Challenges in predictive maintenance at extended prediction horizons. XGBoost, for instance, achieves CIRP Journal of Manufacturing Science and Technology – a review. , 40, 53–67. F1 scores exceeding 94% for horizons beyond 10 hours, though doi: https://doi.org/10.1016/j.cirpj.2022.11.004. [7] Margarida Da Rocha and Faísca Moreira. 2024. FACULDADE DE ENGEN- performance declines with shorter windows due to reduced tem- HARIA DA UNIVERSIDADE DO PORTO Data-Driven Predictive Mainte- poral context. In contrast, causality-informed models offer no nance for Component Life-Cycle Extension. Tech. rep. competitive advantage—primarily due to the limitations of causal [8] Qipeng Wang, Shoubo Feng, and Min Han. 2025. Causal graph convolution neural differential equation for spatio-temporal time series prediction. Ap- discovery conducted on data from a single machine. This nar- plied Intelligence , 55, (May 2025), 7, (May 2025). doi: 10.1007/s10489- 025- 06 row scope lacks the operational diversity and failure variability 287- 7. [9] Sheng Wang, Qiang Zhao, Yinghua Han, and Jinkuan Wang. 2023. Root needed to infer generalizable causal structures, resulting in over- cause diagnosis for complex industrial process faults via spatiotemporal fitting to machine-specific correlations and the exclusion of in- coalescent based time series prediction and optimized granger causality. formative temporal features. These findings highlight the critical , 233, (Feb. 2023). doi: 10.10 Chemometrics and Intelligent Laboratory Systems 16/j.chemolab.2022.104728. need for multi-machine datasets when applying causal methods, [10] Xing Yang, Tian Lan, Hao Qiu, and Chen Zhang. 2025. Nonlinear causal ensuring that inferred relationships reflect true causality rather discovery via dynamic latent variables. IEEE Transactions on Automation than artifacts of constrained data. In addition, Longer prediction . doi: 10.1109/TASE.2024.3522917. Science and Engineering [11] Tanja Zerenner, Marc Goodfellow, and Peter Ashwin. 2021. Harmonic cross- horizons (e.g., 20 hours) afford models access to extended histor- Physical Review E correlation decomposition for multivariate time series. , ical windows (e.g., 72 hours), enhancing their ability to detect 103, (June 2021), 6, (June 2021). doi: 10.1103/PhysRevE.103.062213. [12] XIAOJUN ZHAO, PENGJIAN SHANG, and JINGJING HUANG. 2017. Several subtle patterns and causal signals. In contrast, short horizons fundamental properties of dcca cross-correlation coefficient. , 25, Fractals (e.g., 1 hour) offer limited temporal context, increasing suscepti- 02, 1750017. eprint: https : / / doi . org / 10 . 1142 / S0218348X17500177. doi: bility to noise and overfitting. Causality-informed features such 10.1142/S0218348X17500177. as optimal lag and causal strength are inherently better suited to longer windows, where failure patterns emerge gradually rather than abruptly. 89 Using Interactive Data Visualization for DeFi Market Analysis Daria Pavlova Inna Novalija daria.pavlova@mps.si inna.koval@ijs.si Jožef Stefan International Postgraduate School Jožef Stefan Institute Ljubljana, Slovenia Ljubljana, Slovenia ABSTRACT 2 RELATED WORK lenges with its data-rich, volatile, and multi-dimensional ecosys- Surveys of DeFi systems [12] highlight the centrality of TVL, mar- tem. Static reports struggle to convey short-term dynamics and Decentralized Finance (DeFi) presents unique analytical chal- 2.1 DeFi Analytics Landscape Tableau dashboard. Our ETL architecture processes data from hensive Business Intelligence (BI) solution featuring an auto- search and dashboards, processing millions of daily transactions mated Extract-Transform-Load (ETL) pipeline and interactive 1 into consumable metrics. Recent advances in artificial intelligence have opened new cross-sectional structure simultaneously. We present a compre- from CoinGecko and DeFiLlama expose these aggregates for re- ket capitalization, and volume as monitoring signals. Public APIs three Application Programming Interfaces (APIs)—CoinGecko, frontiers in DeFi analysis. Chen et al. [1] proposed ensemble DeFiLlama, and DexScreener—through validation and transfor- machine learning approaches for detecting rug pulls and pro- mation stages, achieving 45-second execution time. The dash- tocol vulnerabilities, achieving 87% accuracy using features ex- board integrates Key Performance Indicators (KPIs), Total Value tracted from on-chain data and social signals. Their Random Locked (TVL) time-series, market categories analysis, and top Forest model combined with Long Short-Term Memory (LSTM) movers panel with synchronized filters. Performance evaluation networks demonstrated AI’s potential in risk assessment. How- demonstrates 85-99% reduction in analysis time compared to ever, these Machine Learning (ML) approaches require significant manual methods. Three real-world use cases validate practical computational resources and technical expertise, creating barri- applicability: narrative rotation detection (28% investment re- ers for non-technical analysts. Our solution complements these turns), risk concentration monitoring (15% drawdown reduction), advanced techniques by providing immediate, interpretable in- and competitive benchmarking. Our approach bridges the gap sights through interactive visualization. between complex DeFi data and actionable insights without re- quiring technical expertise. 2.2 Business Intelligence and Visualization KEYWORDS Classic data warehouse and BI literature formalizes metrics and dimensional modeling for decision support [6]. Industry guid- DeFi, Business Intelligence, Tableau, TVL, KPI dashboards, Inter- ance positions interactive platforms such as Tableau among active Visualization, ETL Pipeline, Data Mining, Cryptocurrency leading tools for exploratory analysis [4]. Visualization princi- ples—overview first, zoom and filter, details-on-demand [8]—map 1 directly to dashboard layout patterns [3, 9]. INTRODUCTION Studies of graphical perception [2, 5] explain why bars out-Decentralized Finance (DeFi) compresses high-frequency market perform pies for accurate comparisons, and why color semantics activity—liquidity flows, incentive programs, and new protocol (green/red for gains/losses) aid preattentive detection [11]. We deployments—into datasets that change hourly. The ecosystem align with these findings in our chart choices and encodings. encompasses over 6,000 protocols managing billions in Total Value Locked (TVL), creating analytical complexity that tradi- 3 SYSTEM ARCHITECTURE AND tional tools struggle to handle. Practitioners must simultaneously METHODOLOGY answer three critical questions: How big is the market now? (level KPIs), 3.1 ETL Pipeline Architecture How is it moving? (time series), and What drives the cross- section? (categories, movers). Our ETL pipeline implements a modular, fault-tolerant architec- Interactive visualization reduces cognitive load and increases ture processing data through five stages. The architecture follows pattern salience relative to static tables [3, 7, 10, 11]. However, ex- a standard Extract-Transform-Load pattern with additional vali- isting solutions present trade-offs: Dune Analytics requires Struc- dation and quality checks at each stage. tured Query Language (SQL) expertise, Nansen charges $1,800 Extract Layer: Three parallel API clients collect data from annually, while free alternatives like DeFiLlama offer limited CoinGecko (200 tokens per page), DeFiLlama (6,000+ protocols), visualization capabilities. Our goal is to demonstrate a compact, and DexScreener (100+ Decentralized Exchange pairs). Each reproducible Business Intelligence workflow that democratizes client implements asynchronous Hypertext Transfer Protocol DeFi analytics through automated data processing and intuitive (HTTP) requests with exponential backoff (4-10 seconds) and visualization. retry logic (up to 5 attempts). Validation Layer: Implements four-level data quality checks: Permission to make digital or hard copies of part or all of this work for personal • Completeness: Missing value detection with fallback strate- distributed for profit or commercial advantage and that copies bear this notice and • Consistency: Cross-validation between data sources the full citation on the first page. Copyrights for third-party components of this Timeliness: Timestamp validation (<1 hour freshness) or classroom use is granted without fee provided that copies are not made or gies work must be honored. For all other uses, contact the owner/author(s). • SiKDD 2025, October 6, 2025, Ljubljana, Slovenia © 2025 Copyright held by the owner/author(s). 1API documentation: https://www.coingecko.com/en/api, https://defillama.com/ https://doi.org/10.70314/is.2025.sikdd.15 docs/api. 90 SiKDD 2025, October 6, 2025, Ljubljana, Slovenia D. Pavlova • Top movers identification: 10 min → instant (100% au- tomation) 4.2 Comparison with Existing Solutions Table 1: Feature Comparison with Industry Solutions Feature Our Solution Dune Nansen DeFiLlama Cost Free $390/yr $1,800/yr Free Figure 1: ETL Pipeline Architecture: Data flows from three No-code Interface ✓ × ✓ ✓ APIs through validation and transformation stages to pro- Custom ETL ✓ × × × duce four CSV files for dashboard visualization. The system Response Time <2s 5-30s <3s <1s processes 6,000+ protocols with automated retry logic and Visualization Types 4 Unlimited 10+ 2 data quality checks. Data Sources 3 Multiple Multiple 1 Historical Data 30 days All All Limited Our solution occupies a unique position: more sophisticated • Accuracy: Outlier detection using Median Absolute Devi- than DeFiLlama’s basic charts, more accessible than Dune’s SQL ation (MAD) requirements, and more affordable than Nansen’s premium tiers. Transform Layer: Processes validated data through three streams: 5 RESULTS AND USE CASE VALIDATION • Normalize: Converts to tidy format with Coordinated Uni- 5.1 Dashboard Implementation Values (CSV) files optimized for Tableau consumption. Load Layer • with synchronized filtering capabilities. The design synthesizes Aggregate: Groups by time windows and categories four key data dimensions: : Exports processed data as Comma-Separated • KPI Header : Market metrics provide immediate context—$2.86T • The integrated dashboard combines multiple analytical views Features: Calculates rolling statistics and market sentiment versal Time (UTC) timestamps total market cap with 56.1% BTC dominance indicates risk- off sentiment 3.2 Dashboard Design Methodology • TVL Time-Series: Shows capital deployment patterns The dashboard layout follows Shneiderman’s Visual Information across protocols, with upward trajectory suggesting re- Seeking Mantra [8]: overview first, zoom and filter, then details- newed confidence on-demand. • Top Movers Panel: Highlights outliers—clustering in spe- Layout Structure: cific sectors signals narrative emergence • • Top Row : Four KPI cards displaying market totals with Category Analysis: Reveals market concentration—top 24-hour changes 3 sectors comprise 51% of total value • Middle Section: TVL time-series (left, 60% width) and Top Movers panel (right, 40% width) 5.2 Use Case Validation • Bottom Section: Category bars (left) and pie chart (right) Use Case 1: Narrative Rotation Detection for market structure analysis An investment fund utilized the dashboard to identify emerging • Right Sidebar: Interactive filters for Time Window, Cate- trends in Liquid Staking Derivatives (LSDs). When multiple LSD gory Metric, and Top N selections protocols appeared in Top Movers with 40%+ gains while cate- gory volume increased 3x, they allocated capital early, achieving 4 PERFORMANCE EVALUATION 28% returns over two weeks. 4.1 Use Case 2: Risk Concentration Analysis System Performance Metrics A DeFi protocol team monitored market concentration using We evaluated system performance across three dimensions: the category pie chart. When the top 3 categories exceeded 65% Response Time: of total market cap (Herfindahl-Hirschman Index >0.25), they • Initial dashboard load: 3.2s ± 0.5s (n=100) adjusted treasury diversification strategy, reducing drawdown • Filter operations: 1.8s ± 0.3s by 15% during the subsequent correction. • ETL pipeline execution: 45s complete, 8s incremental Use Case 3: Competitive Benchmarking Data Processing Efficiency: Protocol developers tracked their TVL growth relative to category • program launched 3 days after competitors but achieved 2x the API delay: 0.1s between requests TVL growth rate, validating their tokenomics design. • peers. The synchronized time-series view revealed their incentive Batch processing: 50-100 protocols per batch • Memory usage: Peak 256MB • Data volume: 6,000+ protocols, 200 tokens/page User Efficiency Gains : 6 DISCUSSION • 6.1 Synthesis for Decision-Making Market overview generation: 15 min → 5 sec (99.4% re- duction) The dashboard enables multi-dimensional analysis through syn- • Sector rotation analysis: 30 min → 2 min (93.3% reduction) chronized views: 91 Interactive Data Visualization for DeFi SiKDD 2025, October 6, 2025, Ljubljana, Slovenia Figure 2: Integrated DeFi Analysis Dashboard with annotated regions. (A) KPI header showing market totals and BTC dominance, (B) TVL time-series revealing protocol-level capital flows, (C) Top Movers identifying momentum shifts, (D) Category bars showing sector concentration. Red boxes indicate areas of analytical focus. Macro Market Reading : Combining BTC dominance with 7 CONCLUSION AND FUTURE WORK DeFi volume trends provides regime identification. High dom- We presented a comprehensive BI solution for DeFi market anal- inance (>55%) with rising DeFi volume suggests selective risk- ysis that bridges the gap between sophisticated analytics and taking in quality protocols. accessibility. Our dual contribution—a robust ETL pipeline and Flow Analysis: TVL trends coupled with volume data dis- interactive dashboard—demonstrates measurable improvements: tinguish genuine inflows from liquidity reshuffling. Rising TVL 85-99% reduction in analysis time while maintaining data quality with flat volume indicates parking behavior rather than active through systematic validation. usage. The system’s practical value is validated through real-world Rotation Detection: The Top Movers panel acts as an early deployments showing successful identification of profitable trad- warning system. Sector clustering combined with category vol- ing opportunities and risk mitigation strategies. By following ume spikes provides 2-3 day lead time for narrative shifts. established visualization principles and implementing automated data processing, we provide a reproducible framework that de- mocratizes DeFi analytics. Future work includes: (1) Machine learning integration for TVL forecasting and anomaly detection, (2) Real-time streaming with sub-second updates, (3) Cross-chain analytics for Layer 2 6.2 Limitations and Data Quality solutions, (4) Natural language generation for automated insights, Technical Limitations: and (5) On-chain integration for protocol-specific metrics. • TVL double-counting: Rehypothecation can inflate metrics by 20-30% ACKNOWLEDGMENTS Mitigation Strategies • API latency: 5-15 minute delays during high volatility We thank the reviewers for their constructive feedback, partic- • Protocol coverage: Excludes protocols with <$1M TVL ularly suggestions on AI integration and visualization improve- : ments. Special thanks to the SiKDD conference organizers for • Implement adjusted TVL calculations excluding derivative providing the platform to present this work. tokens • Add confidence intervals for volatile metrics REFERENCES • Include protocol age weighting for emerging project de- [1] L. Chen, Z. Zhang, and M. Wang. 2024. AI-Driven Risk Assessment in DeFi: tection Machine Learning Approaches for Protocol Security. Journal of Financial 92 SiKDD 2025, October 6, 2025, Ljubljana, Slovenia D. Pavlova Technology 2, 1 (2024), 87–95. [8] Ben Shneiderman. 1996. The Eyes Have It: A Task by Data Type Taxonomy [2] William Cleveland and Robert McGill. 1984. Graphical Perception: Theory, for Information Visualizations. In Proceedings of the IEEE Symposium on Visual Experimentation, and Application to the Development of Graphical Methods. Languages. IEEE, 336–343. J. Amer. Statist. Assoc. 79, 387 (1984), 531–554. [9] Tableau Software. 2022. Visual Analysis Best Practices: Simple Techniques for [3] Stephen Few. 2013. Information Dashboard Design: Displaying Data for At-a- Making Every Data Visualization Useful. Whitepaper. Glance Monitoring (2nd ed.). Analytics Press. [10] Edward Tufte. 2001. The Visual Display of Quantitative Information (2nd ed.). [4] Gartner Inc. 2024. Magic Quadrant for Analytics and Business Intelligence Graphics Press. Platforms. Research Note G00799564. [11] Colin Ware. 2019. Information Visualization: Perception for Design (4th ed.). [5] Jeffrey Heer and Michael Bostock. 2010. Crowdsourcing Graphical Perception: Morgan Kaufmann. Using Mechanical Turk to Assess Visualization Design. In Proceedings of the [12] Sam Werner, Daniel Perez, Lewis Gudgeon, Ariah Klages-Mundt, Dominik SIGCHI Conference on Human Factors in Computing Systems . ACM, 203–212. Harz, and William Knottenbelt. 2021. SoK: Decentralized Finance (DeFi). In [6] Ralph Kimball and Margy Ross. 2013. The Data Warehouse Toolkit: The Defini- Proceedings of the 4th ACM Conference on Advances in Financial Technologies. tive Guide to Dimensional Modeling (3rd ed.). Wiley. ACM, 30–46. [7] Tamara Munzner. 2014. Visualization Analysis and Design. CRC Press. 93 A Hybrid Lexicon-Machine Learning Approach to Macedonian Sentiment Analysis ∗ ∗ ∗ Sofija Kochovska Branko Kavšek Jernej Vičič kochovskasof ija@gmail.com branko.kavsek@upr.si jernej.vicic@upr.si University of Primorska, UP University of Primorska, UP University of Primorska, UP FAMNIT FAMNIT FAMNIT Koper, Slovenia Koper, Slovenia Koper, Slovenia Jožef Stefan Institute Ljubljana, Slovenia Abstract such as intensifiers, diminishers, and polarity shifters. Here, we extend the approach by implementing a hybrid framework that This study extends our previous work on a rule-based sentiment combines rule-based linguistic features with supervised machine analysis system for Macedonian text [10], which relied on hand- learning classifiers. Specifically, we evaluate Logistic Regression crafted lexicons and linguistic rules. We now investigate the integration of these rule-based features with supervised machine (LR) Support Vector Machines (SVMs) and , using features derived from sentiment lexicons and rule-based weighting schemes. learning classifiers, specifically Logistic Regression (LR) and Sup- Our contributions are twofold: (i) we demonstrate how rule- port Vector Machines (SVM), to improve sentiment classification based features enhance the performance of statistical classifiers in performance. Lexicon-derived features, including polarity, inten- a low-resource setting, and (ii) we provide a systematic evaluation sifiers, diminishers, and negation handling, are combined with of the hybrid approach on Macedonian sentiment data. This study statistical models to evaluate their impact. Experimental results highlights the effectiveness of combining linguistic knowledge show that the hybrid approach substantially outperforms the with machine learning to improve sentiment detection for under- rule-based baseline, increasing the mean F1 score from 73.5% resourced languages. to 86.7% for SVM and 86.4% for LR. Paired t-tests confirm that these improvements are statistically significant (p < 0.001), while Wilcoxon tests indicate a strong trend (p = 0.0625). These find- 2 Related Work ings demonstrate that integrating rule-based linguistic features Sentiment analysis has been widely studied, from lexicon-based with machine learning classifiers provides a robust framework approaches [16, 6] to machine learning and deep learning mod- for sentiment analysis in under-resourced languages such as els [15, 2]. Lexicon-based systems rely on dictionaries and mod- Macedonian. ifiers such as intensifiers, diminishers, and negations; they are interpretable and require no large datasets but have limited cov- Keywords erage. Machine learning models achieve higher accuracy with sufficient data but often act as “black boxes.” In low-resource Sentiment Analysis, Macedonian, Rule-based Approach, Machine languages, hybrid approaches combining lexicon features with Learning, Hybrid Model, Natural Language Processing, Support statistical learning improve robustness [12, 18]. Vector Machine, Logistic Regression, Low-resource Languages For Macedonian, Jovanoski et al. [9] compiled sentiment lexi- 1 Introduction cons and manually annotated Twitter datasets, analyzing how seed lists affect induced lexicons. Uzunova and Kulakov [17] clas- Sentiment analysis is a core task in natural language processing sified movie reviews, while Gajduk and Kocarev [4] achieved (NLP), commonly applied to social media, reviews, and feedback 92% accuracy on forum posts. The SADEmma 1.0 corpus [7] in- analysis. While progress has been substantial for high-resource cludes three-class news sentiment labels across languages, but languages such as English, low-resource languages like Mace- the Macedonian portion has only 198 entries, limiting its useful- donian still face limited availability of annotated corpora, senti- ness for model training. Our previous work [10] introduced a ment lexicons, and reliable tools. Macedonian, an Eastern South curated lexicon of 4,000 words, later expanded to 8,000, evaluated Slavic language spoken by around 1.6 million people as the of- on Macedonian Twitter data. ficial language of North Macedonia, remains under-explored in Despite its close relation to Bulgarian, Serbian, and Croat- computational linguistics despite its close relation to Bulgarian, ian, Macedonian sentiment analysis remains under-resourced. Serbian and Croatian. Comparable studies in Serbian and Slovenian report performance In this study, we build on our earlier work presented at the ranging from moderate to high, with F1 or accuracy scores around ITAT conference (WAFNL workshop) [10], where we developed a 76–83% depending on dataset and methodology [11, 8, 13, 3]. rule-based sentiment analysis system for Macedonian. That work These findings indicate that our results align with trends ob- focused on lexicon construction and the integration of modifiers served in related South Slavic languages. This study extends prior ∗ These authors contributed equally. work by integrating lexicon-based features into supervised clas- sifiers, comparing Logistic Regression and SVMs for Macedonian Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or sentiment classification, and, to our knowledge, represents the distributed for profit or commercial advantage and that copies bear this notice and first combination of rule-based linguistic insights with statistical the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner /author(s). models for this language. Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia © 2025 Copyright held by the owner/author(s). https://doi.org/10.70314/is.2025.sikdd.16 94 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Sofija Kochovska, Branko Kavšek, and Jernej Vičič 3 Methodology 3.4 Evaluation Procedure Our approach builds on the framework presented in Kochovska We evaluated the rule-based baseline and hybrid classifiers using et al. [10], combining lexicon-based rule features with supervised stratified 5-fold cross-validation to ensure balanced sentiment machine learning classifiers. The methodology is designed to class representation. For each fold, models were trained on 80% handle the challenges of sentiment analysis in Macedonian, a low- of the data and tested on 20%, repeating the process across five resource language, by leveraging linguistic insights alongside splits to obtain stable estimates. statistical learning. Performance was measured primarily with F1 scores for posi- tive and negative classes [10], enabling direct comparison with 3.1 Lexicon-Based Feature Extraction Jovanoski et al. [9]. Confusion matrices and full classification reports were also generated to evaluate performance on all three We use manually-checked Macedonian lexicons: • classes, including neutral, highlighting improvements in polarity Positive and Negative Lexicons: Words indicating posi- detection and challenges in handling neutral sentiment. tive or negative sentiment. • Statistical significance of improvements was assessed using Intensifiers and Diminishers: Words that amplify or attenuate sentiment (e.g., , ). slightly very paired 𝑡-tests and Wilcoxon signed-rank tests on per-fold F1 • scores. Polarity Shifters (Negations): Words that invert senti- ment, such as or , applied within a small context not never 4 Results and Evaluation window. • Stop-words: The hybrid sentiment analysis framework was evaluated on Common words with minimal meaning, re- the Macedonian test dataset that we also used for evaluation moved to improve feature quality. of the rule-based only approach discussed in the ITAT/WAFNL Texts are preprocessed to normalize repetitions, remove URLs, paper [10], however this time using Logistic Regression (LR) and mentions, punctuation, and stop-words. Each token is analyzed Support Vector Machine (SVM) classifiers. Both models lever- for sentiment considering intensifiers, diminishers, and negations. aged the rule-based features described in section 3, with hyper- Extracted features include: parameters tuned based on our previous ITAT study for LR and • Normalized lexicon score specifically tested on this dataset for SVM. • Counts of positive and negative words • Counts of intensifiers, diminishers, and negations 4.1 Logistic Regression (LR) These features provide a compact numerical representation of Logistic Regression trained on rule-based features demonstrates sentiment suitable for supervised learning models. consistently strong performance, achieving a mean F1 score of 3.2 0.864 on positive and negative classes. The per-fold results indi- Machine Learning Models cate stable performance across folds, suggesting robustness to The rule-based features (lexicon score, counts of positive/negative variations in the training data (Figure 1). words, intensifiers, diminishers, and negations) are used as input to two classifiers: • Logistic Regression (LR): A linear classifier trained on the rule-based features. Hyperparameters for intensifier weight (1.5), diminisher weight (0.7), and negation win- dow size (2) were adopted from our previous ITAT study, which tested 108 combinations to identify the optimal configuration. • Support Vector Machine (SVM): A linear-kernel SVM trained on the same features. The 𝐶 parameter was tuned via grid search (0.1–5), with the best performance at 𝐶 = 0.15. The selected rule-based configuration for both models is: in- tensifier weight 1.5, diminisher weight 0.7, negation window = = = 2, and 𝜖 = 0.30. These values control the contribution of linguistic modifiers to the overall sentiment score of a text. 3.3 Dataset Splitting Figure 1: Logistic Regression: F1 score per fold for posi- The Macedonian sentiment dataset used in this study is identical tive and negative classes. to that from our previous ITAT/WAFNL paper [10]. For machine The confusion matrix (Figure 2) shows that most misclassifi- learning evaluation, we employ stratified 5-fold cross-validation. cations involve neutral and negative instances. Specifically, 43 In each fold, 80% of the data is used for training and 20% for neutral examples were predicted as negative, and 29 negative ex- testing, ensuring that the class distribution is preserved across amples were labelled as neutral. Positive instances are generally folds. This approach allows robust evaluation of both Logistic well-separated, with minimal confusion, reflecting the effective- Regression and SVM models while leveraging all available data ness of the lexicon-based features. These patterns suggest that for training and testing across different folds. LR captures polarized sentiment effectively but struggles with subtle neutral expressions. 95 Hybrid Lexicon–ML Sentiment Analysis for Macedonian Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Figure 5: Overall precision, recall, and F1 scores for Logistic Regression and SVM. Figure 2: Logistic Regression confusion matrix for all classes. Overall classification metrics confirm high precision and recall for positive and negative classes (Precision = 0.847 / 0.830, Recall = 0.872 / 0.923, F1 = 0.859 / 0.874), while neutral sentiment remains more challenging (F1 = 0.715). Figure 5 presents these metrics visually, highlighting the differences between classes. 4.2 Support Vector Machine (SVM) SVM, also trained on the same rule-based features, achieves a slightly higher mean F1 score of 0.867 for positive and negative classes and shows stable per-fold performance (Figure 3). The hyper-parameter 𝐶 0.15, selected after testing a range from 0.1 = to 5, provided optimal regularization for this dataset. Figure 4: SVM confusion matrix for all classes. Classification metrics (Figure 5) reinforce these observations: SVM maintains high precision for positive and neutral classes and slightly higher F1 scores for polarized sentiment compared to LR (Positive: F1 = 0.862, Negative: F1 = 0.877, Neutral: F1 = 0.684). This demonstrates that combining rule-based features with SVM improves detection of nuanced sentiment in Macedonian text. 4.3 Discussion The evaluation demonstrates that our hybrid framework sub- stantially improves over the purely rule-based approach. The baseline system reached a mean F1 score of 0.736 across folds, Figure 3: SVM: F1 score per fold for positive and negative while Logistic Regression and SVM achieved 0.864 and 0.867, re- classes. spectively. Paired t-tests confirmed that these improvements are The SVM confusion matrix (Figure 4) exhibits a similar trend statistically significant (𝑝 < 0.001). The Wilcoxon signed-rank to LR: neutral instances are most frequently misclassified, with 54 test yielded 𝑝 0.0625, slightly above the conventional threshold, = neutral examples predicted as negative and 38 predicted as posi- likely due to the limited number of folds, but the performance tive. SVM shows improved recall for negative instances, correctly trend remained consistent. identifying 481 of 508 examples, indicating enhanced sensitivity to strong negative cues. 96 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Sofija Kochovska, Branko Kavšek, and Jernej Vičič Most errors stem from the neutral class, where sentiment is [5] Nils Constantin Hellwig, Jakob Fehle, and Christian Wolff. 2024. Exploring large language models for the generation of synthetic training samples for often ambiguous or context-dependent, while positive and nega- aspect-based sentiment analysis in low resource settings. Expert Systems tive classes are reliably distinguished. This shows that leveraging with Applications , 261, (Oct. 2024), 125514. doi: 10.1016/j.eswa.2024.125514. lexicon-based features within machine learning models captures [6] Minqing Hu and Bing Liu. 2004. Mining and summarizing customer reviews. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowl- polarity effectively and generalizes well across folds. Overall, the edge Discovery and Data Mining (KDD ’04). Association for Computing results highlight the strength of hybrid models in combining the Machinery, Seattle, WA, USA, 168–177. isbn: 1581138881. doi: 10.1145/1014 interpretability of rule-based systems with the adaptability of 052.1014073. [7] Nikola Ivačič, Andraž Pelicon, Boshko Koloski, Senja Pollak, and Matthew statistical learning. Future work should address the challenge of Purver. 2024. News sentiment analysis datasets for serbian, bosnian, mace- neutral sentiment and investigate richer contextual or semantic donian, albanian and estonian (sademma 1.0). CLARIN.SI repository. Version 1.0. (2024). http://hdl.handle.net/11356/1987. features. [8] Danka Jokić, Ranka Stanković, and Branislava Šandrih Todorović. 2024. 5 Abusive speech detection in Serbian using machine learning. In Proceedings Conclusion and Future Work of the First International Conference on Natural Language Processing and Ar- tificial Intelligence for Cyber Security. Ruslan Mitkov, Saad Ezzini, Tharindu We presented a hybrid sentiment analysis framework for Macedo- Ranasinghe, Ignatius Ezeani, Nouran Khallaf, Cengiz Acarturk, Matthew nian, combining rule-based lexical features with Logistic Regres- Bradbury, Mo El-Haj, and Paul Rayson, editors. International Conference on Natural Language Processing and Artificial Intelligence for Cyber Security, sion and Support Vector Machines. The hybrid models substan- Lancaster, UK, (July 2024), 153–163. https://aclanthology.org/2024.nlpaics- 1 tially outperformed the purely rule-based system, which achieved .18/. a mean F1 score of 73.6%. Both classifiers improved classification [9] Dame Jovanoski, Veno Pachovski, and Preslav Nakov. 2015. Sentiment anal- ysis in Twitter for Macedonian. In Proceedings of the International Conference performance, particularly for polarized sentiment, while main- Recent Advances in Natural Language Processing. Ruslan Mitkov, Galia An- taining interpretability and robustness by relying exclusively on gelova, and Kalina Bontcheva, editors. INCOMA Ltd. Shoumen, BULGARIA, Hissar, Bulgaria, (Sept. 2015), 249–257. https://aclanthology.org/R15- 1034/. lexicon-derived features. [10] Sofija Kochovska, Branko Kavšek, and Jernej Vičič. 2025. Rule-based senti- Our results demonstrate that integrating linguistic knowledge Proceedings of the ITAT 2025: Information ment analysis of Macedonian. In with statistical learning is effective for under-resourced languages (CEUR Workshop Proceedings). Tel- Technologies – Applications and Theory gárt, Slovakia. like Macedonian, where annotated datasets are scarce. The rule- [11] Adela Ljajić, Ulfeta Marovac, and Aldina Avdic. 2017. Sentiment analysis of based component captures explicit, context-modified cues, while twitter for the serbian language. In (Mar. 2017). ML models generalize well across folds. [12] Walaa Medhat, Ahmed Hassan, and Hoda Korashy. 2014. Sentiment analysis algorithms and applications: a survey. , 5, 4, Ain Shams Engineering Journal Future work includes: 1093–1113. doi: https://doi.org/10.1016/j.asej.2014.04.011. • Incorporating syntactic and semantic embeddings to better [13] Igor Mozetic, Miha Grcar, and Jasmina Smailovic. 2016. Multilingual twitter sentiment classification: the role of human annotators. In vol. 11. (Feb. 2016). capture context and subtle neutral sentiment. doi: 10.1371/journal.pone.0155036. • Experimenting with attention-based or transformer mod- [14] Koena Ronny Mabokela, Mpho Primus, and Turgay Celik. 2025. Advancing sentiment analysis for low-resourced african languages using pre-trained els for long-range dependencies. • Expanding annotated datasets across social media, reviews, one.0325102. language models. , 20, 6, (June 2025), 1–37. doi: 10.1371/journal.p PLOS ONE and user-generated content. [15] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. • Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models Investigating domain adaptation to generalize across dif- for semantic compositionality over a sentiment treebank. In Proceedings of ferent text types. . the 2013 Conference on Empirical Methods in Natural Language Processing • Integrating additional linguistic cues such as POS tags or David Yarowsky, Timothy Baldwin, Anna Korhonen, Karen Livescu, and Steven Bethard, editors. Association for Computational Linguistics, Seattle, dependency relations. Washington, USA, (Oct. 2013), 1631–1642. https://aclanthology.org/D13- 11 • Exploring multilingual transformers (e.g., mBERT, XLM-R) 70/. fine-tuned on Macedonian [2, 1]. [16] Maite Taboada, Julian Brooke, Milan Tofiloski, Kimberly Voll, and Manfred • Using large language models to generate synthetic Mace- Linguistics Stede. 2011. Lexicon-based methods for sentiment analysis. Computational , 37, 2, (June 2011), 267–307. doi: 10.1162/COLI_a_00049. donian training data [19, 14, 5]. [17] Vasilija Uzunova and Andrea Kulakov. 2015. Sentiment analysis of movie reviews written in macedonian language. In . Advances ICT Innovations 2014 This work provides a strong foundation for Macedonian sen- in Intelligent Systems and Computing. Vol. 311. Ana Madevska Bogdanova timent analysis, highlighting the value of hybrid approaches and Dejan Gjorgjevikj, editors. Springer, Cham, 279–288. doi: 10.1007/978- 3- 319- 09879- 1_28. and paving the way for richer linguistic feature integration and [18] Lei Zhang, Shuai Wang, and Bing Liu. 2018. Deep learning for sentiment advanced modeling. analysis : a survey. Wiley Interdisciplinary Reviews: Data Mining and Knowl- edge Discovery, 8, (Jan. 2018). doi: 10.1002/widm.1253. References [19] Wenxuan Zhang, Yue Deng, Bing Liu, Sinno Jialin Pan, and Lidong Bing. 2023. Sentiment analysis in the era of large language models: a reality check. [1] Alexis Conneau et al. 2020. Unsupervised cross-lingual representation learn- https://arxiv.org/abs/2305.15005 arXiv: 2305.15005 [cs.CL]. ing at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors. Association for Computational Linguistics, Online, (July 2020), 8440–8451. doi: 10.18653/v1/2020.acl- main.747. [2] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: pre-training of deep bidirectional transformers for language under- standing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics. Jill Burstein, Christy Doran, and Thamar Solorio, editors. Association for Computational Linguistics, Minneapolis, Minnesota, (June 2019), 4171–4186. doi: 10.18653/v1/N19- 1423. [3] Darja Fišer and Tomaž Erjavec. 2016. Analysis of sentiment labeling of slovene user-generated content. In Nasl. z nasl. zaslona. Znanstvena založba Filozofske fakultete, 22–25. http://nl.ijs.si/janes/wp- content/uploads/2016/0 9/CMC- 2016_Fiser_Erjavec_Analysis- of - Sentiment- Labeling.pdf . [4] Andrej Gajduk and Ljupco Kocarev. 2014. Opinion mining of text documents written in macedonian language. . https://arxi arXiv preprint arXiv:1411.4472 v.org/abs/1411.4472 arXiv: 1411.4472 . [cs.CL] 97 Building an AI-Ready Data Infrastructure Towards a SDG- focused Observatory for the Brazilian Amazon Joao Pita Costa† Mirozlav Polzer Leonardo Barrionuevo Joao Paulo Veiga IRCAI, Jozef Stefan Institute GloCha, Climate Chain Coalition MetAmazonia, AMAGroup CIAAM, University of São Paulo Ljubljana, Slovenia Klagenfurt, Austria Curitiba, Brazil São Paulo, Brazil joao.pitacosta@ircai.org polzer@glocha.info leonardo@amagroup.com.br candia@usp.br Abstract / Povzetek resource mobilization but also robust AI-enabled data systems As artificial intelligence technologies rapidly evolve, regulatory interventions. However, current efforts to monitor and evaluate capable of tracking progress, identifying gaps, and informing responsible AI development, enabling innovation while outdated data that are not designed with advanced analytics or AI safeguarding fundamental rights and public interests. This paper applications in mind [1]. As the volume and variety of sandbox initiatives have emerged as crucial tools for promoting the SDGs are often hampered by fragmented, inaccessible, or analyzes the development and implications of Brazil’s first AI sustainability-related data continue to grow (ranging from regulatory sandbox, with a particular focus on the model satellite imagery and sensor networks to administrative records established by SUSEP (Superintendence of Private Insurance). and citizen-generated content) there is a critical need to rethink Designed as a controlled environment for testing innovative the way data infrastructures are designed. Despite AI-related products and services in the insurance sector, the SUSEP advancements, the broader ecosystem of SDG data remains sandbox illustrates how regulatory flexibility can foster siloed, with significant disparities in data availability, quality, technological advancement, financial inclusion, and market and usability across countries and sectors. National statistical efficiency while maintaining consumer protection and risk offices often lack the infrastructure or capacity to generate real- Law, the sandbox has evolved through three editions (2020, remain underutilized due to interoperability issues or lack of 2021, and 2024), prioritizing both sustainable and technological trust. As a result, policymakers and researchers face substantial oversight. Being developed under Brazil’s Economic Freedom time, high-resolution data, while non-governmental data sources projects. This study explores the sandbox's structure, eligibility barriers when attempting to harness AI for sustainable criteria, business plan requirements, operational limitations, and development monitoring. There is growing recognition that SDG transition mechanisms for companies seeking permanent data must be AI-ready: structured, interoperable, machine- licensure. It also identifies actionable insights for future readable, and enriched with metadata that allows for automated regulatory frameworks, particularly for the National Data processing and semantic understanding [2]. AI-ready data Protection Authority (ANPD) as Brazil advances toward AI- infrastructures enable the use of artificial intelligence and specific governance. By comparing the sandbox's legal machine learning tools for trend detection, predictive modeling, with international best practices, this paper underscores the toward sustainable development. Several initiatives have sandbox’s role as a blueprint for responsible AI regulation in emerged to bridge the gap between data collection and actionable foundations, selection processes, and risk mitigation protocols and evidence-based policymaking, accelerating the global effort emerging markets. insights. In this context, the IRCAI SDG Observatory, an open- Keywords / Ključne besede Research Centre on Artificial Intelligence under the auspices of access data infrastructure developed by the International Sustainable Development Goals (SDGs), AI-ready data UNESCO (IRCAI), aggregates and organizes datasets related to infrastructure, FAIR data principles, Open data, Semantic SDG indicators, news, policies, educational resources and interoperability, Brazilian Amazon, COP30 innovation ecosystems, facilitating their use in AI applications through adherence to open data standards, consistent metadata 1 Introduction represents a step toward a scalable, reusable AI-ready data schemas, and semantic alignment with the SDG framework. It The United Nations' 2030 Agenda for Sustainable Development architecture that can support both global and local decision- outlines 17 SDGs aimed at addressing the worlds most pressing making. The main contribution of this paper is a conceptual and social, economic, and environmental challenges. Achieving practical framework for AI-ready SDG data infrastructure, these goals requires not only coordinated policy action and building on the design principles and implementation strategies †Corresponding author demonstrated by the IRCAI SDG Observatory, as well as by the Permission to make digital or hard copies of part or all of this work for personal or preceding NAIADES Water Observatory [3] focusing on AI and classroom use is granted without fee provided that copies are not made or distributed Water Sustainability, and the recently deployed UNESCO for profit or commercial advantage and that copies bear this notice and the full Landslides Observatory discussed in section 4, both in the citation on the first page. Copyrights for third-party components of this work must intersection of SDG 13 (Climate Action) with SDG 6 (Water be honored. For all other uses, contact the owner/author(s). Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Sustainability) and SDG 11 (Resilient Cities and Communities). © 2025 Copyright held by the owner/author(s). We follow the discussion in [4] and propose an AI-ready and AI- http://doi.org/10.70314/is.2025.sikdd.17 enabled data and metadata infrastructure that can be leveraged 98 for research purposes in what relates AI and Sustainable the ontology (discussed in the next section) and the outcomes Development. Through this lens, we argue for a paradigm shift from the analysis. The resulting datasets are thus not only clean demonstrating an Amazon-focused SDG data ecosystem built on and standardized (considering limitations of the data sources, this new paradigma—moving from static, indicator-focused including different types of bias analysed and exposed) but also reporting systems to dynamic, AI-compatible engine that structured in elasticSearch indices to support downstream AI supports (i) education and training for sustainability; (ii) applications acting over powerful Lucene queries through the desinformation monitoring practices in the sustainability native API. Surrounding the data layer is a robust metadata discourse (see figure 1); and (i) data-driven decision-making and architecture that enables discoverability, semantic enrichment, global collaboration. and AI-readiness. The metadata design is informed by the FAIR data principles (Findable, Accessible, Interoperable, and Reusable) and includes the following key components: (i) Descriptive Metadata, including descriptive elements such as title, description, source organization, temporal coverage, geographic coverage, and associated SDG goals, enabling human and machine agents to easily understand the scope and purpose of each data index; (ii) Structural Metadata, specifying the internal structure of the dataset, such as data types, column definitions, units of measurement, and relationships between variables, facilitating data parsing and automatic preprocessing by text mining tools; (iii) Source Metadata, capturing information about the dataset’s origin, transformation steps, update frequency, and quality assurance processes, ensuring transparency, reproducibility, and trustworthiness; and (iv) Semantic Metadata, leveraging ontologies and controlled vocabularies to provide machine-readable semantics, linking dataset elements to established knowledge graphs, enabling reasoning across data indices and automated alignment of conceptually related information (see figure 2). Figure 1: The SDG distribution of the ingested scientific article abstracts and their Amazon-related main concepts 2 Data and Metadata Architecture Designing an AI-ready SDG data infrastructure requires more than simply aggregating datasets—it demands a structured, extensible architecture that enables machine interpretability, semantic consistency, and interoperability across domains. The IRCAI SDG Observatory proposes in [5] a data structure incorporating both heterogeneous data and complex preprocessed metadata layers to support automated reasoning, text mining applications, and dynamic sustainability analysis. At the core of the infrastructure lies the data layer, which consists of curated datasets aligned with specific SDG indicators. These datasets are collected from a variety of sources, including international organizations, national statistics offices, worldwide Figure 2: Visualisation of the SDG distribution of the news engines, open government data portals, and research ingested OECD AI policies according to the SDG ontology institutions. To ensure consistency and usability, raw datasets built on Wikidata terms defined with SDG topic experts undergo a 3-step transformation process: ● Harmonization: Raw data is converted into To ensure accessibility and integration with external systems, standardized formats (e.g., CSV, JSON, RDF) using the infrastructure exposes datasets and metadata through native predefined schemas (as the official SDG indicator RESTful APIs, allowing developers and researchers to query and framework defined by the UN Statistics Division [6]). retrieve relevant data programmatically, enabling use in ● Normalization: Variables such as geographic dashboards, modeling pipelines, and decision-support systems. units, time periods, and measurement scales are normalized Furthermore, adherence to open data standards such as DCAT to ensure comparability across countries and regions. (Data Catalog Vocabulary) and JSON-LD (Linked Data) ensure ● Validation: Data quality checks are that the infrastructure can interface with other open government implemented to flag missing values, outliers, or inconsistent data platforms, research data repositories, and semantic web units, helping maintain reliability and analytical integrity. services. The architecture is designed with scalability and IRCAI is engaging domain experts for the different SDGs to modularity in mind, allowing new datasets to be integrated with explore the most relevant KPIs to monitor, the search terms in minimal manual intervention. Through automated ingestion 99 pipelines and schema mapping tools, the infrastructure can with open education principles and UNESCO collaboration. It accommodate additional data sources while preserving metadata aims to make knowledge resources directly useful for learners integrity and interoperability. Governance mechanisms, and professionals engaged with Amazonia and their communities. including data quality audits and contributor guidelines, ensure Table 1 shows the data feeding the system across a diversity of the sustainability and reliability of the system over time. topics from news, science and policies, exposing concerns of the To support the in-depth analysis and leverage the availability of public opinion, the knowledge we hold on priority topics, and multilingual text resources at Wikidata, we have developed a part of the regulatory landscape. SDG ontology inspired by [7] based on terms that correspond to Wikipedia pages. Currently published in a CSV format on Table 1: Data ingested into the Amazon Observatory from GitHub [8], it defines rows corresponding to SDG entities—such worldwide news (indicating the language coverage), as goals—and maps them to Wikidata Q‑IDs. Key columns published AI-related scientific articles, and related legal and include: Level (e.g., SDG Goal), Code (e.g., “1”, “1.2”, “ regulatory landscape 1.2.1”), Wikidata Q‑Identifier (e.g., Q23442, Q3048436, Q28146087), label (human‑readable name), Description (concise Concepts 2024 News Science Policy textual summary), and related concepts (optional Q‑IDs linking (Lang. Coverage) to domains like health, energy, gender equality) Each SDG Goal Biodiversity 18083 (100) 44693 3628 row includes its code and corresponding Wikidata ID. Targets Indigenous relevant Target and define unit, measurement scale, and 236 (26) 2127 133 Public Health 26454 (69) 42355 697 description. Using the CSV mappings, the ontology is Amazon constructed so that: rainforest 3936 (87) 172 115 ● explicit parent Goal. Indicators (e.g. 13.2.1) reference the Bioeconomy 156 (16) 33 31 Carbon Credits (e.g. 1.2) are mapped to both their own Wikidata entity and an peoples 8070 (96) 2014 107 sdg:hasTarget links a goal entity to its targets ● sdg:hasIndicator links targets to indicators ● sdg:measuredIn aligns indicator measures to Wikidata units Additional cross-concept links (sdg:relatedTo) connect indicators to external Q‑IDs in domains such as “maternal health” or “clean water”. During dataset ingestion, each column bearing an indicator code is annotated using the corresponding Wikidata Q‑ID from the ontology, enabling dataset cataloging via sdg:indicator URIs, semantic filtering and query based on concept-level tagging, as well as automatic generation of metadata triples (e.g. linking dataset to indicators and units). 3 The Amazon Observatory and Other Pilots The prominence of domains such as digital data processing Figure 3: Evolution in time of the relation between research and machine learning illustrates AI’s multidimensional capacity concepts related to the Brazilian Amazon Rainforest to address complex challenges in resource allocation, public health systems, and environmental sustainability. Comparative To illustrate the potential of such approach, five initial modules analysis between global discourses and those specifically have been developed and are being made available for COP30 oriented toward the Brazilian Amazon—driven by the expertise activities in Belem, at the heart of Amazonia: (i) the News and coordinated efforts of the MetAmazonia initiative—reveals Stream with Sentiment provides multilingual coverage of a pronounced emphasis on environmental preservation, Amazonia-related news, complemented by word clouds of main biodiversity monitoring, and climate resilience in the latter. This concepts and sentiment analysis visualized through maps and divergence indicates that AI’s contributions to sustainable gauges; (ii) the Data Exploration Dashboard integrates multiple development are not uniform but instead conditioned by region- datasets, displaying global research trends, SDG policy specific priorities, ecological constraints, and socio-technical coverage, and innovation activity; (iii) the relation between the contexts. These findings underscore the necessity of developing concepts (edges) relevant to the Amazonia research and the adaptive, context-aware AI frameworks capable of aligning with interconnections between these concepts, being stronger or the heterogeneous demands of both urban and rural environments. weaker according to the amount of published articles, where The Amazon Observatory delivers outcomes such as the these are topics in common (see visualization in figure 3 and data MetAmazonia chatbot, a multidimensional open data platform, characterization in table 1); (iv) in the Education view, the and accessible resources for students and researchers to advance system visualizes open educational resources by mapping knowledge and innovation in the region. The system will be the Amazonia-related topics to SDGs, highlighting key domains and basis for the planned MetAmazonia Chatbot, leveraging these their relevance to specific goals such as SDGs 11, 13, and 15; datasets within the broader SDG AI-agent development, aligned and (v) in regards to innovation ecosystems, we depict the 100 different initiatives that relate to priority topics in the Brazilian models and knowledge extraction tools to automate the discovery Amazon context and could help establishing international and integration of new SDG-related concepts from policy collaboration to address specific problems with local/global data. documents, scientific literature, and real-time news streams; (ii) Building on these modules, the SDG-oriented data Interoperability with National Platforms. Building tools that infrastructure establishes a robust foundation for the support seamless integration of local statistical data with global development of an AI Agent specialized in Amazonia-related and local SDG indicators (e.g., focusing Amazonia), using topics. By combining multilingual news streams, interconnected schema mapping and automated alignment with the SDG research concepts, and contextualized mappings of innovation ontology; (iii) Real-Time Data Ingestion and Streaming and education, the system provides the necessary knowledge Analytics. Incorporating real-time data sources, such as remote base and semantic structure to enable advanced reasoning, sensing, sensor networks, and social media, to enable early- retrieval, and decision-support capabilities. Such an AI Agent is warning systems and near-instant progress monitoring; (iv) AI- not only facilitate rapid access to diverse data sources but also Powered Decision Support Tools. Developing interfaces and support policymakers, researchers, and local communities by tools that allow policy-makers to simulate interventions, explore offering synthesized insights aligned with the SDGs. In doing so, causal relationships, and evaluate trade-offs between SDG it hopes to bridge global sustainability agendas with regional targets using AI models trained on the structured data; (v) challenges, ensuring that context-specific solutions for the Community Governance and Open Collaboration. Establishing Amazon are informed by evidence, enriched by international open, participatory governance models for ontology evolution, collaboration, and continuously updated through the integration dataset curation, and quality assurance to ensure that the of real-time data. infrastructure remains globally relevant and inclusive. In conclusion, AI-ready SDG infrastructure represents a transformative opportunity for evidence-based policy, global 4 Conclusions and Further Work collaboration, and data-driven action on sustainability. By As the global community continues to pursue the 2030 Agenda, continuing to invest in semantic technologies, metadata the importance of robust, interoperable, and machine-actionable standards, and open data ecosystems, we can enable a new SDG data infrastructure has never been greater. This paper has generation of intelligent tools that accelerate progress toward the explored the architecture and implementation of an AI-ready data SDGs globally but also locally. infrastructure for the SDGs, using the IRCAI SDG Observatory and its derived pilots as case studies. Central to this infrastructure Acknowledgments / Zahvala is a well-defined metadata schema, semantic alignment with We thank the support of the European Commission projects Wikidata entities, and adherence to FAIR data principles—all ELIAS (GA101120237) and RAIDO (GA101135800). designed to support automation, reasoning, and integration of data across domains and geographies. By embedding SDG References / Literatura indicators, targets, and goals into a linked-data framework, the [1] Bachmann, N., Tripathi, S., Brunner, M. and Jodlbauer, H.( 2022). The system transforms static reporting datasets into dynamic, contribution of data-driven technologies in achieving the sustainable queryable resources. This enables a wide range of AI development goals. Sustainability, 14(5), p.2497. [2] Stahl, B.C., Schroeder, D. and Rodrigues, R., 2022. AI for Good and the applications, from natural language querying to knowledge graph SDGs. In Ethics of artificial intelligence: Case studies and Options for reasoning and real-time decision support. The SDG Ontology— addressing ethical challenges (pp. 95-106). Cham: Springer International Publishing. based on mappings to Wikidata Q-IDs—serves as a semantic [3] Pita Costa, J., (2023) Water Intelligence to Support Decision Making, backbone, enabling interoperability with external datasets and Operation Management and Water Education: NAIADES Report. IRCAI. ontologies while enhancing transparency and reusability. Despite [4] Pita Costa J., Barrionuevo L.. Kovič Dine M. (2025) Observing the Impact of AI in the Progress of Sustainable Development Goal 11. In Proceedings these advancements, several challenges remain. Data of the 23rd IADIS International Conference e-Society 2025 fragmentation across jurisdictions, lack of standardization in [5] Mitja Jermol, Joao Pita Costa and Matej Kovačič (2025) Onwards to an Ethical and Bias Aware Education for Sustainability through AI. Journal national reporting, and uneven metadata quality continue to of Artificial Intelligence for Sustainable Development (to appear) hinder full automation and scalability. Furthermore, ethical [6] Sustainable Development Solutions Network(2015) Indicators and a considerations around data use—particularly in the context of [7] Joshi, A., Gonzalez Morales, L., Klarman, S., Stellato, A., Helton, A., & Monitoring Framework for the SDGs. United Nations. AI-based decision-making—require further exploration. Lovell, S. (2019). A Knowledge Organization System for the United To improve the Amazon Observatory, future development of AI- Nations Sustainable Development Goals. Proceedings of the 2019 International Conference on Knowledge Engineering and Knowledge ready data infrastructure will focus on several key areas: (i) Management (EKAW). Springer. Automated Ontology Expansion. Leveraging large language [8] Pita Costa, J. (2025) IRCAI SDG Ontology. GitHub. Available at https://github.com/IRCAI-SDGobservatory/data 101 Towards a Format for Describing Networks Vladimir Batagelj Tomaž Pisanski Iztok Savnik IMFM UP, FAMNIT UP, FAMNIT Ljubljana, Slovenia Koper, Slovenia Koper, Slovenia UP, IAM and FAMNIT IMFM iztok.savnik@upr.si Koper, Slovenia Ljubljana, Slovenia UL, FMF tomaz.pisanski@upr.si Ljubljana, Slovenia vladimir.batagelj@fmf.uni- lj.si Ana Slavec Nino Bašić UP, FAMNIT UP, FAMNIT Koper, Slovenia Koper, Slovenia UP, IAM InnoRenew CoE IMFM Koper, Slovenia Ljubljana, Slovenia ana.slavec@f amnit.upr.si nino.basic@f amnit.upr.si Abstract allow us to obtain the specific descriptions required by various network analysis programs using relatively simple scripts. The article provides an overview of the most important network We have many years of experience in developing formats analysis resources and the various types of networks encountered for describing graphs and networks [11, 10, 4]. We will present in their use. Based on experience in developing the NetsJSON the NetsJSON format currently used to describe networks with format, we present components that an exchange/archive format structured data, and some ideas for improving it. This could be for describing networks should contain. a starting point for the development of a common format for Keywords exchanging and archiving networks. Network analysis, Network types, Identification, Format, Ex- change, Archive, Data repository, Factorization, JSON, FAIR. 2 Support for network analysis The concept of a network is an extension of the concept of a 1 Introduction graph. A graph describes the structure of a network. Network Open data plays a crucial role in ensuring the computational re- analysis is a branch of data analysis that draws heavily on the producibility and verifiability of published results. The obtained concepts and results of graph theory. The difference between the results can be verified or supplemented with other methods. Col- two is that networks are usually "irregular", while most problems lections of similar and well-documented datasets are also crucial and results of graph theory assume some "regularity". for developing new methods to analyze specific types of data. It is There are many tools and programs for network analysis. For good to test a new method on several datasets and check whether example, UCINET, Pajek, Gephi, NetMiner, Cytoscape, NodeXL, it gives meaningful/expected results. When preparing such data, E-Net, Tulip, P UCK, GraphViz, SocNetV, Kumu, Polinode, etc. it is essential to adhere to the FAIR principles – Findability, Ac- Programmers can use network analysis packages/libraries in a cessibility, Interoperability, and Reusability. To facilitate ease of variety of programming languages (Python, R, Julia, C++, etc.). use, data should ideally be stored in a text format that preserves They are supporting various network description formats: the structure of the data and includes relevant metadata. Datasets CSV, UCINET DL, Pajek NET, Gephi GEXF, GDF, GML, GraphML, are alive. Their connection to open repositories is important for GraphX, GraphViz DOT, Tulip TPL, Netdraw VNA, Spreadsheet, their accessibility and maintenance. etc. [13, 25, 16]. In 2023, the International Network for Social Network Analysis In addition, network data appears in several application areas (INSNA) requested that Zachary Neal form a working group to such as chemistry and genealogy. There are many formats for develop describing these data. recommendations for sharing network data and materials. They were published in Network Science in 2024 [21] Network datasets are available in multiple repositories. Some accompanied by the [20]. repositories only provide metadata about an individual network Endorsement page Network analysis is an area where data is often stored in and a link to the actual dataset. At the same time, others also diverse file formats. It would be highly beneficial to adopt a com- store the data itself and offer users a display of basic network mon “archiving/exchange” format that can describe (almost) all characteristics. None of them explicitly adheres to FAIR data networks and support authoring, deposit, exchange, visualization, principles. reuse, and preservation of network data. Such a format would Interesting networks can also be found on general data reposi- tories such as Kaggle. Networks can also be created programmat- Permission to make digital or hard copies of all or part of this work for personal ically from selected data. For example, from bibliographic data or classroom use is granted without fee provided that copies are not made or from the free OpenAlex service, we can create collections of bibli- distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this ographic networks on a selected topic using the OpenAlex2Pajek work must be honored. For all other uses, contact the owner /author(s). library in R. Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia For detailed lists of network analysis resources with links to © 2025 Copyright held by the owner/author(s). https://doi.org/10.70314/is.2025.sikdd.9 web pages, see GitHub/bavla/NetsJSON/Info [4]. 102 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Batagelj, Pisanski, Savnik, Slavec & Bašić 3 Graphs and networks 3.3 Types of networks 3.1 Unit identification Besides ordinary (directed, undirected, mixed) networks, some special types of networks are also used: The fundamental task in transforming data about the selected topic into a structured dataset to be used in further analyses is • 2-mode networks, bipartite (valued) graphs – networks the (entity recognition). Often, the source identification of units between two disjoint sets of nodes. data are available as unstructured or semi-structured text. In • multi-relational networks. this case, the transformation is a task of the • linked networks computer-assisted collections of networks and . text analysis • multilevel networks. (CaTA). Terms considered in TA are collected in a (it can be fixed in advance, or built dynamically). • temporal networks dictionary, dynamic graphs – networks changing The two main problems with terms are: – different equivalence over time. words representing the same term, and – same words • ambiguity p- specialized networks: representation of genealogies as representing different terms. Because of these, the – the graphs coding Petri’s nets ; , molecular graphs, etc. transformation of raw text data into formal – is often Network (input) file formats should provide the means of description done manually or semiautomatically. expressing all of these types of networks. All interesting data We assume that unit identification assigns a unique identifier should be recorded (respecting privacy). (ID) to each unit. For some types of units, such IDs are stan- In a network , , , , the set of two-mode N = ( (U V) L P W) dardized: ISO 3166-1 alpha-2 two-letter country codes, ISO 9362 nodes consists of two disjoint sets of nodes and , and all the U V Bank Identifier Codes (BIC), ORCID – Open Researcher and Con- links from have one end node in and the other in . L U V tributor ID, ISSN – International Standard Serial Number, DOI – A , 1 multi-relational network N = (V (L, L2, . . . , L𝑘), P, W) Digital Object Identifier, URI – Uniform Resource Identifier, etc. contains different relations 𝑖 L (sets of links) over the same set Often, in data displays, IDs are replaced by corresponding of nodes. Also, the weights from are defined on different W (short) labels/names. relations or their union. Besides the semantic units or related to the selected In a or network 1 concepts linked multimodal N = ( (V, V 2, . . . , V𝑗 ), topic, we can also identify in the raw data syntactic units – parts , 1 (L L2, . . . , L𝑘), P, W) the set of nodes V is partitioned into of the text. As of TA, we usually consider clauses, subsets ( ) 𝑖 syntactic units modes V , L𝑠 ⊆ V 𝑝 × V 𝑞, and properties and weights statements, paragraphs, news, messages, etc. are usually partial functions. In thematic TA, the units are coded as a rectangular matrix A set of networks 1 {N, N2, . . . , N𝑘} in which each network Syntactic units × Concepts which can be considered as a two-mode shares a (sub)set of nodes with some other network is called a network. of networks. collection In semantic TA the units (often clauses) are encoded according A linked network can be transformed into a collection of net- to the S-V-O ( - - ) model or its improvements. works and vice versa. Subject Verb Object This coding can be directly considered as network with Bibliographical information is usually represented as a col- Subjects ∪ Objects as nodes and links (arcs) labeled with Verbs. This is also lection of bibliographical networks {Cite, WA, WK, WC, WI, . . .} a basis for the semantic web and knowledge networks. (W – works, A – authors, K – keywords, C – countries, I – insti- tutions) [7]. 3.2 Networks Another example of multimodal multirelational networks are knowledge graphs. They can have a very diverse structure (a A is based on two sets – a set of (vertices) that nodes network large number of types of units (modes) and predicates (rela- represent the selected , and a set of (lines) that represent links units tions)), which allows for a fairly accurate description of facts ties from a selected field and solving problems about it. Network graph between units. They determine a . analysis methods are particularly useful in analyzing one or a Additional data about nodes or links can be known – their properties few relational (sub)networks of a knowledge graph. (attributes). For example: name/label, type, value, etc. Network In a , the presence/activity of a node/link can temporal network = Graph + Data change through time . The basic division of temporal network T The data can be measured or computed. Formally, a , , , consists of: N = (V L P W) network descriptions is into cross-sectional and longitudinal. A cross- sectional description usually consists of a sequence of time slices • a graph G = (V, L), where V is the set of nodes and – ordinary networks that describe the state at a selected moment L = E ∪ A is the set of links. A link 𝑒 ∈ L is either or time interval. A longitudinal description is based on temporal directed – an arc 𝑒 , or undirected – an edge 𝑒 , [12, 9] or on a sequence of events. ∈ A ∈ E quantities 𝑛 , 𝑚 = |V | = |L | • P is a set of node value functions / properties: 𝑝 : V → 𝐴 4 Description of traditional networks • W is a set of link value functions / weights: 𝑤 : L → 𝐵 How to describe a network , , , ? In principle the N = (V L P W) Sometimes, implicit additional information/data about values answer is simple – we list its components , , , and . V L P W is provided in the specifications of properties: (a) how can we The simplest way is to describe a network by providing N compute with values – algebraic structures, semigroup, monoid, , and , in a form of two tables. Both tables are (V P) (L W) group, semiring, etc., and (b) properties of values – in a molecular often maintained in some spreadsheet program. They can be graph, an atom is assigned to each node; properties of relevant exported as text in CSV (Comma Separated Values) format. In atoms are such additional data. large networks, we split a network into some subnetworks – a The terminology in the field of network analysis is not unified. collection, to avoid the empty cells. Different application areas use other terms. For example: node – To save space and improve computing efficiency, we often vertex, actor, unit; link – line, tie, edge, connection; etc. replace values of categorical variables with integers. In R, this 103 Towards a Format for Describing Networks Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia encoding is called a . We enumerate all possible val- where . . . are user-defined properties and *** is a sequence of factorization ues of a given categorical variable (coding table) and afterward such elements. replace each value with the corresponding index in the coding The field identifies the format, the field con-netsJSON info table. Since node labels/IDs can be considered a categorical vari- tains metadata, the field contains a table , and the nodes (V P) able, factorization is also usually applied to them. In data analysis, field contains a table , . In recent years, we also links (L W) indices start with 1, but real computer scientists start counting analyzed bike systems (link weight is a daily number of trips from 0. Therefore, it is desirable to include information about the distribution), bibliographies (yearly distributions of publications minimal index value in the description. or citations), and multiway networks [8, 9, 1]. It turned out that This approach is used in most programs dealing with large it was necessary to add another main field, , to the basic data networks. Unfortunately, the coding table is often considered as NetsJSON format, in which we provide additional data about the a kind of metadata and is omitted from the description. properties of values (translations of labels in selected languages, In Pajek [15], node property can be represented in the associ- algebraic structure, etc. [6]). ated file as a vector (numbers, ), a partition (nominal, ), An event description can contain the following fields: type, .vec .clu or a permutation (order, ). All network files can be com- date, title, author, desc, url, cite, copyright, etc. It is intended .per bined into a single project file ( ). Metadata can be added to provide information about the "life" of the dataset – collec-.paj as comments written on lines starting with the . An exam- tion/creation, changes, releases, uses, publications, etc. % ple of transforming CSV tables into Pajek files is available at For describing temporal networks, a node element and a link GitHub/bavla/netsJSON/example/bib [4]. element have an additional required property – a temporal tq quantity. For example, see at violenceU.json GitHub/bavla/ 5 Graph/JSON describing the Franzosi’s violence network. Nets and NetsJSON The general NetsJSON format is also expected to support the We were satisfied with the ”traditional” network description, as description of network collections. implemented in Pajek [15], until we became interested in net- works with node/link properties that are not measured in stan- dard scales (ratio, interval, ordinal, nominal), but have structured values (text, subset, interval, distribution, time series, temporal 6 Elements of a common network format quantity, function, etc.). In topological graph theory, an embed- Our experience with network analysis to date is summarized in ding is described by assigning a rotation to each node [23]. For the following recommendations on the elements of a common describing temporal networks, we initially extended the Pajek format for describing networks. format, defined and used the Ianus format [12]. Combining data and its metadata into a single file is a robust For a format supporting structured values, there were two ob- approach for ensuring data integrity. A JSON-based format is vious choices for its base – XML and JSON. They are both widely particularly well-suited for this purpose, as it fits well with the known and suitable as structured data formats. However, JSON data structures of modern programming languages. JSON also can represent the same data as XML more concisely. We chose supports Unicode. JSON and in 2015 started developing and using the NetJSON We would also encourage the provision, as metadata, of infor- format and the Nets Python package to handle networks with mation about the context of the network, additional knowledge structured-valued properties or weights [5, 4, 3]. On February about it, articles or notebooks on its analysis, comments of users, 26, 2019, the format was renamed to NetsJSON because of the etc. Kaggle is a good example. An improved ICON repository or collision with http://netjson.org/rfc.html. NetsJSON has two ver- Network Repository (we disagree with their "citation request") sions: a and a version. The current implementation could be the way to go. Existing metadata standards should be basic general of the Nets library supports only the basic version. taken into account (Dublin Core, FAIR, Schema). Data has a "life". In addition to describing networks with structured values, When selecting data, its age is often important. Metadata should NetsJSON is expected to offer the capabilities of (most) existing include at least the collection/creation date and the last modifica- network description formats [13, 25] (archiving, conversion) and tion date. provide input data for D3.js visualizations. F By FAIR principles, the format should support: indability: A network description in NetsJSON follows the JSON ( Java- Globally unique and persistent identifier, rich metadata. ccessi-A Script) syntax and consists of five main fields ( , , netsJSON info bility: Open, free, and universally implementable standardized nodes, links, data). communication protocol. Interoperability: Formal, accessible, shared, and proudly applicable language for knowledge represen- { "netsJSON": "basic", tations. Reusable: Metadata are richly described and associated "info": { "org":1, "nNodes":n, "nArcs":mA, "nEdges":mE, with detailed provenance. "simple":TF, "directed":TF, "multirel":TF, "mode":m, The format must support all types of networks (simple, 2-mode, "network":fName, "title":title, "time": { "Tmin":tm, "Tmax":tM, "Tlabs": {labs} }, linked, multi-relational, multi-level, temporal). The network can "meta": [events], ... contain both arcs and edges, as well as parallel links. To describe }, "nodes": [ some knowledge graphs, it would be necessary to allow other { "id":nodeId, "lab":label, "x":x, "y":y, ... }, *** links to act as the end nodes of a link [18]. ], As mentioned earlier, using factorization produces a more "links": [ { "type":arc/edge, "n1":nodeID1, "n2":nodeID2, "rel":r, ... }, concise description of the network. In cases where the node *** ], names are not too long and are readable, we sometimes want "data": { "data1":description1, to avoid factorization. This can be achieved by using a switch *** that indicates whether factorization is used. We can also shorten } } the description length by introducing default values of selected 104 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Batagelj, Pisanski, Savnik, Slavec & Bašić properties. If we also allow counting from 0, it makes sense to T. Pisanski is supported in part by the Slovenian Research add information about the smallest index. Agency (research program P1-0294 and research projects BI- Long labels cause problems when printing/visualizing (parts HR/23-24-012, J1-4351, and J5-4596). of ) networks and results. Therefore, it is useful to have abbrevi- N. Bašić is supported in part by the Slovenian Research Agency ated versions of labels available. For language-based labels, it is (research program P1-0294 and research project J5-4596). sometimes useful to offer additional versions in selected other languages, which increases the accessibility of the data and the References understandability of the results. [1] Vladimir Batagelj. 2024. Cores in multiway networks. Social Network Analy- Most of the network datasets produced by network science sis and Mining, 14, 1, 122. [2] Vladimir Batagelj. 1985. Inductive classes of graphs. In Proceedings of the have no node labels. Node labels are not needed if you study , 43–56. 6th Yugoslav Seminar on Graph Theory distributions, but they are essential in the interpretation of the [3] Vladimir Batagelj. 2016. Nets – Python package for network analysis. ac- cessed: 2025-03-18. (2016). https://github.com/bavla/Nets. obtained “important substructures”. We would encourage pro- [4] Vladimir Batagelj. 2016. NetsJSON – a JSON format for network analysis. viding node labels, or at least some typological information, in accessed: 2025-03-18. (2016). https://github.com/bavla/netsJSON. [5] Vladimir Batagelj. 2016. Network visualization based on JSON and D3.js. cases where privacy issues arise. slides. (2016). https://github.com/bavla/netsJSON/blob/master/doc/netVis.p The common format should support descriptions of networks df . specific to specialized fields of application, such as molecular [6] Vladimir Batagelj. 2021. Semirings in network data analysis / an overview. slides. (2021). https://github.com/bavla/semirings/blob/master/docs/semirin graphs, genealogies (p-graphs) [29], and topological graph em- gs.pdf . beddings [23, 24], among others. The format must be extensible. [7] Vladimir Batagelj and Monika Cerinšek. 2013. On bibliographic networks. In addition to the agreed-upon fields, the users can add their own, , 96, 3, 845–864. Scientometrics [8] Vladimir Batagelj and Anuška Ferligoj. 2016. Symbolic network analysis of allowing for a comprehensive description of their data. bike sharing data / Citi Bike. accessed: 2025-03-18. (2016). https://github.co It is also interesting to ask whether and, if so, how to include m/bavla/Bikes/blob/master/bikes.pdf . [9] Vladimir Batagelj and Daria Maltseva. 2020. Temporal bibliographic net- descriptions of its displays in the network description. Perhaps it works. , 14, 1, 101006. Journal of Informetrics would be worth relying on VEGA-lite [26, 28] and D3.js [14]. Some [10] Vladimir Batagelj and Andrej Mrvar. 1995. NetML. accessed: 2025-03-18. ideas can also be taken from the section on “defining visualization (1995). https://github.com/bavla/netsJSON/blob/master/doc/snetml.pdf . [11] Vladimir Batagelj and Andrej Mrvar. 2018. Pajek and pajekxxl. In Encyclo- parameters in the input file” in the Pajek manual / 5.3 [19, p. 89]. pedia of Social Network Analysis and Mining. Springer, 1–13. Although we are committed to a single-file approach, there [12] Vladimir Batagelj and Selena Praprotnik. 2016. An algebraic approach to may be times when external files are needed (for example, images temporal network analysis based on temporal quantities. Social Network Analysis and Mining, 6, 1–22. to display nodes). Consideration should be given to how to sup- [13] Jernej Bodlaj and Monika Cerinšek. 2014, 2017. Network data file formats. port this option. Given the basic purpose of the common format, In . Reda Alhajj and Encyclopedia of Social Network Analysis and Mining Jon Rokne, editors. Springer New York, New York, NY, 1076–1091. isbn: standard tools (ZIP) can be used to compress large networks. 978-1-4614-7163-9. doi: 10.1007/978- 1- 4614- 7163- 9_298- 1. We have not yet started working on a general format. It is [14] Bostock, Mike and Davies, Jason and Heer, Jeffrey and Ogievetsky, Vadim. supposed to enable descriptions of collections of networks. The [n. d.] The JavaScript library for bespoke data visualization. accessed: 2025- 08-29. (). https://d3js.org/. question arises about the scope of validity of IDs – does the same [15] Wouter De Nooy, Andrej Mrvar, and Vladimir Batagelj. 2018. Exploratory ID in different networks represent the same or other units? This social network analysis with Pajek: Revised and expanded edition for updated is important for operations such as the union or intersection . Vol. 46. Cambridge university press. software [16] Gephi. 2022. Supported graph formats. accessed: 2025-03-18. (2022). https: of networks. Which way to go – introducing contexts or using //gephi.org/users/supported- graph- f ormats/. matchings? Maybe some ideas from the Open Archives Initiative [17] Joseph E Gonzalez, Reynold S Xin, Ankur Dave, Daniel Crankshaw, Michael J Franklin, and Ion Stoica. 2014. Graphx : graph processing in a distributed { } Object Reuse and Exchange (OAI-ORE) and GraphX could be dataflow framework. In 11th USENIX symposium on operating systems design used [22, 17]. An interesting option is the constructive network and implementation (OSDI 14), 599–613. description – building a network from smaller components [10] [18] Aidan Hogan et al. 2021. Knowledge graphs. , ACM Computing Surveys (Csur) 54, 4, 1–37. or describing a network by its construction sequence [2]. [19] Andrej Mrvar and Vladimir Batagelj. 2025. Pajek reference manual. Version Additional ideas may be found on the page ” . http://mrvar.f dv.uni- lj.si/pajek/pajekman.pdf . 6.01 A Python Graph API? [20] Zachary P Neal and et al. 2024. Recommendations for sharing network data ” [27]. For now, we would leave aside descriptions of gener- and materials – endorsement page. accessed: 2025-03-18. (2024). https://ww alizations of networks (multiway networks and hypernets), but w.zacharyneal.com/datasharing. we must not forget about them. [21] Zachary P Neal et al. 2024. Recommendations for sharing network data and materials. , 12, 4, 404–417. Network Science The agreed format must be well documented and supported [22] Open Archives Initiative. 2014. Object Reuse and Exchange (OAI-ORE). 2014- by examples of the use of supported options. 08-14. accessed: 2025-03-18. (2014). https://www.openarchives.org/ore/. [23] Tomaž Pisanski. 1980. Genus of cartesian products of regular bipartite graphs. Journal of Graph Theory, 4, 1, 31–42. 7 Conclusions [24] Tomaž Pisanski and Arjana Žitnik. 2004. Representations of graphs and 26th International Conference on Information Technology Interfaces, maps. In Yet another format only makes sense as a project of a larger IEEE, 19–25. 2004. [25] Matthew Roughan and Jonathan Tuke. 2015. Unravelling graph-exchange community of users in the field of network analysis. file formats. . arXiv preprint arXiv:1503.02781 [26] Arvind Satyanarayan, Dominik Moritz, Kanit Wongsuphasawat, and Jeffrey Heer. 2016. Vega-lite: a grammar of interactive graphics. IEEE transactions Acknowledgements on visualization and computer graphics, 23, 1, 341–350. [27] The Python Wiki. 2011. A Python Graph API? accessed: 2025-03-18. (2011). The computational work reported in this paper was performed https://wiki.python.org/moin/PythonGraphApi. [28] University of Washington Interactive Data Lab. [n. d.] Vega-Lite – A Gram- using programs R and Pajek [15]. The code and data are available mar of Interactive Graphics. accessed: 2025-08-29. (). https://vega.github.io at Github/Bavla [4]. /vega- lite/. V. Batagelj is supported in part by the Slovenian Research [29] Douglas R White, Vladimir Batagelj, and Andrej Mrvar. 1999. Anthropology: analyzing large kinship and marriage networks with pgraph and pajek. Agency (research program P1-0294 and research project J5-4596) Social Science Computer Review, 17, 3, 245–274. and prepared within the framework of the COST action CA21163 (HiTEc). 105 Automating Numba Optimization with Large Language Models: A Case Study on Mutual Information Lučka Kozamernik Blaž Škrlj Teads Teads lucka.kozamernik@teads.com blaz.skrlj@teads.com Martin Jakomin Jasna Urbančič Teads Teads martin.jakomin@teads.com jasna.urbancic@teads.com Abstract Python and NumPy code into optimized machine code at run- Contemporary large language models (LLMs) enable fast research time, Numba offers C-like performance without sacrificing the cycles when developing or optimizing new algorithms. In this flexibility and ease of use of the Python language. A well-written, work, we investigate whether existing LLMs are sufficient to Numba-accelerated MI function can be orders of magnitude faster automatically, under constraints of unit tests, produce implemen- than its pure Python equivalent. Despite these gains, achieving tations of computational extensive algorithms such as the mutual optimal performance with Numba is not always straightforward. information algorithm that would out-perform existing human- The efficiency of Numba-jitted code is highly dependent on the made baselines. We establish an evaluation pipeline where new specific implementation patterns, data access methods, and loop proposed AI implementations are rigorously tested, evaluated, structures used—subtleties that often require significant program- and benchmarked against existing baselines. We used synthetic mer expertise to navigate. numeric datasets of different sizes and results show 10-times This paper introduces a novel approach to bridge this gap: the speed-up using LLM optimized implementations compared to use of Large Language Models (LLMs) to automatically optimize the naive Numba-based optimization while producing consis- Numba-based mutual information algorithms. We hypothesize tently correct mutual information scores. that modern LLMs, trained on vast repositories of code, possess the capability to analyze suboptimal Numba implementations Keywords and refactor them into more efficient versions. Our work explores whether an LLM can identify and correct common performance optimization, mutual information, LLM, Numba anti-patterns in Numba code, such as improper loop organization 1 or inefficient data type usage, to generate an MI implementation Introduction that surpasses a naively written Numba function. We present a Mutual Information (MI) (detailed overview in, e.g., [4]) stands framework for systematically prompting an LLM with a base- as a fundamental measure in information theory, quantifying line algorithm and evaluating the performance of its generated the statistical dependency between two random variables. Its optimizations, demonstrating the potential for AI-driven code application is widespread and critical across numerous domains, acceleration in scientific computing. including feature selection in machine learning [8], neuroscience for analyzing neural spike trains [2], and bioinformatics for un- 2 Related work derstanding gene regulatory networks [9]. The versatility of MI lies in its ability to capture arbitrary non-linear relationships, a This research builds upon three principal areas of study: the significant advantage over linear correlation measures like Pear- computation of mutual information, performance optimization son’s coefficient. with JIT compilers, and the application of Large Language Models However, the computational cost of calculating mutual infor- to code intelligence tasks. mation, especially for large datasets with continuous variables, Mutual Information estimation is the long-standing challenge presents a substantial bottleneck. The standard approach involves of accurately and efficiently estimating mutual information from discretizing the data into bins in order to estimate probability given data. Defined as distributions, a process whose accuracy and performance are highly sensitive to the chosen binning strategy and the efficiency 𝑝 ( 𝑋 , 𝑌 ) 𝐼 𝑋 𝑌 , ( ; ) = E log 𝑝 𝑋 𝑝 𝑌 of the underlying implementation. For practitioners working 𝑝 𝑋 ,𝑌 ( ) ( ) ( ) within the Python ecosystem, libraries like NumPy and SciPy are standard tools, but their performance on MI calculations can it measures the pairwise relationships between random vari- be suboptimal for high-throughput screening or large-scale data ables (continuous or discrete). The most common methods, as exploration tasks. reviewed by Fraser and Swinney (1986) [3] and explored in de- To address this performance gap, Just-In-Time (JIT) compil- tail by Kraskov, Stögbauer, and Grassberger (2004) [5], are based ers like Numba [6] have become indispensable. By translating on data discretization (binning) or k-nearest neighbors (k-NN) estimators. While k-NN methods avoid the issue of bin selection, Permission to make digital or hard copies of all or part of this work for personal they typically incur higher computational complexity. Binned or classroom use is granted without fee provided that copies are not made or methods, though conceptually simpler, depend heavily on the distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this binning strategy for accuracy and performance, a topic exten- work must be honored. For all other uses, contact the owner/author(s). sively studied by Steuer et al. (2002) [11]. Our work focuses on Information Society 2025, Ljubljana, Slovenia the binned approach, as it is highly amenable to loop-based array © 2025 Copyright held by the owner/author(s). https://doi.org/https://doi.org/10.70314/is.2025.sikdd.22 computations where Numba excels. 106 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Kozamernik et al. The performance limitations of Python for numerical com- Initial prompt and context putation led to the development of various acceleration tools, specifically JIT compilers for Scientific Python. Numba, intro- duced by Lam, Pitrou, and Seibert in 2015 [6], has emerged as LLM generating response a leading solution by providing a decorator-based JIT compiler that integrates seamlessly with NumPy. It allows developers to Unit tests accelerate functions containing Python and NumPy syntax, of- ten achieving performance comparable to compiled languages. Research and community best practices have established a set of Benchmark optimization techniques for Numba, such as managing memory layout, ensuring type stability, and structuring loops for paral- lelization and vectorization. This body of knowledge forms the --- Results as additional context *not implemented basis against which we evaluate the LLM’s optimization capabil- ities. Our work differs from traditional performance tuning by Figure 1: Architectural sketch of the benchmarking frame- attempting to automate the discovery and application of these work. Dashed are the feedback loops proposed as future techniques solely through an AI model. work. The emergence of robust Large Language Models (LLMs), such as OpenAI’s Codex (the technology powering GitHub Copilot), has revolutionized software development. These models have serve as additional prompts to the LLM in order to improve itself demonstrated remarkable proficiency in code generation, trans- and the code on the areas where the tests are failing. lation, and explanation [1]. More recently, research has shifted Finally, in the last step of the framework, the resulted imple- towards their application in more nuanced tasks like code refac- mentations were extensively benchmarked. The metric we were toring and optimization. For instance, studies have explored using most interested in was the time needed to compute the mutual LLMs to suggest improvements for energy efficiency or to refac- information for a given dataset; however, other metrics, such as tor code for better readability. However, the specific domain of memory utilization or GPU utilization, could also be used for a optimizing numerical algorithms within a JIT compilation frame- different use case. We further discuss our experimental setup in work like Numba remains relatively unexplored. While LLMs the results section. are known to generate functional code, their ability to produce code that is performant by adhering to the specific constraints 3.1 Reviewing the LLM optimized code and best practices of a framework like Numba is an open and The implementations of mutual information, produced by the compelling research question that this paper directly addresses. selected LLMs, are remarkably similar — both in syntax and in the naming convention. However, there are subtle differences that 3 Using LLMs to optimize existing code set them apart, which we will address later. AI-aided implementa- To facilitate systematic experimentation with LLM-optimized tions have in common that they completely omit error-handling code, we set up a novel framework. The workflow consists of the model inherited from NumPy opting for the native Python in- following basic steps: stead. Moreover, they disregard bound checks for matrix opera- tions beforehand, leaving the code to crash if it goes out of bounds. (1) Prompt the LLM with the task and context. The latter is, according to the official documentation, advised for (2) Test the proposed optimizations against the unit tests. debug purposes only and should be turned off for production, as (3) Benchmark the proposed implementation. it slows down the code significantly. Having said that, Gemini The framework is LLM-agnostic, meaning that any LLM can be implemented bound checks using elementary operations. In line used with it. We opt for the latest and most advanced versions with the change in error handling, both implementations prefer of two popular LLMs, namely ChatGPT 5 and Gemini 2.5-Pro. elementary operations over native NumPy functions. For exam- Both are freely available and excel in complex tasks such as ple, to find the maximal value in an array, the LLM optimized reasoning and coding. The architecture of the framework is given code goes through all elements in the array by the index and in Figure 1. compares to the current maximum instead of calling the built-in To ensure a fair comparison between the models, both eval- NumPy function. There is more evidence for this preference in uated LLMs received the same prompt and the same context. the code. Such changes make the code appear much more C-like The prompt was "Can you make this code computationally more than native Python. Whenever there is the need for typecasting, efficient, this meaning it computes faster?", while the context in- the optimized code performs it at definition, instead of on return, cluded the code that needed to be optimized. The initial code which is commonly used in the naive implementation. The two used in the input already contained some Numba instructions, types of proposed changes are illustrated with the code samples however those were basic and naive. The tested code is a part of in Figure 2. Lastly, both LLMs introduced additional function that OutRank, an open-source tool for computing cardinality-aware performs the pre-built grouping to avoid unnecessary allocations feature ranking [10] and encompasses an implementation of the and relocations in the loop. While the core techniques used for mutual information estimation. optimization are the same for both LLMs, Gemini 2.5-Pro used The LLM output was first tested on unit tests to ensure that Numba’s prange in one of the main computational loops, which the optimizations still produced valid code and did not change adds parallelization, and makes the implementation faster on mul- any functionalities. By testing the proposed solution before using ticore machines. It also took the use of elementary operations it for benchmarking, we are guaranteed that the code and its much further than ChatGPT 5 — it replaced nearly all NumPy op- output are correct, consistent, and stable. Although not part of erations with native operations, increasing the row count twice the framework at this stage, the output of the unit tests could as much as ChatGPT 5 did. The numbers are reported in Table 1 107 Automating Numba Optimization with Large Language Models: A Case Study on Mutual Information Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia @njit( 'Tuple((int32[:], int32[:]))(int32[:])', cache=True, fastmath=True, error_model='numpy', boundscheck=True, ) def numba_unique(a): """Identify unique elements in an array, fast""" container = np.zeros(np.max(a) + 1, dtype=np.int32) for val in a: container[val] += 1 unique_values = np.nonzero(container)[0] unique_counts = container[unique_values] return unique_values.astype(np.int32), unique_counts.astype(np.int32) def fastmath=True) (grouped by the number of features) showing the most numba_unique(a): @njit('Tuple((int32[:], int32[:]))(int32[:])', cache=True, Figure 3: Distribution of Mean Times Across Test Cases # assumes a >= 0 efficient implementation is the one optimized with Gemini maxv = 0 2.5-Pro. for i in range(a.size): if a[i] > maxv: maxv = a[i] container = np.zeros(maxv + 1, dtype=np.int32) for i in range(a.size): container[a[i]] += 1 unique_values = np.nonzero(container)[0].astype(np.int32) 4 Results unique_counts = container[unique_values].astype(np.int32) The setup for our benchmark was the following. We evaluated return unique_values, unique_counts four different implementations of mutual information. For the @njit('Tuple((int32[:], int32[:]))(int32[:])', cache=True, two baselines, we used the standard and generic Sci-Kit learn def fastmath=True) mutual information and OutRank’s basic MI-numba (that already numba_unique(a): contains some Numba instructions to optimize the performance). """ Identify unique elements and their counts in a non-negative And as discussed before, two LLM optimized implementations integer array. were tested— MI-numba-chatgpt5 and MI-numba-gemini, which This version finds the max value in one pass to size the container. also support subsampling with a factor in range (0, 1]. For the """ evaluation, the subsampling factor ranges from 0.1 to 1, which # Assumes a >= 0 means that no subsampling was applied. maxv = 0 if a.size > 0: To gauge how the performance scales with different parame- for i in range(a.size): ters of the dataset, namely the number of examples (rows) and if a[i] > maxv: maxv = a[i] number of features (columns), we synthetically generated several for i in range(a.size): values (and varied the numbers of examples and features). The container[a[i]] += 1 container = np.zeros(maxv + 1, dtype=np.int32) datasets, containing raw numerical features with non-negative unique_values = np.nonzero(container)[0].astype(np.int32) number of features ranged from 40 up to 200 in increments of 20, unique_counts = container[unique_values].astype(np.int32) while the number of examples ranged from 200.000 to 20.000.000 return unique_values, unique_counts in eight logarithmic steps. For each combination, represented by a tuple (algorithm, subsampling factor (where applicable), num- Figure 2: Examples of proposed code changes. On the top ber of examples, number of features), we made five runs of the is the initial function, followed by ChatGPTs solution and code. For each run, we recorded the time to compute mutual on the bottom is the code from Gemini 2.5-Pro. information using Python’s time function. The results are shown in Figure 3. The boxes represent 25th percentile in the bottom and 75th percentile on the top. For all Implementation Row count Relative row count change test case, the LLM optimized implementations were significantly Baseline 182 0% faster than the baselines (the naive Numba implementation of ChatGPT5 213 +17% mutual information from OutRank and the generic Sci-Kit learn Gemini 2.5-Pro 262 +43% mutual information), with Gemini’s implementation being the Table 1: Row count for each of the implementations. White- most efficient regardless of the number of features, number of space and comments are included in the row count. samples or approximation factor. The LLMs sped up the compu- tation of mutual information for approximately 10 times, while the difference between ChatGPT’s and Gemini’s version was much smaller. This implies that the biggest contribution to the In addition, Gemini 2.5-Pro implemented its own in-code bounds speedup comes from the code changes that the two LLM opti- checks based on elementary operations, while ChatCPT 5 did not. mized solutions have in common. Those are primarily the pre- Contributing to the increase in the row count is also the amount built grouping, which aims to reduce in-loop allocations, and the of comments. The code review also revealed that Gemini 2.5-Pro heavy use of elementary operations. Although parallelization in was more consistent in code commenting and the comments were the Gemini 2.5-Pro’s implementation still plays a role, its effect much more useful and informative for the developer. is less significant. 108 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Kozamernik et al. very impressed by the remarkable similarity of the code produced by two different and independent LLMs. The proposed solutions from both models focused on the same key areas: adding an auxiliary function that creates the pre-built groupings to reduce the in-loop allocations, and shifting the paradigm from native NumPy to C-like Python code relying on elementary operations. While the optimization process is not yet fully automatic, our contribution outlines a possible direction for efficient use of LLMs in scientific computing. To reach the fully automatic stage when referring to Numba optimization, we propose the following steps are incorporated in the framework: (1) Use unit test output in case of failure as the next prompt for the LLM to give it a chance to correct the code. (2) Use the result of the benchmarking experiments as feed- back to the LLM and iterate on the proposed optimization. Both of these suggestions create feedback loops back to the LLMs, Figure 4: Computed Mutual Information for all tested im- plementations and for various numbers of feature. enabling an iterative process like the one proposed in Novikov et al. [7]. By comparing the outputs with the existing solutions, we have shown that the LLMs maintained the correctness when To verify that the computed mutual information is consistent introducing optimizations. with the generic implementations, namely the Sci-Kit learn imple- mentation, we plotted the mutual information for each number References of features. We show the results in Figure 4, where we can ob- [1] Mark Chen et al. 2021. Evaluating large language models trained on code. serve that the computed mutual information is almost identical CoRR, abs/2107.03374. https://arxiv.org/abs/2107.03374 arXiv: 2107.03374. [2] Ahmad El Ferdaoussi, Eric Plourde, and Jean Rouat. 2025. Maximizing infor- for all implementations, regardless of the number of features mation in neuron populations for neuromorphic spike encoding. Neuromor- and different optimizations applied. We conclude that the code phic Computing and Engineering, 5, 1, 014002. optimized by LLMs is valid and correct. [3] Andrew M Fraser and Harry L Swinney. 1986. Independent coordinates for strange attractors from mutual information. Physical review A, 33, 2, 1134. [4] Alexander Kraskov, Harald Stögbauer, and Peter Grassberger. 2011. Erratum: 5 Discussion estimating mutual information [phys. rev. e 69, 066138 (2004)]. Physical Review E—Statistical, Nonlinear, and Soft Matter Physics , 83, 1, 019903. In our experiment, we used the latest and most advanced versions [5] Alexander Kraskov, Harald Stögbauer, and Peter Grassberger. 2004. Estimat- of two popular LLMs, namely ChatGPT 5 and Gemini 2.5-Pro, ing mutual information. Physical Review E—Statistical, Nonlinear, and Soft Matter Physics, 69, 6, 066138. with Gemini 2.5-Pro being specifically targeted for coding. While [6] Siu Kwan Lam, Antoine Pitrou, and Stanley Seibert. 2015. Numba: a llvm- we did put two different LLMs to the test, the goal was not so based python jit compiler. In Proceedings of the Second Workshop on the much to compare them, but to develop a framework that would LLVM Compiler Infrastructure in HPC, 1–6. [7] Alexander Novikov et al. 2025. Alphaevolve: a coding agent for scientific serve well for evaluating LLM-based optimizations in scientific and algorithmic discovery. arXiv preprint arXiv:2506.13131. computing. As the new versions of LLMs and new LLMs are [8] Hanchuan Peng, Fuhui Long, and Chris Ding. 2005. Feature selection based periodically appearing in the market, the framework can serve redundancy. on mutual information criteria of max-dependency, max-relevance, and min-IEEE Transactions on pattern analysis and machine intelligence, to keep improving the existing code or, on the other hand, can be 27, 8, 1226–1238. used to quantify the improvements (specifically for the coding [9] Lior I Shachaf, Elijah Roberts, Patrick Cahan, and Jie Xiao. 2023. Gene regula- tion network inference using k-nearest neighbor-based mutual information subdomain) in the LLMs themselves as the new versions are re- estimation: revisiting an old dream. BMC bioinformatics, 24, 1, 84. leased. Additionally, using the framework in development phase [10] Blaz Skrlj and Blaž Mramor. 2023. Outrank: speeding up automl-based model for scientific experiments can reduce the computational time and search for large sparse data sets with cardinality-aware feature ranking. In Proceedings of the 17th ACM Conference on Recommender Systems, 1078– computational resources needed, leading to a lower cost for the 1083. experiments. [11] Ralf Steuer, Jürgen Kurths, Carsten O Daub, Janko Weise, and Joachim Focusing on the LLM aspect of the framework, the question Selbig. 2002. The mutual information: detecting and evaluating dependencies between variables. Bioinformatics, 18, suppl_2, S231–S240. remains what the result of the LLM-based optimization would be, had the context represented by the initial code not used Numba optimizations already. Few additional experiments could be done to explore that: (1) Use Python code without Numba instructions and explic- itly mention Numba in the prompt (2) Use Python code without Numba instructions and do not mention Numba in the prompt (3) Task the LLM to prepare the most computationally effi- cient implementation of mutual information in Python 6 Conclusions In this work, we presented an initial framework for automatic code optimization via LLM achieving a very impressive 10-fold speedup compared to the naive baseline in the benchmarking experiments while maintaining correctness of the code. We were 109 Topological Structure in GitHub Repository Embeddings Using Mapper Ivo Hrib Patrik Zajec ivo.hrib@gmail.com patrik.zajec@ijs.si Jožef Stefan Institute Jožef Stefan Institute Ljubljana, Slovenia Ljubljana, Slovenia Abstract • A discussion of how these preliminary results can guide We present a preliminary framework for the topological analysis future work, in particular the application of statistical of GitHub repository embeddings using the Mapper algorithm. testing methods to validate topological features and their Applied to 10,000 repositories embedded in 768-dimensional interpretation in terms of repository characteristics. space, our approach currently provides visual representations of Mapper graphs, offering a first view into potential topological 2 Background and Related Work structures such as branching patterns and cycles. While these 2.1 The Mapper Algorithm initial results are exploratory, they establish a foundation for The Mapper algorithm [6] constructs a graph representation of rigorous statistical testing of topological features. Future work a topological space by combining a filter function, overlapping will incorporate persistent homology–based significance testing 𝑑 to distinguish genuine structural patterns from noise, with the → covers, and clustering. Given a point cloud 𝑃 embedded in R and ultimate goal of interpreting these features in terms of repository a continuous function 𝑓 : 𝑃 R refered to as a filter function, the algorithm: characteristics. (1) Constructs a cover U = {𝑈 , . . . , 𝑈 1𝑛} of the range 𝑓 (𝑃) Keywords using overlapping intervals − 1 (2) For each interval 𝑈 𝑖 , computes the preimage 𝑃 𝑈 𝑈 𝑖 = 𝑓 (𝑖) topological data analysis, Mapper, GitHub, embeddings, signifi-(3) Clusters each preimage into connected components using cance testing, software repositories, persistent homology a clustering algorithm 1 Introduction ters whose point sets intersect (4) Creates vertices for each cluster and edges between clus-We present a preliminary framework for the topological analysis Common practice uses the first PCA component [4] as the filter of GitHub repository embeddings using the Mapper algorithm. and density based clustering methods, such as DBSCAN [2], un- Applied to 10,000 repositories embedded in 768-dimensional space, our approach provides visual representations of Mapper 𝐺 = (𝑉 , 𝐸) provides a combinatorial description with mapping graphs that reveal branching structures and cycles as potential less specific domain knowledge is provided. The resulting graph 𝜙 𝑉 𝑃 : → P () associating each vertex with a subset of points. organizational patterns in the data. These results are exploratory and serve as a foundation for future work, where statistical signifi- cance testing will be applied to rigorously validate which features represent genuine topological structure rather than noise. Our framework thus establishes an initial step toward understanding the topology of repository embeddings and motivates further methodological development. 1.1 Research Questions This work addresses the following specific question: (1) Do GitHub repository embeddings contain significant topological structures beyond simple clustering? Figure 1: Visual demonstration of mapper algorithm for a projection filter and a simple pointcloud. 1.2 Contributions Our main contributions are: 2.1.1 Parameter Selection and Sensitivity. Mapper results are • A preliminary framework for constructing and visualizing sensitive to three main parameters: Mapper graphs of GitHub repository embeddings. • • Resolution (𝑛): Number of intervals in the cover A systematic comparison of Mapper graphs across multi- ple parameter settings, highlighting sensitivity and recur- • Overlap (𝑝): Percentage overlap between consecutive in- ring structural patterns. tervals • Clustering threshold (𝜖): Distance parameter for the Permission to make digital or hard copies of all or part of this work for personal clustering algorithm or classroom use is granted without fee provided that copies are not made or Taking this into account, we devised the following methodol- distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this ogy for parameter selection. We define a discrete grid of candidate work must be honored. For all other uses, contact the owner/author(s). values, for which the mapper graph is reasonably computable, Information Society 2025, Ljubljana, Slovenia for each of the previously mentioned parameters and the min- © 2025 Copyright held by the owner/author(s). https://doi.org/10.70314/is.2025.sikdd.27 imum number of points per cluster. For each point in this grid 110 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Hrib and Zajec we applied the Mapper algorithm to the dataset and computed a topological features of the high-dimensional space in which collection of quality measures. repositories reside. By applying the Mapper algorithm to reposi- Three main criteria were used to assess the quality of each tory embeddings, we intend to characterize how repositories are Mapper graph: organized in terms of branching patterns, hubs, and cycles. This • perspective emphasizes the geometry and connectivity of the em- Coverage: the proportion of data points captured by the nodes of the Mapper graph, measuring how well the graph bedding space itself, offering potential insights that complement represents the entire dataset. more conventional similarity- or classification-based analyses of • repositories. Modularity: a measure of the strength of community structure in the resulting graph, reflecting the presence of well-defined clusters or substructures. 3 Dataset and Methodology • Stability: the reproducibility of the graph under sampling 3.1 Dataset Description noise, estimated by a bootstrap procedure in which multi- ple resampled datasets were processed and the resulting The raw dataset comprised approximately 500,000 GitHub repos- node assignments compared for consistency. itories, each annotated with a range of metadata fields. These can be grouped into three broad categories: For each parameter combination, we computed stability, cov- erage, and modularity. To aggregate these into a single composite • Textual features: free-form text fields such as descrip- score, we used a weighted sum that places the highest emphasis tion, readme, requirements, and packages, which capture on stability (0.5), followed by coverage (0.3) and modularity (0.2). natural-language documentation and dependency decla- These weights were chosen to reflect our prioritization of repro- rations. ducibility and representativeness over community structure. • Categorical features: attributes such as language, topic, and visibility, which provide discrete labels describing 𝑛 𝜀 Overlap MinPts Coverage Stability Modularity Score repository characteristics. 𝑐𝑢𝑏𝑒𝑠 12 0.70 0.7 3 0.966 0.948 0.785 0.921 • Contextual metadata: fields such as name, bio, website, 12 0.70 0.7 5 0.933 0.924 0.745 0.891 company, location, and date of creation, which provide 16 0.70 0.7 3 0.952 0.872 0.791 0.880 identifying information and organizational context. 10 0.70 0.7 3 0.966 0.847 0.765 0.866 16 0.70 0.7 5 0.915 0.852 0.771 0.855 3.1.1 Repository Selection Criteria. In the interest of computa- tional feasibility, this dataset was then sampled to 10,000 repos- Table 1: Top 5 Mapper parameter settings ranked by score. itories. Repositories were chosen via simple random sampling from the full dataset, as many repositories contained incomplete or inconsistent categorical and contextual metadata; therefore, For each parameter layout, we employed the first PCA compo- stratified sampling was not appropriate. nent as our chosen filter and DBSCAN as our chosen clustering algorithm. 3.1.2 Embedding Process. Each sampled repository was con- verted into a structured dictionary combining the available meta- 2.1.2 Adaptive Mapper and Learnable Filter Functions. Recent data fields. These dictionaries were embedded using the nomic- advances in Mapper methodology include adaptive approaches embed-text model. The model accepts long-context inputs (up to in which filter functions are learned from data rather than man- approximately 8,000 tokens), which makes it suitable for process- ually specified. Such approaches could potentially optimize for ing repository documentation such as README files. statistically significant topological features[3]. These methods were, however, not utilized in our case due to computational 768 gether, the 10,000 sampled repositories form a point cloud in R The resulting embeddings are 768-dimensional vectors. To- . complexity and remain to be explored in the future. Because the embeddings primarily reflect textual and documen- 2.2 tation content (e.g., README and description fields), the analysis Related Work in Software Repository in this study centers on topological structure in the documenta- Analysis tion space rather than source code semantics. These embeddings Several recent studies have explored software repository em- serve as the basis for the Mapper-based topological data analysis beddings and clustering. For example, Rokon et al. introduced described in the following sections. Repo2Vec, which combines metadata, source code, and struc- tural signals into repository embeddings for similarity search 3.2 Mapper Implementation and clustering [5]. Lherondelle et al. proposed an attention- For our purposes we used Kepler-Mapper to get the mapper based model that learns repository embeddings from code and graphs which scored highest, previously mentioned in ??. metadata to support auto-tagging and recommendation tasks [lherondelle2022topical]. Zhang et al. developed HiGitClass, a • Graph 1: hierarchical classification framework for GitHub repositories us- Resolution = 12, Overlap = 0.7, eps = 0.7, min_samples = 3 ing embedding-based methods [8]. Other work has examined clus- • Graph 2: tering repositories with software metrics [repo_metrics2023] Resolution = 12, Overlap = 0.7, eps = 0.7, min_samples = 5 or analyzing the characteristics of repositories in specific domains • Graph 3: such as embedded systems [polaczek2021embedded]. Resolution = 16, Overlap = 0.7, eps = 0.7, min_samples = 3 While these approaches demonstrate that embeddings and • Graph 4: clustering can yield useful insights about software repositories, Resolution = 10, Overlap = 0.7, eps = 0.7, min_samples = 3 they focus primarily on supervised tasks (classification, tagging) • Graph 5: or flat similarity clustering. In contrast, our work explores the Resolution = 16, Overlap = 0.7, eps = 0.7, min_samples = 5 111 Topological Analysis of GitHub Repository Embeddings Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia The filter function is, as before, projection onto the first principal 6.1 Limitations and Error Analysis component. Clustering using DBSCAN with minimum cluster Several limitations must be acknowledged: size parameter adjusted per graph. 6.1.1 Parameter Sensitivity. While we observe some consistent 4 patterns across parameter choices, a more systematic exploration Results of parameter space is needed than a pure grid search. Table 2 reports the structural properties of the selected Mapper graphs, while Table 3 summarizes degree distributions. 6.1.2 Computational Constraints. Full significance testing of Graph 1 (Resolution = 12, MinPts = 3) produced 207 nodes all features is computationally expensive, limiting the scale of and 368 edges across 36 components, with 197 cycles. Graph 3 analysis possible. (Resolution = 16, MinPts = 3) was even larger (232 nodes, 421 6.1.3 Interpretation Challenges. The semantic meaning of topo- edges, 229 cycles), reflecting the finer subdivisions introduced by logical features requires domain expertise and may not generalize higher resolution. across different types of software projects. Graphs 2 and 5 (MinPts = 5) were smaller, around 100 nodes each, as stricter clustering merged many small clusters. Graph 4 6.1.4 Embedding Model Dependence. Results depend on the qual- (Resolution = 10, MinPts = 3) fell between these extremes (194 ity and characteristics of the embedding model used. nodes, 337 edges). Degree distributions confirm these patterns: Graphs 1 and 3 7 Future Work and Conclusions contain many nodes of degree 3–5 with some higher-degree hubs, 7.1 Immediate Extensions while Graphs 2 and 5 are simpler and tree-like. Overall, higher resolution and lower MinPts yield more fragmented, cycle-rich • Complete statistical validation of all observed topological graphs, while stricter clustering produces fewer, larger compo- features nents. These trends highlight the need for statistical testing to • Systematic parameter sensitivity analysis • Comprehensive repository characteristic analysis for in-separate genuine topological signals from parameter effects. As for the visual representations of the graphs, see 2a and 2c, terpretation as well as the bar plots of their respective node sizes . Note That • Cross-validation with different embedding models and many of the nodes are relatively small most likely due to the data subsets reasons mentioned previously. 7.2 Methodological Advances Graph • Adaptive Mapper guided by significance testing to opti- Nodes Edges Conn. comps. Cycles (len) Graph 1 mize filter functions 207 368 36 197 Graph 2 • Validation on simple synthetic datasets to confirm method- 101 187 18 104 Graph 3 ology effectiveness 232 421 40 229 Graph 4 • Development of Mapper quality metrics and automated 194 337 36 179 Graph 5 parameter selection 108 200 19 111 Table 2: Comparison of structural properties across Mapper • Hybrid approaches combining Mapper with other dimen- graphs. sionality reduction techniques 7.3 Applications and Validation • Predictive modeling using topological features for reposi- tory characteristics Graph 1–2 3–5 6–10 11–20 21+ • Integration with software engineering workflows and tools Graph 1 9 181 9 2 6 • Evaluation by domain experts for practical relevance Graph 2 10 82 2 6 1 • Extension to other software engineering datasets and prob- Graph 3 20 182 20 6 4 lems Graph 4 10 171 5 3 5 Graph 5 9 84 7 8 0 7.4 Conclusions Table 3: Binned degree distributions across graphs. While computational constraints limit the scope of current anal- ysis, the framework establishes a foundation for rigorous topo- logical analysis of software engineering data. The combination of visualization, statistical validation, and manual interpreta- 5 Figures and Results Visualization tion provides a comprehensive approach to understanding high- 6 dimensional repository relationships. Discussion The observed topological structure suggests that repository The consistent branching patterns across multiple Mapper graphs embeddings capture meaningful relationships beyond simple suggest genuine topological structure in the repository embed- clustering, opening possibilities for novel applications in software ding space rather than parameter artifacts. engineering and repository analysis. The large presence of cycles indicates more complex topologi- cal relationships beyond simple clustering, possibly representing Acknowledgements repositories that share multiple characteristics or form transition This research was supported by the EnrichMyData project, which regions between different project types. Although most of these may be attributed to noise. We aim to further explore those that provided financial support for the work presented in this paper. are relevant using the techniques from [7] and [1]. 112 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Hrib and Zajec (a) Mapper graph (Graph 1). (b) Node count distribution (Graph 1). (c) Mapper graph (Graph 3). (d) Node count distribution (Graph 3). Figure 2: Representative Mapper graphs (Graph 1 and Graph 3) with corresponding node count barplots. Both 2a and 2c show a significant central connected component with some branching, however the boundary of the largest connected component seems to be quite noisy. Further statistical testing will aim to improve upon pruning the noisy artifacts. References [5] Md Rafsan Jani Rokon, Panagiotis Kallis, Michele Castronovo, Alexander [1] Serebrenik, and Alberto Bacchelli. 2021. Repo2vec: repository embeddings Omer Bobrowski and Primož Skraba. [n. d.] A universal null-distribution for topological data analysis. (). https://www.nature.com/articles/s41598-023-37 for effective similarity search and recommendation. In Proceedings of the 18th 842-2. International Conference on Mining Software Repositories (MSR 2021), 384–394. [2] [6] Gurjeet Singh, Facundo Mémoli, and Gunnar E. Carlsson. 2007. Topological Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. A density-based algorithm for discovering clusters in large spatial databases methods for the analysis of high dimensional data sets and 3d object recog- with noise. In nition. In Eurographics Symposium on Point-Based Graphics. Eurographics Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD’96) Association, 91–100. doi:10.2312/SPBG/SPBG07/091-100. . AAAI Press, 226–231. [3] [7] Patrik Zajec. 2023. Towards testing the significance of branching points and Ziyad Oulhaj, Mathieu Carrière, and Bertrand Michel. 2024. Differentiable mapper for topological optimization of data representation. In cycles in mapper graphs. (2023). Proceedings of the 41st International Conference on Machine Learning [8] Yu Zhang, Frank F. Xu, Sha Li, Yu Meng, Xuan Wang, Qi Li, and Jiawei (Proceedings of Machine Learning Research). Vol. 235. PMLR, 38919–38936. doi:10.48550/ar Han. 2019. Higitclass: keyword-driven hierarchical classification of github Xiv.2402.12854. repositories. In ICDM ’19, 876–885. doi:10.1109/ICDM.2019.00098. [4] Karl Pearson. 1901. On lines and planes of closest fit to systems of points in space. Philosophical Magazine, 2, 11, 559–572. doi:10.1080/14786440109462720. 113 CO 2 Monitoring for Energy-Efficient Workloads in Kubernetes: A Data Provider for CO2-Aware Migration Ivo Hrib Jan Šturm ivo.hrib@gmail.com jan.sturm@ijs.s Jožef Stefan Institute Jožef Stefan Institute Ljubljana, Slovenia Ljubljana, Slovenia Oleksandra Topal Maja Škrjanc oleksandra.topal@ijs.si maja.skrjanc@ijs.si Jožef Stefan Institute Jožef Stefan Institute Ljubljana, Slovenia Ljubljana, Slovenia Abstract in Kubernetes; (ii) a schema and REST API design that facili- We present a CO tates external consumption; and (iii) scenario-based evaluations 2 monitoring component developed within the demonstrating the potential of CO 2 -aware workload migration. FAME project’s Energy Efficient Analytics Toolbox. The service Further testing will take place, utilizing real measurements continuously collects power usage for containerized workloads in and migrations from within the FAME framework, as to showcase Kubernetes via Kepler and fuses it with regional electricity-grid the service’s precise final capabilities, as opposed to benchmark carbon intensity (e.g., ElectricityMaps) to compute per-workload 1 CO − tests. 2 emission rates in g s . Its primary role is to store accurate, timestamped emission values and expose them through light- weight APIs and an optional time-series database (TimescaleDB). 1.1 Key-Idea It acts as a data provider consumed by external orchestration The key idea of our approach is to compute container-level CO2 services, enabling CO emissions by combining two complementary data sources: (i) 2 -aware migration strategies across clusters and regions. instantaneous power consumption estimates from Kepler, and (ii) regional grid carbon intensity values from ElectricityMaps. Keywords First, Kepler provides pod- and container-level telemetry in computing, time-series storage, ElectricityMaps, Kepler power signal is derived from eBPF-based kernel observations and model-based inference all provided by Keplers data source. CO2 monitoring, Kubernetes, energy efficiency, carbon-aware the form of estimated power usage 𝑃(𝑡), expressed in watts. This 1 Introduction / expressed in gCO Second, ElectricityMaps exposes a carbon intensity factor 𝐼 (𝑡 ), 2 kWh, corresponding to the bidding zone of Data centers are a significant contributor to global electricity the node on which the container executes. demand. Beyond advances in hardware efficiency and renewable We align these two signals in time and compute instantaneous energy procurement, intelligent orchestration of workloads can emission rates by: reduce emissions by aligning computation with cleaner energy availability. A prerequisite for such carbon-aware orchestration is 𝐼 (𝑡 ) 𝐸 ( 𝑡 ) = 𝑃 ( 𝑡 ) · reliable and accessible measurements of workload-level emissions. 3600 This paper introduces a CO 2 monitoring and storage service where 𝐸(𝑡) is the CO2 emission rate in g s−1, 𝑃(𝑡) is container 1 designed for Kubernetes environments. The service ingests pod/- power in watts (J s− ), and the division by 3600 converts the container power data from Kepler [5], combines it with regional intensity factor from per-kWh to per-second units. grid carbon intensity from ElectricityMaps [2], computes instan- These per-container emission rates are then aggregated into a taneous emission rates, and persists the resulting time series. time series, optionally persisted in TimescaleDB, and exposed via Unlike optimization or migration tools, this component deliber- a REST API. This composition allows downstream orchestration ately restricts its scope: it provides measurements and exposes services to reason about the carbon impact of workloads at fine them via stable APIs for later consumption. temporal and spatial granularity, enabling CO 2-aware migration By decoupling measurement from decision-making, we ensure strategies. modularity and interoperability. External orchestrators such as the ATOS migration service in FAME D5.4 [3] can consume these 2 Background and Related Work metrics to implement CO2-aware migration strategies without Components of our approach. Our service integrates two ex- needing to handle the intricacies of measurement or data storage. ternal data sources to produce fine-grained CO2 emission signals. Our contributions are threefold: (i) a minimal but complete Kepler is an open-source project that estimates the energy con- architecture for per-workload CO2 measurement and storage sumption of containerized workloads in Kubernetes by leveraging eBPF-based telemetry and machine learning models. It exposes work must be honored. For all other uses, contact the owner/author(s). Permission to make digital or hard copies of all or part of this work for personal per-container power and energy metrics that can be consumed or classroom use is granted without fee provided that copies are not made or by higher-level services. ElectricityMaps provides real-time and distributed for profit or commercial advantage and that copies bear this notice and historical carbon intensity data for electricity grids, expressed in the full citation on the first page. Copyrights for third-party components of this gCO / kWh. By fusing Kepler’s workload-level power estimates 2 Information Society 2025, Ljubljana, Slovenia with regional carbon intensity factors from ElectricityMaps, our © 2025 Copyright held by the owner/author(s). system produces a continuous stream of container-level CO https://doi.org/10.70314/is.2025.sikdd.24 2 114 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Hrib et al. emission data. This stream can then be consumed by orches- 3.2 Data Model tration or scheduling components for migration and placement Each emission record is structured as a tuple that captures both decisions. workload identifiers and measurement values. This schema is Carbon-aware computing. Prior research demonstrates the designed to balance expressiveness with minimal storage over- potential of carbon-aware strategies, such as shifting workloads head, while ensuring compatibility with external orchestration Average Power Limit (RAPL) counters. This dependence limits exist for energy and carbon monitoring. For example, Existing monitoring tools. surement was taken, enabling time-series alignment across Several open-source frameworks nodes and regions. CodeCar- • namespace , pod , container : identifiers for locating the bon [1] and Scaphandre [6] estimate workload emissions, but workload within the Kubernetes hierarchy, which is es- they rely on hardware-specific telemetry, such as Intel’s Running sential for container-level granularity and reproducibility. supplies [4]. Such approaches rely on access to reliable, fine- • ts (timestamp, UTC): the precise moment when the mea-grained emission signals to inform scheduling policies. across time or regions to align with lower-carbon electricity services. portability to Intel CPUs and makes integration across heteroge- • node, region, country_iso2: metadata that ties the con- tainer execution to its physical and geographical context. neous infrastructures challenging. In contrast, our design—built This supports carbon-aware decisions that depend on grid on Kepler and ElectricityMaps—remains hardware-agnostic: eBPF intensity differences across regions. enables container-level monitoring without vendor-specific coun- • power_w , energy_j : raw telemetry provided by Kepler, ters, while ElectricityMaps provides global coverage of carbon describing both instantaneous power and accumulated intensity signals. This combination makes our service applicable energy consumption. in diverse Kubernetes environments and datacenter setups. • intensity_g_per_kwh : regional grid carbon intensity re- Time-series storage. Finally, for persistence, we optionally trieved from ElectricityMaps, serving as the multiplier that employ TimescaleDB, which extends PostgreSQL with hyperta- translates energy into emissions. bles and compression optimized for telemetry data [7]. Never- • co2_g_per_s : the computed emission signal, representing theless, the service can also operate in buffer-only mode when the core value consumed by orchestrators. persistent storage is not required. • source_version : versioning tag for tracking provenance Positioning. This paper positions our monitoring service as of measurements and external data dependencies. a foundational measurement substrate for carbon-aware orches- This schema ensures that each record is self-contained, inter- tration in Kubernetes environments. By combining hardware- agnostic energy estimates with real-time grid carbon data, it pretable across clusters, and suitable for longitudinal analysis in extends the applicability of carbon-aware scheduling beyond the time-series databases. limitations of prior approaches. 3.3 API Endpoints 3 Design and Implementation The service exposes a lightweight REST API, designed to be eas- The component runs as a Kubernetes deployment. Workers col-3.1 ily consumed by external orchestrators or monitoring pipelines. Architecture The API emphasizes read-only access to maintain reliability and auditability. lect power metrics from Kepler, fetch grid intensity values, com- • GET /api/containers : returns the set of containers cur- pute emissions, and either persist results in TimescaleDB or serve rently monitored by the service, allowing orchestrators to them from memory. A REST API provides read-only access to discover available emission signals. historical and recent emissions. • POST /api/emissions : fetches recent emission values in bulk. This endpoint is optimized for dashboards or moni- toring agents that need timely updates with low overhead. Requires a specified time range to return said emissions. • POST /api/emissions/by-container: queries the emis- sion history of a specific container, Similarly requires a time range, as well as the names of specific containers for which to fetch data. • GET /api/schema: provides the data schema including units and field definitions. This enables clients to validate their assumptions and facilitates long-term interoperabil- ity across versions. By standardizing access patterns, the API makes it possible for external services to reliably retrieve emissions information without depending on internal implementation details. 4 Evaluation We now present evaluations based on benchmarks and scenario analyses conducted in the FAME project [3]. The goal was to Figure 1: System architecture assess whether exposing real-time CO2 signals can enable mean- ingful emission reductions when coupled with migration strate- gies. 115 CO2 Monitoring in Kubernetes Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia 4.1 Benchmark Test In a simple benchmark using busybox, a lightweight Linux con- tainer, the optimal CO2 emissions achieved were significantly lower than the mean observed values. The key performance indi- cator (KPI) was defined as a 200% improvement, corresponding to a 66.6% reduction compared to baseline. Results show that this threshold can be achieved and often surpassed. The baseline is, for lack of a better, metric defined as the mean of emissions across all tracked countries with available resources for migration. Figure 3: Plot of average reductions per scenario signals through our service empowers orchestration layers to meet or exceed environmental KPIs. 5 Limitations and Future Work Figure 2: Small timeframe emissions of a benchmark Busy- The reported emissions are estimates subject to the accuracy of box for testing purposes. We can see noticably low emisions both Kepler’s models and grid intensity data. As a result, the for France, which can be explained by heavy reliance on benchmark tests previously performed may not fully capture all nuclear power, as can be seen in [2] possible scenarios, as grid dependency may sometimes force sub- optimal migrations in the CO2-system as per resource availability. The system attempts to minimize emissions within these sub-4.2 Resolution is limited by the update frequency of intensity sources, Scenario-Based Evaluation and storage requirements increase with sampling granularity. Scenarios simulate workload migrations across subsets of Euro- We considered only a single baseline, defined as the mean pean countries. Each scenario randomly selects 4–7 countries CO 2 emissions across all tracked countries. While this provides from a pool of 28, representing constrained deployment options. a general reference point, it is not directly comparable to region- specific benchmarks and may obscure finer-grained differences. sets.The abbreviations (e.g., FR, DE, SI) correspond to ISO-3166 Future work should therefore incorporate multiple baselines, country codes representing different electricity regions. We em- such as per-country averages or established benchmarks from ployed random sampling of countries to simulate the heterogene- the literature, and assess statistical significance relative to them. ity faced by cloud and edge providers operating across multiple Our benchmark scenarios were simplified to ensure repro- regions. This choice enables us to reflect migration challenges ducibility and interpretability. Although random sampling of where workloads are moved not only between datacenters but countries illustrates the variability in energy mixes, it does not also across electricity grids with diverse carbon intensities. While fully capture the operational constraints of datacenter migrations random sampling is a simplification, it provides statistically rep- or multi-cloud scheduling. More complex benchmarks with real- resentative insights into the variability of emission factors. We istic workloads and infrastructure heterogeneity would further showcase oure results through the following 5 scenarios: validate the applicability of our approach. • Scenario 1 (IS, CZ, BG, RO, AT, SE): 88.2% ± 2.1% reduction. Finally, while implementation details such as REST endpoints • Scenario 2 (DE, PL, GR, LV): 72.8% ± 5.6% reduction. and TimescaleDB integration were reported for transparency, • Scenario 3 (GB, LT, SI, DE, AT, GR): 78.0% ± 1.7% reduction. their evaluation was not the main focus of this study. Addi- • Scenario 4 (ES, FR, GB, PL, HU, LT, SE): 89.6% ± 1.1% (best tional experimentation with scalability and deployment overhead case). would strengthen the case for adoption in production environ- • Scenario 5 (LV, ES, HU, LT): 32.4% ± 12.7% (worst case). ments. • All Countries: 87.7% ± 1.7% reduction (ideal case). Future work will focus on service options to adjust granularity Across all scenarios, at least one migration was executed per and tackle scalability issues within the service as well as broader window, with an average emission reduction of 74.8%. These evaluation. results confirm that even under limited availability, CO2-aware migration strategies yield substantial benefits. 6 Conclusion We presented a Kubernetes-native CO2 monitoring service that 4.3 Insights provides real-time emissions data through stable APIs. Evalua- The best-performing scenario demonstrates that careful selec- tions demonstrate that when coupled with migration strategies, tion of even a limited number of regions can approach the ef- these metrics enable significant emission reductions, often sur- fectiveness of full global availability. Conversely, the poorest- passing KPI thresholds. Future work will include integration performing scenario illustrates the dependency on geographic with more compute-intensive workloads, multi-source intensity flexibility. Overall, results validate that exposing reliable CO2 aggregation, and cryptographic provenance for auditability. 116 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Hrib et al. Acknowledgements [2] Electricity Maps ApS. 2025. Electricity maps: real-time carbon intensity of This work was supported by the FAME project under the Euro- electricity consumption. Accessed: 25 September 2025. https://app.electricity maps.com. community and colleagues who contributed feedback during and multicloud enablers for green computing. https://www.fame-horizon.eu /the-project/. Accessed: 25 September 2025. (2025). testing. pean Union’s Horizon Europe programme. We thank the Kepler [3] European Union Horizon Europe Programme. 2025. Fame project: federated [4] Google. 2020. Our data centers now work harder when the sun shines and For all online resources cited, the date of access has been wind blows. Accessed: 25 September 2025. https://blog.google/inside-google included to ensure reproducibility and traceability. /infrastructure/data-centers-work-harder-sun-shines-wind-blows/. [5] Kepler Project Contributors. 2025. Kepler: kubernetes-based efficient power level exporter. Accessed: 25 September 2025. https://github.com/sustainable- References computing-io/kepler. [1] CodeCarbon Project Contributors. 2025. Codecarbon: track and reduce the for linux servers. Accessed: 25 September 2025. https://github.com/hubblo-o [6] Scaphandre Project Contributors. 2025. Scaphandre: energy monitoring agent carbon footprint of your computing. Accessed: 25 September 2025. https://m rg/scaphandre. lco2.github.io/codecarbon/. [7] Timescale Inc. 2025. Timescaledb: an open-source time-series database. Ac- cessed: 25 September 2025. https://www.timescale.com. 117 Beyond Surveys: Adolescent Profiling via Ecological Momentary Assessment and Mobile Sensing Jasminka Dobša Simona Korenjak-Černe Miranda Novak University of Zagreb University of Ljubljana University of Zagreb Faculty of Organization and School of Economics and Faculty of Education and Informatics Business, and IMFM Rehabilitation Sciences Varaždin, Croatia Ljubljana, Slovenia Zagreb, Croatia jasminka.dobsa@foi.hr simona.cerne@ef.uni-lj.si miranda.novak@erf.unizg.hr Maja Buhin Pandur Lucija Šutić University of Zagreb University of Zagreb Faculty of Organization and Informatics Faculty of Education and Rehabilitation Sciences Varaždin, Croatia Zagreb, Croatia mbuhin@foi.unizg.hr lucija.sutic@erf.unizg.hr Abstract escalate into crises. In 2023 [7], the platform was reintroduced with significant improvements, enabling the collection of The aim of this study is to identify profiles of adolescents using behavioral and interpersonal data through natural smartphone survey data and data collected via mobile phones, which included use which enabled collection of reported self-ratings known as ecological momentary assessment (EMA) and passive mobile ecological momentary assessments used in this research. sensing. EMA involved responses to short questionnaires Previous research using EARS has explored various applications. delivered seven times per day over one week, while mobile For instance, one study examined the use of mobile sensing data sensing captured time spent using different categories of mobile to assess stress by analyzing affective language captured via applications. The study was conducted on a sample of 77 smartphone keyboards [4]. Another study investigated the role of secondary school students. Profiling was performed through friendship quality and well-being in adolescence [9], concluding clustering of EMA data aggregated into six composite variables that adolescents who experienced more positive affect also reflecting confidence, attentiveness, positive and negative reported more positive characteristics of close friendships two emotions related to friends, and overall positive and negative hours later. affect. Based on the interpretability of the results, four adolescent In the present study, profiles of adolescents were identified using profiles were identified. These profiles are further explained EMA variables, resulting in four distinct groups. These profiles using survey data and passive data on mobile application usage were subsequently analyzed with respect to survey data and patterns. passive mobile sensing data. The study was guided by the Keywords following research questions: • What distinct adolescent profiles emerge from EMA- Adolescents, clustering, mobile sensing, ecological momentary based composite variables? assessment, well-being • How are these profiles associated with demographic and psychosocial survey measurements (gender, academic achievement, perceived overuse of social 1 Introduction media, level of depression, anxiety, and stress This study was conducted using the Effortless Assessment of symptoms)? Risk States (EARS) application developed by Ksana Health in • What patterns of mobile application use characterize collaboration with the University of Oregon the identified profiles? (https://ksanahealth.com/ears/) [6]. The EARS application was The rest of the paper is organized in the following way: in the originally launched in 2018 to facilitate the collection of high- second section materials and methods are described, the third quality passive mobile sensing data and to support the section presents the results of data analysis, and the fourth section development of predictive machine learning algorithms capable offers a discussion of results and conclusion. of identifying risk states for human well-being before they Permission to make digital or hard copies of part or all of this work for personal or 2 Materials and methods classroom use is granted without fee provided that copies are not made or distributed A sample of 77 Croatian high school students participated in this for profit or commercial advantage and that copies bear this notice and the full study. We employed three types of data: (1) survey data, (2) citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). EMA data aggregated into six composite variables (confidence, Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia attentiveness, positive and negative emotions related to friends, © 2025 Copyright held by the owner/author(s). http://doi.org/10.70314/is.2025.sikdd.29 and overall positive and negative affect), and (3) passive mobile sensing data related to mobile applications usage. The survey 118 data included respondents’ gender, academic achievement (final grades of 3, 4, or 5), self-reported perceptions of overuse of social media (measured on a scale from 14 to 70), and symptoms of depression, anxiety, and stress (determined by DASS-21 scale, each measured on a scale from 0 to 21 [1]). EMA data and passive mobile data were collected using the EARS application. Within the framework of ecological momentary assessment (EMA), respondents reported on the quality of their close friendships and their affect, seven times per day over the course of one week (i.e., up to 49 assessments). The assessment schedule followed a semi-random structure: respondents received questions at random intervals within 2-hour windows between 7 a.m. and 9 p.m. Only respondents who completed at least 10 out of 49 assessments were included in the analyses. Figure 1. Groups obtained by k-means algorithm projected Friendship quality was measured with five items rated on a scale to the first two principal components of composite EMA from 1 (not at all like me) to 7 (completely like me). All items variables. were adapted from prior studies on close relationships [3, 5, 8]. Two composite variables were derived: PosFriendEmo, calculated as the average of three items related to positive friendship-related emotions, and NegFriendEmo, calculated as the average of items reflecting negative friendship-related emotions. Items related to positive friendship-related emotions were following: • “I feel that I can share some worries or secrets with my close friends.” • “I enjoy being with my close friends.” • “I have fun with my close friends.” Items related to negative friendship-related emotions included following statements: • “I feel that my close friends criticize me.” • “My close friends get on my nerves.” Figure 2. Mean values of standardized composite variables Affect was measured with ten items on the same 7-point scale, by groups. adapted from [3]. Two composite variables were created: PosAffects (joyful, cheerful, happy, lively, proud) and NegAffects (guilty, angry, insecure, scared, sad, worried, ashamed), representing the mean values of the respective items. In addition, a composite variable Confident was formed from three items related to peer popularity, self-satisfaction, and body satisfaction, while a composite variable Attentive was formed from five items reflecting responsibility, caring for others, perceived adult support, readiness for schoolwork, and perceived teacher support. Regarding passive data, respondents used a total of 927 applications, which were categorized into 16 groups. Of these, 11 categories were included in the analysis, while the remaining five were excluded due to their negligible usage time. Initial Figure 3. Proportion of respondents by group and gender Bard and ChatGPT) based on app functionality. Each app’s (male, female, I’d rather not say). categorization was performed using generative AI tools (Google to usage of 11 observed categories of mobile apps, variable applied to standardized composite EMA variables. Based on the interpretability of the resulting clusters, the model with four reflecting the total time spent on the mobile phone ( Total groups was selected. passive website to confirm its primary function. Beside variables related Profiles of adolescents were identified using k-means clustering classification was then manually verified through its official ) was also included into the analysis. The analyzed Data analysis was conducted using R statistical software. Group categories included: Tools and productivity, Social media, Music differences were tested using the non-parametric Kruskal–Wallis and audio, Games, Communication, Multimedia, Education and test, followed by Dunn’s post hoc test. Non-parametric tests were learning, Online shopping and services, Travel, Device applied because analyzed variables were not normally distributed. management, and Entertainment. For the analysis of dependency between groups and their school success it was used chi-square test. 119 Figure 5 presents the distribution of daily time (in seconds) that respondents spent using different categories of mobile applications across groups. No statistically significant differences were found in the median time spent on social media or in the total time spent across all application categories. Group 1, which showed the highest median values for the composite variables Confidence, Attentiveness, and positive friendship- related emotions, also reported spending the most time on social media; however, their perception of social media overuse was the lowest among all groups. Group 3, characterized by near-average median values of Confidence, Attentiveness, positive and negative friendship-related emotions, and affect, demonstrated the highest median usage across most application categories (Tools and productivity, Music and audio, Games, Communication, Education and learning, Travel, Device management, and Entertainment). The Kruskal-Wallis test revealed a significant difference in application use only for the Figure 4. Box-plots for variables of self-assessment of overuse Education and learning category, although Dunn’s post hoc test of social media (ovdr, 14-70), level of symptoms of depression did not confirm differences between specific group pairs. (dep, 0-21), level of symptoms of anxiety (anks, 0-21), and Respondents in Group 4 had the highest median usage of level of symptoms of stress (stres, 0-21). Multimedia applications, while those in Group 2 spent the most time on applications related to Shopping and services. Notably, respondents in Group 2 were predominantly male and reported 3 the highest perceived overuse of social media among all groups. Results School success was measured by average grade point, which was Figure 1 shows groups of respondents obtained by k-means 4.05 for Group 1, 4.33 for Group 2, 4.61 for Group 3, and 4.20 algorithm projected to the first two principal components of for Group 4. The chi-square test indicated a borderline non- composite EMA variables. Figure 2 illustrates the mean values significant difference in school success across the groups of the composite variables across groups. Two related pairs of (p=0.0501). Group 3, which showed the highest median time of groups can be observed: Groups 1 and 4, and Groups 2 and 3. application use across most categories, also achieved the highest Groups 1 and 4 display nearly mirror-image profiles with respect average grade point (4.61). In contrast, Group 1, which reported to the x-axis. For Group 1, the composite variables Confident, the highest levels of confidence and attentiveness in EMA Attentive, PosFriendEmo, and PosAffects are above average, (including perceived readiness for school tasks), had the lowest whereas in Group 4, these same variables fall below average. average grade point. Conversely, NegFriendEmo and NegAffects are below average for Group 1 but above average for Group 4. A similar pattern emerges for Groups 2 and 3, which also show mirror-image 4 Discussion and conclusion profiles, though shifted slightly toward above-average values. This study identified four adolescent profiles based on data Attentive collected from 77 Croatian high-school students using EMA. , and Group 3 is characterized by nearly average levels of Confident, PosFriendEmo, while NegFriendEmo, PosAffects, and Data collected from EMA was aggregated across respondents in NegAffects are slightly below average. In contrast, Group 2 demonstrates above-average mean values across all variables. the form of 6 composite variables representing their self-reported Overall, emotions related to friendships and affective states are confidence, attentiveness, positive and negative friendship- less pronounced in Groups 2 and 3 compared to Groups 1 and 4. related emotions, and positive and negative affect. Two pairs of Figure 3 shows that female respondents predominate in mirror-image profiles were observed: Groups 2 and 3, and Groups 3 and 4, in Group 1 there is approximately an equal Groups 1 and 4. Emotional states related to friendships and proportion of male and female respondents, while in Group 2 affective states are less pronounced in Groups 2 and 3 compared predominate male respondents. Figure 4 presents the distribution to other pair of groups, and these groups are characterized by of survey-based variables: self-assessment of overuse of social better academic success. 21), anxiety ( Mobile sensing revealed that respondents used a total of 927 apps, anks media (ovdr, 14-70), level of symptoms of depression (dep, 0- , 0-21), and stress (stres, 0-21). Group 4 exhibits the highest levels of symptoms of depression, anxiety, which were categorized into 16 categories, out of which 11 were and stress. According to the non-parametric Kruskal-Wallis test, analyzed in this study. Although social media accounted for the there is a significant difference between the groups in symptoms largest share of usage time, no significant group differences were of depression (p=0.0045) and stress (p=0.0162). The Dunn’s post found either in social media use or in total application use. Group hoc test indicated that Group 4 has statistically significant higher 1, according to self-perception, exhibited the most confident and levels of symptoms of depression (p=0.0015) and stress attentive and has lowest median levels of depression, anxiety and (p=0.0090) compared to Group 1. The Kruskal-Wallis test shows stress, spent the most time on social media, but perceived its that there is a difference in the perception of overuse of social overuse the least. This group contains approximately an equal media between the groups (p=0.0024). The highest perceived proportion of male and female respondents. Group 2, which was overuse was reported by Group 2, with a significant difference compared to Group 3 (p=0.0021) and Group 1 (p=0.0040). predominantly male, spent the most time on Online shopping and Results indicate that respondents’ perceptions of their social services and reported the highest perceived overuse of social media use did not correspond to the actual time spent on social media, with significant differences compared to Group 1 media (r = 0.0741). (p=0.0040) and Group 3 (p=0.0021). 120 . Figure 5. Box-plots for variables of daily usage of categories of mobile applications by groups (in seconds). Note the different ordinal scales due to the large differences in the use of apps. Group 3, which had the highest academic achievements and the mobile assessment (P.R.O.T.E.C.T.) research project, founded majority of female respondents, had the highest usage of by the Croatian Science Foundation (UIP-2020-02-2852). applications in categories Tools and productivity, Music and audio, Games, Communication, Education and learning, Travel, References Device management, and Entertainment. Group 4, also [1] Antony, M. M., Bieling, P. J., Cox, B. J., Enns, M. W., & Swinson, R. P. predominantly female, exhibited the highest levels of depression, 1998. Psychometric properties of the 42-item and 21-item versions of the Depression Anxiety Stress Scales in clinical groups and a community anxiety, and stress symptoms, spent the least time on social sample. Psychological Assessment, 10(2), 176–181. media, used Multimedia applications more than other groups, and https://doi.org/10.1037/1040-3590.10.2.176 ranked second in the use of [2] Billard, L., Diday, E. 2007. Symbolic Data Analysis: Conceptual Education and learning applications. Statistics and Data Mining, 1st edition, Wiley Importantly, there was no significant correlation between [3] Bülow, A., van Roekel, E., Boele, S., Denissen, J.J.A. and Keijsers, L.. perceived overuse of social media by respondents and their actual 2022. Parent –adolescent interaction quality and adolescent affect: An time spent using it, as measured by passive sensing. This finding experience sampling study on effect heterogeneity. Child Development, 93(3), 315-331, DOI: https://doi.org/10.1111/cdev.13733 highlights the added value of combining mobile sensing with [4] Byrne, M.L., Lind, M.N., Horn, S.R., Mills, K.L., Nelson, B.W., Barnes, survey data, as it provides insights that would not be captured M.L., Slavich, G.M. and Allen, N.B. 2012. Using mobile sensing data to assess stress: Associations with perceived and lifetime stress, mental through self-report alone. While symptoms of depression, health, sleep, and inflammation, Digital Health. 2021:7, DOI: anxiety, and stress were assessed on a 0-21 scale, all median 10.1177/20552076211037227 [5] Li, L.M.W., Chen, Q., Gao, H., Li, W.Q. and Ito, K.2021. Online/offline values were below 10, reflecting the general population sample self-disclosure to offline friends and relational outcomes in a diary in which the prevalence of psychological problems is low. Future school: The moderating role of self-esteem and relational closeness. International Journal of Psychology , 56(1), 129-137, DOI: research could therefore focus on adolescents with higher levels https://doi.org/10.1002/ijop.12684 of depression, anxiety, and stress symptoms. [6] Lind, M.N., Byrne, M.L., Wicks, G., Smidt, A.M., Allen, N.B., 2018. In addition, future work will explore the application of symbolic The Effortless Assessment of Risk States (EARS) Tool: An Interpersonal Approach to Mobile Sensing, JMIR Ment. Health, 2018; 5(3):e10334, data analysis for clustering based on both EMA and mobile DOI: 10.2196/10334. sensing data. Symbolic data analysis, developed for the study of [7] Lind, M. N., Kahn, L. E., Crowley, R., Read, W., Wicks, G., Allen, N. complex and large-scale datasets, incorporates variability B., 2023. Reintroducing the Effortless Assessment Research System (EARS), JMIR Ment. Health, 2023; 10:e38920, DOI: 10.21196/38920. directly into the aggregation process [2]. This approach would [8] Ng, Y.T., Huo, M., Gleason, M.E., Neff, L.A., Charles, S.T. and allow us to account for the stability of emotional states and Fingerman, K.L. 2021. Friendship in old age: Daily encounters and emotional well-being. The Journals of Gerontology: Series B, 76(3), behavioral patterns at the individual level, potentially offering 551-562, DOI: https://doi.org/10.1093/geronb/gbaa007 more refined indicators for defining adolescent profiles. [9] Šutić, L., van Roekel, E. and Novak, M. 2025. Quality of friendships and well-being in adolescence: daily life study. International Journal of Adolescence and Youth, 30(1), DOI: Acknowledgments https://doi.org/10.1080/02673843.2025.2467112 This study was conducted as a part of the Testing the 5C framework of positive youth development: traditional and digital 121 Brazil’s First AI Regulatory Sandbox: Towards Responsible Innovation Cristina Godoy Oliveira† Joao Paulo Candia Veiga Vasilka Sancin Joao Pita Costa CIAAM, C4AI, Univ. of São Paulo CIAAM, C4AI, Univ. of São Paulo Faculty of Law, Univ. of Ljubljana IRCAI, Quintelligence São Paulo, Brazil São Paulo, Brazil Ljubljana, Slovenia Ljubljana, Slovenia cristinagodoy@usp.br candia@usp.br vasilka.sancin@pf.uni-lj.si joao.pitacosta@ircai.org Rafael Meira Silva Maša Kovič Dine Lucas Costa dos Anjos Thiago Gomes Marcilio, CIAAM, C4AI, Univ. of São Faculty of Law, Univ. of Ljubljana Faculty of Law, Univ. of Juiz Anthony C. de Novaes Silva Paulo Ljubljana, Slovenia de Fora CIAAM, C4AI, Univ. of São Paulo São Paulo, Brazil masa.kovic-dine@pf.uni-lj.si Juiz de Fora, Brazil São Paulo, Brazil rafael_meira@alumni.usp.br lucas.anjos@anpd.gov.br tgm.marcilio@gmail.com anthonycharles.silva@outlook.com Abstract / Povzetek As artificial intelligence technologies rapidly evolve, regulatory 1 Introduction sandbox initiatives have emerged as crucial tools for promoting The rapid evolution of artificial intelligence (AI) has prompted responsible AI development, enabling innovation while urgent global discussions about governance frameworks that can safeguarding fundamental rights and public interests. This paper both stimulate innovation and mitigate potential risks. Around analyzes the development and implications of Brazil’s first AI the world, regulators are grappling with how to manage AI regulatory sandbox, with a particular focus on the model systems that are increasingly impacting critical sectors, such as established by SUSEP (Superintendence of Private Insurance). finance, healthcare, education, and public administration. While Designed as a controlled environment for testing innovative countries in Europe have taken the lead in formalizing AI- sandbox illustrates how regulatory flexibility can foster s AI Act—many nations in the Global South, including those in technological advancement, financial inclusion, and market South America, are only beginning to articulate coherent products and services in the insurance sector, the SUSEP specific legislation—most notably through the European Union’ efficiency while maintaining consumer protection and risk regulatory approaches. In Europe, the EU AI Act represents the oversight. Being developed under Brazil’s Economic Freedom first comprehensive legal framework for AI, categorizing Law, the sandbox has evolved through three editions (2020, applications by risk level and imposing strict requirements for 2021, and 2024), prioritizing both sustainable and technological high-risk systems. It introduces transparency, accountability, and projects. This study explores the sandbox's structure, eligibility human oversight obligations, while also fostering innovation criteria, business plan requirements, operational limitations, and through mechanisms such as regulatory sandboxes. This transition mechanisms for companies seeking permanent structured and anticipatory approach reflects Europe’s long- licensure. It also identifies actionable insights for future regulatory frameworks, particularly for the National Data standing tradition of precautionary regulation and data Protection Authority (ANPD) as Brazil advances toward AI- protection, rooted in the General Data Protection Regulation specific governance. By comparing the sandbox's legal (GDPR), and with succeeding regulations and standards such as foundations, selection processes, and risk mitigation protocols the upcoming European AI Sandbox Act that will further extend sandbox’s role as a blueprint for responsible AI regulation in Europe. By contrast, AI regulation in Brazil and South America remains emerging markets with international best practices, this paper underscores the Article 57 of the European AI Act, focusing on AI Sandboxes in . fragmented, preliminary, and largely reactive. In Brazil, multiple Keywords / Ključne besede legislative proposals have been introduced in Congress, but no comprehensive AI law has yet been enacted. The country’s Regulatory Sandboxes, Artificial Intelligence Governance, Data current approach relies on a patchwork of sectoral regulations, Protection, Innovation Policy, Brazilian AI Regulation soft law instruments, and the foundational framework provided † by the General Data Protection Law (Lei Geral de Proteção de Corresponding author Dados - LGPD). While the LGPD is a significant step forward in Permission to make digital or hard copies of part or all of this work for personal or regulating personal data and algorithmic decision-making, it classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full does not address the broader ethical, operational, and societal citation on the first page. Copyrights for third-party components of this work must challenges posed by AI systems. Regionally, South American be honored. For all other uses, contact the owner/author(s). countries exhibit a similar lack of uniformity. Argentina, Chile, Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia © 2025 Copyright held by the owner/author(s). and Colombia have published national AI strategies or draft http://doi.org/10.70314/is.2025.sikdd.13 policy guidelines, yet most remain in early implementation phases. Regulatory oversight is often spread across multiple 122 agencies, and few jurisdictions have adopted binding legal norms personal data. It adopts a multidisciplinary approach for AI beyond data protection. In this landscape, Brazil stands encompassing organizational, technological, and regulatory out as a potential regional leader, particularly through initiatives aspects. Participants are required to present a detailed description such as the National Artificial Intelligence Strategy (Estratégia of the problem or opportunity addressed by their project, Brasileira de Inteligência Artificial – EBIA) [1], National highlighting the current context, challenges, and expected Artificial Intelligence Plan (Plano Nacional de Inteligência benefits, such as innovation and efficiency. The methodology Artificial emphasizes the innovative aspects of the solution, the processing – PBIA) [2], and the growing role of ANPD. This paper argues that regulatory sandboxes of personal data in AI systems, the social impact, and the — flexible, supervised environments for testing innovative solutions offer — intended outcomes. A core component of the methodology is the implementation a pragmatic and context-sensitive tool for advancing AI of algorithmic transparency measures. Applicants must describe governance in Brazil and Latin America. In particular, the how their systems will make algorithmic logic, decisions, and experience of the SUSEP Regulatory Sandbox, an experimental criteria understandable to end users. This includes the use of regulatory environment created by the Superintendence of explainable AI (XAI) tools, audit reports, documentation, and Private Insurance (SUSEP) [3] designed for the insurance dashboards, as well as practices for data traceability and decision market, provides a valuable model for structuring oversight of accountability. The methodology also requires information on emerging technologies. Through an in-depth analysis of the compliance with the LGPD, such as data minimization, risk SUSEP sandbox, this research explores how key regulatory principles—such as proportionality, transparency, risk mechanisms, and respect for data subject rights. Projects must management, mitigation of algorithmic bias, governance management, and sustainability—can inform the development of show alignment with ethical and legal standards to ensure Brazil’s first AI sandbox. In doing so, this study contributes to responsible AI development. ongoing policy debates about how developing economies can In terms of data methodology, applicants must describe the chart their own paths in AI governance, drawing lessons from lifecycle of the personal data used, including its origin, collection, both global benchmarks and local regulatory experiments. processing, storage, and disposal. In addition, the quality of data Moreover, it feeds the ongoing collaboration with the different is crucial, and applicants must describe it to demonstrate that they stakeholders in the development of the Slovenian AI Sandbox are in a good phase to participate in the regulatory sandbox. A initiative, hoping for a constructive exchange based of good preliminary impact assessment on data protection must be practices and AI regulation perspectives. included, along with a risk matrix that identifies potential harms to data subjects and proposes mitigation strategies. The form also 2 Methodology information on the IT infrastructure (cloud, hybrid, on-premises), assesses the technical feasibility of the project by requiring The SUSEP Regulatory Sandbox is an experimental regulatory API data flows, outsourcing arrangements, LLM usage, and environment established to enable the implementation of cybersecurity controls. Financial planning (FINOPS), scalability, innovative projects that offer products and/or services in the social impact assessment, and performance metrics are also insurance market. These innovations are developed or offered critical elements of the methodology. using new methodologies, processes, procedures, or by applying Finally, organizations must consolidate their identified risks and existing technologies in a novel way. Companies participating in mitigation measures into a summary framework, ensuring the sandbox can test — under supervision — new products, services, transparency and accountability throughout the project lifecycle. or new ways of providing traditional services. SUSEP assesses the benefits and risks associated with each innovation and 3 Legislation, Regulation, and Ethical Use: determines whether adjustments are needed, either to the business model or to existing regulations. Objectives and Priorities When the SUSEP Sandbox was launched, it was part of a joint In the 2024 edition of the SUSEP Regulatory Sandbox, initiative involving the financial, insurance, and capital markets, participating companies were required to submit detailed led by the Central Bank of Brazil (BCB), the Securities and information and upload relevant documents through Brazil’s Exchange Commission (CVM), and SUSEP. The SUSEP Electronic Information System (SEI). The sandbox was designed Sandbox was established during the Bolsonaro administration, in to: (i) stimulate competition to improve efficiency; (ii) promote alignment with the Economic Freedom Law (Law No. financial inclusion; (iii) encourage capital formation and 13,874/2019) and broader deregulation efforts. There have been efficient resource allocation; and (iv) develop and deepen the three editions so far: in 2020, 2021, and 2024 [4] — with the 2024 Brazilian insurance market. edition currently open for an indefinite period. The SUSEP SUSEP prioritized proposals classified by the applicants Sandbox is governed by CNSP Resolution No. 381/2020, as themselves as either Sustainable or Technological projects: amended, along with SUSEP Circular No. 598/2020, and by • Sustainable Projects: Aligned with SUSEP and CNSP specific public notices for each edition. The National Private rules, as well as the Federal Government’s Ecological Insurance Council (CNSP) sets the rules for the insurance market, Transformation Plan. These initiatives must deliver climate, and SUSEP ensures compliance. environmental, or social benefits to policyholders, ANPD’s Regulatory Sandbox, on the other hand, is beneficiaries, or society as a whole. structured to comprehensively evaluate the technical, legal, • Technological Projects: Promote the development of ethical, and social dimensions of AI-based projects involving innovative technology by introducing technological 123 novelties or enhancements to products, services, business 5. AI registry – Formal registration with ANPD, with models, or processes, thereby adding functionality or authorizations subject to revocation. quality improvements. 6. Virtual interviews – Ensuring nationwide accessibility. Regarding the eligibility criteria for startups (insurtechs), 7. Exit Strategy – A clear post-sandbox transition plan for applicants were required to offer an innovative product or service continued compliance. and operate via remote/digital platforms. They should demonstrate the novelty of their technology or its creative ’ In Phase 1 of the ANPDs regulatory sandbox selection application and present the solution in a development stage process, whose application period closed on August 25, 2025, suitable for temporary authorization. Moreover, they had to additional points will be awarded to startups, public sector submit a business plan, which included a risk assessment, organizations, and companies developing generative AI specifically addressing cybersecurity, and a damage mitigation solutions. These categories were identified as strategic priorities plan. Besides the typical proposed and current legal/trade names, for Brazil: startups are legally recognized in the Brazilian or organizational structure and director profiles, the business Innovation Framework [5] as key beneficiaries of sandbox plan had to include strategic objectives, and company history, initiatives; public sector organizations often develop socially mission, and vision, along with a problem statement and impactful solutions and are expected to sustain participation market/consumer benefits, proof of concept of product or service without financial or technical aid from ANPD; and Brazil has an and demonstration of potential cost reduction for consumers, if explicit national interest in fostering large language models any. It also described a comparative analysis with existing (LLMs) in Portuguese as part of its broader AI sovereignty offerings, target market, and geographic scope, along with risk strategy. factors and mitigation strategies, the technical architecture and As part of the application process, the ANPD ’ s form operational model, the justification for the Priority Project required that any confidential or sensitive business information classification, and the sustainability policy. The selection process be clearly marked as such by the applicants. This provision is involved two stages: (i) a Selection Phase with a video interview necessary due to Brazil ’ s Freedom of Information Law (Lei de with SUSEP; and a (ii) Temporary Authorization Phase , with a follow-up interview and submission of evidence proving Acesso à Informação – LAI), which mandates public disclosure compliance with normative requirements and completion of unless a legal exception is claimed. Without this explicit corporate formalities, as well as appointment of a director classification, all submitted materials may be treated as public, responsible for sandbox participation and documentary evidence potentially exposing strategic or proprietary information from attesting to the lawful origin of funds contributed by investors. participating firms. To enhance visibility and inclusiveness, the ANPD also adopted a multi-channel outreach strategy, disseminating the call 4 Discussion of initial results for applications through official platforms and with the support The 2024 edition of this initiative included four companies of civil society organizations. To maximize participation, the that were granted permanent licenses (by September), while 32 deadline for applications was extended by an additional 15 days, projects were selected, amongst which 21 received temporary although the overall schedule for evaluation and publication of authorization (by April). Authorized companies were required to results remained unchanged. The final list of selected transmit operational data to SUSEP via API. While in the participants is scheduled to be released on October 2, 2025, as sandbox, companies: originally planned. Finally, there is also another point of flexibility, not expressly • codified, which is the absence of a fixed taxonomy of sandboxes. can only sell approved types of insurance, • For example, the SUSEP sandbox has an innovative character, operate under capped risk exposure, and • seeking to make regulations more flexible. At the same time, the face limits on claims payouts. service is being used in the market. In contrast, the ANPD sandbox aims to provide the regulator with knowledge that protection governance, several SUSEP sandbox practices could enables the preventive updating of market rules, rather than a Given the similarities between insurance regulation and data inform the design of an AI sandbox under Brazil reactive one. Oversight may be distributed among agencies like ’ s National SUSEP, yet the regulatory status of AI companies post-sandbox Data Protection Authority (ANPD), such as: remains unclear. For this reason, ANPD must establish both 1. Innovation focus – Projects must demonstrate clear novelty ensuring long-term supervision and market stability. sandbox-specific rules and post-sandbox AI regulations, or novel applications of technologies, methods and The importance of embedding responsible and ethical principles procedures. in AI governance is particularly acute in Brazil and across South 2. Sustainability integration – For AI, this could include energy, America, where technological innovation intersects with social water and natural resources efficiency, environmental impact, inequality, fragile institutions, and diverse regulatory capacities. and ethical safeguards. By prioritizing transparency, accountability, and fairness in AI 3. Defined operational boundaries – Limitations on AI use systems, these countries can foster public trust while mitigating cases, affected populations, and permitted risk categories. risks of discrimination, exclusion, or misuse of personal data. 4. Mandatory submissions – Risk analysis and mitigation plan, Brazil’s initiatives—such as its National AI Strategy (EBIA), the business plan, and funding source verification. forthcoming AI legal framework, and the regulatory sandbox programs led by SUSEP and the ANPD—illustrate how 124 developing nations can create adaptive governance models that mechanisms that align local priorities with global best practices. balance innovation with fundamental rights. Moreover, as the By proving that responsible innovation can be pursued within largest economy in Latin America, Brazil is well-positioned to resource-constrained and diverse legal settings, the Brazilian serve as a regional benchmark, showing how ethical AI practices sandbox contributes to a global dialogue on AI governance, can promote financial inclusion, strengthen democratic values, helping countries at different stages of regulatory development and encourage sustainable development. In this sense, South to tailor sandbox initiatives to their specific socio-economic and America’s experience underscores that responsible AI is not a institutional realities. luxury for advanced economies but a prerequisite for equitable technological progress in the Global South. Acknowledgments / Zahvala Insert paragraph text here. Insert paragraph text here. Insert 5 Conclusions and further work text here. Insert paragraph text here. Insert paragraph text here. paragraph text here. Insert paragraph text here. Insert paragraph The ANPD’s regulatory sandbox demonstrates Brazil’s Insert paragraph text here. Insert paragraph text here. Insert commitment to experimental and responsible governance of AI. paragraph text here. Insert paragraph text here. Insert paragraph By ensuring transparency through a public information portal, text here. Insert paragraph text here. Insert paragraph text here. addressing confidentiality in accordance with the Access to Insert paragraph text here. Information Law, and promoting inclusive engagement, the initiative aligns with international standards. Drawing on References / Literatura frameworks such as the OECD’s recommendations and the [1] MCTI (2021). Brazilian Strategy of Artificial Intelligence. [Online]. EU’s AI Act, which formally includes regulatory sandboxes, the Available: ebia-documento_referencia_4-979_2021.pdf (www.gov.br) Brazilian approach reinforces the importance of embedding such [2] PBIA (2024). Brazilian Artificial Intelligence Plan . [Online]. Available: https://www.gov.br/mcti/pt-br/acompanhe-o- mechanisms into national legislation. In the context of Bill mcti/noticias/2024/07/plano-brasileiro-de-ia-tera-supercomputador-e- 2338/2023 (under debate in the Deputy Chamber to regulate AI investimento-de-r-23-bilhoes-em-quatro- in Brazil) [6], regulatory sandboxes emerge as strategic tools to anos/ia_para_o_bem_de_todos.pdf/view [3] Brazilian Government Portal (2025). About SUSEP. [Online]. Available: enable adaptive, participatory, and context-aware AI regulation. https://www.gov.br/susep/pt-br/acesso-a- The Brazilian AI sandbox experience also carries significant informacao/institucional/sobre-a-susep relevance beyond Brazil and South America, offering valuable [4] Brazil (2019). JOINT STATEMENT: COORDINATED ACTION TO insights for other developing countries and even jurisdictions BRAZILIAN IMPLEMENT A REGULATORY SANDBOX REGIME IN THE FINANCIAL, SECURITIES, AND CAPITAL with more advanced regulatory frameworks, such as Europe. MARKETS. [Online]. Available: https://www.gov.br/susep/pt- While the European Union has already institutionalized AI br/central-de-conteudos/noticias/2022/noticia [5] Brazil (2021). Complementary Law No. 182, of June 1, 2021. Establishes sandboxes within the AI Act, the Brazilian model demonstrates the Legal Framework for Startups and Innovative Entrepreneurship. how experimental, flexible, and context-sensitive approaches can [Online]. Available: planalto.gov.br/ccivil_03/leis/lcp/lcp182.htm be adapted to environments where regulatory structures are less [6] Brazil (2023). Bill No. 2338, of 2023. Establishes the legal framework consolidated. Its emphasis on transparency, proportionality, and for artificial intelligence in Brazil. [Online]. Available: multi-stakeholder participation shows that effective governance https://www.camara.leg.br/proposicoesWeb/prop_mostrarintegra?codte or=2868197&filename=PL%202338/2023 does not require fully mature institutions but rather innovative 125 126 Indeks avtorjev / Author index Anjos Lucas Costa dos ............................................................................................................................................................... 122 Barrionuevo Leonardo .................................................................................................................................................................. 98 Bašić Nino .................................................................................................................................................................................. 102 Batagelj Vladimir ....................................................................................................................................................................... 102 Brank Janez .................................................................................................................................................................................. 11 Calcina Erik .................................................................................................................................................................................... 7 Camlek Neca ................................................................................................................................................................................ 29 Caporusso Jaya ............................................................................................................................................................................. 19 Cek Rok ........................................................................................................................................................................................ 82 Ćetković Marija ............................................................................................................................................................................ 65 Čibej Jaka ..................................................................................................................................................................................... 53 Costa João Pita ............................................................................................................................................................... 45, 98, 122 Debeljak Žiga ............................................................................................................................................................................... 57 Dine Masa Kovic ........................................................................................................................................................................ 122 Dobša Jasminka .......................................................................................................................................................................... 118 Forcolin Margherita ...................................................................................................................................................................... 82 Frattini Matteo .............................................................................................................................................................................. 49 Grobelnik Adrian Mladenić.................................................................................................................................................... 15, 25 Grobelnik Marko .................................................................................................................................................. 11, 15, 25, 29, 73 Guček Alenka ......................................................................................................................................................................... 25, 69 Guo Zhenyu .................................................................................................................................................................................. 33 Hegler Živa ................................................................................................................................................................................... 29 Hosseini Seyed Iman .................................................................................................................................................................... 86 Hrib Ivo .............................................................................................................................................................................. 110, 114 Jakomin Martin .......................................................................................................................................................................... 106 Jelenčič Jakob ............................................................................................................................................................................... 29 Jeršek Domen ............................................................................................................................................................................... 49 Kassis Rayan ................................................................................................................................................................................ 45 Kavšek Branko ............................................................................................................................................................................. 94 Kenda Klemen ...................................................................................................................................................... 49, 57, 77, 82, 86 Kladnik Matic ............................................................................................................................................................................... 41 Klančič Rok .................................................................................................................................................................................. 49 Kochovska Sofija ......................................................................................................................................................................... 94 Kocjančič Oskar ........................................................................................................................................................................... 37 Korenjak-Černe Simona ............................................................................................................................................................. 118 Kozamernik Lučka ..................................................................................................................................................................... 106 Krumpak Roy ............................................................................................................................................................................... 33 Lamgari Asmai ............................................................................................................................................................................. 45 Leonardi Linda ............................................................................................................................................................................. 82 Leskovec Gašper .......................................................................................................................................................................... 77 Ma Xiang ...................................................................................................................................................................................... 33 Marcilio Thiago Gomes ............................................................................................................................................................. 122 Mladenić Dunja .............................................................................................................................. 7, 11, 29, 33, 41, 57, 61, 73, 86 Mochariq Ouidad.......................................................................................................................................................................... 45 Mylonas Costas ............................................................................................................................................................................ 77 Novak Erik ..................................................................................................................................................................................... 7 Novak Miranda ........................................................................................................................................................................... 118 Novalija Inna .............................................................................................................................................................. 11, 33, 61, 73 Oliveira Cristina Godoy ............................................................................................................................................................. 122 Pandur Maja Buhin..................................................................................................................................................................... 118 Pavlova Daria ......................................................................................................................................................................... 61, 90 Pisanski Tomaž .......................................................................................................................................................................... 102 Pollak Senja .................................................................................................................................................................................. 19 Polzer Mirozlav ............................................................................................................................................................................ 98 Purver Matthew ............................................................................................................................................................................ 19 127 Rahmani Yousef ........................................................................................................................................................................... 45 Roman Dumitru ............................................................................................................................................................................ 33 Rožanec Jože M. .......................................................................................................................................................................... 33 Sancin Vasilka ............................................................................................................................................................................ 122 Savnik Iztok ............................................................................................................................................................................... 102 Silva Anthony Novaes ................................................................................................................................................................ 122 Silva Rafael Meira ...................................................................................................................................................................... 122 Sittar Abdul .................................................................................................................................................................................. 69 Škrjanc Maja ........................................................................................................................................................................ 73, 114 Škrlj Blaž .................................................................................................................................................................................... 106 Slavec Ana ................................................................................................................................................................................. 102 Smiljanic Mateja .......................................................................................................................................................................... 69 Song Tao ...................................................................................................................................................................................... 33 Souss Sohaib ................................................................................................................................................................................ 45 Stopar Luka .................................................................................................................................................................................. 45 Šturm Jan .............................................................................................................................................................................. 73, 114 Šutić Lucija ................................................................................................................................................................................ 118 Topal Oleksandra ........................................................................................................................................................... 73, 82, 114 Tošić Aleksandar .......................................................................................................................................................................... 65 Trajkov Georgi ............................................................................................................................................................................. 15 Urbančič Jasna ........................................................................................................................................................................... 106 Vake Domen ................................................................................................................................................................................. 65 Veiga João Cândia ................................................................................................................................................................ 98, 122 Vičič Jernej ................................................................................................................................................................................... 94 Zajec Patrik ................................................................................................................................................................................ 110 Zaouini Mustafa ........................................................................................................................................................................... 45 Žnidaršič Martin ........................................................................................................................................................................... 37 Žust Martin ................................................................................................................................................................................... 25 128 Odkrivanje znanja in podatkovna skladišča SiKDD Data Mining and Data Warehouses SiKDD Urednika l Editors: Dunja Mladenić Marko Grobelnik