8.–9. oktober 2025 l 8–9 October 2025

Ljubljana, Slovenia



IS 2025



INFORMACIJSKA



DRUZBA ˇ



INFORMATION



SOCIETY



Slovenska konferenca o



umetni inteligenci



Slovenian Conference on



Artificial Intelligence Zbornik 28. mednarodne multikonference



Uredniki l Editors: Zvezek A



Mitja Luštrek, Matjaž Gams, Rok Piltaver Proceedings of the 28th

International Multiconference



Volume A

Zbornik 28. mednarodne multikonference



INFORMACIJSKA DRUŽBA – IS 2025

Zvezek A



Proceedings of the 28th International Multiconference



INFORMATION SOCIETY – IS 2025

Volume A



Slovenska konferenca o umetni inteligenci



Slovenian Conference on Artificial Intelligence



Uredniki / Editors



Mitja Luštrek, Matjaž Gams, Rok Piltaver



http://is.ijs.si



8. –9. oktober 2025 / 8–9 October 2025

Ljubljana, Slovenia



Uredniki:



Mitja Luštrek

Odsek za inteligentne sisteme, Institut »Jožef Stefan«, Ljubljana



Matjaž Gams

Odsek za inteligentne sisteme, Institut »Jožef Stefan«, Ljubljana



Rok Piltaver

Outfit7, Ljubljana



Založnik: Institut »Jožef Stefan«, Ljubljana

Priprava zbornika: Mitja Lasič, Vesna Lasič, Lana Zemljak

Oblikovanje naslovnice: Vesna Lasič, uporabljena slika iz Pixabay



Dostop do e-publikacije:

http://library.ijs.si/Stacks/Proceedings/InformationSociety



Ljubljana, oktober 2025



Informacijska družba

ISSN 2630-371X

DOI: https://doi.org/10.70314/is.2025.skui



Kataložni zapis o publikaciji (CIP) pripravili v Narodni in univerzitetni knjižnici v Ljubljani

COBISS.SI-ID 255435267

ISBN 978-961-264-319-5 (PDF)





PREDGOVOR MULTIKONFERENCI




INFORMACIJSKA DRUŽBA 2025



28. mednarodna multikonferenca Informacijska družba se odvija v času izjemne rasti umetne inteligence,

njenih aplikacij in vplivov na človeštvo. Vsako leto vstopamo v novo dobo, v kateri generativna umetna

inteligenca ter drugi inovativni pristopi oblikujejo poti k superinteligenci in singularnosti, ki bosta krojili

prihodnost človeške civilizacije. Naša konferenca je tako hkrati tradicionalna znanstvena in akademsko

odprta, pa tudi inkubator novih, pogumnih idej in pogledov.



Letošnja konferenca poleg umetne inteligence vključuje tudi razprave o perečih temah današnjega časa:

ohranjanje okolja, demografski izzivi, zdravstvo in preobrazba družbenih struktur. Razvoj UI ponuja rešitve

za številne sodobne izzive, kar poudarja pomen sodelovanja med raziskovalci, strokovnjaki in odločevalci

pri oblikovanju trajnostnih strategij. Zavedamo se, da živimo v obdobju velikih sprememb, kjer je ključno,

da z inovativnimi pristopi in poglobljenim znanjem ustvarimo informacijsko družbo, ki bo varna,

vključujoča in trajnostna.



V okviru multikonference smo letos združili dvanajst vsebinsko raznolikih srečanj, ki odražajo širino in

globino informacijskih ved: od umetne inteligence v zdravstvu, demografskih in družinskih analiz, digitalne

preobrazbe zdravstvene nege ter digitalne vključenosti v informacijski družbi, do raziskav na področju

kognitivne znanosti, zdrave dolgoživosti ter vzgoje in izobraževanja v informacijski družbi. Pridružujejo

se konference o legendah računalništva in informatike, prenosu tehnologij, mitih in resnicah o varovanju

okolja, odkrivanju znanja in podatkovnih skladiščih ter seveda Slovenska konferenca o umetni inteligenci.



Poleg referatov bodo okrogle mize in delavnice omogočile poglobljeno izmenjavo mnenj, ki bo pomembno

prispevala k oblikovanju prihodnje informacijske družbe. »Legende računalništva in informatike«

predstavljajo domači »Hall of Fame« za izjemne posameznike s tega področja. Še naprej bomo spodbujali

raziskovanje in razvoj, odličnost in sodelovanje; razširjeni referati bodo objavljeni v reviji Informatica, s

podporo dolgoletne tradicije in v sodelovanju z akademskimi institucijami ter strokovnimi združenji, kot

so ACM Slovenija, SLAIS, Slovensko društvo Informatika in Inženirska akademija Slovenije.



Vsako leto izberemo najbolj izstopajoče dosežke. Letos je nagrado Michie-Turing za izjemen življenjski

prispevek k razvoju in promociji informacijske družbe prejel Niko Schlamberger, priznanje za

raziskovalni dosežek leta pa Tome Eftimov. »Informacijsko limono« za najmanj primerno informacijsko

tematiko je prejela odsotnost obveznega pouka računalništva v osnovnih šolah. »Informacijsko jagodo« za

najboljši sistem ali storitev v letih 2024/2025 pa so prejeli Marko Robnik Šikonja, Domen Vreš in Simon

Krek s skupino za slovenski veliki jezikovni model GAMS. Iskrene čestitke vsem nagrajencem!



Naša vizija ostaja jasna: prepoznati, izkoristiti in oblikovati priložnosti, ki jih prinaša digitalna preobrazba,

ter ustvariti informacijsko družbo, ki koristi vsem njenim članom. Vsem sodelujočim se zahvaljujemo za

njihov prispevek — veseli nas, da bomo skupaj oblikovali prihodnje dosežke, ki jih bo soustvarjala ta

konferenca.



Mojca Ciglarič, predsednica programskega odbora

Matjaž Gams, predsednik organizacijskega odbora





i


FOREWORD TO THE MULTICONFERENCE




INFORMATION SOCIETY 2025



The 28th International Multiconference on the Information Society takes place at a time of remarkable

growth in artificial intelligence, its applications, and its impact on humanity. Each year we enter a new era

in which generative AI and other innovative approaches shape the path toward superintelligence and

singularity — phenomena that will shape the future of human civilization. The conference is both a

traditional scientific forum and an academically open incubator for new, bold ideas and perspectives.



In addition to artificial intelligence, this year’s conference addresses other pressing issues of our time:

environmental preservation, demographic challenges, healthcare, and the transformation of social

structures. The rapid development of AI offers potential solutions to many of today’s challenges and

highlights the importance of collaboration among researchers, experts, and policymakers in designing

sustainable strategies. We are acutely aware that we live in an era of profound change, where innovative

approaches and deep knowledge are essential to creating an information society that is safe, inclusive, and

sustainable.



This year’s multiconference brings together twelve thematically diverse meetings reflecting the breadth and

depth of the information sciences: from artificial intelligence in healthcare, demographic and family studies,

and the digital transformation of nursing and digital inclusion, to research in cognitive science, healthy

longevity, and education in the information society. Additional conferences include Legends of Computing

and Informatics, Technology Transfer, Myths and Truths of Environmental Protection, Knowledge

Discovery and Data Warehouses, and, of course, the Slovenian Conference on Artificial Intelligence.



Alongside scientific papers, round tables and workshops will provide opportunities for in-depth exchanges

of views, making an important contribution to shaping the future information society. Legends of

Computing and Informatics serves as a national »Hall of Fame« honoring outstanding individuals in the

field. We will continue to promote research and development, excellence, and collaboration. Extended

papers will be published in the journal Informatica, supported by a long-standing tradition and in

cooperation with academic institutions and professional associations such as ACM Slovenia, SLAIS, the

Slovenian Society Informatika, and the Slovenian Academy of Engineering.



Each year we recognize the most distinguished achievements. In 2025, the Michie-Turing Award for

lifetime contribution to the development and promotion of the information society was awarded to Niko

Schlamberger, while the Award for Research Achievement of the Year went to Tome Eftimov. The

»Information Lemon« for the least appropriate information-related topic was awarded to the absence of

compulsory computer science education in primary schools. The »Information Strawberry« for the best

system or service in 2024/2025 was awarded to Marko Robnik Šikonja, Domen Vreš and Simon Krek

together with their team, for developing the Slovenian large language model GAMS. We extend our

warmest congratulations to all awardees.



Our vision remains clear: to identify, seize, and shape the opportunities offered by digital transformation,

and to create an information society that benefits all its members. We sincerely thank all participants for

their contributions and look forward to jointly shaping the future achievements that this conference will

help bring about.



Mojca Ciglarič, Chair of the Program Committee

Matjaž Gams, Chair of the Organizing Committee





ii


KONFERENČNI ODBORI





CONFERENCE COMMITTEES



International Programme Committee Organizing Committee



Vladimir Bajic, South Africa Matjaž Gams, chair

Heiner Benking, Germany Mitja Luštrek

Se Woo Cheon, South Korea Lana Zemljak

Howie Firth, UK Vesna Koricki

Olga Fomichova, Russia Mitja Lasič

Vladimir Fomichov, Russia Blaž Mahnič

Vesna Hljuz Dobric, Croatia

Alfred Inselberg, Israel

Jay Liebowitz, USA

Huan Liu, Singapore

Henz Martin, Germany

Marcin Paprzycki, USA

Claude Sammut, Australia

Jiri Wiedermann, Czech Republic

Xindong Wu, USA

Yiming Ye, USA

Ning Zhong, USA

Wray Buntine, Australia

Bezalel Gavish, USA

Gal A. Kaminka, Israel

Mike Bain, Australia

Michela Milano, Italy

Derong Liu, Chicago, USA

Toby Walsh, Australia

Sergio Campos-Cordobes, Spain

Shabnam Farahmand, Finland

Sergio Crovella, Italy



Programme Committee



Mojca Ciglarič, chair Marjan Heričko Boštjan Vilfan

Bojan Orel Borka Jerman Blažič Džonova Baldomir Zajc

Franc Solina Gorazd Kandus Blaž Zupan

Viljan Mahnič Urban Kordeš Boris Žemva

Cene Bavec Marjan Krisper Leon Žlajpah

Tomaž Kalin Andrej Kuščer Niko Zimic

Jozsef Györkös Jadran Lenarčič Rok Piltaver

Tadej Bajd Borut Likar Toma Strle

Jaroslav Berce Janez Malačič Tine Kolenik

Mojca Bernik Olga Markič Franci Pivec

Marko Bohanec Dunja Mladenič Uroš Rajkovič

Ivan Bratko Franc Novak Borut Batagelj

Andrej Brodnik Vladislav Rajkovič Tomaž Ogrin

Dušan Caf Grega Repovš Aleš Ude

Saša Divjak Ivan Rozman Bojan Blažica

Tomaž Erjavec Niko Schlamberger Matjaž Kljun

Bogdan Filipič Gašper Slapničar Robert Blatnik

Andrej Gams Stanko Strmčnik Erik Dovgan

Matjaž Gams Jurij Šilc Špela Stres

Mitja Luštrek Jurij Tasič Anton Gradišek

Marko Grobelnik Denis Trček

Nikola Guid Andrej Ule





iii


iv




KAZALO / TABLE OF CONTENTS



Slovenska konferenca o umetni inteligenci / Slovenian Conference on Artificial Intelligence ................ 1

PREDGOVOR / FOREWORD ............................................................................................................................... 3

PROGRAMSKI ODBORI / PROGRAMME COMMITTEES ............................................................................... 5

Detecting Pollinators from Stem Vibrations Using a Neural Network / Ambrožič Žan, Bianco Lorenzo, Štrum

Rok, Susič David, Smerkol Maj, Gradišek Anton.............................................................................................. 7

Thermal Camera-Based Cognitive Load Estimation: A Non-Invasive Approach / Anžur Zoja, Slapničar Gašper,

Luštrek Mitja .................................................................................................................................................... 11

A Critical Perspective on MNAR Data: Imputation, Generation, and the Path Toward a Unified Framework /

Azad Fatemeh, Kukar Matjaž ........................................................................................................................... 15

Utilizing Large Language Models for Supporting Multi-Criteria Decision Modelling Method DEX / Bohanec

Marko, Rajkovič Uroš, Rajkovič Vladislav ..................................................................................................... 19

Landscape-Aware Selection of Constraint Handling Techniques in Multiobjective Optimisation / Cork Jordan,

Andova Andrejaana, Krömer Pavel, Tušar Tea, Filipič Bogdan ...................................................................... 23

Explaining Deep Reinforcement Learning Policy in Distribution Network Control / Dobravec Blaž, Žabkar Jure

.......................................................................................................................................................................... 27

Leveraging AI in Melanoma Skin Cancer Diagnosis: Human Expertise vs. Machine Precision / Herke

Anna-Katharina ................................................................................................................................................ 31

Prediction of Root Canal Treatment Using Machine Learning / Jelenc Matej, Shulajkovska Miljana, Jurič Rok,

Gradišek Anton ................................................................................................................................................ 35

Predictive Maintenance of Machines in LABtop Production Environment / Kocuvan Primož, Longar Vinko,

Struna Rok ........................................................................................................................................................ 39

Machine Learning for Cutting Tool Wear Detection: A Multi-Dataset Benchmark Study Toward Predictive

Maintenance / Kolar Žiga, Comte Thibault, Hassani Yanny, Louvancour Hugues, Gams Matjaž ................. 43

Extracting Structured Information About Food Loss and Waste Measurement Practices Using Large Language

Models: A Feasibility Study / Lukan Junoš, Inagawa Maori, Luštrek Mitja .................................................. 47

Eye-Tracking Explains Cognitive Test Performance in Schizophrenia / Marinković Mila, Žabkar Jure ............ 51

Data-Driven Evaluation of Truck Driving Performance with Statistical and Machine Learning Methods /

Nemec Vid, Slapničar Gašper, Luštrek Mitja .................................................................................................. 55

Automated Explainable Schizophrenia Assessment from Verbal-Fluency Audio / Rajher Rok, Žabkar Jure ..... 59

Mapping Medical Procedure Codes Using Language Models / Ratajec Mariša, Gradišek Anton, Reščič Nina . 63

AI-Enabled Dynamic Spectrum Sharing in the Telecommunication Sector – Technical Aspects and Legal

Challenges / Rechberger Nina ......................................................................................................................... 67

SmartCHANGE Risk Prediction Tool: Next-Generation Risk Assessment for Children and Youth / Reščič

Nina, Jordan Marko, Kramar Sebastjan, Krstevska Ana, Založnik Marcel, van der Jagt Lotte, op den Akker

Harm, Vastenburg Martijn, Di Giacomo Valentina, Mancuso Elena, Fenoglio Dario, Dominici Gabrielle,

Luštrek Mitja .................................................................................................................................................... 71

GNN Fusion of Voronoi Spatial Graphs and City–Year Temporal Graphs for Climate Analysis / Romanova

Alex .................................................................................................................................................................. 75

Towards Anomaly Detection in Forest Biodiversity Monitoring: A Pilot Study with Variational Autoencoders /

Susič David, Buchaillot Maria Luisa, Crozzoli Miguel, Builder Calum, Maistrou Sevasti, Gradišek Anton,

Vukašinović Dragana ....................................................................................................................................... 79

Development of a Lightweight Model for Detecting Solitary-Bee Buzz Using Pruning and Quantization

for Edge Deployment / Yagi Ryo, Susič David, Smerkol Maj, Finžgar Miha, Gradišek Anton ..................... 83

Interpretable Predictive Clustering Tree for Post-Intubation Hypotension Assessment / Žugelj Tapia Estefanía,

Kirn Borut, Džeroski Sašo ............................................................................................................................... 87



Indeks avtorjev / Author index ................................................................................................................... 91





v


vi




Zbornik 28. mednarodne multikonference





INFORMACIJSKA DRUŽBA – IS 2025

Zvezek A



Proceedings of the 28th International Multiconference



INFORMATION SOCIETY – IS 2025

Volume A



Slovenska konferenca o umetni inteligenci



Slovenian Conference on Artificial Intelligence



Uredniki / Editors



Mitja Luštrek, Matjaž Gams, Rok Piltaver



http://is.ijs.si



8. –9. oktober 2025 / 8–9 October 2025

Ljubljana, Slovenia



1





2


PREDGOVOR SLOVENSKI KONFERENCI




O UMETNI INTELIGENCI



Slovenska konferenca o umetni inteligenci se letos odvija v času, ko umetna inteligenca še

naprej intenzivno prodira v znanost, industrijo in vsakdanje življenje, še nikoli tako hitro in

tako koristno. Še vedno so v ospredju veliki jezikovni modeli, ki so svoje zmožnosti

razumevanja in generiranja že uspešno razširili na zvok, slike in video. Zanimivo raziskovalno

področje so tudi temeljni (angl. foundation) modeli za druge vrste podatkov – npr. senzorskih

in bioloških, pa tudi takih za robotske akcije, ki jih je takisto mogoče povezati z jezikom. Ti

modeli so posebej dragoceni v medicinskih raziskavah, kjer so že privedli do razvoja novih

zdravilnih učinkovin. Tovrstne raziskave bodo morda privedle do modelov, ki bodo znali

celostno razumevati svet in nanj tudi vplivati, kar močno diši po umetni splošni inteligenci.



Najnaprednejše raziskave umetne inteligence danes zahtevajo infrastrukturo, ki je v Sloveniji

nimamo in se je tudi ne moremo nadejati, vseeno pa se je v zadnjem letu tudi v domačih logih

zgodilo marsikaj zanimivega. Največji dogodek je bil bržkone pridobitev financiranja za

Slovensko tovarno umetne inteligence – superračunalnik za 150 milijonov EUR, prilagojen

umetni inteligenci. Poleg tega je bil zgrajen velik jezikovni model za slovenščino GaMS, ki

omogoča boljše izražanje v našem jeziku in prispeva k slovenski digitalni suverenosti. V

Sloveniji nastaja tudi veliko aplikacij, ki uporabljajo velike jezikovne modele. Med njimi bi

radi izpostavili zdravstvenega pomočnika HomeDOCtor, ki zna državljanom svetovati glede

zdravstvenih težav bolje kot splošnonamenski modeli.



Vrnimo se zdaj h konferenci: letos ima 21 prispevkov, kar je največ po rekordnem letu 2020.

Od teh jih dve tretjini prihajata z Instituta Jožef Stefan, kar ne odstopa dosti od statistike zadnjih

let. Tako širša zastopanost različnih slovenskih ustanov vključno z industrijo še vedno ostaja

naša želja. Ponosni smo, da smo letošnjo konferenco obogatili s kar tremi posebnimi dogodki.

Otvoritev sestavljata uvodni nagovor predstavnice Ministrstva za digitalno preobrazbo in

vabljeno predavanje Eve Tube, ki je v Slovenijo prišla na prestižno mesto ERA Chair v okviru

projekta AutoLearn-SI. Ker umetna inteligenca prodira v vse pore našega življenja, med katere

sodi tudi umetnost, smo zato organizirali sekcijo Beyond Human Art prav na to temo. In

nenazadnje smo Slovensko tovarno umetne inteligence obeležili s sekcijo, kjer smo se poučili

o tovarni, njeni uporabi v znanstvenih raziskavah in vlogi pri obdelavi senzorskih podatkov.



Konferenca ostaja enkraten slovenski in mednarodni prostor odličnosti, odprte akademske

razprave in novih idej. Ponosni smo, da skupaj gradimo slovensko skupnost umetne

inteligence, ki s svojim znanjem in inovacijami prispeva k reševanju ključnih izzivov

sodobnega časa ter krepi vlogo Slovenije v evropskem in svetovnem prostoru.



Mitja Luštrek

Matjaž Gams

Rok Piltaver



3





FOREWORD TO SLOVENIAN CONFERENCE




ON ARTIFICIAL INTELLIGENCE



Slovenian Conference on Artificial Intelligence is taking place this year at a time when AI

continues to advance rapidly into science, industry, and everyday life, faster and more usefully

than ever before. Large language models are still at the forefront, having already successfully

expanded their capabilities to the understanding and generation of sound, images and video.

An interesting research area includes foundation models for other types of data – for example,

sensor and biological data, as well robotic actions, which can likewise be connected to

language. These models are especially valuable in medical research, where they have already

led to the development of new therapeutic compounds. Such research may eventually result in

models capable of comprehensively understanding the world and interacting with it, which

strongly suggests artificial general intelligence.



The most advanced artificial intelligence research today requires infrastructure that Slovenia

does not have and cannot realistically expect, yet the past year has nevertheless seen several

significant and interesting advances in Slovenia as well. The most important milestone was

probably securing the funding for the Slovenian Artificial Intelligence Factory – a 150 million

EUR supercomputer tailored to artificial intelligence. In addition, a large language model for

Slovenian, GaMS, was built, enabling better expression in our language and contributing to

Slovenian digital sovereignty. Slovenia is also seeing the rise of many applications that make

use of large language models. Among them we would like to highlight the healthcare assistant

HomeDOCtor, which is able to advise citizens on health issues better than general-purpose

models.



Returning to the conference: this year, it features 21 papers, the highest number since the record

year of 2020. Out of these, two thirds come from Jožef Stefan Institute, which does not differ

much from the statistics of recent years. Thus, a broader representation of various Slovenian

institutions, including industry, remains our goal. We are proud that this year’s conference was

enriched with three special events. The opening included a welcome address by a representative

of the Ministry of Digital Transformation and a keynote lecture by Eva Tuba, who came to

Slovenia to take up the prestigious ERA Chair position within the AutoLearn-SI project. Since

artificial intelligence is making its way into every aspect of our lives, including art, we

organized a special section titled Beyond Human Art dedicated to this theme. Finally, we

marked the Slovenian Artificial Intelligence Factory with a session where we learned about the

factory itself, its use in scientific research, and its role in processing sensor data.



The conference is a unique Slovenian and international venue for excellence, open academic

debate and new ideas. We are proud that together we are building the Slovenian AI community,

which, through its knowledge and innovations, contributes to addressing the key challenges of

our time and strengthens Slovenia’s role in Europe and globally.



Mitja Luštrek

Matjaž Gams

Rok Piltaver



4

PROGRAMSKI ODBOR / PROGRAMME COMMITTEE



Mitja Luštrek



Matjaž Gams



Rok Piltaver



Zoja Anžur



Cene Bavec



Marko Bohanec



Marko Bonač



Ivan Bratko



Bojan Cestnik



Aleš Dobnikar



Erik Dovgan



Bogdan Filipič



Borka Jerman Blažič



Marjan Krisper



Marjan Mernik



Biljana Mileva Boshkoska



Vladislav Rajkovič



Niko Schlamberger



Tomaž Seljak



Peter Stanovnik



Damjan Strnad



Miha Štajdohar



Vasja Vehovar



5





6


Detecting Pollinators from Stem Vibrations Using a Neural




Network



Žan Ambrožič Lorenzo Bianco

za44564@student.uni- lj.si l.bianco@unito.it

Faculty of Mathematics and Physics, University of Department of Life Science and System Biology,

Ljubljana University of Turin

Ljubljana, Slovenia Turin, Italy



Rok Šturm David Susič, Maj Smerkol, Anton Gradišek

rok.sturm@nib.si Department of Intelligent Systems, Jožef Stefan Institute

National Institute for Biology Ljubljana, Slovenia

Ljubljana, Slovenia



Abstract the global decline in insects [2], [3] and in particular wild bees

[4], [5]. Internationally, the UN Intergovernmental science-policy

Passive sensing of pollinator activity is important for biodiversity

Platform on Biodiversity and Ecosystem Services (IPBES) and the

monitoring and conservation, yet conventional acoustic or visual

Convention on Biological Diversity (CBD) highlighted the need

methods produce large amounts of data and face deployment

to collect long-term high-quality data on pollinators and pollina-

challenges. In this work, we present initial results on investigat-

tion services in order to direct policy and practice responses to

ing stem vibration as an alternative signal for detecting pollinator

address the decline. There were already some attempts to monitor

presence on flowers. Vibration recordings were collected with a

pollinators’ activity from sound/soundscapes recordings (e.g. [6]).

laser vibration instrument from various flower species at multiple

Here, we explore for the first time to monitor pollination activity

locations in Slovenia, totaling approximately 14 hours, of which 3

by using vibroscape recordings [7] from flowering plants which

hours were expert-annotated as positive (insect activity present).

are visited by different pollinators. We investigated the possibility

The task was formulated as a binary classification problem: deter-

of neural networks for automatic detection of pollinator visits

mining whether a vibration segment corresponds to a pollinator

on flowers.

physically touching the flower. Using a neural network model,

performance was evaluated with five-fold cross-validation across

three experiments: (i) using a balanced subset, (ii) using the full 2 Dataset dataset, and (iii) using the full dataset with heuristic prediction

The dataset comprises vibration waveforms from flowers, which

smoothing. On the balanced subset, the model achieved an av-

were used for model training, and auxiliary audio and camera

erage F1-score of 0.86 0.06; on the full dataset, 0.62 0.07; ± ±

recordings collected for labeling and species identification. All

and with heuristic smoothing, 0.69 0.11, demonstrating both ±

recordings were obtained during July and August 2024 at various

the feasibility of vibration-based detection and the benefits of

locations in Slovenia. The vibrations were measured using a Vi-

post-processing. Beyond binary detection, future work will focus

broGo (Polytec, Germany) laser vibration instrument, which has

on species- and activity-level classification. Ultimately, the goal

an operational range of up to 30 m and can detect movements

is to develop lightweight vibration detectors deployable directly 1 − up to 6 m s at frequencies up to 320 kHz. For this study, mea-

on plants, enabling scalable estimation of pollinator visitation surements were performed at close range, with a frequency span

rates in natural and agricultural environments.

of 0–24 kHz and a sampling rate of 48 kHz.

For the measurements, the flower stem was fixed to a pole

Keywords to minimize large movements, and a small piece of reflective

stem vibrations, pollination, neural networks, buzz detection, foil was attached slightly below the flower to enable the laser

spectrograms vibrometer to capture fine vibrations caused by insect activity.

Our data acquisition setup is shown in Figure 1.

1 Introduction The dataset comprised vibration recordings of up to 10 minutes

each, collected from flowers belonging to the genera , Calystegia

Europe supports a rich diversity of wild pollinators among them

2,051 species of bees and 892 species of hoverflies. Collectively, Cichorium Crepis Epilobium Knautia (the majority of samples), , , ,

pollinators provide a wide range of benefits to society including Leontodon Lotus Pastinaca Trifolium , , , and . In total, the record-

ings amounted to approximately 14 hours, of which 3 hours were

more than €15 billion per year contribution to the market value of

annotated for insect activity (as positive), while the rest did not

European crops, pollinating around 78 percent of wild flowering

contain insect activity and was considered negative. Labeling

plants. This pollination service ensures healthy ecosystem func-

was performed in Raven Pro by expert annotators, who used

tioning and maintains wider biodiversity as well as culturally

synchronized audio and video recordings to confirm insect pres-

important flower-rich landscapes [1]. Many reviews highlight

ence and identify species. Each recording was annotated with

Permission to make digital or hard copies of all or part of this work for personal

time intervals indicating insect activity, insect species, activity

or classroom use is granted without fee provided that copies are not made or

type, and, when relevant, additional notes. For the purpose of this

distributed for profit or commercial advantage and that copies bear this notice and

the full citation on the first page. Copyrights for third-party components of this study, where we are only interested in binary classification of

work must be honored. For all other uses, contact the owner /author(s).

detecting pollinators, all intervals with any insect activity which

Information Society 2025, Ljubljana, Slovenia

included physically touching the flower were labeled as 1, and 0

© 2025 Copyright held by the owner/author(s).

https://doi.org/10.70314/is.2025.skui.5707 otherwise.



7





Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Ambrožič et al.





Figure 3: Sample spectrogram of light wind blowing (nega-

tive)



Table 1: Number of labels and the corresponding number

Figure 1: Data acquisition setup for recording vibration of instances by insect. signals, audio, and visual recordings from flowers.

Insect Number of labels Instances

fly 76 4146

Labeled intervals were cut into clips of one second with 0.1-

honey bee 253 1688

second overlap (positive instances), whereas unlabeled portions

wild bee 98 1307

were similarly divided and treated as negative instances. To bal-

hoverfly 82 155

ance the dataset, the negative instances were randomly down-

bumble bee 14 24

sampled. Some negative instances contained environmental noise,

wasp 3 9

such as speech, machinery, or wind, and wind noise was occasion-

moth 1 5

ally present in positive instances. Examples of vibration signals

Total 527 7334

from honey bee foraging and from wind are shown in Figures 2

and 3, respectively. The final balanced subset consisted of 7334

positive and 8664 negative instances. The positive data distribu-

tion by insects is given in Table 1.



a machine learning perspective, the problem was framed as a



binary classification task: distinguishing between the presence



and absence of insects in physical contact with the flower. The

methodology consisted of initial recoding of waveforms and

labeling, preprocessing the data, selecting the appropriate neural

network architecture, and training and evaluating the model.



3.1 Data Preprocessing

First, the instances that were shorter than one second (in cases

where the expert-labeled interval was shorter than one second)

were padded. After that, all instances were then converted into

Mel spectrograms of size 64x64 using fast Fourier transform with

frequency range 0–3 kHz.



3.2 Model Architecture

For the model, a network of residual blocks in combination with

Figure 2: Sample spectrogram of honey bee foraging (posi-

convolution was used. It is a smaller version of some ResNet

tive)

(e.g. ResNet 18) models. Residual blocks offer efficient reuse of

features across the layers. As shown in Figure 4, the input spectro-

gram goes through a 3x3 convolution, followed by three residual

3 Methodology blocks, before final pooling. The residual block, shown in Fig-

The objective of this study was to assess whether stem vibrations ure 5, consists of two 3x3 convolutions to identify features and

can be used to detect the presence of pollinators on flowers. From residual path only uses stride to match the shape before addition.



8





Detecting Pollinators from Stem Vibrations Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia




Input Conv 3×3 Res Block In the first experiment, called 4.1.1 Balanced labeled subset.

1×64×64 1 32 + BN + ReLU 32 64 "Subset", only the manually labeled subset of the dataset was → →

used. This consisted of the 7334 positive and 8664 negative in-

Res Block Res Block Global Linear out stances as described above. These were treated as balanced binary

64 128 128 256 AvgPool 256 1 classification data and evaluated directly. → → →



4.1.2 Full dataset with raw labeling. In the second experiment,

Figure 4: Model Architecture called "Full data (raw)", the entire dataset was included by seg-

menting recordings into 1.0 s windows with a step size of 0.1 s.

Expert annotations were then used to assign labels to these win-

dows, yielding a much larger evaluation set. However, such raw

Conv 3×3 Conv 3×3

labeling frequently introduced short, isolated positive or negative

BN ReLU BN

stride=2 stride=1

events that were likely erroneous. When the model predicted

such isolated events, performance metrics were underestimated,

1×1 Conv as the evaluation framework treated them as genuine labels. This

Input BN Add

stride=2 motivated the introduction of a heuristic smoothing procedure.



Output The third experiment, ReLU 4.1.3 Full dataset with heuristic labeling.

called "Full data (heuristics)", used the same sliding-window seg-

mentation as raw labeling experiment, but applied a heuristic

Figure 5: Residual Block (Res Block) in Figure 4

smoothing procedure to adjust labels. The aim was to reduce

the influence of short, likely erroneous events while preserving

longer, fragmented signals as single detections. Two rules were

To prevent overfitting and to enable extended training, dropout

applied:

of 0.5 was used, which improved performance more than data

augmentation (and was also computationally more efficient). If the model predicted at least 10 consecutive positive win- •

dows (equivalent to 1.0 s), the entire interval was relabeled

3.3 Model Training Settings as positive.

The model was trained by using the binary cross-entropy loss. •

If at least 82% of 50 consecutive windows (equivalent to

5.0 s) were predicted as positive, the entire interval was

Optimization was performed with Adam optimizer with learning

rate 10−4 −5 relabeled as positive. and weight decay 10 . A batch size of 16 and an epoch

number of 30 were used. These empirically determined thresholds suppressed short

false positives while ensuring that extended pollinator events

4 Evaluation Metrics with intermittent weak signals were still detected as continuous

segments. Finally, because the sliding window (1.0 s) exceeded

We used standard performance evaluation metrics: accuracy, pre-

the step size (0.1 s), prediction timestamps were shifted backward

cision, recall and F1-score, which were computed from the num-

by 0.5 s to align the window centers with the expert annotations.

ber of true positives (TP), true negatives (TN), false positives (FP)

and false negatives (FN) as follows:

5 Results and Discussion

TP TN The results of all three experiments are shown in Table 2 along +

Accuracy (1) =

TP TN FP FN with the confusion matrices in Figure 6. + + +



TP Table 2: Results of all experiments. The numbers represent

Precision (2) =

TP FP + the average ± standard deviation across five folds in the

TP cross-validation.

Recall (3) =

TP FN +

Subset Full data (raw) Full data (heur.)

2 Precision Recall Accuracy 0.87 0.03 0.80 0.02 0.86 0.05 · · ± ± ±

F1-score (4) =

Precision Recall Precision 0.85 0.09 0.54 0.11 0.68 0.15 + ± ± ±

In confusion matrices, we used relative numbers samples for Recall 0.87 0.04 0.75 0.11 0.73 0.13 ± ± ±

colors instead of absolute (which are only listed), because there F1-score 0 .86 0.06 0.62 0.07 0.69 0.11 ± ± ± was much more negatives than positives in detection test. Relative

shares are based on true labels (e.g. fraction of FN among all

The results show that there was a significant reduction in

negatively labeled).

performance when we switched from using the balanced subset

to recordings from the full dataset. There are several possible

4.1 Experiments sources of error: labels are annotated on waveform and samples

The model was evaluated in three experimental settings, all us- are extracted in the way that the whole non-padded (therefore

ing 5-fold cross-validation. Instances originating from the same non-silent) part is either positive either negative, furthermore,

recording were always assigned to the same fold to better reflect prediction for a specific time 𝑡 is generated based on the win-

real-world variability. Training and testing were repeated five dow, beginning at 𝑡 and ending at 𝑡 1 s, which might lead to + times, each with a different fold held out for testing and the re- inaccuracies at edges of labels although we shifted the time to

maining folds used for training. Reported results are averages match it as good as possible. There are also no other insects

across the five folds. or activities in samples, which occur in full recordings and are



9





Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Ambrožič et al.




Predicted Predicted Predicted Future work will focus on extending the models beyond bi-

P N P N P N

nary detection towards classification of pollinator species and

6409 925 85k 29k 82k 32k

potentially of behavioral activities. From an applied perspective,

P P P

0.87 0.13 0.74 0.26 0.72 0.28

the long-term goal is to develop lightweight vibration detectors

ctual ctual ctual

1095 7569 72k 322k 40k 354k

A A A that can be mounted directly on plants to automatically register N N N

0.13 0.87 0.18 0.82 0.10 0.90

pollinator visits. Deploying a small number of such sensors in a

Subset Full data (raw) Full data (heur.) field or meadow would enable scalable estimation of pollinator

abundance and activity, providing a valuable tool for biodiversity

monitoring and conservation studies.

Figure 6: Confusion matrices of all 3 experiments, de-

scribed in section 4.1 Acknowledgements

The authors acknowledge the funding from the Slovenian Re-

search and Innovation Agency, Grants PR-10495, P1-0255, J7-

sometimes falsely positive. It is important to note that the "Full

50040, Z1-50018, the Basic core funding P2-0209, and the support

data" is not a balanced set (while "Subset" is) and is meant as a

received from the Erasmus+ Traineeship programme.

test for a real-world scenario, where conditions and frequency of



pollinators with them vary on short time scales (hours), which References makes loss balancing (which would reduce the gap between recall

[1] European Commission. Joint Research Centre. 2021. Proposal for an EU polli-

and precision) in practice very difficult. For this reason, we did Publications Office, LU. doi: 10.2760/881843. nator monitoring scheme.

not use it and we left the thresholds the same as in the "Subset" [2] David L. Wagner. 2020. Insect declines in the anthropocene. Annual Review

of Entomology, 65, 1, (Jan. 2020), 457–480. doi: 10.1146/annurev- ento- 011019

experiment, so the results serve as a valid estimation of the per-

- 025151.

formance in reality. Figure 7 shows how heuristics helped the [3] Roel van Klink, Diana E. Bowler, Konstantin B. Gongalsky, Ann B. Swengel,

Alessandro Gentile, and Jonathan M. Chase. 2020. Meta-analysis reveals

model by smoothing out the short erroneous predictions, result-

declines in terrestrial but increases in freshwater insect abundances. , Science

ing in improved performance. To improve model performance

368, 6489, (Apr. 2020), 417–420. doi: 10.1126/science.aax9931.

even further, additional heuristic filters may be added. [4] J. C. Biesmeijer et al. 2006. Parallel declines in pollinators and insect-pollinated

plants in britain and the netherlands. , 313, 5785, (July 2006), 351–354. Science

doi: 10.1126/science.1127863.

6 Conclusion [5] Luísa Gigante Carvalheiro et al. 2013. Species richness declines and biotic

homogenisation have slowed down for nw-european pollinators

We presented initial results on the feasibility of detecting polli-

and plants. , 16, 7, (May 2013), 870–878. Yvonne Buckley, editor. Ecology Letters

nator presence on flowers from stem vibration recordings using doi: 10.1111/ele.12121.

machine learning methods. We evaluated models under three [6] David Susič, Johanna A. Robinson, Danilo Bevk, and Anton Gradišek. 2025.

Acoustic monitoring of solitary bee activity at nesting boxes. Ecological

experimental settings: a balanced labeled subset, the full dataset Solutions and Evidence

, 6, 3, e70080. e70080 ESO-24-09-164.R1. eprint: http

with raw expert annotations, and the full dataset with heuristic s://besjournals.onlinelibrary.wiley.com/doi/pdf /10.1002/2688- 8319.70080.

doi: https://doi.org/10.1002/2688- 8319.70080.

label smoothing. The results demonstrate that pollinator activity

[7] Rok Šturm, Juan José López Díez, Jernej Polajnar, Jérôme Sueur, and Meta

can be reliably inferred from vibration signals, with heuristic Frontiers in Ecology and

Virant-Doberlet. 2022. Is it time for ecotremology?

post-processing substantially reducing the impact of isolated er- , 10, (Mar. 2022). doi: 10.3389/f evo.2022.828503. Evolution roneous predictions and improving the robustness of detection.





Figure 7: Output example: (blue) model prediction, (green)

heuristic filter, (yellow) expert labels.



10





Thermal Camera-Based Cognitive Load Estimation: A




Non-Invasive Approach



Zoja Anžur Gašper Slapničar Mitja Luštrek

zoja.anzur@ijs.si gasper.slapnicar@ijs.si mitja.lustrek@ijs.si

Jožef Stefan Institute Jožef Stefan Institute Jožef Stefan Institute

Jožef Stefan International Jožef Stefan International Jožef Stefan International

Postgraduate School Postgraduate School Postgraduate School

Ljubljana, Slovenia Ljubljana, Slovenia Ljubljana, Slovenia



Abstract 2 Related Work

Cognitive load (CL) monitoring is a growing area of interest Early approaches of contact-free thermal monitoring of psy-

across various domains. Most traditional methods rely on ei- chophysiological states based on infrared thermal imaging fo-

ther subjective assessments or intrusive sensors, limiting their cused primarily on emotional and affective research [8]. Physio-

practical applicability. In this study, we present a non-invasive logical background was heavily explored, specifically how auto-

approach for estimating CL using thermal imaging. Thermal nomic nervous system activity yields descriptive thermal signa-

videos were collected from 18 participants performing a battery tures related to affect in facial regions. Such work laid the critical

of tasks designed to induce varying levels of CL. Using a low- groundwork for later expansion towards CL estimation.

cost thermal camera, we extracted features from facial regions of One of the fundamental studies towards thermal-camera-based

interest and trained several machine learning models, including CL estimation was published by Abdelrahman et al. in 2017. They

Random Forest, Extreme Gradient Boosting, Stochastic Gradi- introduced an unobtrusive method that uses a commercial ther-

ent Descent (SGD), k-Nearest Neighbors, and Light Gradient mal camera to monitor temperature changes on the forehead

Boosting Machine, on a binary classification task distinguishing and nose, which were chosen as regions of interest based on

between rest and high CL conditions. The models were evaluated physiological background established earlier. It demonstrated

using Leave-One-Subject-Out cross-validation. Our results show that the difference between forehead and nose temperature cor-

that all models outperform the baseline majority classifier, with relates robustly with task difficulty, showing effectiveness in

SGD achieving the highest accuracy (0.64 ± 0.16), despite vari- Stroop test and reading complexity experiments. Notably, the

ability across individuals. These findings support the feasibility system achieved near-real-time detection with an average la-

of thermal imaging as an unobtrusive tool for CL estimation in tency of 0.7 seconds, making it suitable for responsive, real-time

real-world applications. cognition-aware applications [1].

While such monitoring traditionally required relatively ex-

Keywords pensive hardware [6], recent work showed potential of more

affordable low-cost thermal cameras for monitoring of psycho-

cognitive load estimation, thermal imaging, physiological com-

logical states. Black et al. [4] compared state-of-the-art vision

puting, machine learning for affective computing, non-invasive

transformers (ViT) against traditional convolutional neural net-

user monitoring

works (CNNs) on data recorded with low-resolution thermal

cameras. They found superior performance of ViT when classi-

1 Introduction fying emotions, achieving 0.96 F1 score for 5 emotions (anger,

Monitoring cognitive load (CL) unobtrusively and accurately has happiness, neutral, sadness, surprise), confirming feasibility of

become an increasingly important goal across various domains. low-cost hardware.

Traditional methods such as the NASA-TLX questionnaire [7] Lastly, some work explores subtle connections between differ-

for assessing cognitive states often rely on intrusive sensors ent inner states that are difficult to discriminate, such as stress

or subjective self-reporting, limiting their practicality in real- and CL. Bonyad et al. [5] showed correlation of the two states

world applications. In recent years, the use of machine learning in airplane pilots, highlighting that elevated cognitive work-

techniques combined with physiological signals has opened new load induced stress, manifesting in significant cooling across

possibilities for non-invasive and continuous monitoring [2]. the nose, forehead, and cheeks, with the nasal region exhibit-

The primary objective of our study was to predict CL using ing the most rapid and pronounced temperature decline. These

data obtained with a thermal camera. Our aim was to develop a thermal changes were synchronized with increases in heart rate

method for unobtrusive measurement of physiological signals and subjective workload ratings. Overall thermal monitoring is

that achieves high accuracy. Compared to other physiological becoming more accessible and an established CL estimation al-

measurement tools, thermal cameras are relatively low-cost and ternative to other modalities (e.g., wearables, RGB cameras, etc.),

quick to deploy, which makes them a practical choice for real- especially in challenging conditions (e.g., darkness).

world cognitive monitoring applications.



Permission to make digital or hard copies of all or part of this work for personal

or classroom use is granted without fee provided that copies are not made or 3 Data



distributed for profit or commercial advantage and that copies bear this notice and 3.1 Data Collection the full citation on the first page. Copyrights for third-party components of this

work must be honored. For all other uses, contact the owner /author(s).

For the purpose of our experiment, we gathered data from 18

Information Society 2025, Ljubljana, Slovenia

participants using various sensors. In this work, we will focus

© 2025 Copyright held by the owner/author(s).

https://doi.org/10.70314/is.2025.skui.3714 only on relevant data obtained by an affordable FLIR Lepton 3.5



11

Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Anžur et al.



camera, with resolution of 160x120 running at 8.7 frames per The first step in obtaining the average temperatures for the

second. selected ROIs involved applying a facial keypoint detector to

Our participants underwent a battery of tests for inducing extract the coordinates corresponding to each region in the ther-

CL. Data collection was carried out in a controlled laboratory mal images. This process was carried out for the middle frame

environment to ensure consistency across all participants. Af- of every window of the thermal videos by passing it through

ter filling out some initial questionnaires regarding individual’s a pretrained keypoint detection model [11]. The model, based

tiredness and focus levels, the calibration of various sensors used on the widely adopted YOLOv5 architecture, was specifically

in the study was performed. The experiment itself was structured trained on thermal images to enhance its performance for this

into three sequential blocks, each designed to induce CL through modality. Following keypoint detection, we transitioned from

two different tasks offered at two difficulty levels. The first block working with raw thermal images to working with numerical

featured standardized CL tasks – specifically, the N-back and temperature features, specifically the average temperatures com-

Stroop tasks, which are widely used in cognitive research to puted for each region of interest. A more detailed explanation of

engage working memory and executive attention [10, 12]. this feature extraction process is provided in Section 4.1.

The second block introduced more ecologically valid memory



tasks. The memory recall task involved displaying a list of words



on a screen, after which participants had 30 seconds to recall and

verbally report as many as possible. In the visual memory task,

participants observed an image and were later asked to recall

specific details.

The third and final block focused on ecological visual attention

tasks. These included a visual discrepancy detection task and a

line tracking task. In the discrepancy detection task, participants (a) Subject A. (b) Subject B.

compared two images and identified visual differences. In the

line tracking task, participants followed numbered lines from Figure 1: Examples of raw thermal images.

one side of the screen to the other and identified them.

Between these cognitive tasks, participants engaged in relax- At this stage, our dataset – where each row corresponded to a

ation activities such as resting, passively viewing images, or lis- single video frame – contained a substantial number of missing

tening to music, which served as baseline conditions and helped values. These missing values were primarily due to limitations

to balance their CL throughout the experiment. After each task in keypoint detection, which stemmed from several factors. First,

and each relaxation period, participants completed the NASA participants wore smart glasses during the experiment, which

Task Load Index (NASA-TLX) [7] and the Instantaneous Self- often obstructed the eye region and impaired the accuracy of

Assessment (ISA) [9] questionnaires to provide subjective evalu- the keypoint detector. Second, natural head movements, such

ations of their cognitive and affective states. as turning to the left or right, occasionally caused parts of the

The session concluded with the removal of all sensors, a de- face to be occluded, preventing the detector from accurately

briefing session, and participant compensation. The entire pro- identifying key facial landmarks. Given the impact of these issues

cedure lasted approximately 60 minutes per participant, with on data quality, we chose to remove all rows containing missing

around 40 minutes spent for active data collection and the rest values from further analysis. We excluded 31% of the data in this

used for setup, instructions, and debriefing. step. Use of smart glasses was not problematic only for keypoint

detection, but also for feature calculation. The eye regions were

partially obstructed by the glasses, thus preventing the thermal

3.2 Data Preprocessing camera from capturing accurate temperature measurements in

The raw data used in our analysis is illustrated in Figure 1. The this area. Since we were unable to control for this effect, it is

first step in our preprocessing pipeline was windowing. Specifi- possible that it also posed an issue in classification.

cally, we divided each thermal video into consecutive 3-second Next, we performed label transformations to prepare the data

windows with a 25% overlap. From each window, only the middle for subsequent analysis. Initially, the dataset included multiple la-

frame was selected for further analysis. This approach was based bels, each corresponding to one of the tasks described in Section

on the assumption that facial temperature changes driven by 3.1. However, approximately 50% of the instances were labeled as

physiological responses such as blood flow occur gradually over “questionnaire”, reflecting the periods during which participants

several seconds rather than instantaneously. As such, a single completed self-report instruments such as NASA-TLX and ISA.

representative frame from each interval was considered sufficient These instances posed a challenge: filling out a questionnaire is

to capture meaningful thermal variation in 2.25-second steps. neither a clear resting state nor a cognitively demanding task,

The second step in preparing the data for subsequent machine making it difficult to accurately determine the level of CL in-

learning experiments involved the extraction of features from volved. Since our primary interest lay in distinguishing between

thermal camera recordings. Prior research in this domain fre- load and rest conditions, we opted to exclude all rows labeled as

quently utilizes average temperatures from distinct facial regions “questionnaire” from further analysis. In addition, we grouped

as input features, demonstrating that these regions can exhibit the remaining labels into three broader categories: rest, low CL

significant temperature differences associated with various affec- (corresponding to the easy versions of the tasks), and high CL

tive states experienced by participants [3]. Motivated by these (corresponding to the difficult versions).

findings, we adopted a similar methodology to that proposed by Following some initial experiments, we chose to retain only

Aristizabal-Tique et al. [3], and based our feature set on the aver- the most “extreme” instances in terms of CL. Specifically, we

age temperatures of four predefined regions of interest (ROIs): excluded all data labeled as low CL, as this class exhibited sub-

nose, forehead, left eye, and right eye. stantial overlap with both the rest and high load conditions. In



12





Short title to put in the header Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia




particular, some tasks intended to induce low CL turned out to be Table 1: Class distribution

unexpectedly difficult, effectively eliciting high CL, while others

were so easy that it is questionable whether they imposed any

Label Count

cognitive demand at all.

Rest 1626

To further emphasize the most distinct cognitive states, we

High Load 1548

also filtered the remaining data within each label interval. For

intervals of instances labeled as rest, we retained only the final

two-thirds of each interval, based on the assumption that partici-

pants would be most physiologically relaxed toward the end of

This normalization helped reduce inter-subject variability while

each interval labeled rest. Immediately after completing a cogni-

preserving intra-subject dynamics, enabling a more robust learn-

tively demanding task, the body may require some time to “cool

ing of patterns related to CL. Following this step, we proceeded

down”, during which residual physiological activity – such as

with the machine learning experiments using the described set

elevated facial temperature – could still be present. By focusing

of features.

on the latter portion of the interval, we aimed to capture a more



accurate representation of the true resting state. Similarly, for 4.2 Experiments instances labeled as high CL, we also retained only the final two-

After completing the data preparation steps outlined in Sections

thirds of each interval, based on the assumption that CL tends to

3.2 and 4.1, we proceeded with the machine learning experiments.

accumulate toward the end of a demanding task. This selection

At this stage, the dataset consisted of two balanced classes: rest

strategy was intended to maximize the contrast between rest

and high CL, as shown in Table 1. The models were trained on a

and high load conditions by focusing on the time points most

total of 3174 instances, derived from 18 participants.

representative of those states.

In our experiments, we employed a diverse set of machine

learning models, including Random Forest (RF), Extreme Gradient

4 Methodology Boosting (XGB), Stochastic Gradient Descent (SGD), k-Nearest

4.1 Calculating Features Neighbors (KNN), and Light Gradient Boosting Machine (GBM).

As a baseline, we included a majority classifier, which always

As previously mentioned, we extracted features directly from the

predicted the most frequent class in the training data of each

raw thermal images. Using the pretrained keypoint detector [11],

fold. Each model was trained and evaluated using its optimized

we obtained coordinates for five facial keypoints, using which we

hyperparameters, which were determined through a grid search

then defined ROIs corresponding to specific facial areas for each

strategy applied on training data on each Leave-One-Subject-Out

3-second window. ROIs were shaped as rectangles, positioned

(LOSO) iteration aimed at maximizing classification accuracy.

based on keypoint coordinates. Their size and placement were

To ensure the robustness and generalizability of the results,

dynamically defined according to the distance between the eyes,

we adopted a LOSO cross-validation approach, in which each par-

reducing issues such as capturing inconsistent facial areas due to

ticipant served as a test subject exactly once while the remaining

variations in distance from the camera or head movements. This

participants were used for training. This evaluation strategy is

approach was considered appropriate, because the study was

well-suited for personalized and physiological data, where inter-

conducted in a controlled laboratory environment with minimal

subject variability is high. To ensure a comprehensive evaluation

variation in posture and setup. Additionally, a visual inspection

of model performance, we did not rely solely on a single metric.

of the extracted ROIs confirmed that they were well aligned.

Instead, we incorporated a range of evaluation metrics, including

Next, we computed the average pixel temperature for each

accuracy and F1-score. This multi-metric approach allowed us

ROI, as each pixel in a thermal image directly reflects a tempera-

to better capture different aspects of model performance. The

ture value. This process yielded four primary features – one for

results of these experiments are presented in the subsequent

each of the predefined ROIs (nose, forehead, left eye, and right

section.

eye). To capture relative temperature differences between these

regions, we then computed the pairwise differences between all

four average temperatures. This resulted in an additional six fea- 5 Results

tures, representing the thermal contrasts between different facial As mentioned in the previous sections, we trained and evaluated a

areas. Finally, to capture potential temporal trends in tempera- variety of models, and evaluated them using the LOSO. Summary

ture changes, we introduced two additional temporal features. of the results can be seen in Table 2, where both accuracy and F1-

Specifically, for each 3-second window, we computed the tem- score are reported as averages across all subject folds, providing

perature difference between the first and last frame for two key an overall measure of model performance and generalization

regions of interest: the nose and the forehead. These temporal performance.

features aimed to reflect short-term thermal dynamics that may The results indicate that the best-performing algorithm was

be indicative of CL fluctuations. In total, this process resulted in SGD, achieving an accuracy of 0.64 0.16, which represents a 0.13 ± 12 features per instance: 4 average ROI temperatures, 6 pairwise improvement over the baseline majority classifier accuracy of

temperature differences, and 2 temporal difference features. 0.51 0.00. In addition to its accuracy, SGD also achieved a high ±

Finally, we applied personalized normalization to account for F1-score, suggesting that the model performs well in predicting

individual differences in baseline physiological responses. While both classes in a balanced manner. However, SGD also has the

there is considerable variability across participants, the varia- highest variance ( 0.16), which indicates less stability across ± tions within each individual are more informative for detecting subjects. Overall, all evaluated models outperformed the majority

changes in CL. To address this, we standardized all feature values class baseline. Moreover, the accuracy scores across all tested

using z-score normalization per participant, transforming each models were relatively similar, indicating consistent performance

instance based on that individual’s mean and standard deviation. regardless of the specific algorithm used. Performance of GBM,



13





Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Anžur et al.




Table 2: Accuracy and F1-score of trained models compared to the majority class classifier



Classifier Model Accuracy Model F1 Majority Class Accuracy Majority Class F1



RF 0.62 0.13 0.62 0.13 0.51 0.00 0.34 0.04 ± ± ± ±

XGB 0.62 0.14 0.62 0.14 0.51 0.00 0.34 0.04 ± ± ± ±

SGD 0.64 ± 0.16 0.63 ± 0.16 0.51 ± 0.00 0.34 ± 0.04

KNN 0.60 0.10 0.60 0.10 0.51 0.00 0.34 0.04 ± ± ± ±

GBM 0.63 0.10 0.60 0.11 0.51 0.00 0.34 0.04 ± ± ± ±





Figure 2: SGD vs. baseline majority classifier by subject.



RF and XGB was very similar, although somewhat behind the References performance of the SGD. [1] Yomna Abdelrahman, Eduardo Velloso, Tilman Dingler, Albrecht Schmidt,

and Frank Vetere. 2017. Cognitive heat: exploring the usage of thermal

Looking at per-subject results in more detail in Figure 2, we

imaging to unobtrusively estimate cognitive load. Proceedings of the ACM

see that for most subjects, the SGD classifier outperformed the on Interactive, Mobile, Wearable and Ubiquitous Technologies, 1, 3, 1–20.

majority baseline classifier. SGD achieved its best performance [2] Muneeb Imtiaz Ahmad, Ingo Keller, David A Robb, and Katrin S Lohan. 2023.

A framework to estimate cognitive load using physiological data. Personal

on subjects 13, 11, and 15, with accuracies exceeding 0.80. There

and Ubiquitous Computing, 27, 6, 2027–2041. doi: 10.1007/s00779- 020- 01455

is also considerable variation across individuals, which aligns - 7.

with the high variance reported in Table 2. This variability may [3] Victor H. Aristizabal-Tique, Marcela Henao-Pérez, Diana Carolina López-

Medina, Renato Zambrano-Cruz, and Gloria Díaz-Londoño. 2023. Facial

indicate the presence of subject-specific patterns, label noise, or

thermal and blood perfusion patterns of human emotions: proof-of-concept.

data that is inherently more challenging to learn. , 112, 103464. doi: https://doi.org/10.1016/j.jtherb Journal of Thermal Biology

io.2023.103464.



6 [4] James Thomas Black and Muhammad Zeeshan Shakir. 2025. Ai enabled Conclusion

facial emotion recognition using low-cost thermal cameras. Computing&AI

This study demonstrates the potential of low-cost consumer ther- Connect

, 2, 1, 1–10.

[5] Amin Bonyad, Hamdi Ben Abdessalem, and Claude Frasson. 2025. Heat

mal imaging as a viable, non-invasive method for estimating

of the moment: exploring the influence of stress and workload on facial

CL. By leveraging features extracted from key facial regions temperature dynamics. In International Conference on Intelligent Tutoring

Systems. Springer, 181–193.

and applying various machine learning algorithms, we achieved

[6] Federica Gioia, Maria Antonietta Pascali, Alberto Greco, Sara Colantonio,

promising results in distinguishing between rest and high load and Enzo Pasquale Scilingo. 2021. Discriminating stress from cognitive load

cognitive states. Among the tested models, SGD achieved the using contactless thermal imaging devices. In 2021 43rd Annual International

Conference of the IEEE Engineering in Medicine & Biology Society (EMBC).

best average performance, though with notable inter-subject vari-

IEEE, 608–611.

ability. These findings highlight both the strengths and current [7] Sandra G Hart and Lowell E Staveland. 1988. Development of nasa-tlx (task

load index): results of empirical and theoretical research. In Advances in

limitations of thermal-based CL estimation. While the results

psychology. Vol. 52. Elsevier, 139–183.

support the feasibility of using affordable thermal cameras in

[8] Stephanos Ioannou, Vittorio Gallese, and Arcangelo Merla. 2014. Thermal

real-world applications, future work should explore strategies infrared imaging in psychophysiology: potentialities and limits. Psychophys-such as more sophisticated personalization to enhance generaliza- iology, 51, 10, 951–963.

[9] CS Jordan and SD Brennen. 1992. Instantaneous self-assessment of workload

tion across individuals, deep learning, etc. This line of research technique (isa). Defence Research Agency, Portsmouth. points toward usefulness of cognitive monitoring in practical [10] Michael Kane, Andrew Conway, Timothy Miura, and Gregory Colflesh.

2007. Working memory, attention control, and the n-back task: a question

settings such as education, workplace safety, and adaptive user Journal of experimental psychology. Learning, memory,

of construct validity.

interfaces. , 33, (May 2007), 615–22. doi: 10.1037/0278- 7393.33.3.615. and cognition

[11] Askat Kuzdeuov, Dana Aubakirova, Darina Koishigarina, and Huseyin



Acknowledgements Atakan Varol. 2022. Tfw: annotated thermal faces in the wild dataset. IEEE

Transactions on Information Forensics and Security, 17, 2084–2094. doi: 10.11

09/TIFS.2022.3177949.

We sincerely thank our colleagues from Department of Intelli-

[12] Michael P Milham, Kirk I Erickson, Marie T Banich, Arthur F Kramer,

gent Systems ( Jožef Stefan Institute) for their assistance in data

Andrew Webb, Tracey Wszalek, and Neal J Cohen. 2002. Attentional control

collection and preprocessing. in the aging brain: insights from an fmri study of the stroop task. Brain and

cognition, 49, 3, 277–296.



14





A Critical Perspective on MNAR Data: Imputation, Generation,




and the Path Toward a Unified Framework



Fatemeh Azad Matjaž Kukar

fatemeh.azad@f ri.uni- lj.si matjaz.kukar@f ri.uni- lj.si

University of Ljubljana University of Ljubljana

Ljubljana, Slovenia Ljubljana, Slovenia



Abstract 𝜓 : a parameter or set of parameters that govern the missing-

ness process.

Missing Not at Random (MNAR) data remains one of the most

difficult challenges in statistical analysis and machine learning. Data is if the probability of a value being missing • MCAR Despite the widespread availability of advanced imputation meth- is completely independent of both the observed and the

ods, most research continues to focus on Missing Completely unobserved data. The missingness is unrelated to the data

at Random (MCAR) and partially on Missing at Random (MAR) itself — it is a purely random (Eq. 1) as the missingness

scenarios. This paper provides a critical overview of existing ap- pattern (𝑅) depends neither on the observed data (𝑋 )

𝑜𝑏𝑠

proaches for MNAR imputation, methods for simulating MNAR nor on the missing data (𝑋𝑚𝑖𝑠 ).

data, and the limitations of current evaluation practices. We

highlight the lack of standardized benchmarks, unrealistic miss- 𝑃 𝑅 𝑋 , 𝑋 , 𝜓 𝑅 𝜓 (1) 𝑚𝑖𝑠 𝑜𝑏𝑠 ( | ) = 𝑃 ( | )

ingness rates, and insufficient coverage of MNAR conditions in • MAR

Data is if the probability of a value being missing de-

empirical studies. Finally, we propose a suitable framework for

pends only on the observed data, not on the missing data

comprehensive testing of design principles, enabling robust and

itself (Eq. 2). This means that the missingness could be pre-

reproducible evaluation of imputation methods across mecha-

dicted from available (non-missing) data. The probability

nisms and missingness rates.

of the missingness pattern (𝑅) is conditionally indepen-

dent of the actual missing values (𝑋 ) once the observed 𝑚𝑖𝑠

Keywords values (𝑋 ) are taken into account. 𝑜𝑏𝑠

Missing data, MNAR, data imputation, missingness mechanisms,

data generation, machine learning, evaluation framework. 𝑃 𝑅 𝑋 , 𝑋 , 𝜓 𝑅 𝑚𝑖𝑠 𝑜𝑏𝑠 ( | ) = 𝑃 ( |𝑋 , 𝜓 𝑜𝑏𝑠 ) (2)

• Data is MNAR if the probability of a value being missing

1 Introduction depends on some unobserved (missing) value itself, even

after accounting for all the observed data (Eq. 3). In this

Missing data is a pervasive challenge across various domains,

case (𝑋 ) can also include latent features, unobserved 𝑚𝑖𝑠

from clinical diagnostics and bioinformatics to finance, sensor

for all instances. This is the most complex scenario, as the

networks, and social sciences. Missing, damaged, or unrecorded

missingness pattern itself is informative. The probability of

data entries can negatively affect the accuracy of statistical anal-

the missingness pattern (𝑅) is therefore dependent on the

ysis and machine learning models. They reduce predictive power,

missing values (𝑋 ) in a way that cannot be explained

𝑚𝑖𝑠

introduce bias, and often create incompatibilities with algorithms

by the observed values (𝑋 ).

𝑜𝑏𝑠

requiring complete inputs [8]. The impact is especially important

in critical areas like healthcare decision support, where unreli-

𝑃 𝑅 𝑋 , 𝑋 (3) 𝑚𝑖𝑠 𝑜𝑏𝑠 ( | , 𝜓 )

able data or incorrect interpretation can lead to harmful conclu-

While MCAR and MAR have been extensively studied, MNAR

sions.[14, 2].

remains the most difficult and least explored scenario, precisely

A primary difficulty in handling missing data is understanding

the underlying . According to the taxon-

missingness mechanism because the missingness itself carries information about the data.

For example, high-income individuals may systematically with-

omy of Little and Rubin [10], We have three types of missing-

ness: (MCAR), Missing at Random

Missing Completely at Random hold reporting their wealth, or patients with severe conditions

(MAR), and (MNAR).

Missing Not at Random may drop out of longitudinal studies. In both cases, the very act

of non-response encodes meaningful but hidden signals.

To formally describe the MCAR, MAR, and MNAR mecha-

The prevalent imputation (replacing missing values) research

nisms, we first define the following notation, as per [9, 19]:

has focused on MCAR and MAR settings, where assumptions

𝑋 the complete data matrix, which consists of two parts, :

about independence or conditional dependence simplify method-

with 𝑋 being the observed and 𝑋 the missing part

𝑜𝑏𝑠 𝑚𝑖𝑠 ological development and evaluation [14, 23, 13]. In contrast,

of the data.

MNAR scenarios pose a dual challenge: not only is the miss-

𝑅 an indicator matrix of the same dimensions as 𝑋 , where :

ing information inherently dependent on unobserved values,

𝑅 1 if the value 𝑖 𝑗 = 𝑋 is missing, and 𝑖 𝑗 𝑅𝑖 𝑗 = 0 if it is

but there are also very few benchmark datasets that explicitly

observed.

model or annotate MNAR mechanisms. Consequently, evalua-

tion standards remain incomplete. Reported missingness rates

Permission to make digital or hard copies of all or part of this work for personal

often underestimate or ignore MNAR effects, and even sophis-

or classroom use is granted without fee provided that copies are not made or

ticated models, such as generative adversarial networks [7, 24],

distributed for profit or commercial advantage and that copies bear this notice and

the full citation on the first page. Copyrights for third-party components of this graph neural approaches [25], or transformer-based imputers

work must be honored. For all other uses, contact the owner /author(s).

[3], rarely demonstrate systematic robustness in MNAR condi-

Information Society 2025, Ljubljana, Slovenia

tions. Recent works [11, 4] have shown the potential of ensemble

© 2025 Copyright held by the owner/author(s).

https://doi.org/10.70314/is.2025.skui.0957 or meta-imputation strategies, which combine diverse imputers



15

Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Fatemeh Azad and Matjaž Kukar



into robust pipelines. However, these frameworks are also mostly or latent variables. For instance, if severely ill patients system-

validated under MCAR or MAR assumptions. atically omit follow-up surveys, no observed features can ex-

In this paper, we take a critical perspective on the current plain this absence, and machine learning based imputers cannot

state of missing data research, specifically focusing on MNAR. recover the missing structure without explicitly modeling the

We argue that three gaps must be addressed: (i) the lack of effec- mechanism.

tive imputation techniques designed specifically for MNAR, as

current methods are limited in scope and seldom used in practice;

(ii) the deficiency of datasets and generators that can faithfully 2.3 Deep Learning Approaches

represent MNAR patterns; and (iii) the insufficiency of reported Deep generative models have significantly advanced imputation

missingness rates. To bridge these gaps, we outline the vision research. Variational Autoencoders (VAEs) [2] and Generative

and design principles of a comprehensive for MNAR Adversarial Networks (GANs) [23, 7, 24] are capable of learn-framework research that integrates data generation, imputation, and evalu- ing complex distributions and have shown robustness to high

ation under standardized conditions. Such a framework would missingness rates. However, their performance in the context

enable more robust comparisons of existing methods and guide of MNAR conditions is not assured. While some frameworks,

the development of novel techniques tailored to the inherent such as MisGAN, explicitly attempt to learn the missingness

challenges of MNAR. mask distribution alongside the data [7], they often rely on ap-

The remainder of this paper is organized as follows. Section 2 proximations that do not generalize across domains. Similarly,

reviews existing imputation approaches and discusses their ap- diffusion-based models [22, 26] and graph-based imputers [25]

plicability to MNAR. Section 3 examines methodologies for simu- extend coverage to structured data but rarely test systematically

lating and generating MNAR data, highlighting their limitations. against MNAR conditions. Transformers, such as ReMasker [3],

Section 4 critiques how missingness is reported and motivates provide context-aware imputations, but again, their evaluations

the need for standardized benchmarks. Finally, Section 5 presents are mostly limited to MCAR and MAR scenarios.

our vision for a unified MNAR research framework and outlines

open challenges for the community.

2.4 Ensemble Approaches

Recent efforts highlight the potential of combining multiple im-

2 Imputation Methods for MNAR Data puters in ensemble or meta-learning frameworks [11, 4]. Such

A wide range of imputation techniques has been proposed in the methods leverage complementary strengths of diverse imputers

literature, from simple statistical to advanced deep generative and often achieve more stable performance across heterogeneous

models. While these methods have demonstrated effectiveness datasets. However, existing ensemble frameworks have been

under Missing Completely at Random (MCAR) or Missing at validated primarily under MCAR assumptions, and their abil-

Random (MAR) assumptions, their suitability for Missing Not at ity to handle MNAR remains largely unexplored. Recent work

Random (MNAR) scenarios remains highly questionable. This has also explored meta-imputation strategies, such as the Meta-

section reviews the main categories of imputation techniques Imputation Balanced (MIB) framework, which combines multiple

and highlights their limitations when faced with MNAR data. base imputers in a supervised setting [1].

While it is often stated that there are almost no methods tai- To synthesize the discussion above, Table 1 summarizes the

lored for MNAR, several strands of work do exist . . . However, main categories of imputation approaches, their representative

these remain underutilized and rarely integrated into mainstream methods, applicability to missingness mechanisms, and key ref-

imputation pipelines. erences.



2.1 Statistical Imputation Methods 3 Generation of MNAR Data

Statistical techniques such as mean, median, mode, or regression- A persistent challenge in missing data research is the lack of reli-

based imputations are simple and computationally efficient but able and reproducible benchmarks for handling MNAR scenarios.

they mostly rely on strong assumptions about the independence While MCAR and MAR can be easily simulated by random mask-

or conditional dependence of missingness [8, 27]. These assump- ing or conditioning on observed features, MNAR requires mask-

tions rarely hold under MNAR, where the missingness mecha- ing rules that depend on unobserved or latent variables, which

nism is informative itself. For example, imputing systematically makes the generation process more challenging. Consequently,

underreported values (e.g., income, clinical severity) with central- most experimental studies rely on oversimplified masking strate-

tendency statistics introduces bias and distorts the true distribu- gies that do not capture the complexity of real-world MNAR

tion. Maximum likelihood and Bayesian approaches attempt to mechanisms [18, 5].

capture uncertainty, but they typically assume that the missing-

ness process can be ignored or is fully modeled by observed data

[10], which is not the case for MNAR. 3.1 The Role of Data Amputation

Deliberately injecting missing values into fully complete datasets,

referred to as , plays a crucial role in evaluating data amputation

2.2 Machine Learning-Based Approaches imputation techniques. However, until recently, implementations

Machine learning methods, such as 𝑘-nearest neighbors (KNN) of amputation were highly heterogeneous and often insufficiently

[14], matrix factorization [20], decision trees [21], and support documented, preventing fair comparisons across studies [18].

vector machines (SVMs) [6], utilize feature dependencies to ad- This problem is particularly acute for MNAR, where even slight

dress missing data entries. While more flexible than statistical differences in implementation can lead to very different conclu-

methods, they fail when the missingness depends on unobserved sions.



16





A Critical Perspective on MNAR Data Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia




Table 1: Comparison of Imputation Approaches from Literature



Approach Representative Methods Missingness Types Addressed Representative References

Mean, Median, Mode, Regression-

Traditional Statistical Schafer & Graham [27], Little & Ru-

based, Maximum Likelihood, MCAR only (rarely MAR)

bin [10], Lin & Tsai [8]

Bayesian Approaches

KNN, Matrix Factorization, Deci-

Machine Learning Murti et al. [14], Lee et al. [20], Song

MCAR, partially MAR

sion Trees, SVM

& Lu [21], Feng et al. [6]

Collier et al. [2], Yoon et al. [23, 24],

VAEs, GANs, Diffusion Models,

Deep Learning MCAR, MAR (limited MNAR) Li et al. [7], Du et al. [3], Tashiro et

Graph-based Models, Transformers

al. [22], Zheng & Charoenphakdee

[26], You et al. [25]

Meta Learning, Meta-Regressio,

Meta-Learning / Ensembles MCAR, partially MAR; potential for Liu et al. [11], Ellington et al. [4],

MIB Frameworkn

MNAR Azad et al. [1]



3.2 Artificial MNAR Generation Strategies (ii) the lack of realistic MNAR generators inhibits effective evalu-

ation. To address these gaps, we anticipate a unified framework

The most common way to simulate MNAR is by masking val-

integrating generation, imputation, and evaluation of MNAR data

ues as a function of their own magnitude or distribution. For

under standardized and reproducible conditions.

instance, removing a feature’s highest or lowest values mim-

ics non-disclosure of extreme outcomes (e.g., very high glucose

levels) [18]. Stochastic variants extend this idea by assigning 4.1 Design Principles missingness probabilities proportional to the unobserved value

A comprehensive MNAR framework should have the following

itself, enabling flexible control over the intensity of missing-

principles:

ness [16]. While intuitive, such strategies remain oversimplified,

often restricted to univariate rules that fail to capture the multi- • Synthetic realism: Data generators should simulate MNAR

scenarios that mirror real-world domains (e.g., systematic

dimensional dependencies of real domains [5].

dropout in healthcare, self-censoring in socio-economic

Recent work has proposed standardized libraries for data am-

putation to address reproducibility concerns. The pack-

mdatagen data), either by extending existing functionality (e.g., mdata-

gen [12]) or by incorporating custom plug-in modules. To

age provides a broad set of implementations for MCAR, MAR, and

balance interpretability with scalability, both threshold-

MNAR, supporting univariate and multivariate scenarios [12]. In

based rules and learned mechanisms should be supported.

particular, it incorporates advanced MNAR mechanisms such as

Missingness Based on Own Values (MBOV), Missingness Based • Comprehensive evaluation: Benchmarks must test across

all three missingness mechanisms (MCAR, MAR, MNAR)

on Own and Unobserved Values (MBOUV), and Missingness

and a full spectrum of missingness rates.

Based on Intra-Relations (MBIR) [15]. These implementations

move beyond ad hoc thresholding by systematically encoding • Cross-domain applicability: The framework should sup-

port diverse data types (tabular, sequential, multimodal)

missingness processes and offering reproducible pipelines. In ad-

dition, includes visualization and evaluation modules,

mdatagen and allow integration of domain knowledge for context-

specific MNAR simulation.

allowing researchers to inspect missingness patterns and assess

their impact on imputation performance.

Together, these synthetic and standardized approaches form 4.2 Proposed Framework Components

the current toolkit for MNAR data generation. However, despite

We propose that a unified MNAR framework should consist of

their usefulness, they remain abstractions of real-world processes

three interdependent modules:

and should ideally be complemented by domain-informed simu-

(1) Domain-informed and prob- MNAR Data Generators:

lations.

abilistic tools for simulating missingness patterns that

depend on latent or unobserved values, using existing

3.3 Domain-Inspired Simulation libraries ([12] or incorporating custom plug-in functions.

(2) A modular interface with plug- Imputation Engines:

Beyond standardized libraries, domain knowledge remains crit-

in adapters for existing methods that support statistical,

ical for realistic MNAR generation. In healthcare, dropout is

machine learning, deep learning, and ensemble methods

often linked to disease severity, side effects, or socioeconomic

[14, 23, 1]. By isolating imputers within a common frame-

constraints. In socioeconomic surveys, non-response may be

work, researchers can test their robustness under con-

strongly correlated with privacy-sensitive attributes such as in-

trolled MNAR scenarios.

come or debt. Encoding these mechanisms requires integrating

(3) Standardized protocols that combine di- Evaluation Suite:

causal assumptions with probabilistic masking rules [17]. How-

rect metrics (e.g., Mean Absolute Error (MAE), Root Mean

ever, such domain-specific approaches are difficult to generalize,

Squared Error (RMSE)) with indirect metrics (downstream

limiting their utility as benchmarks.

predictive performance, such as accurracy, RMSE/MAE, or

domain relevant metrics such as interpretability, reliability,

4 Toward a Unified Framework for MNAR fairness, . . . ) [1].

Research

Two key insights emerge from the previous sections: (i) current 4.3 Benefits and Impact

imputation methods are not explicitly designed for MNAR, and Developing such a framework would enable several advances:



17





Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Fatemeh Azad and Matjaž Kukar




• Reproducibility: Common benchmarks and generators en- References

sure that different imputation methods can be fairly com- [1] Fatemeh Azad, Zoran Bosnić, and Matjaž Kukar. 2025. Meta-imputation

balanced (mib): an ensemble approach for handling missing data in biomed-

pared.

• Realism: Domain-specific MNAR mechanisms bring eval- Bioinformatics and Biomedicine (BIBE) ical machine learning. In Proceedings of the IEEE International Conference on . submitted.

uations closer to real-world conditions, reducing the gap [2] Mark Collier, Bayan Mustafa, and Mihaela van der Schaar. 2020. VAEs in

the presence of missing data. . arXiv:2006.05301

between research and practice.

[3] Meng Du, Gábor Melis, and Zhaozhi Wang. 2023. Remasker: imputing tabu-

• Innovation: By exposing the weaknesses of existing meth- lar data with masked autoencoding. In The Eleventh International Conference

ods under MNAR, the framework incentivizes the devel- . on Learning Representations

[4] E. Ellington, Guillaume Bastille-Rousseau, Cayla Austin, Kristen Landolt,

opment of mechanism-aware imputers.

Bruce Pond, Erin Rees, Nicholas Robar, and Dennis Murray. 2014. Using

• Generalization: Unified treatment of MCAR, MAR, and multiple imputation to estimate missing data in meta-regression. Methods

MNAR encourages methods that adapt to unknown or in Ecology and Evolution

, 6, (Dec. 2014). doi: 10.1111/2041- 210X.12322.

[5] Tlamelo Emmanuel, Thabiso Maupong, Dimane Mpoeleng, Thabo Semong,

mixed missingness mechanisms without prior assump-

Banyatsang Mphago, and Oteng Tabona. 2021. A survey on missing data in

tions. machine learning. (May 2021). doi: 10.21203/rs.3.rs- 535520/v1.

[6] Hao Feng, Lihui Chen, and Ke Wang. 2005. A svm regression based approach



5 to filling in missing values. In International Conference on Knowledge-Based Conclusion

and Intelligent Information and Engineering Systems. Springer, 581–587.

[7] Shun-Chuan Li, Bingsheng Jiang, and Benjamin M Marlin. 2019. Misgan:

Missing data remains one of the most persistent challenges in

learning from incomplete data with generative adversarial networks. In

machine learning and statistical analysis. While decades of re- . International Conference on Learning Representations search have produced numerous imputation techniques, ranging [8] Wei-Chao Lin and Chih-Fong Tsai. 2020. Missing value imputation: a review

and analysis of the literature (2006–2017). , 53, Artificial Intelligence Review

from simple statistical estimators to deep generative models,

1487–1509.

most methods have been designed and evaluated under the more [9] Roderick J. A. Little and Donald B. Rubin. 1986. Statistical Analysis with tractable MCAR and MAR mechanisms. In contrast, the most Missing Data . John Wiley & Sons. isbn: 978-0471802545.

[10] Roderick JA Little and Donald B Rubin. 2019. Statistical analysis with missing

realistic and challenging setting, MNAR, remains critically un- data

. Vol. 793. John Wiley & Sons.

derexplored. [11] Qian Liu and Manfred Hauswirth. 2020. A provenance meta learning frame-

work for missing data handling methods selection. In 2020 11th IEEE An-

Our review highlights three major gaps in the current state of

nual Ubiquitous Computing, Electronics & Mobile Communication Conference

the field. First, existing imputation methods rarely model the de- . doi: 10.1109/UEMCON51285.2020.9298089. (UEMCON) pendence of missingness on unobserved values, making them un- [12] Arthur Mangussi, Miriam Santos, Filipe Loyola Lopes, Ricardo Cardoso

Pereira, Ana Lorena, and Pedro Henriques Abreu. 2025. Mdatagen: a python

suitable for MNAR scenarios. Second, generating realistic MNAR Neurocomputing

library for the artificial generation of missing data. , 625,

data is crucial because most benchmarks use ad hoc or overly (Apr. 2025), 129478. doi: 10.1016/j.neucom.2025.129478.

[13] Pierre-Alexandre Mattei and Jes Frellsen. 2019. Miwae: deep generative

simplistic masking strategies, which fail to capture the complex-

modelling and imputation of incomplete data sets. In International conference

ity of real-world missingness. Third, evaluation standards remain on machine learning. PMLR, 4413–4423.

incomplete, with reported missingness rates often conflating [14] Dinar M P Murti, I N A Ramatryana, and A P Wibawa. 2019. K-nearest

neighbor (k-nn) based missing data imputation. In 2019 5th International

MCAR/MAR assumptions and failing MNAR realities. Together,

Conference on Science in Information Technology (ICSITech). IEEE, 83–88.

these shortcomings hinder fair comparisons and limit method- [15] 2023. Siamese autoencoder-based approach for missing data imputation. (June

ological innovation. 2023), 33–46. isbn: 978-3-031-35994-1. doi: 10.1007/978- 3- 031- 35995- 8_3.

[16] 2023. Automatic delta-adjustment method applied to missing not at random

To address these challenges, we propose the vision and design imputation

. (June 2023), 481–493. isbn: 978-3-031-35994-1. doi: 10.1007/978-

principles of a unified MNAR framework that integrates three 3- 031- 35995- 8_34.

[17] Ricardo Cardoso Pereira, Joana Cristo Santos, José Amorim, Pedro Rodrigues,

components: (i) data generators that are aware of mechanisms

and Pedro Henriques Abreu. 2020. Missing image data imputation using

and can create realistic MNAR patterns, (ii) modular imputation

variational autoencoders with weighted loss. In (Apr. 2020).

engines that enable thorough testing of various methods, and [18] Miriam Seoane Santos, Ricardo Cardoso Pereira, Adriana Fonseca Costa,

Jastin Pompeu Soares, João Santos, and Pedro Henriques Abreu. 2019. Gen-

(iii) extensive evaluation suites that include direct metrics and

erating synthetic missing data: a review by missing mechanism. , IEEE Access

indirect metrics. Such a framework would provide reproducibility, 7, 11651–11667. doi: 10.1109/ACCESS.2019.2891360.

realism, and a strong foundation for developing next-generation [19] Joseph L. Schafer and John W. Graham. 2002. Missing data: our view of the

state of the art. , 7 2, 147–77. https://api.semanticscho Psychological methods

imputation techniques.

lar.org/CorpusID:7745507.

Future research should move toward principled, mechanism- [20] Nandana Sengupta, Madeleine Udell, Nathan Srebro, and James Evans. 2022.

Sparse data reconstruction, missing value and multiple imputation through

aware imputers and adopt standardized benchmarks for MNAR

matrix factorization. . Sociological Methodology

generation and evaluation. To advance MNAR research, we need

[21] Ying-Ying Song and Ying Lu. 2015. Decision tree methods: applications for

more powerful algorithms and standardized tools and protocols classification and prediction. Shanghai archives of psychiatry, 27, 2, 130.

[22] Yusuke Tashiro, Jiaming Song, Yang Song, and Stefano Ermon. 2021. Csdi:

that enhance rigor and comparability in the field.

conditional score-based diffusion models for probabilistic time series impu-

tation. In . Vol. 34, 24804– Advances in Neural Information Processing Systems

Acknowledgements 24816.

[23] Jinsung Yoon, James Jordon, and Mihaela van der Schaar. 2018. GAIN: miss-

The research and development presented in this paper were ing data imputation using generative adversarial nets. In International con-funded by the Research Agency of the Republic of Slovenia (ARIS) . PMLR, 5689–5698. ference on machine learning

[24] Sanghoon Yoon and Sanghoon Sull. 2020. Gamin: generative adversarial

through the ARIS Young Researcher Programme (research core Proceedings of the

multiple imputation network for highly missing data. In

funding No. P2-209). While preparing this work, the authors used , 8456–8464. IEEE/CVF conference on computer vision and pattern recognition

[25] Jiaxuan You, Xiaobai Ma, Daisy Ding, Mykel Kochenderfer, and Jure Leskovec.

Grammarly to check the correctness of grammar and improve the

2020. Handling missing data with graph representation learning. In Advances

fluency of the writing, aiming to enhance the clarity and impact in Neural Information Processing Systems. Vol. 33, 18357–18368.

of the publication. The authors reviewed and edited the content [26] Shuhan Zheng and Nontawat Charoenphakdee. 2022. Diffusion models for

missing value imputation in tabular data. . arXiv preprint arXiv:2210.17128

produced with this tool/service and accept full responsibility for

[27] Yuyang Zhou, Sarjyt Aryal, and Mohamed Reda Bouadjenek. 2024. Review

the final published content. for handling missing data with special missing mechanism. arXiv preprint

arXiv:2404.04905.



18





Utilizing Large Language Models for Supporting




Multi-Criteria Decision Modelling Method DEX



Marko Bohanec Uroš Rajkovič Vladislav Rajkovič

Dept. of Knowledge Technologies Faculty of Organizational Sciences Faculty of Organizational Sciences

Jožef Stefan Institute University of Maribor University of Maribor

Ljubljana, Slovenia Kranj, Slovenia Kranj, Slovenia

marko.bohanec@ijs.si uros.rajkovic@um.si vladislav.rajkovic@gmail.com



Abstract explored the capabilities of recent mainstream LLM-based

chatbots, specifically ChatGPT and DeepSeek, for supporting the

We experimentally assessed the capabilities of two mainstream MCDM process. We specifically focused on using the method

artificial intelligence chatbots, ChatGPT and DeepSeek, to DEX (Decision EXpert) [3], with which we have extensive

support the multi-criteria decision-making process. Specifically, experience, spanning multiple decades [4], in the roles of

we focused on using the method DEX (Decision EXpert) and decision makers, decision analysts, and teachers. DEX is a full-

investigated their performance in all stages of DEX model aggregation [5] multi-criteria decision modelling method, which

development and utilization. The results indicate that these tools proceeds by making an explicit decision model. DEX uses



and structuring decision criteria, and collecting data about and decision rules to represent decision makers’ preferences. decision alternatives. However, at the current stage of Variables (attributes) are structured hierarchically, representing may substantially contribute in the difficult stages of collecting qualitative (symbolic) variables to represent decision criteria,

development, the support for the whole multi-criteria decision- the decomposition of the decision problem into smaller, easier to

making process is still lacking, mainly due to occasionally handle subproblems. Traditionally, DEX models are developed

inconsistent and erroneous execution of methodological steps. using software such as DEXiWin [6], which helps the user to



Keywords analyze decision alternatives. interactively construct a DEX model and use it to evaluate and

Multi-criteria decision-making, decision analysis, large language The reported research is of exploratory nature. We ran

models, method DEX (Decision EXpert), structuring decision ChatGPT and DeepSeek multiple times over the last six months,

criteria either individually, as a group or in classrooms with students.

Typically, we first formulated some hypothetical decision

problem and then guided the chatbot through the main stages of

1 Introduction the MCDM process:

Multi-criteria decision-making (MCDM) [1] is an established A. Model development stages:

approach to support decision-making in situations where it is 1. Acquiring criteria

necessary to consider multiple interrelated, and possibly 2. Definition of attributes (variables representing criteria)

conflicting criteria, and select the best solution based on the 3. Structuring attributes

available alternatives and the preferences of the decision-maker. 4. Preference modeling (formulating decision rules)

Traditionally, such models are developed in collaboration with B. Model utilization stages:

decision makers and domain experts, who define the criteria, 5. Definition of decision alternatives

acquire decision makers’ preferences and formulate the 6. Evaluation of alternatives

corresponding evaluation rules. The model-development process 7. Explaining the results of evaluation

is demanding, as it includes structuring the problem, formulating 8. Analysis of alternatives

all the necessary model components (such as decision In doing this, we observed the responses generated by the LLMs

preferences or rules) for evaluating decision alternatives, and and assessed them from the viewpoint of skilled decision analysts.

analyzing the results. The main goal was not to solve specific real-life decision

With the development and success of generative artificial problems, but to identify LLMs’ strengths and weaknesses that

intelligence, especially large language models (LLMs) [2], the may substantially affect the MCDM process.

question arises as to how these models can support or perhaps Despite focusing on DEX, many of our findings are also

partially automate decision-making processes. To this end, we applicable to other hierarchical full-aggregation MCDM

methods [1, 5], such as AHP, MAUT/MAVT, and MACBETH,

Permission to make digital or hard copies of part or all of this work for personal or which follow the same methodological stages, with slight

classroom use is granted without fee provided that copies are not made or distributed differences in the representation of model components. for profit or commercial advantage and that copies bear this notice and the full

citation on the first page. Copyrights for third-party components of this work must In the following sections, we review the above-mentioned

be honored. For all other uses, contact the owner/author(s). MCDM stages and describe our experience with each of them.

Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia

© 2025 Copyright held by the owner/author(s). Specifically, we illustrate the process with answers generated by

http://doi.org/10.70314/is.2025.skui.6589 ChatGPT-o3 and DeepSeek-V3. We considered a hypothetical

personal decision problem of buying an electric-powered vehicle



19





(EV). The chatbots were run in parallel on June 6th, 2025, using starting point that can save days or even weeks of work. This


similar prompts. Our assessments and comments are somewhat stage does not depend on the MCDM method used, so other

broader, based on some other use-cases, not presented here. methods may benefit from using LLMs equally well.



2 Acquiring Criteria 3 Definition of Attributes

A MCDM model includes multiple criteria that capture essential In this stage, the task is to define variables, called attributes, that

aspects of decision alternatives. These criteria are used to represent criteria in a MCDM model. As most MCDM methods

evaluate and compare the alternatives in a structured way. use numeric attributes, this stage is specific to DEX, which uses

Defining criteria usually requires a good knowledge of the qualitative attributes. Therefore, this and the following stages

decision problem and the decision maker’s goals. This step is require LLMs to “understand” the method used. While DEX is

usually one of the most difficult and may require consulting less widely known than methods such as AHP or MAUT, it is

domain experts and/or relevant literature. nonetheless used and valued in various applications. Anyway, we

We asked the chatbots: I am considering buying a new EV were somewhat surprised to find out that all consulted LLMs

vehicle for up to 35000 EUR in Slovenia. Suggest criteria for were familiar with DEX and reasonably capable of following its

evaluating such cars. main methodological steps. In some steps, however, we had to

Both ChatGPT and DeepSeek came out with an extensive and specify additional requirements to obtain proper DEX model

structured list of criteria. Table 1 shows the criteria suggested by components.

DeepSeek. ChatGPT’s suggestions were similar, though they Generally, defining qualitative value scales of attributes was

employed slightly different high-level categories, which also not too difficult for LLMs. Asking Suggest preferentially

incorporated measurement units associated with each criterion. ordered value scales typically gives good suggestions for value

scales, for example (DeepSeek):

Table 1: EV cars: Criteria suggested by DeepSeek.



Items 4–8 shortened due to space limitations. Purchase Price: High (>€34k) → Medium (€30k–34k) → Low (<€30k)

1. Range & Battery Efficiency

• Government Incentives: None → Moderate (€1k–3k) → High (>€3k)

Real-world range (WLTP or EPA estimates) – Aim for at least Insurance & Maintenance: Expensive → Moderate → Cheap 300 km in mixed conditions.



• Battery size (kWh) – Larger batteries offer more range but Interestingly, in our first attempts at using LLMs, they were not may increase cost. fully aware of DEX specifics, such as that value scales should • Energy consumption (kWh/100 km) – Lower is better (e.g., have a small number of values, preferentially ordered from “bad” 14 – 18 kWh/100 km).

2. Charging Speed & Compatibility to “good” [3]. After correcting LLMs a few times, they

• DC fast-charging capability (e.g., 50 kW, 100 kW, or higher). “remembered” and now suggest properly formulated value scales.

• AC charging speed (e.g., 7.4 kW, 11 kW, or 22 kW for



home/work charging).

• Charging port type (CCS Combo is standard in Europe).

3. Performance & Driving Experience

• Motor power (kW/HP) – Affects acceleration and highway

performance.

• Drivetrain (FWD, RWD, AWD) – AWD is better for Slovenia’s

varied weather.

• Top speed & acceleration (0–100 km/h time).

4. Practicality & Usability:

• Cargo space, Seating capacity, Ground clearance

5. Price & Incentives

• Purchase price, Slovenian EV incentives, Insurance &

maintenance costs

6. Technology & Features

• Infotainment system, Driver assistance, Heat pump

availability

7. Availability & After-Sales Support

• Local dealership & service network, Warranty

8. Resale Value & Future-Proofing

• Brand reputation, Software updates



Let us immediately say that we consider this the most important

single contribution of LLMs to MCDM modeling. We are not

aware of any previous method that would allow identifying and

structuring decision criteria in such a depth and detail in literally

just a minute. Of course, for “serious” applications getting such

a list does not take the burden off the user, who is still responsible

for verifying the suggestions and checking the criteria for Figure 1: DEX model structure generated by ChatGPT.

relevance and correctness. Nevertheless, this is a valuable



20





4 the model. This is another stage in which LLMs excel due to their Structuring Attributes


ability to act like agents: autonomously collecting relevant data

After acquiring the criteria and defining attributes, the next step

from external sources. In this case, the LLMs constructed lists of

is to structure attributes into a hierarchy (most often an ordinary

actual EV cars, consulting sources such as avto.net, citroen.si,

tree). Following the previous stages, which already resulted in a

reuters.com, and ev-database.org:

well-organized criteria structure and proper definition of

attributes, this stage looks quite easy for LLMs. Figure 1 shows



a full DEX model structure as suggested by ChatGPT without Citroën ë-C4 MG 4 Comfort 64 kWh – €31 490 → €24 290 after Eko-sklad 54 kWh – €31 900 → €24 700

any further instructions. In comparison, the DeepSeek’s Opel Corsa Electric 54 kWh – from €28 490 → €21 290

suggestion (not shown here) was somewhat inferior. Also, it was BYD Dolphin Surf Boost 43 kWh – €22 990 → €15 790*

generated only after we had instructed it that DEX requires Fiat 500e 42 kWh – promo €22 290 (price already after bonus)

“narrow” trees with only two to three descendants for each

aggregate attribute. Anyway, we consider both structures It is important to understand that alternatives should be defined

appropriate and comparable to the achievements of the best after developing the model. This allows the LLM to search for

university-level students. and collect the data needed to evaluate the alternatives.



5 Decision Preferences and Decision Rules 7 Evaluation of Alternatives

In DEX, decision maker’s preferences are represented by In this stage, the five cars have been evaluated by the LLMs. All

decision rules, organized in decision tables. For each aggregate were evaluated as GOOD, except Fiat 500e, which was evaluated

attribute, the user is asked to investigate all combinations of as ADEQUATE. Table 3 illustrates the output generated by

lower-level attribute values and assess the corresponding ChatGPT, showing the evaluation results of the two best-ranked

outcomes. Depending on the number of aggregate attributes, this cars. In addition to the overall evaluation (EV-SUITABILITY),

might be a laborious task, but can be usually carried out using the output includes evaluations achieved at the three main

software such DEXiWin with not too much hassle. LLMs are branches of the model, together with a brief notification of

also capable of suggesting perfectly valid decision tables, as reasons for such evaluations. This is an appropriate summary of

illustrated in Table 2, which suggest the values of Purchase-Cost evaluation results.

depending on Net-Price-After-Subsidy and Registration-Fee.

Table 3: Evaluation of the two best-ranked EV cars

Table 2: Decision table for Purchase-Cost (ChatGPT). (ChatGPT).



Net-Price-After- Model MG 4 Comfort 64 Citroën ë-C4 very-low low medium high Subsidy ▼ / Cost-good (low good (very-low (0 .5 %) (1 %) (1 – 2 %) ( > 2 %) Registration-Fee ► Ownership purchase-purchase-cost

very-low (≤ 25 k €) very-low low low medium cost; & running-cost)

low (25–30 k €) low low medium medium running-cost

medium (30–33 k €) medium medium medium high low)

high (33–35 k €) high high high high Capability good (435 km adequate– (416 km WLTP,

WLTP, 140 good 100 kW DC,

From the DEX perspective, it is important to remark that Table 2 kW DC, 363 380 L boot)

is complete (addressing all possible combinations of input L boot)

values) and preferentially consistent (increasing input values Responsibility good (★ 5 Euro- good (★ 4 Euro-

result in increasing outputs). Initially, these requirements were NCAP, heat- NCAP, heat-

not obvious to LLMs, and we had to request them explicitly. pump pump std.,

After further use, LLMs now generate appropriate rules by optional, V2L forthcoming

themselves. 3.6 kW) V2L)

However, it is important to emphasize an essential issue. EV- GOOD GOOD



Decision making is considered a subjective process, and Rank SUITABILITY 1 2

decisions should reflect individual decision-makers’ preferences.

Using LLMs, we only get preferences, generalized from many

documents. In order to impose our own preferences, we should 8 Explaining the Results of Evaluation (1) carefully check LLMs’ suggestions and (2) request

In MCDM, it is essential to understand the evaluation method

modifications, if necessary. According to our experience, this is

and explain the reasons underlying evaluation results. This is

possible by formulating prompts such as “increase the

how ChatGPT explains the evaluation method:

importance of some attribute” or “alternatives exceeding some



value of some attribute are unacceptable”. Nevertheless, this 1. Leaf values were taken from the most recent Slovene price-lists or Europe-requires focus, a good understanding of the goals, and perpetual wide tech sheets (all ≤ €35 000 trims). verification of the results. 2. Running-costs use your 30 000 km/year mileage and the Q1-2025 average

household tariff of €0.166 kWh-1.

3. Each quantitative item was mapped to the ordered value-scales we set

6 Identification of Decision Alternatives earlier; the three rule-tables (“purchase-cost”, “running-cost”, “economy”)

The goal of this stage is to identify decision alternatives and were then fired, and all higher nodes were combined with simple equal

describe them in terms of values, assigned to input attributes of weights (0.4 Cost-Ownership | 0.4 Capability | 0.2 Responsibility).



21

4. Scores were coded excellent = 4, good = 3, adequate = 2, poor = 1; the available so far for that difficult stage, where LLMs can now

root class was chosen by rounding the weighted mean to the nearest substantially boost the process and save a lot of effort and time.

integer. The second important contribution is the capability of LLMs to

act as agents and collect data about alternatives (section 6) from

At first, everything seems to make sense. However, careful various external resources.

reading reveals the disturbing fact that different decision rules Considering individual MCDM stages, LLMs performance is

from those agreed upon in the previous stages were used to quite impressive. They are capable of evaluating and analyzing

evaluate alternatives. Unfortunately, this often happens with alternatives, without much instruction. Furthermore, if asked,

LLMs, which tend to “forget” about the previous MCDM stages. they can explain the used methods and obtained results quite well.

It is not uncommon that attributes, their value scales and decision In some cases, however, a seemingly convincing explanation

rules change from prompt to prompt. This severely undermines may fall apart, revealing logical and computational errors.

the trust in using LLMs and makes the whole process uneasy: Considering the MCDM process as a whole, the performance

rather than focusing on solving the decision problem, the user is of LLMs is not as favorable. In subsequent MCDM stages, LLMs

forced to meticulously check each and every step. Also, it is not tend to “change their mind” without notice, modifying the

uncommon to discover logical errors or even basic computational already established model components: attributes, value scales,

errors (often referred to as “hallucinations” [7]). In one of our and decision rules. Consequently, this requires a lot of attention

sessions with ChatGPT, it displayed the evaluation formula from the user’s side, who has to check the outputs and

perpetually remind the LLMs to remain consistent. This distracts

(0.2×3)+(0.25×4)+(0.15×4)+(0.2×3)+(0.15×2)+(0.05×2)=3.15 the process and often carries the user away of the main decision-



which looked convincing, but gave a hard-to-notice, but wrong modelling stage (section 5), LLMs suggest generalized decision making task. Also, we should warn that in the preference

result; the correct result is 3.2. preferences that might substantially differ from the user’s

subjective preferences, which need to be enforced explicitly.



9 In summary, LLMs can substantially contribute to the Analysis of Alternatives definition of attributes and alternatives, but are unsuitable for The last stage of the MCDM process is the analysis of carrying out the whole MCDM process due to possible alternatives, which is aimed at exploring the decision space using inconsistent and erroneous executions of the MCDM method. methods such as what-if and sensitivity analysis. Without We believe that, given the current state of LLM development, it providing experimental evidence due to space restrictions, we is more convenient and safer to use specialized and trusted can say that, in principle, LLMs are capable of performing such MCDM software, such as DEXiWin. Nevertheless, LLMs evolve analyses, giving appropriate answers and explanations to fast and we may expect substantial improvements in the future. questions such as:

• Carry out sensitivity analysis for Citroën ë-C3 and MG4



depending on buying price and operating costs. Acknowledgments

• What would have to change for Fiat 500e 42 to become a

good EV vehicle? The authors acknowledge the financial support from the

In most cases, results are correct and informative, particularly in Slovenian Research and Innovation Agency for the programme

cases when an explicit explanation is requested by the user. Knowledge Technologies (research core funding No. P2-0103

However, the issues of using inappropriate model components and P5-0018).

and making logical and computational errors were detected in

this stage as well. References

[1] Kulkarni, A.J. (Ed.), 2022: Multiple Criteria Decision Making. Studies in

Systems, Decision and Control 407, Singapore: Springer, doi:

10.1007/978-981-16-7414-3_3.

10 Discussion [2] Kamath, U., Keenan, K., Somers, G., Sorenson, S., 2024: Large Language

LLMs are developing rapidly and becoming increasingly capable. Models: A Deep Dive: Bridging Theory and Practice. Springer, 506p,

ISBN-13 978-3031656460.

They may evolve under the hood, so that even the same version [3] Bohanec, M., 2022: DEX (Decision EXpert): A qualitative hierarchical

can behave differently depending on recent updates or user- multi-criteria method. In: Multiple Criteria Decision Making (ed.

Kulkarni, A.J.), Studies in Systems, Decision and Control 407, Singapore:

specific factors. This makes them challenging for conducting a Springer, doi: 10.1007/978-981-16-7414-3_3, 39–78.

rigorous scientific research. They come without user manuals, [4] Bohanec, M., Rajkovič, V., Bratko, I., Zupan, B., Žnidaršič, M., 2013:

requiring their users to explore their capabilities on their own. DEX methodology: Three decades of qualitative multi-attribute

modelling. Informatica 37, 49–54.

This study is an experimental attempt to understanding the [5] Ishizaka, A., Nemery, P., 2013: Multi-criteria decision analysis: Methods

capabilities of the current (2025) mainstream LLMs for and software. Chichester: Wiley.

[6] Bohanec, M., 2024: DEXiWin: DEX Decision Modeling Software, User’s

supporting the MCDM process, with special emphasis on the Manual, Version 1.2. Ljubljana: Institut Jožef Stefan, Delovno poročilo

DEX method. On this basis, we could not formulate firm IJS DP-14747. Accessible from https://dex.ijs.si/dexisuite/dexiwin.html.

[7] Banerjee, S., Agarwal, A., Singla, S., 2024. LLMs will always hallucinate,

conclusions, but were still able to make observations and and we need to live with this. 10.48550/arXiv.2409.05746.

formulate recommendations that might help MCDM

practitioners.

The single most important contribution of LLMs to MCDM

is their ability to formulate a well-structured list of relevant

criteria in the first stage (section 2). Nothing nearly as good was



22





Landscape-Aware Selection of Constraint Handling Techniques




in Multiobjective Optimisation



Jordan N. Cork Pavel Krömer Tea Tušar

Andrejaana Andova Technical University of Ostrava Bogdan Filipič

Jožef Stefan Institute and Ostrava, Czech Republic Jožef Stefan Institute and

Jožef Stefan International pavel.kromer@vsb.cz Jožef Stefan International

Postgraduate School Postgraduate School

Ljubljana, Slovenia Ljubljana, Slovenia

{jordan.cork, andrejaana.andova}@ijs.si {tea.tusar, bogdan.f ilipic}@ijs.si



Abstract There are three primary contributions from this work, all

within the CMO domain. The first is related to the set of prob-

Constrained multiobjective optimisation problems (CMOPs) are

lems used to train the algorithm selection model. Real-world

common in real-world optimisation. They often involve expen-

optimisation problems are often difficult to solve, particularly

sive solution evaluations and, therefore, it is helpful to know

when they include constraints. The field requires a methodol-

the best methods to solve them prior to actually solving them.

ogy for selecting a subset of problems with difficult constraint

These problems also tend to be relatively difficult for algorithms

functions from the larger set of known problems. This is the first

compared to the majority of test problems. This difficulty often

contribution. The CHT selection methodology is then tested on

presents itself in the infeasible region, calling for a focus on the

these problems. This methodology is the second contribution.

constraint handling technique (CHT). The purpose of this work is

Here, problem characterisation and machine learning are used

to select the best CHT for problems with difficult constraint func-

to predict the best-performing CHT. The final contribution is a

tions. This first involves the collection of a set of such problems.

set of insights into the features used. The decision tree output

CHT selection is then conducted using problem characterisation

by the CHT selection methodology provides significant insights

and machine learning. The outcomes are positive in that predic-

into both which features are useful and what the features reveal

tion achieved a high accuracy. Additionally, further insights are

about the problems.

provided into the features that describe CMOPs.

The paper is further structured as follows. In Section 2, CMO



Keywords is introduced, providing the required background. Section 3 de-

scribes the two selection methodologies, as well as the validation

constrained multiobjective optimisation, algorithm selection, prob-

method used. Section 4 presents the experimental setup. In Sec-

lem selection, constraint handling techniques

tion 5, the results from the experiments are presented. Finally, in



1 Section 6, the work is summarised and future work is outlined. Introduction



Real-world optimisation problems very often have multiple ob- 2 Constrained Multiobjective Optimisation jectives and are subject to one or more constraints. This is the

Constrained multiobjective optimisation (CMO) involves the op-

domain of constrained multiobjective optimisation (CMO). These

timisation of two or more objective functions given one or more

problems are generally demanding to solve and have restrictions

constraint functions. The constraints may be of the equality or

to the available computational budget. These restrictions make it

inequality forms, however, in this study, only inequality con-

all the more important to know the best method for solving the

straints are considered. Such a CMO problem (CMOP) may be

problem prior to actually attempting to solve it. This calls for an

formulated as follows:

algorithm selection methodology.

One approach to algorithm selection, known as landscape- minimize 𝑓 𝑚 , 𝑚 1, . . . , 𝑀 , ( x ) =

(1)

aware selection, is to first characterise the problem before con- subject to 𝑔 𝑗 0, 𝑗 1, . . . , 𝐽 , ( x ) ≤ =

ducting the algorithm run [2]. Characterisation involves the calcu-

lation of features used to describe the objectives and constraints, where 𝑥 is a 1 x = ( 𝐷 , . . . , 𝑥 𝐷 ) ∈ dimensional R 𝐷 solution vector,

𝑓𝑚 are the , and 𝑔𝑗 the inequality ( x ) objective functions ( x ) con-

as well as their interaction. This is done using a small set of sam-

pled solutions. Once the problem is characterised, knowledge of straint functions. 𝑀 is the number of objectives and 𝐽 the number

of inequality constraints.

similar problems can be used to determine the best approach to

CMO requires an indicator for assessing the quality of the set of

solving it. This approach is taken in this study and applied to con-

optimal points. This indicator is . It was proposed in [19] to

straint handling techniques (CHTs). CHTs are methods designed

I CMOP

handle quality assessment in the three following situations. When

to guide optimisation algorithms in dealing with infeasible solu-

no feasible solutions are found, it uses the minimum constraint

tions, by taking as input the problem constraints and candidate

violation. When feasible solutions are found, but these are outside

solutions, and producing outputs that either repair, penalize, or

of the region of interest (ROI) bound by the given reference point

rank these solutions to balance feasibility with optimality.

(RP), the distance to the ROI is used. Finally, when solutions are

Permission to make digital or hard copies of all or part of this work for personal found within the ROI, it uses the hypervolume (HV). The HV

or classroom use is granted without fee provided that copies are not made or

measures the portion of the objective space dominated by the

distributed for profit or commercial advantage and that copies bear this notice and

the full citation on the first page. Copyrights for third-party components of this set of solutions relative to the RP. was proposed as a value

ICMOP

work must be honored. For all other uses, contact the owner /author(s).

to be minimised. However, it is commonly maximised based on

Information Society 2025, Ljubljana, Slovenia CMOP I

the package implementation [9]. On top of , moarchiving

© 2025 Copyright held by the owner/author(s).

https://doi.org/10.70314/is.2025.skui.3280 the maximised area under the runtime profile curve is used to



23





Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Jordan N. Cork, Andrejaana Andova, Pavel Krömer, Tea Tušar, and Bogdan Filipič




measure the anytime performance of the algorithm [8]. Here, the as follows:

runtime profile is the proportion of performance targets attained ! 𝑅

1 ∑︁

with respect to the evaluation number. ( ) = − ( )

Difficulty 𝑝 1 max AUC 𝑝, 𝑎, 𝑟 (2)

Many methodologies in CMO use an CMOP 𝑎 𝑅 ∈ A value with nor-I 𝑟=1

malised function values. For this, the function values of the prob-

This problem difficulty is calculated for each of the problems in

lems’ optimal solution set are required. Together, these are known P

the set of problems, .

as the Pareto front. The Pareto front may be obtained empirically,

Within the current selection, there will still be cases where all

through knowledge of the problems construction. Often this is

CHTs perform roughly the same on the problem. These problems

not possible, however, and, therefore, algorithm runs are used to

are removed using statistical and practical threshold tests on the

construct an approximation of the front. CMOP I

final values from the 30 runs. Given a normal distribution

In [4], there are 13 benchmark suites listed, consisting of 139

cannot be ensured in the 30 values from each of the algorithm

test problems. These test problems can be instantiated in vari-

runs, the Kruskal-Wallis test is used [11]. It determines if indepen-

ous numbers of dimensions and objectives. This then allows for

dent samples come from the same distribution. However, this still

a substantially larger number of test problem instances to be

leaves problems with no practical differences in their scores. To

generated based on these 139 base test problems.

filter these out, the mean scores are tested for if they vary more

Problem characterisation is conducted using exploratory land-

or less than a small delta and those that vary less are removed.

scape analysis (ELA) features [16]. Work done in [1] has listed 80

Following the filtering out of problems where no meaningful

such features for CMO. These come from three landscapes: the N

differences are observed, the most difficult problems from

multiobjective, violation and multiobjective-violation landscapes.

the remaining set are selected. This leaves one with a suite of

The features can be computed via sampling or random walks.

difficult problems upon which at least one of the algorithms from

There are several constraint handling techniques. Four of

A performs differently.

these are considered in our study. The first is the constrained-



domination principle (CDP), proposed along with the NSGA-II 3.2 Constraint Handling Technique Selection

algorithm [5]. This is a feasibility first approach, where feasible

The general concept for CHT selection is as follows. First, a

solutions are preferred over infeasible ones. The penalty CHT is

machine learning model is trained using the features from each

a classic method and applies a penalty value to the objective val-

problem in the training set. The labels are the best-performing

ues [20], either statically or dynamically. The Improved-Epsilon

CHTs on each problem. At inference time, features are calculated

(I-Epsilon) CHT was designed to work with the MOEA/D algo-

on the problem in question (note: this consumes a portion of

rithm [7]. It dynamically adjusts the 𝜖 value based on the number

the available budget). These features are used as input to the

of feasible solutions. Solutions are considered feasible if they

machine learning algorithm. The resulting model then predicts

are less than the 𝜖 value. Finally, stochastic ranking (SR) uses a

the best-performing CHT for use during the run.

probability value to switch between comparing solutions based

Each step will now be described in more detail. The first step is

on objectives or constraints [18].

to choose a base algorithm and a set of algorithm-relevant CHTs.

The preferred approach would be to select the most appropriate

3 Methodology

algorithm for the problem to be solved at inference time.

This section presents the methodologies used in the study. First,

The second step is generating the training data for the machine

the methodology for selecting the hard test problems is presented,

learning model. First, the features for each of the problems in

followed by the methodology for selecting the appropriate CHT

the training set are gathered. The labels must then be computed,

and the means for testing the model.

which requires algorithm runs; 30 for each CHT. For this, the

budget must be selected carefully. The model, at inference, can be

3.1 Difficult Problem Selection expected to work well only if the budget is the same as it was in

training. The average final values from the 30 runs are then taken

Testing the CHT selection methodology requires test problems.

for each CHT. In CMO, these are the average final values, I

Test problems with too easy constraint functions are less likely to

CMOP

which are being maximised. The CHT with the highest value is

show differences among the CHTs, as algorithms will spend less

then selected as the best-performing CHT. This is used as the

time dealing with infeasible solutions. More difficult constraint

label. Once this has been done for each of the problems in the

functions, on the other hand, will force the algorithm to deal with

training set, the training data is complete.

infeasible solutions longer and, therefore, give the CHTs time to

The third step is to train the model. A decision tree is preferred

show their differences. Test problems with difficult constraint

for its explainability properties. To enhance the explainability of

functions are then desired for our testing.

the model, the depth of the tree should be kept at a minimum.

As mentioned in Section 2, anytime performance is measured

Testing is described in the next subsection. Once complete, i.e.

using the area under the runtime profile curve (AUC), with the

maximised as the indicator. In this study, difficulty is deter-

I CMOP trained with all training data, the model is available for inference.

mined based on the anytime performance of a set of algorithms,

A 3.3 Cross-Validation Testing . Each of the algorithms is run on the problem 𝑅 times and

the average AUC is taken. This is to ensure robustness. It should Testing the model involves a leave-one-problem-out cross-vali-

be noted that when recording the runs, an archive of all non- dation approach. Here, a problem is taken out of the training set

dominated solutions is kept and the value from this archive and left as the test problem. The model is then trained on the data I CMOP

is recorded at each solution evaluation. The budget must also be from the remaining problems in the training set. To predict the

chosen, with budgets allowing algorithm convergence preferred. best-performing CHT, the features from the test problem are used

The maximum average AUC is then used as the problem difficulty, as input to the model. The model then makes a prediction for the

with lower values signifying harder problems. This is formulated best-performing CHT. This is compared to the actual result.



24





Landscape-Aware Selection of Constraint Handling Techniques Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia




The methodology makes allowances for when two or more Table 1: The results from cross-validation testing using

CHTs perform similarly well on the same problem. The predic- the leave-one-problem out methodology. The first column

tion made by the algorithm is then correct if it selects any of lists the test problems in order of difficulty (descending).

𝐷

these. Determining if two or more CHTs are statistically the same indicates the dimensionality. All problems are biobjective.

is achieved through the use of a statistical test, which in this case The models were trained on all problems in the list, bar the

was the Mann-Whitney U test [15]. Again, this test was chosen test problem in question. ‘Actual’ lists the best-performing

because a normal distribution cannot be ensured in the resulting CHT labels, while the prediction column shows the pre-

final values from the runs. The process is as follows. The CHT dicted label. If the predicted label is in the actual labels list,

with the best mean score is selected, then each of the other CHTs the prediction is considered correct. The CHT labels 0, 1, 2

are tested individually against the best-performing CHT to deter- and 3 are CDP, penalty, I-Epsilon and SR, respectively.

mine if they are equivalent, forming the set of best-performing

CHTs. If the predicted label is within this set, it is considered Problem 𝐷 Diffic. Pred. Actual Correct

correct. This process is conducted for all problems in the training DC2-DTLZ3 30 0.976 2 [2] Yes

set and a final percentage of correct predictions is given. DC2-DTLZ1 30 0.965 2 [2] Yes

DC2-DTLZ1 10 0.541 2 [2] Yes



4 Experimental Setup DC2-DTLZ3 10 0.528 2 [2] Yes

NCTP7 30 0.489 0 [0, 3] Yes

In this section, the inputs to the methodologies are described,

NCTP8 10 0.355 3 [0, 1, 3] Yes

along with the packages used throughout.

NCTP15 10 0.339 3 [0, 1, 3] Yes

There are several inputs to the difficult problem selection

DOC3 10 0.330 1 [0, 1, 3] Yes

methodology. First, there is the set of problems, . The dimen-P

NCTP2 10 0.284 3 [0, 1] No

sions chosen were 2, 3, 5, 10 and 30, with only biobjective prob-

NCTP1 10 0.279 3 [0, 1, 3] Yes

lems considered. This resulted in 375 problem instances. The

NCTP7 10 0.269 3 [0, 3] Yes

problems were translated from Matlab by hand or taken from

CTP6 30 0.257 1 [0, 1, 2] Yes

pymoo [3].

CTP8 30 0.249 0 [0, 1, 2] Yes

For , i.e. the set of algorithms, the natural choice was to A

C1-DTLZ3 30 0.240 2 [0, 1, 2] Yes

choose a base algorithm with different constraint handling tech-

DC2-DTLZ1 5 0.230 2 [2] Yes

niques. The base algorithm chosen was NSGA-II [5]. This was

CTP8 10 0.227 0 [0, 1, 2] Yes

used for its versatility with regards to adding various CHTs. Re-

DC2-DTLZ3 5 0.219 2 [2] Yes

garding CHTs, CDP, penalty, I-Epsilon and SR were chosen for

DC3-DTLZ1 30 0.214 2 [2] Yes

their compatibility with NSGA-II. CDP was provided as default

NCTP17 10 0.203 0 [0, 1, 2] Yes

with NSGA-II by . The others were implemented by hand. pymoo

NCTP10 10 0.202 1 [0, 1, 2] Yes

The penalty value selected was a static 100, while the settings

for all others were the proposed defaults. 𝑅 was set at 30.

The number of difficult problems selected, , was set at 20. N f_range_coeff <= 10.96

samples = 20

This number is adequate to test the methodology while still being value = [5, 4, 8, 3] small enough to manage. The budget selected was the one to be class = I-Epsilon

used throughout the study, i.e. 10 ,000 𝐷. The delta value for

· True False



detecting practical differences was set at 0.001. lnd_avg_rws <= 0.19 samples = 8 For the CHT selection methodology, the choice of training value = [0, 0, 8, 0] samples = 12 value = [5, 4, 0, 3]

class = I-Epsilon class = CDP

problems was the set of difficult problems derived from the

setup above. The base algorithm and CHTs were the same as



( those selected above. The model selected was a decision tree corr_cobj_max <= 0.62 samples = 4 samples = 8 value = [4, 0, 0, 0] scikit-learn [17]). The tree depth parameter was the only pa-value = [1, 4, 0, 3] class = CDP class = Penalty rameter tuned. This tuning was done manually, decreasing from

10 to 3, until the performance began to reduce. Finally, the prob-

lem features used were the 80 features described in [1]. These samples = 5 samples = 3

value = [1, 4, 0, 0] value = [0, 0, 0, 3]

were calculated with a sample size of 1,000 𝐷. The random walks · class = Penalty class = SR were simulated using these same samples.



Figure 1: The decision tree built on all the training data. It is

5 Results used to predict the four CHTs. The indices of the values in

In this section, the results from carrying out the methodologies the value lists, indicating the number of instances, signify

are described. First, the construction of the set of difficult prob- CDP, penalty, I-Epsilon and SR, respectively.

lems is discussed. Then, the experimental results are presented.

Finally, the resulting decision tree is discussed.

The difficulty of each problem was calculated as described in The problems come from the following suites: DC-DTLZ [13],

Section 3. The results were heavily skewed towards the easy prob- NCTP [12], DOC [14], CTP [6] and C-DTLZ [10].

lem side. With the parameter set to 20, that many problems Table 1 additionally shows the results from the cross-validation N

were selected. The difficulties of these ranged from 0.202 to 0.976. testing phase of the experiments. As described in Section 3, each

The selected problems are listed in Table 1 in order of descend- problem was given its turn as the test problem, while the others

ing difficulty. They include 5, 10 and 30 dimensional problems, acted as training problems. For 95% of these, the model predicted

with 2 and 3 dimensional problems clearly being easier to solve. correctly from the set of actual best-performing CHTs.



25





Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Jordan N. Cork, Andrejaana Andova, Pavel Krömer, Tea Tušar, and Bogdan Filipič




Figure 1 shows the decision tree that resulted from training on No. P2-0209 “Artificial Intelligence and Intelligent Systems”, and

all of the available data. As it can be seen, the decision tree leaf projects No. N2-0254 “Constrained Multiobjective Optimization

nodes are nearly pure, meaning it achieved near 100% accuracy Based on Problem Landscape Analysis” and GC-0001 “Artificial

on the training data. Due to its high accuracy on the test data Intelligence for Science”).

and the low tree depth, this is not believed to be overfit.

Only 3 of the 80 supplied features were included in the model, References

indicating their importance in identifying appropriate CHTs. [1] Hanan Alsouly, Michael Kirley, and Mario Andrés Muñoz. 2023. An instance

space analysis of constrained multiobjective optimization problems. IEEE

The first of these, separating out I-Epsilon, was f_range_coeff

Transactions on Evolutionary Computation, 27, 5, 1427–1439. doi: 10.1109

(difference between the maximum and minimum of the absolute /TEVC.2022.3208595. value of the linear regression model coefficients, where the model [2] Andrejaana Andova, Jordan N. Cork, Aljoša Vodopija, Tea Tušar, and Bogdan

Filipič. 2024. Predicting algorithm performance in constrained multiobjec-

is fitted between the decision variables and the unconstrained Applications of Evolutionary

tive optimization: A tough nut to crack. In

ranks). This is a multiobjective landscape feature, focusing on . Stephen Smith, João Correia, and Christian Cintrano, editors. Computation

Springer Nature Switzerland, Cham, 310–325. doi: 10.1007/978- 3- 031- 5685

variable scaling. The second feature, separating out CDP, was

5- 8_19.

lnd_avg_rws (average proportion of locally non-dominated solu-

[3] Julian Blank and Kalyanmoy Deb. 2020. Pymoo: Multi-objective optimization

tions in the neighbourhood). This is a multiobjective-violation in Python. , 8, 89497–89509. doi: 10.1109/ACCESS.2020.2990567. IEEE Access

[4] Jordan N. Cork and Bogdan Filipič. 2025. A Bayesian optimization approach

landscape feature, focusing on evolvability, i.e. the degree to

to algorithm parameter tuning in constrained multiobjective optimization.

which the problem landscape facilitates evolutionary improve- In Optimization and Learning. Bernabé Dorronsoro, Martin Zagar, and El-ment. The final feature, distinguishing between penalty and SR, Ghazali Talbi, editors. Springer Nature Switzerland, Cham, 109–122. doi:

10.1007/978- 3- 031- 77941- 1_9.

was corr_cobj_max (the maximum of the constraints and objec-

[5] Kalyanmoy Deb, Amrit Pratap, Sameer Agarwal, and T. Meyarivan. 2002. A

tives correlation). This is also a multiobjective-violation land- fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Transactions scape feature, focusing on evolvability. It should be noted that on Evolutionary Computation

, 6, 2, 182–197. doi: 10.1109/4235.996017.

[6] Kalyanmoy Deb, Amrit Pratap, and T. Meyarivan. 2001. Constrained test

the features are not all related to the violation landscape, but also Proceedings of the

problems for multi-objective evolutionary optimization. In

deal with the objective functions. First International Conference on Evolutionary Multi-Criterion Optimization,

EMO 2001. Springer, 284–298. doi: 10.1007/3- 540- 44719- 9_20.

[7] Zhun Fan, Wenji Li, Xinye Cai, Han Huang, Yi Fang, Yugen You, Jiajie Mo,

6 Conclusion Caimin Wei, and Erik Goodman. 2019. An improved epsilon constraint-

handling method in MOEA/D for CMOPs with large infeasible regions. Soft

In this study, the focus was on the needs of real-world CMOPs. , 23, 12491–12510. doi: 10.1007/s00500- 019- 03794- x. Computing These problems are often difficult for algorithms to solve and [8] Nikolaus Hansen, Anne Auger, Dimo Brockhoff, and Tea Tušar. 2022. Any-

time performance assessment in blackbox optimization benchmarking. IEEE

require expensive solution evaluations. Given the cost of these Transactions on Evolutionary Computation

, 26, 6, 1293–1305. doi: 10.1109

evaluations, it is helpful to know the best method for solving /TEVC.2022.3210897.

[9] Nikolaus Hansen, Nace Sever, Mila Nedić, and Tea Tušar. 2024. Moarchiving:

the problem prior to actually solving it. To address this, the

Multiobjective nondominated archive classes with up to four objectives.

study focused on selecting the most appropriate CHT, a crucial https://github.com/CMA- ES/moarchiving. GitHub repository. (2024).

component of any algorithm operating in CMO. For this selection [10] Himanshu Jain and Kalyanmoy Deb. 2014. An evolutionary many-objective

optimization algorithm using reference-point based nondominated sorting

task, it was critical to test on problems with difficult constraint

approach, Part II: Handling constraints and extending to an adaptive ap-

functions. These problems elicit the most variation among CHTs. proach. IEEE Transactions on Evolutionary Computation, 18, 4, 602–622. doi:

The proposition was made for a methodology that selects 10.1109/TEVC.2013.2281534.

[11] William H. Kruskal and W. Allen Wallis. 1952. Use of ranks in one-criterion

problems with difficult constraint functions from a larger set, Journal of the American Statistical Association

variance analysis. , 47, 260,

with the end goal of conducting CHT selection. This methodology 583–621. doi: 10.1080/01621459.1952.10483441.

[12] Jia-Peng Li, Yong Wang, Shengxiang Yang, and Zixing Cai. 2016. A compar-

involved first collecting a large set of CMOPs, then running a

ative study of constraint-handling techniques in evolutionary constrained

set of algorithms on them to determine their difficulty. Problems 2016 IEEE Congress on Evolutionary Compu-

multiobjective optimization. In

that were easy to solve or showed no variation in algorithm , 4175–4182. doi: 10.1109/CEC.2016.7744320. tation (CEC)

[13] Ke Li, Renzhi Chen, Guangtao Fu, and Xin Yao. 2019. Two-archive evolution-

performance were discarded, as they provide no value in future

ary algorithm for constrained multiobjective optimization. IEEE Transactions

CHT selection tasks. The methodology finally produced a set of , 23, 2, 303–315. doi: 10.1109/TEVC.2018.28554 on Evolutionary Computation

N 11. problems.

[14] Zhi-Zhong Liu and Yong Wang. 2019. Handling constrained multiobjective

This set of difficult problems was used in the second methodol-

optimization problems with constraints in both the decision and objective

ogy proposed, i.e. selecting CHTs using problem characterisation spaces. , 23, 5, 870–884. doi: IEEE Transactions on Evolutionary Computation

10.1109/TEVC.2019.2894743.

and machine learning. Four CHTs were chosen and added to the

[15] Henry B. Mann and Donald R. Whitney. 1947. On a test of whether one of

NSGA-II algorithm. These were CDP, penalty, I-Epsilon and SR. two random variables is stochastically larger than the other. The Annals of

The goal of the selection task was to select the best-performing , 18, 1, 50–60. doi: 10.1214/aoms/1177730491. Mathematical Statistics

[16] Olaf Mersmann, Bernd Bischl, Heike Trautmann, Mike Preuss, Claus Weihs,

CHT on a given problem, noting that several CHTs can perform

and Günter Rudolph. 2011. Exploratory landscape analysis. In Proceedings

best. The methodology was evaluated using cross-validation, , of the 13th Annual Conference on Genetic and Evolutionary Computation

with the leave-one-problem-out method. The findings from test- 829–836. doi: 10.1145/2001576.2001690.

[17] Fabian Pedregosa et al. 2011. Scikit-learn: Machine learning in Python.

ing were positive and indicate that it is possible to select the Journal of Machine Learning Research

, 12, 85, 2825–2830. https://www.jmlr

most appropriate CHT for a given difficult problem. Further, the .org/papers/volume12/pedregosa11a/pedregosa11a.pdf .

[18] Thomas P. Runarsson and Xin Yao. 2000. Stochastic ranking for constrained

final decision tree trained on all the considered difficult problems

evolutionary optimization. , IEEE Transactions on Evolutionary Computation

provides insights into the features characterising CMOPs.

4, 3, 284–294. doi: 10.1109/4235.873238.

In future work, the plans are to extend the CHT selection [19] Aljoša Vodopija, Tea Tušar, and Bogdan Filipič. 2025. Characterization of

constrained continuous multiobjective optimization problems: A perfor-

methodology to the broader domain of algorithm selection.

mance space perspective. , IEEE Transactions on Evolutionary Computation

29, 1, 275–285. doi: 10.1109/TEVC.2024.3366659.



Acknowledgements [20] Yonas Gebre Woldesenbet, Gary G. Yen, and Biruk G. Tessema. 2009. Con-

straint handling in multiobjective evolutionary optimization. IEEE Transac-

The authors acknowledge the financial support from the Slove- tions on Evolutionary Computation

, 13, 3, 514–525. doi: 10.1109/TEVC.2008

.2009032.

nian Research and Innovation Agency (research core funding



26





Explaining Deep Reinforcement Learning Policy in




Distribution Network Control



Blaž Dobravec Jure Žabkar

Elektro Gorenjska d.d. University of Ljubljana, Faculty of Computer and

Kranj, Slovenia Information Science

blaz.dobravec@elektro- gorenjska.si Ljubljana, Slovenia

jure.zabkar@fri.uni- lj.si



Abstract battery charging/discharging to increase self-sufficiency [12]; and

effective consumption/generation strategies have been learned

In safety-critical settings – such as low-voltage electrical distri-

under price signals and network constraints [2, 1]. Given the

bution networks – Deep Reinforcement Learning (DRL) policies

growing heterogeneity of LV networks and the rise of behind-

are hard to deploy due to limited capability to explain why a

the-meter actuators, DRL methods are typically developed and

particular sequence of actions is taken. We use Scenario-Based

validated first in simulation [4]. Their adoption and implemen-

eXplainability (SBX) with temporal prototypes to explain the

tation are often hindered by a lack of explainability of these

policy of our DRL agent. SBX clusters short time-windows of

models.

latent trajectories and uses their medoid trajectories as human-

We present a prototype-based explainability approach for DRL-

friendly summaries. Temporal prototypes map the embeddings

based voltage control in LV distribution networks that directly ex-

of these medoids to actions, and generate explanations of the

form “This scenario is similar to prototype 𝑋 Do action 𝑌 .” We

⇒ ploits flexibility from prosumers. In our approach, the agent acts

on the network’s operating state, coordinating different flexibility

apply our approach to a real low-voltage distribution network

options (e.g. photovoltaic systems, batteries, EVs, heat pumps).

Srakovlje. Preliminary results show that our method offers practi-

We focus on improving power quality by reducing voltage vio-

cally useful human-friendly explanations for sequential decision

lations. Additionally, we use prototype based explainability to

making.

provide interpretation and reasoning behind the action.



Keywords

deep reinforcement learning, explainability, voltage control, low-

voltage distribution network, prototypes 2 Related Work

Explainable Artificial Intelligence (XAI) aims to make the de-

1 Introduction cisions of models understandable to humans. The explanation

A rapid growth of renewable energy resources and a significant process and the final result should be focused on generating ex-

increase in electricity demand due to the electrification of trans- planations that are intuitive to us. Prototype-based explanations

port and heating [8] are reshaping generation (e.g. distributed provide a compelling choice that is interpretable by design. XRL

photovoltaic systems) and consumption (e.g. heat-pumps, elec- remains an active area of research. One such widely employed

trical vehicles) in electrical distribution networks. Increasing explainability technique, primarily used in image classification,

reverse power flows and voltage variability in low-voltage net- is the , which bases its explanations on pixel-wise saliency map

works strongly affect voltage profiles and make the network feature attribution [20]. Building on this idea, Sequeira et al. [17]

operation and control more challenging. made the agent’s interactions with the environment the focal

Deep reinforcement learning (DRL) has recently emerged as a point of their . Interestingness Framework

powerful paradigm for sequential decision-making in complex, In supervised learning, prototype networks explain predictions

high-dimensional environments, with notable successes in games via similarity to learned or human-selected exemplars [3, 15].

(Chess [18], Go [19], Atari [13]), autonomous driving [10], and Extending this paradigm to reinforcement learning, prototype-

industrial robotic process automation [7]. Voltage control in dis- wrapper policies force decisions to be mediated by human-friendly

tribution networks shares similar characteristics, which makes prototypes (single state-snapshot); a recent example is the Prototype-

DRL a promising methodology to solve the control problems in Wrapper Network (PW-Net), which wraps a pre-trained agent

low-voltage networks. and maps latent states to action decisions through prototype

While voltage control is standard at higher voltage levels (e.g., similarities [9]. Beyond interpretability, prototypes have been

with STATCOMs), most LV research has focused on optimizing leveraged to improve representation learning and exploration

individual assets at the customer level [11, 6]. Recent compar- efficiency: Proto-RL pre-trains prototypical embeddings and uses

isons indicate that DRL can outperform classical algorithms for prototype-driven intrinsic motivation to accelerate downstream

micro-grid management with demand-side flexibility [14]. For in- policy learning in pixel-based control [23]. In model-based RL,

stance, dueling double DQN (D3QN) has been used to reduce over- prototypical context learning has also been explored for dynam-

voltages in PV-rich networks [16]; model-free RL has optimized ics generalization [22].

Despite the critical role of explainability in voltage control in

Permission to make digital or hard copies of all or part of this work for personal

low-voltage power systems, there is little research addressing

or classroom use is granted without fee provided that copies are not made or

this challenge. Zhang et al. [24] applied the SHAP explainability

distributed for profit or commercial advantage and that copies bear this notice and

the full citation on the first page. Copyrights for third-party components of this method to a deep reinforcement learning model for implement-

work must be honored. For all other uses, contact the owner /author(s).

ing proportional load shedding during under-voltage situations.

Information Society 2025, Ljubljana, Slovenia

They also used Deep-SHAP [25] to enhance the computational

© 2025 Copyright held by the owner/author(s).

https://doi.org/10.70314/is.2025.skui.9459 efficiency of their XAI model. The model’s output elucidates its



27

Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia B. Dobravec and J. Žabkar



the number of clusters via a silhouette-style score. Within each

selected cluster, the medoids (nearest to the centroids) are taken

as temporal prototypes. Optionally, flattened action windows

are concatenated to latent trajectories before k-means to bias

prototype selection toward action-discriminative regions. The

SBX step produces a prototype tensor of shape 𝐾 , 𝐿, 𝑑 . ( )





3.3 Temporal Prototype Model



We introduce 𝐾 temporal prototypes 𝑃 , each a length-𝐿 }



{ 𝐾



𝑘 =



𝑘 1



latent template 𝑃 selected by SBX (medoids). A shared R



𝑘

∈ 𝐿 𝑑 ×

𝐿 𝑑 𝑝 × →

temporal encoder 𝑔 : maps trajectories to embed- R R

𝜃

dings 𝑧𝑡 𝑔 𝑋 𝜃 = (𝑡) and prototypes to 𝑒 . Following 𝑘 = 𝑔 𝜃 ( 𝑃 𝑘 )

Figure 1: High-level SGTP pipeline: (1) collect latent win- PW-Net, prototype activations use an L2-to-activation mapping.



dows; (2) SBX clustering and medoid selection; (3) train ∥ 2 𝑧 𝑡 − 𝑒 1 𝑘 ∥ + 2 ( 𝑎 𝑡 𝑘 ) = temporal-prototype layer; (4) case-based explanations dur-log , 𝜀 > 0. (1) ∥ 2 𝑧 𝑡 − 𝑒 𝑘 ∥ + 𝜀 ing rollout. 2

Outputs are linear in activations, 𝑦𝑡 𝑊 𝑎 𝑡 , optionally post-= ( )

processed to valid actions (Tanh/ReLu for steer/gas/brake). The

predictions through a visualization layer and a feature impor-

schematics of the algorithm is outlined in Figure 1.

tance layer that addresses both global and local explanations.

Existing research on explainability in power systems, particu- 3.4 Inference and Explanations

larly regarding voltage control, focuses on post-hoc explainability

At test time, we slide a window over trajectories, compute activa-

techniques. Compared to explanations for a single feature (indi- tions 𝑎 𝑡 , and predict actions 𝑦𝑡 𝑘 ( ) . Explanations are provided by

vidual voltage value) such as SHAP, our method considers the

(i) the SBX scenario summaries (offline) and (ii) nearest-neighbor

temporal component in the explanation process. To the best of

windows to each prototype in the encoder embedding space.

our knowledge, this approach has not been applied to the ex-

• Scenario-level (global): SBX clusters and medoids sum-

plainability of the reinforcement learning field in this specific

marize typical behaviors.

manner before.

• Temporal prototype-level (local): per-prototype nearest



3 windows (and prototype self-windows) illustrate charac- SBX-guided Prototype Selection

teristic action trajectories.

We employ Scenario-Based eXplainability (SBX [5]) as an exten-

For each time step, form the most recent latent window, com-

sion of the PWNet [9] to temporal prototypes (prototypes of

pute the encoder embedding and prototype activations, map them

trajectories, not just snapshots of the state space) to provide

linearly to actions, and apply Tanh/ReLu post-processing. Key

global, scenario-level structure and local, time-resolved explana-

hyperparameters are 𝐿 (window length), encoder size 𝑝, and learn-

tions for a trained control policy. SBX is used to partition behavior

ing rate. We select them on a held-out set using validation MSE

and select representative temporal prototypes. On top of the SBX-

and qualitative visualization of nearest-neighbor trajectories.

selected prototypes (without any human-defined prototypes), we



train a temporal prototype model that maps latent features to 4 Experiments actions. This yields a two-tier view: SBX provides a summary of

behavior, while temporal prototypes expose time-local patterns 4.1 Simulation and voltage control policy and their nearest neighbors that drive actions.

We examine a real-world low-voltage distribution network con-

sisting of 26 consumers, of which 7 are active consumers. Those

3.1 Data Preparation and Latent Extraction active consumers are equipped with small solar plants (11kWp).

We consider a trained policy 𝜋 acting in discrete time. A trajec- The total yearly consumption in this network is negative, mean-

tory is a sequence of observation–action pairs. For analysis, we ing that the solar plants are producing more electricity than is

operate on fixed-length trajectories of length 𝐿: needed. A visual representation of the network is displayed in

Fig. 2.

𝑤 𝑡 𝑜𝑡, 𝑎𝑡 , . . . , 𝑜𝑡 + = ( ) ( ) = −

𝐿 1 𝑡 𝐿 1 − , 𝑎 , 𝑡 0 + −

, . . . , 𝑇 𝐿 .

The learning process extended over 1500 episodes, each con-

Observations are first mapped by the frozen policy backbone to taining 96 steps (representing a 15-minute interval across one

latent vectors 𝑥 𝑡 R ∈ 𝑑 =

. We denote the latent trajectory by 𝑋𝑡 day). We evaluated the model every 20 episodes (1 epoch). In this

( 𝐿 ×𝑑 𝑥 , . . . , 𝑥 . We collect an offline dataset by rolling network, we focus on handling mainly high voltages as those are 𝑡 𝑡 + 𝐿 − 1 ) ∈ R out the trained PPO agent and recording, at each time step, the a bigger problem in our example.

policy’s penultimate-layer latent vector and the corresponding

environment action. This yields per-episode sequences of latents 4.2 Explaining a Simulation and actions which are then converted into trajectories of length

We consider a real low-voltage distribution network. An obser-

𝐿 . The supervised target for each trajectory is the action at its vation/state is the vector of per-bus voltage magnitudes 𝑠 =

last real-time step. [𝑣 1, . . . , 𝑣𝑛 (in per unit). Actions are per-active-consumer flex- ]

ibility commands 𝑎 𝛼 1, . . . , 𝛼𝑚 with 𝛼𝑖 1, 1 : negative = [ ] ∈ [− ]

3.2 SBX Prototype Selection values decrease consumption (or increase net export) and positive

SBX is performed in the latent space by clustering window embed- values decrease the generation for active consumers (bounded

dings with k-means over a range of cluster counts and selecting by their instantaneous battery output). The agent acts every 15



28





Explaining Deep Reinforcement Learning Policy in Distribution Network Control Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia





the latent space, the average distance from each prototype to

its top-25 nearest trajectories was 0.124 on average, suggesting

coherent time-local patterns.



Figure 4: Representative prototypes in the Power Control

Figure 2: The network Srakovlje is located in Gorenjska

environment. Each color represents the Scenario, and the

region (north-western part of Slovenia). Active consumers

individual line represents the activations by the individual

(red), and their most representative activations are dis-

active consumers.

played with the corresponding graph. Green circles denote

the most common over-voltage buses prior to voltage con-

trol. The width of the green circle indicates the severity of

the original over-voltage measurements. 4.3 Results

Fidelity. Across both domains, the prototype-based policy closely

tracks the black-box in task reward, while achieving low action

minutes; episodes comprise 96 steps (one day). The goal is to keep

discrepancy on held-out episodes. This suggests that mediating

voltages within operating limits while minimizing interventions

actions through temporal prototypes does not materially degrade

and losses.

performance.

Following prior work on distribution-voltage control [21],

we use a reward that balances voltage quality, activation effort, Global structure. SBX consistently discovers a small set of

recurring scenarios that align with intuitive regimes (straight

and network losses. Trajectories are generated by a PPO policy

driving vs. cornering in continuous control; typical operating

trained in this environment.

conditions in slower dynamics). Scenario summaries (state/action

mean std) are distinct and exhibit stable temporal patterns. ±



Local interpretability. For representative episodes, the nearest-



neighbor aggregates around each prototype show coherent time-

local patterns, and the most influential prototypes (largest con-

tributions) align with observed actions. Explanations adopt a

case-based form, relating current decisions to similar prototypi-

cal windows.

Performance Analysis. We compared the rewards across

different policy architectures. Table presents the results of ??

running 20 episodes for each policy variant, measuring key per-

formance metrics including mean reward, consistency (standard

deviation), and coefficient of variation (CV) as a measure of relia-

Figure 3: Centroids and underlying medoids of the sce-

bility.

narios in the Power Control environment. The individual

Over 20 episodes, the Base policy achieves the highest mean

color represents the average voltage signal in the network

reward (221.8; range 201.0–257.5). PWNet closely matches the

corresponding to the scenarios.

Base with a mean of 220 .7 ( 0.5% lower; range 185.8–249.5), ≈

indicating that mediating decisions through prototypes incurs

We used trajectories with length 𝐿 96 which gives us 𝐾 3 negligible performance loss. The Temporal PWNet trades some = =

prototypes (Figure 3). Scenario selection via a silhouette-style reward for interpretability, averaging 211.5 ( 4.7% below Base; ≈

criterion over 𝑘 2, . . . , 8 yielded a preferred 𝑘 3 scenarios. range 168.4–231.8). Overall, relative performance is: Base 100%, ∈ { } = ≈

Representative scenario-level activation summaries are shown in PWNet 99%, Temporal PWNet 95%. ≈ ≈

Figure 4. offline action-level discrepancy against The results demonstrate several key insights about our ap-Task fidelity: the reference policy (mean-squared error over held-out trajecto- proach. The Base Policy achieves the best rewards. The PWNet

ries at the final step) was MSE 3.218. stored Policy shows comparable performance, indicating that prototype-= Scenario quality:

similarity scores by 𝑘 were: 𝑘 2: 0.131, 𝑘 3: 0.118, 𝑘 4: 0.083, based explanations can be achieved without significant perfor-= = =

𝑘 5: 0.082, 𝑘 6: 0.089, 𝑘 7: 0.093, 𝑘 8: 0.096. A recom- mance degradation. Our Temporal PWNet + SBX approach achieves = = = =

puted silhouette for the chosen 𝑘 3 partition gave 0.099 with a mean reward of 211.47 14.60, representing a modest per-= ±

per-scenario supports 4212, 7312, 5912 trajectories, indicating formance trade-off in exchange for enhanced interpretability [ ]

three regimes with substantial coverage. In through temporal prototypes and scenario-guided explanations. Prototype locality:



29





Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia B. Dobravec and J. Žabkar




5 Discussion References

[1] Shahab Bahrami, Yu Christine Chen, and Vincent W. S. Wong. 2021. Deep

This work introduces Scenario-Guided Temporal Prototypes,

reinforcement learning for demand response in distribution networks. IEEE

which combines global scenario discovery (SBX) with local, time- Transactions on Smart Grid, 12, 1496–1506.

resolved prototypes to explain DRL decisions in voltage control [2] Di Cao, Junbo Zhao, Weihao Hu, Fei Ding, Nanpeng Yu, Qi Huang, and

Zhe Chen. 2021. Model-free voltage control of active distribution system

problem in power networks. We observe that temporal proto-

with PVs using surrogate model-based deep reinforcement learning. Applied

types can approximate black-box actions off-line with low dis- , 306, Part A, (Nov. 2021). Energy crepancy while forcing decisions through human-friendly ex- [3] Chaofan Chen, Oscar Li, Chaofan Tao, Alina Barnett, Cynthia Rudin, and

Jonathan Su. 2019. This looks like that: deep learning for interpretable image

emplars. SBX discovers a small number of recurring regimes, Proceedings of the IEEE/CVF Conference on Computer Vision

recognition. In

with clear scenario-level summaries (Figure 3) and consistent , 8930–8939. and Pattern Recognition (CVPR)

[4] Ruisheng Diao, Zhiwei Wang, Di Shi, Qianyun Chang, Jiajun Duan, and

prototype neighborhoods. This supports case-based reasoning

Xiaohu Zhang. 2019. Autonomous voltage control for grid operation using

over the policy’s temporal dynamics rather than single-step fea- CoRR

deep reinforcement learning. , abs/1904.10597. arXiv: 1904.10597.

ture attributions. Tight nearest-neighbor bands and balanced [5] Blaž Dobravec and Jure Žabkar. 2024. Explaining voltage control decisions:

a scenario-based approach in deep reinforcement learning. In Foundations

per-scenario support indicate that selected prototypes are repre-

of Intelligent Systems. Springer Nature Switzerland, Cham, 216–230. isbn:

sentative rather than outliers. 978-3-031-62700-2.

The limitations of our current approach include reliance on a [6] Samar Fatima, Verner Püvi, and Matti Lehtonen. 2020. Review on the PV

hosting capacity in distribution networks. , 13, 18. Energies

particular windowing choice and off-line evaluation that does not

[7] Natanael Gomes, Felipe Martins, José Lima, and Heinrich Wörtche. 2022.

account for control feedback. Extremely imbalanced or highly Reinforcement learning for collaborative robots pick-and-place applications:

a case study. , 3, (Mar. 2022). Automation

non-stationary data may complicate selection. Prototype inter-

[8] European Union Policy Iniciative. [n. d.] Growing consumption in the euro-

pretability depends on the quality of medoids and the clarity of

pean markets. https://knowledge4policy.ec.europa.eu/growing- consumeris

the associated concepts; domains lacking clear temporal motifs m. Accessed: 2022-11-10. ().

[9] Eoin M. Kenny, Mycal Tucker, and Julie A. Shah. 2023. Towards interpretable

may benefit less from temporal prototypes and may also see

deep reinforcement learning with human-friendly prototypes. In . ICLR

degradation in performance. SBX does not identify the outliers [10] Bangalore Ravi Kiran, Ibrahim Sobh, Victor Talpaert, Patrick Mannion, that might be important for the agent to succeed. The identifica- Ahmad A. Al Sallab, Senthil Kumar Yogamani, and Patrick Pérez. 2020.

Deep reinforcement learning for autonomous driving: A survey. , CoRR

tion of such states within the current architecture will be explored

abs/2002.00444. arXiv: 2002.00444.

in future work. Future work also includes dynamic prototype [11] Wong Ling Ai, Vigna Ramachandaramurthy, Sara Walker, and Janaka Ekanayake.

lengths and human-in-the-loop curation tools for prototype edit- 2020. Optimal placement and sizing of battery energy storage system con-

sidering the duck curve phenomenon. , 8, (Jan. 2020), 197236– IEEE Access

ing and labeling.

197248. doi: 10.1109/ACCESS.2020.3034349.

[12] Brida V. Mbuwir, Fred Spiessens, and Geert Deconinck. 2018. Self-learning



6 Conclusion agent for battery energy management in a residential microgrid. In2018 IEEE PES Innovative Smart Grid Technologies Conference Europe (ISGT-Europe), We presented a pre-hoc interpretability framework that (i) dis- 1–6.

[13] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis

covers scenario structure from trajectories and (ii) explains ac-

Antonoglou, Daan Wierstra, and Martin A. Riedmiller. 2013. Playing atari

tions via temporal prototypes. The approach yields faithful, time- with deep reinforcement learning. CoRR, abs/1312.5602. arXiv: 1312.5602. resolved explanations without materially degrading control qual- [14] Taha Nakabi and Pekka Toivanen. 2020. Deep reinforcement learning for

energy management in a microgrid with flexible demand. Sustainable Energy

ity, as demonstrated in Power Network voltage control. Explana-

Grids and Networks, 25, (Dec. 2020).

tions take a case-based form—“this situation is similar to proto- [15] Meike Nauta, Sander van Bree, and Christin Seifert. 2021. Neural prototype

trees for interpretable fine-grained image recognition. In Proceedings of the

type X”—and are grounded by scenario summaries and prototype

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),

locality.

14933–14943.

Beyond improving transparency, our approach offers prac- [16] Alvaro Rodriguez del Nozal, Esther Romero-Ramos, and Angel Luis Trigo-

Garcia. 2019. Accurate assessment of decoupled oltc transformers to opti-

tical steps: scenario coverage, per-scenario prototype counts,

mize the operation of low-voltage networks. , 12, 11. Energies

and nearest-neighbor coherence expose where explanations are [17] Pedro Sequeira and Melinda T. Gervasio. 2019. Interestingness elements for strong or require refinement. Looking ahead, we plan to enable explainable reinforcement learning: understanding agents’ capabilities and

limitations. , 288, 103367. Artif. Intell.

interactive prototype curation, incorporate uncertainty-aware

[18] David Silver et al. 2017. Mastering chess and shogi by self-play with a general

explanation scores, and explore joint training schemes that cou- reinforcement learning algorithm. CoRR, abs/1712.01815. arXiv: 1712.01815. ple prototype-based interpretability with context-aware latent [19] David Silver et al. 2017. Mastering the game of go without human knowledge.

Nature, 550, 354–359.

dynamics. We will explore the sensitivity of the hyperparameter

[20] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2013. Deep inside

L to the actual training success. We have also identified that the convolutional networks: visualising image classification models and saliency

maps. , abs/1312.6034. CoRR

fidelity metrics beyond the MSE will be necessary to explore.

[21] Jianhong Wang, Wangkun Xu, Yunjie Gu, Wenbin Song, and Tim C. Green.

At this moment comparison to the saliency methods or SHAP 2021. Multi-agent reinforcement learning for active voltage control on power

explanations is still challenging due to the different nature of distribution networks. CoRR, abs/2110.14300. arXiv: 2110.14300.

[22] Junjie Wang, Qichao Zhang, Yao Mu, Dong Li, Dongbin Zhao, Yuzheng

explanations (one being feature step-wise based and the other

Zhuang, Ping Luo, Bin Wang, and Jianye Hao. 2024. Prototypical context-

being multi-step and comparison based). Together, these steps aware dynamics for generalization in visual control with model-based rein-

forcement learning. , 20, 9, 10717– IEEE Transactions on Industrial Informatics

can help bridge the gap between high-performing DRL policies

10727. doi: 10.1109/TII.2024.3396525.

and the trust required for their deployment.

[23] Denis Yarats, Rob Fergus, Alessandro Lazaric, and Lerrel Pinto. 2021. Rein-

forcement learning with prototypical representations. In Proceedings of the

Acknowledgements 38th International Conference on Machine Learning (Proceedings of Machine

Learning Research). PMLR. https://arxiv.org/abs/2102.11271.

This work was partially supported by the Slovenian Research [24] Ke Zhang, Peidong Xu, and Jun Zhang. 2020. Explainable ai in deep reinforce-

ment learning models: a shap method applied in power system emergency

Agency (ARIS), grant L2-4436: Deep Reinforcement Learning

control. In 2020 IEEE 4th Conference on Energy Internet and Energy System

for optimization of LV distribution network operation with Inte- Integration (EI2), 711–716.

grated Flexibility in real-Time (DRIFT), and from the Slovenian [25] Ke Zhang, Jun Zhang, Pei-Dong Xu, Tianlu Gao, and David Wenzhong Gao.

2022. Explainable ai in deep reinforcement learning models for power system

Research Agency (ARIS) as member of the research program Ar-

emergency control. , 9, 2, IEEE Transactions on Computational Social Systems

tificial Intelligence and Intelligent Systems (Grant No. P2-0209). 419–427.



30

Leveraging AI in Melanoma Skin Cancer Diagnosis:



Human Expertise vs. Machine Precision



Anna-Katharina Herke

Applied Artificial Intelligence

Alma Mater Europaea



Anna-

katharina.herke@almamater.si



Abstract This variability presents a significant diagnostic challenge.

Studies have revealed that dermatologists may miss up to one in

Whilst relatively uncommon compared to other skin cancers,

five (20%) cases of melanoma. There is also disagreement

melanoma is one of the most aggressive forms of this cancer.

between professionals on lesion categorization [3, 4]. Artificial

Given early and accurate detection, the condition can be treated

intelligence (AI), particularly deep learning algorithms trained

successfully. Despite advancements in dermoscopy, diagnostic

on large dermoscopic datasets, has emerged as a potential

variability among dermatologists persists, often delaying

equalizer, capable of achieving and possibly exceeding the

treatment. This paper investigates the performance of a deep

classification accuracy of dermatologists [1, 2].

learning model based on ResNet-50 against human

AI’s ability to analyze complex visual patterns in skin lesions

dermatologists in melanoma detection, highlighting synergies

offers a novel solution to diagnostic gaps. However, questions

between AI and human diagnostics. Our findings indicate that AI

remain regarding its performance in clinical settings,

can be as accurate or better than individual dermatologist

generalizability potential biases, and ethical implications [14, 15].

performance in key metrics like sensitivity and specificity, and

This study aims to compare the diagnostic performance of a

that a workflow focused on collaboration in the diagnostic

ResNet-50-based AI model with that of board-certified

process yields superior outcomes compared to either approach

dermatologists and explore synergistic diagnostic workflows.

alone.

We place specific emphasis on aspects of dataset composition,

prospective evaluation design, and clinical integration to expand

Keywords on the findings of previous studies.

Melanoma, skin cancer diagnosis, AI in cancer diagnosis,

dermatology

2 Research Questions



1 Introduction research questions: This paper will focus on and attempt to answer the following

Globally, melanoma accounts for a disproportionate number of

skin cancer-related deaths despite being less common than other 1. How does the diagnostic accuracy of an AI model

skin cancers like basal and squamous cell carcinomas. In the compare to that of human dermatologists? United States alone, melanoma only accounts for one in 100

cases of skin cancer, while causing the majority of deaths from 2. Can AI-human collaboration enhance melanoma

this type of cancer [31]. Early detection dramatically improves detection outcomes?

prognosis, with five-year survival rates exceeding 90% when

melanoma is identified at an early stage [1]. However, diagnostic 3. What are the ethical and practical considerations for

accuracy in dermatology remains highly variable, dependent on AI integration in clinical dermatology?

clinician experience, lesion characteristics, and access to

dermoscopic tools.

3 Related Work

Early studies such as Esteva et al. [1] demonstrated the power of

Permission to make digital or hard copies of part or all of this work for personal or artificial intelligence in skin cancer diagnostics. The authors

classroom use is granted without fee provided that copies are not made or distributed showed that deep convolutional neural networks (CNNs) could

for profit or commercial advantage and that copies bear this notice and the full

citation on the first page. Copyrights for third-party components of this work must match the diagnostic performance of dermatologists in

be honored. For all other uses, contact the owner/author(s). melanoma classification. Haenssle et al. [2] confirmed these

Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia findings in a controlled reader study. Similarly, Brinker et al. [4]

© 2025 Copyright held by the owner/author(s).

https://doi.org/10.70314/is.2025.skui.8366 found that a CNN outperformed 86% of participating

dermatologists.



31

Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Herke et al.



Recent research has shifted toward examining the potential of This two-phase diagnostic design aligns with previous

collaborations between humans and AI. Tschandl et al. [3] and human-versus-AI studies, notably those by Haenssle et al. and

Allen et al. [26] found that AI-assisted diagnosis improved the Tschandl et al., which examined both solo and AI-assisted

accuracy of clinician diagnosis alone. Navarrete-Dechent et al. diagnostic conditions [2, 3, 7]. Randomization and blinding

[7] conducted a prospective trial showing how synergistic ensure impartial evaluation, a standard methodological feature in

diagnosis combining dermatologists and AI tools improved comparative diagnostic trials [5, 6].

diagnostic accuracy.

However, limitations persist. Most studies use retrospective

or experimental setups lacking real-world clinical integration. 4.4 Evaluation Metrics

Few address model bias, particularly regarding skin tone and Performance was measured using sensitivity, specificity, area

underrepresented populations [14, 15, 33, 34]. Those could lead under the ROC curve (AUC-ROC), and average diagnostic time

to false diagnoses. Continued reliance on HAM10000 and per image. Inter-rater agreement was assessed using Fleiss’

institutional datasets restricts generalizability of research kappa.

findings.

In addition, the absence of real-world patient context such as

patient history and a physical exam may cause clinicians to



underestimate diagnostic complexity. Furthermore, adoption 5 Results barriers among clinicians remain underexplored at the time of

writing [27]. 5.1 AI vs Human Diagnostic Performance

This submission seeks to fill these gaps with a prospective

The AI model achieved an AUC-ROC result of 0.94, with 89%

evaluation of AI-human performance and practical deployment

sensitivity and 85% specificity. Dermatologists averaged an

considerations.

AUC of 0.87, with 82% sensitivity and 83% specificity. Notably,

a total of 75% (15 out of 20) dermatologists were outperformed

by the AI in sensitivity [4].



4 We further analyzed inter-rater variability among clinicians Methods

using Fleiss’ kappa statistics. Without AI assistance, Fleiss’

4.1 kappa was 0.58 (moderate agreement). With AI assistance, kappa Data Acquisition and Preprocessing

increased to 0.72 (substantial agreement), indicating improved

Dermoscopic images were sourced from the commonly used

consensus among readers.

HAM10000 dataset [13], supplemented by institutional image

This improvement in agreement supports the claim that AI

archives. Inclusion criteria comprised high-resolution

support enhances diagnostic reliability and synergizes with

dermoscopic images of histopathologically confirmed

human expertise.

melanomas and benign nevi. Exclusion criteria included images

with low resolution, artifacts, or incomplete metadata.

Table 1: Inter-Rater Variability

All images underwent standardized preprocessing procedures

such as resizing to 224×224 pixels, normalization, and

Scenario Fleiss’ Kappa

augmentation (flipping, rotation, and contrast adjustments) to

Clinicians Alone 0.58

enhance generalizability [21, 23].

Clinicians + AI Assist 0.72

Source: research performed in the course of this study



4.2 AI Model Architecture



For this study, we utilized a ResNet-50 CNN pretrained on 5.2 AI-Human Synergy Analysis ImageNet, fine-tuned on the melanoma dataset. The model

When assisted by AI, dermatologist sensitivity improved to 91%,

incorporated dropout regularization and cross-entropy loss

and specificity rose to 87%, surpassing both the solo AI and

optimization. Training was conducted on NVIDIA GPUs using a

unassisted human performance. Average diagnostic time

70/15/15 train-validation-test split. This architecture and training

dropped from 22 seconds to 15 seconds per image [28].

paradigm has demonstrated high performance in skin lesion

classification tasks and is widely adopted in dermatology AI

literature [1, 4]. Table 2: Visual Summary of Results



Diagnostic Sensitivity Specificity AUC- Avg Time/

4.3 Human Cohort and Diagnostic Protocol Modality ROC Image

Twenty board-certified dermatologists with 5–25 years of AI Alone 89% 85% 0.94 3 seconds

clinical experience participated. We asked each participant to Dermatologists 82% 83% 0.87 22 seconds

review 100 randomized images. Images were presented in Alone

Dermatologists 91% 87% 0.96 15 seconds

isolation, blind to patient history and pathology. Diagnoses were

+ AI

binary (melanoma vs. benign). In a second round, participants

Source: research performed in the course of this study

reviewed the same images with AI output overlays.



32

Leveraging AI in Melanoma Skin Cancer Diagnosis Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia



6 Discussion diagnostic accuracy comparable to board-certified

dermatologists and improved performance when integrated into

We were able to affirm previous findings that artificial

clinician workflows. While AI holds transformative potential,

intelligence has the capacity to match or outperform

challenges around bias, explainability, and regulatory oversight

dermatologists in the detection of melanoma [1, 5]. Moreover,

must be addressed to ensure equitable, trustworthy deployment.

diagnostic synergy between human experts and AI enhances

Future work should focus on prospective clinical trials,

overall performance, aligning with findings from Tschandl et al.

patient-facing applications, and interdisciplinary frameworks for

[3] and Navarrete-Dechent et al. [7].

human-AI co-diagnosis. A hybrid diagnostic model, leveraging

AI’s speed and consistency with human intuition and contextual



6.1 awareness, represents the future of dermatological practice. Ethical Considerations and Bias Analysis

As diagnostic models develop, so will technology.

Despite strong results when combining clinician expertise with Improvements in AI, such as federated learning and enhanced

AI in melanoma detection, concerns persist. These concerns explainability methods will lead to improved trust and adoption

begin even before the algorithm is applied. AI models may have in clinical settings. been subject to biased training data. In this context,

underrepresentation of darker skin tones remains problematic

[14, 15]. As a result, AI may exacerbate healthcare disparities



[20], and there remains a need for inclusive datasets and Acknowledgments algorithmic transparency [19] to address these challenges.

The completion of this analysis on melanoma skin cancer was

To strengthen our analysis of bias and inclusivity, we present

achieved through the insightful contributions of researchers in

a descriptive breakdown of our dataset by skin type (Fitzpatrick

the field whose work was pivotal for this analysis. I am thankful

scale):

for the academic community and institutions that provided access

to research databases and journals, which were essential for the

Table 3: Skin Type Breakdown

literature review.

I would also like to extend my gratitude to the peer reviewers and

Fitzpatrick Skin Type Number of Cases Percentage (%)

editors at the IS.IJS.SI Conference, especially Matjaz Gams,

I–II (Light) 500 40

whose valuable feedback enhanced the quality of this paper.

III–IV (Medium) 500 40

V–VI (Dark) 250 20



Total: 1,250 Images References

Source: research performed in the course of this study [1] Esteva, A., et al. (2017). Dermatologist-level classification of skin cancer

with deep neural networks. Nature, 542(7639), 115–118.

[2] Haenssle, H.A., et al. (2018). Man against machine: Diagnostic

This distribution allows for more robust discussion of skin performance of a deep learning convolutional neural network for

tone bias and ensures inclusiveness in our findings. We dermoscopic melanoma image classification. Annals of Oncology, 29(8),

acknowledge that the representation of darker skin types (V–VI) 1836–1842.

[3] Tschandl, P., et al. (2020). Human–computer collaboration for skin cancer

remains limited and may impact generalizability. Future studies recognition. Nature Medicine, 26(8), 1229–1234.

should prioritize dataset balance for equitable AI performance. [4] Brinker, T.J., et al. (2019). Deep learning outperformed 136 of 157

dermatologists in a head-to-head dermoscopic melanoma image

classification task. European Journal of Cancer, 113, 47–54.

In collaborative settings, explainability remains another [5] Phillips, M., et al. (2019). Assessment of accuracy of an artificial

challenge, as clinicians may distrust opaque AI decisions that lesions. intelligence algorithm to detect melanoma in images of skin JAMA Network Open, 2(10), e1913436.

lack transparency. Incorporating interpretable AI frameworks [6] Marchetti, M.A., et al. (2020). Artificial intelligence as a second reader in

and continuous feedback loops can help address these issues [21]. melanoma screening. Journal of the American Academy of Dermatology,

83(1), 188–194.

[7] Navarrete-Dechent, C., et al. (2022). Human-AI synergy in melanoma

diagnosis: A prospective clinical trial. Journal of the American Academy

6.2 of Dermatology, 86(3), 567–575. Integrating AI into Clinical Practice

[8] Liu, Y., et al. (2021). Deep learning for melanoma detection: A systematic

Adoption hurdles include clinician skepticism, workflow review. Journal of Investigative Dermatology, 141(12), 2835–2844.

[9] Fujisawa, Y., et al. (2019). Deep learning-based image analysis of

integration, and regulatory uncertainty [27, 25]. Real-world melanocytic lesions: Current status and future prospects. Frontiers in

implementation requires AI tools to function as second readers, Medicine, 6, 99.

[10] Codella, N.C.F., et al. (2018). Skin lesion analysis toward melanoma

supporting—not supplanting—clinicians [6, 22]. detection: ISIC 2017 Challenge. IEEE ISBI, 168–172.

Regulatory guidance from the FDA (2022) emphasizes post- [11] Sood, T., et al. (2021). AI in dermatology: Challenges and

market monitoring, performance transparency, and adaptive opportunities. Journal of Medical Systems, 45(7), 1–8.

[12] Han, S.S., et al. (2018). Classification of the malignancy of skin lesions

learning constraints. Clinician training, robust validation, and using deep learning-based image analysis. PLoS One, 13(11), e0205820.

clear liability frameworks are essential for safe deployment. [13] Tschandl, P., et al. (2018). The HAM10000 dataset, a large collection of

multi-source dermatoscopic images of common pigmented skin

lesions. Scientific Data, 5, 180161.

[14] Groh, M., et al. (2021). Evaluating racial bias in AI skin cancer

models. NEJM AI, 1(1), 1–10.



7 [15] Daneshjou, R., et al. (2022). Disparities in dermatology AI performance Conclusion

on a diverse patient population. Science Translational Medicine, 14(645),



This study highlights the promise of AI-human collaboration in [16] eabq6147. Kittler, H., et al. (2016). Diagnostic accuracy of an artificial intelligence–

melanoma diagnosis. A fine-tuned ResNet-50 model achieved based device for the evaluation of pigmented skin lesions. Lancet

Oncology, 17(12), 1785–1793.



33

Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Herke et al.



[17] Hollon, T.C., et al. (2020). Machine learning identifies surgical margins [28] Yamada, M., et al. (2022). An AI tool helped reduce dermatologist

in patients with melanoma using stimulated Raman histology. Cancer diagnosis times and errors: A retrospective study. Artificial Intelligence in Research, 80(4), 664–673. Medicine, 129, 102317.

[18] Brinker, T.J., et al. (2020). Skin cancer classification using convolutional [29] Udrea, A., et al. (2020). Accuracy of a smartphone application for triage

neural networks: Systematic review. J Med Internet Res, 22(10), e20736. of skin lesions based on machine learning in a primary care setting. JAMA

[19] Yogananda, C.G., et al. (2021). A Survey on Explainable AI for Skin Network Open, 3(6), e2036362.

Lesion Analysis. Front Med, 8, 777911. [30] FDA. (2022). Regulatory considerations for AI/ML-based medical

[20] Adamson, A.S., Smith, A. (2018). Machine learning and healthcare devices. FDA Guidance Document.

disparities in dermatology. JAMA Dermatol, 154(11), 1247–1248. [31] American Cancer Society (2025). Key Statistics for Melanoma Skin

[21] Ghosal, A., et al. (2021). Deep learning for melanoma detection: A Cancer. Accessed on May 26, 2025 under:

comprehensive review. Artificial Intelligence Review, 54(8), 5783–5819. https://www.cancer.org/cancer/types/melanoma-skin-cancer/about/key-

[22] Topol, E.J. (2019). High-performance medicine: The convergence of statistics.html.

human and artificial intelligence. Nature Medicine, 25, 44–56. [32] Fleiss, J. L. (1971). Measuring nominal scale agreement among many

[23] Han, S.S., et al. (2022). Federated learning for melanoma detection across raters. Psychological Bulletin, 76(5), 378–

institutions. Nature Communications, 13(1), 1–10. 382. https://doi.org/10.1037/h0031619

[24] Ud Din, N., et al. (2023). Artificial Intelligence for Melanoma Diagnosis: [33] Groh, M., Tseng, E., Mahoney, A., & et al. (2023). Evaluating deep neural

A Decade of Progress. Cancers, 15(3), 876. networks trained on clinical images in dermatology: The DERM dataset

[25] Wong, A., et al. (2022). Ethical challenges of AI in melanoma and implications for diversity. The Lancet Digital Health, 5(3), e158–

diagnosis. Lancet Digital Health, 4(3), e156–e165. e168. https://doi.org/10.1016/S2589-7500(22)00284-7.

[26] Allen, J., et al. (2021). Human–Machine Collaboration in Skin Lesion [34] Winkler, J. K., Fink, C., Toberer, F., & et al. (2019). Association between

Diagnosis. JAMA Dermatol, 157(8), 947–954. dermatoscopic application of artificial intelligence for skin cancer

[27] Jones, O.T., et al. (2021). Barriers to AI adoption in dermatology: A recognition and accuracy of dermatologists in a randomized clinical

clinician survey. British J Dermatol, 185(2), 345–352. trial. JAMA Dermatology, 155(6), 627–

634. https://doi.org/10.1001/jamadermatol.2019.1735.



34





Prediction of Root Canal Treatment Using Machine Learning




Matej Jelenc Miljana Shulajkovska

Jožef Stefan Institute Jožef Stefan Institute

Ljubljana, Slovenia Ljubljana, Slovenia

jelenc11matej@gmail.com miljana.sulajkovska@ijs.si



Rok Jurič Anton Gradšek

Odontos, Private Endodontic Practice Jožef Stefan Institute

Ljubljana, Slovenia Ljubljana, Slovenia

rok.juric@odontos.si anton.gradisek@ijs.si



Abstract RCT patient data obtained by a single experienced practitioner (en-

suring a high level of consistency in the treatment approach), as

Root canal treatment is a medical procedure aimed at preventing or

opposed to studies where numerous dentists were treating patients

treating apical periodontitis, which is an inflammation around the

and different choices between them could have resulted in a less

apex of a tooth root. In this study, we analyzed a dataset collected

representative dataset. The aim of the study was to develop and

by an experienced practitioner over the course of several years,

evaluate an algorithm that predicts the outcome of the RCT, as well

and developed a forecasting model, based on the XGBoost algo-

as to analyze how robust the algorithm is and which features influ-

rithm, to predict the outcome of the treatment. The trained models

ence the outcome the most. This study goes hand-in-hand with the

achieved a mean area under the receiver-operating-characteristic

study by Jurič et al. [13] where the analysis was conducted solely

curve (AUROC) of 0.92 and average precision (AP) of 0.77. We dis-

using statistical methods.

cuss the importance of individual features in view of expert dental

knowledge. To assist the practitioner in daily practice, we devel-

oped a web-based application to provide an assessment of treatment 2 Related Work outcomes.

To our knowledge, utilization of machine learning in endodontics

is still relatively unresearched, specifically when predicting treat-

Keywords

ment outcome only using tabular data. Among the related papers,

root canal treatment outcome, feature importance, gradient boost-

[10] employs XGBoost to explore the association between patient-,

ing machines tooth- and treatment-level factors and root canal treatment fail-

ure, while [2] used Random Forests (RF), K Nearest Neighbours

1 Introduction (KNNs), Logistic Regression (LR) and Naive-Bayes (NB) to predict

Apical periodontitis is an inflammation of tissues around the apex the outcome of non-surgical root canal treatments, similarly to

of a tooth. It is a major health burden in the general population, this study. Paper [8] explores the prediction of treatment longevity

with 6% of all teeth showing signs of this condition. Root canal using Support Vector Machines (SVMs), LR and NB, while [14]

treatment (RCT) is aimed to either prevent the onset of apical investigates the relation between root canal morphology and root

periodontitis or to help the tissue to heal if it is already present [13]. canal treatment using both statistical and machine learning meth-

Predicting treatment outcomes in RTC is of high interest both to ods, specifically, using RF, SVMs and Gradient Boosting Machines

the patients and the dentists, as well as to the insurance companies, (GBMs). Moreover, papers [19, 18] investigate the prediction of

as information about the likelihood of successful treatment can case difficulty and prognosis of endodontic microsurgery, while [6,

lead to better allocation of resources and avoid potentially more 9] explore the prediction of root fracture and postoperative pain

invasive procedures, such as tooth removal and its replacement after root canal treatment. Additionally, multiple papers have been

with an implant. found to investigate root canal treatment outcome or related factors

Machine learning has previously been used to study some as- using deep learning (DL) on X-ray images, specifically panoramic

pects of the root canal treatment, including association between or periapical radiographs, such as [3, 22, 11, 1, 5].

patient-, tooth- and treatment-level factors and root canal treat-



ment failure [10], predicting root fracture after root canal treatment 3 Data

and crown installation [6], and non-surgical root canal treatment

The dataset analyzed in this study contains treatment details, clin-

prognosis [2]. In this study, we analyze the data collected by Jurič

ical and radiographic data regarding primary or secondary root

et al. [13]. This dataset is of special interest since it relies on the

canal treatment of mature permanent teeth collected and curated

Permission to make digital or hard copies of all or part of this work for personal or

in [13]. Three different types of outcome were determined - clinical,

classroom use is granted without fee provided that copies are not made or distributed

radiographic, and combined, for which both a strict (no clinical or

for profit or commercial advantage and that copies bear this notice and the full citation

on the first page. Copyrights for third-party components of this work must be honored. radiographical sign of disease) and loose (only negligible sign of

For all other uses, contact the owner /author(s).

disease) assessment criteria were used. In this paper, only strict

Information Society 2025, Ljubljana, Slovenia

assessments were considered and used as prediction targets. All

© 2025 Copyright held by the owner/author(s).

https://doi.org/10.70314/is.2025.skui.1849 assessments were binary, with 1 representing successfull and 0



35





Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Jelenc et al.




representing unsuccessfull treatment outcome. The dataset was 4.4 Grid Search fairly imbalanced, with 88% of all cases representing successfull

To obtain reasonable starting training hyperparameters and a base-

radiographic outcome, 92% successfull clinical outcome and 83%

line model that utilizes all available information, we performed

successfull combined outcome. The study cohort consisted of 740

cross-validated grid-search over a simple manually defined param-

patients and 1264 teeth, resulting in 3153 root canal treatment cases

eter grid, using the scikit-learn library [17].

and 84 features in total. The majority of features represented either



categorical or binary values, such as variables representing gender, 4.5 Correlation Clustering tooth type, root canal etc., while variables such as age and working

When a subset of features in a dataset is highly correlated, standard

length were treated as continous.

methods such as feature permutation importance or performing an

ablation study often produce inaccurate results, since the model

4 Methods can highly depend only on a specific feature and discard correlated

This section outlines the methods used for ranking feature impor- features. Similarly, methods such as SHapley Additive exPlanations

tance and finally training baseline models that can be used as a tool (SHAP) [16] or XGBoost’s built-in feature importances only account

for prediction of root canal treatment outcome. for the contribution of a specific feature to the model’s prediction,

which can again be misleadingly low due to the feature’s correlation



4.1 Data Preprocessing to another.

To address this problem, clustering was performed based on the

First, data regarding second visits was removed, to ensure consis- × ∈

𝑚 𝑛

correlation between features. Let 𝑋 represent the dataset R

tency among cases. Next, features directly dependent or derived

with 𝑚 cases and 𝑛 features. By calculating the Spearman rank

from a specific feature were excluded from the dataset to minimize

correlation coefficient [15, 17, 23] on 𝑋 , a symmetric feature cor-

the dimensionality of the data, as well as any post-operative factors × ∈

𝑛 𝑛

relation matrix 𝐶 was obtained and transformed into a R

that were directly used to determine the treatment outcome. The × ∈

𝑛 𝑛

distance matrix 𝐷 . To group correlated features, hierarchi- R

dataset was further reduced by removing redundant features, which

cal clustering using Ward’s method [17, 21] was performed on 𝐷 to

can only have one value or their value is missing for more than 50%

obtain a hierarchical clustering tree, which was then flattened into

of all cases. Similarly, cases for which more than 50% of features are

discrete clusters containing features with high absolute correlation.

missing were excluded, resulting in 3153 cases and 84 features in



total. Lastly, the dataset was preprocessed using label encoding and 4.6 Ranking Feature Importance evenly split into training (80%) and testing (20%) sets. Furthermore,

To determine the significance of a specific feature 𝑓 , a separate

the training set was split into training (80%) and validation (20%)

XGBoost model 𝑀 was trained and evaluated on a reduced dataset

sets when ranking feature importance, to avoid overfitting. 𝑓

𝑋 to obtain baseline results. Next, permutation testing was con-

𝑓



4.2 ducted by permuting the feature 𝑓 in the testing set and calculating Model Architecture

the drop in performance of 𝑀 compared to the baseline results.

𝑓

For the underlying model, gradient boosting machines were used,

Each feature was tested 20 times. Lastly, a mean drop and p-value

specifically the XGBoost algorithm [7], as it remains widely re-

were calculated on the observed performance drops by performing

garded as the state-of-the-art and preferred choice for tabular data

a t-test, where a high mean drop represented high feature impor-

tasks, over the more and more popular deep learning algorithms, as

tance and a low p-value represented a low chance that the observed

shown in [4, 12, 20]. Furthermore, algorithms based on transparent

drop in performance was caused by an outside factor and not by

methods, such as decision trees, are strongly preferred for applica-

the random distribution of 𝑓 in the test set. To ensure that the fea-

tions in medicine when compared to the "black box" approaches

ture’s importance estimation was not corrupted by any correlated

typically associated with deep learning.

features and at the same time account for the feature’s possible

non-linear connections with other features, while also minimizing

4.3 Metrics the computational cost as much as possible, the reduced dataset 𝑋𝑓

was determined as follows.

Due to the dataset’s high imbalance between negative ( 87%) and

First, using the model trained on all features (see 4.4), SHAP

positive ( 13%) cases, standard classification metrics such as ac-

values [17, 16] were calculated to determine the most contributing

curacy or area under the receiver-operating-characteristic curve

(AUROC) can be highly misleading, therefore average precision feature inside of each cluster. Let 𝐹 𝑓1, . . . , 𝑓𝑛 represent the set = { }

of all features and 𝑆 : 𝐹 the transformation that returns R

→ 𝑚

(AP) was chosen as the key metric for estimating model’s perfor-

SHAP values for a specific feature. The most contributing feature

mance and ability to produce quality predictions, specifically using

the formula: inside of the 𝑖-th correlation cluster 𝐶𝑖 𝑓𝑗 𝑗 𝐼𝑖 was calculated = { | ∈ }

by taking the feature with the highest mean absolute SHAP value

∑︁ i.e. such 𝑓 𝐶 that 𝑖 ∈ ∗ 𝑛 ∀𝑗 𝐼 : 𝑖 ∈ |𝑆 ∗ 𝑓 ( ) | ≤ | () |

𝐴𝑃 𝑅𝑖 𝑅𝑖 − = 𝑗 𝑆 𝑓 . ( − 1 𝑖 𝑚 𝑃 ) ·

∗ ×𝑟 ∈

The reduced dataset 𝑋 , containing only representative R

𝑖 2 =

features, was then transformed into 𝑋 for a feature 𝑓 𝐶 by ∈

𝑓 𝑖

where ∗ ∗ 𝑅 𝑖 and 𝑃 𝑖 are recall and precision at the 𝑖-th threshold when replacing 𝑓 by 𝑓 in 𝑋. Such approach allows eliminating features

𝑖

testing on 𝑛 samples [17], while AUROC was only used to provide highly correlated to 𝑓 and reduces computational cost by only

additional insight when interpreting results. utilizing the most contributing feature within each cluster, while



36





Root Canal Treatment Prediction Using ML Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia




still accounting for any non-linear connections between 𝑓 and assist in assessing the quality and success of a treatment, as well as

features in other clusters. The procedure is visualized in Figure 1. to give insight for possible further patient care.

Furthermore, all the statistically significant factors found in the



original study [13], are found as significant by our method as well.

Specifically, "lesion diameter" was found to be the most relevant

factor, with "root PAI" and "canal code" being in the top 5%, "tooth

type" ("tooth group" and "canal number") in the top 10%, "type of

sealer" and "quality of coronal restoration" in the top 25%, "tender-

ness to periapical palpation" and "quality of root filling" in the top

50% and lastly "injury history" in the top 100% of all significant

features. Here, we exclude factors such as "number of visits" and

"number of canals per root", since they were not used in this study.

Moreover, among the most important factors that this study found

and were not accounted for or found as insignificant in [13], are

"age" as the second most important factor, "cumulative time" being

in the top 5% and "alergic disorders", "working length", "treatment

type", "obturation", "PD local", "vertical percussion", "fistulation"

and "pain bite" being in the top 25%. Such results suggest that ma-

chine learning techniques can perhaps be a better or alternative

approach for ranking feature significance in comparison with stan-

dard statistical methods such as logistic regression models, since

they better account for possible non-linear relationships between

different factors and the treatment outcome.

To further refine our approach of selecting significant features,

we plan to test different p-values, as the models trained on only

Figure 1: The hierarchical correlation tree is first flattened significant features achieved a lower performance than the models

into clusters 𝐶1, . . . , 𝐶𝑟 , for which representative features trained on the entire dataset, with a 5% drop in AUROC and a 7%

∗ ∗ ∗ define the base dataset, from which we get

𝑓 , . . . , 𝑓 𝑋 𝑋

1 𝑟 𝑓 drop in AP on average, suggesting that there are features which our

for ∗ 𝑓 𝐶𝑖 𝑓 𝑓 method deemed insignificant despite enhancing the models’ ability ∈ by replacing by . 𝑖

to learn and produce accurate results. Future work will also involve

4.7 Evaluation analysis of third-party datasets to investigate whether the results

After obtaining feature importances, features with p-value < 0.05 obtained in this study are generalizable and to what degree the data

were deemed as significant. Next, a model using starting parameters collected by a single experienced practitioner is different to a dataset

found in 4.4 was trained on features belonging in the 𝑘-th percentile that is typically collected over a course of several years by a number

in terms of feature importance, for of dentists-in-training. Additionally, we wish to incorporate various 𝑘 in 1%, 5%, 10%, 25%, 50%, 75%,

and 100% (the latter corresponding to all significant features). explainability techniques, to better justify the models’ predictions,

in turn giving a deeper insight into how specific factors affect the



5 Results outcome of root canal treatments as well as better assist a doctor

in understanding and interpreting the predicted outcome.

Figures 2 show the comparison of performances in terms of AP of



models trained on different percentiles. The highest performance References was achieved when utilizing the entire preprocessed dataset consist-

[1] Muhammed Ayhan, İsmail Kayadibi, and Berkehan Aykanat. 2025. Rcfla-yolo:

ing of 84 distinct features in total, achieving AUROC of 0.90 and AP a deep learning-driven framework for the automated assessment of root canal

of 0.70 when predicting radiographic outcome, AUROC of 0.94 and filling quality in periapical radiographs. , 25, 1, 894. doi: BMC Medical Education

10.1186/s12909- 025- 07483- 2.

AP of 0.86 when predicting clinical outcome and finally AUROC of

[2] Catalina Bennasar, Irene García, Yolanda Gonzalez-Cid, Francesc Pérez, and

0.91 and AP of 0.77 when predicting combined outcome. Out of the Juan Jiménez. 2023. Second opinion for non-surgical root canal treatment

prognosis using machine learning models. , 13, 17, 2742. doi: 10.339 Diagnostics

84 chosen features, our method deemed 39 of them significant for

0/diagnostics13172742.

radiographic assessment, 54 significant for clinical assessment, and

[3] Catalina Bennasar, Antonio Nadal-Martínez, Sebastiana Arroyo, Yolanda Gonzalez-

65 for combined assessment, which produced AUROC of 0.88, 0.85, Cid, Ángel Arturo López-González, and Pedro Juan Tárraga. 2025. Integrating

machine learning and deep learning for predicting non-surgical root canal

0.87 and AP of 0.66, 0.75 and 0.70 respectively.

treatment outcomes using two-dimensional periapical radiographs. , Diagnostics

15, 8, 1009. doi: 10.3390/diagnostics15081009.

6 Discussion and Conclusion [4] Vadim Borisov, Tobias Leemann, Kathrin Seßler, Johannes Haug, Martin Pawel-

czyk, and Gjergji Kasneci. 2024. Deep neural networks and tabular data: a

Achieving high performance, our paper shows promise in using ma- survey. IEEE Transactions on Neural Networks and Learning Systems, 35, 6,

7499–7519. doi: 10.1109/TNNLS.2022.3229161.

chine learning techniques for predicting the outcome of endodontic

[5] Berrin Çelik, Mehmet Zahid Genç, and Mahmut Emin Çelik. 2025. Evaluation

treatments. Moreover, we developed a web application, which al- of root canal filling length on periapical radiograph using artificial intelligence.

lows predicting the outcome of root canal treatments using the , 41, 1, 102–110. doi: 10.1007/s11282- 024- 00781- 3. Oral Radiology

models trained on different subsets of data, serving as a tool to



37





Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Jelenc et al.





(a) Strict clinical assessment (b) Strict radiographic assessment (c) Strict combined assessment





Figure 2: Average precision (AP) achieved by XGBoost when predicting strict clinical, radiographic and combined assessment,



utilizing different subsets of features - all features, all significant features, top 75%, top 50%, top 25%, top 10%, top 5% and top 1%

significant features.



[6] Wan-Ting Chang, Hsun-Yu Huang, Tzer-Min Lee, Tsen-Yu Sung, Chun-Hung learning algorithms. , 133, 104522. doi: 10.1016/j.jdent.2023 Journal of Dentistry

Yang, and Yung-Ming Kuo. 2024. Predicting root fracture after root canal .104522.

treatment and crown installation using deep learning. , [20] Ravid Shwartz-Ziv and Amitai Armon. 2022. Tabular data: deep learning is not Journal of Dental Sciences

19, 1, 587–593. doi: 10.1016/j.jds.2023.10.019. all you need. , 81, 84–90. doi: 10.1016/j.inf f us.2021.11.011. Information Fusion

[7] Tianqi Chen and Carlos Guestrin. 2016. Xgboost: a scalable tree boosting [21] Joe H Ward Jr. 1963. Hierarchical grouping to optimize an objective function.

system. In , 58, 301, 236–244. Proceedings of the 22nd ACM SIGKDD International Conference on Journal of the American Statistical Association Knowledge Discovery and Data Mining (KDD ’16). ACM, (Aug. 2016), 785–794. [22] Weiwei Wu, Surong Chen, Pan Chen, Min Chen, Yan Yang, Yuan Gao, Jingyu

doi: 10.1145/2939672.2939785. Hu, and Jingzhi Ma. 2024. Identification of root canal morphology in fused-

[8] Pragati Choudhari, Anand Singh Rajawat, and S. B. Goyal. 2023. Longevity rooted mandibular second molars from x-ray images based on deep learning.

recommendation for root canal treatment using machine learning. , 50, 9, 1289–1297.e1. doi: 10.1016/j.joen.2024.05.014. Engineering Journal of Endodontics

Proceedings, 59, 1, 193. doi: 10.3390/engproc2023059193. [23] Daniel Zwillinger and Stephen Kokoska. 2000. CRC Standard Probability and

[9] Xin Gao, Xing Xin, Zhi Li, and Wei Zhang. 2021. Predicting postoperative pain . (1st ed.). Section 14.7. Chapman & Hall/CRC, Statistics Tables and Formulae

following root canal treatment by using artificial neural network evaluation. Boca Raton, FL. isbn: 978-0-8493-0026-4.

Scientific Reports, 11, 1, 17243. doi: 10.1038/s41598- 021- 96777- 8.

[10] Chantal S. Herbst, Falk Schwendicke, Joachim Krois, and Sascha R. Herbst.

2022. Association between patient-, tooth- and treatment-level factors and

root canal treatment failure: a retrospective longitudinal and machine learning

study. , 117, 103937. doi: 10.1016/j.jdent.2021.103937. Journal of Dentistry

[11] Sascha Rudolf Herbst, Vinay Pitchika, Joachim Krois, Aleksander Krasowski,

and Falk Schwendicke. 2023. Machine learning to predict apical lesions: a cross-

sectional and model development study. , 12, 17, Journal of Clinical Medicine

5464. doi: 10.3390/jcm12175464.

[12] Yejin Hwang and Jongwoo Song. 2023. Recent deep learning methods for

tabular data. , 30, 2, Communications for Statistical Applications and Methods

(Mar. 2023), 215–226. doi: 10.29220/CSAM.2023.30.2.215.

[13] Rok Jurič, G. Vidmar, R. Blagus, and Janja Jan. 2024. Factors associated with

the outcome of root canal treatment—a cohort study conducted in a private

practice. , 57, 4, 377–393. doi: 10.1111/iej.14022. International Endodontic Journal

[14] Mohmed Isaqali Karobari, Vishnu Priya Veeraraghavan, P. J. Nagarathna, Sud-

hir Rama Varma, Jayaraj Kodangattil Narayanan, and Santosh R. Patil. 2025.

Predictive analysis of root canal morphology in relation to root canal treatment

failures: a retrospective study. , 6. doi: 10.3389/f dm Frontiers in Dental Medicine

ed.2025.1540038.

[15] Maurice G. Kendall and Alan Stuart. 1973. The Advanced Theory of Statistics,

Volume 2: Inference and Relationship. (1st ed.). See Section 31.18. Charles Griffin,

London, UK. isbn: 978-0852640111.

[16] Scott Lundberg and Su-In Lee. 2017. A unified approach to interpreting model

predictions. (2017). https : / / arxiv . org / abs / 1705 . 07874 arXiv: 1705 . 07874

[cs.AI].

[17] F. Pedregosa et al. 2011. Scikit-learn: machine learning in python. Journal of

Machine Learning Research, 12, 2825–2830.

[18] Yang Qu, Zhenzhe Lin, Zhaojing Yang, Haotian Lin, Xiangya Huang, and Lisha

Gu. 2022. Machine learning models for prognosis prediction in endodontic

microsurgery. , 118, 103947. doi: 10.1016/j.jdent.2022.10394 Journal of Dentistry

7.

[19] Yang Qu, Yiting Wen, Ming Chen, Kailing Guo, Xiangya Huang, and Lisha Gu.

2023. Predicting case difficulty in endodontic microsurgery using machine



38



Predictive Maintenance of Machines in LABtop Production



Environment



Primož Kocuvan Vinko Longar Rok Struna

Department of intelligent systems Rudolfovo - znanstveno in Rudolfovo - znanstveno in

“Jožef Stefan” Institute tehnološko središče tehnološko središče

Ljubljana, Slovenia Novo mesto, Slovenia Novo mesto, Slovenia

primoz.kocuvan@ijs.si vinko.longar@rudolfovo.eu rok.struna@rudolfovo.eu



Abstract strategies, such as corrective or preventive maintenance,



machinery within the LABtop production environment breakdowns. Predictive maintenance, by contrast, leverages through the deployment of iCOMOX sensor modules on a sensor data and machine learning techniques to detect compressor and machine spindle. Each module integrates patterns, identify anomalies, and forecast potential failures This study investigates predictive maintenance of CNC result in either excessive servicing or unexpected often fail to provide early warnings of failures and may

multi-modal sensing capabilities, including vibration, before they occur. This approach not only enhances



period, resulting in an unlabeled dataset due to the absence multiple machines in sequence - mostly drilling and cutting of recorded failures or anomalies. The analysis employed machines), predictive maintenance has been explored unsupervised machine learning techniques, specifically through the integration of advanced multi-sensor principal component analysis (PCA) for dimensionality monitoring solutions. For this purpose, the public research reduction and clustering to identify operational patterns. institute Rudolfovo implemented iCOMOX sensor modules PCA successfully reduced the original 11-dimensional on both the compressor and the spindle of a CNC machine. enabling comprehensive monitoring of machine conditions. equipment.¸ Data was collected at five-minute intervals over a 30-day Within the LABtop production environment (consists of magnetic field, temperature, and acoustic measurements, operational efficiency but also extends the lifetime of critical



visualization and grouping. The elbow and silhouette vibration, magnetic field, temperature, and acoustic methods determined three optimal clusters for both sensors, measurements dataset to two principal components, allowing for effective Each iCOMOX module integrates several sensing elements—

—providing a rich dataset suitable for



Results suggest that dense clusters represent normal The collected data were acquired over a continuous 30- operation, while outlier clusters may indicate measurement day period at five-minute intervals. Since no machine with one cluster in each case identified as a potential outlier. machine learning–based condition monitoring.



errors or emerging machine faults. Although supervised failures, temperature anomalies, or bearing defects were learning could not yet be applied, future work will integrate recorded during this time, the dataset lacked diagnostic fault-labeled data to enable robust predictive maintenance labels and was therefore treated as unlabeled. To address models. this, unsupervised learning methods were employed to



Keywords uncover latent structures in the data. Principal component

analysis (PCA) was used to reduce the dimensionality of the

predictive maintenance, PCA method, production dataset, while clustering methods were applied to identify

environment, silhouette analysis, elbow method. patterns and potential anomalies in machine operation. The

aim of this study is to evaluate the feasibility of



1 Introduction for industrial equipment, specifically under conditions unsupervised learning methods in predictive maintenance

The increasing complexity of modern production systems where fault-labeled data are unavailable. By analyzing the

demands advanced approaches to machine maintenance in clustering behavior of sensor signals, this work provides

order to minimize downtime, reduce costs, and ensure insights into normal operating regimes and potential

consistent product quality. Traditional maintenance deviations that may correspond to early indicators of faults

or measurement errors. Future work will incorporate

Permission to make digital or hard copies of part or all of this work for supervised learning techniques once labeled fault data

made or distributed for profit or commercial advantage and that copies bear become available, enabling the development of robust personal or classroom use is granted without fee provided that copies are not

this notice and the full citation on the first page. Copyrights for third-party predictive models. components of this work must be honored. For all other uses, contact the

owner/author(s).Information Society 2025, 6–10 October 2025, Ljubljana,

Slovenia © 2025 Copyright held by the owner/author(s).

http://doi.org/10.70314/is.2025.skui.0545



39

2 Related Work 3.2 Data Preprocessing



The field of predictive maintenance (PdM) has advanced Raw signals from the iCOMOX modules were aggregated

considerably, with strong emphasis on unsupervised into feature vectors, yielding an 11-dimensional dataset.

learning methods for anomaly detection and health Standard preprocessing steps included: normalization of

assessment when labeled failure data are unavailable. PdM features to remove scaling effects, filtering to reduce noise

has been shown to significantly reduce maintenance costs, (particularly in the acoustic and vibration signals), and

decrease unexpected downtime, and enhance equipment synchronization of multimodal sensor streams.



reliability [1]. Multi-sensor monitoring platforms such as 3.3 Dimensionality Reduction iCOMOX have emerged as versatile tools for industrial

condition monitoring. These devices integrate vibration, To facilitate visualization and clustering, dimensionality

magnetic field, temperature, and acoustic sensors into a reduction was performed. Multiple techniques (e.g., t-SNE,

compact, industrial-grade package capable of edge analytics Isomap, and autoencoders) were evaluated; however,

and cloud integration [2–5]. Principal Component Analysis (PCA) demonstrated superior

Such systems enable continuous monitoring of machine stability and interpretability. The data were reduced from

health and support the implementation of predictive 11 to 2 principal components, which captured the majority

maintenance strategies in Industry 4.0 environments. From of the variance and allowed effective 2D representation.

a methodological perspective, unsupervised learning

techniques, such as principal component analysis (PCA) and 3.4 Clustering Analysis



dimensionality reduction, and anomaly detection. A hidden structures and potential anomalies. The elbow comprehensive survey highlights the breadth and maturity method ( clustering, are widely applied for exploratory data analysis, Clustering was applied to the reduced dataset to uncover



including k-means, DBSCAN, and OPTICS are instrumental Based on these metrics, three clusters were identified for in grouping operational states and unveiling deviations that each sensor dataset. may signify incipient failures [7]. The analysis was conducted separately for the two sensor of these techniques across domains [6]. Clustering methods employed to determine the optimal number of clusters. Figure 1) and silhouette coefficient (Figure 2) were



proven effective in enhancing fault detection capabilities. Hybrid methods combining PCA with clustering have modules (iCOMOX1 on the spindle and iCOMOX2 on the compressor). Outlier clusters were identified and For example, a railcar health monitoring system employing highlighted for subsequent interpretation. DBSCAN clustering with PCA achieved fault detection



accuracy of 96.4% [8]. Similarly, kernel PCA has been

applied to construct health indices for unsupervised

prognostics [9]. In compressor maintenance, incorporating

clustering-derived features into supervised classifiers

improved predictive accuracy by 4.9% and reduced training

time by 23% [10]. Several studies also propose frameworks

that integrate unsupervised learning with IoT and Big Data

infrastructures, enabling scalable predictive maintenance

solutions across industrial environments [11]. These works

demonstrate the feasibility of extracting actionable health

indicators from unlabeled sensor data and underscore the

critical role of advanced analytics in industrial condition

monitoring. Figure 1: Elbow method



3 Methodology





3.1 Data Acquisition



Two iCOMOX sensor modules were installed on critical

machine components within the LABtop production system:

the spindle of a CNC machine and the air compressor. Each

sensor module integrates vibration, magnetic field,

temperature, and acoustic sensing elements, thereby

providing multimodal monitoring capabilities. Data were

sampled at 5-minute intervals over a continuous 30-day

observation period, resulting in an unlabeled dataset due to

the absence of recorded failures, anomalies, or maintenance

events.

Figure 2: Silhouette coefficient



40

4 Results • PCA combined with clustering effectively



into a two-dimensional space. The first two principal • Both sensor datasets (iCOMOX1 and iCOMOX2) components explained the majority of the variance (>80%), revealed three clusters, with one consistently enabling effective visualization of patterns in machine standing out as an outlier. PCA successfully compressed the 11-dimensional dataset anomalous behavior. distinguished between normal operation and

behavior. Figure 3 illustrates the scatter plot for iCOMOX1. • Without diagnostic labels, these outliers cannot be

Three distinct clusters are visible, with Cluster 1

definitively classified as machine faults, but their

(highlighted in orange) showing divergence from the main

presence highlights potential events of interest for

operating regime. Figure 4 presents the scatter plot for

further investigation.

iCOMOX2, where Cluster 2 (highlighted in green) emerges

• The results validate the feasibility of unsupervised

as an outlier relative to the normal operating clusters.

learning for predictive maintenance in

environments lacking labeled fault data.



5 Discussion



The findings from this study demonstrate the viability of

unsupervised learning methods in particular PCA and

clustering for analyzing unlabeled condition-monitoring

data in industrial environments. By reducing an 11-

dimensional dataset to two principal components, it was

possible to visualize operational states and uncover outlier

clusters that may correspond to anomalous machine

behavior. This outcome aligns with previous work

emphasizing the effectiveness of dimensionality reduction

and clustering in predictive maintenance tasks where

labeled fault data are limited or unavailable [6,8,9].

The observation of three clusters for both the spindle



Figure 3: PCA Scatter plot for ICOMOX1 (iCOMOX1) and compressor (iCOMOX2) highlights the

presence of distinct operating regimes within the LABtop



system. The fact that one cluster consistently emerged as an

outlier suggests potential precursors to faults or,

alternatively, sensor-related anomalies. While conclusive

interpretation requires diagnostic labels, the clustering

nevertheless provides an essential first step toward

identifying patterns that can later inform supervised

learning models once fault data become available.

Compared to related studies, the present results confirm

trends reported in railcar health monitoring [8] and

compressor maintenance [10], where unsupervised

approaches successfully revealed structural patterns in the

absence of labeled datasets. The advantage of PCA lies in its

ability to preserve variance while simplifying visualization,

which proved more effective than alternative reduction

methods considered here (e.g., t-SNE or Isomap). This

Figure 4: PCA Scatter plot for ICOMOX2 echoes findings from other industrial applications where

PCA has served as a reliable baseline for anomaly detection

Clusters containing densely grouped points correspond to

[9].

normal operating conditions of the CNC spindle and

An important implication is that multi-sensor platforms

compressor. The outlier clusters, however, represent either:

such as iCOMOX provide the richness of data required for

• sensor noise or measurement anomalies (e.g., advanced analytics. The combination of vibration, acoustic,

transient vibration spikes or acoustic distortions), magnetic field, and temperature measurements enables

or detection of subtle variations that might not be visible

• incipient machine faults, which could not be through single-sensor monitoring. As highlighted in prior

conclusively confirmed due to the absence of work [2–5], the integration of multimodal data streams

ground-truth failure data. significantly strengthens predictive maintenance

frameworks by improving robustness and interpretability.

Nevertheless, this study also underscores the limitations

of unsupervised learning. Without failure labels, it is not



41

possible to conclusively distinguish between anomalies Finally, interpretability remains an essential concern.

arising from true machine faults and those caused by sensor Future efforts will explore explainable AI (XAI) techniques

noise or environmental conditions. This limitation has been to provide actionable insights into why certain clusters or

widely noted in the literature [6,11]. Future work should anomalies are flagged, thereby enhancing operator trust

therefore focus on generating labeled datasets through and enabling domain experts to validate and refine the

controlled fault injection or long-term monitoring until models.

natural failures occur. Such datasets would enable

supervised and hybrid learning approaches, which have Acknowledgments

shown promise in achieving higher predictive accuracy and

more actionable decision support [1,10]. The research was supported by DIGITOP project which is

In summary, the present analysis validates the potential funded by Ministry of Higher Education, Science and

of unsupervised learning for predictive maintenance in Innovation of Slovenia, Slovenian Research and Innovation

data-scarce environments. While preliminary, the results Agency, and EU NextGenerationEU under Grant TN-06-

establish a methodological foundation for extending 0106. We thank prof. dr. Matjaž Gams for proof-reading the

condition monitoring at LABtop to more advanced machine article and mentorship support within DIGITOP project.

learning pipelines, ultimately contributing to early fault



detection, reduced downtime, and optimized maintenance References planning.

[1] Abdeldjalil Benhanifia, Zied Ben Cheikh, Paulo Moura Oliveira, Antonio

6 Valente, José Lima. Systematic review of predictive maintenance practices in Future Work

the manufacturing sector. Intelligent Systems with Applications, Volume 26,

The present study establishes a foundation for predictive 2025, Article 200501. ISSN 2667-3053.

maintenance at LABtop using unsupervised learning https://doi.org/10.1016/j.iswa.2025.200501



methods; however, several directions remain open for [2] RS Components, iCOMOX Intelligent Condition Monitoring Box – Product Datasheet , 2019. Available: https://docs.rs- further investigation. online.com/c878/A700000007538369.pdf, Accessed 25.8.2025 First, the absence of diagnostic labels limited this study to [3] EE Times Europe, Arrow introduces new Shiratech iCOMOX condition- exploratory clustering and anomaly detection. Future work based monitoring products , 2019. Available: https://www.eetimes.eu/press- will prioritize the collection of labeled datasets through releases/arrow-introduces-new-shiratech-icomox-condition-based-

either (i) controlled fault injection experiments on non- monitoring-products/, Accessed 25.8.2025

critical test equipment or (ii) extended operational [4] EBOM, New Shiratech iCOMOX sensor-to-cloud platform cuts monitoring until natural failures occur. The availability of time-to-market for intelligent condition monitoring, 2019. Available:

labeled fault data will enable the application of supervised https://www.ebom.com/new-shiratech-icomox-sensor-to-cloud-platform-

learning and hybrid approaches, combining clustering- cuts-time-to-market-for-intelligent-condition-monitoring/, Accessed

25.8.2025

derived features with classification models to improve fault

detection accuracy and reliability, as demonstrated in [5] Sensor+Test, iCOMOX – Condition Monitoring Box, 2023.

recent compressor studies [10]. Second, while PCA provided Available:https://www.sensor-

test.de/assets/Fairs/2023/ProductNews/PDFs/iCOMOX.pdf, Accessed

an effective means of dimensionality reduction, more 25.8.2025

advanced techniques such as kernel PCA, autoencoders, and [6] K. Taha, “Semi-supervised and un-supervised clustering: A review and variational autoencoders should be investigated. These experimental evaluation,” Information Systems, vol. 114, p. 102178, 2023. doi:

methods may capture nonlinear relationships in the sensor 10.1016/j.is.2023.102178

data that PCA cannot, potentially yielding richer health [7] GopenAI Blog, Predictive maintenance with unsupervised machine learning

indicators and more precise separation of operational algorithms, 2020. Available: (Blog.gopenai.com), Accessed 25.8.2025

regimes [9]. Third, the present work focused primarily on [8] M. Ejlali, E. Arian, S. Taghiyeh, K. Chambers, A. H. Sadeghi, D. Cakdi, and R.

offline analysis. Future research should extend to real-time B. Handfield, “Developing Hybrid Machine Learning Models to Assign Health

streaming analytics, leveraging the edge-processing Score to Railcar Fleets for Optimal Decision Making,” arXiv preprint

capabilities of the iCOMOX platform [2 arXiv:2301.08877, 2023. – 5]. Deploying online

anomaly detection models would allow immediate [9] Z. Chen et al., “Health Index Construction Based on Kernel PCA for

identification of abnormal conditions and facilitate Equipment Prognostics,” Control Engineering Practice, vol. 126, 2022.

proactive maintenance decisions. [10] A. Salazar et al., “Unsupervised Feature Extraction for Compressor

Fourth, integration with IoT and cloud-based platforms Predictive Maintenance Using Clustering and Supervised Learning,” arXiv,

remains a key step toward scalable deployment. By 2024.

embedding unsupervised learning models into Industry 4.0 [11] Nota, Giancarlo, Nota, Francesco, Toro Lazo, Alonso Nastasia, Michele.

architectures, LABtop can benefit from centralized (2024). A framework for unsupervised learning and predictive maintenance

in Industry 4.0. International Journal of Industrial Engineering and

monitoring, cross-machine comparisons, and fleet-level Management. 15. 304-319. 10.24867/IJIEM-2024-4-365.

anomaly detection, as highlighted in existing frameworks

[11].



42

Machine Learning for Cutting Tool Wear Detection: A



Multi-Dataset Benchmark Study Toward Predictive



Maintenance



Žiga Kolar Thibault Comte Yanny Hassani

ziga.kolar@ijs.si thibault.comte@universite- paris- yanny.hassani@universite- paris-

Jožef Stefan Institute saclay.f r saclay.f r

Ljubljana, Slovenia Universite Paris-Saclay Universite Paris-Saclay

Paris, France Paris, France



Hugues Louvancour Jože Ravničan Matjaž Gams

hugues.louv@gmail.com joze.ravnican@unior.com matjaz.gams@ijs.si

Universite Paris-Saclay UNIOR Kovaška industrija d.d. Jožef Stefan Institute

Paris, France Zreče, Slovenia Ljubljana, Slovenia



Abstract mounted on the cutting machine to monitor vibrations occurring

during the cutting process. Currently, the detection of wear is

This student paper investigates the use of machine learning tech-

performed manually by a human operator. By leveraging artifi-

niques to automate the detection of tool wear in cutting machines,

cial intelligence (AI) and machine learning (ML), this process can

replacing manual monitoring with intelligent, data-driven so-

be automated, making it both easier and more efficient.

lutions. Although the proposed ML methods are standard in

While awaiting the company to complete the necessary pa-

predictive maintenance, our contribution lies in providing the

perwork and acquire and install the accelerometer on the cutting

systematic multi-dataset benchmark tailored for direct transfer

machine, we identified similar publicly available datasets and

to industrial environments. This establishes a reproducible base-

conducted several machine learning experiments using them.

line before deploying and validating on real UNIOR data. As

part of the project, and in anticipation of collecting real-world

accelerometer data from industrial machines, we conducted a

series of benchmarking experiments using five publicly avail- 2 Related Work able datasets that include accelerometer and audio signals under

This section briefly surveys recent research on the use of artificial

various wear-related conditions. The datasets cover a variety

intelligence (AI) techniques for tool wear monitoring in manufac-

of industrial contexts and labeling schemes, allowing us to as-

turing processes such as milling, turning, and drilling. Munaro

sess different preprocessing strategies and classification models

et al. [2] provide a systematic review of 77 studies, contrasting

such as Random Forests, 1D Convolutional Neural Networks,

offline and online monitoring methods. Online approaches lever-

and Long Short-Term Memory networks. Our best results—an

aging sensor data—such as force, vibration, acoustic emission,

F1-score of 0.9949—were achieved using an LSTM model on a

and power—are enhanced by AI models like SVMs, ANNs, CNNs,

vibration dataset simulating fault conditions. These findings high-

and LSTMs, offering accuracies above 90% and industrial rele-

light the strong potential of AI for predictive maintenance and

vance. Sieberg et al. [5] demonstrate CNN-based classification of

lay the groundwork for transferring the developed pipelines to

wear mechanisms from SEM images, achieving 73% test accuracy.

the system once real data become available. Future work will

They emphasize dataset balance and magnification consistency as

focus on real-time wear detection and model deployment within

critical challenges. Colantonio et al. Shah et al. [4] argue for ML’s

live production environments.

superiority over physics-based models in wear prediction, un-

derscoring ANN’s predictive strength when supplied with high-

Keywords quality data and standardized evaluation methods. Recent studies

accelerometer, neural networks, machine learning, cutting tool also explore multimodal sensor fusion, combining accelerometer,

acoustic, and force signals to improve robustness [8]. Specifically,



1 Introduction transfer learning has been shown effective for adapting models

trained on laboratory data to industrial machines [8].

This student paper presents the work carried out by Thibault

Unlike previous reviews such as Munaro et al. [2], which

Comte, Hugues Louvancour, and Yanny Hassani on the UNIOR

survey the field, our work provides a systematic multi-dataset

project, under the mentorship of Žiga Kolar, prof. dr. Matjaž

experimental comparison across three different sensor modalities

Gams for Jozef Stefan Institute, and Joze Ravnican for Unior. The

(accelerometer, vibration, audio) using standardized pipelines.

objective of the UNIOR project is to detect when a cutting ma-

This benchmarking is not only descriptive but forms the basis for

chine becomes worn out by analyzing sensor signals, specifically

industrial transfer to UNIOR’s production line, bridging academic

accelerometer data along the x, y, and z axes. An accelerometer is

datasets with real machine applications.

Permission to make digital or hard copies of all or part of this work for personal

or classroom use is granted without fee provided that copies are not made or



distributed for profit or commercial advantage and that copies bear this notice and 3 Datasets the full citation on the first page. Copyrights for third-party components of this

work must be honored. For all other uses, contact the owner /author(s).

This section describes five different datasets that were identi-

Information Society 2025, Ljubljana, Slovenia

fied—four containing accelerometer data and one featuring audio

© 2025 Copyright held by the owner/author(s).

https://doi.org/10.70314/is.2025.skui.6224 recordings.



43

Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Kolar et al.



3.1 Bosch CNC Machining Dataset per speed. In total, the dataset contains 153,000 vibration records

from the simulation model [3].

The Bosch CNC Machining dataset consists of real-world in-

dustrial vibration data collected from a brownfield CNC milling

machine. Acceleration was measured using a tri-axial Bosch CISS 3.5 Vibrations Dataset

sensor mounted inside the machine, recording the X, Y, and Z This dataset contains vibrational data collected to support early

axes at a sampling rate of 2 kHz. Both normal and anomalous data fault diagnosis in machinery The data was gathered using an

were collected across six distinct timeframes, each spanning six SG-Link tri-axial accelerometer sensor (by MICROSTRAIN Cor-

months between October 2018 and August 2021, with appropriate poration) at a sampling rate of 679 samples per second for each

labeling. Data were collected from three distinct CNC milling of the three axes: axial (z), horizontal (x), and vertical (y). Experi-

machines, each executing 15 processes [7]. A total of 1,702 sam- ments were conducted in the Mechanical Vibration Laboratory

ples were obtained, with each labeled as either "good" or "bad." at the Mechanical Engineering Department of the University of

The distribution of labels was 95.9% good and 4.1% bad. Engineering and Technology (UET), Taxila. The setup simulated

four distinct machine conditions: normal, cracking, offset pulley,



3.2 Cutting Tool Wear Audio Dataset and wear states, using a test rig designed for fault simulation [1].



This dataset comprises 1,488 ten-second .wav audio recordings 4 Methodology and Results of cutting tool wear collected at two spindle speeds: 520 RPM

This section outlines the methodology used for each dataset,

and 635 RPM. Each audio recording is labeled as either “BASE”

focusing on multiclass classification. Various preprocessing tech-

(machine running without cutting), “FRESH” (sharp cutting tool),

niques and machine learning algorithms were applied.

“MODERATE” (moderately worn tool), or “BROKEN” (broken or

fully worn tool). The “FRESH,” “MODERATE,” and “BROKEN” la-

bels were specifically chosen to simulate real cutting conditions, 4.1 Bosch CNC Machining Dataset

focusing on scenarios where the machine is actively engaged The Bosch CNC Machining Dataset contains 95.9% good signals

in material removal. In total, the dataset includes 400 “FRESH” and 4.1% bad signals. The objective was to develop a binary

samples, 376 “MODERATE” samples, and 362 “BROKEN” samples classification model that outperforms a naive baseline, which

across both spindle speeds, offering a nearly balanced distribu- achieves 95.9% accuracy simply by always predicting a signal as

tion well-suited for ML applications. Audio records had different good.

lengths. No artificial background noise was added to the record- Two approaches were tested on the Bosch CNC Machining

ings. All cutting tools used were 16 mm end-mill cutters, and the dataset. The first approach applied random undersampling, which

workpiece material was mild steel [6]. balances class distribution by randomly removing samples from

the majority class while leaving the minority class unchanged.



3.3 Since the majority class accounted for 95.9% of the data, this step Turning Dataset for Chatter

was essential to prevent the model from defaulting to majority-

This dataset contains sensor signals collected from multiple cut-

class predictions. After applying the random undersampling, the

ting tests using a range of measurement devices, including two

Random forest model was used for binary classification. This

perpendicular single-axis accelerometers, a tri-axial accelerome-

method achieved 99% accuracy on 5-fold cross validation, pro-

ter, a microphone, and a laser tachometer. Both raw sensor data

viding a 3.1% improvement over the naive baseline model.

and processed, labeled data from one channel of the tri-axial ac-

Different preprocessing strategies were necessary due to dif-

celerometer are provided. There were four labels used: no-chatter,

ferences in data formats, sampling rates, and class balance across

intermediate chatter, chatter, and unknown. The dataset contains

datasets. For example, in the Bosch dataset, random undersam-

a total of 117 signals, with the following label distribution: 51 la-

pling was applied only on the training folds during 5-fold CV to

beled as no-chatter, 19 as intermediate chatter, 22 as chatter, and

avoid information leakage.

25 as unknown. Data were collected under four distinct cutting

In the second approach, features were initially extracted using

configurations, defined by varying the stick out distance—the

two 1D Convolutional layers followed by two Max Pooling layers.

distance from the heel of the boring rod to the back face of the

To augment the data, random Gaussian noise was added to the

tool holder. The four stickout distances used were 5.08 cm (2

signals, effectively doubling the size of the training set. A binary

inches), 6.35 cm (2.5 inches), 8.89 cm (3.5 inches), and 11.43 cm

classification model using Random Forest was then trained on

(4.5 inches) [8].

this augmented dataset. This model achieved a high accuracy

of 0.996 under 5-fold cross-validation, outperforming the naive

3.4 UCI Accelerometer Dataset baseline by 3.7%.

McNemar’s test was applied between competing models on

To simulate motor vibrations, a 12 cm Akasa AK-FN059 Viper

each dataset. Significant differences (p < 0.05) were observed be-

cooling fan was modified by attaching weights to its blades, and

tween CNN and Random Forest on the Bosch dataset, confirming

an MMA8452Q accelerometer was mounted to capture vibration

that improvements are not due to random variation.

data. An artificial neural network was then used to predict motor

failure time based on this data. Three distinct vibration scenarios

were generated by varying the placement of two weight pieces 4.2 Cutting Tool Wear Audio Dataset

on the fan blades: (1) Red – normal configuration, with weights The Cutting Tool Wear Audio Dataset contained 400 “FRESH”,

on neighboring blades; (2) Blue – perpendicular configuration, 376 “MODERATE”, and 362 “BROKEN” samples across two spin-

with weights on blades 90° apart; and (3) Green – opposite con- dle speeds, requiring a multi-class classification approach. Since

figuration, with weights on opposite blades. For each of the three the signals varied in length, we first identified the longest signal

weight configurations, vibration data was collected every 20 ms (48000 samples) and zero-padded shorter signals to match this

over a 1-minute interval per speed, resulting in 3,000 records length. To improve model accuracy, this maximum length was



44

Machine Learning for Cutting Tool Wear Detection Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia





summary that captures the dominant frequency components of

the entire signal, while accounting for temporal variation through

segmentation.

Furthermore, 11 additional features were extracted from the

raw signal, including the mean, standard deviation, minimum,

maximum, and median of the frequency values. These features

capture the signal’s central tendency and variability, providing

a statistical summary of its frequency content. The 25th and

75th percentiles further quantify the signal’s interquartile range,

highlighting its variability and robustness to outliers. Root mean

square (RMS) provides a measure of the signal’s overall power.

Skewness and kurtosis describe the asymmetry and peakedness

of the distribution, respectively, offering insights into the signal’s

shape beyond basic statistics. Finally, zero crossings count the

number of times the signal crosses the zero axis, serving as an

indicator of frequency content and signal complexity. Together,

these features form a rich representation for classification tasks

involving time-frequency signals.

In total, there were 268 features (257 FFT features and 11 addi-

tional features) and 117 samples. A feature selection technique

was applied to further reduce the number of features. 140 best

features were selected and used as input for Random Forest clas-

sifier.

The best model for this dataset achieved 0.80 (+/-0.06) accuracy

and 0.7588 F1-score on 5-fold cross validation. Results (precision,

recall, F1-score and accuracy) are depicted on Figure 2.

Figure 1: 5-Fold cross validation report and confusion ma-

trix for Cutting Tool Wear Audio dataset.





later reduced. The model architecture included two 1D Convo-

lutional layers and two 1D Max Pooling layers to reduce the

dimensionality of the data while preserving essential features.

The output from the upper layers served as input to a feature

selection algorithm, which identified the 96 most relevant fea-

tures out of a total of 2048. These selected features were then

used by a Random Forest classifier to predict the final label.

The best model for this dataset achieved 0.9279 (+/- 0.01) ac-

curacy and 0.9279 F1 score on 5-fold cross validation. Results

(precision, recall, F1-score and accuracy) are presented on Figure

1.



4.3 Turning Dataset for Chatter

Since each signal varies in length and can be quite long, an ap-

proach based on extracting time-domain and frequency-domain

features was implemented. This method preserves essential in-

formation from the original signals while significantly reducing

dimensionality, making the data more suitable for ML algorithms.

The following approach combines signal segmentation and

frequency-domain feature extraction to summarize the spectral

characteristics of a time-series signal. First, it divides the input Figure 2: 5-Fold cross validation report and confusion ma-

signal into overlapping or non-overlapping fixed-size windows trix for Turning dataset for Chatter. using a sliding window technique, where each segment is of 10000

windows length and the shift between consecutive segments is

determined by step size, which in this case is 5000. This allows

for localized analysis of signal dynamics over time. 4.4 UCI Accelerometer Dataset

Next, we applied the Fast Fourier Transform (FFT) to each This method implements a complete machine learning pipeline

segment, converting the time-domain signal into its frequency- for classifying time-series accelerometer data using features ex-

domain representation. It computes the magnitude spectrum tracted from both the time and frequency domains. Data is first

for each segment and then averages the spectral magnitudes loaded from a CSV file, where each row contains an activity label

across all segments to obtain a single, representative frequency- and raw X, Y, and Z accelerometer readings. The signal is seg-

domain feature vector. This results in a compact yet informative mented into non-overlapping windows of fixed size (50 samples,



45





Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Kolar et al.




corresponding to 1 second at 50 Hz), and only windows with

consistent activity labels are retained for supervised learning.

Next, time-domain and frequency-domain features were ex-

tracted from a signal. Time-domain features include basic statis-

tics (mean, standard deviation, min, max, median), RMS, peak-to-

peak range, skewness, kurtosis, zero-crossing rate, signal energy,

and crest factor. Frequency-domain features are extracted via FFT

and include spectral centroid, spectral spread, peak frequency,

and energy in predefined low (0–5 Hz) and high (10–25 Hz) fre-

quency bands.

This feature-rich representation is passed through a machine

learning pipeline that includes feature scaling, univariate feature

selection (Select K Best ANOVA F-statistical method), and clas-



sification using a Random Forest classifier. The best model for Figure 4: 5-Fold cross validation report and confusion ma-

this dataset achieved 0.972 (+/-0.008) accuracy and 0.97 F1-score trix for Vibration dataset. on 5-fold cross validation. Results (precision, recall, F1-score and

accuracy) are depicted on Figure 3.

This study has several limitations. First, the datasets used



are publicly available and may not fully capture the variability



of industrial machining environments. Second, in some cases

class balance was artificially enforced via undersampling, which

could affect generalizability. Third, we recognize that the lack

of direct industrial validation is a current limitation. However,

our pipelines were designed for immediate deployment once the

company’s accelerometers are installed, ensuring direct continu-

ity from these benchmark studies to industrial application. This

study therefore serves as a reproducible foundation rather than a

final industrial deployment. Partial validation experiments with

UNIOR’s machines are planned as the next project stage.

Figure 3: 5-Fold cross validation report and confusion ma-

trix for UCI Accelerometer dataset. Acknowledgements

The authors acknowledge funding support from the company

UNIOR for the GREMO LIGHTWEIGHT project. The authors



4.5 Vibrations Dataset also acknowledge the funding from the Slovenian Research and

Innovation Agency (ARIS), Grant (PR-10495) and Basic core fund-

In this method the time series data was effectively segmented

ing P2-0209.

into overlapping windows of fixed length 226. A total of 168,372



samples were generated, providing a sufficient amount of data References for training deep learning models. A Long Short-Term Memory

[1] Muhammad Umar Khan, Muhammad Atif Imtiaz, Sumair Aziz, Zeeshan Ka-

(LSTM) neural network was chosen due to its effectiveness in reem, Athar Waseem, and Muhammad Ammar Akram. 2019. System design

for early fault diagnosis of machines using vibration features. In 2019 In-

handling sequential data. The network architecture consisted of

ternational Conference on Power Generation Systems and Renewable Energy

two LSTM layers with 128 and 64 units, respectively, along with Technologies (PGSRET). IEEE, 1–6.

two Dropout layers incorporated to reduce the risk of overfitting [2] Roberto Munaro, Aldo Attanasio, and Antonio Del Prete. 2023. Tool wear

monitoring with artificial intelligence methods: a review. Journal of Manu-

and improve generalization. This method achieved the best per-

facturing and Materials Processing, 7, 4, 129. doi: 10.3390/jmmp7040129.

formance to date, reaching an accuracy of 0.9948 (+/-0.005) in [3] Gustavo Scalabrini Sampaio, Arnaldo Rabello de Aguiar Vallim Filho, Leilton

5-fold cross-validation and an F1-score of 0.9949. The results are Santos da Silva, and Leandro Augusto da Silva. 2019. Prediction of motor

failure time using an artificial neural network. , 19, 19, 4342. Sensors

presented on Figure 4.

[4] Raj Shah, Nikhil Pai, Gavin Thomas, Swarn Jha, Vikram Mittal, Khosro

Shirvni, and Hong Liang. 2024. Machine learning in wear prediction. Journal

5 of Tribology, 147, 4, (Nov. 2024), 040801. eprint: https://asmedigitalcollection Conclusion

.asme.org/tribology/article- pdf /147/4/040801/7400649/trib\_147\_4\_040801

This student paper explored machine learning for automated cut- .pdf . doi: 10.1115/1.4066865.

[5] Philipp Maximilian Sieberg, Dzhem Kurtulan, and Stefanie Hanke. 2022. Wear

ting tool wear detection. Using five public datasets and models

mechanism classification using artificial intelligence. , 15, 7, 2358. Materials

such as Random Forests, CNNs, and LSTMs, we achieved strong doi: 10.3390/ma15072358. performance, notably 0.9949 F1 on the Vibrations dataset. These [6] Nachiket Soni, Amit Kumar, and Hardik Patel. 2023. Acoustic analysis of

cutting tool vibrations of machines for anomaly detection and predictive

benchmarks highlight ML’s potential for predictive maintenance 2023 IEEE 11th Region 10 Humanitarian Technology Conference

maintenance. In

and provide ready-to-deploy pipelines for future industrial data. . IEEE, 43–46. (R10-HTC) Future work will focus on validating the model on industrial ma- [7] Mohamed-Ali Tnani, Michael Feil, and Klaus Diepold. 2022. Smart data collec-

tion system for brownfield cnc milling machines: a new benchmark dataset

chines, optimizing its performance, and deploying it in real-time. Procedia CIRP for data-driven machine monitoring. , 107, 131–136.

Additionally, for ordered domains like the Cutting Tool Wear [8] Melih C Yesilli, Firas A Khasawneh, and Andreas Otto. 2020. On transfer

learning for chatter detection in turning using wavelet packet transform

Audio dataset, misclassifications should not be penalized equally

and ensemble empirical mode decomposition. CIRP Journal of Manufacturing

(e.g., "FRESH" -> "MODERATE" vs. "FRESH" -> "BROKEN"). Thus, , 28, 118–135. Science and Technology future research will explore ordinal metrics, such as weighted

accuracy or quadratic weighted kappa.



46





Extracting Structured Information About Food Loss and




Waste Measurement Practices Using Large Language Models:



A Feasibility Study



Junoš Lukan Maori Inagawa Mitja Luštrek

junos.lukan@ijs.si maoriinagawa@keio.jp mitja.lustrek@ijs.si

Jožef Stefan Institute Jožef Stefan Institute Jožef Stefan Institute

Department of Intelligent Systems Department of Intelligent Systems Department of Intelligent Systems

Jožef Stefan International Ljubljana, Slovenia Jožef Stefan International

Postgraduate School Postgraduate School

Ljubljana, Slovenia Ljubljana, Slovenia



Abstract The FLW standard establishes the scope of an FLW

Waste Quantification Solutions to Limit Environmental inventory. Furthermore, it provides definitions of bound-

Stress (WASTELESS) project aims to develop and test ary elements and recommendations for classifications that

innovative tools and methodologies for measuring and mon- should be used to describe them. For classifying food into

itoring food loss and waste (FLW). A key objective is to categories, it suggests the FAO’s and World Health Orga-

create a decision support toolbox that helps food actors nization’s Codex General Standard for Food Additives [5].

across the entire supply chain, including consumers, select We might add that alternatively, Annex II of “Regulation

the most suitable method for measuring and monitoring (EC) No 1333/2008 of the European Parliament and of the

FLW. To help with this decision, existing, already tested Council” can also be used. For lifecycle stage, the Interna-

FLW measurement practices can be consulted, which are tional Standard Industrial Classifications of All Economic

currently published as short documents. In this work, we Activities (ISIC) or the Statistical Classification of Eco-

show how the data about them can be extracted using nomic Activities in the European Community (NACE) [4]

large language models (LLMs). Additionally, we propose should be used. Finally, for geographical boundary clas-

how this data can be structured and represented as an sification UN region or country codes should be used or

ontology. With this process, we can help users find relevant Nomenclature of Territorial Units for Statistics (NUTS)

data without needing to browse through many documents. [2] in the European context.

The FLW standard also provides guidelines on how to



food loss and waste, large language models, data extraction, surement or monitoring. The FLW Quantification method ranking tool was prepared by the Waste and Resources ontology Keywords decide which quantification method to use for FLW mea-

Action Programme (WRAP) and includes eleven questions.



1 Introduction Most of the questions serve as exclusion criteria. For ex-

ample, a negative response to either “Do you have existing

The project Waste Quantification Solutions to Limit Envi- records that could be used for quantifying FLW?” (Q9)

ronmental Stress (WASTELESS; https://wastelesseu.com/) or “Do you have access to those records?” (Q10) excludes

is designed to develop and test a mix of innovative tools the method of records. As another example, a negative

and methodologies for food loss and waste (FLW) measure- response to “Can you get direct access to the FLW be-

ment and monitoring. One of the tasks is also to create a ing quantified” (Q3) immediately excludes direct weighing,

decision support toolbox [10]. It should help all profiles of counting, assessing volume, and waste composition analy-

food actors, i.e. across the whole food supply chain (FSC), sis, since these all need such access to be feasible. These

including consumers, who want to measure and monitor questions encapsulate the most important characteristics

their FLW, to select the most appropriate method. by which these methods distinguish from one another and

There have been several attempts to harmonise FLW lend themselves to particular needs of users.

measurement methods. The Food loss and waste accounting In this paper, we build upon this work by proposing a

and reporting standard (FLW Standard; [7]) stands out as unified structure through which to describe various prac-

a good structured attempt. It was produced by the Food tices of FLW measurement and reduction. This is a step

Loss & Waste Protocol, a multi-stakeholder partnership towards systematic representation of these data that can

with involvement by Food and Agriculture Organization of enable further analysis of the practices thus described and

the United Nations (FAO) and World Resources Institute their comparison and validation.

among others.



Permission to make digital or hard copies of all or part of this



work for personal or classroom use is granted without fee provided 2 Methods that copies are not made or distributed for profit or commercial

advantage and that copies bear this notice and the full citation on

the first page. Copyrights for third-party components of this work We first outline the structure of desired shortened descrip-



Information Society 2025, Ljubljana, Slovenia must be honored. For all other uses, contact the owner/author(s). tions, report on the process of using large language models (LLMs) to automatically extract them and finally evaluate © 2025 Copyright held by the owner/author(s). the results by comparing them to human annotations. https://doi.org/10.70314/is.2025.skui.8598



47





Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Junoš Lukan, Maori Inagawa, and Mitja Luštrek




2.1 Structure of Extracted Information might be mixed with other waste water exiting the

Based on the previously mentioned processing plant. In cases like this, non-direct meth- FLW Quantification

method ranking tool ods need to be employed, such as modelling or mass and domain-expert knowledge, we de-

termined the following characteristics of FLW measurement balance.

methods and practices to be of the most importance: To be able to suggest specific FLW practices according



(1) FLW method. them in terms of these characteristics. For harmonious rep- to the criteria described above, we need to first describe FLW measurement and reduction practices might resentation, we used already mentioned NUTS and NACE describe very specific technologies and techniques. classifications for region of interest and FSC stage, respec- To make the information more general, we decided tively. We also used a simplified version of FAO’s

to classify as one of ten categories of quantification individual food consumption Global



(2) Region of interest. plement [8] to the FLW Accounting and Reporting categorical levels of “low”, “medium”, and “high”, while Standard [7]. direct access to FLW can be represented with a simple Boolean. methods. These are described in detail in the Sup- describe food category. For accuracy, we opted for three (GIFT; [6]) classification to

European Union (EU) member countries have diverse



legislation that is of relevance to FLW measurement 2.2 Extraction of Data

(see [13] for a review). Some have legislation actions

that are legally binding, such as laws and regulations, To test the extraction of data, we used 11 FLW measure-

and as such prescribe methods of monitoring and ment and reduction practice descriptions. This included

FLW measurement as well as the ways of reporting 3 descriptions of practices developed and piloted in the

the data. On the other hand, some countries only WASTELESS project as well as 8 practices developed in

approach the topic through non-binding legislation other European projects [16].

actions, such as agency orders and policy papers. As To extract data from FLW practice descriptions, we used

such, not every method might be appropriate for two LLMs: ChatGPT 5 Auto [12] and Le Chat [11]. The

every country or region. prompt consisted of the following:

(3) Food supply chain (FSC) stage. (1) Introduction: general summary of the whole extrac-

Food loss and waste can occur at any stage of the tion process;

food supply chain, starting from farmers and other (2) Main instructions:

producers, through food manufacturers and proces- (a) Information to be extracted: a list of questions,

sors, distributors and shippers, grocery stores and the answers to which represent the data that is to

restaurants, all the way to the customers and con- be extracted from the practice description;

sumers. Some methods are more appropriate for cer- (b) Data types and values: a list of possible values and

tain stages in this chain. For example, a household their types for each of the data field, including lists

might keep a diary of their FLW, while sellers such of NUTS and NACE codes and food categories;

as grocery stores, generally manage their stock more (c) Missing information: instructions on how to deal

systematically and precisely. with missing, incomplete, or unclear data;

(4) Accuracy. (d) Format: description of the format of expected out-

FLW measurement methods need also to be consid- put (.csv data);

ered from the point of desired accuracy. The highest (3) Example:

accuracy can be achieved by directly weighing the (a) Input: a short, synthetic description of a FLW

waste or separating it into components (waste com- practice;

position analysis), while diaries or volume assessment (b) Reasoning: values for all data fields and their rela-

produce data of medium accuracy. At the lowest end, tionship to original text, indicate missing values;

proxy data can be used to assess FLW, for example (c) Output: the expected line of data output.



(5) Food category. will not be very accurate. as the Guidance on FLW Quantification Methods as a PDF. Following this initial prompt, practice descriptions were uploaded one by one and the output saved. The lead author Depending on the type of food and how it is packed, findings to another; keeping in mind that such data We included all reference classifications as .csv files as well by using data from another region to extrapolate



we might only be able to use some FLW measurement of this paper also extracted the same information from the descriptions manually. methods, but not others. For example, when dealing



simply counted and their weight inferred. Meanwhile, with packed food items, wasted products can be 2.3 Evaluation of Results

when waste occurs with liquid food, such as milk, To evaluate the extraction of data by LLMs, we compared

volume assessment can be fairly accurate to estimate the output by these models to human annotations. Here,

the weight of FLW. the cases of multiple possible values and missing data need

(6) Direct access to FLW. to be considered. First, some characteristics can objectively

Some food waste cannot be measured directly, such contain several values. For example, a FLW measurement

as by weighing, counting, or waste composition anal- practice might be applicable to several FSC stages and

ysis. For example, when waste is discarded directly more than one food category. Secondly, some data cannot

into the drain in the process of food processing, it be determined from the description of practice.



48





Extracting Information About FLW Measurement Practices Using LLMs Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia




For a characteristic with more than one possible value, Both models achieved similar scores in total across all

consider two subsets of all possible values (𝑈): human an- practice characteristics. ChatGPT did, however, perfectly

notations (H ) and machine-extracted values (M). The fol- agree with the human rating more often. Of all the char-

lowing list gives the scores that were used in the evaluation acteristics, food category was the easiest for the LLMs to

for all possible relationships of these two sets. extract. This is a simple classification and usually, the type



+2; when the subsets were equal, of food is mentioned explicitly. The FLW quantification 𝐻 = 𝑀 . method was inferred with moderate success. On the other +1; when an LLM extracted more values than a human, hand, accuracy of methods was very poorly described. but including those, ∅ ≠ 𝐻 ⊂ 𝑀 ≠ 𝑈 .

0; when the sets were overlapping, but neither contained



-1; when an LLM failed to extract all values that a 0; when there was data available, but LLM extracted the other, that is, there was a partial match in values, 4 Discussion 𝐻 ∩ 𝑀 ≠ ∅, 𝐻 ⊈ 𝑀, 𝑀 ⊈ 𝐻 . In this work, we have shown how using two LLMs, the data from unstructured FLW measurement and reduction no information or returned all possible values, ∅ ≠ practice descriptions can be extracted into structured data. 𝐻 ⊂ 𝑈 , but 𝑀 = ∅ or 𝑀 = 𝑈 . We achieved satisfying if imperfect results. The most important data point, which is the class of the human did, 𝑈 ⊇ 𝐻 ⊃ 𝑀 ≠ ∅ . FLW measurement method was extracted with moderate-2; when the subsets had no values in common, i.e., were success. It needs to be pointed out that extracted infor-disjoint, 𝐻 ∩ 𝑀 = ∅ . mation was not wildly inaccurate in most cases, despite

Note that for simple true or false values, this list simplifies of what the scores might suggest. For example, a method

to the extreme cases; thus they were scored as +2 and −2, of tracking waste on a blockchain was classified as using

respectively. records, where in fact, the data were collected with sur-

The reasoning behind the scoring is that we prefer to veys before being, indeed, recorded. Similarly, one practice

describe a practice in broader terms, even if some extracted described weighing waste as it was collected in the waste-

values are inapplicable, rather than miss a particular value. basket, while simultaneously taking photos of the material.

As an example, it is better to describe a practice as suit- Here, the true measurement method was direct weighing,

able for all food categories than missing the one that it but the LLMs classified it as waste composition analysis.

is actually suitable for. Similarly, when no information is By using photos, such an analysis could in theory be done,

extracted, we can conservatively assume all values apply. but was not in such case. Thus, to improve the relevance

In such a case, an LLM failed objectively, but it is not of the FLW measurement method, we might instead group

punished for it. In the worst case scenario, an LLM “ex- them by some other characteristics. For example, we could

tracted” or hallucinated some values, but they have nothing drop the data field of direct access and instead consider

in common with human annotations; for this two points groups of methods separated in terms of needing direct

are deducted. access to waste.

Food category, however, was very reliably extracted. This



To evaluate the extraction of data by the LLMs, we scored we could make the best use of the food type. Accuracy of the method described was not extracted well, but this is 3 Results indicates that in the further process of the extracted data,



where shown are the sum of scores and the number of per- The authors of FLW practice descriptions never explicitly addressed the question of accuracy, so it needed to be esti- fect scores, that is the number of times the LLM completely mated roughly by other characteristics, such as the general these scores for each practice characteristic in Table 1, most likely due to the subjectivity of this characteristic. their answers as described in Section 2.3. We summarised



tested was 11, which is therefore the maximum number of accuracy of the FLW method class. This also suggests that agreed with the human rater. The number of practices



perfect scores, while the maximum sum is 22. a three-level accuracy is probably too fine grained and it should be described only as “low” and “high”.

We should note that our evaluation only compares the



Table 1: Agreement scores for each characteristic performance of LLMs to manual extraction of data per- of a FLW practice between a human rater and two formed by a single person. It is expected that people would also differ in their extractions, i.e., would not achieve per-



of perfect extractions are shown. fect inter-rater agreement. Thus, the evaluation should not be interpreted as how well the LLMs captured the different LLMs. The sum of scores and the number



Model ChatGPT Le Chat With this process, LLMs enabled us to transform the “objective” truth.

Metric Sum Perfect Sum Perfect descriptions from simple PDF files into structured CSV

FLW method 13 8 3 5 files in a semi-automatic way. In terms of the five-star

Region 12 7 13 5 rating of open data [9] which describes how to get from

FSC stage 8 7 12 6 data in proprietary formats to linked open data, we thus

Accuracy −2 4 −5 3 increase their level from one star to three stars. We can

Food category 22 11 21 10 extend this further and increase the rating of this data to

Direct access 6 7 14 9 five stars: publish truly linked data.



Total The first step that can follow directly the results of this 59 44 58 38 work is to transform the structure described in Section 2.1



49





Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Junoš Lukan, Maori Inagawa, and Mitja Luštrek




Listing 1: A snippet of the ontology in Turtle lan- [15]. The dataset described by this ontology is already vast and is being extended through FoodWasteEXplorer [14]. By guage [1] leveraging it, we plan to publish the practice descriptions @prefix : . as five-star data in future work. @prefix owl: .



@prefix rdf: . Acknowledgments @prefix xsd: .

@prefix rdfs: . This work was carried out as a part of the WASTELESS

@prefix dbpedia: . project which is funded by the European Union’s Horizon

@base .

Europe Research and Innovation programme under Grant

rdf:type owl:Ontology . Agreement No. 101084222.



##################### Classes ##################### References

:FoodLossWasteMeasurementPractice rdf:type owl:Class ;

rdfs:label "Food Loss and Waste Measurement [1] David Beckett, Tim Berners-Lee, Eric Prud’hommeaux, and

Practice" Gavin Carothers. 2014. RDF 1.1 Turtle. Terse RDF Triple @en .

Language. Ed. by Eric Prud’hommeaux and Gavin Carothers.



:Region World Wide Web Consortium (W3C), (Feb. 25, 2014). Re- rdf:type owl:Class ; trieved Aug. 29, 2025 from https://www.w3.org/TR/turtle/. rdfs:label "A NUTS code of the region" ; [2] European Parliament and Council of the European Union.

owl:equivalentClass 2003. Regulation (EC) no 1059/2003 of the European Parlia-

dbpedia:Nomenclature_of_Territorial_Units- ment and of the Council. On the establishment of a common

_for_Statistics . classification of territorial units for statistics (NUTS). Official

Journal of the European Union, 154, 1, (June 21, 2003), 1–41.

:FoodCategory rdf:type owl:Class ; http://data.europa.eu/eli/reg/2003/1059/oj.



rdfs:label [3] European Parliament and Council of the European Union. "Food Category" . 2008. Regulation (EC) no 1333/2008 of the European Parlia-

ment and of the Council. On food additives. Version 02008R1333-

:DairyAndEggs rdf:type owl:Class ; 20240423. Official Journal of the European Union, 354, 16,

rdfs:subClassOf :FoodCategory ; (Dec. 16, 2008), 16–33. http://data.europa.eu/eli/reg/2008/1

rdfs:label "Dairy & Eggs" . 333/oj.

[4] European Parliament and Council of the European Union.

:Milk rdf:type owl:Class ; 2006. Regulation (EC) no 1893/2006 of the European Parlia-



rdfs:label cation of economic activities NACE revision 2 and amending "Milk" . Council Regulation (EEC) no 3037/90 as well as certain EC rdfs:subClassOf ment and of the Council. Establishing the statistical classifi- :DairyAndEggs ;

# ... more classes defined ... regulations on specific statistical domains. Official Journal of ################ Object Properties ################ the European Union, 393, (Dec. 20, 2006), 1–39, 1, (Dec. 20,

:hasTitle rdf:type owl:DatatypeProperty ; 2006). http://data.europa.eu/eli/reg/2006/1893/oj.

rdfs:domain :FoodLossWasteMeasurementPractice ; [5] Food and Agriculture Organization of the United Nations and rdfs:range rdfs:Literal ; World Health Organization. 2019. General standard for food

rdfs:label "with the title" . additives. Codex STAN 192-1995. (2019).

[6] Food and Agriculture Organization of the United Nations



:hasRegion (FAO). 2022. FAO/WHO GIFT. Global Individual Food rdf:type owl:ObjectProperty ; Consumption Data Tool . Retrieved Aug. 30, 2025 from https rdfs:domain :FoodLossWasteMeasurementPractice ; ://www.fao.org/gift-individual-food-consumption/about/en. rdfs:range :Region ; [7] Craig Hanson et al. 2016. Food Loss and Waste Accountin-

rdfs:label "applied in regions" . gand Reporting Standard. Version 1.0. World Resources In-

stitute. ISBN: 978-1-56973-892-4.

:hasFoodCategory rdf:type owl:ObjectProperty ; [8] Craig Hanson et al. 2016. Guidance on FLW Quantification

rdfs:domain :FoodLossWasteMeasurementPractice ; Methods. Supplement to the Food Loss and Waste (FLW) Ac-

rdfs:range :FoodCategory ; counting and Reporting Standard. World Resources Institute.



rdfs:label ISBN: 978-1-56973-893-1. "applicable to food categories" . [9] Tim Berners Lee. 2006. Linked data. Design issues. Ver-

sion 2009-06-18. (July 27, 2006). https://www.w3.org/Design

:hasAccuracy rdf:type owl:DatatypeProperty ; Issues/LinkedData.html.

rdfs:domain :FoodLossWasteMeasurementPractice ; [10] Mitja Luštrek and Junoš Lukan. 2024. Practice Abstracts – rdfs:range "low"^^xsd:string, batch 1 – early phase. Deliverable 6.2. Research rep. Jožef

"medium"^^xsd:string, "high"^^xsd:string . Stefan Institute. doi:10.5281/ZENODO.13503261.

[11] [SW] Mistral AI, Le Chat version November 2024, 2024. url:

https://chat.mistral.ai/.

[12] [SW] OpenAI, ChatGPT version GPT-5, 2025. url: https://c



into an ontology. We illustrate this idea in Listing 1 which hatgpt.com/. [13] Zhuang Qian, Wu Chen, and Giorgia Sabbatini. 2023. White encodes the characteristics as classes and how to connect book for FLW reduction, measurement, and monitoring prac-these to an individual practice using object and datatype tices. Deliverable 1.1. Research rep. Version 1.0. University of Southern Denmark, (Aug. 30, 2023). 116 pp. doi:10.5281 properties. Once we represent the structure like this, we can /ZENODO.11065358. encode a specific instance of FLW measurement practice [14] REFRESH. FoodWasteEXplorer. https://www.foodwasteexp as: lorer.eu/about. [15] Riste Stojanov, Tome Eftimov, Hannah Pinchen, Maria Traka, :MyDairyWastePractice a Paul Finglas, Drago Torkar, and Barbara Korousic Seljak.

:FoodLossWasteMeasurementPractice ; 2019. Food waste ontology. A formal description of knowledge

:hasTitle "Tracking Waste of Dairy in Slovenia" ; from the domain of food waste. In 2019 IEEE International



:hasAccuracy doi:10.1109/bigdata47090.2019.9006254. "high" ^^ xsd:string ; [16] :hasFoodCategory :WholeMilk Conference on Big Data (Big Data). IEEE, (Dec. 2019). ;

Sustainable Food System Innovation Platform. Practice ab-

:hasRegion :SI0. stract inventory. Retrieved Sept. 1, 2025 from https://www.s

The data on FLW measurement practices can then be martchain-platform.eu/en/practice-abstract-inventory.

easily linked to other published data and the closest candi-

date ontology is the Food Waste Ontology by Stojanov et al.



50





Eye-Tracking Explains Cognitive Test Performance in




Schizophrenia



Mila Marinković

Jure Žabkar

mila.marinkovic@fri.uni-lj.si

jure.zabkar@fri.uni-lj.si

University of Ljubljana,Faculty of Computer and Information Science, Ljubljana, Slovenia



Abstract have schizophrenia [6, 7]. More recent work has extended ET

Schizophrenia is associated with cognitive impairments that are beyond basic oculomotor paradigms by embedding it in cognitive

difficult to assess with traditional neuropsychological tests, which tasks. For example, Okazaki et al. [8] combined ET metrics with

are often lengthy and burdensome. Eye-tracking (ET) provides ob- digit-symbol substitution tests and showed improved discrimi-

jective, minimally invasive measures of visual attention and cog- nation between patients and controls. Yang et al. [9] reported

nitive processing and may complement shorter assessments. This that abnormal gaze patterns during reading tasks—such as longer

study investigated whether ET features recorded during three fixation durations and increased saccade counts—enabled high

computerized tasks could distinguish patients with schizophrenia diagnostic accuracy when analyzed with machine learning mod-

from healthy controls. Using the Explainable Boosting Machine els. Similarly, Morita et al. [10] demonstrated the feasibility of

(EBM), we achieved an accuracy of 0.86, and balanced sensitivity portable tablet-based ET combined with cognitive assessments

and specificity, with an area under the curve exceeding 0.9. Fea- for schizophrenia screening. Collectively, these studies highlight

tures related to fixation patterns, saccadic dynamics, and tempo- that combining ET with cognitive testing enriches diagnostic

ral engagement emerged as the most informative. These findings value and provides insights into the cognitive mechanisms un-

indicate that ET features collected during brief cognitive tasks derlying gaze abnormalities.

can provide clinically relevant markers of schizophrenia. Incor- Building on this prior work, the present study investigates

porating ET into short test batteries may reduce patient burden whether ET features recorded during a small set of computerized

while enhancing diagnostic value, supporting the development cognitive tasks can serve as reliable markers of schizophrenia.

of scalable and practical screening tools. Participants completed three tasks (digit span, picture naming,

and n-back), each divided into phases of instruction reading,

Keywords video demonstration, and test execution. From these tasks, we



schizophrenia, eye-tracking, cognitive tasks, machine learning extracted 117 ET features, including fixation measures, saccadic dynamics, gaze entropy, and recording duration. We then ap-

1 plied machine learning methods to evaluate the discriminative Introduction



that affects about 1% of the population worldwide and is charac- formation to overcome the limitations of brief cognitive testing, terized by disturbances in thought, perception, and behavior [1]. ultimately supporting the development of less burdensome but In addition to positive and negative symptoms, patients expe- more informative screening approaches. Schizophrenia is a severe and chronic neuropsychiatric disorder our aim is to test whether ET provides sufficient additional in- power of these features. By focusing on only three short tasks,

rience pronounced cognitive impairments, including deficits in



attention, working memory, and executive functioning, which 2 Methods substantially affect everyday life outcomes [2, 3]. Cognitive as-2.1 Participants sessment is therefore central for both diagnosis and monitoring

of schizophrenia. However, traditional neuropsychological test- The study involves 126 individuals, including 58 patients diag-

ing is lengthy, cognitively demanding, and often exhausting for nosed with schizophrenia (SP) and 68 healthy controls (HC). All

patients, limiting its feasibility in clinical practice. Shorter test participants were adults, aged 18 years or older. Patients were

batteries reduce the burden but often fail to provide sufficiently recruited and tested at the University Psychiatric Hospital Ljubl-

informative data for reliable diagnosis jana. The control group was matched to the patient group on age

Eye-tracking (ET) offers a promising avenue for addressing and gender.

this challenge. ET provides objective, real-time measures of vi- Eligibility criteria required fluency in Slovenian and excluded

sual attention, oculomotor control, and information processing individuals with intellectual disability, organic brain disorders, or

strategies [4]. Numerous studies have shown that patients with a history of substance abuse. Additional exclusion criteria for the

schizophrenia exhibit abnormalities in smooth pursuit eye move- HC group included any past or current psychiatric disorder. At

ments, antisaccades, and fixation stability [5, 6, 7]. These alter- the time of assessment, all SP participants were receiving stable

ations are considered potential endophenotypes of the disorder, doses of antipsychotic medication.

as they are also observed in first-degree relatives who do not Demographic characteristics of the two groups are presented

Permission to make digital or hard copies of all or part of this work for personal in Table 1 and were analyzed to ensure that the groups were

or classroom use is granted without fee provided that copies are not made or comparable in terms of age and gender. While educational attain-



work must be honored. For all other uses, contact the owner/author(s). distributed for profit or commercial advantage and that copies bear this notice and ment differed between groups, further analyses confirmed that the full citation on the first page. Copyrights for third-party components of this within each education level there were no significant differences between SP and HC participants, indicating that education was Information Society 2025, Ljubljana, Slovenia

© 2025 Copyright held by the owner/author(s). unlikely to confound the comparisons.

https://doi.org/10.70314/is.2025.skui.9583



51





Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Marinković and Žabkar




Table 1: Demographic characteristics of participants. Participants completed three computerized cognitive tasks in a

fixed order: digit span (DS), n-back (NB), and picture naming (PS).

Measure SP HC A short break was provided between tasks, with the duration

Counts determined by each participant. All tasks were presented within

Total participants 58 68 the Tobii Pro Lab application, which also stored the raw data.

Male sex 29 35 After recording, the data were exported and processed using a

Female sex 29 33 custom Python program for feature extraction and analysis.



Age (mean years) 46.1 46.7 (1) Reading instructions. Written instructions were dis-Categorical Continuous Each task followed the same three-phase structure:



Most common education level Primary school High school own pace and advanced to the next phase with a mouse played on the screen. Participants could read them at their

HC: Healthy Controls; SP: Patients with Schizophrenia click.

(2) Video example. A short instructional video was pre-

The study was approved by the Medical Ethics Committee of sented once, demonstrating the task procedure.

the Republic of Slovenia (approval number: 0120-51/2024-2711- (3) Test execution. The participant began the task when

4). All participants received a detailed explanation of the study ready. Task duration depended on individual performance.

procedures and provided written informed consent prior to par- The procedure was identical for all participants, ensuring stan-

ticipation. dardization across groups. Only the test execution phase varied

in length, as it was determined by each participant’s performance.

2.2 Testing Procedure Group-level descriptive statistics of fixation durations for all tasks

Eye-tracking data were collected using a Tobii Pro Spectrum [11] and phases are reported in the Results section (Table 3).

eye tracker integrated into a 24-inch monitor with a resolution

of 1920 2.3 Feature Extraction × 1080 pixels. Recordings were made at 1200 Hz in the

“human” tracking mode, with a stimulus presentation latency of We extracted a total of 117 ET features from three computerized

approximately 10 ms. The display frame rate was 30 FPS. Partic- cognitive tasks. As described in Section 2.2, each task was divided

ipants sat ∼55cm from the monitor, in a upright position with into three phases: instruction reading (BN), video demonstration

seating adjusted for comfort and optimal tracking. (GN), and test execution (T).

Before each task, participants were seated comfortably, and the Each participant contributed a single data point to the ML

Tobii Pro Lab [12] interface provided a live preview (see Fig. 1) to analysis. For every task (DS, PS, NB) and every phase (BN, GN,

verify that both eyes were detected and that the viewing distance T), we computed the 13 eye-tracking features listed in Table 2.

was within the recommended range (displayed as a green zone, Each feature was calculated over the entire duration of the given

typically around 55 cm). Once this was confirmed, a standard phase (e.g., the number of fixations refers to the total count during

five-point calibration was performed, during which participants that phase, while mean fixation duration refers to the average

followed a moving dot across the screen. Calibration served both across all fixations in that phase). These were then concatenated

to align gaze tracking and to ensure that the participant had across all tasks and phases, yielding 117 features per participant.

not moved their head between tasks. If the system indicated Thus, the unit of analysis was the participant, not individual trials

suboptimal accuracy, the calibration was repeated. or task phases.





2.4 Data Analysis

We trained and evaluated several machine learning models using

these features. We applied stratified 10-fold cross-validation at the

subject level to ensure that all features from a given participant

were assigned exclusively to either the training or test set, thereby

preventing data leakage across folds. In each iteration, the model

was trained on nine folds and tested on the remaining one, and the

reported metrics represent averages across all folds. Performance

was assessed using accuracy, sensitivity, specificity, and the area

under the receiver operating characteristic curve (AUC). The

final results were reported as the average across all folds.

We evaluated a diverse set of ML models (logistic regression,

random forest, gradient boosting, extreme gradient boosting,

and the explainable boosting machine) to cover both linear and

non-linear approaches with varying levels of interpretability.

EBM was selected as the primary model because it consistently

achieved the highest overall performance while providing in-

herently interpretable feature importance, which is particularly

Figure 1: Calibration interface in Tobii Pro Lab. The pre- valuable in clinical contexts. We did not pursue deep neural

view window ensures both eyes are detected and the par- networks in this study, as the dataset size (126 participants) is

ticipant is seated at an appropriate distance (green zone, relatively small and does not provide sufficient power to train

approximately 55 cm) before calibration and testing begin. high-capacity models without overfitting.



52





ET Explains Cognitive Test Performance in Schizophrenia Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia




Table 2: Eye-tracking features extracted from each task and phase.



Feature Description



num_fixations Total number of fixations during the interval.

avg_fixation_duration Mean duration of fixations (ms), indicating fixation stability.

std_fixation_duration Standard deviation of fixation duration, reflecting variability in fixation times.

num_saccades Total number of saccadic eye movements.

avg_saccade_distance Mean distance of saccades, reflecting amplitude of eye shifts.

avg_saccade_velocity Mean velocity of saccades, indicating how quickly gaze shifts occurred.

avg_saccade_angle Average angular change of saccades, reflecting directional scanning patterns.

gaze_entropy Entropy of gaze distribution, quantifying dispersion vs. concentration of gaze.

recording_duration_ms Total duration of recording for the phase (ms).

unique_squares Number of unique spatial areas (AOIs) visited during the interval.

num_changes Number of transitions between distinct gaze areas.

missing_left_percent Percentage of missing data from the left eye.

missing_right_percent Percentage of missing data from the right eye.

Note: All features are computed as aggregates over the entire task phase for each participant.



3 Results Fig 2 presents the receiver operating characteristic (ROC) curve,

To characterize task engagement and potential variability be- which confirms the model’s strong discriminative ability.

tween groups, we compared fixation durations across all tasks



and phases (Table 3). SP showed longer fixations than HC, espe-

cially during instruction reading and video phases, with smaller

but consistent differences during execution. This indicates altered

attention even outside active task solving.



Table 3: Mean fixation duration in ms per task and phase.



Task Phase HC (Mean ± SD) SP (Mean ± SD)

Reading instructions 239.64 ± 47.79 283.97 ± 45.33

Numbers Watching video 352.14 ± 81.56 400.10 ± 89.51

Test execution 390.66 ± 83.92 407.60 ± 98.53

Reading instructions 228.44 ± 52.49 267.78 ± 60.79

Pictures Watching video 302.40 ± 69.06 368.93 ± 81.42

Test execution 301.97 ± 49.91 319.36 ± 58.07

Reading instructions 229.36 ± 45.41 286.70 ± 63.42



SD: Standard deviation; HC: Healthy controls; SP: Schizophrenia patients Test execution 394.91 ± 115.50 406.24 ± 99.36 across folds was 0.92, confirming strong classification per-Square Watching video 309.41 ± 89.45 352.08 ± 79.37 Figure 2: ROC curve for the EBM model. The mean AUC

formance.



The ML models were trained on 117 extracted eye-tracking

features and achieved strong performance in distinguishing SP We analyzed the feature importance scores provided by EBM,

from HC. The key cross-validation performance metrics are sum- focusing on the ten most informative features (Fig 3). These fea-

marized in Table 4. tures were predominantly derived from the test execution phases

and included measures such as recording duration, number of

Table 4: Cross-validation performance metrics for different fixations, mean fixation duration, and saccadic counts. models. The Explainable Boosting Machine (EBM) achieved

the best overall performance across all metrics. 4 Discussion

The present study demonstrates that eye-tracking (ET) features

Model Accuracy Sensitivity Specificity AUC obtained during brief computerized cognitive tasks can effectively



EBM discriminate between individuals with schizophrenia and healthy 0.86 0.84 0.86 0.93 controls. Using 117 features, the Explainable Boosting Machine LR 0.85 0.77 0.91 0.92 (EBM) achieved strong classification performance, with accuracy, GB 0.78 0.70 0.84 0.83 sensitivity, and specificity values around 0.85 and an AUC of 0.92. RF 0.83 0.84 0.82 0.91 These results provide further evidence that ET-based measures xGB 0.81 0.77 0.85 0.90 capture clinically relevant differences in cognitive processing

EBM: Explainable Boosting Machine; LR: Logistic Regression; GB: Gradient and attentional control in schizophrenia.

Boosting; RF: Random Forest; xGB: Extreme Gradient Boosting Our findings are consistent with previous work showing

that patients with schizophrenia exhibit abnormalities in fix-

Among the tested models, the EBM achieved the highest over- ation behavior, saccadic dynamics, and gaze distribution during

all performance and was therefore selected for detailed analysis.



53





Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Marinković and Žabkar





Third, while we employed subject-level cross-validation to pre-

vent data leakage, robustness checks such as leave-one-subject-

out or leave-one-task-out validation could further strengthen

reliability. Fourth, our analysis focused on static ET features;

dynamic sequence-based or deep learning models could capture

additional temporal information in gaze patterns. Finally, we

only tested three tasks; future research should explore whether

expanding or tailoring the task battery improves performance

while still keeping the protocol brief. Replication with indepen-

dent cohorts will be essential to establish clinical utility.



Conclusion

In conclusion, this study provides strong evidence that eye-

tracking features embedded within short cognitive tasks can

Figure 3: Top 10 most important features identified by the serve as robust markers of schizophrenia. Machine learning mod-

EBM model. The prefixes indicate the task and phase: DS = els trained on these features achieved high discriminative accu-

digit span, PS = picture naming, NB = n-back; BN = reading racy, with interpretable patterns that align with known atten-

instructions, GN = watching video, T = test solving. For tional and cognitive impairments in the disorder. By reducing

example, PS_T_num_fixations refers to the number of fixa- patient burden while maintaining informativeness, this approach

tions during the test phase of the picture naming task. holds promise for the development of accessible, scalable, and

clinically relevant screening tools for schizophrenia.

both simple oculomotor paradigms and more complex cogni-

tive tasks [5, 6, 7, 8, 9, 10]. Importantly, by embedding ET into a 5 Acknowledgments

small set of standardized cognitive tasks, we demonstrate that This research was funded by the Slovenian Research Agency (core

group differences emerge not only during active problem solving funding No. P2-0209). Additional support for Mila Marinković

but also in more passive phases such as reading instructions or was provided through the ARIS project AI4Science (GC-0001).

watching a video example. This suggests that ET provides valu- We thank the University Psychiatric Hospital Ljubljana for par-

able information across the continuum of cognitive engagement, ticipant recruitment, asist. dr. Polona Rus Prelog, dr. med., spec.

extending beyond traditional task performance metrics. psih., and Martina Zakšek for their support, and the broader re-

While prior studies have applied machine learning to ET data search team, clinicians, and participants for their contributions.

in schizophrenia, they have typically relied on single paradigms The authors declare that they have no conflict of interest.

or isolated task conditions. The novelty of the present work lies

in combining a multi-task, multi-phase design with interpretable References



while linking model performance to specific, clinically meaning-captures a broader range of cognitive and attentional processes Medicine, vol. 381, no. 18, p. 1753–1761, 2019. [2] W. Hinzen and J. Rosselló, “The linguistics of schizophrenia: thought distur-ML within a short, clinically feasible test battery. This approach [1] S. R. Marder and T. D. Cannon, “Schizophrenia,” The New England Journal of

bance as language pathology across positive symptoms,” Frontiers in Psychol-

ful features. ogy, vol. 6, 2015.

Interpretability showed that temporal engagement, fixation [3] L. Colle, R. Angeleri, M. Vallana, K. Sacco, B. Bara, and F. Bosco, “Understand-

ing the communicative impairments in schizophrenia: A preliminary study,”

stability, and saccadic activity best differentiated groups. Longer Journal of Communication Disorders, vol. 46, no. 3, pp. 294–308, 2013.



recording durations may reflect slower processing, while altered [4] A. Wolf, K. Ueda, and Y. Hirano, “Recent updates of eye movement abnormali- ties in patients with schizophrenia: A scoping review,” Psychiatry and Clinical fixations and saccades align with prior reports of impaired atten- Neurosciences , vol. 75, pp. 104–118, 2021. tional control. These findings suggest that eye-tracking captures [5] P. S. Holzman, L. R. Proctor, and D. W. Hughes, “Eye-tracking patterns in schizophrenia,” both temporal and oculomotor aspects of task performance, sup- Science , vol. 181, no. 4095, pp. 179–181, 1973. [6] L. Deborah, H. Philip, M. Steven, and M. Nancy, “Eye tracking and schizophre- porting its potential as a clinically meaningful biomarker. nia: A selective review,” Schizophrenia Bulletin , vol. 20, no. 1, pp. 47–62, 1994. From a clinical perspective, these results are encouraging. Tra- [7] U. Ettinger, “Smooth pursuit and antisaccade eye movements as endophe- notypes in schizophrenia spectrum research,” PhD Thesis, Department of ditional neuropsychological assessments are lengthy and cogni- Psychology, Goldsmiths College, University of London, 2002. tively demanding, which can be exhausting for patients and limit [8] K. Okazaki, K. Miura, J. Matsumoto, N. Hasegawa, M. Fujimoto, H. Yamamori,



their applicability. Our study shows that by integrating ET mea- Y. Yasuda, M. Makinodan, and R. Hashimoto, “Discrimination in the clinical diagnosis between patients with schizophrenia and healthy controls using sures into just three relatively brief cognitive tasks, it is possible eye movement and cognitive functions,” Psychiatry and Clinical Neurosciences, to achieve a high level of diagnostic accuracy. This approach may vol. 77, pp. 393–400, 2023. [9] H. Yang, L. He, W. Li, Q. Zheng, Y. Li, X. Zheng, and J. Zhang, “An automatic therefore support the development of shorter, less burdensome, detection method for schizophrenia based on abnormal eye movements in and more objective screening protocols that could complement reading tasks,” Expert Systems With Applications , vol. 238, p. 121850, 2024.

existing clinical evaluations. [10] K. Morita, K. Miura, A. Toyomaki, M. Makinodan, K. Ohi, N. Hashimoto,

Y. Yasuda, T. Mitsudo, F. Higuchi, S. Numata, A. Yamada, Y. Aoki, H. Honda,

R. Mizui, M. Honda, D. Fujikane, J. Matsumoto, N. Hasegawa, S. Ito, H. Akiyama,

Limitations and Future Work T. Onitsuka, Y. Satomura, K. Kasai, and R. Hashimoto, “Tablet-based cognitive

Several limitations should be noted. First, although our sample and eye movement measures as accessible tools for schizophrenia assessment:

Multisite usability study,” JMIR Mental Health, vol. 11, p. e56668, 2024.

size of 126 participants is comparable to similar studies, larger [11] M. Nyström, D. Niehorster, R. Andersson, and I. Hooge, “The tobii pro spec-



and more diverse cohorts are needed to confirm the generalizabil- trum: A useful tool for studying microsaccades?” Behavior Research Methods, vol. 53, 07 2020. ity of the results. Second, all patients were on stable antipsychotic [12] Tobii AB, “Tobii pro lab,” Computer software, Danderyd, Stockholm, 2024. medication, which may have influenced oculomotor behavior. [Online]. Available: http://www.tobii.com/



54





Data-Driven Evaluation of Truck Driving Performance with




Statistical and Machine Learning Methods



Vid Nemec Gašper Slapničar Mitja Luštrek

vidotti.nemec@gmail.com Jožef Stefan Institute Jožef Stefan Institute

Jožef Stefan Institute Ljubljana, Slovenia Ljubljana, Slovenia

Ljubljana, Slovenia gasper.slapnicar@ijs.si mitja.lustrek@ijs.si





Figure 1: Truck driving simulator developed by AAER Research d.o.o.



Abstract identify driver features with stronger influence on fuel consump-

tion than fixed-threshold rules, providing a data-driven baseline

This paper investigates which driving features (e.g. speed, accel-

for future model-based feedback.

eration, braking) most strongly affect driving efficiency in a truck

In addition, we compare the empirical outcomes of our ML

simulator environment. The work systematically compares sta-

analytics with insights from recent literature and the practical

tistical methods (thresholding based on percentiles, IQRs, expert

judgement of a driving expert, to pinpoint where domain knowl-

rules) with machine learning methods (clustering using K-means)

edge aligns or conflicts with the models. This dual perspective

for driver assessment. In addition to practical machine learning

enables a richer interpretation of driver assessment tools and

experimentation, the analysis incorporates expert knowledge

informs the design of future vehicle feedback and incentive sys-

and insights from recent research. This approach evaluates the

tems.

agreement and differences between the two approaches and aims

to interpret them.



Keywords

Driving simulation, fuel efficiency, percentiles, K-Means, SHAP, 2 Related Work

statistical thresholds, machine learning, clustering Recent studies have evaluated driver behaviour for fuel efficiency

using both statistical rules and machine-learning approaches.

1 Introduction Sullivan et al. present a TORCS-based simulator with a realistic

fuel-economy model, enabling safe, repeatable analysis of eco-

Reducing fuel consumption in road transport is a critical goal for

driving strategies [5]. Maisonneuve characterises driver energy

sustainability and cost-efficiency [1]. Prior research, such as [2,

efficiency across driving events and proposes a grading/ranking

3], highlights the impact of driver behaviour - particularly accel-

method based on identified parameters [6]. Zhao et al. develop

eration, braking, and speed profiles on overall fuel efficiency. Yet,

a simulator-based eco-driving support system with real-time

how to most effectively quantify and compare drivers remains an

feedback and post-drive reports, demonstrating measurable re-

open question [4]. This paper addresses which driving features

ductions in fuel use and emissions [7]. Ma et al. provide a scoping

most strongly influence efficiency in a simulated truck driving

review of energy-efficient driving behaviours and applied AI

environment, comparing classical statistical thresholding, based

methods [8]. Prototype driver-training systems have been pro-

on expert knowledge, with clustering - based machine learning.

posed [9], and large-scale, data-driven frameworks to incentivise

Applying known methods, we test whether unsupervised ML can

efficient driving have been developed [3, 10].

Permission to make digital or hard copies of all or part of this work for personal

Most studies agree that key features include speed, throttle,

or classroom use is granted without fee provided that copies are not made or

brake usage, and sometimes gear selection, but differ on methods

distributed for profit or commercial advantage and that copies bear this notice and

the full citation on the first page. Copyrights for third-party components of this for quantifying and weighting these features. Machine learning

work must be honored. For all other uses, contact the owner /author(s).

clustering (e.g., K-means) and feature importance analysis (e.g.,

Information Society 2025, Ljubljana, Slovenia

SHAP) are increasingly used, offering potential improvements in

© 2025 Copyright held by the owner/author(s).

https://doi.org/DOI10.70314/is.2025.skui.4765 objectivity and interpretability of drivers.



55





Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Vid Nemec et al.




3 Methods 3.3.2 SHAP with LightGBM model. As an orthogonal check of



3.1 feature relevance, we applied SHAP to a separate LightGBM Data Collection and Preprocessing

model predicting fuel rate; this diagnostic analysis is independent

Driving data were collected from a high-fidelity truck simulator

of clustering and highlights variables linked to higher consump-

developed by AAER Research d.o.o., which continuously recorded

tion (Table 2).

multiple parameters including pedal positions, steering wheel



angle, vehicle speed, location, and segment identifiers. To ensure 4 Results data quality, missing or zeroed pedal values were imputed. The

signals were then resampled into 1-second windows, where for 4.1 Statistical Thresholding Approach

each parameter we computed the minimum, maximum, mean, Based on the analysis of related worke outlined in Section 2, we

and median values. This aggregation approach was chosen over decided to benchmark driver efficiency based on selected driving

raw resampling because the signals are irregular, zero-inflated, features.We investigated two methods covering complementary

and not normally distributed, making window-based statistics metrics of acceleration and braking, namely:

more representative of driver behavior. In addition, the last ob- Percentile-based thresholds for gas pedal •

served cumulative distance within each window was retained to IQR method for brake pedal • preserve distance continuity. Finally, the processed signals were

Percentiles were chosen for the gas pedal because the sig-

aligned with the boundary of the scenario segment, allowing a

nal is highly zero-inflated and not normally distributed, making

consistent basis for later efficiency evaluation.

distribution-aware thresholds more suitable. Braking behavior is

irregular and heavy-tailed, where IQR offers a robust way to cap-

3.2 Rule-based Aggregation of Segment Labels ture abnormal events. In essence, the IQR rule sets a dispersion-

We aggregated per-segment labels ( / / ) into an anchored cut-off above Q3-robust to heavy tails-whereas per-PASS WARN FAIL

overall per-driver rating using a linear severity score. A in- centile thresholds fix the share of events flagged. Thresholds FAIL dicates a strong threshold exceedance and is therefore weighted were determined by examining histograms of pedal deltas (Fig-

twice a , yielding a simple, interpretable metric that toler- ure 2), ensuring that cutoffs meaningfully separated typical from WARN ates occasional minor deviations. extreme behavior. This procedure enabled transparent, segment-

level benchmarking of driver performance.

𝑆 2 #FAIL #WARN, = +





Good , 𝑆 ,  ≤ 2





Rating 𝑆 Warning, 3 𝑆 5, ) = ≤ ≤

(  



 Bad , 𝑆 ≥ 

6 .



This 2:1 weighting reflects relative severity (a is a clearer FAIL

breach of the threshold than a ) and preserves stability: WARN

small label fluctuations do not flip a driver from to . The Good Bad

middle band ( ) collects borderline cases for review. Warning

Figure 2: Histograms for both pedals



Table 1: Per-driver severity summary (𝑆 = 2·#FAIL+#WARN).

Threshold characterisation:

• Gas Pedal: We applied percentile-based thresholds (65th

Driver #WARN #FAIL 𝑆 Rating

for WARN, 83rd for FAIL) to the gas pedal delta (change

1 4 1 6 Bad

in 0,1 second). This approach better captures outlier accel-

10 5 1 7 Bad

eration behavior while avoiding over-penalizing normal

2 7 2 11 Bad

operation. We removed windows where cruise control

3 4 0 4 Warning

was active for more than 30% of the time to reduce au-

4 4 0 4 Warning

tomation bias in pedal measurements. It was chosen to

5 6 2 10 Bad

balance isolating manual control with keeping enough

6 3 0 3 Warning

observations.

7 3 0 3 Warning

• Brake Pedal: We applied an interquartile–range rule com-

8 4 0 4 Warning

puted from the empirical distribution in each segment:

with the third quartile 𝑄3 and the interquartile range

IQR 𝑄3 𝑄1, we set at 𝑄3 0.5 IQR and = − WARN + FAIL

3.3 Machine Learning at 𝑄3 + 1.5 IQR. It flags both frequent moderate excesses

3.3.1 ( ) and rare but severe braking events ( ) without WARN FAIL K-means clustering.

Unsupervised clustering of K-means

over-penalising normal behaviour.

(k = 3) was applied per segment on standardized aggregated

characteristics (acceleration / braking variability, coasting, use of Certain segments in the driving scenario required strong brak-

cruise control, speed-related measures). Clusters were assigned ing due to test design (e.g., safety-critical stops). These were

semantic labels / / by ordering clusters labelled as SAFETY and excluded from efficiency scoring, as they Good Warning Bad post hoc

by their mean fuel rate ( ): lowest , middle reflect controlled conditions rather than natural driving quality. fuel_mean → Good →

Warning, highest → Bad. We then examined cluster centroids The resulting classifications are summarised as heatmaps (Fig-

(mean feature profiles) and visualised the result as per-segment ures 3 and 4), where rows correspond to drivers and columns

heatmaps. to scenario segments. Cells are coloured green (PASS), orange



56





Driving data analysis Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia




(WARN), and red (FAIL), providing an intuitive visual overview

of performance variability. PASS/WARN/FAIL are segment-level,

per-driver labels that state whether the segment was driven effi-

ciently in terms of fuel use: PASS = efficient, WARN = borderline,

FAIL = inefficient. These labels refer only to fuel consumption,

not safety or travel time. Blank (white) cells indicate cases with-

out an assigned label—either segments excluded from SAFETY

scoring or driver–segment pairs with too few events to make a

reliable decision.





Figure 5: K-means graph for 1st segment



Figure 3: Heat map of the gas pedal through segments using

percentiles method





Figure 6: K-means graph for 4th segment





to or . The 2D PCA projection (Figure 6) shows Warning Bad



these drivers displaced from the centroid, driven by sus- Good

tained high-load throttle (elevated accelerator mean/variance),

low coasting, and reduced cruise-control usage—patterns that the

single-peak percentile metric does not penalize. This highlights

clustering’s sensitivity to cumulative demand and multi-feature

context, whereas the percentile approach captures only isolated

Figure 4: Heat map of the brake pedal through segments exceedances.

using IQR method





4.2 Comparison of Thresholding and

Clustering

A focused comparison was carried out on three representative

track segments: Segment 1, Segment 8, and Segment 4 using the

two complementary methods described in Section 3 (statistical

thresholding and K-means clustering). For visualization only,

we projected standardised features onto two principal compo-

nents (PCA) per segment; clustering and label assignment were

performed in the original standardised space.



4.2.1 Segment 1 (Steady Acceleration). The percentile method

flagged only one driver as exceeding the ’FAIL’ threshold, while Figure 7: K-means graph for 8th segment

most achieved the ’PASS’ status. The clustering of K-means pro-

duced a tightly grouped ’Good’ cluster for most drivers, with a

4.2.3 Segment 8 (Complex Curve–Acceleration Mix). This seg-

single ’Bad’ outlier (visible in PCA as an isolated point on the pos-

ment showed more divergence. The percentile method marked

itive PC1 axis). Agreement between methods was high (>85 %),

several drivers as ’WARN’ due to short bursts of high throttle,

suggesting that, in simpler acceleration scenarios, single-feature

while K-means placed some of these drivers in the ’Good’ cluster.

metrics and multidimensional clustering agree well.

PCA visualization revealed that these drivers exhibited smoother

4.2.2 Segment 4 (Prolonged Uphill Driving). Here the disagree- braking and higher coasting ratios, which the clustering model ment was most pronounced. The percentile rule classified many positively weighted. This highlights a key difference: the sta-

drivers as because their maximum throttle did not ex- tistical approach penalizes isolated peaks, whereas clustering PASS ceed the cut-off. In contrast, K-means frequently assigned them balances them against compensatory behaviors.



57





Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Vid Nemec et al.




4.2.4 Cross-approach Observations. The alignment was strongest 5 Discussion in steady demand scenarios (Segment 1), weaker in mixed behav-

This comparative study shows that rule-based thresholding re-

ior contexts (Segment 8), and lowest in sustained load conditions

mains highly interpretable and aligns with prior work, while

(Segment 4). Statistical thresholding offers high interpretabil-

K-means clustering reveals multi-feature patterns that affect ef-

ity and segment-level clarity, but may overlook multi-feature

ficiency. In practice, percentile rules flag isolated exceedances,

inefficiencies. K-means clustering captures complex, composite

whereas clustering captures cumulative demand and co-variation,

behavior and can sometimes reclassify drivers that the percentile

explaining the discrepancies observed in segments such as Fig-

method labels as efficient. It would be interesting for future work

ure 6. Together, the methods are complementary: thresholding of-

to implement more driver features and analyse in depth which

fers transparent guardrails; clustering provides a broader, context-

have a different effect.

aware view.

We additionally investigated the alignment between model-



based feature importances and expert knowledge/domain expec- 6 Conclusions tations using SHAP.

The results suggest that integrating both statistical and machine



Table 2: Top-5 features per class learning perspectives offers a more robust and nuanced driver

assessment for fuel efficiency. While classical thresholding offers

transparency, machine learning enables the discovery of complex

Class Top 1 Top 2 Top 3 Top 4 Top 5

patterns. Future work should further validate these findings to

Bad AccelerationPedal Speed Acceleration SteeringWheelAngle BrakePedal

Medium Speed AccelerationPedal Acceleration SteeringWheelAngle BrakePedal

develop hybrid driver feedback systems. We only used SHAP

Good AccelerationPedal Speed Acceleration SteeringWheelAngle BrakePedal

Perfect AccelerationPedal Speed Acceleration SteeringWheelAngle BrakePedal

diagnostically; a more systematic SHAP analysis would be inter-

esting across models, segments, and time, to stabilize attributions

and translate them into actionable feedback.

Table 2 presents the five most influential features for each

consumption class ( , , , ), ranked by their Bad Medium Good Perfect

mean absolute SHAP value. The model consistently identifies Acknowledgements

AccelerationPedal and BrakePedal among the top-ranked features We thank the AAER Research d.o.o. team, led by CEO Matej

across multiple classes, in line with the statistical benchmark Vengust, for access to simulator data and expert support. We also

results from Section 4.1, where pedal usage was also the dominant acknowledge support from the EDIH DIGI–SI project.

indicator of inefficient driving events. This agreement confirms

that the machine learning approach captures the same domain- References relevant control inputs as the thresholds defined by the expert, [1] Oscar Delgado, Felipe Rodríguez, and Rachel Muncrief. 2017. Fuel Efficiency

Technology in European Heavy-Duty Vehicles: Baseline and Potential for

while also highlighting secondary but relevant factors such as

the 2020–2030 Timeframe. White Paper. The International Council on Clean

Speed, Acceleration, and SteeringWheelAngle. Transportation (ICCT), (July 2017). https://theicct.org/publication/f uel- ef f i

ciency- technology- in- european- heavy- duty- vehicles- baseline- and- pote



4.3 Pareto Front of Time–Fuel Trade-Offs ntial- f or- the- 2020- 2030- timef rame/. [2] Hung Nguyen, George Tsaramirsis, Ilir Mborja, Dhimitraq Dervishi, Eriona

Hoxha, Stavros Shiaeles, Anastasios Kavoukis, and Stamatios Vologiannidis.



2023. A data-driven framework for incentivising fuel efficient driving be-

haviour in heavy-duty vehicles. , 420, 139942. doi: 10.1016/j.jc J. Clean. Prod.

lepro.2023.139942.

[3] Shuyan Chen, Hongru Liu, Yongfeng Ma, Fengxiang Qiao, Qianqian Pang,

Ziyu Zhang, and Zhuopeng Xie. 2024. High fuel consumption driving be-

havior identification and causal analysis based on lightgbm and shap. Res.

Sq. Preprint. doi: 10.21203/rs.3.rs- 4010652/v1.

[4] Alexander Meschtscherjakov, David Wilfinger, Thomas Scherndl, and Man-

fred Tscheligi. 2009. Acceptance of future persuasive in-car interfaces to-

wards a more economic driving behaviour. In . (Sept. AutomotiveUI 2009

2009), 81–88. doi: 10.1145/1620509.1620526.

[5] Charles Sullivan and Mark Franklin. 2010. An extended driving simulator

used to motivate analysis of automobile fuel economy. In Session 1: Tools,

techniques, and best practies of engineering education for the digital generation.

(May 2010). doi: 10.18260/1- 2- 1153- 53783.

[6] Mathieu Maisonneuve. 2013. Characterization of drivers’ energetic efficiency:

Identification and evaluation of driving parameters related to energy efficiency.

Master’s thesis. Chalmers University of Technology. https://hdl.handle.net

/20.500.12380/185531.

[7] Xiaohua Zhao, Yiping Wu, Jian Rong, and Yunlong Zhang. 2015. Develop-

ment of a driving simulator based eco-driving support system. Transporta-

Figure 8: Pareto front tion Research Part C: Emerging Technologies, 58, 631–641. Technologies to

support green driving. doi: https://doi.org/10.1016/j.trc.2015.03.030.

[8] Zhipeng Ma, Bo Nørregaard Jørgensen, and Zheng Ma. 2024. A scoping

An interesting point of view would be to also consider the review of energy-efficient driving behaviors and applied state-of-the-art ai

methods. , 17, 2. doi: 10.3390/en17020500. Energies

temporal information. Fuel consumption may reduce costs, but

[9] A McGordon, J E W Poxon, C Cheng, R P Jones, and P A Jennings. 2011.

time is also quite important. Figure 8 plots the total time against Development of a driver model to study the effects of real-world driver be-

haviour on the fuel consumption. Proceedings of the Institution of Mechanical

the total fuel per driver. A driver is Pareto efficient if no other

Engineers, Part D: Journal of Automobile Engineering, 225, 11, 1518–1530.

driver is faster and uses less fuel; these form the lower-left frontier.

doi: 10.1177/0954407011409116.

The points to the upper-right are dominated and can improve [10] Thomas J. Daun, Daniel G. Braun, Christopher Frank, Stephan Haug, and

Markus Lienkamp. 2013. Evaluation of driving behavior and the efficacy of

at least one objective without worsening the other. We obtain

a predictive eco-driving assistance system for heavy commercial vehicles

the frontier by non-dominated sorting of per-driver 𝑇 𝑖𝑚𝑒, 𝐹 𝑢𝑒𝑙 in a driving simulator experiment. In 16th International IEEE Conference ( )

totals and colour points by their K-means group, explicitly linking , 2379–2386. doi: 10.1109 on Intelligent Transportation Systems (ITSC 2013)

/ITSC.2013.6728583.

global efficiency to the segment-level patterns identified earlier.



58





Automated Explainable Schizophrenia Assessment from




Verbal-Fluency Audio



Rok Rajher

Jure Žabkar

rr3244@student.uni-lj.si

jure.zabkar@fri.uni-lj.si

University of Ljubljana,

Faculty of Computer and Information Science,

Ljubljana, Slovenia



Abstract (2) what they say: lexical-semantic markers such as category

Schizophrenia is associated with cognitive impairments that are switching, perseverations, and vocabulary diversity.

difficult to assess with traditional neuro-psychological tests. Cur- These are best observed during verbal-fluency tasks - short,

rently, these tests are manually administered by clinical doctors standardized, low-burden, and already used in clinical practice.

and rely on subjective assessment of patient’s behavior, self- Our main hypothesis is that short recordings of Slovene verbal-

reported symptoms, medical history, and mental state. Recent ad- fluency tasks contain sufficient discriminative signal, captured

vances in deep learning substantially improved automatic speech by acoustic and semantic features, to separate individuals with

recognition (ASR), and large language models (LLMs), enabling schizophrenia from healthy controls.

the development of computational tools that can partially auto- In this paper, we present automated machine learning pipeline

mate aspects of psychiatric assessment. We present the first fully for the detection and explanation of schizophrenia, leveraging the

automated classification of individuals with schizophrenia based capabilities of ASR models and state-of-the-art LLMs. The tests

on verbal-fluency tests conducted in Slovene language. Our multi- were conducted in the Slovene language and consisted of two one-

stage pipeline involves audio preprocessing, automatic transcrip- minute subtasks: (1) a semantic fluency task, where participants

tion using the Truebar ASR model, the extraction of meaningful were asked to list as many animal names as possible, and (2)

verbal and non-verbal features, and learning a machine learning a phonetic fluency task, where participants were instructed to

model. The Explainable Boosting Machine (EBM) trained on the generate words beginning with the letter ‘L’. The approach is

obtained feature set achieved the best overall performance. based on audio recordings of verbal fluency tests collected by

Marinković [10]. Our results can be directly compared to those

Keywords reported by Marinković [10], where the transcription and analysis

schizophrenia, automatic speech recognition, large language of the tests were performed manually. The details of our study

models, verbal-fluency tasks, machine learning are extensively described in [13].



1 2 Methods Introduction

Schizophrenia is a chronic and severe mental disorder [8, 11] that 2.1 Participants

affects how a person thinks, feels, and behaves. As a psychotic The dataset comprises of 126 participants: 58 individuals with a

disorder it is characterized by a combination of disorganized clinical diagnosis of schizophrenia (SH), and 68 healthy controls

thinking and behavior, hallucinations, and delusions [2, 14]. The (HC). All individuals in the SH group were patients admitted to

symptoms have major implications on individual’s social life and the University Psychiatric Clinic Ljubljana. All participants were

can lead to a lifelong care [1, 7]. Schizophrenia affects about 1% adults aged 18 years or older and gave consent to being part of

of the population worldwide [9]. the experiment.

Currently, there is no objective or standardized diagnostic test Standard demographic information was collected for each

for schizophrenia. The most widely used diagnostic frameworks participant, including age, gender, highest level of education,

in clinical practice are the DSM-5 [2] and the ICD-11 [14]. With academic performance (school grades), marital status, and em-

rapid improvements in automatic speech recognition (ASR), large ployment status. The dataset is balanced with respect to age and

language models (LLMs), and machine learning, there is rising gender.

interest in computational tools that support, augment, or partially For participants diagnosed with schizophrenia, additional clin-

automate aspects of psychiatric assessment. ical information was recorded: illness duration, number of hospi-

Clinicians have long noted that schizophrenia systematically talizations, and the presence of chronic or co-occurring health

affects speech in two ways: conditions. The median illness duration among individuals with

(1) how people speak: acoustic-prosodic markers such as schizophrenia was 10 years, with a median of 4 hospitalizations.

pause structure, speech rate, and prosodic variability, and The study was approved by the Medical Ethics Committee of

Permission to make digital or hard copies of all or part of this work for personal the Republic of Slovenia (approval number: 0120-51/2024-2711-

or classroom use is granted without fee provided that copies are not made or 4). All participants received a detailed explanation of the study

distributed for profit or commercial advantage and that copies bear this notice and procedures and provided written informed consent prior to par-

the full citation on the first page. Copyrights for third-party components of this ticipation. work must be honored. For all other uses, contact the owner/author(s).

Information Society 2025, Ljubljana, Slovenia

© 2025 Copyright held by the owner/author(s).

https://doi.org/10.70314/is.2025.skui.2350



59

Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Rajher and Žabkar



Measure SH HC as the primary data sources for all subsequent audio- and speech-

based analyses.

Total Participants 58 68



Male Distribution 29 35 3 Preprocessing

Female Distribution 29 33

Median Age (years) 45 46.5 3.1 Audio Data Preparation

Median Primary School Grade 3 5 The WAV recordings were initially divided into two distinct audio

Median High School Grade 3 4 segments using the provided timestamp files: (1) a segment cor-

Median Illness Duration (years) 10 – responding to the phonetic verbal fluency task and (2) a segment

Median Number of Hospitalizations 4 – corresponding to the semantic verbal fluency task.

Prevalent Education Level Elem. HS Both audio segments were then processed through a series of

Prevalent Marital Status Married Married audio enhancement steps:

Prevalent Employment Status Retired Employed (1) Dynamic range compression: To improve audio quality

Table 1: Demographic and clinical characteristics of the and ensure uniformity, downward dynamic range com-

participants. pression (threshold = -20.0 dBFS, ratio = 4:1, attack time =

5 ms, release time = 50 ms) was applied to each segment.

This reduces the volume gap between the quietest and



2.2 Testing procedure (2) loudest parts of a signal [6].

Loudness normalization: adjusting each segment to a

Each participant completed a verbal fluency test consisting of target level of -20 dBFS. This ensured consistent perceived

two sub-tasks: loudness across all recordings, reducing variability from

(1) Phonetic fluency task: participants were asked to pro- differences in speaker volume, room acoustics, or micro-

duce as many Slovene words as possible beginning with phone distance.

the letter ‘L’. Proper nouns, including names of people or (3) Final output: Finally, the two fully processed segments

places, were not allowed. The task lasted 62 seconds in per participant (phonetic and semantic) were saved as

total: during the first 2 seconds, the letter ‘L’ was displayed separate WAV files. These files constitute the final audio

on the screen, followed by 60 seconds for verbal response. data used for all subsequent analyses.

(2) Semantic fluency task: participants were instructed to All of the described steps were implemented using standard

name as many animals as possible in the Slovene language. functions provided by the pydub library.

Pet names and proper nouns were not allowed. The task

duration was 60 seconds. 3.2 Feature Engineering

The testing procedure was standardized: each individual was After automated transcriptions have been processed we per-

seated in front of a laptop computer. After reading the instruc- formed feature engineering. Based on clinical knowledge, we

tions for the phonetic fluency task, the participant pressed a key created meaningful features that serve as reliable markers for dis-

to begin, initiating the countdown. After completing the first tinguishing between individuals with and without schizophrenia.

task, the instructions for the second task (semantic fluency) were Three core symptoms of schizophrenia are directly applicable

displayed. Again, the participant initiated the task by pressing a to our verbal-fluency tasks: disorganized speech, disorganized

key when ready. This concluded the verbal fluency test. behavior, and negative symptoms. The primary rationale behind

Healthy participants were tested at the Faculty of Computer our feature construction is grounded in these core symptom

and Information Science, University of Ljubljana, while individu- domains.

als with schizophrenia were assessed at the University Psychi- Audio recordings are represented in two forms: (1) as text,

atric Clinic Ljubljana. To ensure consistency across conditions, all derived from automated ASR transcriptions, and (2) as spectro-

recordings were conducted in quiet, isolated rooms to eliminate grams – a visual representation of the frequency content of the

possible noise and distractions. audio signal over time. We constructed two groups of features:

All WAV files then underwent the same audio enhancement (1) Verbal features: 39 features derived from the automated

pipeline: (i) dynamic range compression to reduce variability due text transcriptions. These features aim to quantify dis-

to speaking loudness and microphone distance, and (ii) loudness organized speech, e.g. number of phrases produced per

normalization to achieve consistent perceived loudness across second.

recordings. These steps were implemented with standard func- (2) Non-verbal features: 17 features extracted directly from

tions from pydub and applied identically to both sites prior to the spectrograms of the audio recordings, these features

feature extraction. target prosodic elements such as pitch and vocal con-



2.3 Data Format blunted affect and disorganized behavior; e.g. Mean pitch, trol, which are key indicators of negative symptoms like

The final dataset consists of 126 WAV audio recordings, one per representing the speaker’s average vocal pitch.

participant, captured using the built-in laptop microphone during

the test sessions. The audio tracks are encoded in uncompressed 3.3 Automated Transcription

PCM format at a sampling rate of 44.1 kHz with a single (mono) The most critical step in the preprocessing of audio recordings is

audio channel. Additionally, there are 126 corresponding CSV the generation of automated transcriptions. These ASR-derived

files containing timestamps that indicate the start and end times transcriptions serve as the primary input for nearly all subse-

of each subtask. Together, these audio and timestamp files serve quent stages of feature extraction and machine learning analysis.

We employed the ASR model Truebar 24.05, a state-of-the-art



60

Automated Explainable Schizophrenia Assessment from Verbal-Fluency Audio Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia



speech recognition system for the Slovene language. The model captures how the word is typically perceived by the gen-

was developed by the company Vitatis in collaboration with the eral population. In the case of neologisms, the semantic

Laboratory for Data Technologies at the Faculty of Computer and meaning was still applied based on what the word could

Information Science. Using Truebar API we programmatically up- plausibly represent or mean, allowing the model to assign

loaded each audio file and in response receive the corresponding an approximate semantic embedding even for novel or

transcribed words along with their start and end timestamps. invented terms. This feature is used only for the semantic

task, where meaning-based associations between words

3.4 Transcription Adjustment are essential.

The output of the ASR system consists of transcribed words along

with their associated timestamps. These transcriptions may in- 3.6 Data Analysis

clude irrelevant content such as filler words. We used the DSPy We trained and evaluated several machine learning models using

library—a Python framework that enables declarative program- these features. To ensure robust evaluation, we applied stratified

ming for prompting LLMs in a modular and programmatic way in 10-fold cross-validation. Performance was assessed using accu-

combination with GPT-4o model. The transcription adjustment racy, sensitivity, specificity, and the area under the receiver oper-

process was divided into two sequential steps: ating characteristic curve (AUC). The Explainable Boosting Ma-

(1) Transcription filtering: The raw transcription output chine (EBM) consistently achieved the best results when trained

from the Truebar ASR model was first passed to the GPT- on the full feature set. We additionally examined the top 10 most

4o model along with a description of the verbal fluency informative features to assess model interpretability. This ap-

task and its rules. The model was instructed to retain only proach enables us to understand better which et deficits are most

the words it considered to be relevant without modifying prominent in individuals with schizophrenia and may be useful

the words themselves. for targeted clinical interventions.

(2) Transcription correction: The filtered transcription was

then forwarded to the model in a second pass. With the 4 Results

same task context and rules provided, the model was now We observe that the obtained ML models perform similarly when

asked to adjust incorrectly transcribed words to what it in- using the verbal (V) and non-verbal (N) feature sets separately,

ferred the participant likely intended to say. A word could achieving an average AUC of 0.83 on both datasets. In the com-

potentially also be a neologism, we explicitly instructed bined feature set (VN), the average performance improves across

the model to apply corrections only when the intended all metrics: AUC 0.86, CA 0.76, Sens. 0.69, Spec. 0.82, PPV 0.76,

word was judged to be clear and obvious; otherwise, the and F1 0.73. The EBM trained on the combined feature set (VN)

word was left unchanged. For example, a misrecognized achieved the best overall performance: AUC 0.90, CA 0.82, Sens.

word like ‘lon’ would be corrected to ‘slon’ (elephant), 0.76, Spec. 0.87, PPV 0.83, and F1 0.79.

whereas unclear or ambiguous cases were preserved as-is. To probe whether education could drive the observed perfor-

mance, we examined models trained on verbal (V) and non-verbal

3.5 Adding Semantic Meaning (N) features separately, in addition to the combined set (VN). Ver-

After filtering and correcting the transcriptions, we tagged each bal features are more likely to reflect educational attainment

word with semantic annotations relevant to the verbal fluency (e.g., lexical diversity, category switching), whereas core acous-

task. These semantic features are crucial for distinguishing be- tic markers (e.g., pause structure, longest silent pause) are less

tween HC and SH, as they capture subtle linguistic anomalies dependent on education [4]. In our 10-fold CV, V and N models

commonly associated with schizophrenia. We used performed comparably, and VN performed best. This suggests DSPy frame-

work in combination with the that education alone is unlikely to explain the classification. GPT-4o language model to per-

form automated semantic tagging. The model was provided with

task-specific instructions and context for each word. For each 4.1 Global interpretation

transcribed word, we extracted the following semantic tags: The overall feature importance (FI) across the entire dataset is

• Intrusion: used for global interpretation of the model. We calculate it as the The word is semantically unrelated to the tar-

get category (e.g. non-animal word during the animal nam- average absolute contribution of each feature across all samples:



ing task). Intrusions are often more frequent in individuals 𝑛 1 ∑︁ = FI with schizophrenia and reflect impaired cognitive control 𝑗 𝑓 𝑗 ( 𝑥 𝑖, 𝑗 ) , (1) 𝑛

and semantic memory organization [5]. 𝑖 =1

• where Stiltedness: Marks whether the word appears overly for-𝑛 is the total number of samples, and 𝑓𝑗 (𝑥𝑖, 𝑗) is the contri-

mal, unusual, or unnatural in everyday speech. Stilted bution of feature 𝑗 for instance 𝑖. FI measures how strongly each

language is a known linguistic feature in schizophrenia feature influences the model’s predictions on average.

and may signal underlying disruptions in pragmatic lan- Globally most important features are: (1) comb_pho_lev2_-

guage use [12]. avg- the Levenshtein similarity between the filtered and ad-

• justed transcriptions, which indicates impaired speech fluency, Neologism: a newly coined or nonsensical word not

found in the lexicon. Neologisms are characteristic of dis- (2) animal_tempo_max_gap_percent- captures the longest

organized thought and speech, and are especially relevant silent pause during the semantic task, (3) animal_sem_cont_-

in schizophrenia research [3]. max_coherence_index, animal_sem_cont_kurt_coherence_-

• index, and ltest_sem_stat_min_coherence_index- the first Word description (semantic task only): A general,

page-long descriptive summary of the word. For animals, two capture the word-to-word coherence, while the third cap-

this includes common features such as appearance, habi- tures the lowest phonetic similarity between consecutive words

tat, and behavior—providing a semantic embedding that



61





Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Rajher and Žabkar




during the phonetic task, (4) comb_osmile_F0From27.5Hz_- to all audio, and (iii) demonstrating that transcript-only models

stddeNorm_avg- the standard deviation of pitch; highlights vari- (verbal features) remain predictive, indicating that performance

ability in vocal pitch — a marker of prosodic irregularity often is not driven by background acoustics. Future studies should also

observed in individuals with schizophrenia. include participants with other psychiatric conditions, such as

major depressive disorder or bipolar disorder.

4.2 Local interpretation

Each individual prediction can be explained through the posi- Conclusion

tive/negative contribution of each feature. Features with positive We developed and evaluated an automated, explainable pipeline

contributions increase the log-odds in favor of the schizophrenia for schizophrenia assessment using 126 verbal-fluency audio

class, while features with negative contributions decrease the recordings (healthy controls: 68; schizophrenia: 58). The pipeline

log-odds, shifting the prediction toward the healthy control class. comprises audio pre-processing, automatic transcription with the

An example for a severe schizophrenia case is shown in Fig. 1 Truebar ASR model, and extraction of verbal (transcript-derived)

and non-verbal (acoustic/temporal) features. The features were



then used in training and evaluation of several classical machine-

learning models.

Across models, combining verbal and non-verbal features con-

sistently yielded the strongest results. The Explainable Boosting

Machine achieved the highest performance: CA 0.82, Sens. 0.76,

Spec. 0.87, PPV 0.83, F1 0.79, and AUC 0.90. Due to the EBM’s

inherent interpretability, we produced global explanations and lo-

cal explanations (per-instance contribution plots), complemented

by GPT-4o–generated textual summaries. A high model perfor-

Figure 1: Local feature importance plot for a severe mance and associated explanations provide a firm ground for

schizophrenia case as predicted by the EBM model. Red potential decision support system in clinical practice. bars indicate contributions toward the schizophrenia class,

and blue bars toward the healthy control class. 5 Acknowledgments

This work was partially supported by the Slovenian Research

The corresponding textual explanation was generated by Agency (ARIS), research program Artificial Intelligence and In-

GPT-4o model: The results from the verbal fluency test indicate telligent Systems (Grant No. P2-0209).

several features often associated with schizophrenia. Short pauses

between utterances may suggest rushed or pressured speech, References

which can be a sign of reduced speech planning. Low seman- [1] Bandar AlAqeel and Howard C. Margolese. 2012. Remission in schizophrenia:

tic coherence in structured tasks may indicate the intrusion of Critical and systematic review. Harvard Review of Psychiatry 20, 6 (2012),

unrelated thoughts or semantic derailment. Additionally, long 281–297.

[2] American Psychiatric Association. 2013. Diagnostic and statistical manual of

pauses between utterances can reflect cognitive slowing or diffi- mental disorders (5th ed.). American Psychiatric Publishing, Arlington, VA.

culty with word retrieval. These features collectively suggest the [3] Janna N. De Boer, Sanne G. Brederoo, Alban E. Voppel, and Iris E. C. Sommer.

2020. Anomalies in language as a biomarker for schizophrenia. Current

possibility of schizophrenia. The results suggest that, on average, Opinion in Psychiatry 33, 3 (2020), 212–218.

the models are able to rank individuals effectively (high AUC); [4] J. N. De Boer, A. E. Voppel, S. G. Brederoo, H. G. Schnack, K. P. Truong, F. N. K.

they can distinguish between HC and SH in terms of relative Wijnen, and I. E. C. Sommer. 2023. Acoustic speech markers for schizophrenia-

spectrum disorders: a diagnostic and symptom-recognition tool. Psychological

probability. The low CA, sensitivity, PPV, and F1 scores suggest Medicine 53, 4 (March 2023), 1302–1312.

that the chosen classification threshold of 0.5 may not be optimal. [5] Flavia Galaverna, Adrián M. Bueno, Carlos A. Morra, María Roca, and Teresa

This issue was further addressed by evaluating the ROC curve of Torralva. 2016. Analysis of errors in verbal fluency tasks in patients with

chronic schizophrenia. The European Journal of Psychiatry 30, 4 (2016), 305–

the best-performing model to explore whether an alternative clas- 320.

sification threshold could improve the identification of positive [6] Dimitrios Giannoulis, Michael Massberg, and Joshua D. Reiss. 2012. Digital

dynamic range compressor design—A tutorial and analysis. Journal of the

cases; we observed that both the Youden-optimal threshold and Audio Engineering Society 60, 6 (2012), 399–408.

the F1-optimal threshold are approximately 0.49, which differs [7] Josep Maria Haro, Diego Novick, Jordan Bertsch, Jamie Karagianis, Martin

negligibly from the used value of 0.5. Dossenbach, and Peter B. Jones. 2011. Cross-national clinical and functional

remission rates: Worldwide Schizophrenia Outpatient Health Outcomes (W-

The performance of our best model, EBM, shows its strong SOHO) study. The British Journal of Psychiatry 199, 3 (2011), 194–201.

ranking ability, and balanced classification performance on both [8] Thomas R. Insel. 2010. Rethinking schizophrenia. Nature 468, 7321 (2010),

classes. 187–193.

[9] Stephen R. Marder and Tyrone D. Cannon. 2019. Schizophrenia. The New Eng-

land journal of medicine381, 18 (2019), 1753–1761. doi:10.1056/NEJMra1808803



Limitations and Future Work [10] Mila Marinković. 2024. Analysis of speech fluency in patients with schizophre- nia [Master’s Thesis, University of Ljubljana, Faculty of Computer and Infor-

Although our dataset is well-balanced, the sample size (126) is mation Science].

rather small; additional samples would improve model generaliz- [11] Robert A. McCutcheon, Tiago Reis Marques, and Oliver D. Howes. 2020.

ability and robustness. Audio quality could be improved by using [12] Victor Peralta, Manuel J. Cuesta, and Jose de Leon. 1992. Formal thought Schizophrenia—An overview. JAMA Psychiatry 77, 2 (2020), 201–210.

professional microphones instead of built-in laptop microphones, disorder in schizophrenia: A factor analytic study. Comprehensive Psychiatry

which would enhance transcription accuracy. Due to obtaining 33, 2 (1992), 105–110.

[13] Rok Rajher. 2025. Automatic Generation of Explanations in Diagnosing

the audio recordings at two locations, a residual site effect cannot Schizophrenia Using Speech Fluency Testing [Master’s Thesis, University

be fully excluded. We mitigated the risk by (i) using identical task of Ljubljana, Faculty of Computer and Information Science].

instructions and timing in quiet rooms at both sites, (ii) applying [14] World Health Organization. 2022. ICD-11: 6A20 Schizophrenia. Retrieved

from https://icd.who.int/browse/2025-01/mms/en#1683919430.

uniform dynamic range compression and loudness normalization



62





Mapping Medical Procedure Codes Using Language Models




∗

Mariša Ratajec Anton Gradišek Nina Reščič

University of Ljubljana, Faculty of Jožef Stefan Institute Jožef Stefan Institute

Electrical Engineering; Jožef Stefan Ljubljana, Slovenia Ljubljana, Slovenia

Institute anton.gradisek@ijs.si nina.rescic@ijs.si

Ljubljana, Slovenia

ratajec.marisa@gmail.com



Abstract models (LLMs), such as BioBERT, GatorTron, and OpenAI mod-

els. We also explored a hybrid approach that integrates fuzzy

Aligning medical procedure codes across national classification

matching with LLM-derived semantic embeddings.

systems is a challenging task. We investigate the mapping of

In this paper, we present the application of the selected meth-

Slovenian KTDP expressions to German OPS codes using fuzzy

ods for aligning Slovenian KTDP procedure expressions with

matching, biomedical language models (BioBERT, GatorTron), a

German OPS codes. We evaluate their performance, limitations

hybrid approach, and ChatGPT. In the absence of ground truth,

and discuss key challenges associated with this type of code

we assess consistency across methods and conduct manual re-

matching problem.

views. Results show that differences in code structure and expres-

sion detail pose major barriers to alignment. Expert validation

will be essential for improving accuracy. 2 Methodology



2.1 Datasets

Keywords 2.1.1 Slovenian Dataset. The Slovenian dataset is based on the procedure coding, KTDP, OPS, semantic similarity, BioBERT,

Klasifikacija terapevtskih in diagnostičnih postopkov in posegov

fuzzy matching, GatorTron, ChatGPT

(KTDP)[6], version 11, which has been officially implemented na-

tionwide since 1 January 2023. This classification system is used

1 Introduction to code medical procedures in all levels of healthcare in Slovenia

Different countries maintain their own national classification and is structurally aligned with the Australian Classification of

systems for medical procedures, used for clinical documenta- Health Interventions (ACHI), adapted to the local context.

tion, reimbursement, public reporting, and statistical analysis. KTDP consists of 20 chapters, each covering a different clinical

In Slovenia, healthcare professionals rely on a domestic proce- domain. The chapters are organised primarily by body system

dural coding system, while in Germany, the Operationen- und (e.g. nervous, endocrine, cardiovascular), with additional sections

Prozedurenschlüssel (OPS) is used. dedicated to dental care, imaging services, radiation oncology,

At the University Medical Centre (UMC) Ljubljana in Slovenia, and interventions not elsewhere classified. Each chapter is subdi-

interest has emerged in matching expressions from the Klasi- vided into multiple blocks, which group related procedures under

fikacija terapevtskih in diagnostičnih postopkov in posegov shared headings.

(KTDP) with the German OPS classification system. The pur- In total, the classification includes approximately 6,000 unique

pose is to allow international reporting, cost estimation, and procedures. Each is assigned a specific code in a structured nu-

comparative analysis of healthcare procedures. meric format composed of a five-digit base and a two-digit ex-

tension (e.g. 36564-00).

1.1 Problem Outline

2.1.2 German Dataset. The German dataset is based on

Aligning Slovenian procedural expressions with German OPS

Operationen und Prozedurenschlüssel (OPS), version 2024 [2],

codes is a complex task. The Slovenian dataset contains approxi-

which is officially used nationwide for coding medical proce-

mately 6,000 expressions, whereas the German OPS classification

dures. Maintained by the Federal Institute for Drugs and Medical

includes more than 60,000 distinct entries, covering multiple lev-

Devices (BfArM), OPS is revised annually. It is derived from the

els of specificity in various medical domains. Manual mapping

WHO’s International Classification of Procedures in Medicine

is time-consuming and impractical, primarily due to the size of

(ICPM) and adapted to the German healthcare system.

datasets and the absence of convenient tools for efficient code

The classification is organised into six main chapters, covering

retrieval and comparison.

the following clinical domains: diagnostic measures (Chapter 1),

To address this challenge, we explored the development of

imaging diagnostics (Chapter 3), surgical procedures (Chapter

computational approaches to support and accelerate the mapping

5), medications (Chapter 6), non-operative therapeutic measures

process. Due to the nature of the data and the semantic variation

(Chapter 8), and supplementary measures (Chapter 9). Each chap-

between codes, we tested several techniques, including fuzzy

ter is further subdivided into categories and blocks, which group

string matching, semantic similarity scoring, and large language

related procedures based on functional or anatomical criteria.

∗

Corresponding author OPS comprises approximately 60,000 unique procedures. Each

is assigned a hierarchical alphanumeric code, consisting of a four-

Permission to make digital or hard copies of all or part of this work for personal

or classroom use is granted without fee provided that copies are not made or digit base and optional numeric or alphanumeric extensions (e.g.

distributed for profit or commercial advantage and that copies bear this notice and 5-384.50 or 8-844.5c). The coding system follows a structured

the full citation on the first page. Copyrights for third-party components of this

hierarchy, beginning with the chapter number (e.g. 5 for surgi-

work must be honored. For all other uses, contact the owner /author(s).

Information Society 2025, Ljubljana, Slovenia cal procedures), followed by a category (e.g. 5-38 for vascular

© 2025 Copyright held by the owner/author(s).

https://doi.org/10.70314/is.2025.skui.9648



63





Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Ratajec et al.




surgery) and subcategories (e.g. 5-384 for specific surgical tech- texts, MarianMT has demonstrated strong performance in medi-

niques). The digits and characters after the dot denote the exact cal translation tasks, particularly for structured terminology [5],

intervention. making it a suitable and practical choice for this application.



2.1.3 Differences and Similarities between Datasets. Although 2.2.2 Language-based code pairing. To perform code matching, both classification systems serve a similar purpose, they differ in we initially applied a language-based code pairing approach us-

structure and level of detail. The German dataset includes very ing fuzzy matching, implemented via the RapidFuzz library [1].

specific and thoroughly described procedures, clearly outlining Fuzzy matching is particularly useful in cases where expressions

each individual service. The Slovenian system, on the other hand, differ slightly in wording, structure, or spelling. We applied the

uses broader and more general descriptions, without the same token set ratio, which compares the sets of unique words in two

amount of detail or length. strings and calculates a similarity score based on the overlap

Moreover, there is limited direct lexical overlap between the of unique tokens, with edit distance applied to the remaining

two datasets. Even when procedures are conceptually similar, unmatched parts. This method is insensitive to word order and

their descriptions often differ in phrasing, level of specificity, robust to minor variations in phrasing. Using this approach, each

or use of synonyms. As a result, one-to-one matching is rarely English KTDP expression was compared with all translated OPS

straightforward and requires both structural alignment and se- descriptions. For each KTDP entry, we selected the best match-

mantic interpretation. ing OPS procedure based on the highest fuzzy similarity score

and recorded the corresponding code, description, and score for

2.2 Pipeline further analysis.



2.2.3 Semantic-based code pairing. As a second approach, we



applied a semantic-based code pairing approach using contex-

tual embeddings derived from transformer-based language mod-

els. Specifically, we tested two pretrained models: pritamdeka/

BioBERT-mnli-snli-scinli-scitail-mednli-stsb [3], a Sen-

tenceTransformer variant of BioBERT fine-tuned on biomed-

ical and inference tasks, and [10], a UFNLP/gatortron-base

GatorTron model pre-trained on large-scale clinical corpora. Both

models were selected for their strong performance in biomedical

language understanding [7] and to investigate how model choice

influences the quality of semantic code alignment.

Using each model, both KTDP expressions and translated OPS

descriptions were encoded into dense semantic vectors. Cosine

similarity was then computed between each KTDP embedding

and all OPS embeddings to assess semantic closeness. As in the

previous approach, the top matching OPS procedure for each

KTDP expression was selected and recorded following the same

procedure as before.



2.2.4 Combined code pairing. In addition to the individual use of

semantic and lexical methods, we implemented a hybrid match-

ing approach that combines the strengths of both. Specifically, we

Figure 1: Overview of the matching pipeline and exam- integrated semantic similarity scores obtained from BioBERT em-

ple results. KTDP expressions in English were aligned beddings with lexical similarity scores derived from fuzzy match-

to translated OPS expressions using five methods: fuzzy ing (token set ratio). For each KTDP expression, both similarity

matching, BioBERT, a combined BioBERT+fuzzy approach, measures were computed independently against all translated

GatorTron, and OpenAI ChatGPT. All methods except Chat- OPS descriptions. The final similarity score for each pair was

GPT produced structured outputs with match scores, as calculated as a weighted average:

shown in the example results table. ChatGPT returned only

contextual matches without comparable scoring and was

therefore excluded from the standardised evaluation table. score 𝑤 semantic final = · scoresemantic + 𝑤lexical · scorelexical

We experimented with two weighting schemes: one with equal

The overall process is summarised in a pipeline diagram (Figure 1), weights (𝑤 semantic 0.5, 𝑤lexical 0.5) and another prioritising = =

semantic similarity (𝑤 0.7, 𝑤 0.3), to assess = =

which outlines each step — from dataset preparation and transla- semantic lexical

how different balances influence match quality. For each KTDP

tion to the application of matching methods and the structure of

expression, the OPS description with the highest combined score

resulting outputs. Each component of the pipeline is described

was selected and recorded along with the corresponding code

in detail in the following subsections.

and similarity score.

2.2.1 Translation. Since Slovenian KTDP expressions were al- This approach was motivated by practical observations in ready available in English, the German OPS procedure names

the literature, where combining surface-level and context-aware

were translated to English to enable semantic comparison. For

similarity often yields more robust results, especially in cases

this purpose, we used the MarianMT model ( Helsinki-NLP/

opus-mt-de-en) [4], a transformer-based neural machine trans-

lation model. Although not specifically fine-tuned for clinical



64

Mapping Medical Procedure Codes Using Language Models Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia



where purely semantic models overlook minor wording differ- OPS codes across the six main procedural chapters. As illustrated

ences or where lexical methods fail to capture deeper conceptual in Figure 2, all methods predominantly mapped KTDP expres-

alignment [9]. sions to Chapter 5 (surgical procedures), reflecting the procedural

nature of the source data. In contrast, assignments to Chapter

2.2.5 ChatGPT code pairing. As a final exploratory method, we

6 (medications) and Chapter 9 (supplementary measures) were

used a custom ChatGPT instance (GPT-4o, OpenAI) [8] to evalu-

relatively infrequent. This general distribution pattern was con-

ate the potential of conversational large language models (LLMs)

sistent across methods, indicating a shared tendency to favour

for code matching. We uploaded all relevant documentation, in-

procedural codes.

cluding KTDP expressions, translated OPS procedures, and back-

Even so, some notable differences were observed. For example,

ground materials, to a private GPT environment. For each KTDP

GatorTron assigned fewer expressions to Chapter 5 compared to

entry, we either asked the model to suggest the best-matching

the other methods and exhibited a relatively higher proportion of

OPS procedure directly or first requested an interpretation of the

matches to Chapter 8 (non-operative therapeutic measures). Man-

KTDP term followed by a context-based match. This approach

ual review of these cases revealed that many of the expressions

allowed us to assess whether ChatGPT’s contextual reasoning

lacked a clearly corresponding OPS code, which may have led the

could complement or outperform traditional embedding-based

model to prefer broader categories. Still, in the absence of expert

or lexical matching methods.

validation, we cannot determine whether such assignments are



3 more or less accurate. Evaluation



The absence of a validated ground truth presents a fundamen-

tal challenge in assessing the quality of our matching results.

Without expert clinical validation, it is unclear how accurate in-

dividual matches are or which method performs best. To address

this, we first conducted a broad quantitative analysis to evaluate

consistency, disagreement, and similarity across methods. These

metrics provide indirect but informative insights into model be-

haviour, helping to characterise matching patterns even in the

absence of formal validation. Following this initial assessment,

we performed a small-scale manual review to better understand

the plausibility of selected matches. We examined examples with

both high and low matching scores, identifying cases of clear

agreement as well as notable mismatches. This informal inspec-

tion offered additional intuition on method performance and

highlighted the need for domain expertise to reliably judge align-

ment quality.

To begin the quantitative evaluation, we examined how often

different methods assigned KTDP expressions to the same general Figure 2: Distribution of top-1 matched OPS codes across

procedural category. To do this, we compared the prefixes of the the six main procedural chapters for each matching

top-1 matched OPS codes across all methods, where the prefix method. Chapter 1 represents diagnostic measures, Chap-

corresponds to the first digit of the OPS code and indicates the ter 3 imaging diagnostics, Chapter 5 surgical procedures,

high-level category of the procedure (e.g., diagnostic, surgical, Chapter 6 medications, Chapter 8 non-operative therapeu-

therapeutic). This allowed us to assess agreement at a broader tic measures, and Chapter 9 supplementary measures.

level, independent of specific code details.

The results revealed a relatively high degree of consistency: in

To investigate whether certain KTDP procedures are inher-

64.2% of cases (𝑛 4000), all methods returned OPS codes with =

ently easier to match due to wording or alignment with OPS

the same prefix, indicating agreement on the general procedural

terminology, we analysed the standardised match score values

category. In the remaining 35.8% of cases (𝑛 2231), there was =

across all methods using a heatmap (Figure 3). The goal was to de-

partial agreement - some methods aligned on the prefix, while

termine whether consistent scoring patterns could help identify

others diverged. Notably, there were no cases in which all meth-

procedures that are generally easier or more difficult to match,

ods assigned entirely different prefixes, suggesting that at least a

regardless of the specific method used.

minimal level of agreement was always preserved at the category

The heatmap displays Z-standardised scores for each method,

level.

with expressions sorted by BioBERT scores. Although we ex-

However, when comparing full OPS codes, agreement dropped

pected some consistency (i.e., easier expressions receiving higher

substantially. Only 2.9% of cases (𝑛 178) exhibited full consen-=

scores across all methods and harder ones receiving lower scores),

sus across all methods. Most cases (90.1%, 𝑛 5613) fell into the =

the results showed considerable variation. In many cases, a pro-

“some same” category, where at least two methods agreed, and

cedure scored higher with one method and lower with another,

7.1% (𝑛 440) showed complete disagreement, with each method =

suggesting that matching difficulty is method-dependent and

proposing a different code. These results indicate that, while

influenced by how each approach interprets textual or structural

methods often converge on the general category of a procedure,

similarity.

they frequently differ in the specific code they assign within that

Notably, BioBERT and the hybrid BioBERT-fuzzy method pro-

category.

duced very similar score profiles. GatorTron and fuzzy approach

To further examine how the methods differ in their assign-

showed more divergence, indicating different sensitivities to ter-

ment behaviour, we analysed the distribution of top-1 matched

minology structure, dataset alignment, or surface-level phrasing.



65





Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Ratajec et al.




This suggests that methods differ not only in which codes they that the added interpretation step did not substantially improve

select, but also in how confidently they make those matches. matching performance. As previously discussed, this outcome

likely reflects the inherent differences in datasets.





4 Conclusion

Our study highlights the considerable challenge of aligning proce-

dural coding systems across countries with different documenta-

tion practices. Despite employing a range of computational meth-

ods (ranging from fuzzy matching and semantic embeddings to

large language models) the observed differences in dataset struc-

ture and content significantly limited matching performance.

In particular, the lack of detail in some KTDP expressions, the

high specificity of OPS codes, and the absence of one-to-one

equivalents all contributed to inconsistent or ambiguous results.

Crucially, no ground truth currently exists to objectively eval-

uate the quality of these matches. Although indirect metrics and

manual inspection provide useful information, they cannot re-

place expert validation. Therefore, the most important next step

is to involve medical professionals in generating a gold standard

reference set. This would enable formal benchmarking of differ-

ent methods and support the development of more reliable and

Figure 3: Heatmap of Z-standardised MATCH_SCORE_1 values generalisable code alignment pipelines in the future.

across all KTDP expressions, sorted by BioBERT scores. Ultimately, our findings suggest that the key limitation lies

The plot illustrates variation in score strength across meth- not in the technical capability of the methods themselves, but in

ods, highlighting differences in confidence and matching the fundamental heterogeneity of the datasets and the differing

behaviour. philosophies of procedural encoding. Addressing this mismatch

will be essential for any future efforts to enable international

After developing a broader understanding of inter-method dif- interoperability of procedural coding systems.

ferences through quantitative analyses, we conducted a focused

manual review of selected examples to qualitatively assess the Acknowledgments

plausibility of top matches. We examined expressions with both The authors are grateful for the valuable input and ideas con-

high and low matching scores across methods to explore whether tributed by the medical team of the University Medical Center

any consistent patterns could be observed. Ljubljana.

For expressions with high scores and full agreement across

methods, the matches were typically straightforward: the KTDP Funding expression was either identical or highly similar to an OPS entry,

This work was supported by the Slovenian Research and Innova-

often requiring no complex interpretation. These cases tended to

tion Agency (Research Core Funding Number P2-0209).

represent procedural descriptions that appeared in both datasets



with minimal variation. References

In contrast, lower-scoring expressions revealed more complex

[1] [SW] Max Bachmann, rapidfuzz/RapidFuzz: Release 3.13.0 version v3.13.0,

challenges. Two main issues emerged during manual inspection. Apr. 2025. doi: 10.5281/zenodo.15133267, url: https://doi.org/10.5281/zeno

do.15133267.

First, several KTDP procedures had no direct equivalent in the

[2] Bundesinstitut für Arzneimittel und Medizinprodukte (BfArM). 2023. Operationen-

OPS system because they are typically recorded in other coding und Prozedurenschlüssel (OPS), Version 2024: Internationale Klassifikation der

systems (e.g., vaccinations or disease-specific protocols). Second, . BfArM. Bonn, Ger- Prozeduren in der Medizin – Systematisches Verzeichnis

many.

many KTDP expressions were written in a general or aggregated

[3] Pritam Deka, Anna Jurek-Loughrey, et al. 2022. Evidence extraction to

form, often combining multiple procedural steps into a single validate medical claims in fake news detection. In International Conference description. OPS, on the other hand, is highly granular, with . Springer, 3–15. on Health Information Science

[4] Marcin Junczys-Dowmunt et al. 2018. Marian: Fast Neural Machine Trans-

detailed and precisely defined codes. As a result, some KTDP

lation in C++. Tech. rep. arXiv:1804.00344. Demonstration paper, version

expressions may correspond to multiple distinct OPS codes, or v3. arXiv, (Apr. 2018). doi: 10.48550/arXiv.1804.00344.

[5] Bunyamin Keles, Murat Gunay, and Serdar Caglar. 2024. LLMs-in-the-loop

only partially align with available entries.

Part-1: Expert Small AI Models for Bio-Medical Text Translation. Tech. rep.

These observations suggest that performance limitations are

arXiv:2407.12126. Preprint. arXiv, (July 2024). doi: 10.48550/arXiv.2407.121

not solely attributable to matching algorithms themselves, but 26.

[6] Nacionalni inštitut za javno zdravje (NIJZ). 2023. Klasifikacija terapevtskih

also to structural mismatches and representational differences

in diagnostičnih postopkov in posegov: Pregledni seznam (Verzija 11). NIJZ.

between the source datasets. This highlights a key challenge in Ljubljana, Slovenia.

[7] Zabir Al Nazi and Wei Peng. 2024. Large language models in healthcare and

aligning procedural coding systems across countries.

medical domain: a review. , 11, 3, 57. doi: 10.3390/inf ormatics11 Informatics

030057.

3.1 ChatGPT [8] OpenAI. 2024. Gpt-4o. Accessed: August 2025. (2024). https://openai.com/in

dex/gpt- 4o.

Despite leveraging ChatGPT’s capacity for contextual reasoning

[9] Mohammed Suleiman Mohammed Rudwan and Jean Vincent Fonou-Dombeu.

by first interpreting the KTDP expression and then performing 2023. Hybridizing fuzzy string matching and machine learning for improved

ontology alignment. , 15, 7, 229. doi: 10.3390/f i15070229. Future Internet

the match, the resulting OPS codes were, in most cases, identical

[10] Xi Yang et al. 2022. A large language model for electronic health records.

to those produced by previously described methods. This suggests , 5, 1, 194. npj Digital Medicine



66





AI-Enabled Dynamic Spectrum Sharing in the




Telecommunication Sector – Technical Aspects and Legal



Challenges



Nina Rechberger

PhD Candidate Applied AI

Alma Mater Europea

Maribor, Slovenia

nina.rechberger@almamater.si



Abstract frameworks for AI-enabled DSS remain underdeveloped, requiring further exploration. Dynamic Spectrum Sharing (DSS), as part of Dynamic Spectrum The rapid growth of wireless devices and data-intensive Management, is already used in the telecommunication sector applications has heightened demand for radio frequency and is a critical technology for addressing spectrum scarcity in spectrum, a finite resource. Traditional static management often next-generation wireless networks, particularly when leads to underutilized frequency bands, with inflexible policies implementing 6G. Legacy statical spectrum management exacerbating inefficiencies beyond the spectrum's physical (designed for one user exclusively for a certain bandwidth for scarcity [1]. AI-enhanced DSS addresses this by enabling certain services) is no longer fit for purpose, as it does not allow flexible, real-time allocation of resources, adapting to dynamic the efficient use of the spectrum. By leveraging Artificial demands and environments while improving spectrum sensing, Intelligence (AI), DSS enables the real-time adaptive allocation resource allocation, and interference mitigation. of radio frequencies, thereby improving spectrum utilization and This study briefly examines the technical and legal network efficiency. Although the integration of AI into DSS dimensions of AI-enabled DSS, identifying challenges and gaps introduces complex technical and legal challenges. This paper in research. As an initial exploration, it evaluates significant prior aims to investigate the challenge of dynamic spectrum policy work to lay the foundation for future investigations. when using AI-enabled DSS and answer the question of why a

flexible and new spectrum policy is desired. Some suggestions



are long overdue in academic research. Recent research primarily Spectrum Sharing focuses on technical issues, rather than specifically on legal ones. AI-driven DSS leverage all sort of AI techniques to optimize for refining the regulatory framework are also presented, which 2 Technical Aspects of AI-Driven Dynamic



protocols, adaptive regulatory policies, and other legal spectrum utilization in dynamic, complex environments. [2, 3, The closure findings underscore the need for standardized

frameworks to ensure equitable and efficient spectrum sharing. 4].



Keywords 2.1 Spectrum Sensing and Cognitive Radio Spectrum sensing is the cornerstone of the DSS, enabling real- AI-Enabled Dynamic Spectrum Sharing, AI, spectrum sensing, time detection of spectrum occupancy. AI-based techniques, spectrum right, spectrum regulatory framework such as convolutional neural networks (CNNs) and long short-

term memory (LSTM) models, enhance spectrum sensing by



1 analyzing signal patterns and predicting spectrum availability [5, Introduction 6]. CNNs are highlighted for their ability to extract features from The integration of Artificial Intelligence (AI) into Dynamic spectral data, improving detection accuracy in noisy Spectrum Sharing (DSS) introduces technical complexities, such environments without relying on prior knowledge of signals. as computational demands and algorithm reliability (e.g., LSTMs are emphasized for their ability to handle sequential and consistency, robustness, and accuracy), alongside legal time-series data.. challenges, including spectrum rights allocation, interference In addition, deep learning-based spectrum sensing achieves management, and dispute resolution. However, governance up to 45% improvement in detection accuracy compared with

traditional methods, which rely only on basic signal processing

techniques to identify spectrum occupancy like energy detection



Permission to make digital or hard copies of part or all of this work for personal or [5]. Cognitive radio networks (CRNs) powered by AI allow classroom use is granted without fee provided that copies are not made or distributed users to opportunistically access unused spectrum bands without for profit or commercial advantage and that copies bear this notice and the full interfering with other users [3]. The challenges include, among citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). others, the computational complexity of real-time processing and Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia the need for robust datasets to train AI models. Studies highlight © 2025 Copyright held by the owner/author(s).

http://doi.org/DOI_10.70314/is.2025.skui.3657 that AI models may struggle with unpredictable interference

patterns, necessitating hybrid approaches that combine



67

interpretable models (e.g., decision trees) with high-performing interference. Those standards, when using AI, are missing. An

deep learning (DL) models [6]. example of such AI-driven systems to avoid interference is

spectrum access systems (SAS) that use geolocation databases

2.2 Interference Management and sensing to manage the shared spectrum [9]. Simultaneously,

Interference management is critical for ensuring reliable the complexity of AI algorithms raises concerns about

connectivity in the DSS. AI-driven techniques, such as multi- transparency and accountability when unwanted interference

agent reinforcement learning (MARL), optimize power occurs. National regulatory bodies already emphasize the need

allocation and beamforming to minimize interference [6]. MARL for standardized protocols to ensure equitable access and

is used for mitigating jamming attacks, where malicious entities interference mitigation [10, 11]. Regulators must strike a balance

disrupt spectrum utilization by interfering with communications between innovation and the protection of incumbent users and

Another example is reconfigurable intelligent surfaces (RIS) their guaranteed rights to spectrum.



propagation to reduce interference in non-orthogonal CRNs [7]. integrated with AI, which can dynamically adjust signal 3.2 Spectrum Rights and Equitable Access

RIS, also known as an Intelligent Reflecting Surface (IRS), is a Regulatory authorities adopt the fixed spectrum access (FSA)

passive, planar metasurface composed of a large array of low- policy to allocate different parts of the radio spectrum with a

cost, tunable unit cells that can dynamically manipulate incident certain bandwidth to certain services. With such a static and

electromagnetic waves. Unlike active devices like base stations exclusive spectrum allocation policy, only the authorized users,

or relays, RIS does not generate or amplify signals—it reflects, also known as licensed users, have the right to utilize the

refracts, or absorbs them in a programmable way to shape the assigned spectrum, and the other users are forbidden from

wireless propagation environment. accessing the spectrum, regardless of whether the assigned

Research has demonstrated that AI-driven interference spectrum is busy or not [3]. This could be seen as a direct

management achieves a spectrum utilization efficiency of up to opposition to the efficient use of the spectrum, where the use of

62.4% in urban environments, nearly double the utilization the spectrum aligns with all available technical possibilities.

efficiency compared to traditional management [5]. Although Spectrum rights allocation is a contentious issue in DSS, as AI

challenges persist, including the scalability of AI models in large enables dynamic access by multiple users and challenges

networks and the risk of unpredictable behavior in edge cases. traditional licensing models. Spectrum right allocation is

Robust fallback mechanisms are necessary to address traditionally static – one user to a particular broadband. On the

unpredictable AI behavior in edge cases, while standardized other hand, with shared access regimes, such as licensed shared

interfaces and protocols are essential for enabling seamless access (LSA), regulators allow spectrum users to open spectrum

deployment and integration with existing network infrastructure bands while protecting incumbent users [12]. However, only a

[5]. few countries have adopted this option, and it comes with

numerous regulatory restrictions. For explanation, incumbent

2.3 Resource Allocation users are historically incumbent telecommunications operators,

AI enables dynamic resource allocation by predicting network who paid a significant amount of fees for the licence to use the

traffic and allocating spectrum based on real-time demands. spectrum. Therefore, spectrum licenses are important assets for

Machine learning algorithms, such as support vector machines incumbent users. Nevertheless, AI-driven DSS raises concerns

(SVMs) and deep reinforcement learning (DRL), can forecast about monopolistic practices because dominant operators may

spectrum occupancy and optimize bandwidth allocation [8]. For leverage advanced algorithms to secure disproportionate

instance, DRL-assisted virtual network embedding (VNE) in spectrum access [13, 14. 15]. However, legal frameworks must

satellite networks enhances resource utilization by adapting to evolve to address equitable access for smaller operators and

multiple coverage constraints [4]. Major obstacles include the license-exempt users while simultaneously protecting the

need for energy efficiency and the requirement for real-world guaranteed rights of incumbent users/operators. The absence of

datasets to enhance prediction accuracy. The absence of clear spectrum rights allocation policies risks exacerbating

standardized testbeds and benchmarks further complicates disputes and stifling innovation in the industry.



performance evaluation [2]. 3.3 Dispute Resolution



3 Dispute resolution in DSS tackles conflicts over interference, Regulatory Challenges in AI-Enabled

spectrum access, and user priority. AI systems complicate this

Dynamic Spectrum Sharing due to poor interpretability, obscuring decision processes [6]. AI-

The deployment of AI-driven DSS raises significant regulatory driven user prioritization can spark fairness disputes. National

challenges that must be addressed. According to recent research, spectrum strategies propose interagency resolution processes [6,

regulatory issues arise, particularly in interference management, 10]. Explainable AI models (e.g., XAI) improve transparency,

spectrum rights, and dispute resolution. Other legal and aiding dispute resolution [6]. Blockchain-based databases offer

regulatory questions have, to the best of the author's knowledge, tamper-proof spectrum usage records, simplifying conflict

been completely overlooked or only superficially discussed. resolution [6].



3.1 Interference Management

Interference management in DSS requires regulators to ensure

compliance with technical standards to prevent harmful



68

As stated above, traditional regulatory frameworks designed for Governance, Suggested New framework contracts to automate spectrum allocation, ensuring transparency and enforceability. Regulators should issue guidelines for AI 4 The Need for New AI-Enabled DSS The dynamic licensing model can use blockchain-based smart



static spectrum licensing are ill-equipped to handle AI’s opportunistic/dynamic access and imposing penalties for non- algorithms to prioritize licensed users while optimizing



autonomous and data-intensive nature of AI. The proposed compliance. regulatory framework should impose legal mechanisms to



address more flexible licensing, privacy and data protection, 4.1.2. Privacy and Data Protection interference management, security, and international The goal is to require licensed users to implement privacy-coordination, ensuring compliance and fostering innovation. The preserving AI techniques (e.g., Federated Learning and objectives of the new framework, in the author's opinion, are: differential privacy) to minimize data exposure. Minimal data Enabling Innovation ; Ensuring Compliance : that is, aligning exposure goes beyond personal data and should be extended to with existing laws (e.g. national telecom regulations, Data Act, all processed data sets. AI systems in DSS are designed to Artificial Intelligence Act etc.); Promoting Fairness , which process only the necessary data for the requested task. means ensuring equitable spectrum access and accountability in Memorized data, such as geolocation and traffic patterns, should AI decisions.; Support Global Harmonization to align with be encrypted. Therefore, developing standards for anonymized international standards (e.g., ITU, 3GPP); Security and data processing in DSS, with certification for compliant AI Cybersecurity ; Promoting Regulatory Sandboxes, to enable systems, is necessary. For instance, blockchain contracts and safe testing of AI-driven DSS. differential privacy could enhance efficiency in dense networks



4.1 and align with the principle of minimizing sensitive data sharing. Proposed Legal and Regulatory Framework

But on the other hand, all the relevant data for enabling AI-

4.1.1. Dynamic Licensing Model enabled DSS must be shared. Data Act of the EU could address

Replacing the current policy of static and exclusive spectrum this issue.

with the Dynamic Licensing Model is a key principle, or, even Privacy and data protection are strongly connected to the

better, the Dynamic Licensing Model should be prioritized. This Right to Explanation (transparency). Therefore, it is necessary to

could include a tiered access system (primary, secondary, and mandate transparency in AI-driven spectrum decisions, allowing

opportunistic users) managed by AI-driven Spectrum Access users to challenge allocations [6, 11]. Although the Artificial

Systems (SAS) [3, 12, 9]. This means enacting laws defining Intelligence Act of the EU requires high-risk AI systems (DSS

tiered access rights, specifying priority levels, and usage component is legally interpreted as critical infrastructure) to face

conditions. For instance, extending the U.S. Citizens Broadband a strong transparency obligation, in the context of DSS, it needs

Radio Service (CBRS) model, where SAS dynamically assigns to be technically detailed. spectrum, with legal provisions for AI oversight and auditability.

Refinements to the European Electronic Communication Code 4.1.3. Interference Management, Liability and Dispute

(EECC) [13]. to add AI spectrum management tools are another Resolution

possible example. First, a definition of DSS should be added and Clear liability rules for AI-induced interference, balancing the

represented. (e,g, in Art. 2). DSS can be defined as a primary responsibilities of operators, secondary users, and vendors, must

shared use of the radio spectrum, enabling flexible, real-time be established. A shared liability model could be a solution.

allocation of spectrum bands among multiple users and Operators as primary users could be liable for interference unless



rights. In spectrum management (Art. 45 EECC), the goal should AI errors, verified through forensic logs. The interference threshold must be introduced and known at the front. Legal limits designated services, when appropriate, adding tiered access caused by secondary users or by the vendor/distributor/supplier



appropriate certification. So, spectrum management could be for acceptable, e.g., signal-to-noise ratio standards, should be also be, by default, to privilege AI-enabled DSS, adding



flexible enough for new technologies and, at the same time, proof logs of spectrum allocation decisions, accessible when defined. The requirement for AI systems to maintain tamper-

compliant as an exception to the technology and service-neutral needed to stakeholders, is a good way to ensure the transparent

principle, traditionally anchored in EECC, because general operation of DSS. These logs can then be used as evidence at

interest objectives are at stake and can be clearly justified and competent bodies in dispute resolution to resolve interference

subject to regular review. From a practical point of view, disputes, with AI decisions [5, 10].

mandating AI-predictive models for real-time allocation in "AI-

harmonized" bands that require shared AI datasets could be 4.1.4. International Standardization

discussed in future peer reviews. The neutral authorization Promoting harmonized standards for AI-driven DSS through



move to the explicit inclusion of AI/ML, with possible remaining challenges, like interoperability. Negotiating bilateral and international treaties to align spectrum sharing protocols and regime for spectrum designation, with some exceptions, should international bodies like ITU and 3GPP is just one side of the



additional separate regulation, such as the Gigabyte data sovereignty rules is another issue. For instance, ITU’s World certification for bias-free algorithms and energy metrics in an



Infrastructure Act (GIA), intended to simplify access to physical laws for national adoption, ensuring compatibility with global Radiocommunication Conference (WRC) could develop model

infrastructure in this sector. Art. 46 EECC is meant only to 5G and 6G standards [10, 14, 15]. Cross-border Coordination

encourage shared access, while the default AI-driven DSS could (e.g., Art. 4 EECC) could also be expanded, with the RSPG-led

drive spectrum sharing to another level. cooperation utilizing AI tools for interference resolution.



69

only technical sandboxes as controled testing environments is

4.1.5. Security and Cybersecurity proposed.

A robust cybersecurity framework for AI-driven DSS systems is

aimed at preventing attacks such as data poisoning. Acknowledgments



developed. These standards will include encryption, intrusion Europaea and the Doctoral Program in Applied Artificial detection, and regular security audits for AI systems, as well as Cybersecurity standards for AI-Enabled DSS must still be This work was encouraged by the University of Alma Mater

Intelligence.

reporting security breaches. Certifying AI systems for



DSS [10, 14, 15]. cybersecurity compliance, with the development of AI-enabled References



4.1.6. Regulatory Sandboxes [1] Pranita Bhide, Dhanush Shetty, Suresh Mikkili. 2024. Review on 6G

communication and its architecture, technologies included, challenges,



full regulatory constraints could be a way to overcome the domain. IET Quantum Communication. https://doi.org/10.1049/qtc2.12114 development compliance. Sandbox legislation should define the Creating controlled environments to test AI-driven DSS without security challenges and requirements, applications, with respect to AI

[2] Sabir, Bushra, et. 2024. Systematic Literature Review of AI-enabled



sandbox participants. Launching pilot programs with telecom arXiv:2407.10981. https://arxiv.org/abs/2407.10981, https://doi.org/10.48550/arXiv.2407.10981 operators and ensuring legal protections for experimental scope, duration (e.g., 1-2 years), and liability exemptions for Spectrum Management in 6G and Future Networks." arXiv preprint

deployment are essential for the progress of AI-enabled DSS. [3] Ying-Chang Liang 2020 Dynamic Spectrum Management: From

Cognitive Radio to Blockchain and Artificial Intelligence, Springer.



would be enhanced because of a good testing foundation in a [4] Alhammadi, Abdulrahman et. 2024. Artificial Intelligence in 6G Wireless Networks: Opportunities, Applications, and Challenges. International technological and regulatory sense. A good example is the Model After the test period, the transition to actual use in the real world https://doi.org/10.1007/978-981-15-0776-2

Journal of Communication Systems.:



applications [10, 14, 15]. When it comes to regimes for [5] Saurabh Hitendra Patel. 2024. Dynamic Spectrum Sharing and Management Using Drone-Based Platforms for Next-Generation authorization (e.g., Art. 47 EECC), introducing "AI-sandbox" on the UK’s Ofcom sandbox, tailored for AI-driven 6G https://onlinelibrary.wiley.com/doi/10.1002/dac.5443

authorizations for DSS testing accelerates innovation through Wireless Networks. Preprints.org.

https://www.preprints.org/manuscript/202412.0854/v2



AI Act, where sandboxes represent well-documented risk Dynamic Spectrum Access in Advanced Wireless Communications: A Comprehensive Overview. MDPI. https://www.mdpi.com mitigation and, as a result, transparency. pilots accompanied by authorization. This is also in line with the [6] Abiodun Gbenga-Ilori. 2025. Artificial Intelligence Empowering

[7] Robin Chataut et. 2024. 6G Networks and the AI Revolution—Exploring

Technologies, Applications, and Emerging Challenges. PMC.

https://pmc.ncbi.nlm.nih.gov/articles/PMC10969307



5 Conclusion [8] Mehmet Ali Aygül. 2025. Machine learning-based spectrum occupancy prediction: a comprehensive survey- Frontiers in Communications and



In this paper, the author examined AI-Enabled DSS from a Networks. https://www.frontiersin.org/articles/10.3389/frcmn.2024.1345678 technical and legal governance perspective. This is a notable [9] Janette, Stewart. 2024. Improved management of shared spectrum: a achievement because there is a significant gap in research in this potential AI/ML use case. Analysys Mason. field. https://www.analysysmason.com [10] Anonymus. 2024. Advanced Dynamic Spectrum Sharing Demonstration This paper aimed to highlight some dimensions of the in the National Spectrum Strategy. National Telecommunications and interaction between technological perspectives and the Information Administration. https://www.ntia.gov/issues/national-governance of AI-enabled DSS. After reviewing the adversarial spectrum-strategy/advanced-dynamic-spectrum-sharing-demonstration-in-the-national-spectrum-strategy and inherited technical challenges, such as resource allocation, [11] Anonymous (2025). FCC TAC AI-WG Artificial Intelligence Meeting interference management, and spectrum sensing, the legal issues Slides. https://www.fcc.gov/sites/default/files/08-05-2025-FCC-TAC-of interference management, spectrum allocation, and equitable Meeting-Slides-Merged.pdf



access, along with dispute resolution, are briefly discussed. [12] Anonymous 2025. Spectrum management: Key applications and regulatory considerations driving the future use of spectrum." Digital Moving into the future, a new possible regulatory framework Regulation Platform. https://digitalregulation.org is presented, including a dynamic licensing model, the [13] Directive (EU) 2018/1972 of the European Parliament and of the Council implementation of privacy-preserving AI techniques in DSS, and of 11 December 2018 establishing the European Electronic Communications Code, http://data.europa.eu/eli/dir/2018/1972/oj a shared liability approach to interference management that could [14] Anonymous 2024. Artificial Intelligence in Spectrum Management: also contribute to dispute resolution. Briefly, the importance of Policy and Regulatory Considerations." IEEE Conference Publication. international standardization and interoperability, as well as https://ieeexplore.ieee.org cybersecurity threats such as data poisoning and the lack of [15] Hussein, Haval 2025. AI-Driven Cognitive Radio Networks for 6G: Opportunities and Challenges. IEEE Transactions on Wireless standardization, is mentioned. Lastly, creating regulatory not Communications. https://ieeexplore.ieee.org



70

SmartCHANGE Risk Prediction Tool: Next-Generation Risk



Assessment for Children and Youth



Nina Reščič Marko Jordan, Sebastjan Lotte van der Jagt

Jožef Stefan Institute, Kramar, Ana Krstevska, Harm op den Akker

Jožef Stefan International

Marcel Založnik Martijn Vastenburg

Postgraduate School,

Jožef Stefan Institute, Research & Development

Ljubljana, Slovenia

Jožef Stefan International ConnectedCare

nina.rescic@ijs.si

Postgraduate School, Nijmegen, The Netherlands

Ljubljana, Slovenia



Valentina Di Giacomo Dario Fenoglio Mitja Luštrek

Elena Mancuso Jožef Stefan Institute, Gabriele Dominici

Engineering Ingegneria Informatica Jožef Stefan International Università della Svizzera italiana

SpA Postgraduate School, Lugano, Switzerland

Rome, Italy Ljubljana, Slovenia



Abstract The new version introduces three advances: (i) broader harmo-

nization of European cohort datasets through refined syntactic

Non-communicable chronic diseases (NCDs), largely driven by

and semantic alignment; (ii) improved synthetic data generation

lifestyle factors such as poor nutrition, physical inactivity, and

that addresses heterogeneity of the datasets; and (iii) evaluation

obesity, account for over 70% of mortality in Europe. While pre-

of advanced RNN-based architectures alongside conventional

vention has traditionally focused on adults, growing evidence

ML models. While the pipeline in the previous paper powered a

highlights the value of early intervention during childhood and

simple demo, this one is integrated into the SmartCHANGE pro-

adolescence to establish healthy behaviours and reduce long-term

totype that enables early identification of at-risk youth and sup-

risk. This paper presents the updated SmartCHANGE platform,

ports the development of tailored preventive strategies. By com-

which harmonizes heterogeneous datasets, addresses missing in-

bining harmonized datasets, predictive modelling, and privacy-

formation through synthetic data generation, and forecasts risk

preserving methods, it represents a step toward proactive, data-

factors from childhood to adulthood. Forecasts are then applied

driven public health focused on youth as a critical stage for pre-

to established cardiovascular and diabetes risk models, enabling

vention. In addition, explainable AI was used to generate counter-

long-term risk assessment. To ensure privacy, the platform in-

factuals that support understanding of risk factors, and both web

corporates federated learning for secure model training across

and mobile applications were developed to deliver these insights

distributed datasets. By combining synthetically generated data,

directly to healthcare professionals, adolescents, and families.

predictive modelling, privacy-preserving infrastructure, and end-

user applications, the updated SmartCHANGE platform supports



early identification of at-risk youth and enables targeted, data- 2 Baseline Predictive Approach driven interventions to help reduce the future burden of NCDs.

The models for forecasting risk factors are trained on seven



Keywords heterogeneous datasets, none of which contain all the variables

needed for risk prediction. The baseline predictive approach

non-communicable diseases, risk prediction, synthetic data gen-

includes synthetic data generation and forecasting of individual

eration, federated learning, preventive healthcare

risk factors from young to older age using various established

machine-learning models. These forecast risk factors are then

1 Introduction fed into established risk-prediction models to estimate the risk

Non-communicable diseases (NCDs), including cardiovascular of cardiovascular disease and diabetes.

disease and diabetes, cause over 70% of deaths in Europe [6]. Their

onset is shaped by modifiable risk factors such as diet, physical

2.1 Synthetic Data Generation

inactivity, obesity, smoking, and alcohol use. While prevention

The synthetic data generation was used to improve data com-

strategies typically target adults, growing evidence highlights

pleteness, enhance cross-dataset comparability, and support more

childhood and adolescence as critical periods for establishing

robust forecasting and predictive modeling.

lifelong health behaviours [5]. Addressing risk early can delay

or prevent NCD onset and promote long-term well-being.

In this paper, we described an updated pipeline for predicting The risk models required full 2.1.1 Generation of Diet Scores.

NCD risk in young people, building on our previous paper [4]. dietary information, but none of the project datasets contained

all the variables needed for diet scores. We therefore used the

Permission to make digital or hard copies of all or part of this work for personal

EUMenu dataset, which includes the complete set of dietary vari-

or classroom use is granted without fee provided that copies are not made or

ables. Scores were first calculated for all EUMenu individuals.

distributed for profit or commercial advantage and that copies bear this notice and

the full citation on the first page. Copyrights for third-party components of this For project datasets with overlapping dietary or related features,

work must be honored. For all other uses, contact the owner /author(s).

we trained predictive models on EUMenu using only shared vari-

Information Society 2025, Ljubljana, Slovenia

ables and generated synthetic diet scores accordingly. Given the

© 2025 Copyright held by the owner/author(s).

https://doi.org/10.70314/is.2025.skui.7226 task’s simplicity and data structure, linear models were applied.



71

Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Reščič et al.



2.1.2 Generation of Other Data. We generated synthetic values with our goal of early prevention through modifiable risk factors. for missing variables by constructing targeted sub-datasets and Using both models balanced clinical reliability with behavioural

generating data with supervised learning. Each sub-dataset re- relevance, enabling a more comprehensive NCD risk assessment.

quired core demographics (sex, age, weight, height); rows missing Our initial approach applied the models at age 55, the max-

these were discarded to ensure stable baselines. A greedy search imum forecastable age. This yielded inconsistent outputs: T2P

selected predictor sets that maximized coverage of missing en- produced 10-year risks (55–65), while HHS produced a 20-year



tries, informativeness beyond demographics, and training sample risk (55–75). To resolve this, we instead reported cumulative risks √

size. Candidate sets were ranked by Score 𝑈 𝑉 𝐾 , where to age 65, the most suitable endpoint given our data. Two strate-= × × 𝑈 is the number of missing instances covered, 𝑉 the number of gies were evaluated: non-overlapping intervals and overlapping

predictors, and 𝐾 the number of training rows. (hazard-averaging) intervals.

For each sub-dataset, Gradient Boosting, Random Forest, and



Linear Regression models were trained with k-fold cross-validation 3 Advanced Unified Predictive Approaches and grid search. Validation was assessed with Root Relative

This section introduces advanced forecasting methods designed

Squared Error (RRSE; where RRSE = 0 for perfect predictions,

to work directly on heterogeneous datasets without requiring

RRSE = 1 for baseline), and the best model generated the missing

prior synthetic data generation. Despite their greater sophistica-

values. Overlaps were resolved by keeping predictions from the

tion, their accuracy lags behind the more straightforward method

model with a lower RRSE. This process was repeated across vari-

that relies on synthetic data generation.

ables to expand coverage while minimizing error. Data generation

Synthetic data generation and forecasting are trained jointly

proceeded iteratively: after each pass, synthetic variables were

within a single model, enabling the sharing of representations

evaluated with RRSE. Variables below a threshold were accepted

and feedback. Early layers provide initial estimates for both tasks,

and treated as ground truth in the next pass, with sub-datasets

while later stages refine them by capturing complex temporal

and models recomputed accordingly. The procedure terminated

dependencies. Although SmartCHANGE uses only single-year

once no further variables met inclusion or performance plateaued,

inputs per user, the training dataset includes multi-year records,

yielding a consistent. The mean RRSE of synthetic values in the

which reveal broader behavioural patterns.

final dataset was 0.795.

Before entering the network, variables are normalized using

training set statistics. Synthetic values are first generated in a

2.2 Risk Factor Forecasting linear block conditioned on age, gender, and BMI. This block

Having generated synthetic data, we constructed a merged dataset consists of two fully connected layers (128 neurons + ReLU, then

with no missing values. This dataset was used to train machine 21 neurons without activation). Forecasting then adds current

learning (ML) models to forecast health-related risk factors from age, future age, and gender, and predicts 21 risk factors across

childhood into adulthood. The predicted values were then ap- ages 6–55. The forecasting block differs by including an addi-

plied as inputs to publicly available risk models to estimate the tional 128-neuron ReLU layer and more inputs. Forecasting is

risk of developing NCDs. performed separately for each input year, and if multiple years

We implemented a neural network (NN) with two dense layers exist, trajectories are averaged across target ages (e.g., data at 7,

(512 and 128 neurons) to capture non-linear patterns. Training 9, and 12 yield three trajectories averaged per year).

used MSE loss, the Adam optimizer, ReLU activations, dropout This produces a time series of shape (50, 21). Appending masks

(0.2), and early stopping. A single NN forecasted all risk factors for observed/synthetic values and gender gives (50, 43). Risk fac-

simultaneously. Training and test data were prepared by generat- tor trajectories are then refined via a GRU block with bidirec-

ing all younger-to-older age pairs per individual. Inputs included tional layers (128 or 21 hidden units) and a final 21-neuron linear

gender, input and target age, and risk factors at the input age; layer. Predictions are finally de-normalized back to the original

targets were the same risk factors at the target age. This design scale. The overall loss is the mean of two MAE terms: imputation

enabled the model to learn age-progressive changes. and forecasting, with the latter computed only on ground-truth

Input–output pairs were split into training, validation, and test variables in the recorded output year.

sets, with each individual assigned to only one partition to avoid The model was evaluated the same way as the one in Section

leakage. Stratification by dataset preserved source representation. 2.2, with the mean RRSE being 0.907. This is less than the RRSE

Features were standardized with scikit-learn’s StandardScaler. from Section 2.2, indicating the need for further refinement of

For comparison, we trained traditional ML models separately per the unified approach.

variable: Linear Regression, Ridge Regression, Random Forest,



and LightGBM (the latter via the lightgbm library). All mod- 4 Privacy Preservation and Explainability els used default parameters and were trained/tested on the same

Privacy Preservation. Within the SmartCHANGE project, health

pairs as the NN. Performance was measured with MAE and RRSE.

datasets are distributed across multiple countries and institutions.

Training used both real and synthetic data, but evaluation was

These sensitive data fall under strict regulations (e.g., GDPR),

restricted to real data. Input ages ranged from 6–18 years, and

which prohibit cross-border sharing, and new pilot data remain

target ages from 18–55 years, matching the SmartCHANGE fore-

stored locally, reinforcing isolation. Federated Learning (FL) ad-

casting scope. The mean RRSE of the forecast values was 0.829.

dresses this by enabling collaborative training without moving

raw data [3]. Two main challenges arise in deployment: pro-

2.3 Risk Models nounced heterogeneity across sites and residual privacy risks,

We focused on two models: the Healthy Heart Score (HHS) for car- since shared gradients can still leak information. To mitigate

diovascular disease and Test2Prevent (T2P) for diabetes risk. Both these, we developed distribution-aware, privacy-preserving FL

include lifestyle factors such as physical activity and diet—essential strategies tailored to real-world healthcare [2]. Instead of a single

for assessing younger populations and behavioural change—aligning global model, our approach builds compact, differentially private



72





SmartCHANGE Risk Prediction Tool: Next-Generation Risk Assessment for Children and Youth Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia




descriptors of each client’s data distribution, clustering similar a virtual plant by completing daily and weekly personalized

clients to train specialized models. This improves robustness challenges linked to long-term health goals set by the HCP. The

to variability and temporal drift while ensuring fairer predic- app nudges users towards the most suitable challenges but leaves

tions, including for underrepresented groups. On the privacy the final choice to them, supporting autonomy and agency.

side, model partitioning and communication-efficient aggrega- To foster long-term engagement, fully grown plants can be

tion reduce leakage without heavy cryptography by fragmenting placed in the user’s Goal Garden, which both showcases past

gradients and distributing aggregation. Together, these strategies achievements and acts as a reinforcement mechanism. In today’s

enable scalable, robust, and privacy-preserving FL pipelines for reward-driven context, the Goal Garden also enables saving to-

health risk prediction. wards real-life rewards set by parents, further motivating users.

The app’s design emerged from an extensive co-creation process

Explainability. Beyond predictive accuracy, effective NCD risk and iterative validation with users, who responded positively to

assessment must also provide transparent explanations and ac-

the analogy, challenge, and reward structure, as well as the aes-

tionable guidance. For this, we adapt the Counterfactual Concept

thetics. Development was kept flexible, with adjustments made

Bottleneck Model (CF-CBM) [1] to early-life health data. Instead

to align the app with other SmartCHANGE components.

of relying on predefined concepts—often unavailable or inconsis-



tently annotated—our model learns patient feature distributions 6 Conclusion via a variational autoencoder (VAE), ensuring the latent space

This paper provides a concise description of the SmartCHANGE

captures key generative factors of early-life trajectories. Counter-

pipeline, which integrates harmonized datasets, synthetic data

factuals are then generated following CF-CBM principles: given

generation, federated learning, and explainable AI into a secure

a patient profile and its predicted risk, the system proposes min-

platform for early NCD risk prediction and prevention. Through

imally altered, realistic configurations that would change the

the HappyPlant app and professional interface, these methods

outcome. For example, if a child is predicted at high diabetes

are translated into user-centered interventions that support sus-

risk, the model may suggest plausible counterfactual profiles

tainable behaviour change in youth. Detailed descriptions of the

where lifestyle or physiological factors are adjusted to reduce

individual components will be published separately.

risk. By embedding counterfactual reasoning directly into the

pipeline, this approach goes beyond post-hoc interpretability. It

Acknowledgements

both explains which factors drive predictions and identifies how

This work was carried out as part of the SmartCHANGE project,

risk can be reduced, offering clinicians and families actionable,

which received funding from the European Union’s Horizon

personalized strategies for early prevention.

Europe research and innovation program under grant agreement



5 Architecture and User Applications No 101080965. The SLOfit and ACDsi datasets were provided by

the University of Ljubljana (courtesy of Gregor Jurak et al.), the

Architecture. The SmartCHANGE platform (Figure 1) is a mod-

LGS dataset was provided by KU Leuven (courtesy of Martine

ular, microservices-based system for AI-driven health interven-

Thomis), AFINA-TE dataset was provided by the University of

tions in children and adolescents. It integrates the developed

Porto (courtesy of José Ribeiro), ABCD was provided by VUMC,

predictive pipeline described in the previous sections with secure,

HELENA dataset was provided by Helena study group (courtesy

scalable, and privacy-preserving technologies, with emphasis on

of Francisco Ortega), and the University of Turku provided the

GDPR compliance and explainable AI. Two main client interfaces

YFS dataset. We are grateful for their support.

are provided: the HappyPlant mobile app for families and youth,



and a web application for healthcare professionals (HCPs). References

Authentication and authorization are handled through the

[1] Gabriele Dominici, Pietro Barbiero, Francesco Giannini, Martin Gjoreski,

OpenID Connect (OIDC) protocol, with role-based access con-

Giuseppe Marra, and Marc Langheinrich. 2025. Counterfactual concept bot-

trol and single sign-on. Additional safeguards include encrypted tleneck models. In The Thirteenth International Conference on Learning Repre-

communication, pseudonymization, and immutable audit log- sentations. https://openreview.net/f orum?id=w7pMjyjsKN.

[2] Dario Fenoglio, Gabriele Dominici, Pietro Barbiero, Alberto Tonda, Martin

ging. Together, the SmartCHANGE platform, HappyPlant, and Gjoreski, and Marc Langheinrich. 2024. Federated behavioural planes: ex-

the HCP web interface form an integrated ecosystem for pre- plaining the evolution of client behaviour in federated learning. In Advances

in Neural Information Processing Systems (NeurIPS 2024), Vol. 37, 112777–

ventive healthcare, uniting advanced technical architecture with

112813.

user-centered design to deliver effective, scalable, and personal- [3] Dario Fenoglio, Daniel Josifovski, Alessandro Gobbetti, Mattias Formo, Hris-

tijan Gjoreski, Martin Gjoreski, and Marc Langheinrich. 2023. Federated

ized interventions.

learning for privacy-aware cognitive workload estimation. In Proceedings of

Web Application. the 22nd International Conference on Mobile and Ubiquitous Multimedia (MUM

The web application for HCPs serves as a clini-

’23). ACM, New York, NY, USA, 25–36. doi: 10.1145/3626705.3627783.

cal dashboard, enabling them to access patient data, assess long- [4] Marko Jordan, Nina Reščič, Sebastjan Kramar, Marcel Založnik, and Mitja

term risk for metabolic diseases (currently diabetes and CVD, Luštrek. 2024. Smartchange risk prediction tool: demonstrating risk assess-

ment for children and youth. In Slovenska konferenca o umetni inteligenci.

although it can be scaled to integrate additional prediction mod- Zvezek A: zbornik 27. mednarodne multikonference Informacijska družba - IS

els), and support behaviour change strategies. The interface is 2024 : 10.–11. oktober, Ljubljana, Slovenija = Slovenian Conference on Artifi-

structured around a clinically aligned workflow — Consultation, cial Intelligence. Vol. A : proceedings of the 27th International Multiconference

Information Society - IS 2024. Ljubljana, Slovenia, 71–74.

Assessment, and Intervention — mirroring real-world practices. [5] K. Pahkala, H. Hietalampi, T. T. Laitinen, J. S. Viikari, T. Rönnemaa, H. Ni-

inikoski, and et al. 2013. Ideal cardiovascular health in adolescence: effect of

Mobile Application. While intelligent risk predictions support lifestyle intervention and association with vascular intima-media thickness

and elasticity (the special turku coronary risk factor intervention project for

HCPs in guiding clients, evidence and co-creation results show

children [strip] study). , 127, 18, (May 2013), 2088–2096. Circulation

that simply communicating risks is insufficient for sustainable

[6] World Health Organization. 2018. Global Health Estimate 2016: Deaths by

behaviour change in adolescents and families. The HappyPlant Cause, Age, Sex, by Country and by Region, 2000-2016. World Health Orga-

nization.

app was designed to address this gap. Rather than focusing on

risks, it adopts a playful plant-growth analogy: users care for



73

Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Reščič et al.





Figure 1: Logical Architecture of the SmartCHANGE Platform, including the mobile app (HappyPlant) and the web-app for

healthcare professionals, connected to a central FHIR-compliant repository and featuring a Trustworthy AI Framework

with federated learning, explainability, and secure communication via the XCDS Engine.





Figure 2: HappyPlant app screens: the home, challenge and goal garden screens.



74

GNN Fusion of Voronoi Spatial Graphs and City–Year Temporal



Graphs for Climate Analysis



Alex Romanova

Independent Researcher

McLean, VA, USA

sparkling.dataocean@gmail.com





Abstract

We present a two-stream graph framework for climate similarity

that fuses geography with long-term dynamics. A globe-spanning

Voronoi network links cities whose cells share a boundary, while

per-city temporal graphs encode decades of daily temperatures

in 1000 cities over 40 years. We learn (i) temporal embeddings

via a GNN graph-classification model on city–year graphs and

(ii) spatial embeddings via a GNN link-prediction model on the

Voronoi backbone, using either raw climatology vectors or the

learned temporal embeddings as inputs. Treating cosine similar-

ity as edge weights (using 1-cosine) enables graph-mining views:

closeness maps highlight dense climate belts, and betweenness

maps surface long-range "bridges" connecting distant regions.

Figure 1: Node feature types for climate similarity.

The fused approach uncovers patterns that simple averages miss,

including nearby cities with low similarity (microclimates, urban

form, or data aliasing) and far-apart cities with high similarity

(shared seasonal regimes/latitude bands). We also incorporate

the Delaunay triangulation - the dual of Voronoi - to provide a triangle-based analyses; we use it as a robustness check to ensure

geometrically well-posed neighbor network that stabilizes these results are not tied to a single choice of spatial adjacency.

patterns. The method is scalable and reproducible, and the same For temporal behavior, each city is represented by a graph

template - spatial adjacency + temporal history + GNN fusion - whose nodes are city–year pairs with daily-temperature profiles

extends beyond temperature to additional variables and to urban as features. Years are linked when their profiles exceed a cosine-

and infrastructure applications. similarity threshold. We add a virtual node so that each city graph

forms a single connected component.

Keywords To analyze climate across space and time, we use basic vectors

and pre-final vectors from Graph Neural Network (GNN) models.

graph neural networks, spatiotemporal modeling, climate analy-

Figure 1 illustrates four representations used throughout the

sis, Voronoi tessellation, Delaunay triangulation

paper:



1 • Average — climatology vectors (365-day averages) per city; Introduction

• Temporal — embedded city graphs: pre-final vectors from

Understanding global climate patterns is critical to the climate–



work that integrates geographic layout with long-term temporal change challenge. In this study, we explore a graph-based frame- • Spatial — embedded Voronoi nodes: pre-final vectors from a GNN graph classification model on per-city year graphs;

a GNN link-prediction model on the Voronoi graph with



most populated cities with 40 years of daily temperatures. This As a data source, we use climate records for 1,000 of the world’s • Spatial+Temporal — re-embedded nodes: pre-final vectors from a GNN link-prediction model on the Voronoi graph behavior. average vectors as inputs;

dataset (Kaggle [7]) provides geolocations and multi-decade time

using temporal embeddings as inputs.

series, allowing us to combine spatial and temporal perspectives.

Our spatial backbone is a Voronoi graph: from city coordinates, We previously introduced the use of pre-final vectors from

each city receives a Voronoi cell (the region closer to that city a GNN graph classification model on city temporal graphs [17]

than to any other), and two cities are connected when their cells and applied linear-algebra analyses to those outputs.

share a border—an interpretable, globally consistent notion of In this study we contribute:

proximity. Alongside Voronoi, we also construct the Delaunay

triangulation over the same points. Delaunay provides a com- • Construction of a globe-spanning Voronoi spatial graph

plementary, dual view of neighborhood structure and enables and its Delaunay triangulation as complementary spatial

backbones;

• Comparisons across input climatology vectors, output city-

Permission to make digital or hard copies of all or part of this work for personal graph embeddings, and spatial node embeddings from link

or classroom use is granted without fee provided that copies are not made or

distributed for profit or commercial advantage and that copies bear this notice and prediction;

the full citation on the first page. Copyrights for third-party components of this • Graph-mining analyses on induced graphs from each vec-

work must be honored. For all other uses, contact the owner/author(s). tor type, highlighting agreements and differences across

Information Society 2025, Ljubljana, Slovenia

© 2025 Copyright held by the owner/author(s). spatial and temporal representations.

https://doi.org/10.70314/is.2025.skui.1600



75





Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Romanova





2 Related Work

In 2012, two milestones reshaped AI: AlexNet’s convolutional

neural network set a new benchmark in large-scale image clas-

sification, far surpassing prior methods [9, 12], and Google’s

Knowledge Graph operationalized entity–relationship under-

standing at web-scale, transforming data integration, search, and

management [15].

These lines of work initially evolved in parallel—CNNs ex-

celled on grid-structured data, while graph methods targeted Figure 2: Voronoi edge between distant cities: Québec and

relational structure. The emergence of graph neural networks Porto are neighbors because their cells meet across the

(GNNs) in the late 2010s bridged this gap by combining deep Atlantic.

learning with graph computation to model complex dependen-



cies [2]. Despite the rise of large language models (LLMs) since

2022, GNNs remain essential for tasks grounded in explicitly

graph-structured data.

GNNs are now standard for classification and link prediction

on graph-structured data [14, 1]. At web scale, industrial recom-

mender systems adopt scalable inductive variants such as Pin-

Sage [20], while temporal/dynamic settings leverage trajectory-

Figure 3: Largest Voronoi triangle: Wellington–Port Eliz-

predictive embeddings like JODIE [10]. Community benchmarks

abeth–Mar del Plata illustrates long edges formed in

have further standardized evaluation for large graph learning

sparsely populated regions.

(e.g., OGB) [5]. In geophysics, recent studies demonstrate the

effectiveness of GNNs for medium-range global weather forecast-

ing [11], global atmospheric prediction [8], and spatiotemporal (4) Link-prediction vectors (from temporal vectors) —

hydrology and geoscience tasks such as groundwater dynamics the same GNN link-prediction setup, but with temporal

[19] and frost-event forecasting with attention mechanisms [13], GNN embeddings as inputs. supporting the view that graph-based inductive biases are well

This design allows direct comparison of spatial, temporal, and

suited to environmental systems with strong spatial and temporal

structure. hybrid representations within a single framework; see Figure 1.



Voronoi tessellations provide natural adjacency via shared 3.2 GNN Graph Classification Model

cell boundaries and have a long history in climate and global

modeling [6]. Recent applications use Voronoi-induced graphs We apply a GNN graph classification model (PyTorch Geometric)

for urban risk modeling and natural hazards: Gan et al. propose to per-city temporal graphs. Each graph has one node per year,

a Voronoi-based spatiotemporal GCN for traffic crash prediction with that year’s daily-temperature profile as the node features.

[3], while Razavi-Termeh et al. leverage Voronoi entropy in flood We add a virtual node to each graph and connect it to ensure every

susceptibility mapping [16]. Our work synthesizes these ideas city graph is a single connected component. For supervision, we

by constructing a global Voronoi-based spatial graph of cities split cities into two equal groups by absolute latitude (closer

enriched with long-term temperature signals and combining it vs. farther from the equator) and train the model to classify the

with per-city temporal graphs encoded by GNNs. graphs. We then use the pre-final vector as the city’s temporal

embedding for downstream analysis.



3 Methods 3.3 GNN Link Prediction Model



3.1 We apply a GNN link prediction model (Deep Graph Library), us- Graph Construction

ing the GraphSAGE aggregator [4], to the Voronoi spatial graph

We construct a global spatial graph by computing a planar Voronoi

of cities. Unlike the GNN graph classification model, which pro-

diagram on Web Mercator (EPSG:3857) city coordinates; two

duces one embedding per city graph, link prediction runs on

cities are adjacent if their cells touch. The Voronoi/Delaunay is

the global spatial graph and refines each city’s node representa-

used only to define adjacency (not distances/areas), yielding a

tion using both adjacency and input features. We evaluate two

simple, interpretable map of city neighborhoods worldwide.

node-feature variants: (i) 365-day climatology vectors (averaged

We evaluate four alternative node-feature sets:

across years) and (ii) temporal embeddings from the classification

model. After training, we extract pre-final node embeddings as

(1) 365-day climatology vectors — for each city, a 365- enhanced city feature vectors for downstream analysis.

value day-of-year climatology averaged across all available Notes and code are provided on our technical blog [18].

years.

(2) Temporal vectors — pre-final embeddings from GNN 4 Experiments

graph-classification model on each city’s year-by-year

graph (years linked when their daily profiles exceed a 4.1 Voronoi Graph Construction

cosine-similarity threshold). We build the spatial graph from city coordinates with a Voronoi

(3) Link-prediction vectors (from averages) — pre-final tessellation: each city gets a cell, and two cities are linked when

embeddings from a GNN link-prediction model on the their cells touch. This gives a clear, globe-spanning picture of

Voronoi graph using the 365-day climatology vectors as who is naturally close, without picking an arbitrary distance

inputs. cutoff. Alongside this, we also use the Delaunay triangulation



76





GNN Fusion of Spatial and Temporal Graphs for Climate Analysis Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia





Figure 4: Voronoi area (normalized): green=low, yel-



low=mid, red=high.



Figure 5: Closeness centrality across four vector types; red

on the same points—the dual view that connects cities exactly = high, yellow = mid, green = low.

when their Voronoi cells meet and highlights triangle-based local

structure.



Sometimes this setup links places that are far apart because

there are few large cities between them. For example, Québec

(Canada) and Porto (Portugal) become neighbors across the At-

lantic when their cells meet (Figure 2). Larger patterns show up

in the Delaunay view as well: the largest triangle—Wellington

(New Zealand), Port Elizabeth (South Africa), and Mar del Plata

(Argentina)—illustrates how isolated regions can still form direct

connections (Figure 3).

To show spatial density, we color each city by Voronoi cell

size (Figure 4). Small cells (green) mark tight clusters—for exam-

ple, parts of eastern China and northern India—while large cells

(red) indicate sparse areas such as interior Australia or northern

Canada. Dense hubs shorten edges and raise local connectivity; Figure 6: Betweenness centrality across four vector types;

sparse zones create longer links that act as bridges. red = high, yellow = mid, green = low.



4.2 GNN Models

Across both GNNs (temporal graph classification and spatial link 4.4 Centrality and Betweenness Patterns

prediction), we use only pre-final embeddings for downstream Across Vector Types

analysis; we do not report task metrics (edge AUC/AP or classi- Throughout, climate similarity means cosine similarity between

fication accuracy) because our goal is weighted-path/centrality the indicated vectors; for path-based metrics we use edge weights

analysis on a geometric prior. 𝑤 = 1 − cosine. Each set of maps uses the same spatial backbone:

edges come from the Voronoi graph, where two cities are adja-

4.3 How Similar Are Distant or Nearby Cities? cent if their cells share a border. What changes across panels

This section examines climate similarity for both distant and is the edge weight, derived from cosine similarity computed

neighboring city pairs using the four representations (Average, from one of four representations (Average, Temporal, Spatial,

Temporal, Spatial, Spatial+Temporal). Tables 1 and 2 highlight Spatial+Temporal), with vectors normalized prior to cosine. The

highlight representative examples: one for geographically distant topology stays fixed; the weights—and therefore any shortest-

pairs and one for nearby pairs. path–based measures—change with the chosen vectors. Smaller

Many distant pairs show very high similarity, especially when weights mean higher climate similarity.

temporal history and spatial context are both considered. For ex- In the closeness centrality maps (Figure 5), cities with high

ample, Wellington (New Zealand) and Mar del Plata (Argentina), closeness are, on average, at short weighted distance from many

though thousands of kilometers apart, score highly across all four others—i.e., they are similar to many cities. Dense climate regions

metrics—suggesting that similar seasonal regimes and latitude such as Europe and East Asia typically stand out. Differences

can outweigh raw distance. between panels reveal how each representation defines “similar,”

Nearby pairs typically agree across metrics as well. In the sec- shifting which cities appear most central.

ond table, examples such as Barranquilla–Soledad and Barcelona– In the betweenness maps (Figure 6), different weightings em-

Puerto La Cruz show consistently high similarity, reflecting shared phasize different connectors: high-betweenness cities lie on many

local climate. shortest routes. The Spatial+Temporal view surfaces more long-

There are exceptions. New York and Brooklyn, despite being range intermediaries than Average (notably in Africa, South

only a few kilometers apart, score low on the Spatial and Spa- America, and the Pacific). We also observe slight polarization in

tial+Temporal measures. This may reflect microclimates, urban Spatial and Spatial+Temporal; the reason for this requires further

effects, or dataset/aliasing issues (e.g., borough vs. city records). research.

Such cases show that short geographic distances can mask mean- Our centrality and betweenness maps are only a starting point,

ingful environmental differences, underscoring the value of com- with extended graph experiments expected to uncover additional

bining temporal and spatial modeling. structures and recurring pathways.



77





Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Romanova




Table 1: Climate similarity between distant city pairs



City 1 City 2 Distance (km) Average Temporal Spatial Spatial+Temporal



Wellington, NZ Mar del Plata, AR 25870.97 0.9922 1.0000 1.0000 1.0000

Port Elizabeth, ZA Wellington, NZ 16639.04 0.9982 0.9963 0.9999 1.0000

Melbourne, AU Port Elizabeth, ZA 13299.30 0.9916 0.9958 0.9872 0.9993

Reykjavik, IS Krasnoyarsk, RU 12911.14 0.7375 0.7482 0.9861 0.9338

Nuku’alofa, TO Concepcion, CL 11549.31 0.9838 0.9882 0.9995 0.9997



Table 2: Climate similarity between nearby city pairs



City 1 City 2 Distance (km) Average Temporal Spatial Spatial+Temporal



Jerusalem, IL Al Quds, PS 2.27 1.0000 1.0000 0.9998 1.0000

Barranquilla, CO Soledad, CO 5.63 1.0000 0.9585 0.9999 0.9999

Barcelona, VE Puerto La Cruz, VE 6.32 1.0000 1.0000 1.0000 1.0000

Khartoum, SD Omdurman, SD 6.88 1.0000 0.8749 0.9590 0.9988

New York, US Brooklyn, US 7.05 1.0000 0.5220 0.0857 0.0878



5 Conclusion [4] William L. Hamilton, Rex Ying, and Jure Leskovec. 2017. Inductive represen-

In conclusion, the novelty of this work is the explicit fusion of a tation learning on large graphs. InAdvances in Neural Information Processing

Systems (NeurIPS). https://arxiv.org/abs/1706.02216.

Voronoi spatial graph with temporal GNN embeddings to reveal [5] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren,

climate “neighborhoods” that traditional, single-view methods Bowen Liu, Michele Catasta, and Jure Leskovec. 2020. Open graph bench-

mark: datasets for machine learning on graphs. In Advances in Neural Infor-

tend to miss. By running a GNN graph-classification model on mation Processing Systems (NeurIPS). https://arxiv.org/abs/2005.00687.

per-city year graphs and a GNN link-prediction model on the [6] Lili Ju, Todd Ringler, and Max Gunzburger. 2011. Voronoi tessellations and

global Voronoi backbone, we combine geography with long-term their application to climate and global modeling. In Numerical Techniques

for Global Atmospheric Models. Lecture Notes in Computational Science and

dynamics. We compare simple average-by-day climatology vec- Engineering. doi:10.1007/978-3-642-11640-7_10.

tors against pre-final vectors from both GNN models and then [7] Kaggle Dataset. 2020. Temperature history of 1000 cities 1980 to 2020. https:

use these vectors for downstream analysis. -cities. (2020). //www.kaggle.com/datasets/sudalairajkumar/daily-temperature-of-major

This fusion surfaces informative outliers: nearby cities with [8] Ryan Keisler. 2022. Forecasting global weather with graph neural networks.

low cosine similarity—consistent with microclimates, urban form, arXiv preprint arXiv:2202.07575. doi:10.48550/arXiv.2202.07575.

[9] Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton. 2012. Imagenet clas-

or data aliasing—and distant city pairs with high similarity, sug- sification with deep convolutional neural networks. In Advances in Neural

gesting long-distance climate links. Using these vectors as edge Information Processing Systems (NeurIPS). doi:10.1145/3065386.

weights enables graph-mining views: closeness maps highlight [10] Srijan Kumar, Xikun Zhang, and Jure Leskovec. 2019. Predicting dynamic

embedding trajectory in temporal interaction networks. In Proceedings of

dense climate belts, while betweenness maps elevate long-range the 25th ACM SIGKDD International Conference on Knowledge Discovery and

“bridges.” Adding the Delaunay triangulation—the dual of the Data Mining (KDD). doi:10.1145/3292500.3330895.

Voronoi diagram—provides a geometrically well-posed neighbor [11] Rosalia Lam, Alvaro Sanchez-Gonzalez, Matthew Willson, et al. 2023. Learn-

ing skillful medium-range global weather forecasting. Science. doi:10.1126/s

network that stabilizes these patterns. cience.adi2336.

While this study centers on climate and temperature, the dual [12] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning.

Nature. doi:10.1038/nature14539.

Voronoi–Delaunay framework with GNN fusion is broadly ap- [13] Hugo Lira, Luis Martí, and Nayat Sanchez-Pi. 2022. A graph neural net-

plicable. The same geometric scaffold can analyze urban connec- work with spatio-temporal attention for multi-source time series data: an

tivity and infrastructure networks, surface social or economic application to frost forecast. doi:10.3390/s22041486.

[14] Xia Liu, Jie Chen, and Qingsong Wen. 2023. A survey on graph classification

linkages in dense regions, and support practical tasks like traf- and link prediction based on gnn. arXiv preprint arXiv:2307.00865. doi:10.48

fic management and siting of schools, parks, or grocery stores. 550/arXiv.2307.00865.

It offers a stable way to reason about spatial relationships be- son, and Jamie Taylor. 2019. Industry-scale knowledge graphs: lessons and [15] Natasha F. Noy, Yuval Gao, Anshu Jain, Anant Narayanan, Alan Patter-

yond climate. The approach is also a starting point for continued challenges. acmqueue. doi:10.1145/3329781.3332266.

work: enrich node features, adopt spherical/geodesic tessellations, [16] S. Vahideh Razavi-Termeh, Amir Sadeghi, Faisal Ali, Rana Abdul Naqvi,

et al. 2024. Cutting-edge strategies for absence data identification in natural

learn the graph via contrastive or metric objectives, and explore hazards: leveraging voronoi-entropy in flood susceptibility mapping with

dynamic temporal GNNs with attribution, counterfactuals, and advanced ai techniques. Journal of Hydrology. doi:10.1016/j.jhydrol.2024.13

uncertainty. 2337.

[17] Alex Romanova. 2024. Utilizing pre-final vectors from GNN graph classifi-

cation for enhanced climate analysis. In Proceedings of the 21st Workshop

References on Mining and Learning with Graphs (MLG 2024). Co-located with ECML

PKDD 2024.

[1] Jakub Adamczyk. 2022. Application of graph neural networks and graph

[18] sparklingdataocean.com. [n. d.] Temporal–spatial gnn fusion for climate

descriptors for graph classification. arXiv preprint arXiv:2211.03666. doi:10.4

analytics. http://sparklingdataocean.com/2025/06/25/voronoiGNN/.

8550/arXiv.2211.03666. [19] Marco L. Taccari, Hua Wang, James Nuttall, Xue Chen, and Peter K. Jimack.

[2] Michael M. Bronstein, Joan Bruna, Taco Cohen, and Petar Veličković. 2021. 2024. Spatial-temporal graph neural networks for groundwater data. doi:10



[3] preprint arXiv:2104.13478 . doi:10.48550/arXiv.2104.13478. [20] Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L. Hamil- Junjie Gan, Qing Yang, Dong Zhang, Li Li, Xinyu Qu, and Bin Ran. 2024. ton, and Jure Leskovec. 2018. Graph convolutional neural networks for Geometric deep learning: grids, groups, graphs, geodesics, and gauges. arXiv .1038/s41598-024-75385-2.

A novel voronoi-based spatio-temporal graph convolutional network for

web-scale recommender systems. In Proceedings of the 24th ACM SIGKDD

traffic crash prediction considering geographical spatial distributions. IEEE

International Conference on Knowledge Discovery and Data Mining (KDD).

Transactions on Intelligent Transportation Systems. doi:10.1109/TITS.2024.34

doi:10.1145/3219819.3219890.

52275.



78

Towards Anomaly Detection in Forest Biodiversity Monitoring:



A Pilot Study with Variational Autoencoders



David Susič Maria Luisa Buchaillot Miguel Crozzoli

david.susic@ijs.si Fauna Smart Technologies ApS Intelligent Instruments Lab,

Department of Intelligent Systems, Copenhagen, Denmark University of Iceland

Jožef Stefan Institute Reykjavik, Iceland

Ljubljana, Slovenia



Calum Builder Sevasti Maistrou Anton Gradišek

Fauna Smart Technologies ApS Fauna Smart Technologies ApS Department of Intelligent Systems,

Copenhagen, Denmark Copenhagen, Denmark Jožef Stefan Institute

Ljubljana, Slovenia



Dragana Vukašinović

Fauna Smart Technologies ApS

Copenhagen, Denmark



Abstract adaptive, science-based forest management [2, 3]. However, ex-

isting monitoring tools are often limited in scope, fragmented

Biodiversity monitoring in forests requires scalable, automated

across disciplines, and costly to implement at scale [4].

tools for detecting ecological anomalies across time and space.

This paper presents the technical foundation of the biodi-

This paper reports on a three-month pilot deployment (April

2 versity assessment tool (BAT), a modular, scalable system that

1 to June 30, 2025) in Dyrehaven, an 11 km forest park near

integrates ecoacoustics, satellite remote sensing, and machine

Copenhagen, Denmark, where acoustic data from 10 distributed

learning (ML) to enable automated biodiversity monitoring in

AudioMoth sensors and vegetation indices from Sentinel-2 im-

forested landscapes. BAT is designed to detect anomalies in eco-

agery were collected. We trained separate variational autoen-

logical baselines, providing early warning signals of ecosystem

coder (VAE) models on each modality to test the technical feasibil-

degradation [5]. It combines two complementary remote sensing

ity of learning ecological baselines. Since no ecological anomalies

modalities: passive acoustic monitoring (PAM), which captures

occurred during the observation period, evaluation focused on

localized, high-frequency biological activity such as insect or

reconstruction errors, which indicate how well VAEs can capture

bird calls [6, 7], and satellite Earth observation (EO), which offers

typical site-specific ecological patterns (i.e., baseline modeling).

broader, lower-frequency indicators of landscape-level change,

Both acoustic and satellite pipelines achieved low reconstruc-

including vegetation health and canopy dynamics [8].

tion errors, demonstrating that VAEs can reliably model normal

The presence of pests or other stressors often leads to a reduc-

ecological dynamics. This establishes the foundation for future

tion in biodiversity, which can first be detected acoustically as

studies on anomaly detection, which will require larger datasets

diminished biotic sound activity, and later (typically with a lag of

containing true ecological anomalies identified and labeled by

several days) becomes visible in EO data as decreased vegetation

experts. Ongoing work focuses on extending data collection to

greenness. BAT is designed to leverage this temporal and spatial

additional forest sites, while future anomaly detection will re-

complementarity by developing independent anomaly detection

quire expert-labeled anomalies to calibrate baselines and validate

pipelines for each modality, which in future iterations may sup-

model performance for robust, multimodal biodiversity monitor-

port joint multimodal detection of ecological disturbances.

ing.

This study reports on a pilot deployment in Dyrehaven, a



Keywords human-managed park-forest in Denmark, where time-series data

from distributed acoustic sensors and Sentinel-2 satellite im-

biodiversity, anomaly detection, variational autoencoder, ma-

agery were collected between April and June 2025. Separate

chine learning, passive acoustic monitoring, satellite imagery

variational autoencoders (VAEs) were trained on each modality

to test whether robust baseline models can be learned. Ecological



1 Introduction anomalies are inherently rare and cannot be guaranteed within

a limited three-month window, and none occurred during this

Forests are complex, dynamic ecosystems increasingly affected by

period. As a result, evaluation focused on baseline reconstruction

environmental stressors such as pests, diseases, invasive species,

performance rather than anomaly detection accuracy. Demon-

and climate-related disturbances [1]. Effective biodiversity mon-

strating that VAEs can successfully capture “normal” ecological

itoring is essential to detect these stressors early and support

patterns is a necessary prerequisite for future anomaly detection.

Ecological baselines are inherently site-specific, differing across

Permission to make digital or hard copies of all or part of this work for personal forest types, microhabitats, and even within single forests (e.g.,

or classroom use is granted without fee provided that copies are not made or

wetter zones near ponds vs drier uplands). Accordingly, this work

distributed for profit or commercial advantage and that copies bear this notice and

the full citation on the first page. Copyrights for third-party components of this should be understood as a technical feasibility study, with the

work must be honored. For all other uses, contact the owner /author(s). longer-term goal of enabling multimodal detection of ecological

Information Society 2025, Ljubljana, Slovenia

disturbances such as pest outbreaks, supported by expert-labeled

© 2025 Copyright held by the owner/author(s).

https://doi.org/10.70314/is.2025.skui.6757 events and extended deployment across diverse forests.



79





Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Susič et al.




2 Data NDVI was calculated at 10 m resolution, and NDMI at 20 m.

Each index map was divided into fixed-size patches. NDVI maps

Our study area was Dyrehaven, a human-managed forest park

produced 396 patches (11 36 grid), while NDMI produced 108 ×

north of Copenhagen, Denmark (55.8024°N, 12.5685°E), covering

2 patches (6 18 grid), reflecting their respective spatial resolutions. ×

11 km (see Figure 1). The site includes 10 structured microhab-

itats across woodland, meadow, and modified forest areas. Its

ecological diversity and relative stability make it suitable for test- 3 Methodology



ing acoustic and satellite-based monitoring methods. Data were 3.1 Extraction of Acoustic Indices collected between April 1 and June 30, 2025.

10 standard ecoacoustic indices [10] (list in Table 1) were ex-



tracted from each 45-second recording, capturing patterns from

both time-domain and time-frequency analyses. These indices

reflect aspects such as spectral entropy, acoustic complexity, tem-

poral dynamics, and frequency distribution, offering proxies for

ecological features like species richness, biophonic activity, and

anthropogenic disturbance. All indices were independently nor-

malized to the [0, 1] range using their dataset-wide minimum

and maximum values.



Table 1: Acoustic indices used in this study and their eco-

logical interpretation.



Index Use

ACI Detects dynamic biotic sounds (e.g., bird choruses).

AEI Identifies dominance vs. diversity in acoustic commu-

nities.

EAS Differentiates uniform noise vs. structured signals.

ECU Indicates unpredictability and complexity of sound-

Figure 1: Study area in Dyrehaven, Denmark with Au- scapes.

dioMoth recording locations (red pins) and Sentinel-2 satel- ECV Captures temporal structure (e.g., insect or bird

lite bounding box (blue). rhythms).

EPS Distinguishes tonal vs. noisy sound environments.

ADI Proxy for acoustic diversity or species richness.

2.1 Audio NDSI Separates natural from human-made noise.

Passive acoustic data were collected using 10 AudioMoth record- Ht Detects continuous vs. discrete acoustic events.

ing devices deployed across Dyrehaven’s microhabitats. Devices ARI Estimates overall acoustic richness.

were positioned to maximize spatial heterogeneity, minimize

acoustic overlap, and ensure temporal consistency. Each unit

recorded 45-second mono-channel clips every five minutes at

a 48 kHz sampling rate. All devices were weatherproofed and

mounted on trees for continuous outdoor operation. A recording 3.2 Preprocessing of Satellite Imagery

gap occurred between April 20 and April 29 due to memory card To ensure patch-level data quality, we applied the scene classifi-

cation layer (SCL) after resampling. Patches containing cloudy or

failure. A total of 203078 recordings were generated during the

study period. After removing corrupted or incomplete files (309 unreliable pixels (SCL classes 3, 8, 9, or 10) were excluded. This

preprocessing pipeline produced curated spatiotemporal datasets

clips, 0.15%), 202769 valid recordings remained.

of 4436 NDVI patches and 1226 NDMI patches, which served as

2.2 Visual input for training and evaluating the VAE models.

Satellite imagery was sourced from the Sentinel-2 mission [9],

covering a 1.48 km 5.86 km bounding box encompassing 9 of × 3.3 Variational Autoencoder and Evaluation

the 10 AudioMoth locations. Out of 53 total available snapshots Metrics

during our study period, 18 cloud-free scenes ( 50% cloud cover) ≤

A variational autoencoder (VAE) learns to compress input data

were selected for analysis to ensure index reliability.

into a latent representation and reconstruct it via encoder and

Normalized difference vegetation index (NDVI) and Normal-

decoder as per Figure 2.

ized difference moisture index (NDMI) were computed for each

The encoder maps each input to a latent mean, 𝜇 1 and log-

selected image as 2 ()

variance, 𝑙 𝑜𝑔 𝜎 , from which a latent vector 𝑧 is sampled via the

1

NIR reparameterization trick: 𝑧 𝜇1 = + 𝜎1 · 𝜖, where 𝜖 ∼ N (0, 1) and red −

NDVI 2 𝜎 = 1 = exp ( 0 . 5 · log ( 𝜎)). 1 NIR + red

The decoder reconstructs the input from 𝑧, producing a mean

and

NIR SWIR 𝜇2 − 2 ()

and log-variance log 𝜎 of the output distribution. Training

NDMI 2 =

,

NIR SWIR minimizes the total loss: +

where, NIR, SWIR, and red are near-infrared, shortwave-infrared,

and visible red bands, respectively. 𝑤 recon VAE L = L +KL · LKL



80

Towards Anomaly Detection in Forest Biodiversity Monitoring Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia





across modalities. In this pilot, since no anomalies occurred, we

only assess baseline modeling by training and evaluating the

acoustic and satellite VAEs independently, reporting reconstruc-

tion errors as indicators of model performance.



3.4.1 Audio Pipeline. The audio VAE uses a 10-dimensional in-

put, with an encoder and decoder each containing one hidden

layer of size 8 and ReLU activation. The latent space has di-

mension 4. The decoder outputs the reconstructed mean and

log-variance of size 10.



Figure 2: Architecture of VAE for anomaly detection using Model evaluation used 5-fold cross-validation with folds de-

reconstruction probability. fined by spatially clustered AudioMoth devices ( 850 m mini- ∼

mum separation) to reduce data leakage. Models were trained for

30 epochs with a batch size of 512 using the Adam optimizer and

where recon is the negative log-likelihood of the input under a one-cycle learning rate schedule. L the decoder’s Gaussian output:



𝐷 3.4.2 Visual Pipeline.

The satellite VAE takes a 16×16 pixel in-

∑︁

L 2 recon 𝑖 2,𝑖 2 𝑥 = − N ( | )

log 𝜇 , 𝜎

,𝑖 put (NDVI or NDMI) and uses three convolutional layers (32,

𝑖 1 =

64, 128 filters) with ReLU activation in the encoder. The output

and is the Kullback–Leibler divergence between the approx-L

KL is flattened and mapped to a latent space of dimension 4. The

imate posterior 𝑞 𝑧 𝑥 and the prior 𝑝 𝑧 0, 1 : ( | ) ( ) = N ( )

decoder upsamples using three transposed convolutional layers

𝑑 with ReLU, reconstructing the mean and log-variance patches of

1 ∑︁

L 2 2 2 KL 1 𝜎 = − + ( ) − −

1 log 𝜇 𝜎 size 16×16.

, 𝑗 1, 𝑗 1, 𝑗

2

𝑗 1 Separate VAE models were trained for NDVI and NDMI using =

with 𝐷 and 𝑑 representing the input and latent dimensions, re- an 80/20 train-test split. Each model was trained for 20 epochs

spectively. with a batch size of 32 using the Adam optimizer. The loss was

In an operational anomaly detection setting, the decoder’s computed only over non-cloudy pixels.

negative log-likelihood (often referred to as reconstruction like-



lihood) would serve as the anomaly score, with higher values 4 Results and Discussion indicating more anomalous inputs. However, since no ecological

To examine temporal patterns, all indices were plotted over the

anomalies occurred during our three-month observation window,

study period as seen in Figure 4. Acoustic indices were aver-

this pilot study evaluates baseline modeling rather than anomaly

aged between 9AM and 3PM across all 10 AudioMoth devices to

detection accuracy. Specifically, we report reconstruction errors:

avoid nighttime inactivity and minimize dawn/dusk transitions.

mean squared error (MSE) and mean absolute error (MAE) for

A 10-day smoothing window was applied to reduce day/night

acoustic indices, and overall mean absolute error (averaged across

fluctuations. The indices remained relatively stable long-term,

all pixels in each patch) for NDVI and NDMI patches, computed

showing little trend and suggesting no major ecological disrup-

only on non-cloudy patches after SCL masking.

tions and reflecting the stability of the forest soundscape over



3.4 Experimental Setup the study period.

Visual indices were averaged across all patches for each date.

The general pipeline of the BAT system is shown in Figure 3. It

Both indices exhibit a gradual increase from early April to late

consists of independent audio and visual pipelines designed to

June, consistent with seasonal greening. NDVI shows a smooth

operate separately but eventually integrate into a unified decision-

and consistent rise, indicating widespread vegetation growth.

support framework.

NDMI, while generally increasing, displays more irregular varia-



tion, particularly early in the season, likely reflecting transient

moisture conditions. NDVI primarily tracks canopy structure

and greenness, while NDMI is more sensitive to vegetation and

soil moisture.

The audio pipeline VAE was evaluated using reconstruction

MSE and MAE. Since all indices were normalized to the [0,1]

range, errors are directly comparable. As shown in Figure 5,

reconstruction errors are generally low, indicating that the model

effectively captures the underlying structure of the acoustic data.

EPS and Ht showed the highest reconstruction error variability.

This suggests they are more difficult to model but may provide

sensitive signals of ecological change in future anomaly detection

settings. Indices with consistently low reconstruction errors, on

the other hand, indicate stable features that can serve as robust

Figure 3: The general pipeline of the BAT system. components of ecological baselines. These patterns highlight

differences in how well various indices represent typical acous-

In a full anomaly detection setting, the pipelines would use tic dynamics, which is central to establishing reliable baseline

reconstruction likelihoods as anomaly scores and combine them models.



81





Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Susič et al.





during the observation period. Instead, it establishes that robust

models can be trained on available data, providing a foundation

for future multimodal monitoring.

A critical next step is the collection of additional data over

longer time frames and across multiple forest types, since actual

ecological anomalies are rare and cannot be guaranteed within a

short observation window. Detecting and validating anomalies

will require expert labeling of such events once they occur. To

this end, we are continuing data collection at Dyrehaven and

planning expansions to other Danish forests (e.g., Thy, Amager,

Lillebælt) to capture a wider range of ecological contexts and im-

prove model generalization. Further development will also focus

on refining acoustic preprocessing through time-window aver-

aging or time-aware features and enhancing the visual pipeline

with seasonal baselines, sequential models, and zone-specific

Figure 4: Index values over the study period. approaches that account for spatial heterogeneity.

With expert input, longer-term recordings, and broader de-



ployment, the BAT system can evolve from modeling site-specific

baselines into a robust anomaly detection tool supporting scalable

and long-term biodiversity monitoring.



Acknowledgements

This work was funded by Fauna Smart Technologies ApS under

the European Space Agency (ESA) grant no. 4000147116, Biodi-

versity Assessment Tool.



References

[1] William R. L. Anderegg, Oriana S. Chegwidden, Grayson Badgley, Anna T.

Trugman, Danny Cullenward, John T. Abatzoglou, Jeffrey A. Hicke, Jeremy

Freeman, and Joseph J. Hamman. 2022. Future climate risks from stress,

insects and fire across us forests. , 25, 6, 1510–1520. eprint: Ecology Letters

https://onlinelibrary.wiley.com/doi/pdf /10.1111/ele.14018. doi: https://doi.o

rg/10.1111/ele.14018.

[2] Lucas P. Gaspar et al. 2023. Predicting bird diversity through acoustic indices

within the atlantic forest biodiversity hotspot. , Frontiers in Remote Sensing

4, (Dec. 2023). doi: 10.3389/f rsen.2023.1283719.

[3] J.Wolfgang Wägele et al. 2022. Towards a multisensor station for automated

Figure 5: Reconstruction errors for acoustic indices. biodiversity monitoring. Basic and Applied Ecology, 59, 105–138. doi: https:

//doi.org/10.1016/j.baae.2022.01.003.

[4] Santiago Izquierdo-Tort, Andrea Alatorre, Paulina Arroyo-Gerala, Eliza-

beth Shapiro-Garza, Julia Naime, and Jérôme Dupras. 2024. Exploring local

The visual pipeline VAEs were evaluated using overall MAE

perceptions and drivers of engagement in biodiversity monitoring among

per patch. As expected, errors were fairly uniform across pixels,

participants in payments for ecosystem services schemes in southeastern

indicating that the models reconstruct spatial patterns consis- mexico. , 38, 6, e14282. eprint: https://conbio.onlinelibr Conservation Biology

ary.wiley.com/doi/pdf /10.1111/cobi.14282. doi: https://doi.org/10.1111/cobi

tently without localized distortions. The average patch-level MAE

.14282.

(average across all 16 16 256 pixels across all images) was [5] Nathalie Pettorelli, Jake Williams, Henrike Schulte to Bühne, and Merry × =

7.17 Crowson. 2025. Deep learning and satellite remote sensing for biodiversity ± ±

0.11 for NDVI and 9.65 0.26 for NDMI. Given the [0, 1]

monitoring and conservation. , Remote Sensing in Ecology and Conservation

normalization range of each pixel, the errors are relatively small

11, 2, 123–132. eprint: https://zslpublications.onlinelibrary.wiley.com/doi/p

and therefore reflect accurate reconstruction of vegetation and df /10.1002/rse2.415. doi: https://doi.org/10.1002/rse2.415.

[6] Rory Gibb, Ella Browning, Paul Glover-Kapfer, and Kate E. Jones. 2019.

moisture dynamics.

Emerging opportunities and challenges for passive acoustics in ecological

The selected VAE models for both the acoustic and visual Methods in Ecology and Evolution assessment and monitoring. , 10, 2, 169–185.

pipelines demonstrate strong reconstruction performance, with eprint: https://besjournals.onlinelibrary.wiley.com/doi/pdf /10.1111/2041-

210X.13101. doi: https://doi.org/10.1111/2041- 210X.13101.

consistently low errors across acoustic indices and Sentinel-

[7] D.A. Nieto-Mora, Susana Rodríguez-Buritica, Paula Rodríguez-Marín, J.D.

derived NDVI/NDMI patches. This confirms that the models Martínez-Vargaz, and Claudia Isaza-Narváez. 2023. Systematic review of ma-

effectively capture typical ecological patterns, which is the in- chine learning methods applied to ecoacoustics and soundscape monitoring.

Heliyon, 9, 10, e20275. doi: https://doi.org/10.1016/j.heliyon.2023.e20275.

tended outcome of this pilot study. While further hyperparameter

[8] Nathalie Pettorelli et al. 2018. Satellite remote sensing of ecosystem func-

tuning could potentially reduce errors, the key result is that ro- tions: opportunities, challenges and way forward. Remote Sensing in Ecology bust ecological baselines can be modeled. Anomaly detection and Conservation

, 4, 2, 71–93. eprint: https://zslpublications.onlinelibrary.w

iley.com/doi/pdf /10.1002/rse2.59. doi: https://doi.org/10.1002/rse2.59.

itself will require expert-labeled events in future deployments,

[9] Copernicus Data Space Ecosystem. 2015. Sentinel-2. (2015). https://dataspac

but these results provide the necessary technical foundation. e.copernicus.eu/explore- data/data- collections/sentinel- data/sentinel- 2.

[10] Luis J. Villanueva-Rivera, Bryan C. Pijanowski, Jarrod Doucette, and Burak



5 Pekin. 2011. A primer of acoustic analysis for landscape ecologists. Landscape Conclusion

Ecology, 26, 9, (July 2011), 1233–1246. doi: 10.1007/s10980- 011- 9636- 9.

This work demonstrates the technical feasibility of using VAEs

to model baseline ecological patterns from acoustic and satellite

time series in a forested landscape. As a pilot study, it does not

evaluate anomaly detection directly, since no anomalies occurred



82





Development of a Lightweight Model for Detecting




Solitary-Bee Buzz Using Pruning and Quantization for Edge



Deployment



Ryo Yagi David Susič Maj Smerkol

yagi-ryo143@g.ecc.u-tokyo.ac.jp David.Susic@ijs.si maj.smerkol@ijs.si

The University of Tokyo Jožef Stefan Institute Jožef Stefan Institute

Tokyo, Japan Ljubljana, Slovenia Ljubljana, Slovenia



Miha Finžgar Anton Gradišek

miha.finzgar@senso4s.com anton.gradisek@ijs.si

Senso4s Jožef Stefan Institute

Trzin, Slovenia Ljubljana, Slovenia



Abstract Passive acoustic monitoring (PAM) is a non-invasive ap-

Passive acoustic monitoring is increasingly applied in studies proach that continuously records environmental sound with

of pollinators, both for biodiversity assessment and for the deployed microphones to monitor animal activity. Because it

conservation of endangered species. A major challenge is that reduces manual surveys and can operate continuously across

continuous recording generates large volumes of audio data, space and timeeven at night and under inclement weatherit

making centralized processing impractical. Edge computing has gained attention as a cost-effective biodiversity monitoring

offers a promising alternative, provided that the models are technology. PAM has been widely adopted for multiple taxa

optimized for resource constraints of edge devices while main- such as birds and bats; in ornithology, for example, the deep-

taining acceptable performance and efficiency. In this work, learning system BirdNET [2] is already used operationally to

which is our initial study of the edge computing approach, identify species from passively collected field recordings. PAM

we developed and evaluated compact classifiers for detecting is also applied to bee behavior monitoring: in social bees (such

buzzes of solitary bees, extending previous work on acoustic as honeybees or bumblebees), microphones and accelerometers

monitoring. We systematically apply pruning and quantization placed inside or outside hives enable non-invasive, continuous

to multiple models, exploring a range of compression settings. surveillance of queen presence, swarming cues, and robbing [3].

Performance is assessed in terms of mean F1-score and on-disk For solitary bees, recordings at the entrance of nesting boxes

size under both cross-validation and leave-one-location-out are used to detect buzzing and to characterize presence/absence

protocols. Results indicate that substantial reductions in model and activity rhythms [4].

size can be achieved with a minimal loss of performance, and In acoustic approaches for bee state monitoring, machine

that the optimal trade-offs depend on the evaluation setting; learning has been widely used to automatically determine

for example, in cross-validation, a 25.2 MiB baseline reaches activity and behavioral states from audio recordings. Prior

96.2% F1, while a 0.062 MiB model attains 92.5%, achieving work includes both classical machine-learning pipelines and

an approximately 400-fold reduction in size with less than deep-learning methods. Classical approaches such as SVM,

a 4-percentage-point drop. By analyzing the Pareto front of k-NN, and Random Forests have been shown to be practical

F1 vs. model size trade-offs, we identify configurations that and effective [5, 6]. Meanwhile, several studies suggest that

balance robustness and resource constraints. Our early findings CNN-based deep learning models achieve superior performance

demonstrate the feasibility of deploying edge-ready acoustic compared with traditional machine-learning methods [7, 8].

models for scalable pollinator monitoring. However, if all long-term, continuous PAM recordings are

uploaded to the cloud, features such as mel spectrograms and

Keywords MFCCs are extracted there, and then analyzed using machine

learning or deep learning models, the resulting data volumes

edge deployment, lightweight model, pruning, quantization,

become extremely large, which, in a centralized cloud-only

bees

workflow, (i) inflates communication cost by requiring all



1 long-duration audio to be uploaded [9], (ii) raises privacy con- Introduction

cerns as incidental human speech can accumulate in the cloud

Bees are widely recognized as major pollinators - animal polli- [10], (iii) introduces round-trip latency for feature extraction

nators including honey bees contribute to yield in 75% of key and inference that impedes timely detection, and (iv) exposes

crop species and an estimated 35% of global crop production [1]. scalability limits as storage and compute demands grow with

This indicates the importance of pollinator monitoring and pro- multi-site, long-term deployments. To address these issues, we

tection. developed a high-accuracy, lightweight deep model designed

Permission to make digital or hard copies of all or part of this work for personal for edge deployment, capable of on-device preprocessing and

or classroom use is granted without fee provided that copies are not made or dis- inference for recorded audio. Here, the term lightweight refers

tributed for profit or commercial advantage and that copies bear this notice and to memory (both RAM and storage), but in a broader view it

the full citation on the first page. Copyrights for third-party components of this also refers to CPU/GPU requirements, latency requirements,

work must be honored. For all other uses, contact the owner/author(s).

Information Society 2025, Ljubljana, Slovenia and even battery constraints, which is beyond the scope of this

© 2025 Copyright held by the owner/author(s). paper. In our intended operation, audio is processed on-device

https://doi.org/10.70314/is.2025.skui.6788



83

Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Ryo Yagi, David Susič, Maj Smerkol, Miha Finžgar, and Anton Gradišek



and only the result is sent to the cloud, enabling multi-site, Table 1: Parameter counts and model sizes of the models

long-term monitoring with reduced storage cost and latency, used in this study

while preserving privacy and power efficiency.

As a first step toward edge-based bee monitoring with PAM, ResNet-9 Mobilenetv2 Beenet 1 Beenet 2 Beenet 3

we designed and evaluated a lightweight CNN specialized for Parameters (k) 6585.5 2225.9 50.2 17.6 6.4

solitary-bee buzz detection (binary classifier distinguishing Model size (MiB) 25.2 8.7 0.215 0.084 0.036

between buzz and no-buzz). To compress the model, we applied

compression techniques such as structured pruning and int8

post-training quantization when appropriate, and we quantified 2.3 Model Compression Methods

the size–accuracy trade-offs under edge-oriented constraints. Deploying deep neural networks on memory-constrained edge

devices necessitates model compression. We examined two com-



2 plementary techniques: quantization and pruning. Methodology



2.1 2.3.1 Quantization. Quantization maps floating-point weights Dataset

and activations to low-bit integers, thereby reducing model size

We used the dataset collected for the purpose of the study and computation at inference. Here, we adopted post-training

by Susič et al. [4]. This dataset comprises acoustic recordings

quantization (PTQ) and converted the trained network to int8

from nesting boxes of solitary bees (predominantly Osmia without additional training. We used the QNNPACK backend in

spp.) collected through a citizen-science project carried out

PyTorch for ARM targets. To minimize both saturation (clipping)

in the Bela Krajina region in the southeastern Slovenia. The and rounding error under the 8-bit representation and mitigate

recordings were gathered from March 15 to May 26, 2023,

accuracy degradation, we performed calibration with up to 300

resulting in 62 long recordings across seven sites, with a mean

batches of representative inputs to estimate the scale and zero-

duration of 6 2.5 hours per recording. For the purpose of this point.

study, three recordings in total were randomly selected from

different locations. 2.3.2 Pruning. Pruning reduces model complexity by deleting

The recordings were converted to mono-channel audio, seg- parameters deemed unimportant, thereby decreasing memory

mented into 4 s windows with 2 s overlap, transformed into Mel and compute complexity without retraining from scratch.

spectrograms (128 128) configured to cover 50–1450 Hz, and Pruning can be categorized into structured and unstructured

standardized using the mean and standard deviation across the approaches. We adopted structured pruning to realize memory

dataset. For labeling, two annotators inspected the spectrograms savings and speed-ups on commodity hardware, as unstructured

and assigned buzz=1 or no-buzz=0. sparsity typically requires specialized hardware or software

support to translate sparsity into acceleration [13].



2.2 Our pruning pipeline followed Han et al. [14]: (1) train, (2) Neural Network Architecture

prune, and (3) retrain (fine-tune). For filter (i.e., output-channel)

We addressed binary detection of solitary-bee buzzing from Mel selection, we followed the idea of Li et al. [15], ranking convolu-

spectrograms. With memory-constrained edge deployment, tional filters by the L1 norm of their weights and removing those

we evaluate four lighter CNNs compared to the ResNet-9 used with the smallest scores. We implemented this using PyTorch’s

in [4]. Specifically, we consider MobileNetV2 [11] and three torch-pruning, configuring the MagnitudePruner with L1-based

custom lightweight architectures named BeeNet1, BeeNet2, and importance. The selection of filters pruned was performed glob-

BeeNet3, that adopt a depthwise separable convolutional design ally across layers. The target was controlled by a pruning ratio 𝑝;

similar to MobileNetV1 [12]. Model sizes and parameter counts under channel-wise pruning, the resulting parameter-reduction

are summarized in Table 1 and the architectural details of the rate was approximately 1 − (1 − 𝑝)2 [16, 17]. BeeNet variants are provided in Table 2. In all architectures,

each convolutional layer is followed by batch normalization, 2.4 Experimental Setup BatchNorm, and ReLU activation, whereas dw stands for

2.4.1 Model Performance Evaluation Metrics. Because we were

depthwise convolution. For MobileNetV2, we use the standard

dealing with a class-imbalanced dataset (more no-buzz than

backbone and adapt it to spectrograms by converting the first

buzz), we used the F1-score as the primary metric. F1 is the

convolution to a 1-channel input and replacing the final linear

harmonic mean of precision and recall, enabling balanced

layer with a 1280 2 classifier. All other layers remain identical

assessment under imbalance.

to the original MobileNetV2.

While the ResNet-9 approach achieves an F1-score exceed- 2.4.2 Evaluation Protocols. We evaluated the buzz-detecting

ing 95% under five-fold cross-validation on the dataset [4], its models using two protocols, following [4]: cross-validation

25.2 MiB size renders its deployment on a memory-limited edge (CV) and leave-one-location-out (LOLO). The first approach is

devices impractical. Accordingly, we designed and configured a standard test in machine-learning studies whereas the second

compact CNNs (MobileNetV2 and the BeeNet family) and, as de- one shows how well the model generalizes to the data coming

tailed below, applied quantization and pruning to systematically from a previously unseen location with potentially different

evaluate the accuracy–model-size trade-off. background noise. For CV, annotated segments (4 s windows)

The aim of this study is to clarify accuracy as a function of were partitioned into five folds; models were trained on four

model size and the effects of lightweighting techniques under folds and evaluated on the remaining fold, and we reported

strict model-size constraints assuming deployment on MCUs. the mean F1 across folds. Stratification ensures balanced dis-

Accordingly, we adopt a lightweight and relatively simple ar- tributions of the buzz/no-buzz classes and the three locations.

chitecture, with the smallest model containing approximately 6k To mitigate temporal leakage, we performed a time-aware data

parameters. split.



84

Development of a Lightweight Model for Edge Deployment Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia



Table 2: The architectures of BeeNet1, BeeNet2, BeeNet3



BeeNet1 BeeNet2 BeeNet3

Type / Stride Filter Shape Input Size Type / Stride Filter Shape Input Size Type / Stride Filter Shape Input Size

Conv / s1 3 × 3 × 1 × 32 128 × 128 × 1 Conv / s1 3 × 3 × 1 × 32 128 × 128 × 1 Conv / s1 3 × 3 × 1 × 32 128 × 128 × 1

MaxPool / s2 Pool 2 × 2 128 × 128 × 32 MaxPool / s2 Pool 2 × 2 128 × 128 × 32 MaxPool / s2 Pool 2 × 2 128 × 128 × 32 Conv dw / s1 3 × 3 × 32 dw 64 × 64 × 32 Conv dw / s1 3 × 3 × 32 dw 64 × 64 × 32 Conv dw / s1 3 × 3 × 32 dw 64 × 64 × 32

Conv / s1 1 × 1 × 32 × 32 64 × 64 × 32 Conv / s1 1 × 1 × 32 × 32 64 × 64 × 32 Conv / s1 1 × 1 × 32 × 32 64 × 64 × 32

MaxPool / s2 Pool 2 × 2 64 × 64 × 32 MaxPool / s2 Pool 2 × 2 64 × 64 × 32 MaxPool / s2 Pool 2 × 2 64 × 64 × 32 Conv dw / s1 3 × 3 × 32 dw 32 × 32 × 32 Conv dw / s1 3 × 3 × 32 dw 32 × 32 × 32 Conv dw / s1 3 × 3 × 32 dw 32 × 32 × 32

Conv / s1 1 × 1 × 32 × 64 32 × 32 × 32 Conv / s1 1 × 1 × 32 × 64 32 × 32 × 32 Conv / s1 1 × 1 × 32 × 64 32 × 32 × 32

MaxPool / s2 Pool 2 × 2 32 × 32 × 64 MaxPool / s2 Pool 2 × 2 32 × 32 × 64 MaxPool / s8 Pool 8 × 8 32 × 32 × 64 Conv dw / s1 3 × 3 × 64 dw 16 × 16 × 64 Conv dw / s1 3 × 3 × 64 dw 16 × 16 × 64 FC / s1 1024 × 2 4 × 4 × 64

Conv / s1 1 × 1 × 64 × 128 16 × 16 × 64 Conv / s1 1 × 1 × 64 × 128 16 × 16 × 64 Softmax / s1 Classifier 1 × 1 × 2

MaxPool / s2 Pool 2 × 2 16 × 16 × 128 MaxPool / s4 Pool 4 × 4 16 × 16 × 128

Conv dw / s1 3 × 3 × 128 dw 8 × 8 × 128 FC / s1 2048 × 2 4 × 4 × 128

Conv / s1 1 × 1 × 128 × 256 8 × 8 × 128 Softmax / s1 Classifier 1 × 1 × 2

MaxPool / s4 Pool 4 × 4 8 × 8 × 256

FC / s1 1024 × 2 2 × 2 × 256

Softmax / s1 Classifier 1 × 1 × 2



LOLO assessed generalization across sites: models were architectures often achieve higher accuracy than heavily pruned

trained on data from two of the three locations and evaluated larger networks, indicating that purpose-built small models are

on the held-out location, reporting the mean F1 across the three preferable to aggressive pruning under the same size constraint.

possible holds. A note on MobileNetV2 at 𝑝=30%: the trained model degen-

erated to predicting no-buzz for almost all inputs. This behavior

2.4.3 Hyperparameters. We trained the models with cross- may stem from a strong structured reduction under class imbal-



entropy loss and the Adam optimizer, using a 1-cycle learning- ance and warrants further investigation. = rate schedule (maximum LR 0 . 001 ), gradient clipping at 0 . 1 ,

batch size 64, and 20 epochs. Compared to [4], the only change 4 Conclusions was increasing the number of epochs from 10 to 20. For pruning

We addressed buzz detection in acoustic recordings from

fine-tuning, we trained for 10 epochs with a fixed learning rate

of 0. solitary-bee nesting boxes, aiming to develop deep-learning 0001 and no scheduler. We compared the pruning ratios 𝑝

models suitable for memory-constrained edge deployment. We

of 0% (no pruning), 20%, 30%, and 50%.

designed or selected five CNN architectures and systematically



3 measured the performance vs. model-size trade-offs under Results

quantization and structured pruning. As a result, we obtained

3.1 F1 vs. Model Size sub-100 KiB models achieving F1 scores of at least 92% in CV

For each model, we trained and evaluated a variety of combi- and 85% in LOLO experiments, indicating the feasibility and

nations of pruning ratios and quantizations. Table 3 reports the strong potential of accurate on-device inference.

mean F1 and on-disk model size (in MiB) for each setting. Fig- For future work, we plan to train the models on additional

ure 1 shows the plot of all configurations in the F1 – model- datasets that we have collected to improve robustness and to de-

size plane for CV and LOLO, respectively, with the global Pareto ploy the models on real edge devices. Because our compression

front indicating the best trade-offs between model performance pipeline relied on simple techniques, we anticipate further gains

and its size denoted by a dashed line. by adopting a broader set of compression methods, such as

Even under tight memory budgets (< knowledge distillation [18], quantization-aware training (QAT) 100 KiB), competitive

accuracy is achievable. For example, BeeNet1 (int8, 𝑝= [19], and neural architecture search (NAS) [20] to optimize 0 ) attains

0.062MiB with CV F1 of 92.5% and LOLO F1 of 85.7%. Relative to model architectures under memory constraints, including

ResNet-9 (float32, 𝑝=0), this represents an ∼ 400× number and sizes of the filters. reduction in

model size while keeping F1 within 4 percentage points in both

protocols, which is really promising for future edge deployment. Acknowledgements

Performance degradation from int8 quantization is small: The authors acknowledge the Slovenian Research and Innova-

across many settings the F1 drop is about 1 percentage point tion Agency, grants PR-10495 and J7-50040, and Basic core fund-

(pp). With pruning, larger models exhibit smaller accuracy ing P2-0209.

losses as 𝑝 increases. For example, at 𝑝=50% ResNet-9 (float32)

decreases only from 96.2% to 95.1% in CV and from 89.5% to References

87.6% in LOLO, a decline of ≈ 2 percentage points in total. [1] Tom D Breeze, Alison P Bailey, Kelvin G Balcombe, and Simon G Potts.

By contrast, the more compact BeeNet family is more sensi- 2011. Pollination services in the uk: how important are honeybees? Agri-

tive: accuracy degrades markedly with 𝑝 culture, Ecosystems & Environment, 142, 3-4, 137–143. , and at 𝑝 = 50% most [2] Stefan Kahl, Connor M. Wood, Maximilian Eibl, and Holger Klinck. 2021.



configurations lose ≥ Birdnet: a deep learning solution for avian diversity monitoring. Ecological 4 pp. Informatics , 61, 101236. doi : https://doi.org/10.1016/j.ecoinf.2021.101236. Inspection of the global Pareto front shows that many fron-[3] Mahsa Abdollahi, Pierre Giovenazzo, and Tiago H Falk. 2022. Automated tier points correspond to unpruned float32 or int8 models. At beehive acoustics monitoring: a comprehensive review of the literature and

a fixed memory budget, lightly pruned or unpruned lightweight recommendations for future work. Applied Sciences, 12, 8, 3920.



85

Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Ryo Yagi, David Susič, Maj Smerkol, Miha Finžgar, and Anton Gradišek



Table 3: Comparison of F1 scores and model sizes with quantization and pruning applied for five CNNs (ResNet-9, Mo-

bileNetV2, and BeeNet1/2/3)



CV LOLO

Pruning ratio (%) Pruning ratio (%)

Model (Quant.) 0 20 30 50 0 20 30 50

F1 (%) Size (MiB) F1 (%) Size (MiB) F1 (%) Size (MiB) F1 (%) Size (MiB) F1 (%) Size (MiB) F1 (%) Size (MiB) F1 (%) Size (MiB) F1 (%) Size (MiB)

ResNet-9 (float32) 96.2 25.2 95.6 15.7 95.5 12.1 95.1 6.2 89.5 25.2 88.7 15.8 88.4 12.0 87.6 6.2

ResNet-9 (int8) 96.1 6.3 95.6 3.9 95.5 3.0 95.1 1.6 89.4 6.3 89.4 4.0 88.9 3.0 88.0 1.6

MobileNetV2 (float32) 96.6 8.7 95.6 6.1 57.8 4.8 94.9 2.3 88.8 8.7 85.8 6.1 85.4 4.8 79.9 2.3

MobileNetV2 (int8) 95.4 2.2 94.4 1.6 57.7 1.3 94.7 0.677 89.6 2.2 84.9 1.6 85.2 1.3 80.5 0.665

BeeNet1 (float32) 92.7 0.215 91.0 0.140 90.0 0.110 87.8 0.055 85.9 0.215 81.3 0.139 80.6 0.110 77.4 0.056

BeeNet1 (int8) 92.5 0.062 90.9 0.048 89.9 0.040 87.7 0.026 85.7 0.062 81.3 0.047 80.4 0.040 77.1 0.026

BeeNet2 (float32) 92.3 0.084 90.4 0.050 89.3 0.041 87.0 0.025 79.3 0.084 79.1 0.051 77.9 0.041 75.5 0.026

BeeNet2 (int8) 92.0 0.028 90.6 0.022 89.2 0.020 86.7 0.016 80.3 0.028 79.7 0.022 79.5 0.020 76.8 0.016

BeeNet3 (float32) 90.7 0.036 89.8 0.020 88.0 0.017 83.7 0.013 82.5 0.036 83.0 0.021 82.4 0.018 67.5 0.012

BeeNet3 (int8) 90.3 0.014 89.3 0.012 87.9 0.011 83.5 0.010 82.2 0.014 82.5 0.012 81.5 0.011 67.6 0.010





(a) Cross-validation (CV) (b) Leave-one-location-out (LOLO)





Figure 1: F1–model-size trade-offs with the global Pareto frontier under CV and LOLO



[4] David Susič, Johanna A. Robinson, Danilo Bevk, and Anton [12] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Wei-

Gradišek. 2025. Acoustic monitoring of solitary bee activity at jun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. Mo-

nesting boxes. Ecological Solutions and Evidence, 6, 3, e70080. doi: bilenets: efficient convolutional neural networks for mobile vision applica-

https://doi.org/10.1002/2688-8319.70080. tions. arXiv preprint arXiv:1704.04861.

[5] Alison Pereira Ribeiro, Nádia Felix Felipe da Silva, Fernanda Neiva [13] Hongrong Cheng, Miao Zhang, and Javen Qinfeng Shi. 2024. A survey on

Mesquita, Priscila de Cássia Souza Araújo, Thierson Couto Rosa, deep neural network pruning: taxonomy, comparison, analysis, and recom-

and José Neiva Mesquita-Neto. 2021. Machine learning approach for mendations. IEEE Transactions on Pattern Analysis and Machine Intelligence, automatic recognition of tomato-pollinating bees based on their buzzing- 46, 12, 10558–10578. doi: 10.1109/TPAMI.2024.3447085. sounds. PLOS Computational Biology, 17, 9, (Sept. 2021), 1–21. doi: [14] Song Han, Jeff Pool, John Tran, and William Dally. 2015. Learning both 10.1371/journal.pcbi.1009426. weights and connections for efficient neural network. Advances in neural

[6] Antonio Robles-Guerrero, Tonatiuh Saucedo-Anaya, Carlos A. Guerrero- information processing systems, 28.

Mendez, Salvador Gómez-Jiménez, and David J. Navarro-Solís. 2023. [15] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Comparative study of machine learning models for bee colony acoustic 2016. Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710. pattern classification on low computational resources. Sensors, 23, 1. doi: [16] Gongfan Fang, Xinyin Ma, Mingli Song, Michael Bi Mi, and Xinchao Wang. 10.3390/s23010460. 2023. Depgraph: towards any structural pruning. In Proceedings of the

[7] Jaehoon Kim, Jeongkyu Oh, and Tae-Young Heo. 2021. Acoustic scene IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

classification and visualization of beehive sounds using machine learning [17] Gongfan Fang and contributors. 2023. Torch-pruning: structural pruning algorithms and grad-cam. Mathematical Problems in Engineering, 2021, 1, for pytorch. (2023). https://github.com/VainF/Torch-Pruning. 5594498. [18] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowl-

[8] Vladimir Kulyukin, Sarbajit Mukherjee, and Prakhar Amlathe. 2018. To- edge in a neural network. arXiv preprint arXiv:1503.02531.

ward audio beehive monitoring: deep learning vs. standard machine learn- [19] Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, ing in classifying beehive audio samples. Applied Sciences, 8, 9, 1573. Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. 2018. Quanti-

[9] Carrie C Wall, Samara M Haver, Leila T Hatch, Jennifer Miksis-Olds, Rob zation and training of neural networks for efficient integer-arithmetic-only

Bochenek, Robert P Dziak, and Jason Gedamke. 2021. The next wave of inference. In Proceedings of the IEEE conference on computer vision and pat-passive acoustic data management: how centralized access can enhance tern recognition, 2704–2713. science. Frontiers in Marine Science, 8, 703682. [20] Barret Zoph and Quoc V. Le. 2017. Neural architecture search with

[10] Benjamin Cretois, Carolyn M Rosten, and Sarab S Sethi. 2022. Voice activity reinforcement learning. (2017). https://arxiv.org/abs/1611.01578 arXiv:

detection in eco-acoustic data enables privacy protection and is a proxy for 1611.01578 [cs.LG].

human disturbance. Methods in Ecology and Evolution, 13, 12, 2865–2874.

[11] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and

Liang-Chieh Chen. 2018. Mobilenetv2: inverted residuals and linear bottle-

necks. In Proceedings of the IEEE conference on computer vision and pattern

recognition, 4510–4520.



86





Interpretable Predictive Clustering Tree for Post-




Intubation Hypotension Assessment



Estefanía Žugelj Tapia Borut Kirn Sašo Džeroski

Institute of Physiology Institute of Physiology Department of Knowledge

University of Ljubljana, Medical University of Ljubljana, Medical Technologies

faculty faculty Jožef Stefan Institute

Ljubljana, Slovenia Ljubljana, Slovenia Ljubljana, Slovenia

estefania.tapia@mf.uni-lj.si borut.kirn@mf.uni-lj.si saso.dzeroski@ijs.si



Abstract by anesthesia effects, and is usually not related to complex first 30 minutes post-induction, as this period is directly affected

Intraoperative hypotension following intubation is a clinically factors due to surgery[4]. Regarding the risk factors, a post hoc

significant event associated with increased morbidity and analysis in a surgical population of patients at risk of aspiration

mortality. This study presents an interpretable predictive of gastric content identified different risk factors associated with



of hypotensive outcomes, including the prediction of minimum bowel occlusion requiring nasogastric tube placement before and maximum mean arterial pressure (MAP) values during clustering tree (PCT) model designed for multi-target prediction it in the multivariate analysis: age, a higher baseline heart rate,



hypotension in the post-induction period. The multi-target intubation, and the use of remifentanil. A prospective multicenter study found that in the group with hypotension, the dose (mg/kg) regression trees (MTRT) were evaluated using 10-fold cross- of Propofol was significantly higher at 5 and 10 minutes after validation, and feature importance was assessed via a random [5] induction . On the other hand, the following protective factors forest model. Compared to the original tree, the pruned model

demonstrated improved generalization and reduced complexity, have been described: low doses of ketamine and basal systolic

with fewer nodes and enhanced interpretability. The pruned tree blood pressure (SBP)[2].

structure enabled clear decision thresholds based on modifiable Previous studies have employed traditional multivariate

variables such as MAP_after_5min, MAP_basal, and Propofol analysis to identify risk factors and have focused on predicting a



and had high complexity, its feature importance ranking analysis predicting multiple outcomes simultaneously can capture supported the relevance of the attributes retained in the pruned complex interactions and provide more informative insights, dose. While the random forest achieved the highest performance single target: the presence of hypotension[2,4,5]. However,



globally relevant features, such as SBP_after_5min, that were not aiding clinical decision-making and support. Therefore, the model and provided complementary insights, highlighting

prioritized in the single trees. These findings support the use of hypothesis of this study is that predicting multiple outcomes of

interpretable models in clinical decision-support to anticipate PIH simultaneously can effectively identify which variables are

and potentially modify the occurrence of post-intubation most influential in predicting PIH. Overall, this study contributes

hypotension. to the prediction of PIH, which can help anesthesiologists to

make better decisions during induction, potentially improving

Keywords patient outcomes.

multi-target prediction, interpretable machine learning, decision

tree pruning, feature importance, post-intubation, intraoperative

hypotension 2 Methods

Predictive clustering trees (PCT) are a machine learning



1 Introduction framework, the node at the roof (the top node or root node) framework that unifies clustering and prediction tasks. In this

Intubation is a common procedure in emergency departments and corresponds to the cluster that contains all the data, and each

operating rooms, typically performed immediately after the subsequent split partitions the data to minimize intra-cluster

administration of induction agents. These agents have been variance. CLUS is a free software that implements this

associated with hemodynamic instability and post-induction framework and supports multi-target prediction. In a multi-

hypotension (PIH), frequently defined as mean arterial pressure target regression tree (MTRT), the obtained tree is more reliable

(MAP) <65 mmHg[1]. Particularly, in perioperative medicine, in explaining the dependencies between variables, and the

PIH has been related to worse postoperative outcomes, increased prediction is a vector of values of the target attributes[6,7]. For this

comorbidity, and mortality[2,3]. PIH occurrence is limited to the reason, CLUS version 2.12.8 was chosen as the software for this

retrospective analysis. The documentation and latest version can

Permission to make digital or hard copies of part or all of this work for personal or be found at: https://github.com/knowledge-classroom use is granted without fee provided that copies are not made or distributed

for profit or commercial advantage and that copies bear this notice and the full technologies/clus/tree/main.

citation on the first page. Copyrights for third-party components of this work must Data was sourced from the subset SIS of the MOVER

be honored. For all other uses, contact the owner/author(s).

Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia database (https://mover.ics.uci.edu/) —a public database of

© 20 [8] 25 Copyright held by the owner/author(s). anonymized patients undergoing various types of surgery.

http://doi.org/DOI 10.70314/is.2025.skui.9188 The inclusion and exclusion criteria were the following:



87

- Inclusion criteria: 1) patients who underwent major surgery Cisatracurium (cum. dose)

procedures with documented application and dose of one of the Etomidate (cum. dose)

next medications during induction of general anesthesia: Vecuronium (cum. dose)

'Midazolam', 'Propofol', 'Fentanyl', 'Succinylcholine', 'Ketamine', Rocuronium (cum. dose)

'Cisatracurium', 'Etomidate', 'Vecuronium', and 'Rocuronium', 2)

high temporal resolution vital signs of systolic blood pressure After defining the descriptive and target attributes, the entire

(SBP), diastolic blood pressure (DBP), and mean arterial dataset of 340 patients was split into training and test sets using

pressure (MAP) measured from the radial arterial line, registered the sklearn library and the train_test_split function: 80% of the

before the time of intubation and 30 minutes after it, with at least dataset was used for training (272 patients) and 20% for testing

one measure of MAP < 65mmHg during the post-intubation (68 patients). To run CLUS, the training and test sets were

period. converted to ARFF format. Corresponding settings file (.s) were

-Exclusion criteria: Patients with vital signs out of created to define the model parameters for the MTRT tasks. Both

physiological ranges (MAP <30mmHg and MAP > 200mmHg), single-tree and ensemble models were trained, as summarized in

and patients who do not meet the inclusion criteria. Table 2.

As descriptive and target attributes of the learning problem

(see Table 1), the following variables were calculated: Table 2: Tree and ensemble specifications for each

1) MAP_basal: average of MAP measures before intubation, 2) respective MTRT.

SBP_basal: average of SBP measures before intubation, 3)

DBP_basal: average of DBP measures before intubation, 4) Model Predictive Random forest

MAP_after_5min: average of MAP measurements taken after clustering tree

intubation, over a 5-minute period, 5) SBP_after_5min: average (PCT)

of SBP measurements taken after intubation, over a 5-minute Heuristic Variance Variance

period, 6) DBP_after_5min: average of DBP measurements Reduction Reduction

taken after intubation, over a 5-minute period, 7) Min_MAP<65: Pruning Method M5Multi -

Minimum MAP <65 mmHg registered from the intubation up to Ensemble Method - RForest



registered from the intubation up to 30 minutes after, 9) Feature Ranking - Genie3 30 minutes after, 8) Max_MAP<65: Maximum MAP <65 mmHg



MAP<65_count: Counts of registered measurements <65mmHg As an alternative to the train/test split, when running CLUS, over the 30 minutes interval after intubation, 10) the -xval command-line option was used to perform cross- MAP_mean_after_30min: average of MAP measures over 30 validation on all 340 examples. The number of folds (n = 10) was minutes interval after intubation, 11) SBP_mean_after_30min: previously specified in the settings file.

average of SBP measures over 30 minutes interval after Model performance was evaluated using the following

intubation, 12) MAP<65_mean_after_30_min: average of MAP metrics: Mean Absolute Error (MAE), Mean Squared Error

measures <65 mmHg over 30 minutes interval after intubation, (MSE), Root Relative Squared Error (RRMSE), and Pearson

and 13) Body mass index (BMI): weight / ((height / 100) correlation coefficient (r²), computed on both training and test 2 ).

During data preparation, missing values of the height sets.

attribute were replaced with the mean value of the attribute.

3 Results

Table 1: Descriptive and target attributes After applying the exclusion criteria, we were left with 340

patients. Figure 1 illustrates the flow chart for patient selection,

Descriptive attributes (20) and Table 3 shows their demographic characteristics. Target attributes (6)

MAP_basal Min_MAP<65

SBP_basal Table 3: Data set population characteristics Max_MAP<65

DBP_basal MAP<65_count

Age, years, mean (SD) 58.9 (18.9)

MAP_after_5min MAP<65_mean_after_30_min

Gender (male), count 201

SBP_after_5min MAP_mean_after_30min

Weight, kg, mean (SD) 78.6 (23.1)

DBP_after_5min SBP_mean_after_30min

Height, cm, mean (SD) 168.4 (11.1)

Age

BMI, kg/m2, mean (SD) 1.5 (6.8)

Gender

Height



Weight 3.1 Complexity of the Models and Structure

BMI

Midazolam (cumulative The induction time for the original model was significantly

dose) shorter (0.03 sec, pruning time 0 sec) compared to the random

Propofol (cum. dose) forest model (1.62 sec), reflecting its reduced complexity.

Fentanyl (cum. dose) Structurally, the original tree consisted of 241 nodes, 121 leaves,

Succinylcholine (cum. and a depth of 17, whereas the pruned tree was noticeably

dose) simpler, with only 19 nodes, 10 leaves, and a depth of 6.

Ketamine (cum. dose) Additionally, the ensemble random forest model, composed

of 100 trees, contained a total of 21,050 nodes and 10,575 leaves,



88

with an average tree depth of 154, indicating a significantly Original Forest (100 Metric Default Pruned higher complexity and capacity for capturing intricate patterns in (Unpruned) trees)

the data. RMSE 17.32 / 17.39 3.04 / 13.90 8.01 / 12.27 3.81 / 7.41



RRMSE 1.00 / 1.00 0.18 / 0.80 0.46 / 0.70 0.43 / 0.84

Pearson r² 0.0003 / 0.02 0.97 / 0.45 0.79 / 0.52 0.89 / 0.28



Note that cross-validation yields more realistic estimates of

error on unseen examples as compared to a single train-test split.



3.4 Original Model

As stated in section 3.1, the original model contains 241 nodes

and 121 leaves. MAP_after_5min is at the root node, followed

by MAP_basal, these two variables repeat along the tree on more

than one occasion. Except for cisatracurium, ketamine, and

etomidate, in the remaining nodes, the rest of the descriptive

attributes appear at least once, showing different thresholds.



3.5 Pruned Model

In the pruned model, the descriptive attributes retained for multi-

target prediction were MAP_after_5min, MAP_basal, BMI,

Figure 1: Overview of sample population included in this SBP_basal, DBP_after_5min, and Propofol dose. Compared to

study. the original tree, the pruned model demonstrated improved

generalization and interpretability, with a significantly reduced

3.2 number of nodes, as illustrated in Figure 2. Model Performance

The forest with 100 trees exhibits the best performance overall. The highest predicted values for the target attributes—97.9



However, pruning significantly simplified the original model mmHg for MAP_mean_after_30min and 149.8 mmHg for

while retaining, and even improving, its predictive power, with SBP_mean_after_30min—were observed when

lower testing errors for MAE, MSE, RMSE, and RRMSE MAP_after_5min exceeded 93 mmHg and SBP_basal was

compared to the original tree (See Table 4). greater than 181 mmHg.



Table 4: Metrics for training and testing errors (Train/Test)



Original Forest (100

Metric Default Pruned

(Unpruned) trees)

MAE 7.27 / 7.30 2.58 / 7.55 5.41 / 6.18 2.93 / 5.22

MSE 109.35 / 110.12 17.77 / 120.54 61.39 / 83.03 18.41 / 51.05

RMSE 9.41 / 9.45 3.81 / 10.15 7.09 / 8.33 3.96 / 6.76

RRMSE 1.00 / 1.00 0.42 / 1.13 0.76 / 0.91 0.44 / 0.86

Pearson r² – / 0.04 0.82 / 0.14 0.42 / 0.21 0.89 / 0.26



3.3 Cross-Validation Results

The 10-fold cross-validation was conducted using all 340

examples, with an induction time of 0.26 sec for the single tree

and of 9.75 sec for the ensemble random forest. The mean

number of tests for the original model was 267, for the pruned

model 39.2, and for the random forest 100.

As shown in Table 5, the absolute error metrics (MAE, MSE,

RMSE) were higher than when using a train/test split, however

the cross-validation approach yielded lower testing errors for

RRMSE and higher Pearson r² values.



Table 5: Cross-validation metrics for training and testing

errors (Train / Test)

Figure 2: Pruned tree, predicting min_MAP<65,



Metric Original max_MAP<65, MAP<65_count, Forest (100 Default Pruned (Unpruned) trees) MAP<65_mean_after_30_min, MAP_mean_after_30min,

MAE 13.62 / 13.69 1.80 / 10.49 5.78 / 9.28 2.82 / 5.6 and SBP_mean_after_30min. Leaves display predictions in

MSE orange. 300.15 / 302.3 9.22 / 193.3 64.2 / 150.5 16.83 / 63.15



89

On the other hand, the lowest predicted values of these target decisive threshold of 67 mmHg. To diminish the impact of the

variables—50.4 and 58.2 mmHg— were derived from the basal blood pressure values in the occurrence of PIH episodes,



following nodes: MAP_after_5min below 56 mmHg, BMI < 34.5 some proposals include discontinuing renin–angiotensin– 2 kg/m , and the Propofol dose >150 mg. Additionally, the leaf aldosterone system antagonists the day of the surgery and 2 node corresponding to BMI >34.5 kg/m yielded the deepest proactive measures to elevate preoperative values to relieve the value for min_MAP <65, at 26.3 mmHg. effect of the anesthetic medications, which could prevent the Other notably low predictions related to hypotension included appearance of PIH [3,4] . max_MAP<65 at 43.7 mmHg and The obtained pruned predictive clustering tree model showed MAP<65_mean_after_30_min at 43.3 mmHg, both derived from lower testing errors across all metrics compared to the original the node where MAP_basal was below 51 mmHg. tree, with improved performance, interpretability, and



3.6 generalization. Nevertheless, the random forest model performed Forest and Feature Ranking

the best. Regardless of the complexity of the ensemble model,

Despite the complexity of the forest with 100 trees, the feature the feature ranking provided valuable insights into the

ranking, where feature importance was assessed using the contribution of each attribute to the final prediction; some of

Genie3 score, helps to understand the descriptive attributes that these top-ranked features also appear along the nodes of the

mainly contributed to the final multi-target prediction. Figure 3 unpruned and pruned trees. By aggregating importance across

lists the first eleven descriptive attributes, ranked by their multiple trees, random forests can highlight globally relevant

corresponding importance score. MAP_after_5min and features that may not dominate early decision paths in a single

SBP_after_5min are clearly the most influential features in the tree. For example, SBP_after_5min was ranked second in

model; MAP_basal and SBP_basal also contribute significantly, importance, but it did not appear in the top splits of the unpruned

closely following in importance. tree. In the pruned tree, BMI and Propofol dose are included, but

SBP_after_5min, age, and DBP_basal, which ranked higher than

BMI and Propofol dose, are not incorporated in the pruned tree.



The association between higher age and PIH has been noted in

the past [2,5], and it is a variable usually considered during risk

evaluation; however, it is not a modifiable attribute.

In sum, this study demonstrates that interpretable models,

such as pruned trees, when supported by feature importance from

high-performing models, can validate and offer clear, decisive

thresholds of modifiable and actionable variables that impact

MAP values in the post-induction period, thereby reducing PIH-

related comorbidity and mortality. This highlights its potential

utility as a decision support tool in clinical settings.



References

[1] Salmasi V, Maheshwari K, Yang D, Mascha EJ, Singh A, Sessler DI, et

Figure 3: Descriptive attributes contributing most to the al. Relationship between Intraoperative Hypotension, Defined by Either



random Reduction from Baseline or Absolute Thresholds, and Acute Kidney and forest’s prediction , sorted by importance score. Myocardial Injury after Noncardiac Surgery. Anesthesiology

2017;126(1):47–65. DOI:

https://doi.org/10.1097/ALN.0000000000001432

4 Discussion & Conclusion [2] Grillot N, Gonzalez V, Deransy R, Rouhani A, Cintrat G, Rooze P, et al.

Post-induction hypotension during rapid sequence intubation in the

The advantages of using a predictive clustering method for multi- operating room: A post hoc analysis of the randomized controlled

target prediction include the ability to capture complex REMICRUSH trial. Anaesth Crit Care Pain Med 2025; 44(3):101502.

DOI: https://doi.org/10.1016/j.accpm.2025.101502

interactions between descriptive attributes and the simultaneous [3] Sessler DI, Bloomstone JA, Aronson S, Berry C, Gan TJ, Kellum JA, et

prediction of multiple outcomes [6,7] al. Perioperative Quality Initiative consensus statement on intraoperative . A key novelty of this study

is its focus on predicting multiple outcomes related to blood pressure, risk and outcomes for elective surgery. Br J Anaesth

2019;122(5):563–74. DOI: https://doi.org/10.1016/j.bja.2019.01.013

hypotension. This multi-target approach provides a more [4] Südfeld S, Brechnitz S, Wagner JY, Reese PC, Pinnschmidt HO, Reuter

comprehensive overview and enhances clinical decision-support. DA, et al. Post-induction hypotension and early intraoperative

hypotension associated with general anaesthesia. Br J Anaesth

In clinical practice, anesthesiologists need to anticipate and often 2017;119(1):57–64. DOI: https://doi.org/10.1093/bja/aex127



ask themselves: How low will MAP values drop? How will MAP [5] Jor O, Maca J, Koutna J, Gemrotova M, Vymazal T, Litschmannova M, et al. Hypotension after induction of general anesthesia: occurrence, risk evolve throughout the procedure? This is highly relevant because factors, and therapy. A prospective multicentre observational study. J

deeper and longer hypotensive episodes increase the presence of Anesth 2018; 32(5):673–80. DOI: https://doi.org/10.1007/s00540-018-

adverse events associated with intraoperative hypotension[3,4] 2532-6 .

[6] Kocev D, Vens C, Struyf J, Džeroski S. Tree ensembles for predicting

In this study, the pruned model included among the most structured outputs. Pattern Recognit 2012;46:817–33. DOI:

important https://doi.org/10.1016/j.patcog.2012.09.023 variables for the multi-target prediction

[7] Petković M, Levatić J, Kocev D, Breskvar M, Džeroski S. CLUSplus: A

MAP_after_5min and MAP_basal. Previous studies have decision tree-based framework for predicting structured outputs.

significantly associated PIH with the basal or pre-induction SoftwareX 2023; 24:101526. DOI:

MAP[2,4,5] https://doi.org/10.1016/j.softx.2023.101526 , and our results confirm this observation: In the node [8] Samad M, Angel M, Rinehart J, Kanomata Y, Baldi P, Cannesson M. root, the MAP value was the most relevant when calculated Medical Informatics Operating Room Vitals and Events Repository

immediately 5 minutes after intubation, specifically with a (MOVER): a public-access operating room database. JAMIA Open

2023;6(4). DOI: https://doi.org/10.1093/jamiaopen/ooad084



90

Indeks avtorjev / Author index



Ambrožič Žan ................................................................................................................................................................................. 7

Andova Andrejaana ...................................................................................................................................................................... 23

Anžur Zoja ................................................................................................................................................................................... 11

Azad Fatemeh ............................................................................................................................................................................... 15

Bianco Lorenzo .............................................................................................................................................................................. 7

Bohanec Marko ............................................................................................................................................................................ 19

Buchaillot Maria Luisa ................................................................................................................................................................. 79

Builder Calum .............................................................................................................................................................................. 79

Comte Thibault ............................................................................................................................................................................. 43

Cork Jordan .................................................................................................................................................................................. 23

Crozzoli Miguel ........................................................................................................................................................................... 79

Di Giacomo Valentina .................................................................................................................................................................. 71

Dobravec Blaž .............................................................................................................................................................................. 27

Dominici Gabrielle ....................................................................................................................................................................... 71

Džeroski Sašo ............................................................................................................................................................................... 87

Fenoglio Dario ............................................................................................................................................................................. 71

Filipič Bogdan .............................................................................................................................................................................. 23

Finžgar Miha ................................................................................................................................................................................ 83

Gams Matjaž ................................................................................................................................................................................ 43

Gradišek Anton ...................................................................................................................................................... 7, 35, 63, 79, 83

Hassani Yanny ............................................................................................................................................................................. 43

Herke Anna-Katharina ................................................................................................................................................................. 31

Inagawa Maori ............................................................................................................................................................................. 47

Jelenc Matej ................................................................................................................................................................................. 35

Jordan Marko ............................................................................................................................................................................... 71

Jurič Rok ...................................................................................................................................................................................... 35

Kirn Borut .................................................................................................................................................................................... 87

Kocuvan Primož ........................................................................................................................................................................... 39

Kolar Žiga .................................................................................................................................................................................... 43

Kramar Sebastjan ......................................................................................................................................................................... 71

Krömer Pavel ............................................................................................................................................................................... 23

Krstevska Ana .............................................................................................................................................................................. 71

Kukar Matjaž ................................................................................................................................................................................ 15

Longar Vinko ............................................................................................................................................................................... 39

Louvancour Hugues ..................................................................................................................................................................... 43

Lukan Junoš ................................................................................................................................................................................. 47

Luštrek Mitja .............................................................................................................................................................. 11, 47, 55, 71

Maistrou Sevasti ........................................................................................................................................................................... 79

Mancuso Elena ............................................................................................................................................................................. 71

Marinković Mila ........................................................................................................................................................................... 51

Nemec Vid ................................................................................................................................................................................... 55

op den Akker Harm ...................................................................................................................................................................... 71

Rajher Rok ................................................................................................................................................................................... 59

Rajkovič Uroš ............................................................................................................................................................................... 19

Rajkovič Vladislav ....................................................................................................................................................................... 19

Ratajec Mariša .............................................................................................................................................................................. 63

Rechberger Nina ........................................................................................................................................................................... 67

Reščič Nina ............................................................................................................................................................................ 63, 71

Romanova Alex ............................................................................................................................................................................ 75

Shulajkovska Miljana ................................................................................................................................................................... 35

Slapničar Gašper .................................................................................................................................................................... 11, 55

Smerkol Maj ............................................................................................................................................................................. 7, 83

Štrum Rok ...................................................................................................................................................................................... 7

Struna Rok .................................................................................................................................................................................... 39

Susič David ........................................................................................................................................................................ 7, 79, 83



91

Tušar Tea ...................................................................................................................................................................................... 23

van der Jagt Lotte ......................................................................................................................................................................... 71

Vastenburg Martijn ...................................................................................................................................................................... 71

Vukašinović Dragana ................................................................................................................................................................... 79

Yagi Ryo ...................................................................................................................................................................................... 83

Žabkar Jure ....................................................................................................................................................................... 27, 51, 59

Založnik Marcel ........................................................................................................................................................................... 71

Žugelj Tapia Estefanía ................................................................................................................................................................. 87



92

Slovenska konferenca o



umetni inteligenci



Slovenian Conference on



Artificial Intelligence



Uredniki l Editors:



Mitja Luštrek

Matjaž Gams



Rok Piltaver