8.–9. oktober 2025 l 8–9 October 2025 Ljubljana, Slovenia IS 2025 INFORMACIJSKA DRUZBA ˇ INFORMATION SOCIETY Slovenska konferenca o umetni inteligenci Slovenian Conference on Artificial Intelligence Zbornik 28. mednarodne multikonference Uredniki l Editors: Zvezek A Mitja Luštrek, Matjaž Gams, Rok Piltaver Proceedings of the 28th International Multiconference Volume A Zbornik 28. mednarodne multikonference INFORMACIJSKA DRUŽBA – IS 2025 Zvezek A Proceedings of the 28th International Multiconference INFORMATION SOCIETY – IS 2025 Volume A Slovenska konferenca o umetni inteligenci Slovenian Conference on Artificial Intelligence Uredniki / Editors Mitja Luštrek, Matjaž Gams, Rok Piltaver http://is.ijs.si 8. –9. oktober 2025 / 8–9 October 2025 Ljubljana, Slovenia Uredniki: Mitja Luštrek Odsek za inteligentne sisteme, Institut »Jožef Stefan«, Ljubljana Matjaž Gams Odsek za inteligentne sisteme, Institut »Jožef Stefan«, Ljubljana Rok Piltaver Outfit7, Ljubljana Založnik: Institut »Jožef Stefan«, Ljubljana Priprava zbornika: Mitja Lasič, Vesna Lasič, Lana Zemljak Oblikovanje naslovnice: Vesna Lasič, uporabljena slika iz Pixabay Dostop do e-publikacije: http://library.ijs.si/Stacks/Proceedings/InformationSociety Ljubljana, oktober 2025 Informacijska družba ISSN 2630-371X DOI: https://doi.org/10.70314/is.2025.skui Kataložni zapis o publikaciji (CIP) pripravili v Narodni in univerzitetni knjižnici v Ljubljani COBISS.SI-ID 255435267 ISBN 978-961-264-319-5 (PDF) PREDGOVOR MULTIKONFERENCI INFORMACIJSKA DRUŽBA 2025 28. mednarodna multikonferenca Informacijska družba se odvija v času izjemne rasti umetne inteligence, njenih aplikacij in vplivov na človeštvo. Vsako leto vstopamo v novo dobo, v kateri generativna umetna inteligenca ter drugi inovativni pristopi oblikujejo poti k superinteligenci in singularnosti, ki bosta krojili prihodnost človeške civilizacije. Naša konferenca je tako hkrati tradicionalna znanstvena in akademsko odprta, pa tudi inkubator novih, pogumnih idej in pogledov. Letošnja konferenca poleg umetne inteligence vključuje tudi razprave o perečih temah današnjega časa: ohranjanje okolja, demografski izzivi, zdravstvo in preobrazba družbenih struktur. Razvoj UI ponuja rešitve za številne sodobne izzive, kar poudarja pomen sodelovanja med raziskovalci, strokovnjaki in odločevalci pri oblikovanju trajnostnih strategij. Zavedamo se, da živimo v obdobju velikih sprememb, kjer je ključno, da z inovativnimi pristopi in poglobljenim znanjem ustvarimo informacijsko družbo, ki bo varna, vključujoča in trajnostna. V okviru multikonference smo letos združili dvanajst vsebinsko raznolikih srečanj, ki odražajo širino in globino informacijskih ved: od umetne inteligence v zdravstvu, demografskih in družinskih analiz, digitalne preobrazbe zdravstvene nege ter digitalne vključenosti v informacijski družbi, do raziskav na področju kognitivne znanosti, zdrave dolgoživosti ter vzgoje in izobraževanja v informacijski družbi. Pridružujejo se konference o legendah računalništva in informatike, prenosu tehnologij, mitih in resnicah o varovanju okolja, odkrivanju znanja in podatkovnih skladiščih ter seveda Slovenska konferenca o umetni inteligenci. Poleg referatov bodo okrogle mize in delavnice omogočile poglobljeno izmenjavo mnenj, ki bo pomembno prispevala k oblikovanju prihodnje informacijske družbe. »Legende računalništva in informatike« predstavljajo domači »Hall of Fame« za izjemne posameznike s tega področja. Še naprej bomo spodbujali raziskovanje in razvoj, odličnost in sodelovanje; razširjeni referati bodo objavljeni v reviji Informatica, s podporo dolgoletne tradicije in v sodelovanju z akademskimi institucijami ter strokovnimi združenji, kot so ACM Slovenija, SLAIS, Slovensko društvo Informatika in Inženirska akademija Slovenije. Vsako leto izberemo najbolj izstopajoče dosežke. Letos je nagrado Michie-Turing za izjemen življenjski prispevek k razvoju in promociji informacijske družbe prejel Niko Schlamberger, priznanje za raziskovalni dosežek leta pa Tome Eftimov. »Informacijsko limono« za najmanj primerno informacijsko tematiko je prejela odsotnost obveznega pouka računalništva v osnovnih šolah. »Informacijsko jagodo« za najboljši sistem ali storitev v letih 2024/2025 pa so prejeli Marko Robnik Šikonja, Domen Vreš in Simon Krek s skupino za slovenski veliki jezikovni model GAMS. Iskrene čestitke vsem nagrajencem! Naša vizija ostaja jasna: prepoznati, izkoristiti in oblikovati priložnosti, ki jih prinaša digitalna preobrazba, ter ustvariti informacijsko družbo, ki koristi vsem njenim članom. Vsem sodelujočim se zahvaljujemo za njihov prispevek — veseli nas, da bomo skupaj oblikovali prihodnje dosežke, ki jih bo soustvarjala ta konferenca. Mojca Ciglarič, predsednica programskega odbora Matjaž Gams, predsednik organizacijskega odbora i FOREWORD TO THE MULTICONFERENCE INFORMATION SOCIETY 2025 The 28th International Multiconference on the Information Society takes place at a time of remarkable growth in artificial intelligence, its applications, and its impact on humanity. Each year we enter a new era in which generative AI and other innovative approaches shape the path toward superintelligence and singularity — phenomena that will shape the future of human civilization. The conference is both a traditional scientific forum and an academically open incubator for new, bold ideas and perspectives. In addition to artificial intelligence, this year’s conference addresses other pressing issues of our time: environmental preservation, demographic challenges, healthcare, and the transformation of social structures. The rapid development of AI offers potential solutions to many of today’s challenges and highlights the importance of collaboration among researchers, experts, and policymakers in designing sustainable strategies. We are acutely aware that we live in an era of profound change, where innovative approaches and deep knowledge are essential to creating an information society that is safe, inclusive, and sustainable. This year’s multiconference brings together twelve thematically diverse meetings reflecting the breadth and depth of the information sciences: from artificial intelligence in healthcare, demographic and family studies, and the digital transformation of nursing and digital inclusion, to research in cognitive science, healthy longevity, and education in the information society. Additional conferences include Legends of Computing and Informatics, Technology Transfer, Myths and Truths of Environmental Protection, Knowledge Discovery and Data Warehouses, and, of course, the Slovenian Conference on Artificial Intelligence. Alongside scientific papers, round tables and workshops will provide opportunities for in-depth exchanges of views, making an important contribution to shaping the future information society. Legends of Computing and Informatics serves as a national »Hall of Fame« honoring outstanding individuals in the field. We will continue to promote research and development, excellence, and collaboration. Extended papers will be published in the journal Informatica, supported by a long-standing tradition and in cooperation with academic institutions and professional associations such as ACM Slovenia, SLAIS, the Slovenian Society Informatika, and the Slovenian Academy of Engineering. Each year we recognize the most distinguished achievements. In 2025, the Michie-Turing Award for lifetime contribution to the development and promotion of the information society was awarded to Niko Schlamberger, while the Award for Research Achievement of the Year went to Tome Eftimov. The »Information Lemon« for the least appropriate information-related topic was awarded to the absence of compulsory computer science education in primary schools. The »Information Strawberry« for the best system or service in 2024/2025 was awarded to Marko Robnik Šikonja, Domen Vreš and Simon Krek together with their team, for developing the Slovenian large language model GAMS. We extend our warmest congratulations to all awardees. Our vision remains clear: to identify, seize, and shape the opportunities offered by digital transformation, and to create an information society that benefits all its members. We sincerely thank all participants for their contributions and look forward to jointly shaping the future achievements that this conference will help bring about. Mojca Ciglarič, Chair of the Program Committee Matjaž Gams, Chair of the Organizing Committee ii KONFERENČNI ODBORI CONFERENCE COMMITTEES International Programme Committee Organizing Committee Vladimir Bajic, South Africa Matjaž Gams, chair Heiner Benking, Germany Mitja Luštrek Se Woo Cheon, South Korea Lana Zemljak Howie Firth, UK Vesna Koricki Olga Fomichova, Russia Mitja Lasič Vladimir Fomichov, Russia Blaž Mahnič Vesna Hljuz Dobric, Croatia Alfred Inselberg, Israel Jay Liebowitz, USA Huan Liu, Singapore Henz Martin, Germany Marcin Paprzycki, USA Claude Sammut, Australia Jiri Wiedermann, Czech Republic Xindong Wu, USA Yiming Ye, USA Ning Zhong, USA Wray Buntine, Australia Bezalel Gavish, USA Gal A. Kaminka, Israel Mike Bain, Australia Michela Milano, Italy Derong Liu, Chicago, USA Toby Walsh, Australia Sergio Campos-Cordobes, Spain Shabnam Farahmand, Finland Sergio Crovella, Italy Programme Committee Mojca Ciglarič, chair Marjan Heričko Boštjan Vilfan Bojan Orel Borka Jerman Blažič Džonova Baldomir Zajc Franc Solina Gorazd Kandus Blaž Zupan Viljan Mahnič Urban Kordeš Boris Žemva Cene Bavec Marjan Krisper Leon Žlajpah Tomaž Kalin Andrej Kuščer Niko Zimic Jozsef Györkös Jadran Lenarčič Rok Piltaver Tadej Bajd Borut Likar Toma Strle Jaroslav Berce Janez Malačič Tine Kolenik Mojca Bernik Olga Markič Franci Pivec Marko Bohanec Dunja Mladenič Uroš Rajkovič Ivan Bratko Franc Novak Borut Batagelj Andrej Brodnik Vladislav Rajkovič Tomaž Ogrin Dušan Caf Grega Repovš Aleš Ude Saša Divjak Ivan Rozman Bojan Blažica Tomaž Erjavec Niko Schlamberger Matjaž Kljun Bogdan Filipič Gašper Slapničar Robert Blatnik Andrej Gams Stanko Strmčnik Erik Dovgan Matjaž Gams Jurij Šilc Špela Stres Mitja Luštrek Jurij Tasič Anton Gradišek Marko Grobelnik Denis Trček Nikola Guid Andrej Ule iii iv KAZALO / TABLE OF CONTENTS Slovenska konferenca o umetni inteligenci / Slovenian Conference on Artificial Intelligence ................ 1 PREDGOVOR / FOREWORD ............................................................................................................................... 3 PROGRAMSKI ODBORI / PROGRAMME COMMITTEES ............................................................................... 5 Detecting Pollinators from Stem Vibrations Using a Neural Network / Ambrožič Žan, Bianco Lorenzo, Štrum Rok, Susič David, Smerkol Maj, Gradišek Anton.............................................................................................. 7 Thermal Camera-Based Cognitive Load Estimation: A Non-Invasive Approach / Anžur Zoja, Slapničar Gašper, Luštrek Mitja .................................................................................................................................................... 11 A Critical Perspective on MNAR Data: Imputation, Generation, and the Path Toward a Unified Framework / Azad Fatemeh, Kukar Matjaž ........................................................................................................................... 15 Utilizing Large Language Models for Supporting Multi-Criteria Decision Modelling Method DEX / Bohanec Marko, Rajkovič Uroš, Rajkovič Vladislav ..................................................................................................... 19 Landscape-Aware Selection of Constraint Handling Techniques in Multiobjective Optimisation / Cork Jordan, Andova Andrejaana, Krömer Pavel, Tušar Tea, Filipič Bogdan ...................................................................... 23 Explaining Deep Reinforcement Learning Policy in Distribution Network Control / Dobravec Blaž, Žabkar Jure .......................................................................................................................................................................... 27 Leveraging AI in Melanoma Skin Cancer Diagnosis: Human Expertise vs. Machine Precision / Herke Anna-Katharina ................................................................................................................................................ 31 Prediction of Root Canal Treatment Using Machine Learning / Jelenc Matej, Shulajkovska Miljana, Jurič Rok, Gradišek Anton ................................................................................................................................................ 35 Predictive Maintenance of Machines in LABtop Production Environment / Kocuvan Primož, Longar Vinko, Struna Rok ........................................................................................................................................................ 39 Machine Learning for Cutting Tool Wear Detection: A Multi-Dataset Benchmark Study Toward Predictive Maintenance / Kolar Žiga, Comte Thibault, Hassani Yanny, Louvancour Hugues, Gams Matjaž ................. 43 Extracting Structured Information About Food Loss and Waste Measurement Practices Using Large Language Models: A Feasibility Study / Lukan Junoš, Inagawa Maori, Luštrek Mitja .................................................. 47 Eye-Tracking Explains Cognitive Test Performance in Schizophrenia / Marinković Mila, Žabkar Jure ............ 51 Data-Driven Evaluation of Truck Driving Performance with Statistical and Machine Learning Methods / Nemec Vid, Slapničar Gašper, Luštrek Mitja .................................................................................................. 55 Automated Explainable Schizophrenia Assessment from Verbal-Fluency Audio / Rajher Rok, Žabkar Jure ..... 59 Mapping Medical Procedure Codes Using Language Models / Ratajec Mariša, Gradišek Anton, Reščič Nina . 63 AI-Enabled Dynamic Spectrum Sharing in the Telecommunication Sector – Technical Aspects and Legal Challenges / Rechberger Nina ......................................................................................................................... 67 SmartCHANGE Risk Prediction Tool: Next-Generation Risk Assessment for Children and Youth / Reščič Nina, Jordan Marko, Kramar Sebastjan, Krstevska Ana, Založnik Marcel, van der Jagt Lotte, op den Akker Harm, Vastenburg Martijn, Di Giacomo Valentina, Mancuso Elena, Fenoglio Dario, Dominici Gabrielle, Luštrek Mitja .................................................................................................................................................... 71 GNN Fusion of Voronoi Spatial Graphs and City–Year Temporal Graphs for Climate Analysis / Romanova Alex .................................................................................................................................................................. 75 Towards Anomaly Detection in Forest Biodiversity Monitoring: A Pilot Study with Variational Autoencoders / Susič David, Buchaillot Maria Luisa, Crozzoli Miguel, Builder Calum, Maistrou Sevasti, Gradišek Anton, Vukašinović Dragana ....................................................................................................................................... 79 Development of a Lightweight Model for Detecting Solitary-Bee Buzz Using Pruning and Quantization for Edge Deployment / Yagi Ryo, Susič David, Smerkol Maj, Finžgar Miha, Gradišek Anton ..................... 83 Interpretable Predictive Clustering Tree for Post-Intubation Hypotension Assessment / Žugelj Tapia Estefanía, Kirn Borut, Džeroski Sašo ............................................................................................................................... 87 Indeks avtorjev / Author index ................................................................................................................... 91 v vi Zbornik 28. mednarodne multikonference INFORMACIJSKA DRUŽBA – IS 2025 Zvezek A Proceedings of the 28th International Multiconference INFORMATION SOCIETY – IS 2025 Volume A Slovenska konferenca o umetni inteligenci Slovenian Conference on Artificial Intelligence Uredniki / Editors Mitja Luštrek, Matjaž Gams, Rok Piltaver http://is.ijs.si 8. –9. oktober 2025 / 8–9 October 2025 Ljubljana, Slovenia 1 2 PREDGOVOR SLOVENSKI KONFERENCI O UMETNI INTELIGENCI Slovenska konferenca o umetni inteligenci se letos odvija v času, ko umetna inteligenca še naprej intenzivno prodira v znanost, industrijo in vsakdanje življenje, še nikoli tako hitro in tako koristno. Še vedno so v ospredju veliki jezikovni modeli, ki so svoje zmožnosti razumevanja in generiranja že uspešno razširili na zvok, slike in video. Zanimivo raziskovalno področje so tudi temeljni (angl. foundation) modeli za druge vrste podatkov – npr. senzorskih in bioloških, pa tudi takih za robotske akcije, ki jih je takisto mogoče povezati z jezikom. Ti modeli so posebej dragoceni v medicinskih raziskavah, kjer so že privedli do razvoja novih zdravilnih učinkovin. Tovrstne raziskave bodo morda privedle do modelov, ki bodo znali celostno razumevati svet in nanj tudi vplivati, kar močno diši po umetni splošni inteligenci. Najnaprednejše raziskave umetne inteligence danes zahtevajo infrastrukturo, ki je v Sloveniji nimamo in se je tudi ne moremo nadejati, vseeno pa se je v zadnjem letu tudi v domačih logih zgodilo marsikaj zanimivega. Največji dogodek je bil bržkone pridobitev financiranja za Slovensko tovarno umetne inteligence – superračunalnik za 150 milijonov EUR, prilagojen umetni inteligenci. Poleg tega je bil zgrajen velik jezikovni model za slovenščino GaMS, ki omogoča boljše izražanje v našem jeziku in prispeva k slovenski digitalni suverenosti. V Sloveniji nastaja tudi veliko aplikacij, ki uporabljajo velike jezikovne modele. Med njimi bi radi izpostavili zdravstvenega pomočnika HomeDOCtor, ki zna državljanom svetovati glede zdravstvenih težav bolje kot splošnonamenski modeli. Vrnimo se zdaj h konferenci: letos ima 21 prispevkov, kar je največ po rekordnem letu 2020. Od teh jih dve tretjini prihajata z Instituta Jožef Stefan, kar ne odstopa dosti od statistike zadnjih let. Tako širša zastopanost različnih slovenskih ustanov vključno z industrijo še vedno ostaja naša želja. Ponosni smo, da smo letošnjo konferenco obogatili s kar tremi posebnimi dogodki. Otvoritev sestavljata uvodni nagovor predstavnice Ministrstva za digitalno preobrazbo in vabljeno predavanje Eve Tube, ki je v Slovenijo prišla na prestižno mesto ERA Chair v okviru projekta AutoLearn-SI. Ker umetna inteligenca prodira v vse pore našega življenja, med katere sodi tudi umetnost, smo zato organizirali sekcijo Beyond Human Art prav na to temo. In nenazadnje smo Slovensko tovarno umetne inteligence obeležili s sekcijo, kjer smo se poučili o tovarni, njeni uporabi v znanstvenih raziskavah in vlogi pri obdelavi senzorskih podatkov. Konferenca ostaja enkraten slovenski in mednarodni prostor odličnosti, odprte akademske razprave in novih idej. Ponosni smo, da skupaj gradimo slovensko skupnost umetne inteligence, ki s svojim znanjem in inovacijami prispeva k reševanju ključnih izzivov sodobnega časa ter krepi vlogo Slovenije v evropskem in svetovnem prostoru. Mitja Luštrek Matjaž Gams Rok Piltaver 3 FOREWORD TO SLOVENIAN CONFERENCE ON ARTIFICIAL INTELLIGENCE Slovenian Conference on Artificial Intelligence is taking place this year at a time when AI continues to advance rapidly into science, industry, and everyday life, faster and more usefully than ever before. Large language models are still at the forefront, having already successfully expanded their capabilities to the understanding and generation of sound, images and video. An interesting research area includes foundation models for other types of data – for example, sensor and biological data, as well robotic actions, which can likewise be connected to language. These models are especially valuable in medical research, where they have already led to the development of new therapeutic compounds. Such research may eventually result in models capable of comprehensively understanding the world and interacting with it, which strongly suggests artificial general intelligence. The most advanced artificial intelligence research today requires infrastructure that Slovenia does not have and cannot realistically expect, yet the past year has nevertheless seen several significant and interesting advances in Slovenia as well. The most important milestone was probably securing the funding for the Slovenian Artificial Intelligence Factory – a 150 million EUR supercomputer tailored to artificial intelligence. In addition, a large language model for Slovenian, GaMS, was built, enabling better expression in our language and contributing to Slovenian digital sovereignty. Slovenia is also seeing the rise of many applications that make use of large language models. Among them we would like to highlight the healthcare assistant HomeDOCtor, which is able to advise citizens on health issues better than general-purpose models. Returning to the conference: this year, it features 21 papers, the highest number since the record year of 2020. Out of these, two thirds come from Jožef Stefan Institute, which does not differ much from the statistics of recent years. Thus, a broader representation of various Slovenian institutions, including industry, remains our goal. We are proud that this year’s conference was enriched with three special events. The opening included a welcome address by a representative of the Ministry of Digital Transformation and a keynote lecture by Eva Tuba, who came to Slovenia to take up the prestigious ERA Chair position within the AutoLearn-SI project. Since artificial intelligence is making its way into every aspect of our lives, including art, we organized a special section titled Beyond Human Art dedicated to this theme. Finally, we marked the Slovenian Artificial Intelligence Factory with a session where we learned about the factory itself, its use in scientific research, and its role in processing sensor data. The conference is a unique Slovenian and international venue for excellence, open academic debate and new ideas. We are proud that together we are building the Slovenian AI community, which, through its knowledge and innovations, contributes to addressing the key challenges of our time and strengthens Slovenia’s role in Europe and globally. Mitja Luštrek Matjaž Gams Rok Piltaver 4 PROGRAMSKI ODBOR / PROGRAMME COMMITTEE Mitja Luštrek Matjaž Gams Rok Piltaver Zoja Anžur Cene Bavec Marko Bohanec Marko Bonač Ivan Bratko Bojan Cestnik Aleš Dobnikar Erik Dovgan Bogdan Filipič Borka Jerman Blažič Marjan Krisper Marjan Mernik Biljana Mileva Boshkoska Vladislav Rajkovič Niko Schlamberger Tomaž Seljak Peter Stanovnik Damjan Strnad Miha Štajdohar Vasja Vehovar 5 6 Detecting Pollinators from Stem Vibrations Using a Neural Network Žan Ambrožič Lorenzo Bianco za44564@student.uni- lj.si l.bianco@unito.it Faculty of Mathematics and Physics, University of Department of Life Science and System Biology, Ljubljana University of Turin Ljubljana, Slovenia Turin, Italy Rok Šturm David Susič, Maj Smerkol, Anton Gradišek rok.sturm@nib.si Department of Intelligent Systems, Jožef Stefan Institute National Institute for Biology Ljubljana, Slovenia Ljubljana, Slovenia Abstract the global decline in insects [2], [3] and in particular wild bees [4], [5]. Internationally, the UN Intergovernmental science-policy Passive sensing of pollinator activity is important for biodiversity Platform on Biodiversity and Ecosystem Services (IPBES) and the monitoring and conservation, yet conventional acoustic or visual Convention on Biological Diversity (CBD) highlighted the need methods produce large amounts of data and face deployment to collect long-term high-quality data on pollinators and pollina- challenges. In this work, we present initial results on investigat- tion services in order to direct policy and practice responses to ing stem vibration as an alternative signal for detecting pollinator address the decline. There were already some attempts to monitor presence on flowers. Vibration recordings were collected with a pollinators’ activity from sound/soundscapes recordings (e.g. [6]). laser vibration instrument from various flower species at multiple Here, we explore for the first time to monitor pollination activity locations in Slovenia, totaling approximately 14 hours, of which 3 by using vibroscape recordings [7] from flowering plants which hours were expert-annotated as positive (insect activity present). are visited by different pollinators. We investigated the possibility The task was formulated as a binary classification problem: deter- of neural networks for automatic detection of pollinator visits mining whether a vibration segment corresponds to a pollinator on flowers. physically touching the flower. Using a neural network model, performance was evaluated with five-fold cross-validation across three experiments: (i) using a balanced subset, (ii) using the full 2 Dataset dataset, and (iii) using the full dataset with heuristic prediction The dataset comprises vibration waveforms from flowers, which smoothing. On the balanced subset, the model achieved an av- were used for model training, and auxiliary audio and camera erage F1-score of 0.86 0.06; on the full dataset, 0.62 0.07; ± ± recordings collected for labeling and species identification. All and with heuristic smoothing, 0.69 0.11, demonstrating both ± recordings were obtained during July and August 2024 at various the feasibility of vibration-based detection and the benefits of locations in Slovenia. The vibrations were measured using a Vi- post-processing. Beyond binary detection, future work will focus broGo (Polytec, Germany) laser vibration instrument, which has on species- and activity-level classification. Ultimately, the goal an operational range of up to 30 m and can detect movements is to develop lightweight vibration detectors deployable directly 1 − up to 6 m s at frequencies up to 320 kHz. For this study, mea- on plants, enabling scalable estimation of pollinator visitation surements were performed at close range, with a frequency span rates in natural and agricultural environments. of 0–24 kHz and a sampling rate of 48 kHz. For the measurements, the flower stem was fixed to a pole Keywords to minimize large movements, and a small piece of reflective stem vibrations, pollination, neural networks, buzz detection, foil was attached slightly below the flower to enable the laser spectrograms vibrometer to capture fine vibrations caused by insect activity. Our data acquisition setup is shown in Figure 1. 1 Introduction The dataset comprised vibration recordings of up to 10 minutes each, collected from flowers belonging to the genera , Calystegia Europe supports a rich diversity of wild pollinators among them 2,051 species of bees and 892 species of hoverflies. Collectively, Cichorium Crepis Epilobium Knautia (the majority of samples), , , , pollinators provide a wide range of benefits to society including Leontodon Lotus Pastinaca Trifolium , , , and . In total, the record- ings amounted to approximately 14 hours, of which 3 hours were more than €15 billion per year contribution to the market value of annotated for insect activity (as positive), while the rest did not European crops, pollinating around 78 percent of wild flowering contain insect activity and was considered negative. Labeling plants. This pollination service ensures healthy ecosystem func- was performed in Raven Pro by expert annotators, who used tioning and maintains wider biodiversity as well as culturally synchronized audio and video recordings to confirm insect pres- important flower-rich landscapes [1]. Many reviews highlight ence and identify species. Each recording was annotated with Permission to make digital or hard copies of all or part of this work for personal time intervals indicating insect activity, insect species, activity or classroom use is granted without fee provided that copies are not made or type, and, when relevant, additional notes. For the purpose of this distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this study, where we are only interested in binary classification of work must be honored. For all other uses, contact the owner /author(s). detecting pollinators, all intervals with any insect activity which Information Society 2025, Ljubljana, Slovenia included physically touching the flower were labeled as 1, and 0 © 2025 Copyright held by the owner/author(s). https://doi.org/10.70314/is.2025.skui.5707 otherwise. 7 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Ambrožič et al. Figure 3: Sample spectrogram of light wind blowing (nega- tive) Table 1: Number of labels and the corresponding number Figure 1: Data acquisition setup for recording vibration of instances by insect. signals, audio, and visual recordings from flowers. Insect Number of labels Instances fly 76 4146 Labeled intervals were cut into clips of one second with 0.1- honey bee 253 1688 second overlap (positive instances), whereas unlabeled portions wild bee 98 1307 were similarly divided and treated as negative instances. To bal- hoverfly 82 155 ance the dataset, the negative instances were randomly down- bumble bee 14 24 sampled. Some negative instances contained environmental noise, wasp 3 9 such as speech, machinery, or wind, and wind noise was occasion- moth 1 5 ally present in positive instances. Examples of vibration signals Total 527 7334 from honey bee foraging and from wind are shown in Figures 2 and 3, respectively. The final balanced subset consisted of 7334 positive and 8664 negative instances. The positive data distribu- tion by insects is given in Table 1. a machine learning perspective, the problem was framed as a binary classification task: distinguishing between the presence and absence of insects in physical contact with the flower. The methodology consisted of initial recoding of waveforms and labeling, preprocessing the data, selecting the appropriate neural network architecture, and training and evaluating the model. 3.1 Data Preprocessing First, the instances that were shorter than one second (in cases where the expert-labeled interval was shorter than one second) were padded. After that, all instances were then converted into Mel spectrograms of size 64x64 using fast Fourier transform with frequency range 0–3 kHz. 3.2 Model Architecture For the model, a network of residual blocks in combination with Figure 2: Sample spectrogram of honey bee foraging (posi- convolution was used. It is a smaller version of some ResNet tive) (e.g. ResNet 18) models. Residual blocks offer efficient reuse of features across the layers. As shown in Figure 4, the input spectro- gram goes through a 3x3 convolution, followed by three residual 3 Methodology blocks, before final pooling. The residual block, shown in Fig- The objective of this study was to assess whether stem vibrations ure 5, consists of two 3x3 convolutions to identify features and can be used to detect the presence of pollinators on flowers. From residual path only uses stride to match the shape before addition. 8 Detecting Pollinators from Stem Vibrations Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Input Conv 3×3 Res Block In the first experiment, called 4.1.1 Balanced labeled subset. 1×64×64 1 32 + BN + ReLU 32 64 "Subset", only the manually labeled subset of the dataset was → → used. This consisted of the 7334 positive and 8664 negative in- Res Block Res Block Global Linear out stances as described above. These were treated as balanced binary 64 128 128 256 AvgPool 256 1 classification data and evaluated directly. → → → 4.1.2 Full dataset with raw labeling. In the second experiment, Figure 4: Model Architecture called "Full data (raw)", the entire dataset was included by seg- menting recordings into 1.0 s windows with a step size of 0.1 s. Expert annotations were then used to assign labels to these win- dows, yielding a much larger evaluation set. However, such raw Conv 3×3 Conv 3×3 labeling frequently introduced short, isolated positive or negative BN ReLU BN stride=2 stride=1 events that were likely erroneous. When the model predicted such isolated events, performance metrics were underestimated, 1×1 Conv as the evaluation framework treated them as genuine labels. This Input BN Add stride=2 motivated the introduction of a heuristic smoothing procedure. Output The third experiment, ReLU 4.1.3 Full dataset with heuristic labeling. called "Full data (heuristics)", used the same sliding-window seg- mentation as raw labeling experiment, but applied a heuristic Figure 5: Residual Block (Res Block) in Figure 4 smoothing procedure to adjust labels. The aim was to reduce the influence of short, likely erroneous events while preserving longer, fragmented signals as single detections. Two rules were To prevent overfitting and to enable extended training, dropout applied: of 0.5 was used, which improved performance more than data augmentation (and was also computationally more efficient). If the model predicted at least 10 consecutive positive win- • dows (equivalent to 1.0 s), the entire interval was relabeled 3.3 Model Training Settings as positive. The model was trained by using the binary cross-entropy loss. • If at least 82% of 50 consecutive windows (equivalent to 5.0 s) were predicted as positive, the entire interval was Optimization was performed with Adam optimizer with learning rate 10−4 −5 relabeled as positive. and weight decay 10 . A batch size of 16 and an epoch number of 30 were used. These empirically determined thresholds suppressed short false positives while ensuring that extended pollinator events 4 Evaluation Metrics with intermittent weak signals were still detected as continuous segments. Finally, because the sliding window (1.0 s) exceeded We used standard performance evaluation metrics: accuracy, pre- the step size (0.1 s), prediction timestamps were shifted backward cision, recall and F1-score, which were computed from the num- by 0.5 s to align the window centers with the expert annotations. ber of true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN) as follows: 5 Results and Discussion TP TN The results of all three experiments are shown in Table 2 along + Accuracy (1) = TP TN FP FN with the confusion matrices in Figure 6. + + + TP Table 2: Results of all experiments. The numbers represent Precision (2) = TP FP + the average ± standard deviation across five folds in the TP cross-validation. Recall (3) = TP FN + Subset Full data (raw) Full data (heur.) 2 Precision Recall Accuracy 0.87 0.03 0.80 0.02 0.86 0.05 · · ± ± ± F1-score (4) = Precision Recall Precision 0.85 0.09 0.54 0.11 0.68 0.15 + ± ± ± In confusion matrices, we used relative numbers samples for Recall 0.87 0.04 0.75 0.11 0.73 0.13 ± ± ± colors instead of absolute (which are only listed), because there F1-score 0 .86 0.06 0.62 0.07 0.69 0.11 ± ± ± was much more negatives than positives in detection test. Relative shares are based on true labels (e.g. fraction of FN among all The results show that there was a significant reduction in negatively labeled). performance when we switched from using the balanced subset to recordings from the full dataset. There are several possible 4.1 Experiments sources of error: labels are annotated on waveform and samples The model was evaluated in three experimental settings, all us- are extracted in the way that the whole non-padded (therefore ing 5-fold cross-validation. Instances originating from the same non-silent) part is either positive either negative, furthermore, recording were always assigned to the same fold to better reflect prediction for a specific time 𝑡 is generated based on the win- real-world variability. Training and testing were repeated five dow, beginning at 𝑡 and ending at 𝑡 1 s, which might lead to + times, each with a different fold held out for testing and the re- inaccuracies at edges of labels although we shifted the time to maining folds used for training. Reported results are averages match it as good as possible. There are also no other insects across the five folds. or activities in samples, which occur in full recordings and are 9 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Ambrožič et al. Predicted Predicted Predicted Future work will focus on extending the models beyond bi- P N P N P N nary detection towards classification of pollinator species and 6409 925 85k 29k 82k 32k potentially of behavioral activities. From an applied perspective, P P P 0.87 0.13 0.74 0.26 0.72 0.28 the long-term goal is to develop lightweight vibration detectors ctual ctual ctual 1095 7569 72k 322k 40k 354k A A A that can be mounted directly on plants to automatically register N N N 0.13 0.87 0.18 0.82 0.10 0.90 pollinator visits. Deploying a small number of such sensors in a Subset Full data (raw) Full data (heur.) field or meadow would enable scalable estimation of pollinator abundance and activity, providing a valuable tool for biodiversity monitoring and conservation studies. Figure 6: Confusion matrices of all 3 experiments, de- scribed in section 4.1 Acknowledgements The authors acknowledge the funding from the Slovenian Re- search and Innovation Agency, Grants PR-10495, P1-0255, J7- sometimes falsely positive. It is important to note that the "Full 50040, Z1-50018, the Basic core funding P2-0209, and the support data" is not a balanced set (while "Subset" is) and is meant as a received from the Erasmus+ Traineeship programme. test for a real-world scenario, where conditions and frequency of pollinators with them vary on short time scales (hours), which References makes loss balancing (which would reduce the gap between recall [1] European Commission. Joint Research Centre. 2021. Proposal for an EU polli- and precision) in practice very difficult. For this reason, we did Publications Office, LU. doi: 10.2760/881843. nator monitoring scheme. not use it and we left the thresholds the same as in the "Subset" [2] David L. Wagner. 2020. Insect declines in the anthropocene. Annual Review of Entomology, 65, 1, (Jan. 2020), 457–480. doi: 10.1146/annurev- ento- 011019 experiment, so the results serve as a valid estimation of the per- - 025151. formance in reality. Figure 7 shows how heuristics helped the [3] Roel van Klink, Diana E. Bowler, Konstantin B. Gongalsky, Ann B. Swengel, Alessandro Gentile, and Jonathan M. Chase. 2020. Meta-analysis reveals model by smoothing out the short erroneous predictions, result- declines in terrestrial but increases in freshwater insect abundances. , Science ing in improved performance. To improve model performance 368, 6489, (Apr. 2020), 417–420. doi: 10.1126/science.aax9931. even further, additional heuristic filters may be added. [4] J. C. Biesmeijer et al. 2006. Parallel declines in pollinators and insect-pollinated plants in britain and the netherlands. , 313, 5785, (July 2006), 351–354. Science doi: 10.1126/science.1127863. 6 Conclusion [5] Luísa Gigante Carvalheiro et al. 2013. Species richness declines and biotic homogenisation have slowed down for nw-european pollinators We presented initial results on the feasibility of detecting polli- and plants. , 16, 7, (May 2013), 870–878. Yvonne Buckley, editor. Ecology Letters nator presence on flowers from stem vibration recordings using doi: 10.1111/ele.12121. machine learning methods. We evaluated models under three [6] David Susič, Johanna A. Robinson, Danilo Bevk, and Anton Gradišek. 2025. Acoustic monitoring of solitary bee activity at nesting boxes. Ecological experimental settings: a balanced labeled subset, the full dataset Solutions and Evidence , 6, 3, e70080. e70080 ESO-24-09-164.R1. eprint: http with raw expert annotations, and the full dataset with heuristic s://besjournals.onlinelibrary.wiley.com/doi/pdf /10.1002/2688- 8319.70080. doi: https://doi.org/10.1002/2688- 8319.70080. label smoothing. The results demonstrate that pollinator activity [7] Rok Šturm, Juan José López Díez, Jernej Polajnar, Jérôme Sueur, and Meta can be reliably inferred from vibration signals, with heuristic Frontiers in Ecology and Virant-Doberlet. 2022. Is it time for ecotremology? post-processing substantially reducing the impact of isolated er- , 10, (Mar. 2022). doi: 10.3389/f evo.2022.828503. Evolution roneous predictions and improving the robustness of detection. Figure 7: Output example: (blue) model prediction, (green) heuristic filter, (yellow) expert labels. 10 Thermal Camera-Based Cognitive Load Estimation: A Non-Invasive Approach Zoja Anžur Gašper Slapničar Mitja Luštrek zoja.anzur@ijs.si gasper.slapnicar@ijs.si mitja.lustrek@ijs.si Jožef Stefan Institute Jožef Stefan Institute Jožef Stefan Institute Jožef Stefan International Jožef Stefan International Jožef Stefan International Postgraduate School Postgraduate School Postgraduate School Ljubljana, Slovenia Ljubljana, Slovenia Ljubljana, Slovenia Abstract 2 Related Work Cognitive load (CL) monitoring is a growing area of interest Early approaches of contact-free thermal monitoring of psy- across various domains. Most traditional methods rely on ei- chophysiological states based on infrared thermal imaging fo- ther subjective assessments or intrusive sensors, limiting their cused primarily on emotional and affective research [8]. Physio- practical applicability. In this study, we present a non-invasive logical background was heavily explored, specifically how auto- approach for estimating CL using thermal imaging. Thermal nomic nervous system activity yields descriptive thermal signa- videos were collected from 18 participants performing a battery tures related to affect in facial regions. Such work laid the critical of tasks designed to induce varying levels of CL. Using a low- groundwork for later expansion towards CL estimation. cost thermal camera, we extracted features from facial regions of One of the fundamental studies towards thermal-camera-based interest and trained several machine learning models, including CL estimation was published by Abdelrahman et al. in 2017. They Random Forest, Extreme Gradient Boosting, Stochastic Gradi- introduced an unobtrusive method that uses a commercial ther- ent Descent (SGD), k-Nearest Neighbors, and Light Gradient mal camera to monitor temperature changes on the forehead Boosting Machine, on a binary classification task distinguishing and nose, which were chosen as regions of interest based on between rest and high CL conditions. The models were evaluated physiological background established earlier. It demonstrated using Leave-One-Subject-Out cross-validation. Our results show that the difference between forehead and nose temperature cor- that all models outperform the baseline majority classifier, with relates robustly with task difficulty, showing effectiveness in SGD achieving the highest accuracy (0.64 ± 0.16), despite vari- Stroop test and reading complexity experiments. Notably, the ability across individuals. These findings support the feasibility system achieved near-real-time detection with an average la- of thermal imaging as an unobtrusive tool for CL estimation in tency of 0.7 seconds, making it suitable for responsive, real-time real-world applications. cognition-aware applications [1]. While such monitoring traditionally required relatively ex- Keywords pensive hardware [6], recent work showed potential of more affordable low-cost thermal cameras for monitoring of psycho- cognitive load estimation, thermal imaging, physiological com- logical states. Black et al. [4] compared state-of-the-art vision puting, machine learning for affective computing, non-invasive transformers (ViT) against traditional convolutional neural net- user monitoring works (CNNs) on data recorded with low-resolution thermal cameras. They found superior performance of ViT when classi- 1 Introduction fying emotions, achieving 0.96 F1 score for 5 emotions (anger, Monitoring cognitive load (CL) unobtrusively and accurately has happiness, neutral, sadness, surprise), confirming feasibility of become an increasingly important goal across various domains. low-cost hardware. Traditional methods such as the NASA-TLX questionnaire [7] Lastly, some work explores subtle connections between differ- for assessing cognitive states often rely on intrusive sensors ent inner states that are difficult to discriminate, such as stress or subjective self-reporting, limiting their practicality in real- and CL. Bonyad et al. [5] showed correlation of the two states world applications. In recent years, the use of machine learning in airplane pilots, highlighting that elevated cognitive work- techniques combined with physiological signals has opened new load induced stress, manifesting in significant cooling across possibilities for non-invasive and continuous monitoring [2]. the nose, forehead, and cheeks, with the nasal region exhibit- The primary objective of our study was to predict CL using ing the most rapid and pronounced temperature decline. These data obtained with a thermal camera. Our aim was to develop a thermal changes were synchronized with increases in heart rate method for unobtrusive measurement of physiological signals and subjective workload ratings. Overall thermal monitoring is that achieves high accuracy. Compared to other physiological becoming more accessible and an established CL estimation al- measurement tools, thermal cameras are relatively low-cost and ternative to other modalities (e.g., wearables, RGB cameras, etc.), quick to deploy, which makes them a practical choice for real- especially in challenging conditions (e.g., darkness). world cognitive monitoring applications. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or 3 Data distributed for profit or commercial advantage and that copies bear this notice and 3.1 Data Collection the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner /author(s). For the purpose of our experiment, we gathered data from 18 Information Society 2025, Ljubljana, Slovenia participants using various sensors. In this work, we will focus © 2025 Copyright held by the owner/author(s). https://doi.org/10.70314/is.2025.skui.3714 only on relevant data obtained by an affordable FLIR Lepton 3.5 11 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Anžur et al. camera, with resolution of 160x120 running at 8.7 frames per The first step in obtaining the average temperatures for the second. selected ROIs involved applying a facial keypoint detector to Our participants underwent a battery of tests for inducing extract the coordinates corresponding to each region in the ther- CL. Data collection was carried out in a controlled laboratory mal images. This process was carried out for the middle frame environment to ensure consistency across all participants. Af- of every window of the thermal videos by passing it through ter filling out some initial questionnaires regarding individual’s a pretrained keypoint detection model [11]. The model, based tiredness and focus levels, the calibration of various sensors used on the widely adopted YOLOv5 architecture, was specifically in the study was performed. The experiment itself was structured trained on thermal images to enhance its performance for this into three sequential blocks, each designed to induce CL through modality. Following keypoint detection, we transitioned from two different tasks offered at two difficulty levels. The first block working with raw thermal images to working with numerical featured standardized CL tasks – specifically, the N-back and temperature features, specifically the average temperatures com- Stroop tasks, which are widely used in cognitive research to puted for each region of interest. A more detailed explanation of engage working memory and executive attention [10, 12]. this feature extraction process is provided in Section 4.1. The second block introduced more ecologically valid memory tasks. The memory recall task involved displaying a list of words on a screen, after which participants had 30 seconds to recall and verbally report as many as possible. In the visual memory task, participants observed an image and were later asked to recall specific details. The third and final block focused on ecological visual attention tasks. These included a visual discrepancy detection task and a line tracking task. In the discrepancy detection task, participants (a) Subject A. (b) Subject B. compared two images and identified visual differences. In the line tracking task, participants followed numbered lines from Figure 1: Examples of raw thermal images. one side of the screen to the other and identified them. Between these cognitive tasks, participants engaged in relax- At this stage, our dataset – where each row corresponded to a ation activities such as resting, passively viewing images, or lis- single video frame – contained a substantial number of missing tening to music, which served as baseline conditions and helped values. These missing values were primarily due to limitations to balance their CL throughout the experiment. After each task in keypoint detection, which stemmed from several factors. First, and each relaxation period, participants completed the NASA participants wore smart glasses during the experiment, which Task Load Index (NASA-TLX) [7] and the Instantaneous Self- often obstructed the eye region and impaired the accuracy of Assessment (ISA) [9] questionnaires to provide subjective evalu- the keypoint detector. Second, natural head movements, such ations of their cognitive and affective states. as turning to the left or right, occasionally caused parts of the The session concluded with the removal of all sensors, a de- face to be occluded, preventing the detector from accurately briefing session, and participant compensation. The entire pro- identifying key facial landmarks. Given the impact of these issues cedure lasted approximately 60 minutes per participant, with on data quality, we chose to remove all rows containing missing around 40 minutes spent for active data collection and the rest values from further analysis. We excluded 31% of the data in this used for setup, instructions, and debriefing. step. Use of smart glasses was not problematic only for keypoint detection, but also for feature calculation. The eye regions were partially obstructed by the glasses, thus preventing the thermal 3.2 Data Preprocessing camera from capturing accurate temperature measurements in The raw data used in our analysis is illustrated in Figure 1. The this area. Since we were unable to control for this effect, it is first step in our preprocessing pipeline was windowing. Specifi- possible that it also posed an issue in classification. cally, we divided each thermal video into consecutive 3-second Next, we performed label transformations to prepare the data windows with a 25% overlap. From each window, only the middle for subsequent analysis. Initially, the dataset included multiple la- frame was selected for further analysis. This approach was based bels, each corresponding to one of the tasks described in Section on the assumption that facial temperature changes driven by 3.1. However, approximately 50% of the instances were labeled as physiological responses such as blood flow occur gradually over “questionnaire”, reflecting the periods during which participants several seconds rather than instantaneously. As such, a single completed self-report instruments such as NASA-TLX and ISA. representative frame from each interval was considered sufficient These instances posed a challenge: filling out a questionnaire is to capture meaningful thermal variation in 2.25-second steps. neither a clear resting state nor a cognitively demanding task, The second step in preparing the data for subsequent machine making it difficult to accurately determine the level of CL in- learning experiments involved the extraction of features from volved. Since our primary interest lay in distinguishing between thermal camera recordings. Prior research in this domain fre- load and rest conditions, we opted to exclude all rows labeled as quently utilizes average temperatures from distinct facial regions “questionnaire” from further analysis. In addition, we grouped as input features, demonstrating that these regions can exhibit the remaining labels into three broader categories: rest, low CL significant temperature differences associated with various affec- (corresponding to the easy versions of the tasks), and high CL tive states experienced by participants [3]. Motivated by these (corresponding to the difficult versions). findings, we adopted a similar methodology to that proposed by Following some initial experiments, we chose to retain only Aristizabal-Tique et al. [3], and based our feature set on the aver- the most “extreme” instances in terms of CL. Specifically, we age temperatures of four predefined regions of interest (ROIs): excluded all data labeled as low CL, as this class exhibited sub- nose, forehead, left eye, and right eye. stantial overlap with both the rest and high load conditions. In 12 Short title to put in the header Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia particular, some tasks intended to induce low CL turned out to be Table 1: Class distribution unexpectedly difficult, effectively eliciting high CL, while others were so easy that it is questionable whether they imposed any Label Count cognitive demand at all. Rest 1626 To further emphasize the most distinct cognitive states, we High Load 1548 also filtered the remaining data within each label interval. For intervals of instances labeled as rest, we retained only the final two-thirds of each interval, based on the assumption that partici- pants would be most physiologically relaxed toward the end of This normalization helped reduce inter-subject variability while each interval labeled rest. Immediately after completing a cogni- preserving intra-subject dynamics, enabling a more robust learn- tively demanding task, the body may require some time to “cool ing of patterns related to CL. Following this step, we proceeded down”, during which residual physiological activity – such as with the machine learning experiments using the described set elevated facial temperature – could still be present. By focusing of features. on the latter portion of the interval, we aimed to capture a more accurate representation of the true resting state. Similarly, for 4.2 Experiments instances labeled as high CL, we also retained only the final two- After completing the data preparation steps outlined in Sections thirds of each interval, based on the assumption that CL tends to 3.2 and 4.1, we proceeded with the machine learning experiments. accumulate toward the end of a demanding task. This selection At this stage, the dataset consisted of two balanced classes: rest strategy was intended to maximize the contrast between rest and high CL, as shown in Table 1. The models were trained on a and high load conditions by focusing on the time points most total of 3174 instances, derived from 18 participants. representative of those states. In our experiments, we employed a diverse set of machine learning models, including Random Forest (RF), Extreme Gradient 4 Methodology Boosting (XGB), Stochastic Gradient Descent (SGD), k-Nearest 4.1 Calculating Features Neighbors (KNN), and Light Gradient Boosting Machine (GBM). As a baseline, we included a majority classifier, which always As previously mentioned, we extracted features directly from the predicted the most frequent class in the training data of each raw thermal images. Using the pretrained keypoint detector [11], fold. Each model was trained and evaluated using its optimized we obtained coordinates for five facial keypoints, using which we hyperparameters, which were determined through a grid search then defined ROIs corresponding to specific facial areas for each strategy applied on training data on each Leave-One-Subject-Out 3-second window. ROIs were shaped as rectangles, positioned (LOSO) iteration aimed at maximizing classification accuracy. based on keypoint coordinates. Their size and placement were To ensure the robustness and generalizability of the results, dynamically defined according to the distance between the eyes, we adopted a LOSO cross-validation approach, in which each par- reducing issues such as capturing inconsistent facial areas due to ticipant served as a test subject exactly once while the remaining variations in distance from the camera or head movements. This participants were used for training. This evaluation strategy is approach was considered appropriate, because the study was well-suited for personalized and physiological data, where inter- conducted in a controlled laboratory environment with minimal subject variability is high. To ensure a comprehensive evaluation variation in posture and setup. Additionally, a visual inspection of model performance, we did not rely solely on a single metric. of the extracted ROIs confirmed that they were well aligned. Instead, we incorporated a range of evaluation metrics, including Next, we computed the average pixel temperature for each accuracy and F1-score. This multi-metric approach allowed us ROI, as each pixel in a thermal image directly reflects a tempera- to better capture different aspects of model performance. The ture value. This process yielded four primary features – one for results of these experiments are presented in the subsequent each of the predefined ROIs (nose, forehead, left eye, and right section. eye). To capture relative temperature differences between these regions, we then computed the pairwise differences between all four average temperatures. This resulted in an additional six fea- 5 Results tures, representing the thermal contrasts between different facial As mentioned in the previous sections, we trained and evaluated a areas. Finally, to capture potential temporal trends in tempera- variety of models, and evaluated them using the LOSO. Summary ture changes, we introduced two additional temporal features. of the results can be seen in Table 2, where both accuracy and F1- Specifically, for each 3-second window, we computed the tem- score are reported as averages across all subject folds, providing perature difference between the first and last frame for two key an overall measure of model performance and generalization regions of interest: the nose and the forehead. These temporal performance. features aimed to reflect short-term thermal dynamics that may The results indicate that the best-performing algorithm was be indicative of CL fluctuations. In total, this process resulted in SGD, achieving an accuracy of 0.64 0.16, which represents a 0.13 ± 12 features per instance: 4 average ROI temperatures, 6 pairwise improvement over the baseline majority classifier accuracy of temperature differences, and 2 temporal difference features. 0.51 0.00. In addition to its accuracy, SGD also achieved a high ± Finally, we applied personalized normalization to account for F1-score, suggesting that the model performs well in predicting individual differences in baseline physiological responses. While both classes in a balanced manner. However, SGD also has the there is considerable variability across participants, the varia- highest variance ( 0.16), which indicates less stability across ± tions within each individual are more informative for detecting subjects. Overall, all evaluated models outperformed the majority changes in CL. To address this, we standardized all feature values class baseline. Moreover, the accuracy scores across all tested using z-score normalization per participant, transforming each models were relatively similar, indicating consistent performance instance based on that individual’s mean and standard deviation. regardless of the specific algorithm used. Performance of GBM, 13 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Anžur et al. Table 2: Accuracy and F1-score of trained models compared to the majority class classifier Classifier Model Accuracy Model F1 Majority Class Accuracy Majority Class F1 RF 0.62 0.13 0.62 0.13 0.51 0.00 0.34 0.04 ± ± ± ± XGB 0.62 0.14 0.62 0.14 0.51 0.00 0.34 0.04 ± ± ± ± SGD 0.64 ± 0.16 0.63 ± 0.16 0.51 ± 0.00 0.34 ± 0.04 KNN 0.60 0.10 0.60 0.10 0.51 0.00 0.34 0.04 ± ± ± ± GBM 0.63 0.10 0.60 0.11 0.51 0.00 0.34 0.04 ± ± ± ± Figure 2: SGD vs. baseline majority classifier by subject. RF and XGB was very similar, although somewhat behind the References performance of the SGD. [1] Yomna Abdelrahman, Eduardo Velloso, Tilman Dingler, Albrecht Schmidt, and Frank Vetere. 2017. Cognitive heat: exploring the usage of thermal Looking at per-subject results in more detail in Figure 2, we imaging to unobtrusively estimate cognitive load. Proceedings of the ACM see that for most subjects, the SGD classifier outperformed the on Interactive, Mobile, Wearable and Ubiquitous Technologies, 1, 3, 1–20. majority baseline classifier. SGD achieved its best performance [2] Muneeb Imtiaz Ahmad, Ingo Keller, David A Robb, and Katrin S Lohan. 2023. A framework to estimate cognitive load using physiological data. Personal on subjects 13, 11, and 15, with accuracies exceeding 0.80. There and Ubiquitous Computing, 27, 6, 2027–2041. doi: 10.1007/s00779- 020- 01455 is also considerable variation across individuals, which aligns - 7. with the high variance reported in Table 2. This variability may [3] Victor H. Aristizabal-Tique, Marcela Henao-Pérez, Diana Carolina López- Medina, Renato Zambrano-Cruz, and Gloria Díaz-Londoño. 2023. Facial indicate the presence of subject-specific patterns, label noise, or thermal and blood perfusion patterns of human emotions: proof-of-concept. data that is inherently more challenging to learn. , 112, 103464. doi: https://doi.org/10.1016/j.jtherb Journal of Thermal Biology io.2023.103464. 6 [4] James Thomas Black and Muhammad Zeeshan Shakir. 2025. Ai enabled Conclusion facial emotion recognition using low-cost thermal cameras. Computing&AI This study demonstrates the potential of low-cost consumer ther- Connect , 2, 1, 1–10. [5] Amin Bonyad, Hamdi Ben Abdessalem, and Claude Frasson. 2025. Heat mal imaging as a viable, non-invasive method for estimating of the moment: exploring the influence of stress and workload on facial CL. By leveraging features extracted from key facial regions temperature dynamics. In International Conference on Intelligent Tutoring Systems. Springer, 181–193. and applying various machine learning algorithms, we achieved [6] Federica Gioia, Maria Antonietta Pascali, Alberto Greco, Sara Colantonio, promising results in distinguishing between rest and high load and Enzo Pasquale Scilingo. 2021. Discriminating stress from cognitive load cognitive states. Among the tested models, SGD achieved the using contactless thermal imaging devices. In 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC). best average performance, though with notable inter-subject vari- IEEE, 608–611. ability. These findings highlight both the strengths and current [7] Sandra G Hart and Lowell E Staveland. 1988. Development of nasa-tlx (task load index): results of empirical and theoretical research. In Advances in limitations of thermal-based CL estimation. While the results psychology. Vol. 52. Elsevier, 139–183. support the feasibility of using affordable thermal cameras in [8] Stephanos Ioannou, Vittorio Gallese, and Arcangelo Merla. 2014. Thermal real-world applications, future work should explore strategies infrared imaging in psychophysiology: potentialities and limits. Psychophys-such as more sophisticated personalization to enhance generaliza- iology, 51, 10, 951–963. [9] CS Jordan and SD Brennen. 1992. Instantaneous self-assessment of workload tion across individuals, deep learning, etc. This line of research technique (isa). Defence Research Agency, Portsmouth. points toward usefulness of cognitive monitoring in practical [10] Michael Kane, Andrew Conway, Timothy Miura, and Gregory Colflesh. 2007. Working memory, attention control, and the n-back task: a question settings such as education, workplace safety, and adaptive user Journal of experimental psychology. Learning, memory, of construct validity. interfaces. , 33, (May 2007), 615–22. doi: 10.1037/0278- 7393.33.3.615. and cognition [11] Askat Kuzdeuov, Dana Aubakirova, Darina Koishigarina, and Huseyin Acknowledgements Atakan Varol. 2022. Tfw: annotated thermal faces in the wild dataset. IEEE Transactions on Information Forensics and Security, 17, 2084–2094. doi: 10.11 09/TIFS.2022.3177949. We sincerely thank our colleagues from Department of Intelli- [12] Michael P Milham, Kirk I Erickson, Marie T Banich, Arthur F Kramer, gent Systems ( Jožef Stefan Institute) for their assistance in data Andrew Webb, Tracey Wszalek, and Neal J Cohen. 2002. Attentional control collection and preprocessing. in the aging brain: insights from an fmri study of the stroop task. Brain and cognition, 49, 3, 277–296. 14 A Critical Perspective on MNAR Data: Imputation, Generation, and the Path Toward a Unified Framework Fatemeh Azad Matjaž Kukar fatemeh.azad@f ri.uni- lj.si matjaz.kukar@f ri.uni- lj.si University of Ljubljana University of Ljubljana Ljubljana, Slovenia Ljubljana, Slovenia Abstract 𝜓 : a parameter or set of parameters that govern the missing- ness process. Missing Not at Random (MNAR) data remains one of the most difficult challenges in statistical analysis and machine learning. Data is if the probability of a value being missing • MCAR Despite the widespread availability of advanced imputation meth- is completely independent of both the observed and the ods, most research continues to focus on Missing Completely unobserved data. The missingness is unrelated to the data at Random (MCAR) and partially on Missing at Random (MAR) itself — it is a purely random (Eq. 1) as the missingness scenarios. This paper provides a critical overview of existing ap- pattern (𝑅) depends neither on the observed data (𝑋 ) 𝑜𝑏𝑠 proaches for MNAR imputation, methods for simulating MNAR nor on the missing data (𝑋𝑚𝑖𝑠 ). data, and the limitations of current evaluation practices. We highlight the lack of standardized benchmarks, unrealistic miss- 𝑃 𝑅 𝑋 , 𝑋 , 𝜓 𝑅 𝜓 (1) 𝑚𝑖𝑠 𝑜𝑏𝑠 ( | ) = 𝑃 ( | ) ingness rates, and insufficient coverage of MNAR conditions in • MAR Data is if the probability of a value being missing de- empirical studies. Finally, we propose a suitable framework for pends only on the observed data, not on the missing data comprehensive testing of design principles, enabling robust and itself (Eq. 2). This means that the missingness could be pre- reproducible evaluation of imputation methods across mecha- dicted from available (non-missing) data. The probability nisms and missingness rates. of the missingness pattern (𝑅) is conditionally indepen- dent of the actual missing values (𝑋 ) once the observed 𝑚𝑖𝑠 Keywords values (𝑋 ) are taken into account. 𝑜𝑏𝑠 Missing data, MNAR, data imputation, missingness mechanisms, data generation, machine learning, evaluation framework. 𝑃 𝑅 𝑋 , 𝑋 , 𝜓 𝑅 𝑚𝑖𝑠 𝑜𝑏𝑠 ( | ) = 𝑃 ( |𝑋 , 𝜓 𝑜𝑏𝑠 ) (2) • Data is MNAR if the probability of a value being missing 1 Introduction depends on some unobserved (missing) value itself, even after accounting for all the observed data (Eq. 3). In this Missing data is a pervasive challenge across various domains, case (𝑋 ) can also include latent features, unobserved 𝑚𝑖𝑠 from clinical diagnostics and bioinformatics to finance, sensor for all instances. This is the most complex scenario, as the networks, and social sciences. Missing, damaged, or unrecorded missingness pattern itself is informative. The probability of data entries can negatively affect the accuracy of statistical anal- the missingness pattern (𝑅) is therefore dependent on the ysis and machine learning models. They reduce predictive power, missing values (𝑋 ) in a way that cannot be explained 𝑚𝑖𝑠 introduce bias, and often create incompatibilities with algorithms by the observed values (𝑋 ). 𝑜𝑏𝑠 requiring complete inputs [8]. The impact is especially important in critical areas like healthcare decision support, where unreli- 𝑃 𝑅 𝑋 , 𝑋 (3) 𝑚𝑖𝑠 𝑜𝑏𝑠 ( | , 𝜓 ) able data or incorrect interpretation can lead to harmful conclu- While MCAR and MAR have been extensively studied, MNAR sions.[14, 2]. remains the most difficult and least explored scenario, precisely A primary difficulty in handling missing data is understanding the underlying . According to the taxon- missingness mechanism because the missingness itself carries information about the data. For example, high-income individuals may systematically with- omy of Little and Rubin [10], We have three types of missing- ness: (MCAR), Missing at Random Missing Completely at Random hold reporting their wealth, or patients with severe conditions (MAR), and (MNAR). Missing Not at Random may drop out of longitudinal studies. In both cases, the very act of non-response encodes meaningful but hidden signals. To formally describe the MCAR, MAR, and MNAR mecha- The prevalent imputation (replacing missing values) research nisms, we first define the following notation, as per [9, 19]: has focused on MCAR and MAR settings, where assumptions 𝑋 the complete data matrix, which consists of two parts, : about independence or conditional dependence simplify method- with 𝑋 being the observed and 𝑋 the missing part 𝑜𝑏𝑠 𝑚𝑖𝑠 ological development and evaluation [14, 23, 13]. In contrast, of the data. MNAR scenarios pose a dual challenge: not only is the miss- 𝑅 an indicator matrix of the same dimensions as 𝑋 , where : ing information inherently dependent on unobserved values, 𝑅 1 if the value 𝑖 𝑗 = 𝑋 is missing, and 𝑖 𝑗 𝑅𝑖 𝑗 = 0 if it is but there are also very few benchmark datasets that explicitly observed. model or annotate MNAR mechanisms. Consequently, evalua- tion standards remain incomplete. Reported missingness rates Permission to make digital or hard copies of all or part of this work for personal often underestimate or ignore MNAR effects, and even sophis- or classroom use is granted without fee provided that copies are not made or ticated models, such as generative adversarial networks [7, 24], distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this graph neural approaches [25], or transformer-based imputers work must be honored. For all other uses, contact the owner /author(s). [3], rarely demonstrate systematic robustness in MNAR condi- Information Society 2025, Ljubljana, Slovenia tions. Recent works [11, 4] have shown the potential of ensemble © 2025 Copyright held by the owner/author(s). https://doi.org/10.70314/is.2025.skui.0957 or meta-imputation strategies, which combine diverse imputers 15 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Fatemeh Azad and Matjaž Kukar into robust pipelines. However, these frameworks are also mostly or latent variables. For instance, if severely ill patients system- validated under MCAR or MAR assumptions. atically omit follow-up surveys, no observed features can ex- In this paper, we take a critical perspective on the current plain this absence, and machine learning based imputers cannot state of missing data research, specifically focusing on MNAR. recover the missing structure without explicitly modeling the We argue that three gaps must be addressed: (i) the lack of effec- mechanism. tive imputation techniques designed specifically for MNAR, as current methods are limited in scope and seldom used in practice; (ii) the deficiency of datasets and generators that can faithfully 2.3 Deep Learning Approaches represent MNAR patterns; and (iii) the insufficiency of reported Deep generative models have significantly advanced imputation missingness rates. To bridge these gaps, we outline the vision research. Variational Autoencoders (VAEs) [2] and Generative and design principles of a comprehensive for MNAR Adversarial Networks (GANs) [23, 7, 24] are capable of learn-framework research that integrates data generation, imputation, and evalu- ing complex distributions and have shown robustness to high ation under standardized conditions. Such a framework would missingness rates. However, their performance in the context enable more robust comparisons of existing methods and guide of MNAR conditions is not assured. While some frameworks, the development of novel techniques tailored to the inherent such as MisGAN, explicitly attempt to learn the missingness challenges of MNAR. mask distribution alongside the data [7], they often rely on ap- The remainder of this paper is organized as follows. Section 2 proximations that do not generalize across domains. Similarly, reviews existing imputation approaches and discusses their ap- diffusion-based models [22, 26] and graph-based imputers [25] plicability to MNAR. Section 3 examines methodologies for simu- extend coverage to structured data but rarely test systematically lating and generating MNAR data, highlighting their limitations. against MNAR conditions. Transformers, such as ReMasker [3], Section 4 critiques how missingness is reported and motivates provide context-aware imputations, but again, their evaluations the need for standardized benchmarks. Finally, Section 5 presents are mostly limited to MCAR and MAR scenarios. our vision for a unified MNAR research framework and outlines open challenges for the community. 2.4 Ensemble Approaches Recent efforts highlight the potential of combining multiple im- 2 Imputation Methods for MNAR Data puters in ensemble or meta-learning frameworks [11, 4]. Such A wide range of imputation techniques has been proposed in the methods leverage complementary strengths of diverse imputers literature, from simple statistical to advanced deep generative and often achieve more stable performance across heterogeneous models. While these methods have demonstrated effectiveness datasets. However, existing ensemble frameworks have been under Missing Completely at Random (MCAR) or Missing at validated primarily under MCAR assumptions, and their abil- Random (MAR) assumptions, their suitability for Missing Not at ity to handle MNAR remains largely unexplored. Recent work Random (MNAR) scenarios remains highly questionable. This has also explored meta-imputation strategies, such as the Meta- section reviews the main categories of imputation techniques Imputation Balanced (MIB) framework, which combines multiple and highlights their limitations when faced with MNAR data. base imputers in a supervised setting [1]. While it is often stated that there are almost no methods tai- To synthesize the discussion above, Table 1 summarizes the lored for MNAR, several strands of work do exist . . . However, main categories of imputation approaches, their representative these remain underutilized and rarely integrated into mainstream methods, applicability to missingness mechanisms, and key ref- imputation pipelines. erences. 2.1 Statistical Imputation Methods 3 Generation of MNAR Data Statistical techniques such as mean, median, mode, or regression- A persistent challenge in missing data research is the lack of reli- based imputations are simple and computationally efficient but able and reproducible benchmarks for handling MNAR scenarios. they mostly rely on strong assumptions about the independence While MCAR and MAR can be easily simulated by random mask- or conditional dependence of missingness [8, 27]. These assump- ing or conditioning on observed features, MNAR requires mask- tions rarely hold under MNAR, where the missingness mecha- ing rules that depend on unobserved or latent variables, which nism is informative itself. For example, imputing systematically makes the generation process more challenging. Consequently, underreported values (e.g., income, clinical severity) with central- most experimental studies rely on oversimplified masking strate- tendency statistics introduces bias and distorts the true distribu- gies that do not capture the complexity of real-world MNAR tion. Maximum likelihood and Bayesian approaches attempt to mechanisms [18, 5]. capture uncertainty, but they typically assume that the missing- ness process can be ignored or is fully modeled by observed data [10], which is not the case for MNAR. 3.1 The Role of Data Amputation Deliberately injecting missing values into fully complete datasets, referred to as , plays a crucial role in evaluating data amputation 2.2 Machine Learning-Based Approaches imputation techniques. However, until recently, implementations Machine learning methods, such as 𝑘-nearest neighbors (KNN) of amputation were highly heterogeneous and often insufficiently [14], matrix factorization [20], decision trees [21], and support documented, preventing fair comparisons across studies [18]. vector machines (SVMs) [6], utilize feature dependencies to ad- This problem is particularly acute for MNAR, where even slight dress missing data entries. While more flexible than statistical differences in implementation can lead to very different conclu- methods, they fail when the missingness depends on unobserved sions. 16 A Critical Perspective on MNAR Data Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Table 1: Comparison of Imputation Approaches from Literature Approach Representative Methods Missingness Types Addressed Representative References Mean, Median, Mode, Regression- Traditional Statistical Schafer & Graham [27], Little & Ru- based, Maximum Likelihood, MCAR only (rarely MAR) bin [10], Lin & Tsai [8] Bayesian Approaches KNN, Matrix Factorization, Deci- Machine Learning Murti et al. [14], Lee et al. [20], Song MCAR, partially MAR sion Trees, SVM & Lu [21], Feng et al. [6] Collier et al. [2], Yoon et al. [23, 24], VAEs, GANs, Diffusion Models, Deep Learning MCAR, MAR (limited MNAR) Li et al. [7], Du et al. [3], Tashiro et Graph-based Models, Transformers al. [22], Zheng & Charoenphakdee [26], You et al. [25] Meta Learning, Meta-Regressio, Meta-Learning / Ensembles MCAR, partially MAR; potential for Liu et al. [11], Ellington et al. [4], MIB Frameworkn MNAR Azad et al. [1] 3.2 Artificial MNAR Generation Strategies (ii) the lack of realistic MNAR generators inhibits effective evalu- ation. To address these gaps, we anticipate a unified framework The most common way to simulate MNAR is by masking val- integrating generation, imputation, and evaluation of MNAR data ues as a function of their own magnitude or distribution. For under standardized and reproducible conditions. instance, removing a feature’s highest or lowest values mim- ics non-disclosure of extreme outcomes (e.g., very high glucose levels) [18]. Stochastic variants extend this idea by assigning 4.1 Design Principles missingness probabilities proportional to the unobserved value A comprehensive MNAR framework should have the following itself, enabling flexible control over the intensity of missing- principles: ness [16]. While intuitive, such strategies remain oversimplified, often restricted to univariate rules that fail to capture the multi- • Synthetic realism: Data generators should simulate MNAR scenarios that mirror real-world domains (e.g., systematic dimensional dependencies of real domains [5]. dropout in healthcare, self-censoring in socio-economic Recent work has proposed standardized libraries for data am- putation to address reproducibility concerns. The pack- mdatagen data), either by extending existing functionality (e.g., mdata- gen [12]) or by incorporating custom plug-in modules. To age provides a broad set of implementations for MCAR, MAR, and balance interpretability with scalability, both threshold- MNAR, supporting univariate and multivariate scenarios [12]. In based rules and learned mechanisms should be supported. particular, it incorporates advanced MNAR mechanisms such as Missingness Based on Own Values (MBOV), Missingness Based • Comprehensive evaluation: Benchmarks must test across all three missingness mechanisms (MCAR, MAR, MNAR) on Own and Unobserved Values (MBOUV), and Missingness and a full spectrum of missingness rates. Based on Intra-Relations (MBIR) [15]. These implementations move beyond ad hoc thresholding by systematically encoding • Cross-domain applicability: The framework should sup- port diverse data types (tabular, sequential, multimodal) missingness processes and offering reproducible pipelines. In ad- dition, includes visualization and evaluation modules, mdatagen and allow integration of domain knowledge for context- specific MNAR simulation. allowing researchers to inspect missingness patterns and assess their impact on imputation performance. Together, these synthetic and standardized approaches form 4.2 Proposed Framework Components the current toolkit for MNAR data generation. However, despite We propose that a unified MNAR framework should consist of their usefulness, they remain abstractions of real-world processes three interdependent modules: and should ideally be complemented by domain-informed simu- (1) Domain-informed and prob- MNAR Data Generators: lations. abilistic tools for simulating missingness patterns that depend on latent or unobserved values, using existing 3.3 Domain-Inspired Simulation libraries ([12] or incorporating custom plug-in functions. (2) A modular interface with plug- Imputation Engines: Beyond standardized libraries, domain knowledge remains crit- in adapters for existing methods that support statistical, ical for realistic MNAR generation. In healthcare, dropout is machine learning, deep learning, and ensemble methods often linked to disease severity, side effects, or socioeconomic [14, 23, 1]. By isolating imputers within a common frame- constraints. In socioeconomic surveys, non-response may be work, researchers can test their robustness under con- strongly correlated with privacy-sensitive attributes such as in- trolled MNAR scenarios. come or debt. Encoding these mechanisms requires integrating (3) Standardized protocols that combine di- Evaluation Suite: causal assumptions with probabilistic masking rules [17]. How- rect metrics (e.g., Mean Absolute Error (MAE), Root Mean ever, such domain-specific approaches are difficult to generalize, Squared Error (RMSE)) with indirect metrics (downstream limiting their utility as benchmarks. predictive performance, such as accurracy, RMSE/MAE, or domain relevant metrics such as interpretability, reliability, 4 Toward a Unified Framework for MNAR fairness, . . . ) [1]. Research Two key insights emerge from the previous sections: (i) current 4.3 Benefits and Impact imputation methods are not explicitly designed for MNAR, and Developing such a framework would enable several advances: 17 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Fatemeh Azad and Matjaž Kukar • Reproducibility: Common benchmarks and generators en- References sure that different imputation methods can be fairly com- [1] Fatemeh Azad, Zoran Bosnić, and Matjaž Kukar. 2025. Meta-imputation balanced (mib): an ensemble approach for handling missing data in biomed- pared. • Realism: Domain-specific MNAR mechanisms bring eval- Bioinformatics and Biomedicine (BIBE) ical machine learning. In Proceedings of the IEEE International Conference on . submitted. uations closer to real-world conditions, reducing the gap [2] Mark Collier, Bayan Mustafa, and Mihaela van der Schaar. 2020. VAEs in the presence of missing data. . arXiv:2006.05301 between research and practice. [3] Meng Du, Gábor Melis, and Zhaozhi Wang. 2023. Remasker: imputing tabu- • Innovation: By exposing the weaknesses of existing meth- lar data with masked autoencoding. In The Eleventh International Conference ods under MNAR, the framework incentivizes the devel- . on Learning Representations [4] E. Ellington, Guillaume Bastille-Rousseau, Cayla Austin, Kristen Landolt, opment of mechanism-aware imputers. Bruce Pond, Erin Rees, Nicholas Robar, and Dennis Murray. 2014. Using • Generalization: Unified treatment of MCAR, MAR, and multiple imputation to estimate missing data in meta-regression. Methods MNAR encourages methods that adapt to unknown or in Ecology and Evolution , 6, (Dec. 2014). doi: 10.1111/2041- 210X.12322. [5] Tlamelo Emmanuel, Thabiso Maupong, Dimane Mpoeleng, Thabo Semong, mixed missingness mechanisms without prior assump- Banyatsang Mphago, and Oteng Tabona. 2021. A survey on missing data in tions. machine learning. (May 2021). doi: 10.21203/rs.3.rs- 535520/v1. [6] Hao Feng, Lihui Chen, and Ke Wang. 2005. A svm regression based approach 5 to filling in missing values. In International Conference on Knowledge-Based Conclusion and Intelligent Information and Engineering Systems. Springer, 581–587. [7] Shun-Chuan Li, Bingsheng Jiang, and Benjamin M Marlin. 2019. Misgan: Missing data remains one of the most persistent challenges in learning from incomplete data with generative adversarial networks. In machine learning and statistical analysis. While decades of re- . International Conference on Learning Representations search have produced numerous imputation techniques, ranging [8] Wei-Chao Lin and Chih-Fong Tsai. 2020. Missing value imputation: a review and analysis of the literature (2006–2017). , 53, Artificial Intelligence Review from simple statistical estimators to deep generative models, 1487–1509. most methods have been designed and evaluated under the more [9] Roderick J. A. Little and Donald B. Rubin. 1986. Statistical Analysis with tractable MCAR and MAR mechanisms. In contrast, the most Missing Data . John Wiley & Sons. isbn: 978-0471802545. [10] Roderick JA Little and Donald B Rubin. 2019. Statistical analysis with missing realistic and challenging setting, MNAR, remains critically un- data . Vol. 793. John Wiley & Sons. derexplored. [11] Qian Liu and Manfred Hauswirth. 2020. A provenance meta learning frame- work for missing data handling methods selection. In 2020 11th IEEE An- Our review highlights three major gaps in the current state of nual Ubiquitous Computing, Electronics & Mobile Communication Conference the field. First, existing imputation methods rarely model the de- . doi: 10.1109/UEMCON51285.2020.9298089. (UEMCON) pendence of missingness on unobserved values, making them un- [12] Arthur Mangussi, Miriam Santos, Filipe Loyola Lopes, Ricardo Cardoso Pereira, Ana Lorena, and Pedro Henriques Abreu. 2025. Mdatagen: a python suitable for MNAR scenarios. Second, generating realistic MNAR Neurocomputing library for the artificial generation of missing data. , 625, data is crucial because most benchmarks use ad hoc or overly (Apr. 2025), 129478. doi: 10.1016/j.neucom.2025.129478. [13] Pierre-Alexandre Mattei and Jes Frellsen. 2019. Miwae: deep generative simplistic masking strategies, which fail to capture the complex- modelling and imputation of incomplete data sets. In International conference ity of real-world missingness. Third, evaluation standards remain on machine learning. PMLR, 4413–4423. incomplete, with reported missingness rates often conflating [14] Dinar M P Murti, I N A Ramatryana, and A P Wibawa. 2019. K-nearest neighbor (k-nn) based missing data imputation. In 2019 5th International MCAR/MAR assumptions and failing MNAR realities. Together, Conference on Science in Information Technology (ICSITech). IEEE, 83–88. these shortcomings hinder fair comparisons and limit method- [15] 2023. Siamese autoencoder-based approach for missing data imputation. (June ological innovation. 2023), 33–46. isbn: 978-3-031-35994-1. doi: 10.1007/978- 3- 031- 35995- 8_3. [16] 2023. Automatic delta-adjustment method applied to missing not at random To address these challenges, we propose the vision and design imputation . (June 2023), 481–493. isbn: 978-3-031-35994-1. doi: 10.1007/978- principles of a unified MNAR framework that integrates three 3- 031- 35995- 8_34. [17] Ricardo Cardoso Pereira, Joana Cristo Santos, José Amorim, Pedro Rodrigues, components: (i) data generators that are aware of mechanisms and Pedro Henriques Abreu. 2020. Missing image data imputation using and can create realistic MNAR patterns, (ii) modular imputation variational autoencoders with weighted loss. In (Apr. 2020). engines that enable thorough testing of various methods, and [18] Miriam Seoane Santos, Ricardo Cardoso Pereira, Adriana Fonseca Costa, Jastin Pompeu Soares, João Santos, and Pedro Henriques Abreu. 2019. Gen- (iii) extensive evaluation suites that include direct metrics and erating synthetic missing data: a review by missing mechanism. , IEEE Access indirect metrics. Such a framework would provide reproducibility, 7, 11651–11667. doi: 10.1109/ACCESS.2019.2891360. realism, and a strong foundation for developing next-generation [19] Joseph L. Schafer and John W. Graham. 2002. Missing data: our view of the state of the art. , 7 2, 147–77. https://api.semanticscho Psychological methods imputation techniques. lar.org/CorpusID:7745507. Future research should move toward principled, mechanism- [20] Nandana Sengupta, Madeleine Udell, Nathan Srebro, and James Evans. 2022. Sparse data reconstruction, missing value and multiple imputation through aware imputers and adopt standardized benchmarks for MNAR matrix factorization. . Sociological Methodology generation and evaluation. To advance MNAR research, we need [21] Ying-Ying Song and Ying Lu. 2015. Decision tree methods: applications for more powerful algorithms and standardized tools and protocols classification and prediction. Shanghai archives of psychiatry, 27, 2, 130. [22] Yusuke Tashiro, Jiaming Song, Yang Song, and Stefano Ermon. 2021. Csdi: that enhance rigor and comparability in the field. conditional score-based diffusion models for probabilistic time series impu- tation. In . Vol. 34, 24804– Advances in Neural Information Processing Systems Acknowledgements 24816. [23] Jinsung Yoon, James Jordon, and Mihaela van der Schaar. 2018. GAIN: miss- The research and development presented in this paper were ing data imputation using generative adversarial nets. In International con-funded by the Research Agency of the Republic of Slovenia (ARIS) . PMLR, 5689–5698. ference on machine learning [24] Sanghoon Yoon and Sanghoon Sull. 2020. Gamin: generative adversarial through the ARIS Young Researcher Programme (research core Proceedings of the multiple imputation network for highly missing data. In funding No. P2-209). While preparing this work, the authors used , 8456–8464. IEEE/CVF conference on computer vision and pattern recognition [25] Jiaxuan You, Xiaobai Ma, Daisy Ding, Mykel Kochenderfer, and Jure Leskovec. Grammarly to check the correctness of grammar and improve the 2020. Handling missing data with graph representation learning. In Advances fluency of the writing, aiming to enhance the clarity and impact in Neural Information Processing Systems. Vol. 33, 18357–18368. of the publication. The authors reviewed and edited the content [26] Shuhan Zheng and Nontawat Charoenphakdee. 2022. Diffusion models for missing value imputation in tabular data. . arXiv preprint arXiv:2210.17128 produced with this tool/service and accept full responsibility for [27] Yuyang Zhou, Sarjyt Aryal, and Mohamed Reda Bouadjenek. 2024. Review the final published content. for handling missing data with special missing mechanism. arXiv preprint arXiv:2404.04905. 18 Utilizing Large Language Models for Supporting Multi-Criteria Decision Modelling Method DEX Marko Bohanec Uroš Rajkovič Vladislav Rajkovič Dept. of Knowledge Technologies Faculty of Organizational Sciences Faculty of Organizational Sciences Jožef Stefan Institute University of Maribor University of Maribor Ljubljana, Slovenia Kranj, Slovenia Kranj, Slovenia marko.bohanec@ijs.si uros.rajkovic@um.si vladislav.rajkovic@gmail.com Abstract explored the capabilities of recent mainstream LLM-based chatbots, specifically ChatGPT and DeepSeek, for supporting the We experimentally assessed the capabilities of two mainstream MCDM process. We specifically focused on using the method artificial intelligence chatbots, ChatGPT and DeepSeek, to DEX (Decision EXpert) [3], with which we have extensive support the multi-criteria decision-making process. Specifically, experience, spanning multiple decades [4], in the roles of we focused on using the method DEX (Decision EXpert) and decision makers, decision analysts, and teachers. DEX is a full- investigated their performance in all stages of DEX model aggregation [5] multi-criteria decision modelling method, which development and utilization. The results indicate that these tools proceeds by making an explicit decision model. DEX uses and structuring decision criteria, and collecting data about and decision rules to represent decision makers’ preferences. decision alternatives. However, at the current stage of Variables (attributes) are structured hierarchically, representing may substantially contribute in the difficult stages of collecting qualitative (symbolic) variables to represent decision criteria, development, the support for the whole multi-criteria decision- the decomposition of the decision problem into smaller, easier to making process is still lacking, mainly due to occasionally handle subproblems. Traditionally, DEX models are developed inconsistent and erroneous execution of methodological steps. using software such as DEXiWin [6], which helps the user to Keywords analyze decision alternatives. interactively construct a DEX model and use it to evaluate and Multi-criteria decision-making, decision analysis, large language The reported research is of exploratory nature. We ran models, method DEX (Decision EXpert), structuring decision ChatGPT and DeepSeek multiple times over the last six months, criteria either individually, as a group or in classrooms with students. Typically, we first formulated some hypothetical decision problem and then guided the chatbot through the main stages of 1 Introduction the MCDM process: Multi-criteria decision-making (MCDM) [1] is an established A. Model development stages: approach to support decision-making in situations where it is 1. Acquiring criteria necessary to consider multiple interrelated, and possibly 2. Definition of attributes (variables representing criteria) conflicting criteria, and select the best solution based on the 3. Structuring attributes available alternatives and the preferences of the decision-maker. 4. Preference modeling (formulating decision rules) Traditionally, such models are developed in collaboration with B. Model utilization stages: decision makers and domain experts, who define the criteria, 5. Definition of decision alternatives acquire decision makers’ preferences and formulate the 6. Evaluation of alternatives corresponding evaluation rules. The model-development process 7. Explaining the results of evaluation is demanding, as it includes structuring the problem, formulating 8. Analysis of alternatives all the necessary model components (such as decision In doing this, we observed the responses generated by the LLMs preferences or rules) for evaluating decision alternatives, and and assessed them from the viewpoint of skilled decision analysts. analyzing the results. The main goal was not to solve specific real-life decision With the development and success of generative artificial problems, but to identify LLMs’ strengths and weaknesses that intelligence, especially large language models (LLMs) [2], the may substantially affect the MCDM process. question arises as to how these models can support or perhaps Despite focusing on DEX, many of our findings are also partially automate decision-making processes. To this end, we applicable to other hierarchical full-aggregation MCDM methods [1, 5], such as AHP, MAUT/MAVT, and MACBETH, Permission to make digital or hard copies of part or all of this work for personal or which follow the same methodological stages, with slight classroom use is granted without fee provided that copies are not made or distributed differences in the representation of model components. for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must In the following sections, we review the above-mentioned be honored. For all other uses, contact the owner/author(s). MCDM stages and describe our experience with each of them. Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia © 2025 Copyright held by the owner/author(s). Specifically, we illustrate the process with answers generated by http://doi.org/10.70314/is.2025.skui.6589 ChatGPT-o3 and DeepSeek-V3. We considered a hypothetical personal decision problem of buying an electric-powered vehicle 19 (EV). The chatbots were run in parallel on June 6th, 2025, using starting point that can save days or even weeks of work. This similar prompts. Our assessments and comments are somewhat stage does not depend on the MCDM method used, so other broader, based on some other use-cases, not presented here. methods may benefit from using LLMs equally well. 2 Acquiring Criteria 3 Definition of Attributes A MCDM model includes multiple criteria that capture essential In this stage, the task is to define variables, called attributes, that aspects of decision alternatives. These criteria are used to represent criteria in a MCDM model. As most MCDM methods evaluate and compare the alternatives in a structured way. use numeric attributes, this stage is specific to DEX, which uses Defining criteria usually requires a good knowledge of the qualitative attributes. Therefore, this and the following stages decision problem and the decision maker’s goals. This step is require LLMs to “understand” the method used. While DEX is usually one of the most difficult and may require consulting less widely known than methods such as AHP or MAUT, it is domain experts and/or relevant literature. nonetheless used and valued in various applications. Anyway, we We asked the chatbots: I am considering buying a new EV were somewhat surprised to find out that all consulted LLMs vehicle for up to 35000 EUR in Slovenia. Suggest criteria for were familiar with DEX and reasonably capable of following its evaluating such cars. main methodological steps. In some steps, however, we had to Both ChatGPT and DeepSeek came out with an extensive and specify additional requirements to obtain proper DEX model structured list of criteria. Table 1 shows the criteria suggested by components. DeepSeek. ChatGPT’s suggestions were similar, though they Generally, defining qualitative value scales of attributes was employed slightly different high-level categories, which also not too difficult for LLMs. Asking Suggest preferentially incorporated measurement units associated with each criterion. ordered value scales typically gives good suggestions for value scales, for example (DeepSeek): Table 1: EV cars: Criteria suggested by DeepSeek. Items 4–8 shortened due to space limitations. Purchase Price: High (>€34k) → Medium (€30k–34k) → Low (<€30k) 1. Range & Battery Efficiency • Government Incentives: None → Moderate (€1k–3k) → High (>€3k) Real-world range (WLTP or EPA estimates) – Aim for at least Insurance & Maintenance: Expensive → Moderate → Cheap 300 km in mixed conditions. • Battery size (kWh) – Larger batteries offer more range but Interestingly, in our first attempts at using LLMs, they were not may increase cost. fully aware of DEX specifics, such as that value scales should • Energy consumption (kWh/100 km) – Lower is better (e.g., have a small number of values, preferentially ordered from “bad” 14 – 18 kWh/100 km). 2. Charging Speed & Compatibility to “good” [3]. After correcting LLMs a few times, they • DC fast-charging capability (e.g., 50 kW, 100 kW, or higher). “remembered” and now suggest properly formulated value scales. • AC charging speed (e.g., 7.4 kW, 11 kW, or 22 kW for home/work charging). • Charging port type (CCS Combo is standard in Europe). 3. Performance & Driving Experience • Motor power (kW/HP) – Affects acceleration and highway performance. • Drivetrain (FWD, RWD, AWD) – AWD is better for Slovenia’s varied weather. • Top speed & acceleration (0–100 km/h time). 4. Practicality & Usability: • Cargo space, Seating capacity, Ground clearance 5. Price & Incentives • Purchase price, Slovenian EV incentives, Insurance & maintenance costs 6. Technology & Features • Infotainment system, Driver assistance, Heat pump availability 7. Availability & After-Sales Support • Local dealership & service network, Warranty 8. Resale Value & Future-Proofing • Brand reputation, Software updates Let us immediately say that we consider this the most important single contribution of LLMs to MCDM modeling. We are not aware of any previous method that would allow identifying and structuring decision criteria in such a depth and detail in literally just a minute. Of course, for “serious” applications getting such a list does not take the burden off the user, who is still responsible for verifying the suggestions and checking the criteria for Figure 1: DEX model structure generated by ChatGPT. relevance and correctness. Nevertheless, this is a valuable 20 4 the model. This is another stage in which LLMs excel due to their Structuring Attributes ability to act like agents: autonomously collecting relevant data After acquiring the criteria and defining attributes, the next step from external sources. In this case, the LLMs constructed lists of is to structure attributes into a hierarchy (most often an ordinary actual EV cars, consulting sources such as avto.net, citroen.si, tree). Following the previous stages, which already resulted in a reuters.com, and ev-database.org: well-organized criteria structure and proper definition of attributes, this stage looks quite easy for LLMs. Figure 1 shows a full DEX model structure as suggested by ChatGPT without Citroën ë-C4 MG 4 Comfort 64 kWh – €31 490 → €24 290 after Eko-sklad 54 kWh – €31 900 → €24 700 any further instructions. In comparison, the DeepSeek’s Opel Corsa Electric 54 kWh – from €28 490 → €21 290 suggestion (not shown here) was somewhat inferior. Also, it was BYD Dolphin Surf Boost 43 kWh – €22 990 → €15 790* generated only after we had instructed it that DEX requires Fiat 500e 42 kWh – promo €22 290 (price already after bonus) “narrow” trees with only two to three descendants for each aggregate attribute. Anyway, we consider both structures It is important to understand that alternatives should be defined appropriate and comparable to the achievements of the best after developing the model. This allows the LLM to search for university-level students. and collect the data needed to evaluate the alternatives. 5 Decision Preferences and Decision Rules 7 Evaluation of Alternatives In DEX, decision maker’s preferences are represented by In this stage, the five cars have been evaluated by the LLMs. All decision rules, organized in decision tables. For each aggregate were evaluated as GOOD, except Fiat 500e, which was evaluated attribute, the user is asked to investigate all combinations of as ADEQUATE. Table 3 illustrates the output generated by lower-level attribute values and assess the corresponding ChatGPT, showing the evaluation results of the two best-ranked outcomes. Depending on the number of aggregate attributes, this cars. In addition to the overall evaluation (EV-SUITABILITY), might be a laborious task, but can be usually carried out using the output includes evaluations achieved at the three main software such DEXiWin with not too much hassle. LLMs are branches of the model, together with a brief notification of also capable of suggesting perfectly valid decision tables, as reasons for such evaluations. This is an appropriate summary of illustrated in Table 2, which suggest the values of Purchase-Cost evaluation results. depending on Net-Price-After-Subsidy and Registration-Fee. Table 3: Evaluation of the two best-ranked EV cars Table 2: Decision table for Purchase-Cost (ChatGPT). (ChatGPT). Net-Price-After- Model MG 4 Comfort 64 Citroën ë-C4 very-low low medium high Subsidy ▼ / Cost-good (low good (very-low (0 .5 %) (1 %) (1 – 2 %) ( > 2 %) Registration-Fee ► Ownership purchase-purchase-cost very-low (≤ 25 k €) very-low low low medium cost; & running-cost) low (25–30 k €) low low medium medium running-cost medium (30–33 k €) medium medium medium high low) high (33–35 k €) high high high high Capability good (435 km adequate– (416 km WLTP, WLTP, 140 good 100 kW DC, From the DEX perspective, it is important to remark that Table 2 kW DC, 363 380 L boot) is complete (addressing all possible combinations of input L boot) values) and preferentially consistent (increasing input values Responsibility good (★ 5 Euro- good (★ 4 Euro- result in increasing outputs). Initially, these requirements were NCAP, heat- NCAP, heat- not obvious to LLMs, and we had to request them explicitly. pump pump std., After further use, LLMs now generate appropriate rules by optional, V2L forthcoming themselves. 3.6 kW) V2L) However, it is important to emphasize an essential issue. EV- GOOD GOOD Decision making is considered a subjective process, and Rank SUITABILITY 1 2 decisions should reflect individual decision-makers’ preferences. Using LLMs, we only get preferences, generalized from many documents. In order to impose our own preferences, we should 8 Explaining the Results of Evaluation (1) carefully check LLMs’ suggestions and (2) request In MCDM, it is essential to understand the evaluation method modifications, if necessary. According to our experience, this is and explain the reasons underlying evaluation results. This is possible by formulating prompts such as “increase the how ChatGPT explains the evaluation method: importance of some attribute” or “alternatives exceeding some value of some attribute are unacceptable”. Nevertheless, this 1. Leaf values were taken from the most recent Slovene price-lists or Europe-requires focus, a good understanding of the goals, and perpetual wide tech sheets (all ≤ €35 000 trims). verification of the results. 2. Running-costs use your 30 000 km/year mileage and the Q1-2025 average household tariff of €0.166 kWh-1. 3. Each quantitative item was mapped to the ordered value-scales we set 6 Identification of Decision Alternatives earlier; the three rule-tables (“purchase-cost”, “running-cost”, “economy”) The goal of this stage is to identify decision alternatives and were then fired, and all higher nodes were combined with simple equal describe them in terms of values, assigned to input attributes of weights (0.4 Cost-Ownership | 0.4 Capability | 0.2 Responsibility). 21 4. Scores were coded excellent = 4, good = 3, adequate = 2, poor = 1; the available so far for that difficult stage, where LLMs can now root class was chosen by rounding the weighted mean to the nearest substantially boost the process and save a lot of effort and time. integer. The second important contribution is the capability of LLMs to act as agents and collect data about alternatives (section 6) from At first, everything seems to make sense. However, careful various external resources. reading reveals the disturbing fact that different decision rules Considering individual MCDM stages, LLMs performance is from those agreed upon in the previous stages were used to quite impressive. They are capable of evaluating and analyzing evaluate alternatives. Unfortunately, this often happens with alternatives, without much instruction. Furthermore, if asked, LLMs, which tend to “forget” about the previous MCDM stages. they can explain the used methods and obtained results quite well. It is not uncommon that attributes, their value scales and decision In some cases, however, a seemingly convincing explanation rules change from prompt to prompt. This severely undermines may fall apart, revealing logical and computational errors. the trust in using LLMs and makes the whole process uneasy: Considering the MCDM process as a whole, the performance rather than focusing on solving the decision problem, the user is of LLMs is not as favorable. In subsequent MCDM stages, LLMs forced to meticulously check each and every step. Also, it is not tend to “change their mind” without notice, modifying the uncommon to discover logical errors or even basic computational already established model components: attributes, value scales, errors (often referred to as “hallucinations” [7]). In one of our and decision rules. Consequently, this requires a lot of attention sessions with ChatGPT, it displayed the evaluation formula from the user’s side, who has to check the outputs and perpetually remind the LLMs to remain consistent. This distracts (0.2×3)+(0.25×4)+(0.15×4)+(0.2×3)+(0.15×2)+(0.05×2)=3.15 the process and often carries the user away of the main decision- which looked convincing, but gave a hard-to-notice, but wrong modelling stage (section 5), LLMs suggest generalized decision making task. Also, we should warn that in the preference result; the correct result is 3.2. preferences that might substantially differ from the user’s subjective preferences, which need to be enforced explicitly. 9 In summary, LLMs can substantially contribute to the Analysis of Alternatives definition of attributes and alternatives, but are unsuitable for The last stage of the MCDM process is the analysis of carrying out the whole MCDM process due to possible alternatives, which is aimed at exploring the decision space using inconsistent and erroneous executions of the MCDM method. methods such as what-if and sensitivity analysis. Without We believe that, given the current state of LLM development, it providing experimental evidence due to space restrictions, we is more convenient and safer to use specialized and trusted can say that, in principle, LLMs are capable of performing such MCDM software, such as DEXiWin. Nevertheless, LLMs evolve analyses, giving appropriate answers and explanations to fast and we may expect substantial improvements in the future. questions such as: • Carry out sensitivity analysis for Citroën ë-C3 and MG4 depending on buying price and operating costs. Acknowledgments • What would have to change for Fiat 500e 42 to become a good EV vehicle? The authors acknowledge the financial support from the In most cases, results are correct and informative, particularly in Slovenian Research and Innovation Agency for the programme cases when an explicit explanation is requested by the user. Knowledge Technologies (research core funding No. P2-0103 However, the issues of using inappropriate model components and P5-0018). and making logical and computational errors were detected in this stage as well. References [1] Kulkarni, A.J. (Ed.), 2022: Multiple Criteria Decision Making. Studies in Systems, Decision and Control 407, Singapore: Springer, doi: 10.1007/978-981-16-7414-3_3. 10 Discussion [2] Kamath, U., Keenan, K., Somers, G., Sorenson, S., 2024: Large Language LLMs are developing rapidly and becoming increasingly capable. Models: A Deep Dive: Bridging Theory and Practice. Springer, 506p, ISBN-13 978-3031656460. They may evolve under the hood, so that even the same version [3] Bohanec, M., 2022: DEX (Decision EXpert): A qualitative hierarchical can behave differently depending on recent updates or user- multi-criteria method. In: Multiple Criteria Decision Making (ed. Kulkarni, A.J.), Studies in Systems, Decision and Control 407, Singapore: specific factors. This makes them challenging for conducting a Springer, doi: 10.1007/978-981-16-7414-3_3, 39–78. rigorous scientific research. They come without user manuals, [4] Bohanec, M., Rajkovič, V., Bratko, I., Zupan, B., Žnidaršič, M., 2013: requiring their users to explore their capabilities on their own. DEX methodology: Three decades of qualitative multi-attribute modelling. Informatica 37, 49–54. This study is an experimental attempt to understanding the [5] Ishizaka, A., Nemery, P., 2013: Multi-criteria decision analysis: Methods capabilities of the current (2025) mainstream LLMs for and software. Chichester: Wiley. [6] Bohanec, M., 2024: DEXiWin: DEX Decision Modeling Software, User’s supporting the MCDM process, with special emphasis on the Manual, Version 1.2. Ljubljana: Institut Jožef Stefan, Delovno poročilo DEX method. On this basis, we could not formulate firm IJS DP-14747. Accessible from https://dex.ijs.si/dexisuite/dexiwin.html. [7] Banerjee, S., Agarwal, A., Singla, S., 2024. LLMs will always hallucinate, conclusions, but were still able to make observations and and we need to live with this. 10.48550/arXiv.2409.05746. formulate recommendations that might help MCDM practitioners. The single most important contribution of LLMs to MCDM is their ability to formulate a well-structured list of relevant criteria in the first stage (section 2). Nothing nearly as good was 22 Landscape-Aware Selection of Constraint Handling Techniques in Multiobjective Optimisation Jordan N. Cork Pavel Krömer Tea Tušar Andrejaana Andova Technical University of Ostrava Bogdan Filipič Jožef Stefan Institute and Ostrava, Czech Republic Jožef Stefan Institute and Jožef Stefan International pavel.kromer@vsb.cz Jožef Stefan International Postgraduate School Postgraduate School Ljubljana, Slovenia Ljubljana, Slovenia {jordan.cork, andrejaana.andova}@ijs.si {tea.tusar, bogdan.f ilipic}@ijs.si Abstract There are three primary contributions from this work, all within the CMO domain. The first is related to the set of prob- Constrained multiobjective optimisation problems (CMOPs) are lems used to train the algorithm selection model. Real-world common in real-world optimisation. They often involve expen- optimisation problems are often difficult to solve, particularly sive solution evaluations and, therefore, it is helpful to know when they include constraints. The field requires a methodol- the best methods to solve them prior to actually solving them. ogy for selecting a subset of problems with difficult constraint These problems also tend to be relatively difficult for algorithms functions from the larger set of known problems. This is the first compared to the majority of test problems. This difficulty often contribution. The CHT selection methodology is then tested on presents itself in the infeasible region, calling for a focus on the these problems. This methodology is the second contribution. constraint handling technique (CHT). The purpose of this work is Here, problem characterisation and machine learning are used to select the best CHT for problems with difficult constraint func- to predict the best-performing CHT. The final contribution is a tions. This first involves the collection of a set of such problems. set of insights into the features used. The decision tree output CHT selection is then conducted using problem characterisation by the CHT selection methodology provides significant insights and machine learning. The outcomes are positive in that predic- into both which features are useful and what the features reveal tion achieved a high accuracy. Additionally, further insights are about the problems. provided into the features that describe CMOPs. The paper is further structured as follows. In Section 2, CMO Keywords is introduced, providing the required background. Section 3 de- scribes the two selection methodologies, as well as the validation constrained multiobjective optimisation, algorithm selection, prob- method used. Section 4 presents the experimental setup. In Sec- lem selection, constraint handling techniques tion 5, the results from the experiments are presented. Finally, in 1 Section 6, the work is summarised and future work is outlined. Introduction Real-world optimisation problems very often have multiple ob- 2 Constrained Multiobjective Optimisation jectives and are subject to one or more constraints. This is the Constrained multiobjective optimisation (CMO) involves the op- domain of constrained multiobjective optimisation (CMO). These timisation of two or more objective functions given one or more problems are generally demanding to solve and have restrictions constraint functions. The constraints may be of the equality or to the available computational budget. These restrictions make it inequality forms, however, in this study, only inequality con- all the more important to know the best method for solving the straints are considered. Such a CMO problem (CMOP) may be problem prior to actually attempting to solve it. This calls for an formulated as follows: algorithm selection methodology. One approach to algorithm selection, known as landscape- minimize 𝑓 𝑚 , 𝑚 1, . . . , 𝑀 , ( x ) = (1) aware selection, is to first characterise the problem before con- subject to 𝑔 𝑗 0, 𝑗 1, . . . , 𝐽 , ( x ) ≤ = ducting the algorithm run [2]. Characterisation involves the calcu- lation of features used to describe the objectives and constraints, where 𝑥 is a 1 x = ( 𝐷 , . . . , 𝑥 𝐷 ) ∈ dimensional R 𝐷 solution vector, 𝑓𝑚 are the , and 𝑔𝑗 the inequality ( x ) objective functions ( x ) con- as well as their interaction. This is done using a small set of sam- pled solutions. Once the problem is characterised, knowledge of straint functions. 𝑀 is the number of objectives and 𝐽 the number of inequality constraints. similar problems can be used to determine the best approach to CMO requires an indicator for assessing the quality of the set of solving it. This approach is taken in this study and applied to con- optimal points. This indicator is . It was proposed in [19] to straint handling techniques (CHTs). CHTs are methods designed I CMOP handle quality assessment in the three following situations. When to guide optimisation algorithms in dealing with infeasible solu- no feasible solutions are found, it uses the minimum constraint tions, by taking as input the problem constraints and candidate violation. When feasible solutions are found, but these are outside solutions, and producing outputs that either repair, penalize, or of the region of interest (ROI) bound by the given reference point rank these solutions to balance feasibility with optimality. (RP), the distance to the ROI is used. Finally, when solutions are Permission to make digital or hard copies of all or part of this work for personal found within the ROI, it uses the hypervolume (HV). The HV or classroom use is granted without fee provided that copies are not made or measures the portion of the objective space dominated by the distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this set of solutions relative to the RP. was proposed as a value ICMOP work must be honored. For all other uses, contact the owner /author(s). to be minimised. However, it is commonly maximised based on Information Society 2025, Ljubljana, Slovenia CMOP I the package implementation [9]. On top of , moarchiving © 2025 Copyright held by the owner/author(s). https://doi.org/10.70314/is.2025.skui.3280 the maximised area under the runtime profile curve is used to 23 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Jordan N. Cork, Andrejaana Andova, Pavel Krömer, Tea Tušar, and Bogdan Filipič measure the anytime performance of the algorithm [8]. Here, the as follows: runtime profile is the proportion of performance targets attained ! 𝑅 1 ∑︁ with respect to the evaluation number. ( ) = − ( ) Difficulty 𝑝 1 max AUC 𝑝, 𝑎, 𝑟 (2) Many methodologies in CMO use an CMOP 𝑎 𝑅 ∈ A value with nor-I 𝑟=1 malised function values. For this, the function values of the prob- This problem difficulty is calculated for each of the problems in lems’ optimal solution set are required. Together, these are known P the set of problems, . as the Pareto front. The Pareto front may be obtained empirically, Within the current selection, there will still be cases where all through knowledge of the problems construction. Often this is CHTs perform roughly the same on the problem. These problems not possible, however, and, therefore, algorithm runs are used to are removed using statistical and practical threshold tests on the construct an approximation of the front. CMOP I final values from the 30 runs. Given a normal distribution In [4], there are 13 benchmark suites listed, consisting of 139 cannot be ensured in the 30 values from each of the algorithm test problems. These test problems can be instantiated in vari- runs, the Kruskal-Wallis test is used [11]. It determines if indepen- ous numbers of dimensions and objectives. This then allows for dent samples come from the same distribution. However, this still a substantially larger number of test problem instances to be leaves problems with no practical differences in their scores. To generated based on these 139 base test problems. filter these out, the mean scores are tested for if they vary more Problem characterisation is conducted using exploratory land- or less than a small delta and those that vary less are removed. scape analysis (ELA) features [16]. Work done in [1] has listed 80 Following the filtering out of problems where no meaningful such features for CMO. These come from three landscapes: the N differences are observed, the most difficult problems from multiobjective, violation and multiobjective-violation landscapes. the remaining set are selected. This leaves one with a suite of The features can be computed via sampling or random walks. difficult problems upon which at least one of the algorithms from There are several constraint handling techniques. Four of A performs differently. these are considered in our study. The first is the constrained- domination principle (CDP), proposed along with the NSGA-II 3.2 Constraint Handling Technique Selection algorithm [5]. This is a feasibility first approach, where feasible The general concept for CHT selection is as follows. First, a solutions are preferred over infeasible ones. The penalty CHT is machine learning model is trained using the features from each a classic method and applies a penalty value to the objective val- problem in the training set. The labels are the best-performing ues [20], either statically or dynamically. The Improved-Epsilon CHTs on each problem. At inference time, features are calculated (I-Epsilon) CHT was designed to work with the MOEA/D algo- on the problem in question (note: this consumes a portion of rithm [7]. It dynamically adjusts the 𝜖 value based on the number the available budget). These features are used as input to the of feasible solutions. Solutions are considered feasible if they machine learning algorithm. The resulting model then predicts are less than the 𝜖 value. Finally, stochastic ranking (SR) uses a the best-performing CHT for use during the run. probability value to switch between comparing solutions based Each step will now be described in more detail. The first step is on objectives or constraints [18]. to choose a base algorithm and a set of algorithm-relevant CHTs. The preferred approach would be to select the most appropriate 3 Methodology algorithm for the problem to be solved at inference time. This section presents the methodologies used in the study. First, The second step is generating the training data for the machine the methodology for selecting the hard test problems is presented, learning model. First, the features for each of the problems in followed by the methodology for selecting the appropriate CHT the training set are gathered. The labels must then be computed, and the means for testing the model. which requires algorithm runs; 30 for each CHT. For this, the budget must be selected carefully. The model, at inference, can be 3.1 Difficult Problem Selection expected to work well only if the budget is the same as it was in training. The average final values from the 30 runs are then taken Testing the CHT selection methodology requires test problems. for each CHT. In CMO, these are the average final values, I Test problems with too easy constraint functions are less likely to CMOP which are being maximised. The CHT with the highest value is show differences among the CHTs, as algorithms will spend less then selected as the best-performing CHT. This is used as the time dealing with infeasible solutions. More difficult constraint label. Once this has been done for each of the problems in the functions, on the other hand, will force the algorithm to deal with training set, the training data is complete. infeasible solutions longer and, therefore, give the CHTs time to The third step is to train the model. A decision tree is preferred show their differences. Test problems with difficult constraint for its explainability properties. To enhance the explainability of functions are then desired for our testing. the model, the depth of the tree should be kept at a minimum. As mentioned in Section 2, anytime performance is measured Testing is described in the next subsection. Once complete, i.e. using the area under the runtime profile curve (AUC), with the maximised as the indicator. In this study, difficulty is deter- I CMOP trained with all training data, the model is available for inference. mined based on the anytime performance of a set of algorithms, A 3.3 Cross-Validation Testing . Each of the algorithms is run on the problem 𝑅 times and the average AUC is taken. This is to ensure robustness. It should Testing the model involves a leave-one-problem-out cross-vali- be noted that when recording the runs, an archive of all non- dation approach. Here, a problem is taken out of the training set dominated solutions is kept and the value from this archive and left as the test problem. The model is then trained on the data I CMOP is recorded at each solution evaluation. The budget must also be from the remaining problems in the training set. To predict the chosen, with budgets allowing algorithm convergence preferred. best-performing CHT, the features from the test problem are used The maximum average AUC is then used as the problem difficulty, as input to the model. The model then makes a prediction for the with lower values signifying harder problems. This is formulated best-performing CHT. This is compared to the actual result. 24 Landscape-Aware Selection of Constraint Handling Techniques Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia The methodology makes allowances for when two or more Table 1: The results from cross-validation testing using CHTs perform similarly well on the same problem. The predic- the leave-one-problem out methodology. The first column tion made by the algorithm is then correct if it selects any of lists the test problems in order of difficulty (descending). 𝐷 these. Determining if two or more CHTs are statistically the same indicates the dimensionality. All problems are biobjective. is achieved through the use of a statistical test, which in this case The models were trained on all problems in the list, bar the was the Mann-Whitney U test [15]. Again, this test was chosen test problem in question. ‘Actual’ lists the best-performing because a normal distribution cannot be ensured in the resulting CHT labels, while the prediction column shows the pre- final values from the runs. The process is as follows. The CHT dicted label. If the predicted label is in the actual labels list, with the best mean score is selected, then each of the other CHTs the prediction is considered correct. The CHT labels 0, 1, 2 are tested individually against the best-performing CHT to deter- and 3 are CDP, penalty, I-Epsilon and SR, respectively. mine if they are equivalent, forming the set of best-performing CHTs. If the predicted label is within this set, it is considered Problem 𝐷 Diffic. Pred. Actual Correct correct. This process is conducted for all problems in the training DC2-DTLZ3 30 0.976 2 [2] Yes set and a final percentage of correct predictions is given. DC2-DTLZ1 30 0.965 2 [2] Yes DC2-DTLZ1 10 0.541 2 [2] Yes 4 Experimental Setup DC2-DTLZ3 10 0.528 2 [2] Yes NCTP7 30 0.489 0 [0, 3] Yes In this section, the inputs to the methodologies are described, NCTP8 10 0.355 3 [0, 1, 3] Yes along with the packages used throughout. NCTP15 10 0.339 3 [0, 1, 3] Yes There are several inputs to the difficult problem selection DOC3 10 0.330 1 [0, 1, 3] Yes methodology. First, there is the set of problems, . The dimen-P NCTP2 10 0.284 3 [0, 1] No sions chosen were 2, 3, 5, 10 and 30, with only biobjective prob- NCTP1 10 0.279 3 [0, 1, 3] Yes lems considered. This resulted in 375 problem instances. The NCTP7 10 0.269 3 [0, 3] Yes problems were translated from Matlab by hand or taken from CTP6 30 0.257 1 [0, 1, 2] Yes pymoo [3]. CTP8 30 0.249 0 [0, 1, 2] Yes For , i.e. the set of algorithms, the natural choice was to A C1-DTLZ3 30 0.240 2 [0, 1, 2] Yes choose a base algorithm with different constraint handling tech- DC2-DTLZ1 5 0.230 2 [2] Yes niques. The base algorithm chosen was NSGA-II [5]. This was CTP8 10 0.227 0 [0, 1, 2] Yes used for its versatility with regards to adding various CHTs. Re- DC2-DTLZ3 5 0.219 2 [2] Yes garding CHTs, CDP, penalty, I-Epsilon and SR were chosen for DC3-DTLZ1 30 0.214 2 [2] Yes their compatibility with NSGA-II. CDP was provided as default NCTP17 10 0.203 0 [0, 1, 2] Yes with NSGA-II by . The others were implemented by hand. pymoo NCTP10 10 0.202 1 [0, 1, 2] Yes The penalty value selected was a static 100, while the settings for all others were the proposed defaults. 𝑅 was set at 30. The number of difficult problems selected, , was set at 20. N f_range_coeff <= 10.96 samples = 20 This number is adequate to test the methodology while still being value = [5, 4, 8, 3] small enough to manage. The budget selected was the one to be class = I-Epsilon used throughout the study, i.e. 10 ,000 𝐷. The delta value for · True False detecting practical differences was set at 0.001. lnd_avg_rws <= 0.19 samples = 8 For the CHT selection methodology, the choice of training value = [0, 0, 8, 0] samples = 12 value = [5, 4, 0, 3] class = I-Epsilon class = CDP problems was the set of difficult problems derived from the setup above. The base algorithm and CHTs were the same as ( those selected above. The model selected was a decision tree corr_cobj_max <= 0.62 samples = 4 samples = 8 value = [4, 0, 0, 0] scikit-learn [17]). The tree depth parameter was the only pa-value = [1, 4, 0, 3] class = CDP class = Penalty rameter tuned. This tuning was done manually, decreasing from 10 to 3, until the performance began to reduce. Finally, the prob- lem features used were the 80 features described in [1]. These samples = 5 samples = 3 value = [1, 4, 0, 0] value = [0, 0, 0, 3] were calculated with a sample size of 1,000 𝐷. The random walks · class = Penalty class = SR were simulated using these same samples. Figure 1: The decision tree built on all the training data. It is 5 Results used to predict the four CHTs. The indices of the values in In this section, the results from carrying out the methodologies the value lists, indicating the number of instances, signify are described. First, the construction of the set of difficult prob- CDP, penalty, I-Epsilon and SR, respectively. lems is discussed. Then, the experimental results are presented. Finally, the resulting decision tree is discussed. The difficulty of each problem was calculated as described in The problems come from the following suites: DC-DTLZ [13], Section 3. The results were heavily skewed towards the easy prob- NCTP [12], DOC [14], CTP [6] and C-DTLZ [10]. lem side. With the parameter set to 20, that many problems Table 1 additionally shows the results from the cross-validation N were selected. The difficulties of these ranged from 0.202 to 0.976. testing phase of the experiments. As described in Section 3, each The selected problems are listed in Table 1 in order of descend- problem was given its turn as the test problem, while the others ing difficulty. They include 5, 10 and 30 dimensional problems, acted as training problems. For 95% of these, the model predicted with 2 and 3 dimensional problems clearly being easier to solve. correctly from the set of actual best-performing CHTs. 25 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Jordan N. Cork, Andrejaana Andova, Pavel Krömer, Tea Tušar, and Bogdan Filipič Figure 1 shows the decision tree that resulted from training on No. P2-0209 “Artificial Intelligence and Intelligent Systems”, and all of the available data. As it can be seen, the decision tree leaf projects No. N2-0254 “Constrained Multiobjective Optimization nodes are nearly pure, meaning it achieved near 100% accuracy Based on Problem Landscape Analysis” and GC-0001 “Artificial on the training data. Due to its high accuracy on the test data Intelligence for Science”). and the low tree depth, this is not believed to be overfit. Only 3 of the 80 supplied features were included in the model, References indicating their importance in identifying appropriate CHTs. [1] Hanan Alsouly, Michael Kirley, and Mario Andrés Muñoz. 2023. An instance space analysis of constrained multiobjective optimization problems. IEEE The first of these, separating out I-Epsilon, was f_range_coeff Transactions on Evolutionary Computation, 27, 5, 1427–1439. doi: 10.1109 (difference between the maximum and minimum of the absolute /TEVC.2022.3208595. value of the linear regression model coefficients, where the model [2] Andrejaana Andova, Jordan N. Cork, Aljoša Vodopija, Tea Tušar, and Bogdan Filipič. 2024. Predicting algorithm performance in constrained multiobjec- is fitted between the decision variables and the unconstrained Applications of Evolutionary tive optimization: A tough nut to crack. In ranks). This is a multiobjective landscape feature, focusing on . Stephen Smith, João Correia, and Christian Cintrano, editors. Computation Springer Nature Switzerland, Cham, 310–325. doi: 10.1007/978- 3- 031- 5685 variable scaling. The second feature, separating out CDP, was 5- 8_19. lnd_avg_rws (average proportion of locally non-dominated solu- [3] Julian Blank and Kalyanmoy Deb. 2020. Pymoo: Multi-objective optimization tions in the neighbourhood). This is a multiobjective-violation in Python. , 8, 89497–89509. doi: 10.1109/ACCESS.2020.2990567. IEEE Access [4] Jordan N. Cork and Bogdan Filipič. 2025. A Bayesian optimization approach landscape feature, focusing on evolvability, i.e. the degree to to algorithm parameter tuning in constrained multiobjective optimization. which the problem landscape facilitates evolutionary improve- In Optimization and Learning. Bernabé Dorronsoro, Martin Zagar, and El-ment. The final feature, distinguishing between penalty and SR, Ghazali Talbi, editors. Springer Nature Switzerland, Cham, 109–122. doi: 10.1007/978- 3- 031- 77941- 1_9. was corr_cobj_max (the maximum of the constraints and objec- [5] Kalyanmoy Deb, Amrit Pratap, Sameer Agarwal, and T. Meyarivan. 2002. A tives correlation). This is also a multiobjective-violation land- fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Transactions scape feature, focusing on evolvability. It should be noted that on Evolutionary Computation , 6, 2, 182–197. doi: 10.1109/4235.996017. [6] Kalyanmoy Deb, Amrit Pratap, and T. Meyarivan. 2001. Constrained test the features are not all related to the violation landscape, but also Proceedings of the problems for multi-objective evolutionary optimization. In deal with the objective functions. First International Conference on Evolutionary Multi-Criterion Optimization, EMO 2001. Springer, 284–298. doi: 10.1007/3- 540- 44719- 9_20. [7] Zhun Fan, Wenji Li, Xinye Cai, Han Huang, Yi Fang, Yugen You, Jiajie Mo, 6 Conclusion Caimin Wei, and Erik Goodman. 2019. An improved epsilon constraint- handling method in MOEA/D for CMOPs with large infeasible regions. Soft In this study, the focus was on the needs of real-world CMOPs. , 23, 12491–12510. doi: 10.1007/s00500- 019- 03794- x. Computing These problems are often difficult for algorithms to solve and [8] Nikolaus Hansen, Anne Auger, Dimo Brockhoff, and Tea Tušar. 2022. Any- time performance assessment in blackbox optimization benchmarking. IEEE require expensive solution evaluations. Given the cost of these Transactions on Evolutionary Computation , 26, 6, 1293–1305. doi: 10.1109 evaluations, it is helpful to know the best method for solving /TEVC.2022.3210897. [9] Nikolaus Hansen, Nace Sever, Mila Nedić, and Tea Tušar. 2024. Moarchiving: the problem prior to actually solving it. To address this, the Multiobjective nondominated archive classes with up to four objectives. study focused on selecting the most appropriate CHT, a crucial https://github.com/CMA- ES/moarchiving. GitHub repository. (2024). component of any algorithm operating in CMO. For this selection [10] Himanshu Jain and Kalyanmoy Deb. 2014. An evolutionary many-objective optimization algorithm using reference-point based nondominated sorting task, it was critical to test on problems with difficult constraint approach, Part II: Handling constraints and extending to an adaptive ap- functions. These problems elicit the most variation among CHTs. proach. IEEE Transactions on Evolutionary Computation, 18, 4, 602–622. doi: The proposition was made for a methodology that selects 10.1109/TEVC.2013.2281534. [11] William H. Kruskal and W. Allen Wallis. 1952. Use of ranks in one-criterion problems with difficult constraint functions from a larger set, Journal of the American Statistical Association variance analysis. , 47, 260, with the end goal of conducting CHT selection. This methodology 583–621. doi: 10.1080/01621459.1952.10483441. [12] Jia-Peng Li, Yong Wang, Shengxiang Yang, and Zixing Cai. 2016. A compar- involved first collecting a large set of CMOPs, then running a ative study of constraint-handling techniques in evolutionary constrained set of algorithms on them to determine their difficulty. Problems 2016 IEEE Congress on Evolutionary Compu- multiobjective optimization. In that were easy to solve or showed no variation in algorithm , 4175–4182. doi: 10.1109/CEC.2016.7744320. tation (CEC) [13] Ke Li, Renzhi Chen, Guangtao Fu, and Xin Yao. 2019. Two-archive evolution- performance were discarded, as they provide no value in future ary algorithm for constrained multiobjective optimization. IEEE Transactions CHT selection tasks. The methodology finally produced a set of , 23, 2, 303–315. doi: 10.1109/TEVC.2018.28554 on Evolutionary Computation N 11. problems. [14] Zhi-Zhong Liu and Yong Wang. 2019. Handling constrained multiobjective This set of difficult problems was used in the second methodol- optimization problems with constraints in both the decision and objective ogy proposed, i.e. selecting CHTs using problem characterisation spaces. , 23, 5, 870–884. doi: IEEE Transactions on Evolutionary Computation 10.1109/TEVC.2019.2894743. and machine learning. Four CHTs were chosen and added to the [15] Henry B. Mann and Donald R. Whitney. 1947. On a test of whether one of NSGA-II algorithm. These were CDP, penalty, I-Epsilon and SR. two random variables is stochastically larger than the other. The Annals of The goal of the selection task was to select the best-performing , 18, 1, 50–60. doi: 10.1214/aoms/1177730491. Mathematical Statistics [16] Olaf Mersmann, Bernd Bischl, Heike Trautmann, Mike Preuss, Claus Weihs, CHT on a given problem, noting that several CHTs can perform and Günter Rudolph. 2011. Exploratory landscape analysis. In Proceedings best. The methodology was evaluated using cross-validation, , of the 13th Annual Conference on Genetic and Evolutionary Computation with the leave-one-problem-out method. The findings from test- 829–836. doi: 10.1145/2001576.2001690. [17] Fabian Pedregosa et al. 2011. Scikit-learn: Machine learning in Python. ing were positive and indicate that it is possible to select the Journal of Machine Learning Research , 12, 85, 2825–2830. https://www.jmlr most appropriate CHT for a given difficult problem. Further, the .org/papers/volume12/pedregosa11a/pedregosa11a.pdf . [18] Thomas P. Runarsson and Xin Yao. 2000. Stochastic ranking for constrained final decision tree trained on all the considered difficult problems evolutionary optimization. , IEEE Transactions on Evolutionary Computation provides insights into the features characterising CMOPs. 4, 3, 284–294. doi: 10.1109/4235.873238. In future work, the plans are to extend the CHT selection [19] Aljoša Vodopija, Tea Tušar, and Bogdan Filipič. 2025. Characterization of constrained continuous multiobjective optimization problems: A perfor- methodology to the broader domain of algorithm selection. mance space perspective. , IEEE Transactions on Evolutionary Computation 29, 1, 275–285. doi: 10.1109/TEVC.2024.3366659. Acknowledgements [20] Yonas Gebre Woldesenbet, Gary G. Yen, and Biruk G. Tessema. 2009. Con- straint handling in multiobjective evolutionary optimization. IEEE Transac- The authors acknowledge the financial support from the Slove- tions on Evolutionary Computation , 13, 3, 514–525. doi: 10.1109/TEVC.2008 .2009032. nian Research and Innovation Agency (research core funding 26 Explaining Deep Reinforcement Learning Policy in Distribution Network Control Blaž Dobravec Jure Žabkar Elektro Gorenjska d.d. University of Ljubljana, Faculty of Computer and Kranj, Slovenia Information Science blaz.dobravec@elektro- gorenjska.si Ljubljana, Slovenia jure.zabkar@fri.uni- lj.si Abstract battery charging/discharging to increase self-sufficiency [12]; and effective consumption/generation strategies have been learned In safety-critical settings – such as low-voltage electrical distri- under price signals and network constraints [2, 1]. Given the bution networks – Deep Reinforcement Learning (DRL) policies growing heterogeneity of LV networks and the rise of behind- are hard to deploy due to limited capability to explain why a the-meter actuators, DRL methods are typically developed and particular sequence of actions is taken. We use Scenario-Based validated first in simulation [4]. Their adoption and implemen- eXplainability (SBX) with temporal prototypes to explain the tation are often hindered by a lack of explainability of these policy of our DRL agent. SBX clusters short time-windows of models. latent trajectories and uses their medoid trajectories as human- We present a prototype-based explainability approach for DRL- friendly summaries. Temporal prototypes map the embeddings based voltage control in LV distribution networks that directly ex- of these medoids to actions, and generate explanations of the form “This scenario is similar to prototype 𝑋 Do action 𝑌 .” We ⇒ ploits flexibility from prosumers. In our approach, the agent acts on the network’s operating state, coordinating different flexibility apply our approach to a real low-voltage distribution network options (e.g. photovoltaic systems, batteries, EVs, heat pumps). Srakovlje. Preliminary results show that our method offers practi- We focus on improving power quality by reducing voltage vio- cally useful human-friendly explanations for sequential decision lations. Additionally, we use prototype based explainability to making. provide interpretation and reasoning behind the action. Keywords deep reinforcement learning, explainability, voltage control, low- voltage distribution network, prototypes 2 Related Work Explainable Artificial Intelligence (XAI) aims to make the de- 1 Introduction cisions of models understandable to humans. The explanation A rapid growth of renewable energy resources and a significant process and the final result should be focused on generating ex- increase in electricity demand due to the electrification of trans- planations that are intuitive to us. Prototype-based explanations port and heating [8] are reshaping generation (e.g. distributed provide a compelling choice that is interpretable by design. XRL photovoltaic systems) and consumption (e.g. heat-pumps, elec- remains an active area of research. One such widely employed trical vehicles) in electrical distribution networks. Increasing explainability technique, primarily used in image classification, reverse power flows and voltage variability in low-voltage net- is the , which bases its explanations on pixel-wise saliency map works strongly affect voltage profiles and make the network feature attribution [20]. Building on this idea, Sequeira et al. [17] operation and control more challenging. made the agent’s interactions with the environment the focal Deep reinforcement learning (DRL) has recently emerged as a point of their . Interestingness Framework powerful paradigm for sequential decision-making in complex, In supervised learning, prototype networks explain predictions high-dimensional environments, with notable successes in games via similarity to learned or human-selected exemplars [3, 15]. (Chess [18], Go [19], Atari [13]), autonomous driving [10], and Extending this paradigm to reinforcement learning, prototype- industrial robotic process automation [7]. Voltage control in dis- wrapper policies force decisions to be mediated by human-friendly tribution networks shares similar characteristics, which makes prototypes (single state-snapshot); a recent example is the Prototype- DRL a promising methodology to solve the control problems in Wrapper Network (PW-Net), which wraps a pre-trained agent low-voltage networks. and maps latent states to action decisions through prototype While voltage control is standard at higher voltage levels (e.g., similarities [9]. Beyond interpretability, prototypes have been with STATCOMs), most LV research has focused on optimizing leveraged to improve representation learning and exploration individual assets at the customer level [11, 6]. Recent compar- efficiency: Proto-RL pre-trains prototypical embeddings and uses isons indicate that DRL can outperform classical algorithms for prototype-driven intrinsic motivation to accelerate downstream micro-grid management with demand-side flexibility [14]. For in- policy learning in pixel-based control [23]. In model-based RL, stance, dueling double DQN (D3QN) has been used to reduce over- prototypical context learning has also been explored for dynam- voltages in PV-rich networks [16]; model-free RL has optimized ics generalization [22]. Despite the critical role of explainability in voltage control in Permission to make digital or hard copies of all or part of this work for personal low-voltage power systems, there is little research addressing or classroom use is granted without fee provided that copies are not made or this challenge. Zhang et al. [24] applied the SHAP explainability distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this method to a deep reinforcement learning model for implement- work must be honored. For all other uses, contact the owner /author(s). ing proportional load shedding during under-voltage situations. Information Society 2025, Ljubljana, Slovenia They also used Deep-SHAP [25] to enhance the computational © 2025 Copyright held by the owner/author(s). https://doi.org/10.70314/is.2025.skui.9459 efficiency of their XAI model. The model’s output elucidates its 27 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia B. Dobravec and J. Žabkar the number of clusters via a silhouette-style score. Within each selected cluster, the medoids (nearest to the centroids) are taken as temporal prototypes. Optionally, flattened action windows are concatenated to latent trajectories before k-means to bias prototype selection toward action-discriminative regions. The SBX step produces a prototype tensor of shape 𝐾 , 𝐿, 𝑑 . ( ) 3.3 Temporal Prototype Model We introduce 𝐾 temporal prototypes 𝑃 , each a length-𝐿 } { 𝐾 𝑘 = 𝑘 1 latent template 𝑃 selected by SBX (medoids). A shared R 𝑘 ∈ 𝐿 𝑑 × 𝐿 𝑑 𝑝 × → temporal encoder 𝑔 : maps trajectories to embed- R R 𝜃 dings 𝑧𝑡 𝑔 𝑋 𝜃 = (𝑡) and prototypes to 𝑒 . Following 𝑘 = 𝑔 𝜃 ( 𝑃 𝑘 ) Figure 1: High-level SGTP pipeline: (1) collect latent win- PW-Net, prototype activations use an L2-to-activation mapping. dows; (2) SBX clustering and medoid selection; (3) train ∥ 2 𝑧 𝑡 − 𝑒 1 𝑘 ∥ + 2 ( 𝑎 𝑡 𝑘 ) = temporal-prototype layer; (4) case-based explanations dur-log , 𝜀 > 0. (1) ∥ 2 𝑧 𝑡 − 𝑒 𝑘 ∥ + 𝜀 ing rollout. 2 Outputs are linear in activations, 𝑦𝑡 𝑊 𝑎 𝑡 , optionally post-= ( ) processed to valid actions (Tanh/ReLu for steer/gas/brake). The predictions through a visualization layer and a feature impor- schematics of the algorithm is outlined in Figure 1. tance layer that addresses both global and local explanations. Existing research on explainability in power systems, particu- 3.4 Inference and Explanations larly regarding voltage control, focuses on post-hoc explainability At test time, we slide a window over trajectories, compute activa- techniques. Compared to explanations for a single feature (indi- tions 𝑎 𝑡 , and predict actions 𝑦𝑡 𝑘 ( ) . Explanations are provided by vidual voltage value) such as SHAP, our method considers the (i) the SBX scenario summaries (offline) and (ii) nearest-neighbor temporal component in the explanation process. To the best of windows to each prototype in the encoder embedding space. our knowledge, this approach has not been applied to the ex- • Scenario-level (global): SBX clusters and medoids sum- plainability of the reinforcement learning field in this specific marize typical behaviors. manner before. • Temporal prototype-level (local): per-prototype nearest 3 windows (and prototype self-windows) illustrate charac- SBX-guided Prototype Selection teristic action trajectories. We employ Scenario-Based eXplainability (SBX [5]) as an exten- For each time step, form the most recent latent window, com- sion of the PWNet [9] to temporal prototypes (prototypes of pute the encoder embedding and prototype activations, map them trajectories, not just snapshots of the state space) to provide linearly to actions, and apply Tanh/ReLu post-processing. Key global, scenario-level structure and local, time-resolved explana- hyperparameters are 𝐿 (window length), encoder size 𝑝, and learn- tions for a trained control policy. SBX is used to partition behavior ing rate. We select them on a held-out set using validation MSE and select representative temporal prototypes. On top of the SBX- and qualitative visualization of nearest-neighbor trajectories. selected prototypes (without any human-defined prototypes), we train a temporal prototype model that maps latent features to 4 Experiments actions. This yields a two-tier view: SBX provides a summary of behavior, while temporal prototypes expose time-local patterns 4.1 Simulation and voltage control policy and their nearest neighbors that drive actions. We examine a real-world low-voltage distribution network con- sisting of 26 consumers, of which 7 are active consumers. Those 3.1 Data Preparation and Latent Extraction active consumers are equipped with small solar plants (11kWp). We consider a trained policy 𝜋 acting in discrete time. A trajec- The total yearly consumption in this network is negative, mean- tory is a sequence of observation–action pairs. For analysis, we ing that the solar plants are producing more electricity than is operate on fixed-length trajectories of length 𝐿: needed. A visual representation of the network is displayed in Fig. 2. 𝑤 𝑡 𝑜𝑡, 𝑎𝑡 , . . . , 𝑜𝑡 + = ( ) ( ) = − 𝐿 1 𝑡 𝐿 1 − , 𝑎 , 𝑡 0 + − , . . . , 𝑇 𝐿 . The learning process extended over 1500 episodes, each con- Observations are first mapped by the frozen policy backbone to taining 96 steps (representing a 15-minute interval across one latent vectors 𝑥 𝑡 R ∈ 𝑑 = . We denote the latent trajectory by 𝑋𝑡 day). We evaluated the model every 20 episodes (1 epoch). In this ( 𝐿 ×𝑑 𝑥 , . . . , 𝑥 . We collect an offline dataset by rolling network, we focus on handling mainly high voltages as those are 𝑡 𝑡 + 𝐿 − 1 ) ∈ R out the trained PPO agent and recording, at each time step, the a bigger problem in our example. policy’s penultimate-layer latent vector and the corresponding environment action. This yields per-episode sequences of latents 4.2 Explaining a Simulation and actions which are then converted into trajectories of length We consider a real low-voltage distribution network. An obser- 𝐿 . The supervised target for each trajectory is the action at its vation/state is the vector of per-bus voltage magnitudes 𝑠 = last real-time step. [𝑣 1, . . . , 𝑣𝑛 (in per unit). Actions are per-active-consumer flex- ] ibility commands 𝑎 𝛼 1, . . . , 𝛼𝑚 with 𝛼𝑖 1, 1 : negative = [ ] ∈ [− ] 3.2 SBX Prototype Selection values decrease consumption (or increase net export) and positive SBX is performed in the latent space by clustering window embed- values decrease the generation for active consumers (bounded dings with k-means over a range of cluster counts and selecting by their instantaneous battery output). The agent acts every 15 28 Explaining Deep Reinforcement Learning Policy in Distribution Network Control Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia the latent space, the average distance from each prototype to its top-25 nearest trajectories was 0.124 on average, suggesting coherent time-local patterns. Figure 4: Representative prototypes in the Power Control Figure 2: The network Srakovlje is located in Gorenjska environment. Each color represents the Scenario, and the region (north-western part of Slovenia). Active consumers individual line represents the activations by the individual (red), and their most representative activations are dis- active consumers. played with the corresponding graph. Green circles denote the most common over-voltage buses prior to voltage con- trol. The width of the green circle indicates the severity of the original over-voltage measurements. 4.3 Results Fidelity. Across both domains, the prototype-based policy closely tracks the black-box in task reward, while achieving low action minutes; episodes comprise 96 steps (one day). The goal is to keep discrepancy on held-out episodes. This suggests that mediating voltages within operating limits while minimizing interventions actions through temporal prototypes does not materially degrade and losses. performance. Following prior work on distribution-voltage control [21], we use a reward that balances voltage quality, activation effort, Global structure. SBX consistently discovers a small set of recurring scenarios that align with intuitive regimes (straight and network losses. Trajectories are generated by a PPO policy driving vs. cornering in continuous control; typical operating trained in this environment. conditions in slower dynamics). Scenario summaries (state/action mean std) are distinct and exhibit stable temporal patterns. ± Local interpretability. For representative episodes, the nearest- neighbor aggregates around each prototype show coherent time- local patterns, and the most influential prototypes (largest con- tributions) align with observed actions. Explanations adopt a case-based form, relating current decisions to similar prototypi- cal windows. Performance Analysis. We compared the rewards across different policy architectures. Table presents the results of ?? running 20 episodes for each policy variant, measuring key per- formance metrics including mean reward, consistency (standard deviation), and coefficient of variation (CV) as a measure of relia- Figure 3: Centroids and underlying medoids of the sce- bility. narios in the Power Control environment. The individual Over 20 episodes, the Base policy achieves the highest mean color represents the average voltage signal in the network reward (221.8; range 201.0–257.5). PWNet closely matches the corresponding to the scenarios. Base with a mean of 220 .7 ( 0.5% lower; range 185.8–249.5), ≈ indicating that mediating decisions through prototypes incurs We used trajectories with length 𝐿 96 which gives us 𝐾 3 negligible performance loss. The Temporal PWNet trades some = = prototypes (Figure 3). Scenario selection via a silhouette-style reward for interpretability, averaging 211.5 ( 4.7% below Base; ≈ criterion over 𝑘 2, . . . , 8 yielded a preferred 𝑘 3 scenarios. range 168.4–231.8). Overall, relative performance is: Base 100%, ∈ { } = ≈ Representative scenario-level activation summaries are shown in PWNet 99%, Temporal PWNet 95%. ≈ ≈ Figure 4. offline action-level discrepancy against The results demonstrate several key insights about our ap-Task fidelity: the reference policy (mean-squared error over held-out trajecto- proach. The Base Policy achieves the best rewards. The PWNet ries at the final step) was MSE 3.218. stored Policy shows comparable performance, indicating that prototype-= Scenario quality: similarity scores by 𝑘 were: 𝑘 2: 0.131, 𝑘 3: 0.118, 𝑘 4: 0.083, based explanations can be achieved without significant perfor-= = = 𝑘 5: 0.082, 𝑘 6: 0.089, 𝑘 7: 0.093, 𝑘 8: 0.096. A recom- mance degradation. Our Temporal PWNet + SBX approach achieves = = = = puted silhouette for the chosen 𝑘 3 partition gave 0.099 with a mean reward of 211.47 14.60, representing a modest per-= ± per-scenario supports 4212, 7312, 5912 trajectories, indicating formance trade-off in exchange for enhanced interpretability [ ] three regimes with substantial coverage. In through temporal prototypes and scenario-guided explanations. Prototype locality: 29 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia B. Dobravec and J. Žabkar 5 Discussion References [1] Shahab Bahrami, Yu Christine Chen, and Vincent W. S. Wong. 2021. Deep This work introduces Scenario-Guided Temporal Prototypes, reinforcement learning for demand response in distribution networks. IEEE which combines global scenario discovery (SBX) with local, time- Transactions on Smart Grid, 12, 1496–1506. resolved prototypes to explain DRL decisions in voltage control [2] Di Cao, Junbo Zhao, Weihao Hu, Fei Ding, Nanpeng Yu, Qi Huang, and Zhe Chen. 2021. Model-free voltage control of active distribution system problem in power networks. We observe that temporal proto- with PVs using surrogate model-based deep reinforcement learning. Applied types can approximate black-box actions off-line with low dis- , 306, Part A, (Nov. 2021). Energy crepancy while forcing decisions through human-friendly ex- [3] Chaofan Chen, Oscar Li, Chaofan Tao, Alina Barnett, Cynthia Rudin, and Jonathan Su. 2019. This looks like that: deep learning for interpretable image emplars. SBX discovers a small number of recurring regimes, Proceedings of the IEEE/CVF Conference on Computer Vision recognition. In with clear scenario-level summaries (Figure 3) and consistent , 8930–8939. and Pattern Recognition (CVPR) [4] Ruisheng Diao, Zhiwei Wang, Di Shi, Qianyun Chang, Jiajun Duan, and prototype neighborhoods. This supports case-based reasoning Xiaohu Zhang. 2019. Autonomous voltage control for grid operation using over the policy’s temporal dynamics rather than single-step fea- CoRR deep reinforcement learning. , abs/1904.10597. arXiv: 1904.10597. ture attributions. Tight nearest-neighbor bands and balanced [5] Blaž Dobravec and Jure Žabkar. 2024. Explaining voltage control decisions: a scenario-based approach in deep reinforcement learning. In Foundations per-scenario support indicate that selected prototypes are repre- of Intelligent Systems. Springer Nature Switzerland, Cham, 216–230. isbn: sentative rather than outliers. 978-3-031-62700-2. The limitations of our current approach include reliance on a [6] Samar Fatima, Verner Püvi, and Matti Lehtonen. 2020. Review on the PV hosting capacity in distribution networks. , 13, 18. Energies particular windowing choice and off-line evaluation that does not [7] Natanael Gomes, Felipe Martins, José Lima, and Heinrich Wörtche. 2022. account for control feedback. Extremely imbalanced or highly Reinforcement learning for collaborative robots pick-and-place applications: a case study. , 3, (Mar. 2022). Automation non-stationary data may complicate selection. Prototype inter- [8] European Union Policy Iniciative. [n. d.] Growing consumption in the euro- pretability depends on the quality of medoids and the clarity of pean markets. https://knowledge4policy.ec.europa.eu/growing- consumeris the associated concepts; domains lacking clear temporal motifs m. Accessed: 2022-11-10. (). [9] Eoin M. Kenny, Mycal Tucker, and Julie A. Shah. 2023. Towards interpretable may benefit less from temporal prototypes and may also see deep reinforcement learning with human-friendly prototypes. In . ICLR degradation in performance. SBX does not identify the outliers [10] Bangalore Ravi Kiran, Ibrahim Sobh, Victor Talpaert, Patrick Mannion, that might be important for the agent to succeed. The identifica- Ahmad A. Al Sallab, Senthil Kumar Yogamani, and Patrick Pérez. 2020. Deep reinforcement learning for autonomous driving: A survey. , CoRR tion of such states within the current architecture will be explored abs/2002.00444. arXiv: 2002.00444. in future work. Future work also includes dynamic prototype [11] Wong Ling Ai, Vigna Ramachandaramurthy, Sara Walker, and Janaka Ekanayake. lengths and human-in-the-loop curation tools for prototype edit- 2020. Optimal placement and sizing of battery energy storage system con- sidering the duck curve phenomenon. , 8, (Jan. 2020), 197236– IEEE Access ing and labeling. 197248. doi: 10.1109/ACCESS.2020.3034349. [12] Brida V. Mbuwir, Fred Spiessens, and Geert Deconinck. 2018. Self-learning 6 Conclusion agent for battery energy management in a residential microgrid. In2018 IEEE PES Innovative Smart Grid Technologies Conference Europe (ISGT-Europe), We presented a pre-hoc interpretability framework that (i) dis- 1–6. [13] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis covers scenario structure from trajectories and (ii) explains ac- Antonoglou, Daan Wierstra, and Martin A. Riedmiller. 2013. Playing atari tions via temporal prototypes. The approach yields faithful, time- with deep reinforcement learning. CoRR, abs/1312.5602. arXiv: 1312.5602. resolved explanations without materially degrading control qual- [14] Taha Nakabi and Pekka Toivanen. 2020. Deep reinforcement learning for energy management in a microgrid with flexible demand. Sustainable Energy ity, as demonstrated in Power Network voltage control. Explana- Grids and Networks, 25, (Dec. 2020). tions take a case-based form—“this situation is similar to proto- [15] Meike Nauta, Sander van Bree, and Christin Seifert. 2021. Neural prototype trees for interpretable fine-grained image recognition. In Proceedings of the type X”—and are grounded by scenario summaries and prototype IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), locality. 14933–14943. Beyond improving transparency, our approach offers prac- [16] Alvaro Rodriguez del Nozal, Esther Romero-Ramos, and Angel Luis Trigo- Garcia. 2019. Accurate assessment of decoupled oltc transformers to opti- tical steps: scenario coverage, per-scenario prototype counts, mize the operation of low-voltage networks. , 12, 11. Energies and nearest-neighbor coherence expose where explanations are [17] Pedro Sequeira and Melinda T. Gervasio. 2019. Interestingness elements for strong or require refinement. Looking ahead, we plan to enable explainable reinforcement learning: understanding agents’ capabilities and limitations. , 288, 103367. Artif. Intell. interactive prototype curation, incorporate uncertainty-aware [18] David Silver et al. 2017. Mastering chess and shogi by self-play with a general explanation scores, and explore joint training schemes that cou- reinforcement learning algorithm. CoRR, abs/1712.01815. arXiv: 1712.01815. ple prototype-based interpretability with context-aware latent [19] David Silver et al. 2017. Mastering the game of go without human knowledge. Nature, 550, 354–359. dynamics. We will explore the sensitivity of the hyperparameter [20] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2013. Deep inside L to the actual training success. We have also identified that the convolutional networks: visualising image classification models and saliency maps. , abs/1312.6034. CoRR fidelity metrics beyond the MSE will be necessary to explore. [21] Jianhong Wang, Wangkun Xu, Yunjie Gu, Wenbin Song, and Tim C. Green. At this moment comparison to the saliency methods or SHAP 2021. Multi-agent reinforcement learning for active voltage control on power explanations is still challenging due to the different nature of distribution networks. CoRR, abs/2110.14300. arXiv: 2110.14300. [22] Junjie Wang, Qichao Zhang, Yao Mu, Dong Li, Dongbin Zhao, Yuzheng explanations (one being feature step-wise based and the other Zhuang, Ping Luo, Bin Wang, and Jianye Hao. 2024. Prototypical context- being multi-step and comparison based). Together, these steps aware dynamics for generalization in visual control with model-based rein- forcement learning. , 20, 9, 10717– IEEE Transactions on Industrial Informatics can help bridge the gap between high-performing DRL policies 10727. doi: 10.1109/TII.2024.3396525. and the trust required for their deployment. [23] Denis Yarats, Rob Fergus, Alessandro Lazaric, and Lerrel Pinto. 2021. Rein- forcement learning with prototypical representations. In Proceedings of the Acknowledgements 38th International Conference on Machine Learning (Proceedings of Machine Learning Research). PMLR. https://arxiv.org/abs/2102.11271. This work was partially supported by the Slovenian Research [24] Ke Zhang, Peidong Xu, and Jun Zhang. 2020. Explainable ai in deep reinforce- ment learning models: a shap method applied in power system emergency Agency (ARIS), grant L2-4436: Deep Reinforcement Learning control. In 2020 IEEE 4th Conference on Energy Internet and Energy System for optimization of LV distribution network operation with Inte- Integration (EI2), 711–716. grated Flexibility in real-Time (DRIFT), and from the Slovenian [25] Ke Zhang, Jun Zhang, Pei-Dong Xu, Tianlu Gao, and David Wenzhong Gao. 2022. Explainable ai in deep reinforcement learning models for power system Research Agency (ARIS) as member of the research program Ar- emergency control. , 9, 2, IEEE Transactions on Computational Social Systems tificial Intelligence and Intelligent Systems (Grant No. P2-0209). 419–427. 30 Leveraging AI in Melanoma Skin Cancer Diagnosis: Human Expertise vs. Machine Precision Anna-Katharina Herke Applied Artificial Intelligence Alma Mater Europaea Anna- katharina.herke@almamater.si Abstract This variability presents a significant diagnostic challenge. Studies have revealed that dermatologists may miss up to one in Whilst relatively uncommon compared to other skin cancers, five (20%) cases of melanoma. There is also disagreement melanoma is one of the most aggressive forms of this cancer. between professionals on lesion categorization [3, 4]. Artificial Given early and accurate detection, the condition can be treated intelligence (AI), particularly deep learning algorithms trained successfully. Despite advancements in dermoscopy, diagnostic on large dermoscopic datasets, has emerged as a potential variability among dermatologists persists, often delaying equalizer, capable of achieving and possibly exceeding the treatment. This paper investigates the performance of a deep classification accuracy of dermatologists [1, 2]. learning model based on ResNet-50 against human AI’s ability to analyze complex visual patterns in skin lesions dermatologists in melanoma detection, highlighting synergies offers a novel solution to diagnostic gaps. However, questions between AI and human diagnostics. Our findings indicate that AI remain regarding its performance in clinical settings, can be as accurate or better than individual dermatologist generalizability potential biases, and ethical implications [14, 15]. performance in key metrics like sensitivity and specificity, and This study aims to compare the diagnostic performance of a that a workflow focused on collaboration in the diagnostic ResNet-50-based AI model with that of board-certified process yields superior outcomes compared to either approach dermatologists and explore synergistic diagnostic workflows. alone. We place specific emphasis on aspects of dataset composition, prospective evaluation design, and clinical integration to expand Keywords on the findings of previous studies. Melanoma, skin cancer diagnosis, AI in cancer diagnosis, dermatology 2 Research Questions 1 Introduction research questions: This paper will focus on and attempt to answer the following Globally, melanoma accounts for a disproportionate number of skin cancer-related deaths despite being less common than other 1. How does the diagnostic accuracy of an AI model skin cancers like basal and squamous cell carcinomas. In the compare to that of human dermatologists? United States alone, melanoma only accounts for one in 100 cases of skin cancer, while causing the majority of deaths from 2. Can AI-human collaboration enhance melanoma this type of cancer [31]. Early detection dramatically improves detection outcomes? prognosis, with five-year survival rates exceeding 90% when melanoma is identified at an early stage [1]. However, diagnostic 3. What are the ethical and practical considerations for accuracy in dermatology remains highly variable, dependent on AI integration in clinical dermatology? clinician experience, lesion characteristics, and access to dermoscopic tools. 3 Related Work Early studies such as Esteva et al. [1] demonstrated the power of Permission to make digital or hard copies of part or all of this work for personal or artificial intelligence in skin cancer diagnostics. The authors classroom use is granted without fee provided that copies are not made or distributed showed that deep convolutional neural networks (CNNs) could for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must match the diagnostic performance of dermatologists in be honored. For all other uses, contact the owner/author(s). melanoma classification. Haenssle et al. [2] confirmed these Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia findings in a controlled reader study. Similarly, Brinker et al. [4] © 2025 Copyright held by the owner/author(s). https://doi.org/10.70314/is.2025.skui.8366 found that a CNN outperformed 86% of participating dermatologists. 31 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Herke et al. Recent research has shifted toward examining the potential of This two-phase diagnostic design aligns with previous collaborations between humans and AI. Tschandl et al. [3] and human-versus-AI studies, notably those by Haenssle et al. and Allen et al. [26] found that AI-assisted diagnosis improved the Tschandl et al., which examined both solo and AI-assisted accuracy of clinician diagnosis alone. Navarrete-Dechent et al. diagnostic conditions [2, 3, 7]. Randomization and blinding [7] conducted a prospective trial showing how synergistic ensure impartial evaluation, a standard methodological feature in diagnosis combining dermatologists and AI tools improved comparative diagnostic trials [5, 6]. diagnostic accuracy. However, limitations persist. Most studies use retrospective or experimental setups lacking real-world clinical integration. 4.4 Evaluation Metrics Few address model bias, particularly regarding skin tone and Performance was measured using sensitivity, specificity, area underrepresented populations [14, 15, 33, 34]. Those could lead under the ROC curve (AUC-ROC), and average diagnostic time to false diagnoses. Continued reliance on HAM10000 and per image. Inter-rater agreement was assessed using Fleiss’ institutional datasets restricts generalizability of research kappa. findings. In addition, the absence of real-world patient context such as patient history and a physical exam may cause clinicians to underestimate diagnostic complexity. Furthermore, adoption 5 Results barriers among clinicians remain underexplored at the time of writing [27]. 5.1 AI vs Human Diagnostic Performance This submission seeks to fill these gaps with a prospective The AI model achieved an AUC-ROC result of 0.94, with 89% evaluation of AI-human performance and practical deployment sensitivity and 85% specificity. Dermatologists averaged an considerations. AUC of 0.87, with 82% sensitivity and 83% specificity. Notably, a total of 75% (15 out of 20) dermatologists were outperformed by the AI in sensitivity [4]. 4 We further analyzed inter-rater variability among clinicians Methods using Fleiss’ kappa statistics. Without AI assistance, Fleiss’ 4.1 kappa was 0.58 (moderate agreement). With AI assistance, kappa Data Acquisition and Preprocessing increased to 0.72 (substantial agreement), indicating improved Dermoscopic images were sourced from the commonly used consensus among readers. HAM10000 dataset [13], supplemented by institutional image This improvement in agreement supports the claim that AI archives. Inclusion criteria comprised high-resolution support enhances diagnostic reliability and synergizes with dermoscopic images of histopathologically confirmed human expertise. melanomas and benign nevi. Exclusion criteria included images with low resolution, artifacts, or incomplete metadata. Table 1: Inter-Rater Variability All images underwent standardized preprocessing procedures such as resizing to 224×224 pixels, normalization, and Scenario Fleiss’ Kappa augmentation (flipping, rotation, and contrast adjustments) to Clinicians Alone 0.58 enhance generalizability [21, 23]. Clinicians + AI Assist 0.72 Source: research performed in the course of this study 4.2 AI Model Architecture For this study, we utilized a ResNet-50 CNN pretrained on 5.2 AI-Human Synergy Analysis ImageNet, fine-tuned on the melanoma dataset. The model When assisted by AI, dermatologist sensitivity improved to 91%, incorporated dropout regularization and cross-entropy loss and specificity rose to 87%, surpassing both the solo AI and optimization. Training was conducted on NVIDIA GPUs using a unassisted human performance. Average diagnostic time 70/15/15 train-validation-test split. This architecture and training dropped from 22 seconds to 15 seconds per image [28]. paradigm has demonstrated high performance in skin lesion classification tasks and is widely adopted in dermatology AI literature [1, 4]. Table 2: Visual Summary of Results Diagnostic Sensitivity Specificity AUC- Avg Time/ 4.3 Human Cohort and Diagnostic Protocol Modality ROC Image Twenty board-certified dermatologists with 5–25 years of AI Alone 89% 85% 0.94 3 seconds clinical experience participated. We asked each participant to Dermatologists 82% 83% 0.87 22 seconds review 100 randomized images. Images were presented in Alone Dermatologists 91% 87% 0.96 15 seconds isolation, blind to patient history and pathology. Diagnoses were + AI binary (melanoma vs. benign). In a second round, participants Source: research performed in the course of this study reviewed the same images with AI output overlays. 32 Leveraging AI in Melanoma Skin Cancer Diagnosis Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia 6 Discussion diagnostic accuracy comparable to board-certified dermatologists and improved performance when integrated into We were able to affirm previous findings that artificial clinician workflows. While AI holds transformative potential, intelligence has the capacity to match or outperform challenges around bias, explainability, and regulatory oversight dermatologists in the detection of melanoma [1, 5]. Moreover, must be addressed to ensure equitable, trustworthy deployment. diagnostic synergy between human experts and AI enhances Future work should focus on prospective clinical trials, overall performance, aligning with findings from Tschandl et al. patient-facing applications, and interdisciplinary frameworks for [3] and Navarrete-Dechent et al. [7]. human-AI co-diagnosis. A hybrid diagnostic model, leveraging AI’s speed and consistency with human intuition and contextual 6.1 awareness, represents the future of dermatological practice. Ethical Considerations and Bias Analysis As diagnostic models develop, so will technology. Despite strong results when combining clinician expertise with Improvements in AI, such as federated learning and enhanced AI in melanoma detection, concerns persist. These concerns explainability methods will lead to improved trust and adoption begin even before the algorithm is applied. AI models may have in clinical settings. been subject to biased training data. In this context, underrepresentation of darker skin tones remains problematic [14, 15]. As a result, AI may exacerbate healthcare disparities [20], and there remains a need for inclusive datasets and Acknowledgments algorithmic transparency [19] to address these challenges. The completion of this analysis on melanoma skin cancer was To strengthen our analysis of bias and inclusivity, we present achieved through the insightful contributions of researchers in a descriptive breakdown of our dataset by skin type (Fitzpatrick the field whose work was pivotal for this analysis. I am thankful scale): for the academic community and institutions that provided access to research databases and journals, which were essential for the Table 3: Skin Type Breakdown literature review. I would also like to extend my gratitude to the peer reviewers and Fitzpatrick Skin Type Number of Cases Percentage (%) editors at the IS.IJS.SI Conference, especially Matjaz Gams, I–II (Light) 500 40 whose valuable feedback enhanced the quality of this paper. III–IV (Medium) 500 40 V–VI (Dark) 250 20 Total: 1,250 Images References Source: research performed in the course of this study [1] Esteva, A., et al. (2017). Dermatologist-level classification of skin cancer with deep neural networks. Nature, 542(7639), 115–118. [2] Haenssle, H.A., et al. (2018). Man against machine: Diagnostic This distribution allows for more robust discussion of skin performance of a deep learning convolutional neural network for tone bias and ensures inclusiveness in our findings. We dermoscopic melanoma image classification. Annals of Oncology, 29(8), acknowledge that the representation of darker skin types (V–VI) 1836–1842. [3] Tschandl, P., et al. (2020). Human–computer collaboration for skin cancer remains limited and may impact generalizability. Future studies recognition. Nature Medicine, 26(8), 1229–1234. should prioritize dataset balance for equitable AI performance. [4] Brinker, T.J., et al. (2019). Deep learning outperformed 136 of 157 dermatologists in a head-to-head dermoscopic melanoma image classification task. European Journal of Cancer, 113, 47–54. In collaborative settings, explainability remains another [5] Phillips, M., et al. (2019). Assessment of accuracy of an artificial challenge, as clinicians may distrust opaque AI decisions that lesions. intelligence algorithm to detect melanoma in images of skin JAMA Network Open, 2(10), e1913436. lack transparency. Incorporating interpretable AI frameworks [6] Marchetti, M.A., et al. (2020). Artificial intelligence as a second reader in and continuous feedback loops can help address these issues [21]. melanoma screening. Journal of the American Academy of Dermatology, 83(1), 188–194. [7] Navarrete-Dechent, C., et al. (2022). Human-AI synergy in melanoma diagnosis: A prospective clinical trial. Journal of the American Academy 6.2 of Dermatology, 86(3), 567–575. Integrating AI into Clinical Practice [8] Liu, Y., et al. (2021). Deep learning for melanoma detection: A systematic Adoption hurdles include clinician skepticism, workflow review. Journal of Investigative Dermatology, 141(12), 2835–2844. [9] Fujisawa, Y., et al. (2019). Deep learning-based image analysis of integration, and regulatory uncertainty [27, 25]. Real-world melanocytic lesions: Current status and future prospects. Frontiers in implementation requires AI tools to function as second readers, Medicine, 6, 99. [10] Codella, N.C.F., et al. (2018). Skin lesion analysis toward melanoma supporting—not supplanting—clinicians [6, 22]. detection: ISIC 2017 Challenge. IEEE ISBI, 168–172. Regulatory guidance from the FDA (2022) emphasizes post- [11] Sood, T., et al. (2021). AI in dermatology: Challenges and market monitoring, performance transparency, and adaptive opportunities. Journal of Medical Systems, 45(7), 1–8. [12] Han, S.S., et al. (2018). Classification of the malignancy of skin lesions learning constraints. Clinician training, robust validation, and using deep learning-based image analysis. PLoS One, 13(11), e0205820. clear liability frameworks are essential for safe deployment. [13] Tschandl, P., et al. (2018). The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Scientific Data, 5, 180161. [14] Groh, M., et al. (2021). Evaluating racial bias in AI skin cancer models. NEJM AI, 1(1), 1–10. 7 [15] Daneshjou, R., et al. (2022). Disparities in dermatology AI performance Conclusion on a diverse patient population. Science Translational Medicine, 14(645), This study highlights the promise of AI-human collaboration in [16] eabq6147. Kittler, H., et al. (2016). Diagnostic accuracy of an artificial intelligence– melanoma diagnosis. A fine-tuned ResNet-50 model achieved based device for the evaluation of pigmented skin lesions. Lancet Oncology, 17(12), 1785–1793. 33 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Herke et al. [17] Hollon, T.C., et al. (2020). Machine learning identifies surgical margins [28] Yamada, M., et al. (2022). An AI tool helped reduce dermatologist in patients with melanoma using stimulated Raman histology. Cancer diagnosis times and errors: A retrospective study. Artificial Intelligence in Research, 80(4), 664–673. Medicine, 129, 102317. [18] Brinker, T.J., et al. (2020). Skin cancer classification using convolutional [29] Udrea, A., et al. (2020). Accuracy of a smartphone application for triage neural networks: Systematic review. J Med Internet Res, 22(10), e20736. of skin lesions based on machine learning in a primary care setting. JAMA [19] Yogananda, C.G., et al. (2021). A Survey on Explainable AI for Skin Network Open, 3(6), e2036362. Lesion Analysis. Front Med, 8, 777911. [30] FDA. (2022). Regulatory considerations for AI/ML-based medical [20] Adamson, A.S., Smith, A. (2018). Machine learning and healthcare devices. FDA Guidance Document. disparities in dermatology. JAMA Dermatol, 154(11), 1247–1248. [31] American Cancer Society (2025). Key Statistics for Melanoma Skin [21] Ghosal, A., et al. (2021). Deep learning for melanoma detection: A Cancer. Accessed on May 26, 2025 under: comprehensive review. Artificial Intelligence Review, 54(8), 5783–5819. https://www.cancer.org/cancer/types/melanoma-skin-cancer/about/key- [22] Topol, E.J. (2019). High-performance medicine: The convergence of statistics.html. human and artificial intelligence. Nature Medicine, 25, 44–56. [32] Fleiss, J. L. (1971). Measuring nominal scale agreement among many [23] Han, S.S., et al. (2022). Federated learning for melanoma detection across raters. Psychological Bulletin, 76(5), 378– institutions. Nature Communications, 13(1), 1–10. 382. https://doi.org/10.1037/h0031619 [24] Ud Din, N., et al. (2023). Artificial Intelligence for Melanoma Diagnosis: [33] Groh, M., Tseng, E., Mahoney, A., & et al. (2023). Evaluating deep neural A Decade of Progress. Cancers, 15(3), 876. networks trained on clinical images in dermatology: The DERM dataset [25] Wong, A., et al. (2022). Ethical challenges of AI in melanoma and implications for diversity. The Lancet Digital Health, 5(3), e158– diagnosis. Lancet Digital Health, 4(3), e156–e165. e168. https://doi.org/10.1016/S2589-7500(22)00284-7. [26] Allen, J., et al. (2021). Human–Machine Collaboration in Skin Lesion [34] Winkler, J. K., Fink, C., Toberer, F., & et al. (2019). Association between Diagnosis. JAMA Dermatol, 157(8), 947–954. dermatoscopic application of artificial intelligence for skin cancer [27] Jones, O.T., et al. (2021). Barriers to AI adoption in dermatology: A recognition and accuracy of dermatologists in a randomized clinical clinician survey. British J Dermatol, 185(2), 345–352. trial. JAMA Dermatology, 155(6), 627– 634. https://doi.org/10.1001/jamadermatol.2019.1735. 34 Prediction of Root Canal Treatment Using Machine Learning Matej Jelenc Miljana Shulajkovska Jožef Stefan Institute Jožef Stefan Institute Ljubljana, Slovenia Ljubljana, Slovenia jelenc11matej@gmail.com miljana.sulajkovska@ijs.si Rok Jurič Anton Gradšek Odontos, Private Endodontic Practice Jožef Stefan Institute Ljubljana, Slovenia Ljubljana, Slovenia rok.juric@odontos.si anton.gradisek@ijs.si Abstract RCT patient data obtained by a single experienced practitioner (en- suring a high level of consistency in the treatment approach), as Root canal treatment is a medical procedure aimed at preventing or opposed to studies where numerous dentists were treating patients treating apical periodontitis, which is an inflammation around the and different choices between them could have resulted in a less apex of a tooth root. In this study, we analyzed a dataset collected representative dataset. The aim of the study was to develop and by an experienced practitioner over the course of several years, evaluate an algorithm that predicts the outcome of the RCT, as well and developed a forecasting model, based on the XGBoost algo- as to analyze how robust the algorithm is and which features influ- rithm, to predict the outcome of the treatment. The trained models ence the outcome the most. This study goes hand-in-hand with the achieved a mean area under the receiver-operating-characteristic study by Jurič et al. [13] where the analysis was conducted solely curve (AUROC) of 0.92 and average precision (AP) of 0.77. We dis- using statistical methods. cuss the importance of individual features in view of expert dental knowledge. To assist the practitioner in daily practice, we devel- oped a web-based application to provide an assessment of treatment 2 Related Work outcomes. To our knowledge, utilization of machine learning in endodontics is still relatively unresearched, specifically when predicting treat- Keywords ment outcome only using tabular data. Among the related papers, root canal treatment outcome, feature importance, gradient boost- [10] employs XGBoost to explore the association between patient-, ing machines tooth- and treatment-level factors and root canal treatment fail- ure, while [2] used Random Forests (RF), K Nearest Neighbours 1 Introduction (KNNs), Logistic Regression (LR) and Naive-Bayes (NB) to predict Apical periodontitis is an inflammation of tissues around the apex the outcome of non-surgical root canal treatments, similarly to of a tooth. It is a major health burden in the general population, this study. Paper [8] explores the prediction of treatment longevity with 6% of all teeth showing signs of this condition. Root canal using Support Vector Machines (SVMs), LR and NB, while [14] treatment (RCT) is aimed to either prevent the onset of apical investigates the relation between root canal morphology and root periodontitis or to help the tissue to heal if it is already present [13]. canal treatment using both statistical and machine learning meth- Predicting treatment outcomes in RTC is of high interest both to ods, specifically, using RF, SVMs and Gradient Boosting Machines the patients and the dentists, as well as to the insurance companies, (GBMs). Moreover, papers [19, 18] investigate the prediction of as information about the likelihood of successful treatment can case difficulty and prognosis of endodontic microsurgery, while [6, lead to better allocation of resources and avoid potentially more 9] explore the prediction of root fracture and postoperative pain invasive procedures, such as tooth removal and its replacement after root canal treatment. Additionally, multiple papers have been with an implant. found to investigate root canal treatment outcome or related factors Machine learning has previously been used to study some as- using deep learning (DL) on X-ray images, specifically panoramic pects of the root canal treatment, including association between or periapical radiographs, such as [3, 22, 11, 1, 5]. patient-, tooth- and treatment-level factors and root canal treat- ment failure [10], predicting root fracture after root canal treatment 3 Data and crown installation [6], and non-surgical root canal treatment The dataset analyzed in this study contains treatment details, clin- prognosis [2]. In this study, we analyze the data collected by Jurič ical and radiographic data regarding primary or secondary root et al. [13]. This dataset is of special interest since it relies on the canal treatment of mature permanent teeth collected and curated Permission to make digital or hard copies of all or part of this work for personal or in [13]. Three different types of outcome were determined - clinical, classroom use is granted without fee provided that copies are not made or distributed radiographic, and combined, for which both a strict (no clinical or for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. radiographical sign of disease) and loose (only negligible sign of For all other uses, contact the owner /author(s). disease) assessment criteria were used. In this paper, only strict Information Society 2025, Ljubljana, Slovenia assessments were considered and used as prediction targets. All © 2025 Copyright held by the owner/author(s). https://doi.org/10.70314/is.2025.skui.1849 assessments were binary, with 1 representing successfull and 0 35 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Jelenc et al. representing unsuccessfull treatment outcome. The dataset was 4.4 Grid Search fairly imbalanced, with 88% of all cases representing successfull To obtain reasonable starting training hyperparameters and a base- radiographic outcome, 92% successfull clinical outcome and 83% line model that utilizes all available information, we performed successfull combined outcome. The study cohort consisted of 740 cross-validated grid-search over a simple manually defined param- patients and 1264 teeth, resulting in 3153 root canal treatment cases eter grid, using the scikit-learn library [17]. and 84 features in total. The majority of features represented either categorical or binary values, such as variables representing gender, 4.5 Correlation Clustering tooth type, root canal etc., while variables such as age and working When a subset of features in a dataset is highly correlated, standard length were treated as continous. methods such as feature permutation importance or performing an ablation study often produce inaccurate results, since the model 4 Methods can highly depend only on a specific feature and discard correlated This section outlines the methods used for ranking feature impor- features. Similarly, methods such as SHapley Additive exPlanations tance and finally training baseline models that can be used as a tool (SHAP) [16] or XGBoost’s built-in feature importances only account for prediction of root canal treatment outcome. for the contribution of a specific feature to the model’s prediction, which can again be misleadingly low due to the feature’s correlation 4.1 Data Preprocessing to another. To address this problem, clustering was performed based on the First, data regarding second visits was removed, to ensure consis- × ∈ 𝑚 𝑛 correlation between features. Let 𝑋 represent the dataset R tency among cases. Next, features directly dependent or derived with 𝑚 cases and 𝑛 features. By calculating the Spearman rank from a specific feature were excluded from the dataset to minimize correlation coefficient [15, 17, 23] on 𝑋 , a symmetric feature cor- the dimensionality of the data, as well as any post-operative factors × ∈ 𝑛 𝑛 relation matrix 𝐶 was obtained and transformed into a R that were directly used to determine the treatment outcome. The × ∈ 𝑛 𝑛 distance matrix 𝐷 . To group correlated features, hierarchi- R dataset was further reduced by removing redundant features, which cal clustering using Ward’s method [17, 21] was performed on 𝐷 to can only have one value or their value is missing for more than 50% obtain a hierarchical clustering tree, which was then flattened into of all cases. Similarly, cases for which more than 50% of features are discrete clusters containing features with high absolute correlation. missing were excluded, resulting in 3153 cases and 84 features in total. Lastly, the dataset was preprocessed using label encoding and 4.6 Ranking Feature Importance evenly split into training (80%) and testing (20%) sets. Furthermore, To determine the significance of a specific feature 𝑓 , a separate the training set was split into training (80%) and validation (20%) XGBoost model 𝑀 was trained and evaluated on a reduced dataset sets when ranking feature importance, to avoid overfitting. 𝑓 𝑋 to obtain baseline results. Next, permutation testing was con- 𝑓 4.2 ducted by permuting the feature 𝑓 in the testing set and calculating Model Architecture the drop in performance of 𝑀 compared to the baseline results. 𝑓 For the underlying model, gradient boosting machines were used, Each feature was tested 20 times. Lastly, a mean drop and p-value specifically the XGBoost algorithm [7], as it remains widely re- were calculated on the observed performance drops by performing garded as the state-of-the-art and preferred choice for tabular data a t-test, where a high mean drop represented high feature impor- tasks, over the more and more popular deep learning algorithms, as tance and a low p-value represented a low chance that the observed shown in [4, 12, 20]. Furthermore, algorithms based on transparent drop in performance was caused by an outside factor and not by methods, such as decision trees, are strongly preferred for applica- the random distribution of 𝑓 in the test set. To ensure that the fea- tions in medicine when compared to the "black box" approaches ture’s importance estimation was not corrupted by any correlated typically associated with deep learning. features and at the same time account for the feature’s possible non-linear connections with other features, while also minimizing 4.3 Metrics the computational cost as much as possible, the reduced dataset 𝑋𝑓 was determined as follows. Due to the dataset’s high imbalance between negative ( 87%) and First, using the model trained on all features (see 4.4), SHAP positive ( 13%) cases, standard classification metrics such as ac- values [17, 16] were calculated to determine the most contributing curacy or area under the receiver-operating-characteristic curve (AUROC) can be highly misleading, therefore average precision feature inside of each cluster. Let 𝐹 𝑓1, . . . , 𝑓𝑛 represent the set = { } of all features and 𝑆 : 𝐹 the transformation that returns R → 𝑚 (AP) was chosen as the key metric for estimating model’s perfor- SHAP values for a specific feature. The most contributing feature mance and ability to produce quality predictions, specifically using the formula: inside of the 𝑖-th correlation cluster 𝐶𝑖 𝑓𝑗 𝑗 𝐼𝑖 was calculated = { | ∈ } by taking the feature with the highest mean absolute SHAP value ∑︁ i.e. such 𝑓 𝐶 that 𝑖 ∈ ∗ 𝑛 ∀𝑗 𝐼 : 𝑖 ∈ |𝑆 ∗ 𝑓 ( ) | ≤ | () | 𝐴𝑃 𝑅𝑖 𝑅𝑖 − = 𝑗 𝑆 𝑓 . ( − 1 𝑖 𝑚 𝑃 ) · ∗ ×𝑟 ∈ The reduced dataset 𝑋 , containing only representative R 𝑖 2 = features, was then transformed into 𝑋 for a feature 𝑓 𝐶 by ∈ 𝑓 𝑖 where ∗ ∗ 𝑅 𝑖 and 𝑃 𝑖 are recall and precision at the 𝑖-th threshold when replacing 𝑓 by 𝑓 in 𝑋. Such approach allows eliminating features 𝑖 testing on 𝑛 samples [17], while AUROC was only used to provide highly correlated to 𝑓 and reduces computational cost by only additional insight when interpreting results. utilizing the most contributing feature within each cluster, while 36 Root Canal Treatment Prediction Using ML Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia still accounting for any non-linear connections between 𝑓 and assist in assessing the quality and success of a treatment, as well as features in other clusters. The procedure is visualized in Figure 1. to give insight for possible further patient care. Furthermore, all the statistically significant factors found in the original study [13], are found as significant by our method as well. Specifically, "lesion diameter" was found to be the most relevant factor, with "root PAI" and "canal code" being in the top 5%, "tooth type" ("tooth group" and "canal number") in the top 10%, "type of sealer" and "quality of coronal restoration" in the top 25%, "tender- ness to periapical palpation" and "quality of root filling" in the top 50% and lastly "injury history" in the top 100% of all significant features. Here, we exclude factors such as "number of visits" and "number of canals per root", since they were not used in this study. Moreover, among the most important factors that this study found and were not accounted for or found as insignificant in [13], are "age" as the second most important factor, "cumulative time" being in the top 5% and "alergic disorders", "working length", "treatment type", "obturation", "PD local", "vertical percussion", "fistulation" and "pain bite" being in the top 25%. Such results suggest that ma- chine learning techniques can perhaps be a better or alternative approach for ranking feature significance in comparison with stan- dard statistical methods such as logistic regression models, since they better account for possible non-linear relationships between different factors and the treatment outcome. To further refine our approach of selecting significant features, we plan to test different p-values, as the models trained on only Figure 1: The hierarchical correlation tree is first flattened significant features achieved a lower performance than the models into clusters 𝐶1, . . . , 𝐶𝑟 , for which representative features trained on the entire dataset, with a 5% drop in AUROC and a 7% ∗ ∗ ∗ define the base dataset, from which we get 𝑓 , . . . , 𝑓 𝑋 𝑋 1 𝑟 𝑓 drop in AP on average, suggesting that there are features which our for ∗ 𝑓 𝐶𝑖 𝑓 𝑓 method deemed insignificant despite enhancing the models’ ability ∈ by replacing by . 𝑖 to learn and produce accurate results. Future work will also involve 4.7 Evaluation analysis of third-party datasets to investigate whether the results After obtaining feature importances, features with p-value < 0.05 obtained in this study are generalizable and to what degree the data were deemed as significant. Next, a model using starting parameters collected by a single experienced practitioner is different to a dataset found in 4.4 was trained on features belonging in the 𝑘-th percentile that is typically collected over a course of several years by a number in terms of feature importance, for of dentists-in-training. Additionally, we wish to incorporate various 𝑘 in 1%, 5%, 10%, 25%, 50%, 75%, and 100% (the latter corresponding to all significant features). explainability techniques, to better justify the models’ predictions, in turn giving a deeper insight into how specific factors affect the 5 Results outcome of root canal treatments as well as better assist a doctor in understanding and interpreting the predicted outcome. Figures 2 show the comparison of performances in terms of AP of models trained on different percentiles. The highest performance References was achieved when utilizing the entire preprocessed dataset consist- [1] Muhammed Ayhan, İsmail Kayadibi, and Berkehan Aykanat. 2025. Rcfla-yolo: ing of 84 distinct features in total, achieving AUROC of 0.90 and AP a deep learning-driven framework for the automated assessment of root canal of 0.70 when predicting radiographic outcome, AUROC of 0.94 and filling quality in periapical radiographs. , 25, 1, 894. doi: BMC Medical Education 10.1186/s12909- 025- 07483- 2. AP of 0.86 when predicting clinical outcome and finally AUROC of [2] Catalina Bennasar, Irene García, Yolanda Gonzalez-Cid, Francesc Pérez, and 0.91 and AP of 0.77 when predicting combined outcome. Out of the Juan Jiménez. 2023. Second opinion for non-surgical root canal treatment prognosis using machine learning models. , 13, 17, 2742. doi: 10.339 Diagnostics 84 chosen features, our method deemed 39 of them significant for 0/diagnostics13172742. radiographic assessment, 54 significant for clinical assessment, and [3] Catalina Bennasar, Antonio Nadal-Martínez, Sebastiana Arroyo, Yolanda Gonzalez- 65 for combined assessment, which produced AUROC of 0.88, 0.85, Cid, Ángel Arturo López-González, and Pedro Juan Tárraga. 2025. Integrating machine learning and deep learning for predicting non-surgical root canal 0.87 and AP of 0.66, 0.75 and 0.70 respectively. treatment outcomes using two-dimensional periapical radiographs. , Diagnostics 15, 8, 1009. doi: 10.3390/diagnostics15081009. 6 Discussion and Conclusion [4] Vadim Borisov, Tobias Leemann, Kathrin Seßler, Johannes Haug, Martin Pawel- czyk, and Gjergji Kasneci. 2024. Deep neural networks and tabular data: a Achieving high performance, our paper shows promise in using ma- survey. IEEE Transactions on Neural Networks and Learning Systems, 35, 6, 7499–7519. doi: 10.1109/TNNLS.2022.3229161. chine learning techniques for predicting the outcome of endodontic [5] Berrin Çelik, Mehmet Zahid Genç, and Mahmut Emin Çelik. 2025. Evaluation treatments. Moreover, we developed a web application, which al- of root canal filling length on periapical radiograph using artificial intelligence. lows predicting the outcome of root canal treatments using the , 41, 1, 102–110. doi: 10.1007/s11282- 024- 00781- 3. Oral Radiology models trained on different subsets of data, serving as a tool to 37 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Jelenc et al. (a) Strict clinical assessment (b) Strict radiographic assessment (c) Strict combined assessment Figure 2: Average precision (AP) achieved by XGBoost when predicting strict clinical, radiographic and combined assessment, utilizing different subsets of features - all features, all significant features, top 75%, top 50%, top 25%, top 10%, top 5% and top 1% significant features. [6] Wan-Ting Chang, Hsun-Yu Huang, Tzer-Min Lee, Tsen-Yu Sung, Chun-Hung learning algorithms. , 133, 104522. doi: 10.1016/j.jdent.2023 Journal of Dentistry Yang, and Yung-Ming Kuo. 2024. Predicting root fracture after root canal .104522. treatment and crown installation using deep learning. , [20] Ravid Shwartz-Ziv and Amitai Armon. 2022. Tabular data: deep learning is not Journal of Dental Sciences 19, 1, 587–593. doi: 10.1016/j.jds.2023.10.019. all you need. , 81, 84–90. doi: 10.1016/j.inf f us.2021.11.011. Information Fusion [7] Tianqi Chen and Carlos Guestrin. 2016. Xgboost: a scalable tree boosting [21] Joe H Ward Jr. 1963. Hierarchical grouping to optimize an objective function. system. In , 58, 301, 236–244. Proceedings of the 22nd ACM SIGKDD International Conference on Journal of the American Statistical Association Knowledge Discovery and Data Mining (KDD ’16). ACM, (Aug. 2016), 785–794. [22] Weiwei Wu, Surong Chen, Pan Chen, Min Chen, Yan Yang, Yuan Gao, Jingyu doi: 10.1145/2939672.2939785. Hu, and Jingzhi Ma. 2024. Identification of root canal morphology in fused- [8] Pragati Choudhari, Anand Singh Rajawat, and S. B. Goyal. 2023. Longevity rooted mandibular second molars from x-ray images based on deep learning. recommendation for root canal treatment using machine learning. , 50, 9, 1289–1297.e1. doi: 10.1016/j.joen.2024.05.014. Engineering Journal of Endodontics Proceedings, 59, 1, 193. doi: 10.3390/engproc2023059193. [23] Daniel Zwillinger and Stephen Kokoska. 2000. CRC Standard Probability and [9] Xin Gao, Xing Xin, Zhi Li, and Wei Zhang. 2021. Predicting postoperative pain . (1st ed.). Section 14.7. Chapman & Hall/CRC, Statistics Tables and Formulae following root canal treatment by using artificial neural network evaluation. Boca Raton, FL. isbn: 978-0-8493-0026-4. Scientific Reports, 11, 1, 17243. doi: 10.1038/s41598- 021- 96777- 8. [10] Chantal S. Herbst, Falk Schwendicke, Joachim Krois, and Sascha R. Herbst. 2022. Association between patient-, tooth- and treatment-level factors and root canal treatment failure: a retrospective longitudinal and machine learning study. , 117, 103937. doi: 10.1016/j.jdent.2021.103937. Journal of Dentistry [11] Sascha Rudolf Herbst, Vinay Pitchika, Joachim Krois, Aleksander Krasowski, and Falk Schwendicke. 2023. Machine learning to predict apical lesions: a cross- sectional and model development study. , 12, 17, Journal of Clinical Medicine 5464. doi: 10.3390/jcm12175464. [12] Yejin Hwang and Jongwoo Song. 2023. Recent deep learning methods for tabular data. , 30, 2, Communications for Statistical Applications and Methods (Mar. 2023), 215–226. doi: 10.29220/CSAM.2023.30.2.215. [13] Rok Jurič, G. Vidmar, R. Blagus, and Janja Jan. 2024. Factors associated with the outcome of root canal treatment—a cohort study conducted in a private practice. , 57, 4, 377–393. doi: 10.1111/iej.14022. International Endodontic Journal [14] Mohmed Isaqali Karobari, Vishnu Priya Veeraraghavan, P. J. Nagarathna, Sud- hir Rama Varma, Jayaraj Kodangattil Narayanan, and Santosh R. Patil. 2025. Predictive analysis of root canal morphology in relation to root canal treatment failures: a retrospective study. , 6. doi: 10.3389/f dm Frontiers in Dental Medicine ed.2025.1540038. [15] Maurice G. Kendall and Alan Stuart. 1973. The Advanced Theory of Statistics, Volume 2: Inference and Relationship. (1st ed.). See Section 31.18. Charles Griffin, London, UK. isbn: 978-0852640111. [16] Scott Lundberg and Su-In Lee. 2017. A unified approach to interpreting model predictions. (2017). https : / / arxiv . org / abs / 1705 . 07874 arXiv: 1705 . 07874 [cs.AI]. [17] F. Pedregosa et al. 2011. Scikit-learn: machine learning in python. Journal of Machine Learning Research, 12, 2825–2830. [18] Yang Qu, Zhenzhe Lin, Zhaojing Yang, Haotian Lin, Xiangya Huang, and Lisha Gu. 2022. Machine learning models for prognosis prediction in endodontic microsurgery. , 118, 103947. doi: 10.1016/j.jdent.2022.10394 Journal of Dentistry 7. [19] Yang Qu, Yiting Wen, Ming Chen, Kailing Guo, Xiangya Huang, and Lisha Gu. 2023. Predicting case difficulty in endodontic microsurgery using machine 38 Predictive Maintenance of Machines in LABtop Production Environment Primož Kocuvan Vinko Longar Rok Struna Department of intelligent systems Rudolfovo - znanstveno in Rudolfovo - znanstveno in “Jožef Stefan” Institute tehnološko središče tehnološko središče Ljubljana, Slovenia Novo mesto, Slovenia Novo mesto, Slovenia primoz.kocuvan@ijs.si vinko.longar@rudolfovo.eu rok.struna@rudolfovo.eu Abstract strategies, such as corrective or preventive maintenance, machinery within the LABtop production environment breakdowns. Predictive maintenance, by contrast, leverages through the deployment of iCOMOX sensor modules on a sensor data and machine learning techniques to detect compressor and machine spindle. Each module integrates patterns, identify anomalies, and forecast potential failures This study investigates predictive maintenance of CNC result in either excessive servicing or unexpected often fail to provide early warnings of failures and may multi-modal sensing capabilities, including vibration, before they occur. This approach not only enhances period, resulting in an unlabeled dataset due to the absence multiple machines in sequence - mostly drilling and cutting of recorded failures or anomalies. The analysis employed machines), predictive maintenance has been explored unsupervised machine learning techniques, specifically through the integration of advanced multi-sensor principal component analysis (PCA) for dimensionality monitoring solutions. For this purpose, the public research reduction and clustering to identify operational patterns. institute Rudolfovo implemented iCOMOX sensor modules PCA successfully reduced the original 11-dimensional on both the compressor and the spindle of a CNC machine. enabling comprehensive monitoring of machine conditions. equipment.¸ Data was collected at five-minute intervals over a 30-day Within the LABtop production environment (consists of magnetic field, temperature, and acoustic measurements, operational efficiency but also extends the lifetime of critical visualization and grouping. The elbow and silhouette vibration, magnetic field, temperature, and acoustic methods determined three optimal clusters for both sensors, measurements dataset to two principal components, allowing for effective Each iCOMOX module integrates several sensing elements— —providing a rich dataset suitable for Results suggest that dense clusters represent normal The collected data were acquired over a continuous 30- operation, while outlier clusters may indicate measurement day period at five-minute intervals. Since no machine with one cluster in each case identified as a potential outlier. machine learning–based condition monitoring. errors or emerging machine faults. Although supervised failures, temperature anomalies, or bearing defects were learning could not yet be applied, future work will integrate recorded during this time, the dataset lacked diagnostic fault-labeled data to enable robust predictive maintenance labels and was therefore treated as unlabeled. To address models. this, unsupervised learning methods were employed to Keywords uncover latent structures in the data. Principal component analysis (PCA) was used to reduce the dimensionality of the predictive maintenance, PCA method, production dataset, while clustering methods were applied to identify environment, silhouette analysis, elbow method. patterns and potential anomalies in machine operation. The aim of this study is to evaluate the feasibility of 1 Introduction for industrial equipment, specifically under conditions unsupervised learning methods in predictive maintenance The increasing complexity of modern production systems where fault-labeled data are unavailable. By analyzing the demands advanced approaches to machine maintenance in clustering behavior of sensor signals, this work provides order to minimize downtime, reduce costs, and ensure insights into normal operating regimes and potential consistent product quality. Traditional maintenance deviations that may correspond to early indicators of faults or measurement errors. Future work will incorporate Permission to make digital or hard copies of part or all of this work for supervised learning techniques once labeled fault data made or distributed for profit or commercial advantage and that copies bear become available, enabling the development of robust personal or classroom use is granted without fee provided that copies are not this notice and the full citation on the first page. Copyrights for third-party predictive models. components of this work must be honored. For all other uses, contact the owner/author(s).Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia © 2025 Copyright held by the owner/author(s). http://doi.org/10.70314/is.2025.skui.0545 39 2 Related Work 3.2 Data Preprocessing The field of predictive maintenance (PdM) has advanced Raw signals from the iCOMOX modules were aggregated considerably, with strong emphasis on unsupervised into feature vectors, yielding an 11-dimensional dataset. learning methods for anomaly detection and health Standard preprocessing steps included: normalization of assessment when labeled failure data are unavailable. PdM features to remove scaling effects, filtering to reduce noise has been shown to significantly reduce maintenance costs, (particularly in the acoustic and vibration signals), and decrease unexpected downtime, and enhance equipment synchronization of multimodal sensor streams. reliability [1]. Multi-sensor monitoring platforms such as 3.3 Dimensionality Reduction iCOMOX have emerged as versatile tools for industrial condition monitoring. These devices integrate vibration, To facilitate visualization and clustering, dimensionality magnetic field, temperature, and acoustic sensors into a reduction was performed. Multiple techniques (e.g., t-SNE, compact, industrial-grade package capable of edge analytics Isomap, and autoencoders) were evaluated; however, and cloud integration [2–5]. Principal Component Analysis (PCA) demonstrated superior Such systems enable continuous monitoring of machine stability and interpretability. The data were reduced from health and support the implementation of predictive 11 to 2 principal components, which captured the majority maintenance strategies in Industry 4.0 environments. From of the variance and allowed effective 2D representation. a methodological perspective, unsupervised learning techniques, such as principal component analysis (PCA) and 3.4 Clustering Analysis dimensionality reduction, and anomaly detection. A hidden structures and potential anomalies. The elbow comprehensive survey highlights the breadth and maturity method ( clustering, are widely applied for exploratory data analysis, Clustering was applied to the reduced dataset to uncover including k-means, DBSCAN, and OPTICS are instrumental Based on these metrics, three clusters were identified for in grouping operational states and unveiling deviations that each sensor dataset. may signify incipient failures [7]. The analysis was conducted separately for the two sensor of these techniques across domains [6]. Clustering methods employed to determine the optimal number of clusters. Figure 1) and silhouette coefficient (Figure 2) were proven effective in enhancing fault detection capabilities. Hybrid methods combining PCA with clustering have modules (iCOMOX1 on the spindle and iCOMOX2 on the compressor). Outlier clusters were identified and For example, a railcar health monitoring system employing highlighted for subsequent interpretation. DBSCAN clustering with PCA achieved fault detection accuracy of 96.4% [8]. Similarly, kernel PCA has been applied to construct health indices for unsupervised prognostics [9]. In compressor maintenance, incorporating clustering-derived features into supervised classifiers improved predictive accuracy by 4.9% and reduced training time by 23% [10]. Several studies also propose frameworks that integrate unsupervised learning with IoT and Big Data infrastructures, enabling scalable predictive maintenance solutions across industrial environments [11]. These works demonstrate the feasibility of extracting actionable health indicators from unlabeled sensor data and underscore the critical role of advanced analytics in industrial condition monitoring. Figure 1: Elbow method 3 Methodology 3.1 Data Acquisition Two iCOMOX sensor modules were installed on critical machine components within the LABtop production system: the spindle of a CNC machine and the air compressor. Each sensor module integrates vibration, magnetic field, temperature, and acoustic sensing elements, thereby providing multimodal monitoring capabilities. Data were sampled at 5-minute intervals over a continuous 30-day observation period, resulting in an unlabeled dataset due to the absence of recorded failures, anomalies, or maintenance events. Figure 2: Silhouette coefficient 40 4 Results • PCA combined with clustering effectively into a two-dimensional space. The first two principal • Both sensor datasets (iCOMOX1 and iCOMOX2) components explained the majority of the variance (>80%), revealed three clusters, with one consistently enabling effective visualization of patterns in machine standing out as an outlier. PCA successfully compressed the 11-dimensional dataset anomalous behavior. distinguished between normal operation and behavior. Figure 3 illustrates the scatter plot for iCOMOX1. • Without diagnostic labels, these outliers cannot be Three distinct clusters are visible, with Cluster 1 definitively classified as machine faults, but their (highlighted in orange) showing divergence from the main presence highlights potential events of interest for operating regime. Figure 4 presents the scatter plot for further investigation. iCOMOX2, where Cluster 2 (highlighted in green) emerges • The results validate the feasibility of unsupervised as an outlier relative to the normal operating clusters. learning for predictive maintenance in environments lacking labeled fault data. 5 Discussion The findings from this study demonstrate the viability of unsupervised learning methods in particular PCA and clustering for analyzing unlabeled condition-monitoring data in industrial environments. By reducing an 11- dimensional dataset to two principal components, it was possible to visualize operational states and uncover outlier clusters that may correspond to anomalous machine behavior. This outcome aligns with previous work emphasizing the effectiveness of dimensionality reduction and clustering in predictive maintenance tasks where labeled fault data are limited or unavailable [6,8,9]. The observation of three clusters for both the spindle Figure 3: PCA Scatter plot for ICOMOX1 (iCOMOX1) and compressor (iCOMOX2) highlights the presence of distinct operating regimes within the LABtop system. The fact that one cluster consistently emerged as an outlier suggests potential precursors to faults or, alternatively, sensor-related anomalies. While conclusive interpretation requires diagnostic labels, the clustering nevertheless provides an essential first step toward identifying patterns that can later inform supervised learning models once fault data become available. Compared to related studies, the present results confirm trends reported in railcar health monitoring [8] and compressor maintenance [10], where unsupervised approaches successfully revealed structural patterns in the absence of labeled datasets. The advantage of PCA lies in its ability to preserve variance while simplifying visualization, which proved more effective than alternative reduction methods considered here (e.g., t-SNE or Isomap). This Figure 4: PCA Scatter plot for ICOMOX2 echoes findings from other industrial applications where PCA has served as a reliable baseline for anomaly detection Clusters containing densely grouped points correspond to [9]. normal operating conditions of the CNC spindle and An important implication is that multi-sensor platforms compressor. The outlier clusters, however, represent either: such as iCOMOX provide the richness of data required for • sensor noise or measurement anomalies (e.g., advanced analytics. The combination of vibration, acoustic, transient vibration spikes or acoustic distortions), magnetic field, and temperature measurements enables or detection of subtle variations that might not be visible • incipient machine faults, which could not be through single-sensor monitoring. As highlighted in prior conclusively confirmed due to the absence of work [2–5], the integration of multimodal data streams ground-truth failure data. significantly strengthens predictive maintenance frameworks by improving robustness and interpretability. Nevertheless, this study also underscores the limitations of unsupervised learning. Without failure labels, it is not 41 possible to conclusively distinguish between anomalies Finally, interpretability remains an essential concern. arising from true machine faults and those caused by sensor Future efforts will explore explainable AI (XAI) techniques noise or environmental conditions. This limitation has been to provide actionable insights into why certain clusters or widely noted in the literature [6,11]. Future work should anomalies are flagged, thereby enhancing operator trust therefore focus on generating labeled datasets through and enabling domain experts to validate and refine the controlled fault injection or long-term monitoring until models. natural failures occur. Such datasets would enable supervised and hybrid learning approaches, which have Acknowledgments shown promise in achieving higher predictive accuracy and more actionable decision support [1,10]. The research was supported by DIGITOP project which is In summary, the present analysis validates the potential funded by Ministry of Higher Education, Science and of unsupervised learning for predictive maintenance in Innovation of Slovenia, Slovenian Research and Innovation data-scarce environments. While preliminary, the results Agency, and EU NextGenerationEU under Grant TN-06- establish a methodological foundation for extending 0106. We thank prof. dr. Matjaž Gams for proof-reading the condition monitoring at LABtop to more advanced machine article and mentorship support within DIGITOP project. learning pipelines, ultimately contributing to early fault detection, reduced downtime, and optimized maintenance References planning. [1] Abdeldjalil Benhanifia, Zied Ben Cheikh, Paulo Moura Oliveira, Antonio 6 Valente, José Lima. Systematic review of predictive maintenance practices in Future Work the manufacturing sector. Intelligent Systems with Applications, Volume 26, The present study establishes a foundation for predictive 2025, Article 200501. ISSN 2667-3053. maintenance at LABtop using unsupervised learning https://doi.org/10.1016/j.iswa.2025.200501 methods; however, several directions remain open for [2] RS Components, iCOMOX Intelligent Condition Monitoring Box – Product Datasheet , 2019. Available: https://docs.rs- further investigation. online.com/c878/A700000007538369.pdf, Accessed 25.8.2025 First, the absence of diagnostic labels limited this study to [3] EE Times Europe, Arrow introduces new Shiratech iCOMOX condition- exploratory clustering and anomaly detection. Future work based monitoring products , 2019. Available: https://www.eetimes.eu/press- will prioritize the collection of labeled datasets through releases/arrow-introduces-new-shiratech-icomox-condition-based- either (i) controlled fault injection experiments on non- monitoring-products/, Accessed 25.8.2025 critical test equipment or (ii) extended operational [4] EBOM, New Shiratech iCOMOX sensor-to-cloud platform cuts monitoring until natural failures occur. The availability of time-to-market for intelligent condition monitoring, 2019. Available: labeled fault data will enable the application of supervised https://www.ebom.com/new-shiratech-icomox-sensor-to-cloud-platform- learning and hybrid approaches, combining clustering- cuts-time-to-market-for-intelligent-condition-monitoring/, Accessed 25.8.2025 derived features with classification models to improve fault detection accuracy and reliability, as demonstrated in [5] Sensor+Test, iCOMOX – Condition Monitoring Box, 2023. recent compressor studies [10]. Second, while PCA provided Available:https://www.sensor- test.de/assets/Fairs/2023/ProductNews/PDFs/iCOMOX.pdf, Accessed an effective means of dimensionality reduction, more 25.8.2025 advanced techniques such as kernel PCA, autoencoders, and [6] K. Taha, “Semi-supervised and un-supervised clustering: A review and variational autoencoders should be investigated. These experimental evaluation,” Information Systems, vol. 114, p. 102178, 2023. doi: methods may capture nonlinear relationships in the sensor 10.1016/j.is.2023.102178 data that PCA cannot, potentially yielding richer health [7] GopenAI Blog, Predictive maintenance with unsupervised machine learning indicators and more precise separation of operational algorithms, 2020. Available: (Blog.gopenai.com), Accessed 25.8.2025 regimes [9]. Third, the present work focused primarily on [8] M. Ejlali, E. Arian, S. Taghiyeh, K. Chambers, A. H. Sadeghi, D. Cakdi, and R. offline analysis. Future research should extend to real-time B. Handfield, “Developing Hybrid Machine Learning Models to Assign Health streaming analytics, leveraging the edge-processing Score to Railcar Fleets for Optimal Decision Making,” arXiv preprint capabilities of the iCOMOX platform [2 arXiv:2301.08877, 2023. – 5]. Deploying online anomaly detection models would allow immediate [9] Z. Chen et al., “Health Index Construction Based on Kernel PCA for identification of abnormal conditions and facilitate Equipment Prognostics,” Control Engineering Practice, vol. 126, 2022. proactive maintenance decisions. [10] A. Salazar et al., “Unsupervised Feature Extraction for Compressor Fourth, integration with IoT and cloud-based platforms Predictive Maintenance Using Clustering and Supervised Learning,” arXiv, remains a key step toward scalable deployment. By 2024. embedding unsupervised learning models into Industry 4.0 [11] Nota, Giancarlo, Nota, Francesco, Toro Lazo, Alonso Nastasia, Michele. architectures, LABtop can benefit from centralized (2024). A framework for unsupervised learning and predictive maintenance in Industry 4.0. International Journal of Industrial Engineering and monitoring, cross-machine comparisons, and fleet-level Management. 15. 304-319. 10.24867/IJIEM-2024-4-365. anomaly detection, as highlighted in existing frameworks [11]. 42 Machine Learning for Cutting Tool Wear Detection: A Multi-Dataset Benchmark Study Toward Predictive Maintenance Žiga Kolar Thibault Comte Yanny Hassani ziga.kolar@ijs.si thibault.comte@universite- paris- yanny.hassani@universite- paris- Jožef Stefan Institute saclay.f r saclay.f r Ljubljana, Slovenia Universite Paris-Saclay Universite Paris-Saclay Paris, France Paris, France Hugues Louvancour Jože Ravničan Matjaž Gams hugues.louv@gmail.com joze.ravnican@unior.com matjaz.gams@ijs.si Universite Paris-Saclay UNIOR Kovaška industrija d.d. Jožef Stefan Institute Paris, France Zreče, Slovenia Ljubljana, Slovenia Abstract mounted on the cutting machine to monitor vibrations occurring during the cutting process. Currently, the detection of wear is This student paper investigates the use of machine learning tech- performed manually by a human operator. By leveraging artifi- niques to automate the detection of tool wear in cutting machines, cial intelligence (AI) and machine learning (ML), this process can replacing manual monitoring with intelligent, data-driven so- be automated, making it both easier and more efficient. lutions. Although the proposed ML methods are standard in While awaiting the company to complete the necessary pa- predictive maintenance, our contribution lies in providing the perwork and acquire and install the accelerometer on the cutting systematic multi-dataset benchmark tailored for direct transfer machine, we identified similar publicly available datasets and to industrial environments. This establishes a reproducible base- conducted several machine learning experiments using them. line before deploying and validating on real UNIOR data. As part of the project, and in anticipation of collecting real-world accelerometer data from industrial machines, we conducted a series of benchmarking experiments using five publicly avail- 2 Related Work able datasets that include accelerometer and audio signals under This section briefly surveys recent research on the use of artificial various wear-related conditions. The datasets cover a variety intelligence (AI) techniques for tool wear monitoring in manufac- of industrial contexts and labeling schemes, allowing us to as- turing processes such as milling, turning, and drilling. Munaro sess different preprocessing strategies and classification models et al. [2] provide a systematic review of 77 studies, contrasting such as Random Forests, 1D Convolutional Neural Networks, offline and online monitoring methods. Online approaches lever- and Long Short-Term Memory networks. Our best results—an aging sensor data—such as force, vibration, acoustic emission, F1-score of 0.9949—were achieved using an LSTM model on a and power—are enhanced by AI models like SVMs, ANNs, CNNs, vibration dataset simulating fault conditions. These findings high- and LSTMs, offering accuracies above 90% and industrial rele- light the strong potential of AI for predictive maintenance and vance. Sieberg et al. [5] demonstrate CNN-based classification of lay the groundwork for transferring the developed pipelines to wear mechanisms from SEM images, achieving 73% test accuracy. the system once real data become available. Future work will They emphasize dataset balance and magnification consistency as focus on real-time wear detection and model deployment within critical challenges. Colantonio et al. Shah et al. [4] argue for ML’s live production environments. superiority over physics-based models in wear prediction, un- derscoring ANN’s predictive strength when supplied with high- Keywords quality data and standardized evaluation methods. Recent studies accelerometer, neural networks, machine learning, cutting tool also explore multimodal sensor fusion, combining accelerometer, acoustic, and force signals to improve robustness [8]. Specifically, 1 Introduction transfer learning has been shown effective for adapting models trained on laboratory data to industrial machines [8]. This student paper presents the work carried out by Thibault Unlike previous reviews such as Munaro et al. [2], which Comte, Hugues Louvancour, and Yanny Hassani on the UNIOR survey the field, our work provides a systematic multi-dataset project, under the mentorship of Žiga Kolar, prof. dr. Matjaž experimental comparison across three different sensor modalities Gams for Jozef Stefan Institute, and Joze Ravnican for Unior. The (accelerometer, vibration, audio) using standardized pipelines. objective of the UNIOR project is to detect when a cutting ma- This benchmarking is not only descriptive but forms the basis for chine becomes worn out by analyzing sensor signals, specifically industrial transfer to UNIOR’s production line, bridging academic accelerometer data along the x, y, and z axes. An accelerometer is datasets with real machine applications. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and 3 Datasets the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner /author(s). This section describes five different datasets that were identi- Information Society 2025, Ljubljana, Slovenia fied—four containing accelerometer data and one featuring audio © 2025 Copyright held by the owner/author(s). https://doi.org/10.70314/is.2025.skui.6224 recordings. 43 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Kolar et al. 3.1 Bosch CNC Machining Dataset per speed. In total, the dataset contains 153,000 vibration records from the simulation model [3]. The Bosch CNC Machining dataset consists of real-world in- dustrial vibration data collected from a brownfield CNC milling machine. Acceleration was measured using a tri-axial Bosch CISS 3.5 Vibrations Dataset sensor mounted inside the machine, recording the X, Y, and Z This dataset contains vibrational data collected to support early axes at a sampling rate of 2 kHz. Both normal and anomalous data fault diagnosis in machinery The data was gathered using an were collected across six distinct timeframes, each spanning six SG-Link tri-axial accelerometer sensor (by MICROSTRAIN Cor- months between October 2018 and August 2021, with appropriate poration) at a sampling rate of 679 samples per second for each labeling. Data were collected from three distinct CNC milling of the three axes: axial (z), horizontal (x), and vertical (y). Experi- machines, each executing 15 processes [7]. A total of 1,702 sam- ments were conducted in the Mechanical Vibration Laboratory ples were obtained, with each labeled as either "good" or "bad." at the Mechanical Engineering Department of the University of The distribution of labels was 95.9% good and 4.1% bad. Engineering and Technology (UET), Taxila. The setup simulated four distinct machine conditions: normal, cracking, offset pulley, 3.2 Cutting Tool Wear Audio Dataset and wear states, using a test rig designed for fault simulation [1]. This dataset comprises 1,488 ten-second .wav audio recordings 4 Methodology and Results of cutting tool wear collected at two spindle speeds: 520 RPM This section outlines the methodology used for each dataset, and 635 RPM. Each audio recording is labeled as either “BASE” focusing on multiclass classification. Various preprocessing tech- (machine running without cutting), “FRESH” (sharp cutting tool), niques and machine learning algorithms were applied. “MODERATE” (moderately worn tool), or “BROKEN” (broken or fully worn tool). The “FRESH,” “MODERATE,” and “BROKEN” la- bels were specifically chosen to simulate real cutting conditions, 4.1 Bosch CNC Machining Dataset focusing on scenarios where the machine is actively engaged The Bosch CNC Machining Dataset contains 95.9% good signals in material removal. In total, the dataset includes 400 “FRESH” and 4.1% bad signals. The objective was to develop a binary samples, 376 “MODERATE” samples, and 362 “BROKEN” samples classification model that outperforms a naive baseline, which across both spindle speeds, offering a nearly balanced distribu- achieves 95.9% accuracy simply by always predicting a signal as tion well-suited for ML applications. Audio records had different good. lengths. No artificial background noise was added to the record- Two approaches were tested on the Bosch CNC Machining ings. All cutting tools used were 16 mm end-mill cutters, and the dataset. The first approach applied random undersampling, which workpiece material was mild steel [6]. balances class distribution by randomly removing samples from the majority class while leaving the minority class unchanged. 3.3 Since the majority class accounted for 95.9% of the data, this step Turning Dataset for Chatter was essential to prevent the model from defaulting to majority- This dataset contains sensor signals collected from multiple cut- class predictions. After applying the random undersampling, the ting tests using a range of measurement devices, including two Random forest model was used for binary classification. This perpendicular single-axis accelerometers, a tri-axial accelerome- method achieved 99% accuracy on 5-fold cross validation, pro- ter, a microphone, and a laser tachometer. Both raw sensor data viding a 3.1% improvement over the naive baseline model. and processed, labeled data from one channel of the tri-axial ac- Different preprocessing strategies were necessary due to dif- celerometer are provided. There were four labels used: no-chatter, ferences in data formats, sampling rates, and class balance across intermediate chatter, chatter, and unknown. The dataset contains datasets. For example, in the Bosch dataset, random undersam- a total of 117 signals, with the following label distribution: 51 la- pling was applied only on the training folds during 5-fold CV to beled as no-chatter, 19 as intermediate chatter, 22 as chatter, and avoid information leakage. 25 as unknown. Data were collected under four distinct cutting In the second approach, features were initially extracted using configurations, defined by varying the stick out distance—the two 1D Convolutional layers followed by two Max Pooling layers. distance from the heel of the boring rod to the back face of the To augment the data, random Gaussian noise was added to the tool holder. The four stickout distances used were 5.08 cm (2 signals, effectively doubling the size of the training set. A binary inches), 6.35 cm (2.5 inches), 8.89 cm (3.5 inches), and 11.43 cm classification model using Random Forest was then trained on (4.5 inches) [8]. this augmented dataset. This model achieved a high accuracy of 0.996 under 5-fold cross-validation, outperforming the naive 3.4 UCI Accelerometer Dataset baseline by 3.7%. McNemar’s test was applied between competing models on To simulate motor vibrations, a 12 cm Akasa AK-FN059 Viper each dataset. Significant differences (p < 0.05) were observed be- cooling fan was modified by attaching weights to its blades, and tween CNN and Random Forest on the Bosch dataset, confirming an MMA8452Q accelerometer was mounted to capture vibration that improvements are not due to random variation. data. An artificial neural network was then used to predict motor failure time based on this data. Three distinct vibration scenarios were generated by varying the placement of two weight pieces 4.2 Cutting Tool Wear Audio Dataset on the fan blades: (1) Red – normal configuration, with weights The Cutting Tool Wear Audio Dataset contained 400 “FRESH”, on neighboring blades; (2) Blue – perpendicular configuration, 376 “MODERATE”, and 362 “BROKEN” samples across two spin- with weights on blades 90° apart; and (3) Green – opposite con- dle speeds, requiring a multi-class classification approach. Since figuration, with weights on opposite blades. For each of the three the signals varied in length, we first identified the longest signal weight configurations, vibration data was collected every 20 ms (48000 samples) and zero-padded shorter signals to match this over a 1-minute interval per speed, resulting in 3,000 records length. To improve model accuracy, this maximum length was 44 Machine Learning for Cutting Tool Wear Detection Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia summary that captures the dominant frequency components of the entire signal, while accounting for temporal variation through segmentation. Furthermore, 11 additional features were extracted from the raw signal, including the mean, standard deviation, minimum, maximum, and median of the frequency values. These features capture the signal’s central tendency and variability, providing a statistical summary of its frequency content. The 25th and 75th percentiles further quantify the signal’s interquartile range, highlighting its variability and robustness to outliers. Root mean square (RMS) provides a measure of the signal’s overall power. Skewness and kurtosis describe the asymmetry and peakedness of the distribution, respectively, offering insights into the signal’s shape beyond basic statistics. Finally, zero crossings count the number of times the signal crosses the zero axis, serving as an indicator of frequency content and signal complexity. Together, these features form a rich representation for classification tasks involving time-frequency signals. In total, there were 268 features (257 FFT features and 11 addi- tional features) and 117 samples. A feature selection technique was applied to further reduce the number of features. 140 best features were selected and used as input for Random Forest clas- sifier. The best model for this dataset achieved 0.80 (+/-0.06) accuracy and 0.7588 F1-score on 5-fold cross validation. Results (precision, recall, F1-score and accuracy) are depicted on Figure 2. Figure 1: 5-Fold cross validation report and confusion ma- trix for Cutting Tool Wear Audio dataset. later reduced. The model architecture included two 1D Convo- lutional layers and two 1D Max Pooling layers to reduce the dimensionality of the data while preserving essential features. The output from the upper layers served as input to a feature selection algorithm, which identified the 96 most relevant fea- tures out of a total of 2048. These selected features were then used by a Random Forest classifier to predict the final label. The best model for this dataset achieved 0.9279 (+/- 0.01) ac- curacy and 0.9279 F1 score on 5-fold cross validation. Results (precision, recall, F1-score and accuracy) are presented on Figure 1. 4.3 Turning Dataset for Chatter Since each signal varies in length and can be quite long, an ap- proach based on extracting time-domain and frequency-domain features was implemented. This method preserves essential in- formation from the original signals while significantly reducing dimensionality, making the data more suitable for ML algorithms. The following approach combines signal segmentation and frequency-domain feature extraction to summarize the spectral characteristics of a time-series signal. First, it divides the input Figure 2: 5-Fold cross validation report and confusion ma- signal into overlapping or non-overlapping fixed-size windows trix for Turning dataset for Chatter. using a sliding window technique, where each segment is of 10000 windows length and the shift between consecutive segments is determined by step size, which in this case is 5000. This allows for localized analysis of signal dynamics over time. 4.4 UCI Accelerometer Dataset Next, we applied the Fast Fourier Transform (FFT) to each This method implements a complete machine learning pipeline segment, converting the time-domain signal into its frequency- for classifying time-series accelerometer data using features ex- domain representation. It computes the magnitude spectrum tracted from both the time and frequency domains. Data is first for each segment and then averages the spectral magnitudes loaded from a CSV file, where each row contains an activity label across all segments to obtain a single, representative frequency- and raw X, Y, and Z accelerometer readings. The signal is seg- domain feature vector. This results in a compact yet informative mented into non-overlapping windows of fixed size (50 samples, 45 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Kolar et al. corresponding to 1 second at 50 Hz), and only windows with consistent activity labels are retained for supervised learning. Next, time-domain and frequency-domain features were ex- tracted from a signal. Time-domain features include basic statis- tics (mean, standard deviation, min, max, median), RMS, peak-to- peak range, skewness, kurtosis, zero-crossing rate, signal energy, and crest factor. Frequency-domain features are extracted via FFT and include spectral centroid, spectral spread, peak frequency, and energy in predefined low (0–5 Hz) and high (10–25 Hz) fre- quency bands. This feature-rich representation is passed through a machine learning pipeline that includes feature scaling, univariate feature selection (Select K Best ANOVA F-statistical method), and clas- sification using a Random Forest classifier. The best model for Figure 4: 5-Fold cross validation report and confusion ma- this dataset achieved 0.972 (+/-0.008) accuracy and 0.97 F1-score trix for Vibration dataset. on 5-fold cross validation. Results (precision, recall, F1-score and accuracy) are depicted on Figure 3. This study has several limitations. First, the datasets used are publicly available and may not fully capture the variability of industrial machining environments. Second, in some cases class balance was artificially enforced via undersampling, which could affect generalizability. Third, we recognize that the lack of direct industrial validation is a current limitation. However, our pipelines were designed for immediate deployment once the company’s accelerometers are installed, ensuring direct continu- ity from these benchmark studies to industrial application. This study therefore serves as a reproducible foundation rather than a final industrial deployment. Partial validation experiments with UNIOR’s machines are planned as the next project stage. Figure 3: 5-Fold cross validation report and confusion ma- trix for UCI Accelerometer dataset. Acknowledgements The authors acknowledge funding support from the company UNIOR for the GREMO LIGHTWEIGHT project. The authors 4.5 Vibrations Dataset also acknowledge the funding from the Slovenian Research and Innovation Agency (ARIS), Grant (PR-10495) and Basic core fund- In this method the time series data was effectively segmented ing P2-0209. into overlapping windows of fixed length 226. A total of 168,372 samples were generated, providing a sufficient amount of data References for training deep learning models. A Long Short-Term Memory [1] Muhammad Umar Khan, Muhammad Atif Imtiaz, Sumair Aziz, Zeeshan Ka- (LSTM) neural network was chosen due to its effectiveness in reem, Athar Waseem, and Muhammad Ammar Akram. 2019. System design for early fault diagnosis of machines using vibration features. In 2019 In- handling sequential data. The network architecture consisted of ternational Conference on Power Generation Systems and Renewable Energy two LSTM layers with 128 and 64 units, respectively, along with Technologies (PGSRET). IEEE, 1–6. two Dropout layers incorporated to reduce the risk of overfitting [2] Roberto Munaro, Aldo Attanasio, and Antonio Del Prete. 2023. Tool wear monitoring with artificial intelligence methods: a review. Journal of Manu- and improve generalization. This method achieved the best per- facturing and Materials Processing, 7, 4, 129. doi: 10.3390/jmmp7040129. formance to date, reaching an accuracy of 0.9948 (+/-0.005) in [3] Gustavo Scalabrini Sampaio, Arnaldo Rabello de Aguiar Vallim Filho, Leilton 5-fold cross-validation and an F1-score of 0.9949. The results are Santos da Silva, and Leandro Augusto da Silva. 2019. Prediction of motor failure time using an artificial neural network. , 19, 19, 4342. Sensors presented on Figure 4. [4] Raj Shah, Nikhil Pai, Gavin Thomas, Swarn Jha, Vikram Mittal, Khosro Shirvni, and Hong Liang. 2024. Machine learning in wear prediction. Journal 5 of Tribology, 147, 4, (Nov. 2024), 040801. eprint: https://asmedigitalcollection Conclusion .asme.org/tribology/article- pdf /147/4/040801/7400649/trib\_147\_4\_040801 This student paper explored machine learning for automated cut- .pdf . doi: 10.1115/1.4066865. [5] Philipp Maximilian Sieberg, Dzhem Kurtulan, and Stefanie Hanke. 2022. Wear ting tool wear detection. Using five public datasets and models mechanism classification using artificial intelligence. , 15, 7, 2358. Materials such as Random Forests, CNNs, and LSTMs, we achieved strong doi: 10.3390/ma15072358. performance, notably 0.9949 F1 on the Vibrations dataset. These [6] Nachiket Soni, Amit Kumar, and Hardik Patel. 2023. Acoustic analysis of cutting tool vibrations of machines for anomaly detection and predictive benchmarks highlight ML’s potential for predictive maintenance 2023 IEEE 11th Region 10 Humanitarian Technology Conference maintenance. In and provide ready-to-deploy pipelines for future industrial data. . IEEE, 43–46. (R10-HTC) Future work will focus on validating the model on industrial ma- [7] Mohamed-Ali Tnani, Michael Feil, and Klaus Diepold. 2022. Smart data collec- tion system for brownfield cnc milling machines: a new benchmark dataset chines, optimizing its performance, and deploying it in real-time. Procedia CIRP for data-driven machine monitoring. , 107, 131–136. Additionally, for ordered domains like the Cutting Tool Wear [8] Melih C Yesilli, Firas A Khasawneh, and Andreas Otto. 2020. On transfer learning for chatter detection in turning using wavelet packet transform Audio dataset, misclassifications should not be penalized equally and ensemble empirical mode decomposition. CIRP Journal of Manufacturing (e.g., "FRESH" -> "MODERATE" vs. "FRESH" -> "BROKEN"). Thus, , 28, 118–135. Science and Technology future research will explore ordinal metrics, such as weighted accuracy or quadratic weighted kappa. 46 Extracting Structured Information About Food Loss and Waste Measurement Practices Using Large Language Models: A Feasibility Study Junoš Lukan Maori Inagawa Mitja Luštrek junos.lukan@ijs.si maoriinagawa@keio.jp mitja.lustrek@ijs.si Jožef Stefan Institute Jožef Stefan Institute Jožef Stefan Institute Department of Intelligent Systems Department of Intelligent Systems Department of Intelligent Systems Jožef Stefan International Ljubljana, Slovenia Jožef Stefan International Postgraduate School Postgraduate School Ljubljana, Slovenia Ljubljana, Slovenia Abstract The FLW standard establishes the scope of an FLW Waste Quantification Solutions to Limit Environmental inventory. Furthermore, it provides definitions of bound- Stress (WASTELESS) project aims to develop and test ary elements and recommendations for classifications that innovative tools and methodologies for measuring and mon- should be used to describe them. For classifying food into itoring food loss and waste (FLW). A key objective is to categories, it suggests the FAO’s and World Health Orga- create a decision support toolbox that helps food actors nization’s Codex General Standard for Food Additives [5]. across the entire supply chain, including consumers, select We might add that alternatively, Annex II of “Regulation the most suitable method for measuring and monitoring (EC) No 1333/2008 of the European Parliament and of the FLW. To help with this decision, existing, already tested Council” can also be used. For lifecycle stage, the Interna- FLW measurement practices can be consulted, which are tional Standard Industrial Classifications of All Economic currently published as short documents. In this work, we Activities (ISIC) or the Statistical Classification of Eco- show how the data about them can be extracted using nomic Activities in the European Community (NACE) [4] large language models (LLMs). Additionally, we propose should be used. Finally, for geographical boundary clas- how this data can be structured and represented as an sification UN region or country codes should be used or ontology. With this process, we can help users find relevant Nomenclature of Territorial Units for Statistics (NUTS) data without needing to browse through many documents. [2] in the European context. The FLW standard also provides guidelines on how to food loss and waste, large language models, data extraction, surement or monitoring. The FLW Quantification method ranking tool was prepared by the Waste and Resources ontology Keywords decide which quantification method to use for FLW mea- Action Programme (WRAP) and includes eleven questions. 1 Introduction Most of the questions serve as exclusion criteria. For ex- ample, a negative response to either “Do you have existing The project Waste Quantification Solutions to Limit Envi- records that could be used for quantifying FLW?” (Q9) ronmental Stress (WASTELESS; https://wastelesseu.com/) or “Do you have access to those records?” (Q10) excludes is designed to develop and test a mix of innovative tools the method of records. As another example, a negative and methodologies for food loss and waste (FLW) measure- response to “Can you get direct access to the FLW be- ment and monitoring. One of the tasks is also to create a ing quantified” (Q3) immediately excludes direct weighing, decision support toolbox [10]. It should help all profiles of counting, assessing volume, and waste composition analy- food actors, i.e. across the whole food supply chain (FSC), sis, since these all need such access to be feasible. These including consumers, who want to measure and monitor questions encapsulate the most important characteristics their FLW, to select the most appropriate method. by which these methods distinguish from one another and There have been several attempts to harmonise FLW lend themselves to particular needs of users. measurement methods. The Food loss and waste accounting In this paper, we build upon this work by proposing a and reporting standard (FLW Standard; [7]) stands out as unified structure through which to describe various prac- a good structured attempt. It was produced by the Food tices of FLW measurement and reduction. This is a step Loss & Waste Protocol, a multi-stakeholder partnership towards systematic representation of these data that can with involvement by Food and Agriculture Organization of enable further analysis of the practices thus described and the United Nations (FAO) and World Resources Institute their comparison and validation. among others. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided 2 Methods that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work We first outline the structure of desired shortened descrip- Information Society 2025, Ljubljana, Slovenia must be honored. For all other uses, contact the owner/author(s). tions, report on the process of using large language models (LLMs) to automatically extract them and finally evaluate © 2025 Copyright held by the owner/author(s). the results by comparing them to human annotations. https://doi.org/10.70314/is.2025.skui.8598 47 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Junoš Lukan, Maori Inagawa, and Mitja Luštrek 2.1 Structure of Extracted Information might be mixed with other waste water exiting the Based on the previously mentioned processing plant. In cases like this, non-direct meth- FLW Quantification method ranking tool ods need to be employed, such as modelling or mass and domain-expert knowledge, we de- termined the following characteristics of FLW measurement balance. methods and practices to be of the most importance: To be able to suggest specific FLW practices according (1) FLW method. them in terms of these characteristics. For harmonious rep- to the criteria described above, we need to first describe FLW measurement and reduction practices might resentation, we used already mentioned NUTS and NACE describe very specific technologies and techniques. classifications for region of interest and FSC stage, respec- To make the information more general, we decided tively. We also used a simplified version of FAO’s to classify as one of ten categories of quantification individual food consumption Global (2) Region of interest. plement [8] to the FLW Accounting and Reporting categorical levels of “low”, “medium”, and “high”, while Standard [7]. direct access to FLW can be represented with a simple Boolean. methods. These are described in detail in the Sup- describe food category. For accuracy, we opted for three (GIFT; [6]) classification to European Union (EU) member countries have diverse legislation that is of relevance to FLW measurement 2.2 Extraction of Data (see [13] for a review). Some have legislation actions that are legally binding, such as laws and regulations, To test the extraction of data, we used 11 FLW measure- and as such prescribe methods of monitoring and ment and reduction practice descriptions. This included FLW measurement as well as the ways of reporting 3 descriptions of practices developed and piloted in the the data. On the other hand, some countries only WASTELESS project as well as 8 practices developed in approach the topic through non-binding legislation other European projects [16]. actions, such as agency orders and policy papers. As To extract data from FLW practice descriptions, we used such, not every method might be appropriate for two LLMs: ChatGPT 5 Auto [12] and Le Chat [11]. The every country or region. prompt consisted of the following: (3) Food supply chain (FSC) stage. (1) Introduction: general summary of the whole extrac- Food loss and waste can occur at any stage of the tion process; food supply chain, starting from farmers and other (2) Main instructions: producers, through food manufacturers and proces- (a) Information to be extracted: a list of questions, sors, distributors and shippers, grocery stores and the answers to which represent the data that is to restaurants, all the way to the customers and con- be extracted from the practice description; sumers. Some methods are more appropriate for cer- (b) Data types and values: a list of possible values and tain stages in this chain. For example, a household their types for each of the data field, including lists might keep a diary of their FLW, while sellers such of NUTS and NACE codes and food categories; as grocery stores, generally manage their stock more (c) Missing information: instructions on how to deal systematically and precisely. with missing, incomplete, or unclear data; (4) Accuracy. (d) Format: description of the format of expected out- FLW measurement methods need also to be consid- put (.csv data); ered from the point of desired accuracy. The highest (3) Example: accuracy can be achieved by directly weighing the (a) Input: a short, synthetic description of a FLW waste or separating it into components (waste com- practice; position analysis), while diaries or volume assessment (b) Reasoning: values for all data fields and their rela- produce data of medium accuracy. At the lowest end, tionship to original text, indicate missing values; proxy data can be used to assess FLW, for example (c) Output: the expected line of data output. (5) Food category. will not be very accurate. as the Guidance on FLW Quantification Methods as a PDF. Following this initial prompt, practice descriptions were uploaded one by one and the output saved. The lead author Depending on the type of food and how it is packed, findings to another; keeping in mind that such data We included all reference classifications as .csv files as well by using data from another region to extrapolate we might only be able to use some FLW measurement of this paper also extracted the same information from the descriptions manually. methods, but not others. For example, when dealing simply counted and their weight inferred. Meanwhile, with packed food items, wasted products can be 2.3 Evaluation of Results when waste occurs with liquid food, such as milk, To evaluate the extraction of data by LLMs, we compared volume assessment can be fairly accurate to estimate the output by these models to human annotations. Here, the weight of FLW. the cases of multiple possible values and missing data need (6) Direct access to FLW. to be considered. First, some characteristics can objectively Some food waste cannot be measured directly, such contain several values. For example, a FLW measurement as by weighing, counting, or waste composition anal- practice might be applicable to several FSC stages and ysis. For example, when waste is discarded directly more than one food category. Secondly, some data cannot into the drain in the process of food processing, it be determined from the description of practice. 48 Extracting Information About FLW Measurement Practices Using LLMs Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia For a characteristic with more than one possible value, Both models achieved similar scores in total across all consider two subsets of all possible values (𝑈): human an- practice characteristics. ChatGPT did, however, perfectly notations (H ) and machine-extracted values (M). The fol- agree with the human rating more often. Of all the char- lowing list gives the scores that were used in the evaluation acteristics, food category was the easiest for the LLMs to for all possible relationships of these two sets. extract. This is a simple classification and usually, the type +2; when the subsets were equal, of food is mentioned explicitly. The FLW quantification 𝐻 = 𝑀 . method was inferred with moderate success. On the other +1; when an LLM extracted more values than a human, hand, accuracy of methods was very poorly described. but including those, ∅ ≠ 𝐻 ⊂ 𝑀 ≠ 𝑈 . 0; when the sets were overlapping, but neither contained -1; when an LLM failed to extract all values that a 0; when there was data available, but LLM extracted the other, that is, there was a partial match in values, 4 Discussion 𝐻 ∩ 𝑀 ≠ ∅, 𝐻 ⊈ 𝑀, 𝑀 ⊈ 𝐻 . In this work, we have shown how using two LLMs, the data from unstructured FLW measurement and reduction no information or returned all possible values, ∅ ≠ practice descriptions can be extracted into structured data. 𝐻 ⊂ 𝑈 , but 𝑀 = ∅ or 𝑀 = 𝑈 . We achieved satisfying if imperfect results. The most important data point, which is the class of the human did, 𝑈 ⊇ 𝐻 ⊃ 𝑀 ≠ ∅ . FLW measurement method was extracted with moderate-2; when the subsets had no values in common, i.e., were success. It needs to be pointed out that extracted infor-disjoint, 𝐻 ∩ 𝑀 = ∅ . mation was not wildly inaccurate in most cases, despite Note that for simple true or false values, this list simplifies of what the scores might suggest. For example, a method to the extreme cases; thus they were scored as +2 and −2, of tracking waste on a blockchain was classified as using respectively. records, where in fact, the data were collected with sur- The reasoning behind the scoring is that we prefer to veys before being, indeed, recorded. Similarly, one practice describe a practice in broader terms, even if some extracted described weighing waste as it was collected in the waste- values are inapplicable, rather than miss a particular value. basket, while simultaneously taking photos of the material. As an example, it is better to describe a practice as suit- Here, the true measurement method was direct weighing, able for all food categories than missing the one that it but the LLMs classified it as waste composition analysis. is actually suitable for. Similarly, when no information is By using photos, such an analysis could in theory be done, extracted, we can conservatively assume all values apply. but was not in such case. Thus, to improve the relevance In such a case, an LLM failed objectively, but it is not of the FLW measurement method, we might instead group punished for it. In the worst case scenario, an LLM “ex- them by some other characteristics. For example, we could tracted” or hallucinated some values, but they have nothing drop the data field of direct access and instead consider in common with human annotations; for this two points groups of methods separated in terms of needing direct are deducted. access to waste. Food category, however, was very reliably extracted. This To evaluate the extraction of data by the LLMs, we scored we could make the best use of the food type. Accuracy of the method described was not extracted well, but this is 3 Results indicates that in the further process of the extracted data, where shown are the sum of scores and the number of per- The authors of FLW practice descriptions never explicitly addressed the question of accuracy, so it needed to be esti- fect scores, that is the number of times the LLM completely mated roughly by other characteristics, such as the general these scores for each practice characteristic in Table 1, most likely due to the subjectivity of this characteristic. their answers as described in Section 2.3. We summarised tested was 11, which is therefore the maximum number of accuracy of the FLW method class. This also suggests that agreed with the human rater. The number of practices perfect scores, while the maximum sum is 22. a three-level accuracy is probably too fine grained and it should be described only as “low” and “high”. We should note that our evaluation only compares the Table 1: Agreement scores for each characteristic performance of LLMs to manual extraction of data per- of a FLW practice between a human rater and two formed by a single person. It is expected that people would also differ in their extractions, i.e., would not achieve per- of perfect extractions are shown. fect inter-rater agreement. Thus, the evaluation should not be interpreted as how well the LLMs captured the different LLMs. The sum of scores and the number Model ChatGPT Le Chat With this process, LLMs enabled us to transform the “objective” truth. Metric Sum Perfect Sum Perfect descriptions from simple PDF files into structured CSV FLW method 13 8 3 5 files in a semi-automatic way. In terms of the five-star Region 12 7 13 5 rating of open data [9] which describes how to get from FSC stage 8 7 12 6 data in proprietary formats to linked open data, we thus Accuracy −2 4 −5 3 increase their level from one star to three stars. We can Food category 22 11 21 10 extend this further and increase the rating of this data to Direct access 6 7 14 9 five stars: publish truly linked data. Total The first step that can follow directly the results of this 59 44 58 38 work is to transform the structure described in Section 2.1 49 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Junoš Lukan, Maori Inagawa, and Mitja Luštrek Listing 1: A snippet of the ontology in Turtle lan- [15]. The dataset described by this ontology is already vast and is being extended through FoodWasteEXplorer [14]. By guage [1] leveraging it, we plan to publish the practice descriptions @prefix : . as five-star data in future work. @prefix owl: . @prefix rdf: . Acknowledgments @prefix xsd: . @prefix rdfs: . This work was carried out as a part of the WASTELESS @prefix dbpedia: . project which is funded by the European Union’s Horizon @base . Europe Research and Innovation programme under Grant rdf:type owl:Ontology . Agreement No. 101084222. ##################### Classes ##################### References :FoodLossWasteMeasurementPractice rdf:type owl:Class ; rdfs:label "Food Loss and Waste Measurement [1] David Beckett, Tim Berners-Lee, Eric Prud’hommeaux, and Practice" Gavin Carothers. 2014. RDF 1.1 Turtle. Terse RDF Triple @en . Language. Ed. by Eric Prud’hommeaux and Gavin Carothers. :Region World Wide Web Consortium (W3C), (Feb. 25, 2014). Re- rdf:type owl:Class ; trieved Aug. 29, 2025 from https://www.w3.org/TR/turtle/. rdfs:label "A NUTS code of the region" ; [2] European Parliament and Council of the European Union. owl:equivalentClass 2003. Regulation (EC) no 1059/2003 of the European Parlia- dbpedia:Nomenclature_of_Territorial_Units- ment and of the Council. On the establishment of a common _for_Statistics . classification of territorial units for statistics (NUTS). Official Journal of the European Union, 154, 1, (June 21, 2003), 1–41. :FoodCategory rdf:type owl:Class ; http://data.europa.eu/eli/reg/2003/1059/oj. rdfs:label [3] European Parliament and Council of the European Union. "Food Category" . 2008. Regulation (EC) no 1333/2008 of the European Parlia- ment and of the Council. On food additives. Version 02008R1333- :DairyAndEggs rdf:type owl:Class ; 20240423. Official Journal of the European Union, 354, 16, rdfs:subClassOf :FoodCategory ; (Dec. 16, 2008), 16–33. http://data.europa.eu/eli/reg/2008/1 rdfs:label "Dairy & Eggs" . 333/oj. [4] European Parliament and Council of the European Union. :Milk rdf:type owl:Class ; 2006. Regulation (EC) no 1893/2006 of the European Parlia- rdfs:label cation of economic activities NACE revision 2 and amending "Milk" . Council Regulation (EEC) no 3037/90 as well as certain EC rdfs:subClassOf ment and of the Council. Establishing the statistical classifi- :DairyAndEggs ; # ... more classes defined ... regulations on specific statistical domains. Official Journal of ################ Object Properties ################ the European Union, 393, (Dec. 20, 2006), 1–39, 1, (Dec. 20, :hasTitle rdf:type owl:DatatypeProperty ; 2006). http://data.europa.eu/eli/reg/2006/1893/oj. rdfs:domain :FoodLossWasteMeasurementPractice ; [5] Food and Agriculture Organization of the United Nations and rdfs:range rdfs:Literal ; World Health Organization. 2019. General standard for food rdfs:label "with the title" . additives. Codex STAN 192-1995. (2019). [6] Food and Agriculture Organization of the United Nations :hasRegion (FAO). 2022. FAO/WHO GIFT. Global Individual Food rdf:type owl:ObjectProperty ; Consumption Data Tool . Retrieved Aug. 30, 2025 from https rdfs:domain :FoodLossWasteMeasurementPractice ; ://www.fao.org/gift-individual-food-consumption/about/en. rdfs:range :Region ; [7] Craig Hanson et al. 2016. Food Loss and Waste Accountin- rdfs:label "applied in regions" . gand Reporting Standard. Version 1.0. World Resources In- stitute. ISBN: 978-1-56973-892-4. :hasFoodCategory rdf:type owl:ObjectProperty ; [8] Craig Hanson et al. 2016. Guidance on FLW Quantification rdfs:domain :FoodLossWasteMeasurementPractice ; Methods. Supplement to the Food Loss and Waste (FLW) Ac- rdfs:range :FoodCategory ; counting and Reporting Standard. World Resources Institute. rdfs:label ISBN: 978-1-56973-893-1. "applicable to food categories" . [9] Tim Berners Lee. 2006. Linked data. Design issues. Ver- sion 2009-06-18. (July 27, 2006). https://www.w3.org/Design :hasAccuracy rdf:type owl:DatatypeProperty ; Issues/LinkedData.html. rdfs:domain :FoodLossWasteMeasurementPractice ; [10] Mitja Luštrek and Junoš Lukan. 2024. Practice Abstracts – rdfs:range "low"^^xsd:string, batch 1 – early phase. Deliverable 6.2. Research rep. Jožef "medium"^^xsd:string, "high"^^xsd:string . Stefan Institute. doi:10.5281/ZENODO.13503261. [11] [SW] Mistral AI, Le Chat version November 2024, 2024. url: https://chat.mistral.ai/. [12] [SW] OpenAI, ChatGPT version GPT-5, 2025. url: https://c into an ontology. We illustrate this idea in Listing 1 which hatgpt.com/. [13] Zhuang Qian, Wu Chen, and Giorgia Sabbatini. 2023. White encodes the characteristics as classes and how to connect book for FLW reduction, measurement, and monitoring prac-these to an individual practice using object and datatype tices. Deliverable 1.1. Research rep. Version 1.0. University of Southern Denmark, (Aug. 30, 2023). 116 pp. doi:10.5281 properties. Once we represent the structure like this, we can /ZENODO.11065358. encode a specific instance of FLW measurement practice [14] REFRESH. FoodWasteEXplorer. https://www.foodwasteexp as: lorer.eu/about. [15] Riste Stojanov, Tome Eftimov, Hannah Pinchen, Maria Traka, :MyDairyWastePractice a Paul Finglas, Drago Torkar, and Barbara Korousic Seljak. :FoodLossWasteMeasurementPractice ; 2019. Food waste ontology. A formal description of knowledge :hasTitle "Tracking Waste of Dairy in Slovenia" ; from the domain of food waste. In 2019 IEEE International :hasAccuracy doi:10.1109/bigdata47090.2019.9006254. "high" ^^ xsd:string ; [16] :hasFoodCategory :WholeMilk Conference on Big Data (Big Data). IEEE, (Dec. 2019). ; Sustainable Food System Innovation Platform. Practice ab- :hasRegion :SI0. stract inventory. Retrieved Sept. 1, 2025 from https://www.s The data on FLW measurement practices can then be martchain-platform.eu/en/practice-abstract-inventory. easily linked to other published data and the closest candi- date ontology is the Food Waste Ontology by Stojanov et al. 50 Eye-Tracking Explains Cognitive Test Performance in Schizophrenia Mila Marinković Jure Žabkar mila.marinkovic@fri.uni-lj.si jure.zabkar@fri.uni-lj.si University of Ljubljana,Faculty of Computer and Information Science, Ljubljana, Slovenia Abstract have schizophrenia [6, 7]. More recent work has extended ET Schizophrenia is associated with cognitive impairments that are beyond basic oculomotor paradigms by embedding it in cognitive difficult to assess with traditional neuropsychological tests, which tasks. For example, Okazaki et al. [8] combined ET metrics with are often lengthy and burdensome. Eye-tracking (ET) provides ob- digit-symbol substitution tests and showed improved discrimi- jective, minimally invasive measures of visual attention and cog- nation between patients and controls. Yang et al. [9] reported nitive processing and may complement shorter assessments. This that abnormal gaze patterns during reading tasks—such as longer study investigated whether ET features recorded during three fixation durations and increased saccade counts—enabled high computerized tasks could distinguish patients with schizophrenia diagnostic accuracy when analyzed with machine learning mod- from healthy controls. Using the Explainable Boosting Machine els. Similarly, Morita et al. [10] demonstrated the feasibility of (EBM), we achieved an accuracy of 0.86, and balanced sensitivity portable tablet-based ET combined with cognitive assessments and specificity, with an area under the curve exceeding 0.9. Fea- for schizophrenia screening. Collectively, these studies highlight tures related to fixation patterns, saccadic dynamics, and tempo- that combining ET with cognitive testing enriches diagnostic ral engagement emerged as the most informative. These findings value and provides insights into the cognitive mechanisms un- indicate that ET features collected during brief cognitive tasks derlying gaze abnormalities. can provide clinically relevant markers of schizophrenia. Incor- Building on this prior work, the present study investigates porating ET into short test batteries may reduce patient burden whether ET features recorded during a small set of computerized while enhancing diagnostic value, supporting the development cognitive tasks can serve as reliable markers of schizophrenia. of scalable and practical screening tools. Participants completed three tasks (digit span, picture naming, and n-back), each divided into phases of instruction reading, Keywords video demonstration, and test execution. From these tasks, we schizophrenia, eye-tracking, cognitive tasks, machine learning extracted 117 ET features, including fixation measures, saccadic dynamics, gaze entropy, and recording duration. We then ap- 1 plied machine learning methods to evaluate the discriminative Introduction that affects about 1% of the population worldwide and is charac- formation to overcome the limitations of brief cognitive testing, terized by disturbances in thought, perception, and behavior [1]. ultimately supporting the development of less burdensome but In addition to positive and negative symptoms, patients expe- more informative screening approaches. Schizophrenia is a severe and chronic neuropsychiatric disorder our aim is to test whether ET provides sufficient additional in- power of these features. By focusing on only three short tasks, rience pronounced cognitive impairments, including deficits in attention, working memory, and executive functioning, which 2 Methods substantially affect everyday life outcomes [2, 3]. Cognitive as-2.1 Participants sessment is therefore central for both diagnosis and monitoring of schizophrenia. However, traditional neuropsychological test- The study involves 126 individuals, including 58 patients diag- ing is lengthy, cognitively demanding, and often exhausting for nosed with schizophrenia (SP) and 68 healthy controls (HC). All patients, limiting its feasibility in clinical practice. Shorter test participants were adults, aged 18 years or older. Patients were batteries reduce the burden but often fail to provide sufficiently recruited and tested at the University Psychiatric Hospital Ljubl- informative data for reliable diagnosis jana. The control group was matched to the patient group on age Eye-tracking (ET) offers a promising avenue for addressing and gender. this challenge. ET provides objective, real-time measures of vi- Eligibility criteria required fluency in Slovenian and excluded sual attention, oculomotor control, and information processing individuals with intellectual disability, organic brain disorders, or strategies [4]. Numerous studies have shown that patients with a history of substance abuse. Additional exclusion criteria for the schizophrenia exhibit abnormalities in smooth pursuit eye move- HC group included any past or current psychiatric disorder. At ments, antisaccades, and fixation stability [5, 6, 7]. These alter- the time of assessment, all SP participants were receiving stable ations are considered potential endophenotypes of the disorder, doses of antipsychotic medication. as they are also observed in first-degree relatives who do not Demographic characteristics of the two groups are presented Permission to make digital or hard copies of all or part of this work for personal in Table 1 and were analyzed to ensure that the groups were or classroom use is granted without fee provided that copies are not made or comparable in terms of age and gender. While educational attain- work must be honored. For all other uses, contact the owner/author(s). distributed for profit or commercial advantage and that copies bear this notice and ment differed between groups, further analyses confirmed that the full citation on the first page. Copyrights for third-party components of this within each education level there were no significant differences between SP and HC participants, indicating that education was Information Society 2025, Ljubljana, Slovenia © 2025 Copyright held by the owner/author(s). unlikely to confound the comparisons. https://doi.org/10.70314/is.2025.skui.9583 51 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Marinković and Žabkar Table 1: Demographic characteristics of participants. Participants completed three computerized cognitive tasks in a fixed order: digit span (DS), n-back (NB), and picture naming (PS). Measure SP HC A short break was provided between tasks, with the duration Counts determined by each participant. All tasks were presented within Total participants 58 68 the Tobii Pro Lab application, which also stored the raw data. Male sex 29 35 After recording, the data were exported and processed using a Female sex 29 33 custom Python program for feature extraction and analysis. Age (mean years) 46.1 46.7 (1) Reading instructions. Written instructions were dis-Categorical Continuous Each task followed the same three-phase structure: Most common education level Primary school High school own pace and advanced to the next phase with a mouse played on the screen. Participants could read them at their HC: Healthy Controls; SP: Patients with Schizophrenia click. (2) Video example. A short instructional video was pre- The study was approved by the Medical Ethics Committee of sented once, demonstrating the task procedure. the Republic of Slovenia (approval number: 0120-51/2024-2711- (3) Test execution. The participant began the task when 4). All participants received a detailed explanation of the study ready. Task duration depended on individual performance. procedures and provided written informed consent prior to par- The procedure was identical for all participants, ensuring stan- ticipation. dardization across groups. Only the test execution phase varied in length, as it was determined by each participant’s performance. 2.2 Testing Procedure Group-level descriptive statistics of fixation durations for all tasks Eye-tracking data were collected using a Tobii Pro Spectrum [11] and phases are reported in the Results section (Table 3). eye tracker integrated into a 24-inch monitor with a resolution of 1920 2.3 Feature Extraction × 1080 pixels. Recordings were made at 1200 Hz in the “human” tracking mode, with a stimulus presentation latency of We extracted a total of 117 ET features from three computerized approximately 10 ms. The display frame rate was 30 FPS. Partic- cognitive tasks. As described in Section 2.2, each task was divided ipants sat ∼55cm from the monitor, in a upright position with into three phases: instruction reading (BN), video demonstration seating adjusted for comfort and optimal tracking. (GN), and test execution (T). Before each task, participants were seated comfortably, and the Each participant contributed a single data point to the ML Tobii Pro Lab [12] interface provided a live preview (see Fig. 1) to analysis. For every task (DS, PS, NB) and every phase (BN, GN, verify that both eyes were detected and that the viewing distance T), we computed the 13 eye-tracking features listed in Table 2. was within the recommended range (displayed as a green zone, Each feature was calculated over the entire duration of the given typically around 55 cm). Once this was confirmed, a standard phase (e.g., the number of fixations refers to the total count during five-point calibration was performed, during which participants that phase, while mean fixation duration refers to the average followed a moving dot across the screen. Calibration served both across all fixations in that phase). These were then concatenated to align gaze tracking and to ensure that the participant had across all tasks and phases, yielding 117 features per participant. not moved their head between tasks. If the system indicated Thus, the unit of analysis was the participant, not individual trials suboptimal accuracy, the calibration was repeated. or task phases. 2.4 Data Analysis We trained and evaluated several machine learning models using these features. We applied stratified 10-fold cross-validation at the subject level to ensure that all features from a given participant were assigned exclusively to either the training or test set, thereby preventing data leakage across folds. In each iteration, the model was trained on nine folds and tested on the remaining one, and the reported metrics represent averages across all folds. Performance was assessed using accuracy, sensitivity, specificity, and the area under the receiver operating characteristic curve (AUC). The final results were reported as the average across all folds. We evaluated a diverse set of ML models (logistic regression, random forest, gradient boosting, extreme gradient boosting, and the explainable boosting machine) to cover both linear and non-linear approaches with varying levels of interpretability. EBM was selected as the primary model because it consistently achieved the highest overall performance while providing in- herently interpretable feature importance, which is particularly Figure 1: Calibration interface in Tobii Pro Lab. The pre- valuable in clinical contexts. We did not pursue deep neural view window ensures both eyes are detected and the par- networks in this study, as the dataset size (126 participants) is ticipant is seated at an appropriate distance (green zone, relatively small and does not provide sufficient power to train approximately 55 cm) before calibration and testing begin. high-capacity models without overfitting. 52 ET Explains Cognitive Test Performance in Schizophrenia Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Table 2: Eye-tracking features extracted from each task and phase. Feature Description num_fixations Total number of fixations during the interval. avg_fixation_duration Mean duration of fixations (ms), indicating fixation stability. std_fixation_duration Standard deviation of fixation duration, reflecting variability in fixation times. num_saccades Total number of saccadic eye movements. avg_saccade_distance Mean distance of saccades, reflecting amplitude of eye shifts. avg_saccade_velocity Mean velocity of saccades, indicating how quickly gaze shifts occurred. avg_saccade_angle Average angular change of saccades, reflecting directional scanning patterns. gaze_entropy Entropy of gaze distribution, quantifying dispersion vs. concentration of gaze. recording_duration_ms Total duration of recording for the phase (ms). unique_squares Number of unique spatial areas (AOIs) visited during the interval. num_changes Number of transitions between distinct gaze areas. missing_left_percent Percentage of missing data from the left eye. missing_right_percent Percentage of missing data from the right eye. Note: All features are computed as aggregates over the entire task phase for each participant. 3 Results Fig 2 presents the receiver operating characteristic (ROC) curve, To characterize task engagement and potential variability be- which confirms the model’s strong discriminative ability. tween groups, we compared fixation durations across all tasks and phases (Table 3). SP showed longer fixations than HC, espe- cially during instruction reading and video phases, with smaller but consistent differences during execution. This indicates altered attention even outside active task solving. Table 3: Mean fixation duration in ms per task and phase. Task Phase HC (Mean ± SD) SP (Mean ± SD) Reading instructions 239.64 ± 47.79 283.97 ± 45.33 Numbers Watching video 352.14 ± 81.56 400.10 ± 89.51 Test execution 390.66 ± 83.92 407.60 ± 98.53 Reading instructions 228.44 ± 52.49 267.78 ± 60.79 Pictures Watching video 302.40 ± 69.06 368.93 ± 81.42 Test execution 301.97 ± 49.91 319.36 ± 58.07 Reading instructions 229.36 ± 45.41 286.70 ± 63.42 SD: Standard deviation; HC: Healthy controls; SP: Schizophrenia patients Test execution 394.91 ± 115.50 406.24 ± 99.36 across folds was 0.92, confirming strong classification per-Square Watching video 309.41 ± 89.45 352.08 ± 79.37 Figure 2: ROC curve for the EBM model. The mean AUC formance. The ML models were trained on 117 extracted eye-tracking features and achieved strong performance in distinguishing SP We analyzed the feature importance scores provided by EBM, from HC. The key cross-validation performance metrics are sum- focusing on the ten most informative features (Fig 3). These fea- marized in Table 4. tures were predominantly derived from the test execution phases and included measures such as recording duration, number of Table 4: Cross-validation performance metrics for different fixations, mean fixation duration, and saccadic counts. models. The Explainable Boosting Machine (EBM) achieved the best overall performance across all metrics. 4 Discussion The present study demonstrates that eye-tracking (ET) features Model Accuracy Sensitivity Specificity AUC obtained during brief computerized cognitive tasks can effectively EBM discriminate between individuals with schizophrenia and healthy 0.86 0.84 0.86 0.93 controls. Using 117 features, the Explainable Boosting Machine LR 0.85 0.77 0.91 0.92 (EBM) achieved strong classification performance, with accuracy, GB 0.78 0.70 0.84 0.83 sensitivity, and specificity values around 0.85 and an AUC of 0.92. RF 0.83 0.84 0.82 0.91 These results provide further evidence that ET-based measures xGB 0.81 0.77 0.85 0.90 capture clinically relevant differences in cognitive processing EBM: Explainable Boosting Machine; LR: Logistic Regression; GB: Gradient and attentional control in schizophrenia. Boosting; RF: Random Forest; xGB: Extreme Gradient Boosting Our findings are consistent with previous work showing that patients with schizophrenia exhibit abnormalities in fix- Among the tested models, the EBM achieved the highest over- ation behavior, saccadic dynamics, and gaze distribution during all performance and was therefore selected for detailed analysis. 53 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Marinković and Žabkar Third, while we employed subject-level cross-validation to pre- vent data leakage, robustness checks such as leave-one-subject- out or leave-one-task-out validation could further strengthen reliability. Fourth, our analysis focused on static ET features; dynamic sequence-based or deep learning models could capture additional temporal information in gaze patterns. Finally, we only tested three tasks; future research should explore whether expanding or tailoring the task battery improves performance while still keeping the protocol brief. Replication with indepen- dent cohorts will be essential to establish clinical utility. Conclusion In conclusion, this study provides strong evidence that eye- tracking features embedded within short cognitive tasks can Figure 3: Top 10 most important features identified by the serve as robust markers of schizophrenia. Machine learning mod- EBM model. The prefixes indicate the task and phase: DS = els trained on these features achieved high discriminative accu- digit span, PS = picture naming, NB = n-back; BN = reading racy, with interpretable patterns that align with known atten- instructions, GN = watching video, T = test solving. For tional and cognitive impairments in the disorder. By reducing example, PS_T_num_fixations refers to the number of fixa- patient burden while maintaining informativeness, this approach tions during the test phase of the picture naming task. holds promise for the development of accessible, scalable, and clinically relevant screening tools for schizophrenia. both simple oculomotor paradigms and more complex cogni- tive tasks [5, 6, 7, 8, 9, 10]. Importantly, by embedding ET into a 5 Acknowledgments small set of standardized cognitive tasks, we demonstrate that This research was funded by the Slovenian Research Agency (core group differences emerge not only during active problem solving funding No. P2-0209). Additional support for Mila Marinković but also in more passive phases such as reading instructions or was provided through the ARIS project AI4Science (GC-0001). watching a video example. This suggests that ET provides valu- We thank the University Psychiatric Hospital Ljubljana for par- able information across the continuum of cognitive engagement, ticipant recruitment, asist. dr. Polona Rus Prelog, dr. med., spec. extending beyond traditional task performance metrics. psih., and Martina Zakšek for their support, and the broader re- While prior studies have applied machine learning to ET data search team, clinicians, and participants for their contributions. in schizophrenia, they have typically relied on single paradigms The authors declare that they have no conflict of interest. or isolated task conditions. The novelty of the present work lies in combining a multi-task, multi-phase design with interpretable References while linking model performance to specific, clinically meaning-captures a broader range of cognitive and attentional processes Medicine, vol. 381, no. 18, p. 1753–1761, 2019. [2] W. Hinzen and J. Rosselló, “The linguistics of schizophrenia: thought distur-ML within a short, clinically feasible test battery. This approach [1] S. R. Marder and T. D. Cannon, “Schizophrenia,” The New England Journal of bance as language pathology across positive symptoms,” Frontiers in Psychol- ful features. ogy, vol. 6, 2015. Interpretability showed that temporal engagement, fixation [3] L. Colle, R. Angeleri, M. Vallana, K. Sacco, B. Bara, and F. Bosco, “Understand- ing the communicative impairments in schizophrenia: A preliminary study,” stability, and saccadic activity best differentiated groups. Longer Journal of Communication Disorders, vol. 46, no. 3, pp. 294–308, 2013. recording durations may reflect slower processing, while altered [4] A. Wolf, K. Ueda, and Y. Hirano, “Recent updates of eye movement abnormali- ties in patients with schizophrenia: A scoping review,” Psychiatry and Clinical fixations and saccades align with prior reports of impaired atten- Neurosciences , vol. 75, pp. 104–118, 2021. tional control. These findings suggest that eye-tracking captures [5] P. S. Holzman, L. R. Proctor, and D. W. Hughes, “Eye-tracking patterns in schizophrenia,” both temporal and oculomotor aspects of task performance, sup- Science , vol. 181, no. 4095, pp. 179–181, 1973. [6] L. Deborah, H. Philip, M. Steven, and M. Nancy, “Eye tracking and schizophre- porting its potential as a clinically meaningful biomarker. nia: A selective review,” Schizophrenia Bulletin , vol. 20, no. 1, pp. 47–62, 1994. From a clinical perspective, these results are encouraging. Tra- [7] U. Ettinger, “Smooth pursuit and antisaccade eye movements as endophe- notypes in schizophrenia spectrum research,” PhD Thesis, Department of ditional neuropsychological assessments are lengthy and cogni- Psychology, Goldsmiths College, University of London, 2002. tively demanding, which can be exhausting for patients and limit [8] K. Okazaki, K. Miura, J. Matsumoto, N. Hasegawa, M. Fujimoto, H. Yamamori, their applicability. Our study shows that by integrating ET mea- Y. Yasuda, M. Makinodan, and R. Hashimoto, “Discrimination in the clinical diagnosis between patients with schizophrenia and healthy controls using sures into just three relatively brief cognitive tasks, it is possible eye movement and cognitive functions,” Psychiatry and Clinical Neurosciences, to achieve a high level of diagnostic accuracy. This approach may vol. 77, pp. 393–400, 2023. [9] H. Yang, L. He, W. Li, Q. Zheng, Y. Li, X. Zheng, and J. Zhang, “An automatic therefore support the development of shorter, less burdensome, detection method for schizophrenia based on abnormal eye movements in and more objective screening protocols that could complement reading tasks,” Expert Systems With Applications , vol. 238, p. 121850, 2024. existing clinical evaluations. [10] K. Morita, K. Miura, A. Toyomaki, M. Makinodan, K. Ohi, N. Hashimoto, Y. Yasuda, T. Mitsudo, F. Higuchi, S. Numata, A. Yamada, Y. Aoki, H. Honda, R. Mizui, M. Honda, D. Fujikane, J. Matsumoto, N. Hasegawa, S. Ito, H. Akiyama, Limitations and Future Work T. Onitsuka, Y. Satomura, K. Kasai, and R. Hashimoto, “Tablet-based cognitive Several limitations should be noted. First, although our sample and eye movement measures as accessible tools for schizophrenia assessment: Multisite usability study,” JMIR Mental Health, vol. 11, p. e56668, 2024. size of 126 participants is comparable to similar studies, larger [11] M. Nyström, D. Niehorster, R. Andersson, and I. Hooge, “The tobii pro spec- and more diverse cohorts are needed to confirm the generalizabil- trum: A useful tool for studying microsaccades?” Behavior Research Methods, vol. 53, 07 2020. ity of the results. Second, all patients were on stable antipsychotic [12] Tobii AB, “Tobii pro lab,” Computer software, Danderyd, Stockholm, 2024. medication, which may have influenced oculomotor behavior. [Online]. Available: http://www.tobii.com/ 54 Data-Driven Evaluation of Truck Driving Performance with Statistical and Machine Learning Methods Vid Nemec Gašper Slapničar Mitja Luštrek vidotti.nemec@gmail.com Jožef Stefan Institute Jožef Stefan Institute Jožef Stefan Institute Ljubljana, Slovenia Ljubljana, Slovenia Ljubljana, Slovenia gasper.slapnicar@ijs.si mitja.lustrek@ijs.si Figure 1: Truck driving simulator developed by AAER Research d.o.o. Abstract identify driver features with stronger influence on fuel consump- tion than fixed-threshold rules, providing a data-driven baseline This paper investigates which driving features (e.g. speed, accel- for future model-based feedback. eration, braking) most strongly affect driving efficiency in a truck In addition, we compare the empirical outcomes of our ML simulator environment. The work systematically compares sta- analytics with insights from recent literature and the practical tistical methods (thresholding based on percentiles, IQRs, expert judgement of a driving expert, to pinpoint where domain knowl- rules) with machine learning methods (clustering using K-means) edge aligns or conflicts with the models. This dual perspective for driver assessment. In addition to practical machine learning enables a richer interpretation of driver assessment tools and experimentation, the analysis incorporates expert knowledge informs the design of future vehicle feedback and incentive sys- and insights from recent research. This approach evaluates the tems. agreement and differences between the two approaches and aims to interpret them. Keywords Driving simulation, fuel efficiency, percentiles, K-Means, SHAP, 2 Related Work statistical thresholds, machine learning, clustering Recent studies have evaluated driver behaviour for fuel efficiency using both statistical rules and machine-learning approaches. 1 Introduction Sullivan et al. present a TORCS-based simulator with a realistic fuel-economy model, enabling safe, repeatable analysis of eco- Reducing fuel consumption in road transport is a critical goal for driving strategies [5]. Maisonneuve characterises driver energy sustainability and cost-efficiency [1]. Prior research, such as [2, efficiency across driving events and proposes a grading/ranking 3], highlights the impact of driver behaviour - particularly accel- method based on identified parameters [6]. Zhao et al. develop eration, braking, and speed profiles on overall fuel efficiency. Yet, a simulator-based eco-driving support system with real-time how to most effectively quantify and compare drivers remains an feedback and post-drive reports, demonstrating measurable re- open question [4]. This paper addresses which driving features ductions in fuel use and emissions [7]. Ma et al. provide a scoping most strongly influence efficiency in a simulated truck driving review of energy-efficient driving behaviours and applied AI environment, comparing classical statistical thresholding, based methods [8]. Prototype driver-training systems have been pro- on expert knowledge, with clustering - based machine learning. posed [9], and large-scale, data-driven frameworks to incentivise Applying known methods, we test whether unsupervised ML can efficient driving have been developed [3, 10]. Permission to make digital or hard copies of all or part of this work for personal Most studies agree that key features include speed, throttle, or classroom use is granted without fee provided that copies are not made or brake usage, and sometimes gear selection, but differ on methods distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this for quantifying and weighting these features. Machine learning work must be honored. For all other uses, contact the owner /author(s). clustering (e.g., K-means) and feature importance analysis (e.g., Information Society 2025, Ljubljana, Slovenia SHAP) are increasingly used, offering potential improvements in © 2025 Copyright held by the owner/author(s). https://doi.org/DOI10.70314/is.2025.skui.4765 objectivity and interpretability of drivers. 55 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Vid Nemec et al. 3 Methods 3.3.2 SHAP with LightGBM model. As an orthogonal check of 3.1 feature relevance, we applied SHAP to a separate LightGBM Data Collection and Preprocessing model predicting fuel rate; this diagnostic analysis is independent Driving data were collected from a high-fidelity truck simulator of clustering and highlights variables linked to higher consump- developed by AAER Research d.o.o., which continuously recorded tion (Table 2). multiple parameters including pedal positions, steering wheel angle, vehicle speed, location, and segment identifiers. To ensure 4 Results data quality, missing or zeroed pedal values were imputed. The signals were then resampled into 1-second windows, where for 4.1 Statistical Thresholding Approach each parameter we computed the minimum, maximum, mean, Based on the analysis of related worke outlined in Section 2, we and median values. This aggregation approach was chosen over decided to benchmark driver efficiency based on selected driving raw resampling because the signals are irregular, zero-inflated, features.We investigated two methods covering complementary and not normally distributed, making window-based statistics metrics of acceleration and braking, namely: more representative of driver behavior. In addition, the last ob- Percentile-based thresholds for gas pedal • served cumulative distance within each window was retained to IQR method for brake pedal • preserve distance continuity. Finally, the processed signals were Percentiles were chosen for the gas pedal because the sig- aligned with the boundary of the scenario segment, allowing a nal is highly zero-inflated and not normally distributed, making consistent basis for later efficiency evaluation. distribution-aware thresholds more suitable. Braking behavior is irregular and heavy-tailed, where IQR offers a robust way to cap- 3.2 Rule-based Aggregation of Segment Labels ture abnormal events. In essence, the IQR rule sets a dispersion- We aggregated per-segment labels ( / / ) into an anchored cut-off above Q3-robust to heavy tails-whereas per-PASS WARN FAIL overall per-driver rating using a linear severity score. A in- centile thresholds fix the share of events flagged. Thresholds FAIL dicates a strong threshold exceedance and is therefore weighted were determined by examining histograms of pedal deltas (Fig- twice a , yielding a simple, interpretable metric that toler- ure 2), ensuring that cutoffs meaningfully separated typical from WARN ates occasional minor deviations. extreme behavior. This procedure enabled transparent, segment- level benchmarking of driver performance. 𝑆 2 #FAIL #WARN, = + Good , 𝑆 ,  ≤ 2   Rating 𝑆 Warning, 3 𝑆 5, ) = ≤ ≤ (     Bad , 𝑆 ≥  6 .  This 2:1 weighting reflects relative severity (a is a clearer FAIL breach of the threshold than a ) and preserves stability: WARN small label fluctuations do not flip a driver from to . The Good Bad middle band ( ) collects borderline cases for review. Warning Figure 2: Histograms for both pedals Table 1: Per-driver severity summary (𝑆 = 2·#FAIL+#WARN). Threshold characterisation: • Gas Pedal: We applied percentile-based thresholds (65th Driver #WARN #FAIL 𝑆 Rating for WARN, 83rd for FAIL) to the gas pedal delta (change 1 4 1 6 Bad in 0,1 second). This approach better captures outlier accel- 10 5 1 7 Bad eration behavior while avoiding over-penalizing normal 2 7 2 11 Bad operation. We removed windows where cruise control 3 4 0 4 Warning was active for more than 30% of the time to reduce au- 4 4 0 4 Warning tomation bias in pedal measurements. It was chosen to 5 6 2 10 Bad balance isolating manual control with keeping enough 6 3 0 3 Warning observations. 7 3 0 3 Warning • Brake Pedal: We applied an interquartile–range rule com- 8 4 0 4 Warning puted from the empirical distribution in each segment: with the third quartile 𝑄3 and the interquartile range IQR 𝑄3 𝑄1, we set at 𝑄3 0.5 IQR and = − WARN + FAIL 3.3 Machine Learning at 𝑄3 + 1.5 IQR. It flags both frequent moderate excesses 3.3.1 ( ) and rare but severe braking events ( ) without WARN FAIL K-means clustering. Unsupervised clustering of K-means over-penalising normal behaviour. (k = 3) was applied per segment on standardized aggregated characteristics (acceleration / braking variability, coasting, use of Certain segments in the driving scenario required strong brak- cruise control, speed-related measures). Clusters were assigned ing due to test design (e.g., safety-critical stops). These were semantic labels / / by ordering clusters labelled as SAFETY and excluded from efficiency scoring, as they Good Warning Bad post hoc by their mean fuel rate ( ): lowest , middle reflect controlled conditions rather than natural driving quality. fuel_mean → Good → Warning, highest → Bad. We then examined cluster centroids The resulting classifications are summarised as heatmaps (Fig- (mean feature profiles) and visualised the result as per-segment ures 3 and 4), where rows correspond to drivers and columns heatmaps. to scenario segments. Cells are coloured green (PASS), orange 56 Driving data analysis Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia (WARN), and red (FAIL), providing an intuitive visual overview of performance variability. PASS/WARN/FAIL are segment-level, per-driver labels that state whether the segment was driven effi- ciently in terms of fuel use: PASS = efficient, WARN = borderline, FAIL = inefficient. These labels refer only to fuel consumption, not safety or travel time. Blank (white) cells indicate cases with- out an assigned label—either segments excluded from SAFETY scoring or driver–segment pairs with too few events to make a reliable decision. Figure 5: K-means graph for 1st segment Figure 3: Heat map of the gas pedal through segments using percentiles method Figure 6: K-means graph for 4th segment to or . The 2D PCA projection (Figure 6) shows Warning Bad these drivers displaced from the centroid, driven by sus- Good tained high-load throttle (elevated accelerator mean/variance), low coasting, and reduced cruise-control usage—patterns that the single-peak percentile metric does not penalize. This highlights clustering’s sensitivity to cumulative demand and multi-feature context, whereas the percentile approach captures only isolated Figure 4: Heat map of the brake pedal through segments exceedances. using IQR method 4.2 Comparison of Thresholding and Clustering A focused comparison was carried out on three representative track segments: Segment 1, Segment 8, and Segment 4 using the two complementary methods described in Section 3 (statistical thresholding and K-means clustering). For visualization only, we projected standardised features onto two principal compo- nents (PCA) per segment; clustering and label assignment were performed in the original standardised space. 4.2.1 Segment 1 (Steady Acceleration). The percentile method flagged only one driver as exceeding the ’FAIL’ threshold, while Figure 7: K-means graph for 8th segment most achieved the ’PASS’ status. The clustering of K-means pro- duced a tightly grouped ’Good’ cluster for most drivers, with a 4.2.3 Segment 8 (Complex Curve–Acceleration Mix). This seg- single ’Bad’ outlier (visible in PCA as an isolated point on the pos- ment showed more divergence. The percentile method marked itive PC1 axis). Agreement between methods was high (>85 %), several drivers as ’WARN’ due to short bursts of high throttle, suggesting that, in simpler acceleration scenarios, single-feature while K-means placed some of these drivers in the ’Good’ cluster. metrics and multidimensional clustering agree well. PCA visualization revealed that these drivers exhibited smoother 4.2.2 Segment 4 (Prolonged Uphill Driving). Here the disagree- braking and higher coasting ratios, which the clustering model ment was most pronounced. The percentile rule classified many positively weighted. This highlights a key difference: the sta- drivers as because their maximum throttle did not ex- tistical approach penalizes isolated peaks, whereas clustering PASS ceed the cut-off. In contrast, K-means frequently assigned them balances them against compensatory behaviors. 57 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Vid Nemec et al. 4.2.4 Cross-approach Observations. The alignment was strongest 5 Discussion in steady demand scenarios (Segment 1), weaker in mixed behav- This comparative study shows that rule-based thresholding re- ior contexts (Segment 8), and lowest in sustained load conditions mains highly interpretable and aligns with prior work, while (Segment 4). Statistical thresholding offers high interpretabil- K-means clustering reveals multi-feature patterns that affect ef- ity and segment-level clarity, but may overlook multi-feature ficiency. In practice, percentile rules flag isolated exceedances, inefficiencies. K-means clustering captures complex, composite whereas clustering captures cumulative demand and co-variation, behavior and can sometimes reclassify drivers that the percentile explaining the discrepancies observed in segments such as Fig- method labels as efficient. It would be interesting for future work ure 6. Together, the methods are complementary: thresholding of- to implement more driver features and analyse in depth which fers transparent guardrails; clustering provides a broader, context- have a different effect. aware view. We additionally investigated the alignment between model- based feature importances and expert knowledge/domain expec- 6 Conclusions tations using SHAP. The results suggest that integrating both statistical and machine Table 2: Top-5 features per class learning perspectives offers a more robust and nuanced driver assessment for fuel efficiency. While classical thresholding offers transparency, machine learning enables the discovery of complex Class Top 1 Top 2 Top 3 Top 4 Top 5 patterns. Future work should further validate these findings to Bad AccelerationPedal Speed Acceleration SteeringWheelAngle BrakePedal Medium Speed AccelerationPedal Acceleration SteeringWheelAngle BrakePedal develop hybrid driver feedback systems. We only used SHAP Good AccelerationPedal Speed Acceleration SteeringWheelAngle BrakePedal Perfect AccelerationPedal Speed Acceleration SteeringWheelAngle BrakePedal diagnostically; a more systematic SHAP analysis would be inter- esting across models, segments, and time, to stabilize attributions and translate them into actionable feedback. Table 2 presents the five most influential features for each consumption class ( , , , ), ranked by their Bad Medium Good Perfect mean absolute SHAP value. The model consistently identifies Acknowledgements AccelerationPedal and BrakePedal among the top-ranked features We thank the AAER Research d.o.o. team, led by CEO Matej across multiple classes, in line with the statistical benchmark Vengust, for access to simulator data and expert support. We also results from Section 4.1, where pedal usage was also the dominant acknowledge support from the EDIH DIGI–SI project. indicator of inefficient driving events. This agreement confirms that the machine learning approach captures the same domain- References relevant control inputs as the thresholds defined by the expert, [1] Oscar Delgado, Felipe Rodríguez, and Rachel Muncrief. 2017. Fuel Efficiency Technology in European Heavy-Duty Vehicles: Baseline and Potential for while also highlighting secondary but relevant factors such as the 2020–2030 Timeframe. White Paper. The International Council on Clean Speed, Acceleration, and SteeringWheelAngle. Transportation (ICCT), (July 2017). https://theicct.org/publication/f uel- ef f i ciency- technology- in- european- heavy- duty- vehicles- baseline- and- pote 4.3 Pareto Front of Time–Fuel Trade-Offs ntial- f or- the- 2020- 2030- timef rame/. [2] Hung Nguyen, George Tsaramirsis, Ilir Mborja, Dhimitraq Dervishi, Eriona Hoxha, Stavros Shiaeles, Anastasios Kavoukis, and Stamatios Vologiannidis. 2023. A data-driven framework for incentivising fuel efficient driving be- haviour in heavy-duty vehicles. , 420, 139942. doi: 10.1016/j.jc J. Clean. Prod. lepro.2023.139942. [3] Shuyan Chen, Hongru Liu, Yongfeng Ma, Fengxiang Qiao, Qianqian Pang, Ziyu Zhang, and Zhuopeng Xie. 2024. High fuel consumption driving be- havior identification and causal analysis based on lightgbm and shap. Res. Sq. Preprint. doi: 10.21203/rs.3.rs- 4010652/v1. [4] Alexander Meschtscherjakov, David Wilfinger, Thomas Scherndl, and Man- fred Tscheligi. 2009. Acceptance of future persuasive in-car interfaces to- wards a more economic driving behaviour. In . (Sept. AutomotiveUI 2009 2009), 81–88. doi: 10.1145/1620509.1620526. [5] Charles Sullivan and Mark Franklin. 2010. An extended driving simulator used to motivate analysis of automobile fuel economy. In Session 1: Tools, techniques, and best practies of engineering education for the digital generation. (May 2010). doi: 10.18260/1- 2- 1153- 53783. [6] Mathieu Maisonneuve. 2013. Characterization of drivers’ energetic efficiency: Identification and evaluation of driving parameters related to energy efficiency. Master’s thesis. Chalmers University of Technology. https://hdl.handle.net /20.500.12380/185531. [7] Xiaohua Zhao, Yiping Wu, Jian Rong, and Yunlong Zhang. 2015. Develop- ment of a driving simulator based eco-driving support system. Transporta- Figure 8: Pareto front tion Research Part C: Emerging Technologies, 58, 631–641. Technologies to support green driving. doi: https://doi.org/10.1016/j.trc.2015.03.030. [8] Zhipeng Ma, Bo Nørregaard Jørgensen, and Zheng Ma. 2024. A scoping An interesting point of view would be to also consider the review of energy-efficient driving behaviors and applied state-of-the-art ai methods. , 17, 2. doi: 10.3390/en17020500. Energies temporal information. Fuel consumption may reduce costs, but [9] A McGordon, J E W Poxon, C Cheng, R P Jones, and P A Jennings. 2011. time is also quite important. Figure 8 plots the total time against Development of a driver model to study the effects of real-world driver be- haviour on the fuel consumption. Proceedings of the Institution of Mechanical the total fuel per driver. A driver is Pareto efficient if no other Engineers, Part D: Journal of Automobile Engineering, 225, 11, 1518–1530. driver is faster and uses less fuel; these form the lower-left frontier. doi: 10.1177/0954407011409116. The points to the upper-right are dominated and can improve [10] Thomas J. Daun, Daniel G. Braun, Christopher Frank, Stephan Haug, and Markus Lienkamp. 2013. Evaluation of driving behavior and the efficacy of at least one objective without worsening the other. We obtain a predictive eco-driving assistance system for heavy commercial vehicles the frontier by non-dominated sorting of per-driver 𝑇 𝑖𝑚𝑒, 𝐹 𝑢𝑒𝑙 in a driving simulator experiment. In 16th International IEEE Conference ( ) totals and colour points by their K-means group, explicitly linking , 2379–2386. doi: 10.1109 on Intelligent Transportation Systems (ITSC 2013) /ITSC.2013.6728583. global efficiency to the segment-level patterns identified earlier. 58 Automated Explainable Schizophrenia Assessment from Verbal-Fluency Audio Rok Rajher Jure Žabkar rr3244@student.uni-lj.si jure.zabkar@fri.uni-lj.si University of Ljubljana, Faculty of Computer and Information Science, Ljubljana, Slovenia Abstract (2) what they say: lexical-semantic markers such as category Schizophrenia is associated with cognitive impairments that are switching, perseverations, and vocabulary diversity. difficult to assess with traditional neuro-psychological tests. Cur- These are best observed during verbal-fluency tasks - short, rently, these tests are manually administered by clinical doctors standardized, low-burden, and already used in clinical practice. and rely on subjective assessment of patient’s behavior, self- Our main hypothesis is that short recordings of Slovene verbal- reported symptoms, medical history, and mental state. Recent ad- fluency tasks contain sufficient discriminative signal, captured vances in deep learning substantially improved automatic speech by acoustic and semantic features, to separate individuals with recognition (ASR), and large language models (LLMs), enabling schizophrenia from healthy controls. the development of computational tools that can partially auto- In this paper, we present automated machine learning pipeline mate aspects of psychiatric assessment. We present the first fully for the detection and explanation of schizophrenia, leveraging the automated classification of individuals with schizophrenia based capabilities of ASR models and state-of-the-art LLMs. The tests on verbal-fluency tests conducted in Slovene language. Our multi- were conducted in the Slovene language and consisted of two one- stage pipeline involves audio preprocessing, automatic transcrip- minute subtasks: (1) a semantic fluency task, where participants tion using the Truebar ASR model, the extraction of meaningful were asked to list as many animal names as possible, and (2) verbal and non-verbal features, and learning a machine learning a phonetic fluency task, where participants were instructed to model. The Explainable Boosting Machine (EBM) trained on the generate words beginning with the letter ‘L’. The approach is obtained feature set achieved the best overall performance. based on audio recordings of verbal fluency tests collected by Marinković [10]. Our results can be directly compared to those Keywords reported by Marinković [10], where the transcription and analysis schizophrenia, automatic speech recognition, large language of the tests were performed manually. The details of our study models, verbal-fluency tasks, machine learning are extensively described in [13]. 1 2 Methods Introduction Schizophrenia is a chronic and severe mental disorder [8, 11] that 2.1 Participants affects how a person thinks, feels, and behaves. As a psychotic The dataset comprises of 126 participants: 58 individuals with a disorder it is characterized by a combination of disorganized clinical diagnosis of schizophrenia (SH), and 68 healthy controls thinking and behavior, hallucinations, and delusions [2, 14]. The (HC). All individuals in the SH group were patients admitted to symptoms have major implications on individual’s social life and the University Psychiatric Clinic Ljubljana. All participants were can lead to a lifelong care [1, 7]. Schizophrenia affects about 1% adults aged 18 years or older and gave consent to being part of of the population worldwide [9]. the experiment. Currently, there is no objective or standardized diagnostic test Standard demographic information was collected for each for schizophrenia. The most widely used diagnostic frameworks participant, including age, gender, highest level of education, in clinical practice are the DSM-5 [2] and the ICD-11 [14]. With academic performance (school grades), marital status, and em- rapid improvements in automatic speech recognition (ASR), large ployment status. The dataset is balanced with respect to age and language models (LLMs), and machine learning, there is rising gender. interest in computational tools that support, augment, or partially For participants diagnosed with schizophrenia, additional clin- automate aspects of psychiatric assessment. ical information was recorded: illness duration, number of hospi- Clinicians have long noted that schizophrenia systematically talizations, and the presence of chronic or co-occurring health affects speech in two ways: conditions. The median illness duration among individuals with (1) how people speak: acoustic-prosodic markers such as schizophrenia was 10 years, with a median of 4 hospitalizations. pause structure, speech rate, and prosodic variability, and The study was approved by the Medical Ethics Committee of Permission to make digital or hard copies of all or part of this work for personal the Republic of Slovenia (approval number: 0120-51/2024-2711- or classroom use is granted without fee provided that copies are not made or 4). All participants received a detailed explanation of the study distributed for profit or commercial advantage and that copies bear this notice and procedures and provided written informed consent prior to par- the full citation on the first page. Copyrights for third-party components of this ticipation. work must be honored. For all other uses, contact the owner/author(s). Information Society 2025, Ljubljana, Slovenia © 2025 Copyright held by the owner/author(s). https://doi.org/10.70314/is.2025.skui.2350 59 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Rajher and Žabkar Measure SH HC as the primary data sources for all subsequent audio- and speech- based analyses. Total Participants 58 68 Male Distribution 29 35 3 Preprocessing Female Distribution 29 33 Median Age (years) 45 46.5 3.1 Audio Data Preparation Median Primary School Grade 3 5 The WAV recordings were initially divided into two distinct audio Median High School Grade 3 4 segments using the provided timestamp files: (1) a segment cor- Median Illness Duration (years) 10 – responding to the phonetic verbal fluency task and (2) a segment Median Number of Hospitalizations 4 – corresponding to the semantic verbal fluency task. Prevalent Education Level Elem. HS Both audio segments were then processed through a series of Prevalent Marital Status Married Married audio enhancement steps: Prevalent Employment Status Retired Employed (1) Dynamic range compression: To improve audio quality Table 1: Demographic and clinical characteristics of the and ensure uniformity, downward dynamic range com- participants. pression (threshold = -20.0 dBFS, ratio = 4:1, attack time = 5 ms, release time = 50 ms) was applied to each segment. This reduces the volume gap between the quietest and 2.2 Testing procedure (2) loudest parts of a signal [6]. Loudness normalization: adjusting each segment to a Each participant completed a verbal fluency test consisting of target level of -20 dBFS. This ensured consistent perceived two sub-tasks: loudness across all recordings, reducing variability from (1) Phonetic fluency task: participants were asked to pro- differences in speaker volume, room acoustics, or micro- duce as many Slovene words as possible beginning with phone distance. the letter ‘L’. Proper nouns, including names of people or (3) Final output: Finally, the two fully processed segments places, were not allowed. The task lasted 62 seconds in per participant (phonetic and semantic) were saved as total: during the first 2 seconds, the letter ‘L’ was displayed separate WAV files. These files constitute the final audio on the screen, followed by 60 seconds for verbal response. data used for all subsequent analyses. (2) Semantic fluency task: participants were instructed to All of the described steps were implemented using standard name as many animals as possible in the Slovene language. functions provided by the pydub library. Pet names and proper nouns were not allowed. The task duration was 60 seconds. 3.2 Feature Engineering The testing procedure was standardized: each individual was After automated transcriptions have been processed we per- seated in front of a laptop computer. After reading the instruc- formed feature engineering. Based on clinical knowledge, we tions for the phonetic fluency task, the participant pressed a key created meaningful features that serve as reliable markers for dis- to begin, initiating the countdown. After completing the first tinguishing between individuals with and without schizophrenia. task, the instructions for the second task (semantic fluency) were Three core symptoms of schizophrenia are directly applicable displayed. Again, the participant initiated the task by pressing a to our verbal-fluency tasks: disorganized speech, disorganized key when ready. This concluded the verbal fluency test. behavior, and negative symptoms. The primary rationale behind Healthy participants were tested at the Faculty of Computer our feature construction is grounded in these core symptom and Information Science, University of Ljubljana, while individu- domains. als with schizophrenia were assessed at the University Psychi- Audio recordings are represented in two forms: (1) as text, atric Clinic Ljubljana. To ensure consistency across conditions, all derived from automated ASR transcriptions, and (2) as spectro- recordings were conducted in quiet, isolated rooms to eliminate grams – a visual representation of the frequency content of the possible noise and distractions. audio signal over time. We constructed two groups of features: All WAV files then underwent the same audio enhancement (1) Verbal features: 39 features derived from the automated pipeline: (i) dynamic range compression to reduce variability due text transcriptions. These features aim to quantify dis- to speaking loudness and microphone distance, and (ii) loudness organized speech, e.g. number of phrases produced per normalization to achieve consistent perceived loudness across second. recordings. These steps were implemented with standard func- (2) Non-verbal features: 17 features extracted directly from tions from pydub and applied identically to both sites prior to the spectrograms of the audio recordings, these features feature extraction. target prosodic elements such as pitch and vocal con- 2.3 Data Format blunted affect and disorganized behavior; e.g. Mean pitch, trol, which are key indicators of negative symptoms like The final dataset consists of 126 WAV audio recordings, one per representing the speaker’s average vocal pitch. participant, captured using the built-in laptop microphone during the test sessions. The audio tracks are encoded in uncompressed 3.3 Automated Transcription PCM format at a sampling rate of 44.1 kHz with a single (mono) The most critical step in the preprocessing of audio recordings is audio channel. Additionally, there are 126 corresponding CSV the generation of automated transcriptions. These ASR-derived files containing timestamps that indicate the start and end times transcriptions serve as the primary input for nearly all subse- of each subtask. Together, these audio and timestamp files serve quent stages of feature extraction and machine learning analysis. We employed the ASR model Truebar 24.05, a state-of-the-art 60 Automated Explainable Schizophrenia Assessment from Verbal-Fluency Audio Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia speech recognition system for the Slovene language. The model captures how the word is typically perceived by the gen- was developed by the company Vitatis in collaboration with the eral population. In the case of neologisms, the semantic Laboratory for Data Technologies at the Faculty of Computer and meaning was still applied based on what the word could Information Science. Using Truebar API we programmatically up- plausibly represent or mean, allowing the model to assign loaded each audio file and in response receive the corresponding an approximate semantic embedding even for novel or transcribed words along with their start and end timestamps. invented terms. This feature is used only for the semantic task, where meaning-based associations between words 3.4 Transcription Adjustment are essential. The output of the ASR system consists of transcribed words along with their associated timestamps. These transcriptions may in- 3.6 Data Analysis clude irrelevant content such as filler words. We used the DSPy We trained and evaluated several machine learning models using library—a Python framework that enables declarative program- these features. To ensure robust evaluation, we applied stratified ming for prompting LLMs in a modular and programmatic way in 10-fold cross-validation. Performance was assessed using accu- combination with GPT-4o model. The transcription adjustment racy, sensitivity, specificity, and the area under the receiver oper- process was divided into two sequential steps: ating characteristic curve (AUC). The Explainable Boosting Ma- (1) Transcription filtering: The raw transcription output chine (EBM) consistently achieved the best results when trained from the Truebar ASR model was first passed to the GPT- on the full feature set. We additionally examined the top 10 most 4o model along with a description of the verbal fluency informative features to assess model interpretability. This ap- task and its rules. The model was instructed to retain only proach enables us to understand better which et deficits are most the words it considered to be relevant without modifying prominent in individuals with schizophrenia and may be useful the words themselves. for targeted clinical interventions. (2) Transcription correction: The filtered transcription was then forwarded to the model in a second pass. With the 4 Results same task context and rules provided, the model was now We observe that the obtained ML models perform similarly when asked to adjust incorrectly transcribed words to what it in- using the verbal (V) and non-verbal (N) feature sets separately, ferred the participant likely intended to say. A word could achieving an average AUC of 0.83 on both datasets. In the com- potentially also be a neologism, we explicitly instructed bined feature set (VN), the average performance improves across the model to apply corrections only when the intended all metrics: AUC 0.86, CA 0.76, Sens. 0.69, Spec. 0.82, PPV 0.76, word was judged to be clear and obvious; otherwise, the and F1 0.73. The EBM trained on the combined feature set (VN) word was left unchanged. For example, a misrecognized achieved the best overall performance: AUC 0.90, CA 0.82, Sens. word like ‘lon’ would be corrected to ‘slon’ (elephant), 0.76, Spec. 0.87, PPV 0.83, and F1 0.79. whereas unclear or ambiguous cases were preserved as-is. To probe whether education could drive the observed perfor- mance, we examined models trained on verbal (V) and non-verbal 3.5 Adding Semantic Meaning (N) features separately, in addition to the combined set (VN). Ver- After filtering and correcting the transcriptions, we tagged each bal features are more likely to reflect educational attainment word with semantic annotations relevant to the verbal fluency (e.g., lexical diversity, category switching), whereas core acous- task. These semantic features are crucial for distinguishing be- tic markers (e.g., pause structure, longest silent pause) are less tween HC and SH, as they capture subtle linguistic anomalies dependent on education [4]. In our 10-fold CV, V and N models commonly associated with schizophrenia. We used performed comparably, and VN performed best. This suggests DSPy frame- work in combination with the that education alone is unlikely to explain the classification. GPT-4o language model to per- form automated semantic tagging. The model was provided with task-specific instructions and context for each word. For each 4.1 Global interpretation transcribed word, we extracted the following semantic tags: The overall feature importance (FI) across the entire dataset is • Intrusion: used for global interpretation of the model. We calculate it as the The word is semantically unrelated to the tar- get category (e.g. non-animal word during the animal nam- average absolute contribution of each feature across all samples: ing task). Intrusions are often more frequent in individuals 𝑛 1 ∑︁ = FI with schizophrenia and reflect impaired cognitive control 𝑗 𝑓 𝑗 ( 𝑥 𝑖, 𝑗 ) , (1) 𝑛 and semantic memory organization [5]. 𝑖 =1 • where Stiltedness: Marks whether the word appears overly for-𝑛 is the total number of samples, and 𝑓𝑗 (𝑥𝑖, 𝑗) is the contri- mal, unusual, or unnatural in everyday speech. Stilted bution of feature 𝑗 for instance 𝑖. FI measures how strongly each language is a known linguistic feature in schizophrenia feature influences the model’s predictions on average. and may signal underlying disruptions in pragmatic lan- Globally most important features are: (1) comb_pho_lev2_- guage use [12]. avg- the Levenshtein similarity between the filtered and ad- • justed transcriptions, which indicates impaired speech fluency, Neologism: a newly coined or nonsensical word not found in the lexicon. Neologisms are characteristic of dis- (2) animal_tempo_max_gap_percent- captures the longest organized thought and speech, and are especially relevant silent pause during the semantic task, (3) animal_sem_cont_- in schizophrenia research [3]. max_coherence_index, animal_sem_cont_kurt_coherence_- • index, and ltest_sem_stat_min_coherence_index- the first Word description (semantic task only): A general, page-long descriptive summary of the word. For animals, two capture the word-to-word coherence, while the third cap- this includes common features such as appearance, habi- tures the lowest phonetic similarity between consecutive words tat, and behavior—providing a semantic embedding that 61 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Rajher and Žabkar during the phonetic task, (4) comb_osmile_F0From27.5Hz_- to all audio, and (iii) demonstrating that transcript-only models stddeNorm_avg- the standard deviation of pitch; highlights vari- (verbal features) remain predictive, indicating that performance ability in vocal pitch — a marker of prosodic irregularity often is not driven by background acoustics. Future studies should also observed in individuals with schizophrenia. include participants with other psychiatric conditions, such as major depressive disorder or bipolar disorder. 4.2 Local interpretation Each individual prediction can be explained through the posi- Conclusion tive/negative contribution of each feature. Features with positive We developed and evaluated an automated, explainable pipeline contributions increase the log-odds in favor of the schizophrenia for schizophrenia assessment using 126 verbal-fluency audio class, while features with negative contributions decrease the recordings (healthy controls: 68; schizophrenia: 58). The pipeline log-odds, shifting the prediction toward the healthy control class. comprises audio pre-processing, automatic transcription with the An example for a severe schizophrenia case is shown in Fig. 1 Truebar ASR model, and extraction of verbal (transcript-derived) and non-verbal (acoustic/temporal) features. The features were then used in training and evaluation of several classical machine- learning models. Across models, combining verbal and non-verbal features con- sistently yielded the strongest results. The Explainable Boosting Machine achieved the highest performance: CA 0.82, Sens. 0.76, Spec. 0.87, PPV 0.83, F1 0.79, and AUC 0.90. Due to the EBM’s inherent interpretability, we produced global explanations and lo- cal explanations (per-instance contribution plots), complemented by GPT-4o–generated textual summaries. A high model perfor- Figure 1: Local feature importance plot for a severe mance and associated explanations provide a firm ground for schizophrenia case as predicted by the EBM model. Red potential decision support system in clinical practice. bars indicate contributions toward the schizophrenia class, and blue bars toward the healthy control class. 5 Acknowledgments This work was partially supported by the Slovenian Research The corresponding textual explanation was generated by Agency (ARIS), research program Artificial Intelligence and In- GPT-4o model: The results from the verbal fluency test indicate telligent Systems (Grant No. P2-0209). several features often associated with schizophrenia. Short pauses between utterances may suggest rushed or pressured speech, References which can be a sign of reduced speech planning. Low seman- [1] Bandar AlAqeel and Howard C. Margolese. 2012. Remission in schizophrenia: tic coherence in structured tasks may indicate the intrusion of Critical and systematic review. Harvard Review of Psychiatry 20, 6 (2012), unrelated thoughts or semantic derailment. Additionally, long 281–297. [2] American Psychiatric Association. 2013. Diagnostic and statistical manual of pauses between utterances can reflect cognitive slowing or diffi- mental disorders (5th ed.). American Psychiatric Publishing, Arlington, VA. culty with word retrieval. These features collectively suggest the [3] Janna N. De Boer, Sanne G. Brederoo, Alban E. Voppel, and Iris E. C. Sommer. 2020. Anomalies in language as a biomarker for schizophrenia. Current possibility of schizophrenia. The results suggest that, on average, Opinion in Psychiatry 33, 3 (2020), 212–218. the models are able to rank individuals effectively (high AUC); [4] J. N. De Boer, A. E. Voppel, S. G. Brederoo, H. G. Schnack, K. P. Truong, F. N. K. they can distinguish between HC and SH in terms of relative Wijnen, and I. E. C. Sommer. 2023. Acoustic speech markers for schizophrenia- spectrum disorders: a diagnostic and symptom-recognition tool. Psychological probability. The low CA, sensitivity, PPV, and F1 scores suggest Medicine 53, 4 (March 2023), 1302–1312. that the chosen classification threshold of 0.5 may not be optimal. [5] Flavia Galaverna, Adrián M. Bueno, Carlos A. Morra, María Roca, and Teresa This issue was further addressed by evaluating the ROC curve of Torralva. 2016. Analysis of errors in verbal fluency tasks in patients with chronic schizophrenia. The European Journal of Psychiatry 30, 4 (2016), 305– the best-performing model to explore whether an alternative clas- 320. sification threshold could improve the identification of positive [6] Dimitrios Giannoulis, Michael Massberg, and Joshua D. Reiss. 2012. Digital dynamic range compressor design—A tutorial and analysis. Journal of the cases; we observed that both the Youden-optimal threshold and Audio Engineering Society 60, 6 (2012), 399–408. the F1-optimal threshold are approximately 0.49, which differs [7] Josep Maria Haro, Diego Novick, Jordan Bertsch, Jamie Karagianis, Martin negligibly from the used value of 0.5. Dossenbach, and Peter B. Jones. 2011. Cross-national clinical and functional remission rates: Worldwide Schizophrenia Outpatient Health Outcomes (W- The performance of our best model, EBM, shows its strong SOHO) study. The British Journal of Psychiatry 199, 3 (2011), 194–201. ranking ability, and balanced classification performance on both [8] Thomas R. Insel. 2010. Rethinking schizophrenia. Nature 468, 7321 (2010), classes. 187–193. [9] Stephen R. Marder and Tyrone D. Cannon. 2019. Schizophrenia. The New Eng- land journal of medicine381, 18 (2019), 1753–1761. doi:10.1056/NEJMra1808803 Limitations and Future Work [10] Mila Marinković. 2024. Analysis of speech fluency in patients with schizophre- nia [Master’s Thesis, University of Ljubljana, Faculty of Computer and Infor- Although our dataset is well-balanced, the sample size (126) is mation Science]. rather small; additional samples would improve model generaliz- [11] Robert A. McCutcheon, Tiago Reis Marques, and Oliver D. Howes. 2020. ability and robustness. Audio quality could be improved by using [12] Victor Peralta, Manuel J. Cuesta, and Jose de Leon. 1992. Formal thought Schizophrenia—An overview. JAMA Psychiatry 77, 2 (2020), 201–210. professional microphones instead of built-in laptop microphones, disorder in schizophrenia: A factor analytic study. Comprehensive Psychiatry which would enhance transcription accuracy. Due to obtaining 33, 2 (1992), 105–110. [13] Rok Rajher. 2025. Automatic Generation of Explanations in Diagnosing the audio recordings at two locations, a residual site effect cannot Schizophrenia Using Speech Fluency Testing [Master’s Thesis, University be fully excluded. We mitigated the risk by (i) using identical task of Ljubljana, Faculty of Computer and Information Science]. instructions and timing in quiet rooms at both sites, (ii) applying [14] World Health Organization. 2022. ICD-11: 6A20 Schizophrenia. Retrieved from https://icd.who.int/browse/2025-01/mms/en#1683919430. uniform dynamic range compression and loudness normalization 62 Mapping Medical Procedure Codes Using Language Models ∗ Mariša Ratajec Anton Gradišek Nina Reščič University of Ljubljana, Faculty of Jožef Stefan Institute Jožef Stefan Institute Electrical Engineering; Jožef Stefan Ljubljana, Slovenia Ljubljana, Slovenia Institute anton.gradisek@ijs.si nina.rescic@ijs.si Ljubljana, Slovenia ratajec.marisa@gmail.com Abstract models (LLMs), such as BioBERT, GatorTron, and OpenAI mod- els. We also explored a hybrid approach that integrates fuzzy Aligning medical procedure codes across national classification matching with LLM-derived semantic embeddings. systems is a challenging task. We investigate the mapping of In this paper, we present the application of the selected meth- Slovenian KTDP expressions to German OPS codes using fuzzy ods for aligning Slovenian KTDP procedure expressions with matching, biomedical language models (BioBERT, GatorTron), a German OPS codes. We evaluate their performance, limitations hybrid approach, and ChatGPT. In the absence of ground truth, and discuss key challenges associated with this type of code we assess consistency across methods and conduct manual re- matching problem. views. Results show that differences in code structure and expres- sion detail pose major barriers to alignment. Expert validation will be essential for improving accuracy. 2 Methodology 2.1 Datasets Keywords 2.1.1 Slovenian Dataset. The Slovenian dataset is based on the procedure coding, KTDP, OPS, semantic similarity, BioBERT, Klasifikacija terapevtskih in diagnostičnih postopkov in posegov fuzzy matching, GatorTron, ChatGPT (KTDP)[6], version 11, which has been officially implemented na- tionwide since 1 January 2023. This classification system is used 1 Introduction to code medical procedures in all levels of healthcare in Slovenia Different countries maintain their own national classification and is structurally aligned with the Australian Classification of systems for medical procedures, used for clinical documenta- Health Interventions (ACHI), adapted to the local context. tion, reimbursement, public reporting, and statistical analysis. KTDP consists of 20 chapters, each covering a different clinical In Slovenia, healthcare professionals rely on a domestic proce- domain. The chapters are organised primarily by body system dural coding system, while in Germany, the Operationen- und (e.g. nervous, endocrine, cardiovascular), with additional sections Prozedurenschlüssel (OPS) is used. dedicated to dental care, imaging services, radiation oncology, At the University Medical Centre (UMC) Ljubljana in Slovenia, and interventions not elsewhere classified. Each chapter is subdi- interest has emerged in matching expressions from the Klasi- vided into multiple blocks, which group related procedures under fikacija terapevtskih in diagnostičnih postopkov in posegov shared headings. (KTDP) with the German OPS classification system. The pur- In total, the classification includes approximately 6,000 unique pose is to allow international reporting, cost estimation, and procedures. Each is assigned a specific code in a structured nu- comparative analysis of healthcare procedures. meric format composed of a five-digit base and a two-digit ex- tension (e.g. 36564-00). 1.1 Problem Outline 2.1.2 German Dataset. The German dataset is based on Aligning Slovenian procedural expressions with German OPS Operationen und Prozedurenschlüssel (OPS), version 2024 [2], codes is a complex task. The Slovenian dataset contains approxi- which is officially used nationwide for coding medical proce- mately 6,000 expressions, whereas the German OPS classification dures. Maintained by the Federal Institute for Drugs and Medical includes more than 60,000 distinct entries, covering multiple lev- Devices (BfArM), OPS is revised annually. It is derived from the els of specificity in various medical domains. Manual mapping WHO’s International Classification of Procedures in Medicine is time-consuming and impractical, primarily due to the size of (ICPM) and adapted to the German healthcare system. datasets and the absence of convenient tools for efficient code The classification is organised into six main chapters, covering retrieval and comparison. the following clinical domains: diagnostic measures (Chapter 1), To address this challenge, we explored the development of imaging diagnostics (Chapter 3), surgical procedures (Chapter computational approaches to support and accelerate the mapping 5), medications (Chapter 6), non-operative therapeutic measures process. Due to the nature of the data and the semantic variation (Chapter 8), and supplementary measures (Chapter 9). Each chap- between codes, we tested several techniques, including fuzzy ter is further subdivided into categories and blocks, which group string matching, semantic similarity scoring, and large language related procedures based on functional or anatomical criteria. ∗ Corresponding author OPS comprises approximately 60,000 unique procedures. Each is assigned a hierarchical alphanumeric code, consisting of a four- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or digit base and optional numeric or alphanumeric extensions (e.g. distributed for profit or commercial advantage and that copies bear this notice and 5-384.50 or 8-844.5c). The coding system follows a structured the full citation on the first page. Copyrights for third-party components of this hierarchy, beginning with the chapter number (e.g. 5 for surgi- work must be honored. For all other uses, contact the owner /author(s). Information Society 2025, Ljubljana, Slovenia cal procedures), followed by a category (e.g. 5-38 for vascular © 2025 Copyright held by the owner/author(s). https://doi.org/10.70314/is.2025.skui.9648 63 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Ratajec et al. surgery) and subcategories (e.g. 5-384 for specific surgical tech- texts, MarianMT has demonstrated strong performance in medi- niques). The digits and characters after the dot denote the exact cal translation tasks, particularly for structured terminology [5], intervention. making it a suitable and practical choice for this application. 2.1.3 Differences and Similarities between Datasets. Although 2.2.2 Language-based code pairing. To perform code matching, both classification systems serve a similar purpose, they differ in we initially applied a language-based code pairing approach us- structure and level of detail. The German dataset includes very ing fuzzy matching, implemented via the RapidFuzz library [1]. specific and thoroughly described procedures, clearly outlining Fuzzy matching is particularly useful in cases where expressions each individual service. The Slovenian system, on the other hand, differ slightly in wording, structure, or spelling. We applied the uses broader and more general descriptions, without the same token set ratio, which compares the sets of unique words in two amount of detail or length. strings and calculates a similarity score based on the overlap Moreover, there is limited direct lexical overlap between the of unique tokens, with edit distance applied to the remaining two datasets. Even when procedures are conceptually similar, unmatched parts. This method is insensitive to word order and their descriptions often differ in phrasing, level of specificity, robust to minor variations in phrasing. Using this approach, each or use of synonyms. As a result, one-to-one matching is rarely English KTDP expression was compared with all translated OPS straightforward and requires both structural alignment and se- descriptions. For each KTDP entry, we selected the best match- mantic interpretation. ing OPS procedure based on the highest fuzzy similarity score and recorded the corresponding code, description, and score for 2.2 Pipeline further analysis. 2.2.3 Semantic-based code pairing. As a second approach, we applied a semantic-based code pairing approach using contex- tual embeddings derived from transformer-based language mod- els. Specifically, we tested two pretrained models: pritamdeka/ BioBERT-mnli-snli-scinli-scitail-mednli-stsb [3], a Sen- tenceTransformer variant of BioBERT fine-tuned on biomed- ical and inference tasks, and [10], a UFNLP/gatortron-base GatorTron model pre-trained on large-scale clinical corpora. Both models were selected for their strong performance in biomedical language understanding [7] and to investigate how model choice influences the quality of semantic code alignment. Using each model, both KTDP expressions and translated OPS descriptions were encoded into dense semantic vectors. Cosine similarity was then computed between each KTDP embedding and all OPS embeddings to assess semantic closeness. As in the previous approach, the top matching OPS procedure for each KTDP expression was selected and recorded following the same procedure as before. 2.2.4 Combined code pairing. In addition to the individual use of semantic and lexical methods, we implemented a hybrid match- ing approach that combines the strengths of both. Specifically, we Figure 1: Overview of the matching pipeline and exam- integrated semantic similarity scores obtained from BioBERT em- ple results. KTDP expressions in English were aligned beddings with lexical similarity scores derived from fuzzy match- to translated OPS expressions using five methods: fuzzy ing (token set ratio). For each KTDP expression, both similarity matching, BioBERT, a combined BioBERT+fuzzy approach, measures were computed independently against all translated GatorTron, and OpenAI ChatGPT. All methods except Chat- OPS descriptions. The final similarity score for each pair was GPT produced structured outputs with match scores, as calculated as a weighted average: shown in the example results table. ChatGPT returned only contextual matches without comparable scoring and was therefore excluded from the standardised evaluation table. score 𝑤 semantic final = · scoresemantic + 𝑤lexical · scorelexical We experimented with two weighting schemes: one with equal The overall process is summarised in a pipeline diagram (Figure 1), weights (𝑤 semantic 0.5, 𝑤lexical 0.5) and another prioritising = = semantic similarity (𝑤 0.7, 𝑤 0.3), to assess = = which outlines each step — from dataset preparation and transla- semantic lexical how different balances influence match quality. For each KTDP tion to the application of matching methods and the structure of expression, the OPS description with the highest combined score resulting outputs. Each component of the pipeline is described was selected and recorded along with the corresponding code in detail in the following subsections. and similarity score. 2.2.1 Translation. Since Slovenian KTDP expressions were al- This approach was motivated by practical observations in ready available in English, the German OPS procedure names the literature, where combining surface-level and context-aware were translated to English to enable semantic comparison. For similarity often yields more robust results, especially in cases this purpose, we used the MarianMT model ( Helsinki-NLP/ opus-mt-de-en) [4], a transformer-based neural machine trans- lation model. Although not specifically fine-tuned for clinical 64 Mapping Medical Procedure Codes Using Language Models Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia where purely semantic models overlook minor wording differ- OPS codes across the six main procedural chapters. As illustrated ences or where lexical methods fail to capture deeper conceptual in Figure 2, all methods predominantly mapped KTDP expres- alignment [9]. sions to Chapter 5 (surgical procedures), reflecting the procedural nature of the source data. In contrast, assignments to Chapter 2.2.5 ChatGPT code pairing. As a final exploratory method, we 6 (medications) and Chapter 9 (supplementary measures) were used a custom ChatGPT instance (GPT-4o, OpenAI) [8] to evalu- relatively infrequent. This general distribution pattern was con- ate the potential of conversational large language models (LLMs) sistent across methods, indicating a shared tendency to favour for code matching. We uploaded all relevant documentation, in- procedural codes. cluding KTDP expressions, translated OPS procedures, and back- Even so, some notable differences were observed. For example, ground materials, to a private GPT environment. For each KTDP GatorTron assigned fewer expressions to Chapter 5 compared to entry, we either asked the model to suggest the best-matching the other methods and exhibited a relatively higher proportion of OPS procedure directly or first requested an interpretation of the matches to Chapter 8 (non-operative therapeutic measures). Man- KTDP term followed by a context-based match. This approach ual review of these cases revealed that many of the expressions allowed us to assess whether ChatGPT’s contextual reasoning lacked a clearly corresponding OPS code, which may have led the could complement or outperform traditional embedding-based model to prefer broader categories. Still, in the absence of expert or lexical matching methods. validation, we cannot determine whether such assignments are 3 more or less accurate. Evaluation The absence of a validated ground truth presents a fundamen- tal challenge in assessing the quality of our matching results. Without expert clinical validation, it is unclear how accurate in- dividual matches are or which method performs best. To address this, we first conducted a broad quantitative analysis to evaluate consistency, disagreement, and similarity across methods. These metrics provide indirect but informative insights into model be- haviour, helping to characterise matching patterns even in the absence of formal validation. Following this initial assessment, we performed a small-scale manual review to better understand the plausibility of selected matches. We examined examples with both high and low matching scores, identifying cases of clear agreement as well as notable mismatches. This informal inspec- tion offered additional intuition on method performance and highlighted the need for domain expertise to reliably judge align- ment quality. To begin the quantitative evaluation, we examined how often different methods assigned KTDP expressions to the same general Figure 2: Distribution of top-1 matched OPS codes across procedural category. To do this, we compared the prefixes of the the six main procedural chapters for each matching top-1 matched OPS codes across all methods, where the prefix method. Chapter 1 represents diagnostic measures, Chap- corresponds to the first digit of the OPS code and indicates the ter 3 imaging diagnostics, Chapter 5 surgical procedures, high-level category of the procedure (e.g., diagnostic, surgical, Chapter 6 medications, Chapter 8 non-operative therapeu- therapeutic). This allowed us to assess agreement at a broader tic measures, and Chapter 9 supplementary measures. level, independent of specific code details. The results revealed a relatively high degree of consistency: in To investigate whether certain KTDP procedures are inher- 64.2% of cases (𝑛 4000), all methods returned OPS codes with = ently easier to match due to wording or alignment with OPS the same prefix, indicating agreement on the general procedural terminology, we analysed the standardised match score values category. In the remaining 35.8% of cases (𝑛 2231), there was = across all methods using a heatmap (Figure 3). The goal was to de- partial agreement - some methods aligned on the prefix, while termine whether consistent scoring patterns could help identify others diverged. Notably, there were no cases in which all meth- procedures that are generally easier or more difficult to match, ods assigned entirely different prefixes, suggesting that at least a regardless of the specific method used. minimal level of agreement was always preserved at the category The heatmap displays Z-standardised scores for each method, level. with expressions sorted by BioBERT scores. Although we ex- However, when comparing full OPS codes, agreement dropped pected some consistency (i.e., easier expressions receiving higher substantially. Only 2.9% of cases (𝑛 178) exhibited full consen-= scores across all methods and harder ones receiving lower scores), sus across all methods. Most cases (90.1%, 𝑛 5613) fell into the = the results showed considerable variation. In many cases, a pro- “some same” category, where at least two methods agreed, and cedure scored higher with one method and lower with another, 7.1% (𝑛 440) showed complete disagreement, with each method = suggesting that matching difficulty is method-dependent and proposing a different code. These results indicate that, while influenced by how each approach interprets textual or structural methods often converge on the general category of a procedure, similarity. they frequently differ in the specific code they assign within that Notably, BioBERT and the hybrid BioBERT-fuzzy method pro- category. duced very similar score profiles. GatorTron and fuzzy approach To further examine how the methods differ in their assign- showed more divergence, indicating different sensitivities to ter- ment behaviour, we analysed the distribution of top-1 matched minology structure, dataset alignment, or surface-level phrasing. 65 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Ratajec et al. This suggests that methods differ not only in which codes they that the added interpretation step did not substantially improve select, but also in how confidently they make those matches. matching performance. As previously discussed, this outcome likely reflects the inherent differences in datasets. 4 Conclusion Our study highlights the considerable challenge of aligning proce- dural coding systems across countries with different documenta- tion practices. Despite employing a range of computational meth- ods (ranging from fuzzy matching and semantic embeddings to large language models) the observed differences in dataset struc- ture and content significantly limited matching performance. In particular, the lack of detail in some KTDP expressions, the high specificity of OPS codes, and the absence of one-to-one equivalents all contributed to inconsistent or ambiguous results. Crucially, no ground truth currently exists to objectively eval- uate the quality of these matches. Although indirect metrics and manual inspection provide useful information, they cannot re- place expert validation. Therefore, the most important next step is to involve medical professionals in generating a gold standard reference set. This would enable formal benchmarking of differ- ent methods and support the development of more reliable and Figure 3: Heatmap of Z-standardised MATCH_SCORE_1 values generalisable code alignment pipelines in the future. across all KTDP expressions, sorted by BioBERT scores. Ultimately, our findings suggest that the key limitation lies The plot illustrates variation in score strength across meth- not in the technical capability of the methods themselves, but in ods, highlighting differences in confidence and matching the fundamental heterogeneity of the datasets and the differing behaviour. philosophies of procedural encoding. Addressing this mismatch will be essential for any future efforts to enable international After developing a broader understanding of inter-method dif- interoperability of procedural coding systems. ferences through quantitative analyses, we conducted a focused manual review of selected examples to qualitatively assess the Acknowledgments plausibility of top matches. We examined expressions with both The authors are grateful for the valuable input and ideas con- high and low matching scores across methods to explore whether tributed by the medical team of the University Medical Center any consistent patterns could be observed. Ljubljana. For expressions with high scores and full agreement across methods, the matches were typically straightforward: the KTDP Funding expression was either identical or highly similar to an OPS entry, This work was supported by the Slovenian Research and Innova- often requiring no complex interpretation. These cases tended to tion Agency (Research Core Funding Number P2-0209). represent procedural descriptions that appeared in both datasets with minimal variation. References In contrast, lower-scoring expressions revealed more complex [1] [SW] Max Bachmann, rapidfuzz/RapidFuzz: Release 3.13.0 version v3.13.0, challenges. Two main issues emerged during manual inspection. Apr. 2025. doi: 10.5281/zenodo.15133267, url: https://doi.org/10.5281/zeno do.15133267. First, several KTDP procedures had no direct equivalent in the [2] Bundesinstitut für Arzneimittel und Medizinprodukte (BfArM). 2023. Operationen- OPS system because they are typically recorded in other coding und Prozedurenschlüssel (OPS), Version 2024: Internationale Klassifikation der systems (e.g., vaccinations or disease-specific protocols). Second, . BfArM. Bonn, Ger- Prozeduren in der Medizin – Systematisches Verzeichnis many. many KTDP expressions were written in a general or aggregated [3] Pritam Deka, Anna Jurek-Loughrey, et al. 2022. Evidence extraction to form, often combining multiple procedural steps into a single validate medical claims in fake news detection. In International Conference description. OPS, on the other hand, is highly granular, with . Springer, 3–15. on Health Information Science [4] Marcin Junczys-Dowmunt et al. 2018. Marian: Fast Neural Machine Trans- detailed and precisely defined codes. As a result, some KTDP lation in C++. Tech. rep. arXiv:1804.00344. Demonstration paper, version expressions may correspond to multiple distinct OPS codes, or v3. arXiv, (Apr. 2018). doi: 10.48550/arXiv.1804.00344. [5] Bunyamin Keles, Murat Gunay, and Serdar Caglar. 2024. LLMs-in-the-loop only partially align with available entries. Part-1: Expert Small AI Models for Bio-Medical Text Translation. Tech. rep. These observations suggest that performance limitations are arXiv:2407.12126. Preprint. arXiv, (July 2024). doi: 10.48550/arXiv.2407.121 not solely attributable to matching algorithms themselves, but 26. [6] Nacionalni inštitut za javno zdravje (NIJZ). 2023. Klasifikacija terapevtskih also to structural mismatches and representational differences in diagnostičnih postopkov in posegov: Pregledni seznam (Verzija 11). NIJZ. between the source datasets. This highlights a key challenge in Ljubljana, Slovenia. [7] Zabir Al Nazi and Wei Peng. 2024. Large language models in healthcare and aligning procedural coding systems across countries. medical domain: a review. , 11, 3, 57. doi: 10.3390/inf ormatics11 Informatics 030057. 3.1 ChatGPT [8] OpenAI. 2024. Gpt-4o. Accessed: August 2025. (2024). https://openai.com/in dex/gpt- 4o. Despite leveraging ChatGPT’s capacity for contextual reasoning [9] Mohammed Suleiman Mohammed Rudwan and Jean Vincent Fonou-Dombeu. by first interpreting the KTDP expression and then performing 2023. Hybridizing fuzzy string matching and machine learning for improved ontology alignment. , 15, 7, 229. doi: 10.3390/f i15070229. Future Internet the match, the resulting OPS codes were, in most cases, identical [10] Xi Yang et al. 2022. A large language model for electronic health records. to those produced by previously described methods. This suggests , 5, 1, 194. npj Digital Medicine 66 AI-Enabled Dynamic Spectrum Sharing in the Telecommunication Sector – Technical Aspects and Legal Challenges Nina Rechberger PhD Candidate Applied AI Alma Mater Europea Maribor, Slovenia nina.rechberger@almamater.si Abstract frameworks for AI-enabled DSS remain underdeveloped, requiring further exploration. Dynamic Spectrum Sharing (DSS), as part of Dynamic Spectrum The rapid growth of wireless devices and data-intensive Management, is already used in the telecommunication sector applications has heightened demand for radio frequency and is a critical technology for addressing spectrum scarcity in spectrum, a finite resource. Traditional static management often next-generation wireless networks, particularly when leads to underutilized frequency bands, with inflexible policies implementing 6G. Legacy statical spectrum management exacerbating inefficiencies beyond the spectrum's physical (designed for one user exclusively for a certain bandwidth for scarcity [1]. AI-enhanced DSS addresses this by enabling certain services) is no longer fit for purpose, as it does not allow flexible, real-time allocation of resources, adapting to dynamic the efficient use of the spectrum. By leveraging Artificial demands and environments while improving spectrum sensing, Intelligence (AI), DSS enables the real-time adaptive allocation resource allocation, and interference mitigation. of radio frequencies, thereby improving spectrum utilization and This study briefly examines the technical and legal network efficiency. Although the integration of AI into DSS dimensions of AI-enabled DSS, identifying challenges and gaps introduces complex technical and legal challenges. This paper in research. As an initial exploration, it evaluates significant prior aims to investigate the challenge of dynamic spectrum policy work to lay the foundation for future investigations. when using AI-enabled DSS and answer the question of why a flexible and new spectrum policy is desired. Some suggestions are long overdue in academic research. Recent research primarily Spectrum Sharing focuses on technical issues, rather than specifically on legal ones. AI-driven DSS leverage all sort of AI techniques to optimize for refining the regulatory framework are also presented, which 2 Technical Aspects of AI-Driven Dynamic protocols, adaptive regulatory policies, and other legal spectrum utilization in dynamic, complex environments. [2, 3, The closure findings underscore the need for standardized frameworks to ensure equitable and efficient spectrum sharing. 4]. Keywords 2.1 Spectrum Sensing and Cognitive Radio Spectrum sensing is the cornerstone of the DSS, enabling real- AI-Enabled Dynamic Spectrum Sharing, AI, spectrum sensing, time detection of spectrum occupancy. AI-based techniques, spectrum right, spectrum regulatory framework such as convolutional neural networks (CNNs) and long short- term memory (LSTM) models, enhance spectrum sensing by 1 analyzing signal patterns and predicting spectrum availability [5, Introduction 6]. CNNs are highlighted for their ability to extract features from The integration of Artificial Intelligence (AI) into Dynamic spectral data, improving detection accuracy in noisy Spectrum Sharing (DSS) introduces technical complexities, such environments without relying on prior knowledge of signals. as computational demands and algorithm reliability (e.g., LSTMs are emphasized for their ability to handle sequential and consistency, robustness, and accuracy), alongside legal time-series data.. challenges, including spectrum rights allocation, interference In addition, deep learning-based spectrum sensing achieves management, and dispute resolution. However, governance up to 45% improvement in detection accuracy compared with traditional methods, which rely only on basic signal processing techniques to identify spectrum occupancy like energy detection Permission to make digital or hard copies of part or all of this work for personal or [5]. Cognitive radio networks (CRNs) powered by AI allow classroom use is granted without fee provided that copies are not made or distributed users to opportunistically access unused spectrum bands without for profit or commercial advantage and that copies bear this notice and the full interfering with other users [3]. The challenges include, among citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). others, the computational complexity of real-time processing and Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia the need for robust datasets to train AI models. Studies highlight © 2025 Copyright held by the owner/author(s). http://doi.org/DOI_10.70314/is.2025.skui.3657 that AI models may struggle with unpredictable interference patterns, necessitating hybrid approaches that combine 67 interpretable models (e.g., decision trees) with high-performing interference. Those standards, when using AI, are missing. An deep learning (DL) models [6]. example of such AI-driven systems to avoid interference is spectrum access systems (SAS) that use geolocation databases 2.2 Interference Management and sensing to manage the shared spectrum [9]. Simultaneously, Interference management is critical for ensuring reliable the complexity of AI algorithms raises concerns about connectivity in the DSS. AI-driven techniques, such as multi- transparency and accountability when unwanted interference agent reinforcement learning (MARL), optimize power occurs. National regulatory bodies already emphasize the need allocation and beamforming to minimize interference [6]. MARL for standardized protocols to ensure equitable access and is used for mitigating jamming attacks, where malicious entities interference mitigation [10, 11]. Regulators must strike a balance disrupt spectrum utilization by interfering with communications between innovation and the protection of incumbent users and Another example is reconfigurable intelligent surfaces (RIS) their guaranteed rights to spectrum. propagation to reduce interference in non-orthogonal CRNs [7]. integrated with AI, which can dynamically adjust signal 3.2 Spectrum Rights and Equitable Access RIS, also known as an Intelligent Reflecting Surface (IRS), is a Regulatory authorities adopt the fixed spectrum access (FSA) passive, planar metasurface composed of a large array of low- policy to allocate different parts of the radio spectrum with a cost, tunable unit cells that can dynamically manipulate incident certain bandwidth to certain services. With such a static and electromagnetic waves. Unlike active devices like base stations exclusive spectrum allocation policy, only the authorized users, or relays, RIS does not generate or amplify signals—it reflects, also known as licensed users, have the right to utilize the refracts, or absorbs them in a programmable way to shape the assigned spectrum, and the other users are forbidden from wireless propagation environment. accessing the spectrum, regardless of whether the assigned Research has demonstrated that AI-driven interference spectrum is busy or not [3]. This could be seen as a direct management achieves a spectrum utilization efficiency of up to opposition to the efficient use of the spectrum, where the use of 62.4% in urban environments, nearly double the utilization the spectrum aligns with all available technical possibilities. efficiency compared to traditional management [5]. Although Spectrum rights allocation is a contentious issue in DSS, as AI challenges persist, including the scalability of AI models in large enables dynamic access by multiple users and challenges networks and the risk of unpredictable behavior in edge cases. traditional licensing models. Spectrum right allocation is Robust fallback mechanisms are necessary to address traditionally static – one user to a particular broadband. On the unpredictable AI behavior in edge cases, while standardized other hand, with shared access regimes, such as licensed shared interfaces and protocols are essential for enabling seamless access (LSA), regulators allow spectrum users to open spectrum deployment and integration with existing network infrastructure bands while protecting incumbent users [12]. However, only a [5]. few countries have adopted this option, and it comes with numerous regulatory restrictions. For explanation, incumbent 2.3 Resource Allocation users are historically incumbent telecommunications operators, AI enables dynamic resource allocation by predicting network who paid a significant amount of fees for the licence to use the traffic and allocating spectrum based on real-time demands. spectrum. Therefore, spectrum licenses are important assets for Machine learning algorithms, such as support vector machines incumbent users. Nevertheless, AI-driven DSS raises concerns (SVMs) and deep reinforcement learning (DRL), can forecast about monopolistic practices because dominant operators may spectrum occupancy and optimize bandwidth allocation [8]. For leverage advanced algorithms to secure disproportionate instance, DRL-assisted virtual network embedding (VNE) in spectrum access [13, 14. 15]. However, legal frameworks must satellite networks enhances resource utilization by adapting to evolve to address equitable access for smaller operators and multiple coverage constraints [4]. Major obstacles include the license-exempt users while simultaneously protecting the need for energy efficiency and the requirement for real-world guaranteed rights of incumbent users/operators. The absence of datasets to enhance prediction accuracy. The absence of clear spectrum rights allocation policies risks exacerbating standardized testbeds and benchmarks further complicates disputes and stifling innovation in the industry. performance evaluation [2]. 3.3 Dispute Resolution 3 Dispute resolution in DSS tackles conflicts over interference, Regulatory Challenges in AI-Enabled spectrum access, and user priority. AI systems complicate this Dynamic Spectrum Sharing due to poor interpretability, obscuring decision processes [6]. AI- The deployment of AI-driven DSS raises significant regulatory driven user prioritization can spark fairness disputes. National challenges that must be addressed. According to recent research, spectrum strategies propose interagency resolution processes [6, regulatory issues arise, particularly in interference management, 10]. Explainable AI models (e.g., XAI) improve transparency, spectrum rights, and dispute resolution. Other legal and aiding dispute resolution [6]. Blockchain-based databases offer regulatory questions have, to the best of the author's knowledge, tamper-proof spectrum usage records, simplifying conflict been completely overlooked or only superficially discussed. resolution [6]. 3.1 Interference Management Interference management in DSS requires regulators to ensure compliance with technical standards to prevent harmful 68 As stated above, traditional regulatory frameworks designed for Governance, Suggested New framework contracts to automate spectrum allocation, ensuring transparency and enforceability. Regulators should issue guidelines for AI 4 The Need for New AI-Enabled DSS The dynamic licensing model can use blockchain-based smart static spectrum licensing are ill-equipped to handle AI’s opportunistic/dynamic access and imposing penalties for non- algorithms to prioritize licensed users while optimizing autonomous and data-intensive nature of AI. The proposed compliance. regulatory framework should impose legal mechanisms to address more flexible licensing, privacy and data protection, 4.1.2. Privacy and Data Protection interference management, security, and international The goal is to require licensed users to implement privacy-coordination, ensuring compliance and fostering innovation. The preserving AI techniques (e.g., Federated Learning and objectives of the new framework, in the author's opinion, are: differential privacy) to minimize data exposure. Minimal data Enabling Innovation ; Ensuring Compliance : that is, aligning exposure goes beyond personal data and should be extended to with existing laws (e.g. national telecom regulations, Data Act, all processed data sets. AI systems in DSS are designed to Artificial Intelligence Act etc.); Promoting Fairness , which process only the necessary data for the requested task. means ensuring equitable spectrum access and accountability in Memorized data, such as geolocation and traffic patterns, should AI decisions.; Support Global Harmonization to align with be encrypted. Therefore, developing standards for anonymized international standards (e.g., ITU, 3GPP); Security and data processing in DSS, with certification for compliant AI Cybersecurity ; Promoting Regulatory Sandboxes, to enable systems, is necessary. For instance, blockchain contracts and safe testing of AI-driven DSS. differential privacy could enhance efficiency in dense networks 4.1 and align with the principle of minimizing sensitive data sharing. Proposed Legal and Regulatory Framework But on the other hand, all the relevant data for enabling AI- 4.1.1. Dynamic Licensing Model enabled DSS must be shared. Data Act of the EU could address Replacing the current policy of static and exclusive spectrum this issue. with the Dynamic Licensing Model is a key principle, or, even Privacy and data protection are strongly connected to the better, the Dynamic Licensing Model should be prioritized. This Right to Explanation (transparency). Therefore, it is necessary to could include a tiered access system (primary, secondary, and mandate transparency in AI-driven spectrum decisions, allowing opportunistic users) managed by AI-driven Spectrum Access users to challenge allocations [6, 11]. Although the Artificial Systems (SAS) [3, 12, 9]. This means enacting laws defining Intelligence Act of the EU requires high-risk AI systems (DSS tiered access rights, specifying priority levels, and usage component is legally interpreted as critical infrastructure) to face conditions. For instance, extending the U.S. Citizens Broadband a strong transparency obligation, in the context of DSS, it needs Radio Service (CBRS) model, where SAS dynamically assigns to be technically detailed. spectrum, with legal provisions for AI oversight and auditability. Refinements to the European Electronic Communication Code 4.1.3. Interference Management, Liability and Dispute (EECC) [13]. to add AI spectrum management tools are another Resolution possible example. First, a definition of DSS should be added and Clear liability rules for AI-induced interference, balancing the represented. (e,g, in Art. 2). DSS can be defined as a primary responsibilities of operators, secondary users, and vendors, must shared use of the radio spectrum, enabling flexible, real-time be established. A shared liability model could be a solution. allocation of spectrum bands among multiple users and Operators as primary users could be liable for interference unless rights. In spectrum management (Art. 45 EECC), the goal should AI errors, verified through forensic logs. The interference threshold must be introduced and known at the front. Legal limits designated services, when appropriate, adding tiered access caused by secondary users or by the vendor/distributor/supplier appropriate certification. So, spectrum management could be for acceptable, e.g., signal-to-noise ratio standards, should be also be, by default, to privilege AI-enabled DSS, adding flexible enough for new technologies and, at the same time, proof logs of spectrum allocation decisions, accessible when defined. The requirement for AI systems to maintain tamper- compliant as an exception to the technology and service-neutral needed to stakeholders, is a good way to ensure the transparent principle, traditionally anchored in EECC, because general operation of DSS. These logs can then be used as evidence at interest objectives are at stake and can be clearly justified and competent bodies in dispute resolution to resolve interference subject to regular review. From a practical point of view, disputes, with AI decisions [5, 10]. mandating AI-predictive models for real-time allocation in "AI- harmonized" bands that require shared AI datasets could be 4.1.4. International Standardization discussed in future peer reviews. The neutral authorization Promoting harmonized standards for AI-driven DSS through move to the explicit inclusion of AI/ML, with possible remaining challenges, like interoperability. Negotiating bilateral and international treaties to align spectrum sharing protocols and regime for spectrum designation, with some exceptions, should international bodies like ITU and 3GPP is just one side of the additional separate regulation, such as the Gigabyte data sovereignty rules is another issue. For instance, ITU’s World certification for bias-free algorithms and energy metrics in an Infrastructure Act (GIA), intended to simplify access to physical laws for national adoption, ensuring compatibility with global Radiocommunication Conference (WRC) could develop model infrastructure in this sector. Art. 46 EECC is meant only to 5G and 6G standards [10, 14, 15]. Cross-border Coordination encourage shared access, while the default AI-driven DSS could (e.g., Art. 4 EECC) could also be expanded, with the RSPG-led drive spectrum sharing to another level. cooperation utilizing AI tools for interference resolution. 69 only technical sandboxes as controled testing environments is 4.1.5. Security and Cybersecurity proposed. A robust cybersecurity framework for AI-driven DSS systems is aimed at preventing attacks such as data poisoning. Acknowledgments developed. These standards will include encryption, intrusion Europaea and the Doctoral Program in Applied Artificial detection, and regular security audits for AI systems, as well as Cybersecurity standards for AI-Enabled DSS must still be This work was encouraged by the University of Alma Mater Intelligence. reporting security breaches. Certifying AI systems for DSS [10, 14, 15]. cybersecurity compliance, with the development of AI-enabled References 4.1.6. Regulatory Sandboxes [1] Pranita Bhide, Dhanush Shetty, Suresh Mikkili. 2024. Review on 6G communication and its architecture, technologies included, challenges, full regulatory constraints could be a way to overcome the domain. IET Quantum Communication. https://doi.org/10.1049/qtc2.12114 development compliance. Sandbox legislation should define the Creating controlled environments to test AI-driven DSS without security challenges and requirements, applications, with respect to AI [2] Sabir, Bushra, et. 2024. Systematic Literature Review of AI-enabled sandbox participants. Launching pilot programs with telecom arXiv:2407.10981. https://arxiv.org/abs/2407.10981, https://doi.org/10.48550/arXiv.2407.10981 operators and ensuring legal protections for experimental scope, duration (e.g., 1-2 years), and liability exemptions for Spectrum Management in 6G and Future Networks." arXiv preprint deployment are essential for the progress of AI-enabled DSS. [3] Ying-Chang Liang 2020 Dynamic Spectrum Management: From Cognitive Radio to Blockchain and Artificial Intelligence, Springer. would be enhanced because of a good testing foundation in a [4] Alhammadi, Abdulrahman et. 2024. Artificial Intelligence in 6G Wireless Networks: Opportunities, Applications, and Challenges. International technological and regulatory sense. A good example is the Model After the test period, the transition to actual use in the real world https://doi.org/10.1007/978-981-15-0776-2 Journal of Communication Systems.: applications [10, 14, 15]. When it comes to regimes for [5] Saurabh Hitendra Patel. 2024. Dynamic Spectrum Sharing and Management Using Drone-Based Platforms for Next-Generation authorization (e.g., Art. 47 EECC), introducing "AI-sandbox" on the UK’s Ofcom sandbox, tailored for AI-driven 6G https://onlinelibrary.wiley.com/doi/10.1002/dac.5443 authorizations for DSS testing accelerates innovation through Wireless Networks. Preprints.org. https://www.preprints.org/manuscript/202412.0854/v2 AI Act, where sandboxes represent well-documented risk Dynamic Spectrum Access in Advanced Wireless Communications: A Comprehensive Overview. MDPI. https://www.mdpi.com mitigation and, as a result, transparency. pilots accompanied by authorization. This is also in line with the [6] Abiodun Gbenga-Ilori. 2025. Artificial Intelligence Empowering [7] Robin Chataut et. 2024. 6G Networks and the AI Revolution—Exploring Technologies, Applications, and Emerging Challenges. PMC. https://pmc.ncbi.nlm.nih.gov/articles/PMC10969307 5 Conclusion [8] Mehmet Ali Aygül. 2025. Machine learning-based spectrum occupancy prediction: a comprehensive survey- Frontiers in Communications and In this paper, the author examined AI-Enabled DSS from a Networks. https://www.frontiersin.org/articles/10.3389/frcmn.2024.1345678 technical and legal governance perspective. This is a notable [9] Janette, Stewart. 2024. Improved management of shared spectrum: a achievement because there is a significant gap in research in this potential AI/ML use case. Analysys Mason. field. https://www.analysysmason.com [10] Anonymus. 2024. Advanced Dynamic Spectrum Sharing Demonstration This paper aimed to highlight some dimensions of the in the National Spectrum Strategy. National Telecommunications and interaction between technological perspectives and the Information Administration. https://www.ntia.gov/issues/national-governance of AI-enabled DSS. After reviewing the adversarial spectrum-strategy/advanced-dynamic-spectrum-sharing-demonstration-in-the-national-spectrum-strategy and inherited technical challenges, such as resource allocation, [11] Anonymous (2025). FCC TAC AI-WG Artificial Intelligence Meeting interference management, and spectrum sensing, the legal issues Slides. https://www.fcc.gov/sites/default/files/08-05-2025-FCC-TAC-of interference management, spectrum allocation, and equitable Meeting-Slides-Merged.pdf access, along with dispute resolution, are briefly discussed. [12] Anonymous 2025. Spectrum management: Key applications and regulatory considerations driving the future use of spectrum." Digital Moving into the future, a new possible regulatory framework Regulation Platform. https://digitalregulation.org is presented, including a dynamic licensing model, the [13] Directive (EU) 2018/1972 of the European Parliament and of the Council implementation of privacy-preserving AI techniques in DSS, and of 11 December 2018 establishing the European Electronic Communications Code, http://data.europa.eu/eli/dir/2018/1972/oj a shared liability approach to interference management that could [14] Anonymous 2024. Artificial Intelligence in Spectrum Management: also contribute to dispute resolution. Briefly, the importance of Policy and Regulatory Considerations." IEEE Conference Publication. international standardization and interoperability, as well as https://ieeexplore.ieee.org cybersecurity threats such as data poisoning and the lack of [15] Hussein, Haval 2025. AI-Driven Cognitive Radio Networks for 6G: Opportunities and Challenges. IEEE Transactions on Wireless standardization, is mentioned. Lastly, creating regulatory not Communications. https://ieeexplore.ieee.org 70 SmartCHANGE Risk Prediction Tool: Next-Generation Risk Assessment for Children and Youth Nina Reščič Marko Jordan, Sebastjan Lotte van der Jagt Jožef Stefan Institute, Kramar, Ana Krstevska, Harm op den Akker Jožef Stefan International Marcel Založnik Martijn Vastenburg Postgraduate School, Jožef Stefan Institute, Research & Development Ljubljana, Slovenia Jožef Stefan International ConnectedCare nina.rescic@ijs.si Postgraduate School, Nijmegen, The Netherlands Ljubljana, Slovenia Valentina Di Giacomo Dario Fenoglio Mitja Luštrek Elena Mancuso Jožef Stefan Institute, Gabriele Dominici Engineering Ingegneria Informatica Jožef Stefan International Università della Svizzera italiana SpA Postgraduate School, Lugano, Switzerland Rome, Italy Ljubljana, Slovenia Abstract The new version introduces three advances: (i) broader harmo- nization of European cohort datasets through refined syntactic Non-communicable chronic diseases (NCDs), largely driven by and semantic alignment; (ii) improved synthetic data generation lifestyle factors such as poor nutrition, physical inactivity, and that addresses heterogeneity of the datasets; and (iii) evaluation obesity, account for over 70% of mortality in Europe. While pre- of advanced RNN-based architectures alongside conventional vention has traditionally focused on adults, growing evidence ML models. While the pipeline in the previous paper powered a highlights the value of early intervention during childhood and simple demo, this one is integrated into the SmartCHANGE pro- adolescence to establish healthy behaviours and reduce long-term totype that enables early identification of at-risk youth and sup- risk. This paper presents the updated SmartCHANGE platform, ports the development of tailored preventive strategies. By com- which harmonizes heterogeneous datasets, addresses missing in- bining harmonized datasets, predictive modelling, and privacy- formation through synthetic data generation, and forecasts risk preserving methods, it represents a step toward proactive, data- factors from childhood to adulthood. Forecasts are then applied driven public health focused on youth as a critical stage for pre- to established cardiovascular and diabetes risk models, enabling vention. In addition, explainable AI was used to generate counter- long-term risk assessment. To ensure privacy, the platform in- factuals that support understanding of risk factors, and both web corporates federated learning for secure model training across and mobile applications were developed to deliver these insights distributed datasets. By combining synthetically generated data, directly to healthcare professionals, adolescents, and families. predictive modelling, privacy-preserving infrastructure, and end- user applications, the updated SmartCHANGE platform supports early identification of at-risk youth and enables targeted, data- 2 Baseline Predictive Approach driven interventions to help reduce the future burden of NCDs. The models for forecasting risk factors are trained on seven Keywords heterogeneous datasets, none of which contain all the variables needed for risk prediction. The baseline predictive approach non-communicable diseases, risk prediction, synthetic data gen- includes synthetic data generation and forecasting of individual eration, federated learning, preventive healthcare risk factors from young to older age using various established machine-learning models. These forecast risk factors are then 1 Introduction fed into established risk-prediction models to estimate the risk Non-communicable diseases (NCDs), including cardiovascular of cardiovascular disease and diabetes. disease and diabetes, cause over 70% of deaths in Europe [6]. Their onset is shaped by modifiable risk factors such as diet, physical 2.1 Synthetic Data Generation inactivity, obesity, smoking, and alcohol use. While prevention The synthetic data generation was used to improve data com- strategies typically target adults, growing evidence highlights pleteness, enhance cross-dataset comparability, and support more childhood and adolescence as critical periods for establishing robust forecasting and predictive modeling. lifelong health behaviours [5]. Addressing risk early can delay or prevent NCD onset and promote long-term well-being. In this paper, we described an updated pipeline for predicting The risk models required full 2.1.1 Generation of Diet Scores. NCD risk in young people, building on our previous paper [4]. dietary information, but none of the project datasets contained all the variables needed for diet scores. We therefore used the Permission to make digital or hard copies of all or part of this work for personal EUMenu dataset, which includes the complete set of dietary vari- or classroom use is granted without fee provided that copies are not made or ables. Scores were first calculated for all EUMenu individuals. distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this For project datasets with overlapping dietary or related features, work must be honored. For all other uses, contact the owner /author(s). we trained predictive models on EUMenu using only shared vari- Information Society 2025, Ljubljana, Slovenia ables and generated synthetic diet scores accordingly. Given the © 2025 Copyright held by the owner/author(s). https://doi.org/10.70314/is.2025.skui.7226 task’s simplicity and data structure, linear models were applied. 71 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Reščič et al. 2.1.2 Generation of Other Data. We generated synthetic values with our goal of early prevention through modifiable risk factors. for missing variables by constructing targeted sub-datasets and Using both models balanced clinical reliability with behavioural generating data with supervised learning. Each sub-dataset re- relevance, enabling a more comprehensive NCD risk assessment. quired core demographics (sex, age, weight, height); rows missing Our initial approach applied the models at age 55, the max- these were discarded to ensure stable baselines. A greedy search imum forecastable age. This yielded inconsistent outputs: T2P selected predictor sets that maximized coverage of missing en- produced 10-year risks (55–65), while HHS produced a 20-year tries, informativeness beyond demographics, and training sample risk (55–75). To resolve this, we instead reported cumulative risks √ size. Candidate sets were ranked by Score 𝑈 𝑉 𝐾 , where to age 65, the most suitable endpoint given our data. Two strate-= × × 𝑈 is the number of missing instances covered, 𝑉 the number of gies were evaluated: non-overlapping intervals and overlapping predictors, and 𝐾 the number of training rows. (hazard-averaging) intervals. For each sub-dataset, Gradient Boosting, Random Forest, and Linear Regression models were trained with k-fold cross-validation 3 Advanced Unified Predictive Approaches and grid search. Validation was assessed with Root Relative This section introduces advanced forecasting methods designed Squared Error (RRSE; where RRSE = 0 for perfect predictions, to work directly on heterogeneous datasets without requiring RRSE = 1 for baseline), and the best model generated the missing prior synthetic data generation. Despite their greater sophistica- values. Overlaps were resolved by keeping predictions from the tion, their accuracy lags behind the more straightforward method model with a lower RRSE. This process was repeated across vari- that relies on synthetic data generation. ables to expand coverage while minimizing error. Data generation Synthetic data generation and forecasting are trained jointly proceeded iteratively: after each pass, synthetic variables were within a single model, enabling the sharing of representations evaluated with RRSE. Variables below a threshold were accepted and feedback. Early layers provide initial estimates for both tasks, and treated as ground truth in the next pass, with sub-datasets while later stages refine them by capturing complex temporal and models recomputed accordingly. The procedure terminated dependencies. Although SmartCHANGE uses only single-year once no further variables met inclusion or performance plateaued, inputs per user, the training dataset includes multi-year records, yielding a consistent. The mean RRSE of synthetic values in the which reveal broader behavioural patterns. final dataset was 0.795. Before entering the network, variables are normalized using training set statistics. Synthetic values are first generated in a 2.2 Risk Factor Forecasting linear block conditioned on age, gender, and BMI. This block Having generated synthetic data, we constructed a merged dataset consists of two fully connected layers (128 neurons + ReLU, then with no missing values. This dataset was used to train machine 21 neurons without activation). Forecasting then adds current learning (ML) models to forecast health-related risk factors from age, future age, and gender, and predicts 21 risk factors across childhood into adulthood. The predicted values were then ap- ages 6–55. The forecasting block differs by including an addi- plied as inputs to publicly available risk models to estimate the tional 128-neuron ReLU layer and more inputs. Forecasting is risk of developing NCDs. performed separately for each input year, and if multiple years We implemented a neural network (NN) with two dense layers exist, trajectories are averaged across target ages (e.g., data at 7, (512 and 128 neurons) to capture non-linear patterns. Training 9, and 12 yield three trajectories averaged per year). used MSE loss, the Adam optimizer, ReLU activations, dropout This produces a time series of shape (50, 21). Appending masks (0.2), and early stopping. A single NN forecasted all risk factors for observed/synthetic values and gender gives (50, 43). Risk fac- simultaneously. Training and test data were prepared by generat- tor trajectories are then refined via a GRU block with bidirec- ing all younger-to-older age pairs per individual. Inputs included tional layers (128 or 21 hidden units) and a final 21-neuron linear gender, input and target age, and risk factors at the input age; layer. Predictions are finally de-normalized back to the original targets were the same risk factors at the target age. This design scale. The overall loss is the mean of two MAE terms: imputation enabled the model to learn age-progressive changes. and forecasting, with the latter computed only on ground-truth Input–output pairs were split into training, validation, and test variables in the recorded output year. sets, with each individual assigned to only one partition to avoid The model was evaluated the same way as the one in Section leakage. Stratification by dataset preserved source representation. 2.2, with the mean RRSE being 0.907. This is less than the RRSE Features were standardized with scikit-learn’s StandardScaler. from Section 2.2, indicating the need for further refinement of For comparison, we trained traditional ML models separately per the unified approach. variable: Linear Regression, Ridge Regression, Random Forest, and LightGBM (the latter via the lightgbm library). All mod- 4 Privacy Preservation and Explainability els used default parameters and were trained/tested on the same Privacy Preservation. Within the SmartCHANGE project, health pairs as the NN. Performance was measured with MAE and RRSE. datasets are distributed across multiple countries and institutions. Training used both real and synthetic data, but evaluation was These sensitive data fall under strict regulations (e.g., GDPR), restricted to real data. Input ages ranged from 6–18 years, and which prohibit cross-border sharing, and new pilot data remain target ages from 18–55 years, matching the SmartCHANGE fore- stored locally, reinforcing isolation. Federated Learning (FL) ad- casting scope. The mean RRSE of the forecast values was 0.829. dresses this by enabling collaborative training without moving raw data [3]. Two main challenges arise in deployment: pro- 2.3 Risk Models nounced heterogeneity across sites and residual privacy risks, We focused on two models: the Healthy Heart Score (HHS) for car- since shared gradients can still leak information. To mitigate diovascular disease and Test2Prevent (T2P) for diabetes risk. Both these, we developed distribution-aware, privacy-preserving FL include lifestyle factors such as physical activity and diet—essential strategies tailored to real-world healthcare [2]. Instead of a single for assessing younger populations and behavioural change—aligning global model, our approach builds compact, differentially private 72 SmartCHANGE Risk Prediction Tool: Next-Generation Risk Assessment for Children and Youth Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia descriptors of each client’s data distribution, clustering similar a virtual plant by completing daily and weekly personalized clients to train specialized models. This improves robustness challenges linked to long-term health goals set by the HCP. The to variability and temporal drift while ensuring fairer predic- app nudges users towards the most suitable challenges but leaves tions, including for underrepresented groups. On the privacy the final choice to them, supporting autonomy and agency. side, model partitioning and communication-efficient aggrega- To foster long-term engagement, fully grown plants can be tion reduce leakage without heavy cryptography by fragmenting placed in the user’s Goal Garden, which both showcases past gradients and distributing aggregation. Together, these strategies achievements and acts as a reinforcement mechanism. In today’s enable scalable, robust, and privacy-preserving FL pipelines for reward-driven context, the Goal Garden also enables saving to- health risk prediction. wards real-life rewards set by parents, further motivating users. The app’s design emerged from an extensive co-creation process Explainability. Beyond predictive accuracy, effective NCD risk and iterative validation with users, who responded positively to assessment must also provide transparent explanations and ac- the analogy, challenge, and reward structure, as well as the aes- tionable guidance. For this, we adapt the Counterfactual Concept thetics. Development was kept flexible, with adjustments made Bottleneck Model (CF-CBM) [1] to early-life health data. Instead to align the app with other SmartCHANGE components. of relying on predefined concepts—often unavailable or inconsis- tently annotated—our model learns patient feature distributions 6 Conclusion via a variational autoencoder (VAE), ensuring the latent space This paper provides a concise description of the SmartCHANGE captures key generative factors of early-life trajectories. Counter- pipeline, which integrates harmonized datasets, synthetic data factuals are then generated following CF-CBM principles: given generation, federated learning, and explainable AI into a secure a patient profile and its predicted risk, the system proposes min- platform for early NCD risk prediction and prevention. Through imally altered, realistic configurations that would change the the HappyPlant app and professional interface, these methods outcome. For example, if a child is predicted at high diabetes are translated into user-centered interventions that support sus- risk, the model may suggest plausible counterfactual profiles tainable behaviour change in youth. Detailed descriptions of the where lifestyle or physiological factors are adjusted to reduce individual components will be published separately. risk. By embedding counterfactual reasoning directly into the pipeline, this approach goes beyond post-hoc interpretability. It Acknowledgements both explains which factors drive predictions and identifies how This work was carried out as part of the SmartCHANGE project, risk can be reduced, offering clinicians and families actionable, which received funding from the European Union’s Horizon personalized strategies for early prevention. Europe research and innovation program under grant agreement 5 Architecture and User Applications No 101080965. The SLOfit and ACDsi datasets were provided by the University of Ljubljana (courtesy of Gregor Jurak et al.), the Architecture. The SmartCHANGE platform (Figure 1) is a mod- LGS dataset was provided by KU Leuven (courtesy of Martine ular, microservices-based system for AI-driven health interven- Thomis), AFINA-TE dataset was provided by the University of tions in children and adolescents. It integrates the developed Porto (courtesy of José Ribeiro), ABCD was provided by VUMC, predictive pipeline described in the previous sections with secure, HELENA dataset was provided by Helena study group (courtesy scalable, and privacy-preserving technologies, with emphasis on of Francisco Ortega), and the University of Turku provided the GDPR compliance and explainable AI. Two main client interfaces YFS dataset. We are grateful for their support. are provided: the HappyPlant mobile app for families and youth, and a web application for healthcare professionals (HCPs). References Authentication and authorization are handled through the [1] Gabriele Dominici, Pietro Barbiero, Francesco Giannini, Martin Gjoreski, OpenID Connect (OIDC) protocol, with role-based access con- Giuseppe Marra, and Marc Langheinrich. 2025. Counterfactual concept bot- trol and single sign-on. Additional safeguards include encrypted tleneck models. In The Thirteenth International Conference on Learning Repre- communication, pseudonymization, and immutable audit log- sentations. https://openreview.net/f orum?id=w7pMjyjsKN. [2] Dario Fenoglio, Gabriele Dominici, Pietro Barbiero, Alberto Tonda, Martin ging. Together, the SmartCHANGE platform, HappyPlant, and Gjoreski, and Marc Langheinrich. 2024. Federated behavioural planes: ex- the HCP web interface form an integrated ecosystem for pre- plaining the evolution of client behaviour in federated learning. In Advances in Neural Information Processing Systems (NeurIPS 2024), Vol. 37, 112777– ventive healthcare, uniting advanced technical architecture with 112813. user-centered design to deliver effective, scalable, and personal- [3] Dario Fenoglio, Daniel Josifovski, Alessandro Gobbetti, Mattias Formo, Hris- tijan Gjoreski, Martin Gjoreski, and Marc Langheinrich. 2023. Federated ized interventions. learning for privacy-aware cognitive workload estimation. In Proceedings of Web Application. the 22nd International Conference on Mobile and Ubiquitous Multimedia (MUM The web application for HCPs serves as a clini- ’23). ACM, New York, NY, USA, 25–36. doi: 10.1145/3626705.3627783. cal dashboard, enabling them to access patient data, assess long- [4] Marko Jordan, Nina Reščič, Sebastjan Kramar, Marcel Založnik, and Mitja term risk for metabolic diseases (currently diabetes and CVD, Luštrek. 2024. Smartchange risk prediction tool: demonstrating risk assess- ment for children and youth. In Slovenska konferenca o umetni inteligenci. although it can be scaled to integrate additional prediction mod- Zvezek A: zbornik 27. mednarodne multikonference Informacijska družba - IS els), and support behaviour change strategies. The interface is 2024 : 10.–11. oktober, Ljubljana, Slovenija = Slovenian Conference on Artifi- structured around a clinically aligned workflow — Consultation, cial Intelligence. Vol. A : proceedings of the 27th International Multiconference Information Society - IS 2024. Ljubljana, Slovenia, 71–74. Assessment, and Intervention — mirroring real-world practices. [5] K. Pahkala, H. Hietalampi, T. T. Laitinen, J. S. Viikari, T. Rönnemaa, H. Ni- inikoski, and et al. 2013. Ideal cardiovascular health in adolescence: effect of Mobile Application. While intelligent risk predictions support lifestyle intervention and association with vascular intima-media thickness and elasticity (the special turku coronary risk factor intervention project for HCPs in guiding clients, evidence and co-creation results show children [strip] study). , 127, 18, (May 2013), 2088–2096. Circulation that simply communicating risks is insufficient for sustainable [6] World Health Organization. 2018. Global Health Estimate 2016: Deaths by behaviour change in adolescents and families. The HappyPlant Cause, Age, Sex, by Country and by Region, 2000-2016. World Health Orga- nization. app was designed to address this gap. Rather than focusing on risks, it adopts a playful plant-growth analogy: users care for 73 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Reščič et al. Figure 1: Logical Architecture of the SmartCHANGE Platform, including the mobile app (HappyPlant) and the web-app for healthcare professionals, connected to a central FHIR-compliant repository and featuring a Trustworthy AI Framework with federated learning, explainability, and secure communication via the XCDS Engine. Figure 2: HappyPlant app screens: the home, challenge and goal garden screens. 74 GNN Fusion of Voronoi Spatial Graphs and City–Year Temporal Graphs for Climate Analysis Alex Romanova Independent Researcher McLean, VA, USA sparkling.dataocean@gmail.com Abstract We present a two-stream graph framework for climate similarity that fuses geography with long-term dynamics. A globe-spanning Voronoi network links cities whose cells share a boundary, while per-city temporal graphs encode decades of daily temperatures in 1000 cities over 40 years. We learn (i) temporal embeddings via a GNN graph-classification model on city–year graphs and (ii) spatial embeddings via a GNN link-prediction model on the Voronoi backbone, using either raw climatology vectors or the learned temporal embeddings as inputs. Treating cosine similar- ity as edge weights (using 1-cosine) enables graph-mining views: closeness maps highlight dense climate belts, and betweenness maps surface long-range "bridges" connecting distant regions. Figure 1: Node feature types for climate similarity. The fused approach uncovers patterns that simple averages miss, including nearby cities with low similarity (microclimates, urban form, or data aliasing) and far-apart cities with high similarity (shared seasonal regimes/latitude bands). We also incorporate the Delaunay triangulation - the dual of Voronoi - to provide a triangle-based analyses; we use it as a robustness check to ensure geometrically well-posed neighbor network that stabilizes these results are not tied to a single choice of spatial adjacency. patterns. The method is scalable and reproducible, and the same For temporal behavior, each city is represented by a graph template - spatial adjacency + temporal history + GNN fusion - whose nodes are city–year pairs with daily-temperature profiles extends beyond temperature to additional variables and to urban as features. Years are linked when their profiles exceed a cosine- and infrastructure applications. similarity threshold. We add a virtual node so that each city graph forms a single connected component. Keywords To analyze climate across space and time, we use basic vectors and pre-final vectors from Graph Neural Network (GNN) models. graph neural networks, spatiotemporal modeling, climate analy- Figure 1 illustrates four representations used throughout the sis, Voronoi tessellation, Delaunay triangulation paper: 1 • Average — climatology vectors (365-day averages) per city; Introduction • Temporal — embedded city graphs: pre-final vectors from Understanding global climate patterns is critical to the climate– work that integrates geographic layout with long-term temporal change challenge. In this study, we explore a graph-based frame- • Spatial — embedded Voronoi nodes: pre-final vectors from a GNN graph classification model on per-city year graphs; a GNN link-prediction model on the Voronoi graph with most populated cities with 40 years of daily temperatures. This As a data source, we use climate records for 1,000 of the world’s • Spatial+Temporal — re-embedded nodes: pre-final vectors from a GNN link-prediction model on the Voronoi graph behavior. average vectors as inputs; dataset (Kaggle [7]) provides geolocations and multi-decade time using temporal embeddings as inputs. series, allowing us to combine spatial and temporal perspectives. Our spatial backbone is a Voronoi graph: from city coordinates, We previously introduced the use of pre-final vectors from each city receives a Voronoi cell (the region closer to that city a GNN graph classification model on city temporal graphs [17] than to any other), and two cities are connected when their cells and applied linear-algebra analyses to those outputs. share a border—an interpretable, globally consistent notion of In this study we contribute: proximity. Alongside Voronoi, we also construct the Delaunay triangulation over the same points. Delaunay provides a com- • Construction of a globe-spanning Voronoi spatial graph plementary, dual view of neighborhood structure and enables and its Delaunay triangulation as complementary spatial backbones; • Comparisons across input climatology vectors, output city- Permission to make digital or hard copies of all or part of this work for personal graph embeddings, and spatial node embeddings from link or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and prediction; the full citation on the first page. Copyrights for third-party components of this • Graph-mining analyses on induced graphs from each vec- work must be honored. For all other uses, contact the owner/author(s). tor type, highlighting agreements and differences across Information Society 2025, Ljubljana, Slovenia © 2025 Copyright held by the owner/author(s). spatial and temporal representations. https://doi.org/10.70314/is.2025.skui.1600 75 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Romanova 2 Related Work In 2012, two milestones reshaped AI: AlexNet’s convolutional neural network set a new benchmark in large-scale image clas- sification, far surpassing prior methods [9, 12], and Google’s Knowledge Graph operationalized entity–relationship under- standing at web-scale, transforming data integration, search, and management [15]. These lines of work initially evolved in parallel—CNNs ex- celled on grid-structured data, while graph methods targeted Figure 2: Voronoi edge between distant cities: Québec and relational structure. The emergence of graph neural networks Porto are neighbors because their cells meet across the (GNNs) in the late 2010s bridged this gap by combining deep Atlantic. learning with graph computation to model complex dependen- cies [2]. Despite the rise of large language models (LLMs) since 2022, GNNs remain essential for tasks grounded in explicitly graph-structured data. GNNs are now standard for classification and link prediction on graph-structured data [14, 1]. At web scale, industrial recom- mender systems adopt scalable inductive variants such as Pin- Sage [20], while temporal/dynamic settings leverage trajectory- Figure 3: Largest Voronoi triangle: Wellington–Port Eliz- predictive embeddings like JODIE [10]. Community benchmarks abeth–Mar del Plata illustrates long edges formed in have further standardized evaluation for large graph learning sparsely populated regions. (e.g., OGB) [5]. In geophysics, recent studies demonstrate the effectiveness of GNNs for medium-range global weather forecast- ing [11], global atmospheric prediction [8], and spatiotemporal (4) Link-prediction vectors (from temporal vectors) — hydrology and geoscience tasks such as groundwater dynamics the same GNN link-prediction setup, but with temporal [19] and frost-event forecasting with attention mechanisms [13], GNN embeddings as inputs. supporting the view that graph-based inductive biases are well This design allows direct comparison of spatial, temporal, and suited to environmental systems with strong spatial and temporal structure. hybrid representations within a single framework; see Figure 1. Voronoi tessellations provide natural adjacency via shared 3.2 GNN Graph Classification Model cell boundaries and have a long history in climate and global modeling [6]. Recent applications use Voronoi-induced graphs We apply a GNN graph classification model (PyTorch Geometric) for urban risk modeling and natural hazards: Gan et al. propose to per-city temporal graphs. Each graph has one node per year, a Voronoi-based spatiotemporal GCN for traffic crash prediction with that year’s daily-temperature profile as the node features. [3], while Razavi-Termeh et al. leverage Voronoi entropy in flood We add a virtual node to each graph and connect it to ensure every susceptibility mapping [16]. Our work synthesizes these ideas city graph is a single connected component. For supervision, we by constructing a global Voronoi-based spatial graph of cities split cities into two equal groups by absolute latitude (closer enriched with long-term temperature signals and combining it vs. farther from the equator) and train the model to classify the with per-city temporal graphs encoded by GNNs. graphs. We then use the pre-final vector as the city’s temporal embedding for downstream analysis. 3 Methods 3.3 GNN Link Prediction Model 3.1 We apply a GNN link prediction model (Deep Graph Library), us- Graph Construction ing the GraphSAGE aggregator [4], to the Voronoi spatial graph We construct a global spatial graph by computing a planar Voronoi of cities. Unlike the GNN graph classification model, which pro- diagram on Web Mercator (EPSG:3857) city coordinates; two duces one embedding per city graph, link prediction runs on cities are adjacent if their cells touch. The Voronoi/Delaunay is the global spatial graph and refines each city’s node representa- used only to define adjacency (not distances/areas), yielding a tion using both adjacency and input features. We evaluate two simple, interpretable map of city neighborhoods worldwide. node-feature variants: (i) 365-day climatology vectors (averaged We evaluate four alternative node-feature sets: across years) and (ii) temporal embeddings from the classification model. After training, we extract pre-final node embeddings as (1) 365-day climatology vectors — for each city, a 365- enhanced city feature vectors for downstream analysis. value day-of-year climatology averaged across all available Notes and code are provided on our technical blog [18]. years. (2) Temporal vectors — pre-final embeddings from GNN 4 Experiments graph-classification model on each city’s year-by-year graph (years linked when their daily profiles exceed a 4.1 Voronoi Graph Construction cosine-similarity threshold). We build the spatial graph from city coordinates with a Voronoi (3) Link-prediction vectors (from averages) — pre-final tessellation: each city gets a cell, and two cities are linked when embeddings from a GNN link-prediction model on the their cells touch. This gives a clear, globe-spanning picture of Voronoi graph using the 365-day climatology vectors as who is naturally close, without picking an arbitrary distance inputs. cutoff. Alongside this, we also use the Delaunay triangulation 76 GNN Fusion of Spatial and Temporal Graphs for Climate Analysis Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Figure 4: Voronoi area (normalized): green=low, yel- low=mid, red=high. Figure 5: Closeness centrality across four vector types; red on the same points—the dual view that connects cities exactly = high, yellow = mid, green = low. when their Voronoi cells meet and highlights triangle-based local structure. Sometimes this setup links places that are far apart because there are few large cities between them. For example, Québec (Canada) and Porto (Portugal) become neighbors across the At- lantic when their cells meet (Figure 2). Larger patterns show up in the Delaunay view as well: the largest triangle—Wellington (New Zealand), Port Elizabeth (South Africa), and Mar del Plata (Argentina)—illustrates how isolated regions can still form direct connections (Figure 3). To show spatial density, we color each city by Voronoi cell size (Figure 4). Small cells (green) mark tight clusters—for exam- ple, parts of eastern China and northern India—while large cells (red) indicate sparse areas such as interior Australia or northern Canada. Dense hubs shorten edges and raise local connectivity; Figure 6: Betweenness centrality across four vector types; sparse zones create longer links that act as bridges. red = high, yellow = mid, green = low. 4.2 GNN Models Across both GNNs (temporal graph classification and spatial link 4.4 Centrality and Betweenness Patterns prediction), we use only pre-final embeddings for downstream Across Vector Types analysis; we do not report task metrics (edge AUC/AP or classi- Throughout, climate similarity means cosine similarity between fication accuracy) because our goal is weighted-path/centrality the indicated vectors; for path-based metrics we use edge weights analysis on a geometric prior. 𝑤 = 1 − cosine. Each set of maps uses the same spatial backbone: edges come from the Voronoi graph, where two cities are adja- 4.3 How Similar Are Distant or Nearby Cities? cent if their cells share a border. What changes across panels This section examines climate similarity for both distant and is the edge weight, derived from cosine similarity computed neighboring city pairs using the four representations (Average, from one of four representations (Average, Temporal, Spatial, Temporal, Spatial, Spatial+Temporal). Tables 1 and 2 highlight Spatial+Temporal), with vectors normalized prior to cosine. The highlight representative examples: one for geographically distant topology stays fixed; the weights—and therefore any shortest- pairs and one for nearby pairs. path–based measures—change with the chosen vectors. Smaller Many distant pairs show very high similarity, especially when weights mean higher climate similarity. temporal history and spatial context are both considered. For ex- In the closeness centrality maps (Figure 5), cities with high ample, Wellington (New Zealand) and Mar del Plata (Argentina), closeness are, on average, at short weighted distance from many though thousands of kilometers apart, score highly across all four others—i.e., they are similar to many cities. Dense climate regions metrics—suggesting that similar seasonal regimes and latitude such as Europe and East Asia typically stand out. Differences can outweigh raw distance. between panels reveal how each representation defines “similar,” Nearby pairs typically agree across metrics as well. In the sec- shifting which cities appear most central. ond table, examples such as Barranquilla–Soledad and Barcelona– In the betweenness maps (Figure 6), different weightings em- Puerto La Cruz show consistently high similarity, reflecting shared phasize different connectors: high-betweenness cities lie on many local climate. shortest routes. The Spatial+Temporal view surfaces more long- There are exceptions. New York and Brooklyn, despite being range intermediaries than Average (notably in Africa, South only a few kilometers apart, score low on the Spatial and Spa- America, and the Pacific). We also observe slight polarization in tial+Temporal measures. This may reflect microclimates, urban Spatial and Spatial+Temporal; the reason for this requires further effects, or dataset/aliasing issues (e.g., borough vs. city records). research. Such cases show that short geographic distances can mask mean- Our centrality and betweenness maps are only a starting point, ingful environmental differences, underscoring the value of com- with extended graph experiments expected to uncover additional bining temporal and spatial modeling. structures and recurring pathways. 77 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Romanova Table 1: Climate similarity between distant city pairs City 1 City 2 Distance (km) Average Temporal Spatial Spatial+Temporal Wellington, NZ Mar del Plata, AR 25870.97 0.9922 1.0000 1.0000 1.0000 Port Elizabeth, ZA Wellington, NZ 16639.04 0.9982 0.9963 0.9999 1.0000 Melbourne, AU Port Elizabeth, ZA 13299.30 0.9916 0.9958 0.9872 0.9993 Reykjavik, IS Krasnoyarsk, RU 12911.14 0.7375 0.7482 0.9861 0.9338 Nuku’alofa, TO Concepcion, CL 11549.31 0.9838 0.9882 0.9995 0.9997 Table 2: Climate similarity between nearby city pairs City 1 City 2 Distance (km) Average Temporal Spatial Spatial+Temporal Jerusalem, IL Al Quds, PS 2.27 1.0000 1.0000 0.9998 1.0000 Barranquilla, CO Soledad, CO 5.63 1.0000 0.9585 0.9999 0.9999 Barcelona, VE Puerto La Cruz, VE 6.32 1.0000 1.0000 1.0000 1.0000 Khartoum, SD Omdurman, SD 6.88 1.0000 0.8749 0.9590 0.9988 New York, US Brooklyn, US 7.05 1.0000 0.5220 0.0857 0.0878 5 Conclusion [4] William L. Hamilton, Rex Ying, and Jure Leskovec. 2017. Inductive represen- In conclusion, the novelty of this work is the explicit fusion of a tation learning on large graphs. InAdvances in Neural Information Processing Systems (NeurIPS). https://arxiv.org/abs/1706.02216. Voronoi spatial graph with temporal GNN embeddings to reveal [5] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, climate “neighborhoods” that traditional, single-view methods Bowen Liu, Michele Catasta, and Jure Leskovec. 2020. Open graph bench- mark: datasets for machine learning on graphs. In Advances in Neural Infor- tend to miss. By running a GNN graph-classification model on mation Processing Systems (NeurIPS). https://arxiv.org/abs/2005.00687. per-city year graphs and a GNN link-prediction model on the [6] Lili Ju, Todd Ringler, and Max Gunzburger. 2011. Voronoi tessellations and global Voronoi backbone, we combine geography with long-term their application to climate and global modeling. In Numerical Techniques for Global Atmospheric Models. Lecture Notes in Computational Science and dynamics. We compare simple average-by-day climatology vec- Engineering. doi:10.1007/978-3-642-11640-7_10. tors against pre-final vectors from both GNN models and then [7] Kaggle Dataset. 2020. Temperature history of 1000 cities 1980 to 2020. https: use these vectors for downstream analysis. -cities. (2020). //www.kaggle.com/datasets/sudalairajkumar/daily-temperature-of-major This fusion surfaces informative outliers: nearby cities with [8] Ryan Keisler. 2022. Forecasting global weather with graph neural networks. low cosine similarity—consistent with microclimates, urban form, arXiv preprint arXiv:2202.07575. doi:10.48550/arXiv.2202.07575. [9] Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton. 2012. Imagenet clas- or data aliasing—and distant city pairs with high similarity, sug- sification with deep convolutional neural networks. In Advances in Neural gesting long-distance climate links. Using these vectors as edge Information Processing Systems (NeurIPS). doi:10.1145/3065386. weights enables graph-mining views: closeness maps highlight [10] Srijan Kumar, Xikun Zhang, and Jure Leskovec. 2019. Predicting dynamic embedding trajectory in temporal interaction networks. In Proceedings of dense climate belts, while betweenness maps elevate long-range the 25th ACM SIGKDD International Conference on Knowledge Discovery and “bridges.” Adding the Delaunay triangulation—the dual of the Data Mining (KDD). doi:10.1145/3292500.3330895. Voronoi diagram—provides a geometrically well-posed neighbor [11] Rosalia Lam, Alvaro Sanchez-Gonzalez, Matthew Willson, et al. 2023. Learn- ing skillful medium-range global weather forecasting. Science. doi:10.1126/s network that stabilizes these patterns. cience.adi2336. While this study centers on climate and temperature, the dual [12] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature. doi:10.1038/nature14539. Voronoi–Delaunay framework with GNN fusion is broadly ap- [13] Hugo Lira, Luis Martí, and Nayat Sanchez-Pi. 2022. A graph neural net- plicable. The same geometric scaffold can analyze urban connec- work with spatio-temporal attention for multi-source time series data: an tivity and infrastructure networks, surface social or economic application to frost forecast. doi:10.3390/s22041486. [14] Xia Liu, Jie Chen, and Qingsong Wen. 2023. A survey on graph classification linkages in dense regions, and support practical tasks like traf- and link prediction based on gnn. arXiv preprint arXiv:2307.00865. doi:10.48 fic management and siting of schools, parks, or grocery stores. 550/arXiv.2307.00865. It offers a stable way to reason about spatial relationships be- son, and Jamie Taylor. 2019. Industry-scale knowledge graphs: lessons and [15] Natasha F. Noy, Yuval Gao, Anshu Jain, Anant Narayanan, Alan Patter- yond climate. The approach is also a starting point for continued challenges. acmqueue. doi:10.1145/3329781.3332266. work: enrich node features, adopt spherical/geodesic tessellations, [16] S. Vahideh Razavi-Termeh, Amir Sadeghi, Faisal Ali, Rana Abdul Naqvi, et al. 2024. Cutting-edge strategies for absence data identification in natural learn the graph via contrastive or metric objectives, and explore hazards: leveraging voronoi-entropy in flood susceptibility mapping with dynamic temporal GNNs with attribution, counterfactuals, and advanced ai techniques. Journal of Hydrology. doi:10.1016/j.jhydrol.2024.13 uncertainty. 2337. [17] Alex Romanova. 2024. Utilizing pre-final vectors from GNN graph classifi- cation for enhanced climate analysis. In Proceedings of the 21st Workshop References on Mining and Learning with Graphs (MLG 2024). Co-located with ECML PKDD 2024. [1] Jakub Adamczyk. 2022. Application of graph neural networks and graph [18] sparklingdataocean.com. [n. d.] Temporal–spatial gnn fusion for climate descriptors for graph classification. arXiv preprint arXiv:2211.03666. doi:10.4 analytics. http://sparklingdataocean.com/2025/06/25/voronoiGNN/. 8550/arXiv.2211.03666. [19] Marco L. Taccari, Hua Wang, James Nuttall, Xue Chen, and Peter K. Jimack. [2] Michael M. Bronstein, Joan Bruna, Taco Cohen, and Petar Veličković. 2021. 2024. Spatial-temporal graph neural networks for groundwater data. doi:10 [3] preprint arXiv:2104.13478 . doi:10.48550/arXiv.2104.13478. [20] Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L. Hamil- Junjie Gan, Qing Yang, Dong Zhang, Li Li, Xinyu Qu, and Bin Ran. 2024. ton, and Jure Leskovec. 2018. Graph convolutional neural networks for Geometric deep learning: grids, groups, graphs, geodesics, and gauges. arXiv .1038/s41598-024-75385-2. A novel voronoi-based spatio-temporal graph convolutional network for web-scale recommender systems. In Proceedings of the 24th ACM SIGKDD traffic crash prediction considering geographical spatial distributions. IEEE International Conference on Knowledge Discovery and Data Mining (KDD). Transactions on Intelligent Transportation Systems. doi:10.1109/TITS.2024.34 doi:10.1145/3219819.3219890. 52275. 78 Towards Anomaly Detection in Forest Biodiversity Monitoring: A Pilot Study with Variational Autoencoders David Susič Maria Luisa Buchaillot Miguel Crozzoli david.susic@ijs.si Fauna Smart Technologies ApS Intelligent Instruments Lab, Department of Intelligent Systems, Copenhagen, Denmark University of Iceland Jožef Stefan Institute Reykjavik, Iceland Ljubljana, Slovenia Calum Builder Sevasti Maistrou Anton Gradišek Fauna Smart Technologies ApS Fauna Smart Technologies ApS Department of Intelligent Systems, Copenhagen, Denmark Copenhagen, Denmark Jožef Stefan Institute Ljubljana, Slovenia Dragana Vukašinović Fauna Smart Technologies ApS Copenhagen, Denmark Abstract adaptive, science-based forest management [2, 3]. However, ex- isting monitoring tools are often limited in scope, fragmented Biodiversity monitoring in forests requires scalable, automated across disciplines, and costly to implement at scale [4]. tools for detecting ecological anomalies across time and space. This paper presents the technical foundation of the biodi- This paper reports on a three-month pilot deployment (April 2 versity assessment tool (BAT), a modular, scalable system that 1 to June 30, 2025) in Dyrehaven, an 11 km forest park near integrates ecoacoustics, satellite remote sensing, and machine Copenhagen, Denmark, where acoustic data from 10 distributed learning (ML) to enable automated biodiversity monitoring in AudioMoth sensors and vegetation indices from Sentinel-2 im- forested landscapes. BAT is designed to detect anomalies in eco- agery were collected. We trained separate variational autoen- logical baselines, providing early warning signals of ecosystem coder (VAE) models on each modality to test the technical feasibil- degradation [5]. It combines two complementary remote sensing ity of learning ecological baselines. Since no ecological anomalies modalities: passive acoustic monitoring (PAM), which captures occurred during the observation period, evaluation focused on localized, high-frequency biological activity such as insect or reconstruction errors, which indicate how well VAEs can capture bird calls [6, 7], and satellite Earth observation (EO), which offers typical site-specific ecological patterns (i.e., baseline modeling). broader, lower-frequency indicators of landscape-level change, Both acoustic and satellite pipelines achieved low reconstruc- including vegetation health and canopy dynamics [8]. tion errors, demonstrating that VAEs can reliably model normal The presence of pests or other stressors often leads to a reduc- ecological dynamics. This establishes the foundation for future tion in biodiversity, which can first be detected acoustically as studies on anomaly detection, which will require larger datasets diminished biotic sound activity, and later (typically with a lag of containing true ecological anomalies identified and labeled by several days) becomes visible in EO data as decreased vegetation experts. Ongoing work focuses on extending data collection to greenness. BAT is designed to leverage this temporal and spatial additional forest sites, while future anomaly detection will re- complementarity by developing independent anomaly detection quire expert-labeled anomalies to calibrate baselines and validate pipelines for each modality, which in future iterations may sup- model performance for robust, multimodal biodiversity monitor- port joint multimodal detection of ecological disturbances. ing. This study reports on a pilot deployment in Dyrehaven, a Keywords human-managed park-forest in Denmark, where time-series data from distributed acoustic sensors and Sentinel-2 satellite im- biodiversity, anomaly detection, variational autoencoder, ma- agery were collected between April and June 2025. Separate chine learning, passive acoustic monitoring, satellite imagery variational autoencoders (VAEs) were trained on each modality to test whether robust baseline models can be learned. Ecological 1 Introduction anomalies are inherently rare and cannot be guaranteed within a limited three-month window, and none occurred during this Forests are complex, dynamic ecosystems increasingly affected by period. As a result, evaluation focused on baseline reconstruction environmental stressors such as pests, diseases, invasive species, performance rather than anomaly detection accuracy. Demon- and climate-related disturbances [1]. Effective biodiversity mon- strating that VAEs can successfully capture “normal” ecological itoring is essential to detect these stressors early and support patterns is a necessary prerequisite for future anomaly detection. Ecological baselines are inherently site-specific, differing across Permission to make digital or hard copies of all or part of this work for personal forest types, microhabitats, and even within single forests (e.g., or classroom use is granted without fee provided that copies are not made or wetter zones near ponds vs drier uplands). Accordingly, this work distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this should be understood as a technical feasibility study, with the work must be honored. For all other uses, contact the owner /author(s). longer-term goal of enabling multimodal detection of ecological Information Society 2025, Ljubljana, Slovenia disturbances such as pest outbreaks, supported by expert-labeled © 2025 Copyright held by the owner/author(s). https://doi.org/10.70314/is.2025.skui.6757 events and extended deployment across diverse forests. 79 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Susič et al. 2 Data NDVI was calculated at 10 m resolution, and NDMI at 20 m. Each index map was divided into fixed-size patches. NDVI maps Our study area was Dyrehaven, a human-managed forest park produced 396 patches (11 36 grid), while NDMI produced 108 × north of Copenhagen, Denmark (55.8024°N, 12.5685°E), covering 2 patches (6 18 grid), reflecting their respective spatial resolutions. × 11 km (see Figure 1). The site includes 10 structured microhab- itats across woodland, meadow, and modified forest areas. Its ecological diversity and relative stability make it suitable for test- 3 Methodology ing acoustic and satellite-based monitoring methods. Data were 3.1 Extraction of Acoustic Indices collected between April 1 and June 30, 2025. 10 standard ecoacoustic indices [10] (list in Table 1) were ex- tracted from each 45-second recording, capturing patterns from both time-domain and time-frequency analyses. These indices reflect aspects such as spectral entropy, acoustic complexity, tem- poral dynamics, and frequency distribution, offering proxies for ecological features like species richness, biophonic activity, and anthropogenic disturbance. All indices were independently nor- malized to the [0, 1] range using their dataset-wide minimum and maximum values. Table 1: Acoustic indices used in this study and their eco- logical interpretation. Index Use ACI Detects dynamic biotic sounds (e.g., bird choruses). AEI Identifies dominance vs. diversity in acoustic commu- nities. EAS Differentiates uniform noise vs. structured signals. ECU Indicates unpredictability and complexity of sound- Figure 1: Study area in Dyrehaven, Denmark with Au- scapes. dioMoth recording locations (red pins) and Sentinel-2 satel- ECV Captures temporal structure (e.g., insect or bird lite bounding box (blue). rhythms). EPS Distinguishes tonal vs. noisy sound environments. ADI Proxy for acoustic diversity or species richness. 2.1 Audio NDSI Separates natural from human-made noise. Passive acoustic data were collected using 10 AudioMoth record- Ht Detects continuous vs. discrete acoustic events. ing devices deployed across Dyrehaven’s microhabitats. Devices ARI Estimates overall acoustic richness. were positioned to maximize spatial heterogeneity, minimize acoustic overlap, and ensure temporal consistency. Each unit recorded 45-second mono-channel clips every five minutes at a 48 kHz sampling rate. All devices were weatherproofed and mounted on trees for continuous outdoor operation. A recording 3.2 Preprocessing of Satellite Imagery gap occurred between April 20 and April 29 due to memory card To ensure patch-level data quality, we applied the scene classifi- cation layer (SCL) after resampling. Patches containing cloudy or failure. A total of 203078 recordings were generated during the study period. After removing corrupted or incomplete files (309 unreliable pixels (SCL classes 3, 8, 9, or 10) were excluded. This preprocessing pipeline produced curated spatiotemporal datasets clips, 0.15%), 202769 valid recordings remained. of 4436 NDVI patches and 1226 NDMI patches, which served as 2.2 Visual input for training and evaluating the VAE models. Satellite imagery was sourced from the Sentinel-2 mission [9], covering a 1.48 km 5.86 km bounding box encompassing 9 of × 3.3 Variational Autoencoder and Evaluation the 10 AudioMoth locations. Out of 53 total available snapshots Metrics during our study period, 18 cloud-free scenes ( 50% cloud cover) ≤ A variational autoencoder (VAE) learns to compress input data were selected for analysis to ensure index reliability. into a latent representation and reconstruct it via encoder and Normalized difference vegetation index (NDVI) and Normal- decoder as per Figure 2. ized difference moisture index (NDMI) were computed for each The encoder maps each input to a latent mean, 𝜇 1 and log- selected image as 2 () variance, 𝑙 𝑜𝑔 𝜎 , from which a latent vector 𝑧 is sampled via the 1 NIR reparameterization trick: 𝑧 𝜇1 = + 𝜎1 · 𝜖, where 𝜖 ∼ N (0, 1) and red − NDVI 2 𝜎 = 1 = exp ( 0 . 5 · log ( 𝜎)). 1 NIR + red The decoder reconstructs the input from 𝑧, producing a mean and NIR SWIR 𝜇2 − 2 () and log-variance log 𝜎 of the output distribution. Training NDMI 2 = , NIR SWIR minimizes the total loss: + where, NIR, SWIR, and red are near-infrared, shortwave-infrared, and visible red bands, respectively. 𝑤 recon VAE L = L +KL · LKL 80 Towards Anomaly Detection in Forest Biodiversity Monitoring Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia across modalities. In this pilot, since no anomalies occurred, we only assess baseline modeling by training and evaluating the acoustic and satellite VAEs independently, reporting reconstruc- tion errors as indicators of model performance. 3.4.1 Audio Pipeline. The audio VAE uses a 10-dimensional in- put, with an encoder and decoder each containing one hidden layer of size 8 and ReLU activation. The latent space has di- mension 4. The decoder outputs the reconstructed mean and log-variance of size 10. Figure 2: Architecture of VAE for anomaly detection using Model evaluation used 5-fold cross-validation with folds de- reconstruction probability. fined by spatially clustered AudioMoth devices ( 850 m mini- ∼ mum separation) to reduce data leakage. Models were trained for 30 epochs with a batch size of 512 using the Adam optimizer and where recon is the negative log-likelihood of the input under a one-cycle learning rate schedule. L the decoder’s Gaussian output: 𝐷 3.4.2 Visual Pipeline. The satellite VAE takes a 16×16 pixel in- ∑︁ L 2 recon 𝑖 2,𝑖 2 𝑥 = − N ( | ) log 𝜇 , 𝜎 ,𝑖 put (NDVI or NDMI) and uses three convolutional layers (32, 𝑖 1 = 64, 128 filters) with ReLU activation in the encoder. The output and is the Kullback–Leibler divergence between the approx-L KL is flattened and mapped to a latent space of dimension 4. The imate posterior 𝑞 𝑧 𝑥 and the prior 𝑝 𝑧 0, 1 : ( | ) ( ) = N ( ) decoder upsamples using three transposed convolutional layers 𝑑 with ReLU, reconstructing the mean and log-variance patches of 1 ∑︁ L 2 2 2 KL 1 𝜎 = − + ( ) − − 1 log 𝜇 𝜎 size 16×16. , 𝑗 1, 𝑗 1, 𝑗 2 𝑗 1 Separate VAE models were trained for NDVI and NDMI using = with 𝐷 and 𝑑 representing the input and latent dimensions, re- an 80/20 train-test split. Each model was trained for 20 epochs spectively. with a batch size of 32 using the Adam optimizer. The loss was In an operational anomaly detection setting, the decoder’s computed only over non-cloudy pixels. negative log-likelihood (often referred to as reconstruction like- lihood) would serve as the anomaly score, with higher values 4 Results and Discussion indicating more anomalous inputs. However, since no ecological To examine temporal patterns, all indices were plotted over the anomalies occurred during our three-month observation window, study period as seen in Figure 4. Acoustic indices were aver- this pilot study evaluates baseline modeling rather than anomaly aged between 9AM and 3PM across all 10 AudioMoth devices to detection accuracy. Specifically, we report reconstruction errors: avoid nighttime inactivity and minimize dawn/dusk transitions. mean squared error (MSE) and mean absolute error (MAE) for A 10-day smoothing window was applied to reduce day/night acoustic indices, and overall mean absolute error (averaged across fluctuations. The indices remained relatively stable long-term, all pixels in each patch) for NDVI and NDMI patches, computed showing little trend and suggesting no major ecological disrup- only on non-cloudy patches after SCL masking. tions and reflecting the stability of the forest soundscape over 3.4 Experimental Setup the study period. Visual indices were averaged across all patches for each date. The general pipeline of the BAT system is shown in Figure 3. It Both indices exhibit a gradual increase from early April to late consists of independent audio and visual pipelines designed to June, consistent with seasonal greening. NDVI shows a smooth operate separately but eventually integrate into a unified decision- and consistent rise, indicating widespread vegetation growth. support framework. NDMI, while generally increasing, displays more irregular varia- tion, particularly early in the season, likely reflecting transient moisture conditions. NDVI primarily tracks canopy structure and greenness, while NDMI is more sensitive to vegetation and soil moisture. The audio pipeline VAE was evaluated using reconstruction MSE and MAE. Since all indices were normalized to the [0,1] range, errors are directly comparable. As shown in Figure 5, reconstruction errors are generally low, indicating that the model effectively captures the underlying structure of the acoustic data. EPS and Ht showed the highest reconstruction error variability. This suggests they are more difficult to model but may provide sensitive signals of ecological change in future anomaly detection settings. Indices with consistently low reconstruction errors, on the other hand, indicate stable features that can serve as robust Figure 3: The general pipeline of the BAT system. components of ecological baselines. These patterns highlight differences in how well various indices represent typical acous- In a full anomaly detection setting, the pipelines would use tic dynamics, which is central to establishing reliable baseline reconstruction likelihoods as anomaly scores and combine them models. 81 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Susič et al. during the observation period. Instead, it establishes that robust models can be trained on available data, providing a foundation for future multimodal monitoring. A critical next step is the collection of additional data over longer time frames and across multiple forest types, since actual ecological anomalies are rare and cannot be guaranteed within a short observation window. Detecting and validating anomalies will require expert labeling of such events once they occur. To this end, we are continuing data collection at Dyrehaven and planning expansions to other Danish forests (e.g., Thy, Amager, Lillebælt) to capture a wider range of ecological contexts and im- prove model generalization. Further development will also focus on refining acoustic preprocessing through time-window aver- aging or time-aware features and enhancing the visual pipeline with seasonal baselines, sequential models, and zone-specific Figure 4: Index values over the study period. approaches that account for spatial heterogeneity. With expert input, longer-term recordings, and broader de- ployment, the BAT system can evolve from modeling site-specific baselines into a robust anomaly detection tool supporting scalable and long-term biodiversity monitoring. Acknowledgements This work was funded by Fauna Smart Technologies ApS under the European Space Agency (ESA) grant no. 4000147116, Biodi- versity Assessment Tool. References [1] William R. L. Anderegg, Oriana S. Chegwidden, Grayson Badgley, Anna T. Trugman, Danny Cullenward, John T. Abatzoglou, Jeffrey A. Hicke, Jeremy Freeman, and Joseph J. Hamman. 2022. Future climate risks from stress, insects and fire across us forests. , 25, 6, 1510–1520. eprint: Ecology Letters https://onlinelibrary.wiley.com/doi/pdf /10.1111/ele.14018. doi: https://doi.o rg/10.1111/ele.14018. [2] Lucas P. Gaspar et al. 2023. Predicting bird diversity through acoustic indices within the atlantic forest biodiversity hotspot. , Frontiers in Remote Sensing 4, (Dec. 2023). doi: 10.3389/f rsen.2023.1283719. [3] J.Wolfgang Wägele et al. 2022. Towards a multisensor station for automated Figure 5: Reconstruction errors for acoustic indices. biodiversity monitoring. Basic and Applied Ecology, 59, 105–138. doi: https: //doi.org/10.1016/j.baae.2022.01.003. [4] Santiago Izquierdo-Tort, Andrea Alatorre, Paulina Arroyo-Gerala, Eliza- beth Shapiro-Garza, Julia Naime, and Jérôme Dupras. 2024. Exploring local The visual pipeline VAEs were evaluated using overall MAE perceptions and drivers of engagement in biodiversity monitoring among per patch. As expected, errors were fairly uniform across pixels, participants in payments for ecosystem services schemes in southeastern indicating that the models reconstruct spatial patterns consis- mexico. , 38, 6, e14282. eprint: https://conbio.onlinelibr Conservation Biology ary.wiley.com/doi/pdf /10.1111/cobi.14282. doi: https://doi.org/10.1111/cobi tently without localized distortions. The average patch-level MAE .14282. (average across all 16 16 256 pixels across all images) was [5] Nathalie Pettorelli, Jake Williams, Henrike Schulte to Bühne, and Merry × = 7.17 Crowson. 2025. Deep learning and satellite remote sensing for biodiversity ± ± 0.11 for NDVI and 9.65 0.26 for NDMI. Given the [0, 1] monitoring and conservation. , Remote Sensing in Ecology and Conservation normalization range of each pixel, the errors are relatively small 11, 2, 123–132. eprint: https://zslpublications.onlinelibrary.wiley.com/doi/p and therefore reflect accurate reconstruction of vegetation and df /10.1002/rse2.415. doi: https://doi.org/10.1002/rse2.415. [6] Rory Gibb, Ella Browning, Paul Glover-Kapfer, and Kate E. Jones. 2019. moisture dynamics. Emerging opportunities and challenges for passive acoustics in ecological The selected VAE models for both the acoustic and visual Methods in Ecology and Evolution assessment and monitoring. , 10, 2, 169–185. pipelines demonstrate strong reconstruction performance, with eprint: https://besjournals.onlinelibrary.wiley.com/doi/pdf /10.1111/2041- 210X.13101. doi: https://doi.org/10.1111/2041- 210X.13101. consistently low errors across acoustic indices and Sentinel- [7] D.A. Nieto-Mora, Susana Rodríguez-Buritica, Paula Rodríguez-Marín, J.D. derived NDVI/NDMI patches. This confirms that the models Martínez-Vargaz, and Claudia Isaza-Narváez. 2023. Systematic review of ma- effectively capture typical ecological patterns, which is the in- chine learning methods applied to ecoacoustics and soundscape monitoring. Heliyon, 9, 10, e20275. doi: https://doi.org/10.1016/j.heliyon.2023.e20275. tended outcome of this pilot study. While further hyperparameter [8] Nathalie Pettorelli et al. 2018. Satellite remote sensing of ecosystem func- tuning could potentially reduce errors, the key result is that ro- tions: opportunities, challenges and way forward. Remote Sensing in Ecology bust ecological baselines can be modeled. Anomaly detection and Conservation , 4, 2, 71–93. eprint: https://zslpublications.onlinelibrary.w iley.com/doi/pdf /10.1002/rse2.59. doi: https://doi.org/10.1002/rse2.59. itself will require expert-labeled events in future deployments, [9] Copernicus Data Space Ecosystem. 2015. Sentinel-2. (2015). https://dataspac but these results provide the necessary technical foundation. e.copernicus.eu/explore- data/data- collections/sentinel- data/sentinel- 2. [10] Luis J. Villanueva-Rivera, Bryan C. Pijanowski, Jarrod Doucette, and Burak 5 Pekin. 2011. A primer of acoustic analysis for landscape ecologists. Landscape Conclusion Ecology, 26, 9, (July 2011), 1233–1246. doi: 10.1007/s10980- 011- 9636- 9. This work demonstrates the technical feasibility of using VAEs to model baseline ecological patterns from acoustic and satellite time series in a forested landscape. As a pilot study, it does not evaluate anomaly detection directly, since no anomalies occurred 82 Development of a Lightweight Model for Detecting Solitary-Bee Buzz Using Pruning and Quantization for Edge Deployment Ryo Yagi David Susič Maj Smerkol yagi-ryo143@g.ecc.u-tokyo.ac.jp David.Susic@ijs.si maj.smerkol@ijs.si The University of Tokyo Jožef Stefan Institute Jožef Stefan Institute Tokyo, Japan Ljubljana, Slovenia Ljubljana, Slovenia Miha Finžgar Anton Gradišek miha.finzgar@senso4s.com anton.gradisek@ijs.si Senso4s Jožef Stefan Institute Trzin, Slovenia Ljubljana, Slovenia Abstract Passive acoustic monitoring (PAM) is a non-invasive ap- Passive acoustic monitoring is increasingly applied in studies proach that continuously records environmental sound with of pollinators, both for biodiversity assessment and for the deployed microphones to monitor animal activity. Because it conservation of endangered species. A major challenge is that reduces manual surveys and can operate continuously across continuous recording generates large volumes of audio data, space and timeeven at night and under inclement weatherit making centralized processing impractical. Edge computing has gained attention as a cost-effective biodiversity monitoring offers a promising alternative, provided that the models are technology. PAM has been widely adopted for multiple taxa optimized for resource constraints of edge devices while main- such as birds and bats; in ornithology, for example, the deep- taining acceptable performance and efficiency. In this work, learning system BirdNET [2] is already used operationally to which is our initial study of the edge computing approach, identify species from passively collected field recordings. PAM we developed and evaluated compact classifiers for detecting is also applied to bee behavior monitoring: in social bees (such buzzes of solitary bees, extending previous work on acoustic as honeybees or bumblebees), microphones and accelerometers monitoring. We systematically apply pruning and quantization placed inside or outside hives enable non-invasive, continuous to multiple models, exploring a range of compression settings. surveillance of queen presence, swarming cues, and robbing [3]. Performance is assessed in terms of mean F1-score and on-disk For solitary bees, recordings at the entrance of nesting boxes size under both cross-validation and leave-one-location-out are used to detect buzzing and to characterize presence/absence protocols. Results indicate that substantial reductions in model and activity rhythms [4]. size can be achieved with a minimal loss of performance, and In acoustic approaches for bee state monitoring, machine that the optimal trade-offs depend on the evaluation setting; learning has been widely used to automatically determine for example, in cross-validation, a 25.2 MiB baseline reaches activity and behavioral states from audio recordings. Prior 96.2% F1, while a 0.062 MiB model attains 92.5%, achieving work includes both classical machine-learning pipelines and an approximately 400-fold reduction in size with less than deep-learning methods. Classical approaches such as SVM, a 4-percentage-point drop. By analyzing the Pareto front of k-NN, and Random Forests have been shown to be practical F1 vs. model size trade-offs, we identify configurations that and effective [5, 6]. Meanwhile, several studies suggest that balance robustness and resource constraints. Our early findings CNN-based deep learning models achieve superior performance demonstrate the feasibility of deploying edge-ready acoustic compared with traditional machine-learning methods [7, 8]. models for scalable pollinator monitoring. However, if all long-term, continuous PAM recordings are uploaded to the cloud, features such as mel spectrograms and Keywords MFCCs are extracted there, and then analyzed using machine learning or deep learning models, the resulting data volumes edge deployment, lightweight model, pruning, quantization, become extremely large, which, in a centralized cloud-only bees workflow, (i) inflates communication cost by requiring all 1 long-duration audio to be uploaded [9], (ii) raises privacy con- Introduction cerns as incidental human speech can accumulate in the cloud Bees are widely recognized as major pollinators - animal polli- [10], (iii) introduces round-trip latency for feature extraction nators including honey bees contribute to yield in 75% of key and inference that impedes timely detection, and (iv) exposes crop species and an estimated 35% of global crop production [1]. scalability limits as storage and compute demands grow with This indicates the importance of pollinator monitoring and pro- multi-site, long-term deployments. To address these issues, we tection. developed a high-accuracy, lightweight deep model designed Permission to make digital or hard copies of all or part of this work for personal for edge deployment, capable of on-device preprocessing and or classroom use is granted without fee provided that copies are not made or dis- inference for recorded audio. Here, the term lightweight refers tributed for profit or commercial advantage and that copies bear this notice and to memory (both RAM and storage), but in a broader view it the full citation on the first page. Copyrights for third-party components of this also refers to CPU/GPU requirements, latency requirements, work must be honored. For all other uses, contact the owner/author(s). Information Society 2025, Ljubljana, Slovenia and even battery constraints, which is beyond the scope of this © 2025 Copyright held by the owner/author(s). paper. In our intended operation, audio is processed on-device https://doi.org/10.70314/is.2025.skui.6788 83 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Ryo Yagi, David Susič, Maj Smerkol, Miha Finžgar, and Anton Gradišek and only the result is sent to the cloud, enabling multi-site, Table 1: Parameter counts and model sizes of the models long-term monitoring with reduced storage cost and latency, used in this study while preserving privacy and power efficiency. As a first step toward edge-based bee monitoring with PAM, ResNet-9 Mobilenetv2 Beenet 1 Beenet 2 Beenet 3 we designed and evaluated a lightweight CNN specialized for Parameters (k) 6585.5 2225.9 50.2 17.6 6.4 solitary-bee buzz detection (binary classifier distinguishing Model size (MiB) 25.2 8.7 0.215 0.084 0.036 between buzz and no-buzz). To compress the model, we applied compression techniques such as structured pruning and int8 post-training quantization when appropriate, and we quantified 2.3 Model Compression Methods the size–accuracy trade-offs under edge-oriented constraints. Deploying deep neural networks on memory-constrained edge devices necessitates model compression. We examined two com- 2 plementary techniques: quantization and pruning. Methodology 2.1 2.3.1 Quantization. Quantization maps floating-point weights Dataset and activations to low-bit integers, thereby reducing model size We used the dataset collected for the purpose of the study and computation at inference. Here, we adopted post-training by Susič et al. [4]. This dataset comprises acoustic recordings quantization (PTQ) and converted the trained network to int8 from nesting boxes of solitary bees (predominantly Osmia without additional training. We used the QNNPACK backend in spp.) collected through a citizen-science project carried out PyTorch for ARM targets. To minimize both saturation (clipping) in the Bela Krajina region in the southeastern Slovenia. The and rounding error under the 8-bit representation and mitigate recordings were gathered from March 15 to May 26, 2023, accuracy degradation, we performed calibration with up to 300 resulting in 62 long recordings across seven sites, with a mean batches of representative inputs to estimate the scale and zero- duration of 6 2.5 hours per recording. For the purpose of this point. study, three recordings in total were randomly selected from different locations. 2.3.2 Pruning. Pruning reduces model complexity by deleting The recordings were converted to mono-channel audio, seg- parameters deemed unimportant, thereby decreasing memory mented into 4 s windows with 2 s overlap, transformed into Mel and compute complexity without retraining from scratch. spectrograms (128 128) configured to cover 50–1450 Hz, and Pruning can be categorized into structured and unstructured standardized using the mean and standard deviation across the approaches. We adopted structured pruning to realize memory dataset. For labeling, two annotators inspected the spectrograms savings and speed-ups on commodity hardware, as unstructured and assigned buzz=1 or no-buzz=0. sparsity typically requires specialized hardware or software support to translate sparsity into acceleration [13]. 2.2 Our pruning pipeline followed Han et al. [14]: (1) train, (2) Neural Network Architecture prune, and (3) retrain (fine-tune). For filter (i.e., output-channel) We addressed binary detection of solitary-bee buzzing from Mel selection, we followed the idea of Li et al. [15], ranking convolu- spectrograms. With memory-constrained edge deployment, tional filters by the L1 norm of their weights and removing those we evaluate four lighter CNNs compared to the ResNet-9 used with the smallest scores. We implemented this using PyTorch’s in [4]. Specifically, we consider MobileNetV2 [11] and three torch-pruning, configuring the MagnitudePruner with L1-based custom lightweight architectures named BeeNet1, BeeNet2, and importance. The selection of filters pruned was performed glob- BeeNet3, that adopt a depthwise separable convolutional design ally across layers. The target was controlled by a pruning ratio 𝑝; similar to MobileNetV1 [12]. Model sizes and parameter counts under channel-wise pruning, the resulting parameter-reduction are summarized in Table 1 and the architectural details of the rate was approximately 1 − (1 − 𝑝)2 [16, 17]. BeeNet variants are provided in Table 2. In all architectures, each convolutional layer is followed by batch normalization, 2.4 Experimental Setup BatchNorm, and ReLU activation, whereas dw stands for 2.4.1 Model Performance Evaluation Metrics. Because we were depthwise convolution. For MobileNetV2, we use the standard dealing with a class-imbalanced dataset (more no-buzz than backbone and adapt it to spectrograms by converting the first buzz), we used the F1-score as the primary metric. F1 is the convolution to a 1-channel input and replacing the final linear harmonic mean of precision and recall, enabling balanced layer with a 1280 2 classifier. All other layers remain identical assessment under imbalance. to the original MobileNetV2. While the ResNet-9 approach achieves an F1-score exceed- 2.4.2 Evaluation Protocols. We evaluated the buzz-detecting ing 95% under five-fold cross-validation on the dataset [4], its models using two protocols, following [4]: cross-validation 25.2 MiB size renders its deployment on a memory-limited edge (CV) and leave-one-location-out (LOLO). The first approach is devices impractical. Accordingly, we designed and configured a standard test in machine-learning studies whereas the second compact CNNs (MobileNetV2 and the BeeNet family) and, as de- one shows how well the model generalizes to the data coming tailed below, applied quantization and pruning to systematically from a previously unseen location with potentially different evaluate the accuracy–model-size trade-off. background noise. For CV, annotated segments (4 s windows) The aim of this study is to clarify accuracy as a function of were partitioned into five folds; models were trained on four model size and the effects of lightweighting techniques under folds and evaluated on the remaining fold, and we reported strict model-size constraints assuming deployment on MCUs. the mean F1 across folds. Stratification ensures balanced dis- Accordingly, we adopt a lightweight and relatively simple ar- tributions of the buzz/no-buzz classes and the three locations. chitecture, with the smallest model containing approximately 6k To mitigate temporal leakage, we performed a time-aware data parameters. split. 84 Development of a Lightweight Model for Edge Deployment Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Table 2: The architectures of BeeNet1, BeeNet2, BeeNet3 BeeNet1 BeeNet2 BeeNet3 Type / Stride Filter Shape Input Size Type / Stride Filter Shape Input Size Type / Stride Filter Shape Input Size Conv / s1 3 × 3 × 1 × 32 128 × 128 × 1 Conv / s1 3 × 3 × 1 × 32 128 × 128 × 1 Conv / s1 3 × 3 × 1 × 32 128 × 128 × 1 MaxPool / s2 Pool 2 × 2 128 × 128 × 32 MaxPool / s2 Pool 2 × 2 128 × 128 × 32 MaxPool / s2 Pool 2 × 2 128 × 128 × 32 Conv dw / s1 3 × 3 × 32 dw 64 × 64 × 32 Conv dw / s1 3 × 3 × 32 dw 64 × 64 × 32 Conv dw / s1 3 × 3 × 32 dw 64 × 64 × 32 Conv / s1 1 × 1 × 32 × 32 64 × 64 × 32 Conv / s1 1 × 1 × 32 × 32 64 × 64 × 32 Conv / s1 1 × 1 × 32 × 32 64 × 64 × 32 MaxPool / s2 Pool 2 × 2 64 × 64 × 32 MaxPool / s2 Pool 2 × 2 64 × 64 × 32 MaxPool / s2 Pool 2 × 2 64 × 64 × 32 Conv dw / s1 3 × 3 × 32 dw 32 × 32 × 32 Conv dw / s1 3 × 3 × 32 dw 32 × 32 × 32 Conv dw / s1 3 × 3 × 32 dw 32 × 32 × 32 Conv / s1 1 × 1 × 32 × 64 32 × 32 × 32 Conv / s1 1 × 1 × 32 × 64 32 × 32 × 32 Conv / s1 1 × 1 × 32 × 64 32 × 32 × 32 MaxPool / s2 Pool 2 × 2 32 × 32 × 64 MaxPool / s2 Pool 2 × 2 32 × 32 × 64 MaxPool / s8 Pool 8 × 8 32 × 32 × 64 Conv dw / s1 3 × 3 × 64 dw 16 × 16 × 64 Conv dw / s1 3 × 3 × 64 dw 16 × 16 × 64 FC / s1 1024 × 2 4 × 4 × 64 Conv / s1 1 × 1 × 64 × 128 16 × 16 × 64 Conv / s1 1 × 1 × 64 × 128 16 × 16 × 64 Softmax / s1 Classifier 1 × 1 × 2 MaxPool / s2 Pool 2 × 2 16 × 16 × 128 MaxPool / s4 Pool 4 × 4 16 × 16 × 128 Conv dw / s1 3 × 3 × 128 dw 8 × 8 × 128 FC / s1 2048 × 2 4 × 4 × 128 Conv / s1 1 × 1 × 128 × 256 8 × 8 × 128 Softmax / s1 Classifier 1 × 1 × 2 MaxPool / s4 Pool 4 × 4 8 × 8 × 256 FC / s1 1024 × 2 2 × 2 × 256 Softmax / s1 Classifier 1 × 1 × 2 LOLO assessed generalization across sites: models were architectures often achieve higher accuracy than heavily pruned trained on data from two of the three locations and evaluated larger networks, indicating that purpose-built small models are on the held-out location, reporting the mean F1 across the three preferable to aggressive pruning under the same size constraint. possible holds. A note on MobileNetV2 at 𝑝=30%: the trained model degen- erated to predicting no-buzz for almost all inputs. This behavior 2.4.3 Hyperparameters. We trained the models with cross- may stem from a strong structured reduction under class imbal- entropy loss and the Adam optimizer, using a 1-cycle learning- ance and warrants further investigation. = rate schedule (maximum LR 0 . 001 ), gradient clipping at 0 . 1 , batch size 64, and 20 epochs. Compared to [4], the only change 4 Conclusions was increasing the number of epochs from 10 to 20. For pruning We addressed buzz detection in acoustic recordings from fine-tuning, we trained for 10 epochs with a fixed learning rate of 0. solitary-bee nesting boxes, aiming to develop deep-learning 0001 and no scheduler. We compared the pruning ratios 𝑝 models suitable for memory-constrained edge deployment. We of 0% (no pruning), 20%, 30%, and 50%. designed or selected five CNN architectures and systematically 3 measured the performance vs. model-size trade-offs under Results quantization and structured pruning. As a result, we obtained 3.1 F1 vs. Model Size sub-100 KiB models achieving F1 scores of at least 92% in CV For each model, we trained and evaluated a variety of combi- and 85% in LOLO experiments, indicating the feasibility and nations of pruning ratios and quantizations. Table 3 reports the strong potential of accurate on-device inference. mean F1 and on-disk model size (in MiB) for each setting. Fig- For future work, we plan to train the models on additional ure 1 shows the plot of all configurations in the F1 – model- datasets that we have collected to improve robustness and to de- size plane for CV and LOLO, respectively, with the global Pareto ploy the models on real edge devices. Because our compression front indicating the best trade-offs between model performance pipeline relied on simple techniques, we anticipate further gains and its size denoted by a dashed line. by adopting a broader set of compression methods, such as Even under tight memory budgets (< knowledge distillation [18], quantization-aware training (QAT) 100 KiB), competitive accuracy is achievable. For example, BeeNet1 (int8, 𝑝= [19], and neural architecture search (NAS) [20] to optimize 0 ) attains 0.062MiB with CV F1 of 92.5% and LOLO F1 of 85.7%. Relative to model architectures under memory constraints, including ResNet-9 (float32, 𝑝=0), this represents an ∼ 400× number and sizes of the filters. reduction in model size while keeping F1 within 4 percentage points in both protocols, which is really promising for future edge deployment. Acknowledgements Performance degradation from int8 quantization is small: The authors acknowledge the Slovenian Research and Innova- across many settings the F1 drop is about 1 percentage point tion Agency, grants PR-10495 and J7-50040, and Basic core fund- (pp). With pruning, larger models exhibit smaller accuracy ing P2-0209. losses as 𝑝 increases. For example, at 𝑝=50% ResNet-9 (float32) decreases only from 96.2% to 95.1% in CV and from 89.5% to References 87.6% in LOLO, a decline of ≈ 2 percentage points in total. [1] Tom D Breeze, Alison P Bailey, Kelvin G Balcombe, and Simon G Potts. By contrast, the more compact BeeNet family is more sensi- 2011. Pollination services in the uk: how important are honeybees? Agri- tive: accuracy degrades markedly with 𝑝 culture, Ecosystems & Environment, 142, 3-4, 137–143. , and at 𝑝 = 50% most [2] Stefan Kahl, Connor M. Wood, Maximilian Eibl, and Holger Klinck. 2021. configurations lose ≥ Birdnet: a deep learning solution for avian diversity monitoring. Ecological 4 pp. Informatics , 61, 101236. doi : https://doi.org/10.1016/j.ecoinf.2021.101236. Inspection of the global Pareto front shows that many fron-[3] Mahsa Abdollahi, Pierre Giovenazzo, and Tiago H Falk. 2022. Automated tier points correspond to unpruned float32 or int8 models. At beehive acoustics monitoring: a comprehensive review of the literature and a fixed memory budget, lightly pruned or unpruned lightweight recommendations for future work. Applied Sciences, 12, 8, 3920. 85 Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia Ryo Yagi, David Susič, Maj Smerkol, Miha Finžgar, and Anton Gradišek Table 3: Comparison of F1 scores and model sizes with quantization and pruning applied for five CNNs (ResNet-9, Mo- bileNetV2, and BeeNet1/2/3) CV LOLO Pruning ratio (%) Pruning ratio (%) Model (Quant.) 0 20 30 50 0 20 30 50 F1 (%) Size (MiB) F1 (%) Size (MiB) F1 (%) Size (MiB) F1 (%) Size (MiB) F1 (%) Size (MiB) F1 (%) Size (MiB) F1 (%) Size (MiB) F1 (%) Size (MiB) ResNet-9 (float32) 96.2 25.2 95.6 15.7 95.5 12.1 95.1 6.2 89.5 25.2 88.7 15.8 88.4 12.0 87.6 6.2 ResNet-9 (int8) 96.1 6.3 95.6 3.9 95.5 3.0 95.1 1.6 89.4 6.3 89.4 4.0 88.9 3.0 88.0 1.6 MobileNetV2 (float32) 96.6 8.7 95.6 6.1 57.8 4.8 94.9 2.3 88.8 8.7 85.8 6.1 85.4 4.8 79.9 2.3 MobileNetV2 (int8) 95.4 2.2 94.4 1.6 57.7 1.3 94.7 0.677 89.6 2.2 84.9 1.6 85.2 1.3 80.5 0.665 BeeNet1 (float32) 92.7 0.215 91.0 0.140 90.0 0.110 87.8 0.055 85.9 0.215 81.3 0.139 80.6 0.110 77.4 0.056 BeeNet1 (int8) 92.5 0.062 90.9 0.048 89.9 0.040 87.7 0.026 85.7 0.062 81.3 0.047 80.4 0.040 77.1 0.026 BeeNet2 (float32) 92.3 0.084 90.4 0.050 89.3 0.041 87.0 0.025 79.3 0.084 79.1 0.051 77.9 0.041 75.5 0.026 BeeNet2 (int8) 92.0 0.028 90.6 0.022 89.2 0.020 86.7 0.016 80.3 0.028 79.7 0.022 79.5 0.020 76.8 0.016 BeeNet3 (float32) 90.7 0.036 89.8 0.020 88.0 0.017 83.7 0.013 82.5 0.036 83.0 0.021 82.4 0.018 67.5 0.012 BeeNet3 (int8) 90.3 0.014 89.3 0.012 87.9 0.011 83.5 0.010 82.2 0.014 82.5 0.012 81.5 0.011 67.6 0.010 (a) Cross-validation (CV) (b) Leave-one-location-out (LOLO) Figure 1: F1–model-size trade-offs with the global Pareto frontier under CV and LOLO [4] David Susič, Johanna A. Robinson, Danilo Bevk, and Anton [12] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Wei- Gradišek. 2025. Acoustic monitoring of solitary bee activity at jun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. Mo- nesting boxes. Ecological Solutions and Evidence, 6, 3, e70080. doi: bilenets: efficient convolutional neural networks for mobile vision applica- https://doi.org/10.1002/2688-8319.70080. tions. arXiv preprint arXiv:1704.04861. [5] Alison Pereira Ribeiro, Nádia Felix Felipe da Silva, Fernanda Neiva [13] Hongrong Cheng, Miao Zhang, and Javen Qinfeng Shi. 2024. A survey on Mesquita, Priscila de Cássia Souza Araújo, Thierson Couto Rosa, deep neural network pruning: taxonomy, comparison, analysis, and recom- and José Neiva Mesquita-Neto. 2021. Machine learning approach for mendations. IEEE Transactions on Pattern Analysis and Machine Intelligence, automatic recognition of tomato-pollinating bees based on their buzzing- 46, 12, 10558–10578. doi: 10.1109/TPAMI.2024.3447085. sounds. PLOS Computational Biology, 17, 9, (Sept. 2021), 1–21. doi: [14] Song Han, Jeff Pool, John Tran, and William Dally. 2015. Learning both 10.1371/journal.pcbi.1009426. weights and connections for efficient neural network. Advances in neural [6] Antonio Robles-Guerrero, Tonatiuh Saucedo-Anaya, Carlos A. Guerrero- information processing systems, 28. Mendez, Salvador Gómez-Jiménez, and David J. Navarro-Solís. 2023. [15] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Comparative study of machine learning models for bee colony acoustic 2016. Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710. pattern classification on low computational resources. Sensors, 23, 1. doi: [16] Gongfan Fang, Xinyin Ma, Mingli Song, Michael Bi Mi, and Xinchao Wang. 10.3390/s23010460. 2023. Depgraph: towards any structural pruning. In Proceedings of the [7] Jaehoon Kim, Jeongkyu Oh, and Tae-Young Heo. 2021. Acoustic scene IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). classification and visualization of beehive sounds using machine learning [17] Gongfan Fang and contributors. 2023. Torch-pruning: structural pruning algorithms and grad-cam. Mathematical Problems in Engineering, 2021, 1, for pytorch. (2023). https://github.com/VainF/Torch-Pruning. 5594498. [18] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowl- [8] Vladimir Kulyukin, Sarbajit Mukherjee, and Prakhar Amlathe. 2018. To- edge in a neural network. arXiv preprint arXiv:1503.02531. ward audio beehive monitoring: deep learning vs. standard machine learn- [19] Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, ing in classifying beehive audio samples. Applied Sciences, 8, 9, 1573. Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. 2018. Quanti- [9] Carrie C Wall, Samara M Haver, Leila T Hatch, Jennifer Miksis-Olds, Rob zation and training of neural networks for efficient integer-arithmetic-only Bochenek, Robert P Dziak, and Jason Gedamke. 2021. The next wave of inference. In Proceedings of the IEEE conference on computer vision and pat-passive acoustic data management: how centralized access can enhance tern recognition, 2704–2713. science. Frontiers in Marine Science, 8, 703682. [20] Barret Zoph and Quoc V. Le. 2017. Neural architecture search with [10] Benjamin Cretois, Carolyn M Rosten, and Sarab S Sethi. 2022. Voice activity reinforcement learning. (2017). https://arxiv.org/abs/1611.01578 arXiv: detection in eco-acoustic data enables privacy protection and is a proxy for 1611.01578 [cs.LG]. human disturbance. Methods in Ecology and Evolution, 13, 12, 2865–2874. [11] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. 2018. Mobilenetv2: inverted residuals and linear bottle- necks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 4510–4520. 86 Interpretable Predictive Clustering Tree for Post- Intubation Hypotension Assessment Estefanía Žugelj Tapia Borut Kirn Sašo Džeroski Institute of Physiology Institute of Physiology Department of Knowledge University of Ljubljana, Medical University of Ljubljana, Medical Technologies faculty faculty Jožef Stefan Institute Ljubljana, Slovenia Ljubljana, Slovenia Ljubljana, Slovenia estefania.tapia@mf.uni-lj.si borut.kirn@mf.uni-lj.si saso.dzeroski@ijs.si Abstract by anesthesia effects, and is usually not related to complex first 30 minutes post-induction, as this period is directly affected Intraoperative hypotension following intubation is a clinically factors due to surgery[4]. Regarding the risk factors, a post hoc significant event associated with increased morbidity and analysis in a surgical population of patients at risk of aspiration mortality. This study presents an interpretable predictive of gastric content identified different risk factors associated with of hypotensive outcomes, including the prediction of minimum bowel occlusion requiring nasogastric tube placement before and maximum mean arterial pressure (MAP) values during clustering tree (PCT) model designed for multi-target prediction it in the multivariate analysis: age, a higher baseline heart rate, hypotension in the post-induction period. The multi-target intubation, and the use of remifentanil. A prospective multicenter study found that in the group with hypotension, the dose (mg/kg) regression trees (MTRT) were evaluated using 10-fold cross- of Propofol was significantly higher at 5 and 10 minutes after validation, and feature importance was assessed via a random [5] induction . On the other hand, the following protective factors forest model. Compared to the original tree, the pruned model demonstrated improved generalization and reduced complexity, have been described: low doses of ketamine and basal systolic with fewer nodes and enhanced interpretability. The pruned tree blood pressure (SBP)[2]. structure enabled clear decision thresholds based on modifiable Previous studies have employed traditional multivariate variables such as MAP_after_5min, MAP_basal, and Propofol analysis to identify risk factors and have focused on predicting a and had high complexity, its feature importance ranking analysis predicting multiple outcomes simultaneously can capture supported the relevance of the attributes retained in the pruned complex interactions and provide more informative insights, dose. While the random forest achieved the highest performance single target: the presence of hypotension[2,4,5]. However, globally relevant features, such as SBP_after_5min, that were not aiding clinical decision-making and support. Therefore, the model and provided complementary insights, highlighting prioritized in the single trees. These findings support the use of hypothesis of this study is that predicting multiple outcomes of interpretable models in clinical decision-support to anticipate PIH simultaneously can effectively identify which variables are and potentially modify the occurrence of post-intubation most influential in predicting PIH. Overall, this study contributes hypotension. to the prediction of PIH, which can help anesthesiologists to make better decisions during induction, potentially improving Keywords patient outcomes. multi-target prediction, interpretable machine learning, decision tree pruning, feature importance, post-intubation, intraoperative hypotension 2 Methods Predictive clustering trees (PCT) are a machine learning 1 Introduction framework, the node at the roof (the top node or root node) framework that unifies clustering and prediction tasks. In this Intubation is a common procedure in emergency departments and corresponds to the cluster that contains all the data, and each operating rooms, typically performed immediately after the subsequent split partitions the data to minimize intra-cluster administration of induction agents. These agents have been variance. CLUS is a free software that implements this associated with hemodynamic instability and post-induction framework and supports multi-target prediction. In a multi- hypotension (PIH), frequently defined as mean arterial pressure target regression tree (MTRT), the obtained tree is more reliable (MAP) <65 mmHg[1]. Particularly, in perioperative medicine, in explaining the dependencies between variables, and the PIH has been related to worse postoperative outcomes, increased prediction is a vector of values of the target attributes[6,7]. For this comorbidity, and mortality[2,3]. PIH occurrence is limited to the reason, CLUS version 2.12.8 was chosen as the software for this retrospective analysis. The documentation and latest version can Permission to make digital or hard copies of part or all of this work for personal or be found at: https://github.com/knowledge-classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full technologies/clus/tree/main. citation on the first page. Copyrights for third-party components of this work must Data was sourced from the subset SIS of the MOVER be honored. For all other uses, contact the owner/author(s). Information Society 2025, 6–10 October 2025, Ljubljana, Slovenia database (https://mover.ics.uci.edu/) —a public database of © 20 [8] 25 Copyright held by the owner/author(s). anonymized patients undergoing various types of surgery. http://doi.org/DOI 10.70314/is.2025.skui.9188 The inclusion and exclusion criteria were the following: 87 - Inclusion criteria: 1) patients who underwent major surgery Cisatracurium (cum. dose) procedures with documented application and dose of one of the Etomidate (cum. dose) next medications during induction of general anesthesia: Vecuronium (cum. dose) 'Midazolam', 'Propofol', 'Fentanyl', 'Succinylcholine', 'Ketamine', Rocuronium (cum. dose) 'Cisatracurium', 'Etomidate', 'Vecuronium', and 'Rocuronium', 2) high temporal resolution vital signs of systolic blood pressure After defining the descriptive and target attributes, the entire (SBP), diastolic blood pressure (DBP), and mean arterial dataset of 340 patients was split into training and test sets using pressure (MAP) measured from the radial arterial line, registered the sklearn library and the train_test_split function: 80% of the before the time of intubation and 30 minutes after it, with at least dataset was used for training (272 patients) and 20% for testing one measure of MAP < 65mmHg during the post-intubation (68 patients). To run CLUS, the training and test sets were period. converted to ARFF format. Corresponding settings file (.s) were -Exclusion criteria: Patients with vital signs out of created to define the model parameters for the MTRT tasks. Both physiological ranges (MAP <30mmHg and MAP > 200mmHg), single-tree and ensemble models were trained, as summarized in and patients who do not meet the inclusion criteria. Table 2. As descriptive and target attributes of the learning problem (see Table 1), the following variables were calculated: Table 2: Tree and ensemble specifications for each 1) MAP_basal: average of MAP measures before intubation, 2) respective MTRT. SBP_basal: average of SBP measures before intubation, 3) DBP_basal: average of DBP measures before intubation, 4) Model Predictive Random forest MAP_after_5min: average of MAP measurements taken after clustering tree intubation, over a 5-minute period, 5) SBP_after_5min: average (PCT) of SBP measurements taken after intubation, over a 5-minute Heuristic Variance Variance period, 6) DBP_after_5min: average of DBP measurements Reduction Reduction taken after intubation, over a 5-minute period, 7) Min_MAP<65: Pruning Method M5Multi - Minimum MAP <65 mmHg registered from the intubation up to Ensemble Method - RForest registered from the intubation up to 30 minutes after, 9) Feature Ranking - Genie3 30 minutes after, 8) Max_MAP<65: Maximum MAP <65 mmHg MAP<65_count: Counts of registered measurements <65mmHg As an alternative to the train/test split, when running CLUS, over the 30 minutes interval after intubation, 10) the -xval command-line option was used to perform cross- MAP_mean_after_30min: average of MAP measures over 30 validation on all 340 examples. The number of folds (n = 10) was minutes interval after intubation, 11) SBP_mean_after_30min: previously specified in the settings file. average of SBP measures over 30 minutes interval after Model performance was evaluated using the following intubation, 12) MAP<65_mean_after_30_min: average of MAP metrics: Mean Absolute Error (MAE), Mean Squared Error measures <65 mmHg over 30 minutes interval after intubation, (MSE), Root Relative Squared Error (RRMSE), and Pearson and 13) Body mass index (BMI): weight / ((height / 100) correlation coefficient (r²), computed on both training and test 2 ). During data preparation, missing values of the height sets. attribute were replaced with the mean value of the attribute. 3 Results Table 1: Descriptive and target attributes After applying the exclusion criteria, we were left with 340 patients. Figure 1 illustrates the flow chart for patient selection, Descriptive attributes (20) and Table 3 shows their demographic characteristics. Target attributes (6) MAP_basal Min_MAP<65 SBP_basal Table 3: Data set population characteristics Max_MAP<65 DBP_basal MAP<65_count Age, years, mean (SD) 58.9 (18.9) MAP_after_5min MAP<65_mean_after_30_min Gender (male), count 201 SBP_after_5min MAP_mean_after_30min Weight, kg, mean (SD) 78.6 (23.1) DBP_after_5min SBP_mean_after_30min Height, cm, mean (SD) 168.4 (11.1) Age BMI, kg/m2, mean (SD) 1.5 (6.8) Gender Height Weight 3.1 Complexity of the Models and Structure BMI Midazolam (cumulative The induction time for the original model was significantly dose) shorter (0.03 sec, pruning time 0 sec) compared to the random Propofol (cum. dose) forest model (1.62 sec), reflecting its reduced complexity. Fentanyl (cum. dose) Structurally, the original tree consisted of 241 nodes, 121 leaves, Succinylcholine (cum. and a depth of 17, whereas the pruned tree was noticeably dose) simpler, with only 19 nodes, 10 leaves, and a depth of 6. Ketamine (cum. dose) Additionally, the ensemble random forest model, composed of 100 trees, contained a total of 21,050 nodes and 10,575 leaves, 88 with an average tree depth of 154, indicating a significantly Original Forest (100 Metric Default Pruned higher complexity and capacity for capturing intricate patterns in (Unpruned) trees) the data. RMSE 17.32 / 17.39 3.04 / 13.90 8.01 / 12.27 3.81 / 7.41 RRMSE 1.00 / 1.00 0.18 / 0.80 0.46 / 0.70 0.43 / 0.84 Pearson r² 0.0003 / 0.02 0.97 / 0.45 0.79 / 0.52 0.89 / 0.28 Note that cross-validation yields more realistic estimates of error on unseen examples as compared to a single train-test split. 3.4 Original Model As stated in section 3.1, the original model contains 241 nodes and 121 leaves. MAP_after_5min is at the root node, followed by MAP_basal, these two variables repeat along the tree on more than one occasion. Except for cisatracurium, ketamine, and etomidate, in the remaining nodes, the rest of the descriptive attributes appear at least once, showing different thresholds. 3.5 Pruned Model In the pruned model, the descriptive attributes retained for multi- target prediction were MAP_after_5min, MAP_basal, BMI, Figure 1: Overview of sample population included in this SBP_basal, DBP_after_5min, and Propofol dose. Compared to study. the original tree, the pruned model demonstrated improved generalization and interpretability, with a significantly reduced 3.2 number of nodes, as illustrated in Figure 2. Model Performance The forest with 100 trees exhibits the best performance overall. The highest predicted values for the target attributes—97.9 However, pruning significantly simplified the original model mmHg for MAP_mean_after_30min and 149.8 mmHg for while retaining, and even improving, its predictive power, with SBP_mean_after_30min—were observed when lower testing errors for MAE, MSE, RMSE, and RRMSE MAP_after_5min exceeded 93 mmHg and SBP_basal was compared to the original tree (See Table 4). greater than 181 mmHg. Table 4: Metrics for training and testing errors (Train/Test) Original Forest (100 Metric Default Pruned (Unpruned) trees) MAE 7.27 / 7.30 2.58 / 7.55 5.41 / 6.18 2.93 / 5.22 MSE 109.35 / 110.12 17.77 / 120.54 61.39 / 83.03 18.41 / 51.05 RMSE 9.41 / 9.45 3.81 / 10.15 7.09 / 8.33 3.96 / 6.76 RRMSE 1.00 / 1.00 0.42 / 1.13 0.76 / 0.91 0.44 / 0.86 Pearson r² – / 0.04 0.82 / 0.14 0.42 / 0.21 0.89 / 0.26 3.3 Cross-Validation Results The 10-fold cross-validation was conducted using all 340 examples, with an induction time of 0.26 sec for the single tree and of 9.75 sec for the ensemble random forest. The mean number of tests for the original model was 267, for the pruned model 39.2, and for the random forest 100. As shown in Table 5, the absolute error metrics (MAE, MSE, RMSE) were higher than when using a train/test split, however the cross-validation approach yielded lower testing errors for RRMSE and higher Pearson r² values. Table 5: Cross-validation metrics for training and testing errors (Train / Test) Figure 2: Pruned tree, predicting min_MAP<65, Metric Original max_MAP<65, MAP<65_count, Forest (100 Default Pruned (Unpruned) trees) MAP<65_mean_after_30_min, MAP_mean_after_30min, MAE 13.62 / 13.69 1.80 / 10.49 5.78 / 9.28 2.82 / 5.6 and SBP_mean_after_30min. Leaves display predictions in MSE orange. 300.15 / 302.3 9.22 / 193.3 64.2 / 150.5 16.83 / 63.15 89 On the other hand, the lowest predicted values of these target decisive threshold of 67 mmHg. To diminish the impact of the variables—50.4 and 58.2 mmHg— were derived from the basal blood pressure values in the occurrence of PIH episodes, following nodes: MAP_after_5min below 56 mmHg, BMI < 34.5 some proposals include discontinuing renin–angiotensin– 2 kg/m , and the Propofol dose >150 mg. Additionally, the leaf aldosterone system antagonists the day of the surgery and 2 node corresponding to BMI >34.5 kg/m yielded the deepest proactive measures to elevate preoperative values to relieve the value for min_MAP <65, at 26.3 mmHg. effect of the anesthetic medications, which could prevent the Other notably low predictions related to hypotension included appearance of PIH [3,4] . max_MAP<65 at 43.7 mmHg and The obtained pruned predictive clustering tree model showed MAP<65_mean_after_30_min at 43.3 mmHg, both derived from lower testing errors across all metrics compared to the original the node where MAP_basal was below 51 mmHg. tree, with improved performance, interpretability, and 3.6 generalization. Nevertheless, the random forest model performed Forest and Feature Ranking the best. Regardless of the complexity of the ensemble model, Despite the complexity of the forest with 100 trees, the feature the feature ranking provided valuable insights into the ranking, where feature importance was assessed using the contribution of each attribute to the final prediction; some of Genie3 score, helps to understand the descriptive attributes that these top-ranked features also appear along the nodes of the mainly contributed to the final multi-target prediction. Figure 3 unpruned and pruned trees. By aggregating importance across lists the first eleven descriptive attributes, ranked by their multiple trees, random forests can highlight globally relevant corresponding importance score. MAP_after_5min and features that may not dominate early decision paths in a single SBP_after_5min are clearly the most influential features in the tree. For example, SBP_after_5min was ranked second in model; MAP_basal and SBP_basal also contribute significantly, importance, but it did not appear in the top splits of the unpruned closely following in importance. tree. In the pruned tree, BMI and Propofol dose are included, but SBP_after_5min, age, and DBP_basal, which ranked higher than BMI and Propofol dose, are not incorporated in the pruned tree. The association between higher age and PIH has been noted in the past [2,5], and it is a variable usually considered during risk evaluation; however, it is not a modifiable attribute. In sum, this study demonstrates that interpretable models, such as pruned trees, when supported by feature importance from high-performing models, can validate and offer clear, decisive thresholds of modifiable and actionable variables that impact MAP values in the post-induction period, thereby reducing PIH- related comorbidity and mortality. This highlights its potential utility as a decision support tool in clinical settings. References [1] Salmasi V, Maheshwari K, Yang D, Mascha EJ, Singh A, Sessler DI, et Figure 3: Descriptive attributes contributing most to the al. Relationship between Intraoperative Hypotension, Defined by Either random Reduction from Baseline or Absolute Thresholds, and Acute Kidney and forest’s prediction , sorted by importance score. Myocardial Injury after Noncardiac Surgery. Anesthesiology 2017;126(1):47–65. DOI: https://doi.org/10.1097/ALN.0000000000001432 4 Discussion & Conclusion [2] Grillot N, Gonzalez V, Deransy R, Rouhani A, Cintrat G, Rooze P, et al. Post-induction hypotension during rapid sequence intubation in the The advantages of using a predictive clustering method for multi- operating room: A post hoc analysis of the randomized controlled target prediction include the ability to capture complex REMICRUSH trial. Anaesth Crit Care Pain Med 2025; 44(3):101502. DOI: https://doi.org/10.1016/j.accpm.2025.101502 interactions between descriptive attributes and the simultaneous [3] Sessler DI, Bloomstone JA, Aronson S, Berry C, Gan TJ, Kellum JA, et prediction of multiple outcomes [6,7] al. Perioperative Quality Initiative consensus statement on intraoperative . A key novelty of this study is its focus on predicting multiple outcomes related to blood pressure, risk and outcomes for elective surgery. Br J Anaesth 2019;122(5):563–74. DOI: https://doi.org/10.1016/j.bja.2019.01.013 hypotension. This multi-target approach provides a more [4] Südfeld S, Brechnitz S, Wagner JY, Reese PC, Pinnschmidt HO, Reuter comprehensive overview and enhances clinical decision-support. DA, et al. Post-induction hypotension and early intraoperative hypotension associated with general anaesthesia. Br J Anaesth In clinical practice, anesthesiologists need to anticipate and often 2017;119(1):57–64. DOI: https://doi.org/10.1093/bja/aex127 ask themselves: How low will MAP values drop? How will MAP [5] Jor O, Maca J, Koutna J, Gemrotova M, Vymazal T, Litschmannova M, et al. Hypotension after induction of general anesthesia: occurrence, risk evolve throughout the procedure? This is highly relevant because factors, and therapy. A prospective multicentre observational study. J deeper and longer hypotensive episodes increase the presence of Anesth 2018; 32(5):673–80. DOI: https://doi.org/10.1007/s00540-018- adverse events associated with intraoperative hypotension[3,4] 2532-6 . [6] Kocev D, Vens C, Struyf J, Džeroski S. Tree ensembles for predicting In this study, the pruned model included among the most structured outputs. Pattern Recognit 2012;46:817–33. DOI: important https://doi.org/10.1016/j.patcog.2012.09.023 variables for the multi-target prediction [7] Petković M, Levatić J, Kocev D, Breskvar M, Džeroski S. CLUSplus: A MAP_after_5min and MAP_basal. Previous studies have decision tree-based framework for predicting structured outputs. significantly associated PIH with the basal or pre-induction SoftwareX 2023; 24:101526. DOI: MAP[2,4,5] https://doi.org/10.1016/j.softx.2023.101526 , and our results confirm this observation: In the node [8] Samad M, Angel M, Rinehart J, Kanomata Y, Baldi P, Cannesson M. root, the MAP value was the most relevant when calculated Medical Informatics Operating Room Vitals and Events Repository immediately 5 minutes after intubation, specifically with a (MOVER): a public-access operating room database. JAMIA Open 2023;6(4). DOI: https://doi.org/10.1093/jamiaopen/ooad084 90 Indeks avtorjev / Author index Ambrožič Žan ................................................................................................................................................................................. 7 Andova Andrejaana ...................................................................................................................................................................... 23 Anžur Zoja ................................................................................................................................................................................... 11 Azad Fatemeh ............................................................................................................................................................................... 15 Bianco Lorenzo .............................................................................................................................................................................. 7 Bohanec Marko ............................................................................................................................................................................ 19 Buchaillot Maria Luisa ................................................................................................................................................................. 79 Builder Calum .............................................................................................................................................................................. 79 Comte Thibault ............................................................................................................................................................................. 43 Cork Jordan .................................................................................................................................................................................. 23 Crozzoli Miguel ........................................................................................................................................................................... 79 Di Giacomo Valentina .................................................................................................................................................................. 71 Dobravec Blaž .............................................................................................................................................................................. 27 Dominici Gabrielle ....................................................................................................................................................................... 71 Džeroski Sašo ............................................................................................................................................................................... 87 Fenoglio Dario ............................................................................................................................................................................. 71 Filipič Bogdan .............................................................................................................................................................................. 23 Finžgar Miha ................................................................................................................................................................................ 83 Gams Matjaž ................................................................................................................................................................................ 43 Gradišek Anton ...................................................................................................................................................... 7, 35, 63, 79, 83 Hassani Yanny ............................................................................................................................................................................. 43 Herke Anna-Katharina ................................................................................................................................................................. 31 Inagawa Maori ............................................................................................................................................................................. 47 Jelenc Matej ................................................................................................................................................................................. 35 Jordan Marko ............................................................................................................................................................................... 71 Jurič Rok ...................................................................................................................................................................................... 35 Kirn Borut .................................................................................................................................................................................... 87 Kocuvan Primož ........................................................................................................................................................................... 39 Kolar Žiga .................................................................................................................................................................................... 43 Kramar Sebastjan ......................................................................................................................................................................... 71 Krömer Pavel ............................................................................................................................................................................... 23 Krstevska Ana .............................................................................................................................................................................. 71 Kukar Matjaž ................................................................................................................................................................................ 15 Longar Vinko ............................................................................................................................................................................... 39 Louvancour Hugues ..................................................................................................................................................................... 43 Lukan Junoš ................................................................................................................................................................................. 47 Luštrek Mitja .............................................................................................................................................................. 11, 47, 55, 71 Maistrou Sevasti ........................................................................................................................................................................... 79 Mancuso Elena ............................................................................................................................................................................. 71 Marinković Mila ........................................................................................................................................................................... 51 Nemec Vid ................................................................................................................................................................................... 55 op den Akker Harm ...................................................................................................................................................................... 71 Rajher Rok ................................................................................................................................................................................... 59 Rajkovič Uroš ............................................................................................................................................................................... 19 Rajkovič Vladislav ....................................................................................................................................................................... 19 Ratajec Mariša .............................................................................................................................................................................. 63 Rechberger Nina ........................................................................................................................................................................... 67 Reščič Nina ............................................................................................................................................................................ 63, 71 Romanova Alex ............................................................................................................................................................................ 75 Shulajkovska Miljana ................................................................................................................................................................... 35 Slapničar Gašper .................................................................................................................................................................... 11, 55 Smerkol Maj ............................................................................................................................................................................. 7, 83 Štrum Rok ...................................................................................................................................................................................... 7 Struna Rok .................................................................................................................................................................................... 39 Susič David ........................................................................................................................................................................ 7, 79, 83 91 Tušar Tea ...................................................................................................................................................................................... 23 van der Jagt Lotte ......................................................................................................................................................................... 71 Vastenburg Martijn ...................................................................................................................................................................... 71 Vukašinović Dragana ................................................................................................................................................................... 79 Yagi Ryo ...................................................................................................................................................................................... 83 Žabkar Jure ....................................................................................................................................................................... 27, 51, 59 Založnik Marcel ........................................................................................................................................................................... 71 Žugelj Tapia Estefanía ................................................................................................................................................................. 87 92 Slovenska konferenca o umetni inteligenci Slovenian Conference on Artificial Intelligence Uredniki l Editors: Mitja Luštrek Matjaž Gams Rok Piltaver