Zbornik 27. mednarodne multikonference

INFORMACIJSKA DRUŽBA – IS 2024

Zvezek A





Proceedings of the 27th International Multiconference

INFORMATION SOCIETY – IS 2024

Volume A





Slovenska konferenca o umetni inteligenci

Slovenian Conference on Artificial Intelligence





Uredniki / Editors



Mitja Luštrek, Matjaž Gams, Rok Piltaver





http://is.ijs.si





10.–11. oktober 2024 / 10–11 October 2024

Ljubljana, Slovenia



Uredniki:





Mitja Luštrek

Odsek za inteligentne sisteme, Institut »Jožef Stefan«, Ljubljana



Matjaž Gams

Odsek za inteligentne sisteme, Institut »Jožef Stefan«, Ljubljana



Rok Piltaver

Outfit7, Ljubljana





Založnik: Institut »Jožef Stefan«, Ljubljana

Priprava zbornika: Mitja Lasič, Vesna Lasič, Lana Zemljak

Oblikovanje naslovnice: Vesna Lasič





Dostop do e-publikacije:

http://library.ijs.si/Stacks/Proceedings/InformationSociety





Ljubljana, oktober 2024





Informacijska družba

ISSN 2630-371X





Kataložni zapis o publikaciji (CIP) pripravili v Narodni in univerzitetni knjižnici v Ljubljani

COBISS.SI-ID 214409987

ISBN 978-961-264-299-0 (PDF)





PREDGOVOR MULTIKONFERENCI

INFORMACIJSKA DRUŽBA 2024

Leto 2024 je hkrati udarno in tradicionalno. Že sedaj, še bolj pa v prihodnosti bosta računalništvo, informatika (RI) in umetna inteligenca (UI) igrali ključno vlogo pri oblikovanju napredne in trajnostne družbe. Smo na pragu nove dobe, v kateri generativna umetna inteligenca, kot je ChatGPT, in drugi inovativni pristopi utirajo pot k superinteligenci in singularnosti, ključnim elementom, ki bodo definirali razcvet človeške civilizacije.

Naša konferenca je zato hkrati tradicionalna znanstvena, pa tudi povsem akademsko odprta za nove pogumne ideje, inkubator novih pogledov in idej.

Letošnja konferenca ne le da analizira področja RI, temveč prinaša tudi osrednje razprave o perečih temah današnjega časa – ohranjanje okolja, demografski izzivi, zdravstvo in preobrazba družbenih struktur. Razvoj UI ponuja rešitve za skoraj vse izzive, s katerimi se soočamo, kar poudarja pomen sodelovanja med strokovnjaki, raziskovalci in odločevalci, da bi skupaj oblikovali strategije za prihodnost. Zavedamo se, da živimo v času velikih sprememb, kjer je ključno, da s poglobljenim znanjem in inovativnimi pristopi oblikujemo informacijsko družbo, ki bo varna, vključujoča in trajnostna.

Letos smo ponosni, da smo v okviru multikonference združili dvanajst izjemnih konferenc, ki odražajo širino in globino informacijskih ved: CHATMED v zdravstvu, Demografske in družinske analize, Digitalna preobrazba zdravstvene nege, Digitalna vključenost v informacijski družbi – DIGIN 2024, Kognitivna znanost, Konferenca o zdravi dolgoživosti, Legende računalništva in informatike, Mednarodna konferenca o prenosu tehnologij, Miti in resnice o varovanju okolja, Odkrivanje znanja in podatkovna skladišča – SIKDD

2024, Slovenska konferenca o umetni inteligenci, Vzgoja in izobraževanje v RI.

Poleg referatov bodo razprave na okroglih mizah in delavnicah omogočile poglobljeno izmenjavo mnenj, ki bo oblikovala prihodnjo informacijsko družbo. “Legende računalništva in informatike” predstavljajo slovenski “Hall of Fame” za odlične posameznike s tega področja, razširjeni referati, objavljeni v reviji Informatica z 48-letno tradicijo odličnosti, in sodelovanje s številnimi akademskimi institucijami in združenji, kot so ACM Slovenija, SLAIS in Inženirska akademija Slovenije, bodo še naprej spodbujali razvoj informacijske družbe. Skupaj bomo gradili temelje za prihodnost, ki bo oblikovana s tehnologijami, osredotočena na človeka in njegove potrebe.

S podelitvijo nagrad, še posebej z nagrado Michie-Turing, se avtonomna RI stroka vsakoletno opredeli do najbolj izstopajočih dosežkov. Nagrado Michie-Turing za izjemen življenjski prispevek k razvoju in promociji informacijske družbe je prejel prof. dr. Borut Žalik. Priznanje za dosežek leta pripada prof. dr. Sašu Džeroskemu za izjemne raziskovalne dosežke. »Informacijsko limono« za najmanj primerno informacijsko tematiko je prejela nabava in razdeljevanjem osebnih računalnikov ministrstva, »informacijsko jagodo« kot najboljšo potezo pa so sprejeli organizatorji tekmovanja ACM Slovenija. Čestitke nagrajencem!

Naša vizija je jasna: prepoznati, izkoristiti in oblikovati priložnosti, ki jih prinaša digitalna preobrazba, ter ustvariti informacijsko družbo, ki bo koristila vsem njenim članom. Vsem sodelujočim se zahvaljujemo za njihov prispevek k tej viziji in se veselimo prihodnjih dosežkov, ki jih bo oblikovala ta konferenca.



Mojca Ciglarič, predsednica programskega odbora

Matjaž Gams, predsednik organizacijskega odbora





i

PREFACE TO THE MULTICONFERENCE

INFORMATION SOCIETY 2024

The year 2024 is both ground-breaking and traditional. Now, and even more so in the future, computer science, informatics (CS/I), and artificial intelligence (AI) will play a crucial role in shaping an advanced and sustainable society. We are on the brink of a new era where generative artificial intelligence, such as ChatGPT, and other innovative approaches are paving the way for superintelligence and singularity—key elements that will define the flourishing of human civilization. Our conference is therefore both a traditional scientific gathering and an academically open incubator for bold new ideas and perspectives.

This year's conference analyzes key CS/I areas and brings forward central discussions on pressing contemporary issues—environmental preservation, demographic challenges, healthcare, and the transformation of social structures. AI development offers solutions to nearly all challenges we face, emphasizing the importance of collaboration between experts, researchers, and policymakers to shape future strategies collectively. We recognize that we live in times of significant change, where it is crucial to build an information society that is safe, inclusive, and sustainable, through deep knowledge and innovative approaches.

This year, we are proud to have brought together twelve exceptional conferences within the multiconference framework, reflecting the breadth and depth of information sciences:

• CHATMED in Healthcare

• Demographic and Family Analyses

• Digital Transformation of Healthcare Nursing

• Digital Inclusion in the Information Society – DIGIN 2024

• Cognitive Science

• Conference on Healthy Longevity

• Legends of Computer Science and Informatics

• International Conference on Technology Transfer

• Myths and Facts on Environmental Protection

• Data Mining and Data Warehouses – SIKDD 2024

• Slovenian Conference on Artificial Intelligence

• Education and Training in CS/IS.

In addition to papers, roundtable discussions and workshops will facilitate in-depth exchanges that will help shape the future information society. The “Legends of Computer Science and Informatics” represents Slovenia’s “Hall of Fame” for outstanding individuals in this field. At the same time, extended papers published in the Informatica journal, with over 48 years of excellence, and collaboration with numerous academic institutions and associations, such as ACM Slovenia, SLAIS, and the Slovenian Academy of Engineering, will continue to foster the development of the information society. Together, we will build the foundation for a future shaped by technology, yet focused on human needs.

The autonomous CS/IS community annually recognizes the most outstanding achievements through the awards ceremony. The Michie-Turing Award for an exceptional lifetime contribution to the development and promotion of the information society was awarded to Prof. Dr. Borut Žalik. The Achievement of the Year Award goes to Prof. Dr. Sašo Džeroski. The "Information Lemon" for the least appropriate information topic was given to the ministry's procurement and distribution of personal computers. At the same time, the

"Information Strawberry" for the best initiative was awarded to the organizers of the ACM Slovenia competition. Congratulations to all the award winners!

Our vision is clear: to recognize, seize, and shape the opportunities brought by digital transformation and create an information society that benefits all its members. We thank all participants for their contributions and look forward to this conference's future achievements.



Mojca Ciglarič, Chair of the Program Committee

Matjaž Gams, Chair of the Organizing Committee



ii





KONFERENČNI ODBORI

CONFERENCE COMMITTEES



International Programme Committee

Organizing Committee

Vladimir Bajic, South Africa

Matjaž Gams, chair

Heiner Benking, Germany

Mitja Luštrek

Se Woo Cheon, South Korea

Lana Zemljak

Howie Firth, UK

Vesna Koricki

Olga Fomichova, Russia

Mitja Lasič

Vladimir Fomichov, Russia

Blaž Mahnič

Vesna Hljuz Dobric, Croatia



Alfred Inselberg, Israel

Jay Liebowitz, USA

Huan Liu, Singapore

Henz Martin, Germany

Marcin Paprzycki, USA

Claude Sammut, Australia

Jiri Wiedermann, Czech Republic

Xindong Wu, USA

Yiming Ye, USA

Ning Zhong, USA

Wray Buntine, Australia

Bezalel Gavish, USA

Gal A. Kaminka, Israel

Mike Bain, Australia

Michela Milano, Italy

Derong Liu, Chicago, USA

Toby Walsh, Australia

Sergio Campos-Cordobes, Spain

Shabnam Farahmand, Finland

Sergio Crovella, Italy





Programme Committee

Mojca Ciglarič, chair

Marjan Heričko

Baldomir Zajc

Bojan Orel

Borka Jerman Blažič Džonova

Blaž Zupan

Franc Solina

Gorazd Kandus

Boris Žemva

Viljan Mahnič

Urban Kordeš

Leon Žlajpah

Cene Bavec

Marjan Krisper

Niko Zimic

Tomaž Kalin

Andrej Kuščer

Rok Piltaver

Jozsef Györkös

Jadran Lenarčič

Toma Strle

Tadej Bajd

Borut Likar

Tine Kolenik

Jaroslav Berce

Janez Malačič

Franci Pivec

Mojca Bernik

Olga Markič

Uroš Rajkovič

Marko Bohanec

Dunja Mladenič

Borut Batagelj

Ivan Bratko

Franc Novak

Tomaž Ogrin

Andrej Brodnik

Vladislav Rajkovič

Aleš Ude

Dušan Caf

Grega Repovš

Bojan Blažica

Saša Divjak

Ivan Rozman

Matjaž Kljun

Tomaž Erjavec

Niko Schlamberger

Robert Blatnik

Bogdan Filipič

Stanko Strmčnik

Erik Dovgan

Andrej Gams

Jurij Šilc

Špela Stres

Matjaž Gams

Jurij Tasič

Anton Gradišek

Mitja Luštrek

Denis Trček

Marko Grobelnik

Andrej Ule

Nikola Guid

Boštjan Vilfan

iii





iv



KAZALO / TABLE OF CONTENTS



Slovenska konferenca o umetni inteligenci / Slovenian Conference on Artificial Intelligence ................ 1

PREDGOVOR / FOREWORD ............................................................................................................................... 3

PROGRAMSKI ODBORI / PROGRAMME COMMITTEES ............................................................................... 5

PandaChat-RAG: Towards the Benchmark for Slovenian RAG Applications / Kuzman Taja, Pavleska Tanja,

Rupnik Urban, Cigoj Primož .............................................................................................................................. 7

Choosing Features for Stress Prediction with Machine Learning / Bengeri Katja, Lukan Junoš, Luštrek Mitja . 11

Predictive Modeling of Football Results in the WWIN League of Bosnia and Herzegovina / Vladić Ervin,

Mehanović Dželila, Avdić Elma ...................................................................................................................... 15

Sarcasm Detection in a Less-Resourced Language / Đoković Lazar, Robnik-Šikonja Marko ............................ 19

Speech-to-Service: Using LLMs to Facilitate Recording of Services in Healthcare / Smerkol Maj, Susič Rok,

Ratajec Mariša, Halbwachs Helena, Gradišek Anton ...................................................................................... 23

Performance Comparison of Axle Weight Prediction Algorithms on Time-Series Data / Kolar Žiga, Susič

David, Konečnik Martin, Prestor Domen, Pejanovič Nosaka Tomo, Kulauzović Bajko, Kalin Jan, Skobir

Matjaž, Gams Matjaž ....................................................................................................................................... 27

Comparison of Feature- and Embedding-based Approaches for Audio and Visual Emotion Classification /

Trojer Sebastijan, Anžur Zoja, Luštrek Mitja, Slapničar Gašper ..................................................................... 31

Multi-modal Data Collection and Preliminary Statistical Analysis for Cognitive Load Assessment / Krstevska

Ana, Kramar Sebastjan, Gjoreski Hristijan, Gjoreski Martin, Lukan Junoš, Trojer Sebastijan, Luštrek Mitja,

Slapničar Gašper .............................................................................................................................................. 35

Predicting Health-Related Absenteeism with Machine Learning: A Case Study / Piciga Aleksander, Kukar

Matjaž ............................................................................................................................................................... 39

Puzzle Generation for Ultimate-Tic-Tac-Toe / Zirkelbach Maj, Sadikov Aleksander ......................................... 43

Ethical Consideration and Sociological Challenges in the Integration of Artificial Intelligence in Mental Health

Services / Poljak Lukek Saša........................................................................................................................... 47

Optimization Problem Inspector: A Tool for Analysis of Industrial Optimization Problems and Their Solutions /

Tušar Tea, Cork Jordan, Andova Andrejaana, Filipič Bogdan ........................................................................ 51

Multi-Agent System for Autonomous Table Football: A Winning Strategy / Založnik Marcel, Šoln Kristjan ... 55

Towards a Decision Support System for Project Planning: Multi-Criteria Evaluation of Past Projects Success /

Hafner Miha, Bohanec Marko .......................................................................................................................... 59

Minimizing Costs and Risks in Demand Response Optimization: Insights from Initial Experiments / Nedić

Mila, Tušar Tea ................................................................................................................................................ 63

Predicting Hydrogen Adsorption Energies on Platinum Nanoparticles and Surfaces With Machine Learning /

Gašparič Lea, Kokalj Anton, Džeroski Sašo .................................................................................................... 67

SmartCHANGE Risk Prediction Tool: Demonstrating Risk Assessment for Children and Youth / Jordan

Marko, Reščič Nina, Kramar Sebastjan, Založnik Marcel, Luštrek Mitja ....................................................... 71

Predicting Mental States During VR Sessions Using Sensor Data and Machine Learning / Kizhevska Emilija,

Luštrek Mitja .................................................................................................................................................... 75

Biomarker Prediction in Colorectal Cancer Using Multiple Instance Learning / Shulajkovska Miljana, Jelenc

Matej, Jonnagaddala Jitenndra, Gradišek Anton .............................................................................................. 79

Feature-Based Emotion Classification Using Eye-Tracking Data / Božak Tomi, Luštrek Mitja, Slapničar Gašper

.......................................................................................................................................................................... 83

Indeks avtorjev / Author index ................................................................................................................... 87





v





vi



Zbornik 27. mednarodne multikonference

INFORMACIJSKA DRUŽBA – IS 2024

Zvezek A





Proceedings of the 27th International Multiconference

INFORMATION SOCIETY – IS 2024

Volume A





Slovenska konferenca o umetni inteligenci

Slovenian Conference on Artificial Intelligence





Uredniki / Editors



Mitja Luštrek, Matjaž Gams, Rok Piltaver





http://is.ijs.si





10.–11. oktober 2024 / 10–11 October 2024

Ljubljana, Slovenia

1





2





PREDGOVOR





Umetna inteligenca doživlja neverjeten in pospešen razvoj, ko se po tričetrt stoletja, ko je Alan Mathison Turing postavil temelje računalništva in umetne inteligence, končno približuje ne le človeški inteligenci, temveč tudi drugim ključnim človeškim lastnostim, kot sta ustvarjalnost, čustvena inteligenca in zavest. Na številnih področjih umetna inteligenca že presega zmogljivosti večine ljudi in celo strokovnjakov. Veliki jezikovni modeli dosegajo tovrstne rezultate tudi pri dosti manj strukturiranih problemih, kot je bilo predstavljivo pred nekaj leti, npr. pri strokovnih izpitih ter besedilnih nalogah iz matematike in programiranja.



Generativna umetna inteligenca že zdaj spreminja svet. Postala je nepogrešljivo orodje v poslovnem svetu, raziskavah in vsakdanjem življenju, saj omogoča pisanje besedil, ustvarjanje kode, generiranje slik in reševanje kompleksnih problemov. Možno je celo, da smo priča začetkom singularnosti – prelomnega trenutka, ko bo umetna inteligenca presegla človeško inteligenco in omogočila revolucijo na področju produktivnosti in inovacij, čeprav bo treba na sodbo o tem še počakati. Optimizem glede prihodnosti je utemeljen: če se bo razvoj nadaljeval s trenutnim tempom, si lahko predstavljamo svet, kjer bo umetna inteligenca povsem preoblikovala gospodarstvo, znanost in način življenja, pri čemer bo omogočila višjo kakovost življenja za vse.



Čeprav nekateri umetno inteligenco vidijo kot grožnjo, njen trenutni razmah resnejših težav še ni prinesel. Nadejamo se, da bo zadosten del raziskav usmerjen v varnost umetne inteligence, da bo tako ostalo. Z morebitnimi škodljivimi učinki umetne inteligence se spopadajo tudi regulatorji, za katere upamo, da bodo uspešno krmarili med tem ciljem in pretiranim zaviranjem razvoja.



Dostopnost velikih jezikovnih modelov, kot so GPT-ji, pomeni, da so naloge, ki zahtevajo razumevanje in generiranje naravnega jezika, lažje kot kadar koli prej. Mnogi raziskovalci verjamejo, da bo prihodnost programiranja prešla iz tradicionalnih jezikov, kot je Python, na velike jezikovne modele, kjer bo umetna inteligenca generirala kodo in rešitve po meri.

Čeprav je razvoj teh modelov zahtevna naloga, ki presega zmožnosti večine organizacij, se ljudje navajamo na uporabo tega fenomenalnega orodja. Pričakujemo, da bo umetna inteligenca postala učinkovit in zanesljiv partner človeštva.



Že letos vidimo, da so konference v sklopu Informacijske družbe posvečene prav velikim jezikovnim modelom. V okviru Slovenske konference o umetni inteligenci organiziramo formalno debato dijakov – izkušenih debaterjev, ki se udeležujejo mednarodnih tekmovanj – o tem, kako bo umetna inteligenca oblikovala prihodnost in zakaj bi to lahko bila najboljša prihodnost doslej.





Matjaž Gams

Mitja Luštrek

Rok Piltaver

predsedniki Slovenske konference o umetni inteligenci





3

FOREWORD





Artificial intelligence is experiencing incredible and accelerated development. After three-quarters of a century since Alan Mathison Turing laid the foundations of computing and artificial intelligence, it is finally approaching not only human intelligence but also other key human traits such as creativity, emotional intelligence and consciousness. In many areas, artificial intelligence already surpasses the capabilities of most people and even experts. Large language models are achieving such results even in much less structured problems than was imaginable a few years ago, such as professional exams, and mathematics and programming tasks described in free text.



Generative artificial intelligence is already transforming the world. It has become an indispensable tool in the business world, research, and everyday life, enabling text writing, code generation, image creation, and solving complex problems. It is even possible that we are witnessing the beginnings of the singularity—the pivotal moment when artificial intelligence will surpass human intelligence and enable a revolution in productivity and innovation, although time will show whether this is actually the case. Optimism about the future is well-founded: if development continues at its current pace, we can imagine a world where artificial intelligence completely transforms the economy, science, and way of life, leading to a higher quality of life for all.



Although some see artificial intelligence as a threat, its current rapid progress has not yet led to serious problems. We hope that a sufficient part of the research will be directed towards AI safety so that this remains the case. Regulators are also addressing the potential harmful effects of artificial intelligence, and we hope they will successfully navigate between this goal and excessive hindering of development.



The accessibility of large language models, such as GPTs, means that tasks requiring the understanding and generation of natural language are easier than ever before. Many researchers believe that the future of programming will shift from traditional languages, like Python, to large language models, where artificial intelligence will generate custom code and solutions. Although developing these models is a challenging task beyond the capabilities of most organizations, people are getting accustomed to using this phenomenal tool. We expect artificial intelligence to become an effective and reliable partner for humanity.



Already this year, we are seeing conferences within the framework of the Information Society dedicated to large language models. As part of the Slovenian Conference on Artificial Intelligence, we are organizing a formal debate for high school students—experienced debaters who participate in international competitions—on how artificial intelligence will shape the future and why this might be the best future yet.





Matjaž Gams

Mitja Luštrek

Rok Piltaver

Slovenian Conference on Artificial Intelligence chairs



4





PROGRAMSKI ODBOR / PROGRAMME COMMITTEE



Mitja Luštrek

Matjaž Gams

Rok Piltaver

Cene Bavec

Marko Bohanec

Marko Bonač

Ivan Bratko

Bojan Cestnik

Aleš Dobnikar

Erik Dovgan

Bogdan Filipič

Borka Jerman Blažič

Marjan Krisper

Marjan Mernik

Biljana Mileva Boshkoska

Vladislav Rajkovič

Niko Schlamberger

Tomaž Seljak

Peter Stanovnik

Damjan Strnad

Miha Štajdohar

Vasja Vehovar





5



6





PandaChat-RAG:

Towards the Benchmark for Slovenian RAG Applications

Taja Kuzman

Urban Rupnik

Tanja Pavleska

Primož Cigoj

{taja,tanja}@pc7.io

{urban,primoz}@pc7.io

PC7, d.o.o.

PC7, d.o.o.

Ljubljana, Slovenia

Ljubljana, Slovenia

Jožef Stefan Institute

Ljubljana, Slovenia

Abstract

sources, which facilitates the evaluation of the system’s accu-

Retrieval-augmented generation (RAG) is a recent method for

racy [2]. These advantages have spurred quick adoption of RAG

enriching the large language models’ text generation abilities

systems across various applications. For instance, PandaChat1

with external knowledge through document retrieval. Due to

leverages RAG to provide explainable responses with high accu-

its high usefulness for various applications, it already powers

racy in Slovenian and other languages, integrated in customer

multiple products. However, despite the widespread adoption,

service bots and platforms that allow LLM-based retrieval of

there is a notable lack of evaluation benchmarks for RAG systems,

information from texts.

particularly for less-resourced languages. This paper introduces

Although RAG benchmarking is a relatively recent endeavor,

the PandaChat-RAG – the first Slovenian RAG benchmark estab-

some initial frameworks have already emerged [3, 5, 7]. However, lished on a newly developed test dataset. The test dataset is based

these benchmarks are only limited to English and Chinese, leav-

on the semi-automatic extraction of authentic questions and an-

ing a gap in the evaluation of RAG systems for other languages.

swers from a genre-annotated web corpus. The methodology for

To address this gap, we make the following contributions:

the test dataset construction can be efficiently applied to any of

• We present the first benchmark for RAG systems for the

the comparable corpora in numerous European languages. The

Slovenian language. The benchmark is based on the newly

test dataset is used to assess the RAG system’s performance in re-

developed PandaChat-RAG-sl test dataset2, which com-

trieving relevant sources essential for providing accurate answers

prises authentic questions, answers and source texts.

to the given questions. The evaluation involves comparing the

• We introduce a methodology for an efficient semi-automated

performance of eight open- and closed-source embedding models,

development of RAG test datasets that is easily replica-

and investigating how the retrieval performance is influenced

ble for the languages included in the MaCoCu [1] and

by factors such as the document chunk size and the number of

CLASSLA-web corpora collections [10], which include

retrieved sources. These findings contribute to establishing the

all South Slavic languages, Albanian, Catalan, Greek, Ice-

guidelines for optimal RAG system configurations not only for

landic, Maltese, Ukrainian and Turkish.

Slovenian, but also for other languages.

• As the first step of RAG evaluation, we evaluate the re-

triever’s performance in terms of its ability to provide

Keywords

relevant sources crucial to retrieve accurate answers to

retrieval-augmented generation, RAG, embedding models, large

the posed questions. The evaluation encompasses compar-

language models, LLMs, benchmark, Slovenian

ison of performance of several open- and closed-source

embedding models. Furthermore, we provide insights on

the impact of the document chunk size and the number

1

Introduction

of retrieved sources, to identify optimal configurations

The advent of large language models (LLMs) has introduced sig-

of the indexing and retrieval components for robust and

nificant advancements in the field of natural language processing

accurate retrieval.

(NLP). Although LLMs have shown impressive capabilities in gen-

The paper is organized as follows: in Section 2, we provide an erating coherent text, they are prone to hallucinations [7, 16], i.e., introduction to the previous research concerning the evaluation

providing false information. Furthermore, they are dependent on

of RAG systems; Section 3 introduces the PandaChat-RAG-sl

static and potentially outdated corpora [9]. Retrieval-augmented dataset (Section 3.1) and the RAG system architecture (Section generation (RAG) is a method devised to address these challenges

3.2), which is evaluated in Section 4. Finally, in Section 5, we by augmenting LLMs with external information retrieved from a

conclude the paper with a discussion of the main findings and

provided document collection. Connecting LLMs with a relevant

suggestions for future work.

database improves the factual accuracy and temporal relevance

of the generated responses. Moreover, RAG contributes to the

explainability of the generated answers by providing verifiable

2

Related Work

Despite the recent introduction of the RAG architecture, several

Permission to make digital or hard copies of all or part of this work for personal benchmarking initiatives have already emerged [3, 5, 7, 15]. How-or classroom use is granted without fee provided that copies are not made or

ever, since the RAG systems can be applied to various end tasks,

distributed for profit or commercial advantage and that copies bear this notice and the benchmarks focus on different aspects of these systems. Inter

the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

1https://pandachat.ai/

© 2024 Copyright held by the owner/author(s).

2The PandaChat-RAG benchmark and its test dataset are openly available at https:

https://doi.org/10.70314/is.2024.scai.538

//github.com/TajaKuzman/pandachat-rag-benchmark.

7





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Kuzman et al.

alia, current benchmarks assess their performance in text citation

Table 1: Statistics for the PandaChat-RAG-sl dataset.

[7], text continuation, question-answering with support of external knowledge, hallucination modification, and multi-document

Number

summarization [12].

The closest to our work is the evaluation of the RAG systems

Instances

206

on the task of Attributable Question Answering [2]. This task Unique texts

160

involves providing a question as input to the system, which

Words (questions)

1,184

then generates both an answer and an attribution, indicating the

Words (texts w/o questions)

83,467

source text on which the answer is based. The advantage of this

Total words (questions + texts) 84,651

task over the closed-book question-answering task is that it also

measures the system’s capability to provide the correct source.

The majority of RAG benchmarks assess RAG systems in Eng-

inspection of the extracted texts should there be a need to prepare

lish [3, 5, 7, 15] or Chinese [5, 12]. Consequently, the general-a larger dataset.

izability of their findings to other languages remains uncertain.

Table 1 provides the statistical overview for the PandaChat-Furthermore, a limitation of many benchmarks is their reliance

RAG-sl dataset. The dataset consists of 206 instances, that is,

on synthetic data generated by LLMs [5, 12, 15]. To avoid poten-triplets of a question, an answer and a source text, derived from

tial biases introduced by LLMs and to better represent the com-

160 texts. The total size of the dataset is 84,651 words, encom-

plexity and diversity of real-world language use, a more reliable

passing both the questions and the texts containing the answers.

evaluation would be based on non-synthetic test datasets. De-

spite focusing on a different task, recent research [6] has shown 3.2

RAG System

that resource-efficient development of non-synthetic and non-

The RAG pipeline encompasses three main components: index-

machine translated question-answering datasets is feasible by

ing, retrieval, and text generation. During the indexing phase, the

leveraging the availability of general web corpora and genre

user-provided text collection is transformed into a database of

classifiers.

numerical vectors (embeddings) to facilitate document retrieval

by the retriever. This process involves segmenting the documents

into fixed-length chunks, which are then converted into embed-

3

Methodology

dings using large language models. The choice of the embedding

3.1

PandaChat-RAG-sl Dataset

model and the chunk size are critical factors that can signifi-

cantly impact the retrieval performance of the model. Selecting

The PandaChat-RAG-sl dataset comprises questions, answers,

an appropriate embedding model is essential to ensure that the

and the corresponding source texts that encompass the answers.

textual information is converted into a meaningful numerical

It was created through a semi-automated process involving the

representation for effective retrieval. Moreover, the chunk size, in

extraction of texts from the Slovenian web corpus CLASSLA-

terms of the number of tokens, plays a crucial role in determining

web.sl 1.0 [11], followed by a manual extraction of high-quality the informativeness of the embeddings. Incorrect chunk sizes

instances. Since the texts were automatically extracted from a

may lead to numerical vectors that lack important information

general text collection, the dataset encompasses a diverse range

necessary for connecting the question to the corresponding text

of topics that were not predefined or decided upon.

chunk, thereby compromising retrieval accuracy [12].

The CLASSLA-web.sl 1.0 corpus is a collection of texts, col-

When presented with a question, the retrieval component uses

lected from the web in 2021 and 2022 [10]. It was chosen due the semantic search (also known as dense retrieval) to retrieve

to its numerous advantages: 1) it has high-quality content, with

the most relevant text chunks. The search is based on determin-

the majority of texts meeting the criteria for publishable quality

ing the smallest cosine distance between the chunk vectors and

[17]; 2) it is one of the largest and most up-to-date collections the question vector. Lastly, during the text generation phase,

of Slovenian texts, comprising approximately 4 million texts; 3)

the retriever provides the large language model (LLM) with a

the texts are enriched with genre labels, facilitating genre-based

selection of top retrieved sources. The LLM is prompted to pro-

text selection; and 4) it is developed in the same manner as 6

vide a human-like answer to the provided question based on the

other CLASSLA-web corpora [10] and 7 additional MaCoCu web

retrieved text sources. The selection of an appropriate number

corpora in various European languages [1]. This enables easy of top retrieved sources is crucial in this phase: including more

expansion of the benchmark to other languages, including all

than just one retrieved source may enhance retrieval accuracy

South Slavic languages and various European languages, such as

and address situations where the first retrieved source fails to

Albanian, Catalan, Greek, Icelandic, Ukrainian and Turkish.

encompass all relevant information, especially in the case when

The development of the PandaChat-RAG-sl dataset involves

more texts cover the same subject matter. However, increasing

the following steps: 1) the genre-based selection of texts from the

the number of sources also leads to a longer prompt provided

CLASSLA-web.sl corpus; 2) the extraction of texts that comprise

to the LLM, potentially increasing the costs of using the RAG

paragraphs ending with a question (80,215 texts); 3) the extraction

system.

of questions and answers (paragraphs, following the question);

In this study, we assess the indexing and retrieval compo-

4) a manual review process to identify high-quality instances. In

nents, focusing on the impact of different embedding models,

the genre-based selection phase, we extract texts labeled with

chunk sizes, and the number of retrieved sources on retrieval

genres that are most likely to contain objective questions and

performance.

answers, that is, Information/Explanation, Instruction and Legal.

In its present iteration, the dataset consists of 206 instances

Embedding Models. The evaluation includes a range of mul-

derived from the first 1,800 extracted texts. It is important to

tilingual open-source and closed-source models. The selection

note that this effort can easily be continued with further manual

of open-source models is based on the Massive Text Embedding

8





PandaChat-RAG Benchmark

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

Benchmark (MTEB) Leaderboard3 [13]. Specifically, we choose medium-sized multilingual models with up to 600 million parameters that have demonstrated strong performance on Polish

and Russian – Slavic languages that are linguistically related to

Slovenian. The models used in the evaluation are:

• Closed-source embedding models provided by the OpenAI:

an older model text-embedding-ada-002 (OpenAI-Ada) [8],

and two recently published models: text-embedding-3-

small (OpenAI-3-small), and text-embedding-3-large (OpenAI-

3-large) [14].

• Open-source embedding models, available on the Hugging

Face repository: BGE-M3 model [4], base-sized mGTE

model (mGTE-base) [19], and small (mE5-small), base

(mE5-base) and large sizes (mE5-large) of the Multilin-

gual E5 model [18].

Figure 1: The impact of the chunk size on the retrieval

Chunk size. The impact of the chunk size on retrieval per-

performance.

formance is assessed by varying chunk sizes of 128, 256, 512,

and 1024 tokens, with a default chunk overlap of 20 tokens. In

4.2

Number of Retrieved Sources

these experiments, the performance is evaluated based on the

first retrieved source.

Figure 2 shows the performance of the RAG systems when in-

creasing the number of retrieved sources. The results demon-

Number of retrieved sources. Previous work indicates that in-

strate that increasing the number of retrieved sources initially

creasing the number of retrieved sources improves the retrieval

improves the performance, however, after a certain threshold,

accuracy [12]. In this study, we examine the retrieval accuracy the performance levels off.

of embedding models, with a chunk size set to 128 tokens, when

Increasing the number of retrieved sources results in larger

the models retrieve 1 to 5 sources. In this scenario, if any of the

inputs to the LLM in the text generation component, incurring

multiple retrieved sources matches the correct source, the output

higher costs. Using more than two retrieved sources does not

is evaluated as being correct.

significantly improve results in most systems. What is more, with

The retrieval capabilities of the RAG system are evaluated

the top two retrieved sources, certain embedding models, namely,

on the task of Attributed Question-Answering. The evaluation

BGE-M3 and mE5-large, already reach perfect accuracy. Thus,

is based on accuracy, measured as the percentage of questions

our findings indicate that using more than the top two retrieved

correctly matched with the relevant source.

sources is unnecessary.

The experiments are performed using the LlamaIndex library4.

The chunk size is defined using the SentenceSplitter method in

the indexing phase. Number of retrieved sources (similarity top

k), the embedding model and the prompt for the LLM model are

specified as parameters of the chat engine. The closed-source

embedding models are used via the OpenAI API, while the ex-

periments with the open-source models are conducted on a GPU

machine.

4

Experiments and Results

In this section, we present the results of the experiments examin-

ing the impact of the chunk size, the number of retrieved sources,

and the selection of the embedding model on the retrieval per-

formance of the RAG system.

4.1

Chunk Size

Figure 1 shows the impact of the chunk size on the retrieval Figure 2: Impact of the number of retrieved sources on the

performance of the RAG systems that are based on different em-

retrieval performance.

bedding models. The findings suggest that, with the exception of

the OpenAI-Ada model, all systems demonstrate the best perfor-

mance when the text chunk size is set to 128 tokens. Increasing

4.3

Embedding Models

the chunk size hinders the retrieval performance, which is con-

We provide the final comparison of the performance of systems

sistent with previous research [12]. These results confirm that that use different embedding models. We use the parameters

smaller chunk sizes enable the embedding models to capture finer

that have shown to provide the best results in the previous ex-

details that are essential for retrieving the most relevant text for

periments: the chunk size of 128 tokens and top two retrieved

the given question.

sources. As shown in Table 2, the retrieval systems that use the open-source BGE-M3 and mE5-large embedding models achieve

3https://huggingface.co/spaces/mteb/leaderboard

the perfect retrieval score. They are closely followed by the closed-

4https://www.llamaindex.ai/

source OpenAI-3-small and the mE5-base models which achieve

9





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Kuzman et al.

Table 2: Performance comparison between the open-source

Translation. European Association for Machine Translation, Ghent, Belgium,

and closed-source embedding models.

(June 2022), 303–304. https://aclanthology.org/2022.eamt-1.41.

[2]

Bernd Bohnet et al. 2022. Attributed question answering: Evaluation and

modeling for attributed large language models. arXiv preprint arXiv:2212.08037.

embedding model speed (s) retrieval accuracy

[3]

Shuyang Cao and Lu Wang. 2024. Verifiable Generation with Subsentence-

Level Fine-Grained Citations. arXiv preprint arXiv:2406.06125.

BGE-M3

0.58

100

[4]

Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng

mE5-large

0.58

100

Liu. 2024. BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-

Granularity Text Embeddings Through Self-Knowledge Distillation. (2024).

OpenAI-3-small

0.69

99.51

arXiv: 2402.03216 [cs.CL].

mE5-base

0.29

99.51

[5]

Jiawei Chen, Hongyu Lin, Xianpei Han, and Le Sun. 2024. Benchmarking

large language models in retrieval-augmented generation. In

OpenAI-3-large

1.19

99.03

Proceedings

of the AAAI Conference on Artificial Intelligence number 16. Vol. 38, 17754–

mGTE-base

0.31

99.03

17762.

OpenAI-Ada

0.63

98.54

[6]

Anni Eskelinen, Amanda Myntti, Erik Henriksson, Sampo Pyysalo, and

Veronika Laippala. 2024. Building Question-Answer Data Using Web Regis-

mE5-small

0.15

98.54

ter Identification. In Proceedings of the 2024 Joint International Conference

on Computational Linguistics, Language Resources and Evaluation (LREC-

COLING 2024). Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessan-

accuracy of 99.5%. While having slightly lower scores, all other re-

dro Lenci, Sakriani Sakti, and Nianwen Xue, editors. ELRA and ICCL, Torino,

Italia, (May 2024), 2595–2611. https://aclanthology.org/2024.lrec-main.234.

trieval systems still achieve high performance, ranging between

[7]

Tianyu Gao, Howard Yen, Jiatong Yu, and Danqi Chen. 2023. Enabling Large

98.5% and 99% in accuracy.

Language Models to Generate Text with Citations. In Proceedings of the 2023

Conference on Empirical Methods in Natural Language Processing, 6465–6488.

Additionally, Table 2 provides the inference speed of the mod-

[8]

Ryan Greene, Ted Sanders, Lilian Weng, and Arvind Neelakantan. 2022.

els measured in seconds per instance. If inference speed is a

New and improved embedding model. https://openai.com/index/new-and-i

priority, the mE5-base model emerges as the optimal selection, as

mproved-embedding-model/. [Accessed 26-08-2024]. (2022).

[9]

Angeliki Lazaridou et al. 2021. Mind the gap: Assessing temporal generaliza-

it yields high retrieval accuracy of 99.51% and is two times faster

tion in neural language models. Advances in Neural Information Processing

than the two best performing models. In cases where users are re-

Systems, 34, 29348–29363.

stricted to closed-source models due to the unavailability of GPU

[10]

Nikola Ljubešić and Taja Kuzman. 2024. CLASSLA-web: Comparable Web

Corpora of South Slavic Languages Enriched with Linguistic and Genre

resources, the OpenAI-3-small model stands out as the most suit-

Annotation. In Proceedings of the 2024 Joint International Conference on Com-

able option. Its inference speed is comparable to the OpenAI-Ada

putational Linguistics, Language Resources and Evaluation (LREC-COLING

2024), 3271–3282.

model, while it achieves a superior retrieval accuracy.

[11]

Nikola Ljubešić, Peter Rupnik, and Taja Kuzman. 2024. Slovenian web corpus

CLASSLA-web.sl 1.0. In Slovenian language resource repository CLARIN.SI.

5

Conclusion and Future Work

http://hdl.handle.net/11356/1882.

[12]

Yuanjie Lyu et al. 2024. CRUD-RAG: A comprehensive Chinese benchmark

In this paper, a novel test dataset was introduced to assess the

for retrieval-augmented generation of large language models. arXiv preprint

performance of the RAG system on Slovenian language. A gen-

arXiv:2401.17043.

[13]

Niklas Muennighoff, Nouamane Tazi, Loic Magne, and Nils Reimers. 2023.

eral methodology for efficient creating of non-synthetic RAG

MTEB: Massive Text Embedding Benchmark. In Proceedings of the 17th

test datasets was established that can be extended to other lan-

Conference of the European Chapter of the Association for Computational

guages. We evaluated the retrieval accuracy of the RAG system,

Linguistics, 2014–2037.

[14]

OpenAI. 2024. New embedding models and API updates. https://openai.co

examining the impact of the embedding models, the document

m/index/new-embedding-models-and-api-updates/. [Accessed 26-08-2024].

chunk size, and the number of retrieved sources. The assess-

(2024).

[15]

Jon Saad-Falcon, Omar Khattab, Christopher Potts, and Matei Zaharia. 2024.

ment of embedding models encompassed eight open-source and

ARES: An Automated Evaluation Framework for Retrieval-Augmented Gen-

closed-source LLM models. It revealed that open-source models,

eration Systems. In Proceedings of the 2024 Conference of the North American

specifically, BGE-M3 and mE5-large, reached perfect retrieval

Chapter of the Association for Computational Linguistics: Human Language

Technologies (Volume 1: Long Papers), 338–354.

accuracy, demonstrating their suitability for RAG applications on

[16]

Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela, and Jason Weston.

Slovenian texts. Furthermore, the evaluation of the optimal chunk

2021. Retrieval Augmentation Reduces Hallucination in Conversation. In

size and the number of retrieved sources showed that smaller

Findings of the Association for Computational Linguistics: EMNLP 2021, 3784–

3803.

chunk sizes yielded superior results. In contrast, increasing the

[17]

Rik van Noord, Taja Kuzman, Peter Rupnik, Nikola Ljubešić, Miquel Esplà-

number of retrieved sources enhanced results up to a certain

Gomis, Gema Ramírez-Sánchez, and Antonio Toral. 2024. Do Language Mod-

els Care about Text Quality? Evaluating Web-Crawled Corpora across 11

threshold, beyond which the model performance plateaued. Cer-

Languages. In Proceedings of the 2024 Joint International Conference on Com-

tain models already achieved perfect accuracy when evaluated

putational Linguistics, Language Resources and Evaluation (LREC-COLING

based on the top two retrieved sources.

2024), 5221–5234.

[18]

Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder,

While the novel test dataset can be used to evaluate all the

and Furu Wei. 2024. Multilingual E5 Text Embeddings: A Technical Report.

components of the RAG system, in this paper, we focused on

arXiv preprint arXiv:2402.05672.

the evaluation of the indexing and retrieval components. In our

[19]

Xin Zhang et al. 2024. mGTE: Generalized Long-Context Text Representation

and Reranking Models for Multilingual Text Retrieval. (2024). https://arxiv

future work, we will extend the evaluations to the text generation

.org/abs/2407.19669 arXiv: 2407.19669 [cs.CL].

component with regard to fluency, correctness, and usefulness

of the generated answers. Furthermore, we plan to expand the

benchmark to encompass a wider range of languages. The plans

include extending the dataset and evaluation to South Slavic

languages and other European languages that are covered by

comparable MaCoCu [1] and CLASSLA-web [10] corpora.

References

[1]

Marta Bañón et al. 2022. MaCoCu: Massive collection and curation of mono-

lingual and bilingual data: Focus on under-resourced languages. In Proceed-

ings of the 23rd Annual Conference of the European Association for Machine

10





Choosing Features for Stress Prediction with Machine Learning

Katja Bengeri

Junoš Lukan∗

Mitja Luštrek∗

University of Ljubljana

Jožef Stefan Institute

Jožef Stefan Institute

Ljubljana, Slovenia

Department of Intelligent Systems

Department of Intelligent Systems

kb96968@student.uni-lj.si

Ljubljana, Slovenia

Ljubljana, Slovenia

junos.lukan@ijs.si

mitja.lustrek@ijs.si

Abstract

and Slovenia (26 participants). They answered questionnaires

Feature selection is a crucial step in building effective machine

named Ecological Momentary Assessments (EMAs) roughly ev-

learning models, as it directly impacts model accuracy and in-

ery 90 minutes, with smartphone sensor and usage data continu-

terpretability. Driven by the aim of improving stress prediction

ously collected by an Android application [7], while also wearing models, this article evaluates multiple approaches for identify-an Empatica E4 wristband recording physiological data. In 15

ing the most relevant features. The study explores filter-based

days of their participation, each participant responded to more

methods that assess feature importance through correlation anal-

than 96 EMA sessions, on average, which resulted in around 2200

ysis, alongside wrapper methods that iteratively optimize feature

labels.

subsets. Additionally, techniques such as Boruta are analysed for

their effectiveness in identifying all important features, while

3 Target and feature extraction

strategies for handling highly correlated variables are also con-

To fully leverage the potential of the data, we computed a com-

sidered. By conducting a comprehensive analysis of these ap-

prehensive set of features. While some sensors only reported

proaches, we assess the role of feature selection in developing

relatively rare events, such as phone calls, others had a high

stress prediction models.

sampling frequency, such blood volume pulse which sampled

data at 32 Hz. On the other hand, labels were only available every

Keywords

90 min. Therefore, we preprocessed the data in several steps.

Feature selection, Correlation matrix, Balanced accuracy score

3.1 Target variable

1 Introduction

While participants responded to various questionnaires, for this

Machine learning models are increasingly being applied to predict

study, we selected their responses to Stress Appraisal Measure-

stress, which is critical in various domains such as healthcare,

ment [9] as the target variable. It was used to report stress levels workplace management, and wearable technology. However, one

on a scale from 0 to 4, so using it as is the prediction task can be

of the major challenges in developing reliable predictive models

approached as a regression problem.

is identifying the most relevant features from extensive datasets,

However, many stress detection studies tend towards a dis-

comprising physiological and behavioural information.

crete approach, treating stress predominantly as a classification

Feature selection plays a key role in addressing this challenge.

task, often only working with a binary target variable. To con-

By selecting only the most informative features, we can reduce

vert this into a classification problem, we discretized the target

noise, prevent overfitting, and enhance model accuracy. As we

variable into two distinct categories: “no stress”, which included

showed in previous work [8], even simple feature selection tech-all responses with a value of 0, while all others were coded as

niques can increase the 𝐹1 score of predictive models. This paper

“stress”. With that, we ensured a balanced distribution of the

builds upon this finding and explores several feature selection

target variable values.

techniques, ranging from simple correlation-based methods to

more sophisticated wrapper approaches.

3.2 Features

The aim of this work is to assess how feature selection can en-

hance stress prediction models. By comparing different methods,

3.2.1 Data preprocessing. In our work, features were calculated

we aim to identify the optimal strategies for feature selection in

on 30-minute intervals preceding each questionnaire session.

stress prediction which would lead to more reliable and more

From the wide variety of smartphone data and physiological

easily interpretable machine learning models.

measures, a total of 352 features were extracted and grouped into

22 categories, listed in Table 1. Using physiological data from 2 Data collection

Empatica wristband, we first calculated specialized physiological

features on smaller windows (from 4 s to 120 s, depending on the

The data used in this work comes from the STRAW project [1],

sensor; see [4] for more details), which were then aggregated results of which have been previously presented at Information

over 30 min windows by calculating simple statistical features:

Society [6, 8]. The dataset includes the data of 56 participants, mean, median, standard deviation, minimum, and maximum. All

recruited from academic institutions in Belgium (29 participants)

of the categorical features were converted into a set of binary

∗Also with Jožef Stefan International Postgraduate School.

features using the one hot encoding technique and the missing

values were replaced with the mode.

Permission to make digital or hard copies of all or part of this work for personal First, some preliminary data cleaning was performed by ex-or classroom use is granted without fee provided that copies are not made or

distributed for profit or commercial advantage and that copies bear this notice and cluding one of the feature in pairs exhibiting a correlation coeffi-the full citation on the first page. Copyrights for third-party components of this cient of |𝑟 | ≥ 0.95. Despite this, some of the remaining features

work must be honored. For all other uses, contact the owner/author(s).

still exhibited quite strong correlations as shown in Fig. 1. An Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

© 2024 Copyright held by the owner/author(s).

interesting observation used in the later stages of feature selec-

https://doi.org/10.70314/is.2024.scai.991

tion was that high correlation, |𝑟 | ≥ 0.8, was mostly observed

11





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

Katja Bengeri, Junoš Lukan, and Mitja Luštrek

Table 1: Feature categories with the number of features

on each class. When adjusted for random chance, it is calculated

included in each category in parentheses

as

Balanced accuracy

𝑇 𝑃

𝑇 𝑁

=

+

−

1. Empatica electrodermal activity (99) 12. Phone screen events (7)

1,

𝑇 𝑃 + 𝐹 𝑁

𝑇 𝑁 + 𝐹 𝑃

2. Empatica inter-beat interval (50)

13. Phone light (6)

in the binary case, where

3. Empatica temperature (33)

14. Phone battery (5)

𝑇 𝑃 is the number of true positives,

4. Empatica accelerometer (23)

15. Phone speech (4)

𝑇 𝑁 is the number of true negatives, 𝐹 𝑁 is the number of false

5. Empatica data yield (1)

16. Phone interactions (2)

negatives and 𝐹𝑃 is the number of false positives. This definition

6. Phone applications foreground (47)

17. Phone messages (2)

7. Phone location (18)

18. Phone data yield (1)

is equivalent to Youden’s J [11], which assigns a 0 to a random 8. Phone Bluetooth connections (18)

19. Baseline psychological features (7)

classifier (indeed, a dummy classifier achieved a score of 0.0208

9. Phone calls (10)

20. Language (2)

in our case), while a perfect classifier would achieve a score of 1.

10. Phone activity recognition (7)

21. Gender (2)

11. Phone Wi-Fi connections (7)

22. Age (1)

To evaluate the stress detection models described in the fol-

lowing sections, we considered several ways of data partitioning.

Since the variations in the results depending on the data split

between features of the same category. As an example, corre-

were significant, in order to achieve more consistent accuracy,

lations between features related to phone application use are

we employed shuffled 5-fold cross-validation.

shown in Fig. 2.

We also considered a leave-one-subject-out cross-validation

technique. However, this method yielded poor results: using all

Empatica accelerometer

available features, balanced accuracy was 0.05, while with the

5-fold cross validation it was 0.45. This suggested that the partici-

1.00

pants were quite different from each other, making it challenging

Empatica electrodermal activity

to generalize predictions for a subject the model had not encoun-

0.75

tered.

0.50

Empatica inter beat interval

0.25

4.2 Baseline model

0.00

Our initial approach for building a prediction model was to use

Empatica temperature

Limesurvey

all available features. This served as a baseline, which we aimed

0.25

Phone activity recognition

to improve through feature selection.

Phone applications foreground

0.50

We evaluated various predictive models, as shown in Table 2,

Phone battery

0.75

Phone bluetooth

all as implemented in scikit-learn [10]. Among these, the

Phone calls

Phone light

1.00

Random Forest model yielded the best performance.

Phone location

Phone screen

In this work, we aimed to find the best model for predicting

Phone speech

Phone wifi

stress and improve it using the optimal feature subset. Conse-

e

ound

een

quently, we used the Random Forest as the benchmark for com-

ometer

egr

mal activity

Limesurveyecognition

Phone callsPhone light

Phone wifi

paring feature selection algorithms.

Phone battery

Phone speech

oder

Phone bluetooth

Phone location Phone scr

Empatica acceler

Empatica temperatur

Table 2: Performance of different models for the classifica-

Phone activity r

Empatica inter beat interval

tion problem. The mean over five folds, its standard error,

Empatica electr

Phone applications for

and the maximum are shown.

Figure 1: Correlation matrix of the initial feature set. Only

feature categories with more than two features are labelled.

Model

Mean

Max

SEM

Logistic Regression

0.077

0.151

0.025

Support Vector Machines

0.090

0.158

0.022

Gaussian Naive Bayes

1.00

0.061

0.122

0.020

Stochastic Gradient Descent

0.027

0.054

0.007

0.75

Random Forest

0.475

0.558

0.026

0.50

XGBoost

0.441

0.473

0.013

0.25

0.00

In Table 2, SEM represents the Standard Error of the Mean.

0.25

It measures how far the sample mean of the data is likely to be

0.50

from the true population mean.

0.75

4.3

Correlation-Based Feature Reduction

Figure 2: Correlation matrix of the feature set in the Phone

We began the feature selection process by eliminating highly

applications foreground category.

correlated features. For each highly correlated pair, we removed

the feature with the lower rank when sorted by mutual informa-

tion, setting the correlation threshold at |𝑟 | ≥ 0.8 to maintain

4 Prediction models

a manageable number of features. This reduction left us with

approximately 180 features out of the original 352 for model

4.1 Model performance and validation

training and evaluation.

To evaluate the performance of the models we used balanced

While selecting the optimal set of features for stress prediction,

accuracy score which is defined as the average of recall obtained

we aimed to retain all 22 different categories from Table 1, as 12





Choosing Features for Stress Prediction with Machine Learning

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

Empatica accelerometer

of features selected varied across folds, ranging from 50 to 93

features.

Empatica electrodermal activity

1.0

0.8

4.6 Sequential Forward Selection

Empatica inter beat interval

0.6

Another feature selection method we employed was Sequential

0.4

Feature Selector (SFS), a wrapper-based technique [2]. SFS and Empatica temperature

0.2

RFECV differ in their approaches. SFS constructs models for each

Limesurvey

Phone activity recognition

0.0

feature subset at every step, while RFECV builds a single model

Phone applications foreground

and evaluates feature importance scores. Consequently, SFS is

0.2

Phone battery

more computationally expensive, as it must evaluate numerous

0.4

Phone bluetooth

feature combinations before identifying the optimal subset.

Phone calls

0.6

Phone light

In the absence of specified parameters for number of fea-

Phone location

tures to select (n_features_to_select) and tolerance (tol), the

Phone messages

Phone screen

Phone speech

method defaults to selecting half of the available features. The

Phone wifi

e

default configuration was used in our analysis, leading the SFS

ound

een

ometer

egr

to select the top 50 features.

mal activity

Limesurvey

ecognition

Phone calls

Phone light

Phone wifi

Phone scr

Phone speech

oder

Phone battery

Phone bluetooth Phone location

Phone messages

4.7 Boruta method

Empatica acceler

Empatica temperatur

Empatica inter beat interval Phone activity r

The final feature selection technique we employed was the Boruta

Empatica electr

Phone applications for

method [5]. With the assistance of “shadow features”, which are Figure 3: Correlation matrix of the feature set after

original features that have been randomly shuffled, the method

correlation-based feature reduction. Only feature cate-

identifies a subset of features that are relevant to the classification

gories with more than two features are labelled.

task at hand. In our case, shadow features were introduced into

the feature subset obtained after the preprocessing step.

The updated dataset was trained using the Random Forest

model for 100 iterations. In each iteration, all original features

each could provide unique information. Comparing Figs. 1 and 3,

ranked higher in importance than the highest-ranked shadow

we were left with about half the number of features which were

feature were marked as relevant.

still moderately correlated.

Ultimately, a binomial distribution is used to evaluate which

features have enough confidence to be kept in the final selection.

4.4 Feature Selection using the mutual

The number of features selected varied across folds, ranging from

information scoring function

47 to 55 features.

Before applying more complex feature selection algorithms, it

was necessary to reduce computational complexity by further

5 Results

reducing our set of 180 features obtained through correlation-

In Table 3, the final scores for a Random Forest model built on based reduction. Therefore, we used the SelectKBest method

various feature subsets, as derived from the methods described

and the mutual information scoring function to retain the top 100

above, are presented. The data was split using shuffled 5-fold

features. This resulted in features derived from 19 to 20 categories,

cross-validation, to ensure that the results were not overly de-

as categories language, gender, and, in some cases, Empatica

pendent on a data split.

accelerometer were not deemed important for predicting stress.

Going forward, we will refer to the elimination of features

Table 3: Adjusted balanced accuracy scores of a Random

within highly correlated pairs and the selection of the top 100

Forest model, trained on the different feature sets. Last

features using the mutual information scoring function as the

column represents a number of features selected.

preprocessing step.

Feature set

Mean

Max

SEM

N

4.5 Recursive Feature Elimination with

All available features

Cross-Validation (RFECV)

0.464

0.498

0.011

352

Correlation-based reduction

0.483

0.507

0.007

∼180

One of the previously mentioned complex feature selection meth-

Correlation-based, 100 best

0.486

0.498

0.006

100

ods we employed was Recursive Feature Elimination with Cross-

Preprocessing, RFECV

0.471

0.511

0.012

50 to 93

Validation (RFECV) [3]. The feature set we got after the prepro-Preprocessing, SFS

0.483

0.520

0.017

50

cessing step was passed to the RFECV algorithm for thorough

Preprocessing, Boruta

0.481

0.545

0.020

47 to 55

evaluation.

RFECV only

0.465

0.494

0.020

16 to 89

RFECV operates by initially fitting a model to the dataset

SFS only

0.426

0.468

0.017

30

and evaluating its performance through cross-validation. After

Boruta only

0.456

0.509

0.015

∼75

the initial fit, RFECV ranks feature importance and iteratively

removes the least important features based on the models feature

From Table 3, we can see that the most significant improve-

importances attributes, which in the case of Random Forest are

ment in accuracy came after removing the highly correlated

impurity-based feature importances. This process continues until

features, with the average adjusted balanced accuracy score ris-

there is no significant improvement in the model’s performance.

ing from 0.46 to 0.48. Best mean accuracy was achieved after the

To ensure a reasonable duration for the feature selection process,

preprocessing step, with only a minor improvement from 0.483

we set the cross-validation in RFECV to 3 folds. The number

to 0.486.

13





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Katja Bengeri, Junoš Lukan, and Mitja Luštrek

After eliminating highly correlated features, wrapper methods

wrapper methods alone were unable to effectively perform correlation-

did not significantly improve the accuracy on average (rows 3

based feature reduction. We can therefore conclude that simply

to 6 in Table 3). The Boruta method performed best among the relying on feature selection methods, however sophisticated, is

three, with the highest overall maximum accuracy in a single fold.

not as effective as also considering relationships between fea-

These results led us to investigate whether the wrapper feature

tures.

selection method alone could manage correlated features without

It should be noted that the improvements in balanced accuracy

their prior removal and to evaluate the impact of the correlation

are low in all cases. This indicates that results cannot be easily

threshold.

generalized and correlation-based feature selection should not

We employed the RFECV, SFS, and Boruta method on the

be seen as sufficient in general. Instead, we can speculate that

entire feature set of 352 features without applying the prepro-

no single feature selection method is the best one and that sev-

cessing step. For SFS, only 30 features were selected due to its

eral should be considered. We should also note that the Pearson

computational complexity. As shown in the last three rows of

correlation coefficient that we used in this work only considers

Table 3, none of the methods alone were able to improve the linear relationships between features. Other methods can select

result achieved with correlation removal. Highly correlated fea-

features even if they are related in a different way.

tures were left in the final feature set: for example, we identified

three pairs of features with a correlation coefficient exceeding

References

|𝑟 | ≥ 0.8 using SFS alone. Poor results could be attributed either

[1]

Larissa Bolliger, Junoš Lukan, Mitja Luštrek, Dirk De Bacquer, and Els

to the importance of the correlation removal step or to the feature

Clays. 2020. Protocol of the STRess at Work (STRAW) project: how to

disentangle day-to-day occupational stress among academics based on EMA,

subset being too small in the case of the SFS.

physiological data, and smartphone sensor and usage data. International

Journal of Environmental Research and Public Health, 17, 23, (Nov. 2020),

5.1 Selecting the best correlation threshold

8835. doi: 10.3390/ijerph17238835.

[2]

Francesc J. Ferri, Pavel Pudil, Mohamad Hatef, and Josef V. Kittler. 1994.

As previously mentioned, the biggest improvement in score came

Comparative study of techniques for large-scale feature selection. Machine

from removing the feature inside the highly correlated pair. There-

Intelligence and Pattern Recognition, 16, 403–413. doi: 10.1016/b978-0-444-81

892-8.50040-7.

fore, we have also experimented with different correlation cut-off

[3]

Isabelle Guyon, Jason Weston, Stephen Barnhill, and Vladimir Vapnik. 2002.

values to determine the best threshold.

Gene selection for cancer classification using Support Vector Machines.

Machine Learning, 46, 1/3, 389–422. doi: 10.1023/a:1012487302797.

The highest score was achieved with a correlation threshold

[4]

Vito Janko, Matjaž Boštic, Junoš Lukan, and Gašper Slapničar. 2021. Library

of |𝑟 | ≥ 0.75 (Table 4). Considering the impact of cross-validation for feature calculation in the context-recognition domain. In Proceedings of

splits and the relatively minor variance in scores, it appears that

the 24th International Multiconference Information Society – IS 2021. Slovenian Conference on Artificial Intelligence (Ljubljana, Slovenia, Oct. 4–8, 2021).

our initial threshold of |𝑟 | ≥ 0.8 was also quite effective.

Vol. A, 23–26.

[5]

Miron B. Kursa and Witold R. Rudnicki. 2010. Feature selection with the

Boruta package. Journal of Statistical Software, 36, 11, 1–13. doi: 10.18637/js

Table 4: Adjusted balanced accuracy scores of a Random

s.v036.i11.

Forest model trained on a feature subset excluding features

[6]

Junoš Lukan, Larissa Bolliger, Els Clays, Primož Šiško, and Mitja Luštrek.

above the correlation threshold. The number of features

2022. Assessing sources of variability of hierarchical data in a repeated-

measures diary study of stress. In Proceedings of the 25th International Multi-left after correlation-based feature selection differed over

conference Information Society – IS 2022. Pervasive Health and Smart Sensing

validation folds and its range is shown in the final column.

(Ljubljana, Slovenia, Oct. 10–14, 2022). Vol. A, 31–34.

[7]

Junoš Lukan, Marko Katrašnik, Larissa Bolliger, Els Clays, and Mitja Luštrek.

2020. STRAW application for collecting context data and ecological momen-

Threshold

Mean

Max

SEM

N

tary assessment. In Proceedings of the 23rd International Multiconference

Information Society – IS 2020. Slovenian Conference on Artificial Intelli-

gence (Ljubljana, Slovenia, Oct. 5–9, 2020). Vol. A, 63–67.

0.55

0.462

0.506

0.018

28 to 33

[8]

Marcel Franse Martinšek, Junoš Lukan, Larissa Bolliger, Els Clays, Primož

0.60

0.467

0.493

0.009

39 to 41

Šiško, and Mitja Luštrek. 2023. Social interaction prediction from smart-

0

phone sensor data. In Proceedings of the 26th International Multiconference

.65

0.474

0.498

0.008

47 to 50

Information Society – IS 2023. Slovenian Conference on Artificial Intelligence 0.70

0.460

0.501

0.017

61 to 65

(Ljubljana, Slovenia, Oct. 9–13, 2023). Vol. A, 11–14.

0.75

0.498

0.526

0.012

74 to 80

[9]

Edward J. Peacock and Paul T. P. Wong. 1990. The stress appraisal measure

0

(SAM). A multidimensional approach to cognitive appraisal. Stress Medicine,

.80

0.470

0.543

0.022

101 to 107

6, 3, (July 1990), 227–236. doi: 10.1002/smi.2460060308.

[10]

F. Pedregosa et al. 2011. Scikit-learn: machine learning in Python. Journal

of Machine Learning Research, 12, 2825–2830.

[11]

Charles Sanders Peirce. 1884. The numerical measure of the success of

6 Conclusions

predictions. Science, ns-4, 93, 453–454. doi: 10.1126/science.ns-4.93.453.b.

This paper examined different feature selection algorithms to

find the most effective subset for stress prediction. The model

using the feature subset after correlation removal achieved the

highest adjusted balanced accuracy score of 0.483.

Alternative feature selection approaches, including the wrap-

per methods SFS and RFECV, as well as the Boruta method, ap-

plied to the preprocessed feature subset, did not lead to further

optimization of the feature subset in terms of model performance.

Additionally, applying these methods to the entire set of features

did not achieve accuracy levels as high as those obtained after

the correlation-based reduction. In the case of SFS, this may be

attributed to its selection of only 30 features.

Therefore, our results underscore the critical role of the correlation-

based reduction step. In contrast, when this step was omitted

14





Predictive Modeling of Football Results in the WWIN League of

Bosnia and Herzegovina

Ervin Vladić

Dželila Mehanović

Elma Avdić

International Burch University

International Burch University

International Burch University

Sarajevo, Bosnia and Herzegovina

Sarajevo, Bosnia and Herzegovina

Sarajevo, Bosnia and Herzegovina

ervin.vladic@stu.ibu.edu.ba

dzelila.mehanovic@ibu.edu.ba

elma.avdic@ibu.edu.ba

Abstract

a place in the UEFA Conference League. Since the founding of the

WWIN League of Bosnia and Herzegovina, team with the highest

Predictive modeling in football has emerged as a valuable tool for

number of titles was HŠK Zrinjski from Mostar who emerged

enhancing decision-making in sports management. This study

as the winner eight times, followed by Sarajevo which won four

applies machine learning techniques to predict football match

times, Zeljeznicar and Borac both won three times, Siroki Brijeg

outcomes in the WWIN League of Bosnia and Herzegovina. The

won two times and Leotar and Modrica both won once [12].

aim is to evaluate the effectiveness of various models, including

Depending on which entity association they belong to, the teams

Support Vector Machines (SVM), Logistic Regression, Random

that occupy the last two places in the league at the end of the

Forest, Gradient Boosting, and k-Nearest Neighbors (kNN), in

season are relegated to the league below, with two teams from

accurately predicting match results based on key features such

the First League of the Federation of BiH and the First League of

as shots on target, possession percentage, and home/away status.

the RS being promoted in their stead. To elevate football in our

By (1) gathering and analyzing match data from three seasons, (2)

country to the highest level, we must support in-depth analyses

comparing the performance of machine learning models, and (3)

of matches and the factors influencing their outcomes. This will

drawing conclusions on key performance factors, we demonstrate

enable coaches to fine-tune strategies for future games, help

that SVM achieves the highest accuracy at 83%, outperforming

commentators provide more insightful commentary, and allow

other models. These insights contribute to football management,

fans to develop a deeper understanding and get more pleasure

allowing for data-driven strategic planning and performance

from the match.

optimization. Future research will integrate additional factors

The study aims to evaluate the performance of various ML

such as player injuries and weather conditions to improve the

models, including Support Vector Machines (SVM), Logistic Re-

predictive models further.

gression, Random Forest, Gradient Boosting, and k-Nearest Neigh-

Keywords

bors (kNN), in predicting match results. By examining key fea-

tures such as shots on target, possession percentage, and home/away

Football match prediction, machine learning, WWIN league, Sup-

status, we conduct an analysis based on match data from three

port Vector Machines, Random Forest

seasons of the WWIN League, encompassing 400 matches and

key performance metrics.

1

Introduction

The remainder of the paper is structured as follows: Section

Accurate predictions of match outcomes can inform a wide range

II provides an overview of related work in football ML-based

of decisions, from tactical adjustments to player acquisitions, and

prediction. Section III describes the methodology, including the

can improve engagement for fans and stakeholders. While pre-

dataset and models used. Section IV presents the results and

dictive modeling has been extensively applied to top-tier football

analysis of models performance, with a discussion on the practical

leagues like the English Premier League, there is limited research

implications of the findings for football management. Finally,

on regional leagues such as the WWIN League of Bosnia and

Section V concludes the paper and outlines directions for future

Herzegovina. The specificity of the country that is Bosnia and

research.

Herzegovina and the WWIN League, which has not been re-

searched in the sphere of sports research, provides context for

2

Literature Review

this step.

The WWIN League of Bosnia and Herzegovina was established

The prediction of the results of football matches has been recently

in the year 2000 and the same year the WWIN was formed by

studied extensively because of its relation to betting and decision-

the merging of three leagues, it became a league covering the

making in sports. Studies examining the employment of ML

entire territory of Bosnia and Herzegovina. Originally, the league

methods are primarily focused on large European leagues, where

consisted of 16 clubs, and, from the 2016-2017 season, the league

extensive and highly detailed data is available. The application of

contains 12 clubs which makes the level of the league higher

these techniques to regional football leagues, such as the WWIN

[25]. The winner is the team that has the most points by the League of B&H, remains underexplored.

completion of thirty-three rounds; this position will grant a team

Rodrigues and Pinto [15] used a variety of ML methods, in-

a place in the UEFA Champions League qualifications [10]; the cluding Naive Bayes, K-nearest neighbors, Random Forest, and

remaining two teams and the winner of the cup will compete for

SVM, to predict the match outcomes based on previous match

data and player attributes. Their studies revealed excellent re-

Permission to make digital or hard copies of all or part of this work for personal sults in terms of soccer betting profit margins, with the Random

or classroom use is granted without fee provided that copies are not made or

Forest approach obtaining a success rate of 65.26% and a profit

distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this margin of 26.74%. Rahman [13] dedicated his work to employ-work must be honored. For all other uses, contact the owner /author(s).

ing deep learning frameworks especially Deep Neural Networks

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

(DNNs) for football match outcome prediction, particularly dur-

© 2024 Copyright held by the owner/author(s).

https://doi.org/10.70314/is.2024.scai.1642

ing FIFA World Cup 2018. The study classified match outcomes

15





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

Vladić et al.

with 63.3% accuracy with DNN architectures with LSTM or GRU

cells. Baboota and Kaur [3] used machine learning approaches to predict English Premier League match results. The models

compared included Support Vector Machines, Random Forest;

and Gradient Boosting. From their study , they ascertained that

Gradient Boosting outperformed other models in accuracy and

overall predictiveness. Authors in [16] used machine learning techniques, notably SVM and Random Forest Classifier, to predict

English Premier League (EPL) football match results. They got

54.3% accuracy with SVM and 49.8% with Random Forest after

evaluating data from 2013/2014 to 2018/2019 seasons. Another

study [8] employed a few machine learning algorithms to predict matches of the English Premier League season 2017-2018.

Models including Linear Regression, SVM, Logistic Regression,

Random Forest, and Multinomial Naïve Bayes classier show that

the K-nearest neighbors give the best accurate predictions.

In summary, while existing studies have demonstrated the

effectiveness of machine learning in football matches prediction,

there remains a gap in the application of these techniques to

regional leagues like the WWIN League, due to the availability

Figure 1: Workflow diagram

and quality of data. The characteristics of these leagues, such

as smaller datasets and potentially different factors influencing

Table 1: Class Distribution

match outcomes, require a tailored approach. In lesser-known

football leagues models might perform differently due to varia-

Match Type

Count

tions in competitive structures and gameplay strategies, as well.

The study of Munđar and Šimić [11] in which they developed a Home Win

301

simulation model using the Poisson distribution to predict the

Away Win

142

seasonal rankings of teams in the Croatian First Football League,

Draw

151

highlighted the predictive power of statistical models and demon-

strated the significance of home advantage in determining match

The table sums up a type of match result in terms of its fre-

outcomes, which is also an important factor in the WWIN League.

quency in the dataset.

In the recorded 594 matches, 301 ended in home team victories,

142 in away team victories and 151 were tied. The following pie

chart describes the percentage distribution of the match outcome.

3

Materials and Methods

Curiously, home wins are in the majority, comprising 50.7% of all

In this section, we describe the study conducted, detailing the

data collection and feature selection processes, the machine learn-

ing models applied, the evaluation metrics used to assess model

performance, and the approach taken to analyze key features

influencing match outcomes. As a result of providing numerous

procedures that are declared in this section, we represent the

graphical illustration of our methodology. The steps involved

in predicting the outcomes of the WWIN League of Bosnia and

Herzegovina, including data collection, preprocessing, model

development, and algorithm evaluation.

3.1

Dataset

The authors created the dataset for this study by consolidating in-

Figure 2: Class distribution of the dataset

formation from rezultati.com [14], 1XBET [1], and Sofascore [24].

The unique dataset represents the seasons 2021/2022, 2022/2023,

matches. However, away victories contribute to approximately

and 2023/2024 of WWIN League of Bosnia and Herzegovina. The

23.9% of all recorded match results, while 25.4% contribute to

platforms provide a wide range of football match data so it is

draw results. The Fig.2 depicts the frequency of each of the match easy to find important information for examination. The dataset

outcomes.

includes key match facts as date, day of the week, time, home

team, away team, final as well as half-time goals scored in the

3.2

Machine Learning Prediction

game, referee details, shots taken at goal as well as corner kicks

resulting from these attempts on target, bookings made during

In football, the concept of machine learning prediction entails de-

play by both teams and other relevant performance indicators.

veloping models to forecast match outcomes based on the teams’

16





Predictive Modeling of WWIN League Football Results

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

and players’ histories and other attributes [5]. These models Gradient Boosting combines multiple weak learners (typically

employ such methods as regression analysis, classification, and

decision trees) to create a stronger predictive model, improving

neural nets to determine the results given the data fed as the

accuracy by focusing on correcting errors from previous models

input.

[6].

k-Nearest Neighbors (kNN) is an instance-based learning method

3.2.1

Models initialisation, preprocessing, training and testing.

that classifies data by identifying the majority label among the

While implementing Logistic Regression, we have set the max_iter

k closest points. Though simple, it can be computationally ex-

=1000 and random_state = 42. Again, with the same classifier,

pensive as it requires storing all training data and performing

the kernel argument was assigned a linear value while the ran-

real-time comparisons [7].

dom_state was set to 42 to keep the results predictable. Gaussian

Naive Bayes was employed with no modification of its settings

3.2.3

Evaluation Metrics. Last but not the least, the trained mod-

because of the model’s simplistic nature. For Random Forest, we

els are assessed by metrics such as accuracy of the models [19],

used the default parameters since the algorithm is capable of

precision of the models [21], the recall of the models [22], and changing the setting on its own based on the complexity of given

F1-score value of the models [20]. This evaluation enables one data. We initiated the Gradient Boosting with the default param-to analyze how well each of the models is likely to perform in

eters so that the gradients could easily learn and an ensemble

terms of match outcome prediction.

could be formed. Last but not least, we left all the parameters of

k-Nearest Neighbors (kNN) for default value because the algo-

4

Results and Discussion

rithm can find the optimal number of neighbors appropriate for

In this study, we employed six different classifiers to predict

the distribution of the data.

football match outcomes and conducted a comparative analysis

Following that, we proceed with the process of dividing this

of their performance. The effectiveness of each classifier was

gathered data into two sets: the training and the testing ones. We

evaluated based on its accuracy, providing a clear comparison of

split the data into training, where 70% of the data was allocated

their predictive capabilities across the dataset.

and the testing data where 30% was allocated.

Subsequently, the phase of model preprocessing is created for

4.1

Model Performance

which it is essential to filter data effectively to ensure proper

Among the classifiers employed, SVM predicted the most accurate

model training. In the case of feature transformation, we used

results at 83% This model performed almost well, with balanced

scikit-learn’s ColumnTransformer [17] to empower the numeric precision and recall across all three classes (A, D, and H), show-features normalization via the StandardScaler [23] while trans-ing that it can predict match outcomes. In comparison, Random

forming the categorical variables into the binary format by the

Forest achieved a lower accuracy of 65%, with especially evident

use of the OneHotEncoder [18]. This technique pays a lot of deficits in precision and recall for class ’D’. Logistic Regression

attention to ensuring that feature types are standard as well

performed worse than Support Vector Machines, with accuracy of

as harmonious. This method ensures consistency by creating a

77%. Despite its simplicity and computational efficiency, Gaussian

pipeline where preprocessing processes and the model are joined

Naive Bayes had the lowest accuracy of any classifier tested, at

in the same line of work. This means that there is always uni-

39%. This model struggled to predict class ’D’, with low accuracy

formity in the training and the testing of the model, hence a

and recall scores. Random Forest, an established ensemble learn-

manageable variability. Assuming the pipeline has been defined

ing approach, performed not so good, with an accuracy of 54%.

and is ready to proceed, we proceed to the next step of model

This model has generally balanced accuracy and recall across all

training.

classes, making it an acceptable alternative for predicting match

3.2.2

Models in Detail. In this study, many supervised learn-

results. Gradient Boosting, another ensemble learning technique,

ing classifier techniques that have proven valuable in the sports

has a little higher accuracy than Random Forest at 64%. While

area for predictive purposes are employed. Logistic Regression

Gradient Boosting is recognized for its ability to manage compli-

is a statistical technique that predicts the probability of a binary

cated connections, it produced poorer recall ratings, especially

classification, using a sigmoid function to map outputs to a [0,1]

for class ’D’. Lastly, k-Nearest Neighbors (kNN) resulted in 51%

probability space. Coefficients indicate the strength and direction

accuracy, showing that the classifier was relatively poor, they

of relationships between variables, with positive values increas-

had relatively fair precision and recall with all the classes.

ing the likelihood of an event and negative values decreasing it

For making the match predictions, we employed the following

[9].

classification models – Logistic Regression, Support Vector Ma-

Random Forest extends the bagging method by generating

chine, Gaussian Naive Bayes, Random Forest, Gradient Boost and

multiple decision trees using randomly selected data samples.

k-Nearest Neighbors. We obtained the results varying from 39%

Each tree operates independently, and the final prediction is the

to 83%, in which Support Vector Machines were the most effective.

average result across all trees, reducing overfitting and improving

Our findings are partially consistent with prior research because

accuracy in classification tasks [4].

classifiers like Support Vector Machines, Logistic Regression,

SVM aims to find the best hyperplane to separate data points

and Random Forest have manifested robustness in predicting

by class, maximizing the margin between them. It handles non-

the match outcome across datasets. Nevertheless, the results are

linear boundaries by transforming the input data into a higher-

not in conformity with some emerging works, particularly con-

dimensional space [2].

cerning the efficacy of Gaussian Naive Bayes, which performed

Naïve Bayes, based on Bayes’ theorem, assumes feature inde-

poorly in our study in contrast to other research results. It should

pendence, making it fast and easy to implement, especially in

be noted that results may vary significantly between different

applications like spam detection and text classification. Despite

studies depending on the quality, the quantity, and the nature of

the simplicity of this assumption, it performs well in practice

the data that had been used for creating the models of Gaussian

[26].

Naive Bayes.

17





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Vladić et al.

Model

Accuracy

Precision

Recall

F1 score

establish a strong foundation for data-driven decision-making in

football management. Future work should incorporate additional

Logistic Regression

77%

75%

74%

74%

factors such as player injuries and weather conditions to enhance

SVM

83%

86%

83%

84%

predictive accuracy.

Gaussian NB

39%

47%

42%

36%

Random Forest

54%

43%

46%

43%

References

Gradient Boosting

64%

64%

59%

60%

[1]

1XBET. 2007–2024. 1xbet. Retrieved May 26, 2024, from https://1xlite- 57954

kNN

51%

49%

49%

49%

2.top/en?tag=s_245231m_5435c_. (2007–2024).

Table 2: Model Performances

[2]

Mariette Awad and Rahul Khanna. 2015. Support vector machines for classi-

fication. In Efficient Learning Machines. Rahul Khanna, editor. Apress, 39–66.

doi: 10.1007/978- 1- 4302- 5990- 9_3.

[3]

Rahul Baboota and Harleen Kaur. 2019. Predictive analysis and modeling

football results using a machine learning approach for the english premier

league. International Journal of Forecasting, 35, 2, 741–755. doi: 10.1016/j.ijf

The Table 2 shows how accurate various machine learning

orecast.2018.01.003.

models are in predicting WWIN League of Bosnia and Herzegov-

[4]

Leo Breiman. 2001. Random forests. Machine Learning, 45, 5–32. doi: 10.102

3/A:1010933404324.

ina match outcomes.

[5]

Rory P. Bunker and Fadi Thabtah. 2019. A machine learning framework for

sport result prediction. Applied Computing and Informatics, 15, 1, 27–33.

4.1.1

Key factors influencing match outcomes. While this study

[6]

Stefanos Fafalios, Pavlos Charonyktakis, and Ioannis Tsamardinos. 2020.

does not perform formal feature analysis, the observed perfor-

Gradient Boosting Trees. Gnosis Data Analysis PC, (Apr. 2020).

[7]

Gongde Guo, Hui Wang, David Bell, Yaxin Bi, and Kieran Greer. 2003. Knn

mance trends allow us to draw conclusions about the key factors

model-based approach in classification. In On The Move to Meaningful Inter-

influencing match outcomes. In line with prior research, home

net Systems 2003: CoopIS, DOA, and ODBASE: OTM Confederated International

advantage emerged as a critical factor, with teams winning at

Conferences, CoopIS, DOA, and ODBASE 2003. Springer Berlin Heidelberg,

Catania, Sicily, Italy, (Nov. 2003), 986–996.

home in over 50% of cases (Table 1) which reinforces the psy-

[8]

Ishan Jawade, Rushikesh Jadhav, Mark Joseph Vaz, and Vaishnavi Yamgekar.

chological and tactical advantages that come with playing on

2021. Predicting football match results using machine learning. International

Research Journal of Engineering and Technology (IRJET), 8, 7, 177. https://w

familiar ground.

ww.irjet.net.

Offensive metrics, particularly shots on target, also revealed

[9]

Daniel Jurafsky and James H. Martin. 2023. Logistic Regression. Stanford

themselves as strong predictors of success. Teams that generated

University, 5. https://web.stanf ord.edu/~juraf sky/slp3/5.pdf .

[10]

Haris Kruskic. 2019. Uefa champions league explained: how the tournament

more attempts on goal were significantly more likely to win, rein-

works. Bleacher Report. Retrieved from https://bleacherreport.com/articles

forcing the widely accepted view that aggressive, forward-driven

/2819840- uef a- champions- league- explained- how- the- tournament- works.

play translates directly into better results. This trend mirrors ob-

(2019).

[11]

Dušan Munđar and Diana Šimić. 2016. Roatian first football league: teams’

servations from other football leagues, where offensive intensity

performance in the championship. roatian Review of Economic, Business and

is often directly correlated with victory.

Social Statistics 2, 2, 1, 15–23. https://hrcak.srce.hr/file/245359.

[12]

Prva Liga BiH. 2022. Osvajači trofeja. Retrieved from https://plbih.ba/osvaja

4.1.2

Limitations and future work. Despite the promising results,

ci- trof eja/. (2022).

[13]

Ashiqur Rahman. 2020. A deep learning framework for football match

this study has several limitations. First, the dataset used does

prediction. SN Applied Sciences, 2, 2, 165. doi: 10.1007/s42452- 019- 1821- 5.

not account for external factors such as player injuries, weather

[14]

2006–2024. Rezultati. Retrieved May 26, 2024, from https://www.rezultati.c

conditions, or team morale, all of which can influence match

om/. (2006–2024).

[15]

Fátima Rodrigues and Ângelo Pinto. 2022. Prediction of football match

outcomes. Future research should incorporate these factors to im-

results with machine learning. Procedia Computer Science, 204, 463–470. doi:

prove the accuracy of predictions. Second, while SVM performed

10.1016/j.procs.2022.08.057.

[16]

Sayed Muhammad Yonus Saiedy, Muhammad Aslam HemmatQachmas, and

well in this context, more advanced models such as deep learn-

Dr. Amanullah Faqiri. 2020. Predicting epl football matches results using

ing could potentially offer even better predictive performance,

machine learning algorithms. International Journal of Engineering Applied

particularly when dealing with larger datasets.

Sciences and Technology, 5, 3, 83–91. http://www.ijeast.com.

[17]

scikit-learn. 2024. Columntransformer. Retrieved from https://scikit- learn.o

Future work will explore the integration of additional domain-

rg/stable/modules/generated/sklearn.compose.ColumnTransf ormer.html.

specific features, such as player statistics, team form, and envi-

(2024).

ronmental conditions, to further refine the predictive models.

[18]

scikit-learn. 2024. Onehotencoder. Retrieved from https://scikit- learn.or

g/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html.

We will also experiment with more complex algorithms, such as

(2024).

neural networks, to capture the intricate relationships between

[19]

scikit-learn. 2024. Sklearn.metrics.accuracy_score. Retrieved from https://sc

ikit- learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.h

features that may be missed by traditional machine learning

tml. (2024).

models.

[20]

scikit-learn. 2024. Sklearn.metrics.f1_score. Retrieved from https://scikit- le

arn.org/stable/modules/generated/sklearn.metrics.f 1_score.html. (2024).

5

Conclusion

[21]

scikit-learn. 2024. Sklearn.metrics.precision_score. Retrieved from https://s

cikit- learn.org/stable/modules/generated/sklearn.metrics.precision_score

This study demonstrates that machine learning, particularly SVM,

.html. (2024).

[22]

scikit-learn. 2024. Sklearn.metrics.recall_score. Retrieved from https://sciki

effectively predicts football match outcomes in the WWIN League

t- learn.org/stable/modules/generated/sklearn.metrics.recall_score.html.

of Bosnia and Herzegovina. Support Vector Machine has been

(2024).

[23]

scikit-learn developers. 2024. Sklearn.preprocessing.standardscaler. https:

found to be the highest accurate classifier with 83% of accuracy

//scikit- learn.org/stable/modules/generated/sklearn.preprocessing.Standa

rate on match result prediction. SVM has moderate accuracy and

rdScaler.html. (2024).

recall with all three outcome classes: Home Win, Away Win, and

[24]

Sofascore. 2024. Sofascore. Retrieved May 26, 2024, from https://www.sof as

core.com/. (2024).

Draw, indicating football prediction applicability. However, it

[25]

SportMonks. 2022. Premier league api bosnia. Retrieved from https://www

has also revealed that other classifiers’ performances are varying

.sportmonks.com/f ootball- api/premier- league- api- bosnia/. (2022).

with Logistic Regression producing 77% of accuracy and Gauss-

[26]

Geoffrey I. Webb. 2016. Naïve bayes. In Encyclopedia of Machine Learning

and Data Mining. Claude Sammut and Geoffrey I. Webb, editors. (Jan. 2016),

ian Naïve Bayes a poor 39% accuracy. Both Random Forest and

1–2. doi: 10.1007/978- 1- 4899- 7502- 7_581- 1.

Gradient Boosting, which are ensemble learning algorithms, have

similar levels of accuracy; 54% and 64% respectively. While fur-

ther refinement of the models is needed, the current findings

18





Sarcasm Detection in a Less-Resourced Language

Lazar Ðoković

Marko Robnik-Šikonja

lazardjokoviclaki02@gmail.com

marko.robnik@f ri.uni- lj.si

University of Ljubljana, Faculty of Computer and

University of Ljubljana, Faculty of Computer and

Information Science

Information Science

Ljubljana, Slovenia

Ljubljana, Slovenia

Abstract

user annotation via distant supervision through hashtags, such

as #sarcasm, #sarcastic, #not, etc. This method is popular since 1)

The sarcasm detection task in natural language processing tries

only the author of a post can determine whether it is sarcastic or

to classify whether an utterance is sarcastic or not. It is related

not, and 2) it allows large-scale dataset creation. However, this

to sentiment analysis since it often inverts surface sentiment. Be-

method introduces large amounts of noise due to lack of context,

cause sarcastic sentences are highly dependent on context, and

user errors, and common misuse on social media platforms. The

they are often accompanied by various non-verbal cues, the task

sarcasm detection datasets created through manual annotation

is challenging. Most of related work focuses on high-resourced

tend to be of higher quality but are typically much smaller. These

languages like English. To build a sarcasm detection dataset for

problems are further compounded for non-English datasets, both

a less-resourced language, such as Slovenian, we leverage two

manually labeled and automatically collected. Further, as sarcasm

modern techniques: a machine translation specific medium-size

strongly relies on its context, using classical machine translation

transformer model, and a very large generative language model.

(MT) from English often produces inadequate results. This makes

We explore the viability of translated datasets and how the size of

sarcasm detection in less-resourced languages, like Slovenian, an

a pretrained transformer affects its ability to detect sarcasm. We

even bigger challenge. Therefore, developing reliable sarcasm

train ensembles of detection models and evaluate models’ perfor-

detection models is of crucial importance for robust sentiment

mance. The results show that larger models generally outperform

analysis in these languages.

smaller ones and that ensembling can slightly improve sarcasm

We develop a methodology for sarcasm detection in less-resour-

detection performance. Our best ensemble approach achieves an

ced languages and test it on the Slovenian language. We address

F -score of 0.765 which is close to annotators’ agreement in the

1

the problem of missing datasets by comparing state-of-the-art

source language.

machine translation with large generative models. We explore

Keywords

the viability of such datasets and how the number of parameters

affects a model’s ability to detect sarcasm. We construct various

natural language processing, large language models, sarcasm

ensembles of large pretrained language models and evaluate their

detection, neural machine translation, BERT model, GPT model,

performance.

LLaMa model, ensembles

The rest of this work is organized as follows. In Section 2, we 1

Introduction

discuss the proposed approach for detecting sarcasm in a less-

resourced language such as Slovenian. We present the creation

Sentiment analysis is a popular task in natural language process-

of a dataset, details of the training methodology and deployed

ing (NLP), concerned with the extraction of underlying attitudes

ensemble techniques. We lay out our experimental results and

and opinions, usually categorized as “positive”, “negative”, and

their interpretations in Sections 2.3 and 4. In Section 5, we provide

“neutral”. Detection of sentiment is challenging if the utterances

conclusions and directions for future work.

are ironic or sarcastic. Sarcasm is a form of verbal irony that trans-

forms the surface polarity of an apparently positive or negative

2

Sarcasm Detection Dataset

utterance/statement into its opposite [6]. Sarcasm is frequent Existing attempts at automatic sarcasm detection have resulted

in our day-to-day communication, especially on social media

in the creation of datasets in a small number of languages with

[5]. This poses a significant problem for sentiment analysis tools differing sizes and quality. It is unclear if models trained on these

since sarcasm polarity switches create ambiguity in meaning.

datasets would generalize well to unseen languages [1]. Since Sarcasm is highly dependent on its context. For example, the

no dataset exists for Slovenian, we leverage recent advances

sentence “I just love hot weather” could be interpreted as sarcastic,

in machine translation and large language models (LLMs) to

depending on the situation, e.g., during summer heat waves.

create a dataset for supervised sarcasm detection. We thus apply

Historical developments of sarcasm detection are surveyed by

a translate-train approach when fine-tuning our models.

Joshi et al. [3], while recent developments are covered by Moores The prevalence of research done on sarcasm in English means

and Mago [5]. The problem of automatic sarcasm detection in that English datasets are usually larger and of higher quality than

text is most commonly formulated as a classification task. Unfor-

their counterparts in other languages. Further, as the translation

tunately, sarcasm detection is affected by the lack of large-scale,

from (and to) English is usually of better quality, we consider

noise-free datasets. Existing datasets are mostly harvested from

only English datasets.

microblogging platforms such as Twitter and Reddit, relying on

Preliminary tests showed poor quality and poor translation

Permission to make digital or hard copies of all or part of this work for personal ability of Sarcasm on Reddit1 dataset, and News Headlines

or classroom use is granted without fee provided that copies are not made or

Dataset For Sarcasm Detection2. Hence, we chose the recent

distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this

iSarcasmEval3 dataset from the SemEval-2022 shared task. We work must be honored. For all other uses, contact the owner /author(s).

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

1 www.kaggle.com/datasets/danofer/sarcasm

2

© 2024 Copyright held by the owner/author(s).

www.kaggle.com/datasets/rmisra/news-headlines-dataset-for-sarcasm-detection

3

https://doi.org/10.70314/is.2024.scai.4212

github.com/iabufarha/iSarcasmEval

19





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

Ðoković and Robnik-Šikonja

believe that relatively low performance scores obtained in this

You will be provided with a sarcastic/non-sarcastic sentence in

shared task could be improved with the use of larger LLMs.

English, and your task is to translate it into the Slovenian language.

It should keep the original meaning. Examples:

2.1

iSarcasmEval Dataset

• love getting assignments at 6:25pm on a Friday!! //

iSarcasmEval is a dataset of both English and Arabic sarcastic

obožujem, ko mi v petek ob 18:25 pošljejo naloge!!

and non-sarcastic short-form tweets obtained from Twitter. We

• I still can’t believe England won the World Cup //

use only the English part, which is pre-split into a train and test

Še vedno ne morem verjeti, da je Anglija zmagala na sve-

set. Both sets are unbalanced, the former having 867 sarcastic and

tovnem prvenstvu

2601 non-sarcastic examples, while the latter has 200 sarcastic

• taking spanish at ut was not my best decision

//

and 1200 non-sarcastic examples. The authors of the shared task

Učenje španščine na UT ni bila moja najboljša odločitev

,

claim that both distant supervision and manual annotation of

datasets produce noisy labels in terms of both false positives

We manually assessed the outputs of both transformers in order to

and false negatives [1]. Thus, they ask Twitter users to directly determine the best translations for fine-tuning detection models.

provide one sarcastic and three non-sarcastic tweets they have

2.3

Translation Results

posted in the past. These responses are then filtered to ensure

their quality. The produced dataset is not entirely clean since it

During translation, the T5 model sometimes had trouble with

contains links, emojis, and capitalized text. We chose to leave all

examples that had multiple newline characters in a row. It oc-

of these potential features in the text, as they commonly occur

casionally dropped parts of texts it didn’t understand (mostly

in online conversations and could be indicative of sarcasm.

slang and various types of informal text styles). This shows that

Let us mention, that an ensemble approach with 15 trans-

a 10B parameter model is not large enough to robustly translate

former models and transfer from three external sarcasm datasets

all features of a language such as English into a less-resourced

proved to be the most accurate modeling technique for English

language such as Slovenian.

[9] achieving an F -score of 0.605.

1

On the other hand, the GPT model performed surprisingly

well in most instances and it seemed to have a more nuanced

2.2

Translating iSarcasmEval

understanding of phrases used in online speech. It consistently

translated entire texts, keeping the original structure and mean-

Our preliminary testing using smaller BERT-like classifiers showed

ing. Consequently, we used GPT’s translations when training

that the models learned the distribution of the data and defaulted

sarcasm detection models. The translations can be seen in our

to the majority classifier (1200/1400 = 0.857 test accuracy). To

6

repository .

try to dissuade this, we merged the train and test sets, kept all

the sarcastic instances, and randomly sampled an equal number

3

Model Training

of non-sarcastic examples. This left us with a balanced dataset of

2134 samples.

We tested the performance of a wide range of LLMs of different

To enable task specific instructions that would preserve sar-

sizes. Their overview is contained in Table 1.

casm, we skipped classical machine translation tools, and tried

Table 1: Summary of used sarcasm detection models.

two alternative translation approaches:

• using a medium-sized T5 model trained specifically for

Model

Parameters

neural machine translation,

•

SloBERTa

110M

leveraging a significantly larger model via OpenAI’s API.

BERT-BASE-MULTILINGUAL-CASED

179M

The T5 model uses both the encoder and decoder stacks of

XLM-RoBERTa-BASE

279M

the Transformer architecture and is trained within a text-to-

XLM-RoBERTa-LARGE

561M

text framework. We chose Google’s 32-layer T5 model called

META-Llama-3.1-8B-INSTRUCT

8.03B

MADLAD400-10B-MT4, which has 10.7 billion parameters and is META-Llama-3.1-70B-INSTRUCT

70.6B

pretrained on the MADLAD-400 [4] dataset with 250 billion tokens META-Llama-3.1-405B-INSTRUCT

406B

covering 450 languages. Fine-tuning for machine translation was

GPT-3.5-TURBO-0125

?

done on a combination of parallel data sources in 157 languages,

GPT-4o-2024-05-13

?

including Slovenian.

As a generative model, we chose OpenAI’s decoder-based

3.1

Encoder Models Under 1B Parameters

GPT-4o-2024-05-135. Its true size is not known to the public, but it’s speculated that it is significantly smaller than GPT-4, since

The four smallest models are encoder-based models that embed

it is much faster and more efficient. OpenAI claims that it has the

input text and use a classification head to assign it a class. They

best performance across non-English languages of any of their

required additional fine-tuning to perform sarcasm detection. For

models, thus making it suitable for our task.

these models, we conducted hyperparameter optimization.

When prompting generative decoder-based models, it is nec-

We chose the SloBERTa [7, 8] model in order to check whether essary to craft clear and specific instructions to achieve the best

using a monolingual Slovenian model impacts sarcasm detection

results. We used few-shot learning [2], and randomly sampled performance. We also wanted to compare BERT and RoBERTa-three training instances, manually translated them, and included

like models, so we used their multilingual variants and fine-tuned

them in the following prompt where the double forward slash

them on Slovenian data.

was used as a delimiter between the query and the expected

The models were trained for a maximum of five epochs with

response.

the use of early stopping, where the training was halted if the

validation loss didn’t improve after two epochs.

4 huggingface.co/google/madlad400-10b-mt

5

6

platform.openai.com/docs/models/gpt-4o

github.com/GalaxyGHz/Diploma

20





Sarcasm Detection

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

3.2

Llama 3.1 Models

When 𝑛 is set to zero, this approach is equivalent to hard

voting, and in the case of 𝑛 being equal to the predictor count, it

Since the teams that competed in the 2022 shared task on sarcasm

is equivalent to soft voting. We report both results. Additionally,

mostly used BERT and RoBERTa models, we extend the testing to

we compare the results of voting using all trained models with

include significantly larger models. We chose Meta’s open-source

the results obtained by using only the models with large weights

Llama family of models, more specifically, their newest Llama 3.1

in our regularized logistic regression ensemble.

variants. These come in three different sizes, which was perfect

for studying the effects of parameter counts on sarcasm detection.

4

Sarcasm Detection Results

We decided to use the “instruct” version of all three models

since these were fine-tuned to be better at following instructions.

Table 2 summarizes all our results. It is roughly sorted by model When prompting LLama and GPT generative models, the fol-size, smaller models being on top and larger ones being on bot-

lowing few-shot classification prompt was given, with two pos-

tom. The (NFT) tag indicates that a model was not fine-tuned,

itive and two negative examples randomly sampled from our

while the (LoRA) tag means that a model was trained with LoRA.

dataset.

Results are rounded to three decimal places.

You will be provided with text in the Slovenian language, and your

task is to classify whether it is sarcastic or not. Use ONLY token 0

Table 2: Summary of performance results for all tested

(not sarcastic) or 1 (sarcastic) as in the examples:

models. The best scores are in bold.

• Spanje? Kaj je to... Še nikoli nisem slišal za to? 1

Model

Accuracy

F -score

1

• Lepo je biti primerjan z zidom

1

SloBERTa

0.621

0.632

• To sploh nima smisla. Nehaj kopati. 0

BERT-BASE-MULTILINGUAL-CASED

0.499

0.666

• Dne 12. 10. 21 ob 10:30 je bil nivo reke 0,37 m. 0.

XLM-RoBERTa-BASE

0.578

0.579

XLM-RoBERTa-LARGE

0.550

0.597

We used full versions of the 8B and 70B parameter models,

Llama-3.1-8B-INSTRUCT (NFT)

0.560

0.676

while the 405B parameter model was loaded in 16-bit precision

Llama-3.1-8B-INSTRUCT (LoRA)

0.569

0.682

mode. To minimize the use of resources and costs, we employed

Llama-3.1-70B-INSTRUCT (NFT)

0.660

0.724

LoRA parameter-efficient fine-tuning. We provided the models

Llama-3.1-70B-INSTRUCT (4-bit-LoRA)

0.637

0.717

Llama-3.1-405B-INSTRUCT (16-bit-NFT)

0.686

0.751

with training and validation sets and trained them for a maximum

GPT-3.5-TURBO-0125 (NFT)

0.564

0.679

of 10 epochs. No hyperparameter optimization was conducted

GPT-3.5-TURBO-0125

0.749

0.760

in this case due to time constraints. We used the validation loss

GPT-4o-2024-05-13 (NFT)

0.686

0.746

to choose the best model, and we used the same early stopping

L2-LOGISTIC-REGRESSION

0.759

0.765

technique as with the smaller models.

L2-LOGISTIC-REGRESSION-NON-COMMERCIAL

0.707

0.749

HARD-VOTING-ALL

0.670

0.738

3.3

GPT 3 and 4 Models

SOFT-VOTING-ALL

0.658

0.732

HARD-VOTING-BEST-5

0.686

0.749

We also tested two models offered on the OpenAI platform,

SOFT-VOTING-BEST-5

0.686

0.749

GPT-4o-2024-05-13 and GPT-3.5-TURBO-0125. We first used

them in few-shot mode and classified all our examples without

Individual Model Performance

any fine-tuning. When fine-tuning, the platform’s tier system

Out of all of the used models, only BERT-BASE-MULTILINGUAL-

limited us to only the smaller GPT-3.5-TURBO-0125 model. We

-CASED failed to learn any pattern in our data and defaulted to

fine-tuned the model for a maximum of three epochs. In the end,

the dummy classifier.

we used the model with the lowest validation loss to classify the

GPT-3.5-TURBO-0125 sometimes predicts the correct token

examples in the test set.

but then continues to generate additional text, such as 11 and

1\n1. This happens with a small quantity of examples in our

3.4

Sarcasm Detection Ensembles

testing set. We decided to truncate these responses and only kept

the first token as the answer.

When constructing ensemble models, we tried two techniques:

The Llama models sometimes refused to generate tokens zero

stacking and voting. In both cases, we used the predicted proba-

or one. We decided to drop these examples altogether. We report

bility of the sarcastic class from each model as input features.

the test accuracy and trained the ensemble models without them.

3.4.1

Stacking With Regularized Logistic Regression.

Smaller encoder models performed poorly when compared to

Our first

some of the larger models. Only the SloBERTa model manages

ensemble used stacking approach, and logistic regression with

to achieve an accuracy above 0.6. Despite being the smallest of

Ridge regularization as the meta-level classifier. This choice was

the four small models we tested, SloBERTa performed the best.

motivated primarily by the need for feature selection, as we

This suggests that the three larger multilingual encoder models

wanted to identify the most important model predictions and

may lack sufficient understanding of Slovenian. It also highlights

determine which models would be assigned a lower weight. The

that model size alone does not necessarily correlate with better

best models were then used for voting.

performance when it comes to sarcasm detection.

3.4.2

Standard and Mixed Voting. The second ensembling method

The Llama models fared better, achieving accuracies of up to

was voting. We tried cut-off-based mixed voting inspired by [9].

0.686 with the 405B model being comparable to GPT-4o in perfor-

Formally, we used hard voting when the absolute difference be-

mance. They still fell short of the fine-tuned GPT-3.5-TURBO-0125

tween the number of sarcastic and non-sarcastic predictions was

model, which managed to successfully classify about three-quarters

greater than 𝑛, and we used soft voting otherwise. We optimized

of our examples with a F -score of 0.76.

1

the value of 𝑛 based on the ensembles performance on our vali-

Some models had significantly higher F -scores and lower

1

dation set.

accuracies. We show the confusion matrix of one of the models

21





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Ðoković and Robnik-Šikonja

Table

3:

Confusion

Matrix

for

non-fine-tuned

Slovenian sarcasm detection power, but we also note that a pos-

Llama-3.1-405B-INSTRUCT model.

sible alternative could be local fine-tuning of the Llama-3.1-8B-

-INSTRUCT model. Our testing shows that using aggressive quan-

Predicted \ Actual

Positive

Negative

tization combined with LoRA results in significant performance

degradation.

Positive

202

123

We also constructed ensemble models based on voting and

Negative

stacking methods. Observations showed that voting didn’t result

11

91

in any performance improvements. On the other hand, stacking

that exhibited the largest difference in Table 3. These models with the use of a regularized logistic regression managed to

seem to have a tendency to incorrectly classify non-sarcastic text

improve on the performance of its base models.

as sarcastic, leading to a high rate of false positives.

Additional work needs to be done in dataset construction.

Our testing also showed that loading the Llama-3.1-70B-

Sarcastic examples could be extended with context or labels of the

-INSTRUCT model in 4-bit mode and fine-tuning it with LoRA

types of sarcasm they represent. This might help guide models

does not produce satisfactory results, and it is thus better to

towards better understanding of sarcasm. Future work could

conduct full fine-tuning with the smaller Llama model or to use

also explore incorporating heterogeneous models into ensembles

one of OpenAI’s models via their fine-tuning API.

or the creation of Mixture of Experts (MoE) ensembles, whose

GPT-3.5-TURBO-0125 performed the best among individual

baseline models would focus on different aspects of sarcasm.

models, so if costs associated with the use of OpenAI’s API are

acceptable, we recommend its use for sarcasm detection in Slove-

Acknowledgements

nian. This shows that very large models can effectively identify

This research was supported by the Slovenian Research and In-

sarcasm. We believe that with better parameter tuning, Llama 8B

novation Agency (ARIS) core research programme P6-0411 and

could be one of the best (and most economical) options for sar-

projects J7-3159, CRP V5-2297, L2-50070, and PoVeJMo.

casm detection in Slovenian, provided that the user has sufficient

hardware resources.

References

Ensemble Model Performance

[1]

Ibrahim Abu Farha, Silviu Vlad Oprea, Steven Wilson, and Walid Magdy. 2022.

We observed that the regularized logistic regression mostly re-

SemEval-2022 task 6: iSarcasmEval, intended sarcasm detection in English

and Arabic. In Proceedings of the 16th International Workshop on Semantic

lied on the best-performing models. Its focus on the best model

Evaluation (SemEval-2022), 802–814. doi: 10.18653/v1/2022.semeval-1.111.

(GPT-3.5-TURBO-0125), however, suggests that there is signifi-

[2]

Tom Brown et al. 2020. Language models are few-shot learners. In Advances

in Neural Information Processing Systems. Vol. 33, 1877–1901. https://proceedi

cant overlap between the various model predictions.

ngs.neurips.cc/paper_f iles/paper/2020/f ile/1457c0d6bf cb4967418bf b8ac142

We decided to discard BERT-BASE-MULTILINGUAL-CASED when

f 64a- Paper.pdf .

constructing our voting ensembles since its dummy classification

[3]

Aditya Joshi, Pushpak Bhattacharyya, and Mark J. Carman. 2017. Automatic

sarcasm detection: a survey. ACM Comput. Surv., 50, 5, Article 73, 22 pages.

didn’t contribute to overall model performance. Both of these

doi: 10.1145/3124420.

two voting classifiers had an odd number of predictors, so there

[4]

Sneha Kudugunta, Isaac Caswell, Biao Zhang, Xavier Garcia, Derrick Xin,

was no need for a tie-breaker mechanism.

Aditya Kusupati, Romi Stella, Ankur Bapna, and Orhan Firat. [n. d.] Madlad-

400: a multilingual and document-level large audited dataset. In Proceedings

Voting proved to be ineffective in our setups, even scoring

of the 37th International Conference on Neural Information Processing Systems

lower than some of its base models. hard voting generally out-

Article 2940, 13 pages.

[5]

Bleau Moores and Vijay Mago. 2022. A survey on automated sarcasm detec-

performed soft voting. We also note that there was no benefit

tion on twitter. arXiv preprint. doi: 10.48550/arXiv.2202.02516.

in using mixed voting, at least for the sets of predictors that we

[6]

Smaranda Muresan, Roberto Gonzalez-Ibanez, Debanjan Ghosh, and Nina

obtained as hard voting always had a higher accuracy. This was

Wacholder. 2016. Identification of nonliteral language in social media: a

case study on sarcasm. Journal of the Association for Information Science and

true for both the classifiers that used all and only five of the base

Technology, 67, 11, 2725–2737. doi: 10.1002/asi.23624.

models.

[7]

Matej Ulčar and Marko Robnik Šikonja. 2021. Sloberta: slovene monolingual

Regularized logistic regression managed to improve on the

large pretrained masked language model. In Proceedings of Data Mining and

Data Warehousing, SiKDD, 17–20. http://library.ijs.si/Stacks/Proceedings/Inf

scores of individual models, raising accuracy by one percent, thus

ormationSociety/2021/IS2021_Volume_C.pdf .

achieving the best performance out of all of the tested approaches.

[8]

Matej Ulčar and Marko Robnik Šikonja. 2021. Slovenian RoBERTa contextual

This shows that there is still performance to be gained from

embeddings model: SloBERTa 2.0. CLARIN.SI data & tools. Nasl. z nasl. zaslona.

Fakulteta za računalništvo in informatiko. http://hdl.handle.net/11356/1397.

ensembles; however, it is still necessary to use commercial models

[9]

Mengfei Yuan, Zhou Mengyuan, Lianxin Jiang, Yang Mo, and Xiaofeng Shi.

for top performance.

2022. Stce at SemEval-2022 task 6: sarcasm detection in English tweets. In Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-

2022), 820–826. doi: 10.18653/v1/2022.semeval-1.113.

5

Conclusion

In this work, we presented the task of sarcasm detection in the

less-resourced Slovenian language. Our code and results are freely

7

available .

We tackled the missing dataset problem by using two LLMs

to perform neural translation of an English dataset into Slove-

nian. The translations generated by GPT-4o-2024-05-13 out-

paced those generated by a large T5 model specifically trained

for neural machine translation in terms of quality.

We used this data to train a plethora of Transformer-based

models in various settings. We found that fine-tuning GPT-3.5-

-TURBO-0125 via OpenAI’s API results in the highest individual

7 github.com/GalaxyGHz/Diploma

22





Speech-to-Service: Using LLMs to Facilitate Recording of

Services in Healthcare

Maj Smerkol

Rok Susič

Mariša Ratajec

maj.smerkol@ijs.si

rs36117@student.uni- lj.si

mr97744@student.uni- lj.si

Jožef Stefan Institute

University of Ljubljana, Faculty of

University of Ljubljana, Faculty of

Ljubljana, Slovenia

Mathematics and Physics

Electrical Engineering

Ljubljana, Slovenia

Ljubljana, Slovenia

Helena Halbwachs

Anton Gradišek

h.halbwachs@senecura.si

anton.gradisek@ijs.si

SeneCura Kliniken- und

Jožef Stefan Institute

Heimebetriedsgesellschaft m.b.H.

Ljubljana, Slovenia

Vienna, Austria

Abstract

by significantly lowering the number of clicks required in the

UI. The system is built using open-source or publicly accessible

Digital tracking of services is one of the main administrative bur-

components, particularly a speech-to-text system that transcribes

dens of the healthcare staff. Here, we present a proof-of-concept

the recorded conversation, and a large language model (LLM)

study of a so-called speech-to-service (S2S) system that is aimed

that leverages its natural language processing capabilities. The

at facilitating recording of activities, extracting information from

recommender system shows possible required tasks, serving as a

the conversation between a healthcare provider and recipient.

reminder and to suggest tasks that are expected soon, which may

The system comprises of a speech recorder, a diarization compo-

lower the number of visits per patient. These recommended tasks

nent, an LLM to interpret the conversation, and a recommenda-

are then suggested to the healthcare worker, who can review and

tion system integrated in a smart tablet that records completed

confirm them using the LLM-assisted interface. LLMs, such as

activities and suggests possible other activities that may have

ChatGPT and Llama, have seen a surge in popularity in a wide

still be required. We tested the system on 350 conversations and

variety of topics since their popularization in particular with the

obtained 95% accuracy, 97% precision and 94% recall.

unveiling of ChatGPT3 in the autumn of 2022.

Keywords

Several LLM based systems have been proposed recently, in-

cluding administrative task automation [6], decision making pro-healthcare, LLM, speech recognition, recommendation system

cess [10], improving existing automatic speech recognition (ASR) 1 Introduction

systems [1], and providing patients with needed information [9].

A recent study [11] concludes that utilising ASR to ease some Healthcare workers, including nurses, technicians, and care per-administrative tasks leads to faster, more efficient work and even

sonnel form the backbone of the health system as they care for

increase workers’ moods.

patients and tend to their needs. However, with the standard-

ization and systematization of the healthcare professions and

2 System Architecture

services often becomes a large bureaucratic burden, as health-

care workers have to record all the activities and services they

This paper describes two early prototype systems, both aiming

provide to the patients. This process is of course needed as it

to alleviate the workload of healthcare workers by easing the

provides traceability and ensures that all the required activities

task of documenting care actions performed. These are the ASR

were taken care of, but the problem is that the interfaces designed

system that logs care actions based on captured dialogue between

for activity logging are often not user-friendly and require the

the healthcare worker and the patient, and a recommender sys-

users to choose the activities from a extensive lists of drop-down

tem that predicts the required services at a specific time. This

menus. In total, this amounts to substantial time required only for

recommender system relies on the historical data, appropriate

tedious administrative tasks, time that would be more beneficially

for long-term patient care facilities.

spent otherwise.

Both systems are limited in scope and only target the most

With the aim to alleviate the administrative burden of activity

common healthcare services in the dataset for detection or pre-

logging, we explored the possibilities of novel technologies to as-

diction respectively, which can still greatly easy the workload

sist the healthcare staff in their logging tasks. We developed and

for medical workers, since the top 10 most common tasks out of

tested a proof-of-concept system that records the conversation

around 200 care action types represent around 80% of all services

between the healthcare worker and a patient, identifies the activ-

performed.

ities, and allows the healthcare worker to batch-confirm them on

The recommender system allows the care workers to anticipate

a dedicated smart tablet. Batch-confirmation saves a lot of time

tasks in advance and server as a reminder. This aims to lower the

number of patient visits, which also alleviates the workload.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or

distributed for profit or commercial advantage and that copies bear this notice and 2.1 Speech-to-Service ASR

the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner /author(s).

The ASR system consists of a speech diarization model, capable

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

of segmenting the recorded speech based on who is currently

© 2024 Copyright held by the owner/author(s).

https://doi.org/10.70314/is.2024.scai.4550

speaking, a speech transcription model that transcribes the audio

23





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

Smerkol et al.

to text, and a LLM fine-tuned to extract specific information from

hundred conversations that way, and manually checked for mis-

the text. Figure 1 shows the architecture of the prototype system.

takes in the model output. Many conversations were removed

due to selected actions not being mentioned or other reasons. Fi-

nally, the resulting dataset contains 350 conversations and JSON

formatted lists of tasks.

For the prediction of services required during a visit, we have

acquired a log of all services performed in one long-term patient

care facility over a period of 6 months, with the next version

expanding to data from six facilities. The tasks in dataset include

measurements (body temperature, heart rate, blood pressure, ...),

medical tasks (monitoring medicine intake, performing exam-

inations, turning the patient in bed) and care tasks (breakfast,

lunch, cleaning). There are over 200 different tasks mentioned.

Figure 1: Overview of the ASR system.

The dataset includes limited patient information—patient ID, care

type, and a detailed chronological history of services received.

Care types (CareType I, CareType II, CareType III/A, CareType

1

We employ speaker-diarization pretrained model

[4] for di-

III/B, CareType IIII) represent an estimate of how much assistance

arization, pyannote/speaker-diarization-3.1 pretrained model [7]

a person requires. Legal restrictions on accessing sensitive health

2

for transcription, and fine-tuned Llama3 model [2] for informa-data prevented us from obtaining more detailed patient records,

tion extraction via generating JSON formatted output.

so we developed prediction models based on these limited data

2.2 Recommender System

points, balancing accuracy with regulatory constraints.

The data preprocessing involved determining each patient’s

The recommender system prototype is based on machine-learning

presence in the facility by identifying the timestamps of their

prediction of events that are expected to occur in a certain time

first and last recorded service. Patients with a stay of less than

window for a specific patient with addition of tasks that com-

four months were excluded from the analysis to ensure sufficient

monly follow predicted tasks. Due to the sensitive nature of the

data for reliable predictions.

data, we base our predictions only on the time window, patient

ID and care type. Thus we consider multi-output binary clas-

4 Methods

sifiers that do not require large amounts of data for training.

This section describes the methodology used to develop the ASR

Additional tasks are added to the list based on a Markov chain

system and the recommender system.

model that commonly follow, e.g. the task ’clean table’ follows

the task ’lunch’.

4.1 Clustering

The feature vector includes the time of day, day of week, week

of month and month of year as numbers, allowing for capture

The primary goal of the clustering process was to group patients

of periodic events with different periods. Due to lack of patient

with similar patterns in terms of the type and frequency of ser-

data, we opted for personalized models, trained for each patient

vices they received, allowing us to predict relevant services more

separately. We believe that results can be further improved by

effectively for each cluster (since it was not clear, even among

adding more patient-related attributes. The model training used

experts, whether care type and actual care provided were corre-

five month period of data collected, with cross-validation, and

lated).

the accuracy was evaluated on the data collected during the sixth

The clusters, as shown in Figure 2, demonstrate that patients month. Due to patients’ medical states changing over time, some

within the same care type tend to receive similar services. Some

data drift is expected, which is reflected in our results.

deviations, where multiple classifications appear within a cluster,

are likely due to temporary conditions we could not fully exclude

3 Dataset

(for instance, an individual categorized under "Care Type II" may

temporarily receive services typical of "Care Type III/A" (e.g.

To fine-tune the information extraction model based on Llama3,

due to a broken arm), while their care type classification remains

we have created a dataset of conversations in text form and ap-

unchanged). Despite this, the care types differentiate well enough

propriate outputs for each of them, as the task on hand is very

across clusters, leading us to use "CareType" as one of the key

specific and we did not find any existing appropriate dataset. We

attribute for further service predictions.

automated the process and manually removed any bad exam-

In the clustering process, we excluded CareType IIII because

ples. A real dataset, ideally recorded in the target environment,

this group is characterized by highly diverse healthcare needs

is needed for for final implementation - LLM generated datasets

due to specific diseases, and experts advised us to omit it for this

used for training LLMs are only appropriate in preliminary stud-

part of the analysis.

ies.

3

To generate the dataset, we prepared a BERT

LLM via prompt-

4.2 Recommender System

ing [5]. A training sample was generating by first randomly selecting 2 of the 10 target actions, and programmatically generating

To recommend the required services, we constructed the train-

the target output JSON. The BERT model was then tasked with

ing dataset using a detailed log of care actions performed over

generating a conversation, in which these two tasks are men-

a 6-month period. For each patient, the data was divided into

tioned as done during the conversation. We generated several

consecutive 4-hour time windows. In each window, we examined

whether specific care actions were performed, marking them as

1 pyannote/speaker-diarization-3.1

2

"positive" if they occurred within that time frame. This granu-

meta/meta-llama-3-8b

3 google-bert/bert-base-multilingual-cased

lar approach allowed us to capture the temporal dynamics of

24





Speech-to-Service

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

added. Thus, the transcribed text resembles a play with unknown

characters speaking.

The information extraction model is Llama3 [2], and fine-

tuned utilising a LoRA few-shot fine tuning. Our approach was

to fine-tune the model for the task of extracting information about

specific care actions and generate the output in a JSON format,

providing structured data directly. A small training dataset was

prepared as described in the section 3.

5 Results and Discussion

5.1 LLM Based Infromation Retrieval Model

The Llama3 based information extraction model is evaluated

using a 5-fold cross validation, achieving 95% accuracy, 97%

precision, and 94% recall. For evaluation the model’s JSON-

Figure 2: Clustering of patients closely aligns with pre-

formatted strings were deserialized to objects and tested against

existing care type assignments, ranging from minimal

known correct objects to be able to interpret the results as multi-

personal assistance (CareType I) to moderate assistance

label binary classification.

(CareType II), and full or intensive personal assistance

The LLM infromation extraction model sometimes generates

(CareTypes III/A and III/B) for those with more severe care

invalid JSON after fine-tuning, most commonly due to duplicated

needs.

keys or getting stuck in a loop, generating same elements until

maximum output size is generated. The generated strings are

therefore post-processed to fix these mistakes via simple string

manipulation, however this indicates that experiments with dif-

service delivery, ensuring that for each time window, we had a

ferent output formats or avoiding generating the answers should

clear record of the services provided. As a result, we generated

be performed.

over 1000 labeled examples per patient, with each example rep-

The whole ASR pipeline including diarization and transcrip-

resenting a specific time window and its associated care actions.

tion has not yet been evaluated and falls within the scope of

This enabled the model to learn patterns in service requirements

future work.

throughout the day and week.

To identify the best predictive model, we evaluated various

5.2 Recommender System

classification algorithms, including Random Forest, Decision Tree,

Tables 1 and 3 present the classification results. Table 1 reports K-Neighbors, Support Vector Classifier (SVC), Gradient Boost-the average performance across all patients, including standard

ing, and Naive Bayes. Each model was trained using a multi-

deviations for the different models, while Table 3 shows classifica-output classification approach, with features including the fre-

tion accuracy by care type, with averages and standard deviations

quency of the top services provided and the relevant time at-

across all patients within each care type, based on the model with

tributes. To ensure robust model evaluation, we implemented

the best results, which in this case is K-Neighbors (KNN).

5-fold cross-validation and subsequently tested the models on

Results are reported in two ways, tables 1 and 3 show accuracy the sixth month’s data to assess their predictive performance.

considering all target attributes, only considering a prediction

4.3 Speech Recognition and Information

correct when all targets are predicted correctly. The table 2 show average of accuracies for each target attribute.

Extraction

Due to limited availability of training data, only the information-

Table 1: Cross-validation and test accuracy (mean ± stan-

extraction model based on Llama3 was fine tuned using few-shot

dard deviation) across all patients for various classification

LoRA (low-rank adaptor) supervised training. The diarization

models.

and transcription models are used unchanged.

The diarization model used is speaker-diarization [4]. Initial Model

CV Accuracy Test Accuracy

experiments with few-shot LoRA fine tuning [3] did not improve RandomForest

0.71 ± 0.14

0.66 ± 0.16

the performance, hinting at the need for a larger training dataset.

DecisionTree

0.65 ± 0.16

0.66 ± 0.16

The model’s performance is satisfactory at the task of segmenta-

tion, but less so at the task of identifying which segments belong

KNeighbors

0.73 ± 0.13

0.71 ± 0.16

to which speaker, especially for longer conversations. For a two-

SVC

0.63 ± 0.12

0.63 ± 0.14

speaker situation, the model seems to assume the speakers take

GradientBoosting

0.68 ± 0.12

0.66 ± 0.15

turns speaking, causing mistakes when a single speaker pauses

NaiveBayes

0.57 ± 0.17

0.55 ± 0.20

before continuing to speak.

The transcription model used is whisper [8]. The model transcribes each segment separately. As mentioned above, the speak-

The K-Neighbors (KNN) classifier outperformed other models,

ers are not robustly recognised, and we cannot reliably assign a

achieving an average CV accuracy of 73%, a test accuracy of 71%,

speaker to each line of text. Still, labeling each line of text even

and R2 score of 0.44. This made it the most effective model for

with an ambiguous label improves the downstream task of infor-

predicting service plans. Random Forest also performed reason-

mation extraction. The transcribed lines of text are concatenated,

ably well, achieving a test accuracy of 66%, though it did not

and at the start of each utterance a label marking it as such is

surpass KNN in overall performance.

25





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Smerkol et al.

Table 2: Majority Class Percentage and Task-wise Test Ac-

failure at each step. The information retrieval model itself is not

curacy (mean ± standard deviation) across all patients for

inefficient considering computational time and memory required,

various classification models.

but diarization and transcription steps are. The required service

prediction should also be further improved. Using current dataset

Model

Majority Class Task-wise

an alternative approach that may improve performance is using

Percentage

Accuracy

sequence modelling or event prediction approaches. Finally, the

two models could work in tandem - predicting the required ac-

RandomForest

0.72 ± 0.19

0.89 ± 0.10

tions and using that information in the ASR pipeline could be

DecisionTree

0.72 ± 0.19

0.89 ± 0.11

beneficial.

KNeighbors

0.72 ± 0.19

0.91 ± 0.10

Based on the proof-of-concept study, we conclude the sug-

gested approach is in principle feasible and can be beneficial

SVC

0.65 ± 0.16

0.88 ± 0.10

to healthcare providers. However, in view of regulations, spe-

GradientBoosting

0.65 ± 0.16

0.89 ± 0.09

cial caution has to be paid during the implementation of any

NaiveBayes

0.72 ± 0.19

0.85 ± 0.15

sort of such system in a real-world setting. Recording and di-

arizing conversations between healthcare staff and the patients

Table 3: Classification performance of K-Neighbors (KNN)

is likely to include highly personal data, which falls under the

by CareType, showing cross-validation and test accuracy

EU relevant legislation, specifically the GDPR (General Data Pro-

(mean ± standard deviation), averaged across all patients

tection Regulation 4

)

and the EU AI Act (Artificial Intelligence

within each care type.

Act (Regulation (EU) 2024/1689) 5

)

. Furthermore, indiscriminately

recording conversations and feeding them into an LLM will likely

CareType

CV Accuracy Test Accuracy

be considered as "high risk" in view of the AI Act. This means

that implementing such services will require extensive screening,

CareType I

0.79 ± 0.12

0.76 ± 0.16

documentation, clear division of ownership and access roles, and

CareType II

0.79 ± 0.11

0.78 ± 0.13

other compliance with legal requirements.

CareType III/A

0.68 ± 0.13

0.66 ± 0.15

Acknowledgements

CareType III/B

0.70 ± 0.14

0.68 ± 0.17

We thank the healthcare provider organization for the dataset

CareType IIII

0.68 ± 0.10

0.67 ± 0.12

and for insightful discussions.

References

Since all predictive accuracy values exceed the 70% majority

[1]

Ayo Adedeji, Sarita Joshi, and Brendan Doohan. 2024. The sound of health-

class baseline, this is an excellent result. In multi-label classifica-

care: improving medical transcription asr accuracy with large language

models. arXiv preprint arXiv:2402.07658.

tion, where multiple services are predicted simultaneously, it’s

[2]

AI@Meta. 2024. Llama 3 model card. https://github.com/meta- llama/llama3

important to not only focus on overall accuracy but also on the

/blob/main/MODEL_CARD.md.

accuracy of each individual task. By achieving 90% accuracy on

[3]

Shamil Ayupov and Nadezhda Chirkova. 2022. Parameter-efficient finetun-

ing of transformers for source code. ArXiv, abs/2212.05901. https://api.sema

the most common tasks, the model ensures that key services are

nticscholar.org/CorpusID:254564456.

reliably predicted.

[4]

Hervé Bredin and Antoine Laurent. 2021. End-to-end speaker segmentation

The lower test accuracy compared to cross-validation can be

for overlap-aware resegmentation. In Proc. Interspeech 2021. Brno, Czech

Republic, (Aug. 2021).

explained by temporal changes in patient conditions, as the test

[5]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018.

set only included the last month of data. As patient care needs

BERT: pre-training of deep bidirectional transformers for language under-

standing. CoRR, abs/1810.04805. http://arxiv.org/abs/1810.04805 arXiv: shift over time, predicting long-term patterns is more challenging

1810.04805.

than shorter-term cross-validation, where care remains more

[6]

Senay A. Gebreab, Khaled Salah, Raja Jayaraman, Muhammad Habib ur

stable.

Rehman, and Samer Ellaham. 2024. Llm-based framework for administrative

task automation in healthcare. In 2024 12th International Symposium on

The test accuracy also reflected noticeable differences across

Digital Forensics and Security (ISDFS), 1–7. doi: 10.1109/ISDFS60797.2024.10

care types. CareType I and CareType II showed higher accuracy

527275.

rates, while more complex types, such as CareType III/A, III/B,

[7]

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey,

and Ilya Sutskever. 2022. Robust speech recognition via large-scale weak

and IIII, exhibited a drop in accuracy of around 10%. This is likely

supervision. (2022). doi: 10.48550/ARXIV.2212.04356.

due to the more diverse and unpredictable care needs in these

[8]

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey,

and Ilya Sutskever. 2022. Robust speech recognition via large-scale weak

groups, making service prediction more challenging.

supervision. (2022). doi: 10.48550/ARXIV.2212.04356.

This approach, particularly with the strong performance of

[9]

Prakasam S, N. Balakrishnan, Kirthickram T R, Ajith Jerom B, and Deepak S.

our K-Neighbors (KNN) model, demonstrated the potential of

2023. Design and development of ai-powered healthcare whatsapp chatbot.

2023 2nd International Conference on Vision Towards Emerging Trends in

machine learning to enhance personalized planning in healthcare.

Communication and Networking Technologies (ViTECoN), 1–6. https://api.se

In future work, including additional patient-specific features

manticscholar.org/CorpusID:259280109.

beyond time-based data, such as health-related attributes, could

[10]

Raja Vavekanand, Pinja Karttunen, Yue Xu, Stephanie Milani, and Huao Li.

2024. Large language models in healthcare decision support: a review.

further improve accuracy, particularly for the more complex care

[11]

Markus Vogel, Wolfgang Kaisers, Ralf Wassmuth, and Ertan Mayatepek.

types.

2015. Analysis of documentation speed using web-based medical speech

recognition technology: randomized controlled trial. Journal of medical

6 Conclusions

Internet research, 17, 11, e247.

This is early work and further improvements are underway. The

whole ASR pipeline needs to be evaluated and we expect no-

ticeably worse performance comparing to only the information

4 https://gdpr-info.eu/

extraction model due to larger complexity and possibility for

5 https://artificialintelligenceact.eu/the-act/

26





Performance Comparison of Axle Weight Prediction

Algorithms on Time-Series Data

Žiga Kolar

David Susič

Martin Konečnik

Jožef Stefan Institute

Jožef Stefan Institute

Cestel Cestni Inženiring d.o.o

Jamova cesta 39

Jamova cesta 39

Špruha 32

Ljubljana, Slovenia

Ljubljana, Slovenia

Trzin, Slovenia

ziga.kolar@ijs.si

david.susic@ijs.si

martin.konecnik@cestel.si

Domen Prestor

Tomo Pejanovič Nosaka

Bajko Kulauzović

Cestel Cestni Inženiring d.o.o

Cestel Cestni Inženiring d.o.o

Cestel Cestni Inženiring d.o.o

Špruha 32

Špruha 32

Špruha 32

Trzin, Slovenia

Trzin, Slovenia

Trzin, Slovenia

domen.prestor@cestel.si

tomo.pejanovic@cestel.si

bajko@cestel.si

Jan Kalin

Matjaž Skobir

Matjaž Gams

Zavod za gradbeništvo Slovenije

Cestel Cestni Inženiring d.o.o

Jožef Stefan Institute

Dimičeva ulica 12

Špruha 32

Jamova cesta 39

Ljubljana, Slovenia

Trzin, Slovenia

Ljubljana, Slovenia

jan.kalin@zag.si

matjaz.skobir@cestel.si

matjaz.gams@ijs.si

Abstract

including road maintenance planning, traffic management, and

Accurate vehicle axle weight estimation is essential for the main-

the prevention of overloading, which can lead to premature road

tenance and safety of transportation infrastructure. This study

wear and increased accident risks [8]. Traditional methods for evaluates and compares the performance of various algorithms

axle weight measurement often rely on static scales or weigh-

for axle weight prediction using time-series data. The algorithms

in-motion (WIM) systems. While these methods provide direct

assessed include traditional machine learning models (e.g., ran-

measurements, they are susceptible to limitations such as high

dom forest) and advanced deep learning techniques (e.g., con-

installation and maintenance costs, potential measurement inac-

volutional neural networks). The evaluation utilized datasets

curacies due to environmental factors, and the need for frequent

comprising time-series data from 10 sensors positioned on a sin-

calibration.

gle lane of a bridge, with the goal of predicting each vehicle’s axle

In recent years, the advent of advanced computational tech-

weights based on the signals from these sensors. Each algorithm’s

niques has opened new avenues for improving axle weight predic-

performance was measured against the OIML R-134 recommen-

tion. Machine learning (ML) and deep learning (DL) algorithms, in

dation, where a prediction was classified as accurate if the error

particular, offer promising alternatives by leveraging time-series

was within ±4 percent for two-axle vehicles and ±8 percent for

data to model complex, non-linear relationships inherent in ve-

vehicles with more than two axles. Tests were conducted on sev-

hicular weight patterns. These methods can enhance prediction

eral bridges, with this paper presenting detailed results from the

accuracy, handle large volumes of data, and adapt to varying con-

Lopata bridge. Findings indicate that deep learning models, par-

ditions, making them suitable for real-world applications where

ticularly convolutional neural networks, significantly outperform

traditional methods may fall short.

traditional methods in terms of accuracy and their ability to adapt

This study systematically evaluates and compares the per-

to complex patterns in time-series data. This study provides a

formance of various axle weight prediction algorithms using

valuable reference for researchers and practitioners aiming to

time-series data. We focus on a diverse set of algorithms, includ-

enhance axle weight prediction systems, thereby contributing to

ing machine learning models like random forests (RF) [6] and more effective infrastructure management and safety monitoring.

advanced deep learning techniques such as convolutional neural

networks (CNN) [4].

Keywords

The objective of this research is to explore the potential of

time-series data, axle weight, machine learning, neural network

combining traditional WIM systems with advanced ML and DL

models to enhance axle weight predictions. By comparing the

performance of different methodologies, including the SIWIM

1

Introduction

traditional model, random forest (IJS RF), a hybrid approach

Accurate axle weight prediction plays a pivotal role in the mainte-

(AVERAGE(IJS, SIWIM traditional)), and a CNN-based model, this

nance and safety of transportation infrastructure [7]. The precise study aims to identify the most effective strategies for accurate

estimation of axle weights is essential for various applications,

and reliable axle weight estimation. Additionally, it examines the

impact of synthetic data generation on the performance of these

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or

models, providing a comprehensive evaluation of their practical

distributed for profit or commercial advantage and that copies bear this notice and applicability in real-world scenarios.

the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

The study aimed to predict the axle weights of vehicles using

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

ten input signals from sensors placed under the Lopata bridge.

© 2024 Copyright held by the owner/author(s).

https://doi.org/10.70314/is.2024.scai.4752

27





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Kolar et al.

Each predictive algorithm’s performance was evaluated accord-

Bosso et al. [1] proposed a method using weigh-in-motion

ing to the OIML R-134 recommendation, which is deemed accu-

(WIM) data and regression trees to identify patterns in over-

rate if the error margin for predicting the axle weight is within

loaded truck weights and travel. The analysis reveals that truck

±4% for vehicles with two axles and within ±8% for vehicles with

type is the key predictor of overloading, while time of day is

more than two axles.

crucial for axle overloading, with most incidents occurring late

The dataset comprised 1478 samples, i.e. passing of a vehicle,

at night or early morning. These insights can enhance enforce-

each containing 10 signals per vehicle. For each sample, a static

ment strategies and inform pavement management and design,

weight for each axle was assigned as the target value. Static

optimizing infrastructure longevity and safety.

weight refers to the weight measured by a scale when the vehicle

He et al. [2] introduced a new method that uses only the

is stationary.

flexural strain signals from weighing sensors to identify axle

This paper is structured as follows: Section 2 reviews several spacing and weights, reducing installation costs and expanding

state-of-the-art approaches. Section 3 details the preprocessing BWIM applications. The method’s accuracy is validated through

steps necessary before applying machine learning methods. In

numerical simulations and laboratory experiments with a scaled

Section 4, algorithms used for predicting axle weights are pre-vehicle-bridge interaction model, showing promising results for

sented. Section 5 presents the final results of the axle weight accurate axle spacing and weight identification.

predictions. Finally, Section 6 summarizes the findings and proposes ideas for future research.

3

Data Preprocessing

Before applying various algorithms to the dataset, several pre-

2

Related Work

processing steps were necessary. Due to the differing lengths of

The prediction of axle weights using time-series data has often

signals from each sample, padding was performed to standardize

been studied in recent years, resulting in a substantial body of

them to the length of the longest signal. Samples with a gross

related work. Below, several state-of-the-art (SOTA) approaches

weight below 5 kN were excluded from both the training and

are described.

test datasets. Each signal was cropped by removing data to the

Zhou et al. [10] differentiated between high-speed and low-

left of the leftmost peak value minus 100 and to the right of the

speed weigh-in-motion (WIM) systems and analyzed the char-

rightmost peak value plus 100. The peak values were calculated

acteristics of axle weight signals. They proposed a nonlinear

in advance.

curve-fitting algorithm, detailing its implementation. Numerical

To address the limited availability of data required for deep

simulations and field experiments assessed the method’s perfor-

learning, which typically necessitates tens of thousands of sam-

mance, demonstrating its effectiveness with maximum weighing

ples for effective training, synthetic data generation was em-

errors for the front axle, rear axle, and gross weights recorded

ployed. The original dataset comprised 1,478 samples (from Janu-

at 4.01%, 5.24%, and 3.92%, respectively, at speeds of 15 km/h or

ary 2022 to December 2023) i.e. passing of a vehicle, each contain-

lower.

ing 10 signals per vehicle. An additional 20,000 synthetic samples

Wu et al. [8] introduced a modified encoder-decoder architec-were generated using a specific algorithm. This algorithm oper-

ture with a signal-reconstruction layer to identify vehicle proper-

ates by iterating 20,000 times, during each of which a random

ties (velocity, wheelbase, axle weight) using the bridge’s dynamic

training sample and a random strain factor were selected. The

response. This unsupervised encoder-decoder method extracts

strain factor is a random value ranging between 0.5 and 0.99. The

higher features from the original data. A numerical bridge model

selected signal from the training sample was then scaled by the

based on vehicle-bridge coupling vibration theory demonstrated

chosen strain factor. This scaling process effectively models the

the method’s applicability. Results indicated that the proposed ap-

feature that doubling the amplitude of the signal corresponds to

proach accurately predicts traffic loads without additional sensors

doubling its weight.

or vehicle weight labels, achieving better stability and reliability

A crucial aspect of data preprocessing involved the normal-

even with significant data pollution.

ization of sensor signals to ensure uniformity across the dataset.

Xu et al. [9] applied wavelet transform for denoising and re-Each signal was normalized to have a mean of zero and a stan-

constructing the WIM signal, and used a back propagation (BP)

dard deviation of one, which helps in improving the convergence

neural network optimized by the brain storm optimization (BSO)

of machine learning algorithms by ensuring that each feature

algorithm to process the WIM signal. Comparing the predictive

contributes equally to the learning process.

abilities of BP neural networks optimized by different algorithms,

The selection of training and test data was conducted using a

they found the BSO-BP WIM model to exhibit fast convergence

rolling window approach [3]. Specifically, for each testing month, and high accuracy, with a maximum gross weight relative error

the training data comprised all available data up to but not includ-

of 1.41% and a maximum axle weight relative error of 6.69%.

ing the testing month. For instance, if May 2023 was designated

Kim et al. [5] developed signal analysis algorithms using artifi-as the testing month, the training dataset consisted of data from

cial neural networks (ANN) for Bridge Weigh-in-Motion (B-WIM)

January 2022 through April 2023. This process was systematically

systems. Their procedure involved extracting information on ve-

repeated for each testing month from March 2022 to December

hicle weight, speed, and axle count from time-domain strain

2023.

data. ANNs were selected for their effectiveness in incorporating

dynamic effects and bridge-vehicle interactions. Vehicle exper-

4

Methodology

iments with various load cases were conducted on two bridge

Four methods were identified as applicable for predicting vehicle

types: a simply supported pre-stressed concrete girder bridge and

axle weights. The first method, known as SIWIM traditional [11],

a cable-stayed bridge. High-speed and low-speed WIM systems

calculated the number of axles, axle lengths, and axle weights by

were used to cross-check and validate the algorithms’ perfor-

utilizing influence lines to model the signal and determine the

mance.

correct output. For validation purposes, each predicted output

28





Performance Comparison of Axle Weight Prediction Algorithms

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

The detailed architecture of the CNN is shown in Figure 1. 2D

Convolutional layers (Conv2D) were used instead of 1D Convolu-

tional layers due to the input data consisting of 10 sensor signals.

The number of filters and kernel size are specified within the

parentheses of each Conv2D layer, while the pooling size is de-

fined in each 2D MaxPooling layer parentheses (MaxPooling2D).

The last Dense layer has 100 neurons. To mitigate overfitting, a

Dropout layer was added after the final Dense layer. Additionally,

Batch Normalization was applied after each 2D Convolutional

layer to further reduce the risk of overfitting.

Although Long Short-Term Memory (LSTM) and Gated Re-

current Unit (GRU) neural networks could be used for this task,

a Convolutional Neural Network (CNN) was chosen instead be-

cause of its strengths in capturing spatial hierarchies and local

patterns within the data. CNNs are highly effective at extracting

local features and detecting patterns, while LSTM and GRU are

better suited for handling temporal dependencies, which are not

that relevant to this specific task.

5

Results

Figure 1: Architecture of CNN for predicting axle weights.

was stored in a separate file alongside the signal data, enabling

direct comparison with the actual values.

Figure 2: Accuracies of all algorithms for each testing

The second method used the random forest [6] (named IJS

month.

RF) for predicting vehicle axle weights. The model relied on ac-

curately identifying the positions of peaks to function correctly.

Peak values were determined using the 𝑓 𝑖𝑛𝑑_𝑝𝑒𝑎𝑘𝑠 method from

The results of each method described in Section 4 are illusthe SciPy library, which identifies peaks based on a specified

trated in Figure 2. Among the methods evaluated, SIWIM tradi-minimum height. Once the peaks were identified, the algorithm

tional exhibited the poorest performance, with fluctuating trends

extracted values within a ±5 range of each peak. These extracted

observed throughout the entire two-year period. The CNN be-

values were then used as input features for the random forest

gan to outperform the other three approaches after December

model. Additionally, the random forest model incorporated tem-

2022. Conversely, the AVERAGE(IJS, SIWIM traditional) method

perature, axle distances and gross weight as input features. Ran-

showed superior performance during the initial testing months

dom forest algorithms are not inherently suited for time series

from March 2022 to June 2022.

data; however, they perform effectively with numerical data such

The performance of the CNN improved with an increasing

as temperature, axle distance, and gross weight. Therefore, this

amount of data, whereas the IJS RF and AVERAGE(IJS, SIWIM

algorithm was chosen for analyzing this type of input data.

traditional) methods were more effective during the initial phase

The third method integrated the first two approaches by aver-

when less training data was available. However, the improvement

aging the outputs from the SIWIM traditional and IJS RF models

in CNN’s accuracy was not linear. This non-linear trend can be

(named AVERAGE(IJS, SIWIM traditional)). This approach is mo-

attributed to the random initialization of the CNN’s weights

tivated by the concept that combining multiple models can often

before each training session, occasionally leading to suboptimal

yield more accurate results than relying on a single model alone

convergence.

[12].

An additional analysis was conducted to compare the perfor-

The final method employed a convolutional neural network

mance of the models under varying environmental conditions,

(CNN) to predict axle weights. The CNN utilized synthetic data,

such as temperature fluctuations and differing traffic patterns.

as detailed in section 3, during the training phase. This method This analysis revealed that the CNN model maintained its accu-processed all 10 signals as input to calculate the axle weights.

racy more consistently across different conditions, indicating its

29





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Kolar et al.

robustness and adaptability. Furthermore, the inclusion of syn-

be better tailored to capture specific nuances in the time-series

thetic data in training the CNN model contributed to its superior

data.

performance, as it allowed the model to learn from a more di-

After developing individual models for each sensor, the next

verse set of examples. Future research should focus on expanding

step would be to combine the predictions from these models into

the range of synthetic data and exploring additional ensemble

a single final prediction. This can be achieved using an ensemble

techniques to further enhance prediction accuracy.

method, such as a random forest classifier. The random forest

Despite achieving high accuracy with the CNN model, with

classifier would take the ten individual predictions (one from

the highest accuracy reaching 0.94, this most accurate method

each sensor model) as input features and produce a consolidated

still falls short of meeting the OIML R-134 recommendation by

final axle weight prediction.

4.4%. Furthermore, the results show that more static data could

This method not only holds the potential to improve the ac-

be needed for the learning phase. Having 1000 static samples

curacy and robustness of the axle weight predictions but also

which were augmented might not be sufficient to reach the OIML

provides a scalable framework that can be adapted to different

R-134 recommendation.

datasets and sensor configurations. Future work should explore

In summary, the results indicate that while traditional meth-

the implementation of this approach, including the optimization

ods such as IJS RF and AVERAGE(IJS, SIWIM traditional) perform

of individual sensor models and the integration of their predic-

well with limited data, convolutional neural networks (CNNs)

tions through an ensemble method.

demonstrate superior performance as more data becomes avail-

By advancing the CNN model in this manner, it is anticipated

able, despite some variability in their convergence. In addition,

that the performance gap relative to the OIML R-134 recommen-

a sufficient number of training examples is needed to approach

dation could be further reduced, bringing the predictions closer

the desired OIML R-134 recommendation.

to the required accuracy levels with a smaller amount of data

and enhancing the overall efficacy of the axle weight prediction

6

Conclusion and Discussion

system.

In this study, a performance comparison of various axle weight

Acknowledgements

prediction algorithms using time-series data collected from 10

sensors positioned on the Lopata bridge was conducted. The

This study received funding from company Cestel. The authors

algorithms evaluated encompassed traditional machine learning

acknowledge the funding from the Slovenian Research and Inno-

models, such as random forests, and advanced deep learning

vation Agency (ARIS), Grant (PR-10495) and Basic core funding

techniques, notably convolutional neural networks.

P2-0209. The author(s) made use of chatGPT to assist with this

The major findings reveal that CNNs achieved significantly

article. ChatGPT was commonly employed as a tool for enhanc-

better results in predicting axle weights during the latter months

ing the language of the initial draft, without altering the length

of the experiment. The CNNs’ ability to adapt to and learn from

of the text. ChatGPT 4 was accessed/obtained from chatgpt.com

complex patterns within the time series data was a key factor in

and used with modification in July 2024.

their superior performance. Despite achieving high accuracy with

References

the CNN model, reaching a peak accuracy of 0.94, this method

[1]

Mariana Bosso, Kamilla L Vasconcelos, Linda Lee Ho, and Liedi LB Bernucci.

still falls short of meeting the OIML R-134 recommendation by

2020. Use of regression trees to predict overweight trucks from historical

4.4%.

weigh-in-motion data. Journal of Traffic and Transportation Engineering

Overall, there are three implications of this study. First, it

(English Edition), 7, 6, 843–859.

[2]

Wei He, Tianyang Ling, Eugene J OBrien, and Lu Deng. 2019. Virtual axle

demonstrates the potential of deep learning techniques to en-

method for bridge weigh-in-motion systems requiring no axle detector.

hance the accuracy of axle weight predictions where sufficient

Journal of Bridge Engineering, 24, 9, 04019086.

data is available, thereby facilitating more reliable infrastructure

[3]

Hamed Kalhori, Mehrisadat Makki Alamdari, Xinqun Zhu, Bijan Samali,

and Samir Mustapha. 2017. Non-intrusive schemes for speed and axle iden-

management. Second, for smaller datasets, it is more effective

tification in bridge-weigh-in-motion systems. Measurement Science and

to use classical machine learning systems in combination with

Technology, 28, 2, 025102.

[4]

Teja Kattenborn, Jens Leitloff, Felix Schiefer, and Stefan Hinz. 2021. Review

methods like SIWIM traditional. Third, it provides a valuable

on convolutional neural networks (cnn) in vegetation remote sensing. ISPRS

benchmark for researchers and practitioners, guiding the de-

journal of photogrammetry and remote sensing, 173, 24–49.

velopment and implementation of more effective axle weight

[5]

Sungkon Kim, Jungwhee Lee, Min-Seok Park, and Byung-Wan Jo. 2009.

Vehicle signal analysis using artificial neural networks for a bridge weigh-

prediction systems.

in-motion system. Sensors, 9, 10, 7943–7956.

To achieve the OIML R-134 recommendation, two options are

[6]

Steven J Rigatti. 2017. Random forest. Journal of Insurance Medicine, 47, 1,

possible:

31–39.

[7]

Mohhammad Sujon and Fei Dai. 2021. Application of weigh-in-motion

technologies for pavement and bridge response monitoring: state-of-the-art

• Just add more data - if the trend continues, adding another

review. Automation in Construction, 130, 103844.

half a year of measurements would enable achieving the

[8]

Yuhan Wu, Lu Deng, and Wei He. 2020. Bwimnet: a novel method for iden-

standard. Another option would be to apply measurements

tifying moving vehicles utilizing a modified encoder-decoder architecture.

Sensors, 20, 24, 7170.

on a bridge with more traffic.

[9]

Suan Xu, Xing Chen, Yaqiong Fu, Hongwei Xu, and Kaixing Hong. 2022.

• Improve the methods by incorporating advanced ensemble

Research on weigh-in-motion algorithm of vehicles based on bso-bp. Sensors,

22, 6, 2109.

techniques.

[10]

ZF Zhou, P Cai, and RX Chen. 2007. Estimating the axle weight of vehicle

in motion based on nonlinear curve-fitting. IET science, measurement &

To introduce the ensemble approaches, one potential improve-

technology, 1, 4, 185–190.

ment involves modeling each sensor individually. This approach

[11]

A Žnidarič, J Kalin, M Kreslin, M Mavrič, et al. 2016. Recent advances in

bridge wim technology. In

entails building a separate CNN model for each of the ten sen-

Proc. 7th International Conference on WIM.

[12]

Hui Zou and Yuhong Yang. 2004. Combining time series models for fore-

sors, allowing for more specialized and potentially more accurate

casting. International journal of Forecasting, 20, 1, 69–84.

predictions from each sensor’s data. By focusing on the unique

characteristics and data patterns of each sensor, the models can

30





Comparison of Feature- and Embedding-based Approaches for

Audio and Visual Emotion Classification

Sebastijan Trojer

Zoja Anžur

st5804@student.uni- lj.si

zoja.anzur@ijs.si

Jožef Stefan Institute

Jožef Stefan Institute

Faculty of Computer and Information Science

Ljubljana, Slovenia

Ljubljana, Slovenia

Mitja Luštrek

Gašper Slapničar

mitja.lustrek@ijs.si

gasper.slapnicar@ijs.si

Jožef Stefan Institute

Jožef Stefan Institute

Jožef Stefan International Postgraduate School

Ljubljana, Slovenia

Ljubljana, Slovenia

Abstract

nature, lacking explainability and interpretability of the internally

derived features [9]. Furthermore, while some research suggests This paper presents a comparative analysis of feature- and embe-superior performance of embeddings compared to traditional

dding-based approaches for audio-visual emotion classification.

features [20], this is not universally agreed upon [8], especially We compared the performance of traditional handcrafted fea-when taking into account potentially much higher computational

tures, using MediaPipe for visual features and Mel-frequency

complexity of deriving embeddings with deep artificial neural

cepstral coefficients (MFCCs) for audio features, against neural

networks (ANNs).

network (NN)-based embeddings obtained from pretrained mod-

Our research question is thus, whether it is better to compute

els suitable for emotion recognition (ER). The study employs

embeddings using SOTA pretrained DL models instead of using

separate uni-modal datasets for audio and visual modalities to

hand-crafted features, as ANN embeddings promise to increase

rigorously assess the performance of each feature set on each

detection accuracy at the cost of interpretability and computa-

modality. Results demonstrate that in the case of visual data NN-

tional complexity. In this work we compared the performance of

based embeddings significantly outperform handcrafted features

hand-crafted features and embeddings obtained with pretrained

in terms of accuracy and F1 score when training a traditional

SOTA models for the down-stream task of emotion recognition.

classifier. However, for audio data the performance is similar

We independently compared ER performance of audio and video

on all feature sets. Handcrafted features, such as facial blend-

modality, using established benchmark datasets for each. Hand-

shapes, computed from MediaPipe keypoints and MFCCs, re-

crafted features were chosen based on literature and embeddings

main relevant in resource-constrained settings due to their lower

were computed with task-suitable pretrained models available

computational demands. This research provides insights into

in existing Python libraries. Both were formatted in a way that

the trade-offs between traditional feature extraction methods

allowed us to then train a set of traditional ML models, listed in

and modern deep learning techniques, offering guidance for the

Section 3.3, for ER, using hand-crafted features, embeddings, or development of future emotion classification systems.

a union of both as inputs.

Keywords

emotion recognition, embeddings, hand-crafted features

2

Related Work

Performance comparison of hand-crafted features and learned

1

Introduction

embeddings has been discussed in depth in computer vision do-

Automated emotion recognition (ER) often focuses on two modali-

main. Schonberger et al. [15] demonstrated that hand-crafted ties – video and audio. This is akin to human emotion recognition,

features (e.g., SIFT) still perform on par or better than learned

as we heavily rely on audio-visual characteristics, such as facial

embeddings in image reconstruction. They warned of high vari-

expressions and audio cues, to deduce emotional state [7]. Both ance across datasets when using learned embeddings as features.

audio and video are relatively simple to obtain using sensors,

Similarly, Antipov et al. [2] reported similar performance of hand-as such sensors are unobtrusive and easily available (e.g., web-

crafted features (e.g., HOG) and learned embeddings when classi-

cameras) and can be used to train machine learning (ML) models

fying pedestrian gender from images using small datasets. They

for emotion recognition.

also highlighted superior generalization performance of embed-

In the past decade deep-learning (DL) approaches achieved

dings across (unseen) datasets. In emotion recognition from audio,

state-of-the-art (SOTA) results in many domains, including emo-

Papakostas et al. [13] compared using hand-crafted MFCC-based tion recognition [16]. However, despite the superior performance features with embeddings from a custom convolutional neural

of such models, many doubts have been cast on their black-box

network (CNN) trained on spectrograms. The latter slightly out-

performed hand-crafted features by 1% on average in terms of

Permission to make digital or hard copies of all or part of this work for personal F1 score, again showing similar performance. Ye et al. [21] re-or classroom use is granted without fee provided that copies are not made or

cently showed that using a union of both hand-crafted features

distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this and learned embeddings achieves superior performance in user

work must be honored. For all other uses, contact the owner /author(s).

identification, compared to using each input individually.

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

There is moderate (but not universal) agreement in recent

© 2024 Copyright held by the owner/author(s).

https://doi.org/10.70314/is.2024.scai.6883

literature that performance between hand-crafted features and

31





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Sebastijan Trojer, Zoja Anžur, Mitja Luštrek, and Gašper Slapničar

learned embeddings is similar, however, most work comparing

the other hand, we chose SOTA pretrained models trained for

their performance is limited to a single modality or task. We

related tasks. We extracted embeddings at a model-specific point

compared performance between two different modalities for the

before the learning layers, and formatted them using principal

task of ER and investigated potential performance improvements

component analysis (PCA) in order to reduce their dimensionality

of feature-level fusion (hand-crafted + embeddings).

while maintaining the relevant information.

3.2.1

Audio Features. MFCCs are historically well-established

3

Methodology

in ER from audio [10], as they give a good approximation of the Our task consisted of two parts – hand-crafted features and em-human auditory system’s response. For each audio clip, we com-

beddings computation, and ER model training for classification.

puted a common set of statistical aggregate features (averages,

Both were done on (separate) audio and visual modality and will

standard deviations) for MFCCs, Root Mean Square (RMS) en-

be described per-modality in the following sections.

ergy (volume), Zero-Crossing Rate, Spectral Bandwidth, Spectral

Contrast, and Spectral Roll-off, using the librosa python library.

3.1

Datasets

For embeddings we decided to investigate models pretrained

As mentioned previously, the ER task is most-often audio-visual,

on similar audio tasks (e.g., emotion recognition) and use them

so we decided to use an audio and a visual dataset to indepen-

to the point where embeddings are available, which typically

dently evaluate the performance of different feature sets. While

means the upper part of the ANN architecture, responsible for

many datasets exist that contain both modalities, they often have

computation of embeddings representing the features. Three pre-

a problem of imprecise coarse emotion labelling [18], as labels trained models were investigated in our evaluation, all based

are video-based, while emotions can be exhibited and changed in

on the wav2vec2 architecture, which is a self-supervised model

much shorter windows. Splitting video into frames yields a large

for learning speech representations proposed by Facebook AI

number of (different) instances with the same label, so we wanted

Research (FAIR) [3]. Full wav2vec2 pretraining framework com-a dataset with individual image labels. As our focus was on com-

prises a latent feature encoder, a context network using the trans-

paring the performance of hand-crafted and embedding-based

former architecture, a quantization module and contrastive loss

features, we chose two well-established benchmark datasets ded-

(pre-training objective). For our purposes the feature encoder

icated to audio and visual emotion classification. These datasets

is important, which is a 7-layer 1D CNN reducing the dimen-

contain short audio clips and individual images with precise

sionality of audio inputs into a sequence of feature vectors. The

short-term and per-frame labels, circumventing the mentioned

initial model version was pretrained on the LibriSpeech dataset,

per-video labelling problem.

another version was fine-tuned on IEMOCAP dataset specifically

for ER, and finally a large general cross-lingual model (XLSR)

3.1.1

Audio Dataset. For evaluation on audio data we decided

was trained on millions of hours of unlabeled audio data in 53

to use the crowd-sourced emotional multimodal actors dataset

(later extended) languages [5]. These three variants were used (CREMA-D) [4]. It contains short clips of 91 actors between the to extract their corresponding embeddings. Since the input data

ages of 20 and 74 coming from a variety of races and ethnicities,

from CREMA-D is of inconsistent shape (varying by < 1 sec), we

who exhibited six different emotions (Anger, Disgust, Fear, Happy,

had to employ an additional adaptive average pooling layer to en-

Neutral, Sad). Each actor produced about 80 clips (small vari-

sure consistently shaped outputs. We designed this pooling layer

ation), saying specific sentences exhibiting different emotions.

so that we lost minimal information (short segment length for

The distribution of labels was balanced, each class representing

pooling) and the outputs were then flattened. PCA was employed

approx. 16% of the data. The intended emotions were verified

to subsequently reduce them to 10 dimensions. The number of di-

with 2,443 crowd-sourced human raters as baseline. These raters

mensions was chosen arbitrarily and could be changed, however,

predicted emotions based on audio only, video only, or both,

we believe that 10 dimensions offer a good balance between re-

achieving 40.9%, 58.2% and 63.6% recognition of intended (acted)

tained information and computational (and spatial) requirements.

emotion respectively.

Moreover, this number of PCA components is on the same order

3.1.2

Visual Dataset.

of magnitude as the number of hand-crafted features, making

For visual data we chose the extended

them more comparable.

Cohn-Kanade dataset (CK+) [11], which a staple dataset in ER

evaluation from facial expressions. It contains images of 118

3.2.2

Visual Features. For visual features, we focused on the

adults, aged between 18 and 50, again of different ethnicities. Par-

movement of specific facial keypoints, such as the corners of

ticipants were instructed to perform a series of 23 facial displays,

the mouth and eyebrows, which form the basis of the Facial

relating to one of seven emotions (Anger, Contempt, Disgust, Fear,

Action Coding System (FACS) – a taxonomy that categorizes

Happy, Sad, Surprise). The distribution of classes in CK+ is not

human facial expressions based on muscle movements [6]. We

balanced – Surprise is the majority class at 25% and Contempt the

employed the MediaPipe (MP) framework [12] to extract values minority class at 6%, with others in between. This distribution

representing the activation of various facial blendshapes, which

also changes between subjects. CK+ images were reshaped to

correspond approximately to the regions defined in FACS. In this

48x48 pixels, put in grayscale format and cropped using frontal

paper, we classify MediaPipe features as “handcrafted” because,

face Haar cascade classifier [1] as part of preprocessing. The despite being neural network-based, they quantify predefined

emotion labels were validated by experts via facial activation

facial areas with human-interpretable metrics. This contrasts

unit rules (e.g., Happy = Activation unit 12 must be present = Lip

with CNN-based embeddings, which capture patterns without

corner puller active).

direct interpretability.

For comparison, we used embeddings from two pretrained

3.2

Feature Computation

models: FaceNet [17] and EfficientNet [19] from the HSEmotion For selection of hand-crafted features we relied on literature

library [14]. FaceNet architecture is based on GoogleNet, which and previous work in ER for each modality. For embeddings on

is a variant of deep CNN, and is trained using triplet loss. It

32





Comparison of Feature- and Embedding-based Approaches

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

was optimized for facial recognition, verification, and clustering.

(1) Hand-crafted statistical features relating to MFCCs

EfficientNet comprises several inverted bottleneck convolutional

(2) 10-component PCA of wav2vec2 embeddings from a model

residual blocks. It achieved SOTA results on the AffectNet ER

trained on LibriSpeech

dataset, while being relatively light-weight. Again, PCA was used

(3) 10-component PCA of wav2vec2 embeddings from a model

to reduce the embeddings to 10 dimensions.

trained on IEMOCAP

(4) 10-component PCA of wav2vec2 embeddings from a cross-

3.2.3

Computational and Spatial Requirements. In order to have

lingual XLSR model

a clear overview of the trade-off between computational and

(5) Union of hand-crafted and best-performing embeddings

spatial requirements of each feature computation method, and

(from above)

their classification performance discussed in the next section, we

These were compared in experiments as described in Section 3.3,

first report the average times to compute and disk sizes of the

using a set of four ML models. Results of best-performing model

output (per one instance) for each method in Table 1.

for each set in terms of accuracy and F1 are given in Table 2.

Table 1: Average time and disk space needed for feature

Fused data was acquired by concatenating the feature sets.

computation using each method.

Table 2: Best performing models for each feature set and

Modality

Feature method

Avg. Time

Avg. Space

corresponding accuracy and F1 scores for audio data. Note

that embeddings were represented with 10 components

MFCC stats

19 ms

< 1 kB

obtained from PCA.

wav2vec2 LibriSpeech

99 ms

194 kB

Audio

wav2vec2 XLSR

274 ms

258 kB

wav2vec2 IEMOCAP

101 ms

5 kB

Feature set

Best model

Accuracy

F1 score

MediaPipe

10 ms

< 1 kB

N/A

Majority

0.17±0.00

0.05±0.00

FaceNet

29 ms

2 kB

MFCC stats

RF

0.46±0.08

0.43±0.09

Video

EfficientNet

2 ms

5 kB

wav2vec2 LibriSpeech

SVM

0.47±0.08

0.45±0.09

When interpreting the results in Table 1, it must also be con-wav2vec2 XLSR

SVM

0.30±0.05

0.27±0.05

sidered that DL-based methods require additional computational

wav2vec2 IEMOCAP

SVM

0.47±0.08

0.42±0.09

time when doing PCA on top of the raw embeddings.

MFCC + best wav2vec2

SVM

0.52±0.09

0.50±0.10

3.3

Emotion Classification

4.2

Image Emotion Classification

Data splitting is a crucial step in evaluation of ML models, as

To stay consistent with the audio experiments we performed the

it must be done in a way to avoid overfitting and provide a ro-

same LOSO experiments described in Section 3.3. We compared bust evaluation of generalization capabilities of a model. The

model performances using the following features as inputs:

aim of this research was primarily not to evaluate the absolute

(1) MediaPipe blendshapes

performance of ER, but rather compare the performance when

(2) 10-component PCA of FaceNet embeddings

using hand-crafted vs. embedding features. Therefore it was cru-

(3) 10-component PCA of EfficientNet embeddings

cial to consistently ensure that the same data splits and models

(4) Union of MP and FaceNet embeddings

were used in each experiment, for each of the compared inputs.

(5) Union of MP and EfficientNet embeddings

We decided for the most robust leave-one-subject-out (LOSO)

Accuracy and F1 scores for the best performing models for

evaluation, always using default model hyperparameters. Such

each set of features are again reported in Table 3

experimental setup minimized overfitting and also gave a good

overview of generalization performance of emotion classifiers.

Table 3: Best-performing models for each feature set and

4

Experiments and Results

corresponding accuracy and F1 scores for visual data. Note

that embeddings were represented with 10 components

The outputs of the previous step were used as inputs (features)

obtained from PCA.

to train a traditional ML model for emotion classification. We

evaluated several options: taking the 10 PCA components of em-

Feature set

Best model

Accuracy

F1 score

beddings obtained from each pretrained model as inputs, taking

N/A

Majority

0.25±0.00

0.40±0.00

only hand-crafted features as inputs, and taking union of both

MediaPipe

RF

0.62±0.28

0.51±0.29

as input. Each of these cases was evaluated for audio and visual

FaceNet

SVM

0.45±0.30

0.36±0.30

modality separately, using the LOSO experimental setup. Several

EfficientNet

RF

0.93±0.16

0.90±0.20

popular ML models were compared (with default hyperparame-

Mediapipe + FaceNet

XGB

0.70±0.28

0.60±0.29

ters), including k-nearest Neighbours (kNN), Random Forest (RF),

Mediapipe + EfficientNet

XGB

0.93±0.17

0.90±0.21

Support Vector Machines (SVM) with linear kernel, and eXtreme

Gradient Boosting (XGB). We monitored classification accuracy

and macro F1 score as metrics of the model performance. All

4.3

Discussion

results were compared with baseline majority classifier and are

From Tables 2 and 3 we can observe that for audio the best reported as averages across all iterations of LOSO, where majority

performance is achieved when using union of hand-crafted and

was always taken from the train data (all except left-out).

embedding features, while for visual ER the performance of only

embeddings or union is nearly identical. The improvement of

4.1

Audio Emotion Classification

feature union is thus generally small, as for visual data we get

As mentioned in Section 3 we investigated the following options the same result as using only the best embeddings (1% difference

as feature inputs:

in standard deviation), while for audio data the improvement in

33





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Sebastijan Trojer, Zoja Anžur, Mitja Luštrek, and Gašper Slapničar

both metrics is about 5% compared to individual feature sets. All

used data was simulated/acted, so interpretation of these results

results substantially outperform the baseline majority classifiers.

must take that into account. Numbers are expected to decrease

For audio data we can see that the best embedding set (wav2vec2

on a more realistic dataset, as emotions in everyday life are quite

LibriSpeech) performs nearly the same as hand-crafted features

subtle [18]. It would thus make sense to run similar experiments (MFCC stats), which is in agreement with some literature [13]. It on more realistic data as well, although such data is more scarce.

is surprising that LibriSpeech embeddings slightly outperform

IEMOCAP ones, since the latter were trained specifically for

Acknowledgements

emotion recognition, while the former were not. The subpar

This work was supported by bilateral Weave project, funded by

performance of XLSR is expected, since it is a more general cross-

the Slovenian Agency of Research and Innovation (ARIS) under

lingual unsupervised model, while investigated data is spoken in

grant agreement N1-0319 and by the Swiss National Science

English. For visual data on the other hand the best embeddings

Foundation (SNSF) under grant agreement 214991.

(EfficientNet) substantially outperform hand-crafted facial ex-

pression features (MediaPipe) and those obtained from FaceNet.

References

This is expected, as EfficientNet was trained specifically for emo-

[1]

Shahad Salh Ali, Jamila Harbi Al’ Ameri, and Thekra Abbas. 2022. Face

tion recognition, while FaceNet was trained for face recognition.

detection using Haar cascade algorithm. In 2022 Fifth College of Science

International Conference of Recent Trends in Information Technology (CSCTIT),

In terms of ML models, we consistently observed best perfor-

198–201. doi: 10.1109/csctit56299.2022.10145680.

mance of SVM for ER from audio data, while for video data the

[2]

Grigory Antipov, Sid-Ahmed Berrani, Natacha Ruchaud, et al. 2015. Learned

vs. hand-crafted features for pedestrian gender recognition. In Proceedings

best model is not as homogeneous. Importantly, performance of

of the 23rd ACM international conference on Multimedia, 1263–1266.

different models (RF, SVM and XGB) was often within 1%.

[3]

Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, et al. 2020. Wav2vec

Another important observation is the relative stability of re-

2.0: a framework for self-supervised learning of speech representations.

Advances in neural information processing systems, 33, 12449–12460.

sults across subjects when classifying from audio, with standard

[4]

Houwei Cao, David G Cooper, Michael K Keutmann, et al. 2014. Crema-d:

deviations around 8%. The same was not observed in the eval-

crowd-sourced emotional multimodal actors dataset. IEEE transactions on

uation from visual data, with much higher standard deviations,

affective computing, 5, 4, 377–390.

[5]

Alexis Conneau, Alexei Baevski, et al. 2020. Unsupervised cross-lingual rep-

indicating lower stability and greater variation between subjects.

resentation learning for speech recognition. arXiv preprint arXiv:2006.13979.

To address our initial research question, we observed simi-

[6]

Paul Ekman and Erika L Rosenberg. 1997. What the face reveals: Basic and

applied studies of spontaneous expression using the Facial Action Coding

lar performance of hand-crafted features and embeddings from

System (FACS). Oxford University Press, USA.

SOTA DL models for audio-based ER, with union of both achiev-

[7]

Monica Gori, Lucia Schiatti, and Maria B. Amadeo. 2021. Masking emotions:

ing the best results. The image-based visual ER achieved much

face masks impair how we read emotions. Frontiers in Psychology, 12, 669432.

[8]

Léo Grinsztajn, Edouard Oyallon, and Gaël Varoquaux. 2022. Why do tree-

better performance with learned embeddings as inputs, while the

based models still outperform deep learning on typical tabular data? Ad-

union of features showed no improvement. However, the cost

vances in neural information processing systems, 35, 507–520.

of hand-crafted features and embeddings in terms of computa-

[9]

Xuhong Li, Haoyi Xiong, Xingjian Li, et al. 2022. Interpretable deep learning: interpretation, interpretability, trustworthiness, and beyond. Knowledge and

tional power required to compute, and spatial requirements to

Information Systems, 64, 12, 3197–3234.

save, is not the same. While hand-crafted features are usually

[10]

MS Likitha, Sri Raksha R Gupta, K Hasitha, et al. 2017. Speech based human

emotion recognition using MFCC. In 2017 international conference on wireless

computed quickly and represented with a few numbers, as re-

communications, signal processing and networking (WiSPNET), 2257–2260.

ported in Table 1, the embeddings require loading a (commonly

[11]

Patrick Lucey, Jeffrey F Cohn, Takeo Kanade, et al. 2010. The extended

large) pretrained ANN, which performs a large number of matrix

cohn-kanade dataset (ck+): a complete dataset for action unit and emotion-

specified expression. In 2010 ieee computer society conference on computer

multiplications, resulting in high-dimensional embeddings (e.g.,

vision and pattern recognition-workshops. IEEE, 94–101.

64×512). This in turn requires additional dimensionality reduc-

[12]

Camillo Lugaresi, Jiuqiang Tang, Hadon Nash, et al. 2019. Mediapipe: a

tion, such as PCA employed in this work. Our results indicate

framework for building perception pipelines. (2019). https://arxiv.org/abs/1

906.08172.

that for image-based visual ER, the additional cost is worthwhile,

[13]

Michalis Papakostas, Evaggelos Spyrou, Theodoros Giannakopoulos, et al.

due to large improvements in performance, while audio-based

2017. Deep visual attributes vs. hand-crafted audio features on multidomain

speech emotion recognition. Computation, 5, 2, 26.

ER achieved much smaller improvement, making the use of em-

[14]

Andrey Savchenko. 2023. Facial expression recognition with adaptive frame

beddings from pretrained models less attractive.

rate based on multiple testing correction. In Proceedings of the 40th Inter-

Finally, hand-crafted features mostly offer direct interpretabil-

national Conference on Machine Learning (ICML) (Proceedings of Machine

Learning Research). Vol. 202. Pmlr, (July 2023), 30119–30129. https://procee

ity (e.g., audio loudness), while embeddings are commonly black-

dings.mlr.press/v202/savchenko23a.html.

box in nature, lacking explainability without suitable mechanisms

[15]

Johannes L Schonberger, Hans Hardmeier, Torsten Sattler, et al. 2017. Com-

on top. The clear meaning of hand-crafted features can be helpful

parative evaluation of hand-crafted and learned local features. In Proceedings of the IEEE conference on computer vision and pattern recognition, 1482–1491.

when training traditional ML models, where feature importance

[16]

Liam Schoneveld, Alice Othmani, and Hazem Abdelkawy. 2021. Leveraging

can be compared and subsequently interpreted.

recent advances in deep learning for audio-visual emotion recognition.

Pattern Recognition Letters, 146, 1–7.

[17]

Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. Facenet: a

unified embedding for face recognition and clustering. In 2015 IEEE Confer-

5

Conclusion

ence on Computer Vision and Pattern Recognition (CVPR). IEEE, (June 2015).

doi: 10.1109/cvpr.2015.7298682.

In summary we compared using hand-crafted features, embed-

[18]

Gašper Slapničar, Zoja Anžur, Sebastijan Trojer, et al. 2024. Contact-free

dings of pretrained SOTA models, or union of both, as inputs for

emotion recognition for monitoring of well-being: early prospects and future

ER models using audio and visual data. We found that embedding-

ideas. In Intelligent Environments 2024: Combined Proceedings of Workshops

and Demos & Videos Session. IOS Press, 58–67.

based approach is substantially superior with visual data, out-

[19]

Mingxing Tan and Quoc V. Le. 2019. Efficientnet: rethinking model scaling

weighing the computational cost – the latter is in fact the lowest

for convolutional neural networks. CoRR. http://arxiv.org/abs/1905.11946.

[20]

Pawan Kumar Verma, Prateek Agrawal, Ivone Amorim, et al. 2021. Welfake:

when using EfficientNet. For audio data, the improvement was

word embedding over linguistic features for fake news detection. IEEE

only seen in union of inputs, and was relatively low.

Transactions on Computational Social Systems, 8, 4, 881–893.

As future work it would be worthwhile to compare merged

[21]

Cuicui Ye, Jing Yang, and Yan Mao. 2024. Fdhfui: fusing deep representa-

tion and hand-crafted features for user identification. IEEE Transactions on

audio-visual features and embeddings in a single ER problem on

Consumer Electronics.

the same dataset having both modalities. Furthermore, currently

34





Multi-modal Data Collection and Preliminary Statistical

Analysis for Cognitive Load Assessment



Ana Krstevska

Sebastjan Kramar

Hristijan Gjoreski

Department of Intelligent Systems

Department of Intelligent Systems

Faculty of Electrical Engineering and

Jožef Stefan Institute

Jožef Stefan Institute

Information Technologies

Ljubljana, Slovenia

Ljubljana, Slovenia

Skopje, Macedonia

ana.krstevska2001@gmail.com

sebastjan.kramar@ijs.si

hristijang@feit.ukim.edu.mk

Martin Gjoreski

Junoš Lukan

Sebastijan Trojer

Università della Svizzera italiana (USI)

Department of Intelligent Systems

Department of Intelligent Systems

Lugano, Switzerland

Jožef Stefan Institute

Jožef Stefan Institute

martin.gjoreski@usi.ch

Jožef Stefan International Postgraduate

Ljubljana, Slovenia

School

st

5

804@student.uni-lj.si



Ljubljana, Slovenia

junos.lukan@ijs.si

Mitja Luštrek

Gašper Slapničar

Department of Intelligent Systems

Department of Intelligent Systems

Jožef Stefan Institute

Jožef Stefan Institute

Jožef Stefan International Postgraduate School





Lj

ub

lj a

n

a,

Slo

v enia

Ljubljana, Slovenia

gasper.slapnicar@ijs.si

mitja.lustrek@ijs.si

Abstract

1 Introduction

To mitigate distractions during complex tasks, ubiquitous

Human attention is a critical resource that is increasingly targeted

computing devices should adapt to the user's cognitive load.

by mobile applications, online services, and other forms of digital

However, accurately assessing cognitive load remains a significant

engagement. In an era of constant connectivity, capturing and

challenge. This study aims to present sophisticated, multi-modal

retaining user attention has become a primary objective for many

data collection, which can enable accurate estimation of cognitive

technologies. However, as users engage in cognitively demanding

load using wearable and contact-free devices. A total of 25

tasks, distractions can lead to performance degradation and

participants participated in six cognitive load-inducing tasks, each

increased stress. Therefore, to minimize interruptions and maintain

presented at two levels of difficulty. Simultaneously, physiological

productivity, ubiquitous computing systems must become capable

and behavioral data were collected from a multi-modal sensory

of recognizing and adapting to the user’s cognitive load in real time.

setup, including: Empatica E4 wristband, Emteq OCOsense

Cognitive load, defined as the mental effort required to process

glasses, an eye tracker, a thermal camera, a depth camera and an

information and perform tasks, triggers a series of physiological

RGB video camera. Additionally, participants provided subjective

responses in the human body. These responses are largely governed

measures of cognitive load by completing standardized NASA

by the activation of the sympathetic nervous system. When

Task Load Index (NASA TLX) and Instantaneous Self-Assessment

cognitive load increases, measurable changes can be observed in

(ISA) questionnaires following each cognitive task. Preliminary

physiological markers, including blood pressure, brain activity, eye

statistical analyses were conducted on participant demographics,

movements, electrodermal activity (EDA), respiration rate, heart

performance metrics, and the perceived difficulty of tasks, as

rate variability, etc. Furthermore, changes are also reflected in

reported in the completed questionnaires.

facial expressions, posture, and other behavioural patterns.

Keywords

This study seeks to offer a unique multi-modal dataset with a rich

cognitive load inference, wearable sensors, contact-free

set of wearable and unobtrusive sensors to capture the subtle

unobtrusive sensors

changes that occur with the gradual activation of the sympathetic



Permission to make digital or hard copies of part or all of this work for personal or nervous system. Rather than solely focusing on maximizing data

classroom use is granted without fee provided that copies are not made or distributed accuracy through the use of numerous devices, this approach also

for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored.

aims to identify the minimum set of sensors required to achieve

For all other uses, contact the owner/author(s).

reliable cognitive load assessment. To that end, rich multi-modal

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

© 2024 Copyright held by the owner/author(s).

data was collected from a myriad of sensors, including wearables

https://doi.org/10.70314/is.2024.scai.6961

35





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

Ana Krstevska et al.



(OCOsense glasses and Empatica E4 wristband) and contact-free

unobtrusive sensors such as an advanced eye tracker, a thermal

camera, a depth camera, and an RGB video camera. To the best of

our knowledge, no prior dataset exists containing such rich multi-

modal data obtained with such an elaborate sensory setup.



2 Related Work

The challenge of cognitive load estimation has been extensively

studied across various fields. Significant emphasis has been placed

on reducing cognitive load in dynamic environments, such as

aviation [1]. Recent research has increasingly focused on

transitioning

from

direct

measurements,

such

as

electroencephalography (EEG), to indirect methods of cognitive

load assessment. For instance, ocular metrics, including pupil

diameter and blink rate, have been shown to accurately estimate

cognitive load [2, 3, 4]. Additionally, facial temperature variations



have been widely correlated with cognitive workload, providing

Figure 1: Experimental setup

another non-invasive means of assessment [5, 6]. Novak et al.



demonstrated that biometric indicators, such as galvanic skin

Calibration data for the OCOsense glasses was then recorded by

response and skin temperature, can signal increased cognitive load;

having participants replicate four facial expressions — smiling,

however, these measures are insufficient to distinguish between

frowning, brow raising, and eye squeezing — three times each.

varying levels of cognitive load [7]. Wang et al. demonstrated that

Calibration for the eye tracker followed, during which participants

visual cues—including face pose, eye gaze, eye blinking, and yawn

tracked a moving dot with their eyes. This calibration aimed to

frequency—can serve as indicators of cognitive load [8].

optimize participant's seating position for accurate eye-tracking.

This research aims to address the complexities of cognitive load

The experiment's main phase involved participants completing

estimation by integrating a wide range of psychophysiological

cognitive load-inducing tasks that tested three aspects of cognition:

signals, offering a more comprehensive approach to this task.

attention, memory, and visual perception. For each cognitive



domain, two widely recognized tasks were presented, each with

3 Experimental Setup

two levels of difficulty (easy and difficult). This design allowed for

The objective of our data collection was to capture participants'

the differentiation of cognitive load levels. Following each

cognitive load under varying levels of difficulty imposed by

category of cognitive tasks, participants engaged in relaxation tasks

cognitive load-inducing tasks. The study was conducted in a quiet,

that were not expected to induce cognitive load, such as meditation

temperature-controlled room, with participants tested individually.

with open eyes, listening to music to relieve stress and passive

At the beginning of each session, participants were seated in a

viewing of aesthetically pleasing images. These tasks provided

comfortable chair in front of a 24” monitor and given instructions

baseline data for periods of minimal cognitive load.

about the experiment and their expected role. The Empatica E4

In summary, each recording session included six cognitive load-

wristband was then fitted to the participant's non-dominant hand,

inducing tasks (with two levels of difficulty) and three relaxation

and the OCOsense glasses for emotion recognition were equipped

tasks, totaling 15 tasks. After each task, participants completed the

in line with product instructions.

NASA Task Load Index (NASA TLX) questionnaire, a validated

Data collection was further enriched through the use of

instrument for assessing cognitive load across six dimensions:

unobtrusive sensing technologies, including a Tobii Spark eye

mental demand, physical demand, temporal demand, performance,

tracker (60 frames per second), an Intel RealSense Depth Camera

effort, and frustration [9]. Each question was rated on a scale of 0

D455 (providing depth data at 30 fps), a Logitech BRIO stream 4k

to 100. In this study, the unweighted version of the NASA TLX,

webcam at 10 fps with HDR and noise-canceling microphones and

known as the Raw NASA TLX, was used. Additionally,

a FLIR Lepton 3 thermal camera delivering a full 160x120 pixel

participants completed a single-item Instantaneous Self-

thermal resolution with 8 fps. We used this set of devices to

Assessment (ISA) of workload, providing a subjective measure of

continuously monitor participants throughout the recording

the cognitive load induced by the task [10]. These questionnaires

session. The experimental setup can be observed in Figure 1.

served as subjective assessments of cognitive load and as reference



points for the difficulty of each task [11].

The tasks were implemented using PsychoPy, an open-source

4 Data Collection Protocol

software package commonly used in neuroscience and

Prior to the experiment, participants completed a brief sleep

experimental psychology research [12]. For attention-related tasks,

questionnaire to gather information about their sleep patterns (e.g.,

participants completed the N-back and Stroop tests. In the N-back

hours slept the night before and usual sleep duration) and rated their

task, participants were presented with a sequence of letters and

levels of fatigue and focus on a scale of 1 to 10.

asked to determine whether the current letter matched the one





36



Multi-modal Data Collection and Preliminary Statistical Analysis

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

for Cognitive Load Assessment



presented N trials earlier (with task difficulty increasing as N

for the N-back tasks. Notably, the N-back tasks were always

increased) [13]. Participants completed both a 2-back and a 3-back

presented first to participants, suggesting that they may have

task. In the Stroop test, participants identified whether the word

required additional time to adjust to the testing environment and

matched the color in which it was written, with the easier version

fully engage with the task.

involving two colors (red and blue) and the more difficult version

Next, an inferential statistical analysis was performed on the

incorporating five colors [14].

relationship between task scores and various variables of the sleep

Memory-related tasks included a memory game and a question-

pattern. To investigate the potential influence of tiredness on

answering task based on a previously shown image. In the memory

performance, responses from the sleep patterns questionnaire were

game, participants recalled as many words as possible from a set,

analyzed. A non-parametric Kruskal-Wallis test was performed to

with the easier version comprising seven words and the more

determine whether there was a statistically significant difference in

difficult version consisting of 15 words. In the question-answering

overall scores across different levels of tiredness (low, medium,

task, participants focused on an image and then answered questions

and high). The resulting p-value (0.91) indicated no significant

about it (e.g., remembering the number of particular objects in the

difference in performance between these groups. Thus, tiredness

image), with the hard version using an image with greater detail.

levels did not show a statistically significant impact on

The visual perception tasks included a "spot the difference" task

performance within a 95 % confidence interval.

and a pursuit test. In the "spot the difference" task, participants were Similarly, the effect of focus level (low vs. high) on overall

presented with two images and were asked to identify as many

performance was examined using a non-parametric Mann-Whitney

differences as possible within a one-minute time frame. The

test. The p-value was 0.12, indicating no statistically significant

difficulty of this task varied, with the more challenging version

difference in performance between low and high focus groups at

involving an image that contained greater detail compared to the

the 5 % significance level.

simpler, easier version. The pursuit test required participants to

Furthermore, the relationship between hours of sleep the night

visually track irregularly curved, overlapping lines. As with the

before the experiment and participant performance was examined

"spot the difference" task, the pursuit test was administered at two using Spearman’s correlation. The p-value was 0.42, indicating no

levels of difficulty. The more difficult version featured a more

statistically significant correlation between overall performance

intricate image, with longer and more tangled lines, as opposed to

scores and hours of sleep the night before the experiment.

the less complex image used in the easier version of the task.

The potential influence of participants' highest education level



on overall performance was also investigated. To assess this, a non-

5 Statistics

parametric Kruskal-Wallis test was conducted. The results (p-value

In this section, we present some descriptive demographic and task-

of 0.33) indicated no statistically significant difference in

related statistics for the participants involved in the experiment.

performance scores across different education levels among the

The average age of participants was 29.28 years, with a standard

participants.

deviation of 8.31. In terms of educational background, the majority

Overall, the small sample size may have constrained the ability

of participants (44 %) had obtained a Bachelor's degree (BSc),

to detect significant effects. The limited variability in the sample's

followed by those with a Master's degree (MSc), 28 %. A smaller

educational background and other factors likely contributed to the

portion had completed only high school (16 %) or had earned a PhD

lack of observed differences, emphasizing the need for a larger,

(12 %). Additionally, 60 % of the participants were male.

more diverse sample to better understand the impact of these

We then looked at the descriptive statistics derived from the

variables on cognitive load performance.

performance of the participants in each task. These indicate that



participants performed consistently well on tasks such as the 2-back



task, both easy and difficult versions of the Stroop test, the easy

memory task (where participants recalled an average of 5 out of 7

words), the easy version of the "spot the difference" task (with an

average detection rate of approximately 90 % of all the

differences), and both versions of the pursuit test. Notably,

participants performed slightly better on the difficult version of the

Stroop test, likely due to their increased familiarity with the task.

However, performance was lower on tasks such as the 3-back

test (which most participants perceived as highly or extremely

difficult), the difficult memory task (with an average recall rate of

39 %), and both the easy and difficult question-answering tasks.

The difficult version of the "spot the difference" task also showed



lower performance, with participants detecting only 25 % of the



differences on average. Consistent performance among subjects

Figure 2: Reported perceived difficulty per cognitive task

(with low standard deviation) was observed across all tasks except





37

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Ana Krstevska et al.



As shown in Figure 2, participants consistently perceived the

background, and other demographic factors is desirable to enhance

difficulty of the two N-back tasks and the difficult version of the

the generalizability of the findings.

"spot the difference" task as somewhat high or high. This suggests

In future work, the collected data will be processed and utilized

a general consensus regarding the difficulty of these tasks. In

to train machine learning models aimed at estimating cognitive

contrast, the NASA TLX-based perceived difficulty of remaining

load. Ground truth for the machine learning models can be derived

tasks, exhibited significant variability among participants.

from various sources, including perceived task difficulty reported

To assess differences in performance across task difficulties and

through the standardized questionnaires, the designed difficulty

evaluate the potential for differentiating cognitive load using

level of the tasks or the participants' performance on the tasks.

machine learning models, we conducted additional inferential

These machine learning models will leverage sophisticated ML

statistical analyses. The Wilcoxon signed-rank test was used to

techniques to effectively integrate and analyze multi-modal data,

compare participant performance on the easier and more difficult

aiming to enhance the accuracy of cognitive load predictions. We

versions of each cognitive task.

also plan to further expand the dataset with another phase of data

Statistically significant differences in performance were found

collection, offering a rich dataset both in terms of modalities, as

between the two difficulty levels for all tasks. For the N-back, "spot

well as in terms of participants. The collected dataset will serve as

the difference", and pursuit tasks, participants performed

a stepping stone towards robust multi-modal cognitive load

significantly better on the easier versions, indicating that increased

assessment, allowing for creation and benchmarking of ML models

difficulty negatively impacted performance. Conversely, for the

and will be made available to general public after the collection is

Stroop, memory, and question-answering tasks, participants

finalized.

performed better on the more difficult versions.

Acknowledgements

The statistical analysis conducted in this study provides initial

evidence supporting the validity of the data collection protocol,

This work was supported by the Jožef Stefan Institute and

particularly with respect to the selection of tasks and task difficulty

Università della Svizzera italiana (funded by SNSF through the

levels. The tasks chosen for this experiment varied significantly in

project XAI-PAC (PZ00P2_216405)).

terms of their cognitive demands, as reflected by the substantial

References

differences in performance between the easier and more difficult

[1] Jonathan Mead, Mark Middendorf, Christina Gruenwald, Chelsey Credlebaugh, versions of each task. These results indicate that cognitive load and

and Scott Galster. 2017. Investigating Facial Electromyography as an Indicator of performance are task-specific, and the significant differences

Cognitive Workload. In 19th International Symposium on Aviation Psychology, 377–

observed support the feasibility of using machine learning models

382.

[2] Muneeb Imtiaz Ahmad, Ingo Keller, David A. Robb, and Katrin S. Lohan. 2020.

to differentiate between varying levels of cognitive load.

A framework to estimate cognitive load using physiological data. Personal and Ubiquitous Computing, 27, 2027–2041.

[3] Tobias Appel, Christian Scharinger, Peter Gerjets, and Enkelejda Kasneci. 2018.

6 Conclusion and Future Work

Cross-subject workload classification using pupil-related measures. In Proceedings of This study employs a novel approach to data collection for

the 2018 ACM Symposium on Eye Tracking Research & Applications, 4, 1–8.

[4] Tobias Appel, Natalia Sevcenko, Franz Wortha, Katerina Tsarava, Korbinian cognitive load inference by combining psychophysiological data

Moeller, Manuel Ninaus, Enkelejda Kasneci, and Peter Gerjets. 2019. Predicting obtained from multi-modal sensory setup, including wearable and

Cognitive Load in an Emergency Simulation Based on Behavioral and Physiological Measures. In Proceedings of the 21st ACM International Conference on Multimodal unobtrusive contact-free sensors. The decision to utilize a diverse

Interaction (ICMI), 154–163.

set of devices was motivated by the hypothesis that integrating data

[5] Fangqing Zhengren, George Chernyshov, Dingding Zheng, and Kai Kunze. 2019.

from multiple sources could provide a more accurate assessment of

Cognitive load assessment from facial temperature using smart eyewear. In Proceedings of the 2019 ACM International Joint Conference on Pervasive and cognitive load, while also aiming to identify the minimal sensor

Ubiquitous Computing and Proceedings of the 2019 ACM International Symposium

configuration required to achieve reliable results. This is

on Wearable Computers, 657–660.

[6] Yomna Abdelrahman, Eduardo Velloso, Tillman Dingler, Albrecht Schmidt, and particularly relevant in dynamic and high-stakes environments,

Frank Vetere. 2017. Cognitive Heat: Exploring the Usage of Thermal Imaging to such as driving, where accurate cognitive load assessment could

Unobtrusively Estimate Cognitive Load. Proceedings of the ACM on Interactive, have life-saving implications. To the best of our knowledge, no

Mobile, Wearable and Ubiquitous Technologies, 33, 1–20.

[7] Klemen Novak, Kristina Stojmenova, Grega Jakus, and Jaka Sodnik. 2017.

prior research has incorporated such a comprehensive and

Assessment of cognitive load through biometric monitoring. In 7th International multifaceted setup for cognitive load evaluation.

Conference on Information Society and Technology (ICIST).

[8] Zixuan Wang, Jinyun Yan, and Hamid Aghajan. 2012. A framework of personal

The statistical analyses conducted thus far offer promising

assistant for computer users by analyzing video stream. In Proceedings of the 4th validation for the data collection protocol. The selection of tasks

Workshop on Eye Gaze in Intelligent Human Machine Interaction, 1–3.

[9] Sandra G. Hart, and Lowell E. Staveland. 1988. Development of NASA-TLX (Task and task difficulty levels proved effective in eliciting a range of

Load Index): Results of Empirical and Theoretical Research. In Advances in cognitive load levels, as evidenced by the significant performance

Psychology, 52, 139-183

differences between task difficulties.

[10] Andrew J. Tattersall, and Penelope S. Foord. 2007. An experimental evaluation of instantaneous self-assessment as a measure of workload. Ergonomics, 39, 740-748.

To further enhance the validity of the data collection protocol,

[11] Thomas Kosch, Jakob Karolus, Johannes Zagermann, Harald Reiterer, Albrecht several changes could be implemented in potential subsequent

Schmidt, and Paweł W. Woźniak. 2023. A Survey on Measuring Cognitive Workload

in Human-Computer Interaction. ACM Computing Surveys, 55, 1–39.

collections. Refining task difficulty levels could offer more

[12] Jonathan Peirce, Rebecca Hirst, and Michael MacAskill. 2022. Building granularity in cognitive load differentiation, ensuring a clearer

Experiments in PsychoPy. Sage Publications

distinction between varying levels of cognitive load. Furthermore,

[13] Michael J. Kane, and Andrew Conway. 2016. The invention of n-back: An extremely brief history. The Winnower

increasing the diversity of participants in terms of age, educational

[14] John Ridley Stroop. 1992. Studies of interference in serial verbal reactions.

Journal of Experimental Psychology, 121, 15-23





38





Predicting Health-Related Absenteeism with Machine Learning:

A Case Study

Aleksander Piciga

Matjaž Kukar

ap7377@student.uni- lj.si

matjaz.kukar@f ri.uni- lj.si

Faculty of Computer and Information Science,

Faculty of Computer and Information Science,

University of Ljubljana

University of Ljubljana

Ljubljana, Slovenia

Ljubljana, Slovenia

Abstract

0.10

Health-related absenteeism, or sick leave, is a complex issue with

significant financial and operational implications for businesses.

0.08

We present a machine learning approach to predict employee

0.06

absenteeism in a Slovenian company. The study involved pre-

teeism

processing and augmenting the dataset by incorporating domain

0.04

knowledge, and evaluating various machine learning models.

Absen

Gradient Boosted Regression Trees emerged as the most effective

0.02

model, significantly outperforming the baseline model which

0.00

merely predicted the previous year’s absenteeism rate. Key at-

2014

2015

2016

2017

2018

2019

2020

2021

2022

tributes influencing absenteeism were identified, notably includ-

Year

ing current absenteeism, performance evaluations, and various

job type and location-related features. Results highlight the po-

Figure 1: The increase in absenteeism rate in Slovenia be-

tential of machine learning in proactively managing absenteeism

tween 2014 and 2022 [8]. We can observe a steady increase

and offer recommendations for future research, such as modeling

throughout the years.

absenteeism as a time series and incorporating additional data

sources. We also show that the current data is not detailed and

granular enough to further improve the results.

and augmenting the company’s employee data by incorporating

domain knowledge, and evaluating various machine learning

Keywords

models. The findings highlight key attributes influencing ab-

absenteeism, data analysis, data augmentation, machine learning

senteeism and offer recommendations for future research and

interventions.

1

Introduction

The significance of our work extends beyond Company X,

offering a blueprint for organizations tackling absenteeism. By

Absenteeism — temporary absence from work due to health

showcasing machine learning’s efficacy in predicting absenteeism

reasons — is awidespread issue. In Slovenia, it has been on the rise

and revealing its drivers, we contribute to the broader field and

since 2014 (Figure 1), with an average of 56,128 individuals absent pave the way for data-driven interventions promoting a healthier,

daily in 2022, representing approximately 5.91% of the workforce

more productive workforce. This aligns with the growing trend

[8]. This carries substantial financial burdens: direct costs like sick of using AI and ML to address complex organizational challenges.

pay and indirect costs from overstaffing, reduced productivity

Insights from such analyses can aid in strategic workforce plan-

and service quality [2]. The complexity of absenteeism, rooted ning, optimize resource allocation, and ultimately contribute to

in personal and organizational factors, makes it challenging to

a more sustainable and resilient organization.

predict and manage effectively [10].

In section 2 we detail the data and preprocessing, section 3

Recent years have witnessed a growing interest in leverag-

outlines the methodology, section 4 presents the results, and ing artificial intelligence (AI) and machine learning (ML) to ad-section 5 discusses the findings and concludes the study.

dress the absenteeism challenge [5]. Various machine learning techniques, including neural networks, decision trees, random

2

Materials

forests, and gradient boosting, have been employed to predict ab-

senteeism and identify its underlying causes [3, 9]. These studies The data used in our work spanned six years, from 2017 to 2022,

have demonstrated the potential of machine learning in providing

and initially comprised 13,798 instances (aggregated employee

valuable insights for proactive absenteeism management.

records) with up to 49 attributes each. They include demographic

This paper presents a case study conducted in collaboration

details, work-related factors, performance evaluations and the

1

with a Slovenian IT company

aiming to improve absenteeism

current year’s absenteeism rate for each employee, but no partic-

prediction and management. The study includes preprocessing

ulars about sick leave and other personal data.

The initial dataset required substantial preprocessing to pre-

1 The company asked to remain anonymous, so it is referred to as Company X.

pare it for analysis and machine learning [6]. The data cleaning Permission to make digital or hard copies of all or part of this work for personal process involved addressing inconsistencies in attribute values,

or classroom use is granted without fee provided that copies are not made or

such as removing extraneous spaces and converting text to low-

distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this ercase for uniformity. A significant challenge in the dataset was

work must be honored. For all other uses, contact the owner /author(s).

the presence of missing values, denoted by ’/’. Their meaning and

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

handling were discussed with a company representative to de-

© 2024 Copyright held by the owner/author(s).

https://doi.org/10.70314/is.2024.scai.7260

termine their origins and ensure appropriate treatment. In some

39





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Piciga et al.

cases, missing values were imputed based on the average values

discern statistically significant differences across groups defined

of similar instances. For example, missing values in ’Kilometers

by categorical attributes.

to work’ were attributed to errors in data entry and were imputed

using the average value for employees living in the same location

3.2

Data augmentation/Feature Engineering

and working at the same place. On the other hand, missing values

The original dataset underwent a series of transformations to

in performance evaluations were due to employee’s absence on

enhance its suitability for machine learning. This included data

evaluation days.

cleaning, handling missing values, and the creation of new at-

The target variable — health-related absenteeism rate in the

tributes based on domain knowledge and insights from the EDA.

following year — is a continuous variable ranging from 0 to 1. It

New attributes were engineered based on domain knowledge

signifies the proportion of workdays an employee is absent due

and statistical analysis. These included indicators for elevated

to health reasons compared to the total number of workdays. The

absenteeism, receipt of bonuses or awards, high and low perfor-

distribution of this target variable is heavily skewed to the right,

mance evaluations, and absenteeism rates within the employee’s

with most values clustered near zero, indicating that the majority

team and job type. External factors, such as average absenteeism

of employees have low absenteeism rates. However, there exist

rates in the employee’s residential and work locations, were also

some outliers with extremely high absenteeism rates (Figure 2).

incorporated. The feature engineering process was iterative, in-

volving close collaboration with domain experts to ensure the

0.0125

0.2750

derived attributes were meaningful and captured relevant aspects

Median

of employee behavior and organizational dynamics.

95th percentile

103

yees

3.3

Machine Learning Models

emplo

Several well-known machine learning models were employed

of 102

for absenteeism prediction, including Decision Trees, Linear Re-

erb

gression with L1 regularization, K-Nearest Neighbors (KNN),

Num 101

Support Vector Regression (SVR), Gradient Boosted Regression

Trees (GBRT), and Random Forest. Hyperparameter optimization

0.0

0.2

0.4

0.6

0.8

1.0

was conducted by using Optuna toolkit [1] to optimize model Absenteeism in the following year

performance.

Figure 2: Log-distribution of the target variable. Most work-

3.4

Model Evaluation and Selection

ers have very little absence, causing a right-tailed distribu-

Model evaluation was performed using Mean Absolute Error

tion with an “outlier” spike on the right.

(MAE), Root Mean Squared Error (RMSE), and coefficient of de-

2

termination (R ). The models were trained on past years’ data

The skewed distribution of the target variable has implica-

and tested on the subsequent year, with the training set size in-

tions for the statistical analysis and machine learning modeling.

creasing each year. The MAE provided an intuitive measure of

Therefore, non-parametric statistical tests, such as the Spear-

the average prediction error, while the RMSE penalized larger er-

man’s rank correlation and Kruskal-Wallis test, were employed

rors more severely. The R2 quantified the proportion of variance

in EDA and data preprocessing. Additionally, the presence of

in the target variable explained by the model. The models were

outliers necessitates careful consideration during model building

also compared against a baseline model that simply predicted

and evaluation.

the previous year’s absenteeism, to gauge the added value of

The final dataset, comprising 10,347 instances and 42 attributes,

the machine learning approach. A baseline model predicting the

serves as the foundation for the subsequent machine learning,

previous year’s absenteeism rate was used for comparison.

where various models are trained to predict absenteeism rates.

3.5

Model Interpretation

3

Methods

SHAP (SHapley Additive exPlanations) values [4, 7] were cal-The research methodology encompassed a multi-faceted approach,

culated to interpret model predictions and assess attribute im-

integrating exploratory data analysis, feature engineering, and

portance. SHAP values provide insights into the contribution of

the application of diverse machine learning models. The ultimate

each attribute to the model’s output, aiding in understanding

goal was to establish a robust predictive framework for health-

the factors driving absenteeism. SHAP values provide a unified

related absenteeism, while also ensuring model interpretability

framework for interpreting any machine learning model, quanti-

to observe actionable insights.

fying the contribution of each feature to the model’s prediction

for a given instance. By analyzing the SHAP values, it was possi-

3.1

Exploratory Data Analysis (EDA)

ble to identify the most influential attributes and their directional

impact on absenteeism.

The initial phase involved a thorough EDA to understand the

underlying data distribution, identify potential outliers, and un-

3.6

Data Splitting

cover preliminary relationships between attributes and the target

variable (absenteeism in the following year). Given the skewed

To ensure robust model evaluation and mitigate the risk of over-

nature of the target variable, visualizations like histograms and

fitting, the dataset was split into training and testing sets in a

box plots were complemented by non-parametric statistical tests.

prequential manner (year after year). The models were trained

The Spearman’s rank correlation coefficient was employed to as-

on the training set and their performance was assessed on the

sess monotonic relationships between continuous attributes and

unseen testing set for the subsequent year. This comprehensive

the target variable, while the Kruskal-Wallis test was utilized to

methodological framework enabled a systematic exploration of

40





Predicting Health-Related Absenteeism

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

the factors influencing health-related absenteeism and the devel-

the latest year’s data. These results were statistically significantly

opment of a predictive model to proactively manage this critical

better than the baseline model, demonstrating the effectiveness of

issue.

GBRT in capturing the complex patterns underlying absenteeism.

Figure 4 reveals a general trend of MAE improvement for

4

Results

most models in later years, surpassing the baseline model in

The primary objective of our work was to develop machine learn-

the final year. This suggests that the models benefit from the

ing models capable of predicting health-related absenteeism in

increasing amount of training data available in later years. RMSE

2

the subsequent year. The models were evaluated using three key

and R charts (not shown) exhibit almost identical properties.

metrics: Mean Absolute Error (MAE), Root Mean Squared Error

It is clear that ML models profit tremendously from increasing

2

(RMSE), and the coefficient of determination (R ). The baseline

amounts of data, as can be expected.

model, which simply predicted the previous year’s absenteeism,

Given the observed performance gains in later years with

served as a benchmark for comparison (Table 1).

larger training sets, we explored the impact of incorporating data

from previous years. Figure 4 showcases the change in MAE for Table 1: Model performance averaged year-over-year.

the final year when models were trained on data from the past

year and the past three years, respectively.

Model

RMSE

MAE

R2

Random Forest

0.107

0.052

0.344

0.050

Random Forest

GBRT

0.108

0.051

0.333

GBRT

Linear Regression

0.049

Linear Regression

0.108

0.051

0.331

Regression Decision Tree

Baseline Model

Regression Decision Tree

0.112

0.051

0.281

0.048

KNN

0.121

0.057

0.173

MAE 0.047

SVR

0.117

0.075

0.215

0.046

Baseline Model

0.121

0.051

0.156

0.045

As we can see, all machine learning models outperform the

Year

Years

2

Dataset

baseline model in terms of RMSE and R . This indicates their

Past

Base

Three

With

Past

superior ability to explain the variance in the target variable

With

Dataset

(absenteeism in the following year). While the MAE remains rel-

atively consistent across models, the improvement in RMSE and

2

Figure 4: Impact of additional attributes from past years

R suggests that the models are particularly effective in handling

on MAE.

larger deviations in absenteeism predictions.

To establish the statistical significance of the model improve-

ments, we conducted a paired T-test comparing the predictions

The GBRT model exhibited notable improvement with the

of each model against the baseline model. All the selected models

inclusion of additional data, achieving an MAE of 0.044, RMSE

demonstrated statistically significant improvements (p < 0.05)

of 0.093, and R2 of 0.36. This underscores the value of historical

2

in RMSE and R ; this ensures that their superior performance is

data in enhancing the predictive capabilities of machine learning

statistically substantiated and not due to chance.

models for absenteeism and suggests that including even more

historical data per employee would be beneficial.

4.1

Performance Trends and Impact of

Additional Data per Employee

4.2

Interpretability and Additional Insights

To gain deeper insights into model behavior, we examined their

Analysis of SHAP values yielded the following key attributes

performance trends over the years. Figure 3 illustrates the evolu-influencing absenteeism:

tion of MAE for each model.

• Current absenteeism rate

• Performance evaluations

• With respect to the employee’s job type and location:

Random Forest

0.056

GBRT

– Absenteeism rate

Linear Regression

Regression Decision Tree

0.054

Baseline Model

– Proportion of employees with elevated absenteeism

– Proportion of employees without bonuses

0.052

Our findings suggest that absenteeism is influenced by a combina-

MAE 0.050

tion of individual factors (current absenteeism, performance eval-

0.048

uations) and organizational factors ( job type, location, bonuses).

Additionally, a rather simple EDA visualisation of functional

0.046

grouping of employees was quite surprising (Figure 5). Its inter-2018

2019

2020

2021

pretation can be quite speculative, possibly related to increased

Year

job satisfaction or engagement in certain groups. Another, some-

what surprising finding from EDA is that the COVID-19 pandemic

Figure 3: MAE trend over time with additional training

did not significantly influence absenteeism rates in 2020, but it

data from past years.

may have in 2021 (Figure 6).

Finally, t-SNE visualization of the full dataset shows that em-

Among the evaluated models, GBRT exhibited the best perfor-

ployees cannot easily be separated in clusters with similar ab-

2

mance, achieving an MAE of 0.045, RMSE of 0.10, and R of 0.40 on

senteeism (Figure 7). We can identify some distinct subgroups 41





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Piciga et al.

The findings align with existing literature highlighting the

0.20

multifactorial nature of absenteeism. The strong influence of

0.15

current absenteeism on future absenteeism emphasizes its pre-

dictive power, suggesting that past behavior can be a significant

variable 0.10

indicator of future trends. The negative correlation between per-

arget

formance evaluations and absenteeism suggests that employees

T 0.05

with higher evaluations tend to be less absent, potentially due to

0.00

increased job satisfaction or engagement. The impact of denied

ort

bonuses on absenteeism points to the potential role of financial

Supp

Commercial

Technology

incentives and recognition in influencing employee attendance.

Work field

The limitations of our work include the relatively short time

span and the potential influence of unmeasured external factors.

Figure 5: Target variable according to functional partition-

Future research could address these limitations by: modeling

ing within the company.

absenteeism as a time series to capture its dynamic nature, incor-

porating additional data sources such as employee surveys, par-

ticipation in wellness programs, and (within legal limits) health

0.150

and personal circumstances data analyzing absenteeism at a finer

0.125

granularity (e.g., monthly or daily), exploring the inclusion of

0.100

employee health records and workplace environmental factors in

variable

predictive models, and conducting longitudinal studies to track

0.075

absenteeism patterns over extended periods.

arget 0.050

T

While quantitative improvements of ML model predictions

0.025

are not overwhelming, the gained insights can enable targeted

0.000

interventions to reduce absenteeism and promote a healthier

2017

2018

2019

2020

2021

workforce. By leveraging ML and data-driven insights, organi-

Year

zations can proactively manage absenteeism, thus improving

productivity, financial stability, and employee well-being.

Figure 6: Target variable by year. Note the sharp increase

in 2021, possibly attributable to the COVID-19 pandemic.

Acknowledgements

The authors sincerely thank to Company X for providing the data,

domain expertise and several fruitful discussions.. The authors

(like the cluster of red dots on the left), however most data points

acknowledge the financial support from the Slovenian Research

are quite intermingled. This suggests that with our current set of

Agency (research core funding No. P2-209).

attributes, we shouldn’t anticipate a significant improvement in

predictive performance.

References

[1]

T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama. 2019. Optuna: a

next-generation hyperparameter optimization framework. In Proceedings

50

of the 25th ACM SIGKDD International Conference on Knowledge Discovery

t

40

& Data Mining. Association for Computing Machinery, 2623–2631. isbn:

30

onen

9781450362016. doi: 10.1145/3292500.3330701.

20

[2]

M. Bregant, E. Boštjančič, J. Buzeti, M. Ceglar Ključevšek, A. Hiršl, M. Klun, 10

comp

T. Kozjek, N. Tomaževič, and J. Stare. 2012. Izboljševanje delovnega okolja z

0

inovativnimi rešitami. Združenje delodajalcev Slovenije.

−10

t-SNE

[3]

B. Hu. 2021. The application of machine learning in predicting absenteeism

−20

at work. In 2021 2nd International Conference on Computing and Data Science

−30

−

(CDS), 270–276. doi: 10.1109/CDS52072.2021.00054.

40

Second −50

[4]

Y. Meng, N. Yang, Z. Qian, and G. Zhang. 2021. What makes an online

−60

review more helpful: an interpretation framework using xgboost and shap

−50 −40 −30 −20 −10

0

10

20

30

40

50

60

values. Journal of Theoretical and Applied Electronic Commerce Research, 16,

First t-SNE component

3, 466–490. doi: 10.3390/jtaer16030029.

[5]

I. H. Montano, G. Marques, S. G. Alonso, M. López Coronado, and I. de la

Torre Díez. 2020. Predicting absenteeism and temporary disability using

Figure 7: Data visualized in 2D space with t-SNE projection.

machine learning: a systematic review and analysis. Journal of Medical

Systems, 44, 9, (Aug. 2020), 162. doi: 10.1007/s10916-020-01626-2.

Red dots represent examples with absenteeism in the next

[6]

A. Piciga. 2024. Napovedovanje zdravstvenega absentizma s strojnim učenjem.

year above 0.25. Blue shades depict examples with absen-

Bachelor’s Thesis. Univerza v Ljubljani, Fakulteta za računalništvo in infor-

teeism between 0 (light blue) and 0.25 (dark blue).

matiko. https://repozitorij.uni- lj.si/IzpisGradiva.php?lang=slv&id=160413.

[7]

E. Štrumbelj and I. Kononenko. 2014. Explaining prediction models and in-

dividual predictions with feature contributions. Knowledge and Information

Systems, 41, 3, (Dec. 2014), 647–665. doi: 10.1007/s10115-013-0679-x.

5

Discussion and Conclusion

[8]

M. Zaletel, D. Vardič, and M. Hladnik. 2024. Zdravstveni statistični letopis

Slovenije 2022. (2024). Retrieved June 5, 2024 from https://nijz.si/publikacije

/zdravstveni- statisticni- letopis- 2022/.

Our work successfully demonstrates the application of machine

[9]

W. Zaman, S. Zaidi, A. I. Abdullah, and B. Touhid. 2019. Predicting absen-

learning to predict health-related absenteeism. The GBRT model’s

teeism at work using tree-based learners. In Proceedings of the 3rd Interna-

superior performance highlights its ability to capture complex

tional Conference on Machine Learning and Soft Computing. Association for

Computing Machinery, 7–11. doi: 10.1145/3310986.3310994.

data relationships, outperforming simpler models and the base-

[10]

S. Zupanc. 2011. Absentizem, kolegialnost in obremenjenost posameznikov.

line. Also, identifying key attributes influencing absenteeism,

Bachelor’s Thesis. Univerza v Ljubljani. http://www.cek.ef .uni- lj.si/UPES/z

such as current absenteeism, denied bonuses, work type and lo-

upanc1175.pdf .

cation, and performance evaluations, provides valuable insights.

42





Puzzle Generation for Ultimate Tic-Tac-Toe

Maj Zirkelbach

Aleksander Sadikov

mz5153@student.uni- lj.si

aleksander.sadikov@f ri.uni- lj.si

University of Ljubljana, Faculty of Computer and

University of Ljubljana, Faculty of Computer and

Information Science

Information Science

Ljubljana, Slovenia

Ljubljana, Slovenia

Abstract

our application, which is designed to enhance players’ tactical

and strategic thinking.

Ultimate Tic-Tac-Toe is an interesting and popular variant of

In Section 2 we present the related work, and in Section 3 we

Tic-Tac-Toe that lacks available resources for improving game-

detail the technical aspects of the developed application. In Sec-

play skills. In this paper, we present a semi-automatic system for

tion 4 we present the implemented agents and their approximate

generating puzzles as a part of a larger tutorial application aimed

strength. In Section 5 we provide a description of different types

at teaching Ultimate Tic-Tac-Toe. The puzzles are designed to en-

of puzzles and the methodology for their construction. In Section

hance players’ tactical and strategic understanding by presenting

6 we present the evaluation and discuss the results in Section 7.

game scenarios where they must identify correct continuations.

Finally, in Section 8 we present the conclusions and give possible

To ensure the quality of generated puzzles we tested the appli-

extensions and enhancements for future work.

cation with a group of volunteers. The results have shown that

the number of solved puzzles positively impacted users’ ability

to reach higher strength levels but had less of an effect on lower

levels.

2

Related Work

Keywords

There are many implementations of the Ultimate Tic-Tac-Toe

available online, mostly appearing as mobile games aimed pri-

Ultimate Tic-Tac-Toe, puzzle generation, minimax algorithm, tu-

marily at entertainment and lacking advanced playing agents

tor application

[12] [9] [10], as well as web and desktop applications developed to create the strongest possible programs [15] [7] [13]. An exam-1

Introduction

ple of the latter is an agent based on the ideas of the AlphaZero

For centuries, people have enjoyed playing board games like

program [14], currently considered one of the strongest players of chess and Go. Over time, these games have led to the develop-this game [13]. During the development of this agent, significant ment of extensive theory and the accumulation of knowledge,

strategies were discovered, which were also useful in developing

helping players navigate their complexity. Today, advanced arti-

our application. Some researchers have attempted to solve the

ficial intelligence (AI) programs such as AlphaZero [14] surpass game theoretically, but the spatial complexity proved too great

even the strongest human players, offering new insights into

to allow for a complete solution [5].

strategies. However, many lesser-known games have yet to be

It is important to differentiate between the various versions of

thoroughly explored, despite their popularity. One such game is

Ultimate Tic-Tac-Toe. One variant allows the game to continue

Ultimate Tic-Tac-Toe, an advanced version of the classic Tic-Tac-

playing on already-won local boards, which drastically changes

Toe. This game is played on a 3x3 grid of local Tic-Tac-Toe boards,

the game’s dynamics. In this variant, researchers have demon-

creating a global board (Figure 1a). The goal is to win three local strated an optimal strategy for the starting player, who can win

boards in a row, while players must make their moves within

in 43 moves [1]. Further research has focused on enabling a more specific local boards determined by their opponent’s previous

balanced game by introducing random opening moves, which

move. For example, if a player moves in the top-left corner of a

reduces the predictability of forced wins [4]. Despite these inter-local board, the next player must play on the top-left local board.

esting findings, research on these variants is not so relevant for

If the designated board is full or decided, the player can choose

us, as it does not contribute to the understanding of the main

any other available board. Despite its apparent simplicity, the

game.

game has enough spatial complexity that it cannot currently be

While there is a lack of educational material specific to our

solved using brute-force methods.

game, much can be learned from related fields, such as chess,

While there are several online implementations of the game,

which has been extensively researched. The paper by Gobet and

most focus on building strong AI agents; however, There is a

Jansen [8] describes a scientific approach to learning chess, which noticeable lack of resources aimed at teaching and helping players

includes methods to improve memory, perception, and problem-

understand the deeper strategies of the game, which could make

solving skills in players. In this context, it focuses on the acquisi-

the learning curve more manageable for new and aspiring players.

tion and organization of knowledge, including both explicit and

Thus, we have created an application that addresses the lack of

implicit learning of tactics and strategies. This approach facil-

learning tools available for Ultimate Tic-Tac-Toe. This article

itates a deep understanding of games and the development of

places particular emphasis on the puzzle generation aspect of

more effective learning methods.

Chess also offers highly sophisticated practical solutions from

Permission to make digital or hard copies of all or part of this work for personal which we can learn a great deal. Platforms such as chess.com [2]

or classroom use is granted without fee provided that copies are not made or

and lichess.org [11] offer extensive resources and tools for distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this learning chess, especially in the areas of tactics and openings.

work must be honored. For all other uses, contact the owner /author(s).

These platforms allow players to learn through interactive lessons,

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

solving puzzles, and studying various openings, which contribute

© 2024 Copyright held by the owner/author(s).

https://doi.org/10.70314/is.2024.scai.7299

to a deeper understanding of the game and improve playing skills.

43





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Zirkelbach et al.

This approach has proven extremely effective in helping players

The chess rating system is used to measure the playing strength

master complex strategic and tactical concepts in chess.

of chess players. The most commonly used system is the Elo rat-

On the mentioned platforms, the methods for learning tac-

ing [6], which predicts the likelihood of one player winning tics are designed to allow players to solve problems based on

against another based on their ratings:

concrete game situations, which improves pattern recognition

1

and decision-making abilities in real games. Similarly, learning

𝐸

=

,

𝐴

𝑅

−𝑅

𝐵

𝐴

openings involves demonstrating optimal opening moves and

1 + 10

400

their continuations, helping players develop effective strategies

where 𝐸

represents the expected score for player A, 𝑅

is the

𝐴

𝐴

at the beginning of the game.

rating of player A, and 𝑅

is the rating of player B.

𝐵

We have applied similar methods in our Ultimate Tic-TacToe

application. For example, adapting approaches for learning tactics

Table 1: Table of approximate agent strengths. Each agent

can help users improve their recognition and solving of complex

played 100 games (50 as X and 50 as O) against the agent

situations in the game while learning openings helps to under-

one level lower. The results column shows the number

stand key opening moves and their impact on the further course

of points each agent earned with each symbol, as well

of the game. By incorporating these methods into our application,

as the total score. A win awarded 1 point, while a draw

we ensured more effective learning processes and improved the

awarded 0.5 points. The last line shows the results of the

overall gaming experience.

strongest freely available agent against level 9. It had the

same amount of time to think, and they played 30 games.

3

Application Details

Result

Agent

Estimated Rating

In addition to puzzle-solving, the app offers a comprehensive

X

O

Combined

learning experience through various other features. It includes

Confused Chimp - 1

-

-

-

1

AI opponents of different difficulty levels, game analysis, and

Goofy Goblin - 2

49

49

98

620

exploration of effective opening strategies, allowing players to

Casual Carl - 3

41.5

35.5

77

835

refine their understanding in all phases of the game. The user in-

Average Joe - 4

37

25

62

926

terface ensures smooth navigation between these modes, making

Hustling Hugo - 5

39.5

34.5

74

1114

the app a versatile tool for both playing and learning Ultimate

Witty Walter - 6

43

30

73

1293

Tic-Tac-Toe. By integrating these elements, the app serves as a

Thinking Tiffany - 7

35

24

59

1361

resource for players at all levels, helping them to deepen their

Brainy Bob - 8

42,.5

26.5

69

1506

understanding and improve their skills.

Bossman - 9

36.5

22.5

59

1574

To reach a broader audience, the application was developed

UT T T AI

14.5

12.5

27

1948

for both Android and Windows, the dominant operating systems

in the market [15]. It uses Flutter components to deliver a respon-

sive and user-friendly interface. Local data storage is utilized

5

Puzzle Description and Methodology

for user settings, progress, and puzzle data, ensuring efficient

performance and data management.

In this section, we describe different types of puzzles and the

We employed modern technologies and mobile development

methodology employed to generate them for our game.

practices, including state management patterns, to create an eas-

ily expandable app for future updates and enhancements. The

5.1

Puzzles

entire project is hosted on GitHub, though it is not open-source.

The puzzles in the application are divided into tactical and strate-

Test versions of the app for Android and Windows are avail-

gic, with each type of puzzle covering different aspects of the

able on Google Drive: https://drive.google.com/drive/f olders/1Sn

game and helping players improve specific skills.

O_mN_ZVa2wXd0OGI07kLiYKQTDHuEe?usp=drive_link, while Tactical puzzles are useful for understanding tactical ideas

the Android production version is accessible on Google Play Store:

and are particularly applicable in the endgame and middlegame

https://play.google.com/store/apps/details?id=com.uttt_tutor.

phases. They focus on specific situations that require precise

and thoughtful moves, helping players develop the ability to

think quickly and effectively. In total, we generated 1,263 tactical

4

AI Agents and Rating System

puzzles, distributed across five levels. The number of puzzles for

each level is shown in Table 2.

Playing against intelligent agents allows users to refine their

Unlike tactical puzzles, strategic puzzles aim to understand

skills by competing against various virtual opponents. The appli-

the position and long-term plans. They are instrumental in the

cation includes nine different agents, each varying in difficulty

opening and middlegame, where it is crucial to recognize strategic

and gameplay strategies. These agents are designed using Mini-

ideas and develop plans that provide an advantage as the game

max and Monte Carlo Tree Search [3] algorithms, which provide continues. There are 50 strategic puzzles available, currently

different levels of complexity and depth in move analysis. The

arranged in one level, with the possibility of expansion in the

agents and their approximate strengths are shown in Table 1.

future.

To better understand the quality of the agents and evaluate

user progress, we need to establish a system for measuring their

5.2

Tactical Puzzle Generation

strength. Since Ultimate Tic-Tac-Toe is not widely popular, there

is no established system for rating player abilities. Therefore, we

To generate tactical puzzles, we developed a specialized minimax

decided to use the chess rating system as an approximation for

agent that builds a tree of all possible moves leading to victory

our agents.

from the solver’s perspective. A key step in this process is the

44





Puzzle Generation for Ultimate Tic-Tac-Toe

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

Table 2: Number of tactical puzzles on each level.

strategic understanding of the position, which our agents, using

relatively simple heuristics, are incapable of. Therefore, we re-

Level

Puzzle Depth

Quantity

sorted to the most powerful freely available agent [13], which is based on the ideas of the AlphaZero program.

1

1

273

Thus, we generated the strategic puzzles manually. We searched

2

3

493

for interesting and instructive positions that arose in games be-

3

5

231

tween the aforementioned agent and our stronger programs. We

4

7

176

focused on moments when there was a significant deviation in

5

9

90

the position evaluation between the two agents. When the agent

with better strategic understanding detected an important change,

we saved the given position, studied it more closely, and based

selection of tree branches to retain only relevant and correct so-

on our understanding of the game, formulated a solution. The

lutions. It is essential to preserve all of the winner’s possibilities

most common examples of such situations involved sacrificing

while limiting the loser’s responses to those that make finding a

the edge board to gain control over the central board. A basic

solution as difficult as possible. Therefore, we select the continu-

example of this can be seen in Figure 2.

ation that allows the longest possible game for the loser while

leading to the fewest continuations for the winner.

From the tree, we extract all the correct solutions for the given

position. For a high-quality puzzle, it must not have too many

solutions. The criterion we set is that the number of solutions

must be less than the depth of the puzzle. We also decided to

discard all puzzles that have multiple correct continuations for

the first move. This way, we avoid trivial puzzles that would

be too simple. An example of a level 3 tactical puzzle with its

generated solution tree is shown in Figure 1.

(a) User interface of the most powerful freely available agent. For

the given position, it ran 1000 simulations and assessed the move

F2 as the best with an 82% probability. It evaluates the position

with a value of +16.85, which means it assigns approximately 58.4%

win probability to player X (a value of 0 means a draw, 100 a win,

and -100 a loss).

(a) Level 3 tactical puzzle.

(b) Solution tree.

Figure 1: An example of tactical puzzle and its generated

solution tree.

The generation of tactical puzzles for different difficulty levels

(b) Minimax agent with a search depth of 12. It marks the move

was automated by conducting matches between agents of equal

F2 as the worst, as it does not recognize the long-term advantage.

strength, with the search depth of both agents corresponding

to the depth of the puzzle we wanted to find. We chose this

Figure 2: Different interpretations of the same position,

approach to ensure that the resulting positions were interesting

based on which we built the strategic puzzle.

and balanced, as otherwise, the stronger side would usually have

an overly obvious advantage at the start of the puzzle which

would make it boring to solve.

6

Evaluation and Results

5.3

Strategic Puzzle Generation

We conducted a quality analysis of the application with 14 vol-

Automating the creation of strategic puzzles is impossible without

unteers. Their task was to use the app for an extended period

a program that could interpret the given position and simultane-

to improve their knowledge of the game. We were interested

ously provide a human-understandable explanation. Additionally,

in determining whether using the app had a positive impact on

generating strategic puzzles requires an agent with an advanced

the development of their Ultimate Tic-Tac-Toe playing skills and

45





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Zirkelbach et al.

whether progress was dependent on motivation or the time spent

understanding, making progress slower and more challenging.

learning.

This reflects the diminishing returns on improvement as you

To assess individual progress, participants played against the

climb the rating ladder.

agent at the start of testing to determine their initial skill level.

It must also be mentioned that users were free to use any

The application then tracked the highest level each user defeated,

tools within the app during testing and solving more puzzles did

providing an estimate of their improvement over time. This

not correlate with longer app usage. For a clearer assessment of

progress, in relation to the number of puzzles solved, is illus-

puzzle significance, a controlled test focusing solely on puzzle-

trated in Figure 3. For a more concrete interpretation of obtained solving would be more appropriate.

level strengths, refer to Section 4.

8

Conclusion

500

In this work, we presented methods for generating puzzles for

the game of Ultimate Tic-Tac-Toe. To evaluate the quality of

400

these puzzles, we tracked how the number of solved puzzles im-

pacted individual user progress. Our results indicate a correlation

300

between the number of puzzles solved and the ability to reach

stronger AI levels.

However, the evaluation could be refined by focusing exclu-

sively on the puzzle-solving component, isolating it from other

60

functionalities of the application. Additionally, the automation

of tactical puzzle generation could be expanded to cover the mid-

50

dlegame phase, rather than being limited to endgame scenarios.

Another area of improvement is providing clearer assessments of

40

puzzle difficulty. This could be achieved by implementing a rating

Puzzles

d

system that ranks puzzles based on completion rates, offering a

e

more accurate measure of challenge for each puzzle.

30

Solv

of

Acknowledgements

er

20

The author would like to thank the family and friends who par-

ticipated in testing the application.

Numb

10

References

[1]

Guillaume Bertholon, Rémi Géraud-Stewart, Axel Kugelmann, Théo Lenoir,

0

and David Naccache. 2020. At most 43 moves, at least 29: optimal strategies

and bounds for ultimate tic-tac-toe. (2020). doi: 10.48550/ARXIV.2006.02353.

0

1

2

3

4

5

6

7

8

9

[2]

Chess.com. 2024. Chess.com. (June 2024). https://www.chess.com/.

[3]

Rémi Coulom. 2006. Efficient selectivity and backup operators in monte-

AI Level Beaten

carlo tree search. In In: Proceedings Computers and Games 2006. Springer-

Verlag.

[4]

Justin Diamond. 2022. A practical method for preventing forced wins in

Figure 3: Progress in relation to the number of solved puz-

ultimate tic-tac-toe. (2022). doi: 10.48550/ARXIV.2207.06239.

zles. Each arrow represents a human tester and indicates

[5]

Nelson Elhage. 2020. Solving ultimate tic tac toe. (July 2020). https://minim

the change in the achieved level from the beginning to the

ax.dev/docs/ultimate/.

[6]

Arpad E Elo. 1978. The Rating of Chessplayers, Past and Present. Arco Pub.

end of the application’s use.

[7]

Ofek Gila. 2019. Ultimate tic tac toe. (2019). https://www.theof ekf oundatio

n.org/games/UltimateTicTacToe/.

[8]

Fernand Gobet and Peter J Jansen. 2006. Training in chess: a scientific

approach. Education and chess.

7

Discussion

[9]

Henryk. 2023. Ultimate tic tac toe. (Dec. 2023). https://play.google.com/stor

e/apps/details?id=com.henrykvdb.sttt.

The results in Figure 3 indicate that solving more puzzles im-

[10]

HPStudios. 2024. Ultimate tic tac toe. (June 2024). https://play.google.com/s

pacted users’ ability to reach higher levels, but had less of an

tore/apps/details?id=com.MertTaylan.UltimateTicTacToe.

effect on lower levels. This is likely due to the fact that begin-

[11]

Lichess. 2024. Lichess. (June 2024). https://lichess.org/.

[12]

Levi Moore. 2020. Ultimate tic tac toe. (Nov. 2020). https://play.google.com

ners can improve relatively quickly by simply playing the game,

/store/apps/details?id=com.ZeroRare.UltimateTicTacToe.

whereas advanced players require more effort to progress (eg. it

[13]

Arkadiusz Nowaczynski. 2021. Ar-nowaczynski/utttai: alphazero-like ai

solution for playing ultimate tic-tac-toe in the browser. (Dec. 2021). https:

is a lot easier to gain 100 rating points when you are rated 500 as

//github.com/ar- nowaczynski/utttai.

compared to when you are rated 1500).

[14]

David Silver et al. 2018. A general reinforcement learning algorithm that

The reason for this is that at lower ratings, there is generally

masters chess, shogi, and go through self-play. Science, 362, 6419, 1140–1144.

eprint: https : / / www . science . org / doi / pdf / 10 . 1126 / science . aar6404. doi: more room for rapid improvement because the skill gap between

10.1126/science.aar6404.

players tends to be more pronounced, and beginners can quickly

[15]

Michael Xing. 2019. Ultimate tic-tac-toe. (Oct. 2019). https://www.michaelx

benefit from fundamental knowledge and tactical awareness. As

ing.com/UltimateTTT/v3/.

a result, achieving a higher rating initially is easier as players

can fix obvious mistakes and exploit weaker opponents’ errors.

However, as players reach higher levels, the competition be-

comes tougher, and the differences in skill become more nu-

anced. Players at this level are more consistent and less likely

to make blunders, so improving further requires mastering ad-

vanced strategies, pattern recognition, and deeper positional

46





Ethical Consideration and Sociological Challenges in the Integration of Artificial Intelligence in Mental Health Services



Saša Poljak Lukek

sasa.poljaklukek@teof.uni-lj.si

University of Ljubljana, Faculty of Theology

Ljubljana, Slovenia





Abstract 1

treatment options. Chatbots provide therapy via

This article explores the transformative potential

natural language processing [5], while digital

of artificial intelligence (AI) in the field of mental

platforms support online mostly cognitive

health, with a particular focus on ethical

behavioral therapeutic interventions [6]. Avatar

considerations and social challenges. As AI

therapy uses AI to help patients manage

tools become increasingly sophisticated, their

conditions like dementia, autism spectrum

ability to support mental health interventions

disorder, and schizophrenia [7].

presents both opportunities and challenges. We



discuss the importance of a human-centered

1.2

The

Prospect

of

artificial

approach to AI development and the need for

intelligence in mental health services

comprehensive ethical guidelines to ensure

The future orientation underlines the importance

patient safety and well-being. In addition, this

of digital health in overcoming challenges such

paper explores key social trends such as the

as limited access to services, especially in

evolving dynamics of modern families, aging

underserved regions, and outlines measures to

population, migration and considers how AI can

ensure equitable access to digital health

be integrated into these contexts to improve

solutions across the European region [8]. The

mental health care.

use of AI in mental health services raises



questions about the role of non-human

Keywords:

interventions, transparency in the use of

Artificial Intelligence, Mental Health, Human-

algorithms and the long-term impact on the

Centered Approach, Ethics, Modern Family

understanding of illness and the human

Dynamics, Aging Populations, Migration

condition [9]. There are also concerns about



potential bias, gaps in ethical and legal

1

Introduction

frameworks, and the possibility of misuse

1.1

Artificial intelligence in mental

[10,11].

health services

However, there are at least two potentially

Research on the application of AI in mental

positive effects of the use of AI in healthcare:

health care has shown some positive effects on

Accessibility and personalization of services.

the treatment of mental health problems [1],

AI offers new mechanisms to reach those who

including early detection [2,3], providing

might not otherwise be served. AI-supported

feedback and personalized treatment plans [4],

tools can improve the early detection and

and developing of novel diagnose tools [2].

diagnosis of mental disorders [12]. AI chatbots

AI in mental health services is implemented

have shown promise in increasing referrals to

through models like chatbots, digital platforms,

mental health services, especially for minority

and avatar therapy, enhancing accessibility and

groups who are blocked from accessing

traditional care [13]. These technologies can

provide initial assessments, psychoeducation

1 This Publication is a Part of the Research Program The Intersection

and even treatment, expanding access to mental

of Virtue, Experience, and Digital Culture: Ethical and Theological

Insights, financed by the University of Ljubljana.

health support [12]. AI-driven virtual assistants

and wearable devices enable continuous



47

monitoring and personalized care, which could

2.2

Aging Populations

improve patient outcomes [11,14].

AI offers promising solutions for supporting an

The integration of artificial intelligence into

aging population, particularly in addressing

mental health services represents a promising

cognitive decline and mental health challenges.

avenue for the development of personalized

AI applications can monitor vital signs, health

treatment plans through the sophisticated

indicators, and cognition, as well as provide

analysis of large datasets, enabling the

support for daily activities [20]. With an

identification of optimal therapeutic strategies

increasing number of elderly individuals, AI can

tailored to specific client profiles [15,16]. This

support mental health care by providing

data-driven methodology enables the dynamic

companionship through intelligent animal-like

adaptation of therapy to the evolving needs of

robots (e.g., Paro, Harp seal) and assisting in

the client.

monitoring and managing conditions like



dementia [21,22]. AI can also help in tracking

2

Overcoming

Sociological

cognitive

health

and

providing

timely

Challenges through the Integration of

interventions to maintain mental well-being in

Artificial Intelligence in Mental Health

older adults. These technologies have the

Services

potential to enhance independent living and



quality of life for older adults and their families.

2.1

Modern Family Dynamics



2.3

Migration

Modern family trends show that family

structures and attitudes have changed

Migrants often face mental health challenges

significantly in recent decades [17]. There is a

due to displacement, cultural adjustment and

growing acceptance of different family forms,

language barriers. AI can help migrants access

including unmarried cohabitation, same-sex

mental health services by providing culturally

relationships and joint custody arrangements

and linguistically relevant resources and

[18]. These changes reflect an expansion of

support. Chatbots and AI-driven platforms can

developmental idealism and increasing support

bridge gaps in care by providing immediate help

for individual freedom in family choice [17].

and continuity of care across different regions

On the other hand, there is a growing need for

[23].

mental health services for families [19]. As the

Recent research highlights the increasing role of

most vulnerable members of the family - the

digitalization and artificial intelligence (AI) in

children - are usually also at risk, quick and

migration and mobility systems, especially in the

effective action in family mental health is of

context of the COVID-19 pandemic [24]. While

great importance. Many families are struggling

these technologies offer opportunities for

with various psychological problems. Together

improving human rights and supporting

with the changing family structure, this means a

international development, they also bring

great burden for every family member. In

challenges that require careful consideration of

addition, access to psychologists, psychiatrics

design, development and implementation

and therapists is limited, leading to an acute

aspects. The integration of AI into migration

shortage of mental health professionals

processes requires a focus on human rights at

worldwide.

all stages that goes beyond technical feasibility

The accessibility of services is probably the

and companies' claims of inclusivity [24].

strongest argument for the integration of AI in



healthcare [12]. AI-powered conversational

3

Ethical Consideration in the

agents can improve the accessibility of mental

Integration of Artificial Intelligence in

health services by being available online at all

Mental Health Services

times and in underserved areas, being scalable,

One of the main caveats to the use of AI in

reliable, fatigue-free, and providing consistent

mental health is the introduction of new ethical

support, being culturally sensitive to adapt, and

standards to ensure user safety. The approach to

helping

with

education

and

symptom

integrating AI into services should therefore be

management.

human-centered [25]. Any innovation should



therefore focus on people in their most

48

vulnerable position. It is important to assess all

bias, especially among marginalized groups, the

risks with sufficient accuracy and avoid misuse

risks associated with data privacy and security,

of AI as much as possible. The most important

and the challenges posed by the lack of

areas for ethical consideration when integrating

transparency of AI models.

AI into mental health services should be privacy,



bias, transparency, security.

4

Conclusion

Data privacy and security are critical in digital

We propose to define AI as a new ethical entity

healthcare and require robust measures to

in the field of mental health [30]. AI represents a

protect sensitive information and prevent

novel artifact that changes interactions,

unauthorized access. Protecting privacy rights

concepts, epistemic fields and normative

and ensuring informed consent are critical to

requirements.

This

change

requires

a

maintaining trust and ethical standards in the

redefinition of the role of AI, which lies on a

use of personal health data [11]. Combining

spectrum between a tool and an agent. This shift

multiple data streams increases the risk of

underscores the need for new ethical standards

unauthorized use, which exacerbates privacy

and guidelines that recognize the unique status

issues. Ensuring informed consent and

of AI as a distinct and influential actor in the field

maintaining

transparency,

especially

in

of mental health.

emergency operations, are critical to addressing

The integration of AI into services can, on the

these ethical concerns and protecting the rights

one hand, provide more efficient and faster

of participants [26].

solutions to some of the sociological challenges

The use of AI in mental health treatment raises

of today's society, but on the other hand,

ethical concerns about bias, particularly among

requires a precise and correct definition of the

marginalized populations who are already

limits within which these models can be used.

discriminated against and lack access to mental

These efforts aim to bridge the gap between

health care. It is uncertain whether AI-assisted

technology and human-centered care and

psychotherapy can effectively address cultural

ensure that AI complements, rather than

differences and close treatment gaps in diverse

replaces, the therapeutic benefits of human

populations [27]. In addition, populations that

interaction.

are traditionally marginalized in fields such as



psychology and psychiatry are most vulnerable

to algorithmic biases in AI and machine learning

Literature

[27,28]. These biases limit the ability of AI to

[1] Sandhya Bhatt. 2024. Digital Mental Health: Role of

provide culturally and linguistically appropriate

Artificial Intelligence in Psychotherapy. Annals of

mental health resources, exacerbating existing

Neurosciences, 0,0, 1-11.

inequalities. The persistence of such biases in AI

doi:10.1177/09727531231221612

systems not only risks increasing health

[2] Sijia Zhou, Jingping Zhao and Lulu Zhang. 2022.

Application of Artificial Intelligence on Psychological

inequalities, but also exacerbates existing social

Interventions and Diagnosis: An Overview. Frontiers in

inequalities

and

raises

critical

ethical

Psychiatry, 13(March), 1–7.

considerations [9].

https://doi.org/10.3389/fpsyt.2022.811665

The future of artificial intelligence in clinical

[3] Klaudia Kister, Jakub Laskowski, Agata Makarewicz and

Jakub Tarkowski. 2023. Application of artificial

settings is affected by a significant ethical

intelligence tools in diagnosis and treatmentof mental

dilemma concerning the trade-off between the

disorders. Current Problems of Psychiatry, 24, 1–18.

performance and interpretability of machine

https://doi.org/10.12923/2353-8627/2023-0001

learning models [29]. The lack of transparency

[4] Rachel L. Horn and John R. Weisz. 2020. Can Artificial

Intelligence Improve Psychotherapy Research and

in AI models makes it difficult to detect and

Practice? Administration and Policy in Mental Health and

correct biases. This underscores the need for

Mental Health Services Research, 47, 5, 852–855.

greater transparency to ensure ethical and fair

https://doi.org/10.1007/s10488-020-01056-9

clinical decision-making.

[5] Kerstin Denecke, Alaa Abd-alrazaq and Mowafa Househ.

2021. Artificial Intelligence for Chatbots in Mental

In summary, the integration of AI into mental

Health: Opportunities and Challenges. In: Househ, M.,

health services requires the establishment of

Borycki, E., Kushniruk, A. (eds) Multiple Perspectives on

strict ethical standards to protect the safety and

Artificial Intelligence in Healthcare. 115–128.

privacy of users. A human-centered approach is

https://doi.org/10.1007/978-3-030-67303-1_10

essential, with a focus on dealing with potential

49

[6] Elias Aboujaoude, Lina Gega, Michelle B. Parish and

Century of Family Attitude Trends in the United States.

Donald M. Hilty. 2020. Editorial: Digital Interventions in

Sociology of Development, 9, 1, 1–32.

Mental Health: Current Status and Future Directions.

https://doi.org/10.1525/sod.2022.0003

Front. Psychiatry 11, 111. doi: 10.3389/fpsyt.2020.00111

[19] WHO. 2022. World mental health report: transforming

[7] Kay T. Pham, Amir Nabizadeh & Salih Selek. 2022.

mental health for all. (June 2022) Retrieved August 20,

Artificial Intelligence and Chatbots in Psychiatry.

2024 from

Psychiatric Quarterly, 93, 1, 249–253.

https://www.who.int/publications/i/item/978924004933

https://doi.org/10.1007/s11126-022-09973-8

8.

[8] WHO. 2022. Regional digital health action plan for the

[20] Sara J.Czaja and Marco Ceruso. 2022. The Promise of

WHO European Region 2023–2030 (RC72). (July 2022).

Artificial Intelligence in Supporting an Aging Population.

Retrieved August 20, 2024 from

Journal of Cognitive Engineering and Decision Making,

https://www.who.int/europe/publications/i/item/EUR-

16, 4, 182–193.

RC72-5

https://doi.org/10.1177/15553434221129914

[9] Amelia Fiske, Peter Henningsen and Alena Buyx. 2019.

[21] Maria R. Lima. 2024. Home Integration of Conversational

Your Robot Therapist Will See You Now: Ethical

Robots to Enhance Ageing and Dementia Care.

Implications of Embodied Artificial Intelligence in

ACM/IEEE International Conference on Human-Robot

Psychiatry, Psychology, and Psychotherapy. Journal of

Interaction, 115–117.

Medical Internet Research, 21, 5, e13216.

https://doi.org/10.1145/3610978.3638378

https://doi.org/10.2196/13216

[22] Wendy Moyle. 2019. The promise of technology in the

[10] Elizabeth C. Stade, Shannon Wiltsey Stirman, Lyle

future of dementia care. Nature Reviews Neurology, 15,

Ungar, Cody L. Boland,H. Andrew Schwartz, David B.

6, 353–359. https://doi.org/10.1038/s41582-019-0188-y

Yaden, Joao Sedoc, Robert J. DeRubeis, Robb Willer and

[23] Zahra Abtahi, Miriam Potocky, Zarin Eizadyar, Shanna L.

Johannes C. Eichstaedt. 2024. Large Language Models

Burke, Nicole M. Fava. 2022. Digital Interventions for the

Could Change the Future of Behavioral Healthcare: A

Mental Health and Well-Being of International Migrants:

Proposal for Responsible Development and Evaluation.

A Systematic Review. Research on Social Work Practice,

Mental Health Res 3, 12.

33, 5, 518-529. Doi:10.1177/10497315221118854

https://doi.org/10.1038/s44184-024-00056-z

[24] Marie McAuliffe, Jenna Blower and Ana Beduschi. 2021.

[11] David B. Olawade, Ojima Z. Wada, Aderonke Odetayo,

Digitalization and artificial intelligence in migration and

Aanuoluwapo Clement David-olawade, Fiyinfoluwa

mobility: Transnational implications of the covid-19

Asaolu and Judith Eberhardt. 2024. Enhancing mental

pandemic. Societies, 11, 4, 135.

health with Artificial Intelligence: Current trends and

https://doi.org/10.3390/soc11040135

future prospects. Journal of Medicine, Surgery, and

[25] Luke Balcombe and Diego de Leo. 2022. Human-

Public Health, 3, 100099.

Computer Interaction in Digital Mental Health.

https://doi.org/10.1016/j.glmedi.2024.100099

Informatics, 9,1, 14.

[12] Koki Shimada. 2023. The Role of Artificial Intelligence in

https://doi.org/10.3390/informatics9010014

Mental Health: A Review. Science Insights 43, 5, 1119-

[26] Nicholas C. Jacobson and Matthew D. Nemesure. 2021.

1127. doi:10.15354/si.23.re820

Using Artificial Intelligence to Predict Change in

[13] Max Rollwage, Johanna Habicht, Keno Juechems, Ben

Depression and Anxiety Symptoms in a Digital

Carrington, Sruthi Viswanathan, Mona Stylianou, Tobias

Intervention: Evidence from a Transdiagnostic

U. Hauser and Ross Harper. 2023. Using Conversational

Randomized Controlled Trial. Psychiatry Research, 295,

AI to Facilitate Mental Health Assessments and Improve

113618.

Clinical Efficiency Within Psychotherapy Services: Real-

https://doi.org/10.1016/J.PSYCHRES.2020.113618

World Observational Study. JMIR AI, 2, e44358.

[27] Bennett Knox, Pierce Christoffersen, Kalista Leggitt, Zeia

https://doi.org/10.2196/44358

Woodruff and Matthew H. Haber. 2023. Justice,

[14] David D. Luxton. 2020. Ethical implications of

Vulnerable Populations, and the Use of Conversational

conversational agents in global public health. Bulletin of

AI in Psychotherapy. American Journal of Bioethics, 23,

the World Health Organization, 98, 4, 285–287.

5, 48–50.

https://doi.org/10.2471/BLT.19.237636

https://doi.org/10.1080/15265161.2023.2191040

[15] Leonard.Bickman. 2020. Improving Mental Health

[28] Zoha Khawaja and Jean C. Bélisle-Pipon. 2023. Your

Services: A 50-Year Journey from Randomized

robot therapist is not your therapist: understanding the

Experiments to Artificial Intelligence and Precision

role of AI-powered mental health chatbots. Frontiers in

Mental Health. Adm Policy Ment Health, 47, 795–843.

Digital Health, 5, 1278186. doi:

https://doi.org/10.1007/s10488-020-01065-8

10.3389/fdgth.2023.1278186

[16] Silvan Hornstein, Valerie Forman-Hoffman, Albert

[29] Danilo Bzdok and Andreas Meyer-Lindenberg,. 2018.

Nazander, Kristian Ranta and Kevin Hilbert. 2021.

Machine Learning for Precision Psychiatry: Opportunities

Predicting therapy outcome in a digital mental health

and Challenges. Biological Psychiatry: Cognitive

intervention for depression and anxiety: A machine

Neuroscience and Neuroimaging, 3, 3, 223–230.

learning approach. DIGITAL HEALTH, 7, 1-11.

https://doi.org/10.1016/J.BPSC.2017.11.007

doi:10.1177/20552076211060659

[30] Jana Sedlakova and Manuel Trachsel. 2023.

[17] Josef Ehmer. 2021. A historical perspective on family

Conversational Artificial Intelligence in Psychotherapy: A

change in Europe. In Norbert F. Schneider and Michaela

New Therapeutic Tool or Agent? American Journal of

Kreyenfeld (eds). Research Handbook on the Sociology

Bioethics, 23, 5, 4–13.

of the Family, 143–161.

https://doi.org/10.1080/15265161.2022.2048739

https://doi.org/10.4337/9781788975544.00018

[18] Keera Allendorf, Linda Young-Demarco and Arland

Thornton. 2023. Developmental Idealism and a Half-

50





Optimization Problem Inspector: A Tool for Analysis of

Industrial Optimization Problems and Their Solutions

Tea Tušar

Jordan N. Cork

Andrejaana Andova

Bogdan Filipič

Jožef Stefan Institute and Jožef Stefan International Postgraduate School

Ljubljana, Slovenia

{tea.tusar, jordan.cork, andrejaana.andova, bogdan.f ilipic}@ijs.si

Abstract

OPI is a web application, implemented by a Python library called

optimization-problem-inspector included in the PyPi Python

This paper presents the Optimization Problem Inspector (OPI)

1

package index . It is highly interactive and requires no program-tool for assisting researchers and practitioners in analyzing indus-

ming knowledge to be used.

trial optimization problems and their solutions. OPI is a highly

Freely available contemporary software tools for multiobjective

interactive web application requiring no programming knowl-

optimization, such as DESDEO [12], jMetal [7] (and jMetalPy [2]),

edge to be used. It helps the users to better understand their

the MOEA Framework [8], ParadisEO-MOEO [10], platEMO [17],

problem by: 1) comparing the landscape features of the analyzed

pygmo [3], pymoo [4], and Scilab [15], provide the implementa-problem with those of some well-understood reference problems,

tion of various optimization algorithms and test problems. While

and 2) visualizing the values of solution variables, objectives, con-

the majority of them include some visualization of solutions,

straints and any other user-specified solution parameters. The

the plots are mostly focused on showing algorithm results for

features of OPI are presented using a bi-objective pressure vessel

the purpose of comparing algorithm performance and not to

design problem as an example.

increase problem understanding. In addition, none of these tools

Keywords

compute additional problem features as OPI does. Therefore, OPI

brings a unique perspective to optimization problem analysis and

optimization, black-box problems, sampling, problem characteri-

understanding.

zation, visualization

Next, Section 2 presents the real-world problem that will be used to showcase the features of OPI in Section 3. The paper 1

Introduction

concludes with some remarks in Section 4.

Industrial optimization problems often require simulations to

evaluate solutions. For example, in electrical motor design [18,

2

Real-World Use Case

19], assessing the efficiency and electromagnetic performance of a Our chosen real-world problem is a version of the well-known

proposed design is done by running a simulator that analyzes the

pressure vessel design problem, first proposed more than 30

motor magnetic field and flux distribution. Such evaluations are

years ago [16]. In this work, we adapt the formulation from [5]

black boxes to the user and the optimization algorithm alike, i.e.,

to handle the pressure vessel volume as a constraint, as well as

the underlying functions cannot be explicitly expressed, which

an objective. We also remove one unnecessary constraint and

makes the problem hard to understand and solve.

use the original boundary constraints for the first two variables.

The established way to gain a better understanding of indus-

A pressure vessel is a tank, designed to store compressed

trial problems is through the analysis of their solutions. Depend-

gasses or liquids. It consists of a cylindrical middle part capped

ing on the problem at hand, this can be a challenging task, as

at both ends by hemispherical heads. The pressure vessel has

industrial problems often have a large number of variables, mul-

four design variables (see Figure 1): the shell thickness, 𝑥

=

,

1

𝑇s

tiple objectives and constraints [20].

the head thickness, 𝑥

=

, the inner radius,

=

2

𝑇

𝑥

𝑅 , and the

h

3

The Optimization Problem Inspector (OPI) presented in this

length of the cylindrical section of the vessel, 𝑥

=

4

𝐿. The two

paper is a tool conceived to ease this task for both problem experts

thickness variables are integer multiples of 0.0625 inches, which

and optimization algorithm developers. OPI provides two ways

correspond to the available thicknesses of rolled steel plates,

to further the understanding of an optimization problem:

while the length and the radius are continuous. The problem

(1) It computes a set of landscape features of the analyzed

has three constraints, two on the search variables and one on

problem and compares them to those of well-understood

1

reference problems.

https://pypi.org/project/optimization- problem- inspector/

(2) It provides visualizations of solutions through the values

of their variables, objectives, constraints and any other

𝑥

=

4

𝐿

user-specified solution parameters.

𝑥

=

=

2

𝑇

𝑥

𝑇

h

1

s

Permission to make digital or hard copies of all or part of this work for personal

𝑥

=

3

𝑅

or classroom use is granted without fee provided that copies are not made or

distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner /author(s).

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

© 2024 Copyright held by the owner/author(s).

Figure 1: Pressure vessel design variables.

https://doi.org/10.70314/is.2024.scai.8265

51





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Tea Tušar, Jordan N. Cork, Andrejaana Andova, and Bogdan Filipič

the volume. Its two objectives are to minimize the total costs,

If needed, the sample can be generated by the tool itself, based

including the costs of the material, forming and welding, and to

on the variable information provided in the problem specification

maximize the volume. The problem is formally defined as follows:

step. However, this is not a required step in using OPI. A user

that already has a set of (evaluated) solutions to work with can

2

2

min

𝑓

(x) = 0

+ 1

+ 3

1

.6224𝑧1𝑥3𝑥4

.7781𝑧2𝑥

.1661𝑧

𝑥

3

1

4

skip it and input the data directly (see Section 3.3).

+

2

19.84𝑧 𝑥

Sample generation requires one to choose the number of de-

1

3

sired samples, set to a default of 100, and the sample generation

4

2

3

max

𝑓

(x) =

+

2

𝜋 𝑥

𝑥

𝜋 𝑥

3

4

3

method. Three sample generation methods are supported: ran-

3

dom, Sobol and Latin Hypercube, with random sampling being

subject to

𝑔

(x) = 0

−

≤ 0

1

.0193𝑥 3

𝑧1

the default. The user may alter the settings of these sampling

𝑔

(x) = 0

−

≤ 0

methods, such as the random generator seed. Selecting the but-

2

.00954𝑥 3

𝑧2

ton to generate and download the sample will download it in a

𝑔

(x) =

(x) ≥ 1 296 000

3

𝑓2

csv-formatted file.

𝑥

∈ {18

1

, . . . , 32}

In the pressure vessel use case, OPI warns the user that not

𝑥

∈ {10

2

, . . . , 32}

all sample generation methods are appropriate. In fact, the Sobol

𝑥

∈ [10

sampler and the Latin Hypercube Sampler are not compatible

3, 𝑥 4

, 200]

with non-continuous parameters. If used nevertheless, they may

where

𝑧

= 0

1

.0625𝑥 1

produce unexpected results.

𝑧

= 0

2

.0625𝑥 2

3.3

Data

3

Optimization Problem Inspector Features

In OPI, the data is essentially a set of evaluated solutions, where

OPI is a web application, organized into five functional sections

each solution must contain a value for all objectives, constraints

and a help section, providing guidance to the user. OPI expects the

and other parameters included in the problem specification. The

user to provide the problem specification and its data—evaluated

evaluation is conducted externally to the tool.

problem solutions. Then, it generates and visualizes comparisons

The data needs to be uploaded in a file in csv format. If any

to artificial reference problems and visualizes the provided data.

parameters from the problem specification are missing from the

Next, we will describe the main features of OPI through its

data, the tool will display a warning message. Any data parame-

five content sections: problem specification, sample generation,

ters that are not included in the problem specification, are ignored

data, comparison to reference problems, and data visualization.

without raising any warnings. When correctly input, the user

will be able to view the data they have input, inspecting it in

3.1

Problem Specification

tabular format.

In the first OPI section, the user can provide the specification

Inputting the data completes the setup stage of the process.

of the industrial problem to be studied. The tool needs this in-

The user may then begin generating visualisations to assist them

formation to properly generate the samples, described in the

in understanding their problem.

Section 3.2, and setup the visualisations.

The problem specification must be given in the yaml file for-

3.4

Comparison to Reference Problems

mat and needs to contain some basic information about problem

The first visualization mechanism provided by OPI visually com-

parameters (variables, objectives, constraints) to be included in

pares the problem to a set of artificial reference problems with

the analysis. OPI can handle one or more objectives and zero or

known properties. This is conducted by displaying the landscape

more constraints. In addition to variables, objectives and con-

features of the user-defined problem alongside the same features

straints, the user can specify any number of other parameters

of each of the reference problems in a parallel coordinates plot.

that they want analyzed and visualized, for example, the name

The plot is interactive—the user can highlight some of the prob-

of the algorithm that found a solution or the time required to

lems by brushing along one of the parallel axes. In addition, the

evaluate a solution.

feature values can be viewed in a table and downloaded to a file

For each of the parameters, the user needs to specify its name

in csv format.

and its grouping (whether it is a variable, objective, constraint or

The reference problems can be set by the user, however, con-

something else). For variables, their type (continuous, integer or

fined within the collection labelled here as GBBOB, i.e., gener-

categorical) and the upper and lower bounds (for non-categorical

alised BBOB, where BBOB stands for the well-known suite of 24

types) are also required. An example yaml file, specifying a con-

Black-Box Optimization Benchmarking problems with diverse

strained multiobjective problem with several variables, is already

properties [9]. OPI provides a generator of GBBOB problems that provided within the tool to guide the user.

match the analyzed problem in terms of the number of variables

For the pressure vessel design problem, we can input four

and objectives and the presence or absence of constraints. For

variables (first two are integer and last two are continuous), two

objectives and (optionally) the constraint, any single-objective

objectives and three constraints. Alternatively, we can decide to

BBOB problem instance can be used. The user can specify the de-

skip the individual constraints and only use the total constraint

sired GBBOB problems in the yaml format. OPI already contains

violation instead.

five GBBOB problems to start.

A problem can be characterized by a large number of features,

3.2

Sample Generation

most hard to interpret by a human. In OPI, we included the fol-

In OPI, a sample is a set of x-values, corresponding to the variables

lowing problem landscape features that are understandable to an

set in the problem specification section. In other words, a sample

expert user [1, 11, 13, 14]: CorrObj, MinCV, FR, constr_obj_corr, is a set of non-evaluated solutions.

H_MAX, UPO_N, PO_N and a set of neighborhood features. CorrObj

52





Optimization Problem Inspector

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

Figure 2: The initial part of the parallel coordinate plot visualizing feature values for the analyzed problem and the chosen set of artificial test problems.

is a feature that shows the correlation between the objectives.

well as the color map, can also be specified by the user. Both vi-

MinCV represents the minimum constraint violation among all so-

sualizations support interaction and can be downloaded in html

lutions in the population. FR represents the proportion of feasible

or png format.

solutions in the population. constr_obj_corr presents the max-

3.5.1

Scatter Plot Matrix.

2

The scatter plot matrix consists of 𝑛

imum correlation between the constraints and all the problem

plots for 𝑛 chosen problem parameters as it contains 2-D scatter

objectives. H_MAX is the maximum information content among

plots for all possible parameter pairs. In OPI, the user can apply

all objectives. UPO_N is the proportion of unconstrained non-

brushing and linking to select the desired solutions in one or

dominated solutions, while PO_N is the proportion of the con-

more of the scatter plots. These are then highlighted in all scatter

strained non-dominated solutions. The neighbourhood features

plots in the matrix.

denoted by neighbourhood_feats are a collection of features

Figure 3 shows such a scatter plot for our pressure vessel

explaining the neighborhood of solutions, e.g., how many neigh-

problem. This visualization includes data from two sources. The

bors of a solution dominate the solution, how many neighbors

first comes from a random sampling of the search space (shown

are dominated by the solution, how many are incomparable to

in light blue) and the second from running the NSGA-II algo-

the solution, how close the neighboring solutions are, etc. OPI

6

rithm [6] on this problem for 2 · 10

function evaluations to

offers a total of 16 features, but the user can choose which to

achieve a good approximation of the Pareto front (shown in

compute and visualize.

black). The two sources are set apart by a custom parameter that

Figure 2 shows the initial part of the parallel coordinates plot is then used for coloring the solutions. Some solutions from Fig-

(as the entire plot would not fit the paper) for the pressure vessel

ure 3 are highlighted – see the rectangle in the (𝑥

) scatter

3, 𝑥 1

problem. In the comparison, we use the default five GBBOB

plot (third from the left in the top row).

reference problems as well as a custom created one. We notice

These plots clearly show the linear relationship of the near-

that the pressure vessel problem is most similar to the custom

optimal solutions between 𝑥

and

as well as

and

. When

1

𝑥 2

𝑥 1

𝑥 3

GBBOB problem with the first objective equal to the step ellipsoid

only 𝑓

and

are chosen, it is distinctively visible that the Pareto

1

𝑓2

function 𝑓 , the second to the multimodal peaks function

, and

7

𝑓22

set approximation is piece-wise linear and disconnected.

the linear constraint 𝑓 . This similarity might be due to our mixed-

5

integer problem containing plateaus in the continuous landscape

3.5.2

Parallel Coordinates Plot. The parallel coordinates plot

space in which the features are computed, which is similar to the

shows all chosen parameters as parallel coordinates and solu-

step ellipsoid function, and having linear constraints.

tions as lines in the plot. Similarly as with the scatter plot matrix,

interaction via brushing and linking is supported to select solu-

3.5

Data Visualization

tions that fit the desired values.

In the data visualization section of the web application, the sup-

4

Conclusions

plied data can be visualized using either a scatter plot matrix or

a parallel plot. In both cases, the user can choose which prob-

This work presented the features of Optimization Problem Inspec-

lem parameters to visualize among all those listed in problem

tor – a web application to support problem experts and algorithm

specification. Additionally, a simple data filtering that limits any

designers in gaining a better understanding of industrial optimiza-

variable between the desired minimum and maximum values is

tion problems. The tool provides comparisons to well-understood

also supported and can be manipulated via the OPI interface in

reference problems and interactive and highly-customizable vi-

yaml format. The parameter used for coloring the solutions, as

sualizations, which can be exported in html and png formats.

53





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

Tea Tušar, Jordan N. Cork, Andrejaana Andova, and Bogdan Filipič

Figure 3: Random (light blue) and near-optimal (black) solutions of the pressure vessel design problem visualized in OPI with a scatter plot matrix containing variables 𝑥 to

and objectives

and

.

1

𝑥 4

𝑓1

𝑓2

Samples can be exported and solutions imported using the stan-

[7]

Juan José Durillo and Antonio J. Nebro. 2011. jMetal: A Java framework

for multi-objective optimization. Advanced Engineering Software, 42, 10,

dard csv format, which makes the data exchange between OPI

760–771. doi: 10.1016/J.ADVENGSOFT.2011.05.014.

and various optimization software easy to do. OPI functionality

[8]

David Hadka. 2024. MOEA Framework: A free and open source Java frame-

is made to be simple and at the same time flexible. Therefore, it

work for multiobjective optimization. https://github.com/MOEAFramewor

k/MOEAFramework. Computer software, version 4.4. (2024).

is utilisable by non-experts and experts, alike, providing a wide

[9]

Nikolaus Hansen, Steffen Finck, Raymond Ros, and Anne Auger. 2009. Real-

range of angles from which to view the problems.

Parameter Black-Box Optimization Benchmarking 2009: Noiseless Functions

Definitions. Research Report RR-6829. INRIA. https://hal.inria.f r/inria- 0036

Acknowledgements

2633v2.

[10]

Arnaud Liefooghe, Laetitia Jourdan, and El-Ghazali Talbi. 2011. A software

The authors acknowledge the financial support from the Slove-

framework based on a conceptual unified model for evolutionary multiob-

jective optimization: ParadisEO-MOEO. European Journal of Operational

nian Research and Innovation Agency (research core funding

Research, 209, 2, 104–112. doi: 10.1016/J.EJOR.2010.07.023.

No. P2-0209 “Artificial Intelligence and Intelligent Systems”, and

[11]

K. M. Malan, J. F. Oberholzer, and A. P. Engelbrecht. 2015. Characterising

constrained continuous optimisation problems. In Proceedings of the 2015

project No. N2-0254 “Constrained Multiobjective Optimization

IEEE Congress on Evolutionary Computation (CEC 2015), 1351–1358. doi:

Based on Problem Landscape Analysis”) and from the Jožef Ste-

10.1109/CEC.2015.7257045.

fan Innovation Fund (project “A Tool for Analysis of Industrial

[12]

Giovanni Misitano, Bhupinder Singh Saini, Bekir Afsar, Babooshka Shavazi-

pour, and Kaisa Miettinen. 2021. DESDEO: The modular and open source

Optimization Problems and Their Solutions”). This publication is

framework for interactive multiobjective optimization. IEEE Access, 9, 148277–

also based upon work from COST Action “Randomised Optimi-

148295. doi: 10.1109/ACCESS.2021.3123825.

sation Algorithms Research Network” (ROAR-NET), CA22137,

[13]

Mario A. Muñoz, Michael Kirley, and Saman K. Halgamuge. 2015. Ex-

ploratory landscape analysis of continuous space optimization problems

supported by COST (European Cooperation in Science and Tech-

using information content. IEEE Transactions on Evolutionary Computation,

nology). We are grateful to Jernej Zupančič for implementing the

19, 1, 74–87. doi: 10.1109/TEVC.2014.2302006.

[14]

Cyril Picard and Jürg Schiffmann. 2021. Realistic constrained multiobjec-

core functionalities of the Optimization Problem Inspector.

tive optimization benchmark problems from design. IEEE Transactions on

Evolutionary Computation, 25, 2, 234–246. doi: 10.1109/TEVC.2020.3020046.

References

[15]

Philippe Roux and Perrine Mathieu. 2016. Scilab: I. Fundamentals. In Scilab,

from theory to practice. D-Booker Editions.

[1]

Hanan Alsouly, Michael Kirley, and Mario Andrés Muñoz. 2023. An instance

[16]

E. Sandgren. 1990. Nonlinear integer and discrete programming in mechan-

space analysis of constrained multiobjective optimization problems. IEEE

Transactions on Evolutionary Computation

ical design optimization. Journal of Mechanical Design, 112, 2, 223–229.

, 27, 5, 1427–1439. doi: 10.1109

[17]

Ye Tian, Ran Cheng, Xingyi Zhang, and Yaochu Jin. 2017. PlatEMO: A MAT-

/TEVC.2022.3208595.

LAB platform for evolutionary multi-objective optimization. IEEE Computa-

[2]

Antonio Benítez-Hidalgo, Antonio J. Nebro, José García-Nieto, Izaskun

tional Intelligence Magazine, 12, 4, 73–87. doi: 10.1109/MCI.2017.2742868.

Oregi, and Javier Del Ser. 2019. jMetalPy: A Python framework for multi-

[18]

Tea Tušar, Peter Korošec, and Bogdan Filipič. 2023. A multi-step evaluation

objective optimization with metaheuristics. Swarm and Evolutionary Com-

putation

process in electric motor design. In Slovenian Conference on Artificial In-

, 51, 100598. doi: 10.1016/J.SWEVO.2019.100598.

telligence, Proceedings of the 26th International Multiconference Information

[3]

Francesco Biscani and Dario Izzo. 2020. A parallel global multiobjective

Society (IS 2023). Vol. A. Jožef Stefan Institute, Ljubljana, Slovenia, 48–51.

framework for optimization: pagmo. Journal of Open Source Software, 5, 53,

[19]

Tea Tušar, Peter Korošec, Gregor Papa, Bogdan Filipič, and Jurij Šilc. 2007.

2338. doi: 10.21105/joss.02338.

A comparative study of stochastic optimization methods in electric motor

[4]

Julian Blank and Kalyanmoy Deb. 2020. Pymoo: Multi-objective optimization

design. Applied Intelligence, 27, 2, 101–111. doi: 10.1007/S10489- 006- 0022- 2.

in Python. IEEE Access, 8, 89497–89509. doi: 10.1109/ACCESS.2020.2990567.

[20]

Koen van der Blom, Timo M. Deist, Vanessa Volz, Mariapia Marchi, Yusuke

[5]

Carlos A. Coello Coello. 2002. Theoretical and numerical constraint-handling

Nojima, Boris Naujoks, Akira Oyama, and Tea Tušar. 2023. Identifying

techniques used with evolutionary algorithms: A survey of the state of the

properties of real-world optimisation problems through a questionnaire. In

art. Computer Methods in Applied Mechanics and Engineering, 191, 11, 1245–

Many-Criteria Optimization and Decision Analysis: State-of-the-Art, Present

1287. doi: https://doi.org/10.1016/S0045- 7825(01)00323- 1.

Challenges, and Future Perspectives. Natural Computing Series. Dimo Brock-

[6]

Kalyanmoy Deb, Amrit Pratap, Sameer Agarwal, and T. Meyarivan. 2002. A

hoff, Michael Emmerich, Boris Naujoks, and Robin C. Purshouse, editors.

fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Transactions

on Evolutionary Computation

Springer, 59–80. doi: 10.1007/978- 3- 031- 25263- 1_3.

, 6, 2, 182–197. doi: 10.1109/4235.996017.

54





Multi-Agent System for Autonomous Table Football: A Winning

Strategy

∗

∗

Marcel Založnik

Kristjan Šoln

Jožef Stefan Institute

Faculty of Electrical Engineering, University of Ljubljana

Ljubljana, Slovenia

Ljubljana, Slovenia

marcel.zaloznik@gmail.com

ks4835@student.uni- lj.si

Abstract

This paper presents a multi-agent system (MAS) for autonomous

table football, developed for the FuzbAI competition at the Uni-

versity of Ljubljana. Our system consists of four independent

agents, each dynamically assigned specific roles—Goalkeeper,

Defender, Midfielder, and Attacker—based on real-time game

analysis. This role-based architecture enabled seamless coordi-

nation between offensive and defensive strategies, allowing our

team to secure first place. We describe the simulation framework

used, the processing of sensor data, and the control strategies

that allowed the agents to execute precise actions in a dynamic

environment. The results highlight the effectiveness of adaptive,

role-based decision-making, demonstrating the potential of MAS

in real-time, competitive settings.

Keywords

Figure 1: Table setup for the FuzbAI autonomous football

multi-agent system, autonomous table football, role-based strat-

competition.

egy, real-time decision making, AI in robotics

1

Introduction

selecting roles that dictated their actions during gameplay. This

The FuzbAI competition, held as part of the “Dnevi Avtomatike”

strategic approach enabled our team to outperform competitors

event at the Faculty of Electrical Engineering, University of Ljubl-

and ultimately secure first place in the competition.

jana, is a premier contest for students specializing in automation

This paper delves into the development and implementation

and artificial intelligence [11]. This event challenges participants of our multi-agent system. We will explore the architectural

to develop intelligent autonomous agents capable of playing table

choices, the role-based decision-making strategies employed by

football without human intervention. The competition not only

each agent, and the overall system’s performance in the context

serves as a platform for demonstrating technical skills but also

of the FuzbAI competition.

fosters innovation in the application of AI and machine learning

techniques in real-time environments. Figure 1 illustrates the 2

Competition Setup and System Description

table setup used in the competition.

The FuzbAI competition required all participants to develop pro-

The FuzbAI competition is structured in a way that teams

grams capable of playing table football autonomously. To facil-

must design and implement a fully autonomous system capable

itate this, the competition provided a standardized simulation

of effectively competing against other AI-driven systems. Each

environment and a set of initial tools that every team used as

match is a test of the participants’ ability to integrate advanced

the foundation for their development. This section describes the

algorithms and robotics, simulating the dynamics of a real foot-

simulation framework, the types of data available from the sys-

ball game on a miniature scale. The competitive format includes

tem, and the means by which agents could interact with both the

both qualification rounds and knockout stages, ensuring that

simulated and real game environments.

only the most capable and innovative solutions advance to the

final stages.

2.1

Simulation Framework

Our entry into the FuzbAI competition focused on the develop-

Participants were provided with a Python-based simulation frame-

ment of a multi-agent system (MAS), where each of our four rods

work designed to emulate a real table football match, as shown in

functioned as an independent agent. These agents were designed

figure 2. This simulator accurately replicated the physics of the to collaborate through a streamlined decision-making process,

game, including the movement of the ball and rods, and managed

∗ Both authors contributed equally to this research.

the interactions between the environment and the agents control-

ling the rods. The framework included fundamental functionali-

Permission to make digital or hard copies of all or part of this work for personal ties such as ball tracking, rod positioning, and interaction rules,

or classroom use is granted without fee provided that copies are not made or

allowing all teams to concentrate on AI development without

distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this needing to construct the simulation infrastructure themselves.

work must be honored. For all other uses, contact the owner /author(s).

One of the key features of the competition setup was that the

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

interaction protocols for the simulator and the physical table

© 2024 Copyright held by the owner/author(s).

https://doi.org/10.70314/is.2024.scai.8341

were identical. The same signals and commands used to control

55





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

Založnik and Šoln

3

Related Work

Research on multi-agent systems (MAS) and their application

in robotic football has been extensively explored. This section

reviews some contributions that have informed the development

of autonomous systems for table football and real football.

Moos et al. (2024) [5] developed an automated football table as a research platform for reinforcement learning, highlighting

the challenges of transferring learned behaviors from simulation

to real-world environments and the need for robust algorithms

to handle uncertainties. While reinforcement learning is a com-

mon approach in such studies, we did not achieve satisfactory

results with it. Therefore, we decided to use multi-agent systems

instead. Klančar et al. (2002) [4] investigated cooperative control in robot football (real football) using multi-agent systems,

Figure 2: Simulator interface.

focusing on behavior-based control and dynamic role assignment

among robots to optimize performance. Their approach empha-

sized effective communication for coordination in multi-agent

settings. This work particularly inspired our approach to multi-

the actuators in the simulator were also used for the real table

agent systems, where we focused on behavior-based control and

without any modification. This feature ensured that teams could

dynamic role assignment. Ribeiro et al. (2024) [6] proposed a seamlessly transition their algorithms from the simulated envi-probability-based strategy (PBS) for robotic football (real foot-

ronment to the physical table setup, which was used in the final

ball), utilizing real-time data for centralized decision-making

rounds of the competition. As a result, the simulation provided

without relying heavily on pre-defined plays. Their approach

a consistent testing ground that mirrored the actual physical

demonstrated flexibility across different environments. Smit et

setup, enabling teams to develop and refine their strategies under

al. (2023) [8] explored scaling multi-agent reinforcement learn-uniform conditions.

ing (MARL) to a full 11v11 simulated football environment (real

football), focusing on computational efficiency and the use of

2.2

Sensor Data

attention mechanisms to enhance scalability in large-scale multi-

Both the simulation environment and the real table provided

agent settings. Song et al. (2024) [9] conducted an empirical study each team with data from two cameras, one placed on each side

on the Google Research Football platform (real football), intro-

of the table. Each camera captured different views of the game,

ducing a population-based MARL training pipeline to quickly

and teams had to decide how to combine the information from

develop competitive AI players, highlighting the importance of

both cameras. The data provided by each camera included:

scalable training frameworks. Scott et al. (2022) [7] examined

•

end-to-end learning in RoboCup simulations (real football), op-

Ball position: The coordinates of the ball on the 2D plane

timizing both low-level skills and high-level strategies through

of the table.

•

competitive self-play, providing a comprehensive approach to

Ball speed: Velocity of the ball.

•

multi-agent training in competitive environments.

Ball size: Area of the ball in the captured image (in pixels).

• Rod positions: Calibrated position of all rods (in the inter-

4

MAS Approach to Autonomous Table

val [0, 1]).

• Rod angles: Calibrated angle of all rods (in the interval

Football Control

[−32, +32]).

In this section, we describe the the methodology of our approach.

This camera data was streamed continuously, requiring teams

We describe agent architecture, different agent roles and outline

to process and merge the inputs from both cameras to accurately

the actions they can take. Then, we discuss the conditions and

interpret the game’s state. The accuracy and frequency of the

priorities for role assignment during the game and evaluate the

data were sufficient to enable real-time decision-making by the

behavior of the system as a whole.

autonomous agents, whether interacting with the simulator or

the physical table.

4.1

Agent Architecture

There exist several agent architectures, commonly used in MAS.

2.3

Actuator Outputs

Approaches, such as [4, 10, 12, 13], use role-based approach for interaction between agents and with the environment. In role-To interact with the environment, each agent could send com-

based approach, based on the concepts from role theory [1], the mands to the actuators that controlled the rods. The system

agents are assigned roles which affect their behavior. While the

allowed for two primary types of commands:

overall long-term goal of the system is typically predefined and

• Translatory movement: Moving the rod left or right across

does not change, e.g. win a table football match, the current role

the table.

of an individual agent defines agent’s short-term goals, which

• Rotational movement: Rotating the rod to control the angle

influences agent behavior, their decision-making process, and

at which the players struck the ball.

how they interact with the rest of the system. Furthermore, sepa-

Precise and timely commands were crucial for effective game

ration of agent functionality into independent roles can provide

control, as they enabled the agents to optimally position their fig-

simplification and decoupling of individual tasks, leading to a

ures, strike the ball accurately, and execute defensive or offensive

more modular system, which can simplify and improve the ex-

strategies effectively.

tensibility of the implementation [3].

56





Autonomous Table Football: Winning Strategy

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

There exist several approaches to role and behavior implemen-

opponent scoring even if the actuators fail to respond fast enough

tation in MAS, such as merging different roles, role models and

to block this style of attack. Here, communication between the

class members [2, 3, 4]. In our implementation, we simplify the two agents is performed implicitly, as each agent perceives the

architecture by allowing an agent to occupy only a single role at

roles of other agents as a part of the overall environmental state.

a time, and defining the roles in a way that allows reassigning be-

Defender is an agent tasked with blocking opponent attacks

tween iterations of the algorithm without regard to the previous

by intercepting the ball when it is in the opponent’s possession or

role or state of the agent.

moving towards the goal. This role utilizes a single follow action,

Each role defines a set of possible actions an agent can take.

similar to the Goalkeeper’s follow action. Whenever the Defender

The agents decide which action to take based on their priority

role is active, the agent tracks the position and velocity of the ball,

and the current environment. More complex roles can be im-

trying to match either its current coordinate or the estimated

plemented in a stateful manner, meaning the decision on which

intersection with the trajectory of the ball. The agent identifies

action to take is dependent on previous actions as well. An agent

the figure closest to the intersection and attempts to move the

can only be assigned a single role at a time, but can switch be-

rod using minimal amount of movement. This approach allows

tween roles throughout iterations regardless if the particular goal

for faster adjustments during the game, improving defensive

is fulfilled, when appropriate conditions arise. Additionally, every

efficiency.

agent must have a role assigned at all times.

Midfielder is a an agent role with the primary task of raising

An action is a discrete, autonomous task that an agent can

the figures to allow passing the ball from behind the current

take on by making appropriate decisions and acting onto the

agent. This role, although simple, is essential in order to avoid

environment, e.g. by sending commands to the actuators. This

accidentally breaking a friendly attack by an Attacker agent

advances the agent toward the goal imposed by the current role.

behind the current rod.

An agent can only execute a single action at a time. Additionally,

Attacker is an agent with the task of kicking the ball towards

every agent must be actively executing an action at all times.

the opponent goal in an attempt to score a point. Unlike other

These concepts were implemented using an Object Oriented

roles, the Attacker role is implemented in a stateful manner.

approach, as suggested by the authors of the competition. In our

Actions can only happen in a specified order, when the corre-

implementation, each agent repeatedly executes a fast processing

sponding conditions are met. The role implements follow, kick

routine. Every iteration, the environment data is updated and role

and prevent back-kick actions.

selection for the agent is performed. Then, as the agent decides on

Whenever the agent is assigned this role, the follow action is

a role for that iteration, the appropriate role processing function

executed first. During the follow action, the agent slightly raises

is called, executing individual actions.

the figures in order to prepare for a kick. The figure closest to

the ball is selected and rod offset is adjusted in order to align

the figure with the ball. Whenever the agent determines that the

4.2

Role Description

alignment with the ball is sufficient, the agent moves onto the

A typical table football setup consists of four rods per player, each

next state, the kick action. Here, the rod is rotated in order to

with a number of mounted figures. In this implementation, each

strike the ball. During this state, it is still necessary to track the

rod is considered an agent, resulting in a system with four agents

position of the ball, as the ball can move significantly within a

for which we define the following roles, typically associated with

few iterations of the algorithm. As the rod completes the forward

table football games.

rotation, the agent monitors the position of the ball and assesses

Goalkeeper is the final line of defense, primarily responsible

if the figure successfully hit the ball. In that case, the next action

for intercepting the ball before it reaches the goal. Typically the

is set back to follow, and the agent is usually assigned a new role

left-most rod, which is nearest to the goal and has a single figure,

according to the environment. However, if the figure missed the

the goalkeeper follows the ball position using two possible ac-

ball during the kick, the agent moves onto the prevent back-kick

tions: follow and misaligned follow. The follow action simply tries

action. This final action is meant to prevent an accidental kick

to align the figure on the rod with the current ball position. How-

in the opposite of the intended direction. The rod is translated

ever, if the velocity of the ball exceeds a predefined threshold,

sideways and slowly rotated into a neutral position, in order to

the agent instead attempts to estimate the ball trajectory based

circumvent the ball. While executing this action, role switching

on its velocity vector. This estimation is simplified by assuming

for the current agent is disabled as well.

that the ball maintains a straight-line path. The figure is there-

During execution, the agent aligns the rod position with the

fore positioned at the intersection of the rod and the estimated

ball; however, a perfectly aligned figure results in a straight shot,

trajectory in an attempt to intercept a fast-moving ball.

which is easily defended by maintaining alignment with the ball.

The misaligned follow action is an augmented variant of the

A more effective strategy involves kicking at an angle to aim

former action, designed to increase the overall defense surface of

for the goal or create a rebound off the wall, which is harder

the defending agents. A common scenario in table football occurs

to defend. This role achieves this by slightly misaligning the

when an attacker attempts to bypass the defenses by slightly

figure before and during the kick. The agent computes the angle

pushing the ball parallel to the rod and striking it immediately

between the ball’s current position and the selected target, with

after. Even though a human player might react fast enough to

the figure’s required misalignment set proportionally to this angle

block such an attack, actuator response times are often insuffi-

and adjusted by a tunable parameter for fine-tuning. This attack

cient. A defense strategy against such attacks is is to misalign the

strategy significantly increases the performance of the Attacker

goalkeeper and defender figures, increasing the defense surface.

role.

Here, this is implemented by the misaligned follow action, and is

activated whenever the ball is relatively slow, in the possession

of the opponent and another agent in front of the Goalkeeper is

currently in a Defender role. This decreases the chances of the

57





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Založnik and Šoln

4.3

Role Assignment

success rate. Additionally, even though there are no explicit,

intentional passes between agents, the strategy of simply passing

Individual roles are assigned to agents according to defined as-

the ball as far forward as possible is enough for a successful

signment conditions and rules. Some approaches use an objective

gameplay.

function in order to select a role, often taking role priority into

The system overall is sensitive to changes in parameters and

account [4]. In this approach, we instead define a simple set of requires precise tuning. The simulator, although effective, does

conditions which, along with role priority, decide on the most

not perfectly simulate the physical table, and additional parame-

appropriate role for a particular agent based on the current state

ter tuning is required when transitioning from the simulator to

of the environment.

real-world application.

If in a particular instant, more roles fulfill the assignment con-

ditions for a particular agent, the role with higher priority is

5

Conclusion

selected. In this implementation, the highest priority belongs

to the Attacker role, followed by the Goalkeeper, Defender and

This paper presented a multi-agent system (MAS) for autonomous

finally the Midfielder with the lowest priority. This ordering is

table football, developed for the FuzbAI competition. Our role-

based on the strictness of assignment conditions for each role,

based design allowed each rod to act as an independent agent,

and the importance of that particular role. For example, the At-

dynamically adapting to the game state. This approach enabled

tacker role has the strictest selection conditions among all roles,

effective coordination between offense and defense, contributing

and therefore is assigned the highest priority, while the Mid-

to our system’s first-place win.

fielder role has a very broad assignment condition and is not as

The results demonstrate the effectiveness of a modular, adap-

important compared to an Attack agent.

tive architecture in dynamic environments, highlighting the im-

We define the role selection conditions as follows. The At-

portance of robust decision-making and quick role-switching.

tacker role is selected whenever the ball speed drops below a

Future work could include machine learning to predict opponent

specified threshold, and the ball is within kicking clearance of

behavior and optimize strategies, as well as expanding the system

the rod. The Goalkeeper role is selected if that particular agent

to more complex environments. Overall, our MAS showed strong

belongs to the left-most rod, closest to the player’s goal. The

performance in a competitive setting, offering valuable insights

Defender role is selected whenever the ball is in front of the rod.

for future developments in autonomous systems.

Lastly, the Midfielder role is selected whenever the ball is behind

References

the rod, as the role’s only task is to raise the figures to allow the

ball to pass forward.

[1]

Bruce J Biddle. 1986. Recent developments in role theory. Annual review of

sociology, 12, 1, 67–92.

This set of conditions combined with the defined role priority,

[2]

G. Cabri, L. Ferrari, and L. Leonardi. 2004. Agent role-based collaboration and allow the agents to switch between roles effectively and covers

coordination: a survey about existing approaches. In 2004 IEEE International

Conference on Systems, Man and Cybernetics (IEEE Cat. No.04CH37583). Vol. 6.

the main functionality required to play the game. Role priority

IEEE. isbn: 1062-922X. doi: 10.1109/ICSMC.2004.1401064.

ensures that the agent works toward a correct goal based on

[3]

E.A. Kendall. 1999. Role modelling for agent system analysis, design, and

the circumstances. For example, any rod, even the Goalkeeper,

implementation. In Proceedings - 1st International Symposium on Agent

Systems and Applications and 3rd International Symposium on Mobile Agents,

should attempt to kick the ball if it is close and slow enough,

ASA/MA 1999. IEEE, 204–218. isbn: 0769503403. doi: 10.1109/ASAMA.1999

while only the left-most rod should attempt to be the goalkeeper.

.805405.

[4]

Gregor Klančar, Marko Lepetič, Boštjan Potočnik, Rihard Karba, and Drago

Matko. Cooperative control of mobile agents in soccer game. Faculty of

4.4

Behavior of the System as a Whole

Electrical Engineering, University of Ljubljana, Slovenia, (2002).

[5]

Janosch Moos, Cedric Derstroff, Niklas Schröder, and Debora Clever. 2024.

The system’s primary offensive strategy is for the Attacker agents

Learning to play foosball: system and baselines. In Cornell University Li-

brary, arXiv.org. doi: 10.48550/arxiv.2407.16606.

to advance the ball as far forward as possible, ultimately aiming

[6]

António Fernando Alcântara Ribeiro, Ana Carolina Coelho Lopes, Tiago

for the goal, while Midfielder agents ensure that they do not

Alcântara Ribeiro, Nino Sancho Sampaio Martins Pereira, Gil Teixeira Lopes,

obstruct forward passes. During opponent attacks, the systems

and António Fernando Macedo Ribeiro. 2024. Probability-based strategy

for a football multi-agent autonomous robot system. Robotics, 13, 1. doi:

primary defensive strategy is for the Defender and Goalkeeper

10.3390/robotics13010005.

roles to intercept the ball. In certain situations, they collaborate

[7]

Atom Scott, Keisuke Fujii, and Masaki Onishi. 2022. How does ai play

to expand the defense surface, compensating for the limitations

football? an analysis of rl and real-world football strategies. In International Conference on Agents and Artificial Intelligence. Vol. 1. Elsevier Scopus, 42–52.

posed by actuator response times. Once the opponent’s attack

isbn: 2184-3589. doi: 10.5220/0010844300003116.

ends, agents detect the change in the environment and the roles

[8]

Andries Smit, Herman A. Engelbrecht, Willie Brink, and Arnu Pretorius.

2023. Scaling multi-agent reinforcement learning to full 11 versus 11 sim-

are reassigned to shift the game towards offensive play.

ulated robotic football. Autonomous agents and multi-agent systems, 37, 1.

The system’s game strategy can be adjusted by modifying

doi: 10.1007/s10458- 023- 09603- y.

parameters such as role priority, assignment rules, or individual

[9]

Yan Song, He Jiang, Zheng Tian, Haifeng Zhang, Yingping Zhang, Jiangcheng

Zhu, Zonghong Dai, Weinan Zhang, and Jun Wang. 2024. An empirical study

actions. For instance, a more defensive strategy can be achieved

on google research football multi-agent scenarios. International journal of

by tightening the conditions for assigning the Attacker role.

automation and computing, 21, 3, 549–570. doi: 10.1007/s11633-023-1426-8.

Overall, the implemented algorithm performs well, with the

[10]

Manuela Veloso, Peter Stone, and Kwun Han. 1998. The cmunited-97 robotic

soccer team: perception and multiagent control. In Proceedings of the second

combination of discrete roles resulting in a competent gameplay.

international conference on Autonomous agents, 78–85.

However, delays and noise present in measurements, and delays

[11]

Laboratorij za avtomatiko in kibernetiko. 2024. Dnevi avtomatike. Accessed:

2024-08-21. (2024). https://dnevi- avtomatike.si/?page_id=270.

due to actuator response times, sometimes cause the system to

[12]

Franco Zambonelli, Nicholas R. Jennings, and Michael Wooldridge. 2003.

miss, e.g. during attacks. The prevent back-kick action of the

Developing multiagent systems: the gaia methodology. ACM transactions on

Attacker role proves essential in such situations, performing

software engineering and methodology, 12, 3, 317–370. ObjectType-Article-2.

doi: 10.1145/958961.958963.

careful repositioning. Another surprisingly successful strategy is

[13]

Xiaoqin Zhang, Haiping Xu, and Bhavesh Shrestha. 2007. An integrated

aiming at the goal or the wall during the attack action. Even if the

role-based approach for modeling, designing and implementing multi-agent

ball does not follow the intended trajectory due to measurement

systems. Journal of the Brazilian Computer Society, 13, 2, 45–60. doi: 10.100

7/bf 03192409.

noise and system delays, it still considerably increases the attack

58





Towards a Decision Support System for Project Planning:

Multi-Criteria Evaluation of Past Projects Success

Miha Hafner

Marko Bohanec

Elea iC d.o.o.

Jožef Stefan Institute

Department for Tunnels and Geotechnics, and

Department of Knowledge Technologies

Jožef Stefan International Postgraduate School

Ljubljana, Slovenia

Ljubljana, Slovenia

marko.bohanec@ijs.si

miha.hafner@elea.si

Abstract

•

Objectives that support project goals include concrete and

measurable project characteristics such as deliverables,

Project planning typically refers to the project management step

milestones, and other steps and strategies to achieve the

in which project assets, timelines, budgets, milestones,

goals [7].

subcontractors, etc., are determined before the new project starts.

•

Scope and requirements concerning project boundaries,

In this paper, we address infrastructure design projects in the

e.g., the need for experts, potential subcontractors, technical

context of a specific company (Elea iC) and explore the idea of

equipment and other requirements to finish the project.

using data about past-finished projects to help project managers



•

and project leaders in project planning. A crucial requirement in



Constraints and limitations concerning project deadlines,

costs, etc. [8].

this context is the ability to evaluate/assess the success of



Besides that, each project should finish with the client’s and

finished/new projects. This paper proposes a solution using a

stakeholders’ satisfaction [8].

multi-criteria model to evaluate finished projects. This way, we



To achieve the above for the new project, project planning is

add project success information to the finished projects database,

vital at the beginning of each new project [6], [8]. It is the project which we shall use in the decision support system being designed

management and project leaders' task to recognize and include

to extract knowledge for the new project plan.

all these in the project plan so that the work and processes lead

Keywords

to successful project completion.



This study aims to support this process in the context of

Project success evaluation, multi-criteria model, decision support

Elea iC company, an interdisciplinary provider of engineering

systems, data analysis, data mining, project management, project

services and projects in Slovenia [5]. We wanted to include the leading tools.

knowledge obtained from past–finished projects in the project

planning process for the new projects. The company collected

this data from 2001. The assumptions are as follows:

1 Introduction

1. The finished projects in the database offer valuable

Infrastructure, such as tunnels, bridges, schools, houses, sewage

information for the new project planning phase.

systems, roads, etc., and its design discipline play a vital role in

2. The project workflows established in the company and

society. Thus, infrastructure design must have properly and

requirements remained similar over the years.

thoroughly defined requirements, objectives, scope and

The main challenge related to this question is the new project

constraints concerning many expert fields such as civil

success assessment and its consideration in light of the available

engineering, architecture, geology, geotechnics, environmental

finished project data [7]. Unfortunately, the actual finished engineering, urban planning, and other expert fields [1], [2], [3].

projects database does not contain much information about the

The term design is connected to the process that ends with

finished projects' success. To bridge this, we had to construct a

technical documentation, technical approvals, models, and other

project success evaluation model, evaluate finished projects in

deliverables prepared at the end of the design process. Each such

the database and add this information to the database. The

process is referred to as the project [4]. The projects are expected expected result of those steps is a database suitable for applying

to have clearly defined:

data-analysis and knowledge extraction methods, such as

•

Goals defining the project's desired result, e.g., a building

hierarchical clustering and machine learning [20].

permit for a bridge, static analysis of a retaining wall,

This paper describes the finished project success evaluation

architectural design for a subway station, geotechnical

component (hereafter called FPSE), which is part of the future

exploration for a tunnel, etc. [4].

decision support system (DSS, [12], [13], [14]) for project planning (hereafter called E-DSS). First, we present the general

Permission to make digital or hard copies of part or all of this work for personal or architecture of the E-DSS, explaining the role and integration of

classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full FPSE in its context. In section 3, we present the database of

citation on the first page. Copyrights for third-party components of this work must finished projects and its preparation for supporting the

be honored. For all other uses, contact the owner/author(s).

configuration of new projects. The evaluation model used and

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

© 2024 Copyright held by the owner/author(s).

the experimental evaluation of FPSE are presented in sections 4

https://doi.org/10.70314/is.2024.scai.8463

and 5, respectively. Section 6 concludes the paper.

59





2 E-DSS Architecture

Figure 1 also shows the element E-DSS administrator used to upgrade FPSE periodically by upgrading the database of the

E-DSS is a DSS under construction to support the project

finished projects or making changes in EM according to the

management and project leaders in the Elea iC company

system's operational requirements and expected results.

(hereafter called “the user”) in configuring the new project plan



parameters when a new project starts.





The user is expected to define the E-DSS input as shown in

Figure 1: the new project objectives, requirements, desires, and expectations. Practically, this means that the user collects all the

available new project data by:

•

Extracting the new project data from the new project

assignment and contracts containing relevant information

for the project planning.

•

Checking the company and potential subcontractors' state of

the resources and assets needed to complete the new project.

Examples of those data include projected monetary value, project

scope and goals, project start and finish date, the expert fields

needed for project completion, etc.



The E-DSS output (Figure 1) consists of the new project plan Figure 2: FPD+S development workflow

configuration together with the corresponding success scores

This paper focuses on the development of FPSE. The

(+S). The configuration comprises the data such as the number

workflow is shown in Figure 2, consisting of the following steps: of employees involved, the number of subcontractors, work

Step A. Finished projects database preparation (FPD), as

distribution, work duration, the number of pauses, etc. Project

described in section 3.

success scores are assessed assuming this configuration settings.

Step B. Project success evaluation model (EM), as described



in section 4.

Step C. Finished projects database with EM success scores

(FPD+S): The result of the FPSE is the upgraded database

of the finished projects with the finished projects' success

scores (FPD+S).

3 Data Description

E-DSS is a data-driven system that operates on data from past-

finished projects. This data was collected in Elea iC company

from the year 2001 to 2023. At the beginning of the data

collection, the number of the observed variables was relatively



small, but it has grown substantially over the years. At the time

Figure 1: E-DSS architecture

of this study, the database contained data on 4704 finished

Accordingly, E-DSS is composed of the following

projects, described by 39 numeric variables; 6 of them were

components (Figure 1):

date/time/year variables, and 2 categorical variables.

•

NPPE (New Project Parameters Extraction) is the

Data preparation (Step A, Figure 2) was carried out as component that extracts the potential new project

follows:

configuration parameters and corresponding data to support

1. Data cleaning: replacing “Nan” and deleting erroneous data;

the decision-making. NPPE is currently under construction

2. Outlier’s removal using the Interquartile range approach

and is aimed to operate interactively with the user and

[18];

support: searching for similar projects in FPD+S according

3. Data imputation: replacing the missing values using a

to a predefined range, searching the projects by desired

descriptive statistic (e.g. mean, median, or most frequent)

success score, project segmentation, and project group

along each column or using a constant value [19]. We identification—unsupervised descriptive analytics and

employed the mean strategy.

parameter prediction by supervised machine learning

4. Sensitive data and information removal. For this reason, all

methods. The component NPPE+S inside NPPE evaluates

numeric data was scaled to a range between 0 and 1.

the success of the potential new project's configuration

We ended up with the database FPD containing data on 3132

parameters obtained. The evaluation is made by EM, which

finished projects described by 27 numeric variables. The

is part of FPSE.

variables describe the main project management characteristics,

•

FPSE (Finished Project Success Evaluation) consists of:

such as project financial results, workload distribution, number

o

EM (Project success Evaluation Model) for evaluating

of employees, subcontractors, etc. Table 1 shows the list of all

the new project configuration (described in section 4).

variables together with their basic statistics.

o

FPD+S is the database of finished projects with project

This way, the finished project database (FPD) was prepared

success evaluations (section 3).

for the FPSE component. FPD is the main resource for

Exploratory Data Analysis for observing the data and its

60





properties, such as variable correlations, variable information

•

Pauses Time Share: the ratio between the months the

gain, etc. These operations are invoked interactively by the user

employees did not work and the total number of months.

in the context of NPPE and are not discussed further in this paper.

•

Hour Income: the ratio between project value and the

number of work hours necessary to finish the project.

Table 1: Basic statistics of the variables after data cleaning,



outliers’ removal, and data scaling



Figure 3: Multi-criteria model for the projects’ success

evaluation

Evaluation parameters represent outputs of the model :

•

WORK DISTRIBUTION: assesses the characteristics of the

work distribution in the project duration.

•

PROJECT PAUSES: assesses the work pauses.

•

PROJECT WORKFLOW: combines evaluation parameters

WORK DISTRIBUTION and PROJECT PAUSES

•

PROJECT FINANCIAL RESULT: assesses the project's

success from the financial point of view.



•

PROJECT SUCCESS SCORE: overall success score,

determined by aggregation of all subordinate parameters.

Aggregation functions map subordinate EM parameters to the

4 Evaluation of Projects’ Success

corresponding parent parameters. Employed is the weighted

The project success evaluation model (EM), developed in Step B

average function, using weights shown in Figure 3. Currently,

(Figure 2), is aimed at:

weights are chosen to make all parameters equally important.

•

The evaluation of the projects in FPD resulting in the

FPD+S (Figure 2).

•

The evaluation of potential new projects suggested through

5 Experimental Evaluation of FPSE

interaction between the user and NPPE+S (Figure 1).

Figure 4 shows an example of evaluating a project from FPD.

Project success evaluation involves multiple criteria that have

Input parameters’ values (terminal nodes) were obtained from

to be aggregated into a single evaluation score. Different criteria

the data base, while evaluation parameters’ values (green nodes)

might be of different importance and affect the score differently,

were calculated by EM. The example project shows good

i.e., with different weights. For this purpose, we chose MAUT

workflow score (0.75), but has a poor financial score (0.29), both

(Multi-Attribute Utility Theory) [11], a multi-criteria modelling leading to an average success score (0.52) of the project. Several

approach that facilitates both hierarchical structuring of criteria

other projects of different types were evaluated in this way,

and using weights for the aggregation of scores.

confirming the appropriateness of EM structure and

Considering the above requirements, available FPD data and

conformance with requirements of potential users. In this way,

other multi-criteria approaches to project evaluation ([15], [16],

the quality of EM was assessed on a sample of past projects.

[17]), we developed the EM as presented in Figure 3.

Further assessment is planned in the next stages while

EM consists of three components [10]: input parameters, configuring new projects, where EM’s results can be confronted

evaluation parameters and aggregation functions.

with opinions of project leaders actively involved in the process.

EM already enables evaluation of multiple finished projects.

Input parameters are variables in the leaf nodes of the model:

In Step C (Figure 2), FPD was extended by adding five variables

•

Project Work Concentration: explains the distribution of the

corresponding the five Evaluation parameters of EM. All

work on the project. If the value is closer to 0 or 1, the

projects in FPD were evaluated by EM, resulting in FPD+S.

majority of the work is done at the beginning or at the end

Basic statistics of FPD+S is presented by the distribution of

of the project, respectively.

the variables in Figure 5. The variables marked with red colour

•

Time Reserve: explains if the project work ended earlier

on the x-axis are E-DSS input parameters, the green uppercase

than defined in the contract.

variables are those corresponding to success scores, and the blue

•

Number of Pauses: the number of times the work on the

variables are potential new project parameters. The distribution

project stopped.

of the final project evaluation, PROJECT_SUCCESS_SCORE,

61





(average = 0.52, min = 0.15, max = 0.94) indicates that it well

and created decision trees for prediction of individual output

covers the range of possible outcomes and enables the

parameters that may lead to high new project success scores.

discrimination and sorting of projects.

Future work will primarily continue by further data analysis

and data mining of FPD+S, attempting to design effective

algorithms for interactive exploration of past projects and

suggesting as good as possible configurations of new projects.

On this basis, we shall make a detailed functional specification

of the NPPE+S component and design/implement the E-DSS.

Despite that E-DSS considered here is tailor-made for the

specific business environment and is bound to the specific data

base, the approach seems general enough to be applied to similar

environments, projects and processes [9]. This work is a showcase of substantial efforts needed to prepare a corporate



database for decision-support, which is often neglected in the

literature. The main contribution is a combination of data

Figure 4: Example of evaluating a project using EM

processing with MAUT-based multi-criteria decision modelling.

References

[1]

Fransje L. Hooimeijer, Jeremy D. Bricker, Adam J. Pel, A. D. Brand, Frans

H.M. Van de Ven, Amin Askarinejad. 2022. Multi-and interdisciplinary

design of urban infrastructure development. In Proceedings of the Institution

of Civil Engineers: Urban Design and Planning. Vol.175. TU Delft. 153-168.

[2]

Simon Christian Becker, Philip Sander. 2023. Development of a Project Objective and Requirement System (PORS) for major infrastructure projects

to align the interests of all the stakeholders. In Expanding Underground -

Knowledge and Passion to Make a Positive Impact on the World. CRC Press,

London, UK, 3369-3376. DOI:10.1201/9781003348030-408.

[3]

Michel-Alexandre Cardin, Ana Mijic, Jennifer Whyte. 2023. Data-driven

infrastructure systems design for uncertainty, sustainability, and resilience.

In D. M. Fabio Biondini, Life-Cycle of Structures and Infrastructure

Systems.

CRC

Press,

London,

UK,

2565

–

2572.

DOI:

10.1201/9781003323020-312.

[4]

Saša Žagar. 2016. Organizacijski model v projektivnem podjetju Elea iC

d.o.o.. Maribor, B.Sc. Thesis, Retrieved July 12, 2024 from

https://dk.um.si/IzpisGradiva.php?id=58799&lang=eng.

[5]

Elea iC webpage. https://www.elea.si/en/.

[6]

Jürg Kuster, Eugen Huber, Robert Lippmann, Alphons Schmid, Emil

Schneider, Urs Witschi, Roger Wüst. 2015. Project Management Handbook.



Springer-Verlag, Berlin Heidelberg, Germany.

Figure 5: Distribution of the FPD+S features, including EM

[7]

Anton Hauc. 2007. Projektni management. (2nd. ed.). GV Založba,

project success assessments

Ljubljana, Slovenija.

[8]

Harvey A. Levine. 2002. Practical Project Management: Tips, Tactics, and

Tools. John Wiley & Sons, Inc., New York, NY.

[9]

Nadja Damij, Talib Damij. 2014. Process management. Springer-Verlag,

6 Conclusions

Berlin Heidelberg, Germany.

E-DSS is a DSS under construction, aimed at supporting the

[10] Marko Bohanec. 2012. Odločanje in modeli. DMFA – založništvo,

Ljubljana, Slovenija.

project management and project leaders' process in the new

[11] Salvatore Greco, Matthias Ehrgott, José Rui Figueira. 2016. Multiple Criteria infrastructure project planning phase. We presented the design

Decision Analysis, State of the Art Surveys. Springer, Portsmouth, UK. DOI

10.1007/978-1-4939-3094-4

and development of the FPSE component, consisting of a multi-

[12] Maria Rashidi, Maryam Ghodrat, Bijan Samali and Masoud Mohammadi.

criteria project success evaluation model EM and a data base of

2018. Decision Support Systems. In Management of Information Systems.

projects, extended with success evaluation scores FPD+S.

IntechOpen, London, UK, 19-38. DOI: 10.5772/intechopen.79390.



[13] Daniel Joseph Power. 2013. Decision Support, Analytics, and Business EM has been developed using the MAUT approach and has

Intelligence.

Business

Expert

Press,

New

York,

NY.

DOI

turned out to be “fit-for-purpose”. It employs the data that is

10.4128/9781606496190.

[14] Sofiat Abioye, Lukumon Oyedele, Lukman Akanbi, Anuoluwapo Ajayi,

available in the projects’ database. It meaningfully describes

Juan Manuel Davila Delgado, Muhammad Bilal, Olugbenga Akinade, Ashraf

aspects of the project's success and offers practical and functional

Ahmed. 2021. Artificial intelligence in the construction industry: A review

of present status, opportunities and future challenges. Journal of Building model for the evaluation of multiple projects in the database.

Engineering 44, Elsevier. https://doi.org/10.1016/j.jobe.2021.103299.

FPSE is a key decision-support resource for E-DSS. E-DSS

[15] Erwin Berghuis. 2018. Measuring Systems Engineering and Project Success, will allow the user to interactively search for similar past

Master’s Thesis. University of Twente. https://purl.utwente.nl/essays/75088

[16] Ali Beiki Ashkezari, Mahsa Zokaee, Amir Aghsami, Fariborz Jolai, Maziar

projects, to filter them according to the success score and

Yazdani. 2022. Selecting an Appropriate Configuration in a Construction simulate the effects of alternative project configurations,

Project Using a Hybrid Multiple Attribute Decision Making and Failure Analysis

Methods.

Buildings,

MDPI,

Volume

12,

643.

DOI:

ultimately proposing the best one. Approaches based on

https://doi.org/10.3390/buildings9050112.

unsupervised descriptive analytics (clustering) and supervised

[17] Urban Pinter, Igor Pšunder. 2013. Evaluating construction project success machine learning methods for prediction of E-DSS output

with use of the M-TOPSIS method. Journal of civil engineering and

management, Volume 19(1), 16–23. doi:10.3846/13923730.2012.734849

parameters are foreseen for this purpose. Actually, we have

[18] Interquartile

range.

Retrieved

May

15,

2024

from

already tested hierarchical clustering and decision tree

https://en.wikipedia.org/wiki/Interquartile_range

[19] SimpleImputer.

Retrieved

May

15,

2024

from

https://scikit-

classification methods on FPD+S, and first results are

learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html

encouraging. We obtained meaningful clusters of past projects

[20] Aggarwal C. Aggarwal. 2015. Data Mining: The Textbook. Springer, New

York, USA.

62





Minimizing Costs and Risks in Demand Response Optimization:

Insights from Initial Experiments

Mila Nedić

Tea Tušar

Faculty of Mathematics and Physics

Jožef Stefan Institute and

University of Ljubljana

Jožef Stefan International Postgraduate School

Ljubljana, Slovenia

Ljubljana, Slovenia

mn38120@student.uni- lj.si

tea.tusar@ijs.si

Abstract

techniques. However, baselines can be exploited, e.g., when con-

sumers artificially increase consumption before an event to inflate

This paper presents a method for changing the energy use of

their baseline and maximize the awarded rebate.

consumers participating in Demand Response (DR) programs,

1

Through the SEEDS project , we are developing a method-

focusing on peak balancing to improve grid stability. Multiple ob-

ology for providing energy flexibility services to prosumers –

jectives including costs and risks are considered, and a weighted

participants in energy markets capable of both producing and

sum is used to transform them into a single objective. This re-

consuming energy – in order to enhance grid stability. Machine

sults in an optimization problem that can be optimally solved.

learning is used to predict the baseline energy usage of prosumers

To calculate the costs, the load consumption baseline needs to

and their flexibility, while mixed-integer linear programming

be established. Since this is challenging and can be exploited,

(MILP) is used to optimize the operation of prosumers within

we conduct initial experiments to test whether our method to

their flexibility. Our approach will be tested in the Slovenian pilot,

adjust the baseline can be easily manipulated. We explore an

in collaboration with Petrol d.d. and Elektro Celje d.d.

original scenario and three of its variants to examine the effects

Our work integrates prosumer flexibility into DR optimization,

of various parameters on the optimization outcome. Our results

focusing on minimizing costs and risks while limiting energy

indicate that 1) an excessive emphasis on risk results in no energy

fluctuations. While the goal is to eventually use this approach on

change, 2) enforcing a net zero energy change minimizes energy

real-world data from the pilot, this paper reports on some initial

use while still securing the rebate, and 3) without an adjustment

experiments verifying whether the current problem formulation

period, the consumer is less inclined to increase the load just be-

results in solutions with desired properties. In particular, we wish

fore the demand period. In future work, we will reformulate some

to test if our adjusted consumer baseline approach can be easily

objectives to avoid exploitation and better reflect the real-world

exploited.

needs of DR.

Research on prosumer flexibility, optimization techniques,

Keywords

and demand response optimization includes a wide range of

approaches [8]. In [3], Balázs et al. quantify residential prosumer multiobjective optimization, mixed-integer linear programming,

flexibility using engineering models and real-world data. Their

demand response, baseline consumption, electrical grid

work provides valuable insight into prosumer behavior and en-

ergy management. Capone et al. [4] optimize district energy 1

Introduction

systems by balancing costs and carbon emissions with genetic al-

Peaks in energy demand can strain the electrical grid, leading to

gorithms and linear programming, showing significant emission

inefficiencies and potential failures. A widely used strategy for

reductions at a modest cost increase. Magalhães and Antunes [7]

balancing these peaks is Demand Response (DR), in which the

compare thermal load models in demand response strategies

Distribution System Operator (DSO) forecasts future peaks and

using MILP, finding that discrete control formulations improve

requests from consumers to adjust their energy use to reduce

computational efficiency. Thus, our methodology is in line with

them. In the peak time rebate DR program [2], consumers receive related work while the actual optimization problem (its varia rebate if they reduce their load in the demand period. On the

ables, objectives and constraints) differs from existing ones as it

other hand, if they commit to respond to the demand, but fail to

is adapted to our specific use case.

do so, they can be penalized. It is therefore of utmost importance

This paper is further organized as follows. In Section 2, we to accurately assess whether and how much a consumer reduced

provide a brief overview of the optimization problem, followed

their load to meet the demand.

by its detailed definition in terms of its variables, constraints and

The load reduction of a consumer is computed as the differ-

objectives. The optimization approach is explained in Section 3,

ence between its baseline (the amount of energy the customer

where we discuss the scalarization technique used to transform

would have consumed without a demand request) and its actual

our multi-objective problem into a single-objective MILP form

use [2]. The importance of establishing a baseline and the various and the method used to solve it. The experiments and their results

ways of calculating it are presented in [5]. Common methods are given in Section 4. Finally, conclusions and further work ideas for calculating baselines include simple historical data averages,

are described in Section 5.

exponential moving averages and short-term load forecasting

2

Optimization Problem

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or

The problem formulation in this work assumes a peak time rebate

distributed for profit or commercial advantage and that copies bear this notice and DR program in which the DSO and the consumer have a contract

the full citation on the first page. Copyrights for third-party components of this stipulating the following conditions: 1) the consumer can chose

work must be honored. For all other uses, contact the owner /author(s).

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

whether to respond to a demand request, 2) if the consumer

© 2024 Copyright held by the owner/author(s).

1

https://doi.org/10.70314/is.2024.scai.8587

https://project- seeds.eu/

63





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Mila Nedić and Tea Tušar

A

F

participates in DR, it receives a rebate proportional to the reduced

𝐸

instead of the forecast one 𝐸

, where the adjustment is deter-

𝑡

𝑡

load, 3) if the consumer participates in DR but does not reduce the

A

mined by the energy amounts in the adjustment period – the 𝑛

load by at least 75 % of the required amount, it is penalized, 4) the

S

intervals before the start of the demand period 𝑡

. More formally,

load reduction is estimated using an adjusted consumer baseline,

the adjusted timetable is computed as

which takes into account the forecast consumer energy usage as

S

𝑡

−1

well as its actual consumption before the demand period.





1

∑︁







F

F

A

𝐸

−

𝐸

− 𝐸

,

if 𝑛

> 0;

The optimization task is to set the energy consumption of

𝑗

A





𝑡

A

𝑗

𝐸

=

𝑛

𝑡

S

A

all loads of a consumer participating in DR taking into account

𝑗 =𝑡 −𝑛







F

their flexibility so that consumer costs, risks and energy fluctua-

𝐸

,

otherwise

𝑡



tions are minimized. This ensures efficient grid operation while

S

E

for all intervals 𝑡

∈ {𝑡 , . . . , 𝑡 } in the demand period. Then,

maintaining economic feasibility for the consumer.

R

the recognized load reduction 𝐸

at demand time interval 𝑡

∈

To formally define our optimization problem, we first intro-

𝑡

S

E

duce its variables, followed by the constraints and the objective

{𝑡 , . . . , 𝑡 } is determined as

functions we aim to optimize. Finally, we provide an overview

R

A

𝐸

= 𝐸 − 𝐸 ,

𝑡

of the weighted sum approach, which serves as the scalarization

𝑡

𝑡

technique to transform all objective values into a single one.

R

while the total recognized load reduction 𝐸

is computed as

2.1

Variables

E

𝑡

∑︁

R

R

𝐸

=

𝐸

.

A solution is specified by the energy amounts 𝐸

∈

𝑐,𝑖

R for each

𝑡

S

consumer load 𝑐

∈ C and time interval 𝑖 ∈ {1, . . . , 𝑛}. They

𝑡 =𝑡

correspond to the change of consumption from the forecast one.

R

A rebate is awarded if 𝐸

is negative (the consumption has

These are the only variables of this optimization problem.

been reduced). If the total recognized load reduction exceeds the

From these energy amounts and the forecast timetable of en-

T

total demanded energy reduction 𝐸

, the rebate is capped, i.e.,

ergy usage, the resulting energy consumption 𝐸

in time interval

𝑖

𝑖 ∈ {1, . . . , 𝑛 } is computed as

(





B



R



T

R

𝑝

min

𝐸

, 𝐸



,

if 𝐸

< 0

R

∑︁

𝑓

=

.

F

𝐸

= 𝐸 +

𝐸

.

𝑖

𝑖

𝑐,𝑖

0,

otherwise

𝑐 ∈ C

Finally, a penalty is added to the total costs if the demand

2.2

Constraints

has not been met, that is, the ratio between the recognized and

D

The energy amounts of a solution need to adhere to two kinds of

demanded energy reduction, 𝐸

, in any of the demand time

S

E

constraints. The first type are the interval energy constraints:

intervals 𝑡 ∈ {𝑡 , . . . , 𝑡

} is lower than 75 %,

min

max

𝐸

≤ 𝐸

≤ 𝐸

,

R

𝐸

𝑐,𝑖

𝑐,𝑖

𝑐,𝑖





P | T |

𝑡

S

E }

P



𝑝

𝐸

,

if

< 75 % for one or more 𝑡 ∈ {𝑡 , . . . , 𝑡

𝑓

=

D

.

for each consumer load 𝑐 ∈ C and time interval 𝑖 ∈ {1, . . . , 𝑛}.

𝐸



0,

otherwise

The second are the total energy constraints:



𝑛

The second optimization objective

represents risks. In order

∑︁

𝑓2

𝑇 ,min

𝑇 ,max

𝐸

≤

𝐸

≤ 𝐸

,

𝑐

𝑐,𝑖

𝑐

to penalize any changes to the timetable when the risks are high,

𝑖 =1

the objective function is defined as

for each consumer load 𝑐 ∈ C.

𝑛

∑︁

∑︁



𝑓

=

𝑟

2.3

Objective Functions

2

𝑖

𝐸𝑐,𝑖 ,

𝑖 =1

𝑐 ∈ C

The three objectives to be minimized in this scenario are the

where 𝑟

represents the risk at time interval 𝑖 .

costs, risks and energy fluctuations.

𝑖

To penalize unnecessary energy fluctuations, the third objec-

The first optimization objective 𝑓

consists of all costs associ-

1

tive 𝑓

averages the consecutive changes in energy amounts for

ated with the solution and equals

3

all consumer loads, i.e.,

E

R

P

𝑓

=

−

+

1

𝑓

𝑓

𝑓

,

𝑛

1

∑︁ ∑︁



E

R

where 𝑓

represents the energy costs, 𝑓

is the rebate for the

𝑓

=

−

3

𝐸

𝐸

𝑐,𝑖

𝑐,𝑖 −1 .

(𝑛 − 1) |C |

P

recognized load reduction and 𝑓

is the penalty that is charged

𝑖 =2 𝑐 ∈ C

in case the recognized load reduction does not meet the require-

2.4

Weighted Sum Approach

ments.

E

The energy costs 𝑓

equal the sum of energy costs over all

Since the optimal solutions to this problem appear to reside in

time intervals 𝑖 ∈ {1, . . . , 𝑛},

the convex region of the objective space, we use a weighted sum

approach to transform all objective values into a single one. The

𝑛

∑︁

E

𝑓

=

𝑝

𝐸 ,

single objective function to be minimized thus equals

𝑖

𝑖

𝑖 =1

𝑓

= 𝑤

+

+

1 𝑓1

𝑤 2 𝑓2

𝑤 3 𝑓3

where 𝑝

is the interval energy price.

𝑖

The solution gains a rebate it the load is reduced in the demand

under the condition 𝑤

+

= 1. The weight

can be set

1

𝑤 2

𝑤 3

S

E

R

period {𝑡 , . . . , 𝑡

}. Note that the recognized load reduction 𝐸 ,

independently of 𝑤

and 𝑤

and serves as a measure of limiting

𝑡

1

2

S

E

𝑡 ∈ {𝑡 , . . . , 𝑡

}, is computed from the adjusted timetable energy

the energy fluctuations.

64





Minimizing Costs and Risks in Demand Response Optimization

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

3

Optimization Approach

The three scenario variants differ from the basic as follows.

3.1

Setting Weights in the Weighted Sum

The first scenario variant has no demand. In the second and

third scenario variant, the total energy change is set to 0 kWh

To obtain diverse solutions with the weighted sum approach, a

ensuring the reduction in energy consumption in some intervals

good strategy for setting the weights is needed. While we plan to

is matched with its increase in others. Additionally, the third

use a more sophisticated approach for this purpose in future work,

scenario variant has no adjustment period, i.e, 𝑛

= 0.

A

these initial experiments were made by choosing equidistant

values of 𝑤

from the interval [0

as 1 −

.

1

, 1] and defining 𝑤2

𝑤 1

4.2

Results and Discussion

−3

In order to limit energy fluctuations, we set 𝑤

to 10

. Smaller

3

We discuss here the results of our original scenario and its three

weights proved insufficient in limiting the fluctuations while

variants. They are depicted also in plots in Figures 1 to 4, which larger weights interfered with the first two objectives, which are

show with a black line how the consumer load changes from its

more important than the third.

planned timetable. Consumer load flexibility at each time interval

3.2

Linearization

is shown in gray (there is no flexibility in the first four and last

four intervals). The demand period is denoted in red and the

Since all of the objective functions specified in Section 2.3 are adjustment period in blue. In most cases (unless the risk has

either non-linear or contain non-linear parts, specific techniques

a large weight), the consumer reduces the load in the demand

are required to linearize these objectives and ensure the problem

period enough to meet the required demand and earn the entire

fits the MILP form. In particular, it is necessary to linearize the ab-

available rebate while not incurring any penalty. The amount of

solute value of a real variable, the product of a binary variable and

this reduction and the energy change outside of this period differ

a real variable, the minimum of two variables, along with other

for the various scenario variants.

non-linear function conditions. We use standard approaches to

achieve linearization for all these cases [9].

4.2.1

Original Scenario. When the risk has a large weight, the

load does not change outside of the demand period (see the top

3.3

Tool and Solver

plot in Figure 1). However, when the impact of risk is minimal

2

(bottom plot in Figure 1), the load is reduced everywhere except We use the OR-Tools Python library

to implement and solve

during the adjustment period. This strategy artificially increases

the single-objective MILP problem. The library is a comprehen-

the perceived load reduction to maximize the rebate, as dictated

sive tool for solving optimization problems, including linear pro-

by the rebate calculation formula.

gramming, integer programming, and combinatorial optimiza-

tion. Specifically, we use the SCIP (Solving Constraint Integer

4.2.2

Scenario Variant #1: No Demand. If the optimization is

3

Programs) solver [1] integrated within OR-Tools

for solving

called without a demand, the result depends on the weighting of

MILP problem instances.

the first two objectives. As long as the impact of risk is significant

To solve a MILP problem using OR-Tools and the integrated

(top plot in Figure 2), the load does not change. Otherwise, the SCIP solver, the following steps are performed: import the linear

load is reduced to the maximum extent in each interval (bottom

solver wrapper, declare the SCIP solver, define the variables with

plot in Figure 2). This approach minimizes the function 𝑓

, there-

𝐸

their respective bounds, set the constraints and the objective

fore reducing costs. This means that the consumer behavior can

function and lastly, analyze and display the solution.

change when optimized even if no demand is present.

4

Experiments

4.2.3

Scenario Variant #2: Zero Total Energy Change. Due to the

zero energy constraint, the consumer makes adjustments solely

We first conduct experiments using a basic scenario with a single

within the demand and adjustment periods (see Figure 3). During consumer load. Then, we variate some parameters of this scenario

the adjustment period, the user offsets the consumption from the

to see how they affect the resulting solutions.

demand period, thereby achieving a maximal rebate. To adhere

4.1

Experimental Setup

to the requirement of minimizing risks and fluctuations in other

intervals, no additional changes are made, as such actions would

The basic scenario has the following parameters:

increase the objective value.

• Time is represented as 28 15-minute intervals.

• The demand period starts at 𝑖 = 13 and ends at 𝑖 = 16.

4.2.4

Scenario Variant #3: Zero Total Energy Change and No Ad-

•

T

The total required reduction 𝐸

equals −8 kWh and the

justment Period. When the baseline is not adjusted, the load is

D

required reduction 𝐸

at each interval equals −2 kWh.

increased in intervals outside of the demand period, regardless

• The adjustment period has a duration of four intervals.

whether they occur before or after it. The specific intervals when

• The load change needs to be within [−3 kWh, 3 kWh] for

this happens depend on the solver and are random as they lead

each interval 𝑖 = 5, 6, . . . , 24 and is fixed to 0 kWh for the

to the same objective function value. An example of such a case

remaining intervals.

in depicted in Figure 4.

•

F

The forecast timetable energy 𝐸

is constant and equals

The last two variants additionally confirm that the usage of

𝑖

12 kWh for all time intervals.

the adjustment period enables exploitation – the entire rebate

• The total energy constraint is unbounded.

can be gained with a smaller load reduction in the demand period

• The risk equals 0.50 for all time intervals.

if the load is increased in the adjustment period.

•

R

All prices are constant: 𝑝

= 0.25 EUR, 𝑝

= 0.50 EUR and

𝑖

P

5

Conclusions

𝑝

= 1.00 EUR.

This paper focuses on demand response optimization and the

2 https://developers.google.com/optimization

3

growing role of prosumers in energy systems. A standard MILP

https://github.com/google/or- tools/blob/stable/ortools/linear_solver/samples/mi

p_var_array.py

framework is used to set the consumer load energies within

65





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Mila Nedić and Tea Tušar

]

for generating a set of diverse solutions representing various

Adjustment

Demand

Wh

period

period

Load flexibility

[k

trade-offs between costs and risks.

3

2

1

By creating three scenario variants, we were able to explore

0

change

−1

the effect of some parameters on the optimization outcome. We

−2

−3

observe that:

Energy

1

3

5

7

9

11

13

15

17

19

21

23

25

27

29

• Regardless of the variant, the optimal load schedule does

Time intervals

]

not deviate from the forecast one if the importance of risk

Adjustment

Demand

Wh

is too high, i.e., if the weight 𝑤

is too large. This critical

period

period

Load flexibility

[k

2

3

2

value of 𝑤

depends on the scenario variant.

2

1

0

• If the consumer is obliged to a zero sum in load increase

change

−1

−2

and reduction, the optimal solution uses the minimal nec-

−3

Energy

1

3

5

7

9

11

13

15

17

19

21

23

25

27

29

essary resources to earn a rebate while avoiding excessive

Time intervals

energy changes.

• When the adjustment period is unspecified, the prosumer

Figure 1: Results for the original scenario with 𝑤 = 0

1

.6

is less likely to increase the load just before the demand

and 𝑤 = 0

= 0

= 0

2

.4 (top) and 𝑤1

.8 and 𝑤2

.2 (bottom).

period.

Moving forward, we need to refine the objectives. The cur-

]

rent method to assess the baseline consumption is susceptible

Wh

to exploitation and should be amended. We could calculate the

Load flexibility

[k

3

2

consumer baseline from similar consumers that do not partici-

1

0

change

pate in DR as suggested in [6]. We will also need to revise the

−1

−2

−

penalty calculation to account for the imminent change of tariffs

3

Energy

1

3

5

7

9

11

13

15

17

19

21

23

25

27

29

in the Slovenian energy market. We additionally plan to improve

Time intervals

the calculation of risks to ensure more robust optimization and

]

real-world applicability. Finally, we intend to develop a better

Wh

Load flexibility

[k

strategy for setting the weights, targeting values with the most

3

2

1

significant impact rather than evenly distributing them.

0

change

−1

−2

−3

Acknowledgements

Energy

1

3

5

7

9

11

13

15

17

19

21

23

25

27

29

The SEEDS project is co-funded by the European Union’s Horizon

Time intervals

Europe innovation actions programme under the Grant Agree-

Figure 2: Results for the variant without demand with 𝑤 =

ment n°101138211. The authors acknowledge the financial sup-

1

0.5 and 𝑤

= 0

= 0

= 0

port from the Slovenian Research and Innovation Agency (re-

2

.5 (top) and 𝑤1

.7 and 𝑤2

.3 (bottom).

search core funding No. P2-0209). The authors wish to thank

Bernard Ženko, Martin Žnidaršič and Aljaž Osojnik for helpful

]

discussions when shaping this work.

Adjustment

Demand

Wh

period

period

Load flexibility

[k

3

2

1

References

0

change

−1

[1]

Tobias Achterberg. 2009. SCIP: Solving constraint integer programs. Mathe-

−2

−3

matical Programming Computation, 1, 1–41. doi: 10.1007/s12532-008-0001-1.

Energy

1

3

5

7

9

11

13

15

17

19

21

23

25

27

29

[2]

AEIC Load Research Committee. 2009. Demand response measurement &

verification: Applications for load research. Tech. rep. AEIC Load Research

Time intervals

Committee.

[3]

István Balázs, Attila Fodor, and Attila Magyar. 2021. Quantification of the

Figure 3: Results for the variant with zero total energy

flexibility of residential prosumers. Energies, 14, 4860. doi: 10.3390/en141648

change with

60.

𝑤

= 0

= 0

1

.6 and 𝑤2

.4.

[4]

Martina Capone, Elisa Guelpa, and Verda Vittorio. 2021. Multi-objective

optimization of district energy systems with demand response. Energy, 227,

120472. doi: 10.1016/j.energy.2021.120472.

]

[5]

Antonio Gabaldón, Ana García-Garre, María Carmen Ruiz-Abellón, Antonio

Demand

Wh

period

Load flexibility

[k

Guillamón, Carlos Álvarez-Bel, and Luis Alfredo Fernandez-Jimenez. 2021.

3

Improvement of customer baselines for the evaluation of demand response

2

1

through the use of physically-based load models. Utilities Policy, 70, 101213.

0

change

−1

doi: 10.1016/j.jup.2021.101213.

−2

−

[6]

Joe Glass, Stephen Suffian, Adam Scheer, and Carmen Best. 2022. Demand

3

response advanced measurement methodology: Analysis of open-source

Energy

1

3

5

7

9

11

13

15

17

19

21

23

25

27

29

baseline and comparison group methods to enable CAISO demand response

Time intervals

resource performance evaluation. Tech. rep. California Independent System

Operator (CAISO).

Figure 4: Results for the variant with zero total energy

[7]

Pedro L. Magalhães and Carlos Henggeler Antunes. 2020. Comparison of ther-

mal load models for MILP-based demand response planning. In Sustainable

change and no adjustment period with 𝑤 = 0

=

1

.6 and 𝑤2

Energy for Smart Cities. Springer International Publishing, Cham, 110–124.

0.4.

[8]

Javier Parra-Domínguez, Esteban Sánchez, and Ángel Ordóñez. 2023. The

prosumer: A systematic review of the new paradigm in energy and sustainable

development. Sustainability, 15, 13. doi: 10.3390/su151310552.

[9]

Nace Sever. 2022. Časovno razporejanje terenskih nalog z mešanim celoštevil-

their flexibility so that the costs, risks and energy fluctuations

skim linearnim programiranjem. Bachelor’s Thesis. University of Ljubljana,

Faculty of Mathematics and Physics. https://repozitorij.uni- lj.si/IzpisGradiva

are all minimized. Since the objectives are scalarized with the

.php?lang=slv&id=140427.

weighted sum approach, correctly setting their weights is crucial

66





Predicting Hydrogen Adsorption Energies on Platinum

Nanoparticles and Surfaces with Machine Learning

Lea Gašparič

Anton Kokalj

Sašo Džeroski

lea.gasparic@ijs.si

tone.kokalj@ijs.si

saso.dzeroski@ijs.si

Jožef Stefan Institute, Jožef

Jožef Stefan Institute, Jožef

Jožef Stefan Institute, Jožef

Stefan international postgraduate

Stefan international postgraduate

Stefan international postgraduate

schoole

school

school

Ljubljana, Slovenia

Ljubljana, Slovenia

Ljubljana, Slovenia

Abstract

The mechanism of HER includes adsorbed hydrogen

The growing interest in hydrogen gas as a fuel drives re-

atom (H*) as an intermediate. Consequently, the adsorp-

search into environmentally friendly hydrogen production

tion energy of hydrogen is often used as a descriptor of

methods. One viable approach of obtaining hydrogen is

the catalytic activity of the material [15, 21]. The most

the electrocatalysis of water, which includes the hydrogen

straightforward approach to obtain the adsorption energies

evolution reaction (HER) as one of the half-reactions. In

is with density-functional theory (DFT) calculations. How-

the search of highly active catalysts for the HER, machine

ever, as the size of the system and the number of different

learning can be effectively utilized to develop models for

adsorption sites increase, a full DFT analysis becomes com-

calculating hydrogen adsorption energy, a key descriptor of

putationally unfeasible. To address this challenge, machine-

catalytic activity. In this study, we learned models for pre-

learning methods can be employed to predict hydrogen

dicting hydrogen adsorption energy on platinum. We used

adsorption energies based on DFT results, enabling the

various machine-learning (ML) techniques on two datasets,

investigation of more complex systems [10]. For example,

one for extended surfaces and the other for nanoparticles.

bimetallic nanoparticles were investigated by Jäger et al.

The respective results reveal that ML models for extended

[8] and Zhang et al. investigated amorphous systems [20].

surfaces are more accurate than those for nanoparticles,

This contribution focuses on the use of machine learning

and that the features describing the local environment are

for predicting hydrogen adsorption energies on platinum

the most significant for the predictions. For surfaces, the

using electronic and geometric descriptors. Two separate

coordination number is the most relevant feature, while the

datasets were constructed, one for surfaces and the other

d-band center is the most important for nanoparticles. The

for nanoparticles. By employing supervised learning and

ML models developed in this study lack sufficient accuracy

attribute ranking, we built ML models, assessed their accu-

to provide reliable results, highlighting the need for further

racy and analyzed whether the two datasets exhibit similar

investigation with additional features or larger datasets.

correlations. The idea of the contribution is illustrated in

Figure 1.

Keywords

platinum, hydrogen, DFT calculations, decision trees, fea-

ture ranking

1

Introduction

A lot of scientific and societal interest is devoted to hydro-

gen fuel, which can generate electrical power by producing

water as a byproduct. One environmentally friendly method

of producing hydrogen is through the electrocatalysis of

water, where hydrogen and oxygen gases are formed. This

Figure 1: Supervised machine learning and feature

process involves two reactions: oxygen and hydrogen evolu-

ranking was performed for hydrogen adsorption

tion reactions. Considerable effort is being directed towards

energy on platinum catalysts modeled as surfaces

improving catalysts for both reactions and understanding

and nanoparticles.

the fundamental processes involved [21, 13]. In this contribution, we will focus on the hydrogen evolution reaction

(HER), for which platinum is known to be a highly ac-

2

Materials and Methods

tive catalyst due to its near-optimal hydrogen adsorption

2.1

DFT Calculations and Datasets

free energy [15, 21]. However, the high cost of platinum

motivates ongoing research of alternative materials.

We utilized DFT calculations to calculate hydrogen ad-

sorption energies (a target variable for ML) and electronic

Permission to make digital or hard copies of all or part of this

descriptors for ML. We also utilized geometric descriptors.

work for personal or classroom use is granted without fee provided

Two datasets were constructed, one for platinum nanopar-

that copies are not made or distributed for profit or commercial

advantage and that copies bear this notice and the full citation on

ticles and the other for platinum surfaces.

the first page. Copyrights for third-party components of this work

DFT calculations were performed with the Perdew-Burke-

must be honored. For all other uses, contact the owner/author(s).

Ernzerhof (PBE) approximation [17], a plane-wave basis

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

©

set, and PAW pseudopotentials [3]. Energy cutoffs were set

2024 Copyright held by the owner/author(s).

https://doi.org/10.70314/is.2024.scai.8689

to 50 and 575 Ry for wavefunctions and electron density,

67





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

Gašparič et al.

respectively. Methfessel-Paxton smearing [12] of 0.02 eV

was employed.

Pt(111), Pt(100), and Pt(110) surface slab models were

constructed with the calculated lattice parameter of bulk

Pt (3.97 ˚

A). The models of Pt(111) and Pt(100) surfaces

consist of 4 atomic layers, with the bottom layer fixed to

Figure 2: Models of extended surfaces used to cal-

bulk positions, while Pt(110) has 6 atomic layers with the

culate hydrogen adsorption energies.

bottom two layers fixed. To achieve a greater variety of

adsorption sites, Pt(111) and Pt(100) were also modeled

with a missing-row defect. All surface models are shown in

to the size of nanoparticles were also utilized, in particular:

Figure 2. Calculations accounted for the dipole correction

the number of all atoms (𝑁

and periodic images of slabs were separated by at least

all) in the nanoparticle, the

number of surface atoms (𝑁

15 ˚

A of vacuum. Different sizes of surface supercells were

surf ), the maximal (𝑟max) and

minimal (𝑟

used, and the k-point grid for (1×1) surface unit cells of

min) distances from the center of the nanoparticle

to the surface atoms and the distance from the center of

Pt(111), Pt(100), and Pt(110) were 12×12×1, 11×11×1,

the nanoparticle to the adsorption site (𝑟

and 11×8×1, respectively. For larger supercells, the number

ads). The datasets

for surfaces and nanoparticles contained 46 and 85 data

of k-points was adapted accordingly.

points, respectively.

Calculations with nanoparticles were performed with

the gamma k-point and Martyna-Tuckerman correction

for isolated systems [11]. Nanoparticles were modeled with

2.2

Machine-Learning Methods

different shapes and sizes, consisting of 3 and up to 116

The prepared datasets were analyzed using the Weka soft-

atoms. Their periodic images were separated by at least

ware package [4]. The target value in both datasets is the

15 ˚

A of vacuum. All calculations were preformed with the

hydrogen adsorption energy, making this a regression task.

Quantum ESPRESSO package [5].

Supervised machine learning was employed to develop mod-

The hydrogen adsorption energy was calculated as:

els for predicting the target value, which were evaluated

by 10-fold cross-validation.

1

𝐸

One of the used methods is linear regression, that com-

ads = 𝐸H* − 𝐸* −

𝐸H

(1)

2

2

putes the linear relationship between the target value and

where 𝐸H* is the calculated energy of optimized adsorp-

the descriptors. The relevant descriptors included in the

tion system, 𝐸* is the energy of the standalone platinum

equation were selected according to the M5 method [18].

system, and 𝐸H is the energy of the hydrogen molecule.

2

This method iteratively removes descriptors with the small-

All performed calculations included only one adsorbed H

est effect on the model until the error of the model no

atom per supercell or nanoparticle.

longer decreases.

As an electronic descriptor, we used the d-band center,

We also used the random forest method [7, 1] with 100

which is considered to be a good indicator of metal reac-

trees of unlimited depth. With this method, multiple deci-

tivity [6]. It was obtained through DFT calculations using

sion trees were constructed by selecting relevant features

the following equation:

from a random subset of int(log (𝑚) + 1) features, where

2

𝑚 is the total number of features. The final values are the

∞

∫︀

𝑛d(𝐸)𝐸𝑑𝐸

averages of the predictions from the individual trees.

−∞

𝜀

To obtain an explainable ML model, we also built regres-

d =

(2)

∞

∫︀

sion trees using the M5’ method [18, 19]. In this method,

𝑛d(𝐸)𝑑𝐸

−∞

trees are built by splitting the training sets according to

where 𝐸 is the energy and 𝑛

attributes that maximize the standard deviation reduction.

d is the projected density of

states on d-orbitals of the atoms forming the adsorption

After the trees are constructed, they are pruned to avoid

site.

overfitting and smoothed to address discontinuities between

For the geometric descriptors, we determined the average

the leaves. For our datasets, we used unpruned trees to

coordination number of Pt atoms forming the adsorption

prevent the formation of trees that are too small and give

site, as well as the generalized coordination number (GCN)

poor predictions. We also restricted tree branching to a

of the adsorption site [2], calculated as:

minimum of 6 instances per leaf node for surfaces and 20

for nanoparticles to avoid overfitting the data and to ensure

𝑁𝑖

trees of sufficient size.

∑︁

CN(𝑗)

GCN(𝑖) =

(3)

We also performed variable importance estimation and

CNmax

𝑗=1

ranking for our selected descriptors with all data points

where 𝑖 denotes an atom or a group of atoms forming the

used as a test set. To evaluate the importance of the de-

adsorption site, 𝑁𝑖 is the number of first nearest neighbors

scriptors with respect to hydrogen adsorption energy, we

of 𝑖, which are denoted with 𝑗. CN(𝑗) is the coordination

employed two methods: ReliefF [9] and correlation [16]. The number of atom 𝑗 and CNmax is the maximal coordination

ReliefF method is more sensitive to feature interactions

of a given site found in the bulk material.

and works by calculating the distances between training in-

In addition, the type of adsorption site was used as a

stances and identifying the ’nearest hit’ and ’nearest miss’.

descriptor. For extended surfaces, the coverage of H atoms,

It then adjusts the weights of the differing descriptors be-

the surface area per H atom and surface type were also used

tween the target and nearest instances. The correlation

for learning. For nanoparticles, some descriptors relevant

method evaluates the Pearson correlation coefficient [16]

68





Hydrogen on platinum

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

between the features and the target variable, without ac-

The regression-tree models shown in Figure 3 have lower

counting for interactions between features. It gives scores

accuracy and, consequently, are less reliable.

ranging from −1 to 1, with 1 being the highest correla-

The ML models could be improved by expanding the

tion score, a score of −1 indicates anti-correlation, and 0

dataset or by calculating additional descriptors. For sur-

indicates no correlation.

faces, more data can be obtained through calculations on

a wider variety of surface types and by accounting for dif-

3

Results and Discussion

ferent surface defects. However, expanding the dataset for

3.1

Machine-Learning Models

nanoparticles is limited by their size, since DFT calcula-

tions for larger particles are computationally too demand-

Supervised machine learning was performed using linear

ing. Therefore, a larger number of different smaller particles

regression, random forest, and M5’ regression tree. The

can be tested instead. Using more sophisticated descriptors

obtained Pearson’s correlation coefficients and root mean

such as atom-centered symmetry functions, smooth overlap

squared errors (RMSE) between true and predicted values

of atomic positions and many body tensor representation

are shown in Table 1.

could also improve the results, but would require different

We can observe that not all ML models provide better

sampling of adsorption structures. The use of transfer learn-

RMSE values compared to those calculated with a simple

ing from pre-trained models based on chemical structures

arithmetic average, referred to as the default predictor. For

could also lead to significant improvements.

surfaces, linear regression and random forest perform the

best and yield similar results. The regression tree model

3.2

Feature Ranking

performs the worst and has higher RMSE compared to

Feature ranking was performed for both surfaces and nanopar-

the default predictor. For nanoparticles, all methods yield

ticles, with the results presented in Figure 4. The ReliefF

errors close to those of the default predictor and correlation

and correlation importance criteria provide different rank-

coefficients bellow 0.5.

ings of features. For surfaces, the coordination number is

The obtained results indicate that with the selected

identified as the most relevant descriptor, followed by the

descriptors, the hydrogen adsorption energies are more

generalized coordination number. In contrast, for nanopar-

accurately predicted on surfaces, which are simpler as com-

ticles, the d-band center is the most important descriptor.

pared to nanoparticles. Surfaces have high symmetry and

Features describing the size of different nanoparticles show

only a handful of different adsorption sites, while nanopar-

lower relevance for predictions. The most relevant features

ticles have different shapes and sizes, consist of different

in both data sets describe the local environment of the

facets, and each nanoparticle has numerous different ad-

adsorption site, indicating the local nature of adsorption.

sorption sites. This gives a huge variety of adsorption sites

The importance of the d-band center is already well-

that can make the prediction of adsorption energies harder.

documented in the literature [14], as it correlates with

Considering the best models, the obtained adsorption

the reactivity of metals. As seen from the graphs, the d-

energies have an error of ±0.13 eV for surfaces and ±0.22 eV

band center is not so strongly correlated with the hydrogen

for nanoparticles. Due to the exponential dependence of

binding energy on surfaces. This can be attributed to the

reaction rate and adsorption energy, even a small error in

fact that on a perfectly flat surface, all surface atoms have

adsorption energy hugely affects the reaction rate. Hence,

the same d-band center. In contrast, on nanoparticles, the

the models, particularly for nanoparticles, do not provide

d-band center varies for each adsorption site because the

sufficiently accurate results for any practical use.

atoms are not equivalent. Therefore, the d-band center is

The selected ML models also provide insights into the

expected to be more relevant for nanoparticles. For the

relations between the considered features and the target

ranking based on correlation, the calculated factors for the

variable. The linear regression model for nanoparticles

includes only the d-band center and a factor for the hollow

adsorption site, whereas the equation for surfaces is more

complex. It includes adsorption site, surface type, and both

coordination numbers. This indicates that for nanoparticles,

the d-band center is the most relevant factor, while for

surfaces, geometric factors exhibit greater predictive value.

Table 1: Pearson’s correlation coefficients (CC) and

root mean squared errors (RMSE) in eV units for

all three used ML methods. For comparison, RMSE

of the default predictor is also given.

surfaces

Nanoparticles

CC

RMSE

CC

RMSE

linear regression

0.71

0.13

0.38

0.22

Figure 3: Schematic representation the obtained

random forest

0.69

0.13

0.34

0.22

random-tree models for ideal surfaces and nanopar-

M5’ decision tree

0.49

0.19

0.34

0.22

ticles. Nodes are denoted with orange and the resulting

classes are represented with turquoise circles and in-

default predictor

/

0.18

/

0.23

clude the number of data points in the class.

69





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

Gašparič et al.

[3]

Andrea Dal Corso. 2014. Pseudopotentials periodic table:

From H to Pu. Comput. Mater. Sci., 95, (Dec. 2014), 337–350.

(files: H.pbe-kjpaw psl.1.0.0.UPF, Pt.pbe-n-kjpaw psl.1.0.0.UPF).

doi: 10.1016/j.commatsci.2014.07.043.

[4]

Eibe Frank, Mark A. Hall, and Ian H. Witten. 2016. The

WEKA Workbench. Online Appendix for ”Data Mining:

Practical Machine Learning Tools and Techniques”. Fourth

Edition. Morgan Kaufmann. https://ml.cms.waikato.ac.nz/w

eka/Witten et al 2016 appendix.pdf.

[5]

Paolo Giannozzi et al. 2009. Quantum ESPRESSO: a modular

and open-source software project for quantum simulations of

materials. J. Phys: Condens. Matter, 21, 39, 395502. Code

available from http : / / www . quantum - espresso . org/. doi:

10.1088/0953-8984/21/39/395502.

[6]

Bjørk Hammer and Jens K. Nørskov. 1995. Electronic factors

determining the reactivity of metal surfaces. Surf. Sci., 343,

3, (Dec. 1995), 211–220. doi: 10.1016/0039-6028(96)80007-0.

[7]

Tin Kam Ho. 1995. Random decision forests. In Proceedings

of 3rd international conference on document analysis and

recognition. Vol. 1. IEEE, 278–282.

Figure 4: Variable importance scores calculated by the

[8]

Marc O. J. Jäger, Yashasvi S. Ranawat, Filippo Federici

ReliefF and correlation criteria. Importance scores for

Canova, Eiaki V. Morooka, and Adam S. Foster. 2020. Efficient

correlation criteria are given as absolute values.

machine-learning-aided screening of hydrogen adsorption on

bimetallic nanoclusters. ACS Comb. Sci., 22, 12, (Dec. 2020),

768–781. doi: 10.1021/acscombsci.0c00102.

[9]

Igor Kononenko, Edvard Šimec, and Marko Robnik-Šikonja.

d-band center are negative. This indicates that a lower

1997. Overcoming the myopia of inductive learning algorithms

d-band center corresponds to a higher adsorption energy

with RELIEFF. Applied Intelligence, 7, 1, (Jan. 1997), 39–55.

and consequently a less reactive site, which is physically

doi: 10.1023/A:1008280620621.

[10]

Jin Li et al. 2023. Machine learning-assisted low-dimensional

intuitive.

electrocatalysts design for hydrogen evolution reaction. Nano-

It is also interesting to note that the surface type de-

Micro Lett., 15, 1, (Oct. 2023), 227–27. doi: 10.1007/s40820-

scriptor is not very relevant according to correlation, yet

023-01192-5.

[11]

Glenn J. Martyna and Mark E. Tuckerman. 1999. A reciprocal

it becomes the second most important feature when other

space based method for treating long range interactions in ab

descriptor are considered. This can be attributed to the

initio and force-field-based calculations in clusters. J. Chem.

Phys., 110, 6, (Feb. 1999), 2810–2821.

fact that this descriptor has the same value for all adsorp-

doi: 10.1063/1.477923.

[12]

Michael Methfessel and Anthony Thomas Paxton. 1989. High-

tion sites on the same surface. However, when combined

precision sampling for brillouin-zone integration in metals.

with other descriptors, it can give additional information,

Phys. Rev. B, 40, 6, (Aug. 1989), 3616–3621. doi: 10.1103

/PhysRevB.40.3616.

as similar adsorption sites on different surfaces can yield

[13]

Bishnupad Mohanty, Piyali Bhanja, and Bikash Kumar Jena.

considerably different adsorption energies.

2022. An overview on advances in design and development

of materials for electrochemical generation of hydrogen and

oxygen. Mater. Today Energy, 23, (Jan. 2022), 100902.

4

Conclusion

doi:

10.1016/j.mtener.2021.100902.

We applied different ML techniques to predict the adsorp-

[14]

Anders Nilsson, Lars G. M. Pettersson, Bjørk Hammer, Thomas

Bligaard, Claus Hviid Christensen, and Jens K. Nørskov. 2005.

tion energy of hydrogen on platinum surfaces and nanopar-

The electronic structure effect in heterogeneous catalysis.

ticles using simple geometric and electronic descriptors.

Catal. Lett., 100, 3, (Apr. 2005), 111–114. doi: 10.1007/s1056

2-004-3434-9.

Models for predicting adsorption energy on surfaces per-

[15]

Jens Kehlet Nørskov, Thomas Bligaard, Ashildur Logadottir,

formed better, with the linear regression and random forest

John R. Kitchin, Jingguang G. Chen, Stanislav Pandelov, and

methods showing the highest correlation coefficient and

Ulrich Stimming. 2005. Trends in the exchange current for

hydrogen evolution. J. Electrochem. Soc., 152, 3, (Jan. 2005),

accuracy. In contrast, predictions for nanoparticles yielded

J23. doi: 10.1149/1.1856988.

lower correlation coefficients and accuracy similar to the

[16]

Karl Pearson. 1895. Vii. note on regression and inheritance

one calculated by a default predictor. Therefore, the mod-

in the case of two parents. proceedings of the royal society

of London, 58, 347-352, 240–242.

els presented in this contribution do not provide accurate

[17]

John P. Perdew, Kieron Burke, and Matthias Ernzerhof. 1996.

estimation of hydrogen adsorption energies. Utilizing more

Generalized gradient approximation made simple. Phys. Rev.

Lett., 77, 18, (Oct. 1996), 3865–3868.

sophisticated descriptors and larger training data sets could

doi: 10.1103/PhysRev

Lett.77.3865.

enhance the performance of these models.

[18]

John R et al. Quinlan. 1992. Learning with continuous classes.

Differences between datasets are also evident in feature

In 5th Australian joint conference on artificial intelligence.

Vol. 92. World Scientific, 343–348.

ranking. For surfaces, coordination numbers are the most

[19]

Yong Wang and Ian H Witten. 1997. Inducing model trees

relevant descriptors, while for nanoparticles, the d-band

for continuous classes. In Proceedings of the ninth European

center shows the highest relevance. All these relevant de-

conference on machine learning number 1. Vol. 9. Citeseer,

128–137.

scriptors are related to the local environment of the adsorp-

[20]

Jiawei Zhang, Peijun Hu, and Haifeng Wang. 2020. Amor-

tion site, indicating that adsorption is a local phenomenon.

phous catalysis: machine learning driven high-throughput

screening of superior active site for hydrogen evolution reac-

tion. J. Phys. Chem. C, 124, 19, (May 2020), 10483–10494.

References

doi: 10.1021/acs.jpcc.0c00406.

[1]

Leo Breiman. 2001. Random forests. Machine Learning, 45,

[21]

Jing Zhu, Liangsheng Hu, Pengxiang Zhao, Lawrence Yoon

1, (Oct. 2001), 5–32. doi: 10.1023/A:1010933404324.

Suk Lee, and Kwok-Yin Wong. 2020. Recent advances in elec-

[2]

Federico Calle-Vallejo, José I. Mart´

ınez, Juan M. Garc´

ıa-

trocatalytic hydrogen evolution using nanoparticles. Chem.

Lastra, Philippe Sautet, and David Loffreda. 2014. Fast pre-

Rev., 120, 2, (Jan. 2020), 851–918. doi: 10.1021/acs.chemrev

diction of adsorption properties for platinum nanocatalysts

.9b00248.

with generalized coordination numbers. Angew. Chem. Int.

Ed., 53, 32, (Aug. 2014), 8316–8319. doi: 10.1002/anie.20140

2958.

70





SmartCHANGE Risk Prediction Tool: Demonstrating Risk

Assessment for Children and Youth

Marko Jordan

Nina Reščič

Sebastjan Kramar

Jožef Stefan Institute,

Jožef Stefan Institute,

Jožef Stefan Institute,

Department of Intelligent Systems

Department of Intelligent Systems

Department of Intelligent Systems

Ljubljana, Slovenia

Jožef Stefan International

Ljubljana, Slovenia

marko.jordan@ijs.si

Postgraduate School

sebastjan.kramar@ijs.si

Ljubljana, Slovenia

nina.rescic@ijs.si

Marcel Založnik

Mitja Luštrek

Jožef Stefan Institute,

Jožef Stefan Institute,

Department of Intelligent Systems

Department of Intelligent Systems

Ljubljana, Slovenia

Jožef Stefan International

marcel.zaloznik@ijs.si

Postgraduate School

Ljubljana, Slovenia

mitja.lustrek@ijs.si

Abstract

healthy lifestyle can improve physical, social, and mental well-

being, especially among youth, while mitigating the risks of

Non-communicable diseases (NCDs) have become a significant

NCD-related morbidity and mortality [15], [14], [5].

public health challenge in developed countries, driven by com-

Traditionally, clinical prevention strategies for NCDs have

mon risk factors such as obesity, low physical activity, and un-

been directed at adults, as the risk factors typically become ev-

healthy lifestyle choices. Early childhood and adolescence are

ident in adulthood. However, recent evidence suggests that fo-

crucial for establishing healthy behaviours, and early interven-

cusing interventions on children and adolescents can be a more

tion can play a crucial role in preventing or delaying the onset

effective strategy for reducing NCD risk through behaviour mod-

of NCDs later in life. However, current tools for identifying high-

ification [13]. While NCDs may not appear in childhood or ado-risk individuals are primarily designed for adults, which results

lescence, early signs can alreadexistnt. Tackling risk factors and

in missed early detection opportunities in younger populations.

promoting healthy habits during these stages can prevent or de-

The SmartCHANGE project (https://smart-change.eu/) seeks to

lay NCDs later in life [12]. Childhood and youth are also crucial bridge this gap by developing reliable AI tools that assess risk

periods for establishing healthy lifestyle habits. Since risk fac-

factors in children and adolescents as accurately as possible while

tors for NCDs often persist from childhood into adulthood [9],

promoting optimized risk reduction strategies.

early risk assessment and reduction of risk factors can potentially

In developing the risk assessment tool, we addressed the chal-

lower the incidence of NCD. Lastly, NCDs in youth are a signifi-

lenge of merging diverse datasets, predicting missing data to cre-

cant global health challenge, with nearly one in five adolescents

ate longitudinal datasets, implementing existing validated models

worldwide being overweight or obese [1].

for diabetes (QxMD) and cardiovascular disease (SCORE2), and

Identifying high-risk individuals for future health problems

ultimately creating a simple online application to demonstrate

is essential for targeted preventive interventions. Existing tools

the functionality of the developed risk tool.

focus mainly on adults [6], for instance predicting 10-year risk of Keywords

developing cardiovascular disseased [17] or diabetes [8], missing the opportunity to identify high-risk individuals during child-risk tool, dataset merge, neural networks, online application

hood and adolescence, a critical period for forming lifestyle habits.

However, recognition of health risks is not a trivial task. For

1

Introduction

instance, only 35% of doctors in the UK are aware of the rec-

In developed countries, non-communicable chronic diseases (NCDs)

ommendations for physical activity, and only 13% can specify

have emerged as the foremost public health challenge over recent

the recommended weekly duration. Moreover, more than 80% of

decades. According to the World Health Organization (WHO),

parents of inactive children incorrectly believe that their children

NCDs account for more than 70% of mortality in the European

are sufficiently active [4]. Developing risk prediction tools for region [18]. Common risk factors for NCD include obesity, poor children and youth would significantly improve NCD prevention

physical fitness, and unhealthy lifestyle habits such as inadequate

and promote cost-effective strategies.

physical activity, sedentary behaviour, poor nutrition, insufficient

This paper presents the development of an initial demo ap-

sleep, smoking, and excessive alcohol consumption. Embracing a

plication of a risk assessment tool designed for children and

adolescents in the SmartCHANGE project [3] - merging datasets, Permission to make digital or hard copies of all or part of this work for personal predicting missing data to build longitudinal datasets, and im-or classroom use is granted without fee provided that copies are not made or

plementing existing validated models for diabetes (QxMD) and

distributed for profit or commercial advantage and that copies bear this notice and cardiovascular disease (SCORE2) and finally, the application de-the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner /author(s).

velopment.

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

© 2024 Copyright held by the owner/author(s).

https://doi.org/10.70314/is.2024.scai.8844

71





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

Trovato et al.

Table 1: Overview of Selected Datasets

Dataset Name

SLOfit

LGS

YFS

AFINA-TE

Country of Origin

SI

BE

FI

PT

Age Range

5 - 20

5 - 25

0 - 60

5 - 25

Longitudinal Study

Yes

Yes

Yes

No

# of Participants

280,165

17,991

3,596

1,632

# of Measurements

3,121,399

31,127

32,364

1,632

# of Variables

13

80

24

59

% of Missing Values

2.55%

16.25%

39.49%

33.53%

2

Methodology

(a) Example of the datasets pre-imputation.

2.1

Datasets

To estimate the risk of non-communicable diseases in children,

ideally, one would need a dataset that tracks risk factors from a

young age (when the prediction is made) to an older age (when

these diseases typically emerge). Such comprehensive longitu-

dinal datasets would allow for accurate predictions of an indi-

vidual’s likelihood of developing a disease later in life based on

their early risk factors. However, such datasets are currently un-

available, so we must rely on a collection of partial and often

heterogeneous datasets.

In our study, we have chosen 16 types of variables that are

used by risk models SCORE2 [17] and QxMD [8]. The datasets we were using are described in Table 1. The SLOfit program is a school fitness monitoring initiative in Slovenia [11]. The Leuven (b) Example of the datasets post-imputation.

Growth Study (LGS)[2, 16] is a longitudinal study initiated in 1969 that evaluates physical fitness. The Cardiovascular Risk in

Figure 1: The YFS dataset (blue) covers a broad range of vari-

Young Finns Study (YFS)[10], started in the late 1970s, focuses on ables across a wide age span but includes a relatively small

early cardiovascular disease risk factors. The AFINA-TE dataset

number of participants. In contrast, the SLOfit dataset

[7] is part of an intervention program in Portugal designed to (green) has many participants but includes fewer variables

enhance physical fitness, activity, and nutritional knowledge

over a shorter age span. In the first step, we imputed the

among children and adolescents.

missing variables across the datasets (grey).

2.2

Data Imputation Through Datasets

The first step involved imputing missing values within each

dataset (see Figure 1 for representation). To guide this process, we calculated the coverage for each variable. Initially, we used

only fully observed variables—such as height, weight, and sex—as

features in models to impute missing values for other variables.

The variables were imputed based on their coverage using ma-

chine learning on existing features. After this initial imputation

sweep, we had a complete, though potentially imperfect, dataset.

In the second sweep, we treated all columns as complete, incor-

porating the newly imputed values from the first sweep. This

allowed us to train models with a more comprehensive dataset,

improving the accuracy of the imputation.

Figure 2: Longitudinal filling of the datasets.

2.3

Longitudinal Data Imputation

For instance, a vertical jump one standard deviation above the

In the second step, we employed a similar approach but focused

mean in the LGS dataset was considered equivalent to a stand-

on merging the datasets to fill the new merged dataset longitudi-

ing long jump one standard deviation above the mean in the

nally (see Figure 2 for representation). To maximize their overlap, SLOfit dataset. After matching and standardizing the columns

we treated certain variables as equivalent—such as vertical jump

across datasets, we merged the individual datasets into a single,

from the LGS dataset and standing long jump from the SLOfit

comprehensive dataset and repeated the imputation process.

dataset.

With a merged dataset free of missing values, we built models

Since the raw values of these variables differ, we standardized

to predict attribute values at age 55—the oldest age supported

them by converting them to z-scores, which were calculated as

by our data—using values from age 14. Due to the lack of data

follows:

𝑣 𝑎𝑟 𝑖𝑎𝑏𝑙 𝑒 − 𝑚𝑒𝑎𝑛

covering the entire age range from 14 to 55, we approached this

𝑧 _𝑠𝑐𝑜𝑟 𝑒 =

.

in two stages: predicting from age 14 to 18 and then from 18 to

𝑠𝑡 𝑎𝑛𝑑𝑎𝑟 𝑑 _𝑑𝑒 𝑣𝑖𝑎𝑡 𝑖𝑜𝑛

72





Short title to put in the header

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

Ind. 18

Ind. 55

Pop.

Height [cm]

3.11

3.47

1.62

Weight [kg]

4.79

13.60

10.58

SBP [mmHg]

1.46

2.39

10.91

Total cholesterol [mmol/L]

0.05

0.10

0.64

HDL [mmol/L]

0.02

0.08

0.21

LDL [mmol/L]

0.05

0.17

0.51

Smoking [1-9]

1.01

1.72

2.26

Table 2: Mean absolute error for individual forecasting to

ages 18 and 55, and for population forecasting.

Figure 3: Population-based approach using z-scores.

with greater accuracy in this approach. In the future, we may

explore combining both methods or select the more accurate one

depending on the variable.

55. The models used were simple neural networks with a single

hidden layer.

4

Demo Application

This individual forecasting approach required available data

for the same person from the start to the end age. However,

To show the general idea of the project, we constructed a demo

since we had more data available for different people of various

application implemented with Python in the Dash framework.

ages, we also explored a population-based approach to forecast

In the app, a user can specify the inputs (some inputs, such as

the typical evolution of each variable. While this method is less

steroid use, were fixed to make the app more concise) to the

personalized, it is also less prone to anomalies caused by atypical

models, which in turn yielded two plots which showed how

individuals. In the population-based approach, we again used z-

the cardiovascular and diabetes risk evolved from the currently

scores, assuming that each person’s z-score remains constant. For

selected age up to an age of an older adult, at age 55. In a different

example, if someone’s blood pressure is one standard deviation

plot, we also showed how a risk factor chosen changes over time.

below the mean at age 14, it is assumed to stay one standard

4.1

Risk Prediction using Demo Application

deviation below the mean at age 55 (see Figure 3).

The developed demo application interface offers a dynamic tool

2.4

Risk Models

for visualizing health risks based on various user-input parame-

ters used in risk models (Figure 4). By allowing users to adjust The SCORE2 and QxMD models were used in the application

these parameters, the dashboard generates real-time projections

to assess cardiovascular disease and type 2 diabetes risk. These

of two key risk metrics: a 10-year cardiovascular risk score and

models were chosen for their validity, robustness and effective-

a 10-year risk of developing diabetes. These risks are shown in

ness in predicting these chronic conditions. By incorporating

two line graphs, illustrating how these conditions’ probability

both, healthcare practitioners can comprehensively evaluate car-

evolves with age. Additionally, the dashboard includes a feature

diometabolic risk factors, aiding in well-informed patient man-

that tracks the progression of a selected health parameter (BMI,

agement and intervention decisions.

systolic blood pressure, total cholesterol, HDL) over time, provid-

The SCORE2 model, developed by the European Society of

ing insight into how this factor might change as the individual

Cardiology, estimates the risk of cardiovascular events over ten

ages. The developed tool intuitively explains how lifestyle and

years. It calculates the risk score by incorporating variables such

physiological factors contribute to long-term health risks, offer-

as age, sex, smoking status, blood pressure, and lipid profile. Ad-

ing valuable insights for clinical decision-making and personal

ditionally, SCORE2 considers regional variations in risk factors,

health management.

providing more accurate predictions tailored to specific popula-

tions [17].

4.2

Further Development of the Application

The QxMD Diabetes Risk Calculator, a comprehensive clinical

decision support tool, is employed to evaluate the risk of devel-

The current version of the demo application is developed based on

oping type 2 diabetes mellitus. This model integrates risk factors,

the data and models currently available. However, there remains

including age, BMI, family history, physical activity level, and

an open question regarding the specific needs and preferences of

dietary habits, to estimate an individual’s diabetes risk [8].

the medical experts who will ultimately use the final application.

To address this, we plan to present the current version to these

3

Evaluation

experts and, based on their feedback, refine and enhance the

application in subsequent iterations.

Table 2 presents the cross-validated evaluation results of our forecasting models. As anticipated, the errors in the first stage

5

Conclusion

of individual forecasting are shallow due to the relatively short

period. The emistakesin the second stage are higher but still con-

The SmartCHANGE project represents a significant step toward

sidered acceptable, with the notable exceptions of weight and

improving the early detection and prevention of non-communicable

smoking. We hypothesize that the high variability during puberty,

diseases (NCDs) in children and youth. While the tool presented

which many adolescents experience around age 14, complicates

in this paper is a demo version demonstrating some basic func-

accurate weight forecasting. In population forecasting, the errors

tionalities, our future work will focus on developing a more

are generally more significant, which aligns with the less per-

comprehensive web application for medical professionals and a

sonalized nature of this method. However, weight is forecasted

mobile application for families. We also plan to enhance the tool

73





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

Trovato et al.

Figure 4: The figure is a dashboard interface that allows users to input various health-related parameters and observe the evolution of associated risks over time.

by replacing the current SCORE2 and QxMD risk models with

cardiovascular risk: a report of the american college of cardiology/american

heart association task force on practice guidelines. Circulation, 129, 25, (June more advanced models—Test2Prevent for diabetes and Healthy

2014), S49–S73.

Heart Score for cardiovascular disease—incorporating features

[7]

Noelia González-Gálvez, Jose Carlos Ribeiro, and Jorge Mota. 2022. Car-

related to diet and physical activity. Additionally, the application

diorespiratory fitness, obesity and physical activity in schoolchildren: the

effect of mediation. International journal of environmental research and public will be updated to meet medical experts’ needs based on their

health, 19, 23, 16262–16270. ObjectType-Article-1. doi: 10.3390/ijerph19231

feedback.

6262.

[8]

S. J. Griffin, P. S. Little, C. N. Hales, A. L. Kinmonth, and N. J. Wareham.

Acknowledgements

2000. Diabetes risk score: towards earlier detection of type 2 diabetes in

general practice. Diabetes/Metabolism Research and Reviews, 16, 3, 164–171.

This work was carried out as a part of the SmartCHANGE project,

[9]

D. R. Jacobs, J. G. Woo, A. R. Sinaiko, S. R. Daniels, J. Ikonen, and M. Juonala.

2022. Childhood cardiovascular risk factors and adult cardiovascular events.

which received funding from the European Union’s Horizon Eu-

New England Journal of Medicine, 386, 19, (May 2022), 1765–1777.

rope research and innovation program under grant agreement

[10]

Markus Juonala et al. 2008. Cohort profile: the cardiovascular risk in young

finns study. International Journal of Epidemiology, 37, 6, 1220–1226.

No 101080965. The SLOfit dataset for wasvided by the University

[11]

Gregor Jurak et al. 2020. Slofit surveillance system of somatic and motor

of Ljubljana (courtesy of Gregor Jurak et al.), the LGS dataset was

development of children and adolescents: upgrading the slovenian sports

provided by KU Leuven (courtesy of Martine ThomThomashe

educational chart. Acta Universitatis Carolinae. Kinanthropologica, 56, 1,

28–40. doi: 10.14712/23366052.2020.4.

AFINA-TE dataset was provided by the University of Porto (cour-

[12]

H. C. Jr McGill, C. A. McMahan, E. E. Herderick, G. T. Malcom, R. E. Tracy,

tesy of José Ribeiro) and the the University of Turku provided

and J. P. Strong. 2000. Origin of atherosclerosis in childhood and adolescence.

the YFS dataset are grateful for their support.

American Journal of Clinical Nutrition, 72, 5, (Nov. 2000), 1307S–1315S.

[13]

K. Pahkala, H. Hietalampi, T. T. Laitinen, J. S. Viikari, T. Rönnemaa, H. Ni-

inikoski, and et al. 2013. Ideal cardiovascular health in adolescence: effect of References

lifestyle intervention and association with vascular intima-media thickness

and elasticity (the special turku coronary risk factor intervention project

[1]

P. S. Azzopardi, S. J. C. Hearps, K. L. Francis, E. C. Kennedy, A. H. Mokdad,

for children [strip] study). Circulation, 127, 18, (May 2013), 2088–2096.

N. J. Kassebaum, S. Lim, and et al. 2019. Progress in adolescent health and

[14]

J. R. Ruiz, I. Cavero-Redondo, F. B. Ortega, G. J. Welk, L. B. Andersen, and

wellbeing: tracking 12 headline indicators for 195 countries and territories,

V. Martinez-Vizcaino. 2016. Cardiorespiratory fitness cut points to avoid

1990-2016. Lancet, 393, 10190, (Mar. 2019), 1101–1120.

cardiovascular disease risk in children and adolescents; what level of fitness

[2]

Gaston P Beunen, Robert M Malina, Marc A Van’t Hof, Jan Simons, Michel

should raise a red flag? a systematic review and meta-analysis. British

Ostyn, Roland Renson, and Dirk Van Gerven. 1988. Adolescent growth and

motor performance: A longitudinal study of Belgian boys

Journal of Sports Medicine, 50, 13, 773–779.

. Human Kinetics

[15]

T. J. Saunders, C. E. Gray, V. J. Poitras, J. P. Chaput, I. Janssen, P. T. Katz-Publishers.

marzyk, and et al. 2016. Combinations of physical activity, sedentary be-

[3]

SmartCHANGE Consortium. 2024. Smartchange - horizon europe project.

haviour and sleep: relationships with health indicators in school-aged chil-

Accessed: 2024-09-02. (2024). https://www.smart- change.eu/.

dren and youth. Applied Physiology, Nutrition, and Metabolism, 41, 6, (June

[4]

K. Corder, E. M. van Sluijs, I. Goodyer, C. L. Ridgway, R. M. Steele, D. Bamber, 2016), 486–505.

V. Dunn, S. J. Griffin, and U. Ekelund. 2011. Physical activity awareness of

[16]

Johan Simons, Gaston Beunen, Roland Renson, Albrecht L. M. Claessens,

british adolescents. Archives of Pediatrics Adolescent Medicine, 165, 3, 281–

Bernard Vanreusel, and Jos A. V. Lefevre. 1990. Growth and fitness of Flemish

289.

girls: The Leuven Growth Study. Human Kinetics, Champaign, IL.

[5]

A. García-Hermoso, R. Ramírez-Campillo, and M. Izquierdo. 2019. Is mus-

[17]

SCORE2 working group and ESC Cardiovascular risk collaboration. 2021.

cular fitness associated with future health benefits in children and adoles-

SCORE2 risk prediction algorithms: new models to estimate 10-year risk

cents? a systematic review and meta-analysis of longitudinal studies. Sports

Medicine

of cardiovascular disease in Europe. European Heart Journal, 42, 25, (June

, 49, 7, (July 2019), 975–989.

2021), 2439–2454.

[6]

D. C. Jr Goff, D. M. Lloyd-Jones, G. Bennett, S. Coady, R. B. D’Agostino, and

[18]

World Health Organization. 2018. Global Health Estimate 2016: Deaths

R. Gibbons. 2014. American college of cardiology/american heart association

by Cause, Age, Sex, by Country and by Region, 2000-2016. World Health

task force on practice guidelines. 2013 acc/aha guideline on the assessment of Organization.

74





Predicting Mental States During VR Sessions Using Sensor Data

and Machine Learning

∗

Emilija Kizhevska

Mitja Luštrek

emilija.kizhevska@ijs.si

mitja.lustrek@ijs.si

Jožef Stefan Institute

Jožef Stefan Institute

Jožef Stefan International Postgraduate School (IPS)

Jožef Stefan International Postgraduate School (IPS)

Ljubljana, Slovenia

Ljubljana, Slovenia

Abstract

VR creates an immersive environment that enhances empathy

by allowing users to experience different perspectives and engage

Empathy is a multifaceted concept with both cognitive and

emotionally. VR is effective for empathy training and is referred

emotional

components

that

plays

a

crucial

role

in

social

to as ’the ultimate empathy machine’ [1, 11] for various reasons: interactions,

prosocial

behavior,

and

mental

health.

In

our

1) Immersive Experience: Provides a strong sense of presence,

study,

empathy

and

general

arousal

were

induced

via

VR,

helping users adopt new viewpoints [15]. 2) Perspective-Taking with physiological signals measured and ground truth collected

and Emotional Engagement: Simulates realistic scenarios to

through questionnaires. Data from over 100 participants were

provoke emotional responses and understanding [19]. 3) Empathy collected and analyzed using multiple machine learning models

Training: Effective in healthcare, education, and diversity training

and

classification

algorithms

to

predict

empathy

based

on

by challenging preconceptions and deepening emotional insights

physiological responses. We explored different data balancing

[16]. 4) Ethical Considerations: Ensures respectful use of VR, techniques

and

labeled

data

in

multiple

ways

to

enhance

balancing immersive experiences with participants’ well-being

model performance. Our results show that they are effective in

[2].

detecting general arousal, empathy, and differentiating between

The objective of this study was to examine how participants’

non-empathic and empathic arousal, but the models encountered

empathy correlates with changes in their physiological metrics,

difficulties with precise emotion detection. The dataset extracted

measured using sensors such as inertial measurement unit (IMU),

at 5-second intervals and models using Random Forest and

photoplethysmograph

(PPG),

and

electromyography

(EMG).

Extreme Gradient Boosting showed the best performance. Future

Participants were immersed in 360° VR videos featuring actors

work will focus on refining emotion detection through advanced

displaying various emotions (sadness, happiness, anger, and

modeling techniques and investigating gender differences in

anxiety) and reported their empathetic experiences via brief

empathy.

questionnaires. Using data from these sensors and questionnaires,

Keywords

machine learning models were developed to predict empathy

scores based on physiological responses during the VR sessions

VR, mental states, machine learning, sensor data

[9].

1

Introduction

2

Materials and Data Collection Process

Empathy is a multifaceted concept explored across various fields,

2.1

Materials and Setup for Empathy

including psychology, neuroscience, and sociology. Though no

universal definition exists, empathy is generally understood to

Elicitation in VR

include both cognitive (understanding another’s perspective)

To elicit empathy, we immersed participants in a 360º and 3D

and emotional (experiencing another’s feelings) components [8].

virtual environment, as VR has proven more effective than

Our research defines empathy as the ability to model others’

methods like 2D videos, workshops, or text-based exercises [8,

emotional states and respond sensitively while recognizing the

13, 17, 20]. We used videos featuring actors expressing four self-other distinction [14].

emotions—happiness, sadness, anger, and anxiousness—without

There

is

no

"golden

standard"

for

measuring

empathy

additional content to avoid confounding factors [2]. Recognizing

[10],

with

methods

varying

from

self-report

questionnaires

the

impact

of

understanding

emotional

context,

an

audio

to

psychophysiological

measures

like

heart

rate

and

skin

narrative version was also created, followed by a corresponding

conductance. Each method has its pros and cons, often leading

video (50-120 seconds). To ensure gender balance, we recorded

to a combination of approaches for a comprehensive assessment.

videos with two male and two female actors. Five versions were

Psychophysiological

measures

offer

objective

data

but

face

developed: four with narratives (two male, two female) and one

challenges due to individual variability and non-empathetic

non-narrative, where all emotions are portrayed by all actors

factors. Our study addresses these issues by using machine

without accompanying narratives. The non-narrative version

learning to directly measure empathy from physiological signals,

allows gradual transitions between emotions, making it suitable

offering a novel approach.

for participants of all linguistic backgrounds.

Additionally, a 2-minute forest video ("The Amsterdam Forest

Permission to make digital or hard copies of all or part of this work for personal in Springtime") was included at the start to establish a relaxed

or classroom use is granted without fee provided that copies are not made or

baseline and a roller coaster video ("Official 360 POV - Yukon

distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this Striker

-

Canada’s

Wonderland")

at

the

end

to

control

for

work must be honored. For all other uses, contact the owner /author(s).

non-empathic arousal. Both videos were sourced from YouTube.

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

Participants completed trait empathy questionnaires (QCAE)

© 2024 Copyright held by the owner/author(s).

https://doi.org/10.70314/is.2024.scai.9356

[14] and, after each emotion-specific video, provided feedback 75





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

E.Kizhevska et al.

Figure 1: The best accuracies for each group of models, developed using datasets extracted at two different frequencies and various data balancing techniques, presented for all the labeling schemes

on their empathic state (State Empathy Scale) [18], arousal and 3

Methodology

valence levels (SAM) [3], and personal distress (IRI) [5]. Each VR

3.1

Preprocessing

session lasted around 20 minutes to minimize VR sickness, with

Since all the features or insights are numeric, except for the

participants viewing one of five versions.

feature "Expression/Type," which has three values—smile, frown,

Sensor

data

were

collected

using

the

emteqPRO

system

and neutral—we applied one-hot encoding, a technique used in

attached to the Pico Neo 3 Pro Eye VR headset, including EMG

data preprocessing where categorical (non-numeric) variables

for facial muscle activation, PPG for heart rate, and IMU for head

are transformed into a numerical format. Each unique value in

motion tracking. The device uses an internal clock as well [12].

the original non-numeric feature is transformed into a separate

binary (0 or 1) feature.

Next, because missing values represent less than 1% of the total

2.2

Dataset Description

data for each participant, they were filled in using the average

In

this

research,

we

used

convenience

sampling

to

recruit

of each feature’s values. Scaling the values in the descriptive

participants from the general public without a specific selection

features between 0 and 1 was the final step in the preprocessing

pattern. Participants were invited from various sources, including

process.

Jožef

Stefan

Institute

employees,

university

students,

and

the

general

public.

Invitations

were

sent

verbally

or

in

3.2

Feature Engineering

writing.

Data

collection

concluded

with

105

participants,

averaging

22.43

±

5.31

years

(range

19–45),

with

75.24%

Since features were provided at intervals ranging from 1 second

identifying as female. Participants had diverse educational and

to 500 milliseconds, we divided the data into two windows: one

professional backgrounds. Additionally, ethical clearance for

of 5 seconds and one of 500 milliseconds. For each window,

this study was obtained from the Research Ethics Committee

we computed features from the 22 insights across the seven

at

the

Faculty

of

Arts,

University

of

Maribor,

Slovenia

modules, as well as from the features for head activity and

(No. 038-11-146/2023/13FF UM). Furthermore, written informed

facial muscle electrodes, deriving a total of 108 new features,

consent was obtained from the actors prior to recording.

including minimum, maximum, average, and standard deviation

The EmteqPRO system not only provides raw sensor data

for each original feature or insight. Additionally, the features for

but also generates derived variables through the Emteq Emotion

head activity and facial muscle electrodes were used to define

AI Engine, which utilizes data-fusion and machine learning to

’Expression/Type,’ and the time and row index were used as

analyze multimodal sensor data and assess the user’s emotional

provided. However, the row index was disregarded further in the

state. This system provides a file with 29 derived features, called

study.

affective insights for each recording: 7 features for heart-rate

We labeled the dataset in six different ways: 1) as a binary

variability (HRV) and 3 for breathing rate; 2 features for facial

classification aiming to detect empathic arousal, comparing

expressions; 4 features for arousal and 4 for valence; 1 feature for

empathic parts with the forest part of the video, while excluding

facial activation; and 1 feature for facial valence. Additionally,

the non-empathic content of the roller coaster video; 2) as a

head activity is tracked, reflecting the percentage of the session

binary classification using the forest and roller coaster, aiming to

with head movement. Dry EMG electrodes on facial muscles such

detect non-empathic arousal; 3) again, as a binary classification,

as the zygomatic, corrugator, frontalis, and orbicularis provide

but including only empathic parts and the roller coaster, aiming

four more features, each representing muscle activation as a

to distinguish between empathic and non-empathic arousal, and

percentage of maximum activation observed during calibration.

examining the differences in physiological responses between

The data also includes the time elapsed since the start of the

empathic content and non-empathic arousal-inducing content,

recording and the row index.

such as the roller coaster video; 4) aiming to detect arousal in

76





Predicting Mental States, VR Sessions, Sensor Data and ML

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

general, regardless of whether it is empathic or non-empathic,

activation of particular muscles from the calibration session,

by splitting the entire dataset into two classes: the forest and

especially the zygomaticus and orbicularis muscles—were also

everything else, including empathic parts and the roller coaster;

highly correlated.

5) into three classes: treating the chunks of the roller coaster and

Regarding

the

labeling

schemes,

we

can

conclude

the

forest as separate classes and grouping all the empathic parts into

following: 1) We can detect empathic arousal with confusion

one class, without differentiating between the different emotions.

matrices that show a relatively good distribution of correct

The goal is to distinguish among no-arousal, empathic arousal,

predictions across both classes and high accuracies for most

and non-empathic arousal; 6) with the average of participants’

of

the

developed

models;

2)

We

can

detect

non-empathic

answers to the state empathy questions for each part of the video,

arousal, with almost every developed model achieving a balanced

with each part of the empathic content considered a separate

accuracy higher than 60%, reaching up to 78%, and a reasonable

chunk. Additionally, there are two other classes: the forest and

balance between classes, indicating satisfactory classification

the roller coaster. The aim is to detect the level of empathy

performance; 3) We can even distinguish between empathic and

participants experience during the session. We also included each

non-empathic arousal with balanced accuracy of 79%; 4) We can

participant’s ID, intending to later use it for model evaluation

detect arousal in general, again with high accuracies and balanced

with the ’leave-one-subject-out’ technique.

classes; 5) We can distinguish to some extent among no-arousal,

empathic arousal, and non-empathic arousal; 6) However, it is

4

Experiments and Results

currently very challenging to detect the precise level of empathy

4.1

Experimental setup

participants are feeling during the session using these methods,

and to determine whether they are empathizing by mirroring

To build models for predicting a participant’s state empathy

emotions or experiencing something different while observing

during

the

VR

session,

we

used

six

different

classification

specific emotions. The best we can detect in this regard is up

algorithms: Gaussian Naive Bayes, Stochastic Gradient Descent

to 28% balanced accuracy, with confusion matrices showing a

Classifier,

K-Nearest

Neighbors

Classifier,

Decision

Tree

relatively balanced performance across multiple classes, with a

Classifier,

Random

Forest

Classifier,

and

Extreme

Gradient

good number of correct classifications, particularly in the more

Boosting Classifier. The balanced accuracy score was used as

frequent classes.

an evaluation metric to assess the classification models for

Regarding the two window sizes, both models showed similar

predicting participants’ state empathy. This metric evaluates the

class balance and balanced accuracy scores. However, the dataset

overall balanced accuracy of the model by calculating the average

extracted at 5-second intervals performed slightly better. Using

of recall obtained on each class. Additionally, we used a confusion

this dataset, false positives and false negatives were reduced more

matrices to evaluate the performance of the classification models

effectively. This led to more reliable classification performance,

by comparing the actual and predicted labels.

especially in terms of precision and recall, despite the smaller

For

model

evaluation,

we

used

a

Leave-One-Subject-Out

scale. Thus, the models developed using the 5-second interval

cross-validation setup, where each subject is a unique participant

dataset

generally

performed

better,

showing

more

effective

identified by their ID.

classification and fewer errors. The simpler confusion matrix

Because the labeling schemes 2, 3, 5, and 6 are not balanced

and potentially better handling of fewer classes suggest that it

(with

the

80%

of

the

majority

class),

we

conducted

four

performs better in practical terms (Figure 2, Figure 1).

experiments for each developed model: 1) applying the Synthetic

Regarding the data balancing techniques, the undersampling

Minority Over-sampling Technique (SMOTE) to create synthetic

technique

never

produced

the

best

results.

For

the

dataset

samples for the minority class to balance the dataset; 2) using

extracted at 500 ms intervals, using the SMOTE oversampling

the RandomUnderSampler (RUnderS) method to randomly select

technique and SMOTETomek yielded the best results. For the

samples from the majority class, thereby reducing their count

dataset extracted at 5-second intervals, using the entire dataset

and balancing the dataset; 3) using SMOTETomek, a combination

yielded

the

best

results,

although

models

developed

using

of SMOTE for oversampling and Tomek links for undersampling,

SMOTETomek yielded slightly lower results in each combination

which targets both the minority and majority classes; and 4) using

of different labeling schemes.

the dataset as it is, without any undersampling or oversampling.

Regarding the classification algorithms, Gaussian Naive Bayes

performed the worst in terms of balanced confusion matrices,

4.2

Results

while Random Forest Classifier and Extreme Gradient Boosting

Including

models

developed

by

six

different

classification

performed the best across all combinations, with Random Forest

algorithms on two distinct datasets—with two different window

Classifier showing slightly better results for most combinations

sizes—and utilizing four different data balancing techniques:

(Figure 2, Figure 1).

undersampling, oversampling, combination techniques, and the

dataset in its original form, along with six different labeling

schemes,

we

obtained

288

unique

confusion

matrices

and

4.3

Conclusion

corresponding accuracies for each combination.

We ran a correlation matrix, which revealed that the highest

In this study, we define the entire plan for developing materials,

correlation with the state empathy feature was found with

methods, and environments to evoke and measure the level of

the derived maximum and minimum values from the mean

empathy. We started by defining the videos and the session,

heart rate, the derived maximum and minimum values from

creating or selecting questionnaires for later use as ground truth,

the arousal class feature, and the average of the arousal class

writing the narratives, recording the VR videos, and then editing

— the insight, which can be -1 (low), 0 (medium), or 1 (high).

and preparing them for use. Additionally, we collected a dataset

The derived standard deviation, maximum, and minimum values

from over 100 participants, which we filtered, preprocessed, and

from the activation—expressed as a percentage of the maximum

prepared for feature engineering and analysis.

77





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

E.Kizhevska et al.

Figure 2: The best confusion matrices for each group of models, developed using dataset extracted at a 5s window size and various data balancing techniques, shown for all labeling schemes

We

conducted

and

analysed

four

groups

of

experiments,

empathy-related abilities using body ownership illusions in embodied

virtual reality. Frontiers in Robotics and AI, 5, 326671.

totaling 288 combinations, where we developed models using

[3]

M. M. Bradley and P. J. Lang. 1994. Measuring emotion: the self-assessment

two different window sizes, six classification algorithms, and

manikin and the semantic differential. Journal of Behavior Therapy and

three resampling techniques, with six different labeling schemes

Experimental Psychiatry, 25, 1, 49–59.

[4]

A. Cuevas, M. Febrero, and R. Fraiman. 2004. An anova test for functional

aimed at detecting various aspects of the dataset chunks: four

data. Computational Statistics and Data Analysis, 47, 1, 111–122.

empathetic parts, forest, and roller coaster.

[5]

M. H. Davis. 1980. A multidimensional approach to individual differences

The main conclusion is that we can detect arousal in general,

in empathy. JSAS Catalog of Selected Documents in Psychology/American

Psychological Association, 85.

non-empathic

arousal,

empathy,

and

differentiate

between

[6]

A. French, M. Macedo, J. Poulsen, T. Waterson, and A. Yu. 2008. Multivariate

non-empathic and empathic arousal, as well as between relaxed

analysis of variance (manova). San Francisco State University.

[7]

T.

K.

Kim.

2015.

T

test

as

a

parametric

statistic.

Korean Journal of

states and arousal. However, we face difficulties in detecting and

Anesthesiology, 68, 6, 540–546.

distinguishing between the precise levels of empathy during VR

[8]

E. Kizhevska, F. Ferreira-Brito, T. Guerreiro, and M. Luštrek. 2022. Using

sessions using these methods and approaches.

virtual reality to elicit empathy: a narrative review. VR4Health@ MUM,

19–22.

Our next steps involve refining the detection of empathy

[9]

E. Kizhevska, K. Šparemblek, and M. Luštrek. 2024. Protocol of the study

levels during VR sessions by applying detailed data filtering

for predicting empathy during vr sessions using sensor data and machine

and transforming it into a stationary format. Furthermore, we

learning. PloS One, 19, 7, e0307385.

[10]

F. F. D. Lima and F. D. L. Osório. 2011. Empathy: assessment instruments and

will develop models such as Autoregressive, Moving Average,

psychometric quality–a systematic literature review with a meta-analysis

and Extended Recurrent Moving Average, and use clustering

of the past ten years. Frontiers in Psychology, 12. 781346.

[11]

M. Mado, F. Herrera, K. Nowak, and J. Bailenson. 2021. Effect of virtual reality techniques like DBSCAN and HDBSCAN. Additionally, we will

perspective-taking on related and unrelated contexts. Cyberpsychology,

extract more features from the raw data or use end-to-end neural

Behavior, and Social Networking, 24, 12, 839–845.

networks. We plan to analyze gender differences in empathy

[12]

M. J. Magnée, B. De Gelder, H. Van Engeland, and C. Kemner. 2007. Facial

electromyographic responses to emotional information from faces and

with a t-test [7], and explore the impact of narrative context and voices in individuals with pervasive developmental disorder. Journal of

emotions on empathic responses using ANOVA and MANOVA

Child Psychology and Psychiatry, 48, 11, 1122–1130.

[4, 6].

[13]

K. M. Nelson, E. Anggraini, and A. Schlüter. 2020. Virtual reality as a tool

for environmental conservation and fundraising. PloS One, 15, 4, e0223631.

[14]

R. L. Reniers, R. Corcoran, R. Shryane Drake, N. M., and B. A. Völlm. 2011.

Acknowledgements

The qcae: a questionnaire of cognitive and affective empathy. Journal of

personality assessment, 93, 1, 84–95. doi: doi:10.1080/00223891.2010.528484.

The part of Emilija Kizhevska was supported by the Slovenian

[15]

G. Riva, J. A. Waterworth, and E. L. Waterworth. 2004. The layers of presence: Research and Innovation Agency (ARIS) as part of the young

a bio-cultural approach to understanding presence in natural and mediated

environments. CyberPsychology Behavior, 7, 4, 402–416.

researcher PhD program, grant PR-12879. The technical aspects

[16]

R. O. Roswell, C. D. Cogburn, J. Tocco, J. Martinez, C. Bangeranye, J. N.

of the videos, the recording and video editing were skillfully

Bailenson, and L. Smith. 2020. Cultivating empathy through virtual reality:

conducted by Igor Djilas and Luka Komar. The actors featured in

advancing conversations about racism, inequity, and climate in medicine.

Academic Medicine, 95, 12, 1882–1886.

the videos were Sara Janašković, Kristýna Šajtošová, Domen Puš,

[17]

N. S. Schutte and E. J. Stilinović. 2017. Facilitating empathy through virtual and Jure Žavbi. The questionnaires were selected and created,

reality. Motivation and Emotion, 41, 708–712.

the narratives were written, and the psychological aspects of the

[18]

L. Shen. 2010. On a scale of state empathy during message processing.

Western Journal of Communication, 74, 5, 504–524.

video creation were considered by Kristina Šparemblek.

[19]

M. Slater, A. Antley, A. Davison, D. Swapp, C. Guger, C. Barker, and M. V.

Sanchez-Vives. 2006. A virtual reprise of the stanley milgram obedience

References

experiments. PloS One, 1, 1, e39. doi: 10.1145/1188913.1188915.

[20]

J. Stargatt, S. Bhar, T. Petrovich, J. Bhowmik, D. Sykes, and K. Burns. 2021.

[1]

D. Banakou, P. D. Hanumanthu, and M. Slater. 2016. Virtual embodiment of

The effects of virtual reality-based education on empathy and understanding

white people in a black virtual body leads to a sustained reduction in their

of the physical environment for dementia care workers in australia: a

implicit racial bias. Frontiers in Human Neuroscience, 10, 226766.

controlled study. Journal of Alzheimer’s Disease, 84, 3, 1247–1257.

[2]

P. Bertrand, J. Guegan, L. Robieux, C. A. McCall, and F. Zenasni. 2018.

Learning empathy through virtual reality: multiple strategies for training

78





Biomarker Prediction in Colorectal Cancer Using Multiple

Instance Learning

Miljana Shulajkovska∗

Matej Jelenc

miljana.sulajkovska@ijs.si

jelenc11matej@gmail.com

Jožef Stefan Institute

Jožef Stefan Institute

Ljubljana, Slovenia

Ljubljana, Slovenia

Jitenndra Jonnagaddala

Anton Gradišek

jitendra.jonnagaddala@unsw.edu.au

anton.gradisek@ijs.si

School of Population Health, Faculty of Medicine and

Jožef Stefan Institute

Health

Ljubljana, Slovenia

Syndey, Australia

Abstract

Deep learning methods have emerged as a promising non-

Microsatellite instability (MSI) is a crucial biomarker in colorec-

invasive alternative for MSI prediction by analysing whole slide

tal cancer, guiding personalised treatment strategies. The focus

images (WSIs) of histopathological samples. These models can

of our paper is on evaluating how different state-of-the-art pre-

detect patterns linked to MSI, eliminating the need for genetic

trained artificial intelligence models perform in extracting fea-

testing. WSIs provide a comprehensive view of tumor histology,

tures on molecular and cellular oncology (MCO) study dataset

offering a faster, less invasive, and more accessible means of

to predict biomarkers. In this study, we present an advanced

diagnosis.

approach for MSI prediction using multiple instance learning on

Integrating deep learning into clinical practice can improve

whole slide images. Our process begins with comprehensive pre-

early MSI detection, personalise treatment, and reduce invasive

processing of WSIs, followed by tessellation, which breaks down

procedures. WSI-based methods streamline diagnostics and en-

large images into manageable tiles. State-of-the-art feature ex-

hance cancer care with accessible predictive analytics.

traction techniques are utilised on these selected tiles, employing

To manage these challenges, WSIs are often divided into smaller

pretrained models to capture rich, discriminative features. Vari-

regions or patches. A common method to address these issues

ous aggregation methods are applied to combine these features,

is Multiple Instance Learning (MIL) [3, 8]. Due to the vast size leading to the prediction of MSI status across the entire slide.

of WSIs, computational resources can be easily overwhelmed,

We assess the performance of different pretrained models within

making MIL an essential approach. MIL is a machine learning

this framework, demonstrating their effectiveness in accurately

technique that operates on sets or "bags" of instances, where the

predicting MSI, with results showing an AUROC of 0.91 on the

label is assigned to the entire bag rather than individual instances.

MCO dataset. Our findings underscore the potential of multiple

This is particularly advantageous in WSI analysis, where labels

instance learning-based approaches in enhancing biomarker pre-

such as MSI status apply to the entire slide, which is composed

diction in colorectal cancer, contributing to more targeted and

of numerous smaller regions or patches.

effective treatment strategies.

In this context, [4] demonstrates state-of-the-art (SOTA) results in predicting MSI in colorectal cancer. Their workflow uti-

Keywords

lizes the Swin-T model on small datasets to predict MSI. First, a

pretrained tissue classification model is employed to filter out

multiple instance learning, whole slide images, colorectal cancer,

non-tissue patches, followed by fine-tuning a pretrained model

biomarker prediction

to classify the remaining patches. Both intra-cohort and exter-

nal validation are performed. When trained on the MCO dataset

1 Introduction

(N=1065), the model achieved a mean AUROC of 0.92 ± 0.05 for

MSI is a crucial biomarker in colorectal cancer (CRC) that indi-

MSI prediction. Similarly, [11] employs a transformer-based ap-cates defects in the DNA mismatch repair system, leading to a

proach for large-scale multi-cohort evaluation, involving over

high mutation rate within tumor cells. MSI status has significant

13,000 patients for biomarker prediction, achieving a negative

clinical implications, influencing treatment decisions, particularly

predictive value of over 0.99 for MSI prediction. When trained

the use of immunotherapy, and providing prognostic information.

and tested only on a single cohort (MCO), the model achieved

Traditionally, MSI is determined through laboratory tests such as

an AUROC of 0.85. While [4] achieved promising results on the PCR-based assays or immunohistochemistry (IHC) on tumor tis-MCO dataset using an additional tissue classifier, we obtained

sue samples, which require invasive biopsy procedures. However,

comparable performance without the need for tissue classifica-

these methods can be time-consuming, costly, and dependent on

tion. On the other hand, [11] used a multicentric cohort, which the availability of sufficient tissue samples.

demands additional computational resources. In comparison to

their results on the MCO dataset, we achieved a 6% improvement

Permission to make digital or hard copies of all or part of this work for personal using a smaller dataset.

or classroom use is granted without fee provided that copies are not made or

In this study, we leverage MIL to process WSIs for the pre-

distributed for profit or commercial advantage and that copies bear this notice and diction of MSI in CRC. By testing SOTA models on the MCO

the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

dataset, we aim to assess their performance in MSI prediction

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

using MIL. This approach not only highlights the potential of MIL

© 2024 Copyright held by the owner/author(s).

in processing complex, unannotated WSIs but also contributes

https://doi.org/10.70314/is.2024.scai.9705

79





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Miljana Shulajkovska, Matej Jelenc, Jitenndra Jonnagaddala, and Anton Gradišek to the broader goal of improving biomarker prediction in CRC,

where 𝑤 and 𝑉 are trainable parameters.

ultimately supporting more personalized and effective treatment

This approach allows the model to dynamically focus on the

strategies.

most relevant patches, leading to more accurate MSI predictions.

The paper is organised as follows: Section 2 outlines the meth-Another technique similiar to attention is DSMIL [7] or a dual ods used in the pipeline, Section 3 provides a description of the stream aggregator, consisting of two branches, employing both an

data, Section 4 presents the results, and Section 5 discusses the instance classifier and a bag classifier. Let

𝐿 ×1

ℎ

∈

be a feature

𝑖

R

findings and potential directions for future work.

embedding, and 𝐵 = {ℎ0, ..., ℎ } a bag of embeddings. The first

𝑛

stream uses an instance classifier, followed by a max-pooling

2 Methods

operation to obtain a score 𝑐 (𝐵) and the critical embedding ℎ .

𝑚

𝑚

This section outlines the pipeline for MSI prediction, as illus-

The second stream aggregates the embeddings into a single bag

trated in Figure 1. The process begins with the preprocessing embedding which is then passed through a bag classifier:

of WSIs, including tessellation into smaller patches. Next, SOTA

𝑛 −1

∑︁

pretrained models are employed to extract features from these

𝑐

(𝐵) = 𝑊

𝑈 (ℎ , ℎ

)𝑣

𝑏

𝑏

𝑖

𝑚

𝑖

patches. These models, trained on large and diverse datasets,

𝑖

capture rich and discriminative features crucial for accurate MSI

Where 𝑊 is a weight vector for classification, 𝑣 an information

𝑏

𝑖

prediction. Finally, aggregation techniques are applied to com-

vector and 𝑈 is a distance measurement between an arbitrary

bine the information from the patches, enabling precise MSI

embedding and the critical embedding:

status prediction for the entire slide. Each subsection provides a

exp(⟨𝑞 , 𝑞 ⟩)

𝑖

𝑚

concise explanation of these individual processes.

𝑈 (ℎ , ℎ

) =

𝑖

𝑚

Í𝑛=1 exp(⟨𝑞 , 𝑞 ⟩)

𝑘

𝑚

𝑘 =0

2.1 Preprocessing

where is a query vector. Both 𝑞 and 𝑣 are calculated by:

𝑖

𝑖

WSIs are first tessellated into smaller, more manageable patches

𝑞

= 𝑊 ℎ ,

𝑣

= 𝑊 ℎ ,

𝑖 = 0, ..., 𝑛 − 1

𝑖

𝑞

𝑖

𝑖

𝑣

𝑖

to facilitate further processing. This step involves dividing the

where 𝑊 and 𝑊 are weight matrices. The final prediction is

𝑞

𝑣

large images into smaller regions using the tiatoolbox presented

given by:

in [9]. Non-informative tissue patches are removed to ensure the 1

analysis focuses solely on relevant tissue areas.

𝑐 (𝐵) =

(𝑐 (𝐵) + 𝑐 (𝐵))

𝑚

2

𝑏

Specifically, patches that are out of bounds—where only a

The last approach for feature aggregation reviewed in this

portion contains actual image data and the remainder consists of

paper is TransMIL, as proposed in [10], a Transformer based padding—are discarded. Patches that consist entirely of tissue are

aggregation method, which unlike the afore-mentioned methods,

retained for subsequent analysis. This preprocessing step ensures

takes into account spatial information as well. By treating a

that only informative and relevant patches are used for feature

bag of embeddings as a sequence of tokens, TransMIL uses a

extraction and MSI prediction.

novel TPT module made up of two Transformer layers and a

position encoding layer, where Transformer layers are designed

2.2 Feature Extraction Methods

for aggregating morphological information and Pyramid Position

Since only WSI-level annotations are available, several pretrained

Encoding Generator (PPEG) which encodes spatial information,

feature extraction models - UNI [1], ProvGigaPath [13], Phikon followed by a multi-layer perceptron (MLP) which classifies the

[2] and CTransPath [12] - are applied to patches, removing bag.

the need for detailed patch-level labeling. These SOTA models,

trained on large datasets, can capture complex and discrimina-

2.4 MSI Classification

tive features essential for accurate biomarker prediction. The

The aggregation step produces a single feature vector F, which

extracted feature embeddings are then used as input for the ag-

encapsulates the most informative characteristics of the entire

gregation and classification stages, laying the foundation for

slide. This aggregated feature vector F is then passed through

precise MSI status prediction. For technical details about these

one or more fully connected (dense) layers. These layers apply

models, see Table 1.

learned weights and biases to the features to transform them

into a form that is more suitable for classification. The output of

2.3 Aggregation Methods

the fully connected layer is often passed through an activation

After feature extraction, we apply aggregation techniques to

function, such as a sigmoid or softmax, depending on whether

combine patch-level features into a slide-level representation.

the classification task is binary (microsatellite instability MSI vs.

Traditional pooling methods like max-pooling and mean-pooling

microsatellite stability MSS) or multi-class. For MSI prediction,

provide straightforward approaches.

a sigmoid function is typically used, outputting a probability

However, these methods are limited by their lack of trainability.

value between 0 and 1. The final output of the model is a single

In recent years, attention-based pooling or ABMIL became a

probability value indicating the likelihood of the slide being MSI.

popular technique that adresses this issue [6]. ABMIL assigns a A threshold (e.g., 0.5) is applied to this probability to make a

weight

binary decision.

𝛼

to each patch’s feature vector, reflecting its importance:

𝑖

∑︁

3 Data

𝐹 =

𝛼 𝑓

𝑖

𝑖

For this paper the MCO study [5] was used for training and test-

𝑖 ∈𝑃

The attention scores

ing. The MCO study collection contains 1,500 digitized whole

𝛼

are computed as:

𝑖

slide images (WSIs) of colorectal cancer tissues. Conducted by

exp( ⊤

𝑤

tanh(𝑉 𝑓 ))

the Molecular and Cellular Oncology (MCO) Study group from

𝑖

𝛼

=

𝑖

Í

exp( ⊤

1994 to 2010, this study systematically gathered tissue samples

𝑤

tanh(𝑉 𝑓 ))

𝑘 ∈𝑃

𝑘

80





Biomarker Prediction in CRC Using MIL

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

Figure 1: General architecture: multiple-instance learning approach.

feature extractor

architecture

dataset

embedding size

Mass-100k: in-house histopathology slides from

MGH and BWH, and external slides from

UNI [1]

ViT-large, DINOv2, 16 heads

1024

the GTEx consortium containing >100M images,

derived from >100,000 WSIs across 20 major tissue types

Prov-Path: dataset from Providence,

ProvGigaPath [13]

ViT-large, DINOv2, 24 heads

a large US health network comprising 28 cancer centres,

1536

consisting of 1,3B images from 171,189 WSIs

PanCancer40M: dataset from TCGA,

Phikon [2]

ViT-large, iBOT combining MIM and CL

covering 13 anatomic sites and 16 cancer subtypes,

768

consisting of 43,4M images from 6,093 WSIs

dataset from TCGA and PAIP,

CTransPath [12]

CNN with multi-scale Swin Transformer

768

consisting of 15M images from 32,220 WSIs

Table 1: Technical details about the pretrained feature extraction models.

and clinical data from over 1,500 patients who underwent col-

Three feature aggregation methods—ABMIL, DSMIL, and Trans-

orectal cancer surgery. Each slide, representing a typical tumor

MIL—were applied to the extracted features to generate a single

section, is stained with Hematoxylin and eosin and scanned at

representative feature for each WSI. Following aggregation, a

a 40x objective, achieving a resolution of 0.25 mpp comparable

simple neural network with a sigmoid activation function and a

to an optical microscope (∼100,000 dpi). The total data size is

threshold of 0.5 was used to classify MSI and MSS.

approximately 3 Terabytes, and the collection is available on the

Each aggregation model was then trained for each feature

Intersect Australia RDSI Node.

extraction method on each fold, with training being conducted

over 50 epochs using the AdamW optimiser and the 1-cycle

learning rate scheduler to adjust the learning rate as models

4 Results

approached convergence. Binary cross-entropy (BCE) was used

as the loss function. After each epoch, model performance was

The dataset used in this study comprised 996 whole slide images

evaluated on the validation set using the AUROC metric to select

(WSIs), with 242 labeled as MSI and 754 as MSS. To evaluate

the best checkpoint, as most models tended to overfit toward the

the performance of various aggregation methods, models were

end of training. The selected checkpoints were then tested to

trained using 5-fold cross-validation, which ensured robust train-

calculate the mean AUROC across all folds.

ing and validation. To create a balanced testing set of 96 samples,

Results are presented in Figure 2a. The best performance was 20% of positive (MSI) samples and an equal number of negative

achieved using the DSMIL aggregation method with the ProvGi-

(MSS) samples were randomly excluded. The remaining data was

gaPath feature extractor, yielding an AUROC of 0.91 ± 0.01. The

split into five equally balanced parts for cross-validation, with

ABMIL method performed best with the Phikon and UNI extrac-

each fold consisting of 180 samples in the validation set and 720

tors, achieving AUROCs of 0.91 ± 0.02. Finally, the TransMIL

samples in the training set.

method combined with ProvGigaPath resulted in an AUROC

WSIs were then preprocessed into bags, each containing ap-

of 0.90 ± 0.01. Additionally, statistical analysis was performed,

proximately 2,000 to 4,000 patches. Each patch was then con-

specifically, the Wilcoxon signed-rank test, which yielded an

verted into feature embeddings using four different feature ex-

average p-value of 0.446, showing a relatively insignificant dif-

traction methods: Phikon, CTransPath, ProvGigaPath, and UNI.

ference in performance of different feature extraction methods,

Specifically, CTransPath and Phikon produced embeddings with

as expected.

768 features, UNI with 1024 features, and ProvGigaPath with

1536 features.

81





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

Miljana Shulajkovska, Matej Jelenc, Jitenndra Jonnagaddala, and Anton Gradišek (a) ABMIL

(b) DSMIL

(c) TransMIL

Figure 2: Predictive performance of 5-fold cross-validation of different feature extractors and aggregation methods. AUROC

plots for prediction of MSI/MSS status. The true positive rate represents sensitivity and the false negative rate represents 1-specificity. The shaded areas represent the standard deviation (SD). The value of the lower right each plot represents mean AUROC ± SD.

5 Discussion and Conclusion

better capture the complex relationships between patches within

n this study, we explored the potential of MIL combined with

a WSI. Advanced methods may help in refining the prediction

SOTA pretrained models for predicting MSI in colorectal cancer.

process, leading to further improvements in model performance.

Our results indicate that the approach is highly effective, achiev-

Overall, our study demonstrates the potential of MIL-based ap-

ing an AUROC of 0.913 on the MCO dataset. This is a notable

proaches in enhancing biomarker prediction in colorectal cancer,

achievement, particularly when compared to previous studies,

paving the way for more personalized and effective treatment

such as [4] and[11], which reported AUROCs of 0.92 and 0.85, strategies.

respectively, on the same dataset. Our results not only validate

References

the effectiveness of our approach but also suggest that the careful

selection and combination of feature extraction and aggregation

[1]

Richard J Chen et al. 2024. Towards a general-purpose foundation model

for computational pathology. Nature Medicine, 30, 3, 850–862.

methods can yield improvements in predictive accuracy.

[2]

Alexandre Filiot, Ridouane Ghermi, Antoine Olivier, Paul Jacob, Lucas Fidon,

The positive and negative rates observed in our results reflect

Alice Mac Kain, Charlie Saillard, and Jean-Baptiste Schiratti. 2023. Scaling

self-supervised learning for histopathology with masked image modeling.

the model’s ability to correctly classify MSI and MSS cases. A

medRxiv, 2023–07.

high true positive rate (sensitivity) indicates the model’s pro-

[3]

Michael Gadermayr and Maximilian Tschuchnig. 2024. Multiple instance

ficiency in identifying MSI-positive cases, which is crucial for

learning for digital pathology: a review of the state-of-the-art, limitations & future potential. Computerized Medical Imaging and Graphics, 102337.

ensuring that patients who could benefit from MSI-targeted ther-

[4]

Bangwei Guo, Xingyu Li, Jitendra Jonnagaddala, Hong Zhang, and Xu Steven

apies are accurately identified. Conversely, a high true negative

Xu. 2022. Predicting microsatellite instability and key biomarkers in colorec-

rate (specificity) shows the model’s effectiveness in correctly clas-

tal cancer from h&e-stained images: achieving sota predictive performance

with fewer data using swin transformer. arXiv preprint arXiv:2208.10495.

sifying MSS cases, thereby minimising false positives. To further

[5]

Nick Hawkins. 2015. MCO study whole slide image collection. (2015).

enhance the accuracy and reliability of MSI prediction, several

[6]

Maximilian Ilse, Jakub Tomczak, and Max Welling. 2018. Attention-based

deep multiple instance learning. In International conference on machine

avenues for future work are planned.

learning. PMLR, 2127–2136.

Utilisation of the Entire Dataset: We plan to leverage the full

[7]

Bin Li, Yin Li, and Kevin W Eliceiri. 2021. Dual-stream multiple instance

dataset to improve the robustness of our model. Training on a

learning network for whole slide image classification with self-supervised

contrastive learning. In Proceedings of the IEEE/CVF Conference on Computer

larger dataset may help in capturing more nuanced patterns and

Vision and Pattern Recognition, 14318–14328.

variations, leading to even more accurate predictions.

[8]

Oded Maron and Tomás Lozano-Pérez. 1997. A framework for multiple-

Fine-Tuning of Pretrained Models: While we used pretrained

instance learning. Advances in neural information processing systems, 10.

[9]

Johnathan Pocock et al. 2022. Tiatoolbox as an end-to-end library for ad-

models without fine-tuning in this study, fine-tuning these mod-

vanced tissue image analytics. Communications medicine, 2, 1, 120.

els specifically for the task of MSI prediction could further im-

[10]

Zhuchen Shao, Hao Bian, Yang Chen, Yifeng Wang, Jian Zhang, Xiangyang

Ji, et al. 2021. Transmil: transformer based correlated multiple instance

prove their performance. Tailoring the models to our specific

learning for whole slide image classification. Advances in Neural Information

data distribution and task requirements may yield significant

Processing Systems, 34, 2136–2147.

gains in accuracy.

[11]

Sophia J Wagner et al. 2023. Transformer-based biomarker prediction from

colorectal cancer histology: a large-scale multicentric study. Cancer Cell, 41, Incorporation of a Tissue Classifier: Since MSI is typically

9, 1650–1661.

found in tumor tissue, we plan to integrate a tissue classifier to

[12]

Xiyue Wang, Sen Yang, Jun Zhang, Minghui Wang, Jing Zhang, Wei Yang,

automatically remove non-tumor tissue from the analysis. This

Junzhou Huang, and Xiao Han. 2022. Transformer-based unsupervised con-

trastive learning for histopathological image classification. Medical image

step should enhance the model’s focus on relevant tissue regions,

analysis, 81, 102559.

potentially improving MSI prediction accuracy and speed up the

[13]

Hanwen Xu et al. 2024. A whole-slide foundation model for digital pathology

from real-world data. Nature, 1–8.

whole process.

Development of Advanced Aggregation Methods: We also plan

to explore more sophisticated aggregation techniques that can

82





Feature-Based Emotion Classification Using Eye-Tracking Data

Tomi Božak

Mitja Luštrek

Gašper Slapničar

tb85088@student.uni- lj.si

mitja.lutsrek@ijs.si

gasper.slapnicar@ijs.si

Jožef Stefan Institute

Jožef Stefan Institute

Jožef Stefan Institute

Ljubljana, Slovenia

Jožef Stefan International

Ljubljana, Slovenia

Postgraduate School

Ljubljana, Slovenia

Abstract

[8, 12]. We hypothesized that eye-tracking data is a valuable modality for multi-modal emotion recognition on its own, with

The field of emotion recognition from eye-tracking data is well-

potential applications in real-world scenarios like office work,

established and offers near-real-time insights into human affec-

driving, and psychological assessments, as well as in estimat-

tive states. It is less obtrusive than some other modalities, such

ing well-being. Our motivation was to explore eye-tracker-based

as electroencephalogram (EEG), electrocardiogram (ECG) and

predictive models as an essential component in such practical

galvanic skin response (GSR), which are often used in emotion

applications.

recognition tasks. This study examined the practical feasibility of

The primary objective of our study was to validate existing

emotion recognition using an eye-tracker with a lower frequency

findings on the performance of classical ML models for emotion

than that typically employed in similar research. Using ocular

classification from eye-tracking data, using the models – Support

features, we explored the efficacy of classical machine learning

Vector Machine (SVM) and k-Nearest Neighbors (KNN) – and

(ML) models in classifying four emotions (anger, disgust, sadness,

features already explored in the literature [9, 15] as well as exand tenderness) as well as neutral and “undefined“ emotions. The

ploring classifiers not so frequently used in this field – such as

features included gaze direction, pupil size, saccadic movements,

RF and XGBoost (XGB). Additionally, we aimed to explore the

fixations, and blink data. The data from the “emotional State

potential of emotion recognition at lower sampling frequencies

Estimation based on Eye-tracking database“ was preprocessed

available in most non-professional eye trackers. For the early

and segmented into various time windows, with 22 features ex-

feasibility study, we used an existing dataset, which collected

tracted for model training. Feature importance analysis revealed

data using a wearable eye-tracker but findings could possibly be

that pupil size and fixation duration were most important for

extended to high-quality unobtrusive contact-free trackers. Our

emotion classification. The efficacy of different window lengths

research also focused on understanding the impact of individual

(1 to 10 seconds) was evaluated using Leave-One-Subject-Out

features and window lengths on model performance.

(LOSO) and 10-fold cross-validation (CV). The results demon-

strated that accuracies of up to 0.76 could be achieved with 10-

fold CV when differentiating between positive, negative, and

neutral emotions. The analysis of model performance across

different window lengths revealed that longer time windows

generally resulted in improved model performance. When the

2

Related Work

data was split using a marginally personalised 10-fold CV within

In literature, various physiological signals have been employed

video, the Random Forest Classifier (RF) achieved an accuracy of

for emotion recognition, with a particular focus on modalities

0.60 in differentiating between the six aforementioned emotions.

such as EEG, GSR, and eye-tracking systems [1, 6, 9]. Researchers Some challenges remain, particularly in regard to data granu-have explored both uni- and multi-modal approaches, finding that

larity, model generalization across subjects and the impact of

the integration of multiple modalities can significantly enhance

downsampling on feature dynamics.

emotion recognition accuracy. Lu et al. achieved 0.78 accuracy

with eye-related features recorded with eye-tracking glasses –

Keywords

which are not contact-free but record at relatively low frequen-

eye-tracking, emotion recognition, machine learning

cies of 60 Hz or 120 Hz. They predicted positive, negative and

neutral classes with SVM. Interestingly, they observed a 0.10 in-

1

Introduction

crease in accuracy when combining eye-related and EEG features

[12]. Similarly, Guo et al. observed a more substantial gain, with Emotion recognition is a vibrant area of research, leveraging di-accuracy improving by 0.20 when integrating EEG, eye-tracking,

verse data sources such as images [11], audio [16], and also, ocular and eye images, as opposed to using only eye-tracking data [7].

features like pupil dilation, gaze direction, blinks, and saccadic

The features derived from eye-tracking have been widely used

movements [3, 8, 12]. Such eye-related features provide valu-in ML algorithms to detect emotional states [2, 7, 12, 15]. However, able insights into emotional states, offering a less-invasive and

most studies have traditionally categorized emotions into broad

real-time approach to understanding human affective responses.

groups like positive, negative, and neutral [12, 14]. Pupil size, in Most studies that tried to predict emotions from these eye-related

particular, has emerged as a valuable indicator for distinguishing

features relied not only on eye-tracking data but also on EEG

between positive and negative emotions [2, 7, 12] . Recent efforts Permission to make digital or hard copies of all or part of this work for personal have begun to refine these broad categories, identifying more

or classroom use is granted without fee provided that copies are not made or

specific emotions like happiness, sadness, fear, anger, etc. [2, 7,

distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this

15]. Although current methods can effectively identify certain work must be honored. For all other uses, contact the owner /author(s).

emotions such as sadness and fear, further research is needed to

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

reliably differentiate between others like disgust, joy, and surprise

© 2024 Copyright held by the owner/author(s).

https://doi.org/10.70314/is.2024.scai.9988

[2].

83





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Tomi Božak, Mitja Luštrek, and Gašper Slapničar

3

Methodology

We used 50% sliding window overlap. From each window, we

3.1

Data

computed 22 features, belonging to the following groups:

In our research, we used the “emotional State Estimation based

(1) gaze coordinates on screen: std of x and y coordinates

on Eye-tracking database“ (eSEEd) [13]. The eSEEd comprises (2) pupil ellipse sizes of a and b for each eye: mean, std

data from 48 participants, each of whom watched 10 carefully

(3) blinks: number; mean and std of duration (all 0 if no blinks)

selected videos intended to evoke specific emotional responses.

(4) saccades: number; mean speed; mean, std, total duration

After viewing each video, participants ranked their emotions

(5) fixations: number; mean, std, total duration

– anger, disgust, sadness, and tenderness – on a scale from 0

to 10. Tenderness, however, is not regarded as one of the basic

emotions, but it has been widely utilized in emotion research

Saccade and, implicitly, fixation calculations were done using

in recent years [13]. Since the participants had ranked all four existing code based on the algorithm proposed by Engbert et al. [5,

emotions for every video, a labelling problem emerged when

10]. The algorithm calculates the velocity and acceleration of eye multiple emotions shared the highest score, in our case, leading

movements by using a velocity threshold identification method

to “undefined“ labels. In our study, emotions were mapped by

to detect saccades based on continuous 3D gaze data. In our study

applying a set of extraction rules in the following order: if the

we define fixation (interval) as an absence of a saccade (interval),

highest-ranked emotion is below four, the response is labelled as

thus one fixation is declared between every two saccades (and

neutral; if multiple emotions share the highest rank, the label is

before the first and after the last one).

undefined; otherwise, the emotion with the highest rank is cho-

As mentioned previously, our data was imbalanced in terms

sen. The boundary of four was chosen because the original study

of class distribution, namely the distributions for anger, disgust,

on eSEEd constructed this rule and we adapted it from there

sadness, neutral, tenderness and undefined were 8.7%, 13.6%,

[13]. Although the initial study design aimed for an even distri-17.5%, 25.7%, 15.8% and 18.7%, respectively. Notably, for the 1 s

bution of emotions, neutral responses dominate, representing

window length, the number of windows was 67,181, whilst for

about one-fourth of the labels (depending on window length).

the 10 s window, the number of instances decreased to 6,507.

3.1.1

Data Preprocessing. We have preprocessed the data to

3.2

Experiments

make it more suitable for our future research and to reduce its

size. We wanted to study the performance of data with a rela-

We initially examined feature correlation matrices to identify

tively low frequency rate of 60 Hz, which is used by relatively

potential correlations between features, as well as between fea-

affordable mid-tier eye-trackers, like Tobii Pro Spark. Firstly, the

ture and class. Then, we compared the following classifiers from

features that were uninformative or could be misleading (e.g.

the Scikit-learn library: Random Forest (RF), Support Vector Ma-

raw tracker signal and timestamps) were removed, and the fol-

chines (SVM), k-Nearest Neighbors (kNN), and XGBoost (XGB)

lowing set of features was preserved: 2D screen coordinates of

from the XGBoost library, as well as an ensemble method major-

gaze points (for standard deviation (std) of screen gaze coordi-

ity vote of the aforementioned classifiers. We compared all results

nates), 3D coordinates of gaze points (exclusively for saccade

against a baseline majority classifier. Each model was trained and

calculations), pupil sizes (a and b of the pupil ellipse), and eye

tested using its default hyperparameters. To evaluate the models’

IDs (each eye has its own pupil size features). Secondly, rows

performance, we implemented multiple CV techniques.

containing any NaN values were removed, as there were no large

The first CV technique was Leave-One-Subject-Out (LOSO).

consecutive blocks of such rows and downsampling of the data

Secondly, we implemented a marginally personalised 10-fold CV

was planned. Finally, we further downsampled the data to 60

“within video.“ In this approach, a standard 10-fold CV was per-

Hz, matching the sampling frequency of a mid-tier eye-tracker.

formed where 90% of temporally sequential windows were used

However, we acknowledge that downsampling might lead to the

for training and 10% for testing. The splits were done separately

loss of high-frequency information, which could be important for

for each video within every subject. All the training data from

capturing subtle dynamics in gaze behaviour and pupil responses.

every video was combined to train a single model, and all the

This is particularly relevant considering that recent studies, such

test data was combined to evaluate the model, ensuring that

as those by Collins et al. [3] and the SEED project [4, 17], have the model was exposed to data from all subjects and videos. We

utilized data collected at much higher frequencies to preserve

named the experiment “marginally“ personalised because most

these subtle dynamics. Therefore, while downsampling makes

training data does not come from any single subject and is thus

the data more meaningful to our research and more computa-

not very personalised. Finally, we explored a completely person-

tionally manageable, it is important to keep in mind the reduced

alised 10-fold CV “within subject.“ Here, training and testing were

temporal resolution when discussing the results.

done only on data of one subject. In all three CV methods, the

Following the preprocessing, window segmentation was ap-

instances were never shuffled to preserve temporal and subject

plied to the data. This step is essential for analyzing temporal

sequential information and to minimize overfitting.

patterns within the data, as it allows for the capture of trends

We attempted to merge certain classes in a way to group

and behaviours over specific time intervals. By segmenting the

negative emotions – anger, disgust, and sadness – under the cat-

data into windows, we can improve the robustness of feature

egory “negative,“ while labelling tenderness as “positive.“ The

extraction and model training, enabling the detection of mean-

label for neutral remained unchanged, while the undefined la-

ingful patterns that might be obscured in raw, unsegmented data.

bel was changed to “negative“ because it always resulted from

Additionally, with window segmentation, the number of training

multiple negative emotions scoring equally. Lastly, the feature im-

instances increases which is commonly better for learning more

portances were analysed for different combinations of data splits

robust ML models and conducting rigorous evaluation. Hence,

and models in order to identify potential consistently important

multiple window lengths were examined, namely: 1, 3, 5 and 10 s.

features.

84





Feature-Based Emotion Classification Using Eye-Tracking Data

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

4

Results

similarly well, with the absolute best being RF on 10 s windows

which outperformed the majority classifier by 0.05 and 0.13 for

The results described in the following subsections are summarised

accuracy and F1-score, respectively. When grouping the nega-

in Table 1.

tive emotions, we observe an absolute improvement in models’

4.1

Feature Correlations

performance, but a relative decline toward the majority classi-

fier benchmark. The best model, in this case, did not surpass

The first important observation from correlation matrices was

the majority classifier in terms of accuracy, with the majority

that no output class is closely correlated to any other singular fea-

classifier achieving 0.67 ± 0.16 accuracy and 0.61 ± 0.16 F1-score,

ture. Secondly, we noticed some strong correlations, for example,

while SVM, the best-performing model, scored an accuracy of

a 1.0 correlation between a number of fixations and a number of

0.64 ± 0.13 and an F1-score of 0.63 ± 0.12.

saccades, because one simply equals the other increased by one.

More importantly, we noticed little-to-no correlation between fea-

tures that proved to be most important in some best-performing

4.5

Feature Importances

models, meaning each of these features brought some novel infor-

Following the completion of model training, we analyzed the fea-

mation to the model. The only exceptions of important features

ture importances of the best-performing models. For RF this was

being correlated are the features representing the mean size of a

calculated based on the Mean Decrease in Impurity, summing the

pupil i.e., ellipse a and b axes, which are expected to be correlated.

impurity reduction each feature contributes across all trees; and

They were correlated more than 0.8. However, we decided not to

for XGB, feature importances were calculated using the “weight“

remove any features because we assessed the feature count of 22

metric, which counts the number of times each feature is used to

to be well-balanced in relation to the number of instances.

split the data across all trees. For SVM we did not calculate feature

importances. In the completely personalised 10-fold experiments,

4.2

Leave-One-Subject-Out

feature importances varied significantly across different subjects

With the goal of training a robust general model for our dataset,

and even between different runs within the same subject, specif-

we first applied the LOSO CV technique. The best performance

ically with RF, as the random state was not fixed. In contrast,

was achieved by RF on 10 s windows, yielding an accuracy of

feature importance was notably consistent in experiments where

0.28 ± 0.13 and an F1-score of 0.28 ± 0.16. It outperformed the ma-

models were trained on data from multiple subjects, such as in

jority classifier by 0.03 in accuracy and 0.13 in F1-score. In a sub-

the LOSO and the 10-fold within video, even with a variable

sequent experiment, the negative emotions were grouped. This

random state of the RF model.

adjustment led to an overall increase in performance. However,

The most important features of best-performing models were

with such grouping the majority classifier score also increased to

those related to average pupil sizes, followed by fixation duration.

0.59 accuracy, which is the same as the best-performing model.

These results partially align with those of Collins et al., who

Further analysis revealed that high accuracy mainly implied

found features relating to pupil diameter and saccades statistically

the subject predominantly reported “neutral“ feelings and low

significant [3].

accuracy implied little-to-no “neutral“ labels. However, not every

subject with a high “neutral“ count achieved outstanding results

5

Conclusion

and not every subject with a wide range of emotions yielded poor

results. A comparison was made between the number of windows

Our research explored emotion classification using eye-tracking

in the left-out subject to their performance and no correlation

data with classical ML models and hand-crafted features. The

was found. 10 s window length performed better than the shorter

data was downsampled to a lower-than-standard frequency i.e.,

windows with lengths 1-5 s. We also tested longer (60 s) windows

to 60 Hz, which was more realistic for consumer contact-free

and the resulting accuracies were higher than those from 10 s

eye-tracker data. This made the problem harder, making it not

windows, but we evaluated that the number of instances was

directly comparable with other studies working on eSEEd, but

insufficient for the results to be representative.

valuable from a practical perspective.

Window segmentation significantly impacted model perfor-

4.3

Marginally Personalised 10-fold

mance, with the best results constantly obtained using the largest

Cross-Validation Within Video

window length. This suggests that longer observation periods

capture more comprehensive information, making smaller win-

Given that the LOSO yielded relatively poor results, the next

dows less effective for emotion classification. We hypothesize

step was to explore 10-fold CV. Experiments showed an average

that this does not transfer to realistic scenarios, as users might

accuracy of 0.60 ± 0.07 and an F1-score of 0.60 ± 0.08, produced

experience emotions in short bursts while being neutral for the

with RF on 10 s windows, the best-performing model. This should

majority of the time. In specifically designed cases where emo-

be compared to the results given by the majority classifier –

tion is consistently induced for longer periods of time (like our

average accuracy of 0.21 ± 0.01 and F1-score of 0.07 ± 0.01. With

dataset), this is more expected.

negative emotions grouped, the accuracy and F1-score raised to

The LOSO validation strategy, which tests model generaliza-

0.76 ± 0.04 and 0.73 ± 0.04, respectively, for the best-performing

tion across different subjects, yielded poor results. The variability

XGB on 10 s windows. The majority class classifier yielded an

in performance across subjects indicates the challenge of cap-

accuracy of 0.66 ± 0.02 and an F1-score of 0.52 ± 0.02.

turing general relationships between eye features and emotions.

While both 10-fold CV approaches showed an increase in perfor-

4.4

Personalised 10-fold Cross-Validation

mance, their generalizability is limited. Completely personalised

Even though 10-fold CV within video resulted in much better per-

10-fold showed worse results than the marginally personalised

formance compared to LOSO, we wanted to see the performance

one presumably because of the low number of videos per emotion

of completely personalised models. All the models performed

within an individual subject.

85





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Tomi Božak, Mitja Luštrek, and Gašper Slapničar

Table 1: Best-performing models and their corresponding results along the results of the Majority Class Classifier for the same parameters. Window lengths are 10 s.

Settings

Model Acc

Model F1

Majority Class Acc

Majority Class F1

LOSO, RF

0.28 ± 0.13

0.28 ± 0.16

0.25 ± 0.25

0.15 ± 0.26

LOSO, SVM, negative emotions grouped

0.59 ± 0.19

0.46 ± 0.18

0.59 ± 0.19

0.46 ± 0.18

10-fold within video, RF

0.60 ± 0.07

0.60 ± 0.08

0.21 ± 0.01

0.07 ± 0.01

10-fold within video, XGB, negative emotions grouped

0.76 ± 0.04

0.73 ± 0.04

0.66 ± 0.02

0.52 ± 0.02

10-fold within subject, RF

0.38 ± 0.20

0.42 ± 0.19

0.33 ± 0.26

0.29 ± 0.26

10-fold within subject, SVM, negative emotions grouped

0.64 ± 0.13

0.63 ± 0.12

0.67 ± 0.16

0.61 ± 0.16

An important issue with the eSEEd data is that all participants

The authors acknowledge the use of OpenAI’s ChatGPT for

watched the same 10 emotion-evoking videos in the exact same

generating text suggestions during the preparation of this paper.

order. This uniformity raises concerns that, given the small num-

All the generated content has been reviewed and edited by the

1

ber of videos (two intended

per emotion), the models might

authors to ensure accuracy and relevance to the research.

learn to associate features unrelated to emotions, such as video

dynamics or illumination. We circumvented the problem with

References

video dynamics by dropping the mean gaze coordinate features

[1]

Zeeshan Ahmad and Naimul Khan. 2022. A survey on physiological signal-

based emotion recognition. Bioengineering 2022. https://www.mdpi.com/23

and not using them in our experiments.

06- 5354/9/11/688.

Despite these challenges, our experiments offer valuable in-

[2]

Aracena Claudio, Basterrech Sebastián, Snáel Václav, and Velásquez Juan.

sights into the feasibility of emotion recognition from low-frequency

2015. Neural networks for emotion recognition based on eye tracking data.

In 2015 IEEE International Conference on Systems, Man, and Cybernetics,

eye-tracker data, providing a foundation for future work. We

2632–2637. doi: 10.1109/smc.2015.460.

opted for classical models initially due to their explainability,

[3]

Mackenzie L. Collins and T. Claire Davies. 2023. Emotion differentiation

through features of eye-tracking and pupil diameter for monitoring well-

lower computational complexity, and efficiency, which are in our

being. In 2023 45th Annual International Conference of the IEEE Engineering in opinion essential for understanding the data before transitioning

Medicine & Biology Society (EMBC). doi: 10.1109/embc40787.2023.10340178.

to more complex deep learning models.

[4]

Ruo-Nan Duan, Jia-Yi Zhu, and Bao-Liang Lu. 2013. Differential entropy

feature for EEG-based emotion classification. In 6th International IEEE/EMBS

In future work, several enhancements could be explored to

Conference on Neural Engineering (NER). IEEE, 81–84.

improve the robustness and accuracy of emotion classification

[5]

Ralf Engbert, Lars Rothkegel, Daniel Backhaus, and Hans A. Trukenbrod.

models using eye-tracking data. One approach could involve ana-

2016. Evaluation of velocity-based saccade detection in the smi-etg 2w

system. Technical report, Allgemeine und Biologische Psychologie, Uni-versität lyzing distinct fixation areas as an additional feature, potentially

Potsdam.

offering deeper insights into visual attention patterns. Moreover,

[6]

Atefeh Goshvarpour, Ataollah Abbasi, and Ateke Goshvarpour. 2017. An

accurate emotion recognition system using ECG and GSR signals and match-

considering that each emotion is (in some cases) represented

ing pursuit method. Biomed J. 2017. doi: 10.1016/j.bj.2017.11.001.

by two videos, a valuable experiment would be to train models

[7]

Jiang-Jian Guo, Rong Zhou, Li-Ming Zhao, and Bao-Liang Lu. 2019. Multi-

on one video and test on the other. This could help assess the

modal emotion recognition from eye image, eye movement and EEG using

deep neural networks. In 2019 41st Annual International Conference of the

model’s ability to generalize across different stimuli within the

IEEE Engineering in Medicine and Biology Society (EMBC), 3071–3074. doi:

same emotional category.

10.1109/embc.2019.8856563.

Further analysis could focus on demographic factors by exam-

[8]

Robert Jenke, Angelika Peer, and Martin Buss. 2014. Feature extraction and

selection for emotion recognition from EEG. IEEE Transactions on Affective

ining the LOSO results for potential correlations between model

Computing, 5, 3, 327–339. doi: 10.1109/taffc.2014.2339834.

predictions and participant characteristics such as gender, age,

[9]

Lim Jia Zheng, Mountstephens James, and Jason Teo. 2020. Emotion recog-

nition using eye-tracking: taxonomy, review and current challenges. Sensors

and education. This might reveal underlying biases or trends that

2020. https://doi.org/10.3390/s20082384.

affect model performance. Additionally, rather than downsam-

[10]

Fjorda Kazazi. 2022. Detect saccades and saccade mean velocity in python

pling and removing rows with missing data, future work could

from data collected in pupil labs eye tracker. Accessed: 25. 7. 2024. https:

//www.f jordakazazi.com/detect_saccades.

explore retaining or imputing these rows.

[11]

Yousif Khaireddin and Zhuofa Chen. 2021. Facial emotion recognition: state

Furthermore, exploring the training of neural networks on

of the art performance on FER2013. arXiv preprint arXiv:2105.03588. doi:

raw, non-downsampled data from multiple modalities is another

10.48550/arXiv.2105.03588.

[12]

Yifei Lu, Wei-Long Zheng, Binbin Li, and Bao-Liang Lu. 2015. Combining

promising direction, as other studies already observed promising

eye movements and EEG to enhance emotion recognition. In Ijcai. Vol. 15.

results with such approaches. Moreover, we should address the

Buenos Aires, 1170–1176.

[13]

Vasileios Skaramagkas and Emmanouil Ktistakis. 2023. Esee-d: emotional

issue of overlapping emotions which could involve developing a

state estimation based on eye-tracking dataset. Brain Sciences, 13, 4. doi:

multiclass output model, reflecting a real-world scenario where

10.3390/brainsci13040589.

multiple emotions can be present simultaneously. This approach

[14]

Mohammad Soleymani, Maja Pantic, and Thierry Pun. 2012. Multimodal

emotion recognition in response to videos. IEEE Transactions on Affective

could also help reduce the number of undefined labels, increasing

Computing, 3, 2, 211–223. doi: 10.1109/t-affc.2011.37.

the amount of useful data.

[15]

Paweł Tarnowski, Marcin Kołodziej, Andrzej Majkowski, and Remigiusz Jan

Rak. 2020. Eye-tracking analysis for emotion recognition. Computational

Acknowledgements

Intelligence and Neuroscience. https://onlinelibrary.wiley.com/doi/10.1155/2

020/2909267.

[16]

Shiqing Zhang, Shiliang Zhang, Tiejun Huang, Wen Gao, and Qi Tian.

This work was supported by bilateral Weave project, funded by

2018. Learning affective features with a hybrid deep model for audio–visual

the Slovenian Agency of Research and Innovation (ARIS) under

emotion recognition. IEEE Transactions on Circuits and Systems for Video

grant agreement N1-0319 and by the Swiss National Science

Technology, 28, 10, 3030–3043. doi: 10.1109/tcsvt.2017.2719043.

[17]

Wei-Long Zheng and Bao-Liang Lu. 2015. Investigating critical frequency

Foundation (SNSF) under grant agreement 214991.

bands and channels for EEG-based emotion recognition with deep neural

networks. IEEE Transactions on Autonomous Mental Development, 7, 3, 162–

175. doi: 10.1109/TAMD.2015.2431497.

1 The average percentage of the videos for which the participants had reported the target emotion (also known as the “hit rate“) was 71.8% [13].

86





Indeks avtorjev / Author index



Andova Andrejaana ...................................................................................................................................................................... 51

Anžur Zoja ................................................................................................................................................................................... 31

Avdić Elma ................................................................................................................................................................................... 15

Bengeri Katja ............................................................................................................................................................................... 11

Bohanec Marko ............................................................................................................................................................................ 59

Božak Tomi .................................................................................................................................................................................. 83

Cigoj Primož .................................................................................................................................................................................. 7

Cork Jordan .................................................................................................................................................................................. 51

Đoković Lazar .............................................................................................................................................................................. 19

Džeroski Sašo ............................................................................................................................................................................... 67

Filipič Bogdan .............................................................................................................................................................................. 51

Gams Matjaž ................................................................................................................................................................................ 27

Gašparič Lea ................................................................................................................................................................................. 67

Gjoreski Hristijan ......................................................................................................................................................................... 35

Gjoreski Martin ............................................................................................................................................................................ 35

Gradišek Anton ...................................................................................................................................................................... 23, 79

Hafner Miha ................................................................................................................................................................................. 59

Halbwachs Helena ........................................................................................................................................................................ 23

Jelenc Matej ................................................................................................................................................................................. 79

Jonnagaddala Jitenndra ................................................................................................................................................................ 79

Jordan Marko ............................................................................................................................................................................... 71

Kalin Jan....................................................................................................................................................................................... 27

Kizhevska Emilija ........................................................................................................................................................................ 75

Kokalj Anton ................................................................................................................................................................................ 67

Kolar Žiga .................................................................................................................................................................................... 27

Konečnik Martin .......................................................................................................................................................................... 27

Kramar Sebastjan ................................................................................................................................................................... 35, 71

Krstevska Ana .............................................................................................................................................................................. 35

Kukar Matjaž ................................................................................................................................................................................ 39

Kulauzović Bajko ......................................................................................................................................................................... 27

Kuzman Taja .................................................................................................................................................................................. 7

Lukan Junoš ........................................................................................................................................................................... 11, 35

Luštrek Mitja .................................................................................................................................................. 11, 31, 35, 71, 75, 83

Mehanović Dželila ....................................................................................................................................................................... 15

Nedić Mila .................................................................................................................................................................................... 63

Pavleska Tanja ............................................................................................................................................................................... 7

Pejanovič Nosaka Tomo ............................................................................................................................................................... 27

Piciga Aleksander ......................................................................................................................................................................... 39

Poljak Lukek Saša ........................................................................................................................................................................ 47

Prestor Domen .............................................................................................................................................................................. 27

Ratajec Mariša .............................................................................................................................................................................. 23

Reščič Nina .................................................................................................................................................................................. 71

Robnik-Šikonja Marko ................................................................................................................................................................. 19

Rupnik Urban ................................................................................................................................................................................. 7

Sadikov Aleksander...................................................................................................................................................................... 43

Shulajkovska Miljana ................................................................................................................................................................... 79

Skobir Matjaž ............................................................................................................................................................................... 27

Slapničar Gašper .............................................................................................................................................................. 31, 35, 83

Smerkol Maj ................................................................................................................................................................................. 23

Šoln Kristjan ................................................................................................................................................................................. 55

Susič David .................................................................................................................................................................................. 27

Susič Rok ..................................................................................................................................................................................... 23

Trojer Sebastijan .................................................................................................................................................................... 31, 35

Tušar Tea ................................................................................................................................................................................ 51, 63

Vladić Ervin ................................................................................................................................................................................. 15



87



Založnik Marcel ..................................................................................................................................................................... 55, 71

Zirkelbach Maj ............................................................................................................................................................................. 43





88





Slovenska konferenca o

umetni inteligenci

Slovenian Conference on

Artificial Intelligence

Uredniki > Editors:

Mitja Luštrek, Matjaž Gams, Rok Piltaver





Document Outline


02 - Naslovnica - notranja - A - DRAFT

03 - Kolofon - A - DRAFT

04 - IS2024 - Predgovor

05 - IS2024 - Konferencni odbori

07 - Kazalo - A

08 - Naslovnica - notranja - A - DRAFT

09 - Predgovor podkonference - A

10 - Programski odbor podkonference - A

11 - Prispevki - A SCAI_2024_paper_001 (0538) Abstract

1 Introduction

2 Related Work

3 Methodology 3.1 PandaChat-RAG-sl Dataset

3.2 RAG System





4 Experiments and Results 4.1 Chunk Size

4.2 Number of Retrieved Sources

4.3 Embedding Models





5 Conclusion and Future Work





SCAI_2024_paper_002 (0991) Abstract

1 Introduction

2 Data collection

3 Target and feature extraction 3.1 Target variable

3.2 Features





4 Prediction models 4.1 Model performance and validation

4.2 Baseline model

4.3 Correlation-Based Feature Reduction

4.4 Feature Selection using the mutual information scoring function

4.5 Recursive Feature Elimination with Cross-Validation (RFECV)

4.6 Sequential Forward Selection

4.7 Boruta method





5 Results 5.1 Selecting the best correlation threshold





6 Conclusions





SCAI_2024_paper_003 (1642) Abstract

1 Introduction

2 Literature Review

3 Materials and Methods 3.1 Dataset

3.2 Machine Learning Prediction





4 Results and Discussion 4.1 Model Performance





5 Conclusion





SCAI_2024_paper_004 (4212) Abstract

1 Introduction

2 Sarcasm Detection Dataset 2.1 iSarcasmEval Dataset

2.2 Translating iSarcasmEval

2.3 Translation Results





3 Model Training 3.1 Encoder Models Under 1B Parameters

3.2 Llama 3.1 Models

3.3 GPT 3 and 4 Models

3.4 Sarcasm Detection Ensembles





4 Sarcasm Detection Results

5 Conclusion

Acknowledgements





SCAI_2024_paper_005 (4550) Abstract

1 Introduction

2 System Architecture 2.1 Speech-to-Service ASR

2.2 Recommender System





3 Dataset

4 Methods 4.1 Clustering

4.2 Recommender System

4.3 Speech Recognition and Information Extraction





5 Results and Discussion 5.1 LLM Based Infromation Retrieval Model

5.2 Recommender System





6 Conclusions

Acknowledgements





SCAI_2024_paper_006 (4752) Abstract

1 Introduction

2 Related Work

3 Data Preprocessing

4 Methodology

5 Results

6 Conclusion and Discussion

Acknowledgements





SCAI_2024_paper_007 (6883) Abstract

1 Introduction

2 Related Work

3 Methodology 3.1 Datasets

3.2 Feature Computation

3.3 Emotion Classification





4 Experiments and Results 4.1 Audio Emotion Classification

4.2 Image Emotion Classification

4.3 Discussion





5 Conclusion





SCAI_2024_paper_008 (6961)

SCAI_2024_paper_009 (7260) Abstract

1 Introduction

2 Materials

3 Methods 3.1 Exploratory Data Analysis (EDA)

3.2 Data augmentation/Feature Engineering

3.3 Machine Learning Models

3.4 Model Evaluation and Selection

3.5 Model Interpretation

3.6 Data Splitting





4 Results 4.1 Performance Trends and Impact of Additional Data per Employee

4.2 Interpretability and Additional Insights





5 Discussion and Conclusion

Acknowledgements





SCAI_2024_paper_010 (7299) Abstract

1 Introduction

2 Related Work

3 Application Details

4 AI Agents and Rating System

5 Puzzle Description and Methodology 5.1 Puzzles

5.2 Tactical Puzzle Generation

5.3 Strategic Puzzle Generation





6 Evaluation and Results

7 Discussion

8 Conclusion

Acknowledgements





SCAI_2024_paper_011 (8236)

SCAI_2024_paper_012 (8265) Abstract

1 Introduction

2 Real-World Use Case

3 Optimization Problem Inspector Features 3.1 Problem Specification

3.2 Sample Generation

3.3 Data

3.4 Comparison to Reference Problems

3.5 Data Visualization





4 Conclusions

Acknowledgements





SCAI_2024_paper_013 (8341) Abstract

1 Introduction

2 Competition Setup and System Description 2.1 Simulation Framework

2.2 Sensor Data

2.3 Actuator Outputs





3 Related Work

4 MAS Approach to Autonomous Table Football Control 4.1 Agent Architecture

4.2 Role Description

4.3 Role Assignment

4.4 Behavior of the System as a Whole





5 Conclusion





SCAI_2024_paper_014 (8463)

SCAI_2024_paper_015 (8587) Abstract

1 Introduction

2 Optimization Problem 2.1 Variables

2.2 Constraints

2.3 Objective Functions

2.4 Weighted Sum Approach





3 Optimization Approach 3.1 Setting Weights in the Weighted Sum

3.2 Linearization

3.3 Tool and Solver





4 Experiments 4.1 Experimental Setup

4.2 Results and Discussion





5 Conclusions

Acknowledgements





SCAI_2024_paper_016 (8689) Abstract

1 Introduction

2 Materials and Methods 2.1 DFT Calculations and Datasets

2.2 Machine-Learning Methods





3 Results and Discussion 3.1 Machine-Learning Models

3.2 Feature Ranking





4 Conclusion





SCAI_2024_paper_017 (8844) Abstract

1 Introduction

2 Methodology 2.1 Datasets

2.2 Data Imputation Through Datasets

2.3 Longitudinal Data Imputation

2.4 Risk Models





3 Evaluation

4 Demo Application 4.1 Risk Prediction using Demo Application

4.2 Further Development of the Application





5 Conclusion

Acknowledgements





SCAI_2024_paper_018 (9356) Abstract

1 Introduction

2 Materials and Data Collection Process 2.1 Materials and Setup for Empathy Elicitation in VR

2.2 Dataset Description





3 Methodology 3.1 Preprocessing

3.2 Feature Engineering





4 Experiments and Results 4.1 Experimental setup

4.2 Results

4.3 Conclusion





Acknowledgements





SCAI_2024_paper_019 (9705) Abstract

1 Introduction

2 Methods 2.1 Preprocessing

2.2 Feature Extraction Methods

2.3 Aggregation Methods

2.4 MSI Classification





3 Data

4 Results

5 Discussion and Conclusion





SCAI_2024_paper_020 (9988) Abstract

1 Introduction

2 Related Work

3 Methodology 3.1 Data

3.2 Experiments





4 Results 4.1 Feature Correlations

4.2 Leave-One-Subject-Out

4.3 Marginally Personalised 10-fold Cross-Validation Within Video

4.4 Personalised 10-fold Cross-Validation

4.5 Feature Importances





5 Conclusion

Acknowledgements





12 - Index - A

Blank Page

Blank Page

SCAI_2024_paper_016 - NEW.pdf Abstract

1 Introduction

2 Materials and Methods 2.1 DFT Calculations and Datasets

2.2 Machine-Learning Methods





3 Results and Discussion 3.1 Machine-Learning Models

3.2 Feature Ranking





4 Conclusion





11 - Prispevki - A.pdf SCAI_2024_paper_001 (0538) Abstract

1 Introduction

2 Related Work

3 Methodology 3.1 PandaChat-RAG-sl Dataset

3.2 RAG System





4 Experiments and Results 4.1 Chunk Size

4.2 Number of Retrieved Sources

4.3 Embedding Models





5 Conclusion and Future Work





SCAI_2024_paper_002 (0991) Abstract

1 Introduction

2 Data collection

3 Target and feature extraction 3.1 Target variable

3.2 Features





4 Prediction models 4.1 Model performance and validation

4.2 Baseline model

4.3 Correlation-Based Feature Reduction

4.4 Feature Selection using the mutual information scoring function

4.5 Recursive Feature Elimination with Cross-Validation (RFECV)

4.6 Sequential Forward Selection

4.7 Boruta method





5 Results 5.1 Selecting the best correlation threshold





6 Conclusions





SCAI_2024_paper_003 (1642) Abstract

1 Introduction

2 Literature Review

3 Materials and Methods 3.1 Dataset

3.2 Machine Learning Prediction





4 Results and Discussion 4.1 Model Performance





5 Conclusion





SCAI_2024_paper_004 (4212) Abstract

1 Introduction

2 Sarcasm Detection Dataset 2.1 iSarcasmEval Dataset

2.2 Translating iSarcasmEval

2.3 Translation Results





3 Model Training 3.1 Encoder Models Under 1B Parameters

3.2 Llama 3.1 Models

3.3 GPT 3 and 4 Models

3.4 Sarcasm Detection Ensembles





4 Sarcasm Detection Results

5 Conclusion

Acknowledgements





SCAI_2024_paper_005 (4550) Abstract

1 Introduction

2 System Architecture 2.1 Speech-to-Service ASR

2.2 Recommender System





3 Dataset

4 Methods 4.1 Clustering

4.2 Recommender System

4.3 Speech Recognition and Information Extraction





5 Results and Discussion 5.1 LLM Based Infromation Retrieval Model

5.2 Recommender System





6 Conclusions

Acknowledgements





SCAI_2024_paper_006 (4752) Abstract

1 Introduction

2 Related Work

3 Data Preprocessing

4 Methodology

5 Results

6 Conclusion and Discussion

Acknowledgements





SCAI_2024_paper_007 (6883) Abstract

1 Introduction

2 Related Work

3 Methodology 3.1 Datasets

3.2 Feature Computation

3.3 Emotion Classification





4 Experiments and Results 4.1 Audio Emotion Classification

4.2 Image Emotion Classification

4.3 Discussion





5 Conclusion





SCAI_2024_paper_008 (6961)

SCAI_2024_paper_009 (7260) Abstract

1 Introduction

2 Materials

3 Methods 3.1 Exploratory Data Analysis (EDA)

3.2 Data augmentation/Feature Engineering

3.3 Machine Learning Models

3.4 Model Evaluation and Selection

3.5 Model Interpretation

3.6 Data Splitting





4 Results 4.1 Performance Trends and Impact of Additional Data per Employee

4.2 Interpretability and Additional Insights





5 Discussion and Conclusion

Acknowledgements





SCAI_2024_paper_010 (7299) Abstract

1 Introduction

2 Related Work

3 Application Details

4 AI Agents and Rating System

5 Puzzle Description and Methodology 5.1 Puzzles

5.2 Tactical Puzzle Generation

5.3 Strategic Puzzle Generation





6 Evaluation and Results

7 Discussion

8 Conclusion

Acknowledgements





SCAI_2024_paper_011 (8236)

SCAI_2024_paper_012 (8265) Abstract

1 Introduction

2 Real-World Use Case

3 Optimization Problem Inspector Features 3.1 Problem Specification

3.2 Sample Generation

3.3 Data

3.4 Comparison to Reference Problems

3.5 Data Visualization





4 Conclusions

Acknowledgements





SCAI_2024_paper_013 (8341) Abstract

1 Introduction

2 Competition Setup and System Description 2.1 Simulation Framework

2.2 Sensor Data

2.3 Actuator Outputs





3 Related Work

4 MAS Approach to Autonomous Table Football Control 4.1 Agent Architecture

4.2 Role Description

4.3 Role Assignment

4.4 Behavior of the System as a Whole





5 Conclusion





SCAI_2024_paper_014 (8463)

SCAI_2024_paper_015 (8587) Abstract

1 Introduction

2 Optimization Problem 2.1 Variables

2.2 Constraints

2.3 Objective Functions

2.4 Weighted Sum Approach





3 Optimization Approach 3.1 Setting Weights in the Weighted Sum

3.2 Linearization

3.3 Tool and Solver





4 Experiments 4.1 Experimental Setup

4.2 Results and Discussion





5 Conclusions

Acknowledgements





SCAI_2024_paper_016 (8689) Abstract

1 Introduction

2 Materials and Methods 2.1 DFT Calculations and Datasets

2.2 Machine-Learning Methods





3 Results and Discussion 3.1 Machine-Learning Models

3.2 Feature Ranking





4 Conclusion





SCAI_2024_paper_017 (8844) Abstract

1 Introduction

2 Methodology 2.1 Datasets

2.2 Data Imputation Through Datasets

2.3 Longitudinal Data Imputation

2.4 Risk Models





3 Evaluation

4 Demo Application 4.1 Risk Prediction using Demo Application

4.2 Further Development of the Application





5 Conclusion

Acknowledgements





SCAI_2024_paper_018 (9356) Abstract

1 Introduction

2 Materials and Data Collection Process 2.1 Materials and Setup for Empathy Elicitation in VR

2.2 Dataset Description





3 Methodology 3.1 Preprocessing

3.2 Feature Engineering





4 Experiments and Results 4.1 Experimental setup

4.2 Results

4.3 Conclusion





Acknowledgements





SCAI_2024_paper_019 (9705) Abstract

1 Introduction

2 Methods 2.1 Preprocessing

2.2 Feature Extraction Methods

2.3 Aggregation Methods

2.4 MSI Classification





3 Data

4 Results

5 Discussion and Conclusion





SCAI_2024_paper_020 (9988) Abstract

1 Introduction

2 Related Work

3 Methodology 3.1 Data

3.2 Experiments





4 Results 4.1 Feature Correlations

4.2 Leave-One-Subject-Out

4.3 Marginally Personalised 10-fold Cross-Validation Within Video

4.4 Personalised 10-fold Cross-Validation

4.5 Feature Importances





5 Conclusion

Acknowledgements





Blank Page

Blank Page